Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

agriculture

Article
Improved YOLOv8-Seg Network for Instance Segmentation of
Healthy and Diseased Tomato Plants in the Growth Stage
Xiang Yue , Kai Qi, Xinyi Na , Yang Zhang , Yanhua Liu and Cuihong Liu *

College of Engineering, Shenyang Agricultural University, Shenyang 110866, China; yuexiang@syau.edu.cn (X.Y.);
2022240067@stu.syau.edu.cn (K.Q.); 2022240052@stu.syau.edu.cn (X.N.); 2022240040@stu.syau.edu.cn (Y.Z.);
2022240006@stu.syau.edu.cn (Y.L.)
* Correspondence: cuihongliu77@syau.edu.cn

Abstract: The spread of infections and rot are crucial factors in the decrease in tomato production.
Accurately segmenting the affected tomatoes in real-time can prevent the spread of illnesses. However,
environmental factors and surface features can affect tomato segmentation accuracy. This study
suggests an improved YOLOv8s-Seg network to perform real-time and effective segmentation of
tomato fruit, surface color, and surface features. The feature fusion capability of the algorithm was
improved by replacing the C2f module with the RepBlock module (stacked by RepConv), adding
SimConv convolution (using the ReLU function instead of the SiLU function as the activation function)
before two upsampling in the feature fusion network, and replacing the remaining conventional
convolution with SimConv. The F1 score was 88.7%, which was 1.0%, 2.8%, 0.8%, and 1.1% higher
than that of the YOLOv8s-Seg algorithm, YOLOv5s-Seg algorithm, YOLOv7-Seg algorithm, and Mask
RCNN algorithm, respectively. Meanwhile, the segment mean average precision (segment mAP@0.5 )
was 92.2%, which was 2.4%, 3.2%, 1.8%, and 0.7% higher than that of the YOLOv8s-Seg algorithm,
YOLOv5s-Seg algorithm, YOLOv7-Seg algorithm, and Mask RCNN algorithm. The algorithm can
perform real-time instance segmentation of tomatoes with an inference time of 3.5 ms. This approach
provides technical support for tomato health monitoring and intelligent harvesting.

Keywords: YOLOv8; instance segmentation; disease detection; maturity segmentation


Citation: Yue, X.; Qi, K.; Na, X.;
Zhang, Y.; Liu, Y.; Liu, C. Improved
YOLOv8-Seg Network for Instance
Segmentation of Healthy and 1. Introduction
Diseased Tomato Plants in the
The accurate segmentation of growing and diseased tomatoes, even against complex
Growth Stage. Agriculture 2023, 13,
1643. https://doi.org/10.3390/
backgrounds, is essential for effective tomato picking, fruit monitoring, and precise evalua-
agriculture13081643
tion of tomato size and quality [1]. Accurate and timely segmentation can benefit various
applications, such as machine vision and greenhouse monitoring systems [2]. There are
Academic Editor: Suresh Neethirajan several challenges faced by tomato instance segmentation at present. Environmental factors
Received: 13 July 2023 such as changes in lighting, overlapping fruit, leaf occlusion, and variations in angles can
Revised: 16 August 2023 interfere with the process. Moreover, alterations in the surface color and features of the
Accepted: 16 August 2023 tomatoes can adversely affect segmentation outcomes [3].
Published: 21 August 2023 Over the past ten years or more, significant research and practical efforts have focussed
on detecting and segmenting target fruits. Conventional techniques involve analyzing
single features, such as color, geometric shape, and texture. For instance, Si et al. [4]
segmented apples from background images using color surface variation analysis. Meth-
Copyright: © 2023 by the authors. ods of multi-feature fusion analysis employ either a combination of geometric shape and
Licensee MDPI, Basel, Switzerland. color attributes or a fusion of color, intensity, edge, and orientation characteristics. Yin
This article is an open access article
et al. [5] designed an approach to recognize ripe tomatoes by initially reducing noise
distributed under the terms and
in cases of occlusion and overlap of the fruit and subsequently combining their color
conditions of the Creative Commons
and shape attributes for recognition. Nonetheless, both single-feature analysis and multi-
Attribution (CC BY) license (https://
feature fusion techniques are compromised by low robustness and extensive time con-
creativecommons.org/licenses/by/
sumption. Detection and segmentation accuracy can be negatively impacted by changes in
4.0/).

Agriculture 2023, 13, 1643. https://doi.org/10.3390/agriculture13081643 https://www.mdpi.com/journal/agriculture


Agriculture 2023, 13, 1643 2 of 15

environmental conditions and variations in fruit surface color and features, especially in
unstructured environments.
The accurate, efficient, and real-time instance segmentation of growing and diseased
tomatoes is essential in the complex environment of tomato greenhouses. This enables the
timely picking of ripe fruits, helps avoid spoilage, and assists in monitoring diseased fruits
to prevent bacterial infections in the planting field. In the past several years, deep learning
technology has widely been employed for tasks such as instance segmentation and object
detection, owing to its high accuracy and efficiency. To achieve object detection and instance
segmentation, a branch for generating binary masks has been introduced to Mask RCNN [6],
serving as a prototype of Fast RCNN (Faster R-CNN: Toward real-time object detection
using region proposal networks) [7]. Jia et al. [8] used the improved Mask RCNN algorithm
for instance segmentation on overlapping apples. They fused ResNet [9] and DenseNet as
the feature extraction network of the model and achieved an accuracy rate of 97.31% on
120 images in the test set. Huang et al. [10] proposed a fuzzy Mask R-CNN to automatically
identify the maturity of tomato fruits. They distinguished the foreground and background
in the image through the fuzzy c-means model and Hough transform method, located the
edge features for automatic labeling, and achieved a 98.00% accuracy rate in 100 images.
Afonso et al. [11] used the Mask R-CNN model, with ResNet101 as the backbone, to segment
ripe and unripe tomatoes, achieving a segmentation precision of 95% and 94% for each,
respectively. Wang et al. [12] proposed an improved Mask RCNN model that integrates
the attention mechanism for segmenting apple maturity under various conditions, such as
light influence, occlusion, and overlap. The test results showed accuracy and recall rates
of 95.8% and 97.1%, respectively. In addition, the segmentation of tomato fruit [13], the
detection of tomato fruit infection areas [14], the segmentation of tomato maturity [15], and
the segmentation of Soil block [16] based on Mask RCNN have demonstrated that the high
precision and robustness of the Mask RCNN algorithm in object detection and instance
segmentation. Mask RCNN is a conventional two-stage instance segmentation model.
Masks are generated by Mask RCNN through feature positioning. The located features are
then passed to the mask predictor after performing pooling operations on the region of
interest. However, executing these operations sequentially can cause slow segmentation
speed, large model size, and an increased number of computing parameters. In contrast
to conventional instance segmentation algorithms, which rely on feature localization to
generate masks, YOLACT (You Only Look At Coefficients) [17] is a real-time method. It
can rapidly generate high-quality instance masks by parallelizing the tasks of generating
prototype masks and predicting mask coefficients. The task of instance segmentation,
which uses the YOLO framework, builds on the principles of the YOLACT network for
completion. Initially, two parallel sub-tasks are executed: generating prototype masks and
predicting mask coefficients. Subsequently, the prototype is subjected to linear weighting
based on the obtained mask coefficients, which leads to the creation of instance masks.
Mubashiru [18] proposed a lightweight YOLOv5 algorithm for accurately segmenting fruits
from four gourd family plants with similar features. The proposed algorithm achieved a
segmentation accuracy of 88.5%. Although this method attains faster segmentation speed,
there is a need to further optimize the accuracy of the segmentation. In contrast to the
anchor-based detection head of YOLOv5, YOLOv8 adopts a novel anchor-free method. This
method decreases the number of hyperparameters, which improves the model’s scalability
while enhancing segmentation performance.
This paper proposed an improved YOLOv8s-Seg algorithm for segmenting healthy
and diseased tomatoes based on research conducted by scholars worldwide. The research
consisted of the following tasks:
(1) To enhance the edge features of the tomatoes, algorithms such as Gaussian blur,
Sobel operator, and weighted superposition were used to sharpen the 1600 photos in
the original dataset. Further data enhancement operations expanded the dataset to
9600 photos;
Agriculture 2023, 13, x FOR PEER REVIEW 3 of 17

Agriculture 2023, 13, 1643 3 of 15

(2) The feature fusion capability of the algorithm was improved by adding SimConv
(2) convolution
The feature [19] before
fusion the twoof
capability upsampling operations
the algorithm in the feature
was improved fusion SimConv
by adding network,
replacing the remaining regular convolutions with SimConv convolution,
convolution [19] before the two upsampling operations in the feature fusion and swap-
net-
ping the C2f module with the RepBlock module [20];
work, replacing the remaining regular convolutions with SimConv convolution, and
(3) An improved
swapping the YOLOv8s-Seg
C2f module withalgorithm was proposed
the RepBlock to address the slow running
module [20];
time, high parameter count, and large number of
(3) An improved YOLOv8s-Seg algorithm was proposed to addresscalculations of the
the slow
two-stage in-
running
stance segmentation
time, high parameter model. Thislarge
count, and algorithm
numberwas designed with
of calculations thetwo-stage
of the aim of effective,
instance
real-time instance segmentation of healthy and diseased tomatoes.
segmentation model. This algorithm was designed with the aim of effective, real-time
instance segmentation of healthy and diseased tomatoes.
2. Materials and Methods
2. Materials
2.1. and Methods
Data Acquisition
2.1. Data Acquisition
The dataset includes photographs of tomatoes at four stages of maturity, including
young The dataset
fruit, includes
immature, photographs
half-ripe, and ripe.of tomatoes at fourimages
It also includes stages of
of six
maturity,
common including
tomato
young fruit,
diseases: greyimmature, half-ripe,
mold, umbilical rot,and ripe.
crack, It also includes
bacterial canker, lateimages of and
blight, six common tomato
virus disease. A
diseases: grey mold, umbilical rot, crack, bacterial canker, late blight,
total of 788 photos were captured from tomato cultivation plots 26 and 29 at Shenyang and virus disease.
A total of 788University
Agricultural photos were captured
(latitude: from
41.8° N),tomato cultivation
with a seedling plots 26
spacing andm.
of 0.3 29Furthermore,
at Shenyang
Agricultural University (latitude: 41.8 ◦ N), with a seedling spacing of 0.3 m. Furthermore,
an additional 812 photos which demonstrate the previously mentioned five diseases,
an additional
namely 812 photos
grey mold, which
umbilical demonstrate
rot, bacterial canker,thelate
previously
blight, andmentioned five diseases,
virus disease, were re-
trieved from Wikipedia. This brings the total number of photos in the dataset to 1600.were
namely grey mold, umbilical rot, bacterial canker, late blight, and virus disease, The
retrieved from Wikipedia. This brings the total number of photos in
images were taken using an iPhone 13, which captured them in JPG format with a resolu-the dataset to 1600.
The images
tion of 1280 were
× 720 taken
pixels.using an iPhone
Images retrieved 13,from
which capturedwere
Wikipedia thempreserved
in JPG format
in thewith
samea
resolution of 1280 × 720 pixels. Images retrieved from Wikipedia were preserved in the
format and resolution. The dataset was divided into training and validation sets at a 7:3
same format and resolution. The dataset was divided into training and validation sets at
ratio. As a result, there were 1120 photos in the training set and 480 photos in the valida-
a 7:3 ratio. As a result, there were 1120 photos in the training set and 480 photos in the
tion set. Example images are shown in Figure 1.
validation set. Example images are shown in Figure 1.

Example images.
Figure 1. Example images. (a)
(a) young
young fruit,
fruit, (b)
(b) immature,
immature, (c)
(c) half-ripe,
half-ripe, (d)
(d) ripe,
ripe, (e,f)
(e,f) ripe,
ripe, immature,
immature,
(g) umbilical rot, (h) grey mold, (i) crack, (j) virus disease,
disease, (k)
(k) late
late blight,
blight, (l)
(l) bacterial
bacterial canker.
canker.

2.2. Image
2.2. Image Preprocessing
Preprocessing
In this
In this study,
study, the
thedataset
datasetofof1600
1600 photos
photos was
was enhanced
enhanced using
using several
several techniques,
techniques, in-
including the Sobel operator and weighted overlay. These methods aimed to improve
cluding the Sobel operator and weighted overlay. These methods aimed to improve the the
clarity of tomato fruit edges for better image annotation and feature extraction. Initially,
clarity of tomato fruit edges for better image annotation and feature extraction. Initially,
Gaussian blur was applied to the images to reduce noise. Then, the Sobel operator was
Gaussian blur was applied to the images to reduce noise. Then, the Sobel operator was
utilized to calculate the image gradients and extract edge features. Finally, the gradient
utilized to calculate the image gradients and extract edge features. Finally, the gradient
images were combined with the original images using a weighted overlay technique to
images were combined with the original images using a weighted overlay technique to
enhance the edge features. Figure 2 compares the photos before and after the image
sharpening process.
Agriculture 2023, 13, x FOR PEER REVIEW 4 of 17

enhance the edge features. Figure 2 compares the photos before and after the image sharp-
Agriculture 2023, 13, 1643
enhance the edge features. Figure 2 compares the photos before and after the image
ening process. 4 of 15
sharp-
ening process.

Figure 2. Image sharpening. (a,c) Original image. (b,d) Sharpened image.


Figure 2. Image sharpening. (a,c) Original image. (b,d) Sharpened image.
FigureIncreasing
2. Image sharpening.
the number(a,c) Originalthrough
of images image. (b,d)
dataSharpened image.
enhancement can help minimize over-
fitting during
Increasingthethetraining
numberprocess
of images and improve
through datathe robustnesscan
enhancement of help
the model,
minimize thereby
over- in-
Increasing
fitting
creasing during
the the
thenumber
training of
generalization images
process
ability ofand through
the model.data
improve theenhancement
robustness
Lighting canand
of the
conditions help
model,minimize
thereby
shooting over-
angles
fitting
can significantly influence fruit detection and segmentation in tomato greenhouses.in-
during
increasing the
the training
generalization process
ability and
of the improve
model. the
Lightingrobustness
conditions ofandthe model,
shooting thereby
angles
can significantly
creasing
Brightness influence
the adjustment,
generalization fruit detection
ability
mirroring, theand
ofand segmentation
model.
rotation Lighting in conditions
operations tomato
were greenhouses. toBright-
and shooting
applied angles
simulate
can ness adjustment,
significantly mirroring,
influence and
fruit rotation
detection operations
and were applied
segmentation to
in simulate
tomato weather
greenhouses.
weather changes and variations in detection equipment angles [21]. Performing data en-
changes and variations in detection equipment angles [21]. Performing data enhancement
Brightness
hancement adjustment,
on the sharpened mirroring,
datasetand of 1600rotation
photos operations
resulted inwere applied to
the expansion simulate
of the da-
on the sharpened dataset of 1600 photos resulted in the expansion of the dataset to 9600
weather
taset
photos changes
to 9600 photos
through and variations
through
brightness in detection
brightness
adjustment, equipment
adjustment,
mirroring, angles
mirroring,
and rotation, [21].
and
as shown Performing
rotation,
in datainen-
Figureas3.shown
The
Figure
hancement 3. The training set includes 6720 photos, and the validation set
training set includes 6720 photos, and the validation set contains 2880 photos, resulting inthe
on the sharpened dataset of 1600 photos resulted in the contains
expansion 2880
of pho-da-
tos,6744
taset resulting
to 9600
tomato in 6744through
photos
fruits. tomato
Table fruits. Table
brightness
1 presents the 1 presents
adjustment,
detailed the detailed structure.
mirroring,
structure. and rotation, as shown in
Figure 3. The training set includes 6720 photos, and the validation set contains 2880 pho-
tos, resulting in 6744 tomato fruits. Table 1 presents the detailed structure.

Figure 3. Data
Figure enhancement.
3. Data enhancement.(a)
(a)sharpened image;(b)
sharpened image; (b)sharpened
sharpened image
image adjusted
adjusted brightness
brightness (bright-
(bright-
ened); (c)(c)
ened); sharpened
sharpenedimage adjustedbrightness
image adjusted brightness (darkened);
(darkened); (d) sharpened
(d) sharpened imageimage subjected
subjected to mir-
to mirror-
roring operation;(e)(e)
ing operation; sharpened
sharpened image
image rotated
rotated by◦ ;180°;
by 180 (f) sharpened
(f) sharpened image image by 30◦ . by 30°.
rotatedrotated
Figure 3. Data enhancement. (a) sharpened image; (b) sharpened image adjusted brightness (bright-
ened); (c) sharpened image adjusted brightness (darkened); (d) sharpened image subjected to mir-
roring operation; (e) sharpened image rotated by 180°; (f) sharpened image rotated by 30°.
Agriculture 2023, 13, 1643 5 of 15

Agriculture 2023, 13, x FOR PEER REVIEW 5 of 17

Table 1. Structure of the dataset.


Table 1. Structure of the dataset.
Number of Images Train (Data Validation (Data Number of Instances
Type Number
Type afterofSharpening
Images after Enhancement)
Train (Data Enhancement)
Validation (Data En-
Enhancement) Number of Instances
(Validation)
Sharpening hancement) (Validation)
late blight
late blight 156
156 655
655 281281 602 602
crackcrack 150
150 630
630 270270 740 740
grey grey
mold mold 152
152 638
638 274274 593 593
virusvirus 164
164 689
689 295295 589 589
rot rot 174
174 731
731 313313 601 601
cankercanker 166166 697
697 299299 579 579
ripe ripe 161161 677
677 289289 852 852
half-ripe
half-ripe 152
152 638
638 274274 778 778
immature
immature 168
168 706
706 302302 900 900
young young 157
157 659
659 283283 780 780
TotalTotal 1600
1600 6720
6720 2880
2880 6744 6744
Late blight indicates tomato late blight; crack indicates tomato crack; grey mold indicates tomato
Late blight indicates
grey mold; virus tomato
indicateslate blight;
tomato crack
virus indicates
disease; tomatotomato
rot indicates crack; umbilical
grey mold indicates
rot; tomato grey mold;
canker indicates
virus tomato
indicates tomatocanker;
bacterial virus disease; rot indicates
ripe indicates tomato
ripe tomato; umbilical
half-ripe rot; canker
indicates indicates
half-ripe tomato
tomato; bacterial canker;
immature
ripe indicates ripe tomato; half-ripe indicates half-ripe tomato;
indicates immature tomato; and young indicates young fruit tomato.immature indicates immature tomato; and young
indicates young fruit tomato.

Ten Ten categories


categories of of photos
photos weremanually
were manually annotated
annotated using
usingLabelme
Labelmesoftware (version
software (version 6.1.1)
6.1.1) after completing the data enhancement. These categories are tomato late blight, to-
after completing the data enhancement. These categories are tomato late blight, tomato
mato crack, tomato grey mold, tomato virus diseases, tomato umbilical rot, tomato bacte-
crack, tomato grey mold, tomato virus diseases, tomato umbilical rot, tomato bacterial
rial canker, ripe tomatoes, half-ripe tomatoes, immature tomatoes, and young fruit. Dur-
canker,
ing ripe tomatoes,process,
the annotation half-ripe
thetomatoes, immature
polygon tool tomatoes,
was selected and the
to annotate young fruit.
edges During the
of the
annotation
tomatoesprocess, the polygon
that required tool was selected
instance segmentation. Figure 4to annotate
displays the edgesofoftoma-
the annotation the tomatoes
that toes.
required instance segmentation. Figure 4 displays the annotation of tomatoes.

Figure 4. Annotation of tomatoes.


Figure 4. Annotation of tomatoes.

2.3. Tomato Instance Segmentation Based on Improved YOLOv8s-Seg


The YOLO (you only look once) series is a deep-learning model for detecting ob-
jects. YOLOv8, developed by the same authors as YOLOv5, shares a similar overall style.
YOLOv8 has made significant improvements and optimizations over the YOLOv5 network,
resulting in enhanced algorithm performance. The YOLOv8 network supports object de-
tection and tracking, as well as additional tasks, such as instance segmentation, image
classification, and key point detection. Similar to YOLOv5, YOLOv8 provides five dif-
ferent scales of models (n, s, m, l, x), with increasing depth and width from left to right.
In reference to the ELAN design philosophy [22], YOLOv8 replaces the C3 structure in
work, resulting in enhanced algorithm performance. The YOLOv8 network supports ob-
ject detection and tracking, as well as additional tasks, such as instance segmentation, im-
age classification, and key point detection. Similar to YOLOv5, YOLOv8 provides five dif-
ferent scales of models (n, s, m, l, x), with increasing depth and width from left to right. In
Agriculture 2023, 13, 1643
reference to the ELAN design philosophy [22], YOLOv8 replaces the C3 structure in the 6 of 15
YOLOv5 backbone network with a C2f structure. This alteration enables YOLOv8 to main-
tain its lightweight characteristics while obtaining a greater amount of gradient flow in-
the YOLOv5
formation. Compared backbonethe
to YOLOv5, network with of
head part a C2f structure.
YOLOv8 This alteration
exhibits enables YOLOv8
more prominent dif- to
maintain its lightweight characteristics while obtaining a greater
ferences due to the implementation of the widely-used decoupled head structure. For loss amount of gradient flow
information. Compared to YOLOv5, the head part of YOLOv8 exhibits more prominent
function calculation, YOLOv8 utilizes the TaskAlignedAssigner positive sample assign-
differences due to the implementation of the widely-used decoupled head structure. For
ment strategy [23]. Furthermore,
loss function it introduces
calculation, YOLOv8the distribution
utilizes focal loss [24]. During
the TaskAlignedAssigner positive train-
sample as-
ing, the strategy of disabling
signment strategymosaic augmentation
[23]. Furthermore, in the last
it introduces the10 epochs is focal
distribution incorporated,
loss [24]. During
as introduced training,
in YOLOX [25], to of
the strategy effectively improve
disabling mosaic precision in
augmentation in the
thelast
data augmentation
10 epochs is incorporated,
process. YOLOv8s-Seg is aninextension
as introduced YOLOX [25],of the YOLOv8 object
to effectively improve detection
precisionmodel. It is augmentation
in the data specif-
ically designedprocess. YOLOv8s-Seg
for carrying is an extension
out segmentation tasks.of the
TheYOLOv8 object detection
YOLOv8s-Seg networkmodel.
drawsIton is specifi-
cally designed for carrying out segmentation tasks. The YOLOv8s-Seg
the principles of the YOLACT network to achieve real-time instance segmentation of ob- network draws on
the principles of the YOLACT network to achieve real-time instance
jects and maintain a high segment mean average precision. Figure 5 displays the Structure segmentation of objects
and maintain a high segment mean average precision. Figure 5 displays the Structure of
of the YOLACT network.
the YOLACT network.

Figure 5. Structure of YOLACT network.


Figure 5. Structure of YOLACT network.

The YOLOv8-Seg (ultralytics-8.0.57)


The YOLOv8-Seg networknetwork
(ultralytics-8.0.57) consistsconsists
of two main
of two maincomponents:
components: back-
backbone and head (which can be further divided into neck and segment). TheThe
bone and head (which can be further divided into neck and segment). GitHub pro-
GitHub
provides five vides five scale
different different scale models
models of the network,
of the network, namely,
namely, YOLOv8n-Seg,
YOLOv8n-Seg, YOLOv8s-Seg,
YOLOv8s-
YOLOv8m-Seg, YOLOv8l-Seg, and YOLOv8x-Seg. In this
Seg, YOLOv8m-Seg, YOLOv8l-Seg, and YOLOv8x-Seg. In this study, experiments werestudy, experiments were con-
ducted on YOLOv8-Seg models of different scales to evaluate the segment mAP@0.5 and
conducted on YOLOv8-Seg models of different scales to evaluate the segment mAP@0.5 and
model size. Table 2 presents the results.
model size. Table 2 presents the results.
Table 2. Comparison of network segmentation results.
Table 2. Comparison of network segmentation results
Models Seg mAP@0.5 (%) Model Size (MB)
Models YOLOv8n-Seg Seg mAP@0.5 (%)0.853 Model Size (MB)6.5
YOLOv8n-Seg YOLOv8s-Seg 0.853 0.898 6.5 20.4
YOLOv8s-SegYOLOv8m-Seg 0.898 0.900
20.4 54.8
YOLOv8l-Seg 0.903 92.3
YOLOv8m-SegYOLOv8x-Seg 0.900 0.907 54.8 143.9
YOLOv8l-Seg 0.903 92.3
Table
YOLOv8x-Seg 2 shows that YOLOv8s-Seg
0.907 achieved a segment mAP @0.5 of 89.8%, a 3.5%
143.9
improvement over YOLOv8n-Seg. However, it was slightly lower than YOLOv8m-Seg,
YOLOv8l-Seg, and YOLOv8x-Seg by 0.2%, 0.5%, and 0.9%, respectively. Regarding model
size, YOLOv8s-Seg occupies 20.4 MB, an increase of 13.9 MB compared to YOLOv8n-Seg.
However, it is significantly lighter than YOLOv8m-Seg, YOLOv8l-Seg, and YOLOv8x-Seg,
with reductions of 34.4 MB, 71.9 MB, and 123.5 MB, respectively. Considering the segment
mAP@0.5 performance and lightweight requirements, YOLOv8s-Seg was selected as the
model for experimentation in this study. Figure 6 illustrates the structure of the improved
YOLOv8s-Seg network.
Agriculture 2023, 13, x FOR PEER REVIEW 8 of 17

Agriculture 2023, 13, 1643 7 of 15

Figure 6. Structure of tomato segmentation network based on improved YOLOv8s-Seg. The ability to fuse features of the network was improved by replacing
the C2f module with the RepBlock module, adding SimConv convolution before two upsampling in the neck module, and replacing the remaining conventional
convolution with SimConv.
Figure 6. Structure of tomato segmentation network based on improved YOLOv8s-Seg. The ability to fuse features of the network was improved by replacing the
C2f module with the RepBlock module, adding SimConv convolution before two upsampling in the neck module, and replacing the remaining conventional
convolution with SimConv.
Agriculture
Agriculture2023, 13,13,
2023, x FOR PEER
x FOR REVIEW
PEER REVIEW 9 of 17 9 of 1

Agriculture 2023, 13, 1643 The backbone network of YOLOv8s-Seg consists of a 3 × 3 convolution, a C2f module,
The backbone network of YOLOv8s-Seg consists of a 3 × 3 convolution, 8aofC2f 15 modul
and an SPPF (spatial pyramid pooling fusion) module. In contrast to the YOLOv5 net-
and an SPPF (spatial pyramid pooling fusion) module. In contrast to the YOLOv5 ne
work, YOLOv8s-Seg replaces the initial 6 × 6 convolution with a 3 × 3 convolution in the
work,TheYOLOv8s-Seg
backbone replaces the initialconsists
6 × 6 convolution with a 3 ×a 3C2f
of a 3 × Additionally, convolution in th
backbone network,network
makingofthe YOLOv8s-Seg
model more lightweight. 3 convolution, the C3 module,
module
backbone
and an 7)SPPF network, making
(spatial pyramid the
pooling model more lightweight. Additionally, the C3 modu
(Figure in YOLOv5 is replaced withfusion)
the C2f module.
module Inin
contrast to the YOLOv5
YOLOv8s-Seg. The C2f network,
module,
(Figure
YOLOv8s-Seg 7) in YOLOv5
replaces theis replaced
initial 6 × 6 with the
convolution C2f module
with a 3 × 3 in YOLOv8s-Seg.
convolution
designed with skip connections and additional split operations, enriches the gradient flow in the The C2f modul
backbone
designed
network,
during with skip
making
backpropagation connections
the model
andmore and the
additional
lightweight.
improves split operations,
Additionally,
performance enriches
the C3 module
of the model. the gradient
(Figure
YOLOv8s-Seg 7) in
uti- flow
YOLOv5
during is replaced
backpropagationwith the C2f
and module
improves in YOLOv8s-Seg.
the performance The C2f
ofCSP module,
the model. designed
YOLOv8s-Seg
lizes two versions of the cross stage partial network (CSP). The [26] in the backbone ut
with
lizesskip
two connections
versions ofand
the additional
cross split
stage operations,
partial networkenriches
(CSP). theThe
gradient
CSP flow in
[26] during
theuses
backbon
network employs residual connections (as shown in Figure 6), while the head part
backpropagation and improves the performance of the model. YOLOv8s-Seg utilizes two
network
direct employs
connections. Theresidual
SPPF connections
structure (as shown in Figurethe
in YOLOv8s-Seg 6), same
whileasthe head part use
versions of the cross stage partial network (CSP). The CSPremains [26] in the backbone in YOLOv5
network
direct
(version connections. The SPPF structure in YOLOv8s-Seg remains the same as in YOLOv
employs6.1), utilizing
residual cascaded
connections (as5 shown
× 5 pooling
in Figure kernels to accelerate
6), while the head network
part usesoperation
direct
(version 6.1),
speed.
connections. The utilizing cascaded
SPPF structure 5 × 5 pooling
in YOLOv8s-Seg kernels
remains the same to accelerate
as in YOLOv5 network
(versionoperatio
speed.
6.1), utilizing cascaded 5 × 5 pooling kernels to accelerate network operation speed.

Figure 7. Structure of the C3 module.

Figure
Figure
The7. 7. Structure
Structure
head module of the
of the C3 module.of the neck and segment parts. The neck module in-
C3comprised
is module.
corporates
The headthe module
path aggregation
is comprised network (PANet)
of the neck and [27]
segmentand parts.
featureThepyramid network
neck module
(FPN) The head
[28] as the
incorporates
module
feature
pathfusion
is comprised
networks.
aggregation
of the
Unlike
network
neck
YOLOv5
(PANet)
and segment
[27] and
and YOLOv6,
parts. The neck
YOLOv8s-Seg
feature pyramid
module in
network re-
corporates
moves
(FPN) the
[28]1 as the
× 1feature path
convolution aggregation
fusionbefore network
upsampling
networks. Unlike and (PANet)
fusesand
YOLOv5 [27] and
the feature
YOLOv6, feature
maps pyramid networ
directly from
YOLOv8s-Seg
(FPN) stages
removes
different [28]
the 1as× offeature
1the fusion before
convolution
backbone networks. Unlike
upsampling
network. This study YOLOv5
and and
fusestothe
aimed YOLOv6,
feature
enhance maps
the YOLOv8s-Seg
directly
network per- re
moves
from the
different 1 × 1
stagesconvolution
of the before
backbone upsampling
network. This and
study fuses
aimed
formance of YOLOv8s-Seg by improving its neck module. Specifically, before each up- the
to feature
enhance maps
the directly
network from
different
performance
sampling stages oftwo
the1backbone
of YOLOv8s-Seg
operation, network.
by improving
× 1 SimConv itsThis
convolutions neckstudy
module.
were aimed to enhance
Specifically,
added, and the each
before
the remainingnetwork
reg- pe
upsampling
formance
ular convolutionsoperation,
of in thetwo
YOLOv8s-Seg 1 ×part
neck 1 SimConv
by improvingconvolutions
were replaced neck 3were
its with module.added,
× 3 SimConv and the remaining
Specifically, beforeThe
convolutions. each up
regular convolutions
C2fsampling operation,
module (Figure in
8) wasthe neck part
tworeplaced were
1 × 1 SimConv replaced with
with theconvolutions 3
RepBlock module× 3 SimConv
were added, convolutions.
(Figure and The
theRepBlock
6). The remaining reg
C2f module (Figure 8) was replaced with the RepBlock module (Figure 6). The RepBlock
module is composed of
ular convolutions instacked
the neck RepConv
part wereconvolutions,
replaced with and the
3 × structure
3 SimConv of the RepConv Th
convolutions.
module is composed of stacked RepConv convolutions, and the structure of the RepConv
convolution
C2f module is depicted
(Figure 8) in was
Figure 6.
replaced with the RepBlock module (Figure 6). The RepBloc
convolution is depicted in Figure 6.
module is composed of stacked RepConv convolutions, and the structure of the RepCon
convolution is depicted in Figure 6.

Figure 8. Structure of the C2f module in the neck.


Figure 8. Structure of the C2f module in the neck.

Figure 8. Structure of the C2f module in the neck.


Agriculture 2023, 13, x FOR PEER REVIEW

Agriculture 2023, 13, 1643 The YOLOv5 network employs a static allocation strategy to assign 9 of positive
15 a
ative samples based on the intersection over union (IOU) between the predicte
and ground truth. However, the YOLOv8s-Seg network has improved this aspe
The YOLOv5
troducing network employs
a superior dynamic a static allocation
allocation strategy toItassign
strategy. positive and
incorporates theneg-TaskAli
ative samples based on the intersection over union (IOU) between the predicted boxes
signer (TOOD), which selects positive samples based on a weighted score that com
and ground truth. However, the YOLOv8s-Seg network has improved this aspect by intro-
the classification
ducing and regression
a superior dynamic scores. ItThe
allocation strategy. computation
incorporates is represented by Form
the TaskAlignedAssigner
(TOOD), which selects positive samples based on a weighted
  score that comes from the
t  s u
classification and regression scores. The computation is represented by Formula (1).

where s: prediction scores for labeled t = sα × ucategories,


β
u: prediction frame (1) with the
Ground Truth, t: alignment scores for categorical regression.
where s: prediction scores for labeled categories, u: prediction frame with the IOU of
Ground During
Truth, t:training,
alignment YOLOv8s-Seg performs
scores for categorical online image enhancement to ens
regression.
the During
modeltraining,
encounters slightly
YOLOv8s-Seg different
performs images
online in each epoch.
image enhancement Mosaic
to ensure enhancem
that the
crucial
model methodslightly
encounters of datadifferent
improvement
images in that randomly
each epoch. combines
Mosaic fourisimages.
enhancement a crucial This te
method
compels the model to learn how to detect partially obstructed and compels
of data improvement that randomly combines four images. This technique differently po
the model to learn how to detect partially obstructed and differently positioned objects. In
objects. In the last 10 training epochs, the YOLOv8s-Seg network deactivates the
the last 10 training epochs, the YOLOv8s-Seg network deactivates the mosaic enhancement,
aenhancement,
method proven to a improve
methodnetwork
provenprecision
to improve network
effectively. Figure precision
9 demonstrateseffectively.
an F
demonstrates an example
example of mosaic enhancement. of mosaic enhancement.

Figure
Figure 9. 9. Mosaic
Mosaic datadata enhancement.
enhancement.

2.4. Model Training and Performance Evaluation


2.4. Model Training and Performance Evaluation
The examinations were performed on Windows 10 using a 12 vCPU Intel(R) Xeon(R)
The8255C
Platinum examinations
CPU @2.50 were performed
GHz and an Nvidiaon Windows
GeForce 10 using
RTX 2080Ti a 12card.
graphics vCPU TheIntel(R)
Platinumfor
framework 8255C CPU @2.50
deep learning GHz and
was PyTorch 1.8.1an
andNvidia
Compute GeForce RTX 2080Ti graphics c
Unified DeviceArchitecture
(CUDA)
framework11.1, for
accelerated by cuDNN
deep learning wasversion
PyTorch8.0.5.1.8.1
In this
andexperiment,
Compute the improved
Unified DeviceArc
YOLOv8s-Seg, YOLOv8s-Seg, YOLOv5s-Seg, and YOLOv7-Seg were performed in the
(CUDA) 11.1, accelerated by cuDNN version 8.0.5. In this experiment, the im
same environment configuration and under the same hyperparameter settings, as indicated
YOLOv8s-Seg,
in YOLOv8s-Seg,
Table 3. The hyperparameter settingsYOLOv5s-Seg,
of Mask CNN: the and YOLOv7-Seg
learning were
rate, batch size, performe
learning
same environment
momentum, configuration
weight decay, and under
number of iterations, andthe same
image hyperparameter
size were set to 0.004, settings,
2, 0.9, as i
1in 10−4 , 30
× Table epochs, and 640 × 640 pixels, respectively.
3. The hyperparameter settings of Mask CNN: the learning rate, batch size,
momentum, weight decay, number of iterations, and image size were set to 0.004, 2
10−4, 30 epochs, and 640 × 640 pixels, respectively.
Agriculture 2023, 13, 1643 10 of 15

Table 3. Hyperparameters during training.

learning rate 0.01


batch size 16
momentum 0.937
weight decay 0.0005
number of iterations 300 epochs
image size 640 × 640 pixels

In this study, we assess the performance of the improved YOLOv8s-Seg using precision,
recall, F1 score, and segment mAP@0.5 . Tomato locations were assessed using precision,
recall, and F1 score, while segmentation results were evaluated using segment mAP [29].
Equations (2)–(5) are used to calculate the precision, recall, F1 score, and segment mAP
scores. The higher the four parameters are, the better the segmentation results.

TP
precision = × 100% (2)
(TP + FP)

TP
recall = × 100% (3)
(TP + FN)
recall
F1 = 2 × precision × (4)
precision + recall
c
AP(i )
segmAP = ∑ C
(5)
i =1

where TP denotes an actual positive sample with a positive prediction, while FP indicates
an actual negative sample with a positive prediction, and FN indicates an actual positive
sample with a negative prediction. AP represents the average precision of segmentation.
The segmentation performance of the model increases with the AP score. C represents the
number of segmentation categories.

3. Results and Discussion


3.1. Instance Segmentation between Growing and Diseased Tomatoes
To validate the performance of the improved YOLOv8s-Seg in segmenting tomato
growth stages and common diseases in fruits, we used 2880 photos containing 6744 tomato
fruits for validation. The experimental results showed that the model achieved the precision,
recall, F1 score, and segment mAP@0.5 of 91.9%, 85.8%, 89.0%, and 92.2%, respectively.
Figure 10 presents examples of instance segmentation on tomatoes affected by several
factors like leaf occlusion, fruit overlap, lighting variations, angle changes, growth stages,
and common diseases of the improved YOLOv8s-Seg network. Table 4 shows the results of
the segmentation for growing and diseased tomatoes. As shown in Figure 10, the improved
YOLOv8s-Seg algorithm accurately segments tomatoes (Figure 10d–f,j) affected by leaf
occlusion (Figure 10a), fruit overlap (Figure 10b), lighting variations (Figure 10c), and angle
changes (Figure 10g). Overall, the algorithm exhibits precise segmentation performance on
tomatoes affected by factors such as leaf occlusion, fruit overlap, lighting variations, and
angle changes. Meanwhile, the algorithm also achieves outstanding instance segmentation
results for tomatoes (Figure 10k,l) in growth stages (Figure 10h) and those affected by
disease (Figure 10i). From Table 4, it can be seen that the precision rates obtained by
the improved YOLOv8s-Seg algorithm for instance segmentation of ripe tomato, half-ripe
tomato, immature tomato, young fruit, grey mold, umbilical rot, bacterial canker, late blight,
virus disease, and crack scores were 92.7%, 92.3%, 89.9%, 91.2%, 92.6%, 92.2%, 91.5%, 92.4%,
93%, and 91.3% respectively. The analysis of instance segmentation results on tomatoes
in growth stages and diseased tomatoes reveals the effectiveness of the algorithm in
overcoming the impact of tomato surface color and features on segmentation performance.
canker, late blight, virus disease, and crack scores were 92.7%, 92.3%, 89.9%, 91.2%, 92.6%,
92.2%, 91.5%, 92.4%, 93%, and 91.3% respectively. The analysis of instance segmentation
results on tomatoes in growth stages and diseased tomatoes reveals the effectiveness of
Agriculture 2023, 13, 1643 11 of 15
the algorithm in overcoming the impact of tomato surface color and features on segmen-
tation performance. In conclusion, the algorithm exhibits remarkable segmentation per-
formance on tomatoes affected by environmental factors during growth stages and dis-
In conclusion, the algorithm exhibits remarkable segmentation performance on tomatoes
ease.
affected by environmental factors during growth stages and disease.

10. Examples of instance segmentation of tomatoes. (a): ripe tomatoes and immature toma-
Figure 10.
shaded by
toes shaded by leaves,
leaves, half-ripe
half-ripe tomatoes
tomatoes with
with intact
intact fruit
fruit characteristics;
characteristics; (b):
(b): immature
immature tomatoes
tomatoes
with overlapping
with overlapping fruit
fruit and
and ripe
ripe tomatoes
tomatoes with
with intact
intact fruit
fruit characteristics;
characteristics; (c):
(c): immature
immature tomatoes
tomatoes
affected by
affected by changes
changes in in light;
light; (d):
(d): example
example segmentation
segmentation of of tomatoes
tomatoes shaded
shaded by by leaves;
leaves; (e):
(e): example
example
segmentation of overlapping fruit; (f): example segmentation of tomatoes affected by changes in
segmentation of overlapping fruit; (f): example segmentation of tomatoes affected by changes in light;
light; (g): immature and ripe tomatoes affected by changes in angle; (h): immature tomatoes, half-
(g): immature and ripe tomatoes affected by changes in angle; (h): immature tomatoes, half-ripe toma-
ripe tomatoes, and young fruit; (i): cracked tomatoes, ripe tomatoes. (j): example segmentation of
toes, and young
tomatoes affectedfruit; (i): cracked
by changes tomatoes,
in angle; ripe tomatoes.
(k): example (j): example
segmentation segmentation
for immature of tomatoes
tomatoes, half-ripe
affected by changes in angle; (k): example segmentation for immature tomatoes,
tomatoes, and young fruit; (l): example segmentation for cracked tomatoes and ripe tomatoes. half-ripe tomatoes,
and young fruit; (l): example segmentation for cracked tomatoes and ripe tomatoes.

Table 4. Segmentation results of healthy and diseased tomatoes.

Half- Grey Late Young


Type Canker Immature Crack Ripe Rot Virus
Ripe Mold Blight Fruit
precision
91.5 89.9 91.3 92.7 92.3 92.6 92.4 92.2 91.2 93
(%)
Late blight indicates tomato late blight; crack indicates tomato crack; grey mold indicates tomato grey mold;
virus indicates tomato virus disease; rot indicates tomato umbilical rot; canker indicates tomato bacterial canker;
ripe indicates ripe tomato; half-ripe indicates half-ripe tomato; immature indicates immature tomato; and young
indicates young fruit tomato.
Agriculture 2023, 13, 1643 12 of 15

3.2. Comparison with Other Instance Segmentation Algorithms


In this paper, to investigate the segmentation capabilities of the improved YOLOv8s-
Seg for tomatoes, the performance of the network was evaluated in terms of precision, recall,
F1 score, segment mAP@0.5 , and inference time. This performance was then compared
with that of YOLOv8s-Seg, YOLOv7-Seg, YOLOv5s-Seg, and Mask RCNN algorithms. The
example segmentation results of the five networks were obtained using the same training
and validation sets during the training process. The hyperparameters of the five models
during the training process were set as follows: the hyperparameters of YOLOv8s-Seg, the
improved YOLOv8s-Seg, the YOLOv5s-Seg, and YOLOv7-Seg are shown in Table 3. For
Mask RCNN, the learning rate, batch size, learning momentum, weight decay, and number
of iterations were set to 0.004, 2, 0.9, 1 × 10−4, and 30 epochs, respectively. Table 5 displays
the segmentation results for the five models.
Table 5. Segmentation results for the five algorithms.

Precision Recall F1 Score Segment Inference Time


Method
(%) (%) (%) mAP@0.5 (ms)
Mask RCNN 89.8 85.5 87.6 0.915 90
YOLOv5s-Seg 89.0 83.0 85.9 0.890 2.5
YOLOv7-Seg 91.4 84.8 87.9 0.904 15.2
YOLOv8s-Seg 90.3 85.4 87.7 0.898 3.1
Improved YOLOv8s-Seg(ours) 91.9 85.8 88.7 0.922 3.5

In Table 5, the results show the performance of the improved YOLOv8s-Seg algorithm
compared to other models. The improved YOLOv8s-Seg algorithm achieves precision,
recall, F1 score, and segment mAP@0.5 of 91.9%, 85.8%, 88.7%, and 0.922, respectively.
Compared to the YOLOv8s-Seg algorithm, the improvements were 1.6%, 0.4%, 1.0%, and
2.4%, respectively. Compared to the YOLOv5s-Seg algorithm, the improvements were 2.9%,
2.8%, 2.8%, and 3.2%, respectively. Compared to the YOLOv7-Seg algorithm, this algorithm
showed increases of 0.5%, 1.0%, 0.8%, and 1.8%. Compared to the Mask RCNN algorithm,
this algorithm had increments of 2.1%, 0.3%, 1.1%, and 0.7%, respectively. Additionally, the
inference time of 3.5 ms signifies a minor increase over YOLOv5s-Seg and YOLOv8s-Seg
(0.4 ms and 0.6 ms) but a significant reduction over YOLOv7-Seg and Mask RCNN (11.7 ms
and 86.5 ms), supporting real-time instance segmentation. In conclusion, the improved
YOLOv8s-Seg algorithm stands out in precision, recall, F1 score, and segment mAP@0.5 ,
with effective inference time. Figure 11 provides the comparison of Segment mAP@0.5 for
five algorithms.

3.3. Comparison of the Improved YOLOv8s-Seg and YOLOv8s-Seg


In this paper, we proposed an improved YOLOv8s-Seg network designed for real-
time and effective instance segmentation of various tomato stages, including young fruit,
immature, half-ripe, ripe, and common diseases such as grey mold, umbilical rot, crack,
bacterial canker, late blight, and virus disease. The feature fusion capability of the algo-
rithm has been significantly improved through enhancements to the feature fusion network,
with modifications detailed in Table 6. Analysis of Tables 5 and 6 shows that the improved
YOLOv8s-Seg network achieved an F1 score of 88.7% and a segment mAP@0.5 of 92.2%,
representing improvements of 1.0% and 2.4%, respectively, over the original YOLOv8s-
Seg network. Notably, the segment mAP@0.5 improvements were achieved with only a
marginal increase in memory size (0.7 MB) and inference time (0.4 ms), highlighting
the network’s efficiency in real-time instance segmentation of both growing and
diseased tomatoes.
3, 13, x FOR PEER REVIEW 14 of 17
Agriculture 2023, 13, 1643 13 of 15

Figure 11. Comparison of Segment


Figure 11. mAP
Comparison @0.5 for five
of Segment mAPalgorithms.
@0.5 for five algorithms.

3.3. Comparison Table


of the6.Improved YOLOv8s-Seg
Comparison and YOLOv8s-Seg
of parameter variations.

In this paper, we proposed


∆% an improved∆% YOLOv8s-Seg network designed ∆% for real-
Inference
Methods Model Size GFLOPs Parameters
time and effective instance MBsegmentation of various FLOPs tomato stages, including Parametersyoung fruit,
Time (ms)
immature, half-ripe,
YOLOv8s-Seg 20.4 ripe, and common 42.5 diseases such as11,783,470 grey mold, umbilical rot, crack, 3.1
Improved
bacterial canker, late blight, and virus disease. The feature fusion capability of the algo-
21.1 +0.7 47.2 +4.7 10,400,750 −1,382,720 3.5
YOLOv8s-Seg
rithm has been significantly improved through enhancements to the feature fusion net-
work, with modifications detailed in Table 6. Analysis of Tables 5 and 6 shows that the
3.4. Effect of Different Image Resolutions on Tomato Segmentation
improved YOLOv8s-Seg network achieved an F1 score of 88.7% and a segment mAP@0.5 of
92.2%, representingToimprovements
investigate the impact of photo resolution on the segmentation results of the
of 1.0% and 2.4%, respectively, over the original
improved YOLOv8s-Seg network, we experimented using different input image sizes
YOLOv8s-Seg network. Notably,
during training: 416 ×the
416segment
pixels, 640mAP
× 640 @0.5 improvements were achieved with
pixels, 768 × 768 pixels, and 1024 × 1024 pixels.
only a marginalTable
increase in memory size (0.7 MB)
7 provides a comprehensive overview and inference time (0.4
of the segment mAPms), highlighting
@0.5 and the inference
the network’s efficiency
time for eachin real-time instanceThe
image resolution. segmentation
results reveal of both
that growing
as the and diseased
photo resolution increases
tomatoes. from 416 × 416 pixels to 640 × 640 pixels, the inference time increases by 2.6 ms, while the
segment mAP@0.5 improves by 1.1%. This indicates that the model enhances the instance
segmentation
Table 6. Comparison performance
of parameter at the cost of a slight increase in inference time. However,
variations.
when the resolution is further increased to 768 × 768 pixels and 1024 × 1024 pixels, the
Δ% inference time shows more
Δ% substantial increments of 4.4 ms and 6.3 ms, respectively. In
Inference
Model Size GFLOPs
contrast, the segment mAP @0.5 onlyParameters
experiences Δ% Parameters
minor improvements of 0.2% and 0.3%.
MB FLOPs Time (ms)
It could be concluded that the resolution of 640 × 640 pixels is more suitable for training
Seg 20.4 42.5
the improved YOLOv8s-Seg network. 11,783,470
This resolution balances achieving 3.1satisfactory
d segmentation performance and maintaining reasonable inference time.
21.1 +0.7 47.2 +4.7 10,400,750 −1,382,720 3.5
Seg

3.4. Effect of Different Image Resolutions on Tomato Segmentation


To investigate the impact of photo resolution on the segmentation results of the im-
proved YOLOv8s-Seg network, we experimented using different input image sizes during
Agriculture 2023, 13, 1643 14 of 15

Table 7. Comparison of network segmentation results.

Resolutions (Pixels) Segment mAP@0.5 (%) Inference Time (ms)


416 × 416 pixels 91.1 0.9
640 × 640 pixels 92.2 3.5
768 × 768 pixels 92.4 7.9
1024 × 1024 pixels 92.5 9.8

4. Conclusions
An improved YOLOv8s-Seg network based on instance segmentation for tomato
illness and maturity was suggested in this paper. The feature fusion capability of the
algorithm was improved by replacing the C2f module with the RepBlock module, adding
SimConv convolution before two upsampling in the feature fusion network, and replacing
the remaining conventional convolution with SimConv. The improved YOLOv8s-Seg
network achieved a segment mAP@0.5 of 92.2% on the validation set. This showed an
improvement of 2.4% compared to the original YOLOv8s-Seg network, an improvement of
3.2% over the YOLOv5s-Seg network, an improvement of 1.8% relative to the YOLOv7-Seg
network, and an improvement of 0.7% over the Mask RCNN network. Regarding inference
time, the improved YOLOv8s-Seg network reached a speed of 3.5 ms, an increase of 0.4 ms
and 0.6 ms compared to the YOLOv8s-Seg and YOLOv5s-Seg networks, but a significant
reduction compared to the YOLOv7-Seg and Mask RCNN algorithms, reduced by 11.7 ms
and 86.5 ms respectively. This capability facilitates the real-time segmentation of both
healthy and diseased tomatoes. Overall, the improved YOLOv8s-Seg network exhibits
precise segmentation performance on tomatoes affected by factors such as leaf occlusion,
fruit overlap, lighting variations, and angle changes. Meanwhile, the analysis of instance
segmentation results for tomatoes at different growth stages and diseases shows that the
algorithm effectively reduces the impact of surface color and features on performance.
In conclusion, the algorithm shows notable segmentation performance on tomatoes
affected by environmental factors during growth stages and disease. Future research will
continue to optimize the algorithm to improve the segment mAP@0.5 . Efforts will also
be directed toward simplifying the YOLOv8s-Seg network structure to increase computa-
tional efficiency.

Author Contributions: Conceptualization, X.Y.; methodology, K.Q.; software, K.Q.; validation, X.N.
and Y.Z.; formal analysis, Y.L.; investigation, Y.L.; resources, Y.Z.; data curation, X.N.; writing—
original draft preparation, K.Q.; writing—review and editing, X.Y.; visualization, K.Q.; supervision,
C.L.; project administration, X.Y.; funding acquisition, X.Y. All authors have read and agreed to the
published version of the manuscript.
Funding: This work was supported in part by the Youth Program of the Liaoning Education Depart-
ment under Grant LSNQN202025.
Institutional Review Board Statement: Not applicable.
Data Availability Statement: Data will be made available on request.
Conflicts of Interest: We have no affiliations with any organization with a direct or indirect financial
interest in the subject matter discussed in the manuscript.

References
1. Lee, J.; Nazki, H.; Baek, J.; Hong, Y.; Lee, M. Artificial intelligence approach for tomato detection and mass estimation in precision
agriculture. Sustainability 2020, 12, 9138. [CrossRef]
2. Fan, Y.Y.; Zhang, Z.M.; Chen, G.P. Application of vision sensor in the target fruit recognition system of picking robot. Agric. Mech.
Res. 2019, 41, 210–214.
3. Gongal, A.; Amatya, S.; Karkee, M.; Zhang, Q.; Lewis, K. Sensors and systems for fruit detection and localization: A review.
Comput. Electron. Agric. 2015, 116, 8–19. [CrossRef]
4. Si, Y.; Liu, G.; Feng, J. Location of apples in trees using stereoscopic vision. Comput. Electron. Agric. 2015, 112, 68–74. [CrossRef]
Agriculture 2023, 13, 1643 15 of 15

5. Yin, H.; Chai, Y.; Yang, S.X.; Mittal, G.S. Ripe tomato recognition and localization for a tomato harvesting robotic system. In
Proceedings of the International Conference of Soft Computing and Pattern Recognition, Malacca, Malaysia, 4–7 December 2009;
pp. 557–562. [CrossRef]
6. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision,
Venice, Italy, 22–29 October 2017; pp. 2961–2969. [CrossRef]
7. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings
of the Advances in Neural Information Processing Systems 28 (NIPS 2015), NeurIPS, Montreal, BC, Canada, 7–12 December 2015;
p. 28. [CrossRef]
8. Jia, W.; Tian, Y.; Luo, R.; Zhang, Z.; Lian, J.; Zheng, Y. Detection and segmentation of overlapped fruits based on optimized mask
R-CNN application in apple harvesting robot. Comput. Electron. Agric. 2020, 172, 105380. [CrossRef]
9. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [CrossRef]
10. Huang, Y.P.; Wang, T.H.; Basanta, H. Using fuzzy mask R-CNN model to automatically identify tomato ripeness. IEEE Access
2020, 8, 207672–207682. [CrossRef]
11. Afonso, M.; Fonteijn, H.; Fiorentin, F.S.; Lensink, D.; Mooij, M.; Faber, N.; Polder, G.; Wehrens, R. Tomato fruit detection and
counting in greenhouses using deep learning. Front. Plant Sci. 2020, 11, 571299. [CrossRef]
12. Wang, D.; He, D. Fusion of Mask RCNN and Attention Mechanism for Instance Segmentation of Apples under Complex
Background. Comput. Electron. Agric. 2022, 196, 106864. [CrossRef]
13. Wang, C.; Yang, G.; Huang, Y.; Liu, Y.; Zhang, Y. A Transformer-based Mask R-CNN for Tomato Detection and Segmentation.
Intell. Fuzzy Syst. 2023, 44, 8585–8595. [CrossRef]
14. Wang, Q.; Qi, F.; Sun, M.; Qu, J.; Xue, J. Identification of Tomato Disease Types and Detection of Infected Areas Based on Deep
Convolutional Neural Networks and Object Detection Techniques. Comput. Intell. Neurosci. 2019, 2019, 9142753. [CrossRef]
15. Hsieh, K.W.; Huang, B.Y.; Hsiao, K.Z.; Tuan, Y.H.; Shih, F.P.; Hsieh, L.C.; Chen, S.; Yang, I.C. Fruit maturity and location
identification of beef tomato using R-CNN and binocular imaging technology. Food Meas. Charact. 2021, 15, 5170–5180. [CrossRef]
16. Liu, L.; Bi, Q.; Liang, J.; Li, Z.; Wang, W.; Zheng, Q. Farmland Soil Block Identification and Distribution Statistics Based on Deep
Learning. Agriculture 2022, 12, 2038. [CrossRef]
17. Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, Seoul, South Korea, 27 October–2 November 2019; pp. 9157–9166. [CrossRef]
18. Mubashiru, L.O. YOLOv5-LiNet: A lightweight network for fruits instance segmentation. PLoS ONE 2023, 18, e0282297. [CrossRef]
19. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection
framework for industrial applications. arXiv 2022. [CrossRef]
20. Weng, K.; Chu, X.; Xu, X.; Huang, J.; Wei, X. EfficientRep: An Efficient Repvgg-style ConvNets with Hardware-aware Neural
Network Design. arXiv 2023. [CrossRef]
21. Magalhães, S.A.; Castro, L.; Moreira, G.; Dos Santos, F.N.; Cunha, M.; Dias, J.; Moreira, A.P. Evaluating the single-shot multibox
detector and YOLO deep learning models for the detection of tomatoes in a greenhouse. Sensors 2021, 21, 3569. [CrossRef]
[PubMed]
22. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object
Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada,
18–22 June 2023; pp. 7464–7475. [CrossRef]
23. Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021
IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3490–3499.
[CrossRef]
24. Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed
bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012. [CrossRef]
25. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021. [CrossRef]
26. Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning
capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA,
13–19 June 2020; pp. 390–391. [CrossRef]
27. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [CrossRef]
28. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
[CrossRef]
29. Tian, Y.; Yang, G.; Wang, Z.; Li, E.; Liang, Z. Instance segmentation of apple flowers using the improved mask R–CNN model.
Biosyst. Eng. 2020, 193, 264–278. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like