SSRN-id4089053

1 Research on silkworm disease detection in real conditions based
2 on CA-YOLO v3
3 Hongkang Shia Dingyi Tiana Shiping Zhua * Linbo Lib Jianmei Wub
4 a College of Engineering and technology, Southwest university, Chongqing 400700, China
5 b Sericultural Research Institute, Sichuan academy of agricultural sciences, Sichuan 637000，
6 China
7 * Corresponding author. zspswu@126.com
8 Abstract: Silkworm is an important economic insect, but often suffered from diseases during
9 rearing process, resulting in a large amount of cocoons losses each year in China. Accurate
10 recognition for the diseased silkworms can benefit for preventing transmission of pathogens and
11 reducing the loss of silkworm cocoons. However, current recognition methods for silkworm diseases
12 based on deep learning are mainly image classification, which is quite different from the real rearing
13 environments. Therefore, the detection of silkworm diseases under high-density rearing conditions
14 is an unsolved problem. In this paper, the nuclear polyhedrosis virues (the NPV), one of the most
15 high breeding frequency and strong infectiousness silkworm diseases, was selected for detection
16 object, and detection research was carried out by using object detection method. The method of
17 rearing and infecting pathogens was used for acquiring the healthy and diseased silkworms in actual
18 environments. Images mixed with the healthy and diseased silkworms at same time were collected
19 by using mobile phone, and the collected images were labeled by using the Labelimg toolkit. The
20 CA-YOLO v3 network was proposed by using the structure of YOLO v3, as well as combining
21 ConvNeXt module and image attention mechanism to realize the extraction of key features. The
22 CA-YOLO v3 was trained and evaluated on silkworm disease dataset, the recall was 82.35% and
23 93.92%, the precision was 95.32% and 87.90%, the F1-score was 0.88 and 0.91, and the AP was
24 94.81% and 95.19% for the healthy and diseased silkworms respectively, the mean Average
25 Precision (mAP) was 95.0%. The performance was better than the original YOLO v3 network. A
26 GUI of detection for silkworm disease was developed based on PyQt 5, and the trained CA-YOLO
27 v3 model was embedded into the GUI, realizing detection on real-time video, local image and video.
28 The results of this paper indicated that the CA-YOLO v3 can realize efficiently and accurately
29 detection for silkworm diseases. This study can provide theoretical reference for the development
Electronic copy available at: https://ssrn.com/abstract=4089053

30 of disease early warning, and technical support for development precise control equipment.
31 Keywords: silkworm diseases; object detection; CA-YOLO v3; deep learning
32 1. Introduction
33 Silkworm, an important economic insect, mainly used for producing natural silks. In China,
34 silkworm rearing has a history of more than 5,000 years and has always been an important part of
35 agriculture and animal husbandry. However, mulberry leaves are picked from the open air, and
36 silkworm is reared in indoor open environment, which is very susceptible to attacked by pathogens.
37 In addition, silkworm has a short life cycle, high-density breeding, and the pathogens are usually
38 high contagious. Therefore, the infected silkworms are difficult to be cured by using medicines, and
39 usually die directly or do not form cocoons due to serious physiological damage. Statistics show
40 that the average loss of silkworm cocoons caused by diseases accounts for more than 20% of the
41 total output each year in China, and some areas with severe diseases lost more than 50%, and even
42 encounter no harvest, causing huge economic losses (Jiang et al. 2014; Xu et al. 2019).
43 Due to the harmfulness and incurability of silkworm diseases, timely recognition for the
44 diseased silkworms is helpful to cut off the transmission of pathogens, carry out precise control and
45 reduce the loss of silkworm cocoons. However, silkworm belongs to high-density reared insect with
46 subtle individual differences, high similarity between the diseased and healthy silkworms, leading
47 to difficult identify for them. The traditional method mainly rely on manual identification, which
48 has problems such as low efficiency, poor reliability, and relying on long-term professional
49 experience, which cannot meet the demand of the evolution and development of sericulture industry.
50 With the development of computer science in recent years, deep learning technology has been
51 widely used in agriculture and animal husbandry fields (Kamilaris et al. 2018). There are many
52 application cases about this emerging technology. Karlekar and Seal. (2020) proposed a
53 classification method for soybean leaf diseases based on convolution neural network (CNN),
54 achieved effective recognition accuracy of 98.14% for 16 categories diseases. Wang et al. (2020)
55 designed the CPAFNet network, which was similar to Inception (Szegedy et al. 2014) structure for
56 pest image recognition. The accuracy reached 92.26 %, which is better than several state-of-the-art
57 networks on a dataset containing 73, 635 images. Altuntaş et al. (2019) adopted CNN and transfer
58 learning to recognize haploid and diploid maize seeds. The experimental results proved that CNN

59 models could be a useful tool in recognizing haploid maize seeds. Shi et al. (2020) used MobileNet
60 v1 (Howard et al. 2017) for image recognition of 10 silkworm varieties, the accuracy rates were
61 98.9% and 96%, respectively. Spirited by this trend and according to the fact that silkworm would
62 show different features after infected diseases. Researchers attend to utilize deep learning and
63 machine vision for recognition of silkworm diseases. Xia et al. (2019) proposed a classification
64 algorithm of silkworm diseases based on image attention mechanism (Borji and Itti, 2013) and
65 DenseNet module (Huang et al. 2016), realizing image recognition for five categories diseased and
66 healthy silkworms. The recognition accuracy achieved 87.8% on a dataset including 7978 images.
67 Ding and Chen. (2019) proposed a method based on the Mergencelayer-Slicelayer-Pairingloss and
68 ALexNet (Krizhevsky et al. 2017) architecture to classify image of silkworm diseases, with the
69 accuracy of 84.5% on same dataset with (Xia et al. 2019). In our previous study (Shi et al. 2021),
70 we adopted rearing and artificial infection method to obtain the healthy and five diseased silkworms,
71 including nuclear polyhedrosis virues (NPV), nosema bombycis, beauveria bassiana, bacterial
72 disease, and pesticide poisoning in same growth stage. Images were collected by using mobile phone
73 and a disease dataset was constructed in actual condition. An improved ResNet (He et al. 2016)
74 algorithm was proposed to recognize image of the healthy and diseased silkworms. The average
75 accuracy reached 94.37%. Hence, the above researches indicated that deep learning could accurately
76 recognize silkworm diseases by using method of image recognition. However, the above studies
77 were carried out by image classification method, which each image contains only one silkworm in
78 dataset. It is inconsistent with the actual situation of high-density rearing of silkworms. Moreover,
79 there are healthy and diseased silkworms in rearing trays at same time when pathogens breeding
80 and transmission, so classification model is not conducive to the recognition of diseases in actual
81 rearing environments.
82 Object detection can overcome the shortcoming of classification networks, and can recognize
83 category and detect position of each object simultaneously. Representative object detection
84 algorithms includes SSD (Wei et al. 2016), Faster R-CNN (Ren et al. 2017), YOLO v3 (Redmon
85 and Farhadi. 2018), and so on. In agriculture fields, using object detection method for research is a
86 hot spot and mainstream (Chen et al. 2021). Riekert et al. (2021) proposed a method to detect posture
87 for 24/7 pig position and posture based on Faster R-CNN and NASNet (Zoph and Li. 2017), which
88 realized 84.5% mean average precision (mAP) for day recording and 58% mAP for night recording.

89 Liu and Wang. (2022) proposed a two-stages method for pig face recognition based on EfficientDet-
90 d0 (Tan et al. 2019) and DenseNet 121, which improved the mAP by 28% than EfficientDet-d0
91 network trained by the one-stage method. Liu et al. 2022 presented a MSRCR-YOLO v4-tiny model
92 based on MSRCR algorithm and channel pruning, to detect corn weeds in field environment. The
93 test result showed that the mAP achieved 96.6%, which was higher than Faster-RCNN and YOLO
94 v3 network. Wang et al. (2022) proposed Pest-D2Det network for pest monitoring by using attention
95 module and modifying the backbone of the original D2Det model. The experiment result showed
96 Pest-D2Det achieved performance in terms of 78.6% mAP.
97 Among the emerging object detection architectures, YOLO v3 (Redmon and Farhadi. 2018)
98 can realize efficiency and precision detection at the same time, and has attracted extensive attention
99 from researchers. Liu and Wang. (2020) used YOLO v3 and image feature pyramid for detection of
100 tomato leaf diseases and insect pests, which the mAP reached 92.39 %. Wang et al. (2020) employed
101 YOLO v3 to detect six behaviors of laying hens, including mating, standing, feeding, spreading,
102 fighting, and drinking. Tian et al. (2019) designed an improved YOLO v3 to detect different growth
103 stages of apple. The detection results were better than Faster R-CNN. Bai et al. (2022) used YOLO
104 v3 and U-Net (Olaf et al. 2015) for segmentation and detection of green cucumbers, which achieved
105 good performance simultaneously. Zhang et al. (2022) proposed an improved YOLO v3 algorithm
106 to extract skeleton of beef cattle by detecting 16 key nodes in actual breeding environments, the
107 experiment indicated that the AP reached 97.18%. The research cited above showed that YOLO v3
108 could be competent for a variety of detection tasks, but silkworm belongs to high-density rearing
109 insect, and healthy silkworms are very similar to the diseased silkworms in appearance. In order to
110 achieve excellent detection for silkworm disease by using YOLO v3, it is necessary to enhance the
111 key feature extract capability of network.
112 Attention mechanism (Borji and Itti, 2013) is an important method to improve the performance
113 of deep learning network. It can make network pay more attention to the feature that play an
114 important role for detection or recognition. Common visual attention mechanism models include
115 SENet (Hu et al. 2019), CBAM (Woo et al. 2018), and ECA (Wang et al. 2020), etc. Some scholars
116 have combined attention mechanism and YOLO algorithms to conduct detection research in the
117 field of agriculture, and achieved very excited results. Qi et al. (2022) added SENet module to
118 YOLO v5 network to detect leaf diseases of tomato plants. The results showed that the network with

119 SENet achieved better performance than the original one. Lu et al. (2022) added CBAM module to
120 YOLO v4 to detect immature or mature apple fruits. The result showed that the improved
121 architecture was better in F1-score, precision and recall. Le et al (2021) designed an improved
122 YOLO v4–tiny network by using an adaptive spatial feature pyramid method and CBAM module.
123 The experimental results on detection of individual green pepper indicated that the performance of
124 improved algorithm was better than SSD, Faster R-CNN, YOLO v3, and so on. Zhang et al. (2021)
125 used a lightweight attention mechanism model MobileNet v3 (Howard et al. 2019) to replace the
126 backbone of YOLO v4, and detect potato individual in complex environment, which realized
127 significantly improved detection efficiency without accuracy reduction.
128 In the past period, image vision and sequence model were two independent branches of deep
129 learning. Yet, with Transformer (Vaswani et al. 2017), a sequence model based on pure attention
130 mechanism, achieved great success in sequence tasks. More and more researchers have applied
131 Transformer ideas to image vision fields. The milestone is ViT model, proposed by Dosovitskiy et
132 al (2021). ViT first regards an image as 16 × 16 patches, and expands each patch into a one-
133 dimensional vector. Transformer encoder then was used for extracting image global features based
134 on attention mechanism. One-dimension vector representing the prediction result was obtained by
135 the Softmax operation, which as same as image classification in CNNs. ViT model not only realized
136 the leapfrog development of image classification using sequence network, and its accuracy on
137 ImageNet was higher than the representative algorithm in CNNs after pre-training on large dataset.
138 Subsequently, Liu et al. (2021) proposed the Swin-Transformer network based on ViT module,
139 which build some hierarchical features maps of image, indicated that Transformer architecture could
140 be used for visual tasks such as object detection and semantic segmentation. The recognition
141 accuracy on ImageNet dataset was better than ViT and the sate-of-the-art CNNs. Hence, many
142 researchers regard Transformer architecture as the mainstream of deep learning in the next few years
143 (Han et al. 2021). More recently, however, Liu et al. (2022) proposed a CNN model named
144 ConvNeXt, which adopted a series of optimization methods and absorbed the design idea of Swin-
145 Transformer, and achieved better performance on ImageNet than Swin-Transformer. Therefore,
146 ConvNeXt is also regard as the "Renaissance" of CNNs.
147 In summary, in order to realize efficient and accurate detection of silkworm diseases, the NPV,
148 which is one of high infectious and accounts for more than 60% in all diseases (Jiang et al. 2014),

149 was selected for study in this work. The approach improved the existing YOLO v3 algorithm, and
150 a CA-YOLO v3 network was designed based on ConvNeXt, Swin-Transformer and visual attention
151 mechanism to replace backbone of YOLO v3. We also constructed a silkworm disease dataset,
152 which each image containing healthy and diseased silkworms so that similar to real rearing
153 conditions. This study can provide support for the research of disease early warning and equipment
154 development of precision control.
155 The organizational structure of this paper is that the second section will introduce data
156 collection and dataset construction, as well as the structure and principles of CA-YOLO v3. The
157 third section is the experimental environment and results, and the last chapter is the conclusion of
158 research and future work.
159 2. Materials and method

160 2.1 Sample collection and image acquisition
161 2.1.1 Sample collection
162 Nowadays, there are thousands of silkworm varieties in the world and each variety enjoy
163 different appearance and disease resistance. In this experiment, a variety named Chuan Shan × Shu
164 Shui, which is main rearing variety in southwest of China, and easily suffered from the NPV, was
165 used for experiment sample. Fig. 1 showed the healthy and suffered from the NPV pathogeny images
166 of Chuan Shan × Shu Shui at instar 4. To collect adequate samples for dataset construction, a method
167 of rearing and manual infecting pathogens was conducted before image collection. According to the
168 fact that young silkworm co-breeding (Shi et al. 2018) was generally adopted in China, which
169 realized reduction of the risk of disease infection at instar 1 to 3, so this paper mainly focuses on
170 adult silkworm (instar 4 to 5).
171
172 (a) (b) (c) (d)
173 Fig.1 Health and diseased images of silkworm variety Chuanshan×Shushui, (a) is health image, (b), (c), (d) are
174 diseased images.

175 In this experiment, more than 3,000 silkworms were reared for sample collection, about half
176 of which were used for diseased samples and other for healthy ones. Diseased silkworms were
177 infected by smearing the NPV pathogen on mulberry leaves at the first day of instar 4. The dosage
178 of pathogen was 5 ml, which can make most of silkworms get sick. The onset time of the NPV is
179 about 3 days after infected pathogens. After infection, all silkworm were reared normally, and only
180 once pathogen infection.
181
182 Fig.2 Infection pathogenic
183 2.1.2 Image acquisition
184 From the third day after infection, an expert checked the infected silkworms one by one, and
185 the diagnosed silkworms were used for diseased sample. When image acquisition, several healthy
186 and diseased silkworms were placed manually on background to imitate the scene of healthy
187 silkworms mixed with diseased silkworms when emerging of diseases in real rearing conditions.
188 Image acquisition lasted four days because silkworm individual got sick at different times. The
189 number of samples consumed and images collected each day was shown in Table 1. Silkworms have
190 been photographed were no longer used for research sample in order to prevent cross infection.
191 Table 1 Number of samples consumed and number of images collected each day
Data 4th day of 1th day of 2th day of

Sleeping day Total
Group instar 4 instar 5 instar 5
Health 407 20 751 315 1493
Diseased 422 189 680 188 1479
Image number 211 90 486 154 941
192 Note: some diseased silkworm did not sleep in sleeping day due to infection. Only a few healthy silkworm did not sleep in time.
193 Images of silkworm were photographed and acquired using a model iPhone 6S smart phone
194 with a 12-megapixel. The environment was indoor with natural light, and collection time was 9:00-
195 11:00 and 15:00-17:00 in each day. A tripod was used for fix the collection device so that the lens

196 was pointing downward and the silkworm maintained a natural posture (Fig. 3). To ensure that the
197 body shape of the silkworm did not change after image resizing, the aspect ratio of capture device
198 screen was set to 1: 1, and the focal length was fixed to ensure that the screen could contain 8 ~ 10
199 silkworms. Mulberry leaves were selected as image background and be replaced frequently and
200 randomly to avoid image background becomes regular. Image collection method has been used in
201 our previous research (Shi et al. 2020, Shi et al. 2022).
202
203 Fig.3 Image acquisition device
204 In this experiment, the manual placement of diseased and healthy silkworms in one background
205 was used to simulate the early stage of disease spreads, which needs timely warning and prevention.
206 Hence, most of the collected images contain both healthy and diseased silkworms. Moreover, some
207 images contain only one diseased silkworm to imitate that the phenomenon of manic crawling when
208 silkworm infected the NPV. A total of 941 original images were collected, which including 2972
209 silkworms, composed of 1479 diseased silkworms and 1493 healthy ones. Some of collected
210 original images were shown in Fig.4.
211
212 Fig.4 Example of original image, most of collected images contain healthy and diseased silkworms
213 simultaneously.
214 2.2 Dataset construction

215 2.2.1 Image resizing and labeling
216 The size of the original captured image was 3224 × 3224 pixels, which is more than input size

217 of common object detection algorithms. The bilinear interpolation was employed for resized original
218 image to 416×416 pixels，and no other image preprocessing method was performed. All images
219 were divide into a training set and test set in the ratio of 8: 2 by selecting images randomly. The
220 annotation called LabelImg was used to label the images, according to recognition results by expert.
221 A rectangle box (ground box) was used to represent position of silkworm in image and the label NP
222 (Nuclear Polyhedrosis virus) and H (healthy) was used to represent diseased and healthy silkworm
223 respectively, as shown in Fig. 5. A total of 1493 H objects and 1479 NP objects were labeled.
224
225 Fig.5 Label fabrication
226 2.2.2 Image enhancement
227 Data augmentation can improve the stability of neural networks. In this study, silkworms were
228 placed on background by manual for image collection, it is inevitable that dataset has formed some
229 law due to personal operation habits, the spacing and posture between silkworm and mulberry leaf
230 at different time, which may affect detection research. Hence, some image enhancement operation
231 were carried out on training set collected by this paper, including random rotation, width and height
232 shift, and randomly horizontal flip. The parameters of enhancement were shown in Table 2, where,
233 𝑁 is the length and width of the image.
234 Table 2 Parameters of image enhancement
Parameter Rotation Width shift Height shift range Randomly Shuffle
range (𝛼) range (𝑡𝑥) (𝑡𝑦) horizontal flip

Value ( ‒ 90𝑜,90𝑜) (-0.2 × N, 0.2 × N) (-0.2 × N, 0.2 × N True True
235 Examples of image enhancement were shown in Fig. 6. For one image on training set, image
236 enhancement would provide different results due to the parameters of enhancement are random
237 number in a certain range, and only one enhanced image would be used to replace original image
238 for model training. Moreover, image enhancement operation were carried out by transform each
239 pixel coordinates, the ground boxes of image were changed synchronously after enhancement.
240
241 (a) Original image
242
243 (b) Images after data enhancement
244 Fig.6 Example of image enhancement, (a) is original image; (b) are images after data enhancement. When
245 training the networks, only one image in (b) would used to replace to original image for training.
246 2.3 CA-YOLO v3

247 2.3.1 Network structure
248 The primary network architecture of CA-YOLO v3 was based on generic YOLO v3, as shown
249 in Fig. 7. The backbone of CA-YOLO v3 is composed of an input layer, a convolution layer, and
250 four stages CA-block. The input size of CA-YOLO v3 is 416 × 416 × 3. On convolution layer, 96
251 filters with a size of 4 × 4 are used to perform convolution operation, and the convolution stride is
252 4, so the dimension of feature map is mapping into 104 × 104 × 96 after convolution operation.
253 Then, four stages CA-blocks were designed to extract key features of silkworm image. We detail
254 structure of CA-block in the section 2.3.2. The basic number of convolution kernels of the four
255 stages CA-blocks is 96, 192, 384, and 768 respectively, and the stage compute ratio is set to 3, 3, 9
256 and 3, which means the loop number of residual operations in CA-block instead of the number of
257 CA-block. These amount distributions about the number of convolutional kernel and stage ratio

258 were originated from Swin-Transformer, and be adopted in ConvNeXt architecture. Furthermore,
259 compared to Darknet-53, the backbone of generic YOLO v3, which constituted by five stages
260 residual block, we reduced the computational complexity and training parameters, so as to improve
261 efficiency of network. Another reason why we adopt this stage compute ratio is that we argue the
262 feature maps of 26 × 26 dimensional plays a key role in connecting the preceding and the following,
263 so it be set to 9, which is 3 times than other CA-blocks. The CA-block in first stages does not change
264 the width and height of feature maps, other stages CA-blocks halve the width and height of feature
265 maps to ensure obtain the feature maps with dimensional of 13 × 13, 26 × 26 and 52 × 52 for feature
266 fusion in FPN. All CA-blocks double the channel of feature maps.
267
268 Fig.7 CA-YOLO v3 structure, the size of input image is 416×416×3, “Conv2D, 4×4, 96, stride=4” represents
269 using 96 filters with a size of 4×4 and stride of 4 to conduct convolutional operation. “batch_size, 104×104×96”
270 represents the number of batch operation images and the output dimension of the feature map after CA-blocks
271 operation respectively. “CA-blocks, 96, 3” refers to the basic number of filters in CA-blocks is 96, and the loop of
272 residual operation in CA-block is 3 times. “Concat layer” means future concatenate layer. “Upsampling” is up sample
273 layer. “Conv2D, 3×3, 1×1” denotes two convolutional layers with a size of 3×3 and 1×1 respectively.
274 We adopted CA-block to replace original five convolution layers in FPN network, to make
275 structure of network keep coherent so that to ensure detection efficiency. However, there are
276 some differences between CA-block in FPN and backbone extraction network, which be
277 described in section 2.3.2. There are two convolutional layers with a size of 3 × 3 and 1 × 1
278 filters respectively in YOLO head. The number filters of in with a size of 3 × 3 convolution

279 layers is four times than basic number filters of CA-block in FPN, the number filters of last
280 convolutional layer is related to the number of detection categories. The specific calculation
281 formula is as follow:
282 K = (N + 5) × 𝑁𝐶 (1)
283 where K means the number of filters in the last convolution operation, N refers to the number
284 of detection object categories, 𝑁𝐶 is the number of anchor boxes.
285 This research includes healthy and diseased silkworms, and the number of anchor boxes in
286 each YOLO head is 3, it can be concluded that K is 21 by using formula (1).
287 2.3.2 CA-block
288 We proposed two different CA-blocks in backbone and the FPN network respectively, the
289 structure of CA-block in backbone is shown in Fig.8 (a), the Padding operation with 5 pixels is used
290 to prevent loss of edge features and control dimension of output feature maps. A depth-wise
291 separable convolutional operation with a size of 7×7 filter, and a normal 2D convolutional layer
292 with a size of 1×1 filter are used for feature extraction and channel adjustment, followed by a vision
293 attention block. Then, a residual architecture contains 7×7 depth-wise separable convolution and 3
294 normal convolutional layers with a size of 1 × 1 are employed for increase the depth and extraction
295 capability of block, and residual block is performed repeatedly according the stage compute ratio in
296 backbone.
297
298 (a) (b)

299 Fig.8 Block design for backbone, FPN of CA-YOLO v3.
300 Fig.8 (b) showed the structure of CA-block in FPN network, a 1×1 convolutional layer
301 operation is used first for channel adjustment due to the input future map is conducted by
302 concatenation operation. A depth-wise separable convolutional layer with a size of 3×3 filter and 3
303 convolutional layers with a size of 1×1 were used to feature extraction and channel adjustment,
304 followed by a attention block.
305 We designed large convolutional kernel with a size of 7×7 in backbone of CA-YOLO v3 to
306 enhance the receptive filed of the network as much as possible. The performance of large kernel
307 was proved in recent research (Szegedy et al., 2017; Ding et al., 2022). However, in FPN network,
308 a depth-wise size of 3×3 was used for extract more exact features. We adopted separable convolution
309 operation to reduce training parameters and ensure operation efficiency. The training parameters of
310 CA-YOLO v3 are only 46.3 million, which is 15 million less than YOLO v3. At same time, an
311 image attention block was added in CA-block, which be detailed in section 2.3.3, to enhance
312 extraction capability of key features. There is no attention block in residual structure of CA-block
313 in backbone for avoiding the surge of parameters.
314 The idea of inverted bottleneck was also used in CA-block, its principle is changing the
315 adjustment way of feature channels. This idea originated from Swin-Transformer, also applied to
316 ConvNeXt network. As shown in Fig.9. The inverted bottleneck refers to that expanding the channel
317 number of feature maps to four times than the original, and then restoring them. This is different
318 from conventional method in ResNet, which condensing the channel to 1/4 times of input future
319 maps, and then restoring them. ConvNeXt has verified the inverted bottleneck was beneficial for
320 improving network performance.
321

322 Fig. 9 Block designs for a ResNet, a Swin Transformer, and a ConvNeXt.
323 The operation including batch normalization and LeakyReLU activation was conducted in each
324 convolutional layers of CA-blocks, the formula of LeakyReLU is as follows：
325 𝐿𝑒𝑎𝑘𝑦𝑅𝑒𝐿𝑈(x) = {0.1 x,× x, 𝑖𝑓 𝑥 ≥ 0

otherwise (2)
326 2.3.3 Attention block
327 Image attention mechanism mainly includes spatial attention model and channel attention
328 model. Spatial attention model focuses on the spatial local information of image, while channel
329 attention model focuses more on the characteristic channel information (Qi et al, 2022). In this
330 article, we introduced the ECA (efficient channel attention) module, proposed by (Wang et al. 2019),
331 into the CA-block, which was expected to make the algorithm focus more on the difference of
332 healthy and diseased silkworms and the location of each silkworm. Specifically, ECA is a channel
333 attention mechanism using convolution operation instead of full connection operation, which was
334 proposed by SENet (Hu et al. 2019), to enhance local cross-channel interaction without
335 dimensionality reduction. ECA module first compresses the feature maps in the spatial dimension
336 and then performs one-dimensional convolution and activation operation to obtain the attention
337 weights, the refined feature map finally was obtained by multiplying the attention weights and
338 original feature map.
339 The method of using ECA module for attention weights is as follows: for given feature maps， 𝑋
× H × C
340 = [x1, 𝑥2, … 𝑥𝑐]， 𝑋 ∈ R𝑊 ， where 𝑊, H, and C represent the length, width, and
341 number of feature channels, respectively. The global average pooling was used for gained one-
342 dimensional feature graph, the formula is as follows:
343 Y = GAP(𝑋) (3)
344 where, the GAP refers to the global average pooling.

× 1 × C
345 Therefore, the output after the GAP is Y ∈ R1 , Then a one-dimensional convolution
346 and activation operations were utilized for acquired the attention weight of feature channels. The
347 relationship for Y' is as follows:
348 Y' = 𝛿(𝐶1𝐷𝑘(Y)) (4)
349 where 𝐶1D refers to one-dimensional convolutional operation, subscript k is the kernel size,
350 the default value is 5, 𝛿 means activation function, the formula is as follows:

1
351 𝛿(𝑥) =
1 + 𝑒‒𝑥
(5)
352 Therefore, the training parameter of using ECA module was Wk, which is defined as follows:
 w1,1  w1,k 0 0   0 
 
 0 w2,2  w 2, k 1
0   0 
353 (6)
         
 C ,C 
 0  0 0  wC ,C  k 1
 w 
354 According to the above matrix, the amount of parameters increase 5 ∗ C when using ECA
355 module each time. The broadcast operation provided by Python was used for changing the shape of
× 1 × C × H × C
356 Y' ∈ R1 to Y' ∈ R𝑊 , followed by multiplying Y' by input feature graph 𝑋, the
357 importance of each channel in the original feature map was obtained.
358 2.3.4 Loss function
359 The loss function acts as evaluating the CNN model in the process of training. The CA-
360 YOLO v3 employed the mean squared error (MSE) loss function for module training. The
361 specific calculation formula for MSE loss is described in equation (7), which concluding the
362 coordinate error of detection objects, the confidence error of region where containing or not
363 containing detection objects, and the category probability error of detection object. Different
364 weights were assigned to each error and the square operation of the width and height of the
365 prediction boxes makes the position of the prediction boxes, the confidence of the category and
366 the accuracy of the category reach a relatively ideal state.

𝑆2 𝐵
367 𝐽 = 𝜆𝑐𝑜𝑜𝑟𝑑 ∑ ∑ 𝐼 [ (𝑥 ‒ 𝑥 ) 𝑜𝑏𝑗

𝑖𝑗 𝑖 𝑖
2
+ (𝑦𝑖 ‒ 𝑦𝑖) ]
2
𝑖 = 0𝑗 = 0
𝑆2 𝐵
368 + 𝜆𝑐𝑜𝑜𝑟𝑑 ∑ ∑ 𝐼 [( 𝑜𝑏𝑗

𝑖𝑗 𝑤𝑖 ‒
2
𝑤𝑖) + ( ℎ𝑖 ‒ ℎ𝑖)
2
]
𝑖 = 0𝑗 = 0
𝑆2 𝐵
369 + ∑ ∑𝐼 𝑜𝑏𝑗
𝑖𝑗 (𝐶𝑖 ‒ 𝐶𝑖)2
𝑖 = 0𝑗 = 0
𝑆2 𝐵
370 + 𝜆𝑛𝑜𝑏𝑗 ∑ ∑𝐼 𝑛𝑜𝑜𝑏𝑗

𝑖𝑗 (𝐶𝑖 ‒ 𝐶𝑖)2
𝑖 = 0𝑗 = 0
𝑆2 2
371 + ∑𝑖 = 0𝐼𝑜𝑏𝑗
𝑖 ∑𝑐 ∈ 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
(𝑝𝑖(𝑐) ‒ 𝑝𝑖(𝑐)) (7)
372 where 𝜆𝑐𝑜𝑜𝑟𝑑 represents the weight of grid containing object with value of 5. S is the number
373 of grid in the input image, with values of 13, 26 and 52 respectively. B refers to the number of

374 anchor boxes generated by each grid. 𝐼𝑜𝑏𝑗 𝑡ℎ
𝑖 𝑗 denotes the silkworm falls into 𝑗 bounding boxes of
375 𝑖𝑡ℎ grid. 𝑥𝑖 and 𝑦𝑖 are the geometric center coordinate of silkworm in the input image, 𝑥𝑖 and 𝑦𝑖
376 are the prediction coordinate of network. 𝑤𝑖 and ℎ𝑖 are the width and height of bounding box in
377 the input image. 𝑤𝑖 and ℎ𝑖 are the width and height of prediction box. 𝐶𝑖 is the ground
378 confidence value of object of 𝐼𝑜𝑏𝑗 𝑛𝑜𝑜𝑏𝑗

𝑖 𝑗 , 𝐶𝑖 is the prediction confidence value of network. 𝐼 𝑖 𝑗
379 reprents there is no silkworm falls into 𝑗𝑡ℎ bounding boxes of 𝑖𝑡ℎ grid， 𝐼𝑜𝑏𝑗
𝑖 represents the
380 geometric center of detection object falls into 𝑖𝑡ℎ grid, 𝜆𝑛𝑜𝑏𝑗 is the weight of grid which not
381 containing object with value of 0.5. 𝑝𝑖(𝑐) denotes the probability of 𝑐 belongs to healthy or
382 diseased, 𝑝𝑖(𝑐) denotes the prediction probability of 𝑐 belongs to healthy or diseased. MSE loss
383 function reduces the contribution of the grid that does not contain object to the parameter update by
384 setting different weight values.
385 2.4 Experimental environment and evaluation indicators

386 2.4.1 Experimental environment
387 All experiments were operated on a DellL Precision 5820 workstation with an Intel® Core i7-
388 9800X processor, and RTX2080Ti GPUs, with 11GB memory, and the CUDA-10.0 computing
389 platform. The operating system was Windows10 Professional (64 bits), the programming language
390 was Python3.7, the programming environment was Jupyter notebook, and the deep learning
391 framework was TensorFlow GPU 1.14. The toolkits used include Numpy, Keras, PIL, etc.
392 The hyper-parameters of model training included the number of epoch was 300, and the mini-
393 batch size was 4. The learning rate was initial 0.001, and was multiplied by 0.8 to reduce it when
394 the loss value did not decrease in five consecutive epochs. The MSE loss was used as a loss function
395 and the Adam was used as the optimizer. The IoU (Intersection over Union) value was 0.5, and the
396 confidence threshold of prediction result was 0.3.
397 The size of anchor boxes during data encoding and decoding were 10 × 13, 16 × 30, 33 × 23,
398 30 × 61, 62 × 45, 59 × 119, 116 × 90, 156 × 198, 373 × 326. We directly used these anchor boxes
399 proposed by YOLO v3 due to the boxes obtained by clustering operation of ground boxes from our
400 dataset is not distributed in proportion, and the experiment showed that it would reduce the detection
401 accuracy.
402 2.4.2 Evaluation indicators
403 The 80% images (602 images) on training set were used for network training, and 20% (150

404 images) are used for network verification. The training set includes 1221 healthy silkworms and
405 1216 diseased silkworms. The model weight was saved after the training, and the model was tested
406 by using test set (189 images), which include 272 healthy silkworms and 263 diseased silkworms.
407 The precision, recall, F1-score, average precision (AP) and mean average precision (mAP) were
408 used for model evaluation. The specific calculation formulas are as follows, respectively:
𝑇𝑃
409 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 + 𝐹𝑃 × 100% (8)
𝑇𝑃
410 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 + 𝐹𝑁 × 100% (9)
2 × Precision × Recall
411 F1-score = Precision + Recall
(10)
N 1
412 AP = ∑1∫0Precision(𝑅ecall)𝑑𝑅 × 100% (11)
∑N ∫1Precision(𝑅ecall)𝑑𝑅
1 0
413 mAP = 𝑁 × 100% (12)
414 where TP (True Positives) refers to the object is detected as a positive sample and the test result
415 is correct, FP (False Positives) means the object is detected as a positive sample, but the actual
416 object is a negative sample, FN (False Negatives) represents object is detected as a negative sample,
417 but the object is actually a positive sample. AP is the area enclosed by precision and recall curve (P-
418 R curve). mAP is the average of all AP. In this study, objects are healthy silkworm and pyogenic
419 silkworm respectively, so 𝑁 is 2.
420 Judging an object belongs to a positive sample or a negative sample is not absolute. For
421 example, when the detection object is healthy silkworms, the area contain diseased silkworm or not
422 contain silkworm is a negative sample. Whereas healthy silkworms may be regard as negative
423 sample when detecting diseased silkworms, Therefore, the precision, recall, F1-score, AP were
424 calculated for both healthy and diseased silkworms respectively.
425 3 Experimental results

426 3.1 Comparison of the detection effect with YOLO v3
427 In order to verify the training and detection effects of CA-YOLO v3, the original YOLO v3
428 and CA-YOLO v3 architecture were trained and tested under the same environments, respectively.
429 The loss values of the two networks in the training set and validation set were recorded at same
430 time.
431 Fig.10. showed the loss value curve of CA-YOLO v3 and YOLO v3 training, it can be

432 concluded that the convergence speed of CA-YOLO v3 was slower than YOLO v3 in the initial
433 stage of training, the reason of this phenomena is that image attention mechanism and large size
434 kernel made the network spend more time to extract the key features. The two networks can reach
435 the state of relative convergence after about 100 epochs, and the convergence effects of two
436 networks are same approximately. However, the loss value of CA-YOLO v3 is significantly smaller
437 than that of YOLO v3 network after about 150 epochs, indicating its training effect is better.
438
439 (a) Training set (b) Validation set
440 Note: in order to observe the convergence of the network, we used 100 to replaced the loss value greater than 100 at the initial stage
441 of training.
442 Fig.10 Loss curves of CA-YOLO v3 and YOLO v3 in training
443 CA-YOLO v3 and YOLO v3 were tested on test set after model training, and the recall,
444 precision, F1-score, AP and mAP for healthy and diseased silkworms were calculated respectively,
445 which was shown in Table 3.
446 Table 3 The detection result of two networks
Recall Precision F1-SCORE AP

Model mAP
H NP H NP H NP H NP
YOLO v3 83.82 % 88.59 % 89.76 % 84.73 % 0.87 0.87 94.15 % 92.10 % 93.13 %
CA-YOLO v3 82.35 % 93.92 % 95.32 % 87.90 % 0.88 0.91 94.81 % 95.19 % 95.00 %
447 Note: The “H” means healthy silkworms, and the “NP” refers to diseased silkworms.
448 The test result showed that the proposed CA-YOLO v3 achieved the better performance based
449 on the most of evaluating indicator, including precision of 95.32% and 87.90%, F1-score of 0.88
450 and 0.91, AP of 94.81 and 95.19% for healthy silkworms and diseased silkworms respectively. More
451 importantly, the mAP of CA-YOLO v3 reached 95.00%, which is 1.87% higher than 93.13% of

452 YOLO v3. There is only one index of CA-YOLO v3 was slightly lower than YOLO v3, which is
453 the recall of healthy silkworm. However, CA-YOLO v3 achieved recall of 93.92% for diseased
454 silkworm, which is significantly better than 88.59% of YOLO v3.
455 3.2 Ablation study of CA-block

456 3.2.1 Ablation study of kernel size
457 In order to prove the effectiveness of CA-block with different structure, the ablation study on
458 different types of kernel size, attention module, and stage ratio were performed in this section. The
459 influence of kernel size was tested first. We used the different kernel size, which is from 3×3 to
460 11×11, for depth-wise separable convolution to tested the detection effectiveness.
461 Table 4 Influence of kernel size
Recall Precision F1-score AP

Methods mAP
H NP H NP H NP H NP
3×3 dw 64.34 % 85.55 % 88.83 % 74.01 % 0.75 0.79 89.36 % 86.66 % 88.01 %
5×5 dw 72.06 % 86.69 % 92.45 % 80.85 % 0.81 0.84 92.56 % 89.75 % 91.16 %
7×7 dw 76.10 % 86.69 % 90.79 % 84.44 % 0.83 0.86 90.76 % 89.27 % 90.01 %
9×9 dw 77.21 % 90.49 % 94.17 % 84.40 % 0.85 0.87 93.24 % 92.76 % 93.00 %
11×11 dw 54.51 % 63.12 % 89.70 % 74.11 % 0.68 0.68 86.73 % 77.40 % 82.07 %
462 The “dw” means depth-wise separable convolution.
463 As demonstrated in Table 4, the experiment result implicated that the kernel size of CA-block
464 do impact the detection performance and the kernel size with 9 × 9 dw achieved the best results,
465 with the mAP of 93.00 %. It also showed that with the increase of the kernel size from 3 × 3 to 9 ×
466 9, the mAP increases except that 7 × 7 dw achieved slight reduction than 5 × 5 dw. However, the
467 kernel size with 11 × 11 obtained the mAP with only 82.07 %, which was far below than other size.
468 The reason for this result may be the padding operation before 11×11 dw operation was 9 pixel,
469 which contained more irrelevant information for feature extraction. In addition, the result showed
470 that increasing the kernel size in some range is helpful to improve the detection performance.
471 3.2.2 Ablation study of ECA module
472 To verify that attention mechanism can effectively enhance the capability of key feature
473 extraction, the ECA module was combined with different kernel size were tested in this section. The

474 comparisons were conducted on the kernel size with the 3 × 3, 5 ×5, 7 × 7 and 9 × 9, and the
475 evaluating indicators were calculated respectively.
476 Table 5 Influence of ECA module

Methods mAP
H NP H NP H NP H NP
3×3 dw + ECA 63.38 % 89.73 % 95.38 % 77.12 % 0.80 0.83 92.81 % 86.63 % 89.72 %
5×5 dw + ECA 77.21 % 90.87 % 90.13 % 84.85 % 0.83 0.86 94.18 % 92.22 % 93.20 %
7×7 dw + ECA 82.35 % 93.92 % 95.32 % 87.90 % 0.88 0.91 94.81 % 95.19 % 95.00 %
9×9 dw + ECA 69.49 % 87.07 % 90.87 % 79.51 % 0.79 0.83 90.15 % 86.94 % 88.55 %
477 Table 5 shows the experiment influence result of attention mechanism. On the one hand, when
478 added ECA module to CA-block, the mAP increased obviously, which proved that attention
479 mechanism could make network pay more attention to the feature which play an important role in
480 detection, and realize efficient extraction of image features. Nevertheless, the performance of 9 × 9
481 dw achieved some decrease on evaluating indicators, which implied some over fitting may occurred.
482 On the other hand, the CA block with 7 × 7 dw and ECA module achieved the best performance on
483 recall, precision, F1- score, AP, and mAP compare to other kernel size, so we designed the CA-
484 blocks based on this test result.
485 To further verify that the ECA can stay ahead of other attention modules, in this section, two
486 state-of-the-art attention modules, SENet and CBAM module were compared on the dataset,
487 respectively. The specific results are presented in Table 5.
488 Table 6 Influence of attention module

Methods mAP
H NP H NP H NP H NP
7×7 dw + SENet 72.79 % 91.63 % 95.65 % 80.33 % 0.83 0.86 93.81 % 91.19 % 92.50 %
7×7 dw + CBAM 58.82 % 76.43 % 91.43 % 76.43 % 0.72 0.76 87.01 % 79.37 % 83.19 %
7×7 dw + ECA 82.35 % 93.92 % 95.32 % 87.90 % 0.88 0.91 94.81 % 95.19 % 95.00 %
489 The CA-block SENet and CBAM achieved 92.50% and 83.19 % mAP respectively, which are
490 inferior to ECA module. The result implied that different attention mechanism make distinct impact
491 on detection result, and the ECA module is an effective method to obtain excellent performance on

492 the dataset built in this study.
493 3.2.2 Ablation study of stage radio and bottleneck
494 The stage radio of the backbone network of CA-YOLO v3 is 3: 3: 9: 3, which spirited from
495 Swin-Transformer, and is different from the original YOLO v3 network and ResNet. At the same
496 time, the CA-block absorbed the inverted bottleneck from Swin-Transformer and ConvNeXt
497 module. So in order to verify the impact of the stage radio and bottleneck on the model, two sage
498 radio from ResNet-50 and ResNet-101 were used for trainng and testing CA-YOLO v3, two types
499 of bottleneck module also tested in this section.
500 Table 7 Influence of stage radio and bottleneck

Methods mAP
H NP H NP H NP H NP
3: 4: 6: 3 76.47 % 90.11 % 95.85 % 83.16 % 0.85 0.86 95.91 % 91.76 % 93.84 %
3: 4: 23: 3 65.44 % 81.75 % 92.27 % 77.62 % 0.78 0.80 91.79 % 81.02 % 86.40 %
Resnet bottleneck 76.84 % 87.83 % 96.76 % 84.93 % 0.86 0.86 95.22 % 91.67 % 93.44 %
Flatten bottleneck 80.88 % 92.78 % 99.10 % 85.16 % 0.89 0.89 95.97 % 92.15 % 94.06 %
Inverted bottleneck 82.35 % 93.92 % 95.32 % 87.90 % 0.88 0.91 94.81 % 95.19 % 95.00 %
501 As can be seen from Table 7, when the stage radio of backbone of CA-YOLO v3 is 3: 4: 6: 3,
502 which is equal to ResNet-50, the mAP was 93.84%, and the stage radio is 3: 4: 23: 3, which is equal
503 to ResNet-101, the mAP was only 86.40%. The experiment result proved the stage radio used in
504 this study, which is 3: 3: 9: 3, is more effective for object detection on dataset of this research.
505 Moreover, when the bottleneck is same to ResNet, or a flatten bottleneck, which means the number
506 of filters are same in three convolution layers and the number of channel is not change, the
507 performance of network is 93.44% and 94.06% mAP, both of them were lower than the inverted
508 bottleneck structure designed in this work. It could be conclude that the structure of CA-block and
509 CA-YOLO v3 absorbed design idea and achieved better performance.
510 3.3 Visualization of detection results

511 3.3.1 Visualization result analysis
512 In order to visualize the detection results on image, three images selected from test set, and
513 their labeled results by using the Labelimg toolkit, and detection results of by using YOLO v3 and

514 CA-YOLO v3 respectively, as are shown in Fig.11.
515
516 (a) (b) (c)
517
518 (d) (e) (f)
519
520 (g) (h) (i)
521 Fig. 11 Comparison of detection results between YOLO v3 and CA-YOLO v3. (a), (b), (c) are labeled by using
522 the Labelimg toolkit, where silkworms in boxes of light blue are the diseased, and in boxes of dark red are the
523 healthy. (d), (e), (f) are detection results by using YOLO v3 model, (g), (h), (i) are detection results by using CA-
524 YOLO v3. The silkworm located pink boxes are the diseased, whereas green boxes are the healthy, the character
525 located in the left above of boxes is confidence value of its category.
526 It could be observed that YOLO v3 failed to detect one silkworm at the top of the image in (d).
527 Similarly, CA-YOLO v3 misjudged the category of one silkworm on the left in (g). YOLO v3 failed
528 to detect one at the top of the image in (d). Similarly, CA-YOLO v3 misjudged the category of one

529 silkworm on the left in (g). YOLO v3 misjudged the category of the leftmost silkworm and the
530 uppermost silkworm respectively, whereas the test results of CA-YOLO v3 are completely correct
531 in (e) and (f).
532 Though some silkworms were failed to detection as shown in Fig (11), however, the
533 visualization result also indicated that deep learning could not only accurately identify the category
534 of silkworm, but also accurately detect the position of silkworm in the image in the case of mixed
535 with healthy and diseased silkworms.
536 3.3.2 GUI for the trained CA-YOLO v3 module
537 In order to apply the research results to real rearing environment, a GUI (Graphical User
538 Interface) for detection of silkworm diseases was developed after model training and testing by
539 using PyQt 5, and the trained CA-YOLO v3 model was embedded into the GUI. We hope that
540 detection GUI could work together with rearing machine of silkworm in real conditions, which
541 require the GUI could used for silkworm disease detection in real-time video, local image, and local
542 video.
543
544 Fig.12 Detection interface of local images
545 The QT designer provided by PyQt 5 was used to design software interface, as shown in Fig.12.
546 The interface of detection software contains working mode selection, warning mode, image
547 visualization, text visualization and control area, etc. The Pushbutton, RadioButton, Qpainter and

548 other controls were used for designing interface of the GUI. Each control is associated with a
549 corresponding function. The operation of software is very easy, only one of working mode needs to
550 be selected before pressing the button of start detection. The small window displays in red when the
551 warning function is enable and the diseased silkworm is detected, otherwise it remains in green.
552
553 Fig.13 Workflow design of detection GUI
554 Fig.13 showed the workflow chart of the GUI. The working mode of the GUI includes
555 detection for real-time video acquired by USB camera, local image and video saved in computer.
556 When using the GUI to detection, first, the one of working mode should be selected according to
557 application occasion. Then, the pre-trained CA-YOLO v3 model would be called to detect image or
558 frames of video by using slot function, which connected by the start button and detection program.
559 The visualization results are displayed on the software interface at the same time. Text results of
560 each image or frames of video finally are saved automatically, which used for traceability analysis.
561 The pausing or terminating detection function also can be used during detection process.
562 4. Conclusion and future work

563 In this research, the NPV, which is the most prevalent silkworm disease with high occur
564 frequency and strong contagion, was detected in adult stages of silkworm by using deep learning
565 and object detection. The healthy and diseased silkworms were collected synchronously by using
566 rearing and manual infection in actual environments. A dataset, which most of images containing
567 several healthy and diseased silkworms, was built by putting silkworms on background and imaging

568 them using a high-resolution smartphone camera. The state-of-the-art deep learning architecture,
569 including YOLO v3, ConvNeXt, and Swin-Transformer, were architecturally improved to design a
570 CA-YOLO v3 network for effectively and accurately detecting silkworm disease in complex
571 conditions. A detection GUI was developed for silkworm detection based on PyQt 5. Based on the
572 results, the following specific conclusions can be drawn from this article.
573 1) This research proposed a detection method for silkworm diseases based on object objection,
574 different from existing researches based on image classification, which is inconsistent with the
575 actual situation of high-density rearing of silkworm. The proposed method can not only recognize
576 whether a silkworm is diseased or healthy, but also detect the coordinate of each silkworm in image,
577 which can provide technical support for the development of diseases prevent equipment.
578 2) Our designed CA-YOLO v3 contains two major improvements: a) The CA-block was
579 designed based on two state-of-the-art including ConvNeXt and Swin-Transformer network. The
580 ECA attention mechanism also was introduced to the CA-block for extraction of key features. b)
581 Four stages stacked CA-blocks were used for replacing the DarkNet-53, which significantly reduced
582 training parameters and improved network performance. The experiment on dataset verified the
583 superiority of CA-YOLO v3 over the original YOLO v3 with 1.87% higher mAP.
584 3) Our developed a detection GUI for silkworm disease, and the trained CA-YOLO v3 model
585 was embedded into the GUI, which realizing detection of real-time video, local image and video.
586 The GUI can save detection results for traceability analysis and show tips when the diseased
587 silkworm is detected. The GUI could also be used for joint experiment with rearing machine of
588 silkworm in real conditions.
589 Overall, this study indicated the superiority of the proposed CA-YOLO v3 algorithm for better
590 silkworm disease detection in mixed with healthy and diseased silkworms, and in adult stage of
591 silkworm. CA-YOLO v3 can perform in real-time applications when all images of model training
592 were collected from real conditions. This research also can provide a theoretical reference for early
593 warning of silkworm diseases, and technical support for development of precision control
594 equipment.
595 However, there are still some deficiencies in our study. a) The silkworm density in dataset
596 proposed by this experiment is less than the real environments, and only partial growth stages of
597 single silkworm variety were imaged and detected, so as to the diversity and richness of the dataset

598 needs to be enhanced. b) Object detection method belongs to supervised learning, which means that
599 all diseased silkworms were diagnosed by human, then image annotation was carried out. The
600 silkworm could be diagnosed by eyes indicated that there is obvious morphological characteristics,
601 which also means the diseases belongs to middle or late stage of infection. The recognition results
602 are less helpful to cut transmission of pathogeny and reduce cocoons losses.
603 In future work, we plan to collect diseased silkworm in different growth stages and include
604 more variety, constructing a larger dataset for detection research. More importantly, we will attach
605 important on early detection of silkworm diseases, focus on the behavioral difference when
606 pathogeny infection early, so as to realize the early warning and accurate prevention of silkworm
607 diseases.
608 Author contributions
609 Hongkang Shi: Image acquisition, draft writing. Dingyi Tian: Image acquisition, model
610 training. Shiping Zhu: Image acquisition, paper review, and test guarantee. Linbo Li: Sample
611 collection, software development. Jianmei Wu: Silkworm rearing, writing, and testing.
612 Declaration of Competing Interest
613 The authors declare that they have no known competing financial interests or personal
614 relationships that could have appeared to influence the work reported in this paper.
615 Acknowledgments
616 This study has supported by the National Modern Agricultural Industrial Technology System
617 Special Project (No.CARS-18).
618 References
619 Jiang, L., Zhao, P., Xia Q., 2014. Research Progress and Prospect of Silkworm Molecular Breeding
620 for Disease Resistance. Science of Sericulture, 40(04):571-575.
621 Xu, A., Qian, H., Sun, P., Liu, M., Lin, C., Li, G., Li, L., Zhang, Y., Zhao G., 2019. Breeding of a
622 New Silkworm Variety Huakang 3 with Resistance to Bombyx mori Nucleopolyhedrosis.
623 Science of Sericulture, 45(02), 201-211.
624 Kamilaris, A., Prenafeta-Boldú, X.F., 2018. Deep learning in agriculture: A survey. Comput.
625 Electron. Agric. 147, 70-90.
626 Karlekar, A., Seal, A., 2020. SoyNet: Soybean leaf diseases classification. Comput. Electron. Agric.
627 172, 105342.

628 Szegedy, C., Ioffe, S., Vanhoucke, V., Alex Alemi. 2016. Inception-v4, Inception-ResNet and the
629 Impact of Residual Connections on Learning. arXiv eprint arXiv: 1602.07261.
630 Wang, J., Li, Y., Feng, H., Ren, L., Du, X., Wu, J., 2020. Common pests image recognition based
631 on deep convolutional neural network. Comput. Electron. Agric. 179, 105384.
632 Altuntaş, Y., Cömert, Z., Kocamaz, A. F., 2019. Identification of haploid and diploid maize seeds
633 using convolutional neural networks and a transfer learning approach. Comput. Electron. Agric.
634 163, 104874.
635 Shi, H., Tian, Y., Yang, C., Chen, Y., Su, S., Zhang, Z., Zhang, J., Jiang, M., 2020. Research on
636 Intelligent Recognition of Silkworm Larvae Races Based on Convolutional Neural Networks.
637 Journal of Southwest University (Natural Science Edition), 42(12), 34-45.
638 Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam,
639 H., 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision
640 Applications. arXiv preprint arXiv: 1704.04861.
641 Xia, D., Yu, Z., Cheng, A., 2019. Development and Application of Silkworm Disease Recognition
642 System Based on Mobile App. Beijing: 10th International Conference on Image and Graphics,
643 471-482.
644 Borji, A., and Itti., 2013. State-of-the-Art in Visual Attention Modeling. IEEE Transactions on
645 Pattern Analysis and Machine Intelligence, 35, 185-207.
646 Huang, G., Liu, Z., Maaten, L. V. D., Weinberger, K. Q., 2016. Densely Connected Convolutional
647 Networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2261-2269.
648 Ding, J., Cheng, A., 2019 An Improved Similarity Algorithm Based on Deep Hash and Code Bit
649 Independence. 2019 4th International Conference on Insulating Materials, Material
650 Application and Electrical Engineering.
651 Krizhevsky, A., Sutskever, I., Hinton, G. E., 2017. ImageNet classification with deep convolutional
652 neural networks. Communications of the ACM, 60(6), 84-90.
653 Shi, H., Huang, L., Hu, C., Hu, G., Zhang, J., 2022. Research on recognition of silkworm diseases
654 based on Convolutional Neural Network. Journal of Chinese Agricultural Mechanization,
655 43(01), 150-157.
656 He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In:
657 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.

658 Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., Berg, A. C., 2016. SSD: Single
659 Shot MultiBox Detector. arXiv preprint arXiv: 1512.02325.
660 Ren, S., He, K., Girshick, R., Sun J., 2017. Faster R-CNN: Towards Real-Time Object Detection
661 with Region Proposal Networks. IEEE Transactions on Pattern Analysis & Machine
662 Intelligence, 39(6), 1137-1149.
663 Redmon J., Farhadi, A., 2018. YOLOv3: An Incremental Improvement. arXiv preprints arXiv:
664 1804.02767．
665 Chen, C., Zhu, W., Norton, T., 2021. Behaviour recognition of pigs and cattle: Journey from
666 computer vision to deep learning. Comput. Electron. Agric. 187, 106255.
667 Riekert, M., Opderbeck, S., Wild, A., Gallmann, E., 2021. Model selection for 24/7 pig position and
668 posture detection by 2D camera imaging and deep learning. Comput. Electron. Agric. 187,
669 106213.
670 Zoph, B., Le, Q. V., 2016. Neural Architecture Search with Reinforcement Learning. arXiv preprints
671 arXiv: 1611.01578.
672 Wang, Z., Liu, T., 2022. Two-stage method based on triplet margin loss for pig face recognition.
673 Comput. Electron. Agric. 194, 106737.
674 Liu, C., Gao, T., Ma, Z., Song, Z., Li, F., Yan, Y., 2022. Target Detection Model of Corn Weeds in
675 Field Environment Based on MSRCR Algorithm and YOLO v4-tiny. Transactions of the
676 Chinese Society for Agricultural Machinery, 53(02):246-255+335.
677 Wang, H., Li, Y., Dang, L. M., Moon, H., 2022. An efficient attention module for instance
678 segmentation network in pest monitoring. Comput. Electron. Agric. 195, 106853.
679 Liu, J., Wang, X., 2020. Tomato Diseases and Pests Detection Based on Improved YOLO v3
680 Convolutional Neural Network. Frontiers in Plant Science, 11, 00898.
681 Wang, J., Wang, N., Li, L., Ren, Z., 2020. Real-time behavior detection and judgment of egg
682 breeders based on YOLO v3. Neural Computing and Applications, 32, 5471–5481.
683 Tian, Y., Yang, G., Wang, Z., Wang, H., Li, E., Liang, Z., 2019. Apple detection during different
684 growth stages in orchards using the improved YOLO-V3 model. Comput. Electron. Agric.157,
685 417-426.
686 Bai, Y. Guo, Y., Zhang, Q., Cao, B., Zhang, B., 2022. Multi-network fusion algorithm with transfer
687 learning for green cucumber segmentation and recognition under complex natural environment.

688 Comput. Electron. Agric. 194, 106789.
689 Olaf, R., Philipp, F., Thomas, B., 2015. U-Net: Convolutional Networks for Biomedical Image
690 Segmentation. Medical Image Computing and Computer-Assisted Intervention -- MICCAI
691 2015, 234—241.
692 Zhang, H., Li, Y., Zhou, L., Wang, R., Li, S., Wang, H., Multi-objective Skeleton Extraction M of
693 Beef Cattle Based on Improved YOLO v3. Transactions of the Chinese Society for Agricultural
694 Machinery. https://kns.cnki.net/kcms/detail/11.1964.S.20220125.1603. 005.html
695 Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E., Squeeze-and-Excitation Networks. IEEE Transactions
696 on Pattern Analysis and Machine Intelligence, 42, 2011-2023.
697 Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q., 2020. ECA-Net: Efficient Channel Attention
698 for Deep Convolutional Neural Networks. 2020 IEEE/CVF Conference on Computer Vision
699 and Pattern Recognition (CVPR), 11531-11539.
700 Woo, S., Park, J., Lee, J. Y., Kweon, I. S., 2018. CBAM: Convolutional Block Attention Module.
701 Computer Vision -- ECCV, 3-9.
702 Qi, J., Liu, X., Liu, K., Xu, F., Guo, H., Tian, X., Li, M., Bao, Z., Li, Y., 2022. An improved
703 YOLOv5 model based on visual attention mechanism: Application to recognition of tomato
704 virus disease. Comput. Electron. Agric. 194, 106780.
705 Lu, S., Chen, W., Zhang, X., Karkee, M., 2022. Canopy-attention-YOLOv4-based immature/mature
706 apple fruit detection on dense-foliage tree architectures for early crop load estimation. Comput.
707 Electron. Agric. 193, 106696.
708 Li, X., Pan, J., Xie, F., Zeng, J., Li, Q., Huang, X., Liu, D., Wang, X., 2021. Fast and accurate green
709 pepper detection in complex backgrounds via an improved Yolov4-tiny model. Comput.
710 Electron. Agric. 191, 106503.
711 Zhang, Z., Zhang, Z., Li, J., Wang, H., Li Y., Li D., 2021. Potato detection in complex environment
712 based on improved YOLO v4 model. Transactions of the Chinese Society of Agricultural
713 Engineering, 37(22), 170-178.
714 Howard, A., Sandler, M., Chu, G., Chen, L., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R.,
715 Vasudevan, V., Le, Q. V., Adam H., 2019. Searching for MobileNetV3. arXiv preprint arXiv:
716 1905.02244.
717 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., N., Kaiser, L., Polosukhin,

718 I., 2017. Attention Is All You Need. arXiv preprint arXiv: 1706.03762.
719 Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani,
720 M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021. An Image is Worth
721 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv: 2010.11929.
722 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin Transformer:
723 Hierarchical Vision Transformer using Shifted Windows. arXiv preprint arXiv:2103.14030.
724 Han, K., Wang, Y., Chen, H., Chen, X., Guo J., Liu Z., Tang Y., Xiao A., Xu C., Xu Y., Yang, Z.,
725 Zhang Y., Tao D., 2021. A Survey on Vision Transformer. arXiv preprint arXiv:2012.12556.
726 Liu Z., Mao, H., Wu, C. Feichtenhofer, C., Darrell, T., Xie, S., 2022. A ConvNet for the 2020s.
727 arXiv e-prints arXiv: 2201.03545.
728 Shi, H., Jiang, M., Li, L., Wu, J., Ye, J., Ma, Y., Hu, G., Zhang, J., 2018. Design of Young Silkworm
729 Feeding Machine with Spiral Lifiting System and Its Production Test. Sericulture Science,
730 44(06): 891-897.
731 Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A. A., 2017. Inception-v4, inception-resnet and the
732 impact of residual connections on learning. In Thirty-first AAAI conference on artificial
733 intelligence.
734 Ding, X., Zhang, X., Zhou, Y., Han, J., Ding, G., Sun, J., 2022. Scaling up Your Kernels to 31×31:
735 Revisiting Large Kernel Design in CNNs，arXiv e-prints arXiv:2203.06717.

SSRN-id4089053

Uploaded by

Copyright:

Available Formats

You might also like

SSRN-id4089053

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SSRN-id4089053

Uploaded by

Copyright:

Available Formats

1 Research on silkworm disease detection in real conditions based

4 a College of Engineering and technology, Southwest university, Chongqing 400700, China

5 b Sericultural Research Institute, Sichuan academy of agricultural sciences, Sichuan 637000，

7 * Corresponding author. zspswu@126.com

Electronic copy available at: https://ssrn.com/abstract=4089053

31 Keywords: silkworm diseases; object detection; CA-YOLO v3; deep learning

Electronic copy available at: https://ssrn.com/abstract=4089053

Electronic copy available at: https://ssrn.com/abstract=4089053

96 Pest-D2Det achieved performance in terms of 78.6% mAP.

111 key feature extract capability of network.

Electronic copy available at: https://ssrn.com/abstract=4089053

127 significantly improved detection efficiency without accuracy reduction.

146 ConvNeXt is also regard as the "Renaissance" of CNNs.

Electronic copy available at: https://ssrn.com/abstract=4089053

154 development of precision control.

158 research and future work.

159 2. Materials and method

170 adult silkworm (instar 4 to 5).

174 diseased images.

Electronic copy available at: https://ssrn.com/abstract=4089053

180 once pathogen infection.

183 2.1.2 Image acquisition

Data 4th day of 1th day of 2th day of

Health 407 20 751 315 1493

Diseased 422 189 680 188 1479

Image number 211 90 486 154 941

Electronic copy available at: https://ssrn.com/abstract=4089053

210 original images were shown in Fig.4.

214 2.2 Dataset construction

Electronic copy available at: https://ssrn.com/abstract=4089053

226 2.2.2 Image enhancement

233 𝑁 is the length and width of the image.

234 Table 2 Parameters of image enhancement

Parameter Rotation Width shift Height shift range Randomly Shuffle

range (𝛼) range (𝑡𝑥) (𝑡𝑦) horizontal flip

Electronic copy available at: https://ssrn.com/abstract=4089053

246 2.3 CA-YOLO v3

Electronic copy available at: https://ssrn.com/abstract=4089053

Electronic copy available at: https://ssrn.com/abstract=4089053

281 formula is as follow:

284 of detection object categories, 𝑁𝐶 is the number of anchor boxes.

287 2.3.2 CA-block

298 (a) (b)

Electronic copy available at: https://ssrn.com/abstract=4089053

304 followed by a attention block.

313 in backbone for avoiding the surge of parameters.

320 improving network performance.

Electronic copy available at: https://ssrn.com/abstract=4089053

324 convolutional layers of CA-blocks, the formula of LeakyReLU is as follows：

325 𝐿𝑒𝑎𝑘𝑦𝑅𝑒𝐿𝑈(x) = {0.1 x,× x, 𝑖𝑓 𝑥 ≥ 0

326 2.3.3 Attention block

338 original feature map.

342 dimensional feature graph, the formula is as follows:

343 Y = GAP(𝑋) (3)

344 where, the GAP refers to the global average pooling.

347 relationship for Y' is as follows:

348 Y' = 𝛿(𝐶1𝐷𝑘(Y)) (4)

Electronic copy available at: https://ssrn.com/abstract=4089053

358 2.3.4 Loss function