Self-Attention_Vision_Transformer_with_Transfer_Learning_for_Efficient_Crops_and_Weeds_Classification

2023 6th International Conference on Information Systems and Computer Networks (ISCON)
GLA University, Mathura, India. Mar 3-4, 2023
Self-Attention Vision Transformer with Transfer

Learning for Efficient Crops and Weeds
Classification
2023 6th International Conference on Information Systems and Computer Networks (ISCON) | 979-8-3503-4696-1/23/$31.00 ©2023 IEEE | DOI: 10.1109/ISCON57294.2023.10112049
1st Shubham Sharma 2nd Manu Vardhan

Department of Computer Science and Engineering Department of Computer Science and Engineering
National Institute of Technology National Institute of Technology
Raipur, India 492010 Raipur, India 492010
ssharma.phd2020.cse@nitrr.ac.in mvardhan.cs@nitrr.ac.in
ORCID ID:0000-0003-0096-9244 ORCID ID: 0000-0003-2944-7896
Abstract—Classifying and mapping weeds are crucial for Due to its advancement, CNN has become popular in
ensuring that plantations remain free of unwanted plants, recent years for image processing and classification. The
particularly in cases where plant spacing is uneven, making transformer is a new category of NN architecture; with the
weed detection more challenging than it is for crops. To help of the innovative transformer architecture, NLP tasks
address this issue, Deep Neural Networks (DNNs) are often may now perform sequence-to-sequence modeling in a much
used in agriculture to identify the unique features of weeds and more advanced way [3]. The vision community is
classify them based on low-resolution imaging, which helps to particularly interested in learning whether transformers can
control weed populations. Convolutional Neural Networks successfully compete with EfficientNet [4] and ResNet [4],
(CNN) in Deep Learning (DL) models have advanced,
two of the most popular CNNs in vision applications. The
achieving remarkable performances in plant-weed
classification. Despite Convolutional Neural Networks (CNNs)
ViT structure was recently made available to enhance
success in computer vision, they still need help with massive classification applications. Pre-designed architectures like
labeled datasets, intra-class classification, and high processing AlexNet [4], GooLeNet [4], and ResNet are prevalent in
costs when used in classification tasks. Meanwhile, image-processing applications. These models frequently
transformer-based models have shown significant modeling include many trainable parameters, making a sizable dataset
capabilities globally but are yet to be thoroughly investigated necessary to find the best values for the parameters. These
in agriculture. In this study, we explore Self-Attention Vision models are consequently trained using a large-scale dataset,
Transformers (ViT) using a Transfer Learning (TL) approach such as ImageNet21k [5], and the weights obtained are then
to classify weeds in benchmark datasets. The proposed model, applied as a transfer learning approach. ViT can produce
ViT-S16, was fine-tuned on the ImageNet-1k dataset, pre- more remarkable outcomes than conventional topologies, but
trained on the ImageNet-21k, and evaluated on the public they are only successful if there is a high need for data.
benchmark agriculture dataset DeepWeed, which contains Therefore, considerable labeling work is needed to train
plants grown with various weed species under different these models in a supervised manner, which is not always
conditions. Our experimental results show that ViT-S16 feasible or sustainable. Realizing self-attention ways for ViT
outperforms state-of-the-art CNN models ResNeXt-50 and may be a means to enhance their effectiveness while also
EfficientNet-B7, achieving an utmost accuracy of 98.54%. making it more straightforward for them to use across a
These results show how helpful ViT can be in different image broader range of problems.
classification tasks. In the future, researchers can look into the
strengths and weaknesses of various ViT models. Recent research has shown that attention-based networks,
such as ViT-without convolutional layers, can beat CNN on
Keywords— computer vision, deep learning, convolutional various Computer Vision (CV) tasks [6]. Understanding the
neural network, vision transformer, self-attention, agriculture distinctions between ViTs and CNN's is crucial because ViT
and their variants outperform CNNs in many vision
I. INTRODUCTION
applications. One significant difference between a
In India, Vegetable is considered one of the most nutrient- transformer-based NN and a CNN-based model is the
dense foods due to their sufficient antioxidants, vitamins, and receptive field size; the latter is better at capturing a long-
minerals. The yield of vegetables decreased by 55%-95% in distance pixel relation due to the self-attention mechanism.
the case of weed-vegetable competition [1]. Early diagnosis
of weed species is essential for healthy plants; the life cycle The main contribution of this study is outlined as
of weed species is mainly categorized into three categories, follows:
i.e., biennial weeds, perennial weeds, and annual weeds, as 1. We presented recent research based on ML, DL,
shown in Table I. In places with little to no weed infestation, and ViT on the segmentation and classification
chemical herbicides are overused, negatively affecting the of weeds in plants.
environment, including soil and groundwater pollution. For
weed management and control to be successful, weed seeds 2. ViT-S16, a type of self-attention ViT, is
must be accurately and effectively categorized. However, the proposed using a transfer learning approach to
categorization relies heavily on manual examination based decrease training time on large benchmark
on destructive sampling, which is expensive and has a poor datasets.
flux [2]. We investigated utilizing a non-destructive, 3. The performance of the proposed method is
intelligent image recognition technique to address this issue. evaluated on crop weed datasets consisting of
Most of the research has been ongoing for weed detection 17,509 images capturing eight different weed
through various deep-learning methods. species native to Australia.
979-8-3503-4696-1/23/$31.00 ©2023 IEEE 1

Authorized licensed use limited to: Chittagon University of Engineering and Technology. Downloaded on November 22,2023 at 19:30:07 UTC from IEEE Xplore. Restrictions apply.
TABLE I. THE LIFECYCLE OF DIFFERENT WEED SPECIES
Perennials Weeds Biennial Weeds Annual Weeds
Simple Bulbous Corm First Year Second Year Monsoon Winter Summer
Perennials Perennials Perennials
Sonchus Allium Sp. Timothy Daucus Alternanthera Commelina Lambus Quarter Kharif
Arvensis
Bermuda Hedge Japanese Carota Echinata Benghalensis Chenopodium Ravi
Gras Bindweed Knotweed Album
Wild Onion Yarrow Leafy Spurge Nuicauls Dacus Carota Borehavia Lambus Quarter Others
Biennales Erecta
4. The performance matrix of the proposed work is CNN's EfficientNet-B7 and InceptionV4 architecture are
evaluated with state-of-the-art CNNs models used as decision machines to evaluate weed growth
ResNeXt- 50 [7] and, EfficientNet-B7 [8]. estimation [13]. EfficientNet-B7 and InceptionV4 have
achieved 97% and 94% accuracy, respectively. GooLeNet
The continuity of this research paper is organized as and AlexNet selected the best method for weed classification
follows; Section 2 discusses the latest research papers on from the quantitative evaluation. GooLeNet outperformed
ML, DL, and ViT applied in image classification. Section 3 AlexNet, whereas AlexNet has substantial classification
is the main contribution of this paper, in which we accuracy with low time consumption. A data augmentation-
summarize the Self-Attention ViT that we used in the public enhanced framework is proposed for crop weed
benchmark dataset, and ViT-S16 outperformed state- of-the- segmentation based on the original Random Image Cropping
art CNNs models ResNeXt-50 and, EfficientNet-B7. Section and Patching (RICAP) method [14], initially designed to
4 shows the experimental results of our proposed research, augment data for generic image classification with the
and in the final section, we discuss several research limitation of training with a large amount of data.
directions, give our conclusion, and provide insights on
prospects In some circumstances, just 1% of the fully-connected
layers' weight can be utilized. Five pre-trained CNN models
II. RELATED SURVEYS are used for crop weed classification, whereas ResNet50
Our work is related to three primary research directions: outperformed other models [10]. A model AgriNet- VGG19
Classical ML, CNN, and ViT. Here we focus on some was proposed with the AgriNet dataset for plant species
representative methods closely related to our work. classification [15]. The proposed model was pre- trained on
five ImageNet architectures. Combining DL and IP
A. Machine Learning approaches allowed for the reduction of the model's
Define Image classification with machine learning has complexity. A pre-trained CenterNet model was used to
the great potential of algorithms to learn and detect hidden identify vegetables and create a bounding box around them
features from labeled and unlabelled datasets through while maintaining their classification as weed species [15].
different learning mechanisms. A feature-learning-based Genetic algorithms were used to determine and assess the
approach automatically identifies and distinguishes weed employed color index following Bayesian classification
classification for rice plants [9]. The experimental results error.
demonstrate that various image processing, feature
extraction, and machine learning techniques can achieve high C. Vision Transformer
classification accuracy. The experiments compared five The One last digression before going into the heart of the
shape descriptors and six supervised pattern recognition article: ViT-based architectures are today one of the most
techniques [10]. The experiment result shows that the OPF promising approaches in CV and are obtaining outstanding
(Optimum Path Forest) using BAS-100 (Beam Angle results. ViT achieves remarkable results compared to CNN
Statistics) descriptor with OCS distance (BAS-100-OCS) while getting fewer computational resources for pre-training
gives the best classification accuracy. Fourteen features for [16]. It is only the application of a Transformer in the image
crop weed classification were evaluated to determine the specific with slight modification in the implementation to
optimal combination that provides the highest classification handle the different data modalities. Real- time automated
accuracy through the Support Vector Machine (SVM) plant disease categorization has been proposed using a
technique [11]. The findings show that SVM, when applied hybrid CNN and simple ViT model [17]. According to the
to a set of 224 test photos, achieves above 97% accuracy. experiment's findings, attention blocks marginally improved
The study recommends using a Random Forest classifier accuracy, while combining attention blocks with
(RF) to classify crop weeds in real time for variable-rate convolutional blocks had no appreciable impact on
pesticide spraying [12]. Experimental findings demonstrate prediction speed. In CrossViT, two multi- scale vision
the efficacy of the suggested vision-based pesticide spraying transformers process the image at various scales while
framework in real-time using the pulse width modulation frequently cross-attending to one another [18]. The
technique to manage the agrochemical flow rate. recognition accuracy for picture categorization is increased
using a cross-attention-based fusion method. The outcomes
B. Deep Learning demonstrate enhancements above the baseline ViT. ViT
Using Convolutional operations, a CNN can analyze struggles to pay attention at deeper levels and proposes Re-
images and find patterns in data. Because of automatic attention as a remedy, which involves blending each head's
feature extraction, CNN is ideally suited and accurate for CV attention post-softmax. The author presented a novel Re-
applications like object segmentation or image classification. attention method (DeepViT) with a low computational and
2
memory overhead to address the attention collapse problem B. Convolutional Neural Network
of ViT [19], [20]. A CNN works, in general, like any other feed-forward. It
Through slight adjustments to current ViT models, the consists of an input block, one or more hidden blocks
suggested strategy enables the training of deeper ViT models (hidden layer N*N), which carry out calculations through
with constant performance gains. The MIL-ViT (Multi-scale activation functions (for example, RELU), and an output
vision transformer) provides controllable visual encodings at block which carries out the actual classification. Convolution
many scales, while the attention mechanism of the Vision levels represent the difference concerning the classic feed-
Longformer is the second. Using these two methods, Multi- forward networks. CNN takes input images, assigns weights
Scale Vision Longformer (MIL-VT) considerably enhances and biases to various objects/aspects in the image, and can
the ViT for encoding high- resolution images. The differentiate one from another, refer to Figure 2;
experiment's outcome shows that ViT's attention mechanism
outperforms other effective attention mechanisms. Compared
to convolutional architecture under the same pre-training
setting, MIL-VT advocated using the vision transformer
[20]. To effectively utilize the feature representations
retrieved from individual patches, which vision transformers
often ignore, the proposed MIL-VT framework is built on a
multiple-instance learning technique with a new "MIL head."
The proposed "MIL head" might be quickly plugged into the
existing Vision Transformer architecture, significantly
enhancing the model's performance.
III. MATERIALS AND METHODS
A. Dataset Used
The DeepWeeds dataset is used for this research, having
nine different weed species image datasets with the
collection of 17,509 images, including "Parkinsonia,"
"Rubber vine," "Parthenium," "Siam weed," "Prickly acacia," Fig. 2. The schematics diagram of the primary convolutional neural
network block architecture.
"Lantana," "Snake weed," and "Chinese apple". The
following locations in Queensland were used to gather the
CNN works by extracting features from images. When
images: "Paluma," "McKinlay," "Kelso," "Hervey Range,"
more resources are available, CNNs are frequently scaled up
"Douglas," "Cluden," "Charters Towers," and "Black River"
to attain higher accuracy after being developed at a fixed
[21]. The dataset is categorized and broken down by weed,
resource cost [10].
location, geographical distribution, and sample images from
a separate class of the dataset, as shown in Figure 1 and The ResNet architecture consists of several convolutional
Table II . layers (Conv), batch normalizations (BN), Relu activation
function, and one shortcut, as shown in Fig. 3(a), where F
denotes the nonlinear function for the convolutional path and
H denotes the shortcut path, and the output of Residual
building block can be formulated in Eq. (1).
‫ )ݔ(ܨ = ݕ‬+ ‫)ݔ( ܪ‬ (1)
Fig. 1. Sample images from classes of the DeepWeeds, namely: (a)

Chinee apple, (b) Parkinsonia, (d) Snake weed [21]
TABLE II. THE DEEPWEEDS DATASET DISTRIBUTION BY WEED SPECIES (ROW) AND LOCATION (COLUMN) [21]
Regios
Black Cluden Charters Douglas Hervey Kelso McKinlay Paluma Total
River Towers Range
Chinee apple 0 0 0 718 340 20 0 47 1125
Parkinsonia 0 1031 0 0 0 0 0 0 1031
Lantana 0 0 0 9 0 0 0 1055 1064
Prickly acacia 0 132 0 7 0 0 929 0 1062
Parthenium 0 0 246 0 0 776 0 0 1022
Siam weed 1072 0 0 0 1 0 0 2 1074
Rubber vine 0 1 188 815 0 5 0 0 1009
Negatives 1200 1234 605 2606 812 893 943 1154 9106
Snake weed 10 0 0 928 471 34 0 43 1016
Total 2282 2398 1039 5077 340 1728 1872 2301 17509
3
The inverted bottleneck MBConv is the primary
component of the EfficientNet-B0 baseline network, as
depicted in Fig. 3. (b). Direct connections are utilized
between jams that connect much fewer channels than
expansion layers because blocks in MBConv consist of a
layer that first expands and then compresses the channels.
The author suggests a novel model scaling technique that
scales up CNN's more organized way using a straightforward
but incredibly potent compound coefficient [8]. More
extensive networks with larger widths, depths, or resolutions
tend to achieve higher accuracy.
Fig. 4. The illustration of the classic Vision Transformer model

architecture.
In the self-attention layer, The input vector is first

converted into three distinct dimension vectors and packed
into three different matrices, Q, K, and V, respectively, in the
self-attention layer [3]. These vectors are q (query vector), k
(key vector), and v (value vector), respectively. The
following formula is used to compute the attention function
between q, k, and v, Eq. (2):
ொൈ୏೅
‫ܳ( ݊݋݅ݐ݊݁ݐݐܣ‬, ‫ܭ‬, ܸ) = softmax ቀ ቁ× ܸ (2)
Fig. 3. The architecture of the (a) Residual Building Block and (b) ξௗ௞
EfficientNet-B0 baseline network
In this study, we apply the type of self-attention ViT-S16,
fine-tuned on the ImageNet-1k dataset [5] and pre- trained on
the ImageNet-21k [5] for classifying crop weeds on a
C. Self-Attention Vision Transformer benchmark dataset called DeepWeeds [21]. We also
highlight the potential route for future development.
Our approach involves using Self-Attention Vision
Transformers (ViT) with transfer learning for efficient crops IV. RESULT AND DISCUSSION
and weeds classification. ViT is a neural network
architecture that utilizes self-attention layers to extract A. Parameter Setting
intrinsic features from images. To classify an image, ViT The model parameter is optimized using the stochastic
partitions it into non-overlapping patches and applies a linear gradient descent optimizer with a batch size of 64 and a
projection layer to create a sequence input for the maximum number of training epochs of 30. Warmup steps
transformer model. The resulting sequence contains are set to 10, Warmup LR is set to 0.06, and the initial
information about each patch and their relationships with learning rate is set to 0.03, refer to Table III. Data
other patches within the image. The ViT model then applies augmentation techniques like cropping, flipping, rotation,
self-attention layers to learn patterns from the sequence and and color augmentation are utilized throughout the training
predict objects or classes that may be present in the image. process to increase the variety of the training data. This study
We leverage transfer learning by fine-tuning the ViT model used the RGB version of the images since it might result in
on the ImageNet-1k dataset, pre-trained on the ImageNet- more accurate categorization outcomes than those attained by
21k, to classify weeds in benchmark datasets, such as the utilizing the grayscale versions. With a score of 98.54%, the
DeepWeed dataset. Our results show that ViT outperforms self-attention-based ViT model outperformed ResNeXt-50
state-of-the-art CNN models, achieving an utmost accuracy and EfficientNet-B7 in this trial in terms of accuracy.
of 98.54%. We improve classification accuracy by
appending a unique classification token (CLS) at the end of TABLE III. THE PARAMETER SETTING AND TUNING FOR VIT-S16
every prediction made using token representations. This
Parameter Value
token helps the model to distinguish between different Batch size 64
objects or classes and to provide more accurate predictions. Epochs 30
By utilizing ViT with transfer learning and CLS tokens, our Initial learning rate 0.03
approach achieves high accuracy in crops and weeds Warmup learning rate 0.06
classification tasks and can be applied to various image Warmup steps 10
classification tasks. Num channels 3
Input image resolution 224*224
Pixel values range [-1, 1]
4
B. Evaluation Metrics TABLE IV. AVERAGE F1, RECALL, PRECISION, AND ACCURACY OF
RESNEXT-50, EFFICIENTNET-B7, AND PROPOSED VIT-S16 WITH TRANSFER
Classification probabilities of true positives (TP), false LEARNING
positives (FP), and false negatives (FN) are calculated as a Model Average Average Average Accuracy
result of analyzing the relevance between the predicted and
ground-truth labels. We subsequently determine a model's F1 recall precision
recall measure, which measures how well it accurately ResNeXt-50 0.9437 0.9437 0.9440 90.7
predicts all the ground-truth classes, and a precision measure, EfficientNet- 0.9722 0.9722 0.9420 92.4
which measures the proportion of correct positive predictions
B7
to all positive predictions. The metrics used in the evaluation
procedure were precision, recall, and F1-Score. The latter is ViT-S16 0.99 0.99 0.99 98.54
the weighted average of precision and recall, considering (Transfer
both FP and FN to evaluate the model's performance. Learning*)
்௉
ܲ‫ ݊݋݅ݏ݅ܿ݁ݎ‬ൌ (3)
்௉ାி௉
Table IV presents the findings of the trained models'
்௉
ܴ݈݈݁ܿܽ ൌ (4) evaluation. ViT-S16 has achieved the highest precision,
்௉ାிே
recall, and f1-score by applying a pre-transformer to
ሺோ௘௖௔௟௟‫כ‬௉௥௘௖௜௦௜௢௡ሻ sequences of image patches (Tokens) to classify the
‫ ͳܨ‬െ ܵܿ‫ ݁ݎ݋‬ൌ ʹ ‫כ‬ (5) complete image. The resultant precision, recall, and F1-score
ሺோ௘௖௔௟௟ା௉௥௘௖௜௦௜௢௡ሻ
increased and were more significant than all the other
C. Results Analysis models, outperforming the state-of-the-art CNNs models
This SUB-section presents the experiments on ResNeXt-50 and EfficientNet-B7 with a maximum accuracy
DeepWeeds datasets with three models: self-attention vision of 98.54%, refer to Figure 6. EfficientNets significantly
transformer, ResNet, and EfficentNet. To pre-train the outperform other ConvNets. In this research, EfficientNet-
Vision Transformer structure, 17,509 weed photos from a B7 achieves new state-of-the-art 92.4% accuracy, and
publicly accessible dataset with nine different classes were ResNeXt-50 give 90.7%, refer Figure 5, whereas, with fewer
gathered. We divided the dataset randomly, using 85% for parameter, ViT-S16 obtained better accuracy compared to
training and 15% for validation. For pre-training, all convolutional-based architecture, ResNeXt-50 and,
marijuana photographs are downsized to 224*224 pixels. EfficientNet-B7, with similar complexity.
This model is a vision transformer of type S-16 pre-trained
on the ImageNet-21k dataset and fine-tuned on the This experiment demonstrates that it is preferable to use
ImageNet-1k dataseT. Finally, we compare the results of our the transformer blocks after the CNN blocks rather than
method with the state-of-the-art convolutional model earlier. ViT outperformed both models, which just used CNN
ResNeXt-50 and, EfficientNet-B7. ResNeXt is a simple, blocks in terms of accuracy. ViT works on massive datasets;
highly modularized network architecture with the same therefore, we may utilize ResNet and EfficientNet models,
topology for image classification. Unlike a ResNet, it which are state-of-the-art convolutional architectures for all
exposes an additional cardinality dimension as an essential datasets. ViT has far fewer parameters than the work done in
component to the depth and width dimensions. The baseline the reviewed articles. Transformers, however, have made
network created by AutoML MNAS is EfficientNet- B0, significant strides in computer vision and natural language
while Efficient-B1 to B7 are obtained by scaling up the processing jobs like language translation.
baseline network. V. CONCLUSION
This study suggested a method for intelligently
classifying the seeds of various weed species. It offered a
technique more likely to have broad commercial and
technological uses. Our study compares two CNN-based
architectures, ResNeXt-50 and EfficientNet-B7, with a
conventional version of the self-attention vision transformer
network. All the experiments are performed in the publicly
available dataset "DeepWeeds," which consists of 17,509
Fig. 5. Training and validation accuracy graph plot of ResNetXt-50 (a) and images with nine classes of different weed species. The ViT-
(b) EfficientNet-B7. S16 model achieves an accuracy of 98.54% and outperforms
the state-of-the-art CNNs models ResNeXt-50 by 7.84% and
EfficientNet-B7 by 6.14%. The ViT- based model
consistently outperformed the CNN when the models were
trained from scratch and achieved an improved accuracy
over the results mentioned in the literature. The ViT model is
pre-trained on the ImageNet- 21k dataset and fine-tuned on
the ImageNet-1k dataset, and it has far fewer parameters than
the work done in the evaluated articles.
Fig. 6. Training and validation accuracy (a) and Training and validation
loss (b) graph plot of ViT-S16.
Object segmentation networks are needed to separate
weeds from wide plant images in the upcoming study. Multi-
class classification needs to be solved to consider many
diseases to progress toward an autonomous weed
5
classification system. Future work of this research can [13] N. Islam et al., “Early Weed Detection Using Image Processing and
include assembling a dataset with adequate labels for this Machine Learning Techniques in an Australian Chilli Farm,”
Agriculture, vol. 11, no. 5, 2021, doi: 10.3390/agriculture11050387.
challenge and determining a network topology for object
[14] M. Alam, M. S. Alam, M. Roman, M. Tufail, M. U. Khan, and M. T.
segmentation and multi-class classification. Khan, “Real-Time Machine-Learning Based Crop/Weed Detection
and Classification for Variable-Rate Spraying in Precision
REFERENCES Agriculture,” in 2020 7th International Conference on Electrical and
Electronics Engineering (ICEEE), 2020, pp. 273– 280. doi:
[1] K. M. V. G. O. A. N. N. G. M. A. A. S. Anand Muni Mishra Shilpi
10.1109/ICEEE49618.2020.9102505.
Harnal, “A Deep Learning-Based Novel Approach for Weed Growth
Estimation,” Intelligent Automation & Soft Computing, vol. 31, no. 2, [15] K. Gupta, R. Rani, and N. K. Bahia, “Plant-Seedling Classification
pp. 1157–1173, 2022, doi: 10.32604/iasc.2022.020174. Using Transfer Learning-Based Deep Convolutional Neural
Networks,” International Journal of Agricultural and Environmental
[2] T. Luo et al., “Classification of weed seeds based on visual images
Information Systems (IJAEIS), vol. 11, no. 4, pp. 25–40, 2020.
and deep learning,” Information Processing in Agriculture, 2021, doi:
https://doi.org/10.1016/j.inpa.2021.10.002. [16] D. Su, H. Kong, Y. Qiao, and S. Sukkarieh, “Data augmentation for
deep learning based semantic segmentation and crop-weed
[3] A. Dosovitskiy et al., “An Image is Worth 16x16 Words:
classification in agricultural robotics,” Comput Electron Agric, vol.
Transformers for Image Recognition at Scale,” ArXiv, vol.
190, p. 106418, 2021, doi:
abs/2010.11929, 2020.
https://doi.org/10.1016/j.compag.2021.106418.
[4] A. Canziani, A. Paszke, and E. Culurciello, “An Analysis of Deep
[17] A. N. Fountsop, J. L. Ebongue Kedieng Fendji, and M. Atemkeng,
Neural Network Models for Practical Applications,” ArXiv, vol.
“Deep Learning Models Compression for Agricultural Plants,”
abs/1605.07678, 2016.
Applied Sciences, vol. 10, no. 19, 2020, doi: 10.3390/app10196866.
[5] O. Russakovsky et al., “ImageNet Large Scale Visual Recognition
[18] Y. Bazi, L. Bashmal, M. M. al Rahhal, R. al Dayil, and N. al Ajlan,
Challenge,” Int J Comput Vis, vol. 115, pp. 211–252, 2014.
“Vision Transformers for Remote Sensing Image Classification,”
[6] S. Bhojanapalli, A. Chakrabarti, D. Glasner, D. Li, T. Unterthiner, Remote Sens (Basel), vol. 13, no. 3, 2021, doi: 10.3390/rs13030516.
and A. Veit, “Understanding Robustness of Transformers for Image
[19] X. Li and S. Li, “Transformer Help CNN See Better: A Lightweight
Classification,” CoRR, vol. abs/2103.14586, 2021, [Online].
Hybrid Apple Disease Identification Model Based on Transformers,”
[7] Available: https://arxiv.org/abs/2103.14586 Agriculture, vol. 12, no. 6, 2022, doi: 10.3390/agriculture12060884.
[8] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated [20] C.-F. Chen, Q. Fan, and R. Panda, “CrossViT: Cross-Attention Multi-
Residual Transformations for Deep Neural Networks,” 2017 IEEE Scale Vision Transformer for Image Classification,” CoRR, vol.
Conference on Computer Vision and Pattern Recognition (CVPR), abs/2103.14899, 2021, [Online]. Available:
pp. 5987–5995, 2016. https://arxiv.org/abs/2103.14899
[9] M. Tan and Q. v Le, “EfficientNet: Rethinking Model Scaling for [21] D. Zhou et al., “DeepViT: Towards Deeper Vision Transformer,”
Convolutional Neural Networks,” CoRR, vol. abs/2103.11886, 2021, [Online]. Available:
[10] ArXiv, vol. abs/1905.11946, 2019. https://arxiv.org/abs/2103.11886
[11] E. T. Cheng Beibei and Matson, “A Feature-Based Machine Learning [22] K. and B. Q. and B. C. and N. M. and H. N. and L. Y. and L. H. and
Agent for Automatic Rice and Weed Discrimination,” in Artificial Z. Y. Yu Shuang and Ma, “MIL-VT: Multiple Instance Learning
Intelligence and Soft Computing, 2015, pp. 517–527. Enhanced Vision Transformer for Fundus Image Classification,” in
[12] X. Jin, J. Che, and Y. Chen, “Weed Identification Using Deep Medical Image Computing and Computer Assisted Intervention –
Learning and Image Processing in Vegetable Plantation,” IEEE MICCAI 2021, 2021, pp. 45–54.
Access, vol. 9, pp. 10940–10950, 2021, doi: [23] A. Olsen et al., “DeepWeeds: A Multiclass Weed Species Image
10.1109/ACCESS.2021.3050296. Dataset for Deep Learning,” Sci Rep, vol. 9, no. 1, p. 2058, 2019, doi:
10.1038/s41598-018-38343-3.
6

Self-Attention_Vision_Transformer_with_Transfer_Learning_for_Efficient_Crops_and_Weeds_Classification

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Self-Attention_Vision_Transformer_with_Transfer_Learning_for_Efficient_Crops_and_Weeds_Classification

Uploaded by

Copyright:

Available Formats

2023 6th International Conference on Information Systems and Computer Networks (ISCON)

GLA University, Mathura, India. Mar 3-4, 2023

Self-Attention Vision Transformer with Transfer

1st Shubham Sharma 2nd Manu Vardhan

979-8-3503-4696-1/23/$31.00 ©2023 IEEE 1

‫ )ݔ(ܨ = ݕ‬+ ‫)ݔ( ܪ‬ (1)

Fig. 1. Sample images from classes of the DeepWeeds, namely: (a)

Fig. 4. The illustration of the classic Vision Transformer model

In the self-attention layer, The input vector is first

You might also like