Grape Detection With Convolutional Neural N - 2020 - Expert Systems With Applica

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Expert Systems with Applications 159 (2020) 113588

Contents lists available at ScienceDirect

Expert Systems with Applications


journal homepage: www.elsevier.com/locate/eswa

Grape detection with convolutional neural networks


Hubert Cecotti a,⇑, Agustin Rivera a, Majid Farhadloo a, Miguel A. Pedroza b
a
Department of Computer Science, College of Science and Mathematics, Fresno State, 2576 E. San Ramon Ave., MS ST90, Fresno, CA 93740, USA
b
Department of Viticulture and Enology, Jordan College of Agriculture Sciences and Technology, 2360 E. Barstow Avenue, MS VR89, Fresno, CA 93740, USA

a r t i c l e i n f o a b s t r a c t

Article history: Convolutional neural networks, as a type of deep learning approach, have revolutionized the field of com-
Received 10 July 2019 puter vision and pattern recognition through state of the art performance in a large number of classifica-
Revised 18 May 2020 tion tasks. Machine learning has been recently incorporated into intelligent systems related to
Accepted 18 May 2020
agricultural and food production to decrease manual processing when dealing with large number of oper-
Available online 26 May 2020
ations. Feedforward artificial neural networks such as convolutional neural networks can be used in agri-
culture for the segmentation and classification of images containing objects of interests such as fruits, or
Keywords:
leaves. It is however unknown what is the best architecture to use, if it is necessary to propose new archi-
Machine learning
Deep learning
tectures, and what is the impact of the input feature space on the classification performance. In this
Agriculture paper, we propose to detect two types of grapes (Albariño white grapes and Barbera red grapes) in
Viticulture images. We investigate 1) the impact of the input feature space: color images, grayscale images, and color
histograms using convolutional neural networks; 2) the impact of the parameters such as the size of the
blocks, and the impact of data augmentation; 3) the performance of 11 pre-trained deep learning archi-
tectures, i.e. using a transfer learning approach for the classification. The results support the conclusion
that images of grapes can be efficiently segmented using different feature spaces where color images pro-
vide the best performance. With convolutional neural networks using transfer learning, the best perfor-
mance is achieved with Resnet networks reaching an accuracy of 99% for both red and white grapes.
Finally, data augmentation, image normalization, and the input feature space have a key impact on the
overall performance.
Ó 2020 Elsevier Ltd. All rights reserved.

1. Introduction able resources such as land and water (Lampridi et al., 2019). This
field involved multiple scientific disciplines related to sensors, arti-
Intelligent systems and data intensive science in the field of ficial intelligence, big data, and robotics, which are being used to
agri-technology have recently progressed thanks to the advances improve global food supplies. Agriculture has an important role
in machine learning, and especially artificial neural networks with in the global economy, in relation to climate and demographic
deep learning. Machine learning has recently emerged with high issues of this century. Agri-technology and precision farming are
performance computing and the availability of frameworks that increasingly multi-disciplinary fields that are data driven for
did simplify the application of computer vision techniques on a increasing agricultural productivity and minimizing its environ-
large number of problems that go beyond the typical applications mental footprint by delegating as much work as possible to
found in computer science. Agri-technology is about using machines. Typical applications include crop management
advanced monitoring and data analysis to optimize crop productiv- (Ferentinos, 2018), livestock management (Hansen et al., 2018),
ity and quality. In this sense, the implementation of intelligent sys- water and soil management (Mohammadi et al., 2015;
tems are revolutionizing the agricultural production by allowing Coopersmith et al., 2014). Intelligent systems performing classifi-
farmers to make timely decisions and a more efficient use of valu- cation tasks have become a key aspect of agri-technology.
Specifically in the field of fruit crops, multiple classification tech-
niques have been used. Support vector machines (SVMs) were used
⇑ Corresponding author at: Department of Computer Science, College of Science for the automatic count of coffee fruits on a coffee branch (Ramos
and Mathematics, Fresno State, 2576 E. San Ramon Ave., MS ST90, Fresno, CA 93740, et al., 2017). SVM with features based on Zernike moments and color
USA.
information was used for grape detection (Chamelat et al., 2006).
E-mail addresses: hcecotti@csufresno.edu (H. Cecotti), eleqtriq@mail.fresnostate.
edu (A. Rivera), majid_farhadloo@mail.fresnostate.edu (M. Farhadloo), miguelp@
Color and texture information features were combined with an
mail.fresnostate.edu (M. A. Pedroza). SVM for grape bunch segmentation (Liu and Whitty, 2015). Fuzzy

https://doi.org/10.1016/j.eswa.2020.113588
0957-4174/Ó 2020 Elsevier Ltd. All rights reserved.
2 H. Cecotti et al. / Expert Systems with Applications 159 (2020) 113588

clustering method by artificial swarm and the high color component In this paper, we propose to research intelligent systems that per-
was used for grape detection with an accuracy of 90.33% (Luo et al., ceive the environment through vision to address the difficult prob-
2015). The AdaBoost framework and multiple color components lem of image segmentation in agri-technology for viticulture. In
were combined for grape cluster detection giving an accuracy of particular, this study focuses on two types of grapes during the
96.56% (Luo et al., 2016). Gaussian naive Bayes were employed for ripening period leading to grape harvest by considering images that
the detection of cherry branches with full foliage (Amatya et al., have been gathered within a month. These images contain multiple
2015). In Patil and Thorat (2016), they have used Hidden Markov clusters of grapes that must be detected within the image. Such a
models for the early detection of grapes diseases in combination to detection allows to estimate the surface of a grape cluster, with a
Internet of Things (IoT). focus on the analysis of the grapes themselves. For the classification,
Different color spaces were considered in multiple studies. The we consider convolutional neural networks (CNNs or conv nets), as a
RGB color images were transformed into the YIQ (luma and state of the art technique in computer vision for image classification.
chrominance) color space in Slaughter and Harrel (1987). In A typical pattern recognition system is composed of two main
Cheng et al. (2001), the threshold of color intensity resulted from stages: the feature extraction stage and the classification stage.
image histogram analysis. Specular spherical reflection peaks in The feature extraction part aims at the creation of a feature set that
RGB images obtained at night under artificial illumination were provides a discriminant power for the classification task, i.e. by
used to count the number of grape berries (Font et al., 2014). In minimizing the intra-class variance and maximizing the inter-
Fu et al. (2015), kiwifruits were recognized at nighttime by extract- class variance between classes. The feature extraction stage can
ing R-G color channels with an accuracy of 88.3%. In Font et al. be achieved with prior information from the problem, it can be
(2015), the H component of the HSV (hue, saturation, value) color data driven by considering statistical properties of the images
space was used for the image segmentation containing grapes. The (e.g. Principal Component Analysis), or both data driven and task
classification of green berries versus green leaf background was driven, by extracting features that aim at separating objects from
proposed based on both visual texture and shape using the radial different classes of the problem. With deep learning, the feature
symmetry transform (Nuske et al., 2011). extraction and the classification steps are jointly bounded, blend-
Grape cultivation has a high social and economic impact in soci- ing the components related to feature extraction and classification.
eties. Grape yield estimation are one of the critical aspects in wine By considering deep learning, we can distinguish two main
production by allowing growers to manage agricultural resources, approaches: First, to create a model that will be fully trained in rela-
while maintain equilibrium conditions that guarantee the plant tion to the input images that are given as an input; second, to con-
health and business sustainability in the long term. Yield forecasts sider the transfer learning paradigm in which knowledge from a
have been typically estimated by considering the number of vines, previously solved problem is applied to a different but related prob-
grape clusters per vine, and the weight of an average grape cluster lem. Knowledge can be extracted from a pre-trained network that
(Komm and Moyer, 2015). Current techniques for estimating yield was set on a relatively similar task, i.e. computer vision task, and this
include manual sampling of grape clusters to determine character- knowledge is used to enrich a new classifier. Transfer learning can be
istics such as cluster weight and berry size. The average number of used in different settings. First, the values extracted from a selected
bunches per vine and the average weight per bunch is computed layer (processing stage in the neural network architecture) based on
and then extrapolated to the whole vineyard in relation to the existing architectures can be directly used as input features in a sep-
number of vines per acre. This manual process has multiple draw- arate classifier to classify a different problem. Second, it is possible to
backs. First, it is not reliable because the yield distribution may not initialize an architecture with the weights from an already existing
be uniform across the vineyard. Second, it is time consuming and learned architecture, with a complete retraining of the classifier.
expensive because a subsample of the bunches is removed from Third, the first approach is used, and then the neural network is
the vine, and is therefore wasted. Besides, sample processing and trained on the whole architecture for fine tuning the complete set
yield estimations are not always processed in a timely manner of parameters. In the proposed study, only the first case is considered
causing delays in harvest logistics, which have a significant impact where the features are extracted from an existing architecture, with-
on the quality of the fruit. The industry practice for yield prediction out retraining the whole architecture.
is currently destructive, expensive, and spatially sparse; sparse A key question that we address here is if it is worthwhile to
samples are taken and extrapolated to determine overall yield dur- retrain a complete network for the segmentation of grapes, or if
ing the growing season (Nuske et al., 2011). it is possible to reuse an existing model through transfer learning.
Automatic techniques that are performed in the laboratory have Furthermore, complex architectures that are used in computer
been proposed for harvest yield estimation. For instance, a flatbed vision to classify large number of images may be too computation-
scanner was used to take images of detached Pinot Noir berries ally expensive for the current problem. They have typically a high
(Battany, 2008). The grayscale images were binarized, and water- memory footprint, which can become a problem for the deploy-
shed segmentation was used to segment and count the joined ber- ment of the approach in embedded systems with limited memory
ries. Laboratory-based techniques are faster but they remain and computational power. Therefore, we will consider the compar-
destructive as samples have to be extracted. In traditional sam- ison of different convolutional network architectures and transfer
pling techniques, the location of the sampling points is often over- learning approaches using state of the art models.
looked, and samples are mixed to form an homogeneous group. In The contributions of this paper to intelligent systems and agri-
this sense, changes in grape qualities due to soil and terrain varia- technology are:
tions are often lost. In addition, the results should be extrapolated
to the vineyard. The grape morphology using the Freeman chain  The performance analysis of convolutional neural network
code algorithm was used to estimate the size and weight of grapes architectures for the segmentation of images to detect grapes
(Tardáguila et al., 2012). For estimating the yield without any through transfer learning by considering state of the art archi-
destructive aspect and by considering the variability that exists tectures. It is important to evaluate the performance of existing
across the vineyard, it is necessary to observe all the grapes. One state of the art architectures and estimate if it is worthwhile to
step toward this solution is the segmentation of grape images to retrain a whole classifier or use transfer learning.
isolate grapes from non-grape elements. Such an estimation can  The study of the impact of data augmentation techniques on the
provide key information about the size and shape of the cluster, accuracy of grape images segmentation. Convolutional neural
which can be then further used for determining the yield. networks are known to require a large number of trials and data
H. Cecotti et al. / Expert Systems with Applications 159 (2020) 113588 3

augmentation through the addition of modified images is a some knowledge of the problem to properly estimate the deforma-
solution. It is necessary to know for the problem of grape detec- tions that can be considered, without modifying the label of the
tion the extent to which data augmentation impacts the image. In the case of images of grapes that possess a relative struc-
performance. ture, i.e. the shape of grapes, it is not possible to consider elastic
 The classification in different feature spaces using convolutional deformation techniques as they may modify the shape of the grape.
neural networks. We consider and compare three feature As the blocks in the images of grapes are treated as texture without
spaces: grayscale images, color images, and color histograms, prior information about any shape, there is no information related
allowing to infer the extent to which the performance accuracy to the center and orientation of the shape that could be estimated
is driven by the color information or the texture information with central 2D moments.
from the grapes within the images. This problem is relevant The history CNNs goes back to the 1980s (Fukushima, 1980).
because colors reflect various lighting conditions that may One of the most famous architectures is LeNet5 (LeCun et al.,
occur during data acquisition. 1998), which was used to classify digits. Then, other architectures
 The creation of a labeled database of grape images available to have been proposed with less layers and better performance
other researchers to validate pattern recognition and machine through the addition of elastic deformations (Simard et al., 2003).
learning algorithms. The most influential CNN in recent years was AlexNet, which got
outstanding results at the ImageNet Large Scale Visual Recognition
The remaining sections of the paper are organized as follows: Challenge (ILSVRC) (Krizhevsky et al., 2012). Recent works on CNNs
First, convolutional neural networks and transfer learning are pre- have taken advantage of GPU (graphics processing unit) computing
sented in Section 2. The description of the datasets and the differ- and the development of frameworks that decompose the imple-
ent neural networks architectures are given in Section 3. The mentation in a way that facilitate researchers to work on the
results are then detailed in Section 4 and discussed with their CNN architectures (Jia et al., 2014), and to allow researchers who
impacts in Section 5. Finally, the main contributions are summa- are not computer scientists to implement and deploy CNN based
rized in Section 6. classifiers on various tasks. The runner-up at the ILSVRC 2014 com-
petition is VGGNet (Simonyan and Zisserman, 2015), based on 16
convolutional layers, with 3  3 convolutions, and a uniform archi-
2. Convolutional neural network tecture. It is widely used for feature extraction. VGGNet consists of
138 million parameters. The Residual Neural Network (ResNet) (He
Convolutional Neural Networks belong to a type of artificial et al., 2016) added skip connections and features heavy batch nor-
neural networks commonly used in deep learning (Goodfellow malization and ReLu activation functions. Skip connections allow
et al., 2017). Their architectures are typically based on the principle to jump over layers. Typical ResNet implementations include dou-
of the human primary visual cortex (V1) (Grill-Spector and Malach, ble or triple layer skips. The architecture embeds 152 layers while
2004). Their current prevalence in computer vision tasks is moti- keeping a lower complexity than VGGNet, achieving a top-5 error
vated by their high performance in many complex classification rate of 3.57%, beating human-level performance. The winner of
tasks (Ciresßan et al., 2012). The current trends suggest that all fea- the ILSVRC 2014 competition was GoogleNet from Google, reach-
tures should be learned, with applications moving away from ing human level performance by achieving a top-5 error rate of
scale-invariant feature transform (SIFT) (Lowe, 1999) based fea- 6.67%. This CNN model embeds an inception module, batch nor-
tures to embrace CNNs for instance retrieval applications, with malization, image distortions, and the RMSprop optimizer. Their
end-to-end feature learning and extraction approaches (Zheng architecture consisted of a 22 layer deep CNN but reduced the
et al., 2018). CNNs have been considered to perform classification number of parameters from 60 million (AlexNet) to 4 million. In
tasks in many problems in which the inputs have an underlying Szegedy et al. (2016), they explored ways to scale up networks in
structure, such as images, multi-dimensional signals. CNNs have ways that aim at utilizing the added computation as efficiently
been successfully used in brain-computer interface (Cecotti and as possible by suitably factorized convolutions and aggressive reg-
Gräser, 2011), robotics (Redmon and Angelova, 2015), chemistry ularization. They benchmarked the method on the ILSVRC 2012
(Ma et al., 2015), and astronomy (Dieleman et al., 2015). CNNs classification challenge validation set and have showed important
are typically known to require a large number of representative gains over the state of the art.
labeled examples to capture the variability that exists across exam- Table 1 presents different architectures, including the number
ples. These examples should grasp all the variability that is of layers, the input size, the total number of parameters that have
expected during the test phase, in order to provide features that been set through learning in the network, the layer that was used
are tolerant to different geometric transformations in the case of for feature extraction (name and position), and the number of fea-
images, e.g. translation and rotation. tures that are extracted to be used with a different classifier. These
One of the limitations for effective training of CNNs is the need CNNs have been trained with more than a million images from the
of a substantial amount of labeled data. In the case of a training ImageNet database, and are able to classify images into 1000 object
dataset with a low number of labeled examples that are described categories. The name and position are mentioned following the
in a high dimensional space, the classifier may be unable to dis- notations used in Matlab (The MathWorks) that considers all the
cover the proper features that can be useful to discriminate exam- functions that are used throughout the propagation stage. In this
ples, failing to capture the complexity of the underlying structure notation, each layer is represented by a sequence of functions,
of the data. To solve this problem, it is possible to augment the i.e. convolution and activation functions. Before being used, the
training dataset with new images that are based on existing image must be resized to fit the input size of these networks, e.g.
images, in such a way that the transformed images do not change 224  224. Finally, the feature set corresponds to the outputs of
their label. Multiple approaches have been proposed in the litera- one of the last layer of the architecture, typically before a sequence
ture to increase the number of labeled images (Baird, 1990; of fully connected layers.
Simard et al., 1991). These approaches include the addition of The convolution operation is one of the most important aspects
images that have been translated and/or rotated, or where local in the feature extraction process. In a 2D convolution, the transfor-
elastic deformations have been applied on the images (Simard mation is applied on the 2D feature maps in order to compute fea-
et al., 2003). For each classification problem, it is important to have tures from the 2D dimensions only. For images corresponding to a
4 H. Cecotti et al. / Expert Systems with Applications 159 (2020) 113588

Table 1
Pre-trained networks used for feature extraction.

Architecture # Layers Input size # Parameters Layer Layer # Features


(In pixels) (In million) Name Position
Alexnet 8 227  227 61 Pool5 16/25 9216
Densenet201 201 224  224 20 Avg-pool 706/709 1920
Googlenet 22 224  224 7 Pool5-7x7-s1 140/144 1024
Inceptionresnetv2 164 299  299 55.9 Avg-pool 822/825 1653
Inceptionv3 48 299  299 23.9 Avg-pool 313/316 2048
Resnet18 18 224  224 11.7 Pool5 69/72 512
Resnet50 50 224  224 25.6 Avg-pool 174/177 2048
Resnet101 101 224  224 44.6 Pool5 344/347 2048
Squeezenet 18 227  227 1.24 Pool10 66/68 1000
Vgg16 16 224  224 138 FC7 36/41 4096
Vgg19 19 224  224 144 FC7 42/47 4096

2D signal, we consider a mask of size W 1  W 2 , the value of the f is the focal length. The distance from the camera to the clusters was
unit yðx1 ; x2 Þ at the mth feature map in the lth layer is given by: held constant at approximately 18  5 cm (see Fig. 1b and 1c).
Image sampling was done weekly from 11:00 am to 12:00 pm. Three
x1 ;x2 ¼f ðrx1 ;x2 Þ þ w0
m;l
ym;l m;l
ð1Þ pictures from each cluster were taken from three different angles as
X
M;W 1 ;W 2 shown in Fig. 1c. For each grape variety, five vines were selected
rm;l
x1 ;x2 ¼ wm;l;m0
i1;i2  ym0;l1
x1 þi1;x2 þi2 within their respective blocks in the Fresno State vineyard. For each
m0¼0;i1¼0;i2¼0
vine, ten clusters were selected and tagged as indicated in Fig. 1a;
where f is an activation function (e.g. tanh, rectified linear unit func- five clusters were selected from the north side exposition and five
clusters from the south. 50 berries were collected from each vine
tion (Jarrett et al., 2009)). The set of weights wm;l;m0 represents the
for chemical analysis and measurements of diameter and
connections between the unit at coordinate ðx1 ; x2 Þ at the mth fea-
colorimetry.
ture map in the lth layer and its corresponding unit at the m0th fea-
The images for both groups have been gathered in 5 different
ture map in the ðl  1Þth layer (the previous layer). wm;l
0 is a days. The images for the group A have been acquired in July 9,
threshold. 11, 13, 23, and 31 2018. The images for the group B have been
acquired in August 2, 8, 21, 27, and 30 2018. The original size of
3. Methods the images was 4032  3024pixels. The images are then resized
to 448  336. The number of images for each day and group is
3.1. Datasets given in Table 2. Datasets are available by contacting the corre-
sponding author.
Two grape types: Albariño (Alvarinho) white grapes (Group A) The ground truth for the grape images was obtained by drawing
and Barbera red grapes (Group B) grown at Fresno State Vineyards a polygon to indicate the boundary of the shapes containing
(Fresno, California) are considered in this study. Albariño, is a type
of white wine grape typically grown in north of Spain and Portugal,
Table 2
and used to make varietal white wines. This type of wine is also Distribution of the number of images.
produced in multiple regions in California, USA, including the Cen-
Albariño Barbers
tral Valley. Barbera is a red Italian wine grape variety, which was
brought to America by Italian immigrants in the 19th/20th cen- Date # Images Date # Images
turies, and started in California. These grapes were selected based 7/9/2018 149 8/2/2018 78
on their relatively long ripening period shown in the Central 7/11/2018 150 8/8/2018 79
7/13/2018 147 8/21/2018 79
Valley.
7/22/2018 151 8/27/2018 79
Digital image acquisition was done using the built-in camera 7/31/2018 71 8/30/2018 79
from an iPhone 8. The camera is a 12 MP with f/1.8 aperture, where

Fig. 1. Digital image acquisition procedure.


H. Cecotti et al. / Expert Systems with Applications 159 (2020) 113588 5

type we consider only the information relative to the colors of the


blocks: we include the histograms in Red, Blue, and Green of the
colors presented in the block; the input is 128  3 corresponding
to 3 histograms of 128 bins. The three architectures are defined
in Table 4. For the second convolutional layer of the two first CNNs,
strides are set to 2 in both dimensions.
For the transfer learning approach where the pre-trained net-
work are used, we consider a multi-layer perceptron, i.e. a regular
artificial feedforward neural network with a single hidden layer.
The inputs are fully connected to the hidden layer of 500 units,
which is fully connected to the output layer of 2 units.

3.3. Data augmentation

The most common method to enhance a data set containing


images is to use geometric transformations that will not change
the class of the object. In this paper, the grapes can be presented
Fig. 2. Grape segmentation process of converting JSON object to segmentation class in different orientations and positions. We consider the following
in PNG.
2 transformations: 1) mirroring the image along the X-axis and
2) rotations, varying by 15 degrees. Such a procedure results in
grapes. The LabelMe software written in Python with Anaconda up to 60 different images with a single original image given as
environment was used to label the images, inspired by Russell an input, substantially increasing the size of the training dataset.
et al. (2008). Fig. 2a depicts an example of this process, where a
green polygon indicates the grape part. Each image within the con- 3.4. Performance analysis
taining grape segments produces a JSON object. Each JSON object is
then placed into a Python script provided by LabelMe in order to For the evaluation, we select a subset of blocks obtained from
generate the Binary Segmentation per image in PNG format. all the images in such a way that there are as many blocks corre-
Fig. 2b shows an example where the red polygon indicates the sponding to a grape than blocks corresponding to non-grape ele-
grape segment in the binary channel. ments in an image. Such an evaluation allows to keep balanced
Each image is divided by the block size of N b  N b . In the classes. The overall performance is measured with the 5-fold cross
decomposition process, images are divided within an offset as large validation by estimating the area under the ROC curve (AUC)
as half of the block size in both the vertical and horizontal direc- (Fawcett, 2006) and the accuracy (ACC). The reported measure-
tions. In addition, the blocks are filtered. A block is considered as ments are defined as follows:
belonging to a grape if 98% of its pixels are part of a grape. The TPR ¼ TP=P
same way, a block is considered as belonging to a non-grape part
FPR ¼ FP=N
if 98% of its pixels are not part of a grape. In the subsequent sec-
FNR ¼ FN=P ð2Þ
tions, we consider blocks of size N b ¼ 32; N b ¼ 48, and N b ¼ 64.
In this stage, a complete dataset of images in a 4-D format of TNR ¼ TN=N
N b  N b  N c  N s are provided, where N c represents the number ACC ¼ ðTP þ TNÞ=ðP þ NÞ
of channels (3 for color images, 1 for grayscale images). N s repre-
where TP; FP; FN, and TN corresponds to the number of true positive
sents the total number blocks per group category. In the next step,
(number of blocks corresponding to a grape and detected as a
the images were divided into 60% as a training set, 20% as a valida-
grape), false positive (number of blocks detected as grapes but
tion set, and lastly 20% as a test set. The training dataset is used to
being non-grapes), false negative (number of blocks detected as
train the classifier with the properties of batch size equal to 64, and
non-grapes but being a grape), and true negative (number of non-
a maximum number of iterations set to 40. The number of blocks
grape blocks detected as non-grapes). P and N represent the total
per category and block size is presented in Table 3. Before the clas-
number of blocks representing grapes and non-grapes, respectively,
sification, each block was z-score normalized, across all the blocks
with P ¼ N in the tests.
from the training datatsets.

4. Results
3.2. Architectures
4.1. Features
We consider three architectures that are based on three types of
inputs. The first type corresponds to 24-bit RGB color images, the
The performance in relation to the type of inputs is presented in
input is N b  N b  N c with N c ¼ 3. The second type contains 8-bit
Fig. 3, where the error bars represent the standard deviation across
grayscale images, the input is N b  N b  N c with N c ¼ 1. In the third
the 5 folds. By considering blocks of size 32, the accuracy for the
color blocks is 91:54  1:96 and 97:63  0:56 for Albariño grapes
Table 3 and Barbera red grapes, respectively. For the blocks in grayscale,
Distribution of the blocks for Albariño (A) and Barbers (B). the accuracy is 78:48  2:62 and 77:78  4:88. Finally, for the inputs
Type Size Train Validation Test
corresponding to only the color information, the accuracy is
89:04  1:75 and 89:59  3:03. These results highlight the impor-
A 32 41578 13860 13860
A 48 12190 4064 4064
tance of both the information relative from the texture and the color.
A 64 4274 1425 1425 These results suggest that the color information is more discrimi-
B 32 25611 8538 8537 nant than the texture content, and the best results are obtained
B 48 7967 2656 2656 when both are combined. The same pattern of performance is
B 64 2893 965 965
observed across the blocks of size 48 and 64. The mean accuracy
6 H. Cecotti et al. / Expert Systems with Applications 159 (2020) 113588

Table 4
Architecture of the CNNs for the different types of inputs.

Layer Size Type Weights Bias


Color Input 32  32  3
image Layer 1 32  32  10 Convolution 2D (5  5) + ReLU 750 10
Layer 2 14  14  50 Convolution 2D (5  5) + ReLU 12500 50
Layer 3 1  1  100 Fully connected + ReLu 980000 100
Output 112 Fully connected + Softmax 200 2
Grayscale Input 32  32  1
image Layer 1 32  32  10 Convolution 2D (5  5) + ReLU 50 10
Layer 2 14  14  50 Convolution 2D (5  5) + ReLU 12500 50
Layer 3 1  1  100 Fully connected + ReLu 980000 100
Output 112 Fully connected + Softmax 200 2
Color Input 3  128  1
histogram Layer 1 3  128  1 Convolution 2D (1  5) + ReLU 50 10
Layer 2 3  120  50 Convolution 2D (1  5) + ReLU 2500 50
Layer 3 1  1  100 Fully connected + ReLu 1800000 100
Output 112 Fully connected + Softmax 200 2

Fig. 3. Accuracy in the relation to the input type.

and standard deviation in relation to the block size are depicted in of performance can be found across the blocks of size 48 and 64
Fig. 3. For blocks of size 48  48, the accuracy is 92:38  2:82 and with an improvement of about 2% for Albariño grape images and
98:21  0:53, while it is 91:87  4:82 and 97:92  0:85 for the blocks no significant improvement for Barbera red grape images.
of size 64  64. Examples of segmentation using the trained networks with
color images using data augmentation for each type of grapes are
depicted in Fig. 5. The classification of the blocks is estimated using
4.2. Data augmentation a block size of 32 pixels, every 8 pixels in both directions. The pixel
values in overlapped regions are averaged in relation to the num-
The mean accuracy and standard deviation in relation to the ber of overlapping blocks.
tested data augmentation techniques is displayed in Fig. 4. By con-
sidering blocks of size 32, the accuracy for the Albariño grape
images is 92:82  1:99; 92:77  1:89; 93:46  2:15 for the rotation, 4.3. Architectures
the mirror, and the combination of the rotation and mirror trans-
formations, representing an improvement of 2% in the accuracy The performance for the different architectures using transfer
compared to the default condition. For the Barbera red grape learning with color images are presented in Tables 5 and 6. The
images, the accuracy is 97:67  0:67; 97:71  0:62, and true positive rate, false positive rate, false negative rate, true neg-
97:83  0:56. Performance is increased for Albariño grape images ative rate, accuracy and area under the ROC curve are given in each
through data augmentation while there was no substantial row of the tables. The best performance for Albariño is obtained
improvements for the Barbera red grape images. The same pattern with with Resnet50 with an accuracy of 99:6  0:16. The architec-

Fig. 4. Effect of data augmentation on the accuracy.


H. Cecotti et al. / Expert Systems with Applications 159 (2020) 113588 7

Fig. 5. Original image, ground truth, and Heat map based on the classification. The black color represents grapes. First rows: Albariño grapes; second row: Barbera red grapes.

Table 5
Performance with transfer learning for Albariño grapes.

Architecture TPR FPR FNR TNR ACC AUC


Alexnet 96.16 3.83 5.87 94.13 95:22  0:91 97:86  0:34
Densenet201 98.83 1.16 2.15 97.84 98:14  0:42 99:73  0:11
Googlenet 95.89 4.1 7.33 92.66 94:31  1:16 98:23  0:51
Inceptionresnetv2 98.74 1.25 2.13 97.86 98:37  0:55 99:65  0:15
Inceptionv3 98.59 1.4 2.18 97.81 98:2  0:48 99:65  0:14
Resnet18 98.96 1.04 2.37 97.62 97:65  0:69 99:52  0:11
Resnet50 98.43 1.56 3.03 96.96 99:6  0:16 98:06  0:61
Resnet101 98.89 1.1 2.67 97.33 98:15  0:5 99:68  0:16
Squeezenet 95.99 4 6.24 93.75 94:76  1:1 98:52  0:37
Vgg16 98.23 1.77 3.54 96.46 97:34  0:73 99:29  0:23
Vgg19 98.34 1.65 3.6 96.4 97:14  0:74 99:32  0:2

Table 6
Performance with transfer learning for Barbera red grapes.

Architecture TPR FPR FNR TNR ACC AUC


Alexnet 99.31 0.67 0.85 99.13 99:14  0:21 99:5  0:55
Densenet201 99.31 0.67 0.49 99.5 99:44  0:29 99:83  0:25
Googlenet 99.25 0.74 0.83 99.15 98:89  0:32 99:82  0:15
Inceptionresnetv2 99.35 0.64 0.64 99.34 99:23  0:27 99:77  0:26
Inceptionv3 99.31 0.67 0.86 99.12 99:36  0:25 99:8  0:23
Resnet18 99.44 0.55 0.56 99.42 99:26  0:29 99:81  0:23
Resnet50 99.25 0.74 0.8 99.19 99:45  0:3 99:83  0:24
Resnet101 99.44 0.54 0.47 99.52 99:48  0:26 99:79  0:29
Squeezenet 98.93 1.06 0.73 99.26 98:66  0:55 99:8  0:23
Vgg16 99.26 0.72 0.6 99.38 99:18  0:28 99:68  0:46
Vgg19 98.76 1.23 0.75 99.24 99:06  0:35 99:77  0:18

ture Resnet101 provides the best accuracy for Barbera red grapes been applied on more different tasks, reaching agri-technoloy and
with an accuracy of 99:48  0:26. Overall the results indicate the particularly viticulture.
superiority of the Resnet type architecture for this classification In this paper, we investigated multiple aspects of the detection
task, showing that such a network can be efficiently used via trans- of grapes in images taken in vineyards. First, we evaluated and
fer learning for the detection of grapes. compared the performance of 11 pre-trained deep learning archi-
tectures to estimate to what extent it is worth to retrain a new
classifier to achieve the required task. The results have proven in
5. Discussion a significant way that a transfer learning approach can provide a
high accuracy. Through the results obtained with transfer learning,
Machine learning techniques applied to computer vision have it indicates that it is not necessary to fully train a convolutional
historically been first applied on problems in the field of document neural network for obtained a high performance, and pre-trained
analysis and recognition, with applications such as optical charac- networks can provide relevant features for the detection of grapes
ter and symbol recognition. Thanks to the improvement of both the in images. For both types of grapes, it was possible to reach an
techniques and the computational powers, classification tasks have accuracy above 99% by using existing pre-trained network. Second,
8 H. Cecotti et al. / Expert Systems with Applications 159 (2020) 113588

the impact of the input feature space has been evaluated by com- study highlight the difficulty of the task, as the performance could
paring: color images, grayscale images, and color histograms using not be above random decision, suggesting that more information
appropriate convolutional neural network architectures. The would be needed to determine the age of grapes by considering
obtained results showed that these three input spaces provide an only blocks of images. Moreover, intelligent systems dealing with
accuracy superior to 75%. The performance using the color feature the segmentation of grapes should include additional sensors to
space was better than the spatial information contained in the facilitate data acquisition, such obtaining images from multiple
grayscale images, while the color images led to an accuracy supe- angles and multiple positions for improving the yield detection.
rior to 90%. In addition, we have estimated the impact of the
parameters such as the size of the blocks, and the impact of data
6. Conclusion
augmentation; showing that these parameters have a key influence
on the accuracy. With the proposed network that included only
In this paper, we have investigated different deep learning
two convolutional layers, the best results across all experiments
approaches for the detection of grapes in images for yield estima-
for the Albariño white grapes were achieved with settings of nor-
tion. The approaches include deep learning as a means to extract
malized color, 64  64 blocks and data augmentations of rotation
features via a transfer learning approach, and the creation of archi-
and mirroring – resulting in an accuracy of 98.2%. The best results
tectures that are specifically tailored to the needs of the applica-
for Barbera red grapes were achieved with settings of normalized
tion, by considering different types of input feature spaces. The
color, 48  48 blocks and data augmentation of rotation only –
high performance with pre-trained network with ResNet shows
resulting in 98.5% accuracy. These results show the importance
that it is not necessary to train a deep learning architecture for
of data augmentation for training a network and the choice of
the detection of grapes in images. However, training a network
the inputs.
for the specific problem of grape detection provides key insight
While it is better to reach 100% for the detection of single blocks
about the type of image invariance that should be required to pro-
belonging to grapes or non-grapes, such a high accuracy is not nec-
vide a robust system, in particular in relation to the importance of
essary as the results can be combined by considering a sliding win-
the colors for the grape identification. Finally, the proposed
dow in which overlapped windows can be combined to estimate
approaches represent one step toward the development of practi-
the class of blocks corresponding to the intersection of multiple
cal intelligent systems in agri-technology for deployment in smart
blocks. For instance, blocks of size 32  32, with a sliding window
vineyards.
overlapping blocks of 16 pixels, can provide 4 scores for a block of
size 16  16. By considering a multi-classifier approach by merging
the output scores, it is possible to substantially increase the relia- 6.1. Acknowledgment
bility of the decisions. In addition, from a practical point of view,
blocks corresponding to grapes are likely to be grouped in a cluster. The project has been partially supported by the Faculty
Isolated blocks within a cluster of blocks belonging to grapes are Research, Scholarly, and Creative Activity program of the College
likely to be part of a grape. of Science and Mathematics, Fresno State. The authors thank Bryan
The limitations of this study are the type of crops that have been Wegley for data grape image collection and the Undergraduate
investigated and to determine how these results could be trans- Research Assistantship grants from the Department of Viticulture
ferred to other problems such as the classification any fruit tree, and Enology.
such a peaches and oranges, that have a specific shape and color.
Due to the difference in performance between the types of grapes, References
the results suggest that the performance can be grape specific.
However, in order to be generalized to other grapes, the creation Amatya, S., Karkee, M., Gongal, A., Zhang, Q., & Whiting, M. D. (2015). Detection of
of a labeled dataset of images is necessary. Such work is tedious, cherry tree branches with full foliage in planar architecture for automated
sweet-cherry harvesting. Biosystems Engineering, 146, 3–15.
requiring manual labor to segment images for the creation of the Baird, H. (1990). Document image defect models. In Proceedings of the IAPR
training dataset. Methods based on semi-supervised learning workshop on syntactic and structural pattern recognition (pp. 38–46).
should be investigated in complement to deep learning techniques Battany, M. (2008). A practical method for counting berries based on image analysis.
In Proceedings of the 2nd annual national viticulture research conference (pp.
to leverage the necessity of labeling a large number of images. The
4–5).
current method can provide some estimation of the yield based on Cecotti, H., & Gräser, A. (2011). Convolutional neural networks for P300 detection
a 2D image. However, future works should include 3D imaging with application to brain-computer interfaces. IEEE Transaction of the Pattern
stereoscopic imaging to obtain more information about the grape Analysis and Machine Intelligence, 33, 433–445.
Chamelat, R., Rosso, E., Choksuriwong, A. & Rosenberger, C. (2006). Grape detection
cluster. In the current experimental conditions at Fresno, CA, by image processing. In Proceedings of the 32nd annual conference on IEEE
USA, the weather is steady and sunny; the proposed system should industrial electronics (pp. 3697–3702).
be tested in conditions where uneven illumination conditions in Cheng, H., Jiang, X., Sun, Y., & Wang, J. (2001). Color image segmentation: Advances
and prospects. Pattern Recognition, 34, 2259–2281.
relation to the weather are pronounced. Ciresßan, D., Meier, U. & Schmidhuber, J. (2012). Multi-column deep neural networks
In this paper, we have focused on the identification of grapes, for image classification. In Computer vision and pattern recognition (CVPR) (pp.
versus other elements that are not grapes in color images. How- 3642–3649).
Coopersmith, E. J., Minsker, B. S., Wenzel, C. E., & Gilmore, B. J. (2014). Machine
ever, the non-grape part of the image could be effectively seg- learning assessments of soil drying for agricultural planning. Computers and
mented to estimate the amount of leaves in the image. However, Electronics in Agriculture, 104, 93–104.
other visual features from vines such as foliar area per vine, and Dieleman, S., Willett, K. W., & Dambre, J. (2015). Rotation-invariant convolutional
neural networks for galaxy morphology prediction. Monthly Notices of the Royal
identification of grape vine diseases could benefit from a similar Astronomical Society, 450, 1441–1459.
approach as the one proposed in this paper. A future challenge to Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27,
address is the ability to classify grapes in relation to their age, in 861–874.
Ferentinos, K. P. (2018). Deep learning models for plant disease detection and
order to detect the ripeness and sweetness of grapes before picking
diagnosis. Computers and Electronics in Agriculture, 145, 311–318.
them as sweetness vary over time. When grapes reach maturity, Font, D., Tresanchez, M., Martnez, D., Moreno, J., Clotet, E., & Palacn, J. (2015).
and the cells swell, the fruit gets larger and softer. The sugar Vineyard yield estimation based on the analysis of high resolution images
increases, while acidity in the fruit decreases. Depending on how obtained with artificial illumination at night. Sensors, 15, 8284–8301.
Font, T., D.; Pallejà, Tresanchez, M., Runcan, D., Moreno, J., Martnez, D., Teixidó, M. &
grapes are used, it is important to detect when they should be har- Palacín, J. (2014). A proposal for automatic fruit harvesting by combining a low
vested. Prior tests using the same architecture that was used in this cost stereovision camera and a robotic arm. Sensors 14, 11557–11579.
H. Cecotti et al. / Expert Systems with Applications 159 (2020) 113588 9

Fu, L., Wang, B., Cui, Y., Su, S., Gejima, Y., & Kobayashi, T. (2015). Kiwifruit Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E., & Svetnik, V. (2015). Deep neural nets as a
recognition at nighttime using artificial lighting based on machine vision. method for quantitative structure-activity relationships. Journal of Chemical
International Journal of Agricultural and Biological Engineering, 8, 52–59. Information and Modeling, 55, 263–274.
Fukushima, K. (1980). A self-organizing neural network model for a mechanism of Mohammadi, K., Shamshirband, S., Motamedi, S., Petkovic, D., Hashim, R., & Gocic,
pattern recognition unaffected by shift in position. Biological Cybernetics, 36, M. (2015). Extreme learning machine based prediction of daily dew point
193–202. temperature. Computers and Electronics in Agriculture, 117, 214–225.
Goodfellow, I., Bengio, Y., & Courville, A. (2017). Deep learning. The: MIT Press. Nuske, S., Achar, S., Bates, T., Narasimhan, S. & Singh, S. (2011). Yield estimation in
Grill-Spector, K., & Malach, R. (2004). The human visual cortex. Annual Review of vineyards by visual grape detection. In Proceedings of the IEEE/RSJ international
Neuroscience, 27, 649–677. conference on intelligent robots and systems (pp. 2352–2358).
Hansen, M. F., Smith, M. L., Smith, L. N., Salter, M. G., Baxter, E. M., Farish, M., & Patil, S. S. & Thorat, S. A. (2016). Early detection of grapes diseases using machine
Grieve, B. (2018). Towards on-farm pig face recognition using convolutional learning and IoT. In Proceedings of the 2nd international conference on
neural networks. Computers in Industry, 98, 145–152. cognitive computing and information processing (CCIP) (pp. 1–5).
He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep residual learning for image Ramos, P. J., Prieto, F. A., Montoya, E. C., & Oliveros, C. E. (2017). Automatic fruit
recognition. In Proceedings of the IEEE conference on computer vision and count on coffee branches using computer vision. Computers and Electronics in
pattern recognition (pp. 770–778). Agriculture, 137, 9–22.
Jarrett, K., Kavukcuoglu, K., Ranzato, M. & LeCun, Y. (2009). What is the best multi- Redmon, J. & Angelova, A. (2015). Real-time grasp detection using convolutional
stage architecture for object recognition? In Proceedings of the 12th neural networks. In Proceedings of the IEEE international conference on
international conference on computer vision (ICCV’09) (pp. 2146–2153). robotics and automation (ICRA) (pp. 1316–1322).
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S. & Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A
Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. database and web-based tool for image annotation. International Journal of
In Proceedings of the 22nd international conference on multimedia (pp. 675– Computer Vision, 77, 157–173.
678). Simard, P., Steinkraus, D. & Platt, J. (2003). Best practices for convolutional neural
Komm, B. & Moyer, M. (2015). Vineyard yield estimation. In Washington state networks applied to visual document analysis. In Proceedings of the 7th
university extension bulletin (pp. 1–11). international conference document analysis and recognition (ICDAR) (pp. 958–
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with 962).
deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Simard, P., Victorri, B., LeCun, Y., & Denker, J. (1991). In R. P. L. E. J. E. Moody & S. J.
Q. Weinberger (Eds.), Advances in neural information processing systems 25 Hanson (Eds.), Advances in neural information processing systems (pp. 895–903).
(pp. 1097–1105). Curran Associates Inc. Simonyan, K. & Zisserman, A. (2015). Very deep convolutional networks for large-
Lampridi, M. G., Sorensen, C. G., & Bochtis, D. (2019). Agricultural sustainability: A scale image recognition. In International conference on learning
review of concepts and methods. Sustainability, 11, 1–27. representations.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning Slaughter, D., & Harrel, R. (1987). Color vision in robotic fruit harvesting.
applied to document recognition. Proceedings of the IEEE, 86, 2278–2324. Transactions of the ASAE, 30, 1144–1148.
Liu, S., & Whitty, M. (2015). Automatic grape bunch detection in vineyards with an Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. (2016). Rethinking the
SVM classifier. Journal of Applied Logics, 13, 643–653. inception architecture for computer vision. In Proceedings of the IEEE
Lowe, D. G. (1999). Object recognition from local scale-invariant features. In conference on computer vision and pattern recognition (pp. 2818–2826).
Proceedings of the international conference on computer vision (Vol. 2, pp. Tardáguila, J., Diago, M. P., Blasco, J., Millán, B., Cubero, S., García-Navarrete, O. L. &
1150–1157). Aleixos, N. (2012). Automatic estimation of the size and weight of grapevine
Luo, L., Tang, Y., Zou, X., Wang, C., Zhang, P., & Feng, W. (2016). Robust grape cluster berries by image analysis. In Proceedings of the CIGR AgEng.
detection in a vineyard by combining the adaboost framework and multiple Zheng, L., Yang, Y., & Tian, Q. (2018). SIFT meets CNN: A decade survey of instance
color components. Sensors, 16, 2098–2118. retrieval. IEEE Transactions on PAMI, 40, 1224–1244.
Luo, L., Zou, X., Yang, Z., Li, G., Song, X., & Zhang, C. (2015). Grape image fast
segmentation based on improved artificial bee colony and fuzzy clustering.
Transactions of the CSAM, 46, 23–28.

You might also like