Professional Documents
Culture Documents
Computers and Electronics in Agriculture: Sciencedirect
Computers and Electronics in Agriculture: Sciencedirect
Discovering weather periods and crop properties favorable for coffee rust T
incidence from feature selection approaches
Emmanuel Lassoa, , David Camilo Corralesa,b, Jacques Avelinoc,d,e,f, Elias de Melo Virginio Filhoe,
⁎
Keywords: Coffee Leaf Rust (CLR) is a disease that leads to considerable losses in the worldwide coffee industry; as those
Coffea arabica that have been reported recently in Colombia and Central America. The early detection of favorable conditions
Hemileia vastatrix for epidemics could be used to improve decision making for the coffee grower and thus reduce the losses due to
Crop disease the disease. Researchers tried to predict the occurrence of the disease earlier through statistical and machine
Dimensionality reduction
learning models from crop properties, disease indicators and weather conditions. These studies considered the
Machine learning
Model explanation
impact of weather variables in a common period for all. Assuming that the dynamics of weather that most impact
the development of the disease occur in the same time periods is simplistic. We propose an approach to discover
the time period (window) for each weather variables and crop related features that most explain a future ob
served CLR incidence, in order to obtain a prediction model through machine learning. The selection of the
variables more related with coffee rust incidence and rejection of the features with no significant contribution of
information in machine learning tasks were approached from Feature Selection methods (Filter, Wrapper,
Embedded). In this way, a CLR incidence prediction model based on the features with the greatest impact on the
development of the disease was obtained. Moreover, the use of SHapley Additive exPlanations allowed us to
identify the impact of features in the model prediction. The monitoring of coffee rust incidence is the most
important predictor, since it provides information about current inoculum and this determines how much can
the incidence grow or decrease. Temperature is a determining driver for germination and penetration phases in
days 9 to 6 and 4 to 1 before the date of prediction. Additionally, the amount of rain determines whether
uredospore dispersal or washing conditions occurred. The mean absolute error expected in the model is 6.94% of
incidence, trained with XGBoost algorithm and the dataset reduced by Embedded method. The estimation of the
disease incidence 28 days later can be used to improve decision making in control and nutrition practices.
1. Introduction known propagule. The latent period, i.e. the time between germination
and sporulation, is a key parameter of the epidemic: the shorter it is, the
Coffee Leaf Rust (CLR) is one of the diseases of coffee plants that more intense the epidemic (Avelino and Rivas, 2013). The first symp
cause more injuries in trees and crop losses (Avelino et al., 2008). The toms are yellowish spots that appear on the underside of leaves. These
causal agent is the fungus Hemileia vastatrix Berk. & Broome (1869). spots then grow and produce uredospores displaying a typical orange
The disease cycle is composed by propagule germination, penetration colour. Chlorotic spots can be observed on the upper surface of the
through stomata into the leaf, colonization of leaf tissue, sporulation leaves. During the last stage, lesions become necrotic (Avelino et al.,
through stomata and dispersal which comprises propagule release, its 2008). The disease affects coffee leaves causing defoliation and, in the
transport and deposition on coffee leaves. The uredospore is the only worst-case scenario, death of branches and heavy crop losses. For
Corresponding author.
⁎
E-mail addresses: eglasso@unicauca.edu.co (E. Lasso), dcorrales@unicauca.edu.co, davidcamilo.corralesmunoz@inra.fr (D.C. Corrales), jacques.avelino@cirad.fr,
javelino@catie.ac.cr (J. Avelino), eliasdem@catie.ac.cr (E. de Melo Virginio Filho), jcorral@unicauca.edu.co (J.C. Corrales).
https://doi.org/10.1016/j.compag.2020.105640
Received 8 April 2020; Received in revised form 4 June 2020; Accepted 16 July 2020
Available online 05 August 2020
0168-1699/ © 2020 Elsevier B.V. All rights reserved.
E. Lasso, et al. Computers and Electronics in Agriculture 176 (2020) 105640
example, in Colombia after the 2008 epidemic, production decreased by et al., 2007). It also affects other drivers involved in the CLR cycle, such
30% from 2008 to 2011, compared with 2007; while in Central America as rain, wind, fruiting load and soil moisture (López-Bravo et al., 2012).
the production decreased by 16% after the epidemic of 2012–2013 The balance of these effects is still controversial.
(Avelino et al., 2008). Reductions in production generate a negative Crop management (fertilization, diseases controls) drives CLR epi
impact on the livelihood of coffee growers and agricultural workers, as demic. However, crop management is limited by the economic capacity
harvesters. Disease controls imply greater investment, which makes the of the coffee grower. The continuous monitoring of the disease allows
farmer’s situation even more precarious. Additionally, the majority of the application of fungicides with no excess, at an appropriate time as
coffee varieties planted in Latin America are still susceptible to coffee soon as CLRI reaches a certain level, usually 5% (Zambolim, 2016),
rust, covering 80% of the area in Central America in 2012 (Avelino reducing further CLR intensity and impacts. On the other hand, ferti
et al., 2008). Among the cultivated species, Coffea arabica is the most lizer applications contribute to the recovery of coffee tree due to the
severely attacked (Rivillas et al., 2011). Despite the development of action of nutrients on vegetative growth (Villarreyna et al., 2019).
CLR-resistant coffee varieties, such as in Colombia where more than The observed CLRI is also a result of coffee plant growth (Avelino
60% of its coffee crops are planted with resistant varieties (Avelino et al., 2004; Kushalappa et al., 1983; Merle et al., 2019) and previous
et al., 2008), new rust races have appeared capable of breaking this CLRI values, as proxies for the estimation of the inoculum stock that
resistance (Talhinhas et al., 2017). will potentially cause new infections if the right conditions are met
One of the most used ways for the CLR assessment is the calculation (Kushalappa et al., 1983). The importance of host growth lies in the fact
of its incidence. A plant unit (normally, a leaf) is categorized according that, in a growing season, an apparent dilution of CLRI occurs when
to whether it presents the symptoms of the disease or not (Madden new healthy leaves appear, decreasing the proportion of infected leaves
et al., 2007). After classifying a representative sample of plant units, the (Ferrandino, 2008; Merle et al., 2019; López-Bravo et al., 2012). This
CLRI (Coffee Leaf Rust Incidence) corresponds to the average propor decrease does not imply that the conditions for the pathogen are not
tion of leaves infected over the total analyzed. CLRI is a continuous favorable. Similarly, fall of non-rusted leaves, for diverse reasons, will
variable ranging 0–100. The main limitation of this measurement for increase CLRI (Kushalappa, 1981).
CLRI monitoring is the possible error in the categorization of infected or The severe impacts due to CLR epidemics have pointed out the need
healthy leaves. Also, the incidence is a descriptor of the dynamics of for an early warning system that could be used to support decision
both CLR and coffee plant (Merle et al., 2019). Although the definition making for disease control (Avelino et al., 2008; Lasso et al., 2018). In
of incidence is uniformly accepted, there are many different ways of order to estimate epidemic growth, some approaches to modeling CLR
choosing the set of plants and leaves to be examined, and even of de have been proposed in the last decades. These models have been both
termining if a leaf is diseased or not. CLRI is not therefore necessarily descriptive and predictive. The first approach was proposed in 1961 by
comparable between different trials if the sampling method was dif Rayner (1961) in which an attempt was made to explain the latent
ferent. period of CLR using a multiple regression. The regression was applied to
Weather, shade level, fruit load and crop management are four of data of mean maximum and minimum temperatures that occured
the principal drivers for the development of CLR development (Avelino during this latent period. In later years, a series of approaches by
and Rivas, 2013; Rayner, 1961; Avelino et al., 2006). Each phase of CLR Kushalappa et al. (1980), Kushalappa (1981), Kushalappa and Ludwig
has its own weather requirements and specific durations for these re (1982), Kushalappa et al. (1983), Kushalappa et al. (1984), Kushalappa
quirements. and Eskes (1989) through regressions were used to explain CLR latent
Temperature affects propagule germination, penetration, coloniza period, the CLR infection rate, proportion of leaf area infected and the
tion and sporulation phases. For germination, the optimum is around 22 CLRI growth rate. The CLR infection rate corresponds to the apparent
° C (Nutman et al., 1963), while daily average temperatures around 28 growth rate that takes into account the proportion of leaves and leaf
° C favors sporulation (Merle et al., 2020) and temperatures of 25 ° C area rusted (Kushalappa and Ludwig, 1982) and depends on the amount
shorten the latent period (Leguizamon, 1985). Temperatures of 22–28 of healthy plant material available for infection. One of the most im
° C that favor germination and lower temperatures (13–16 °C) that favor portant contributions of Kushalappa was the development of a quasi-
the formation of appressoria over the stomata, structures that facilitates mechanistic model that integrated knowledge about the disease pro
the penetration phase, allow the infection to occur in less than 6 h in cesses and the drivers that affect it, by the generation of the NSRMP
presence of free water (De Jong et al., 1987). (Net Survival Ratio for Monocyclic Process) variable. NSRMP quantifies
The fungus requires the presence of a layer of water on the under the favorable events for the development of the epidemic, which oc
side of the leaves to germinate (Nutman et al., 1963; Waller et al., curred in a period of 28 days before the date of prediction (DP)
2007). Water is also important for dispersal, particularly via splashing, (Kushalappa et al., 1983) to explain infection rate 14 to 28 days after
i.e. the dispersal in raindrops after impacting lesions with uredospores. DP. The study carried out by Kushalappa (1981) tried to forecast the
However, if the rains are very abundant and intense, the uredospores CLR infection rate making use of stepwise regression and different
can be eliminated by washing (Kushalappa and Eskes, 1989). As CLR is prediction intervals. The variables that most explained the CLR infec
an obligate parasite, needing living leaves for its survival, any released tion rate were: proportion of leaf area occupied by visible uredospores,
uredospore that cannot reach a coffee leaf will not contribute anymore minimum and maximum temperatures, rainfall and new leaves formed;
to the epidemic growth. in a prediction interval of 56 days after DP. Santacreo et al. (1984)
Relative humidity is an indirect measurement of leaf wetness. This studied favorable conditions for CLRI. They related the same weather
condition can be derived from the number of hours with relative hu variables as those of the study of Kushalappa (1981), plus the total
midity of the air above a specific limit, usually 90% or 95% (Sutton leaves present in the coffee tree and previous CLRI.
et al., 1984). Recent research works (Meira et al., 2008; Lasso et al., 2015;
The physiological characteristics (particularly in relation with fruit Corrales et al., 2016; Lasso et al., 2017; Corrales et al., 2018) have
load) of the coffee tree has an influence on the latent period of the proposed the use of supervised learning models for CLRI or CLRI growth
disease. For susceptible crops with high fruit load and favorable rate as a function of weather conditions, crop management and crop
weather conditions, the latent period can last less than 2 weeks. It is properties (shade, coffee planting density, altitude). The weather vari
longer (up to several months) on the oldest leaves of low yielding coffee ables ocurring in the 28 days (a single time period or window) prior to
plants in cold and dry conditions (Rayner, 1961). CLRI assessment were considered, following Kushalappa et al. (1983)
The presence of shade on coffee crops has an effect on the disease proposal. For the prediction of the numerical value of the CLRI, re
(Boudrot et al., 2016), since it maintains very narrow thermal ampli gression-based algorithms were used, while for the CLR infection rates,
tude values and favors a constant high air relative humidity (Arcila expressed as nominal values, the algorithms correspond to classification
2
E. Lasso, et al. Computers and Electronics in Agriculture 176 (2020) 105640
task. In Corrales et al. (2016) an unsupervised learning algorithm is 2. Materials and methods
used in order to group similar weather conditions and crop properties
into clusters, which are integrated into the training set of a supervised We proposed an approach to discover the weather windows and
learning algorithm for CLR modeling. In these approaches, the pre variables that most explain a future observed CLRI. A dataset with the
paration of the training set for learning tasks follows the considerations resulting features was used to obtain a prediction model through ma
of expert knowledge on CLR. chine learning. The different stages of our approach are presented in
The use of CLR expert knowledge was used in Meira et al. (2008), Fig. 2.
Lasso et al. (2015) to generate composite variables representing fa Weather monitoring information is broken down into windows of
vorable conditions for the disease, i.e. leaf wetness was addressed in a different duration, and associated with crop property information. A
variable that quantifies the average temperature in hours when the feature selection process is applied to obtain the best features for the
relative humidity were above 90% or 95%. modeling of the CLRI and discard the irrelevant ones. Next, the re
Fig. 1 shows the evolution of the methods used for CLR modeling, sulting datasets are used to train different machine learning algorithms
grouped by those using simple or multiple linear regression (Regres and obtain their respective model. To establish the best combination
sion), regression based on Machine Learning algorithms (ML Regres between sets of selected features and machine learning algorithms, the
sion), classification algorithms and other techniques. Since the 2000s, model with the lowest prediction error is selected. Finally, the highly
the rise of the use of machine learning techniques to address disease correlated variables are cleaned and the impact of the values of the final
modeling can be seen. features on the CLRI prediction generated by the model is analyzed.
However, most of these studies have not explored different weather Each subprocess and element is reported below.
windows within the 28 days proposed by Kushalappa. Assuming that all
weather variables impacting the development of the disease occur in 2.1. Data source
the same time periods is therefore simplistic (Coakley et al., 1988), as
demonstrated by Merle et al. (2020). Statistical methods (and the The data used corresponds to a long term experiment of coffee-
modeling process itself) allow to identify the most influential periods of based agroforestry systems established in Costa Rica in 2000, described
each weather variable (Bugaud et al., 2016; Leandro-Muñoz et al., by Haggar et al. (2011), studying ecological processes that promote
2017). This approach generates an increase in the number of variables sustainability and higher coffee productivity, under different crops
(called features in data modeling and analysis tasks) included in the conditions. CLRI is one of the studied variables. This trial was carried
modeling processes. out in the Tropical Agricultural Research and Higher Education Center
The objective of this work was to model the CLRI in order to study (CATIE) (Rossi et al., 2011) at coordinates 9° 53′ 44″ North and 83° 38′
particularly the impact of weather variables, characterized in different 07″ West. Detailed information has been continuously collected, which
weather time periods, called windows. The window approach generated makes it a unique experiment in the area. The experiment has a total
large dimensional datasets. The most important variables were selected area of 9.2 ha., located at 685 meter above sea level, in soils with a
through Feature Selection (FS) methods. FS allowed to avoid redundant franc-clay texture. The variety of coffee is Caturra of the species Coffea
and noisy data, improving the CLRI modeling. In addition to weather arabica, susceptible to most CLR races. The crop management (fertili
variables, we incorporated host growth, shade level, and management zation, diseases and pests control) has two strategies: organic and
type. The monitoring of CLRI was incorporated as a response variable. conventional. Organic management uses chicken manure and organic
Machine learning tasks allowed us to produce a model for CLRI pre matter (coffee pulp) at two intensity levels. Conventional management
diction that is best sensitive to these characteristics. Moreover, the has also two levels. The high conventional level uses the complete
analysis of the contribution that each of the most important variables technical package for maximizing productivity including pesticides and
had in the model predictions allowed us a more detailed understanding herbicides application (copper based fungicide (50% Cu) in 1 kg ha 1
of the favorable conditions for the disease development. doses combined with a systematic product (cyproconazole 10% WG) in
0.4 liter ha−1 doses), and fertilization (300 kg N ha−1, 20 kg P ha−1,
150 kg K ha−1). The medium conventional level is a less intense level,
using a half-dose of inputs compared to high conventional level (Haggar
3
E. Lasso, et al. Computers and Electronics in Agriculture 176 (2020) 105640
Fig. 2. Modules to discover the weather windows and features that most explain a future observed CLRI.
et al., 2011). There are 20 treatments configured with different com frame (MTF). We did not use a larger period, since CLRI at DP is already
binations of six types of shaded and full sun exposed crops, and the two the result of the meteorological conditions that mainly occurred in the
management strategies mentioned above. The treatments are replicated previous month, considering that the latent period is one month long on
in three blocks. For our purpose, we only considered the two levels of average (Waller, 1982). CLRI at DP provides measurement of the in
conventional management and the following shade conditions: dense oculum stock available for new infections.
shaded crops those that contained the combination Chloroleucon In order to generate features in shorter times and identify which of
(Cashá - Ab.i) (C) + Erythrina (Poró) (E), and Full Sun (FS). CLRI as them are most related with the CLRI to be predicted, i.e. at
sesment was done monthly through the selection of 10 branches in 10 DP + 28 days, we analyzed sub-frames within the MTF sequentially,
coffee trees, counting the total number of leaves and those with CLR called windows (Coakley et al., 1988). Each new window begins one day
symptoms. after the start time of the previous one. If s is the size of the set and i is
Daily weather data were obtained from the CATIE Meteorological the size of the window, the MTF can be divided into s - i + 1 windows.
Station located in its campus, in Turrialba, Costa Rica, at an altitude of Fig. 3 shows the windows before DP that we obtained. The date of
602 meters above sea level, at coordinates 9° 53′ North Latitude and 83° predicted incidence (DPI) is 28 days after DP. The CLRI measured in DP
38′ West Longitude. The average meteorology in the experiment loca was called current CLRI (cCLRI) while the one in DPI predicted incidence
tion between 2002 and 2014 is: precipitation 3037 mm/year, air tem (pCLRI). The feature index represents the corresponding range of days
perature 22 °C, relative humidity 89.6%. (before DP). For example, tMax7-4 represents the maximum tempera
We used the information of weather and CLRI monitoring from April ture between days 7 and 4 before DP.
2002 to December 2014. To avoid redundancy issues, we only used the The weather data were divided into 4 types of windows, according
data from one of the blocks according to the process carried out: the to their size (Fig. 3), in the following way:
data from block 1 for the FS and modeling process, while the data of
block 2 for results validation and model explanations. • 14D: Single window of 14 consecutive days (i = 14); one feature for
each weather variable.
2.2.1. Weather
• 4D: 11 windows of 4 consecutive days (i = 4); 11 features for each
weather variable.
From the meteorological station data, we took the following
weather variables: maximum (tMax) and minimum (tMin) air tem
• 3D: 12 windows of 3 consecutive days (i = 3); 12 features for each
weather variable.
perature, average (tAvg) air temperature calculated over the day,
thermal amplitude (tAmp) which represents the difference between
maximum and minimum temperatures, average (hAvg) and minimum 2.2.2. Other variables
(hMin) relative humidity, number of days with precipitation greater or Shade condition (shade) and management type (mgmt) were coded
equal to 1 mm (rDay) and daily precipitation (pre). The quality of the as dummy variables that takes a binary value to indicate shaded crop
weather station data was evaluated from the search for outliers, out-of- (1) or full sun (0), for shade; and conventional high (1) or conventional
range values and verification of null values (lack of value). The initial medium (0) for mgmt. A variable related to the growth of the host
scale of weather data was daily. Since the scale of disease monitoring (coffee tree) (hGrowth), characterized by the difference between the
was monthly, a time series resample was necessary to correspond to the number of leaves in coffee trees in day 14 and day 1 before DP, was also
same time scale as that of disease monitoring. codified as a dummy variable; its value is 1 if the number of leaves
We only considered meteorological data in a period of 14 days be increases and 0 otherwise. Additionally, the two numerical variables
fore the date on which CLRI was measured in each month, which would related to the disease, cCLRI and pCLRI, were included in the datasets
be the date of prediction (DP). This period was called the main time for each window. The data quality verification was similar to that done
4
E. Lasso, et al. Computers and Electronics in Agriculture 176 (2020) 105640
Fig. 3. Set of windows for weather variables according to window size. The feature index represents the corresponding range of days before the date of prediction (DP
– green point). For example, for a 4 consecutive days window, tMax7-4 represents the maximum temperature between days 7 and 4 before DP. The date of predicted
incidence (DPI - orange cross) is 28 days after DP. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this
article.)
for weather data. Incidence values above 100% were found and re and Anuradha, 2019): search direction, search strategy, stop criteria
moved. In our approach, it is necessary to have the measurement of two and validation. The search direction defines the way the set of features
consecutive months of the disease, thus, months with no data were is reduced to find the best reduced subset. In the case of backward
discarded. searching, it starts from an empty set of features that will be filled re
cursively in each iteration. Conversely, in the case of backward
searching, each iteration recursively removes a feature from the total
2.3. Feature selection
set of features. The search strategy refers to how the features that are
added or removed in each iteration are chosen. It can be sequential,
In a dataset, the high dimensionality (large amount of features) can
based on performance criteria or randomized. The stop criteria can be
generate problems for data processing since a large number of irrele
chosen according to a desired number of features, number of process
vant or misleading features do not provide significant information re
iterations or no limit (evaluation of all possibilities). Finally, the vali
lated with the target variable in a learning process (Khalid et al., 2014).
dation depends on the learning task that will be carried out (regression
Additionally, a large number of correlated predictors (multicollinearity)
or classification). It is commonly addressed through cross validation.
is usually associated with model overfitting (Magidson et al., 2013). To
Several elements of FS process depend on the characteristics of the
solve this, from computer and data science, the Feature Selection (FS)
dataset used and the learning task for which it will be used. Our dataset
approach was proposed. FS is based on the selection of the best features
had continuous numerical target variable and numerical features (in
among all the features that are useful for a determined machine
cluding those encoded as dummy variables). Since the target variable
learning task (Khalid et al., 2014). The resulting reduced dataset can be
was numerical, the supervised learning task was regression.
processed more easily (because fewer features are presented and the
Each FS method generates lists of features selected for each window.
instances size is decreased), so the models obtained are more simple
New subsets were generated from these lists. Since the FS process is
and accurate (Chrysostomou, 2009). There are several FS algorithms
done in relation to the target variable (pCLRI), this one was separated
that can be classified into three categories, depending on the process
from the others for the process.
used to achieve their objective (Corrales et al., 2018): Filter, Wrapper
We describe in the following subsections each method and elements
and Embedded methods. These methods have been applied in the
of the FS process according to our needs. We applied each method in
agriculture domain in order to optimize the analysis of the most re
order to compare the results.
levant features for a specific problem (Bocca and Rodrigues, 2016; Kale
and Sonavane, 2019; Sharif et al., 2018).
The FS methods are supported by other elements such as (Venkatesh
5
E. Lasso, et al. Computers and Electronics in Agriculture 176 (2020) 105640
2.3.1. Filter method: Pearson’s correlation coefficient another. For this, we took the correlations between the features selected
A filter method based on Pearson’s correlation coefficient was used, from the Pearson’s coefficient. In the case of finding two variables with
given its performance in FS related to regression model building a moderate or high correlation (absolute value > 0.5) (Koo and Li,
(Rendall et al., 2019). The features were individually correlated to the 2016), we removed the one that had a lower importance score, given by
target variable by the Pearson’s correlation coefficient from the corre the algorithm the model was trained with. This process was not done
lation function available in Pandas for Python (McKinney, 2010). The previously since the importance and relevance of all features within
threshold to select the features was addressed by the Rule of Thumb learning tasks was not known yet. In addition, we built a new dataset
proposed by Krehbiel (2004), which takes into account the sample size with the resultant features and train a model with it, in order to com
for statistical significance. Features with a correlation coefficient pare the MAE with the best one obtained in the previous section. The
|r| 2/ n , where n is the number of dataset features, were selected. testing set corresponds to data from block 2 of the CATIE experiment.
6
E. Lasso, et al. Computers and Electronics in Agriculture 176 (2020) 105640
Table 1 between the features and pCLRI that was not clear in the Fig. 4 (non
Number of features defined as relevant and irrelevant by feature selection linear relationship). The average of daily precipitation between 11 and
methods and approaches. 8 days before DP (pre11-8), until 15 mm contributes positively to in
Dataset FS Method Approach Relevant F. Irrelevant F. cidence and above 15 mm, the reverse is seen. The average of minimum
temperatures between 4 and 1 day before DP (tMin4-1) are positively
14D Filter Pearson 2 10 related to the predicted incidence until 19° C. Above this value, the
Embedded LASSO 2 10
relationship is inverted. The number of rainy days between 14 and
XGBoost FS 12 0
Wrapper SFS Rforest 11 1 11 days before DP (rDay14-11) negatively contributes to pCLRI. No
SFS XGBoost 8 4 rainy days in this window tend to favor the incidence which decreases
RFE Rforest 7 5 after every rainy day in the window. Low maximum temperatures be
RFE XGBoost 1 11
tween 9 and 6 days before DP (tMax9-6) increase the incidence while
7D Filter Pearson 16 52
high values have the opposite effect. The predicted incidence tends to
Embedded LASSO 2 66 increase with higher average of daily precipitation between 6 and
XGBoost FS 24 44 3 days before DP (pre6-3) until 10 mm. Above this value, pCLRI de
Wrapper SFS Rforest 43 25 creases. For values above 19 mm, this feature has no effect in the
SFS XGBoost 11 57
predicted incidence.
RFE Rforest 49 19
RFE XGBoost 1 67
Table 2
Feature Selection (FS) method, learning algorithm related, minimum Mean Absolute Error (MAE) and compared MAE and number of features (No. F.) for original (O)
and reduced (R) dataset obtained from Feature Selection.
Window FS Method L. Algorithm MAE R. MAE O. No. F. R. No. F. O.
7
E. Lasso, et al. Computers and Electronics in Agriculture 176 (2020) 105640
Fig. 4. Summary of SHAP values for the features according their values. The range of values for each feature is represented in a color gradient, where red represents
its highest value and blue the lowest. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
day reported by Merle et al. (2020) and Avelino et al. (2020). observed effect of management (mgmt) was expected. Proper nutrition
Once the uredospores are deposited, they need water to germinate contributes negatively to the disease (Avelino et al., 2006) and fungi
and the temperature has a great impact. Excessively high maximum cides application reduces rust area and protects the plant against new
temperatures (tMax9-6 > 29 °C) disadvantage rust development infections (Merle et al., 2019). The mgmt feature can be improved by
(Ribeiro et al., 1978), while lower values around 22 °C generate op having the information if fungicides and fertilizers were applied in the
timal conditions for germination and penetration (Nutman et al., 1963). past month previous to DP.
The windows of the precipitation in days 6 to 3 before DP (pre6-3) and The SHAP values contribute to the interpretation of the results when
minimum temperatures between 4 and 1 day before DP (tMin4-1) share applying a machine learning process. Many of these processes generate
2 days. Rainfall around < 10 mm, that leave free water on the leaves, in models known as “black box” where their inputs and outputs are
conjunction with minimum temperatures around 19 and 20 °C generate known, but not the process that generates said outputs from the inputs.
conditions for germination and penetration phases (Nutman et al., SHAP values allowed us to have an idea of how the model generates a
1963). The minimum daily temperatures are normally reached just prediction. In addition, the graphical representation facilitated the in
before sunrise. High minimum temperatures combined with darkness terpretation of the relationships found between the features and the
are needed for the uredospores to germinate and accomplish infection target variable in light of the scientific knowledge on the disease. The
(Leguizamon, 1985). interpretation and validation was even better and easier due to the
Even though the model was not constructed with weather data after reduction of features, according to their importance in the modeling
DP, it generates CLRI predictions with an acceptable error. However, process and after elimination by mutual correlation analysis.
including weather data after DP could help improve the model. As The analysis of favorable conditions for CLR was improved con
demonstrated by Merle et al. (2020), daily rainfall and thermal am sidering different short consecutive windows compared to a single long
plitude impact up to 11 days before the symptom appearance on the duration period, where short phenomena can go unnoticed. Although,
coffee leaf. statistically, in the shorter window the modeling task would have more
The host growth (hGrowth equal to 1) generates a decrease in the “options” for the generation of the functions in the resulting model, the
predicted CLRI value, where a dilution effect of the disease is verified, 4-day window was better than the 3-day one. If the window is too short,
as already reported in Ferrandino (2008), Merle et al. (2019), López- there is no biological response related to disease phases.
Bravo et al. (2012). On the contrary, in periods of vegetative decrease In the application of the FS methods, the Wrapper RFE method is the
the CLRI values tend to increase due to the absence of dilution effect. one that selects the lowest proportion of relevant features. In the
This feature appears to be essential for CLRI predictions. Any model Wrapper RFE method, since cCLRI has a much greater importance than
that would not include the effects of host growth will fail in predict other features, the reduction of the sizes of the sets of features in each
CLRI. For shaded crops (shade equal to 1), the expected incidence is iteration lead only to consider that feature.
higher than for those in full sun, which is an indication of the favorable
microclimatic conditions under shade. Shade has been reported to 5. Conclusions
buffer temperatures, to increase wetness, favoring germination, infec
tion and reducing the latent period (López-Bravo et al., 2012), to in We identified the favorable conditions of rain and temperature that
tercept raindrops reducing uredospore washing (Avelino et al., 2020) lead to dispersal, germination and penetration CLR phases.
and to promote uredospore dispersal in the air due to the increased Additionally, we trained a machine learning model able to estimate the
kinetic energy of the raindrops in the understory (Avelino et al., 2020) disease incidence 28 days later. The combination of the analysis of
that heavily hit the coffee leaves (Boudrot et al., 2016). Similarly, the weather variables characterized in windows of short duration, CLRI
8
E. Lasso, et al. Computers and Electronics in Agriculture 176 (2020) 105640
Fig. 5. Dependence plots for numeric features relating the contribution to model prediction (SHAP value) according to feature value. The red curve shows the smooth
tendency and the histograms over the axis, the values distributions. (For interpretation of the references to colour in this figure legend, the reader is referred to the
web version of this article.)
monitoring and crop properties, with feature selection methods, ma model can be used, previous validation, to identify CLR epidemics in
chine learning and a model explanation technique, allowed us to their early phases and facilitate institutions, organizations and in
achieve it. All the process was made with real data from a field ex dividual coffee producers decision-making for timely disease control, to
periment. We are aware that the model performance may be over es avoid or reduce coffee losses.
timated since the used dataset corresponds only to one location. To
improve the generalization of the model, the application of the same CRediT authorship contribution statement
approach in other regions and countries is necessary. However, the
results of our study are a promising advance for CLR modeling. This Emmanuel Lasso: Conceptualization, Methodology, Software,
9
E. Lasso, et al. Computers and Electronics in Agriculture 176 (2020) 105640
Table A.3
Tuning of learning algorithms hyper-parameters.
Algorithm Hyper-parameter Range
Supplementary data associated with this article can be found, in the online version, at https://doi.org/10.1016/j.compag.2020.105640.
10
E. Lasso, et al. Computers and Electronics in Agriculture 176 (2020) 105640
Haggar, J., Barrios, M., Bolaños, M., Merlo, M., Moraga, P., Munguia, R., Ponce, A., Trinchera, L. (Eds.), New Perspectives in Partial Least Squares and Related Methods.
Romero, S., Soto, G., Staver, C., Virginio, E. de M.F., 2011. Coffee agroecosystem Springer New York, New York, NY, pp. 65–78.
performance under full sun, shade, conventional and organic management regimes in McKinney, W., 2010. Data structures for statistical computing in python. In: van der Walt,
central america. Agrofor. Syst. 82, 285–301. https://doi.org/10.1007/s10457-011- S., Millman, J. (Eds.), Proceedings of the 9th Python in Science Conference, pp.
9392-5. 51–56.
Kale, A.P., Sonavane, S.P., 2019. Iot based smart farming: feature subset selection for Meira, C.A., Rodrigues, L.H., Moraes, S.A., 2008. Análise da epidemia da ferrugem do
optimized high-dimensional data using improved ga based approach for elm. cafeeiro com árvore de decisão. Trop. Plant Pathol. 33, 114–124.
Comput. Electron. Agric. 161, 225–232. https://doi.org/10.1016/j.compag.2018.04. Merle, I., Pico, J., Granados, E., Boudrot, A., Tixier, P., Filho, E.d.M. Virginio, Cilas, C.,
027.. bigData and DSS in Agriculture, http://www.sciencedirect.com/science/ Avelino, J., 2019. Unraveling the complexity of coffee leaf rust behavior and devel
article/pii/S0168169917309389. opment in different coffea arabica agro-ecosystems. Phytopathology. doi:10.1094/
Khalid, S., Khalil, T., Nasreen, S., 2014. A survey of feature selection and feature ex PHYTO-03-19-0094-R. arXiv:https://doi.org/10.1094/PHYTO-03-19-0094-R, pMID:
traction techniques in machine learning. In: 2014 Science and Information 31502519. doi: 10.1094/PHYTO-03-19-0094-R.
Conference, pp. 372–378. https://doi.org/10.1109/SAI.2014.6918213. Merle, I., Tixier, P., de Melo Virginio, E., Filho, C., Cilas, J. Avelino, 2020. Forecast
Koo, T.K., Li, M.Y., 2016. A guideline of selecting and reporting intraclass correlation models of coffee leaf rust symptoms and signs based on identified microclimatic
coefficients for reliability research. J. Chiropractic Med. 15, 155–163. combinations in coffee-based agroforestry systems in costa rica. Crop Protect. 130,
Krehbiel, T.C., 2004. Correlation coefficient rule of thumb. Decision Sci. J. Innov. Educ. 2, 105046. https://doi.org/10.1016/j.cropro.2019.105046.. http://www.sciencedirect.
97–100. com/science/article/pii/S0261219419303928.
Kushalappa, A.C., 1981. Linear models applied to variation in the rate of coffee rust Nutman, F., Roberts, F., Clarke, R., 1963. Studies on the biology of hemileia vastatrix
development. J. Phytopathol. 101, 22–30. https://doi.org/10.1111/j.1439-0434. berk. & br. Trans. Br. Mycolo. Soc. 46, 27–44.
1981.tb03317.x.. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1439-0434. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
1981.tb03317.x. Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,
Kushalappa, A., Chaves, G., et al., 1980. An analysis of the development of coffee rust in Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: Machine learning in
the field. Fitopatologia Brasileira 5, 95–103. Python. J. Mach. Learn. Res. 12, 2825–2830.
Kushalappa, A.C., Eskes, A.B., 1989. Advances in coffee rust research. Annu. Rev. Raschka, S., 2018. Mlxtend: Providing machine learning and data science utilities and
Phytopathol. 27, 503–531. extensions to python’s scientific computing stack. J. Open Source Softw. 3https://doi.
Kushalappa, A.C., Ludwig, A., 1982. Calculation of apparent infection rate in plant dis org/10.21105/joss.00638.. http://joss.theoj.org/papers/10.21105/joss.00638.
eases: development. Phytopathology 72, 1373–1377. Rayner, R., 1961. Germination and penetration studies on coffee rust (hemileia vastatrix
Kushalappa, A., Akutsu, M., Ludwig, A., 1983. Application of survival ratio for mono b. & br). Ann. Appl. Biol. 49, 497–505.
cyclic process of hemileia vastatrix in predicting coffee rust infection rates. Rendall, R., Castillo, I., Schmidt, A., Chin, S.-T., Chiang, L.H., Reis, M., 2019. Wide
Phytopathology 73, 96–103. spectrum feature selection (wise) for regression model building. Comput. Chem. Eng.
Kushalappa, A., Akutsu, M., Oseguera, S., Chaves, G., Melles, C., Miranda, J., Bartolo, G., 121, 99–110.
1984. Equations for predicting the rate of coffee rust development based on net Ribeiro, I.J.A., Monaco, L.C., Tisseli Filho, O., Sugimori, M.H., 1978. Efeito de alta
survival ratio for monocyclic process of hemileia vastatrix [coffea arabica]. temperatura no desenvolvimento de hemileia vastatrix em cafeeiro suscetível.
Fitopatologia Brasileira (Brazil). Rivillas, C., Serna, C., Cristancho, M., Gaitan, A., 2011. La Roya del Cafeto en Colombia.
Lasso, E., Thamada, T.T., Meira, C.A.A., Corrales, J.C., 2015. Graph Patterns as Impacto, manejo y costos de control, Cientific bot036, Cenicafe.
Representation of Rules Extracted from Decision Trees for Coffee Rust Detection. Rossi, E., Montagnini, F., de Melo Virginio Filho, E., 2011. Effects of management prac
Springer International Publishing, Cham, pp. 405–414. doi:10.1007/978-3-319- tices on coffee productivity and herbaceous species diversity in agroforestry systems
24129-6_35. in costa rica. In: Agroforestry as a tool for landscape restoration. Nova Science
Lasso, E., Thamada, T.T., Meira, C.A.A., Corrales, J.C., 2017. Expert system for coffee rust Publishers, New York, pp. 115–132.
detection based on supervised learning and graph pattern matching. Int. J. Metadata Rückstieß, T., Osendorfer, C., van der Smagt, P., 2011. Sequential feature selection for
Semant. Ontol. 12, 19–27. classification. In: Australasian Joint Conference on Artificial Intelligence. Springer,
Lasso, E., Valencia, O., Corrales, D.C., López, I.D., Figueroa, A., Corrales, J.C., 2018. A pp. 132–141.
cloud-based platform for decision making support in colombian agriculture: A study Santacreo, R., Reyes, P., Oseguera, S., et al., 1984. Estudio del desarrollo de la roya del
case in coffee rust. In: Angelov, P., Iglesias, J.A., Corrales, J.C. (Eds.), Advances in cafeto hemileia vastatrix berk & br. y su relación con factores biológicos y climáticos
Information and Communication Technologies for Adapting Agriculture to Climate en condiciones de campo de dos zonas cafetaleras de honduras, ca, in: 6. Simposio
Change. Springer International Publishing, Cham, pp. 182–196. Latinoamericano sobre Caficultura24-25 Nov 1983Panamá (Panamá), IICA ICCR-
Leandro-Muñoz, M.E., Tixier, P., Germon, A., Rakotobe, V., Phillips-Mora, W., Maximova, 340, IICA, San José (Costa Rica). PROMECAFE.
S., Avelino, J., 2017. Effects of microclimatic variables on the symptoms and signs Sharif, M., Khan, M.A., Iqbal, Z., Azam, M.F., Lali, M.I.U., Javed, M.Y., 2018. Detection
onset of moniliophthora roreri, causal agent of moniliophthora pod rot in cacao. PloS and classification of citrus diseases in agriculture based on optimized weighted seg
One 12, e0184638. mentation and feature selection. Comput. Electron. Agric. 150, 220–234. https://doi.
Leguizamon, C., et al., 1985. Contribution à la connaissance de la résistance incomplète org/10.1016/j.compag.2018.04.023.. http://www.sciencedirect.com/science/
du caféier arabica (coffea arabica l.) à la rouille orangée (hemileia vastatrix berk. et article/pii/S0168169917306373.
br.), Th se de Doctorat. ENSA, Montpellier, France. Sutton, J., Gillespie, T., Hildebrand, P., 1984. Monitoring weather factors in relation to
Liaw, A., Wiener, M., 2002. Classification and regression by randomforest. R News 2, plant disease [crop microclimate, electrical sensors, temperature and wetness gauges,
18–22. https://CRAN.R-project.org/doc/Rnews/. sources of error]. Plant Diseases.
López-Bravo, D.F., Virginio-Filho, E.d.M., Avelino, J., 2012. Shade is conducive to coffee Talhinhas, P., Batista, D., Diniz, I., Vieira, A., Silva, D.N., Loureiro, A., Tavares, S.,
rust as compared to full sun exposure under standardized fruit load conditions. Crop Pereira, A.P., Azinheira, H.G., Guerra-Guimarães, L., et al., 2017. The coffee leaf rust
Protect. 38, 21–29. pathogen hemileia vastatrix: one and a half centuries around the tropics. Molecul.
Lundberg, S.M., Lee, S.-I., 2017. A unified approach to interpreting model predictions. In: Plant Pathol. 18, 1039–1051.
Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc.:
Garnett, R. (Eds.), Advances in Neural Information Processing Systems 30. Curran Ser. B (Methodol.) 58, 267–288.
Associates Inc, pp. 4765–4774. http://papers.nips.cc/paper/7062-a-unified- Venkatesh, B., Anuradha, J., 2019. A review of feature selection and its methods.
approach-to-interpreting-model-predictions.pdf. Cybernet. Inf. Technol. 19, 3–26.
Lundberg, S.M., Erion, G., Chen, H., DeGrave, A., Prutkin, J.M., Nair, B., Katz, R., Villarreyna, R., Barrios, M., Vílchez, S., Cerda, R., Vignola, R., Avelino, J., 2019.
Himmelfarb, J., Bansal, N., Lee, S.-I., 2019. Explainable ai for trees: From local ex Economic constraints as drivers of coffee rust epidemics in nicaragua. Crop Protect
planations to global understanding, arXiv preprint arXiv:1905.04610. 104980.
Madden, L.V., Hughes, G., van den Bosch, F., 2007. The study of plant disease epidemics. Waller, J., 1982. Coffee rust—epidemiology and control. Crop Protect. 1, 385–404.
Am. Phytopathol. Soc. 0. doi:10.1094/9780890545058. https://apsjournals.apsnet. Waller, J.M., Bigger, M., Hillocks, R.J., 2007. Coffee Pests, Diseases and their
org/doi/abs/10.1094/9780890545058. Management. CABI.
Magidson, J., 2013. Correlated component regression: re-thinking regression in the pre Zambolim, L., 2016. Current status and management of coffee leaf rust in brazil. Trop.
sence of near collinearity. In: Abdi, H., Chin, W.W., Esposito Vinzi, V., Russolillo, G., Plant Pathol. 41, 1–8.
11