Spatial Data Analysis of Fusarium. David Brown

Master Thesis
submitted within the UNIGIS MSc programme

at Z_GIS
University of Salzburg
Spatial Data Analysis of Fusarium Wilt

Incidence in the District of San Luis de
Shuaro, Peru
by
David Brown Fuentes

1123727
A thesis submitted in partial fulfilment of the requirements of

the degree of
Master of Science (Geographical Information Science & Systems) – MSc (GISc)
Advisor:
Carlos Mena, PhD
Turrialba, August 2014

II
Dedicado a mi esposa y mis padres.
III
Abstract
Spatial Data Analysis was conducted over the disease incidence of Fusarium Wilt of banana in the
district of San Luis de Shuaro, Peru. Data was obtained after consolidating it from raw data,
resulting in a total of 76 records dataset, each one representing a sampled farm for Fusarium Wilt of
banana. The spatial distribution was represented through a map of points each one representing a
sampled farm. Thiessen Polygons was utilised as basic interpolation method to obtain a general
representation of the disease distribution in the study area. The Spatial Data Analysis includes
spatial autocorrelation tests, which were applied through Moran’s I, Geary´s c and a modified
version of Moran’s I using an Empirical Bayes Estimate. The relationship between the Disease
Incidence of Fusarium Wilt of Banana and possible explanatory variables like altitude, soil pH,
plant density, farm type, farm size, banana variety, shade percentage, slope and the Soil Quality
Index was also analysed through a GAMLSS model with a beta-binomial distribution. Results show
that there is no evidence of spatial autocorrelation for the disease incidence. The regression model
shows that the most significant explanatory variables of the disease incidence are soil pH, altitude,
plant density and Soil Quality Index.
IV
Acknowledgements
I want to recognize the support of Bioversity International for allowing me to join the Master
Degree Program at UNIGIS, especially to Miguel Dita, Charles Staver, Stephan Weise and Dietmar
Stoian for support the development of my career in GIS at Bioversity International.
I want to recognize the valuable help of Karl Atzmanstorfer as my thesis advisor and Gunda
Cespedes for her help during the revision process of my thesis, both at the University of Salzburg.
To Jacob van Etten for his comments and for lend me a very useful book and to Philippe Tixier for
all his comments and suggestions.
A special acknowledgement to Carlos Román Jerí for the hard work he did in San Luis de Shuaro,
Peru collecting data for his research and which also allows me to conduct this work.
Finally but most important, I want to acknowledge the support and valuable suggestions of my wife
Karol Paola during the development of my thesis.
V
Table of contents
Abstract .............................................................................................................................................. IV
Acknowledgements ............................................................................................................................. V
Table of contents ................................................................................................................................ VI
List of Figures .................................................................................................................................... IX
List of Tables....................................................................................................................................... X
1. Introduction ................................................................................................................................. 1
1.1 Motivation ........................................................................................................................... 1
1.2 Problem Description............................................................................................................ 2
1.3 Objectives ............................................................................................................................ 2
1.4 Hypothesis ........................................................................................................................... 3
1.5 Scope ................................................................................................................................... 3
2. Literature review ......................................................................................................................... 5
2.1 Fusarium Wilt...................................................................................................................... 5
2.2 Disease Incidence ................................................................................................................ 6
2.3 Spatial Data Analysis .......................................................................................................... 6
2.3.1 Areal Analysis ............................................................................................................. 7
2.3.2 Geostatistical Analysis ................................................................................................ 7
2.3.3 Point Pattern Analysis ................................................................................................. 7
2.3.4 Spatial Autocorrelation ............................................................................................... 7
2.4 Exploratory Spatial Data Analysis ...................................................................................... 8
2.5 Measures of Spatial Autocorrelation ................................................................................... 8
2.6 Moran’s I using Empirical Bayes Estimates ....................................................................... 9
2.7 Semivariogram .................................................................................................................... 9
2.8 Thiessen Polygons ............................................................................................................. 10
2.9 Spatial Data Analysis Applied to Plant Disease Incidence ............................................... 10
2.10 Linear Regression Model and some considerations about it ............................................. 12
2.11 Generalised Linear Models ............................................................................................... 14
3 Methodology ............................................................................................................................. 17
3.1 Case study area: San Luis de Shuaro................................................................................. 18
3.2 Workflow diagram ............................................................................................................ 21
VI
3.3 Data preparation ................................................................................................................ 22
3.5 Mapping the Spatial Distribution of Disease Incidence .................................................... 25
3.5.1 Points Map................................................................................................................. 25
3.5.2 Thiessen Polygons ..................................................................................................... 26
3.6 Data Exploration ............................................................................................................... 27
3.6.1 Histogram .................................................................................................................. 28
3.6.2 Normal Q-Q Plot ....................................................................................................... 28
3.6.3 Spatial autocorrelation tests....................................................................................... 29
3.6.3.1 Neighbour List and Spatial Weights ............................................................................. 30
3.6.3.2 Moran’s I ....................................................................................................................... 32
3.6.3.3 Geary’s C....................................................................................................................... 32
3.6.3.4 Modified Moran’s I – Empirical Bayes Index............................................................... 32
3.7 GAMLSS applied to Fusarium Wilt Disease Incidence .................................................... 32
4 Results and Analysis ................................................................................................................. 36
4.1 Map of the Spatial Distribution of Disease Incidence in the Study Area .......................... 36
4.2 Results of Exploratory Data Analysis ............................................................................... 38
4.2.1 Histogram .................................................................................................................. 38
4.2.2 Normal Q-Q Plot ....................................................................................................... 39
4.2.3 Skewness and Kurtosis .............................................................................................. 40
4.2.4 Neighbours list and Spatial Weights ......................................................................... 41
4.2.5 Results of Moran’s I Test .......................................................................................... 42
4.2.6 Results of Geary’c Test ............................................................................................. 43
4.2.7 Results of Global Moran’s I using Empirical Bayes Estimates ................................ 44
4.3 Results of the GAMLSS.................................................................................................... 45
4.4 Analysis of the Results ...................................................................................................... 50
4.4.1 Spatial Distribution of Fusarium Wilt in the study area ............................................ 50
4.4.2 Spatial Autocorrelation Tests .................................................................................... 50
4.4.3 GAMLSS................................................................................................................... 51
4 Conclusions ............................................................................................................................... 53
5 References ................................................................................................................................. 55
6 Annexes ..................................................................................................................................... 60
Annex 1 – R Code utilised for test spatial autocorrelation............................................................ 60
VII
Annex 2 – R Code utilised for the GAMLSS................................................................................ 61
VIII
List of Figures
Figure 1 – Location of San Luis de Shuaro District (David Brown, ArcGIS for Desktop 10.2) ...... 18
Figure 2 – Elevation of the study area (David Brown, ArcGIS for Desktop 10.2) ........................... 19
Figure 3 –Location of banana farms in the study area (David Brown, ArcGIS for Desktop 10.2) ... 20
Figure 4 – Workflow diagram (David Brown, Microsoft Word 2010) ............................................. 21
Figure 5 – Symbols and classification utilised to map the distribution of the measured disease
incidence (David Brown, ArcGIS for Desktop 10.2) ........................................................................ 25
Figure 6 - Map of the Spatial Distribution of Fusarium wilt incidence in the study area (David
Brown, ArcGIS for Desktop 10.2) .................................................................................................... 36
Figure 7 – Thiessen Polygons constructed as from the points of each sample farm (David Brown,
ArcGIS for Desktop 10.2) ................................................................................................................. 37
Figure 8 – Histogram of disease incidence rate ................................................................................ 38
Figure 9 – Histogram of a normal distribution from simulated data ................................................. 38
Figure 10 – Normal Q-Q Plot of the disease incidence ..................................................................... 39
Figure 11 – Normal Q-Q Plot for a normal distribution for simulated data ...................................... 39
Figure 12 – a) Delaunay triangulation, b) Sphere of Indifference, c) Distance based ...................... 41
Figure 13 – a) k nearest neighbour with k = 1, b) k nearest neighbour with k= 5, c) k nearest
neighbour with k=10 ......................................................................................................................... 41
Figure 14 – Plots of the residuals for model validation .................................................................... 49
IX
List of Tables
Table 1 – Categories of Spatial Data Analysis accordingly to the spatial data type ........................... 7
Table 2 – Main Characteristics and Differences between Moran’s I and Geary’s c, based on Fortin
et al. (2002) ......................................................................................................................................... 9
Table 3 – Variables included in the dataset consolidated from the raw data provided by of Román
Jerí (2012) ......................................................................................................................................... 23
Table 4 – Results of Moran’s I test computations using different neighbours list methods ............ 42
Table 5 – Results of Geary’s c test computations using different neighbours list methods .............. 44
Table 6 – Results of Moran’s I with Empirical Bayes Estimates computations using different
neighbours list methods..................................................................................................................... 44
Table 7 – Results of the GLM model with a Binomial distribution and all the proposed explanatory
variables ............................................................................................................................................ 46
Table 8 – Results for the base model with all the proposed variables .............................................. 47
Table 9 – Results of the adjusted model after applying the variable selection using stepGAIC ....... 48
X
1. Introduction
1.1 Motivation
Bananas and plantains, are very important crops in developing countries, both as a staple food as a
commodity (Arias, Dankers, Liu & Pilkauskas, 2003). They are cultivated in more than 100
countries around world, both in tropical and subtropical regions (Frison and Sharrock, 1998). Often
referred just as bananas in general, they are grown in two main typical scenarios; the production for
export market and small scale for local market and as a staple, being the earlier characterized by
high input applications and just a few banana varieties (Arias, Dankers, Liu & Pilkauskas, 2003).
As any other crop, bananas and plantains are affected by diseases which could reduce or totally
inhibit production. One of the more important diseases is the Fusarium wilt of banana, a soilborne
fungal disease caused by the fungus Fusarium oxysporum f. sp. cubense, which affects different
banana cultivars (Brayford, 1992). In the case of the production for export market, Fusarium Wilt
ceased to be a problem with change from the susceptible variety Gros Michel to the resistant
Cavendish. However, many small farmers still growth susceptible varieties even under the risk that
Fusarium Wilt represents to their production, mainly because most of the susceptible varieties are
highly appreciated by consumers in local markets. With the appearance and dispersion of a new
race of Fusarium Wilt, known as Tropical Race 4 which affects the Cavendish variety, the disease is
recovering importance also in the production for export market. Therefore, nowadays the study of
Fusarium Wilt is a very important area to both the banana export industry and for the small farmers
in the developing countries. The analysis of the spatial data of Fusarium wilt of banana is relevant
to have a better understanding of the factors that affects the incidence of the disease, which could
support the design of disease assessment and management methodologies.
1
1.2 Problem Description
The district of San Luis de Shuaro in Peru is a zone where small farmers growth different banana
varieties as their main economic activity (Román Jerí, 2012), and many of these cultivated varieties
are susceptible to Fusarium wilt. Since most of the infected plants are inhibited to produce a banana
bunch, each diseased plant represent an economic lost to the farmer. Although with some distinctive
characteristics, the region of San Luis de Shuaro presents similar conditions to other regions in
Latin America and the Caribbean, which in general are characterized by the diversity of their
production systems, ranging from monocrop to agroforestry systems with mixed crops. The findings
of the present work could provide valuable insights to understand this complex disease not only in
the region of the study case but in others with similar conditions and characteristics.
1.3 Objectives
General Objective: To conduct spatial data analysis of Fusarium wilt incidence in the district of
San Luis de Shuaro, Peru.
Specific Objectives
 Mapping the spatial distribution of the disease incidence rate of Fusarium Wilt of banana in
the case study area
 Conduct Exploratory Data Analysis including three different test for Spatial
Autocorrelation: Moran’s I, Geary’s c and Moran’s I using Empirical Bayes Estimates
 Modelling the relationship between Fusarium Wilt incidence and a set of explanatory
variables using a GAMLSS (Generalised Additive Models for Location Scale and Shape)
 To analyse the relationship of Fusarium Wilt incidence with the selected variables by the
implemented GAMLSS
2
Research Questions
1) How could the spatial distribution of the disease incidence be represented using GIS
software and cartographic techniques to provide a general description of the region in terms
of incidence levels?
2) Are the sampled farms in the study area spatially autocorrelated with respect to disease
incidence?
3) Which are the most reliable combination of model and probability distribution to analyse
the relationship of the Fusarium Wilt incidence and a set of explanatory variables?
4) Which are the factors that influence the Fusarium Wilt incidence in the study area?
1.4 Hypothesis
H0: There is not spatial autocorrelation of the Fusarium Wilt incidence between the sampled farms
in the area of San Luis de Shuaro District.
H1: There is a spatial autocorrelation of the Fusarium Wilt incidence between the sampled farms in
the area of San Luis de Shuaro District.
1.5 Scope
The present work analyse the spatial distribution of the Fusarium wilt incidence using spatial
statistics, to determine if nearest farms are more likely to exhibit similar levels of incidence, which
could lead to a better understanding of the disease dynamics. It also analyse the relationship
between a set of proposed variables and the disease incidence. The study comprises information
from 76 farms in the region of San Luis de Shuaro, Peru. More specific and technical details could
be found the Chapter 3. The mapping of the spatial distribution is a graphical representation of the
disease incidence using cartographic techniques, with the aim to provide a general description on
how the different levels of disease incidence are distributed along the study region. Geographical
localisation data were available just for each farm, thus the spatial analysis is limited to the spatial
relationship between each sampled farm, with the relationship between plants within each sampled
3
farms out of the scope of this work. Additionally to the spatial analysis per se, the results of spatial
autocorrelation between farms with respect to the disease incidence are a key inputs to analyse the
relationship between the set of proposed variables and the disease incidence through a Generalised
Linear Model. In this aspect, the present work traces an outline on how further analysis with similar
inputs and goals could be conducted, especially in terms of suggest a reliable combination of
regression model and probability distribution. Finally, the study determines a set of explanatory
variables which has influence on the disease incidence, based on the statistical results after
implement the Generalised Linear Model, which provides valuable inputs to future studies focused
on the more influencing factors, contributing to the development of better methodologies and
strategies both for assessment of a suspicious area or for management of a confirmed infected area.
4
2. Literature review
This section condensate the concepts behind the development of this work, from the description of
the analysed disease to concepts of Spatial Data Analysis applied to analyse it, containing both
scientific papers and theoretical books. A list of suggested literature is also provided in order to
facilitate the resource browsing for readers interested in other works which also treats the Spatial
Data Analysis applied to Plant Disease incidence.
2.1 Fusarium Wilt

“Panama disease of bananas is historically one of the most infamous plant diseases, destroying the
banana production industries in areas of Central America where the highly susceptible banana
cultivar Gros Michel predominated from ca. 1900–1955 ” (Brayford, 1992, p. 1).
Also known as Panama disease, Fusarium Wilt is a soilborne fungal disease which infects the plant
through the roots, spreading to the xylem and causing vascular browning in the pseudostem
(Brayford, 1992). Typical symptoms include vascular discolouration and yellowing of the leaves
(Pérez-Vicente, Dita & Martinez de la Parte, 2014). It could survive up to 30 years in the soil in the
absence of banana (Ploetz, 2006). Dispersal of the disease could be caused by infected plant
material, through the water and soil (Pérez-Vicente et al., 2014).
As indicated by Pérez-Vicente et al. (2014), the possibility of recovering of a susceptible banana
plant infected with Fusarium wilt is very low, and if it occurs the growth will be deficient.
Accordingly to Ploetz (2006), the options available for management of this disease are scarce, being
the genetic resistance the most effective.
Its importance in the banana global market was reduced in 1950, when shifting from susceptible
Gros Michel variety to the resistant Cavendish was done in Latin America (Perez-Vicente et al.,
2014). However, susceptible varieties are still cultivated in small scale, especially by smallholders
5
in mixed with other crops like coffee, cacao and threes in agroforestry systems (Perez-Vicente et al.,
2014). This situation has also an effect on Fusarium Wilt research, which accordingly to
Lichtemberg, Pocasangre, Staver, Dold and Sikora (2010) comprise two eras, the Gros Michel era
where efforts were focused on possible origins and the epidemics in the American tropics, and the
Cavendish era, where efforts were focused on the pathogen diversity and not into the disease.
2.2 Disease Incidence

Madden, Hughes and van den Bosch (2007) defines disease incidence as the proportion of plants (or
plan units) diseased or the number of diseased plants (or plants units) out of the total assessed. The
same authors indicate that disease incidence could be measured at different scales depending on the
plant units utilised. For the case of this work, the plant unit refers to an individual plant. Within this
context, “disease incidence provides an estimate of the probability of infection” (Hughes, Munkvold
& Samita, 1998), and “it is the most common records contained spatial plant disease data
encountered in phytopathological literature” (Madden et al., 2007).
2.3 Spatial Data Analysis

Spatial Data Analysis is the area of Spatial Analysis where statistical techniques are developed and
applied to analyse spatial data (Haining, 2003). Accordingly to Bailey and Gatrell (as cited in
Pfeiffer, 1996, p.83), methods used in spatial data analysis can be categorized as:
a) Methods for visualizing data
b) Methods for exploratory data analysis
c) Methods for development of statistical models
Following the classification of Cressie (as cited in Plant, 2012, p. 5) there is three categories of
spatial data: geostatistical, areal and point pattern. As presented by Krivoruchko (2011, p. 22), each
of these categories corresponds to continuous, aggregated and discrete data respectively. Table 1
summarizes the type of data and the type spatial data analysis which correspond it.
6
Table 1 – Categories of Spatial Data Analysis accordingly to the spatial data type
Type of data Spatial Data Analysis

Discrete Point Pattern Analysis
Aggregated or Areal Lattice or Areal
Continuous Geostatistical
2.3.1 Areal Analysis

Accordingly to Plant (2012, p. 5), “… areal data consist of data that are defined only at a set of
locations, which may be points or polygons”. The main objective to analyse areal data is to detect
and explain spatial patterns, sometimes including its relationship with covariates (Pfeiffer, 1996).
2.3.2 Geostatistical Analysis

Geostatistical data consist of data that is spatially continuous and the main objective to be analysed
is to describe the spatial variation of an attribute variable Pfeiffer (1996), and to interpolate the
value of the measured attribute at points where it wasn’t measured Plant (2012, p.5).
2.3.3 Point Pattern Analysis

As is shown in Table 1, point pattern analysis deals with discrete data and it analyses the pattern of
the registered locations. Typically this pattern is analysed based on what is called analysis of
clustering (Pfeiffer, 1996).
2.3.4 Spatial Autocorrelation

“Everything is related to everything else, but near things are more related than distant things” –
Tobler’s First Law of Geography. (Tobler, 1970)
The Tobler’s first law of geography is a recurrent citation in the spatial analysis literature and so far
continues being the best way to define spatial autocorrelation.
Griffith (2009) defines spatial autocorrelation as “…the correlation among values of a single
variable strictly attributable to their relatively close locational positions on a two-dimensional (2-D)
7
surface, introducing a deviation from the independent observations assumption of classical
statistics”.
Miron (as cited in Plant, 2012, p.423) presents three different sources of spatial autocorrelation
presence in regression models:
d) Interaction
e) Reaction
f) Model misspecification
2.4 Exploratory Spatial Data Analysis

Haining and Cressie (as cited in Haining, 2003, p.182) defines ESDA as set of techniques for:
g) Explore spatial data

h) Summarize spatial properties of data
i) Detect spatial patterns in data
j) Formulate hypothesis related to the geography of the data
Exploratory spatial data analysis includes both visual and numeric methods. Visual methods could
include: Histogram, Q-Q plots, Boxplots and Scatterplot (Haining, 2003, p.189). Numerical
methods include spatial autocorrelation tests like Moran’s I to explore if the data is clustered or
dispersed (Haining, 2003, p.226).
2.5 Measures of Spatial Autocorrelation

Moran’s I and Geary’s c are both indexes that are applied to test the null hypothesis of zero spatial
autocorrelation (Plant, 2012, p. 104). A description of the main features and differences are
presented at following in Table 2, based on Fortin, Dale and ver Hoef (2002).
8
Table 2 – Main Characteristics and Differences between Moran’s I and Geary’s c, based on Fortin et al. (2002)
Characteristic Moran’s I Geary’s c

How is computed Degree of correlation between Difference among
values of a variable as a function
of spatial lags.
Zero Spatial Autocorrelation The expected value for zero The expected value for zero
autocorrelation is nearly zero, autocorrelation is 1.
although more formally:
1
With n as the number of
(n  1)
areal units
Positive Spatial Values nearly 1 Values nearly 0
Autocorrelation
Negative Spatial Values nearly -1 Values nearly 2
Autocorrelation
2.6 Moran’s I using Empirical Bayes Estimates

The traditional calculation of Moran's I for disease cases does not account for population
heterogeneity, so that, its application to disease rates or proportions may result in indication of
spatial correlation that is completely due to the spatial proximity of population sizes, but not
due to the similarity among the disease rates. (Jackson, Huang, Xie & Tiwari, 2010)
Assunção and Reis (1999) propose an Empirical Bayesian Estimate modification for the calculation
of Moran’s I when it is applied to rates calculated as from populations with different sizes, which is
the case of the data analysed in the present work.
2.7 Semivariogram
From the geostatistics point of view, spatial autocorrelation is tested through the Semivariogram
Nelson, Orum, Jaime-Garcia and Nadem (1999). The semivariogram calculates the difference
between locations of two measurements, which is called the spatial lag, using the following
function (Plant, 2012, p. 117):
 (h)  var ( x  y)   ( x)

1
2
The semivariogram is commonly represented in a plot, representing the semivariance as a function
of distance (Bivand, Pebesma & Gomez-Rubio, 2008). Rather than use the function presented
9
before, the semivariogram is commonly estimated through the experimental semivariogram which
is represented by the following function (Plant, 2012, p. 118):
1 m( h)
ˆ (h)   Y ( xi  h)  Y ( xi )2
2m(h) i1
For a more clear understanding of the terms Variogram and Semivariogram and their differences
Bachmaier and Backes (2008) provides a clarification.
2.8 Thiessen Polygons

Also known as Proximity Polygons or Voronoi Maps (O’Sullivan & Unwind, 2010, p. 50), they are
considered by Plant (2012, p.163) as a useful interpolation method when locations are too sparse
and irregular. Thiessen Polygons are geometrically calculated areas as from points and just take into
account distance from each point to calculate the polygon. More formally, and accordingly to
O’Sullivan and Unwind (2010, p. 50), they are “a polygon of any entity is that region of the space
which is closer to the entity than it is to any other”. Although useful as pointed out by Plant, their
use should take into account its limitations and not state exaggerated conclusions just from the
resulting polygons.
2.9 Spatial Data Analysis Applied to Plant Disease Incidence

Bivand et al. (2008, p. 311) indicates that “displaying the spatial variation of the incidence of a
disease can help us to detect areas where the disease is particularly prevalent, which may lead to the
detection of previously unknown risk factors”. Although this statement was done from the point of
view human health, in some way it is also valid for plant disease epidemiology. From the point of
view of the study of a plant disease, to conduct a spatial autocorrelation analysis both the spatial
location and the disease status of the sampling units must be known, (Madden et al., 2007).
A considerably amount of studies were found in the literature about spatial data analysis applied to
plant disease including the Nelson et al. (1999) which presents some applications of GIS and
10
Geostatistics to plant disease epidemiology. The work of Selvaraja, Balassundram, Vadamalai and
Husni (2012) applies geostatistics to analyse the spatial variability of the Orange Spotting Disease
in oil palm. Talei, Safaie and Aghajani (2013) studied the spatial distribution of Soybean Charcoal
Rot incidence using geostatistics, more specifically an interpolation using ordinary kriging. Alves
and Pozza (2010) on the other hand propose the use of indicator kriging for study the spatial
variability of common bean anthracnose. Del Ponte, Shah and Bergstrom (2003) analysed the
spatial patterns Fusarium head blight using the index of dispersion through a beta-binomial
distribution. Although that index doesn’t involve the spatial location of the sampled data; but
accordingly to this literature review it is common methodology utilised in plant pathology to
estimate aggregation or dispersion patterns, being relevant the works of Hughes and Madden
(1993), Madden and Hughes (1994) and Hughes, Madden and Munkvold (1996). Nelson, Felix-
Gastelum, Orum, Stowell and Myers (1994) applies geostatistics to analyse to design and validate
the regional plant virus management programs if the Del Fuerte Valley, located in Sinaloa, Mexico.
Guzmán-Plazola, Gómez-Pauza, García-Espinosa and Gavi-Reyes (2004) applied geostatistics to
interpolate the spatial distribution of Fusarium solani f. sp. phaseoli; which is the cause of the root
rot on common bean. Musoli, et al. (2008) studied the spatial and temporal analysis of coffee wilt
disease which is caused by Fusarum xylarioides, using also a geostatistical approach. Oerke, Meier,
Dehne, Sulyok, Krska and Steiner (2010) analysed the spatial variability of Fusarium head blight
pathogens in wheat crops using the Spatial Analysis by Distance IndicES (SADIE) and the Lloyd’s
index of patchiness.
Studies about Spatial Data Analysis applied to Banana’s Fusarium Wilt incidence were not found in
search for a previous work with similar objectives. Although this is not an undebatable fact, it could
lead to infers that these kinds of studies are too scarce. Lichtemberg et al. (2010) analysed the
Fusarium Wilt incidence at smallholder level in Nicaragua, using the classical statistics methods
like Pearson Correlation and ANOVA. Plotting of the farms and comparison of two different zones
11
were also conducted in that work. Román Jerí (2012) conducted a study in which geographical
characteristics were considered for targeting the farms to be analysed and thereafter spatial
distribution of Fusarium Wilt incidence was plotted for graphical representation. Although the study
of Román Jerí (2012) does not include a strong component of spatial data analysis, the collected
data is the base of the present work.
As Schabenger and Pierce (as cited in Madden et al., 2007, p. 15) indicates, “…disease incidence is
a count with a natural denominator, which could be converted into proportions”. This is important
to take into account selecting the appropriate statistical analysis type to apply (Madden et al., 2007).
For example, Madden and Hughes (1994) indicates that distributions like Poisson and negative
binomial are generally inappropriate for analysing disease incidence (rate) data, proposing the use
of beta-binomial distribution instead. Krivoruchko (2011) points out the fact that typical index used
to measure spatial autocorrelation like Moran’s I and Geary’s c are commonly applied to rates, even
when these indices assumes that data mean and variance are constants, which are difficult
conditions to find in rates data like disease incidence. Paulitz, Zhang and Cook (2003) applies what
they call a Spatial Generalised Mixed Model to account for spatial autocorrelation in a spatial point
pattern analysis and also to interpolate disease incidence rates. However the use of this
methodology should be approached with caution due to the high level of complexity of a GLMM,
being a challenge even for statisticians (Bolker et al., 2009).
2.10 Linear Regression Model and some considerations about it

Linear Regression Models are frequently used in statistical analysis (Kongchouy, Choonpradub &
Kuning, 2010). It supports researchers to explore the relationship between variables and to explain
the strength of a set of independent variables to predict a dependent variable (Urdan, 2010, p. 145).
As Zuur, Ieno, Walker, Saveliev and Smith (2009, p. 17) calls it, it is “the mother of all models”.
However, as any other models it has limitations that should be taken in to account to apply it
12
correctly. The following equation is reproduce from Zuur et al. (2009, p.17) and shows the linear
regression model.
Yi       i   i
where
 i ~  (0,  2 )
Following Zuur et al. (2009, p.17) explanation Yi is the response or dependent variable and  i is the
explanatory or independent variable. The information that is not explained by the model is captured
by the residuals, represented in the equation by  i while  and  represents the population
intercept and the slope respectively and both are unknown parameters.
There are five assumptions that should be considered accordingly to Zuur et al. (2009, p.19) to
apply the linear regression correctly:
1. Normality: Linear Regression assumes that the data has a normal distribution. In this
sense, normality means that when a plot of frequency of the cases (in the y axis) vs the
score of the variable of interest (in the x axis) is constructed, it will exhibit a bell shaped
curve (Urdan, 2010, p. 10).
2. Homogeneity: The homogeneity assumption means that the spread of data should be the
same at each X value. When this condition is not accomplished it is called
heteroskedasticity (Bivand et al., 2008, p. 274) or heterogeneity (Zuur et al., 2009, p. 20).
3. Fixed X: This assumptions means that the explanatory variables are deterministic and not
random variables (Zuur et al., 2009, p. 21).
4. Independence: Accordingly to (Zuur et al., 2009, p. 21), independence is when the Y value
at  i is not influenced by other  i , and it came into the most serious problem when it is no
satisfied. The same Zuur et al. points out that there is two ways to violate the independence
assumption, by applying an improper model or a dependence structure due to the nature of
13
the data. In the case of a dependence structure, it could be due to temporal or spatial
dependence, being the latter of special interest in the present work.
5. Correct Model Specification: This means that there is assumed a correct selection of
explanatory variables.
There are different points of view from different authors on how to deal when one of these
assumptions are violated. For example, when normality assumption is violated some take the
approach to apply a transformation to the data, like the logarithmic transformation, trying to get the
desired normal distribution. On the other extreme are those who prefer switch to other model
without applying any transformation to the data. Other models could include: Generalised Linear
Models, Generalised Additive Models, Generalised Least Square, etc. Each of these has different
approaches to tackles the violations of the linear regression assumptions, depending on which of the
assumptions are violated.
The approach selected to the present work follows the suggestion stated by Zuur et al. (2009, p. 19):
“Always apply the simplest statistical technique on your data, but ensure it is applied correctly”.
2.11 Generalised Linear Models

When the analysed data doesn’t fulfil the requirements to use the Linear Regression Model, the
GLM (Generalised Linear Models) come to the scene as the most convenient solution (Plant, 2012,
p. 301). A GLM basically consists of three distinctive parts (Crawley, 2007, p. 512):
a) The error structure
b) The linear predictor
c) The link function
The error structure refers to the type of distribution of the error in the analysed data, which could
also be seen as the distribution of the response variable. Instead of apply a transformation when the
analysed data has a non-normal distribution, a GLM allows to specify different types of
distributions like binomial, Poisson, etc. (Crawley, 2007, p. 512). The linear predictor is also called
14
the systematic part (Zuur et al., 2009, p. 210) and in general terms is the set of explanatory variables
expressed as a function. Finally the link function is the part which relates the systematic part with
the mean of the response variable. The implementation of a GLM consists of three steps (Zuur et
al., 2009, p. 210) which coincides with the three parts presented by Crawley (2007, p. 512):
a) An assumption of the distribution of the response variable
b) Specify the systematic part (The explanatory variables)
c) Specify the link function
However, undesirable but typical characteristic that the disease incidence data often also presents is
overdispersion, which is when the observed variability is greater than the predicted (Garret,
Madden, Hughes & Pfender, 2004). If overdispersion is no taken into account it will totally
invalidate the statistical inference obtained from the model (Guimarães, 2005). Approaches to
account for overdispersion includes the use of a maximum quasi-likelihood method instead of the
maximum likelihood and use of discrete distributions like the negative binomial and the beta-
binomial distributions (Garret et al. 2004).
2.12 GAMLSS (Generalised Additive Models for Location Scale and Shape)
Accordingly to Stasinopoulos and Rigby (2007) GAMLSS are semi-parametric regression type
models; mainly because they require a parametric distribution assumption for the response variable
and could use non parametric smoothing functions for the modelling of the parameters of the
distribution. As a GLM comes as a solution for the cases that could not be solved with Linear
Regression, a GAMLSS is the proposed solution for the cases that could not be solved with a GLM.
Most of these cases are when the response variable doesn’t follow an exponential family
distribution (Stasinopoulos & Rigby, 2007). More details on how it was used in the present work
will be presented in the next chapter.
15
Concluding Remarks on Literature Review
The fact that previous works analysing the Fusarium Wilt incidence from the Spatial Data Analysis
perspective wasn’t found gives a special relevance to this work and brings an outline of the
considerations to take in future studies.
Detailed information about the implementation of the concepts presented here to the analysis of
Fusarium Wilt incidence from the point of view of Spatial Data Analysis is provided in the next
chapter.
16
3 Methodology
The proposed methodology is basically a combination of different methods and techniques applied
by different authors as presented in the literature review, explaining specific details in the present
chapter.
The first part consists of a description of the study area including elevation, major roads and rivers
present in it. In the second section the data preparation process is presented focusing on how the
data will be treated in the present work. The third section consists of a graphical representation of
the disease incidence rate for each farm in the study area. Since no areal boundaries were available
for the sampled farms, each of them was represented by a point. The fourth section treats
exploration of the data, using both graphical and numerical methods, the latter including tests for
spatial autocorrelation using different method like Moran’s I and Geary’s c. The Empirical Bayes
Estimate to improve Moran’s I calculation proposed by Assunção and Reis (1999) was also utilised
to test spatial autocorrelation, mainly due the hereogeinity of the populations size of sampled plants
per each farm.
Finally, an explanation about why GAMLSS was selected and how was used to model the
relationship between Fusarium Wilt incidence and variables like soil pH, Altitude, Farm Size,
Slope, Farm Type, Banana Variety, Plant Density and Soil Quality Index.
17
3.1 Case study area: San Luis de Shuaro
Accordingly to the Instituto Nacional de Estadística e Informática – INEI (as cited in Román Jerí,
2012, p. 25) the district of San Luis de Shuaro is located in Peru, in the Chanchamayo province,
department of Junín. It is at 187 km from Lima, the Peru’s Capital, being the agriculture its main
economic activity. Figure 1 shows the location of the San Luis de Shuaro district within Peru.
Figure 1 – Location of San Luis de Shuaro District (David Brown, ArcGIS for Desktop 10.2)
Although the study was focused on the San Luis de Shuaro district, it includes farms from outside
the official boundaries due to different criteria applied by Román Jerí (2012). Accordingly to the
Hole-filled seamless SRTM data (Jarvis, Reuter, Nelson & Guevara, 2008), altitude range from 597
– 2021 meters above sea level in the study area, as shown in Figure 2.
18
To obtain the elevation for the study area it was delimited drawing a rectangle that includes all the
analysed spatial points, representing each one a farm. Then, an extraction from the SRTM elevation
data was performed using the rectangle as a mask. This elevation model was just utilised as
descriptive resource for the study area. The elevation attribute for each analysed farm was taken
from collected data using the GPS Handheld.
Figure 2 – Elevation of the study area (David Brown, ArcGIS for Desktop 10.2)
19
As could be observed in Figure 3, the San Luis de Shuaro District is divided in two by a river which
also separates the 76 farms in two main groups. The district is also divided in two by two major
roads which cross the district near the river described before.
Figure 3 –Location of banana farms in the study area (David Brown, ArcGIS for Desktop 10.2)
20
3.2 Workflow diagram
The workflow diagram presented in Figure 4 shows the necessary processes and results obtained
during the development of the present work.
Literature review
Data preparation
Mapping the
Spatial Distribution
of Fusarium Wilt
Exploratory Spatial Data Analysis
Test Spatial
Autocorrelation
Generalised Linear Model
Thiessen Polygons
Test Overdispersion
Generalised Additive Model for Location Scale and Shape
Model Validation
Figure 4 – Workflow diagram (David Brown, Microsoft Word 2010)
21
3.3 Data preparation
Data collected by Román Jerí (2012) was stored in several spreadsheets containing location of the
assessed farms and is part of Bioversity International datasets collection. Location was registered in
UTM format with a GPS Handheld (Garmin eTrex Vista HCx) in each assessed farm where the first
symptomatic plant was found (Román Jerí, 2012, p. 32).
A main dataset was consolidated in a shapefile to be easily manipulated into ArcGIS and in R
Language and Environment for Statistical Computing (R Core Team, 2014) through the package
maptools (Bivand & Lewin-Koh, 2014). The shapefile was generated from previous files in XLSX
format (Microsoft Office) containing the spatial location in UTM coordinates, along with the
variables listed in Table 2. Although contained in the raw data provided by Román Jerí (2012),
disease incidence was verified and recalculated with the available counts of total diseased plants
and total assessed plants per each farm. After this revision of the all registries corresponding to 76
farms, nine registries differ from the original incidence rate, with four of these registries
corresponding to zero disease incidence in the recalculated rate.
The following projection was used as it was found to be the most appropriate accordingly to UTM
Grid Zones of the World compiled by Morton (2014).
Projected Coordinate System: WGS_1984_UTM_Zone_18S
Projection: Transverse Mercator
Geographic Coordinate System: GCS_WGS_1984
Datum: D_WGS_1984
Prime Meridian: Greenwich
Angular Unit: Degree
22
Table 3 – Variables included in the dataset consolidated from the raw data provided by of Román Jerí (2012)
Variable Description
Altitude Altitude measured by the GPS unit
Slope Inclination measured with clinometer
Farm Type Monocrop, Mixed crops, Agroforestry System, Backyard
Planting density Plants per hectare
Shade percentage Shade percentage measured with an spherical densitometer
pH pH Measured with soil tester KCB-300
Soil Quality Index Modified from Altieri and Nicholls (2002)
Farm Size Farm area in hectares
Variety “Isla”, “Seda”, “Mixed”
Since the intention of the present work is not to repeat the conducted by Román Jerí (2012), but to
analyse the data from the Spatial Data Analysis point of view, from the original dataset consolidated
by Román Jerí (2012) just a portion of the available information was utilised for the present work
and some modifications were made. For example, the Soil Quality Index, which is a modified
version of the proposed by Altieri and Nicholls (2002), was not calculated by Román Jerí but the
individual variables that compose that index were utilised in his analysis. The main difference
between the index here calculated and the proposed by Altieri and Nicholls (2002) is that the two
variables (Humidity retention and Water infiltration) were not included in the present work, mainly
because they are not available. Although the raw data was provided by Román Jerí (2012), the
reader should be aware of the different objectives of each work and analyse the results from
different points of views.
3.4 Proposed Approach to analyse spatial data
As stated by Krivoruchko (2011, p. 22), using methods which are intended to be used with
geostatistical data to analyse areal or point pattern data will produce erroneous results. However,
accordingly to Plant (2012, p.5) in some occasions data could be treated as geostatistical for one
analysis and as areal for other. Accordingly to Fortin et al. (2002) the selection of the spatial
statistic methods could be based on the research objectives or by the type of measured data.
23
The data used in the present analysis represents the incidence measured on each farm on the form:
TD
DI  100
TE
DI = Disease Incidence
TD = Total of Diseased Plants
TE = Total of Evaluated Plants
The location is represented by a point, which corresponds to the location of the first symptomatic
plant found in the farm (Román Jerí, 2012, p. 32). Although some considerations should be taken
for future sampling designs, this kind of conditions and constraints are very usual to find in
collected datasets and try to deal with these kinds of challenging conditions is one of the additional
motivation for this work.
This arrangement of spatial location and incidence rate for each farm has the particularity that one
point is used to represent an area, raising the methodological question of: Which spatial data
analysis method should be used?
As the disease incidence rate represents an aggregation of values for each farm, intuition indicates
that data should be treated as areas, however polygons for each farm was not available. When there
is not available boundaries for areas being analysed, a point could be used to define their location
and the apply methods for point objects (Haining, 2003, p. 81). This possibility to use a point to
represent each analysed area is also supported by Bivand et al. (2008, p. 244). Therefore, the taken
approach is to conduct spatial data analysis using methods for areal data.
24
3.5 Mapping the Spatial Distribution of Disease Incidence
3.5.1 Points Map

The map of the distribution Fusarium Wilt incidence of banana in the study area was produced
using the location of 76 farms, surveyed by Román Jerí (2012). A six classes classification and a
colour palette from blue to red (corresponding to Low and High respectively) was utilised to
represent the measured disease incidence distribution in the study area. This classification is shown
Figure 5.
Figure 5 – Symbols and classification utilised to map the distribution of the measured disease incidence (David Brown,
ArcGIS for Desktop 10.2)
25
3.5.2 Thiessen Polygons
As explained in the literature review, Thiessen Polygons are geometrical constructions as from
points. Taken into account that Thiessen Polygons couldn’t produce a prediction map but a
representation of the proximity area for a point object, a graphical areal representation was
produced using this technique as from points which represent each sampled farm and applying the
same colour and classification for disease incidence as applied for point representation.
Although it is possible to use Areal Interpolation as implemented in the Geostatiscal Analyst of
ArcGIS (Krivoruchko, Gribov & Krause, 2011) using the Thiessen Polygons as input layer, to avoid
misinterpretation of the technique and methodology and analysing that it could be an abuse of this
useful tool, the production of a prediction map for the present work was avoided.
26
3.6 Data Exploration
Exploration of the data will be basically divided in two steps:
a) Visual inspection of the data distribution
b) Test for spatial autocorrelation
Before explore the data it is worth to explain why is so important to check if the data exhibits a
normal distribution and why to check if there is spatial autocorrelation present.
As explained in the literature review, Linear Regression assumes the data has a normal distribution
which when plotted takes a bell shaped form. The main reason to explore data is to check if it
distribution corresponds to a normal distribution or to other kind of distribution, like Poisson,
Binomial, etc. As the data analysed in the present work is proportion, it is presumable to found a
non-normal distribution of the data. If this is the case, a common linear regression should be
avoided and other models like the GLM or GAM could be possible solutions.
In the case of spatial autocorrelation, it is important as part of the exploratory data analysis basically
due to two main reasons:
1) If positive spatial autocorrelation is present, it means that near farms will have similar
scores for disease incidence and possibilities to find a clustered pattern are high. Negative
spatial autocorrelation means that near farms have completely different scores for disease
incidence and the pattern of distribution is a dispersed pattern. Finally, if the null hypothesis
of zero spatial autocorrelation is confirmed, then the process behind the disease incidence
occurs in a random pattern. If positive or negative spatial autocorrelation is present in the
variable of interest, in this case the disease incidence, it means that there is a reason for that
spatial autocorrelation which will be of interest to explain to know the process behind.
27
2) If present, spatial autocorrelation violates the independence assumption required by the
Linear Regression Model and even other models like GLM, and some considerations
should be taken to account for the spatial autocorrelation into the proposed model.
3.6.1 Histogram
One of the easiest ways to see the shape of the data distribution is the histogram, which is defined
by Dalgaard (2008, p. 71) as: “…, a count of how many observations fall within specified divisions
(“bins”) of the x-axis”. Along with histogram, there are two indicators that help to determine the
type of shape of a distribution, which are skew and kurtosis (Urdan, 2010, p. 31). The skew
measures if the distribution is positively or negatively skewed, which means that distribution has an
elongated tail at the higher end of the distribution in the first case, or at the lower end of the
distribution in the second case (Urdan, 2010, p. 31). On the other hand, kurtosis tells if the
distribution is flatter than a normal distribution, in which case is called platykurtic, or if it has a
peak higher than that is found in a normal distribution, in which case the distribution is called
leptokurtic (Urdan, 2010, p. 31). As a rule of thumb the Skewness should be ideally 0 and the
Kurtosis should be 3 for a normally distributed data (NIST/SEMATECH, 2014).
3.6.2 Normal Q-Q Plot
Another way to visually check the data for normality is the Quantile-Quantile Plot (Q-Q Plot). To
understand the Q-Q plot, the Empirical Cumulative Distribution Function (c.d.f) should be defined.
Being x the analysed variable, the c.d.f. is defined by Dalgaard (2008, p. 73) as “the fraction of data
smaller than or equal to x”. The Q-Q plot corresponds to the “kth smallest observation against the
expected value of the kth smallest observation out of n in a standard” (Dalgaard, 2008, p. 73).
In the practice, a straight line should be expected in Q-Q plot for a normally distributed data
(Dalgaard, 2008, p. 73).
28
3.6.3 Spatial autocorrelation tests
Spatial autocorrelation was tested using Moran’s I and Geary’s c. An additional test of Moran’s I
was also applied using the methodology of Assunção and Reis (1999). All the R Language (R Core
Team, 2014) code utilised to compute the spatial autocorrelation test is included in the Annex 1.
Accordingly to Griffith (2009) when spatial autocorrelation is detected, it could be in one of
following categories:
1) Strong Positive Spatial Autocorrelation: Present in data like the remote sensing data and
not very common in the majority of the cases
2) Moderate Positive Spatial Autocorrelation: The most common type of spatial
autocorrelation
3) Moderate Negative Spatial Autocorrelation: Not so common to find and typically
associated with geographic competition.
Moran’s I is one of the most common used statistics to test the null hypothesis of zero spatial
autocorrelation (Plant, 2012, p. 104) and was selected to compute the spatial autocorrelation of
Fusarium Wilt incidence as areal data. Moran’s I and all the necessary calculations were computed
using the R Language (R Core Team, 2014) with functions included in the spdep package (Bivand,
2014). Geary’s c is other statistic method which also test the null hypothesis of zero spatial
autocorrelation. Griffith and Lane (as cited in Plant, 2012, p. 106), concludes that the Moran’s I is
generally preferred over Geary’s c, but computing Geary’s c for corroborate Moran’s I results is
desirable. That is the main reason to also calculate the Geary’s c in the present work.
In the case of the Empirical Bayes Estimate, proposed by Assunção and Reis (1999) as a way to
improve the Moran’s I, the main reason to also includes its calculation is the fact that some authors
like Jackson et al. (2010), Assunção and Reis (1999) and Tsai (2012) affirms that Moran’s I test
doesn’t work very well for rates calculated as from populations with different sizes, as is the case of
the disease incidence treated in the present work.
29
3.6.3.1 Neighbour List and Spatial Weights
Before calculate any of the statistics for test spatial autocorrelation two previous steps should be
done. First, the relation between the spatial objects should be defined using a neighbour criterion
(Bivand et al., 2008, p. 239). After defining which objects will be related as neighbours, a spatial
weight should be assigned to each relation link (Bivand et al., 2008, p. 251). Depending on the type
spatial objects to model the spatial relationship, polygons or points, the adequate method should be
selected. For the case of points, as is the case of the analysed data, two common methods are
available to construct the neighbour list: the k nearest neighbour and distance based neighbour list.
However there are more options for create a neighbour list like Delaunay triangulation one of them.
Different methods for neighbours definition are treated in more detail by Haining (2003, p. 80) and
Bivand et al. (2008, p. 240).
The k nearest neighbour methods selects the k nearest neighbours of each point (Plant, 2012, p. 90),
being k a parameter to be provided. For example, if a k value of 2 is provided, the method will
produce a neighbour list assuring that each point will have 2 neighbours. The distance based method
selects the neighbour for each point taken a distance threshold which is defined by two parameters,
a minimum and maximum bound (Bivand et al., 2008, p. 247).
The spatial weights area assigned to each neighbour link accordingly to different styles, being the
row standardized the recommended style if not much is known about the analysed spatial process
(Bivand et al., 2008, p. 251).
In the R Language (R Core Team, 2014) a k nearest list could be constructed with the functions
knn2nb along with the function knearneigh, booth included in the package spdep (Bivand, 2014).
In the case of the distance based method, the function dnearneigh could be used to construct the
neighbour list. The spatial weights list is constructed with the function nb2listw, also provided by
the package spdep (Bivand, 2014).
30
After clarified the necessary previous steps to calculate the spatial autocorrelation statistics, there is
a new interrogate to solve: which size of threshold should be defined for the distance based method
or which k should be used for the k nearest method? More than that, which method should be used?
Accordingly to O’Sullivan and Unwin (2010, p. 205), when the analysed process is not well
understood, the definition of the spatial structure and the weight assignation will be a difficult
process. Haining (2003, p. 81) suggests that if additional information about the analysed process it
should be utilised to define linkages, rather than define the by only geometrical or spatial criteria.
There are no magical recipes to select the neighbouring method and in different theoretical books
the knowledge about the studied process is presented as the key input to resolve this problem.
Although applied to a different case and with a specific implementation, the work of Souris and
Bichaud (2011) could give an insight that the k nearest neighbour method could be appropriate to
apply in epidemiology studies. Therefore, for the present work the k nearest neighbour will be
selected, although for the sake of support of this decision a set of comparisons will be conducted
against other three different methods: Delaunay triangulation, Sphere of Indifference and distance
based.
Delaunay triangulation and Sphere of Indifference are graph based methods and their main
difference are that the first defines the neighbours by triangulation and the latter uses circles with a
radius equal to the distance from the point to the nearest neighbours points (Bivand et al., 2008, p.
245).
For the case of the distance based method, which needs to define a threshold of distance the
approach proposed by Anselin (2003). Basically the lower bound is set to 0 and the upper bound is
set using the maximum distance needed to assure that each point has at least one neighbour. This is
achieved extracting the max distance value resulting after the applying the k nearest neighbour
method with a k = 1.
31
3.6.3.2 Moran’s I
As mentioned in the previous section, the Moran’s I test were computed in using the R Language (R
Core Team, 2014) with the function moran.test. Basically the function needs the list of neighbours
constructed with one of the methods explained before and a vector with the values of the variable to
check for spatial autocorrelation.
3.6.3.3 Geary’s C
As in the Moran’s I, for the Geary’s c a neighbour list is also needed. For this case, the same
neighbour lists constructed for the Moran’s I calculation will be utilised. Computation of Geary’s c
was done using the function “geary.test” included in the package spdep (Bivand, 2014)
3.6.3.4 Modified Moran’s I – Empirical Bayes Index

The proposed modified Moran’s I applied is the proposed by Assunção and Reis (1999) and is
implemented with the function EBest included in the package spdep (Bivand, 2014). This function
calculates an Empirical Bayes Estimate and compute the Moran’s I using the resulting smoothed
rate.
3.7 GAMLSS applied to Fusarium Wilt Disease Incidence

One of the special interest of this work is to explore and model the relationship between Fusarium
Wilt incidence and variables like altitude, shade percentage, slope, soil pH, plant density and a Soil
Quality Index. The Soil Quality Index was calculated accordingly to the methodology of Altieri and
Nicholls (2002) and it encloses a list of measurements which in some way gives an estimation of
the quality of the soil.
As exposed in the literature review, using GLMs are the usual approach to analyse data that doesn’t
exhibit a normal distribution (Plant, 2012, p.301), as is the case of disease incidence rate. Garret et
al., (2004) also suggests the utilisation of the GLM instead of applying a transformation over the
data. On the other hand, Kongchouy, Choonpradub and Kuning (2010) indicates that using a
logarithmic transformation over the disease incidence rate is enough to achieve satisfactory results.
32
Bivand et al. (2008, p. 274) also applies a logarithmic transformation to disease incidence rates to
try to obtain a nearly normal distribution. In this context a Logarithmic Transformation consists in
calculate the logarithm for each of the original values of the variable of interest and use it instead of
the original value. Recalling from basic mathematics, “a logarithm function is defined with respect
to a base” (Nau, 2014). However since the data utilised in the present work contains values of zero
for the farms without the disease, the logarithmic transformation not seems to be a feasible solution,
even though some transformation could be done. More than that, with available methods like the
Generalised Linear Models to handle this kind of data without transforming it, there is not strong
reason to take the transformation approach.
Zuur et al. (2009, p. 19) states that the simplest statistical model should be used, but it should be
used in the correct form. Following this approach, a GLM was applied to the disease incidence rate
trying to find a model which could explain the relationship between the disease incidence and the
proposed set of explanatory variables. The implementation was done using the R Language (R Core
Team, 2014).
As presented in the literature review, a GLM consist of three steps (Zuur et al., 2009, p. 210):
a) An assumption of the distribution of the response variable
b) Specify the systematic part (The explanatory variables)
c) Specify the link function
In the case of the data analysed in the present work, the distribution of the response variable is
proportional data which corresponds to a binomial distribution, accordingly to Zuur et al. (2009, p.
202) and Garret et al. (2004). The systematic part corresponds to the explanatory variables, in the
present work they are a selection of the variables presented in table 3. This selection is undertaken
using criteria to select the variables which are statistically significant into the model, mostly based
33
on its p-value. GLMs uses maximum likelihood to fit to the analysed data, and here is where the
link function appears, being the logit-link the most used for proportional data (Garret et al. 2004).
It is common to found in Poisson and binomial distributions that the observed variability is greater
than the predicted, which is known as overdispersion and is very common to be found in plant
disease data (Garret et al. 2004). For the case when overdispersion is found one approach to solve
this could be the use a maximum quasi-likelihood method could be applied to fit the model to the
data (Garret et al. 2004). In this case, the distributions still being a binomial distribution, but
allowing the overdispersion as it was taken into account (Zuur et al., 2009, p. 226). Special attention
should be put on test for overdispersion before start selecting the explanatory variables for the
model (Zuur et al., 2009, p. 223). In the present work overdispersion was found on the model, but
was approached using a GAMLSS and a discrete distribution called beta-binomial distribution. Two
main reasons support the selected approach; 1) The GAMLSS provides an AIC calculation, which
is very useful for the selection of the most significant explanatory variables and it is not provided by
the quasi-binomial method, 2) The beta-binomial distribution is widely recognized as the most
adequate solution for overdispersed proportional data, like the plant disease incidence (Garret et al.,
2004). The implementation was done also with R Language (R Core Team, 2014) and the gamlss
package (Rigby & Stasinopoulos, 2005).
The next step is to select the explanatory variables which are important to include into the model.
Accordingly to Zuur et al. (2009, p. 221) two options are available for this; a selection using the
AIC (Akaike Information Criteria) or use the hypothesis testing method. The AIC is a measure of
how good the model fits (Dalgaard, 2008, p. 232) and an extensible explanation could be found in
Akaike (1998), but in general terms the AIC is based on the Maximum Likelihood Estimator to
select the most appropriate model (Pan, 2001). It could be calculated for a GLM model in the R
Language (R Core Team, 2014) using the function step. Basically a model with a lower AIC value
34
will be better (Zuur et al., 2009, p.542). However, since a GAMLSS was utilised, the function
available for this calculation is stepGAIC from the gamlss package (Rigby & Stasinopoulos, 2005).
Validation of the Model
Zuur et al. (2009, p. 23) suggests validating a linear regression model using graphs as follows:
1) A graph of the model residuals vs fitted values to check for homogeneity
2) A Q-Q plot or histogram of the residuals to verify normality
3) Plot residuals vs each explanatory variable to verify independence
Teetor and Loukides (2011, p. 295) provides a simple explanation on how to interpret this kind of
graphs.
Since the GAMLSS was used instead of a classical GLM, the graphs provided by the package was
utilised to validate the model. The function plot of the R Language (The R Core Team, 2014)
applied to a GAMLSS model provides the following graphs for model validation (Stasinopoulos,
Rigby & Akantziliotou, 2008, p. 121):
 Model residuals against the fitted values
 Model residuals against an index or specified x-variable
 Kernel density estimate of the residuals
 QQ-normal plot of the residuals
In general, what should be expected for a valid and good fit model are residuals with a normal
distribution and without patterns.
The code utilised in the R Language (R Core Team, 2014) to implement the GLM and GAMLSS is
included in Annex 2.
35
4 Results and Analysis
4.1 Map of the Spatial Distribution of Disease Incidence in the Study Area
Figure 6 shows the resulting map of the distribution of Fusarium Wilt Incidence in the region of San
Luis de Shuaro, Peru. As from this points Thiessen Polygons where constructed and coloured with
the same colour scheme and using the classification for disease incidence. The resulting map with
the Thiessen Polygons is shown in the Figure 7.
Figure 6 - Map of the Spatial Distribution of Fusarium wilt incidence in the study area (David Brown, ArcGIS for
Desktop 10.2)
36
Figure 7 – Thiessen Polygons constructed as from the points of each sample farm (David Brown, ArcGIS for Desktop
10.2)
37
4.2 Results of Exploratory Data Analysis
4.2.1 Histogram
The Figure 8 shows the histogram calculated using the R Language (R Core Team, 2014) for the
disease incidence. Different from the normal distribution shown in the Figure 9, the histogram
shows that the data correspond to a positively skewed distribution.
Figure 8 – Histogram of disease incidence rate
Figure 9 – Histogram of a normal distribution from simulated data
38
4.2.2 Normal Q-Q Plot
The Quantile-Quantile Plot for a normal distribution should have a straight line as is shown in the
Figure 11, different from the shape shown in Figure 10, which shows a non-normal distribution of
the data.
Figure 10 – Normal Q-Q Plot of the disease incidence
Figure 11 – Normal Q-Q Plot for a normal distribution for simulated data
39
4.2.3 Skewness and Kurtosis
As explained before, values for the Skewness and Kurtosis corresponding to a normal distribution
should be 0 for the first and 3 for the later.
In the present work Skewness and Kurtosis were calculated using the R Language (R Core Team,
2014) with the function stat.desc from the package pastecs (Grosjean & Ibanez, 2014).
For the case of disease incidence, the Skewness was 1.9327 and Kurtosis was 3.1458, being in this
case the Skewness the more problematic.
40
4.2.4 Neighbours list and Spatial Weights
Probably one of the most convenient ways to present how different are the spatial structures
depending on the selected method to define the neighbour is in a graphic. The following are the
graphics showing the different methods to construct neighbour relationships. Delaunay
triangulation, Sphere of Indifference and distance based are shown in Figure 12, while the k nearest
neighbour relationship with k = 1, k = 5 and k = 10 are shown in Figure 13.
a) b) c)
Figure 12 – a) Delaunay triangulation, b) Sphere of Indifference, c) Distance based
a) b) c)
Figure 13 – a) k nearest neighbour with k = 1, b) k nearest neighbour with k= 5, c) k nearest neighbour with k=10
41
4.2.5 Results of Moran’s I Test
As can be seen in Table 4, none of the calculation of Moran’s I using different neighbour list has an
acceptable p-value under the confidence interval of 95 %. At this point is necessary to recall why
the p-value is relevant. A very detailed explanation could be found in Urdan (2010, p. 61). The
following are definitions from Urdan (2010, p. 77):
- p-value : “The probability of obtaining a statistic of a given size from a sample of a given
size by chance, or due to random error”
- confidence interval: An interval calculated using sample statistics to contain the
population
Combined, these concepts are a way to determine if the calculated value is significant from the
statistical point of view. In the present case, using a 95 % interval confidence just the values with a
p-value less than 0.05 will be statistically significant.
It is also worth to mention that calculations of Moran’s I were computed with a “Two Sided”
alternative hypothesis. The default value is to set the alternative hypothesis to be greater than value
for zero spatial autocorrelation, guessing that the expected possible spatial autocorrelation will be
positive. However, since there are no clues to presume that the possible spatial autocorrelation will
be positive or negative, the alternative hypothesis was set to “Two Sided”.
Table 4 – Results of Moran’s I test computations using different neighbours list methods
Neighbour Method I E(I) var (I) St. deviate p-value

Delaunay Triangulation -0.0467 -0.0133 0.0042 -0.52 0.6053
Sphere of Indifference -0.025 -0.013 0.01 -0.12 0.9069
Nearest Neighbour k = 1 0.0046 -0.0133 0.02 0.13 0.8993
Nearest Neighbour k = 5 -0.107 -0.013 0.004 -1.5 0.1385
Distance bands -0.0759 -0.0133 0.0045 -0.93 0.3509
42
To understand better the results it is necessary to explain the values contained in the table 4. The I
value represents the Moran’s I calculations, which for positive spatial autocorrelation will have
positive values and for negative spatial autocorrelation will have negative values. The E(I) value
represents the expected value for the null hypothesis of spatial autocorrelation and it comes from
the following function (Griffith, 2009):
1
n 1
Where n is the number of areal units, in this case the number of farms.
The var (I) represents the variance of the statistic and the St. deviate is the standard deviate (Bivand
et al., 2008, p. 260).
4.2.6 Results of Geary’c Test

In the case of the Geary’s c test, the only computation which has a significant p-value is the resulted
using a neighbour list made from the k nearest neighbour method with a k value of 10. In the table 5
de c values represent the computed Geary’s c. The rest values are as indicated in the Moran’s I
calculation, with the difference that the expected value for zero spatial autocorrelation is 1. Values
between 1 and 2 represents negative spatial autocorrelation and values between zero and 1 indicates
positive spatial autocorrelation. In the present case, although p-value of the calculation for the k
nearest neighbour is statistically significant, the c value is barely greater than one and the null
hypothesis of zero spatial autocorrelation is confirmed.
43
Table 5 – Results of Geary’s c test computations using different neighbours list methods
Neighbour Method c E(c) var(c) St. deviate p-value

Delaunay Triangulation 1.0428 1 0.0054 -0.58 0.561
Sphere of Indifference 1.021 1 0.012 -0.19 0.8496
Nearest Neighbour k = 1 1.074 1 0.033 -0.41 0.6841
Distance bands 1.072 1 0.0059 -0.94 0.3493
4.2.7 Results of Global Moran’s I using Empirical Bayes Estimates
Table 6 – Results of Moran’s I with Empirical Bayes Estimates computations using different neighbours list methods
Neighbour Method I E(I) var(I) St. p-value

Deviate
Delaunay Triangulation -0.0461 -0.0133 0.0042 -0.51 0.6113
Sphere of Indifference -0.025 -0.013 0.01 -0.11 0.9115
Nearest Neighbour k = 1 0.0079 -0.0133 0.02 0.15 0.8808
Distance bands -0.0768 -0.0133 0.0045 -0.95 0.3445
In the case of the computation of Moran’s I using the Empirical Bayes Estimate, none of the
calculation exhibit spatial autocorrelation as could be observed in Table 6 and consequently the null
hypothesis is accepted.
As a result of the three different methods implemented to test for spatial autocorrelation in the
disease incidence, there is no spatial autocorrelation in this case.
Differences between the Moran’s I and Geary’s c could be attributed to the effect of the distribution
of the data, which accordingly to Cliff and Ord (as cited in Plant, 2012, p. 106) affects more the c
calculation than I.
44
4.3 Results of the GAMLSS
To start constructing a model which explains the Fusarium Wilt incidence through the selected
variables the first step is to include all these variables as explanatory variables in the model and
then apply the Akaike Information Criteria to select a better model. The first model contains the
following variables as explanatory:
- Area (Farm Size)
- Altitude
- pH
- Planting Density
- Farm Type
- Variety
- Slope
- Shade Percentage
- Soil Quality Index
As explained before, if the modelled data presents overdispersion it should be preferable to specify
a beta-binomial distribution. To the test that the data here analysed presents or not overdispersion a
GLM with a binomial distribution was specified. More than that, if overdispersion is not present, a
GLM with a binomial distribution could be used and the use of the GAMLSS will be optional.
A GLM was specified with function glm, which is part of the R Language (R Core Team, 2014)
with all the proposed explanatory variables and specifying a binomial distribution. Results are
presented in Table 7.
45
Table 7 – Results of the GLM model with a Binomial distribution and all the proposed explanatory variables
Estimate Std. Error t value Pr(>|t|)

(Intercept)
7.60064318 0.60639424 12.53416123 0.0000000000
pH
-1.37212957 0.15488183 -8.85920321 0.0000000000
Altitude
0.00094969 0.00014293 6.64426691 0.0000000000
Slope
-0.00135563 0.00266970 -0.50778478 0.6116042814
Farm Size (area)
0.03063996 0.04297005 0.71305393 0.4758123862
Plant density
-0.00131031 0.00035593 -3.68136155 0.0002319918
Soil Quality Index
-0.91801074 0.08157778 -11.25319525 0.0000000000
Factor(farm_type)2
-0.12178332 0.10985683 -1.10856391 0.2676183558
factor(farm_type)3
0.12928188 0.44745031 0.28893014 0.7726348387
Factor (Variety - Seda)
-0.22388688 0.24582094 -0.91077219 0.3624154171
Factor (Variety - Mix)
0.43529689 0.45731161 0.95186057 0.3411676975
Shade percentage
0.01467834 0.01320917 1.11122363 0.2664721004
Factor(farm_type)2: Factor(Variety) Mix
0.52844930 0.46682362 1.13201063 0.2576299656
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1693.44 on 75 degrees of freedom

Residual deviance: 258.64 on 63 degrees of freedom
AIC: 483.6
Number of Fisher Scoring iterations: 5
To assess the proposed model for overdispersion the two key values to analyse are the residual
deviance and the degrees of freedom. To estimate overdispersion the residual deviance is divided by
the degrees of freedom (Zuur et al., 2009, p. 224), in this case 258.64/63 = 4.105 which is higher
than the expected 1 for the binomial family, as is indicated in the model summary. The resulting
overdispersion parameter for the present case indicates that overdispersion is present in the model.
To account for overdispersion in the model a GAMLSS with a beta-binomial distribution was used.
A beta-binomial distribution is a combination of the beta and binomial distributions (Hilbe, 2013)
and is used when binomial data presents overdispersion (Guimarães, 2005).
46
The Table 8 shows the results of the base model with all the proposed variables, applying a
GAMLSS with a beta-binomial distribution.
Table 8 – Results for the base model with all the proposed variables
Intercept
6.29559794 1.14470686 5.49974684 0.00000074
pH
-1.26226658 0.26622565 -4.74134102 0.00001256
Altitude
0.00110670 0.00030041 3.68400796 0.00047906
Factor (Variety - Seda)
-0.08451387 0.37110387 -0.22773643 0.82058864
Factor (Variety - Mix)
0.27996758 0.63164774 0.44323372 0.65911506
Planting Density
-0.00182239 0.00076547 -2.38074932 0.02031365
Slope
-0.00489983 0.00563126 -0.87011249 0.38754222
Farm Size
0.07216351 0.08153999 0.88500759 0.37951792
Soil Quality Index
-0.73183938 0.13730923 -5.32986312 0.00000141
Factor(farm_type)2
-0.25502120 0.23044149 -1.10666355 0.27264724
factor(farm_type)3
-0.02887809 0.94505482 -0.03055705 0.97571939
Shade percentage
0.00834388 0.02681166 0.31120343 0.75667341
Factor(farm_type)2: Factor(Variety) Mix
0.44620648 0.68007146 0.65611704 0.51413809
Mu link function: logit
Sigma link function: log
Sigma Coefficients:
(Intercept) -4.7 0.2277 -20.64 1.169e-32
-------------------------------------------------------------------
No. of observations in the fit: 76
Degrees of Freedom for the fit: 14
Residual Deg. of Freedom: 62
at cycle: 13
Global Deviance: 362.1153

AIC: 390.1153
SBC: 422.7456
Summary of the Randomised Quantile Residuals

mean = -0.0124098
variance = 1.063567
coef. of skewness = -0.1870921
coef. of kurtosis = 2.620784
Filliben correlation coefficient = 0.9950251
47
Applying the function stepGAIC to select a model using the AIC value the selected variables were:
- pH
- Altitude
- Planting Density
The table 9 shows the summary of results of the fitted model with the selected variables. As could
be observed the model was improved from an AIC of 390.11 to an AIC of 382.08 after selecting the
most significant variables.
Table 9 – Results of the adjusted model after applying the variable selection using stepGAIC
Intercept 5.50208907 1.04975051 5.24133025 0.00000157

pH -1.32829757 0.25205493 -5.26987349 0.00000140
Altitude 0.00099656 0.00030522 3.26511767 0.00168565
Planting Density -0.00202482 0.00068184 -2.96964748 0.00406429
Soil Quality Index -0.51989099 0.11633084 -4.46907255 0.00002906
Mu link function: logit

Sigma link function: log
Sigma Coefficients:
(Intercept) -4.374 0.2158 -20.26 3.779e-32
-------------------------------------------------------------------
No. of observations in the fit: 76
Degrees of Freedom for the fit: 6
Residual Deg. of Freedom: 70
at cycle: 12
Global Deviance: 370.0847

AIC: 382.0847
SBC: 396.0691
Summary of the Randomised Quantile Residuals

mean = -0.02736283
variance = 0.9909983
coef. of skewness = -0.02519111
coef. of kurtosis = 2.923242
Filliben correlation coefficient = 0.9979815
48
To validate the model, the graphs of the residuals were constructed and are shown in the Figure 14.
Figure 14 – Plots of the residuals for model validation
In the upper left the randomised residuals were plotted against the fitted values. In the upper right
the randomised residuals were plotted against an index, which basically corresponds to the number
of observations, in the present case 76 farms. These two plots shouldn’t present any pattern for a
good fitted model (Zuur et al., 2009, p. 27).
The Density Estimate and the Normal Q-Q Plot helps to evaluate the normality of the residuals,
required for a good fitted model (Stasinopoulos et al., 2008, p. 122). In the present case they appear
to be normally distributed. Finally, four values from the randomised residuals should be observed to
confirm that the model was well fitted; the Mean, the Variance, the coefficient of Skewness and the
coefficient of Kurtosis, which all were shown on Table 9. A well fitted model should have a mean
near zero, variance near to one, a coefficient of Skewness near to zero and a coefficient of Kurtosis
near to 3. In the present case the values presented in Table 9 are nearly the expected values for a
well fitted model.
49
4.4 Analysis of the Results
4.4.1 Spatial Distribution of Fusarium Wilt in the study area

For the spatial distribution of Fusarium Wilt in the study area, points objects representing each farm
was utilised as explained in the methodology chapter. Disease incidence was classified in 5 classes
for which a colour was also assigned. This symbology and colour arrangement allows a graphical
representation on how different levels of disease incidence spread along the study area. As a very
basic interpolation Thiessen Polygons were utilised to represent the influence zone of each farm
with Fusarium Wilt presence in the study area. However, this should be interpreted just as a first
approximation to obtain zones of influence in the study area and not as a prediction map.
4.4.2 Spatial Autocorrelation Tests

After computing the Moran’s I with different neighbours lists there is not strong evidence for a
spatial autocorrelation to be present in the study area for the Fusarium Wilt incidence. A fact should
be highlighted from these results, and it is the relevance that has the design of the spatial structure
of the studied process as from the neighbours list and its effect over the detection of spatial
autocorrelation, as stated by Bivand et al. (2008, p. 239), O’Sullivan and Unwin (2010, p. 201),
Haining (2003, p. 79) and Plant (2012, p. 80).
Geary’s c was also used to account for spatial autocorrelation over the Fusarium Wilt incidence in
the study area. Although with different p-values, the null hypothesis of zero spatial autocorrelation
was also confirmed as in the case of Moran’s I.
For the case of Moran’s I using an Empirical Bayes Estimates to smooth the disease incidence rates,
basically the results were the same as from the normal Moran’s I.
50
4.4.3 GAMLSS
The proposed model using GAMLSS to explain the Fusarium Wilt incidence has the following
variables as explanatory:
- Soil pH
- Altitude
- Planting Density
It was the result of apply the AIC to select the variables to be included in the model and the model
validation was conducted using graphs provided by the same statistical tool
Soil pH is already recognized to have a correlation with Fusarium Wilt disease. Accordingly to
Alvarez, García, Robles and Díaz (1981) there is evidence to expect a higher disease presence in
soils with a pH lower than 7 while. Román Jerí (2012, p. 90) also reports a relationship between the
disease incidence and soil pH in his results. In the case of the Soil Quality Index, calculated
accordingly to the methodology proposed by Altiere and Nicholls (2002), presents a correlation
with the disease incidence. Better soil conditions were also reported by Domínguez-Hernández,
Negrín and Rodriguez (2008) as an associated condition to expect lower levels of Fusarium Wilt
presence, although Alabouvette (as cited in Domínguez-Hernández et al., 2008, p. 405) states that
that there is no evidence that soil properties play any role in suppressiveness.
In the case of altitude, although specific studies wasn’t found in literature about the effect of
altitude over Fusarium Wilt incidence, it is known that altitude acts indirectly over banana growth
due to a decrease on temperature, being difficult to produce bananas over above 1000 meters of
altitude (Arvanitoyannis & Mavromatis, 2009). These unfavourable conditions for banana growth
could also influence the plant health and how it could defends against the Fusarium Wilt (Ploetz,
Jones, Sebaisgari & Tushemereirwe, 1994).
51
With regards of plant density, there is not specific work found in the literature dealing with the
effect of plant density over the Fusarium Wilt incidence. Works like the conducted by Athani,
Revanappa and Dharmatti (2009) studied the effect of plant density over plant height and yield, but
it doesn’t account for plant health or other interesting variables for the present work. However, as in
the case of altitude, unfavourable conditions could be playing in favour to a weak plant to acquire
the disease and with a higher plant density the competition between plants is also higher. One
possible hypothesis could be that at higher competition conditions (high plant density) without the
adequate fertilisation, the risk to have weaker plants could increase, and those weak plants could be
more prone to get diseased by Fusarium Wilt or even other diseases. However, this is just a
hypothesis outline and should be taken just as a possible subject to further work and not as fact.
52
4 Conclusions
1) Spatial distribution of the Fusarium Wilt in study area was successfully represented aided with
the software ArcGIS for Desktop (ESRI, 2013). Thiessen polygons are a useful method for
basic interpolation but clearly have the limitation that is just a geometric construction and the
inference as from them should be done with caution. Other forms of interpolation, like areal
interpolation, are suggested to be explored in future works when the real boundaries of the
farms are available. The areal interpolation tool available in the software ArcGIS Desktop
(ESRI, 2013) could be a starting point, but has the limitation that could assume a Gaussian,
Binomial or Poisson distributions but no the Beta-Binomial distribution which is needed for
binomial data with overdispersion. As overdispersion is a common characteristic found in
disease incidence data, a geostatistical tool which easily allows researchers to produce
interpolations for this kind of data will be a very valuable contribution of further work.
2) Spatial autocorrelation was not found in the Fusarium Wilt incidence in the study area, which
mainly represents that the pattern of distribution of the farms with presence of Fusarium Wilt is
random. Based on these results, the null hypothesis H0 can’t be rejected. Special attention
should be put to the fact that these results are from data that represents farms located in a very
diverse region with a variety of elements involved. Even when spatial autocorrelation was not
found at this scale other studies should be done to analyse the spatial autocorrelation at farm
level, which implies to collect the location of each sampled plant. The results obtained in the
present work could support the design of sampling strategy when the spatial component will be
included in an epidemiology study.
53
3) GAMLSS model with a beta-binomial distribution was successfully applied to explain the
Fusarium Wilt incidence of banana as from a set of explanatory variables. The beta-binomial
distribution was found to be the most appropriate to model binomial data with overdispersion,
confirming what was found in the literature review. One remarkable result from the present
work is the guidelines produced to model the plant disease incidence, since all the methodology
scripts code implemented in the R Language (R Core Team, 2014) is provided to be easily
reproduced. A desirable further work using the present work as starting point could be the
development of a software package that provides an easy to use tool for plant pathologist, or at
least a detailed guide on how to model disease incidence data with the available tools.
4) The relationship of soil pH and soil quality conditions with Fusarium Wilt incidence were
confirmed, as it coincides with results found in previous works. These results could lead to
develop more specific work to analyse the influence of pH and soil conditions over the disease
incidence of Fusarium Wilt. Further work is also suggested to explore in depth the relationship
of altitude and plant density with Fusarium Wilt incidence. Although the present work are based
on part of the raw data kindly provided by the work of Román Jerí (2012), the approach taken
was completely blind with respect to that previous work in terms of the methodology used to
analyse the relationship between disease incidence and the proposed explanatory variables. As a
future work, a detailed revision of the methodologies applied by the two works is suggested to
outline the reasons behind the different results found.
54
5 References
Akaike, H. (1998). Information Theory and an Extension of the Maximum Likelihood

Principle. In E. Parzen, K. Tanabe, & G. Kitagawa (Eds.), Selected Papers of Hirotugu Akaike (pp.
199–213). New York, NY: Springer New York.
Altieri, M. A., & Nicholls, C. I. (2002). Sistema agroecologico rápido de evaluación de calidad
de suelo y salud de cultivos en el agroecosistema de café. Retrieved August 28, 2014, from
University of California, Berkeley, Agroecology in Action website:
http://www.agroeco.org/doc/SistAgroEvalSuelo2.htm
Alvarez, C. E., García, V., Robles, J., & Díaz, A. (1981). Influence des caractéristiques du sol
sur l’incidence de la Maladie de Panama. Fruits, 36(2), 71–81.
Alves, M. de C., & Pozza, E. A. (2010). Indicator kriging modeling epidemiology of common
bean anthracnose. Applied Geomatics, 2(2), 65–72. doi:10.1007/s12518-010-0021-1
Anselin, L. (2003). Data and Spatial Weights in spdep Notes and Illustrations. Urbana-
Champaign: University of Illinois. Retrieved September 9, 2014, from
https://geodacenter.asu.edu/system/files/dataweights.pdf
Arias, P., Dankers, C., Liu, P., & Pilkauskas, P. (2003). The world banana economy, 1985-
2002. Rome: Food and Agriculture Organization of the United Nations.
Arvanitoyannis, I. S., & Mavromatis, A. (2009). Banana Cultivars, Cultivation Practices, and
Physicochemical Properties. Critical Reviews in Food Science and Nutrition, 49(2), 113–135.
Assunção, R. M., & Reis, E. A. (1999). A new proposal to adjust Moran’s I for population
density. Statistics in Medicine, 18(16), 2147–2162.
Athani, S. I., Revanappa, & Dharmatti, P. R. (2009). Effect of plant density on growth and yield
in banana. 22, 1, 143–146.
Bachmaier, M., & Backes, M. (2008). Variogram or semivariogram? Understanding the

variances in a variogram. Precision Agriculture, 9(3), 173–175.
Bivand, R. (2014). spdep: Spatial dependence: weighting schemes, statistics and models. R
package. (Version 0.5-71). Retrieved August 14, 2014, from http://CRAN.R-
project.org/package=spdep
Bivand, R., & Lewin-Koh, N. (2014). maptools: Tools for reading and handling spatial objects
(Version 0.8-29). R. Retrieved August 14, 2014, from http://CRAN.R-
project.org/package=maptools
Bivand, R. S., Pebesma, E. J., & Gómez-Rubio, V. (2008). Applied spatial data analysis with R.
New York; London: Springer.
55
Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., &
White, J.-S. S. (2009). Generalized linear mixed models: a practical guide for ecology and
evolution. Trends in Ecology & Evolution, 24(3), 127–135.
Brayford, D. (1992). Fusarium oxysporum f. sp. cubense. IMI Descriptions of Fungi and
Bacteria, 112, 115.
Crawley, M. J. (2007). The R book. Chichester, England; Hoboken, N.J.: Wiley.
Dalgaard, P. (2008). Introductory Statistics with R. New York, NY: Springer New York.
Del Ponte, E. M., Shah, D. A., & Bergstrom, G. C. (2003). Spatial Patterns of Fusarium Head
Blight in New York Wheat Fields Suggest Role of Airborne Inoculum. Plant Health Progress.
doi:10.1094/PHP-2003-0418-01-RS
Domínguez‐Hernández, J., Negrín, M. A., & Rodríguez, C. M. (2008). Soil Potassium Indices
and Clay‐Sized Particles affecting Banana‐Wilt Expression Caused by Soil Fungus in Banana
Plantation Development on Transported Volcanic Soils. Communications in Soil Science and Plant
Analysis, 39(3-4), 397–412.
ESRI. (2013). ArcGIS for Desktop (Version 10.2). Redlands, California: Environmental
Systems Resource Institute.
Fortin, M.-J., Dale, M. R. T., & Hoef, J. ver. (2002). Spatial analysis in ecology. In
Encyclopedia of Environmetrics (Vol. 4, pp. 2051–2058). Chichester, UK: John Wiley & Sons, Ltd.
Frison, E., & Sharrock, S. (1998). The economic, social and nutritional importance of banana in
the world. In Bananas and Food Security/Les productions bananières: un enjeu économique majeur
pour la sécurité alimentaire (pp. 21–35). Douala, Cameroon: INIBAP.
Garrett, K. A., Madden, L. V., Hughes, G., & Pfender, W. F. (2004). New Applications of
Statistical Tools in Plant Pathology, 94(9), 999–1003.
Griffith, D. A. (2009). Spatial Autocorrelation. Retrieved August 19, 2014, from Elsevier Store
website: http://booksite.elsevier.com/brochures/hugy/SampleContent/Spatial-Autocorrelation.pdf
Grosjean, P, & Ibañez, F., (2014). pastecs: Package for Analysis of Space-Time Ecological
Series. R package version. 1.3-18. Retrieved August 5, 2014, from http://CRAN.R-
project.org/package=pastecs
Guimarães, P. (2005). A simple approach to fit the beta-binomial model. Stata Journal, 5(3),
385–394.
Guzmán-Plazola, R. A., Gómez-Pauza, R., García-Espinosa, R., & Gavi-Reyes, F. (2004).

Distribución Espacial de la Pudrición Radical del Frijol (Phaseolus vulgaris L.) por Fusarium solani
(Mart.) Sacc. f. sp. phaseoli (Burk.) Snyd. y Hans. en la Vega de Metztitlán, Hidalgo, México.
Revista Mexicana de Fitopatología
56
Haining, R. P. (2003). Spatial data analysis: theory and practice. Cambridge, UK ; New York:
Cambridge University Press.
Hilbe, J. M. (2013). Beta Binomial Regression. The SelectedWorks of Joseph M Hilbe.
Hughes, G. and Madden L.V., (1993). Using the Beta-Binomial Distribution to Describe
Aggregated Patterns of Disease Incidence. Phytopathology 83:759-763.
Hughes, G., Madden, L. V., & G. P. Munkvold. (1996). Cluster Sampling for Disease Incidence
Data. American Phytopathological Society, 86(2), 132–137.
Hughes, G., Munkvold, G. P., & Samita, S. (1998). Application of the logistic-normal-binomial
distribution to the analysis of Eutypa dieback disease incidence. International Journal of Pest
Management, 44(1), 35–42
Jackson, M. C., Huang, L., Xie, Q., & Tiwari, R. C. (2010). A modified version of Moran’s I.
International Journal of Health Geographics, 9(1), 33. doi:10.1186/1476-072X-9-33
Jarvis A., H.I. Reuter, A. Nelson, E. Guevara, 2008, Hole-filled seamless SRTM data V4,
International Centre for Tropical Agriculture (CIAT). Retrieved Octuber 5, 2014, from
http://srtm.csi.cgiar.org.
Kongchouy, N., Choonpradub, C., & Kuning, M. (2010). Methods for Modeling Incidence
Rates with Application to Pneumonia Among Children in Surat Thani Province, Thailand, 1(37),
29–38.
Krivoruchko, K. (2011). Spatial statistical data analysis for GIS users. Redlands, Calif.: Esri
Press.
Krivoruchko, K., Gribov, A., & Krause, E. (2011). Multivariate Areal Interpolation for
Continuous and Count Data. Procedia Environmental Sciences, 3, 14–19.
doi:10.1016/j.proenv.2011.02.004
Lichtemberg, P. S. F., Pocasangre, L. E., Staver, C., Dold, C., & Sikora, R. A. (2010). Fusarium
Wilt (Fusarium oxysporum f. sp. cubense) in Gros Michel (AAA) bananas, the incidence at
smalholder level in Nicaragua. In Conference on International Research on Food Security, Natural
Resource and Rural Development. Zurich.
Madden, L. V., Hughes, G. and van den Bosch, F. 2007. The Study of Plant Disease Epidemics.
APS Press, St Paul
Madden, L. V., & Hughes, G. (1994). BBD-Computer Software for Fitting the Beta-Binomial
Distribution to Disease Incidence Data, Plant Disease, 78(5), 536-540.
Morton, A. (2014). UTM Grid Zones of the World. Retrieved August 9, 2014, from
http://www.dmap.co.uk/utmworld.htm
Musoli, C. P., Pinard, F., Charrier, A., Kangire, A., ten Hoopen, G. M., Kabole, C., Owang J.,
Bieysse D., Cilas, C. (2008). Spatial and temporal analysis of coffee wilt disease caused by
57
Fusarium xylarioides in Coffea canephora. European Journal of Plant Pathology, 122(4), 451–460.
doi:10.1007/s10658-008-9310-5
Nau, R. F. (2014). The logarithm transformation. Retrieved September 17, 2014, from
http://people.duke.edu/~rnau/411log.htm
Nelson, M. R., Felix-Gastelum, R., Orum, T. V., Stowell, L. J., & Myers, D. E. (1994).
Geographic Information Systems and Geostatistics in the Design and Validation of Regional Plant
Virus Management Programs, 84(9), 898–905.
Nelson, M. R., Orum, T. V., Jaime-Garcia, R., & Nadeem, A. (1999). Applications of
Geographic Information Systems and Geostatistics in Plant Disease Epidemiology and
Management. Plant Disease, 83(4), 308–319. doi:10.1094/PDIS.1999.83.4.308
NIST/SEMATECH. (2014). Measures of Skewness and Kurtosis. Retrieved August 18, 2014,
from http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
Oerke, E.-C., Meier, A., Dehne, H.-W., Sulyok, M., Krska, R., & Steiner, U. (2010). Spatial
variability of fusarium head blight pathogens and associated mycotoxins in wheat crops: Spatial
variability of Fusarium species and mycotoxins. Plant Pathology, 59(4), 671–682.
doi:10.1111/j.1365-3059.2010.02286.x
O’Sullivan, D., & Unwin, D. (2010). Geographic Information Analysis (2nd ed.). John Wiley &
Sons, Inc.
Pan, W. (2001). Akaike’s Information Criterion in Generalized Estimating Equations.

Biometrics, 57(1), 120–125. doi:10.1111/j.0006-341X.2001.00120.x
Paulitz, T. C., Zhang, H., & Cook, R. J. (2003). Spatial distribution of Rhizoctonia oryzae and
rhizoctonia root rot in direct-seeded cereals. Canadian Journal of Plant Pathology, 25(3), 295–303.
doi:10.1080/07060660309507082
Pérez-Vicente, L., Dita, M. A., & Parte, E. M. la. (2014). Technical Manual Prevention and
Diagnostic of Fusarium Wilt (Panama disease) of banana caused by Fusarium oxysporum f. sp.
cubense Tropical Race 4 (TR4). FAO.
Pfeiffer, D. U. (1996). Issues related to handling of spatial data. In Proceedings of the

epidemiology and state veterinary programmes (pp. 83–105). Christchurch.
Plant, R. E. (2012). Spatial data analysis in ecology and agriculture using R. Boca Raton: CRC
Press.
Ploetz, R. C. (2006). Fusarium Wilt of Banana Is Caused by Several Pathogens Referred to as

Fusarium oxysporum f. sp. cubense. Phytopathology, 96(6), 653–656. doi:10.1094/PHYTO-96-
0653
Ploetz, R. C., Jones, D. R., Sebaisgari, K., & Tushemereirwe, W. K. (1994). Panama disease on
East African highland bananas. Fruits (Paris), 49(4), 253–260.
58
R Core Team. (2014). R: A language and environment for statistical computing. Vienna,
Austria.: R Foundation for Statistical Computing. Retrieved June 5, 2014, from http://www.R-
project.org/
Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized additive models for location, scale
and shape,(with discussion). Applied Statistics, 54, 507–554.
Román Jerí, C. H. (2012). Consideraciones epidemiológicas para el manejo de la Marchitez por

Fusarium (Fusarium oxysporum f. sp. cubense) del banano en la región central del Perú. CATIE,
Turrialba, Costa Rica.
Selvaraja, S., Balasundra, S. K., Vadamalai, G., & Husni, M. H. A. (2012). Spatial Variability
of Orange Spotting Disease in Oil Palm. Journal of Biological Sciences, 12(4), 232–238.
doi:10.3923/jbs.2012.232.238
Souris, M., & Bichaud, L. (2011). Statistical methods for bivariate spatial analysis in marked
points. Examples in spatial epidemiology. Spatial and Spatio-Temporal Epidemiology, 2(4), 227–
234. doi:10.1016/j.sste.2011.06.001
Stasinopoulos, D. M., & Rigby, R. A. (2007). Generalized Additive Models for Location Scale
and Shape (GAMLSS) in R. Journal of Statistical Software, 23(7), 1–46.
Stasinopoulos, M., Rigby, B., & Akantziliotou, C. (2008). Instructions on how to use the gamlss
package in R (Second Edition).
Taliei, F., Safaie, N., & Aghajani, M. A. (2013). Spatial Distribution of Macrophomina
phaseolina and Soybean Charcoal Rot Incidence Using Geographic Information System (A Case
Study in Northern Iran), 15, 1523–1536.
Teetor, P., & Loukides, M. K. (2011). R cookbook. Sebastopol, CA; Beijing: O’Reilly.
Tsai, P.-J. (2012). Application of Moran’s Test with an Empirical Bayesian Rate to Leading
Health Care Problems in Taiwan in a 7-Year Period (2002–2008). Global Journal of Health Science,
4(5). doi:10.5539/gjhs.v4n5p63
Tobler, W. R. (1970). A Computer Movie Simulating Urban Growth in the Detroit Region.
Economic Geography, 46(2): 234–240.
Urdan, T. C. (2010). Statistics in plain English (Third Edition). New York: Routledge.
Zuur, A. F., Ieno, E. N., Walker, N., Saveliev, A. A., & Smith, G. M. (2009). Mixed effects
models and extensions in ecology with R. New York, NY: Springer-Verlag New York.
59
6 Annexes
Annex 1 – R Code utilised for test spatial autocorrelation
#Loading dataset
farms.p <- readShapePoints("SHP/farms.shp")
##########----Neighbours list using different methods
#-----------------Graph based Neighbours
#Delaunay Triagulation
tri.list <- tri2nb(coords)
plot.nb(tri.list, coords)
nb_list.t <- nb2listw(tri.list, style = "W")
#Sphere of Indiference
soi.1 <- soi.graph(tri.list, coords)
soi_nb <- graph2nb(soi.1)
plot.nb(soi_nb, coords)
nb_list.s <-nb2listw(soi_nb, style = "W")
#----------------Distance based Neighbours
#Nearest Neighbours
#Default k value - k = 1
k_near.1 <- knn2nb(knearneigh(farms.p))
plot.nb(k_near.1, coords)
#With k = 5
k_near.5 <- knn2nb(knearneigh(farms.p, k = 5))
#With k = 10
k_near.10 <- knn2nb(knearneigh(farms.p, k = 10))
#Neighbours Spatial Weights Lists
nb_list.k1 = nb2listw(k_near.1, style = "W")

nb_list.k5 = nb2listw(k_near.5,style = "W")
nb_list.k10 = nb2listw(k_near.10,style = "W")
#Distance Bands
#Neighbours list based on distance

k_dist <- nbdists(k_near.1, farms.coords)
k_dist_vec <- unlist(k_dist)
max_dist <- max(k_dist_vec)
dist_nei <- dnearneigh(farms.p, d1 = 0, d2 = max_dist)

plot(dist_nei, farms.coords)
60
nb_list.d <- nb2listw(dist_nei, style = "W")
#Moran's I Test with different neighbour list
moran.test(farms.p$inc_raw, listw = nb_list.t, alternative = "two.sided")

moran.test(farms.p$inc_raw, listw = nb_list.s, alternative = "two.sided")
moran.test(farms.p$inc_raw, listw = nb_list.k1, alternative = "two.sided")

moran.test(farms.p$inc_raw, listw = nb_list.d, alternative = "two.sided")
#Geary's C Test with k-nearest

geary.test(farms.p$inc_raw, listw = nb_list.t, alternative = "two.sided")
geary.test(farms.p$inc_raw, listw = nb_list.s, alternative = "two.sided")
geary.test(farms.p$inc_raw, listw = nb_list.k1, alternative = "two.sided")

geary.test(farms.p$inc_raw, listw = nb_list.d, alternative = "two.sided")
#Moran's I using Empirical Bayes Estimates
ebi.1 <- EBest(farms.p$diseased_p, farms.p$evaluated_, family = "binomial")
moran.test(ebi.1$estmm, listw = nb_list.t, alternative = "two.sided")

moran.test(ebi.1$estmm, listw = nb_list.s, alternative = "two.sided")
moran.test(ebi.1$estmm, listw = nb_list.k1, alternative = "two.sided")

moran.test(ebi.1$estmm, listw = nb_list.d, alternative = "two.sided")
Annex 2 – R Code utilised for the GAMLSS

#Miscellaneous code for variable construction
dis_inc <- cbind(farms.p$diseased_p,farms.p$evaluated_)
variety.f <- factor(farms.p$Variety, levels = cbind(1,2,3), labels =

cbind("Isla", "Seda", "Mix"))
#Base model GLM with all the explanatory variables
glm.0 <- glm(formula = dis_inc ~ pH + altitude + slope + area + planting_d +

Average_So + factor(farm_type) * variety.f + shade_perc, data =
farms.p, family = binomial)
61
#Base model with all the explanatory variables
library(gamlss) #Load the gamlss package
gamlss.1 <- gamlss(formula = dis_inc ~ pH + altitude + variety.f + planting_d +

slope + area + Average_So + variety.f * factor(farm_type) + shade_perc, family =
BB, data = farms.p)
summary(gamlss.1)
plot(gamlss.1)
stepGAIC(gamlss.1)
#Fitted model after selection using stepGAIC
gamlss.2 <- gamlss( formula = dis_inc ~ pH + altitude + planting_d + Average_So,

family = BB, data = farms.p)
summary(gamlss.2)
GAIC(gamlss.1, gamlss.2)
plot(gamlss.2)
62

Spatial Data Analysis of Fusarium. David Brown

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spatial Data Analysis of Fusarium. David Brown

Uploaded by

Copyright:

Available Formats

Master Thesis

submitted within the UNIGIS MSc programme

Spatial Data Analysis of Fusarium Wilt

David Brown Fuentes

A thesis submitted in partial fulfilment of the requirements of

Turrialba, August 2014

plant density and Soil Quality Index.

Stoian for support the development of my career in GIS at Bioversity International.

all his comments and suggestions.

Karol Paola during the development of my thesis.

support the design of disease assessment and management methodologies.

San Luis de Shuaro, Peru.

the case study area

Autocorrelation: Moran’s I, Geary’s c and Moran’s I using Empirical Bayes Estimates

the area of San Luis de Shuaro District.

Data Analysis applied to Plant Disease incidence.

2.1 Fusarium Wilt

material, through the water and soil (Pérez-Vicente et al., 2014).

As indicated by Pérez-Vicente et al. (2014), the possibility of recovering of a susceptible banana

the genetic resistance the most effective.

2.2 Disease Incidence

encountered in phytopathological literature” (Madden et al., 2007).

2.3 Spatial Data Analysis

a) Methods for visualizing data

b) Methods for exploratory data analysis

c) Methods for development of statistical models

Type of data Spatial Data Analysis

2.3.1 Areal Analysis

2.3.2 Geostatistical Analysis

2.3.3 Point Pattern Analysis

clustering (Pfeiffer, 1996).

2.3.4 Spatial Autocorrelation

Tobler’s First Law of Geography. (Tobler, 1970)

continues being the best way to define spatial autocorrelation.

presence in regression models:

2.4 Exploratory Spatial Data Analysis

g) Explore spatial data

dispersed (Haining, 2003, p.226).

2.5 Measures of Spatial Autocorrelation

Characteristic Moran’s I Geary’s c

2.6 Moran’s I using Empirical Bayes Estimates

the case of the data analysed in the present work.

function (Plant, 2012, p. 117):

 (h)  var ( x  y)   ( x)

The semivariogram is commonly represented in a plot, representing the semivariance as a function

is represented by the following function (Plant, 2012, p. 118):

Bachmaier and Backes (2008) provides a clarification.

2.8 Thiessen Polygons

2.9 Spatial Data Analysis Applied to Plant Disease Incidence

accordingly to this literature review it is common methodology utilised in plant pathology to

Guzmán-Plazola, Gómez-Pauza, García-Espinosa and Gavi-Reyes (2004) applied geostatistics to

data is the base of the present work.

being a challenge even for statisticians (Bolker et al., 2009).

2.10 Linear Regression Model and some considerations about it

apply the linear regression correctly:

curve (Urdan, 2010, p. 10).

same at each X value. When this condition is not accomplished it is called

random variables (Zuur et al., 2009, p. 21).

assumption, by applying an improper model or a dependence structure due to the nature of

dependence, being the latter of special interest in the present work.

assumptions are violated.

2.11 Generalised Linear Models