Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

Master Thesis

submitted within the UNIGIS MSc programme


at Z_GIS
University of Salzburg

Spatial Data Analysis of Fusarium Wilt


Incidence in the District of San Luis de
Shuaro, Peru
by

David Brown Fuentes


1123727

A thesis submitted in partial fulfilment of the requirements of


the degree of
Master of Science (Geographical Information Science & Systems) – MSc (GISc)

Advisor:
Carlos Mena, PhD

Turrialba, August 2014


II
Dedicado a mi esposa y mis padres.

III
Abstract
Spatial Data Analysis was conducted over the disease incidence of Fusarium Wilt of banana in the

district of San Luis de Shuaro, Peru. Data was obtained after consolidating it from raw data,

resulting in a total of 76 records dataset, each one representing a sampled farm for Fusarium Wilt of

banana. The spatial distribution was represented through a map of points each one representing a

sampled farm. Thiessen Polygons was utilised as basic interpolation method to obtain a general

representation of the disease distribution in the study area. The Spatial Data Analysis includes

spatial autocorrelation tests, which were applied through Moran’s I, Geary´s c and a modified

version of Moran’s I using an Empirical Bayes Estimate. The relationship between the Disease

Incidence of Fusarium Wilt of Banana and possible explanatory variables like altitude, soil pH,

plant density, farm type, farm size, banana variety, shade percentage, slope and the Soil Quality

Index was also analysed through a GAMLSS model with a beta-binomial distribution. Results show

that there is no evidence of spatial autocorrelation for the disease incidence. The regression model

shows that the most significant explanatory variables of the disease incidence are soil pH, altitude,

plant density and Soil Quality Index.

IV
Acknowledgements
I want to recognize the support of Bioversity International for allowing me to join the Master

Degree Program at UNIGIS, especially to Miguel Dita, Charles Staver, Stephan Weise and Dietmar

Stoian for support the development of my career in GIS at Bioversity International.

I want to recognize the valuable help of Karl Atzmanstorfer as my thesis advisor and Gunda

Cespedes for her help during the revision process of my thesis, both at the University of Salzburg.

To Jacob van Etten for his comments and for lend me a very useful book and to Philippe Tixier for

all his comments and suggestions.

A special acknowledgement to Carlos Román Jerí for the hard work he did in San Luis de Shuaro,

Peru collecting data for his research and which also allows me to conduct this work.

Finally but most important, I want to acknowledge the support and valuable suggestions of my wife

Karol Paola during the development of my thesis.

V
Table of contents
Abstract .............................................................................................................................................. IV
Acknowledgements ............................................................................................................................. V
Table of contents ................................................................................................................................ VI
List of Figures .................................................................................................................................... IX
List of Tables....................................................................................................................................... X
1. Introduction ................................................................................................................................. 1
1.1 Motivation ........................................................................................................................... 1
1.2 Problem Description............................................................................................................ 2
1.3 Objectives ............................................................................................................................ 2
1.4 Hypothesis ........................................................................................................................... 3
1.5 Scope ................................................................................................................................... 3
2. Literature review ......................................................................................................................... 5
2.1 Fusarium Wilt...................................................................................................................... 5
2.2 Disease Incidence ................................................................................................................ 6
2.3 Spatial Data Analysis .......................................................................................................... 6
2.3.1 Areal Analysis ............................................................................................................. 7
2.3.2 Geostatistical Analysis ................................................................................................ 7
2.3.3 Point Pattern Analysis ................................................................................................. 7
2.3.4 Spatial Autocorrelation ............................................................................................... 7
2.4 Exploratory Spatial Data Analysis ...................................................................................... 8
2.5 Measures of Spatial Autocorrelation ................................................................................... 8
2.6 Moran’s I using Empirical Bayes Estimates ....................................................................... 9
2.7 Semivariogram .................................................................................................................... 9
2.8 Thiessen Polygons ............................................................................................................. 10
2.9 Spatial Data Analysis Applied to Plant Disease Incidence ............................................... 10
2.10 Linear Regression Model and some considerations about it ............................................. 12
2.11 Generalised Linear Models ............................................................................................... 14
3 Methodology ............................................................................................................................. 17
3.1 Case study area: San Luis de Shuaro................................................................................. 18
3.2 Workflow diagram ............................................................................................................ 21

VI
3.3 Data preparation ................................................................................................................ 22
3.5 Mapping the Spatial Distribution of Disease Incidence .................................................... 25
3.5.1 Points Map................................................................................................................. 25
3.5.2 Thiessen Polygons ..................................................................................................... 26
3.6 Data Exploration ............................................................................................................... 27
3.6.1 Histogram .................................................................................................................. 28
3.6.2 Normal Q-Q Plot ....................................................................................................... 28
3.6.3 Spatial autocorrelation tests....................................................................................... 29
3.6.3.1 Neighbour List and Spatial Weights ............................................................................. 30
3.6.3.2 Moran’s I ....................................................................................................................... 32
3.6.3.3 Geary’s C....................................................................................................................... 32
3.6.3.4 Modified Moran’s I – Empirical Bayes Index............................................................... 32
3.7 GAMLSS applied to Fusarium Wilt Disease Incidence .................................................... 32
4 Results and Analysis ................................................................................................................. 36
4.1 Map of the Spatial Distribution of Disease Incidence in the Study Area .......................... 36
4.2 Results of Exploratory Data Analysis ............................................................................... 38
4.2.1 Histogram .................................................................................................................. 38
4.2.2 Normal Q-Q Plot ....................................................................................................... 39
4.2.3 Skewness and Kurtosis .............................................................................................. 40
4.2.4 Neighbours list and Spatial Weights ......................................................................... 41
4.2.5 Results of Moran’s I Test .......................................................................................... 42
4.2.6 Results of Geary’c Test ............................................................................................. 43
4.2.7 Results of Global Moran’s I using Empirical Bayes Estimates ................................ 44
4.3 Results of the GAMLSS.................................................................................................... 45
4.4 Analysis of the Results ...................................................................................................... 50
4.4.1 Spatial Distribution of Fusarium Wilt in the study area ............................................ 50
4.4.2 Spatial Autocorrelation Tests .................................................................................... 50
4.4.3 GAMLSS................................................................................................................... 51
4 Conclusions ............................................................................................................................... 53
5 References ................................................................................................................................. 55
6 Annexes ..................................................................................................................................... 60
Annex 1 – R Code utilised for test spatial autocorrelation............................................................ 60

VII
Annex 2 – R Code utilised for the GAMLSS................................................................................ 61

VIII
List of Figures
Figure 1 – Location of San Luis de Shuaro District (David Brown, ArcGIS for Desktop 10.2) ...... 18
Figure 2 – Elevation of the study area (David Brown, ArcGIS for Desktop 10.2) ........................... 19
Figure 3 –Location of banana farms in the study area (David Brown, ArcGIS for Desktop 10.2) ... 20
Figure 4 – Workflow diagram (David Brown, Microsoft Word 2010) ............................................. 21
Figure 5 – Symbols and classification utilised to map the distribution of the measured disease
incidence (David Brown, ArcGIS for Desktop 10.2) ........................................................................ 25
Figure 6 - Map of the Spatial Distribution of Fusarium wilt incidence in the study area (David
Brown, ArcGIS for Desktop 10.2) .................................................................................................... 36
Figure 7 – Thiessen Polygons constructed as from the points of each sample farm (David Brown,
ArcGIS for Desktop 10.2) ................................................................................................................. 37
Figure 8 – Histogram of disease incidence rate ................................................................................ 38
Figure 9 – Histogram of a normal distribution from simulated data ................................................. 38
Figure 10 – Normal Q-Q Plot of the disease incidence ..................................................................... 39
Figure 11 – Normal Q-Q Plot for a normal distribution for simulated data ...................................... 39
Figure 12 – a) Delaunay triangulation, b) Sphere of Indifference, c) Distance based ...................... 41
Figure 13 – a) k nearest neighbour with k = 1, b) k nearest neighbour with k= 5, c) k nearest
neighbour with k=10 ......................................................................................................................... 41
Figure 14 – Plots of the residuals for model validation .................................................................... 49

IX
List of Tables
Table 1 – Categories of Spatial Data Analysis accordingly to the spatial data type ........................... 7
Table 2 – Main Characteristics and Differences between Moran’s I and Geary’s c, based on Fortin
et al. (2002) ......................................................................................................................................... 9
Table 3 – Variables included in the dataset consolidated from the raw data provided by of Román
Jerí (2012) ......................................................................................................................................... 23
Table 4 – Results of Moran’s I test computations using different neighbours list methods ............ 42
Table 5 – Results of Geary’s c test computations using different neighbours list methods .............. 44
Table 6 – Results of Moran’s I with Empirical Bayes Estimates computations using different
neighbours list methods..................................................................................................................... 44
Table 7 – Results of the GLM model with a Binomial distribution and all the proposed explanatory
variables ............................................................................................................................................ 46
Table 8 – Results for the base model with all the proposed variables .............................................. 47
Table 9 – Results of the adjusted model after applying the variable selection using stepGAIC ....... 48

X
1. Introduction

1.1 Motivation

Bananas and plantains, are very important crops in developing countries, both as a staple food as a

commodity (Arias, Dankers, Liu & Pilkauskas, 2003). They are cultivated in more than 100

countries around world, both in tropical and subtropical regions (Frison and Sharrock, 1998). Often

referred just as bananas in general, they are grown in two main typical scenarios; the production for

export market and small scale for local market and as a staple, being the earlier characterized by

high input applications and just a few banana varieties (Arias, Dankers, Liu & Pilkauskas, 2003).

As any other crop, bananas and plantains are affected by diseases which could reduce or totally

inhibit production. One of the more important diseases is the Fusarium wilt of banana, a soilborne

fungal disease caused by the fungus Fusarium oxysporum f. sp. cubense, which affects different

banana cultivars (Brayford, 1992). In the case of the production for export market, Fusarium Wilt

ceased to be a problem with change from the susceptible variety Gros Michel to the resistant

Cavendish. However, many small farmers still growth susceptible varieties even under the risk that

Fusarium Wilt represents to their production, mainly because most of the susceptible varieties are

highly appreciated by consumers in local markets. With the appearance and dispersion of a new

race of Fusarium Wilt, known as Tropical Race 4 which affects the Cavendish variety, the disease is

recovering importance also in the production for export market. Therefore, nowadays the study of

Fusarium Wilt is a very important area to both the banana export industry and for the small farmers

in the developing countries. The analysis of the spatial data of Fusarium wilt of banana is relevant

to have a better understanding of the factors that affects the incidence of the disease, which could

support the design of disease assessment and management methodologies.

1
1.2 Problem Description

The district of San Luis de Shuaro in Peru is a zone where small farmers growth different banana

varieties as their main economic activity (Román Jerí, 2012), and many of these cultivated varieties

are susceptible to Fusarium wilt. Since most of the infected plants are inhibited to produce a banana

bunch, each diseased plant represent an economic lost to the farmer. Although with some distinctive

characteristics, the region of San Luis de Shuaro presents similar conditions to other regions in

Latin America and the Caribbean, which in general are characterized by the diversity of their

production systems, ranging from monocrop to agroforestry systems with mixed crops. The findings

of the present work could provide valuable insights to understand this complex disease not only in

the region of the study case but in others with similar conditions and characteristics.

1.3 Objectives

General Objective: To conduct spatial data analysis of Fusarium wilt incidence in the district of

San Luis de Shuaro, Peru.

Specific Objectives

 Mapping the spatial distribution of the disease incidence rate of Fusarium Wilt of banana in

the case study area

 Conduct Exploratory Data Analysis including three different test for Spatial

Autocorrelation: Moran’s I, Geary’s c and Moran’s I using Empirical Bayes Estimates

 Modelling the relationship between Fusarium Wilt incidence and a set of explanatory

variables using a GAMLSS (Generalised Additive Models for Location Scale and Shape)

 To analyse the relationship of Fusarium Wilt incidence with the selected variables by the

implemented GAMLSS

2
Research Questions

1) How could the spatial distribution of the disease incidence be represented using GIS

software and cartographic techniques to provide a general description of the region in terms

of incidence levels?

2) Are the sampled farms in the study area spatially autocorrelated with respect to disease

incidence?

3) Which are the most reliable combination of model and probability distribution to analyse

the relationship of the Fusarium Wilt incidence and a set of explanatory variables?

4) Which are the factors that influence the Fusarium Wilt incidence in the study area?

1.4 Hypothesis

H0: There is not spatial autocorrelation of the Fusarium Wilt incidence between the sampled farms
in the area of San Luis de Shuaro District.

H1: There is a spatial autocorrelation of the Fusarium Wilt incidence between the sampled farms in

the area of San Luis de Shuaro District.

1.5 Scope

The present work analyse the spatial distribution of the Fusarium wilt incidence using spatial

statistics, to determine if nearest farms are more likely to exhibit similar levels of incidence, which

could lead to a better understanding of the disease dynamics. It also analyse the relationship

between a set of proposed variables and the disease incidence. The study comprises information

from 76 farms in the region of San Luis de Shuaro, Peru. More specific and technical details could

be found the Chapter 3. The mapping of the spatial distribution is a graphical representation of the

disease incidence using cartographic techniques, with the aim to provide a general description on

how the different levels of disease incidence are distributed along the study region. Geographical

localisation data were available just for each farm, thus the spatial analysis is limited to the spatial

relationship between each sampled farm, with the relationship between plants within each sampled

3
farms out of the scope of this work. Additionally to the spatial analysis per se, the results of spatial

autocorrelation between farms with respect to the disease incidence are a key inputs to analyse the

relationship between the set of proposed variables and the disease incidence through a Generalised

Linear Model. In this aspect, the present work traces an outline on how further analysis with similar

inputs and goals could be conducted, especially in terms of suggest a reliable combination of

regression model and probability distribution. Finally, the study determines a set of explanatory

variables which has influence on the disease incidence, based on the statistical results after

implement the Generalised Linear Model, which provides valuable inputs to future studies focused

on the more influencing factors, contributing to the development of better methodologies and

strategies both for assessment of a suspicious area or for management of a confirmed infected area.

4
2. Literature review

This section condensate the concepts behind the development of this work, from the description of

the analysed disease to concepts of Spatial Data Analysis applied to analyse it, containing both

scientific papers and theoretical books. A list of suggested literature is also provided in order to

facilitate the resource browsing for readers interested in other works which also treats the Spatial

Data Analysis applied to Plant Disease incidence.

2.1 Fusarium Wilt


“Panama disease of bananas is historically one of the most infamous plant diseases, destroying the

banana production industries in areas of Central America where the highly susceptible banana

cultivar Gros Michel predominated from ca. 1900–1955 ” (Brayford, 1992, p. 1).

Also known as Panama disease, Fusarium Wilt is a soilborne fungal disease which infects the plant

through the roots, spreading to the xylem and causing vascular browning in the pseudostem

(Brayford, 1992). Typical symptoms include vascular discolouration and yellowing of the leaves

(Pérez-Vicente, Dita & Martinez de la Parte, 2014). It could survive up to 30 years in the soil in the

absence of banana (Ploetz, 2006). Dispersal of the disease could be caused by infected plant

material, through the water and soil (Pérez-Vicente et al., 2014).

As indicated by Pérez-Vicente et al. (2014), the possibility of recovering of a susceptible banana

plant infected with Fusarium wilt is very low, and if it occurs the growth will be deficient.

Accordingly to Ploetz (2006), the options available for management of this disease are scarce, being

the genetic resistance the most effective.

Its importance in the banana global market was reduced in 1950, when shifting from susceptible

Gros Michel variety to the resistant Cavendish was done in Latin America (Perez-Vicente et al.,

2014). However, susceptible varieties are still cultivated in small scale, especially by smallholders

5
in mixed with other crops like coffee, cacao and threes in agroforestry systems (Perez-Vicente et al.,

2014). This situation has also an effect on Fusarium Wilt research, which accordingly to

Lichtemberg, Pocasangre, Staver, Dold and Sikora (2010) comprise two eras, the Gros Michel era

where efforts were focused on possible origins and the epidemics in the American tropics, and the

Cavendish era, where efforts were focused on the pathogen diversity and not into the disease.

2.2 Disease Incidence


Madden, Hughes and van den Bosch (2007) defines disease incidence as the proportion of plants (or

plan units) diseased or the number of diseased plants (or plants units) out of the total assessed. The

same authors indicate that disease incidence could be measured at different scales depending on the

plant units utilised. For the case of this work, the plant unit refers to an individual plant. Within this

context, “disease incidence provides an estimate of the probability of infection” (Hughes, Munkvold

& Samita, 1998), and “it is the most common records contained spatial plant disease data

encountered in phytopathological literature” (Madden et al., 2007).

2.3 Spatial Data Analysis


Spatial Data Analysis is the area of Spatial Analysis where statistical techniques are developed and

applied to analyse spatial data (Haining, 2003). Accordingly to Bailey and Gatrell (as cited in

Pfeiffer, 1996, p.83), methods used in spatial data analysis can be categorized as:

a) Methods for visualizing data

b) Methods for exploratory data analysis

c) Methods for development of statistical models

Following the classification of Cressie (as cited in Plant, 2012, p. 5) there is three categories of

spatial data: geostatistical, areal and point pattern. As presented by Krivoruchko (2011, p. 22), each

of these categories corresponds to continuous, aggregated and discrete data respectively. Table 1

summarizes the type of data and the type spatial data analysis which correspond it.

6
Table 1 – Categories of Spatial Data Analysis accordingly to the spatial data type

Type of data Spatial Data Analysis


Discrete Point Pattern Analysis
Aggregated or Areal Lattice or Areal
Continuous Geostatistical

2.3.1 Areal Analysis


Accordingly to Plant (2012, p. 5), “… areal data consist of data that are defined only at a set of

locations, which may be points or polygons”. The main objective to analyse areal data is to detect

and explain spatial patterns, sometimes including its relationship with covariates (Pfeiffer, 1996).

2.3.2 Geostatistical Analysis


Geostatistical data consist of data that is spatially continuous and the main objective to be analysed

is to describe the spatial variation of an attribute variable Pfeiffer (1996), and to interpolate the

value of the measured attribute at points where it wasn’t measured Plant (2012, p.5).

2.3.3 Point Pattern Analysis


As is shown in Table 1, point pattern analysis deals with discrete data and it analyses the pattern of

the registered locations. Typically this pattern is analysed based on what is called analysis of

clustering (Pfeiffer, 1996).

2.3.4 Spatial Autocorrelation


“Everything is related to everything else, but near things are more related than distant things” –

Tobler’s First Law of Geography. (Tobler, 1970)

The Tobler’s first law of geography is a recurrent citation in the spatial analysis literature and so far

continues being the best way to define spatial autocorrelation.

Griffith (2009) defines spatial autocorrelation as “…the correlation among values of a single

variable strictly attributable to their relatively close locational positions on a two-dimensional (2-D)

7
surface, introducing a deviation from the independent observations assumption of classical

statistics”.

Miron (as cited in Plant, 2012, p.423) presents three different sources of spatial autocorrelation

presence in regression models:

d) Interaction

e) Reaction

f) Model misspecification

2.4 Exploratory Spatial Data Analysis


Haining and Cressie (as cited in Haining, 2003, p.182) defines ESDA as set of techniques for:

g) Explore spatial data


h) Summarize spatial properties of data
i) Detect spatial patterns in data
j) Formulate hypothesis related to the geography of the data

Exploratory spatial data analysis includes both visual and numeric methods. Visual methods could

include: Histogram, Q-Q plots, Boxplots and Scatterplot (Haining, 2003, p.189). Numerical

methods include spatial autocorrelation tests like Moran’s I to explore if the data is clustered or

dispersed (Haining, 2003, p.226).

2.5 Measures of Spatial Autocorrelation


Moran’s I and Geary’s c are both indexes that are applied to test the null hypothesis of zero spatial

autocorrelation (Plant, 2012, p. 104). A description of the main features and differences are

presented at following in Table 2, based on Fortin, Dale and ver Hoef (2002).

8
Table 2 – Main Characteristics and Differences between Moran’s I and Geary’s c, based on Fortin et al. (2002)

Characteristic Moran’s I Geary’s c


How is computed Degree of correlation between Difference among
values of a variable as a function
of spatial lags.
Zero Spatial Autocorrelation The expected value for zero The expected value for zero
autocorrelation is nearly zero, autocorrelation is 1.
although more formally:
1
With n as the number of
(n  1)
areal units
Positive Spatial Values nearly 1 Values nearly 0
Autocorrelation
Negative Spatial Values nearly -1 Values nearly 2
Autocorrelation

2.6 Moran’s I using Empirical Bayes Estimates


The traditional calculation of Moran's I for disease cases does not account for population
heterogeneity, so that, its application to disease rates or proportions may result in indication of
spatial correlation that is completely due to the spatial proximity of population sizes, but not
due to the similarity among the disease rates. (Jackson, Huang, Xie & Tiwari, 2010)

Assunção and Reis (1999) propose an Empirical Bayesian Estimate modification for the calculation

of Moran’s I when it is applied to rates calculated as from populations with different sizes, which is

the case of the data analysed in the present work.

2.7 Semivariogram
From the geostatistics point of view, spatial autocorrelation is tested through the Semivariogram

Nelson, Orum, Jaime-Garcia and Nadem (1999). The semivariogram calculates the difference

between locations of two measurements, which is called the spatial lag, using the following

function (Plant, 2012, p. 117):

 (h)  var ( x  y)   ( x)


1
2

The semivariogram is commonly represented in a plot, representing the semivariance as a function

of distance (Bivand, Pebesma & Gomez-Rubio, 2008). Rather than use the function presented

9
before, the semivariogram is commonly estimated through the experimental semivariogram which

is represented by the following function (Plant, 2012, p. 118):

1 m( h)
ˆ (h)   Y ( xi  h)  Y ( xi )2
2m(h) i1

For a more clear understanding of the terms Variogram and Semivariogram and their differences

Bachmaier and Backes (2008) provides a clarification.

2.8 Thiessen Polygons


Also known as Proximity Polygons or Voronoi Maps (O’Sullivan & Unwind, 2010, p. 50), they are

considered by Plant (2012, p.163) as a useful interpolation method when locations are too sparse

and irregular. Thiessen Polygons are geometrically calculated areas as from points and just take into

account distance from each point to calculate the polygon. More formally, and accordingly to

O’Sullivan and Unwind (2010, p. 50), they are “a polygon of any entity is that region of the space

which is closer to the entity than it is to any other”. Although useful as pointed out by Plant, their

use should take into account its limitations and not state exaggerated conclusions just from the

resulting polygons.

2.9 Spatial Data Analysis Applied to Plant Disease Incidence


Bivand et al. (2008, p. 311) indicates that “displaying the spatial variation of the incidence of a

disease can help us to detect areas where the disease is particularly prevalent, which may lead to the

detection of previously unknown risk factors”. Although this statement was done from the point of

view human health, in some way it is also valid for plant disease epidemiology. From the point of

view of the study of a plant disease, to conduct a spatial autocorrelation analysis both the spatial

location and the disease status of the sampling units must be known, (Madden et al., 2007).

A considerably amount of studies were found in the literature about spatial data analysis applied to

plant disease including the Nelson et al. (1999) which presents some applications of GIS and

10
Geostatistics to plant disease epidemiology. The work of Selvaraja, Balassundram, Vadamalai and

Husni (2012) applies geostatistics to analyse the spatial variability of the Orange Spotting Disease

in oil palm. Talei, Safaie and Aghajani (2013) studied the spatial distribution of Soybean Charcoal

Rot incidence using geostatistics, more specifically an interpolation using ordinary kriging. Alves

and Pozza (2010) on the other hand propose the use of indicator kriging for study the spatial

variability of common bean anthracnose. Del Ponte, Shah and Bergstrom (2003) analysed the

spatial patterns Fusarium head blight using the index of dispersion through a beta-binomial

distribution. Although that index doesn’t involve the spatial location of the sampled data; but

accordingly to this literature review it is common methodology utilised in plant pathology to

estimate aggregation or dispersion patterns, being relevant the works of Hughes and Madden

(1993), Madden and Hughes (1994) and Hughes, Madden and Munkvold (1996). Nelson, Felix-

Gastelum, Orum, Stowell and Myers (1994) applies geostatistics to analyse to design and validate

the regional plant virus management programs if the Del Fuerte Valley, located in Sinaloa, Mexico.

Guzmán-Plazola, Gómez-Pauza, García-Espinosa and Gavi-Reyes (2004) applied geostatistics to

interpolate the spatial distribution of Fusarium solani f. sp. phaseoli; which is the cause of the root

rot on common bean. Musoli, et al. (2008) studied the spatial and temporal analysis of coffee wilt

disease which is caused by Fusarum xylarioides, using also a geostatistical approach. Oerke, Meier,

Dehne, Sulyok, Krska and Steiner (2010) analysed the spatial variability of Fusarium head blight

pathogens in wheat crops using the Spatial Analysis by Distance IndicES (SADIE) and the Lloyd’s

index of patchiness.

Studies about Spatial Data Analysis applied to Banana’s Fusarium Wilt incidence were not found in

search for a previous work with similar objectives. Although this is not an undebatable fact, it could

lead to infers that these kinds of studies are too scarce. Lichtemberg et al. (2010) analysed the

Fusarium Wilt incidence at smallholder level in Nicaragua, using the classical statistics methods

like Pearson Correlation and ANOVA. Plotting of the farms and comparison of two different zones

11
were also conducted in that work. Román Jerí (2012) conducted a study in which geographical

characteristics were considered for targeting the farms to be analysed and thereafter spatial

distribution of Fusarium Wilt incidence was plotted for graphical representation. Although the study

of Román Jerí (2012) does not include a strong component of spatial data analysis, the collected

data is the base of the present work.

As Schabenger and Pierce (as cited in Madden et al., 2007, p. 15) indicates, “…disease incidence is

a count with a natural denominator, which could be converted into proportions”. This is important

to take into account selecting the appropriate statistical analysis type to apply (Madden et al., 2007).

For example, Madden and Hughes (1994) indicates that distributions like Poisson and negative

binomial are generally inappropriate for analysing disease incidence (rate) data, proposing the use

of beta-binomial distribution instead. Krivoruchko (2011) points out the fact that typical index used

to measure spatial autocorrelation like Moran’s I and Geary’s c are commonly applied to rates, even

when these indices assumes that data mean and variance are constants, which are difficult

conditions to find in rates data like disease incidence. Paulitz, Zhang and Cook (2003) applies what

they call a Spatial Generalised Mixed Model to account for spatial autocorrelation in a spatial point

pattern analysis and also to interpolate disease incidence rates. However the use of this

methodology should be approached with caution due to the high level of complexity of a GLMM,

being a challenge even for statisticians (Bolker et al., 2009).

2.10 Linear Regression Model and some considerations about it


Linear Regression Models are frequently used in statistical analysis (Kongchouy, Choonpradub &

Kuning, 2010). It supports researchers to explore the relationship between variables and to explain

the strength of a set of independent variables to predict a dependent variable (Urdan, 2010, p. 145).

As Zuur, Ieno, Walker, Saveliev and Smith (2009, p. 17) calls it, it is “the mother of all models”.

However, as any other models it has limitations that should be taken in to account to apply it

12
correctly. The following equation is reproduce from Zuur et al. (2009, p.17) and shows the linear

regression model.

Yi       i   i
where
 i ~  (0,  2 )

Following Zuur et al. (2009, p.17) explanation Yi is the response or dependent variable and  i is the

explanatory or independent variable. The information that is not explained by the model is captured

by the residuals, represented in the equation by  i while  and  represents the population

intercept and the slope respectively and both are unknown parameters.

There are five assumptions that should be considered accordingly to Zuur et al. (2009, p.19) to

apply the linear regression correctly:

1. Normality: Linear Regression assumes that the data has a normal distribution. In this

sense, normality means that when a plot of frequency of the cases (in the y axis) vs the

score of the variable of interest (in the x axis) is constructed, it will exhibit a bell shaped

curve (Urdan, 2010, p. 10).

2. Homogeneity: The homogeneity assumption means that the spread of data should be the

same at each X value. When this condition is not accomplished it is called

heteroskedasticity (Bivand et al., 2008, p. 274) or heterogeneity (Zuur et al., 2009, p. 20).

3. Fixed X: This assumptions means that the explanatory variables are deterministic and not

random variables (Zuur et al., 2009, p. 21).

4. Independence: Accordingly to (Zuur et al., 2009, p. 21), independence is when the Y value

at  i is not influenced by other  i , and it came into the most serious problem when it is no

satisfied. The same Zuur et al. points out that there is two ways to violate the independence

assumption, by applying an improper model or a dependence structure due to the nature of

13
the data. In the case of a dependence structure, it could be due to temporal or spatial

dependence, being the latter of special interest in the present work.

5. Correct Model Specification: This means that there is assumed a correct selection of

explanatory variables.

There are different points of view from different authors on how to deal when one of these

assumptions are violated. For example, when normality assumption is violated some take the

approach to apply a transformation to the data, like the logarithmic transformation, trying to get the

desired normal distribution. On the other extreme are those who prefer switch to other model

without applying any transformation to the data. Other models could include: Generalised Linear

Models, Generalised Additive Models, Generalised Least Square, etc. Each of these has different

approaches to tackles the violations of the linear regression assumptions, depending on which of the

assumptions are violated.

The approach selected to the present work follows the suggestion stated by Zuur et al. (2009, p. 19):

“Always apply the simplest statistical technique on your data, but ensure it is applied correctly”.

2.11 Generalised Linear Models


When the analysed data doesn’t fulfil the requirements to use the Linear Regression Model, the

GLM (Generalised Linear Models) come to the scene as the most convenient solution (Plant, 2012,

p. 301). A GLM basically consists of three distinctive parts (Crawley, 2007, p. 512):

a) The error structure

b) The linear predictor

c) The link function

The error structure refers to the type of distribution of the error in the analysed data, which could

also be seen as the distribution of the response variable. Instead of apply a transformation when the

analysed data has a non-normal distribution, a GLM allows to specify different types of

distributions like binomial, Poisson, etc. (Crawley, 2007, p. 512). The linear predictor is also called

14
the systematic part (Zuur et al., 2009, p. 210) and in general terms is the set of explanatory variables

expressed as a function. Finally the link function is the part which relates the systematic part with

the mean of the response variable. The implementation of a GLM consists of three steps (Zuur et

al., 2009, p. 210) which coincides with the three parts presented by Crawley (2007, p. 512):

a) An assumption of the distribution of the response variable

b) Specify the systematic part (The explanatory variables)

c) Specify the link function

However, undesirable but typical characteristic that the disease incidence data often also presents is

overdispersion, which is when the observed variability is greater than the predicted (Garret,

Madden, Hughes & Pfender, 2004). If overdispersion is no taken into account it will totally

invalidate the statistical inference obtained from the model (Guimarães, 2005). Approaches to

account for overdispersion includes the use of a maximum quasi-likelihood method instead of the

maximum likelihood and use of discrete distributions like the negative binomial and the beta-

binomial distributions (Garret et al. 2004).

2.12 GAMLSS (Generalised Additive Models for Location Scale and Shape)

Accordingly to Stasinopoulos and Rigby (2007) GAMLSS are semi-parametric regression type

models; mainly because they require a parametric distribution assumption for the response variable

and could use non parametric smoothing functions for the modelling of the parameters of the

distribution. As a GLM comes as a solution for the cases that could not be solved with Linear

Regression, a GAMLSS is the proposed solution for the cases that could not be solved with a GLM.

Most of these cases are when the response variable doesn’t follow an exponential family

distribution (Stasinopoulos & Rigby, 2007). More details on how it was used in the present work

will be presented in the next chapter.

15
Concluding Remarks on Literature Review

The fact that previous works analysing the Fusarium Wilt incidence from the Spatial Data Analysis

perspective wasn’t found gives a special relevance to this work and brings an outline of the

considerations to take in future studies.

Detailed information about the implementation of the concepts presented here to the analysis of

Fusarium Wilt incidence from the point of view of Spatial Data Analysis is provided in the next

chapter.

16
3 Methodology

The proposed methodology is basically a combination of different methods and techniques applied

by different authors as presented in the literature review, explaining specific details in the present

chapter.

The first part consists of a description of the study area including elevation, major roads and rivers

present in it. In the second section the data preparation process is presented focusing on how the

data will be treated in the present work. The third section consists of a graphical representation of

the disease incidence rate for each farm in the study area. Since no areal boundaries were available

for the sampled farms, each of them was represented by a point. The fourth section treats

exploration of the data, using both graphical and numerical methods, the latter including tests for

spatial autocorrelation using different method like Moran’s I and Geary’s c. The Empirical Bayes

Estimate to improve Moran’s I calculation proposed by Assunção and Reis (1999) was also utilised

to test spatial autocorrelation, mainly due the hereogeinity of the populations size of sampled plants

per each farm.

Finally, an explanation about why GAMLSS was selected and how was used to model the

relationship between Fusarium Wilt incidence and variables like soil pH, Altitude, Farm Size,

Slope, Farm Type, Banana Variety, Plant Density and Soil Quality Index.

17
3.1 Case study area: San Luis de Shuaro
Accordingly to the Instituto Nacional de Estadística e Informática – INEI (as cited in Román Jerí,

2012, p. 25) the district of San Luis de Shuaro is located in Peru, in the Chanchamayo province,

department of Junín. It is at 187 km from Lima, the Peru’s Capital, being the agriculture its main

economic activity. Figure 1 shows the location of the San Luis de Shuaro district within Peru.

Figure 1 – Location of San Luis de Shuaro District (David Brown, ArcGIS for Desktop 10.2)

Although the study was focused on the San Luis de Shuaro district, it includes farms from outside

the official boundaries due to different criteria applied by Román Jerí (2012). Accordingly to the

Hole-filled seamless SRTM data (Jarvis, Reuter, Nelson & Guevara, 2008), altitude range from 597

– 2021 meters above sea level in the study area, as shown in Figure 2.

18
To obtain the elevation for the study area it was delimited drawing a rectangle that includes all the

analysed spatial points, representing each one a farm. Then, an extraction from the SRTM elevation

data was performed using the rectangle as a mask. This elevation model was just utilised as

descriptive resource for the study area. The elevation attribute for each analysed farm was taken

from collected data using the GPS Handheld.

Figure 2 – Elevation of the study area (David Brown, ArcGIS for Desktop 10.2)

19
As could be observed in Figure 3, the San Luis de Shuaro District is divided in two by a river which

also separates the 76 farms in two main groups. The district is also divided in two by two major

roads which cross the district near the river described before.

Figure 3 –Location of banana farms in the study area (David Brown, ArcGIS for Desktop 10.2)

20
3.2 Workflow diagram
The workflow diagram presented in Figure 4 shows the necessary processes and results obtained

during the development of the present work.

Literature review

Data preparation

Mapping the
Spatial Distribution
of Fusarium Wilt
Exploratory Spatial Data Analysis

Test Spatial
Autocorrelation
Generalised Linear Model

Thiessen Polygons
Test Overdispersion

Generalised Additive Model for Location Scale and Shape

Model Validation

Figure 4 – Workflow diagram (David Brown, Microsoft Word 2010)

21
3.3 Data preparation
Data collected by Román Jerí (2012) was stored in several spreadsheets containing location of the

assessed farms and is part of Bioversity International datasets collection. Location was registered in

UTM format with a GPS Handheld (Garmin eTrex Vista HCx) in each assessed farm where the first

symptomatic plant was found (Román Jerí, 2012, p. 32).

A main dataset was consolidated in a shapefile to be easily manipulated into ArcGIS and in R

Language and Environment for Statistical Computing (R Core Team, 2014) through the package

maptools (Bivand & Lewin-Koh, 2014). The shapefile was generated from previous files in XLSX

format (Microsoft Office) containing the spatial location in UTM coordinates, along with the

variables listed in Table 2. Although contained in the raw data provided by Román Jerí (2012),

disease incidence was verified and recalculated with the available counts of total diseased plants

and total assessed plants per each farm. After this revision of the all registries corresponding to 76

farms, nine registries differ from the original incidence rate, with four of these registries

corresponding to zero disease incidence in the recalculated rate.

The following projection was used as it was found to be the most appropriate accordingly to UTM

Grid Zones of the World compiled by Morton (2014).

Projected Coordinate System: WGS_1984_UTM_Zone_18S

Projection: Transverse Mercator

Geographic Coordinate System: GCS_WGS_1984

Datum: D_WGS_1984

Prime Meridian: Greenwich

Angular Unit: Degree

22
Table 3 – Variables included in the dataset consolidated from the raw data provided by of Román Jerí (2012)

Variable Description
Altitude Altitude measured by the GPS unit
Slope Inclination measured with clinometer
Farm Type Monocrop, Mixed crops, Agroforestry System, Backyard
Planting density Plants per hectare
Shade percentage Shade percentage measured with an spherical densitometer
pH pH Measured with soil tester KCB-300
Soil Quality Index Modified from Altieri and Nicholls (2002)
Farm Size Farm area in hectares
Variety “Isla”, “Seda”, “Mixed”

Since the intention of the present work is not to repeat the conducted by Román Jerí (2012), but to

analyse the data from the Spatial Data Analysis point of view, from the original dataset consolidated

by Román Jerí (2012) just a portion of the available information was utilised for the present work

and some modifications were made. For example, the Soil Quality Index, which is a modified

version of the proposed by Altieri and Nicholls (2002), was not calculated by Román Jerí but the

individual variables that compose that index were utilised in his analysis. The main difference

between the index here calculated and the proposed by Altieri and Nicholls (2002) is that the two

variables (Humidity retention and Water infiltration) were not included in the present work, mainly

because they are not available. Although the raw data was provided by Román Jerí (2012), the

reader should be aware of the different objectives of each work and analyse the results from

different points of views.

3.4 Proposed Approach to analyse spatial data

As stated by Krivoruchko (2011, p. 22), using methods which are intended to be used with

geostatistical data to analyse areal or point pattern data will produce erroneous results. However,

accordingly to Plant (2012, p.5) in some occasions data could be treated as geostatistical for one

analysis and as areal for other. Accordingly to Fortin et al. (2002) the selection of the spatial

statistic methods could be based on the research objectives or by the type of measured data.

23
The data used in the present analysis represents the incidence measured on each farm on the form:

TD
DI  100
TE

DI = Disease Incidence

TD = Total of Diseased Plants

TE = Total of Evaluated Plants

The location is represented by a point, which corresponds to the location of the first symptomatic

plant found in the farm (Román Jerí, 2012, p. 32). Although some considerations should be taken

for future sampling designs, this kind of conditions and constraints are very usual to find in

collected datasets and try to deal with these kinds of challenging conditions is one of the additional

motivation for this work.

This arrangement of spatial location and incidence rate for each farm has the particularity that one

point is used to represent an area, raising the methodological question of: Which spatial data

analysis method should be used?

As the disease incidence rate represents an aggregation of values for each farm, intuition indicates

that data should be treated as areas, however polygons for each farm was not available. When there

is not available boundaries for areas being analysed, a point could be used to define their location

and the apply methods for point objects (Haining, 2003, p. 81). This possibility to use a point to

represent each analysed area is also supported by Bivand et al. (2008, p. 244). Therefore, the taken

approach is to conduct spatial data analysis using methods for areal data.

24
3.5 Mapping the Spatial Distribution of Disease Incidence

3.5.1 Points Map


The map of the distribution Fusarium Wilt incidence of banana in the study area was produced

using the location of 76 farms, surveyed by Román Jerí (2012). A six classes classification and a

colour palette from blue to red (corresponding to Low and High respectively) was utilised to

represent the measured disease incidence distribution in the study area. This classification is shown

Figure 5.

Figure 5 – Symbols and classification utilised to map the distribution of the measured disease incidence (David Brown,
ArcGIS for Desktop 10.2)

25
3.5.2 Thiessen Polygons
As explained in the literature review, Thiessen Polygons are geometrical constructions as from

points. Taken into account that Thiessen Polygons couldn’t produce a prediction map but a

representation of the proximity area for a point object, a graphical areal representation was

produced using this technique as from points which represent each sampled farm and applying the

same colour and classification for disease incidence as applied for point representation.

Although it is possible to use Areal Interpolation as implemented in the Geostatiscal Analyst of

ArcGIS (Krivoruchko, Gribov & Krause, 2011) using the Thiessen Polygons as input layer, to avoid

misinterpretation of the technique and methodology and analysing that it could be an abuse of this

useful tool, the production of a prediction map for the present work was avoided.

26
3.6 Data Exploration
Exploration of the data will be basically divided in two steps:

a) Visual inspection of the data distribution

b) Test for spatial autocorrelation

Before explore the data it is worth to explain why is so important to check if the data exhibits a

normal distribution and why to check if there is spatial autocorrelation present.

As explained in the literature review, Linear Regression assumes the data has a normal distribution

which when plotted takes a bell shaped form. The main reason to explore data is to check if it

distribution corresponds to a normal distribution or to other kind of distribution, like Poisson,

Binomial, etc. As the data analysed in the present work is proportion, it is presumable to found a

non-normal distribution of the data. If this is the case, a common linear regression should be

avoided and other models like the GLM or GAM could be possible solutions.

In the case of spatial autocorrelation, it is important as part of the exploratory data analysis basically

due to two main reasons:

1) If positive spatial autocorrelation is present, it means that near farms will have similar

scores for disease incidence and possibilities to find a clustered pattern are high. Negative

spatial autocorrelation means that near farms have completely different scores for disease

incidence and the pattern of distribution is a dispersed pattern. Finally, if the null hypothesis

of zero spatial autocorrelation is confirmed, then the process behind the disease incidence

occurs in a random pattern. If positive or negative spatial autocorrelation is present in the

variable of interest, in this case the disease incidence, it means that there is a reason for that

spatial autocorrelation which will be of interest to explain to know the process behind.

27
2) If present, spatial autocorrelation violates the independence assumption required by the

Linear Regression Model and even other models like GLM, and some considerations

should be taken to account for the spatial autocorrelation into the proposed model.

3.6.1 Histogram

One of the easiest ways to see the shape of the data distribution is the histogram, which is defined

by Dalgaard (2008, p. 71) as: “…, a count of how many observations fall within specified divisions

(“bins”) of the x-axis”. Along with histogram, there are two indicators that help to determine the

type of shape of a distribution, which are skew and kurtosis (Urdan, 2010, p. 31). The skew

measures if the distribution is positively or negatively skewed, which means that distribution has an

elongated tail at the higher end of the distribution in the first case, or at the lower end of the

distribution in the second case (Urdan, 2010, p. 31). On the other hand, kurtosis tells if the

distribution is flatter than a normal distribution, in which case is called platykurtic, or if it has a

peak higher than that is found in a normal distribution, in which case the distribution is called

leptokurtic (Urdan, 2010, p. 31). As a rule of thumb the Skewness should be ideally 0 and the

Kurtosis should be 3 for a normally distributed data (NIST/SEMATECH, 2014).

3.6.2 Normal Q-Q Plot

Another way to visually check the data for normality is the Quantile-Quantile Plot (Q-Q Plot). To

understand the Q-Q plot, the Empirical Cumulative Distribution Function (c.d.f) should be defined.

Being x the analysed variable, the c.d.f. is defined by Dalgaard (2008, p. 73) as “the fraction of data

smaller than or equal to x”. The Q-Q plot corresponds to the “kth smallest observation against the

expected value of the kth smallest observation out of n in a standard” (Dalgaard, 2008, p. 73).

In the practice, a straight line should be expected in Q-Q plot for a normally distributed data

(Dalgaard, 2008, p. 73).

28
3.6.3 Spatial autocorrelation tests
Spatial autocorrelation was tested using Moran’s I and Geary’s c. An additional test of Moran’s I

was also applied using the methodology of Assunção and Reis (1999). All the R Language (R Core

Team, 2014) code utilised to compute the spatial autocorrelation test is included in the Annex 1.

Accordingly to Griffith (2009) when spatial autocorrelation is detected, it could be in one of

following categories:

1) Strong Positive Spatial Autocorrelation: Present in data like the remote sensing data and

not very common in the majority of the cases

2) Moderate Positive Spatial Autocorrelation: The most common type of spatial

autocorrelation

3) Moderate Negative Spatial Autocorrelation: Not so common to find and typically

associated with geographic competition.

Moran’s I is one of the most common used statistics to test the null hypothesis of zero spatial

autocorrelation (Plant, 2012, p. 104) and was selected to compute the spatial autocorrelation of

Fusarium Wilt incidence as areal data. Moran’s I and all the necessary calculations were computed

using the R Language (R Core Team, 2014) with functions included in the spdep package (Bivand,

2014). Geary’s c is other statistic method which also test the null hypothesis of zero spatial

autocorrelation. Griffith and Lane (as cited in Plant, 2012, p. 106), concludes that the Moran’s I is

generally preferred over Geary’s c, but computing Geary’s c for corroborate Moran’s I results is

desirable. That is the main reason to also calculate the Geary’s c in the present work.

In the case of the Empirical Bayes Estimate, proposed by Assunção and Reis (1999) as a way to

improve the Moran’s I, the main reason to also includes its calculation is the fact that some authors

like Jackson et al. (2010), Assunção and Reis (1999) and Tsai (2012) affirms that Moran’s I test

doesn’t work very well for rates calculated as from populations with different sizes, as is the case of

the disease incidence treated in the present work.

29
3.6.3.1 Neighbour List and Spatial Weights
Before calculate any of the statistics for test spatial autocorrelation two previous steps should be

done. First, the relation between the spatial objects should be defined using a neighbour criterion

(Bivand et al., 2008, p. 239). After defining which objects will be related as neighbours, a spatial

weight should be assigned to each relation link (Bivand et al., 2008, p. 251). Depending on the type

spatial objects to model the spatial relationship, polygons or points, the adequate method should be

selected. For the case of points, as is the case of the analysed data, two common methods are

available to construct the neighbour list: the k nearest neighbour and distance based neighbour list.

However there are more options for create a neighbour list like Delaunay triangulation one of them.

Different methods for neighbours definition are treated in more detail by Haining (2003, p. 80) and

Bivand et al. (2008, p. 240).

The k nearest neighbour methods selects the k nearest neighbours of each point (Plant, 2012, p. 90),

being k a parameter to be provided. For example, if a k value of 2 is provided, the method will

produce a neighbour list assuring that each point will have 2 neighbours. The distance based method

selects the neighbour for each point taken a distance threshold which is defined by two parameters,

a minimum and maximum bound (Bivand et al., 2008, p. 247).

The spatial weights area assigned to each neighbour link accordingly to different styles, being the

row standardized the recommended style if not much is known about the analysed spatial process

(Bivand et al., 2008, p. 251).

In the R Language (R Core Team, 2014) a k nearest list could be constructed with the functions

knn2nb along with the function knearneigh, booth included in the package spdep (Bivand, 2014).

In the case of the distance based method, the function dnearneigh could be used to construct the

neighbour list. The spatial weights list is constructed with the function nb2listw, also provided by

the package spdep (Bivand, 2014).

30
After clarified the necessary previous steps to calculate the spatial autocorrelation statistics, there is

a new interrogate to solve: which size of threshold should be defined for the distance based method

or which k should be used for the k nearest method? More than that, which method should be used?

Accordingly to O’Sullivan and Unwin (2010, p. 205), when the analysed process is not well

understood, the definition of the spatial structure and the weight assignation will be a difficult

process. Haining (2003, p. 81) suggests that if additional information about the analysed process it

should be utilised to define linkages, rather than define the by only geometrical or spatial criteria.

There are no magical recipes to select the neighbouring method and in different theoretical books

the knowledge about the studied process is presented as the key input to resolve this problem.

Although applied to a different case and with a specific implementation, the work of Souris and

Bichaud (2011) could give an insight that the k nearest neighbour method could be appropriate to

apply in epidemiology studies. Therefore, for the present work the k nearest neighbour will be

selected, although for the sake of support of this decision a set of comparisons will be conducted

against other three different methods: Delaunay triangulation, Sphere of Indifference and distance

based.

Delaunay triangulation and Sphere of Indifference are graph based methods and their main

difference are that the first defines the neighbours by triangulation and the latter uses circles with a

radius equal to the distance from the point to the nearest neighbours points (Bivand et al., 2008, p.

245).

For the case of the distance based method, which needs to define a threshold of distance the

approach proposed by Anselin (2003). Basically the lower bound is set to 0 and the upper bound is

set using the maximum distance needed to assure that each point has at least one neighbour. This is

achieved extracting the max distance value resulting after the applying the k nearest neighbour

method with a k = 1.

31
3.6.3.2 Moran’s I
As mentioned in the previous section, the Moran’s I test were computed in using the R Language (R

Core Team, 2014) with the function moran.test. Basically the function needs the list of neighbours

constructed with one of the methods explained before and a vector with the values of the variable to

check for spatial autocorrelation.

3.6.3.3 Geary’s C
As in the Moran’s I, for the Geary’s c a neighbour list is also needed. For this case, the same

neighbour lists constructed for the Moran’s I calculation will be utilised. Computation of Geary’s c

was done using the function “geary.test” included in the package spdep (Bivand, 2014)

3.6.3.4 Modified Moran’s I – Empirical Bayes Index


The proposed modified Moran’s I applied is the proposed by Assunção and Reis (1999) and is

implemented with the function EBest included in the package spdep (Bivand, 2014). This function

calculates an Empirical Bayes Estimate and compute the Moran’s I using the resulting smoothed

rate.

3.7 GAMLSS applied to Fusarium Wilt Disease Incidence


One of the special interest of this work is to explore and model the relationship between Fusarium

Wilt incidence and variables like altitude, shade percentage, slope, soil pH, plant density and a Soil

Quality Index. The Soil Quality Index was calculated accordingly to the methodology of Altieri and

Nicholls (2002) and it encloses a list of measurements which in some way gives an estimation of

the quality of the soil.

As exposed in the literature review, using GLMs are the usual approach to analyse data that doesn’t

exhibit a normal distribution (Plant, 2012, p.301), as is the case of disease incidence rate. Garret et

al., (2004) also suggests the utilisation of the GLM instead of applying a transformation over the

data. On the other hand, Kongchouy, Choonpradub and Kuning (2010) indicates that using a

logarithmic transformation over the disease incidence rate is enough to achieve satisfactory results.

32
Bivand et al. (2008, p. 274) also applies a logarithmic transformation to disease incidence rates to

try to obtain a nearly normal distribution. In this context a Logarithmic Transformation consists in

calculate the logarithm for each of the original values of the variable of interest and use it instead of

the original value. Recalling from basic mathematics, “a logarithm function is defined with respect

to a base” (Nau, 2014). However since the data utilised in the present work contains values of zero

for the farms without the disease, the logarithmic transformation not seems to be a feasible solution,

even though some transformation could be done. More than that, with available methods like the

Generalised Linear Models to handle this kind of data without transforming it, there is not strong

reason to take the transformation approach.

Zuur et al. (2009, p. 19) states that the simplest statistical model should be used, but it should be

used in the correct form. Following this approach, a GLM was applied to the disease incidence rate

trying to find a model which could explain the relationship between the disease incidence and the

proposed set of explanatory variables. The implementation was done using the R Language (R Core

Team, 2014).

As presented in the literature review, a GLM consist of three steps (Zuur et al., 2009, p. 210):

a) An assumption of the distribution of the response variable

b) Specify the systematic part (The explanatory variables)

c) Specify the link function

In the case of the data analysed in the present work, the distribution of the response variable is

proportional data which corresponds to a binomial distribution, accordingly to Zuur et al. (2009, p.

202) and Garret et al. (2004). The systematic part corresponds to the explanatory variables, in the

present work they are a selection of the variables presented in table 3. This selection is undertaken

using criteria to select the variables which are statistically significant into the model, mostly based

33
on its p-value. GLMs uses maximum likelihood to fit to the analysed data, and here is where the

link function appears, being the logit-link the most used for proportional data (Garret et al. 2004).

It is common to found in Poisson and binomial distributions that the observed variability is greater

than the predicted, which is known as overdispersion and is very common to be found in plant

disease data (Garret et al. 2004). For the case when overdispersion is found one approach to solve

this could be the use a maximum quasi-likelihood method could be applied to fit the model to the

data (Garret et al. 2004). In this case, the distributions still being a binomial distribution, but

allowing the overdispersion as it was taken into account (Zuur et al., 2009, p. 226). Special attention

should be put on test for overdispersion before start selecting the explanatory variables for the

model (Zuur et al., 2009, p. 223). In the present work overdispersion was found on the model, but

was approached using a GAMLSS and a discrete distribution called beta-binomial distribution. Two

main reasons support the selected approach; 1) The GAMLSS provides an AIC calculation, which

is very useful for the selection of the most significant explanatory variables and it is not provided by

the quasi-binomial method, 2) The beta-binomial distribution is widely recognized as the most

adequate solution for overdispersed proportional data, like the plant disease incidence (Garret et al.,

2004). The implementation was done also with R Language (R Core Team, 2014) and the gamlss

package (Rigby & Stasinopoulos, 2005).

The next step is to select the explanatory variables which are important to include into the model.

Accordingly to Zuur et al. (2009, p. 221) two options are available for this; a selection using the

AIC (Akaike Information Criteria) or use the hypothesis testing method. The AIC is a measure of

how good the model fits (Dalgaard, 2008, p. 232) and an extensible explanation could be found in

Akaike (1998), but in general terms the AIC is based on the Maximum Likelihood Estimator to

select the most appropriate model (Pan, 2001). It could be calculated for a GLM model in the R

Language (R Core Team, 2014) using the function step. Basically a model with a lower AIC value

34
will be better (Zuur et al., 2009, p.542). However, since a GAMLSS was utilised, the function

available for this calculation is stepGAIC from the gamlss package (Rigby & Stasinopoulos, 2005).

Validation of the Model

Zuur et al. (2009, p. 23) suggests validating a linear regression model using graphs as follows:

1) A graph of the model residuals vs fitted values to check for homogeneity

2) A Q-Q plot or histogram of the residuals to verify normality

3) Plot residuals vs each explanatory variable to verify independence

Teetor and Loukides (2011, p. 295) provides a simple explanation on how to interpret this kind of

graphs.

Since the GAMLSS was used instead of a classical GLM, the graphs provided by the package was

utilised to validate the model. The function plot of the R Language (The R Core Team, 2014)

applied to a GAMLSS model provides the following graphs for model validation (Stasinopoulos,

Rigby & Akantziliotou, 2008, p. 121):

 Model residuals against the fitted values

 Model residuals against an index or specified x-variable

 Kernel density estimate of the residuals

 QQ-normal plot of the residuals

In general, what should be expected for a valid and good fit model are residuals with a normal

distribution and without patterns.

The code utilised in the R Language (R Core Team, 2014) to implement the GLM and GAMLSS is

included in Annex 2.

35
4 Results and Analysis

4.1 Map of the Spatial Distribution of Disease Incidence in the Study Area
Figure 6 shows the resulting map of the distribution of Fusarium Wilt Incidence in the region of San

Luis de Shuaro, Peru. As from this points Thiessen Polygons where constructed and coloured with

the same colour scheme and using the classification for disease incidence. The resulting map with

the Thiessen Polygons is shown in the Figure 7.

Figure 6 - Map of the Spatial Distribution of Fusarium wilt incidence in the study area (David Brown, ArcGIS for
Desktop 10.2)

36
Figure 7 – Thiessen Polygons constructed as from the points of each sample farm (David Brown, ArcGIS for Desktop
10.2)

37
4.2 Results of Exploratory Data Analysis
4.2.1 Histogram
The Figure 8 shows the histogram calculated using the R Language (R Core Team, 2014) for the

disease incidence. Different from the normal distribution shown in the Figure 9, the histogram

shows that the data correspond to a positively skewed distribution.

Figure 8 – Histogram of disease incidence rate

Figure 9 – Histogram of a normal distribution from simulated data

38
4.2.2 Normal Q-Q Plot
The Quantile-Quantile Plot for a normal distribution should have a straight line as is shown in the

Figure 11, different from the shape shown in Figure 10, which shows a non-normal distribution of

the data.

Figure 10 – Normal Q-Q Plot of the disease incidence

Figure 11 – Normal Q-Q Plot for a normal distribution for simulated data

39
4.2.3 Skewness and Kurtosis
As explained before, values for the Skewness and Kurtosis corresponding to a normal distribution

should be 0 for the first and 3 for the later.

In the present work Skewness and Kurtosis were calculated using the R Language (R Core Team,

2014) with the function stat.desc from the package pastecs (Grosjean & Ibanez, 2014).

For the case of disease incidence, the Skewness was 1.9327 and Kurtosis was 3.1458, being in this

case the Skewness the more problematic.

40
4.2.4 Neighbours list and Spatial Weights

Probably one of the most convenient ways to present how different are the spatial structures

depending on the selected method to define the neighbour is in a graphic. The following are the

graphics showing the different methods to construct neighbour relationships. Delaunay

triangulation, Sphere of Indifference and distance based are shown in Figure 12, while the k nearest

neighbour relationship with k = 1, k = 5 and k = 10 are shown in Figure 13.

a) b) c)

Figure 12 – a) Delaunay triangulation, b) Sphere of Indifference, c) Distance based

a) b) c)

Figure 13 – a) k nearest neighbour with k = 1, b) k nearest neighbour with k= 5, c) k nearest neighbour with k=10

41
4.2.5 Results of Moran’s I Test
As can be seen in Table 4, none of the calculation of Moran’s I using different neighbour list has an

acceptable p-value under the confidence interval of 95 %. At this point is necessary to recall why

the p-value is relevant. A very detailed explanation could be found in Urdan (2010, p. 61). The

following are definitions from Urdan (2010, p. 77):

- p-value : “The probability of obtaining a statistic of a given size from a sample of a given

size by chance, or due to random error”

- confidence interval: An interval calculated using sample statistics to contain the

population

Combined, these concepts are a way to determine if the calculated value is significant from the

statistical point of view. In the present case, using a 95 % interval confidence just the values with a

p-value less than 0.05 will be statistically significant.

It is also worth to mention that calculations of Moran’s I were computed with a “Two Sided”

alternative hypothesis. The default value is to set the alternative hypothesis to be greater than value

for zero spatial autocorrelation, guessing that the expected possible spatial autocorrelation will be

positive. However, since there are no clues to presume that the possible spatial autocorrelation will

be positive or negative, the alternative hypothesis was set to “Two Sided”.

Table 4 – Results of Moran’s I test computations using different neighbours list methods

Neighbour Method I E(I) var (I) St. deviate p-value


Delaunay Triangulation -0.0467 -0.0133 0.0042 -0.52 0.6053
Sphere of Indifference -0.025 -0.013 0.01 -0.12 0.9069
Nearest Neighbour k = 1 0.0046 -0.0133 0.02 0.13 0.8993
Nearest Neighbour k = 5 -0.107 -0.013 0.004 -1.5 0.1385
Nearest Neighbour k = 10 -0.0732 -0.0133 0.0018 -1.4 0.1626
Distance bands -0.0759 -0.0133 0.0045 -0.93 0.3509

42
To understand better the results it is necessary to explain the values contained in the table 4. The I

value represents the Moran’s I calculations, which for positive spatial autocorrelation will have

positive values and for negative spatial autocorrelation will have negative values. The E(I) value

represents the expected value for the null hypothesis of spatial autocorrelation and it comes from

the following function (Griffith, 2009):

1
n 1

Where n is the number of areal units, in this case the number of farms.

The var (I) represents the variance of the statistic and the St. deviate is the standard deviate (Bivand

et al., 2008, p. 260).

4.2.6 Results of Geary’c Test


In the case of the Geary’s c test, the only computation which has a significant p-value is the resulted

using a neighbour list made from the k nearest neighbour method with a k value of 10. In the table 5

de c values represent the computed Geary’s c. The rest values are as indicated in the Moran’s I

calculation, with the difference that the expected value for zero spatial autocorrelation is 1. Values

between 1 and 2 represents negative spatial autocorrelation and values between zero and 1 indicates

positive spatial autocorrelation. In the present case, although p-value of the calculation for the k

nearest neighbour is statistically significant, the c value is barely greater than one and the null

hypothesis of zero spatial autocorrelation is confirmed.

43
Table 5 – Results of Geary’s c test computations using different neighbours list methods

Neighbour Method c E(c) var(c) St. deviate p-value


Delaunay Triangulation 1.0428 1 0.0054 -0.58 0.561
Sphere of Indifference 1.021 1 0.012 -0.19 0.8496
Nearest Neighbour k = 1 1.074 1 0.033 -0.41 0.6841
Nearest Neighbour k = 5 1.133 1 0.0073 -1.6 0.1205
Nearest Neighbour k = 10 1.1495 1 0.0049 -2.1 0.03354
Distance bands 1.072 1 0.0059 -0.94 0.3493

4.2.7 Results of Global Moran’s I using Empirical Bayes Estimates

Table 6 – Results of Moran’s I with Empirical Bayes Estimates computations using different neighbours list methods

Neighbour Method I E(I) var(I) St. p-value


Deviate
Delaunay Triangulation -0.0461 -0.0133 0.0042 -0.51 0.6113
Sphere of Indifference -0.025 -0.013 0.01 -0.11 0.9115
Nearest Neighbour k = 1 0.0079 -0.0133 0.02 0.15 0.8808
Nearest Neighbour k = 5 -0.106 -0.013 0.004 -1.5 0.1426
Nearest Neighbour k = 10 -0.0731 -0.0133 0.0018 -1.4 0.1633
Distance bands -0.0768 -0.0133 0.0045 -0.95 0.3445

In the case of the computation of Moran’s I using the Empirical Bayes Estimate, none of the

calculation exhibit spatial autocorrelation as could be observed in Table 6 and consequently the null

hypothesis is accepted.

As a result of the three different methods implemented to test for spatial autocorrelation in the

disease incidence, there is no spatial autocorrelation in this case.

Differences between the Moran’s I and Geary’s c could be attributed to the effect of the distribution

of the data, which accordingly to Cliff and Ord (as cited in Plant, 2012, p. 106) affects more the c

calculation than I.

44
4.3 Results of the GAMLSS
To start constructing a model which explains the Fusarium Wilt incidence through the selected

variables the first step is to include all these variables as explanatory variables in the model and

then apply the Akaike Information Criteria to select a better model. The first model contains the

following variables as explanatory:

- Area (Farm Size)

- Altitude

- pH

- Planting Density

- Farm Type

- Variety

- Slope

- Shade Percentage

- Soil Quality Index

As explained before, if the modelled data presents overdispersion it should be preferable to specify

a beta-binomial distribution. To the test that the data here analysed presents or not overdispersion a

GLM with a binomial distribution was specified. More than that, if overdispersion is not present, a

GLM with a binomial distribution could be used and the use of the GAMLSS will be optional.

A GLM was specified with function glm, which is part of the R Language (R Core Team, 2014)

with all the proposed explanatory variables and specifying a binomial distribution. Results are

presented in Table 7.

45
Table 7 – Results of the GLM model with a Binomial distribution and all the proposed explanatory variables

Estimate Std. Error t value Pr(>|t|)


(Intercept)
7.60064318 0.60639424 12.53416123 0.0000000000
pH
-1.37212957 0.15488183 -8.85920321 0.0000000000
Altitude
0.00094969 0.00014293 6.64426691 0.0000000000
Slope
-0.00135563 0.00266970 -0.50778478 0.6116042814
Farm Size (area)
0.03063996 0.04297005 0.71305393 0.4758123862
Plant density
-0.00131031 0.00035593 -3.68136155 0.0002319918
Soil Quality Index
-0.91801074 0.08157778 -11.25319525 0.0000000000
Factor(farm_type)2
-0.12178332 0.10985683 -1.10856391 0.2676183558
factor(farm_type)3
0.12928188 0.44745031 0.28893014 0.7726348387
Factor (Variety - Seda)
-0.22388688 0.24582094 -0.91077219 0.3624154171
Factor (Variety - Mix)
0.43529689 0.45731161 0.95186057 0.3411676975
Shade percentage
0.01467834 0.01320917 1.11122363 0.2664721004
Factor(farm_type)2: Factor(Variety) Mix
0.52844930 0.46682362 1.13201063 0.2576299656

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1693.44 on 75 degrees of freedom


Residual deviance: 258.64 on 63 degrees of freedom
AIC: 483.6

Number of Fisher Scoring iterations: 5

To assess the proposed model for overdispersion the two key values to analyse are the residual

deviance and the degrees of freedom. To estimate overdispersion the residual deviance is divided by

the degrees of freedom (Zuur et al., 2009, p. 224), in this case 258.64/63 = 4.105 which is higher

than the expected 1 for the binomial family, as is indicated in the model summary. The resulting

overdispersion parameter for the present case indicates that overdispersion is present in the model.

To account for overdispersion in the model a GAMLSS with a beta-binomial distribution was used.

A beta-binomial distribution is a combination of the beta and binomial distributions (Hilbe, 2013)

and is used when binomial data presents overdispersion (Guimarães, 2005).

46
The Table 8 shows the results of the base model with all the proposed variables, applying a

GAMLSS with a beta-binomial distribution.

Table 8 – Results for the base model with all the proposed variables

Estimate Std. Error t value Pr(>|t|)

Intercept
6.29559794 1.14470686 5.49974684 0.00000074
pH
-1.26226658 0.26622565 -4.74134102 0.00001256
Altitude
0.00110670 0.00030041 3.68400796 0.00047906
Factor (Variety - Seda)
-0.08451387 0.37110387 -0.22773643 0.82058864
Factor (Variety - Mix)
0.27996758 0.63164774 0.44323372 0.65911506
Planting Density
-0.00182239 0.00076547 -2.38074932 0.02031365
Slope
-0.00489983 0.00563126 -0.87011249 0.38754222
Farm Size
0.07216351 0.08153999 0.88500759 0.37951792
Soil Quality Index
-0.73183938 0.13730923 -5.32986312 0.00000141
Factor(farm_type)2
-0.25502120 0.23044149 -1.10666355 0.27264724
factor(farm_type)3
-0.02887809 0.94505482 -0.03055705 0.97571939
Shade percentage
0.00834388 0.02681166 0.31120343 0.75667341
Factor(farm_type)2: Factor(Variety) Mix
0.44620648 0.68007146 0.65611704 0.51413809
Mu link function: logit
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.7 0.2277 -20.64 1.169e-32

-------------------------------------------------------------------
No. of observations in the fit: 76
Degrees of Freedom for the fit: 14
Residual Deg. of Freedom: 62
at cycle: 13

Global Deviance: 362.1153


AIC: 390.1153
SBC: 422.7456

Summary of the Randomised Quantile Residuals


mean = -0.0124098
variance = 1.063567
coef. of skewness = -0.1870921
coef. of kurtosis = 2.620784
Filliben correlation coefficient = 0.9950251

47
Applying the function stepGAIC to select a model using the AIC value the selected variables were:

- pH

- Altitude

- Planting Density

- Soil Quality Index

The table 9 shows the summary of results of the fitted model with the selected variables. As could

be observed the model was improved from an AIC of 390.11 to an AIC of 382.08 after selecting the

most significant variables.

Table 9 – Results of the adjusted model after applying the variable selection using stepGAIC

Estimate Std. Error t value Pr(>|t|)

Intercept 5.50208907 1.04975051 5.24133025 0.00000157


pH -1.32829757 0.25205493 -5.26987349 0.00000140
Altitude 0.00099656 0.00030522 3.26511767 0.00168565
Planting Density -0.00202482 0.00068184 -2.96964748 0.00406429
Soil Quality Index -0.51989099 0.11633084 -4.46907255 0.00002906

Mu link function: logit


Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.374 0.2158 -20.26 3.779e-32

-------------------------------------------------------------------
No. of observations in the fit: 76
Degrees of Freedom for the fit: 6
Residual Deg. of Freedom: 70
at cycle: 12

Global Deviance: 370.0847


AIC: 382.0847
SBC: 396.0691

Summary of the Randomised Quantile Residuals


mean = -0.02736283
variance = 0.9909983
coef. of skewness = -0.02519111
coef. of kurtosis = 2.923242
Filliben correlation coefficient = 0.9979815

48
To validate the model, the graphs of the residuals were constructed and are shown in the Figure 14.

Figure 14 – Plots of the residuals for model validation

In the upper left the randomised residuals were plotted against the fitted values. In the upper right

the randomised residuals were plotted against an index, which basically corresponds to the number

of observations, in the present case 76 farms. These two plots shouldn’t present any pattern for a

good fitted model (Zuur et al., 2009, p. 27).

The Density Estimate and the Normal Q-Q Plot helps to evaluate the normality of the residuals,

required for a good fitted model (Stasinopoulos et al., 2008, p. 122). In the present case they appear

to be normally distributed. Finally, four values from the randomised residuals should be observed to

confirm that the model was well fitted; the Mean, the Variance, the coefficient of Skewness and the

coefficient of Kurtosis, which all were shown on Table 9. A well fitted model should have a mean

near zero, variance near to one, a coefficient of Skewness near to zero and a coefficient of Kurtosis

near to 3. In the present case the values presented in Table 9 are nearly the expected values for a

well fitted model.

49
4.4 Analysis of the Results

4.4.1 Spatial Distribution of Fusarium Wilt in the study area


For the spatial distribution of Fusarium Wilt in the study area, points objects representing each farm

was utilised as explained in the methodology chapter. Disease incidence was classified in 5 classes

for which a colour was also assigned. This symbology and colour arrangement allows a graphical

representation on how different levels of disease incidence spread along the study area. As a very

basic interpolation Thiessen Polygons were utilised to represent the influence zone of each farm

with Fusarium Wilt presence in the study area. However, this should be interpreted just as a first

approximation to obtain zones of influence in the study area and not as a prediction map.

4.4.2 Spatial Autocorrelation Tests


After computing the Moran’s I with different neighbours lists there is not strong evidence for a

spatial autocorrelation to be present in the study area for the Fusarium Wilt incidence. A fact should

be highlighted from these results, and it is the relevance that has the design of the spatial structure

of the studied process as from the neighbours list and its effect over the detection of spatial

autocorrelation, as stated by Bivand et al. (2008, p. 239), O’Sullivan and Unwin (2010, p. 201),

Haining (2003, p. 79) and Plant (2012, p. 80).

Geary’s c was also used to account for spatial autocorrelation over the Fusarium Wilt incidence in

the study area. Although with different p-values, the null hypothesis of zero spatial autocorrelation

was also confirmed as in the case of Moran’s I.

For the case of Moran’s I using an Empirical Bayes Estimates to smooth the disease incidence rates,

basically the results were the same as from the normal Moran’s I.

50
4.4.3 GAMLSS
The proposed model using GAMLSS to explain the Fusarium Wilt incidence has the following

variables as explanatory:

- Soil pH

- Altitude

- Soil Quality Index

- Planting Density

It was the result of apply the AIC to select the variables to be included in the model and the model

validation was conducted using graphs provided by the same statistical tool

Soil pH is already recognized to have a correlation with Fusarium Wilt disease. Accordingly to

Alvarez, García, Robles and Díaz (1981) there is evidence to expect a higher disease presence in

soils with a pH lower than 7 while. Román Jerí (2012, p. 90) also reports a relationship between the

disease incidence and soil pH in his results. In the case of the Soil Quality Index, calculated

accordingly to the methodology proposed by Altiere and Nicholls (2002), presents a correlation

with the disease incidence. Better soil conditions were also reported by Domínguez-Hernández,

Negrín and Rodriguez (2008) as an associated condition to expect lower levels of Fusarium Wilt

presence, although Alabouvette (as cited in Domínguez-Hernández et al., 2008, p. 405) states that

that there is no evidence that soil properties play any role in suppressiveness.

In the case of altitude, although specific studies wasn’t found in literature about the effect of

altitude over Fusarium Wilt incidence, it is known that altitude acts indirectly over banana growth

due to a decrease on temperature, being difficult to produce bananas over above 1000 meters of

altitude (Arvanitoyannis & Mavromatis, 2009). These unfavourable conditions for banana growth

could also influence the plant health and how it could defends against the Fusarium Wilt (Ploetz,

Jones, Sebaisgari & Tushemereirwe, 1994).

51
With regards of plant density, there is not specific work found in the literature dealing with the

effect of plant density over the Fusarium Wilt incidence. Works like the conducted by Athani,

Revanappa and Dharmatti (2009) studied the effect of plant density over plant height and yield, but

it doesn’t account for plant health or other interesting variables for the present work. However, as in

the case of altitude, unfavourable conditions could be playing in favour to a weak plant to acquire

the disease and with a higher plant density the competition between plants is also higher. One

possible hypothesis could be that at higher competition conditions (high plant density) without the

adequate fertilisation, the risk to have weaker plants could increase, and those weak plants could be

more prone to get diseased by Fusarium Wilt or even other diseases. However, this is just a

hypothesis outline and should be taken just as a possible subject to further work and not as fact.

52
4 Conclusions

1) Spatial distribution of the Fusarium Wilt in study area was successfully represented aided with

the software ArcGIS for Desktop (ESRI, 2013). Thiessen polygons are a useful method for

basic interpolation but clearly have the limitation that is just a geometric construction and the

inference as from them should be done with caution. Other forms of interpolation, like areal

interpolation, are suggested to be explored in future works when the real boundaries of the

farms are available. The areal interpolation tool available in the software ArcGIS Desktop

(ESRI, 2013) could be a starting point, but has the limitation that could assume a Gaussian,

Binomial or Poisson distributions but no the Beta-Binomial distribution which is needed for

binomial data with overdispersion. As overdispersion is a common characteristic found in

disease incidence data, a geostatistical tool which easily allows researchers to produce

interpolations for this kind of data will be a very valuable contribution of further work.

2) Spatial autocorrelation was not found in the Fusarium Wilt incidence in the study area, which

mainly represents that the pattern of distribution of the farms with presence of Fusarium Wilt is

random. Based on these results, the null hypothesis H0 can’t be rejected. Special attention

should be put to the fact that these results are from data that represents farms located in a very

diverse region with a variety of elements involved. Even when spatial autocorrelation was not

found at this scale other studies should be done to analyse the spatial autocorrelation at farm

level, which implies to collect the location of each sampled plant. The results obtained in the

present work could support the design of sampling strategy when the spatial component will be

included in an epidemiology study.

53
3) GAMLSS model with a beta-binomial distribution was successfully applied to explain the

Fusarium Wilt incidence of banana as from a set of explanatory variables. The beta-binomial

distribution was found to be the most appropriate to model binomial data with overdispersion,

confirming what was found in the literature review. One remarkable result from the present

work is the guidelines produced to model the plant disease incidence, since all the methodology

scripts code implemented in the R Language (R Core Team, 2014) is provided to be easily

reproduced. A desirable further work using the present work as starting point could be the

development of a software package that provides an easy to use tool for plant pathologist, or at

least a detailed guide on how to model disease incidence data with the available tools.

4) The relationship of soil pH and soil quality conditions with Fusarium Wilt incidence were

confirmed, as it coincides with results found in previous works. These results could lead to

develop more specific work to analyse the influence of pH and soil conditions over the disease

incidence of Fusarium Wilt. Further work is also suggested to explore in depth the relationship

of altitude and plant density with Fusarium Wilt incidence. Although the present work are based

on part of the raw data kindly provided by the work of Román Jerí (2012), the approach taken

was completely blind with respect to that previous work in terms of the methodology used to

analyse the relationship between disease incidence and the proposed explanatory variables. As a

future work, a detailed revision of the methodologies applied by the two works is suggested to

outline the reasons behind the different results found.

54
5 References

Akaike, H. (1998). Information Theory and an Extension of the Maximum Likelihood


Principle. In E. Parzen, K. Tanabe, & G. Kitagawa (Eds.), Selected Papers of Hirotugu Akaike (pp.
199–213). New York, NY: Springer New York.

Altieri, M. A., & Nicholls, C. I. (2002). Sistema agroecologico rápido de evaluación de calidad
de suelo y salud de cultivos en el agroecosistema de café. Retrieved August 28, 2014, from
University of California, Berkeley, Agroecology in Action website:
http://www.agroeco.org/doc/SistAgroEvalSuelo2.htm

Alvarez, C. E., García, V., Robles, J., & Díaz, A. (1981). Influence des caractéristiques du sol
sur l’incidence de la Maladie de Panama. Fruits, 36(2), 71–81.

Alves, M. de C., & Pozza, E. A. (2010). Indicator kriging modeling epidemiology of common
bean anthracnose. Applied Geomatics, 2(2), 65–72. doi:10.1007/s12518-010-0021-1

Anselin, L. (2003). Data and Spatial Weights in spdep Notes and Illustrations. Urbana-
Champaign: University of Illinois. Retrieved September 9, 2014, from
https://geodacenter.asu.edu/system/files/dataweights.pdf

Arias, P., Dankers, C., Liu, P., & Pilkauskas, P. (2003). The world banana economy, 1985-
2002. Rome: Food and Agriculture Organization of the United Nations.

Arvanitoyannis, I. S., & Mavromatis, A. (2009). Banana Cultivars, Cultivation Practices, and
Physicochemical Properties. Critical Reviews in Food Science and Nutrition, 49(2), 113–135.

Assunção, R. M., & Reis, E. A. (1999). A new proposal to adjust Moran’s I for population
density. Statistics in Medicine, 18(16), 2147–2162.

Athani, S. I., Revanappa, & Dharmatti, P. R. (2009). Effect of plant density on growth and yield
in banana. 22, 1, 143–146.

Bachmaier, M., & Backes, M. (2008). Variogram or semivariogram? Understanding the


variances in a variogram. Precision Agriculture, 9(3), 173–175.

Bivand, R. (2014). spdep: Spatial dependence: weighting schemes, statistics and models. R
package. (Version 0.5-71). Retrieved August 14, 2014, from http://CRAN.R-
project.org/package=spdep

Bivand, R., & Lewin-Koh, N. (2014). maptools: Tools for reading and handling spatial objects
(Version 0.8-29). R. Retrieved August 14, 2014, from http://CRAN.R-
project.org/package=maptools

Bivand, R. S., Pebesma, E. J., & Gómez-Rubio, V. (2008). Applied spatial data analysis with R.
New York; London: Springer.

55
Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H., &
White, J.-S. S. (2009). Generalized linear mixed models: a practical guide for ecology and
evolution. Trends in Ecology & Evolution, 24(3), 127–135.

Brayford, D. (1992). Fusarium oxysporum f. sp. cubense. IMI Descriptions of Fungi and
Bacteria, 112, 115.

Crawley, M. J. (2007). The R book. Chichester, England; Hoboken, N.J.: Wiley.

Dalgaard, P. (2008). Introductory Statistics with R. New York, NY: Springer New York.

Del Ponte, E. M., Shah, D. A., & Bergstrom, G. C. (2003). Spatial Patterns of Fusarium Head
Blight in New York Wheat Fields Suggest Role of Airborne Inoculum. Plant Health Progress.
doi:10.1094/PHP-2003-0418-01-RS

Domínguez‐Hernández, J., Negrín, M. A., & Rodríguez, C. M. (2008). Soil Potassium Indices
and Clay‐Sized Particles affecting Banana‐Wilt Expression Caused by Soil Fungus in Banana
Plantation Development on Transported Volcanic Soils. Communications in Soil Science and Plant
Analysis, 39(3-4), 397–412.

ESRI. (2013). ArcGIS for Desktop (Version 10.2). Redlands, California: Environmental
Systems Resource Institute.

Fortin, M.-J., Dale, M. R. T., & Hoef, J. ver. (2002). Spatial analysis in ecology. In
Encyclopedia of Environmetrics (Vol. 4, pp. 2051–2058). Chichester, UK: John Wiley & Sons, Ltd.

Frison, E., & Sharrock, S. (1998). The economic, social and nutritional importance of banana in
the world. In Bananas and Food Security/Les productions bananières: un enjeu économique majeur
pour la sécurité alimentaire (pp. 21–35). Douala, Cameroon: INIBAP.

Garrett, K. A., Madden, L. V., Hughes, G., & Pfender, W. F. (2004). New Applications of
Statistical Tools in Plant Pathology, 94(9), 999–1003.

Griffith, D. A. (2009). Spatial Autocorrelation. Retrieved August 19, 2014, from Elsevier Store
website: http://booksite.elsevier.com/brochures/hugy/SampleContent/Spatial-Autocorrelation.pdf

Grosjean, P, & Ibañez, F., (2014). pastecs: Package for Analysis of Space-Time Ecological
Series. R package version. 1.3-18. Retrieved August 5, 2014, from http://CRAN.R-
project.org/package=pastecs

Guimarães, P. (2005). A simple approach to fit the beta-binomial model. Stata Journal, 5(3),
385–394.

Guzmán-Plazola, R. A., Gómez-Pauza, R., García-Espinosa, R., & Gavi-Reyes, F. (2004).


Distribución Espacial de la Pudrición Radical del Frijol (Phaseolus vulgaris L.) por Fusarium solani
(Mart.) Sacc. f. sp. phaseoli (Burk.) Snyd. y Hans. en la Vega de Metztitlán, Hidalgo, México.
Revista Mexicana de Fitopatología

56
Haining, R. P. (2003). Spatial data analysis: theory and practice. Cambridge, UK ; New York:
Cambridge University Press.

Hilbe, J. M. (2013). Beta Binomial Regression. The SelectedWorks of Joseph M Hilbe.

Hughes, G. and Madden L.V., (1993). Using the Beta-Binomial Distribution to Describe
Aggregated Patterns of Disease Incidence. Phytopathology 83:759-763.

Hughes, G., Madden, L. V., & G. P. Munkvold. (1996). Cluster Sampling for Disease Incidence
Data. American Phytopathological Society, 86(2), 132–137.

Hughes, G., Munkvold, G. P., & Samita, S. (1998). Application of the logistic-normal-binomial
distribution to the analysis of Eutypa dieback disease incidence. International Journal of Pest
Management, 44(1), 35–42

Jackson, M. C., Huang, L., Xie, Q., & Tiwari, R. C. (2010). A modified version of Moran’s I.
International Journal of Health Geographics, 9(1), 33. doi:10.1186/1476-072X-9-33

Jarvis A., H.I. Reuter, A. Nelson, E. Guevara, 2008, Hole-filled seamless SRTM data V4,
International Centre for Tropical Agriculture (CIAT). Retrieved Octuber 5, 2014, from
http://srtm.csi.cgiar.org.

Kongchouy, N., Choonpradub, C., & Kuning, M. (2010). Methods for Modeling Incidence
Rates with Application to Pneumonia Among Children in Surat Thani Province, Thailand, 1(37),
29–38.

Krivoruchko, K. (2011). Spatial statistical data analysis for GIS users. Redlands, Calif.: Esri
Press.

Krivoruchko, K., Gribov, A., & Krause, E. (2011). Multivariate Areal Interpolation for
Continuous and Count Data. Procedia Environmental Sciences, 3, 14–19.
doi:10.1016/j.proenv.2011.02.004

Lichtemberg, P. S. F., Pocasangre, L. E., Staver, C., Dold, C., & Sikora, R. A. (2010). Fusarium
Wilt (Fusarium oxysporum f. sp. cubense) in Gros Michel (AAA) bananas, the incidence at
smalholder level in Nicaragua. In Conference on International Research on Food Security, Natural
Resource and Rural Development. Zurich.

Madden, L. V., Hughes, G. and van den Bosch, F. 2007. The Study of Plant Disease Epidemics.
APS Press, St Paul

Madden, L. V., & Hughes, G. (1994). BBD-Computer Software for Fitting the Beta-Binomial
Distribution to Disease Incidence Data, Plant Disease, 78(5), 536-540.

Morton, A. (2014). UTM Grid Zones of the World. Retrieved August 9, 2014, from
http://www.dmap.co.uk/utmworld.htm

Musoli, C. P., Pinard, F., Charrier, A., Kangire, A., ten Hoopen, G. M., Kabole, C., Owang J.,
Bieysse D., Cilas, C. (2008). Spatial and temporal analysis of coffee wilt disease caused by

57
Fusarium xylarioides in Coffea canephora. European Journal of Plant Pathology, 122(4), 451–460.
doi:10.1007/s10658-008-9310-5

Nau, R. F. (2014). The logarithm transformation. Retrieved September 17, 2014, from
http://people.duke.edu/~rnau/411log.htm

Nelson, M. R., Felix-Gastelum, R., Orum, T. V., Stowell, L. J., & Myers, D. E. (1994).
Geographic Information Systems and Geostatistics in the Design and Validation of Regional Plant
Virus Management Programs, 84(9), 898–905.

Nelson, M. R., Orum, T. V., Jaime-Garcia, R., & Nadeem, A. (1999). Applications of
Geographic Information Systems and Geostatistics in Plant Disease Epidemiology and
Management. Plant Disease, 83(4), 308–319. doi:10.1094/PDIS.1999.83.4.308

NIST/SEMATECH. (2014). Measures of Skewness and Kurtosis. Retrieved August 18, 2014,
from http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm

Oerke, E.-C., Meier, A., Dehne, H.-W., Sulyok, M., Krska, R., & Steiner, U. (2010). Spatial
variability of fusarium head blight pathogens and associated mycotoxins in wheat crops: Spatial
variability of Fusarium species and mycotoxins. Plant Pathology, 59(4), 671–682.
doi:10.1111/j.1365-3059.2010.02286.x

O’Sullivan, D., & Unwin, D. (2010). Geographic Information Analysis (2nd ed.). John Wiley &
Sons, Inc.

Pan, W. (2001). Akaike’s Information Criterion in Generalized Estimating Equations.


Biometrics, 57(1), 120–125. doi:10.1111/j.0006-341X.2001.00120.x

Paulitz, T. C., Zhang, H., & Cook, R. J. (2003). Spatial distribution of Rhizoctonia oryzae and
rhizoctonia root rot in direct-seeded cereals. Canadian Journal of Plant Pathology, 25(3), 295–303.
doi:10.1080/07060660309507082

Pérez-Vicente, L., Dita, M. A., & Parte, E. M. la. (2014). Technical Manual Prevention and
Diagnostic of Fusarium Wilt (Panama disease) of banana caused by Fusarium oxysporum f. sp.
cubense Tropical Race 4 (TR4). FAO.

Pfeiffer, D. U. (1996). Issues related to handling of spatial data. In Proceedings of the


epidemiology and state veterinary programmes (pp. 83–105). Christchurch.

Plant, R. E. (2012). Spatial data analysis in ecology and agriculture using R. Boca Raton: CRC
Press.

Ploetz, R. C. (2006). Fusarium Wilt of Banana Is Caused by Several Pathogens Referred to as


Fusarium oxysporum f. sp. cubense. Phytopathology, 96(6), 653–656. doi:10.1094/PHYTO-96-
0653

Ploetz, R. C., Jones, D. R., Sebaisgari, K., & Tushemereirwe, W. K. (1994). Panama disease on
East African highland bananas. Fruits (Paris), 49(4), 253–260.

58
R Core Team. (2014). R: A language and environment for statistical computing. Vienna,
Austria.: R Foundation for Statistical Computing. Retrieved June 5, 2014, from http://www.R-
project.org/

Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized additive models for location, scale
and shape,(with discussion). Applied Statistics, 54, 507–554.

Román Jerí, C. H. (2012). Consideraciones epidemiológicas para el manejo de la Marchitez por


Fusarium (Fusarium oxysporum f. sp. cubense) del banano en la región central del Perú. CATIE,
Turrialba, Costa Rica.

Selvaraja, S., Balasundra, S. K., Vadamalai, G., & Husni, M. H. A. (2012). Spatial Variability
of Orange Spotting Disease in Oil Palm. Journal of Biological Sciences, 12(4), 232–238.
doi:10.3923/jbs.2012.232.238

Souris, M., & Bichaud, L. (2011). Statistical methods for bivariate spatial analysis in marked
points. Examples in spatial epidemiology. Spatial and Spatio-Temporal Epidemiology, 2(4), 227–
234. doi:10.1016/j.sste.2011.06.001

Stasinopoulos, D. M., & Rigby, R. A. (2007). Generalized Additive Models for Location Scale
and Shape (GAMLSS) in R. Journal of Statistical Software, 23(7), 1–46.

Stasinopoulos, M., Rigby, B., & Akantziliotou, C. (2008). Instructions on how to use the gamlss
package in R (Second Edition).

Taliei, F., Safaie, N., & Aghajani, M. A. (2013). Spatial Distribution of Macrophomina
phaseolina and Soybean Charcoal Rot Incidence Using Geographic Information System (A Case
Study in Northern Iran), 15, 1523–1536.

Teetor, P., & Loukides, M. K. (2011). R cookbook. Sebastopol, CA; Beijing: O’Reilly.

Tsai, P.-J. (2012). Application of Moran’s Test with an Empirical Bayesian Rate to Leading
Health Care Problems in Taiwan in a 7-Year Period (2002–2008). Global Journal of Health Science,
4(5). doi:10.5539/gjhs.v4n5p63

Tobler, W. R. (1970). A Computer Movie Simulating Urban Growth in the Detroit Region.
Economic Geography, 46(2): 234–240.

Urdan, T. C. (2010). Statistics in plain English (Third Edition). New York: Routledge.

Zuur, A. F., Ieno, E. N., Walker, N., Saveliev, A. A., & Smith, G. M. (2009). Mixed effects
models and extensions in ecology with R. New York, NY: Springer-Verlag New York.

59
6 Annexes

Annex 1 – R Code utilised for test spatial autocorrelation

#Loading dataset
farms.p <- readShapePoints("SHP/farms.shp")
##########----Neighbours list using different methods

#-----------------Graph based Neighbours

#Delaunay Triagulation
tri.list <- tri2nb(coords)
plot.nb(tri.list, coords)
nb_list.t <- nb2listw(tri.list, style = "W")

#Sphere of Indiference
soi.1 <- soi.graph(tri.list, coords)
soi_nb <- graph2nb(soi.1)
plot.nb(soi_nb, coords)
nb_list.s <-nb2listw(soi_nb, style = "W")

#----------------Distance based Neighbours

#Nearest Neighbours
#Default k value - k = 1
k_near.1 <- knn2nb(knearneigh(farms.p))
plot.nb(k_near.1, coords)

#With k = 5
k_near.5 <- knn2nb(knearneigh(farms.p, k = 5))
plot.nb(k_near.5, coords)

#With k = 10
k_near.10 <- knn2nb(knearneigh(farms.p, k = 10))
plot.nb(k_near.10, coords)

#Neighbours Spatial Weights Lists

nb_list.k1 = nb2listw(k_near.1, style = "W")


nb_list.k5 = nb2listw(k_near.5,style = "W")
nb_list.k10 = nb2listw(k_near.10,style = "W")

#Distance Bands

#Neighbours list based on distance


k_dist <- nbdists(k_near.1, farms.coords)
k_dist_vec <- unlist(k_dist)
max_dist <- max(k_dist_vec)

dist_nei <- dnearneigh(farms.p, d1 = 0, d2 = max_dist)


plot(dist_nei, farms.coords)

60
nb_list.d <- nb2listw(dist_nei, style = "W")

#Moran's I Test with different neighbour list

moran.test(farms.p$inc_raw, listw = nb_list.t, alternative = "two.sided")


moran.test(farms.p$inc_raw, listw = nb_list.s, alternative = "two.sided")

moran.test(farms.p$inc_raw, listw = nb_list.k1, alternative = "two.sided")


moran.test(farms.p$inc_raw, listw = nb_list.k5, alternative = "two.sided")
moran.test(farms.p$inc_raw, listw = nb_list.k10, alternative = "two.sided")

moran.test(farms.p$inc_raw, listw = nb_list.d, alternative = "two.sided")

#Geary's C Test with k-nearest


geary.test(farms.p$inc_raw, listw = nb_list.t, alternative = "two.sided")
geary.test(farms.p$inc_raw, listw = nb_list.s, alternative = "two.sided")

geary.test(farms.p$inc_raw, listw = nb_list.k1, alternative = "two.sided")


geary.test(farms.p$inc_raw, listw = nb_list.k5, alternative = "two.sided")
geary.test(farms.p$inc_raw, listw = nb_list.k10, alternative = "two.sided")

geary.test(farms.p$inc_raw, listw = nb_list.d, alternative = "two.sided")

#Moran's I using Empirical Bayes Estimates

ebi.1 <- EBest(farms.p$diseased_p, farms.p$evaluated_, family = "binomial")

moran.test(ebi.1$estmm, listw = nb_list.t, alternative = "two.sided")


moran.test(ebi.1$estmm, listw = nb_list.s, alternative = "two.sided")

moran.test(ebi.1$estmm, listw = nb_list.k1, alternative = "two.sided")


moran.test(ebi.1$estmm, listw = nb_list.k5, alternative = "two.sided")
moran.test(ebi.1$estmm, listw = nb_list.k10, alternative = "two.sided")

moran.test(ebi.1$estmm, listw = nb_list.d, alternative = "two.sided")

Annex 2 – R Code utilised for the GAMLSS


#Miscellaneous code for variable construction

dis_inc <- cbind(farms.p$diseased_p,farms.p$evaluated_)

variety.f <- factor(farms.p$Variety, levels = cbind(1,2,3), labels =


cbind("Isla", "Seda", "Mix"))

#Base model GLM with all the explanatory variables

glm.0 <- glm(formula = dis_inc ~ pH + altitude + slope + area + planting_d +


Average_So + factor(farm_type) * variety.f + shade_perc, data =
farms.p, family = binomial)

61
#Base model with all the explanatory variables
library(gamlss) #Load the gamlss package

gamlss.1 <- gamlss(formula = dis_inc ~ pH + altitude + variety.f + planting_d +


slope + area + Average_So + variety.f * factor(farm_type) + shade_perc, family =
BB, data = farms.p)

summary(gamlss.1)

plot(gamlss.1)

stepGAIC(gamlss.1)

#Fitted model after selection using stepGAIC

gamlss.2 <- gamlss( formula = dis_inc ~ pH + altitude + planting_d + Average_So,


family = BB, data = farms.p)

summary(gamlss.2)

GAIC(gamlss.1, gamlss.2)

plot(gamlss.2)

62

You might also like