Professional Documents
Culture Documents
Species Data Issues of Acquisition and Design
Species Data Issues of Acquisition and Design
Species Data Issues of Acquisition and Design
Issues of
Acquisition and Design
1
www.gbif.org
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
111
gaps in the collections and these gaps might cause difficulties later on
when analysing the data.
The rapid growth of web databases such as GBIF has made access to
data much simpler (e.g. Table 7.1). Yet this does not necessarily mean
that the data can be used without restriction. Extractions from large,
community databases such as GBIF need to be treated with caution for
the following reasons: (i) uncertainty in species identification, (ii) low or
unknown accuracy of sample location, (iii) lack of design, (iv) incomplete
or uneven spatial coverage of the true distribution of a species, or (v) spa-
tial autocorrelation in sample locations.
The issues related to species identification cannot be easily resolved,
and are not addressed in this book. The second issue related to uncer-
tainty in sampling location is covered in Chapter 8. The lack of design is
a third, serious issue for all analyses that attempt to derive a probabilistic
habitat suitability estimate from large datasets. If lack of design is an issue,
one might consider resampling existing databases in order to increase the
level of design (Broennimann and Guisan, 2008; Veloz, 2009; Anderson
and Raza, 2010; Hijmans, 2012; Syfert et al., 2013; Mateo et al., 2015).
Such resampling can only improve, but not fully remove the design bias
inherent to such large datasets (see Section 7.4 for some suggestions).The
fourth issue relating to the lack of coverage cannot be easily overcome,
and its effects are treated in Section 8.2. The fifth issue is spatial autocor-
relation, which is an inherent property of spatially structured, ecological
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
112 · Data Acquisition, Sampling Design, and Spatial Scales
data. This is partly dealt with in the following section. Finally, the sixth
issue is sample size. When data is derived from large databases, one only
has limited control over the sample size of a species. The available data
can be considered as “presence-only” or the samples of presences of spe-
cies other than the target species can be considered as “pseudo-absences”
and thus allow the production of presence and absence data. However,
it is likely that these derived datasets contain considerable problems in
terms of prevalence (Phillips et al., 2009) and design, and may require
further subsampling or design efforts with regards to “pseudo-absence”
selection (see Barbet-Massin et al., 2012; Mateo et al., 2015).
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
113
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
114 · Data Acquisition, Sampling Design, and Spatial Scales
globally whether there is any spatial autocorrelation in the residuals. We
can do this using a Moran’s I test.This involves deriving a distance matrix
from all observations, and then testing the distance effect against the
residuals. Here, we use the ape library and the stepwise-optimized GLM
model of V. vulpes from Section 6.2.10. As coordinates, we take the x-
and y-coordinates from the pts.cal object, which we used to derive
the calibration dataset. The order and dimension of the two objects are
still the same. We generate an inverse-distance matrix with the diagonals
set to 0 for the Moran’s I test on the GLM model residuals (i.e. vulpes.
step$residuals).
> library(ape)
> xy <-pts.cal[,1:2]
> dists <-as.matrix(dist(xy))
> dists.inv <-1/
dists
> diag(dists.inv) <- 0
> Moran.I(vulpes.step$residuals, dists.inv)
$observed
[1]0.01506913
$expected
[1] -0.0001695203
$sd
[1]0.0003468404
$p.value
[1] 0
We learn from this example that the p-value for testing for spatial auto-
correlation is highly significant (p < 0.05). We therefore find spatial
autocorrelation in the residuals. Next, we will want to plot the spatial
correlation structure against distances between our observations. Samples
separated by a short distance should have greater similarity (and thus
correlation) than samples separated by a larger distance. We evaluate this
distance dependence using a Mantel correlogram in the ncf package.
This package makes it easy to plot a spatial (Mantel) correlogram. This
is done by first extracting the residuals from the GLM object, and then
randomly selecting 500 points from the residuals and from the x- and
y-coordinates (xy object from the previous example). This information
is needed in the correlog() command in the ncf package. We store
the result of this command in the spat.cor object. We can either plot
this object directly by typing “plot(spat.cor)”, or we can extract the
necessary information, and make a neater plot (see Figure 7.1)
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
115
Figure 7.1 Spatial correlation (a) and spatial patterns (b) of model residuals for the
Vulpes vulpes stepwise-optimized GLM model. The correlogram reveals a low cor-
relation at a short distance. (A black and white version of this figure will appear in some
formats. For the color version, please refer to the plate section.)
> library(ncf)
> rsd<-vulpes.step$residuals
> rnd<-sample(1:length(rsd),500,replace=T)
> spat.cor<-correlog(xy[rnd,1],xy[rnd,2],rsd[rnd],increment=2,
resamp=10)
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
116 · Data Acquisition, Sampling Design, and Spatial Scales
incorporate the effects of spatial autocorrelation in models (e.g. Carl and
Kühn, 2007; Dormann et al., 2007; Kühn et al., 2009). Broad spatial trends
such as those visible in the residuals in Figure 7.1 can be removed using
spatial eigenvector maps (SEVM) or similar techniques built on residuals,
in order to avoid strong interactions with the other predictors (Maggini
et al., 2006; Kühn et al., 2009). This is not currently standard practice in
habitat suitability modeling, but we agree that neglecting such effects can
have an adverse impact on the modeling and the conclusions drawn from
these analyses. Regardless of whether the spatial trends are removed or
not, we will always be faced with a lack of certainty regarding the impor-
tance and trends in analysed variables, due to the correlative nature of
regression-type analyses. Careful selection of a smaller number of impor-
tant variables helps to avoid these problems. Running sensitivity analyses
to estimate the relative effect size of spatial autocorrelation in models,
compared to other factors (see subsequent sections) is also helpful. In
some cases spatial autocorrelation has a negligible effect on HSM predic-
tions (Thibaud et al., 2014), but see also Guillera-Arroita et al. (2014) for
some issues related to the model setup in the aforementioned publication.
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
117
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
118 · Data Acquisition, Sampling Design, and Spatial Scales
Once everything is ready, we run the loop and generate the prevalence test.
>for (i in 1:length(yb1)){
paok<-
rbind(vulvul.p[1:yb1[i],],vulvul.a)
rownames(paok)<-1:dim(paok)[1]
# Full Model
paok1f<-glm(Vulpes.vulpes~bio3+I(bio3^2)+bio11+I(bio11^2)+bio12+
I(bio12^2),family=“binomial”,data=paok)
paok0<-predict(paok1f,paok,type=“response”)
pr.qual[i,1]<-ecospat.adj.D2.glm(paok1f)
tmp1 <-data.frame(1:length(paok0),paok[,1],paok0)
names(tmp1) <-c(“ID”,”Observed”,”Predicted”)
pr.qual[i,3]<-auc(tmp1)$AUC
pr.qual[i,6]<-ecospat.max.kappa(paok0,paok[,1])[[2]
][1,2]
pr.qual[i,9]<-ecospat.max.tss(paok0,paok[,1])[[2]
][1,2]
# Stepwise optimized model
paok1s<-step(paok1f,direction=“both”,trace=F)
paok0<-
predict(paok1s,paok,type=“response”)
pr.qual[i,2]<-ecospat.adj.D2.glm(paok1s)
tmp1 <-data.frame(1:length(paok0),paok[,1],paok0)
names(tmp1) <-c(“ID”,”Observed”,”Predicted”)
pr.qual[i,4]<-auc(tmp1)$AUC
pr.qual[i,7]<-ecospat.max.kappa(paok0,paok[,1])[[2]][1,2]
pr.qual[i,10]<-ecospat.max.tss(paok0,paok[,1])[[2]
][1,2]
# Xval procedure
paok1x<-ecospat.cv.glm(paok1s)
tmp1 <-data.frame(1:length(paok0),paok[,1],paok1x$predictions)
names(tmp1) <-c(“ID”,”Observed”,”Predicted”)
pr.qual[i,5]<-auc(tmp1)$AUC
pr.qual[i,8]<-ecospat.max.kappa(paok1x$predictions,paok[,1])
[[2]][1,2]
pr.qual[i,11]<-ecospat.max.tss(paok1x$predictions,paok[,1])[[2]
]
[1,2]
}
Finally, we plot all model quality results from the prevalence test
(Figure 7.2). Note that the sample size plot is also presented in the same
figure, but not given as code example here.
> plot(pr.qual$Prev,pr.qual$Kappa.f,ty=“l”,lwd=5,col=“#00FF00B4”,
ylim=c(0,1.0),xlim=c(0,.5),xlab=“Sample size”,
ylab=“Model Quality”,main=“Prevalence Effects”)
> points(pr.qual$Prev,pr.qual$Kappa.s,ty=“l”,lwd=5,col=“#00CD00B4”,
lty=3)
> points(pr.qual$Prev,pr.qual$Kappa.x,ty=“l”,lwd=5,col=“#008B00B4”,
lty=2)
> points(pr.qual$Prev,pr.qual$TSS.f,ty=“l”,lwd=5,col=“#ADD8E6B4”)
> points(pr.qual$Prev,pr.qual$TSS.s,ty=“l”,lwd=5,col=“#9FB6CDB4”,
lty=3)
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
119
Figure 7.2 (a) Sample size and (b) prevalence effects on model accuracy in the global
Vulpes vulpes dataset. Low prevalence has a strong effect on cross-validated Kappa and
TSS accuracies. (A black and white version of this figure will appear in some formats. For the
color version, please refer to the plate section.)
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
120 · Data Acquisition, Sampling Design, and Spatial Scales
> points(pr.qual$Prev,pr.qual$TSS.x,ty=“l”,lwd=5,col=“#0000FFB4”,
lty=2)
> points(pr.qual$Prev,pr.qual$AUC.f,ty=“l”,lwd=5,col=“#EE2C2CB4”)
> points(pr.qual$Prev,pr.qual$AUC.s,ty=“l”,lwd=5,col=“#CD2626B4”,
lty=3)
> points(pr.qual$Prev,pr.qual$AUC.x,ty=“l”,lwd=5,col=“#8B1A1AB4”,
lty=2)
> legend(.355,0.45,c(“Kappa full”,”Kappa step”,”Kappa xval”,
“TSS full”, “TSS step”, “TSS xval”, “AUC full”, “AUC step”,
“AUC xval”), lty=c(1,3,2,1,3,2,1,3,2),lwd=c(8),col=c(“#00FF00B4”,
”#00CD00B4”, “#008B00B4”,”#ADD8E6B4”,”#9FB6CDB4”,”#0000FFB4”,
”#EE2C2CB4”, “#CD2626B4”,”#8B1A1AB4”))
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
121
Before classifying the bio3 and bio12 rasters, we crop the global bioclim
rasters from Section 6.1.3 to the extent of the conterminous lower 48
states of the United States. For this we load a shapefile of the states of the
United States and extract the lower 48 states:
> usa <-shapefile(“vector/
usa/
USA_
states.shp”)
> usa_
contin <-usa[usa$STATE_NAME != “Alaska” & usa$STATE_NAME
!= “Hawaii”, ]
With the second command, we select all states that are neither Alaska
nor Hawaii, thus representing the conterminous United States. Next we
need to generate an empty raster, to which we then rasterize the state
polygons using the DRAWSEQ field:
> empty_
raster <-raster(bio3)
> usa_
raster <-rasterize(usa_
contin,empty_
raster,field=“DRAWSEQ”)
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
122 · Data Acquisition, Sampling Design, and Spatial Scales
Next, we mask and crop the global bio3 and bio12 rasters to the United
States extent:
> bio3.us <-crop(mask(bio3, usa_
raster), extent(usa_
contin))
> bio12.us <-crop(mask(bio12, usa_
raster), extent(usa_
contin))
We now have generated two reclassified raster layers, one with class
numbers running from 1 to 9 and the other with numbers of 10, 20,
… 90. Then we summed the two layers so the new values are now
unique, meaning that we can trace back their origin. For example, a
number of 56 means that it originates from class 5 (code 50) of bio12
and class 6 of bio3. No other combination of classes would generate
this code.
Next we plot the histogram of class frequencies, and we plot the
stratification map with the rainbow color scheme (Figure 7.3). This
includes checking the range of possible values for plotting the histo-
gram (cspan). In order to better visualize the classes, we also allocate
random colors to the map, and we click on five locations with the
mouse to read values from the map (shown after you click five times
at any location on the map). The randomly allocated colors are stored
in a variable called yb, a name that has no ecological or R-specific
meaning:
> cspan<-maxValue(B3B12.comb)-minValue(B3B12.comb)
> yb<-rainbow(100)[round(runif(cspan,.5,100.5))]
> par(mfcol=c(1,2))
> hist(B3B12.comb,breaks=100,col=heat.colors(cspan),
main=“Histogram values”)
> plot(B3B12.comb,col=yb,main=“Stratified map”, asp=1)
> click(B3B12.comb,n=5,type=“p”,xy=T)
> par(mfcol=c(1,1))
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
123
Figure 7.3 (a) Histogram of the frequencies of the classes, and (b) stratification map
with the rainbow color scheme used to identify the strata from the environmental
stratification of the study region. (A black and white version of this figure will appear in
some formats. For the color version, please refer to the plate section.)
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
124 · Data Acquisition, Sampling Design, and Spatial Scales
for the rgdal package. In order to distinguish it from our original ras-
ter format, we will call this paok from now on (an artificial name with
no ecological meaning). This is different from the raster format in the
raster package:
> paok <-as(B3B12.comb, “SpatialPixelsDataFrame”)
It is clear to see that these designs are arranged in a sequence from fully
random to fully regular.The four graphs are plotted as follows (Figure 7.4):
> par(mfcol=c(2,2))
> plot(B3B12.comb,main=“random”,col=rev(terrain.colors(25)))
> points(s.rand, pch=3, cex=.5)
> plot(B3B12.comb,main=“stratified”,col=rev(terrain.colors(25)))
> points(s.strt, pch=3, cex=.5)
> plot(B3B12.comb,main=“nonaligned”,col=rev(terrain.
colors(25)))
> points(s.nona, pch=3, cex=.5)
> plot(B3B12.comb,main=“regular”,col=rev(terrain.colors(25)))
> points(s.regl, pch=3, cex=.5)
> par(mfcol=c(1,1))
We have now developed four different designs, and each of these data-
sets can be used to sample the environmental layer stacks as done previ-
ously. Once this is done, we can check how the different designs affect
the retrieved distribution layers sampled with the respective design. We
can compare this distribution with the known truth (when plotting the
distribution of the raster layers).
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
125
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
126 · Data Acquisition, Sampling Design, and Spatial Scales
approach still has a lower risk of failing to detect rare strata. With both
strategies, even the rarest strata will still be chosen with a few points,
(proportional) or with an equal number of points as other strata (equal),
unless they are too small (see Figure 7.6). The equal number variant
assigns the same number of points, which –depending on the number of
available classes –can result in slightly different numbers than those ori-
ginally given. The proportional variant assigns the numbers according to
the number of pixels available in the stratification grid. If the proportion
of a stratum results in less than one sample being allocated to this class,
then this class is not sampled.
We apply the equal, and then the proportional, allocation of sample
points according to our random (environmentally) stratified design using
the ecospat.recstrat_regl() and the ecospat.recstrat_
prop() functions. These functions take the stratification grid and the
total number of points to allocate per strata as arguments.
Finally, we plot 150 points from the two designs over the whole study
area (Figure 7.5). We can see that the proportional design reveals a very
similar distribution as is available in the study area, while the equal design
(even numbers per stratum) generates a more uniform distribution of
strata.
> envstrat_equ<- ecospat.recstrat_regl(B3B12.comb,150)
> envstrat_prp<- ecospat.recstrat_prop(B3B12.comb, 150)
> par(mfcol=c(1,2))
> plot(B3B12.comb,main=“Proportional Sampling”, col=rev(terrain.
colors(25)))
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
127
Figure 7.6 Illustration of the class distribution between the two sampling designs.
The x-axis labels indicate the identifier of the strata from the environmental strati-
fication of the study region.
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
128 · Data Acquisition, Sampling Design, and Spatial Scales
7.4.4 Sampling Designs Along Linear Features
We have so far developed six different spatially and environmentally ori-
ented sampling designs, which all cover the whole study area. Some eco-
logical questions will require sampling along linear features, and we will
therefore briefly mention this approach as well. For example, you may
need to sample along rivers or transects from the sea to inland, or you
might ask how a species has invaded along roads. All cases require specific-
ally sampling along linear elements, not across large landscapes as a whole.
The three designs we demonstrate here are available in the rgdal pack-
age, and resemble the first four spatial designs discussed.The first is a “ran-
dom” design, where points are allocated randomly at distance D from the
origin of the linear path.The “regular” design dissects the whole length
of the linear element into even distance classes, and allocates points accord-
ingly. Finally, the “stratified” sampling is again a spatiallystratified sam-
pling approach, which dissects the line into blocks of equal distance, and
samples randomly within these. The “non-aligned” sampling does not
make any sense here, and is therefore not available.
As an example, we load the DEM of the globe, upscale the spatial
resolution to 10 km, and crop it to the extent and the mask of the con-
terminous lower 48 states used in the examples in Section 7.4.1.
> dem_globe <- raster(“raster/topo/GTOPO30.tif”)
> dem_usa <- crop(dem_globe, usa_contin)
> dem_usa_
10km <-aggregate(dem_ usa, 10, fun=mean)
> empty_raster <-raster(dem_ usa_10km)
> usa_raster <- rasterize(usa_contin, empty_raster,
field=“DRAWSEQ”)
We then mask the DEM and generate a contour of the 1000 m altitude
band throughout the United States, as illustrated in Figure 7.7. This con-
tour line is now ready to use to design different line sampling strategies
by allocating sample points along this elevation contour.
> dem_
usa_
masked <-mask(dem_usa_
10km, usa_raster)
> iso_
1000m<-rasterToContour(dem_
usa_masked, nlevels=1,
levels=c(1000))
> plot(dem_usa_
masked,col=dem.c(100),
main=“Elevation Contours at 1000m”)
> lines(iso_1000m, lwd=1.3, col=2)
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
129
Figure 7.7 Elevation contour lines (1000 m) plotted in the study area. (A black and
white version of this figure will appear in some formats. For the color version, please refer to
the plate section.)
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
130 · Data Acquisition, Sampling Design, and Spatial Scales
Figure 7.8 Three linear sampling designs applied along elevation contours in the
study region: (a) random, (b) stratified, and (c) regular.
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
131
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
132 · Data Acquisition, Sampling Design, and Spatial Scales
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
133
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012
134 · Data Acquisition, Sampling Design, and Spatial Scales
are no strong arguments in favor of a different, more taxon-specific
approach (Barbet-Massin et al., 2012). Constraining the sampling to
areas where the species has not been observed to prevent conflict-
ing presence and absence observations is likely to increase initial bias
(Wisz and Guisan, 2009) and may result in over-predictions of the
species ranges (Hanberry et al., 2012).
Downloaded from https://www.cambridge.org/core. University of New England, on 18 Nov 2017 at 17:08:20, subject to the Cambridge Core
terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/9781139028271.012