Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

701-1613-00 Advanced Landscape Ecology [HS 12]

Lecture Notes: 24.11. / 31.11.2012

Species Distribution Modeling (SDM) with GLM, GAM and CART


Dependent vs. independent variables: a conceptual ecological view
Here, we introduce and present modern statistical approaches that allow us to simulate the distribution or
abundance and density of species, populations or habitats in space. The methods we present are based on
regression techniques, and they all have in common that they explain and predict the presence or abundance of
species as a function of one or many site factors (elevation, slope, aspect, solar radiation, precipitation, soil type,
etc.).
In general, we aim at predicting a biotic variable (e.g. presence) as a function of explanatory variables. Therefore
we call the biotic variable as dependent variable and the predictors as independent variables. Yet, several
terminologies exist in the scientific literature (see list):
Dependent
response variable
dependent variable

Independent
explanatory variable
independent variable

This dichotomy reflects the logics of regression analyses where a response variable is considered dependent of
explanatory (or independent) variables. The independent variables are considered uninfluenced by the dependent
variable, meaning that there is no immediate feedback. Yet, this dichotomy reflects also the biological logics of
the regression modeling approach. We attempt to explain e.g. the presence of a species from biotic and abiotic
site factors. Therefore the presence of a species is considered a physiological or mechanistic logic of these site
factors, or in other words a causal function of the explanatory variables based on the niche requirements of a
species. The regression itself does NOT distinguish between correlative and causal relationships. As soon as a
variable is significant in a regression, it can be seen as a statistical predictor, even if the biological explanation
is irrelevant or wrong. Thus, it largely depends on the experimental design and context, if we can talk about
causal or correlative relationships.
A recently developed classification of ecological models follows a similar logic. We also distinguish between
mechanistic and empirical models. But a third group of models is distinguished, called mathematical or
analytical (see figure).

From: Guisan & Zimmermann, 2000


This classification was originally developed by Levins (1966) and later refined by Sharpe (1990) and Guisan &
Zimmermann (2000). The basic hypothesis is that only two out of three desirable model characteristics namely
reality, precision and generality can be simultaneously optimized when a model is developed and refined. As a
consequence of this trade-off, the third characteristic is sacrificed. Although partly challenged, this model
classification is generally accepted today. No model can simultaneously be highly accurate (P), be based on
F. Kienast, J. Bolliger, N.E. Zimmermann

701-1613-00 Advanced Landscape Ecology [HS 12]

Lecture Notes: 24.11. / 31.11.2012

physiological processes and mechanism (R), and be applied generally (e.g. globally) under a large range of
different conditions and instances (G).
Statistical regression models are generally seen as empirical models since they do not express mechanistic or
physiological processes. Rather, the dependent variable explains the observed biological pattern (the
dependent variable, e.g. the presence of a species) in a statistical manner often from empirical data. Based on
ecological and biological concepts and rationale we still may deduct finer differences. The model classification
says that empirical models optimize precision and reality. Based on experiments and extensive observations we
know that plants require thermal energy, light, water and nutrients for their growth. If one of these factors is
completely minimal, no plant growth is possible. Similar constraints hold for animals and for both groups of
living organisms the rationale for these statements lies in the niche concept, where each species has its own
evolutionary developed physiological limitations, and a species can only occur, where 1) the fundamental niche
requirements are met (fundamental niche), and 2) where it can co-exist under competitive interactions with all
other species at the same site (realized niche). We can now use this experimental knowledge. The better a
variable we use in a regression model has been found to physiologically (causally) explain the presence of a
species, the more realism is added to the model. On the other hand, we optimize the precision of a model if we
only care about predictive (model power) not biological relevance. In general, those variables that yield most
accurate models are biologically not very relevant. Thus, in empirical modeling we can still put more weight on
R or P.
Mike Austin and co-workers (Austin, 1980, 1985; Austin et al., 1984; Austin und Smith, 1989) have categorized
all possible predictor variables regarding their biological causality and explanation upon plants and animals into
three classes:
1. Resource v. Are used as resources for growth: light, soil water, nutrients, ...
2. Direct v.
Have direct impact on growth, but not as resource: temperature, air humidity, pH, ...
3. Indirect v. No direct impact, often correlated with 1 & 2: elevation, slope, aspect, ...
Later on, these types of variables were also classified as proximal (1) and distal (3), with gradients in-between,
so that (2) is rather proximal, but slope rather distal.
When developing predictive, empirical models and especially when selecting predictive variables we should
thus always consider what aim we follow in a study. Do we require the model to be mostly precise or rather
biologically plausible? A model to deduct immediate conservation measures may require a different model type
than a risk assessment under climate change projections!

From: Guisan & Zimmermann, 2000


F. Kienast, J. Bolliger, N.E. Zimmermann

701-1613-00 Advanced Landscape Ecology [HS 12]

Lecture Notes: 24.11. / 31.11.2012

Introduction to generalized linear models (GLM)


After a more theoretical introduction to predictive habitat distribution modeling, we now want to learn how we
can achieve this by using modern quantitative methods in statistics. A classical approach for such an analysis of
inference is (multiple) linear regression (Multiple linear regression by least squares, ordinary least squares
regression = OLS or general linear models = LM). Yet, this approach is rarely useful in landscape, vegetation
and species modeling since several of the basic requirements of this classical regression approach are hurt. As
dependent variable, we often use the presence, cover or abundance of a species or habitats within a plot or
within a landscape (patch). Such dependent variables cannot be used in LM modeling. Presence can be expressed
as 1 (exists) or 0 (is missing), and not intermediate values, or values above 1 or below 0 are possible. Similar
constraints exist for cover, only that here intermediate values between 0 and 100% are possible. Finally, counts
of species or habitats (abundance) can take large numbers, but only positive values are possible. In all cases there
is a lower and often an upper limit to the possible data range. This causes problems because in LM the residuals
are required to be normally distributed along the whole data range (mean of 0 and constant variance along the
data range of the dependent variable). This is usually not possible in data such as presented here. Competition
and spatial constraints in a landscape usually generate skewed distributions for abundance and cover.
A series of comparably new statistical methods has been developed that solve some of these problems (or at least
avoid some of the limitations elegantly). The first technique we introduce here is called Generalized Linear
Models (GLM), which represents an extension (generalization) of the classical linear regression method. In many
cases, transformations are used in order to reach or approximate Normal distribution for the dependent variable.
While this is rather a trick in LMs in order to make the dependent variable fit the statistical requirements, it is a
basic element of GLMs. In both cases LM and GLM it may additionally be advantageous to transform the
independent variables, irrespective of its use in LM or GLM models. In a GLM, the dependent variable is
transformed automatically based on a so-called link-function, which describes mathematically the way the
dependent variable is transformed relative to its mean value [], and how it is linked to the linear predictor [X
or f(x)]: the set of predictor variables used to calibrate the regression model. The link function (or model family)
is now selected specifically depending on the characteristic of the dependent variable and its statistical
distribution. GLM models are not optimized statistically using least-squares, but rather using the maximum
likelihood approach. The results can be read like classical LM models results in the form:
Y = a + b1*X1 + b2*X12 + b3*X2 + b4*X22 + e
where the values a, b1, b2, etc. represent the regression parameters of the model results. Y is the dependent
variable, and X1, X2 are independent variables (e.g. precipitation). These regression parameters can then be tested
for significance in a next step.
GLM models are based on so-called parametric functions (which means it is possible to explain the shape of the
curve by functions), and they can be fitted for a range of distributions (Binomial, Poisson, Gaussian, Bernoulli,
Gamma, etc.). This parametric shape fitting is in contrast to non-parametric types of regression, where the same
distributions can be used but model do not result in parametric parameters estimates but rather a) in local
smoothing functions along gradients of predictor variables (e.g. GAM) or b) non-parametric classification trees
(CT).
It is important to notice that in GLMs (and GAMs), the model is fitted to the transformed dependent variable,
and the parameters therefore stand for the dependent variable in the transformed data space. In order to bring the
results back to the original data space, it is important to back-transform the model output based on the link
function. Therefore, we need to apply an inverse of the link transformation (the so-called inverse-link). Here are
few important link and inverse-link functions (with the logit link being on of the most often used functions):
Model family
Gaussian
Binomial
Poisson

Link Name
Identity
Logit
Log

Link Function
(x) =
(x) = log( / (1- ))
(x) = log()

Inverse-Link Function
= (x)
= exp((x)) / (1 + exp((x)))
= exp((x))

Usually, we do not have to care about the link function when using GLMs in a statistical package. However, we
have to invert the model output ourselves when we are to test the model output against test data. Here is an
example of how this may work:
Example GLM
Assume we have studied the presence and absence of a target species along a precipitation gradient based on a
random stratified sampling starting from a coast with high precipitation towards rather dry conditions in the
interior. We denote the presence of our species as (pres=1) and the absence as (pres=0) and we note
precipitation values each time. We may find the following distribution in our data set:

F. Kienast, J. Bolliger, N.E. Zimmermann

701-1613-00 Advanced Landscape Ecology [HS 12]

Lecture Notes: 24.11. / 31.11.2012

Data example:
Pres
0
0
0
0
0
0
0
1
0
0
1
1
1

Precip
1.0
1.4
1.5
1.7
2.0
2.1
2.8
3.0
3.2
3.3
3.5
3.8
4.0

Pres
0
1
1
1
1
1
1
1
1
1
1
1
0

Precip
4.1
4.4
4.7
4.8
4.9
5.0
5.1
5.3
5.4
5.5
6.0
6.1
6.1

Pres
1
1
1
0
1
0
0
1/0
1/0
0
0
0

Precip
6.3
6.4
6.8
6.9
7.0
7.4
7.9
8.5 (outlier)
8.6 (outlier)
8.7
8.8
9.0

Figure: Presence and absence of a target species along a precipitation gradient (no outliers).
It seems obvious that we cannot calculate a simple linear regression on order to predict and simulate the presence
of our target species along the precipitation gradient. Neither a linear nor a polynomial model fit can explain the
distribution of our presence/absence points along the observed gradient. It is important to notice that below
precip3 and above precip7 the species is absent (pres=0). This represents a binomial type of a
distribution, thus logistic regression solves our problem best. Results of our artificial data set are as follows
(using the statistical package R):
RESULT: LOGIT-REGRESSION WITHOUT OUTLIERS
glm(formula = pres == 1 ~ prec + prec^2, family = binomial)
Coefficients:
(Intercept)
-19.44730691

prec
8.633968185

I(prec^2)
-0.8382754773

Degrees of Freedom: 38 Total; 35 Residual


Residual Deviance: 22.45022175 on 35 degrees of freedom
Null Deviance:
52.67918572 on 37 degrees of freedom

In order to simulate the presence of our target species as a function of precipitation (in the form of a prediction),
we can not apply the regression parameters in a first step, and in a second step we back-transform the model
using the inverse-link:
Y = -19.45 + 8.634*prec 0.838275*prec2
Pres = exp(Y) / (1 + exp(Y))

F. Kienast, J. Bolliger, N.E. Zimmermann

(regression function)
(inverse link)

701-1613-00 Advanced Landscape Ecology [HS 12]

Lecture Notes: 24.11. / 31.11.2012

As a result we now see a response function that we take as probability of pres=1 (see figure below):

Figure: Observed presence and absence along a hypothetical precipitation gradient. The line graph represents the
regression model result (i.e. the simulated probability of occurrence).

If we are now to decide whether our target species occurs or not, then we can simply classify the probability of
occurrence. We would then e.g. decide that the species may occur where Pres>50%, with predicted absences
otherwise. There are, of course, other possibilities to classify into presences and absences, which we do not
discuss here.

GLM, GAM and CT statistical tools for predictive habitat distribution modeling
After having introduced GLMs as an alternative to OLS (or LM) modeling, we now present generalized linear
models (GLM), generalized additive models (GAM) and classification trees (CT) in more detail and we discuss
some characteristics of these models.
GLM:
The basic characteristics of GLMs have been discussed in the previous chapter. They are not repeated here.
When should we now use GLM models and when should we use GAM models (see below)? It is important to
understand that GLMs are bound to rather limited but clear parametric shapes. Linear terms in the calibration of
predictor variables translate into sigmoidal response functions, while quadratic terms translate to a unimodal
response (see figure above). When a parameter is entered in the 4th polynomial, the response will be of bimodal
shape. The shapes are generally symmetric. Thus, if we have confidence, that our true response shape behind
the data is of any of these parametric forms, then we can easily use GLMs. Otherwise, we may prefer GAM or
CT. However, we will see that the latter two are much more sensitive to data gaps, and that with data from
limited sampling we always may prefer the comparably robust GLM models.
How can we now judge the quality of our model? Basically, we should always test a model against independent
test data that were not used in the model building. However, this is not always possible, and often we may want
to get a quick check whether on how strong the calibration actually is. There is a simple measure that allows us
to check the quality of the model from output information. In LM models, the R2 value explains the calibration
strength of the model. It explains what proportion of the overall data variance is explained by the model
function. In our example we have 19 observed presences among 38 points. The most simple model is to predict
the response as the mean of the dependent variable (= 0.5 in our case, saying that our best guess is that the
species can be found in 50% of all plots). This simplest of all possible models is termed the null model, and the
variation in the data set around the mean of the dependent variables is then termed the null model variation
(variance in LM and deviance in GLM). Our model is now (hopefully) more accurate so that we make better
guesses on whether we find a species or not. The classical R2 of a model explains what fraction of the variation
in the null model is now explained after fitting the regression model. A value of 100% would say that we have
explained the whole variation, with no unexplained variation left (sum of the residuals = 0). Usually we have
values between 0 and 100% and often we are already quite happy if we get values >30% in landscape ecological
studies (due to the usually large stochastic component in our data).

F. Kienast, J. Bolliger, N.E. Zimmermann

701-1613-00 Advanced Landscape Ecology [HS 12]

Lecture Notes: 24.11. / 31.11.2012

Thus starting from the null variance (deviance) in LM (GAM/GLM) where variance / deviance circumscribes
the total deviation from the overall mean of the dependent variable, we can calculate the variance / deviance after
fitting the model thus this output tells us how much variation remains unexplained after fitting the model. In
our example we get deviance since we have calibrated a GLM model. Deviance is abbreviated as D2 it is often
also called pseudo-R2, which highlights the similarity with R2. In our case, D2 is:
D2 = [( Null-Deviance Residual Deviance) / Null-Deviance]

, or in our example

D = [(52.68-22.45) / 52.68]
D2 = 0.57

= [1 - (Residual Deviance / Null-Deviance)]

This value expresses the fact that 57% of the total variation of the null model has been explained based on our
simple model of precipitation and only 43% are still unexplained. This is a comparably high model fit. How can
we now further improve this model? And if we had found a better model, what could be reasons for still not
finding a perfect model? In order to highlight possible reasons
Before continuing, we want to learn more about the influence of extreme values or outliers. In the data
example listed above we have switched two absence points into presences at high precipitation values (listed as
outliers). Outliers are data points that have values in the dependent variable that are very unexpected, usually
resulting in high deviations from a fitted model ( high residuals). In the literature example shown below from
Jongman et al. (1987) a single outlier was added and the same model was fitted again. The outlier says that
presence was observed under conditions where usually only absences were recorded. The resulting regression
response shows marked differences. The maximum fitted probability (pmax) is reduced and the realized niche
shape of the model is much broader (see t standard deviation). Also, the mean of the response () moves
towards the outlier (to the right).
A

Figure: GLM regression using a data set without (A) and with (B) outliers (aus Jongman et al., 1987).

We realize that outliers need to be check carefully for accuracy. It is possible that they represent truth, meaning
that we do not yet understand the presence of the species based on our explanation (model). They may pinpoint
us to additional predictor variables that we did not yet incorporate into our model. Yet, they may also stand for
data errors, which can have a huge effect on models and thus on the inference we draw from our data. Checking
for errors in the data is thus an extremely important step in the scientific work flow.

F. Kienast, J. Bolliger, N.E. Zimmermann

701-1613-00 Advanced Landscape Ecology [HS 12]

Lecture Notes: 24.11. / 31.11.2012

We now return to our earlier data example and perform the same calculations again after adding our two outliers.
The figure below illustrates this gradient and the possible distribution of data points along this precipitation
gradient. The gray dots represent the switched outliers. They were white (=absence) in the first example, and
now they are presences (black dots) in the example below:

Figure: Precipitation gradient from low (red) to high (blue) rainfall. Observations of species presences are black and
absences are white. The grey dots represent absences except where outliers were used (here they were converted to
presences). X represents a precipitation value of ca. 7.8.

Obviously, the target species prefers intermediate values of rainfall (or humidity). There are two outliers now
available as presences, too (blue zone), and if we calibrate the same model as in the introductory example, then
the model output is as follows:
RESULT: LOGIT-REGRESSION INCLUDING OUTLIERS
glm(formula:
Coefficients:
(Intercept)
-9.620438512

pres == 1 ~ prec + prec^2, family = binomial)


prec
4.090510577

I(prec^2)
-0.3616529729

Degrees of Freedom: 38 Total; 35 Residual


Residual Deviance: 33.57542446 on 35 degrees of freedom
Null Deviance:
52.25735206 on 37 degrees of freedom

Below (on next page) the two model examples are illustrated in two different probability maps. Additionally, a
difference map is shown in order to highlight the spatial effects of the two outliers on model performance.

F. Kienast, J. Bolliger, N.E. Zimmermann

701-1613-00 Advanced Landscape Ecology [HS 12]

Lecture Notes: 24.11. / 31.11.2012

no outliers present

F. Kienast, J. Bolliger, N.E. Zimmermann

701-1613-00 Advanced Landscape Ecology [HS 12]

Lecture Notes: 24.11. / 31.11.2012

GAM:
If we have irregularities in the data as we have seen in the example with outliers, and if we have carefully
checked that these irregularities are not errors but true data behavior, then we might want to calibrate and apply
more flexible simulation models. GLMs might not be able to carefully visualize the observed effects since they
force the dependent variable to follow a fixed (parametric) shape too insensitively. Obviously, our species
behaves rather irregularly towards high precipitation. Fitting a more flexible model is possible in so called
Generalized Additive Models (GAM). Here, we do not use parametric (predefined) shapes, but rather let the data
find the best solution to the shape by applying a selection of local smoothing functions along the gradients of
predictor variables. As smoother we usually apply loess (locally weighted estimators) or spline functions. Within
a predefined window (we decide the size of the window) the smoother is applied in order to balance the ratio
between presences and absences found within the window. With small window sizes, the curve becomes very
flexible; with global windows (all data used in one window) the model is very similar to a GLM. We can apply
the same model families as for GLMs, which makes it easy to compare models between the two regression
model types.
Below, we illustrate the two most commonly used smoothers in a GAM with window sizes of approx. 1/3 of the
data range applied to the data set with outliers present.

Figure: All models are calibrated with outliers present. Dotted regular line represents the
GLM; the red hatched line is the SPLINE fit, the black hatched line is the LOESS fit.
It is obvious how both functions react to outliers. Both models are able to fit higher maximum probabilities, and
towards the outliers, the shape of the response reacts clearly. The reason for not reacting stronger lies in the fact
that we used a large window size. The model results for the SPLINE fit are as follows:
D2 = 57.77

RESULT: GAM-REGRESSION WITH LOGIT LINK


glm(formula: pres == 1 ~ s(prec), family = binomial, data = ex1b)
Degrees of Freedom: 38 Total; 35 Residual
Residual Deviance: 30.18914259 on 35 degrees of freedom
Null Deviance:
52.25735206 on 37 degrees of freedom

The model is better than the GLM version. At the same null deviance (all with outliers), the residual deviance is
reduced to 30.189 (and 31.06 with the LOESS version and the selected window size). In Splus and R, two very
powerful statistics packages (with R having the advantage of being free of charge under the GNU public
license), we can now predict this model to a new data set. This is very straightforward, since we do not get a
function as an output. Without this predict.gam() function it would be quite difficult to get predictions out from
model calibrations (this option is available in virtually all statistical regression models in R).
There are no higher polynomials necessary in GAM. The smoother and the moving window size determine the
flexibility of the response shape (while in GLMS, higher polynomials are necessary to obtain a similar effect). It
is even possible to mix GLM and GAM models in Splus and R. Adding a smoother to a variable (term) applies
F. Kienast, J. Bolliger, N.E. Zimmermann

701-1613-00 Advanced Landscape Ecology [HS 12]

Lecture Notes: 24.11. / 31.11.2012

the GAM fit. If a variable is entered with no smoothing function added, then a regular GLM fit is applied to this
variable.
Comparison between GLM & GAM
GAM models are very flexible, with easy ways of adjusting smoothing functions. The moving window size can
be adjusted flexibly and we can switch between different smoothers. On the other hand, no interaction terms
(between multiple predictor variables) can be added (therefore the name additive, because variables can only be
added, not interrelated). The advantage of GAM is that it allows to represent the true response as seen in the data
more flexibly and thus allows often to calibrate more accurate models.
In GLM models, the shape of the response is less flexibly drawn. Yet, it has been shown in several case studies,
that this does not necessarily result in less accurate models. The question is often how independent the data are,
which are used to test models. If the calibration data are used to also test, then GAM models are always more
accurate. Yet, using the same data set usually results in model overfitting. If independent data is used, then
sometimes GLMs are better, because they are more robust. But if data are very carefully sampled and if the test
data set is of similar quality, then it is not easy to forecast what method will be better.
Nowadays many more new methods exist, that are partly even better than GLM and GAM (see Elith et al. 2006,
Ecography). However, in a case study using Swiss data (Guisan, Zimmermann et al. 2007, Ecological
Monographs), we saw that oftentimes the best models are not really significantly better than GAM and GLM.
Though we suggest using both or several approaches and find the best solution with your own data.
CT and CART:
Classification and regression trees (CART) are an alternative method to the two presented model types. Here we
have an iterative optimization algorithm that searches to optimize a dichotomous decision key for explaining a
dependent variable from a set of independent predictors. It is closely related to regressions, but similar to GAM
we do not get regression parameters, but rather a tree. Using the same hypothetical data set with outliers as above
we obtain the following classification tree:

This tree is the visualization of a written tree that includes a number of statistical measures. We read the tree
such that if we know the condition of a site (e.g. precip = 4.5), then we enter the tree at the top, and we drop the
value through the tree until it comes to a terminal node. At the top we argue that precip is larger than 2.9, thus
we go to the right, and so on. We finally end at the terminal decision saying the value is smaller than 6.05, and
going to the left there means that we expect to find the species (Presence = TRUE).
Additionally to the tree we get the dichotomous key with details to the characteristics at each node. Terminal
nodes are indicated with an asterisk (*).
F. Kienast, J. Bolliger, N.E. Zimmermann

10

701-1613-00 Advanced Landscape Ecology [HS 12]

Lecture Notes: 24.11. / 31.11.2012

node), split, n, deviance, yval, (yprob)


* denotes terminal node
1) root 38 52.257350 TRUE ( 0.44736840 0.5526316 )
2) prec<2.9 7 0.000000 FALSE ( 1.00000000 0.0000000 ) *
3) prec>2.9 31 38.985560 TRUE ( 0.32258060 0.6774194 )
6) prec<8.65 28 31.490770 TRUE ( 0.25000000 0.7500000 )
12) prec<3.4 3 3.819085 FALSE ( 0.66666670 0.3333333 )
24) prec<3.1 1 0.000000 TRUE ( 0.00000000 1.0000000 ) *
25) prec>3.1 2 0.000000 FALSE ( 1.00000000 0.0000000 ) *
13) prec>3.4 25 25.020120 TRUE ( 0.20000000 0.8000000 )
26) prec<6.85 19 12.786840 TRUE ( 0.10526320 0.8947368 )
52) prec<4.25 4 4.498681 TRUE ( 0.25000000 0.7500000 )
104) prec<4.05 3 0.000000 TRUE ( 0.00000000 1.0000000 ) *
105) prec>4.05 1 0.000000 FALSE ( 1.00000000 0.0000000 ) *
53) prec>4.25 15 7.347901 TRUE ( 0.06666667 0.9333333 )
106) prec<6.05 10 0.000000 TRUE ( 0.00000000 1.0000000 ) *
107) prec>6.05 5 5.004024 TRUE ( 0.20000000 0.8000000 )
214) prec<6.2 2 2.772589 FALSE ( 0.50000000 0.5000000 ) *
215) prec>6.2 3 0.000000 TRUE ( 0.00000000 1.0000000 ) *
27) prec>6.85 6 8.317766 FALSE ( 0.50000000 0.5000000 )
54) prec<8.2 4 4.498681 FALSE ( 0.75000000 0.2500000 )
108) prec<7.2 2 2.772589 FALSE ( 0.50000000 0.5000000 )
216) prec<6.95 1 0.000000 FALSE ( 1.00000000 0.0000000 ) *
217) prec>6.95 1 0.000000 TRUE ( 0.00000000 1.0000000 ) *
109) prec>7.2 2 0.000000 FALSE ( 1.00000000 0.0000000 ) *
55) prec>8.2 2 0.000000 TRUE ( 0.00000000 1.0000000 ) *
7) prec>8.65 3 0.000000 FALSE ( 1.00000000 0.0000000 ) *

A CART model attempts to isolate regions of high purity along the precipitation gradient, having either
observations with mostly only presence (TRUE) or absence (FALSE). For each node the dichotomous tree
explains us how well this was done (deviance), and how many observations remain at each node (n). Yval gives
the decision according to the balance of TRUEs and FALSEs, and Yprob values indicate at what probability we
can expect the value TRUE and FALSE. CART models can also be calibrated for models of more than two
possible levels (TRUE and FALSE). Thus, it is a wonderful and quick method to explain the presence of a range
of habitat types from predictor variables. It is thus often used in vegetation mapping projects. In our model, there
were no presences available below a precipitation value of 2.9; the model is thus perfect at this node. We can
now ask for a similar overview of the calibration strength as we got them for GLM and GAM:
Classification tree:
tree(formula = factor(pres == 1) ~ prec, data = ex1b, minsize = 1)
Number of terminal nodes: 13
Residual mean deviance: 0.1109035 = 2.772589 / 25
Misclassification error rate: 0.02631579 = 1 / 38

The model seems to be close to perfect. Only one out of 38 cases is wrongly predicted. Does this now mean that
CART is more powerful than GLM and GAM? If we want a perfect model, we almost seem to be there, right?
Yet, model fit can only be judged from independent data. CART models are extremely sensitive to overfitting.
We can virtually search for additional predictors, until each data point is split and explained in an enormous tree.
But what did we gain then? Models make only sense, if we can use simple explanations for the variability
observed in the field. Therefore our goal must be to simplify these overfitted models. This can be done by crossvalidation. Here, we split the whole
data set into k classes (k-fold crossvalidation), and we re-fit the tree k
times. Each time we leave out one kth
we check how well the model behaves
at each node (by dropping the left out
kth data part through the model and
checking for errors). Once all points
have been used iteratively k-1 times
for fitting and once for testing we
summarize. In general we find that in
such a cross-validation, the last nodes
are actually less accurate than if we
would stop at higher nodes. The
cross-validation then allows us to
graphically visualize this and to prune
off the fines nodes that make the
model less accurate on independent
data.

F. Kienast, J. Bolliger, N.E. Zimmermann

11

701-1613-00 Advanced Landscape Ecology [HS 12]

Lecture Notes: 24.11. / 31.11.2012

In our example we applied a 10-fold cross-validation and it seems that 5 nodes are most optimal (figure). At
higher numbers of nodes, the deviance of the model is increasing again. We now can prune off the other nodes,
and there are now 5 nodes remaining. The model is now slightly worse (misclassification of 3 out of 38), but it
does so with a much more parsimonious model, which is even tested partly independently. As with GLMs and
GAMs, we can add additional variables easily.
Classification tree:
snip.tree(tree = ex1b.tre1, nodes = c(54, 26))
Number of terminal nodes: 7
Residual mean deviance: 0.5575974 = 17.28552 / 31
Misclassification error rate: 0.07894737 = 3 / 38

The remaining tree has now the following characteristics:


node), split, n, deviance, yval, (yprob)
* denotes terminal node

1) root 38 52.257350 TRUE ( 0.4473684 0.5526316 )


2) prec<2.9 7 0.000000 FALSE ( 1.0000000 0.0000000 ) *
3) prec>2.9 31 38.985560 TRUE ( 0.3225806 0.6774194 )
6) prec<8.65 28 31.490770 TRUE ( 0.2500000 0.7500000 )
12) prec<3.4 3 3.819085 FALSE ( 0.6666667 0.3333333 )
24) prec<3.1 1 0.000000 TRUE ( 0.0000000 1.0000000 ) *
25) prec>3.1 2 0.000000 FALSE ( 1.0000000 0.0000000 ) *
13) prec>3.4 25 25.020120 TRUE ( 0.2000000 0.8000000 )
26) prec<6.85 19 12.786840 TRUE ( 0.1052632 0.8947368 ) *
27) prec>6.85 6 8.317766 FALSE ( 0.5000000 0.5000000 )
54) prec<8.2 4 4.498681 FALSE ( 0.7500000 0.2500000 ) *
55) prec>8.2 2 0.000000 TRUE ( 0.0000000 1.0000000 ) *
7) prec>8.65 3 0.000000 FALSE ( 1.0000000 0.0000000 ) *

CART models are very flexible. As GAM they do not require the dependent variable to follow any statistical
distribution. Often CART is slightly or significantly less accurate than GLMs and GAMs. It knows only black
and white, no gray shades. And often larger trees are not easy to interpret ecologically. Yet it gives a quick
overview of what predictors seem to make sense. Thus CART is often used for data screening.

F. Kienast, J. Bolliger, N.E. Zimmermann

12

701-1613-00 Advanced Landscape Ecology [HS 12]

Lecture Notes: 24.11. / 31.11.2012

Testing predictive habitat distribution models


So far we have used two measures to evaluate models, namelyD2 und Misclassification error rate. Both
measures were directly obtained from model calibration outputs. This is not an appropriate way of testing
models. Rather it allowed us to evaluate the calibration strength. Testing or evaluating the model accuracy
requires independent test data in the ideal case. Where no independent test data set is available two alternatives
can be used: a) split sample test (e.g. using 2/3 of the data for model fitting and 1/3 for testing), or b) k-fold
cross-validation (as we had seen in the CART example). Using values of k>10 is not a very strong test, though.
Usually, we use values between 3 and 10.
Confusion Matrix:
The confusion matrix is an often used means to test predictions against observations. There are many different
measures available based on the confusion matrix. Here we present a few. They all have advantages and
disadvantages. In its simplest structure, the matrix takes the following form:
Observed Data

Simulated

Data

Presence

Absence

a [1]
correct
True positive

b [2]
incorrect
False positive

c [3]
incorrect
False negative

d [4]
correct
True negative

Box: A general confusion matrix with simulated vs.


observed presences (P) and absences (A).

Based on the four fields (a, b, c and d) and the sum of these fields (N) we can now formulate a number of test
measures in order to judge model accuracy. The measure Correct classification rate is often used and the same
holds for Kappa. Here they are both applied to a 2 x 2 confusion matrix, but both measures can basically also
calculated on an n x n matrix (e.g. when we generate models of habitat types that are mutually exclusive). Yet
the -formula needs to be adjusted.

Table:

Test measures based on a 2 x 2 confusion matrix, N is the sum of a+b+c+d

Test Measure
Formula
Correct classification rate
(a+d)/N
Misclassification rate
(b+c)/N
False positive rate (Error type I; commission)
b/(b+d)
False negative rate (Error type II, omission)
c/(a+c)
Total misclassification
fneg+fpos
Odds ratio
(a*d)/(c*b)
Kappa statistics
(a+d) - (((a+c)*(a+b)+(b+d)*(c+d))/N)
N - (((a+c)*(a+b)+(b+d)*(c+d))/N)

Abbreviation
CCR
MCR
fpos
fneg
TMC
OR
K

Kappa takes values of 1 (perfect model) to 0 (random agreement between model and reality. It measures to what
degree the model is better than a chance prediction based on prevalence. If kappa (K) takes negative values, it
means that the predictions are systematically wrong.
When evaluating GLM or GAM models we require an additional step. While CART models give a binary
response by default, GLM and GAM models produce probabilities for a species or habitat to occur. We thus
need a classification step to compare the model output with the original field data. We can do so by applying a
F. Kienast, J. Bolliger, N.E. Zimmermann

13

701-1613-00 Advanced Landscape Ecology [HS 12]

Lecture Notes: 24.11. / 31.11.2012

fixed threshold (say 0.5). Often, though, other thresholds yield significantly better results. A rule of thumb says
that the best threshold is usually around the prevalence of the calibration data. Prevalence is the ratio between
presences and absences found in the data (which is also .5 in our hypothetical example, and slightly higher in the
example with outliers) or the sampling mean (if we code presence as 1 and absence as 0). This means however,
that we best optimize the threshold by sequentially cutting the probabilities into thresholds ranging from 0 to 1
and calculating test statistics each time. Once we have found the best split, we can apply this split to the
probability map, and we also have the test statistics associated to this split.
In general, best results are obtained, where type I and type II errors are balanced. Yet, there are cases, where one
of the two errors is of higher interest. If we are to design a reserve for a species X, then high false negatives
are very costly compared to false positive rates, since we want to ascertain that all certain presence sites are
included in a reserve in order to maximize the conservation efforts. In other words, if we include wrong
presences (false positives, which are actually absences), then this is not problematic; it becomes problematic
though if we restrict the reserve to much by excluding sites that are wrong absences (actually presences, false
negatives). Yet, including all possible sites gets fneg to a value of 0. However, this would not be efficient either.
Usually, we would then set a threshold of an fneg error we would maximally tolerate for designing a reserve
(e.g. 5%).
AUC a threshold independent measure
Before, we illustrated how it makes sense to evaluate all possible cut level thresholds to classify probabilities
into presences and absences. There is an additional measure termed AUC (area under the ROC curve) that does
the same evaluation for fneg and fpos at each possible cut threshold, then translates fneg into true positive (1fneg), and plots the two values against each other for each threshold. A model that is basically no better than a
random prediction would result in a curve that closely follows the 1:1 line. Thus the area under this curve is ca.
0.5. A perfect model has an area under the ROC (=receiver operating signal) curve of 1.0. Values below 0.5 are
obtained in systematically wrong predictions.

F. Kienast, J. Bolliger, N.E. Zimmermann

14

701-1613-00 Advanced Landscape Ecology [HS 12]

Lecture Notes: 24.11. / 31.11.2012

Exercises to: Predictive Habitat Distribution Modelling


Questions A
A.1) Is precipitation a resource, direct or indirect variable?
A.2) Why do we find the treeline in Switzerland to be at 1800 in the Northern Alps and ca. 2350 in Zermatt?
A.3) Define a modeling goal that may require: (i) resource or direct variables, and (ii) indirect variables.
Answers:

Questions B
B.1) How can a model be further improved after having calibrated a predictive variable?
B.2) Why is it difficult to find a perfect model?
Answers:

Questions C
C.1) What is the D2 for the new model? Is it better or worse than the previous example?
C.2) What probability can be expected for the point X in the map?
C.3) What are the effects of extreme values?
The following figure illustrates the two model outputs graphically.

Answers:

F. Kienast, J. Bolliger, N.E. Zimmermann

15

701-1613-00 Advanced Landscape Ecology [HS 12]

Lecture Notes: 24.11. / 31.11.2012

Questions D
D.1) What are the values of CCR (correct classification rate), fpos, fneg, and TMC given cut levels of 0.3, 0.5
or 0.7? Assume that the data set below comes from an independently sampled test data set:

P/A Prob
0
0
0
0
0
0
0
1
0
0
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
1
0
0
0

0.00
0.01
0.01
0.02
0.05
0.07
0.27
0.35
0.44
0.48
0.57
0.67
0.72
0.74
0.80
0.83
0.84
0.85
0.86
0.86
0.87
0.87
0.87
0.87
0.87
0.87
0.86
0.85
0.81
0.80
0.78
0.70
0.53
0.27
0.23
0.20
0.16
0.11

Answers:
Observed Data
Presence

Observed Data

Absence
b

Simulated P

CCR
fpos
fneg

Data

: _______ threshold = _____


: _______
: _______
TMC = _____

CCR
fpos
fneg

: _______ threshold = _____


: _______
: _______
TMC = _____

Observed Data
Presence

Observed Data

Absence

Absence

Simulated P
c

CCR
fpos
fneg

Presence

Simulated P

Data

Absence

Simulated P
c

Data

Presence

Data

: _______ threshold = _____


: _______
: _______
TMC = _____

F. Kienast, J. Bolliger, N.E. Zimmermann

16

CCR
fpos
fneg

: _______ threshold = _____


: _______
: _______
TMC = _____

You might also like