Professional Documents
Culture Documents
Li 2013
Li 2013
12000
1 REVIEW 1
2 2
3 3
4 4
5 5
6 Applying various algorithms for species distribution modelling 6
7 7
8 8
9 9
10 10
11 Xinhai LI and Yuan WANG 11
12 Key Laboratory of the Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing, China 12
13 13
14 14
15 15
16 16
17 Abstract 17
18 18
Species distribution models have been used extensively in many fields, including climate change biology, land-
19 19
scape ecology and conservation biology. In the past 3 decades, a number of new models have been proposed,
20 20
yet researchers still find it difficult to select appropriate models for data and objectives. In this review, we aim
21 21
to provide insight into the prevailing species distribution models for newcomers in the field of modelling. We
22 22
compared 11 popular models, including regression models (the generalized linear model, the generalized addi-
23 23
tive model, the multivariate adaptive regression splines model and hierarchical modelling), classification mod-
24 24
els (mixture discriminant analysis, the generalized boosting model, and classification and regression tree analy-
25 25
sis) and complex models (artificial neural network, random forest, genetic algorithm for rule set production and
26 26
maximum entropy approaches). Our objectives are: (i) to compare the strengths and weaknesses of the models,
27 27
their characteristics and identify suitable situations for their use (in terms of data type and species–environment
28 28
relationships) and (ii) to provide guidelines for model application, including 3 steps: model selection, model
29 29
formulation and parameter estimation.
30 30
31 Key words: algorithms, machine learning, model formulation, model selection, species distribution models 31
32 32
33 33
34 34
35 35
36 INTRODUCTION on the basis of data (mathematical representations of 36
37 their known distribution in environmental spaces) (Aus- 37
38 Species distribution models (SDMs) are also known tin 2007). They rely on statistical correlations between 38
39 as habitat models, ecological niche models, bioclimat- existing species distributions and environmental vari- 39
40 ic envelopes and resource selection functions (Elith & ables. Levins (1966) points out 3 goals of ecological 40
41 Graham 2009). These models use computer algorithms models: reality, generality and precision. Typically, only 41
42 to predict the distribution of species in geographic space 2 out of the 3 desirable model goals can be attained si- 42
43 multaneously, while the third goal has to be sacrificed. 43
44 This trade-off leads to a distinction of 3 different groups 44
45 of models (Korzukhin et al. 1996). SDMs are static em- 45
46 pirical models (i.e. phenomenological and statistical) 46
Correspondence: Xinhai Li, Key Laboratory of the Zoological
47 rather than mechanistic models (i.e. physiological, fun- 47
48 Systematics and Evolution, Institute of Zoology, Chinese damental and process-based) or general analytical mod- 48
49 Academy of Sciences, 1 Beichen West Road, Beijing 100101, els (i.e. theoretical and mathematical). They relate ob- 49
50 China. served presence or abundance of a species to values of 50
51 Email: lixh@ioz.ac.cn environmental variables at particular sites. In contrast, 51
124 © 2012 Wiley Publishing Asia Pty Ltd, ISZS and IOZ/CAS
Review of species distribution models
mechanistic (or process-based) models assess the bio- mation about how much species might move in future 1
physiological aspects of a species to determine the con- climate conditions, and which climate variables might 2
ditions in which the species can ideally persist, usually impact the future ranges the most. 3
based on observations made in controlled field or labo- Many SDMs are easy to use, but researchers of- 4
ratory studies (Guisan & Zimmermann 2000). ten find it difficult to select the most appropriate mod- 5
The relationship between species and environment is el (or models) for their cases, and there has been sig- 6
complex (Pearson et al. 2006a; Morin & Thuiller 2009; nificant variability in model performance (Araujo et al. 7
Wiens et al. 2009) and overcoming the uncertainty of 2005). Some review papers provide insightful discus- 8
model applications is a great challenge (Patt et al. 2005; 9
sions covering broad aspects of model development and
Pearson et al. 2006a; Dormann et al. 2008; Fuller et al. 10
applications (e.g. Guisan & Zimmermann 2000; Guisan
2008; Cressie et al. 2009; Morin & Thuiller 2009; Con- 11
& Thuiller 2005; Meynard & Quinn 2007). However,
roy et al. 2011). In the past 3 decades, scientists have 12
knowledge is still required regarding which methods are
developed numerous models to estimate the relationship 13
best suited to the available data and the intended appli-
between species and associated environmental variables. 14
cations, because the advice enabling informed choice of
However, sometimes, different models provide diverse 15
methods is currently scattered throughout the published
predictions (Pearson et al. 2006a; Randin et al. 2006). 16
literature and is incomplete (Elith & Graham 2009).
17
Consequently, hybrid or ensemble model frameworks In this paper, we review up-to-date studies of SDM 18
were suggested to make reliable and robust predictions applications, and aim to provide intuitive insight into 19
of the potential distribution of species (e.g. del Barrio the prevailing SDMs for newcomers in the field of eco- 20
et al. 2006; Araujo & New 2007; McRae et al. 2008; logical modelling. We compare 11 popular models, in- 21
Coetzee et al. 2009; Morin & Thuiller 2009; Thuiller cluding classic regression models, advanced classifica- 22
et al. 2009b; Conroy et al. 2011). For example, BIO- tion tree methods, complex machine learning techniques 23
MOD (Thuiller et al. 2009b), a package of software R (R (e.g. neural network, random forest, maximum entropy 24
Development Core Team 2011), can automatically com- [Maxent] and genetic algorithm for rule set production 25
pare 9 SDMs and is able to suggest the best model (from [GARP] approaches) and hierarchical models. Our ob- 26
the 9 models) for any specific species–environment sit- jectives are: (i) to compare the strengths and weaknesses 27
uation. Several studies compare different models to ex- of the models, their characteristics and identify suitable 28
plore SDM characteristics (type of algorithms) and their situations for their use (in regards to data type and spe- 29
suitability (regarding interpretation ability and predic- cies–environment relationships) and (ii) to provide guide- 30
tion power) for the available data and the intended ap- lines for model application, including 3 steps: model se- 31
plication (e.g. Moisen & Frescino 2002; Brotons et al. lection, model formulation and parameter estimation. 32
2004; Segurado & Araujo 2004; Thuiller et al. 2004; 33
Araujo & Guisan 2006; Elith et al. 2006; Pearson et al.
2006b; Tsoar et al. 2007; Elith & Graham 2009; Aertsen MODELS 34
35
et al. 2010, 2011; Kampichler et al. 2010). There are a variety of SDMs available to predict spe- 36
Species distribution models are powerful tools and cies distributions based on the association of species oc- 37
are applied in many aspects of ecology, yet their outputs currences and environment variables. In this paper, we 38
are based on a number of assumptions. For example, in list 11 popular models (Table 1), describe their charac- 39
studies on biological consequences of climate change, teristics and highlight their strengths and shortcomings. 40
SDMs have been used extensively to estimate the cur- Generalized linear models 41
rent distributions and future range shifts of species (e.g. 42
Araujo & New 2007; Thuiller et al. 2008; Li et al. 2010) Generalized linear models (GLMs) are a general- 43
on the basis of the following assumptions: (i) the current ization of general linear models (McCullagh & Nelder 44
occurrences of the species are in their favorite habitats; 1989). GLMs were introduced in the 1970s (Nelder & 45
(ii) the species distributions are determined by the ex- Wedderburn 1972) and formulated in the late 1980s 46
planatory variables (e.g. temperature and precipitation); (McCullagh & Nelder 1989). The common types of 47
and (iii) the association between the species distribution GLMs are linear regression, logistic regression and 48
and the explanatory variables does not change in the fu- Poisson regression. Logistic regression is suitable for 49
ture (no adaptation). SDMs can provide valuable infor- present and absent species occurrence data and Poisson 50
51
© 2012 Wiley Publishing Asia Pty Ltd, ISZS and IOZ/CAS 125
X. Li and Y. Wang
Popularity is the number of references obtained by querying the database of the Web of Knowledge (Thomson Reuters) using the model names as key words for database
2 it link function for logistic regression and the logarithm
3 link function for Poisson regression enable the depen-
4 dant variable (indicating species presence or abundance)
Nelder & Wedderburn 1972
Breiman 2001a
Friedman 1991
Hopfield 1982
9
Wikle 2003
species and environment variables. All model parame-
Reference
10
subject, while publication year is restrained within 2000–2011. ‡The p/a indicates presence and absence data; p indicates presence only data.
ters of GLMs can be clearly interpreted with ecological
11
meanings. Applying GLMs requires careful calibration:
12
users must check the significance of each explanatory
13
variable and remove the non-significant variables (model
14
p/a or abundance
p/a or abundance
15
Species data‡
19
20
tial for better fits to data than GLMs (Hastie & Tibshira-
21 ni 1986). GAMs use data-defined smoothing functions
Popularity†
12203
66155
23
6848
6820
1421
278
569
226
143
646
Low
42 1991).
Mixture discriminant analysis
43
Generalized boosting models
44
Artificial neural networks
45
Hierarchical modeling
126 © 2012 Wiley Publishing Asia Pty Ltd, ISZS and IOZ/CAS
Review of species distribution models
can be thought of a piecewise regression. In MARS, the method for developing a model in a forward stage-wise 1
spline knots are determined automatically. In addition, fashion, at each step adding small modifications in parts 2
first-order interactions between variables can also be of the model space to fit the data better (Friedman et al. 3
specified. MARS are particularly powerful when there 2000). GBMs can be used for regression as well as clas- 4
are large numbers of explanatory variables and low- sification problems, with continuous and/or categorical 5
order interaction effects (Thuiller et al. 2009a). predictors. 6
7
Mixture discriminant analysis Artificial neural networks 8
Mixture discriminant analysis (MDA) is an exten- The concept of artificial neurons was first proposed 9
sion of linear discriminant analysis (Venables & Ripley in 1943 (McCulloch & Pitts 1943). An artificial neural 10
2002). Discriminant analysis is used to predict the cat- network (ANN) is a complex model system, involving 11
egories of the observations (e.g. presence or absence of a network of simple processing elements (artificial neu- 12
a species), based on the combination of the explanatory rons) that can exhibit complex global behavior (e.g. se- 13
variables. Linear discriminant analysis is closely related lect a site as habitat based on numerous environmental 14
to general linear model, which also attempts to express variables), determined by the links between the neurons 15
1 dependent variable as a linear combination of explan- and associated functions (Hopfield 1982). The key fea- 16
atory variables. Linear discriminant analysis was devel- ture of an ANN is that it contains a hidden layer. Each 17
oped in the 1930s (Fisher 1936). Advanced discriminant neuron in the hidden layer receives information from 18
analysis has been developed for nonlinear, nonpara- 19
each input, sums the inputs, adds a constant (the bias),
metric situations (Hastie et al. 1994; Hastie & Tibshira- 20
then transforms the result using a fixed function. ANNs
ni 1996). MDA is a method for classification based on 21
can operate like multiple regressions when the outputs
mixture models. It assumes that the distribution of the 22
are continuous variables, or like classifications when the
class of each environmental variable follows a normal 23
outputs are categorical. The accuracy of ANNs is main-
distribution. The mixture of normals is used to obtain a 24
ly controlled by the amount of weight decay of the links
density estimation for each class (e.g. species presence 25
and the number of hidden neurons. All model parame-
or absence). MDA can control the within-class spread of 26
ters can be optimized by new observations. As a whole,
the subclass centers relative to the between-class spread, 27
ANNs are nonlinear models but with so many parame-
by forming multiple normal distributions within 1 class. 28
ters they are extremely flexible; flexible enough to ap-
29
Generalized boosting models proximate any smooth function (Thuiller et al. 2009b).
30
ANNs usually require longer time for computers to pro-
Boosting methods are designed to fit many simple 31
cess than other parallel models.
models whose predictions are then combined to give 32
more robust estimates of the relationship between spe- Classification and regression tree 33
cies distribution and a set of environmental variables, 34
Classification and regression tree (CART) analy- 35
whereas GLMs seek to fit the single model that best ex-
sis was designed in the 1980s (Breiman et al. 1984). It 36
plains the relationship between species and environment
uses recursive partitioning to split the data into increas- 37
(Friedman et al. 2000; Friedman 2001). Generalized
ingly smaller, homogenous subsets until a termination 38
boosting models (GBM) have 2 algorithms: the boost-
ing algorithm iteratively uses the regression-tree algo- is reached (Venables & Ripley 2002). In a CART, each 39
rithm to construct a combination of trees. GLMs and group of records (species presence or absence data), 40
GAMs can be used in tree building. Boosting is used to represented by a ‘node’ in a decision tree, can only be 41
split into 2 groups. The heterogeneity of a node can be 42
overcome the inaccuracies inherent in a single tree mod-
interpreted as a deviance of a Gaussian model (regres- 43
el. It can compute a sequence of simple classification
sion tree) or of a multinomial model (classification tree). 44
trees, where each successive tree is built for the predic-
45
tion residuals of the preceding tree. GBMs can even- The best tree is a trade-off between a high decrease of
46
tually produce a good fit of the predicted values to the deviance and the smallest number of nodes. CART is
47
observed values, even if the specific nature of the re- able to uncover complex interactions between predic-
48
lationships between the predictor variables and the de- tors that may be difficult or impossible to uncover using 49
pendent variable is complex (e.g. nonlinear, interacted traditional multivariate techniques (Breiman et al. 1984; 50
or noisy with outliers). Boosting can be understood as a De’ath & Fabricius 2000). 51
© 2012 Wiley Publishing Asia Pty Ltd, ISZS and IOZ/CAS 127
X. Li and Y. Wang
128 © 2012 Wiley Publishing Asia Pty Ltd, ISZS and IOZ/CAS
Review of species distribution models
© 2012 Wiley Publishing Asia Pty Ltd, ISZS and IOZ/CAS 129
X. Li and Y. Wang
1 and GARP have recursive parameter optimization fea- tributions for many species and regions. Maxent has
2 tures and, thus, are referred to as machine learning tech- good penalty functions (i.e. regularization) in parame-
3 niques. ter estimation to prevent overfitting. Regularization has
4 The complex models can extract hidden features from most impact when sample sizes are small (Phillips et al.
5 the input data. They are suitable when datasets with very 2004), enabling reliable predictions in such cases. In
6 complicated correlation structures are to be analyzed or most situations, Maxent outperforms GARP in terms of
7 when datasets contain many highly intercorrelated vari- prediction accuracy (Phillips et al. 2006).
8 ables. These complex models can reproduce minor de-
9
Cross comparisons
tails of the training data, which can result in an overfit-
10 ting problem, causing biased model prediction in other The predictive accuracy of different SDMs, includ-
11 regions or other time periods (Reibnegger et al. 1991). ing regression, classification and complex models, have
12 been compared in some studies (e.g. Breiman 2001b;
To detect overfitting problems, approaches such as
13 Elith et al. 2006; Meynard & Quinn 2007; Olden et al.
resubstituition (the data used to calibrate models are
14 2008; Phillips & Dudik 2008; Coetzee et al. 2009;
also used to validate them), bootstrap, data-splitting and
15 Elith & Graham 2009; Marmion et al. 2009; Morin &
independent validation can be used (Araujo et al. 2005).
16 Thuiller 2009; Thuiller et al. 2009b; Aertsen et al. 2010;
To reduce overfitting problems, approaches such as con-
17 Kampichler et al. 2010). For different cases (different
trolling the model complexity by using only the most
18 species with different geographical and environmental
predictive top ranked variables (e.g. Liu et al. 2011), de-
19 distributions), model performance usually varies (Segu-
veloping new algorithms to improve the generalization
20 rado & Araujo 2004), and no single model is best for all
capability (e.g. Bramer 2002) or controlling the data dis-
21 cases. In some situations, simpler models work best, es-
tribution skewness (Sun et al. 2006) are applicable.
22 pecially when the study area in small (e.g. at a mountain
23 Artificial neural networks are widely used in statisti-
cal estimations, classification optimization and control slope with monotonous elevation gradient). Aertsen et al
24 (2010) modeled the distribution of 3 tree species using
25 theory (Olden et al. 2008). CART is a much more effi-
cient machine learning tool than ANN. Random forest is 5 SDMs, and found that GAM was the best and ANN
26
a more advanced model and consists of many decision was the worst, compared with GLM, CART and GBM.
27
trees; CART has only 1 tree. Breiman (2001b) compares However, in general, the complex models have over-
28
the misclassification error of random forest and CART all better performance than simpler models (e.g. Prasad
29
using 10 different datasets by leaving out a random 10% et al. 2006; Meynard & Quinn 2007; Olden et al. 2008;
30
of the data for checking. Random forest was found to be Phillips & Dudik 2008), indicating that model com-
31
better for all datasets. Breiman indicated that the mod- plexity contributes to predictive accuracy (Tsoar et al.
32
el is not overfitting using a complex model (a forest) in- 2007).
33
34 stead of a rather simple model (a tree) (Breiman 2001b). All SDMs we discuss can use both numeric and cat-
35 Maxent and GARP are presence-only approaches that egorical environmental variables. GLM and GAM are
36 are suitable for most cases in which the true absence suitable for numeric variables because of their regres-
37 data is lacking. Theoretically, they fit models from pres- sion nature. MARS tends to be better than CART for
38 ence and background data, compared with other pres- numeric variables because hinges are more appropriate
39 ence-only models that either generate pseudo-absence for numeric variables than the piecewise constant seg-
40 data by sampling points outside the boundary of pres- mentation used by recursive partitioning. MARS usually
41 ence data, or strictly apply presence-only modelling (e.g. do not give as good a fit as GBM, but can be built much
42 BIOCLIM) (Phillips et al. 2006; Li et al. 2011). Maxent more quickly and are more interpretable (i.e. effect of
43 and GARP project the realized niche into a geographi- each predictor is clear). Complex models can deal well
44 cal space without giving weight to observed absence infor- with both numeric and categorical environmental vari-
45 mation, which could result in a poorer fit to the current ob- ables.
46 served distribution than the presence–absence approaches, For noisy, nonlinear and high-dimensional data, clas-
47 such as random forest and GBM (Pearson et al. 2006b). sification tree-based machine learning methods are of-
48 However, based on an extensive comparison of 16 SDMs ten suggested. For predicting the distribution of a bird
49 over 226 species from 6 regions of the world, Elith et al. species, random forest is best, CART is second best and
50 (2006) point out that the presence-only approaches, such ANN is the worst (Kampichler et al. 2010). Random
51 as Maxent and GARP, are effective for modelling dis-
130 © 2012 Wiley Publishing Asia Pty Ltd, ISZS and IOZ/CAS
Review of species distribution models
forest is preferable for data including categorical pre- habitat heterogeneity, types of environmental variables 1
dictive variables with different numbers of levels, yet it and species–environment relationships. Here, we sug- 2
is biased in favor of those attributes with more levels. gest new processes that are important for model devel- 3
Thus, the variable importance scores from random forest opment: model selection, model formulation and param- 4
are not reliable for this type of data (Deng et al. 2011). eter estimation. 5
Predictive accuracy of models was also tested by artifi- 6
Model selection 7
cially adding spatial deviances to the occurrences, and
the quality of the predictions made with GBM dimin- Researchers need to select models on the basis of 8
ished significantly when degraded data were used, while their research targets (data types: e.g. noisy, nonlin- 9
there was no evidence of decline in performance of re- ear and high-dimensional, lots of categorical variables) 10
gression based techniques (GLM, GAM and MARS) and objectives (e.g. checking variable weight, explain- 11
and Maxent (Graham et al. 2008). Geographical attri- ing species–environmental relationships, predicting dis- 12
butes (prevalence, latitudinal range and clumping [spa- tribution and estimating future range shifts). Although 13
tial autocorrelation]) would also influence predictive ac- complex models have better prediction accuracy, simple 14
curacy (Marmion et al. 2009). GAM, GLM and MDA models have their own unique advantages. GLMs are 15
are highly influenced by the 3 attributes, whereas ran- transparent; that is, all coefficients of explanatory vari- 16
dom forest, ANN and GBM are only moderately influ- ables (including their quadratic terms and interaction 17
enced, and MARS and CART are only slightly affected terms) can be clearly shown and well interpreted. On the 18
(Marmion et al. 2009). contrary, other models (i.e. GAM, MARS, MDA, GBM, 19
CART and complex models) have too many parameters, 20
Based on a simulated species that provides known
so that meaningful ecological interpretation is not pos- 21
species–environment relationships and spatial distri-
sible. The complex models, such as random forest, are 22
bution, Elith and Graham (2009) evaluate differences 23
also called ‘black boxes’: they are easy to use yet hard
among methods in relation to the truth. The results sug- 24
to explain (Breiman 2001b).
gest that BRT models the overall shape of the response 25
most accurately, followed in order by Maxent, random Researchers should be aware of the strength of each
model. GAM is good for multimodal continuous vari- 26
forest, GLM and GARP. Although the general form of 27
the fitted function was correct for BRT, it was overfitted ables; MARS is suitable for situations with high first-or-
der interaction effects; GBM and CART are preferred 28
to the sample; random forest showed even more overfit- 29
ting (Elith & Graham 2009). This is different from what for many categorical variables with outlier observations;
ANN performs well for very complicated species–envi- 30
Breiman (2001b) concludes: that random forest has no 31
overfitting problem. Maxent is able to fit complex func- ronmental relationships. Random forest is ideal for cas-
es with many explanatory variables and high interac- 32
tions between response and predictor variables, and can 33
include interaction terms, but to a more limited extent tion effects. Maxent and GARP are easy to use, because
available software can import GIS layers of environ- 34
than BRT (Elith et al. 2006). 35
mental variables directly, with presence only species oc-
currence data. Hierarchical modelling is preferable for 36
MODEL APPLICATIONS situations when different regressions (for a few indepen- 37
dent processes [e.g. species occurrence and detection]) 38
When applying models, some key steps, such as ver- 39
ification, calibration, validation (evaluation), credibili- can be combined together.
40
ty and qualification, have been well addressed in other re- Model formulation 41
views (e.g. Rykiel 1996; Guisan & Zimmermann 2000; 42
Guisan & Thuiller 2005). These steps are very important Based on Guisan and Zimmermann’s definition, sta-
43
for good modelling practice. Most model systems now tistical model formulation means: choosing a suited al-
44
provide relevant tools for these steps. However, the great- gorithm for predicting the species distributions, defining
45
est challenge for newcomers is determining how to select a particular type of response variable and estimating the
46
the most relevant model for their data types and model- model coefficients, and selecting an optimal statistical
47
ling objectives, and how to use them appropriately. approach with regard to the modelling context (Guisan
48
At present, researchers have many options for select- & Zimmermann 2000). Model formulation is also re-
49
ing from a variety of models to deal with different sit- ferred to as ‘verification’ by some authors (e.g. Rykiel
50
uations relating to, for example, various spatial extent, 1996).
51
© 2012 Wiley Publishing Asia Pty Ltd, ISZS and IOZ/CAS 131
X. Li and Y. Wang
132 © 2012 Wiley Publishing Asia Pty Ltd, ISZS and IOZ/CAS
Review of species distribution models
distribution range) should be included in the study ar- Breiman L (2001a). Random forests. Machine Learning 1
eas. The parameter estimation process is usually easy 45, 5–32. 2
because most models can do this well. The model eval- Breiman L (2001b). Statistical modelling: the two cul- 3
uation process is also important, yet current models pro- tures. Statistical Science 16, 199–215. 4
vide enough statistics to check the model performance. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984). 5
Using many SDMs for a certain dataset and research ob- Classification and Regression Trees. Chapman and 6
jective can significantly decrease the risk of obtaining Hall, New York. 7
biased results. 8
Brotons L, Thuiller W, Araujo MB, Hirzel AH (2004).
Presence-absence versus presence-only modelling 9
ACKNOWLEDGMENTS methods for predicting bird habitat suitability. Ecog- 10
raphy 27, 437–48. 11
This study was supported by the Strategic Priori-
Coetzee BWT, Robertson MP, Erasmus BFN, van Rens- 12
ty Research Program of the Chinese Academy of Sci-
burg BJ, Thuiller W (2009). Ensemble models predict 13
ences (XDA05080701) and the Public Welfare Project
important bird areas in southern Africa will become 14
(201209027) of the Ministry of Environmental Protec-
tion of China. less effective for conserving endemic birds under cli- 15
mate change. Global Ecology and Biogeography 18, 16
701–10. 17
REFERENCES 18
Conroy MJ, Runge MC, Nichols JD, Stodola KW, Coo-
Aertsen W, Kint V, Van Orshoven J, Özkan K, Muys B per RJ (2011). Conservation in the face of climate 19
(2010). Comparison and ranking of different model- change: the roles of alternative models, monitoring 20
ling techniques for prediction of site index in Med- and adaptation in confronting and reducing uncer- 21
iterranean mountain forests. Ecological Modelling tainty. Biological Conservation 144, 1204–13. 22
221, 1119–30. Costa J, Peterson AT, Beard CB (2002). Ecologic niche 23
Aertsen W, Kint V, Van Orshoven J, Muys B (2011). modelling and differentiation of populations of Tri- 24
Evaluation of modelling techniques for forest site atoma brasiliensis neiva, 1911, the most important 25
productivity prediction in contrasting ecoregions Chagas’ disease vector in northeastern Brazil (hemip- 26
using stochastic multicriteria acceptability analy- tera, reduviidae, triatominae). American Journal of 27
sis (SMAA). Environmental Modelling Software 26, Tropical Medicine and Hygiene 67, 516–20. 28
929–37. Cressie N, Calder CA, Clark JS, Hoef JMV, Wikle CK 29
Araujo MB, Guisan A (2006). Five (or so) challenges (2009). Accounting for uncertainty in ecological anal- 30
for species distribution modelling. Journal of Bioge- ysis: the strengths and limitations of hierarchical sta- 31
ography 33, 1677–88. tistical modelling. Ecological Applications 19, 553– 32
Araujo MB, New M (2007). Ensemble forecasting of 70. 33
species distributions. Trends in Ecology and Evolu- De’ath G, Fabricius KE (2000). Classification and re- 34
tion 22, 42–7. gression trees: a powerful yet simple technique for 35
Araujo MB, Pearson RG, Thuiller W, Erhard M (2005). ecological data analysis. Ecology 81, 3178–92. 36
Validation of species-climate impact models under del Barrio G, Harrison PA, Berry PM et al. (2006). Inte- 37
climate change. Global Change Biology 11, 1504–13. grating multiple modelling approaches to predict the 38
Austin M (2007). Species distribution models and eco- potential impacts of climate change on species’ distri- 39
logical theory: a critical assessment and some possi- butions in contrasting regions: comparison and impli- 40
ble new approaches. Ecological Modelling 200, 1–19. cations for policy. Environmental Science & Policy 9, 41
Berliner LM (1996). Hierarchical Bayesian time-se- 129–47. 42
ries models. In: Hanson KM, Silver RN, eds. Hier- Deng H, Runger G, Tuv E (2011). Bias of importance 43
archical Bayesian Time Series Models. Fundamen- measures for multi-valued attributes and solutions. 44
tal Theories of Physics, vol. 79. Proceedings of the Proceedings of the 21st International Conference 45
15th International Workshop on Maximum Entropy on Artificial Neural Networks (ICANN); 14–17 Jun 46
and Bayesian Methods; 31 Jul–4 Aug 1995, Santa Fe, 2011, Espoo, Finland. Springer, New York. 47
NM, USA, pp. 15–22. Dormann CF, Schweiger O, Arens P et al. (2008). Pre- 48
Bramer M (2002). Using J-pruning to reduce overfitting diction uncertainty of environmental change effects 49
in classification trees. Knowledge-Based Systems 15, on temperate European biodiversity. Ecology Letters 50
301–8. 11, 235–44. 51
© 2012 Wiley Publishing Asia Pty Ltd, ISZS and IOZ/CAS 133
X. Li and Y. Wang
1 Elith J, Graham CH (2009). Do they? How do they? species under six climate scenarios. Forest Ecology
2 Why do they differ? On finding reasons for differing and Management 254, 390–406.
3 performances of species distribution models. Ecogra- Kampichler C, Wieland R, Calmé S, Weissenberger H,
4 phy 32, 66–77. Arriaga-Weiss S (2010). Classification in conserva-
5 Elith J, Graham CH, Anderson RP et al. (2006). Novel tion biology: a comparison of five machine-learning
6 methods improve prediction of species’ distributions methods. Ecological Informatics 5, 441–50.
7 from occurrence data. Ecography 29, 129–51. Korzukhin MD, TerMikaelian MT, Wagner RG (1996).
8 Fisher RA (1936). The use of multiple measurements in Process versus empirical models: which approach for
9 taxonomic problems. Annals of Eugenics 7, 179–88. forest ecosystem management? Canadian Journal of
10 Friedman J, Hastie T, Tibshirani R (2000). Additive lo- Forest Research 26, 879–87.
11 gistic regression: a statistical view of boosting. The Levins R (1966). The strategy of model building in pop-
12 Annals of Statistics 28, 337–407. ulation ecology. American Scientist 54, 421–31.
13 Li RQ, Tian HD, Li XH (2010). Climate change induced
Friedman JH (1991). Multivariate adaptive regression
14 range shifts of Galliformes in China. Integrative Zo-
splines. The Annals of Statistics 19, 1–67.
15 ology 5, 154–63.
Friedman JH (2001). Greedy function approximation: a
16
gradient boosting machine. The Annals of Statistics Li WK, Guo QH, Elkan C (2011). Can we model the
17 probability of presence of species without absence
29, 1189–232.
18 data? Ecography 34, 1096–105.
19
Fuller T, Morton DP, Sarkar S (2008). Incorporating un-
certainty about species’ potential distributions under Liu J, Jolly RA, Smith AT et al. (2011). Predictive Pow-
20
climate change into the selection of conservation ar- er Estimation Algorithm (PPEA)a new algorithm to
21 reduce overfitting for genomic biomarker discovery.
eas with a case study from the Arctic Coastal Plain of
22 PLOS ONE 6, e24233.
Alaska. Biological Conservation 141, 1547–59.
23
Graham CH, Elith J, Hijmans RJ et al. (2008). The in- Marmion M, Luoto M, Heikkinen RK, Thuiller W
24
fluence of spatial errors in species occurrence data (2009). The performance of state-of-the-art model-
25
used in distribution models. Journal of Applied Ecol- ling techniques depends on geographical distribution
26 of species. Ecological Modelling 220, 3512–20.
ogy 45, 239–47.
27
Guisan A, Thuiller W (2005). Predicting species dis- McCullagh P, Nelder JA (1989). Generalized Linear
28
tribution: offering more than simple habitat models. Models. Chapman and Hall, London.
29
Ecology Letters 8, 993–1009. McCulloch W, Pitts W (1943). A logical calculus of
30
Guisan A, Zimmermann NE (2000). Predictive habitat ideas immanent in nervous activity. Bulletin of Math-
31
distribution models in ecology. Ecological Modelling ematical Biophysics 5, 115–33.
32
33 135, 147–86. McRae BH, Schumaker NH, McKane RB, Busing RT,
34 Guisan A, Edwards TC, Hastie T (2002). Generalized Solomon AM, Burdick CA (2008). A multi-model frame-
linear and generalized additive models in studies of work for simulating wildlife population response to
35
species distributions: setting the scene. Ecological land-use and climate change. Ecological Modelling
36
Modelling 157, 89–100. 219, 77–91.
37
38 Hastie T, Tibshirani R (1986). Generalized additive Meynard CN, Quinn JF (2007). Predicting species dis-
models. Statistical Science 1, 297–318. tributions: a critical comparison of the most common
39
statistical models using artificial species. Journal of
40 Hastie T, Tibshirani R (1996). Discriminant analysis by
Biogeography 34, 1455–69.
41 Gaussian mixtures. Journal of the Royal Statistical
42 Society: Series B (Methodological) 58, 155–76. Moisen GG, Frescino TS (2002). Comparing five mod-
43
elling techniques for predicting forest characteristics.
Hastie T, Tibshirani R, Buja A (1994). Flexible discrim-
Ecological Modelling 157, 209–25.
44 inant-analysis by optimal scoring. Journal of the
45 American Statistical Association 89, 1255–70. Morin X, Thuiller W (2009). Comparing niche- and
46 process-based models to reduce prediction uncertain-
Hopfield JJ (1982). Neural networks and physical sys-
47
ty in species range shifts under climate change. Ecol-
tems with emergent collective computational abili-
ogy 90, 1301–13.
48 ties. PNAS 79, 2554–8.
49 Nelder J, Wedderburn R (1972). Generalized linear
Iverson LR, Prasad AM, Matthews SN, Peters M (2008). models. Journal of the Royal Statistical Society Se-
50 Estimating potential habitat for 134 eastern US tree
51 ries A 135, 370–84.
134 © 2012 Wiley Publishing Asia Pty Ltd, ISZS and IOZ/CAS
Review of species distribution models
Olden JD, Lawler JJ, Poff NL (2008). Machine learning Stockwell D, Peters D (1999). The GARP modelling 1
methods without tears: a primer for ecologists. Quar- system: problems and solutions to automated spatial 2
terly Review of Biology 83, 171–93. prediction. International Journal of Geographical In- 3
Patt A, Klein RJT, de la Vega-Leinert A (2005). Tak- formation Science 13, 143–58. 4
ing the uncertainty in climate-change vulnerability Stockwell DRB, Peterson AT (2002). Effects of sample 5
assessment seriously. Comptes Rendus Geosciences size on accuracy of species distribution models. Eco- 6
337, 411–24. logical Modelling 148, 1–13. 7
Pearson RG, Thuiller W, Araujo MB et al. (2006a). 8
Sun Y, Todorovic S, Li J (2006). Reducing the overfit-
Model-based uncertainty in species range prediction. ting of AdaBoost by controlling its data distribution 9
Journal of Biogeography 33, 1704–11. skewness. International Journal of Pattern Recogni- 10
Pearson RG, Thuiller W, Araujo MB et al. (2006b). tion and Artificial Intelligence 20, 1093–116. 11
Model-based uncertainty in species range prediction. 12
Thuiller W, Brotons L, Araujo MB, Lavorel S (2004).
Journal of Biogeography 33, 1704–11. 13
Effects of restricting environmental range of data to
14
Phillips SJ, Dudik M (2008). Modelling of species dis- project current and future species distributions. Ecog-
tributions with Maxent: new extensions and a com- 15
raphy 27, 165–72.
prehensive evaluation. Ecography 31, 161–75. 16
Thuiller W, Albert C, Araujo MB et al. (2008). Predict- 17
Phillips SJ, Dudik M, Schapire RE (2004). A maximum ing global change impacts on plant species’ distribu- 18
entropy approach to species distribution modelling. tions: future challenges. Perspectives in Plant Ecolo- 19
Proceedings of the 21st International Conference on gy, Evolution and Systematics 9, 137–52.
Machine Learning; 4–8 Jul 2004, Banff, Canada. 20
Thuiller W, Lafourcade B, Araujo M (2009a). Mod 21
Phillips SJ, Anderson RP, Schapire RE (2006). Maxi- Operating Manual for BIOMOD. Laboratoire 22
mum entropy modelling of species geographic distri- d’Écologie Alpine, Université Joseph Fourier, Greno- 23
butions. Ecological Modelling 190, 231–59. ble, France. 24
Prasad AM, Iverson LR, Liaw A (2006). Newer classi- Thuiller W, Lafourcade B, Engler R, Araujo MB 25
fication and regression tree techniques: bagging and (2009b). BIOMOD–a platform for ensemble fore- 26
random forests for ecological prediction. Ecosystems
casting of species distributions. Ecography 32, 369– 27
9, 181–99.
73. 28
R Development Core Team (2011). R: A Language and 29
Tsoar A, Allouche O, Steinitz O, Rotem D, Kadmon R
Environment for Statistical Computing. R Foundation 30
(2007). A comparative evaluation of presence-only
for Statistical Computing, Vienna.
methods for modelling species distribution. Diversity 31
Randin CF, Dirnboeck T, Dullinger S, Zimmermann and Distributions 13, 397–405. 32
NE, Zappa M, Guisan A (2006). Are niche-based spe- 33
cies distribution models transferable in space? Jour- Venables WN, Ripley BD (2002). Modern Applied Sta-
tistics with S, 4th edn. Springer, Berlin. 34
nal of Biogeography 33, 1689–703. 35
Reibnegger G, Weiss G, Wernerfelmayer G, Judmaier G, Wiens JA, Stralberg D, Jongsomjit D, Howell CA, Sny-
36
Wachter H (1991). Neural networks as a tool for uti- der MA (2009). Niches, models and climate change:
37
lizing laboratory information–comparison with linear assessing the assumptions and uncertainties. PNAS
38
discriminant-analysis and with classification and re- 106, 19729–36.
39
gression trees. PNAS 88, 11426–30. Wikle CK (2003). Hierarchical Bayesian models for 40
Royle J, Dorazio R (2008). Hierarchical Modelling and predicting the spread of ecological processes. Ecolo- 41
Inference in Ecology: The Analysis of Data from Pop- gy 84, 1382–94.
42
ulations, Metapopulations and Communities. Elsevi- Wikle CK, Berliner LM, Cressie N (1998). Hierarchi- 43
er, Amsterdam. cal Bayesian space-time models. Environmental and 44
Rykiel EJ (1996). Testing ecological models: the mean- Ecological Statistics 5, 117–54. 45
ing of validation. Ecological Modelling 90, 229–44. Yee TW, Mitchell ND (1991). Generalized additive- 46
Segurado P, Araujo MB (2004). An evaluation of meth- models in plant ecology. Journal of Vegetation Sci- 47
ods for modelling species distributions. Journal of ence 2, 587–602. 48
Biogeography 31, 1555–68. 49
50
51
© 2012 Wiley Publishing Asia Pty Ltd, ISZS and IOZ/CAS 135