Inference From Presence-Only Data The Ongoing Controversy: Trevor Hastie and Will Fithian

Ecography 36: 864–867, 2013
doi: 10.1111/j.1600-0587.2013.00321.x
© 2013 The Authors. Ecography © 2013 Nordic Society Oikos
Subject Editor: Miguel Araújo. Accepted 22 February 2013
Inference from presence-only data; the ongoing controversy
Trevor Hastie and Will Fithian

T. Hastie (hastie@stanford.edu) and W. Fithian, Statistics Dept, Stanford Univ., CA 94305, USA.
Presence-only data abounds in ecology, often accompanied by a background sample. Although many interesting aspects
of the species’ distribution can be learned from such data, one cannot learn the overall species occurrence probability, or
prevalence, without making unjustified simplifying assumptions. In this forum article we question the approach of Royle
et al. (2012) that claims to be able to do this.
Modelling of species distributions is most convincing 2012) is attractive. Other popular approaches are: 1)
when presence–absence data is sampled in a systematic way. MAXENT (Phillips and Dudik 2008) which models the fea-
For example, researchers survey a collection of equal sized ture density for the presence data. 2) Naive logistic regres-
quadrats and record the presence or absence of a particular sion, which treats the background data as absence data, and
species of plant. They also record other features of the fits logistic regression models. 3) Manly’s exponential model
quadrat, such as annual precipitation, soil salinity, altitude, (Manly et al. 2010), a precursor to the IPP model in this
and so on. These features are then used in a statistical model context. Fithian and Hastie (unpubl.) survey these methods,
such as logistic regression to build a model for the prob and shows that they are all equivalent to the IPP model, in
ability of species occurrence. Using this fitted model, species particular for an exhaustive sample of background data.
occurrence probability is predicted and can be projected Similar conclusions appear in Aarts et al. (2012) as well as
onto a map of the region, if the features are available at Warton and Shepherd (2010). The conclusion to be drawn
each geographical unit. See Guisan and Zimmerman (2000) from these comparisons is that while absolute occurrence
for a review of such methods. rates are typically elusive, relative rates are more accessible
Often the only species data available are the geographical and available from these type of data.
coordinates of sites where the species was observed – This forum article will step away from these sampling
so-called presence-only data – as recorded by observers. considerations, and address a simpler question. We can think
Also available is a large collection of background data, con- of sampling units as geographical sites x ∈ c; each x repre-
sisting of geographical coordinates and associated geographi- sents a cell or unit of area, and c represents the domain of
cal features such as those available from GIS data. In many interest, or entire collection of such cells. At each site the
cases this background data is available at every geographical vector z  z(x) records the values of som geographical attri-
unit of area in a map of the region, and hence also at the butes or features. Let the binary variable y denote presence
presence sites. Apart from those locations where the species (1) or absence (0) sites. We denote the marginal density of z
were observed, no species information is available for the by p(z), and the (conditional) density of z at presence sites
background data. by p1(z), and absence sites by p0(z). If the overall presence
For animal species, sampling time as well as area is rele- occurrence probability is y(y  1) and hence absence
vant, since the species may wander around. Other compli- y(y  0)  1 2 y(y  1), then basic probability theory tells
cating factors exist, such as the species being present but are us two things:
not observed, and sampling bias (e.g. proximity to roads). 1) the conditional occurrence probability at a site, given
For the purposes of this article, we keep the discussion we observe feature z, is given by
simple and avoid these other sampling issues.
The question is what can one learn from such presence– ψ( y  1) π1( z )
ψ( y  1z )  (1)
background data? There are many approaches to this prob- π( z )
lem, which has become an active area of research. Our
current favorite approach is to model the species occurrence 2) The marginal feature density p(z) is a mixture of the two
rate (as in number of times the focal species is seen per unit class-conditional densities:
area, per unit time). To this end the inhomogeneous Poisson
point process (IPP) (Wharton and Shepard 2010, Aarts et al. p(z) 5 p1(z) y(y 5 1) 1 p0(z)(1 2 y(y 5 1)) (2)
864
Presence–background data consists of a random sample Figure 1 (left panel, red) shows a plot of a very simple
of values of z from p1(z), as well as a separate sample from model for occurrence probability
p(z) (possibly the entire background distribution), which
directly inform us about the densities p1 and p. However, y(y  1x; b) (3)
even if both of these distributions were fully known, we can
see from Eq. (1) that this would not be enough information This corresponds to a logistic regression model linear in x,
to estimate y(y  1z). We are missing the overall occurrence logit[y(y 1x; b)]  b0 1 xb1 (4)
probability y(y  1), or at least some data that allow us to
estimate this. In this case b0  21 and b1  1. We assume here that the
The reader might think that Eq. (2) offers some hope, but marginal distribution p(x) is uniform on [22.5, 2.5],
it does not. We know or have data on p(z) and p1(z) – this which makes the overall prevalence y(y  1)  ∫y(y  1x; b)
leaves a lot of flexibility in choosing somewhat arbitrary p(x)dx ≈ 0.33 in this case, and the values of y(y  1x; b)
values for y(y  1) and p0(z) to make Eq. (2) work out – range between 0.03 and 0.83. Royle et al. (2012) use a linear
unless, that is, we impose strong parametric restrictions on logistic model similar to Eq. (4) for modeling occurrence
some of the ingredients. But then we are manufacturing probability.
information via these assumptions when none exists in With such a parametric assumption, one can perform
the data. We will see more of this in the next section. Ward inference on the data. As they point out, using Eq. (1)
et al. (2009) discuss this problem and the lack of identifi- we can write
ability of y(y  1) from such data. They warned of the folly
in relying on arbitrary parametric assumptions to squeeze ψ( y  1) π1( x )  ψ( y  1x ) π( x ) (5)
out estimates of y(y  1). Phillips et al. (2009) raise similar
 y( y  1, x) (6)
issues. Most recently Phillips and Elith (2013) address the
same issue, and reinforce some of the points we make here. If p(x) is uniform, as it typically is in the geographic domain,
and since Σx ∈c y(y  1, x)  y(y  1), we can write
ψ( yi  1xi )
The parametric approach of Royle et al. π1( xi )  (7)
Σ x∈χ ψ( y  1x )
Royle et al. (2012) discuss methods for estimating species
occurrence probabilities from presence-only data – the same This is a model for the density of the observed data xi
problem we outline above. They cite Ward et al. (2009), yet at the presence sites, and it is expressed in terms of the
proceed to impose parametric assumptions on y(y  1z) to parameters of our logistic regression model if we replace
enable estimation of y(y  1) – exactly what we warned y(yx) in Eq. (7) with y(yx; b). On the basis of this
against. Here we will strengthen our argument in the context Royle et al. (2012) do maximum-likelihood estimation for
of their model, and using their notation. We will also sim- b see Eq. (9) below. Note that the presence observations xi
plify the discussion further, as they did, and focus attention appear in the numerator; the background data are used to
on geographic features x rather than environmental features compute the sums in the denominator.
z  z(x); in the appendix we show that this transition is This sounds like statistical alchemy: why don’t we need to
benign. know y(y  1) anymore? Note that b0 is playing a similar
Occurrence probability Logit of occurrence probability

1.0 ψ(y = 1|x; β)
ψ∗(y = 1|x; β) 1
0.8
0
logit[Pr(y = 1|x)]
Pr(y = 1|x)
0.6 −1
0.4 −2
0.2 −3
0.0 −4
−2 −1 0 1 2 −2 −1 0 1 2
χ χ
Figure 1. Left panel: two different models for occurrence probabilities. The blue curve is half the red curve, and hence has exactly half
the prevalence (marginal occurrence probability) of the red curve. The implied likelihood (Eq. 9) of the presence-data xi is identical for these
two models. Right plot: the logits of the two models on the left. The broken blue curve is the best linear approximation to the solid blue
curve – the approximation that would be imposed by a linear logistic regression model. Since the solid logit curves (red and blue)
are indistinguishable with respect to the likelihood (Eq. 9), distinguishing the dotted blue line from the red is no easier than distinguishing
it from the solid blue. Determining whether or not this slight curvature is present is the entire basis upon which the Royle et al. (2012)
procedure would estimate the prevalence at either 34% (solid red) or 17% (dotted blue).
865
role as y(y  1) was before (y(y  1) multiplied all occur- with two parameters, but with one having prevalence half
rence probabilities, whereas eb0 multiplies the odds). the other. We could change the 1/2 in (Eq. 8) to any 0 
Therefore, it should be surprising that suddenly we can esti- C  1 and the same statement would be true (with ‘half ’
mate it from the data. The reason b0  21 vs b0  1 are changed to ‘fraction C ’). We note that this lack of identifi-
distinguishable from each other in this model is that if ability with propotional models that we exploit was pointed
we hold b1 fixed and change b0, it increases all of the prob- out by Lele and Keim (2006, p. 3023, top left), who origi-
abilities y(y  1x) in Eq. (7), increasing the numerator and nally proposed the approach used by Royle et al. (2012).
the denominator. But because it changes some a little more Now the second model is not a linear logistic model, so it
than others, it subtly changes the density p1(xi) – ‘subtly’ would not be up for comparison in the Royle et al. (2012)
being the operative word. The problem is that other things framework. The right hand panel shows the logit transforms
can subtly change p1 too – such as the linear logistic model of each of these two models. Indeed, the second model is
(Eq. 4) being subtly misspecified, as we see next. not a linear logistic model, but it is almost one. The dotted
The blue curve in this left hand plot of Fig. 1 corresponds blue curve shows the best linear approximation to this logit
to a different model for the occurrence probability, in the population. Since the solid red line and solid blue
curve are indistinguishable from each other with respect to
1 the likelihood Eq. (9), distinguishing the dotted blue
ψ ∗( y  1x ; β)   ψ( y  1x ; β) (8)
2 line from the red is no easier than distinguishing it from
the solid blue. Determining whether or not this slight
For this model (Eq. 8) the overall prevalence y*(y  1) 
curvature is present is the entire basis upon which the
1/2  y(y  1), i.e. 0.17 or exactly half of the prevalence Royle et al. (2012) procedure would estimate the preva-
for model (Eq. 3). Although y*(y  1x) does not corre- lence at either 34% (solid red) or 17%. One would need an
spond to a linear logistic model, it is nearly linear on awfully large amount of data to be able to detect a difference
the logit scale (see the solid blue curve in the right plot of between the two blue lines, even with presence–absence
Fig. 1), and is still a simple parametric model; but more data.
on that to come. We now present a simulation to reinforce the points
The critical point of this example is that the joint likeli- we have made. We simulate data from model (Eq. 8)
hood of the presence data (i.e. Eq. 4 in Royle et al. (2012)) (nearly linear logistic), and fit a linear logistic model using
the likelihood (Eq. 9). In detail, we generate a large sample
n ψ( yi  1x i ; β) of values of x via the uniform distribution p, generate
£( β) ∏ (9) 0/1 ‘presence/absence’ data using the probabilities (Eq. 8),
i 1 Σ x∈χ ψ( y  1x ; β) and then take a random subset of 1000 values of x from

those that came up as ‘present’. This sample of 1000 is
n ψ ∗ ( yi  1x i ; β)
∏ (10) fed into (Eq. 9), which is then maximized with respect to
i 1 Σ x∈χ ψ ∗ ( y  1x ; β) the two linear logistic parameters. Rather than show the

parameter estimates b̂ 0 and b̂ 1 that result, we instead com-
is identical for these two models. In other words, the pute the implied estimated prevalence value ŷ(y  1) 
likelihood would have nothing to say about whether model ∫y(y  1x; b̂)p(x)dx. This was repeated B  1000 times.
(Eq. 3) or model (Eq. 8) was preferred – two models, both The middle histogram in Fig. 2 shows the results. The
C= 0.3 C= 0.5 C= 0.7

200 200 200
150 150 150

Frequency
Frequency
Frequency
100 100 100
50 50 50
0 0 0
0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4
ψ(y = 1) ψ(y = 1) ψ(y = 1)
Figure 2. Results of three separate simulation runs. The center histogram shows the estimated value ŷ(y  1) when a linear logistic
regression model is fit to data generated from model (Eq. 8). The histogram summarizes B  1000 different runs, each consisting of
1000 presence samples. The true value of y(y  1) is given by the vertical red line. The two flanking histograms change the 1/2 in
(Eq. 8) to C  3/10 (left) and C  7/10 (right), with again the red line showing the true value of y(y  1). In all cases, irrespective of
the true value of y(y  1), the histograms indicate that the value being estimated is centered around y(y  1) ≈ 0.33, the value from
Eq. (3).
866
histogram is peaked around 0.33 (the value from Eq. (3), not shown here, they are too fragile and arbitrary, and will not be
the true value 0.17). The flanking histograms repeat these robust in practical settings.
simulations using 3/10 and 7/10 rather than the 1/2 in
(Eq. 8). In all cases the estimated ŷ(y  1)bear no relation-
Acknowledgements – TH was partially supported by grant
ship to the true values (red lines).
DMS-1007719 from the National Science Foundation, and
The take-home message here is that: a) two perfectly grant RO1-EB001988-15 from the National Inst. of Health. WF
good and parsimonious probability models are indistin- was supported by VIGRE grant DMS-0502385 from the National
guishable with respect to the likelihood (Eq. 9) for the pres- Science Foundation. The authors thank Jane Elith for helpful sug-
ence data, despite the fact that one has half the prevalence gestions on an earlier draft.
of the other; i.e. prevalence is not identifiable in this
extended family. b) By insisting on a particular parametric
form, e.g. linear logistic, we are on extremely flimsy ground, References
as the subtle distinction in this example shows. c) When
presence-only data arise via models that are nearly linear Aarts, G. et al. 2012. Comparative interpretation of count,
on the logistic scale, maximum likelihood using Eq. (9) presence–absence and point methods for species distribution
models. – Methods Ecol. Evol. 3: 177–187.
and a linear logistic regression model can be incapable of
Guisan, A. and Zimmerman, N. 2000. Predictive habitat dis
estimating the correct parameter values, and in particular tribution models in ecology. – Ecol. Model. 135: 147–186.
the correct implied prevalence. This is not trickery with data Lele, S. and Keim, J. 2006. Weighted distributions and estimation
simulations or abstract ideas; it cuts to the core of how of resource selection probability Functions. – Ecology 87:
Royle et al.’s model estimates probabilities. They say you can 3021–3028.
estimate probabilities from presence-only data by using Manly, B. et al. 2010. Resource selection by animals: statistical
their model, which relies on a linear logistic framework. design and analysis for field studies. – Kluwer.
The problem is that in the real world, functional forms Phillips, S. and Dudik, M. 2008. Modeling of species distribution
with maxent: new extensions and a comprehensive evaluation.
are almost never linear; linearity is just a useful approxi
– Ecography 31: 161–175.
mation. We have shown here that data distributed just Phillips, S. and Elith, J. 2013. On estimating probability of
slightly differently to that allowed in their framework will presence from use-availability or presence-background data.
lead to incorrect estimates of prevalence, and therefore – Ecology in press.
incorrect estimates of probability of presence. Phillips, S. et al. 2009. Sample selection bias and presence-only
Stepping back a bit, we remake our earlier point. It distribution models: implications for background and pseudo-
should be clear that a sample of n1 sites, along with a absence data. – Ecol. Appl. 19: 181–197.
sample of n0 unclassified samples (i.e. a mix of presence and Royle, J. et al. 2012. Likelihood analysis of species occurrence
probability from presence-only data for modelling species
absence), tells you nothing about the overall probability of
distributions. – Methods Ecol. Evol. 3: 545–554.
occurrence, absent strong parametric assumptions about Ward, G. et al. 2009. Presence-only data and the em algorithm.
the form of the underlying densities. In other words, there is – Biometrics 65: 554–563.
no information on prevalence in the data itself; it all comes Wharton, D. and Shepard, L. 2010. Poisson point process models
from the model assumptions. Using such assumptions as the solve the ‘pseudo-absence problem’ for presence-only data
basis for estimating overall prevalence is not a good idea; as in ecology. – Ann. Appl. Stat. 4: 1383–1402.
Appendix
Geographic vs environmental features
Usually we parameterize our conditional occurrence model

∑ ψ ( y 1z) π( z ) ∑ ψ ( y 1z( x )) π( x ) (13)
z ∈Z x ∈χ
in terms of environmental features zi  z(xi) rather than
the geographic features xi themselves. For ease of exposition So then we have
here, we treat z as discrete rather than continuous.
Then along the lines of Eq. (7) we would have ψ ( y i  1z ( xi ))
π1 ( zi )   π( zi ) (14)
ψ( yi  1zi ) π( zi ) ∑ x ∈χ ψ ( y  1z ( x ))π( x )

π1( zi )  (11)
Σ z ∈Z ψ( y  1z ) π( z ) So if p(x) is uniform, and we parameterize y(yi  1z(xi); b),

then the log-likelihood contribution from presence site i
Here p(z) is the marginal environmental feature distribution, for b is
and is not uniform. However,
 ψ( y  1z ( x ); β) 
π( z )  ∑ π( x ) (12) log  i i
  Ci
 ∑ x ∈χ ψ ( y  1z ( x ); β) 
(15)
x∈χ ; z ( x )  z

and so the denominator in Eq. (11) can be written where Ci can be discarded, since it does not depend on b.
867

Inference From Presence-Only Data The Ongoing Controversy: Trevor Hastie and Will Fithian

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inference From Presence-Only Data The Ongoing Controversy: Trevor Hastie and Will Fithian

Uploaded by

Copyright:

Available Formats

Ecography 36: 864–867, 2013