Professional Documents
Culture Documents
Fine-Resolution Population Mapping Using OpenStreetMap
Fine-Resolution Population Mapping Using OpenStreetMap
Fine-Resolution Population Mapping Using OpenStreetMap
170]
On: 16 December 2014, At: 08:37
Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered
office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
To cite this article: Mohamed Bakillah, Steve Liang, Amin Mobasheri, Jamal Jokar Arsanjani
& Alexander Zipf (2014) Fine-resolution population mapping using OpenStreetMap points-of-
interest, International Journal of Geographical Information Science, 28:9, 1940-1963, DOI:
10.1080/13658816.2014.909045
Taylor & Francis makes every effort to ensure the accuracy of all the information (the
“Content”) contained in the publications on our platform. However, Taylor & Francis,
our agents, and our licensors make no representations or warranties whatsoever as to
the accuracy, completeness, or suitability for any purpose of the Content. Any opinions
and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content
should not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions, claims,
proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or
howsoever caused arising directly or indirectly in connection with, in relation to or arising
out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &
Conditions of access and use can be found at http://www.tandfonline.com/page/terms-
and-conditions
Downloaded by [141.212.109.170] at 08:37 16 December 2014
International Journal of Geographical Information Science, 2014
Vol. 28, No. 9, 1940–1963, http://dx.doi.org/10.1080/13658816.2014.909045
1. Introduction
Numerous activities, such as business development, demographic studies, disaster pre-
vention, and urban planning, require data on population at a fine resolution, including at
building level. However, to protect privacy, population statistics provided by national
census are aggregated and therefore not available at building level (Ural et al. 2011,
Langford 2013, Sridharan and Qiu 2013). Estimates at the building level must be
computed by disaggregating census population data with appropriate techniques.
Population estimates can be obtained through areal interpolation. Areal interpolation is
the process of transferring data from a first spatial unit system (called source units, usually
the census units) to another spatial unit system (called target units, e.g., cells of a raster
grid) (Sridharan and Qiu 2013). Areal interpolation can be conducted using ancillary data
to guide the redistribution of population, based on the observation that population
distribution is often correlated with other spatial variables (such as land use/land cover,
LULC). Still, the quality and the appropriateness of the ancillary data used influence the
accuracy of the estimation. Common types of ancillary data used for areal interpolation
include LULC data derived from satellite imagery, official cadastral or topographic
databases, official vector street network data (considering that people commonly reside
along residential streets), or Light Detection and Ranging (LiDAR) data.
Methods employed to estimate population at building level commonly use LiDAR
data to estimate building volume or height (Lu et al. 2010, Sridharan and Qiu 2013).
Building surface can also be used to assess population distribution; however, the lack of
information on the volume of buildings may generate underestimation or overestimation
of population (Harvey 2002), unless the height of buildings is relatively homogeneous (Lu
et al. 2010). However, LiDAR data is costly. Alternative methods that use affordable data
are needed.
This paper proposes a method for areal interpolation at building level that uses VGI.
To the best of our knowledge, this is the first attempt to use VGI for this task. VGI is
Downloaded by [141.212.109.170] at 08:37 16 December 2014
produced by volunteers through Web 2.0 applications rather than by traditional data
producers (Goodchild 2007). VGI is criticized due to potential data quality issues (Bishr
and Mantelas 2008, Flanagin and Metzger 2008, Jackson et al. 2013), but it can also
extend existing data sets with information that pertains to a level of detail that go beyond
the capacity of ‘official’ producers (Tulloch 2008). VGI applications allow collecting
local knowledge, which usually cannot be gathered using traditional data collection
processes (Goodchild 2007). Consequently, numerous researchers in the field of
GIScience consider VGI as a valuable data source that can complement official and
commercial data (Goodchild 2007, De Longueville et al. 2010, Haklay 2010, Zielstra
and Zipf 2010, Mooney and Corcoran 2011). For instance, Song and Sun (2010) have
reported that the usage of VGI in urban management has increased. Coleman et al. (2010)
report that some state governments in Australia and Germany, as well as the US
Geological Survey, have employed VGI in their mapping programs. Commercial data
providers such as NAVTEQ and TomTom have used VGI to identify updates required for
their databases (Coleman 2010). One of the most well-known examples of VGI project is
OpenStreetMap (OSM), a collaborative mapping project that enables volunteers to pro-
duce and share maps (Neis et al. 2011). OSM has already been used for various purposes,
including the automatic derivation of three-dimensional CityGML models (Goetz and Zipf
2012), road-based travel recommendation systems (Sun et al. in press), and mapping land-
use patterns (Jokar Arsanjani et al. 2013). In OSM, users can describe features – such as
roads, water bodies, and points-of-interest (POIs) – using ‘tags’ (Goetz et al. 2012). A
POI is a feature on the map that is given specific point coordinates (e.g., schools,
restaurants, churches, etc.) and which location may be useful to know depending on the
context of the user (especially in tourism and for routing services). Because some types of
place can be correlated with a higher density of population (as argued in Zhang and Qiu
(2011)), POIs can be useful to refine population estimates provided that a correlation can
be established.
The proposed method is an alternative solution to estimate population at the building
level that relies solely on open data and VGI. The principle of the approach is to combine
three existing methods: multi-class dasymetric mapping (conducted with Urban Atlas
LULC data) is used to obtain population estimates for relatively large zones; then,
interpolation by surface volume integration (using OSM POIs to determine points of
high population density) is employed to uncover variations of population density within
these zones; finally, binary dasymetric mapping (conducted with OSM building layer
data) is used to redistribute population within buildings.
1942 M. Bakillah et al.
The originality of the proposed approach lies in investigating the use of VGI to obtain
population estimates. The hypothesis behind this method is that OSM data will be of
sufficient quality to support this task. Of note is that Strunck (2010), who has conducted
experiments on the growth of different OSM POI categories in Germany, observed that
OSM had about twice as many POIs as TomTom MultiNet data, a commercial data
source. However, one can still expect that VGI quality issues may affect the accuracy
of population estimates. Concerning POIs, specific concerns include spatial and thematic
accuracy. In terms of spatial accuracy, VGI contributors are not formally required to
generate data with an established level of spatial precision, and may have an inaccurate or
incomplete perception of the geographic phenomenon they describe (Bakillah et al.
2013a). Jackson et al. (2013) note that OSM contributors are subjected to no constraints
concerning the placement of a POI, and may rely on various devices to do so (GPS,
smartphone, etc.); therefore, one can expect significant variations in positioning. In terms
of thematic accuracy, Scheider et al. (2011) observed that the terms used by contributors
Downloaded by [141.212.109.170] at 08:37 16 December 2014
2. Related work
Areal weighting is the simplest type of areal interpolation method and it requires no
ancillary data (Markoff and Shapiro 1973, Goodchild and Lam 1980, Lam 1983). It is also
referred to as area weighted interpolation (Lwin 2010) or overlay operation based on
geometric properties (Wu et al. 2005). In this method, population estimates in target zones
are computed based only on the proportion of each source zone that overlaps with the
target zone (Goodchild and Lam 1980). It is based on the assumption that population is
uniformly distributed within the source zone. However, source zones in census rarely have
homogeneous population distributions. Therefore, population estimates based on areal
weighting are likely to differ from the reality because of this assumption. However, in the
absence of ancillary data, the areal weighting method may be useful. An improved version
of areal weighting, called target-density weighting (TDW), was proposed by Schroeder
(2007). TDW is a method employed to interpolate population data from one set of census
units to another (data from more than one census is required). TDW is based on the
assumption that if people lived in a given region when a first census was conducted, these
people are more likely to live there too (in a proportional manner) when a subsequent
census is conducted. TDW was extended into the cascading density weighting (CDW)
method to produce complete population time series; CDW repeats the TDW backwards
through a series of census (Schroeder 2010). Another example of areal interpolation
International Journal of Geographical Information Science 1943
method that does not require ancillary data is Tobler’s pycnophylactic interpolation
(Tobler 1979). This approach creates a smooth population density distribution for any
location (x, y) by minimizing the sum of squares of partial derivatives. The density
distribution respects Tobler’s pycnophylactic property, i.e., an important property where
the sum of people reported for each source zone is preserved during the interpolation
process.
Dasymetric mapping (also referred to as dasymetric modeling) is a technique where
additional data (called ancillary data) is employed to guide the redistribution of population
data at a finer level of resolution (Semenov-Tian-Shansky 1928, Wright 1936). Binary
dasymetric mapping (also referred to as filtered areal weighting) is the simplest type of
dasymetric mapping method (Eicher and Brewer 2001). It is called binary because source
zones are subdivided into populated and unpopulated areas. The areal weighting method
is then applied within the populated areas of source zones. Common sources used to
identify populated areas include:
Downloaded by [141.212.109.170] at 08:37 16 December 2014
● LULC data derived from satellite imagery (Mennis 2003, Reibel and Agrawal
2007, Cromley et al. 2012);
● county address points and parcels (Tapp 2010);
● vector street network data (Xie 1995, Hawley and Moellering 2005, Zandbergen
and Ignizio 2010);
● vector or raster maps provided by official mapping agencies (Langford 2013),
including topographic databases (Wu et al. 2008, Lwin and Murayama 2010);
● Google Map images (Yang et al. 2012).
The main drawback of binary dasymetric mapping is that it relies on the assumption that
the population is uniformly distributed within the populated areas (Maantay et al. 2007).
The multi-class dasymetric mapping method partly addresses this problem. The
territory is partitioned into several residential classes (e.g., high density, medium density,
low density) associated with different population densities (Eicher and Brewer 2001). To
estimate the population density associated with each residential class, empirical sampling
of population densities (Mennis 2003) and multivariate linear regression estimated with
ordinary least squares (Langford 2006) were proposed. Data sources used for multi-class
dasymetric mapping include:
Multi-class dasymetric mapping may theoretically be more accurate than binary dasy-
metric mapping; however, it is similarly based on the assumption that population is evenly
distributed within an area covered by the same residential class. Lo (2008) has demon-
strated that the relation between population density and the type of LULC category is
varying (referred to as the ‘problem of related variables’). This finding suggests that linear
regression alone is not appropriate to estimate population distribution, which was also
emphasized by other authors (Griffith and Can 1996). Indeed, Wu and Murray (2005)
explain that methods based on regression analysis assume that population density is
spatially independent. Still, the difficulty of identifying a meaningful statistical relation
between population density and other variables remains a challenge (Mennis 2009). To
1944 M. Bakillah et al.
address this issue, methods that integrate spatial autocorrelation have been proposed. For
example, Wu and Murray (2005) propose a co-kriging method for estimating population
density using census data and impervious surface proportion as secondary variable to
refine population estimates. Other authors suggest area-to-point kriging to disaggregate
residual population density values obtained with regression (Kyriakidis 2004, Liu et al.
2008). Griffith (2013) pointed out that kriging and co-kriging involve variable transfor-
mation, which may introduce errors in final population estimates. He proposes a geo-
graphic areal unit imputation Poisson model specification that uses remotely sensed
images as well as spatial and temporal autocorrelation to circumvent this issue. The
Poisson model has also been used by Leyk et al. (2013b) to represent relations between
variables. Another approach to address the issue of related variables is maximum entropy
dasymetric modeling, where the maximum entropy principle is used to model the statis-
tical relation between population density and other variables, such as attributes of house-
holds (Leyk et al. 2013a). The expectation–maximization (EM) algorithm, which was first
Downloaded by [141.212.109.170] at 08:37 16 December 2014
data (in this case, provided by UK government), but mentioned that while being an
interesting alternative, OSM has not been used. Similarly, Lin et al.’s evaluation of
publicly available data for areal interpolation focuses on remote-sensing LULC data
produced by official sources (2013). Aubrecht et al. (2011) have explored the potential
of VGI for modeling the spatiotemporal characteristics of urban population, but not to
produce population estimates.
for each census blocks (subunits of districts) is also available. The Hamburg city region
mainly comprises medium to dense urban fabric north of the River Elbe, with agriculture
and forest areas at the periphery. The center of Hamburg is made of a mixture of urban
fabric of varying density, green urban areas, sport and leisure activity areas, and an
important portion occupied by port facilities.
Ancillary data used in this study include LULC data from Urban Atlas, the European
LULC data for large urban zones that have more than 100,000 inhabitants (http://www.
eea.europa.eu/data-and-maps/data/urban-atlas) (Figures 1 and 2).
Urban Atlas data is also from 2011. The LULC categories used by Urban Atlas do not
identify residential areas. Instead, areas that are more likely to contain the majority of
residential buildings are in the continuous and discontinuous urban fabric categories,
which also include commercial and industrial types of building. Meanwhile, other cate-
gories may also contain residential buildings. Therefore, it was decided to consider all
categories of LULC as habitable areas, except for the water category.
To redistribute population to individual buildings, data on areas and location of
buildings is needed. For this purpose, the OSM building layer was used to extract building
footprints and coordinates. Because of the collaborative nature of OSM, the data on
buildings has been created at different dates. While, in OSM, the type of building can
be identified with attributes, contributors do not always provide a value for this attribute,
so one cannot rely on this type of data to identify building use. POIs, which are used to
identify high-density areas, were also extracted from OSM (and therefore created at
different periods of time). OSM POIs describe geo-located features (point, line, or
polygon). The type of place is specified through tags, which are made of a (key, value)
pair, where the ‘key’ is a category of features and the ‘value’ is a subcategory of features.
An example of tag describing a POI is (amenity, postbox). In OSM, contributors are
encouraged to use the vocabulary suggested in OSM Feature Wiki (http://wiki.open-
streetmap.org/wiki/Map_Features) when adding POIs. POIs created by a contributor can
be edited by other contributors.
(3) Population/cell
(4) Population/building
used. The main contribution of the method is in the next step, where population/area
covered by a given LULC category is disaggregated to population/cell. Here, the cells are
basic spatial units created by generating a grid over the territory. To obtain population/cell,
a new interpolation by surface volume integration method is deployed; it estimates the
population density per cell based on the spatial distribution of relevant POIs. The POIs
retrieved from OSM are classified by category using the ID3 unpruned machine learning
decision tree. The next steps are based on the hypothesis that some categories of POIs are
associated with a higher density of population. These POI categories are selected based on
the correlation between the density of a given category of POI and the density of
population. Then, a quadtree procedure is applied to localize POIs and infer the presence
of control points. Population in cells is computed with a decaying function centered on
control points. Finally, the OSM building layer data is used to compute population per
building based on their surface.
Downloaded by [141.212.109.170] at 08:37 16 December 2014
decision tree algorithm is used. A quadtree procedure is used to deal with high volumes of
POIs and control points are identified from the spatial distribution of these POIs. The
details of the procedure to derive control points from POIs are described below.
by the centroid of the polygon. The initial data set obtained for the city of Hamburg
contains 31,593 POIs. Only the subset of POIs that belong to a category that makes up a
significant percentage of the whole set of POIs (0.01%) was retained. Otherwise, some
‘categories’ of POIs would be instantiated only by a few objects, which would not allow
to use them to identify control points due to their scarcity. The result is a set of 16,964
POIs. Table 1 shows the number of POIs categorized according to the value of the ‘key’.
training data set was elaborated using the matcher and manually validating a random
subset of the matches.
In this equation, n is the number of census blocks, PDi is the population density in census
block i, PD is the average population density, nb_POI is the number of POIs (of a given
category) in census block i, and nb POI is the average number of POIs of the given
category among all census blocks. Table 2 shows Spearman’s correlation coefficient
between the occurrences of some categories of POI and population density.
These results are used to select the types of POI that are considered in the determina-
tion of control points. The results of population disaggregation using different categories
of HDI POI for determining control points are reported in the experimentation section.
Correlation Correlation
Key; value coefficient Key, value coefficient
four nodes, represents the recursive subdivision of space into nested quadrants (Samet
1984). The subdivision process does not have to be made evenly. Rather, it adapts to the
density of the spatial objects we intend to capture: a quadrant is subdivided into four new
quadrants (and four new nodes are created in the tree structure) as long as it contains one
spatial object or more. If no spatial object is found in a quadrant, this quadrant is no
longer divided (Figure 4).
A variant of the quadtree that was used here is the bucket quadtree. Instead of
continuing the subdivision until a quadrant contains only one spatial object, the subdivi-
sion continues until a quadrant contains not more than a predefined number of objects.
This is used to avoid having too small quadrants as leaf nodes. In this context, this would
generate too many control points, which would create an artificially uniform population
distribution.
In this approach, the spatial objects we intend to capture are HDI POIs. The bucket
quadtree procedure is applied. The predefined minimum number of objects ranged
between 1 and 10, depending on the category of POI that was used. Then, to determine
the location of control points, a generalization procedure is applied to identify a single
point from all HDI POIs located in the quadrant. The type of generalization operation is a
midpoint aggregation (Bereuter and Weibel 2010). In the midpoint aggregation operation,
spatially and semantically close points (HDI POIs from the same category) are aggregated
into a single point, which is the mean center point (midpoint) of the HDI POIs. To
illustrate this, in Figure 5, midpoints are identified (circled in blue) in each quadrant,
considering train stations as HDI POIs. To find the midpoint MP of a series of points
<POI1, POI2, …, POIj, …, POIn> within a quadrant, the following distance is minimized:
1952 M. Bakillah et al.
Downloaded by [141.212.109.170] at 08:37 16 December 2014
X
n
DISTANCE TO MID POINT ¼ dist MP; POIj (2)
j¼1
The midpoints constitute the control points that are used to compute population density in
cells.
q
distðCu ; nearest control pointÞ
W ðCu Þ ¼ 1 (4)
dist max
Downloaded by [141.212.109.170] at 08:37 16 December 2014
The formula is illustrated in Figure 6 considering that educational institutions were used
as control points.
The population in cell C depends on radial distance from the nearest control point. The
formula depends on different parameters. First, the constantud is a constant that represents
the maximal density in areas covered by u in d. It can be estimated from the density of
people per census block, as given in census data. Then, dist(Cu, nearest_control_point) is
the distance between the centroid of the cell Cu and the nearest control point. Dist_max is
the maximal value of dist(Cu, nearest_control_point). Q is the population density decaying
factor. One can see from the above formula that the density in a cell is maximal (densityd
(Cu) = constantud) when this cell includes a control point, because dist(Cu, nearest_con-
trol_point) is minimal and W(Cu) is very close to 1. As we move away from a control
point, the population decreases. The rate of decay is fixed by q. Therefore, q greatly
influences the population estimate. This is why several values of q are tested in the
experimentation section.
Figure 6. Population in cell C depends on the distance from the nearest control point (in this
figure, educational institutions are considered as control points).
1954 M. Bakillah et al.
Still, another problem must be resolved. If, after computing the population in cells, we
sum up the population in all cells covered by u, we must obtain that this population is
equal to TOTAL_POPudA in order to respect Tobler’s pycnophylactic property. This is
why the computation of population in cells with the above formula is reiterated by fitting
the constantud so that the sum of population in all cells covered by u is as close as possible
to TOTAL_POPudA.
POPðC Þ
POPðbÞ ¼ Ab Pk (5)
i¼1 Ai
Downloaded by [141.212.109.170] at 08:37 16 December 2014
5. Experimentation
The experimentation has two objectives: (1) investigate the impact of two factors on the
reliability of the estimates – the choice of POI to determine control points and the
decaying factor’s value; and (2) compare the reliability of the population estimates with
other state-of-the-art methods.
It is not possible to experimentally verify whether the population estimates at the
building level are accurate, since no official data is available at that level, due to privacy
concerns. However, it is possible to verify the reliability of the method at the census block
level, which is the finer level of detail available. Since blocks are generally composed of a
small number of buildings, the reliability assessments are still worth consideration.
Therefore, in each experiment, estimates computed at the building level were aggregated
to obtain the estimated population for each census block. The estimated population is then
compared with the census population.
5.1. Results
Population estimates that were obtained using different categories of HDI POIs were
compared. This experiment tests the hypothesis formulated in Section 4: there exist some
categories of POI that are associated with areas where population density is higher. The
categories of POI ‘amenity: educational institution’ (comprising schools, universities, and
kindergartens) and ‘public transport: station’ were selected as HDI POIs, since they are
the most strongly correlated with population density (see Section 4). Three values for the
decaying factor were also tested: q = 0.3, q = 0.5, and q = 0.7. The results were assessed
with a linear regression (Figures 7 and 8).
For the first category of POI, better results are achieved in terms of standard error
(369.2), R2 (0.9274), and regression coefficient (1.007) with a slower decrease of the
population as we move away from control points (q = 0.3). Still, standard error and R2 are
International Journal of Geographical Information Science 1955
Downloaded by [141.212.109.170] at 08:37 16 December 2014
Figure 7. Simple linear regressions for HDI POI = ‘amenity: educational institution’: (a) q = 0.3,
(b) q = 0.5, (c) q = 0.7.
Figure 8. Simple linear regressions for HDI POI = ‘public transport: station’: (a) q = 0.3, (b)
q = 0.5, (c) q = 0.7.
1956 M. Bakillah et al.
somewhat high and show some lack of accuracy. Then, as q increases (the decay is faster
as we move away from control points), the reliability of the estimate also decreases
significantly and the regression coefficient moves away from 1.
For the second category of POI, better results are achieved in terms of standard error
(286.6), R2 (0.9530), and regression coefficient (0.9849) with a value of q = 0.5, which
contrasts with the previous case. This demonstrates that the value of q should be
calibrated against the type of POI used to identify control points. It is possible that
population decreases more quickly around some types of place (i.e., the population is
more concentrated around these points) than others. However, in both cases, as q
increases, the slope of the regression line diminishes below 1, showing slight under-
estimation of the population. Worse results are obtained with q = 0.7. Overall, since best
results are obtained with the second category of POI, it seems that in this experiment, the
category ‘public transport: station’ was more appropriate to identify control points.
In the second part of the experiment, while reporting further on experiments with
Downloaded by [141.212.109.170] at 08:37 16 December 2014
different categories of POI, the proposed method is compared with the following repre-
sentative state-of-the-art approaches:
(1) Areal weighting: population is distributed to the block level based on area
proportion only.
(2) Binary dasymetric mapping: population is distributed to the block level based
on area proportion and distinction between habitable and non-habitable areas.
LULC data from Urban Atlas was employed to determine habitable/non-habitable
areas. Urban fabric categories were considered as habitable, while other categories
were considered as non-habitable.
(3) Multi-class dasymetric mapping: the method described in Section 4.1 was
applied. Population was distributed evenly within areas covered by the same
LULC category.
(4) Interpolation by surface volume integration: population estimates are com-
puted using Equation (4) with q = 0.3 and control points obtained from ‘amenity:
educational institution’ (POIs that generated the most reliable results).
(5) Proposed method with binary instead of multi-class dasymetric mapping,
with q = 0.3 and control points obtained from ‘amenity: educational institution’.
(6) Interpolation modeling related variables with co-kriging: population density is
estimated for the 500 m2 grid using a co-kriging method, where the primary
variable is population density and the secondary variable is POI distribution
(tested with ‘amenity: educational institution’). Population estimates were
rescaled to satisfy Tobler’s pycnophylactic property.
The root mean square error (RMSE) and the mean absolute error (MAE), which are well-
known accuracy measures in the field of population density estimation (Liu et al. 2008,
Langford 2013), were measured (Table 3):
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pn
i¼1 ðPi
estimated
Piobserved Þ2
RMSE ¼ (6)
n
International Journal of Geographical Information Science 1957
1X n
Pestimated Pobserved
MAE ¼ i i (7)
P i¼1
In these formulas, n is the number of census blocks, Piestimated is the estimated population
in block i, Piobserved is the official census population in block i, and P is the total
population in Hamburg.
As expected, areal weighting performs poorly but is interesting to illustrate differ-
ences. There is no great difference between binary and multi-class dasymetric mapping,
showing that indeed few people live in non-urban fabric areas. However, binary dasy-
metric mapping slightly outperforms interpolation by surface volume integration, which
was also reported (although with a more pronounced difference) in Langford (2013). The
corroboration of these results suggests that modeling population distribution with a
decaying function may be oversimplifying, which contrasts with results obtained in
Zhang and Qiu (2011). It also questions the reliability of POIs as a unique source of
ancillary data (while in the proposed approach, it is coupled with LULC data). The
comparison of interpolation by surface volume integration with the proposed method
(with the same category of POIs being used to ensure a fair comparison) also suggests that
the multi-class dasymetric mapping method based on LULC contributes to increasing the
reliability of the estimates. When comparing the results obtained with the proposed
method using binary or dasymetric mapping, it can be observed that the multi-class
approach does improve the estimates, but not with a great difference. Since binary is
easier to implement than multi-class dasymetric mapping, the former might therefore be
worth considering. Also, the purity of the samples used for estimating population densities
associated with each LULC category (85%) impacts the reliability of estimates obtained
with multi-class dasymetric mapping (although misclassification can also occur with
binary dasymetric mapping). At last, better results are obtained with interpolation with
co-kriging than with the proposed method with interpolation by surface volume integra-
tion. Here, POIs were used as secondary variable, but preliminary estimates obtained for
each category of LULC with multi-class dasymetric mapping (Section 4.1) were also used
to rescale the results of co-kriging. The results could be interpreted as another indication
1958 M. Bakillah et al.
that the interpolation by surface volume integration is too simplistic. Still, the difference
for both RMSE and MAE is not large. This may be worth considering, since the
co-kriging model is more complex to implement.
The results show that POIs are useful to identify control points and estimate variations
within a given category of LULC. Table 3 lists some categories of POI that performed
better compared to other POIs. It shows that for some categories of POI, the results are
not necessarily better than simple multi-class dasymetric mapping (e.g., hospitals and
doctors). It is difficult nevertheless to draw conclusions, since it could be due to the fact
that this type of place is not correlated with population density, or because data on this
type of place is incomplete in OSM. Results are also likely to differ in different cities, as
spatial urban patterns are varying.
5.2. Discussion
Downloaded by [141.212.109.170] at 08:37 16 December 2014
The comparison of results suggests possible areas for improvement. In addition, the
margin of error could be greater at the level of buildings, although this cannot be verified.
Although volume of buildings could be used to generate estimates at this level, it would
not still be an absolute reference against which the proposed method could be compared.
An important cause of error is possibly the decaying function behind the interpolation
by surface volume integration. Although results show it helps to better reflect population
density variations within an area with uniform LULC classification, this model simplifies
the population density distribution. The second issue with the model is that the decaying
factor q controls the extent of local influence. The experiments show that the choice of the
value of q should depend on the type of POI selected to determine control points, with
important variations comprised between 0.3 and 0.7. To verify whether the population
decay is indeed correlated with the category of POI, further experiments must be
conducted where estimates obtained with the same categories of POI but different decay
values and in different regions would be compared. Such experiments would also have to
take into account the differences between the urban fabric patterns across regions.
The choice of POI also affects the estimates. In addition, the completeness of the POI
data set is an issue, as well as semantic accuracy. Since POI descriptions are provided by
volunteers, inaccurate descriptions are likely to occur. Until further research is done to
develop tools to frame the volunteer production of POIs (such as semantic annotation for
POI, see Scioscia et al. 2013) and until such tools become common use, appropriate
filtering of POI data is required. An alternative could be POIs from official or commercial
data sets.
Errors in LULC classification are likely to impact population estimates (Maantay et al.
2007, Liu et al. 2008, Lu et al. 2010, Langford 2013), a source of error that can hardly be
avoided. In addition, in different phases of the method, approximations were made, such
as considering cells mostly covered by a category of LULC as entirely covered by this
category, or considering buildings as entirely part of a cell while they were only partly
included in this cell. Also, data on types of building could not be used (e.g., to discard
industrial or commercial buildings) since building attributes are not always available in
OSM. An additional parameter that affects the results is grid resolution. Further experi-
ments are needed to assess its impact on the reliability of estimates. A coarser grid would
affect dasymetric mapping, because it would negatively impact the purity of population
density samples. A finer grid may produce more accurate results by allowing finer
distinctions. However, a finer grid may also have a negative effect with more buildings
being split between two or more cells.
International Journal of Geographical Information Science 1959
population distribution. A case study with the city of Hamburg, Germany, has been
conducted. Population estimates obtained using different categories of POI were com-
pared. The method was also compared with state-of-the-art methods. The results show that
although there are areas for improvement, POIs can be considered as interesting ancillary
data in the absence of three-dimensional data on buildings. Of note still is that VGI data,
including OSM, is not yet widely available or well-developed in all regions of the world.
Similarly, LULC data from Urban Atlas is available only in Europe and for urban areas
with a population of more than 100,000 people. Although other LULC data sets are
available for other countries, such data is not consistently available, e.g., in developing
countries, most likely limiting the approach presented in this paper to urban regions in
developed countries.
Further research is also required to deal with the VGI quality issues. Although the aim
of the proposed approach is to improve accessibility by relying on free data, it cannot fully
achieve this goal until VGI becomes more common. Still, this research opens an inter-
esting avenue for other studies to be conducted in the future using VGI.
The possible causes of error that were identified include the incompleteness or spatial
and thematic inaccuracy of data on POIs in OSM. The inappropriateness of some POIs to
estimate population distribution (where some categories of POI may not be correlated
with population density) may vary according to the city, since different cities are char-
acterized by different urban patterns. In this regard, future work will be conducted to
further investigate patterns of POIs that reflect spatial population distribution. A more
comprehensive investigation of the issue also requires comparison of the results using
commercially produced or official POIs and linking it to various indicators of POI quality.
Another area of concern is the impact of the decaying function. Calibration is needed
whenever the method needs to be applied to a different region or using different POIs.
Additional ancillary data, such as distance from roads, may help determine how popula-
tion density is decaying away from a POI. Additionally, further research on correlation
between types of POI, structure of urban fabric, and population decay may be conducted
to improve the approach. The surface volume integration may also be inappropriate,
because it measures the density in a cell according to its distance from the nearest control
point. Taking into account the distribution of control points surrounding a cell rather than
a single control point may produce better results. Also, better estimates were obtained
with co-kriging interpolation, where the problem of related variables is taken into account.
A hybrid approach is worth consideration.
1960 M. Bakillah et al.
Acknowledgement
We wish to thank the anonymous reviewers for their valuable contribution.
Funding
This work was conducted for the GRIPS Project and supported by the German Federal Ministry of
Education and Research (BMBF).
References
Aubrecht, C., Ungar, J., and Freire, S., 2011. Exploring the potential of volunteered geographic
information for modeling spatio-temporal characteristics of urban population. In: Proceedings of
7VCT, 11–13 October, Lisbon.
Bakillah, M., et al., 2013a. Semantic interoperability of sensor data with volunteered geographic
information: a unified model. ISPRS International Journal of Geo-Information, 2 (3), 766–796.
Downloaded by [141.212.109.170] at 08:37 16 December 2014
doi:10.3390/ijgi2030766
Bakillah, M., et al., 2013b. A dynamic and context-aware semantic mediation service for discover-
ing and fusion of heterogeneous sensor data. Journal of Spatial Information Science, 6,
155–185.
Bereuter, P. and Weibel, R., 2010. Generalisation of point data for mobile devices: a problem-
oriented approach. In: Proceedings of the 13th workshop on progress in generalisation and
multiple representation, September 12–13, Zurich. Bern: International Cartographic Association
commission, 1–8.
Bishop, C.M., 2006. Pattern recognition and machine learning (information science and statistics),
vol. 1. Secaucus, NJ: Springer-Verlag New York.
Bishr, M. and Mantelas, L., 2008. A trust and reputation model for filtering and classifying
knowledge about urban growth. GeoJournal, 72 (3–4), 229–237. doi:10.1007/s10708-008-
9182-4
Coleman, D.J., 2010. Volunteered geographic information in spatial data infrastructure: an early
look at opportunities and constraints. In: Spatially enabling society: research, emerging trends
and critical assessment. Leuven University Press.
Coleman, D.J., Sabone, B., and Nkhwanana, N., 2010. Volunteering geographic information to
authoritative databases: linking contributor motivations to program effectiveness. Geomatica, 64
(1), 383–396.
Cromley, R.G., Hanink, D.M., and Bentley, G.C., 2012. A quantile regression approach to areal
interpolation. Annals of the Association of American Geographers, 102 (4), 763–777.
doi:10.1080/00045608.2011.627054
De Longueville, B., et al., 2010. Digital Earth’s nervous system for crisis events: real-time sensor
web enablement of volunteered geographic information. International Journal of Digital Earth,
3 (3), 242–259. doi:10.1080/17538947.2010.484869
Dempster, A.P., Laird, N.M., and Rubin, D.B., 1977. Maximum likelihood from incomplete data via
the EM algorithm. Journal of the Royal Statistical Society B, 39 (1), 1–38.
Eicher, C.L. and Brewer, C.A., 2001. Dasymetric mapping and areal interpolation: implementation
and evaluation. Cartography and Geographic Information Science, 28 (2), 125–138.
doi:10.1559/152304001782173727
Flanagin, A. and Metzger, M., 2008. The credibility of volunteered geographic information.
GeoJournal, 72 (3–4), 137–148. doi:10.1007/s10708-008-9188-y
Flowerdew, R. and Green, M., 1994. Areal interpolation and types of data. In: S. Fotheringham and
P. Rogerson, eds. Spatial analysis and GIS. London: Taylor and Francis, 121–146.
Goetz, M., Lauer, J., and Auer, M., 2012. An algorithm-based methodology for the creation of a
regularly updated global online map derived from volunteered geographic information. In:
C.-P. Rückemann and B. Resch, eds. Proceedings of the fourth international conference on
advanced geographic information systems, applications, and services, January 30–February 4,
Valencia, 50–58.
Goetz, M. and Zipf, A., 2012. Towards defining a framework for the automatic derivation of 3D
CityGML models from volunteered geographic information. International Journal of 3D
Information Modeling, 1 (2), 1–16. doi:10.4018/ij3dim.2012040101
International Journal of Geographical Information Science 1961
Goodchild, M.F., 2007. Citizens as sensors: the world of volunteered geography. GeoJournal, 69
(4), 211–221. doi:10.1007/s10708-007-9111-y
Goodchild, M.F. and Lam, N.S.-N., 1980. Areal interpolation: a variant of the traditional spatial
problem. Geo-Processing, 1, 297–312.
Griffith, D.A., 2013. Estimating missing data values for georeferenced Poisson counts.
Geographical Analysis, 45 (3), 259–284. doi:10.1111/gean.12015
Griffith, D.A. and Can, A., 1996. Spatial statistical/econometric version of simple urban population
density models. In: S.L. Arlinghaus and D.A. Griffith, eds. Practical handbook of spatial
statistics. Boca-Raton, FL: CRC Press.
Haklay, M., 2010. How good is volunteered geographical information? A comparative study of
OpenStreetMap and Ordnance Survey datasets. Environment and Planning B: Planning and
Design, 37 (4), 682–703. doi:10.1068/b35097
Harvey, J.T., 2002. Estimating census district populations from satellite imagery: some approaches
and limitations. International Journal of Remote Sensing, 23 (10), 2071–2095. doi:10.1080/
01431160110075901
Hawley, K. and Moellering, H., 2005. A comparative analysis of areal interpolation methods.
Cartography and Geographic Information Science, 32 (4), 411–423. doi:10.1559/
Downloaded by [141.212.109.170] at 08:37 16 December 2014
152304005775194818
Jackson, S.P., et al., 2013. Assessing completeness and spatial error of features in volunteered
geographic information. ISPRS International Journal of Geo-Information, 2 (2), 507–530.
doi:10.3390/ijgi2020507
Jokar Arsanjani, J., et al., 2013. Toward mapping land-use patterns from volunteered geographic
information. International Journal of Geographical Information Science. doi:10.1080/
13658816.2013.800871
Kyriakidis, P., 2004. A geostatistical framework for area-to-point spatial interpolation. Geographical
Analysis, 36 (3), 259–289. doi:10.1111/j.1538-4632.2004.tb01135.x
Lam, N.S., 1983. Spatial interpolation methods: a review. Cartography and Geographic Information
Science, 10 (2), 129–150. doi:10.1559/152304083783914958
Langford, M., 2006. Obtaining population estimates in non-census reporting zones: an evaluation of
the 3-class dasymetric method. Computers, Environment and Urban Systems, 30 (2), 161–180.
doi:10.1016/j.compenvurbsys.2004.07.001
Langford, M., 2013. An evaluation of small area population estimation techniques using open access
ancillary data. Geographical Analysis, 45 (3), 324–344. doi:10.1111/gean.12012
Leyk, S., Nagle, N.N., and Buttenfield, B.P., 2013a. Maximum entropy dasymetric modeling for
demographic small area estimation. Geographical Analysis, 45 (3), 285–306. doi:10.1111/
gean.12011
Leyk, S., et al., 2013b. Establishing relationships between parcel data and land cover for demo-
graphic small area estimation. Cartography and Geographic Information Science, 40 (4),
305–315. doi:10.1080/15230406.2013.782682
Lin, J., et al., 2013. Evaluating the use of publicly available remotely sensed land cover data for
areal interpolation. GIScience & Remote Sensing, 50 (2), 212–230.
Liu, X.H., Kyriakidis, P.C., and Goodchild, M.F., 2008. Population‐density estimation using regres-
sion and area‐to‐point residual kriging. International Journal of Geographical Information
Science, 22 (4), 431–447. doi:10.1080/13658810701492225
Lo, C.P., 2008. Population estimation using geographically weighted regression. GIScience &
Remote Sensing, 45 (2), 131–148. doi:10.2747/1548-1603.45.2.131
Lu, Z., et al., 2010. Population estimation based on multi-sensor data fusion. International Journal
of Remote Sensing, 31 (21), 5587–5604. doi:10.1080/01431161.2010.496801
Lwin, K.K., 2010. Online micro-spatial analysis based on GIS estimated building population: a
case of Tsukuba City. Thesis (PhD). Graduate School of Life and Environmental Sciences,
University of Tsukuba.
Lwin, K.K. and Murayama, Y., 2010. Development of GIS tool for dasymetric mapping.
International Journal of Geoinformatics, 6 (1), 8–11.
Maantay, J.A., Maroko, A.R., and Herrmann, C., 2007. Mapping population distribution in the
urban environment: the cadastral-based expert dasymetric system (CEDS). Cartography and
Geographic Information Science, 34 (2), 77–102. doi:10.1559/152304007781002190
Markoff, J. and Shapiro, G., 1973. The linkage of data describing overlapping geographical units.
Historical Methods Newsletter, 7 (1), 34–46. doi:10.1080/00182494.1973.10112670
1962 M. Bakillah et al.
Mennis, J., 2003. Generating surface models of population using dasymetric mapping. The
Professional Geographer, 55 (1), 31–42.
Mennis, J., 2009. Dasymetric mapping for estimating population in small areas. Geography
Compass, 3 (2), 727–745. doi:10.1111/j.1749-8198.2009.00220.x
Mooney, P. and Corcoran, P., 2011. Can volunteered geographic information be a participant in
eEnvironment and SDI? In: J. Hrebicek, G. Schimak and R. Denzer, eds. Environmental
Software Systems. Frameworks of eEnvironment, 9th IFIP WG 5.11 International Symposium,
ISESS 2011, IFIP Advances in Information and Communication Technology, vol. 359, 27–29
June, Brno. Berlin: Springer, 115–122.
Nagle, N.N., et al., 2014. Dasymetric modeling and uncertainty. Annals of the Association of
American Geographers, 104 (1), 80–95. doi:10.1080/00045608.2013.843439
Neis, P., Zielstra, D., and Zipf, A., 2011. The street network evolution of crowdsourced maps:
OpenStreetMap in Germany 2007–2011. Future Internet, 4 (1), 1–21. doi:10.3390/fi4010001
Reibel, M. and Agrawal, A., 2007. Areal interpolation of population counts using pre-classified land
cover data. Population Research and Policy Review, 26 (5), 619–633. doi:10.1007/s11113-007-
9050-9
Rodrigues, F., et al., 2012. Automatic classification of points-of-interest for land use analysis. In:
Downloaded by [141.212.109.170] at 08:37 16 December 2014
Tobler, W., 1979. Smooth pycnophylactic interpolation for geographical regions. Journal of the
American Statistical Association, 74 (367), 519–530. doi:10.1080/01621459.1979.10481647
Tulloch, D., 2008. Is VGI participation? From vernal pools to video games. GeoJournal, 72 (3–4),
161–171. doi:10.1007/s10708-008-9185-1
Ural, S., Hussain, E., and Shan, J., 2011. Building population mapping with aerial imagery and GIS
data. International Journal of Applied Earth Observation and Geoinformation, 13 (6), 841–852.
doi:10.1016/j.jag.2011.06.004
Wright, J.K., 1936. A method of mapping densities of population: with cape cod as an example. The
Geographical Review, 26 (1), 103–110. doi:10.2307/209467
Wu, C. and Murray, A.T., 2005. A cokriging method for estimating population density in urban
areas. Computers, Environment and Urban Systems, 29 (5), 558–579. doi:10.1016/j.
compenvurbsys.2005.01.006
Wu, S., Wang, L., and Qiu, X., 2008. Incorporating GIS building data and census housing statistics
for subblock-level population estimation. The Professional Geographer, 60 (1), 121–135.
doi:10.1080/00330120701724251
Wu, S.-S., Qiu, X., and Wang, L., 2005. Population estimation methods in GIS and remote sensing:
a review. GIScience and Remote Sensing, 42 (1), 80–96. doi:10.2747/1548-1603.42.1.80
Downloaded by [141.212.109.170] at 08:37 16 December 2014
Xie, Y., 1995. The overlaid network algorithms for areal interpolation problem. Computers,
Environment and Urban Systems, 19 (4), 287–306. doi:10.1016/0198-9715(95)00028-3
Yang, X., et al., 2012. Preliminary mapping of high-resolution rural population distribution based on
imagery from Google Earth: a case study in the Lake Tai basin, eastern China. Applied
Geography, 32 (2), 221–227. doi:10.1016/j.apgeog.2011.05.008
Zandbergen, P. and Ignizio, D., 2010. Comparison of dasymetric mapping techniques for small-area
population estimates. Cartography and Geographic Information Science, 37 (3), 199–214.
doi:10.1559/152304010792194985
Zhang, C. and Qiu, F., 2011. A point-based intelligent approach to areal interpolation. The
Professional Geographer, 63 (2), 262–276. doi:10.1080/00330124.2010.547792
Zielstra, D. and Zipf, A., 2010. A comparative study of proprietary geodata and volunteered
geographic information for Germany. In: The 13th AGILE International Conference on
Geographic Information Science (AGILE 2010), 11–14 May, Guimarães.