Fine-Resolution Population Mapping Using OpenStreetMap



International Journal of Geographical Information Science, 2014
Vol. 28, No. 9, 1940–1963,

Fine-resolution population mapping using OpenStreetMap

Mohamed Bakillaha,b*, Steve Liangb, Amin Mobasheria, Jamal Jokar Arsanjania
and Alexander Zipfa
GIScience Research Group, Institute of Geography, University of Heidelberg, Heidelberg,
Germany; bDepartment of Geomatics Engineering, University of Calgary, Calgary, Canada
(Received 31 October 2013; final version received 18 March 2014)
Data on population at building level is required for various purposes. However, to

protect privacy, government population data is aggregated. Population estimates at
finer scales can be obtained through areal interpolation, a process where data from a
first spatial unit system is transferred to another system. Areal interpolation can be
conducted with ancillary data that guide the redistribution of population. For popula-
tion estimation at the building level, common ancillary data include three-dimensional
data on buildings, obtained through costly processes such as LiDAR. Meanwhile,
volunteered geographic information (VGI) is emerging as a new category of data
and is already used for purposes related to urban management. The objective of this
paper is to present an alternative approach for building level areal interpolation that
uses VGI as ancillary data. The proposed method integrates existing interpolation
techniques, i.e., multi-class dasymetric mapping and interpolation by surface volume
integration; data on building footprints and points-of-interest (POIs) extracted from
OpenStreetMap (OSM) are used to refine population estimates at building level. A case
study was conducted for the city of Hamburg and the results were compared using
different types of POIs. The results suggest that VGI can be used to accurately estimate
population distribution, but that further research is needed to understand how POIs can
reveal population distribution patterns.
Keywords: areal interpolation; OpenStreetMap; points-of-interest; population estimation;
volunteered geographic information

1. Introduction
Numerous activities, such as business development, demographic studies, disaster pre-
vention, and urban planning, require data on population at a fine resolution, including at
building level. However, to protect privacy, population statistics provided by national
census are aggregated and therefore not available at building level (Ural et al. 2011,
Langford 2013, Sridharan and Qiu 2013). Estimates at the building level must be
computed by disaggregating census population data with appropriate techniques.
Population estimates can be obtained through areal interpolation. Areal interpolation is
the process of transferring data from a first spatial unit system (called source units, usually
the census units) to another spatial unit system (called target units, e.g., cells of a raster
grid) (Sridharan and Qiu 2013). Areal interpolation can be conducted using ancillary data
to guide the redistribution of population, based on the observation that population
distribution is often correlated with other spatial variables (such as land use/land cover,

*Corresponding author. Email:

© 2014 Taylor & Francis

International Journal of Geographical Information Science 1941

LULC). Still, the quality and the appropriateness of the ancillary data used influence the
accuracy of the estimation. Common types of ancillary data used for areal interpolation
include LULC data derived from satellite imagery, official cadastral or topographic
databases, official vector street network data (considering that people commonly reside
along residential streets), or Light Detection and Ranging (LiDAR) data.
Methods employed to estimate population at building level commonly use LiDAR
data to estimate building volume or height (Lu et al. 2010, Sridharan and Qiu 2013).
Building surface can also be used to assess population distribution; however, the lack of
information on the volume of buildings may generate underestimation or overestimation
of population (Harvey 2002), unless the height of buildings is relatively homogeneous (Lu
et al. 2010). However, LiDAR data is costly. Alternative methods that use affordable data
are needed.
This paper proposes a method for areal interpolation at building level that uses VGI.
To the best of our knowledge, this is the first attempt to use VGI for this task. VGI is
produced by volunteers through Web 2.0 applications rather than by traditional data
producers (Goodchild 2007). VGI is criticized due to potential data quality issues (Bishr
and Mantelas 2008, Flanagin and Metzger 2008, Jackson et al. 2013), but it can also
extend existing data sets with information that pertains to a level of detail that go beyond
the capacity of ‘official’ producers (Tulloch 2008). VGI applications allow collecting
local knowledge, which usually cannot be gathered using traditional data collection
processes (Goodchild 2007). Consequently, numerous researchers in the field of
GIScience consider VGI as a valuable data source that can complement official and
commercial data (Goodchild 2007, De Longueville et al. 2010, Haklay 2010, Zielstra
and Zipf 2010, Mooney and Corcoran 2011). For instance, Song and Sun (2010) have
reported that the usage of VGI in urban management has increased. Coleman et al. (2010)
report that some state governments in Australia and Germany, as well as the US
Geological Survey, have employed VGI in their mapping programs. Commercial data
providers such as NAVTEQ and TomTom have used VGI to identify updates required for
their databases (Coleman 2010). One of the most well-known examples of VGI project is
OpenStreetMap (OSM), a collaborative mapping project that enables volunteers to pro-
duce and share maps (Neis et al. 2011). OSM has already been used for various purposes,
including the automatic derivation of three-dimensional CityGML models (Goetz and Zipf
2012), road-based travel recommendation systems (Sun et al. in press), and mapping land-
use patterns (Jokar Arsanjani et al. 2013). In OSM, users can describe features – such as
roads, water bodies, and points-of-interest (POIs) – using ‘tags’ (Goetz et al. 2012). A
POI is a feature on the map that is given specific point coordinates (e.g., schools,
restaurants, churches, etc.) and which location may be useful to know depending on the
context of the user (especially in tourism and for routing services). Because some types of
place can be correlated with a higher density of population (as argued in Zhang and Qiu
(2011)), POIs can be useful to refine population estimates provided that a correlation can
be established.
The proposed method is an alternative solution to estimate population at the building
level that relies solely on open data and VGI. The principle of the approach is to combine
three existing methods: multi-class dasymetric mapping (conducted with Urban Atlas
LULC data) is used to obtain population estimates for relatively large zones; then,
interpolation by surface volume integration (using OSM POIs to determine points of
high population density) is employed to uncover variations of population density within
these zones; finally, binary dasymetric mapping (conducted with OSM building layer
data) is used to redistribute population within buildings.
1942 M. Bakillah et al.

The originality of the proposed approach lies in investigating the use of VGI to obtain
population estimates. The hypothesis behind this method is that OSM data will be of
sufficient quality to support this task. Of note is that Strunck (2010), who has conducted
experiments on the growth of different OSM POI categories in Germany, observed that
OSM had about twice as many POIs as TomTom MultiNet data, a commercial data
source. However, one can still expect that VGI quality issues may affect the accuracy
of population estimates. Concerning POIs, specific concerns include spatial and thematic
accuracy. In terms of spatial accuracy, VGI contributors are not formally required to
generate data with an established level of spatial precision, and may have an inaccurate or
incomplete perception of the geographic phenomenon they describe (Bakillah et al.
2013a). Jackson et al. (2013) note that OSM contributors are subjected to no constraints
concerning the placement of a POI, and may rely on various devices to do so (GPS,
smartphone, etc.); therefore, one can expect significant variations in positioning. In terms
of thematic accuracy, Scheider et al. (2011) observed that the terms used by contributors
to describe geographic features can be ambiguous. Still, in OSM, contributors are

encouraged to use terms organized into a folksonomy, which contributes to the reduction
of naming heterogeneities. Despite such issues, successful usages of VGI POIs have been
reported, such as Rodrigues et al. (2013) who have relied on Yahoo POIs to estimate
disaggregated employment size.
The reliability of the results obtained with the proposed approach was investigated
with a case study in Hamburg, Germany, using different types of POIs. The proposed
method was also compared with other typical areal interpolation approaches. The results
show that while the choice and quality of POIs have an impact on estimates, using VGI to
perform areal interpolation is an interesting avenue.
This paper is organized as follows: related work is provided in the next section.
Section 3 presents the data used for the case study. The approach is explained in Section
4. Section 5 presents the experiments and a discussion of results. Section 6 concludes the
paper and outlines future work.

2. Related work
Areal weighting is the simplest type of areal interpolation method and it requires no
ancillary data (Markoff and Shapiro 1973, Goodchild and Lam 1980, Lam 1983). It is also
referred to as area weighted interpolation (Lwin 2010) or overlay operation based on
geometric properties (Wu et al. 2005). In this method, population estimates in target zones
are computed based only on the proportion of each source zone that overlaps with the
target zone (Goodchild and Lam 1980). It is based on the assumption that population is
uniformly distributed within the source zone. However, source zones in census rarely have
homogeneous population distributions. Therefore, population estimates based on areal
weighting are likely to differ from the reality because of this assumption. However, in the
absence of ancillary data, the areal weighting method may be useful. An improved version
of areal weighting, called target-density weighting (TDW), was proposed by Schroeder
(2007). TDW is a method employed to interpolate population data from one set of census
units to another (data from more than one census is required). TDW is based on the
assumption that if people lived in a given region when a first census was conducted, these
people are more likely to live there too (in a proportional manner) when a subsequent
census is conducted. TDW was extended into the cascading density weighting (CDW)
method to produce complete population time series; CDW repeats the TDW backwards
through a series of census (Schroeder 2010). Another example of areal interpolation
International Journal of Geographical Information Science 1943

method that does not require ancillary data is Tobler’s pycnophylactic interpolation
(Tobler 1979). This approach creates a smooth population density distribution for any
location (x, y) by minimizing the sum of squares of partial derivatives. The density
distribution respects Tobler’s pycnophylactic property, i.e., an important property where
the sum of people reported for each source zone is preserved during the interpolation
Dasymetric mapping (also referred to as dasymetric modeling) is a technique where
additional data (called ancillary data) is employed to guide the redistribution of population
data at a finer level of resolution (Semenov-Tian-Shansky 1928, Wright 1936). Binary
dasymetric mapping (also referred to as filtered areal weighting) is the simplest type of
dasymetric mapping method (Eicher and Brewer 2001). It is called binary because source
zones are subdivided into populated and unpopulated areas. The areal weighting method
is then applied within the populated areas of source zones. Common sources used to
identify populated areas include:
● LULC data derived from satellite imagery (Mennis 2003, Reibel and Agrawal
2007, Cromley et al. 2012);
● county address points and parcels (Tapp 2010);
● vector street network data (Xie 1995, Hawley and Moellering 2005, Zandbergen
and Ignizio 2010);
● vector or raster maps provided by official mapping agencies (Langford 2013),
including topographic databases (Wu et al. 2008, Lwin and Murayama 2010);
● Google Map images (Yang et al. 2012).

The main drawback of binary dasymetric mapping is that it relies on the assumption that
the population is uniformly distributed within the populated areas (Maantay et al. 2007).
The multi-class dasymetric mapping method partly addresses this problem. The
territory is partitioned into several residential classes (e.g., high density, medium density,
low density) associated with different population densities (Eicher and Brewer 2001). To
estimate the population density associated with each residential class, empirical sampling
of population densities (Mennis 2003) and multivariate linear regression estimated with
ordinary least squares (Langford 2006) were proposed. Data sources used for multi-class
dasymetric mapping include:

● Cadastral databases (Maantay et al. 2007);

● Topographic databases, land-use zoning, and transportation network (Su et al.
● Remote-sensing data (Liu et al. 2008).

Multi-class dasymetric mapping may theoretically be more accurate than binary dasy-
metric mapping; however, it is similarly based on the assumption that population is evenly
distributed within an area covered by the same residential class. Lo (2008) has demon-
strated that the relation between population density and the type of LULC category is
varying (referred to as the ‘problem of related variables’). This finding suggests that linear
regression alone is not appropriate to estimate population distribution, which was also
emphasized by other authors (Griffith and Can 1996). Indeed, Wu and Murray (2005)
explain that methods based on regression analysis assume that population density is
spatially independent. Still, the difficulty of identifying a meaningful statistical relation
between population density and other variables remains a challenge (Mennis 2009). To
1944 M. Bakillah et al.

address this issue, methods that integrate spatial autocorrelation have been proposed. For
example, Wu and Murray (2005) propose a co-kriging method for estimating population
density using census data and impervious surface proportion as secondary variable to
refine population estimates. Other authors suggest area-to-point kriging to disaggregate
residual population density values obtained with regression (Kyriakidis 2004, Liu et al.
2008). Griffith (2013) pointed out that kriging and co-kriging involve variable transfor-
mation, which may introduce errors in final population estimates. He proposes a geo-
graphic areal unit imputation Poisson model specification that uses remotely sensed
images as well as spatial and temporal autocorrelation to circumvent this issue. The
Poisson model has also been used by Leyk et al. (2013b) to represent relations between
variables. Another approach to address the issue of related variables is maximum entropy
dasymetric modeling, where the maximum entropy principle is used to model the statis-
tical relation between population density and other variables, such as attributes of house-
holds (Leyk et al. 2013a). The expectation–maximization (EM) algorithm, which was first
introduced by Dempster et al. (1977), iteratively applies a feedback loop to refine

population density estimates; it is also used to perform areal interpolation with related
variables (Flowerdew and Green 1994). More recently, it was coupled with geographically
weighted regression (GWR) to allow the densities of different categories of control zones
(e.g., commercial, residential) to have non-constant ratios among the different source units
(Schroeder and Van Riper 2013).
Interpolation by surface volume integration is another type of technique that relies on
a set of points that are considered as places where population density is expected to be
high, compared to the surroundings. These points are referred to as ‘control points’. Zhang
and Qiu (2011) used schools as control points. It is assumed that the population decreases
according to a decaying function of the radial distance from control points. Then,
estimates are smoothed at intersections between two cells. The method proposed in this
paper investigates the use of this technique with VGI.
Also worth mentioning, while not specifically addressed in this paper, is the more
recent research on uncertainty in dasymetric mapping. Approaches that measure the
uncertainty of population estimates include those of Schroeder (2007), Kyriakidis
(2004), Wu and Murray (2005), and Liu et al. (2008). Nagle et al. (2014) explain that
these approaches do quantify the uncertainty of population estimates, but do not incorpo-
rate uncertainty associated with ancillary data. To address this issue, they propose the
penalized maximum entropy dasymetric model (P–MEDM), which is an extension of the
Maximum Entropy (ME) technique that explicitly models the uncertainty of input data
and propagates it through the population estimation process.
The above-mentioned methods provide estimates for target units that include several
buildings, but not at building level. Some approaches have been proposed to estimate
population using three-dimensional data on buildings. Lu et al. (2010) have used regres-
sion analysis to model the relation between population and the area or volume of the
buildings obtained with LiDAR data.
Sridharan and Qiu (2013) have developed a similar LiDAR-based approach. Ural
et al. (2011) have studied population estimation at the building level using aerial images,
digital terrain, and surface models to determine building footprints and heights. Then, city
zoning maps were used to classify buildings as residential and non-residential. These
approaches all rely on costly data to evaluate buildings’ volume. The method proposed in
this paper circumvents this obstacle by using free data (VGI). To the best of our knowl-
edge, VGI has not been used yet for population estimation. Langford (2013) recently
claimed to have conducted the first experimentation of dasymetric mapping using open
International Journal of Geographical Information Science 1945

data (in this case, provided by UK government), but mentioned that while being an
interesting alternative, OSM has not been used. Similarly, Lin et al.’s evaluation of
publicly available data for areal interpolation focuses on remote-sensing LULC data
produced by official sources (2013). Aubrecht et al. (2011) have explored the potential
of VGI for modeling the spatiotemporal characteristics of urban population, but not to
produce population estimates.

3. Study area and data

The study area is the city of Hamburg, Germany. Hamburg is the second largest city in
Germany. According to the 2011 census, this city had a total population of 1,746,813 and
a total area of 75,525 ha at that time, with an average density of 23 people/ha. The census
data is provided for 7 districts composed of 106 census blocks with varying sizes, ranging
from about 71 ha to 14,758 ha, occupied by approximately 3–412,000 people. Population
for each census blocks (subunits of districts) is also available. The Hamburg city region
mainly comprises medium to dense urban fabric north of the River Elbe, with agriculture
and forest areas at the periphery. The center of Hamburg is made of a mixture of urban
fabric of varying density, green urban areas, sport and leisure activity areas, and an
important portion occupied by port facilities.
Ancillary data used in this study include LULC data from Urban Atlas, the European
LULC data for large urban zones that have more than 100,000 inhabitants (http://www. (Figures 1 and 2).

Figure 1. Classified LULC of Hamburg urban zone (from Urban Atlas).

1946 M. Bakillah et al.
Figure 2. Categories of LULC (from Urban Atlas).

Urban Atlas data is also from 2011. The LULC categories used by Urban Atlas do not
identify residential areas. Instead, areas that are more likely to contain the majority of
residential buildings are in the continuous and discontinuous urban fabric categories,
which also include commercial and industrial types of building. Meanwhile, other cate-
gories may also contain residential buildings. Therefore, it was decided to consider all
categories of LULC as habitable areas, except for the water category.
To redistribute population to individual buildings, data on areas and location of
buildings is needed. For this purpose, the OSM building layer was used to extract building
footprints and coordinates. Because of the collaborative nature of OSM, the data on
buildings has been created at different dates. While, in OSM, the type of building can
be identified with attributes, contributors do not always provide a value for this attribute,
so one cannot rely on this type of data to identify building use. POIs, which are used to
identify high-density areas, were also extracted from OSM (and therefore created at
different periods of time). OSM POIs describe geo-located features (point, line, or
polygon). The type of place is specified through tags, which are made of a (key, value)
pair, where the ‘key’ is a category of features and the ‘value’ is a subcategory of features.
An example of tag describing a POI is (amenity, postbox). In OSM, contributors are
encouraged to use the vocabulary suggested in OSM Feature Wiki ( when adding POIs. POIs created by a contributor can
be edited by other contributors.

4. Proposed method: areal interpolation at building level using VGI POIs

The flowchart of the method is illustrated in Figure 3. Ancillary data is indicated in the
right side, while steps of the methods appear on the left side. The principle of the method
is that the population is gradually disaggregated from larger to smaller spatial units:

(1) Population/district (as given by census data)

(2) Population/area covered by a given LULC category
International Journal of Geographical Information Science 1947
Figure 3. Flowchart of the method.

(3) Population/cell
(4) Population/building

To disaggregate the population/district to population/area covered by a given LULC

category, multi-class dasymetric mapping based on LULC data from Urban Atlas was
1948 M. Bakillah et al.

used. The main contribution of the method is in the next step, where population/area
covered by a given LULC category is disaggregated to population/cell. Here, the cells are
basic spatial units created by generating a grid over the territory. To obtain population/cell,
a new interpolation by surface volume integration method is deployed; it estimates the
population density per cell based on the spatial distribution of relevant POIs. The POIs
retrieved from OSM are classified by category using the ID3 unpruned machine learning
decision tree. The next steps are based on the hypothesis that some categories of POIs are
associated with a higher density of population. These POI categories are selected based on
the correlation between the density of a given category of POI and the density of
population. Then, a quadtree procedure is applied to localize POIs and infer the presence
of control points. Population in cells is computed with a decaying function centered on
control points. Finally, the OSM building layer data is used to compute population per
building based on their surface.
Downloaded by [] at 08:37 16 December 2014

4.1. Multi-class dasymetric mapping with Urban Atlas LULC data

Let u be a given LULC category and d a district of the city. Consider that the territory is
partitioned into a grid of cells. The objective of this phase is to estimate, for each district
d, the average number of people that should be allotted to a cell covered by a given LULC
category u. This amount is denoted with POPud. The value of POPud is unique to each
district. This is because in a given district, a category of LULC does not necessarily have
the same population density than in another district. For example, the areas classified as
‘discontinuous medium density urban fabric’ in the north of Hamburg have a lower
population density than the same type of area in the center of Hamburg. The procedure
to compute POPud using multi-class dasymetric mapping is described in Mennis (2003).
The grid resolution (500 m2) was chosen so that the majority of cells are covered with
only one category of LULC. The average population density values associated with each
category of LULC were derived by overlapping LULC data with census data, using the
census blocks that were covered with at least 85% by a given LULC category. Average
population density values were derived from samples obtained by overlapping land-use
data with census data, using census blocks that were entirely or most entirely covered by
only one LULC category (Mennis 2003). The estimates generated by the multi-class
dasymetric mapping assume that the population distribution is uniform within the same
LULC category in a district. In the following, extra steps are taken to further refine the
population estimates.

4.2. Using OSM POIs for interpolation by surface volume integration

The objective of this phase is to localize control points. Similar to other types of ancillary
data such as LULC, control points are places whose spatial distribution is related to that of
the population (Zhang and Qiu 2011). For example, there are some types of place, such as
schools, supermarkets, churches, etc., where population density is usually high compared
to the surroundings. This assumption can be tested by measuring the correlation between
the density of a category of place and the density of population. Once control points are
determined, it is assumed that the population decreases according to a given decaying
function of the radial distance from control points. In this paper, a new method to identify
control points is introduced. This method is based on POIs and assumes that some
categories of POIs are correlated with population density. First, POIs must be classified
to deal with naming heterogeneities. For this procedure, the ID3 machine learning
International Journal of Geographical Information Science 1949

decision tree algorithm is used. A quadtree procedure is used to deal with high volumes of
POIs and control points are identified from the spatial distribution of these POIs. The
details of the procedure to derive control points from POIs are described below.

4.2.1. Retrieving POIs

First, OSM POIs in the city of Hamburg are gathered through OSM’s extended API. In
OSM, contributors are encouraged to use the vocabulary suggested in OSM Feature Wiki
( when creating POIs, but they do not
have to. The resulting naming heterogeneity complicates the retrieval of POIs that fall
within a given category.
POIs can be associated with nodes, ways, or polygons (combinations of ways). Since
POIs are used to identify control points, only those that can be reduced to points are used
(excluding POIs associated with roads). Polygon features (e.g., amenities) were replaced
by the centroid of the polygon. The initial data set obtained for the city of Hamburg
contains 31,593 POIs. Only the subset of POIs that belong to a category that makes up a
significant percentage of the whole set of POIs (0.01%) was retained. Otherwise, some
‘categories’ of POIs would be instantiated only by a few objects, which would not allow
to use them to identify control points due to their scarcity. The result is a set of 16,964
POIs. Table 1 shows the number of POIs categorized according to the value of the ‘key’.

4.2.2. Classifying POIs

Owing to the above-mentioned naming heterogeneity, POIs need to be classified accord-
ing to predefined categories. POIs that were not specified using the OSM Map Features
vocabulary were classified using the ID3 implementation of the unpruned decision tree
(see Bishop 2006). The ID3 unpruned decision tree is a machine learning algorithm that
relies on a set of training data to classify objects according to predefined categories. The
ID3 unpruned decision tree was reported to perform well at classifying POIs in Rodrigues
et al. (2012). For the decision tree, the categorization of OSM Map Features was used,
since an important number of POIs already use this categorization. To determine whether
a POI belongs to a given category, a lexical and syntactic matcher that compares the
meaning of strings, which is described in Bakillah et al. (2013b), was employed. The

Table 1. Main categories of POIs retrieved for Hamburg.

Key; value quantity Key, value quantity

Public transport; station (include subway 4705 Amenity; recycling 806

entrances and bus stops)
Highway; traffic signals 3882 Power; switch 383
Historic; memorial 3821 Amenity; school, university, 338
Amenity; restaurant, fast-food, & bars 2683 Amenity; fuel 195
highway; crossing 2177 Highway; Motorway_link 100
Amenity; bench 1910 Man-made; elevator 67
Amenity; parking 1363 Amenity; Hospital, doctors 58
Amenity; telephone 1273 Tourism; museum 41
Amenity; Postbox 1266 Highway; speed_camera 38
Highway; Turning-circle 1169 Amenity; studio 30
1950 M. Bakillah et al.

training data set was elaborated using the matcher and manually validating a random
subset of the matches.

4.2.3. Selecting ‘high-density indicator’ POIs

POIs that are expected to be correlated with high population density are called ‘high-
density indicator’ (HDI) POIs. To identify HDI POIs, correlation between the occurrences
of a category of POI (e.g., amenity = bus station) and the population density in a census
block was investigated with Spearman’s correlation coefficient, a coefficient that measures
correlation whether linear or nonlinear. The equation for Spearman’s correlation coeffi-
cient is as follows:
i¼1 PDi  PD nb POIi  nb POI
ρ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Pn  2 Pn  2ffi (1)
Downloaded by [] at 08:37 16 December 2014

i¼1 PD i  PD i¼1 nb POI i  nb POI

In this equation, n is the number of census blocks, PDi is the population density in census
block i, PD is the average population density, nb_POI is the number of POIs (of a given
category) in census block i, and nb POI is the average number of POIs of the given
category among all census blocks. Table 2 shows Spearman’s correlation coefficient
between the occurrences of some categories of POI and population density.
These results are used to select the types of POI that are considered in the determina-
tion of control points. The results of population disaggregation using different categories
of HDI POI for determining control points are reported in the experimentation section.

4.2.4. Determining control points with quadtree

The quadtree is used to locate HDI POIs. This procedure is employed to reduce a large
volume of POIs to a representative subset that mirrors their density distribution. If a
quadtree procedure is not available, other techniques such as clustering algorithms could
be used too. The quadtree is a spatial tree data structure in which a tree, which can have

Table 2. Correlation between selected categories of POIs and population density.

Correlation Correlation
Key; value coefficient Key, value coefficient

Public transport; station (include subway 0.89 Amenity; recycling 0.74

entrances and bus stops)
Highway; traffic signals 0.62 Power; switch 0.12
Historic; memorial 0.55 Amenity; school, 0.82
university, kindergarden
Amenity; restaurant, fast-food & bars 0.28 Amenity; fuel 0.31
highway; crossing 0.70 Highway; Motorway_link 0.08
Amenity; bench 0.68 Man-made; elevator 0.12
Amenity; parking 0.29 Amenity; Hospital, 0.59
Amenity; telephone 0.25 Tourism; museum 0.20
Amenity; Postbox 0.80 Highway; speed_camera −0.20
Highway; Turning-circle 0.20 Amenity; studio −0.10
International Journal of Geographical Information Science 1951
Figure 4. Simplified graphical representation of a quadtree.

four nodes, represents the recursive subdivision of space into nested quadrants (Samet
1984). The subdivision process does not have to be made evenly. Rather, it adapts to the
density of the spatial objects we intend to capture: a quadrant is subdivided into four new
quadrants (and four new nodes are created in the tree structure) as long as it contains one
spatial object or more. If no spatial object is found in a quadrant, this quadrant is no
longer divided (Figure 4).
A variant of the quadtree that was used here is the bucket quadtree. Instead of
continuing the subdivision until a quadrant contains only one spatial object, the subdivi-
sion continues until a quadrant contains not more than a predefined number of objects.
This is used to avoid having too small quadrants as leaf nodes. In this context, this would
generate too many control points, which would create an artificially uniform population
In this approach, the spatial objects we intend to capture are HDI POIs. The bucket
quadtree procedure is applied. The predefined minimum number of objects ranged
between 1 and 10, depending on the category of POI that was used. Then, to determine
the location of control points, a generalization procedure is applied to identify a single
point from all HDI POIs located in the quadrant. The type of generalization operation is a
midpoint aggregation (Bereuter and Weibel 2010). In the midpoint aggregation operation,
spatially and semantically close points (HDI POIs from the same category) are aggregated
into a single point, which is the mean center point (midpoint) of the HDI POIs. To
illustrate this, in Figure 5, midpoints are identified (circled in blue) in each quadrant,
considering train stations as HDI POIs. To find the midpoint MP of a series of points
<POI1, POI2, …, POIj, …, POIn> within a quadrant, the following distance is minimized:
1952 M. Bakillah et al.
Figure 5. Identification of control points from HDI POIs.


The midpoints constitute the control points that are used to compute population density in

4.2.5. Computing population density in grid cells

Phase 2 computes the average number of people that should be allotted to a cell covered
by a LULC category u in district d, POPud. In this step, control points are used to estimate
the variations of population density within an area covered by the same LULC category.
The procedure is as follows:
First, the districts <d1, d2, … dM> that overlap with A are retrieved. Then, the
polygons P1, P2, … PN formed by the intersection of A and districts d1, d2, … dM are
computed. Then, for each polygon Pi, and for each LULC category u, the cells that are
mainly covered by u in Pi (this number of cells is denoted with NB_CELLudA) are
retrieved. The total population that is contained in these cells (denoted with
TOTAL_POPudA) is given by the following:


International Journal of Geographical Information Science 1953

In this formula, POPud is the value obtained in phase 1. For example, if

TOTAL_POPforest_HamburgNord_A = 120, it means that in the polygon formed by the
intersection of the Hamburg Nord district with the area of interest, there is an estimated
120 people living in areas classified as forests. Then, as explained above, this number of
people must be redistributed (not necessarily evenly) within areas covered by u according
to the following formula. This formula (adapted from Zhang and Qiu (2011)) gives the
estimated population density in a cell Cu mostly covered by u:

Densityd ðCu Þ ¼ constantud  W ðCu Þ

distðCu ; nearest control pointÞ
W ðCu Þ ¼ 1  (4)
dist max
Downloaded by [] at 08:37 16 December 2014

as control points.
The population in cell C depends on radial distance from the nearest control point. The
formula depends on different parameters. First, the constantud is a constant that represents
the maximal density in areas covered by u in d. It can be estimated from the density of
people per census block, as given in census data. Then, dist(Cu, nearest_control_point) is
the distance between the centroid of the cell Cu and the nearest control point. Dist_max is
the maximal value of dist(Cu, nearest_control_point). Q is the population density decaying
factor. One can see from the above formula that the density in a cell is maximal (densityd
(Cu) = constantud) when this cell includes a control point, because dist(Cu, nearest_con-
trol_point) is minimal and W(Cu) is very close to 1. As we move away from a control
point, the population decreases. The rate of decay is fixed by q. Therefore, q greatly
influences the population estimate. This is why several values of q are tested in the
experimentation section.

Figure 6. Population in cell C depends on the distance from the nearest control point (in this
figure, educational institutions are considered as control points).
1954 M. Bakillah et al.

Still, another problem must be resolved. If, after computing the population in cells, we
sum up the population in all cells covered by u, we must obtain that this population is
equal to TOTAL_POPudA in order to respect Tobler’s pycnophylactic property. This is
why the computation of population in cells with the above formula is reiterated by fitting
the constantud so that the sum of population in all cells covered by u is as close as possible

4.2.6. Computing population in buildings

The population of each cell is assigned to buildings inside the cell. The number of people
in a building b is assigned proportionally to the building’s surface, as follows:

POPðbÞ ¼ Ab Pk (5)
i¼1 Ai
Downloaded by [] at 08:37 16 December 2014

In this formula, Ab is the building’s surface, Ai is the surface of another building i

contained in cell C, k is the number of buildings contained in cell C, and POP(C) is the
population in cell C. If a building’s surface overlaps with more than one cell, this building
is assigned to the cell that contains the largest proportion of the building’s surface.

5. Experimentation
The experimentation has two objectives: (1) investigate the impact of two factors on the
reliability of the estimates – the choice of POI to determine control points and the
decaying factor’s value; and (2) compare the reliability of the population estimates with
other state-of-the-art methods.
It is not possible to experimentally verify whether the population estimates at the
building level are accurate, since no official data is available at that level, due to privacy
concerns. However, it is possible to verify the reliability of the method at the census block
level, which is the finer level of detail available. Since blocks are generally composed of a
small number of buildings, the reliability assessments are still worth consideration.
Therefore, in each experiment, estimates computed at the building level were aggregated
to obtain the estimated population for each census block. The estimated population is then
compared with the census population.

5.1. Results
Population estimates that were obtained using different categories of HDI POIs were
compared. This experiment tests the hypothesis formulated in Section 4: there exist some
categories of POI that are associated with areas where population density is higher. The
categories of POI ‘amenity: educational institution’ (comprising schools, universities, and
kindergartens) and ‘public transport: station’ were selected as HDI POIs, since they are
the most strongly correlated with population density (see Section 4). Three values for the
decaying factor were also tested: q = 0.3, q = 0.5, and q = 0.7. The results were assessed
with a linear regression (Figures 7 and 8).
For the first category of POI, better results are achieved in terms of standard error
(369.2), R2 (0.9274), and regression coefficient (1.007) with a slower decrease of the
population as we move away from control points (q = 0.3). Still, standard error and R2 are
International Journal of Geographical Information Science 1955
Figure 7. Simple linear regressions for HDI POI = ‘amenity: educational institution’: (a) q = 0.3,
(b) q = 0.5, (c) q = 0.7.

Figure 8. Simple linear regressions for HDI POI = ‘public transport: station’: (a) q = 0.3, (b)
q = 0.5, (c) q = 0.7.
1956 M. Bakillah et al.

somewhat high and show some lack of accuracy. Then, as q increases (the decay is faster
as we move away from control points), the reliability of the estimate also decreases
significantly and the regression coefficient moves away from 1.
For the second category of POI, better results are achieved in terms of standard error
(286.6), R2 (0.9530), and regression coefficient (0.9849) with a value of q = 0.5, which
contrasts with the previous case. This demonstrates that the value of q should be
calibrated against the type of POI used to identify control points. It is possible that
population decreases more quickly around some types of place (i.e., the population is
more concentrated around these points) than others. However, in both cases, as q
increases, the slope of the regression line diminishes below 1, showing slight under-
estimation of the population. Worse results are obtained with q = 0.7. Overall, since best
results are obtained with the second category of POI, it seems that in this experiment, the
category ‘public transport: station’ was more appropriate to identify control points.
In the second part of the experiment, while reporting further on experiments with
Downloaded by [] at 08:37 16 December 2014

different categories of POI, the proposed method is compared with the following repre-
sentative state-of-the-art approaches:

(1) Areal weighting: population is distributed to the block level based on area
proportion only.
(2) Binary dasymetric mapping: population is distributed to the block level based
on area proportion and distinction between habitable and non-habitable areas.
LULC data from Urban Atlas was employed to determine habitable/non-habitable
areas. Urban fabric categories were considered as habitable, while other categories
were considered as non-habitable.
(3) Multi-class dasymetric mapping: the method described in Section 4.1 was
applied. Population was distributed evenly within areas covered by the same
LULC category.
(4) Interpolation by surface volume integration: population estimates are com-
puted using Equation (4) with q = 0.3 and control points obtained from ‘amenity:
educational institution’ (POIs that generated the most reliable results).
(5) Proposed method with binary instead of multi-class dasymetric mapping,
with q = 0.3 and control points obtained from ‘amenity: educational institution’.
(6) Interpolation modeling related variables with co-kriging: population density is
estimated for the 500 m2 grid using a co-kriging method, where the primary
variable is population density and the secondary variable is POI distribution
(tested with ‘amenity: educational institution’). Population estimates were
rescaled to satisfy Tobler’s pycnophylactic property.

The root mean square error (RMSE) and the mean absolute error (MAE), which are well-
known accuracy measures in the field of population density estimation (Liu et al. 2008,
Langford 2013), were measured (Table 3):

i¼1 ðPi
 Piobserved Þ2
RMSE ¼ (6)
International Journal of Geographical Information Science 1957

Table 3. Comparative reliability assessment results.

Method RMSE MAE (%)

Areal weighting 175.6 49.8

Binary dasymetric mapping 124.3 19.4
Multi-class dasymetric mapping 119.6 19.1
Interpolation by surface volume integration 137.3 26.5
Proposed method with binary dasymetric mapping 111.2 11.9
Interpolation with co-kriging 100.2 8.5
Proposed method, ‘amenity: educational institution’, q = 0.3 104.1 9.2
Proposed method, ‘public transport: station’, q = 0.5 103.7 8.9
Proposed method, ‘amenity; Post-box’, q = 0.3 106.5 10.1
Proposed method, ‘amenity; recycling’, q = 0.5 109.4 11.0
Proposed method, ‘highway; crossing’, q = 0.5 112.2 11.9
Proposed method, ‘amenity; bench’, q = 0.6 114.2 14.3
Proposed method, ‘highway; traffic signals’, q = 0.5 118.6 17.3
Proposed method, ‘amenity; hospital, doctors’, q = 0.7 119.7 19.8

1X n  
Pestimated  Pobserved 
MAE ¼ i i (7)
P i¼1

In these formulas, n is the number of census blocks, Piestimated is the estimated population
in block i, Piobserved is the official census population in block i, and P is the total
population in Hamburg.
As expected, areal weighting performs poorly but is interesting to illustrate differ-
ences. There is no great difference between binary and multi-class dasymetric mapping,
showing that indeed few people live in non-urban fabric areas. However, binary dasy-
metric mapping slightly outperforms interpolation by surface volume integration, which
was also reported (although with a more pronounced difference) in Langford (2013). The
corroboration of these results suggests that modeling population distribution with a
decaying function may be oversimplifying, which contrasts with results obtained in
Zhang and Qiu (2011). It also questions the reliability of POIs as a unique source of
ancillary data (while in the proposed approach, it is coupled with LULC data). The
comparison of interpolation by surface volume integration with the proposed method
(with the same category of POIs being used to ensure a fair comparison) also suggests that
the multi-class dasymetric mapping method based on LULC contributes to increasing the
reliability of the estimates. When comparing the results obtained with the proposed
method using binary or dasymetric mapping, it can be observed that the multi-class
approach does improve the estimates, but not with a great difference. Since binary is
easier to implement than multi-class dasymetric mapping, the former might therefore be
worth considering. Also, the purity of the samples used for estimating population densities
associated with each LULC category (85%) impacts the reliability of estimates obtained
with multi-class dasymetric mapping (although misclassification can also occur with
binary dasymetric mapping). At last, better results are obtained with interpolation with
co-kriging than with the proposed method with interpolation by surface volume integra-
tion. Here, POIs were used as secondary variable, but preliminary estimates obtained for
each category of LULC with multi-class dasymetric mapping (Section 4.1) were also used
to rescale the results of co-kriging. The results could be interpreted as another indication
1958 M. Bakillah et al.

that the interpolation by surface volume integration is too simplistic. Still, the difference
for both RMSE and MAE is not large. This may be worth considering, since the
co-kriging model is more complex to implement.
The results show that POIs are useful to identify control points and estimate variations
within a given category of LULC. Table 3 lists some categories of POI that performed
better compared to other POIs. It shows that for some categories of POI, the results are
not necessarily better than simple multi-class dasymetric mapping (e.g., hospitals and
doctors). It is difficult nevertheless to draw conclusions, since it could be due to the fact
that this type of place is not correlated with population density, or because data on this
type of place is incomplete in OSM. Results are also likely to differ in different cities, as
spatial urban patterns are varying.

5.2. Discussion
Downloaded by [] at 08:37 16 December 2014

margin of error could be greater at the level of buildings, although this cannot be verified.
Although volume of buildings could be used to generate estimates at this level, it would
not still be an absolute reference against which the proposed method could be compared.
An important cause of error is possibly the decaying function behind the interpolation
by surface volume integration. Although results show it helps to better reflect population
density variations within an area with uniform LULC classification, this model simplifies
the population density distribution. The second issue with the model is that the decaying
factor q controls the extent of local influence. The experiments show that the choice of the
value of q should depend on the type of POI selected to determine control points, with
important variations comprised between 0.3 and 0.7. To verify whether the population
decay is indeed correlated with the category of POI, further experiments must be
conducted where estimates obtained with the same categories of POI but different decay
values and in different regions would be compared. Such experiments would also have to
take into account the differences between the urban fabric patterns across regions.
The choice of POI also affects the estimates. In addition, the completeness of the POI
data set is an issue, as well as semantic accuracy. Since POI descriptions are provided by
volunteers, inaccurate descriptions are likely to occur. Until further research is done to
develop tools to frame the volunteer production of POIs (such as semantic annotation for
POI, see Scioscia et al. 2013) and until such tools become common use, appropriate
filtering of POI data is required. An alternative could be POIs from official or commercial
data sets.
Errors in LULC classification are likely to impact population estimates (Maantay et al.
2007, Liu et al. 2008, Lu et al. 2010, Langford 2013), a source of error that can hardly be
avoided. In addition, in different phases of the method, approximations were made, such
as considering cells mostly covered by a category of LULC as entirely covered by this
category, or considering buildings as entirely part of a cell while they were only partly
included in this cell. Also, data on types of building could not be used (e.g., to discard
industrial or commercial buildings) since building attributes are not always available in
OSM. An additional parameter that affects the results is grid resolution. Further experi-
ments are needed to assess its impact on the reliability of estimates. A coarser grid would
affect dasymetric mapping, because it would negatively impact the purity of population
density samples. A finer grid may produce more accurate results by allowing finer
distinctions. However, a finer grid may also have a negative effect with more buildings
being split between two or more cells.
International Journal of Geographical Information Science 1959

6. Conclusion and future work

Areal interpolation techniques using LULC data are commonly used to estimate popula-
tion distribution, but numerous experiments, including those reported here, demonstrate
that they cannot be used to conduct accurate estimation of population at a fine scale. A
common alternative to this problem is to use three-dimensional data on buildings’ volume
derived from LiDAR data. However, such data is not always available, due to its
significant cost. Meanwhile, VGI is used for an increasing variety of tasks related to
urban management, emergency response, routing services, etc. It was therefore worth
experimenting whether VGI could also prove useful to support population estimation.
This paper presents a method for areal interpolation at building level that uses VGI as
ancillary data. After a first phase that uses multi-class dasymetric mapping to assess
average population density within homogeneous LULC category, OSM data on buildings
and POIs are used to refine the population estimation at building level. An original
method was elaborated to classify and determine POIs that can be used to estimate
population distribution. A case study with the city of Hamburg, Germany, has been
conducted. Population estimates obtained using different categories of POI were com-
pared. The method was also compared with state-of-the-art methods. The results show that
although there are areas for improvement, POIs can be considered as interesting ancillary
data in the absence of three-dimensional data on buildings. Of note still is that VGI data,
including OSM, is not yet widely available or well-developed in all regions of the world.
Similarly, LULC data from Urban Atlas is available only in Europe and for urban areas
with a population of more than 100,000 people. Although other LULC data sets are
available for other countries, such data is not consistently available, e.g., in developing
countries, most likely limiting the approach presented in this paper to urban regions in
developed countries.
Further research is also required to deal with the VGI quality issues. Although the aim
of the proposed approach is to improve accessibility by relying on free data, it cannot fully
achieve this goal until VGI becomes more common. Still, this research opens an inter-
esting avenue for other studies to be conducted in the future using VGI.
The possible causes of error that were identified include the incompleteness or spatial
and thematic inaccuracy of data on POIs in OSM. The inappropriateness of some POIs to
estimate population distribution (where some categories of POI may not be correlated
with population density) may vary according to the city, since different cities are char-
acterized by different urban patterns. In this regard, future work will be conducted to
further investigate patterns of POIs that reflect spatial population distribution. A more
comprehensive investigation of the issue also requires comparison of the results using
commercially produced or official POIs and linking it to various indicators of POI quality.
Another area of concern is the impact of the decaying function. Calibration is needed
whenever the method needs to be applied to a different region or using different POIs.
Additional ancillary data, such as distance from roads, may help determine how popula-
tion density is decaying away from a POI. Additionally, further research on correlation
between types of POI, structure of urban fabric, and population decay may be conducted
to improve the approach. The surface volume integration may also be inappropriate,
because it measures the density in a cell according to its distance from the nearest control
point. Taking into account the distribution of control points surrounding a cell rather than
a single control point may produce better results. Also, better estimates were obtained
with co-kriging interpolation, where the problem of related variables is taken into account.
A hybrid approach is worth consideration.
1960 M. Bakillah et al.

We wish to thank the anonymous reviewers for their valuable contribution.

This work was conducted for the GRIPS Project and supported by the German Federal Ministry of
Education and Research (BMBF).









