Professional Documents
Culture Documents
Cellular Census - Explorations in Urban Data Collection
Cellular Census - Explorations in Urban Data Collection
www.computer.org/pervasive
Vol. 6, No. 3
July–September 2007
© 2007 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or
for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be
Cellular Census:
Explorations in Urban
Data Collection
Analysis of cell phone use can provide an important new way of looking
at the city as a holistic, dynamic system.
M
uch of our understanding of ur- tions of mobile phone usage levels in central
ban systems comes from tradi- Rome during autumn 2006. The system archi-
tional data collection methods tecture, including data collection, transfer, and
such as surveys by person or processing, has been detailed elsewhere.1
phone. These approaches can TI supplied several different types of data, first
provide detailed information about urban behav- and foremost of which was the Erlang, a mea-
iors, but they’re hard to update and might limit sure of network bandwidth usage typically col-
results to “snapshots in time.” lected at the antenna level. Additionally, TI used
In the past few years, some innovative ap- its innovative Lochness platform to supply aggre-
proaches have sought to use mobile devices to col- gate location and trajectory data on callers using
lect spatiotemporal data (see the sidebar, “Urban the system for more than three minutes at a time.
Analysis Using Mobile-Device Data”). But little Two transportation companies—Atac-Rome (a
research has been done to develop and analyze the public bus company) and Samarcanda (a private
much larger samples of existing data generated daily taxi company)—also provided supplemental GPS
by mobile networks. data to MIT for further processing. However,
The most common explanation for this is that here we focus on the Erlang data collected over
the challenge of data-sharing with the telecom- four months in late 2006 and covering a region
munications industry has ham- of 47 km2, considering how it can help us better
Jonathan Reades pered data access. However, in understand urban dynamics.
University College London early 2006, a collaboration be- An Erlang is one person-hour of phone use, so
tween Telecom Italia, which 1 Erlang could represent one person talking for an
Francesco Calabrese, Andres
serves 40 percent of the Roman hour, two people talking for a half hour each, 30
Sevtsuk, and Carlo Ratti
market, and MIT’s SENSEable people speaking for two minutes each, and so on.
Massachusetts Institute City Laboratory (http://senseable. Consequently, Erlang data is both aggregate and
of Technology mit.edu) allowed unprecedented anonymous, and deducing individual identities
access to aggregate mobile phone from the data collected and stored in the system is
data from Rome. Here, we ex- impossible. Additionally, because Erlang data is a
plore how researchers might be able to use data standard measure used by most network opera-
for an entire metropolitan region to analyze tors, it’s an accessible source for the analysis of typ-
urban dynamics. ical GSM (Global System for Mobile Communi-
cation) networks. You can collect Erlang data
The Real Time Rome platform without installing new applications or upgrading
The TI and MIT collaboration, developed the base station controllers, both of which incur
under the Real Time Rome label, was shown at costs and operational risks for the networks.
the 2006 Venice Biennale. The installation incor- Although Erlang data can’t be linked to an indi-
porated both real-time and historical visualiza- vidual subscriber and doesn’t offer the locational
30 PERVASIVE computing Published by the IEEE Computer Society ■ 1536-1268/07/$25.00 © 2007 IEEE
Urban Analysis Using Mobile-Device Data
REFERENCES
with Nokia phones carrying specially designed logging software.1,2
Rein Ahas and Ülar Mark tracked the mobile phones of 300 users 1. N. Eagle and A. Pentland, “Eigenbehaviors: Identifying Structure in Rou-
for a “social positioning method” analysis.3 By combining spatio- tine,” 2006; http://vismod.media.mit.edu//tech-reports/TR-601.pdf.
temporal data from phones with demographic and attitudinal data 2. N. Eagle and A. Pentland, “Reality Mining: Sensing Complex Social Sys-
from surveys, they created a map of social spaces in Estonia. tems,” Personal and Ubiquitous Computing, vol. 10, no. 4, 2006, pp.
In the UK, the Cityware research group has taken a more readily 255–268.
scalable approach. They supplement the pedestrian flow data typi- 3. R. Ahas and Ü. Mark, “Location Based Services—New Challenges for Plan-
cally gathered as part of a space syntax analysis with data on Blue- ning and Public Administration?” Futures, vol. 37, no. 6, 2005, pp. 547–561.
tooth devices passing through pedestrian survey “gates.”4
4. E. O’Neill et al., “Instrumenting the City: Developing Methods for
However, approaches such as these can suffer from important limi- Observing and Understanding the Digital Cityscape,” UbiComp 2006:
tations: they rely on the deployment of ad hoc infrastructure or re- Ubiquitous Computing, LNCS 4206, Springer, 2006, pp. 315–332.
First visualizations
and hypothesis
Figure 1 shows one of the simplest vi-
sualizations of Erlang data: a 3D plot
of telecommunications activity during
Madonna’s controversial 6 August Figure 1. A 3D plot of telecommunications activity during a Madonna concert
2006 performance, when more than in Rome.
70,000 people converged on the Stadio
Olimpico for a concert condemned by
the Pope. Generic Erlang maps such as ious statistical techniques, we can use dif- bar), except that we’re characterizing
this, which was presented at the Bien- ferences in Erlang data over time to derive spaces by their mobile-bandwidth use over
nale, are graphically appealing and clues to the types of activity in the imme- time. By analyzing the bandwidth “signa-
intuitively easy to grasp. However, diate area of the mast. (A mast can carry ture” of each antenna, we try to envision
they’re actually quite difficult to inter- multiple antennas, oriented in different how it might correlate with urban activi-
pret rigorously, and they provide little directions or serving different frequencies.) ties in the geographical vicinity.
insight into local-area dynamics with- This analysis is conceptually related to the Because Erlang data is an antenna-
out additional processing. idea of a chronotype (see the “Chrono- level measure, we needed an algorithm
We hypothesize that by employing var- types and Space-Time Typologies” side- to spread the point data values across the
area served, accounting for distance transmitted to the SENSEable City Labo- Figure 2 shows the pixels for these six
decay in signal coverage and multiple ratory. This means that while the rela- locations.
antennas on a single mast. Carlo Ratti tive difference between any two observa- To minimize the impact of special
and his colleagues took a center-of- tions is scaled consistently, the actual events on the data set, we calculated an
gravity approach,2 but to interpolate Erlang value at that point in time is un- average Erlang value for each pixel at
values for the entire metropolitan region, known. So, it’s helpful to focus on the each 15-minute interval, using a 90-day
an alternative algorithm3 was used to relationships between points over time period. So, for example, the data point
divide Rome into “pixels” measuring and space rather than the specific value for 9 a.m. Monday is an average of every
1,600 m2. We used an exponential dis- at any one point in time.) 9 a.m. Monday value between 1 Sep-
tribution function to derive an Erlang Using prior knowledge of the city, we tember and 30 November 2006. We
point value based on a composite signal arbitrarily selected eight locations that excluded civic holidays from the calcu-
from the surrounding masts. we expected to have markedly different lation on the basis that they would intro-
We use this mathematical notation: signatures. Following an initial vi- duce unnecessary noise.
sualization exercise, we selected six for
• Loc is the set of 1,600 m2 pixels. analysis: Erlang data by day of the week
• T96 is the set of times when we made Beginning with a minimal level of
observations each day of the week. • Termini, Rome’s main passenger rail processing, figure 3 shows how Erlang
Because we took measurements every station and busiest subway station; data changes over time at each of the
15 minutes, one day comprises 96 • Trastevere, a mixed-use area popular six selected pixels. As the graphs in-
observations. with Romans and tourists for its bars dicate, Monday through Friday are
• Day is the set of {Weekday, Friday, Sat- and restaurants; broadly similar, except for a more rapid
urday, Sunday} (we discuss this in • the Piazza Bologna, a residential area decrease in activity on Friday after-
more detail later). east of the city center; noon, suggesting a transition to the
• erlang(␦, , ) defines the Erlang value • the area in front of the Pantheon (one weekend. Even more strikingly, Satur-
at location Loc, at time T96, of Rome’s premier tourist attractions), day and Sunday values often drop be-
and ␦ Day. which also contains many bars and low 50 percent of the typical weekday
{ }
• mean ai indicates the mean of the
i∈I
restaurants; load, but the drop’s magnitude varies
• the Stadio Olimpico, a sports and ma- dramatically from site to site. This find-
values ai, i I. jor concert venue northwest of central ing indicates that weekday and week-
Rome; and end data should be treated separately
(To preserve confidentiality, TI used a • Tiburtina, a smaller rail and subway in our analysis.
scaling factor to adjust the Erlang values interchange. Intriguingly, areas more closely identi-
12 a.m. 5 a.m. 10 a.m. 3 p.m. 8 p.m. 12 a.m. 5 a.m. 10 a.m. 3 p.m. 8 p.m. 12 a.m. 5 a.m. 10 a.m. 3 p.m. 8 p.m.
consistent with the idea that although Cluster analysis 9 p.m. Each of these points lies toward
weekday telecommunications activity at So far, we’ve focused largely on indi- the middle of a period of rapid change
each site exhibits a more dynamic tempo- vidual pixels, and we’ve identified some or significant variation between sites—
ral pattern, weekend activity exhibits more interesting features at a fairly detailed the early morning rise in activity, late
spatial dispersal. From an urban-planning spatial level. Our preliminary analysis morning peak period, early afternoon
standpoint, this strongly suggests large indicates that residential areas, com- lull, afternoon peak, and evening drop.
commuter flows into the central business muter hubs, nighttime hot spots, and The six normalized Erlang values thus
district during the week and more resi- even special-event venues demonstrate make up the coordinates of a vector that
dentially oriented activity on weekends. features consistent with our contextual, describes, in a limited way, each pixel’s
Of course, planners are well aware of this anecdotal knowledge of Rome. How- signature.
spatial relationship, but spatial and tem- ever, validating our hypotheses requires We could use many clustering tech-
poral visualization of these features at this a more rigorously quantitative study. niques to create segmentations based on
scale hasn’t been possible before. The ultimate goal is to take the derived the affinity between vectors. We chose a
One caveat: the levels of activity be- signatures, group them by degree of sim- K-Means approach, such that every
tween 3 and 6 a.m. throughout the week ilarity, and map them to urban spa- observation in a cluster is as much like
mean that any analysis using that period tiotemporal structures. other members of that cluster and as dif-
would be rooted in extremely low Erlang As a proof of concept, we created a ferent as possible from members of any
values. So, such a comparison might erro- simplified vector—required for compu- other cluster. With six coordinates from
neously indicate excessive shifts in activ- tational manageability—to feed pixel each day, and separate sets of coordi-
ity from site to site. Nonetheless, from this data for each of Rome’s 262,144 pixels nates for Monday through Thursday
initial analysis, it seems that through nor- to a clustering algorithm. An examina- (one set of averaged observations), Fri-
malized signatures we can reconstruct tion of our six selected pixels suggested day, Saturday, and Sunday, the K-Means
some of the functioning of the city using that six times in the daily cycle of Erlang algorithm used a 24-dimensional space.
the invisible fingerprints of mobile phone activity are particularly significant: 1 We employed two clustering steps.
infrastructure. a.m., 7 a.m., 11 a.m., 2 p.m., 5 p.m., and First, for each pixel, we calculate fea-
Normalized Erlang
3
2
ture(loc) = {erlang(␦, , j)}, j = 1 a.m., 7
a.m., 11 a.m., 2 p.m., 5 p.m., and 9 p.m.
1
Second, the K-Means clustering algo-
rithm partitions the pixels into mutually 0
exclusive clusters. Each cluster is charac- 12 a.m. 5 a.m. 10 a.m. 3 p.m. 8 p.m.
terized by its centroid, and the algorithm (a)
aims to minimize the error function: Normalized Erlang 4
k =1 loc j ∈Clusterk
2
where clusterk is the set of objects related
to the cluster k, and centroidk is the mean 1
of all the points in clusterk. We calculated
the distance between pixels using the 0
12 a.m. 5 a.m. 10 a.m. 3 p.m. 8 p.m.
squared Euclidean distance: (b)
distance loc1 , loc2 = ( ) Termini Trastevere Pantheon
1
⎛ 24 Piazza Bologna Tiburtina Stadio Olimpico
2⎞
2
⎜∑
⎜⎝ τ =1 i
( )
feature loc1 − feature loc2 ( )i ⎟
⎟⎠
plexity of cities, this is hardly surprising. with eight clusters as a compromise
As a result of the clustering process, we However, the existence of several small between simplicity and specificity. Doing
can group all pixels in the city into any arbi- clusters with much stronger levels of affil- this gave us a fair cophenetic correlation
trary number of groups based on the affin- iation or differentiation indicates that the value of 0.7704. Cophenetic correlation
ity of their composite Erlang signature. In overall data set includes some quite distinct is one way to gauge the clusters’ fit to the
our tests, we found a mix of clusters that signatures. These signatures will likely map original data set—values approaching
suggest a complex set of relationships to distinct types of urban activity. 1.0 suggest a good fit—by comparing
between signatures. Given the sheer com- For this initial research, we worked pairwise linkages between observations.
Figure 5. Erlang data for Rome normalized over space and time. Intensities range from low (blue) to high (red).
Figure 6. Analysis of eight clusters of Erlang data: (a) clusters 1–4; (b) a satellite view of Rome, for comparison; (c) clusters 5–8.
Projecting these clusters onto a map map to the most important points of Moreover, we’ve recently received
of Rome (see figure 6) naturally indi- entry to the city by car and train: Ter- data from Pagine Gialle (the Italian Yel-
cates that they’re closely linked to the mini station, Tiburtina, the end of the low Pages) with which we intend to val-
normalized Erlang signatures. The Corso d’Italia, the Porta Maggiore, and idate our initial findings by linking the
edges of Rome’s urban core are clearly the Porta San Giovanni. signatures to spatial data on business
visible, as are the hot spots of urban types and densities. In so doing, we can
activity straddling the Tiber River. The Discussion build on the processing requirements we
map suggests an overall structure to the Our preliminary findings suggest that discussed earlier in this article:
city, with a correspondence between signature analysis can provide an impor-
levels of telecommunications activity tant new way of looking at the city as a 1. Antenna and pixel values must be
and types of human activity. At this holistic, dynamic system. In particular, the normalized over both space and
point, however, we can’t verifiably con- mobile phone network lets us develop a time to provide a measure of rela-
nect cellular signatures to specific types real-time representation of those dynam- tive telecommunications intensity.
of human activity. ics at the city and city-region scale. This 2. The substantial differences between
We then adjusted the metric to favor approach can complement traditional col- weekdays and weekends require
the two most distinctive types of use lection techniques, which are often out- treating them separately in a classi-
seen in the normalized graphs: early dated by the time they’re available to pol- fication algorithm.
morning use suggestive of commuting icy makers and the general public. Of 3. The key time periods intimated in
behavior and late evening use suggestive course, because our hypotheses so far are this initial analysis appear to be 12
of nighttime leisure activities. For these based on anecdotal evidence, our findings to 2 a.m., 5 to 8 a.m., 10 a.m. to 12
clusters we obtained cophenetic corre- will require additional validation, which p.m., 2 to 6 p.m., and 8 to 11 p.m.
lations of 0.7630 and 0.8508, indicat- we outline below. However, as our initial cluster analy-
ing that the clustering approach has sub- What’s most promising about this sis makes clear, these aren’t the only
stantial promise. early research is the extent to which our factors.
The red nighttime-leisure cluster in fig- findings seem to parallel those of other
ure 7a shows two discrete spatial group- European researchers4,5 as well as more We expect several other analytical
ings that map anecdotally to known conceptual research into telecommuni- approaches to yield insights into net-
areas of evening activity: Trastevere and cations’ impacts on urban behaviors.6,7 work usage patterns. One of the most
the area ranging to the west and south In particular, we can characterize areas promising approaches is Eigenbehavior
of the Piazza Navona, and the vicinity on the basis of flows and dynamics analysis.8 Because we can easily map the
of the Piazza Spagna. The red commuter rather than on the basis of comparatively signature to a vector representation of
clusters in figure 7b quite astonishingly static physical or demographic features. the sort already used in the cluster analy-
I
262,144 pixels over a three-month per- t would be exciting to compare the polling the phones in a cell to obtain a
iod, our algorithm spreads Erlang data signatures collected from Rome list of IMEI (International Mobile Equip-
through all 360 degrees, producing a pos- with similar data from other major ment Identity) numbers at the mast
sible skew in the overall distribution. European cities such as London, level—could provide unmatched detail
Finally, not all masts handle both the Paris, or Frankfurt. For instance, it’s rea- on travel origins and destinations, and
900- and 1,800-Hz bands used in sonable to expect that cities with more on population densities. By scrambling
Europe. So, some network activity might distinct spatial patterns of human activ- handset identifiers with changing encryp-
gravitate toward more physically remote ity might display correspondingly more tion schemes, reporting only partial tra-
base stations with the hardware to distinct patterns of network use and jectories, and never reporting on cells or
process calls in a particular band. We more readily classifiable signatures. paths containing fewer than an agreed
don’t have data that would let us com- Unfortunately, at this time commercial minimum number of users, you would
pensate for these possible biases. So, considerations appear to preclude using be able to perform this kind of research
without adopting an entirely different data from other network operators. without compromising personal privacy.
approach to data collection—one that This issue highlights the extent to This data would also assist enormously
the network operator would have been which research using cellular networks in understanding how individual and
reluctant to support at this development must take nonscientific factors into group behavior changes over time and
stage—localizing phones more accu- account. First, a policy framework at the space. This would not only shed further
rately is impossible. national or European level that encour- light on the rhythms of urban life but also
Although the data to which we cur- ages networks to share nonidentifiable address the fact that you can’t derive met-
rently have access has clear, substantial data with planning and policy researchers rics on activity and population densities
limitations, we believe our approach rep- would be immensely helpful. Clearly, from Erlang data alone.
resents an appropriate trade-off between there are important considerations from The challenge is that as the data
locational specificity and implementa- the standpoint of commercial confiden- becomes more useful, it also becomes
tional feasibility. Fortunately, analysis at tiality, personal privacy, and possibly more sensitive to both operators and end
the city and city-regional scale doesn’t even national security. However, in the users. An all-or-nothing approach to pri-
depend on the high level of accuracy that absence of clear regulatory guidance, fur- vacy has hampered this discussion.
the AUTHORS
Jonathan Reades is an MPhil and a PhD candidate at the Bartlett School of Planning
at University College London. His research interests are the application of mobile
phone data to topics in urban planning such as business clustering and communica-
tion, and the spatiotemporal structures of European cities. He previously spent eight
years in data analytics, helping telecom firms use their data for targeted marketing. REFERENCES
His first degree was in comparative literature at Princeton University; he’s a student
member of both the Royal Town Planning Institute and the Town and Country Plan- 1. F. Calabrese and C. Ratti, “Real Time
ning Association. Contact him at the Planning Dept., 4th fl., The Bartlett School, Rome,” Networks and Communication
Wates House, 22 Gordon St., London WC1H 0QB, UK; j.reades@ucl.ac.uk. Studies, vol. 20, nos. 3 & 4, 2006, pp.
247–258.