Project Report Sormunen

Inferring co-occurrences from WiFi logs
Silja Sormunen, Student number: 724645

April 9, 2019
1. Introduction
Wireless networks offer an inexpensive alternative for human movement monitoring. Spa-
tiotemporal data can be collected passively without user involvement, as WiFi access points
automatically capture wireless signals transmitted by electronic devices as long as the de-
vice’s WiFi function is turned on. This data can be used for tracking movement patterns of
users, and, when the data grows large enough, instances of spatiotemporal co-occurrences
and their characteristics can be used to infer the underlying social network structure. In-
ferring social ties from spatiotemporal data is based on the intuitive idea that individuals,
who are more frequently observed together than chance would predict, are likely to know
each other in real life. However, as the number of coincidental co-occurrences can be sig-
nificant, the main challenge of this line of research lies in finding efficient methods for
separating real encounters from coincidental co-occurrences.
The present dataset consists of fourteen days’ WiFi logs recorded at Aalto University’s
campuses in Otaniemi, Toolo, Mikkeli and Pori. The aim of this project was to analyse some
basic features of the data and to infer social relationships of users based on spatiotemporal
co-occurrences. Due to time constraints, the focus stayed mainly on the first task.
After a literature review in section two, basic features of the data are explored and
connections between buildings are analysed in section three. Typical challenges of wireless
network data related to location inaccuracy and temporal sparsity are evident even in this
dataset. In order to reduce the effect of these limitations, observations of each device
were grouped into stays and unnaturally short stays were filtered out. After excluding
stationary devices, co-occurrences and common movements of each pair of devices were
extracted, and diversity values reflecting the distribution of co-occurrences to different
locations were calculated for the co-occurrence vector of each pair.
1
2. Related literature
2.1. Extracting social ties from WiFi data
Location inaccuracy and temporal sparsity of the observations are two common challenges
associated with WiFi data [4, 9]. Density of access points sets constraints to the maximal
accuracy of location estimates [14]. In addition, wireless signals are easily disturbed by
both stationary and moving obstacles, which can cause unexpected fluctuation in detected
signal strength (RSSI) values [9]. Model of smartphone may affect detected signal strength,
which further complicates the localization process [12].
Methods for collecting information about devices vary. Detection of devices can rely
on probe requests that devices transmit automatically in search of nearby networks, on
control frames such as null frames, which devices emit while connected to a network [12],
or on directly gathering information at regular intervals about all devices connected to
a specific network [11, 13]. Some degree of temporal irregularity is associated with all
of these approaches. Different devices emit probe requests at different rates [28]. The
probe frequency for one device is typically not constant; probe requests and null frames
are sent in bursts [12], and the contact frequency is affected by the power state of the
device [28]. Somewhat surprisingly, devices have been observed to continue sending probe
requests even while connected to a WiFi network [29]. Interference from other devices and
other environmental disturbances can lead to probe requests not being captured by access
points, which further contributes to the irregularity in signalling frequency [4]. Probing
frequency can be artificially increased, which, however, has the inconvenient drawback of
faster energy consumption of devices [28]. Even the observations gathered by logging all
connections of the network at regular intervals can exhibit some irregularity, as not all
present devices are necessarily detected every time the connections are checked [13].
Despite these challenges, WiFi signals have been successfully used in social network
mining. The difficult task of localizing devices can be avoided by using names of access
points as approximations for devices’ locations. In [29], 19 university campus buildings
equipped with a WiFi probe detector near the entrance were first categorized into one of
four categories (canteen, teaching building, laboratory building or dormitory), after which
observations of each unique device were grouped into trajectories consisting of semantic stay
points. All nearly simultaneous entries and departures inside a three-minute time window
contributed to the longest common subsequence of a pair of users. Similarity values based
on the length of longest common subsequence were used as weights of social ties. Features
of encounters were used to infer the level of intimacy of each pair of users; co-occurences
during weekends and in the evenings were found to indicate a more intimate relationship
than co-occurrences during daytime on weekdays. Also the type of location was taken into
account. Still, as the simultaneous stays were not necessarily located at the same building
but only at the same type of building, the longest common subsequence did not necessarily
indicate common movement pattern but only a temporally similar routine. Results were
2
validated by investigating social relationships of selected users, and community detection
was performed based on the obtained similarity measures.
Groups were detected based on nearly simultaneous entrance and exit times also in
[1]. Relationships of individuals were not analysed further, as the focus of the article lied
mainly on analysing utilization of a university staff lounge area. People who regularly
entered or exited the lounge inside a three-minute time window were assumed to belong
to the same social group, and they were found to utilize the lounge more regularly and for
longer time periods than people who usually entered and exited the lounge alone. In [23],
unsupervised machine learning techniques were used to identify groups of supporters of
two teams at a football match. The groups were identified based on the fraction of shared
locations between devices; the more shared locations there were, the higher the probability
of the users belonging to the same group of supporters was estimated to be.
Bilogrevic et al. [2] compared the accuracy of relationships detection based on ac-
cess points logs to that based on information collected directly by mobile devices. In
the first case, 37 access points where installed to detect communication between partic-
ipants through peer-to-peer-technologies, and spatial proximity was inferred from these
observations by first estimating devices’ latitude and longitude coordinates with a trilat-
eration process. In the latter case proximity was inferred directly from RSSI values of
neighbouring devices detected by each individual device. Information about underlying
relationships was gathered, and a machine-learning algorithm was trained to classify rela-
tionships into friends, classmates or others based on features of co-occurrences. Features
taken into account were duration, inter-encounter time, average encounter RSSI value,
number of encounters, and, in the case of access point logs, also the type of encounter
location (pathways, public spaces or classrooms). Classifier based on access point logs
obtained accuracy comparable to that of device-based classifier, especially when measures
based on community detection were included in the model.
Instead of first localizing devices, Hong et al. [12] estimated co-occurrences directly
from detected signal strength values. Signal strength information of probe requests and
null data frames were detected by several monitors and merged to obtain a normalized RSSI
fingerprint. Co-occurrences were detected based on RSSI fingerprint similarity and used to
calculate for each pair of users a relationship index, which took into account the number of
co-occurrences, their lengths as well as their associated fingerprint similarity values. The
implemented system reached a high level of accuracy, but required also some installation
work; only two indoor spaces were monitored and both rooms were equipped with several
monitors. Co-occurrences were detected based on the similarity of RSSI values even by
Vanderhulst et al. [28], even though this time the focus was solely on detecting spontaneous
and short-lived encounters instead of discovering long-term friendships. The system was
more invasive than that of Hong et al., as a probe-generating system was implemented
to increase the probe frequency of devices to obtain a high temporal granularity, and the
rates at which devices transmitted probe requests were synchronized. RSSI values were
used even in [3], where a system based on correlation of variation in two devices’ RSSI
values was developed for detecting common movements of devices.
Many smartphones store a list of known network names (SSID) and send specific probe
3
requests to these networks. This information can be used to infer social ties between indi-
viduals [7, 27]. Barbera et al. [27] constructed users social network based on SSID names
using the Adamic-Adar similarity metric, which allows for diminishing the contribution of
popular past WiFi networks to link weights. When a single monitor was set to capture
wireless signals at the entrance of a university building, the obtained Adamic-Adar sim-
ilarity metrics were found to correlate positively with the probability of co-occurrence of
two users. Overall, the privacy threats caused by the broadcasted SSID names were evi-
dent: languages of SSID names were used to infer nationality of users, while information
about the vendor (evident from unmodified identifier of device) was used to infer users’
socio-economic status.
In [15], the data was somewhat sparser, as the data consisted of events of users logging
in to a base station in order to access WiFi. Social relationships were inferred using tem-
poral and spatial precision and uniqueness of co-occurrences. All co-occurrences were first
grouped into events, one event including all observations inside a 5-minute time window.
Each event obtained a weight based on its spatial and temporal precision and uniqueness.
An event was defined to be temporally unique if few other events were taking place at the
same time, and spatially unique if not many other events take place at the same location.
High temporal precision, instead, implied short time period between the occurrences of the
first and the last person related to a particular event, while spatial precision was related
to the level of granularity. Event weight was obtained by multiplying these four measures,
and link weight for each pair of individuals was equal to the sum of the event weights of
all events the two individuals had co-occurred in. The principle of homophily was used
for testing the successfulness of the approach; users connected with high-weighted links
showed greater demographic similarity than a pair of users picked at random.
Besides monitoring signals of near-by devices, physical proximity can be inferred by
directly comparing the lists of WiFi access points seen by two users, if one has access to
information gathered by devices. Using this approach, Sapiezynski et al. [25] compared
the role of different features of co-occurrences in predicting ties of social networks. The
social networks were constructed based on participants’ interactions through phone calls,
SMS or Facebook. The most important feature in predicting ties of phone call and SMS
networks was the total time spent together outside working hours weighted by the number
of people present. Individuals who frequently called each other also met more frequently
and at more varying times than others.
2.2. Extracting social ties from other kinds of spatiotemporal data
When developing methods for uncovering users’ social relationships from spatiotemporal
data, data from location sharing social networks has proven popular. Applications en-
abling sharing of user’s location offer a readily available ground truth in the form of users’
contacts in the application. This type of data is temporally much more sparse than WiFi-
or GPS-data, and detection of co-occurrences is often based on simply observing whether
4
users co-occur inside discrete, beforehand determined time windows. These kinds of ap-
proaches differ mainly in the ways they aim to separate real encounters from coincidental
co-occurrences. Features related to each individual’s movement pattern as well as features
characterising locations or the pattern of users’ co-occurrences have been proposed.1
Popularity of co-occurrence location can be used to infer the significance of the co-
occurrence; co-occurrences in less popular places are in general stronger indicators of an
underlying association between users than co-occurrences in more crowded locations. Lo-
cation entropy measures the diversity of unique visitors of a location [6], and it has been
successfully used to emphasize co-occurrences at locations specific to only few users [21]
This can be especially helpful if the data is relatively sparse [21], but the applicability is
reduced when the monitored area is small and covers only places with relatively similar
user profile, as is the case in the present dataset. In addition to considering popularity of
the co-occurrence location, Wang et al. [30] included in their model the probability of an
individual to visit a certain location; if user visits the location frequently, co-occurrences
are more likely to be coincidences.
Besides locations, entropies can be calculated for co-occurrence vectors of two indi-
viduals. Variety of co-occurrence locations for a pair of users has been used to separate
coincidental co-occurrences from more meaningful ones, as coincidental co-occurrences tend
to concentrate on only few places [21]. In [21], both of these measures were applied in rela-
tionship prediction; diversity values of encounters were calculated based on co-occurrence
entropies, and location entropies were used for calculating weighted frequencies. Diver-
sity values were based on Renyi entropy, which allows for emphasizing co-occurrences in
locations that the user pair visits only rarely by changing the parameter q, the order of
diversity; the idea here was that coincidental co-occurrences tend to be associated with
high local frequencies. The optimal combination of diversity and weighted frequency for
predicting true friendships was found empirically.
Probability of a true encounter is further affected by temporal features of co-occurrences,
such as time of the day [24]. Probability of friendship is higher if meetings are distributed
to a longer time period [30] or if there are encounters outside working hours [19]. Authors
of [20] proposed a framework where regularity of encounters as well as diversity of co-
occurrence locations contributed to weights of social ties, inferred with the help of machine
learning techniques. Desai et al., instead, [8] found that temporal diversity correlated bet-
ter with self-reported strength of relationship than location diversity or the average number
of encounters per day. Temporal diversity reflects how co-occurrences are spread across
different time intervals of the day, and it was calculated using Renyi entropy, similarly to
how location diversity was calculated in [21]. The more weight was given to encounters
at times atypical to the user pair (i.e. the lower the parameter q of Renyi entropy was),
the better temporal diversity predicted closeness of relationship. The data consisted of
GPS-traces of 46 university students living at campus. Latitude and longitude coordinates
were geohashed to specific location names before analysis.
1
A more detailed description of several of the articles described in this section can be found in a review
article [26], which I found at a late stage of this work.
5
In [5], a probabilistic model was developed to infer social ties from spatiotemporally
coinciding photos uploaded to a photo-sharing site. Features taken into account were the
spatial and temporal distance between uploads as well as the number of places the users co-
occurred in. In [17], co-occurrence detection was based on a student card scanning system,
which recorded individuals’ movements with great temporal accuracy. The number of co-
occurrences of each pair of students was assumed to follow a Poissonian distribution, and
link strengths were inferred by testing the probability of the number of co-occurrences
being non-random. A somewhat different approach was taken in [31], where social ties
were predicted based on distances between users’ frequent movement areas emphasizing
less popular areas.
Approaches based on similarity of trajectories have been common when inferring friend-
ships and more generally user similarity from GPS-data. As these similarity measures
typically rely on distances between users’ frequent movement areas or on similar transfer
times between relevant stay points and not on actual simultaneity of stays, they do not
necessarily indicate an existing friendship, but instead serve as a good basis for a friend
recommendation system based on shared interests and habits. In this category belongs
for example the work of Li et al. [16], in which similarity measures were calculated based
on the number and length of similar location sequences with similar travel times. Similar
sequences were weighted depending on their length and the level of spatial granularity; the
optimal weights were set empirically.
2.3. Movement analysis at campus areas using WiFi logs
WiFi logs have been used to identify recurrent movement patterns at public buildings and
other areas, such as hospitals [22] and university campuses [11, 18]. In this last section
of the literature review, I will shortly present two papers, where WiFi logs ares used in
movement analysis at campus areas and that I found helpful when filtering the data as
well as when analysing connections between buildings.
In [11], observations gathered of devices connected to the eduroam network were used
to uncover typical movement patterns on campus. All connections in the network were
automatically logged at intervals of five minutes and grouped in the database to sessions,
each session representing an uninterrupted connection associated to a particular access
point. Consecutive sessions at same location were grouped together if they were separated
by less than an hour, and sessions with only one observation were removed, as they were
reasoned to be likely to represent people passing by. As mobile devices were assumed to
connect more frequently to new access points, mobile devices were identified by defining
the ratio of short sessions (consisting of only one observation) out of the number of all
sessions per device.
Meneses et al. [18] focused on analysing frequent movement patterns between different
locations at university campuses at different times of day. Again, each session in the
eduroam network represented an uninterrupted connection to one access point. The spatial
6
resolution was in the range of few tens of meters. After suspiciously fast changes between
nearby access points were filtered out, durations of movements between places followed a
power law distribution. Analysis of place connectivity - the amount of movements between
buildings - revealed that some locations acted as hubs with connections to many places,
while most of the locations were less well connected.
3. Features of the data
3.1. Description of the dataset
The present dataset consists of two weeks’ WiFi logs from Aalto University’s four cam-
puses at Otaniemi, Toolo, Mikkeli and Pori. The analysed data covers 14 days from 29th
October to 11th November. One data point represents one observation of a device’s WiFi
signal detected by the university’s WiFi access points (APs). Overall there were 51 250
459 observations with latitude and longitude information missing from 33 653 179 obser-
vations. Each observation contains several measures; in this work the relevant features
are anonymized Media Access Control (MAC) address, timestamp, location name, lati-
tude and longitude coordinates, confidence factor, service set identifier (SSID) as well as
a timestamp telling when the device was first detected. Confidence factor represents half
of the side of the square (measured in feet) in the centre of which the device is located
with 95 % probability. The level of uncertainty, i.e. the size of the square, is related to the
received signal strength indicator (RSSI) value.
MAC address acts as a unique identifier for one device. Number of unique MAC-
addresses was 501 964, out of which 244 589 MACs were observed only once during the
entire time period. iPhones, some of which regularly change their MAC-address, might
contribute to the large number of unique MAC-addresses; still, the number is surprisingly
high. These one-time observations were not restricted to any particular location, and
their daily pattern resembled the overall pattern of the observations in the sense that they
peaked around lunch time, which should not have been the case had these observations
been generated by by-passing cars.
The location names are built hierarchically, each one consisting of a campus name,
building name, floor number and in the case of some buildings (Vare and Dipoli), also
a room number. Altogether there are 164 unique locations names, which correspond to
39 different buildings and two outside areas (Rakentajanaukio and Vare piha-alue). 31 of
these buildings are located in Otaniemi, four in Toolo, two in Mikkeli and two in Pori.
While latitude and longitude coordinates are defined using a triangulation process, the
method used for defining location names is not - at this point - entirely clear. Out of 4 547
696 location name changes, where latitude and longitude coordinates are known at both
ends of the movement, the coordinates stay exactly the same in over half (2 983 230) of
these movements, which implies that the methods used in these estimations indeed differ.
7
Furthermore, most location names were associated with both missing and non-missing
latitude and longitude coordinates.
The SSID name was visible in about half (26 694 874) of the observations. The APs
broadcast nine different SSID names out of which the three most popular ones - ”aalto
open”, ”aalto” and ”eduroam” - cover over 99 % of the observations with a visible SSID.
3.2. Pre-processing and filtering

The aim of pre-processing was twofold: to filter out stationary devices and to identify stay
points for devices spending a reasonable proportion of time at the campus area. Obser-
vations were grouped into stays in order to reduce the effect of irregular inter-observation
times and to enable identification of inaccurate stays. Visualizing observations of individ-
ual devices revealed that many devices seemed to change location suspiciously fast and
suspiciously often. This phenomenon was especially evident between different floors of the
same building (example in figure 1). Consequently, even though lower granularity level
increases the number of coincidental co-occurrences, I decided to perform the subsequent
co-occurrence detection at building level in the hope that more noise would be removed
than added. As discussed in the literature review, good results have been obtained in
previous research using building level information, even when the number of monitored
buildings has been significantly lower than in the present dataset (for example [29]). Even
at building level, there were clearly many erroneous location names. One device, for ex-
ample, seemed to travel inside a 50 seconds’ period from the Undergraduate centre to the
second floor of Dipoli, back to the second floor of the Undergraduate centre, then out to
the yard in front of Vare, back to the second floor of the Undergraduate centre, again
to the yard in front of Vare and finally back to the second floor of Dipoli. One possible
explanation for this kind of phenomena might be that even though the WiFi network is
not designed to cover all outdoor areas of the campuses, a device might occasionally be
able to connect to a network even when not located inside a building [22]. As there are no
location names for outdoor areas (expect for Vare piha-alue and Rakentajanaukio), this
can be expected to create short stays with inaccurate location names reflecting movements
between buildings.
The data was filtered in several successive steps. First, all devices with less than 10
observations or a moving radius less than 20 m were removed. This step was not able to
capture all stationary devices, as the radius of movements was left undefined for devices
with missing coordinates, and on the other hand the coordinates might fluctuate even in
the case of stationary devices. This initial filtering reduced the number of unique MAC
addresses to 95 615.
Next, all observation were grouped into stays. If consecutive observations at the same
location were separated by less than 30 minutes, it was assumed that the device had not
left the building in the time between, and the observations were merged into one stay. The
merging threshold was originally set to 15 minutes, but after examining the distribution
for lengths of stays as well as individual trajectories after merging, I decided to raise it to
30 minutes. Still, many of the stays after this initial merging remained unnaturally short
8
Figure 1: Examples of observations of individual devices on Monday 29/10
with median length of stay at 1,9 minutes. For some devices, the total amount of stays was
in the range of 104 , which equals over 700 stays per day on average. It was assumed that
very short stays lasting under one minute most likely represent people passing by or just
some general noise in the localization process. Consequently, all stays lasting less than one
minute were filtered out, after which the stays at the same location separated by less than
30 minutes were again merged. In addition, devices with only one stay were filtered out.
The stays were filtered based on their duration and not on the durations between the end
of the previous and the start of the next stay because, as already discussed, movements
between buildings are not necessarily reflected accurately enough. After these steps, the
median duration of stays had risen to 20 minutes. The median is still smaller than one
would expect it to be. It could have been wise to continue removing very short stays,
but as I started to feel like I was filtering too large part of the data away, the filtering of
9
individual stays was not continued further.
After the stay points were formed, devices were filtered based on the proportion of time
they spent at Aalto. Following the idea in [12], a presence ratio was calculated for each
device by dividing the sum of duration of stays by the time span of two weeks. Devices with
a presence ratio over 0,5 or under 0,02 were excluded from subsequent analysis. The lower
bound was chosen in order to filter out devices with too little information for a meaningful
analysis, while the upper bound aimed at identifying stationary devices. The upper limit
was set to 0,5 as the amount of individuals spending over half of their time at Aalto was
assumed to be very low, especially since the presence ratio is likely to underestimate the
time spent at campus due to possible periodical inactivity of devices.
Lastly, all devices with stays in only one building or with less than 10 stays were
filtered out. Examining the distribution for lengths of stays revealed that there were still
some suspiciously long individual stays left in the data. These were assumed to belong to
stationary devices, and, consequently, devices with stays lasting over 12 hours were filtered
out. The final number of devices was 16 171.
3.3. Characteristics of the data

In the following figures the distributions are plotted for both unfiltered and filtered data.
In general, if plotting the measure requires information about individual observations,
filtered data (plotted in orange) refers to data from which all devices not used in co-
occurrence analysis are excluded, but in which all observations of each included device
are left unfiltered. This is because at the stage where the very short stays are filtered
out, individual observations have already been merged into longer stays. While it would
have been possible to retrace which observations belong to which stay, this did not seem
worthwhile. In figures 4 and 5, the filtered data (plotted in red) refers to data from which
even the too-fast stays have been filtered.
The overall pattern of number of observations per hour followed a regular daily pattern
peaking around lunchtime (figure 2). The amount of observations diminished somewhat
on Fridays and even more during weekends. Number of observations per device is close to
a power law in the unfiltered data, while the distribution for the filtered devices resembles
more closely a lognormal distribution (figure 2c).
Probability distributions of confidence factors and distances between successive obser-
vations with defined coordinates are shown in figure 3. In the latter figure, cases where
coordinates do not change are excluded (over half in both the unfiltered and filtered data).
The median confidence factor for the unfiltered data was 136 and for the filtered data 120,
which correspond to a 83x83 m and a 73x73 m square, respectively. Distribution of con-
fidence factors resembles an exponential distribution, while the distribution for distances
is less regular. As should be the case, the filtering procedure leaves the distribution of
confidence factors and distances relatively unaffected, while the number of observations
and locations per device (figure 4) as well as device-wise entropies (figure 5) are on average
higher for the filtered devices.
10
(a)
(b) (c)
Figure 2: (a) Number of observations per hour during the observed two weeks, (b) Number
of observations on Monday 29/10, (c) Number of observations per device
Entropies were calculated for both devices and buildings (figure 5). The chosen base of
logarithm was two, meaning that entropy was measured in bits. Entropies calculated for
distribution of device’s observations in different locations are relatively low, and naturally
even more so once the location names are reduced to building level. This indicates that
observations of one device tend to concentrate heavily on few places. Entropies of locations
reflect the user profile of the building; popular places with many visitors have higher
entropies, while low entropy implies that the location is more specific to fewer individuals
[21]. Out of the buildings at Otaniemi and Toolo campus areas, the undergraduate centre
had the highest entropy (12,4) while Gentti had the lowest entropy (4,8). It should be
noted that the effect of filtering is somewhat unequal for different buildings due to the
chosen filtering criteria; for example the relative number of excluded devices is larger for
devices staying mainly in Pori than for devices staying mainly at Otaniemi, as all devices
with stays in only one building were excluded, and there is only one monitored building in
Pori.
11
(a) (b)
Figure 3: Confidence factors of all observations (a) and distances between successive ob-
servations (b)
(a) (b)
(c) (d)
Figure 4: Number of locations per device at high granularity (a) and at building level (b).
Number of devices per location at high granularity level (c) and at building level (d).
12
(a) (b)
Figure 5: Entropies of devices at high granularity level (a) and at building level (b).
Devices with only one observation are excluded. The distributions in (b) are not directly
comparable, as entropies are calculated based on individual observations for the unfiltered
data, while for the filtered data stays are first divided in periods of five minutes, and
entropies are calculated based on the number of these periods at different locations.
Figure 6: Number of observations per location
Some devices contact the network more frequently than others, and the observations
typically come in bursts. Probability distribution for inter-observation times is irregular
(figure 7), and a significant proportion of inter-observation times lasts less than one second.
The small bump near 105 seconds reflects the daily rhythm of campus corresponding to
people returning back to university after night. The distributions for inter-observation
times for cases where location name changes and for cases where it stays the same look
markedly similar, further confirming that movements between buildings are not recorded
temporally accurately. For the filtered devices, median of inter-observation times was under
half an hour for almost all devices.
13
(a)
(b)
Figure 7: Times between successive observations (a). In the figure on the left location name
changes between observations; on the right location name stays the same. Probability
distribution for medians of inter-observation times per device are shown in (b).
14
Every observation included a timestamp at which the session had begun. The total
number of uninterrupted sessions in the unfiltered data was 2 593 181, and the distribution
for number of stays per device followed a power law values ranging from 1 to 2358. Du-
rations of uninterrupted sessions varied, and location names and coordinates could change
inside one session. Rate of observations inside one session did not stay constant. Some-
what surprisingly, some sessions overlapped; a new session with a new starting time was
logged even though observations of the same device’s other session with an earlier starting
time were logged both before and after this new event. Varying locations or different WiFi
networks (different SSID-names) did not seem to explain this phenomenon.
In order to examine reliability of the localization method used for defining latitude and
longitude coordinates, durations of movements lasting under one hour were plotted against
distances between the start and end points of the movement. As evident in figure 8, there
are almost no movements where the travelled distance would be less than two meters and
the duration would be less than two seconds, assumingly due to constraints embedded
in the localization system. There are some suspiciously fasts movements (assuming that
people are not driving) such as jumps of several hundred meters in less than ten seconds.
Figure 8: Correlation of distances between successive observations and durations of these

movements in the unfiltered data. Cases where coordinates stay the same are excluded.
Brightness of color indicates the number of movements in that cell.
15
3.4. Connections between buildings
Connections between buildings were investigated using the data from which also the very
short stays were filtered. No clear hubs emerge if one simply looks at the number of build-
ings each building is connected to by at least one movement. However, when the number of
movements for each of these connections is counted, the resulting probability distribution
is close to a power law (figure 9), indicating that there are a couple of very crowded connec-
tions, while most connections remain relatively unpopular. The most popular connections
at different times of day and week are shown in figure 10. As expected, during weekends
movements at Otaniemi campus seem to be more random, while on weekdays - especially
during lunchtime and in the afternoon - traffic is more heavily concentrated between the
main buildings. In Toolo, traffic between different buildings remains relatively active even
on weekends, which might imply that some of the observations result from by-passers.
Figure 9: Number of movements per non-directional connection between buildings.
Symmetry of connections from and to each building at Otaniemi campus was inves-
tigated by calculating Gini-indeces for the distribution of movements from and to other
locations. Gini-index is zero in the case of a uniform distribution and approaches one as
the distribution concentrates more heavily on one value. All buildings had relatively sim-
ilar Gini-indeces for their in- and out-going movements, which implies a certain degree of
symmetry of traffic. Indices were in the range of 0,6-0,8 for all buildings except for Gentti,
for which the indices were just below 0,5. Symmetry of connections was further analysed
with extended Jaccard-indices (introduced in [10]), which allow for comparing similarity of
ranked sets. The top 5 in- and out -connections were relatively similar for most buildings,
but exactly same only for two buildings (Otakaari 5 and Meritekniikka 1). Otahalli and
16
Gentti had the most varying in- and out-connections. In general, all buildings were well
connected to buildings close to them, which might be a sign of noise in the localization
process.
(a) Weekdays 6 - 11 (b) Weekdays 11 - 14
(c) Weekdays 14 - 18 (d) Weekdays 18 - 24
(e) Weekends 0 - 24
Figure 10: Popular connections at different times of day on weekdays and during the
weekend. Edge weights are calculated by dividing the number of movements on a certain
(directional) connection between two buildings by the total number of movements during
the chosen time window. Only connections with over one percent of the total traffic are
shown; darkness of the edge implies the relative popularity of the path. Movements lasting
over one hour are excluded. During weekends the movement patterns remain very similar
throughout the day, which is why all movements are merged in (e).
17
4. Co-occurrences and common movements
4.1. Methods
After determining stay points and their starting and ending times, co-occurrences for each
pair of devices were extracted by comparing time windows of the devices’ stays. Two
devices were defined as co-occurring if their stays at the same location shared a non-zero
time period. In addition, the number and lengths of maximal similar location sequences
were detected. Similar location sequence was defined as a sequence where two devices visit
same locations in identical order with stays at each location overlapping for a non-zero
time period; this sequence is maximal if it is not a subsequence to any other such similar
location sequence. In addition, it was required that both the respective ending times of
the stays at the first location and the starting times of the subsequent stays in the next
location differed less than 5 minutes, and that the time period between the consecutive
stays at different locations was less than one hour. This last constraint was relaxed for
common movements between different cities, of which, however there were none in the
data.
In order to consider not only the amount but also the duration of co-occurrences, the
length of each co-occurrence was divided by five and rounded up to the closest larger
integer. For example, if a co-occurrence lasted for 12 minutes, this was counted as three
distinct co-occurrences. Following the example in [21], diversity values were calculated for
co-occurrence vectors of each pair of devices. Diversity values were calculated using the
Shannon entropy of co-occurrence vector as shown in the equation below:
X c c
Dij = exp(− P ij,l log2 P ij,l ) (1)
l,c 6=0
ij,l
l cij,l l cij,l
where cij,l denotes the number of co-occurrences of devices i and j in location l. It

should be noted that in [21], Renyi entropy was ultimately favored over Shannon entropy
in the final model, but as the number of co-occurrences is in this work related to durations
of co-occurrences, decreasing the parameter q - the order of diversity of Renyi entropy - as
Pham et al. did, would actually penalize for longer encounters.
4.2. Results
Out of all 130 742 535 possible pairs of devices, 30% had at least one co-occurrence during
the observed time period, and only 5% had co-occurrences in more than one location.
Consequently, majority of pairs with co-occurrences had very low diversity values associated
with their co-occurrence vector (figure 11b). Median number of co-occurrences for pairs
with at least one co-occurrence was 10 (figure 11a).
18
(a) (b)
Figure 11: (a) Number of co-occurrences. Pairs with no co-occurrences are excluded. (b)
Diversity values of co-occurrence vectors
Overall, only 0,13% out of all possible pairs shared at least one common movement,
and 0,008% had at least one common movement of at least length two. The longest
detected common movement was of length 12. The distribution of the number of common
movements for pairs with at least one common movement is close to a power law (figure
12a). In order to enable better comparison between pairs of devices, the number of common
movements was divided with the average number of common movements of the user pair,
similarly to how the number of nearly simultaneous entries and exits were normalized
in [29]. The probability distribution of these normalized values resembles a lognormal
distribution (figure 12b).
Pairs with longer common movements tended to have slightly higher diversity values
associated with their co-occurrence vector. This might result from an underlying true rela-
tionship between users; however, it should be noted that diversity values can be expected
to be somewhat higher for pairs with at least one common movement, simply because a
pair with common movements necessarily co-occurs in at least two different locations.
In order to examine how the common movements were distributed across pairs of de-
vices, all pairs with at least two common movements (one movement of at least length
two or at least two separate movements) were read into a network. There were in total
13 345 links connecting 6766 devices; the interquartile range for degree was [1,5], and the
maximum degree was 16.
19
(a) (b)
Figure 12: Number of common movements (a). Normalized values are shown in (b). Pairs
with no common movements are excluded.
20
5. Discussion
The method used for defining location names turned out to be less reliable than expected,
which - in addition to shortness of the analysed time period and general uncertainty related
to filtering - restricts the reliability of obtained results. While results of movement analysis
were mainly in accordance with expectations, low number of common movements and
shortness of stays even after filtering imply that there is still a great deal of noise left in
the data. Some common movements might not be recognized as such, if devices are in an
inactive mode when entering or leaving a building; the limit of five minutes might have
been too strict considering this periodic inactivity of devices. Despite filtering out all very
short stays, some common movements might have been be cut short if one of the devices
connected to a nearby network for over a minute while the other one did not.
Overall there seems to be a relatively low level of agreement between the methods used
for defining location names and coordinates. Before continuing working with this dataset, it
would be good to understand how exactly these methods differ, and whether the confidence
factors reflect the accuracy of both estimations. In general, using the latitude and longitude
coordinates to extract stay places would offer higher granularity and increase reliability of
the results, especially if the confidence factors were taken into account.2 Even the reliability
of the present results could have been improved by simply leaving out all observations with
confidence factors from the higher end of the distribution.
Besides questions related to the localization processes, some questions worth exploring
before continuing analysis of this dataset include the following. It would be good to find
out what exactly determines the frequency of observations in the present dataset; are
differences entirely device-dependent or are the connections also monitored from the side
of the system? I originally understood that the observations were probe requests, but after
noticing that each observation included also the ”first detected at”-timestamp, I became
unsure of this. It would be good to understand better how recording the sessions functions
- could this information be used to infer more reliably when the devices exit the campus
area?
Regarding possible similarity measures based on common movements, it could be a
good idea to give more weight to longer common movements, similarly to [16] (where the
weights for longer location sequences with similar travel times, however, were set empiri-
cally). As there seems to be clear differences in how popular the connections are, one could
emphasize common movements on less popular connections and calculate diversity values
for the distribution of each pair’s movements on different paths. In addition to location
related characteristics, temporal features of co-occurrences and common movements could
be incorporated in the model when calculating weights of social ties. If the number of
movements between cities stays low even when the analysed time period is longer, it could
be reasonable to leave out devices staying mainly in Mikkeli or Pori. Alternatively, differ-
2
When identifying stops and moves from latitude and longitude coordinates, it may be useful to look at
[4], where different algorithms commonly used for identifying stops and moves from GPS-data are tested
with WiFi-data.
21
ent levels of granularity could be taken into account by emphasizing common movements
between different cities more than smaller-scale movements inside one campus area.
22
References
[1] Naeim Abedi, Ashish Bhaskar, and Edward Chung. “Tracking spatio-temporal move-
ment of human in terms of space utilization using Media-Access-Control address
data”. In: Applied Geography 51 (July 2014), 72fffdfffdfffd81. doi: 10 . 1016 / j .
apgeog.2014.04.001.
[2] Igor Bilogrevic, Kevin Huguenin, Murtuza Jadliwala, Florent Lopez, Jean-Pierre
Hubaux, Philip Ginzboorg, and Valtteri Niemi. “Inferring Social Ties in Academic
Networks Using Short-Range Wireless Communications”. In: Nov. 2013, pp. 179–188.
doi: 10.1145/2517840.2517842.
[3] Gayathri Chandrasekaran, Mesut Ali Ergin, Marco Gruteser, Richard Martin, Jie
Yang, and Yingying Chen. “DECODE: Exploiting shadow fading to DEtect CO-
Moving wireless devices”. In: IEEE Trans. Mob. Comput. 8 (Dec. 2009), pp. 1663–
1675. doi: 10.1109/TMC.2009.131.
[4] Cristian Chilipirea, Mitra Baratchi, Ciprian Dobre, and Maarten van Steen. “Identi-
fying Stops and Moves in WiFi Tracking Data”. In: Sensors 18 (Nov. 2018), p. 4039.
doi: 10.3390/s18114039.
[5] David Crandall, Lars Backstrom, Dan Cosley, Siddharth Suri, Daniel Huttenlocher,
and Jon Kleinberg. “Inferring social ties from geographic coincidences”. In: Proceed-
ings of the National Academy of Sciences of the United States of America 107 (Dec.
2010), pp. 22436–41. doi: 10.1073/pnas.1006155107.
[6] Justin Cranshaw, Eran Toch, Jason Hong, Aniket Kittur, and Norman Sadeh. “Bridg-
ing the gap between physical location and online social networks”. In: UbiComp’10
- Proceedings of the 2010 ACM Conference on Ubiquitous Computing (Sept. 2010),
pp. 119–128. doi: 10.1145/1864349.1864380.
[7] Mathieu Cunche, Mohamed Ali Kaafar, and Roksana Boreli. “I know who you will
meet this evening! Linking wireless devices using Wi-Fi probe requests”. In: June
2012, pp. 1–9. isbn: 978-1-4673-1238-7. doi: 10.1109/WoWMoM.2012.6263700.
[8] Deshana Desai, Harsh Nisar, and Rishabh Bhardawaj. “Role of Temporal Diversity
in Inferring Social Ties Based on Spatio-Temporal Data”. In: Mar. 2017, pp. 1–8.
doi: 10.1145/3041823.3041836.
[9] J E. van Engelen, J J. van Lier, F W. Takes, and Heike Trautmann. “Accurate WiFi-
Based Indoor Positioning with Continuous Location Sampling: European Conference,
ECML PKDD 2018, Dublin, Ireland, September 10fffdfffdfffd14, 2018, Proceedings,
Part III”. In: Jan. 2019, pp. 524–540. isbn: 978-3-030-10996-7. doi: 10.1007/978-
3-030-10997-4_32.
[10] Floriana Gargiulo, Auguste Caen, Renaud Lambiotte, and Timoteo Carletti. “The
classical origin of modern mathematics”. In: EPJ Data Science (Aug. 2016). doi:
10.1140/epjds/s13688-016-0088-y.
23
[11] Simon Griffioen, Marijn Vermeer, Balfffdfffdzs Dukai, Stefan van der Spek, and Ed-
ward Verbree. “Exploring indoor movement patterns through eduroam connected
wireless devices”. In: Agile (May 2017).
[12] Hande Hong, Chengwen Luo, and Mun Choon Chan. “SocialProbe: Understanding
Social Interaction Through Passive WiFi Monitoring”. In: (Nov. 2016), pp. 94–103.
doi: 10.1145/2994374.2994387.
[13] Eftychia Kalogianni, Rusne Sileryte, Marco Lam, Kaixuan Zhou, Martijn van der
Ham, Edward Verbree, and Stefan van der SPEK. “Passive WiFi Monitoring of the
Rhythm of the campus”. In: (June 2015).
[14] Mikkel Kjaergaard and Petteri Nurmi. “Challenges for social sensing using WiFi
signals”. In: (June 2012). doi: 10.1145/2307863.2307869.
[15] Hady Lauw, Ee-Peng Lim, Hweehwa Pang, and Teck-Tim Tan. “STEvent: Spatio-
Temporal Event Model for Social Network Discovery”. In: ACM Transactions on
Information Systems 28 (June 2010). doi: 10.1145/1777432.1777438.
[16] Quannan Li, Yu Zheng, Xing Xie, Yukun Chen, Wenyu Liu, and Wei-Ying Ma.
“Mining User Similarity Based on Location History”. In: GIS ’08 (2008), 34:1–34:10.
doi: 10.1145/1463434.1463477. url: http://doi.acm.org/10.1145/1463434.
1463477.
[17] Tao Liu, Lintao Yang, Shouyin Liu, and Shuangkui Ge. “Inferring and analysis of
social networks using RFID check-in data in China”. In: PLOS ONE 12 (June 2017),
e0178492. doi: 10.1371/journal.pone.0178492.
[18] Filipe Meneses and Adriano Moreira. “Large scale movement analysis from WiFi
based location data”. In: 2012 International Conference on Indoor Positioning and
Indoor Navigation, IPIN 2012 - Conference Proceedings (Nov. 2012), pp. 1–9. doi:
10.1109/IPIN.2012.6418885.
[19] Nathan N. Eagle, Alex Pentland, and David Lazer. “Inferring Friendship Network
Structure by Using Mobile Phone Data”. In: Proceedings of the National Academy
of Sciences of the United States of America 106 (Sept. 2009), pp. 15274–8. doi:
10.1073/pnas.0900282106.
[20] Gunarto Njoo, Min-Chia Kao, Kuo-Wei Hsu, and Wen-Chih Peng. “Exploring Check-
in Data to Infer Social Ties in Location Based Social Networks”. In: Lecture Notes
in Computer Science (including subseries Lecture Notes in Artificial Intelligence and
Lecture Notes in Bioinformatics) (Apr. 2017), pp. 460–471. doi: 10.1007/978-3-
319-57454-7_36.
[21] Huy Pham, Cyrus Shahabi, and Yan Liu. “EBM - An entropy-based model to infer
social strength from spatiotemporal data”. In: Proceedings of the ACM SIGMOD
International Conference on Management of Data (June 2013), pp. 265–276. doi:
10.1145/2463676.2465301.
24
[22] Thor Prentow, Antonio Ruiz-Ruiz, Henrik Blunck, Allan Stisen, and Mikkel Kjfffdfff-
drgaard. “Spatio-temporal facility utilization analysis from exhaustive WiFi moni-
toring”. In: Pervasive and Mobile Computing 16 (Dec. 2014). doi: 10.1016/j.pmcj.
2014.12.006.
[23] Clement Roux, James Little, and John McAuley. “Towards approaches and tech-
niques for analysing WiFi location data”. In: (Nov. 2017). doi: 10.13140/RG.2.2.
23587.76323.
[24] Piotr Sapiezynski, Arkadiusz Stopczynski, David Kofoed Wind, Jure Leskovec, and
Sune Lehmann. “Inferring Person-to-person Proximity Using WiFi Signals”. In: Proc.
ACM Interact. Mob. Wearable Ubiquitous Technol. 1.2 (June 2017), 24:1–24:20. issn:
2474-9567. doi: 10.1145/3090089. url: http://doi.acm.org/10.1145/3090089.
[25] Piotr Sapiezynski, Arkadiusz Stopczynski, David Wind, Jure Leskovec, and Sune
Lehmann. “Offline Behaviors of Online Friends”. In: (Nov. 2018).
[26] Cyrus Shahabi and Huy Pham. “Inferring Real-World Relationships from Spatiotem-
poral Data”. In: IEEE Data Eng. Bull. 38 (2015), pp. 14–26.
[27] Marco V. Barbera, Alessandro Epasto, Alessandro Mei, Vasile C. Perta, and Julinda
Stefa. “Signals from the crowd: Uncovering social relationships through smartphone
probes”. In: Proceedings of the ACM SIGCOMM Internet Measurement Conference,
IMC (Oct. 2013), pp. 265–276. doi: 10.1145/2504730.2504742.
[28] Geert Vanderhulst, Afra Mashhadi, Marzieh Dashti, and Fahim Kawsar. “Detecting
human encounters from WiFi radio signals”. In: (Nov. 2015), pp. 97–108. doi: 10.
1145/2836041.2836050.
[29] Fengzi Wang, Xinning Zhu, and Jiansong Miao. “Semantic Trajectories Based Social
Relationships Discovery Using WiFi Monitors”. In: 9784 (July 2016), pp. 433–442.
doi: 10.1007/978-3-319-42553-5_37.
[30] Hongjian Wang, Zhenhui Li, and W.-C Lee. “PGT: Measuring Mobility Relationship
Using Personal, Global and Temporal Factors”. In: Proceedings - IEEE International
Conference on Data Mining, ICDM 2015 (Jan. 2015), pp. 570–579. doi: 10.1109/
ICDM.2014.111.
[31] Yang Zhang and Jun Pang. “Distance and Friendship: A Distance-Based Model for
Link Prediction in Social Networks”. In: Sept. 2015, pp. 55–66. isbn: 978-3-319-
25254-4. doi: 10.1007/978-3-319-25255-1_5.
25

Project Report Sormunen

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Report Sormunen

Uploaded by

Copyright:

Available Formats

Inferring co-occurrences from WiFi logs

Silja Sormunen, Student number: 724645

2.1. Extracting social ties from WiFi data

2.2. Extracting social ties from other kinds of spatiotemporal data

2.3. Movement analysis at campus areas using WiFi logs

3. Features of the data

3.1. Description of the dataset

3.2. Pre-processing and filtering

3.3. Characteristics of the data

Figure 6: Number of observations per location

Figure 8: Correlation of distances between successive observations and durations of these

Figure 9: Number of movements per non-directional connection between buildings.

(a) Weekdays 6 - 11 (b) Weekdays 11 - 14

(c) Weekdays 14 - 18 (d) Weekdays 18 - 24

where cij,l denotes the number of co-occurrences of devices i and j in location l. It

You might also like