Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

A MAJOR PROJECT REPORT

on

USER’s GEOLOCATION FROM TWEETS


Submitted in Partial Fulfilment for the Award of the
Degree of

Bachelor of Technology
in

Information Technology
By

Prem Kumar
Dileep Bhagat
Brajesh Kumar

Under the supervision of

Dr. Jyoti Prakash Singh

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING


NATIONAL INSTITUTE OF TECHNOLOGY PATNA
CERTIFICATE
This undersigned certify that Prem Kumar Roll No. 1407026, Dileep Bhagat Roll No.
1407034 and Brajesh Kumar Roll No. 1407031 have carried out the project entitled “User’s
Geo-location from tweets” as their 8th semester major project under my supervision.

…………………………………………….……… ...……………………………………...............

Dr. Prabhat Kumar Dr. Jyoti Prakash Singh


Head of Department Assistant Professor (Supervisor)
Department of Computer Science and Engineering Department of Computer Science and Engineering
National Institute of Technology Patna National Institute of Technology Patna
DECLARATION
We hereby declare that this project work entitled “User’s Geo-location from tweets” has
been carried out by us in the department of Computer Science & Engineering of National
Institute of Technology Patna under the guidance of Dr. Jyoti Prakash Singh. No part of
this work has been submitted for the major project of degree or diploma to any other
institute.

NAME ROLL NO SIGNATURE


PREM KUMAR 1407026
DILEEP BHAGAT 1407034
BRAJESH KUMAR 1407031

Date: __/__/____

i
ACKNOWLEDGEMENT
We would like to take this opportunity to thank all our sources of aspiration during the
course of this project.

First and foremost we are grateful to Dr. Jyoti Prakash Singh, who gave an opportunity to
work on User’s Geo-location from tweets and for his continuous support during the
project and for his patience, motivation and immense knowledge. He helped us come up
with project topic and guided us over almost a semester of development. And during the
most difficult time when writing this report, he gave us the moral support and the
freedom to move on.

We hereby take the privilege to express my gratitude to all the people who directly or
indirectly involved in the execution of this work, without whom this project would not
have been a success.

We are also thankful to our PHD senior Mr. Abhinav Kumar for his valuable guidance
support and cooperation extends by him. Then we would like to thank our project
team member for their kind cooperation, help and never ending support.

We are also thankful to NIT Patna for providing us technical skills and facilities which
proved to be very useful for our project.

ii
ABSTRACT

Twitter User Geo-location

Prediction of user location on social media is a challenging task due limited availability of
resources, and important area of research due a lot of real life applications such as location-
based recommendation, social unrest forecasting. Most of the proposed model follow two
approaches either content based, or network-based. Content-based approach take account
user data on the other hand network-based approach takes interaction among users. In this
project we introduced a new model which can learn from different view of same data
(Multi-view features) which inherits advantage of both the approaches. In this project Multi-
view features are TF-IDF, DOC2VEC (these features capture textual information), NODE2VEC
(this feature capture interaction among users), TIMESTAMP (this feature capture time
related user behaviour). Taking this as regression problem we then concatenate these
features in order to train our model and then we predict the exact geo-coordinates of the
user. Then at last we analyse performance of the model based on distance between actual
geo-coordinates and predicted geo-coordinates. We have shown that our model performed
well on our benchmark dataset.

iii
CONTENTS

CERTIFICATE
DECLARATION---------------------------------------------------------------------------------------i

ACKNOWLEDGEMENT----------------------------------------------------------------------------ii

ABSTRACT-------------------------------------------------------------------------------------------iii

Chapter 1 – INTRODUCTION--------------------------------------------------------------------3

Chapter 2 – RELATED WORK--------------------------------------------------------------------5


Chapter 3 – METHODOLOGY-------------------------------------------------------------------6

3.1 –MODEL ARCHITECTURE----------------------------------------------------6

3.1.1 ARCHITECTURE------------------------------------------------6
3.1.2 TRAINING MODEL--------------------------------------------7
3.1.3 TESTING MODEL----------------------------------------------8

3.2 DATASET-------------------------------------------------------------------------8

3.3 DATA PRE-PROCESSING & NORMALIZATION----------------------------9

3.4 MULTIVIEW FEATURES-------------------------------------------------------10

3.4.1 TF-IDF-----------------------------------------------------------10
3.4.2 CONTEXT FEATURE------------------------------------------12
3.4.3 Node2vec Feature-------------------------------------------13
3.4.4 TIMESTAMP FEATURE--------------------------------------15

3.5 HAVERSINE FORMULA--------------------------------------------------------15

3.6 KERAS LIBRARY-----------------------------------------------------------------16

1
Chapter 4 – RESULTS----------------------------------------------------------------------------17

4.1 OUTPUT TABLES OF FEATURES-------------------------------------------17

4.2 DISTANCE ERROR TABLE---------------------------------------------------20

4.3 PERFORMANCE--------------------------------------------------------------20

Chapter 5 – CONCLUSION----------------------------------------------------------------------21

Chapter 6 – REFERENCES-----------------------------------------------------------------------22

2
Chapter 1-INTRODUCTION

Over few years social networks gained a huge amount of popularity, currently 2.62 billion
people worldwide are active on social network, some of the most popular social networking
sites are Face book, Twitter, and Instagram etc. In this project we would focus on Twitter,
Twitter alone has 300 million users. On Twitter, users post messages of maximum length
140 characters previously but in 2017 it extended length to 280 characters that is known as
tweet which can be seen by followers or public depending upon the permission of the user.
That tweet can be re-tweet means repost by the seen user, in this way information on the
Twitter spread quickly. Twitter can even be considered as a human powered sensing
network, with a lot of useful information, yet, in an unstructured form. Due to this reason
mining, extracting useful information from huge amount of Twitter data is really a
challenging task.

In this project we will extract important information that is “user location” from the massive
amount of Twitter data. Location information of users enables a lot of application like
location-based recommendation, social unrest forecasting, quick rescue operation at the
time of disaster or accident etc. User location information can also be useful for the
government to understand trends and patterns of a region in order to take necessary action
if required as soon as possible.

Although Twitter provide geo tagging feature by which user can tag the location information
while posting a tweet, but it is does not provide reliable information because most of the
tweets are not geo tagged, in any case if a user tag geo location that is generally random i.e.
non-existent place, all, everywhere, small town. This is the reason it is an important area of
research, where there are a lot of proposed approach already given by the researchers.
Basically, location prediction has two category tweet location prediction, user location
prediction. In tweet location extraction, we have to predict location of a single tweet, but in
user location extraction we have to predict location of a user on basis of data generated by
the user. Location extraction of a single tweet is extremely difficult due limited availability of
resources and is not possible to apply it in real life. On the other hand, location extraction of
user is common and a lot of method is described in this paper. So we focus on location
extraction on user level in this project.

3
In this project we considered user location extraction problem as a regression problem in
which we have to predict the exact geo-coordinates of a user on the basis of data generated
by the user. Firstly, we train our model with the training data which labelled with the geo-
coordinates of each user, after each epoch network edges are adjusted and we take best
out of it in order to predict the location information of the test users.

In the twitter user extraction there are basically two types of approaches either network
based or content based, in content-based approach we have to extract information from the
textual contents of the tweets of the user to extract the user location information. In
network-based approach we consider interaction between users for geo-location both are
good in terms of accuracy, and both have their own advantage and disadvantage.

In this project we developed a model which considered both approaches and inherits
advantage of both content-based approach and network-based approach. We are exploring
recent development in neural networks (i.e. deep learning) and Multi-view learning. Multi-
view learning is a model which encompassing methods which learn from multiple
representations showing a great representation recently. In twitter data multiple views are
different types of information available i.e. text, metadata, interaction with other users,
features extracted from the twitter data etc.

We created a generic Multi-view neural network which can learn from multiple
representation of problem. It combines different view of the problem and learn which give
good accuracy compare to other single representation. We consider four specific types
features to train the model which are TF-IDF, DOC2VEC (these two captures textual
information), NODE2VEC (this feature capture user network interaction), and last one is
TIMESTAMP which capture time related user behaviour. Using these features, we trained
our Multi-view model and then predict exact geo-coordinates of a user. Then, we analyse
the performance of our model in terms of distance between actual geo-coordinates and the
predicted geo-coordinates.

4
Chapter 2-RELATED WORK

Generally, there are two approaches to extract the user location information either content
based or network based. In content-based approaches to extract the location information
user generated user generated content is used, to use the textual data as feature firstly NLP
natural (language processing) is used. In network-based approaches the user interaction
with other users in nearby location is considered. Recently, several methods have addressed
the Twitter user geo-location problem using deep learning. For example, Liu and Inkpen
train stacked de-noising auto encoders for predicting regions, states, and geographical
Co-ordinates.

In researches it has seen that there is correlation between likelihood of friendship of two
social network users and geographical distance between them. So, one can use this
correlation in order to predict the location information of a user by using the location
information of his or her friend. Basically, in network-based approaches this correlation is
considered. The network-based approach has several advantages over the content-based
counterpart, including language independence. Also, it does not require training, which is a
very resource intensive and time-consuming process on big datasets.

However, the inherent weakness of this approach is that it cannot propagate labels
(locations) to users that are not connected to the graph. As a result, isolated users remain
unlabelled. To address the problem of isolated users in the network-based approach, unified
text and network methods are proposed in which leverage both the discriminative power of
textual information and the representativeness of the users’ graph. In particular, the textual
information is used to predict labels for disconnected users before running label
propagation algorithms. Additionally, the novelty of the works lies in building a densely
undirected graph based on the mentioning of users. This makes a significant improvement
in the location prediction. Following models combining text, metadata and user network
features have been introduced these models have to rely on user profile information
including user location, user time zone and user UTC offset. These types of information
should be considered unavailable in the Twitter user geo-location context. That is the
reason why the three benchmark datasets considered in this paper do not provide the
Twitter profile information.

5
Chapter 3-METHODLOGY
In this project, we wish to predict the location of a user using the tweets information,
behaviour of user’s tweets which is obtained from the dataset. By using this information we
predict the area (in terms of geo-coordinates) where the user most probably resides. Our
method addresses this problem as a regression problem.

We propose a multi-view neural network model to learn from multiple views of data for
prediction of user location. This model works on multiple features. The benefit of this model
is the ability of taking both content-based and network-based features. These features take
user- tweet’s content, user network structure and time information. It is worth mentioning
that except the time feature, all other features are extracted from the tweet’s content.
Integrating all features into this model results in a powerful tool for predicting the user’s
geo-location. This section presents our model and the different types of employed features
in detail.

3.1 MODEL ARCHITECTURE

3.1.1 Architecture

Our model architecture is shown in fig. 3.1. This model leverages different features
extracted from the tweet’s content and metadata. Each feature corresponds to one view of
the model. In fig. 4.1 four features are put into 4 individual branches. Each branch has only
one hidden layer allowing learning higher order features.

Since we have multiple view of same data, there can be many ways to combine them like-
one approach will be simple vector concatenation. But we argue that by that means, we do
not fully utilize the power of multiple features. Our architecture is much more effective than
simple vector concatenation. Now coming to model architecture, each feature is input to
one branch of the neural network, and then each branch is connected to one hidden layer.
In order to learn non-linear transformation for each branch, we employ the ReLU activation
function after each hidden layer. The ReLU activation function is efficient for back
propagation. The outputs of these branches are concatenated making a combined hidden
layer. At the end we employ a linear activation function to get outputs.

Our objective is to minimize the distance between actual & predicted geo-coordinate, which
is obtained by haversine formula.

6
Fig. 3.1 Architecture for multi-view neural network

3.1.2 Training model

For training our model first of all we divide our data set into two parts- train data set and
test data set. For training we use concatenation output to map with training user geo-
coordinates. We train our model using the Adam optimization algorithm, which optimizes
our objective to get best possible result. Adam optimization algorithm is combination of
both adaptive gradient algorithm and root mean square propagation, thus it is effective in
field of deep learning because it achieves good results fast.

The metrics that we used is to find the distance error between the actual and predicted geo-
coordinates. Along with, mean squared error is also calculated for each epoch. During the
training, the distance error metric of the model is continuously monitored.

7
3.1.3 Testing model

To predict the location of a user from the test data set, we use the trained MODEL to get the
geo-coordinates of test user. Since we used linear activation function to map concatenation
output to geo-coordinates in training, during testing we will get the predicted geo-
coordinates and our metrics calculate the distance error between the actual and predicted
geo-coordinate. The accuracy of the model is measured by distance error metrics.

3.2 Dataset

In this project, we employ following datasets, containing tweets coming from different
regions of world shown in fig. 3.2.

Our dataset consists of more than 6, 50,000 tweets coming from 9995 unique users. Tweets
were filtered carefully before being put into the dataset to make sure that only relevant
tweets are kept. Every record in our dataset contains seven features namely:

[USER, TIME, LAT, LONG, FOLLOWER, FOLLOWING, TEXT]

In this dataset, geo-coordinates of the first tweet of users were used as their primary
location. The location of a user is indicated by a pair of real numbers, namely latitude and
longitude, which is given by lat & long columns. The time column indicates time in seconds
that have passed from Jan 01, 1970 00:00 AM. We have to convert time value into our local
time before processing.

The follower and following columns indicate the no of users which follows that user & no of
users which that user follows respectively. At last we have text column which shows the
content of tweet coming from the user. This text column can have any length up to 280
characters. This text supports the utf-8 encoding.

8
Fig. 3.2 Dataset

3.3 Data pre-processing & normalization

Before computing node2vec, doc2vec and tf-idf features, a simple pre-processing phase is
required. First, we tokenize the tweets and remove stop-words using nltk, a dedicated
library for natural language processing. Then, we replace URLs and punctuation by special
characters, which results in reducing the size of vocabulary without affecting the semantics
of tweets. Again nltk is used for stemming in the last stage of pre-processing. For calculation
of doc2vec feature, we have to take whole tokenize vocabulary user-wise and put into a
document.

9
Normalization is a common way to pre-process data before applying any machine learning
algorithms. There are many ways to normalize any data like- max-min normalization, l2
normalization etc. Data can be normalized by removing the mean and dividing by standard
deviation. We can normalize any data in [-1, 1] or [0, 1] range. In our case we normalize all
features in [-1, 1] range.

3.4 Multi-view features

Figure 3.1 shows the different features which are passed as input to our neural network
model. In this project, we realize our model by leveraging features from textual information
(Term frequency-Inverse document frequency, doc2vec, user interaction network) and
metadata (timestamp). These features are extracted from the records present in the
dataset. Rest of the section will describe about the features and how they are computed.

3.4.1 Term frequency-inverse document frequency (tf-idf)

TF-IDF stands for term frequency-inverse document frequency, and the tf-idf weight is a
weight often used in information retrieval and text mining. This weight is a statistical
measure used to evaluate how important a word is to a document in a collection or corpus.
The importance increases proportionally to the number of times a word appears in the
document but is offset by the frequency of the word in the corpus. Variations of the tf-idf
weighting scheme are often used by search engines as a central tool in scoring and ranking a
document's relevance given a user query.

How to compute

Typically, the tf-idf weight is composed by two terms: the first computes the normalized
Term Frequency (TF), aka. the number of times a word appears in a document, divided by
the total number of words in that document; the second term is the Inverse Document
Frequency (IDF), computed as the logarithm of the number of the documents in the corpus
divided by the number of documents where the specific term appears.

 TF: TERM FREQUENCY, which measures how frequently a term occurs in a


document. Since every document is different in length, it is possible that a term
would appear much more times in long documents than shorter ones. Thus, the
term frequency is often divided by the document length (aka. the total number of
terms in the document) as a way of normalization:

10
TF (t) = (Number of times term t appears in a document) / (Total number of terms in the
document)

 IDF: Inverse Document Frequency, which measures how important a term is. While
computing TF, all terms are considered equally important. However it is known that
certain terms, such as "is", "of", and "that", may appear a lot of times but have little
importance. Thus we need to weigh down the frequent terms while scale up the rare
ones, by computing the following:

IDF (t) = log (Total number of documents / Number of documents with term t in it)

Then, the TF-IDF is defined by:

TF-IDF(t, d, D)= TF(t, d)* IDF(t, D)


where D is total no of documents, d is one specific document and t is term in document d.

The output form is normalized so that constant length of values is maintained. But in our
case we normalized it with l2 norm to make it unit length. The output vectors have values
between [0, 1] range and consists of 100 dimension vector. In fact, there are many variants
of definition of TF-IDF, and selecting one form depends upon the specific situations.

11
3.4.2 Context feature (doc2vec)

The context feature is a mapping from a variable length block of text (e.g. sentence,
paragraph, or entire document) to a fixed-length continuous valued vector. It provides a
numerical representation capturing the context of the document. It is an extension of the
broadly used word2vec model.

Many machine learning algorithms require the input to be represented as a fixed-length


feature vector. When it comes to texts, one of the most common fixed-length features is
bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses:
they lose the ordering of the words and they also ignore semantics of the words. For
example, “powerful,” “strong” and “Paris” are equally distant.

The intuition of doc2vec is that a certain context is more likely to produce some sets of
words than other contexts. Doc2vec trains an embedding capable of expressing the relation
between the context and the corresponding words.
To achieve this goal, it employs a simple neural network architecture consisting of one
hidden layer without an activation function. A text window samples some nearby words in a
document; some of these words are used as inputs to the network and some as outputs.
Moreover, an additional input for the document is added to the network bringing the
document’s context. The training process is totally unsupervised. After training, the fixed
representation of the document input will capture the context of the whole document. Two
architectures were proposed to learn a document’s representation, namely, Distributed Bag
of Words (PV-DBOW) and Distributed Memory (PV-DM) versions of Paragraph Vector.
Although PV-DBOW is a simpler architecture, it has been claimed that PV-DBOW performs
robustly if trained on large datasets. Therefore, we select PV-DBOW model to extract the
context feature.

In this project, we train PV-DBOW models using the tweets from the training sets. Later we
extract the context feature vectors for both training and test data sets. Our implementation
is based on gensim library.

12
3.4.3 Node2vec feature

Node2vec an algorithmic framework for learning continuous feature representations for


nodes in networks. In node2vec, we learn a mapping of nodes to a low-dimensional space of
features that maximizes the likelihood of preserving network neighbourhoods of nodes. We
define a flexible notion of a node’s network neighbourhood and design a biased random
walk procedure, which efficiently explores diverse neighbourhoods. Our algorithm
generalizes prior work which is based on rigid notions of network neighbourhoods, and we
argue that the added flexibility in exploring neighbourhoods is the key to learning richer
representations.

Let V be the set of nodes of a graph. Node2vec learns a mapping function f : V -> R(d) that
captures the connectivity patterns observed in the graph. Here, d is a parameter specifying
the dimensionality of the feature representation, and f is a matrix of size |V|X d. For every
source node v, a set of neighbourhood nodes Ns(v)∈V is generated through a
neighbourhood sampling strategy S.

Node2vec employs a sampling method referred to as biased Random Walk which samples
nodes belonging to the neighbourhood of node v, according to discrete transition
probabilities between the current node v and the next node w. These probabilities depend
on the distance between the previous node u and the next node w. Denote by duw the
distance in terms of number of edges from node u to node w, if the next node coincides
with the previous node, then duw = 0. If the next node has a direct connection to the
previous node, then duw = 1, and if the next node is not connected to the previous node,
then duw = 2.

The random walk sampling runs on nodes to obtain a list of walks. Later, the node’s
embeddings are found from the set of walks using the stochastic gradient descent
procedure.

13
In the context of Twitter user geo-location, each node corresponds to a user, while an edge
is the connection between two users. We can define these connections by several criteria
depending on the availability of data. For example, we may consider that two users are
connected when actions such as following, mentioning or re-tweeting are detected. In this
paper, the content of tweet messages is used to build graph connections. We construct an
undirected user graph by employing mention connections. First, we create a unique set V
with all the users of interest. If a user mentions directly another user and both of them
belong to V, we create an edge reflecting this interaction. To avoid sparsity of the
connections, if two users of interest mention a third user, who does not belong to V , we
create an edge between these two users. This is shown in figure 3.3.

Fig.3.3 Twitter User Graph

A shortcoming of this method is that it can only produce an embedding for a node if that
node has at least one connection to another node. Nodes without an edge cannot be
represented. Therefore, for an isolated node, we consider an all-zero vector as its
embedding. Moreover, whenever a new node joins the graph, the algorithm needs to run
again to learn feature vectors for all the nodes of the graph, making our method inherently
transductive.

14
3.4.4 Timestamp feature

In our benchmark dataset we have seen the posting time of all tweets is available in UTC
value (Coordinated Universal Time). This allows us to leverage another view of the data. It
was shown that there exists a correlation between time and place in a Twitter stream of
data. In fact, it is less likely that people tweet late at night than at any other time, which
implies a drift in longitude. Therefore, the timestamp could be an indication for a time zone.

Computation of timestamp feature:

We obtain the timestamp feature for a given user as follows:


1. First, we extract the timestamps from all the tweets of that user and convert them to
the standard format to extract the hour value.
2. Then, a 24-dimensional vector is created corresponding to 24 hours in a day; the i-th
element of this vector equals the number of messages posted by the user at the i-th
hour.
3. This feature is l2 normalized to a unit vector before feeding it to our neural network
model.

3.5 Haversine formula

The haversine formula determines the great-circle distance between two points on
a sphere given their longitudes and latitudes. Important in navigation, it is a special case of a
more general formula in spherical trigonometry, the law of haversines that relates the sides
and angles of spherical triangles. So, distance is,

where, r is radius of earth φ1, φ2 latitude of points 1 & 2 in radians


λ1, λ2 longitude of points 1 & 2 in radians

15
3.6 Keras library

Keras is a minimalist Python library for deep learning that can run on top of Theano or
TensorFlow. It was developed to make implementing deep learning models as fast and easy
as possible for research and development. It runs on Python 2.7 or 3.5 and can seamlessly
execute on GPUs and CPUs given the underlying frameworks.

Keras was developed & maintained using four guiding principles:

 Modularity: A model can be understood as a sequence or a graph alone. All the


concerns of a deep learning model are discrete components that can be combined in
arbitrary ways.
 Minimalism: The library provides just enough to achieve an outcome, no frills and
maximizing readability.
 Extensibility: New components are intentionally easy to add and use within the
framework, intended for researchers to trial and explore new ideas.
 Python: No separate model files with custom file formats. Everything is native
Python.

Build deep learning models in keras:

We can summarize the construction of deep learning models in Keras as follows:

 Define your model. Create a sequence and add layers.


 Compile your model. Specify loss functions and optimizers.
 Fit your model. Execute the model using data.
 Make predictions. Use the model to generate predictions on new data.

16
Chapter 4-RESULTS

4.1 Output tables of features

We have estimated all the feature vectors and normalize the results in [-1, 1] range. Since all
features contains many columns, so it is not possible to show all columns and also no of
rows is very large, so in below tables we only show only few columns & rows. It is basically
to give the idea how our features look like. The TF-IDF table is shown below:

USER 0 1 2 ..... 97 98 99
.
01976 0.4779787 0.47797878 0.3940966 ..... 0.0 0.0 0.0
66 84905651 490565193 67925225 .
93 9
100Mi 0.3021136 0.21260416 0.1944235 ..... 0.0759437 0.0759437 0.075943
NuSN 87330539 305885926 24330848 . 27970386 27970386 7279703
oTHiN 45 73 73 8673
G
100ba 0.2129784 0.21297846 0.1840601 ..... 0.0628620 0.0626596 0.060823
rproje 60730310 073031088 98347207 . 71006194 28485263 4623007
ct 88 2 3 99 0689
1057t 0.2407240 0.24072404 0.2117033 ..... 0.0303739 0.0288937 0.027690
hebea 49989356 998935631 24500391 . 99679108 19248599 1707217
t 31 9 556 99 87442
10Soc 0.4060516 0.36811822 0.2232564 ..... 0.0243010 0.0242760 0.012914
cerChi 55973818 54046822 59708188 . 35401460 97175035 8799967
ck10 3 91 055 37 72205

Table 4.1 TF-IDF

17
After computing doc2vec feature, we normalize it in [-1, 1] range. It is basically 100-D vector
user-wise. Below table shows doc2vec values for some of the users.

USER 0 1 2 ..... 97 98 99
.
01976 0.0 0.00174514 0.00349028 ..... 0.1692790 0.1710241 0.17276
66 488903404 977806808 . 54236302 99125336 9344014
2 4 07 12 37016
100Mi - 0.08678178 - ..... 0.1305216 0.0790629 0.00781
NuSNo 0.117170 586067377 0.09158545 . 29567441 05419551 1717577
THiNG 2291472 847928595 26 44 921551
0639
100bar 0.017681 - 0.10065936 ..... - 0.0709426 -
projec 3386494 0.08594725 781906894 . 0.0388108 13820979 0.08200
t 2914 801011369 95423428 47 1028056
714 08325
1057t - 0.05418105 - ..... - 0.1282890 0.11169
hebeat 0.084312 747967421 0.06564915 . 0.2448541 73986062 9383764
4949081 722039114 09844320 4 14955
1389 88
10Socc 0.166131 - - ..... - - 0.03321
erChic 4768093 0.01579058 0.07906806 . 0.1917508 0.0440789 4552929
k10 3643 398852115 140646282 95825783 28102437 18629
7 5

Table 4.2 DOC2VEC

Similarly, node2vec is normalized in [-1, 1] range. It is also a 100-D vector user-wise. Below
table shown node2vec values for some users:

18
USER 0 1 2 .... 97 98 99

019766 0.0019 - 0.00274793 .... 0.0048119 0.0005168 0.001295


6 469619 0.00270956 29 426 3414 0358
36
100Mi 0.0005 0.00469113 0.00377622 .... - 0.0030726 0.003276
NuSNo 217174 19 18 0.0020863 551 8322
THiNG 8 574
100bar 0.0020 - 0.00216753 .... 0.0041824 0.0008664 0.003512
project 533409 0.00225666 34 323 1329 1555
63
1057th 0.0012 0.00304409 0.00150938 .... - - 0.004247
ebeat 040149 42 45 0.0015666 0.0006208 7073
074 3063
10Socc 0.0019 0.00457191 - .... 0.0019537 9.606516e 0.003320
erChick 675903 14 0.00228192 429 -05 8674
10 89

TABLE 4.3 NODE2VEC

Timestamp feature is 24-D vector. Below is its tabular form:

USER 0 1 2 .... 21 22 23
0197666 0.0 0.0 0.0 .... 0.0 0.0 0.0
100MiNu 0.4682 0.0780340 0.10404537 .... 0.1560680 0.1040453 0.312136
SNoTHiN 041815 30257406 367654174 60514812 73676541 1210296
G 444378 31 62 74 2524

100barpr 0.2349 0.1566520 0.07832604 .... 0.0783260 0.0783260 0.0


oject 781349 89997591 499879574 44998795 44998795
963872 48 74 74

1057the 0.0 0.0 0.0 .... 0.0 0.0 0.0


beat
10Soccer 0.2491 0.2491364 0.08304547 .... 0.0 0.1660909 0.083045
Chick10 364395 39561219 985373997 59707479 4798537
612199 9 95 3997

Table 4.4 Timestamp values

19
4.2 Distance error table

USER ERROR (in km.)


0197666 54.407395657891065
08ocho08 60.81573995543397
100MiNuSNoTHiNG 52.921865400061364
100barproject 77.17443681549337
1057thebeat 49.15643361437389
10SoccerChick10 59.85297665624623
10WingsAndFries 39.63106269943796
12tmcglynn 28.345667613062353
13granttaaron13 60.94838229997878

Table 4.5 ERROR TABLE

4.3 Performance

After estimating the distance error for each user, we report here the best results. The
minimum distance error that we have gotten is 8.72 km while maximum distance error is
274.18 km, but overall the average error is 62.50 km.

20
Chapter 5-CONCLUSION

Noisy and sparse labelled data make the prediction of Twitter user locations a challenging
task. While plenty approaches have been proposed, no method obtained a very high
accuracy. Following the multi-view learning paradigm, we have showed the effectiveness of
combining knowledge from both user-generated content and network based relationships.
In particular, we propose a multi-view neural network architecture that uses text
information (like- words, paragraph semantics, and network topology) as well as time
information. Overall we can say this proposed model gives good accuracy.

The performance of our model heavily depends upon on user graph features (node2vec).
The node2vec algorithm used in this paper is transductive, meaning the graph is built on all
users.

21
Chapter 6-REFERENCES

[1] Do, T. H., Nguyen, D. M., Tsiligianni, E., Cornelis, B., & Deligiannis, N. (2017). Multiview
Deep Learning for Predicting Twitter Users' Location. arXiv preprint arXiv:1712.08091.

[2] NiklasDonges, How to build a Neural Network with Keras,


https://towardsdatascience.com/how-to- build-a-neural-network-with-keras-e8faa33d0ae4
accessed on 26/04/2018

[3] RadimŘehůřek, Doc2Vec Tutorial,


https://rare-technologies.com/doc2vec-tutorial/ accessed on 10/04/2018

[4] Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for
networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge
discovery and data mining (pp. 855-864). ACM.

[5] Jason Brownlee, Develop Your First Neural Network in Python With Keras Step-By-Step,
https://machinelearningmastery.com/tutorial-first-neural-network- python-keras/ accessed
on 25/04/2018

[6] Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and
documents. In International Conference on Machine Learning (pp. 1188-1196).

[7] Christian S. Perone, Machine Learning :: Text feature extraction (tf-idf)


http://blog.christianperone.com/2011/09/machine-learning-text-feature-extraction-tf-idf-
part-i/ accessed on 16/03/2018

[8]https://networkx.github.io/documentation/networkx-1.10/tutorial/tutorial.html
accessed on 28/03/2018

[9] Sentiment Analysis Using Doc2Vec,


http://linanqiu.github.io/2015/10/07/word2vec-sentiment/ accessed on 10/04/2018

[10] Exploring Activation Functions for Neural Networks,


https://towardsdatascience.com/exploring-activation- functions-for-neural-networks-
73498da59b02 accessed on 25/04/2018

[11]node2vec github, https://github.com/aditya-grover/node2vec accessed on 19/04/2018

[12] Python-Data-Science-and-Machine-Learning-Bootcamp,
https://www.udemy.com/python-for-data-science-and-machine-learning-
bootcamp/learn/v4/ accessed on 14/02/2018

22

You might also like