Professional Documents
Culture Documents
8th Sem Project Report PDF
8th Sem Project Report PDF
on
Bachelor of Technology
in
Information Technology
By
Prem Kumar
Dileep Bhagat
Brajesh Kumar
…………………………………………….……… ...……………………………………...............
Date: __/__/____
i
ACKNOWLEDGEMENT
We would like to take this opportunity to thank all our sources of aspiration during the
course of this project.
First and foremost we are grateful to Dr. Jyoti Prakash Singh, who gave an opportunity to
work on User’s Geo-location from tweets and for his continuous support during the
project and for his patience, motivation and immense knowledge. He helped us come up
with project topic and guided us over almost a semester of development. And during the
most difficult time when writing this report, he gave us the moral support and the
freedom to move on.
We hereby take the privilege to express my gratitude to all the people who directly or
indirectly involved in the execution of this work, without whom this project would not
have been a success.
We are also thankful to our PHD senior Mr. Abhinav Kumar for his valuable guidance
support and cooperation extends by him. Then we would like to thank our project
team member for their kind cooperation, help and never ending support.
We are also thankful to NIT Patna for providing us technical skills and facilities which
proved to be very useful for our project.
ii
ABSTRACT
Prediction of user location on social media is a challenging task due limited availability of
resources, and important area of research due a lot of real life applications such as location-
based recommendation, social unrest forecasting. Most of the proposed model follow two
approaches either content based, or network-based. Content-based approach take account
user data on the other hand network-based approach takes interaction among users. In this
project we introduced a new model which can learn from different view of same data
(Multi-view features) which inherits advantage of both the approaches. In this project Multi-
view features are TF-IDF, DOC2VEC (these features capture textual information), NODE2VEC
(this feature capture interaction among users), TIMESTAMP (this feature capture time
related user behaviour). Taking this as regression problem we then concatenate these
features in order to train our model and then we predict the exact geo-coordinates of the
user. Then at last we analyse performance of the model based on distance between actual
geo-coordinates and predicted geo-coordinates. We have shown that our model performed
well on our benchmark dataset.
iii
CONTENTS
CERTIFICATE
DECLARATION---------------------------------------------------------------------------------------i
ACKNOWLEDGEMENT----------------------------------------------------------------------------ii
ABSTRACT-------------------------------------------------------------------------------------------iii
Chapter 1 – INTRODUCTION--------------------------------------------------------------------3
3.1.1 ARCHITECTURE------------------------------------------------6
3.1.2 TRAINING MODEL--------------------------------------------7
3.1.3 TESTING MODEL----------------------------------------------8
3.2 DATASET-------------------------------------------------------------------------8
3.4.1 TF-IDF-----------------------------------------------------------10
3.4.2 CONTEXT FEATURE------------------------------------------12
3.4.3 Node2vec Feature-------------------------------------------13
3.4.4 TIMESTAMP FEATURE--------------------------------------15
1
Chapter 4 – RESULTS----------------------------------------------------------------------------17
4.3 PERFORMANCE--------------------------------------------------------------20
Chapter 5 – CONCLUSION----------------------------------------------------------------------21
Chapter 6 – REFERENCES-----------------------------------------------------------------------22
2
Chapter 1-INTRODUCTION
Over few years social networks gained a huge amount of popularity, currently 2.62 billion
people worldwide are active on social network, some of the most popular social networking
sites are Face book, Twitter, and Instagram etc. In this project we would focus on Twitter,
Twitter alone has 300 million users. On Twitter, users post messages of maximum length
140 characters previously but in 2017 it extended length to 280 characters that is known as
tweet which can be seen by followers or public depending upon the permission of the user.
That tweet can be re-tweet means repost by the seen user, in this way information on the
Twitter spread quickly. Twitter can even be considered as a human powered sensing
network, with a lot of useful information, yet, in an unstructured form. Due to this reason
mining, extracting useful information from huge amount of Twitter data is really a
challenging task.
In this project we will extract important information that is “user location” from the massive
amount of Twitter data. Location information of users enables a lot of application like
location-based recommendation, social unrest forecasting, quick rescue operation at the
time of disaster or accident etc. User location information can also be useful for the
government to understand trends and patterns of a region in order to take necessary action
if required as soon as possible.
Although Twitter provide geo tagging feature by which user can tag the location information
while posting a tweet, but it is does not provide reliable information because most of the
tweets are not geo tagged, in any case if a user tag geo location that is generally random i.e.
non-existent place, all, everywhere, small town. This is the reason it is an important area of
research, where there are a lot of proposed approach already given by the researchers.
Basically, location prediction has two category tweet location prediction, user location
prediction. In tweet location extraction, we have to predict location of a single tweet, but in
user location extraction we have to predict location of a user on basis of data generated by
the user. Location extraction of a single tweet is extremely difficult due limited availability of
resources and is not possible to apply it in real life. On the other hand, location extraction of
user is common and a lot of method is described in this paper. So we focus on location
extraction on user level in this project.
3
In this project we considered user location extraction problem as a regression problem in
which we have to predict the exact geo-coordinates of a user on the basis of data generated
by the user. Firstly, we train our model with the training data which labelled with the geo-
coordinates of each user, after each epoch network edges are adjusted and we take best
out of it in order to predict the location information of the test users.
In the twitter user extraction there are basically two types of approaches either network
based or content based, in content-based approach we have to extract information from the
textual contents of the tweets of the user to extract the user location information. In
network-based approach we consider interaction between users for geo-location both are
good in terms of accuracy, and both have their own advantage and disadvantage.
In this project we developed a model which considered both approaches and inherits
advantage of both content-based approach and network-based approach. We are exploring
recent development in neural networks (i.e. deep learning) and Multi-view learning. Multi-
view learning is a model which encompassing methods which learn from multiple
representations showing a great representation recently. In twitter data multiple views are
different types of information available i.e. text, metadata, interaction with other users,
features extracted from the twitter data etc.
We created a generic Multi-view neural network which can learn from multiple
representation of problem. It combines different view of the problem and learn which give
good accuracy compare to other single representation. We consider four specific types
features to train the model which are TF-IDF, DOC2VEC (these two captures textual
information), NODE2VEC (this feature capture user network interaction), and last one is
TIMESTAMP which capture time related user behaviour. Using these features, we trained
our Multi-view model and then predict exact geo-coordinates of a user. Then, we analyse
the performance of our model in terms of distance between actual geo-coordinates and the
predicted geo-coordinates.
4
Chapter 2-RELATED WORK
Generally, there are two approaches to extract the user location information either content
based or network based. In content-based approaches to extract the location information
user generated user generated content is used, to use the textual data as feature firstly NLP
natural (language processing) is used. In network-based approaches the user interaction
with other users in nearby location is considered. Recently, several methods have addressed
the Twitter user geo-location problem using deep learning. For example, Liu and Inkpen
train stacked de-noising auto encoders for predicting regions, states, and geographical
Co-ordinates.
In researches it has seen that there is correlation between likelihood of friendship of two
social network users and geographical distance between them. So, one can use this
correlation in order to predict the location information of a user by using the location
information of his or her friend. Basically, in network-based approaches this correlation is
considered. The network-based approach has several advantages over the content-based
counterpart, including language independence. Also, it does not require training, which is a
very resource intensive and time-consuming process on big datasets.
However, the inherent weakness of this approach is that it cannot propagate labels
(locations) to users that are not connected to the graph. As a result, isolated users remain
unlabelled. To address the problem of isolated users in the network-based approach, unified
text and network methods are proposed in which leverage both the discriminative power of
textual information and the representativeness of the users’ graph. In particular, the textual
information is used to predict labels for disconnected users before running label
propagation algorithms. Additionally, the novelty of the works lies in building a densely
undirected graph based on the mentioning of users. This makes a significant improvement
in the location prediction. Following models combining text, metadata and user network
features have been introduced these models have to rely on user profile information
including user location, user time zone and user UTC offset. These types of information
should be considered unavailable in the Twitter user geo-location context. That is the
reason why the three benchmark datasets considered in this paper do not provide the
Twitter profile information.
5
Chapter 3-METHODLOGY
In this project, we wish to predict the location of a user using the tweets information,
behaviour of user’s tweets which is obtained from the dataset. By using this information we
predict the area (in terms of geo-coordinates) where the user most probably resides. Our
method addresses this problem as a regression problem.
We propose a multi-view neural network model to learn from multiple views of data for
prediction of user location. This model works on multiple features. The benefit of this model
is the ability of taking both content-based and network-based features. These features take
user- tweet’s content, user network structure and time information. It is worth mentioning
that except the time feature, all other features are extracted from the tweet’s content.
Integrating all features into this model results in a powerful tool for predicting the user’s
geo-location. This section presents our model and the different types of employed features
in detail.
3.1.1 Architecture
Our model architecture is shown in fig. 3.1. This model leverages different features
extracted from the tweet’s content and metadata. Each feature corresponds to one view of
the model. In fig. 4.1 four features are put into 4 individual branches. Each branch has only
one hidden layer allowing learning higher order features.
Since we have multiple view of same data, there can be many ways to combine them like-
one approach will be simple vector concatenation. But we argue that by that means, we do
not fully utilize the power of multiple features. Our architecture is much more effective than
simple vector concatenation. Now coming to model architecture, each feature is input to
one branch of the neural network, and then each branch is connected to one hidden layer.
In order to learn non-linear transformation for each branch, we employ the ReLU activation
function after each hidden layer. The ReLU activation function is efficient for back
propagation. The outputs of these branches are concatenated making a combined hidden
layer. At the end we employ a linear activation function to get outputs.
Our objective is to minimize the distance between actual & predicted geo-coordinate, which
is obtained by haversine formula.
6
Fig. 3.1 Architecture for multi-view neural network
For training our model first of all we divide our data set into two parts- train data set and
test data set. For training we use concatenation output to map with training user geo-
coordinates. We train our model using the Adam optimization algorithm, which optimizes
our objective to get best possible result. Adam optimization algorithm is combination of
both adaptive gradient algorithm and root mean square propagation, thus it is effective in
field of deep learning because it achieves good results fast.
The metrics that we used is to find the distance error between the actual and predicted geo-
coordinates. Along with, mean squared error is also calculated for each epoch. During the
training, the distance error metric of the model is continuously monitored.
7
3.1.3 Testing model
To predict the location of a user from the test data set, we use the trained MODEL to get the
geo-coordinates of test user. Since we used linear activation function to map concatenation
output to geo-coordinates in training, during testing we will get the predicted geo-
coordinates and our metrics calculate the distance error between the actual and predicted
geo-coordinate. The accuracy of the model is measured by distance error metrics.
3.2 Dataset
In this project, we employ following datasets, containing tweets coming from different
regions of world shown in fig. 3.2.
Our dataset consists of more than 6, 50,000 tweets coming from 9995 unique users. Tweets
were filtered carefully before being put into the dataset to make sure that only relevant
tweets are kept. Every record in our dataset contains seven features namely:
In this dataset, geo-coordinates of the first tweet of users were used as their primary
location. The location of a user is indicated by a pair of real numbers, namely latitude and
longitude, which is given by lat & long columns. The time column indicates time in seconds
that have passed from Jan 01, 1970 00:00 AM. We have to convert time value into our local
time before processing.
The follower and following columns indicate the no of users which follows that user & no of
users which that user follows respectively. At last we have text column which shows the
content of tweet coming from the user. This text column can have any length up to 280
characters. This text supports the utf-8 encoding.
8
Fig. 3.2 Dataset
Before computing node2vec, doc2vec and tf-idf features, a simple pre-processing phase is
required. First, we tokenize the tweets and remove stop-words using nltk, a dedicated
library for natural language processing. Then, we replace URLs and punctuation by special
characters, which results in reducing the size of vocabulary without affecting the semantics
of tweets. Again nltk is used for stemming in the last stage of pre-processing. For calculation
of doc2vec feature, we have to take whole tokenize vocabulary user-wise and put into a
document.
9
Normalization is a common way to pre-process data before applying any machine learning
algorithms. There are many ways to normalize any data like- max-min normalization, l2
normalization etc. Data can be normalized by removing the mean and dividing by standard
deviation. We can normalize any data in [-1, 1] or [0, 1] range. In our case we normalize all
features in [-1, 1] range.
Figure 3.1 shows the different features which are passed as input to our neural network
model. In this project, we realize our model by leveraging features from textual information
(Term frequency-Inverse document frequency, doc2vec, user interaction network) and
metadata (timestamp). These features are extracted from the records present in the
dataset. Rest of the section will describe about the features and how they are computed.
TF-IDF stands for term frequency-inverse document frequency, and the tf-idf weight is a
weight often used in information retrieval and text mining. This weight is a statistical
measure used to evaluate how important a word is to a document in a collection or corpus.
The importance increases proportionally to the number of times a word appears in the
document but is offset by the frequency of the word in the corpus. Variations of the tf-idf
weighting scheme are often used by search engines as a central tool in scoring and ranking a
document's relevance given a user query.
How to compute
Typically, the tf-idf weight is composed by two terms: the first computes the normalized
Term Frequency (TF), aka. the number of times a word appears in a document, divided by
the total number of words in that document; the second term is the Inverse Document
Frequency (IDF), computed as the logarithm of the number of the documents in the corpus
divided by the number of documents where the specific term appears.
10
TF (t) = (Number of times term t appears in a document) / (Total number of terms in the
document)
IDF: Inverse Document Frequency, which measures how important a term is. While
computing TF, all terms are considered equally important. However it is known that
certain terms, such as "is", "of", and "that", may appear a lot of times but have little
importance. Thus we need to weigh down the frequent terms while scale up the rare
ones, by computing the following:
IDF (t) = log (Total number of documents / Number of documents with term t in it)
The output form is normalized so that constant length of values is maintained. But in our
case we normalized it with l2 norm to make it unit length. The output vectors have values
between [0, 1] range and consists of 100 dimension vector. In fact, there are many variants
of definition of TF-IDF, and selecting one form depends upon the specific situations.
11
3.4.2 Context feature (doc2vec)
The context feature is a mapping from a variable length block of text (e.g. sentence,
paragraph, or entire document) to a fixed-length continuous valued vector. It provides a
numerical representation capturing the context of the document. It is an extension of the
broadly used word2vec model.
The intuition of doc2vec is that a certain context is more likely to produce some sets of
words than other contexts. Doc2vec trains an embedding capable of expressing the relation
between the context and the corresponding words.
To achieve this goal, it employs a simple neural network architecture consisting of one
hidden layer without an activation function. A text window samples some nearby words in a
document; some of these words are used as inputs to the network and some as outputs.
Moreover, an additional input for the document is added to the network bringing the
document’s context. The training process is totally unsupervised. After training, the fixed
representation of the document input will capture the context of the whole document. Two
architectures were proposed to learn a document’s representation, namely, Distributed Bag
of Words (PV-DBOW) and Distributed Memory (PV-DM) versions of Paragraph Vector.
Although PV-DBOW is a simpler architecture, it has been claimed that PV-DBOW performs
robustly if trained on large datasets. Therefore, we select PV-DBOW model to extract the
context feature.
In this project, we train PV-DBOW models using the tweets from the training sets. Later we
extract the context feature vectors for both training and test data sets. Our implementation
is based on gensim library.
12
3.4.3 Node2vec feature
Let V be the set of nodes of a graph. Node2vec learns a mapping function f : V -> R(d) that
captures the connectivity patterns observed in the graph. Here, d is a parameter specifying
the dimensionality of the feature representation, and f is a matrix of size |V|X d. For every
source node v, a set of neighbourhood nodes Ns(v)∈V is generated through a
neighbourhood sampling strategy S.
Node2vec employs a sampling method referred to as biased Random Walk which samples
nodes belonging to the neighbourhood of node v, according to discrete transition
probabilities between the current node v and the next node w. These probabilities depend
on the distance between the previous node u and the next node w. Denote by duw the
distance in terms of number of edges from node u to node w, if the next node coincides
with the previous node, then duw = 0. If the next node has a direct connection to the
previous node, then duw = 1, and if the next node is not connected to the previous node,
then duw = 2.
The random walk sampling runs on nodes to obtain a list of walks. Later, the node’s
embeddings are found from the set of walks using the stochastic gradient descent
procedure.
13
In the context of Twitter user geo-location, each node corresponds to a user, while an edge
is the connection between two users. We can define these connections by several criteria
depending on the availability of data. For example, we may consider that two users are
connected when actions such as following, mentioning or re-tweeting are detected. In this
paper, the content of tweet messages is used to build graph connections. We construct an
undirected user graph by employing mention connections. First, we create a unique set V
with all the users of interest. If a user mentions directly another user and both of them
belong to V, we create an edge reflecting this interaction. To avoid sparsity of the
connections, if two users of interest mention a third user, who does not belong to V , we
create an edge between these two users. This is shown in figure 3.3.
A shortcoming of this method is that it can only produce an embedding for a node if that
node has at least one connection to another node. Nodes without an edge cannot be
represented. Therefore, for an isolated node, we consider an all-zero vector as its
embedding. Moreover, whenever a new node joins the graph, the algorithm needs to run
again to learn feature vectors for all the nodes of the graph, making our method inherently
transductive.
14
3.4.4 Timestamp feature
In our benchmark dataset we have seen the posting time of all tweets is available in UTC
value (Coordinated Universal Time). This allows us to leverage another view of the data. It
was shown that there exists a correlation between time and place in a Twitter stream of
data. In fact, it is less likely that people tweet late at night than at any other time, which
implies a drift in longitude. Therefore, the timestamp could be an indication for a time zone.
The haversine formula determines the great-circle distance between two points on
a sphere given their longitudes and latitudes. Important in navigation, it is a special case of a
more general formula in spherical trigonometry, the law of haversines that relates the sides
and angles of spherical triangles. So, distance is,
15
3.6 Keras library
Keras is a minimalist Python library for deep learning that can run on top of Theano or
TensorFlow. It was developed to make implementing deep learning models as fast and easy
as possible for research and development. It runs on Python 2.7 or 3.5 and can seamlessly
execute on GPUs and CPUs given the underlying frameworks.
16
Chapter 4-RESULTS
We have estimated all the feature vectors and normalize the results in [-1, 1] range. Since all
features contains many columns, so it is not possible to show all columns and also no of
rows is very large, so in below tables we only show only few columns & rows. It is basically
to give the idea how our features look like. The TF-IDF table is shown below:
USER 0 1 2 ..... 97 98 99
.
01976 0.4779787 0.47797878 0.3940966 ..... 0.0 0.0 0.0
66 84905651 490565193 67925225 .
93 9
100Mi 0.3021136 0.21260416 0.1944235 ..... 0.0759437 0.0759437 0.075943
NuSN 87330539 305885926 24330848 . 27970386 27970386 7279703
oTHiN 45 73 73 8673
G
100ba 0.2129784 0.21297846 0.1840601 ..... 0.0628620 0.0626596 0.060823
rproje 60730310 073031088 98347207 . 71006194 28485263 4623007
ct 88 2 3 99 0689
1057t 0.2407240 0.24072404 0.2117033 ..... 0.0303739 0.0288937 0.027690
hebea 49989356 998935631 24500391 . 99679108 19248599 1707217
t 31 9 556 99 87442
10Soc 0.4060516 0.36811822 0.2232564 ..... 0.0243010 0.0242760 0.012914
cerChi 55973818 54046822 59708188 . 35401460 97175035 8799967
ck10 3 91 055 37 72205
17
After computing doc2vec feature, we normalize it in [-1, 1] range. It is basically 100-D vector
user-wise. Below table shows doc2vec values for some of the users.
USER 0 1 2 ..... 97 98 99
.
01976 0.0 0.00174514 0.00349028 ..... 0.1692790 0.1710241 0.17276
66 488903404 977806808 . 54236302 99125336 9344014
2 4 07 12 37016
100Mi - 0.08678178 - ..... 0.1305216 0.0790629 0.00781
NuSNo 0.117170 586067377 0.09158545 . 29567441 05419551 1717577
THiNG 2291472 847928595 26 44 921551
0639
100bar 0.017681 - 0.10065936 ..... - 0.0709426 -
projec 3386494 0.08594725 781906894 . 0.0388108 13820979 0.08200
t 2914 801011369 95423428 47 1028056
714 08325
1057t - 0.05418105 - ..... - 0.1282890 0.11169
hebeat 0.084312 747967421 0.06564915 . 0.2448541 73986062 9383764
4949081 722039114 09844320 4 14955
1389 88
10Socc 0.166131 - - ..... - - 0.03321
erChic 4768093 0.01579058 0.07906806 . 0.1917508 0.0440789 4552929
k10 3643 398852115 140646282 95825783 28102437 18629
7 5
Similarly, node2vec is normalized in [-1, 1] range. It is also a 100-D vector user-wise. Below
table shown node2vec values for some users:
18
USER 0 1 2 .... 97 98 99
USER 0 1 2 .... 21 22 23
0197666 0.0 0.0 0.0 .... 0.0 0.0 0.0
100MiNu 0.4682 0.0780340 0.10404537 .... 0.1560680 0.1040453 0.312136
SNoTHiN 041815 30257406 367654174 60514812 73676541 1210296
G 444378 31 62 74 2524
19
4.2 Distance error table
4.3 Performance
After estimating the distance error for each user, we report here the best results. The
minimum distance error that we have gotten is 8.72 km while maximum distance error is
274.18 km, but overall the average error is 62.50 km.
20
Chapter 5-CONCLUSION
Noisy and sparse labelled data make the prediction of Twitter user locations a challenging
task. While plenty approaches have been proposed, no method obtained a very high
accuracy. Following the multi-view learning paradigm, we have showed the effectiveness of
combining knowledge from both user-generated content and network based relationships.
In particular, we propose a multi-view neural network architecture that uses text
information (like- words, paragraph semantics, and network topology) as well as time
information. Overall we can say this proposed model gives good accuracy.
The performance of our model heavily depends upon on user graph features (node2vec).
The node2vec algorithm used in this paper is transductive, meaning the graph is built on all
users.
21
Chapter 6-REFERENCES
[1] Do, T. H., Nguyen, D. M., Tsiligianni, E., Cornelis, B., & Deligiannis, N. (2017). Multiview
Deep Learning for Predicting Twitter Users' Location. arXiv preprint arXiv:1712.08091.
[4] Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for
networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge
discovery and data mining (pp. 855-864). ACM.
[5] Jason Brownlee, Develop Your First Neural Network in Python With Keras Step-By-Step,
https://machinelearningmastery.com/tutorial-first-neural-network- python-keras/ accessed
on 25/04/2018
[6] Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and
documents. In International Conference on Machine Learning (pp. 1188-1196).
[8]https://networkx.github.io/documentation/networkx-1.10/tutorial/tutorial.html
accessed on 28/03/2018
[12] Python-Data-Science-and-Machine-Learning-Bootcamp,
https://www.udemy.com/python-for-data-science-and-machine-learning-
bootcamp/learn/v4/ accessed on 14/02/2018
22