Professional Documents
Culture Documents
Rathore2018 - Epidemic Model-Based Visibility Estimation in Online Social Networks
Rathore2018 - Epidemic Model-Based Visibility Estimation in Online Social Networks
Rathore2018 - Epidemic Model-Based Visibility Estimation in Online Social Networks
Abstract—The emergence of various Online So- picture, animation and so on. Twitter is also one of
cial Network (OSN) services has revolutionized the the popular micro-blogging services that allows its
way people express themselves among their social users to share messages with the maximum length
connections and to the world. Twitter is one of the of 280 characters, called tweets.
most popular OSNs, which allows its users to share
ideas with their followers and public, in the form of Most of the OSN users spend a significant
tweets. Visibility prediction of a tweet is an interesting amount of time on such social sites on a regular
issue that might be useful in estimating privacy risk basis. They share a variety of information on these
caused by the tweet. In this paper, we propose a
sites in form of their profile and posts but do not
technique inspired by epidemic models to predict the
visibility of a tweet. Our model exploits user interest have any idea about their audience. All of these
and relationship strength to predict the visibility of objects might reveal highly sensitive and personal
a tweet. The evaluation results show that one can information about users. For example a user’s OSN
predict the total number of likes and re-tweets of a profile typically includes her/his gender, sexual
tweet with the accuracy of approximately 89%. orientation, email, education, profession and so on.
Further, users voluntarily publish variety of infor-
Keywords—On-line Social Network; Privacy; Vis-
mation at OSN using different data/activity sharing
ibility Prediction; Information Diffusion; Forwarding
Probability; Keyword Extraction. services offered by OSN. This huge amount of
personal information about users on such sites
attract malicious users, who might misuse that
I. I NTRODUCTION information in order to launch various kind of
attacks [1], [2], [3], [4].
Online Social Networks (OSNs) are web-based
services that offer users to create articulated virtual Therefore, OSNs like Twitter urgently requires
social interaction network with others as per their a mechanism that allows its users to know who
interest. In recent years, the number of Internet can access their tweets and who are not. Presently,
users has increased tremendously worldwide due Twitter allows its users to make all the tweets either
to various reasons. This phenomenon has boosted public or private. If users choose the public setting
the growth of various social network sites such (default one), then their tweets become accessible
as Facebook1 , Twitter2 and many more. These to even non-twitter users. But, tweets of users with
platforms allow users to share about ideas, events, private settings become available to followers only.
actions, activities, feelings with their contacts or Such kind of settings do not offer enough privacy to
even with the public as well. These massages may the users. Hence, methods to measure and restrict
be in the variety of formats like text, audio, video, the visibility of a tweet need to be developed.
1 www.facebook.com We firmly believe that an estimate of the visi-
2 www.twitter.com bility of a tweet might help in controlling potential
The research about visibility prediction of user 1) The followers of a user often forward
contents on OSN platforms is in its early stage. those contents whose topic matches with
their interest. We believe that the contents
3 In this paper, we use visibility and publicity as synonym matching with a user’s interest have a
2162
Authorized licensed use limited to: University of Exeter. Downloaded on June 17,2020 at 19:50:53 UTC from IEEE Xplore. Restrictions apply.
higher probability of being forwarded by
the user. Hence, based on the keywords
from previously forwarded tweets, we in-
fer interest of a user. These keywords
may allow us to develop a measure of a
user’s interest, which in turn provide the
probability of forwarding a tweet further.
2) Moreover, the strength of the relationship
between the source user (one who gener-
ates the tweet) and the forwarding user
(one who forwards/shares it further) also
affects the forwarding decision.
2163
Authorized licensed use limited to: University of Exeter. Downloaded on June 17,2020 at 19:50:53 UTC from IEEE Xplore. Restrictions apply.
interest in at least one topic t ∈ T with non-zero users), are given by :
probability and only forwards messages belonging
to the topics of her/his interest. Ith (u) = αt Sth (u)
h
As soon as a user u ∈ V , publishes a tweet tw
X
= αti k i (2)
on topic t, it becomes available to all the followers i=1
of u. As a result of that, followers of u become
susceptible to forward tw further. Any of these Let Mhu be the total number of users up to hop
followers may forward tw further depending on length h from u, then the number of users who
her/ his interest and influence of u on him/her. For remain reluctant to the message up to hop length
simplicity, we assume that a tweet is forwarded by h, is given by
any user at most once. Once a follower forwards h
tw, s/he increases the visibility of tw. Let αt be the
X
Dth (u) = Mhu − Ith (u) (3)
average forwarding probability with respect to the i=1
topic t ∈ T associated with tw, for all susceptible
users. The users who have forwarded tw are termed Moreover, It + Dt = M , where It and Dt be
as influenced users for the topic t and denoted by the total number of infected and deactivated users
the set It . The influenced users make tw available respectively for the topic t.
to their respective followers. As a result of it, 2) Forwarding Probability: Let tw be a tweet
some of their followers also become susceptible to with topic t ∈ T , then the forwarding probability
forward the message further with probability αt . of tw for user u is proportional to the level of
And some of those followers remain uninterested interest that u has in t, and the relationship strength
about the message with probability β = 1−αt . We of u with the tweet’s forwarder/owner. Let τtu ∈
refer them as Deactivated nodes or Neutral nodes [0, 1], be the probability of interest u has in topic
with respect to the topic t and denote the set of of tw shared by her/his friend v, and ruv ∈ [0, 1]
such users as Dt . be relationship strength of link (u, v). Then, the
Let M be the total number of users in OSN and forwarding probability of the tweet tw for user u
there are k followers at one hop distance for a user is given by
on average. Then, there will be k + k 2 followers αtu = τtu .ruv (4)
up to 2-hop distance, and k + k 2 + k 3 up to hop Trust: The forwarding probability of tweet tw of
3-distance and so on. Furthermore, k, k + αt k 2 , u, by her/his follower f is also proportional to the
k+αt k 2 +αt2 k 3 are the number of susceptible users level of trust f has on u. The measurement of trust
for the hop distance 1, 2, 3 respectively. Hence, the between two OSN users is not a trivial task. Some
total number of susceptible users for tweet tw up schemes to measure trust has been proposed in the
to hop length h from source user u are given by literature[21]. Sticking ourself to privacy only, we
following equation: give a simple formula to measure trust a user v has
Sth (u) = k + αt k 2 + αt2 k 3 + αt3 k 4 + .... + αth−1 k h on a user u for a time window ∆ as follows:
Xh Let Zu be the set of tweets of u out of which
= αti−1 k i m tweets are liked or forwarded by user v. Then
i=1 trust that v has on u is given by following formula:
= k[1 + αt k + (αt k)2 + .... + (αt k)h−1 ] ∆ m
" # ηv,u = (5)
1 − (αt k)h |Zu |
=k , αt k > 1
1 − αt k
" # User Interest: Users have different preferences
(αt k)h − 1 concerning the topic associated with a tweet. They
Sth (u) = k (1) like/forward tweets that match their preferences
αt k − 1
and ignore tweets that do not match any of their
Similarly, the number of users up to hop length interests. To find the forwarding probability of a
h, who have shared tw further (i.e. the influenced user, we estimate the probability of interest for a
2164
Authorized licensed use limited to: University of Exeter. Downloaded on June 17,2020 at 19:50:53 UTC from IEEE Xplore. Restrictions apply.
predetermined set of topics using supervise learn- to our need to remove any stop word occurring in
ing technique over the set of tweets the user has tw.
shared in the past.
Algorithm 2: Algorithm to extract keywords
We trained Naive Bayes (NB) classifier, Multi-
from a given tweet.
nomial Naive Bayes (MNB) classifier and Linear
SVC for detecting the topics from the set of tweets Data: tw: a tweet
of Twitter users. We chose NB classifier as it Data: st words: a list of stop words.
is one of the popular classifier used in text pro- Result: keywords: A finite multi-set of
cessing [22]. As features, we supplied frequency keywords.
distribution of keywords that we extracted using 1 text = Remove emoji (tw)
Algorithm-2 from the set of tweets of the target . Removes emojis and similar
user. We used Algorithm-1 to prepare frequency symbols
distribution of the keywords extracted from tweets 2 text = clean text (text)
of the user. We chose the value of frequency thresh- . Removes all urls, image etc.
old as 6, since it gave us maximum classification 3 noun phrase = np extractor (text)
accuracy during the experiment. 4 for each np in noun phrase do
5 w = split(np)
6 words.append (w);
Algorithm 1: This Algorithm generates fre-
7 for each w in words do
quency distribution of keywords from a given
multi-set of keywords. 8 if w ∈ st words then
9 words.remove (w);
Data: K: A multi-set of keywords
Data: µ: frequency threshold 10 for each w in words do
Result: F : A set of tuples (k, f ), where k, 11 w = spell correct (w);
f are keyword its frequency of
12 for each w in words do
occurrence.
13 w = lammatize(w);
1 F = {} . An empty dictionary
14 keywords.append (w)
2 for each k ∈ K do
. count occurrences for k 15 return(keywords);
3 if k ∈ F then
4 F[k] = F[k] + 1
5 else
IV. I MPLEMENTATION & E VALUATION
6 F[k] = 1
To evaluate the proposed model, we sampled
7 for each k ∈ F do a subgraph of Twitter using BFS Sampling. We
8 if F[k] < µ then chose BFS Sampling because it is one of the
9 remove(F[k]) popular methods to get a plausible sample graph
10 return (F) of OSN[24].
A. DataSet
Feature Extraction: To find user interest in The dataset we used for our experiment purpose
chosen topics, we exploited keywords extracted consists of 100176 twitter users with their fol-
from the set of tweets of a user u. We used lower information and recent tweets. The Figure-3,
Algorithm-2 that extracts keywords from tw, by shows the out-degree distribution of our sampled
employing some of the Natural Language Process- Twitter graph which indicates that around 99, 000
ing functions available in NLTK library [23]. This of the users has followers (out-degree) less than
algorithm takes a tweet tw, and a set of Stop-words 100. Further, we took a Twitter account as the
as arguments and returns a list of keywords that source for BFS sampling algorithm. As most of
occur in tw. We used a list of stop words fitting the popular accounts on Twitter has a large number
2165
Authorized licensed use limited to: University of Exeter. Downloaded on June 17,2020 at 19:50:53 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Classification Metrics
Metrics Topic NB MN-NB Linear SVC
Average Accuracy (%) 57.256 95.692 76.077
Accuracy
Variance in Accuracy 220.170 17.059 188.020
Statistics
Std Dev in Accuracy 14.838 4.130 13.712
Film & Music 0.428 0.470 0.417
Politics & Governance 0.820 0.864 0.838
Precision Science & Technology 0.806 0.938 0.852
Sports 0.843 0.925 0.891
Tourism 0.696 0.782 0.778
Film & Music 0.891 0.944 0.944
Politics & Governance 0.684 0.981 0.988
Recall Science & Technology 0.426 0.981 0.995
Sports 0.637 0.963 0.963
Tourism 0.488 0.980 0.980
Film & Music 0.512 0.560 0.524
Politics & Governance 0.691 0.899 0.886
F-Score Science & Technology 0.528 0.948 0.907
Sports 0.657 0.917 0.899
Fig. 3: Out-degree Distribution in the graph. Tourism 0.483 0.840 0.838
2166
Authorized licensed use limited to: University of Exeter. Downloaded on June 17,2020 at 19:50:53 UTC from IEEE Xplore. Restrictions apply.
we made predictions for different users for hop
count value ranging from 2 to 5 and recorded the
results of predictions. We divided users into two
categories: non-celebrity and celebrity. We refer
users having follower count less than 300 as non-
celebrity users and rest as celebrity users. We
made predictions for 100 randomly selected tweets
belonging to both categories of users that were
also selected randomly. After that, we matched the
predicted values with, the actual number of likes
(favorite count) plus re-tweets count each tweet has
got. Here, we refer this value as publicity/visibility
value. After that, we calculated the prediction error
(in %) by subtracting the predicted value from
the actual publicity value and averaged all of the Fig. 5: Error rate vs Hop-Count in tweet publicity
results. Since during predictions, our model made prediction for Celebrity users
some over-predictions as well, hence we calculated
root mean square error for these predictions and
plots them against hop count. Figure-Figure-4 and
5 show these results. These results confirm that follower’s interest in
the topic of a tweet, trust and local topological
parameters like out-degree impacts the visibility of
a tweet. Therefore, if we can hide a tweet from
the followers having the higher interest in the topic
of the tweet or high trust on the forwarding user
or both, then the visibility of a tweet might be
controlled. Similarly, we can keep the visibility of
a tweet low by hiding it from a follower with high
out-degree or higher interest in the topic of the
tweet.
2167
Authorized licensed use limited to: University of Exeter. Downloaded on June 17,2020 at 19:50:53 UTC from IEEE Xplore. Restrictions apply.
[2] C. Zhang et al., “Privacy and security for online social M. A. Sharaf, M. A. Cheema, and J. Qi, Eds. Cham:
networks: Challenges and opportunities,” Netwrk. Mag. Springer International Publishing, 2015, pp. 104–116.
of Global Internetwkg., vol. 24, no. 4, pp. 13–18, Jul. [18] M. Jenders et al., “Analyzing and predicting viral tweets,”
2010. in Proceedings of the 22Nd International Conference on
[3] C. D. Marsan, “15 worst internet privacy scandals World Wide Web, ser. WWW ’13 Companion. New
of all time,” Jan 2012. [Online]. Available: York, NY, USA: ACM, 2013, pp. 657–664.
http://www.networkworld.com/article/2185187/security/ [19] E. F. Can et al., “Predicting retweet count using visual
15-worst-internet-privacy-scandals-of-all-time.html cues,” in Proceedings of the 22nd ACM international con-
[4] L. C. Williams, “The 9 biggest privacy and security ference on Conference on information & knowledge
breaches that rocked 2013,” Dec 2013. [Online]. Avail- management, ser. CIKM ’13. ACM, 2013, pp. 1481–
able: https://thinkprogress.org/the-9-biggest-privacy- 1484.
and-security-breaches-that-rocked-2013-416a61e194450 [20] N. C. Rathore et al., Predicting User Visibility in Online
[5] D. Gruhl et al., “Information diffusion through Social Networks Using Local Connectivity Properties.
blogspace,” in Proceedings of the 13th International Springer International Publishing, 2015, pp. 419–430.
Conference on World Wide Web, ser. WWW ’04. ACM, [21] W. Sherchan et al., “A survey of trust in social networks,”
2004, pp. 491–501. ACM Comput. Surv., vol. 45, no. 4, pp. 47:1–47:33, Aug.
[6] R. Zafarani et al., Social Media Mining: An Introduction. 2013.
New York, NY, USA: Cambridge University Press, 2014. [22] H. Mao et al., “Loose tweets: An analysis of privacy
leaks on twitter,” in Proceedings of the 10th Annual
[7] H. Mao et al., “Loose tweets: An analysis of privacy
ACM Workshop on Privacy in the Electronic Society, ser.
leaks on twitter,” in Proceedings of the 10th Annual
WPES ’11. ACM, 2011, pp. 1–12.
ACM Workshop on Privacy in the Electronic Society, ser.
WPES ’11. ACM, 2011, pp. 1–12. [23] “Natural language processing toolkit,” June 2018.
[Online]. Available: http://www.nltk.org/
[8] E. D. Cristofaro et al., “Hummingbird: Privacy at the
time of twitter,” in 2012 IEEE Symposium on Security [24] M. Kurant et al., “Towards unbiased bfs sampling,” IEEE
and Privacy, May 2012, pp. 285–299. Journal on Selected Areas in Communications, vol. 29,
no. 9, pp. 1799–1809, October 2011.
[9] T. Hogg et al., “Stochastic models predict user behavior
in social media,” CoRR, vol. abs/1308.2705, 2013. [25] “Python,” July 2018. [Online]. Available: https:
//www.python.org/
[10] L. Zhu and K. Lerman, “A visibility-based model for
[26] “Tweepy,” July 2017. [Online]. Available: http:
link prediction in social media,” in Proceedings of the
//tweepy.readthedocs.io/en/v3.5.0/
ASE/IEEE Conference on Social Computing, 2014.
[27] “Twitter developer documentation,” July 2017. [Online].
[11] D. Kempe et al., “Maximizing the spread of influence Available: https://dev.twitter.com/rest/public
through a social network,” in Proceedings of the Ninth
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, ser. KDD ’03. ACM, 2003,
pp. 137–146.
[12] N. Du et al., “Scalable influence estimation in
continuous-time diffusion networks,” in Proceedings of
the 26th International Conference on Neural Information
Processing Systems, ser. NIPS’13, 2013, pp. 3147–3155.
[13] A. Goyal et al., “Learning influence probabilities in social
networks,” in Proceedings of the Third ACM Interna-
tional Conference on Web Search and Data Mining, ser.
WSDM ’10. ACM, 2010, pp. 241–250.
[14] J. Yang and J. Leskovec, “Modeling information diffusion
in implicit networks,” in Proceedings of the 2010 IEEE
International Conference on Data Mining, ser. ICDM
’10. IEEE Computer Society, 2010, pp. 599–608.
[15] A. Guille et al., “Information diffusion in online social
networks: A survey,” SIGMOD Rec., vol. 42, no. 2, pp.
17–28, jul 2013.
[16] A. Kupavskii et al., “Prediction of retweet cascade size
over time,” in Proceedings of the 21st ACM International
Conference on Information and Knowledge Management,
ser. CIKM ’12. ACM, 2012, pp. 2335–2338.
[17] M. M. Anwar et al., “Predicting the spread of a new
tweet in twitter,” in Databases Theory and Applications,
2168
Authorized licensed use limited to: University of Exeter. Downloaded on June 17,2020 at 19:50:53 UTC from IEEE Xplore. Restrictions apply.