Twitter Recommendation System: Social Graph

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

TWITTER RECOMMENDATION SYSTEM

Roshan Sumbaly
This assignment requires us to build a Twitter User Recommendation System. The input to the system is
the user’s profile i.e. can be either his social graph or his semantic information. Exploiting these 2
options we need to generate a list of top 10 recommendations for 10 of our friends (i.e. total of 100
recommendations). The code written uses both the “TwitterSearch” functions as well as the “python-
twitter” package. The “python-twitter” package was also updated to contain more features like
retrieving names of all followers of any user, a feature missing in the base package.

Social Graph
The first way to recommend users is on the basis of the social graph. This approach makes use of the
already existing links between people to try to find new links possible. Different weightages are given to
these new links and the ones with the highest weight are recommended to the target user (user for
whom we are recommending). Two things are crucial in such algorithms
(a) which links are taken into consideration and (b) how weights are assigned.

The first algorithm in this section is based on “Similarity” i.e. pick up new links from people who follow
you. So if A, B, C follow you and also follow D, then D would be the recommended new link. This
algorithm generally fails on Twitter since we have scenarios wherein millions of people follow celebrities
or news channels but the celebrity only follows some people. Thus if the celebrity is the target user he
will be recommended news channels which may not be of interest to him. [In our context celebrities are
people who are followed by many but follow only a closed network. Similarly non-celebrities are people
that follow many celebrities but are followed by comparatively few people. ]

In this figure the nodes represent unique users while an


arrow from say node A to B indicates that A is following B.
The red node is the target user i.e. for whom we are
recommending. Using the “Similarity” algorithm, the
yellow node is the recommended user.

The second algorithm is called “Transitive attention.” This is based on the (2nd level) links formed by your
friends i.e. the people you follow. This algorithm solves the shortcomings of the first algorithm wherein
if the celebrity is the target user he may get recommendations only from his small close network (unless
his close network has non-celebrities).

This model MAY face problems like “loss of interest


overlap” i.e. overlap of interests with the target user
decreases as you go further into the social graph.
The third algorithm was implemented and is called “Collaborative Filtering.” This is based on finding new
links from the people with whom you share your followers. This would work best in both cases of target
user i.e. celebrity and non-celebrity.

The orange node in this example shows a “Follower of friends”


(FOF). The recommended nodes to the target user (Red) are the
Yellow Nodes.

Our first goal is to determine new links. We first find all people who follow the same people as we do i.e.
are followers of friends (FOF – orange node). We next take every one of these FOF’s and store the other
new people (yellow nodes) they also follow. This is our dataset of interest as we first found like-minded
people (FOF’s) and then saw who they were following. Analysis of this dataset shows that people
belonging to this category are friends (whom we haven’t added) or people / groups with similar
interests.

We next need to give a score to rank these new yellow nodes. We do this like PageRank wherein the
weightage of each new yellow node is the aggregation of score assigned by the FOF he is connected to.
|஺∩஻|
And the score of the each FOF is the Jaccard’s similarity score i.e. ‫= ܬ‬ , where sets A and B are
|஺∪஻|
friends of the target user and FOF in question respectively. The high overlap means many friends are
common and a higher score should be assigned to the FOF and indirectly to his new yellow nodes. The
formula computes a high score of 1.0 for perfect match while a score of 0.0 for no match at all. The final
scores of a new user can be > 1.0. Finally scores for all the yellow nodes are added up and ranked.

Semantic Information
The next set of algorithm aims to exploit semantic information like keywords, hashtags, location, etc.
Each of these can help in finding different sets of users. Keywords and Hashtags help in finding people
with similar interests. Hashtags are generally used to voice an opinion / raise a question about
something. Location and Time zones are another set of totally different parameter which can help in
recommending people in similar geographic area / timezone.

Finding users who use the same keywords would be the good parameter for our recommendation
engine. But the problem faced in this case is the language. Due to the 140 characters limitation most
users tend to use abbreviated / slang words and hence the “language models” for sites like Twitter /
Facebook Wall, etc is totally different from available research based language models. The models found
in literature are mostly well formed and generated from news corpuses, etc.

Another way to recommend, and also was implemented (“SemanticFiltering1.py“) was to use hashtags.
By using hashtags we find users with similar interests. On implementing this and visualizing the output
on myself (as the target user), I realized that the basic fallacy in this method is that it recommends
people who may have used the tag only once in their entire tweet history. So eventhough they have a
similar interest as mine, they aren’t appealing enough to be followed. (For example, I got a
recommendation which had a user with only one tweet but with the hashtag of my interest). A better
௡௨௠௕௘௥ ௢௙ ௢௖௖௨௥௥௘௡௖௘௦ ௢௙ ௛௔௦௛௧௔௚
approach would have been to rank these users on the basis of ௧௢௧௔௟ ௡௨௠௕௘௥ ௢௙ ௧௪௘௘௧௦
, but since
this is computationally expensive the idea was dropped.

A totally different approach was adopted by restricting the domain to recommending only those people
to whom our friends have replied (“SemanticFiltering2.py”). My hypothesis is that a reply is an exchange
of either opinion / trust and hence can indicate people of similar interests. Initial experimentation shows
that the recommended users where very accurate and also of similar interests. But by reading only last
20 tweets from each of my friends, I was being recommended around 168 friends. To cut down on these
some form of ranking / score was required. For the same I used the time and location information.

The time information is used to see which user has been active so as to remove inactive users. (Users
like Oprah who just make accounts for publicity and then hardly use them. Very common on Twitter
wherein people open account to experiment, spam for a day-two and then never visit it again). I ranked

users by giving them a score of i.e. higher score to user with the latest tweet compared
ଵାௗ௔௬௦ ௗ௜௙௙௘௥௘௡௖௘
to one who hasn’t tweeted for a while.

To incorporate another level of personalization, top results where sorted on the basis of geographic
location i.e. users belonging to the same area as the target user were given higher priority compared to
other users. This was coded at a rudimentary level by matching every word in this user’s location field.
For example, if the target user has the location “Stanford USA”, another user with the same words
(“USA Stanford”, “Stanford USA”) will get a higher ranking compared to only “USA”, who will
comparatively get a higher ranking compared to someone with location “India” (no word overlap).

Results
Results show that Collaborative filtering helps in discovering “old friends”. On the otherhand, the
Semantic based algorithm has a mixture of people who may be “old friends” as well as some new people
with similar area of interests. For example, in case of the user “itsmemad”, collaborative filtering gives
results which have 6 of her college friends. In case of “kapeeshsaraf”, with Semantic based algorithm,
the results has a mixture of friends as well as some odd people in the Bay Area whom he is willing to
follow.

This brings us to the next observation that finally who the user follows is a function of the user's mood.
One sect of people prefer following only their friends (eventhough they might be very infrequent in
posting) compared to another group which prefers following people who post frequently even though
they don’t know them.

There is as such no pattern which could be derived from the ranks. Since I knew the 10 target users (and
their close networks) personally, while ranking the results I gave a higher rank to recommended users
who belonged to our friend network. This was because of my bias since I would generally follow only my
friends if recommended. But on viewing the ranks given by the 10 users it becomes apparent that this
bias may or may not be in other people.

For example, “pyschosurd” gives a low rank to two users, who are from the same college (and hence got
assigned high ranks by me), but due to personal grudges are given a lower rank. Another example of a
personal opinion is in case of “vishwanathrulz.” Eventhough he comes from a tech background, he
prefers following some “comical” character like “Peegeekay”, compared to say “cloudideas.” This is
totally opposite of the way I ranked it since I prefer following technology related users like “hadoop”,
“googleresearch”, ”yahoo”,etc., compared to say some funny character.

The last observation made was that most users complained about getting recommended users who
have very few posts, especially in the Collaborative algorithm group. This was because there was no
ranking done on the basis of “number of tweets.” This could be incorporated into the ranking which
currently only uses Jaccard’s coefficient.

Following is the list of 10 users for whom I ran my algorithms.

Collaborative: itsmemad, fizzle_sizzle, Adityarao, shakti, eeshan

Semantic: psychosurd, vishwanathrulz, kapeeshsaraf, anup99, angadsinghgill


Code
CollaborativeFiltering.py
from TwitterSearch import *
import pprint
import twitter

p = pprint.PrettyPrinter(indent=4)
#TargetUser = Default_user
DEFAULT_USER = "abhishek_kr7"
def jaccard_coefficient(l1, l2):
return float(l1.intersection(l2).__len__())/float(l1.union(l2).__len__())

a = TwitterSearch()
api = twitter.Api()
api = twitter.Api(username='lifeinhex', password='$$$$$$$$$’)

friends = api.GetFriends(DEFAULT_USER)
friends_id = a.get_friends(DEFAULT_USER)
friends_names = [f.GetScreenName() for f in friends]
print friends_names

followers = {}
print friends_names

for friend in friends_names:


followers_of_friends = api.GetFollowers(friend)
i=0
for follower in followers_of_friends:
if i==10:
break
i+=1
if follower not in followers.keys():
temp = follower.__getattribute__('screen_name')
if temp is None:
continue
try:
followers[temp]=a.get_friends(temp)
print friend, temp

except ValueError:
print "Error"

similarity_score = {}
scores_users = {}
for follower in followers:
temp = similarity_score[follower] = jaccard_coefficient(friends_id,
followers[follower])
for friend_id in followers[follower]:
if friend_id in scores_users:
scores_users[friend_id] = scores_users[friend_id] + temp
else:
scores_users[friend_id] = temp

top_scores = {}
top_scores_id = {}
for i in range(50):
top_scores[i] = 0

print top_scores

for friend_id in scores_users.keys():


i=0
while scores_users[friend_id] < top_scores[i]:
i+=1
if i==50:
break
if i!=50:
top_scores[i] = scores_users[friend_id]
top_scores_id[i] = friend_id

print "TOP SCORES = "


print top_scores
print top_scores_id

print "Top recommendations = "


for id in top_scores_id.values():
if id not in friends_id:
print get_information(id)['screen_name']

SemanticFiltering1.py

from TwitterSearch import *


import pprint
import twitter
import re
from datetime import datetime

p = pprint.PrettyPrinter(indent=4)
DEFAULT_USER = "abhishek_kr7"
def jaccard_coefficient(l1, l2):
return float(l1.intersection(l2).__len__())/float(l1.union(l2).__len__())

a = TwitterSearch()
api = twitter.Api()
api = twitter.Api(username=’lifeinhex', password='$$$$$$$$$$$')

statuses = api.GetUserTimeline(DEFAULT_USER)

temp = ""
users = {}
for status in statuses:
temp_hashtags_list = re.findall("\#[A-za-z]*", status.text)
if temp_hashtags_list.__len__() != 0:
for hashtag in temp_hashtags_list:
for i in range(3):
try:
tweets = a.search(hashtag, count = 50)
print tweets
for tweet in tweets :
temp_user = tweet['from_user']

if temp_user not in users.keys():


users[temp_user] = float(1/float((datetime.now()-
datetime.strptime(tweet['created_at'],"%a, %d %b %Y %H:%M:%S +0000")).days +
2))

break
except:
print "Trying again ",i

results_users = [(count, user_name) for user_name, count in users.items()]


results_users.sort()
results_users.reverse()

i=0
modified_users = []
for users in results_users:
if i==100:
break
i+=1
modified_users.append(users[1])

print modified_users
modified_users.remove(DEFAULT_USER)

temp_loc = ""
temp_loc = get_information(DEFAULT_USER)['location']
my_location = set(re.split(", ", temp_loc))
print my_location

final_scores = {}
for mod_user in modified_users:
temp_loc = ""
if get_information(mod_user)['location'] is None:
continue
temp_loc = get_information(mod_user)['location']
temp_location = set(re.split(", ", temp_loc))
if temp_location.__len__() != 0:
print mod_user,temp_location
final_scores[mod_user] = jaccard_coefficient(my_location,
temp_location)

final_users = [(count, user_name) for user_name, count in


final_scores.items()]
final_users.sort()
final_users.reverse()

print final_users
SemanticFiltering2.py

from TwitterSearch import *


import pprint
import twitter
import re
from datetime import datetime

p = pprint.PrettyPrinter(indent=4)
DEFAULT_USER = "abhishek_kr7"
def jaccard_coefficient(l1, l2):
return float(l1.intersection(l2).__len__())/float(l1.union(l2).__len__())

a = TwitterSearch()
api = twitter.Api()
api = twitter.Api(username='lifeinhex', password=’$$$$$$$$$$$$')

users = []
temp_user = ""
friends = api.GetFriends(DEFAULT_USER)
friend_names = [ friend.screen_name for friend in friends]

for friend in friends:


try:
print friend
except:
print "Error"
break
statuses = api.GetUserTimeline(friend.screen_name)
for status in statuses:
temp_users = re.findall("\@[A-za-z]*", status.text)
for temp_user in temp_users:
if temp_user[1:].__len__() != 0:
users.append(temp_user[1:])

user_scores = {}
for user in set(users):
try:
if user not in friends:
user_scores[user] = float(1/float((datetime.now()-
datetime.strptime(get_information(user)['created_at'],"%a %b %d %H:%M:%S
+0000 %Y")).days + 2))
print user
except:
print "Error"

results_users = [(count, user_name) for user_name, count in


user_scores.items()]
results_users.sort()
results_users.reverse()

print results_users

i=0
modified_users = []
for users in results_users:
if users[1] in friend_names:
continue
modified_users.append(users[1])
if i==50:
break
i+=1

print "Modified users = ",modified_users

if DEFAULT_USER in modified_users:
modified_users.remove(DEFAULT_USER)

temp_loc = ""
temp_loc = get_information(DEFAULT_USER)['location']
if temp_loc is not None:
my_location = set(re.split(", ", temp_loc))
print my_location

final_scores = {}
for mod_user in modified_users:
temp_loc = ""
if get_information(mod_user)['location'] is None:
continue
temp_loc = get_information(mod_user)['location']
temp_location = set(re.split(", ", temp_loc))
if temp_location.__len__() != 0:
print mod_user,temp_location
final_scores[mod_user] = jaccard_coefficient(my_location,
temp_location)

final_users = [(count, user_name) for user_name, count in


final_scores.items()]
final_users.sort()
final_users.reverse()

print "After location = ",final_users

You might also like