Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/308371781

An introduction to Twitter Data Analysis in Python

Working Paper · September 2016


DOI: 10.13140/RG.2.2.12803.30243

CITATIONS READS

2 2,445

1 author:

Vivek Wisdom
Deloitte & Touche Llp
2 PUBLICATIONS   2 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Data Analysis with Twitter View project

DeepGAN View project

All content following this page was uploaded by Vivek Wisdom on 21 September 2016.

The user has requested enhancement of the downloaded file.


An introduction to Twitter Data Analysis in Python
Vivek Wisdom, Rajat Gupta
Artigence Inc
{Vivek, Rajat}@artigence.in

ABSTRACT 2.1. Registering App


To register the twitter app, we need to create a new app
In this introductory paper, we explain the process of
https://apps.twitter.com/. On registering the app we will
storing, preparing and analyzing twitter streaming data, receive consumer_key and consumer_secret_key. Next, From
then we examine the methods and tools available in python the configuration page of the app, we will get access_token
programming language to visualize the analyzed data. we and access_token_secret, which will be used to get access to
believe that using social networks and microblogs for twitter on behalf of our application. We must keep these
efficient analysis of massive real-time data about the authentication tokens private as they can be misused. Best
product, services, and global events is crucial for better practice is to create a separate config file and keep these
decisions. Twitter’s popularity as an information source has tokens.
led to the development of applications and research in
various domains. Humanitarian Assistance and Disaster 2.2. Accessing Data
Relief is one domain where information from Twitter is used Twitter provides REST API’s to connect with their
to provide situational awareness to a crisis situation. service. We will use one python library to access the twitter
Managerial decisions, stock prediction[4], improving traffic REST API’s called Tweepy. It provides wrapper methods to
prediction[1] are some example of the use cases getting easily access twitter REST API.
favorites among researchers.
To install Tweepy we can use below command.
Keywords—Twitter; Data Analysis; Python;
pip install tweepy
1. INTRODUCTION
In order to authorize our app to access Twitter on our
Social networks and microblogging sites have become behalf, we need to use the OAuth interface. Below code will
the unparalleled source of unstructured data. This data is use tweepy OAuthHandler method and our configuration
enormous in quantity and also in terms of the useful tokens to provide access to twitter. [5]
information they can provide if we process them effectively.
This is due to the nature of microblogs on which people post
import tweepy
real-time messages about their opinions on a variety of
topics, discuss current issues, complain, and express their
sentiment for products they use in daily life. In fact, many from tweepy import OAuthHandler
companies have started analyzing such massive amount of
important data to get a sense of general sentiment for their consumer_key = 'YOUR-CONSUMER-KEY'
product and/or services. Many times proactive companies
study user reactions and reply to the user on social consumer_secret = 'YOUR-CONSUMER-
microblogs. This process provides on spot solution to make
SECRET'
the user experience better but at large scale its very painful
and time-consuming task. So the challenge here is to build
solutions which analyze the sources of data coming from access_token = 'YOUR-ACCESS-TOKEN'
various microblogging and social networks to make
important long-term product design and service access_secret = 'YOUR-ACCESS-SECRET'
implementation decisions.
In this paper, we take one such social network called auth=OAuthHandler(consumer_key,consu
Twitter to analyze and visualize various important metrics mer_secret)
related to an event, product or service.
auth.set_access_token(access_token,a
2. COLLECTING DATA ccess_secret)
To be able to access Twitter data programmatically we
need to create and register an app on twitter developers api = tweepy.API(auth)
website for authentication and thereafter we can access data
by using Twitter API. (Python code to get twitter access)
The api variable is now our entry point for most of the retweet_count: retweets of the tweet,
operations we can perform with Twitter.
place, geolocation: location information, if available,
2.3. Storing Data user: the full profile of user,
Now we will access all tweet data from personal profile entities: list of entities like url’s, @mentions, #hashtags
and store that is JSON file to use that for our analysis steps.
Tweepy library provides simple cursor interface to iterate We can imagine how these data already allow for some
through all the tweets and store them in JSON file. interesting analysis: we can check who is most favorited/
retweeted, who’s discussing with who, what are the most
We will use below code to store our user timeline tweet popular hashtags and so on. But, the most important field we
data to my_tweets.json file. [5] are looking for is the content of the tweet which is
represented by field text. Next step is to tokenize the tweet
def process_or_store(tweet): text fields.
3.2.Tokenizing the Tweet
with open(‘data/my_tweets.json','a')
as mt: Tokenization is one of the most basic, yet most important,
steps in text analysis. The purpose of tokenization is to split a
mt.write(json.dumps(tweet)+’\n') stream of text into smaller units called tokens, usually words
or phrases. We will use python NLTK library to tokenize the
tweets. Even NLTK library needs some preprocessing steps
for tweet in tweepy.Cursor to correctly tokenize @mentions and #hashtags. We use
(api.user_timeline).items(): regular expressions to provide exceptions for mentions and
hashtags.
process_or_store(tweet._json)
Tokenization prepares the text for next step, which is to
(Python code to read user timeline) removing stop-words like ‘the’, ‘or’, ‘to’, ‘and’ etc.
3.3.Removing Stop-Words
3. PREPARING DATA
Stop-word removal is one important step that should be
Before we begin to analyze the twitter data, it's important considered during the pre-processing stages. Stop-words are
to understand the structure of the tweet as well as pre-process most popular and common words of any language. While
the data to remove non-useful terms called stop-words. their use in the language is crucial, they don’t usually convey
Preprocessing of data in data analysis is the very important a particular meaning, especially if taken out of context. This
step. Preprocessing is in the simple term means to take in the is the case of articles, conjunctions, some adverbs, etc. which
data and prepare the data for optimal output considering our are commonly called stop-words. Some libraries provide
requirement. default stop-words for different languages. NLTK library
So, to preprocess the tweet data we need to understand provides default stop-words for English language.
the structure of the single tweet and analyze its different An array of custom stop words included in this analysis
parts. are as below:
3.1.The Anatomy of the Tweet
[‘The', 'what', 'What', 'You',
Tweets are short messages, restricted to 140 characters in 'Your', 'A', 'new', 'https', 'Hi',
length. Due to the nature of this microblogging service 'We', 'My', 'Now', 'please', ‘get’]
(quick and short messages), people use acronyms, make
spelling mistakes, use emoticons and other characters that As we can see in the above array of stop words, they
express special meanings. Following is a brief terminology don’t add any information in data analysis where we are
associated with tweets. Emoticons: These are facial finding term frequencies.These words are heavily used in
expressions pictorially represented using punctuation and English language and we might these examples as the most
letters; they express the user’s mood. Target: Users of Twitter occurred terms.
use the “@” symbol to refer to other users on the microblog.
Referring to other users in this manner automatically alerts Given the nature of our data and our tokenization, we
them. Hashtags: Users usually use hashtags to mark topics. should also be careful with all the punctuation marks and
This is primarily done to increase the visibility of their with terms like RT (used for re-tweets) and via (used to
tweets. mention the original author of an article or a re-tweet), which
are not in the default stop-word list.
A single tweet contains a lot of information related to
users, the text of the tweet, created date of the tweet, the 4. ANALYZING THE DATA
location of the tweet and many more fields. We will use
some of the fields to complete the analysis. After the text data is preprocessed and tokenized we can
now proceed with different analysis objectives.
The key fields of a single tweet are following:
text: text of the tweet, 4.1.Term Frequencies

lang: acronym of the tweet language like ‘en’, Counting frequencies of a term in twitter data analysis is
one of the simplest steps. By this, we can analyze for a
created_date: date of creation of the tweet, particular user that what he frequently tweets about. One use
favorite_count: number of favorites of the tweet, case of term frequencies is that advertisement companies can
provide targeted ads based on the user's term frequencies. It
will have more possibility that user clicks or visit the 4.3.Most Used Hashtags
promoted website.
Hashtags arena of the most frequently used features of
Below code can be an example of counting all the twitter. They used to represent most recent happening in the
frequencies of all the terms. world. Using hashtags effectively we can find many useful
pieces of information like a number of tweets of particular
terms_only = [term for term in hashtags, which is generally used to compare the twitter
preprocess (tweet['text']) if term battle between teams in most sports.
not in stop and not In current dataset i.e. my_tweets.json we will try to find
term.startswith(('@', ‘#'))] the most used #hashtags. Below sample code can generate all
the hashtags used in the tweet dataset:
In the above code snippet we are listing out all the terms
which are in preprocessed tweet text if that term is not in
stop-words array stop it doesn’t starts with @ or #. We can terms_hash= [term for term in
use python collection called counter() to count the preprocess(tweet[‘text']) if
occurrences of the terms and list them aside to their count. term.startswith('#')]

Here are the 20 most used terms on my personal user Below are most used hashtags on my user timeline tweet
timeline. [5] data:

[('India', 67), ('world', 38), [('#ibmsmartcamp', 32), ('#', 12),


('work', 30), ('make', 30), ('©', ('#MakeItHappen', 12), ('#IndvsPak',
27), ('Photography', 27), ('future', 10), ('#ShangriLaExperience', 6),
26), ('Nice', 24), ('visit', 24), ('#EarthDay', 6), ('#Scala', 5),
('going', 24), ('time', 24), ('#MissingOut', 6),('#CRUD', 5),
('This', 24), ('twitter', 23), (‘#AllEnglandSSP', 8),
('great', 22), ('Do', 22), ('win', ('#IBMSmartCamp', 8),
22), ('final', 22), ("India's", 21), ('#SpotTheMissing', 6),
(‘PM’, 20), (‘next’, 20)] ('#entrepreneur', 6),
('#SuperGenes', 5),
If we see above result, this is also not perfect tokenizer ('#BillionsinChange', 5)]
we have used, as © copyright text is still present in the
tokens. But with a number of iterations or by using powerful If we notice there is exactly one ‘#’ is present in my
stop words library we can develop tokenizer close to perfect. tweets for 12 times, which has to be ignored as it doesn’t
convey any information. We can correct those mistakes using
4.2.Bigram Terms iterative or by implying more sophisticated tokenizers.
Bigram terms are those terms who occur together 4.4.Most Used Mentions
frequently. NLTK library provides a method to find bigrams
in tokenized data. The bigrams() method from NLTK will Mentions in tweets are used to refer someone who is
take a list of tokens and produce a list of tuples using present on the twitter using ‘@username’. They are also used
adjacent tokens. to reply someone are to notify multiple people about an
information. Analyzing mentions, we can find most frequent
Bigrams provide more meaning than a single text, they contacts of a user and then we can suggest them of new
can be crucial in finding useful information.Below code can followers based on the followers of frequent contacts.
generate bigrams from a list of term frequencies:
In current dataset i.e. my_tweets.json we will try to find
terms_bigrams=bigrams(terms_only) the most used @mentions. To find all the mentions in the
tweet dataset we can search for all the words which are
So passing our tokenized data after removing stop-words starting with ‘@ Symbol’, as in below code:
and some other commonly known words, we get following
10 most frequently used bigrams: terms_mentions = [term for term in
preprocess(tweet[‘text']) if
[(('Photography', '©'), 27), term.startswith('@')]
(('Vision', '2020'), 12), (('Vivek',
'Wisdom'), 12), (('India', Below are most used mentions on my user timeline tweet
'Vision'), 12), (('transfer', data:
'handle'), 12), (("Google's",
'CEO'), 8), (('Sundar', 'Pichai'), [('@narendramodi', 58), ('@Piclogy',
8), (('Alibaba', 'Singles'), 8), 27), ('@google', 24), ('@Artigence',
(('really', 'deserve'), 8), 24), ('@Inc', 16), ('@xprize', 16),
(('Sultanpur', 'really'), 8)] ('@businessinsider', 14),
('@pranavmistry', 12),
So it clearly shows to include most frequently used terms ('@IndianDiplomacy', 12),
on my timeline. ('@elonmusk', 12),
('@richardbranson', 10), ('@buffer',
10), ('@singularityhub', 10),
('@YouTube', 10), ('@ICC', 10)]
5. ANALYZING STREAMING DATA 5.4.Most Used Mentions
Analyzing streaming data is very important as it allows Out of 10000 live tweets below are the most used
us to make real-time decisions on the basis of real-time data. m e n t i o n s f o r # D e a d l i n e D a y. A s w e c a n s e e
Streaming data analysis can be large in terms of data being ‘@SkySportsNewsHQ’ is the most active twitter account for
generated so analysis of streaming data requires heavy this dataset.
computation resources.
[('@SkySportsNewsHQ', 445),
In this paper, we have analyzed #DeadlineDay hashtag. ('@TransferRelated', 262),
#DeadlineDay is the transfer window period during the year ('@TransferSources', 249),
in which a football club can transfer players from other ('@bbcsport_david', 225), ('@br_uk',
countries into their playing staff. Such a transfer is completed 175), ('@ChelseaFC', 145),
by registering the player into the new club through FIFA. ('@SkyFootball', 145),
In streaming data analysis we will use the different ('@SamNasri19', 142),
analysis use cases for data which is being generated live. ('@JackWilshere', 139), ('@ManCity',
136), ('@LeeClayton_', 136),
5.1.Term Occurrences ('@NCLairport', 135),
('@DeadlineDayLive', 132),
For #DeadlineDay streaming data, we have analyzed ('@3FLnQe', 131), ('@CPFC', 127)]
below most used term occurrences. This shows that David
Luiz is tweeted most of time in approximately 5000 live
tweets. 6. VISUALIZATIONS
Visualizations helps represent the analyzed data to make
[('David', 578), ('Luiz', 517), decisions effectively. Here for #DeadlineDay we will display
('loan', 433), ('Heathrow', 380), map of the users tweeting about deadline day.
('spotted', 378), ('Airport', 360),
('Roma', 264), ('Chelsea', 245),
('Arsenal', 243), ( ', 228),
('City', 214), ('Wilshere', 210),
('move', 204), ('season-long', 188),
('deal', 177)]

5.2.Bigrams Terms
Bigrams in this #DeadlineDay data show that David Luiz
terms has been used most frequently together as its complete
name of the footballer from Brazil.

[(('David', 'Luiz'), 561),


(('Heathrow', 'Airport'), 390),
(('Luiz', 'spotted'), 388),
(('spotted', 'Heathrow'), 386),
(('season-long', 'loan'), 220), (Image) Display of Tweet locations for #DeadlineDay
(('Airport', 'https://t.co/
fbU4drhfRB'), 151), (('Samir', As we can see in the above map red dots represent
'Nasri'), 145), (('think', 'might'), #DeadlineDay is most used hashtag in UK and Europe.
104), (('Quite', 'welcoming'), 104), To display the map we have used the geoJSON library to
(('Jack', 'Wilshere'), 104), tag the different tweet locations on map provided by Leaflet
(('party', 'someone'), 104), map library.
(('someone', 'Do'), 104), (('Do',
'think'), 104), (('welcoming', 7. CONCLUSION
'party'), 104)]
In this paper, we started with very basics of Twitter data
5.3.Most Used Hashtags analysis. We explained for twitter app authentication using
OAuth and Tweepy. Then we explained steps to collect
Out of 6500 live tweets below are the most used hashtags historical data as well as streaming data. We then
for #DeadlineDay. preprocessed the data using tokenizers.
[('#DeadlineDay', 5163), ('#MUFC', In the final step, we tried to execute a number of use
233), ('#CFC', 151), ('#AFC', 143), cases to analyze the stored data. We represented results of
('#deadlineday', 116), ('#SSNHQ', analyzing most used terms for a data set, most used hashtags,
106), ('#', 105), ('#CPFC', 94), most used mentions of user accounts on twitter and we also
('#LFC', 81), ('#TIMYHelloGreece', represented the bigrams i.e. two terms used frequently in our
75), ('#GirlsInTheHouse3', 72), dataset.
('#FelizMiercoles', 62), ('#UCL', Then we created a map visualization of user tweeting for
58), ('#PSG', 48)] a particular hashtag #DeadlineDay and we found that it has
been most actively tweeted from Europe.
This paper is introductory in nature and hence deals with
basics of twitter data analysis using python. In future work,
we will try to represent more advanced data analysis patterns
like sentiment analysis and decision making with more
accurate results.
REFERENCES

1. J He, W Shen, P Divakaruni, L Wynter, R Lawrence, “Improving


Traffic Prediction with Tweet Semantics”, Proceedings of the Twenty-
Third International Joint Conference on Artificial Intelligence, pp.
1387–1393, August 3-9 2013.
2. A. Agarwal, B. Xie, I Vovsha, O. Rambow, R. Passonneau “Sentiment
Analysis of Twitter Data” In the proceedings of Workshop on
Language in Social Media, ACL, 2011
3. S Kumar, F Morstatter, H Liu, “Twitter Data Analytics” Springer
Book 2013
4. A Mittal, A Goel, “Stock Prediction Using Twitter Sentiment
Analysis”, Stanford University, 2011
5. “D Ediger, K Jiang, J Riedy, D. A. Bader “Massive Social Network
Analysis: Mining Twitter for Social Good”, 39th International
Conference on Parallel Processing 2010, pp. 583-593
6. https://github.com/vivekwisdom/TwitterAnalysisApp, Code
repository of the sample application

View publication stats

You might also like