Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 18

Twitter Sentiment

Analysis ( NLP)

This Photo by Unknown Author is licensed under CC BY-NC


ABSTRACT

• Today’s World is known as the Data World. A large volume of Data is


generated every day. Out of all this Data, the demand for practical Data is
increasing as we speak. Industries, organizations need this valuable Data to
make a profit. One of the most important social media is Twitter,
responsible for generating millions of tweets in just one day. This vast
number is justified as Twitter accounts for more than a hundred million
active users. And also, the fact that almost every prominent personality
around the globe, including the prime minister of a country, is active on
Twitter, which makes the analysis of Twitter data even more critical.
People's opinions matter a lot. We have already seen how the social media
platform can even affect a country's election.
• The impact of the Data is deeply rooted among the community, which can
affect the day-to-day lives of people. Here is where the role of sentimental
analysis comes into the picture. Sentiment Analysis plays a significant role
during product analysis. With the help of this sentiment, the organization
can predict its product performance. Sentiment analysis is widely applied
to the voice of the customer materials such as reviews and survey
responses, online and social media, and healthcare materials for
applications that range from marketing to customer service to clinical
medicine.
• In this project, we will apply various NLP concepts such as Data scraping,
Data cleaning or exploration, tokenization, POS tagging, sentiment analysis,
Data visualization to our Data set. Various tools used to perform this task
include Tweepy, Pandas, NLTK, Plotly, et cetra. The primary aim is to
provide a method for analyzing sentiment in the noisy Twitter Data. The
results are classified into user’s perception via tweets into positive or
negative or neutral.
• The tools used include Tweepy (for mining
tweets), Pandas (for data cleaning/wrangling), Tweet
Preprocessor (for rapid tweet cleaning), NLTK (for
tokenization, stopwords removal and POS
tagging), Plotly, Matplotlib and Word Cloud (for

Tools and visualization).

Workflow
Explanation in Brief

• An NLP Data Science project to find out how people feel about 2021.
• After 2020 turned out to be a disaster, we’ve all been looking forward to 2021 with hope. I
decided to perform a Twitter Sentiment Analysis to find out if the new year is treating us well.
• With this project I wanted to get familiar with the Natural Language Processing (NLP) techniques
and answer the following questions:
• What are the most common words people use to describe 2021?
• What is the number of tweets with positive, negative and neutral sentiment?
• What are the most common words used in positive, neutral and negative tweets?
• What are the most liked and retweeted posts?
Tweets Mining

• I used a Python library Tweepy to build a tweets dataset from


scratch. Tweepy works with Twitter API and in order to use it, you
need to start a Twitter developer account. That’s how you can get a
unique consumer key/access code that you need to access the
Twitter data. Tweepy is a pretty straightforward library to use,
however scraping tweets can be slow due to the new mining limits
which Twitter imposed through their API. That’s why Tweepy
introduced the “wait_on_rate_limit” method, which “puts the
scraper to sleep” when the Tweet’s limit has been reached. 
Data Cleaning/Exploration

• In this case, data cleaning was a short step. Firstly, I checked the
shape of the data and the name of the columns. I dropped a column,
which contained a second index (created during merging).
Secondly I checked for any duplicates with the column “id” as a
subset and dropped all of them. Lastly, I checked for NaN values
and since ‘Location’ had all missing values I decided to drop that
column. After a bit of research, I found out that Twitter disabled
automatic location tagging from their users. That’s why the
location data is missing for most of the tweets.
Tweets Processing (Tokenization/POS tagging)

• I decided to perform sentiment analysis with VADER.


• I found that VADER gives better results when it has access to this
information.
• I found an interesting library called Tweets preprocessor which
makes it easy to remove hashtags, mentions, links etc from a tweet:
• To find out which adjectives people used to describe 2021, I
decided to run a Part of Speech Tagging (POS). As the name
suggests, POS is an NLP technique that allows you to tag each
word as a part of speech it belongs to (for example, “JJ” for
adjectives).
• In order to perform POS tagging, I used NLTK. However, before
your text is ready to be tagged, you need to perform
tokenization. Tokenization is another NLP step which is a way of
separating a piece of text into smaller units called Tokens.
• After tagging Parts of Speech, I used a loop to get a list of all
adjectives.
Sentiment Analysis

• VADER (Valence Aware Dictionary and sEntiment Reasoner) is a pre-


trained model that uses rule-based values tuned to sentiments from social
media. It’s not 100% accurate and can’t understand sarcasm or
irony. Therefore it’s important to keep in mind that the results can be
skewed. However, VADER is intelligent enough to extract meaning
from negations ( “not good”), multiple punctuations (“!!!”) or emojis.
• To perform Twitter sentiment analysis with VADER we first need to call
the “SentimentIntensivityAnalyzer()”. VADER displays
“polarity_scores” which gives us numerical values for use of negative,
neutral, and positive word choice. The compound value reflects the
overall sentiment ranging from -1 being very negative and +1 being very
positive.
• I created a loop that goes through the text of the tweets, gets the compound
score and translates it into ‘negative’, ‘positive’ and ‘neutral’ sentiment
labels.
• To present the proportion of tweets per sentiment I created a bar chart
with Plotly.
Data
Visualisation

• The last step of my analysis


was to visualise the most
commonly used words. I used
Python’s built-in
library Collections to count
the frequency of the 100 most
used adjectives.
• Secondly, I created a line plot
with Matplotlib.
• Another cool way of
displaying the word
frequency is Word
Cloud. You can easily
style your word cloud
and create different
shapes and colour
schemes.
To display the most frequently used
words for each sentiment I used a
Sunburst Chart from Plotly:
Lastly, to display the most liked and retweeted tweets, I created a table:
With this project we learnt the following
insights:

“BEST”, “LAST, “GOOD”, “NEW” and “BAD”


are the top 5 words used to describe 2021

Summary The majority of tweets had a positive


sentiment (12.55K), followed by neutral
(5659) and negative (3432) sentiment

The most common word in positive tweets


was “good”, word “last” in negative tweets
and “new” in neutral

The most popular tweet was retweeted 2717


times and the most liked tweet received
22,498 likes!.
• https://github.com/cjhutto/vaderSentiment/blob/master/
README.rst
•  https://stackoverflow.com/questions/15388831/what-are-

References
all-possible-pos-tags-of-nltk
• https://docs.tweepy.org/en/latest/
• https://developer.twitter.com/en
• https://github.com/s/preprocessor

You might also like