Python Project

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 23

PYTHON – FINAL PROJECT (TWITTER ANALYSIS)

From:
Ahmed Khan (22230011)
Shehzad Muslim (22230014)
Ashaan Khan (22230007)
Syeeda Rimshah Bukhary (22230013)
Submitted To: Ussama Yaqub
CONTENTS
SECTION-I........................................................................................................................................................................................ 3
Purpose....................................................................................................................................................................................... 3
Hypothesis Development............................................................................................................................................................3
Data Cleaning.............................................................................................................................................................................. 3
Sentiment Analysis..................................................................................................................................................................... 3
Data Analysis.............................................................................................................................................................................. 4
Impact of ‘YouTube blockage’ on Tweet Counts of Verified and Un-Verified Users..............................................................4
Impact of ‘YouTube blockage’ on Tweet Sentiments of Verified and Un-Verified Users........................................................4
Discussion................................................................................................................................................................................... 5
Conclusion.................................................................................................................................................................................. 6
SECTION-II....................................................................................................................................................................................... 6
TOPIC MODELING................................................................................................................................................................... 6
Latent Dirichlet Allocation (LDA)..............................................................................................................................................7
Steps in LDA Implementation.....................................................................................................................................................7
Preparing data for LDA analysis.................................................................................................................................................7
Step A: Simple preprocessing activity was performed for cleaning and slicing of table by bringing cleaned text with single
column for further processing under LDA...................................................................................................................................7
Step B: Transforming the text data in a format that will serve as an input for training LDA model. In this step; we will
tokenize the text and remove stop words.....................................................................................................................................8
Step C: Once the words are tokenized; we will convert the tokenized object into a corpus and dictionary.................................9
LDA Model Training.................................................................................................................................................................. 9
ANALYZING LDA MODEL RESULTS.................................................................................................................................10
Conclusion on Topic 1: our model aggregate all words (Jalsa, Banned, Imran Khan, Channel, Pakistan) pertaining to Imran
khan jalsa at Peshawar and media coverage to be banned.........................................................................................................12
SECTION-III................................................................................................................................................................................... 13
SUPERVISED MACHINE LEARNING..................................................................................................................................13
Conceptualization of Model Task..............................................................................................................................................13
Text Preparation (cleaning & preprocessing)............................................................................................................................14
Text Exploration....................................................................................................................................................................... 14
N-Grams Sequence (Bigram)....................................................................................................................................................16
Model Selection & Training......................................................................................................................................................16
Result Analysis & Interpretation...............................................................................................................................................17
REFERENCES......................................................................................................................................................................... 19
Appendix................................................................................................................................................................................... 19
SECTION-I

PURPOSE
This study employs sentiment analysis and text mining techniques on Twitter data to investigate
the changes in the tweets of both verified and unverified users during the YouTube blockade on
September 6th, 2022.

HYPOTHESIS DEVELOPMENT
In this part of the paper, we will come up with a question for our study. The event we're looking
at is when YouTube was blocked on September 6th, 2022. This happened because Imran Khan
was going to have a Jalsa in Peshawar, and many news channels were about to show it live on
YouTube (Figure 1). Before this event, the government blocked YouTube, and this became a
huge talk on Twitter (Figure 2). Due to the involvement of governmental influence, our objective
was to examine the commonalities and disparities between prominent figures and the public.

Hence, we propose the following research question:

RQ. Whether there is a discernible difference in tweet sentiment and count between verified and
unverified users during the YouTube blockage period (14:00pm–19:00pm)? (Figure 3)

To answer about research question, two hypotheses should be met

1. There should be difference between words used in tweets by both verified and non-
verified uses before 14:00pm
2. There should be no difference between words used in tweets by both verified and non-
verified uses during 14:00pm – 19:00pm

DATA CLEANING
Information extracted from Twitter underwent preprocessing for sentiment analysis, as well as
for generating word clouds and conducting word frequency analysis. During this preprocessing
phase, we eliminated URLs, sanitized tags, email addresses, and emojis from the collected text
data. Additionally, common stop words like "a," "is," and "are" were excluded from all tweets
using the standard stop-word list available in the Python word cloud library.

SENTIMENT ANALYSIS
In our sentiment analysis process, we employ the VADER python library. VADER functions by
assigning four distinct scores to each tweet, namely: positive, negative, neutral, and compound.
In the scope of this study, our emphasis is solely on the compound score. The scale of this score
ranges from -1 to +1, wherein a score approximating +1 denotes profoundly positive sentiment,
whereas a score nearing -1 signifies considerably negative sentiment. Within our analysis, we
leverage this compound score to categorize tweets within our dataset as either exhibiting positive
or negative sentiment.

DATA ANALYSIS
In this research, we employed tweets from Pakistan and categorized them into groups according
to account verification status: Verified and Un-Verified users. These tweets, from both verified
and un-verified users, were subsequently segmented into subgroups based on time periods:
before 2:00 PM and between 2:00 PM to 7:00 PM (Figure 4).

The distinction in word usage between verified and unverified users, as depicted in (Figure 5)
validates our first hypothesis. Additionally, the content of (Figure 6) confirms the fulfillment of
our second hypothesis, as both user groups are actively discussing the YouTube blockage.

IMPACT OF ‘YOUTUBE BLOCKAGE’ ON TWEET COUNTS OF VERIFIED AND


UN-VERIFIED USERS
There was a substantial 288% surge in tweet counts among verified users following the YouTube
blockage event. This suggests a higher engagement level among verified users. Unverified users
also experienced a notable increase in tweet counts (40%) after the YouTube blockage event,
albeit to a lesser extent compared to verified users.

IMPACT OF ‘YOUTUBE BLOCKAGE’ ON TWEET SENTIMENTS OF VERIFIED


AND UN-VERIFIED USERS
Verified users predominantly adopted a neutral tone (44.7%) in their tweets during this period as
compared to unverified users who leaned more towards negative sentiments (41.5%)
DISCUSSION
The substantial surge of 288% in tweet counts among verified users following the YouTube
blockage event can be attributed to their prominence as public figures. These verified users,
being influential individuals with a significant following, likely utilized Twitter as a platform to
express their viewpoints regarding the event. Their choice to maintain a neutral tone in their
tweets reflects the sensitivity of the issue as a government matter. By adopting a neutral stance,
these prominent figures may aim to convey information and updates while avoiding potential
controversy or bias associated with expressing overtly positive or negative sentiments.

Contrastingly, the public, comprising unverified users, displayed a more open and emotionally
charged response. With a notable increase of 40% in tweet counts, unverified users voiced their
sentiments more openly and candidly. Their tweets, leaning towards negative sentiments at
41.5%, conveyed a sense of dissatisfaction or disagreement with the government's actions,
particularly concerning Imran Khan.

While both groups – prominent verified users and the public – may be conveying similar
viewpoints regarding the government's actions, their approaches differ. Verified users opt for a
measured and neutral tone, recognizing the nuances of the issue and their role as influential
figures. Their focus on increasing their tweet activity might stem from the understanding that
quantity, rather than strictly positive or negative sentiment, contributes to broader engagement on
social media.

In contrast, unverified users exhibit a more emotionally charged response, driven by their
collective sentiments that the government's actions are unfavorable towards Imran Khan. This
group is less concerned with maintaining neutrality, which is often a characteristic of prominent
figures who must balance their public image and influence.
CONCLUSION
In conclusion, the differing tweet patterns and sentiment expressions among verified and
unverified users following the YouTube blockage event underscore the interplay between social
status, messaging strategy, and the desire to engage with an audience. Verified users, as
prominent figures, prioritize neutrality and higher tweet activity to effectively communicate
government matters, while the public expresses more open and direct sentiments about the
perceived wrongs towards Imran Khan, albeit with less emphasis on maintaining a balanced
tone.

SECTION-II

TOPIC MODELING
Topic Models is a type of statistical language models used to discover/uncover hidden structure
in a collection of texts. Topic modelling uses statistical modeling under unsupervised machine
learning to identify clusters or groups of similar words within the body of text. (e.g., Health,
doctor, Patient, Hospital – Topic pertains to HEALTH CARE).

Topic modelling covers following activities under single model:

 Dimensionality Reduction – instead of reviewing entire text. In topic modelling, the text
is break into words and topic with different weights.
 Unsupervised Learning – grouping similar weight word and topic under the same cluster
defines as Topic. The number of topics, like the number of clusters, is an output
parameter under topic modelling. By doing topic modeling, we build clusters of words
rather than clusters of texts. A text is thus a mixture of all the topics, each having a
specific weight.
 Tagging - abstract “topics” that occur in a collection of documents that best represents
the information in them.
There are several existing algorithms that can be used to perform the topic modeling. The most
common are:
 Latent Semantic Analysis (LSA/LSI),
 Probabilistic Latent Semantic Analysis (pLSA), and
 Latent Dirichlet Allocation (LDA)
For the project in hand, we have selected LDA to build model and visualize the output results.

LATENT DIRICHLET ALLOCATION (LDA)


LDA is a probabilistic model that assumes each topic is a mixture of underlying set of words, and
each document is a mixture of over a set of topic probabilities
LDA generates - M number of documents, N number of words, and prior K number of topics, the
model trains to output distribution of words for each topic and distribution of topics against each
document.
With the help of LDA; we are able to check using probability distribution:
- Document Topic Density
- Topic Word Density

STEPS IN LDA IMPLEMENTATION


1. Loading data
2. Data cleaning
3. Exploratory analysis
4. Preparing data for LDA analysis
5. LDA model training
6. Analyzing LDA model results
Step 1, 2 and 3 has already been performed at start of data preparation and at exploratory level
followed by Sentiment Analysis and Polarity Scoring. Therefore; in this section we will move
from step 4 to onward.
Under LDA analysis, we are analyzing twitters trends after 2:00 PM, which is in line with our
hypothesis (i.e. to evaluate the Trends of Verified and Unverified users after 2: 00 PM (i.e. from
14:00 to 19:00 – converge in same direction as elaborated in Part 1)

PREPARING DATA FOR LDA ANALYSIS

Step A: Simple preprocessing activity was performed for cleaning and slicing of table by
bringing cleaned text with single column for further processing under LDA.
Output

Step B: Transforming the text data in a format that will serve as an input for training LDA
model. In this step; we will tokenize the text and remove stop words.
Output

['graanacom']

Step C: Once the words are tokenized; we will convert the tokenized object into a corpus and
dictionary

Output

[(0, 1)]

LDA MODEL TRAINING


To train LDA model; we assumed number of topics to be set at 10 (selection of topics vary from
person to person depending upon data or business problem to address).

Based on the above assumption; below code is used to run and train the model:
Output

ANALYZING LDA MODEL RESULTS


After training the model; we will go for visualization the topics for interpretability. The
visualization can be done on Word Cloud as well but for better preview and interpretation, we
have used a visualization package “pyLDAvis”. This package helps in better understanding of (i)
Individual Topics and (ii) define relationships between the topics.
Output

After clicking cluster 1 (Topic1) Following display will appear


For (1), you can manually select each topic to view its top most frequent and/or “relevant” terms,
using different values of the λ parameter. This can help when you’re trying to assign a human
interpretable name or “meaning” to each topic.
For (2), exploring the INTERTOPIC DISTANCE PLOT can help you learn about how topics
relate to each other, including potential higher-level structure between groups of topics.
CONCLUSION ON TOPIC 1: OUR MODEL AGGREGATE ALL WORDS (JALSA,
BANNED, IMRAN KHAN, CHANNEL, PAKISTAN) PERTAINING TO IMRAN
KHAN JALSA AT PESHAWAR AND MEDIA COVERAGE TO BE BANNED.
We can click on each topic/cluster to check the frequency of words. Here in report, we have
shown results of Topic 2 and Topic 9.

On clicking CLUSTER 2 (TOPIC 2); following display will appear:

Cluster 2 is near to cluster 1 having similar kinds of words but in different context and corpus.

On clicking CLUSTER 9 (TOPIC 9) ; following display will appear:


SECTION-III

SUPERVISED MACHINE LEARNING


For the purposes of supervised machine learning, we adopted the technique of logistic regression
to predict the target variable i.e. sentiment score on the basis of features through TF-IDF
Vectorization technique which involves conversion of text data into numerical features that
machine learning models can work with.

In the appended section, the stepwise process flow is given:

CONCEPTUALIZATION OF MODEL TASK


Conceptualization of model task requires the following steps:

a. Text Problem Formulation – Compute sentiment scores (structured output) from text
(unstructured unput)
b. Text Preparation & Wrangling – This step involves critical cleansing and preprocessing
tasks necessary to convert streams of unstructured data into a format that is usable by
traditional methods designed for structured inputs.
c. Text Exploration – This step encompasses text visualization through techniques such as
text feature and engineering.
The visual explanation of the same is also given below:

TEXT PREPARATION (CLEANING & PREPROCESSING)


In this step, the recommended procedures such as removal of stop words & punctuations,
removal of emoticons, removal of HTML tags as well as tokenization of text was conducted.
Since the python codes of the same have already been discussed above, thus they have not been
reproduced in this section to avoid duplication.

Once tokenization of text was successfully completed, the sentiment scores were converted into
binary numbers i.e. 0 & 1. Pertinent to mention that the dataframe was restricted upto 100,000
rows given the system constraints with Google Colab since the subsequent steps were not being
successfully completed on the original data frame.

The output file which includes the tokenized text in parallel with the sentiment score is given
below:

TEXT EXPLORATION
Text exploration was conducted as part of the exploratory analysis for the purpose of feature
selection & engineering. The techniques adopted were generating bag of words and n-grams
sequence. The input commands along with the output files are given below:

Bag of Words (BoW)

Input & Output Results

From the above analysis which also includes the top 5 tokens with the highest bag of words
scores, it can be concluded that bag of words sometimes fail to give results from which
meaningful inferences can be made. The token i.e. ‘Pakistan’ has the highest frequency, however
it is to be noted that Pakistan on a stand-alone basis will not give meaningful inferences to gauge
the sentiment score since it is more likely to be mentioned in combination with some other key
word which can help in predicting sentiment scores.

In view of the above, the 2 nd technique i.e. n-grams sequencing to better understand the
sentiments involved.
N-GRAMS SEQUENCE (BIGRAM)
Input & Output Results

To further evidence our point, the n-gram sequence can be better applied to predict the sentiment
scores. Pertinent to mention that the token ‘Pakistan’ was more frequently used in conjunction
with Army considering the concentration of tweets was around September 6 th, 2023. Thus, it can
be inferred that whenever Pakistan Army was mentioned in a tweet, the same was most likely to
translate into a positive sentiment given the importance of the defense day for Pakistani
nationals.

MODEL SELECTION & TRAINING


The model selected for the purpose of supervised machine learning was Logistic Regression:

Target Variable – Predicting sentiment score and the scores were converted into binary numbers
such as positive = 1 & negative = 0.

Features – TF-IDF Vectorization is a technique to convert text data into numerical features that
machine learning models can work with. It considers both the frequency of a word in a particular
document (TF) and the rarity of the word across all documents (IDF). TF-IDF assigns higher
weights to words that are important in a specific document but not too common in the entire
dataset.

The data was split into two sets (80% train and 20% test).

Splitting of Data

TF-IDF Vectorization

RESULT ANALYSIS & INTERPRETATION


Training of Model

The model depicted accuracy of 96.477% at predicting sentiment scores on train data.

Validation on Test Data

The model depicted accuracy of 95.646% at predicting sentiment scores on test data.
REFERENCES
https://www.analyticsvidhya.com/blog/2021/06/part-2-topic-modeling-and-latent-dirichlet-
allocation-lda-using-gensim-and-sklearn/#:~:text=Latent%20Dirichlet%20Allocation%20(LDA)
%20is,are%20also%20%E2%80%9Chidden%20topics%E2%80%9D.

https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-
lda-35ce4ed6b3e0

APPENDIX
Figure 1

Figure 2
Figure 3

Figure 4
Figure 5

Figure 6

You might also like