Professional Documents
Culture Documents
Python Project
Python Project
Python Project
From:
Ahmed Khan (22230011)
Shehzad Muslim (22230014)
Ashaan Khan (22230007)
Syeeda Rimshah Bukhary (22230013)
Submitted To: Ussama Yaqub
CONTENTS
SECTION-I........................................................................................................................................................................................ 3
Purpose....................................................................................................................................................................................... 3
Hypothesis Development............................................................................................................................................................3
Data Cleaning.............................................................................................................................................................................. 3
Sentiment Analysis..................................................................................................................................................................... 3
Data Analysis.............................................................................................................................................................................. 4
Impact of ‘YouTube blockage’ on Tweet Counts of Verified and Un-Verified Users..............................................................4
Impact of ‘YouTube blockage’ on Tweet Sentiments of Verified and Un-Verified Users........................................................4
Discussion................................................................................................................................................................................... 5
Conclusion.................................................................................................................................................................................. 6
SECTION-II....................................................................................................................................................................................... 6
TOPIC MODELING................................................................................................................................................................... 6
Latent Dirichlet Allocation (LDA)..............................................................................................................................................7
Steps in LDA Implementation.....................................................................................................................................................7
Preparing data for LDA analysis.................................................................................................................................................7
Step A: Simple preprocessing activity was performed for cleaning and slicing of table by bringing cleaned text with single
column for further processing under LDA...................................................................................................................................7
Step B: Transforming the text data in a format that will serve as an input for training LDA model. In this step; we will
tokenize the text and remove stop words.....................................................................................................................................8
Step C: Once the words are tokenized; we will convert the tokenized object into a corpus and dictionary.................................9
LDA Model Training.................................................................................................................................................................. 9
ANALYZING LDA MODEL RESULTS.................................................................................................................................10
Conclusion on Topic 1: our model aggregate all words (Jalsa, Banned, Imran Khan, Channel, Pakistan) pertaining to Imran
khan jalsa at Peshawar and media coverage to be banned.........................................................................................................12
SECTION-III................................................................................................................................................................................... 13
SUPERVISED MACHINE LEARNING..................................................................................................................................13
Conceptualization of Model Task..............................................................................................................................................13
Text Preparation (cleaning & preprocessing)............................................................................................................................14
Text Exploration....................................................................................................................................................................... 14
N-Grams Sequence (Bigram)....................................................................................................................................................16
Model Selection & Training......................................................................................................................................................16
Result Analysis & Interpretation...............................................................................................................................................17
REFERENCES......................................................................................................................................................................... 19
Appendix................................................................................................................................................................................... 19
SECTION-I
PURPOSE
This study employs sentiment analysis and text mining techniques on Twitter data to investigate
the changes in the tweets of both verified and unverified users during the YouTube blockade on
September 6th, 2022.
HYPOTHESIS DEVELOPMENT
In this part of the paper, we will come up with a question for our study. The event we're looking
at is when YouTube was blocked on September 6th, 2022. This happened because Imran Khan
was going to have a Jalsa in Peshawar, and many news channels were about to show it live on
YouTube (Figure 1). Before this event, the government blocked YouTube, and this became a
huge talk on Twitter (Figure 2). Due to the involvement of governmental influence, our objective
was to examine the commonalities and disparities between prominent figures and the public.
RQ. Whether there is a discernible difference in tweet sentiment and count between verified and
unverified users during the YouTube blockage period (14:00pm–19:00pm)? (Figure 3)
1. There should be difference between words used in tweets by both verified and non-
verified uses before 14:00pm
2. There should be no difference between words used in tweets by both verified and non-
verified uses during 14:00pm – 19:00pm
DATA CLEANING
Information extracted from Twitter underwent preprocessing for sentiment analysis, as well as
for generating word clouds and conducting word frequency analysis. During this preprocessing
phase, we eliminated URLs, sanitized tags, email addresses, and emojis from the collected text
data. Additionally, common stop words like "a," "is," and "are" were excluded from all tweets
using the standard stop-word list available in the Python word cloud library.
SENTIMENT ANALYSIS
In our sentiment analysis process, we employ the VADER python library. VADER functions by
assigning four distinct scores to each tweet, namely: positive, negative, neutral, and compound.
In the scope of this study, our emphasis is solely on the compound score. The scale of this score
ranges from -1 to +1, wherein a score approximating +1 denotes profoundly positive sentiment,
whereas a score nearing -1 signifies considerably negative sentiment. Within our analysis, we
leverage this compound score to categorize tweets within our dataset as either exhibiting positive
or negative sentiment.
DATA ANALYSIS
In this research, we employed tweets from Pakistan and categorized them into groups according
to account verification status: Verified and Un-Verified users. These tweets, from both verified
and un-verified users, were subsequently segmented into subgroups based on time periods:
before 2:00 PM and between 2:00 PM to 7:00 PM (Figure 4).
The distinction in word usage between verified and unverified users, as depicted in (Figure 5)
validates our first hypothesis. Additionally, the content of (Figure 6) confirms the fulfillment of
our second hypothesis, as both user groups are actively discussing the YouTube blockage.
Contrastingly, the public, comprising unverified users, displayed a more open and emotionally
charged response. With a notable increase of 40% in tweet counts, unverified users voiced their
sentiments more openly and candidly. Their tweets, leaning towards negative sentiments at
41.5%, conveyed a sense of dissatisfaction or disagreement with the government's actions,
particularly concerning Imran Khan.
While both groups – prominent verified users and the public – may be conveying similar
viewpoints regarding the government's actions, their approaches differ. Verified users opt for a
measured and neutral tone, recognizing the nuances of the issue and their role as influential
figures. Their focus on increasing their tweet activity might stem from the understanding that
quantity, rather than strictly positive or negative sentiment, contributes to broader engagement on
social media.
In contrast, unverified users exhibit a more emotionally charged response, driven by their
collective sentiments that the government's actions are unfavorable towards Imran Khan. This
group is less concerned with maintaining neutrality, which is often a characteristic of prominent
figures who must balance their public image and influence.
CONCLUSION
In conclusion, the differing tweet patterns and sentiment expressions among verified and
unverified users following the YouTube blockage event underscore the interplay between social
status, messaging strategy, and the desire to engage with an audience. Verified users, as
prominent figures, prioritize neutrality and higher tweet activity to effectively communicate
government matters, while the public expresses more open and direct sentiments about the
perceived wrongs towards Imran Khan, albeit with less emphasis on maintaining a balanced
tone.
SECTION-II
TOPIC MODELING
Topic Models is a type of statistical language models used to discover/uncover hidden structure
in a collection of texts. Topic modelling uses statistical modeling under unsupervised machine
learning to identify clusters or groups of similar words within the body of text. (e.g., Health,
doctor, Patient, Hospital – Topic pertains to HEALTH CARE).
Dimensionality Reduction – instead of reviewing entire text. In topic modelling, the text
is break into words and topic with different weights.
Unsupervised Learning – grouping similar weight word and topic under the same cluster
defines as Topic. The number of topics, like the number of clusters, is an output
parameter under topic modelling. By doing topic modeling, we build clusters of words
rather than clusters of texts. A text is thus a mixture of all the topics, each having a
specific weight.
Tagging - abstract “topics” that occur in a collection of documents that best represents
the information in them.
There are several existing algorithms that can be used to perform the topic modeling. The most
common are:
Latent Semantic Analysis (LSA/LSI),
Probabilistic Latent Semantic Analysis (pLSA), and
Latent Dirichlet Allocation (LDA)
For the project in hand, we have selected LDA to build model and visualize the output results.
Step A: Simple preprocessing activity was performed for cleaning and slicing of table by
bringing cleaned text with single column for further processing under LDA.
Output
Step B: Transforming the text data in a format that will serve as an input for training LDA
model. In this step; we will tokenize the text and remove stop words.
Output
['graanacom']
Step C: Once the words are tokenized; we will convert the tokenized object into a corpus and
dictionary
Output
[(0, 1)]
Based on the above assumption; below code is used to run and train the model:
Output
Cluster 2 is near to cluster 1 having similar kinds of words but in different context and corpus.
a. Text Problem Formulation – Compute sentiment scores (structured output) from text
(unstructured unput)
b. Text Preparation & Wrangling – This step involves critical cleansing and preprocessing
tasks necessary to convert streams of unstructured data into a format that is usable by
traditional methods designed for structured inputs.
c. Text Exploration – This step encompasses text visualization through techniques such as
text feature and engineering.
The visual explanation of the same is also given below:
Once tokenization of text was successfully completed, the sentiment scores were converted into
binary numbers i.e. 0 & 1. Pertinent to mention that the dataframe was restricted upto 100,000
rows given the system constraints with Google Colab since the subsequent steps were not being
successfully completed on the original data frame.
The output file which includes the tokenized text in parallel with the sentiment score is given
below:
TEXT EXPLORATION
Text exploration was conducted as part of the exploratory analysis for the purpose of feature
selection & engineering. The techniques adopted were generating bag of words and n-grams
sequence. The input commands along with the output files are given below:
From the above analysis which also includes the top 5 tokens with the highest bag of words
scores, it can be concluded that bag of words sometimes fail to give results from which
meaningful inferences can be made. The token i.e. ‘Pakistan’ has the highest frequency, however
it is to be noted that Pakistan on a stand-alone basis will not give meaningful inferences to gauge
the sentiment score since it is more likely to be mentioned in combination with some other key
word which can help in predicting sentiment scores.
In view of the above, the 2 nd technique i.e. n-grams sequencing to better understand the
sentiments involved.
N-GRAMS SEQUENCE (BIGRAM)
Input & Output Results
To further evidence our point, the n-gram sequence can be better applied to predict the sentiment
scores. Pertinent to mention that the token ‘Pakistan’ was more frequently used in conjunction
with Army considering the concentration of tweets was around September 6 th, 2023. Thus, it can
be inferred that whenever Pakistan Army was mentioned in a tweet, the same was most likely to
translate into a positive sentiment given the importance of the defense day for Pakistani
nationals.
Target Variable – Predicting sentiment score and the scores were converted into binary numbers
such as positive = 1 & negative = 0.
Features – TF-IDF Vectorization is a technique to convert text data into numerical features that
machine learning models can work with. It considers both the frequency of a word in a particular
document (TF) and the rarity of the word across all documents (IDF). TF-IDF assigns higher
weights to words that are important in a specific document but not too common in the entire
dataset.
The data was split into two sets (80% train and 20% test).
Splitting of Data
TF-IDF Vectorization
The model depicted accuracy of 96.477% at predicting sentiment scores on train data.
The model depicted accuracy of 95.646% at predicting sentiment scores on test data.
REFERENCES
https://www.analyticsvidhya.com/blog/2021/06/part-2-topic-modeling-and-latent-dirichlet-
allocation-lda-using-gensim-and-sklearn/#:~:text=Latent%20Dirichlet%20Allocation%20(LDA)
%20is,are%20also%20%E2%80%9Chidden%20topics%E2%80%9D.
https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-
lda-35ce4ed6b3e0
APPENDIX
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6