PDF New

1.
INTRODUCTION
1.1 Background
It's not easy to film an iconic TV series. The competition has never been fiercer, and
huge budgets are at stake, particularly when the popularity of television shows has
peaked all over the world. Hitting the jackpot is very real, however, with the expense
of science. The industry depended on feature films for a long time, and the series was
not known as creatively full-blown and financially successful. But there is a turning
point in every story. It feels silly to write a “brief history” of online, serialized content—
since its history is, itself, so brief. In less than a decade, the online video industry has
moved from a disconnected hodge-podge of animated shorts to a multi-billion industry.
Genre in particular refers to the method of Web Series categorization based on

similarities in the narrative elements from which Series are constructed. And the
definition of Series genre is always being as a debate by critics. By the end of the silent
film period, genre had become more clearly established and subdivided, and it had also
become the symbolic feature of a film. Genre is often not unitary in the film, so each
definition of a film's genre is usually analyzed. So here comes the real deal regarding
the popularity of each Series Genre different from the Movie Genre because one series
can co relate multiple happening in more than one season. Sentimental Analysis or also
popular as Opinion Mining is basically a natural language processing technique that is
generally used to detect the whether the data under the hammer is of Positive Sentiment,
Negative Sentiment or Neutral. It focuses on study of
Human behavior and perception and it refers to the use of natural language processing
in text analysis, Information Extraction Task and computational algorithm. Sentiments
are emotions, opinions, feelings, likes or dislikes, good or bad. Sentiment analysis in
this paper is about the tweets of the audience or followers who are following the lead
actor and Lead Actress of the Popular Web Series of different Genre. Different people
are expressing their own interest and views towards the Web Series. The twitter handle
of the Web Series of Different Genre provides series of discussion among the audience.
in social network analysis, centrality is one of the most studied principles. There's a lot
of research on using centrality indicators to locate the most important users in a social
1
network. The task is to find metrics that can be computed easily and can identify users
according to relevance parameters that are as similar to fact as possible. The topic is
discussed in the context of the Twitter network, an online social networking site with
millions of users and a vast flow of messages published and spread daily by user
interactions. Twitter has different types of users, but the greatest utility lies in finding
the most Influential ones. The purpose of this project is to collect and classify some of
the Twitter influence measures that exist so far which can help in finding the influence
of the twitter account. The twitter account mentioned here are basically the account of
the popular web series of the Netflix and their lead Actor /Actress Twitter account.
These measures are very diverse.
Twitter is now one of the most popular online social networking sites on the planet. It's
a microblogging site that allows users to send and read 140-character messages known
as tweets. Simple text, URL pages, photos, mentions of other users (preceded by the
symbol @"), and hash tags, which are words that are highlighted by putting the symbol
#" in front of them, are all examples of tweets. A trending subject is a phrase (one or
more words) that appears in a large number of tweets from a specific location at a
specific time. These "places" and "moments" are defined thanks to additional metadata
in each tweet, which includes geo location, time, and broadcaster account information,
among other items.
User-to-user, user-to-tweet, tweet-to-tweet, and tweet-to-user are the four forms of
public relationships on Twitter. The acts that are valid for and form of relationship are
listed in the table below.
Table 1: Twitter relationships between users and tweets.

USER TWEET
Follows/is followed by Mention,

USER Posts, tweets, likes
Replies to, Retweets to
Posted by, retweeted by, liked by,

TWEET Replies/is replied from
replied by
2
The Twitter API offers access to all of the data needed to measure these metrics. An
API (Application Programming Interface) is a collection of functions, protocols, and
resources that can be used to construct a program or to interact with other services.
Twitter offers three APIs to developers (two of which are available to the public), all of
which provide access to information from the social network. This data is collected
through a variety of requests and is useful for analysis, historical or real-time data
searches, and the creation of unofficial clients.
1.2 Problem Statement
One of the most common types of entertainment is Web Series. Because of the
widespread usage of the Internet in recent years, vast quantities of data and interaction
about series have been created and posted online. Audience Perception and interest
about the series are also key factor of monitoring the progress of the Web Series in the
Particular Genre. Producers and Directors are busy scheduling the web series content
whenever they get the actor/actress that has matched to their interest regardless of the
influence and popularity that they are in the Social Media. One of the useful practice in
the Social Network Analysis is the analysis of Popularity of the Series through multiple
interacted tweets, hashtags and calculation of user metrics.
Twitter, with its online discussion model, is an excellent forum for performing such
studies. With Twitter's topic structure in mind, the issue can be described as follows:
by monitoring current (and previous) tweet behavior for a hash tag, we can predict what
Netflix viewers will say about the Web Series after they just finished watching.
Furthermore, examining a greater number of series can show the audience's perspective
on the Web Series Genre. More precisely, we can estimate if it has risen in popularity
and, if so, by how much. In this project, we will attempt to formulate and solve an
example of such a problem in relation to the Netflix Web Series and their Genres that
we have selected.
3
1.3 Objective
The main objective of this proposal is:
 To perform Sentiment Analysis of Different tweets of the audience or

followers of the Web Series Twitter handle and their Lead Actor/Actress.
 To Calculate and Analysis Different Social network Metrics based on Twitter
data obtained through Twitter API.
1.4 Scopes and Limitations
Scopes of this Project are:
 This concept can be used in Analysis of Different Movies and Web Series that
are Streaming in other Video Streaming Platform.
 The twitter metrics calculated in this Project is readily available with the help
of Developer Account.
Main limitation of this Project is it doesn't consider the time frame regarding of the
tweets and has used very less streaming tweets data. It has focused on the static data
and interaction of audience and twitter handle.
1.5 Report Organization
This project is divided into following chapters:

Chapter Two includes the literature review of different theoretical articles that presents
previous results and findings of relevant research based on a Sentiment Analysis using
different Machine Learning Classifier and Twitter User Metrics Calculation. Chapter
Three describes the methodology of the whole work to get the objective of the project,
which includes formulation of objective function, algorithm of method, flow chart, Data
collection, Software used. Chapter Four provides the results of ML Classification and
User Metrics Calculation, which includes results and Validation of Data Set used for
Project. Chapter Five gives the conclusion of this project work and the future work
suggestions.
4
2. LITERATURE REVIEW
The authors present a method for automatically collecting a corpus that can be used to
train a sentiment classifier in [1]. The difference in distributions between positive,
negative, and neutral sets was observed using Tree Tagger for POS-tagging. Authors
concluded that syntactic constructs can be used to characterize feelings or state facts
based on their findings. They used the corpus to train a sentiment classifier using some
POS-tags that may be good indicators of emotional text. Document emotions were
graded as positive, negative, or neutral by the classifier. The multinomial Naive Bayes
classifier, which uses N-gram and POS-tags as features, was used to construct the
classifier.
Comparative opinion mining of YouTube comments was carried out in this study on
comments containing similarities of Android and iOS [2]. Complete comments were
used for classification in the first setting. They filtered comments based on semantic
capital, holding only nouns, adjectives, and verbs to minimize computation. They used
the naive presumption that the neighborhoods of keywords are enough to clarify the
class of comments in another experiment environment. The Naive Bayes algorithm was
used in all situations. The findings in terms of various performance metrics were
unsatisfactory, but the naive expectation about keyword neighborhood words
performed admirably.
The authors [3] used Machine learning method for polarity classification on movie
review results, dividing the dataset into two sets: train and test. The entire data set was
gathered from a movie review website. They then used an NLP method to perform data
preprocessing. The data set was trained using ML classifiers, including Multinomial
NB, Bernoulli NB, SVM, Maximum Entropy, and Decision Tree classifiers, which
were evaluated using a test dataset. Finally, experimental findings showed that
Multinomial NB had a higher accuracy (88.5%) than others.
After proper preprocessing, an efficient feature vector was generated in this paper by
doing feature extraction in two steps [4]. The first move was to extract and apply twitter-
specific features to the function vector. Following that, certain features were deleted
from tweets, and feature extraction was performed as if it were on standard text. These
characteristics were also included in the function vector. The function vector's
5
classification accuracy was tested using Naive Bayes, SVM, Maximum Entropy, and
Ensemble classifiers. For the latest function vector, both of these classifiers had
virtually equal accuracy. The function vector performed admirably in the study's
electronic products domain.
In [5] the authors of this paper have classified different indicators of activity, popularity,
and impact in the literature for the particular sense of Twitter in this paper. This was
the first systematic survey of its kind on Twitter Impact Assessment, as far as we know.
Almost all popularity indicators were related to follow-up relationships, while the
majority of Behavior measures are focused on response acts. They stressed the use of
retweets for impact measures. The metrics related to favorites or likes were the least
used of all the forms of metrics considered.
This study looked at Twitter users' impact using three separate measures: in degree,
retweets, and mentions. They discovered that in degree reflects a user's popularity, but
it has little to do with other significant aspects of influence, such as engaging the
audience, as demonstrated by retweets and mentions [6]. Retweets were based on the
content value of a tweet, while mentions were based on the user's name value. Such
small distinctions result in two distinct classes of top Twitter users: those with a high
in degree but few retweets or mentions, and those with a high in degree but few retweets
or mentions.
The authors [7] of this paper explored the issue of recognizing prominent Twitter users.
They present an overview of Twitter's current strength and suggest some measures for
measuring user Impact. Since the user is the key contributor in every Social Network,
they provided a full profile for each and every account to draw their impact. The effect
of influence is generally noticeable when a user's Followers are influenced by
corresponding user's posts, even though the presence of this form of friendship may be
overlooked.
6
3. METHODOLOGY
3.1 Block Diagram of Overall Project
The Methodology of this project work can be divided into parts, First the Twitter
Sentiment Analysis of the Web Series of Different Genre and secondly Twitter User
Influence Calculation in Each Genre for the Lead Actor /Actress.
DATA PRE
FEATURE EXTRACTION
TWITTER PROCESSING
AND SELECTION
DATA
COLLECTION
FEATURES
ACTIVITY POPULARITY
METRICS METRICS
CLASSIFIER
SENTIMENT
CALCULATION sPOLARITY
METRICS FOR EACH GENRE
ANAYSIS OF GENRES
Fig 3-1: Block Diagram of Project
7
3.2 Sentiment Analysis and Score
Following Sentiment analysis System using Standard Machine learning approach is

explained in the Section below.
3.2.1 Twitter Data Collection
The “tweetstream” python library, which offers a package for simple twitter streaming
API, is used to obtain data in the form of raw tweets. There are two ways to control
tweets with this API: Sample Stream and Filter Stream. Sample Stream simply provides
a limited, random sample of all tweets that are being streamed in real time. Filter Stream
sends tweets that meet a set of standards. It has the ability to filter tweets based on three
criteria:
• Tracking/searching for a certain keyword(s) in tweets
• Particular Twitter user(s) based on their user-ids
• Tweets from a remote location (or locations) (only for geo-tagged tweets).
A tweet obtained using this approach contains a lot of raw data, which we may or may
not find useful for our application. It takes the form of a python “dictionary” data sort,
which has a variety of key-value pairs. The following is a list of several key-value pairs:
• User ID
• Screen name of the user
• Original Text of the tweet
• Presence of hashtags
• If the tweet is a retweet
• Language in which the twitter user has registered their account
• Geo-tag position of the tweet
• Date and time that the tweet was generated
8
3.2.2 Data Preprocessing
Preprocessing text data is important because it prepares the raw text for mining, making
it easier to extract information from the text and apply machine learning algorithms to
it. If we miss this stage, the likelihood of working with noisy and inaccurate data
increases. The aim of this step is to remove noise that isn't important in determining the
sentiment of tweets, such as punctuation, special characters, numbers, and words that
aren't important in the sense of the text.These Twitter handles don't have much detail
about the tweet's content. Removing punctuation, numbers, and even special characters
is done because they don't help distinguish between different types of tweets. The
majority of the smaller terms add no meaning. For instance, ‘pdx,' ‘his,' and ‘all.' We
can break each tweet into individual words or tokens.
i) Filtering: Removing URL links that are in datasets files before preprocessing.
ii) Removing stop words: articles (“a”, “an”, “the”) are removed from datasets.
iii) Removing numbers, punctuation, and unnecessary spaces: Unnecessary

numbers, Punctuation and spaces are removed which are of no use.
iv) Converting to lower case: All the letters in the sentences are converted into
lower case.
v) Tokenization: Tokenization is the process of breaking a stream of text up

into words, phrases, symbols, or other meaningful elements called tokens. The
list of tokens becomes input for further processing such as parsing or text
mining. Tokenization is useful both in linguistics as we segment text by splitting
it by spaces and punctuation marks. However, we make sure that short forms
such as “din‟t”, “w‟ll”, remain as one word.
3.2.3 Feature Extraction and Selection
The aim of this process is to use a well-known feature extraction technique called bag
of words (BoW) to extract features for review classification. It's a simple feature
extraction technique that uses a vector space model to describe the analysis of text
document. A feature s represented by each vector space dimension. We use unigrams
as a feature set in this project. The features in the vector space reflect all potential
unigrams from the review text document, while the feature values relate to the
frequency or occurrence of unigrams in the review document. Consider the following
9
three review text documents, and for the sake of convenience, we have shown a single
review sentence from each Tweets.
Review 1: “I loved this movie.” Review 3: “Great acting a good movie.”

Review 2: “I hated this movie.”
Table 2: BOW Vector Space Model for Unigrams
Review SN Acting Good Great Hated Loved Movie This Class
Review 1 0 0 0 0 1 1 1 +ve
Review 2 0 0 0 1 0 1 1 -ve
Review 3 1 1 1 0 0 1 0 +ve
3.2.4 Machine Learning Classifier
“Support Vector Machine” (SVM) is a supervised machine learning algorithm which

can be used for both classification and regression challenges. However, it is mostly
used in classification problems. In the SVM algorithm, we plot each data item as a point
in n-dimensional space (where n is number of features you have) with the value of each
feature being the value of a particular coordinate. Then, we perform classification by
finding the hyper-plane that differentiates the two classes very well. .
Y wT x  b  1 wT x  b  0
Support wT x  b  1
Vectors
Fig 3-2: Support Vector Machines Concept
10
SVC stands for Support Vector Machine used for Classification and has been
successfully used for many applications involving the division of data into two or more
groups. The aim of SVC is to find a classification criterion (i.e., a decision function)
that can properly distinguish unseen data at the testing stage while still having good
generalization potential. For a two-class data grouping, this requirement may be a linear
straight line with a maximum distance (margin) from each class's data. In SVC-related
discussions, this linear classifier is often referred to as an optimal hyper plane [8].For a
set of training data, xi (i = 1, 2, 3,...,n), this straight line, also known as a linear
hyperplane, is defined as :
wT x  b  1 ………………… …………Equation 1
Where w denotes an n-dimensional vector and b denotes a bias term.The hyper plane
must have two unique properties: (1) the smallest possible error in data separation, and
(2) the greatest possible distance from the closest data in each class [9]. Under these
conditions, each class's data can only be on the hyperplane's left (y = 1) or right (y = 1)
sides. To control the separability of data, two margins can be described as follows:
  1.... for . yi  1
wT x  b
  1.... for . yi  1 ………………………Equation 2
3.3 Twitter Metrics Calculation
A metric is a simple mathematical expression that allows us to provide basic details

about a social network as a numerical value. Metrics can then be combined to create a
(ranking) measure, which is a formula or algorithm that provides a criterion for ranking
of network consumer. There are, of course, more complex measures than a
simple combination of metrics. Initial tweets, comments, retweets, mentions, and graph
characteristics are among the metrics proposed by Pal and Counts [10]. Following that,
other researchers began to work on this metric system.
Table 3: Kind of Metric Used
NAME F RT M RP
Activity Tweet Count Score √
General Activity √ √ √ √
Popularity Follower Rank √
Popularity √
11
3.3.1 Activity Measures
Users are considered successful when their presence in the social network is consistent
and regular over time, regardless of the amount of exposure they gain for it. It's worth
noting that there might be really active Twitter users who are invisible to any metric
because they don't leave any traces on the network. As a result, when we say
"participation," we mean acts that can be calculated, such as tweets, retweets, mentions,
and replies. In this context, Yin and Zhang [11] define a user's behavior as the likelihood
of seeing a tweet.
The TweetRank [12], which is simply a metric that counts the number of tweets a
consumer has, is perhaps the most basic activity indicator. The Tweet count score [13],
which lists the number of initial tweets plus the number of retweets, is a little more
sophisticated. Following this logic, a fair behavior metric could be the number of each
user's measurable behaviors. As a result, we describe the General Activity as follows
for each user:
GeneralActivity (i)  OT1  RP1  RT1  FT1 ……………………… Equation 3
3.3.2 Popularity Measures
We may call a user famous if he is noticed by a large number of other users on the
network. A celebrity, for example, is an example of a famous user but does not actually
have an active and influential account. For example, Clint Eastwood's Twitter account
(@Eastwood) has over 60,000 followers but no followers and no tweets. The
normalized variant of the standard in-degree metric [14][15] for social networks in
general is the Follower-Rank [12], also known as Structural Advantage [16]:
F1
FollowerRank (i)  ……………………………………….. Equation 4
F1  F3
There are different versions of this metric, such as the Twitter Follower-Followee ratio
(TFF) [17], which equals F1/F3. The downside of these metrics is that F1 and F3
metrics will fluctuate a lot, and some Twitter users' numbers of followers are
excessively high compared to the rest. To account for these discrepancies, the
Popularity [18] formula was devised as follows:
Popularity(i)  1  eF1 …………………………………Equation 5
12
4. RESULTS AND DISCUSSIONS
4.1 Sentiment Classifier of Different Web Series
This section describes the use of Tweets of Twitter handle of Popular Netflix Web
Series and Lead Actor Actress in the Genre as the Data set for classifying the sentiment
of Audience towards them. About minimum of 3200 of tweets of 9 Popular Twitter
Account in Each Genre is Extracted, Preprocessed and classified to obtain the Result.
Total of 1,44,000 tweets were extracted.
4.1.1 Data set Collection and Pre-Processing
Data obtained from the twitter API were extracted in the CSV file under different
Heading. All the information required to compute these metrics is accessible through
the Twitter API. An API is a set of functions, protocols and tools that are used to build
an application, or to facilitate the communication with services. Twitter provides three
kinds of API to developers (two of them are public) that together provide access to
information from the social network. This information is obtained through different
kind of requests, and it is useful for research, to search for historical or real-time
information, and to develop unofficial clients [5].Common important term that were
extacted were “created_at” “id” “id_str” “text source truncated”
“in_reply_to_status_id”,“in_reply_to_status_id_str”,“in_reply_to_user_id”
retweet_count” “favorite_count” favorited retweeted filter_level lang .
Fig 4-1: Extraction of Tweet
13
Given the dataset, at first, the preprocessing techniques are applied over the dataset to
segment the dataset into sentences, tokenize the sentences into words, and remove the
stop words. Word stemming is also performed on the remaining words to stem the
words to their root form.
Fig 4-2: Cleaning of Tweet
4.1.2 Creating Tokens and Corpus from Text
Tokenization is the method of breaking down a text into manageable chunks. Tokens
are the name for these bits. We may, for example, break down a large chunk of text into
words or sentences. We may specify our own conditions to divide the input text into
meaningful tokens, depending on the task at hand.
Fig 4-3: Sample Tweet
14
Fig 4-4: Tweets in the form of Tokens
Furthermore the Bag of Words (BOW)of unigram is made so as to classify the

Sentiment of the tweets.
Fig 4-5: n-Gram Corpus
4.1.3 Classification using Machine Learning Classifier
The SVM Model was created and trained using the dataset of the Rotten Tomates .
Different dataset of twitter handle is aggregated into the single text file for the testing
purpose.
4.1.3.1 Classification Report of Aggregate of Data Set of Genre 1 (Comedy)
Table 5: Confusion Matrix for Genre 1
Predicted
Confusion
Positive Negative Neutral
Matrix
Actual Positive 1145 0 9
Negative 443 4556 58
Neutral 31 0 3960
15
Table 6: Evaluation metrics for Genre 1
Trained Data F-
Accuracy Precision Recall
Set Measure
Rotten Tomatoes 0.946 0.9922 0.707 0.825
4.1.3.2 Classification Report of Aggregate of Data Set of Genre 1 (Comedy)
Predicted
Confusion
Matrix
Negative 443 11628 58
Neutral 10 0 4440
Trained Data
Accuracy Precision Recall F-Measure
Set
Rotten Tomatoes 0.969 0. 9952 0.7203 0.8357
4.1.3.3 Classification Report of Aggregate of Data Set of Genre 3 (SCI-FI)
Predicted
Confusion
Matrix
Neutral 39 0 4782
16
Trained Data
Set
Rotten Tomatoes 0.9469 0. 9942 0.7203 0.836
4.1.3.4 Classification Report of Aggregate of Data Set of Genre 4 (Romance)
Predicted
Confusion
Matrix
Neutral 44 0 7749
Trained Data
Set
Rotten Tomatoes 0.948 0. 994 0.649 0.785
4.1.3.5 Classification Report of Aggregate of Data Set of Genre 5 (Action)
Predicted
Confusion
Matrix
Negative 889 10473 87
Neutral 41 0 7379
17
Table 14: Evaluation Metrics for Genre 5
Trained Data
Set
Rotten Tomatoes 0.953 0. 994 0.730 0.842
4.2 User Metrics Calculation of Different Genre
Table 12: Metrics and Definition
Metrics Definition
OT1 Original Tweets of User
RP1 Replies from the User
RT1 Retweets from the User
FT1 Favorite Tweet from the User
GA General Activity (OT1+RP1+RT1+FT1)
F1 Account of User Follows
F3 Follower of the User
Follower
Defined as F1/(F1+F3)
Rank
Popularity Defined as in links in network.
18
4.2.1 User Metrics Calculation of Different Genre 1
Table 13: Genre 1 Metrics Calculation
Folower
Twitter Acount Ot1 Rp1 Rt1 Ft1 Ga F1 F3 Popularity
Rank
Ahwamysedaris 1126 190 1639 4246 7201 794 1.00E+06 0.000793 1
Portiaderoosi 619 26 121 60 826 236 34900 0.006717 1
Batemanjason 353 19 121 454 947 632 2.00E+06 0.000421 1
Tomellis17 1633 447 1120 302 3502 409 4615 0.081409 1
Kevinmalejandra 434 40 267 9129 9870 20 234700 8.52E-05 1
Arnetwill 1501 394 436 1203 3534 293 418500 0.0007 1
Bojackhorseman 907 2198 80 8796 11981 37 36800 0.001004 1
Arresteddev 1805 903 152 3835 6695 43 501700 8.57E-05 1
Twitter Folower
Ot1 Rp1 Rt1 Ft1 Ga F1 F3 Popularity
Acount Rank
Anyataylorjoy 1288 196 351 2518 4353 82 219200 0.000374 1
Asanteblackk 47 1 1 8355 8404 248 39700 0.006208 1
Chloepirriepie 531 966 507 1 2005 433 3329 0.115098 1
Omarsy 985 130 263 2863 4241 305 3200000 9.53E-05 1
Shirineboutella 170 342 154 199 865 226 261600 0.061715 1
Thecaleelharris 118 4 35 144 301 74 28900 0.013247 1
Netflixtheqg 917 270 2013 2973 6173 64 173600 0.011036 1
Whentheyseeus 131 692 357 6782 7962 144 261600 0.002877 1
19
Folower
Twitter Account OT1 RP1 RT1 FT1 GA F1 F3 Popularity
Rank
DavidKHarbour 1137 1689 278 6498 9602 794 1.00E+06 0.000793 1
GatenM123 319 336 228 3174 4057 236 34900 0.006717 1
hannahjk1 149 74 268 454 945 632 2.00E+06 0.000421 1
ImCConner 480 1352 976 11007 13815 409 4615 0.081409 1
MichaelaCoel 434 40 267 9129 9870 20 234700 8.52E-05 1
ReneeGoldsberry 1122 131 436 19041 20730 293 418500 0.0007 1
AltCarb 226 1203 81 2415 3925 37 36800 0.001004 1
blackmirror 161 745 13 1216 2135 43 501700 8.57E-05 1
Twitter Folower
Acount Rank
Iamgreenfield 1316 452 665 1932 4365 31 520400 5.96E-05 1
Joannalgarcia 2285 578 335 1788 4986 268 146400 0.001827 1
Nicolacoughlan 1498 812 891 20176 23377 1361 261600 0.005176 1
Real_Brooke 164 121 107 371 763 114 28900 0.003929 1
Regejean 642 197 428 8353 9620 1131 173600 0.006473 1
Zooeydeschanel 2017 544 639 312 3512 651 6.00E+06 0.00011 1
Bridgerton 462 538 338 1906 3244 26 195300 0.000133 1
Sweetmagnolias 134 16 280 1558 1988 69 10400 0.006591 1
20
Twitter Folower
Acount Rank
alexanderludwig 23 15 10 1627 1675 268 483500 0.000554 1
candicepatton 1580 668 949 5939 9136 402 506700 0.000793 1
dpanabaker 112 23 45 2688 2868 294 746200 0.000394 1
KatherynWinnick 2456 306 436 1253 4451 146 504900 0.000289 1
ralphmacchiol 2320 440 441 3177 6378 617 192000 0.003203 1
WilliamZabka 1828 667 674 3884 7053 262 283300 0.000924 1
CobraKaiSeries 618 1948 634 6140 9340 103 244500 0.000421 1
HistoryVikings 1152 1016 983 12472 15623 1264 245578 0.000902 1
4.3 Discussions
Machine learning model clearly separated the number of Tweets of the user who were
Interactive with the twitter handle of different celebrities and account of the Web Series
of Different Genre.Following results were obtained from the Classifier.
Table 18: Data of Tweets showing Different Sentiment
Classified
Genre Classified Positive Negative Classified Neutral
Genre 1 (Comedy) 1145 4556 3960

Genre 2(Drama) 1244 8905 4440
Genre 3(Sci-Fi) 1708 9034 4782
Genre 4(Romance) 1756 6671 7749
Genre 5(Action) 2520 10473 7379
21
Table 19: Data of Tweets of Different Genre and their metrics
General Mean Mean

Genre
Activity Popularity Follower
Genre 1 (Comedy) 44556 1 0.0003
Genre 2(Drama) 34304 1 0.0004
Genre 3(Sci-Fi) 65078 1 0.0004
Genre 4(Romance) 51855 1 0.0004
Genre 5(Action) 56524 1 0.0005
4.3.1 Visualizations of Different Metrics
12000
10000
8000
6000
4000
2000
0
Genre 1 Genre Genre 3(Sci-Fi) Genre Genre
(Comedy) 2(Drama) 4(Romance) 5(Action)
Classified Positive Classified Negative Classified Neutral Tweets
Fig 4-6 Tweets Classification in Different Genre
70000
60000
50000
40000
30000
20000
10000
0
General Activity
Genre 1 (Comedy) Genre 2(Drama) Genre 3(Sci-Fi)

Genre 4(Romance) Genre 5(Action)
Fig 4-7 General Activity in Different Genre
22
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Ot1 Rp1 Rt1 Ft1
Fig 4-8 Different metrics of General Activity in Genre one
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
Ot1 Rp1 Rt1 Ft1
Fig 4-9 Different metrics of General Activity in Genre two
20000
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
OT1 RP1 RT1 FT1
23
Fig 4-10 Different metrics of General Activity in Genre three
25000
20000
15000
10000
5000
Ot1 Rp1 Rt1 Ft1
Fig 4-11 Different metrics of General Activity in Genre five
14000
12000
10000
8000
6000
4000
2000
0
Ot1 Rp1 Rt1 Ft1
Fig 4-12 Different metrics of General Activity in Genre five
4.4 Summarization of Results of Research
From Sentiment Analysis and Series of Data analysis of Different Metrics of The
Activity of the Celebrity of various Genres of the Netflix Web Series we have obtained
the following results. These results are based on about the 150000 tweets obtained
through twitter data mining from the twitter handle of top most watched Netflix Web
Series that were placed in the Trending Now Section in Netflix Platform.
24
The findings are summarized in the table below for convenience from this research.
Table 20: Summarization of Research
Genre with most Positive Sentiment Action Genre

Tweet
Genre With most Negative Sentiment Action Genre

Tweets
Genre having Most General Activity Sci-Fi Genre
(including Favorites tweet)
Genre having least General Activity Drama Genre
Genre that have celebrity which have Comedy Genre

more chance to interact with People
25
5. CONCLUSION AND RECOMMENDATION
5.1 Conclusion
This research was conducted to understand the basics interaction of the Netflix Web
Series Audience with their celebrities. The interaction in the twitter is analyzed into
three different class Positive, Negative and Neutral. The Classifier was trained on the
data set of Rotten Tomatoes which were preferred in most Research as the data set was
made on the basis of large number of User Review on the Rotten Tomatoes Website
which is also known as popular Critics Web site. The result showed many tweets
classified as Negative as Compared to Positive and Neutral Tweets. Action /Adventure
Genre had the most Tweets classified as Positive with nice recall and precision score
with almost similar Accuracy in most of the Genre. Similar the most interacted scenario
between the audience and celebrity was Sci-Fi Genre in a twitter with maximum amount
of General Activity that was calculated on the basis of Original Tweets, Replies,
Mention and Favorites. Among the General Activity the most notable parameter to
contribute was the Favorite of Tweets where many Celebrities were actively
participated to Favorite the tweet.
5.2 Future Work
The work presented in the Project can be further extended and tested with different
algorithms and varying the size of the data set. The project is based on the 3200 tweets
extracted from the popular tweet handle of the celebrities who were having pivotal role
in the Netflix Web Series.
Project can be extended with new attributes such as comparison of Different Metrics
for the appropriate analysis of quality time spend by the celebrity and the audience
together in the Social media.
26
REFERENCES
[1] A. Pak and P. Paroubek, “Twitter for Sentiment Analysis: When Language
Resources are Not Available,” 2011 22nd International Workshop on Database and
Expert Systems Applications, 2011
[2] A. U. Khan, M. Khan, and M. B. Khan, “Naïve Multi-label Classification of

YouTube Comments Using Comparative Opinion Mining,” Procedia Computer
Science, vol. 82, pp. 57–64, 2016.
[3] A. Rahman and M. S. Hossen, “Sentiment analysis on movie review data using
machine learning approach,” in 2019 International Conference on Bangla Speech and
Language Processing (ICBSLP), 2019, pp. 1–4.
[4] U. Kumari, A. K. Sharma and D. Soni, "Sentiment analysis of smart phone product
review using SVM classification technique," 2017 International Conference on
Energy, Communication, Data Analytics and Soft Computing (ICECDS), Chennai,
India, 2017, pp. 1469-1474, doi: 10.1109/ICECDS.2017.8389689.
[5] F. Riquelme and P. González-Cantergiani, “Measuring user influence on Twitter:

A survey,” Information Processing & Management, vol. 52, no. 5, pp. 949–975, 2016.
[6] Cha, M., Haddadi, H., Benevenuto, F., & Gummadi, K, “Measuring User Influence
in Twitter: The Million Follower Fallacy”, 2010.
[7] J. Sun and J. Tang, “A Survey of Models and Algorithms for Social Influence
Analysis,” Social Network Data Analytics, pp. 177–214, 2011.
[8] S. Abe, Support Vector Machines for Pattern Classification, Springer-Verlag

London Limited, 2008, 350 pp
[9] [1]I. Steinwart and C. Scovel, "Fast rates for support vector machines using
Gaussian kernels", The Annals of Statistics, vol. 35, no. 2, pp. 575-607, 2007.
Available: 10.1214/009053606000001226.
[10] A. Pal and S. Counts, “Identifying topical authorities in microblogs,” Proceedings

of the fourth ACM international conference on Web search and data mining - WSDM
'11, 2011.
27
[11] Z. Yin and Y. Zhang, “Measuring Pair-Wise Social Influence in Microblog,” 2012
International Conference on Privacy, Security, Risk and Trust and 2012 International
Confernece on Social Computing, 2012.
[12] R. Nagmoti, A. Teredesai, and M. De Cock, “Ranking Approaches for Microblog

Search,” 2010 IEEE/WIC/ACM International Conference on Web Intelligence and
Intelligent Agent Technology, 2010.
[13] T. Noro, F. Ru, F. Xiao, and T. Tokuda, “Twitter User Rank Using Keyword
Search.,” in EJC, 2012, vol. 251, pp. 31–48.
[14] B. Hajian and T. White, “Modelling Influence in a Social Network: Metrics and
Evaluation,” 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust
and 2011 IEEE Third Int'l Conference on Social Computing, 2011.
[15] X. Jin and Y. Wang, “Research on Social Network Structure and Public Opinions
Dissemination of Micro-blog Based on Complex Network Analysis,” Journal of
Networks, vol. 8, no. 7, 2013.
[16] R. Cappelletti and N. Sastry, “IARank: Ranking Users on Twitter in Near Real-
Time, Based on Their Information Amplification Potential,” 2012 International
Conference on Social Informatics, 2012.
[17] C. Bigonha, T. N. Cardoso, M. M. Moro, M. A. Gonçalves, and V. A. Almeida,

“Sentiment-based influence detection on Twitter,” Journal of the Brazilian Computer
Society, vol. 18, no. 3, pp. 169–183, 2011.
[18] A. Aleahmad, P. Karisani, M. Rahgozar, and F. Oroumchian, “OLFinder: Finding

opinion leaders in online social networks,” Journal of Information Science, vol. 42, no.
5, pp. 659–674, 2016.
[19] F. Morone and H. A. Makse, “Influence maximization in complex networks

through optimal percolation,” Nature, vol. 524, no. 7563, pp. 65–68, 2015
28

PDF New

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PDF New

Uploaded by

Copyright:

Available Formats

1.

Genre in particular refers to the method of Web Series categorization based on

Table 1: Twitter relationships between users and tweets.

Follows/is followed by Mention,

Posted by, retweeted by, liked by,

1.2 Problem Statement

The main objective of this proposal is:

 To perform Sentiment Analysis of Different tweets of the audience or

1.4 Scopes and Limitations

Scopes of this Project are:

1.5 Report Organization

This project is divided into following chapters:

3.1 Block Diagram of Overall Project

METRICS FOR EACH GENRE

Fig 3-1: Block Diagram of Project

Following Sentiment analysis System using Standard Machine learning approach is

3.2.1 Twitter Data Collection

• Tracking/searching for a certain keyword(s) in tweets

• Particular Twitter user(s) based on their user-ids

• Screen name of the user

• Original Text of the tweet

• If the tweet is a retweet

• Language in which the twitter user has registered their account

• Geo-tag position of the tweet

• Date and time that the tweet was generated

iii) Removing numbers, punctuation, and unnecessary spaces: Unnecessary

v) Tokenization: Tokenization is the process of breaking a stream of text up

3.2.3 Feature Extraction and Selection

Review 1: “I loved this movie.” Review 3: “Great acting a good movie.”

Review SN Acting Good Great Hated Loved Movie This Class

3.2.4 Machine Learning Classifier

“Support Vector Machine” (SVM) is a supervised machine learning algorithm which

Fig 3-2: Support Vector Machines Concept

3.3 Twitter Metrics Calculation

A metric is a simple mathematical expression that allows us to provide basic details

Table 3: Kind of Metric Used

GeneralActivity (i)  OT1  RP1  RT1  FT1 ……………………… Equation 3

3.3.2 Popularity Measures

Popularity(i)  1  eF1 …………………………………Equation 5

4.1 Sentiment Classifier of Different Web Series

4.1.1 Data set Collection and Pre-Processing

Fig 4-1: Extraction of Tweet

Fig 4-2: Cleaning of Tweet

4.1.2 Creating Tokens and Corpus from Text

Fig 4-3: Sample Tweet

Furthermore the Bag of Words (BOW)of unigram is made so as to classify the

Fig 4-5: n-Gram Corpus

4.1.3 Classification using Machine Learning Classifier

4.1.3.1 Classification Report of Aggregate of Data Set of Genre 1 (Comedy)

Table 5: Confusion Matrix for Genre 1

Actual Positive 1145 0 9

Negative 443 4556 58

4.1.3.2 Classification Report of Aggregate of Data Set of Genre 1 (Comedy)

Table 7: Confusion Matrix for Genre 2

Table 8: Evaluation metrics for Genre 2

Rotten Tomatoes 0.969 0. 9952 0.7203 0.8357

4.1.3.3 Classification Report of Aggregate of Data Set of Genre 3 (SCI-FI)

Table 9: Confusion Matrix for Genre 3

Rotten Tomatoes 0.9469 0. 9942 0.7203 0.836

4.1.3.4 Classification Report of Aggregate of Data Set of Genre 4 (Romance)

Table 11: Confusion Matrix for Genre 4

Table 12: Evaluation metrics for Genre 4

Rotten Tomatoes 0.948 0. 994 0.649 0.785