Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

IEEE - 43488

Automated Topic Modeling and Sentiment Analysis of Tweets on


SparkR
Prema Monish Santoshi Kumari Narendra Babu C
Directorate of Training and Lifelong Department Of CSE Department Of CSE
Learning, M. S. Ramaiah University of M. S. Ramaiah University of
M. S. Ramaiah University of applied Applied Sciences, Applied Sciences,
Science, Bengaluru, India Bengaluru, India Bengaluru, India
premamonish12@gmail.com santoshik29@gmail.com narendrababu.c@gmail.com

Abstract—Advancement of mobile and internet technology understand the means and concepts of words in text
improves the communication and freedom of speaking in social documents however this probabilistic model helps to discover
networks, blogs and websites. Twitter is one of the most common correlation among latent words and finding topic for group of
and popular social media platform gives freedom to the people to words that occur in set of document. Combing those word in
put their views, thoughts and opinion to the world. Analyzing the document helps to find new hidden topics form group of
large scale tweets by putting together large scale individual’s correlated words. Topic modeling is accomplished using
opinion on particular context will allows us to find various various methods and algorithms. One of the most commonly
hidden topics and insights. This paper proposes for developing an used method is LDA.
automated topic modeling technique LDA to identify the
interested topics of discussion from large scale tweets related to In this paper LDA is considered for automated modeling
two famous political leaders of the county India. Paper the content of tweets. However it is very challenging to
implements a topic modeling method on SparkR framework to process unstructured twitter data contains linguistic content
improve the speed and performance for large scale real time which requires natural language processing and cleaning of
social data processing and analysis. Finally sentiment analysis of 140 character tweets carrying individual’s opinion and
tweets is carried out using lexicon based approach to identify the sentiment of peoples and wide discussion on various hidden
people sentiment towards these two leaders. Using empirical topics.
results identified various unknown topics and people interest,
expectations and their concerns on various topics. Results also Latent Dirchlet Allocation (LDA) is a probabilistic topic
shows automated topic modeling and sentiment analysis of tweets modeling technique introduced by Blei, Ng and Jordan [1]
on SparkR framework improves the speed compare to normal R which represents texts as mixture of topics that groups words
tool. with certain probabilities. In LDA method assigned number
of topics K are checked repeatedly for each word in every
Keywords—Twitter data analytics; LDA; SparkR; big data; document .The entire set of documents is checked multiple
sentiment analysis; topic modeling times and this iterative process helps LDA to generate a final
solution with consistent topics . Various methods like Gibbs
I. INTRODUCTION sampling, Expectation-Maximization (EM) algorithm and
Variational Bayes(VB) inference are proposed to estimate the
Social media analytics uses various techniques to analyze number of topics K, a LDA parameter [2] .Gibbs sampling is
social media data for knowledge discovery and information- an Monte Carlo Markov-chain algorithm for generating
driven decision making in several fields like business, samples from a joint distribution. EM uses an iterative method
healthcare, education, government, politics and social network for estimating parameters in statistical models depending on
analysis. Social media like twitter, Facebook, WhatsApp, certain latent variables [2]. VB is an alternative to Gibbs
LinkedIn, Instagram are generating large amount of real time sampling statistical inference over complex distribution which
data which can be extracted for social media analytics to find are difficult to evaluate directly. VB requires large amount of
new insights. The volume of information is increasing work compared to Gibbs sampling to derive set of equations
extensively in social media due to increased use of internet used to iteratively update the parameters .
and mobile technologies, hence analyzing such large volume
of data at real time becomes difficult. This big social media Twitter is large online communication social networking
data can be analyzed using a combination of methods of data micro blogging site. Twitter has gained popularity over the
mining, web mining, text mining and natural language period of time .The statistics shows that usage of social media
processing on big data tools. and mobile communication has increased in number of active
twitter users in India from 26.7 million users in 2017 to 30.4
Topic modeling is a statistical machine learning approach million users in 20181 .Thus lot of information is shared by
used to find semantics structures in email text, blog post, users during usage of twitter. In twitter users post and
speeches ,journal article ,book chapter, social media data and communicate through messages called "tweets". The
many kind of unstructured text. Topic models cannot

9th ICCCNT 2018


July 10-12, 2018, IISC, Bengaluru
Bengaluru, India
IEEE - 43488

information on the twitter such as tweets, retweets and user widely used in various domains to automatically mining the
profiles are analyzed to find personality trait, sentimental hidden topics [6].
analysis, particular opinion, political analytics, social network
analysis ,event analysis and many more [2]. Analyzing this Another researcher uses a LDA topic modeling to analyze
large volume of real time twitter data can be achieved using the twitter message content of challenging social events in
big data tools like Hadoop and SparkR [3]. In this work we Kenya. Normalized Mutual Information (NMI) and topic
analyzed real time twitter data on SparkR framework. coherence analysis were used to find the LDA parameters for
Whereas R is an open source statistical programming language model that outputs more informative and coherent topics. The
widely used for statistical data analysis. Spark combined with change in the number of topics significantly resulted in change
R to improve the performance and speed of processing. of output [7].

In present paper real-time tweets is extracted from twitter Three-layer hierarchical Bayesian model is applied on
of two famous Indian political party leaders Narendra Modi adverse drug reaction data which allowed to predict targeted
and Rahul Gandhi. The amount of tweets about the two person clinical safety in early preclinical stage. LDA with collapsed
can be linked to a party performance and people opinion about Gibbs sampling method is applied for dimensionality
the person by performing sentimental analysis and topic reduction and to obtain higher prediction accuracy with short
run time [8]. Securing networks from intrusions has been
modeling. Extracted real time twitter data is analyzed on
SparkR framework by applying LDA topic modeling to find critical issue. LDA-based Automatic Rule Generation
the related topics of the person tweets. The data analyzed is (LARGEN) with Gibbs sampling can be applied to analyze the
used as feedback about public needs and aspirations for taking malicious traffic and to find the attack signatures that will be
future political decisions. used as Intrusion detection system rules. These rules can be
used to identify cyber attack with high accuracy rate[9].
This paper address mainly following three implementation.
Opinion-lexical approach is a text mining technique to find
1. Real time twitter data extraction and preprocessing
the sentimental analysis such as positive, negative and neutral
on SparkR framework to speed up the process tweets applied on two famous faces Narendra modi and
2. Automated topic modeling of tweets using LDA Arvind Kejriwal in country India. The results provide
Gibbs method to find hidden topics in corpus of feedback of public opinion, to deal with political affairs in
tweets. better way and identify areas they need to improve[10]. To
3. Finally, sentiment analysis of tweets to identify characterize malicious android apps advance LDA with
people sentiment on presented scenario. Genetic Algorithm (GA) is applied on app description and
The remaining paper is organized as follows. Section II sensitive data flow information to generate topic-specific data
give an overview of past research work on twitter data flow signatures. These topic-specific data flow signatures can
analytics and LDA. Section III gives the detailed description be used to characterize malware and malicious behavior[11] .
of proposed system of this paper .Section IV gives the
visualization and explanation of experimental analysis Two layers of matrix factorization is proposed for topic
.Section V gives the conclusion and future work of this paper. modeling method to detect latent variables in legislative
speeches from the period 1999-2014 of European Parliament
(EP). The results shows the members of EP have reacted to
external and internal stimulus when making parliament
II. RELATED WORK speeches [12].
Several research work has been carried out in the area of
There is a wide application of the topic modeling in recent
social media analysis, topic modeling, twitter data analytics
years. LDA with various flavors are proposed based on their
and big data analytics. Various methods of text mining, natural
application and performance of the algorithm like LDA with
language processing and topic modeling are applied in fields
Gibbs sampling, LDA with GA and LDAGEN
like software engineering, social media analytics, Image
classification, and annotation, and political science, medical
and linguistic science. One of the popular latent semantic III. PROPOSED SYSTEM
analysis method is LDA used for automated topic Proposed system aims to develop, implement an automated
identification and modeling in large data set. topic modeling technique LDA to identify the interesting
There are various models proposed on LDA [2] and topics from large scale tweets related to two famous political
different information technique are used to extract data from leaders of the county India. Interpreting the LDA output is
twitter like analysis of different hash tags, sentiment analysis difficult by humans. LDAvis [13] package used to visualize
,event analysis , opinion mining ,identification of influence the results by creating JSON object which provides effective
,twitter network topology, business value analysis and many visualization template. Also sentiment analysis of tweets is
more [4]. Social media analytics can be used for carried out using NRC lexicon approach [5] to identify the
identifying people opinion and sentiments towards political people sentiment towards those two leaders. Empirical system
party. Analyzing political party twitter data using text mining is implement in following steps.
and lexical method will help in understanding people emotions
towards political parties [5]. LDA is a well-known topic A. Connecting R with Spark using SparkR Package
modeling algorithm that can be effectively used for R is open source software. R widely used for statistical
discovering topics from social media content .This approach is analysis, data analytics, visualization and reporting. R

9th ICCCNT 2018


July 10-12, 2018, IISC, Bengaluru
Bengaluru, India
IEEE - 43488

Programming made its first debut in 1993 developed by Ross documents and columns are terms in dictionary. Such that
Ihaka and Robert Gentleman at the Department of Statistics of each entry(i,j) in matrix represents the frequency of term i in
the University of Auckland in Auckland, New Zealand [3]. R document j. TF-IDF(Term frequency - inverse document
in most popular for analyzing statistical data and data science. frequency) represents impertinence of the word for the
It is single threaded execution model that limits the data size document and relative document importance for overall
to be analyzed. To speed up there are many proposals to corpus.
develop parallelism in execution [14].
There are several ways for R to enable big data G. Calculating LDA
processing like RHadoop, Rhipe and SparkR packages are Every document is mixture of topics, each topic consists of
provided by R community. SaprkR provides outstanding some distribution words like selected dataset Narendra Modi
performance than other two approaches which used map tweets consists of many related issues. LDA is topic modeling
reduce[15]. There for this model uses SparkR which is a R technique which assumes probabilistic distribution over latent
package that provides a light-weight frontend to use Apache topics is represented by each document. In proposed paper
Spark in R tool. SparkR provides a distributed data frame that uses Gibbs sampling and harmonic mean to find the k value.
supports data manipulation on large datasets like dplyr in R. Gibbs sampling is relatively faster than EM [16]. Gibbs
Spark depends on the SPARK_HOME environmental sampling is a marginal likelihood function which marginalizes
variable, so before the spark session which connects SparkR the posteriori log-likelihood ratio. It is adopted to find the best
with R Program environmental variable need to be set. Then assignment of related words into topic according to maximum
load SparkR library to speed up the execution using multiple posteriori log-likelihood ratio [17]. Instead of applying k
threads. values manually, harmonic mean is applied for log-likelihood
for each topic and find the maximum mean value which gives
B. Data Collection the number of topics k [17].
Unstructured real time data is extracted from tweeter using A document is a sequence of N words denoted by d =
twitter API by authenticating with twitter API. 18000 tweets (w1,w2, . . . ,wN), where wn is the nth word in the sequence. A
and retweets are extracted for Rahul Gandhi and Narendra corpus D is a collection of M documents denoted by D =
Modi dated from 2017-04-18. Extracted data is stored in the (d1,d2, . . . ,dM). where dm is the mth word in sequence. A
form data frame for further manipulation and analysis. generative process of LDA [2] is as follows :
(a) Choose a multinomial distribution φt for topic t (t ∈
C. Processing and cleaning data {1,2,..., T}) from a Dirichlet distribution with parameter β.
In this step only text (tweets) are extracted from data frame
to create corpus. As discussed before tweets are highly (b) Choose a multinomial distribution Θd for document d
unstructured which contains unwanted features like http, @, #, (d ∈ {d1,d2, . . . ,dM}) from a Dirichlet distribution with
www, url, .,"",via, stop words, numbers, white spaces which parameter α.
does not carry any opinion or sentient. All unwanted features (c) For a word w n (n ∈ {1,..., N d }) in document d,
are removed and cleaned from extracted data by applying text
mining technique. i. Select a topic z n from Θ d .
ii. Select a word w n from φ zn .
D. Feature selection
The probability of observed data D is computed and
In this step preprocessed and cleaned data is used for
maximized as follows:
feature selection.in which all the documents are converted to
lower case letters to maintain uniformity. Subsequently
document is tokenized into set of words or corpus. Using
lemmatization and stemming relative and similar feature are
tokenized into single type. Terminologies used
E. Sentimental analysis k - Number of topics
In this step unsupervised lexicon based approach is used Ș - parameter of the Dirichlet prior on the per-document
for emotion classification [5]. Emotion level in calculated by topic distributions,
associating the emotions and sentiment score of collected
tweets generated by NRC lexicon method. Package Syuzhet is β - Parameter of the Dirichlet prior on the per-topic word
used to perform this sentiment analysis. Calculated emotions distribution,
can be interpreted by graphical representation which makes ȟm -is the topic distribution for document m,
easy to understand the results.
k -the word distribution for topic k,
F. Create Term document matrix TF-IDF Z mn -is the topic for the nth word in document m
Considering corpus and dictionary of terms that contains
all words in corpus a document term matrix is created. It Wmn- is the specific word.
contains a two dimensional matrix whose rows are the

9th ICCCNT 2018


July 10-12, 2018, IISC, Bengaluru
Bengaluru, India
IEEE - 43488

LDA generates latent topics in whole corpus. As discussed


above interpreting LDA output is very challenging. The topics
identified by using LDA can be visualized using LDAvis
package. LDAvis Package [13] provides web based interactive
visualization of topics. Visualization plot using LDAvis has
two basic panel. Left panel present overall view of topic
model and relation between topics. Each circle corresponds to
a topic. Right panel has the horizontal bar plot which
represents usefulness of currently selected topic for selected
term independently.

IV. EXPERIMENTAL RESULT ANALYSIS

A. Extracting twitter data


For experiment 18000 real time tweets are extracted using
twitter API for each individual leaders dated from 2017-04-18
to 2018-04-30. Extracted data is stored as data frame. Fig 1. People emotions score for Narendra Modi tweets
Preprocessing data by removing noise and unwanted data .For
feature selection data is extracted from retweets, split into
words to find the insightful meaning.

B. Determine TF - IDF
Document term matrix finds the frequency of words
that occur in each document. Document term matrix is
calculated using package 'tm' .Using document term matrix
TF-IDF is calculated. Sparse entries in document term matrix
is removed. Document term matrix had 18000 tweets and
11104 terms after removing sparse terms it has 10172 tweets
and 11104 terms.

C. Sentimental analysis
Emotion level is calculated by associating the emotions
and sentiment of collected tweets generated by NRC lexicon Fig 2. People emotions score for Rahul Gandhi tweets
method. Package syuzhet is used to calculate sentiment and
emotions score. Below Table 1 illustrates the sentiment scores D. Applying and visualization LDA
of Narendra Modi and Rahul Gandhi. Figure 1 and figure 2
gives the sentiment and emotion visualization for both leader LDA with Gibbs sampling is applied for document term
tweets. matrix using package 'topic models'. The obtained LDA topics
are visualized using ‘LDAvis’ package. In this k=7 number of
topics obtained by applying harmonic mean for likelihood.
TABLE 1 SENTIMENT ANALYSIS SCORES Time elapsed to execute the model on sparkR is 127.33sec
Sentiment Narendra Rahul Gandhi to preprocessing, analyze and visualize 18000 tweets.
scores modi However time elapsed to preprocessing, analyze and visualize
anger 442 481 the same 18000 tweets without spark is 135 sec.
anticipation 380 386 In figure 3, 4, 5, 6 left panel present overall view of
disgust 348 368 topic model and relation between topics. Each circle
fear 474 526 corresponds to a topic. Right panel has the horizontal bar plot
joy 331 329 which represents usefulness of currently selected topic for
sadness 415 447 terms independently shows top 30 terms in topic. Below
figures 3, 4, 5, 6 shows the visualization of LDA plot for top
Surprise 220 221
eight topic discussed for two famous political faces of India.
trust 572 569 The figure 3 and 5 depict the overall term frequency in all the
negative 984 1104 topics. Figure 4 and 6 represents the frequent terms of the first
positive 1017 988 cluster.

9th ICCCNT 2018


July 10-12, 2018, IISC, Bengaluru
Bengaluru, India
IEEE - 43488

Fig 3. Top eight clusters represents eight topics resulted to Narendra Modi tweets

Fig 4. Term frequency of first cluster representing first topic identified for Narendra Modi tweets

Fig 5. Top eight clusters represents eight topics resulted to Rahul Gandhi tweets

9th ICCCNT 2018


July 10-12, 2018, IISC, Bengaluru
Bengaluru, India
IEEE - 43488

Fig 6. Term frequency of first cluster representing first topic identified for Rahul Gandhi tweets

E. Analysis of Result REFERENCES


Sentiment score of above Table.1 shows more positive [1] D. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,” Journal
sentiment score toward Narendra Modi compered to Rahul of machine Learning research, 2003.
Gandhi. These results can be used identify people expectation [2] H. Jelodar, Y. Wang, C. Yuan, and X. Feng, “Abstract :,” 2017.
and trust toward both the political leaders. Further analysis of [3] Y. G. Jung, K. T. Kim, B. Lee, and H. Y. Youn, “Enhanced Naive
Bayes Classifier for Real-time Sentiment Analysis with SparkR,” pp.
large scale tweets would be helpful in predicting the next 141–146, 2016.
prime minister of India. [4] H. Anber, A. Salah, and A. A. A. El-aziz, “A Literature Review on
Twitter Data Analysis,” vol. 8, no. 3, pp. 241–249, 2016.
Topic modeling results of figure 3, 4, 5, 6, will be helpful [5] S. Kuamri, “Real Time Analysis of Social Media Data to Understand
to identify various topics discussed relative to different issues People Emotions Towards National Parties,” 2017.
and news related to Narendra Modi and Rahul Gandhi. Further [6] V. A. Rohani, S. Shayaa, G. Babanejaddehaki, and V. Ali, “Topic
analysis of these topics will be used to identify opinion or Modeling for Social Media Content : A Topic Modeling for Social
sentiment related to various hidden topics in different Media Content : A Practical Approach Practical Approach,” pp. 397–
402, 2016.
applications such as healthcare, sports, education and [7] M. Sokolova, K. Huang, S. Matwin, J. Ramisch, R. Black, C. Orwa,
terrorism. S. Ochieng, and N. Sambuli, “Topic Modelling and Event
Identification from Twitter Textual Data Keywords :”
[8] C. Xiao, P. Zhang, W. A. Chaowalitwongse, J. Hu, and F. Wang,
V. CONCLUSION “Adverse Drug Reaction Prediction with Symbolic Latent Dirichlet
Advancement in internet, mobile technology and social Allocation,” pp. 1590–1596, 2017.
media gives a freedom for people to speak and put their [9] S. Lee, S. Kim, S. Lee, J. Choi, H. Yoon, D. Lee, and J. Lee,
“LARGen : Automatic Signature Generation for Malwares Using
opinion and sentiment in front of the world. People Latent Dirichlet Allocation,” vol. X, no. X, 2016.
discussions, opinion and sentiment in the form of tweets on [10] Q. D. O. Vlviru, “6rfldo 0hgld 2slqlrq$qdo\vlviru ,qgldq 3rolwlfdo
two famous political face of county India are collected, ’lsorpdwv,” pp. 681–686, 2017.
analyzed at real time by applying topic modeling technique [11] X. Yang, D. Lo, L. Li, X. Xia, T. F. Bissyandé, and J. Klein,
and sentiment analysis technique. Instead of reading “Characterizing malicious Android apps by mining topic-specific data
flow signatures,” vol. 90, pp. 27–39, 2017.
individual tweets of people, 18000 tweets are collected and
[12] D. Greene and J. P. Cross, “Unveiling the Political Agenda of the
analyzed using LDA based topic molding method to identify European Parliament Plenary : A Topical Analysis,” 2015.
various topics of discussion related to two famous political [13] C. Sievert, K. E. Shirley, and N. York, “LDAvis : A method for
faces of country India. Further sentiment analysis of collected visualizing and interpreting topics,” pp. 63–70, 2014.
tweets is carried out using lexicon based method to identify [14] Y. El-khamra, N. Gaffney, D. Walling, E. Wernert, W. Xu, and H.
inclination of people sentiment towards the each leaders. Zhang, “Performance Evaluation of R with Intel Xeon Phi
Coprocessor,” 2013.
Overall experimental analysis is carried out on SaprkR [15] R. Huang, T. Advanced, W. Xu, and T. Advanced, “Performance
framework to improve the preference and speed of processing. Evaluation of Enabling Logistic Regression for Big Data with R,” pp.
To summarize a real time automated topic modeling and 2517–2524, 2015.
sentiment analysis is performed over large set of related tweets [16] M. Banko, “A Comparison Of Expectation Maximization and Gibbs
on SaprkR. Sampling Strategies for Motif Finding,” pp. 1–7, 2004.
[17] M. Ponweiser, “Latent Dirichlet Allocation in R,” no. May, 2012.

9th ICCCNT 2018


July 10-12, 2018, IISC, Bengaluru
Bengaluru, India
IEEE - 43488

9th ICCCNT 2018


July 10-12, 2018, IISC, Bengaluru
Bengaluru, India

You might also like