Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB16)

Trending Topic Analysis Using Novel Sub


Topic Detection Model
Halima Banu S and S Chitrakala
Department of Computer Science and Engineering, Anna University, Chennai, Tamil Nadu, India.
(e-mail: halimabanu91@gmail.com , au.chitras@gmail.com).

Abstract—Twitter collects millions of tweets every day which Analysis of topics is truly strengthened by performing a
in turn serves as a rich information delivering platform. sentiment classification but summarization applications lack
However, some users, especially new users, often find it difficult this component and as a result produce conflicting summaries.
to understand trending topics in Twitter when confronted with Topics discussed in Twitter are very diverse and
overwhelming and unorganized tweets. Previously, there has unpredictable. Sentiment classifiers always dedicate
been attempts to provide a short snippet to summarize a topic
but, this does not scale up to user’s expectation as it does not
themselves to a specific domain or topic. Namely, a classifier
provide any analyzed summary. This work aims to develop a trained on sentiment data from one topic often performs
Trending Topic Analysis System to analyze trending topics by poorly on test data from another. Most of the recent
performing Topic based sentiment classification thereby applications for sentiment analysis deal with a single topic
summarizing public views on selected trending topics and to under interest. Another concern regarding sentiment
generate extractive sub summaries of topics over the time period classification is if carried out as a supervised approach it will
using a novel sub topic detection approach. Different from the incur heavy manual effort in annotation.Contributions of this
traditional summarization framework, conflicting summary
generation could be avoided with sentiment classification
paper focus in addressing the following issues: (1) Avoidance
enhanced by common and tweet specific feature extraction of generation of conflicting summaries (2) Topic Evolution
thereby sorting the data into separate sentiment corpus. Volume- with efficient sub topic detection model to uncover hidden sub
based followed by topic modelled approach of detecting sub topic topics.
in the corpus help detect subtopics under the trending topic more Rest of the paper is organized as follows: In Section 2 we
efficiently. A new approach called Foreground Dynamic Topic discuss the literature survey pertaining to Sentiment
Modelling (VF-DTM) is proposed which distills the noisy content Classification and Sub Topic Detection for Twitter
and extracts the foreground tweets from the corpus and then
Summarization. System architecture and details about the
build the intended model on it. Efficiency of Sub topic detection
in turn increases the quality of the extractive sub summaries proposed model in described in Section 3. Experimental
generated. Results are discussed in Section 4 followed by Conclusion and
Future works in Section 5.
Index Terms—Trending Topic Analysis, Novel Sub-Topic
Detection, Foreground Dynamic Topic Modelling, VF-DTM II. LITERATURE SURVEY
A. Sentiment Classification
I. INTRODUCTION
One of the major issues of Sentiment classification using
Over the recent years Twitter has grown from a vague machine learning approaches is the need for labeled training
invention to become a mainstream medium for dissemination data. Twitter dataset lack labeled data which often poses a
of messages and the public discussion of news and events. The huge effort in manual annotation.
rapid proliferation of Twitter posts presents a big obstacle for H.Rui et all [1] used supervised algorithms namely Naïve
efficient information acquisition. It is impossible for a user to Bayes Classifier and Support Vector Machine Classifier for
get an overview of important topics on Twitter by reading all sentiment classification to investigate if Twitter WOM affects
tweets every day. In addition, because of information movie sales and if so the approach involved in it.
redundancy and the informal writing style, it is time Contributions included measuring the impacts of WOM from
consuming to find useful information about a topic from a people with different degrees of social connectivity exhibited
huge number of tweets. The tremendous volume of tweets in a social network. But the system incurred heavy labor work
raises a needful framework called summarization as the key to in labeling the huge training data set.
facilitating the requirements of topic exploration, navigation, Furthermore twitter data are very diverse; therefore to train
and search from hundreds of thousands of tweets. Specifically, a universal classifier was near impossible.
a summary that provides representative information of topics
with no redundancy and overlapped content is desired.

978-1-4673-9745-2 ©2016 IEEE


International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB16)

Some works have been done for cross domain sentiment capture the opinion expressed in the data thus leading to
analysis for review datasets like in [7], S.J.Pan et all used conflicting summaries. Also, sub topic detection mechanism
Spectral FeatureAlignment algorithm (SFA) for classifying needs more enhancement.
sentiment polarity, or bridging the gap between the domains
and to aligndomain-specific words from different domains into III. SYSTEM ARCHITECTURE
unified clusters. System significantly outperformed previous Trending Topic Analyzer is implemented through topic
approaches to cross-domain sentiment classification but could adaptive sentiment classification and multi tweet extractive
not be applied for twitter dataset. summarization. This work aims at generating chronologically
Xiaojun Wan [2] used Co training algorithm for training ordered sub summaries which throws light in understanding
and SVM Classifier was used for sentiment analysis. Major the topic evolution of the trending topics. Topic under study is
focus was on cross lingual sentiment classification which assumed to contained several hidden sub topics which are
leveraged English corpus for Chinese sentiment classification. revealed using the proposed novel model called Foreground
But the approach could not suite well for the nature of tweets. Dynamic Topic Modelling (VF-DTM). This enables an end
S.Lui et all [3] performed semi supervised multi class user to completely analyze a trending topic to a greater level
model for topic adaptive sentiment classification. The of detail.
classifier was initially built using common features which did
not adapt well for cross domains. Semi supervised learning A. Target Data Preparatory Steps
was used to adaptively learn the build a topic adaptive Target data collection is performed by extracting the
classifier. But the system needed topic labeled data in order to trending topic on a region basis. Only those topics which
apply the respective classifier model. Obtaining labelled data involves a topic development of user focus change will be
poses a major constraint which in turn reduces the adaptability considered for summarization in this work. Based on the
of this system to any kind of data to be analyzed. selected topics tweets are extracted from the Twitter and
stored in database. Necessary preprocessing such as URL
B. Sub Topic Detection for Twitter Summarization removal, Slang Word Replacement, Non English Tweet Filter,
The popularity of micro-blogging services, such as Twitter Stemmer and Stop Word Removal is performed to prepare the
has caught increasing attention from worldwide researchers. target useable data. Along with the preprocessing phase, the
There exist some pioneering researches working on Twitter proposed system also handles other language tweet translation
summarization. which is essential to prevent discarding of public opinion
D.Wen et all [4] accomplished Summarization using a non- about the trending topic.
parametric Bayesian model applied to Hidden Markov
Models. A novel observation model was designed to allow B. Topic Based Sentiment Classification
ranking based on selected predictive characteristics of Our sentiment classification is inspired by the method
individual tweets. Major focus was to investigate the proposed by Shenghua Liu et al. in [3] for labelling the corpus
possibility of using a temporal probabilistic data model known with their corresponding sentiments. Topic based sentiment
as Hierarchical Dirichlet Process Hidden Markov Model classification is divided into two phases namely Feature
(HDP-HMM) to process a stream of tweets pertaining to a Extraction and Model Building.
single subject and cluster the tweets into groups or rankings Feature extraction involved in this system creates a
based on the value of the individual tweets .But Summaries significant impact in the performance of sentiment
generated did not project the temporal nature. classification task. Tweets have a special nature unlike normal
English sentences like @ symbol used to refer to a user,
F.Lui et all [5] proposed a Concept Based optimization emoticons expressed in the tweets etc. Incorporating such
framework for topic Summarization using Integer Linear features while performing feature extraction enhances the
Programming. Target data comprised of original and overall process. Features extracted could be classified into two
normalized tweets along with web content relevant to the broad categories namely Common Features and Tweet
topic. The focus was not on developing new summarization Specific Features. With the extracted features classifier model
systems but rather utilizing and integrating diverse text is built based on each topic without exhaustively labelling the
sources to generate more informative summaries but there training set. Support Vector Machine is used as the classifier
were lack of series of sub events identification to show topic whose appropriateness to this work will be examined in later
evolving process. sections. Common features are a combination of common
B.O’Conor et all [6] explained Twitter topics by presenting sentiment words and topic sentiment words which will be used
a simple list of messages. An exploratory search application to train the classifier.With POS tagging for tweets on a topic
for Twitter called TweetMotif grouped messages by frequent and removing the common sentiment words, proposed
significant terms and retrieved results were further narrowed framework selects the frequent adjectives (aj), verbs (v),
down through a search interface. But the system lacked nouns (n) and adverbs (av) as candidates of topic-adaptive.
temporal aspect in summaries generated and topic Figure 1 below shows the system architecture depicting the
development was not observed since sub topic detection proposed system model. Following Algorithm 1 describes the
process was not employed. D.Gao et all [8] generated steps involved in tagging the tweets.This forms the topic
sequential summaries on a topic using Stream and Semantic adaptive words. Common adaptive word set is built using
based approaches for sub topic detection. But the approach fell generic positive and negative oriented words.
short of readability of summaries generated. System failed to

978-1-4673-9745-2 ©2016 IEEE


International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB16)

end for
Fig.1 System Architecture.
for every user in addUserList do
extractSentiment(user)
Algorithm 1: Common Feature Extraction extractPostTime(user.tweet)
Input : Pre-processed tweets end for
Output : Feature Vector
for each pre-processed tweet in topic e do Based on the post time of each user , feature vector will be
words [] =POSTagger (tweet) built based on the sentiment of the user‘s tweet and its post
frequencyCount (words []) time. Emoticons present in the tweet text serve as a simple
targetTagList (v, n, av, aj) way to predict the sentiment of the tweet. Therefore using a
for each word in targetTagList emoticon dictionary as s basis, emoticon feature value will be
if (word not in CommonSentimentWordList) computed.
addTopicAdaptiveWordList(word) Once the initial features are extracted , classifier model is
end if built using a collaborative training approach where the
end for unlabelled data and the features extracted will be iteratively
end for augmented with the initial labelled set.

Tweet specific features include the @-network features, C. Tweet Summarization Framework
user sentiment features and emoticons. These features captures Topic evolution is achieved using a two-step process in the
the special nature of tweets as features. In twitter, @ symbol summarization framework as used in [8] namely Sub Topic
in a tweet is used to refer to a user whom a person would Detection and Sub Summary Candidate Selection. Tweets are
tweet to e.g. @abc means that the tweet is referring the user prone to contain noisy data due to various reasons such as
abc. Using this symbol useful feature can be extracted there by limited length of a tweet text, informal language used by the
eliminating the need to label the entire dataset.Algorithm 2 user tweeting etc.
details about the tweet specific feature extraction where child Sub Topic Detection process in the proposed framework
edges (CEdge) and parent edges (PEdges) will be created for a involves a Volume based approach followed by the novel
Topic Modelling approach called the Foreground-Dynamic
directed graph model according to the respective criteria.
Topic Modelling. Tweet stream that is collected may contain
many irrelevant data which may tamper the quality of the
Algorithm 2: Tweet Specific Feature Extraction extractive summary generated.
Input : Pre-processed tweets This raises a need to clean and extract only the relevant
Output : Feature Vector corpus. Noisy and irrelevant tweets given a trending topic is
for each tweet in topic e in training set L do handled by a bi-level process of executing Volume based
initialize (vertices-tweet) approach and; the output from the previous step i.e. the tweet
initialize (d-edges- @-relations) set collected will be fed into the novel topic modelling
end for approach.
for every vertexi in vertices do Sub topics are captured using eq. (3.10) where the word
for every vertexj in vertices do distribution evolves over the time period.
if(postTime(vertexi)<postTime(vertexj))
βt,k | βt−1,k ~ N(βt−1,k ,σ 2 I ) (1)
createCEdge(vertex i,vertexj)
else Tweet distribution is obtained using eq. (3.11) where the
createPEdge(vertex i,vertexj) tweet at time interval at time interval t depends on t-1.
end if a t | α t −1 ~ N (α t −1 , δ 2 I ) (2)
end for
end for
for each tweet in topic e in training set L do
if(user.tweetCount>2)

addUserList(user)
end if
978-1-4673-9745-2 ©2016 IEEE
International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB16)

VI. EXPERIMENTAL RESULTS


Twitter API allows to track and download the trending
topics given the WOEID (Where On Earth Identifier) of a
region. Based on the selection criteria, only those topics which
involves topic development or user focus change will be
considered and hence downloaded. Tweets from July 2015 to
September 2015 are collected. Usually a ratio of 7:3 will be
followed in hand classifying the training set and the remaining
thirty percent would be for testing. But the classification
Fig.2 Foreground Dynamic Topic Modelling
approach followed in this work is topic based and it is
adaptive in nature. Therefore, excessive manual labelling is
Algorithm 3: Foreground Dynamic Topic Modelling VF-DTM
avoided reducing the initial training set to about 10 to 20
Input : Sentiment labelled tweets detected from Volume percent. The classifier modelled is independent of the topic
based approach and K sub topics under study. Table 1 justifies the superiority of SVM
Output : Tweets modelled under each Topic classifier.
begin
populate foreground and background sets TABLE I
APPROPRIATENESS OF CLASSIFIER USED
capture sub topics with eq(1)
capture sequential structure between models as Classifier Name Accuracy
eq.(2s)
for each tweet do SVM 0.800
capture variational state space model as in eq(3) Ada-Boost 0.676
for each word do Decision Tree 0.710
compute eq.(4)
Naïve Bayes 0.713
update variational parameters
Random Forest 0.722
end for
end for
end Weka [10] implementation of classifiers namely Ada-Boost,
Decision Tree, Naïve Bayes and Random Forest are used for
Variational state space model is used approximate comparison. Now, the topic wise sentiment labelled data will
posterior inference using eq. (3) and (4). be considered for summarization task. Performance of the
^ extractive summary is measured using two popular metrics for
β t | β t ~ N ( β t ,ν t 2 I ) (3) summarization namely Coverage and Novelty. The coverage
wt ,n | β t ~ Mult (π ( β t )) (4) is defined using eq. (6):

 Count
exp( β t , W )
where, π ( β t ) = match ( N − gram)
 exp( β t , W )

S H S
w 1 1 d j ∈D N − gram∈di ,di
Coverage = .
where, at -per tweet distribution at time t, ν t -sub topic | DH |
d i ∈D H
wij
 Count ( N − gram)
distribution. At time t, z t -sub topic at time t, β t -per word d j ∈D S N − gram∈d iH

distribution at time t, wt is the specific word. Before (6)


generating the extractive summary for the topic modelling, we | j − i | +1, j ≠ i
where, wi, j = 
first need to sort the subtopics detected from F-DTM 1, j = i
according to their temporal information. Particularly, we use
Reason candidate model [9] to output the relationships where, D H and D S are human and system generated sub
between tweets and subtopics to assign each tweet to the summaries, i and j are the index of the sub summary. wi, j is
subtopic that it most likely belongs to. Most probable tweets the weight added corresponding to the matches between the
will be associated to each sub topic by the extracting the word system and human generated sub summaries. Novelty is
distribution only from the foreground set and not the defined using eq. (7):
background set. This process is applicable for mining topics
for specific set of data collection. The tweets being assigned Novelty =
1
| D | −1  (I
i >1
d i − I d i ,d i −1 ) (7)
signify that they best represent the topic under consideration.
Finally using eq. (5) , scores of the tweets are computed where, I di is the Information Summary and I d i , d i−1 is the
and the highest scored tweets will be considered under each overlapped information. This is used to calculate the average
sub topic as the extractive sub summaries. increment of information content of two adjacent sub
Score(Tweet ) =  p ( word | z t ), tweet i ∈ Z t
words∈tweeti
(5) summaries. Performance of baseline systems namely LDA,
DTM and Heuristic systems are compared based on the
Redundancy in each sub summary is handled using Maximal metrics namely Coverage and Novelty of the summary
Relevance Model as in [8]. generated and is shown in Table 2 below.
TABLE 2

978-1-4673-9745-2 ©2016 IEEE


International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB16)

COMPARISON WITH BASELINE SYSTEMS proceeding with building the required model has proved to be
Proposed an efficient choice.
Trending Topic Metric Heuristic LDA DTM
VF-DTM
Coverage 0.270 0.310 0.330 0.39
#Yakub Memon V. CONCLUSION
Novelty 0.390 0.735 0.795 0.800
#Black Day Coverage 0.264 0.305 0.330 0.380 Thus the proposed Trending Topic Analysis System has
For Indian
Democracy
Novelty 0.390 0.717 0.771 0.794 been developed that is able to analyse twitter trending topics
in a constructive manner. Tweet corpus will be sentiment
labelled with common and tweet specific feature extraction,
thus enhancing the classification task. Sentiment classified
tweets are forwarded into novel summarization framework in
which the proposed Foreground Dynamic Topic Modelling
(VF-DTM) approach enables in detecting sub topics
accurately by avoiding noisy data .Then the system extracts
the most significant tweets to generate extractive summaries
using a graph based approach using proposed salient features
of the tweets. The system can be further enhanced by
normalizing the selected significant tweets for better
understanding and can be modelled to generate abstractive
summaries of the topic which also contributes to the
readability of the summary

REFERENCES
Fig.3 Comparison with Baseline Systems
[1] Rui, Huaxia, Yizao Liu, and Andrew Whinston. “Whose and what
chatter matters? The effect of tweets on movie sales.” Decision Support
From Figure 3 it is learnt that Coverage performance of all Systems55.4 (2013): pp.863-870.
the baseline systems are noticeably less than the proposed [2] Wan, Xiaojun. “Co-training for cross-lingual sentiment classification.”
system. And for Novelty, Heuristic systems show a very poor Proceedings of the Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on Natural Language
performance , LDA gives an acceptable performance whereas Processing of the AFNLP: Volume 1-Volume 1. Association for
DTM and the proposed system almost achieves near Computational Linguistics, 2009.pp.235-243
performance of 7% increase compared to the other two, still [3] S.Liu, X.Cheng, F.Li, and F.Li, “TASC: Topic-Adaptive Sentiment
the proposed system VF-DTM is 2% higher than DTM. Classification on Dynamic Tweets”, IEEE Transactions on Knowledge
Sub topic segmentation also contributes to the quality of and Data Engineering, Vol. 27 ,no. 6,pp. 1696 - 1709, Jun. 2011 .
the summary generated. Evaluation of sub topic segmentation [4] Wen, Dunwei, and Geoffrey Marshall. “Automatic Twitter Topic
Summarization.” Computational Science and Engineering (CSE), 2014
is performed using average cosine similarity due to the fact IEEE 17th International Conference on. IEEE, 2014, pp-207-212
that the sub topics tend to cluster the tweet stream into [5] Liu, Fei, Yang Liu, and Fuliang Weng. “Why is SXSW trending?:
segments and each such segment will focus on one sub topic. exploring multiple text sources for Twitter topic summarization.”
Cosine similarity for two successive sub topics are evaluated Proceedings of the Workshop on Languages in Social Media.
Association for Computational Linguistics, 2011,pp.66-75
to depict the segmentation performance of the proposed
[6] O'Connor, Brendan, Michel Krieger, and David Ahn. “TweetMotif:
system. Table 3 displays the average cosine similarity values Exploratory Search and Topic Summarization for Twitter.” ICWSM.
exhibited by various baseline systems for various sample 2010,pp.384-385
topics. [7] Pan, S. J., Ni, X., Sun, J. T., Yang, Q., & Chen, Z. (2010, April). “Cross-
domain sentiment classification via spectral feature alignment.”
TABLE 3
InProceedings of the 19th international conference on World wide
SUB TOPIC SEGMENTATION PERFORMANCE
web (pp. 751-760). ACM.
[8] D.Gao, W.Li, X.Cai, R.Zhang, and Y.Ouyang, “Sequential
Sub Topic Detection Cosine Similarity Summarization: A Full View of Twitter Trending Topics”, IEEE
Models Transactions on Audio, Speech, and Language Processing, Vol. 22, no.
2,pp. 293-302,Feb. 2014.
Heuristic Baseline 0.841
[9] S.Tan, Y.Li, H.Sun, Z.Guan, and X.Yan, “Interpreting the Public
LDA 0.688 Sentiment Variations on Twitter”, IEEE Transactions on Knowledge
OPAD 0.693 and Data Engineering, Vol. 26, no. 5, pp. 1158-1170, May. 2014.
[10] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.
DTM 0.677
Witten, “The weka data mining software: An update,” ACM SIGKDD
Proposed VF-DTM 0.6585
[11] Chen, Yan, Xiaoming Zhang, Zhoujun Li, and Jun-Ping Ng. “Search
Performance of Volume based approach in sub topic engine reinforced semi-supervised classification and graph-based
segmentation proves that detecting topic merely based on summarization of microblogs.” Neurocomputing 152 (2015): 274-286.
volume will affect the segmentation performance leading to
unidentified topics in the corpus. DTM uses semantic
approach and not merely volume but the performance of
proposed system (VF-DTM) is still better than DTM due to
the reason that the model building process is not interrupted
with noisy content. Therefore distilling noise and further
978-1-4673-9745-2 ©2016 IEEE

You might also like