Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase

Topic Modelling: A survey of topic models
1Comsats
Adnan Bashir1 and Dr Sohail Asghar 1

Institute of Information Technology, Islamabad, Pakistan
Abstract-In recent years we have significant increase

in online data. It has become very difficult to extract
valuable information from this large collections of data.
In this paper, we start our discussion with the
importance of topic models and why we need it. After
this we discuss some prominent text models. This
paper discusses various topic modelling methods and
its applications that includes social media network
analysis and use of topic modelling in software
engineering and bio information. In the end we discuss
social media analysis using topic modelling
techniques.
Keywords-Topic Modelling, Latent Dirichlet allocation,
Probabilistic topic modelling Probabilistic latent
semantic analysis, Latent semantic analysis, SelfAggregation Topic Model, Pseudo-document-based
Topic Model, Sparsity-enhanced PTM
1
Introduction
It is estimated that more than 40% of the world

population have access to internet while it was less
than 1% in 1995 [1]. And with the emergence of social
media like facebook , twitter and tumblr the size of the
data on internet has increased tremendously. As of
recent stats there are 2.34bn social media users
worldwide [2]. According to internet stats live, we send
2 billion emails, search more than 40000 queries on
google and 6000 tweets in every second [3]. All of this
means that we have huge amount of data at our hands.
And getting useful information from this data collection
can be a very painful task. But with the advent of topic
modelling field this action can be performed with
relative ease. The s urvival of organization in todays
world is very much dependent on data. Availability of
timely and reliable information is very important to
make informed decisions.
on topic modelling for short texts due to its importance

in social media these days.
2
Latent Dirichlet allocation (LDA)
Latent Dirichlet Allocation presented in 2003 is one the

most popular and widely topic models [6]. Latent
Dirichlet Allocation is based on statics model and is
basically a generative model. By generative we mean
that it tries to generate the document as it was written.
It is one of the widely used topic modelling technique
and is basically an unsupervised learning.
Latent Dirichlet Allocation Model considers a document
as a bag of words. It assumes that a document is
basically a mixture of words and topics. Different words
belonging to different topics in the document are
represented as probabilistic distribution in accordance
to their appearance. Latent Dirichlet Allocation is
independent of domain and used language.
The two most important factors in the Latent Dirichlet
Allocation are per topic word distribution and per
document topic distribution that is represented. The
high value of per document topic distribution means
that a given document contains a mixture of topics and
not just a words related to single topic. While Low value
of topic distribution per document means that most of
the words in the document belong to same topic(s).
Similarly high value of per topic word distribution
means that topic contain a mixture of words and low
value means that words are related to less diverse
topics. Unlike a Dirichlet-multinomial clustering, Latent
Dirichlet Allocation has three levels [6]. Topic node is
sampled repeatedly in the document to associate
multiple topics to the document. The multiple topics
generated within a document are infinitely
exchangeable.
The main assumption in topic modelling is that

meanings of words in each document are logically
related to each other [4]. So we can assume that
logically related words will occur more frequently in a
document as compared to words that are not logically
related to each other or belong to different topics. One
definition of topic modelling is re-occurring pattern of
co-related words [5].
In this paper we have divided our discussion in three
sections. First section discusses probabilistic topic
models and most importantly Latent Dirichlet Allocation
and its different variants. Second section discusses
dynamic topic models and topic models for documents
that evolve with time. And in the third section we focus
Figure 1 Results for collab orative filtering on the

EachMovie data [6]
Figure 1 shows the comparison of Latent Dirichlet

Allocation with the probabilistic latent semantic
indexing [7] and unigram mixture model from the work
[6]. It shows the topics provided by Latent Dirichlet
Allocation shows much better results and can be used
to filter particular topics from large collection of text.
The use of Latent Dirichlet Allocation Model is not
only limited to topic modeling. It has also been used in
bioinformatics, collaborative filtering and content
based image retrieval.
2.1.3
2.1.1
Each good recommendation system for scientific

articles must have at least three features. First one is
that recommendation of older articles as users are
mostly interested in foundation of any field. Second
step is to integrate new articles. Since the articles is
new and it is not known how many other users have it
their libraries.
Analysis of Complex Medical Datasets

using Latent Dirichlet Allocation
Cluster Analysis of large and complex biological and

medical datasets can be done using traditional cluster
analysis techniques. But the accuracy of such
multivariate is not limited is not reliable. With the topic
modeling one can reduce the high dimensionality of
data and reduce it to small number of latent variables
[8].
The researchers applied topic modeling techniques
inspired from Latent Dirichlet Allocation and
performed cluster analysis on three different large
medical datasets that included Salmonella pulsedfield gel electrophoresis (PFGE) dataset, lung cancer
dataset, and breast cancer dataset. The results were
huge improvement from traditional techniques of
cluster analysis.
2.1.2
Context Bases Image Retrieval using

Latent Dirichlet Allocation
Latent Dirichlet has also been extensively used in

context based image retrieval. The semantic gap
coupled with time and memory complexity makes it
very difficult to retrieve images based on content. So
researchers have proposed solutions to retrieve
images using the information surrounding the image.
Metadata and tags on social media are the best
examples information present around an image.
Collaborative Filtering and Latent

Dirichlet Allocation
Latent Dirichlet Allocation was used very effectively

with collaborative filter in [10]. It was inspired by the
work websites like in which users can created their
own library and recommend it other users. Now our
researchers in [10] created a recommended system
using Latent Dirichlet Allocation topic model and
collaborative filtering to find scientific articles.
Here traditional collaborative recommendation

systems failed here. The third and the most important
step is profile each users based on the articles he or
she has in personal library as it can be very valuable
in scientific community. It helps researchers connect
with users with similar interests. Also we can use
scientific articles to describe what kind of users have
them in their libraries. The authors in [10] combines
the ideas of collaborative filtering based on latent
topic factors [11] , [12] , [13] and content analysis
based on probabilistic topic modeling [6], [14], [15].
In first step this recommendation system uses user
information and articles it liked to recommend similar
articles to other users. But latent factor models cannot
generalize previously unseen articles. For this topic
modeling plays an important role as provides the
representation of latent themes in the document. So
this can help in recommending articles based on
content too.
However, the keywords used for image retrieval

using metadata and other text around the images may
not necessarily be helpful. As it may not contain that
keyword that an untrained user is looking for. Another
limitation is that the result obtained using conventional
image retrieval techniques may contain irreverent
images. So Latent Dirichlet Allocation was used to
extract the hidden semantic for context based image
retrieval [9]. This solves the above mentioned two
problems related to difficulty is image retrieval and
removal of irrelevant images from the result set due to
improvement in accuracy of retrieved results.
Figure 2 Maximum likelihood from incomplete data via
the EM algorithm. Here, theta denotes j (topic
proportions) and theta correction denotes the offset

j . The 10 topics are ob tained b y joining the top 5
topics ranked b y jk and another top 5 topics ranked
b y |jk|, k = 1, , K. Under CTR, an article of wide
interest is likely to exhib it more topics than its text
exhib its. For example, this article b rings in several
other topics, including one on Bayesian statistics
(topic 10). Note that the EM article is mainly ab out
parameter estimation (topic 1), though is frequently
referenced b y Bayesian statisticians (and scholars in
other fields as well). [10]
2.1.4
Tab le 2 Comparison of LDA to LSI over Mozilla [20]
Use of Latent Dirichlet Allocation in

Software Engineering
Latent Dirichlet Allocation has also been used

extensively in software engineering domain. One
approach for information retrieval for source code is
given using latent semantic analysis [16]. But the
problem is that is does not deal well with terms with
multiple meanings.
Also the results from latent semantic indexing can be
difficult to interpret as these are given in numeric
spatial representation. To address these problems the
same author later introduced Probabilistic Latent
Semantic Indexing [17]. But it is also prone to
overfitting as described in [18]. Furthermore
probabilistic topic modeling has not been successful
in predicting the topic distribution for new documents
[6], [19]. But Latent Dirichlet Allocation can overcome
these problems as it is a generative topic model, can
extract hidden information and has the ability to
generalize to model unseen or new documents [6].
Another important use of Latent Dirichlet Allocation is
source code retrieval for Bug Localization [20]. In this
model a document collection is build using source
code files then we perform LDA analysis on
generated document collection. After this user may
query the model for bug localization.
The authors in [20] built a tool that compares issued
query with each document in the collection and
estimates similarity between them. Ranking of results
is performed using similarity measure. These rankings
are very helpful in determination of that part of source
code that are in need of a modification to fix a bug.
Figure 3 shows the proposed model in [20].
This model presented several case studies that
included Mozilla and eclipse. The comparison
between latent semantic indexing and the proposed
model is given below for both case studies in Table 1
and Table 2 respectively.
Another approach in software engineering is use of

genetic algorithm to effectively use topic models in
Software engineering [21].
This model makes assumption that software code has
different properties as compared to natural language.
Also it proposes a solution that combines a genetic
algorithm [22].It addresses three different software
engineering tasks that includes artifact labeling in
software, successful location of a feature and
traceability of link recovery.
2.2
Author Topic Model
Author topic model extend latent Dirichlet allocation
approach for topic modeling [23]. The important
associations are author association over multinomial
distribution of topics and topic association over
multinomial distribution of words.
The model can also manage the documents with
multiple authors where a document with multiple
authors is modeled over a distribution of topics that is
a mixture of distributions associated with the authors.
The model was applied to a collection of 1700 NIPS
conference papers.
The author topic model can have several applications.
One that is demonstrated in [23] is to find the
similarity between authors to generate a list of authors
that are highly likely to be reviewers for the given
abstract of paper and list of author plus past
collaborators.
Table 1 shows such automated recommendation
system. All the results were averaged using Gibbs
Sampler.
Tab le 3: Symmetric KL divergence for pairs of authors

[23]
Tab le 4 Author Entropies [23]
The model can also be helpful to calculate the entropy

of authors distribution over topics to which a
particular author addresses a single topic or multiple
topics in a document. Table 2 shows the highest
averaged entropy for 400 topics.
2.3
Figure 3 LDA b ased approach for Bug Localization

[20]
Tab le 1 Comparison of LDA to LSI over Eclipse [20]
Author Persona Topic (APT)
Another application of author topic model was

presented by [24] to find the expert to review scientific
papers. They propose the matching the expertise of
authors based on the content of the document. They
proposed a slightly modified author topic model called
Author Persona Topic (APT) Model. Each reviewer
has expertise over several subjects. So this model
also created different personas for each reviewer
based on subject area.
This model also follows the principles of information
retrieval but instead of retrieving relevant documents
based on the content of given document it retrieves
relevant people.
They built a model for each potential reviewer of a
paper. Each potential reviewer has a distribution over
words. Then this distribution is used to rank each
reviewer against the words present in the given
scientific article. Table 3 shows the comparison of
author persona topic model with other commonly

used and relevant models.
Tab le 7 Pairs considered most alike b y SNA
Tab le 5 Precision at relevance cutoff 3 after retrieving

n reviewers [24]
2.3.1
Another interesting of author model is AuthorRecipient topic Model [25] in social media analysis.
Instead of document this model considers messages
or emails exchanged between users.
The key difference between this model and normal
author topic model is that this model also accounts
recipient of message or document. This model
generates pair of author and recipient and each pair
has multinomial distribution over words.
Tab le 6 Pair considered most alike b y ART [25]
It can also help in discovering the author or recipient

of a particular message by calculating marginal
distributions over topics specific to only one author or
recipient. Table 6 and table 7 show comparison
between social network analysis and author topic
recipient model as both consider pairs generated from
email by McCallum. The dataset used was Enron
Email Dataset [26].
Supervised Latent Dirichlet Allocation
Supervised latent Dirichlet allocation [27] was

introduced to predict rather than classification
performed unsupervised Latent Dirichlet Allocation.
The key difference between unsupervised Latent
Dirichlet Allocation is association is response variable
with each of the document. This response variable
may be simply count of users visiting a web page,
ratings of a movie or category of the document
depending upon the document collection available.
Latent topics are then found by jointly modelling
documents and response. This helps in predicting the
best possible for associated variable for future and
unseen documents.
Figure 4 Result of Application of Supervised LDA on

Movie Corpus [27]
In [28] the authors proposed that in order to model

latent features of a document, they should use logistic
normal distributions. The logistic normal distribution
can be helpful in revealing the co-relation between
topics that is not possible to find using single Dirichlet
Process. With the help of co-related topic model, we
can also give more suitable visualization to
unstructured corpus to documents. Co-related topic is
also very helpful in predicting the remaining part of the
document even when a small portion of the document
has been read.
Figure 6 shows the comparison between co-related
topic model and Latent Dirichlet Allocation model.
3
Figure 5 Result of Application of supervise LDA on
Digg Corpus [27]
The authors then applied the results on two diverse
datasets. The first one is to predict the movie ratings
using supervised Latent Dirichlet Allocation. The
second dataset was used to predict the count of digs
a web page is supposed to get on a website called
www.digg.com.
Figure 4 and Figure 5 shows the results of application
supervised latent Dirichlet Allocation on movie corpus
and Digg Corpus respectively.
In Figure 4 and Figure 5 Predictive R 2 assess the
quality of predictions.
2.3.2
Co-related topic model
Most of the statistical topic models based on Latent

Dirichlet Allocation that we have discussed so far
have failed to predict co-relation between topics.
However, it is natural that document pres ent together
in text corpus are highly likely to be linked together.
Co-related topic model is more expressive as
compares to Latent Dirichlet Allocation Model. Also
they have not been very efficient in predicting the
remaining part of the document [28].
Time Evolution Topic Models
Latent Dirichlet model has been applied to various

applications and has been very successful in m any
aspects. But when it comes to topic modeling of corpus
that evolve with the passage of time then it does not
show the desired results. For example,if we are
considering american history and we want to get topics
related to World War I. LDA may group world war I with
mexican american war because it does not consider
time. Several new models have proposed by many
researchers and we are going to discuss few of the
famous in this section.
3.1
Dynamic Topic Models
In Dynamic Topics documents are grouped together

with respect to time. In this model, sequential topic
models are developed for discrete data [29]. These
topic models are based on Gaussian time series on the
natural parameters of the multinomial topics and
logistic normal topic proportion models. Dynamic topic
models are normally treated as prediction model for
large and unstructured documents.
The authors in [29] applied their proposed model on
subset of articles from science. This corpus contained
30,000 articles and 250 of these documents were from
1881 and 1999.
Figure 7 Application of Dynamic Topic Model [29]
3.2
Topics over Time
One of the main problem in Latent Dirichlet Allocation

Model is its inability to manage the change is topics
over time and failure to successfully capture of low
dimensional data. Topics over time model was
proposed by [30] to overcome above mentioned two
problems.
In this model topic model time is jointly modelled with
word co-occurrence patterns. This model does not
discretize time but with each topic a continuous time
distribution is associated. Topics in this model
generate words and timestamps. This helps in
discovering co-occurring words in any locality of these
patterns in time. The stronger the word co-occurrence
patterns emerged, the narrower is the time distribution
for them. Similarly a wide time distribution is generated
when co-occurrence of words remain consistent for
large span of time.
3.3
Multiscale topic tomography
In [31] the authors a new model called Multiscale topic

tomography in which word count using Poisson
distribution is used along with Bayesian multiscale
analysis for modelling. This model presents an
alternative approach to dynamic topic modelling [29] in
which topic evolution modelling is done by conjugate
prior to topic parametrization. This model also gives the
user to enable them zoom in and zoom out on the time
scale and focus on evolution of topics at a particular
time scale.
Figure 10 Comparison of Perplexities of the models:

lower is b etter [31]
Figure 8 Application of TOT on NIPS Dataset [30]
Figure 10 shows that perplexity of this model is much

better than Latent Dirichlet Allocation. The perplexity
is the indicator of the ability of topic modelling
technique to predict the unseen part of data. Lower
value of perplexity means that Multiscale topic
tomography has better ability to predict unseen part of
document.
4
Figure 9 Application of LDA on NIPS Dataset [30]

The authors applied their model on 17 Years of NIPS
research papers in addition to 9 months of personal
emails and 200 years of president state of the union
addresses. The results showed improved topic
discovery and much accurate time period prediction.
Short text topic modelling
The social media has seen tremendous growth in

recent years and due to this growth we have seen a
large collection of short texts spawned online. The
extraction of useful information and topic modelling on
this large collection of data is of huge importance. It
has many significant applications in these days.
Conventional Topic Models such as latent Dirichlet
allocation (LDA) are not suitable here. The primary
reason for the failure of standard topic models for short
text modelling is sparsity of text [32] and lack of word
co-occurrence. The authors in [32] demonstrated the
accuracy of topics extraction for trained topic models
can be greatly influenced by the length of documents.
For better training of topic models one should consider
short messages. However the importance of modelling
short text cannot be denied.
Now we will discuss some of the common topic models
that are designed for short text modelling by keeping
sparsity of data in view.
4.1
Biterm Topic Modelling
In this topic modelling proposed by [33] the

fundamental idea is that we can use word cooccurrence for topic learning in rich global word cooccurrence patterns for better topic learning. This
model is also a generative model for word-pair cooccurrence in short context. Both words in biterm must
belong to same topic belonging to a mixture of topics.
The authors extensively tested their model on twitter
and website Q&A.
documents for implicit aggregation of short texts to

overcome data sparsity issues [35]. This transforms
the modelling of large collection of short texts into
modelling of much less pseudo documents. This helps
in parameter estimation for both efficiency and
accuracy. Sparsity-enhanced PTM (SPTM) is also
applied using Spike and Slab before topic distributions
to remove undesired correlations between pseudo
documents. The authors in [35] also proved with
experiments that self-aggregation topic model (SATM)
are prone to overfitting.
Figure 12 Illustration of over [35]

Figure 11 Classification performance comparison with
different data proportions on the Questions collection
(K=40) [33]
4.2
Short and sparse text topic modelling via

self-aggregation
This model proposed by [34] uses short text

aggregation during topic inference. The aggregation is
performed by using general topic affinity to make it
readily applicable on any real world data. The authors
argue that sparsity problem can be resolved by
aggregating of short text with similar topics. They
further assumed that each short text belongs to a
pseudo document of long unobserved collection of
text. So the right documentship is very important
when it comes extracting right and meaningful topics.
The authors in [34] achieved this at topic inference
stage using organic integration of topic modeling and
text self-aggregation.
4.3
Pseudo-Document Topic Modelling
In this model short texts are aggregated to overcome

data sparsity issues in short text. The advantage for
Pseudo-Document Topic Modelling is that it is not
dependent on auxiliary information for topic modelling
as it might be very costly. Sometime this auxiliary
information is not even available. The key part of this
type of topic modelling involves the use of pseudo
Conclusion
In this paper we started our discussion by describing
what topic modeling is and then we moved into
different topics models. Although we were not able to
introduce any new topic models but we tried to cover
important aspects of topic modelling and its
application in various fields.Future work include
proposing a topic model that could be used for
focused analysis for short text modeling on social
media.
5
References
[1] G. I. R. 2016, Global Internet Report

2016, Internet Society, 2016.
[2] Statisa, Statistics and facts about Social
Networks, Statisa, 2016.
[3] Internet Stats Live, Internet Stats Live,
2016.
[4] P. B. John W. Mohr, IntroductionTopic

models: What they are and why they
matter, Elsevier, vol. 41, no. 6, pp. 545569, 2014.
[5] M. R. BRETT, Topic Modeling: A Basic
Introduction, Jounal of Digital
Humanities, vol. 2, no. 1, 2012.
[6] A. Y. N. I. J. David M. Blei, Latent Dirichlet
Allocation, Journal of Machine Learning
Research, vol. 3, 2003.
[7] T. Hofmann, Probabilistic Latent
Semantic Indexing, in Proceedings of the
22nd annual international ACM SIGIR
conference on Research and development
in information retrieval, Berkeley,
California, USA, 1999.
[8] W. Z. J. J. C. Weizhong Zhao, Topic
modeling for cluster analysis of large
biological and medical datasets, in From
11th Annual MCBIOS Conference,
Stillwater, OK, USA, 2014.
[9] M. T. K. M. B. J. Hatem Aouadi, An LDA
Topic Model Adaptation for ContextBased Image Retrieval, in E-Commerce
and Web Technologies, Bookmetric, 2015,
pp. 69-80.
[10] D. M. B. Chong Wang, Collaborative topic
modeling for recommending scientific
articles, in Proceedings of the 17th ACM
SIGKDD international conference on
Knowledge discovery and data mining, San
Diego, California, USA, 2011.
[11] B.-C. C. Deepak Agarwal, Regressionbased latent factor models, in
Proceedings of the 15th ACM SIGKDD
international conference on Knowledge
discovery and data mining, Paris, France,
2009.
[12] Y. Koren, Matrix Factorization

Techniques for Recommender Systems,
Computer, vol. 42, no. 8, pp. 30-37, 2009.
[13] A. M. Ruslan Salakhutdinov, Bayesian
probabilistic matrix factorization using
Markov chain Monte Carlo, in
Proceedings of the 25th international
conference on Machine learning, Helsinki,
Finland, 2008.
[14] S. G. ,. W. ,. L. B.-g. D. M. B. Jonathan
Chang, Reading Tea Leaves: How Humans
Interpret Topic Models, in Neural
Information Processing Systems
Conference., Vancouver, 2009.
[15] B.-C. C. Deepak Agarwal, fLDA: matrix
factorization through latent dirichlet
allocation, in Proceedings of the third
ACM international conference on Web
search and data mining, New York, New
York, USA, 2010.
[16] LSA and information retrieval: Getting
back to basics, in Handbook of Latent
Semantic Analysis, Bib Tex, 2007, pp. 293321.
[17] T. Hofmann, Probabilistic latent semantic
analysis, in Proceedings of the Fifteenth
conference on Uncertainty in artificial
intelligence, Stockholm, Sweden, 1999.
[18] A. K. Mark Girolami, On an Equivalence
between PLSI and LDA, in Proceedings of
the 26th annual international ACM SIGIR
conference on Research and development
in informaion retrieval, Toronto, Canada ,
2003.
[19] W. B. C. Xing Wei, LDA-based document
models for ad-hoc retrieval, in
Proceedings of the 29th annual
international ACM SIGIR conference on
Research and development in information

retrieval, Seattle, Washington, USA, 2006.
[20] N. A. K. ,. H. E. Stacy K. Lukins, Source
Code Retrieval for Bug Localization using
Latent Dirichlet Allocation, Information
and Software Technology, vol. 52, no. 9,
pp. 972-990, 2010.
[21] B. D. O. P. D. L. P. Annibale Panichella,
How to Effectively Use Topic Models for
Software Engineering Tasks? An Approach
Based on Genetic Algorithms, in
Proceedings of the 2013 International
Conference on Software Engineerin, San
Francisco, CA, USA, 2013.
[22] J. H. Holland, Adaptation in natural and
artificial systems, Cambridge: MIT Press,
1975.
[23] The Author-Topic Model for Authors and
Documents, in Proceedings of the 20th
conference on Uncertainty in artificial
intelligence, Banff, Canada, 2004.
[24] A. M. David Mimno, Expertise modeling
for matching papers with reviewers, in
roceedings of the 13th ACM SIGKDD
discovery and data mining, San Jose,
California, 2007.
[25] A. C.-E. X. W. Andrew McCallum, The
Author-Recipient-Topic Model for Topic
and Role Discovery in Social Networks:
Experiments with Enron and Academic
Email, Journal of Artificial Intelligence
Research, vol. 30, no. 1, pp. 249-272,
2007.
[26] J. A. by Jitesh Shetty, The Enron email
dataset database schema and brief
statistical report, Information Sciences
Institute, 2004.
[27] D. M. B. a. J. D. McAuliffe, Supervised

topic models, in In Advances in Neural
Information Processing Systems, 2007.
[28] J. D. L. David M. Blei, Correlated Topic
Models, in Advances in Neural
Information Processing Systems 18, 2005.
[29] J. D. L. David M. Blei, Dynamic Topic
Models, in International Conference on
Machine Learning, Pittsburgh, 2006.
[30] A. M. Xuerui Wang, Topics over time: a
non-Markov continuous-time model of
topical trends, in Proceedings of the 12th
ACM SIGKDD international conference on
Knowledge discovery and data mining,
Philadelphia, PA, USA, 2006.
[31] S. D. J. D. L. C. U. Ramesh M. Nallapati,
Multiscale topic tomography, in
discovery and data mining, San Jose,
California, USA, 2007.
[32] L. H. a. B. D. Davison, Empirical study of
topic modeling in Twitter, in Proceedings
of the First Workshop on Social Media
Analytics, Washington D.C., District of
Columbia, 2010.
[33] J. G. Y. L. X. C. Xiaohui Yan, A biterm topic
model for short texts, in Proceedings of
the 22nd international conference on
World Wide Web, 2013.
[34] C. K. Y. G. S. J. P. Xiaojun Quan, Short and
sparse text topic modeling via selfaggregation, in Proceedings of the 24th
International Conference on Artificial
Intelligence, 2015.
[35] B. U. Yuan Zuo*, Junjie Wu, Has Lin and R.
Hui Xiong, Topic Modeling of Short Texts:
A Pseudo-Document View, in KDD 16,

August 13 - 17, 2016, San Francisco, 2016.
[36] Internet Society, Global Internet Report
2016, Internet Society, 2016.
[37] Statista, Facts on Social Networks,
Statista, 2016.
[38] Internet Stats Live, Internet Stats Live,
Internet Stats Live, 2016.
[39] T. G. S. S. Michal Rosen-Zvi, The AuthorTopic Model for Authors and Documents,
Proceedings of the Twentieth Conference
on Uncertainty in Artificial Intelligence, pp.
487-494, 2012.
[40] Finding scientic topics, Proceedings of
the National Academy of Sciences, vol.
101, no. 1, pp. 5228-5235, 2004.
[41] X. W. C.-E. Andrew McCallum, Topic and
Role Discovery in Social Networks,
Journal of Artificial Intelligence Research,
vol. 30, no. 1, pp. 249-272, 2006.
[42] X. W. A. C.-E. Andrew Mccallum, The
Author-Recipient-Topic Model for Topic
and Role Discovery in Social
Networks:Experiments with Enron and
Academic Email, Journal of Artificial
Intelligence Research, vol. 30, pp. 249272, 2007.
[43] J. D. L. David M. Blei, Correlated Topic
Models, The Annals of Applied Statistics,
vol. 1, no. 1, pp. 17-35, 2007.
[44] D. H. R. N. a. C. D. M. Daniel Ramage,
Labeled LDA: A supervised topic model
for credit attribution in multi-labeled
corpora, in Conference on Empirical
Methods in Natural Language Processing,
Singapore, 2009.
[45] S. T. Dumais, Latent semantic analysis,

Annual Review of Information Science and
Technology, vol. 38, no. 1, pp. 188-230,
2004.
[46] G. P. ,. F. R. ,. S. S. ,. S. B. Andr Bergholz,
Improved Phishing Detection using
Model-Based Features, in The Fifth
Conference on Email and Anti-Spam CEAS,
Mountain View, California, USA, 2008.
[47] A. M. Xuerui Wang, Topics over Time: A
NonMarkov Continuous Time Model of
Topical Trends, KDD '06 Proceedings of
the 12th ACM SIGKDD international
conference on Knowledge discovery and
data mining, pp. 424-433, 2006.
[48] W. C. D. L. U. Ramesh Nallapati,
Multiscale topic tomography, KDD '07
discovery and data mining, pp. 520-529,
2007.
[49] BTM: Topic Modeling over Short Texts,
Xueqi Cheng,Xiaohui Yan,Yanyan
Lan,Jiafeng Guo, vol. 26, no. 12, 2014.
[50] Y.-G. G. A. M. G. A. V. R. Denys
Poshyvanyk, Feature Location Using
Probabilistic Ranking of Methods Based on
Execution Scenarios and Information
Retrieval, IEEE Transactions on Software
Engineering, vol. 33, no. 6, pp. 420-432,
2007.

Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase

Uploaded by

Copyright:

Available Formats

Topic Modelling: A survey of topic models

Adnan Bashir1 and Dr Sohail Asghar 1

Abstract-In recent years we have significant increase

It is estimated that more than 40% of the world

on topic modelling for short texts due to its importance

Latent Dirichlet allocation (LDA)

Latent Dirichlet Allocation presented in 2003 is one the

The main assumption in topic modelling is that

Figure 1 Results for collab orative filtering on the

Figure 1 shows the comparison of Latent Dirichlet

Each good recommendation system for scientific

Analysis of Complex Medical Datasets

Cluster Analysis of large and complex biological and

Context Bases Image Retrieval using

Latent Dirichlet has also been extensively used in

Collaborative Filtering and Latent

Latent Dirichlet Allocation was used very effectively

Here traditional collaborative recommendation

However, the keywords used for image retrieval

proportions) and theta correction denotes the offset

Tab le 2 Comparison of LDA to LSI over Mozilla [20]

Use of Latent Dirichlet Allocation in

Latent Dirichlet Allocation has also been used

Another approach in software engineering is use of

Tab le 3: Symmetric KL divergence for pairs of authors

Tab le 4 Author Entropies [23]

The model can also be helpful to calculate the entropy

Figure 3 LDA b ased approach for Bug Localization

Author Persona Topic (APT)

Another application of author topic model was

author persona topic model with other commonly

Tab le 7 Pairs considered most alike b y SNA

Tab le 5 Precision at relevance cutoff 3 after retrieving

It can also help in discovering the author or recipient

Supervised Latent Dirichlet Allocation

Supervised latent Dirichlet allocation [27] was

Figure 4 Result of Application of Supervised LDA on

In [28] the authors proposed that in order to model

Co-related topic model

Most of the statistical topic models based on Latent

Time Evolution Topic Models

Latent Dirichlet model has been applied to various

Dynamic Topic Models

In Dynamic Topics documents are grouped together

Figure 7 Application of Dynamic Topic Model [29]

Topics over Time

One of the main problem in Latent Dirichlet Allocation

Multiscale topic tomography

In [31] the authors a new model called Multiscale topic

Figure 10 Comparison of Perplexities of the models:

Figure 8 Application of TOT on NIPS Dataset [30]

Figure 10 shows that perplexity of this model is much

Figure 9 Application of LDA on NIPS Dataset [30]

Short text topic modelling

The social media has seen tremendous growth in

Biterm Topic Modelling

In this topic modelling proposed by [33] the

documents for implicit aggregation of short texts to

Figure 12 Illustration of over [35]

Short and sparse text topic modelling via

This model proposed by [34] uses short text

Pseudo-Document Topic Modelling

In this model short texts are aggregated to overcome