Word Embeddings Paper

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
A Correlated Topic Model Using Word Embeddings

Guangxu Xun1 , Yaliang Li1 , Wayne Xin Zhao2,3 , Jing Gao1 , Aidong Zhang1
1
Department of Computer Science and Engineering, SUNY at Buffalo, NY, USA
2
School of Information, Renmin University of China, Beijing, China
3
Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, China
1
{guangxux, yaliangl, jing, azhang}@buffalo.edu, 2 batmanfly@gmail.com
Abstract and Lafferty, 2006a] replaces the Dirichlet prior with logis-
tic normal distribution which allows for covariance structure
Conventional correlated topic models are able to among the topics.
capture correlation structure among latent topics Nowadays, the rapidly developing technique in natural lan-
by replacing the Dirichlet prior with the logistic guage processing – word embeddings [Bengio et al., 2003],
normal distribution. Word embeddings have been [Mikolov and Dean, 2013] – provides us with the possibility
proven to be able to capture semantic regularities to model topics and topic correlations in the continuous se-
in language. Therefore, the semantic relatedness mantic space. Word embeddings, also known as word vectors
and correlations between words can be directly cal- and distributed representations of words, are real-valued con-
culated in the word embedding space, for exam- tinuous vectors for words, which have proven to be effective
ple, via cosine values. In this paper, we propose at capturing semantic regularities in language. Words with
a novel correlated topic model using word embed- similar semantic and syntactic properties tend to be projected
dings. The proposed model enables us to exploit into nearby area in the vector space. By replacing the orig-
the additional word-level correlation information inal discrete word types in LDA with continuous word em-
in word embeddings and directly model topic cor- beddings, Gaussian-LDA [Das et al., 2015] has shown that
relation in the continuous word embedding space. the additional semantics in word embeddings can be incorpo-
In the model, words in documents are replaced rated into topic models and further enhance the performance.
with meaningful word embeddings, topics are mod-
The main goal of correlated topic models is to model and
eled as multivariate Gaussian distributions over the
discover correlation between topics. And now we know that
word embeddings and topic correlations are learned
word embeddings are able to capture semantic regularities
among the continuous Gaussian topics. A Gibbs
in language, and the correlations between words can be di-
sampling solution with data augmentation is given
rectly measured by the Euclidean distances or cosine val-
to perform inference. We evaluate our model on
ues between the corresponding word embeddings. More-
the 20 Newsgroups dataset and the Reuters-21578
over, semantically related words are close to each other in
dataset qualitatively and quantitatively. The exper-
space and should be more likely to be grouped into the same
imental results show the effectiveness of our pro-
topic. Since Gaussian distributions depict a notion of central-
posed model.
ity in continuous space, it is a natural choice to model top-
ics as Gaussian distributions over word embeddings in space.
Therefore, the motivation of this paper is to model topics in
1 Introduction the word embedding space, exploit the known correlation in-
Conventional topic models, such as Probabilistic Latent Se- formation at word level and further improve the correlation
mantic Analysis (PLSA) [Hofmann, 1999] and Latent Dirich- discovery at topic level.
let Allocation (LDA) [Blei et al., 2003], have proven to be a In this paper, we propose the Correlated Gaussian Topic
powerful unsupervised tool for the statistical analysis of doc- Model (CGTM) to model both topics and topic correlations in
ument collections. Those methods [Zhu et al., 2012], [Zhu the word embedding space. More specifically, first we learn
et al., 2014] follow the bag-of-word assumption and model word embeddings with the help of external large unstructured
each document as an admixture of latent topics, which are text corpora to obtain additional word-level correlation in-
multinomial distributions over words. formation; Second, in the vector space of word embeddings,
A limitation of the conventional topic models is the in- we model topics and topic correlations to exploit useful addi-
ability to directly model correlations between topics, for in- tional semantics in word embeddings, wherein each topic is
stances, a document about autos is more likely to be related to represented as a Gaussian distribution over the word embed-
motorcycles than to politics. In reality, it is natural to expect dings and topic correlations are learned among those Gaus-
correlated latent topics in most text corpora. In order to ad- sian topics. Third, we develop a Gibbs sampling algorithm
dress this limitation, the Correlated Topic Model (CTM) [Blei for CGTM.
4207
To validate the efficacy of our proposed model, we evalu-

ate our model on the 20 Newsgroups dataset and the Reuters-
21578 dataset, which are well-known dataset for experiments
in text mining domain. The experimental results show that
our model can discover more reasonable topics and topic cor-
relations than the baseline models.
2 Related Works
Correlation is an inherent property in many text corpora, for
example, [Blei and Lafferty, 2006b] explores the time evolu-
tion of topics and [Mei et al., 2008] analyzes the locational Figure 1: Schematic illustration of the CGTM framework.
correlation among topics. However, due to the use of the
Dirichlet prior, traditional topic models are not able to model
the topic correlation directly. CTM [Blei and Lafferty, 2006a] called Word2Vec [Mikolov and Dean, 2013] to train word
proposes to use logistic normal distribution to model the vari- embeddings. In the learning process of Word2Vec, words
ability among topic proportions and thus learn the covariance with similar meanings gradually converge to nearby areas in
structure of topics. the vector space. In this model, words in the form of word
Word embeddings can capture the semantic meanings of embeddings are used as input to a softmax classifier and each
words via low-dimensional real-valued vectors [Mikolov and word is predicted based on its neighbourhood words within a
Dean, 2013], for example, vector operation vector(‘king’) - certain context window.
vector(‘man’) + vector(‘woman’) results in a vector which is Having learnt the word embeddings, given a word wdn ,
very close to vector(‘queen’). The concept of word embed- which denotes the nth word in dth document, we can en-
dings was first introduced into natural language processing rich that word by replacing it with the corresponding word
by Neural Probabilistic Language Model (NPLM) [Bengio et embedding. The following section describes how this enrich-
al., 2003]. Due to its effectiveness and wide variety of appli- ment is used in a generative process to model topics and topic
cation domains, word embeddings have garnered a great deal correlations.
of attention and development [Mikolov et al., 2013], [Pen-
nington et al., 2014], [Morin and Bengio, 2005], [Collobert 4 Generative Process
and Weston, 2008], [Mnih and Hinton, 2009], [Huang et al.,
Trained word embeddings give us useful additional seman-
2012].
tics, which helps us discover reasonable topics and topic cor-
Since word embeddings carry additional semantics, many
relations in the vector space. However, each document now
researchers have tried to incorporate them into topic mod-
is a sequence of continuous word embeddings instead of a se-
els to improve the performance [Das et al., 2015], [Li et al.,
quence of discrete word types. Therefore, conventional topic
2016], [Liu et al., 2015], [Li et al., 2017]. [Liu et al., 2015]
models no longer are applicable. Since the word embeddings
proposed Topical Word Embeddings (TWE) which combines
are located in space based on their semantics and syntax, in-
word embeddings and topic models in a simple and effective
spired by [Hu et al., 2012] and [Das et al., 2015], we consider
way to achieve topical embeddings for each word. [Das et al.,
them as draws from several Gaussian distributions. Hence,
2015] uses Gaussian distributions to model topics in the word
each topic is characterized as a multivariate Gaussian distri-
embedding space.
bution in the vector space. The choice of Gaussian distribu-
The aforementioned models either fail to directly model
tion is justified by the observations that Euclidean distances
correlation among topics or fail to leverage the word-level
between word embeddings are consistent with their semantic
semantics and correlations. We propose to leverage the word-
similarities.
level semantics and correlations within word embeddings to
aid us in learning the topic-level correlations. The graphical model of CGTM is shown in Figure 1. More
formally, there are K topics and each topic is represented by a
multivariate Gaussian distribution over the word embeddings
3 Learning Word Embeddings in the word vector space. Let µk and Σk denote the mean
We begin our topic discovery process with learning the word and covariance for the k th Gaussian topic. Each document
embeddings with semantic regularities. Unlike the traditional is a admixture of K Gaussian topics. ηd is a K dimensional
one-hot representations of words which encode each word as vector where each dimension represents the weight of each
a binary vector of N (the size of vocabulary) digits with only topic in document d. Then the document-specific topic distri-
one digit being 1 the others 0, the distributed representations bution θd can be computed based on ηd . µc is the mean of η
of words encode each word as a unique real-valued vector. and Σc is the covariance of η. By replacing the Dirichlet pri-
By mapping words into this vector space, word embeddings ors in conventional LDA with logistic normal priors, the topic
are able to overcome several drawbacks of the one-hot rep- correlation information is integrated into the model. µ0 , Σ0 ,
resentations such as the curse of dimensionality, the out-of- ν0 , µ, Σ and ν are hyper parameters for Gaussian topics and
vocabulary words and the lack of semantics. logistic normal priors.
In this paper, we adopt a recently developed, very effective Note that variables in bold font mean they are either vectors
and efficient distributed representations of words based model or matrices, for example, wdn . The generative process is as
4208
follows: z −dn which is the topic assignment scheme without consid-

ering the current word wdn , the topic of each word is drawn
1. Draw Σc ∼ W −1 (Ψ, ν).
iteratively as:
2. Draw µc ∼ N (µ, τ1c Σc ). p(zdn = k|z −dn , w) ∝ p(zdn = k|z −dn )p(wdn |zdn = k)
3. For each Gaussian topic k = 1, 2, · · · , K: exp(ηdk ) τk + 1
−1 ∝P · Tr (wdn |µk , Σk ),
(a) Draw topic covariance Σk ∼ W (Ψ0 , ν0 ). i exp(η i)
d τk
(b) Draw topic mean µk ∼ N (µ0 , τ1 Σk ). (3)
4. For each document d = 1, 2, · · · , D: where Tr (w|µ, Σ) is the multivariate Student’s t-distribution
(a) Draw η d ∼ N (µc , Σc ). for Gaussian sampling with (r = νk − dim + 1) being its
degrees of freedom and dim being the dimensionality of word
(b) For each word index n = 1, 2, · · · , Nd :
embeddings. (νk = ν + Nk ) and (τk = τ + Nk ) are the
i. Draw a topic zdn ∼ M ultinomial(f (η d )). parameters of topic k, where Nk denotes the total number of
ii. Draw a word wdn ∼ N (µzdn , Σzdn ). words that are assigned to topic k.
where τ and τc are constant factors; and f (η) is the logistic 5.2 Updating Gaussian Topics
transformation: Every time we re-sample topic assignment zdn , we need to
exp(ηdk ) update the two involved Gaussian topics because the current
f (ηdk ) = θdk = P i
. (1) word wdn is either leaving or joining this Gaussian topic. Fol-
i exp(ηd )
lowing [Das et al., 2015], we derive the updates for µk and
The following conjugate priors are utilized for topic parame- Σk of the posterior Gaussian distributions for topic k:
ters: a Gaussian distribution N for the mean and an inverse
Wishart distribution W −1 for the covariance. However, note τ µ0 + Nk w̄k
that there is still a non-conjugacy problem between the logis- µk = ,
τk
tic normal distribution and multinomial distribution, and we (4)
will solve this with data augmentation technique in the fol- Ψ0 + C k + τ Nk (w̄k − µ0 )(w̄k − µ0 )T /τk
Σk = ,
lowing section. νk − dim + 1
where w̄k is the sample mean of all the word embeddings as-
5 Parameter Inference signed to topic k, and C k is the scaled form of sample covari-
ance of all the word embeddings assigned to topic k. These
The observed variables are documents consisting of word em- two intermediate variables are calculated as follows:
beddings, and our goal is to infer the posterior Gaussian dis-
tribution of each topic, topic assignment of each word, and PD PNd
topic correlations. Given D documents and the correspond- d=1 n=1 δ(zdn , k)wdn
w̄k = ,
ing word embeddings w, the joint distribution of topic assign- Nk
ments z and logistic normal parameters η is: Nd
D X (5)
X
D Y Nd
Ck = δ(zdn , k)(wdn − w̄k )(wdn − w̄k )T ,
d=1 n=1
Y
p(z, {η d }D
d=1 |w) ∝ p(w|z) ( θdzdn )N (η d |µc , Σc )
d=1 n=1 where δ(zdn , k) is the Kronecker delta function that
Nd
D Y δ(zdn , k) = 1 if zdn = k, δ(zdn , k) = 0 otherwise.
Y exp(ηdzdn )
∝ p(w|z) ( PK )N (η d |µc , Σc ), 5.3 Sampling Logistic Normal Parameters
i
d=1 n=1 i exp(ηd )
(2) Given topic assignments, directly sampling logistic normal
parameters η is difficult due to non-conjugacy. To address the
where p(w|z) is the Gaussian probability of words w under non-conjugacy problem between the logistic normal distribu-
topic assignments z. Because of the choice of conjugate pri- tion and multinomial distribution, following [Holmes et al.,
ors for topic parameters, those variables can be integrated out 2006], [Polson et al., 2013] and [Chen et al., 2013], we sam-
and we can efficiently re-sample topic assignment for each ple the logistic normal parameters η based on z with auxiliary
word. However, due to the non-conjugacy between the logis- variables. For document d, the likelihood for ηdk conditioned
tic normal and multinomial distributions, regular Gibbs sam- on η −k
d is:
pling scheme doesn’t work for the logistic normal parameters.
Thus we adopt Gibbs sampling with data augmentation tech- l(ηdk |η −k
d )
nique to solve this non-conjugacy problem. k k
Y exp(η k ) zdn
N d
exp(ηdk )
1−zdn
d
5.1 Sampling Topic Assignments = P i
1− P i
i exp(ηd ) i exp(ηd ) (6)
n=1
Since the topic parameters have conjugate priors, the sam- k
pling process of topic assignments is similar to the Gibbs (exp(ρkd ))Cd
= ,
sampling scheme for LDA [Griffiths, 2002]. Given η and (1 + exp(ρkd ))Nd
4209
k k
where zdn is the topic indicator that zdn = 1 if word wdn is 5.4 Updating Topic Correlation
assigned to k topic, zdn = 0 otherwise. ρkd = ηdk − ζdk ,
th k
Given {η d }D
d=1 , the logistic normal parameters µc and Σc
ζdk = log( j6=k exp(ηdj )) and Cdk is the number of words
P
are updated as:
assigned to topic k in document d. Therefore, we obtain the
posterior distribution of ηdk proportional to multiplying the τc D
µc = µ+ η̄,
likelihood by the prior: τc + D τc + D
(12)
p(ηdk |η −k k −k k k 2 τc D
d , z, w) ∝ l(ηd |η d )N (ηd |µd , σk ). (7) Σc = Ψ + Q + (η̄ − µ)(η̄ − µ)T ,
For the prior part, it is a univariate Gaussian distribution τc + D
conditioned on the other logistic normal parameters in the where η̄ is the mean of {η d }D 1
− η̄)(η d −
d=1 , and Q = D (η d
current document η −k −k
d . Thus, given η d and µc , Σc of the η̄)T .
multivariate Gaussian distribution over η, we have:
µkd = µk − Λ−1 −k
kk Λk−k (η d − µ−k ), 6 Experiments
(8)
σk2 = Λ−1 kk , In this section, we carry out experiments on two real-
where Λ = Σ−1 c is the precision matrix. However, the non- world text collections – the 20 Newsgroups dataset1 and the
conjugacy makes it difficult to directly calculate the likeli- Reuters-21578 dataset2 to demonstrate the efficacy of our
hood l(ηdk |η −k
d ) and thus unable to directly sample ηd .
k proposed model. 20 Newsgroups contains approximately
By introducing auxiliary Polya-Gamma variable λkd [Pol- 20,000 text documents partitioned evenly across 20 different
son et al., 2013], we are able to get around the non-conjugacy newsgroups. Reuters contains about 10,000 documents, but
problem and the likelihood l(ηdk |η −k due to the imbalance of each category, only the largest 8 cat-
d ) can now be expressed
as: egories are selected in Reuters, leaving us with 7,674 docu-
ments in total. Both datasets have become popular datasets
l(ηdk |η −k
d )=
Z ∞ for experiments in many data mining tasks, such as text
1 k k λkd (ρkd )2 classification. Each document is associated with one sin-
exp(κ d ρ d ) exp(− )p(λkd |Nd , 0)dλkd ,
2N d 0 2 gle category label. For 20 Newsgroups, correlation is exhib-
(9) ited across different newsgroups (e.g. rec.sport.baseball and
k k k rec.sport.hockey), which makes this dataset a suitable choice
where κd = Cd −Nd /2 and p(λd |Nd , 0) is the Polya-Gamma
to verify the effectiveness of topic correlation discovery for
distribution PG(Nd , 0). As one can observe, Equation 9 im-
CGTM.
plies that p(ηdk |η −k d , z, w) is the marginal distribution of the We compare CGTM with three topic modeling methods:
joint distribution: LDA [Blei et al., 2003], CTM [Blei and Lafferty, 2006a] and
p(ηdk , λkd |η −k
d , z, w) ∝ Gaussian-LDA [Das et al., 2015]. CTM replaces the dirichlet
1 k k 2
λd (ρd ) prior in LDA with logistic normal distribution to capture the
k k k k k 2
exp(κ d ρ d − )p(λ d |N d , 0)N (η d |µ d , σ k ). correlation among topic proportions. Gaussian-LDA was first
2Nd 2 proposed for audio retrieval [Hu et al., 2012] and then used
(10)
to leverage word embeddings in the continuous vector space
Therefore we can sample ηdk based on the auxiliary variable [Das et al., 2015].
λkd . The sampling procedure is as follows: To learn high quality word embeddings, we combine the
• Sampling λkd : according to Equation 10 and [Pol- current dataset with Wikipedia as the knowledge source. The
son et al., 2013], we have the conditional distribution motivation of using Wikipedia as the supplemental source lies
k λk (ρ k 2
) k in the sheer range of topics and subjects that are covered and
p(λd |z, w, η) ∝ exp(− d 2 d )p(λd |Nd , 0), which re-
it allows us to enhance the semantics of word embeddings ex-
sults in a Polya-Gamma distribution PG(Nd , ρkd ). Fol- tracted from 20 Newsgroups and Reuters. In the experiment,
lowing the ideas in [Polson et al., 2013] and [Chen et we set the dimensionality of word embeddings to 100, and
al., 2013], Polya-Gamma variables can be drawn in O(1) the context window size to 12. We train word embeddings
time, and so a sample of λkd is obtained. for 100 epochs.
• Sampling ηdk : according to Equation 10, we can sample We are interested to see if the learned topics can reveal a
ηdk with posterior probability: similar mixture and correlation with the ground truth text cat-
k k 2 egories. Hence we set the number of topics K to the number
k k λd (ηd )
p(ηdk |η −k
d , z, w, λ) ∝ exp(κ η
d d − )N (η k k
|µ
d d , σ 2
k ). of categories. For uniformity, all the models are implemented
2 with Gibbs sampling and run for 100 iterations. The Gaus-
(11)
This results in a univariate Gaussian distribution sian topic hyper parameter µ0 is set to the sample mean of
N (γdk , (τdk )2 ) conditioned on the auxiliary variable λkd , all the word vectors, the initial degree of freedom ν0 to the
k k 2 −2 k k k k k 2 dimensionality of word embeddings, and Ψ0 to an identity
where γd = (τd ) (σd µd + κd + λd ζd ) and (τd ) = matrix.
−2
(σd + λkd )−1 . Thus, given the auxiliary variable λkd , ηdk
1
can be easily drawn from a univariate Gaussian distribu- www.qwone.com/ jason/20Newsgroups/
2
tion. www.daviddlewis.com/resources/testcollections/reuters21578/
4210
Figure 2: Topic Words and Correlations.
6.1 Topic Words and Correlations space. This demonstrates how the known word-level corre-
lation information can aid us in discovering the topic-level
To investigate the quality of topics and the topic cor- correlations.
relations discovered by CGTM, we visualize each topic In this subsection, we qualitatively exhibit the effectiveness
with their top words as well as the topic correlations. of discovering topics and topic correlations of CGTM. In the
To make the visualization clearer, we select only 6 cate- following subsections, we will quantitatively evaluate CGTM
gories from the 20 Newsgroups dataset whose topic words on topic coherence and topic correlation discovery.
and correlations can be easily recognized and defined.
The selected newsgroups are “rec.autos”, “rec.motorcycles”, 6.2 Topic Coherence
“rec.sport.baseball”, “rec.sport.hockey”, “talk.politics.guns”, In order to quantitatively assess the topic coherence, we
and “talk.politics.mideast”. Thus, in this experiment, we set adopt a metric called coherence score of topics proposed by
the number of topics K to 6. As Figure 2 shows, we display [Mimno et al., 2011] which is able to automatically evaluate
top 10 words for each topic discovered by CGTM and map the coherence of each discovered topic. Given a topic z and
the corresponding word embeddings into a two-dimensional its top T words V z = {v1z , v2z , ..., vTz }, the coherence score
space via Principal Component Analysis (PCA). The size of of this topic is defined as:
each word varies with its relative frequency in the correspond-
T X
t
ing topic. The different colors and shapes of words indicate X D(vtz , vlz ) + 1
they are from 6 different topics. Each circle depicts the Gaus- C(z; V z ) = log , (13)
t=2 l=1
D(vlz )
sian distribution for each topic. The detected topic correla-
tion is represented as a dashed line between topics. As one where D(vlz ) is the document frequency of word vlz and
can observe, all the newsgroups, Hockey (topic 1), Baseball D(vtz , vlz ) is the number of documents in which words vtz
(topic 2), Mideast (topic 3), Guns (topic 4), Autos (topic 5) and vlz co-occurred. The coherence score follows the intu-
and Motorcycles (topic 6), are successfully discovered with ition that words from the same topic tend to co-occur in doc-
reasonable topic words. uments. This topic coherence score has been proven to be
As the ground truth labels indicate, one can easily figure highly consistent with human coherence judgements [Mimno
that Autos is correlated with Motorcycles, Baseball is cor- et al., 2011].
related with Hockey, and Guns is correlated with Mideast. The topic coherence result on the 20 Newsgroups dataset
The dashed lines in the figure denote the automatically de- is reported in Table 1. In order to investigate the overall qual-
tected topic correlations by CGTM. With the help of word ity of all the discovered topics, the average coherence score
1
V z ).
P
embeddings and Gaussian topics, topic correlations are also is reported, which is calculated as C̄ = K z C(z;
correctly detected, as the dashed lines show. We can see that, To make this evaluation more comprehensive, the number
since word embeddings can capture the regularities in lan- of topic words T ranges from 5 to 50. For all the models,
guage such as synonyms, two topics tend to be correlated if the topic words are ordered by word counts in each topic.
their topic word embeddings overlap in the continuous vector Though for Gaussian-LDA and CGTM, topic words can also
4211
Top T words 5 10 20 50 Model Precision Recall F1

LDA -13.86 -64.11 -322.07 -2384.68 LDA 0.438 0.507 0.470
CTM -13.77 -64.49 -323.71 -2395.58 CTM 0.447 0.634 0.524
Gaussian-LDA -14.83 -66.31 -323.91 -2505.33 Gaussian-LDA 0.438 0.496 0.465
CGTM -12.37 -60.48 -317.43 -2362.75 CGTM 0.523 0.623 0.568
Table 1: Comparison of topic coherence scores. Table 2: Comparison of document-topic distribution on the 20
Newsgroups dataset.
be ordered by word probabilities under each Gaussian topic,

we still order them by word counts, since first, the Gaussian
posterior probability information has already been fully uti-
lized in the training phase and second, this coherence score is Model Precision Recall F1
more appropriate to measure frequent words in a topic. The
LDA 0.844 0.392 0.535
result shows that the topic words discovered by our model are
more coherent than the topic words discovered by the base- CTM 0.796 0.433 0.561
line models. Gaussian-LDA 0.865 0.405 0.552
6.3 Document Topics and Topic Correlation CGTM 0.870 0.431 0.576
We see that the topics discovered by CGTM qualitatively
Table 3: Comparison of document-topic distribution on the Reuters
exhibit good topic words and reasonable correlations, and dataset.
CGTM also outperforms the baseline models in terms of co-
herence score. But are the topics discovered by our model
really corresponding to the coherent news categories? If pairs in clustering result E, and ||pairE || represents the num-
yes, it would be very convenient for us to assess the qual- ber of instances in pairE . The experimental results of docu-
ity of the detected topic correlations, because the correla- ment clustering on 20 Newsgroups and Reuters are reported
tions among the ground truth newsgroups labels are well de- in Table 2 and Table 3 respectively. We can see that, with
fined. For example, 20 Newsgroups categories “rec.autos” respect to the consistency between ground truth document
and “rec.motorcycles” are clearly correlated, and Reuters cat- labels and discovered topics, CGTM outperforms the other
egories “money” and “trade” should also exhibit correlations. baselines on both datasets.
To answer this question, we compare the ground truth docu-
ment labels with the document-topic labels discovered by the 7 Conclusions
models to see if they are consistent. The label of each docu-
ment comes from the dataset and is used as the ground truth. In this paper, we have proposed a correlated topic model us-
The document-topic label of each document is assigned by ing word embeddings. Word embeddings learnt from large,
the models. More specifically, for each model, we can assign unstructured corpora, such as Wikipedia, can aid us in mod-
one single topic to document d according to: eling topics and topic correlation by bringing in additional
useful semantics. The known word-level correlation infor-
zd = argmaxz p(z|d). mation in word embeddings is passed to topic-level corre-
lation discovery task via Gaussian topics. In our case, the
So this is a clustering evaluation problem where each doc- word embeddings are trained on the combined collections of
ument is a sample. To solve the cluster matching problem, Wikipedia and the 20 newsgroups dataset. We model each
e.g., ground truth label 1 may correspond to topic 5 instead topic as a Gaussian distribution over word embeddings and
of topic 1, we adopt pairwise comparison [Menestrina et al., directly learn topic correlations in the vector space. The ex-
2010] to measure the consistency between the ground truth periments qualitatively show CGTM is able to learn meaning-
document labels and the learned topic representation of doc- ful topics and topic correlation, and quantitatively validate the
uments. The pairwise comparison is defined as: effectiveness of our model in terms of topic coherence score
||pairE ∩ pairG || and document clustering on two real-world datasets.
precision(E, G) = ,
||pairE ||
||pairE ∩ pairG ||
Acknowledgements
recall(E, G) = , This work was supported in part by the US National Science
||pairG ||
Foundation under grants NSF IIS-1218393, IIS-1514204 and
2 × precision × recall IIS 1319973. Any opinions, findings, and conclusions or rec-
F 1(E, G) = ,
precision + recall ommendations expressed in this material are those of the au-
where E and G are two clustering solutions corresponding to thor(s) and do not necessarily reflect the views of the National
the document-topic clusters and the ground truth document Science Foundation.
labels respectively in our case, and pairE denotes the set of
4212
References [Li et al., 2017] Shaohua Li, Jun Zhu, and Chunyan Miao.
[Bengio et al., 2003] Yoshua Bengio, Réjean Ducharme, Psdvec: A toolbox for incremental and scalable word em-
Pascal Vincent, and Christian Jauvin. A neural probabilis- bedding. Neurocomputing, 237:405–409, 2017.
tic language model. journal of machine learning research, [Liu et al., 2015] Yang Liu, Zhiyuan Liu, Tat-Seng Chua,
3(Feb):1137–1155, 2003. and Maosong Sun. Topical word embeddings. In AAAI,
[Blei and Lafferty, 2006a] David Blei and John Lafferty. pages 2418–2424, 2015.
Correlated topic models. Advances in neural information [Mei et al., 2008] Qiaozhu Mei, Deng Cai, Duo Zhang, and
processing systems, 18:147, 2006. ChengXiang Zhai. Topic modeling with network regular-
[Blei and Lafferty, 2006b] David M Blei and John D Laf- ization. In Proceedings of the 17th international confer-
ence on World Wide Web, pages 101–110. ACM, 2008.
ferty. Dynamic topic models. In Proceedings of the 23rd
international conference on Machine learning, pages 113– [Menestrina et al., 2010] David Menestrina, Steven Euijong
120. ACM, 2006. Whang, and Hector Garcia-Molina. Evaluating entity res-
[Blei et al., 2003] David M Blei, Andrew Y Ng, and olution results. Proceedings of the VLDB Endowment, 3(1-
2):208–219, 2010.
Michael I Jordan. Latent dirichlet allocation. Journal of
machine Learning research, 3(Jan):993–1022, 2003. [Mikolov and Dean, 2013] T Mikolov and J Dean. Dis-
tributed representations of words and phrases and their
[Chen et al., 2013] Jianfei Chen, Jun Zhu, Zi Wang, Xun
compositionality. Advances in neural information process-
Zheng, and Bo Zhang. Scalable inference for logistic- ing systems, 2013.
normal topic models. In Advances in Neural Information
Processing Systems, pages 2445–2453, 2013. [Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg
Corrado, and Jeffrey Dean. Efficient estimation of
[Collobert and Weston, 2008] Ronan Collobert and Jason
word representations in vector space. arXiv preprint
Weston. A unified architecture for natural language pro- arXiv:1301.3781, 2013.
cessing: Deep neural networks with multitask learning. In
Proceedings of the 25th international conference on Ma- [Mimno et al., 2011] David Mimno, Hanna M Wallach, Ed-
chine learning, pages 160–167. ACM, 2008. mund Talley, Miriam Leenders, and Andrew McCallum.
Optimizing semantic coherence in topic models. In Pro-
[Das et al., 2015] Rajarshi Das, Manzil Zaheer, and Chris ceedings of the Conference on Empirical Methods in Natu-
Dyer. Gaussian lda for topic models with word embed- ral Language Processing, pages 262–272. Association for
dings. In Proceedings of the 53nd Annual Meeting of the Computational Linguistics, 2011.
Association for Computational Linguistics, 2015.
[Mnih and Hinton, 2009] Andriy Mnih and Geoffrey E Hin-
[Griffiths, 2002] Tom Griffiths. Gibbs sampling in the gen- ton. A scalable hierarchical distributed language model. In
erative model of latent dirichlet allocation. 2002. Advances in neural information processing systems, pages
[Hofmann, 1999] Thomas Hofmann. Probabilistic latent se- 1081–1088, 2009.
mantic indexing. In Proceedings of the 22nd annual inter- [Morin and Bengio, 2005] Frederic Morin and Yoshua Ben-
national ACM SIGIR conference on Research and develop- gio. Hierarchical probabilistic neural network language
ment in information retrieval, pages 50–57. ACM, 1999. model. In Aistats, volume 5, pages 246–252. Citeseer,
[Holmes et al., 2006] Chris C Holmes, Leonhard Held, et al. 2005.
Bayesian auxiliary variable models for binary and multi- [Pennington et al., 2014] Jeffrey Pennington, Richard
nomial regression. Bayesian analysis, 1(1):145–168, Socher, and Christopher D Manning. Glove: Global
2006. vectors for word representation. In EMNLP, volume 14,
[Hu et al., 2012] Pengfei Hu, Wenju Liu, Wei Jiang, and pages 1532–43, 2014.
Zhanlei Yang. Latent topic model based on gaussian- [Polson et al., 2013] Nicholas G Polson, James G Scott, and
lda for audio retrieval. In Chinese Conference on Pattern Jesse Windle. Bayesian inference for logistic models using
Recognition, pages 556–563. Springer, 2012. pólya–gamma latent variables. Journal of the American
[Huang et al., 2012] Eric H Huang, Richard Socher, Christo- statistical Association, 108(504):1339–1349, 2013.
pher D Manning, and Andrew Y Ng. Improving word rep- [Zhu et al., 2012] Jun Zhu, Amr Ahmed, and Eric P. Xing.
resentations via global context and multiple word proto- Medlda: maximum margin supervised topic models. Jour-
types. In Proceedings of the 50th Annual Meeting of the nal of Machine Learning Research, 13:2237–2278, 2012.
Association for Computational Linguistics: Long Papers-
[Zhu et al., 2014] Jun Zhu, Ning Chen, Hugh Perkins, and
Volume 1, pages 873–882. Association for Computational
Bo Zhang. Gibbs max-margin topic models with data
Linguistics, 2012.
augmentation. Journal of Machine Learning Research,
[Li et al., 2016] Shaohua Li, Tat-Seng Chua, Jun Zhu, and 15(1):1073–1110, 2014.
Chunyan Miao. Generative topic embedding: A contin-
uous representation of documents. In Proceedings of The
54th Annual Meeting of the Association for Computational
Linguistics (ACL), 2016.
4213

Word Embeddings Paper

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Word Embeddings Paper

Uploaded by

Copyright:

Available Formats

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

A Correlated Topic Model Using Word Embeddings

To validate the efficacy of our proposed model, we evalu-

follows: z −dn which is the topic assignment scheme without consid-

Figure 2: Topic Words and Correlations.

Top T words 5 10 20 50 Model Precision Recall F1

be ordered by word probabilities under each Gaussian topic,

You might also like