Professional Documents
Culture Documents
Semantic Word Clusters Using Signed Normalized Graph Cuts
Semantic Word Clusters Using Signed Normalized Graph Cuts
Semantic Word Clusters Using Signed Normalized Graph Cuts
small overlap between all three thesauri tested. To our knowledge, Gallier (2016) is the first theoret-
ical foundation of multiclass signed normalized cuts.
We evaluated our clusters by comparing them to dif-
Hou (2005) used positive degrees of nodes in the de-
ferent vector representations. In addition, we eval-
gree matrix of a signed graph with weights (-1, 0, 1),
uated our clusters against SimLex-999. Finally, we
which was advanced by Kolluri et al. (2004); Kunegis
tested our method on the sentiment analysis task.
et al. (2010) using absolute values of weights in the
Overall, signed spectral clustering results are a very
degree matrix. Although must-link and cannot-link
clean and elegant augmentation to current methods,
soft spectral clustering (Rangapuram & Hein, 2012)
and may have broad application to many fields. Our
both share similarities with our method, this simi-
main contributions are the novel method for signed
larity only applies to cases where cannot-link edges
clustering of signed graphs by Gallier (2016), the ap-
are present. Our method excludes a weight term of
plication of this method to create semantic word clus-
cannot-link, as well as the volume of cannot-link
ters which are agnostic to both vector space represen-
edges within the clusters. Furthermore, our optimiza-
tations and thesauri, and finally, the systematic evalu-
tion method differs from that of must-link / cannot-
ation and creation of word clusters using thesauri.
link algorithms. We developed a novel theory and al-
gorithm that extends the clustering of Shi & Malik
1.1. Related Work
(2000); Yu & Shi (2003) to the multi-class signed
Semantic word cluster and distributional thesauri have graph case (Gallier, 2016).
been well studied (Lin, 1998; Curran, 2004). Recently
there has been a lot of work on incorporating syn- 2. Signed Graph Cluster Estimation
onyms and antonyms into word embeddings. Most re-
cent models either attempt to make richer contexts, 2.1. Signed Normalized Cut
in order to find semantic similarity, or overlay the- Weighted graphs for which the weight matrix is a
saurus information in a supervised or semi-supervised symmetric matrix in which negative and positive
manner. Tang et al. (2014) created sentiment-specific entries are allowed are called signed graphs. Such
word embedding (SSWE), which were trained for graphs (with weights (−1, 0, +1)) were introduced as
twitter sentiment. Yih et al. (2012) proposed polarity early as 1953 by (Harary, 1953), to model social re-
induced latent semantic analysis (PILSA) using the- lations involving disliking, indifference, and liking.
sauri, which was extended by Chang et al. (2013) to The problem of clustering the nodes of a signed graph
a multi-relational setting. The Bayesian tensor factor- arises naturally as a generalization of the clustering
ization model (BPTF) was introduced in order to com- problem for weighted graphs. Figure 1 shows a signed
bine multiple sources of information (Zhang et al., graph of word similarities with a thesaurus overlay.
2014). Faruqui et al. (2015) used belief propagation Gallier (2016) extends normalized cuts signed graphs
to modify existing vector space representations. The in order to incorporate antonym information into word
word embeddings on Thesauri and Distributional in- clusters.
formation (WE-TD) model (Ono et al., 2015) in-
Definition 2.1. A weighted graph is a pair G =
corporated thesauri by altering the objective func-
(V, W ), where V = {v1 , . . . , vm } is a set of nodes
tion for word embedding representations. Similarly,
or vertices, and W is a symmetric matrix called the
The Pham et al. (2015) introduced multitask Lexical
weight matrix, such that wi j ≥ 0 for all i, j ∈
Contrast Model which extended the word2vec Skip-
{1, . . . , m}, and wi i = 0 for i = 1, . . . , m. We
gram method to optimize for both context as well
say that a set {vi , vj } is an edge iff wi j > 0. The
as synonymy/antonym relations. Our approach differs
corresponding (undirected) graph (V, E) with E =
from the afore-mentioned methods in that we created
{{vi , vj } | wi j > 0}, is called the underlying graph
word clusters using the antonym relationships as neg-
of G.
ative links. Similar to Faruqui et al. (2015) our signed
clustering method uses existing vector representations Given a signed graph G = (V, W ) (where W is a
to create word clusters. symmetric matrix with zero diagonal entries), the un-
derlying graph of G is the graph with node set V and
Semantic Word Clusters Using Signed Normalized Graph Cuts
set of (undirected) edges E = {{vi , vj } | wij 6= 0}. Given a partition of V into K clusters (A1 , . . . , AK ),
if we represent the jth block of this partition by a vec-
If (V, W ) is a signed graph, where W is an m × m tor X j such that
symmetric matrix with zero diagonal entries and with (
the other entries wij ∈ R arbitrary, for any node vi ∈ j aj if vi ∈ Aj
V , the signed degree of vi is defined as Xi =
0 if vi ∈ / Aj ,
m
X
di = d(vi ) = |wij |, for some aj 6= 0.
j=1
Definition 2.2. The signed normalized cut
and the signed degree matrix D as sNcut(A1 , . . . , AK ) of the partition (A1 , ..., AK ) is
defined as
D = diag(d(v1 ), . . . , d(vm )).
K
For any subset A of the set of nodes V , let
X cut(Aj , Aj ) + 2links− (Aj , Aj )
sNcut(A1 , . . . , AK ) = .
m
vol(Aj )
X X X j=1
vol(A) = di = |wij |.
vi ∈A vi ∈A j=1
Another formulation is
For any two subsets A and B of V , define
K
links+ (A, B), links− (A, B), and cut(A, A) by X (X j )> LX j
sNcut(A1 , . . . , AK ) = .
X (X j )> DX j
links+ (A, B) = wij j=1
vi ∈A,vj ∈B
wij >0 where X is the N × K matrix whose jth column is
X Xj.
links− (A, B) = −wij
vi ∈A,vj ∈B Observe that minimizing sNcut(A1 , . . . , AK )
wij <0
X amounts to minimizing the number of positive and
cut(A, A) = |wij |. negative edges between clusters, and also minimizing
vi ∈A,vj ∈A the number of negative edges within clusters. This
wij 6=0 second minimization captures the intuition that nodes
connected by a negative edge should not be together
Then, the signed Laplacian L is defined by
(they do not “like” each other; they should be far from
L = D − W, each other).
Semantic Word Clusters Using Signed Normalized Graph Cuts
If X is a solution to the relaxed problem, then XQ is 1. We set Λ = I and find R ∈ O(K) that mini-
also a solution, where Q ∈ O(K). mizes kX − ZRkF .
1/2
If we make the change of variable Y = D X or 2. Given X, Z, and R, find a diagonal invertible
−1/2
equivalently X = D Y. matrix Λ that minimizes kX − ZRΛkF .
Semantic Word Clusters Using Signed Normalized Graph Cuts
Figure 2.1. Cluster with one antonym relation. The red edge
represents the antonym relation. Blue edges represent synonymy
relations.
4. Empirical Results
In this section we begin with intrinsic analysis of the
resulting clusters. We then compare empirical clus- Figure 3.2. This is a plot of the relationship between the number
ters with SimLex-999 as a gold standard for semantic of disconnected components and negative edges within the
the sentiment prediction task. Our synonym clusters Figure 3. Graph of Disconnected Component and Negative Edge
are well suited for this task, as including antonyms in Relations
clusters results in incorrect predictions.
4.2. Data
4.1. Simulated Data
4.2.1. W ORD E MBEDDINGS
In order to evaluate our signed graph clustering
method, we first focused on intrinsic measures of clus- For comparison, we used four different word em-
ter quality. Figure 3.2 demonstrates that the number of bedding methods: Skip-gram vectors (word2vec)
negative edges within a cluster is minimized using our (Mikolov et al., 2013), Global vectors (GloVe) (Pen-
Semantic Word Clusters Using Signed Normalized Graph Cuts
nington et al., 2014), Eigenwords (Dhillon et al., ters and word similarity. In Table 1, we show the 4
2015), and Global Context (GloCon) (Huang et al., most-associated words with “accept”, “negative” and
2012) vector representation. We used word2vec 300 “unlike”.
dimensional embeddings which were trained using
word2vec code on several billion words of English 4.3.1. C LUSTER S IMILARITY AND
comprising the entirety of Gigaword and the En- H YPERPARAMETER O PTIMIZATION
glish discussion forum data gathered as part of BOLT. For a similarity metric between any two words,
A minimal tokenization was performed based on we use the heat kernel of Euclidean distance, so
CMU’s twoknenize1 . For GloVe we used pretrained kwi −wj k
2
−
200 dimensional vector embeddings2 trained using sim(wi , wj ) = e σ . The thesaurus matrix
Wikipedia 2014 + Gigaword 5 (6B tokens). Eigen- entry Tij has a weight of 1 if words i and j are
words were trained on English Gigaword with no synonyms, -1 if words i and j are antonyms, and
lowercasing or cleaning. Finally, we used 50 di- 0 otherwise. Thus the weight matrix entries Wij =
2
mensional vector representations from Huang et al. −
kwi −wj k
(2012), which used the April 2010 snapshot of the Tij e σ .
Wikipedia corpus (Lin, 1998; Shaoul, 2010), with a Table 2 shows results from the grid search of hyperpa-
total of about 2 million articles and 990 million to- rameter optimization. Here we show that Eigenword
kens. + MSW outperforms Eigenword + Roget, which is in
contrast with the other word embeddings where the
4.2.2. T HESAURI combination with Roget performs better.
Several thesauri were used, in order to test robustness As a baseline, we created clusters using K-means
(including Roget’s Thesaurus, the Microsoft Word where the number of K clusters was set to 750. All
English (MS Word) thesaurus from Samsonovic et al. K-means clusters have a statistically significant differ-
(2010) and WordNet 3.0) (Miller, 1995). Jarmasz & ence in the number of antonym pairs relative to ran-
Szpakowicz (2004); Hale (1998) have shown that Ro- dom assignment of labels. When compared with the
get’s thesaurus has better semantic similarity than MS Word thesaurus, Word2Vec, Eigenword, GloCon,
WordNet. This is consistent with our results using a and GloVe word embeddings had a total of 286, 235,
larger dataset of SimLex-999. 235, 220 negative edges, respectively. The results are
We chose a subset of 5108 words for the train- similar with the other thesauri. This shows that there
ing dataset, which had high overlap between various are a significant number of antonyms pairs in the K-
sources. Changes to the training dataset had minimal means clusters derived from the word embeddings. By
effects on the optimal parameters. Within the train- optimizing the hyperparameters using normalized cuts
ing dataset, each of the thesauri had roughly 3700 without thesauri information, we found a significant
antonym pairs, and combined they had 6680. How- decrease in the number of negative edges, which was
ever, the number of distinct connected components indistinguishable from random assignment and corre-
varied, with Roget’s Thesaurus having the least (629), sponded to a roughly ninety percent decrease across
and MS Word Thesaurus (1162) and WordNet (2449) clusters. When analyzed using an out of sample the-
having the most. These ratios were consistent across saurus and 27081 words, the number of antonym clus-
the full dataset. ters decreased to under 5 for all word embeddings,
with the addition of antonym relationship information.
4.3. Cluster Evaluation If we examined the number of distinct connected com-
One of our main goals was to go beyond qualitative ponents within the different word clusters, we ob-
analysis into quantitative measures of synonym clus- served that when K-means were used, the number
1
of disconnected components were statistically signif-
https://github.com/brendano/ icant from random labelling. This suggests that the
ark-tweet-nlp
2
http://nlp.stanford.edu/projects/GloVe/ word embeddings capture synonym relationships. By
optimizing the hyperparameters we found roughly a
Semantic Word Clusters Using Signed Normalized Graph Cuts
Table 2. Clustering evaluation after parameter optimization minimizing error using grid search.
4.3.3. S ENTIMENT A NALYSIS solution. We showed that we can find superior syn-
onym clusters which do not require new word embed-
We used the Socher et al. (2013) sentiment tree-
dings, but simply overlay thesaurus information. The
bank 3 with coarse grained labels on phrases and sen-
clusters are general and can be used with several out of
tences from movie review excerpts. The treebank is
the box word embeddings. By accounting for antonym
split into training (6920) , development (872), and
relationships, our algorithm greatly outperforms sim-
test (1821) datasets. We trained an l2 -norm regular-
ple normalized cuts, even with Huang’s word embed-
ized logistic regression (Friedman et al., 2001) us-
dings , which are designed to capture semantic rela-
ing our word clusters in order to predict the coarse-
tions. Finally, we examined our clustering method on
grained sentiment at the sentence level. We compared
the sentiment analysis task from Socher et al. (2013)
our model against existing models: Naive Bayes with
sentiment treebank dataset and showed improved per-
bag of words (NB), sentence word embedding aver-
formance versus comparable models.
ages (VecAvg), retrofitted sentence word embeddings
(RVecAvg) (Faruqui et al., 2015), simple recurrent This method could be applied to a broad range of NLP
neural network (RNN), recurrent neural tensor net- tasks, such as prediction of social group clustering,
work (RNTN) (Socher et al., 2013), and the state- identification of personal versus non-personal verbs,
of-the art Convolutional neural network (CNN) (Kim, and analysis of clusters which capture positive, neg-
2014). Table 5 shows that although our model does not ative, and objective emotional content. It could also
out-perform the state-of-the-art, signed clustering per- be used to explore multi-view relationships, such as
forms better than comparable models, including the aligning synonym clusters across multiple languages.
recurrent neural network, which has access to more Another possibility is to use thesauri and word vector
information. representations together with word sense disambigua-
tion to generate synonym clusters for multiple senses
Model Accuracy
NB (Socher et al., 2013) 0.818
of words. Finally, our signed clustering could be ex-
VecAvg (W2V, GV, GC) 0.812, 0.796, 0.678 tended to evolutionary signed clustering.
(Faruqui et al., 2015)
RVecAvg (W2V, GV, GC) 0.821, 0.822, 0.689
(Faruqui et al., 2015) References
RNN, RNTN (Socher et al., 2013) 0.824, 0.854
CNN (Le & Zuidema, 2015) 0.881
Belkin, Mikhail and Niyogi, Partha. Laplacian eigen-
SC W2V 0.836 maps for dimensionality reduction and data repre-
SC GV 0.819 sentation. Neural computation, 15(6):1373–1396,
SC GC 0.572 2003.
SC EW 0.820
Table 5. Sentiment analysis accuracy for binary predictions of Chang, Kai-Wei, Yih, Wen-tau, and Meek, Christo-
signed clustering algorithm (SC) versus other models. pher. Multi-relational latent semantic analysis. In
EMNLP, pp. 1602–1612, 2013.
Faruqui, Manaal, Dodge, Jesse, Jauhar, Sujay K, Kolluri, Ravikrishna, Shewchuk, Jonathan Richard,
Dyer, Chris, Hovy, Eduard, and Smith, Noah A. and O’Brien, James F. Spectral surface reconstruc-
Retrofitting word vectors to semantic lexicons. Pro- tion from noisy point clouds. In Proceedings of the
ceedings of NAACL 2015, Denver, CO, 2015. 2004 Eurographics/ACM SIGGRAPH symposium
on Geometry processing, pp. 11–21. ACM, 2004.
Friedman, Jerome, Hastie, Trevor, and Tibshirani,
Robert. The elements of statistical learning, vol- Kunegis, Jérôme, Schmidt, Stephan, Lom-
ume 1. Springer series in statistics Springer, Berlin, matzsch, Andreas, Lerner, Jürgen, De Luca,
2001. Ernesto William, and Albayrak, Sahin. Spectral
analysis of signed graphs for clustering, predic-
Gallier, Jean. Spectral theory of unsigned and signed tion and visualization. In SDM, volume 10, pp.
graphs applications to graph clustering: a survey. 559–559. SIAM, 2010.
arXiv preprint arXiv:1601.04692, 2016.
Le, Phong and Zuidema, Willem. Compositional dis-
Hale, Michael Mc. A comparison of wordnet and ro- tributional semantics with long short term memory.
get’s taxonomy for measuring semantic similarity. arXiv preprint arXiv:1503.02510, 2015.
arXiv preprint cmp-lg/9809003, 1998.
Lin, Dekang. Automatic retrieval and clustering of
Harary, Frank. On the notion of balance of a signed similar words. In Proceedings of the 36th An-
graph. The Michigan Mathematical Journal, 2(2): nual Meeting of the Association for Computational
143–146, 1953. Linguistics and 17th International Conference on
Computational Linguistics-Volume 2, pp. 768–774.
Harris, Zellig S. Distributional structure. Word, 1954. Association for Computational Linguistics, 1998.
Hill, Felix, Reichart, Roi, and Korhonen, Anna. Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean,
Simlex-999: Evaluating semantic models with Jeffrey. Efficient estimation of word representations
(genuine) similarity estimation. arXiv preprint in vector space. arXiv preprint arXiv:1301.3781,
arXiv:1408.3456, 2014. 2013.
Hou, Yao Ping. Bounds for the least laplacian eigen- Miller, George A. Wordnet: a lexical database for en-
value of a signed graph. Acta Mathematica Sinica, glish. Communications of the ACM, 38(11):39–41,
21(4):955–960, 2005. 1995.
Huang, Eric H, Socher, Richard, Manning, Christo- Mohammad, Saif M, Dorr, Bonnie J, Hirst, Graeme,
pher D, and Ng, Andrew Y. Improving word rep- and Turney, Peter D. Computing lexical contrast.
resentations via global context and multiple word Computational Linguistics, 39(3):555–590, 2013.
prototypes. In Proceedings of the 50th Annual Ono, Masataka, Miwa, Makoto, and Sasaki, Yutaka.
Meeting of the Association for Computational Lin- Word embedding-based antonym detection using
guistics: Long Papers-Volume 1, pp. 873–882. As- thesauri and distributional information. 2015.
sociation for Computational Linguistics, 2012.
Pennington, Jeffrey, Socher, Richard, and Manning,
Jarmasz, Mario and Szpakowicz, Stan. Rogets the- Christopher D. Glove: Global vectors for word rep-
saurus and semantic similarity1. Recent Advances resentation. Proceedings of the Empiricial Methods
in Natural Language Processing III: Selected Pa- in Natural Language Processing (EMNLP 2014),
pers from RANLP, 2003:111, 2004. 12:1532–1543, 2014.
Kim, Yoon. Convolutional neural networks for sen- Rangapuram, Syama Sundar and Hein, Matthias.
tence classification. Proceedings of the Empiricial Constrained 1-spectral clustering. International
Methods in Natural Language Processing (EMNLP conference on Artificial Intelligence and Statistics
2014), 12:1746–1751, 2014. (AISTATS), 22:1143–1151, 2012.
Semantic Word Clusters Using Signed Normalized Graph Cuts
Samsonovic, Alexei V, Ascoli, Giorgio A, and Krich- 1212–1222. Association for Computational Lin-
mar, Jeffrey. Principal semantic components of lan- guistics, 2012.
guage and the measurement of meaning. PloS one,
5(6):e10921, 2010. Yu, Stella X and Shi, Jianbo. Multiclass spectral clus-
tering. In Computer Vision, 2003. Proceedings.
Scheible, Silke, im Walde, Sabine Schulte, and Ninth IEEE International Conference on, pp. 313–
Springorum, Sylvia. Uncovering distributional dif- 319. IEEE, 2003.
ferences between synonyms and antonyms in a
word space model. In Proceedings of the 6th In- Zhang, Jingwei, Salwen, Jeremy, Glass, Michael, and
ternational Joint Conference on Natural Language Gliozzo, Alfio. Word semantic representations
Processing, pp. 489–497, 2013. using bayesian probabilistic tensor factorization.
In In Proceedingsof the 2014 Conference on Em-
Shaoul, Cyrus. The westbury lab wikipedia corpus. pirical Methods in Natural Language Processing
Edmonton, AB: University of Alberta, 2010. (EMNLP), pp. 1522–1531, 2014.
Shi, Jianbo and Malik, Jitendra. Normalized cuts and Zhao, Ying and Karypis, George. Criterion functions
image segmentation. Pattern Analysis and Ma- for document clustering: Experiments and analysis.
chine Intelligence, IEEE Transactions on, 22(8): Technical report, Citeseer, 2001.
888–905, 2000.