Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Automatic Selection of Context Configurations for

Improved (and Fast) Class-Specific Word Representations


Ivan Vulic1 , Roy Schwartz2 , Ari Rappoport2
Roi Reichart3 , Anna Korhonen1
1
Language Technology Lab, DTAL, University of Cambridge
2
Institute of Computer Science, The Hebrew University
3
Faculty of Industrial Engineering and Management, Technion, IIT
{iv250,alk23}@cam.ac.uk {roys02,arir}@cs.huji.ac.il
roiri@ie.technion.ac.il

arXiv:1608.05528v1 [cs.CL] 19 Aug 2016

Abstract
Recent work has demonstrated that stateof-the-art word embedding models require
different context types to produce highquality representations for different word
classes such as adjectives (A), verbs (V),
and nouns (N). This paper is concerned
with identifying contexts useful for learning A/V/N-specific representations. We introduce a simple yet effective framework
for selecting class-specific context configurations that yield improved representations
for each class. We propose an automatic
A* style selection algorithm that effectively
searches only a fraction of the large configuration space. The results on predicting similarity scores for the A, V, and N
subsets of the benchmarking SimLex-999
evaluation set indicate that our method is
useful for each class: the improvements are
6% (A), 6% (V), and 5% (N) over the best
previously proposed context type for each
class. At the same time, the model trains on
only 14% (A), 26.2% (V), and 33.6% (N)
of all dependency-based contexts, resulting
in much shorter training time.

Introduction

Dense real-valued distributed word representations,


or word embeddings, have become ubiquitous in
NLP, serving as invaluable features in a broad range
of NLP tasks (Turian et al., 2010; Collobert et al.,
2011; Chen and Manning, 2014). The omnipresent
word2vec skip-gram model with negative sampling (SGNS) (Mikolov et al., 2013) is still considered the state-of-the-art word representation model,
due to its simplicity, fast training, as well as its solid
and robust performance across a wide variety of semantic tasks (Baroni et al., 2014; Levy et al., 2015).

The original implementation of SGNS learns word


representations from local bag-of-words contexts
(BOW). However, the underlying SGNS model is
equally applicable to other context types (Levy and
Goldberg, 2014a).
Recent work demonstrated that reaching beyond the omnipresent BOW contexts towards contexts based on dependency parses (Bansal et al.,
2014; Schwartz et al., 2016; Melamud et al.,
2016) or symmetric patterns (Schwartz et al., 2015;
Schwartz et al., 2016) yields significant improvements in learning representations for particular
word classes such as adjectives (A) and verbs (V).
Curiously enough, despite the success story with
adjectives and verbs, SGNS with BOW contexts
still outperforms all other context types in noun (N)
representation learning.1
Further, several recent studies (Ling et al., 2015;
Schwartz et al., 2016) suggested that not all contexts are created equal: for instance, Schwartz et
al. (2016) show that it is possible to learn highquality representations of verbs and adjectives using only a subset of dependency-based contexts
that covers coordination structures for the SGNS
training while discarding everything else. Besides
their competitive performance on verb and adjective similarity measured on SimLex-999 (Hill et al.,
2015), training with such coordination-based contexts is significantly faster due to the pre-training
selection step.
In this work, we propose a simple yet effective
general framework for selecting context configurations that yield improved and fast representations
for verbs, adjectives, and nouns. Starting from a
dependency-parsed training corpus, the key idea
consists of two steps. First, we detect and retain
only individual context bags useful for a particular word class. Such context bags are labelled by
1
We refer the reader to the work of Schwartz et al. (2016)
for a comprehensive overview of results on SimLex-999.

language-universal typed dependency links, e.g.,


amod contexts or dobj contexts. Second, we
select final class-specific context configurations
by automatically searching the large configuration
space using a novel A* style heuristic algorithm.
For instance, our algorithm detects that the combination of amod and conj contexts is effective for
adjective similarity).
We show that SGNS requires different context
configurations to produce optimal results for each
word class. Some context bags that boost learning
representations for one word class (e.g., amod contexts for adjectives) may be uninformative or even
harmful when learning representations for another
word class (e.g., amod for verbs). By removing
such uninformative context bags, we are able both
to speed up the SGNS training (by reducing the
number of training items) and to improve representations for each class (by using more focused
contexts).
The results on the task of predicting similarity
scores for the verb (V), adjective (A), and noun (N)
portions of the benchmarking SimLex-999 evaluation set indicate that the proposed method is useful
for each class. Using a standard experimental setup
(Levy et al., 2015), we outperform the best baseline
context type for each word class: the improvements
are 6 points for adjectives, 6 for verbs, and 5 for
nouns over the best previously proposed context
type for each word class. At the same time, the
model trains on only 14% (A), 26.2% (V), and
33.6% (N) of all dependency-based contexts. We
also show that by building context configurations
we obtain improvements on the entire SimLex-999
benchmark (4 points over the best baseline). Interestingly, this context configuration is not the
optimal configuration for any word class.
We also demonstrate that the proposed method
is robust: the improvements extend to another standard training setup (Schwartz et al., 2016) which
uses more training data, as well as a different parser
and a different dependency annotation scheme.

Related Work

Word representations, also known as vector space


models or word embeddings, date back to the early
1970s (Salton, 1971). Word representation models
typically train on (word,context) pairs. Traditionally, most models use bag-of-words (BOW) contexts, which represent a word using its neighbouring words, irrespective of the syntactic or seman-

tic relations between them (Collobert et al., 2011;


Mikolov et al., 2013; Mnih and Kavukcuoglu,
2013; Pennington et al., 2014, inter alia).
Recently, several alternative context types have
been proposed, motivated by the limitations of
BOW contexts, most notably their focus on topical
rather than functional similarity (e.g., coffee:cup
vs. coffee:tea). These include dependency contexts
(Grefenstette, 1994; Pad and Lapata, 2007; Levy
and Goldberg, 2014a), pattern contexts (Baroni et
al., 2010; Schwartz et al., 2015) and substitute vectors (Yatbaz et al., 2012; Melamud et al., 2015).
Very recently, several studies investigated the
effect of different context types on word representation quality. Melamud et al. (2016) compared
three context types on a set of intrinsic and extrinsic evaluation setups, and showed that the optimal
type largely depends on the task at hand. Vulic
and Korhonen (2016) compared a few variants of
dependency and BOW contexts across different languages, reaching similar conclusions. Schwartz et
al. (2016), showed that symmetric patterns are very
useful contexts for verb and adjective similarity,
while BOW works best for nouns, measured on
the benchmarking SimLex-999 (Hill et al., 2015)
evaluation set. They also indicated that coordinations, a subset of dependency contexts, are more
useful for verbs and adjectives than the entire set
of dependencies.
In this work we propose an algorithm for a systematic and efficient inspection of the dependencybased context configuration space which yields superior class-specific representations, with substantial improvements for verbs, nouns and adjectives.
Previous attempts on improving word representations by specialising them for a particular relation
(e.g., similarity vs relatedness, antonyms) operate
in one of the two following frameworks: (1) modifying the prior or the regularisation of the original
training procedure (Yu and Dredze, 2014; Wieting
et al., 2015; Liu et al., 2015; Kiela et al., 2015;
Ling et al., 2015); (2) light post-processing procedures which use lexical knowledge to refine offthe-shelf pre-trained word vectors (Faruqui et al.,
2015; Mrkic et al., 2016). We suggest that the
induced representations may also be specialised a
priori (i.e., before training) using context selection.

Methodology

Intro: Representation Model Following prior


work (Schwartz et al., 2016; Vulic and Korhonen,

2016), we keep the word representation model


fixed while varying only the context type. For
all context configurations and all other (baseline)
context types, we opt for the standard and very
robust choice (Levy et al., 2015) in vector space
modeling: SGNS (Mikolov et al., 2013). In all
experiments we use word2vecf, a reimplementation of word2vec able to learn from arbitrary
(word, context) pairs.2 To simplify the presentation, we adopt the assumption that all training data
for SGNS are in the form of such (word, context)
pairs (Levy and Goldberg, 2014a; Levy and Goldberg, 2014c), where word is the current target
word, and context is its observed context (e.g.,
BOW, positional, dependency-based).
By fixing the representation model and varying
only the context type, one may attribute all differences in results to a sole factor: the actual context
type/configuration.
3.1

Terminology and Background

Dependency-Based Contexts This work starts


from dependency-based contexts (DEPS) (Pad
and Lapata, 2007; Utt and Pad, 2014), which
were recently proven useful in learning word embeddings (Levy and Goldberg, 2014a; Melamud et
al., 2016).
Given a parsed training corpus, for each
target w with modifiers m1 , . . . , mk and a
head h, w is paired with context elements
m1 _r1 , . . . , mk _rk , h_rh1 , where r is the type of
the dependency relation between the head and the
modifier (e.g., amod), and r1 denotes an inverse
relation. A naive version of DEPS extracts contexts
from the parsed corpus without any post-processing
(more on this later). Given an example from Fig. 1,
the DEPS contexts of discovers are: scientist_nsubj,
stars_dobj, telescope_nmod. Compared to BOW,
DEPS capture longer-range relations (e.g., telescope) and filter out accidental contexts (e.g.,
Australian).
Besides being oriented towards functional similarity through such more focused contexts, DEPS
also provides a natural implicit grouping of related
contexts. For instance, all pairs with the shared relation r and r1 may be taken as an r-based context
bag, e.g., pairs (scientist, Australian_amod), (Australian, scientist_amod1 ) are labelled the amod

nmod
amod

Australian

nsubj

scientist

dobj

discovers

case

stars

telescope

nmod

nsubj
amod

Scienziato

with

case

dobj

australiano

scopre

stelle

con

telescopio

nmod
amod

Australian

nsubj

scientist

dobj

discovers

case

stars

with

telescope

prep:with

Figure 1: An example of extracting dependencybased contexts. Top: the example English sentence
from (Levy and Goldberg, 2014a), now parsed using Universal Dependencies. Middle: the same
sentence in Italian, UD-parsed. Note the similarity between the two parses which suggests that our
UD-based framework may be extended to other languages. Bottom: the intuition behind post-parsing
prepositional arc collapsing. The uninformative
short-range case arc between with and telescope
is removed, while another pseudo-arc specifying
the exact link type (i.e., prep:with) between
discovers and telescope is added.
context bag, while (discovers, stars_dobj), (stars,
discovers_dobj 1 ) are labelled dobj (Fig. 1).
Universal Dependencies as Labels Following
recent initiatives on cross-linguistically consistent
annotations, we adopt the Universal Dependencies
(UD) annotation scheme3 (McDonald et al., 2013;
Nivre et al., 2015) to serve as labels for individual context bags.4 The scheme leans on the universal Stanford dependencies (de Marneffe et al.,
2014) complemented with the universal POS tagset
(Petrov et al., 2012). It is straightforward to translate previous annotation schemes to UD (de Marneffe et al., 2014).
UD scheme is becoming a de facto standard
in dependency parsing. It provides a universal
and consistent inventory of categories for similar
syntactic constructions across multiple languages.
This property could facilitate representation learning in languages other than English, cross-lingual
modeling, and application of our methodology to
other languages in future work.
Context Configurations
Assume that we
have obtained M distinct dependency relations

https://bitbucket.org/yoavgo/word2vecf
For details concerning the implementation, we refer the reader
to (Goldberg and Levy, 2014; Levy and Goldberg, 2014a).

3
4

http://universaldependencies.org/ (V1.2 used)


http://universaldependencies.org/u/dep/index.html

r1 , . . . , rM after parsing and post-processing the


corpus. The j-th individual context bag, j =
1, . . . , M , labelled rj is a bag (or a multiset) of
(word, context) pairs where context has one of
the following forms: v_rj or v_rj1 , where v is
some vocabulary word. A context configuration is
then simply a set of individual context bags, e.g.,
R = {ri , rj , rk }, also labelled as R: ri + rj + rk .
We call a configuration consisting of K individual
context bags a K-set configuration (e.g., in this
example, R is a 3-set configuration).
Intuition behind Context Configurations The
core idea of the work on building context configurations lies in the intuition that the distributional hypothesis (Harris, 1954) may be restated as a series
of statements according to a particular dependency
relation: for instance, two adjectives are similar
if they modify similar nouns, which is captured
by the amod typed dependency link. This could
also be reversed to reflect noun similarity by saying
that two nouns are similar if they are modified by
similar adjectives. In another example, two verbs
are similar if they are used as predicates of similar
nominal subjects (nsubj and nsubjpass), etc.
3.2

Individual Context Bags

Post-parsing steps are performed to obtain a comprehensive list of (language-universal) individual


bags at the right level of granularity for computational feasibility (step 1), to enhance context expressiveness (steps 2-3), and to remove uninformative
contexts (step 4).
(1) Closely related bags are merged into a new
broader bag: direct (dobj) and indirect objects
(iobj) are merged into obj. Further, nsubj
and nsubjpass are merged into one single bag
subj, xcomp and ccomp into comp, advcl and
advmod into adv. Note that these combinations
are performed solely for efficiency reasons (i.e.,
to reduce the search space), without affecting any
theoretical implications of the proposed algorithm
(see later Alg. 1).
(2) We perform the standard procedure of prepositional arc-collapsing (Levy and Goldberg, 2014a;
Vulic and Korhonen, 2016), illustrated in Fig. 1.
Following the procedure, all pairs where the relation r has the form prep:X (where X is a preposition) are subsumed to an individual context bag
labelled prep.
(3) Coordination-based contexts are extracted as
in prior work (Schwartz et al., 2016), distinguish-

ing between left and right contexts extracted from


the conj relation.5 This context bag is called
conjlr. We also utilize the variant that does not
make the distinction, called conjll. If both are
used together, the label is simply conj.
(4) Optionally, it is possible to filter out
all infrequent dependency relations (e.g., dep,
goeswith, foreign) and (semantically) uninformative ones (e.g., punct, det, cop, aux).6
After performing these steps, the final list of individual context bags we use in our experiments is
as follows: subj, obj, comp, nummod, appos,
nmod, acl, amod, prep, adv, compound,
conjlr, conjll.
3.3

Selection of Class-Specific Configurations

Initial Pools of Candidate Context Bags To reduce the number of possible configurations for each
word class (again for efficiency reasons) it is possible to automatically pool only individual context
bags that are potentially useful. For instance, since
amod denotes an adjectival modifier of a noun, this
bag can be safely removed from the pool of candidate bags for verbs. Using amod contexts for verb
representation modeling may be a pure coincidence
as a result of incorrect parses, and may eventually
lead to deteriorated verb vectors. In another example, appositional modification captured by appos
always denotes a relation between nouns, so it can
be safely excluded as a candidate bag for adjectives and verbs. The construction of such initial
class-specific pools is done automatically before
the configuration search procedure starts (see line
1 in Alg. 1).
Best Configuration Search Although a bruteforce exhaustive search over all possible configurations is possible in theory and for small pools (e.g.,
for adjectives, see Tab. 2), it becomes challenging
or practically infeasible for large pools and large
training data. For instance, based on the pool from
Tab. 2, the search for the optimal configuration
would involve trying out 27 1 = 127 different
configurations for verbs (i.e., training 127 different
SGNS models) and 210 1 = 1023 configurations
for nouns (1023 SGNS models!). Therefore, to
reduce the number of visited configurations, we
have designed a simple heuristic search algorithm
5
Given the coordination structure boys and girls, training
pairs are (boys, girls_conj), (girls, boys_conj 1 ).
6
Another choice is to retain all individual context bags and
filter them later, see line 1 of Alg. 1.

Algorithm 1: Best Configuration Search

E(RP ool )
ri + rj + rk + rl

P ool
E(Rr
) < E(RP ool )
l

P ool
E(Rr
) > E(RP ool )
i

ri + rj + rk

ri + rj + rl

ri + rk + rl

rj + rk + rl

2
3
P ool
P ool
E(Rr
) < E(Rr
)
i rj
i

ri + rj

ri + rk

rj + rk

ri + rl

rj + rl

rk + rl

4
5
ri

rj

rk

rl

6
7

Input :Set of M individual context bags:


0
{r10 , r20 , . . . , rM
}
build: pool of K M candidate individual context
bags {r1 , . . . , rK } according to the scores
0
E(r10 ), . . . , E(rM
), where E() is a fitness function ;
build: K-set configuration RP ool = {r1 , . . . , rK } ;
initialize: (1) set of candidate configurations
R = {RP ool } ; (2) current level l = K ; (3) best
configuration Ro = ;
search:
repeat
Rn ;
Ro arg max E(R) ;
RR

foreach R R do
foreach ri R do
build new (l 1)-set context
configuration Rri = R {ri } ;
if E(Rri ) E(R) then
Rn Rn {Rri } ;

Figure 2: An illustration of Alg. 1. Search space


may be presented as a DAG with direct links between origin configurations (e.g., ri + rj + rk ) and
all its children configurations obtained by removing exactly one individual bag from the origin (e.g.,
ri + rj , rj + rk ). After automatically constructing
the initial pool (line 1), the entry point of the algorithm is the RP ool configuration (line 2). Thicker
circles denote visited configurations, while the gray
circle denotes the best configuration found by the
algorithm.
inspired by A* search (Hart et al., 1968).
Alg. 1 provides a high-level overview of the algorithm while a toy example from Fig. 2 illustrates its
flow. First, starting from all possible M individual
context bags, the algorithm automatically detects
the subset of K candidate individual bags that are
used as the initial pool (line 1 of Alg. 1). The selection is based on some fitness (goal) function E. In
our setup, E(R) is Spearmans correlation with
human judgement scores obtained on the development set after training the SGNS model with the
configuration R. The selection is automatic relying on a simple threshold. We use a threshold of
0.2 without any fine-tuning in all experiments
with all word classes.
The algorithm then starts from the full K-set
RP ool configuration (line 3) and tests K (K 1)set configurations where exactly one individual bag
ri is removed to generate each such configuration
(line 10). It then retains only (K 1)-set configurations that score higher than the higher-level origin
K-set configuration from the previous step (lines
11-12, see Fig. 2). Using this principle, it continues searching only over lower-level (l 1)-set
configurations that further improve performance
over their l-set origin configuration. It stops if it
reaches the lowest level or if it cannot improve the

9
10
11
12

13
14
15

l l1;
R Rn ;
until l == 0 or R == ;
Output :Best configuration Ro

goal function any more (line 15). The best scoring


configuration is returned (n.b., not guaranteed to be
the global optimum).
In our experiments, with this heuristic, the search
for the optimal configuration for verbs is performed
only over 13 1-set configurations plus 26 other configurations (39 out of 133 possible configurations).7
For nouns, the advantage of the heuristic is even
more dramatic: only 104 out of 1026 possible configurations were considered during the search.8

Experimental Setup

In terms of training data, pre-processing, parameter


settings, and evaluation our setup is replicated from
a recent related study by Levy et al. (2015).
Training Data All the representations in our
comparison are induced from the cleaned and tokenised English Polyglot Wikipedia data (Al-Rfou
et al., 2013).9 The corpus contains approx. 75M
sentences and 1.5G word tokens.
7

The total is 133 and not 127 as we have to include 6


additional 1-group configurations that have to be tested but
are not included in the initial pool for verbs, see Tab. 1 and
Tab. 2.
8
We have also experimented with a less conservative algorithm variant which does not stop when lower-level configurations are unable to improve E; it instead follows the path of
the best-scoring lower-level configuration even if its score is
lower than the score of its origin. As we do not observe any
significant difference between the two variants, we opt for the
faster and simpler one.
9
https://sites.google.com/site/rmyeid/projects/polyglot

The Wikipedia data were POS-tagged with universal POS (UPOS) tags (Petrov et al., 2012) using state-of-the art TurboTagger (Martins et al.,
2013)10 , trained using default settings without any
further parameter fine-tuning (SVM MIRA with
20 iterations) on the TRAIN + DEV portion of the
UD treebank annotated with UPOS tags. The data
were then parsed with UD11 using the graph-based
Mate parser v3.61 (Bohnet, 2010)12 with standard
settings on the TRAIN + DEV UD treebank portion.13
Evaluation
Following prior work (Schwartz
et al., 2015; Schwartz et al., 2016), we experiment with the verb pair (222 pairs), adjective pair
(111 pairs), and noun pair (666 pairs) portions of
SimLex-999. We report Spearmans correlation
between the ranks derived from the scores of the
evaluated models and the human scores. Our final evaluation setup is borrowed from Levy et al.
(2015): the context configurations are optimised on
a development set, which is separate from the unseen test data. We perform 2-fold cross-validation;
the reported scores are always the averages of the 2
runs. All scores are computed in the standard fashion, applying the cosine similarity to the vectors of
words participating in a pair.
Baseline Context Types We compare context
configurations against relevant baseline context
types from prior work (Melamud et al., 2016;
Schwartz et al., 2016; Vulic and Korhonen, 2016):
- BOW: Standard bag-of-words contexts.
- POSIT: Positional contexts (Schtze, 1993; Levy
and Goldberg, 2014b) which enrich BOW with
information on the sequential position of each context word. Given the example from Fig. 1, POSIT
with the window size 2 extracts these contexts for
discovers: Australian_-2, scientist_-1, stars_+2,
with_+1.
- DEPS-All: All dependency links without any context selection, extracted from dependency-parsed
data with prepositional arc collapsing.
- COORD: Coordination-based contexts are used
as fast lightweight contexts for improved representations of adjectives and verbs (Schwartz et al.,
10

http://www.cs.cmu.edu/~ark/TurboParser/
Such a language-universal tagging+parsing procedure has
been chosen deliberately. As suggested by Vulic and Korhonen
(2016), the language-independent framework will facilitate
extensions of this research to other languages in future work
without any major language-specific adaptations.
12
https://code.google.com/archive/p/mate-tools/
13
We opt for the Mate parser due to its speed, simplicity,
and state-of-the-art performance, see (Choi et al., 2015).
11

Context Group

Adj

Verb

Noun

conjn (A+N+V)
obj (N+V)
prep (N+V)
amod (A+N)
compound (N)
adv (V)
nummod (-)

0.415
-0.028
0.188
0.479
-0.124
0.197
-0.142

0.281
0.309
0.344
0.058
-0.019
0.342
-0.065

0.401
0.390
0.387
0.398
0.416
0.104
0.029

Table 1: 2-fold cross-validation results for an illustrative selection of individual context groups
over the SimLex-999 subsets associated with each
word group. Values in parentheses denote the classspecific pools to which each group is finally selected based on its score.
Adjectives

Verbs

Nouns

amod,
conjlr,
conjll

prep, acl,
obj, comp,
adv, conjlr,
conjll

amod, prep, compound,


subj, obj, appos, acl,
nmod, conjlr, conjll

Table 2: Automatically selected initial pools of


candidate bags for each word class (Sect. 3.3).
2016). This is in fact the conjlr context bag, a
subset of DEPS-All (see Step 3 in Sect. 3.2).
- SP: Contexts based on symmetric patterns (SPs)
(Davidov and Rappoport, 2006; Schwartz et al.,
2015), lexico-syntactic constructs such as X and
Y or X or Y; Y is then an SP context instance
for X, and vice versa. The SP contexts, extracted
from plain text, are very effective in modeling adjective and verb similarity, but they fall short of
BOW and related approaches in noun similarity
modeling (Schwartz et al., 2016).
SGNS Preprocessing and Parameters The
SGNS preprocessing scheme was replicated from
(Levy and Goldberg, 2014a; Levy et al., 2015) with
all context types. After lowercasing, all words and
contexts that appeared less than 100 times were
filtered. For DEPS-All, the vocabulary spans approx. 185K word types. The word2vecf SGNS
for all models was trained using stochastic gradient
descent and standard settings: 15 negative samples, global learning rate: 0.025, subsampling rate:
1e 4, 15 epochs.
Further, all representations were trained with
d = 300 (very similar trends are observed with
d = 100, 500). For BOW and POSIT, the window size was tuned on the dev set: its value is
2 in all experiments, which is aligned with prior
work (Schwartz et al., 2015). Parameters for the
extraction of SPs were also tuned on the dev set.14
14

The SP extraction algorithm is available online:

Baselines

(Verbs)

Baselines

(Nouns)

BOW (win=2)
POSIT (win=2)
COORD (conjlr)
SP
DEPS-All

0.336
0.345
0.283
0.349
0.344

BOW (win=2)
POSIT (win=2)
COORD (conjlr)
SP
DEPS-All

0.435
0.437
0.392
0.372
0.441

Configurations: Verbs
POOL-ALL
prep+acl+obj+adv+conj
prep+acl+obj+comp+conj
prep+obj+comp+adv+conj
prep+acl+adv+conj (BEST)
prep+acl+obj+adv
prep+acl+adv
prep+acl+conj
acl+obj+adv+conj
acl+obj+adv

Configurations: Nouns
0.379
0.393
0.344
0.391
0.409
0.392
0.407
0.390
0.345
0.385

POOL-ALL
amod+subj+obj+appos+compound+nmod+conj
amod+subj+obj+appos+compound+conj
amod+subj+obj+appos+compound+conjlr
amod+subj+obj+compound+conj (BEST)
amod+subj+obj+appos+conj
subj+obj+compound+conj
amod+subj+compound+conj
amod+subj+obj+compound
amod+obj+compound+conj

0.469
0.478
0.487
0.476
0.491
0.470
0.479
0.481
0.478
0.481

Table 3: 2-fold cross-validation results over SimLex-999 for (a) verbs and (b) nouns subsets. Only a
selection of context configurations optimised for verb and noun similarity are shown. POOL-ALL denotes
a configuration where all individual context bags from the verbs/nouns-oriented pools (see Table 2) are
used. BEST denotes the best performing configuration found by Alg. 1. Other configurations visited by
Alg. 1 that score higher than the best scoring baseline context type for each word class are in gray.
Baselines

(Adjectives)

BOW (win=2)
POSIT (win=2)
COORD (conjlr)
SP
DEPS-All

0.489
0.460
0.407
0.395
0.360

Configurations: Adjectives
POOL-ALL: amod+conj
amod+conjlr
amod+conjll
conj

(BEST)

0.546
0.527
0.531
0.470

Table 4: 2-fold cross-validation results over the


SimLex-999 (c) adjectives subset with adjectivespecific configurations.

Results and Discussion

Not All Context Bags are Created Equal First,


we test the performance of individual context bags
across SimLex-999 adjective, verb, and noun subsets. Besides providing insight on the intuition
behind context selection, these findings are important for the automatic selection of class-specific
pools. The results are shown in Tab. 1.
The experiment supports our intuition (see
Sect. 3.1): some context bags are definitely not
useful for some classes and may be safely removed
when performing the class-specific SGNS training.
For instance, the amod bag is indeed important for
adjective and noun similarity, and at the same time
it does not encode any useful information regarding
verb similarity; compound is, as expected, useful
only for nouns. Tab. 1 also suggests that some
http://www.cs.huji.ac.il/roys02/software/dr06/dr06.html

context bags (e.g., nummod) do not encode any informative contextual evidence regarding similarity,
therefore they can be discarded. The initial results
with individual context bags help to reduce the
pool of candidate bags (line 1 in Alg. 1), see Tab. 2.
Note that while Tab. 1 shows cross-validation results, the actual pool is based solely on the dev set
performance.
Searching for Optimal Configurations Next,
we test if we can actually improve class-specific
representations by selecting specialised configurations. We traverse the configuration search
space on the dev set using the proposed algorithm
(Alg. 1). The results per word class are summarised
in Tables 3 and 4.
Class-specific context configurations yield more
accurate representations, as evident from scores
on SimLex-999 subsets and improvements over all
baselines. The improvements with the best classspecific configurations found by Alg. 1 are approximately 6 points for adjectives, 6 for verbs, and 5
for nouns over the best baseline for each class.
The improvements are visible even with configurations that simply pool all candidate individual
bags (POOL-ALL), without running Alg. 1. However, further careful context selection, i.e., traversing the configuration space using Alg. 1 leads to
additional improvements for V and N (gains of 3
and 2.2 points).
It is interesting to point out that the best configuration for verbs do not include subj or obj
contexts, which seem to be more important in noun

Context Type

A Questions

V Questions

N Questions

Context Type

Training Time

# Pairs

BOW (win=2)
POSIT (win=2)
COORD (conjlr)
SP
DEPS-All

31/41
32/41
26/41
26/41
31/41

14/19
13/19
11/19
11/19
14/19

16/19
15/19
8/19
12/19
16/19

BOW (win=2)
POSIT (win=2)
COORD (conjlr)
SP
DEPS-All

179mins 27s
190mins 12s
4mins 11s
1mins 29s
103mins 35s

5.974G
5.974G
129.69M
46.37M
3.165G

BEST-ADJ
BEST-VERBS
BEST-NOUNS

32/41
24/41
30/41

12/19
15/19
14/19

15/19
16/19
17/19

BEST-ADJ
BEST-VERBS
BEST-NOUNS

14mins 5s
29mins 48s
41mins 14s

447.4M
828.55M
1.063G

Table 5: Results on the A/V/N TOEFL subsets. BEST-* are the best class-specific configurations as displayed in Tables 3 and 4. The
reported scores are in the following form: correct_answers/overall_questions.

Table 6: Training time in minutes of SGNS (d =


300) with different context types. BEST-* denotes
the best scoring configuration for each class found
by Alg. 1. # Pairs shows a total number of pairs
used in SGNS training for each context type.

representation modeling.
Very similar improved scores are achieved with
a variety of configurations (see Tab. 3), especially
in the neighbourhood of the optimal configuration
found by Alg. 1. This indicates that the method is
quite robust: even sub-optimal solutions result in
improved class-specific representations.

Another Training Setup To demonstrate that


context configurations generalise to settings with
more training data, and other annotation and parser
choices, in another experiment we have replicated
the experimental setup of Schwartz et al. (2016).
The training corpus is now the 8G words corpus
generated by the word2vec script.15 A preprocessing step now merges common word pairs and
triplets to expression tokens (e.g., Bilbo_Baggins).
The corpus is parsed with labeled Stanford dependencies (de Marneffe and Manning, 2008) following (Levy and Goldberg, 2014a; Schwartz et al.,
2016): Stanford POS Tagger (Toutanova et al.,
2003) and the stack version of the MALT parser
(Goldberg and Nivre, 2012). SGNS preprocessing
and parameters are also replicated; the only difference is training with 500-dimensional embeddings
as in prior work.
We use the same set of individual context bags as
before. The straightforward translation from labelled Stanford dependencies into UD is performed
using the mapping from de Marneffe et al. (2014),

TOEFL Evaluation The effects of per-class representation specialisation are further supported by
evaluation on the A, V, and N TOEFL questions
(Landauer and Dumais, 1997). The results are summarised in Tab. 5. However, the limited size (e.g.,
only 19 V and N questions, 41 A questions) prevents us from drawing any general conclusions with
this evaluation set.
Training: Fast and/or Accurate? Since context
configurations rely on the idea of context selection,
their advantage also lies in reduced training times
for SGNS. The configuration-based model trains
on only 14% (A), 26.2% (V), and 33.6% (N) of all
dependency-based contexts. Training times along
with the total number of training pairs for each
context type are displayed in Tab. 6. All SGNS
models were trained using parallel word2vecf
training on 10 Intel(R) Xeon(R) E5-2667 2.90GHz
processors.
The results indicate that class-specific configurations are not as lightweight and fast as SP or
COORD contexts (Schwartz et al., 2016). However,
the results also suggest that such configurations provide a better trade-off between accuracy and speed:
they reach peak performances for each word class,
outscoring all baseline context types (including SP
and COORD), while at the same time training is
still much faster than with heavyweight full context types such as BOW, POSIT or DEPS-All.

15

code.google.com/p/word2vec/source/browse/trunk/

Context Type

Adj

Verbs

Nouns

All

BOW (win=2)
POSIT (win=2)
COORD (conjlr)
SP
DEPS-All

0.604
0.585
0.629
0.649
0.574

0.307
0.400
0.413
0.458
0.389

0.501
0.471
0.428
0.414
0.492

0.464
0.469
0.430
0.444
0.464

BEST-ADJ
BEST-VERBS
BEST-NOUNS
BEST-ALL

0.671
0.392
0.581
0.616

0.348
0.455
0.327
0.402

0.504
0.478
0.535
0.519

0.449
0.448
0.489
0.506

Table 7: Results on the A/V/N SimLex-999 subsets,


and on the entire set (All) in the training setup from
Schwartz et al. (2016). d = 500. BEST-* are the
best class-specific configurations borrowed from
the previous training setup (see Sect. 4).

e.g., nn is mapped into compound, and rcmod,


In future work, we plan to test the framework
partmod, infmod are all mapped into one indi- with finer-grained individual context bags, and invidual bag acl.
vestigate beyond POS-based word classes and deAs in previous experiments, we again build class- pendency relations; this may further refine and spespecific configurations and evaluate on A/V/N sub- cialise induced vector spaces. We also plan to test
sets of SimLex-999. We also evaluate another con- the portability of the proposed approach to more
languages.
figuration found using Alg. 1, which targets the
overall improved performance without any finergrained division to word classes (BEST-ALL). The
References
results are provided in Tab. 7.
Class-specific configurations again outperform [Al-Rfou et al.2013] Rami Al-Rfou, Bryan Perozzi, and
competitive baseline context types for adjectives
Steven Skiena. 2013. Polyglot: Distributed word
representations for multilingual NLP. In CoNLL,
and nouns. The BEST-VERBS configuration is
pages 183192.
slightly outscored by the SP baseline, but the
difference is not statistically significant. Fur- [Bansal et al.2014] Mohit Bansal, Kevin Gimpel, and
ther, SGNS with the BEST-ALL configuration
Karen Livescu. 2014. Tailoring continuous word
representations for dependency parsing. In ACL,
(amod+subj+obj+compound+prep+adv+conj) outpages 809815.
performs all baseline models on the entire benchmark. This finding also extends to the previ- [Baroni et al.2010] Marco Baroni, Brian Murphy, Edous training setup. Interestingly, the non-specific
uard Barbu, and Massimo Poesio. 2010. Strudel:
A corpus-based semantic model based on properties
BEST-ALL configuration falls short of A/V/Nand types. Cognitive Science, pages 222254.
specific configurations for each class.

Conclusion

We have presented a novel framework for selecting


class-specific context configurations which yield
improved and fast representations for prominent
word classes: adjectives, verbs, and nouns. Each
word class requires a different configuration to produce optimal results on the class-specific subset of
SimLex-999. We have also proposed an algorithm
that is able to find a suitable class-specific configuration while making the search over the large space
of possible context configurations computationally
feasible.
Our approach is designed to be as general as
possible, and can be employed to other contexts
types, evaluation tasks and languages. It is especially important to stress the language portability
of our approach. It relies on language-agnostic annotations stemming from the initiative on Universal
Dependencies, and it also uses language-agnostic
POS taggers and dependency parsers. We have
also conducted preliminary experiments using exactly the same framework with verb- and adjectivespecialised word representations for German and
Italian. Evaluations on German and Italian SimLex999 (Leviant and Reichart, 2015)16 display similar
patterns as for English, and demonstrate that our
framework can be easily applied to other languages.
16

http://technion.ac.il/ira.leviant/

[Baroni et al.2014] Marco Baroni, Georgiana Dinu, and


Germn Kruszewski. 2014. Dont count, predict!
A systematic comparison of context-counting vs.
context-predicting semantic vectors. In ACL, pages
238247.
[Bohnet2010] Bernd Bohnet. 2010. Top accuracy and
fast dependency parsing is not a contradiction. In
COLING, pages 8997.
[Chen and Manning2014] Danqi Chen and Christopher D. Manning. 2014. A fast and accurate dependency parser using neural networks. In EMNLP,
pages 740750.
[Choi et al.2015] Jinho D. Choi, Joel Tetreault, and
Amanda Stent. 2015. It depends: Dependency
parser comparison using a Web-based evaluation
tool. In ACL, pages 387396.
[Collobert et al.2011] Ronan Collobert, Jason Weston,
Lon Bottou, Michael Karlen, Koray Kavukcuoglu,
and Pavel P. Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine
Learning Research, 12:24932537.
[Davidov and Rappoport2006] Dmitry Davidov and Ari
Rappoport. 2006. Efficient unsupervised discovery of word categories using symmetric patterns and
high frequency words. In ACL, pages 297304.
[de Marneffe and Manning2008] Marie-Catherine
de Marneffe and Christopher D. Manning. 2008.
The Stanford typed dependencies representation. In
Proceedings of the Workshop on Cross-Framework
and Cross-Domain Parser Evaluation, pages 18.

[de Marneffe et al.2014] Marie-Catherine de Marneffe,


Timothy Dozat, Natalia Silveira, Katri Haverinen,
Filip Ginter, Joakim Nivre, and Christopher D. Manning. 2014. Universal Stanford dependencies: A
cross-linguistic typology. In LREC, pages 4585
4592.
[Faruqui et al.2015] Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and
Noah A. Smith. 2015. Retrofitting word vectors
to semantic lexicons. In NAACL-HLT, pages 1606
1615.
[Goldberg and Levy2014] Yoav Goldberg and Omer
Levy. 2014. Word2vec explained: Deriving
Mikolov et al.s negative-sampling word-embedding
method. CoRR, abs/1402.3722.
[Goldberg and Nivre2012] Yoav Goldberg and Joakim
Nivre. 2012. A dynamic oracle for arc-eager dependency parsing. In COLING, pages 959976.
[Grefenstette1994] Gregory Grefenstette. 1994. Explorations in automatic thesaurus discovery.
[Harris1954] Zellig S. Harris. 1954.
structure. Word, 10(23):146162.

Distributional

[Hart et al.1968] Peter E. Hart, Nils J. Nilsson, and


Bertram Raphael. 1968. A formal basis for
the heuristic determination of minimum cost paths.
IEEE Transactions on Systems Science and Cybernetics, 4(2):100107.
[Hill et al.2015] Felix Hill, Roi Reichart, and Anna Korhonen. 2015. SimLex-999: Evaluating semantic
models with (genuine) similarity estimation. Computational Linguistics, 41(4):665695.
[Kiela et al.2015] Douwe Kiela, Felix Hill, and Stephen
Clark. 2015. Specializing word embeddings for
similarity or relatedness. In EMNLP, pages 2044
2048.
[Landauer and Dumais1997] Thomas K. Landauer and
Susan T. Dumais. 1997. Solutions to Platos problem: The Latent Semantic Analysis theory of acquisition, induction, and representation of knowledge.
Psychological Review, 104(2):211240.
[Leviant and Reichart2015] Ira Leviant and Roi Reichart. 2015. Separated by an un-common language:
Towards judgment language informed vector space
modeling. CoRR, abs/1508.00106.
[Levy and Goldberg2014a] Omer Levy and Yoav Goldberg. 2014a. Dependency-based word embeddings.
In ACL, pages 302308.
[Levy and Goldberg2014b] Omer Levy and Yoav Goldberg. 2014b. Linguistic regularities in sparse and
explicit word representations. In CoNLL, pages 171
180.
[Levy and Goldberg2014c] Omer Levy and Yoav Goldberg. 2014c. Neural word embedding as implicit
matrix factorization. In NIPS, pages 21772185.

[Levy et al.2015] Omer Levy, Yoav Goldberg, and Ido


Dagan. 2015. Improving distributional similarity
with lessons learned from word embeddings. Transactions of the ACL, 3:211225.
[Ling et al.2015] Wang Ling, Yulia Tsvetkov, Silvio
Amir, Ramon Fermandez, Chris Dyer, Alan W
Black, Isabel Trancoso, and Chu-Cheng Lin. 2015.
Not all contexts are created equal: Better word representations with variable attention. In EMNLP, pages
13671372.
[Liu et al.2015] Quan Liu, Hui Jiang, Si Wei, Zhen-Hua
Ling, and Yu Hu. 2015. Learning semantic word
embeddings based on ordinal knowledge constraints.
pages 15011511.
[Martins et al.2013] Andr F. T. Martins, Miguel B.
Almeida, and Noah A. Smith. 2013. Turning on the
turbo: Fast third-order non-projective turbo parsers.
In ACL, pages 617622.
[McDonald et al.2013] Ryan T. McDonald, Joakim
Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith B.
Hall, Slav Petrov, Hao Zhang, Oscar Tckstrm,
Claudia Bedini, Nria Bertomeu Castell, and Jungmee Lee. 2013. Universal dependency annotation
for multilingual parsing. In ACL, pages 9297.
[Melamud et al.2015] Oren Melamud, Ido Dagan, and
Jacob Goldberger. 2015. Modeling word meaning
in context with substitute vectors. In NAACL-HLT,
pages 472482.
[Melamud et al.2016] Oren Melamud, David McClosky, Siddharth Patwardhan, and Mohit Bansal.
2016. The role of context types and dimensionality
in learning word embeddings. In NAACL-HLT.
[Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever,
Kai Chen, Gregory S. Corrado, and Jeffrey Dean.
2013. Distributed representations of words and
phrases and their compositionality. In NIPS, pages
31113119.
[Mnih and Kavukcuoglu2013] Andriy Mnih and Koray
Kavukcuoglu. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In NIPS,
pages 22652273.
[Mrkic et al.2016] Nikola Mrkic, Diarmuid
Saghdha, Blaise Thomson, Milica Gaic,
Lina Maria Rojas-Barahona, Pei-Hao Su, David
Vandyke, Tsung-Hsien Wen, and Steve J. Young.
2016. Counter-fitting word vectors to linguistic
constraints. In NAACL-HLT, pages 142148.
[Nivre et al.2015] Joakim Nivre et al. 2015. Universal Dependencies 1.2. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics,
Charles University in Prague.
[Pad and Lapata2007] Sebastian Pad and Mirella Lapata. 2007. Dependency-based construction of semantic space models. Computational Linguistics,
33(2):161199.

[Pennington et al.2014] Jeffrey Pennington, Richard


Socher, and Christopher Manning. 2014. Glove:
Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages
15321543.
[Petrov et al.2012] Slav Petrov, Dipanjan Das, and
Ryan T. McDonald. 2012. A universal part-ofspeech tagset. In LREC, pages 20892096.
[Salton1971] Gerard Salton. 1971. The SMART Retrieval System: Experiments in Automatic Document
Processing.
[Schtze1993] Hinrich Schtze. 1993. Part-of-speech
induction from scratch. In ACL, pages 251258.
[Schwartz et al.2015] Roy Schwartz, Roi Reichart, and
Ari Rappoport. 2015. Symmetric pattern based
word embeddings for improved word similarity prediction. In CoNLL, pages 258267.
[Schwartz et al.2016] Roy Schwartz, Roi Reichart, and
Ari Rappoport. 2016. Symmetric patterns and coordinations: Fast and enhanced representations of
verbs and adjectives. In NAACL-HLT, pages 499
505.
[Toutanova et al.2003] Kristina Toutanova, Dan Klein,
Christopher D. Manning, and Yoram Singer. 2003.
Feature-rich part-of-speech tagging with a cyclic dependency network. In NAACL-HLT, pages 173180.
[Turian et al.2010] Joseph P. Turian, Lev-Arie Ratinov,
and Yoshua Bengio. 2010. Word representations:
A simple and general method for semi-supervised
learning. In ACL, pages 384394.
[Utt and Pad2014] Jason Utt and Sebastian Pad.
2014. Crosslingual and multilingual construction of
syntax-based vector space models. Transactions of
the ACL, 2:245258.
[Vulic and Korhonen2016] Ivan Vulic and Anna Korhonen. 2016. Is universal syntax universally useful
for learning distributed word representations? In
ACL, pages 518524.
[Wieting et al.2015] John Wieting, Mohit Bansal,
Kevin Gimpel, and Karen Livescu. 2015. From
paraphrase database to compositional paraphrase
model and back. Transactions of the ACL, 3:345
358.
[Yatbaz et al.2012] Mehmet Ali Yatbaz, Enis Sert, and
Deniz Yuret. 2012. Learning syntactic categories
using paradigmatic representations of word context.
In EMNLP, pages 940951.
[Yu and Dredze2014] Mo Yu and Mark Dredze. 2014.
Improving lexical embeddings with semantic knowledge. In ACL, pages 545550.

You might also like