Professional Documents
Culture Documents
Auto Select Context Config
Auto Select Context Config
Abstract
Recent work has demonstrated that stateof-the-art word embedding models require
different context types to produce highquality representations for different word
classes such as adjectives (A), verbs (V),
and nouns (N). This paper is concerned
with identifying contexts useful for learning A/V/N-specific representations. We introduce a simple yet effective framework
for selecting class-specific context configurations that yield improved representations
for each class. We propose an automatic
A* style selection algorithm that effectively
searches only a fraction of the large configuration space. The results on predicting similarity scores for the A, V, and N
subsets of the benchmarking SimLex-999
evaluation set indicate that our method is
useful for each class: the improvements are
6% (A), 6% (V), and 5% (N) over the best
previously proposed context type for each
class. At the same time, the model trains on
only 14% (A), 26.2% (V), and 33.6% (N)
of all dependency-based contexts, resulting
in much shorter training time.
Introduction
Related Work
Methodology
nmod
amod
Australian
nsubj
scientist
dobj
discovers
case
stars
telescope
nmod
nsubj
amod
Scienziato
with
case
dobj
australiano
scopre
stelle
con
telescopio
nmod
amod
Australian
nsubj
scientist
dobj
discovers
case
stars
with
telescope
prep:with
Figure 1: An example of extracting dependencybased contexts. Top: the example English sentence
from (Levy and Goldberg, 2014a), now parsed using Universal Dependencies. Middle: the same
sentence in Italian, UD-parsed. Note the similarity between the two parses which suggests that our
UD-based framework may be extended to other languages. Bottom: the intuition behind post-parsing
prepositional arc collapsing. The uninformative
short-range case arc between with and telescope
is removed, while another pseudo-arc specifying
the exact link type (i.e., prep:with) between
discovers and telescope is added.
context bag, while (discovers, stars_dobj), (stars,
discovers_dobj 1 ) are labelled dobj (Fig. 1).
Universal Dependencies as Labels Following
recent initiatives on cross-linguistically consistent
annotations, we adopt the Universal Dependencies
(UD) annotation scheme3 (McDonald et al., 2013;
Nivre et al., 2015) to serve as labels for individual context bags.4 The scheme leans on the universal Stanford dependencies (de Marneffe et al.,
2014) complemented with the universal POS tagset
(Petrov et al., 2012). It is straightforward to translate previous annotation schemes to UD (de Marneffe et al., 2014).
UD scheme is becoming a de facto standard
in dependency parsing. It provides a universal
and consistent inventory of categories for similar
syntactic constructions across multiple languages.
This property could facilitate representation learning in languages other than English, cross-lingual
modeling, and application of our methodology to
other languages in future work.
Context Configurations
Assume that we
have obtained M distinct dependency relations
https://bitbucket.org/yoavgo/word2vecf
For details concerning the implementation, we refer the reader
to (Goldberg and Levy, 2014; Levy and Goldberg, 2014a).
3
4
Initial Pools of Candidate Context Bags To reduce the number of possible configurations for each
word class (again for efficiency reasons) it is possible to automatically pool only individual context
bags that are potentially useful. For instance, since
amod denotes an adjectival modifier of a noun, this
bag can be safely removed from the pool of candidate bags for verbs. Using amod contexts for verb
representation modeling may be a pure coincidence
as a result of incorrect parses, and may eventually
lead to deteriorated verb vectors. In another example, appositional modification captured by appos
always denotes a relation between nouns, so it can
be safely excluded as a candidate bag for adjectives and verbs. The construction of such initial
class-specific pools is done automatically before
the configuration search procedure starts (see line
1 in Alg. 1).
Best Configuration Search Although a bruteforce exhaustive search over all possible configurations is possible in theory and for small pools (e.g.,
for adjectives, see Tab. 2), it becomes challenging
or practically infeasible for large pools and large
training data. For instance, based on the pool from
Tab. 2, the search for the optimal configuration
would involve trying out 27 1 = 127 different
configurations for verbs (i.e., training 127 different
SGNS models) and 210 1 = 1023 configurations
for nouns (1023 SGNS models!). Therefore, to
reduce the number of visited configurations, we
have designed a simple heuristic search algorithm
5
Given the coordination structure boys and girls, training
pairs are (boys, girls_conj), (girls, boys_conj 1 ).
6
Another choice is to retain all individual context bags and
filter them later, see line 1 of Alg. 1.
E(RP ool )
ri + rj + rk + rl
P ool
E(Rr
) < E(RP ool )
l
P ool
E(Rr
) > E(RP ool )
i
ri + rj + rk
ri + rj + rl
ri + rk + rl
rj + rk + rl
2
3
P ool
P ool
E(Rr
) < E(Rr
)
i rj
i
ri + rj
ri + rk
rj + rk
ri + rl
rj + rl
rk + rl
4
5
ri
rj
rk
rl
6
7
foreach R R do
foreach ri R do
build new (l 1)-set context
configuration Rri = R {ri } ;
if E(Rri ) E(R) then
Rn Rn {Rri } ;
9
10
11
12
13
14
15
l l1;
R Rn ;
until l == 0 or R == ;
Output :Best configuration Ro
Experimental Setup
The Wikipedia data were POS-tagged with universal POS (UPOS) tags (Petrov et al., 2012) using state-of-the art TurboTagger (Martins et al.,
2013)10 , trained using default settings without any
further parameter fine-tuning (SVM MIRA with
20 iterations) on the TRAIN + DEV portion of the
UD treebank annotated with UPOS tags. The data
were then parsed with UD11 using the graph-based
Mate parser v3.61 (Bohnet, 2010)12 with standard
settings on the TRAIN + DEV UD treebank portion.13
Evaluation
Following prior work (Schwartz
et al., 2015; Schwartz et al., 2016), we experiment with the verb pair (222 pairs), adjective pair
(111 pairs), and noun pair (666 pairs) portions of
SimLex-999. We report Spearmans correlation
between the ranks derived from the scores of the
evaluated models and the human scores. Our final evaluation setup is borrowed from Levy et al.
(2015): the context configurations are optimised on
a development set, which is separate from the unseen test data. We perform 2-fold cross-validation;
the reported scores are always the averages of the 2
runs. All scores are computed in the standard fashion, applying the cosine similarity to the vectors of
words participating in a pair.
Baseline Context Types We compare context
configurations against relevant baseline context
types from prior work (Melamud et al., 2016;
Schwartz et al., 2016; Vulic and Korhonen, 2016):
- BOW: Standard bag-of-words contexts.
- POSIT: Positional contexts (Schtze, 1993; Levy
and Goldberg, 2014b) which enrich BOW with
information on the sequential position of each context word. Given the example from Fig. 1, POSIT
with the window size 2 extracts these contexts for
discovers: Australian_-2, scientist_-1, stars_+2,
with_+1.
- DEPS-All: All dependency links without any context selection, extracted from dependency-parsed
data with prepositional arc collapsing.
- COORD: Coordination-based contexts are used
as fast lightweight contexts for improved representations of adjectives and verbs (Schwartz et al.,
10
http://www.cs.cmu.edu/~ark/TurboParser/
Such a language-universal tagging+parsing procedure has
been chosen deliberately. As suggested by Vulic and Korhonen
(2016), the language-independent framework will facilitate
extensions of this research to other languages in future work
without any major language-specific adaptations.
12
https://code.google.com/archive/p/mate-tools/
13
We opt for the Mate parser due to its speed, simplicity,
and state-of-the-art performance, see (Choi et al., 2015).
11
Context Group
Adj
Verb
Noun
conjn (A+N+V)
obj (N+V)
prep (N+V)
amod (A+N)
compound (N)
adv (V)
nummod (-)
0.415
-0.028
0.188
0.479
-0.124
0.197
-0.142
0.281
0.309
0.344
0.058
-0.019
0.342
-0.065
0.401
0.390
0.387
0.398
0.416
0.104
0.029
Table 1: 2-fold cross-validation results for an illustrative selection of individual context groups
over the SimLex-999 subsets associated with each
word group. Values in parentheses denote the classspecific pools to which each group is finally selected based on its score.
Adjectives
Verbs
Nouns
amod,
conjlr,
conjll
prep, acl,
obj, comp,
adv, conjlr,
conjll
Baselines
(Verbs)
Baselines
(Nouns)
BOW (win=2)
POSIT (win=2)
COORD (conjlr)
SP
DEPS-All
0.336
0.345
0.283
0.349
0.344
BOW (win=2)
POSIT (win=2)
COORD (conjlr)
SP
DEPS-All
0.435
0.437
0.392
0.372
0.441
Configurations: Verbs
POOL-ALL
prep+acl+obj+adv+conj
prep+acl+obj+comp+conj
prep+obj+comp+adv+conj
prep+acl+adv+conj (BEST)
prep+acl+obj+adv
prep+acl+adv
prep+acl+conj
acl+obj+adv+conj
acl+obj+adv
Configurations: Nouns
0.379
0.393
0.344
0.391
0.409
0.392
0.407
0.390
0.345
0.385
POOL-ALL
amod+subj+obj+appos+compound+nmod+conj
amod+subj+obj+appos+compound+conj
amod+subj+obj+appos+compound+conjlr
amod+subj+obj+compound+conj (BEST)
amod+subj+obj+appos+conj
subj+obj+compound+conj
amod+subj+compound+conj
amod+subj+obj+compound
amod+obj+compound+conj
0.469
0.478
0.487
0.476
0.491
0.470
0.479
0.481
0.478
0.481
Table 3: 2-fold cross-validation results over SimLex-999 for (a) verbs and (b) nouns subsets. Only a
selection of context configurations optimised for verb and noun similarity are shown. POOL-ALL denotes
a configuration where all individual context bags from the verbs/nouns-oriented pools (see Table 2) are
used. BEST denotes the best performing configuration found by Alg. 1. Other configurations visited by
Alg. 1 that score higher than the best scoring baseline context type for each word class are in gray.
Baselines
(Adjectives)
BOW (win=2)
POSIT (win=2)
COORD (conjlr)
SP
DEPS-All
0.489
0.460
0.407
0.395
0.360
Configurations: Adjectives
POOL-ALL: amod+conj
amod+conjlr
amod+conjll
conj
(BEST)
0.546
0.527
0.531
0.470
context bags (e.g., nummod) do not encode any informative contextual evidence regarding similarity,
therefore they can be discarded. The initial results
with individual context bags help to reduce the
pool of candidate bags (line 1 in Alg. 1), see Tab. 2.
Note that while Tab. 1 shows cross-validation results, the actual pool is based solely on the dev set
performance.
Searching for Optimal Configurations Next,
we test if we can actually improve class-specific
representations by selecting specialised configurations. We traverse the configuration search
space on the dev set using the proposed algorithm
(Alg. 1). The results per word class are summarised
in Tables 3 and 4.
Class-specific context configurations yield more
accurate representations, as evident from scores
on SimLex-999 subsets and improvements over all
baselines. The improvements with the best classspecific configurations found by Alg. 1 are approximately 6 points for adjectives, 6 for verbs, and 5
for nouns over the best baseline for each class.
The improvements are visible even with configurations that simply pool all candidate individual
bags (POOL-ALL), without running Alg. 1. However, further careful context selection, i.e., traversing the configuration space using Alg. 1 leads to
additional improvements for V and N (gains of 3
and 2.2 points).
It is interesting to point out that the best configuration for verbs do not include subj or obj
contexts, which seem to be more important in noun
Context Type
A Questions
V Questions
N Questions
Context Type
Training Time
# Pairs
BOW (win=2)
POSIT (win=2)
COORD (conjlr)
SP
DEPS-All
31/41
32/41
26/41
26/41
31/41
14/19
13/19
11/19
11/19
14/19
16/19
15/19
8/19
12/19
16/19
BOW (win=2)
POSIT (win=2)
COORD (conjlr)
SP
DEPS-All
179mins 27s
190mins 12s
4mins 11s
1mins 29s
103mins 35s
5.974G
5.974G
129.69M
46.37M
3.165G
BEST-ADJ
BEST-VERBS
BEST-NOUNS
32/41
24/41
30/41
12/19
15/19
14/19
15/19
16/19
17/19
BEST-ADJ
BEST-VERBS
BEST-NOUNS
14mins 5s
29mins 48s
41mins 14s
447.4M
828.55M
1.063G
Table 5: Results on the A/V/N TOEFL subsets. BEST-* are the best class-specific configurations as displayed in Tables 3 and 4. The
reported scores are in the following form: correct_answers/overall_questions.
representation modeling.
Very similar improved scores are achieved with
a variety of configurations (see Tab. 3), especially
in the neighbourhood of the optimal configuration
found by Alg. 1. This indicates that the method is
quite robust: even sub-optimal solutions result in
improved class-specific representations.
TOEFL Evaluation The effects of per-class representation specialisation are further supported by
evaluation on the A, V, and N TOEFL questions
(Landauer and Dumais, 1997). The results are summarised in Tab. 5. However, the limited size (e.g.,
only 19 V and N questions, 41 A questions) prevents us from drawing any general conclusions with
this evaluation set.
Training: Fast and/or Accurate? Since context
configurations rely on the idea of context selection,
their advantage also lies in reduced training times
for SGNS. The configuration-based model trains
on only 14% (A), 26.2% (V), and 33.6% (N) of all
dependency-based contexts. Training times along
with the total number of training pairs for each
context type are displayed in Tab. 6. All SGNS
models were trained using parallel word2vecf
training on 10 Intel(R) Xeon(R) E5-2667 2.90GHz
processors.
The results indicate that class-specific configurations are not as lightweight and fast as SP or
COORD contexts (Schwartz et al., 2016). However,
the results also suggest that such configurations provide a better trade-off between accuracy and speed:
they reach peak performances for each word class,
outscoring all baseline context types (including SP
and COORD), while at the same time training is
still much faster than with heavyweight full context types such as BOW, POSIT or DEPS-All.
15
code.google.com/p/word2vec/source/browse/trunk/
Context Type
Adj
Verbs
Nouns
All
BOW (win=2)
POSIT (win=2)
COORD (conjlr)
SP
DEPS-All
0.604
0.585
0.629
0.649
0.574
0.307
0.400
0.413
0.458
0.389
0.501
0.471
0.428
0.414
0.492
0.464
0.469
0.430
0.444
0.464
BEST-ADJ
BEST-VERBS
BEST-NOUNS
BEST-ALL
0.671
0.392
0.581
0.616
0.348
0.455
0.327
0.402
0.504
0.478
0.535
0.519
0.449
0.448
0.489
0.506
Conclusion
http://technion.ac.il/ira.leviant/
Distributional