Combining Language and Vision With A Multimodal Skip-Gram Model

Combining Language and Vision
with a Multimodal Skip-gram Model
Angeliki Lazaridou Nghia The Pham Marco Baroni

Center for Mind/Brain Sciences
University of Trento
{angeliki.lazaridou|thenghia.pham|marco.baroni}@unitn.it
arXiv:1501.02598v3 [cs.CL] 12 Mar 2015
Abstract tion has led to the development of multimodal dis-

tributional semantic models (MDSMs) (Bruni et al.,
We extend the S KIP - GRAM model of Mikolov 2014; Feng and Lapata, 2010; Silberer and Lapata,
et al. (2013a) by taking visual information into 2014), that enrich linguistic vectors with perceptual
account. Like S KIP - GRAM, our multimodal
information, most often in the form of visual fea-
models (MMS KIP - GRAM) build vector-based
word representations by learning to predict tures automatically induced from image collections.
linguistic contexts in text corpora. However, MDSMs outperform state-of-the-art text-based
for a restricted set of words, the models are approaches, not only in tasks that directly require
also exposed to visual representations of the access to visual knowledge (Bruni et al., 2012), but
objects they denote (extracted from natural
also on general semantic benchmarks (Bruni et al.,
images), and must predict linguistic and visual
features jointly. The MMS KIP - GRAM mod-
2014; Silberer and Lapata, 2014). However, current
els achieve good performance on a variety of MDSMs still have a number of drawbacks. First,
semantic benchmarks. Moreover, since they they are generally constructed by first separately
propagate visual information to all words, we building linguistic and visual representations of the
use them to improve image labeling and re- same concepts, and then merging them. This is ob-
trieval in the zero-shot setup, where the test viously very different from how humans learn about
concepts are never seen during model training. concepts, by hearing words in a situated perceptual
Finally, the MMS KIP - GRAM models discover
context. Second, MDSMs assume that both linguis-
intriguing visual properties of abstract words,
paving the way to realistic implementations of tic and visual information is available for all words,
embodied theories of meaning. with no generalization of knowledge across modal-
ities. Third, because of this latter assumption of
full linguistic and visual coverage, current MDSMs,
1 Introduction paradoxically, cannot be applied to computer vision
Distributional semantic models (DSMs) derive tasks such as image labeling or retrieval, since they
vector-based representations of meaning from pat- do not generalize to images or words beyond their
terns of word co-occurrence in corpora. DSMs have training set.
been very effectively applied to a variety of seman- We introduce the multimodal skip-gram models,
tic tasks (Clark, 2015; Mikolov et al., 2013b; Turney two new MDSMs that address all the issues above.
and Pantel, 2010). However, compared to human The models build upon the very effective skip-gram
semantic knowledge, these purely textual models, approach of Mikolov et al. (2013a), that constructs
just like traditional symbolic AI systems (Harnad, vector representations by learning, incrementally, to
1990; Searle, 1984), are severely impoverished, suf- predict the linguistic contexts in which target words
fering of lack of grounding in extra-linguistic modal- occur in a corpus. In our extension, for a subset
ities (Glenberg and Robertson, 2000). This observa- of the target words, relevant visual evidence from
natural images is presented together with the cor- from early-acquired concrete words to a larger vo-
pus contexts (just like humans hear words accompa- cabulary. However, they use subject-generated fea-
nied by concurrent perceptual stimuli). The model tures as surrogate for realistic perceptual informa-
must learn to predict these visual representations tion, and only test the model in small-scale simula-
jointly with the linguistic features. The joint objec- tions of word learning. Hill and Korhonen (2014),
tive encourages the propagation of visual informa- whose evaluation focuses on how perceptual infor-
tion to representations of words for which no direct mation affects different word classes more or less
visual evidence was available in training. The result- effectively, similarly to Howell et al., integrate per-
ing multimodally-enhanced vectors achieve remark- ceptual information in the form of subject-generated
ably good performance both on traditional seman- features and text from image annotations into a skip-
tic benchmarks, and in their new application to the gram model. They inject perceptual information
“zero-shot” image labeling and retrieval scenario. by merging words expressing perceptual features
Very interestingly, indirect visual evidence also af- with corpus contexts, which amounts to linguistic-
fects the representation of abstract words, paving the context re-weighting, thus making it impossible to
way to ground-breaking cognitive studies and novel separate linguistic and perceptual aspects of the in-
applications in computer vision. duced representation, and to extend the model with
non-linguistic features. We use instead authentic im-
2 Related Work age analysis as proxy to perceptual information, and
we design a robust way to incorporate it, easily ex-
There is by now a large literature on multimodal tendible to other signals, such as feature norm or
distributional semantic models. We focus here on brain signal vectors (Fyshe et al., 2014).
a few representative systems. Bruni et al. (2014) The recent work on so-called zero-shot learning
propose a straightforward approach to MDSM in- to address the annotation bottleneck in image la-
duction, where text- and image-based vectors for the beling (Frome et al., 2013; Lazaridou et al., 2014;
same words are constructed independently, and then Socher et al., 2013) looks at image- and text-based
“mixed” by applying the Singular Value Decompo- vectors from a different perspective. Instead of com-
sition to their concatenation. An empirically supe- bining visual and linguistic information in a com-
rior model has been proposed by Silberer and La- mon space, it aims at learning a mapping from
pata (2014), who use more advanced visual repre- image- to text-based vectors. The mapping, induced
sentations relying on images annotated with high- from annotated data, is then used to project images
level “visual attributes”, and a multimodal fusion of objects that were not seen during training onto
strategy based on stacked autoencoders. Kiela and linguistic space, in order to retrieve the nearest word
Bottou (2014) adopt instead a simple concatena- vectors as labels. Multimodal word vectors should
tion strategy, but obtain empirical improvements by be better-suited than purely text-based vectors for
using state-of-the-art convolutional neural networks the task, as their similarity structure should be closer
to extract visual features, and the skip-gram model to that of images. However, traditional MDSMs can-
for text. These and related systems take a two- not be used in this setting, because they do not cover
stage approach to derive multimodal spaces (uni- words for which no manually annotated training im-
modal induction followed by fusion), and they are ages are available, thus defeating the generalizing
only tested on concepts for which both textual and purpose of zero-shot learning. We will show be-
visual labeled training data are available (the pio- low that our multimodal vectors, that are not ham-
neering model of Feng and Lapata (2010) did learn pered by this restriction, do indeed bring a signifi-
from text and images jointly using Topic Models, cant improvement over purely text-based linguistic
but was shown to be empirically weak by Bruni et representations in the zero-shot setup.
al. (2014)). Multimodal language-vision spaces have also
Howell et al. (2005) propose an incremental mul- been developed with the goal of improving cap-
timodal model based on simple recurrent networks tion generation/retrieval and caption-based image
(Elman, 1990), focusing on grounding propagation retrieval (Karpathy et al., 2014; Kiros et al., 2014;
Mao et al., 2014; Socher et al., 2014). These meth-
ods rely on necessarily limited collections of cap-
=
+
tioned images as sources of multimodal evidence,
the cute little sat on the mat CAT
whereas we automatically enrich a very large corpus
with images to induce general-purpose multimodal
word representations, that could be used as input maximize context prediction maximize similarity
embeddings in systems specifically tuned to caption

processing. Thus, our work is complementary to this
line of research. map to visual space
cat
3 Multimodal Skip-gram Architecture
Figure 1: “Cartoon” of MMS KIP - GRAM -B. Lin-
3.1 Skip-gram Model guistic context vectors are actually associated to
We start by reviewing the standard S KIP - GRAM classes of words in a tree, not single words. S KIP -
model of Mikolov et al. (2013a), in the version GRAM is obtained by ignoring the visual objective,
we use. Given a text corpus, S KIP - GRAM aims MMS KIP - GRAM -A by fixing M u→v to the identity
at inducing word representations that are good at matrix.
predicting the context words surrounding a target
word. Mathematically, it maximizes the objective
visual representation of the concepts they denote
function:   (just like in a conversation, where a linguistic
T
1 X X utterance will often be produced in a visual scene
log p(wt+j |wt ) (1)
T including some of the word referents). The visual
t=1 −c≤j≤c,j6=0
representation is also encoded in a vector (we
where w1 , w2 , ..., wT are words in the training describe in Section 4 below how we construct
corpus and c is the size of the window around it). We thus make the skip-gram “multimodal” by
target wt , determining the set of context words to adding a second, visual term to the original linguis-
be predicted by the induced representation of wt . tic objective, that is, we extend Equation 1 as follow:
Following Mikolov et al., we implement a subsam- T
1X
pling option randomly discarding context words as (Lling (wt ) + Lvision (wt )) (3)
T
an inverse function of their frequency, controlled by t=1
hyperparameter t. The probability p(wt+j |wt ), the where LP ling (wt ) is the text-based skip-gram ob-
core part of the objective in Equation 1, is given by jective −c≤j≤c,j6=0 log p(wt+j |wt ), whereas the
softmax: Lvision (wt ) term forces word representations to take
u0wt+j T uwt visual information into account. Note that if a word
e
p(wt+j |wt ) = P 0 T (2) wt is not associated to visual information, as is
W
w0 =1 euw0 uwt
systematically the case, e.g., for determiners and
where uw and u0w are the context and target vector non-imageable nouns, but also more generally for
representations of word w respectively, and W is any word for which no visual data are available,
the size of the vocabulary. Due to the normaliza- Lvision (wt ) is set to 0.
tion term, Equation 2 requires O(|W |) time com- We now propose two variants of the visual objec-
plexity. A considerable speedup to O(log |W |), is tive, resulting in two distinguished multi-modal ver-
achieved by using the hierarchical version of Equa- sions of the skip-gram model.
tion 2 (Morin and Bengio, 2005), adopted here.
3.3 Multi-modal Skip-gram Model A
3.2 Injecting visual knowledge One way to force word embeddings to take visual
We now assume that word learning takes place in a representations into account is to try to directly
situated context, in which, for a subset of the target increase the similarity (expressed, for example,
words, the corpus contexts are accompanied by a by the cosine) between linguistic and visual rep-
resentations, thus aligning the dimensions of the Our text corpus is a Wikipedia 2009 dump compris-
linguistic vector with those of the visual one (recall ing approximately 800M tokens.1 To train the multi-
that we are inducing the first, while the second is modal models, we add visual information for 5,100
fixed), and making the linguistic representation of a words that have an entry in ImageNet (Deng et al.,
concept “move” closer to its visual representation. 2009), occur at least 500 times in the corpus and
We maximize similarity through a max-margin have concreteness score ≥ 0.5 according to Turney
framework commonly used in models connecting et al. (2011). On average, about 5% tokens in the
language and vision (Weston et al., 2010; Frome et text corpus are associated to a visual representation.
al., 2013). More precisely, we formulate the visual To construct the visual representation of a word, we
objective Lvision (wt ) as: sample 100 pictures from its ImageNet entry, and
X
− max(0, γ − cos(uwt , vwt ) + cos(uwt , vw0 )) (4) extract a 4096-dimensional vector from each picture
w0 ∼Pn (w) using the Caffe toolkit (Jia et al., 2014), together
where the minus sign turns a loss into a cost, γ is with the pre-trained convolutional neural network of
the margin, uwt is the target multimodally-enhanced Krizhevsky et al. (2012). The vector corresponds
word representation we aim to learn, vwt is the cor- to activation in the top (FC 7) layer of the network.
responding visual vector (fixed in advance) and vw0 Finally, we average the vectors of the 100 pictures
ranges over visual representations of words (fea- associated to each word, deriving 5,100 aggregated
tured in our image dictionary) randomly sampled visual representations.
from distribution Pn (wt ). These random visual rep-
resentations act as “negative” samples, encouraging Hyperparameters For both S KIP - GRAM and the
uwt to be more similar to its own visual representa- MMS KIP - GRAM models, we fix hidden layer size
tion than to that of other words. The sampling distri- to 300. To facilitate comparison between MMS KIP -
bution is currently set to uniform, and the number of GRAM -A and MMS KIP - GRAM -B, and since the for-
negative samples controlled by hyperparameter k. mer requires equal linguistic and visual dimension-
ality, we keep the first 300 dimensions of the visual
3.4 Multi-modal Skip-gram Model B vectors. For the linguistic objective, we use hierar-
The visual objective in MMS KIP - GRAM -A has the chical softmax with a Huffman frequency-based en-
drawback of assuming a direct comparison of lin- coding tree, setting frequency subsampling option
guistic and visual representations, constraining them t = 0.001 and window size c = 5, without tuning.
to be of equal size. MMS KIP - GRAM -B lifts this The following hyperparameters were tuned on the
constraint by including an extra layer mediating be- text9 corpus:2 MMS KIP - GRAM -A: k=20, γ =0.5;
tween linguistic and visual representations (see Fig- MMS KIP - GRAM -B: k=5, γ=0.5, λ=0.0001.
ure 1 for a sketch of MMS KIP - GRAM -B). Learning
this layer is equivalent to estimating a cross-modal 5 Experiments
mapping matrix from linguistic onto visual repre-
5.1 Approximating human judgments
sentations, jointly induced with linguistic word em-
beddings. The extension is straightforwardly imple- Benchmarks A widely adopted way to test DSMs
mented by substituting, into Equation 4, the word and their multimodal extensions is to measure how
representation uwt with zwt = M u→v uwt , where well model-generated scores approximate human
M u→v is the cross-modal mapping matrix to be in- similarity judgments about pairs of words. We put
duced. To avoid overfitting, we also add an L2 reg- together various benchmarks covering diverse as-
ularization term for M u→v to the overall objective pects of meaning, to gain insights on the effect of
(Equation 3), with its relative importance controlled perceptual information on different similarity facets.
by hyperparamer λ. Specifically, we test on general relatedness (MEN,
Bruni et al. (2014), 3K pairs), e.g., pickles are re-
4 Experimental Setup lated to hamburgers, semantic (≈ taxonomic) simi-
The parameters of all models are estimated by back- 1
http://wacky.sslmit.unibo.it
2
propagation of error via stochastic gradient descent. http://mattmahoney.net/dc/textdata.html
larity (Simlex-999, Hill et al. (2014), 1K pairs; Sem- on words without direct visual representations.
Sim, Silberer and Lapata (2014), 7.5K pairs), e.g.,
pickles are similar to onions, as well as visual sim- Results The state-of-the-art visual CNN FEA -
ilarity (VisSim, Silberer and Lapata (2014), same TURES alone perform remarkably well, outperform-
pairs as SemSim with different human ratings), e.g., ing the purely textual model (S KIP - GRAM) in two
pickles look like zucchinis. tasks, and achieving the best absolute performance
on the visual-coverage subset of Simlex-999. Re-
Alternative Multimodal Models We compare garding multimodal fusion (that is, focusing on
our models against several recent alternatives. We the visual-coverage subsets), both MMS KIP - GRAM
test the vectors made available by Kiela and Bottou models perform very well, at the top or just below
(2014). Similarly to us, they derive textual features it on all tasks, with comparable results for the two
with the skip-gram model (from a portion of the variants. Their performance is also good on the
Wikipedia and the British National Corpus) and use full data sets, where they consistently outperform
visual representations extracted from the ESP data- S KIP - GRAM and SVD (that is much more strongly
set (von Ahn and Dabbish, 2004) through a convo- affected by lack of complete visual information).
lutional neural network (Oquab et al., 2014). They They’re just a few points below the state-of-the-art
concatenate textual and visual features after normal- MEN correlation (0.8), achieved by Baroni et al.
izing to unit length and centering to zero mean. We (2014) with a corpus 3 larger than ours and exten-
also test the vectors that performed best in the evalu- sive tuning. MMS KIP - GRAM -B is close to the state
ation of Bruni et al. (2014), based on textual features of the art for Simlex-999, reported by the resource
extracted from a 3B-token corpus and SIFT-based creators to be at 0.41 (Hill et al., 2014). Most im-
Bag-of-Visual-Words visual features (Sivic and Zis- pressively, MMS KIP - GRAM -A reaches the perfor-
serman, 2003) extracted from the ESP collection. mance level of the Silberer and Lapata (2014) model
Bruni and colleagues fuse a weighted concatenation on their SemSim and VisSim data sets, despite the
of the two components through SVD. We further re- fact that the latter has full visual-data coverage and
implement both methods with our own textual and uses attribute-based image representations, requir-
visual embeddings as C ONCATENATION and SVD ing supervised learning of attribute classifiers, that
(with target dimensionality 300, picked without tun- achieve performance in the semantic tasks compa-
ing). Finally, we present for comparison the results rable or higher than that of our CNN features (see
on SemSim and VisSim reported by Silberer and La- Table 3 in Silberer and Lapata (2014)). Finally, if
pata (2014), obtained with a stacked-autoencoders the multimodal models (unsurprisingly) bring about
architecture run on textual features extracted from a large performance gain over the purely linguistic
Wikipedia with the Strudel algorithm (Baroni et al., model on visual similarity, the improvement is con-
2010) and attribute-based visual features (Farhadi et sistently large also for the other benchmarks, con-
al., 2009) extracted from ImageNet. firming that multimodality leads to better semantic
All benchmarks contain a fair amount of words models in general, that can help in capturing differ-
for which we did not use direct visual evidence. We ent types of similarity (general relatedness, strictly
are interested in assessing the models both in terms taxonomic, perceptual).
of how they fuse linguistic and visual evidence when While we defer to further work a better un-
they are both available, and for their robustness in derstanding of the relation between multimodal
lack of full visual coverage. We thus evaluate them grounding and different similarity relations, Table
in two settings. The visual-coverage columns of Ta- 2 provides qualitative insights on how injecting
ble 1 (those on the right) report results on the subsets visual information changes the structure of se-
for which all compared models have access to direct mantic space. The top S KIP - GRAM neighbours of
visual information for both words. We further report donuts are places where you might encounter them,
results on the full sets (“100%” columns of Table whereas the multimodal models relate them to other
1) for models that can propagate visual information take-away food, ranking visually-similar pizzas at
and that, consequently, can meaningfully be tested the top. The owl example shows how multimodal
MEN Simlex-999 SemSim VisSim
Model
100% 42% 100% 29% 100% 85% 100% 85%
K IELA AND B OTTOU - 0.74 - 0.33 - 0.60 - 0.50
B RUNI ET AL . - 0.77 - 0.44 - 0.69 - 0.56
S ILBERER AND L APATA - - - - 0.70 - 0.64 -
CNN FEATURES - 0.62 - 0.54 - 0.55 - 0.56
S KIP - GRAM 0.70 0.68 0.33 0.29 0.62 0.62 0.48 0.48
C ONCATENATION - 0.74 - 0.46 - 0.68 - 0.60
SVD 0.61 0.74 0.28 0.46 0.65 0.68 0.58 0.60
MMS KIP - GRAM -A 0.75 0.74 0.37 0.50 0.72 0.72 0.63 0.63
MMS KIP - GRAM -B 0.74 0.76 0.40 0.53 0.66 0.68 0.60 0.60
Table 1: Spearman correlation between model-generated similarities and human judgments. Right columns
report correlation on visual-coverage subsets (percentage of original benchmark covered by subsets on first
row of respective columns). First block reports results for out-of-the-box models; second block for visual
and textual representations alone; third block for our implementation of multimodal models.
Target S KIP - GRAM MMS KIP - GRAM -A MMS KIP - GRAM -B

donut fridge, diner, candy pizza, sushi, sandwich pizza, sushi, sandwich
owl pheasant, woodpecker, squirrel eagle, woodpecker, falcon eagle, falcon, hawk
mural sculpture, painting, portrait painting, portrait, sculpture painting, portrait, sculpture
tobacco coffee, cigarette, corn cigarette, cigar, corn cigarette, cigar, smoking
depth size, bottom, meter sea, underwater, level sea, size, underwater
chaos anarchy, despair, demon demon, anarchy, destruction demon, anarchy, shadow
Table 2: Ordered top 3 neighbours of example words in purely textual and multimodal spaces. Only donut
and owl were trained with direct visual information.
models pick taxonomically closer neighbours of over the more abstract measurement sense picked
concrete objects, since often closely related things by the MMS KIP - GRAM neighbours. For chaos,
also look similar (Bruni et al., 2014). In particular, they rank a demon, that is, a concrete agent of chaos
both multimodal models get rid of squirrels and at the top, and replace the more abstract notion of
offer other birds of prey as nearest neighbours. despair with equally gloomy but more imageable
No direct visual evidence was used to induce the shadows and destruction (more on abstract words
embeddings of the remaining words in the table, that below).
are thus influenced by vision only by propagation.
The subtler but systematic changes we observe in 5.2 Zero-shot image labeling and retrieval
such cases suggest that this indirect propagation
The multimodal representations induced by our
is not only non-damaging with respect to purely
models should be better suited than purely text-
linguistic representations, but actually beneficial.
based vectors to label or retrieve images. In particu-
For the concrete mural concept, both multimodal
lar, given that the quantitative and qualitative results
models rank paintings and portraits above less
collected so far suggest that the models propagate
closely related sculptures (they are not a form of
visual information across words, we apply them to
painting). For tobacco, both models rank cigarettes
image labeling and retrieval in the challenging zero-
and cigar over coffee, and MMS KIP - GRAM -B
shot setup (see Section 2 above).3
avoids the arguably less common “crop” sense
cued by corn. The last two examples show how the 3
We will refer here, for conciseness’ sake, to image label-
multimodal models turn up the embodiment level ing/retrieval, but, as our visual vectors are aggregated represen-
in their representation of abstract words. For depth, tations of images, the tasks we’re modeling consist, more pre-
their neighbours suggest a concrete marine setup cisely, in labeling a set of pictures denoting the same object and
retrieving the corresponding set given the name of the object.
Setup We take out as test set 25% of the 5.1K P@1 P@2 P@10 P@20 P@50
words we have visual vectors for. The multimodal S KIP - GRAM 1.5 2.6 14.2 23.5 36.1
MMS KIP - GRAM -A 2.1 3.7 16.7 24.6 37.6
models are re-trained without visual vectors for MMS KIP - GRAM -B 2.2 5.1 20.2 28.5 43.5
these words, using the same hyperparameters as
above. For both tasks, the search for the correct Table 3: Percentage precision@k results in the zero-
word label/image is conducted on the whole set of shot image labeling task.
5.1K word/visual vectors.
In the image labeling task, given a visual vector P@1 P@2 P@10 P@20 P@50
representing an image, we map it onto word space, S KIP - GRAM 1.9 3.3 11.5 18.5 30.4
and label the image with the word corresponding MMS KIP - GRAM -A 1.9 3.2 13.9 20.2 33.6
to the nearest vector. To perform the vision-to- MMS KIP - GRAM -B 1.9 3.8 13.2 22.5 38.3
language mapping, we train a Ridge regression by 5- Table 4: Percentage precision@k results in the zero-
fold cross-validation on the test set (for S KIP - GRAM shot image retrieval task.
only, we also add the remaining 75% of word-image
vector pairs used in estimating the multimodal mod-
els to the Ridge training data).4
embeddings we are inducing, while general enough
In the image retrieval task, given a linguis-
to achieve good performance in the semantic tasks
tic/multimodal vector, we map it onto visual space,
discussed above, encode sufficient visual informa-
and retrieve the nearest image. For S KIP - GRAM, we
tion for direct application to image analysis tasks.
use Ridge regression with the same training regime
This is especially remarkable because the word vec-
as for the labeling task. For the multimodal mod-
tors we are testing were not matched with visual
els, since maximizing similarity to visual represen-
representations at model training time, and are thus
tations is already part of their training objective, we
multimodal only by propagation. The best perfor-
do not fit an extra mapping function. For MMS KIP -
mance is achieved by MMS KIP - GRAM -B, confirm-
GRAM -A, we directly look for nearest neighbours
ing our claim that its M u→v matrix acts as a multi-
of the learned embeddings in visual space. For
modal mapping function.
MMS KIP - GRAM -B, we use the M u→v mapping
function induced while learning word embeddings. 5.3 Abstract words
Results In image labeling (Table 3) S KIP - GRAM We have already seen, through the depth and chaos
is outperformed by both multimodal models, con- examples of Table 2, that the indirect influence of
firming that these models produce vectors that are visual information has interesting effects on the rep-
directly applicable to vision tasks thanks to visual resentation of abstract terms. The latter have re-
propagation. The most interesting results however ceived little attention in multimodal semantics, with
are achieved in image retrieval (Table 4), which Hill and Korhonen (2014) concluding that abstract
is essentially the task the multimodal models have nouns, in particular, do not benefit from propagated
been implicitly optimized for, so that they could be perceptual information, and their representation is
applied to it without any specific training. The strat- even harmed when such information is forced on
egy of directly querying for the nearest visual vec- them (see Figure 4 of their paper). Still, embod-
tors of the MMS KIP - GRAM -A word embeddings ied theories of cognition have provided considerable
works remarkably well, outperforming on the higher evidence that abstract concepts are also grounded
ranks S KIP - GRAM, which requires an ad-hoc map- in the senses (Barsalou, 2008; Lakoff and John-
ping function. This suggests that the multimodal son, 1999). Since the word representations produced
by MMS KIP - GRAM -A, including those pertaining
4
We use one fold to tune Ridge λ, three to estimate the map- to abstract concepts, can be directly used to search
ping matrix and test in the last fold. To enforce strict zero-shot
conditions, we exclude from the test fold labels occurring in
for near images in visual space, we decided to ver-
the LSVRC2012 set that was employed to train the CNN of ify, experimentally, if these near images (of concrete
Krizhevsky et al. (2012), that we use to extract visual features. things) are relevant not only for concrete words, as
expected, but also for abstract ones, as predicted by global |words| unseen |words|
embodied views of meaning. all 48% 198 30% 127
concrete 73% 99 53% 30
More precisely, we focused on the set of 200 abstract 23% 99 23% 97
words that were sampled across the USF norms con-
creteness spectrum by Kiela et al. (2014) (2 words Table 5: Subjects’ preference for nearest visual
had to be excluded for technical reasons). This neighbour of words in Kiela et al. (2014) vs. random
set includes not only concrete (meat) and abstract pictures. Figure of merit is percentage proportion
(thought) nouns, but also adjectives (boring), verbs of significant results in favor of nearest neighbour
(teach), and even grammatical terms (how). Some across words. Results are reported for the whole set,
words in the set have relatively high concreteness as well as for words above (concrete) and below (ab-
ratings, but are not particularly imageable, e.g.: stract) the concreteness rating median. The unseen
hot, smell, pain, sweet. For each word in the set, column reports results when words exposed to direct
we extracted the nearest neighbour picture of its visual evidence during training are discarded. The
MMS KIP - GRAM -A representation, and matched it words columns report set cardinality.
with a random picture. The pictures were selected
freedom theory wrong
from a set of 5,100, all labeled with distinct words
(the picture set includes, for each of the words as-
sociated to visual information as described in Sec-
tion 4, the nearest picture to its aggregated visual
representation). Since it is much more common for god together place
concrete than abstract words to be directly repre-

sented by an image in the picture set, when search-
ing for the nearest neighbour we excluded the pic-
ture labeled with the word of interest, if present (e.g.,
Figure 2: Examples of nearest visual neighbours of
we excluded the picture labeled tree when picking
some abstract words: on the left, cases where sub-
the nearest neighbour of the word tree). We ran a
jects preferred the neighbour to the random foil; on
CrowdFlower5 survey in which we presented each
the right, cases where they did not.
test word with the two associated images (random-
izing presentation order of nearest and random pic-
ture), and asked subjects which of the two pictures
significant preference for the model-predicted near-
they found more closely related to the word. We
est picture for about one fourth of the abstract terms.
collected minimally 20 judgments per word. Sub-
Whether a word was exposed to direct visual evi-
jects showed large agreement (median proportion of
dence during training is of course making a big dif-
majority choice at 90%), confirming that they under-
ference, and this factor interacts with concreteness,
stood the task and behaved consistently.
as only two abstract words were matched with im-
We quantify performance in terms of proportion ages during training.6 When we limit evaluation to
of words for which the number of votes for the near- word representations that were not exposed to pic-
est neighbour picture is significantly above chance tures during training, the difference between con-
according to a two-tailed binomial test. We set sig- crete and abstract terms, while still large, becomes
nificance at p<0.05 after adjusting all p-values with less dramatic than if all words are considered.
the Holm correction for running 198 statistical tests.
Figure 2 shows four cases in which subjects ex-
The results in Table 5 indicate that, in about half
pressed a strong preference for the nearest visual
the cases, the nearest picture to a word MMS KIP -
neighbour of a word. Freedom, god and theory are
GRAM -A representation is meaningfully related to
strikingly in agreement with the view, from embod-
the word. As expected, this is more often the case for
ied theories, that abstract words are grounded in rel-
concrete than abstract words. Still, we also observe a
6
In both cases, the images actually depict concrete senses of
5
http://www.crowdflower.com the words: a memory board for memory and a stop sign for stop.
evant concrete scenes and situations. The together Model ρ
W ORD FREQUENCY 0.22
example illustrates how visual data might ground ab-
K IELA ET AL . -0.65
stract notions in surprising ways. For all these cases, S KIP - GRAM 0.05
we can borrow what Howell et al. (2005) say about MMS KIP - GRAM -B 0.04
visual propagation to abstract words (p. 260): MMS KIP - GRAM -A -0.75
MMS KIP - GRAM -B* -0.71
Intuitively, this is something like trying to explain
an abstract concept like love to a child by using (a) (b)
concrete examples of scenes or situations that are
associated with love. The abstract concept is never Figure 3: (a) Distribution of MMS KIP - GRAM -A
fully grounded in external reality, but it does inherit vector activation for meat (blue) and hope (red). (b)
some meaning from the more concrete concepts to
Spearman ρ between concreteness and various mea-
which it is related.
sures on the Kiela et al. (2014) set.
Of course, not all examples are good: the last col-
umn of Figure 2 shows cases with no obvious rela-
MMS KIP - GRAM -A representations and those gen-
tion between words and visual neighbours (subjects
erated by mapping MMS KIP - GRAM -B vectors onto
preferred the random images by a large margin).
visual space (MMS KIP - GRAM -B*) achieve very
The multimodal vectors we induce also display an
high correlation (but, interestingly, not MMS KIP -
interesting intrinsic property related to the hypothe-
GRAM -B). This is further evidence that multimodal
sis that grounded representations of abstract words
learning is grounding the representations of both
are more complex than for concrete ones, since ab-
concrete and abstract words in meaningful ways.
stract concepts relate to varied and composite situa-
tions (Barsalou and Wiemer-Hastings, 2005). A nat- 6 Conclusion
ural corollary of this idea is that visually-grounded
representations of abstract concepts should be more We introduced two multimodal extensions of S KIP -
diverse: If you think of dogs, very similar images of GRAM . MMS KIP - GRAM -A is trained by directly
specific dogs will come to mind. You can also imag- optimizing the similarity of words with their visual
ine the abstract notion of freedom, but the nature of representations, thus forcing maximum interaction
the related imagery will be much more varied. Re- between the two modalities. MMS KIP - GRAM -B in-
cently, Kiela et al. (2014) have proposed to measure cludes an extra mediating layer, acting as a cross-
abstractness by exploiting this very same intuition. modal mapping component. The ability of the mod-
However, they rely on manual annotation of pictures els to integrate and propagate visual information re-
via Google Images and define an ad-hoc measure sulted in word representations that performed well in
of image dispersion. We conjecture that the repre- both semantic and vision tasks, and could be used as
sentations naturally induced by our models display input in systems benefiting from prior visual knowl-
a similar property. In particular, the entropy of our edge (e.g., caption generation). Our results with ab-
multimodal vectors, being an expression of how var- stract words suggest the models might also help in
ied the information they encode is, should correlate tasks such as metaphor detection, or even retriev-
with the degree of abstractness of the corresponding ing/generating pictures of abstract concepts. Their
words. As Figure 3(a) shows, there is indeed a dif- incremental nature makes them well-suited for cog-
ference in entropy between the most concrete (meat) nitive simulations of grounded language acquisition,
and most abstract (hope) words in the Kiela et al. set. an avenue of research we plan to explore further.
To test the hypothesis quantitatively, we mea-
Acknowledgments
sure the correlation of entropy and concreteness
on the 200 words in the Kiela et al. (2014) set.7 We thank Adam Liska, Tomas Mikolov, the re-
Figure 3(b) shows that the entropies of both the viewers and the NIPS 2014 Learning Semantics au-
7
Since the vector dimensions range over the real number
dience. We were supported by ERC 2011 Start-
line, we calculate entropy on vectors that are unit-normed af- ing Independent Research Grant n. 283554 (COM-
ter adding a small constant insuring all values are positive. POSES).
References [Fyshe et al.2014] Alona Fyshe, Partha P Talukdar, Brian
Murphy, and Tom M Mitchell. 2014. Interpretable
[Baroni et al.2010] Marco Baroni, Eduard Barbu, Brian semantic vectors from a joint model of brain-and text-
Murphy, and Massimo Poesio. 2010. Strudel: A based meaning. In In Proceedings of ACL, pages 489–
distributional semantic model based on properties and 499.
types. Cognitive Science, 34(2):222–254.
[Glenberg and Robertson2000] Arthur Glenberg and
[Baroni et al.2014] Marco Baroni, Georgiana Dinu, and David Robertson. 2000. Symbol grounding and
Germán Kruszewski. 2014. Don’t count, pre- meaning: A comparison of high-dimensional and
dict! a systematic comparison of context-counting vs. embodied theories of meaning. Journal of Memory
context-predicting semantic vectors. In Proceedings and Language, 3(43):379–401.
of ACL, pages 238–247, Baltimore, MD.
[Harnad1990] Stevan Harnad. 1990. The symbol ground-
[Barsalou and Wiemer-Hastings2005] Lawrence Barsa- ing problem. Physica D: Nonlinear Phenomena, 42(1-
lou and Katja Wiemer-Hastings. 2005. Situating 3):335–346.
abstract concepts. In D. Pecher and R. Zwaan, editors,
[Hill and Korhonen2014] Felix Hill and Anna Korhonen.
Grounding Cognition: The Role of Perception and
2014. Learning abstract concept embeddings from
Action in Memory, Language, and Thought, pages
multi-modal data: Since you probably can’t see what
129–163. Cambridge University Press, Cambridge,
I mean. In Proceedings of EMNLP, pages 255–265,
UK.
Doha, Qatar.
[Barsalou2008] Lawrence Barsalou. 2008. Grounded [Hill et al.2014] Felix Hill, Roi Reichart, and Anna Ko-
cognition. Annual Review of Psychology, 59:617–645. rhonen. 2014. SimLex-999: Evaluating se-
[Bruni et al.2012] Elia Bruni, Gemma Boleda, Marco Ba- mantic models with (genuine) similarity estimation.
roni, and Nam Khanh Tran. 2012. Distributional se- http://arxiv.org/abs/arXiv:1408.3456.
mantics in Technicolor. In Proceedings of ACL, pages [Howell et al.2005] Steve Howell, Damian Jankowicz,
136–145, Jeju Island, Korea. and Suzanna Becker. 2005. A model of grounded
[Bruni et al.2014] Elia Bruni, Nam Khanh Tran, and language acquisition: Sensorimotor features improve
Marco Baroni. 2014. Multimodal distributional se- lexical and grammatical learning. Journal of Memory
mantics. Journal of Artificial Intelligence Research, and Language, 53:258–276.
49:1–47. [Jia et al.2014] Yangqing Jia, Evan Shelhamer, Jeff Don-
[Clark2015] Stephen Clark. 2015. Vector space mod- ahue, Sergey Karayev, Jonathan Long, Ross Girshick,
els of lexical meaning. In Shalom Lappin and Sergio Guadarrama, and Trevor Darrell. 2014. Caffe:
Chris Fox, editors, Handbook of Contemporary Se- Convolutional architecture for fast feature embedding.
mantics, 2nd ed. Blackwell, Malden, MA. In arXiv preprint arXiv:1408.5093.
press; http://www.cl.cam.ac.uk/˜sc609/ [Karpathy et al.2014] Andrej Karpathy, Armand Joulin,
pubs/sem_handbook.pdf. and Li Fei-Fei. 2014. Deep fragment embeddings for
[Deng et al.2009] Jia Deng, Wei Dong, Richard Socher, bidirectional image sentence mapping. In Proceedings
Lia-Ji Li, and Li Fei-Fei. 2009. Imagenet: A large- of NIPS, pages 1097–1105, Montreal, Canada.
scale hierarchical image database. In Proceedings of [Kiela and Bottou2014] Douwe Kiela and Léon Bottou.
CVPR, pages 248–255, Miami Beach, FL. 2014. Learning image embeddings using convolu-
[Elman1990] Jeffrey L Elman. 1990. Finding structure in tional neural networks for improved multi-modal se-
time. Cognitive science, 14(2):179–211. mantics. In Proceedings of EMNLP, pages 36–45,
[Farhadi et al.2009] Ali Farhadi, Ian Endres, Derek Doha, Qatar.
Hoiem, and David Forsyth. 2009. Describing objects [Kiela et al.2014] Douwe Kiela, Felix Hill, Anna Korho-
by their attributes. In Proceedings of CVPR, pages nen, and Stephen Clark. 2014. Improving multi-
1778–1785, Miami Beach, FL. modal representations using image dispersion: Why
[Feng and Lapata2010] Yansong Feng and Mirella Lap- less is sometimes more. In Proceedings of ACL, pages
ata. 2010. Visual information in semantic represen- 835–841, Baltimore, MD.
tation. In Proceedings of HLT-NAACL, pages 91–99, [Kiros et al.2014] Ryan Kiros, Ruslan Salakhutdinov, and
Los Angeles, CA. Richard Zemel. 2014. Unifying visual-semantic em-
[Frome et al.2013] Andrea Frome, Greg Corrado, Jon beddings with multimodal neural language models.
Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ran- In Proceedings of the NIPS Deep Learning and Rep-
zato, and Tomas Mikolov. 2013. DeViSE: A deep resentation Learning Workshop, Montreal, Canada.
visual-semantic embedding model. In Proceedings of Published online: http://www.dlworkshop.
NIPS, pages 2121–2129, Lake Tahoe, NV. org/accepted-papers.
[Krizhevsky et al.2012] Alex Krizhevsky, Ilya Sutskever, compositional semantics for finding and describing
and Geoffrey Hinton. 2012. ImageNet classification images with sentences. Transactions of the Associa-
with deep convolutional neural networks. In Proceed- tion for Computational Linguistics, 2:207–218.
ings of NIPS, pages 1097–1105, Lake Tahoe, Nevada. [Turney and Pantel2010] Peter Turney and Patrick Pantel.
[Lakoff and Johnson1999] George Lakoff and Mark 2010. From frequency to meaning: Vector space mod-
Johnson. 1999. Philosophy in the Flesh: The Embod- els of semantics. Journal of Artificial Intelligence Re-
ied Mind and Its Challenge to Western Thought. Basic search, 37:141–188.
Books, New York. [Turney et al.2011] Peter Turney, Yair Neuman, Dan As-
[Lazaridou et al.2014] Angeliki Lazaridou, Elia Bruni, saf, and Yohai Cohen. 2011. Literal and metaphori-
and Marco Baroni. 2014. Is this a wampimuk? cross- cal sense identification through concrete and abstract
modal mapping between distributional semantics and context. In Proceedings of EMNLP, pages 680–690,
the visual world. In Proceedings of ACL, pages 1403– Edinburgh, UK.
1414, Baltimore, MD. [von Ahn and Dabbish2004] Luis von Ahn and Laura
[Mao et al.2014] Junhua Mao, Wei Xu, Yi Yang, Jiang Dabbish. 2004. Labeling images with a computer
Wang, and Alan Yuille. 2014. Explain images game. In Proceedings of CHI, pages 319–326, Vienna,
with multimodal recurrent neural networks. In Pro- Austria.
ceedings of the NIPS Deep Learning and Represen- [Weston et al.2010] Jason Weston, Samy Bengio, and
tation Learning Workshop, Montreal, Canada. Pub- Nicolas Usunier. 2010. Large scale image annotation:
lished online: http://www.dlworkshop.org/ learning to rank with joint word-image embeddings.
accepted-papers. Machine learning, 81(1):21–35.
[Mikolov et al.2013a] Tomas Mikolov, Ilya Sutskever,
Kai Chen, Greg Corrado, and Jeff Dean. 2013a.
Distributed representations of words and phrases and
their compositionality. In Proceedings of NIPS, pages
3111–3119, Lake Tahoe, NV.
[Mikolov et al.2013b] Tomas Mikolov, Wen-tau Yih, and
Geoffrey Zweig. 2013b. Linguistic regularities in
continuous space word representations. In Proceed-
ings of NAACL, pages 746–751, Atlanta, Georgia.
[Morin and Bengio2005] Frederic Morin and Yoshua
Bengio. 2005. Hierarchical probabilistic neural net-
work language model. In Proceedings of AISTATS,
pages 246–252, Barbados.
[Oquab et al.2014] Maxime Oquab, Leon Bottou, Ivan
Laptev, and Josef Sivic. 2014. Learning and trans-
ferring mid-level image representations using convo-
lutional neural networks. In Proceedings of CVPR.
[Searle1984] John Searle. 1984. Minds, Brains and Sci-
ence. Harvard University Press, Cambridge, MA.
[Silberer and Lapata2014] Carina Silberer and Mirella
Lapata. 2014. Learning grounded meaning repre-
sentations with autoencoders. In Proceedings of ACL,
pages 721–732, Baltimore, Maryland.
[Sivic and Zisserman2003] Josef Sivic and Andrew Zis-
serman. 2003. Video Google: A text retrieval ap-
proach to object matching in videos. In Proceedings
of ICCV, pages 1470–1477, Nice, France.
[Socher et al.2013] Richard Socher, Milind Ganjoo,
Christopher Manning, and Andrew Ng. 2013.
Zero-shot learning through cross-modal transfer. In
Proceedings of NIPS, pages 935–943, Lake Tahoe,
NV.
[Socher et al.2014] Richard Socher, Quoc Le, Christo-
pher Manning, and Andrew Ng. 2014. Grounded

Combining Language and Vision With A Multimodal Skip-Gram Model

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Combining Language and Vision With A Multimodal Skip-Gram Model

Uploaded by

Copyright:

Available Formats

Combining Language and Vision

with a Multimodal Skip-gram Model

Angeliki Lazaridou Nghia The Pham Marco Baroni

Abstract tion has led to the development of multimodal dis-

embeddings in systems specifically tuned to caption

Target S KIP - GRAM MMS KIP - GRAM -A MMS KIP - GRAM -B

concrete than abstract words to be directly repre-

You might also like