Professional Documents
Culture Documents
Combining Language and Vision With A Multimodal Skip-Gram Model
Combining Language and Vision With A Multimodal Skip-Gram Model
+
tioned images as sources of multimodal evidence,
the cute little sat on the mat CAT
whereas we automatically enrich a very large corpus
with images to induce general-purpose multimodal
word representations, that could be used as input maximize context prediction maximize similarity
cat
3 Multimodal Skip-gram Architecture
Figure 1: “Cartoon” of MMS KIP - GRAM -B. Lin-
3.1 Skip-gram Model guistic context vectors are actually associated to
We start by reviewing the standard S KIP - GRAM classes of words in a tree, not single words. S KIP -
model of Mikolov et al. (2013a), in the version GRAM is obtained by ignoring the visual objective,
we use. Given a text corpus, S KIP - GRAM aims MMS KIP - GRAM -A by fixing M u→v to the identity
at inducing word representations that are good at matrix.
predicting the context words surrounding a target
word. Mathematically, it maximizes the objective
visual representation of the concepts they denote
function: (just like in a conversation, where a linguistic
T
1 X X utterance will often be produced in a visual scene
log p(wt+j |wt ) (1)
T including some of the word referents). The visual
t=1 −c≤j≤c,j6=0
representation is also encoded in a vector (we
where w1 , w2 , ..., wT are words in the training describe in Section 4 below how we construct
corpus and c is the size of the window around it). We thus make the skip-gram “multimodal” by
target wt , determining the set of context words to adding a second, visual term to the original linguis-
be predicted by the induced representation of wt . tic objective, that is, we extend Equation 1 as follow:
Following Mikolov et al., we implement a subsam- T
1X
pling option randomly discarding context words as (Lling (wt ) + Lvision (wt )) (3)
T
an inverse function of their frequency, controlled by t=1
hyperparameter t. The probability p(wt+j |wt ), the where LP ling (wt ) is the text-based skip-gram ob-
core part of the objective in Equation 1, is given by jective −c≤j≤c,j6=0 log p(wt+j |wt ), whereas the
softmax: Lvision (wt ) term forces word representations to take
u0wt+j T uwt visual information into account. Note that if a word
e
p(wt+j |wt ) = P 0 T (2) wt is not associated to visual information, as is
W
w0 =1 euw0 uwt
systematically the case, e.g., for determiners and
where uw and u0w are the context and target vector non-imageable nouns, but also more generally for
representations of word w respectively, and W is any word for which no visual data are available,
the size of the vocabulary. Due to the normaliza- Lvision (wt ) is set to 0.
tion term, Equation 2 requires O(|W |) time com- We now propose two variants of the visual objec-
plexity. A considerable speedup to O(log |W |), is tive, resulting in two distinguished multi-modal ver-
achieved by using the hierarchical version of Equa- sions of the skip-gram model.
tion 2 (Morin and Bengio, 2005), adopted here.
3.3 Multi-modal Skip-gram Model A
3.2 Injecting visual knowledge One way to force word embeddings to take visual
We now assume that word learning takes place in a representations into account is to try to directly
situated context, in which, for a subset of the target increase the similarity (expressed, for example,
words, the corpus contexts are accompanied by a by the cosine) between linguistic and visual rep-
resentations, thus aligning the dimensions of the Our text corpus is a Wikipedia 2009 dump compris-
linguistic vector with those of the visual one (recall ing approximately 800M tokens.1 To train the multi-
that we are inducing the first, while the second is modal models, we add visual information for 5,100
fixed), and making the linguistic representation of a words that have an entry in ImageNet (Deng et al.,
concept “move” closer to its visual representation. 2009), occur at least 500 times in the corpus and
We maximize similarity through a max-margin have concreteness score ≥ 0.5 according to Turney
framework commonly used in models connecting et al. (2011). On average, about 5% tokens in the
language and vision (Weston et al., 2010; Frome et text corpus are associated to a visual representation.
al., 2013). More precisely, we formulate the visual To construct the visual representation of a word, we
objective Lvision (wt ) as: sample 100 pictures from its ImageNet entry, and
X
− max(0, γ − cos(uwt , vwt ) + cos(uwt , vw0 )) (4) extract a 4096-dimensional vector from each picture
w0 ∼Pn (w) using the Caffe toolkit (Jia et al., 2014), together
where the minus sign turns a loss into a cost, γ is with the pre-trained convolutional neural network of
the margin, uwt is the target multimodally-enhanced Krizhevsky et al. (2012). The vector corresponds
word representation we aim to learn, vwt is the cor- to activation in the top (FC 7) layer of the network.
responding visual vector (fixed in advance) and vw0 Finally, we average the vectors of the 100 pictures
ranges over visual representations of words (fea- associated to each word, deriving 5,100 aggregated
tured in our image dictionary) randomly sampled visual representations.
from distribution Pn (wt ). These random visual rep-
resentations act as “negative” samples, encouraging Hyperparameters For both S KIP - GRAM and the
uwt to be more similar to its own visual representa- MMS KIP - GRAM models, we fix hidden layer size
tion than to that of other words. The sampling distri- to 300. To facilitate comparison between MMS KIP -
bution is currently set to uniform, and the number of GRAM -A and MMS KIP - GRAM -B, and since the for-
negative samples controlled by hyperparameter k. mer requires equal linguistic and visual dimension-
ality, we keep the first 300 dimensions of the visual
3.4 Multi-modal Skip-gram Model B vectors. For the linguistic objective, we use hierar-
The visual objective in MMS KIP - GRAM -A has the chical softmax with a Huffman frequency-based en-
drawback of assuming a direct comparison of lin- coding tree, setting frequency subsampling option
guistic and visual representations, constraining them t = 0.001 and window size c = 5, without tuning.
to be of equal size. MMS KIP - GRAM -B lifts this The following hyperparameters were tuned on the
constraint by including an extra layer mediating be- text9 corpus:2 MMS KIP - GRAM -A: k=20, γ =0.5;
tween linguistic and visual representations (see Fig- MMS KIP - GRAM -B: k=5, γ=0.5, λ=0.0001.
ure 1 for a sketch of MMS KIP - GRAM -B). Learning
this layer is equivalent to estimating a cross-modal 5 Experiments
mapping matrix from linguistic onto visual repre-
5.1 Approximating human judgments
sentations, jointly induced with linguistic word em-
beddings. The extension is straightforwardly imple- Benchmarks A widely adopted way to test DSMs
mented by substituting, into Equation 4, the word and their multimodal extensions is to measure how
representation uwt with zwt = M u→v uwt , where well model-generated scores approximate human
M u→v is the cross-modal mapping matrix to be in- similarity judgments about pairs of words. We put
duced. To avoid overfitting, we also add an L2 reg- together various benchmarks covering diverse as-
ularization term for M u→v to the overall objective pects of meaning, to gain insights on the effect of
(Equation 3), with its relative importance controlled perceptual information on different similarity facets.
by hyperparamer λ. Specifically, we test on general relatedness (MEN,
Bruni et al. (2014), 3K pairs), e.g., pickles are re-
4 Experimental Setup lated to hamburgers, semantic (≈ taxonomic) simi-
The parameters of all models are estimated by back- 1
http://wacky.sslmit.unibo.it
2
propagation of error via stochastic gradient descent. http://mattmahoney.net/dc/textdata.html
larity (Simlex-999, Hill et al. (2014), 1K pairs; Sem- on words without direct visual representations.
Sim, Silberer and Lapata (2014), 7.5K pairs), e.g.,
pickles are similar to onions, as well as visual sim- Results The state-of-the-art visual CNN FEA -
ilarity (VisSim, Silberer and Lapata (2014), same TURES alone perform remarkably well, outperform-
pairs as SemSim with different human ratings), e.g., ing the purely textual model (S KIP - GRAM) in two
pickles look like zucchinis. tasks, and achieving the best absolute performance
on the visual-coverage subset of Simlex-999. Re-
Alternative Multimodal Models We compare garding multimodal fusion (that is, focusing on
our models against several recent alternatives. We the visual-coverage subsets), both MMS KIP - GRAM
test the vectors made available by Kiela and Bottou models perform very well, at the top or just below
(2014). Similarly to us, they derive textual features it on all tasks, with comparable results for the two
with the skip-gram model (from a portion of the variants. Their performance is also good on the
Wikipedia and the British National Corpus) and use full data sets, where they consistently outperform
visual representations extracted from the ESP data- S KIP - GRAM and SVD (that is much more strongly
set (von Ahn and Dabbish, 2004) through a convo- affected by lack of complete visual information).
lutional neural network (Oquab et al., 2014). They They’re just a few points below the state-of-the-art
concatenate textual and visual features after normal- MEN correlation (0.8), achieved by Baroni et al.
izing to unit length and centering to zero mean. We (2014) with a corpus 3 larger than ours and exten-
also test the vectors that performed best in the evalu- sive tuning. MMS KIP - GRAM -B is close to the state
ation of Bruni et al. (2014), based on textual features of the art for Simlex-999, reported by the resource
extracted from a 3B-token corpus and SIFT-based creators to be at 0.41 (Hill et al., 2014). Most im-
Bag-of-Visual-Words visual features (Sivic and Zis- pressively, MMS KIP - GRAM -A reaches the perfor-
serman, 2003) extracted from the ESP collection. mance level of the Silberer and Lapata (2014) model
Bruni and colleagues fuse a weighted concatenation on their SemSim and VisSim data sets, despite the
of the two components through SVD. We further re- fact that the latter has full visual-data coverage and
implement both methods with our own textual and uses attribute-based image representations, requir-
visual embeddings as C ONCATENATION and SVD ing supervised learning of attribute classifiers, that
(with target dimensionality 300, picked without tun- achieve performance in the semantic tasks compa-
ing). Finally, we present for comparison the results rable or higher than that of our CNN features (see
on SemSim and VisSim reported by Silberer and La- Table 3 in Silberer and Lapata (2014)). Finally, if
pata (2014), obtained with a stacked-autoencoders the multimodal models (unsurprisingly) bring about
architecture run on textual features extracted from a large performance gain over the purely linguistic
Wikipedia with the Strudel algorithm (Baroni et al., model on visual similarity, the improvement is con-
2010) and attribute-based visual features (Farhadi et sistently large also for the other benchmarks, con-
al., 2009) extracted from ImageNet. firming that multimodality leads to better semantic
All benchmarks contain a fair amount of words models in general, that can help in capturing differ-
for which we did not use direct visual evidence. We ent types of similarity (general relatedness, strictly
are interested in assessing the models both in terms taxonomic, perceptual).
of how they fuse linguistic and visual evidence when While we defer to further work a better un-
they are both available, and for their robustness in derstanding of the relation between multimodal
lack of full visual coverage. We thus evaluate them grounding and different similarity relations, Table
in two settings. The visual-coverage columns of Ta- 2 provides qualitative insights on how injecting
ble 1 (those on the right) report results on the subsets visual information changes the structure of se-
for which all compared models have access to direct mantic space. The top S KIP - GRAM neighbours of
visual information for both words. We further report donuts are places where you might encounter them,
results on the full sets (“100%” columns of Table whereas the multimodal models relate them to other
1) for models that can propagate visual information take-away food, ranking visually-similar pizzas at
and that, consequently, can meaningfully be tested the top. The owl example shows how multimodal
MEN Simlex-999 SemSim VisSim
Model
100% 42% 100% 29% 100% 85% 100% 85%
K IELA AND B OTTOU - 0.74 - 0.33 - 0.60 - 0.50
B RUNI ET AL . - 0.77 - 0.44 - 0.69 - 0.56
S ILBERER AND L APATA - - - - 0.70 - 0.64 -
CNN FEATURES - 0.62 - 0.54 - 0.55 - 0.56
S KIP - GRAM 0.70 0.68 0.33 0.29 0.62 0.62 0.48 0.48
C ONCATENATION - 0.74 - 0.46 - 0.68 - 0.60
SVD 0.61 0.74 0.28 0.46 0.65 0.68 0.58 0.60
MMS KIP - GRAM -A 0.75 0.74 0.37 0.50 0.72 0.72 0.63 0.63
MMS KIP - GRAM -B 0.74 0.76 0.40 0.53 0.66 0.68 0.60 0.60
Table 1: Spearman correlation between model-generated similarities and human judgments. Right columns
report correlation on visual-coverage subsets (percentage of original benchmark covered by subsets on first
row of respective columns). First block reports results for out-of-the-box models; second block for visual
and textual representations alone; third block for our implementation of multimodal models.
Table 2: Ordered top 3 neighbours of example words in purely textual and multimodal spaces. Only donut
and owl were trained with direct visual information.
models pick taxonomically closer neighbours of over the more abstract measurement sense picked
concrete objects, since often closely related things by the MMS KIP - GRAM neighbours. For chaos,
also look similar (Bruni et al., 2014). In particular, they rank a demon, that is, a concrete agent of chaos
both multimodal models get rid of squirrels and at the top, and replace the more abstract notion of
offer other birds of prey as nearest neighbours. despair with equally gloomy but more imageable
No direct visual evidence was used to induce the shadows and destruction (more on abstract words
embeddings of the remaining words in the table, that below).
are thus influenced by vision only by propagation.
The subtler but systematic changes we observe in 5.2 Zero-shot image labeling and retrieval
such cases suggest that this indirect propagation
The multimodal representations induced by our
is not only non-damaging with respect to purely
models should be better suited than purely text-
linguistic representations, but actually beneficial.
based vectors to label or retrieve images. In particu-
For the concrete mural concept, both multimodal
lar, given that the quantitative and qualitative results
models rank paintings and portraits above less
collected so far suggest that the models propagate
closely related sculptures (they are not a form of
visual information across words, we apply them to
painting). For tobacco, both models rank cigarettes
image labeling and retrieval in the challenging zero-
and cigar over coffee, and MMS KIP - GRAM -B
shot setup (see Section 2 above).3
avoids the arguably less common “crop” sense
cued by corn. The last two examples show how the 3
We will refer here, for conciseness’ sake, to image label-
multimodal models turn up the embodiment level ing/retrieval, but, as our visual vectors are aggregated represen-
in their representation of abstract words. For depth, tations of images, the tasks we’re modeling consist, more pre-
their neighbours suggest a concrete marine setup cisely, in labeling a set of pictures denoting the same object and
retrieving the corresponding set given the name of the object.
Setup We take out as test set 25% of the 5.1K P@1 P@2 P@10 P@20 P@50
words we have visual vectors for. The multimodal S KIP - GRAM 1.5 2.6 14.2 23.5 36.1
MMS KIP - GRAM -A 2.1 3.7 16.7 24.6 37.6
models are re-trained without visual vectors for MMS KIP - GRAM -B 2.2 5.1 20.2 28.5 43.5
these words, using the same hyperparameters as
above. For both tasks, the search for the correct Table 3: Percentage precision@k results in the zero-
word label/image is conducted on the whole set of shot image labeling task.
5.1K word/visual vectors.
In the image labeling task, given a visual vector P@1 P@2 P@10 P@20 P@50
representing an image, we map it onto word space, S KIP - GRAM 1.9 3.3 11.5 18.5 30.4
and label the image with the word corresponding MMS KIP - GRAM -A 1.9 3.2 13.9 20.2 33.6
to the nearest vector. To perform the vision-to- MMS KIP - GRAM -B 1.9 3.8 13.2 22.5 38.3
language mapping, we train a Ridge regression by 5- Table 4: Percentage precision@k results in the zero-
fold cross-validation on the test set (for S KIP - GRAM shot image retrieval task.
only, we also add the remaining 75% of word-image
vector pairs used in estimating the multimodal mod-
els to the Ridge training data).4
embeddings we are inducing, while general enough
In the image retrieval task, given a linguis-
to achieve good performance in the semantic tasks
tic/multimodal vector, we map it onto visual space,
discussed above, encode sufficient visual informa-
and retrieve the nearest image. For S KIP - GRAM, we
tion for direct application to image analysis tasks.
use Ridge regression with the same training regime
This is especially remarkable because the word vec-
as for the labeling task. For the multimodal mod-
tors we are testing were not matched with visual
els, since maximizing similarity to visual represen-
representations at model training time, and are thus
tations is already part of their training objective, we
multimodal only by propagation. The best perfor-
do not fit an extra mapping function. For MMS KIP -
mance is achieved by MMS KIP - GRAM -B, confirm-
GRAM -A, we directly look for nearest neighbours
ing our claim that its M u→v matrix acts as a multi-
of the learned embeddings in visual space. For
modal mapping function.
MMS KIP - GRAM -B, we use the M u→v mapping
function induced while learning word embeddings. 5.3 Abstract words
Results In image labeling (Table 3) S KIP - GRAM We have already seen, through the depth and chaos
is outperformed by both multimodal models, con- examples of Table 2, that the indirect influence of
firming that these models produce vectors that are visual information has interesting effects on the rep-
directly applicable to vision tasks thanks to visual resentation of abstract terms. The latter have re-
propagation. The most interesting results however ceived little attention in multimodal semantics, with
are achieved in image retrieval (Table 4), which Hill and Korhonen (2014) concluding that abstract
is essentially the task the multimodal models have nouns, in particular, do not benefit from propagated
been implicitly optimized for, so that they could be perceptual information, and their representation is
applied to it without any specific training. The strat- even harmed when such information is forced on
egy of directly querying for the nearest visual vec- them (see Figure 4 of their paper). Still, embod-
tors of the MMS KIP - GRAM -A word embeddings ied theories of cognition have provided considerable
works remarkably well, outperforming on the higher evidence that abstract concepts are also grounded
ranks S KIP - GRAM, which requires an ad-hoc map- in the senses (Barsalou, 2008; Lakoff and John-
ping function. This suggests that the multimodal son, 1999). Since the word representations produced
by MMS KIP - GRAM -A, including those pertaining
4
We use one fold to tune Ridge λ, three to estimate the map- to abstract concepts, can be directly used to search
ping matrix and test in the last fold. To enforce strict zero-shot
conditions, we exclude from the test fold labels occurring in
for near images in visual space, we decided to ver-
the LSVRC2012 set that was employed to train the CNN of ify, experimentally, if these near images (of concrete
Krizhevsky et al. (2012), that we use to extract visual features. things) are relevant not only for concrete words, as
expected, but also for abstract ones, as predicted by global |words| unseen |words|
embodied views of meaning. all 48% 198 30% 127
concrete 73% 99 53% 30
More precisely, we focused on the set of 200 abstract 23% 99 23% 97
words that were sampled across the USF norms con-
creteness spectrum by Kiela et al. (2014) (2 words Table 5: Subjects’ preference for nearest visual
had to be excluded for technical reasons). This neighbour of words in Kiela et al. (2014) vs. random
set includes not only concrete (meat) and abstract pictures. Figure of merit is percentage proportion
(thought) nouns, but also adjectives (boring), verbs of significant results in favor of nearest neighbour
(teach), and even grammatical terms (how). Some across words. Results are reported for the whole set,
words in the set have relatively high concreteness as well as for words above (concrete) and below (ab-
ratings, but are not particularly imageable, e.g.: stract) the concreteness rating median. The unseen
hot, smell, pain, sweet. For each word in the set, column reports results when words exposed to direct
we extracted the nearest neighbour picture of its visual evidence during training are discarded. The
MMS KIP - GRAM -A representation, and matched it words columns report set cardinality.
with a random picture. The pictures were selected
freedom theory wrong
from a set of 5,100, all labeled with distinct words
(the picture set includes, for each of the words as-
sociated to visual information as described in Sec-
tion 4, the nearest picture to its aggregated visual
representation). Since it is much more common for god together place