Show and Tell: A Neural Image Caption Generator (CVPR 2015) : Presenters: Tianlu Wang, Yin Zhang October 5

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 13

Show and Tell: A Neural Image

Caption Generator (CVPR


Presenters: Tianlu Wang, Yin Zhang

October 5th
Neural Image Caption (NIC)
Main Goal: automatically describe the content of an image using properly formed
English sentences

Human: A young girl asleep on the sofa

cuddling a stuffed bear.

NIC: A baby is asleep next to a teddy bear.

Mathematically, to build a single joint model that takes an image I as input, and is trained to maximize the
likelihood p(Sentence|Image) of producing a target sequence of words
Inspiration from Machine Translation task

The target sentence is generated by maximizing the likelihood

P(T|S), where T is the target language and S is the source

Use the Encoder - Decoder structure

• Encoder (RNN): transform the source language into a rich
fixed length vector
• Decoder (RNN): take the output of encoder as input and
generates the target sentence

An example of translating words written in

source language ”ABCD” to those in target
language “XYZQ”
NIC Model Architecture
Follow the Encoder - Decoder structure
• Encoder (deep CNN): transform the image into a rich fixed length vector
• Decoder (RNN): take the output of encoder as input and generates the target sentence
NIC Model Architecture

Choice of CNN: winner on the ILSVRC 2014

classification competition

Choice of RNN: LSTM RNN (Recurrent

Neural Network with LSTM cell)

In training process, they left the

CNN unchanged, only trained the
RNN part.
RNN(Recurrent Neural Network)

• Why? Sequential task: speech, text and video…

E.g. translate a word based on the previous one
• Advantage: Pass information from one step to
next, information persistence
• How? Loops, multiple copies of same
cell(module), passing a message to a successor

Want to know more?

• Why it’s better?
Long term dependency problem:
translation of the last word depends on the
information of the first word…
when gap between relevant information
grows, RNN fails
• Long Short Term Memory Networks
remembers information for long periods of
LSTM(Long Short Term Memory)

Cell state:
flows along it!

Gate: optionally
let information
LSTM Cont.(forget gate)
input x
f (vector, every element is 0 or 1)
previous output h

decide what
information to
throw away from
the cell state
LSTM Cont.
decide what values will be updated

input gate: decide what

new information will be
stored in cell state
push the value to be between -1 and 1

create new candidate values

update the old cell state

into new cell state
LSTM Cont.(output gate)

decide what parts of cell state we’ll output

output the parts we decided to


• Show and Tell: A Neural Image Caption Generator, Oriol Vinyals,
Alexander Toshev, Samy Bengio, Dumitru Erhan
• Understanding LSTM Networks, colah’s blog

You might also like