Show and Tell: A Neural Image Caption Generator (CVPR 2015) : Presenters: Tianlu Wang, Yin Zhang October 5

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 13

Show and Tell: A Neural Image

Caption Generator (CVPR


2015)

Presenters: Tianlu Wang, Yin Zhang


October 5th
Neural Image Caption (NIC)
Main Goal: automatically describe the content of an image using properly formed
English sentences

Human: A young girl asleep on the sofa


cuddling a stuffed bear.

NIC: A baby is asleep next to a teddy bear.

Mathematically, to build a single joint model that takes an image I as input, and is trained to maximize the
likelihood p(Sentence|Image) of producing a target sequence of words
Inspiration from Machine Translation task

The target sentence is generated by maximizing the likelihood


P(T|S), where T is the target language and S is the source
language

Use the Encoder - Decoder structure


• Encoder (RNN): transform the source language into a rich
fixed length vector
• Decoder (RNN): take the output of encoder as input and
generates the target sentence

An example of translating words written in


source language ”ABCD” to those in target
language “XYZQ”
NIC Model Architecture
Follow the Encoder - Decoder structure
• Encoder (deep CNN): transform the image into a rich fixed length vector
• Decoder (RNN): take the output of encoder as input and generates the target sentence
NIC Model Architecture

Choice of CNN: winner on the ILSVRC 2014


classification competition

Choice of RNN: LSTM RNN (Recurrent


Neural Network with LSTM cell)

In training process, they left the


CNN unchanged, only trained the
RNN part.
RNN(Recurrent Neural Network)

• Why? Sequential task: speech, text and video…


E.g. translate a word based on the previous one
• Advantage: Pass information from one step to
next, information persistence
• How? Loops, multiple copies of same
cell(module), passing a message to a successor

Want to know more? http://karpathy.github.io/2015/05/21/rnn-effectiveness/


RNN & LSTM
• Why it’s better?
Long term dependency problem:
translation of the last word depends on the
information of the first word…
when gap between relevant information
grows, RNN fails
• Long Short Term Memory Networks
remembers information for long periods of
time.
LSTM(Long Short Term Memory)

Cell state:
information
flows along it!

Gate: optionally
let information
through
LSTM Cont.(forget gate)
input x
f (vector, every element is 0 or 1)
previous output h

decide what
information to
throw away from
the cell state
LSTM Cont.
decide what values will be updated

input gate: decide what


new information will be
stored in cell state
push the value to be between -1 and 1

create new candidate values

update the old cell state


into new cell state
LSTM Cont.(output gate)

decide what parts of cell state we’ll output

output the parts we decided to


Result

BLEU: https://en.wikipedia.org/wiki/BLEU
Reference:
• Show and Tell: A Neural Image Caption Generator, Oriol Vinyals,
Alexander Toshev, Samy Bengio, Dumitru Erhan
https://arxiv.org/pdf/1411.4555v2.pdf
http://techtalks.tv/talks/show-and-tell-a-neural-image-caption-gener
ator/61592/
• Understanding LSTM Networks, colah’s blog
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

You might also like