Lecture 4a - Sequence To Sequence

Course series: Deep Learning for NLP
Sequence to Sequence Models

& Practical Considerations
Lecture # 4a
Hassan Sajjad and Fahim Dalvi

Qatar Computing Research Institute, HBKU
Recap
• RNN predicts a word based on the context
• A context vector at time t can be seen as a
summary of previous words
how are you ?
hello how are you

Recap
• RNN predicts a word based on the context
• A context vector at time t can be seen as a
summary of previous words
how are you ? Summary of the

sentence
hello how are you

Seq2Seq models take RNN a step further
Seq2Seq models take a sequence and predict another

sequence
Seq2Seq models take a sequence and predict another

sequence
• Translation: Parallel source and target sentences
– sequence of words → sequence of words
• Speech Recognition: Audio waves to transcription
– sequence of audio signals → sequence of words
• Image Captioning: Images to text sequence
– “sequence” of pixels → sequence of words
Intuitively, the seq2seq model learns to:
• Read a source sequence completely
• Predict a target sequence
Different from
● Per timestep prediction using RNN, here both
source and target are sequences
● Sequence generation using RNN, both source
and target are from different worlds
• Machine translation as an example
– Pairs of sentences in two languages
He does not go home Er geht ja nicht nach hause
I am working on it Ich arbeite daran
Source language Target language

– Given a source sequence of words,
predict/generate the target sequence of words

– Given a source sequence of words,
predict/generate the target sequence of words
– Target sequence can only be generated after
processing the entire source sequence

• So far, we have seen a RNN with one
language involved:
English
• Simple seq2seq models can be seen as
consisting of bilingual RNNs
– Consider two language models
English German
• Recap: a language model looks at a sequence
of words and predicts the next word
English German
• Now, consider a Bilingual RNN
– Imagine English and German sentences as a
single sequence of strings
– No explicit information about source and target
language
John is driving a car . John fährt ein Auto .
Boundary symbols to mark source and

target sentence endings
• Consider this combined form as a single language
John is driving a car John fährt ein Auto
Source Target
• Consider this combined form as a single language
John fährt ein Auto -
John is driving a car - John fährt ein Auto
Source Target
• Essentially, the first RNN is summarizing the English
sentence into a vector, and the second RNN uses this to
generate German!
Source Target
Encoder Decoder
This is also called the Encoder-Decoder architecture!

Encoder Decoder
The model learns to read a source sequence, and then

predict the corresponding target sequence
Encoder Decoder
We generally do not produce any outputs in the Encoder,

since we are interested in generating the second sequence
Anatomy of a seq2seq model

Recurrent Layers John fährt ein Auto -

Target Embeddings
Source Embeddings John fährt ein Auto -

Output Layer John fährt ein Auto -

Output Layer
Recap score for “katze”
Total classes is equal to score for “hund”
vocabulary size of the target score for “John”
John
language
score for “haus”
score for “tür”
score for “schule”
score for “laptop”

John is driving a car -
score for “phone”
score for “zebra”
More layers usually lead to stronger models

• End-to-end training: one loss to optimize all
parameters
• Better use of context: use source sentences
and previously predicted target words
• Distributed word representations: semantic
similarities
Problems with Sequence to
Sequence Models
Problems with Seq2Seq Models
• Sequence to sequence models perform poorly
when translating long sentences
Ich bin
• Issue: a sentence is
represented as a fixed vector
• Relationship between source
and target words is very
abstract
• All words are not equally
important to predict a target
I am fine - Ich
word
Problems with Seq2Seq Models
• Sequence to sequence models perform poorly
when translating long sentences
https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
Attention Mechanism
One vector represents all source words
Anna watches the Talkshow
Anna sieht sich die Talkshow an

Attention Mechanism
Source and target words inherently have relationships

Attention Mechanism
Anna,Anna are more relevant to each other than
Anna,Talkshow

Attention Mechanism
Solution: For each target word, concentrate only on
specific source words

Attention Mechanism

Attention Mechanism

Attention Mechanism

Attention Mechanism

Attention Mechanism
Formally, given the previous
target word, the attention Attention
mechanism “scores” each mechanism
John
source word
John is driving a car - John
Neural Machine Translation by Luong, Cho and Manning

Attention Mechanism
Attention
mechanism
John

Attention Mechanism
Attention
mechanism
John

Attention Mechanism
Attention
mechanism
John

Attention Mechanism
Attention
mechanism
John

Attention Mechanism
Attention
mechanism
John

Attention Mechanism
Attention
mechanism
John
2 1

Attention Mechanism
Attention
mechanism
John
2 1

Attention Mechanism
Attention
mechanism
John
2 1 5

Attention Mechanism
Attention
mechanism
John
2 1 5

Attention Mechanism
Attention
mechanism
John
2 1 5 1

Attention Mechanism
Attention
mechanism
John
2 1 5 1

Attention Mechanism
Attention
mechanism
John
2 1 5 1 1

Attention Mechanism
Attention
mechanism
John
0.2 0.1 0.5 0.1 0.1

Use softmax to convert scores
into probabilities
Attention Mechanism
Attention
mechanism
John
0.2 0.1 0.5 0.1 0.1

Build context vector by taking a
weighted sum over the source
Attention Mechanism
Attention
mechanism
John
0.2 0.1 0.5 0.1 0.1

Use context vector to predict
next word
Attention Mechanism
Attention
mechanism
John fährt
John is driving a car - John fährt

Similarly, get source scores for
the next prediction
Attention Mechanism
Attention
mechanism
John fährt
0.1 0.0 0.1 0.6 0.2

Use softmax to convert scores
into probabilities
Attention Mechanism
Attention
mechanism
John fährt
0.1 0.0 0.1 0.6 0.2

next word
Attention Mechanism
Attention
mechanism
John fährt ein
0.1 0.0 0.1 0.6 0.2

next word
Empirical Evaluation of Attention
w/ Attention
w/ no Attention

Attention Mechanism - Visual
Aligning words
If we plot the source
scores for each target
word, we can see what
each target word is
aligned to.
Dzmitry Bahdanau, KyungHuyn Cho, and

Yoshua Bengio. Neural Machine
Translation by Jointly Learning to Translate
and Align. ICLR’15
Attention Mechanism - Visual
Aligning words
If we plot the source
scores for each target
word, we can see what
each target word is
aligned to.
Note the reordering in this

particular example
Dzmitry Bahdanau, KyungHuyn Cho, and

Yoshua Bengio. Neural Machine
Translation by Jointly Learning to Translate
and Align. ICLR’15
Applications of Sequence to
Sequence Model
• Summarization
• Dialog-based systems (chatbots)
• Speech recognition
• Image captioning
• Machine translation
• ...
Image Captioning
https://towardsdatascience.com/image-captioning-in-deep-learning-9cd23fb4d8d2
Summary
• Bilingual LSTM translates from one language to
another language
• Encoder-Decoder model helps us encode a sequence
into a summary vector and use that to decode
another sequence
• Attention mechanism learns soft alignment
between source and target words
• OpenNMT toolkit http://opennmt.net/

Lecture 4a - Sequence To Sequence

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 4a - Sequence To Sequence

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4a - Sequence To Sequence

Uploaded by

Copyright:

Available Formats

Course series: Deep Learning for NLP

Sequence to Sequence Models

Hassan Sajjad and Fahim Dalvi

how are you ?

hello how are you

how are you ? Summary of the

hello how are you

Seq2Seq models take a sequence and predict another

Seq2Seq models take a sequence and predict another

He does not go home Er geht ja nicht nach hause

I am working on it Ich arbeite daran

Source language Target language

He does not go home Er geht ja nicht nach hause

I am working on it Ich arbeite daran

Source language Target language

He does not go home Er geht ja nicht nach hause

I am working on it Ich arbeite daran

Source language Target language

John is driving a car . John fährt ein Auto .

Boundary symbols to mark source and

John is driving a car John fährt ein Auto

John fährt ein Auto -

John is driving a car - John fährt ein Auto

John is driving a car - John fährt ein Auto

John is driving a car - John fährt ein Auto

This is also called the Encoder-Decoder architecture!

John is driving a car - John fährt ein Auto

The model learns to read a source sequence, and then

John is driving a car - John fährt ein Auto

We generally do not produce any outputs in the Encoder,

John fährt ein Auto -

John is driving a car - John fährt ein Auto

Recurrent Layers John fährt ein Auto -

John is driving a car - John fährt ein Auto

John is driving a car - John fährt ein Auto

Output Layer John fährt ein Auto -

John is driving a car - John fährt ein Auto

score for “laptop”

John is driving a car - John fährt ein Auto

Anna watches the Talkshow

Anna sieht sich die Talkshow an

Anna watches the Talkshow

Anna sieht sich die Talkshow an

Anna watches the Talkshow

Anna sieht sich die Talkshow an

Anna sieht sich die Talkshow an

Anna sieht sich die Talkshow an

Anna sieht sich die Talkshow an

Anna sieht sich die Talkshow an

Anna sieht sich die Talkshow an

John is driving a car - John

Neural Machine Translation by Luong, Cho and Manning

John is driving a car - John

Neural Machine Translation by Luong, Cho and Manning

John is driving a car - John

Neural Machine Translation by Luong, Cho and Manning

John is driving a car - John