Lecture 4a - Sequence To Sequence

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 65

Course series: Deep Learning for NLP

Sequence to Sequence Models


& Practical Considerations
Lecture # 4a

Hassan Sajjad and Fahim Dalvi


Qatar Computing Research Institute, HBKU
Recap
• RNN predicts a word based on the context
• A context vector at time t can be seen as a
summary of previous words

how are you ?

hello how are you


Recap
• RNN predicts a word based on the context
• A context vector at time t can be seen as a
summary of previous words

how are you ? Summary of the


sentence

hello how are you


Sequence to Sequence Models
Seq2Seq models take RNN a step further
Sequence to Sequence Models
Seq2Seq models take RNN a step further

Seq2Seq models take a sequence and predict another


sequence
Sequence to Sequence Models
Seq2Seq models take RNN a step further

Seq2Seq models take a sequence and predict another


sequence
• Translation: Parallel source and target sentences
– sequence of words → sequence of words
• Speech Recognition: Audio waves to transcription
– sequence of audio signals → sequence of words
• Image Captioning: Images to text sequence
– “sequence” of pixels → sequence of words
Sequence to Sequence Models
Intuitively, the seq2seq model learns to:
• Read a source sequence completely
• Predict a target sequence
Sequence to Sequence Models
Different from
● Per timestep prediction using RNN, here both
source and target are sequences
● Sequence generation using RNN, both source
and target are from different worlds
Sequence to Sequence Models
• Machine translation as an example
– Pairs of sentences in two languages

He does not go home Er geht ja nicht nach hause

I am working on it Ich arbeite daran

Source language Target language


Sequence to Sequence Models
• Machine translation as an example
– Pairs of sentences in two languages
– Given a source sequence of words,
predict/generate the target sequence of words

He does not go home Er geht ja nicht nach hause

I am working on it Ich arbeite daran

Source language Target language


Sequence to Sequence Models
• Machine translation as an example
– Pairs of sentences in two languages
– Given a source sequence of words,
predict/generate the target sequence of words
– Target sequence can only be generated after
processing the entire source sequence

He does not go home Er geht ja nicht nach hause

I am working on it Ich arbeite daran

Source language Target language


Sequence to Sequence Models
• So far, we have seen a RNN with one
language involved:

English
Sequence to Sequence Models
• Simple seq2seq models can be seen as
consisting of bilingual RNNs
– Consider two language models

English German
Sequence to Sequence Models
• Recap: a language model looks at a sequence
of words and predicts the next word

English German
Sequence to Sequence Models
• Now, consider a Bilingual RNN
– Imagine English and German sentences as a
single sequence of strings
– No explicit information about source and target
language

John is driving a car . John fährt ein Auto .

Boundary symbols to mark source and


target sentence endings
Sequence to Sequence Models
• Consider this combined form as a single language

John is driving a car John fährt ein Auto

Source Target
Sequence to Sequence Models
• Consider this combined form as a single language

John fährt ein Auto -

John is driving a car - John fährt ein Auto

Source Target
Sequence to Sequence Models
• Essentially, the first RNN is summarizing the English
sentence into a vector, and the second RNN uses this to
generate German!
John fährt ein Auto -

John is driving a car - John fährt ein Auto

Source Target
Sequence to Sequence Models

Encoder Decoder
John fährt ein Auto -

John is driving a car - John fährt ein Auto

This is also called the Encoder-Decoder architecture!


Sequence to Sequence Models

Encoder Decoder
John fährt ein Auto -

John is driving a car - John fährt ein Auto

The model learns to read a source sequence, and then


predict the corresponding target sequence
Sequence to Sequence Models

Encoder Decoder
John fährt ein Auto -

John is driving a car - John fährt ein Auto

We generally do not produce any outputs in the Encoder,


since we are interested in generating the second sequence
Sequence to Sequence Models
Anatomy of a seq2seq model

John fährt ein Auto -

John is driving a car - John fährt ein Auto


Sequence to Sequence Models
Anatomy of a seq2seq model

Recurrent Layers John fährt ein Auto -

John is driving a car - John fährt ein Auto


Sequence to Sequence Models
Anatomy of a seq2seq model
Target Embeddings
Source Embeddings John fährt ein Auto -

John is driving a car - John fährt ein Auto


Sequence to Sequence Models
Anatomy of a seq2seq model

Output Layer John fährt ein Auto -

John is driving a car - John fährt ein Auto


Output Layer
Recap score for “katze”
Total classes is equal to score for “hund”
vocabulary size of the target score for “John”
John
language
score for “haus”
score for “tür”
score for “schule”

score for “laptop”


John is driving a car -
score for “phone”
score for “zebra”
Sequence to Sequence Models
Anatomy of a seq2seq model
More layers usually lead to stronger models
John fährt ein Auto -

John is driving a car - John fährt ein Auto


Sequence to Sequence Models
• End-to-end training: one loss to optimize all
parameters
• Better use of context: use source sentences
and previously predicted target words
• Distributed word representations: semantic
similarities
Problems with Sequence to
Sequence Models
Problems with Seq2Seq Models
• Sequence to sequence models perform poorly
when translating long sentences
Ich bin

• Issue: a sentence is
represented as a fixed vector
• Relationship between source
and target words is very
abstract
• All words are not equally
important to predict a target
I am fine - Ich
word
Problems with Seq2Seq Models
• Sequence to sequence models perform poorly
when translating long sentences

https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
Attention Mechanism
One vector represents all source words

Anna watches the Talkshow

Anna sieht sich die Talkshow an


Attention Mechanism
Source and target words inherently have relationships

Anna watches the Talkshow

Anna sieht sich die Talkshow an


Attention Mechanism
Anna,Anna are more relevant to each other than
Anna,Talkshow

Anna watches the Talkshow

Anna sieht sich die Talkshow an


Attention Mechanism
Solution: For each target word, concentrate only on
specific source words
Anna watches the Talkshow

Anna sieht sich die Talkshow an


Attention Mechanism
Solution: For each target word, concentrate only on
specific source words
Anna watches the Talkshow

Anna sieht sich die Talkshow an


Attention Mechanism
Solution: For each target word, concentrate only on
specific source words
Anna watches the Talkshow

Anna sieht sich die Talkshow an


Attention Mechanism
Solution: For each target word, concentrate only on
specific source words
Anna watches the Talkshow

Anna sieht sich die Talkshow an


Attention Mechanism
Solution: For each target word, concentrate only on
specific source words
Anna watches the Talkshow

Anna sieht sich die Talkshow an


Attention Mechanism
Formally, given the previous
target word, the attention Attention
mechanism “scores” each mechanism
John
source word

John is driving a car - John

Neural Machine Translation by Luong, Cho and Manning


Attention Mechanism
Attention
mechanism
John

John is driving a car - John

Neural Machine Translation by Luong, Cho and Manning


Attention Mechanism
Attention
mechanism
John

John is driving a car - John

Neural Machine Translation by Luong, Cho and Manning


Attention Mechanism
Attention
mechanism
John

John is driving a car - John

Neural Machine Translation by Luong, Cho and Manning


Attention Mechanism
Attention
mechanism
John

John is driving a car - John

Neural Machine Translation by Luong, Cho and Manning


Attention Mechanism
Attention
mechanism
John

John is driving a car - John

Neural Machine Translation by Luong, Cho and Manning


Attention Mechanism
Attention
mechanism
John

2 1

John is driving a car - John

Neural Machine Translation by Luong, Cho and Manning


Attention Mechanism
Attention
mechanism
John

2 1

John is driving a car - John

Neural Machine Translation by Luong, Cho and Manning


Attention Mechanism
Attention
mechanism
John

2 1 5

John is driving a car - John

Neural Machine Translation by Luong, Cho and Manning


Attention Mechanism
Attention
mechanism
John

2 1 5

John is driving a car - John

Neural Machine Translation by Luong, Cho and Manning


Attention Mechanism
Attention
mechanism
John

2 1 5 1

John is driving a car - John

Neural Machine Translation by Luong, Cho and Manning


Attention Mechanism
Attention
mechanism
John

2 1 5 1

John is driving a car - John

Neural Machine Translation by Luong, Cho and Manning


Attention Mechanism
Attention
mechanism
John

2 1 5 1 1

John is driving a car - John

Neural Machine Translation by Luong, Cho and Manning


Attention Mechanism
Attention
mechanism
John

0.2 0.1 0.5 0.1 0.1

John is driving a car - John


Use softmax to convert scores
into probabilities
Neural Machine Translation by Luong, Cho and Manning
Attention Mechanism
Attention
mechanism
John

0.2 0.1 0.5 0.1 0.1

John is driving a car - John


Build context vector by taking a
weighted sum over the source
Neural Machine Translation by Luong, Cho and Manning
Attention Mechanism
Attention
mechanism
John

0.2 0.1 0.5 0.1 0.1

John is driving a car - John


Use context vector to predict
next word
Neural Machine Translation by Luong, Cho and Manning
Attention Mechanism
Attention
mechanism
John fährt

John is driving a car - John fährt


Similarly, get source scores for
the next prediction
Neural Machine Translation by Luong, Cho and Manning
Attention Mechanism
Attention
mechanism
John fährt

0.1 0.0 0.1 0.6 0.2

John is driving a car - John fährt


Use softmax to convert scores
into probabilities
Neural Machine Translation by Luong, Cho and Manning
Attention Mechanism
Attention
mechanism
John fährt

0.1 0.0 0.1 0.6 0.2

John is driving a car - John fährt


Use context vector to predict
next word
Neural Machine Translation by Luong, Cho and Manning
Attention Mechanism
Attention
mechanism
John fährt ein

0.1 0.0 0.1 0.6 0.2

John is driving a car - John fährt


Use context vector to predict
next word
Neural Machine Translation by Luong, Cho and Manning
Empirical Evaluation of Attention

w/ Attention

w/ no Attention

Neural Machine Translation by Luong, Cho and Manning


Attention Mechanism - Visual
Aligning words
If we plot the source
scores for each target
word, we can see what
each target word is
aligned to.

Dzmitry Bahdanau, KyungHuyn Cho, and


Yoshua Bengio. Neural Machine
Translation by Jointly Learning to Translate
and Align. ICLR’15
Attention Mechanism - Visual
Aligning words
If we plot the source
scores for each target
word, we can see what
each target word is
aligned to.

Note the reordering in this


particular example

Dzmitry Bahdanau, KyungHuyn Cho, and


Yoshua Bengio. Neural Machine
Translation by Jointly Learning to Translate
and Align. ICLR’15
Applications of Sequence to
Sequence Model
• Summarization
• Dialog-based systems (chatbots)
• Speech recognition
• Image captioning
• Machine translation
• ...
Image Captioning

https://towardsdatascience.com/image-captioning-in-deep-learning-9cd23fb4d8d2
Summary
• Bilingual LSTM translates from one language to
another language
• Encoder-Decoder model helps us encode a sequence
into a summary vector and use that to decode
another sequence
• Attention mechanism learns soft alignment
between source and target words
• OpenNMT toolkit http://opennmt.net/

You might also like