Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

BEEKER

RNN: Sequence to Sequence

Pawan Goyal

CSE, IIT Kharagpur

Feb 15th, 2022

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 1 / 43
What we have seen?

Input sequence to a fixed-sized vector


Input sequence to an output sequence of the same length

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 2 / 43
What we have seen?

Input sequence to a fixed-sized vector


Input sequence to an output sequence of the same length

Any other constraint?


Mapping input sequence to an output sequence, not necessarily of the same
length
machine translation, question answering, chatbots, summarization, ...

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 2 / 43
What is machine translation?

Machine Translation (MT) is the task of translating a sentence x from one


language (the source language) to a sentence y in another language (the
target language)

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 3 / 43
Sequence-to-sequence architecture

Also known as encoder-decoder architecture


-

Input sequence X = (x(1) , . . . , x(nx ) )


Output sequence Y = (y(1) , . . . , y(ny ) )

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 4 / 43
Sequence-to-sequence architecture

Also known as encoder-decoder architecture


Input sequence X = (x(1) , . . . , x(nx ) )
Output sequence Y = (y(1) , . . . , y(ny ) )
Encoder (reader/input) RNN: Emits the context C, a vector summarizing
the input sequence, usually as a simple function of its final hidden state

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 4 / 43
Sequence-to-sequence architecture

Also known as encoder-decoder architecture


Input sequence X = (x(1) , . . . , x(nx ) )
Output sequence Y = (y(1) , . . . , y(ny ) )
Encoder (reader/input) RNN: Emits the context C, a vector summarizing
the input sequence, usually as a simple function of its final hidden state
Decoder (writer/output) RNN: Is conditioned on the context C to generate
the output sequence.

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 4 / 43
ai÷:m}•¥¥¥÷µ÷⇐s
Target :
Yu -
.

Ym
-


French roca

{ de•d← a

-491*7--4 sees
9
adf.wi.ir
pry

lying
-

he
-
-

7-

ho→n
-
-

iii. rid
.

tu"
Mud
Teacher
m an <
|YI→
Encoder Foray
Decodes

*
) I'm not teeth
find -
G- - End Tsahy
/ 7 of
asgmax step
poor
.

the
freely actual olp from

tnepser .
step ,
Sequence-to-sequence architecture

Also known as encoder-decoder architecture


Input sequence X = (x(1) , . . . , x(nx ) )
Output sequence Y = (y(1) , . . . , y(ny ) )
Encoder (reader/input) RNN: Emits the context C, a vector summarizing
the input sequence, usually as a simple function of its final hidden state
Decoder (writer/output) RNN: Is conditioned on the context C to generate
the output sequence.

What is the innovation?


The lengths nx and ny can vary from each other

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 4 / 43
Sequence to Sequence Models for Machine Translation

time
test -

Never thing
TIherFoz1 testing
Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 5 / 43
Sequence to sequence is versatile

Many NLP tasks can be phrased as sequence-to-sequence


Summarization (long text ! short text)
- -

Dialogue (previous utterances ! next utterance)


-

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 6 / 43
Neural Machine Translation (NMT)

plyn.be?---.. . -.-.--
An example of a Conditional Language Model
Language Model because the decoder is predicting the next word of the
target sentence y
Conditional because the predictions are also conditioned on the
sentence x

NMT directly calculates P(y|x) :


.

How to train an NMT system?


Get a big parallel corpus ...

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 7 / 43
Training an NMT system
a

÷÷
-

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 8 / 43
Greedy decoding

One possibility is to generate (decode) the target sentence by taking argmax


on each step of the decoder
argmax
prob __t
(
. greedy )
dist us.

aogmax → Sample uses


⑤ p prob dist .X
.

This is greedy decoding


Problems with this method? :
Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 9 / 43
Problems with greedy decoding

Greedy decoding has no way to undo decisions!

How to fix this?

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 10 / 43
Exhaustive search decoding

ao →
-
!!
-53 optimal

nose

Ideally we want to find a (length T ) translation y that maximizes


[P(y|x)
-

= P(y1 |x)P(y2 |y1 , x)P(y3 |y2 , y1 , x) . . . P(yT |y1 , . . . , yT 1 , x)


= ’Tt=1 P(yt |y1 , . . . , yt 1 , x) ] -
NMT formulation
-
-

We could try to compute all possible sequences y exhaustively


This means that on each step t of the decoder, we are tracking V t
possible partial translations, where V is vocab size
This O(V T ) complexity is far too expensive! N
}
-

year
Olr 'T →Lavoe

* Keep track of k_ most probable seeaences


till this point
.

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 11 / 43
Beam search decoding

Core Idea
On each step of the decoder, keep track of the k most probable partial
translations (hypotheses)
k is the beam size (in practice around 5 to 10)

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 12 / 43
Beam search decoding

Core Idea
On each step of the decoder, keep track of the k most probable partial
translations (hypotheses)
k is the beam size (in practice around 5 to 10)

A hypothesis y1 , . . . , yt has a score which is its log probability:


score(y1 , . . . , yt ) = logPLM (y1 , . . . , yt |x) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)
Scores are all negative, higher is better
We search for high scoring hypotheses, tracking top k on each step
Beam search is not guaranteed to find optimal solution, but much more
efficient than exhaustive search!

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 12 / 43
Beam search decoding: example
Beam size = k =2. Blue numbers = d
-

score(y1 , . . . , yt ) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)


-

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 13 / 43
Beam search decoding: example
Beam size = k =2. Blue numbers =
score(y1 , . . . , yt ) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

Greedy)
¥75k

9
Es
-


É¥-
< START
?
IIe
Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 14 / 43
Beam search decoding: example
Beam size = k =2. Blue numbers =
score(y1 , . . . , yt ) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

É
Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 15 / 43
Beam search decoding: example
Beam size = k =2. Blue numbers =
score(y1 , . . . , yt ) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 16 / 43
Beam search decoding: example
Beam size = k =2. Blue numbers =
score(y1 , . . . , yt ) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 17 / 43
Beam search decoding: example
Beam size = k =2. Blue numbers =
score(y1 , . . . , yt ) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 18 / 43
Beam search decoding: example
Beam size = k =2. Blue numbers =
score(y1 , . . . , yt ) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 19 / 43
Beam search decoding: example
Beam size = k =2. Blue numbers =
score(y1 , . . . , yt ) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 20 / 43
Beam search decoding: example
Beam size = k =2. Blue numbers =
score(y1 , . . . , yt ) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 21 / 43
Beam search decoding: example
Beam size = k =2. Blue numbers =
score(y1 , . . . , yt ) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 22 / 43
Beam search decoding: example
Beam size = k =2. Blue numbers =
score(y1 , . . . , yt ) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x) T

a-

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 23 / 43
Beam search decoding: example
Beam size = k =2. Blue numbers =
score(y1 , . . . , yt ) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 24 / 43
Beam search decoding: example
Beam size = k =2. Blue numbers = greedy decodes
score(y1 , . . . , yt ) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)
NS .

beam -

Search

deady

forty
drums
tests

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 25 / 43
Beam Search Decoding: Stopping Criterion

In greedy decoding,

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 26 / 43
Beam Search Decoding: Stopping Criterion

In greedy decoding, usually we decode until the model produces an


hENDi token
I Example: hSTARTi he hit me with a pie hENDi
-

-
- - -

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 26 / 43
Beam Search Decoding: Stopping Criterion

In greedy decoding, usually we decode until the model produces an


hENDi token
I Example: hSTARTi he hit me with a pie hENDi I laytn atone

times hold
-
In beam search decoding, a

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 26 / 43
Beam Search Decoding: Stopping Criterion

In greedy decoding, usually we decode until the model produces an


hENDi token
I Example: hSTARTi he hit me with a pie hENDi

In beam search decoding, different hypotheses may produce hENDi


tokens on different timesteps

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 26 / 43
Beam Search Decoding: Stopping Criterion

In greedy decoding, usually we decode until the model produces an


hENDi token
I Example: hSTARTi he hit me with a pie hENDi

In beam search decoding, different hypotheses may produce hENDi


tokens on different timesteps
I When a hypothesis produces hENDi, that hypothesis is complete
I Place it aside and continue exploring other hypotheses via beam search
Usually we continue beam search until:

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 26 / 43
Beam Search Decoding: Stopping Criterion

In greedy decoding, usually we decode until the model produces an


hENDi token
I Example: hSTARTi he hit me with a pie hENDi

In beam search decoding, different hypotheses may produce hENDi


tokens on different timesteps
I When a hypothesis produces hENDi, that hypothesis is complete
I Place it aside and continue exploring other hypotheses via beam search
Usually we continue beam search until:
I We reach timestep T (some pre-defined cutoff), or

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 26 / 43
Beam Search Decoding: Stopping Criterion

In greedy decoding, usually we decode until the model produces an


hENDi token
I Example: hSTARTi he hit me with a pie hENDi

In beam search decoding, different hypotheses may produce hENDi


tokens on different timesteps
I When a hypothesis produces hENDi, that hypothesis is complete
I Place it aside and continue exploring other hypotheses via beam search
Usually we continue beam search until:
I We reach timestep T (some pre-defined cutoff), or

{wytose
I We have at least n completed hypotheses (some pre-defined cutoff)

be #

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 26 / 43
Beam search decoding: finishing up

We have our list of completed hypotheses


How to select the top one with the highest score?

z
{÷ ✗
w
' we < END ]

Wg Wu SEND>

length Ñ" .
"

normal rzlgpc )

6
would not be preferred
longer sentences

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 27 / 43
Beam search decoding: finishing up

We have our list of completed hypotheses


How to select the top one with the highest score?

Each hypothesis y1 , . . . , yt has a score


score(y1 , . . . , yt ) = logPLM (y1 , . . . , yt |x) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 27 / 43
Beam search decoding: finishing up

We have our list of completed hypotheses


How to select the top one with the highest score?

Each hypothesis y1 , . . . , yt has a score


score(y1 , . . . , yt ) = logPLM (y1 , . . . , yt |x) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

Problem with this

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 27 / 43
Beam search decoding: finishing up

We have our list of completed hypotheses


How to select the top one with the highest score?

Each hypothesis y1 , . . . , yt has a score


score(y1 , . . . , yt ) = logPLM (y1 , . . . , yt |x) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

Problem with this


Longer hypotheses have lower scores.

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 27 / 43
Beam search decoding: finishing up

We have our list of completed hypotheses


How to select the top one with the highest score?

Each hypothesis y1 , . . . , yt has a score


score(y1 , . . . , yt ) = logPLM (y1 , . . . , yt |x) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

Problem with this


Longer hypotheses have lower scores.
Fix: Normalize the score by length, and use the normalized score to select the
top one instead.
1
t Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 27 / 43
Alternative for solving NLP problems .

Neural

%↳ representation greedy
vs

÷' beans-

Deco@fseoscnprr.R
"
,
-
'

- deady
Rtr "
Laban generate
t
labels
{ pnetrain.by
-
see .

LS -1ms )
*return } ' ] RAN ( GRUS ,

2? Fasteners
CT -1 4:30 -
5:30 PI ( 22nd Feb)

Tne_
[Muodle ,
sigmrtollwze -1J
has been covered
Syllabus :
Whatever

Austins
AeY÷
:-.
sample
"" "

T÷÷→⇐÷ /
"

9-
RNNk⇐%
• * Hh →
app texts #

BERT} manhtst-o-rusfn-s.es

You might also like