Beeker: RNN: Sequence To Sequence

BEEKER
RNN: Sequence to Sequence
Pawan Goyal
CSE, IIT Kharagpur
Feb 15th, 2022
Pawan Goyal (IIT Kharagpur) RNN: Sequence to Sequence Feb 15th, 2022 1 / 43
What we have seen?
Input sequence to a fixed-sized vector

Input sequence to an output sequence of the same length
What we have seen?
Input sequence to a fixed-sized vector

Input sequence to an output sequence of the same length
Any other constraint?

Mapping input sequence to an output sequence, not necessarily of the same
length
machine translation, question answering, chatbots, summarization, ...
What is machine translation?
Machine Translation (MT) is the task of translating a sentence x from one

language (the source language) to a sentence y in another language (the
target language)
Sequence-to-sequence architecture
Also known as encoder-decoder architecture

-
Input sequence X = (x(1) , . . . , x(nx ) )

Output sequence Y = (y(1) , . . . , y(ny ) )

Encoder (reader/input) RNN: Emits the context C, a vector summarizing
the input sequence, usually as a simple function of its final hidden state

Decoder (writer/output) RNN: Is conditioned on the context C to generate
the output sequence.
ai÷:m}•¥¥¥÷µ÷⇐s
Target :
Yu -
.
Ym
-
•
French roca
{ de•d← a
-491*7--4 sees
9
adf.wi.ir
pry
lying
-
he
-
-
7-
→
ho→n
-
-
iii. rid
.
tu"
Mud
Teacher
m an <
|YI→
Encoder Foray
Decodes
*
) I'm not teeth
find -
G- - End Tsahy
/ 7 of
asgmax step
poor
.
the
freely actual olp from
tnepser .
step ,

Decoder (writer/output) RNN: Is conditioned on the context C to generate
the output sequence.
What is the innovation?

The lengths nx and ny can vary from each other
Sequence to Sequence Models for Machine Translation
time
test -
Never thing
TIherFoz1 testing
Sequence to sequence is versatile
Many NLP tasks can be phrased as sequence-to-sequence

Summarization (long text ! short text)
- -
Dialogue (previous utterances ! next utterance)

-
Neural Machine Translation (NMT)
plyn.be?---.. . -.-.--
An example of a Conditional Language Model
Language Model because the decoder is predicting the next word of the
target sentence y
Conditional because the predictions are also conditioned on the
sentence x
NMT directly calculates P(y|x) :

.
How to train an NMT system?

Get a big parallel corpus ...
Training an NMT system
a
÷÷
-
Greedy decoding
One possibility is to generate (decode) the target sentence by taking argmax

on each step of the decoder
argmax
prob __t
(
. greedy )
dist us.
aogmax → Sample uses

⑤ p prob dist .X
.
This is greedy decoding

Problems with this method? :
Problems with greedy decoding
Greedy decoding has no way to undo decisions!
How to fix this?
Exhaustive search decoding
ao →
-
!!
-53 optimal
☒
nose
Ideally we want to find a (length T ) translation y that maximizes

[P(y|x)
-
= P(y1 |x)P(y2 |y1 , x)P(y3 |y2 , y1 , x) . . . P(yT |y1 , . . . , yT 1 , x)

= ’Tt=1 P(yt |y1 , . . . , yt 1 , x) ] -
NMT formulation
-
-
We could try to compute all possible sequences y exhaustively

This means that on each step t of the decoder, we are tracking V t
possible partial translations, where V is vocab size
This O(V T ) complexity is far too expensive! N
}
-
year
Olr 'T →Lavoe
* Keep track of k_ most probable seeaences

till this point
.
Beam search decoding
Core Idea
On each step of the decoder, keep track of the k most probable partial
translations (hypotheses)
k is the beam size (in practice around 5 to 10)
Beam search decoding
Core Idea
On each step of the decoder, keep track of the k most probable partial
translations (hypotheses)
k is the beam size (in practice around 5 to 10)
A hypothesis y1 , . . . , yt has a score which is its log probability:

score(y1 , . . . , yt ) = logPLM (y1 , . . . , yt |x) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)
Scores are all negative, higher is better
We search for high scoring hypotheses, tracking top k on each step
Beam search is not guaranteed to find optimal solution, but much more
efficient than exhaustive search!
Beam search decoding: example
Beam size = k =2. Blue numbers = d
-
score(y1 , . . . , yt ) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

-
Beam size = k =2. Blue numbers =
Greedy)
¥75k
✓
9
Es
-
✓
É¥-
< START
?
IIe
É
score(y1 , . . . , yt ) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x) T
a-
Beam size = k =2. Blue numbers = greedy decodes
NS .
beam -
Search
deady
forty
drums
tests
Beam Search Decoding: Stopping Criterion
In greedy decoding,
In greedy decoding, usually we decode until the model produces an

hENDi token
I Example: hSTARTi he hit me with a pie hENDi
-
-
- - -

hENDi token
I Example: hSTARTi he hit me with a pie hENDi I laytn atone
times hold
-
In beam search decoding, a

hENDi token
In beam search decoding, different hypotheses may produce hENDi

tokens on different timesteps

hENDi token

I When a hypothesis produces hENDi, that hypothesis is complete
I Place it aside and continue exploring other hypotheses via beam search
Usually we continue beam search until:

hENDi token

I We reach timestep T (some pre-defined cutoff), or

hENDi token

I We reach timestep T (some pre-defined cutoff), or
{wytose
I We have at least n completed hypotheses (some pre-defined cutoff)
be #
Beam search decoding: finishing up
We have our list of completed hypotheses

How to select the top one with the highest score?
z
{÷ ✗
w
' we < END ]
Wg Wu SEND>
length Ñ" .
"
normal rzlgpc )
6
would not be preferred
longer sentences

Each hypothesis y1 , . . . , yt has a score



Problem with this


Problem with this

Longer hypotheses have lower scores.


Problem with this

Longer hypotheses have lower scores.
Fix: Normalize the score by length, and use the normalized score to select the
top one instead.
1
t Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)
Alternative for solving NLP problems .
Neural
%↳ representation greedy
vs
÷' beans-
Deco@fseoscnprr.R
"
,
-
'
- deady
Rtr "
Laban generate
t
labels
{ pnetrain.by
-
see .
LS -1ms )
*return } ' ] RAN ( GRUS ,
2? Fasteners
CT -1 4:30 -
5:30 PI ( 22nd Feb)
Tne_
[Muodle ,
sigmrtollwze -1J
has been covered
Syllabus :
Whatever
Austins
AeY÷
:-.
sample
"" "
T÷÷→⇐÷ /
"
9-
RNNk⇐%
• * Hh →
app texts #
BERT} manhtst-o-rusfn-s.es

Beeker: RNN: Sequence To Sequence

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Beeker: RNN: Sequence To Sequence

Uploaded by

Copyright:

Available Formats

BEEKER

RNN: Sequence to Sequence

CSE, IIT Kharagpur

Feb 15th, 2022

Input sequence to a fixed-sized vector

Input sequence to a fixed-sized vector

Any other constraint?

Machine Translation (MT) is the task of translating a sentence x from one

Also known as encoder-decoder architecture

Input sequence X = (x(1) , . . . , x(nx ) )

Also known as encoder-decoder architecture

Also known as encoder-decoder architecture

Also known as encoder-decoder architecture

What is the innovation?

Many NLP tasks can be phrased as sequence-to-sequence

Dialogue (previous utterances ! next utterance)

NMT directly calculates P(y|x) :

How to train an NMT system?

One possibility is to generate (decode) the target sentence by taking argmax

aogmax → Sample uses

This is greedy decoding

Greedy decoding has no way to undo decisions!

How to fix this?

Ideally we want to find a (length T ) translation y that maximizes

= P(y1 |x)P(y2 |y1 , x)P(y3 |y2 , y1 , x) . . . P(yT |y1 , . . . , yT 1 , x)

We could try to compute all possible sequences y exhaustively

* Keep track of k_ most probable seeaences

A hypothesis y1 , . . . , yt has a score which is its log probability:

score(y1 , . . . , yt ) = Âti=1 logPLM (yi |y1 , . . . , yi 1 , x)

In greedy decoding, usually we decode until the model produces an

In greedy decoding, usually we decode until the model produces an

In greedy decoding, usually we decode until the model produces an

In beam search decoding, different hypotheses may produce hENDi

In greedy decoding, usually we decode until the model produces an

In beam search decoding, different hypotheses may produce hENDi

In greedy decoding, usually we decode until the model produces an

In beam search decoding, different hypotheses may produce hENDi

In greedy decoding, usually we decode until the model produces an

In beam search decoding, different hypotheses may produce hENDi

We have our list of completed hypotheses

We have our list of completed hypotheses

Each hypothesis y1 , . . . , yt has a score

We have our list of completed hypotheses

Each hypothesis y1 , . . . , yt has a score

Problem with this

We have our list of completed hypotheses

Each hypothesis y1 , . . . , yt has a score

Problem with this

We have our list of completed hypotheses

Each hypothesis y1 , . . . , yt has a score

Problem with this

You might also like