Professional Documents
Culture Documents
Lecture 10 NLP For Finance
Lecture 10 NLP For Finance
Lecture 10 NLP For Finance
Lecture 10:
Christopher Manning
Machine and Richard Socher
Translation,
Sequence-to-sequence
Lecture 2: Wordand Attention
Vectors
Abigail See
Announcements
• Honor code issues: Assignment 2
• Assignment 3 released
Source: http://aiweirdness.com/post/170820844947/more-candy-hearts-by-neural-network
3 2/15/18
Welcome to the second half of the course!
• Remaining lectures are mostly geared towards projects
• However: today’s lecture will cover two core NLP Deep Learning
techniques
4 2/15/18
Overview
Today we will:
is improved by
6 2/15/18
1950s: Early Machine Translation
Machine Translation research
began in the early 1950s.
Source: https://youtu.be/K-HfpsHPmvw
7 2/15/18
1990s-2010s: Statistical Machine Translation
• Core idea: Learn a probabilistic model from data
• Suppose we’re translating French → English.
• We want to find best English sentence y, given French sentence x
Translation Model
Language Model
Models how words and phrases
should be translated.
Models how to write good English.
Learnt from parallel data.
Learnt from monolingual data.
8 2/15/18
1990s-2010s: Statistical Machine Translation
• Question: How to learn translation model ?
• First, need large amount of parallel data
(e.g. pairs of human-translated French/English sentences)
Demotic
Ancient Greek
9 2/15/18
1990s-2010s: Statistical Machine Translation
• Question: How to learn translation model ?
• First, need large amount of parallel data
(e.g. pairs of human-translated French/English sentences)
10 2/15/18
What is alignment?
Alignments
Alignment is the correspondence between particular words in the
translated sentence pair.
We can factor the translation model P(f | e )
• by identifying
Note: alignments
Some words have no (correspondences)
counterpart
between words in f and words in e
nouveaux
séismes
spurious
secoué
Japon
word
deux
par
Le
Le
Japan Japon Japan
shaken secoué shaken
by par
two deux by
new nouveaux two
quakes séismes new
quakes
11 2/15/18
Alignment is complex
Alignments: harder
Alignment can be one-to-many (these are “fertile” words)
zero fertility word
programme
not translated
application
mis
été
Le
en
And Le
a
the programme And
program a the
has été
program
been mis
implemented en has
application been
implemented
one-to-many
alignment
12 2/15/18
Alignment is complex
Alignments: harder
Alignment can be many-to-one
autochtones
appartenait
The Le
reste
balance reste
aux
Le
was
the appartenait The
territory balance
of aux was
the
the
aboriginal autochtones
people territory
of
the
many-to-one
alignments aboriginal
people
13 2/15/18
Alignment is complex
Alignments: hardest
Alignment can be many-to-many (phrase-level)
démunis
pauvres
sont
Les
The Les The
poor pauvres
poor
don’t sont
have démunis don t
any have
money any
money
many-to-many
alignment phrase
alignment
14 2/15/18
1990s-2010s: Statistical Machine Translation
• Question: How to learn translation model ?
• First, need large amount of parallel data
(e.g. pairs of human-translated French/English sentences)
Language Model
Question:
How to compute Translation Model
this argmax?
16 2/15/18
Searching for the best translation
Translation Process
Chapter 6: Decoding 6
17 2/15/18
Searching for the Translation
best translation
Options
er geht ja nicht nach hause
he is yes not after house
it are is do not to home
, it goes , of course does not according to chamber
, he go , is not in at home
Decoding: Find Best Path
it is
he will be
not
is not
home
under house
it goes does not return home
he goes do not do not
er geht is ja nicht to nach hause
are following
is after all not after
does not to
not
is not
are not
is not a
Chapter 6: Decoding
are 8
does not go home
it
to
18 2/15/18
backtrack from highest scoring complete hypothesis
1990s-2010s: Statistical Machine Translation
• SMT is a huge research field
• The best systems are extremely complex
• Hundreds of important details we haven’t mentioned here
• Systems have many separately-designed subcomponents
• Lots of feature engineering
• Need to design features to capture particular language phenomena
• Require compiling and maintaining extra resources
• Like tables of equivalent phrases
• Lots of human effort to maintain
• Repeated effort for each language pair!
19 2/15/18
2014
(dramatic reenactment)
2014 Ne
u
Ma ral
Tra chin
nsl e
atio
n
MT
res
ear
c h
21
(dramatic reenactment)
2/15/18
What is Neural Machine Translation?
• Neural Machine Translation (NMT) is a way to do Machine
Translation with a single neural network
22 2/15/18
Neural Machine Translation (NMT)
The sequence-to-sequence model
Target sentence (output)
Encoding of the source sentence.
Provides initial hidden state
the poor don’t have any money <END>
for Decoder RNN.
argmax
argmax
argmax
argmax
argmax
argmax
argmax
Encoder RNN
Decoder RNN
les pauvres sont démunis <START> the poor don’t have any money
23 2/15/18
Neural Machine Translation (NMT)
• The sequence-to-sequence model is an example of a
Conditional Language Model.
• Language Model because the decoder is predicting the
next word of the target sentence y
• Conditional because its predictions are also conditioned on the source
sentence x
Decoder RNN
les pauvres sont démunis <START> the poor don’t have any money
argmax
argmax
argmax
argmax
argmax
argmax
argmax
26 2/15/18
Better-than-greedy decoding?
• Greedy decoding has no way to undo decisions!
• les pauvres sont démunis (the poor don’t have any money)
• → the ____
• → the poor ____
• → the poor are ____
27 2/15/18
Beam search decoding
• Ideally we want to find y that maximizes
28 2/15/18
Beam search decoding: example
Beam size = 2
<START>
29 2/15/18
Beam search decoding: example
Beam size = 2
the
<START>
30 2/15/18
Beam search decoding: example
Beam size = 2
poor
the
people
<START>
poor
a
person
31 2/15/18
Beam search decoding: example
Beam size = 2
are
poor
the don’t
people
<START>
person
poor
a but
person
32 2/15/18
Beam search decoding: example
Beam size = 2
always
not
are
poor
the don’t have
people
take
<START>
person
poor
a but
person
33 2/15/18
Beam search decoding: example
Beam size = 2
always
in
not
are with
poor
the don’t have
people
any
take
<START> enough
person
poor
a but
person
34 2/15/18
Beam search decoding: example
Beam size = 2
always
in
not
are with
poor money
the don’t have
people funds
any
take
<START> enough
person money
poor
a but funds
person
35 2/15/18
Beam search decoding: example
Beam size = 2
always
in
not
are with
poor money
the don’t have
people funds
any
take
<START> enough
person money
poor
a but funds
person
36 2/15/18
Advantages of NMT
Compared to SMT, NMT has many advantages:
• Better performance
• More fluent
• Better use of context
• Better use of phrase similarities
38 2/15/18
How do we evaluate Machine Translation?
BLEU (Bilingual Evaluation Understudy)
20
15
10
0
2013 2014 2015 2016
Source: http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf
40 2/15/18
NMT: the biggest success story of NLP Deep Learning
• This is amazing!
• SMT systems, built by hundreds of engineers over many
years, outperformed by NMT systems trained by a handful of
engineers in a few months
41 2/15/18
So is Machine Translation solved?
• Nope!
• Many difficulties remain:
• Out-of-vocabulary words
• Domain mismatch between train and test data
• Maintaining context over longer text
• Low-resource language pairs
42 2/15/18
So is Machine Translation solved?
• Nope!
• Using common sense is still hard
?
43 2/15/18
So is Machine Translation solved?
• Nope!
• NMT picks up biases in training data
Source: https://hackernoon.com/bias-sexist-or-this-is-the-way-it-should-be-ce1f7c8c683c
44 2/15/18
So is Machine Translation solved?
• Nope!
• Uninterpretable systems do strange things
Source: http://languagelog.ldc.upenn.edu/nll/?p=35120#more-35120
45 2/15/18
NMT research continues
NMT is the flagship task for NLP Deep Learning
ATTENTION
46 2/15/18
Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
Target sentence (output)
Decoder RNN
les pauvres sont démunis <START> the poor don’t have any money
47 2/15/18
Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
This needs to capture all Target sentence (output)
information about the
source sentence.
the poor don’t have any money <END>
Information bottleneck!
Encoder RNN
Decoder RNN
les pauvres sont démunis <START> the poor don’t have any money
48 2/15/18
Attention
• Attention provides a solution to the bottleneck problem.
• First we will show via diagram (no equations), then we will show
with equations
49 2/15/18
Sequence-to-sequence with attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN
dot product
Attention
scores
Decoder RNN
Encoder
RNN
dot product
Attention
scores
Decoder RNN
Encoder
RNN
dot product
Attention
scores
Decoder RNN
Encoder
RNN
Decoder RNN
Encoder
RNN
Decoder RNN
Encoder
RNN
Decoder RNN
Encoder
RNN
"!#
Attention Attention
Decoder RNN
Encoder
RNN
"!#
Attention Attention
Decoder RNN
Encoder
RNN
"!#
Attention Attention
Decoder RNN
Encoder
RNN
"!#
Attention Attention
Decoder RNN
Encoder
RNN
"!#
Attention Attention
Decoder RNN
Encoder
RNN
les pauvres sont démunis <START> the poor don’t have any
• We take softmax to get the attention distribution for this step (this is a
probability distribution and sums to 1)
• We use to take a weighted sum of the encoder hidden states to get the
attention output
62 2/15/18
Attention is great
• Attention significantly improves NMT performance
• It’s very useful to allow decoder to focus on certain parts of the source
• Attention solves the bottleneck problem
Alignments: hardest
• Attention allows decoder to look directly at source; bypass bottleneck
• Attention helps with vanishing gradient problem
• Provides shortcut to faraway states
démunis
pauvres
• Attention provides some interpretability
sont
Les
• By inspecting attention distribution, we can see
what the decoder was focusing
The on Les The
poor pauvres
• We get alignment for free!
don’t sont
poor
many-to-many
63 2/15/18
alignment phrase
Recap
• We learned the history of Machine Translation (MT)
• Sequence-to-sequence is the
architecture for NMT (uses 2 RNNs)
65 2/15/18
Next time
• More types of attention
66 2/15/18