RNN Part2

trearalhms
-
word Embedding -
→
→ Rmn ( Recurrent NN )
I
are
constant
# Parameters of
of the size
irrespective
the context window
the
how
Symmetry
in
#
weights are
processed
Forward Propagation
:-#
1)
a(t) = b + Wh(t + Ux(t)
h(t) = tanh(a(t) )
o(t) = c + Vh(t)
Ei ŷ(t) = softmax(o(t) )
¥9m entropy
-
will be updated?
HI now parameters
Pawan Goyal (IIT Kharagpur) Neural Language Model, RNNs February 8th, 2022 9 / 38
Forward Propagation
1)
a(t) = b + Wh(t + Ux(t)
h(t) = tanh(a(t) )
o(t) = c + Vh(t)
ŷ(t) = softmax(o(t) )
This maps an input sequence to an output sequence of the same length.
Loss Function
Total loss is sum of the losses over all the time steps.
So, if L(t) is the negative log likelihood of y(t) given x(1) , . . . , x(t) , then
Loss Function
L({x(1) , . . . , x(t) }, {y(1) , . . . , y(t) })

= Â L(t)
t
= Â log pmodel (y(t) |{x(1) , . . . , x(t) })
t
Loss Function
L({x(1) , . . . , x(t) }, {y(1) , . . . , y(t) })

= Â L(t)
t
t
where pmodel (y(t) |{x(1) , . . . , x(t) }) is given by reading the entry for y(t) from the
model’s output vector ŷ(t)
Loss Function
L({x(1) , . . . , x(t) }, {y(1) , . . . , y(t) })

= Â L(t)
t
t
where pmodel (y(t) |{x(1) , . . . , x(t) }) is given by reading the entry for y(t) from the
model’s output vector ŷ(t)
Back propagation - right to left - back propagation through time (BPTT)
Recurrent Neural Networks
.
-
pred
I
- ii
→
1.
Ears
Example RNN
debt .
prob
-
1-74
-
-3¥ 33_pm
Iv
-
P
predict hidden
nextdoor
✗ -
ñ;' VI.a
top
-
- - -
Training an RNN language model
Et
Trigons RMI
-
:÷=
BP
bf- R LI u,yw_T → parameters

Generating text with an RNN Language Model
Eos >
_→
<
08
f by meth
×
⇐
LI
" '÷.
÷÷÷÷÷
↳ → hi - .
ÑItd
Rn±
← renren.at#RePmsentat
.
I Learning
-
classification ?
See .
Labeling ? ✓
of-J.y p-mettw.rs
-
① knew → Ftl
-
an →
Variants of RNA
PIÉ¥f② _nns#oo(t SOTA )

Vanishing Gradient Problem with RNNs
a ↳
singular titties
13--1=1%1-+1:-)
Effect of vanishing gradient on RNN LM
2-
- - -
=
-
Effect of vanishing gradient on RNN LM
-
-
=
=
÷×w TE
How to fix vanishing gradient problem?
The main problem is that it is too difficult for the RNN to learn to preserve
information over many timesteps.
In a vanilla RNN, the hidden state is constantly being rewritten
- (t)
h = tanh(Wh + Ux + b) (t 1) (t)
How about better RNN units?
II. E-
Using Gates for better RNN units
sigmoid
-
The gates are also vectors

On each timestep, each element of the gates can be open (1), close (0)
or somewhere in-between.
=
-
The gates are dynamic: their value is computed based on the current
context.
Two famous architectures

GRUs, LSTMs
IT Tart
Gated Recurrent Units (GRU)
6 ( U-n.tk?- b)
h - dim 5101€17 h
→
-
-
8
-
,
µ→
-
± - - -
-
g÷
÷
-
-
Ut=o Caroyflw "tÉas Un ( Ito hlt -7)

new
( ignore ilp)
ltt -0
-
→ some dim is 1
wn(n±i°
.
at
É÷
-
4
*
÷
-
d.) www.t
✓
RAN
µnvÉt Uh
É÷
✗
+ bn
)
he -
• ,
Ñ¥↳
t
( 1- uh / • he - ,
+ " °
new info
Mt -0 New cell cower
more
U-
,
leaam based on
Kefkxibility
→
control
more
Lµkt_
Long Short Term Memory (LSTM)
=
hSTM_
Long Short Term Memory (LSTM)
-
pint
predict : vYI
→ -4€
fÉ|
r 1
at
→ arul
trip's
-
. : -
→
.
-1
ht
- -
At ~IeI -
Em
I
-
=
→
-
- q
lunh
Woolf
a
?⃝
?⃝
Uinettbi )
✓ HP fate : it = • ( Wiht . . -1
forget fate
f-t-ofwfht.a-vfkt-bflf-1-f-T-C-e-tt-w-ssimrtoc.RU
:
÷:÷.*⇐
Vent -1ha )
[e- = tanh ( Woh # → +
ht== ?
boat -1k )
✓Otltpnt gate
: ot.rlwoht.it
How does LSTM solve vanishing gradients?
The LSTM architecture makes it easier for the RNN to preserve information
-
over many timesteps

-
e.g., if the forget gate is set to remember everything on every timestep,

then the info in the cell is preserved indefinitely
By contrast, it is harder for vanilla RNN to learn a recurrent weight matrix
o =
Wh that preserves info in hidden state
:÷÷÷
RNNs can be used for sentence classification
-
e.g., sentiment classification

⇐esat=o→
sentence
→
representation
ha ¥
[ f-anarchy
of the
]
sentence
~
RNNs can be used for sentence classification
e.g., sentiment classification
LM
atas n-rn .ms? -osa.mr- R-:tIab-tirg3

Ñq→ → →
Ip
ha ha ha ha
9
has
pnpeoirbastd
late
T T T T see .
play football
Ilp Rahul likes to
seen olp IE O O O O
{ Name / Placed
lately organization}
MY lasses
.
& Gene / Disease

pefire
Code W.
Wz wz -
-
\WY° -
.
-
- z
Mi µ,
mixing Er
the
classify in one of
for each word
Approx : -
,
classes
entity
RNNs can be used to find sentence similarity
Siamese Architecture
Pass both sentences through RNNs: the parameters are shared across
these
Use (a function of) the difference between the final hidden states
qt ④
1 Yar
T
[ Irm"
0 ↳ Siamese
Sharing
their pain
- -
-
I - -
-
RNNs can be used for tagging
# P°¥P→
y ±
e.g., part-of-speech tagging, named entity recognition
⇐
.
- .
he
h,
-
- -
RNNs for compararison mining in Reviews
= ÷} comparison
-
ses.bt-o.is
↳ bidirectionally
⇐ → an
" -
-
Bidirectional RNNs
labels
See ,
What we have seen till now?

The state at time t only captures information from the past x(1) , . . . , x(t 1) ,
and the present input x(t)
Some models also allow information from past y values to affect the
current state when the y values are available
q→ HIM ✗
☒
→
no →
→
Ya .
¢-1M
act ' ✗
at -
.
Bidirectional RNNs
What we have seen till now?

The state at time t only captures information from the past x(1) , . . . , x(t 1) ,
and the present input x(t)
Some models also allow information from past y values to affect the
current state when the y values are available
However, in many applications we want to output a prediction of y(t) which may

depend on the whole input sequence. e.g., sentiment analysis.
Bidirectional RNNs: motivation for sentiment classification
HMI
from
do, ,
E. T -
-
-
.
I :* ←
.
-
resentment
-4ft gift
Eni; in ]
µ, Eh? ;Ñ ]
xp ,
1
n
% ÉI⇐ .
- .
. .
In .
. ← I. ¥Tn+ , btw RNN
Vb
nÉi
\
Ub
ii. ¥ÑYt→ñ -
tlwr.vn
Iit !
-
44¥ 1^4
task
.
I'm salary ?
What is the
{ .no#EhM0-
Preteens ✗
Large-scale
→
am
{
classification -
"
→
Bi Rams}
-
by ✓
small-scale tenet
→
Bidirectional RNNs
Bidirectional RNNs
Bidirectional RNNs: simplified diagram
Bi-RNNs are only applicable if you have access to the entire input sequence
thus not applicable to Language Modeling where only the left context is
available
Using Char-RNN for word embeddings
Word embed -
→
.
digs
C. -
- Cm
¥ W, -
-
Wn
⑨ LSTM § Lstm
=] Sentences
xp ,
seep .
↳-
Multi-layer RNNs
if
=
What other paradigm can RNN be used for?
"
lastnyabgsifiot
[ image captioning]
yay
. tenth A- war tenth
olp
.
olp olp
nuns
fix fagtn
-
dip
✓
Translation
summarization
-
mastership
-lH|
fix length
ICNERT
war . laughs
" 't
rei¥* .
v1 imy
Eine as
sesterce
☐ ☐ ☐ ilp length
Text classes
Sentence
'
→ Edgeley →
US Seg ,
Labels
In For Translation
÷Ér=nwIÉ¥
- RNN 8-
-
Transformers -
Transformer
RNNS
qq.tt
zone
¥É÷gga than as -1ms
.
u#m
.
→
GIRLS

RNN Part2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RNN Part2

Uploaded by

Copyright:

Available Formats

trearalhms

L({x(1) , . . . , x(t) }, {y(1) , . . . , y(t) })

L({x(1) , . . . , x(t) }, {y(1) , . . . , y(t) })

L({x(1) , . . . , x(t) }, {y(1) , . . . , y(t) })

bf- R LI u,yw_T → parameters

PIÉ¥f② _nns#oo(t SOTA )

How about better RNN units?

The gates are also vectors

Two famous architectures

Ut=o Caroyflw "tÉas Un ( Ito hlt -7)

over many timesteps

e.g., if the forget gate is set to remember everything on every timestep,

e.g., sentiment classification

e.g., sentiment classification

atas n-rn .ms? -osa.mr- R-:tIab-tirg3

& Gene / Disease

What we have seen till now?

What we have seen till now?

However, in many applications we want to output a prediction of y(t) which may

You might also like