Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

CS60010: Deep Learning

Spring 2023

Sudeshna Sarkar

Transformer- Part 1
Sudeshna Sarkar
10 Mar 2023

14-Mar-23 CS60010 Sudeshna Sarkar IIT Kgp 1


Encoder Decoder Attention
a21 a22 a23 a24
estamos comiendo pan [STOP]
softmax
y1 y2 y3 y4
e21 e22 e23 e24

h1 h2 h3 h4 s0 s2 s4
s1 s3

x1 x2 x3 x4 c1 y0 c2 y1 c3 y2 c4 y3

we are eating bread


[START] estamos comiendo pan

14-Mar-23 2
Attention-only Translation Models

• Problems with recurrent networks:


• Sequential training and inference: Time grows in proportion to sentence length.
Hard to parallelize.
• Long-range dependencies have to be remembered across many single time steps.
• Tricky to learn hierarchical structures (“car”, “blue car”, “into the blue car”…)

• Alternative:
• Convolution – but has other limitations.

14-Mar-23 3
Self-Attention
• Consider an input (or intermediate) sequence
• Construct the next level representation:
which can choose “where to look”, by assigning a weight to each input position.
• Create a weighted sum.

Level k+1

Softmax over lower


locations conditioned
on context at lower and
higher locations

Level k
Self-attention

• Each word is a query to form


attention over all tokens
• This generates a context-
dependent representation of
each token: a weighted sum of
all tokens
• The attention weights
dynamically mix how much is
taken from each token
14-Mar-23
Comparing CNNs, RNNs, and Self-Attention

14-Mar-23
Self-Attention “Transformers”
• Constant path length between any two positions.
• Variable receptive field (or the whole input sequence).
• Supports hierarchical information flow by stacking self-attention layers.
• Trivial to parallelize.
• Attention weighting controls information propagation.
• If we have attention, do we even need recurrent connections?
• Can we transform our RNN into a purely attention-based model?

Vaswani et al. “Attention is all you need”, arXiv 2017


https://arxiv.org/abs/1706.03762

14-Mar-23 7
Transformer Model Vaswani et al. 2017

• Transformers map sequences of input


vectors 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛 to sequences of
output vectors 𝑦𝑦1 , 𝑦𝑦2 , … , 𝑦𝑦𝑚𝑚 .
• Made up of stacks of Transformer
blocks.
• combine linear layers, feedforward
networks, and self-attention layers.
• Self-attention allows a network to
directly extract and use information
from arbitrarily large contexts

14-Mar-23
Transformer Model Dot Product Attention
𝑞𝑞 𝑇𝑇 𝑘𝑘𝑖𝑖
𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖 =
Queries, Keys, and Values 𝑑𝑑
• For each element in the sequence S 𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖 = softmax 𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖
(length m) define key, value and query exp 𝑞𝑞 𝑇𝑇 𝑘𝑘𝑖𝑖 ⁄ 𝑑𝑑
=
𝑚𝑚 ∑𝑗𝑗 exp 𝑞𝑞 𝑇𝑇 𝑘𝑘𝑗𝑗 ⁄ 𝑑𝑑
𝑑𝑑𝑒𝑒𝑒𝑒
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝑞𝑞, 𝑆𝑆 � 𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖 𝑣𝑣𝑖𝑖
=
𝑖𝑖=1
• Normalize attention weights to sum
to 1 and be non-negative.

exp 𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖


𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖 =
∑𝑗𝑗 exp 𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑗𝑗

14-Mar-23
Is Attention all we need?

Attention can in principle do everything that


recurrence can, and more!
This has a few issues we must overcome:

1. The encoder has no temporal dependencies.

mignon chiot Un <START> A cute puppy


From Self-Attention to Transformers
1.Positional encoding addresses lack of sequence information

2.Multi-headed attention allows querying multiple positions at each layer

3.Adding nonlinearities

4.Masked decoding how to prevent attention lookups into the future?


Self-Attention explained

14-Mar-23
Self-Attention

𝑏𝑏 𝑖𝑖 is obtained based on the whole


𝑏𝑏1 𝑏𝑏 2 𝑏𝑏 3 𝑏𝑏 4 input sequence.
𝑏𝑏1 , 𝑏𝑏 2 , 𝑏𝑏 3 , 𝑏𝑏 4 can be computed in
parallel.
Self-Attention Layer

𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4

14-Mar-23
𝑞𝑞: query (to match others)
𝑞𝑞 𝑖𝑖 = 𝑊𝑊 𝑞𝑞 𝑎𝑎𝑖𝑖
𝑘𝑘: key (to be matched)
𝑘𝑘 𝑖𝑖 = 𝑊𝑊 𝑘𝑘 𝑎𝑎𝑖𝑖
𝑣𝑣: information to be extracted
𝑣𝑣 𝑖𝑖 = 𝑊𝑊 𝑣𝑣 𝑎𝑎𝑖𝑖

𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4

𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4


𝑎𝑎𝑖𝑖 = 𝑊𝑊𝑥𝑥 𝑖𝑖
𝑥𝑥 1 𝑥𝑥 2 𝑥𝑥 3 𝑥𝑥 4
14-Mar-23
Scaled Dot-Product Attention d is the dim of 𝑞𝑞 and 𝑘𝑘

𝛼𝛼1,𝑖𝑖 = 𝑞𝑞1 � 𝑘𝑘 𝑖𝑖 / 𝑑𝑑
dot product
𝛼𝛼1,1 𝛼𝛼1,2 𝛼𝛼1,3 𝛼𝛼1,4

𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4

𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4

𝑥𝑥 1 𝑥𝑥 2 𝑥𝑥 3 𝑥𝑥 4
14-Mar-23
𝑏𝑏1 = � 𝛼𝛼�1,𝑖𝑖 𝑣𝑣 𝑖𝑖
Considering the whole sequence 𝑖𝑖

𝑏𝑏1

𝛼𝛼�1,1 𝛼𝛼�1,2 𝛼𝛼�1,3 𝛼𝛼�1,4

𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4

𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4

𝑥𝑥 1 𝑥𝑥 2 𝑥𝑥 3 𝑥𝑥 4
14-Mar-23
𝑏𝑏 2 = � 𝛼𝛼�2,𝑖𝑖 𝑣𝑣 𝑖𝑖
𝑖𝑖

𝑏𝑏 2

𝛼𝛼�2,1 𝛼𝛼�2,2 𝛼𝛼�2,3 𝛼𝛼�2,4

𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4

𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4

𝑥𝑥 1 𝑥𝑥 2 𝑥𝑥 3 𝑥𝑥 4
14-Mar-23
𝑏𝑏1 , 𝑏𝑏 2 , 𝑏𝑏 3 , 𝑏𝑏 4 can be computed in parallel.

𝑏𝑏1 𝑏𝑏 2 𝑏𝑏 3 𝑏𝑏 4

Self-Attention Layer

𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4

𝑥𝑥 1 𝑥𝑥 2 𝑥𝑥 3 𝑥𝑥 4
14-Mar-23
𝑞𝑞1 𝑞𝑞 2 𝑞𝑞 3 𝑞𝑞 4 = 𝑊𝑊 𝑞𝑞 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4
𝑞𝑞 𝑖𝑖 = 𝑊𝑊 𝑞𝑞 𝑎𝑎𝑖𝑖
𝑄𝑄 I
𝑘𝑘 𝑖𝑖 = 𝑊𝑊 𝑘𝑘 𝑎𝑎𝑖𝑖
𝑘𝑘1 𝑘𝑘 2𝑘𝑘 3 𝑘𝑘 4 = 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4
𝑣𝑣 𝑖𝑖 = 𝑊𝑊 𝑣𝑣 𝑎𝑎𝑖𝑖 𝑊𝑊 𝑘𝑘
𝐾𝐾 I
𝑣𝑣 1 𝑣𝑣 2𝑣𝑣 3 𝑣𝑣 4 = 𝑣𝑣 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4
𝑊𝑊
𝑉𝑉 I

𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4

𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4

𝑥𝑥 1 𝑥𝑥 2 𝑥𝑥 3 𝑥𝑥 4
14-Mar-23
Self-attention
𝑏𝑏1

𝛼𝛼�1,1 𝛼𝛼�1,2 𝛼𝛼�1,3 𝛼𝛼�1,4

𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4

𝛼𝛼1,1 = 𝑘𝑘1 𝑞𝑞1 𝛼𝛼1,2 = 𝑘𝑘 2 𝑞𝑞1 𝛼𝛼1,1 𝑘𝑘1


𝛼𝛼1,2 𝑘𝑘 2
𝛼𝛼1,3 = 𝑘𝑘 3 𝑞𝑞1 𝛼𝛼1,4 = 𝑘𝑘 4 𝑞𝑞1 𝛼𝛼1,3 = 𝑞𝑞1
𝑘𝑘 3
𝛼𝛼1,4 𝑘𝑘 4
14-Mar-23 (ignore 𝑑𝑑 for simplicity)
Self-attention 𝑏𝑏 2 = � 𝛼𝛼�2,𝑖𝑖 𝑣𝑣 𝑖𝑖
𝑖𝑖
𝑏𝑏 2

𝛼𝛼�2,1 𝛼𝛼�2,2 𝛼𝛼�2,3 𝛼𝛼�2,4

𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4

𝛼𝛼�1,1 𝛼𝛼�2,1 𝛼𝛼�3,1 𝛼𝛼�4,1 𝛼𝛼1,1 𝛼𝛼2,1 𝛼𝛼3,1 𝛼𝛼4,1


𝑘𝑘1
𝛼𝛼�1,2 𝛼𝛼�2,2 𝛼𝛼�3,2 𝛼𝛼�4,2 𝛼𝛼1,2 𝛼𝛼2,2 𝛼𝛼3,2 𝛼𝛼4,2 𝑘𝑘 2
𝛼𝛼�1,3 𝛼𝛼�2,3 𝛼𝛼�3,3 𝛼𝛼�4,3 𝛼𝛼1,3 𝛼𝛼2,3 𝛼𝛼3,3 𝛼𝛼4,3 = 𝑞𝑞1 𝑞𝑞 2 𝑞𝑞 3 𝑞𝑞 4
𝑘𝑘 3
𝛼𝛼�1,4 𝛼𝛼�2,4 𝛼𝛼�3,4 𝛼𝛼�4,4 𝛼𝛼1,4 𝛼𝛼2,4 𝛼𝛼3,4 𝛼𝛼4,4 𝑘𝑘 4 𝑄𝑄
14-Mar-23
𝐴𝐴̂ 𝐴𝐴 𝐾𝐾 𝑇𝑇
Self-attention 𝑏𝑏 2 = � 𝛼𝛼�2,𝑖𝑖 𝑣𝑣 𝑖𝑖
𝑖𝑖
𝑏𝑏 2

𝛼𝛼�2,1 𝛼𝛼�2,2 𝛼𝛼�2,3 𝛼𝛼�2,4

𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4

𝛼𝛼�1,1 𝛼𝛼�2,1 𝛼𝛼�3,1 𝛼𝛼�4,1


𝛼𝛼�1,2 𝛼𝛼�2,2 𝛼𝛼�3,2 𝛼𝛼�4,2
𝑏𝑏1 𝑏𝑏 2 𝑏𝑏 3 𝑏𝑏 4 = 𝑣𝑣 1 𝑣𝑣 2𝑣𝑣 3 𝑣𝑣 4
𝛼𝛼�1,3 𝛼𝛼�2,3 𝛼𝛼�3,3 𝛼𝛼�4,3
O 𝑉𝑉 𝛼𝛼�1,4 𝛼𝛼�2,4 𝛼𝛼�3,4 𝛼𝛼�4,4
14-Mar-23 𝐴𝐴̂
Self-attention
O Q = 𝑊𝑊 𝑞𝑞 I
Self-Attention K = 𝑊𝑊 𝑘𝑘 I
Layer
I V = 𝑊𝑊 𝑣𝑣 I


A A = 𝐾𝐾 𝑇𝑇 Q

O = V �
A

14-Mar-23
Self-Attention keep repeating until we’ve
processed this enough
at the end, decode it into an
answer
self-attention “layer”

self-attention “layer”
Multiple Attention Heads
𝑘𝑘 : level number
• Multiple attention heads can learn to 𝐿𝐿: number of heads
attend in different ways
𝑋𝑋 = 𝐱𝐱1 , … , 𝐱𝐱 𝑛𝑛
• Requires additional parameters to
𝐱𝐱1𝑖𝑖 = 𝐱𝐱 𝑖𝑖
compute different attention values and
𝑘𝑘,𝑙𝑙
transform vectors 𝛼𝛼�𝑖𝑖,𝑗𝑗 = 𝐱𝐱 𝑖𝑖𝑘𝑘−1 𝐖𝐖𝑘𝑘,𝑙𝑙 𝐱𝐱𝑗𝑗𝑘𝑘−1

𝛼𝛼𝑖𝑖𝑘𝑘,𝑙𝑙 = softmax 𝛼𝛼�𝑖𝑖,1


𝑘𝑘,𝑙𝑙 𝑘𝑘,𝑙𝑙
, … , 𝛼𝛼�𝑖𝑖,𝑛𝑛
𝑛𝑛

𝑥𝑥𝑖𝑖𝑘𝑘,𝑙𝑙 = � 𝛼𝛼𝑖𝑖,𝑗𝑗
𝑘𝑘,𝑙𝑙 𝑘𝑘−1
𝐱𝐱𝑗𝑗
𝑖𝑖=1

𝐱𝐱 𝑖𝑖𝑘𝑘 = 𝑉𝑉 𝑘𝑘 𝐱𝐱 𝑖𝑖𝑘𝑘,1 ; … ; 𝐱𝐱𝑖𝑖𝑘𝑘,𝐿𝐿

14-Mar-23
Multi-head Self-attention
(2 heads as example)
𝑞𝑞 𝑖𝑖,1 = 𝑊𝑊 𝑞𝑞,1 𝑞𝑞 𝑖𝑖 𝑏𝑏 𝑖𝑖,1
𝑞𝑞 𝑖𝑖,2 = 𝑊𝑊 𝑞𝑞,2 𝑞𝑞 𝑖𝑖

𝑞𝑞 𝑖𝑖,1 𝑞𝑞 𝑖𝑖,2 𝑘𝑘 𝑖𝑖,1 𝑘𝑘 𝑖𝑖,2 𝑣𝑣 𝑖𝑖,1 𝑣𝑣 𝑖𝑖,2 𝑞𝑞 𝑗𝑗,1 𝑞𝑞 𝑗𝑗,2 𝑘𝑘 𝑗𝑗,1 𝑘𝑘 𝑗𝑗,2 𝑣𝑣 𝑗𝑗,1 𝑣𝑣 𝑗𝑗,2

𝑞𝑞 𝑖𝑖 𝑘𝑘 𝑖𝑖 𝑣𝑣 𝑖𝑖 𝑞𝑞 𝑗𝑗 𝑘𝑘 𝑗𝑗 𝑣𝑣 𝑗𝑗

𝑞𝑞 𝑖𝑖 = 𝑊𝑊 𝑞𝑞 𝑎𝑎𝑖𝑖 𝑎𝑎𝑖𝑖 𝑎𝑎 𝑗𝑗
14-Mar-23
Multi-head Self-attention
(2 heads as example)
𝑞𝑞 𝑖𝑖,1 = 𝑊𝑊 𝑞𝑞,1 𝑞𝑞 𝑖𝑖 𝑏𝑏 𝑖𝑖,1
𝑞𝑞 𝑖𝑖,2 = 𝑊𝑊 𝑞𝑞,2 𝑞𝑞 𝑖𝑖
𝑏𝑏 𝑖𝑖,2

𝑞𝑞 𝑖𝑖,1 𝑞𝑞 𝑖𝑖,2 𝑘𝑘 𝑖𝑖,1 𝑘𝑘 𝑖𝑖,2 𝑣𝑣 𝑖𝑖,1 𝑣𝑣 𝑖𝑖,2 𝑞𝑞 𝑗𝑗,1 𝑞𝑞 𝑗𝑗,2 𝑘𝑘 𝑗𝑗,1 𝑘𝑘 𝑗𝑗,2 𝑣𝑣 𝑗𝑗,1 𝑣𝑣 𝑗𝑗,2

𝑞𝑞 𝑖𝑖 𝑘𝑘 𝑖𝑖 𝑣𝑣 𝑖𝑖 𝑞𝑞 𝑗𝑗 𝑘𝑘 𝑗𝑗 𝑣𝑣 𝑗𝑗

𝑞𝑞 𝑖𝑖 = 𝑊𝑊 𝑞𝑞 𝑥𝑥 𝑖𝑖 𝑎𝑎𝑖𝑖 𝑎𝑎 𝑗𝑗
14-Mar-23
Multi-head Self-attention
(2 heads as example)
𝑏𝑏 𝑖𝑖,1 𝑏𝑏 𝑖𝑖,1
𝑏𝑏 𝑖𝑖 = 𝑊𝑊 𝑂𝑂 𝑏𝑏 𝑖𝑖
𝑏𝑏 𝑖𝑖,2 𝑏𝑏 𝑖𝑖,2

𝑞𝑞 𝑖𝑖,1 𝑞𝑞 𝑖𝑖,2 𝑘𝑘 𝑖𝑖,1 𝑘𝑘 𝑖𝑖,2 𝑣𝑣 𝑖𝑖,1 𝑣𝑣 𝑖𝑖,2 𝑞𝑞 𝑗𝑗,1 𝑞𝑞 𝑗𝑗,2 𝑘𝑘 𝑗𝑗,1 𝑘𝑘 𝑗𝑗,2 𝑣𝑣 𝑗𝑗,1 𝑣𝑣 𝑗𝑗,2

𝑞𝑞 𝑖𝑖 𝑘𝑘 𝑖𝑖 𝑣𝑣 𝑖𝑖 𝑞𝑞 𝑗𝑗 𝑘𝑘 𝑗𝑗 𝑣𝑣 𝑗𝑗

14-Mar-23
𝑎𝑎𝑖𝑖 𝑎𝑎 𝑗𝑗
Positional Encoding

𝑥𝑥 𝑖𝑖 𝑎𝑎𝑖𝑖
𝑊𝑊 = 𝑊𝑊 𝐼𝐼 𝑥𝑥 𝑖𝑖
………

𝑝𝑝𝑖𝑖
𝑊𝑊 𝐼𝐼 𝑊𝑊 𝑃𝑃
𝑝𝑝𝑖𝑖 = 0
𝑒𝑒 𝑖𝑖
1 i-th dim
0 + 𝑊𝑊 𝑃𝑃 𝑝𝑝𝑖𝑖 𝑞𝑞 𝑖𝑖 𝑘𝑘 𝑖𝑖 𝑣𝑣 𝑖𝑖

𝑒𝑒 𝑖𝑖 + 𝑎𝑎𝑖𝑖
• Each position has a unique positional vector 𝑒𝑒 𝑖𝑖 (not
learned from data)
𝑖𝑖 𝑖𝑖 𝑥𝑥 𝑖𝑖
• each 𝑥𝑥 appends a one-hot vector 𝑝𝑝 orr add them

14-Mar-23
Positional encoding: learned
Another idea: just learn a positional encoding

Different for every input sequence

The same learned values for every sequence

dimensionality max sequence length


but different for different time steps

How many values do we need to learn?

+ more flexible (and perhaps more optimal) than sin/cos encoding


+ a bit more complex, need to pick a max sequence length (and can’t generalize beyond it)
Multi-head attention

Since we are relying entirely on attention now, we might want to incorporate more than one time step

because of softmax, this will


be dominated by one value

hard to specify that you want two


different things (e.g., the subject
and the object in a sentence)
Multi-head attention

Idea: have multiple keys, queries, and values for every time step!

around 8 heads seems to work


pretty well for big models
Alternating self-attention & nonlinearity
self-attention “layer”

just a neural net applied at every position


self-attention “layer” after every self-attention layer!

Sometimes referred to as “position-


wise feedforward network”

We’ll describe some specific


commonly used choices shortly
Self-attention can see the future!
A crude self-attention “language model”: (in reality, we would have many alternating
self-attention layers and position-wise
feedforward networks, not just one)

Big problem: self-attention at step 1 can look at the value


at steps 2 & 3, which is based on the inputs at steps 2 & 3

At test time (when decoding), the inputs at steps 2 & 3 will


be based on the output at step 1…
…which requires knowing the input at steps 2 & 3
self-attention “layer”
Masked attention
A crude self-attention “language model”: At test time (when decoding), the inputs at steps 2 & 3 will
be based on the output at step 1…
…which requires knowing the input at steps 2 & 3

Must allow self-attention into the past…


…but not into the future

self-attention “layer”
Transformer summary
self-attention “layer” • These are generally called “Transformers” because
they transform one sequence into another at each
layer.
• Alternate self-attention “layers” with nonlinear
position-wise feedforward networks (to get
nonlinear transformations)
• Use positional encoding (on the input or input
embedding) to make the model aware of relative
self-attention “layer” positions of tokens
• Use multi-head attention
• Use masked attention if you want to use the model
for decoding.
The “classic” transformer position-wise softmax

As compared to a sequence
to sequence RNN model position-wise nonlinear
network

cross attention

repeated N times
position-wise nonlinear position-wise nonlinear
network we’ll discuss network

repeated N times
how this bit
works soon
self-attention “layer” masked self-attention

position-wise encoder position-wise encoder


The Final Linear and Softmax Layer

• A Softmax Layer to output word

• Let’s assume that our model knows 10,000 unique English words (our
model’s “output vocabulary”) that it’s learned from its training dataset.

• The softmax layer then turns those scores into probabilities (all positive,
all add up to 1.0). The cell with the highest probability is chosen, and the
word associated with it is produced as the output for this time step.

14-Mar-23 38
Combining encoder and decoder
values
“Cross-attention” cross attention

Much more like the standard attention


from the previous lecture position-wise nonlinear position-wise nonlinear
network network

repeated N times
self-attention “layer” masked self-attention

position-wise encoder position-wise encoder

in reality, cross-attention is
cross attention also multi-headed!
output
The Residuals
each sub-layer (self-attention, ffnn) in each encoder has a residual connection
around it, and is followed by a layer-normalization step.

14-Mar-23 40
14-Mar-23 41
Layer normalization
Main idea: batch normalization is very helpful, but hard to use with sequence models Sequences are
different lengths, makes normalizing across the batch hard Sequences can be very long, so we
sometimes have small batches
Simple solution: “layer normalization” – like batch norm, but not across the batch

Batch norm Layer norm


Different dimensions of a

1-dim
The Transformer Architecture
Composed of an encoder and a
decoder.
• Encoder:
• a stack of multiple identical blocks
• each block has two sublayers
1. a multi-head self-attention pooling
(queries, keys, and values are all
from the outputs of the previous
encoder layer)
2. a positionwise feed-forward network
3. A residual connection is employed
around both sublayers followed by
layer normalization

14-Mar-23
The Transformer Architecture
Decoder:
• a stack of multiple identical blocks
• each block has three sublayers.
1. a multi-head self-attention pooling -- each
position in the decoder is allowed to only
attend to all positions in the decoder up to
that position
2. Encoder-decoder attention: queries are from
the outputs of the previous decoder layer,
and the keys and values are from the
Transformer encoder outputs
3. A positionwise feed-forward network
• A residual connection is employed around
both sublayers followed by layer normalization

14-Mar-23
Transformer Layer Norm:
https://arxiv.org/abs/ Batch Size
Layer 1607.06450
𝑏𝑏 ′ Norm Batch Norm:
https://www.youtube.c 𝜇𝜇 = 0,
+ om/watch?v=BZh1ltr5R
kg
𝜎𝜎 = 1
Batch
𝑏𝑏 𝜇𝜇 = 0, 𝜎𝜎 = 1
Layer

𝑎𝑎 attend on the
input sequence

Masked: attend on the


generated sequence
14-Mar-23 45
Decoder decodes one position at a

Putting it all together time with masked attention

The Transformer

6 layers, each with d = 512

residual connection with LN

residual connection with LN


multi-head cross attention
2-layer neural net at each position
residual connection with LN
essentially a residual connection with LN same as encoder only masked

concatenates attention from all heads

Vaswani et al. Attention Is All You Need. 2017.


Why transformers?
Downsides:
- Attention computations are technically O(n2)
- Somewhat more complex to implement (positional encodings, etc.)

Benefits:
+ Much better long-range connections
+ Much easier to parallelize
+ In practice, can make it much deeper (more layers) than RNN

The benefits seem to vastly outweigh the downsides, and


transformers work much better than RNNs (and LSTMs) in
many cases

Arguably one of the most important sequence modeling


improvements of the past decade

You might also like