Professional Documents
Culture Documents
Tarnsformer Part1 10 Mar 23 PDF
Tarnsformer Part1 10 Mar 23 PDF
Spring 2023
Sudeshna Sarkar
Transformer- Part 1
Sudeshna Sarkar
10 Mar 2023
h1 h2 h3 h4 s0 s2 s4
s1 s3
x1 x2 x3 x4 c1 y0 c2 y1 c3 y2 c4 y3
14-Mar-23 2
Attention-only Translation Models
• Alternative:
• Convolution – but has other limitations.
14-Mar-23 3
Self-Attention
• Consider an input (or intermediate) sequence
• Construct the next level representation:
which can choose “where to look”, by assigning a weight to each input position.
• Create a weighted sum.
Level k+1
Level k
Self-attention
14-Mar-23
Self-Attention “Transformers”
• Constant path length between any two positions.
• Variable receptive field (or the whole input sequence).
• Supports hierarchical information flow by stacking self-attention layers.
• Trivial to parallelize.
• Attention weighting controls information propagation.
• If we have attention, do we even need recurrent connections?
• Can we transform our RNN into a purely attention-based model?
14-Mar-23 7
Transformer Model Vaswani et al. 2017
14-Mar-23
Transformer Model Dot Product Attention
𝑞𝑞 𝑇𝑇 𝑘𝑘𝑖𝑖
𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖 =
Queries, Keys, and Values 𝑑𝑑
• For each element in the sequence S 𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖 = softmax 𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖
(length m) define key, value and query exp 𝑞𝑞 𝑇𝑇 𝑘𝑘𝑖𝑖 ⁄ 𝑑𝑑
=
𝑚𝑚 ∑𝑗𝑗 exp 𝑞𝑞 𝑇𝑇 𝑘𝑘𝑗𝑗 ⁄ 𝑑𝑑
𝑑𝑑𝑒𝑒𝑒𝑒
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝑞𝑞, 𝑆𝑆 � 𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖 𝑣𝑣𝑖𝑖
=
𝑖𝑖=1
• Normalize attention weights to sum
to 1 and be non-negative.
14-Mar-23
Is Attention all we need?
3.Adding nonlinearities
14-Mar-23
Self-Attention
14-Mar-23
𝑞𝑞: query (to match others)
𝑞𝑞 𝑖𝑖 = 𝑊𝑊 𝑞𝑞 𝑎𝑎𝑖𝑖
𝑘𝑘: key (to be matched)
𝑘𝑘 𝑖𝑖 = 𝑊𝑊 𝑘𝑘 𝑎𝑎𝑖𝑖
𝑣𝑣: information to be extracted
𝑣𝑣 𝑖𝑖 = 𝑊𝑊 𝑣𝑣 𝑎𝑎𝑖𝑖
𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4
𝛼𝛼1,𝑖𝑖 = 𝑞𝑞1 � 𝑘𝑘 𝑖𝑖 / 𝑑𝑑
dot product
𝛼𝛼1,1 𝛼𝛼1,2 𝛼𝛼1,3 𝛼𝛼1,4
𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4
𝑥𝑥 1 𝑥𝑥 2 𝑥𝑥 3 𝑥𝑥 4
14-Mar-23
𝑏𝑏1 = � 𝛼𝛼�1,𝑖𝑖 𝑣𝑣 𝑖𝑖
Considering the whole sequence 𝑖𝑖
𝑏𝑏1
𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4
𝑥𝑥 1 𝑥𝑥 2 𝑥𝑥 3 𝑥𝑥 4
14-Mar-23
𝑏𝑏 2 = � 𝛼𝛼�2,𝑖𝑖 𝑣𝑣 𝑖𝑖
𝑖𝑖
𝑏𝑏 2
𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4
𝑥𝑥 1 𝑥𝑥 2 𝑥𝑥 3 𝑥𝑥 4
14-Mar-23
𝑏𝑏1 , 𝑏𝑏 2 , 𝑏𝑏 3 , 𝑏𝑏 4 can be computed in parallel.
𝑏𝑏1 𝑏𝑏 2 𝑏𝑏 3 𝑏𝑏 4
Self-Attention Layer
𝑥𝑥 1 𝑥𝑥 2 𝑥𝑥 3 𝑥𝑥 4
14-Mar-23
𝑞𝑞1 𝑞𝑞 2 𝑞𝑞 3 𝑞𝑞 4 = 𝑊𝑊 𝑞𝑞 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4
𝑞𝑞 𝑖𝑖 = 𝑊𝑊 𝑞𝑞 𝑎𝑎𝑖𝑖
𝑄𝑄 I
𝑘𝑘 𝑖𝑖 = 𝑊𝑊 𝑘𝑘 𝑎𝑎𝑖𝑖
𝑘𝑘1 𝑘𝑘 2𝑘𝑘 3 𝑘𝑘 4 = 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4
𝑣𝑣 𝑖𝑖 = 𝑊𝑊 𝑣𝑣 𝑎𝑎𝑖𝑖 𝑊𝑊 𝑘𝑘
𝐾𝐾 I
𝑣𝑣 1 𝑣𝑣 2𝑣𝑣 3 𝑣𝑣 4 = 𝑣𝑣 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4
𝑊𝑊
𝑉𝑉 I
𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4
𝑥𝑥 1 𝑥𝑥 2 𝑥𝑥 3 𝑥𝑥 4
14-Mar-23
Self-attention
𝑏𝑏1
𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4
𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4
𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4
�
A A = 𝐾𝐾 𝑇𝑇 Q
O = V �
A
14-Mar-23
Self-Attention keep repeating until we’ve
processed this enough
at the end, decode it into an
answer
self-attention “layer”
self-attention “layer”
Multiple Attention Heads
𝑘𝑘 : level number
• Multiple attention heads can learn to 𝐿𝐿: number of heads
attend in different ways
𝑋𝑋 = 𝐱𝐱1 , … , 𝐱𝐱 𝑛𝑛
• Requires additional parameters to
𝐱𝐱1𝑖𝑖 = 𝐱𝐱 𝑖𝑖
compute different attention values and
𝑘𝑘,𝑙𝑙
transform vectors 𝛼𝛼�𝑖𝑖,𝑗𝑗 = 𝐱𝐱 𝑖𝑖𝑘𝑘−1 𝐖𝐖𝑘𝑘,𝑙𝑙 𝐱𝐱𝑗𝑗𝑘𝑘−1
𝑥𝑥𝑖𝑖𝑘𝑘,𝑙𝑙 = � 𝛼𝛼𝑖𝑖,𝑗𝑗
𝑘𝑘,𝑙𝑙 𝑘𝑘−1
𝐱𝐱𝑗𝑗
𝑖𝑖=1
14-Mar-23
Multi-head Self-attention
(2 heads as example)
𝑞𝑞 𝑖𝑖,1 = 𝑊𝑊 𝑞𝑞,1 𝑞𝑞 𝑖𝑖 𝑏𝑏 𝑖𝑖,1
𝑞𝑞 𝑖𝑖,2 = 𝑊𝑊 𝑞𝑞,2 𝑞𝑞 𝑖𝑖
𝑞𝑞 𝑖𝑖,1 𝑞𝑞 𝑖𝑖,2 𝑘𝑘 𝑖𝑖,1 𝑘𝑘 𝑖𝑖,2 𝑣𝑣 𝑖𝑖,1 𝑣𝑣 𝑖𝑖,2 𝑞𝑞 𝑗𝑗,1 𝑞𝑞 𝑗𝑗,2 𝑘𝑘 𝑗𝑗,1 𝑘𝑘 𝑗𝑗,2 𝑣𝑣 𝑗𝑗,1 𝑣𝑣 𝑗𝑗,2
𝑞𝑞 𝑖𝑖 𝑘𝑘 𝑖𝑖 𝑣𝑣 𝑖𝑖 𝑞𝑞 𝑗𝑗 𝑘𝑘 𝑗𝑗 𝑣𝑣 𝑗𝑗
𝑞𝑞 𝑖𝑖 = 𝑊𝑊 𝑞𝑞 𝑎𝑎𝑖𝑖 𝑎𝑎𝑖𝑖 𝑎𝑎 𝑗𝑗
14-Mar-23
Multi-head Self-attention
(2 heads as example)
𝑞𝑞 𝑖𝑖,1 = 𝑊𝑊 𝑞𝑞,1 𝑞𝑞 𝑖𝑖 𝑏𝑏 𝑖𝑖,1
𝑞𝑞 𝑖𝑖,2 = 𝑊𝑊 𝑞𝑞,2 𝑞𝑞 𝑖𝑖
𝑏𝑏 𝑖𝑖,2
𝑞𝑞 𝑖𝑖,1 𝑞𝑞 𝑖𝑖,2 𝑘𝑘 𝑖𝑖,1 𝑘𝑘 𝑖𝑖,2 𝑣𝑣 𝑖𝑖,1 𝑣𝑣 𝑖𝑖,2 𝑞𝑞 𝑗𝑗,1 𝑞𝑞 𝑗𝑗,2 𝑘𝑘 𝑗𝑗,1 𝑘𝑘 𝑗𝑗,2 𝑣𝑣 𝑗𝑗,1 𝑣𝑣 𝑗𝑗,2
𝑞𝑞 𝑖𝑖 𝑘𝑘 𝑖𝑖 𝑣𝑣 𝑖𝑖 𝑞𝑞 𝑗𝑗 𝑘𝑘 𝑗𝑗 𝑣𝑣 𝑗𝑗
𝑞𝑞 𝑖𝑖 = 𝑊𝑊 𝑞𝑞 𝑥𝑥 𝑖𝑖 𝑎𝑎𝑖𝑖 𝑎𝑎 𝑗𝑗
14-Mar-23
Multi-head Self-attention
(2 heads as example)
𝑏𝑏 𝑖𝑖,1 𝑏𝑏 𝑖𝑖,1
𝑏𝑏 𝑖𝑖 = 𝑊𝑊 𝑂𝑂 𝑏𝑏 𝑖𝑖
𝑏𝑏 𝑖𝑖,2 𝑏𝑏 𝑖𝑖,2
𝑞𝑞 𝑖𝑖,1 𝑞𝑞 𝑖𝑖,2 𝑘𝑘 𝑖𝑖,1 𝑘𝑘 𝑖𝑖,2 𝑣𝑣 𝑖𝑖,1 𝑣𝑣 𝑖𝑖,2 𝑞𝑞 𝑗𝑗,1 𝑞𝑞 𝑗𝑗,2 𝑘𝑘 𝑗𝑗,1 𝑘𝑘 𝑗𝑗,2 𝑣𝑣 𝑗𝑗,1 𝑣𝑣 𝑗𝑗,2
𝑞𝑞 𝑖𝑖 𝑘𝑘 𝑖𝑖 𝑣𝑣 𝑖𝑖 𝑞𝑞 𝑗𝑗 𝑘𝑘 𝑗𝑗 𝑣𝑣 𝑗𝑗
14-Mar-23
𝑎𝑎𝑖𝑖 𝑎𝑎 𝑗𝑗
Positional Encoding
𝑥𝑥 𝑖𝑖 𝑎𝑎𝑖𝑖
𝑊𝑊 = 𝑊𝑊 𝐼𝐼 𝑥𝑥 𝑖𝑖
………
𝑝𝑝𝑖𝑖
𝑊𝑊 𝐼𝐼 𝑊𝑊 𝑃𝑃
𝑝𝑝𝑖𝑖 = 0
𝑒𝑒 𝑖𝑖
1 i-th dim
0 + 𝑊𝑊 𝑃𝑃 𝑝𝑝𝑖𝑖 𝑞𝑞 𝑖𝑖 𝑘𝑘 𝑖𝑖 𝑣𝑣 𝑖𝑖
…
𝑒𝑒 𝑖𝑖 + 𝑎𝑎𝑖𝑖
• Each position has a unique positional vector 𝑒𝑒 𝑖𝑖 (not
learned from data)
𝑖𝑖 𝑖𝑖 𝑥𝑥 𝑖𝑖
• each 𝑥𝑥 appends a one-hot vector 𝑝𝑝 orr add them
14-Mar-23
Positional encoding: learned
Another idea: just learn a positional encoding
Since we are relying entirely on attention now, we might want to incorporate more than one time step
Idea: have multiple keys, queries, and values for every time step!
self-attention “layer”
Transformer summary
self-attention “layer” • These are generally called “Transformers” because
they transform one sequence into another at each
layer.
• Alternate self-attention “layers” with nonlinear
position-wise feedforward networks (to get
nonlinear transformations)
• Use positional encoding (on the input or input
embedding) to make the model aware of relative
self-attention “layer” positions of tokens
• Use multi-head attention
• Use masked attention if you want to use the model
for decoding.
The “classic” transformer position-wise softmax
As compared to a sequence
to sequence RNN model position-wise nonlinear
network
cross attention
repeated N times
position-wise nonlinear position-wise nonlinear
network we’ll discuss network
repeated N times
how this bit
works soon
self-attention “layer” masked self-attention
• Let’s assume that our model knows 10,000 unique English words (our
model’s “output vocabulary”) that it’s learned from its training dataset.
• The softmax layer then turns those scores into probabilities (all positive,
all add up to 1.0). The cell with the highest probability is chosen, and the
word associated with it is produced as the output for this time step.
14-Mar-23 38
Combining encoder and decoder
values
“Cross-attention” cross attention
repeated N times
self-attention “layer” masked self-attention
in reality, cross-attention is
cross attention also multi-headed!
output
The Residuals
each sub-layer (self-attention, ffnn) in each encoder has a residual connection
around it, and is followed by a layer-normalization step.
14-Mar-23 40
14-Mar-23 41
Layer normalization
Main idea: batch normalization is very helpful, but hard to use with sequence models Sequences are
different lengths, makes normalizing across the batch hard Sequences can be very long, so we
sometimes have small batches
Simple solution: “layer normalization” – like batch norm, but not across the batch
1-dim
The Transformer Architecture
Composed of an encoder and a
decoder.
• Encoder:
• a stack of multiple identical blocks
• each block has two sublayers
1. a multi-head self-attention pooling
(queries, keys, and values are all
from the outputs of the previous
encoder layer)
2. a positionwise feed-forward network
3. A residual connection is employed
around both sublayers followed by
layer normalization
14-Mar-23
The Transformer Architecture
Decoder:
• a stack of multiple identical blocks
• each block has three sublayers.
1. a multi-head self-attention pooling -- each
position in the decoder is allowed to only
attend to all positions in the decoder up to
that position
2. Encoder-decoder attention: queries are from
the outputs of the previous decoder layer,
and the keys and values are from the
Transformer encoder outputs
3. A positionwise feed-forward network
• A residual connection is employed around
both sublayers followed by layer normalization
14-Mar-23
Transformer Layer Norm:
https://arxiv.org/abs/ Batch Size
Layer 1607.06450
𝑏𝑏 ′ Norm Batch Norm:
https://www.youtube.c 𝜇𝜇 = 0,
+ om/watch?v=BZh1ltr5R
kg
𝜎𝜎 = 1
Batch
𝑏𝑏 𝜇𝜇 = 0, 𝜎𝜎 = 1
Layer
…
𝑎𝑎 attend on the
input sequence
The Transformer
Benefits:
+ Much better long-range connections
+ Much easier to parallelize
+ In practice, can make it much deeper (more layers) than RNN