Tarnsformer Part1 10 Mar 23 PDF

CS60010: Deep Learning
Spring 2023
Sudeshna Sarkar
Transformer- Part 1
Sudeshna Sarkar
10 Mar 2023
14-Mar-23 CS60010 Sudeshna Sarkar IIT Kgp 1

Encoder Decoder Attention
a21 a22 a23 a24
estamos comiendo pan [STOP]
softmax
y1 y2 y3 y4
e21 e22 e23 e24
h1 h2 h3 h4 s0 s2 s4
s1 s3
x1 x2 x3 x4 c1 y0 c2 y1 c3 y2 c4 y3
we are eating bread

[START] estamos comiendo pan
14-Mar-23 2
Attention-only Translation Models
• Problems with recurrent networks:

• Sequential training and inference: Time grows in proportion to sentence length.
Hard to parallelize.
• Long-range dependencies have to be remembered across many single time steps.
• Tricky to learn hierarchical structures (“car”, “blue car”, “into the blue car”…)
• Alternative:
• Convolution – but has other limitations.
14-Mar-23 3
Self-Attention
• Consider an input (or intermediate) sequence
• Construct the next level representation:
which can choose “where to look”, by assigning a weight to each input position.
• Create a weighted sum.
Level k+1
Softmax over lower

locations conditioned
on context at lower and
higher locations
Level k
Self-attention
• Each word is a query to form

attention over all tokens
• This generates a context-
dependent representation of
each token: a weighted sum of
all tokens
• The attention weights
dynamically mix how much is
taken from each token
14-Mar-23
Comparing CNNs, RNNs, and Self-Attention
14-Mar-23
Self-Attention “Transformers”
• Constant path length between any two positions.
• Variable receptive field (or the whole input sequence).
• Supports hierarchical information flow by stacking self-attention layers.
• Trivial to parallelize.
• Attention weighting controls information propagation.
• If we have attention, do we even need recurrent connections?
• Can we transform our RNN into a purely attention-based model?
Vaswani et al. “Attention is all you need”, arXiv 2017

https://arxiv.org/abs/1706.03762
14-Mar-23 7
Transformer Model Vaswani et al. 2017
• Transformers map sequences of input

vectors 𝑥𝑥1 , 𝑥𝑥2 , … , 𝑥𝑥𝑛𝑛 to sequences of
output vectors 𝑦𝑦1 , 𝑦𝑦2 , … , 𝑦𝑦𝑚𝑚 .
• Made up of stacks of Transformer
blocks.
• combine linear layers, feedforward
networks, and self-attention layers.
• Self-attention allows a network to
directly extract and use information
from arbitrarily large contexts
14-Mar-23
Transformer Model Dot Product Attention
𝑞𝑞 𝑇𝑇 𝑘𝑘𝑖𝑖
𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖 =
Queries, Keys, and Values 𝑑𝑑
• For each element in the sequence S 𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖 = softmax 𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖
(length m) define key, value and query exp 𝑞𝑞 𝑇𝑇 𝑘𝑘𝑖𝑖 ⁄ 𝑑𝑑
=
𝑚𝑚 ∑𝑗𝑗 exp 𝑞𝑞 𝑇𝑇 𝑘𝑘𝑗𝑗 ⁄ 𝑑𝑑
𝑑𝑑𝑒𝑒𝑒𝑒
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝑞𝑞, 𝑆𝑆 � 𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖 𝑣𝑣𝑖𝑖
=
𝑖𝑖=1
• Normalize attention weights to sum
to 1 and be non-negative.
exp 𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖

𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖 =
∑𝑗𝑗 exp 𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑗𝑗
14-Mar-23
Is Attention all we need?
Attention can in principle do everything that

recurrence can, and more!
This has a few issues we must overcome:
1. The encoder has no temporal dependencies.
mignon chiot Un <START> A cute puppy

From Self-Attention to Transformers
1.Positional encoding addresses lack of sequence information
2.Multi-headed attention allows querying multiple positions at each layer
3.Adding nonlinearities
4.Masked decoding how to prevent attention lookups into the future?

Self-Attention explained
14-Mar-23
Self-Attention
𝑏𝑏 𝑖𝑖 is obtained based on the whole

𝑏𝑏1 𝑏𝑏 2 𝑏𝑏 3 𝑏𝑏 4 input sequence.
𝑏𝑏1 , 𝑏𝑏 2 , 𝑏𝑏 3 , 𝑏𝑏 4 can be computed in
parallel.
Self-Attention Layer
𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4
14-Mar-23
𝑞𝑞: query (to match others)
𝑞𝑞 𝑖𝑖 = 𝑊𝑊 𝑞𝑞 𝑎𝑎𝑖𝑖
𝑘𝑘: key (to be matched)
𝑘𝑘 𝑖𝑖 = 𝑊𝑊 𝑘𝑘 𝑎𝑎𝑖𝑖
𝑣𝑣: information to be extracted
𝑣𝑣 𝑖𝑖 = 𝑊𝑊 𝑣𝑣 𝑎𝑎𝑖𝑖
𝑞𝑞1 𝑘𝑘1 𝑣𝑣 1 𝑞𝑞 2 𝑘𝑘 2 𝑣𝑣 2 𝑞𝑞 3 𝑘𝑘 3 𝑣𝑣 3 𝑞𝑞 4 𝑘𝑘 4 𝑣𝑣 4

𝑎𝑎𝑖𝑖 = 𝑊𝑊𝑥𝑥 𝑖𝑖
𝑥𝑥 1 𝑥𝑥 2 𝑥𝑥 3 𝑥𝑥 4
14-Mar-23
Scaled Dot-Product Attention d is the dim of 𝑞𝑞 and 𝑘𝑘
𝛼𝛼1,𝑖𝑖 = 𝑞𝑞1 � 𝑘𝑘 𝑖𝑖 / 𝑑𝑑
dot product
𝛼𝛼1,1 𝛼𝛼1,2 𝛼𝛼1,3 𝛼𝛼1,4
14-Mar-23
𝑏𝑏1 = � 𝛼𝛼�1,𝑖𝑖 𝑣𝑣 𝑖𝑖
Considering the whole sequence 𝑖𝑖
𝑏𝑏1
𝛼𝛼�1,1 𝛼𝛼�1,2 𝛼𝛼�1,3 𝛼𝛼�1,4
14-Mar-23
𝑏𝑏 2 = � 𝛼𝛼�2,𝑖𝑖 𝑣𝑣 𝑖𝑖
𝑖𝑖
𝑏𝑏 2
𝛼𝛼�2,1 𝛼𝛼�2,2 𝛼𝛼�2,3 𝛼𝛼�2,4
14-Mar-23
𝑏𝑏1 , 𝑏𝑏 2 , 𝑏𝑏 3 , 𝑏𝑏 4 can be computed in parallel.
𝑏𝑏1 𝑏𝑏 2 𝑏𝑏 3 𝑏𝑏 4
Self-Attention Layer
14-Mar-23
𝑞𝑞1 𝑞𝑞 2 𝑞𝑞 3 𝑞𝑞 4 = 𝑊𝑊 𝑞𝑞 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4
𝑞𝑞 𝑖𝑖 = 𝑊𝑊 𝑞𝑞 𝑎𝑎𝑖𝑖
𝑄𝑄 I
𝑘𝑘 𝑖𝑖 = 𝑊𝑊 𝑘𝑘 𝑎𝑎𝑖𝑖
𝑘𝑘1 𝑘𝑘 2𝑘𝑘 3 𝑘𝑘 4 = 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4
𝑣𝑣 𝑖𝑖 = 𝑊𝑊 𝑣𝑣 𝑎𝑎𝑖𝑖 𝑊𝑊 𝑘𝑘
𝐾𝐾 I
𝑣𝑣 1 𝑣𝑣 2𝑣𝑣 3 𝑣𝑣 4 = 𝑣𝑣 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4
𝑊𝑊
𝑉𝑉 I
14-Mar-23
Self-attention
𝑏𝑏1
𝛼𝛼�1,1 𝛼𝛼�1,2 𝛼𝛼�1,3 𝛼𝛼�1,4
𝛼𝛼1,1 = 𝑘𝑘1 𝑞𝑞1 𝛼𝛼1,2 = 𝑘𝑘 2 𝑞𝑞1 𝛼𝛼1,1 𝑘𝑘1

𝛼𝛼1,2 𝑘𝑘 2
𝛼𝛼1,3 = 𝑘𝑘 3 𝑞𝑞1 𝛼𝛼1,4 = 𝑘𝑘 4 𝑞𝑞1 𝛼𝛼1,3 = 𝑞𝑞1
𝑘𝑘 3
𝛼𝛼1,4 𝑘𝑘 4
14-Mar-23 (ignore 𝑑𝑑 for simplicity)
Self-attention 𝑏𝑏 2 = � 𝛼𝛼�2,𝑖𝑖 𝑣𝑣 𝑖𝑖
𝑖𝑖
𝑏𝑏 2
𝛼𝛼�2,1 𝛼𝛼�2,2 𝛼𝛼�2,3 𝛼𝛼�2,4
𝛼𝛼�1,1 𝛼𝛼�2,1 𝛼𝛼�3,1 𝛼𝛼�4,1 𝛼𝛼1,1 𝛼𝛼2,1 𝛼𝛼3,1 𝛼𝛼4,1

𝑘𝑘1
𝛼𝛼�1,2 𝛼𝛼�2,2 𝛼𝛼�3,2 𝛼𝛼�4,2 𝛼𝛼1,2 𝛼𝛼2,2 𝛼𝛼3,2 𝛼𝛼4,2 𝑘𝑘 2
𝛼𝛼�1,3 𝛼𝛼�2,3 𝛼𝛼�3,3 𝛼𝛼�4,3 𝛼𝛼1,3 𝛼𝛼2,3 𝛼𝛼3,3 𝛼𝛼4,3 = 𝑞𝑞1 𝑞𝑞 2 𝑞𝑞 3 𝑞𝑞 4
𝑘𝑘 3
𝛼𝛼�1,4 𝛼𝛼�2,4 𝛼𝛼�3,4 𝛼𝛼�4,4 𝛼𝛼1,4 𝛼𝛼2,4 𝛼𝛼3,4 𝛼𝛼4,4 𝑘𝑘 4 𝑄𝑄
14-Mar-23
𝐴𝐴̂ 𝐴𝐴 𝐾𝐾 𝑇𝑇
Self-attention 𝑏𝑏 2 = � 𝛼𝛼�2,𝑖𝑖 𝑣𝑣 𝑖𝑖
𝑖𝑖
𝑏𝑏 2
𝛼𝛼�2,1 𝛼𝛼�2,2 𝛼𝛼�2,3 𝛼𝛼�2,4
𝛼𝛼�1,1 𝛼𝛼�2,1 𝛼𝛼�3,1 𝛼𝛼�4,1

𝛼𝛼�1,2 𝛼𝛼�2,2 𝛼𝛼�3,2 𝛼𝛼�4,2
𝑏𝑏1 𝑏𝑏 2 𝑏𝑏 3 𝑏𝑏 4 = 𝑣𝑣 1 𝑣𝑣 2𝑣𝑣 3 𝑣𝑣 4
𝛼𝛼�1,3 𝛼𝛼�2,3 𝛼𝛼�3,3 𝛼𝛼�4,3
O 𝑉𝑉 𝛼𝛼�1,4 𝛼𝛼�2,4 𝛼𝛼�3,4 𝛼𝛼�4,4
14-Mar-23 𝐴𝐴̂
Self-attention
O Q = 𝑊𝑊 𝑞𝑞 I
Self-Attention K = 𝑊𝑊 𝑘𝑘 I
Layer
I V = 𝑊𝑊 𝑣𝑣 I
�
A A = 𝐾𝐾 𝑇𝑇 Q
O = V �
A
14-Mar-23
Self-Attention keep repeating until we’ve
processed this enough
at the end, decode it into an
answer
self-attention “layer”
Multiple Attention Heads
𝑘𝑘 : level number
• Multiple attention heads can learn to 𝐿𝐿: number of heads
attend in different ways
𝑋𝑋 = 𝐱𝐱1 , … , 𝐱𝐱 𝑛𝑛
• Requires additional parameters to
𝐱𝐱1𝑖𝑖 = 𝐱𝐱 𝑖𝑖
compute different attention values and
𝑘𝑘,𝑙𝑙
transform vectors 𝛼𝛼�𝑖𝑖,𝑗𝑗 = 𝐱𝐱 𝑖𝑖𝑘𝑘−1 𝐖𝐖𝑘𝑘,𝑙𝑙 𝐱𝐱𝑗𝑗𝑘𝑘−1
𝛼𝛼𝑖𝑖𝑘𝑘,𝑙𝑙 = softmax 𝛼𝛼�𝑖𝑖,1

𝑘𝑘,𝑙𝑙 𝑘𝑘,𝑙𝑙
, … , 𝛼𝛼�𝑖𝑖,𝑛𝑛
𝑛𝑛
𝑥𝑥𝑖𝑖𝑘𝑘,𝑙𝑙 = � 𝛼𝛼𝑖𝑖,𝑗𝑗
𝑘𝑘,𝑙𝑙 𝑘𝑘−1
𝐱𝐱𝑗𝑗
𝑖𝑖=1
𝐱𝐱 𝑖𝑖𝑘𝑘 = 𝑉𝑉 𝑘𝑘 𝐱𝐱 𝑖𝑖𝑘𝑘,1 ; … ; 𝐱𝐱𝑖𝑖𝑘𝑘,𝐿𝐿
14-Mar-23
Multi-head Self-attention
(2 heads as example)
𝑞𝑞 𝑖𝑖,1 = 𝑊𝑊 𝑞𝑞,1 𝑞𝑞 𝑖𝑖 𝑏𝑏 𝑖𝑖,1
𝑞𝑞 𝑖𝑖,2 = 𝑊𝑊 𝑞𝑞,2 𝑞𝑞 𝑖𝑖
𝑞𝑞 𝑖𝑖,1 𝑞𝑞 𝑖𝑖,2 𝑘𝑘 𝑖𝑖,1 𝑘𝑘 𝑖𝑖,2 𝑣𝑣 𝑖𝑖,1 𝑣𝑣 𝑖𝑖,2 𝑞𝑞 𝑗𝑗,1 𝑞𝑞 𝑗𝑗,2 𝑘𝑘 𝑗𝑗,1 𝑘𝑘 𝑗𝑗,2 𝑣𝑣 𝑗𝑗,1 𝑣𝑣 𝑗𝑗,2
𝑞𝑞 𝑖𝑖 𝑘𝑘 𝑖𝑖 𝑣𝑣 𝑖𝑖 𝑞𝑞 𝑗𝑗 𝑘𝑘 𝑗𝑗 𝑣𝑣 𝑗𝑗
𝑞𝑞 𝑖𝑖 = 𝑊𝑊 𝑞𝑞 𝑎𝑎𝑖𝑖 𝑎𝑎𝑖𝑖 𝑎𝑎 𝑗𝑗
14-Mar-23
𝑞𝑞 𝑖𝑖,1 = 𝑊𝑊 𝑞𝑞,1 𝑞𝑞 𝑖𝑖 𝑏𝑏 𝑖𝑖,1
𝑞𝑞 𝑖𝑖,2 = 𝑊𝑊 𝑞𝑞,2 𝑞𝑞 𝑖𝑖
𝑏𝑏 𝑖𝑖,2
𝑞𝑞 𝑖𝑖 = 𝑊𝑊 𝑞𝑞 𝑥𝑥 𝑖𝑖 𝑎𝑎𝑖𝑖 𝑎𝑎 𝑗𝑗
14-Mar-23
𝑏𝑏 𝑖𝑖,1 𝑏𝑏 𝑖𝑖,1
𝑏𝑏 𝑖𝑖 = 𝑊𝑊 𝑂𝑂 𝑏𝑏 𝑖𝑖
𝑏𝑏 𝑖𝑖,2 𝑏𝑏 𝑖𝑖,2
14-Mar-23
𝑎𝑎𝑖𝑖 𝑎𝑎 𝑗𝑗
Positional Encoding
𝑥𝑥 𝑖𝑖 𝑎𝑎𝑖𝑖
𝑊𝑊 = 𝑊𝑊 𝐼𝐼 𝑥𝑥 𝑖𝑖
………
𝑝𝑝𝑖𝑖
𝑊𝑊 𝐼𝐼 𝑊𝑊 𝑃𝑃
𝑝𝑝𝑖𝑖 = 0
𝑒𝑒 𝑖𝑖
1 i-th dim
0 + 𝑊𝑊 𝑃𝑃 𝑝𝑝𝑖𝑖 𝑞𝑞 𝑖𝑖 𝑘𝑘 𝑖𝑖 𝑣𝑣 𝑖𝑖
…
𝑒𝑒 𝑖𝑖 + 𝑎𝑎𝑖𝑖
• Each position has a unique positional vector 𝑒𝑒 𝑖𝑖 (not
learned from data)
𝑖𝑖 𝑖𝑖 𝑥𝑥 𝑖𝑖
• each 𝑥𝑥 appends a one-hot vector 𝑝𝑝 orr add them
14-Mar-23
Positional encoding: learned
Another idea: just learn a positional encoding
Different for every input sequence
The same learned values for every sequence
dimensionality max sequence length

but different for different time steps
How many values do we need to learn?
+ more flexible (and perhaps more optimal) than sin/cos encoding

+ a bit more complex, need to pick a max sequence length (and can’t generalize beyond it)
Multi-head attention
Since we are relying entirely on attention now, we might want to incorporate more than one time step
because of softmax, this will

be dominated by one value
hard to specify that you want two

different things (e.g., the subject
and the object in a sentence)
Multi-head attention
Idea: have multiple keys, queries, and values for every time step!
around 8 heads seems to work

pretty well for big models
Alternating self-attention & nonlinearity
just a neural net applied at every position

self-attention “layer” after every self-attention layer!
Sometimes referred to as “position-

wise feedforward network”
We’ll describe some specific

commonly used choices shortly
Self-attention can see the future!
A crude self-attention “language model”: (in reality, we would have many alternating
self-attention layers and position-wise
feedforward networks, not just one)
Big problem: self-attention at step 1 can look at the value

at steps 2 & 3, which is based on the inputs at steps 2 & 3
At test time (when decoding), the inputs at steps 2 & 3 will

be based on the output at step 1…
…which requires knowing the input at steps 2 & 3
Masked attention
A crude self-attention “language model”: At test time (when decoding), the inputs at steps 2 & 3 will
be based on the output at step 1…
…which requires knowing the input at steps 2 & 3
Must allow self-attention into the past…

…but not into the future
Transformer summary
self-attention “layer” • These are generally called “Transformers” because
they transform one sequence into another at each
layer.
• Alternate self-attention “layers” with nonlinear
position-wise feedforward networks (to get
nonlinear transformations)
• Use positional encoding (on the input or input
embedding) to make the model aware of relative
self-attention “layer” positions of tokens
• Use multi-head attention
• Use masked attention if you want to use the model
for decoding.
The “classic” transformer position-wise softmax
As compared to a sequence
to sequence RNN model position-wise nonlinear
network
cross attention
repeated N times
position-wise nonlinear position-wise nonlinear
network we’ll discuss network
repeated N times
how this bit
works soon
self-attention “layer” masked self-attention
position-wise encoder position-wise encoder

The Final Linear and Softmax Layer
• A Softmax Layer to output word
• Let’s assume that our model knows 10,000 unique English words (our
model’s “output vocabulary”) that it’s learned from its training dataset.
• The softmax layer then turns those scores into probabilities (all positive,
all add up to 1.0). The cell with the highest probability is chosen, and the
word associated with it is produced as the output for this time step.
14-Mar-23 38
Combining encoder and decoder
values
“Cross-attention” cross attention
Much more like the standard attention

from the previous lecture position-wise nonlinear position-wise nonlinear
network network
repeated N times
self-attention “layer” masked self-attention
position-wise encoder position-wise encoder
in reality, cross-attention is
cross attention also multi-headed!
output
The Residuals
each sub-layer (self-attention, ffnn) in each encoder has a residual connection
around it, and is followed by a layer-normalization step.
14-Mar-23 40
14-Mar-23 41
Layer normalization
Main idea: batch normalization is very helpful, but hard to use with sequence models Sequences are
different lengths, makes normalizing across the batch hard Sequences can be very long, so we
sometimes have small batches
Simple solution: “layer normalization” – like batch norm, but not across the batch
Batch norm Layer norm

Different dimensions of a
1-dim
The Transformer Architecture
Composed of an encoder and a
decoder.
• Encoder:
• a stack of multiple identical blocks
• each block has two sublayers
1. a multi-head self-attention pooling
(queries, keys, and values are all
from the outputs of the previous
encoder layer)
2. a positionwise feed-forward network
3. A residual connection is employed
around both sublayers followed by
layer normalization
14-Mar-23
The Transformer Architecture
Decoder:
• a stack of multiple identical blocks
• each block has three sublayers.
1. a multi-head self-attention pooling -- each
position in the decoder is allowed to only
attend to all positions in the decoder up to
that position
2. Encoder-decoder attention: queries are from
the outputs of the previous decoder layer,
and the keys and values are from the
Transformer encoder outputs
3. A positionwise feed-forward network
• A residual connection is employed around
both sublayers followed by layer normalization
14-Mar-23
Transformer Layer Norm:
https://arxiv.org/abs/ Batch Size
Layer 1607.06450
𝑏𝑏 ′ Norm Batch Norm:
https://www.youtube.c 𝜇𝜇 = 0,
+ om/watch?v=BZh1ltr5R
kg
𝜎𝜎 = 1
Batch
𝑏𝑏 𝜇𝜇 = 0, 𝜎𝜎 = 1
Layer
…
𝑎𝑎 attend on the
input sequence
Masked: attend on the

generated sequence
14-Mar-23 45
Decoder decodes one position at a
Putting it all together time with masked attention
The Transformer
6 layers, each with d = 512
residual connection with LN

multi-head cross attention
2-layer neural net at each position
essentially a residual connection with LN same as encoder only masked
concatenates attention from all heads
Vaswani et al. Attention Is All You Need. 2017.

Why transformers?
Downsides:
- Attention computations are technically O(n2)
- Somewhat more complex to implement (positional encodings, etc.)
Benefits:
+ Much better long-range connections
+ Much easier to parallelize
+ In practice, can make it much deeper (more layers) than RNN
The benefits seem to vastly outweigh the downsides, and

transformers work much better than RNNs (and LSTMs) in
many cases
Arguably one of the most important sequence modeling

improvements of the past decade

Tarnsformer Part1 10 Mar 23 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tarnsformer Part1 10 Mar 23 PDF

Uploaded by

Copyright:

Available Formats

CS60010: Deep Learning

14-Mar-23 CS60010 Sudeshna Sarkar IIT Kgp 1

we are eating bread

• Problems with recurrent networks:

Softmax over lower

• Each word is a query to form

Vaswani et al. “Attention is all you need”, arXiv 2017

• Transformers map sequences of input

exp 𝛼𝛼 𝑞𝑞, 𝑘𝑘𝑖𝑖

Attention can in principle do everything that

1. The encoder has no temporal dependencies.

mignon chiot Un <START> A cute puppy

2.Multi-headed attention allows querying multiple positions at each layer

4.Masked decoding how to prevent attention lookups into the future?

𝑏𝑏 𝑖𝑖 is obtained based on the whole

𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4

𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4

𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4

𝛼𝛼�1,1 𝛼𝛼�1,2 𝛼𝛼�1,3 𝛼𝛼�1,4

𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4

𝛼𝛼�2,1 𝛼𝛼�2,2 𝛼𝛼�2,3 𝛼𝛼�2,4

𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4

𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4

𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4

𝛼𝛼�1,1 𝛼𝛼�1,2 𝛼𝛼�1,3 𝛼𝛼�1,4

𝛼𝛼1,1 = 𝑘𝑘1 𝑞𝑞1 𝛼𝛼1,2 = 𝑘𝑘 2 𝑞𝑞1 𝛼𝛼1,1 𝑘𝑘1

𝛼𝛼�2,1 𝛼𝛼�2,2 𝛼𝛼�2,3 𝛼𝛼�2,4

𝛼𝛼�1,1 𝛼𝛼�2,1 𝛼𝛼�3,1 𝛼𝛼�4,1 𝛼𝛼1,1 𝛼𝛼2,1 𝛼𝛼3,1 𝛼𝛼4,1

𝛼𝛼�2,1 𝛼𝛼�2,2 𝛼𝛼�2,3 𝛼𝛼�2,4

𝛼𝛼�1,1 𝛼𝛼�2,1 𝛼𝛼�3,1 𝛼𝛼�4,1

𝛼𝛼𝑖𝑖𝑘𝑘,𝑙𝑙 = softmax 𝛼𝛼�𝑖𝑖,1

𝐱𝐱 𝑖𝑖𝑘𝑘 = 𝑉𝑉 𝑘𝑘 𝐱𝐱 𝑖𝑖𝑘𝑘,1 ; … ; 𝐱𝐱𝑖𝑖𝑘𝑘,𝐿𝐿

Different for every input sequence

The same learned values for every sequence

dimensionality max sequence length

How many values do we need to learn?

+ more flexible (and perhaps more optimal) than sin/cos encoding

because of softmax, this will

hard to specify that you want two

around 8 heads seems to work

just a neural net applied at every position

Sometimes referred to as “position-

We’ll describe some specific

Big problem: self-attention at step 1 can look at the value

At test time (when decoding), the inputs at steps 2 & 3 will

Must allow self-attention into the past…

position-wise encoder position-wise encoder

• A Softmax Layer to output word

Much more like the standard attention

position-wise encoder position-wise encoder

Batch norm Layer norm

Masked: attend on the

Putting it all together time with masked attention

6 layers, each with d = 512

residual connection with LN

residual connection with LN

concatenates attention from all heads

Vaswani et al. Attention Is All You Need. 2017.

The benefits seem to vastly outweigh the downsides, and

Arguably one of the most important sequence modeling

You might also like