Professional Documents
Culture Documents
Diagrams V2s
Diagrams V2s
Diagrams V2s
Umar Jamil
Downloaded from: https://github.com/hkproj/transformer-from-scratch-notes
Video: https://www.youtube.com/watch?v=bCz4OMemCcA
License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0):
https://creativecommons.org/licenses/by-nc/4.0/legalcode
State
RNN RNN RNN … RNN
<0>
X1 X2 X3 XN
Original sentence
(tokens) YOUR CAT IS A LOVELY CAT
We define dmodel = 512, which represents the size of the embedding vector of each word
+ + + + + +
… 1664.068 … … … 1281.458
Position Embedding
… 8080.133 … … … 7902.890
(vector of size 512).
Only computed once … 2620.399 … … … 912.970
= = = = = =
… 1835.479 … … … 1452.869
… 11356.483 … … … 11179.24
… 13019.826 … … … 5292.638
𝑑
10000 𝑚𝑜𝑑𝑒𝑙 PE(0, 1) PE(1, 1) PE(2, 1)
… … …
𝑝𝑜𝑠
𝑃𝐸 𝑝𝑜𝑠, 2𝑖 + 1 = c𝑜𝑠 2𝑖
𝑑 Sentence 2 I LOVE YOU
10000 𝑚𝑜𝑑𝑒𝑙
PE(0, 0) PE(1, 0) PE(2, 0)
Q X KT
* all values are random.
softmax = IS 0.147 0.132 0.262 0.097 0.218 0.145 1
LOVELY
512 0.146 0.158 0.152 0.143 0.227 0.174 1
* for simplicity I considered only one head, which makes d model = dk. (6, 6)
(6, 6)
𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = softmax 𝑉
𝑑𝑘
dk
Q X WQ = Q’ Q
1
Q
2
Q
3
Q
4
seq YO
UR
CA
T
IS A
LOV
ELY
CA
T
CA
T 0.124 0.278 0.201 0.128 0.154 0.115
𝑄𝐾 𝑇
𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑑𝑘
= IS 0.147 0.132 0.262 0.097 0.218 0.145
LOV
ELY 0.146 0.158 0.152 0.143 0.227 0.174
Input K X WK = K’ K
1
K
2
K
3
K
4 CA
T 0.195 0.114 0.203 0.103 0.157 0.229
V V V
V X WV = V’ 1 2
V
3 4
𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = softmax 𝑉
𝑑𝑘
H H H H
E E E E
seq A
D
A
D
A
D
A
D
H X WO = MH-A
1 2 3 4
(seq, h * dv) (h * dv, dmodel) (seq, dmodel)
dv
seq = sequence length
dmodel
Keys Values
ROMANTIC TITANIC
Query = “love”
SCIFI INCEPTION
Keys Values
ROMANTIC TITANIC
Query = “love”
SCIFI INCEPTION
… … …
… … …
μ1 μ2 μ3
σ12 σ22 σ23
dk
Q X WQ = Q’ Q
1
Q
2
Q
3
Q
4
seq YOUR 0.268 0.119 0.134 0.148 0.179 0.152
(seq, dmodel) (dmodel, dmodel) (seq, dmodel) CAT 0.124 0.278 0.201 0.128 0.154 0.115
𝑄𝐾 𝑇
=
IS 0.147 0.132 0.262 0.097 0.218 0.145
𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑑𝑘
A 0.210 0.128 0.206 0.212 0.119 0.125
Input K X WK = K’ K
1
K
2
K
3
K
4 CAT 0.195 0.114 0.203 0.103 0.157 0.229
𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = softmax 𝑉
𝑑𝑘
H H H H
E E E E
seq A
D
A
D
A
D
A
D
H X WO = MH-A
1 2 3 4
(seq, h * dv) (h * dv, dmodel) (seq, dmodel)
dv
seq = sequence length
dmodel
Ti amo molto
The encoder outputs, for each word a vector that not only captures
its meaning (the embedding) or the position, but also its interaction Linear
with other words by means of the multi-head attention.
(seq, dmodel) → (seq, vocab_size)
Encoder Decoder
Output Output
(seq, dmodel) (seq, dmodel)
Encoder Decoder
Encoder Decoder
Input Input
(seq, dmodel) (seq, dmodel)
Ti amo molto
Inference
Time Step = 1 Softmax
We select a token from the vocabulary corresponding to the position
of the token with the maximum value.
(seq, vocab_size)
Encoder Decoder
Output Output
(seq, dmodel) (seq, dmodel)
Encoder Decoder
Encoder Decoder
Input Input
(seq, dmodel) (seq, dmodel)
<SOS>I love you very much<EOS> <SOS> * Both sequences will have same length thanks to padding
Inference
Time Step = 2 Softmax
Since decoder input now contains two tokens, we
select the softmax corresponding to the second token.
(seq, vocab_size)
Linear
Decoder
Output
(seq, dmodel)
Decoder
Input
(seq, dmodel)
<SOS>I love you very much<EOS> <SOS> ti Append the previously output word to the decoder input
Inference
Time Step = 3 Softmax
Since decoder input now contains three tokens, we
select the softmax corresponding to the third token.
(seq, vocab_size)
Linear
Decoder
Output
(seq, dmodel)
Decoder
Input
(seq, dmodel)
<SOS>I love you very much<EOS> <SOS> ti amo Append the previously output word to the decoder input
Inference
Time Step = 4 Softmax
Since decoder input now contains four tokens, we
select the softmax corresponding to the fourth token.
(seq, vocab_size)
Linear
Decoder
Output
(seq, dmodel)
Decoder
Input
(seq, dmodel)
<SOS>I love you very much<EOS> <SOS> ti amo molto Append the previously output word to the decoder input