10 Transformer 2

Transformer Model (2/2):
From Shallow to Deep
Shusen Wang
Multi-Head Attention
Single-Head Self-Attention
• Self-attention layer: 𝐂 = Attn 𝐗, 𝐗 .
• This is called “single-head self-attention”.
𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'
Self-Attention Layer (Parameters: 𝐖* , 𝐖+ , 𝐖, )
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Multi-Head Self-Attention
• Using 𝑙 single-head self-attentions (which do not share parameters.)
• A single-head self-attention has 3 parameter matrices: 𝐖* , 𝐖+ , 𝐖, .
• Totally 3𝑙 parameters matrices.
𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'
Self-Attention Layer (Parameters: 𝐖* , 𝐖+ , 𝐖, )
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Multi-Head Self-Attention
• Using 𝑙 single-head self-attentions (which do not share parameters.)
• Concatenating outputs of single-head self-attentions.
• Suppose single-head self-attentions' outputs are 𝑑×𝑚 matrices.
• Multi-head’s output shape: 𝑙𝑑 ×𝑚.
𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'
Multi-Head Self-Attention Layer
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Multi-Head Attention
• Using 𝑙 single-head attentions (which do not share parameters.)
• Concatenating single-head attentions’ outputs.
𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;
Multi-Head Attention Layer
𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:

Stacked Self-Attention Layers
Self-Attention Layer
𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Self-Attention Layer + Dense Layer
𝐮:% = ReLU 𝐖A 𝐜:%
Dense
𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
𝐮:% 𝐮:$= ReLU 𝐖A 𝐜:$
Dense Dense
𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
𝐮:% 𝐮:$ 𝐮:& = ReLU 𝐖A 𝐜:&
Dense Dense Dense
𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:'= ReLU 𝐖A 𝐜:'
Dense Dense Dense ⋯ Dense
𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:'

𝐱% 𝐱$ 𝐱& ⋯ 𝐱'

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:'

𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Transformer’s Encoder
512×𝑚 matrix
Dense Layer
Self-Attention Layer
512×𝑚 matrix
One Block of Encoder

512×𝑚 matrix
Dense Layer
512×𝑚 matrix Self-Attention Layer
Block 1
512×𝑚 matrix
𝐗 ∈ ℝG%$×' One Block of Encoder

512×𝑚 matrix
512×𝑚 matrix Dense Layer
Block 2 Self-Attention Layer
Block 1
512×𝑚 matrix

512×𝑚 matrix
512×𝑚 matrix
Block 6
⋮ Dense Layer
Block 2 Self-Attention Layer
Block 1
512×𝑚 matrix

Stacked Attention Layers
Stacked Attentions
• Transformer is a Seq2Seq model (encoder + decoder).

• Encoder’s inputs are vectors 𝐱% , 𝐱 $ , ⋯ , 𝐱 ' .
• Decoder’s inputs are vectors 𝐱%: , 𝐱 $: , ⋯ , 𝐱 ;: .
Encoder’s inputs: Decoder’s inputs:

𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:
Stacked Attentions
• Transformer’s encoder contains 6 stacked blocks.

• 1 block ≈ 1 multi-head attention layer + 1 dense layer.
𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:'
Encoder (6x stacked blocks)

Decoder’s inputs:
𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:
Stacked Attentions
𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:' 𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;
Encoder (6x stacked blocks) Multi-Head Self-Attention
𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:

Stacked Attentions
𝐳:% 𝐳:$ 𝐳:& ⋯ 𝐳:;
𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:' 𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;
𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:

Stacked Attentions 𝐬:% = ReLU 𝐖L 𝐳:%
Dense
𝐳:% 𝐳:$ 𝐳:& ⋯ 𝐳:;
𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:' 𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;
𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:

Stacked Attentions 𝐬:% 𝐬:$ = ReLU 𝐖L 𝐳:$
Dense Dense
𝐳:% 𝐳:$ 𝐳:& ⋯ 𝐳:;
𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:' 𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;
𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:

Stacked Attentions 𝐬:% 𝐬:$ 𝐬:& ⋯ 𝐬:;
Dense Dense Dense Dense
𝐳:% 𝐳:$ 𝐳:& ⋯ 𝐳:;
𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:' 𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;
𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:

Transformer’s Decoder : One Block
512×𝑡 matrix
Dense Layer
Attention Layer
Self-Attention
Layer
512×𝑚 matrix 512×𝑡 matrix

Transformer’s Decoder : One Block
512×𝑡 matrix
Dense Layer
512×𝑡 matrix
Attention Layer
One Block
Self-Attention
Layer 512×𝑚 matrix 512×𝑡 matrix
512×𝑚 matrix 512×𝑡 matrix

Put Everything Together
Transformer
𝐔 ∈ ℝG%$×'
Block 6
⋮ Encoder Network
Block 1
𝐗 ∈ ℝG%$×'
Transformer
𝐔 ∈ ℝG%$×'
Block 6
⋮
512×𝑡 matrix
Block 1
Block 1
𝐗 ∈ ℝG%$×'
𝐗 : ∈ ℝG%$×;
Transformer
𝐔 ∈ ℝG%$×'
Block 6
512×𝑡 matrix
⋮
Block 2
Block 1
Block 1
𝐗 ∈ ℝG%$×'
𝐗 : ∈ ℝG%$×;
Transformer
512×𝑡 matrix
𝐔 ∈ ℝG%$×'
Block 6
Block 6
⋮
⋮
Block 2
Block 1
Block 1
𝐗 ∈ ℝG%$×'
𝐗 : ∈ ℝG%$×;
Comparison with RNN Seq2Seq Model
Decoder RNN
Encoder RNN
𝐬% 𝐬$ 𝐬& ⋯ 𝐬;
𝐡% 𝐡$ 𝐡& ⋯ 𝐡'
𝐁 𝐁 𝐁 ⋯ 𝐁
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Applied to Machine Translation
Example: English to German
𝐔 ∈ ℝG%$×'
Encoder
𝐗 ∈ ℝG%$×'
𝐔 ∈ ℝG%$×'
Encoder Encoder (6 Blocks)
𝐗 ∈ ℝG%$×'
𝐮% 𝐮$ ⋯ 𝐮'
𝐔 ∈ ℝG%$×'
Encoder Encoder (6 Blocks)
𝐗 ∈ ℝG%$×'
𝐱% 𝐱$ ⋯ 𝐱'
1st English
Word
2nd English
Word
⋯ 𝑚th English
Word
𝐔 ∈ ℝG%$×'
Encoder Decoder (6 Blocks)
𝐗 ∈ ℝG%$×'
𝐲%
𝐔 ∈ ℝG%$×'
𝐗 ∈ ℝG%$×'
𝐱%:
Start Sign
𝐲% Random
𝐔 ∈ ℝG%$×' Sampling
𝐗 ∈ ℝG%$×'
𝐱%:
Start Sign 1st German

Word
𝐲% 𝐲$
𝐔 ∈ ℝG%$×'
𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $:

Word
𝐲% 𝐲$ Random
𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $:
Start Sign 1st German 2nd German

Word Word
𝐲% 𝐲$ 𝐲&
𝐔 ∈ ℝG%$×'
𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $: 𝐱 &:
Start Sign 1st German 2nd German

Word Word
𝐲% 𝐲$ 𝐲& Random
𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $: 𝐱 &:

Word
2nd German
Word
⋯
𝐲% 𝐲$ 𝐲& ⋯
𝐔 ∈ ℝG%$×'
𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $: 𝐱 &: ⋯

Word
2nd German
Word
⋯ tth German
Word
𝐲% 𝐲$ 𝐲& ⋯ 𝐲;
𝐔 ∈ ℝG%$×'
𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:

Word
2nd German
Word
⋯ tth German
Word
𝐲% 𝐲$ 𝐲& ⋯ 𝐲; Random
Sampling
𝐔 ∈ ℝG%$×'
𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:

Word
2nd German
Word
⋯ tth German
Word
Stop Sign
Summary
From Single-Head to Multi-Head
• Single-head self-attention can be combined to form a multi-head
self-attention.
Single-Head 𝐂 (%)
Self-Attention (𝑑×𝑚)
𝐗 Single-Head 𝐂 ($)
(𝑑U ×𝑚) Self-Attention (𝑑×𝑚)
Single-Head
Self-Attention
𝐂 (&)
(𝑑×𝑚)
self-attention.
Single-Head 𝐂 (%)
Self-Attention (𝑑×𝑚)
𝐂 (%)
(𝑑×𝑚)
𝐗 Single-Head 𝐂 ($) concatenate
𝐂 ($)
(𝑑U ×𝑚) Self-Attention (𝑑×𝑚) (𝑑×𝑚)
𝐂 (&)
(&) (𝑑×𝑚)
Single-Head
Self-Attention
𝐂
(𝑑×𝑚)
self-attention.
• Single-head attention can be combined to form a multi-head

attention.
Encoder Network of Transformer
• 1 encoder block ≈ multi-head self-attention + dense.

• Input shape: 512×𝑚.
• Output shape: 512×𝑚.
• Encoder network is a stack of 6 such blocks.

Decoder Network of Transformer
• 1 decoder block ≈ multi-head self-attention + multi-head attention

+ dense.
• Input shape: 512×𝑚 , 512×𝑡 .
• Output shape: 512×𝑡.
• Decoder network is a stack of 6 such blocks.

Transformer Model
• Transformer is Seq2Seq model; it has an encoder and a decoder.

• Transformer model is not RNN.
• Transformer is based on attention and self-attention.
• Transformer outperforms all the state-of-the-art RNN models.
Thank you!

10 Transformer 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 Transformer 2

Uploaded by

Copyright:

Available Formats

Transformer Model (2/2):

From Shallow to Deep

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'

Self-Attention Layer (Parameters: 𝐖* , 𝐖+ , 𝐖, )

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'

Self-Attention Layer (Parameters: 𝐖* , 𝐖+ , 𝐖, )

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'

Multi-Head Self-Attention Layer

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;

Multi-Head Attention Layer

𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'

Multi-Head Self-Attention Layer

𝐮:% = ReLU 𝐖A 𝐜:%

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'

Multi-Head Self-Attention Layer

𝐮:% 𝐮:$= ReLU 𝐖A 𝐜:$

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'

Multi-Head Self-Attention Layer

𝐮:% 𝐮:$ 𝐮:& = ReLU 𝐖A 𝐜:&

Dense Dense Dense

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'

Multi-Head Self-Attention Layer

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:'= ReLU 𝐖A 𝐜:'

Dense Dense Dense ⋯ Dense

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'

Multi-Head Self-Attention Layer

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:'

Dense Dense Dense ⋯ Dense

Dense Dense Dense ⋯ Dense

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:'

Dense Dense Dense ⋯ Dense

One Block of Encoder

512×𝑚 matrix Self-Attention Layer

𝐗 ∈ ℝG%$×' One Block of Encoder

512×𝑚 matrix Dense Layer

Block 2 Self-Attention Layer

𝐗 ∈ ℝG%$×' One Block of Encoder

Block 2 Self-Attention Layer

𝐗 ∈ ℝG%$×' One Block of Encoder

• Transformer is a Seq2Seq model (encoder + decoder).

Encoder’s inputs: Decoder’s inputs:

• Transformer’s encoder contains 6 stacked blocks.

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:'

Encoder (6x stacked blocks)

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:' 𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;

Encoder (6x stacked blocks) Multi-Head Self-Attention

𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:

𝐳:% 𝐳:$ 𝐳:& ⋯ 𝐳:;

Multi-Head Attention Layer

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:' 𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;

Encoder (6x stacked blocks) Multi-Head Self-Attention

𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:

𝐳:% 𝐳:$ 𝐳:& ⋯ 𝐳:;

Multi-Head Attention Layer

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:' 𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;

Encoder (6x stacked blocks) Multi-Head Self-Attention

𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:

𝐳:% 𝐳:$ 𝐳:& ⋯ 𝐳:;

Multi-Head Attention Layer

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:' 𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;

Encoder (6x stacked blocks) Multi-Head Self-Attention

𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;: