Professional Documents
Culture Documents
10 Transformer 2
10 Transformer 2
Shusen Wang
Multi-Head Attention
Single-Head Self-Attention
• Self-attention layer: 𝐂 = Attn 𝐗, 𝐗 .
• This is called “single-head self-attention”.
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Multi-Head Self-Attention
• Using 𝑙 single-head self-attentions (which do not share parameters.)
• A single-head self-attention has 3 parameter matrices: 𝐖* , 𝐖+ , 𝐖, .
• Totally 3𝑙 parameters matrices.
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Multi-Head Self-Attention
• Using 𝑙 single-head self-attentions (which do not share parameters.)
• Concatenating outputs of single-head self-attentions.
• Suppose single-head self-attentions' outputs are 𝑑×𝑚 matrices.
• Multi-head’s output shape: 𝑙𝑑 ×𝑚.
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Multi-Head Attention
• Using 𝑙 single-head attentions (which do not share parameters.)
• Concatenating single-head attentions’ outputs.
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Self-Attention Layer + Dense Layer
Dense
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Self-Attention Layer + Dense Layer
Dense Dense
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Self-Attention Layer + Dense Layer
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Self-Attention Layer + Dense Layer
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Stacked Self-Attention Layers
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Stacked Self-Attention Layers
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Transformer’s Encoder
512×𝑚 matrix
Dense Layer
Self-Attention Layer
512×𝑚 matrix
512×𝑚 matrix
Dense Layer
Block 1
512×𝑚 matrix
512×𝑚 matrix
Block 1
512×𝑚 matrix
512×𝑚 matrix
Block 6
⋮ Dense Layer
Block 1
512×𝑚 matrix
Dense
Dense Dense
Dense Layer
Attention Layer
Self-Attention
Layer
Dense Layer
512×𝑡 matrix
Attention Layer
One Block
Self-Attention
Layer 512×𝑚 matrix 512×𝑡 matrix
𝐔 ∈ ℝG%$×'
Block 6
⋮ Encoder Network
Block 1
𝐗 ∈ ℝG%$×'
Transformer
𝐔 ∈ ℝG%$×'
Block 6
⋮
512×𝑡 matrix
Block 1
Block 1
𝐗 ∈ ℝG%$×'
𝐗 : ∈ ℝG%$×;
Transformer
𝐔 ∈ ℝG%$×'
Block 6
512×𝑡 matrix
⋮
Block 2
Block 1
Block 1
𝐗 ∈ ℝG%$×'
𝐗 : ∈ ℝG%$×;
Transformer
512×𝑡 matrix
𝐔 ∈ ℝG%$×'
Block 6
Block 6
⋮
⋮
Block 2
Block 1
Block 1
𝐗 ∈ ℝG%$×'
𝐗 : ∈ ℝG%$×;
Comparison with RNN Seq2Seq Model
Decoder RNN
Encoder RNN
𝐬% 𝐬$ 𝐬& ⋯ 𝐬;
𝐡% 𝐡$ 𝐡& ⋯ 𝐡'
𝐁 𝐁 𝐁 ⋯ 𝐁
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Applied to Machine Translation
Example: English to German
𝐔 ∈ ℝG%$×'
Encoder
𝐗 ∈ ℝG%$×'
Example: English to German
𝐔 ∈ ℝG%$×'
𝐗 ∈ ℝG%$×'
Example: English to German
𝐮% 𝐮$ ⋯ 𝐮'
𝐔 ∈ ℝG%$×'
𝐗 ∈ ℝG%$×'
𝐱% 𝐱$ ⋯ 𝐱'
1st English
Word
2nd English
Word
⋯ 𝑚th English
Word
Example: English to German
𝐔 ∈ ℝG%$×'
𝐗 ∈ ℝG%$×'
Example: English to German
𝐲%
𝐔 ∈ ℝG%$×'
𝐗 ∈ ℝG%$×'
𝐱%:
Start Sign
Example: English to German
𝐲% Random
𝐔 ∈ ℝG%$×' Sampling
𝐗 ∈ ℝG%$×'
𝐱%:
𝐲% 𝐲$
𝐔 ∈ ℝG%$×'
𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $:
𝐲% 𝐲$ Random
𝐔 ∈ ℝG%$×' Sampling
𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $:
𝐲% 𝐲$ 𝐲&
𝐔 ∈ ℝG%$×'
𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $: 𝐱 &:
𝐲% 𝐲$ 𝐲& Random
𝐔 ∈ ℝG%$×' Sampling
𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $: 𝐱 &:
𝐲% 𝐲$ 𝐲& ⋯
𝐔 ∈ ℝG%$×'
𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $: 𝐱 &: ⋯
𝐲% 𝐲$ 𝐲& ⋯ 𝐲;
𝐔 ∈ ℝG%$×'
𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:
𝐲% 𝐲$ 𝐲& ⋯ 𝐲; Random
Sampling
𝐔 ∈ ℝG%$×'
𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:
Single-Head 𝐂 (%)
Self-Attention (𝑑×𝑚)
𝐗 Single-Head 𝐂 ($)
(𝑑U ×𝑚) Self-Attention (𝑑×𝑚)
Single-Head
Self-Attention
𝐂 (&)
(𝑑×𝑚)
From Single-Head to Multi-Head
• Single-head self-attention can be combined to form a multi-head
self-attention.
Single-Head 𝐂 (%)
Self-Attention (𝑑×𝑚)
𝐂 (%)
(𝑑×𝑚)
𝐗 Single-Head 𝐂 ($) concatenate
𝐂 ($)
(𝑑U ×𝑚) Self-Attention (𝑑×𝑚) (𝑑×𝑚)
𝐂 (&)
(&) (𝑑×𝑚)
Single-Head
Self-Attention
𝐂
(𝑑×𝑚)
From Single-Head to Multi-Head
• Single-head self-attention can be combined to form a multi-head
self-attention.