10 Transformer 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Transformer Model (2/2):

From Shallow to Deep

Shusen Wang
Multi-Head Attention
Single-Head Self-Attention
• Self-attention layer: 𝐂 = Attn 𝐗, 𝐗 .
• This is called “single-head self-attention”.

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'

Self-Attention Layer (Parameters: 𝐖* , 𝐖+ , 𝐖, )

𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Multi-Head Self-Attention
• Using 𝑙 single-head self-attentions (which do not share parameters.)
• A single-head self-attention has 3 parameter matrices: 𝐖* , 𝐖+ , 𝐖, .
• Totally 3𝑙 parameters matrices.

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'

Self-Attention Layer (Parameters: 𝐖* , 𝐖+ , 𝐖, )

𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Multi-Head Self-Attention
• Using 𝑙 single-head self-attentions (which do not share parameters.)
• Concatenating outputs of single-head self-attentions.
• Suppose single-head self-attentions' outputs are 𝑑×𝑚 matrices.
• Multi-head’s output shape: 𝑙𝑑 ×𝑚.

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'

Multi-Head Self-Attention Layer

𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Multi-Head Attention
• Using 𝑙 single-head attentions (which do not share parameters.)
• Concatenating single-head attentions’ outputs.

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;

Multi-Head Attention Layer

𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:


Stacked Self-Attention Layers
Self-Attention Layer

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'

Multi-Head Self-Attention Layer

𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Self-Attention Layer + Dense Layer

𝐮:% = ReLU 𝐖A 𝐜:%

Dense

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'

Multi-Head Self-Attention Layer

𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Self-Attention Layer + Dense Layer

𝐮:% 𝐮:$= ReLU 𝐖A 𝐜:$

Dense Dense

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'

Multi-Head Self-Attention Layer

𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Self-Attention Layer + Dense Layer

𝐮:% 𝐮:$ 𝐮:& = ReLU 𝐖A 𝐜:&

Dense Dense Dense

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'

Multi-Head Self-Attention Layer

𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Self-Attention Layer + Dense Layer

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:'= ReLU 𝐖A 𝐜:'

Dense Dense Dense ⋯ Dense

𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:'

Multi-Head Self-Attention Layer

𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Stacked Self-Attention Layers

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:'

Dense Dense Dense ⋯ Dense


Multi-Head Self-Attention Layer

𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Stacked Self-Attention Layers

Dense Dense Dense ⋯ Dense


Multi-Head Self-Attention Layer

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:'

Dense Dense Dense ⋯ Dense


Multi-Head Self-Attention Layer

𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Transformer’s Encoder

512×𝑚 matrix

Dense Layer

Self-Attention Layer

512×𝑚 matrix

One Block of Encoder


Transformer’s Encoder

512×𝑚 matrix

Dense Layer

512×𝑚 matrix Self-Attention Layer

Block 1
512×𝑚 matrix

𝐗 ∈ ℝG%$×' One Block of Encoder


Transformer’s Encoder

512×𝑚 matrix

512×𝑚 matrix Dense Layer

Block 2 Self-Attention Layer

Block 1
512×𝑚 matrix

𝐗 ∈ ℝG%$×' One Block of Encoder


Transformer’s Encoder
512×𝑚 matrix

512×𝑚 matrix
Block 6

⋮ Dense Layer

Block 2 Self-Attention Layer

Block 1
512×𝑚 matrix

𝐗 ∈ ℝG%$×' One Block of Encoder


Stacked Attention Layers
Stacked Attentions

• Transformer is a Seq2Seq model (encoder + decoder).


• Encoder’s inputs are vectors 𝐱% , 𝐱 $ , ⋯ , 𝐱 ' .
• Decoder’s inputs are vectors 𝐱%: , 𝐱 $: , ⋯ , 𝐱 ;: .

Encoder’s inputs: Decoder’s inputs:


𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:
Stacked Attentions

• Transformer’s encoder contains 6 stacked blocks.


• 1 block ≈ 1 multi-head attention layer + 1 dense layer.

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:'

Encoder (6x stacked blocks)


Decoder’s inputs:
𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:
Stacked Attentions

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:' 𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;

Encoder (6x stacked blocks) Multi-Head Self-Attention

𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:


Stacked Attentions

𝐳:% 𝐳:$ 𝐳:& ⋯ 𝐳:;

Multi-Head Attention Layer

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:' 𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;

Encoder (6x stacked blocks) Multi-Head Self-Attention

𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:


Stacked Attentions 𝐬:% = ReLU 𝐖L 𝐳:%

Dense

𝐳:% 𝐳:$ 𝐳:& ⋯ 𝐳:;

Multi-Head Attention Layer

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:' 𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;

Encoder (6x stacked blocks) Multi-Head Self-Attention

𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:


Stacked Attentions 𝐬:% 𝐬:$ = ReLU 𝐖L 𝐳:$

Dense Dense

𝐳:% 𝐳:$ 𝐳:& ⋯ 𝐳:;

Multi-Head Attention Layer

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:' 𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;

Encoder (6x stacked blocks) Multi-Head Self-Attention

𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:


Stacked Attentions 𝐬:% 𝐬:$ 𝐬:& ⋯ 𝐬:;

Dense Dense Dense Dense

𝐳:% 𝐳:$ 𝐳:& ⋯ 𝐳:;

Multi-Head Attention Layer

𝐮:% 𝐮:$ 𝐮:& ⋯ 𝐮:' 𝐜:% 𝐜:$ 𝐜:& ⋯ 𝐜:;

Encoder (6x stacked blocks) Multi-Head Self-Attention

𝐱% 𝐱$ 𝐱& ⋯ 𝐱' 𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:


Transformer’s Decoder : One Block
512×𝑡 matrix

Dense Layer

Attention Layer

Self-Attention
Layer

512×𝑚 matrix 512×𝑡 matrix


Transformer’s Decoder : One Block
512×𝑡 matrix

Dense Layer
512×𝑡 matrix

Attention Layer
One Block
Self-Attention
Layer 512×𝑚 matrix 512×𝑡 matrix

512×𝑚 matrix 512×𝑡 matrix


Put Everything Together
Transformer

𝐔 ∈ ℝG%$×'

Block 6

⋮ Encoder Network

Block 1

𝐗 ∈ ℝG%$×'
Transformer

𝐔 ∈ ℝG%$×'

Block 6


512×𝑡 matrix
Block 1
Block 1
𝐗 ∈ ℝG%$×'
𝐗 : ∈ ℝG%$×;
Transformer

𝐔 ∈ ℝG%$×'

Block 6
512×𝑡 matrix


Block 2
Block 1
Block 1
𝐗 ∈ ℝG%$×'
𝐗 : ∈ ℝG%$×;
Transformer
512×𝑡 matrix
𝐔 ∈ ℝG%$×'
Block 6
Block 6


Block 2
Block 1
Block 1
𝐗 ∈ ℝG%$×'
𝐗 : ∈ ℝG%$×;
Comparison with RNN Seq2Seq Model

Decoder RNN

Encoder RNN
𝐬% 𝐬$ 𝐬& ⋯ 𝐬;
𝐡% 𝐡$ 𝐡& ⋯ 𝐡'
𝐁 𝐁 𝐁 ⋯ 𝐁
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:
𝐱% 𝐱$ 𝐱& ⋯ 𝐱'
Applied to Machine Translation
Example: English to German

𝐔 ∈ ℝG%$×'

Encoder

𝐗 ∈ ℝG%$×'
Example: English to German

𝐔 ∈ ℝG%$×'

Encoder Encoder (6 Blocks)

𝐗 ∈ ℝG%$×'
Example: English to German

𝐮% 𝐮$ ⋯ 𝐮'
𝐔 ∈ ℝG%$×'

Encoder Encoder (6 Blocks)

𝐗 ∈ ℝG%$×'
𝐱% 𝐱$ ⋯ 𝐱'

1st English
Word
2nd English
Word
⋯ 𝑚th English
Word
Example: English to German

𝐔 ∈ ℝG%$×'

Encoder Decoder (6 Blocks)

𝐗 ∈ ℝG%$×'
Example: English to German

𝐲%
𝐔 ∈ ℝG%$×'

Encoder Decoder (6 Blocks)

𝐗 ∈ ℝG%$×'
𝐱%:

Start Sign
Example: English to German

𝐲% Random
𝐔 ∈ ℝG%$×' Sampling

Encoder Decoder (6 Blocks)

𝐗 ∈ ℝG%$×'
𝐱%:

Start Sign 1st German


Word
Example: English to German

𝐲% 𝐲$
𝐔 ∈ ℝG%$×'

Encoder Decoder (6 Blocks)

𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $:

Start Sign 1st German


Word
Example: English to German

𝐲% 𝐲$ Random
𝐔 ∈ ℝG%$×' Sampling

Encoder Decoder (6 Blocks)

𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $:

Start Sign 1st German 2nd German


Word Word
Example: English to German

𝐲% 𝐲$ 𝐲&
𝐔 ∈ ℝG%$×'

Encoder Decoder (6 Blocks)

𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $: 𝐱 &:

Start Sign 1st German 2nd German


Word Word
Example: English to German

𝐲% 𝐲$ 𝐲& Random
𝐔 ∈ ℝG%$×' Sampling

Encoder Decoder (6 Blocks)

𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $: 𝐱 &:

Start Sign 1st German


Word
2nd German
Word

Example: English to German

𝐲% 𝐲$ 𝐲& ⋯
𝐔 ∈ ℝG%$×'

Encoder Decoder (6 Blocks)

𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $: 𝐱 &: ⋯

Start Sign 1st German


Word
2nd German
Word
⋯ tth German
Word
Example: English to German

𝐲% 𝐲$ 𝐲& ⋯ 𝐲;
𝐔 ∈ ℝG%$×'

Encoder Decoder (6 Blocks)

𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:

Start Sign 1st German


Word
2nd German
Word
⋯ tth German
Word
Example: English to German

𝐲% 𝐲$ 𝐲& ⋯ 𝐲; Random
Sampling
𝐔 ∈ ℝG%$×'

Encoder Decoder (6 Blocks)

𝐗 ∈ ℝG%$×'
𝐱%: 𝐱 $: 𝐱 &: ⋯ 𝐱 ;:

Start Sign 1st German


Word
2nd German
Word
⋯ tth German
Word
Stop Sign
Summary
From Single-Head to Multi-Head
• Single-head self-attention can be combined to form a multi-head
self-attention.

Single-Head 𝐂 (%)
Self-Attention (𝑑×𝑚)

𝐗 Single-Head 𝐂 ($)
(𝑑U ×𝑚) Self-Attention (𝑑×𝑚)

Single-Head
Self-Attention
𝐂 (&)
(𝑑×𝑚)
From Single-Head to Multi-Head
• Single-head self-attention can be combined to form a multi-head
self-attention.

Single-Head 𝐂 (%)
Self-Attention (𝑑×𝑚)
𝐂 (%)
(𝑑×𝑚)
𝐗 Single-Head 𝐂 ($) concatenate
𝐂 ($)
(𝑑U ×𝑚) Self-Attention (𝑑×𝑚) (𝑑×𝑚)

𝐂 (&)
(&) (𝑑×𝑚)
Single-Head
Self-Attention
𝐂
(𝑑×𝑚)
From Single-Head to Multi-Head
• Single-head self-attention can be combined to form a multi-head
self-attention.

• Single-head attention can be combined to form a multi-head


attention.
Encoder Network of Transformer

• 1 encoder block ≈ multi-head self-attention + dense.


• Input shape: 512×𝑚.
• Output shape: 512×𝑚.

• Encoder network is a stack of 6 such blocks.


Decoder Network of Transformer

• 1 decoder block ≈ multi-head self-attention + multi-head attention


+ dense.
• Input shape: 512×𝑚 , 512×𝑡 .
• Output shape: 512×𝑡.

• Decoder network is a stack of 6 such blocks.


Transformer Model

• Transformer is Seq2Seq model; it has an encoder and a decoder.


• Transformer model is not RNN.
• Transformer is based on attention and self-attention.
• Transformer outperforms all the state-of-the-art RNN models.
Thank you!

You might also like