Professional Documents
Culture Documents
Deeplearning - Ai Deeplearning - Ai
Deeplearning - Ai Deeplearning - Ai
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
Tasks In NLP:
Context Windows:
User 1: What’s for dinner?
Chatbot: Who’s cooking, you or me?
User 1: Hey now chatbot.
Chatbot: I hope it’s not hay, that’s
what horses eat.
Transformer
Complexity
deeplearning.ai
Transformer Issues
● Attention on sequence of length L takes L2 time and memory
L=100 L2 = 10K (0.001s at
10M ops/s)
L=1000 L2 = 1M (0.1s at 10M
ops/s)
L=10000 L2 = 100M (10s at 10M
● N layers take N times as much memory
ops/s)
GPT-3 has 96
L=100000 L2 layers
= 10B and new models
(1000s will have
at 10M
more
ops/s)
Attention Complexity
● Attention: softmax(QKT)V
● QKT is [L, L]
● Save compute by using area of interest for large L
Memory with N Layers
● Activations need to be stored for backprop
image ©
(Transformer: A Novel Neural
Network Architecture for
Language Understanding.)
Nearest Neighbors
Course:
Natural Language Processing with Classification and Vector Spaces
Lessons:
● KNN
● Multiple Planes
Nearest Neighbors
Compute the nearest neighbor to q among vectors {k1, …, kn}
● Attention computes d(q, ki) for i from 1 to n which can be slow
LSH Attention:
● Hash Q and K
...
~ 2 GB
12 x Attention
12 x Feed-Forward Feed Forward
~ 2 GB
Attention
...
~ 2 GB
12 x Attention
12 x Feed-Forward Feed Forward
~ 2 GB
50 GB total
Attention
~ 2 GB input
L = 1 million tokens
Reversible
Residual
Layers
deeplearning.ai
Residual Blocks in Transformer
Feed Forward ? -
Needs
Ability
To
+
Reverse
?
Attention -
Reversible Residual Blocks
+
FeedFwd
+
Attention
Reversible layers
+
FeedFwd FeedFwd
-
+
Attention Attention
-
Reversible layers equations
Standard Transformer:
ya = x + Attention(x) yb = ya + FeedFwd(ya)
Reversible:
y1 = x1 + Attention(x2) y2 = x2 + FeedFwd(y1)
+
FeedFwd
+
Attention
Reversible layers equations
x1 + y1
y1 = x1 + Attention(x2)
y2 = x2 + FeedFwd(y1)
x2 Attention FeedFwd
x2 = y2 - FeedFwd(y1)
+ y2
x1 = y1 - Attention(x2)
Reversible layers equations
x1 - y1
y1 = x1 + Attention(x2)
y2 = x2 + FeedFwd(y1)
x2 Attention FeedFwd
x2 = y2 - FeedFwd(y1)
- y2
x1 = y1 - Attention(x2)
Reversible Transformer: BLEU Scores
Transformer
[Vaswani+1]
Reversible Transformer
data ©
(Reformer:
The Efficient
Transformer)
Reformer
deeplearning.ai
Reformer
The Reversible Transformer
1 GPU
(16 GB)
L = 1 million tokens
Reformer
● LSH Attention
● Reversible Layers
image ©
(Attention Is
All You
Need)
Reformer
image ©
(Reformer:
The Efficient
Transformer)
Chatbot
● Reformer
● MulitiWOZ dataset
● Trax