Professional Documents
Culture Documents
(Shared) - GPT
(Shared) - GPT
Lecturer: Ngoc Ba
VietAI Teaching Team
Founder @ ProtonX
Video
Transformer
Application
Encoder Layers Bert
Application
https://arxiv.org/abs/1706.03762
Bert vs GPT
0
Nội
2019 GPT-2
Language Models are Unsupervised Multitask Learners
Experimentation on large datasets and data preparation techniques for training a model on
multiple different tasks.
2020 GPT-3
Language Models are Few-Shot Learners
Continuing the breakthroughs with the use of zero-shot/one-shot/few-shot
learning instead of fine-tuning the model.
4/2022 InstructionGPT
Training language models to follow instructions with human feedback
Incorporating human-loop and reinforcement learning into model training to avoid generating bad/poisoned
information.
2019
4/2022
11/2022
GPT-1 Output Probabilities
Softmax
Improving Language Understanding by Generative
Pre-Training (117M parameters; Radford et al., 2018) Linear
Bỏ đi cross-attention
Multi-Headed
Unsupervised pre-training vớiAttention
encoder
Masked
Multi-Headed
Attention
Output
Embedding
Cost Function
Cost Function
GPT-1 GPT-2
117M parameters 1.5B parameters
4.6GB data 40GB data (WebText)
final prompt
english sentence = Inference Result
Result: fromage
GPT-2 GPT-3
1.5B parameters 175B parameters
40GB data over 600GB data
Dataset
In-context learning
Traditional
Unsupervised-model Fine-tuning
In-context learning
No fine-tuning
No gradient-updates
Zero-shot
One-shot
Few-shot
Learning
https://arxiv.org/pdf/2005.14165.pdf
Data quality
https://arxiv.org/pdf/2005.14165.pdf
Reality
Copy Distributed
Average
the parameters
https://arxiv.org/pdf/2001.08361.pdf
Reality
GPT-2 Medium
345M parameters
https://bluestudio.ai/smart-hr
Inference Architecture
Internet
Load balancing
Distribute the traffic
Instance Group
Auto Scaling
Video
Human-in-the-loop
Admin is able to review and regenerate new response to improve content quality.
Instruction finetuning
https://arxiv.org/pdf/2210.11416.pdf
GPT-3.5 - Instruction GPT
https://arxiv.org/pdf/2203.02155.pdf Video
GPT-3.5 - Instruction GPT
Supervised Reward modeling (RM) Reinforcement learning
fine-tuning (SFT) (RL)
https://arxiv.org/pdf/2203.02155.pdf
Supervised fine-tuning (SFT)
Write an email…
fine-tuning
GPT-3
(SFT)
Video
Reward modeling (RM)
.
.
Write an email…
Generated 1 Generated 4 4
Generated 2 Generated 2 3
GPT-3 Generated 3 Generated 1
2
Generated 4 Generated 3 1
Ranking
Human scoring
Video
Reward modeling (RM)
Write an email…
Copy Reward (Preference)
Reduce weight
.
.
model
175B 6B Fine-tuning
Generated 1 Generated 4 4
Generated 2 Generated 2 3
GPT-3 Generated 3 Generated 1
2
Generated 4 Generated 3 1
Problem: The decisions made by humans are often subject to noise and miscalibration.
Generated 1 2 Generated 3 1
https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model
Training Process
Agent
Prompts
x: Ngoc is
Copy
Supervised Tuned Language
fine-tuning (SFT) Modal (RL Policy) 1
+
A penalty that restrains us from diverging excessively from the pretrained model. It is
Kullback-Leibler (KL) divergence Video
Objective function Reward (Preference
model)
y: a teacher of
VietAI NLP y: handsome.
Classs
Video
What we learn today
33
Reinforcement Learning
● https://web.stanford.edu/class/cs234/
● https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture11-prompting-rlhf.pdf
34
Thank you!
35