Large Language Model

UET
Since 2004
ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

VNU-University of Engineering and Technology
Natural Language Processing - INT3406E 20
Large Language Model
Nguyen Van Vinh - UET

Outline
● Introduction to LM
● Large Language Models and applications
UET-FIT 2
Language Modeling (Mô hình ngôn ngữ)?
● What is the probability of “Tôi trình bày ChatGPT tại Trường ĐH Công
Nghệ” ?
● What is the probability of “Công Nghệ học Đại trình bày ChatGPT tại Tôi” ?
● “Tôi trình bày ChatGPT tại Trường ĐH Công nghệ, địa điểm …”) or
P(…/Tôi trình bày ChatGPT tại Trường ĐH Công nghệ, địa điểm) ?
● A model that computes either of these:
W = w1,w2,w3,w4,w5…wn
P(W) or P(wn|w1,w2…wn-1) is called a language model
3
Large Language Model
4
Large Language Model (Hundreds of Billions of
Tokens)
5
6
Large Language Models - yottaFlops of Compute
Source: https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture11-prompting-rlhf.pdf 7
Why LLMs?
● Double Descent
8
Why LLMs?
● Scaling Law for Neural Language Models

○ Performance depends strongly on scale! We keep getting better performance as
we scale the model, data, and compute up!
9
Why LLMs?
● Generalization
○ We can now use one single model to solve many NLP tasks
10
Why LLMs? Emergence in few-shot prompting
Emergent Abilities
• Some ability of LM is
not present in
smaller models but
is present in larger
models
Emergent Capability - In-Context Learning
12
Emergent Capability - In-Context Learning
13
What is pre-training / fine-tuning?
● “Pre-train” a model on a large dataset for task X, then “fine-tune” it on a

dataset for task Y
● Key idea: X is somewhat related to Y, so a model that can do X will have
some good neural representations for Y as well (transfer learning)
● ImageNet pre-training is huge in computer vision: learning generic visual
features for recognizing objects
Can we find some task X that can be

useful for a wide range of
downstream tasks Y?
14
Pretraining + Prompting Paradigm
15
Prompting Engineering (2020  now)
● Prompts involve instructions and context passed to a language model to
achieve a desired task
Prompt engineering is
the practice of
developing and
optimizing prompts to
efficiently use language
models (LMs) for a
variety of applications
16
Prompt Engineering Techniques
● Many advanced prompting techniques have been designed to
improve performance on complex tasks •
○ Few-shot prompts
○ Chain-of-thought (CoT) prompting
○ Self-Consistency
○ Knowledge Generation Prompting
○ ReAct
17
Temperature and Top-p Sampling in LLMs
● Temperature and Top-p sampling are two essential parameters that can be
tweaked to control the output of LLMs
● Temperature (0-2): This parameter determines the creativity and diversity of the text
generated by LLMs model. A higher temperature value (e.g., 1.5) leads to more
diverse and creative text, while a lower value (e.g., 0.5) results in more focused and
deterministic text.
● Top-p Sampling (0-1): This parameter maintains a balance between diversity and
high-probability words by selecting tokens from the top-p most probable tokens
whose collective probability mass is greater than or equal to a threshold p.
18
Three major forms of pre-training (LLMs)
19
BERT: Bidirectional Encoder Representations from
Transformers
Source: (Devlin et al, 2019): BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 20
Masked Language Modeling (MLM)
● Q: Why we can’t do language modeling with bidirectional models?
● Solution: Mask out k% of the input words, and then predict the masked words
21
Next Sentence Prediction (NSP)
22
BERT pre-training
23
RoBERTa
● BERT is still under-trained
● Removed the next sentence prediction pre-training — it adds more noise than
benefits!
● Trained longer with 10x data & bigger batch sizes
● Pre-trained on 1,024 V100 GPUs for one day in 2019
24
(Liu et al., 2019): RoBERTa: A Robustly Optimized BERT Pretraining Approach
Text-to-text models: the best of both worlds (Bard)?
● Encoder-only models (e.g., BERT) enjoy the benefits of bidirectionality but they can’t be
used to generate text
● Decoder-only models (e.g., GPT3, Lamma2) can do generation but they are left-to-right
LMs..
● Text-to-text models combine the best of both worlds!
T5 = Text-to-Text Transfer Transformer
(Raffel et al., 2020): Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer 25
How to use these pre-trained models?
26
From GPT to GPT-2 to GPT-3
27
Quiz
● Context size?
● The larger the size context, the more difficult it is?
28
GPT-3: language models are few-shot learners
● GPT-2 → GPT-3: 1.5B → 175B (# of parameters), ~14B → 300B (# of tokens)
29
GPT-3’s in-context learning
30
[2020] GPT-3 to [2022] ChatGPT
What’s new?
● Training on code
● Supervised
instruction tuning
● RLHF =
Reinforcement
learning from
human feedback
Source: Fu, 2022, “How does GPT Obtain its Ability? Tracing Emergent Abilities of Language
Models to their Sources" 31
How was ChatGPT developed?
32
Evaluation of LLMs
33
LLMs newest
● Claude 2.1 (Anthropic)

○ 200K Context Window
○ 2x Decrease in Hallucination Rates
● GPT4 turbo (Open AI)

○ 128K Context Window
Vietnamese
● PhoGPT (VinAI)
● FPT.AI
● VNG (Zalo):
● …
34
ChatGPT application for reading comprehension (ChatPdf)
● Fine-tune the ChatGPT model with training data in specific domain

● Using LLM improvement techniques based on Retrieval Augmented
Generation (RAG)
● Use efficient Prompting to achieve expectation output
35
Large Language models Risks
● LLMs make mistakes

○ (falsehoods, hallucinations)
● LLMs can be misused
○ (misinformation, spam)
● LLMs can cause harms
○ (toxicity, biases, stereotypes)
● LLMs can be attacked
○ (adversarial examples, poisoning, prompt injection)
● LLMs are costly to train and deploy
36
Summary
● Introduction to LLM
● Large Language models (types)
UET-FIT 37
UET
Since 2004
ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

VNU-University of Engineering and Technology
Thank you
Email me
vinhnv@vnu.edu.vn

Large Language Model

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Large Language Model

Uploaded by

Copyright:

Available Formats

UET

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

Natural Language Processing - INT3406E 20

Large Language Model

Nguyen Van Vinh - UET

● Scaling Law for Neural Language Models

● “Pre-train” a model on a large dataset for task X, then “fine-tune” it on a

Can we find some task X that can be

● Q: Why we can’t do language modeling with bidirectional models?

T5 = Text-to-Text Transfer Transformer

● GPT-2 → GPT-3: 1.5B → 175B (# of parameters), ~14B → 300B (# of tokens)

● Claude 2.1 (Anthropic)

● GPT4 turbo (Open AI)

● Fine-tune the ChatGPT model with training data in specific domain

● LLMs make mistakes

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

You might also like