LARGE LANGUAGE MODELS (LLM)

LLM | Premjith B
LARGE L
LANGUAGE L
M
MODELS
Dr Premjith B
Assistant Professor (Sr. Gr.)
Amrita School of Artificial Intelligence, Coimbatore L
Amrita Vishwa Vidyapeetham L
M
1
LLM | Premjith B
Amrit Subramanian
Second year, B.Tech CSE(AI)
Amrita Vishwa Vidyapeetham
MY Saran Dharshan
TEAM
Pathange Omkareshwara Rao

2
3
LLM | Premjith B
LLM | Premjith B
“To think this all began with letting autocomplete finish our sentences.”
Source: Slide Show: New Yorker Cartoons April 24 & May 1, 2023 | The New Yorker
4
LLM | Premjith B
LANGUAGE MODEL (LM)

A probability distribution over all the sequences of words that might be spoken or
written (in some language or context)
• Calculating the probability of a given P  w1 , w2 , , wn 

sequence of words
• Computing the probability of the next

word(s) given a sequence of words P  wn wn 1 , wn  2 , , w1 
5
LLM | Premjith B
Sentence: LLM is a language model characterized by its large size
P  LLM is a language model characterized by its large size  

P  LLM   P is LLM   P a LLM is   P language LLM is a  
P  model LLM is a language   P characterized LLM is a language model  
P  by LLM is a language model characterized  
P its LLM is a language model characterized by   GPT-3 still acts in this
P  large LLM is a language model characterized by its   way, but the model is
implemented as a very
P size LLM is a language model characterized by its large  large neural network
of 175-billion
parameters!
6
LLM | Premjith B
ALGORITHMS USED FOR LMS
• Hidden Markov Model (HMM)
• Recurrent Neural Networks (RNN) Deep learning-based

approaches
• Long Short-Term Memory Networks (LSTM) /
Gated Recurrent Unit (GRU)
• Initially designed for Machine

• Encoder – Decoder architecture Translation
• Attention mechanism • Paved the way for LLMs
7
LLM | Premjith B
TRANSFORMER
• No recurrence
• No convolutions
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser,
and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30
(2017).
8
LLM | Premjith B
LANGUAGE MODELS
Autoregressive Models Autoencoding Models

Decoder-only models (GPT-x models) Encoder-only models (BERT, RoBERTa,
• Predict future token ELECTRA)
• Predict token based on past and future context
The joke was funny. She couldn’t stop ___.

He ate the entire ___ of pizza.
• NL generation (NLG)
• NL understanding (NLU)
9
LLM | Premjith B
Image source:The emergence of Large Language Models (LLMs) - The Low Down - Momentum Works
10
LLM | Premjith B
Images generated by DALL-E 2
https://nlpnewsletter.substack.com/p/palm-dall-e-2-chinchilla-chain-of-thought-prompting-values-and-culture-in-nlp-845878
11
LLM | Premjith B
Joke explanations by PaLM
https://nlpnewsletter.substack.com/p/palm-dall-e-2-chinchilla-chain-of-thought-prompting-values-and-culture-in-nlp-845878
12
LLM | Premjith B
TYPICAL LIFE OF AN LM
Cleaned data Fine-tuning

Preprocessing
Text data
13
LLM | Premjith B
HOW LARGE ARE LMS?
• LMs with >100 million parameters

• Larger model sizes larger compute, more
expensive during inference
• Different sizes of LMs have different ways to
adapt and use them
• Fine-tuning, zero-shot/few-shot prompting, in-
context learning
14
LLM | Premjith B
WHY LLMS?
The promise: one single model to solve many NLP tasks

A single LM
trained in an
entirely
unsupervised
fashion can
solve greater no.
of tasks
Source: cs.princeton.edu/courses/archive/fall22/cos597G/lectures/lec01.pdf
15
LLM | Premjith B
Source: Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. "Exploring the limits of transfer learning with a unified
text-to-text transformer." The Journal of Machine Learning Research 21, no. 1 (2020): 5485-5551.
16
LLM | Premjith B
LARGE LANGUAGE MODELS (LLM)
Built on the foundation of:
● Unsupervised pre-training
○ Web pages
● Supervised fine-tuning
○ Benchmarks
● More compute + data
Source: DeepLearningAI Event: Building with Instruction-Tuned LLMs: A Step-by-Step Guide
17
LLM | Premjith B
Source: Generative pre-trained transformer - Wikipedia
18
LLM | Premjith B
Two tasks
Pre-training
• Train the network on massive data
• Data can be unlabeled (mostly)
Expensive
Fine-tuning
• Model is first initialized with the pre-trained
parameters, and all of the parameters are
fine-tuned using labeled data Image courtesy: How does in-context learning work? A framework for understandi
ng the differences from traditional supervised learning | SAIL Blog (stanford.edu)
• Fine tuning is task-specific
19
LLM | Premjith B
Pretraining
• Encoder
• Captures the bidirectional contextual information - can be conditioned on
future
• Example: BERT
• Decoder
• Language models – Can not be conditioned on future
• Generate text
• Examples: GPT-3, GPT-3.5, GPT-4
• Encoder-Decoder
• Sequence-to-sequence mapping
• Examples: Transformers, T5
20
LLM | Premjith B
GENERATIVE PRETRAINED TRANSFORMER

(GPT)
• Built using the decoder part of the transformer networks

• Natural Language Generation model
• Showed that language modeling at scale can be an effective pretraining
technique for downstream tasks like natural language inference.
• Transformer decoder with 12 layers
• 768-dimensional hidden states, 3072-dimensional feed-forward hidden layers.
• Trained on BooksCorpus: over 7000 unique books
21
LLM | Premjith B
THREE MAJOR WAYS TO USE LLMS
Prompting: Encoding: Transfer Learning:

ChatGPT-style UI Convert NL strings into Fine-tune pre-trained
API, e.g., OpenAI API vector model to your specialized
domain/task
Command-line with your
instance
22
LLM | Premjith B
Liu, P., Yuan, W.,

Fu, J., Jiang, Z.,
Hayashi, H., &
Neubig, G. (2023).
Pre-train, prompt,
and predict: A
systematic survey
of prompting
methods in natural
language
processing. ACM
Computing
Surveys, 55(9), 1-
Four paradigms in NLP 35.
23
LLM | Premjith B
PROMPTING
• For many tasks, supervised fine-tuning data may not be available

• Key idea: Training an LM that models the probability P(x; θ) of the input text
x, and using this probability to predict y
• Reduces the need for a large supervised dataset
• Prompting is non-invasive:
• It does not introduce additional parameters or require direct inspection of a
model’s representations.
• It can be thought of as a lower bound on what the model “knows” about the new
task Courtesy: 263-5354-00L Large Language Models
24
LLM | Premjith B
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing
Surveys, 55(9), 1-35.
25
LLM | Premjith B
Prompting is the approach of adding extra information for the model to

condition on during its generation of Y, a series of tokens
Prompting is done by prepending a series of tokens, P, to the input X, such

that the model maximizes the likelihood of the correct Y,
Pr Y  P; X 
PRETRAIN, FINE-TUNE, PREDICT
26
LLM | Premjith B
• Manually design a “prompt” that demonstrates how to formulate a task as a

generation task.
• No need to update the model weights at all!
• Fine-tuning still outperforms prompting, but prompting doesn’t require lots of

training data.
Challenge: Finding the most appropriate prompt to allow an LM to solve the task in hand
Courtesy: 263-5354-00L Large Language Models
27
LLM | Premjith B
Zero-shot [Instruction] [Input] Is the sentiment positive or negative?

“This movie sucks!” A:
One-Shot [Instruction] [Ex In 1] [Ex Out 1] Is the sentiment positive or negative?
[Input] Q: This movie rocks!
A: Positive.
Q: “This movie sucks!”
A:
Two-Shot [Instruction] [Ex In 1] [Ex Out 1] Is the sentiment positive or negative?
[Ex In 2] [Ex Out 2] [Input] Q: This movie rocks!
A: Positive.
Q: My eyes are bleeding!
A: Negative.
Q: “This movie sucks!”
A:
Chain-of-Thought [Instruction] [Request step-by-step Is the sentiment positive or negative?
explanation] [Input] Please explain your answer step-by-step.
Q: “This movie sucks!” A:
28
LLM | Premjith B
Source: PaLM , DALL-E 2 , Chinchilla 🐭, Chain-of-thought prompting ⛓💭✍️, Values and Culture in NLP 🏛 (substack.com)
29
LLM | Premjith B
Brown, T., Mann, B., Ryder, N., Subbiah, M.,

Kaplan, J. D., Dhariwal, P., ... & Amodei, D.
(2020). Language models are few-shot
learners. Advances in neural information
processing systems, 33, 1877-1901.
30
LLM | Premjith B
Editing the prompts
Give diverse examples
Give many examples
Ask the model to explain or verify the response
Give examples specific to the input
Try many prompts
Fine-tune the model for a prompt
31
LLM | Premjith B
Examples
of input,
template,
and answer
for different
tasks
Liu, P., Yuan, W.,

Fu, J., Jiang, Z.,
Hayashi, H., &
Neubig, G. (2023).
Pre-train, prompt,
and predict: A
systematic survey
of prompting
methods in natural
language
processing. ACM
Computing
Surveys, 55(9), 1-
35.
32
LLM | Premjith B
DESIGN CONSIDERATIONS FOR PROMPTING
• Pre-trained Model Choice

• Prompt Engineering
• Answer Engineering
• Multi-prompt Learning
• Prompt-based Training Strategies
33
LLM | Premjith B
PRE-TRAINED LANGUAGE MODELS
Standard Language Model (SLM)

Objective: Training the model to optimize the probability P(x) of text from a
training corpus
Text is generally predicted in an autoregressive fashion, predicting the tokens in the
sequence one at a time.
34
LLM | Premjith B
Typical paradigms of pre-trained LMs
Surveys, 55(9), 1-35.
35
LLM | Premjith B
PROMPT ENGINEERING
• Process of creating a prompting function fprompt(x) to obtain the most optimal

performance on the downstream applications
• Prompt template engineering: manual or algorithmic search for the optimal
template for a downstream task
• Two types of prompts
• Cloze prompt: fill in the blanks of a text
• Prefix prompt: complete a string (prefix string)
36
LLM | Premjith B
• Prompt Mining: A mining-based method to automatically find templates

given a set of training inputs x and outputs y
• Prompt Paraphrasing: Paraphrase an existing prompt to create a set of
other prompts
• Gradient-based Search: Search over actual tokens to find short sequences
that trigger the underlying pre-trained LM to generate the desired target
prediction. This search is done iteratively, stepping through tokens in the
prompt
• Prompt Generation: Generate prompts using LMs
• Prompt Scoring: Score the prompt generated by LMs using the probability
of occurrence using an LM.
37
LLM | Premjith B
Prefix tuning
• Having a proper context can steer the LM without changing its parameters
• If we want the LM to generate a word (e.g., Learning), we can prepend its
common collocations as context (e.g., Machine), and the LM will assign a
much higher probability to the desired word
• Prepends a sequence of task-specific vectors to the input by keeping the
parameters of LM frozen
Freezes the Transformer parameters and only

optimizes the prefix (the red prefix blocks)
Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for
generation. arXiv preprint arXiv:2101.00190.
38
LLM | Premjith B
• Prepends a prefix for an autoregressive LM to obtain

z   PREFIX ; x; y 
• Prepends prefixes for both encoder and encoder to obtain
z   PREFIX ; x; PREFIX '; y 
Prefix tuning modifies more layers of the model by inserting a task-specific

prefix to the input sequence, thus requiring more parameters to be finetuned.
39
LLM | Premjith B
Pidx denotes the sequence of prefix indices, and |Pidx| denotes the length of the prefix
Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
40
LLM | Premjith B
Computation of hidden layer information by an autoregressive

hi  LM   zi , hi  transformer model by incorporating the past activations in its
left context
is a trainable language model distribution,

max log P  y x  

 logP  z
iYidx
i hi  and is the pre-trained parameters, and
perform gradient updates on the log-likelihood
objective
Concatenation of all neural network layers at time step i.
hi   h(1)i , h(2)

i ,  , h ( n)
i 
 It is copied from (trainable prefix matrix) directly if the
corresponding time step is within the prefix (hi is [i]),
otherwise it is computed using the pre-trained LM.
41
LLM | Premjith B
 P i,: if i  Pidx
hi  
 LM   zi , hi  otherwise
Prefix-tuning initializes a trainable matrix Pθ (parametrized by θ) of dimension |Pidx| ×

dim(hi) to store the prefix parameters.
The language model parameters are fixed and the prefix parameters θ are the only
trainable parameters.
42
LLM | Premjith B
Parametrization of Pθ
P i,:  MLP  P ' i,:
is a smaller matrix
and has the same rows dimension (prefix length), but different column
dimension. After the fine-tuning process, only will be stored in the memory.
Pidx dim  hi  Pidx  k

P   P '  
is a large feed-forward neural network
43
LLM | Premjith B
Basic Python implementation for one layer
soft_prompt = torch.nn.Parameter(torch.rand(num_tokens,
embed_dim)
def transformer_block_for_prefix_tuning(x):
soft_prompt = FFN(soft_prompt)
x = concat([soft_prompt, x], dim=seq) return
transformer_block(x)
44
LLM | Premjith B
Source: Understanding Parameter-Efficient LLM Finetuning: Prompt Tuning And Prefix Tuning (sebastianraschka.com)
45
LLM | Premjith B
PROMPT TUNING
• An approach to add extra information for the model to condition on during its
generation of the text
• Prompt tuning removes the restriction that the prompt P be parameterized by θ;
instead, the prompt has its own dedicated parameters, θP, that can be updated
• Prefix tuning learns soft prompts at all layers of the model while prompt
tuning only modifies the input.
Pr ; P Y  P; X  Maximize the likelihood using backpropagation while only

applying gradient updates to θP
46
LLM | Premjith B
• Prompt design involves selecting

prompt tokens from a fixed
vocabulary of frozen embeddings
• Prompt tuning can be thought of as

using a fixed prompt of special
tokens, where only the embeddings of
these prompt tokens can be updated
Source: GitHub - kipgparker/soft-prompt-tuning
47
LLM | Premjith B
Source: GitHub - arazd/ProgressivePrompts: Progressive Prompts: Continual Learning for Language Models
48
LLM | Premjith B
ANSWER ENGINEERING
• Aims to search for an answer space Z and a map to the original output
Y that results in an effective predictive model
• Two dimensions
• Answer shape
• Answer design
49
LLM | Premjith B
Selection of the
ANSWER SHAPE shape of acceptable
answers depends on
The shape of an answer characterizes its granularity
the task to perform
• Tokens: One of the tokens in the pre-trained LM’s
vocabulary, or a subset of the vocabulary. Token or text-span
answer spaces are widely
• Span: A short multi-token span. These are usually used used in classification
together with cloze prompts. tasks, relation extraction
or named entity
• Sentence: A sentence or document. These are commonly recognition
used with prefix prompts. Longer phrasal or sentential answers are
often used in language generation tasks
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A
systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1-35.
50
LLM | Premjith B
ANSWER DESIGN
• Manual Design
• The space of potential answers, Z, and its mapping to the output Y, are
designed manually
Unconstrained spaces Constrained spaces

• Answer space: tokens, fixed- • Space of possible outputs is
length spans, token sequence constrained
• Directly map the answer to the • Performed when the label space
final output is limited
• E.g: text classification
51
LLM | Premjith B
Discrete Answer Search

• Automatic answer search
• Answer paraphrasing
• Start with an initial answer space
• Use paraphrasing approaches to expand the answer space
• Prune-then search
• Start with an answer space of several plausible answers
• Prune the space to obtain the final set of answers
• Label decomposition
• Decompose each label into constituent words and use them as labels
• In relation extraction task, for the relation per:city_of_death, the decomposed
label words would be {person, city, death}
52
LLM | Premjith B
Continuous Answer Search

• Answer tokens are optimized using a
gradient descent approach
• Assign a virtual token for each class

label and optimize the token
embedding for each class along with
the prompt token embeddings
• Answer tokens are optimized directly

in the embedding space. Therefore,
instead of using the embedding
generated by an LM, the algorithms
use the embedding learned from
scratch. Hambardzumyan, K., Khachatrian, H., & May, J. (2021). Warp: Word-level
adversarial reprogramming. arXiv preprint arXiv:2101.00121.
53
LLM | Premjith B
Hambardzumyan, K., Khachatrian, H., & May, J. (2021). Warp: Word-level adversarial reprogramming. arXiv preprint arXiv:2101.00121.
54
LLM | Premjith B
MULTI-PROMPT LEARNING
Instead of a
single
prompt,
multiple
prompts can
be used
55
LLM | Premjith B
TRAINING STRATEGIES FOR PROMPTING

METHODS
Surveys, 55(9), 1-35.
56
LLM | Premjith B
CHALLENGES IN PROMPTING
• Complicated NLP tasks • Multi class-multi label

• Relation extraction, information classification tasks
extraction • Tuning approach selection
• Prompting for the structured • Prompt ensemble
information • Prompt augmentation
• Preparation of text-answer pair • Selection of the appropriate
template LLM
57
LLM | Premjith B
LANGUAGE MODELLING ≠ ASSISTING USERS
Language models are not aligned with user intent

Finetuning to the rescue!
Source: Natural Language Processing with Deep Learning CS224N
58
LLM | Premjith B
INSTRUCTION FINE-TUNING
Collect examples of
(instruction, output) pairs
across many tasks and
finetune an LM
Source: Chung, Hyung Won, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li et al. "Scaling instruction-finetuned language models." arXiv preprint arXiv:2210.11416 (2022).
59
LLM | Premjith B
Wei, Jason, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. "Finetuned language models are zero-shot learners." arXiv
preprint arXiv:2109.01652 (2021).
60
LLM | Premjith B
Source: Chung,
Hyung Won, Le
Hou, Shayne
Longpre, Barret
Zoph, Yi Tay,
William Fedus, Eric
Li et al. "Scaling
instruction-
finetuned language
models." arXiv
preprint
arXiv:2210.11416 (
2022).
61
LLM | Premjith B
• Muffin - Multi-task finetuning with instructions

• T0-SF - T0 is an encoder-decoder model that consumes textual inputs and
produces target responses. SF stands for “sans Flan”.
• CoT – Chain of Thought
• MMLU - Massive Multi-task Language Understanding
• TyDi QA - Typologically Diverse Question Answering
• MGSM - Multilingual Grade School Math
• BIG-Bench - A diverse evaluation suite that focuses on tasks believed to be
beyond the capabilities of current language models.
62
LLM | Premjith B
Source: Chung, Hyung Won, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li et al. "Scaling instruction-finetuned language models." arXiv preprint arXiv:2210.11416 (2022).
63
LLM | Premjith B
General pipeline of instruction tuning.

Zhang, Shengyu, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li et al. "Instruction Tuning for Large Language Models: A Survey." arXiv preprint
arXiv:2308.10792 (2023).
64
LLM | Premjith B
Instruction Dataset Construction

• Each instance in an instruction data consists of three elements
• An instruction
• a natural language text sequence to specify the task
• Example: Write an email requesting leave to the manager
• Optional input - supplementary information for context
• Anticipated output – based on the instruction and the input
65
LLM | Premjith B
Instruction dataset construction approaches

• Data integration from annotated natural language datasets
• (instruction, output) pairs are collected from existing annotated
natural language datasets
• Generating outputs using LLMs
Instruction Tuning
• An LLM can be directly fine-tuned in a fully-supervised manner on
the collected datasets.
66
LLM | Premjith B
Important Aspects of Instruction Data
• Mixing few-shot settings: Training with mixed zero-shot and few-shot prompts
significantly improves performance in both settings.
• Task diversity: Large models benefit from continuously increasing the number of
tasks.
• Data augmentation: Augmenting the data, such as by inverting inputs and outputs
(e.g., turning a question-answering task into a question-generation task), is
beneficial.
• Mixing weights: Appropriately tuning the mixing weights is important when using
a combination of instruction-tuning datasets.
Source: Instruction Tuning Vol. 1 - by Sebastian Ruder (substack.com)
67
LLM | Premjith B
Some instruction-tuning datasets
arXiv:2308.10792 (2023).
68
LLM | Premjith B
LLMs tuned on instruction-tuning datasets
arXiv:2308.10792 (2023).
69
LLM | Premjith B
Zhang,
Shengyu,
Linfeng Dong,
Xiaoya Li, Sen
Zhang, Xiaofei
Sun, Shuhe
Wang, Jiwei Li
et al.
"Instruction
Tuning for
Large Language
Models: A
Survey." arXiv
preprint
arXiv:2308.107
92 (2023).
70
LLM | Premjith B
FINE-TUNING
• Pre-train a language model on a

downstream task
• A task-specific layer (relatively small) is
attached to the original network
• The full network can be re-trained to
fine-tune the pre-trained (learned)
weights
Source: 9/26 Presentation (princeton.edu)
71
LLM | Premjith B
PARAMETER EFFICIENT FINE-TUNING (PEFT)

• An approach for
finetuning large
language models in a
parameter-efficient
manner
• Fine-tuning a model
requires to store as many
parameters as the
original model
• High storage
complexity
Source: peft (anoopsarkar.github.io)
72
73
LLM | Premjith B
LLM | Premjith B
• PEFT – classifications
• Does the method introduce new parameters to the model?
• Does it fine-tune a small subset of existing parameters?
• Does the method aim to minimize memory footprint or only storage
efficiency?
• Additive methods
• Selective methods
• Reparametrization-based methods
• Hybrid methods
Lialin, V., Deshpande, V., & Rumshisky, A. (2023). Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647.
74
LLM | Premjith B
Lialin, V., Deshpande, V., & Rumshisky, A. (2023). Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647.
75
LLM | Premjith B
ADDITIVE METHODS
• Key idea: Augment the existing pre-trained model with additional

parameters or layers and train only the added parameters
• Introduce additional parameters. However, achieve significant training time
and memory efficiency improvements
• Two approaches
• Adapters
• Soft prompts
76
LLM | Premjith B
Adapters
• Attach a small fully connected layer at every layer of the transformer
• Inserts small modules (adapters) between transformer layers
• Adapter layer performs a down projection to project the input hidden layer
information to a lower-dimensional space, followed by a non-linear activation
function and an up projection
• A residual connection to generate the final form
Wdown   d r
h  h  f  hWdown Wup
Wup   rd
77
LLM | Premjith B
Source: Houlsby, Neil,

Andrei Giurgiu, Stanislaw
Jastrzebski, Bruna
Morrone, Quentin De
Laroussilhe, Andrea
Gesmundo, Mona
Attariyan, and Sylvain
Gelly. "Parameter-efficient
transfer learning for NLP."
In International Conference
on Machine Learning, pp.
2790-2799. PMLR, 2019.
78
LLM | Premjith B
Basic python implementation
def transformer_block_with_adapter(x):
residual = x
x = SelfAttention(x)
x = FFN(x) # adapter
x = LN(x + residual)
residual = x
x = FFN(x) # transformer FFN
x = FFN(x) # adapter
return x
79
LLM | Premjith B
• Unified view of parameter-efficient transfer learning
h  1    x  h    x  f  xW1 W2
W1  Wq PkT W2  Pv and are prefix vector matrices
Pk , Pv  ld
f  Softmax .
  x  is a scalar quantity. It is the sum of the normalized attention

weights on the prefixes
80
LLM | Premjith B
• Sparse Adapter
• Pruned adapters
• Reduces the model size of neural networks by pruning redundant
parameters and training the rest ones
Source:He, Shwai, Liang Ding, Daize Dong, Miao Zhang, and Dacheng Tao. "Sparseadapter: An easy approach for improving the parameter-efficiency of adapters." arXiv preprint
arXiv:2210.04284 (2022).
81
LLM | Premjith B
Pruning methods
• Random: z Uniform  0,1
• Magnitude: z w
nin  nout
• Erdos-Rényi (ER): sparsity  1 
nin  nout
• Single-shot network pruning based on connection sensitivity (SNIP):

zl   wl  gl
82
LLM | Premjith B
• Compacter
Source: Karimi
Mahabadi, Rabeeh,
James Henderson,
and Sebastian
Ruder. "Compacter:
Efficient low-rank
hypercomplex
adapter
layers." Advances in
Neural Information
Processing
Systems 34 (2021):
1022-1035.
83
LLM | Premjith B
• AdapterHub
• An easy-to-use and extensible adapter training and sharing framework
for transformer-based model
84
LLM | Premjith B
Soft prompts
• Some of the model’s input embeddings are fine-tuned via gradient descent.
• Soft prompts can be trained for the input layer only or for all layers
• Soft prompts could be pre-trained or prompts for different tasks utilized to reduce
the computation required for fine-tuning a soft prompt for a new task
def soft_prompted_model(input_ids):
x = Embed(input_ids)
x = concat([soft_prompt, x], dim=seq)
return model(x)
85
LLM | Premjith B
SELECTIVE METHODS
• Selectively fine-tuning some parts of the network

• Fine-tuning only a few top layers of a network
• Based on the type of the layer
• Internal structure of the network (tuning only model biases)
• Tuning only particular rows of the parameter matrix
• Sparsity-based approaches
86
LLM | Premjith B
• Cross-attention fine-tuning
• Originally designed for machine translation
    src , tgt , enc ,  dec ,  xattn 
includes source-language token embeddings, source positional embeddings, and source

embeddings layer norm parameters
includes target-language input and output token embeddings, target positional embeddings, and
target embeddings layer norm parameters
includes self-attention, layer norm, and feed-forward parameters in the encoder stack
includes self-attention, layer norm, and feed-forward parameters in the decoder stack
includes cross-attention and corresponding layer norm parameters
87
LLM | Premjith B
• BitFit
• Bias-terms Fine-tuning (BitFit)
• Freeze most of the transformer-encoder parameters and train only the
bias terms and the task-specific classification layer
• Fine-tune only a small portion of the model’s parameters
• Amount to less than 0.1% of the total number of parameters
88
LLM | Premjith B
REPARAMETRIZATION-BASED METHODS
• Leverage low-rank representations to minimize the number of trainable parameters

• Basic intuition: neural networks have low dimensional representations
• Common approaches
• Intrinsic SAID
• LoRA
89
LLM | Premjith B
• Intrinsic SAID
• Structure-Aware Intrinsic Dimension (SAID)
• An objective function’s intrinsic dimensionality describes the minimum
dimension needed to solve the optimization problem it defines to some
precision level
• Intrinsic dimensionality of a pre-trained LLM (or LM): The number of
free parameters required to closely approximate the optimization
problem to be solved during fine-tuning of a model for a downstream
task.
• Intrinsic dimension is the lowest dimensional subspace in which one can
optimize the original function to within a certain level of approximation
error
90
LLM | Premjith B
   0 ,1 , , m 
D
Set of parameters that parameterize some model
Generally, re-parameterize in the lower-dimensional -dimensions

 D   0D  P  d 
P : d   D Linear projection
is the original model parameterization
91
LLM | Premjith B
SAID
iD   0,Di  i P  d  m i
is the number of layers

represents a layer
is a scaling parameter
92
LLM | Premjith B
• LoRA: Low-Rank Adaptation of Large Language Models

• Try to achieve a small number of task-specific parameters
• While the weights of a pre-trained model have full rank on the pre-
trained tasks, the LoRA authors point out that pre-trained large
language models have a low “intrinsic dimension” when they are
adapted to a new task
Decompose the weight

changes, ΔW, into a
lower-rank
representation
93
LLM | Premjith B
Problem statement
During fine-tuning, the parameters are initialized as and updated to with

respect to the following objective
Drawback
y • For each downstream task, a
max

  log  P  y  t x, yt 
different set of is learned, where
 x , y  t 1
• Hard to storing and deploying
94
LLM | Premjith B
Approach
Adopt a parameter-efficient approach

• The task-specific parameter increment
• Encoded by a smaller set of parameters with
y
max

  log  P  0      yt x, yt  
 x , y  t 1
95
LLM | Premjith B
people.cs.umass.edu/~miyyer/cs685/slides/multilingual.pdf
96
LLM | Premjith B
Pre-trained language models

have a low intrinsic
dimension
Possible to learn efficiently

with a low-dimensional
parameterization
97
LLM | Premjith B
Hu, Edward J., et al. "Lora: Low-rank adaptation of large language

models." arXiv preprint arXiv:2106.09685 (2021).
98
LLM | Premjith B
• During training, pre-trained weights is fixed
h  W0 x  Wx  W0 x  BAx
B   d r , A  r k , r min  d , k 
• LoRA can be applied to any weights in a deep neural network

• In the LoRA paper, the authors considered three weights related to self-attention
d model d model
Wk , Wq , Wv  
99
LLM | Premjith B
What is the optimal rank?
Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. "Lora: Low-rank adaptation of large language models."
arXiv preprint arXiv:2106.09685 (2021).
100
LLM | Premjith B
Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. "Lora: Low-rank adaptation of large language models."
101
LLM | Premjith B
• Subspace spanned by the top i singular vectors of Ur8 is contained in the
subspace spanned by the top j singular vectors of Ur64

• has a strong correlation with
• 𝛥𝑊 amplifies some features that are already in 𝑊 but not emphasized
in 𝑊
• 𝑊 with a larger rank tends to pick up more directions already emphasized
in 𝑊
102
LLM | Premjith B
Basic python implementation
def lora_linear(x):
h = x @ W # regular linear
h += x @ W_A @ W_B # low-rank update
return scale * h
103
LLM | Premjith B
• QLoRA: Quantized LoRA

• Fine-tuning very large models is expensive
• Regular 16-bit finetuning of a LLaMA 65B parameter model requires
more than 780 GB of GPU memory
• Quantization: Reducing the 16-bit floating point precision to
Integer/float 4-bit/8-bit precision to have a relatively smaller model
without any performance degradation
• QLORA reduces the average memory requirements of finetuning a 65B
parameter model from >780GB of GPU memory to <48GB without
degrading the runtime or predictive performance
104
LLM | Premjith B
Overview of Floating Point 8 (FP8) format
Source: Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA (huggingface.co)
105
LLM | Premjith B
Google uses bfloat16 format in its tensor processing units.
Source:The bfloat16 numerical format | Cloud TPU | Google Cloud Source: https://nhigham.com/2018/12/03/half-precision-arithmetic-fp16-versus-bfloat16/
106
LLM | Premjith B
• Two types of precision

• One low-precision storage data type (4-bit)
• One computation data type (BFloat-16)
• Whenever a QLoRA weight is used, dequantize the weight to BFloat16
and perform matrix multiplication in 16-bit
• Three components
• 4-bit NormalFloat quantization
• Double quantization
• Paged optimizers
107
LLM | Premjith B
4-bit NormalFloat quantization
1  1   i 1  
qi   QX  k   QX  k 
2  2 1   2 1  
is a quantile function of the standard normal distribution
Quantile function: A lossy minimum entropy encoding with k bits has the property that for any input
data, the quantized outputs take the value of each of the 2k different bit representations equally often.
• Ensures an equal number of values per quantization bin from the input tensor
• A function that takes a quantile level as input and returns the corresponding quantile value of the
standard normal distribution.
108
LLM | Premjith B
Double quantization
• The process of quantizing the quantization constants for additional memory
savings
Paged optimizers
• Utilize the NVIDIA unified memory feature, which performs automatic page-to-
page transfers between the CPU and GPU, functioning much like regular memory
paging between CPU RAM and the disk
109
LLM | Premjith B
1. 4-bit integers represent 16 levels which are evenly spaced in the [−1, 1] range. The levels
would be -1.0, -0.8667, -0.7333, -0.6, -0.4667, -0.3333, -0.2, -0.0667, 0.0667, 0.2, 0.3333,
0.4667, 0.6, 0.7333, 0.8667, 1.0
2. Suppose a weight in the big FP32 model is 0.23456.
3. The closest value in the 16 levels is 0.2.
4. Quantize the weight to 0.2.
5. In the 4-bit representation, store the value 10 (0.2 is the 10th value in the 16 levels).
6. To use this 4-bit weight in computation, dequantize it back to FP32 using the index stored.
(10th index = 0.2)
7. The dequantization error is 0.23456–0.2 = 0.03456 (~1/4th of the quantization step size -
0.1333).
Source: Understanding LoRA and QLoRA — The Powerhouses of Efficient Finetuning in Large Language Models | by Murali Manohar | Aug, 2023 | Medium
110
LLM | Premjith B
QLoRA
Y BF 16  X BF 16 doubleDequant c1FP 32 , c 2k  bit ,W NF 4   X BF 16L1BF 16LBF

2
16
 
doubleDequant c1FP 32 , c2k bit ,W NF 4   dequant dequant c1FP 32 , c 2k bit  ,W NF 4  W BF 16
111
LLM | Premjith B
HYBRID METHODS
• MAM (Mix and Match)Adapter
def transformer_block_mam(x):
x = concat([x, soft_prompt], dim=seq) Incorporates both
residual = x Adapters and
x = SelfAttention(x)
Prompt tuning
x_a = FFN(x) # parallel adapter
x_a = scale * x_a
x = LN(x + x_adapter)
return x
112
LLM | Premjith B
UniPELT
• Gated combination of LoRa, Prefix-tuning, and Adapter
• LoRa reparametrization is used for attention matrices, prefix-tuning is
applied to keys and values of each layer, and adapters are added after the
feed-forward layer of the transformer block.
113
LLM | Premjith B
LLAMA
Intuition
The largest models do not achieve the best performances, but

by smaller models trained on more data
114
LLM | Premjith B
• Ranges from 7B to 65B parameters

• LLaMA-13B outperforms GPT-3 on most benchmarks, despite being 10×
smaller
• It can be run on a single GPU
• 65B-parameter LLaMA model is competitive with the best large language
models such as Chinchilla or PaLM-540B
• Used only publicly available data for training the model
• Tokenize the data with the byte-pair encoding (BPE) algorithm
• Training dataset contains roughly 1.4T tokens after tokenization
115
LLM | Premjith B
Pre-training data
Model sizes, architectures, and optimization hyper-parameters.
Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière et al. "Llama: Open and efficient foundation language models." arXiv
preprint arXiv:2302.13971 (2023).
116
LLM | Premjith B
Architecture of LLaMA
• Pre-normalization
• Normalize the input of each transformer sub-layer instead of normalizing the
output a
ai  i
gi
• Used RMSNorm normalization function RMS  a 
• This approach was later used in GPT-3 1 n 2
RMS  a   
n i 1
ai
• SwiGLU activation function
SwiGLU  x   x      x   1      x   x
• This approach was later used in PALM

117
LLM | Premjith B
• Rotary Embeddings
• Used rotary embeddings instead of absolute positional embeddings
• Optimizer: AdamW
• = 0.9; = 0.95
• Weight decay = 0.1
• Gradient clipping = 1.0
• Warmup steps = 2000
• Context length = 2048 tokens
118
LLM | Premjith B
llama/MODEL_CARD.md at main · facebookresearch/llama · GitHub
119
LLM | Premjith B
LLAMA 2
• 7B to 70B parameters
• Two models: LlaMA 2 and LLaMA 2-Chat
• Trained on a new mix of publicly available data
• Did not include data from Meta’s products or services
• Increased the size of the pretraining corpus by 40%, doubled the context length of the model, and
adopted grouped-query attention
• Grouped-query attention divides query heads into G groups, each of which shares a single key
head and value head.
• Trained on 2 trillion tokens
• LlaMA 2-Chat optimized for dialogue use cases
120
LLM | Premjith B
Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv preprint
arXiv:2307.09288 (2023).
121
LLM | Premjith B
Training details
• Standard transformer architecture
• Prenormalization using RMSNorm
• SwiGLU activation function
• Rotary positional embedding
• Context length = 4096 tokens
• Optimizer: AdamW
• 𝛽_1 = 0.9; 𝛽_1 = 0.95
• Weight decay = 0.1
• Gradient clipping = 1.0
• Warmup steps = 2000
• eps = 10-5
122
LLM | Premjith B
Training of Llama 2-Chat
arXiv:2307.09288 (2023).
123
LLM | Premjith B
arXiv:2307.09288 (2023).
124
LLM | Premjith B
Image source:GitHub - OpenGVLab/LLaMA-Adapter: Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
125
LLM | Premjith B
Image source:GitHub - OpenGVLab/LLaMA-Adapter: Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
126
LLM | Premjith B
REFERENCES
1. https://www.newyorker.com/cartoons/issue-cartoons/cartoons-from-the-april-24-and-may-1-
2023-issue
2. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural
information processing systems 30 (2017).
3. https://thelowdown.momentum.asia/the-emergence-of-large-language-models-llms/
4. cs.princeton.edu/courses/archive/fall22/cos597G
5. Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J. Liu. "Exploring the limits of transfer learning with a unified
text-to-text transformer." The Journal of Machine Learning Research 21, no. 1 (2020): 5485-
5551
127
LLM | Premjith B
6. https://github.com/FourthBrain/Building-with-Instruction-Tuned-LLMs-A-Step-by-Step-
Guide#wave-welcome-to-the-support-repository-for-the-deeplearningai-event-building-with-
instruction-tuned-llms-a-step-by-step-guide
7. https://en.wikipedia.org/wiki/Generative_pre-trained_transformer
8. http://ai.stanford.edu/blog/understanding-incontext/
9. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and
predict: A systematic survey of prompting methods in natural language processing. ACM
Computing Surveys, 55(9), 1-35.
10. 263-5354-00L Large Language Models
11. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D.
(2020). Language models are few-shot learners. Advances in neural information processing
systems, 33, 1877-1901.
12. Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for
generation. arXiv preprint arXiv:2101.00190.
128
LLM | Premjith B
13. Hambardzumyan, K., Khachatrian, H., & May, J. (2021). Warp: Word-level adversarial
reprogramming. arXiv preprint arXiv:2101.00121.
14. Natural Language Processing with Deep Learning CS224N
15. Chung, Hyung Won, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li et
al. "Scaling instruction-finetuned language models." arXiv preprint arXiv:2210.11416 (2022).
16. Zhang, Shengyu, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li et
al. "Instruction Tuning for Large Language Models: A Survey." arXiv preprint
arXiv:2308.10792 (2023).
17. https://anoopsarkar.github.io/advanced-nlp-class/assets/slides/peft.pdf
18. Lialin, V., Deshpande, V., & Rumshisky, A. (2023). Scaling down to scale up: A guide to
parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647.
19. Houlsby, Neil, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe,
Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. "Parameter-efficient transfer learning
for NLP." In International Conference on Machine Learning, pp. 2790-2799. PMLR, 2019.
129
LLM | Premjith B
20. Pfeiffer, Jonas, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder,
Kyunghyun Cho, and Iryna Gurevych. "Adapterhub: A framework for adapting transformers."
21. He, Junxian, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig.
"Towards a unified view of parameter-efficient transfer learning." arXiv preprint
arXiv:2110.04366 (2021).
22. He, Shwai, Liang Ding, Daize Dong, Miao Zhang, and Dacheng Tao. "Sparseadapter: An easy
approach for improving the parameter-efficiency of adapters." arXiv preprint
arXiv:2210.04284 (2022).
23. Karimi Mahabadi, Rabeeh, James Henderson, and Sebastian Ruder. "Compacter: Efficient low-
rank hypercomplex adapter layers." Advances in Neural Information Processing Systems 34
(2021): 1022-1035.
24. Edalati, Ali, Marzieh Tahaei, Ivan Kobyzev, Vahid Partovi Nia, James J. Clark, and Mehdi
Rezagholizadeh. "Krona: Parameter efficient tuning with kronecker adapter." arXiv preprint
arXiv:2212.10650 (2022).
130
LLM | Premjith B
25. Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. "Language
models are unsupervised multitask learners." OpenAI blog 1, no. 8 (2019): 9.
26. Mozhdeh Gheini, Xiang Ren, and Jonathan May. 2021. Cross-Attention is All You Need: Adapting
Pretrained Transformers for Machine Translation. In Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, pages 1754–1765, Online and Punta Cana, Dominican
Republic. Association for Computational Linguistics.
27. Zaken, E. B., Ravfogel, S., & Goldberg, Y. (2021). Bitfit: Simple parameter-efficient fine-tuning for
transformer-based masked language-models. arXiv preprint arXiv:2106.10199.
28. Alan Ansell, Edoardo Ponti, Anna Korhonen, and Ivan Vulić. 2022. Composable Sparse Fine-Tuning for
Cross-Lingual Transfer. In Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 1778–1796, Dublin, Ireland. Association for
Computational Linguistics.
29. Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. 2021. Intrinsic Dimensionality Explains the
Effectiveness of Language Model Fine-Tuning. In Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers), pages 7319–7328, Online. Association for
Computational Linguistics.
131
LLM | Premjith B
30. Aghajanyan, Armen, Luke Zettlemoyer, and Sonal Gupta. "Intrinsic dimensionality explains the
effectiveness of language model fine-tuning." arXiv preprint arXiv:2012.13255 (2020).
31. Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu
Wang, and Weizhu Chen. "Lora: Low-rank adaptation of large language models." arXiv
preprint arXiv:2106.09685 (2021).
32. https://people.cs.umass.edu/~miyyer/cs685/slides/multilingual.pdf
33. Dettmers, Tim, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. "Qlora: Efficient
finetuning of quantized llms." arXiv preprint arXiv:2305.14314 (2023).
34. https://huggingface.co/blog/4bit-transformers-bitsandbytes
35. https://cloud.google.com/tpu/docs/bfloat16
36. Dettmers, Tim, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. "8-bit optimizers via block-
wise quantization." arXiv preprint arXiv:2110.02861 (2021).
37. https://andlukyane.com/blog/paper-review-qlora
132
LLM | Premjith B
38. Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. "Improving language
understanding by generative pre-training." (2018).
39. Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux,
Timothée Lacroix, Baptiste Rozière et al. "Llama: Open and efficient foundation language
models." arXiv preprint arXiv:2302.13971 (2023).
40. Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv
preprint arXiv:2307.09288 (2023).
41. Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza
Rutherford, Diego de Las Casas et al. "Training compute-optimal large language models."
42. https://vinija.ai/models/LLaMA/
43. Su, Jianlin, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. "Roformer:
Enhanced transformer with rotary position embedding." arXiv preprint arXiv:2104.09864
(2021).
133
LLM | Premjith B
44. https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md
45. Ainslie, Joshua, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit
Sanghai. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints."
46. https://ai.meta.com/llama/?
utm_source=alexandrabarr.beehiiv.com&utm_medium=referral&utm_campaign=llama-2-explained-
training-performance-and-results
47. https://github.com/OpenGVLab/LLaMA-Adapter
48. Daniel Jurafsky and James H Martin. 2021. Speech and language processing: An introduction to natural
language processing, computational linguistics, and speech recognition.
49. https://magazine.sebastianraschka.com/p/understanding-parameter-efficient#:~:text=Prefix%20Versus
%20Prompt%20Tuning&text=Prefix%20tuning%20modifies%20more%20layers,in%20fewer
%20parameters%20being%20updated.
50. Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. 2022. P-Tuning:
Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. In Proceedings of the 60th
Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68,
Dublin, Ireland. Association for Computational Linguistics.
134
LLM | Premjith B
51. Wei, Jason, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du,
Andrew M. Dai, and Quoc V. Le. "Finetuned language models are zero-shot learners." arXiv preprint
arXiv:2109.01652 (2021).
52. https://nlpnewsletter.substack.com/p/instruction-tuning-vol-1?utm_source=post-email-
title&publication_id=1178062&post_id=136684903&utm_campaign=email-post-
title&isFreemail=true&r=ktq1z&utm_medium=email
53. https://nlpnewsletter.substack.com/p/palm-dall-e-2-chinchilla-chain-of-thought-prompting-values-
and-culture-in-nlp-845878
135
136
LLM | Premjith B
LLM | Premjith B
Reach out at b_premjith@cb.amrita.edu
137
LLM | Premjith B
FULLY FUNDED OPEN PHD POSITION
• Project title: Multimodal Social Media Data Analysis in Dravidian Languages

• PI and Co-PIs: Dr Premjith B, Dr Jyothish Lal G, and Dr Sowmya V
• Monthly stipend: Rs. 25,000/- per month (for three years)
• Project should be completed in two years
• Requirements for PhD: 2 Q1 journals, 2 conference papers
• Eligibility
• Qualifiation: M.Tech AI/DS/CSE/ECE, MSc Mathematics/Computer Science
• Skills: Programming, Machine Learning, and Deep Learning
• Interested people can send their resume to b_premjith@cb.amrita.edu on or before 9th October
138
139
LLM | Premjith B

LARGE LANGUAGE MODELS (LLM)

Uploaded by

Copyright:

Available Formats

You might also like

LARGE LANGUAGE MODELS (LLM)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LARGE LANGUAGE MODELS (LLM)

Uploaded by

Copyright:

Available Formats

LLM | Premjith B

Pathange Omkareshwara Rao

LANGUAGE MODEL (LM)

• Calculating the probability of a given P  w1 , w2 , , wn 

• Computing the probability of the next

Sentence: LLM is a language model characterized by its large size

P  LLM is a language model characterized by its large size  

ALGORITHMS USED FOR LMS

• Hidden Markov Model (HMM)

• Recurrent Neural Networks (RNN) Deep learning-based

• Initially designed for Machine

Autoregressive Models Autoencoding Models

The joke was funny. She couldn’t stop ___.

Images generated by DALL-E 2

Joke explanations by PaLM

Cleaned data Fine-tuning

HOW LARGE ARE LMS?

• LMs with >100 million parameters

The promise: one single model to solve many NLP tasks

LARGE LANGUAGE MODELS (LLM)

Built on the foundation of:

● More compute + data

Source: DeepLearningAI Event: Building with Instruction-Tuned LLMs: A Step-by-Step Guide

Source: Generative pre-trained transformer - Wikipedia

GENERATIVE PRETRAINED TRANSFORMER

• Built using the decoder part of the transformer networks

THREE MAJOR WAYS TO USE LLMS

Prompting: Encoding: Transfer Learning:

Liu, P., Yuan, W.,

• For many tasks, supervised fine-tuning data may not be available

Prompting is the approach of adding extra information for the model to

Prompting is done by prepending a series of tokens, P, to the input X, such

PRETRAIN, FINE-TUNE, PREDICT

• Manually design a “prompt” that demonstrates how to formulate a task as a

• No need to update the model weights at all!

• Fine-tuning still outperforms prompting, but prompting doesn’t require lots of

Courtesy: 263-5354-00L Large Language Models

Zero-shot [Instruction] [Input] Is the sentiment positive or negative?

Brown, T., Mann, B., Ryder, N., Subbiah, M.,

Editing the prompts

Give diverse examples

Give many examples

Ask the model to explain or verify the response

Give examples specific to the input

Try many prompts

Fine-tune the model for a prompt

Liu, P., Yuan, W.,

DESIGN CONSIDERATIONS FOR PROMPTING

• Pre-trained Model Choice

PRE-TRAINED LANGUAGE MODELS

Standard Language Model (SLM)

Typical paradigms of pre-trained LMs

• Process of creating a prompting function fprompt(x) to obtain the most optimal

• Prompt Mining: A mining-based method to automatically find templates

Freezes the Transformer parameters and only

• Prepends a prefix for an autoregressive LM to obtain

• Prepends prefixes for both encoder and encoder to obtain

z   PREFIX ; x; PREFIX '; y 

Prefix tuning modifies more layers of the model by inserting a task-specific