Professional Documents
Culture Documents
LARGE LANGUAGE MODELS (LLM)
LARGE LANGUAGE MODELS (LLM)
LARGE LANGUAGE MODELS (LLM)
LARGE L
LANGUAGE L
M
MODELS
Dr Premjith B
Assistant Professor (Sr. Gr.)
Amrita School of Artificial Intelligence, Coimbatore L
Amrita Vishwa Vidyapeetham L
M
1
LLM | Premjith B
Amrit Subramanian
Second year, B.Tech CSE(AI)
Amrita Vishwa Vidyapeetham
MY Saran Dharshan
Second year, B.Tech CSE(AI)
Amrita Vishwa Vidyapeetham
TEAM
2
3
LLM | Premjith B
LLM | Premjith B
“To think this all began with letting autocomplete finish our sentences.”
Source: Slide Show: New Yorker Cartoons April 24 & May 1, 2023 | The New Yorker
4
LLM | Premjith B
5
LLM | Premjith B
6
LLM | Premjith B
TRANSFORMER
• No recurrence
• No convolutions
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser,
and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30
(2017).
8
LLM | Premjith B
LANGUAGE MODELS
• NL generation (NLG)
• NL understanding (NLU)
9
LLM | Premjith B
Image source:The emergence of Large Language Models (LLMs) - The Low Down - Momentum Works
10
LLM | Premjith B
https://nlpnewsletter.substack.com/p/palm-dall-e-2-chinchilla-chain-of-thought-prompting-values-and-culture-in-nlp-845878
11
LLM | Premjith B
https://nlpnewsletter.substack.com/p/palm-dall-e-2-chinchilla-chain-of-thought-prompting-values-and-culture-in-nlp-845878
12
LLM | Premjith B
TYPICAL LIFE OF AN LM
13
LLM | Premjith B
14
LLM | Premjith B
WHY LLMS?
Source: cs.princeton.edu/courses/archive/fall22/cos597G/lectures/lec01.pdf
15
LLM | Premjith B
Source: Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. "Exploring the limits of transfer learning with a unified
text-to-text transformer." The Journal of Machine Learning Research 21, no. 1 (2020): 5485-5551.
16
LLM | Premjith B
● Unsupervised pre-training
○ Web pages
● Supervised fine-tuning
○ Benchmarks
17
LLM | Premjith B
18
LLM | Premjith B
Two tasks
Pre-training
• Train the network on massive data
• Data can be unlabeled (mostly)
Expensive
Fine-tuning
• Model is first initialized with the pre-trained
parameters, and all of the parameters are
fine-tuned using labeled data Image courtesy: How does in-context learning work? A framework for understandi
ng the differences from traditional supervised learning | SAIL Blog (stanford.edu)
• Fine tuning is task-specific
19
LLM | Premjith B
Pretraining
• Encoder
• Captures the bidirectional contextual information - can be conditioned on
future
• Example: BERT
• Decoder
• Language models – Can not be conditioned on future
• Generate text
• Examples: GPT-3, GPT-3.5, GPT-4
• Encoder-Decoder
• Sequence-to-sequence mapping
• Examples: Transformers, T5
20
LLM | Premjith B
21
LLM | Premjith B
22
LLM | Premjith B
23
LLM | Premjith B
PROMPTING
24
LLM | Premjith B
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing
Surveys, 55(9), 1-35.
25
LLM | Premjith B
Pr Y P; X
26
LLM | Premjith B
Challenge: Finding the most appropriate prompt to allow an LM to solve the task in hand
27
LLM | Premjith B
28
LLM | Premjith B
Source: PaLM , DALL-E 2 , Chinchilla 🐭, Chain-of-thought prompting ⛓💭✍️, Values and Culture in NLP 🏛 (substack.com)
29
LLM | Premjith B
30
LLM | Premjith B
31
LLM | Premjith B
Examples
of input,
template,
and answer
for different
tasks
32
LLM | Premjith B
33
LLM | Premjith B
34
LLM | Premjith B
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing
Surveys, 55(9), 1-35.
35
LLM | Premjith B
PROMPT ENGINEERING
Prefix tuning
• Having a proper context can steer the LM without changing its parameters
• If we want the LM to generate a word (e.g., Learning), we can prepend its
common collocations as context (e.g., Machine), and the LM will assign a
much higher probability to the desired word
• Prepends a sequence of task-specific vectors to the input by keeping the
parameters of LM frozen
38
LLM | Premjith B
39
LLM | Premjith B
Pidx denotes the sequence of prefix indices, and |Pidx| denotes the length of the prefix
Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
40
LLM | Premjith B
41
LLM | Premjith B
P i,: if i Pidx
hi
LM zi , hi otherwise
The language model parameters are fixed and the prefix parameters θ are the only
trainable parameters.
42
LLM | Premjith B
Parametrization of Pθ
is a smaller matrix
and has the same rows dimension (prefix length), but different column
dimension. After the fine-tuning process, only will be stored in the memory.
43
LLM | Premjith B
soft_prompt = torch.nn.Parameter(torch.rand(num_tokens,
embed_dim)
def transformer_block_for_prefix_tuning(x):
soft_prompt = FFN(soft_prompt)
x = concat([soft_prompt, x], dim=seq) return
transformer_block(x)
44
LLM | Premjith B
Source: Understanding Parameter-Efficient LLM Finetuning: Prompt Tuning And Prefix Tuning (sebastianraschka.com)
45
LLM | Premjith B
PROMPT TUNING
• An approach to add extra information for the model to condition on during its
generation of the text
• Prompt tuning removes the restriction that the prompt P be parameterized by θ;
instead, the prompt has its own dedicated parameters, θP, that can be updated
• Prefix tuning learns soft prompts at all layers of the model while prompt
tuning only modifies the input.
46
LLM | Premjith B
47
LLM | Premjith B
Source: GitHub - arazd/ProgressivePrompts: Progressive Prompts: Continual Learning for Language Models
48
LLM | Premjith B
ANSWER ENGINEERING
• Aims to search for an answer space Z and a map to the original output
Y that results in an effective predictive model
• Two dimensions
• Answer shape
• Answer design
49
LLM | Premjith B
Selection of the
ANSWER SHAPE shape of acceptable
answers depends on
The shape of an answer characterizes its granularity
the task to perform
• Tokens: One of the tokens in the pre-trained LM’s
vocabulary, or a subset of the vocabulary. Token or text-span
answer spaces are widely
• Span: A short multi-token span. These are usually used used in classification
together with cloze prompts. tasks, relation extraction
or named entity
• Sentence: A sentence or document. These are commonly recognition
used with prefix prompts. Longer phrasal or sentential answers are
often used in language generation tasks
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A
systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1-35.
50
LLM | Premjith B
ANSWER DESIGN
• Manual Design
• The space of potential answers, Z, and its mapping to the output Y, are
designed manually
53
LLM | Premjith B
Hambardzumyan, K., Khachatrian, H., & May, J. (2021). Warp: Word-level adversarial reprogramming. arXiv preprint arXiv:2101.00121.
54
LLM | Premjith B
MULTI-PROMPT LEARNING
Instead of a
single
prompt,
multiple
prompts can
be used
55
LLM | Premjith B
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing
Surveys, 55(9), 1-35.
56
LLM | Premjith B
CHALLENGES IN PROMPTING
57
LLM | Premjith B
58
LLM | Premjith B
INSTRUCTION FINE-TUNING
Collect examples of
(instruction, output) pairs
across many tasks and
finetune an LM
Source: Chung, Hyung Won, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li et al. "Scaling instruction-finetuned language models." arXiv preprint arXiv:2210.11416 (2022).
59
LLM | Premjith B
Wei, Jason, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. "Finetuned language models are zero-shot learners." arXiv
preprint arXiv:2109.01652 (2021).
60
LLM | Premjith B
Source: Chung,
Hyung Won, Le
Hou, Shayne
Longpre, Barret
Zoph, Yi Tay,
William Fedus, Eric
Li et al. "Scaling
instruction-
finetuned language
models." arXiv
preprint
arXiv:2210.11416 (
2022).
61
LLM | Premjith B
62
LLM | Premjith B
Source: Chung, Hyung Won, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li et al. "Scaling instruction-finetuned language models." arXiv preprint arXiv:2210.11416 (2022).
63
LLM | Premjith B
64
LLM | Premjith B
65
LLM | Premjith B
Instruction Tuning
• An LLM can be directly fine-tuned in a fully-supervised manner on
the collected datasets.
66
LLM | Premjith B
• Mixing few-shot settings: Training with mixed zero-shot and few-shot prompts
significantly improves performance in both settings.
• Task diversity: Large models benefit from continuously increasing the number of
tasks.
• Data augmentation: Augmenting the data, such as by inverting inputs and outputs
(e.g., turning a question-answering task into a question-generation task), is
beneficial.
• Mixing weights: Appropriately tuning the mixing weights is important when using
a combination of instruction-tuning datasets.
Source: Instruction Tuning Vol. 1 - by Sebastian Ruder (substack.com)
67
LLM | Premjith B
Zhang, Shengyu, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li et al. "Instruction Tuning for Large Language Models: A Survey." arXiv preprint
arXiv:2308.10792 (2023).
68
LLM | Premjith B
Zhang, Shengyu, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li et al. "Instruction Tuning for Large Language Models: A Survey." arXiv preprint
arXiv:2308.10792 (2023).
69
LLM | Premjith B
Zhang,
Shengyu,
Linfeng Dong,
Xiaoya Li, Sen
Zhang, Xiaofei
Sun, Shuhe
Wang, Jiwei Li
et al.
"Instruction
Tuning for
Large Language
Models: A
Survey." arXiv
preprint
arXiv:2308.107
92 (2023).
70
LLM | Premjith B
FINE-TUNING
71
LLM | Premjith B
72
73
LLM | Premjith B
LLM | Premjith B
• PEFT – classifications
• Does the method introduce new parameters to the model?
• Does it fine-tune a small subset of existing parameters?
• Does the method aim to minimize memory footprint or only storage
efficiency?
• Additive methods
• Selective methods
• Reparametrization-based methods
• Hybrid methods
Lialin, V., Deshpande, V., & Rumshisky, A. (2023). Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647.
74
LLM | Premjith B
Lialin, V., Deshpande, V., & Rumshisky, A. (2023). Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647.
75
LLM | Premjith B
ADDITIVE METHODS
76
LLM | Premjith B
Adapters
• Attach a small fully connected layer at every layer of the transformer
• Inserts small modules (adapters) between transformer layers
• Adapter layer performs a down projection to project the input hidden layer
information to a lower-dimensional space, followed by a non-linear activation
function and an up projection
• A residual connection to generate the final form
Wdown d r
h h f hWdown Wup
Wup rd
77
LLM | Premjith B
78
LLM | Premjith B
def transformer_block_with_adapter(x):
residual = x
x = SelfAttention(x)
x = FFN(x) # adapter
x = LN(x + residual)
residual = x
x = FFN(x) # transformer FFN
x = FFN(x) # adapter
x = LN(x + residual)
return x
79
LLM | Premjith B
h 1 x h x f xW1 W2
Pk , Pv ld
f Softmax .
80
LLM | Premjith B
• Sparse Adapter
• Pruned adapters
• Reduces the model size of neural networks by pruning redundant
parameters and training the rest ones
Source:He, Shwai, Liang Ding, Daize Dong, Miao Zhang, and Dacheng Tao. "Sparseadapter: An easy approach for improving the parameter-efficiency of adapters." arXiv preprint
arXiv:2210.04284 (2022).
81
LLM | Premjith B
Pruning methods
• Random: z Uniform 0,1
• Magnitude: z w
nin nout
• Erdos-Rényi (ER): sparsity 1
nin nout
82
LLM | Premjith B
• Compacter
Source: Karimi
Mahabadi, Rabeeh,
James Henderson,
and Sebastian
Ruder. "Compacter:
Efficient low-rank
hypercomplex
adapter
layers." Advances in
Neural Information
Processing
Systems 34 (2021):
1022-1035.
83
LLM | Premjith B
• AdapterHub
• An easy-to-use and extensible adapter training and sharing framework
for transformer-based model
84
LLM | Premjith B
Soft prompts
• Some of the model’s input embeddings are fine-tuned via gradient descent.
• Soft prompts can be trained for the input layer only or for all layers
• Soft prompts could be pre-trained or prompts for different tasks utilized to reduce
the computation required for fine-tuning a soft prompt for a new task
def soft_prompted_model(input_ids):
x = Embed(input_ids)
x = concat([soft_prompt, x], dim=seq)
return model(x)
85
LLM | Premjith B
SELECTIVE METHODS
86
LLM | Premjith B
• Cross-attention fine-tuning
• Originally designed for machine translation
src , tgt , enc , dec , xattn
87
LLM | Premjith B
• BitFit
• Bias-terms Fine-tuning (BitFit)
• Freeze most of the transformer-encoder parameters and train only the
bias terms and the task-specific classification layer
• Fine-tune only a small portion of the model’s parameters
• Amount to less than 0.1% of the total number of parameters
88
LLM | Premjith B
REPARAMETRIZATION-BASED METHODS
89
LLM | Premjith B
• Intrinsic SAID
• Structure-Aware Intrinsic Dimension (SAID)
• An objective function’s intrinsic dimensionality describes the minimum
dimension needed to solve the optimization problem it defines to some
precision level
• Intrinsic dimensionality of a pre-trained LLM (or LM): The number of
free parameters required to closely approximate the optimization
problem to be solved during fine-tuning of a model for a downstream
task.
• Intrinsic dimension is the lowest dimensional subspace in which one can
optimize the original function to within a certain level of approximation
error
90
LLM | Premjith B
0 ,1 , , m
D
Set of parameters that parameterize some model
91
LLM | Premjith B
SAID
iD 0,Di i P d m i
92
LLM | Premjith B
93
LLM | Premjith B
Problem statement
94
LLM | Premjith B
Approach
y
max
log P 0 yt x, yt
x , y t 1
95
LLM | Premjith B
people.cs.umass.edu/~miyyer/cs685/slides/multilingual.pdf
96
LLM | Premjith B
97
LLM | Premjith B
98
LLM | Premjith B
h W0 x Wx W0 x BAx
B d r , A r k , r min d , k
d model d model
Wk , Wq , Wv
99
LLM | Premjith B
Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. "Lora: Low-rank adaptation of large language models."
arXiv preprint arXiv:2106.09685 (2021).
100
LLM | Premjith B
Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. "Lora: Low-rank adaptation of large language models."
arXiv preprint arXiv:2106.09685 (2021).
101
LLM | Premjith B
102
LLM | Premjith B
def lora_linear(x):
h = x @ W # regular linear
h += x @ W_A @ W_B # low-rank update
return scale * h
103
LLM | Premjith B
104
LLM | Premjith B
Source: Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA (huggingface.co)
105
LLM | Premjith B
Source:The bfloat16 numerical format | Cloud TPU | Google Cloud Source: https://nhigham.com/2018/12/03/half-precision-arithmetic-fp16-versus-bfloat16/
106
LLM | Premjith B
• Three components
• 4-bit NormalFloat quantization
• Double quantization
• Paged optimizers
107
LLM | Premjith B
1 1 i 1
qi QX k QX k
2 2 1 2 1
Quantile function: A lossy minimum entropy encoding with k bits has the property that for any input
data, the quantized outputs take the value of each of the 2k different bit representations equally often.
• Ensures an equal number of values per quantization bin from the input tensor
• A function that takes a quantile level as input and returns the corresponding quantile value of the
standard normal distribution.
108
LLM | Premjith B
Double quantization
• The process of quantizing the quantization constants for additional memory
savings
Paged optimizers
• Utilize the NVIDIA unified memory feature, which performs automatic page-to-
page transfers between the CPU and GPU, functioning much like regular memory
paging between CPU RAM and the disk
109
LLM | Premjith B
1. 4-bit integers represent 16 levels which are evenly spaced in the [−1, 1] range. The levels
would be -1.0, -0.8667, -0.7333, -0.6, -0.4667, -0.3333, -0.2, -0.0667, 0.0667, 0.2, 0.3333,
0.4667, 0.6, 0.7333, 0.8667, 1.0
2. Suppose a weight in the big FP32 model is 0.23456.
3. The closest value in the 16 levels is 0.2.
4. Quantize the weight to 0.2.
5. In the 4-bit representation, store the value 10 (0.2 is the 10th value in the 16 levels).
6. To use this 4-bit weight in computation, dequantize it back to FP32 using the index stored.
(10th index = 0.2)
7. The dequantization error is 0.23456–0.2 = 0.03456 (~1/4th of the quantization step size -
0.1333).
Source: Understanding LoRA and QLoRA — The Powerhouses of Efficient Finetuning in Large Language Models | by Murali Manohar | Aug, 2023 | Medium
110
LLM | Premjith B
QLoRA
doubleDequant c1FP 32 , c2k bit ,W NF 4 dequant dequant c1FP 32 , c 2k bit ,W NF 4 W BF 16
111
LLM | Premjith B
HYBRID METHODS
• MAM (Mix and Match)Adapter
def transformer_block_mam(x):
x = concat([x, soft_prompt], dim=seq) Incorporates both
residual = x Adapters and
x = SelfAttention(x)
Prompt tuning
x = LN(x + residual)
x_a = FFN(x) # parallel adapter
x_a = scale * x_a
x = LN(x + x_adapter)
return x
112
LLM | Premjith B
UniPELT
• Gated combination of LoRa, Prefix-tuning, and Adapter
• LoRa reparametrization is used for attention matrices, prefix-tuning is
applied to keys and values of each layer, and adapters are added after the
feed-forward layer of the transformer block.
113
LLM | Premjith B
LLAMA
Intuition
114
LLM | Premjith B
115
LLM | Premjith B
Pre-training data
Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière et al. "Llama: Open and efficient foundation language models." arXiv
preprint arXiv:2302.13971 (2023).
116
LLM | Premjith B
Architecture of LLaMA
• Pre-normalization
• Normalize the input of each transformer sub-layer instead of normalizing the
output a
ai i
gi
• Used RMSNorm normalization function RMS a
• This approach was later used in GPT-3 1 n 2
RMS a
n i 1
ai
• SwiGLU activation function
SwiGLU x x x 1 x x
• Rotary Embeddings
• Used rotary embeddings instead of absolute positional embeddings
• Optimizer: AdamW
• = 0.9; = 0.95
• Weight decay = 0.1
• Gradient clipping = 1.0
• Warmup steps = 2000
• Context length = 2048 tokens
118
LLM | Premjith B
119
LLM | Premjith B
LLAMA 2
• 7B to 70B parameters
• Two models: LlaMA 2 and LLaMA 2-Chat
• Trained on a new mix of publicly available data
• Did not include data from Meta’s products or services
• Increased the size of the pretraining corpus by 40%, doubled the context length of the model, and
adopted grouped-query attention
• Grouped-query attention divides query heads into G groups, each of which shares a single key
head and value head.
• Trained on 2 trillion tokens
• LlaMA 2-Chat optimized for dialogue use cases
120
LLM | Premjith B
Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv preprint
arXiv:2307.09288 (2023).
121
LLM | Premjith B
Training details
• Standard transformer architecture
• Prenormalization using RMSNorm
• SwiGLU activation function
• Rotary positional embedding
• Context length = 4096 tokens
• Optimizer: AdamW
• 𝛽_1 = 0.9; 𝛽_1 = 0.95
• Weight decay = 0.1
• Gradient clipping = 1.0
• Warmup steps = 2000
• eps = 10-5
122
LLM | Premjith B
Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv preprint
arXiv:2307.09288 (2023).
123
LLM | Premjith B
Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv preprint
arXiv:2307.09288 (2023).
124
LLM | Premjith B
Image source:GitHub - OpenGVLab/LLaMA-Adapter: Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
125
LLM | Premjith B
Image source:GitHub - OpenGVLab/LLaMA-Adapter: Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
126
LLM | Premjith B
REFERENCES
1. https://www.newyorker.com/cartoons/issue-cartoons/cartoons-from-the-april-24-and-may-1-
2023-issue
2. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural
information processing systems 30 (2017).
3. https://thelowdown.momentum.asia/the-emergence-of-large-language-models-llms/
4. cs.princeton.edu/courses/archive/fall22/cos597G
5. Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,
Yanqi Zhou, Wei Li, and Peter J. Liu. "Exploring the limits of transfer learning with a unified
text-to-text transformer." The Journal of Machine Learning Research 21, no. 1 (2020): 5485-
5551
127
LLM | Premjith B
6. https://github.com/FourthBrain/Building-with-Instruction-Tuned-LLMs-A-Step-by-Step-
Guide#wave-welcome-to-the-support-repository-for-the-deeplearningai-event-building-with-
instruction-tuned-llms-a-step-by-step-guide
7. https://en.wikipedia.org/wiki/Generative_pre-trained_transformer
8. http://ai.stanford.edu/blog/understanding-incontext/
9. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and
predict: A systematic survey of prompting methods in natural language processing. ACM
Computing Surveys, 55(9), 1-35.
10. 263-5354-00L Large Language Models
11. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D.
(2020). Language models are few-shot learners. Advances in neural information processing
systems, 33, 1877-1901.
12. Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for
generation. arXiv preprint arXiv:2101.00190.
128
LLM | Premjith B
13. Hambardzumyan, K., Khachatrian, H., & May, J. (2021). Warp: Word-level adversarial
reprogramming. arXiv preprint arXiv:2101.00121.
14. Natural Language Processing with Deep Learning CS224N
15. Chung, Hyung Won, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li et
al. "Scaling instruction-finetuned language models." arXiv preprint arXiv:2210.11416 (2022).
16. Zhang, Shengyu, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li et
al. "Instruction Tuning for Large Language Models: A Survey." arXiv preprint
arXiv:2308.10792 (2023).
17. https://anoopsarkar.github.io/advanced-nlp-class/assets/slides/peft.pdf
18. Lialin, V., Deshpande, V., & Rumshisky, A. (2023). Scaling down to scale up: A guide to
parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647.
19. Houlsby, Neil, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe,
Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. "Parameter-efficient transfer learning
for NLP." In International Conference on Machine Learning, pp. 2790-2799. PMLR, 2019.
129
LLM | Premjith B
20. Pfeiffer, Jonas, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder,
Kyunghyun Cho, and Iryna Gurevych. "Adapterhub: A framework for adapting transformers."
arXiv preprint arXiv:2007.07779 (2020).
21. He, Junxian, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig.
"Towards a unified view of parameter-efficient transfer learning." arXiv preprint
arXiv:2110.04366 (2021).
22. He, Shwai, Liang Ding, Daize Dong, Miao Zhang, and Dacheng Tao. "Sparseadapter: An easy
approach for improving the parameter-efficiency of adapters." arXiv preprint
arXiv:2210.04284 (2022).
23. Karimi Mahabadi, Rabeeh, James Henderson, and Sebastian Ruder. "Compacter: Efficient low-
rank hypercomplex adapter layers." Advances in Neural Information Processing Systems 34
(2021): 1022-1035.
24. Edalati, Ali, Marzieh Tahaei, Ivan Kobyzev, Vahid Partovi Nia, James J. Clark, and Mehdi
Rezagholizadeh. "Krona: Parameter efficient tuning with kronecker adapter." arXiv preprint
arXiv:2212.10650 (2022).
130
LLM | Premjith B
25. Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. "Language
models are unsupervised multitask learners." OpenAI blog 1, no. 8 (2019): 9.
26. Mozhdeh Gheini, Xiang Ren, and Jonathan May. 2021. Cross-Attention is All You Need: Adapting
Pretrained Transformers for Machine Translation. In Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, pages 1754–1765, Online and Punta Cana, Dominican
Republic. Association for Computational Linguistics.
27. Zaken, E. B., Ravfogel, S., & Goldberg, Y. (2021). Bitfit: Simple parameter-efficient fine-tuning for
transformer-based masked language-models. arXiv preprint arXiv:2106.10199.
28. Alan Ansell, Edoardo Ponti, Anna Korhonen, and Ivan Vulić. 2022. Composable Sparse Fine-Tuning for
Cross-Lingual Transfer. In Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 1778–1796, Dublin, Ireland. Association for
Computational Linguistics.
29. Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. 2021. Intrinsic Dimensionality Explains the
Effectiveness of Language Model Fine-Tuning. In Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers), pages 7319–7328, Online. Association for
Computational Linguistics.
131
LLM | Premjith B
30. Aghajanyan, Armen, Luke Zettlemoyer, and Sonal Gupta. "Intrinsic dimensionality explains the
effectiveness of language model fine-tuning." arXiv preprint arXiv:2012.13255 (2020).
31. Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu
Wang, and Weizhu Chen. "Lora: Low-rank adaptation of large language models." arXiv
preprint arXiv:2106.09685 (2021).
32. https://people.cs.umass.edu/~miyyer/cs685/slides/multilingual.pdf
33. Dettmers, Tim, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. "Qlora: Efficient
finetuning of quantized llms." arXiv preprint arXiv:2305.14314 (2023).
34. https://huggingface.co/blog/4bit-transformers-bitsandbytes
35. https://cloud.google.com/tpu/docs/bfloat16
36. Dettmers, Tim, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. "8-bit optimizers via block-
wise quantization." arXiv preprint arXiv:2110.02861 (2021).
37. https://andlukyane.com/blog/paper-review-qlora
132
LLM | Premjith B
38. Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. "Improving language
understanding by generative pre-training." (2018).
39. Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux,
Timothée Lacroix, Baptiste Rozière et al. "Llama: Open and efficient foundation language
models." arXiv preprint arXiv:2302.13971 (2023).
40. Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv
preprint arXiv:2307.09288 (2023).
41. Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza
Rutherford, Diego de Las Casas et al. "Training compute-optimal large language models."
arXiv preprint arXiv:2203.15556 (2022).
42. https://vinija.ai/models/LLaMA/
43. Su, Jianlin, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. "Roformer:
Enhanced transformer with rotary position embedding." arXiv preprint arXiv:2104.09864
(2021).
133
LLM | Premjith B
44. https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md
45. Ainslie, Joshua, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit
Sanghai. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints."
arXiv preprint arXiv:2305.13245 (2023).
46. https://ai.meta.com/llama/?
utm_source=alexandrabarr.beehiiv.com&utm_medium=referral&utm_campaign=llama-2-explained-
training-performance-and-results
47. https://github.com/OpenGVLab/LLaMA-Adapter
48. Daniel Jurafsky and James H Martin. 2021. Speech and language processing: An introduction to natural
language processing, computational linguistics, and speech recognition.
49. https://magazine.sebastianraschka.com/p/understanding-parameter-efficient#:~:text=Prefix%20Versus
%20Prompt%20Tuning&text=Prefix%20tuning%20modifies%20more%20layers,in%20fewer
%20parameters%20being%20updated.
50. Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. 2022. P-Tuning:
Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. In Proceedings of the 60th
Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68,
Dublin, Ireland. Association for Computational Linguistics.
134
LLM | Premjith B
51. Wei, Jason, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du,
Andrew M. Dai, and Quoc V. Le. "Finetuned language models are zero-shot learners." arXiv preprint
arXiv:2109.01652 (2021).
52. https://nlpnewsletter.substack.com/p/instruction-tuning-vol-1?utm_source=post-email-
title&publication_id=1178062&post_id=136684903&utm_campaign=email-post-
title&isFreemail=true&r=ktq1z&utm_medium=email
53. https://nlpnewsletter.substack.com/p/palm-dall-e-2-chinchilla-chain-of-thought-prompting-values-
and-culture-in-nlp-845878
135
136
LLM | Premjith B
LLM | Premjith B
137
LLM | Premjith B