Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Challenges and Applications of Large Language Models

Jean Kaddourα, †, ∗ , Joshua Harrisβ, ∗ , Maximilian Mozesα ,


Herbie Bradleyγ, δ, ϵ , Roberta Raileanuζ , and Robert McHardyη, ∗
α
University College London β UK Health Security Agency γ EleutherAI
δ
University of Cambridge ϵ Stability AI ζ Meta AI Research η InstaDeep

Abstract
Large Language Models (LLMs) went from
non-existent to ubiquitous in the machine learn-
ing discourse within a few years. Due to the Desi gn High Infer ence
Latency, Lim ited
Beh av i or
Unfathom able Datasets, Pr om pt Br ittleness,
arXiv:2307.10169v1 [cs.CL] 19 Jul 2023

Context Length,
fast pace of the field, it is difficult to identify Tokenizer -Reliance,
Fine-Tuning Over head
Hallucinations
M isaligned Behavior ,
Outdated Know ledge
the remaining challenges and already fruitful
Tasks Not
application areas. In this paper, we aim to es- Solvable
By Scale
tablish a systematic set of open problems and
Detecting
application successes so that ML researchers High Pr e-Tr aining Gener ated
Costs Texts, Br ittle
can comprehend the field’s current state more Evaluations
quickly and become productive.
Sci en ce
Contents Evaluations Based on Static
Hum an-Wr itten Gr ound Tr uth,
Lacking Exper im ental Designs,
Lack of Repr oducibility
1 Introduction 1

2 Challenges 2
2.1 Unfathomable Datasets . . . . . . 2 Figure 1: Overview of LLM Challenges. Designing
2.2 Tokenizer-Reliance . . . . . . . . 4 LLMs relates to decisions taken before deployment. Be-
2.3 High Pre-Training Costs . . . . . 6 haviorial challenges occur during deployment. Science
2.4 Fine-Tuning Overhead . . . . . . 10 challenges hinder academic progress.
2.5 High Inference Latency . . . . . . 11
2.6 Limited Context Length . . . . . . 14
3.4 Creative Work . . . . . . . . . . . 39
2.7 Prompt Brittleness . . . . . . . . 17
3.5 Knowledge Work . . . . . . . . . 40
2.8 Hallucinations . . . . . . . . . . . 19
2.9 Misaligned Behavior . . . . . . . 22 3.6 Law . . . . . . . . . . . . . . . . 42
2.10 Outdated Knowledge . . . . . . . 27 3.7 Medicine . . . . . . . . . . . . . 43
2.11 Brittle Evaluations . . . . . . . . 27 3.8 Reasoning . . . . . . . . . . . . . 44
2.12 Evaluations Based on Static, 3.9 Robotics and Embodied Agents . . 45
Human-Written Ground Truth . . 28 3.10 Social Sciences & Psychology . . 46
2.13 Indistinguishability between Gen- 3.11 Synthetic Data Generation . . . . 48
erated and Human-Written Text . 29
4 Related Work 49
2.14 Tasks Not Solvable By Scale . . . 30
2.15 Lacking Experimental Designs . . 31 5 Conclusion 49
2.16 Lack of Reproducibility . . . . . . 33
1 Introduction
3 Applications 34
3.1 Chatbots . . . . . . . . . . . . . . 34 Given the quickly growing plethora of LLM re-
3.2 Computational Biology . . . . . . 36 search papers, we aim to address two questions: (1)
3.3 Computer Programming . . . . . 37 Challenges: What problems remain unresolved?
* Equal
and (2) Applications: Where are LLMs currently
contribution.

{jean.kaddour,robert.mchardy}.20@ucl.ac.uk, being applied, and how are the challenges con-
joshua.harris@ukhsa.gov.uk straining them? For (1), we group the challenges

1
in Fig. 1 into three broader categories “Design”, word sequence repeated 61, 036 times in the train-
“Behavior”, and “Science”. To provide answers ing split. By deduplicating it, they reduce the rate
for (2), we explore the fields of chatbots, compu- of emitted memorizations by 10x. Abbas et al. [6]
tational biology, computer programming, creative introduce SemDeDup, a technique designed to iden-
work, knowledge work, law, medicine, reasoning, tify semantic duplicates that, although perceptually
robotics, and the social sciences. distinct, convey predominantly similar information,
This paper is an opinionated review and assumes such as sentences with analogous structures with
familiarity with LLMs and how they work (we refer certain words replaced by synonyms. After apply-
to more introductory works in Sec. 4). Further, we ing their method to C4, they find that it improves
focus on models trained on text data. We target a over NearDup. Similarly, Kaddour [250] find near-
technical researcher audience and do not discuss duplicates in the Pile [165] by clustering document
political, philosophical, or moral perspectives on embeddings and identifying clusters gathering du-
LLMs. plicates.

2 Challenges
Benchmark Data Contamination occurs when
Challenge the training dataset contains data from or similar
to the evaluation test set. This can lead to inflated
This box highlights a challenge.
performance metrics, as the model can memorize
the test data and simply regurgitate it back during
2.1 Unfathomable Datasets testing.
Scaling the amount of pre-training data has been Finding and removing all training and test data
one of the major drivers to equip LLMs with overlaps is difficult in practice. For example, the
general-purpose capabilities [256]. The size of GPT-3 authors Brown et al. [59] found a code bug
pre-training datasets quickly outgrew the number after training, resulting in only partially removing
of documents most human teams could manually all detected overlaps from the training data. They
quality-check. Instead, most data collection proce- could not afford to retrain the model, so they used it
dures rely on heuristics regarding data sources and with the remaining overlaps and “cleaned” variants
filtering. of the considered benchmarks, with all potentially
In this section, we explore the adverse conse- leaked examples removed. They define overlap-
quences of these heuristics and the reality that many ping examples as examples that share at least 13
model practitioners possess only a nebulous under- consecutive words with any other example in the
standing of the data on which their model has been pre-training set. If an example is shorter than 13
trained. We refer to this issue as follows. words, they consider it overlapping if it shares all
of its words with another example.
Unfathomable Datasets
Similarly, Dodge et al. [125] search for test data
The size of modern pre-training datasets ren- in the web-crawled C4 corpus but measure exact
ders it impractical for any individual to read matches, normalized for capitalization and punctu-
or conduct quality assessments on the en- ation. They find various input-and-label contamina-
compassed documents thoroughly. tions of text generation and knowledge completion
tasks; and input-only contaminations of the GLUE
Near-Duplicates can arise in different forms benchmark. They argue that there are two ways test
and have been reported to degrade model per- data can end up in a snapshot of Common Crawl
formance [294, 200, 250]. Near-duplicates are (the original dump source of C4): either a given
harder to find compared to exact duplicates; fil- test set is built from a web text or uploaded after
tering out of such is a standard step in most data creation. Sainz et al. [472] ask ChatGPT to gener-
collection pipelines, e.g., using the MinHash algo- ate academic benchmark instances, finding that it
rithm [57]. Lee et al. [294] propose the NearDup has memorized multiple ones, including some test
method and find that over 1% of tokens emitted splits. Jacovi et al. [237] propose three strategies to
unprompted from a model are part of a memorized mitigate contamination, including encryption and
sequence of the C4 dataset, e.g., it contains a 61- training exclusion controls.

2
Personally Identifiable Information (PII) such Date Name
Size
Sources Public
GB Tokens∗
as phone numbers and email addresses, have
2014 BookCorpus 5 GB 11 B Novels Yes
been found within pre-training corpora, resulting [684, 36]
in privacy leaks during prompting. Carlini et al. 2019 OSCAR 6.3 T ? Webpages in 166 Yes
[65, 67], Lukas et al. [344] extract PII data by [399] languages
2019 WebText 40 GB ? Webpages No
prompting GPT-2; Kulkarni [283] report how an en- [440]
gineer yields secret API keys by prompting GitHub 12.2020 CC-100 2.5 TB 292 B Webpages in 100 Yes
Copilot. Henderson et al. [195] discuss the avail- [100] Languages
12.2020 The Pile 825 GB 300 B Science, Webpages, Yes
ability of PII in law data across different jurisdic- [165, 41] GitHub Code, Law,
tions and filter it based on the legal norm in the etc.
respective jurisdiction. El-Mhamdi et al. [137] 2020 C4 [443] 745 GB 156 B Webpages Yes

contend that because strong model performance 10.2020 mC4 [631] ? 6.3 T Webpages in 101 Yes
Languages
typically requires memorization of the training 2021 MassiveText 10.5 TB 2.34 T Webpages, Books, No
data [146, 58], the (undetected) existence of PII [441] News, and Code

in the training data will likely result in models that 12.2021 GLaM [130] ? 1.6 T Webpages, No
Wikipedia, Conver-
render them extractable. sations, Forums,
Books, News
01.2022 Infiniset ? 2.81 T Forum dialogs, No
Pre-Training Domain Mixtures Several stud- [551] C4 data, Code,
ies have argued for diversity in the pre-training Wikipedia, Web-
pages
corpus [165, 341, 291]. Many popular corpora fol-
06.2022 ROOTS 1.61 TB 2.34 T Webpages in 46 lan- Yes
low this by concatenating datasets from different [289] guages and GitHub
sources, as illustrated in Table 1. However, it re- Code in 13 lan-
guages
mains underexplored what amount of data from 11.2022 The Stack 6 TB 235 B GitHub Code in 30 Yes
different sources is necessary for strong down- [271] languages

stream performances. Finding suboptimal mix- 04.2023 LLaMA 2.7 TB 1.2 T Webpages, GitHub Yes
[556] / Red- Code, Science,
tures can cause low transferability to downstream Pajama [98] Wikipedia, Books
tasks [593, 580] and reliance on spurious corre- 06.2023 RefinedWeb 2.8 TB 600 B Webpages Yes
[415]
lations [253, 618, 347]. Xie et al. [622] find do-
main mixture proportions by training a small proxy Table 1: Overview of Selected Pre-Training Datasets.
model using group-distributionally robust optimiza- Over the years, pre-training datasets have become more
tion [471]; surprisingly, they find that the final unfathomable: they grew rapidly in size and diversity,
model trained using their found domain weights and not all datasets are publicly available (we do not
yields improved perplexity across all domains, even include datasets that have very little or no information
when it down-weights a domain. Given a tar- available about them). Unless stated otherwise, the
get downstream task, Yao et al. [641], Xie et al. natural language is in English. ∗ We report the number
of tokens as provided by the respective paper based on
[624] select subsets most useful for pre-training.
their proposed tokenization scheme.
Longpre et al. [341] measure the effects of domain
compositions and find that inclusion of heteroge-
neous data sources is broadly beneficial and likely
more important than the data quality (as measured For example, instruction fine-tuning via task in-
by the document quality classifier employed by structions prepended to each set of input-output
PaLM [86] and GLaM [130]) or size, which also pairs is a very popular scheme, which we will later
motivates smaller yet more diverse pre-training discuss in more detail in Sec. 2.9. Wang et al. [589]
datasets [250]. propose Super-NaturalInstructions, a
fine-tuning dataset with 1,616 diverse tasks and
Fine-Tuning Task Mixtures have to be deter- expert-written instructions. Muennighoff et al.
mined for fine-tuning a pre-trained model on many [377] extend MTLM to the multilingual setting,
different tasks, usually with comparatively few ex- showing that fine-tuning on multilingual tasks with
amples per task. This technique, which we call English prompts improves results on tasks in all
multitask-prompted fine-tuned LMs (MTLMs), has languages.
demonstrated significant generalization improve- However, similar to the previous paragraph, how
ments with very little additional training compute. to balance the task datasets well remains unclear.

3
As the tasks can vary in size considerably, Raf- essary to convey the same information varies
fel et al. [443] mix each task in proportion to the significantly across languages, making the pric-
number of examples in its ’train’ split (up to some ing policy of API language models, which charge
max_num_examples). Jang et al. [239] report users based on the number of processed or gen-
that MTLMs can underperform expert LLMs fine- erated tokens, potentially unfair. They find that
tuned on only a single task because of (i) nega- users of many supported languages are overcharged
tive task transfer, where learning multiple tasks at while receiving subpar results, with this group pre-
once hinders the learning of some specific tasks, dominantly residing in areas where these APIs are
and (ii) catastrophic forgetting of previous tasks already less affordable.
when learning new tasks. Iyer et al. [235] study Further, discrepancies between the data that
varying task (sets) proportions, finding several a tokenizer and a model have been trained on
trade-offs and concluding that the right values for can lead to glitch tokens [465], which can sub-
these parameters depend on the downstream end- sequently cause unexpected model behavior as
goals. Longpre et al. [340] balance different sets of their corresponding embeddings are essentially un-
task sources by omitting them, one at a time, and trained. This coupling between the tokenizer and
ranking their contributions on the MMLU bench- pre-training corpus creates the burden of a new
mark [197]; further, they mix the input prompt training run of the tokenizer each time the pre-
templates of zero- and few-shot prompting; find- training corpus is modified.
ing that this improves the performance in both set- Next, Tokenization schemes that work well in a
tings. Another trend is to imitate closed-source multilingual setting, particularly with non-space-
models like ChatGPT by collecting a dataset of separated languages such as Chinese or Japanese,
API outputs (against OpenAI’s terms and condi- remain challenging [157, 91].
tions) and fine-tuning an open-source LM with Existing subword tokenization schemes are pre-
it [540]. However, Gudibande et al. [180] point dominantly greedy algorithms trying to encode
out that such imitation models are only good at language as efficiently as possible regarding the
mimicking the proprietary model’s style but not number of tokens used. Naturally, these methods
its content, a distinction that has been discussed favor subwords comprising larger parts of the train-
extensively in the causality literature [253]. They ing data and, therefore, subwords that are shared
conclude that substantial capability gaps between across many languages. This favors languages
fine-tuned open-sourced and closed-source models with shared scripts like Latin and Cyrillic, result-
remain, motivating future work for better imitation ing in suboptimal tokenization of low-resource lan-
data. guages [92, 676].

2.2 Tokenizer-Reliance Tokenizer-Reliance

Tokenization is the process of breaking a sequence Tokenizers introduce several challenges,


of words or characters into smaller units called e.g., computational overhead, language de-
tokens, such that they can be fed into the model. pendence, handling of novel words, fixed
One common tokenization approach is subword to- vocabulary size, information loss, and low
kenization, where we split words into smaller units, human interpretability.
called subwords or WordPieces [490]. The goal
is to handle rare and out-of-vocabulary words in Subword-Level Inputs are the dominant
a model’s vocabulary effectively while maintain- paradigm, providing a good trade-off between
ing a limited number of tokens per sequence in the vocabulary size and sequence length. Byte-Pair
interest of computational complexity. Subword to- Encoding [490, 577] (BPE) starts with the set
kenizers are usually trained unsupervised to build of symbols (characters or bytes) that comprise
a vocabulary and optionally merge rules to encode the training data. The tokenizer is then trained
the training data efficiently. to learn rules to merge the most frequent pair
However, the necessity of tokenization comes of two consecutive tokens—defined by the
with multiple drawbacks [257]; some of which we existing vocabulary—into a new vocabulary item.
discuss below. For example, Ahia et al. [13], Petrov Byte-level BPE (BBPE) [577] is an extension
et al. [426] show that the number of tokens nec- of BPE with byte-level subwords, particularly

4
(1) Tokenizer Training Costs (2) Arch. depends on Vocabulary
Training Sequences Vocabulary Embedding Transformer Softmax over
Matrix Blocks Vocabulary
English where where
as as
where as token ##ization token token
##ization ##ization

Tokenization can sometimes lead to a loss of to to


to a for loss lead are for for
information. For example, in languages where
loss loss
word boundaries are not clearly defined, such
lead lead
as Chinese. … boundaries chinese example … are are
boundaries boundaries
chinese chinese
Chinese 中 ⾔ 息 致 定 信 example example
中 中
⾔ ⾔

MHA

MHA
FFN

FFN
息 息
標 多 時 單 合 致 致
標記化有時會導致信息丟失。 例如,在單
詞邊界沒有明確定義的語⾔中,例如中⽂,

信 … 定

或者在具有許多複合詞的複雜語⾔中,...... 會 明 導 界 義 許 … 標 標
時 時
單 單
Python in i n def array
合 合
in in
def bubble_sort(array): i i
n = len(array) def def
for i in range(n):
swapped = False
[ ] , _ + - range array array

for j in range(0, n - i - 1): [ [


if array[j] > array[j + 1]: ] ]

….
swap(array[j], array[j + 1])
for if False ) ], 1 sort … … …
<latexit sha1_base64="EUAQA2JFnjQN78tKMAv1XK5WuxQ=">AAACEXicbVC7TsMwFHXKq5RXgJHFokLqVCWI18BQCZAYC6IPqQmV4zqtVceJbAepSvMLLPwKCwMIsbKx8Tc4bQZoOZKl43Pu1b33eBGjUlnWt1FYWFxaXimultbWNza3zO2dpgxjgUkDhywUbQ9JwignDUUVI+1IEBR4jLS84UXmtx6IkDTkd2oUETdAfU59ipHSUtesOAFSA8+HV9ChHE5/XnKb3ifj5hg6igZEwsu01DXLVtWaAM4TOydlkKPeNb+cXojjgHCFGZKyY1uRchMkFMWMpCUnliRCeIj6pKMpR3qQm0wuSuGBVnrQD4V+XMGJ+rsjQYGUo8DTldnGctbLxP+8Tqz8MzehPIoV4Xg6yI8ZVCHM4oE9KghWbKQJwoLqXSEeIIGw0iFmIdizJ8+T5mHVPqke3xyVa+d5HEWwB/ZBBdjgFNTANaiDBsDgETyDV/BmPBkvxrvxMS0tGHnPLvgD4/MHRwacqQ==</latexit> <latexit sha1_base64="VxULw+Mr90KaxUh2GOiiN/OYQpA=">AAACIXicbVDLTsMwEHR4lvIqcORiUSFxqhLEowcOSHDgWBBtkZpSOe4GLBwnsjeIKuRXuPArXDiAEDfEz+C0PfAaydLszK7WO0EihUHX/XAmJqemZ2ZLc+X5hcWl5crKasvEqebQ5LGM9UXADEihoIkCJVwkGlgUSGgHN0eF374FbUSsznGQQDdiV0qEgjO0Uq9S9yOG10FI29QXio6qIDvLL7PjXuYj3GEWxX2QeU59FBEYet+6p3m5V6m6NXcI+pd4Y1IlYzR6lXe/H/M0AoVcMmM6nptgN2MaBZeQl/3UQML4DbuCjqWK2V3dbHhhTjet0qdhrO1TSIfq94mMRcYMosB2FheY314h/ud1Ugzr3UyoJEVQfLQoTCXFmBZx0b7QwFEOLGFcC/tXyq+ZZhxtqEUI3u+T/5LWds3bq+2e7lQPD8ZxlMg62SBbxCP75JCckAZpEk4eyBN5Ia/Oo/PsvDnvo9YJZzyzRn7A+fwC7oij/A==</latexit>

… E 2 R|V |⇥D W 2 RDmodel ⇥|V |


Figure 2: Exemplary Drawbacks of relying on Tokenization. (1) The tokenizer training step involves non-trivial
computations, e.g., multiple passes over the entire pre-training dataset, and introduces a dependency on it, which
can become especially problematic in multilingual settings. (2) The embedding layer E and output layer W of
LLMs involve the vocabulary size; e.g., making up ≈ 66% of the model’s parameter count in T5 models [629].

suited for multilingual tasks where it enables Byte-Level Inputs are an alternative to subword
vocabulary sharing between languages. A trained tokenization is use byte-level inputs. Byte-level
BPE tokenizer applies the previously learned rules inputs can either be used in combination with sub-
to tokenize inputs. WordPiece [485, 617] is a word tokenizers [577] or used to define a limited
closed-source tokenization algorithm used, e.g., vocabulary that can be used to encode all possi-
in BERT [120]. Like BPE, WordPiece starts with ble sequences. For example, Xue et al. [630]
a small initial vocabulary, which is iteratively train a non-subword mT5 model using UTF-8
extended by learning merge rules and creating new bytes rather than subword tokens as inputs, show-
vocabulary items. Rather than selecting the most ing promising performance on multilingual data.
frequent pair of consecutive tokens, WordPiece While this enables subword-free LLMs, UTF-8 en-
uses a scoring function to normalize the frequency codes Latin languages with fewer bytes than e.g.,
of the pair by the frequencies of the individual Chinese, Japanese or Korean1 . Tay et al. [546] pro-
tokens to prioritize common pairs with rare pose the Charformer, a tokenization-free model
individual tokens. Unigram Tokenization [281] which learns a soft subword tokenization in la-
iteratively trims a large base vocabulary to a given tent space (Gradient-Based Subword Tokenization)
target size. To this end, at each step of the tokenizer given byte-level inputs. Charformer performs com-
training, a unigram language model is used to parably to subword-based models while incurring
compute a loss over the training data conditional less computational overhead than other byte or
on a certain vocabulary item being removed. subword models. Choe et al. [83] train a small-
A proportion of the subwords with the lowest scale, 0.8B language model based on raw byte-
losses are removed to form the base vocabulary level inputs and show that it performs compara-
for the next iteration. Unigram tokenization is bly. On a smaller scale, Clark et al. [94] show that
probabilistic, i.e., during inference, all possible their tokenization- and vocabulary-free encoder Ca-
tokenizations of a given sequence are scored nine outperforms a comparable tokenization-based
using the unigram language model, and the most model. Yu et al. [652] address the computational
likely one is selected. SentencePiece [282] is a cost that byte-level tokenization incurs by segment-
commonly used open-source library, implementing ing input sequences into local patches, which can
several tokenization algorithms such as (B)BPE be processed in parallel. Similarly, Horton et al.
and Unigram tokenization. SentencePiece also [212] propose to operate directly on file bytes. In a
implements non-subword tokenization approaches
like word- and character-level tokenization. 1
https://www.unicode.org/versions/Unicode15.0.0/

5
parallel line of work, Rust et al. [467] render text compute budgets. For example, OpenAI [398] re-
as images and train an encoder model to predict the port that they were able to accurately predict the
raw pixels of the images. model performance of the full-size GPT-4 model
based on the performance of a series of smaller
2.3 High Pre-Training Costs models using at most 10,000x less compute than
The vast majority of the training costs go toward the the full model.
pre-training process. Training a single LLM can The exact power law coefficients are still heav-
require hundreds of thousands of compute hours, ily debated. Kaplan et al. [256] put forward that
which in turn cost millions of dollars and consume the model size should be scaled more aggressively
energy amounts equivalent to that used by several than the dataset size to use a given compute budget
typical US families annually [412, 86, 44]. Re- optimally. Contrary to this, Hoffmann et al. [206]
cently proposed scaling laws [256] posit that model find that many LLMs are undertrained and argue
performances scale as a power law with model size, that the number of parameters and data should be
dataset size, and the amount of compute used for scaled equally. However, power laws sometimes
training, which is fairly unsustainable and can be come in the form of bounds, which can span an
classified as Red AI [487], where state-of-the-art re- order of magnitude difference in the amount of
sults are essentially “bought” by spending massive data to be used given a concrete compute budget
computational resources. For example, depending [665]. Further, the pre-training loss does not al-
on the exact law coefficients, reducing the error ways correlate well with downstream performance
from 3% to 2% can require an order of magnitude [252, 332, 251].
more data or compute [518]. The viewpoint of Touvron et al. [556], Vries
[571], Touvron et al. [557] is that when selecting
Unsustainable Loss Power-Law [256]
a model size, the computation resources for later
Performance increases through larger com- usage (inference) should be considered, not just
pute budgets but at a decreasing rate if the the one-time training costs. They suggest that it
model or dataset size is fixed, reflecting a might be beneficial to train a smaller model more
power law with diminishing returns. intensively upfront to offset larger inference costs
in the future. Hence, they train models of various
In the following, we look at two lines of work sizes on more tokens than are typically used to
aiming at resolving such issues. achieve the best performance possible, given the
model size.
Compute-Optimal Training Recipes [201, 256] One remaining hurdle of performance prediction
In Sec. 2.1, we discussed how the availability is inverse scaling, which we discuss in Sec. 2.14.
of LLM pre-training data has become abundant Since scaling laws were typically constructed in the
through the quickly-spread practice of including context of pre-training and thereby decoupled from
web-crawled text. Further, thanks to the intro- downstream tasks, it remains an open question of
duction of Transformer models [563] and suit- how to predict inverse scaling properties. Tay et al.
able hardware [210], we have scaled models to [544] find that scaling laws can differ in upstream
unprecedented sizes. Assuming that we have not and downstream setups; aside from only the model
yet reached the limits of data [45, 568, 415] nor size, model shape matters for downstream fine-
model sizes [256, 206, 398]; currently, the main tuning.
bottleneck is the amount of compute available [1].
Given a particular budget, how large should the pre- Pre-Training Objectives Various pre-training
training corpus and model be to maximize training objectives (PTO) are suitable for performing self-
efficiency? supervised training of LLMs. The exact choice of
As mentioned at the beginning of this section, PTO heavily influences the model’s data efficiency
one recent proposal is to learn empirical “scaling during pre-training, which in turn can reduce the
laws” [201, 256], which describe the relationship number of iterations required. A PTO typically
between LLM performance and the compute bud- is a function of the (i) architecture, (ii) input/tar-
get, model, and dataset size. These laws can pro- gets construction (e.g., target span length, low/high
vide the right scaling recipe for compute-optimal corruption, see Fig. 4), and (iii) masking strategy
training, ideally, even when extrapolating to larger (Fig. 3). While (i) and (ii) can be disentangled and

6
Masked LM Language Modeling Prefix LM
where the model uses tokens before and after the
y5
target token for predictions, leveraging a more
y4
holistic understanding of its context than the NTP
Targets

y3

y2
objective. Furthermore, we can use each input
y1
sentence to predict multiple masked tokens in a
x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 single pass, while the NTP objective typically
Input Input Input
learns from predicting one token at a time.
Figure 3: Masking Strategies. Each row denotes to Let xMASK denote the set of indices of the
which inputs xi (columns) a particular output yi (row) masked tokens and x¬MASK the unmasked tokens.
can attend to (uni- or bi-directional). The objective of MLM is then to maximize the
likelihood given the parameters θ,

should not be conflated conceptually [545], in prac- 1


L(xMASK |x¬MASK ) =
tice, there exist popular combinations that achieve |xMASK |
good performances. X (2)
· log P (xMASKi |x¬MASK ; θ).
Attending to all tokens, as shown in Fig. 3(left),
i∈xMASK
is the most data-efficient strategy since it uses con-
text from before and after the token to be predicted. Patel et al. [410] show that such models produce
However, for that reason, it is unsuitable for text representations more suitable for transfer learning;
generation [120], since it considers future context however, they come with difficulties in performing
for prediction. We typically employ it in natural in-context learning (Sec. 2.7).
language understanding (NLU) tasks [120], where To further improve the training efficiency of the
it has shown strong results. The next token predic- MLM objective, Bajaj et al. [33] propose to replace
tion objective is most suitable for natural language input tokens with ones generated by an auxiliary
generation (NLG) but also the least data efficient language model (ALM), resulting in a Model gen-
since it only attends to the past context (Fig. 3(mid- erated dEnoising TRaining Objective (METRO).
dle)). More recent advances in pre-training objec- Their approach consists of roughly three compo-
tives aim to find a middle-ground to increase data nents: (i) train an ALM using the MLM objec-
efficiency by providing stronger and more diverse tive, (ii) given some inputs with masked positions,
training signals, e.g., the Prefix LM, which partly predict the tokens (with the ALM), (iii) train the
attends to past tokens, as illustrated in Fig. 3(right) main model to correct these tokens inserted in the
and discussed below. masked positions, i.e., 1) predict whether the ALM
The following discusses the trade-offs between has replaced a token and if so, 2) predict the origi-
some of the recently proposed objectives. Fig. 4 nal token. They train the auxiliary and main model
visually depicts the different pre-training objectives. jointly.
Notation-wise, we denote a sequence of N tokens Prefix Language Modeling [443] generalizes
x as x = x1 , . . . , xN . language modeling by allowing prefix tokens with a
We start with the most basic and still widely- bidirectional receptive field to be added to the input
used Language Modeling [59] (or next token pre- (without prefix, it is equivalent to standard LM).
diction) objective. Here, we learn parameters θ by Note that this is still different from the bidirectional
maximizing the likelihood of the next token given context as in MLM, where we always condition on
the previous tokens, all the tokens before and after the masked ones (see
Fig. 3 left). For computing the hidden states of the
N
X prefix, prefix-LM attends to tokens before and after
L(x) = log P (xi |x1 , . . . , xi−1 ; θ). (1)
(see Fig. 3 right).
i=1
Span Corruption [303, 443, 132] or span de-
Masked Language Modeling (MLM; or noising refers to a group of denoising objectives
Cloze) [549, 120] hides a set proportion of that generalize MLM to denoise contiguous se-
tokens in the sequence by replacing them with a quences of tokens within a given text, called spans.
special [MASK] token. The literature employs The denoising objectives typically replace the sam-
the MLM objective for non-autoregressive, i.e., pled spans with a single unique masking token
non-generative, bidirectional context models, and train the model to fill it in. Raffel et al. [443]

7
Span Corruption Prefix Language Modeling Long Span Corruption
(R-Denoising) (S-Denoising) (one form of X-Denoising)

Inputs Inputs Inputs


Some proponents of AI consciousness subscribe to functionalism, the Some proponents of AI consciousness subscribe to functionalism, the Some proponents of AI consciousness subscribe to functionalism, the

4 by their function than their


view that mental states are defined more view that mental states are defined more by their function than their 12more by their function than their
view that mental states are defined

underlying physical structure. In other words, if an AI can respond to underlying physical structure. In other words, if an AI can respond to underlying physical structure. In other words, if an AI can respond to

inputs and generate outputs similar to a conscious being, then it could be inputs and generate outputs similar to a conscious being, then it could be 13to a conscious being, then it could be
inputs and generate outputs similar

3 doesn't account for subjective


considered conscious. However, this view considered conscious. However, this 56
view doesn't account for subjective considered conscious. However, this view doesn't account for subjective

(qualia), the "what it feels like" aspect of consciousness. The Simulational (qualia), the "what it feels like" aspect of consciousness. The Simulational 14 aspect of consciousness. The Simulational
(qualia), the "what it feels like"

Argument is that some argue that if2an AI can simulate human behavior Argument is that some argue that if an AI can simulate human behavior Argument is that some argue that if an AI can simulate human behavior

Targets Targets Targets


4 3 2 12

13

56 14

Fill In The Middle Meet In The Middle

Inputs Inputs Inputs (Reversed Order)


Some proponents of AI consciousness subscribe to functionalism, the Some proponents of AI consciousness subscribe to functionalism, the behavior human simulate can AI an if that argue some that is Argument

view that mental states are defined more by their function than their view that mental states are defined more by their function than their Simulational The consciousness. of aspect “like feels it what” the (qualia),

M
underlying physical structure. In other words, if an AI can respond to underlying physical structure. In other words, if an AI can respond to experiences subjective for account
ov
e being, then it could be
26 to a conscious
inputs and generate outputs similar inputs and generate outputs similar to a conscious being, then it could be inputs and generate outputs similar to a conscious being, then it could be

considered conscious. However, this view doesn't account for subjective considered conscious. However, this 56
view doesn't account for subjective considered conscious. However, this 52
view doesn't account for subjective

(qualia), the “what it feels like” aspect of consciousness. The Simulational (qualia), the "what it feels like" aspect of consciousness. The Simulational (qualia), the "what it feels like" aspect of consciousness. The Simulational

Argument is that some argue that if an AI can simulate human behavior Argument is that some argue that if an AI can simulate human behavior Argument is that some argue that if an AI can simulate human behavior

Targets Targets Targets

26

56 52

Figure 4: Self-Supervised Data Construction by Pre-Training Objectives, adopted from Tay et al. [545]. We
indicate masked tokens with gray rectangles, which become the targets. For brevity, we omit special tokens.

shows that this can speed up training because span quences with limited context, which we illustrate
corruption produces shorter sequences on average in Fig. 4). The MoD objective has subsequently
compared to corrupting individual tokens in an i.i.d. been shown to improve model performance by con-
manner. tinuing training pre-trained LLMs [443, 86] for
relatively few steps [547].
Mixture of Denoisers [545] (MoD) refers to
Fill In the Middle Bavarian et al. [38] propose
injecting objective diversity by mixing multiple
to augment the next token prediction objective by
denoising objectives. Tay et al. [545] categorize
shuffling tokens within a document such that we
three denoising objectives: {R,S,X}-Denoiser. The
fill in the middle (FIM) based on prefix and suf-
regular denoising corresponds to the previously in-
fix. They demonstrate that models pre-trained on a
troduced span denoising. Specific denoising com-
mixture of FIM-transformed and left-to-right data
prises splitting a given sequence into a prefix act-
result in left-to-right and FIM capability models.
ing as the context and a suffix acting as the target.
In extreme denoising, we corrupt large parts of Meet in the Middle Nguyen et al. [382] extend
the input by either (a) increasing the proportion the FIM objective by enabling bidirectional context
of masked tokens per span or (b) increasing the to construct a denser, more data-efficient supervi-
span length forcing the model to generate long se- sion signal while maintaining the autoregressive

8
nature of the underlying model: They train two times and increases the utilization of computational
decoders—one forward → −p (xi | x<i ; θ) and one resources.
backward language model ← −
p (xi | x<i ; θ)—with These issues have motivated asynchronous paral-
shared parameters θ. Additionally, they add an lelization schemes. Recht et al. [453] present Hog-
agreement regularize to the loss, encouraging the wild!, which greedily applies gradients to the local
forward and backward model to agree: for a dataset weights on each accelerator as soon as they arrive,
S of sequences, the full pre-training loss is offering better resource utilization than pipeline
parallelism but suffering from training instabilities
|x| due to stale gradients which are based on outdated
− log →

XX
p (xi | x<i ; θ) model weights.
| {z }
x∈S i=1 NLL for forward model Gomez et al. [172] propose N-Wise interlock-
− log ←

p (xi | x>i ; θ) (3) ing backpropagation, which is a generalization of
| {z }
NLL for backward model
end-to-end and local training. While end-to-end
TV → (global) training performs a forward pass through
+βDi,x (−p ∥←−
p ),
| {z } all layers, computes a loss and gradients, and back-
agreement regularizer propagates through all layers, local training per-
forms forward passes through all layers individ-
T V (→
where Di,x −
p ∥←−
p ) is the total variation distance ually and immediately computes a local loss and
among the two models on the i-th token. Once gradient update, offering higher resource utilization
pre-training has been completed, we can use only at the cost of (empirically) worse task performance.
the forward model → −
p. N-Wise interlocking backpropagation strikes a com-
promise by performing a forward pass through N
Parallelism Strategies The sheer size of LLMs layers before computing a loss and updating the
makes it hard to train or even do inference with parameters of the associated layers, enabling better
them on only one accelerator (GPU, TPU, etc.). layer communication than local training and higher
A common solution is model parallelism, which computational efficiency than end-to-end training.
can be viewed as a divide-and-conquer strategy: Chowdhery et al. [86] leverage a combination
we slice up various parts of the model (dividing of model parallelism and fully sharded data par-
the problem into sub-problems), distribute them allelism (FSDP) [628, 674]—a technique where
across multiple devices, with each device comput- each device only holds a subset of the model pa-
ing a portion of the overall computation (solve each rameters, gradients, and optimizer states, and pa-
problem independently) and combine all results to rameters necessary for local computations are com-
produce the final output (forward/backward pass). municated on-demand—to enable highly parallel,
Implementing model parallelism synchronously high throughput training across thousands of chips
creates a problem where running data batches within a single TPU pod. PaLM further employs
through multiple workers with sequential depen- data parallelism to achieve scaling at pod level,
dency (each layer depends on results from the pre- leveraging the Pathways [37] system to distribute
vious layer) leads to significant waiting times and data.
under-utilization of computation resources. In a parallel line of work, Lepikhin et al. [298]
Another strategy is pipeline parallelism, which propose GShard, a model parallelism method that
combines model parallelism with data parallelism, extends the XLA [468] compiler, enabling auto-
meaning that we not only distribute parts of the matic sharding of models.
model across different devices but parts of the data
too, i.e., each worker splits its mini-batch further Miscellaneous Rae et al. [441] stack the lay-
into micro-batches with gradients being accumu- ers of a 4.5B parameter model to jump-start and
lated across all micro-batches before the weight accelerate the training of a 9B model, which led
update. Huang et al. [226] instantiate such an ap- to a 40% reduction in compute; an idea that has
proach called GPipe, which divides each mini- been previously used for training smaller-scale
batch into smaller micro-batches distributed across LMs [173]. Brown et al. [59] progressively in-
different accelerators simultaneously; gradients are crease the batch size from a small to the full value
applied synchronously at the end. Compared to over training when training GPT-3; a trick that
naive model parallelism, this decreases waiting has been previously used for training image mod-

9
els [514]. Sanyal et al. [476] apply latest weight av- cient [213, 311] and requires practitioners to keep
eraging [249] to LLMs between 1 and 12B param- individual fine-tuned LLMs in memory for every
eters; for a 6.9B parameter model, they reach sav- task. We illustrate this overhead in Figure 5.
ings of up to 4,200 GPU hours. For smaller-scale
models, there exist various pre-training speedup al- Overhead of Storing and Loading
gorithms [663, 685], but they have not been scaled Fine-Tuned LLMs [213, 311]
up yet and shown to offer only limited gains when
When adapting an LLM via full-model fine-
compared with budget-adjusted baselines [251].
tuning, an individual copy of the model
2.4 Fine-Tuning Overhead must be stored (consuming data storage) and
loaded (expending memory allocation, etc.)
A potential drawback of pre-training LLMs on mas- for each task.
sive and diverse sets of textual data is that the re-
sulting models might struggle to explicitly cap-
Parameter-efficient fine-tuning An alternative
ture the distributional properties of task-specific
method to adapt an LLM to a specific dataset/do-
datasets. To address this, fine-tuning refers to
main is via parameter-efficient fine-tuning (PEFT).
adapting the pre-trained model parameters on com-
PEFT refers to a class of methods that adapt LLMs
paratively smaller datasets that are specific to an
by updating only a small subset of model parame-
individual domain or task. LLM fine-tuning is
ters. Adapters [213] are one of the earliest works
highly effective at adapting LLMs for downstream
on PEFT. This method incorporates additional,
tasks [215, 120, 440].
learnable layers into a Transformer architecture that
Technically speaking, fine-tuning can be
are updated during fine-tuning whilst keeping the
achieved by further training a model on a smaller
remainder of the network unchanged. Experimen-
dataset. Depending on the model architecture, this
tal results on 26 text classification tasks (incl. the
is done by either (i) directly fine-tuning pre-trained
GLUE benchmark [575]) reveal that models trained
models using a standard language modeling objec-
via Adapters are competitive with full fine-tuning
tive or (ii) adding individual learnable layers to the
while updating only 3% of the model’s parame-
output representations of a pre-trained language
ters. Ben Zaken et al. [40] instead propose only
model, which are designed to create compatibil-
to update the model’s bias terms for fine-tuning,
ity between the model’s output representations and
which make up less than 1% of the model’s pa-
the output formats of individual downstream tasks
rameters. Experimental results show competitive
(e.g., for text classification or sequence labeling).
performance across tasks of the GLUE benchmark.
See Devlin et al. [120] (Figure 1) for an illustration.
We are aware of three general frameworks for incor-
However, LLMs with billions of parameters have
porating adapters into language model fine-tuning,
large memory requirements to store (i) the model
namely AdapterHub [428], LLM-Adapters [219],
parameters, (ii) the model activations, and (iii) the
and HuggingFace’s PEFT library [356].
gradients and corresponding statistics. Due to lim-
PEFT methods introduced for larger mod-
ited device memory (e.g., GPU or TPU) necessi-
els include prefix-tuning [311] and prompt-
tates access to large clusters with many devices
tuning [299], which both operate by prepending
to fine-tune a full LLM, limiting access to a few
a set of learnable token embeddings to an input.
institutions with large compute resources.
These token embeddings (also referred to as soft
Large Memory Requirements prompts [299]) are learned during the fine-tuning
stage, whereas the remainder of the model parame-
Fine-tuning entire LLMs requires the same ters remains fixed. Most notably, such soft prompts
amount of memory as pre-training, render- contain thousands rather than millions of param-
ing it infeasible for many practitioners. eters and are much more efficient to store. No-
tably, one still has to backpropagate through the
Moreover, while full model fine-tuning is ef- network while fine-tuning the tokens. Alternatives
fective at adapting LLMs to perform well on spe- for models with only black-box API access have
cific downstream tasks, individual copies of fine- been proposed too [528, 122].
tuned LLMs need to be stored and loaded for It has been shown that prompt-tuning can
individual tasks, which is computationally ineffi- learn generalizable representations with very small

10
weight matrices at individual Transformer layers as
Sen t i m en t QA H at e sp eec h
m o d el m o d el m o d el an additive low-rank decomposition. Such a repa-
rameterization avoids the need to compute dense
Fi n e-t u n i n g Fi n e-t u n i n g Fi n e-t u n i n g matrix multiplications. Dettmers et al. [118] ex-
LLM # 1 LLM # 2 LLM # 3
tend LoRA to quantized LLMs, drastically reduc-
Sen t i m en t
an al ysi s t ask
Qu est i o n
an sw er i n g t ask
H at e sp eec h
t ask
ing memory usage, allowing them to fine-tune a
65B model on a single 48GB GPU. The authors
(a) mention that regular training of the same model
requires more than 780 GB of GPU memory.

Sen t i m en t QA H at e sp eec h Compute Requirements However, despite sub-


m o d el m o d el m o d el
stantial improvements in memory complexity
needed to fine-tune LLMs for specific tasks, a re-
B ase L L M maining challenge is the time complexity. Fine-
( PEFT-ad ap t ab l e) tuning an LLM, even with PEFT methods, still
requires full gradient computation. The compu-
PEFT w ei g h t s PEFT w ei g h t s PEFT w ei g h t s tational infrastructure needed to adapt LLMs pro-
Sen t i m en t Qu est i o n H at e sp eec h hibits potential applications like personalization on
an al ysi s t ask an sw er i n g t ask t ask
smaller devices.
(b)
Full Matrix Multiplications
Figure 5: Fine-tuning an LLM for a specific down-
stream task. (a) illustrates vanilla fine-tuning, which Parameter-efficient fine-tuning of LLMs
requires updating the entire model, resulting in a new still requires computing full forward/back-
model for each task. In (b), PEFT instead learns a small ward passes throughout the whole network.
subset of model parameters for each task with a fixed
base LLM. The same base model can be re-used during
inference for different tasks. 2.5 High Inference Latency
According to Pope et al. [431], Weng [605], two
reasons why LLMs exhibit high inference latencies
amounts of training data, achieving competitive
are: (1) low parallelizability since the inference
performances when trained on less than 100 exam-
procedure proceeds one token at a time and (2)
ples for safety classification [376] or five examples
large memory footprints, due to the model size
for multilingual question answering [11]. In addi-
and the transient states needed during decoding
tion to that, recent work investigates the potential
(e.g., attention key and value tensors). Further, the
of using soft prompts for pre-training and transfer
authors also discuss the quadratic scaling of the
learning across different tasks [179, 572].
attention mechanisms in Transformers, which we
Liu et al. [331] introduce (IA)3 , which scales
discuss separately in Sec. 2.6.
activations in individual Transformer layers with
learnable vectors. The authors demonstrate its ef- High Inference Latency [431, 605]
fectiveness by showing that models trained using
(IA)3 outperform full model fine-tuning on various LLM inference latencies remain high be-
datasets whilst updating only 0.01% of the model’s cause of low parallelizability and large mem-
parameters. ory footprints.
Malladi et al. [355] propose a memory-efficient
zeroth-order (MeZO) optimizer, which only re- In the following section, we review techniques
quires the same memory footprint as during in- used to address these challenges by e.g., reduc-
ference (instead of storing gradients or optimizer ing the memory footprint (size and/or bandwidth),
states). Further, it can optimize non-differentiable or accelerating specific computational operations.
objectives like accuracy or F1 scores, which con- Note that some of these techniques may also be
ventional gradient-based tuning methods cannot. applicable during the training process, but we dis-
Hu et al. [218] propose Low-Rank Adaptation cuss them here since they are not only designed for
(LoRA), which formulates parameter updates of training, like the approaches discussed in Sec. 2.3.

11
Efficient Attention Roughly two lines of work Similarly, GLM-130B [658] uses a degradation-
aim to accelerate attention mechanism computa- free 8-bit quantization scheme, storing weights in
tions by (i) lower-level hardware-aware modifica- 8-bit and performing matrix multiplications in 16-
tions or (ii) higher-level sub-quadratic approxima- bit precision. Frantar et al. [153] propose an effi-
tions of the attention mechanism. cient, one-shot quantization technique to compress
For the former, multi-query attention [493] aims LLM weights down to 3 to 4 bits per weight, en-
to reduce memory bandwidth bottlenecks when se- abling 175B parameter models to be run on a single
quentially generating sequences of tokens using GPU. Dettmers et al. [119] further improve upon
Transformer decoder layers by keeping only one this by combining higher precision representations
attention head for the key and value tensors. Sim- for outlier weights and grouped quantization.
ilarly, Dao et al. [107], Pagliardini et al. [404] re-
duce memory bandwidth by proposing an alter- Pruning is a complementary post-training tech-
native computation method for multi-head self- nique to quantization, removing parts of the
attention, called FlashAttention, to minimize weights of a given model (without degrading its per-
the number of I/O operations to speed up the com- formance). An important distinction is whether the
putation on modern GPUs. As an optimized atten- pruning follows a structured pattern or is unstruc-
tion implementation, FlashAttention lever- tured. Structured sparse models substitute dense
ages operator fusion to reduce the memory band- sections of a model with an assembly of signifi-
width bottleneck. Pagliardini et al. [404] build cantly smaller yet still dense components. Unstruc-
on top of FlashAttention and incorporate at- tured sparse models contain weights of value zero,
tention sparsity patterns, encompassing key/query which do not influence the network’s behavior and
dropping and hashing-based attention. Pope et al. can therefore be committed in theory. However, in
[432] implement different sharding techniques to practice, it is more challenging to translate theo-
efficiently spread the feedforward and attention retical to practical computation savings on current
computations across devices while optimizing for hardware [161, 112, 336].
inter-device communication costs, enabling context On the structured side, early work on pruning
lengths of up to 43,000 tokens using multi-query language models mainly aims at comparatively
attention. small MLM-type models [592, 143, 243]. Ma et al.
With regards to the second stream of work, a [349] propose LLM-Pruner, which aims at pruning
common theme to improve the computational or LLMs in a task-agnostic manner while preserving
memory complexity of the attention mechanism is the zero-shot capabilities of the models. To this
to sparsify the attention matrix or introducing (lin- end, LLM-Pruner adopts a three-stage pruning pro-
ear) approximations [543]. However, the scalabil- cedure where 1) interdependent structures within
ity of some efficient Attention approximations has the model are identified and grouped, 2) the contri-
been questioned. For example, Tay et al. [542], Hua bution to the overall performance is estimated for
et al. [220] find that the Performer attention approx- each group, and low-performing groups are pruned,
imation [85] severely underperforms the vanilla 3) performance recovery via parameter-efficient
self-attention mechanism, especially when scaled fine-tuning procedure using LoRA [218].
up to large models. On the unstructured side, SparseGPT [152] is an
unstructured pruning approach specifically devel-
Quantization is a post-training technique that oped to be fast enough to be run on LLMs with
reduces the memory footprint and/or increases the hundreds of billions of parameters within a few
model’s throughput by reducing the computational hours, being able to prune the number of parame-
precision of weights and activations. nuQmm [407] ters by up to 60% while maintaining roughly the
and ZeroQuant [643] use a non-uniform quan- same model performance. Sun et al. [527] pro-
tization method to quantize weights and apply pose Wanda (Pruning by Weights and activations),
custom CUDA kernels for computational benefits. which applies magnitude pruning based on the
LLM.int8() [117] is a degradation-free quanti- product of each weight’s magnitude and the norm
zation scheme enabling efficient inference of multi- of the corresponding input activations, matching
billion parameter LLMs by utilizing Int8 quantiza- SparseGPT in performance while requiring only
tion and falling back to higher precision for certain a single forward pass to prune the network. Both
outlier features without the need for re-training. SparseGPT and Wanda can be extended to per-

12
form semi-structured pruning, enabling n:m spar- that the activation maps of default Transformer
sity [228, 680] and achieving the corresponding models often emerge to be very sparse implicitly;
speed-ups on recent GPUs [369]. the larger the model, the sparser measured by the
percentage of nonzero entries. Similarly, Zhang
Mixture-of-Experts architectures typically con- et al. [670] find that post-training MoEfication, i.e.,
sist of a set of experts (modules), each with unique converting monolithic models to equivalent MoE
weights, and a router (or gating) network, which models, can speed up inference by 2x.
determines which expert module processes an in-
put. MoE models decrease inference time by not Cascading refers to the idea of employing
using all experts at once but only activating a sub- differently-sized models for different queries [75].
set of them. Further, they can reduce communica- In spirit, this idea is similar to Mixture-of-Experts
tion across devices in model-distributed settings by models, but instead of learning a routing module,
placing each expert on a separate accelerator; only we employ a cascade of multiple, differently-sized
the accelerators hosting the router and the relevant monolithic models (these can be even black-box
expert model must communicate. Shazeer et al. API models) and learn a scoring function that de-
[495] propose one of the first MoE layers embed- cides which model(s) receive which query. Chen
ded within a language model, which they refer to et al. [75] demonstrate that this strategy dominates
as sparsely-gated MoEs (SG-MoEs). They denote the Pareto frontier between accuracy and cost.
by G(x) and Ei (x) the gating network output and Decoding Strategies can greatly impact the com-
the i-th expert network output for a given input putational cost of performing inference. For ex-
x, respectively.
Pn We can then write the output as ample, beam search trades off compute for higher-
y = i=1 G(x) i Ei (x). Wherever G(x)i = 0, quality results. Another example of a computa-
we do not need to compute Ei (x), thereby saving tionally expensive decoding scheme is sample-and-
compute during inference. Lepikhin et al. [298] rank [8] where N independent sequences of tokens
scale up an SG-MoE model to 600B parameters y 1 , . . . , y N are obtained using random sampling,
by proposing GShard, a model parallelism method and the highest probability sequence is used as the
that extends the XLA [468] compiler. While SG- final output.
MoE selects the top-k experts with k > 1, the Latency-oriented strategies such as speculative
Switch Transformer (ST) [145] architecture uses sampling [522, 300, 74] first autoregressively gen-
k = 1 experts, which reduces routing computation erate a draft of length K using a smaller (draft)
and communication across experts (which may be model; then, the larger (target) model scores the
located on different accelerators). ST empirically draft, followed by a modified rejection sampling
outperformed a strongly tuned T5 model with up to scheme to accept a subset of the tokens from left to
7x pre-training speedups. Lewis et al. [302] notice right. Similar ideas have been proposed in various
that the learned routers can result in unbalanced contexts, such as for blockwise parallel genera-
assignments across experts. To ensure balanced tion [522], grammatical error correction [529], and
routing, they formulate a linear assignment prob- with a larger LLM refining generation produced by
lem that maximizes token-expert affinities while a small model [265]. Del Corro et al. [114] observe
equally distributing the number of tokens across that tokens towards the end of a sequence are easier
experts. Yu et al. [653] propose sMLP, an MoE to predict due to more contextual information, mo-
using only MLPs blocks, which (i) they scale up to tivating a new decoding strategy that skips earlier
10B, (ii) results in a 2x improvement in pre-training layers in the network for such tokens.
speed, and (iii) outperforms sparse Transformer
counterparts. 2.5.1 Software
However, MoE models still suffer from unique Various frameworks have been designed to en-
issues like expert collapse (all experts learning the able the efficient training of multi-billion to
same), likely caused by underconstrained routing trillion parameter language models such as
functions [80]. For example, Roller et al. [459] DeepSpeed [450] and Megatron-LM [501] to
demonstrates that learned expert assignments do account for the unique challenges arising when
not always outperform random ones. training such models. This is necessitated by the
Interestingly, instead of designing an architec- fact that most LLMs do not fit into a single device’s
ture for sparsity explicitly, Li et al. [314] observe (GPU, TPU) memory, and scaling across GPUs and

13
compute nodes needs to account for communica- 2.6 Limited Context Length
tion and synchronization costs. FlexGen [497]
provides further speed-ups by aggregating memory Addressing everyday NLP tasks often necessitates
and compute resources from the GPU, CPU, and an understanding of a broader context. For exam-
disk and utilizing techniques such as 4-bit quan- ple, if the task at hand is discerning the sentiment
tization, enabling inference with 175B parameter in a passage from a novel or a segment of an aca-
models on a single GPU. demic paper, it is not sufficient to merely analyze a
The frameworks typically combine existing par- few words or sentences in isolation. The entirety of
allelism strategies to compensate for drawbacks the input (or context), which might encompass the
and scale model training across multiple sets of whole section or even the complete document, must
compute nodes, within compute nodes, and across be considered. Similarly, in a meeting transcript,
multiple GPUs per node. e.g., Smith et al. [515] the interpretation of a particular comment could
use tensor slicing within a node, pipeline paral- pivot between sarcasm and seriousness, depending
lelism across nodes, and data parallelism to train on the prior discussion in the meeting.
multiple model replicas over sets of nodes. Addi- Li et al. [308] evaluate several LLMs in the long-
tional features include memory optimizations [445, context settings and find that while commercial
454, 446], communication-efficient [536, 307, 343] closed-API models often fulfill their promise, many
and fused optimizers2 , and support for MoE train- open-source models – despite claiming to perform
ing [444]. well with longer contexts – exhibit severe perfor-
Specialized implementations such as mance degradation. They point out that there is
Tutel [230] and MegaBlocks [160] of- a difference between being architecturally-able to
fer efficient sparse MoE training, while deal with long inputs and actually performing well.
Alpa [677] enables automatic data and model Having an architecture that can infer long inputs
parallelism for LLMs written in Jax. The does not guarantee that the LLM will perform as
FasterTransformer3 library includes highly well on those as on shorter inputs. Similarly, Liu
optimized Transformer encoder and decoder et al. [333] find that changing the location of rel-
implementations for TensorFlow, PyTorch, and evant information in the input can degrade model
Triton. performance. Interestingly, they find that decoder-
Kwon et al. [285] introduce vLLM, an open- only LLMs like GPT-3.5 can deal well with such
source library for efficient inference and LLM serv- information at the beginning or end of the input
ing. vLLM employs PagedAttention, which par- context; they cannot access information in the mid-
titions each sequence’s KV cache into fixed-size dle of it well, resulting in a U-shaped performance
blocks. When performing attention computations, curve.
blocks are fetched from non-contiguous memory.
This enables memory sharing, reducing memory Limited Context Length
consumption and transfers in decoding strategies
such as beam search, ultimately improving through- Limited context lengths are a barrier for
put. handling long inputs well to facilitate ap-
The Petals [54] library4 allows users to col- plications like novel or textbook writing or
laboratively fine-tune and run LLMs by distribut- summarizing.
ing subsets of model parameters to individual ma-
chines.
To this end, we discuss three lines of work per-
All of these libraries address the enormous com- mitting longer context lengths. First, we look at
putational costs associated with training and run- efficient attention mechanisms, which help miti-
ning LLMs, either by offering more efficient im- gate the effect of long inputs on the computational
plementations, lowering memory requirements, or requirements of Transformer models. Next, we ex-
using distributed or decentralized computing strate- amine positional embedding schemes in the light
gies. of generalization to longer sequence lengths than
2
https://github.com/nvidia/apex
those used during training. Lastly, we revise Trans-
3
https://github.com/NVIDIA/FasterTransformer former alternatives which neither require attention
4
https://github.com/bigscience-workshop/petals nor positional embeddings.

14
Efficient Attention Mechanisms One way of generalize well to significantly longer sequences
addressing the limited context of LLMs is by de- during inference.
signing more efficient attention mechanisms that The fundamental building block of the Trans-
can process longer inputs. Ma et al. [350] intro- former architecture is the self-attention mechanism.
duce Luna, a linear unified nested attention mech- It is permutation-invariant; therefore, the output is
anism that approximates softmax attention with independent of the input sequence order. Positional
two nested linear attention functions, yielding only information is commonly injected to make the
linear (as opposed to quadratic) time and space model respect a token’s position in the sequence,
complexity, allowing it to process much longer in- i.e., capture the semantics of where a token occurs
puts. Similarly, Shen et al. [496] and Li et al. [310] rather than just whether it occurs. The longer the
present alternative attention mechanisms equivalent input is, the more important the positional embed-
to the dot-product attention but which require sub- ding becomes since the model needs to effectively
stantially less memory and compute resources. Guo use information from different parts of the input
et al. [183] propose an attention mechanism called that may cover a wide range of distances from the
Transient Global, which is an extension of local current token.
attention where each token can attend to nearby
Without positional embeddings, a Transformer
tokens and a set of global tokens. It enables to han-
models the relations between any two tokens with
dle sequences with up to 12,000 tokens. Similarly,
equal probability. Hence, positional embeddings
CoLT5 [15] enables context lengths of up to 64,000
introduce an LSTM-like inductive bias that (typi-
tokens by splitting the computations into a light
cally) tokens closer to each other in the sequence
branch with local attention, fewer attention heads,
are more relevant to each other. Depending on the
and a heavy branch with full attention. CoLT5 ap-
positional embedding scheme chosen, this can be
plies the light branch to every token and the heavy
learned or effectively hard-coded. However, it re-
branch to a subset of tokens that are selected by a
mains unclear what is the most effective positional
learnable routing function.
embedding scheme for long inputs. Further, mod-
After investigating the effect of the dot-product els face difficulties generalizing to unseen sequence
self-attention mechanism, Tay et al. [541] pro- lengths by introducing a dependency on sequence
pose the Synthesizer, a new architecture that learns positions. This is an undesirable artifact of posi-
synthetic attention weights without token-token tional embeddings, as language semantics do not
interactions, showing that it consistently outper- inherently depend on the length of an utterance.
forms transformers on various language-based
While positional encoding schemes such as rela-
tasks. Britz et al. [56] offer an alternative attention
tive positional encodings or, more recently, ALiBi
mechanism based on a fixed-size memory repre-
have made progress in building more generaliz-
sentation that is more efficient, yielding inference
able ways for injecting positional information into
speedups of 20% without significantly hurting per-
Transformers, the challenge of generalizing to se-
formance. Hua et al. [220] combine a single-head
quences much longer than seen during training re-
attention mechanism with a linear attention approx-
mains largely unsolved. Surprisingly, Haviv et al.
imation to achieve speed-ups between 4.9x and
[192] find that causal LLMs without positional en-
12.1x for auto-regressive language modeling while
codings are competitive compared to models with
obtaining similar perplexities as a standard Trans-
positional encodings and accredit this success to
former model. Ding et al. [124] propose dilated
the causal attention mask leaking positional infor-
attention which splits a sequence into equally long
mation into the model.
segments and processes each of these in parallel
using a sparsified attention mechanism. Dilated In the following, we first summarize some stan-
attention offers a linear computational complexity dard positional embeddings technique and then
in the sequence length and, applied hierarchically, move to more advanced schemes designed to im-
enables inputs of up to 1B tokens. prove length generalization. We start with Abso-
lute Positional Embeddings [563], which inject
Length Generalization As the required compute positional information by sinusoidal embeddings
of Transformer-based LLMs grows quadratic with based on the absolute position i of a token xi within
the sequence length, it is a desired property to build their sequence x1 , . . . , xN into the model input.
LLMs that can be trained on short sequences and Given an input sequence X = [x1 , . . . , xN ], we

15

You might also like