Professional Documents
Culture Documents
3642 GLM 130b An Open Bilingual Pre
3642 GLM 130b An Open Bilingual Pre
Aohan Zeng⋄†∗ , Xiao Liu⋄†∗ , Zhengxiao Du⋄† , Zihan Wang⋄ , Hanyu Lai⋄ , Ming Ding⋄ ,
Zhuoyi Yang⋄ , Yifan Xu⋄ , Wendi Zheng⋄ , Xiao Xia⋄ , Weng Lam Tam⋄§ , Zixuan Ma⋄ ,
Yufei Xue§ , Jidong Zhai⋄ , Wenguang Chen⋄ , Zhiyuan Liu⋄ , Peng Zhang§ ,
Yuxiao Dong⋄‡ , Jie Tang⋄‡
A BSTRACT
1 I NTRODUCTION
Large language models (LLMs), particularly those with over 100 billion (100B) parameters (Brown
et al., 2020; Thoppilan et al., 2022; Rae et al., 2021; Chowdhery et al., 2022; Wang et al., 2021),
have presented attractive scaling laws (Wei et al., 2022b), where emergent zero-shot and few-shot
capabilities suddenly arose. Among them, GPT-3 (Brown et al., 2020) with 175B parameters pi-
oneers the study of 100B-scale LLMs by strikingly generating better performance with 32 labeled
examples than the fully-supervised BERT-Large model on a variety of benchmarks. However, both
GPT-3 (and many other closed-sourced 100B-scale ones)—the model itself—and how it can be
trained, have been thus far intransparent to the public. It is of critical value to train a high-quality
LLM of such scale with both the model and training process shared with everyone.
We thus aim to pre-train an open and highly-accurate 100B-scale model with ethical concerns in
mind. Over the course of our attempt, we have come to realize that pre-training a dense LLM at
such a scale raises numerous unexpected technical and engineering challenges compared to training
10B-scale models, in terms of pre-training efficiency, stability, and convergence. Similar difficulties
* The two lead authors AZ and XL contributed equally ({zengaohan,shawliu9}@gmail.com)
†
Work partially done when AZ, XL, and ZD interned at Zhipu.AI.
‡
Team leads: YD and JT. Corresponding author: JT (jietang@tsinghua.edu.cn)
For detailed author contributions, please refer to Appendix E.
1
Published as a conference paper at ICLR 2023
have also been concurrently observed in training OPT-175B (Zhang et al., 2022) and BLOOM-
176B (Scao et al., 2022), further demonstrating the significance of GPT-3 as a pioneer study.
In this work, we introduce the pre-training of a 100B-scale model—GLM-130B, in terms of engi-
neering efforts, model design choices, training strategies for efficiency and stability, and quantization
for affordable inference. As it has been widely realized that it is computationally unaffordable to
empirically enumerate all possible designs for training 100B-scale LLMs, we present not only the
successful part for training GLM-130B but also many of the failed options and lessons learned.
Particularly, the training stability is the decisive factor in the success of training models of such a
scale. Different from practices such as manually adjusting learning rates in OPT-175B and using
embedding norm in the sacrifice of performance in BLOOM-176B, we experiment with various op-
tions and find the strategy of embedding gradient shrink can significantly stabilize the training of
GLM-130B.
Specifically, GLM-130B is a bilingual (English and Chinese) bidirectional dense model with 130 bil-
lion parameters, pre-trained over 400 billion tokens on a cluster of 96 NVIDIA DGX-A100 (8×40G)
GPU nodes between May 6 and July 3, 2022. Instead of using the GPT-style architecture, we adopt
the General Language Model (GLM) algorithm (Du et al., 2022) to leverage its bidirectional at-
tention advantage and autoregressive blank infilling objective. Table 1 summarizes the comparison
between GLM-130B, GPT-3 and another two open-source efforts—OPT-175B and BLOOM-176B,
as well as PaLM 540B (Chowdhery et al., 2022)—a 4× larger model—as a reference.
Altogether, the conceptual uniqueness and engineering efforts enable GLM-130B to exhibit perfor-
mance that surpasses the level of GPT-3 on a wide range of benchmarks (in total 112 tasks) and also
outperforms PaLM 540B in many cases, while outperformance over GPT-3 has not been observed in
OPT-175B and BLOOM-176B (Cf. Figure 1 left). For zero-shot performance, GLM-130B is better
than GPT-3 175B (+5.0%), OPT-175B (+6.5%), and BLOOM-176B (+13.0%) on LAMBADA (Pa-
perno et al., 2016), and achieves 3× better performance than GPT-3 on Big-bench-lite (Srivastava
et al., 2022). For the 5-shot MMLU (Hendrycks et al., 2021) tasks, it is better than GPT-3 175B
(+0.9%) and BLOOM-176B (+12.7%). As a bilingual LLM also in Chinese, it offers significantly
better results than ERNIE TITAN 3.0 260B (Wang et al., 2021)—the largest Chinese LLM—on 7
zero-shot CLUE (Xu et al., 2020) datasets (+24.26%) and 5 zero-shot FewCLUE (Xu et al., 2021)
ones (+12.75%). Importantly, as summarized in Figure 1 right, GLM-130B as an open model is
associated with significantly less bias and generation toxicity than its 100B-scale counterparts.
Finally, we design GLM-130B to empower as many people as possible to conduct 100B-scale LLM
studies. First, instead of using 175B+ parameters as OPT and BLOOM, the 130B size is decided be-
cause such a size supports inference on a single A100 (8×40G) server. Second, to further lower the
GPU requirements, we quantize GLM-130B into INT4 precision without post training while OPT
and BLOOM can only reach INT8. Due to a unique property of the GLM architecture, GLM-130B’s
INT4 quantization introduces negligible performance degradation, e.g., -0.74% on LAMBADA and
even +0.05% on MMLU, making it still better than the uncompressed GPT-3. This enables GLM-
2
Published as a conference paper at ICLR 2023
13
12
11
10
Gradient Norm
9
8
7
6
5
4
3
2
1
0
0 500 1k 1.5k 2k 2.5k 3k
(a) More than 30 failed preliminary trials at 100B-scale (b) Final decisive trials: Sandwich-LN v.s. DeepNorm
Figure 3: Trials on different LayerNorms for GLM-130B training. It turns out that DeepNorm is the
most stable one, as it has small gradient norm and does not spike in the early stage training.
130B’s fast inference with performance guarantee on a server of 4×RTX 3090 (24G) or 8×RTX
2080 Ti (11G), the most affordable GPU required for using 100B-scale LLMs to date.
We open-source the model checkpoints, code, training logs, related toolkits, and lessons learned.
2 T HE D ESIGN C HOICES OF GLM-130B
The architecture of a machine learning model defines its inductive bias. However, it has been real-
ized that it is computationally unaffordable to explore various architectural designs for LLMs. We
introduce and explain the unique design choices of GLM-130B.
2.1 GLM-130B’ S A RCHITECTURE
GLM as Backbone. Most recent 100B-scale LLMs, such as GPT-3, PaLM, OPT, and BLOOM,
follow the traditional GPT-style (Radford et al., 2019) architecture of decoder-only autoregressive
language modeling. In GLM-130B, we instead make an attempt to explore the potential of a bidi-
rectional GLM—General Language Model (Du et al., 2022)—as its backbone.
GLM is a transformer-based language model that leverages autoregressive blank infilling as its train-
ing objective. Briefly, for a text sequence x = [x1 , · · · , xn ], text spans {s1 , · · · , sm } are sampled
from it, each of which si denotes a span of consecutive tokens [si,1 , · · · , si,li ] and is replaced (i.e.,
corrupted) with a single mask token to form xcorrupt . The model is asked to recover them autoregres-
sively. To allow interactions between corrupted spans, their visibility to each other is decided by a
randomly sampled permutation on their order.
GLM’s bidirectional attention over unmasked (i.e., uncorrupted) contexts distinguishes GLM-130B
from GPT-style LLMs in which the unidirectional attention is used. To support both understanding
and generation, it mixes two corruption objectives, each indicated by a special mask token:
• [MASK]: short blanks in sentences whose lengths add up to a certain portion of the input.
• [gMASK]: random-length long blanks at the end of sentences with prefix contexts provided.
Context Mask(s)
Conceptually, the blank infilling objective with bidi-
rectional attention enables a more effective compre-
hension of contexts than GPT-style models: when us-
ing [MASK], GLM-130B behaves as BERT (Devlin
et al., 2019) and T5 (Raffel et al., 2020); when us- Bidirec'onal A.en'on
(e.g., GLM)
ing [gMASK], GLM-130B behaves similarly to Pre-
fixLM (Liu et al., 2018; Dong et al., 2019).
Empirically, GLM-130B offers a record-high accuracy
of 80.2% on zero-shot LAMBADA by outperforming Unidirec'onal A.en'on
(e.g., GPT-3, PaLM)
both GPT-3 and PaLM 540B in Figure 2. By setting
the attention mask, GLM-130B’s unidirectional vari- Figure 2: GLM-130B and LLMs of similar
ant is comparable to GPT-3 and OPT-175B. Our ob- scale on zero-shot LAMBADA language
servations are in line with existing findings (Liu et al., modeling. Details on GLM’s bidirectional
2018; Dong et al., 2019). attention are provided in Du et al. (2022).
Layer Normalization (LN, Ba et al. (2016)). Training instability is one major challenge for training
LLMs (Zhang et al., 2022; Scao et al., 2022; Chowdhery et al., 2022) (Cf. Figure 10 in Appendix
for collapses in training several 100B-scale models). A proper choice of LNs can help stabilize
3
Published as a conference paper at ICLR 2023
the training of LLMs. We experiment with existing practices, e.g., Pre-LN (Xiong et al., 2020),
Post-LN (Ba et al., 2016), Sandwich-LN (Ding et al., 2021), which are unfortunately incapable of
stabilizing our GLM-130B test runs (Cf. Figure 3 (a) and Appendix B.2 for details).
Our search is later focused on Post-LN due to its favorable downstream results in preliminary ex-
periments though it does not stabilize GLM-130B. Fortunately, one of the attempts on Post-LN
initialized with the newly-proposed DeepNorm (Wang et al., 2022b) generates promising training
stability. Specifically, given the number of GLM-130B’s layers N , we adopt DeepNorm(x) =
1
LayerNorm(α · x + Network(x)), where α = (2N ) 2 , and apply the Xavier normal initialization
1
with the scaling factor of (2N )− 2 to ffn, v_proj and out_proj. Additionally, all bias terms
are initialized to zero. Figure 3 shows it significantly benefits the training stability of GLM-130B.
Positional Encoding and FFNs. We empirically test different options for positional encoding (PE)
and FFN improvements in terms of both training stability and downstream performance (Cf. Ap-
pendix B.3 for details). For PEs in GLM-130B, we adopt Rotary Positional Encoding (RoPE, Su
et al. (2021)) rather than ALiBi (Press et al., 2021). To improve FFNs in Transformer, we pick GLU
with the GeLU (Hendrycks & Gimpel, 2016) activation as the replacement.
GLM-130B is trained on a cluster of 96 DGX-A100 GPU (8×40G) servers with a 60-day access.
The goal is to pass through as many tokens as possible, as a recent study (Hoffmann et al., 2022)
suggests that most existing LLMs are largely under-trained.
The 3D Parallel Strategy. The data parallelism (Valiant, 1990) and tensor model paral-
lelism (Shoeybi et al., 2019) are the de facto practices for training billion-scale models (Wang &
Komatsuzaki, 2021; Du et al., 2022). To further handle the huge GPU memory requirement and the
decrease in overall GPU utilization resulted from applying tensor parallel between nodes—as 40G
rather than 80G A100s are used for training GLM-130B, we combine the pipeline model parallelism
with the other two strategies to form a 3D parallel strategy.
The pipeline parallelism divides the model into sequential stages for each parallel group, and to fur-
ther minimize bubbles introduced by pipeline, we leverage the PipeDream-Flush (Narayanan et al.,
4
Published as a conference paper at ICLR 2023
2021) implementation from DeepSpeed (Rasley et al., 2020) to train GLM-130B with a relative
big global batch size (4,224) to reduce time and GPU memory wasting. Through both numeri-
cal and empirical examinations, we adopt 4-way tensor parallelism and 8-way pipeline parallelism
(Cf. Appendix B.4 for details). Following the calculation in (Chowdhery et al., 2022), we report
hardware FLOPs utilization (HFU) of 43.3% and model FLOPs utilization (MFU) of 32.5% due to
re-materialization.
GLM-130B Configurations. We aim to enable our 100B-scale LLM to run a single DGX-A100
(40G) node in FP16 precision. Based on the hidden state dimension of 12,288 we adopt from
GPT-3, the resultant model size has to be no more than 130B parameters, thus GLM-130B. To
maximize GPU utilization, we configure the model based on the platform and its corresponding
parallel strategy. To avoid insufficient memory utilization in the middle stages due to the additional
word embedding at both ends, we balance the pipeline partition by removing one layer from them,
making 9×8-2=70 transformer layers in GLM-130B.
During the 60-day access to the cluster, we manage to train GLM-130B for 400 billion tokens
(roughly 200 billion each for Chinese and English) with a fixed sequence length of 2,048 per sample.
For the [gMASK] training objective, we use a context window of 2,048 tokens. For the [MASK]
and multi-task objectives, we use a context window of 512 and concatenate four samples together to
cater the 2,048-sequence-length. We warm-up the batch size from 192 to 4224 over the first 2.5%
samples. We use AdamW (Loshchilov & Hutter, 2019) as our optimizer with β1 and β2 set to 0.9
and 0.95, and a weight decay value of 0.1. We warm up the learning rate from 10−7 to 8 × 10−5
over the first 0.5% samples, then decay it by a 10× cosine schedule. We use a dropout rate of 0.1
and clip gradients using a clipping value of 1.0 (Cf. Table 11 for the full configurations).
The training stability is the decisive factor in GLM-130B’s quality, which is also largely impacted
by the number of tokens it passes through (Hoffmann et al., 2022). Thus, given the computing
usage constraint, there has to be a trade-off between efficiency and stability with regard to floating-
point (FP) formats: low-precision FP formats (e.g., 16-bit precision—FP16) improve computing
efficiency but are prone to overflow and underflow errors, resulting in training collapses.
Mixed-Precision. We follow the common practice of a mixed-
precision (Micikevicius et al., 2018) strategy (Apex O2), i.e., FP16
for forwards and backwards and FP32 for optimizer states and mas-
ter weights, to reduce the GPU memory usage and improve train-
ing efficiency. Similar to OPT-175B and BLOOM-176B (C.f. Fig-
ure 10 in Appendix), the training of GLM-130B faces frequent loss
spikes resulted from this choice, which tends to become increas-
ingly frequent as the training goes on. The precision related spikes
are often without clear reasons: some recover on their own; others (a) Gradient norm with EGS α = 0.1
come with a portent of suddenly soaring gradient norm and even-
tually a spike or even NaN in loss. OPT-175B attempted to fix by
manually skipping data and adjusting hyper-parameters; BLOOM-
176B did so via the embedding norm technique (Dettmers et al.,
2021). We spent months to empirically investigate the spikes and
realize that a few issues emerge when transformers scale up:
First, the transformer main branch’s value scale can be extremely
(b) EGS in 40B-scale testing
large in deeper layers if using Pre-LN. This is addressed in GLM-
130B by using DeepNorm based Post-LN (Cf. Section 2.1), which Figure 4: EGS reduces gradi-
makes the value scale always bounded. ent scale and variance to stabi-
lize LLMs’ pre-training.
Second, the attention scores grow so large that they exceed FP16’s
range, as the model scales up. There are a few options to overcome this issue in LLMs. In
CogView (Ding et al., 2021), PB-Relax is proposed to remove bias terms and deduct extremum
value in attention computation to avoid the problem, which unfortunately does not help avoid dis-
convergence in GLM-130B. In BLOOM-176B, the BF16 format is used instead of FP16, due to its
wide range of values on NVIDIA Ampere GPUs (i.e., A100). However, BF16 consumes ∼15%
5
Published as a conference paper at ICLR 2023
more run-time GPU memory than FP16 in our experiments due to its conversion to FP32 in gradi-
ent accumulation, and more importantly it is not supported on other GPU platforms (e.g., NVIDIA
Tesla V100), limiting the accessibility of produced LLMs. Another option from BLOOM-176B is
to apply embedding norm with BF16, but in sacrifice of a significant penalty on model performance,
as they notice that embedding norm can harm model’s zero-shot learning (Cf. Section 4.3 in (Scao
et al., 2022)).
Embedding Layer Gradient Shrink (EGS). Our empirical search identifies that the gradient norm
can serve as an informative indicator of training collapses. Specifically, we find that a training
collapse usually lags behind a “spike” in gradient norm by a few training steps. Such spikes are
usually caused by the embedding layer’s abnormal gradients, as we observe that its gradient norm
is often several magnitude larger that those of other layers in GLM-130B’s early stage training (Cf.
Figure 4 (a)). In addition, it tends to fluctuate dramatically in the early training. The problem is
handled in vision models (Chen et al., 2021) via freezing the patch projection layer. Unfortunately,
we cannot freeze the training of the embedding layer in language models.
Finally, we find the gradient shrink on embedding layers could overcome loss spikes and thus sta-
bilize GLM-130B’s training. It is first used in the multi-modal transformer CogView (Ding et al.,
2021). Let α be the shrinking factor, the strategy can be easily implemented via word_embedding =
word_embedding ∗ α + word_embedding.detach() ∗ (1 − α). Figure 4 (b) suggests that empirically,
setting α = 0.1 wipes out most spikes we would have met, with negligible latency.
In fact, the final GLM-130B training run only experiences three late-stage loss divergence cases,
though it fails numerous times due to hardware failures. For the three unexpected spikes, it turns out
further shrinking the embedding gradient can still help stabilize the GLM-130B training. See the
training notes and Tensorboard logs in our code repository for details.
One of the major goals of GLM-130B is to lower the hardware requirements for accessing 100B-
scale LLMs without efficiency and effectiveness disadvantages.
As mentioned, the model size of 130B is determined for running the full GLM-130B model on a sin-
gle A100 (40G×8) server, rather than the high-end A100 (80G×8) machine required by OPT-175B
and BLOOM-176B. To accelerate GLM-130B inference, we also leverage FasterTransformer (Ti-
monin et al., 2022) to implement GLM-130B in C++. Compared to the PyTorch implementation
of BLOOM-176B in Huggingface, GLM-130B’s decoding inference is 7-8.4× faster on the same
single A100 server. (Cf. Appendix B.5 for details).
INT4 Quantization for RTX 3090s/2080s. To further support popularized GPUs, we attempt to
compress GLM-130B as much as possible while maintaining performance superiority, particularly
via quantization (Zafrir et al., 2019; Shen et al., 2020; Tao et al., 2022), which introduces little
task-agnostic performance drops for generative language models.
Typically, the practice is to quantize both model
weights and activations to INT8. However, our anal-
ysis in Appendix B.6 suggests that LLMs’ activations
may contain extreme outliers. Concurrently, the emer-
gent outliers in OPT-175B and BLOOM-176B are
also discovered (Dettmers et al., 2022), which influ-
ence only about 0.1% feature dimensions and are thus
solved by matrix multiplication decomposition for the
outlying dimensions. Differently, there exist about
30% outliers in GLM-130B’s activations, making the
technique above far less efficient. Thus, we decide Figure 5: (Left) attn-dense and w2’s
to focus on the quantization of model weights (i.e., weight distributions; (Right) GLM-130B’s
mostly linear layers) while keeping the FP16 precision INT4 weight quantization scaling law.
for activations. The quantized model is dynamically converted to FP16 precision at runtime, in-
troducing a small computational overhead but greatly reducing the GPU memory usage for storing
model weights.
6
Published as a conference paper at ICLR 2023
Table 2: Left: Quantized GLM-130B’s performance on several benchmarks; Right: INT4 quantized
GLM-130B’s inference speed (encode and decode) with FasterTransformer.
Model Precision GLM-130B GPT-3 GPU Type 128 Enc./Dec. 512 Enc./Dec,
FP16 INT8 INT4 FP16 8 × A100 (40G) 0.15s 4.29s 0.18s 17.7s
MMLU (acc, ↑) 44.75 44.71 44.80 43.9 8 × V100 (32G) 0.31s 6.97s 0.67s 28.1s
LAMBADA (acc, ↑) 80.21 80.21 79.47 76.2 4 × RTX 3090 (24G) 0.37s 8.16s 1.30s 32.3s
Pile (a part, BPB, ↓) 0.634 0.638 0.641 0.74 8 × RTX 2080 Ti (11G) 0.39s 6.77s 1.04s 27.3s
Excitingly, we manage to reach the INT4 weight quantization for GLM-130B while existing suc-
cesses have thus far only come to the INT8. Memory-wise, by comparing to INT8, the INT4 version
helps additionally save half of the required GPU memory to 70GB, thus allowing GLM-130B infer-
ence on 4 × RTX 3090 Ti (24G) or 8 × RTX 2080 Ti (11G). Performance-wise, Table 2 left indicates
that without post-training at all, the INT4-version GLM-130B experiences almost no performance
degradation, thus maintaining the performance advantages over GPT-3 on common benchmarks.
GLM’s INT4 Weight Quantization Scaling Law. We examine the underlying mechanism of this
unique INT4 weight quantization scaling law exhibited in Figure 5 right. We plot the weight value
distributions in Figure 5 left, which turns out to directly impact the quantization quality. Specifically,
a wider-distributed linear layer needs to be quantized with larger bins, leading to more precision loss.
Thus the wide-distributed attn-dense and w2 matrices explain the INT4 quantization failure for
GPT-style BLOOM. Conversely, GLMs tend to have much narrower distributions than those of
similar-sized GPTs, and the gap between INT4 and FP16 versions keeps further decreasing as the
GLM model size scales up (Cf. Figure 15 in Appendix for details).
5 T HE R ESULTS
We follow the common settings in LLMs such as GPT-3 and PaLM to evaluate GLM-130B for
English 1 . As a bilingual LLM with Chinese, GLM-130B is also evaluated on Chinese benchmarks.
Discussion on the Scope of Zero-Shot Learning in GLM-130B. Since GLM-130B has been
trained with MIP, here we clarify its scope of zero-shot evaluation. In fact, “zero-shot” seems to
have controversial interpretations without a consensus in the community. We follow one of the in-
fluential related surveys (Xian et al., 2018), which says “At test time, in zero-shot learning setting,
the aim is to assign a test image to an unseen class label” where involving unseen class labels is a
key. Therefore, we derive our criterion to pick GLM-130B’s zero-shot (and few-shot) datasets as:
• English: 1) For tasks with fixed labels (e.g., natural language inference): no datasets in such tasks
should be evaluated on; 2) For tasks without fixed labels (e.g., (multiple-choice) QA, topic classi-
fication): only datasets with an obvious domain transfer from those in MIP should be considered.
• Chinese: All datasets can be evaluated as there exists a zero-shot cross-lingual transfer.
Filtering Test Datasets. Following prior practices (Brown et al., 2020; Rae et al., 2021) and our
criterion mentioned above, we filter and refrain to report potentially contaminated datasets’ evalua-
tion results. For LAMBADA and CLUE, we find minimal overlap under the 13-gram setting. Pile,
MMLU, and BIG-bench are either held-out or released later than the crawling of corpora.
LAMBADA. LAMBADA (Paperno et al., 2016) is a dataset to test the last word language model-
ing capability. The results previously shown in Figure 2 suggest GLM-130B achieves a zero-shot
accuracy of 80.2 with its bidirectional attention, setting up a new record on LAMBADA.
Pile. The Pile test-set (Gao et al., 2020) includes a series Table 3: GLM-130B’s average BPB on
of benchmarks for language modeling. On average, GLM- Pile evaluation (18 sub-datasets).
130B performs the best on its 18 shared test sets in terms Jurassic-1 GPT-3 GLM-130B
of weighted BPB when compared to GPT-3 and Jurassic- Avg. BPB 0.650 0.742 0.634
1 (Lieber et al., 2021) whose results are directly adopted
from the latter, demonstrating its strong language capability (Cf. Appendix C.4 for details).
1
Results in OPT-175B’s paper are reported as applications to access it have not been approved for months.
7
Published as a conference paper at ICLR 2023
46 16
GLM-130B (5-shot) GLM-130B 0-shot 0-shot 1-shot 3-shot
GPT-3 175B (5-shot) 14 GLM-130B 1-shot
GLM-130B 3-shot
42 12 GPT-3 0-shot GPT-3 2.6B 0.60 0.71 1.83
10 GPT-3 1-shot
GPT-3 3-shot GPT-3 6.7B -0.06 2.93 5.40
38 8 PaLM 0-shot
GPT-3 13B 1.77 5.43 7.95
6
GPT-3 175B 4.35 11.34 13.18
34 4
BLOOM 176B (5-shot) 2 PaLM 540B 8.05 37.77 -
30 0
108 109 1010 1011 GLM-130B 13.31 14.91 15.12
50 100 150 200 250 300 350 400
Trained Tokens (Billion) Effective Parameter Count
Figure 6: GLM-130B on MMLU Figure 7: BIG-bench-lite evalua- Table 4: Details on BIG-
(57 tasks) along training steps. tion (24 tasks) across scales. bench-lite (24 tasks).
MMLU (Hendrycks et al., 2021) is a diverse benchmark including 57 multi-choice question an-
swering tasks concerning human knowledge ranging from high-school-level to expert-level. It is
released after the crawling of Pile and serves as an ideal test-bed for LLMs’ few-shot learning. The
GPT-3 result is adopted from MMLU and BLOOM-176B is tested by using the same prompts as
GLM-130B’s (Cf. Appendix C.6 and Table 15 for details).
GLM-130B’s few-shot (5-shot) performance on MMLU approaches GPT-3 (43.9) after viewing
about 300B tokens in Figure 6. It continues moving up as the training proceeds, achieving an
accuracy of 44.8 when the training has to end (i.e., viewing 400B tokens in total). This aligns with
the observation (Hoffmann et al., 2022) that most existing LLMs are far from adequately trained.
BIG-bench (Srivastava et al., 2022) benchmarks challenging tasks concerning models’ ability on
reasoning, knowledge, and commonsense. Given evaluating on its 150 tasks is time-consuming for
LLMs, we report the BIG-bench-lite—an official 24-task sub-collection—for now. Observed from
Figure 7 and Table 4, GLM-130B outperforms GPT-3 175B and even PaLM 540B (4× larger) in
zero-shot setting. This is probably owing to GLM-130B’s bidirectional context attention and MIP,
which has been proved to improve zero-shot results in unseen tasks (Wei et al., 2022a; Sanh et al.,
2022). As the number of shots increases, GLM-130B’s performance keeps going up, maintaining its
outperformance over GPT-3 (Cf. Appendix C.5 and Table 14 for details on each model and task).
Limitations and Discussions. In the experiments above, we observe that GLM-130B’s performance
growth (13.31 to 15.12) with the increase of few-shot samples is not as significant as GPT-3’s (4.35
to 13.18). Here is our intuitive attempt to understand the phenomenon.
First, the bidirectional nature of GLM-130B could lead to strong zero-shot performance (as is indi-
cated in zero-shot language modeling), thus getting closer to the few-shot “upper-bound” for models
of similar scale (i.e., 100B-scale) than unidirectional LLMs. Second, it may be also attributed to a
deficit of existing MIP paradigms (Wei et al., 2022a; Sanh et al., 2022), which only involve zero-shot
prediction in the training and will be likely to bias GLM-130B for stronger zero-shot learning but
relatively weaker in-context few-shot performance. To correct the bias, a potential solution we came
up with would be to employ MIP with varied shots of in-context samples rather than only zero-shot
samples.
Finally, despite almost the same GPT architecture as GPT-3, PaLM 540B’s relative growth with few-
shot in-context learning is substantially more significant than GPT-3’s. We conjecture this further
acceleration in performance growth is a source of PaLM’s high-quality and diverse private-collected
training corpora. By combining our experiences with (Hoffmann et al., 2022)’s insights, we came to
realize that better architectures, better data, and more training FLOPS should be further invested.
8
Published as a conference paper at ICLR 2023
Acc. or EM
77.5 77.4 77.5 83.9 81.1 77.0 77.1 74.7
60 73.8 71.2 69.0
64.4 55.7
53.8 53.5 54.9 51.7
40 44.6
20 29.5
16.6
0
EPRSTMT OCNLI-FC BUSTM CHID-FC CLUEWSC-FC C3 WSC1.1 CMNLI DRCD OCNLI_50K AFQMC CMRC2018
Figure 8: GLM-130B and ERNIE Titan 3.0 260B evaluated on zero-shot CLUE and FewCLUE.
7 CLUE and 5 FewCLUE datasets (Cf. Appendix C.7 for details). We compare GLM-130B to the
largest existing Chinese monolingual language model—the 260B ERNIE Titan 3.0 (Wang et al.,
2021). We follow its setting to report zero-shot results on dev datasets. GLM-130B consistently
outperforms ERNIE Titan 3.0 across 12 tasks (Cf. Figure 8). Interestingly, GLM-130B performs at
least 260% better than ERNIE on two abstractive MRC datasets (DRCD and CMRC2018), possibly
due to GLM-130B’s pre-training objective that naturally resonates to abstractive MRC’s form.
6 R ELATED W ORK
In this section, we review related work to GLM-130B on topics of pre-training, transferring, and
inference of pre-trained LLMs (Qiu et al., 2020; Bommasani et al., 2021).
Pre-Training. Vanilla language modeling refers to decoder-only autoregressive models (e.g.,
GPT (Radford et al., 2018)), but it also recognizes any forms of self-supervised objectives on texts.
Recently, transformer-based (Vaswani et al., 2017) language models present a fascinating scaling
law: new abilities (Wei et al., 2022b) arise as models scale up, from 1.5B (Radford et al., 2019),
10B-scale language models (Raffel et al., 2020; Shoeybi et al., 2019; Black et al., 2022), to 100B-
scale GPT-3 (Brown et al., 2020). Later, despite many 100B-scale LLMs (Lieber et al., 2021; Thop-
pilan et al., 2022; Rae et al., 2021; Smith et al., 2022; Chowdhery et al., 2022; Wu et al., 2021; Zeng
et al., 2021; Wang et al., 2021) in both English and Chinese, they are not available to public or only
accessible via limited APIs. The closeness of LLMs severely stymies its development. GLM-130B’s
efforts, along with recent ElutherAI, OPT-175B (Zhang et al., 2022), and BLOOM-176B (Scao et al.,
2022), aim to offer high-quality open-sourced LLMs to our community.
Transferring. Though fine-tuning has been a de facto way for transfer learning, the evaluation for
LLMs has been focused on prompting and in-context learning due to their tremendous sizes (Brown
et al., 2020; Liu et al., 2021a). Nevertheless, some recent attempts has been on parameter-efficient
learning on language models (Houlsby et al., 2019) and prompt tuning (i.e., P-tuning, Li & Liang
(2021); Liu et al. (2021b); Lester et al. (2021); Liu et al. (2022)). For now we do not focus on them
and will leave the comprehensive testing of them on GLM-130B in future study.
Inference. Most public-accessible LLMs nowadays are providing their services via limited APIs.In
this work, an important part of our endeavor has been on LLMs’ efficient and fast inference. Related
work may include distillation (Sanh et al., 2019; Jiao et al., 2020; Wang et al., 2020), quantiza-
tion (Zafrir et al., 2019; Shen et al., 2020; Tao et al., 2022), and pruning (Michel et al., 2019; Fan
et al., 2019). Very recent work (Dettmers et al., 2022) shows that LLMs such as OPT-175B and
BLOOM-176B can be quantized to 8 bit due to special distribution of outlier dimensions. In this
work, we demonstrate GLM’s scaling law for INT4 weight quantization, which allows GLM-130B
to inference on as few as 4×RTX 3090 (24G) GPUs or 8×RTX 2080 Ti (11G) GPUs.
We introduce GLM-130B, a bilingual pre-trained language model that aims to facilitate open and
inclusive LLM research. GLM-130B’s technical and engineering undertakings generate insight into
LLMs’ architectures, pre-training objectives, training stability and efficiency, and affordable infer-
ence. Altogether, it contributes to the high quality of GLM-130B in terms of both language perfor-
mance on 112 tasks and ethical results on bias and toxicity benchmarks. Our experiences of both
success and failure are condensed into the lessons for training 100B-scale LLMs, attached in the
Appendix B.10.
9
Published as a conference paper at ICLR 2023
ACKNOWLEDGEMENT
This research was supported by Natural Science Foundation of China (NSFC) 61825602, 62276148
and Zhipu.AI. We thank all our collaborators and partners from the Knowledge Engineering Group
(KEG), Parallel Architecture & Compiler technology of Mobile, Accelerated, and Networked sys-
tems Group (PACMAN), Natural Language Processing Group (THUNLP) at Tsinghua University,
and Zhipu.AI.
E THICS S TATEMENT
We hereby acknowledge that all of the co-authors of this work are aware of the provided ICLR Code
of Ethics and honor the code of conduct. This work introduces an open-source Large Language
Model (LLM), which could be used to generate synthetic text for harmful applications, such as tele-
marketing fraud, political propaganda, and personal harassment as is discussed in (Weidinger et al.,
2021; Sheng et al., 2021; Dev et al., 2021). We do not anticipate any hazardous outputs, especially
towards vulnerable and historically disadvantaged groups of peoples, after using the model.
And to better collaborate with our community to prevent and ultimately eliminate the risks techni-
cally, we make the following crucial open efforts in this work:
Open-Sourced LLMs for Ethical Risk Study. While some people think that restricting the access
of LLMs can prevent such harmful applications, we argue that promoting LLM inclusivity can lead
to better defense against potential harms caused by LLMs. Currently, only governments and large
corporations can afford the considerable costs of pre-training LLMs. There is no guarantee that
organizations having the the substantial financial resources will not do harm using a LLM. Without
access to such LLMs, individuals cannot even realize the role of LLMs in the harm.
Conversely, releasing an open LLM can provide access and transparency to all the researchers and
promote the research to reduce the potential harm of LLMs, like algorithms to identify the synthetic
text Gehrmann et al. (2019). Also, it is known that LLMs can suffer from problems in fairness,
bias, privacy, and truthfulness Zhang et al. (2021); Lin et al. (2022); Liang et al. (2021); Bender
et al. (2021). An open LLM can reveal the model parameters and internal states corresponding
to specific inputs instead of providing APIs to black-box models. In conclusion, researchers can
conduct analysis of LLMs’ flaws in depth and propose improved algorithms to solve the problems.
Ethical Evaluation and Improvements. We also evaluate our model over a wide range of English
ethical evaluation benchmarks, including bias measurement (Nadeem et al., 2021; Nangia et al.,
2020), hate speech detection (Mollas et al., 2020), and toxic generation estimation (Gehman et al.,
2020). Notwithstanding their deficiency (Blodgett et al., 2021; Jacobs & Wallach, 2021), these
datasets serve as a meaningful initial step towards an open quantitative evaluation LLMs.
Our evaluation implies that our algorithm designs, especially the bilingual pre-training of a LLM,
can significantly mitigate the biases and toxicity an LLM may present while keeping its strong
language performance compared to other LLMs (Brown et al., 2020; Zhang et al., 2022) trained
with monolingual English corpora (Cf. Appendix A for more details).
R EPRODUCIBILITY
Compared to mainstream closed-sourced LLMs including GPT-3 175B(Brown et al., 2020), PaLM
540B (Chowdhery et al., 2022), Gopher (Rae et al., 2021), Chinchilla (Hoffmann et al., 2022),
LaMDA (Thoppilan et al., 2022), FLAN (Wei et al., 2022a), and many others, GLM-130B is open-
sourced and devotes to promote openness and inclusivity in LLM research from the very beginning.
We have paid great effort to ensure the reproducibility of our evaluation. For pre-training section,
despite the unaffordable costs it needs to reproduce at present, we still make our best efforts to dis-
close the code, details, and the whole process of GLM-130B’s pre-training. Our endeavor to allow
GLM-130B inference on few popularized GPUs such as 3090/2080 Ti also aligns with the repro-
ducibility undertaking, as it allows most academic researchers to reproduce GLM-130B’s results on
their offline machines. We also provide free APIs for individual users to test GLM-130B’s ability.
10
Published as a conference paper at ICLR 2023
Pre-Training. We provide the complete training notes, Tensorboard logs, and code for our pre-
training in our repository (Cf. Abstract). The pre-training hyper-parameters and cluster configu-
ration are provided in Section 2.3 and Table 11. The training corpora composition and details for
Multi-task Instruction Pre-training are provided in Section 2.2 and Appendix C.1 and C.2.
Evaluation. We organize all the evaluation, including language benchmarks (LAMBADA, Pile,
MMLU, BIG-bench, CLUE, and FewCLUE) and ethical benchmarks (CrowS-Pairs, StereoSet,
ETHOS, RealToxicPrompts), into one-command-to-run bash scripts in our code repository. Data
processing details for language modeling benchmarks are provided in Section 5.1 and Appendix C.4,
for MMLU are provided in Section 5.2 and Appendix C.6, for BIG-bench are provided in Section 5.3
and Appendix C.5, for CLUE and FewCLUE are provided in 5.4. For all ethical evaluation, please
refer to Appendix A for details.
R EFERENCES
Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. Knowledge graph based synthetic
corpus generation for knowledge-enhanced language model pre-training. In Proceedings of the
2021 Conference of the North American Chapter of the Association for Computational Linguis-
tics: Human Language Technologies, pp. 3554–3565, 2021.
Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta,
Honglei Zhuang, Vinh Q Tran, Dara Bahri, Jianmo Ni, et al. Ext5: Towards extreme multi-task
scaling for transfer learning. In International Conference on Learning Representations, 2022.
Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victo-
ria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, et al. Efficient large scale language
modeling with mixtures of experts. arXiv preprint arXiv:2112.10684, 2021.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert Webson, Colin Raffel, Nihal V Nayak, Ab-
heesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Févry, et al. Promptsource: An integrated
development environment and repository for natural language prompts. In Proceedings of the
60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations,
pp. 93–104, 2022.
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the
dangers of stochastic parrots: Can language models be too big? In FAccT ’21: 2021 ACM Con-
ference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, Canada, March
3-10, 2021, pp. 610–623. ACM, 2021.
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from
question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural
language processing, pp. 1533–1544, 2013.
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical com-
monsense in natural language. In Proceedings of the AAAI conference on artificial intelligence,
volume 34, pp. 7432–7439, 2020.
Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Ho-
race He, Connor Leahy, Kyle McDonell, Jason Phang, et al. Gpt-neox-20b: An open-source
autoregressive language model. In Proceedings of BigScience Episode\# 5–Workshop on Chal-
lenges & Perspectives in Creating Large Language Models, pp. 95–136, 2022.
Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna Wallach. Stereotyping
norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the
59th Annual Meeting of the Association for Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1004–1015,
2021.
11
Published as a conference paper at ICLR 2023
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx,
Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportu-
nities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Nicola De Cao, Wilker Aziz, and Ivan Titov. Editing factual knowledge in language models. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp. 6491–
6506. Association for Computational Linguistics, 2021.
Xavier Carreras and Lluís Màrquez. Introduction to the conll-2005 shared task: Semantic role
labeling. In CoNLL, pp. 152–164, 2005.
Thiago Castro Ferreira, Claire Gardent, Nikolai Ilinykh, Chris van der Lee, Simon Mille, Diego
Moussallem, and Anastasia Shimorina. The 2020 bilingual, bi-directional WebNLG+ shared
task: Overview and evaluation results (WebNLG+ 2020). In Proceedings of the 3rd Inter-
national Workshop on Natural Language Generation from the Semantic Web (WebNLG+), pp.
55–76, Dublin, Ireland (Virtual), 12 2020. Association for Computational Linguistics. URL
https://aclanthology.org/2020.webnlg-1.7.
Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision
transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.
9640–9649, 2021.
Ke-Li Chiu and Rohan Alexander. Detecting hate speech with gpt-3. arXiv preprint
arXiv:2103.12407, 2021.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm:
Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and
Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.
arXiv preprint arXiv:1803.05457, 2018.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov.
Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the
57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988, 2019.
Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise
quantization. arXiv preprint arXiv:2110.02861, 2021.
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix
multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
Sunipa Dev, Masoud Monajatipoor, Anaelia Ovalle, Arjun Subramonian, J. M. Phillips, and Kai Wei
Chang. Harms of gender exclusivity and challenges in non-binary representation in language
technologies. ArXiv, abs/2108.12084, 2021.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, 2019.
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou,
Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers.
Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,
and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding
and generation. Advances in Neural Information Processing Systems, 32, 2019.
12
Published as a conference paper at ICLR 2023
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm:
General language model pretraining with autoregressive blank infilling. In Proceedings of the
60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pp. 320–335, 2022.
Ondřej Dušek, David M. Howcroft, and Verena Rieser. Semantic noise matters for neural natural
language generation. In Proceedings of the 12th International Conference on Natural Language
Generation, pp. 421–426, Tokyo, Japan, October–November 2019. Association for Computa-
tional Linguistics. doi: 10.18653/v1/W19-8652. URL https://aclanthology.org/W19
-8652.
Hady Elsahar, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, Frederique
Laforest, and Elena Simperl. T-rex: A large scale alignment of natural language with knowledge
base triples. In Proceedings of the Eleventh International Conference on Language Resources
and Evaluation (LREC 2018), 2018.
Mihail Eric, Rahul Goel, Shachi Paul, Abhishek Sethi, Sanchit Agarwal, Shuyang Gao, Adarsh
Kumar, Anuj Kumar Goyal, Peter Ku, and Dilek Hakkani-Tür. Multiwoz 2.1: A consolidated
multi-domain dialogue dataset with state corrections and state tracking baselines. In LREC, 2020.
Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with
structured dropout. arXiv preprint arXiv:1909.11556, 2019.
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason
Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text
for language modeling. arXiv preprint arXiv:2101.00027, 2020.
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. Realtoxic-
ityprompts: Evaluating Neural Toxic Degeneration in Language Models. dblp://journals/dblp,
2020.
Sebastian Gehrmann, Hendrik Strobelt, and Alexander Rush. GLTR: Statistical detection and vi-
sualization of generated text. In Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics: System Demonstrations, pp. 111–116, Florence, Italy, July 2019. As-
sociation for Computational Linguistics.
Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi,
Aremu Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna Clinciu, Dipanjan Das,
Kaustubh D Dhole, et al. The gem benchmark: Natural language generation, its evaluation and
metrics. GEM 2021, pp. 96, 2021.
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle
use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of
the Association for Computational Linguistics, 9:346–361, 2021.
Peter Hase, Mona T. Diab, Asli Celikyilmaz, Xian Li, Zornitsa Kozareva, Veselin Stoyanov, Mohit
Bansal, and Srinivasan Iyer. Do language models have beliefs? methods for detecting, updating,
and visualizing model beliefs. CoRR, abs/2111.13654, 2021.
Ruining He, Anirudh Ravula, Bhargav Kanagal, and Joshua Ainslie. Realformer: Transformer likes
residual attention. In Findings of the Association for Computational Linguistics: ACL-IJCNLP
2021, pp. 929–943, 2021.
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint
arXiv:1606.08415, 2016.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob
Steinhardt. Measuring massive multitask language understanding. In International Conference
on Learning Representations, 2021.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza
Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Train-
ing compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
13
Published as a conference paper at ICLR 2023
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pre-
training for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, An-
drea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp.
In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong
Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural
networks using pipeline parallelism. Advances in neural information processing systems, 32,
2019.
Abigail Z Jacobs and Hanna Wallach. Measurement and fairness. In Proceedings of the 2021 ACM
conference on fairness, accountability, and transparency, pp. 375–385, 2021.
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.
Tinybert: Distilling bert for natural language understanding. In Findings of the Association for
Computational Linguistics: EMNLP 2020, pp. 4163–4174, 2020.
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly
supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meet-
ing of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611,
2017.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris
Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a
benchmark for question answering research. Transactions of the Association for Computational
Linguistics, 7:453–466, 2019.
Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the
carbon emissions of machine learning. CoRR, abs/1910.09700, 2019.
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt
tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Pro-
cessing, pp. 3045–3059, 2021.
Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thir-
teenth international conference on the principles of knowledge representation and reasoning,
2012.
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the
11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),
pp. 4582–4597, 2021.
Xiangyang Li, Yu Xia, Xiang Long, Zheng Li, and Sujian Li. Exploring text-transformers in aaai
2021 shared task: Covid-19 fake news detection in english. In CONSTRAINT@AAAI, 2021.
Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. Towards under-
standing and mitigating social biases in language models. In Proceedings of the 38th Interna-
tional Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume
139 of Proceedings of Machine Learning Research, pp. 6565–6576. PMLR, 2021.
Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evalua-
tion. White Paper. AI21 Labs, 2021.
Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization
Branches Out, pp. 74–81, Barcelona, Spain, July 2004. Association for Computational Linguis-
tics. URL https://aclanthology.org/W04-1013.
14
Published as a conference paper at ICLR 2023
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human
falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pp. 3214–3252, Dublin, Ireland, May 2022. Association for
Computational Linguistics.
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-
train, prompt, and predict: A systematic survey of prompting methods in natural language pro-
cessing. arXiv preprint arXiv:2107.13586, 2021a.
Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam
Shazeer. Generating wikipedia by summarizing long sequences. In International Conference on
Learning Representations, 2018.
Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt
understands, too. arXiv preprint arXiv:2103.10385, 2021b.
Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-tuning:
Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the
60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers),
pp. 61–68, 2022.
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International
Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019,
2019.
Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? Advances
in neural information processing systems, 32, 2019.
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia,
Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed
precision training. In International Conference on Learning Representations, 2018.
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct elec-
tricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing, pp. 2381–2391, 2018.
Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D. Manning, and Chelsea Finn. Memory-
based model editing at scale. In International Conference on Machine Learning, ICML 2022,
17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning
Research, pp. 15817–15831. PMLR, 2022.
Ioannis Mollas, Zoe Chrysopoulou, Stamatis Karlos, and Grigorios Tsoumakas. Ethos: an online
hate speech detection dataset. arXiv preprint arXiv:2006.08328, 2020.
Moin Nadeem, Anna Bethke, and Siva Reddy. Stereoset: Measuring stereotypical bias in pretrained
language models. In Proceedings of the 59th Annual Meeting of the Association for Computa-
tional Linguistics and the 11th International Joint Conference on Natural Language Processing
(Volume 1: Long Papers), pp. 5356–5371, 2021.
Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel Bowman. Crows-pairs: A challenge
dataset for measuring social biases in masked language models. In Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1953–1967,
2020.
Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient
pipeline-parallel dnn training. In International Conference on Machine Learning, pp. 7937–7947.
PMLR, 2021.
Tomoko Ohta, Yuka Tateisi, and Jin-Dong Kim. The genia corpus: An annotated research abstract
corpus in molecular biology domain. In HLT, pp. 82–86, 2002.
15
Published as a conference paper at ICLR 2023
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi,
Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset:
Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1525–1534, 2016.
David A. Patterson, Joseph Gonzalez, Quoc V. Le, Chen Liang, Lluis-Miquel Munguia, Daniel
Rothchild, David R. So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network
training. CoRR, abs/2104.10350, 2021.
Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Björkelund, Olga
Uryupina, Yuchen Zhang, and Zhi Zhong. Towards robust linguistic analysis using ontonotes. In
CoNLL, pp. 143–152, 2013.
Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables
input length extrapolation. In International Conference on Learning Representations, 2021.
Amy Pu, Hyung Won Chung, Ankur Parikh, Sebastian Gehrmann, and Thibault Sellam. Learning
compact metrics for MT. In Proceedings of the 2021 Conference on Empirical Methods in Natural
Language Processing, pp. 751–762, Online and Punta Cana, Dominican Republic, November
2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.58. URL
https://aclanthology.org/2021.emnlp-main.58.
Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. Pre-trained
models for natural language processing: A survey. Science China Technological Sciences, 63(10):
1872–1897, 2020.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language under-
standing with unsupervised learning. 2018.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language
models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John
Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models:
Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a unified text-to-text
transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen,
and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine
Learning, pp. 8821–8831. PMLR, 2021.
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System opti-
mizations enable training deep learning models with over 100 billion parameters. In Proceedings
of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
pp. 3505–3506, 2020.
Sebastian Riedel, Limin Yao, and Andrew McCallum. Modeling relations and their mentions with-
out labeled text. In ECML-PKDD, pp. 148–163, 2010.
Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the
parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP), pp. 5418–5426, 2020.
Dan Roth and Wen-tau Yih. A linear programming formulation for global inference in natural
language tasks. In HLT-NAACL, pp. 1–8, 2004.
Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in
coreference resolution. In NAACL-HLT (2), 2018.
16
Published as a conference paper at ICLR 2023
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kam-
yar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho-
torealistic text-to-image diffusion models with deep language understanding. In Advances in
Neural Information Processing Systems.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adver-
sarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-
independent named entity recognition. In HLT-NAACL, pp. 142–147, 2003.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of
bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables
zero-shot task generalization. In The Tenth International Conference on Learning Representa-
tions, 2022.
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman
Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-
parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
Timo Schick, Sahana Udupa, and Hinrich Schütze. Self-diagnosis and self-debiasing: A proposal for
reducing corpus-based bias in nlp. Transactions of the Association for Computational Linguistics,
9:1408–1424, 2021.
Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano.
MLSUM: The multilingual summarization corpus. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (EMNLP), pp. 8051–8067, Online, Novem-
ber 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.647.
URL https://aclanthology.org/2020.emnlp-main.647.
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney,
and Kurt Keutzer. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings
of the AAAI Conference on Artificial Intelligence, volume 34, pp. 8815–8821, 2020.
Emily Sheng, Kai-Wei Chang, P. Natarajan, and Nanyun Peng. Societal biases in language genera-
tion: Progress and challenges. In ACL, 2021.
Sam Shleifer, Jason Weston, and Myle Ott. Normformer: Improved transformer pretraining with
extra normalization. arXiv preprint arXiv:2110.09456, 2021.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan
Catanzaro. Megatron-lm: Training multi-billion parameter language models using model par-
allelism. arXiv preprint arXiv:1909.08053, 2019.
Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared
Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using deep-
speed and megatron to train megatron-turing nlg 530b, a large-scale generative language model.
arXiv preprint arXiv:2201.11990, 2022.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam
Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the
imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint
arXiv:2206.04615, 2022.
Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep
learning in NLP. In Proceedings of the 57th Conference of the Association for Computational
Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp.
3645–3650. Association for Computational Linguistics, 2019.
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer
with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
17
Published as a conference paper at ICLR 2023
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question
answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference
of the North American Chapter of the Association for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long and Short Papers), pp. 4149–4158, 2019.
Chaofan Tao, Lu Hou, Wei Zhang, Lifeng Shang, Xin Jiang, Qun Liu, Ping Luo, and Ngai Wong.
Compression of generative pre-trained language models via quantization. In Proceedings of the
60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pp. 4821–4836, 2022.
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze
Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog
applications. arXiv preprint arXiv:2201.08239, 2022.
Denis Timonin, Bo Yang Hsueh, and Vinh Nguyen. Accelerated inference for large transformer
models using nvidia triton inference server. NVIDIA blog, 2022.
Leslie G Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):
103–111, 1990.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural informa-
tion processing systems, 30, 2017.
David Wadden, Ulme Wennberg, Yi Luan, and Hannaneh Hajishirzi. Entity, relation, and event
extraction with contextualized span representations. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP), pp. 5784–5789, 2019.
C. Walker and Linguistic Data Consortium. ACE 2005 Multilingual Training Corpus. Linguistic
Data Consortium, 2005. ISBN 9781585633760.
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer
Levy, and Samuel R. Bowman. SuperGLUE: A Stickier Benchmark for General-Purpose Lan-
guage Understanding Systems. In NeurIPS 2019, pp. 3261–3275, 2019.
Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language
Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
Chenguang Wang, Xiao Liu, Zui Chen, Haoyun Hong, Jie Tang, and Dawn Song. Deepstruct:
Pretraining of language models for structure prediction. In Findings of the Association for Com-
putational Linguistics: ACL 2022, pp. 803–823, 2022a.
Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet:
Scaling transformers to 1,000 layers. arXiv preprint arXiv:2203.00555, 2022b.
Shuohuan Wang, Yu Sun, Yang Xiang, Zhihua Wu, Siyu Ding, Weibao Gong, Shikun Feng, Jun-
yuan Shang, Yanbin Zhao, Chao Pang, et al. Ernie 3.0 titan: Exploring larger-scale knowledge en-
hanced pre-training for language understanding and generation. arXiv preprint arXiv:2112.12731,
2021.
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-
attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neu-
ral Information Processing Systems, 33:5776–5788, 2020.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Rationale-
augmented ensembles in language models. arXiv preprint arXiv:2207.00747, 2022c.
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, An-
drew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International
Conference on Learning Representations, 2022a.
18
Published as a conference paper at ICLR 2023
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yo-
gatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language
models. arXiv preprint arXiv:2206.07682, 2022b.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny
Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint
arXiv:2201.11903, 2022c.
Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang,
Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm
from language models. arXiv preprint arXiv:2112.04359, 2021.
Shaohua Wu, Xudong Zhao, Tong Yu, Rongguo Zhang, Chong Shen, Hongli Liu, Feng Li, Hong
Zhu, Jiangang Luo, Liang Xu, et al. Yuan 1.0: Large-scale pre-trained language model in zero-
shot and few-shot learning. arXiv preprint arXiv:2110.04725, 2021.
Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning—a
comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis
and machine intelligence, 41(9):2251–2265, 2018.
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang,
Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture.
In International Conference on Machine Learning, pp. 10524–10533. PMLR, 2020.
Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu,
Cong Yu, et al. Clue: A chinese language understanding evaluation benchmark. In Proceedings
of the 28th International Conference on Computational Linguistics, pp. 4762–4772, 2020.
Liang Xu, Xiaojing Lu, Chenyang Yuan, Xuanwei Zhang, Huilin Xu, Hu Yuan, Guoao Wei, Xiang
Pan, Xin Tian, Libo Qin, et al. Fewclue: A chinese few-shot learning evaluation benchmark.
arXiv preprint arXiv:2107.07498, 2021.
Sha Yuan, Hanyu Zhao, Zhengxiao Du, Ming Ding, Xiao Liu, Yukuo Cen, Xu Zou, Zhilin Yang, and
Jie Tang. Wudaocorpora: A super large-scale chinese corpora for pre-training language models.
AI Open, 2:65–68, 2021.
Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: Quantized 8bit bert. In
2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS
Edition (EMC2-NIPS), pp. 36–39. IEEE, 2019.
Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang, ZhenZhang Yang,
Kaisheng Wang, Xiaoda Zhang, et al. Pangu-\α: Large-scale autoregressive pretrained chinese
language models with auto-parallel computation. arXiv preprint arXiv:2104.12369, 2021.
Chiyuan Zhang, Daphne Ippolito, Katherine Lee, Matthew Jagielski, Florian Tramèr, and Nicholas
Carlini. Counterfactual memorization in neural language models. CoRR, abs/2112.12938, 2021.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christo-
pher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer
language models. arXiv preprint arXiv:2205.01068, 2022.
Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. Position-
aware attention and supervised data improve slot filling. In EMNLP, pp. 35–45, 2017.
Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan Roth. “going on a vacation” takes longer than
“going for a walk”: A study of temporal commonsense understanding. In Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3363–3369, 2019.
Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix X. Yu, and
Sanjiv Kumar. Modifying memories in transformer models. CoRR, abs/2012.00363, 2020.
19
Published as a conference paper at ICLR 2023
Part I
Appendix
Table of Contents
A Ethics: Evaluation on Biases and Toxicity 21
A.1 Bias Measurement: CrowS-Pairs . . . . . . . . . . . . . . . . . . . . . . . . . 21
A.2 Bias Measurement: StereoSet . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
A.3 Hate Speech Detection: ETHOS . . . . . . . . . . . . . . . . . . . . . . . . . . 22
A.4 Toxic Genearation: RealToxicPrompts . . . . . . . . . . . . . . . . . . . . . . 22
B Technical Details 23
B.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
B.2 Layer Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
B.3 Positional Encoding and Feed-forward Network . . . . . . . . . . . . . . . . . 24
B.4 Pipeline Parallel Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
B.5 Inference Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
B.6 Activation Outlier Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
B.7 Weight Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
B.8 Quantization settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
B.9 Ablation on Contribution Attribution . . . . . . . . . . . . . . . . . . . . . . . 29
B.10 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
E Contributions 52
E.1 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
E.2 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
E.3 Post Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
E.4 Project Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
E.5 Computation Sponsor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
G Broader Impact 55
G.1 Impact on AI Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
G.2 Impact on Individual Developers and Small Companies . . . . . . . . . . . . . . 55
20
Published as a conference paper at ICLR 2023
H Environmental Impact 56
Another widely used bias and stereotype evaluation benchmark is StereoSet (Nadeem et al., 2021),
which is also adopted in (Lieber et al., 2021; Artetxe et al., 2021; Zhang et al., 2022). To balance
the evaluation between bias detecting and language modeling quality, StereoSet reports a series of
metrics including Language Modeling Scores (LMS), Stereotype Score (SS), and Idealized Context
Association Test Score (ICAT) as an overall averaged metric. For example, given the premise “She
2
https://www.eeoc.gov/prohibited-employment-policiespractices
21
Published as a conference paper at ICLR 2023
is the twin’s mother”, StereoSet provides three candidate hypothesis: 1) “the water is deep”, 2)
“she is a lazy, unkind person”, and 3) “she is a kind, caring woman”. The first option servers as a
distractor to test models’ language capability and calculate LMS; the second and third statements
are anti-stereotypical and stereotypical respectively and used for calculating SS. A widely-adopted
technique here is to calibrate the likelihood of an option according to its length (Lieber et al., 2021;
Zhang et al., 2022), as the distractor term is particularly short.
Following (Zhang et al., 2022), we normalize scores over tokens rather than characters (Lieber et al.,
2021) to yield model predictions for calculating the metrics. The results are shown in Table 6. As we
observe, GLM-130B exceedingly outperforms GPT-3 Davinci and OPT-175B on all metrics. Such
results accurately align with our discoveries in language modeling experiments and CrowS-Pairs
bias evaluation, that GLM-130B has a high quality in both language modeling and social fairness.
Table 6: StereoSet (Nadeem et al., 2021) Bias Measurement with LMS (↑), SS (↓), and ICAT (↑).
Social media corpus may contain hate speeches, and to investigate to what extent LLMs know and
can help to identify them is crucial. We adopt the ETHOS dataset originally proposed in (Mollas
et al., 2020) to detect sexism and racism speech on zero-shot or few-shot datasets created by (Chiu
& Alexander, 2021). GPT-3 Davinci (a public-accessible variant of GPT-3 175B) and OPT 175B are
also tested on the benchmark (whose results are reported in (Zhang et al., 2022)). For binary clas-
sification including Zero-shot, One-shot, and Few-shot (binary) (which answers “yes” or “no”), we
report binary F1; for multiclass classification (which answers “yes”, “no”, or “neither”), we report
micro F1. We adopt almost the same prompts as in (Chiu & Alexander, 2021), except aligning the
Few-shot (binary) prompt to the form used in One-shot and adding the word “Classification”
before the colon in the original Few-shot (multiclass) prompt.
Results are shown in Table 7. We find that Table 7: ETHOS (Mollas et al., 2020) Hate
GLM-130B outperforms two other LLMs among speech detection. “(bi)” and “(mul)” denote bi-
four different settings. On one hand, GLM- nary and multiclass classification respectively.
130B’s pre-training over unsupervised diverse All scores are F1 and the higher the better.
corpora from online forums and social media in-
cluding sections such as “hackernews”, “stack- GPT-3 OPT-175B GLM-130B
exchange”, and “pile_cc” can endow our model
Zero-shot 62.8 66.7 68.8
with the background knowledge to identify those One-shot 61.6 71.3 79.1
speeches. On the other hand, the MIP training Few-shot (bi) 35.4 75.9 79.7
may also improve GLM-130B’s zero-shot and Few-shot (mul) 67.2 81.2 85.8
few-shot capabilities.
22
Published as a conference paper at ICLR 2023
Figure 10: Handling training collapses and instability is the first priority when training LLMs.
B T ECHNICAL D ETAILS
In this section, we introduce additional details about the technical issues we have identified and
solved throughout the GLM-130B training. Along with concurrent open-source LLM efforts, we
believe that those published details could serve as great cornerstones to future LLM training.
B.1 T OKENIZATION
For the tokenization of the corpus, we implement a text tokenizer based on the package icetk with
several adjustments. As an image-text unified tokenizer, the vocabulary size of icetk is 150000.
The first 20000 tokens are image tokens and the rest are text tokens. The text tokenizer of icetk
is formulated and trained by sentencepiece4 , on a 25GB bilingual corpus equally distributed with
English and Chinese contents. We divide tokens recognized by the tokenizer into four categories.
The common tokens are assigned from No.20000 to No.20099, consisting of punctuations, numbers
and spaces free of extended definition. No.20100 to No.83822 are English tokens and No.83823 to
4
https://github.com/google/sentencepiece
23
Published as a conference paper at ICLR 2023
No.145653 are Chinese tokens. Tokens after No.145653 are other special tokens including concate-
nated punctuations and pieces from other languages, etc.
During our implementation, We ignore the first 20000 image tokens and simply utilize the latter
130000 intended for text tokenization. we disable the ignoring of linebreak to tokenize the line-
break mark \n into No. 20004 token <n>. On the basis of inherent tokens, we add special tokens
[MASK] and [gMASK] for model prediction. We also add special tokens <sop>, <eop>, <eos>
for sentence and passage separation.
Here we briefly introduce the history of layer normalization in language modeling problems, and
how its variants perform in recent LLMs including our experiments for them on GLM-130B.
Post-LN (Vaswani et al., 2017). Post-LN is jointly proposed with the transformer architecture and is
placed between the residual blocks. It is then adopted by BERT (Devlin et al., 2019) for bidirectional
language model pre-training. Nevertheless, Post-LN was later accused of transformers’ slow and
vulnerable converging (Xiong et al., 2020) and the Pre-LN emerged as a substitute.
Pre-LN (Xiong et al., 2020). On the contrary, Pre-LN is located in the residual blocks to reduce
exploding gradients and becomes dominant in existing language models, including all recent LLMs.
However, OPT-175B (Zhang et al., 2022), BLOOM (Scao et al., 2022), and text-to-image model
CogView Ding et al. (2021) later observe that Pre-LN is still unable to handle the vulnerable training
when models scale up to 100B or meet multi-modal data. This is also justified in GLM-130B’s
preliminary experiments, where Pre-LN consistently crashes in its early stage training.
Additionally, another problem rooted in Pre-LN transformers is that it may harm the model perfor-
mance after tuning compared to Post-LN. This is observed in (He et al., 2021).
Sandwich-LN (Ding et al., 2021). As a remedy, on top of Pre-LN, CogView (later in Norm-
former (Shleifer et al., 2021)) develops Sandwich-LN which appends extra normalization to the
end of each residual branch. Accompanied with PB-Relax (Precision-Bottleneck Relaxation) tech-
niques, they stabilize the training of a 4-billion text-to-image generation model. Despite its superi-
ority over Pre-LN, sadly Sandwich-LN is also proved to collapse in GLM-130B training; let alone
the potential consequent weaker tuning performance caused by its Pre-LN nature.
Positional Encoding Vanilla transformer adopts absolute (or sinuous) position encoding, and is
later evolved into relative positional encoding (Dai et al., 2019). Relative PEs can capture word
relevance better than absolute positional encoding. Rotary Positional Embedding (RoPE) (Su et al.,
2021) is a relative position encoding implemented in the form of absolute position encoding, and its
core idea is shown in the following equation.
(Rm q)⊤ (Rn k) = q ⊤ Rm
⊤
Rn k = q ⊤ Rn−m k (1)
The product of q at position m and k at position n is related to their distance n − m, which reflects
the relativity of the position encoding. The definition of R in the above equation is
cos mθ1 − sin mθ1 ···
0 0 0 0
sin mθ1 cos mθ1 0 0 ··· 0 0
0 0 cos mθ2 − sin mθ2 · · · 0 0
d
Rθ,m =
0 0 sin mθ2 cos mθ2 · · · 0 0
(2)
.. .. .. .. .. .. ..
. . . . . . .
0 0 0 0 · · · cos mθd/2 − sin mθd/2
0 0 0 0 · · · sin mθd/2 cos mθd/2
To allow its value to decay as the distance increases, θ takes the value
−2(i−1) d
θ = θi = 10000 d , i ∈ 1, 2, · · · , (3)
2
24
Published as a conference paper at ICLR 2023
A two-dimensional absolute position encoding method is proposed in vanilla GLM for modeling
both intra- and inter-span position information. In GLM-130B, different from the two-dimensional
positional encoding used in vanilla GLM, we turn back to conventional one-dimensional positional
encoding. However, we originally thought that two-dimensional form cannot be directly applied to
RoPE5 . As a substitute plan, in GLM-130B we simply remove the second dimension used in the
original GLM as we find that the unidirectional attention mask sub-matrices for [MASK] generation
indicate the token order as well. This observation results in our transforming GLM-130B’s positional
encoding into a one-dimensional one according to the following strategies:
• For sequences corrupted by short spans, we discard the second-dimensional position encoding.
• For sequences corrupted by a long span at the end, we change the positional ids to one-dimensional
0, 1, · · · , s − 1, and generated tokens will just prolong the first-dimensional positional encoding
from the last context token s − 1.
Feed-forward Network Some recent efforts to improve transformer architecture have been on
the FFN, including replacing it with GLU (adopted in PaLM). Research shows that using GLU
can improve model performance, which is consistent with our experimental results (Cf. Table 8).
Specifically, we use GLU with the GeLU (Hendrycks & Gimpel, 2016) activation. as
FFNGeGLU (x; W1 , V , W2 ) = (GeLU(xW1 ) ⊗ xV ) W2 (4)
In order to keep the same parameter as the vanilla FFN, the feed-forward size dffn (which is usually
4dH , where dH is the hidden dimension) is reduced to 38 dH as the V is additionally introduced.
Ablation Study on PE and FFN In order to validate our PE Table 8: Ablation Study for PE
and FFN choices, we test them in our experiments by pre-training and FFN on GLMBase
GLMBase (110M) over a random 50G Chinese and English mixed
corpus. We compare absolute PE with two recent popular relative Model Test PPL
PE variants, RoPE (Chowdhery et al., 2022) and ALiBi (Press
et al., 2021). For FFN, we compare vanilla FFN with Gate Lin- GLMBase 24.58
ear Unit with GeLU activations. Results from Table 8 show that + ALiBi 24.14
both ALiBi and RoPE improve perplexity on the test set, and the + RoPE 22.95
improvement is more significant with RoPE while using GeGLU + RoPE + GeGLU 22.31
can further improve the model’s performance.
In pipeline parallelism, each stage consists of three operations (Cf. Figure 11(a)): forward (de-
noted as F), backward (denoted as B), and optimizer step (denoted as U). However, naive sequential
pipeline implementation leads to an unbearable amount of bubbles. The improved Gpipe (Huang
et al., 2019) (Cf. Figure 11(b)) strategy reduces bubbles drastically via splitting data into micro-
batches; the more micro-batches there are, the more stages can compute simultaneously in an itera-
tion. The recent PipeDream-Flush (Narayanan et al., 2021) (Cf. Figure 11(c)) additionally optimizes
the GPU memory usage by interweaving forward and backward from different stages to reduce for-
ward activation’s memory occupation.
We analyze the bubble share in GLM-130B’s pre-training by assuming that the number of pipeline
segments is p, the number of micro-batches is m, and the time for forward and backward per micro-
batch are tf and tb . In ideal case, forward and backward take tideal = m(tf +tb ). But in practice, the
default pipeline delivery strategy causes p − 1 forward propagation and p − 1 backward propagation
bubbles, respectively, for a total time of tbubble = (p − 1)(tf + tb ), so that the bubble occupancy is
tbubble p−1
bubble-ratio = = (5)
tideal + tbubble m+p−1
For larger numbers of micro-batches, the bubble percentage will be reduced to an acceptable level.
In particular, experiments in GPipe Huang et al. (2019) show that when m ≥ 4p, the total percentage
5
We later found the instructions to implement two-dimensional RoPE from its author’s blog https:
//kexue.fm/archives/8397, but our training has proceeded for weeks.
25
Published as a conference paper at ICLR 2023
GPU 3 F0 B0 U0 F1 B1 U1
GPU 2 F0 B0 U0 F1 B1 U1
GPU 1 F0 B0 U0 F1 B1 U1
Bubble time
GPU 0 F0 B0 U0 F1 B1 U1
Time
GPU 3 F0 F1 F2 F3 F4 F5 F6 F7 B0 B1 B2 B3 B4 B5 B6 B7 U0
GPU 2 F0 F1 F2 F3 F4 F5 F6 F7 B0 B1 B2 B3 B4 B5 B6 B7 U0
GPU 1 F0 F1 F2 F3 F4 F5 F6 F7 B0 B1 B2 B3 B4 B5 B6 B7 U0
GPU 0 F0 F1 F2 F3 F4 F5 F6 F7 B0 B1 B2 B3 B4 B5 B6 B7 U0
Time
GPU 3 F0 B0 F1 B1 F2 B2 F3 B3 F4 B4 F5 B5 F6 B6 F7 B7 U0
GPU 2 F0 F1 F2 F3 B0 B1 B2 F4 B3 F5 B4 F6 B5 F7 B6 B7 U0
GPU 1 F0 F1 F2 F3 B0 B1 F4 B2 F5 B3 F6 B4 F7 B5 B6 B7 U0
GPU 0 F0 F1 F2 F3 B0 F4 B1 F5 B2 F6 B3 F7 B4 B6 B6 B7 U0
Time
of pipeline bubble time is reduced to a negligible level due to the forward recomputation technique in
backpropagation that allows some overlap in computational communication, thus showing that the
bubbles introduced in parallel by the pipeline model do not seriously deplete the training efficiency.
In general, in order to make full use of the hardware, it is common to place models into model
parallel groups consisting of multiple nodes and try to use the full memory of each node. In this
case, we can freely adjust the ratio of pipeline model parallelism and tensor model parallelism. Since
data parallelism hardly affects the computation time, we assume that the scale of data parallelism is
d = 1, the total number of nodes is n, the scale of tensor model parallelism is t, and the scale of
pipeline model parallelism is p, and satisfies n = t × p, the bubble share in this case is
n/t − 1
bubble-ratio = (6)
m + n/t − 1
From the above equation, we can see that increasing the size of tensor parallelism will further reduce
the bubble ratio. However, the tensor parallelism scale cannot be increased indefinitely, which would
lead to a reduction in computational granularity and greatly increase the communication cost across
a certain threshold. Therefore, we can conclude that the size of tensor model parallelism should
increase slowly as the model size increases, but not more than the number of graphics cards in
a single machine. In the training of GLM-130B, the experiments show that the optimal tensor
parallelism scale is t = 4 and does not scale up to the scale of t = 8 in the DGX-A100 system. The
other parameters are m = 176, p = 8, and the bubble share is calculated to be only 3.8%, which is
sufficient to demonstrate the efficiency of pipeline model parallelism.
26
Published as a conference paper at ICLR 2023
Table 9: Decoding speed in our real trials between BLOOM-176B (Scao et al., 2022) (from Hug-
gingface Transformers) and GLM-130B’s implementation in 16-bit precision with 8 × A100 (80G).
Figure 12: Distribution of outliers in GLM-130B’s activations. The vertical axis denotes the hidden
state dimensions (4,096 rather than 12,288 as this is a parallel segment), and the horizontal denotes
tokens in a input sentence. Using a 128×128 2D histogram to get a better view of the distribution of
outliers. The figure on the right swaps some of the vertical coordinates so that it can be clearly seen
that the outlier occur about 30% of its dimensions.
A model’s plain PyTorch implementation is easy to read and run, but it can be intolerably slow for
LLMs. Based on NVIDIA’s FasterTransformer6 we spend two months implementing GLM-130B
into C++ to speed up inference, including the following main optimizations:
• Optimize time-costing operations such as GeGLU, Layer Normalization, and SoftMax.
• Reduce the number of GPU kernel calls (e.g., fuse MultiheadAttention into one computation ker-
nel).
• Specify the algorithm of the best performance when calling cuBLAS.
• Improve the computing efficiency by transposing the model parameters in advance.
• Use half2 in FP16 computation to double the half’s access bandwidth and computing throughput.
We currently pack up the full FasterTransformer implementation for GLM-130B into a plug-and-
play docker image for users’ convenience, and we are still working on adapting it to our Pytorch
implementation by only changing one line of code. A comparison between our speeding up GLM-
130B implementation and the so far default available BLOOM-176B implementation in Hugging-
face Transformers7 is shown in Table 9. Our implementation for GLM-130B can be 7.0 to 8.4 times
faster than BLOOM-176B’s Pytorch implementation. The exertion to accelerate LLM for tolerable
response speed could be extremely crucial to its popularization.
As is described in prior sections, GLM-130B’s weight can be quantized into INT4 to drastically cut
down parameter redundancy in the inference. However, we also find that GLM-130B’s activations
(i.e., hidden states between layers) cannot be properly quantized, as they contain value outliers as is
also suggested in concurrent literature (Dettmers et al., 2022).
What is special in GLM-130B is that 30% of its dimensions may present value outliers (Cf. Fig-
ure 12), while other GPT-based LLMs (e.g., OPT-175B and BLOOM 176B) only has very few
outlying dimensions (Dettmers et al., 2022). Therefore, the solution to decompose matrix multipli-
6
https://github.com/NVIDIA/FasterTransformer
7
https://huggingface.co/docs/transformers/model_doc/bloom
27
Published as a conference paper at ICLR 2023
cation for higher-precision computation in outlying dimensions proposed in (Dettmers et al., 2022)
is not applicable to GLM-130B.
We study whether these outliers can be ignored in LLM quantiza-
tion, and the answer is interestingly “no”. These values can be sev-
eral orders of magnitude larger than ordinary activation values (Cf.
Figure 13). While most values (accounts for 99.98% dimensions in
a hidden state) stay less them 6, those two outlying dimensions can
reach 50 or even over 100. They are speculated to be some important
clues for GLM-130B and potentially other LLMs to memorize some
fixed world or language knowledge, and thus removing or omitting Figure 13: GLM-130B’s
them in quantization can lead to significant performance degradation. activation outliers’ absolute
value scale.
B.7 W EIGHT Q UANTIZATION
B.7.1 P RELIMINARIES
Absmax Quantization is a symmetric quantization that a range of [−absmax(x), absmax(x)] is
mapped to [−(2b − 1), 2b − 1] for x.
absmax(x)
sx = (7)
2b−1 − 1
xq = round(x/sx ) (8)
where sx is the scaling factor, xq is the quantization result and b is the bit width.
Zeropoint Quantization is an asymmetric quantization that a range of [min(x), max(x)] is
mapped to [−(2b − 1), 2b − 1].
max(x) − min(x)
sx = (9)
2b − 2
zx = round(min(x)/sx ) + 2b−1 − 1 (10)
xq = round(x/sx ) − zx (11)
where zx is the zero point.
Col/Row-wise Quantization Using a single scaling factor for the weight matrix often leads to
more quantization errors because one single outlier leads to a decrease in the quantization precision
of all other elements. A common workaround is to group the weight matrix by rows or by columns,
with each group being quantized separately and having independent scaling factors.
Our goal is to save GPU memory as much as possible without hurting model performance. In prac-
tice, we only quantize linear layers, which take up most of the transformer parameters, and leave
input/output embedding, layer normalization, and bias terms unchanged. At the quantization pre-
cision of INT4, two INT4 weights are compressed into one INT8 weight for saving GPU memory
usage. Absmax quantization is adopted since we found it enough to maintain model performance,
and it is more computationally efficient than zeropoint quantization. During inference, only quan-
tized weights are stored in GPU memory, the FP16 weights for linear layers will be dequantized at
runtime.
28
Published as a conference paper at ICLR 2023
Table 10: Accuracy on LAMBADA dataset for GLM and BLOOM family at 100M to 176B scales
across different quantization precision.
78.3
80
72.774.8 71.2 Model
67.3 65.466.4 63.563.567.3 64.1 GLM (uni)
GLM (bi)
51.756.152.5
60 GLM + MIP (bi)
50.7
40 33.734.5 35.035.640.0
26.3 27.327.727.3
20
LAMBADA MMLU WiC ReCoRD Hellaswag WSC BoolQ ANLI R1
Figure 14: Contribution attribution analysis on GLM objective and MIP training. We take GLM-
10B (English only) as an example in the ablation. Generally, GLM objective’s bidirectional attention
accounts for 70% of the improvements, while MIP’s major contribution lies in text similarity tasks.
To achieve INT4 weight quantization, we analyze the weight value distribution of major linear layers
in GLM-130B and a counterpart BLOOM-176B in a histogram (Cf. Figure 15). The horizontal axis
denotes the weight value, and the vertical axis denotes the number of weights of such value in
log scale. As we can see, it is majorly the w2 linear layers in BLOOM-176B that present skewed
distributions, which would hinder the symmetrical quantization. On the contrary, GLM-130B’s w2
is well-shaped without many outliers and skewed distribution, and thus paces the way for its INT4
quantization with little performance loss.
29
Published as a conference paper at ICLR 2023
exact dataset setting in T0 (Sanh et al., 2022) and the information extraction datasets in GLM-130B
to allow the correct evaluation on some types of tasks (e.g., NLI).
Figure 14 shows the ablation results. On the 8 datasets we test, we find that the GLM objective
is a major contributor to the improvement (from GLM (uni) to GLM + MIP (bi)). For exam-
ple, it accounts for 73% improvement in LAMBADA and 90% improvement in MMLU, which
are very widely adopted challenging benchmarks for LLMs. As for MIP, on some datasets (e.g.,
WiC, ReCoRD, Hellaswag), MIP may even harm the performance. While for datasets related to text
similarity and coreference (e.g., WSC, BoolQ, ANLI R1), MIP is the main contributor. It is likely
because the text similarity and coreference challenges, which people usually construct intentionally
to test language models’ ability, are seldom seen in the self-supervised corpus that makes up peo-
ple’s daily written texts. Thus, MIP training mainly helps to bridge the gap between self-supervised
pre-training and these tasks.
Lesson 2 (Platform-aware Configuration). Configure LLMs based on the cluster and parallel
strategy used to squeeze hardware potential.
Lesson 5 (Systematical Instability: FP16). Though FP16 induces more instability, it enables
training and inference on diverse platforms.
Lesson 7 (GLM’s INT4 Quantization Scaling Law). GLM has a unique INT4 weight quan-
tization scaling law unobserved in GPT-style BLOOM.
Lesson 8 (Future Direction). To create powerful LLMs, the main focus can be on 1) more and
better data, 2) better architectures and pre-training objectives, and 3) more sufficient training.
30
Published as a conference paper at ICLR 2023
Figure 15: Weight value distribution of linear layers in GLM-130B (in orange, attn-dense,
attn-qkv, glu-w1, glu-w2) and BLOOM-176B (in blue, attn-dense, attn-qkv,
ffn-w1, ffn-w2)’s first 28 transformer layers. Generally for GLM-130B it is attn-dense
and w2 that may present narrow value distributions. attn-qkv and w1 may also be a reason for
enabling INT4 quantization in middle layers of GLM-130B.
31
Published as a conference paper at ICLR 2023
Following practices in (Raffel et al., 2020; Wei et al., 2022a; Sanh et al., 2022; Aribandi et al., 2022),
we include a number of prompted instruction datasets in GLM-130B’s MIP training, which accounts
for 5% of the training tokens. All prompts for T0 datasets are from PromptSource (Bach et al., 2022)
and prompts for DeepStruct datasets are newly created. Their composition is shown in Table 12,
which makes up natural language understanding and generation datasets from T0 (Sanh et al., 2022)
and promptsource (Bach et al., 2022), and information extraction datasets from DeepStruct (Wang
et al., 2022a). In GLM-130B’s training, we calculate that approximately 36% of the samples in each
dataset has been seen.
T0 originally splits datasets for 1) multi-task prompted training and 2) zero-shot task transfer two
sections. We initially planed to only include training sets of T0’s multi-task prompted training
section and DeepStruct (Wang et al., 2022a), but by a mistake we included both multi-task prompted
training and zero-shot task transfer sections’ datasets in MIP and excluded DeepStruct datasets. The
mistake was fixed at around 23k steps and our model continued to train on the correct version.
Natural Language Understanding and Generation. We adopt datasets and corresponding
prompts from promptsource (Bach et al., 2022). For all prompted samples in each dataset, we set a
truncation of maximal 10,0000 samples per dataset and combine them together as the MIP dataset.
Details of the prompted samples and datasets are provided in promptsource’s GitHub repository8 .
Information Extraction. Based on the datasets from DeepStruct (Wang et al., 2022a), a multi-
task language model pre-training approach for information extraction tasks, we create instructions
and prompts for part of its datasets (as is shown in Table 12). We reformulate information extraction
tasks into instruction tuning formats to allow zero-shot generalization to new extraction schema. For
all prompted samples in each dataset, we set a truncation of maximal 20,0000 samples per dataset as
there are fewer information extraction datasets than common language understanding and generation
ones. For KELM (Agarwal et al., 2021) and PropBank (Kingsbury & Palmer) datasets, since their
original size is gigantic, we sample 50,0000 samples for each of them from their prompted samples.
Prompts and instructions for all datasets in DeepStruct (Wang et al., 2022a) are newly created by
authors manually. The introduction, task description, and full prompts for each dataset are attached
in the following sections. To allow template infilling, all prompts are written into Jinja9 templates.
When a dataset sample is provided in our format, Joinja engine will render it into a prompted sample
with instruction.
A more systematic evaluation on GLM-130B’s information extraction ability is left for a future
work, as the concentration in this work is on the training and designing details of an LLM.
We adopt Multiwoz 2.1 (Eric et al., 2020) dialogue state tracking dataset. The dataset is reformulated
into two tasks, each with one prompt correspondingly:
• Dialogue state tracking: which asks the model to extract information from dialogues given a list
of certain slots, e.g., taxi_arrival_time and destination.
• Slot filling: which model should fill in one provided slot and identify situations without answer.
8
https://github.com/bigscience-workshop/promptsource
9
https://github.com/pallets/jinja
32
Published as a conference paper at ICLR 2023
{{text}}
{{text}}
We adopt ACE05 (Walker & Consortium, 2005) event extraction datasets following the setting
in (Wadden et al., 2019). The dataset is reformulated into two tasks with three prompts as follows:
• Event Argument Extraction: given a trigger in text and a list of its argument roles, the model is
asked to extract the arguments from the provided text.
• Argument Identification: given a trigger and a certain argument role, the model is asked to
extract the argument if it exists in the provided text; otherwise, the model should generate nothing.
- {{shuffle(allowed_arguments[trigger['event_type']].values()) | join("\
n- ")}}
{{text}}
33
Published as a conference paper at ICLR 2023
Please write down ALL event arguments related to the trigger "{{trigger
['text']}} ({{allowed_triggers[trigger['event_type']]}})" marked with "[
]", given the following categories:
- {{shuffle(allowed_arguments[trigger['event_type']].values()) | join("\
n- ")}}
{{text}}
The argument should be (copy from the context if you find it; if not, do
not generate): ||| {{filter_type(relations, query_arg) | join(" ")}}
Joint entity and relation extraction aims to recognize named entities in a piece of text and judge
the relationships between them. It is closely related to knowledge acquisition, where the ulti-
mate target is to structuring the unstructured web contents into knowledge triples (e.g., (London,
capital_of, Britain)). The task can be formulated into either a pipeline framework (a
combination of named entity recognition and relation extraction), or end-to-end training.
In this work, we adopt three classical joint entity and relation extraction datasets: CoNLL04 (Roth &
Yih, 2004), NYT (Riedel et al., 2010), and ACE2005 (Walker & Consortium, 2005). In GLM-130B,
we follow (Wang et al., 2022a) to formulate such challenges into sequence-to-sequence generation,
where our inputs are raw texts and outputs are triples. We only conduct relation-related tasks for
these datasets here, and leave the entity-related ones to the named entity recognition section.
• Relation Extraction: here we extract knowledge triples consisting of “head entity”, “relation”,
and “tail entity”, given a list of relation candidates. For example, given the input “In Kunming
the 800-some faculty and student established the National Southwestern Associated University.”,
the model output could be (National Southwestern Associated University,
location of formation, Kunming).
• Conditional Relation Extraction: given a single relation candidate, judge if the input text con-
tains the relation. If so, extraction all related triples; if not, do not generate.
• Knowledge Slot Filling: assign a certain entity from text, and ask the model to extract all triples
that takes the entity as the head.
• Relation Classification: given two entities from texts, ask the model to judge the relation between
them based on a list of candidate relations.
34
Published as a conference paper at ICLR 2023
{{text}}
{{text}}
{{text}}
Nevertheless, existing joint entity and relation extraction datasets have very limited relation schema.
For example, CoNLL04 only contains five different relations; the most diverse NYT dataset con-
tains 24 Freebase predicates. To allow the model to capture a diverse range of potential verbal-
ized predicates, we extend the task with automatically generated knowledge-text aligned data from
KELM (Agarwal et al., 2021). We do not include other distantly supervised dataset (e.g., T-Rex (El-
sahar et al., 2018)) since they can be extremely noisy.
For KELM data, since it is based on the full Wikidata schema (which contains too many relations
to be enumerated), we create two KELM-specific prompts for the task of Relation Extraction and
Knowledge Slot Filling:
35
Published as a conference paper at ICLR 2023
{{text}}
Named entity recognition is a task which targets identifying named entities from raw text corpus
and assign them with proper entity types. For example, in the sentence “In 1916 GM was reincorpo-
rated in Detroit as "General Motors Corporation".”, General Motors Corporation could
be of entity type organization. We design two different types of tasks based on named entity
recognition datasets CoNLL03 (Sang & Meulder, 2003), OntoNotes 5.0 (Pradhan et al., 2013), and
GENIA (Ohta et al., 2002). We also include named entity recognition sub-tasks from joint entity
and relation datasets.
• Named Entity Recognition: given a certain list of possible entity types (e.g., location,
person, organization), extract all related entities from the provided text content.
• Entity Typing: entity typing is one of the important derivative tasks from named entity recogni-
tion. It aims to classify the correct type of an entity mention (without entity types), and is often
appended to the entity mention extraction as post-processing.
please extract all mentioned entities from left to right in the sentence
, in the form of "( X ; instance of ; Z )".
36
Published as a conference paper at ICLR 2023
{{text}}
Relation classification is a fundamental task in information extraction, which identifies the relation-
ships from a list of candidates between two given entities. The problem is a long standing one as it
suffers from outrageous cost of data labeling, since manual labeling on knowledge-intensive tasks
requires educated annotators that charges high. A de facto data creation method in relation extrac-
tion relies on distant supervision, which aligns existing knowledge triples in knowledge bases to text
contents automatically, and assume that such alignments are correct in certain conditions. Here we
only include TacRED (Zhang et al., 2017) dataset and create several different tasks based on it.
• Relation Classification: the most traditional task formulation. Given two entities from text and
classify their relation from a list of candidates. The form can be either answering the relation
directly or in the form of a triple (similar to relation extraction).
• Knowledge Slot Filling: change the task into given head entity and relation, to identify whether
the tail entity exists in the input text. If not, generate nothing.
• Yes or No Question: turn the problem into a task similar to natural language inference. For
example, given the sentence “The series focuses on the life of Carnie Wilson, daughter of Brian
Wilson, founder of the Beach Boys.”, the model will be asked to judge the correctness of a triple
such as Carnie Wilson, father, Brian Wilson by answering “yes” or “no”.
{{text}}
37
Published as a conference paper at ICLR 2023
{{text}}
{{text}}
Semantic role labeling is a long-standing information task that wants to identify the semantic argu-
ments related to a given predicate in a sentence. For example, in the sentence “Grant was employed
at IBM for 21 years where she held several executive positions.” and the predicate “employed” in
it, semantic role labeling identifies the Grant as the subject and IBM as the second object.
We create two different tasks based on semantic role labelling datasets CoNLL05 (Carreras &
Màrquez, 2005), CoNLL12 (Pradhan et al., 2013), and PropBank (Kingsbury & Palmer).
• Semantic Role Labeling: the traditional task form, where a verb (i.e., predicate) is annotated in
text and the model is asked to generate related semantic roles.
• Semantic Role Filling: given a verb and and a potential semantic role, the model is asked to judge
whether the role exists in the sentence and generate it.
• Predicate Recognition: given a segment of a sentence and its corresponding semantic role, iden-
tify which verb it is related to.
38
Published as a conference paper at ICLR 2023
{{text}}
Here we describe the result sources for GPT-3, BLOOM-176B, and OPT-175B. Other LLMs we
may compare are mostly completely closed-sourced; thus, their results are all taken from existing
preprints, publications, or the results stored in BIG-bench repository10 .
For GPT-3, while most of its results in this paper are taken from existing literature if not specified,
the rest were acquired via our own requesting OpenAI Danvici API are explicitly mentioned. For
BLOOM-176B and OPT-175B, if without specific annotation, their results are:
• Taken from the OPT paper (Zhang et al., 2022).
• Taken from the EAI-Eval BigScience Arch&Scale - Google Sheet11 .
• Taken from BigScience evaluation results repository in Huggingface Datasets12 .
Specifically, we cannot evaluate OPT-175B by ourselves as we are still not officially granted the
checkpoint, though we have sent several applications in the past few months.
Table 13: GLM-130B and its similar-sized
C.4 P ILE T EST- SET E VALUATION LLMs’ BPB results on Pile test-set.
Pile evalution (Gao et al., 2020) is a comprehen- Jurassic-1 GPT-3 GLM-130B
sive language modeling benchmark which origi-
dm_mathematics 1.040 1.370 0.786
nally includes 22 different text datasets from di-
ubuntu_irc 0.857 0.946 0.977
verse domains. We report our results over a part of opensubtitles 0.879 0.932 0.889
18 datasets with previously reported baseline re- hackernews 0.869 0.975 0.873
sults (Lieber et al., 2021). Different from tradi- books33 0.835 0.802 0.803
tional language modeling benchmarks, Pile evalu- pile_cc 0.669 0.698 0.771
ation report the BPB (bits-per-byte) perplexity to philpapers 0.741 0.723 0.766
avoid the mismatch comparison between models gutenberg_pg_19 0.890 1.160 0.821
with different vocabularies. Because in general, arxiv 0.680 0.838 0.570
stackexchange 0.655 0.773 0.611
language models with a larger vocabulary will be nih_exporter 0.590 0.612 0.614
favored in perplexity comparison if not restricted. pubmed_abstracts 0.587 0.625 0.610
In the evaluation, we strictly follow the setting uspto_backgrounds 0.537 0.566 0.537
in (Gao et al., 2020), leveraging [gMASK] and pubmed_central 0.579 0.690 0.510
a context-length of 1,024 with bidirectional atten- freelaw 0.514 0.612 0.499
tion, and the rest 1024 tokens to calculate BPB in github 0.358 0.645 0.329
an autoregressive manner. The weighted average enron_emails 0.621 0.958 0.604
youtube_subtitles 0.825 0.815 0.746
BPB are calculated based on each shared dataset’s
ratio in Pile training-set (Gao et al., 2020). Weighted Avg. 0.650 0.742 0.634
The detailed metrics on Pile test-set are reported in Table 13. We observe that compared to GPT-
3, GLM-130B has a noticeable weaker performance on phil_papers and pile_cc, which is likely
because of GLM-130B’s bilingual natural and lack of more diverse and high-quality private collected
corpora.
10
https://github.com/google/BIG-bench
11
https://docs.google.com/spreadsheets/d/1CI8Q9RCblLRzUOPJ6ViqBmo284-8oj
luQ-CmaEuhuv0
12
https://huggingface.co/datasets/bigscience/evaluation-results/tree/ma
in/bloom/bloomzeval/transformers/evaluation_val
39
Published as a conference paper at ICLR 2023
All results on 57 MMLU (Hendrycks et al., 2021) datasets of GLM-130B and BLOOM 176B are
shown in Table 15. In Section 5.2, we report weighted average accuracy (i.e., accuracy average per
sample, rather than by discipline) of GLM-130B, GPT-3 175B, and BLOOM 176B.
Below is a prompted example with 1-shot priming. We predict the probability on [’A’, ’B’,
’C’, ’D’] at the next token, and take the one with the maximal probability as the answer.
Here we elaborate the prompts we use for CLUE (Xu et al., 2020) and FewCLUE (Xu et al., 2021)
evaluation. On Chinese datasets, prompting meets some challenges as Chinese texts are organized
by single characters rather than words, leading to unequal length of verbalizers in many cases. Albeit
dataset-specific calibration (Wang et al., 2021; Wu et al., 2021) can help to mitigate the issue, the
too specified technique can be complicated in implementation. Our evaluation in this paper adopts
a more easy to solve method leveraging GLM-130B’s unique features. As GLM-130B is a bilingual
LLM with English MIP, we adopt English prompts and verbalizers from similar tasks in (Bach
et al., 2022) for Chinese dataset evaluation and find such strategies to be quite effective. In terms
of evaluation metrics, except for DRCD and CMRC2018 two question answering datasets which
reports EM, other datasets report accuracy.
40
Published as a conference paper at ICLR 2023
Natural language generation, or conditional natural language generation here, refers to tasks that
require generating text based on the given information, such as tables and documents. We evaluate
GLM-130B on data-to-text and summarization tasks. The datasets include WebNLG 2020 (Cas-
tro Ferreira et al., 2020), Clean E2E NLG (Dušek et al., 2019) and WikiLingua (Scialom et al.,
2020) from GEM generation benchmark (Gehrmann et al., 2021). We select full WebNLG 2020
and the Clean E2E NLG in the test set and randomly select 5000 test examples from WikiLingua
following the practice in (Chowdhery et al., 2022). Following the settings in PaLM, the prompt
used for the Summarization tasks is “Summarize the following article:” and the prompt used for
the Data-to-Text tasks is “Verbalize:”. An exception is E2E, where we process the data using
the prompt “generate-gramatically-correct-text from” provided in promptsource for GLM-130B and
GPT-3 175B (Davinci). All evaluations are one-shot, and the demonstration samples are randomly
sampled from the training set. We report the F-measure of ROUGE-2, ROUGE-L (Lin, 2004) and
BLEURT-20 (Pu et al., 2021). We compare our model with LaMDA, GPT-3 175B (Davinci), and
PaLM, where the results of LaMDA and PaLM are reported by (Chowdhery et al., 2022), and we
evaluate GPT-3 175B (Davinci) through OpenAI API.13
Our results are presented in Table 16. It shows that GLM-130B has better performances than
LaMDA and GPT-3 (Davinci) on all tasks. In the Data-to-text task, GLM-130B performs slightly
worse than PaLM-540B, while in the summary task, GLM-130B has even higher ROUGE results.
We also ablate GLM-130B to unidirectional to demonstrate the advantage of bidirectional attention.
Unidirectional GLM-130B underperforms GPT-3 175B in all three datasets, but when it shifts to
bidirectional attention, there is an instant boost, making GLM-130B even comparable to PaLM-
540B in a few cases. It indicates that bidirectional attention over the provided context (i.e., prefix)
can also be beneficial for text generation missions.
Table 16: 1-shot GEM English natural language generation tasks (WebNLG, E2E, and WikiLingua).
We compare two versions of GLM-130B (uni: unidirectional attention, bi: bidirectional attention),
showing that bidirectional attention can also improve conditional generation’s performance.
LaMDA GPT-3 175B GLM-130B
Task Dataset Metric PaLM-540B
137B (Davinci)
uni bi
ROUGE-2 30.5 29.9 25.3 38.5 44.4
WebNLG ROUGE-L - 41.2 36.7 49.3 53.8
Data
BLEURT-20 - 59.0 53.2 67.7 73.9
to
Text ROUGE-2 29.2 30.3 30.9 33.9 35.2
E2E ROUGE-L - 39.2 40.0 42.6 43.9
BLEURT-20 - 64.5 65.0 68.1 69.7
ROUGE-2 5.4 7.2 5.8 10.4 9.9
Summary WikiLingua ROUGE-L - 18.9 16.4 23.4 20.6
BLEURT-20 - 41.2 39.4 45.0 47.7
Groundtruth: 185 centimetre tall Aleksandr Prudnikov played for the Otkrytiye Arena based FC Spartak,
Moscow.
GPT-3 175B (Davinci): Aleksandr Prudnikov is a midfielder for FC Spartak Moscow, a football (soccer) club
based in Moscow, Russia.
GLM-130B: Aleksandr Prudnikov is 185.0 cm tall and plays for FC Spartak Moscow.
13
We use ROUGE implementation at https://github.com/google-research/google-research/tree/master/rouge
and BLEURT-20 implementation at https://github.com/google-research/google-research/tree/master/rouge,
whose checkpoint is available at https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip
41
Published as a conference paper at ICLR 2023
Groundtruth: At the riverside, there is a coffee shop called The Blue Spice.
GPT-3 175B (Davinci): Blue Spice is a riverside coffee shop which is located on the corner of River Street
and Riverbank Street.
GLM-130B: There’s a coffee shop that serves coffee in the riverside area, Blue Spice.
(WikiLingua Example, without demonstration sample)
The majority of your customers will search for you online, so it's
essential to have a user-friendly website. At the very least, your
website should include information about your business and your history
in the moving industry, details about the quoting process, contact
information, and a description of the services you offer. If possible,
allow customers to schedule quotes online, view your availability, or
read testimonials from other customers. One of the easiest ways to start
your business is by helping people you already know with their moves.
You can be on the lookout for any announcements related to moving that
your friends make on social media. Once you have provided good service
to friends, they are likely to recommend you to others. In order to
spread the word about your business, have some professional looking
promotional materials printed and distribute them around your community.
You can distribute business cards at public events, tuck them into
local bulletin boards, or even print them in directories, yearbooks, and
other local print media. Flyers can be mailed, posted in public places,
or distributed to businesses that might be able to refer customers to
like you, such as furniture stores. Make sure you have a professional,
recognizable logo that is consistent across all of your marketing
materials. Another way to get your business's name out there is to make
yourself visible. Whether it's by working with partners at local events,
volunteering, or using your vehicle for an ad campaign, visibility is
key for driving business. Build relationships with influential people in
your community. Realtors are a great source of referrals to movers, as
are the owners of local furniture stores or the office staff at a large
apartment complex. You can use directory sites like Craigslist to
advertise your services to people in your local community for free.
Social media is also a great way to spread the word about your business.
There are many options for advertising, depending on your budget and
your target market. Consider options like PPC advertising, television
and radio commercials, newspaper ads, direct mail flyers, or memberships
with referral services. The best thing you can do to grow your business
is to provide excellent service to your customers. Be sure to always be
on time, be friendly, be respectful of your customers' belongings, and
offer accurate price quotes. Be sure to ask your happy customers to post
reviews about you online. New customers will see these reviews and will
have more faith in you as a reputable company if they see that others
have had a good experience. You will inevitably end up having to deal
with an unhappy customer at some point, but do your best to resolve the
problem to the customer's satisfaction. The last thing you want is
negative reviews about your business circulating the internet! Summarize
the article:
Groundtruth: Create a website. Start with friends. Distribute business cards and flyers. Network for promo-
tion. Seek out free advertising opportunities. Consider paid advertising. Gain a good reputation.
GPT-3 175B (Davinci): Market yourself online. Look for opportunities to market your services offline.
GLM-130B: Have a user-friendly website. Promote your business through social media. Build relationships
with influential people in your community. Use directory sites like Craigslist to advertise your services.
42
Published as a conference paper at ICLR 2023
Table 17: Winograd-style tasks evaluation (Winogender and Winograd273). All scores are accuracy.
K refers to number of shots. ∗ PaLM 540B did not report the exact 0-shot Winogender result, so we
have to estimate a value from its plotted diagram.
Table 19: Commonsense reasoning (Commonsense QA, MC-TACO). K refers to number of shots.
We include the evaluation on Winograd-style tasks, which derives from the classical Winograd
Schemas Challenge (Levesque et al., 2012) that aims to test coreference resolution in an ambiguous
context for the machine to understand. Since in MIP, we have included the Winogrande (Sakaguchi
et al., 2021) and SuperGLUE WSC (Wang et al., 2019), here we test on Winogender (Rudinger et al.,
2018) and Winograd273 (Levesque et al., 2012). For Winogender, GPT-3’s results are acquired from
OpenAI API, and BLOOM’s 1-shot result is evaluated by ourselves. For Winograd273, since exist-
ing works (Brown et al., 2020; Chowdhery et al., 2022) show that 1-shot learning brings almost no
improvement, we only test the zero-shot result. Another thing to notice is that, despite GPT-style
models (e.g., GPT-3, PaLM) adopting the “partial evaluation” described in (Radford et al., 2019),
we find the prompt “<sentence> The "<pronoun>" refers to [MASK]” is better for
GLM-130B and adopt it in the evaluation.
The results are presented in Table 17. GLM-130B performs the best across all evaluated LLM on
Winogender, and marginally poorer than GPT-3 and PaLM on Winograd273.
Closed-book question answering (CBQA) (Roberts et al., 2020) is a widely adopted task to evaluate
language models’ memorization of factual knowledge, on contrary to the traditional “open-book”
evaluation. As we have included TriviaQA (Joshi et al., 2017) and WebQuestions (Berant et al.,
2013) in the MIP training, here we choose Natural Questions (Kwiatkowski et al., 2019) and Strat-
egyQA (Geva et al., 2021) as the evaluation datasets for CBQA.
The results are presented in Table 18. GLM-130B performs relatively poorer on Natural Questions
and performs well on StrategyQA. GLM-130B’s underperformance on Natural Questions, we spec-
ulate, potentially derives from the insufficiency fitting on English corpora, as it roughly only viewed
43
Published as a conference paper at ICLR 2023
200B English tokens and thus does not memorize the detailed knowledge very well. Since CBQA
seems to be a task that especially stresses memorization, as is indicated by Chinchilla (Hoffmann
et al., 2022)’s a strong performance, we think with sufficient training later, GLM-130B can perform
better.
Here we evaluate GLM-130B and some other LLMs on commonsense reasoning abilities. As we
have included PIQA (Bisk et al., 2020), ARC (Clark et al., 2018), and OpenbookQA (Mihaylov
et al., 2018) in the MIP training, we select another two widely adopted commonsense reasoning
datasets in our evaluation: Commonsense QA (Talmor et al., 2019) and Multiple-choice Temporal
Commonsense (MC-TACO, Zhou et al. (2019)). For Commonsense QA, we test the GPT-3 via
OpenAI Davinci API, BLOOM-176B via its Huggingface Implementation, and GLM-130B using
the prompt “answer_given_question_without_options” from promptsource (Bach et al., 2022). For
StrategyQA, we follow the EM computation method provided in (Zhou et al., 2019).
The results are shown in Table 19. As we can see, GLM-130B performs the best on both Com-
monsense QA and MC-TACO across evaluated LLMs, demonstrating that GLM-130B has a good
grasp of commonsense knowledge. OPT’s results are not included due to the reason described in
Appendix C.3.
As is discussed in Section 5, we adopt a rather strict criterion for selecting datasets for zero/few-shot
learning in GLM-130B’s evaluation due to the use of MIP. Nevertheless, the criterion significantly
reduces the dataset we could currently evaluate, and especially some readers have doubted whether
the restriction of not evaluating on MIP-seen fixed-label datasets is necessary (e.g., natural language
inference (NLI)), and suggest that we may report them in an independent section to avoid confusion.
Frankly speaking, in such a setting GLM-130B’s zero/few-shot learning could be quite advanta-
geous. Below, we take NLI as a typical example to show GLM-130B’s outperformance in the
scenarios. We include 6 widely-used NLI datasets–which are not incorporated in GLM-130B’s MIP
training, as the benchmarks. The results are presented in Table 20, which shows that GLM-130B’s
“zero-shot” performance could be much better due to the seen task type.
Table 20: “Zero-shot” results of GLM-130B on 6 typical natural language inference (NLI) datasets.
∗
DISCLAIMER: Despite the datasets are never seen, some other NLI datasets have been in-
cluded in GLM-130B’s MIP, making it different from the existing standard zero-shot setting.
We also report our evaluation of GLM-130B on the SuperGLUE (Wang et al., 2019) benchmark,
which consists 8 different natural language understanding challenges. Noted that these results
are neither zero/few-shot nor fine-tuned results, because 7 out of 8 tasks’ training sets have been
included in GLM-130B’s MIP training (except for ReCoRD) together with other 67 multi-task
datasets; however, GLM-130B is also not individually fine-tuned on any of them. Therefore, these
results are not for relative comparison for any other models’, but only for readers’ reference on
GLM-130B’s absolute ability.
44
Published as a conference paper at ICLR 2023
100
80
60
40
20 GLM-130B (uni)
GLM-130B (bi)
cb record wsc multirc rte wic copa boolq
Figure 17: GLM-130B (uni and bi)’s untuned results on SuperGLUE development set, using prompt-
source (Bach et al., 2022) prompts and task formulation. DISCLAIMER: Noted that some of the
SuperGLUE training sets have been included in the MIP training. We report the results here
only for readers’ reference.
100 95.0
84.5 Standard Prompting Chain-of-Thoughts
80 73.7 68.3
60 54.0 58.6 53.1
47.9
40
27.9
20 13.4 15.7
0 1.0
Sports LLC Coin Flip Coin Flip (OOD: 3) Reverse List Date
Figure 18: Chain-of-thought prompting can also improve GLM-130B’s performance on reasoning
tasks compared to standard prompting.
Table 21: The results of GLM-130B on the SuperGLUE dataset obtained using the P-tuning v2 (Liu
et al., 2022). We report the Accuracy metric for all datasets except for MultiRC (F1a) and ReCoRD
(F1).
The results are presented in Figure 17. We ablate the unidirectional and bidirectional GLM-130B to
justify the usefulness of GLM objective in boosting LLMs’ ability to understand. Each point in the
figure refers to a prompt-specific result, for which the prompt is from the promptsource (Bach et al.,
2022) repository. We adopt the task formulation from promptsource, too. As we can observe, GLM
(bi) has much fewer variances and higher performances on all tasks. For some of the tasks (such as
CB, MultiRC, RTE, COPA, and BoolQ), GLM-130B can even achieve over 80% accuracy.
We also attempted to fine-tune GLM-130B on the SuperGLUE dataset. However, we encountered
the issue of rapid overfitting within a single epoch when we used full parameter fine-tuning on
downstream tasks. This resulted in poor performance on the validation set. To address this issue,
we explored the use of efficient parameter fine-tuning methods, which tune only a small number
of parameters and are less prone to overfitting. After experimenting with several methods, we use
P-Tuning v2 (Liu et al., 2022), which demonstrated comparable results to full parameter fine-tuning
in GLM-130B, but with only 0.1% to 3% of tuned parameters. The results of our experiments with
P-Tuning v2 are presented in Table 21.
45
Published as a conference paper at ICLR 2023
Figure 19: Log-scaling ability tasks of GLM-130B. These tasks’ performance grows logarithmically
with the amount of GLM parameters. Most of traditional NLP tasks fall into the same pattern.
Last letter concatenation (LLC). The task asks the model to concatenate the last letters of words
in a name (e.g., "Elon Musk" -> "nk"). We generate full names by randomly concatenating the top
1000 first and last names from name census data14 .
Coin flip. This task asks the model to answer whether a coin is still heads up after people either
flip or don’t flip it beginning from being heads up. (e.g., "A coin is heads up. Phoebe flips the coin.
Osvaldo does not flip the coin. Is the coin still heads up?" -> "no"). We additionally evaluate on
the scenario where the number of people in the query examples is larger than that in the in-context
examples, i.e. the out-of-distribution (OOD) setting.
Reverse List. This task asks the model to reverse the order of a list of everyday objects (e.g.,
"cigar, umbrella, key, gum, alarm" -> "alarm, gum, key, umbrella, cigar"). We generate the lists by
randomly sampling from the vocabulary of everyday objects15 .
Sports. This task asks the model to judge the truthfulness of a statement about a sports player (e.g.,
"Joao Moutinho caught the screen pass in the NFC championship" -> "false").
Date. This task asks the model to infer the data from a given context (e.g., "2015 is coming in 36
hours. What is the date one week from today in MM/DD/YYYY?" -> "01/05/2015").
We use the same examples and chains as Wei et al. (2022c). For each task, we try two different
formats of prompts and both unidirectional and bidirectional attention mechanism and report the
best performance. The first format is "Question: {context} Answer: {target}". The second one is
to add serial numbers before examples in the first format of prompts. The results are presented in
Figure 18.
Scaling up pre-trained language models has been proven to boost downstream performance on a
wide range of tasks continually. His, emergent abilities which are unpredictable from smaller scales.
To illustrate this, we conducted extensive experiments to explore the scaling property and emergent
abilities. Following prior literature (Wei et al., 2022b), we categorize the NLP tasks into two types
based on our observations.
• Log-scaling Ability Tasks (Cf. Figure 19): where the task performance grows logarithmically
with the number of model parameters. Typical tasks and datasets include LAMBADA, Wikitext-
103, Wikitext-2, Penn Tree Bank.
• Emergent Ability Tasks (Cf. Figure 20): where the task performance only soars up when the
amount of model parameters reaches a certain threshold. Typical tasks and datasets include:
14
https://namecensus.com
15
https://www.vocabulary.com/lists/189583
46
Published as a conference paper at ICLR 2023
Figure 20: Emergent ability tasks of GLM-130B. These tasks’ performance does not grow much
until the model size reaches a certain threshold (e.g., 100B or 10B). After reaching the threshold, the
model performance soars up quickly. The BIG-bench (Srivastava et al., 2022) benchmark collects
many of these challenges.
47
Published as a conference paper at ICLR 2023
48
Published as a conference paper at ICLR 2023
Table 12: The 74 datasets involved in Multi-task Instruction Pre-training (MIP). Datasets from T0-
PromptSource (Sanh et al., 2022; Bach et al., 2022) are named in their Hugging Face datasets iden-
tifiers. Datasets from DeepStruct (Wang et al., 2022a) are described in Appendix C.2.
49
Published as a conference paper at ICLR 2023
Table 14: Details results of GLM-130B, GPT-3 175B (Brown et al., 2020), and PaLM 540B (Chowd-
hery et al., 2022) on BIG-bench-lite in 0, 1, and 3-shots. “Normalized preferred metric” is reported
for each task. GPT-3 and PaLM’s results are reported in BIG-bench’s GitHub repository, and PaLM
540B’s 3-shot results are not found.
50
Published as a conference paper at ICLR 2023
Table 15: Detailed results of GLM-130B and BLOOM 176B (Scao et al., 2022) on
MMLU (Hendrycks et al., 2021). We find that no existing literature has reported GPT-3 175B’s
numerical accuracy. BLOOM is evaluated using Huggingface Transformer implementation.
51
Published as a conference paper at ICLR 2023
E C ONTRIBUTIONS
The GLM-130B project was conceived in Dec. 2021 with its pre-training part completed in July 3rd,
2022 and its evaluation and applications still ongoing. Over the course, we have experienced various
technical and engineering challenges (Cf. Appendix F and Figure 21 for details). It would not be
possible to reach its current status if without the collaboration of multiple teams—the Knowledge
Engineering Group (KEG), Parallel Architecture & Compiler technology of Mobile, Accelerated,
and Networked systems Group (PACMAN), and Natural Language Processing Group (THUNLP) at
Tsinghua University, as well as Zhipu.AI. The detailed contributions are listed below.
E.1 P REPARATION
52
Published as a conference paper at ICLR 2023
The GLM-130B project16 was conceived in Dec. 2021 in a brainstorming meeting at Tsinghua KEG.
We firmly believe that it is of value to pre-train a highly accurate language model, in particular for
both Chinese and English. Though GPT-3 (Brown et al., 2020) is the pioneer for this effort, it is not
available to most people in the world. In addition, it supports English only. We therefore decide to
initialize the project GLM-130B. Please note that the WuDao 1.75T model we built last year is a
sparse model with 480 mixture-of-experts (MoE), rather than a dense one as GPT-3. Our goal then
is to train a bilingual pre-trained dense model with high accuracy on downstream tasks, and to make
it open to everyone in the world-anyone, anywhere can download it and use it on a single server with
appropriate GPUs.
The ambitious project soon faced several important challenges:
• Lack of computational resources: No organization is willing to sponsor such a big project and
freely make it public.
• Lack of a robust pre-training algorithm: Despite GPT-3’s success on English corpus, it is
unclear how to train a high-accurate bilingual model for both English and Chinese.
• Lack of fast inference solutions: Since the goal is to have the model public to everyone, we need
to design fast inference solutions with low resource requirements to run the model.
For the pre-training algorithm, we finally chose GLM (Du et al., 2022) due to its high performance
in practice. We eventually decided to train a GLM model of 130 billion parameters after several
rounds of discussions and exploration, because such a size makes it possible to run the inference on
a single A100 (40G * 8) server.
Our first attempt at training the model was in January 2022, shortly after we received a small sponsor
of GPUs for test running. However, we soon realized that we had significantly underestimated the
technical difficulties of pre-training a model at such a scale (>100B). It seems that pre-training a
highly accurate 100B-scale model is quite different from training a 10B-scale one. Due to frequent
random hardware failures, model gradients exploding, unexpected excessive memory usage in the
algorithm, debug for the 3D pipeline in the new Megatron and DeepSpeed frameworks, inability to
recover from optimizer states, blocked TCP responses between processes, and many many unex-
pected “bugs”, the project was delayed for many times. The Tsinghua PACMAN team gave us a
hand at this difficult time and together we successfully fixed most of the “bugs”.
By March, we were still short on computational resources, but fortunately got a chance to try test
runs on several other platforms, including Ascend 910, Hygon DCU, NVIDIA, and Sunway. The
immediate challenge was for us to adapt our training code to these different platforms, as the under-
lying operators are quite different. Also, it introduced many new issues: the element-wise operators
not supporting fast computation for large-dimension vectors, various issues that hindered conver-
gence—the large gradient norms of input embeddings, native Post-LN, Pre-LN, and Sandwich-LN,
dataloader state seeds, and computation precision choices in Softmax and Attention — as well as
numerous mistakes we ourselves made. With tremendous help from all of our generous partners, we
finally succeeded in making our pre-training algorithms runnable across all the platforms—frankly,
a surprising achievement for this project. The timeline of GLM-130B in Figure 21 covers most of
the issues we have encountered and addressed as of this writing.
On April 26th, we received a generous computing sponsorship from Zhipu.AI — an AI startup that
aims to teach machines to think like humans. After another week of testing, we finally kicked off
the training of the GLM-130B model on its 96 A100 (40G * 8) servers on May 6th. Additionally,
Zhipu.AI also sent a team to help evaluate the pre-trained model and build a demonstration website.
The training period spanned two months, during which we began developing a toolkit to allow GLM-
130B’s inference in low-resource setting with swapping technique and quantization. Though it is
already the most accessible model of its scale, together with our partner from Tsinghua NLP, we have
been exploring the limit of popularized hardware platforms, which would truly make the 100B-scale
model accessible to as many people as possible. To date, we managed to reach the INT4 weight
quantization for GLM-130B. Importantly, the INT4 version of GLM-130B without post training
16
This section is largely extracted and updated from the blog introduction of GLM-130B at http://keg.
cs.tsinghua.edu.cn/glm-130b/ (Posted date: August 4, 2022).
53
Published as a conference paper at ICLR 2023
54
Published as a conference paper at ICLR 2023
faces negligible performance degradation compared to its uncompressed original, while it consumes
only 25% of the GPU memory required by the uncompressed version, thus supporting its effective
inference on 4 × RTX 3090 Ti (24G) or 8 × RTX 2080 Ti (11G). We will attempt to further reduce
the resource requirements and keep the community updated on this important working item.
G B ROADER I MPACT
This paper introduces an open bilingual pre-trained language model with 130 billion parameters.
Currently most pre-trained language models with over 100 billion parameters are privately owned
by governments and large corporations (Brown et al., 2020; Thoppilan et al., 2022; Rae et al., 2021;
Chowdhery et al., 2022; Wang et al., 2021). A few of them (Brown et al., 2020; Lieber et al., 2021)
provide limited inference APIs with fees. In contrast, the weights and code of GLM-130B are open
to anyone who is interested in LLMs. Moreover, we significantly lower the hardware requirements
for inference by speed-up implementation and INT4 quantization. The paper can have a broader
impact on the research community, individual developers and small companies, and society.
Most research institutions cannot afford the substantial cost of pretraining large language models.
As a result, most researchers, except employees of governments and large corporations, only have
access to the limited inference APIs with fees. With the inference APIs, researchers can only analyze
the outputs of models as black boxes, which limits the scope of potential work. With GLM-130B,
researchers can analyze the model parameters and internal states corresponding to specific inputs,
leading to in-depth studies of LLMs’ theory, capacity, and flaws. Researchers can also modify the
model architecture and weights, to validate the proposed algorithms to improve LLMs Zhu et al.
(2020); Cao et al. (2021); Hase et al. (2021); Mitchell et al. (2022).
With INT4 quantization, GLM-130B can perform inference on popularized GPUs such as 4 × RTX
3090 or 8 × RTX 2080 Ti, which can be easily accessed from cloud service. As a result, researchers
who cannot afford powerful data-center GPU servers like DGX-A100 can also utilize GLM-130B.
Currently, individual developers and small companies who want to integrate LLMs into their busi-
ness can only choose paid inference APIs. The increased cost can hinder their attempts. Instead,
GLM-130B can be deployed on popularized hardware that they own or can access via cloud service
to reduce the cost. Furthermore, they can utilize distillation techniques Sanh et al. (2019); Jiao et al.
(2020) to obtain smaller models that preserve comparable performance on their specific tasks. While
some developers may lack the ability to complete deployment and distillation on their own, we be-
lieve with GLM-130B and more open LLMs in the future, the corresponding toolkits and service
providers will become more available.
We also note that currently most applications of LLMs are based on prompt engineering, partly
due to the limitation of inference APIs. In downstream scenarios such as online customer service,
the companies accumulate huge amounts of human-generated data that contain domain knowledge.
With the open-source weights and code, developers can finetune GLM-130B on their own data to
mitigate the gap of domain knowledge.
Large language models, together with other machine learning models in different modalities (e.g.,
Image (Ramesh et al., 2021; Ding et al., 2021; Saharia et al.) and Video (Hong et al., 2022)), could
be used to generate synthetic text for harmful applications, such as telemarketing fraud, political
propaganda, and personal harassment as is discussed in (Weidinger et al., 2021; Sheng et al., 2021;
Dev et al., 2021). We do not anticipate any hazardous outputs, especially towards vulnerable and
historically disadvantaged groups of people, after using the model.
While some people think that restricting access to LLMs can prevent such harmful applications, we
argue that promoting LLM inclusivity can lead to better defense against potential harm caused by
55
Published as a conference paper at ICLR 2023
LLMs. Currently, only governments and large corporations can afford the considerable costs of pre-
training LLMs. There is no guarantee that organizations having the substantial financial resources
to pretrain an LLM will not do harm with it. Without access to such LLMs, individuals cannot
even realize the role of LLMs in harm. Conversely, releasing an open LLM can provide access and
transparency to all the researchers and promote the research to reduce the potential harm of LLMs,
like algorithms to identify the synthetic text Gehrmann et al. (2019) or detect fake news Li et al.
(2021).
Also, it is known that LLMs can suffer from problems in fairness, bias, privacy, and truthful-
ness Zhang et al. (2021); Lin et al. (2022); Liang et al. (2021); Bender et al. (2021). An open
LLM can reveal the model parameters and internal states corresponding to specific inputs instead
of providing APIs to black-box models. In conclusion, researchers can conduct analysis of LLMs’
flaws in depth and propose improved algorithms to solve the problems.
H E NVIRONMENTAL I MPACT
One of the major concerns about large language models is their huge energy usage and associated
carbon emissions Strubell et al. (2019); Lacoste et al. (2019); Patterson et al. (2021); Bender et al.
(2021). GPT-3 was estimated to use 500 tons of carbon emissions footprint (CO2eq) Patterson et al.
(2021). We consumed a total of 442.4MWh of electricity over the 60-day course of training. Given
the 0.5810 kg/kWh carbon efficiency of local power grid, the pre-training released 257.01 metric
tons of CO2 . This is around half of GPT-3’s carbon footprint, probably due to the efficient parallel
strategies and NVIDIA’s hardware improvements. The carbon emission is roughly the equivalent of
the yearly emissions of 18 average Americans. However, we believe that with GLM-130B released,
more carbon emissions for reproducing 100B-scale LLMs can be saved.
56