Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

How well do Large Language Models perform in Arithmetic tasks?

Zheng Yuan1 Hongyi Yuan12 Chuanqi Tan1 Wei Wang1 Songfang Huang1
1
Alibaba Group 2 Tsinghua University
{yuanzheng.yuanzhen,chuanqi.tcq,hebian.ww,songfang.hsf}@alibaba-inc.com
yuanhy20@mails.tsinghua.edu.cn

Abstract contained in this dataset including addition (+),


subtraction (−), multiplication (×), division
Large language models have emerged abili- (÷), exponentiation (∧), trigonometry functions
ties including chain-of-thought to answer math
(sin, cos, tan), and logarithm functions (log, ln) of
arXiv:2304.02015v1 [cs.CL] 16 Mar 2023

word problems step by step (Wei et al., 2022b).


Solving math word problems not only requires integers, decimals, and irrational numbers (π, e).
abilities to disassemble problems via chain-of- Long arithmetic expressions with brackets are also
thought but also needs to calculate arithmetic included which are common in complex math word
expressions correctly for each step. To the problems. Results in Table 1 show detailed evalu-
best of our knowledge, there is no work to ations on OpenAI’s GPTs including GPT-4 (Ope-
focus on evaluating the arithmetic ability of nAI, 2023), ChatGPT2 , GPT-3.5 (Ouyang et al.,
large language models. In this work, we pro-
2022) and other open-sourced LLMs. We find that
pose an arithmetic dataset MATH 401 to test
latest large language models including GPT-4,
GPT-4 and ChatGPT outperform other models by
ChatGPT, InstrctGPT, Galactica, and LLaMA a large margin in all kinds of arithmetic abilities.
with various arithmetic expressions and pro- InstructGPT (Ouyang et al., 2022) and Galactica
vide a detailed analysis of the ability of large (Taylor et al., 2022) do have some arithmetic abili-
language models. MATH 401 and evaluation ties. We analyze factors affecting LLMs’ arithmetic
codes are released at https://github.com/ ability systematically including tokenization (§4.2),
GanjinZero/math401-llm. 1 pre-training (§4.3), prompts (§4.4), interpolation
and extrapolation (§4.5), scaling laws (§4.6), COT
1 Introduction
(§4.7), and ICL (§4.8).
Emergent abilities show in sufficiently large lan- One may say that the ability to solve arithmetic
guage models (LLMs) (Wei et al., 2022a) like tasks is not necessary for a large language model.
chain-of-thought reasoning (COT) (Wei et al., LLMs can use the calculator API when they need
2022b). Chain-of-thought reasoning requires to decode an answer (Schick et al., 2023). Arith-
LLMs to solve a question by thinking questions metic ability evaluation can be a gauge for general
step by step which performs well in school math intelligence since mastering arithmetic serves as a
word problems (Wei et al., 2022b; Kojima et al., fundamental requirement for performing intricate
2022). Recent LLMs are further fine-tuned with mathematical tasks including symbolic math rea-
instruction tuning (Sanh et al., 2021; Chung et al., soning (Noorbakhsh et al., 2021; Gaur and Saunshi,
2022; Ouyang et al., 2022) which demonstrates 2022) and automatic theorem proofing (Polu and
improved COT ability compared to only self- Sutskever, 2020; Wu et al., 2022).
supervised pre-training. To solve a math word
problem, COT disassembles the problem into sim- 2 Related Works
ple steps. For each step, LLMs have to compute Evaluate Math Ability of LLMs To show the
correctly based on arithmetic expressions. Thus, math reasoning ability of LLMs, Wang and Ko-
evaluating the arithmetic ability of LLMs is neces- matsuzaki (2021); Chung et al. (2022); Thoppilan
sary since it is the upper bound of LLMs’ ability et al. (2022) evaluate their models on various math
for solving math word problems. word problems benchmark (Saxton et al., 2019;
To this end, we propose an arithmetic dataset Hendrycks et al., 2021; Cobbe et al., 2021; Shi
named MATH 401. Different difficulties are 2
https://openai.com/blog/
1
This project is working in progress. introducing-chatgpt-and-whisper-apis
Model Size E +− × ÷ ∧ Tri log Dec Neg Irr Big Long Easy Hard All
GPT-4 ? X 99 67 100 50 68 76 67 67 100 48 96 100 67 84
ChatGPT ? X 97 65 80 50 44 56 67 67 64 40 68 100 49 74
InstructGPT 175B × 83 59 80 36 8 16 64 64 36 4 24 92 22 57
CodeX 175B X 36 27 8 10 8 0 25 25 12 0 0 40 4 22
Galactica 120B X 69 43 24 44 16 0 57 57 28 0 24 78 12 45
LLaMA 65B X 44 35 8 22 8 0 41 41 20 0 4 52 5 28
OPT 175B X 33 35 4 12 0 4 25 25 8 0 0 41 2 22
GPT-Neox 20B X 51 48 4 40 4 0 43 43 20 0 8 66 4 35
GLM 130B X 39 31 8 22 0 0 29 29 24 0 8 46 5 26
BloomZ 176B × 23 37 12 30 8 0 43 43 20 0 8 39 6 22
Bloom 176B × 21 37 12 30 0 0 37 37 16 0 0 37 4 20
T0++ 11B × 6 3 0 6 8 0 3 3 4 0 0 7 2 4
Flan-T5 11B × 1 13 4 0 0 0 11 11 8 0 0 6 2 4

Table 1: Arithmetic ability for LLMs measured by accuracy, we only list models with largest parameter counts. E
= Euler, Dec = Decimal, Neg = Negative, Irr = Irrational, Big = Big Numbers, Long = Long Expressions.

et al., 2022). For newly released LLM ChatGPT, • Add & Subtract of two integers within
Shakarian et al. (2023); Frieder et al. (2023) eval- 1,000,000,000,000.
uate its mathematical ability independently. To
notice, our paper evaluates ChatGPT using gpt- • Add & Subtract of two integers within -
3.5-turbo-0301 version and GPT-4 using chat UI 10∼10.
on March 16th which may have different perfor-
• Add & Subtract of two decimal numbers
mances compared to their reported results and fu-
within -100∼100.
ture analysis.
Evaluate Arithmetic Ability of LLMs • Multiply two integers within 100.
Nogueira et al. (2021); Wang et al. (2021)
evaluate pretrained language models on simple • Multiply two decimal numbers within 10.
arithmetic expressions including addition (+)
• Multiply two integers within 100,000.
and subtraction (−). Muffo et al. (2022) have
further tested the multiplication (×) of language • Division of two integers within 100.
models. They found tokenization (Nogueira et al.,
2021; Kim et al., 2021) and token frequency • Exponentiation of with integer base within 10
(Razeghi et al., 2022) are two important factors for and integer exponent within 2∼4.
language model arithmetic ability. Compared to
previous work, we focus on evaluating Large LMs • Exponentiation of with a decimal number
(with instruction fine-tuning) on comprehensive within 10 as the base and a decimal number
arithmetic abilities with different types of operators within 2∼4 as the exponent.
and numbers.
• Add, Subtract & Multiply with one integer
3 Evaluation Settings within 10 and a common irrational number
(i.e. e or π).
3.1 Arithmetic Expression Settings
We construct 401 arithmetic expressions to test • Long arithmetic expressions with brackets, in-
large language models which include Euler equa- volved integers are all within 100 and opera-
tion (eiπ + 1 = 0) as group 0 and 25 problems each tors contain add, subtract, multiply, and divi-
for group 1∼16. If not otherwise mentioned, used sion.
numbers are positive integers.
• Trigonometry functions including sin, cos,
• Euler Equation. and tan. Inputs can be in the format of de-
grees and radians (π can also appear in the
• Add & Subtract of two integers within 10.
inputs).
• Add & Subtract of two integers within 100.
• Logarithm of integers within 1000 of different
• Add & Subtract of two integers within 1,000. bases: 2, e, 10.
These groups cover mathematical operators used Model Prompt Acc ↑ RE ↓ NNR ↓
gpt-4 Cal*4 83.54 0.07 0.00
in elementary mathematics. We consider groups gpt-3.5-turbo-0301 Cal* 75.06 0.14 0.50
1,2,3,5,6,7,8,11 as Easy queries and all others as text-davinci-003 Cal 56.61 0.76 2.99
Hard queries. We calculate the results of all code-davinci-002 Eqa 21.7 2.39 11.47
galactica-120b Eqa 45.14 1.30 3.99
arithmetic expressions using built-in functions of galactica-30b Eqa 45.14 0.69 1.75
Python and round to four decimal places. Examples llama-65b Eqa 28.43 1.61 4.74
opt-175b Cal 21.70 3.18 21.70
of expressions are listed in Appendix A. gpt-neox-20b Eqa 35.41 1.19 4.49
glm-130b $ 25.94 1.27 2.74
3.2 Metrics bloomz-176b $$ 22.44 1.50 4.74
bloom-176b $ 20.20 2.60 18.45
Since LLMs can decode arbitrary contents (which T0++-11b Cal 4.24 3.34 9.48
may contain their step-by-step calculation steps), flan-t5-xxl-11b Eqa 3.74 5.78 43.89
we first ignore decoded numbers in parentheses flan-t5-xl-3b $ 7.48 3.34 25.19
and preserve the last number decoded by LLMs. If Table 2: Evaluation on MATH 401 with differ-
the decoded number is a fraction, we will convert ent LLMs. Prompts are selected via best accu-
it to decimal for evaluation except for group 10 racy. Cal means “Calculate:” and Eqa means “\be-
which requires calculating division. To measure the gin{equation}”. * means providing an additional
arithmetic ability of LLMs, we use the following system-level message.
metrics to measure their outputs.
Accuracy If the difference between the decoded (Scao et al., 2022; Muennighoff et al., 2022), T0++
number and the target number is less than 1e − 3, (Sanh et al., 2021), GLM (Zeng et al., 2022) and
we consider it a correct prediction. Accuracy is Flan-T5 (Chung et al., 2022). We also test the
calculated based on correct prediction counts. smaller versions of the above models.
Relative error We denote decoded number is ŷ We test following prompts: ∅ (i.e. no prompt),
and target is y. We calculate relative error by: “Calculate:”, “$”, “$$”, and “\begin{equation}”.
The latest three prompts are inspired by that LLMs
kŷ − yk may be pretrained with LATEX sources. We provide
RE = min(10, ) (1)
max(kyk, 1) three versions of input formats: math texts (π),
plain texts (pi), LATEX texts (\pi). When we use
If LLM does not decode any number, we consider
LATEX-related prompts, we provide the model with
RE = 10. We truncate the relative error to 10 to
LATEX texts. When we use other prompts, we will
prevent that one big mistake dominate the average
provide math texts if their tokenizers can encode
relative error.
them. Otherwise, we will provide plain text. For
Non-number ratio If decoded content does not ChatGPT (gpt-3.5-turbo-0301), we test different
contain any numbers, we consider it a failure. We system-level prompts as instructions: ∅ (i.e. no
calculate the non-number ratio based on it. prompt), “You are an accurate calculator.”, and
“You are an accurate calculator, please calculate
3.3 Evaluation Details provided equation to four decimal places.”. For
We test GPT-4 by their official chat UI3 . Since GPT- GPT-4, we only test prompt “You are an accurate
4 has limited request counts, we only query GPT-4 calculator, please calculate provided equation to
with groups that ChatGPT cannot answer correctly. four decimal places.”.
We test GPT-3.5 (including davinci (CodeX, In- We use default decode settings for OpenAI APIs,
structGPT) and turbo (ChatGPT) series models) and we use greedy decoding for all other LLMs.
(Ouyang et al., 2022; Chen et al., 2021) via Ope-
nAI APIs. We also test following open-sourced
LLMs including Galactica (Taylor et al., 2022), 4 Results and Analysis
GPT from EleutherAI (Wang and Komatsuzaki,
2021; Black et al., 2022), LLaMA (Touvron et al., 4.1 Results
2023), OPT (with instruction learning) (Zhang
Overall Results Table 1, 2, and 3 show results
et al., 2022), Bloom (with instruction learning)
of different LLMs on MATH 401. We find GPT-
3
https://chat.openai.com/chat?model=gpt-4 4 and ChatGPT outperform all other models by a
Model Prompt Acc ↑ RE ↓ NNR ↓ Division, exponentiation, trigonometry functions,
gpt-4 Cal*4 83.54 0.07 0.00
gpt-3.5-turbo-0301 Cal* 75.06 0.14 0.50 and logarithm functions are hard for most LLMs.
text-davinci-003 Cal 56.61 0.76 2.99 LLMs have some abilities dealing with decimal,
text-davinci-002 Cal 42.89 2.13 15.96 negative, and irrational numbers. Only GPT-4 and
text-curie-001 Cal 11.47 1.92 6.48
text-babbage-001 Eqa 5.24 2.59 5.74 ChatGPT have the ability to deal with big numbers
code-davinci-002 Eqa 21.70 2.39 11.47 (> 1e12) and complex long queries which proves
galactica-120b Eqa 45.14 1.30 3.99 their generalization and reasoning abilities. GPT-4
galactica-30b Eqa 45.14 0.69 1.75
galactica-6.7b Cal 34.41 2.61 8.73 shows extremely good ability in long arithmetic
llama-65b Eqa 28.43 1.61 4.74 expressions.
llama-30b Eqa 30.17 1.72 3.74
llama-13b $ 27.68 2.40 9.73 When will ChatGPT fail? Though ChatGPT
llama-7b $$ 21.95 2.11 7.48
opt-175b Cal 21.70 3.18 21.70 obtains such a good performance, we will check
opt-66b ∅ 20.70 2.66 18.70 when ChatGPT fails to answer. For multiplica-
opt-iml-max-30b Cal 17.46 1.52 6.23 tion (×), ChatGPT passes all queries in Group
opt-30b ∅ 15.96 2.28 11.22
opt-13b ∅ 15.21 2.19 10.97 7 and 8 and get wrong answers for all queries
opt-6.7b Cal 14.46 1.46 4.24 in Group 9. An example is ChatGPT predicts
gpt-neox-20b Eqa 35.41 1.19 4.49
gpt-j-6b Cal 27.18 1.55 8.98
71786 × 21638 = 1, 551, 402, 068, while the true
bloomz-176b $$ 22.44 1.50 4.74 answer is 1, 553, 305, 468. ChatGPT gives a very
bloom-176b $ 20.2 2.60 18.45 close estimation with the correct head and tail,
bloomz-7b1 $ 12.72 2.56 15.46
bloom-7b1 Cal 7.23 2.41 6.48
which proves that ChatGPT does not use a cal-
bloomz-3b $$ 7.98 2.63 12.47 culator API for math calculation.
bloom-3b Cal 4.24 2.41 8.73 For division in Group 11, ChatGPT sometimes
bloomz-1b7 Eqa 4.74 4.28 31.17
bloom-1b7 Cal 5.24 2.54 11.22 gives correct answers to two decimal places which
T0++-11b Cal 4.24 3.34 9.48 will be considered incorrect in our metric. We
glm-130b $ 25.94 1.27 2.74 can see in Table 5, requiring ChatGPT to output
glm-10b Cal 14.96 2.30 3.74
flan-t5-xxl-11b Eqa 3.74 5.78 43.89 four decimal places will improve its accuracy in
flan-t5-xl-3b $ 7.48 3.34 25.19 multiplication and division.
flan-t5-large-780m Cal 3.74 2.31 2.49 For exponentiation (∧), ChatGPT correctly an-
flan-t5-base-250m Eqa 2.49 3.18 14.21
swers all queries in Group 10 which contain only
Table 3: Full evaluation on MATH 401 with different integers as bases. It is too hard for any language
LLMs. Prompts are selected via best accuracy. model (even ChatGPT) correctly estimate the ex-
ponentiation of a decimal number as the base and
a decimal number as the exponent. It seems that
large margin4 . GPT-4 surpasses ChatGPT with ac-
ChatGPT treats ∗∗ as multiplication sometimes, for
curacy of 10 points and reduce relative error half.
example, ChatGPT estimates 5.5507 ∗∗ 2.0434 =
InstructGPT performs third measured by accuracy
10.31554 which is close to 5.5507 × 2.0434 =
and Galactica-30B performs third measured by rel-
11.3423 and far from answer 33.1895.
ative error. Compared to models proposed before
For calculating trigonometry functions, Chat-
InstructGPT (text-davinci-003), GPT-series applies
GPT understands degrees and radians correctly
Reinforcement Learning from Human Feedback
and generates exact answers

for special inputs
(RLHF) which may enhance their arithmetic ability ◦ 3
significantly. Galactica is pre-trained with massive like cos(−210 ) = − 2 (we omit explanation
LATEX source codes which could be the reason why generated by ChatGPT here). However, Chat-
Galactica performs well in arithmetics. GPT may generate wrong explanations which mis-
lead itself. An example is: “We know that the
Grouped Results To clearly understand the sine function is periodic with a period of 2π,
arithmetic ability of LLMs, we show grouped accu- which means that sin(x + 2π) = sin(x) for any
racy in Table 1. GPT-4 obtains first places and Chat- value of x. Therefore, we can subtract multi-
GPT obtains second places for all groups. Most ples of 2π from −3.75π until we get a value be-
LLMs are only capable of doing addition and sub- tween 0 and 2π: −3.75π = −3π − 0.75π =
traction and have some ability for multiplication. −9.42477 − 2.35619 = −11.78096. Adding 2π,
4
OpenAI states they improve the math of ChatGPT since we get: −11.78096 + 2π = -9.42477 etc.” Any mis-
version Jan 30, and we cannot evaluate any previous version. take in explanations may result in a wrong answer.
For logarithm functions, we find that ChatGPT 2 and 3). Splitting numbers into individual tokens
is capable of using change of base formula and neglects all number tokens with more digits and
predicting answers within two decimal places. makes all single digit tokens (mainly 0 ∼ 9) ap-
For long expressions, ChatGPT can understand pear in the pre-training corpus in the same order
the operators’ priorities. ChatGPT sometimes gen- of magnitude. Galactica-30B and LLaMA-30B ob-
erates answers step by step and sometimes gener- tain 45.14 and 30.17 in terms of accuracy (list in
ates answers directly. It is very likely to generate Table 3) that outperforms OPT-30B (15.96), Bloom-
wrong answers when it decodes answers directly. 176B (20.2), and GLM-130B (25.94), which show
superiority of digit-level tokenization.
What about GPT-4? For big number multiplica-
tion (×) in group 9, GPT-4 also fails in all cases 4.3 Training
with similar problems occurring in ChatGPT. Self-supervised Pre-training While pre-
For exponentiation (∧), GPT-4 will not consider training, code corpus and LATEX-sources are
∗∗ as × anymore and give better estimations. possible to relate to arithmetic ability since they
For calculating expressions with irrational num- all contain arithmetic operators and numbers.
bers, GPT-4 will consider e as natural logarithm Code-davinci-002 is pretrained with code cor-
correctly. pus. Code-davinci-002 performs well on many
For logarithm functions, GPT-4 calculates loga- reasoning-related tasks (Zhou et al., 2022), how-
rithm base e and 10 by “using a calculator” (this is ever, it performs not good compared to other LLMs
a message generated by GPT-4). GPT-4 calculates in arithmetics. This proves that mathematical
logarithm base 2 by change of base formula and reasoning ability is different from arithmetic
generates approximate results. ability which needs to understand numbers
For long equations, GPT-4 solves all equations deeply. Galactica with numerous LATEX-sources
step by step and obtains a much higher accuracy. outperforms other LLMs except for InstructGPT
We compare and summarize how GPT-4 outper- and ChatGPT which show LATEX is useful.
forms ChatGPT here:
Instruction Tuning is also very important in
• Better division ability. arithmetic ability. Comparing Opt-30B (Acc 15.96
• Better trigonometry ability. RE 2.28 NNR 11.22) with Opt-Iml-Max-30B (Acc
17.46 RE 1.52 NNR 6.23), Bloom (Acc 20.2 RE
• Understand irrational numbers properly. 2.6 NNR 18.45) with BloomZ (Acc 22.44 RE 1.5
NNR 4.74), and code-davinci-002 (Acc 21.7) with
• Always calculate long expressions step by
text-davinci-002 (Acc 42.89) in Table 3 show that
step.
instruction tuning can boost the performance in
4.2 Tokenization all metrics. Text-davinci-003 (RLHF) outperforms
text-davinci-002 (SFT) in arithmetic tasks which
Arithmetic expressions have special tokens includ-
shows RLHF is important for building arithmetic
ing π, ×, ÷, ◦ which are not within T5 series mod-
ability.
els (i.e. T0++ and Flan-T5). T0++-11B (Acc 4.24
and RE 3.34) and Flan-T5-xxl-11B (Acc 3.74 and 4.4 Prompts
RE 5.78) perform badly on arithmetic tasks com-
Input Prompts We find the best prompts are dif-
pared to other similar-size models: Opt-13B (Acc
ferent across LLMs. We list the best and worst
15.21 and RE 2.19) and LLaMA-13B (Acc 27.68
prompts for LLMs in Table 8. We find models are
and RE 2.4).
sensitive to input prompts and not using prompts
We notice that Galactica and LLaMA split num-
is the worst option for most LLMs. For Instruct-
bers into individual tokens. For example 123.456
GPT and ChatGPT, using “Calculate” as a prompt
is converted into 1 2 3 . 4 5 6. Razeghi et al.
perform best. For other LLMs, using LATEX-related
(2022) show that arithmetic ability is related to
prompts perform best.
pre-training term frequencies. For tokens that ap-
pear more in pre-training, LLMs can have better System Prompts For ChatGPT, we can also pro-
accuracy in answering arithmetic expressions about vide system-level messages as instruction prompts.
them. Number tokens with more digits (e.g. 23) Table 5 shows providing system-level messages im-
apparently appear less than single digit token (e.g. proves ChatGPT’s accuracy and reduces relative
Model Best Acc Worst Acc
gpt-3.5-turbo-0301 Cal* 75.06 $$ 64.59
text-davinci-003 Cal 56.61 Eqa 43.64
galactica-120b Eqa 45.14 ∅ 38.9
llama-65b Eqa 28.43 Cal 4.74
opt-175b Cal 21.7 ∅ 15.21
gpt-neox-20b Eqa 35.41 ∅ 26.93
glm-130b $ 25.94 ∅ 22.44
bloomz-176b $$ 22.44 ∅ 11.72

Table 4: Best and worst prompts for different LLMs.

error significantly. The most different groups are


group 13 irrational numbers and group 16 loga-
rithm functions. Without a system-level message,
ChatGPT thinks e can be Euler’s number or a vari-
able and cannot give an answer. For logarithm
functions, ChatGPT tries to explain how it calcu-
lates which may mislead our provided parser. We
notice that if we require ChatGPT to output results
to four decimal places, it will have a zero non-
Figure 1: Performances of MATH 401 on LLMs with
number ratio. To conclude, ChatGPT will try to
different sizes. We do not know the parameter count of
explain the calculation procedure without a system- ChatGPT. We list InstructGPT results with SFT setting
level prompt and will only provide answers with a (text-davinci-002) only for a fair comparison.
system-level prompt.
Group Cal Cal* Cal*4 numbers / long expressions which are very unlikely
Acc RE Acc RE Acc RE
0 Euler 100 .00 100 .00 100 .00 to be covered by pretraining corpora or instruc-
1 ∼ 6 +− 97 .00 96 .00 93 .01 tions. Thus answering easy queries may examine
7 ∼ 10 ×÷ 69 .20 69 .01 71 .01 the interpolation ability of models and answering
11 ∼ 12 ∧ 50 .24 50 .32 50 .27
13 Irr. 64 1.73 72 .56 84 .11 hard queries must examine the extrapolation ability
14 Long 68 .19 64 .46 60 .59 of the models. We find ChatGPT performs best
15 Tri. 44 1.21 48 .96 44 1.40
16 Log 56 .80 60 .04 56 .01
on hard queries, and all other models have limited
Overall 74 .33 75 .14 74 .14 performance on hard queries which show limited
extrapolation.
Table 5: Comparing different system prompts in Chat-
GPT on MATH 401. Cal means no system prompt. * 4.6 Scaling Laws
= “You are an accurate calculator.” 4 = “Calculating to
four decimal places.” To understand how parameter counts influence
arithmetic ability, we plot the results with different-
size LLMs in Figure 1. We do not plot text-davinci-
4.5 Interpolation and Extrapolation 003, gpt-3.5-turbo-0301 and gpt-4 since they do
LLMs have strong abilities to fit on in-domain data. not have smaller versions with the same setting.
If pretraining corpora contain arithmetic expres- We find that LLMs have better abilities with larger
sions, it is easy for LLMs to memorize them. For parameter counts. An interesting phenomenon we
out-of-domain data, LLMs need to extrapolate how found is model over 30B does not improve signif-
to calculate them. We do not know what are in- icantly compared with 30B models, especially in
domain data and out-of-domain data for models Galactica where the 120B model performs the same
(especially ChatGPT), so it is hard to test their in- as the 30B model. We hypothesize that 30B may be
terpolation and extrapolation abilities. We use the enough for arithmetic ability. ChatGPT may be a
easy group and the hard group to estimate the in- model smaller than 175B which outperforms other
terpolation and extrapolation abilities. The easy 175B models a lot, thus larger parameter count does
group queries have possibilities that appear in the not guarantee better arithmetic ability. For GPT-4,
pretraining corpora or instruct-tuning, while the we cannot have any possible guess. Considering its
hard group queries contain big numbers / decimal much slower inference speed, we guess it has larger
Expression Calculate: Zero-shot COT
(88 - 42) * (33 * First, we need to simplify the expression inside the First, we simplify the parentheses: (88 - 42) * (33 *
41) - (13 - 27) = parentheses: 88 - 42 = 46 33 * 41 = 1353 13 - 27 = 41) - (13 - 27) = 46 * 1353 - (-14) Next, we simplify
62252 -14 Now we can substitute these values into the orig- the second set of parentheses by changing the sub-
inal expression: (46) * (1353) - (-14) Multiplying 46 traction into addition of the opposite: = 46 * 1353 +
and 1353 gives: 62,238 Remember that subtracting 14 We can then multiply 46 and 1353: = 62338 + 14
a negative number is the same as adding a positive Finally, we add: = 62352 Therefore, (88 - 42) * (33
number, so we can rewrite the expression as: 62,238 * 41) - (13 - 27) = 62352.
+ 14 Adding 14 gives: 62,252 Therefore, the answer
is: 62,252

Table 6: Example of Group 14 decoded by ChatGPT with Calculate and COT prompts.

parameter counts than ChatGPT and obtain better Model Naive ICL
Acc RE Acc RE
reasoning ability (i.e. long arithmetic expression). galactica-120b 45.14 1.3 45.14 0.42
galactica-6.7b 34.41 2.61 32.67 0.65
4.7 Chain-of-Thought flan-t5-xxl 3.74 5.78 0.0 10.0
flan-t5-base 2.49 3.18 0.0 10.0
LLMs can leverage chain-of-thought to better an-
swer math word problems (Wei et al., 2022b). We Table 8: In-context learning on MATH 401.
test on ChatGPT whether chain-of-thought will im-
prove arithmetic calculations. We use the prompt
“Let us solve this equation step by step” to instruct the query) for each query. We test whether ICL
ChatGPT for zero-shot COT (Kojima et al., 2022). can improve the well-behaved model (Galactica)
We compare the results of zero-shot COT using and the underperforming model (Flan-T5). For
“Calculate:” in Table 7. Surprisingly, we find that Galactica, it does not improve accuracy but reduces
COT does not improve the performance of any relative error significantly. For small-sized Flan
group even in group 14 with long arithmetic ex- (smaller than 3B) it cannot generate any number
pressions. To understand the reason for this phe- under the setting of in-context-learning.
nomenon, we check decoded results for these two 5 Conclusion
prompts in Table 6. We find using “Calculate:”
as the prompt can automatically generate chain- In this paper, we propose MATH 401 to evaluate
of-thoughts for long arithmetic expressions and the arithmetic ability of LLMs. We find that tok-
generate answers directly for easy questions. enization, pre-training corpus, prompts, and model
parameter counts are important for their arithmetic
Group Cal 0 COT
Acc RE Acc RE
ability. The reason ChatGPT performs so well in
0 Euler 100 .00 100 .00 arithmetic still has some mystery, i.e. the parame-
1 ∼ 6 +− 97 .00 94 .02 ter counts and instruction datasets of ChatGPT. We
7 ∼ 10 ×÷ 69 .20 61 .66
11 ∼ 12 ∧ 50 .24 48 .56
hope this paper can help readers improve LLMs
13 Irr. 64 1.73 28 4.89 with better arithmetic ability. This paper is only
14 Long 68 .19 64 .46 focused on arithmetic, testing LLMs on other math
15 Tri. 44 1.21 40 1.14
16 Log 56 .80 28 5.37 topics including symbolic mathematics, solving (or-
Overall 74 .33 66 .98 dinary differential, partial differential) equations,
calculus, algebra, geometry, probability theory, and
Table 7: Comparing zero-shot COT and Calculate us- graph theory are also interesting topics.
ing ChatGPT on MATH 401.

References
4.8 In-context Learning
Sid Black, Stella Biderman, Eric Hallahan, Quentin An-
In-context learning (ICL) provides related question- thony, Leo Gao, Laurence Golding, Horace He, Con-
answer pairs to improve LLMs (Brown et al., 2020; nor Leahy, Kyle McDonell, Jason Phang, Michael
Wei et al., 2022b). In our task, we can provide Pieler, USVSN Sai Prashanth, Shivanshu Purohit,
similar arithmetic expressions before the queries to Laria Reynolds, Jonathan Tow, Ben Wang, and
Samuel Weinbach. 2022. GPT-NeoX-20B: An open-
help model understanding the arithmetic operator source autoregressive language model. In Proceed-
as done in Smith et al. (2022). We provide 8 similar ings of the ACL Workshop on Challenges & Perspec-
cases (we promise these cases are different from tives in Creating Large Language Models.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie you seen that number? investigating extrapolation in
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind question answering models. In Conference on Em-
Neelakantan, Pranav Shyam, Girish Sastry, Amanda pirical Methods in Natural Language Processing.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, T. J. Henighan, Rewon Child, Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, taka Matsuo, and Yusuke Iwasawa. 2022. Large
Clemens Winter, Christopher Hesse, Mark Chen, language models are zero-shot reasoners. ArXiv,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin abs/2205.11916.
Chess, Jack Clark, Christopher Berner, Sam Mc- Niklas Muennighoff, Thomas Wang, Lintang Sutawika,
Candlish, Alec Radford, Ilya Sutskever, and Dario Adam Roberts, Stella Biderman, Teven Le Scao,
Amodei. 2020. Language models are few-shot learn- M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hai-
ers. ArXiv, abs/2005.14165. ley Schoelkopf, Xiangru Tang, Dragomir Radev, Al-
ham Fikri Aji, Khalid Almubarak, Samuel Albanie,
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,
Zaid Alyafeai, Albert Webson, Edward Raff, and
Henrique Ponde, Jared Kaplan, Harrison Edwards,
Colin Raffel. 2022. Crosslingual generalization
Yura Burda, Nicholas Joseph, Greg Brockman, Alex
through multitask finetuning.
Ray, Raul Puri, Gretchen Krueger, Michael Petrov,
Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Matteo Muffo, Aldo Cocco, and Enrico Bertino. 2022.
Brooke Chan, Scott Gray, Nick Ryder, Mikhail Evaluating transformer language models on arith-
Pavlov, Alethea Power, Lukasz Kaiser, Moham- metic operations using number decomposition. In
mad Bavarian, Clemens Winter, Philippe Tillet, Fe- International Conference on Language Resources
lipe Petroski Such, David W. Cummings, Matthias and Evaluation.
Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel
Herbert-Voss, William H. Guss, Alex Nichol, Igor Rodrigo Nogueira, Zhiying Jiang, and Jimmy J. Li.
Babuschkin, S. Arun Balaji, Shantanu Jain, An- 2021. Investigating the limitations of the trans-
drew Carr, Jan Leike, Joshua Achiam, Vedant Misra, formers with simple arithmetic tasks. ArXiv,
Evan Morikawa, Alec Radford, Matthew M. Knight, abs/2102.13019.
Miles Brundage, Mira Murati, Katie Mayer, Peter
Welinder, Bob McGrew, Dario Amodei, Sam Mc- Kimia Noorbakhsh, Modar Sulaiman, Mahdi Sharifi,
Candlish, Ilya Sutskever, and Wojciech Zaremba. Kallol Roy, and Pooyan Jamshidi. 2021. Pretrained
2021. Evaluating large language models trained on language models are symbolic mathematics solvers
code. ArXiv, abs/2107.03374. too! arXiv preprint arXiv:2110.03501.
OpenAI. 2023. Gpt-4 technical report.
Hyung Won Chung, Le Hou, Shayne Longpre, Bar-
ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
Wang, Mostafa Dehghani, Siddhartha Brahma, et al. roll L Wainwright, Pamela Mishkin, Chong Zhang,
2022. Scaling instruction-finetuned language mod- Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
els. arXiv preprint arXiv:2210.11416. 2022. Training language models to follow in-
structions with human feedback. arXiv preprint
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, arXiv:2203.02155.
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Stanislas Polu and Ilya Sutskever. 2020. Generative
Nakano, Christopher Hesse, and John Schulman. language modeling for automated theorem proving.
2021. Training verifiers to solve math word prob- ArXiv, abs/2009.03393.
lems. arXiv preprint arXiv:2110.14168.
Yasaman Razeghi, Robert L Logan IV, Matt Gard-
Simon Frieder, Luca Pinchetti, Ryan-Rhys Grif- ner, and Sameer Singh. 2022. Impact of pretrain-
fiths, Tommaso Salvatori, Thomas Lukasiewicz, ing term frequencies on few-shot reasoning. ArXiv,
Philipp Christian Petersen, Alexis Chevalier, and J J abs/2202.07206.
Berner. 2023. Mathematical capabilities of chatgpt.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H.
ArXiv, abs/2301.13867.
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
Vedant Gaur and Nikunj Saunshi. 2022. Symbolic Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja,
math reasoning with language models. In 2022 Manan Dey, M Saiful Bari, Canwen Xu, Urmish
IEEE MIT Undergraduate Research Technology Thakker, Shanya Sharma Sharma, Eliza Szczechla,
Conference (URTC), pages 1–5. Taewoon Kim, Gunjan Chhablani, Nihal Nayak,
Debajyoti Datta, Jonathan Chang, Mike Tian-Jian
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Jiang, Han Wang, Matteo Manica, Sheng Shen,
Arora, Steven Basart, Eric Tang, Dawn Song, and Zheng Xin Yong, Harshit Pandey, Rachel Bawden,
Jacob Steinhardt. 2021. Measuring mathematical Thomas Wang, Trishala Neeraj, Jos Rozen, Ab-
problem solving with the math dataset. arXiv heesht Sharma, Andrea Santilli, Thibault Fevry, Ja-
preprint arXiv:2103.03874. son Alan Fries, Ryan Teehan, Stella Biderman, Leo
Gao, Tali Bers, Thomas Wolf, and Alexander M.
Jeonghwan Kim, Giwon Hong, Kyung min Kim, Rush. 2021. Multitask prompted training enables
Junmo Kang, and Sung-Hyon Myaeng. 2021. Have zero-shot task generalization.
David Saxton, Edward Grefenstette, Felix Hill, and Kurzweil, Blaise Aguera-Arcas, Claire Cui, Mar-
Pushmeet Kohli. 2019. Analysing mathematical rea- ian Croak, Ed Huai hsin Chi, and Quoc Le. 2022.
soning abilities of neural models. arXiv preprint Lamda: Language models for dialog applications.
arXiv:1904.01557. ArXiv, abs/2201.08239.

Teven Le Scao, Angela Fan, Christopher Akiki, El- Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
lie Pavlick, Suzana Ilić, Daniel Hesslow, Ro- Martinet, Marie-Anne Lachaux, Timothée Lacroix,
man Castagné, Alexandra Sasha Luccioni, François Baptiste Rozière, Naman Goyal, Eric Hambro,
Yvon, Matthias Gallé, et al. 2022. Bloom: A 176b- Faisal Azhar, et al. 2023. Llama: Open and effi-
parameter open-access multilingual language model. cient foundation language models. arXiv preprint
arXiv preprint arXiv:2211.05100. arXiv:2302.13971.

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Ben Wang and Aran Komatsuzaki. 2021. GPT-J-
Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, 6B: A 6 Billion Parameter Autoregressive Lan-
Nicola Cancedda, and Thomas Scialom. 2023. Tool- guage Model. https://github.com/kingoflolz/
former: Language models can teach themselves to mesh-transformer-jax.
use tools. ArXiv, abs/2302.04761. Cunxiang Wang, Boyuan Zheng, Yuchen Niu, and Yue
Zhang. 2021. Exploring generalization ability of
Paulo Shakarian, Abhinav Koyyalamudi, Noel Ngu, pretrained language models on arithmetic and logi-
and Lakshmivihari Mareedu. 2023. An independent cal reasoning. In Natural Language Processing and
evaluation of chatgpt on mathematical word prob- Chinese Computing.
lems (mwp).
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Maarten Bosma, Denny Zhou, Donald Metzler, et al.
Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Di- 2022a. Emergent abilities of large language models.
panjan Das, and Jason Wei. 2022. Language models arXiv preprint arXiv:2206.07682.
are multilingual chain-of-thought reasoners. ArXiv,
abs/2210.03057. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Ed Huai hsin Chi, Quoc Le, and Denny
Shaden Smith, Mostofa Patwary, Brandon Norick, Zhou. 2022b. Chain of thought prompting elic-
Patrick LeGresley, Samyam Rajbhandari, Jared its reasoning in large language models. ArXiv,
Casper, Zhun Liu, Shrimai Prabhumoye, George abs/2201.11903.
Zerveas, Vijay Anand Korthikanti, Elton Zhang,
Rewon Child, Reza Yazdani Aminabadi, Julie Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li,
Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong Markus N. Rabe, Charles Staats, Mateja Jamnik, and
He, Michael Houston, Saurabh Tiwary, and Bryan Christian Szegedy. 2022. Autoformalization with
Catanzaro. 2022. Using deepspeed and megatron to large language models. ArXiv, abs/2205.12615.
train megatron-turing nlg 530b, a large-scale genera-
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang,
tive language model. ArXiv, abs/2201.11990.
Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,
Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan
Ross Taylor, Marcin Kardas, Guillem Cucurull,
Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Peng
Thomas Scialom, Anthony Hartshorn, Elvis Saravia,
Zhang, Yuxiao Dong, and Jie Tang. 2022. Glm-
Andrew Poulton, Viktor Kerkez, and Robert Stojnic.
130b: An open bilingual pre-trained model. arXiv
2022. Galactica: A large language model for sci-
preprint arXiv:2210.02414.
ence. arXiv preprint arXiv:2211.09085.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Artetxe, Moya Chen, Shuohui Chen, Christopher De-
Noam M. Shazeer, Apoorv Kulshreshtha, Heng- wan, Mona Diab, Xian Li, Xi Victoria Lin, et al.
Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, 2022. Opt: Open pre-trained transformer language
Yu Du, Yaguang Li, Hongrae Lee, Huaixiu Zheng, models. arXiv preprint arXiv:2205.01068.
Amin Ghafouri, Marcelo Menegali, Yanping Huang,
Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Denny Zhou, Nathanael Scharli, Le Hou, Jason Wei,
Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, Olivier Bousquet, Quoc Le, and Ed Huai hsin
I. A. Krivokon, Willard James Rusch, Marc Pick- Chi. 2022. Least-to-most prompting enables com-
ett, Kathleen S. Meier-Hellstern, Meredith Ringel plex reasoning in large language models. ArXiv,
Morris, Tulsee Doshi, Renelito Delos Santos, Toju abs/2205.10625.
Duke, Johnny Hartz Søraker, Ben Zevenbergen, Vin-
odkumar Prabhakaran, Mark Díaz, Ben Hutchinson, A Examples from MATH 401
Kristen Olson, Alejandra Molina, Erin Hoffman-
John, Josh Lee, Lora Aroyo, Ravindran Rajakumar, We list examples for each group from MATH 401.
Alena Butryna, Matthew Lamm, V. O. Kuzmina,
Joseph Fenton, Aaron Cohen, Rachel Bernstein, Ray • eiπ + 1 = 0
• 5 + 9 = 14

• 21 + 97 = 118

• 721 − 847 = −126

• 714637232158 − 667119914538 =
47517317620

• −1 + (−6) = −7

• −0.038 + 0.0092 = −0.0288

• 78 × 64 = 4992

• 5.0 × 0.09 = 0.045

• 45960 × 59693 = 2743490280

• 70 ÷ 61 = 1.1475

• 74 = 2401

• 2.2423.7342 = 20.3865

• e + π = 5.8598

• (4 × 64) × (39 + 12) = 13056

• sin(−3.75π) = 0.7071

• log10 (797) = 2.9015

You might also like