Professional Documents
Culture Documents
How Well Do LLM Perform Iin Arithmetic Tasks
How Well Do LLM Perform Iin Arithmetic Tasks
Zheng Yuan1 Hongyi Yuan12 Chuanqi Tan1 Wei Wang1 Songfang Huang1
1
Alibaba Group 2 Tsinghua University
{yuanzheng.yuanzhen,chuanqi.tcq,hebian.ww,songfang.hsf}@alibaba-inc.com
yuanhy20@mails.tsinghua.edu.cn
Table 1: Arithmetic ability for LLMs measured by accuracy, we only list models with largest parameter counts. E
= Euler, Dec = Decimal, Neg = Negative, Irr = Irrational, Big = Big Numbers, Long = Long Expressions.
et al., 2022). For newly released LLM ChatGPT, • Add & Subtract of two integers within
Shakarian et al. (2023); Frieder et al. (2023) eval- 1,000,000,000,000.
uate its mathematical ability independently. To
notice, our paper evaluates ChatGPT using gpt- • Add & Subtract of two integers within -
3.5-turbo-0301 version and GPT-4 using chat UI 10∼10.
on March 16th which may have different perfor-
• Add & Subtract of two decimal numbers
mances compared to their reported results and fu-
within -100∼100.
ture analysis.
Evaluate Arithmetic Ability of LLMs • Multiply two integers within 100.
Nogueira et al. (2021); Wang et al. (2021)
evaluate pretrained language models on simple • Multiply two decimal numbers within 10.
arithmetic expressions including addition (+)
• Multiply two integers within 100,000.
and subtraction (−). Muffo et al. (2022) have
further tested the multiplication (×) of language • Division of two integers within 100.
models. They found tokenization (Nogueira et al.,
2021; Kim et al., 2021) and token frequency • Exponentiation of with integer base within 10
(Razeghi et al., 2022) are two important factors for and integer exponent within 2∼4.
language model arithmetic ability. Compared to
previous work, we focus on evaluating Large LMs • Exponentiation of with a decimal number
(with instruction fine-tuning) on comprehensive within 10 as the base and a decimal number
arithmetic abilities with different types of operators within 2∼4 as the exponent.
and numbers.
• Add, Subtract & Multiply with one integer
3 Evaluation Settings within 10 and a common irrational number
(i.e. e or π).
3.1 Arithmetic Expression Settings
We construct 401 arithmetic expressions to test • Long arithmetic expressions with brackets, in-
large language models which include Euler equa- volved integers are all within 100 and opera-
tion (eiπ + 1 = 0) as group 0 and 25 problems each tors contain add, subtract, multiply, and divi-
for group 1∼16. If not otherwise mentioned, used sion.
numbers are positive integers.
• Trigonometry functions including sin, cos,
• Euler Equation. and tan. Inputs can be in the format of de-
grees and radians (π can also appear in the
• Add & Subtract of two integers within 10.
inputs).
• Add & Subtract of two integers within 100.
• Logarithm of integers within 1000 of different
• Add & Subtract of two integers within 1,000. bases: 2, e, 10.
These groups cover mathematical operators used Model Prompt Acc ↑ RE ↓ NNR ↓
gpt-4 Cal*4 83.54 0.07 0.00
in elementary mathematics. We consider groups gpt-3.5-turbo-0301 Cal* 75.06 0.14 0.50
1,2,3,5,6,7,8,11 as Easy queries and all others as text-davinci-003 Cal 56.61 0.76 2.99
Hard queries. We calculate the results of all code-davinci-002 Eqa 21.7 2.39 11.47
galactica-120b Eqa 45.14 1.30 3.99
arithmetic expressions using built-in functions of galactica-30b Eqa 45.14 0.69 1.75
Python and round to four decimal places. Examples llama-65b Eqa 28.43 1.61 4.74
opt-175b Cal 21.70 3.18 21.70
of expressions are listed in Appendix A. gpt-neox-20b Eqa 35.41 1.19 4.49
glm-130b $ 25.94 1.27 2.74
3.2 Metrics bloomz-176b $$ 22.44 1.50 4.74
bloom-176b $ 20.20 2.60 18.45
Since LLMs can decode arbitrary contents (which T0++-11b Cal 4.24 3.34 9.48
may contain their step-by-step calculation steps), flan-t5-xxl-11b Eqa 3.74 5.78 43.89
we first ignore decoded numbers in parentheses flan-t5-xl-3b $ 7.48 3.34 25.19
and preserve the last number decoded by LLMs. If Table 2: Evaluation on MATH 401 with differ-
the decoded number is a fraction, we will convert ent LLMs. Prompts are selected via best accu-
it to decimal for evaluation except for group 10 racy. Cal means “Calculate:” and Eqa means “\be-
which requires calculating division. To measure the gin{equation}”. * means providing an additional
arithmetic ability of LLMs, we use the following system-level message.
metrics to measure their outputs.
Accuracy If the difference between the decoded (Scao et al., 2022; Muennighoff et al., 2022), T0++
number and the target number is less than 1e − 3, (Sanh et al., 2021), GLM (Zeng et al., 2022) and
we consider it a correct prediction. Accuracy is Flan-T5 (Chung et al., 2022). We also test the
calculated based on correct prediction counts. smaller versions of the above models.
Relative error We denote decoded number is ŷ We test following prompts: ∅ (i.e. no prompt),
and target is y. We calculate relative error by: “Calculate:”, “$”, “$$”, and “\begin{equation}”.
The latest three prompts are inspired by that LLMs
kŷ − yk may be pretrained with LATEX sources. We provide
RE = min(10, ) (1)
max(kyk, 1) three versions of input formats: math texts (π),
plain texts (pi), LATEX texts (\pi). When we use
If LLM does not decode any number, we consider
LATEX-related prompts, we provide the model with
RE = 10. We truncate the relative error to 10 to
LATEX texts. When we use other prompts, we will
prevent that one big mistake dominate the average
provide math texts if their tokenizers can encode
relative error.
them. Otherwise, we will provide plain text. For
Non-number ratio If decoded content does not ChatGPT (gpt-3.5-turbo-0301), we test different
contain any numbers, we consider it a failure. We system-level prompts as instructions: ∅ (i.e. no
calculate the non-number ratio based on it. prompt), “You are an accurate calculator.”, and
“You are an accurate calculator, please calculate
3.3 Evaluation Details provided equation to four decimal places.”. For
We test GPT-4 by their official chat UI3 . Since GPT- GPT-4, we only test prompt “You are an accurate
4 has limited request counts, we only query GPT-4 calculator, please calculate provided equation to
with groups that ChatGPT cannot answer correctly. four decimal places.”.
We test GPT-3.5 (including davinci (CodeX, In- We use default decode settings for OpenAI APIs,
structGPT) and turbo (ChatGPT) series models) and we use greedy decoding for all other LLMs.
(Ouyang et al., 2022; Chen et al., 2021) via Ope-
nAI APIs. We also test following open-sourced
LLMs including Galactica (Taylor et al., 2022), 4 Results and Analysis
GPT from EleutherAI (Wang and Komatsuzaki,
2021; Black et al., 2022), LLaMA (Touvron et al., 4.1 Results
2023), OPT (with instruction learning) (Zhang
Overall Results Table 1, 2, and 3 show results
et al., 2022), Bloom (with instruction learning)
of different LLMs on MATH 401. We find GPT-
3
https://chat.openai.com/chat?model=gpt-4 4 and ChatGPT outperform all other models by a
Model Prompt Acc ↑ RE ↓ NNR ↓ Division, exponentiation, trigonometry functions,
gpt-4 Cal*4 83.54 0.07 0.00
gpt-3.5-turbo-0301 Cal* 75.06 0.14 0.50 and logarithm functions are hard for most LLMs.
text-davinci-003 Cal 56.61 0.76 2.99 LLMs have some abilities dealing with decimal,
text-davinci-002 Cal 42.89 2.13 15.96 negative, and irrational numbers. Only GPT-4 and
text-curie-001 Cal 11.47 1.92 6.48
text-babbage-001 Eqa 5.24 2.59 5.74 ChatGPT have the ability to deal with big numbers
code-davinci-002 Eqa 21.70 2.39 11.47 (> 1e12) and complex long queries which proves
galactica-120b Eqa 45.14 1.30 3.99 their generalization and reasoning abilities. GPT-4
galactica-30b Eqa 45.14 0.69 1.75
galactica-6.7b Cal 34.41 2.61 8.73 shows extremely good ability in long arithmetic
llama-65b Eqa 28.43 1.61 4.74 expressions.
llama-30b Eqa 30.17 1.72 3.74
llama-13b $ 27.68 2.40 9.73 When will ChatGPT fail? Though ChatGPT
llama-7b $$ 21.95 2.11 7.48
opt-175b Cal 21.70 3.18 21.70 obtains such a good performance, we will check
opt-66b ∅ 20.70 2.66 18.70 when ChatGPT fails to answer. For multiplica-
opt-iml-max-30b Cal 17.46 1.52 6.23 tion (×), ChatGPT passes all queries in Group
opt-30b ∅ 15.96 2.28 11.22
opt-13b ∅ 15.21 2.19 10.97 7 and 8 and get wrong answers for all queries
opt-6.7b Cal 14.46 1.46 4.24 in Group 9. An example is ChatGPT predicts
gpt-neox-20b Eqa 35.41 1.19 4.49
gpt-j-6b Cal 27.18 1.55 8.98
71786 × 21638 = 1, 551, 402, 068, while the true
bloomz-176b $$ 22.44 1.50 4.74 answer is 1, 553, 305, 468. ChatGPT gives a very
bloom-176b $ 20.2 2.60 18.45 close estimation with the correct head and tail,
bloomz-7b1 $ 12.72 2.56 15.46
bloom-7b1 Cal 7.23 2.41 6.48
which proves that ChatGPT does not use a cal-
bloomz-3b $$ 7.98 2.63 12.47 culator API for math calculation.
bloom-3b Cal 4.24 2.41 8.73 For division in Group 11, ChatGPT sometimes
bloomz-1b7 Eqa 4.74 4.28 31.17
bloom-1b7 Cal 5.24 2.54 11.22 gives correct answers to two decimal places which
T0++-11b Cal 4.24 3.34 9.48 will be considered incorrect in our metric. We
glm-130b $ 25.94 1.27 2.74 can see in Table 5, requiring ChatGPT to output
glm-10b Cal 14.96 2.30 3.74
flan-t5-xxl-11b Eqa 3.74 5.78 43.89 four decimal places will improve its accuracy in
flan-t5-xl-3b $ 7.48 3.34 25.19 multiplication and division.
flan-t5-large-780m Cal 3.74 2.31 2.49 For exponentiation (∧), ChatGPT correctly an-
flan-t5-base-250m Eqa 2.49 3.18 14.21
swers all queries in Group 10 which contain only
Table 3: Full evaluation on MATH 401 with different integers as bases. It is too hard for any language
LLMs. Prompts are selected via best accuracy. model (even ChatGPT) correctly estimate the ex-
ponentiation of a decimal number as the base and
a decimal number as the exponent. It seems that
large margin4 . GPT-4 surpasses ChatGPT with ac-
ChatGPT treats ∗∗ as multiplication sometimes, for
curacy of 10 points and reduce relative error half.
example, ChatGPT estimates 5.5507 ∗∗ 2.0434 =
InstructGPT performs third measured by accuracy
10.31554 which is close to 5.5507 × 2.0434 =
and Galactica-30B performs third measured by rel-
11.3423 and far from answer 33.1895.
ative error. Compared to models proposed before
For calculating trigonometry functions, Chat-
InstructGPT (text-davinci-003), GPT-series applies
GPT understands degrees and radians correctly
Reinforcement Learning from Human Feedback
and generates exact answers
√
for special inputs
(RLHF) which may enhance their arithmetic ability ◦ 3
significantly. Galactica is pre-trained with massive like cos(−210 ) = − 2 (we omit explanation
LATEX source codes which could be the reason why generated by ChatGPT here). However, Chat-
Galactica performs well in arithmetics. GPT may generate wrong explanations which mis-
lead itself. An example is: “We know that the
Grouped Results To clearly understand the sine function is periodic with a period of 2π,
arithmetic ability of LLMs, we show grouped accu- which means that sin(x + 2π) = sin(x) for any
racy in Table 1. GPT-4 obtains first places and Chat- value of x. Therefore, we can subtract multi-
GPT obtains second places for all groups. Most ples of 2π from −3.75π until we get a value be-
LLMs are only capable of doing addition and sub- tween 0 and 2π: −3.75π = −3π − 0.75π =
traction and have some ability for multiplication. −9.42477 − 2.35619 = −11.78096. Adding 2π,
4
OpenAI states they improve the math of ChatGPT since we get: −11.78096 + 2π = -9.42477 etc.” Any mis-
version Jan 30, and we cannot evaluate any previous version. take in explanations may result in a wrong answer.
For logarithm functions, we find that ChatGPT 2 and 3). Splitting numbers into individual tokens
is capable of using change of base formula and neglects all number tokens with more digits and
predicting answers within two decimal places. makes all single digit tokens (mainly 0 ∼ 9) ap-
For long expressions, ChatGPT can understand pear in the pre-training corpus in the same order
the operators’ priorities. ChatGPT sometimes gen- of magnitude. Galactica-30B and LLaMA-30B ob-
erates answers step by step and sometimes gener- tain 45.14 and 30.17 in terms of accuracy (list in
ates answers directly. It is very likely to generate Table 3) that outperforms OPT-30B (15.96), Bloom-
wrong answers when it decodes answers directly. 176B (20.2), and GLM-130B (25.94), which show
superiority of digit-level tokenization.
What about GPT-4? For big number multiplica-
tion (×) in group 9, GPT-4 also fails in all cases 4.3 Training
with similar problems occurring in ChatGPT. Self-supervised Pre-training While pre-
For exponentiation (∧), GPT-4 will not consider training, code corpus and LATEX-sources are
∗∗ as × anymore and give better estimations. possible to relate to arithmetic ability since they
For calculating expressions with irrational num- all contain arithmetic operators and numbers.
bers, GPT-4 will consider e as natural logarithm Code-davinci-002 is pretrained with code cor-
correctly. pus. Code-davinci-002 performs well on many
For logarithm functions, GPT-4 calculates loga- reasoning-related tasks (Zhou et al., 2022), how-
rithm base e and 10 by “using a calculator” (this is ever, it performs not good compared to other LLMs
a message generated by GPT-4). GPT-4 calculates in arithmetics. This proves that mathematical
logarithm base 2 by change of base formula and reasoning ability is different from arithmetic
generates approximate results. ability which needs to understand numbers
For long equations, GPT-4 solves all equations deeply. Galactica with numerous LATEX-sources
step by step and obtains a much higher accuracy. outperforms other LLMs except for InstructGPT
We compare and summarize how GPT-4 outper- and ChatGPT which show LATEX is useful.
forms ChatGPT here:
Instruction Tuning is also very important in
• Better division ability. arithmetic ability. Comparing Opt-30B (Acc 15.96
• Better trigonometry ability. RE 2.28 NNR 11.22) with Opt-Iml-Max-30B (Acc
17.46 RE 1.52 NNR 6.23), Bloom (Acc 20.2 RE
• Understand irrational numbers properly. 2.6 NNR 18.45) with BloomZ (Acc 22.44 RE 1.5
NNR 4.74), and code-davinci-002 (Acc 21.7) with
• Always calculate long expressions step by
text-davinci-002 (Acc 42.89) in Table 3 show that
step.
instruction tuning can boost the performance in
4.2 Tokenization all metrics. Text-davinci-003 (RLHF) outperforms
text-davinci-002 (SFT) in arithmetic tasks which
Arithmetic expressions have special tokens includ-
shows RLHF is important for building arithmetic
ing π, ×, ÷, ◦ which are not within T5 series mod-
ability.
els (i.e. T0++ and Flan-T5). T0++-11B (Acc 4.24
and RE 3.34) and Flan-T5-xxl-11B (Acc 3.74 and 4.4 Prompts
RE 5.78) perform badly on arithmetic tasks com-
Input Prompts We find the best prompts are dif-
pared to other similar-size models: Opt-13B (Acc
ferent across LLMs. We list the best and worst
15.21 and RE 2.19) and LLaMA-13B (Acc 27.68
prompts for LLMs in Table 8. We find models are
and RE 2.4).
sensitive to input prompts and not using prompts
We notice that Galactica and LLaMA split num-
is the worst option for most LLMs. For Instruct-
bers into individual tokens. For example 123.456
GPT and ChatGPT, using “Calculate” as a prompt
is converted into 1 2 3 . 4 5 6. Razeghi et al.
perform best. For other LLMs, using LATEX-related
(2022) show that arithmetic ability is related to
prompts perform best.
pre-training term frequencies. For tokens that ap-
pear more in pre-training, LLMs can have better System Prompts For ChatGPT, we can also pro-
accuracy in answering arithmetic expressions about vide system-level messages as instruction prompts.
them. Number tokens with more digits (e.g. 23) Table 5 shows providing system-level messages im-
apparently appear less than single digit token (e.g. proves ChatGPT’s accuracy and reduces relative
Model Best Acc Worst Acc
gpt-3.5-turbo-0301 Cal* 75.06 $$ 64.59
text-davinci-003 Cal 56.61 Eqa 43.64
galactica-120b Eqa 45.14 ∅ 38.9
llama-65b Eqa 28.43 Cal 4.74
opt-175b Cal 21.7 ∅ 15.21
gpt-neox-20b Eqa 35.41 ∅ 26.93
glm-130b $ 25.94 ∅ 22.44
bloomz-176b $$ 22.44 ∅ 11.72
Table 6: Example of Group 14 decoded by ChatGPT with Calculate and COT prompts.
parameter counts than ChatGPT and obtain better Model Naive ICL
Acc RE Acc RE
reasoning ability (i.e. long arithmetic expression). galactica-120b 45.14 1.3 45.14 0.42
galactica-6.7b 34.41 2.61 32.67 0.65
4.7 Chain-of-Thought flan-t5-xxl 3.74 5.78 0.0 10.0
flan-t5-base 2.49 3.18 0.0 10.0
LLMs can leverage chain-of-thought to better an-
swer math word problems (Wei et al., 2022b). We Table 8: In-context learning on MATH 401.
test on ChatGPT whether chain-of-thought will im-
prove arithmetic calculations. We use the prompt
“Let us solve this equation step by step” to instruct the query) for each query. We test whether ICL
ChatGPT for zero-shot COT (Kojima et al., 2022). can improve the well-behaved model (Galactica)
We compare the results of zero-shot COT using and the underperforming model (Flan-T5). For
“Calculate:” in Table 7. Surprisingly, we find that Galactica, it does not improve accuracy but reduces
COT does not improve the performance of any relative error significantly. For small-sized Flan
group even in group 14 with long arithmetic ex- (smaller than 3B) it cannot generate any number
pressions. To understand the reason for this phe- under the setting of in-context-learning.
nomenon, we check decoded results for these two 5 Conclusion
prompts in Table 6. We find using “Calculate:”
as the prompt can automatically generate chain- In this paper, we propose MATH 401 to evaluate
of-thoughts for long arithmetic expressions and the arithmetic ability of LLMs. We find that tok-
generate answers directly for easy questions. enization, pre-training corpus, prompts, and model
parameter counts are important for their arithmetic
Group Cal 0 COT
Acc RE Acc RE
ability. The reason ChatGPT performs so well in
0 Euler 100 .00 100 .00 arithmetic still has some mystery, i.e. the parame-
1 ∼ 6 +− 97 .00 94 .02 ter counts and instruction datasets of ChatGPT. We
7 ∼ 10 ×÷ 69 .20 61 .66
11 ∼ 12 ∧ 50 .24 48 .56
hope this paper can help readers improve LLMs
13 Irr. 64 1.73 28 4.89 with better arithmetic ability. This paper is only
14 Long 68 .19 64 .46 focused on arithmetic, testing LLMs on other math
15 Tri. 44 1.21 40 1.14
16 Log 56 .80 28 5.37 topics including symbolic mathematics, solving (or-
Overall 74 .33 66 .98 dinary differential, partial differential) equations,
calculus, algebra, geometry, probability theory, and
Table 7: Comparing zero-shot COT and Calculate us- graph theory are also interesting topics.
ing ChatGPT on MATH 401.
References
4.8 In-context Learning
Sid Black, Stella Biderman, Eric Hallahan, Quentin An-
In-context learning (ICL) provides related question- thony, Leo Gao, Laurence Golding, Horace He, Con-
answer pairs to improve LLMs (Brown et al., 2020; nor Leahy, Kyle McDonell, Jason Phang, Michael
Wei et al., 2022b). In our task, we can provide Pieler, USVSN Sai Prashanth, Shivanshu Purohit,
similar arithmetic expressions before the queries to Laria Reynolds, Jonathan Tow, Ben Wang, and
Samuel Weinbach. 2022. GPT-NeoX-20B: An open-
help model understanding the arithmetic operator source autoregressive language model. In Proceed-
as done in Smith et al. (2022). We provide 8 similar ings of the ACL Workshop on Challenges & Perspec-
cases (we promise these cases are different from tives in Creating Large Language Models.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie you seen that number? investigating extrapolation in
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind question answering models. In Conference on Em-
Neelakantan, Pranav Shyam, Girish Sastry, Amanda pirical Methods in Natural Language Processing.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, T. J. Henighan, Rewon Child, Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, taka Matsuo, and Yusuke Iwasawa. 2022. Large
Clemens Winter, Christopher Hesse, Mark Chen, language models are zero-shot reasoners. ArXiv,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin abs/2205.11916.
Chess, Jack Clark, Christopher Berner, Sam Mc- Niklas Muennighoff, Thomas Wang, Lintang Sutawika,
Candlish, Alec Radford, Ilya Sutskever, and Dario Adam Roberts, Stella Biderman, Teven Le Scao,
Amodei. 2020. Language models are few-shot learn- M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hai-
ers. ArXiv, abs/2005.14165. ley Schoelkopf, Xiangru Tang, Dragomir Radev, Al-
ham Fikri Aji, Khalid Almubarak, Samuel Albanie,
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,
Zaid Alyafeai, Albert Webson, Edward Raff, and
Henrique Ponde, Jared Kaplan, Harrison Edwards,
Colin Raffel. 2022. Crosslingual generalization
Yura Burda, Nicholas Joseph, Greg Brockman, Alex
through multitask finetuning.
Ray, Raul Puri, Gretchen Krueger, Michael Petrov,
Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Matteo Muffo, Aldo Cocco, and Enrico Bertino. 2022.
Brooke Chan, Scott Gray, Nick Ryder, Mikhail Evaluating transformer language models on arith-
Pavlov, Alethea Power, Lukasz Kaiser, Moham- metic operations using number decomposition. In
mad Bavarian, Clemens Winter, Philippe Tillet, Fe- International Conference on Language Resources
lipe Petroski Such, David W. Cummings, Matthias and Evaluation.
Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel
Herbert-Voss, William H. Guss, Alex Nichol, Igor Rodrigo Nogueira, Zhiying Jiang, and Jimmy J. Li.
Babuschkin, S. Arun Balaji, Shantanu Jain, An- 2021. Investigating the limitations of the trans-
drew Carr, Jan Leike, Joshua Achiam, Vedant Misra, formers with simple arithmetic tasks. ArXiv,
Evan Morikawa, Alec Radford, Matthew M. Knight, abs/2102.13019.
Miles Brundage, Mira Murati, Katie Mayer, Peter
Welinder, Bob McGrew, Dario Amodei, Sam Mc- Kimia Noorbakhsh, Modar Sulaiman, Mahdi Sharifi,
Candlish, Ilya Sutskever, and Wojciech Zaremba. Kallol Roy, and Pooyan Jamshidi. 2021. Pretrained
2021. Evaluating large language models trained on language models are symbolic mathematics solvers
code. ArXiv, abs/2107.03374. too! arXiv preprint arXiv:2110.03501.
OpenAI. 2023. Gpt-4 technical report.
Hyung Won Chung, Le Hou, Shayne Longpre, Bar-
ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
Wang, Mostafa Dehghani, Siddhartha Brahma, et al. roll L Wainwright, Pamela Mishkin, Chong Zhang,
2022. Scaling instruction-finetuned language mod- Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
els. arXiv preprint arXiv:2210.11416. 2022. Training language models to follow in-
structions with human feedback. arXiv preprint
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, arXiv:2203.02155.
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Stanislas Polu and Ilya Sutskever. 2020. Generative
Nakano, Christopher Hesse, and John Schulman. language modeling for automated theorem proving.
2021. Training verifiers to solve math word prob- ArXiv, abs/2009.03393.
lems. arXiv preprint arXiv:2110.14168.
Yasaman Razeghi, Robert L Logan IV, Matt Gard-
Simon Frieder, Luca Pinchetti, Ryan-Rhys Grif- ner, and Sameer Singh. 2022. Impact of pretrain-
fiths, Tommaso Salvatori, Thomas Lukasiewicz, ing term frequencies on few-shot reasoning. ArXiv,
Philipp Christian Petersen, Alexis Chevalier, and J J abs/2202.07206.
Berner. 2023. Mathematical capabilities of chatgpt.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H.
ArXiv, abs/2301.13867.
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
Vedant Gaur and Nikunj Saunshi. 2022. Symbolic Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja,
math reasoning with language models. In 2022 Manan Dey, M Saiful Bari, Canwen Xu, Urmish
IEEE MIT Undergraduate Research Technology Thakker, Shanya Sharma Sharma, Eliza Szczechla,
Conference (URTC), pages 1–5. Taewoon Kim, Gunjan Chhablani, Nihal Nayak,
Debajyoti Datta, Jonathan Chang, Mike Tian-Jian
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Jiang, Han Wang, Matteo Manica, Sheng Shen,
Arora, Steven Basart, Eric Tang, Dawn Song, and Zheng Xin Yong, Harshit Pandey, Rachel Bawden,
Jacob Steinhardt. 2021. Measuring mathematical Thomas Wang, Trishala Neeraj, Jos Rozen, Ab-
problem solving with the math dataset. arXiv heesht Sharma, Andrea Santilli, Thibault Fevry, Ja-
preprint arXiv:2103.03874. son Alan Fries, Ryan Teehan, Stella Biderman, Leo
Gao, Tali Bers, Thomas Wolf, and Alexander M.
Jeonghwan Kim, Giwon Hong, Kyung min Kim, Rush. 2021. Multitask prompted training enables
Junmo Kang, and Sung-Hyon Myaeng. 2021. Have zero-shot task generalization.
David Saxton, Edward Grefenstette, Felix Hill, and Kurzweil, Blaise Aguera-Arcas, Claire Cui, Mar-
Pushmeet Kohli. 2019. Analysing mathematical rea- ian Croak, Ed Huai hsin Chi, and Quoc Le. 2022.
soning abilities of neural models. arXiv preprint Lamda: Language models for dialog applications.
arXiv:1904.01557. ArXiv, abs/2201.08239.
Teven Le Scao, Angela Fan, Christopher Akiki, El- Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
lie Pavlick, Suzana Ilić, Daniel Hesslow, Ro- Martinet, Marie-Anne Lachaux, Timothée Lacroix,
man Castagné, Alexandra Sasha Luccioni, François Baptiste Rozière, Naman Goyal, Eric Hambro,
Yvon, Matthias Gallé, et al. 2022. Bloom: A 176b- Faisal Azhar, et al. 2023. Llama: Open and effi-
parameter open-access multilingual language model. cient foundation language models. arXiv preprint
arXiv preprint arXiv:2211.05100. arXiv:2302.13971.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Ben Wang and Aran Komatsuzaki. 2021. GPT-J-
Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, 6B: A 6 Billion Parameter Autoregressive Lan-
Nicola Cancedda, and Thomas Scialom. 2023. Tool- guage Model. https://github.com/kingoflolz/
former: Language models can teach themselves to mesh-transformer-jax.
use tools. ArXiv, abs/2302.04761. Cunxiang Wang, Boyuan Zheng, Yuchen Niu, and Yue
Zhang. 2021. Exploring generalization ability of
Paulo Shakarian, Abhinav Koyyalamudi, Noel Ngu, pretrained language models on arithmetic and logi-
and Lakshmivihari Mareedu. 2023. An independent cal reasoning. In Natural Language Processing and
evaluation of chatgpt on mathematical word prob- Chinese Computing.
lems (mwp).
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Maarten Bosma, Denny Zhou, Donald Metzler, et al.
Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Di- 2022a. Emergent abilities of large language models.
panjan Das, and Jason Wei. 2022. Language models arXiv preprint arXiv:2206.07682.
are multilingual chain-of-thought reasoners. ArXiv,
abs/2210.03057. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Ed Huai hsin Chi, Quoc Le, and Denny
Shaden Smith, Mostofa Patwary, Brandon Norick, Zhou. 2022b. Chain of thought prompting elic-
Patrick LeGresley, Samyam Rajbhandari, Jared its reasoning in large language models. ArXiv,
Casper, Zhun Liu, Shrimai Prabhumoye, George abs/2201.11903.
Zerveas, Vijay Anand Korthikanti, Elton Zhang,
Rewon Child, Reza Yazdani Aminabadi, Julie Yuhuai Wu, Albert Qiaochu Jiang, Wenda Li,
Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong Markus N. Rabe, Charles Staats, Mateja Jamnik, and
He, Michael Houston, Saurabh Tiwary, and Bryan Christian Szegedy. 2022. Autoformalization with
Catanzaro. 2022. Using deepspeed and megatron to large language models. ArXiv, abs/2205.12615.
train megatron-turing nlg 530b, a large-scale genera-
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang,
tive language model. ArXiv, abs/2201.11990.
Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,
Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan
Ross Taylor, Marcin Kardas, Guillem Cucurull,
Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Peng
Thomas Scialom, Anthony Hartshorn, Elvis Saravia,
Zhang, Yuxiao Dong, and Jie Tang. 2022. Glm-
Andrew Poulton, Viktor Kerkez, and Robert Stojnic.
130b: An open bilingual pre-trained model. arXiv
2022. Galactica: A large language model for sci-
preprint arXiv:2210.02414.
ence. arXiv preprint arXiv:2211.09085.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Artetxe, Moya Chen, Shuohui Chen, Christopher De-
Noam M. Shazeer, Apoorv Kulshreshtha, Heng- wan, Mona Diab, Xian Li, Xi Victoria Lin, et al.
Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, 2022. Opt: Open pre-trained transformer language
Yu Du, Yaguang Li, Hongrae Lee, Huaixiu Zheng, models. arXiv preprint arXiv:2205.01068.
Amin Ghafouri, Marcelo Menegali, Yanping Huang,
Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Denny Zhou, Nathanael Scharli, Le Hou, Jason Wei,
Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, Olivier Bousquet, Quoc Le, and Ed Huai hsin
I. A. Krivokon, Willard James Rusch, Marc Pick- Chi. 2022. Least-to-most prompting enables com-
ett, Kathleen S. Meier-Hellstern, Meredith Ringel plex reasoning in large language models. ArXiv,
Morris, Tulsee Doshi, Renelito Delos Santos, Toju abs/2205.10625.
Duke, Johnny Hartz Søraker, Ben Zevenbergen, Vin-
odkumar Prabhakaran, Mark Díaz, Ben Hutchinson, A Examples from MATH 401
Kristen Olson, Alejandra Molina, Erin Hoffman-
John, Josh Lee, Lora Aroyo, Ravindran Rajakumar, We list examples for each group from MATH 401.
Alena Butryna, Matthew Lamm, V. O. Kuzmina,
Joseph Fenton, Aaron Cohen, Rachel Bernstein, Ray • eiπ + 1 = 0
• 5 + 9 = 14
• 21 + 97 = 118
• 714637232158 − 667119914538 =
47517317620
• −1 + (−6) = −7
• 78 × 64 = 4992
• 70 ÷ 61 = 1.1475
• 74 = 2401
• 2.2423.7342 = 20.3865
• e + π = 5.8598
• sin(−3.75π) = 0.7071