Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Machine assisted proof


Terence Tao
February 10, 2024

computers has been in scientific computation, which


is heavily used nowadays to numerically solve differ-
Mathematicians have relied on upon computers (hu- ential equations and dynamical systems, or to com-
man, mechanical, or electronic) and machines to as- pute the statistics of large matrices or linear opera-
sist them in their research for centuries (or even mil- tors. An early example of such computation arose in
lennia, if one considers early calculating tools such the 1920s, when Hendrik Lorentz assembled a team of
as the abacus). For instance, ever since the early human computers to model the fluid flow around the
logarithm tables of Napier and others, mathemati- Afsluitdijk - a major dam then under construction in
cians have known the value of constructing large data the Netherlands; among other things, this calculation
sets of mathematical objects to perform computa- was notable for pioneering the now-standard device
tions and to make conjectures. Legendre and Gauss of floating point arithmetic. But modern computer
used extensive tables of prime numbers compiled by algebra systems (e.g., Magma, SAGEMath, Mathe-
human computers to conjecture what is now known matica, Maple, etc.), as well as more general-purpose
as the prime number theorem; a century and a half programming languages, can go well beyond tradi-
later, Birch and Swinnerton-Dyer similarly used early tional “number-crunching”; they are now routinely
electronic computers to generate enough data on el- used to perform symbolic computations in algebra,
liptic curves over finite fields to propose their own analysis, geometry, number theory, and many other
celebrated conjecture on these objects. And many branches of mathematics. Some forms of scientific
readers have no doubt taken advantage of one of the computations are famously unreliable due to round-
broadest mathematical data sets of all, the Online off errors and instabilities, but one can often replace
Encyclopedia of Integer Sequences, which has gener- these methods with more rigorous substitutes (for in-
ated numerous conjectures and unexpected connec- stance, replacing floating point arithmetic with inter-
tions between different areas of mathematics, as well val arithmetic), possibly at the expense of increased
as serving as a valuable mathematical search engine runtime or memory usage.
for researchers looking for literature on a mathemat- A relative of computer algebra systems are satisfi-
ical object which they do not know the name of, but ability (SAT) solvers and satisfiability modulo theo-
which they can associate with a sequence of integers. ries (SMT) solvers, which can perform complex log-
In the twenty-first century, such large databases also ical deductions of conclusions from certain restricted
serve as crucial training data for machine learning sets of hypotheses, and generate proof certificates for
algorithms, which promise to automate, or at least each such deduction. Of course, satisfiability is a NP-
greatly facilitate, the process of generating conjec- complete problem, so these solvers do not scale past
tures and connections in mathematics. a certain point. Here is a typical example of a result
Besides data generation, another venerable use of proved using a SAT solver:
∗ The author is a professor of mathematics at the Uni-

versity of California, Los Angeles. His email address is Theorem 0.1 (Boolean Pythaogrean triples theo-
tao@math.ucla.edu. rem [HKM16]). The set {1, . . . , 7824} can be parti-

1
tioned into two classes, neither of which contains a should expect some surprising demonstrations of new
Pythagorean triple (a, b, c) with a2 +b2 = c2 ; however, mathematical research modalities in the near future;
this is not possible for {1, . . . , 7825}. not the science-fiction conception of an superintelli-
gent AI that can solve complex mathematical prob-
The proof required 4 CPU-years of computation
lems autonomously, but a valuable assistant that can
and generated a 200 terabyte propositional proof,
suggest new ideas, filter out errors, and perform rou-
which was later compressed to 68 gigabytes.
tine case checking, numerical experiments and liter-
Computers are of course also used routinely by ature review tasks, allowing the human mathemati-
mathematicians for mundane tasks such as writing cians in the project to focus on the exploration of
papers and communicating with collaborators. But high level concepts.
in recent decades, several promising new ways to
use computers assist in mathematical research have
emerged:
1 Proof assistants
• Machine learning algorithms can be used to dis-
cover new mathematical relationships, or gener- The mere fact that a computation was performed us-
ate potential examples or counterexamples for ing a computer does not, of course, automatically
mathematical problems. guarantee it is correct. The computation could pro-
duce incur numerical errors, such as that caused by
• Formal proof assistants can be used to verify replacing continuous variables or equations with dis-
proofs (as well as the output of large language crete approximations. Bugs can be inadvertently in-
models), allow truly large-scale mathematical troduced into the code, or the input data may it-
collaborations, and help build data sets to train self contain inaccuracies. Even the compiler that the
the aforementioned machine learning algorithms. computer uses to run the code could be flawed. Fi-
nally, even if the code executes perfectly, the expres-
• Large language models such as ChatGPT can
sion that is correctly computed by the code may not
(potentially) be used to make other tools easier
be the expression that one actually wanted for the
and faster to use; they can also suggest proof
mathematical argument.
strategies or related work, and even generate
(simple) proofs directly. Early computer assisted proofs experienced many
of these issues. For instance, the original proof of
Each of these tools have already found niche appli- the four-color theorem [AH89] by Appel and Haken
cations in different areas of mathematics, but what I in 1976 revolved around a list of 1834 graphs that
find particularly intriguing is the possibility of com- needed to obey two properties, called “reducibil-
bining these tools together, with one tool counteract- ity” and “unavoidability”. Reducibility could be
ing the weaknesses of another. For instance, formal checked by feeding each graph one at a time into a
proof assistants and computer algebra packages could custom-written piece of software; but unavoidability
filter out the now notorious tendency of large lan- required a tedious calculation comprising hundreds
guage models to “hallucinate” plausible-looking non- of pages of microfiche – verified by hand through the
sense, while conversely these models could help au- heroic efforts of Haken’s daughter Dorothea Blostein,
tomate the more tedious aspects of proof formaliza- which ended up contain multiple (fixable) errors. In
tion, as well as provide a natural language interface 1994, Robertson, Sanders, Seymour, and Thomas
to run complex symbolic or machine learning algo- [RSST96] attempted to make the computational com-
rithms. Many of these combinations are still only at ponent of Appel–Haken proof fully verifiable by com-
the proof-of-concept stage of development, and it will puter, but ended up instead producing a simpler ar-
take time for the technology to mature into a truly gument (involving just 633 graphs, and an easier pro-
useful and reliable tool for mathematicians; but the cedure to verify unavoidability) that could be verified
early experiments do seem to be encouraging, and we much more efficiently by computer code written in

2
any number of standard programming languages. in 2005 [Gon08]. The infamous Kepler conjecture on
Proof assistants take this formalization one step the densest packing of R3 by unit balls, was proven by
further, being a special type of computer language Hales and Ferguson in 1998 [Hal05] in a very compli-
that is designed not to perform purely computational cated (and computer-assisted) proof. In 2003, Hales
tasks, but to verify the correctness of the conclu- launched the Flyspeck project to formally verify the
sion of a logical or mathematical argument. Roughly proof, estimating that it would take twenty years to
speaking, each step in a maathematical proof would do so, although it ended up that, through a collabo-
correspond to some number of lines of code in this ration between Hales and 21 other contributors, this
language, and the overall code would only compile if was achieved in “only“ eleven years [HAB+ 17]. More
the proof was valid. Modern proof assistants, such recently, Scholze in 2019 launched the “liquid tensor
as Coq, Isabelle, or Lean, intentionally try to mimic experiment” [Com22] to formally verify a fundamen-
the language and structure of mathematical writing, tal theorem of himself and Clausen on the vanishing
although often substantially fussier in many respects. of a certain Ext group of a “liquid vector space” in
As a simple example, in order to interpret a math- the theory of condensed mathematics. The human-
ematical expression such as ab , a formal proof assis- written proof was “only” ten pages long, albeit with
tant may require one to specify precisely the “type” of an enormous amount of prerequisite material in con-
the underlying variables a, b (e.g., natural numbers, densed mathematics; nevertheless, the formalization
real numbers, complex numbers), in order to deter- in Lean took about eighteen months in a large col-
mine which exponentiation operation is being used laborative effort. I myself led a formalization effort
(which is particularly important for expressions such [Tao23] on the recent proof of Gowers, Green, Man-
as 00 , which have slightly different interpretations un- ners and myself of a conjecture in additive combina-
der different notions of exponentiation). Much effort torics; the human-written proof was 33 pages long,
has been placed into developing automated tools and but largely self-contained, and a group of about 20
extensive libraries of mathematical results to man- collaborators was able to formalize it in three weeks.
age these low-level aspects of a formal proof, but in Some fields of mathematics are more challenging to
practice the “obvious” parts of a mathematical ar- formalize than others; Kevin Buzzard has recently
gument can often take longer to formalize than the announced a project to formalize the proof of Fer-
“important” parts of the argument. To give just mat’s Last Theorem, which he estimates to take at
one example: given three sets A1 , A2 , A3 , a math- least five years.
ematician might work with the Cartesian Q products Given all the effort required, what would the value
(A1 × A2 ) × A3 , A1 × (A2 × A3 ), and i∈{1,2,3} Ai in- of proof formalization efforts be to mathematics?
terchangeably, since they are “obviously” the “same” Most obviously, it provides an extremely high level
object; but in most formalizations of mathematics, of confidence that a given result is correct, which is
these products are not actually identical, and a for- particularly valuable for results that are controversial
mal version of the argument may need to invest some or notorious for attracting false proofs, or for partic-
portion of the proof establishing suitable equivalences ularly lengthy proofs in fields where willing referees
between such spaces, and ensuring that statements to verify such proofs line-by-line are in short supply.
involving one version of this product continue to hold (Theoretically there could still be a hidden bug in
for the other. the proof assistant compiler – which is deliberately
For this and other reasons, the task of converting kept as small as possible to reduce this possibility –
a proof written by a human mathematician - even or the definitions used in the formal statement of the
a very careful one - to a formal proof that compiles result may differ in subtle but important ways from
in a formal proof assistant is quite time consuming, the human-readable statement, but such a scenario is
although the process has gradually become more ef- unlikely, especially if the formal proof tracks the hu-
ficient over time. The aforementioned four-color the- man written proof closely.) The formalization process
orem was formalized in Coq by Werner and Gonthier typically uncovers minor issues in the human proof,

3
and can sometimes reveal simplifications or strength- Before this can happen, however, the formaliza-
enings of the argument, for instance by revealing that tion process needs to become more efficient. The
a seemingly important hypothesis in a lemma was “de Bruijn factor” (the ratio between the difficulty of
in fact unnecessary, or that a low-powered but more writing a correct formal proof and a correct informal
general tool can be used in place of an advanced but proof) is still well above one (I estimate ∼ 20), but
specialized one. A formalization project in a mod- dropping. I believe there is no fundamental obstacle
ern language such as Lean will typically contribute to dropping this ratio below one, especially with in-
many basic mathematical results generated through creased integration with AI, SMT solvers, and other
the course of the project to a common mathematical tools; this would be transformative to our field.
library, which makes it easier for future formalization
projects to proceed.
But formal proof assistants can also enable new 2 Machine learning
modalities of mathematical education and collabora-
tion. Several experimental projects are underway to Machine learning refers to a broad array of tech-
take a formal proof and convert it into more human- niques for training a computer to perform a complex
understandable forms, such as an interactive text in task - such as predicting an output corresponding to
which individual steps in the argument can be ex- a given input drawn from a very broad class, or to
panded into more detail or collapsed into a high-level discern correlations and other relationships in a data
summary; this could be a particularly suitable for- set. Many popular models for machine learning use
mat for future mathematical textbooks. A traditional some form of a neural network to encode how the
mathematics collaboration rarely involves more than computer will perform the task. These networks are
five or so co-authors, in part due to the need for ev- functions of many variables formed by composing to-
ery co-author to trust and verify the work of every gether a large number of simpler operations (both lin-
other; but formalization projects routinely involve ear and nonlinear); typically one assigns some sort of
scores of people who may have had no prior inter- reward function (or loss function) to such a network,
action, precisely because the formal proof assistant for instance by empirically measuring its performance
allows for individual subtasks in the project to be against a training data set, and then performs a com-
precisely defined and verified independently of the putationally intensive optimization to find choices of
other subtasks. It is conceivable that these proof as- parameters for this network to make the reward func-
sistants could also allow a similar division of labor for tion as large as possible (or loss function as small
the generation of new mathematical results, allowing as possible). These models have countless practical
for highly parallelized and crowdsourced collabora- applications, for instance in image and speech recog-
tions at a far larger scale than previous online collab- nition, recommendation systems, or fraud detection.
orations (such as the “Polymath” projects [Gow10]) However, they usually do not come with strong guar-
which were limited by the need to have human mod- antees of accuracy, particularly when applied to in-
eration of the discussion. In time, the large collabo- puts that are significantly different from the training
rations that are already established practice in other data set, or when the training data set is noisy or in-
sciences, or in software engineering projects, may complete. Furthermore, the models are often opaque,
also become commonplace in research mathematics; in the sense that it is difficult to extract from the
some contributors may play the role of “project man- model a human-understandable explanation of why
agers”, focusing for instance on establishing precise the model made a particular prediction, or to under-
“blueprints” that break the project down into smaller stand the model’s behavior in general. As such, these
pieces, while others could specialize to individual tools would appear at first glance to be unsuited for
components of the project, without necessarily hav- research mathematics, where one desires both rigor-
ing all the expertise needed to understand the project ous proof and intuitive understanding of the argu-
as a whole. ments.

4
Nevertheless, there have been recent promising use lish finite time blowup for this equation using more
cases of a suitably chosen machine learning tool to traditional numerical methods; however, the machine
produce, or at least suggest, new rigorous mathemat- learning paradigm shows great potential as a comple-
ics, particularly when combined with other, more reli- mentary approach to these sorts of PDE problems.
able techniques that can validate the output of these For instance, one could envisage a hybrid approach
tools. For instance, a fundamental problem in the in which a human mathematician first proposes a
mathematical theory of fluid equations such as the blowup ansatz, which a neural network then tries to
Euler or Navier–Stokes equations is to be able to rig- find a rough approximate solution for, and then more
orously demonstrate finite time blowup of solutions u traditional numerical methods are used to refine that
in finite time from smooth initial data. The most no- solution to one that is accurate enough for the rigor-
torious instance of this concerns the incompressible ous stability analysis to be applied.
Navier-Stokes equations in three dimensions, the res- Another example of the use of machine learning in
olution of which is one of the (unsolved) Millennium mathematics is in the field of knot theory. Knots have
Prize problems; this still remains out of reach, but re- an extremely diverse set of topological invariants: the
cent progress has been made on other fluid equations, signature of a knot is an integer associated to the
such as the Boussinesq equations in two dimensions homology of a surface bounding the knot (a Siefert
(a simplified model for the incompressible Euler equa- surface); the Jones polynomial of a knot can be de-
tions in three dimensions). One route to establishing scribed using the representation theory of braids; and
such a singularity lies in constructing a self-similar most knots (excluding torus knots) have a canonical
blowup solution u, which is described by a lower- hyperbolic geometry on the complement that can be
dimensional function U that solves a simpler PDE. A used to describe a number of hyperbolic invariants,
closed-form solution for this PDE does not seem to such as hyperbolic volume; and so forth. A priori,
be available; but if one can produce a sufficiently high it is not obvious how these invariants coming from
quality approximate solution Ũ to this PDE (which very different areas of mathematics are related to
approximately obeys certain boundary conditions), each other. However, in 2021, Davies, Juhász, Lack-
it can be possible to then rigorously demonstrate an enby, and Tomasev [DJLT21] investigated this issue
exact solution U by an application of perturbation through machine learning. By training a neural net-
theory (such as those based around fixed point theo- work on an existing database of nearly two million
rems). Traditionally, one would use numerical PDE knots (together with a million randomly generated
methods to try to produce these approximate solu- additional knots), they were able to make this neural
tions Ũ , for instance by discretizing the PDE into network model predict the signature of a knot from
a difference equation, but it can be computationally about two dozen hyperbolic invariants with high ac-
expensive to use such methods to obtain solutions curacy. However, the prediction function generated
with the desired level of accuracy. An alternate ap- was quite opaque and did not initially reveal much
proach was proposed in 2019 by Wang, Lai, Gómez- insight about what the precise relationship between
Serrano, and Buckmaster [WLGSB23], who used a signature and hyperbolic invariants was. Neverthe-
Physics Informed Neural Network (PINN) trained to less, it was possible to proceed further by a sim-
generate functions Ũ that minimized a suitable loss ple tool known as saliency analysis, which roughly
function measuring the extent to which the desired speaking measured to the influence each individual
PDE and boundary conditions are being approxi- hyperbolic invariant had on the prediction function.
mately obeyed. As these functions Ũ are generated This analysis revealed that of the two dozen hyper-
through a neural network rather than a discretized bolic invariants used, only three of them (longitudinal
version of the equation, they can be faster to gen- translation, and the real and complex parts of merid-
erate, and potentially less susceptible to numerical ional translation) had a significant effect on the pre-
instabilities. As it turned out, a contemporaneous diction function. By visually inspecting scatterplots
work by Chen and Hou [CH22] was able to estab- of the signature against these three invariants, the

5
authors were able to conjecture a more comprehen- “real-world” and “synthetic” sets of data).
sible relationship between these quantities. Further
numerics disproved their initial conjecture, but sug-
gested a modified version of the conjecture which they 3 Large language models
were able to prove rigorously. This interplay between
machine-generated conjectures and human verifica- Large language models (LLMs), are a relatively re-
tion (and modification) using theory is a promising cent type of machine learning model that is suited
paradigm that seems applicable to many other fields for training on extremely broad and large data sets
of mathematics. of natural language texts. A popular large language
Many of the applications of machine learning re- model is the GPT (generative pre-trained trans-
quire a large amount of training data, ideally rep- former), which as the name suggests is built around
resented in some standardized format (e.g., a vector the transformer model - a variant of a neural net-
of numbers) so that existing machine learning algo- work designed to predict the next word (or “token”)
rithms can be applied to it with relative ease. The in a string of words, in which some long-term “at-
precise representation of data can be of critical im- tention” to words early in the string is retained in
portance; a correlation between different components order to simulate the context of the sentence. By it-
of the data may be easily discoverable by machine erating this model, one can then produce a lengthy
learning algorithms in one data representation, but text response to a given text prompt. When trained
nearly impossible to find in another. While some on a small amount of data, the outputs of such mod-
fields of mathematics are beginning to compile large els are unimpressive - not much more sophisticated
databases of useful objects (e.g., knots, graphs, or el- than trying to iterate the “autocomplete” text input
liptic curves), there are still many important classes feature on a smartphone, for instance - but after ex-
of more vaguely defined mathematical concepts that tensive training on extremely large and diverse data
have not been systematically placed into a form us- sets, the outputs of these models can be surprisingly
able for machine learning. For instance, to return to coherent and even creative, and can generate text
the example of PDE, there are thousands of different that is difficult to distinguish from human writing at
partial differential equations studied in the literature, first glance, although on closer inspection the output
but often with large variability in notation and al- is often nonsensical and not connected to any ground
gebraic arrangement of terms, and there is nothing truth, a phenomenon known as “hallucination”.
resembling a standard database of commonly stud- One can of course attempt to apply such general-
ied PDEs together with their basic properties (e.g., purpose LLMs to try to attack mathematical prob-
whether they are elliptic, parabolic, or hyperbolic; lems directly. Occasionally, the results can be quite
what is known about the existence and uniqueness of impressive; for instance Bubeck et al. document
solutions, conservation laws, etc.). Such a database a case in which the powerful large language model
could potentially be useful to make conjectural pre- GPT-4 was able to provide a complete and correct
dictions about the behavior of one PDE based on proof of a problem from the 2022 International Math-
results on other PDEs, or to suggest possible analo- ematical Olympiad, which was not in the training
gies or reductions from one PDE to another; but data set for this model. Conversely, the model is
the lack of any canonical normal form to represent not well suited for performing precise calculations,
such equations (or at least a “fingerprint” to identify or even basic arithmetic; in one instance [BCE+ 23],
them [BT13]) makes it difficult at present to even when asked to compute the expression 7 ∗ 4 + 8 ∗ 8,
build such a database, let alone feed it into a neu- GPT-4 promptly came up with the incorrect answer
ral network. It is conceivable in the future though of 120, then proceeded to justify the calculation with
that advances in both proof formalization and artifi- a step-by-step procedure that returned the correct
cial intelligence may make it more feasible to generate answer of 92. When questioned about the discrep-
and utilize such databases (which could contain both ancy, GPT-4 could only offer that its initial guess

6
was a “typo”. These issues can be somewhat com- rithmic moment generating function, which is a key
pensated for by using “plugins” for GPT-4, in which notion in the statement of Cramér’s theorem, even
it is trained to send specific types of queries, such as if it did not seem to “know” exactly how this func-
mathematical calculations, to an external tool (such tion was relevant to the problem. In another experi-
as Wolfram Alpha) rather than to guess the answer ment, I asked GPT-4 for suggestions on how to prove
through its internal model, although the integration a combinatorial identity I was working on. It gave
between the tools are not seamless at present. In a number of suggestions that I had already consid-
a somewhat similar vein, a recent proof-of-concept ered (asymptotic analysis, induction, numerics) as
[RPBN+ 23] has shown that large language models well as some generic advice (simplify the expressions,
can be used to find examples in various problems in look for similar problems, understand the problem),
combinatorics and computer science that outperform but also suggested a technique (generating functions)
previous human-generated examples, by asking those that I had simply overlooked, and ended up solving
models to generate a program to create such examples the problem fairly readily. On the other hand, such
rather than to try building the example directly, and a list of advice would probably have been of little use
then executing that program in another language to to a novice mathematician, who did not have enough
reliably verify the quality of the output, which is then sufficient experience to independently gauge the use-
sent back to the original model to prompt it to im- fulness of each of the proposed suggestions. Never-
prove its guesses. There has also been recent progress theless, I see a role for these tools in drawing out a
in using large language models to enhance existing user’s latent knowledge in a problem, simply by be-
symbolic proof engines to attack narrow classes of ing a good listener and proposing reasonably relevant
mathematics problems, such as Olympiad geometry ideas that the user is expert enough to evaluate.
problems [TWL+ 24]. Github Copilot is another GPT model that is in-
In my own experiments with GPT-4, I have found tegrated into several popular code editors. Being
the most productive use cases have been to generate trained on large data sets of code in different lan-
basic computer code in various languages (Python, guages, it is designed to make autocomplete sugges-
SAGE, LaTeX, Lean, regex, etc.), or to clean up tions for code that is partially written, utilising con-
messy and unorganized sets of data (e.g., to arrange textual clues such as informal descriptions of the task
a pile of references scraped from the internet into a to be performed elsewhere in the code. I have found it
coherent LaTeX bibliography, after providing GPT- works surprisingly well for writing mathematical La-
4 with a few examples of the desired format for the TeX, as well as formalizing in Lean; indeed, it assisted
bibliography items to get it started). In such cases, it in writing this very article by suggesting several sen-
often produces satisfactory or nearly-satisfactory out- tences as I was writing, many of which I retained or
put on the first attempt, with only a small amount lightly edited for the final version. While the quality
of revision needed to obtain the type of output I of its suggestions is highly variable, it can sometimes
was seeking. I have also had some limited success display an uncanny level of simulated understanding
in getting GPT-4 to suggest relevant literature or of the intent of the text. For instance, when writing
techniques for an actual math problem. In one test another expository LaTeX note on how to estimate
case, I asked it how one would calculate the rate of integrals, I described how the integral was to be bro-
exponential decay in the tail probability of a sum ken up into three parts, and then gave the details of
of independent random variables, in order to assess how to estimate the first part; Copilot then promptly
whether it knew about the relevant theorem in this suggested how to estimate the second part by a sim-
regard (Cramér’s theorem) without providing it with ilar method, changing the variables around in what
key words such as large deviation theory. As it turned turned to be a completely correct fashion. The fre-
out, GPT-4 did not exactly locate this theorem and quency of these experiences has led to a small but
instead produced a string of mathematical nonsense, noticeable speedup in my writing of both LaTeX and
but curiously it did manage to reference the loga- Lean, and I expect these tools to become even more

7
useful in the future, as it becomes possible to “fine- to an AI, who would try to formalize each step of the
tune” these models on one’s personal writing style result and query the author whenever clarification is
and preferences. required.
The human-intensive nature of formal proof veri-
fication in its current form means that it is not fea-
4 Can these tools be combined? sible at present for a significant fraction of current
research papers to be fully formalized in real time.
The various technologies discussed above have very However, it is plausible that many of the tools al-
diverse strengths and weaknesses, and none of them ready used to verify specific computation-intensive
at their present level of development are suitable components of a research paper, such as a numer-
as general-purpose tools for mathematicians, on par ical integration or PDE solver, a symbolic algebra
with ubiquitious platforms as LaTeX or the arXiv. computation, or a result established using an SMT
However, there are promising recent experiments in solver, could be modified to produce formal proof cer-
creating more satisfactory tools by combining two or tificates. Furthermore, the class of calculations that
more separate technologies together. For instance, could be formalized in this fashion could be greatly
one plausible way to combat the hallucinatory nature expanded from where things stand in current prac-
of large language models when generating proofs is to tice. To give just one example, in the field of PDE, it
require the model to format its output in the language is common to devote pages of calculation to estimat-
of a formal proof assistant, with any errors generated ing some integral expression involving one or more
by the assistant sent back to the model as feedback. unknown functions (such as solutions to a PDE),
This combined system seems suitable for generating using bounds on such functions in various function
short proofs of simple statements [YSG+ 23]; as such space norms (such as Sobolev space norms), together
tasks are often the limiting factor in efficiently for- with standard inequalities (e.g., the Hölder inequal-
malizing proofs, this type of paradigm could greatly ity and the Sobolev inequalities), as well as various
accelerate the speed of such formalization, particu- identities such as integration by parts or differentia-
larly if these models become fine-tuned on formal tion under the integral sign. Such calculations, while
proofs in particular, as opposed to general text, and routine, can contain typos (such as sign errors) of var-
are integrated with more traditional automated the- ious degrees of severity, and can be tedious to referee
orem proving methods, such as the deployment of carefully, with the calculations themselves providing
SMT solvers. little insight beyond verifying that the final estimate
With their ability to take natural language inputs, holds. It is conceivable that tools could be devel-
Large Language Models are also potentially a user- oped to establish such estimates in an automated or
friendly interface that allows mathematicians without semi-automated fashion, and the current lengthy and
particular software expertise to use advanced tools. unenlightening proofs of such estimates could be re-
As mentioned before, I and many others already rou- placed by a link to a formal proof certificate. More
tinely use such models to generate simple code in ambitiously, one might be able to ask a future AI
various languages (including symbolic algebra pack- tool to produce the best estimate it can, given some
ages), or to create intricate diagrams and images; it set of initial hypotheses and methods, without first
seems reasonable to expect that in the near future, performing some pen-and-paper calculation to guess
one could also communicate through such models to what that estimate would be. At present, the state
design and operate something as complex as a ma- space of possible estimates is too complex to be au-
chine learning model using only high-level, conver- tomatically explored in such a fashion; but I see no
sational instructions. More ambitiously, one could reason why this sort of automation will not be achiev-
hope eventually to be able to generate (first drafts able as technology advances. When this becomes the
of) entire research papers, complete with formal ver- case, mathematical explorations become possible at
ification, by explaining a result in natural language scales that are not currently feasible. Continuing the

8
example of PDE, papers in this field typically study References
one or two equations at a time; but in the future one
[AH89] Kenneth Appel and Wolfgang Haken, Every pla-
may be able to study hundreds of equations at once, nar map is four colorable, Contemporary Mathe-
perhaps working out an argument in full for just one matics, vol. 98, American Mathematical Society,
equation and letting AI tools then adapt the argu- Providence, RI, 1989. With the collaboration of
ments to large families of related equations, querying J. Koch. MR1025335
the author as necessary whenever the extension of the [BCE+ 23] Sébastien Bubeck, Varun Chandrasekaran, Ro-
nen Eldan, Johannes Gehrke, Eric Horvitz, Ece
arguments is non-routine. Some hints of this type of
Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li,
large scale mathematical exploration are beginning Scott Lundberg, Harsha Nori, Hamid Palangi,
to emerge in other areas of mathematics, such as au- Marco Tulio Ribeiro, and Yi Zhang, Sparks of
tomated exploration of conjectures in graph theory artificial general intelligence: Early experiments
with gpt-4, 2023.
[Wag21].
[BT13] Sara C. Billey and Bridget E. Tenner, Fingerprint
It is not clear at present which of these experiments databases for theorems, Notices Amer. Math. Soc.
will end up being the most successful in bringing ad- 60 (2013), no. 8, 1034–1039. MR3113227
vanced computer assistance to the typical working [CH22] Jiajie Chen and Thomas Y. Hou, Stable nearly
mathematician. Some proofs of concept are not cur- self-similar blowup of the 2d boussinesq and 3d
rrently scaleable, particularly those that are reliant euler equations with smooth data, parts i and ii,
2022.
on extremely computationally intensive (and often
[Com22] Johan Commelin, Liquid tensor experiment,
proprietary) AI models, or require a large amount of Mitt. Dtsch. Math.-Ver. 30 (2022), no. 3, 166–
expert human input and supervision. However, I am 170. MR4469845
encouraged by the diverse efforts to explore the space [DJLT21] Alex Davies, András Juhász, Marc Lackenby, and
of possibilities, and believe that there will be many Nenad Tomasev, The signature and cusp geome-
further examples of novel ways to perform machine- try of hyperbolic knots, 2021.
assisted mathematics in the very near future. [GGMT23] W. T. Gowers, Ben Green, Freddie Manners, and
Terence Tao, On a conjecture of marton, 2023.
[Gon08] Georges Gonthier, Formal proof—the four-color
theorem, Notices Amer. Math. Soc. 55 (2008),
5 Further reading no. 11, 1382–1393. MR2463991
[Gow10] W. T. Gowers, Polymath and the density
Hales-Jewett theorem, An irregular mind, 2010,
The subject of machine assisted proof is quite dif- pp. 659–687. MR2815619
fuse, distributed across various areas of mathemat- [HAB+ 17] Thomas Hales, Mark Adams, Gertrud Bauer,
ics, computer science, and even engineering; while Tat Dat Dang, John Harrison, Le Truong Hoang,
each individual subfield has plenty of activity, it is Cezary Kaliszyk, Victor Magron, Sean McLaugh-
lin, Tat Thang Nguyen, Quang Truong Nguyen,
only recently that efforts have been made to build a Tobias Nipkow, Steven Obua, Joseph Pleso, Ja-
more unified community bringing together all of the son Rute, Alexey Solovyev, Thi Hoai An Ta,
topics listed here. As such, there are few places cur- Nam Trung Tran, Thi Diep Trieu, Josef Urban,
rently where one can find holistic surveys of these Ky Vu, and Roland Zumkeller, A formal proof of
the Kepler conjecture, Forum Math. Pi 5 (2017),
rapidly developing modalities of mathematics. One e2, 29. MR3659768
starting point is the proceedings [Kor23] of a June [Hal05] Thomas C. Hales, A proof of the Kepler conjec-
2023 National Academies workshop on “AI to As- ture, Ann. of Math. (2) 162 (2005), no. 3, 1065–
sist Mathematical Reasoning” (which the author was 1185. MR2179728
a co-organizer of). Many of the examples discussed [HKM16] Marijn J. H. Heule, Oliver Kullmann, and
here were also drawn from a February 2023 IPAM Victor W. Marek, Solving and verifying the
Boolean Pythagorean triples problem via cube-
workshop on “Machine assisted proof” (which the au-
and-conquer, Theory and applications of satis-
thor also co-organized), whose talks may be found fiability testing—SAT 2016, 2016, pp. 228–245.
online. MR3534782

9
[Kor23] Samantha Koretsky (ed.), Artificial intelligence
to assist mathematical reasoning: Proceedings of
a workshop, National Academies Press, 2023.
[RPBN+ 23] Bernardino Romera-Paredes, Mohammadamin
Barekatain, Alexander Novikov, Matej Balog,
M. Pawan Kumar, Emilien Dupont, Francisco
J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang,
Omar Fawzi, Pushmeet Kohli, and Alhussein
Fawzi, Mathematical discoveries from program
search with large language models, Nature 625
(December 2023), no. 7995, 468–475.
[RSST96] Neil Robertson, Daniel P. Sanders, Paul Sey-
mour, and Robin Thomas, A new proof of the
four-colour theorem, Electron. Res. Announc.
Amer. Math. Soc. 2 (1996), no. 1, 17–25.
MR1405965
[Tao23] Terence Tao, Formalizing the proof of pfr in
lean4 using blueprint: a short tour, 2023. https:
//terrytao.wordpress.com/2023/11/18.
[TWL+ 24] Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He
He, and Thang Luong, Solving olympiad geom-
etry without human demonstrations, Nature 625
(January 2024), no. 7995, 476–482.
[Wag21] Adam Zsolt Wagner, Constructions in combina-
torics via neural networks, 2021.
[WLGSB23] Y. Wang, C.-Y. Lai, J. Gómez-Serrano, and
T. Buckmaster, Asymptotic self-similar blow-up
profile for three-dimensional axisymmetric Eu-
ler equations using neural networks, Phys. Rev.
Lett. 130 (2023), no. 24, Paper No. 244002, 6.
MR4608987
[YSG+ 23] Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul
Chalamala, Peiyang Song, Shixing Yu, Saad
Godil, Ryan Prenger, and Anima Anandku-
mar, Leandojo: Theorem proving with retrieval-
augmented language models, 2023.

10

You might also like