Representing Numbers in NLP: A Survey and A Vision

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Representing Numbers in NLP: a Survey and a Vision

Avijit Thawani and Jay Pujara and Pedro Szekely and Filip Ilievski
University of Southern California
Information Sciences Institute
{thawani,jpujara,pszekely,ilievski}@isi.edu

Abstract Numbers are neglected. In NLP, however,


NLP systems rarely give special consideration numbers are either filtered out explicitly during
to numbers found in text. This starkly con- preprocessing (Graff et al., 2003), or treated the
trasts with the consensus in neuroscience that, same as words, often collapsing them into an UNK
arXiv:2103.13136v1 [cs.CL] 24 Mar 2021

in the brain, numbers are represented differ- token. Subword tokenization approaches like BPE
ently from words. We arrange recent NLP (Sennrich et al., 2016) and WordPiece (Wu et al.,
work on numeracy into a comprehensive taxon- 2016) instead retain numbers, but split them into
omy of tasks and methods. We break down the
arbitrary tokens, for example 1234 might be split
subjective notion of numeracy into 7 subtasks,
arranged along two dimensions: granularity
into two tokens as 12-34 or 123-4 or 1-234.
(exact vs approximate) and units (abstract vs Recent work has shown that these are subopti-
grounded). We analyze the myriad represen- mal number representations (Wallace et al., 2019;
tational choices made by 18 previously pub- Zhang et al., 2020). On the DROP Question An-
lished number encoders and decoders. We syn- swering benchmark, BERT performs five times
thesize best practices for representing numbers worse when the answer is a number instead of a
in text and articulate a vision for holistic nu-
span of text (Dua et al., 2019). Relatively simple
meracy in NLP, comprised of design trade-offs
and a unified evaluation. strategies like switching from subword to char-level
tokenization (Geva et al., 2020), or from decimal
1 Introduction to scientific notation (Zhang et al., 2020) already
Numbers are an integral part of text. To under- boost performance. Such results warrant a deeper
stand a simple sentence like I woke up at 11, we study into the best number representations.
need not just literacy but also numeracy. We must Numbers are important. Given the ubiquity of
decode the string 11 to the quantity 11 and infer numbers and their fundamental differences with
11 to denote a time of the day, probably 11 a.m. words, enabling NLP systems to represent them ef-
We need commonsense to reason that 11 a.m. is fectively is beneficial for domains like scientific ar-
quite late in the morning. This interpretation of ticles (Spithourakis and Riedel, 2018) and financial
11 is strongly contextual, as I earn $11 per month documents (Chen et al., 2019; Jiang et al., 2020).
evokes different units and value expectations. Note Number understanding is also useful to detect sar-
how the semantics remains the same for both sen- casm (Dubey et al., 2019) and to model dialogues
tences if 11 was replaced by 10, i.e., the context involving price negotiations (Chawla et al., 2020).
is tolerant to some variability. Recent NLP progress towards numeracy has
Numbers are everywhere. Reasoning with been sporadic but encouraging. In this paper, we
quantities and counts is crucial to understanding survey prior work and highlight the kind of numer-
the world. Evolutionary learning has given numer- acy targeted (e.g., arithmetic, measurement, numer-
ical cognition skills to several animals, including ation) as well as the kind of representation used
human beings (Dehaene, 2011). Our ancient an- (e.g., value embeddings, DigitRNNs). We provide
cestors furthered numeracy by developing multiple the first NLP-centric taxonomy of numeracy tasks
number systems, similar to but independent from (Section 2) and of number representations (Sec-
the evolution of languages. Numeracy is an essen- tion 3) for the reader to succinctly comprehend
tial skill for language understanding, since numbers the challenge posed by numeracy. We synthesize
are often interspersed in text: the 6 million pages in key takeaways (Section 5) and propose a unifying
English Wikipedia have over 150 million numbers. vision for future research (Section 6).
Benchmarking or Probing Tasks Downstream
Abstract Grounded Applications
Exact Simple Arithmetic (2+3=5) AWP (2 balls + 3 balls = 5 balls), Question Answering,
Exact Facts (birds have two legs) Science Problems
Approx Numeration (‘2’ = 2.0), Measurement (dogs weigh 50 lbs), Sarcasm Detection,
Magnitude (‘2’ < ‘5’) Numerical Language Modeling Numeral Categorization

Table 1: Seven numeracy tasks, arranged along the axes of (rows) granularity - exact vs approximate, and (columns)
units - abstract vs grounded. We also list downstream applications requiring a similar granularity of numeracy.

2 Tasks Simple Arithmetic is the task of addition, sub-


traction, etc. over numbers alone. It is convenient
There are several different aspects of numeracy.
to create synthetic datasets involving such math
The DROP dataset alone offers a wide variety of nu-
operations for both masked (Geva et al., 2020) and
meric reasoning questions such as retrieval-based
causal language models (GPT-3 Brown et al. 2020).
(How many yards did Brady run?), count-based
Numeration or Decoding refers to the task of
(How many goals were scored? given a comprehen-
mapping a string form to its numeric value, e.g., 19
sion describing multiple goals), and simple arith-
to 19.0. Within NLP, this task is set up as a lin-
metic (How many years after event 1 did event 2
ear regressor probe over a (frozen) representation
occur? given dates of both events). Besides down-
of the string. Numeration has been probed for in
stream applications, there have also been probing
static word embeddings (Naik et al., 2019), con-
experiments to evaluate whether NLP models can
textualized language models (Wallace et al., 2019),
decode numbers from strings (e.g., 19 to 19.0), or
and multilingual number words, e.g., nineteen or
estimate quantities (e.g., how tall are lions?).
dix-neuf (Johnson et al., 2020).
Such a diverse range of abilities are usually all
referred to collectively as numeracy, which gives Magnitude Comparison is the ability to tell
rise to confusion. We limit this abuse of terminol- which of two (or more) numbers is larger. For lan-
ogy and provide a neat taxonomy for arranging the guage models, this has been probed in an argmax
different tasks proposed under numeracy. setup (choose the largest of five numbers) as well
as a binary classification task, e.g., given 23 and
2.1 Our Taxonomy of Tasks 32, pick the label 1 to indicate that 32 > 23 (Naik
Drawing from work in cognitive science (Feigen- et al., 2019; Wallace et al., 2019).
son et al., 2004), we propose the following two Arithmetic Word Problems (AWP) are the
dimensions to organize tasks within numeracy: grounded version of simple arithmetic that we find
in school textbooks, e.g., Mary had two cookies.
1. Granularity: whether the encoding of the She gave one away. How many does she have left?
number is (1) exact, e.g., birds have two legs, or There exist several NLP datasets on math word
(2) approximate, e.g., Jon is about 180 cms tall. problems (Amini et al., 2019; Saxton et al., 2019;
2. Units: whether the numbers are (1) abstract, Roy and Roth, 2015; Hendrycks et al., 2021).
e.g., 2+3=5, or (2) grounded, e.g., 2 apples + 3 ap- Exact Facts in the context of numeracy involves
ples = 5 apples. While abstract mathematical tasks commonsense knowledge such as dice have 6 faces
are easy to probe and create artificial datasets for, or birds have two legs. An approximate sense of
numbers grounded in units are challenging since quantity would be of little help here since assertions
they need to be understood in the context of words. like dice have 5 faces or birds have three legs are
factually incorrect. Two recent datasets for numeric
2.2 Survey of Existing Tasks commonsense facts are Numbergame (Mishra et al.,
We now describe 7 standard numeracy tasks, ar- 2020) and NumerSense (Lin et al., 2020).
ranged according to our taxonomy in Table 1. Measurement Estimation is a task in psychol-
We defer discussion on downstream tasks (right- ogy in which subjects are asked to approximately
most column in the table) as well as miscellaneous guess measures of objects along certain dimensions,
numeracy-related tasks (such as counting) to A. e.g., number of seeds in a watermelon or weight
of a telephone (Bullard et al., 2004). VerbPhysics modern NLP. We assume that there exists an inde-
(Forbes and Choi, 2017) is a benchmark of binary pendent parallel process of mapping words into em-
comparisons between physical attributes of various beddings, such as subword tokenization followed
objects, e.g., ball <size tiger. DoQ (Elazar et al., by lookup embeddings in BERT.
2019) is a web-extracted dataset of Distributions
over Quantities, which can be used as a benchmark 3.1 Our Taxonomy
for language models’ measurement estimation abil- We look at two kinds of representations: string-
ities (Zhang et al., 2020). Lastly, MC-TACO (Zhou based and real-based. Real-based representations
et al., 2020) is a collection of temporal-specific perform some computation involving the numerical
measurement estimates, e.g., going for a vacation value of the number. The string-based representa-
spans a few days/weeks. tions instead see numbers in their surface forms;
Numerical Language Modeling in its literal they must assign arbitrary token IDs and look up
sense is not a task but a setup, analogous to their embeddings to feed into the architecture.
masked/causal language modeling for words. Other
tasks could be modeled as numeric language mod- 3.1.1 String Based
eling, e.g., arithmetic (2+3=[MASK]) and measure-
By default, language models treat numbers as
ment estimation (lions weigh [MASK] pounds). In
strings, the same as words. However, within string
practice, numerical language modeling refers to the
representations, one could tweak simple changes:
task of making numeric predictions for completing
Notation: The number 80 could be written in
unlabelled, naturally occurring text.
Hindu-Arabic numerals (80), Roman numerals
Word predictions in language modeling are typi-
(LXXX), scientific notation (8e1), English words
cally evaluated with classification metrics such as
(eighty), or with base 20 as in French (quatre-
Accuracy at K or perplexity. Numeric predictions,
vingts). Nogueira et al. (2021) exclusively study
on the other hand, are evaluated with regression
the effect of many such notation choices in lan-
metrics such as mean absolute error, root mean
guage models, on the task of simple arithmetic.
squared error, or their log-scaled and percentage
Tokenization: Word level tokenizations are in-
variants.
effective for numbers, since they are likely to map
Downstream Applications for numeracy are
most numbers to an UNK token, except for a few
abound. Dubey et al. (2019) detect sarcasm in
commonly occuring ones (e.g., 1, 2, 5, 10, 100).
tweets based on numbers. Chen et al. (2020) iden-
Other possibilities are subword tokenizations like
tify claims in financial documents using alternative
BPE and WordPiece, as well as character (or digit)
number representations and the auxiliary task of nu-
level tokenizations.
meral understanding or categorization (Chen et al.,
Pooling: The pooling dimension of variation
2018). Similarly, simple arithmetic and math word
springs up after analyzing the effect of tokenization.
problems serve as auxiliary tasks for GenBERT
With subword and character level tokenizations, a
(Geva et al., 2020) towards improving its score on
single number may now correspond to multiple to-
the DROP QA benchmark.
kens, e.g., 100 segmented into 10-0 or 1-0-0.
3 Methods Prior work (Spithourakis and Riedel, 2018) has ar-
gued for using RNNs or CNNs to instead pool the
Analogous to our taxonomy of subtasks in the pre- embeddings of these tokens into a single embed-
vious section, here we attempt to arrange the wide ding before feeding to the language model. The
variety of alternative number representations pro- default way that language models see numbers are
posed in recent literature. We limit our analysis the same as words, hence no pooling is applied.
to methods of encoding (numbers → embeddings)
and/or decoding (embeddings → numbers) num- 3.1.2 Real Based
bers. We do not discuss, for example, methods Real-based number encoders can be expressed as
that use symbolic reasoning (Andor et al., 2019) or f : R → Rd whereas decoders can be expressed
modify activation functions to enhance numeracy as g : Rd → R. Real-based methods proposed in
(Trask et al., 2018). literature can vary on account of direction (whether
A typical example of the base architecture could they encode, decode or both), scale (linear vs log),
be BERT (Devlin et al., 2019), the workhorse of and discretization (binning vs continuous valued).
Direction: Some proposed methods are encoder- tion, i.e., 314.1 is expressed as 3141[EXP]2). Num-
only, e.g., DICE (Sundararaman et al., 2020), while BERT hence follows a scientific notation, subword
some can be decoder-only, e.g., those requiring tokenization, and no pooling.1
sampling from a parameterized distribution (Berg- DigitRNN, DigitCNN Spithourakis and Riedel
Kirkpatrick and Spokoyny, 2020). (2018) and Wallace et al. (2019) experimented with
Scale: Inspired by cognitive science literature poolingof digit embeddings into a single embed-
(Dehaene, 2011), several methods have attempted ding representing the full number. Both used RNNs
to model numbers in the log (instead of linear) as well as CNNs for pooling.
scale, i.e., to perform mathematical operations on DigitRNN-sci & Exponent (Embedding)
the logarithm of the number to be represented. The Berg-Kirkpatrick and Spokoyny (2020) used a
first operation in a log-scaled f is log(·) and the scientific notation variant of DigitRNNs (which
last operation in a log-scaled g is exp(·). We we refer to as DigitRNN-sci in Table 2), as well as
discuss more scales in the following subsection, a simpler alternative: exponent embedding. The
such as the stabilized log scale (Jiang et al., 2020) latter merely learns a lookup embedding for the
and the learned scale/flow (Berg-Kirkpatrick and exponent, completely ignoring the mantissa.
Spokoyny, 2020).
Discretization: Training continuous value func- 3.2.2 Real-based methods
tions for a large range of numbers turns out to be DICE Determinisitic Independent-of-Corpus Em-
practically infeasible (Wallace et al., 2019). Some beddings (Sundararaman et al., 2020) is an attempt
real-based methods first bin numbers before learn- to handcraft number encoder 2 f so as to preserve
ing embeddings for each bin. These bins could the relative magnitude between two numerals and
be on the linear scale (0-10, 10-20, 20-30, . . . ) or their embeddings. Given two scalars i and j, and
the log scale (0.01-0.1, 0.1-1, 1-10, . . . ), and the their embeddings f (i) and f (j), the cosine dis-
lookup embeddings can be learnt by the regular tance between f (i) and f (j) is intended to mono-
cross entropy (Chen et al., 2020) or dense cross tonically increase/decrease with the Euclidean dis-
entropy (Zhang et al., 2020). tance between i and j. DICE is offered as not only a
deterministic encoding but also as an auxiliary loss
3.2 Survey of Existing Methods
function for softly training number embeddings
Having established dimensions of variance of num- alongside, say, SQuAD (Rajpurkar et al., 2016)
ber representations, we describe some key string- Value Embedding The most intuitive parame-
based and real-based methods used in prior work. terized encoder for real numbers is one that feeds
Table 2 depicts these methods as individual rows, the scalar magnitude of the number through a shal-
with the first three columns showing their position low neural network. The converse of value embed-
in our taxonomy (§ 3.1). The last seven columns ding is to learn a shallow neural network mapping
correspond to the seven tasks (§ 2.2), with each cell g : Rd → R. This decoder is simply the probe
denoting a representative work that introduce it. used for decoding/numeration task.
3.2.1 String-based methods The idea of projecting number magnitudes into
an NLP model that otherwise inputs only lookup
Word Vectors & Contextualized Embeddings
embeddings may appear flawed. But Vaswani et al.
Word2vec (Mikolov et al., 2013), GloVe (Penning-
(2017) have (rather successfully) encoded posi-
ton et al., 2014), ELMo (Peters et al., 2018), and
tional information into transformers using both
BERT (Devlin et al., 2019) have been probed as
learned embeddings (similar to Value) and fixed
baselines against several contending methods.
ones (similar to DICE).
GenBERT Geva et al. (2020) presented Gen-
Log Value Wallace et al. (2019) also experiment
BERT, a question answering model with pretrained
with a log-scaled value encoder in addition to the
BERT serving as both its encoder and decoder.
one on a linear scale. Zhang et al. (2020) experi-
GenBERT tokenizes numbers at the digit level, and
ment with a log value decoder for measurement esti-
is finetuned on auxiliary tasks of arithmetic word
mation, which they call the RGR (regress) method.
problems and simple arithmetic.
Log scaling has a neuroscientific inspiration since
NumBERT Zhang et al. (2020) pretrain BERT
from scratch over a modified dataset such that all 1
Pooling as described in § 3.1.1.
2
numbers have been converted into scientific nota- Number encoder-decoder as defined in § 3.1.2.
Exact Approximate
Arith Facts AWP Num Mag Meas LMN
String-Based Notation Tokenization Pooling
Word Vectors Decimal Word NA W+19 W+19 W+19 G+19 J+20
Contextualized Decimal Subword No W+19 L+20 G+20 W+19 W+19 Z+20 SR18
GenBERT Decimal Char No G+20 G+20
NumBERT Scientific Char No Z+20 Z+20
DigitRNN/CNN Decimal Char Yes W+19 W+19 W+19 SR18
DigitRNN-sci Scientific Char RNN BS20
Exponent Scientific Word NA BS20
Real-Based Scale Direction Binning
DICE Linear Enc-only No S+20 S+20 S+20
Value Linear Both No W+19 W+19 W+19
Log Value Log Both No W+19 W+19 W+19 Z+20
MCC Log Dec-only Yes Z+20
Log Laplace Log Dec-only No BS20
Flow Laplace Learn Dec-only No BS20
DExp Log Dec-only No BS20
GMM Linear Dec-only Both** SR18
GMM-proto Linear Enc-only* No J+20 J+20 J+20 J+20
SOM-proto Log Enc-only* No J+20 J+20 J+20 J+20

Table 2: An overview of numeracy in NLP: Each row is a method (§3.2), arranged as per our taxonomy (§3.1)
split by string and real, further branching into three dimensions each. The last seven columns correspond to
the seven subtasks of numeracy (§2.2), split by Exact and Approximate granularity (§2.1). The cells point to
representative (not exhaustive) works that have experimented with a given method (row) on a given task (column).
Notes: Prototype* is encoder-only but reuses embeddings for the decoder (Jiang et al., 2020). GMM** has been
discretized (Spithourakis and Riedel, 2018) as well as continuous valued (Berg-Kirkpatrick and Spokoyny, 2020).

observations of human (and animal) understand- MCC or multi-class classification is another


ing of numbers is better modelled by a log-scale number decoder which outputs a distribution, but
representation (Dehaene, 2011). a discrete one: over log-scaled bins of numbers,
Log Laplace In contrast to the point estimate e.g., 1-10, 10-100, and so on (Zhang et al., 2020).
output of the RGR decoder, models can also be Previously described decoders either output a point
used to parameterize a distribution over numbers. estimate or a unimodal distribution, thus failing
Such a formulation is helpful when estimating ap- to hedge its predictions for a multimodal ground
proximate quantities. Vectors representing some truth. Given a masked number prediction problem
context can be used to parameterize, say, the mean We went to the restaurant at [MASK] p.m., MCC is
and variance of a Gaussian or Laplace distribu- better equipped to estimate two peaks: one around
tion. Berg-Kirkpatrick and Spokoyny (2020) in- lunch time (say, 1-2 p.m.) and another around din-
stead transform the space being modeled by param- ner (say, 7-9 p.m.).
eterizing the location parameter of a Log-Laplace Discrete Latent Exponent (DExp) is an-
distribution L(X, 1) where X is the context repre- other potentially multimodal distribution (Berg-
sentation of unmasked tokens, in a masked (numer- Kirkpatrick and Spokoyny, 2020) where the model
ical) language modelling setup. When inferring or parameterizes a multinomial distribution for the
decoding a number, they sample a point z ~L(X, 1) exponent (similar to MCC) and uses it to sample
and exponentiate it, such that the output is exp(z). an exponent e, which then acts as a latent variable
Flow Laplace The expressivity of number de- for emitting the mean µ of a Gaussian (standard
coders can be expanded or contracted by merely deviation fixed at 0.05). This Gaussian is finally
parameterizing a different distribution. Berg- used to sample the output number z ~N (µ, 0.05).
Kirkpatrick and Spokoyny (2020) propose a more GMM Another attempt to circumvent the uni-
expressive decoder where instead of the log scale, modal Gaussians or point estimates is to learn a
the model learns its own density mapping. After Gaussian mixture model. Spithourakis and Riedel
sampling z ~L(X, 1), the output is transformed to (2018) learn a mixture of K Gaussians by pre-
exp( z−a )
c
b
where a, b, and c, are also parameters
, training their means (µi ) and variances (σi 2 ) over
emitted by the same model. the training corpus with Expectation Maximization
algorithms, while the mixing weights πi are de- 4 Results
rived from the model. Next, to sample a single
number P from the GMM probability mass function Having organized the landscape of numeracy tasks
q(u) = K and methods, we now present come key results for
i=1 πi N (u; µi ; σi ), the authors first sam-
ple the precision (number of decimal places) from each numeracy task in NLP from previously pub-
yet another Gaussian and use that to discretize the lished experiments over a subset of the described
probability mass function into equal sized bins, number representations:
over which the probabilities are summed. If the Abstract Probes Word Embeddings vastly out-
sampled precision is, say 2, then theRprobability of perform random embedding baselines on abstract
3.145
emitting a number 3.14 is given by 3.135 q(u)du. probes such as numeration, magnitude comparison,
This likelihood estimate is used to train a causal and sorting (Wallace et al., 2019; Naik et al., 2019).
language model. DICE, Value and Log Value embeddings excel at
Berg-Kirkpatrick and Spokoyny (2020)’s GMM these probes, which makes intuitive sense given
implementation is slightly different: it alters the that they explicitly encode the numbers’ magnitude
last inference step by sampling directly from the - although Value embeddings do not easily extrapo-
mixture of Gaussians, as they did with Log Laplace, late to larger numbers, possibly due to instability
Flow Laplace, and DExp. in training. The best number encoders with respect
GMM-prototype by Jiang et al. (2020) simi- to these probes were found to be DigitCNNs, and
larly pretrains (with EM/hard-EM) the mean, the character-tokenized models, e.g., ELMo, in general
variances, but also the mixture weights πi s of a outperform subword ones, e.g., BERT (Wallace
GMM over the training corpus. They then learn et al., 2019).
K prototype embeddings ei s corresponding to the Arithmetic GPT-3 (Brown et al., 2020) per-
K Gaussians. When encoding a new numeral n, forms extremely well at zero shot simple arithmetic,
as long as the number of digits in the operands are
PK(input) embedding is calculated as: E(n) =
its
low. The tokenization scheme could be the cause
i=1 wi .ei , where the weights are induced from
the GMM: for limited extrapolation, since language models
get better at arithmetic when numbers are tokenized
πi N (n; µi ; σi ) at the digit/character level (Nogueira et al., 2021;
wi = P (Z = i|U = n) = PK
j=1 πj N (n; µj ; σj )
Wallace et al., 2019). For arithmetic word prob-
lems, state of the art solvers rely on predicting an
Thus the difference between GMM and GMM- equation, which is then filled in with specific nu-
prototypes is that after fixing mean and standard meric values from the question (Patel et al., 2021),
deviations of the Gaussian mixtures, in GMM the altogether bypassing the need for encoding num-
model learns to predict the mixture weights πi bers into embeddings.
for each individual number prediction, whereas Masked Language Modelling Zhang et al.
in GMM-prototype, πi ’s are frozen and the model (2020) show that BERT pretrained over datasets
learns prototype embeddings ei ’s. Note that proto- where numbers are in scientific notation (Num-
type embeddings are encoder-only.To decode num- BERT) converges to the same loss as BERT on
bers, the authors implement weight-sharing across masked language modelling objective, and scores
input and output embeddings, similar to how word nearly the same on GLUE language understanding
vectors are trained (Mikolov et al., 2013), i.e., find- benchmarks. For (causal) numeric language mod-
ing out which of the numerals in the corpus has the elling, Spithourakis and Riedel (2018) show that
closest embedding. Gaussian Mixture Models are the best decoders.
SOM-prototype GMM-prototype, in effect, For (masked) numeric language modelling, Berg-
merely use the mixture of Gaussians to infer pro- Kirkpatrick and Spokoyny (2020) show that mod-
totypes and to get the weights wi ’s. Jiang et al. elling the mantissa in scientific notation may be an
(2020) tried another variant by identifying proto- overkill, since exponent embeddings alone outper-
type numerals with Self Organizing Maps (Koho- form DigitRNN-sci over financial news and scien-
nen, 1990) and by defining the weights as: wi = tific articles.
|g(xi ) − g(n)|−1 where xi is the ith prototype, n Measurement Estimation Zhang et al. (2020)
is the number to be encoded, and g is a log-based train a regression probe to predict measure-
squashing function. ments of objects over the CLS embeddings of
BERT/NumBERT. Given a template-lexicalized with unpooled ones (NumBERT, GenBERT) which
sentence such as “the dog is heavy,” the model must makes it hard to proclaim a winner among the two.
predict the weight of a typical dog, against ground Rule of thumb for real-based methods? Log
truth from the Distribution over Quantities dataset scale is preferred over linear scale (Zhang et al.,
(Elazar et al., 2019). They find that NumBERT is 2020; Jiang et al., 2020; Wallace et al., 2019; Berg-
a better text encoder than BERT for measurement Kirkpatrick and Spokoyny, 2020), which makes
estimation, the only difference between them be- intuitive sense but lacks as rigorous a study as has
ing the notation used by the respective pretraining been undertaken in the cognitive science commu-
corpora. They also experiment with two number de- nity (Feigenson et al., 2004). Regarding discretiza-
coders: MCC (multi-class classification) and RGR tion, Zhang et al. (2020) show that binning (dense
(regression / Log Value embedding). MCC per- cross entropy loss) works better than continuous
forms better when trying to predict Distributions value prediction (MAE loss) on datasets where
over Quantities - perhaps due to the ground truth ground truth distributions are available. Lastly,
resembling the predicted gaussians - but not on modeling continuous predictions is notoriously
VerbPhysics - where the ground truth is less noisy. hard for large ranges (Wallace et al., 2019) but Sp-
Lastly, even static word embeddings like GloVe ithourakis and Riedel (2018) offer a way of binning
have been shown to contain enough knowledge of such distributions by picking a precision level.
measurement estimates to contrast two objects, e.g.,
Encoding vs Decoding numbers? In our sim-
classifying whether a car is bigger/heavier/fasster
plified discussions above, we avoid differentiat-
than a ball (Goel et al., 2019).
ing between methods for encoding and decoding
Exact Facts BERT and RoBERTa capture lim- numbers. Value Embedding, for instance, can be
ited numerical commonsense, evident over Nu- used to encode numbers (projecting scalars onto
merSense (Lin et al., 2020) sentences such as ‘a vector space) as well as to decode numbers (col-
tricycle has [MASK] wheels,’ with the answer lapsing a vector into a scalar). On the other hand,
choices limited to the numbers 0-10. Results can be manually-designed encoders like DICE are not eas-
further improved by finetuning over a Wikipedia- ily reversible into decoding methods. Even with re-
extracted dataset of numeric information. Mishra versible methods, the encoders and decoders must
et al. (2020) find the commonsense question an- usually be independently parameterized, unlike the
swering to be one of the hardest among their Num- input and output word embeddings which often
bergame challenge, using the NumNetv2 model share weights (Press and Wolf, 2016). Prototype
(Ran et al., 2019) which is commonly used for embeddings by Jiang et al. (2020) are an exception,
DROP question answering. Both of these experi- which share input/output embeddings for a fixed
ments evaluate on exact match metrics, hence it vocabulary of numbers.
remains to be seen if representing approximate
Can we mix-and-match multiple methods?
magnitudes yields benefit in modelling numeric
Given the wide range of number representations,
facts.
an obvious next step is to try an ensemble of em-
beddings. Berg-Kirkpatrick and Spokoyny (2020)
5 Recommendations
show that for encoding numbers, exponent embed-
Based on the above results, we now synthesize cer- dings added to DigitRNN (scientific notation) em-
tain key insights into a set of directed takeaways to beddings barely outperforms the exponent embed-
guide practitioners’ design of number representa- dings alone. Similar experiments with a mix of real
tions for their task: and string methods are yet to be seen.
Rule of thumb for string-based methods? Sci- Which methods for which tasks? Based on
entific notation is superior to decimal notation our taxonomy of tasks in Table 1, abstract tasks
(Zhang et al., 2020) since models can learn to are good early probes for the grounded ones, e.g.,
attend mostly to the exponent embedding rather finetuning GenBERT (Geva et al., 2020) on simple
than the mantissa (Berg-Kirkpatrick and Spokoyny, arithmetic helps it do well on downstream ques-
2020). Character level tokenization outperforms tion answering, and the high scores of DICE (Sun-
subword level (Nogueira et al., 2021; Wallace et al., dararaman et al., 2020) on numeration and magni-
2019; Geva et al., 2020). Pooled representations tude comparison are an indicator of similar boosts
(DigitRNN, DigitCNN) lack a controlled study on (numeric) language modelling. With respect
to granularity, real-based methods work well for arithmetic problems, is found to score higher on
approximate tasks such as measurement estima- DROP QA. Similarly, DICE (Sundararaman et al.,
tion and language modeling (Zhang et al., 2020; 2020), optimized for numeration, improves score
Berg-Kirkpatrick and Spokoyny, 2020) but not for on Numeracy600K order-of-magnitude prediction
exact tasks like arithmetic word problems or com- task. Going forward, we need several such studies,
monsense. DigitRNNs are broad-purpose number ideally for each pair of tasks to see whether some
encoders, whereas distribution modeling methods numeracy skills help models generalize to others.
like DExp are effective at decoding numbers. Design Principles. Number representations
vary based on design trade-offs between inductive
6 Vision for Unified Numeracy in NLP biases and data-driven variance. The default BERT
setup, with subword tokenization and lookup em-
Numeracy is a core system of human intelligence
beddings, occupies the variance end of the spec-
(Kinzler and Spelke, 2007). Teaching numeracy to
trum, allowing freedom in representing numbers.
students works best when taught holistically, while
Value embeddings and DICE encodings, on the
less effective teachers deal with areas of mathemat-
other hand, are closer to the bias end of the spec-
ics discretely (Askew and Askew, 1997). While
trum, since the inductive bias of continuity on the
the NLP community has genuinely strived to im-
number line constrains the learning space. It is im-
prove language models’ numeric skills, not all as-
portant to identify where on the bias-variance scale
pects of numeracy have been sufficiently targeted.
any representation stands, for a fair comparison.
It is evident from the sparsity in Table 2 that the
community is far from achieving, or attempting, a Following parallel work in cognitive science, the
holistic solution to numeracy. In this section, we community could explore whether exact and ap-
outline our vision for such a unified solution, as proximate numeracy require two specialized mod-
three prerequisites necessary to consider for numer- ules (Feigenson et al., 2004) or could be handled
ical NLU: with a single representation (Cordes et al., 2001).
Evaluation. The first step towards a holistic Model designers must also make a choice on cov-
solution to numeracy requires a benchmark cover- erage: whether to target a broad or a narrow range
ing its different subtasks. Aggregated leaderboards of numbers to be represented. Multi-class classifi-
in NLP like GLUE (Wang et al., 2018) and Su- cation (Zhang et al., 2020) over a fixed number of
perGLUE (Wang et al., 2019) have incentivized bins, restricts the range of numbers expressed, as
research on natural langauge understanding, with do DICE embeddings (Sundararaman et al., 2020).
scores categorized into semantic, syntactic, logical, Value embeddings are continuous and theoretically
and background knowledge. unrestricted, but must practically be capped for bug-
An analogous leaderboard could be constructed free training. On the other hand, string-based repre-
to evaluate models on numeric reasoning tasks, sentations could always fall back to subword/char-
again categorized according to the skills evaluated, level token embeddings
√ to represent not only floats
e.g., exact vs approximate granularity, or abstract but also irrational ( 2) and complex (1 + 2ι) num-
vs grounded numeracy. Numbergame (Mishra bers. Roy et al. (2015) introduced the Quantity-
et al., 2020) is one such aggregation focusing on Value Representation format to allow closed and
exact numeracy benchmarks, as evaluated by F1 open ranges alongside scalar point numbers.
and exact match scores in a reading comprehen- Broader Impact. Numbers are ubiquitous in
sion setup. Both Numbergame and our own list natural language and are easily identified, at least
of tasks (Section 2.2) are preliminary attempts at in numeral forms. But they are by no means the
teasing apart the different aspects of numeracy. We only class of ordered concepts required for natural
encourage researchers to extend and refine such language understanding. Successful number repre-
taxonomies. sentations can inspire work on incorporating more
A suite of numeracy tasks, matched with evalua- continuous domains into natural language process-
tions of their respective numerical skills, can enable ing systems. For instance, gradable adjectives like
testing model generalization from one skill to an- good, great, amazing, etc. are arguably on some
other. Some progress has already been made in cardinal scale, which can be mapped using value
this transfer learning setup, e.g., GenBERT (Geva embeddings or Gaussian mixture models (Sharp
et al., 2020), finetuned on a synthetic dataset of et al., 2018; de Marneffe et al., 2010). Days of the
week (Mon-Sun) and months of an year (Jan-Dec) which we have tried our best to rectify.
form periodic patterns which can be modeled with
sinusoidal functions (Martinez et al., 2020). Ethical Considerations
Lastly, numeracy is essential for natural lan- This work revolves around the Hindu-Arabic Nu-
guage understanding. Consider the sentence: “Pro- meral system and English number words, which
grammers earn $200,000 versus $100,000 for re- are not the only number systems still in use today.
searchers." An intelligent agent with numeracy We encourage follow-up work to take these sys-
skills would identify that $100k is half of $200k, tems into consideration, on the lines of Johnson
that $100k possibly denotes annual salary, and infer et al. (2020) and Nefedov (2020).
that higher salaries lead to higher standards of liv-
ing. In short, it was able to learn something about
the two concepts programmers and researchers, by References
crossing the continuous semantic space of num- Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik
bers! The agent could now make use of this knowl- Koncel-Kedziorski, Yejin Choi, and Hannaneh Ha-
edge in a number-free situation, e.g., the mask in jishirzi. 2019. MathQA: Towards interpretable
math word problem solving with operation-based
“He could not afford a car for several years after
formalisms. In Proceedings of the 2019 Conference
earning a CS degree because she took a job as a of the North American Chapter of the Association
[MASK]” might better be filled with the word re- for Computational Linguistics: Human Language
searcher, than with programmer. A key goal of Technologies, Volume 1 (Long and Short Papers),
imparting numeracy to NLP models is to help them pages 2357–2367, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics.
understand more about the world, using numbers.
Daniel Andor, Luheng He, Kenton Lee, and Emily
7 Conclusion Pitler. 2019. Giving BERT a calculator: Finding op-
erations and arguments with reading comprehension.
This paper summarizes and contextualizes recent In Proceedings of the 2019 Conference on Empirical
work on numeracy in NLP. We propose the first Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-
taxonomy of tasks and methods concerning text- guage Processing (EMNLP-IJCNLP), pages 5947–
centric numeric reasoning. We highlight key take- 5952, Hong Kong, China. Association for Computa-
aways from the several experiments in literature, tional Linguistics.
along with caveats and scope for confirming some Mike Askew and Mike Askew. 1997. Effective teach-
of the observed trends. We present a case for lack ers of numeracy. King’s College London London.
of a holistic solution to numeracy in NLP, and put
Taylor Berg-Kirkpatrick and Daniel Spokoyny. 2020.
forward a set of aspects to consider when working An empirical investigation of contextualized number
towards one. We draw the following two major con- prediction. In Proceedings of the 2020 Conference
clusions from our study: (1) the default subword on Empirical Methods in Natural Language Process-
segmentation with lookup embeddings used to rep- ing (EMNLP), pages 4754–4764, Online. Associa-
resent words is clearly suboptimal for numbers (2) tion for Computational Linguistics.
there are several unanswered research questions on Satwik Bhattamishra, Arkil Patel, and Navin Goyal.
the level of specificity, coverage, and inductive bias 2020. On the computational power of transformers
needed to holistically solve numeracy. and its implications in sequence modeling. In Pro-
ceedings of the 24th Conference on Computational
Natural Language Learning, pages 455–475, Online.
8 Acknowledgements Association for Computational Linguistics.
This work was funded by the Defense Ad- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
vanced Research Projects Agency with award Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
N660011924033. We would like to thank the count- Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
less suggestions we accumulated during prelimi- Gretchen Krueger, Tom Henighan, Rewon Child,
nary presentations at MLSS 2020, WeCNLP 2020, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
and GSS 2020, as well as over email correspon- Clemens Winter, Christopher Hesse, Mark Chen,
dences with Biplav Srivastava, Antoine Bosselut, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam Mc-
and Harsh Agarwal. We would like to thank the Candlish, Alec Radford, Ilya Sutskever, and Dario
anonymous NAACL 2021 reviewers (particularly Amodei. 2020. Language models are few-shot learn-
#3) for pointing out blind spots in our submission, ers.
Sarah E Bullard, Deborah Fein, Mary Kay Gleeson, Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel
Nita Tischer, Robert L Mapou, and Edith Kaplan. Stanovsky, Sameer Singh, and Matt Gardner. 2019.
2004. The biber cognitive estimation test. Archives DROP: A reading comprehension benchmark requir-
of clinical neuropsychology, 19(6):835–846. ing discrete reasoning over paragraphs. In Proceed-
ings of the 2019 Conference of the North American
Kushal Chawla, Gale Lucas, Jonathan Gratch, and Chapter of the Association for Computational Lin-
Jonathan May. 2020. Bert in negotiations: Early pre- guistics: Human Language Technologies, Volume 1
diction of buyer-seller negotiation outcomes. arXiv (Long and Short Papers), pages 2368–2378, Min-
preprint arXiv:2004.02363. neapolis, Minnesota. Association for Computational
Linguistics.
Chung-Chi Chen, Hen-Hsen Huang, and Hsin-Hsi
Chen. 2020. Numclaim: Investor’s fine-grained
claim detection. In Proceedings of the 29th ACM In- Abhijeet Dubey, Lakshya Kumar, Arpan Somani,
ternational Conference on Information and Knowl- Aditya Joshi, and Pushpak Bhattacharyya. 2019.
edge Management, CIKM ’20, page 1973–1976, “when numbers matter!”: Detecting sarcasm in nu-
New York, NY, USA. Association for Computing merical portions of text. In Proceedings of the Tenth
Machinery. Workshop on Computational Approaches to Subjec-
tivity, Sentiment and Social Media Analysis, pages
Chung-Chi Chen, Hen-Hsen Huang, Yow-Ting Shiue, 72–80, Minneapolis, USA. Association for Compu-
and Hsin-Hsi Chen. 2018. Numeral understanding tational Linguistics.
in financial tweets for fine-grained crowd-based fore-
casting. In 2018 IEEE/WIC/ACM International Con- Yanai Elazar and Yoav Goldberg. 2019. Where’s my
ference on Web Intelligence (WI), pages 136–143. head? Definition, data set, and models for numeric
IEEE. fused-head identification and resolution. Transac-
tions of the Association for Computational Linguis-
Chung-Chi Chen, Hen-Hsen Huang, Hiroya Takamura, tics, 7:519–535.
and Hsin-Hsi Chen. 2019. Numeracy-600K: Learn-
ing numeracy for detecting exaggerated information Yanai Elazar, Abhijit Mahabal, Deepak Ramachandran,
in market comments. In Proceedings of the 57th An- Tania Bedrax-Weiss, and Dan Roth. 2019. How
nual Meeting of the Association for Computational large are lions? inducing distributions over quanti-
Linguistics, pages 6307–6313, Florence, Italy. Asso- tative attributes. In Proceedings of the 57th Annual
ciation for Computational Linguistics. Meeting of the Association for Computational Lin-
guistics, pages 3973–3983, Florence, Italy. Associa-
Peter Clark, Oren Etzioni, Daniel Khashabi, Tushar
tion for Computational Linguistics.
Khot, Bhavana Dalvi Mishra, Kyle Richardson,
Ashish Sabharwal, Carissa Schoenick, Oyvind
Tafjord, Niket Tandon, et al. 2019. From’f’to’a’on Lisa Feigenson, Stanislas Dehaene, and Elizabeth
the ny regents science exams: An overview of the Spelke. 2004. Core systems of number. Trends in
aristo project. arXiv preprint arXiv:1909.01958. cognitive sciences, 8(7):307–314.

Sara Cordes, Rochel Gelman, Charles R Gallistel, and Maxwell Forbes and Yejin Choi. 2017. Verb physics:
John Whalen. 2001. Variability signatures distin- Relative physical knowledge of actions and objects.
guish verbal from nonverbal counting for both large In Proceedings of the 55th Annual Meeting of the As-
and small numbers. Psychonomic bulletin & review, sociation for Computational Linguistics (Volume 1:
8(4):698–707. Long Papers), pages 266–276, Vancouver, Canada.
Association for Computational Linguistics.
Marie-Catherine de Marneffe, Christopher D. Manning,
and Christopher Potts. 2010. “was it good? it was Mor Geva, Ankit Gupta, and Jonathan Berant. 2020.
provocative.” learning the meaning of scalar adjec- Injecting numerical reasoning skills into language
tives. In Proceedings of the 48th Annual Meeting models. In Proceedings of the 58th Annual Meet-
of the Association for Computational Linguistics, ing of the Association for Computational Linguis-
pages 167–176, Uppsala, Sweden. Association for tics, pages 946–958, Online. Association for Com-
Computational Linguistics. putational Linguistics.
Stanislas Dehaene. 2011. The number sense: How the
mind creates mathematics. OUP USA. Pranav Goel, Shi Feng, and Jordan Boyd-Graber. 2019.
How pre-trained word representations capture com-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and monsense physical comparisons. In Proceedings
Kristina Toutanova. 2019. BERT: Pre-training of of the First Workshop on Commonsense Inference
deep bidirectional transformers for language under- in Natural Language Processing, pages 130–135,
standing. In Proceedings of the 2019 Conference Hong Kong, China. Association for Computational
of the North American Chapter of the Association Linguistics.
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), David Graff, Junbo Kong, Ke Chen, and Kazuaki
pages 4171–4186, Minneapolis, Minnesota. Associ- Maeda. 2003. English gigaword. Linguistic Data
ation for Computational Linguistics. Consortium, Philadelphia, 4(1):34.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Rodrigo Nogueira, Zhiying Jiang, and Jimmy Li.
Arora, Steven Basart, Eric Tang, Dawn Song, and 2021. Investigating the limitations of the transform-
Jacob Steinhardt. 2021. Measuring mathematical ers with simple arithmetic tasks. arXiv preprint
problem solving with the math dataset. arXiv arXiv:2102.13019.
preprint arXiv:2103.03874.
Arkil Patel, Satwik Bhattamishra, and Navin Goyal.
Chengyue Jiang, Zhonglin Nian, Kaihao Guo, Shanbo 2021. Are nlp models really able to solve simple
Chu, Yinggong Zhao, Libin Shen, and Kewei Tu. math word problems?
2020. Learning numeral embedding. In Findings
of the Association for Computational Linguistics: Jeffrey Pennington, Richard Socher, and Christopher
EMNLP 2020, pages 2586–2599, Online. Associa- Manning. 2014. GloVe: Global vectors for word
tion for Computational Linguistics. representation. In Proceedings of the 2014 Confer-
ence on Empirical Methods in Natural Language
Devin Johnson, Denise Mak, Andrew Barker, and Lexi
Processing (EMNLP), pages 1532–1543, Doha,
Loessberg-Zahl. 2020. Probing for multilingual
Qatar. Association for Computational Linguistics.
numerical understanding in transformer-based lan-
guage models. In Proceedings of the Third Black-
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
boxNLP Workshop on Analyzing and Interpreting
Gardner, Christopher Clark, Kenton Lee, and Luke
Neural Networks for NLP, pages 184–192, Online.
Zettlemoyer. 2018. Deep contextualized word rep-
Association for Computational Linguistics.
resentations. In Proceedings of the 2018 Confer-
Katherine D Kinzler and Elizabeth S Spelke. 2007. ence of the North American Chapter of the Associ-
Core systems in human cognition. Progress in brain ation for Computational Linguistics: Human Lan-
research, 164:257–264. guage Technologies, Volume 1 (Long Papers), pages
2227–2237, New Orleans, Louisiana. Association
Teuvo Kohonen. 1990. The self-organizing map. Pro- for Computational Linguistics.
ceedings of the IEEE, 78(9):1464–1480.
Marten Postma, Filip Ilievski, and Piek Vossen. 2018.
Bill Yuchen Lin, Seyeon Lee, Rahul Khanna, and Xi- SemEval-2018 task 5: Counting events and par-
ang Ren. 2020. Birds have four legs?! NumerSense: ticipants in the long tail. In Proceedings of The
Probing Numerical Commonsense Knowledge of 12th International Workshop on Semantic Evalua-
Pre-Trained Language Models. In Proceedings of tion, pages 70–80, New Orleans, Louisiana. Asso-
the 2020 Conference on Empirical Methods in Nat- ciation for Computational Linguistics.
ural Language Processing (EMNLP), pages 6862–
6868, Online. Association for Computational Lin- Ofir Press and Lior Wolf. 2016. Using the output
guistics. embedding to improve language models. arXiv
Richard Diehl Martinez, Scott Novotney, Ivan Bulyko, preprint arXiv:1608.05859.
Ariya Rastrow, and Andreas Stolcke. 2020. Contex-
tual datetime language model adaptation for speech Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
recognition. West Coast NLP Summit. Percy Liang. 2016. SQuAD: 100,000+ questions for
machine comprehension of text. In Proceedings of
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- the 2016 Conference on Empirical Methods in Natu-
frey Dean. 2013. Efficient estimation of word ral Language Processing, pages 2383–2392, Austin,
representations in vector space. arXiv preprint Texas. Association for Computational Linguistics.
arXiv:1301.3781.
Qiu Ran, Yankai Lin, Peng Li, Jie Zhou, and Zhiyuan
George A Miller. 1995. Wordnet: a lexical database for Liu. 2019. NumNet: Machine reading comprehen-
english. Communications of the ACM, 38(11):39– sion with numerical reasoning. In Proceedings of
41. the 2019 Conference on Empirical Methods in Nat-
ural Language Processing and the 9th International
Swaroop Mishra, Arindam Mitra, Neeraj Varshney,
Joint Conference on Natural Language Processing
Bhavdeep Sachdeva, and Chitta Baral. 2020. To-
(EMNLP-IJCNLP), pages 2474–2484, Hong Kong,
wards question format independent numerical rea-
China. Association for Computational Linguistics.
soning: A set of prerequisite tasks.
Aakanksha Naik, Abhilasha Ravichander, Carolyn Abhilasha Ravichander, Aakanksha Naik, Carolyn
Rose, and Eduard Hovy. 2019. Exploring numeracy Rose, and Eduard Hovy. 2019. EQUATE: A bench-
in word embeddings. In Proceedings of the 57th An- mark evaluation framework for quantitative reason-
nual Meeting of the Association for Computational ing in natural language inference. In Proceedings
Linguistics, pages 3374–3380, Florence, Italy. Asso- of the 23rd Conference on Computational Natural
ciation for Computational Linguistics. Language Learning (CoNLL), pages 349–361, Hong
Kong, China. Association for Computational Lin-
Mikhail Nefedov. 2020. Dataset for evaluation of guistics.
mathematical reasoning abilities in russian. In Con-
ference on Artificial Intelligence and Natural Lan- Subhro Roy and Dan Roth. 2015. Solving general arith-
guage, pages 135–144. Springer. metic word problems. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan- Andrew Trask, Felix Hill, Scott E Reed, Jack Rae,
guage Processing, pages 1743–1752, Lisbon, Portu- Chris Dyer, and Phil Blunsom. 2018. Neural arith-
gal. Association for Computational Linguistics. metic logic units. In Advances in Neural Informa-
tion Processing Systems, pages 8035–8044.
Subhro Roy, Tim Vieira, and Dan Roth. 2015. Reason-
ing about quantities in natural language. Transac- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
tions of the Association for Computational Linguis- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
tics, 3:1–13. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information pro-
David Saxton, Edward Grefenstette, Felix Hill, and cessing systems, pages 5998–6008.
Pushmeet Kohli. 2019. Analysing mathematical rea-
soning abilities of neural models. In International Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh,
Conference on Learning Representations. and Matt Gardner. 2019. Do NLP models know
numbers? probing numeracy in embeddings. In
Rico Sennrich, Barry Haddow, and Alexandra Birch. Proceedings of the 2019 Conference on Empirical
2016. Neural machine translation of rare words Methods in Natural Language Processing and the
with subword units. In Proceedings of the 54th An- 9th International Joint Conference on Natural Lan-
nual Meeting of the Association for Computational guage Processing (EMNLP-IJCNLP), pages 5307–
Linguistics (Volume 1: Long Papers), pages 1715– 5315, Hong Kong, China. Association for Computa-
1725, Berlin, Germany. Association for Computa- tional Linguistics.
tional Linguistics.
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Rebecca Sharp, Mithun Paul, Ajay Nagesh, Dane Bell, Amanpreet Singh, Julian Michael, Felix Hill, Omer
and Mihai Surdeanu. 2018. Grounding gradable ad- Levy, and Samuel Bowman. 2019. Superglue: A
jectives through crowdsourcing. In Proceedings of stickier benchmark for general-purpose language un-
the Eleventh International Conference on Language derstanding systems. In Advances in Neural Infor-
Resources and Evaluation (LREC 2018), Miyazaki, mation Processing Systems, pages 3266–3280.
Japan. European Language Resources Association
(ELRA). Alex Wang, Amanpreet Singh, Julian Michael, Fe-
lix Hill, Omer Levy, and Samuel Bowman. 2018.
Georgios P. Spithourakis and Sebastian Riedel. 2018.
GLUE: A multi-task benchmark and analysis plat-
Numeracy for language models: Evaluating and im-
form for natural language understanding. In Pro-
proving their ability to predict numbers. CoRR,
ceedings of the 2018 EMNLP Workshop Black-
abs/1805.08154.
boxNLP: Analyzing and Interpreting Neural Net-
Dhanasekar Sundararaman, Shijing Si, Vivek Subra- works for NLP, pages 353–355, Brussels, Belgium.
manian, Guoyin Wang, Devamanyu Hazarika, and Association for Computational Linguistics.
Lawrence Carin. 2020. Methods for numeracy-
preserving word embeddings. In Proceedings of the Sandra Williams and Richard Power. 2010. A fact-
2020 Conference on Empirical Methods in Natural aligned corpus of numerical expressions. In Pro-
Language Processing (EMNLP), pages 4742–4753, ceedings of the Seventh International Conference
Online. Association for Computational Linguistics. on Language Resources and Evaluation (LREC’10),
Valletta, Malta. European Language Resources As-
Mirac Suzgun, Yonatan Belinkov, Stuart Shieber, and sociation (ELRA).
Sebastian Gehrmann. 2019. LSTM networks can
perform dynamic counting. In Proceedings of the Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
Workshop on Deep Learning and Formal Languages: Le, Mohammad Norouzi, Wolfgang Macherey,
Building Bridges, pages 44–54, Florence. Associa- Maxim Krikun, Yuan Cao, Qin Gao, Klaus
tion for Computational Linguistics. Macherey, Jeff Klingner, Apurva Shah, Melvin John-
son, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws,
Oyvind Tafjord, Peter Clark, Matt Gardner, Wen-tau Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith
Yih, and Ashish Sabharwal. 2019. Quarel: A dataset Stevens, George Kurian, Nishant Patil, Wei Wang,
and models for answering questions about qualita- Cliff Young, Jason Smith, Jason Riesa, Alex Rud-
tive relationships. In Proceedings of the AAAI Con- nick, Oriol Vinyals, Greg Corrado, Macduff Hughes,
ference on Artificial Intelligence, volume 33, pages and Jeffrey Dean. 2016. Google’s neural machine
7063–7071. translation system: Bridging the gap between human
and machine translation.
Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Gold-
berg, and Jonathan Berant. 2020. Leap-of-thought: Karen Wynn. 1990. Children’s understanding of count-
Teaching pre-trained models to systematically rea- ing. Cognition, 36(2):155–193.
son over implicit knowledge. Advances in Neural
Information Processing Systems, 33. Xikun Zhang, Deepak Ramachandran, Ian Tenney,
Yanai Elazar, and Dan Roth. 2020. Do language em-
Alberto Testolin, Serena Dolfi, Mathijs Rochus, and beddings capture scales? In Findings of the Associ-
Marco Zorzi. 2020. Visual sense of number vs. ation for Computational Linguistics: EMNLP 2020,
sense of magnitude in humans and machines. Sci- pages 4889–4896, Online. Association for Computa-
entific reports, 10(1):1–13. tional Linguistics.
Ben Zhou, Qiang Ning, Daniel Khashabi, and Dan on their fingers, but struggle with realizing the Car-
Roth. 2020. Temporal common sense acquisition dinal Principle, i.e., the last counter value denotes
with minimal supervision. In Proceedings of the
the number of entities being considered (Wynn,
58th Annual Meeting of the Association for Compu-
tational Linguistics, pages 7579–7589, Online. As- 1990). Similarly, LSTMs (Suzgun et al., 2019) and
sociation for Computational Linguistics. transformers (Bhattamishra et al., 2020) have been
shown to possess counting skills but in order to
A Other Numeracy Tasks answer counting questions, they must also learn
to map the counts to number words or numerals.
Here, we describe certain related tasks that fall Counting tasks have been proposed in computer
outside our taxonomy: vision (Testolin et al., 2020) as well as in NLP
(Numeric) Paraphrasing is what we call the (Postma et al., 2018; Talmor et al., 2020).
task of identifying one-to-one correspondences be- Domain-specific tasks require background
tween different surface forms of the same number. knowledge in addition to mathematical skills. Num-
Twelve is the same as ‘12’, also referred to as a bergame (Mishra et al., 2020) includes questions
dozen. This task cuts across all the tasks we dis- on Physics (find the distance travelled in 2 hrs by
cussed, since the same number, expressed in several a train moving at 50 mph) and Chemistry (find the
different ways, should be nevertheless identified by mass percentage of H in C6H6). Project Aristo
an NLP model before any subsequent reasoning. (Clark et al., 2019) solves elementary and high
Similar to how WordNet (Miller, 1995) provides school science problems, which often involve nu-
a huge list of synonyms, numeric paraphrases can meric reasoning.
be obtained by libraries3 which convert numerals
to words, words to numerals, etc. One could also
envision this as a learning task given a large enough
corpus, such as the NumGen dataset (Williams and
Power, 2010) containing 2000 fact-aligned numeric
expressions over 110 articles.
Quantity Entailment tasks (Ravichander et al.,
2019; Roy et al., 2015) are analogous to Natural
Language Inference, which requires understanding
of not only equivalence (as in paraphrasing) but
also deeper relations like entailment and contradic-
tion, e.g., the premise ‘he was 16 yrs old’ entails the
hypothesis ‘he was a teenager’. On similar lines,
Mishra et al. (2020) modified the existing QuaRel
dataset (Tafjord et al., 2019) to force models to per-
form quantity entailment, e.g., dog1 is light, dog2
is heavy is replaced with dog1 weighs 70 lbs, dog2
weighs 90 lbs.
Numeral Understanding is the task of cate-
gorizing numbers into percentages, prices, dates,
times, quantities, etc. and their respective subcate-
gories (Chen et al., 2018).
Fused-Head Resolution for numbers is essen-
tial to ground them when the context is implicit.
For example, the sentence “I woke up at 11" has
‘a.m.’ or ‘o’clock’ as the fused head to be resolved
(Elazar and Goldberg, 2019).
Counting is the task of keeping track of discrete
instances of some object. When kids count a set
of objects, they quickly learn to keep a track, say
3
Example: https://pypi.org/project/num2words/

You might also like