Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Word Acquisition in Neural Language Models

Tyler A. Chang1,2 , Benjamin K. Bergen1


1
Department of Cognitive Science
2
Halıcıoğlu Data Science Institute
University of California, San Diego
{tachang, bkbergen}@ucsd.edu

Abstract 3
"walk" "walk"

Proportion learned
Mean surprisal
6 0.75
We investigate how neural language mod-
9 0.50
els acquire individual words during train-
ing, extracting learning curves and ages 12
0.25
arXiv:2110.02406v1 [cs.CL] 5 Oct 2021

of acquisition for over 600 words on the


MacArthur-Bates Communicative Develop- 15
2 3 4 5 6 15 20 25 30
ment Inventory (Fenson et al., 2007). Draw-
BERT steps (log10) Child months
ing on studies of word acquisition in chil-
dren, we evaluate multiple predictors for
words’ ages of acquisition in LSTMs, Figure 1: Learning curves for the word “walk” in
BERT, and GPT-2. We find that the ef- a BERT language model and human children. Blue
fects of concreteness, word length, and lex- horizontal lines indicate age of acquisition cutoffs.
ical class are pointedly different in chil- The blue curve represents the fitted sigmoid function
dren and language models, reinforcing the based on the language model surprisals during training
importance of interaction and sensorimo- (black). Child data obtained from Frank et al. (2017).
tor experience in child language acquisition.
Language models rely far more on word
frequency than children, but like children, et al., 2020); in particular, behavioral approaches
they exhibit slower learning of words in from psycholinguistics and cognitive science have
longer utterances. Interestingly, models fol- been applied to study language model predictions
low consistent patterns during training for (Futrell et al., 2019; Ettinger, 2020). From a cog-
both unidirectional and bidirectional mod-
nitive perspective, language models are of theoret-
els, and for both LSTM and Transformer ar-
chitectures. Models predict based on uni- ical interest as distributional models of language,
gram token frequencies early in training, be- agents that learn exclusively from statistics over
fore transitioning loosely to bigram proba- language (Boleda, 2020; Lenci, 2018).
bilities, eventually converging on more nu- However, previous psycholinguistic studies of
anced predictions. These results shed light
language models have nearly always focused on
on the role of distributional learning mech-
anisms in children, while also providing in- fully-trained models, precluding comparisons to
sights for more human-like language acqui- the wealth of literature on human language ac-
sition in language models. quisition. There are limited exceptions. Rumel-
hart and McClelland (1986) famously studied past
tense verb form learning in phoneme-level neu-
1 Introduction ral networks during training, a study which was
Language modeling, predicting words from con- replicated in more modern character-level recur-
text, has grown increasingly popular as a pre- rent neural networks (Kirov and Cotterell, 2018).
training task in NLP in recent years; neural lan- However, these studies focused only on sub-word
guage models such as BERT (Devlin et al., 2019), features. There remains a lack of research on lan-
ELMo (Peters et al., 2018), and GPT (Brown et al., guage acquisition in contemporary language mod-
2020) have produced state-of-the-art performance els, which encode higher level features such as
on a wide range of NLP tasks. There is now a syntax and semantics.
substantial amount of work assessing the linguistic As an initial step towards bridging the gap be-
information encoded by language models (Rogers tween language acquisition and language model-
ing, we present an empirical study of word acqui- sition in language models themselves.
sition during training in contemporary language
models, including LSTMs, GPT-2, and BERT. 2.2 Evaluating language models
We consider how variables such as word fre- Recently, there has been substantial research eval-
quency, concreteness, and lexical class contribute uating language models using psycholinguistic ap-
to words’ ages of acquisition in language mod- proaches, reflecting a broader goal of interpret-
els. Each of our selected variables has effects on ing language models (BERTology; Rogers et al.,
words’ ages of acquisition in children; our lan- 2020). For instance, Ettinger (2020) used the out-
guage model results allow us to identify the ex- put token probabilities from BERT in carefully
tent to which each effect in children can or cannot constructed sentences, finding that BERT learns
be attributed in principle to distributional learning commonsense and semantic relations to some de-
mechanisms. gree, although it struggles with negation. Gulor-
Finally, to better understand how computational dava et al. (2018) found that LSTM language mod-
models acquire language, we identify consistent els recognize long distance syntactic dependen-
patterns in language model training across archi- cies; however, they still struggle with more com-
tectures. Our results suggest that language mod- plicated constructions (Marvin and Linzen, 2018).
els may acquire traditional distributional statistics These psycholinguistic methodologies do not
such as unigram and bigram probabilities in a sys- rely on specific language model architectures or
tematic way. Understanding how language models fine-tuning on a probe task. Notably, because
acquire language can lead to better architectures these approaches rely only on output token prob-
and task designs for future models, while also pro- abilities from a given language model, they are
viding insights into distributional learning mecha- well-suited to evaluations early in training, when
nisms in people. fine-tuning on downstream tasks is unfruitful.
That said, previous language model evaluation
2 Related work
studies have focused on fully-trained models, pro-
Our work draws on methodologies from word ac- gressing largely independently from human lan-
quisition studies in children and psycholinguistic guage acquisition literature. Our work seeks to
evaluations of language models. In this section, bridge this gap.
we briefly outline both lines of research.
3 Method
2.1 Child word acquisition
We trained unidirectional and bidirectional lan-
Child development researchers have previously guage models with LSTM and Transformer archi-
studied word acquisition in children, identify- tectures. We quantified each language model’s age
ing variables that help predict words’ ages of of acquisition for each word in the CDI (Fenson
acquisition in children. In Wordbank, Frank et al., 2007). Similar to word acquisition studies in
et al. (2017) compiled reports from parents re- children, we identified predictors for words’ ages
porting when their child produced each word of acquisition in language models.1
on the MacArthur-Bates Communicative Develop-
ment Inventory (CDI; Fenson et al., 2007). For 3.1 Language models
each word w, Braginsky et al. (2016) fitted a lo-
gistic curve predicting the proportion of children Datasets and training Language models were
that produce w at different ages; they defined a trained on a combined corpus containing the
word’s age of acquisition as the age at which 50% BookCorpus (Zhu et al., 2015) and WikiText-103
of children produce w. Variables such as word fre- datasets (Merity et al., 2017). Following Devlin
quency, word length, lexical class, and concrete- et al. (2019), each input sequence was a sentence
ness were found to influence words’ ages of ac- pair; the training dataset consisted of 25.6M sen-
quisition in children across languages. Recently, tence pairs. The remaining sentences (5.8M pairs)
it was shown that fully-trained LSTM language were used for evaluation and to generate word
model surprisals are also predictive of words’ ages 1
Code and data are available at
of acquisition in children (Portelance et al., 2020). https://github.com/tylerachang/
However, no studies have evaluated ages of acqui- word-acquisition-language-models.
# Parameters Perplexity 3.2 Learning curves and ages of acquisition
LSTM 37M 54.8
We sought to quantify each language model’s abil-
GPT-2 108M 30.2
ity to predict individual words over the course of
BiLSTM 51M 9.0
BERT 109M 7.2
training. We considered all words in the CDI that
were considered one token by the language mod-
Table 1: Parameter counts and evaluation perplexities els (611 out of 651 words).
for the trained language models. For reference, the pre- For each such token w, we identified up to 512
trained BERT base model from Huggingface reached occurrences of w in the held-out portion of the
a perplexity of 9.4 on our evaluation set. Additional
language modeling dataset.2 To evaluate a lan-
perplexity comparisons with comparable models are in-
cluded in Appendix A.1.
guage model at training step s, we fed each sen-
tence pair into the model, attempting to predict
the masked token w. We computed the surprisal:
learning curves. Sentences were tokenized us- − log2 (P (w)) averaged over all occurrences of w
ing the unigram language model tokenizer imple- to quantify the quality of the models’ predictions
mented in SentencePiece (Kudo and Richardson, for word w at step s (Levy, 2008; Goodkind and
2018). Models were trained for 1M steps, with Bicknell, 2018).
batch size 128 and learning rate 0.0001. As a met- We computed this average surprisal for each tar-
ric for overall language model performance, we get word at approximately 200 different steps dur-
report evaluation perplexity scores in Table 1. We ing language model training, sampling more heav-
include evaluation loss curves, full training details, ily from earlier training steps, prior to model con-
and hyperparameters in Appendix A.1. vergence. The selected steps are listed in Ap-
pendix A.1. By plotting surprisals over the course
Transformers The two Transformer models fol- of training, we obtained a learning curve for each
lowed the designs of GPT-2 (Radford et al., word, generally moving from high surprisal to
2019) and BERT (Devlin et al., 2019), allowing low surprisal. The surprisal axis in our plots is
us to evaluate both a unidirectional and bidirec- reversed to reflect increased understanding over
tional Transformer language model. GPT-2 was the course of training, consistent with plots show-
trained with the causal language modeling objec- ing increased proportions of children producing a
tive, where each token representation is used to given word over time (Frank et al., 2017).
predict the next token; the masked self-attention For each learning curve (4 language model ar-
mechanism allows tokens to attend only to pre- chitectures × 611 words), we fitted a sigmoid
vious tokens in the input sequence. In contrast, function to model the smoothed acquisition of
BERT used the masked language modeling objec- word w. Sample learning curves are shown in Fig-
tive, where masked tokens are predicted from sur- ure 1 and Figure 2.
rounding tokens in both directions.
Age of acquisition To extract age of acquisi-
Our BERT model used the base size model from
tion from a learning curve, we established a cut-
Devlin et al. (2019). Our GPT-2 model used the
off surprisal where we considered a given word
similar-sized model from Radford et al. (2019),
“learned.” In child word acquisition studies, an
equal in size to the original GPT model. Parameter
analogous cutoff is established when 50% of chil-
counts are listed in Table 1. Transformer models
dren produce a word (Braginsky et al., 2016).
were trained using the Huggingface Transformers
Following this precedent, we determined our
library (Wolf et al., 2020).
cutoff to be 50% between a baseline surprisal (pre-
dicting words based on random chance) and the
LSTMs We also trained both a unidirectional
minimum surprisal attained by the model for word
and bidirectional LSTM language model, each
w. We selected the random chance baseline to
with three stacked LSTM layers. Similar to GPT-
2
2, the unidirectional LSTM predicted the token at We only selected sentence pairs with at least eight to-
time t from the hidden state at time t−1. The bidi- kens of context, unidirectionally or bidirectionally depending
on model architecture. Thus, the unidirectional and bidirec-
rectional LSTM (BiLSTM) predicted the token at tional samples differed slightly. Most tokens (92.3%) had the
time t from the sum of the hidden states at times maximum of 512 samples both unidirectionally and bidirec-
t − 1 and t + 1 (Aina et al., 2019). tionally, and all tokens had at least 100 samples in both cases.
8 "eat" "eat" "eat"
"eat"
8

Mean surprisal
7.5 6
10
10
10.0 9
12
12
14 12.5 12
14

16 15.0
16 15
2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6
LSTM steps (log10) GPT−2 steps (log10) BiLSTM steps (log10) BERT steps (log10)

Figure 2: Learning curves for the word “eat” for all four language model architectures. Blue horizontal lines
indicate age of acquisition cutoffs, and blue curves represent fitted sigmoid functions.

best reflect a language model’s ability to predict • Concreteness: we used human-generated


a word with no access to any training data, similar concreteness norms from Brysbaert et al.
to an infant’s language-specific knowledge prior (2014), rated on a five-point scale. We im-
to any linguistic exposure. We selected minimum puted missing values (3% of words) using the
surprisal as our other bound to reflect how well mean concreteness score.
a particular word can eventually be learned by a • Lexical class: we used the lexical classes an-
particular language model, analogous to an adult’s notated in Wordbank. Possible lexical classes
understanding of a given word. were Noun, Verb, Adjective, Function Word,
For each learning curve, we found the intersec- and Other.
tion between the fitted sigmoid and the cutoff sur-
prisal value. We defined age of acquisition for a We ran linear regressions with linear terms for
language model as the corresponding training step, each predictor. To determine statistical signifi-
on a log10 scale. Sample cutoffs and ages of ac- cance for each predictor, we ran likelihood ratio
quisition are shown in Figure 1 and Figure 2. tests, comparing the overall regression (including
the target predictor) with a regression including all
3.3 Predictors for age of acquisition predictors except the target. To determine the di-
As potential predictors for words’ ages of acqui- rection of effect for each continuous predictor, we
sition in language models, we selected variables used the sign of the coefficient in the overall re-
that are predictive of age of acquisition in children gression.
(Braginsky et al., 2016). When predicting ages As a potential concern for interpreting regres-
of acquisition in language models, we computed sion coefficient signs, we assessed collinearities
word frequencies and utterance lengths over the between predictors by computing the variance in-
language model training corpus. Our five selected flation factor (VIF) for each predictor. No VIF
predictors were: exceeded 5.0,4 although we did observe mod-
erate correlations between log-frequency and n-
• Log-frequency: the natural log of the word’s chars (r = −0.49), and between log-frequency
per-1000 token frequency. and concreteness (r = −0.64). These correla-
tions are consistent with those identified for child-
• MLU: we computed the mean length of ut- directed speech in Braginsky et al. (2016). To
terance as the mean length of sequences con- ease collinearity concerns, we considered single-
taining a given word.3 MLU has been used predictor regressions for each predictor, using ad-
as a metric for the complexity of syntactic justed predictor values after accounting for log-
contexts in which a word appears (Roy et al., frequency (residuals after regressing the predictor
2015). over log-frequency). In all cases, the coefficient
• n-chars: as in Braginsky et al. (2016), we sign in the adjusted single predictor regression was
used the number of characters in a word as consistent with the sign of the coefficient in the
a coarse proxy for the length of a word. overall regression.
When lexical class (the sole categorical predic-
3
We also considered a unidirectional MLU metric (count- tor) reached significance based on the likelihood
ing only previous tokens) for the unidirectional models, find-
4
ing that it produced similar results. Common VIF cutoff values are 5.0 and 10.0.
LSTM GPT-2 BiLSTM BERT Children
Log-frequency ∗∗∗(−) ∗∗∗ (−) ∗∗∗ (−) ∗∗∗ (−) ∗∗∗ (−)

MLU ∗∗ (+) ∗∗∗ (+) ∗∗∗ (+) ∗∗∗ (+)

n-chars ∗∗∗ (−) ∗∗∗ (−) ∗∗∗ (−) ∗∗∗ (−) ∗∗ (+)

Concreteness ∗∗∗ (−)

Lexical class ∗∗∗ ∗∗∗ ∗∗∗

R2 0.93 0.92 0.95 0.94 0.43

Table 2: Significant predictors for a word’s age of acquisition are marked by asterisks (p < 0.05∗ ; p < 0.01∗∗ ;
p < 0.001∗∗∗ ). Signs of coefficients are notated in parentheses. The R2 denotes the adjusted R2 in a regression
using all five predictors.

ratio test, we ran a one-way analysis of covari-

BiLSTM AoA (steps, log10)


6 35

Child AoA (months)


ance (ANCOVA) with log-frequency as a covari- 30
5
ate. The ANCOVA ran a standard ANOVA on the 25

age of acquisition residuals after regressing over 4 20

log-frequency. We used Tukey’s honestly signifi- 15


3
cant difference (HSD) test to assess pairwise dif- 10
−7.5 −5.0 −2.5 0.0 2.5 −6 −4 −2 0 2 4
ferences between lexical classes. Training log−frequency CHILDES log−frequency

3.4 Age of acquisition in children


Figure 3: Effects of log-frequency on words’ ages of
For comparison, we used the same variables to acquisition (AoA) in the BiLSTM and children. The
predict words’ ages of acquisition in children, as BiLSTM was the language model architecture with the
in Braginsky et al. (2016). We obtained smoothed largest effect of log-frequency (adjusted R2 = 0.94).
ages of acquisition for children from the Word-
bank dataset (Frank et al., 2017). When pre- kens. Children are estimated to hear approxi-
dicting ages of acquisition in children, we com- mately 13K words per day (Gilkerson et al., 2017),
puted word frequencies and utterance lengths over for a total of roughly 19.0M words during their
the North American English CHILDES corpus of first four years of life. Because contemporary lan-
child-directed speech (MacWhinney, 2000). guage models require much more data than chil-
Notably, CHILDES contained much shorter dren hear, the models do not necessarily reflect
sentences on average than the language model how children would learn if restricted solely to lin-
training corpus (mean sentence length 4.50 to- guistic input. Instead, the models serve as exam-
kens compared to 15.14 tokens). CDI word log- ples of relatively successful distributional learn-
frequencies were only moderately correlated be- ers, establishing how one might expect word ac-
tween the two corpora (r = 0.78). This aligns quisition to progress according to effective distri-
with previous work finding that child-directed butional mechanisms.
speech contains on average fewer words per utter-
ance, smaller vocabularies, and simpler syntactic 4 Results
structures than adult-directed speech (Soderstrom,
2007). These differences were likely compounded Significant predictors of age of acquisition are
by differences between spoken language in the shown in Table 2, comparing children and each of
CHILDES corpus and written language in the lan- the four language model architectures.
guage model corpus. We computed word frequen-
Log-frequency In children and all four lan-
cies and MLUs separately over the two corpora
guage models, more frequent words were learned
to ensure that our predictors accurately reflected
earlier (a negative effect on age of acquisition). As
the learning environments of children and the lan-
shown in Figure 3, this effect was much more pro-
guage models.
nounced in language models (adjusted R2 = 0.91
We also note that the language model training
to 0.94) than in children (adjusted R2 = 0.01).5
corpus was much larger overall than the CHILDES
corpus. CHILDES contained 7.5M tokens, while 5
Because function words are frequent but acquired later
the language model corpus contained 852.1M to- by children, a quadratic model of log-frequency on age of ac-
The sizeable difference in log-frequency predic- tions between concrete words and easier distribu-
tivity emphasizes the fact that language models tional learning contexts. Again, this highlights the
learn exclusively from distributional statistics over importance of sensorimotor experience and con-
words, while children have access to additional so- ceptual development in explaining the course of
cial and sensorimotor cues. child language acquisition.

MLU Except in unidirectional LSTMs, MLU Lexical class The bidirectional language mod-
had a positive effect on a word’s age of acquisi- els showed no significant effects of lexical class
tion in language models. Interestingly, we might on age of acquisition. In other words, the differ-
have expected the opposite effect (particularly in ences between lexical classes were sufficiently ac-
Transformers) if additional context (longer utter- counted for by the other predictors for BERT and
ances) facilitated word learning. Instead, our re- the BiLSTM. However, in the unidirectional lan-
sults are consistent with effects of MLU in chil- guage models (GPT-2 and the LSTM), nouns and
dren; words in longer utterances are learned later, function words were acquired later than adjectives
even after accounting for other variables. The lack and verbs.6 This contrasts with children learn-
of effect in unidirectional LSTMs could simply be ing English, who on average acquired nouns ear-
due to LSTMs being the least sensitive to contex- lier than adjectives and verbs, acquiring function
tual information of the models under considera- words last.7
tion. The positive effect of MLU in other models Thus, children’s early acquisition of nouns can-
suggests that complex syntactic contexts may be not be explained by distributional properties of
more difficult to learn through distributional learn- English nouns, which are acquired later by unidi-
ing alone, which might partly explain why chil- rectional language models. This result is compat-
dren learn words in longer utterances more slowly. ible with the hypothesis that nouns are acquired
earlier because they often map to real world ob-
n-chars There was a negative effect of n-chars jects; function words might be acquired later be-
on age of acquisition in all four language models; cause their meanings are less grounded in senso-
longer words were learned earlier. This contrasts rimotor experience. It has also been argued that
with children, who acquire shorter words earlier. children might have an innate bias to learn objects
This result is particularly interesting because the earlier than relations and traits (Markman, 1994).
language models we used have no information Lastly, it is possible that the increased salience of
about word length. We hypothesize that the ef- sentence-final positions (which are more likely to
fect of n-chars in language models may be driven contain nouns in English and related languages)
by polysemy, which is not accounted for in our facilitates early acquisition of nouns in children
regressions. Shorter words tend to be more pol- (Caselli et al., 1995). Consistent with these hy-
ysemous (a greater diversity of meanings; Casas potheses, our results suggest that English verbs
et al., 2019), which could lead to slower learning and adjectives may be easier to learn from a
in language models. In children, this effect may purely distributional perspective, but children ac-
be overpowered by the fact that shorter words are quire nouns earlier based on sensorimotor, social,
easier to parse and produce. or cognitive factors.
Concreteness Although children overall learn
4.1 First and last learned words
more concrete words earlier, the language mod-
els showed no significant effects of concreteness As a qualitative analysis, we compared the first
on age of acquisition. This entails that the ef- and last words acquired by the language models
fects in children cannot be explained by correla- and children, as shown in Table 3. In line with our
previous results, the first and last words learned by
quisition in children provided a slightly better fit (R2 = 0.03)
6
if not accounting for lexical class. A quadratic model of Significant pairwise comparisons between lexical classes
log-frequency also provided a slightly better fit for unidirec- are listed in Appendix A.2.
tional language models (R2 = 0.93 to 0.94), particularly 7
There is ongoing debate around the existence of a uni-
for high-frequency words; in language models, this could be versal “noun bias” in early word acquisition. For instance,
due either to a floor effect on age of acquisition for high- Korean and Mandarin-speaking children have been found to
frequency words or to slower learning of function words. Re- acquire verbs earlier than nouns, although this effect appears
gardless, significant effects of other predictors remained the sensitive to context and the measure of vocabulary acquisition
same when using a quadratic model for log-frequency. (Choi and Gopnik, 1995; Tardif et al., 1999).
Language models Children publicly-available fully-trained language models.
First a, and, for, he, her, baby, ball, bye, Indeed, the correlation between minimum sur-
his, I, it, my, of, on, daddy, dog, hi, prisal and age of acquisition was substantial (Pear-
she, that, the, to, mommy, moo, no, son’s r = 0.88 to 0.92). However, this correla-
was, with, you shoe, uh, woof, yum tion was driven largely by effects of log-frequency,
Last bee, bib, choo, above, basement, which had a large negative effect on both met-
cracker, crayon, beside, country, rics. When adjusting minimum surprisal and age
giraffe, glue, kitty, downtown, each, of acquisition for log-frequency (using residu-
moose, pancake, hate, if, poor, walker, als after linear regressions), the correlation de-
popsicle, quack, which, would, creased dramatically (Pearson’s r = 0.22 to 0.46).
rooster, slipper, tuna, yourself While minimum surprisal accounts for a signifi-
yum, zebra cant amount of variance in words’ ages of acquisi-
tion, the two metrics are not interchangeable.
Table 3: First and last words acquired by the language
models and children. For language models, we identi-
fied words that were in the top or bottom 5% of ages of 4.3 Alternative age of acquisition definitions
acquisition for all models. For children, we identified Finally, we considered alternative operationaliza-
words in the top or bottom 2% of ages of acquisition.
tions of words’ ages of acquisition in language
models. For instance, instead of defining an acqui-
the language models were largely determined by sition cutoff at 50% between random chance and
word frequencies. The first words acquired by the the minimum surprisal for each word, we could
models were all in the top 3% of frequent words, consider the midpoint of each fitted sigmoid curve.
and the last acquired words were all in the bot- This method would be equivalent to defining up-
tom 15%. Driven by this effect, many of the first per and lower surprisal baselines at the upper and
words learned by the language models were func- lower asymptotes of the fitted sigmoid, relying on
tion words or pronouns. In contrast, many of the the assumption that these asymptotes roughly ap-
first words produced by children were single-word proximate surprisal values before and after train-
expressions, such as greetings, exclamations, and ing. However, this assumption fails in cases where
sounds. Children acquired several highly frequent the majority of a word’s learning curve is modeled
words late, such as “if,” which is in the 90th fre- by only a sub-portion of the fitted sigmoid. For
quency percentile of the CHILDES corpus. Of example, for the word “for” in Figure 4, the high
course, direct comparisons between the first and surprisal asymptote is at 156753.5, compared to a
last words acquired by the children and language random chance surprisal of 14.9 and a minimum
models are confounded by differing datasets and surprisal of 4.4. Using the midpoint age of acqui-
learning environments, as detailed in Section 3.4. sition in this case would result in an age of acqui-
sition of −9.6 steps (log10).
4.2 Age of acquisition vs. minimum surprisal We also considered alternative cutoff propor-
Next, we assessed whether a word’s age of ac- tions (replacing 50%) in our original age of ac-
quisition in a language model could be predicted quisition definition. We considered cutoffs at each
from how well that word was learned in the fully- possible increment of 10%. The signs of nearly all
trained model. To do this, we considered the min- significant coefficients in the overall regressions
imum surprisal attained by each language model (see Table 2) remained the same for all language
for each word. We found a significant effect of models regardless of cutoff proportion.8
minimum surprisal on age of acquisition in all
four language models, even after accounting for all 5 Language model learning curves
five other predictors (using likelihood ratio tests;
The previous sections identified factors that pre-
p < 0.001). In part, this is likely because the ac-
dict words’ ages of acquisition in language mod-
quisition cutoff for each word’s fitted sigmoid was
els. We now proceed with a qualitative analy-
dependent on the word’s minimum surprisal.
sis of the learning curves themselves. We found
It could then be tempting to treat minimum sur-
prisal as a substitute for age of acquisition in lan- 8
The only exception was a non-significant positive coeffi-
guage models; this approach would require only cient for n-chars in BERT with a 90% acquisition cutoff.
4
"for" 8 "drop" 14
6
"eat" 10

Mean surprisal

Mean surprisal

Mean surprisal
8 10 11 "lollipop"
16
10 12
12
12 13 18
14
14 14
16 20
15
2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6
LSTM steps (log10) LSTM steps (log10) LSTM steps (log10) LSTM steps (log10)

Figure 4: LSTM learning curves for the words “for,” “eat,” “drop,” and “lollipop.” Blue horizontal lines indicate
age of acquisition cutoffs, and blue curves represent fitted sigmoid functions. Green dashed lines indicate the
surprisal if predicting solely based on unigram probabilities (raw token frequencies). Early in training, language
model surprisals tended to shift towards the unigram frequency-based surprisals.

that language models learn traditional distribu- gences between each reference distribution and the
tional statistics in a systematic way. model predictions over the course of training. As
expected, all four language models converged to-
5.1 Unigram probabilities wards the true token distribution (minimizing the
First, we observed a common pattern in word loss function) throughout training, diverging from
learning curves across model architectures. As ex- the uniform distribution. Divergence from the uni-
pected, each curve began at the surprisal value cor- form distribution could also reflect that the models
responding to random chance predictions. Then, became more confident in their predictions during
as shown in Figure 4, many curves shifted towards training, leading to lower entropy predictions.
the surprisal value corresponding to raw unigram As hypothesized, we also found that all four lan-
probabilities (i.e. based on raw token frequen- guage models exhibited an early stage of training
cies). This pattern was particularly pronounced in which their predictions approached the unigram
in LSTM-based language models, although it ap- distribution, before diverging to reflect other in-
peared in all architectures. Interestingly, the shift formation. This suggests that the models overfit-
occurred even if the unigram surprisal was higher ted to raw token frequencies early in training, an
(or “worse”) than random-chance surprisal, as effect which was particularly pronounced in the
demonstrated by the word “lollipop” in Figure LSTM-based models. Importantly, because the
4. Thus, we posited that language models pass models eventually diverged from the unigram dis-
through an early stage of training where they ap- tribution, the initial unigram phase cannot be ex-
proximate unigram probabilities. plained solely by mutual information between the
To test this hypothesis, we aggregated each true token distribution and unigram frequencies.
model’s predictions for randomly masked tokens
in the evaluation dataset (16K sequences), includ- 5.2 Bigram probabilities
ing tokens not on the CDI. For each saved training We then ran a similar analysis using bigram prob-
step, we computed the average Kullback-Leibler abilities, where each token probability was de-
(KL) divergence between the model predictions pendent only on the previous token. A bigram
and the unigram frequency distribution. For com- distribution Pb was computed for each masked
parison, we also computed the KL divergence with token in the evaluation dataset, based on bi-
a uniform distribution (random chance) and with gram counts in the training corpus. As dic-
the one-hot true token distribution. We note that tated by the bigram model definition, we de-
the KL divergence with the one-hot true token fined Pb (wi ) = P (wi |wi−1 ) for unidirectional
distribution is equivalent to the cross-entropy loss models, and Pb (wi ) = Pb (wi |wi−1 , wi+1 ) ∝
function using log base two.9 P (wi |wi−1 )P (wi+1 |wi ) for bidirectional models.
As shown in Figure 5, we plotted the KL diver- We computed the average KL divergence between
9
the bigram probability distributions and the lan-
All KL divergences were computed using log base two.
guage model predictions.
KL divergences were computed as KL(yref , ŷ), where ŷ was
the model’s predicted probability distribution and yref was As shown in Figure 5, during the unigram
the reference distribution. learning phase, the bigram KL divergence de-
15

KL divergence
10
10 Bigram
One−hot (loss)
5 5
Uniform
Unigram
0 0
2 3 4 5 6 2 3 4 5 6
LSTM steps (log10) GPT−2 steps (log10)

15
KL divergence

10 10

5 5

0 0
2 3 4 5 6 2 3 4 5 6
BiLSTM steps (log10) BERT steps (log10)

Figure 5: KL divergences between reference distributions and model predictions over the course of training. The
KL divergence with the one-hot true token distribution is equivalent to the base two cross-entropy loss. Early in
training, the models temporarily overfitted to unigram then bigram probabilities.

creased for all language models. This is likely low a similar pattern. Because BERT and GPT-
caused by mutual information between the un- 2 only encode token position information through
igram and bigram distributions; as the models learned absolute position embeddings before the
approached the unigram distribution, their diver- first self-attention layer, they have no architectural
gences with the bigram distributions roughly ap- reason to overfit to bigram probabilities based on
proximated the average KL divergence between adjacent tokens.10 Instead, unigram and bigram
the bigram and unigram distributions themselves learning may be a natural consequence of the lan-
(average KL = 3.86 between unidirectional bi- guage modeling task, or even distributional learn-
grams and unigrams; average KL = 5.88 be- ing more generally.
tween bidirectional bigrams and unigrams). In
other words, the models’ initial decreases in bi- 6 Discussion
gram KL divergences can be explained predomi- We found that language models are highly sensi-
nantly by unigram frequency learning. tive to basic statistics such as frequency and bi-
However, when the models began to diverge gram probabilities during training. Their acqui-
from the unigram distribution, they continued to sition of words is also sensitive to features such
approach the bigram distributions. Each model as sentence length and (for unidirectional models)
then hit a local minimum in average bigram KL lexical class. Importantly, the language models ex-
divergence before diverging from the bigram dis- hibited notable differences with children in the ef-
tributions. This suggests that the models overfitted fects of lexical class, word lengths, and concrete-
to bigram probabilities after the unigram learning ness, highlighting the importance of social, cog-
phase. Thus, it appears that early in training, lan- nitive, and sensorimotor experience in child lan-
guage models make predictions based on unigram guage development.
frequencies, then bigram probabilities, eventually
learning to make more nuanced predictions. 6.1 Distributional learning, language
modeling, and NLU
Of course, this result may not be surprising for
LSTM-based language models. Because tokens In this section, we address the broader relation-
are fed into LSTMs sequentially, it is intuitive ship between distributional language acquisition
that they would make use of bigram probabilities. and contemporary language models.
Our results confirm this intuition, and they fur- 10
Absolute position embeddings in the Transformers were
ther show that Transformer language models fol- randomly initialized at the beginning of training.
Distributional learning in people There is on- as young as ten months old learn word-object
going work assessing distributional mechanisms pairings, mapping novel words onto perceptually
in human language learning (Aslin and New- salient objects (Pruden et al., 2006). By the age
port, 2014). For instance, adults can learn syn- of two, they are able to integrate social cues such
tactic categories using distributional information as eye gaze, pointing, and joint attention (Çetinçe-
alone (Reeder et al., 2017). Adults also show lik et al., 2021). Neural network models of one-
effects of distributional probabilities in reading word child utterances exhibit vocabulary acquisi-
times (Goodkind and Bicknell, 2018) and neural tion trajectories similar to children when only us-
responses (Frank et al., 2015). In early language ing features from conceptual categories and rela-
acquisition, there is evidence that children are sen- tions (Nyamapfene and Ahmad, 2007). Our work
sitive to transition (bigram) probabilities between shows that these grounded and interactive features
phonemes and between words (Romberg and Saf- impact child word acquisition in ways that cannot
fran, 2010), but it remains an open question to be explained solely by intra-linguistic signals.
what extent distributional mechanisms can explain That said, there is a growing body of work
effects of other factors (e.g. utterance lengths grounding language models using multimodal in-
and lexical classes) known to influence naturalistic formation and world knowledge. Language mod-
language learning. els trained on visual and linguistic inputs have
To shed light on this question, we consid- achieved state-of-the-art performance on visual
ered neural language models as distributional lan- question answering tasks (Antol et al., 2015; Lu
guage learners. If analogous distributional learn- et al., 2019; Zellers et al., 2021b), and mod-
ing mechanisms were involved in children and els equipped with physical dynamics modules
language models, then we would observe similar are more accurate than standard language mod-
word acquisition patterns in children and the mod- els at modeling world dynamics (Zellers et al.,
els. Our results demonstrate that a purely distri- 2021a). There has also been work building mod-
butional learner would be far more reliant on fre- els directly for non-distributional tasks; reinforce-
quency than children are. Furthermore, while the ment learning can be used for navigation and
effects of utterance length on words’ ages of ac- multi-agent communication tasks involving lan-
quisition in children can potentially be explained guage (Chevalier-Boisvert et al., 2019; Lazaridou
by distributional mechanisms, the effects of word et al., 2017; Zhu et al., 2020). These models
length, concreteness, and lexical class cannot. highlight the grounded, interactive, and commu-
nicative nature of language. Indeed, these non-
Distributional models Studying language ac- distributional properties may be essential to more
quisition in distributional models also has implica- human-like natural language understanding (Ben-
tions for core NLP research. Pre-trained language der and Koller, 2020; Emerson, 2020). Based
models trained only on text data have become cen- on our results for word acquisition in language
tral to state-of-the-art NLP systems. Language models, it is possible that these multimodal and
models even outperform humans on some tasks non-distributional models could also exhibit more
(He et al., 2021), making it difficult to pinpoint human-like language acquisition.
why they perform poorly in other areas. In this
work, we isolated ways that language models dif- 7 Conclusion
fer from children in how they acquire words, em-
phasizing the importance of sensorimotor experi- In this work, we identified factors that predict
ence and cognitive development for human-like words’ ages of acquisition in contemporary lan-
language acquisition. Future work could investi- guage models. We found contrasting effects of
gate the acquisition of syntactic structures or se- lexical class, word length, and concreteness in
mantic information in language models. children and language models, and we observed
much larger effects of frequency in the models
Non-distributional learning We showed that than in children. Furthermore, we identified ways
distributional language models acquire words in that language models aquire unigram and bigram
very different ways from children. Notably, chil- statistics early in training. This work paves the
dren’s linguistic experience is grounded in sen- way for future research integrating language ac-
sorimotor and cognitive experience. Children quisition and natural language understanding.
Acknowledgements Girish Sastry, Amanda Askell, Sandhini Agar-
wal, Ariel Herbert-Voss, Gretchen Krueger,
We would like to thank the anonymous review-
Tom Henighan, Rewon Child, Aditya Ramesh,
ers for their helpful suggestions, and the Language
Daniel Ziegler, Jeffrey Wu, Clemens Winter,
and Cognition Lab (Sean Trott, James Michaelov,
Chris Hesse, Mark Chen, Eric Sigler, Ma-
and Cameron Jones) for valuable discussion. We
teusz Litwin, Scott Gray, Benjamin Chess,
are also grateful to Zhuowen Tu and the Machine
Jack Clark, Christopher Berner, Sam McCan-
Learning, Perception, and Cognition Lab for com-
dlish, Alec Radford, Ilya Sutskever, and Dario
puting resources. Tyler Chang is partially sup-
Amodei. 2020. Language models are few-shot
ported by the UCSD HDSI graduate fellowship.
learners. In Conference on Neural Information
Processing Systems.
References Marc Brysbaert, Amy Warriner, and Victor Kuper-
Laura Aina, Kristina Gulordava, and Gemma man. 2014. Concreteness ratings for 40 thou-
Boleda. 2019. Putting words in context: LSTM sand generally known English word lemmas.
language models and lexical ambiguity. In Behavior Research Methods, 46.
Proceedings of the 57th Annual Meeting of Bernardino Casas, Antoni Hernández-Fernández,
the Association for Computational Linguistics, Neus Català, Ramon Ferrer-i-Cancho, and
pages 3342–3348, Florence, Italy. Association Jaume Baixeries. 2019. Polysemy and brevity
for Computational Linguistics. versus frequency in language. Computer
Speech & Language, 58:19–50.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu,
Margaret Mitchell, Dhruv Batra, Lawrence Zit- Maria Cristina Caselli, Elizabeth Bates, Paola
nick, and Devi Parikh. 2015. VQA: Visual Casadio, Judi Fenson, Larry Fenson, Lisa
question answering. In International Confer- Sanderl, and Judy Weir. 1995. A cross-
ence on Computer Vision. linguistic study of early lexical development.
Cognitive Development, 10(2):159–199.
Richard Aslin and Elissa Newport. 2014. Distri-
butional language learning: Mechanisms and Maxime Chevalier-Boisvert, Dzmitry Bahdanau,
models of category formation. Language Salem Lahlou, Lucas Willems, Chitwan Sa-
Learning, 64:86–105. haria, Thien Huu Nguyen, and Yoshua Bengio.
2019. BabyAI: A platform to study the sample
Emily M. Bender and Alexander Koller. 2020. efficiency of grounded language learning. In In-
Climbing towards NLU: On meaning, form, and ternational Conference on Learning Represen-
understanding in the age of data. In Proceed- tations.
ings of the 58th Annual Meeting of the As-
sociation for Computational Linguistics, pages Soonja Choi and Alison Gopnik. 1995. Early
5185–5198, Online. Association for Computa- acquisition of verbs in Korean: A cross-
tional Linguistics. linguistic study. Journal of Child Language,
22(3):497–529.
Gemma Boleda. 2020. Distributional semantics
and linguistic theory. Annual Review of Lin- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
guistics, 6(1):213–234. Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
Mika Braginsky, Daniel Yurovsky, Virginia understanding. In Proceedings of the 2019
Marchman, and Michael Frank. 2016. From uh- Conference of the North American Chapter
oh to tomorrow: Predicting age of acquisition of the Association for Computational Linguis-
for early words across languages. In Proceed- tics: Human Language Technologies, Volume
ings of the Annual Meeting of the Cognitive Sci- 1 (Long and Short Papers), pages 4171–4186,
ence Society. Minneapolis, Minnesota. Association for Com-
putational Linguistics.
Tom Brown, Benjamin Mann, Nick Ryder,
Melanie Subbiah, Jared D Kaplan, Prafulla Guy Emerson. 2020. What are the goals of distri-
Dhariwal, Arvind Neelakantan, Pranav Shyam, butional semantics? In Proceedings of the 58th
Annual Meeting of the Association for Compu- City, Utah. Association for Computational Lin-
tational Linguistics, pages 7436–7453, Online. guistics.
Association for Computational Linguistics.
Kristina Gulordava, Piotr Bojanowski, Edouard
Allyson Ettinger. 2020. What BERT is not: Grave, Tal Linzen, and Marco Baroni. 2018.
Lessons from a new suite of psycholinguistic di- Colorless green recurrent networks dream hi-
agnostics for language models. Transactions of erarchically. In Proceedings of the 2018
the Association for Computational Linguistics, Conference of the North American Chapter
8:34–48. of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
Larry Fenson, Virginia Marchman, Donna Thal,
(Long Papers), pages 1195–1205, New Orleans,
Phillip Dale, Steven Reznick, and Elizabeth
Louisiana. Association for Computational Lin-
Bates. 2007. MacArthur-Bates communicative
guistics.
development inventories. Paul H. Brookes Pub-
lishing Company, Baltimore, MD. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
Michael Frank, Mika Braginsky, Daniel Yurovsky, Weizhu Chen. 2021. DeBERTa: Decoding-
and Virginia Marchman. 2017. Wordbank: enhanced BERT with disentangled attention. In
An open repository for developmental vo- International Conference on Learning Repre-
cabulary data. Journal of Child Language, sentations.
44(3):677–694.
Christo Kirov and Ryan Cotterell. 2018. Recur-
Stefan Frank, Leun Otten, Giulia Galli, and rent neural networks in linguistic theory: Re-
Gabriella Vigliocco. 2015. The ERP response visiting pinker and prince (1988) and the past
to the amount of information conveyed by tense debate. Transactions of the Association
words in sentences. Brain and Language, for Computational Linguistics, 6:651–665.
140:1–11.
Taku Kudo and John Richardson. 2018. Senten-
Richard Futrell, Ethan Wilcox, Takashi Morita, cePiece: A simple and language independent
Peng Qian, Miguel Ballesteros, and Roger subword tokenizer and detokenizer for neural
Levy. 2019. Neural language models as psy- text processing. In Proceedings of the 2018
cholinguistic subjects: Representations of syn- Conference on Empirical Methods in Natural
tactic state. In Proceedings of the 2019 Con- Language Processing: System Demonstrations,
ference of the North American Chapter of the pages 66–71, Brussels, Belgium. Association
Association for Computational Linguistics: Hu- for Computational Linguistics.
man Language Technologies, Volume 1 (Long
and Short Papers), pages 32–42, Minneapolis, Angeliki Lazaridou, Alexander Peysakhovich, and
Minnesota. Association for Computational Lin- Marco Baroni. 2017. Multi-agent cooperation
guistics. and the emergence of (natural) language. In In-
ternational Conference on Learning Represen-
Jill Gilkerson, Jeffrey Richards, Steven War- tations.
ren, Judith Montgomery, Charles Greenwood,
D. Kimbrough Oller, John Hansen, and Ter- Alessandro Lenci. 2018. Distributional models of
rance Paul. 2017. Mapping the early language word meaning. Annual Review of Linguistics,
environment using all-day recordings and auto- 4(1):151–171.
mated analysis. American Journal of Speech-
Language Pathology, 26(2):248–265. Roger Levy. 2008. Expectation-based syntactic
comprehension. Cognition, 106(3):1126–1177.
Adam Goodkind and Klinton Bicknell. 2018. Pre-
dictive power of word surprisal for reading Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan
times is a linear function of language model Lee. 2019. ViLBERT: Pretraining task-agnostic
quality. In Proceedings of the 8th Workshop visiolinguistic representations for vision-and-
on Cognitive Modeling and Computational Lin- language tasks. In Conference on Neural Infor-
guistics (CMCL 2018), pages 10–18, Salt Lake mation Processing Systems.
Brian MacWhinney. 2000. The CHILDES project: Patricia Reeder, Elissa Newport, and Richard
Tools for analyzing talk. Lawrence Erlbaum Aslin. 2017. Distributional learning of subcate-
Associates, Mahwah, NJ. gories in an artificial grammar: Category gener-
alization and subcategory restrictions. Journal
Ellen Markman. 1994. Constraints on word mean- of Memory and Language, 97:17–29.
ing in early language acquisition. Lingua,
92:199–227. Anna Rogers, Olga Kovaleva, and Anna
Rumshisky. 2020. A primer in BERTol-
Rebecca Marvin and Tal Linzen. 2018. Targeted ogy: What we know about how BERT works.
syntactic evaluation of language models. In Transactions of the Association for Computa-
Proceedings of the 2018 Conference on Empir- tional Linguistics, 8:842–866.
ical Methods in Natural Language Processing,
pages 1192–1202, Brussels, Belgium. Associa- Alexa Romberg and Jenny Saffran. 2010. Statis-
tion for Computational Linguistics. tical learning and language acquisition. Wiley
Interdisciplinary Reviews in Cognitive Science,
Stephen Merity, Caiming Xiong, James Bradbury, 1(6):906–914.
and Richard Socher. 2017. Pointer sentinel
mixture models. In Proceedings of the Fifth In- Brandon Roy, Michael Frank, Philip DeCamp,
ternational Conference on Learning Represen- Matthew Miller, and Deb Roy. 2015. Predicting
tations. the birth of a spoken word. Proceedings of the
National Academy of Sciences, 112(41):12663–
Abel Nyamapfene and Khurshid Ahmad. 2007. A 12668.
multimodal model of child language acquisition
David Rumelhart and James McClelland. 1986.
at the one-word stage. In International Joint
On learning the past tenses of English verbs.
Conference on Neural Networks, pages 783–
Parallel Distributed Processing: Explorations
788.
in the Microstructure of Cognition, 2.
Matthew Peters, Mark Neumann, Mohit Iyyer, Melanie Soderstrom. 2007. Beyond babytalk: Re-
Matt Gardner, Christopher Clark, Kenton Lee, evaluating the nature and content of speech in-
and Luke Zettlemoyer. 2018. Deep contextu- put to preverbal infants. Developmental Re-
alized word representations. In Proceedings view, 27(4):501–532.
of the 2018 Conference of the North Ameri-
can Chapter of the Association for Computa- Twila Tardif, Susan Gelman, and Fan Xu. 1999.
tional Linguistics: Human Language Technolo- Putting the “noun bias” in context: A compar-
gies, Volume 1 (Long Papers), pages 2227– ison of English and Mandarin. Child Develop-
2237, New Orleans, Louisiana. Association for ment, 70(3):620–635.
Computational Linguistics.
Thomas Wolf, Lysandre Debut, Victor Sanh,
Eva Portelance, Judith Degen, and Michael Frank. Julien Chaumond, Clement Delangue, Anthony
2020. Predicting age of acquisition in early Moi, Pierric Cistac, Tim Rault, Remi Louf,
word learning using recurrent neural networks. Morgan Funtowicz, Joe Davison, Sam Shleifer,
In Proceedings of CogSci 2020. Patrick von Platen, Clara Ma, Yacine Jernite,
Julien Plu, Canwen Xu, Teven Le Scao, Sylvain
Shannon M. Pruden, Kathy Hirsh-Pasek, Gugger, Mariama Drame, Quentin Lhoest, and
Roberta Michnick Golinkoff, and Eliza- Alexander Rush. 2020. Transformers: State-
beth Hennon. 2006. The birth of words: of-the-art natural language processing. In Pro-
Ten-month-olds learn words through perceptual ceedings of the 2020 Conference on Empirical
salience. Child Development, 77(2):266–280. Methods in Natural Language Processing: Sys-
tem Demonstrations, pages 38–45, Online. As-
Alec Radford, Jeff Wu, Rewon Child, David Luan, sociation for Computational Linguistics.
Dario Amodei, and Ilya Sutskever. 2019. Lan-
guage models are unsupervised multitask learn- Rowan Zellers, Ari Holtzman, Matthew Peters,
ers. OpenAI Technical Report. Roozbeh Mottaghi, Aniruddha Kembhavi, Ali
Farhadi, and Yejin Choi. 2021a. PIGLeT: Lan- Hyperparameter Value
guage grounding through neuro-symbolic inter- Hidden size 768
action in a 3D world. In Proceedings of the 59th Embedding size 768
Annual Meeting of the Association for Compu- Vocab size 30004
tational Linguistics and the 11th International Max sequence length 128
Joint Conference on Natural Language Pro- Batch size 128
cessing (Volume 1: Long Papers), pages 2040– Train steps 1M
2050, Online. Association for Computational Learning rate decay Linear
Linguistics. Warmup steps 10000
Learning rate 1e-4
Rowan Zellers, Ximing Lu, Jack Hessel, Young- Adam  1e-6
jae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Adam β1 0.9
and Yejin Choi. 2021b. MERLOT: Multimodal Adam β2 0.999
neural script knowledge models. arXiv preprint Dropout 0.1
arXiv:2106.02636v2.
Transformer hyperparameter Value
Wang Zhu, Hexiang Hu, Jiacheng Chen, Zhiwei
Transformer layers 12
Deng, Vihan Jain, Eugene Ie, and Fei Sha.
Intermediate hidden size 3072
2020. BabyWalk: Going farther in vision-
Attention heads 12
and-language navigation by taking baby steps. Attention head size 64
In Proceedings of the 58th Annual Meeting Attention dropout 0.1
of the Association for Computational Linguis- BERT mask proportion 0.15
tics, pages 2539–2556, Online. Association for
Computational Linguistics. LSTM hyperparameter Value
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan LSTM layers 3
Salakhutdinov, Raquel Urtasun, Antonio Tor- Context size 768
ralba, and Sanja Fidler. 2015. Aligning books Table 4: Language model training hyperparameters.
and movies: Towards story-like visual explana-
tions by watching movies and reading books. In
2015 IEEE International Conference on Com- models with the same architectures in previ-
puter Vision, pages 19–27. ous work. For BERT, we evaluated the per-
plexity of Huggingface’s pre-trained BERT base
Melis Çetinçelik, Caroline Rowland, and Tineke uncased model on our evaluation dataset (Wolf
Snijders. 2021. Do the eyes have it? A system- et al., 2020). For the remaining models, we
atic review on the role of eye gaze in infant lan- used the evaluation perplexities reported in the
guage development. Frontiers in Psychology, original papers: Gulordava et al. (2018) for the
11. LSTM,11 Radford et al. (2019) for GPT-2 (us-
A Appendix ing the comparably-sized model evaluated on the
WikiText-103 dataset), and Aina et al. (2019) for
A.1 Language model training details the BiLSTM. Because these last three models
Language model training hyperparameters are were cased, we could not evaluate them directly
listed in Table 4. Input and output token embed- on our uncased evaluation set. Due to differing vo-
dings were tied in all models. Each model was cabularies, hyperparameters, and datasets, our per-
trained using four Titan Xp GPUs. The LSTM, plexity comparisons are not definitive; however,
BiLSTM, BERT, and GPT-2 models took four, they show that our models perform similarly to
five, seven, and eleven days to train respectively. contemporary language models.
To verify language model convergence, we plot- Finally, we evaluated each of our models for
ted evaluation loss curves, as in Figure 6. To word acquisition at 208 checkpoint steps during
ensure that our language models reached perfor- training, sampling more heavily from earlier steps.
mance levels comparable to contemporary lan- 11
The large parameter count for the LSTM in Gulordava
guage models, in Table 6 we report perplex- et al. (2018) is primarily due to its large vocabulary without a
ity comparisons between our trained models and decreased embedding size.
LSTM GPT-2 Children
Adj < Function ∗∗∗ Adj < Function ∗∗∗ Nouns < Adj ∗∗∗
Adj < Nouns ∗∗ Adj < Other ∗ Nouns < Verbs ∗∗∗
Adj < Other ∗∗ Verbs < Function ∗∗∗ Nouns < Function ∗∗∗
∗∗∗
Verbs < Function Verbs < Nouns ∗ Function > Adj ∗∗∗
Verbs < Nouns ∗∗ Verbs < Other ∗∗ Function > Verbs ∗∗∗
Verbs < Other ∗ Nouns < Function ∗∗∗ Function > Other ∗∗∗
Other < Adj ∗∗
Other < Verbs ∗∗

Table 5: Significant pairwise differences between lexical classes when predicting words’ ages of acquisition in
language models and children (adjusted p < 0.05∗ ; p < 0.01∗∗ ; p < 0.001∗∗∗ ). A higher value indicates that
a lexical class is acquired later on average. The five possible lexical classes were Noun, Verb, Adjective (Adj),
Function Word (Function), and Other.

10 described in the text, when lexical class reached


Evaluation Loss

8
Model significance based on the likelihood ratio test (ac-
BERT
counting for log-frequency, MLU, n-chars, and
6 BiLSTM
GPT−2
concreteness), we ran a one-way analysis of co-
4 LSTM variance (ANCOVA) with log-frequency as a co-
2 variate. There was a significant effect of lexical
0 250K 500K 750K 1M class in children and the unidirectional language
Steps
models (the LSTM and GPT-2; p < 0.001).
Pairwise differences between lexical classes
Figure 6: Evaluation loss during training for all four were assessed using Tukey’s honestly significant
language models. Note that perplexity is equal to
difference (HSD) test. Significant pairwise differ-
exp(loss).
ences are listed in Table 5.
Ours Previous work
# Params Perplexity # Params Perplexity
LSTM 37M 54.8 72M a 52.1
GPT-2 108M 30.2 117M b 37.5
BiLSTM 51M 9.0 42M c 18.1
BERT 109M 7.2 110M d 9.4

Table 6: Rough perplexity comparisons between our


trained language models and models with the same ar-
chitectures in previous work (a Gulordava et al., 2018;
b
Radford et al., 2019; c Aina et al., 2019; d Wolf et al.,
2020).

We evaluated checkpoints at the following steps:

• Every 100 steps during the first 1000 steps.


• Every 500 steps during the first 10,000 steps.
• Every 1000 steps during the first 100,000
steps.
• Every 10,000 steps for the remainder of train-
ing (ending at 1M steps).

A.2 Lexical class comparisons


We assessed the effect of lexical class on age of ac-
quisition in children and each language model. As

You might also like