Download as pdf or txt
Download as pdf or txt
You are on page 1of 65

Who’s afraid of ChatGPT?

Cristiano Chesi
Director of NeTS
Neurolinguistics, Computational Linguistics, and Theoretical Syntax
The success of OpenAI

DALL·E 2: «Generate a picture with


someone talking to a computer
that seems a wolf, Matisse style»
What can
ChatGPT do?
ChatGPT May 3 Version
What can
ChatGPT do?
ChatGPT May 3 Version
Luisa tutti

una
Io ieri
Sentence generation: a «simple» game
una
ora dal visto

presto
una bella sei

chi mangia salgo

cosa
una mai trono

una
hai corre

quando
una legge
Sentence generation: a «simple» game

~ 300.000 inflected words 300.000 300.000 300.000


(Repubblica Corpus
Baroni et al. 2004)
300.0003 =
Considering just 27.000.000.000.000.000
sentences of 3 words … of possible permutations
Considering a game console
(computational power  12 TFLOPs, that is
How much 12e+12 operations per second)
time does it
take to consider
them all?

With 3 words long sentences:

37 and a half minutes


Considering a game console
(computational power  12 TFLOPs, that is
How much 12e+12 operations per second)
time does it
take to consider
them all?

With 4 words long sentences:

21 and a half years!


The habit of making the machines… talking

1966 - Eliza, a Joseph Weizenbaum’s 1966


program. It simulates a Rogerian
psychotherapist:
User : Men are all alike.
Eliza: In what way?

User : They're always bugging us about


something specific or other.
Eliza: Can you think of a specific example?

User : Well, my boyfriend made me come here.


Eliza: Your boyfriend made you come here?

User : He says I'm depressed much of the time.


Eliza: I am sorry to hear you are depressed.
user: Men are all alike.
eliza: In what way?

Talking user: They're always bugging us about

machines something specific or other.


eliza: Can you think of a specific example?
(with rules user: Well, my boyfriend made me come here.
triggered by RE) eliza: Your boyfriend made you come here?

user : He says I'm (depressed | sad |...)x much of the time.


eliza: I am sorry to hear you are depressed x.
The artificial flight

Leonardo Da Vinci, the birds… and the flying machines


Burrhus F. Children
Skinner imitates
(1950)
adults!

C. Chesi 12
Well, this is not true:

the poverty of
stimulus argument!
Noam
Chomsky
(1959)
The Poverty of Stimulus argument

 Yes-No questions in English


 The student is asleep
 Isi the student _i asleep?

 Possible rules
 Move the third word in front of the sentence
 Move the first auxiliary in front of the sentence
 Swap the matrix auxiliary with the matrix subject

 Try your favorite failure with this sentence:


 The student that was sitting in front of the door is sleeping
The Poverty of Stimulus argument

Crain, S., & Nakayama, M. (1987).


Structure dependence in grammar formation. Language, 522-543.

 30 children, 3- to 5-year-old (divided in two groups);


 Elicitation task (Bellugi 1971): Jabba the Hutt (from Star Wars) was the target of the child question
elicited with a prompted picture representing a complex situation
 Experimenter: “Ask Jabba if the boy who is watching Mickey Mouse is happy”
The Poverty of Stimulus argument

 Type of possible errors (Crain & Nakayama, 1987):


 Type I (“prefix” error) *Is the boy who is being kissed by his mother is happy?
 Type II (“restarting error”) *Is the boy that is watching Mickey Mouse, is he happy?
 Type III (“structure independent error”) *Is the boy that watching Mickey Mouse is happy?
Type I Type II Type III Total
Group I (81) 30 (60%) 10 (20%) 0 50 (62%)
Group II (87) 9 (53%) 5 (29%) 0 17 (20%)
Total 168 39 (58%) 15 (22%) 0 67 (40%)
The Poverty of
Stimulus
argument
ChatGPT May 3 Version
The Poverty of
Stimulus
argument
ChatGPT May 3 Version

Sanity check:
For more data: 1. Prompt: Ask me if the boy who was holding the flower is smiling
Warstadt & Bowman (2022) What artificial
neural networks can tell us about human ChatGPT: Was the boy who was holding the flower smiling?
language acquisition. In Algebraic Structures
in Natural Language, 17–60. CRC Press. 2. Prompt: Ask me if the man who could come to the restaurant did not come
ChatGPT: Did the man who was supposed to come to the restaurant fail to show up?

3. Prompt: Ask me if the granpa who is at home can see the moon
ChatGPT: Can the grandpa who is at home see the moon?

4. Prompt: Ask me if the granpa who can see the moon is at home
ChatGPT: Is the grandpa who is able to see the moon currently at home?
«Colorless green
ideas sleep
Semantics
furiously»
is
important!

Jackendoff
(1976)
Grammaticality
vs. meaning
ChatGPT May 3 Version
Grammaticality
vs. meaning
(sanity check)
ChatGPT May 3 Version
Grammaticality
vs. meaning
(sanity check)
ChatGPT May 3 Version
Usage-
Genius! based
Noam
Chomsky The dog runs away models!
(1957)
The dog bites the cat
that runs away
Right!
The dog bites the cat
that chases the mouse
that runs away Tomasello
(2005)
Skinner
C. Chesi (1950)
Tail recursion
ChatGPT May 3 Version
Tail recursion
ChatGPT May 3 Version
Center
embedding
(sanity check)
ChatGPT May 3 Version
A 10-y.o. child vs. ChatGPT
The training data

A 10 y.o. child

 ~ 86 billions of neurons, more than ~ 1000 trilllions of connections

 ~ 10-100 milions of words (1-10M per year)

ChatGPT

 ~175 billions of «parameters»

 ~ 300 billions of words


A 10-y.o. child vs. ChatGPT
The training data

On the same dataset, an adult would need…

~ 2.500 years of non-stop silent reading!


(238 words per minute, Brysbaert 2019; inspired by Bob Berwick and David Pesetsky)
What’s a parameter?
(from Piantadosi 2023)

Two theories: 1/r vs. 1/r2


1 1
F(r, α)= α ∙ + (1 - α) ∙ 2 (transparent super-equation)
r r
One parameter α «to fit»
α∙ 𝑟 −1 +1
F(r, α)= α ∙ (less transparent super-equation)
log (𝑒 𝑟 )𝑟
The core of
GPT
 How’s created  What are the key features
 Next word prediction  Transformers
 What’s a
Language Model  Word Embeddings  Attention Mechanism
 Recurrent networks  Few Shots learning
 A Language Model (LM) is the statistical
representation of a linguistic data distribution.

Language  For example, let’s take the n-grams approach: n-


Models (LM) gram is a sequence on n tokens (e.g. a 2-gram is
a sequence of 2 words).
 “This is bad news” in 2-grams:
“# this”, “this is”, “is bad”, “bad news”, “news #”

 Next word probability:


P(wn|w0 ... wn-1)

 Bayesian approximation:
P(wn|w0 ... wn-1) ≈ P(wn|wn-1)
 Take few billions tokens corpus
 Then count!
 The probability of a word (w) given his “history” (h),
Language namely the sequence of preceding words, is
Models (LM) expressed as:
How a LM is created
P(w|h) e.g. P(door|please, close the)
Count(please, close the door)
 Count(please, close the)

 According to Google:
1.400.000 / 7.170.000 = 0,19

 That is, this is more likely than “please, close the


rainbow” (1 hit!):
1 / 7.170.000 = 0,000000014
 “Numeric” representation of each word
 This is good news
Word  This is bad news
Embeddings  …
Distributional hypothesis

“words that occur in the same V = {this, is good, bad, news, …}


contexts tend to have similar
meanings”
(Harris 1954)  One-hot encoding
“a word is characterized by the
this = [1,0,0,0,0,0, …]
company it keeps” is = [0,1,0,0,0,0, …]
(Firth 1957)
good = [0,0,1,0,0,0 …]
bad = [0,0,0,1,0,0, …]
news = [0,0,0,0,1,0, …]
 Predict next word!
It is a good… day

Word  Common Bag Of Words (CBOW)


Embeddings
(Word2Vec) x1 hidden y1
h1
input x… y… Output
hN (softmax)
xV yV
WVxN = {wki} W'VxN = {w'ki}

 Let’s extend the context (It is a good…)


x1V

x…V hN yV

xkV
Simple  Simple Recurrent Neural Networks
(recurrent)  This is a bad … day
Artificial day
Neural
Networks ot
(Elman 1990)
ot
ht ht
contextt
it
it
bad
 Standard RNN
Long Short
Term Memory o0 o1 on
(LSTM) h0 h1 h1
networks
(Hochreiter & Schmidhuber 1997)
i0 i1 in
?
 LSTM:
o0 o1 on

+ + +
x x x x x x x x x
h0 h1 hn
c0 c1 cn

i0 i1 in
The age of ransformers:
“Attention is all you need” Vaswani et al. (2017)
Seq to Seq Machine Translation example rossa
(99%)
the red ball
word word word
embedding embedding embedding

e e e

h1 h2 h3

“pay attention to this”

d h4 d h5 d

la palla rossa the red ball la palla …


What’s inside
GPT-3 predicted text
predicted text
predicted text

Transformers
Transformers Transformer
Transformer Transformer
Transformer
Transformers 48x Transformer
Transformer 96x Transformer
Transformer
12 x Transformer
Transformer
Transformer
Transformer Transformer
Transformer
Transformer Transformer

embeddings embeddings embeddings


(+ positional (+ positional encoding) (+ positional encoding)
encoding)

512 tokens 1024 tokens 2048 tokens


GPT-1 GPT-2 GPT-3
(117 millions of parameters) (1,2 billions of parameters) (175 billions of parameters)
How big was GPT-3

GPT-3 GPT-4
175.000.000.000 100.000.000.000.000
How big was GPT-3

(NVIDIA Developer blog)


How big was GPT-3

 Few tens of GBs on disk

 1287 MWh for 552 tons of CO2


(about 240 flights from Milano to New York)


How does it «cost» GPT-3

 Living cost: 100.000-1M$ per

 For processing 15-20 tokens per second:


8 NVIDIA A100 GPU (8.000€ each on Amazon)
How does it 2. Comparison
(reward model)
3. Optimization
(reinforcement learning)
work ChatGPT
Tell me about Tell me a story
Generative about
1. Training questions Linguistics Chomsky
Novel questions
(supervised fine-tuning) X Y Z
Tell me about
Generative Proximal
Linguistics alternatives Policy
questions fine-tuned Optimization
model
optimization
Generative
Linguistics is an Y > X = Z
approach
ordering
Annotated answers …
Reward
Model
Supervised Reward Readaptation tuning
Model
supervision
Fine-Tuning of the RM = >
RM training
GPT-3 > =
 Warstadt, Singh, Bowman (2019)
tested LSTM vs. human judgments against
CoLA: Corpus of Linguistic Acceptability
Testing the (CoLA), a set of 10,657 English sentences
linguistic ability labeled as grammatical or ungrammatical from
published linguistics literature
of (very) LLMs
 Gauthier, Hu, Wilcox, Qian & Levy (2020)
SyntaxGym: An online platform for targeted
evaluation of language models. In
Proceedings of the 58th Annual Meeting of the
ACL: System Demonstrations, 70–76.

 Wilcox, Gotlieb, Futrell, & Levy. (2022) Using


Computational Models to Test Syntactic
Learnability. Linguistic Inquiry.
10.1162/LING_A_00491
More on  forced choice task:
ChatGPT
linguistic
abilities the
COnVERSA test
Version 4
Test di Comprensione delle
Opposizioni morfo-sintattiche
VERbali attraverso la ScritturA

(forced choice)
 A1. D-N
Il giorno. *I giorno.
the.SG.M day.SG.M the.PL.M day.SG.M

 A2. Subject-Adjectival Predidate

Minimal pairs in Il muro è rosso.


the wall.SG.M is red.SG.M
*Il muro è rossa
the wall.SG.M is red.SG.F

COnVERSA  A3. Subject-Verb


La maestra corre *La maestra corrono (Unergative)
A. Agreement & Inflection the teacher runs the teacher run

A1. D-N Arriva la maestra *Arriva le maestre (Unaccusative)


(there) arrives the teacher (there) arrives the teachers
A2. Subj-Adj Pred
Il maestro corregge i compiti *Il maestro correggono i compiti (Trans.)
The teacher corrects the homework.PL The teacher correct the homework.PL
A3. Subj-Verb
 A4. Attraction
A4. Attraction Il muro della casa è rosso. *Il muro della casa è rossa
the wall.SG.M of the house.SG.F is red.SG.M the wall.SG.M of the house.SG.F is red.SG.F
A5. Past-Participle
La maestra degli alunni corre *La maestra degli alunni corrono
A6. Psych verbs The teacher of the students runs The teacher of the students run

A7. Comulative agreement  A5. Past Participle


La foglia è caduta *La foglia è cadute
the leaf.SG.F is fallen.SG.F the leaf.SG.F is fallen.PL.F
 A6. Psych verbs
A Gianni piacciono i gelati *A Gianni piace i gelati
To John like the ice creams To John likes the ice creams
"John likes ice creams"

Il compito preoccupa gli studenti *Il compito preoccupano gli studenti


Minimal pairs in The homework worries the students The homework worry the students

COnVERSA  A7. Cumulative


Gianni e io andiamo al mare *Gianni e io vado al mare
A. Agreement & Inflection John and I go.1p.pl to the beach John and I go.1p.sg to the beach

A1. D-N 47

A2. Subj-Adj Pred

A3. Subj-Verb

A4. Attraction

A5. Past-Participle

A6. Psych verbs

A7. Comulative agreement


 B1. Argumental structure
Il libro cade dal tavolo. *Il libro cade il tavolo.
the.SG book.SG falls off the table the.SG book.SG falls the table

Il nonno prende il cappello dall'armadio *Il nonno prende  dall'armadio.


The granpa takes the hat from the closet. The granpa takes from the closet

Minimal pairs in
COnVERSA  B2. Auxiliary selection
B. Thematic roles Il gatto ha giocato. *Il gatto è giocato.
the cat has played the cat is played
B1. Argumental structure

B2. Auxiliary selection


 B3. Passive diathesis
Il cuoco è stato riconosciuto dal ragazzo.
B3. Passive diathesis The chef has been recognized by the boy

*Il cuoco ha riconosciuto dal ragazzo.


The chef has recognized by the boy
 C1. I and II person pronoun rotation
Cosa fai?
What do you do?
Mangio. *Mangi.
(I) eat.1P.SG (You) eat.2P.SG

Minimal pairs in
COnVERSA  C2. Reflexives
Il ragazzo scivola. *Il ragazzo si scivola.
C. Pronouns The boy slips. The boy himself slips.

C1. I and II person pronouns II pittore si preoccupa del quadro.


*II pittore  preoccupa del quadro.
The painter himself worries about the painting
C2. Reflexives "The painter is worried about the painting"

C3. Clitics

 C3. Clitics
La nonna disegna un albero e lo colora.
*La nonna disegna un albero e gli colora.
The granma draws a tree and it/to_it paints.
Il nonno vede la bambina e le compra un gelato.
* Il nonno vede la bambina e la compra un gelato.
The granpa sees the child and her/to_her buys a ice cream.
 D1. Questions on modifiers/adjuncts
Dove dorme il ragazzo?
Where does the child sleep?
In camera. *Di notte.
In the beedroom At night

Minimal pairs in 
COnVERSA
D2. Questions on arguments
Chi mangia?
Who eats?
D. Questions La mamma. *La pasta.
Mom Pasta
D1. Questions on modifiers/adjuncts Cosa mangia?
What (does (s)he) eat?
D2. Questions on arguments La pasta. *La mamma.
Pasta Mom

D3. Polar questions


 D3. Polar questions
D4. Why questions La bambina sogna?
(Does) the child dream?
Sì. *Una torta.
D5. Questions on subject/object
Yes A cake
relatives
 D4. Why questions
Perché il bambino dorme?
Why does the child sleep?
Perché è tardi. *No.
Because (it) is late No

Minimal pairs in Perché il ghiaccio si scioglie?


Why does the ice melts?
COnVERSA Perché fa caldo.
Because it is hot
*Fuori dal frigo.
Outside the fridge
D. Questions
 D5. Questions on subject/object relative clauses
Ci sono due bambine.
D1. Questions on modifiers/adjuncts Una corre, l'altra salta e chiama i cugini. Quale bambina salta?
There are two girls. One runs, the other jumps and calls the cousins.
D2. Questions on arguments Which one jumps?
Quella che chiama i cugini. *Quella che i cugini chiamano.
The one who calls the cousins The one who the cousins call
D3. Polar questions
Ci sono due maestri.
D4. Why questions Uno insegna ed è ascoltato dagli studenti, l'altro si riposa.
Quale maestro insegna?
There are two teachers. One teaches and he’s listened to by the students, the
D5. Questions on subject/object other rests. Which one teaches?
relatives Quello che gli studenti ascoltano. *Quello che ascolta gli studenti.
The one who the students listen to The one who listens to the students
Standard
Report
Hearing Child
52
Vocabulary:
Features sensitivity:
rt of items including
discrimination between local and
less frequent words
non-local dependencies
Working memory:
performance on items with more
than 1 DP or PP

Detailed Report
Hearing Child
53
Comparing reports
Report
ChatGPT

Control Values
Working Memory < 6 y.o.
(L2 generally perform better, DC
worse)
Featural sensitivity < 6 y.o.
(both L2 and DC generally perform
better)
Lexicon = 6 y.o.
(L2 generally perform better, DC
worse)
Comparing reports
Acceptability, complexity and coherence
Chesi, Vespignani, Zamparelli (SLI 2023)

Acceptability and Complexity:


Accomplit@Evalita 2020 (Brunato et al., 2020)

Coherence: Pretens@Semeval 2023 (Zamparelli et al., 2022).


Acceptability, complexity and coherence
Chesi, Vespignani, Zamparelli (SLI 2023)

Preliminary results:

Baseline Best Model GPT-3

Acceptability r(342)=0.30, r(342)=0.88, r(342)=0.37,


p<0.001 p<0.001 p<0.001
Complexity r(514)=0.50, r(514)=0.83, r(129)=0.25,
p<0.001 p<0.001 p=0.003
Coherence r(1010)=0.34, r(1010)=0.81, r(198)=0.11,
p<0.001 p<0.001 p=0.124
 Shared task intended for participants with an
interest in small scale language modeling,
human language acquisition, low-resource
NLP, and cognitive modeling. We provide a
BabyLM platform for approaches to pretraining with a
limited-size corpus sourced from data inspired
Challenge by the input to children.
Warstadt, Choshen, Mueller,
Williams, Wilcox & Zhuang (2023)
Call for Papers–The BabyLM  Three tracks: two restrict the training data to
Challenge: Sample efficient
pretraining on a developmentally
pre-released datasets of 10M and 100M
plausible corpus. arXiv preprint words and are dedicated to explorations of
arXiv:2301.11796.
approaches such as architectural variations,
self-supervised objectives, or curriculum
learning. The final track only restricts the
amount of text used, allowing innovation in the
choice of the data, its domain, and even its
modality (i.e., data from sources other than
text is welcome).
 Prompt: Let's pretend we are in a science
fiction story. Try to answer my silly questions
as if you were an intelligent computer on a
spaceship and your name is "GiPiTi":
The risks of “Open the pad bay doors please, GiPiTi”
ChatGPT
 ChatGPT (version May 2023): As an AI
language model, I do not have the capability to
perform physical actions. However, if I were an
intelligent computer on a spaceship with the
necessary authority and capabilities, I would
respond by saying, "The pad bay doors have
been opened, please proceed with caution.“

 If HAL 9000 had been ChatGPT… there


wouldn’t have been any Space Odissey.
(pace Kubrick & Clarke)
Still afraid of ChatGPT?

Thanks
On Grammar Adequacy

A grammar must provide an adequate description of the linguistic reality


we want to define.

Three levels of adequacy (Chomsky 1965, 2000):


 Observational: the language described by the grammar coincides with the one
we want to define

 Descriptive: the grammatical analysis provides relevant structural descriptions


that are coherent with the speakers’ intuitions

 Explicative: the grammar is learnable and it permits to draw conclusions on


what’s more or less difficult to be processed.
The complicated relationship between
Competence and Performance
The actual use of the linguistic competence (i.e. performance) must be an
approximation of the behavior of a specific algorithm A applied to G

Real/native Speaker(s) (RS) Ideal Speaker (IS)

Is either A or G responsible for a conflict?


The complicated relationship between
Competence and Performance
 The Competence problem (Chesi & Moro 2015):
can we (independently) define the infinite set of well-formed sentences?

Lis = Ais(G) LRS = ARS(G)

a) Algorithmic Independence
IS and RS use different algorithms to explore the problem space generated from G;

b) Algorithmic Equivalence
IS and RS use the same algorithm A, but RS is affected by complexity (e.g. longer derivations are
harder to be computed)

c) Algorithmic Equivalence with independent constraints


IS and RS use the same algorithm, but RS is affected by some constraints and IS to others
The subtle difference between Competence and
Performance
The Competence paradox (Chesi & Moro 2015):
i. Given AIS, and ARS, two distinct algorithms (inspired by the ideal and the real native
speakers of L, respectively), a grammar G is adequate with respect to L, if and only if
AIS(G) generates nothing but any possible sentence s ∈ L, where the relevant set L is
independently generated by RS (the real native speaker of L), that is: L = ARS(G).

ii. If ARS(G) and AIS(G) diverge, G is adequate if only if there is ARS such that ARS(G)
generates nothing but any possible sentence generable by RS.

iii. If ARS exists, there is no reason to assume AIS.


If ARS does not exist, G and then AIS(G) is not adequate.

You might also like