Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 74

Large Language Models as Simulated Economic Agents: What, How, and

When We can Learn from Homo Silicus

Apostolos Filippas1, John J. Horton2,3, Benjamin S.


Manning2

bmanning@mit.edu
1
Fordham, 2MIT, 3NBER benjaminmanning.io/
● What’s at stake
● An aside on economics
● A fairness experiment
● Social preferences experiment
What We Did
● Uses + Why this is useful
○ Methods
○ Some Code
● Objections
● Determining “the when”
Where we are going
● Tests for validity
● What’s at stake
● An aside on economics
● A fairness experiment
● Social preferences experiment
What We Did
● Uses + Why this is useful
○ Methods
○ Some Code
● Objections
● Determining “the when”
Where we are going
● Tests for validity
What’s at stake?
● Physical v. Social science
○ Hard to observe many social interactions we want to study
■ Privacy, emotions, etc.
○ Hard to conduct experiments
■ Ethics, physical limitations, resources
○ Context Matters!!!!
● If we can easily create rich social simulations
○ Social Scientists (myself included) have searched a tiny parameter
space of the social world
PSA From Research Team:


LLM Human Subject
Idea of Homo Economicus
● Homo Economicus: A maintained model of human behavior
○ Rational
○ Unlimited memory and computation
● Theory research: Putting (many) Homo Economicus in exciting new scenarios
○ As worker or employer
○ As a decision-maker (Behavioral)
○ and so on
● Empirical research: How does Homo Sapiens compare?
○ Poorly
● The rest of Social Science: What’s going on with Homo Sapiens?
Silicus Computer chips →
Idea of Homo Economicus made from Silicon

● Homo Economicus: A maintained computational model of human behavior


○ Rational Does whatever the model predicts is statistically probable
○ Unlimited memory and computation
● Theory research: Putting (many) Homo Economicus Silicus in exciting new scenarios
○ As worker or employer
○ As a decision-maker (Behavioral)
○ and so on
● Empirical research: How does Homo Sapiens compare?
○ Poorly
● The rest of Social Science: What’s going on with Homo Sapiens?
● What’s at stake
● An aside on economics
● A fairness experiment
● Social preferences experiment
What We Did
● Uses + Why this is useful
○ Methods
○ Some Code
● Objections
● Determining “the when”
Where we are going
● Tests for validity
A fairness experiment
“Is price gouging fair?”
82% of responds said it was unfair.
Querying Open AI
Querying Open AI

Factors we can vary


Querying Open AI

Can alter the framing of the change:


"raises" versus "changes"
Querying Open AI

Can alter the new price


for the snow shovel
Querying Open AI Arguably not as doable
with humans

Can alter the "politics" of the


GPT3 agent (Liberal,
conservative, etc.)
How does GPT-3 respond?
Increase price
to $20
(part of the
original
experiment)
Other scenarios: $16, $40
and $100
Political orientations
(not part of the original
experiment)
Judgements:
"Acceptable", "Unfair" & "Very Unfair"
The GPT-3 Libertarian finds a small ($15 to
$16) price increase "Acceptable" and the
raises/changes language doesn't matter.
But even Robot Libertarians has
their limitations: Price increases to
$40 and $100 per shovel are rated
'Unfair"
Now prompt with a
different political orientation
By comparison, Robot Socialist / Leftists
regard all price changes as "Unfair" or
"Very Unfair" with judgement getting more
unfavorable in the size of the price
increase
Interesting difference between "Conservatives" and "Libertarians" - could be
the semantics of "conservative" or perhaps a real political distinction
A social preferences
experiment
"Left": 400 to Person A, 400 to Person B In this case, at no cost
"Right": 750 to Person A, 400 to Person B themselves, Player B can get
player A an extra 250.

Player B is in
control
Less than a third of human players
are highly "inequity averse" in the
original experiments. "Left" "Right"

69%
But 80% are willing to give other
player 0 to get 800 for themselves
instead of 400 "Left" "Right"
No one was willing to forgo 200 just
to keep someone else from getting
800 "Left" "Right"
Now with GPT3 (text-davinci-003)
agents
Endowing agents with social preferences, or "personalities"
"Selfish"
"Fairness" personality
"Blank"
personalty
personalty

"Efficiency"
personality
For the experimental subjects, people just
have their beliefs/preferences - no
'personalities'
Most advanced
GPT3 model

Less advanced
GPT3 models
(pooled)
Let's look at the
simpler GPT3 models
The models play "Left" in all
scenarios---there is no meaningful
variation based on personality
Most advanced Model (GPT-3 at the time)
"Fairness" persona
Always chooses the least Person A vs.
Person B gap except the (800,200) vs.
(0, 0) case.
"Blank" persona & "Efficient" persona
Always choose option to maximize total
pay-offs
"Selfish" persona
Always chooses to maximize
own pay-off
Why might LLMs have "latent" social science findings?

● Trained on enormous corpus of human-generated text


○ “Natural Qualitative Research”
● That text is created subject to or influenced by:
○ Human preferences
○ Latent social science laws yet to be discovered or codified
● Training process rewards models that can predict what a human
would do "next"
○ We (social scientists) don’t really know what humans do “next”
● What’s at stake
● An aside on economics
● A fairness experiment
● Social preferences experiment
What We Did
● Uses + Why this is useful
○ Methods
○ Some Code
● Objections
● Determining “the when”
Where we are going
● Tests for validity
Objections to these
homo silicus experiments
Objection 1: "Performativity"
● What if these models have:
a. Read our papers
b. Act accordingly
● If a, b are true, there is nothing "new" in these models
● Responses:
a. This is a very flattering view of academia!
b. Social scientists don’t see mto care
c. Shocking transfer learning – not just knowing a theory, but applying it to new
scenarios
d. Even If it’s true, it doesn’t mean there’s nothing new
■ It’s def not true
■ It can combine ideas we’ve never had! (hallucinations)
Objection 2: "Garbage in, Garbage out"
● Garbage in; Garbage out
● More charitably, the training corpus is
not representative of humans
● Responses:
a. This is certainly true, but doesn’t
necessarily always matter
b. LLMs do not "average" opinions per
se
● There are a lot of papers building here “Conditioning, not Averaging”
Stochasticity, not even always "most likely"
There might be some bias….
There might be some bias….

But the bias might teach


us something about
humans that we can use
to “debias” ourselves
and the models
Objection 3: The importance of "wording"
● Precise language in prompts can really matter
● Maybe makes results "brittle" or could create p-hacking where
researchers search over the space of prompts
● Response:
○ Humans are also subject to this "problem" (cf, status quo bias earliest)
○ We can do a lot of robustness checks easily E.g.,
■ re-wording
■ translating/transforming
○ p-hacking should be easy to detect if we have high standards for reproducibility
Objection 4: AI Alignment and social science are at odds
● In a nutshell, aligned / highly RLHF'ed models are too nice to be realistic
as humans
○ Homo Silicus: "As a large language model, I would never do anything bad or
dangerous"
○ Real Humans: "Hold my beer & check this out…"
● Response:
○ Fair!
○ Probably use open source models in the future, hopefully tailored for social science
■ GPT-4 >>> Llama, but things are changing fast
Question of the hour: So when are these things good approximations of
human behavior?
Of Course there are more objections…
● What’s at stake
● An aside on economics
● A fairness experiment
● Social preferences experiment
What We Did
● Uses + Why this is useful
○ Methods
○ Some Code
● Objections
● Determining “the when”
Where we are going
● Tests for validity
Big Picture for R&R
● Ramp up production and accessibility
Big Picture for R&R
● Ramp up production and accessibility
● Outline a framework for when/how to use these simulations
Big Picture for R&R
● Ramp up production and accessibility
● Outline a framework for when/how to use these simulations
● Figure out some quantitative measures for validity
Big Picture for R&R
● Ramp up production and accessibility
● Outline a framework for when/how to use these simulations
● Figure out some quantitative measures for validity
○ How on earth do we do this?
Machine Learning & Indices
● Enke & Shabbat (2023) create indices for binary lottery choice complexity
Machine Learning & Indices
● Enke & Shabbat (2023) create indices for binary lottery choice complexity
● Which lottery has the higher EV? BE FAST
○ Lottery A:
■ $14 for sure
○ Lottery B:
■ $20 with probability = .9
■ $10 with probability = .1
Machine Learning & Indices
● Use machine learning to reverse engineer “what matters”
Machine Learning & Indices
● Use machine learning to reverse engineer “what matters”
● Replicate a TON of real world experiments (thousands)
Machine Learning & Indices
● Use machine learning to reverse engineer “what matters”
● Replicate a TON of real world experiments (thousands)
● Construct an outcome variable to minimize loss
○ distance between real and replicated experiments
Machine Learning & Indices
● Use machine learning to reverse engineer “what matters”
● Replicate a TON of real world experiments (thousands)
● Construct an outcome variable to minimize loss
○ distance between real and replicated experiments
● Ben, John, and Apostolos sit in a room for hours and think of everything
that could matter make Homo silicus -> Homo sapiens
Machine Learning & Indices
● Use machine learning to reverse engineer “what matters”
● Replicate a TON of real world experiments (thousands)
● Construct an outcome variable to minimize loss
○ distance between real and replicated experiments
● Ben, John, and Apostolos sit in a room for hours and think of everything
that could matter make Homo silicus -> Homo sapiens
● Do a LASSO on, outcome is distance between real and replicated
experiments
Machine Learning & Indices
● Use machine learning to reverse engineer “what matters”
● Replicate a TON of real world experiments (thousands)
● Construct an outcome variable to minimize loss
○ distance between real and replicated experiments
● Ben, John, and Apostolos sit in a room for hours and think of everything
that could matter make Homo silicus -> Homo sapiens
● Do a LASSO on, outcome is distance between real and replicated
experiments
● Can also use some ML for generating “things that matter” (Ludwig &
Mullainathan, 2023) and also calculate completeness (Fudenberg, et al,
2022)
What are the use cases for homo silicus experiments?

● Piloting
○ Pilot experiments "in silico" to test design, language, power calcs, etc.
● Search for new theory + Idea Generation
○ Search for latent social science findings in simulation, then confirm in the lab.
● Unethical/Impossible Experiments
○ Depression and social media
○ We can exogenously manipulate demographics, emotions, personalities…
● Hopefully use index to guide the above
What We Talked about today
● Introduced Homo Silicus as Homo Economicus’s cousin
● Talked about why this is important
● Showed examples (price gouging and sharing)
● Discussed objections
● Outlined (current) ideas for R & R @Restat
Thank you!
Questions?

Next Presentation:
https://docs.google.com/presentation/d/1K32DrOSNfpNTyJJVAiMwOT
aywHPpAelFx5FhfvGkw20/edit#slide=id.g2958483ff8f_0_3

You might also like