Professional Documents
Culture Documents
Noy Zhang 2023 ChatGPT
Noy Zhang 2023 ChatGPT
R
high-quality work. They received a base pay-
available in Table 1. The attrition rate was adjusting our estimates upward for imperfect with strong incentives to produce high-quality
6% in the control group and 11% in the treat- compliance. output (under the convex scheme) demon-
ment group. Balance tests indicate that across strates that the time-saving effects of ChatGPT
13 pretreatment characteristics, the treatment Productivity are not specific to linear payment regimes and
and control groups exhibited a small but sig- We first show results for our two productivity apply robustly across incentive structures.
nificant difference only for only two character- measures: time taken and evaluator grades In our third incentive arm involving 20%
istics: employment status and being a human (Fig. 1). The experimental intervention shifted of participants, we required participants to
resources professional. Our partly within- both outcomes substantially. In the treatment spend exactly 15 min on each task, thereby
person design, which controls for performance group, time taken on the posttreatment task holding effort fixed across the treatment and
on the pretreatment task, should eliminate any dropped by 11 min (0.75 SDs) relative to the control groups and allowing us to interpret
influence of selective attrition on our results. control group, who took an average of 27 min any difference in grades as a pure effect of
In the supplementary materials, we also re- (P < 0.001). Average evaluator grades in the ChatGPT access on productive capacity. In
port Lee bounds (24) on our main results and treatment group increased by 0.45 SDs (P < this arm, the treatment increased grades by
versions of our results controlling for employ- 0.001), with similar increases for overall grades a similar albeit not statistically significant
ment status and occupation, which confirmed and specific grades for writing quality, content 0.33 SDs (26).
that our results are highly robust to selective quality, and originality. In an additional intervention, after com-
attrition. These effects are not limited to specific pleting the second task, 30% of the treatment
pockets of the time or grade distributions. As group were shown their first-task human-
Results shown in Fig. 1, C and D, the entire time dis- created output and given the opportunity to
Take-up of ChatGPT
Mean Grade
25 4.4
20 4
Treated
Control 3.6
15
Pre-Treatment Post-Treatment Pre-Treatment Post-Treatment
Percent of Essay-Grades
Percent of Respondents
20
15 30
10 20
5 10
0 0
0-5 6-10 11-15 16-20 21-25 26-30 31-35 36-40 41-45 >45 1 2 3 4 5 6 7
Minutes Spent on Second Task Second-Task Grade
Fig. 1. Treatment effects on productivity. Effects shown are restricted to of the within-participant change in outcome from before to after treatment
the linear and convex incentive groups. (A) and (B) Means (and 95% on a treatment dummy, occupation*task order fixed effects, and incentive
confidence intervals for those means) of self-reported time taken and average arm fixed effects. In (A), this is at the participant level and SEs are hetero-
grades in the first and second task separately in the treatment and control skedasticity robust. In (B), this is at the participant-evaluator level, the
groups. The results look very similar for the objective measure of time regression also includes grader fixed effects, and SEs are clustered at the
active (see the supplementary materials). Also shown are treatment effect participant level. (C and D) Raw graphs of the outcome distribution in the
coefficients and 95% confidence intervals, rescaled to be in terms of pretreatment treatment versus control group on the second task; (C) is at the participant level
SDs of the outcome variable. The coefficients are estimated from regressions and (D) is at the participant-evaluator level.
in inequality was driven by the fact that par- time editing and improving the draft or using most of this editing was superficial, such as
ticipants who scored lower on the first task it to brainstorm or edit? changing a placeholder or rearranging a sen-
benefited more from ChatGPT access. As Fig. Our evidence supports the first possibil- tence. Evaluator grades also suggest that
2A shows, the gap between treatment and ity. Almost everyone submitted lightly ed- this editing was ineffectual. There was no
control is much larger at the left end of the ited or unedited ChatGPT output, and we correlation between how long a participant
x axis. observed small time expenditures on edit- was active after pasting in the ChatGPT text
ing and no resulting improvement in re- and the grade they ultimately received, and
Human-machine interactions spondents’ grades. In the treatment group, treated respondents who used ChatGPT did not
What kinds of human-machine interactions 33% of participants reported submitting receive higher average grades than the raw
underlie the productivity results documented ChatGPT’s initial output without editing it, ChatGPT output that we gave to evaluators to
above? Did workers paste the task prompt into and 53% reported editing before submit- grade (see the supplementary materials).
ChatGPT and immediately submit its output, ting. However, those who reported editing It is not obvious whether these dynamics
minimizing their time spent and increasing were active on the task for only 3.3 min on should be interpreted as evidence that ChatGPT
their grades because ChatGPT’s writing abilities average after we first observed them pasting will displace human workers or evidence that
exceeded theirs? Or did they treat ChatGPT as in a large quantity of text (presumably from it will augment them. Although ChatGPT di-
a helpful but imperfect tool, for example, using ChatGPT), with most active for 0 to 2 min rectly substituted for participants’ effort with
it to create a rough draft and then spending (27). Qualitative examination suggests that little need for human input, it also enabled
Fig. 2. Effects on grades A Grade Inequality Decreases past week (P < 0.01). The persistence of this gap
and time across the suggests that the dissemination of ChatGPT
initial grade distribution. 6 into real professional activity is still in very
Participant-evaluator early stages, with usage held back by a lack
observations are binned of knowledge about or experience with the
together according to 5 technology.
the task 1 grade given to In the 2-week follow-up, ChatGPT users
Grade: Task 2
this participant by this gave the technology an average usefulness score
evaluator. (A and B) Within 4 of 3.66 out of 5.00, somewhat lower than in our
each bin, the panels plot main experiment, likely because of the greater
the average task 2 grade length and complexity of real-world tasks. The
(A) or task 2 time taken Control Slope: 0.414 participants reported using it for a broad range
3
(B) for the observations Change in Slope: -0.272 of tasks such as generating recommendation
in the bin separately
95% CI on Change: [-0.14, -0.405] letters for employees, responding to customer
by treatment versus service requests, brainstorming, rough-drafting
2
control. Also shown are emails, and editing.
1 2 3 4 5 6 7
the control-group slope, Grade: Task 1 Nonusers were divided into three roughly
control-treatment differ- equal-sized groups, reporting that: (i) ChatGPT
ence in slopes, and the Control Treatment was not useful in their job, (ii) they did not
Fig. 3. Job satisfaction, self- A Job Satisfaction Increases B Self-Efficacy Marginally Increases
efficacy, and beliefs about auto-
mation. (A) and (B) Job satisfaction .4 Treatment Effect: 0.19 SDs
.2 95% CI: [-0.01, 0.39]
and self-efficacy (originally elicited Treatment Effect: 0.47 SDs
on scales of 1 to 10, normalized to 95% CI: [0.29, 0.65]
Self-Efficacy (SDs)
.1
and after treatment in the treat-
ment and control groups. Dots are
0
means and error bars are 95% 0
confidence intervals for means. Also
shown are the coefficients on the
-.2 -.1
treatment effect of a regression
specified as in Fig. 1A. (C) Cross- Treated Treated
sectional comparison of beliefs Control -.2 Control
-.4
about automation in the treatment Pre-Treatment Post-Treatment Pre-Treatment Post-Treatment
and control group, all on a scale
of 1 to 10. The first question was
“How worried are you about
C Effects on Three Beliefs About Automation
workers in your occupation being
8
p-value: 0.005
5 4
nature, access or supply context-specific knowl- tal design. We required short tasks that could curate, and some versions can access the in-
edge and is not a reliable source of precise fac- be explicitly described for and performed by ternet to fact check themselves. We speculate
tual information. a range of anonymous workers online, and that in more open-ended, real-world tasks,
The tasks could also be described through we needed to incentivize serious effort. Our workers may find iterative rounds of prompt-
short, self-contained prompts, making use of judgment was that building factual accuracy ing and discussion with ChatGPT useful even
ChatGPT easy, whereas many real-world tasks requirements into the tasks would either re- if they cannot immediately prompt out a final
involve vaguer objectives and instructions, re- sult in tasks that felt artificial and unnatural product. In these contexts, ChatGPT and human
quiring workers to exercise initiative in deter- (e.g., requiring participants to Google and re- workers may be more strongly complemen-
mining what to do. Finally, participants in port one or two specific facts) or overwhelm tary than in our experiment. The importance
our tasks faced direct incentives in the form of our budget for evaluators (e.g., giving par- of context-specific knowledge will also limit
bonus payments scaling with output quality, ticipants an open-ended research task and ChatGPT’s utility, but there are plausible
which encouraged them to maximize gener- exhaustively fact checking their assertions). work-arounds. ChatGPT can be instructed to
ic output quality and minimize time spent. The aforementioned factors limit but do not incorporate lists of context-specific factors,
White-collar workers are typically incentivized eliminate the generalizability of our results. In and organizations may be able to build cus-
through longer-run promotion and firing incen- real-world tasks, the need to fact-check ChatGPT’s tomized ChatGPT-like models. Our follow-up
tives, which might instead encourage conspicu- output will reduce its time-saving benefits, surveys show that many workers do find
ous exertion of effort or the development of a but the speed and writing quality increases ChatGPT useful in their real jobs.
consistent personal style, both of which make observed in our experiment are sufficiently Overall, we speculate that, relative to our
ChatGPT less useful. large that we suspect that ChatGPT will still experimental findings, the direct productivity
The tasks and incentive schemes were cho- often be useful. Moreover, newer versions of effects of ChatGPT in the real economy will be
sen to meet the constraints of the experimen- ChatGPT are more consistently factually ac- somewhat lower and the technology will be
more strongly complementary to human work- hiring systems based partly on conspicuous 26. Because of the small sample size in this group, the confidence interval
ers. To what extent either of these is true re- exertion of effort. Large language models may includes zero and the treatment and control group differ in terms of
average pretreatment grades (see the supplementary materials).
mains an open question. be used to monitor or evaluate workers and 27. See the supplementary materials for a full description of this
avoid paying higher wages (31). Organiza- analysis.
Implications tional and societal norms around the accept- 28. The effects on beliefs about automation may also have
dissipated because rising global awareness of the technology
Our experiment captured only direct, immediate ability of using tools such as ChatGPT may
led to an equalization of familiarity between the treatment and
effects of ChatGPT on worker productivity. We take time to cohere and may significantly control groups.
could not examine the complex labor market affect adoption of the technology (32–35). 29. Y. Liu, A. Mittal, D. Yang, A. Bruckman, in CHI ’22: Proceedings
dynamics that will arise as firms and workers Overall, the arrival of ChatGPT ushers in an of the 2022 CHI Conference on Human Factors in Computing
Systems, New Orleans, LA, 29 April 2022 – 5 May 2022, S. Barbosa,
adapt to ChatGPT. Several factors will mediate era of vast uncertainty about the economic C. Lampe, C. Appert, D. A. Shamma, S. Drucker, J. Williamson,
how the direct productivity impacts of ChatGPT and labor market effects of AI technologies K. Yatani, Eds. (Association for Computing Machinery, 2022);
affect wages and employment in exposed oc- (36–38). Our experiment takes the first step https://dl.acm.org/doi/10.1145/3491102.3517731.
30. J. Qadir, TechRxiv [Preprint] (2022); https://doi.org/10.
cupations. The first is the degree to which toward answering the many questions that 36227/techrxiv.21789434.v1.
demand for goods produced by ChatGPT could have arisen. 31. D. Acemoglu, A. F. Newman, Eur. Econ. Rev. 46, 1733–1756
expand as ChatGPT-fueled productivity in- (2002).
32. J. Ayling, A. Chapman, AI Ethics 2, 405–429 (2022).
creases make those goods much cheaper. For RE FERENCES AND NOTES 33. J. Hohenstein et al., arXiv:2102.05756 [cs.HC] (2021).
example, demand for programming services 1. D. H. Autor, J. Econ. Perspect. 29, 3–30 (2015). 34. M. Suh, E. Youngblom, M. Terry, C. J. Cai, in Proceedings of the
could plausibly expand massively if the price 2. D. H. Autor, D. Dorn, Am. Econ. Rev. 103, 1553–1597 (2013). 2021 CHI Conference on Human Factors in Computing Systems,
3. T. Eloundou, S. Manning, P. Mishkin, D. Rock, arXiv:2303.10130 Yokohama, Japan, 8–13 May 2021, Y. Kitamura, A. Quigley,
of those services fell. Aggregate programming K. Isbister, T. Igarashi, P. Bjorn, S. Drucker, Eds. (Association
[econ.GN] (2023).