Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

research

DOI:10.1145/ 3633453

Case study asks Copilot users about its impact


on their productivity, and seeks to find their
perceptions mirrored in user data.
BY ALBERT ZIEGLER, EIRINI KALLIAMVAKOU, X. ALICE LI,
ANDREW RICE, DEVON RIFKIN, SHAWN SIMISTER,
GANESH SITTAMPALAM, AND EDWARD AFTANDILIAN

Measuring
GitHub
Copilot’s
Impact on
Productivity key insights
˽ AI pair-programming tools such as GitHub
Copilot have a big impact on developer
productivity. This holds for developers
of all skill levels, with junior developers
seeing the largest gains.
˽ The reported benefits of receiving AI
suggestions
CODE- COM PL ET ION S Y S T EM S OF F ER I NG suggestions while coding span the full
to a developer in their integrated development range of typically investigated aspects of
productivity, such as task time, product
environment (IDE) have become the most frequently quality, cognitive load, enjoyment, and
learning.
used kind of programmer assistance.1 When ˽ Perceived productivity gains are reflected
generating whole snippets of code, they typically use in objective measurements of developer
activity.
a large language model (LLM) to predict what the user ˽ While suggestion correctness is
might type next (the completion) from the context of important, the driving factor for these
improvements appears to be not
what they are working on at the moment (the prompt).2 correctness as such, but whether the
suggestions are useful as a starting point
This system allows for completions at any position in for further development.

54 COMM UNICATIO NS O F THE ACM | M A R C H 2024 | VO L . 67 | NO. 3


the code, often spanning multiple higher its benefit. The validity of this as completion times for standardized
lines at once. assumption is not obvious when con- tasks.23,a Alternatively, we can leverage
Potential benefits of generating sidering issues such as whether two the developers themselves as expert
large sections of code automatically short completions are more valuable assessors of their own productivity.
are huge, but evaluating these sys- than one long one, or whether review- This meshes well with current think-
tems is challenging. Offline evalua- ing suggestions can be detrimental to ing in software engineering research
tion, where the system is shown a par- programming flow. suggesting measuring productiv-
tial snippet of code and then asked Code completion in IDEs using lan- ity on multiple dimensions and using
to complete it, is difficult not least guage models was first proposed in self-reported data.6 Thus, we focus on
because for longer completions there Hindle et al.,9 and today neural syn- studying perceived productivity.
are many acceptable alternatives and thesis tools such as GitHub Copilot, Here, we investigate whether usage
no straightforward mechanism for CodeWhisperer, and TabNine suggest measurements of developer interac-
labeling them automatically.5 An ad- code snippets within an IDE with the tions with GitHub Copilot can predict
ILLUSTRATION BY J UST IN M ETZ

ditional step taken by some research- explicitly stated intention to increase perceived productivity as reported
ers3,21,29 is to use online evaluation a user’s productivity. Developer pro- by developers. We analyze ​2,631​sur-
and track the frequency of real us- ductivity has many aspects, and a re-
ers accepting suggestions, assuming cent study has shown that tools like a Nevertheless, such completion times are
that the more contributions a system these are helpful in ways that are only greatly reduced in many settings, often by
makes to the developer’s code, the partially reflected by measures such more than half.16

MA R C H 2 0 2 4 | VO L. 6 7 | N O. 3 | C OM M U N IC AT ION S OF T HE ACM 55
research

Figure 1. GitHub Copilot’s code completion funnel. vey responses from developers using
GitHub Copilot and match their re-
sponses to measurements collected
from the IDE. We consider acceptance
Completion mostly unchanged unchanged
counts and more detailed measures
of contribution, such as the amount
Average number of events per survey user active hour

50 170 of code contributed by GitHub Copilot


and persistence of accepted comple-
tions in the code. We find that accep-
40 tance rate of shown suggestions is a
better predictor of perceived produc-
tivity than the alternative measures.
30
We also find that acceptance rate var-
ies significantly over our developer
population as well as over time, and
24 present a deeper dive into some of
20 these variations.
Our results support the principle
that acceptance rate can be used for
10 coarse-grained monitoring of the per-
5.6 5.3 5.1 5 formance of a neural code synthesis
6.6 system. This ratio of shown sugges-
4.3 3.8 3.4 3.1
0 tions being accepted correlates better
than more detailed measures of con-
completion
opportunity

completion
shown

completion
accepted

after 30
seconds

after 2
minutes

after 5
minutes

after 10
minutes
tribution. However, other approaches
remain necessary for fine-grained
investigation due to the many human
factors involved.

Background
Offline evaluation of code completion
Figure 2. Demographic composition of survey respondents. can have shortcomings even in tracta-
ble circumstances where completions
can be labeled for correctness. For ex-
Beginner ample, a study of 1
​ 5,000​completions by
Think of
Intermediate
the language 66 developers in Visual Studio found sig-
you have used Advanced
the most with nificant differences between synthetic
OurTool. benchmarks used for model evaluation
How proficient Student/Learning
are you in 0–2 Years Prof. Experience and real-world usage.7 The evaluation
that language? 3–5 Years Prof. Experience of context-aware API completion for Vi-
6–10 Years Prof. Experience sual Studio IntelliCode considered Re-
11–15 Years Prof. Experience
Which best 16+ Years Prof. Experience
call@5—the proportion of completions
describes your for which the correct method call was in
programming
experience? Student the top five suggestions. This metric fell
Professional from 9​ 0%​in offline evaluation to 7​ 0%​
Hobbyist when used online.21
Consultant/Freelancer
Due to the diversity of potential
Researcher
Which of Other solutions to a multi-line completion
the following
best describes task, researchers have used software
what you do? Python testing to evaluate the behavior of
JavaScript completions. Competitive program-
What TypeScript ming sites have been used as a source
programming Java
languages Ruby
of such data8,11 as well as handwrit-
do you usually ten programming problems.5 Yet, it
use? Choose Go
up to three C# is unclear how well performance on
from the list. Rust programming competition data gen-
HTML eralizes to interactive development in
0% 25% 50% 75% 100%
Other
an IDE.
In this work, we define acceptance
rate as the fraction of completions

56 COMM UNICATIO NS O F THE AC M | M A R C H 2024 | VO L . 67 | NO. 3


research

shown to the developer that are subse- Data and Methodology


quently accepted for inclusion in the Usage measurements. GitHub Copilot
source file. The IntelliCode Compose provides code completions using Ope-
system uses the term click through rate nAI language models. It runs within
(CTR) for this and reports a value of​
10%​in online trials.20 An alternative It is unclear how the IDE and at appropriate points
sends a completion request to a cloud-
measure is that of daily completions
accepted per user (DCPU) for which a
well performance hosted instance of the neural model.
GitHub Copilot can generate comple-
value of around 20 has been report- on programming tions at arbitrary points in code rath-
ed.3,29 To calculate acceptance rate
one must, of course, normalize DCPU
competition er than, for example, only being trig-
gered when a developer types a period
by the time spent coding each day. For data generalizes for invoking a method on an object. A
context, in our study, GitHub Copilot
has an acceptance rate of 2 ​ 7%​and a
to interactive variety of rules determine when to re-
quest a completion, when to abandon
mean DCPU in excess of 312 (See Fig- development requests if the developer has moved
ure 1).b These differences are presum-
ably due to differences in the kinds in an IDE. on before the model is ready with a
completion, and how much of the re-
of completion offered, or perhaps to sponse from the model to surface as a
user-interface choices. We discuss completion.
later how developer objectives, choice As stated in our terms of usage,b the
of programming language, and even GitHub Copilot IDE extension records
time of day seem to affect our data. the events shown in Table 1 for all us-
Such discrepancies highlight the dif- ers. We make usage measurements
ficulty in using acceptance rate to un- for each developer by counting those
derstand the value of a system. events.
There is some evidence that accep- Our measures of persistence go fur-
tance rate (and indeed correctness) ther than existing work, which stops at
might not tell the whole story. One sur- acceptance. The intuition here is that a
vey of developers considered the use completion which is accepted into the
of AI to support translation between source file but then subsequently turns
programming languages and found out to be incorrect can be considered
indications that developers tolerated, to have wasted developer time both in
and in some cases valued, erroneous reviewing it and then having to go back
suggestions from the model.26 and delete it. We also record mostly un-
Measuring developer productiv- changed completions: A large comple-
ity through activity counts over time (a tion requiring a few edits might still be
typical definition of productivity bor- a positive contribution. It is not clear
rowed from economics) disregards the how long after acceptance one should
complexity of software development, confirm persistence, so we consider a
as they account for only a subset of range of options.
developer outputs. A more holistic pic- The events pertaining to comple-
ture is formed by measuring perceived tions form a funnel which we show
productivity through self-reported quantitatively in Table 1. We include
data across various dimensions6 and a summary of all data in Appendix
supplementing it with automatically A.c (All appendices for this article can
measured data.4 We used the SPACE be found online at https://dl.acm.org/
framework6 to design a survey that doi/10.1145/3633453).
captures self-reported productivity We normalize these measures
and paired the self-reported data with against each other and write X _
usage telemetry. per _ Y to indicate we have normal-
To the best of our knowledge, this ized metric X by metric Y. For example:
is the first study of code suggestion accepted _ per _ hour is calculat-
tools establishing a clear link between ed as the total number of accepted
usage measurements and developer events divided by the total number of
productivity or happiness. A previ- (active) hour events.
ous study comparing GitHub Copilot Table 2 defines the core set of met-
against IntelliCode with 25 partici-
pants found no significant correlation b See https://bit.ly/3S7oqZV
between task completion times and c Appendices can be found in the arXiv version
survey responses.22 https://arxiv.org/pdf/2205.06537.pdf.

MA R C H 2 0 2 4 | VO L. 6 7 | N O. 3 | C OM M U N IC AT ION S OF T HE ACM 57
research

Table 1. Developer usage events collected by GitHub Copilot. when using GitHub Copilot.” For each
self-reported productivity measure,
we encoded its five ordinal response
Opportunity A heuristic-based determination by the IDE and the plug-in that a completion
values to numeric labels (1 = Strongly
might be appropriate at this point in the code (for example, the cursor is not in
the middle of a word) Disagree, .​..​, 5 = Strongly Agree). We
Shown Completion shown to the developer
include the full list of questions and
their coding to the SPACE framework
Accepted Completion accepted by the developer for inclusion in the source file
in Appendix C. For more information
Accepted char The number of characters in an accepted completion
on the SPACE framework and how the
Mostly Completion persisting in source code with limited modifications (Levenshtein empirical software engineering com-
unchanged X distance less than 33%) after X seconds, where we consider a duration of 30,
munity has been discussing developer
120, 300, and 600 seconds
productivity, please see the following
Unchanged X Completion persisting in source code unmodified after X seconds.
section.
(Active) hour An hour during which the developer was using their IDE with the plug-in active Early in our analysis, we found that
the usage metrics we describe in the
Usage Measurements section corre-
Table 2. The core set of measurements considered in this article. sponded similarly to each of the mea-
sured dimensions of productivity, and
Natural name Explanation in turn these dimensions were highly
Shown rate Ratio of completion opportunities that resulted in a completion being correlated to each other (Figure 3). We
shown to the user therefore added an aggregate produc-
Acceptance rate Ratio of shown completions accepted by the user tivity score calculated as the mean of
Persistence rate Ratio of accepted completions unchanged after 30, 120, 300, and 600 all 12 individual measures (excluding
seconds skipped questions). This serves as a
Fuzzy persistence rate Ratio of accepted completions mostly unchanged after 30, 120, 300, rough proxy for the much more com-
and 600 seconds plex concept of productivity, facili-
Efficiency Ratio of completion opportunities that resulted in a completion tating recognition of overall trends,
accepted and unchanged after 30, 120, 300, and 600 seconds which may be less discernible on indi-
Contribution speed Number of characters in accepted completions per distinct, active hour vidual variables due to higher statisti-
Acceptance frequency Number of accepted completions per distinct, active hour cal variation. The full dataset of these
aggregate productivity scores togeth-
Persistence frequency Number of unchanged completions per distinct, active hour
er with the usage measurements con-
Total volume Total number of completions shown to the user
sidered in this article is available at
Loquaciousness Number of shown completions per distinct, active hour https://bit.ly/47HVjAM.
Eagerness Number of shown completions per opportunity Given it has been impossible to pro-
duce a unified definition or metric(s)
for developer productivity, there have
rics we feel have a natural interpreta- on Mar. 6, 2022. been attempts to synthesize the fac-
tion in this context. We note there are The survey contained multiple- tors that impact productivity to de-
alternatives, and we incorporate these choice questions regarding demo- scribe it holistically, include various
in our discussion where relevant. graphic information (see Figure 2) relevant factors, and treat developer
Productivity survey. To understand and Likert-style questions about dif- productivity as a composite mea-
users’ experience with GitHub Co- ferent aspects of productivity, which sure17,19,24 In addition, organizations
pilot, we emailed a link to an online were randomized in their order of ap- often use their own multidimensional
survey to ​ 17, 420​users. These were pearance to the user. Figure 2 shows frameworks to operationalize produc-
participants of the unpaid technical the demographic composition of our tivity, which reflects their engineering
preview using GitHub Copilot with respondents. We note the significant goals—for example, Google uses the
their everyday programming tasks. proportion of professional program- QUANTS framework, with five compo-
The only selection criterion was hav- mers who responded. nents of productivity.27 In this article,
ing previously opted in to receive com- The SPACE framework6 defines five we use the SPACE framework,6 which
munications. A vast majority of survey dimensions of productivity: Satisfac- builds on synthesis of extensive and
users (more than 80%) filled out the tion and well-being, Performance, Ac- diverse literature by expert research-
survey within the first two days, on or tivity, Communication and collabora- ers and practitioners in the area of de-
before February 12, 2022. We there- tion, and Efficiency and flow. We use veloper productivity.
fore focus on data from the four-week four of these (S,P,C,E), since self re- SPACE is an acronym of the five di-
period leading up to this point ("the porting on (A) is generally considered mensions of productivity:
study period"). We received a total of inferior to direct measurement. We ˲ S (Satisfaction and well being):
2,047 responses we could match to included 11 statements covering these This dimension is meant to reflect
usage data from the study period, the four dimensions in addition to a sin- developers’ fulfillment with the work
earliest on Feb. 10, 2022 and the latest gle statement: “I am more productive they do and the tools they use, as well

58 COM MUNICATIO NS O F TH E ACM | M A R C H 2024 | VO L . 67 | NO. 3


research

as how healthy and happy they are ˲ A (Activity): This is the count of documentation or the speed of an-
with the work they do. This dimension outputs—for example, the number of swering questions, or the onboard-
reflects some of the easy-to-overlook pull requests closed by a developer. As ing time and processing of new team
trade-offs involved when looking ex- a result, this dimension is best quanti- members.
clusively at velocity acceleration—for fied via system data. Given the variety ˲ E (Efficiency and flow): This di-
example, when we target faster turn- of developers’ activities as part of their mension reflects the ability to com-
around of code reviews without con- work, it is important that the activ- plete work or make progress with little
sidering workload impact or burnout ity dimension accounts for more than interruption or delay. It is important
for developers. coding activity—for instance, writing to note that delays and interruptions
˲ P (Performance): This dimension documentation, creating design specs, can be caused either by systems or hu-
aims to quantify outcomes rather than and so on. mans, and it is best to monitor both
output. Example metrics that capture ˲ C (Communication and collabora- self-reported and observed measure-
performance relate to quality and re- tion): This dimension aims to capture ments—for example, use self-reports
liability, as well as further-removed that modern software development of the ability to do uninterrupted work,
metrics such as customer adoption or happens in teams and is, therefore, as well as measure wait time in engi-
satisfaction. impacted by the discoverability of neering systems).

Figure 3. Correlation between metrics. Metrics are ordered by similarity based on distance in the correlation matrix, except for manu-
ally fixing the aggregate productivity and acceptance rate at the end for visibility.

accepted_per_shown
accepted_per_opportunity
unchanged_600_per_opportunity
unchanged_300_per_opportunity
unchanged_30_per_opportunity
unchanged_120_per_opportunity
unchanged_300_per_active_hour
unchanged_600_per_active_hour
unchanged_120_per_active_hour
unchanged_30_per_active_hour
accepted_per_active_hour
accepted_char_per_active_hour
shown_per_active_hour
shown_per_opportunity
shown
unchanged_120_per_accepted
unchanged_30_per_accepted
unchanged_300_per_accepted
unchanged_600_per_accepted
mostly_unchanged_120_per_accepted
mostly_unchanged_300_per_accepted
mostly_unchanged_600_per_accepted
mostly_unchanged_30_per_accepted
learn_from
less_time_searching
unfamiliar_progress
repetitive_faster
less_effort_repetitive
better_code
less_frustrated
focus_satisfying
more_fulfilled
stay_in_flow
tasks_faster
aggregate_productivity
aggregate_productivity
tasks_faster
stay_in_flow
more_fulfilled
focus_satisfying
less_frustrated
better_code
less_effort_repetitive
repetitive_faster
unfamiliar_progress
less_time_searching
learn_from
mostly_unchanged_30_per_accepted
mostly_unchanged_600_per_accepted
mostly_unchanged_300_per_accepted
mostly_unchanged_120_per_accepted
unchanged_600_per_accepted
unchanged_300_per_accepted
unchanged_30_per_accepted
unchanged_120_per_accepted
shown
shown_per_opportunity
shown_per_active_hour
accepted_char_per_active_hour
accepted_per_active_hour
unchanged_30_per_active_hour
unchanged_120_per_active_hour
unchanged_600_per_active_hour
unchanged_300_per_active_hour
unchanged_120_per_opportunity
unchanged_30_per_opportunity
unchanged_300_per_opportunity
unchanged_600_per_opportunity
accepted_per_opportunity
accepted_per_shown

Spearman
Correlation
1.00
0.75
0.50
0.25
0.00

MA R C H 2 0 2 4 | VO L. 6 7 | N O. 3 | C OM M U N IC AT ION S OF T HE ACM 59
research

What Drives Perceived Productivity? is intuitive in the sense that shorter


To examine the relationship between periods move the measure closer to
Developer objective measurements of user be-
havior and self-reported perceptions of
acceptance rate. We also expect that
at some point after accepting the com-
Productivity productivity, we used our set of core us- pletion it becomes simply part of the

and the
age measurements (Table 2). We then code, so any changes (or not) after that
calculated Pearson’s R correlation co- point will not be attributed to GitHub

SPACE
efficient and the corresponding p-val- Copilot. All persistence measures
ue of the F-statistic between each pair were less well correlated than accep-

Framework of usage measurement and perceived


productivity metric. We also computed
a PLS regression from all usage mea-
tance rate.
To assess the different metrics in
a single model, we ran a regression
Developer productivity has been surements jointly. using projection on latent structures
a controversial topic in software
engineering research over the We summarize these results in (PLS). The choice of PLS, which cap-
years. We point readers to excellent Figure 3, showing the correlation co- tures the common variation of these
presentations of the existing discourse efficients between all measures and variables as is linearly connected to
in the community in Meyer et al.12
and Murphy-Hill et al.;15 however
survey questions. The full table of the aggregate productivity,28 is due to
we summarize the key points of all results is included in Appendix B, the high collinearity of the single met-
discussion below: available online. rics. The first component, to which
˲ Inspired by economics
We find acceptance rate (accept- every metric under consideration con-
definitions of productivity as output
per unit of input, some research has ed _ per _ shown) most positively tributes positively, explains ​43 . 2%​of
defined developer productivity in the predicts users’ perception of produc- the variance. The second component
same terms—for example, numbers of tivity, although, given the confound- captures the acceptance rate/change
lines of code per day, function points ing and human factors, there is still rate dichotomy; it explains a further​
per sprint, and so on. However, such
measures are not connected to goals notable unexplained variance. 13 . 1%​. Both draw most strongly from
(for instance, it is not the goal of a Of all usage measurements, accep- acceptance rate.
developer to write the most lines of tance rate correlates best with aggregate This strongly points to acceptance
code), they may motivate developers to
game the system, they do not account
productivity (​ ρ = 0 . 24​, P
​ < 0 . 0001​). rate being the most immediate indica-
for the quality of the output, and they This measurement is also the best per- tor of perceived productivity, although
are in tension with other metrics (for forming for at least one survey ques- it is beneficial to combine with others
example, a higher number of commits tion in each of the SPACE dimensions. to get a fuller picture.
or PRs will create a higher need for
code reviews). This correlation is high confidence but
˲ Observational studies of leaves considerable unexplained vari- Experience
developers reveal that developers ance. Later, we explore improvements To understand how different types of
spent more than half their working
day on activities other than coding.13
from combining multiple usage mea- developers interact with Copilot, our
Given this, the view of developer surements together. survey asked respondents to self-report
productivity as inputs and outputs, Looking at the more detailed met- their level of experience in two ways:
or using metrics that strictly focus on rics around persistence, we see that it ˲ "Think of the language you have
coding, ignores the reality of the work
developers do. is generally better over shorter time used the most with Copilot. How pro-
˲ In addition, developers’ periods than over longer periods. This ficient are you in that language?" with
perspective on what affects their
productivity12 and what metrics might Table 3. Effects of experience on facets of productivity where result of linear regression
reflect it14 differs from the inputs/ was a statistically significant covariate.
outputs view. When asked when they
are productive and how they measure
productivity, developers do not cite Productivity measure coeff
lines of code or function points per
sprint, but rather completing tasks, Proficiency Better code −0.061*
being free of interruptions, usefulness
of their work, success of the feature Proficiency Stay in flow 0.069*
they worked on, and more.
Proficiency Focus satisfying 0.067*
˲ To sum up, after many studies
and many definitions, measurements, Proficiency Less effort repetitive 0.072**
and approaches to productivity,
the empirical software engineering Proficiency Repetitive faster 0.055***
research community has concluded
that developer productivity is a Years Better code −0.087*
multidimensional topic that cannot
Years Less frustrated −0.103**
be summarized by a single metric.10
Both objective and subjective Years Repetitive faster −0.054*
approaches to measurement have
been tried, leading to the conclusion Years Unfamiliar progress 0.081*
that they both have advantages and
disadvantages. (*: p ¡ 0.05, **: p ¡ 0.01, ***: p ¡ 0.001.)

60 COM MUNICATIO NS O F TH E AC M | M A R C H 2024 | VO L . 67 | NO. 3


research

options ‘Beginner’, ‘Intermediate’, and Figure 4. Different metrics clustering in latent structures predicting perceived pro-
‘Advanced’. ductivity. We color the following groups: flawless suggestions (counting the number of
˲ "Which best describes your pro- unchanged suggestions), persistence rate (ratio of accepted suggestions that are un-
gramming experience?" with options changed), and fuzzy persistence rate (accepted suggestions that are mostly unchanged).

starting with "Student" and ranging


from "0–2 years" to "16+ years" in two- Metric
year intervals. acceptance rate persistence rate
We compute correlations with pro- acceptance frequency fuzzy persistence rate
ductivity metrics for both experience amount contribution (char) shown overall
variables and include these two vari- flawless suggestion frequency shown rate
ables as covariates in a multivariate re-
0.03
gression analysis. We find that both are
negatively correlated with our aggre-
gate productivity measure (proficien-

Projection on second latent structure


0.02
cy: ​ρ = − 0 . 095​, P​ = 0 . 0001​; years of
experience: ρ​ = − 0 . 161​, P ​ < 0 . 0001​).
However, in multivariate regressions
predicting productivity from usage 0.01

metrics while controlling for demo-


graphics, proficiency had a non-sig-
nificant positive effect (​coeff = 0 . 021​,​ 0.00
P = 0 . 213​), while years of experience
had a non-significant negative effect
(​coeff = − 0 . 032​, P
​ = 0 . 122​). –0.01
Looking further at individual mea-
sures of productivity, (Table 3) we find
that both language proficiency and 0.000 0.005 0.010

years of experience negatively predict Projection on first latent structure


developers agreeing that Copilot helps
them write better code. However, pro-
ficiency positively predicts developers
agreeing that Copilot helps them stay Figure 5. Linear regressions between acceptance rate and aggregate productivity by
subgroup defined through years of professional experience or programming language
in the flow, focus on more satisfying use. Dashed lines denote averages. The x-axis is clipped at (0, 0.5), and 95% of respon-
work, spend less effort on repetitive dents fall into that range.
tasks, and perform repetitive tasks
faster. Years of experience negatively
predicts developers feeling less frus- experience
none 6–10 y language
trated in coding sessions and per- 11–15 y JavaScript Python
≤2y
forming repetitive tasks faster while 3–5 y ≥ 16 y TypeScript other
using Copilot, but positively predicts
experience language
Table 4. Correlations of acceptance rate
with aggregate productivity broken down
by subgroup.
4.25

subgroup coeff n
none 0.135* 344
aggregate productivity

≤ 2y 0.178** 451 4.00

3–5y 0.255*** 358


6 – 10 y 0.265*** 251
3.75
11 – 15 y 0.171* 162
≥ 16 y 0.153* 214
JavaScript 0.227*** 1184
3.50
TypeScript 0.165*** 654
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
Python 0.172*** 716
acceptance rate
other 0.178*** 1829

MA R C H 2 0 2 4 | VO L. 6 7 | N O. 3 | C OM M U N IC AT ION S OF T HE ACM 61
research

• more online developers making until mornings 7:00 am PST, where the
All appendices progress faster when average acceptance rate is also rather
for this article
can be found
working in an unfa- high at ​23%​.
in the online miliar language. These ˲ Typical working hours during the
supplemental
findings suggest that
file at https://
dl.acm.org/ experienced developers Experienced week from 7:00 am PST to 4:00 pm PST,
where the average acceptance rate is
doi/10.1145/
3633453.
who are already highly
skilled are less likely
developers who much lower at ​21 . 2%​.

to write better code with Copilot, but are already highly Conclusions
Copilot can assist their productivity in
other ways, particularly when engag-
skilled are less When we set out to connect the pro-
ductivity benefit of GitHub Copilot to
ing with new areas and automating likely to write usage measurements from developer
routine work.
Junior developers not only report
better code with activity, we collected measurements
about acceptance of completions in
higher productivity gains; they also Copilot, but Copilot line with prior work, but also devel-
tend to accept more suggestions. How-
ever, the connection observed in the can assist their oped persistence metrics, which ar-
guably capture sustained and direct
section "What Drives Perceived Pro- productivity in other impact on the resulting code. We

ways.
ductivity" is not solely due to differing were surprised to find acceptance rate
experience levels. In fact, the connec- (number of acceptances normalized
tion persists in every single experience by the number of shown completions)
group, as shown in Figure 5. to be better correlated with reported
productivity than our measures of
Variation over Time persistence.
Its connection to perceived productiv- In hindsight, this makes sense.
ity motivates a closer look at the accep- Coding is not typing, and GitHub Co-
tance rate and what factors influence pilot’s central value lies not in being
it. Acceptance rate typically increases the way users enter most of their code.
over the board when the model or un- Instead, it lies in helping users to make
derlying prompt-crafting techniques the best progress toward their goals. A
are improved. But even if these con- suggestion that serves as a useful tem-
ditions are held constant (the study plate to tinker with may be as good or
period did not see changes to either), better than a perfectly correct (but ob-
there are more fine-grained temporal vious) line of code that only saves the
patterns emerging. user a few keystrokes.
For coherence of the cultural impli- This suggests that a narrow focus
cations of time of day and weekdays, on the correctness of suggestions
all data in this section was restricted would not tell the whole story for these
to users from the U.S. (whether in kinds of tooling. Instead, one could
the survey or not). We used the same view code suggestions inside an IDE to
time frame as for the investigation in be more akin to a conversation. While
the previous section. In the absence chatbots such as ChatGPT are already
of more fine-grained geolocation, we used for programming tasks, they are
used the same time zone to interpret explicitly structured as conversations.
timestamps and for day boundaries Here, we hypothesize that interactions
(PST), recognizing this will introduce with Copilot, which is not a chatbot,
some level of noise due to the inhomo- share many characteristics with natu-
geneity of U.S. time zones. ral-language conversations.
Nevertheless, we observe strong We see anecdotal evidence of this
regular patterns in overall acceptance in comments posted about GitHub
rate (Figure 6). These lead us to distin- Copilot online (see Appendix E for
guish three different time regimes, all examples), in which users talk about
of which are statistically significantly sequences of interactions. A conver-
distinct at ​p < 0 . 001%​ (using boot- sation turn in this context consists of
strap resampling): the prompt in the completion request
˲ The weekend: Saturdays and Sun- and the reply as the completion itself.
days, where the average acceptance The developer’s response to the com-
rate is comparatively high at ​23 . 5%​. pletion arises from the subsequent
˲ Typical non-working hours during changes incorporated in the next
the week: evenings after 4:00 pm PST prompt to the model. There are clear

62 COMM UNICATIO NS O F THE ACM | M A R C H 2024 | VO L . 67 | NO. 3


research

20. Svyatkovskiy, A., Deng, S.K., Fu, S., and Sundaresan,


Figure 6. Average acceptance rate during the week. Each point represents the average N. Intellicode compose: Code generation using
for a one-hour period, whereas the shaded ribbon shows the min-max variation during transformer. In Proceedings of the 28th ACM Joint
the observed four-week period. European Software Eng. Conf. and Symp. on the
Foundations of Software Eng., P. Devanbu, M.B.
Cohen, and T. Zimmermann (eds). ACM, (Nov. 2020),
1433–1443; 10.1145/3368089.3417058
Daily and weekly patterns in acceptance rate in the US 21. Svyatkovskiy, A. et al. Fast and memory-efficient
(all users between 2022-01-15 and 2022-02-12) neural code completion. In Proceedings of the
18th IEEE/ACM Intern. Conf. on Mining Software
off hours weekend working hours Repositories, (May 2021, 329–340; 10.1109/
MSR52588.2021.00045
22. Vaithilingam, P., Zhang, T., and Glassman, E.
Expectation vs. experience: Evaluating the usability
of code generation tools powered by large language
models. In Proceedings of the 2022 Conf. on Human
26% Factors in Computing Systems.
23. Vaithilingam, P., Zhang, T., and Glassman, E.L.
Expectation vs. experience: Evaluating the usability
of code generation tools powered by large language
models. In Proceedings of the CHI Conf. on
Human Factors in Computing Systems, Association
for Computing Machinery, Article 332 (2022), 7;
acceptance rate

24%
10.1145/3491101.3519665
24. Wagner, S. and Ruhe, M. A systematic review of
productivity factors in software development. arXiv
preprint arXiv:1801.06475 (2018).
25. Wang, D. et al. From human-human collaboration to
22% human-AI collaboration: Designing AI systems that
can work together with people. In Proceedings of
the 2020 CHI Conf. on Human Factors in Computing
Systems (2020), 1–6.
26. Weisz, J.D. et al. Perfection not required? Human-AI
partnerships in code translation. In Proceedings of
20% the 26th Intern. Conf. on Intelligent User Interfaces, T.
Hammond et al (eds). ACM, (April 2021), 402–412;
10.1145/3397481.3450656
Saturday Sunday Monday Tuesday Wednesday Thursday Friday 27. Winters, T., Manshreck, T., and Wright, H. Software
12:00 12:00 12:00 12:00 12:00 12:00 12:00 Engineering at Google: Lessons Learned from
Programming Over Time. O’Reilly Media (2020).
weekday and time (PST) 28. Wold, S., Sjöström, M., and Eriksson, L. PLS-regression:
A basic tool of chemometrics. Chemometrics and
Intelligent Laboratory Systems 58, 2 (2001), 109–130;
10.1016/S0169-7439(01)00155-1.
29. Zhou, W., Kim, S., Murali, V., and Ari Aye, G. Improving
programming parallels to factors 960–970; 10.1109/ICSE.2019.00101
code autocompletion with transfer learning.
8. Hendrycks, D. et al. Measuring coding challenge
such as specificity and repetition that competence with APPS. CoRR abs/2105.09938,
CoRR abs/2105.05991 (2021); https://arxiv.org/
abs/2105.05991
have been identified to affect human (2021); https://arxiv.org/abs/2105.09938
9. Hindle, A. et al. On the naturalness of software. In 34th
judgements of conversation quality.18 Intern. Conf. on Software Engineering, M. Glinz, G.C. Albert Ziegler (wunderalbert@github.com ) is a principal
Researchers have already investigated Murphy, and M. Pezzè (eds). IEEE Computer Society, researcher at GitHub, Inc., San Francisco, CA, USA.
June 2012, 837–847; 10.1109/ICSE.2012.6227135
the benefits of natural-language feed- 10. Jaspan, C. and Sadowski, C. No single metric captures Eirini Kalliamvakou is a staff researcher at GitHub, Inc.,
productivity. Rethinking Productivity in Software San Francisco, CA, USA.
back to guide program synthesis,2 so Engineering, (2019), 13–20. X. Alice Li is a staff researcher for Machine Learning at
the conversational framing of coding 11. Kulal, S. et al. Spoc: Search-based pseudocode to code. GitHub, San Francisco, CA, USA.
In Proceedings of Advances in Neural Information
completions is not a radical proposal. Processing Systems 32, H.M. Wallach et al (eds), Dec. Andrew Rice is a principal researcher at GitHub, Inc., San
But neither is it one we have seen fol- 2019, 11883–11894; https://bit.ly/3H7YLtF Francisco, CA, USA.
12. Meyer, A.N., Barr, E.T., Bird, C., and Zimmermann,
lowed yet. T. Today was a good day: The daily life of software
Devon Rifkin is a principal research engineer at GitHub,
Inc., San Francisco, CA, USA.
developers. IEEE Transactions on Software
References Engineering 47, 5 (2019), 863–880. Shawn Simister is a staff software engineer at GitHub,
1. Amann, S., Proksch, S., Nadi, S., and Mezini, M. A 13. Meyer, A.N. et al. The work life of developers: Activities, Inc., San Francisco, CA, USA.
study of visual studio usage in practice. In IEEE 23rd switches and perceived productivity. IEEE Transactions Ganesh Sittampalam is a principal software engineer at
Intern. Conf. on Software Analysis, Evolution, and on Software Engineering 43, 12 (2017), 1178–1193. GitHub, Inc., San Francisco, CA, USA.
Reengineering 1. IEEE Computer Society, (March 14. Meyer, A.N., Fritz, T., Murphy, G.C., and Zimmermann,
2016), 124–134; 10.1109/SANER.2016.39 T. Software developers’ perceptions of productivity. In Edward Aftandilian is a principal researcher at GitHub,
2. Austin, J. et al. Program synthesis with large language Proceedings of the 22nd ACM SIGSOFT Intern. Symp. Inc., San Francisco, CA, USA.
models. CoRR abs/2108.07732 (2021); https://arxiv. on Foundations of Software Engineering (2014), 19–29.
org/abs/2108.07732 15. Murphy-Hill, E. et al. What predicts software This work is licensed under a
3. Ari Aye, G., Kim, S., and Li, H. Learning autocompletion developers’ productivity? IEEE Transactions on http://creativecommons.org/licenses/by/4.0/
from real-world datasets. In Proceedings of the 43rd Software Engineering 47, 3 (2019), 582–594.
IEEE/ACM Intern. Conf. on Software Engineering: 16. Peng, S., Kalliamvakou, E., Cihon, P., and Demirer, M.
Software Engineering in Practice, (May 2021), The impact of AI on developer productivity: Evidence
131–139; 10.1109/ICSE-SEIP52600.2021.00022 from GitHub Copilot. arXiv:2302.06590 [cs.SE] (2014)
4. Beller, M., Orgovan, V., Buja, S., and Zimmermann, 17. Ramírez, Y.W. and Nembhard, D.A. Measuring
T. Mind the gap: On the relationship between knowledge worker productivity: A taxonomy. J. of
automatically measured and self-reported Intellectual Capital 5, 4 (2004), 602–628.
productivity. IEEE Software 38, 5 (2020), 24–31. 18. See, A., Roller, S., Kiela, D., and Weston, J. What makes
5. Chen, M. et al. Evaluating large language models a good conversation? How controllable attributes
trained on code. CoRR abs/2107.03374 (2021); affect human judgments. In Proceedings of the 2019
https://arxiv.org/abs/2107.03374 Conf. of the North American Chapter of the Assoc.
6. Forsgren, N. et al. The SPACE of developer for Computational Linguistics: Human Language
productivity: There’s more to it than you think. Queue Technologies 1, J. Burstein, C. Doran, and T. Solorio
19, 1 (2021), 20–48. (eds). Assoc. for Computational Linguistics, (June
7. Hellendoorn, V.J., Proksch, S., Gall, H.C., and Bacchelli, 2019), 1702–1723; 10.18653/v1/n19-1170 Watch the authors discuss
A. When code completion fails: A case study on 19. Storey, M. et al. Towards a theory of software this work in the exclusive
real-world completions. In Proceedings of the 41st developer job satisfaction and perceived productivity. Communications video.
Intern. Conf. on Software Engineering, J.M. Atlee, T. In Proceedings of the IEEE Trans. on Software https://cacm.acm.org/videos/
Bultan, and J. Whittle (eds). IEEE/ACM, (May 2019), Engineering 47, 10 (2019), 2125–2142. measuring-github-copilot

MA R C H 2 0 2 4 | VO L. 6 7 | N O. 3 | C OM M U N IC AT ION S OF T HE ACM 63

You might also like