Professional Documents
Culture Documents
CodeCompose - A Large-Scale Industrial Deployment of AI-assisted Code Authoring
CodeCompose - A Large-Scale Industrial Deployment of AI-assisted Code Authoring
CodeCompose - A Large-Scale Industrial Deployment of AI-assisted Code Authoring
Sequence of tokens
metadata IDs code before IDs <mask> ID code after IDs <mask> ID target IDs
• CM implements the masking after the text has been tok- As we are suggesting single lines of code, we stop the generation
enized into token IDs, which limits the model during train- early once a newline token has been generated. Due to the real-time
ing to only seeing mask spans with edges at common tok- nature of our application and the inline suggestion user experience
enizer tokens. LCM lifts the masking step to the language (UX), we only return one sequence of generated tokens.
level and avoids this, similar to the fill-in-the-middle (FIM)
task [11]. Also, LCM only masks at certain trigger charac- 3.2 Training data
ters – that is, characters where the model will be queried
during inference such as (, ., =, etc. For training on our internal data, we collected first party data from
• We prefix certain metadata to the input in LCM, such as the Meta’s code repositories and notebooks, applying several filters:
programming language, full path to the file, and the kernel • Rather than crawling the entire repository, we used code
name for notebooks. that is modified through diffs checked in by developers as
• Through model level ablations (not in the scope of this paper), a way of staying close to our end application (i.e., writing
we found an optimal 70-30 split of the model’s input length code in the IDE). This way we avoid training on code that
between code before and code after the cursor. may have been added a long time ago but is never modified.
• Specialized for our use case, LCM has only one mask in any • To keep the training data fresh, we only looked at diffs going
input. back up to 2 years, and only kept the latest versions of files
A step-by-step overview of constructing an input in LCM is to avoid bugs that may have been patched.
shown in Figure 2, along with an example code snippet. Once an • For each major target language, we worked with the respec-
input is constructed, during training, we maximize the log proba- tive language services team at Meta to exclude code with
bility of the language-masked input: old/obsolete patterns, not-in-production, etc.
After these filters, our first party training data included in the
log P ( [Metadata; Before; <mask>; After; <mask>; Target]) order of tens of millions of files amounting to a few billion lines of
code across 10+ languages.
where Metadata, Before and After are the tokens in the metadata,
code before, and code after the cursor, respectively, Target is the
code that was masked, and <mask> is a special token. During in- 3.3 Training details
ference, we sample tokens in an auto-regressive manner from the We took InCoder-1.3B, the public model with 1.3 billion parameters,
distribution: and fine-tuned it further on the above first party data with our LCM
objective. A 6.7 billion parameter model called InCoder-6.7B is also
P (· | [Metadata; Before; <mask>; After; <mask>]) available, but we found that for suggesting single lines of code it
4
Table 1: Accuracy metrics across programming languages for the public model and the fine-tuned model
added too much latency for incremental gain in accuracy. For fine- Clients can make requests to this tier via Thrift. The caller specifies
tuning the 1.3B model, we used a batch size of 20 per device (2.5k the code before and after the cursor, as well as the file path, language,
effective), and a learning rate of 5e-4. Training for 4 epochs with and metadata to use to process the request. The request handler is
sharded data parallelism took 4 days on a cluster of 128 A100 GPUs. designed to be simple to utilize the most out of the GPUs. It simply
We then deployed the model on a cluster of 150 A100 GPUs. processes the request data into a string, tokenizes it, and invokes the
model’s auto-regressive generation function that runs on the GPU.
3.4 Effect of fine-tuning on prediction accuracy Since we are optimizing for latency rather than throughput, each
We conduct offline evaluations to measure the lift of fine-tuning request is processed as soon as it arrives and so there is no inference
with first-party data on the model accuracy. The offline evaluation batching. Due to latency considerations, when the input length is
measures two metrics: Exact Match (EM), BLEU score [24]. less than 1500 tokens, we can afford to perform a beam search with
We collect 20K source code files randomly sampled from Meta’s a beam width of 3 and return the most likely suggestion. Otherwise,
source code repositories. We collect files that belong to four major we perform greedy generation.
programming languages used in Meta: Hack, C++, Python, and
Flow (JavaScript). To simulate the code being written in the IDE,
we randomly mask a part of the code snippet that terminates with 4.2 LSP
an end-of-the-line character, in each file. The prediction task is to To mediate requests between the client and server, we implemented
predict the masked code snippets when code before and code after a language server in Rust that we reuse across our various editor in-
are passed to the model as the context. In addition to this context, tegrations. While most language servers are designed to support a
we pass metadata such as file name, full path, and the language. wide array of traditional IDE functionality (autocomplete, jump-to-
As shown in Table 1, fine tuning the CodeCompose model on definition, etc.), the CodeCompose language server supports only
first-party source code with the CM objective allowed the model to one meaningful request type: textDocument/inlineCompletions.
learn internal specific code styles, formatting, libraries, etc., which The LSP also implements core request logic such as debouncing
led to a significant improvement of the EM and BLEU metrics. and caching. Debouncing waits for a 20ms pause before sending a
The improvements for languages like Hack, which are heavily cus- request to avoid too many requests from being sent as the user is
tomized for use at Meta with minimal public presence, are evident typing in the editor. Caching saves the response from the server
with 23.5% improvement in exact match and 24.97% in BLEU score. with the context as the key, so that duplicate requests – which hap-
Whereas, the improvements for C++ are minimal because of the re- pen frequently as developers type and erase code – can be avoided.
semblance between C++ used at Meta and the public version. More- Moreover, requests that hit the LSP cache are served significantly
over, when fine-tuned with the customized LCM objective that is faster than the service.
closer to the final application, the model performance improves fur-
ther, even for C++.
This experiment provides a strong evidence that an internally 4.3 Clients
fine tuned model outperforms an off-the-shelf model that is trained Clients are extensions that run locally on developers’ machines
on external data only (when tested on an organization-specific and are responsible for ultimately displaying inline suggestions in
benchmarks for code completion), and is more suitable to serve the the editor. For editors such as VS Code that support LSP natively,
needs of the developer community at the given enterprise. we require relatively little glue code to create the CodeCompose
extension. For editors such as Android Studio that do not have
4 SYSTEM DESIGN native LSP support, we built a small adapter to proxy requests to
CodeCompose employs a typical client-server architecture in which our LSP.
the server is an inference tier that runs the model and a client is an Furthermore, this architecture makes it straightforward to in-
editor that surfaces code suggestions. We encode the bulk of the tegrate CodeCompose with our in-house developer tools, such as
client-side logic in a Language Server Protocol (LSP) conformant the web-based Bento notebook [2] surface created internally at
language server that is reused across multiple editor integrations. Meta. Because the notebook surface also supports LSP, it was easy
to provide our data scientists and ML engineers with the same AI-
4.1 Server powered code suggestions as developers working in VS Code. This
At Meta, we have a tier of machines equipped with A100 GPUs, each ability to plug CodeCompose into any code editing surface inter-
with sufficient memory to host the fine-tuned CodeCompose model. nally is an advantage of owning the entire stack.
5
Figure 3: System Architecture for CodeCompose
Note that Table 2 might indicate that Python has more users than Table 3 lists the distribution of qualitative feedback responses.
languages like Hack or Javascript, but that is simply an artifact of 91.5% of the CodeCompose users gave a favorable response while
our incremental rollout strategy that opened CodeCompose first 8.5% of them gave an unfavorable response. Many respondents
to Python, as mentioned in Section 2.1. (15.7%) appreciated the fact that CodeCompose suggestions are
accurate and CodeCompose was able to auto complete the line[s]
5.3 Qualitative feedback they were going to type.
In addition to accuracy of suggestions, 23% of the users said
In addition to collecting metrics such as acceptance rate and per- CodeCompose was helpful in discovering new APIs, automate
centage of code typed using CodeCompose, we collect qualitative tedious tasks such as boilerplate code. Therefore, CodeCompose is
feedback from the CodeCompose users through a forum. We cre- not just helping with typing code faster (as demonstrated in Table
ated a Workplace group [9], where CodeCompose users can post 2) but also saving time and effort people spend in searching for
any feedback they may have. Everyone in the company can see the APIs and documentation.
feedback, react to it, and have a follow-up discussion. Though it is CodeCompose is found to be helping people accelerate their cod-
a public group, we do not expect the CodeCompose users to suffer ing activity. 20% of the users said they found CodeCompose to be
from response bias [14] due to the fact that Meta is a large organiza- accelerating their coding activity. In addition to helping write code
tion with open culture. Also, the developers that provided feedback faster, CodeCompose seems to be helping developers produce more
sit far away (organizationally) from the developers of CodeCom- in-code documentation and API documentation. This is an inter-
pose. Therefore, the developers tend share their feedback about esting side-effect and demonstrates the model’s ability to generate
tools and services openly without hesitation. text as an output (from code) in addition to code generation.
We received 70 responses over a 15-day time period, on which we
performed qualitative data analysis [10]. We performed coding [32] 5.4 Representative quotes
manually with two members of the team annotating the responses
To offer an impression, we list some quotes (positive and negative)
received from the CodeCompose users independently. Coding is
that we received from the CodeCompose users. The quotes are
the process of labeling and organizing qualitative data to identify
paraphrased for presentation purposes, and are generalized to re-
different themes and the relationships between them.
move references to internal artifacts.
The labeling process involved three steps: 1. the team got to-
A developer who got access to CodeCompose shared their first
gether and came up with five categories (labels) to represent the
impressions:
qualitative responses (bootstrapping). First, we identified some cat-
egories that are intuitive and refined them as we sift through the "Today I needed to write a quick tool for [task]. I opened
verbatim to arrive at a final set of categories 2. two members of the VS Code and started writing [code]. Suddenly this thing
team performed labeling independently 3. the team got together called CodeCompose popped up and wrote all the code
again to reconcile their understanding and made modifications to for me! This was a super delightful experience, it really
the labels as necessary. made my day, so thank you!"
We list the categories or labels and their descriptions in Table 3. In this case, the developer was not informed about the existence
We have two primary categories of user feedback: CSAT (favorable) of CodeCompose. They happened to be picked by our randomized
and DSAT (unfavorable). CSAT is further expanded into five sub- roll out algorithm. This quote summarizes the delightful coding
categories: Accelerate coding, Discovery, Documentation, Sugges- experience that the developers at Meta experienced when they pair-
tion accuracy, and generally favorable. The intent behind the cate- program with CodeCompose.
gorization is to understand what makes developers think CodeCom- Several remarks are clear indicators of the usefulness of employ-
pose is useful and the use cases for which they use CodeCompose. ing CodeCompose in code authoring workflows.
7
Table 3: Distribution of qualitative feedback responses
"I have been a big fan of [CodeCompose ] ever since I faster than looking for other call sites to figure out how
got access. I do not think I have been in any coding ses- to use this [API]."
sion since then where I did not use it."
We also observed interesting side effects such as increased in-
"The suggestions [from CodeCompose ] are quite rel- code documentation when developers use CodeCompose to make
evant and do not show up too often – not necessarily code changes. As CodeCompose provides accurate descriptions
“rarely”, but not obnoxiously often either. It is a nice bal- of code changes and a template to start with, developers tend to
ance." accept the suggestions, make changes (as necessary), and push the
documentation changes (along with the source code changes) in
"I was blown away that after writing the following their diffs. This is reflected in some of the anecdotes listed below.
[code], CodeCompose successfully predicted the cor-
"I find CodeCompose particularly useful when writing
rect completion. This is amazing because I could not re-
docstrings. Without CodeCompose, I would not even
member the name of the global variable or where it was
imagine that I can write [so many] lines of docstring
in the code, but CodeCompose knew what it was."
for my actual code."
Many developers highlighted the fact that the predictions are
"I really like how CodeCompose highlights the value
relevant. Also, we received feedback about how nicely CodeCom-
of naming and the quality of documentation, which
pose navigates the precision versus recall problem by not showing
used to be mainly for long-term benefit, but now good
suggestions too often.
naming and [documentation] gives you an immediate
There were also several pieces of feedback that highlighted Code-
effectiveness boost."
Compose’s ability to help developers discover new APIs or ramp
them up quickly on unfamiliar APIs. In addition to the positive remarks, few developers requested
to disable CodeCompose as they found CodeCompose to be noisy
"I wanted to post about a great experience I had in using and intrusive. We list a few anecdotes below.
CodeCompose in the [internal] codebase. I have been
getting back into coding and was rusty on the different "CodeCompose is good, however, it seems to reduce my
arguments that exist in the [library] operators. Some coding speed. The reason is that the traditional auto-
operators can have up to [hundreds of] arguments so complete was typically suggesting [real] functions, but
this is a challenging problem. CodeCompose has helped CodeCompose is suggesting hallucinations. These are
me several times in a day by giving me the right argu- not always correct and I would like a UX design where
ment that I cared about, whereas my previous workflow I could switch between traditional auto-complete and
would have been to [search] for the operator and spend CodeCompose."
several minutes on each."
"I have realized that [CodeCompose ] tends to strug-
"CodeCompose has been a game changer for working gle while working with function signatures. The sug-
with APIs that I have not used before. Even if it is not gestions it offers do not match the actual signature, al-
exactly what I need, it often shows me approximately though they make sense."
what I should pass or how I should call it. This is much
8
"The [code editing] experience is sometimes good and conclusions from the quantitative metrics or qualitative feedback
sometimes bad. Auto-complete struggles to work to- presented. It is possible that the results might not hold elsewhere,
gether with CodeCompose – they seem to be competing and for this reason, we cannot assume a priori that other AI-based
with each other." coding assistants built outside Meta will perform similarly. Simi-
This highlights the importance of investing in UX research to larly, the value-add of fine-tuning the LLM on Meta’s internal code
deliver the best experience for AI-assisted code authoring while co- (Table 1) may also not translate externally. We already saw that the
existing with traditional auto-complete systems. The feedback also lift offered by fine-tuning varied between languages.
talks about the problems faced by LLMs with respect to hallucina-
tions and grounding [17, 19]. We are actively exploring this area to 6.2 Productivity Impact
make CodeCompose a productive experience for all the developers. The scope of this paper is to only present our results from build-
ing and deploying CodeCompose at scale and its usage. It is impor-
5.5 Factors Affecting CodeCompose tant to note that the metrics used in this paper need not correlate
favorability with the productivity of developers who use CodeCompose. We
After analyzing all the qualitative feedback, we identified the fol- do not make any claims about the impact of CodeCompose on the
lowing main factors that make a developer inclined towards find- productivity of developers at Meta. For example, simply because a
ing a system like CodeCompose useful. developer accepts CodeCompose suggestions often or is favorable
Favorable scenarios for CodeCompose: The scenarios for which to it does not necessarily make them productive – that has to be
CodeCompose was able to add the biggest value includes (but not measured separately with statistical significance.
limited to), auto completing lines of code, API discovery, boilerplate
coding, suggesting standard libraries, generating in-code documen- 6.3 Internal Validity
tation, etc. While we have taken significant measures to reduce bias in the re-
When it comes to developers, the developers who benefited the sults presented (e.g., by waiting several weeks after randomized
most are the ones work on authoring code that involves writing rollout), we cannot guarantee that there is absolutely no bias. The
boilerplate code and idioms, the ones who follow the typical coding time period chosen for measuring the metrics in Table 2 could have
patterns employed at Meta, and the ones that employ standard first- had external events beyond our control that influenced developers’
party (or third-party) libraries to accomplish their coding tasks. coding behaviors during that time. Some languages such as Type-
Additionally, CodeCompose is well received by the developers who script only have a small number of developers, who might have
tend to work on building pipelines and common infrastructure had an outsized influence on the metrics for those languages.
tasks that involves writing heavily templatized code. This can be Qualitative feedback, in general, comes with a degree of subjec-
corroborated with the distribution of responses listed in Table 3 tivity. While Meta encourages a culture of open feedback, develop-
and the representative quotes listed in Section 5.4. ers might not have been comfortable sharing negative assessments
Unfavorable scenarios for CodeCompose: After analyzing the with colleagues. We facilitated easy opt out, but less than 1% had
qualitative feedback, we found that CodeCompose is not helpful opted out, indicating that a majority of developers found Code-
in scenarios where developers use specialized APIs and libraries. Compose to be a net positive experience.
CodeCompose seems to be struggling with suggesting the correct The categorization of the qualitative feedback into different cat-
code when the developers employ atypical coding patterns. egories was done manually. We tried to reduce subjectivity in this
Some developers found the traditional, semantic, auto-complete process by making two authors in the team categorize them and
functionality to be more useful in those cases. Developers also then reconcile the different labels. Nevertheless, a different set of
complained about the possibility of CodeCompose to hallucinate people might have produced a different categorization in Table 3.
[17, 21] when recommending the uncommon API names, URLs, etc.
Developers also found the coexistence of CodeCompose with 7 RELATED WORK
traditional auto-complete to be vital. Sometimes, when these two
In software engineering research, while there is a large body of
systems compete for showing suggestions, it creates a disruptive
work on code completion [13, 20, 26, 27, 33], there has been limited
experience to developers. Also, commonly used keyboard shortcuts
deployment of these in large industrial environments [3–5, 7]. With
such as “Tab” are overloaded to accept CodeCompose suggestions
the advent of Generative AI, advances in the code assistance area
and traditional auto-complete suggestion. Solving these UX prob-
are now in weeks rather than months. Several of these results are
lems is vital to facilitating a smoother integration with existing se-
circulated as blog posts and not traditional papers. We use a mix
mantic engines, which we discussed in Section 2.2.
of these results to set the context for historical work at Google,
Amazon, Microsoft, and GitHub.
6 THREATS TO VALIDITY The closest in spirit to our work is the work at Google [5]. At
In this section, we list certain threats to the validity of our evaluation Google [5] deploying hybrid semantic ML code completion to 10k+
in Section 5. Google employees (over three months across eight programming
languages) compared to a control group observed a 6% reduction in
6.1 Generalizability coding iteration time (time between builds and tests) when exposed
CodeCompose is an entirely internal tool that has only been de- to single-line ML completion. These results demonstrate the impact
ployed at Meta. Therefore, there is a threat of drawing any general on developer productivity and at the time of the article 3% of new
9
code (measured in characters) was generated from accepting ML across 9 programming languages at Meta, and presented metrics and
completion suggestions. feedback to understand the impact of the system on code authoring.
Pythia [29] deployed with Intellicode [4] by Microsoft is built We first looked at some unique challenges in building such cod-
using DL models trained on code contexts extracted from abstract ing assistants in large organizations. We shared our experience in
syntax trees. Its best matching code completions are in the order of building trust, designing the user experience, and measuring the
100 ms. The offline evaluation of Pythia obtained on 2700 Python right set of metrics to evaluate the system.
open source software GitHub repositories showed a top-5 accuracy We presented details about the underlying InCoder-based LLM
of 92%, surpassing the baseline models by 20% averaged over classes, that powers CodeCompose. We introduced a custom training ob-
for both intra and cross-project settings.[29] IntelliCode Compose jective, Language Causal Masking, that suits our application of sug-
[28] – is a general-purpose multilingual code completion tool that gesting individual lines of code. In doing so, we conducted an of-
is capable of predicting sequences of code tokens of arbitrary types, fline evaluation that showed a 23-25% improvement brought about
generating up to entire lines of syntactically correct code [28]. It by fine-tuning the LLM on the Meta’s internal code.
is built on a state-of-the-art generative transformer model trained We then presented the system architecture of CodeCompose. We
on 1.2 billion lines of source code in Python, C#, JavaScript, and built CodeCompose using three primary components: the server,
TypeScript programming languages [28]. Its best model yields an Language Service Protocol (LSP), and the client. When a developer
average edit similarity of 86.7% and a perplexity of 1.82 for Python types code, the service receives a request to perform inference and
programming language [28]. generates a sequence of tokens to auto-complete the statement(s).
There have been several empirical evaluations of GitHub’s Copi- The LSP and the client help orchestrate the inference calls and
lot [3] in actual use for automatic code completion. Nguyen et al. implement the logic to display the suggestion.
[22] used 33 LeetCode questions to create queries for Copilot in We presented quantitative metrics that show the scale and im-
four different programming languages. They found that Copilot’s pact of CodeCompose. In a 15-day time period, 4.5 million sugges-
Java suggestions have the highest correctness score (57%) while tions were shown to developers, 22% of which were accepted. This
JavaScript is the lowest (27%) [22]. Overall, Copilot’s suggestions resulted in 8% of code typed by users of CodeCompose coming
had low complexity with no notable differences between the pro- from suggestions offered by CodeCompose.
gramming languages. They also observed some potential Copilot Finally, we presented qualitative feedback from developers. We
shortcomings: such as generating code that can be further simplified found that an overwhelming 91.5% of the feedback shared by de-
and code that relies on undefined helper methods [22]. Not all usage velopers was favorable towards CodeCompose, indicating that it
has been positive. A user study [30] with 24 participants to under- adds significant value to their coding experience. Expanding into
stand how programmers use and perceive Copilot found that while the feedback, we found CodeCompose having desirable side effects,
Copilot did not necessarily improve the task completion time or such as helping developers discover unfamiliar APIs and write doc-
success rate, most participants preferred to use Copilot in daily pro- umentation.
gramming tasks, since Copilot often provided a useful starting point In the future, we plan to enable more features such as block
and saved the effort of searching online [30]. However, participants completions, more interaction modalities such as a conversational
did face difficulties in understanding, editing, and debugging code modality within the IDE, functionality to explain source code and
snippets generated by Copilot, which significantly hindered their provide code walk-throughs, etc. We are exploring opportunities
task-solving effectiveness [30]. An empirical study [25] presented a to leverage semantic information to perform pre-processing and
controlled experiment with GitHub Copilot showing recruited soft- post-processing to improve the suggestion accuracy by reducing
ware developers who were asked to implement an HTTP server in hallucinations. Furthermore, we see opportunities to expand this
JavaScript as quickly as possible. The treatment group, with access technology, beyond code authoring, to help developers across the
to Copilot, completed the task 55.8% faster than the control group. Software Development Life Cycle (SDLC).
Amazon’s CodeWhisperer [7] is a fully functional code comple-
tion tool integrated into the IDE. Its analysis [14] found that lan- 9 ACKNOWLEDGEMENTS
guage models can generate code with correct syntax and pass unit
We want to thank the following people for their help, support, and
tests in programming languages they are not intentionally trained
consultation while envisioning, building, and deploying CodeCom-
on. Like the companies discussed above (Google, Microsoft, GitHub,
pose at Meta: Gabriel Synnaeve, Baptiste Rozière, Chris Lengerich,
and Amazon), there are several commercial versions of code com-
Omer Dunay, Marshall Roch, Karim Nakad, Peter Monaco, Kelly
pletion tools by other generative AI companies. The goal here is to
Hirano, Killian Murphy, Kristian Kristensen, Wes Dyer, Adam Tait,
provide a sample of the related work rather than undertaking a sys-
Arun Ganesan, Andy Chiu, and everyone on the CodeCompose
tematic literature survey of the state of the art. The above works
team.
summarize the related work in this area at this point in time in this
rapidly changing field.
REFERENCES
[1] Github 2022. How GitHub Copilot helps improve developer productivity. Github.
https://github.blog/2022-07-14-research-how-github-copilot-helps-improve-
developer-productivity/
8 CONCLUSION [2] Meta Accessed 2021. ELI5: Bento - Interactive Notebook that Empowers Development
Collaboration and Best Practices. Meta. https://developers.facebook.com/blog
In this paper, we introduced an AI-based coding assistant system /post/2021/09/20/eli5-bento-interactive-notebook-empowers-development-
named CodeCompose, discussed how we scaled it to 16k developers collaboration-best-practices/
10
[3] Github Accessed 2021. Github Copilot. Github. https://github.com/features/copi [19] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii,
lot Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of Hallucination in
[4] Microsoft Accessed 2021. Microsoft Intellicode. Microsoft. https://visualstudio.m Natural Language Generation. ACM Comput. Surv. 55, 12, Article 248 (mar 2023),
icrosoft.com/services/intellicode 38 pages. https://doi.org/10.1145/3571730
[5] Google Accessed 2021. ML Enhanced Code Completion. Google. https://ai.googl [20] Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. 2021. Code Pre-
eblog.com/2022/07/ml-enhanced-code-completion-improves.html diction by Feeding Trees to Transformers. In 2021 IEEE/ACM 43rd International
[6] Accessed 2022. ML-Enhanced Code Completion Improves Developer Productivity. Conference on Software Engineering (ICSE). 150–162. https://doi.org/10.1109/IC
https://ai.googleblog.com/2022/07/ml-enhanced-code-completion-improves.h SE43902.2021.00026
tml [21] Zihao Li. 2023. The Dark Side of ChatGPT: Legal and Ethical Challenges from
[7] Amazon Accessed 2023. Amazon CodeWhisperer. Amazon. https://aws.amazon Stochastic Parrots and Hallucination. arXiv:2304.14347 [cs.CY]
.com/codewhisperer [22] Nhan Nguyen and Sarah Nadi. 2022. An empirical evaluation of GitHub copilot’s
[8] Accessed 2023. Flow Javascript. https://engineering.fb.com/2014/11/18/web/flo code suggestions. In Proceedings of the 19th International Conference on Mining
w-a-new-static-type-checker-for-javascript Software Repositories (MSR ’22).
[9] Accessed 2023. Workplace Group. https://www.workplace.com/features/groups [23] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou,
[10] Carl Auerbach and Louise B Silverstein. 2003. Qualitative data: An introduction Silvio Savarese, and Caiming Xiong. 2023. CodeGen: An Open Large Language
to coding and analysis. Vol. 21. NYU press. Model for Code with Multi-Turn Program Synthesis. arXiv:2203.13474 [cs.LG]
[11] Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine [24] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU:
McLeavey, Jerry Tworek, and Mark Chen. 2022. Efficient Training of Language A Method for Automatic Evaluation of Machine Translation. In Proceedings of
Models to Fill in the Middle. arXiv:2207.14255 [cs.CL] the 40th Annual Meeting on Association for Computational Linguistics (Philadel-
[12] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, phia, Pennsylvania) (ACL ’02). Association for Computational Linguistics, USA,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda 311–318. https://doi.org/10.3115/1073083.1073135
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, [25] Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Impact of AI on Developer Productivity: Evidence from GitHub Copilot. In Arxiv.
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin https://arxiv.org/abs/2302.06590
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya [26] Sebastian Proksch, Johannes Lerch, and Mira Mezini. 2015. Intelligent Code
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. Completion with Bayesian Networks. ACM Transactions on Software Engineering
arXiv:2005.14165 [cs.CL] and Methodology (TOSEM) 25, 1, Article 3 (12 2015).
[13] Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples [27] R. Robles and M. Lanza. 2008. How Program History Can Improve Code Com-
to improve code completion systems. In Proceedings of the 7th joint meeting of pletion. In 2008 23rd IEEE/ACM International Conference on Automated Software
the European software engineering conference and the ACM SIGSOFT symposium Engineering. 317–326. https://doi.org/10.1109/ASE.2008.42
on The foundations of software engineering (ESEC/FSE ’09). [28] Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. Ac-
[14] Nicola Dell, Vidya Vaidyanathan, Indrani Medhi, Edward Cutrell, and William cessed 2020. IntelliCode compose: code generation using transformer. In Proceed-
Thies. 2012. "Yours is Better!": Participant Response Bias in HCI. In Proceedings ings of the 28th ACM Joint Meeting on European Software Engineering Conference
of the SIGCHI Conference on Human Factors in Computing Systems (Austin, Texas, and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020).
USA) (CHI ’12). Association for Computing Machinery, New York, NY, USA, [29] Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. 2019. Pythia:
1321–1330. https://doi.org/10.1145/2207676.2208589 AI-assisted Code Completion System. In Proceedings of the 25th ACM SIGKDD
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: International Conference on Knowledge Discovery & Data Mining (KDD ’19).
Pre-training of Deep Bidirectional Transformers for Language Understanding. [30] Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. 2022. Expectation vs.
arXiv:1810.04805 [cs.CL] Experience: Evaluating the Usability of Code Generation Tools Powered by Large
[16] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Language Models. In Extended Abstracts of the 2022 CHI Conference on Human
Ruiqi Zhong, Wen tau Yih, Luke Zettlemoyer, and Mike Lewis. 2023. InCoder: A Factors in Computing Systems (CHI EA ’22).
Generative Model for Code Infilling and Synthesis. arXiv:2204.05999 [cs.SE] [31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[17] Nuno M. Guerreiro, Duarte Alves, Jonas Waldendorf, Barry Haddow, Alexandra Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You
Birch, Pierre Colombo, and André F. T. Martins. 2023. Hallucinations in Large Need. arXiv:1706.03762 [cs.CL]
Multilingual Translation Models. arXiv:2303.16104 [cs.CL] [32] Michael Williams and Tami Moser. 2019. The art of coding and thematic explo-
[18] Kajta Hofmann, Bhaskar Mitra, Filip Radlinski, and Milad Shokouhi. 2014. An ration in qualitative research. International Management Review 15, 1 (2019), 45–
Eye-Tracking Study of User Interactions with Query Auto Completion (CIKM 55.
’14). Association for Computing Machinery, New York, NY, USA, 549–558. https: [33] Wen Zhou, Seohyun Kim, Vijayaraghavan Murali, and Gareth Ari Aye. 2022.
//doi.org/10.1145/2661829.2661922 Improving Code Autocompletion with Transfer Learning. In 2022 IEEE/ACM 44th
International Conference on Software Engineering: Software Engineering in Practice
(ICSE-SEIP). 161–162. https://doi.org/10.1145/3510457.3513061
11