CodeCompose - A Large-Scale Industrial Deployment of AI-assisted Code Authoring

CodeCompose: A Large-Scale Industrial Deployment of
AI-assisted Code Authoring

Vijayaraghavan Murali Chandra Maddila Imad Ahmad
vijaymurali@meta.com cmaddila@meta.com imadahmad@meta.com
Meta Platforms Inc. Meta Platforms Inc. Meta Platforms Inc.
USA USA USA
Michael Bolin Daniel Cheng Negar Ghorbani

mbolin@meta.com danielcheng@meta.com negargh@meta.com
arXiv:2305.12050v1 [cs.SE] 20 May 2023
Meta Platforms Inc. Meta Platforms Inc. Meta Platforms Inc.

USA USA USA
Renuka Fernandez Nachiappan Nagappan

rdfernandez@meta.com nnachi@meta.com
Meta Platforms Inc. Meta Platforms Inc.
UK USA
ABSTRACT source code or is confined to a small set of developers. For example,
The rise of large language models (LLMs) has unlocked various ap- a question like “How can I upload a table to Hive in Hack?” can be
plications of this technology in software development. In particular, answered using that knowledge.
generative LLMs have been shown to effectively power AI-based The state of the source code and the developers associated with
code authoring tools that can suggest entire statements or blocks of the source code keeps evolving constantly - internal libraries and
code during code authoring. In this paper we present CodeCompose, tools get added and deprecated, while developers move across teams
an AI-assisted code authoring tool developed and deployed at Meta and change roles. At Meta’s scale, keeping up with the knowledge
internally. CodeCompose is based on the InCoder LLM that merges required to accomplish coding tasks is challenging. Additionally,
generative capabilities with bi-directionality. We have scaled up the dynamic environment at a large software company like Meta
CodeCompose to serve tens of thousands of developers at Meta, poses interesting challenges with respect to knowledge discovery
across 10+ programming languages and several coding surfaces. and achieving developer efficiency.
We discuss unique challenges in terms of user experience and Recently, large language models (LLMs)[12, 15, 31] have exhib-
metrics that arise when deploying such tools in large-scale indus- ited the ability to assimilate vast amounts of knowledge after being
trial settings. We present our experience in making design deci- trained on the data in various corpora. This characteristic of LLMs
sions about the model and system architecture for CodeCompose has made them highly impactful in assisting developers in authoring
that addresses these challenges. Finally, we present metrics from code[3, 5, 7, 23]. Specifically, when they are trained to predict the
our large-scale deployment of CodeCompose that shows its im- next token in a code sequence, LLMs can become powerful coding
pact on Meta’s internal code authoring experience over a 15-day assistants that can suggest entire statements or blocks of code dur-
time window, where 4.5 million suggestions were made by Code- ing code authoring. Such a coding assistant trained on an organiza-
Compose. Quantitative metrics reveal that (i) CodeCompose has an tion’s code repository can surface internal knowledge during code
acceptance rate of 22% across several languages, and (ii) 8% of the authoring when developers are most likely to seek that information.
code typed by users of CodeCompose is through accepting code At Meta, we built an AI-assisted code authoring system named
suggestions from CodeCompose. Qualitative feedback indicates an CodeCompose to explore the application of LLM technology for
overwhelming 91.5% positive reception for CodeCompose. In addi- code authoring. Over the past year, we (i) conducted R&D on the
tion to assisting with code authoring, CodeCompose is also intro- LLM (model) architecture, training data, and training objective, (ii)
ducing other positive side effects such as encouraging developers built an end-to-end coding assistant that offers code suggestions
to generate more in-code documentation, helping them with the in various code authoring surfaces across the company, and (iii)
discovery of new APIs, etc. gathered quantitative metrics and qualitative feedback to measure
the impact of the CodeCompose system. CodeCompose has the
1 INTRODUCTION following characteristics:
Meta is a large software company with a complex code base. Tens
of thousands of developers work on billions of lines of code (LOCs) • Multi-lingual: CodeCompose is based on the InCoder LLM [16]
in a monolithic repository that hosts source code written in 20+ and has been trained on 10+ programming languages at Meta.
programming languages. One of the prominent activities in the As such, it inherits the property of LLMs to be multi-lingual.
Software Development Life Cycle (SDLC) at Meta is code author- • Customized for the organization: CodeCompose is fine-tuned
ing. A significant amount of contextual knowledge on the inter- on Meta’s internal code repository, and is thus able to handle
nal software development processes, libraries, is embedded in the Meta-specific languages such as Hack and Flow. CodeCompose
1
of developers who were early adopters of CodeCompose, and we
discuss our learnings for each challenge.
(a) 2.1 Trust

Most of today’s code generation LLMs consider code as a sequence
of tokens, similar to natural language. That is one of the reasons for
the observed acceleration in the model-building process – dealing
with the complexities of parsing source code or developing semantic
understanding is not required anymore. However, this has some
(b) side effects: most of the models cannot guarantee that the generated
code compiles, is syntactically correct, or executes. For example,
the model may make a “suggestion” by generating an API call that
does not exist. Making developers trust and accept suggestions in
this setting is a huge challenge. Other factors can also impact trust
such as the source corpus used for training and its validity, biases in
training data, security issues, vulnerabilities, constant data drift, etc.
The traditional precision vs recall problem in machine learning
also plays into trust. If the model is optimized for recall and gener-
ates a lot of incorrect suggestions, developers will lose faith in the
system as it becomes noisy. If the model is optimized for precision
(c) and generates only a few accurate suggestions, developers might
eventually stop caring about the system as it becomes sparse. Also,
Figure 1: CodeCompose (a) offers inline code suggestions in it will be hard to justify the return on investment if the system is
VSCode in a grey text when the user is typing code (Tab to not generating enough suggestions. Studies[1] show that develop-
accept), (b) changes its suggestion to adapt to a natural lan- ers are fine with reworking a suggestion as long as the model pro-
guage comment, (c) suggests code or documentation based vides a useful starting point or structure with relevant libraries and
on code below the current position. function calls. However, requiring a lot of rework might contribute
to the reduction of trust and confidence in the model.
Learnings for CodeCompose. Building trust was an important
is also deeply integrated with Meta’s version of VSCode, as seen part of our productization, as it is directly affected by model ac-
in Figure 1(a), and other editors. curacy. First, we worked with language partner teams at Meta to
• Natural language proficiency: CodeCompose can understand identify obsolete patterns (e.g., “PHP-isms” in Hack) and not-in-
inline comments in natural language and generate code that production code (e.g., experiments) in the codebase and filter it out
adheres to the comment, as shown in Figure 1(b). It can also from our training data. In a similar vein, we only train on code that
fluently generate comments, messages, and documentation. is being actively modified, that is, containing commits in the last
• Bi-directional: CodeCompose can look beyond the cursor to sug- two years, as opposed to all code in the repository.
gest code at the current position. In Figure 1(c) it can be seen Secondly, from developer feedback, we found out that contextual
suggesting the docstring for a function from code that conven- information adds significant value to the suggestion accuracy, such
tionally appears after the docstring. as the code after the cursor, the file being edited, or the kernel being
In this paper we present our experience in building CodeCom- used (in a notebook). We modified our model’s training objective
pose, discussions on the unique challenges in a large industrial set- to take into account this contextual information (for more details
ting and how they influenced design decisions about CodeCom- see Section 3).
pose, and finally, results from an extensive large-scale deployment Finally, we adopted a rollout strategy to incrementally build trust
including developer feedback on the impact of CodeCompose on in the product. We rolled out CodeCompose to the company in
how code is authored at Meta. waves of languages: (i) only Python, (ii) Hack, Flow (Javascript)
The paper is organized as follows. Section 2 discusses industrial and C++, (iii) others, and within each wave, we rolled it out to in-
challenges for coding assistants, Section 3 presents details about the crements of 25% of the developer population. By doing this steadily,
CodeCompose model, Section 4 presents the system architecture, at every step we were able to gather developer feedback (such as
Section 5 presents results from our large-scale deployment at Meta, the above) and iterate on the product before rolling out further.
Section 6 discusses some threats to validity, Section 7 discusses
related work, and Section 8 concludes the paper. 2.2 User Experience
Due to the generality of LLMs, code generation can be done at mul-
2 INDUSTRIAL CHALLENGES tiple levels of granularity: token completion, statement completion,
In this section we discuss unique challenges and considerations for completing multiple lines, or generating entire blocks. Depending
deploying a coding assistant at a large-scale industrial organization on factors such as suggestion confidence, user context, and task con-
such as Meta. These are gathered from feedback from hundreds text, we will need to generate suggestions at different granularities.
2
For example, when developers are in their coding “flow”, suggesting Learnings for CodeCompose. At Meta, and there are standard
a large block of code and moving their existing code down – and definitions of internal metrics to measure the scale of deployment
moving it back up when they type over the suggestion – results in of tools. We track usage metrics such as the number of active users,
an extremely jarring experience. This is because developers would number of suggestions offered, acceptance rate, and the percent-
constantly need to switch between typing code and reviewing the age of code typed by developers, that comes from accepting Code-
suggested block of code, resulting in an unpleasant “code review Compose’s suggestions (for more details see Section 5). We also use
on-the-fly” situation. Therefore, when to suggest, what to suggest, offline metrics such as exact match to measure the quality of the
and how much to suggest is an essential part of the user experience. model and run experiments.
Similarly, performance and latency are also an important consid- One particular challenge for us was balancing the rollout of
eration. Coding assistants operate in a real-time environment with CodeCompose with measuring its impact without bias. We wanted
strict requirements on latency. Developers do not want to wait for to roll it out to developers who are excited by this technology and
several seconds for the system to generate a single suggestion as would provide early feedback to help us iterate quickly. However,
they are typing it – the system has to match the developer’s typing we wanted to avoid implicitly introducing any bias in the population
speed. Studies[18] have shown that sometimes users adjust their of developers who may be, for example, more likely to accept sug-
typing behaviors (and speed) to match autocomplete suggestions gestions than normal. To address this, we first rolled it out to a few
in IDEs. However, slowing oneself down should be compensated hundred early adopters and gathered feedback that improved the
by passing the right suggestions and making it a “net positive” ex- product. Then, we incrementally expanded the rollout to randomly
perience. include more developers at Meta, outweighing any possible bias
from early adopters. For all our deployment metrics in Section 5, we
discarded any data that was gathered before the randomized rollout.
Learnings for CodeCompose. After several iterations, we found
out through developer feedback that offering single-line sugges-
tions – that is, completing the current line of code where the cursor 3 MODEL DETAILS
is – strikes a good balance between suggesting an adequate amount In this section we provide details about the underlying LLM archi-
of code and avoiding the jarring effect. Furthermore, even within tecture for CodeCompose, and discuss considerations that affected
the current line we do not show suggestions if there is any code to our decision.
the right of the cursor, except certain tokens such as ), }, ], etc.
In terms of latency, developers were fine with suggestions ap-
pearing within 300ms - 500ms, and certainly not beyond 1s. To 3.1 Model Architecture and Training Objective
bring down the end-to-end latency below this threshold, we em- LLMs largely fall into two main categories. The BERT [15] branch
ployed a suite of optimizations such as caching and debouncing (for of models are trained with the Masked Language Model (MLM)
more details see Section 4). With this experience, developers who objective that masks out certain parts of the input and trains the
are looking for multi-line suggestions simply press Tab as many model to predict them back. These models have a notion of bidi-
times as needed. This allows them to edit the currently suggested rectionality, where predicting a token can take into account both
line before going to the next line, which makes the subsequent sug- preceding and following tokens. However, MLM is not easily suited
gestions more accurate. They can also press a manual keyboard for generative tasks. The GPT [12] branch of models are trained
shortcut to request a multi-line suggestion, but we found that de- with the Causal Language Model (CLM) objective that provides
velopers rarely use this workflow. the model a sequence of tokens and trains it to predict the next to-
ken [12]. CLM makes the model more suited for auto-regressive
generation tasks, however it only takes into account the preceding
2.3 Metrics tokens at a given point.
Evaluating the usefulness of AI-generated suggestions is a major In industrial settings, developers often perform the task of editing
challenge. This includes (but not limited to), complexities involved where there is code before and after their cursor position. The
with respect to the granularity of suggestions, developers’ flexibility code after the cursor contains highly relevant signals about the
to rework the generated code, defining true positives, etc. A strict code being written. Thus, for a code generation system we would
metric like exact match, BLEU score or percentage of accepted ideally want both – a generative model with bidirectionality. For this
suggestions as-is is going to underrepresent the actual benefits purpose, we used the InCoder LLM [16] as our pre-trained model.
offered by these solutions. For example, while acceptance rate is a InCoder is a generative model trained with the Causal Masking
good metric to measure product usage, high acceptance rate does (CM) objective, where a sequence of tokens are masked out from
not necessarily mean developers are being more productive when a given input, appended to the end of the input, and the model
using the tool – that has to be measured explicitly. is trained to generate the whole sentence left-to-right. Moreover,
Furthermore, evaluation goes beyond just measuring product the tokenizer used by InCoder tokenizes multiple words into a
impact – it helps us understand where the model is inaccurate, single token, and as such allows efficient encoding of common code
or where developer pain points are in the user experience, so as patterns such as import numpy as np. For more details we refer
to address those issues. However, complexities such as privacy the reader to the InCoder [16] paper.
compliance and proprietary data make it harder to track all the To suit our end application better, we make a few modifications
telemetry that is required to evaluate or monitor the product. to the CM objective, and propose Language Causal Masking (LCM):
3
Tokenize code at trigger
characters: (, ., =, etc.
Sequence of tokens
metadata Select a random subsequence

File: /home/bubblesort.py Language: python under a length limit
def bubbleSort(array): 3 sequences of tokens

# loop through each element of array
for i in range(len(array)):
code before Create metadata, code before, code after,
# keep track of swapping and target sentences
swapped = False
# loop to compare array elements 4 sentences
for j in range(0, len(array) - i - 1):
if array[j] > array[j + 1]: Encode each sentence separately with
# swap if needed target CodeCompose’s tokenizer
temp = array[j]
array[j] = array[j+1] 4 sequences of token IDs
array[j+1] = temp
swapped = True code after Truncate code before and
# array is already sorted
code after using a 70-30 split
if not swapped:
4 sequences of truncated token IDs
break
Concatenate the encoded sequences and insert
special token IDs for mask, pad if needed
metadata IDs code before IDs <mask> ID code after IDs <mask> ID target IDs
Figure 2: Steps to construct an input to the model in LCM with an example.
• CM implements the masking after the text has been tok- As we are suggesting single lines of code, we stop the generation
enized into token IDs, which limits the model during train- early once a newline token has been generated. Due to the real-time
ing to only seeing mask spans with edges at common tok- nature of our application and the inline suggestion user experience
enizer tokens. LCM lifts the masking step to the language (UX), we only return one sequence of generated tokens.
level and avoids this, similar to the fill-in-the-middle (FIM)
task [11]. Also, LCM only masks at certain trigger charac- 3.2 Training data
ters – that is, characters where the model will be queried
during inference such as (, ., =, etc. For training on our internal data, we collected first party data from
• We prefix certain metadata to the input in LCM, such as the Meta’s code repositories and notebooks, applying several filters:
programming language, full path to the file, and the kernel • Rather than crawling the entire repository, we used code
name for notebooks. that is modified through diffs checked in by developers as
• Through model level ablations (not in the scope of this paper), a way of staying close to our end application (i.e., writing
we found an optimal 70-30 split of the model’s input length code in the IDE). This way we avoid training on code that
between code before and code after the cursor. may have been added a long time ago but is never modified.
• Specialized for our use case, LCM has only one mask in any • To keep the training data fresh, we only looked at diffs going
input. back up to 2 years, and only kept the latest versions of files
A step-by-step overview of constructing an input in LCM is to avoid bugs that may have been patched.
shown in Figure 2, along with an example code snippet. Once an • For each major target language, we worked with the respec-
input is constructed, during training, we maximize the log proba- tive language services team at Meta to exclude code with
bility of the language-masked input: old/obsolete patterns, not-in-production, etc.
After these filters, our first party training data included in the
log P ( [Metadata; Before; <mask>; After; <mask>; Target]) order of tens of millions of files amounting to a few billion lines of
code across 10+ languages.
where Metadata, Before and After are the tokens in the metadata,
code before, and code after the cursor, respectively, Target is the
code that was masked, and <mask> is a special token. During in- 3.3 Training details
ference, we sample tokens in an auto-regressive manner from the We took InCoder-1.3B, the public model with 1.3 billion parameters,
distribution: and fine-tuned it further on the above first party data with our LCM
objective. A 6.7 billion parameter model called InCoder-6.7B is also
P (· | [Metadata; Before; <mask>; After; <mask>]) available, but we found that for suggesting single lines of code it
4
Table 1: Accuracy metrics across programming languages for the public model and the fine-tuned model
Exact Match BLEU score

Models
Python Hack Flow C++ Python Hack Flow C++
Public 23.1 14.2 18.62 23.42 37.88 27.63 33.83 39.85
Fine tuned (CM) 35.53 37.77 30.79 24.18 50.03 52.60 45.83 40.29
Fine tuned (LCM) 48.84 57.73 52.23 40.01 61.98 72.96 68.62 56.23
added too much latency for incremental gain in accuracy. For fine- Clients can make requests to this tier via Thrift. The caller specifies
tuning the 1.3B model, we used a batch size of 20 per device (2.5k the code before and after the cursor, as well as the file path, language,
effective), and a learning rate of 5e-4. Training for 4 epochs with and metadata to use to process the request. The request handler is
sharded data parallelism took 4 days on a cluster of 128 A100 GPUs. designed to be simple to utilize the most out of the GPUs. It simply
We then deployed the model on a cluster of 150 A100 GPUs. processes the request data into a string, tokenizes it, and invokes the
model’s auto-regressive generation function that runs on the GPU.
3.4 Effect of fine-tuning on prediction accuracy Since we are optimizing for latency rather than throughput, each
We conduct offline evaluations to measure the lift of fine-tuning request is processed as soon as it arrives and so there is no inference
with first-party data on the model accuracy. The offline evaluation batching. Due to latency considerations, when the input length is
measures two metrics: Exact Match (EM), BLEU score [24]. less than 1500 tokens, we can afford to perform a beam search with
We collect 20K source code files randomly sampled from Meta’s a beam width of 3 and return the most likely suggestion. Otherwise,
source code repositories. We collect files that belong to four major we perform greedy generation.
programming languages used in Meta: Hack, C++, Python, and
Flow (JavaScript). To simulate the code being written in the IDE,
we randomly mask a part of the code snippet that terminates with 4.2 LSP
an end-of-the-line character, in each file. The prediction task is to To mediate requests between the client and server, we implemented
predict the masked code snippets when code before and code after a language server in Rust that we reuse across our various editor in-
are passed to the model as the context. In addition to this context, tegrations. While most language servers are designed to support a
we pass metadata such as file name, full path, and the language. wide array of traditional IDE functionality (autocomplete, jump-to-
As shown in Table 1, fine tuning the CodeCompose model on definition, etc.), the CodeCompose language server supports only
first-party source code with the CM objective allowed the model to one meaningful request type: textDocument/inlineCompletions.
learn internal specific code styles, formatting, libraries, etc., which The LSP also implements core request logic such as debouncing
led to a significant improvement of the EM and BLEU metrics. and caching. Debouncing waits for a 20ms pause before sending a
The improvements for languages like Hack, which are heavily cus- request to avoid too many requests from being sent as the user is
tomized for use at Meta with minimal public presence, are evident typing in the editor. Caching saves the response from the server
with 23.5% improvement in exact match and 24.97% in BLEU score. with the context as the key, so that duplicate requests – which hap-
Whereas, the improvements for C++ are minimal because of the re- pen frequently as developers type and erase code – can be avoided.
semblance between C++ used at Meta and the public version. More- Moreover, requests that hit the LSP cache are served significantly
over, when fine-tuned with the customized LCM objective that is faster than the service.
closer to the final application, the model performance improves fur-
ther, even for C++.
This experiment provides a strong evidence that an internally 4.3 Clients
fine tuned model outperforms an off-the-shelf model that is trained Clients are extensions that run locally on developers’ machines
on external data only (when tested on an organization-specific and are responsible for ultimately displaying inline suggestions in
benchmarks for code completion), and is more suitable to serve the the editor. For editors such as VS Code that support LSP natively,
needs of the developer community at the given enterprise. we require relatively little glue code to create the CodeCompose
extension. For editors such as Android Studio that do not have
4 SYSTEM DESIGN native LSP support, we built a small adapter to proxy requests to
CodeCompose employs a typical client-server architecture in which our LSP.
the server is an inference tier that runs the model and a client is an Furthermore, this architecture makes it straightforward to in-
editor that surfaces code suggestions. We encode the bulk of the tegrate CodeCompose with our in-house developer tools, such as
client-side logic in a Language Server Protocol (LSP) conformant the web-based Bento notebook [2] surface created internally at
language server that is reused across multiple editor integrations. Meta. Because the notebook surface also supports LSP, it was easy
to provide our data scientists and ML engineers with the same AI-
4.1 Server powered code suggestions as developers working in VS Code. This
At Meta, we have a tier of machines equipped with A100 GPUs, each ability to plug CodeCompose into any code editing surface inter-
with sufficient memory to host the fine-tuned CodeCompose model. nally is an advantage of owning the entire stack.
5
Figure 3: System Architecture for CodeCompose
4.4 Telemetry potential biases from early adopters. We instrumented telemetry to

The LSP is responsible for recording all of the client-side telemetry, track various events in the IDE such as displaying a suggestion in-
which ensures that metrics are recorded consistently across the line, accepting or rejecting a suggestion, and the length of accepted
various surfaces where CodeCompose is used. suggestions.
Because the LSP receives textDocument/didChange events, it For our qualitative evaluation (Section 5.3), we collected data by
can calculate most of the metrics of interest without further cooper- mining user feedback posts in our support group. We manually an-
ation from the client. Nevertheless, the LSP does expect a client to notated posts as CSAT (Customer Satisfaction) or DSAT (Customer
send a custom cc/received notification upon receipt of the com- Dissatisfaction). The feedback group does not allow anonymous
pletions so that we can record the full round-trip time from the end posting. However, culturally the Meta is known to promote open
user’s perspective. Limiting the demands on the client facilitates feedback and has a vocal developer community, which reduces the
integrating CodeCompose into more of our internal tools. effect of response bias [14] if not eliminate it.
5 EVALUATION: OBSERVATIONS AND

PERCEPTIONS ABOUT USEFULNESS OF 5.2 Observations from deploying
CODECOMPOSE CodeCompose for multiple languages
In this section, we provide details about the metrics that we track CodeCompose made 4.5 million suggestions during the aforemen-
to understand the usage, reception, and impact of CodeCompose tioned time period across 9 programming languages. 16k distinct
on developer workflows. developers have seen at least one CodeCompose suggestion. As
shown in Table 2, we observe a suggestion level acceptance rate of
5.1 Data collection and methodology 22% across 9 programming languages for suggestions that were dis-
We obtained data from a large-scale deployment of CodeCompose played for at least 750 milliseconds. Imposing a lower bound on the
at Meta. For our observational study (Section 5.2), we collect the de- display time of the suggestions when calculating the acceptance
ployment data from 15th April to 30th April 2023. This data is col- rate helps in making sure developers who were exposed to a sug-
lected after enabling CodeCompose for 9 programming languages gestion had a chance to see and comprehend it [6].
for at least two weeks for each programming language, and mak- In addition, we log the length of each accepted suggestion and
ing sure any anomalies concerning telemetry instrumentation are the number of characters developers type when authoring code
identified and fixed. (excluding large copy-pastes, refactors, etc.). This allows us to cal-
We also made sure that the data collection period is several culate, at a character level, the percentage of the code typed by de-
weeks after we enabled CodeCompose to developers based on a velopers that came through accepting CodeCompose suggestions,
randomized rollout strategy for each language so that we avoid any which we computed to be 8%.
6
Table 2: CodeCompose acceptance rate across programming languages
Language # Suggestions shown Acceptance rate Percentage of code # Users

typed using CodeCom-
pose
Python 1.87mn 22 8 10.7k
Hack 1.25mn 22.5 10 5.5k
C++ 608.1k 20 10 2.5k
Flow (Javascript) [8] 583.2k 18.2 7 2.5k
Rust 74.2k 17.2 9 212
Objective C++ 56.5k 18 6 429
Objective C 34.4k 18.1 6 299
C 23.5k 21.3 12 201
Typescript 8.9k 19 10 76
All 4.5mn 22 8 16k
Note that Table 2 might indicate that Python has more users than Table 3 lists the distribution of qualitative feedback responses.
languages like Hack or Javascript, but that is simply an artifact of 91.5% of the CodeCompose users gave a favorable response while
our incremental rollout strategy that opened CodeCompose first 8.5% of them gave an unfavorable response. Many respondents
to Python, as mentioned in Section 2.1. (15.7%) appreciated the fact that CodeCompose suggestions are
accurate and CodeCompose was able to auto complete the line[s]
5.3 Qualitative feedback they were going to type.
In addition to accuracy of suggestions, 23% of the users said
In addition to collecting metrics such as acceptance rate and per- CodeCompose was helpful in discovering new APIs, automate
centage of code typed using CodeCompose, we collect qualitative tedious tasks such as boilerplate code. Therefore, CodeCompose is
feedback from the CodeCompose users through a forum. We cre- not just helping with typing code faster (as demonstrated in Table
ated a Workplace group [9], where CodeCompose users can post 2) but also saving time and effort people spend in searching for
any feedback they may have. Everyone in the company can see the APIs and documentation.
feedback, react to it, and have a follow-up discussion. Though it is CodeCompose is found to be helping people accelerate their cod-
a public group, we do not expect the CodeCompose users to suffer ing activity. 20% of the users said they found CodeCompose to be
from response bias [14] due to the fact that Meta is a large organiza- accelerating their coding activity. In addition to helping write code
tion with open culture. Also, the developers that provided feedback faster, CodeCompose seems to be helping developers produce more
sit far away (organizationally) from the developers of CodeCom- in-code documentation and API documentation. This is an inter-
pose. Therefore, the developers tend share their feedback about esting side-effect and demonstrates the model’s ability to generate
tools and services openly without hesitation. text as an output (from code) in addition to code generation.
We received 70 responses over a 15-day time period, on which we
performed qualitative data analysis [10]. We performed coding [32] 5.4 Representative quotes
manually with two members of the team annotating the responses
To offer an impression, we list some quotes (positive and negative)
received from the CodeCompose users independently. Coding is
that we received from the CodeCompose users. The quotes are
the process of labeling and organizing qualitative data to identify
paraphrased for presentation purposes, and are generalized to re-
different themes and the relationships between them.
move references to internal artifacts.
The labeling process involved three steps: 1. the team got to-
A developer who got access to CodeCompose shared their first
gether and came up with five categories (labels) to represent the
impressions:
qualitative responses (bootstrapping). First, we identified some cat-
egories that are intuitive and refined them as we sift through the "Today I needed to write a quick tool for [task]. I opened
verbatim to arrive at a final set of categories 2. two members of the VS Code and started writing [code]. Suddenly this thing
team performed labeling independently 3. the team got together called CodeCompose popped up and wrote all the code
again to reconcile their understanding and made modifications to for me! This was a super delightful experience, it really
the labels as necessary. made my day, so thank you!"
We list the categories or labels and their descriptions in Table 3. In this case, the developer was not informed about the existence
We have two primary categories of user feedback: CSAT (favorable) of CodeCompose. They happened to be picked by our randomized
and DSAT (unfavorable). CSAT is further expanded into five sub- roll out algorithm. This quote summarizes the delightful coding
categories: Accelerate coding, Discovery, Documentation, Sugges- experience that the developers at Meta experienced when they pair-
tion accuracy, and generally favorable. The intent behind the cate- program with CodeCompose.
gorization is to understand what makes developers think CodeCom- Several remarks are clear indicators of the usefulness of employ-
pose is useful and the use cases for which they use CodeCompose. ing CodeCompose in code authoring workflows.
7
Table 3: Distribution of qualitative feedback responses
User feedback Category Description # of responses

Accelerate coding CodeCompose helped speed up 14 (20%)
my coding process
CSAT Discovery CodeCompose helped me dis- 16 (22.8%)
cover APIs, write boilerplate
code faster
Documentation CodeCompose helped me with 5 (7.1%)
generating in-code documenta-
tion, docstrings, etc.
Suggestion accuracy CodeCompose suggestions are 11 (15.7%)
accurate and not noisy
Generally favorable CodeCompose is generally 18 (25.7%)
great, and I will keep using it
for my coding tasks
DSAT Unfavorable CodeCompose is not useful 6 (8.5%)
"I have been a big fan of [CodeCompose ] ever since I faster than looking for other call sites to figure out how
got access. I do not think I have been in any coding ses- to use this [API]."
sion since then where I did not use it."
We also observed interesting side effects such as increased in-
"The suggestions [from CodeCompose ] are quite rel- code documentation when developers use CodeCompose to make
evant and do not show up too often – not necessarily code changes. As CodeCompose provides accurate descriptions
“rarely”, but not obnoxiously often either. It is a nice bal- of code changes and a template to start with, developers tend to
ance." accept the suggestions, make changes (as necessary), and push the
documentation changes (along with the source code changes) in
"I was blown away that after writing the following their diffs. This is reflected in some of the anecdotes listed below.
[code], CodeCompose successfully predicted the cor-
"I find CodeCompose particularly useful when writing
rect completion. This is amazing because I could not re-
docstrings. Without CodeCompose, I would not even
member the name of the global variable or where it was
imagine that I can write [so many] lines of docstring
in the code, but CodeCompose knew what it was."
for my actual code."
Many developers highlighted the fact that the predictions are
"I really like how CodeCompose highlights the value
relevant. Also, we received feedback about how nicely CodeCom-
of naming and the quality of documentation, which
pose navigates the precision versus recall problem by not showing
used to be mainly for long-term benefit, but now good
suggestions too often.
naming and [documentation] gives you an immediate
There were also several pieces of feedback that highlighted Code-
effectiveness boost."
Compose’s ability to help developers discover new APIs or ramp
them up quickly on unfamiliar APIs. In addition to the positive remarks, few developers requested
to disable CodeCompose as they found CodeCompose to be noisy
"I wanted to post about a great experience I had in using and intrusive. We list a few anecdotes below.
CodeCompose in the [internal] codebase. I have been
getting back into coding and was rusty on the different "CodeCompose is good, however, it seems to reduce my
arguments that exist in the [library] operators. Some coding speed. The reason is that the traditional auto-
operators can have up to [hundreds of] arguments so complete was typically suggesting [real] functions, but
this is a challenging problem. CodeCompose has helped CodeCompose is suggesting hallucinations. These are
me several times in a day by giving me the right argu- not always correct and I would like a UX design where
ment that I cared about, whereas my previous workflow I could switch between traditional auto-complete and
would have been to [search] for the operator and spend CodeCompose."
several minutes on each."
"I have realized that [CodeCompose ] tends to strug-
"CodeCompose has been a game changer for working gle while working with function signatures. The sug-
with APIs that I have not used before. Even if it is not gestions it offers do not match the actual signature, al-
exactly what I need, it often shows me approximately though they make sense."
what I should pass or how I should call it. This is much
8
"The [code editing] experience is sometimes good and conclusions from the quantitative metrics or qualitative feedback
sometimes bad. Auto-complete struggles to work to- presented. It is possible that the results might not hold elsewhere,
gether with CodeCompose – they seem to be competing and for this reason, we cannot assume a priori that other AI-based
with each other." coding assistants built outside Meta will perform similarly. Simi-
This highlights the importance of investing in UX research to larly, the value-add of fine-tuning the LLM on Meta’s internal code
deliver the best experience for AI-assisted code authoring while co- (Table 1) may also not translate externally. We already saw that the
existing with traditional auto-complete systems. The feedback also lift offered by fine-tuning varied between languages.
talks about the problems faced by LLMs with respect to hallucina-
tions and grounding [17, 19]. We are actively exploring this area to 6.2 Productivity Impact
make CodeCompose a productive experience for all the developers. The scope of this paper is to only present our results from build-
ing and deploying CodeCompose at scale and its usage. It is impor-
5.5 Factors Affecting CodeCompose tant to note that the metrics used in this paper need not correlate
favorability with the productivity of developers who use CodeCompose. We
After analyzing all the qualitative feedback, we identified the fol- do not make any claims about the impact of CodeCompose on the
lowing main factors that make a developer inclined towards find- productivity of developers at Meta. For example, simply because a
ing a system like CodeCompose useful. developer accepts CodeCompose suggestions often or is favorable
Favorable scenarios for CodeCompose: The scenarios for which to it does not necessarily make them productive – that has to be
CodeCompose was able to add the biggest value includes (but not measured separately with statistical significance.
limited to), auto completing lines of code, API discovery, boilerplate
coding, suggesting standard libraries, generating in-code documen- 6.3 Internal Validity
tation, etc. While we have taken significant measures to reduce bias in the re-
When it comes to developers, the developers who benefited the sults presented (e.g., by waiting several weeks after randomized
most are the ones work on authoring code that involves writing rollout), we cannot guarantee that there is absolutely no bias. The
boilerplate code and idioms, the ones who follow the typical coding time period chosen for measuring the metrics in Table 2 could have
patterns employed at Meta, and the ones that employ standard first- had external events beyond our control that influenced developers’
party (or third-party) libraries to accomplish their coding tasks. coding behaviors during that time. Some languages such as Type-
Additionally, CodeCompose is well received by the developers who script only have a small number of developers, who might have
tend to work on building pipelines and common infrastructure had an outsized influence on the metrics for those languages.
tasks that involves writing heavily templatized code. This can be Qualitative feedback, in general, comes with a degree of subjec-
corroborated with the distribution of responses listed in Table 3 tivity. While Meta encourages a culture of open feedback, develop-
and the representative quotes listed in Section 5.4. ers might not have been comfortable sharing negative assessments
Unfavorable scenarios for CodeCompose: After analyzing the with colleagues. We facilitated easy opt out, but less than 1% had
qualitative feedback, we found that CodeCompose is not helpful opted out, indicating that a majority of developers found Code-
in scenarios where developers use specialized APIs and libraries. Compose to be a net positive experience.
CodeCompose seems to be struggling with suggesting the correct The categorization of the qualitative feedback into different cat-
code when the developers employ atypical coding patterns. egories was done manually. We tried to reduce subjectivity in this
Some developers found the traditional, semantic, auto-complete process by making two authors in the team categorize them and
functionality to be more useful in those cases. Developers also then reconcile the different labels. Nevertheless, a different set of
complained about the possibility of CodeCompose to hallucinate people might have produced a different categorization in Table 3.
[17, 21] when recommending the uncommon API names, URLs, etc.
Developers also found the coexistence of CodeCompose with 7 RELATED WORK
traditional auto-complete to be vital. Sometimes, when these two
In software engineering research, while there is a large body of
systems compete for showing suggestions, it creates a disruptive
work on code completion [13, 20, 26, 27, 33], there has been limited
experience to developers. Also, commonly used keyboard shortcuts
deployment of these in large industrial environments [3–5, 7]. With
such as “Tab” are overloaded to accept CodeCompose suggestions
the advent of Generative AI, advances in the code assistance area
and traditional auto-complete suggestion. Solving these UX prob-
are now in weeks rather than months. Several of these results are
lems is vital to facilitating a smoother integration with existing se-
circulated as blog posts and not traditional papers. We use a mix
mantic engines, which we discussed in Section 2.2.
of these results to set the context for historical work at Google,
Amazon, Microsoft, and GitHub.
6 THREATS TO VALIDITY The closest in spirit to our work is the work at Google [5]. At
In this section, we list certain threats to the validity of our evaluation Google [5] deploying hybrid semantic ML code completion to 10k+
in Section 5. Google employees (over three months across eight programming
languages) compared to a control group observed a 6% reduction in
6.1 Generalizability coding iteration time (time between builds and tests) when exposed
CodeCompose is an entirely internal tool that has only been de- to single-line ML completion. These results demonstrate the impact
ployed at Meta. Therefore, there is a threat of drawing any general on developer productivity and at the time of the article 3% of new
9
code (measured in characters) was generated from accepting ML across 9 programming languages at Meta, and presented metrics and
completion suggestions. feedback to understand the impact of the system on code authoring.
Pythia [29] deployed with Intellicode [4] by Microsoft is built We first looked at some unique challenges in building such cod-
using DL models trained on code contexts extracted from abstract ing assistants in large organizations. We shared our experience in
syntax trees. Its best matching code completions are in the order of building trust, designing the user experience, and measuring the
100 ms. The offline evaluation of Pythia obtained on 2700 Python right set of metrics to evaluate the system.
open source software GitHub repositories showed a top-5 accuracy We presented details about the underlying InCoder-based LLM
of 92%, surpassing the baseline models by 20% averaged over classes, that powers CodeCompose. We introduced a custom training ob-
for both intra and cross-project settings.[29] IntelliCode Compose jective, Language Causal Masking, that suits our application of sug-
[28] – is a general-purpose multilingual code completion tool that gesting individual lines of code. In doing so, we conducted an of-
is capable of predicting sequences of code tokens of arbitrary types, fline evaluation that showed a 23-25% improvement brought about
generating up to entire lines of syntactically correct code [28]. It by fine-tuning the LLM on the Meta’s internal code.
is built on a state-of-the-art generative transformer model trained We then presented the system architecture of CodeCompose. We
on 1.2 billion lines of source code in Python, C#, JavaScript, and built CodeCompose using three primary components: the server,
TypeScript programming languages [28]. Its best model yields an Language Service Protocol (LSP), and the client. When a developer
average edit similarity of 86.7% and a perplexity of 1.82 for Python types code, the service receives a request to perform inference and
programming language [28]. generates a sequence of tokens to auto-complete the statement(s).
There have been several empirical evaluations of GitHub’s Copi- The LSP and the client help orchestrate the inference calls and
lot [3] in actual use for automatic code completion. Nguyen et al. implement the logic to display the suggestion.
[22] used 33 LeetCode questions to create queries for Copilot in We presented quantitative metrics that show the scale and im-
four different programming languages. They found that Copilot’s pact of CodeCompose. In a 15-day time period, 4.5 million sugges-
Java suggestions have the highest correctness score (57%) while tions were shown to developers, 22% of which were accepted. This
JavaScript is the lowest (27%) [22]. Overall, Copilot’s suggestions resulted in 8% of code typed by users of CodeCompose coming
had low complexity with no notable differences between the pro- from suggestions offered by CodeCompose.
gramming languages. They also observed some potential Copilot Finally, we presented qualitative feedback from developers. We
shortcomings: such as generating code that can be further simplified found that an overwhelming 91.5% of the feedback shared by de-
and code that relies on undefined helper methods [22]. Not all usage velopers was favorable towards CodeCompose, indicating that it
has been positive. A user study [30] with 24 participants to under- adds significant value to their coding experience. Expanding into
stand how programmers use and perceive Copilot found that while the feedback, we found CodeCompose having desirable side effects,
Copilot did not necessarily improve the task completion time or such as helping developers discover unfamiliar APIs and write doc-
success rate, most participants preferred to use Copilot in daily pro- umentation.
gramming tasks, since Copilot often provided a useful starting point In the future, we plan to enable more features such as block
and saved the effort of searching online [30]. However, participants completions, more interaction modalities such as a conversational
did face difficulties in understanding, editing, and debugging code modality within the IDE, functionality to explain source code and
snippets generated by Copilot, which significantly hindered their provide code walk-throughs, etc. We are exploring opportunities
task-solving effectiveness [30]. An empirical study [25] presented a to leverage semantic information to perform pre-processing and
controlled experiment with GitHub Copilot showing recruited soft- post-processing to improve the suggestion accuracy by reducing
ware developers who were asked to implement an HTTP server in hallucinations. Furthermore, we see opportunities to expand this
JavaScript as quickly as possible. The treatment group, with access technology, beyond code authoring, to help developers across the
to Copilot, completed the task 55.8% faster than the control group. Software Development Life Cycle (SDLC).
Amazon’s CodeWhisperer [7] is a fully functional code comple-
tion tool integrated into the IDE. Its analysis [14] found that lan- 9 ACKNOWLEDGEMENTS
guage models can generate code with correct syntax and pass unit
We want to thank the following people for their help, support, and
tests in programming languages they are not intentionally trained
consultation while envisioning, building, and deploying CodeCom-
on. Like the companies discussed above (Google, Microsoft, GitHub,
pose at Meta: Gabriel Synnaeve, Baptiste Rozière, Chris Lengerich,
and Amazon), there are several commercial versions of code com-
Omer Dunay, Marshall Roch, Karim Nakad, Peter Monaco, Kelly
pletion tools by other generative AI companies. The goal here is to
Hirano, Killian Murphy, Kristian Kristensen, Wes Dyer, Adam Tait,
provide a sample of the related work rather than undertaking a sys-
Arun Ganesan, Andy Chiu, and everyone on the CodeCompose
tematic literature survey of the state of the art. The above works
team.
summarize the related work in this area at this point in time in this
rapidly changing field.
REFERENCES
[1] Github 2022. How GitHub Copilot helps improve developer productivity. Github.
https://github.blog/2022-07-14-research-how-github-copilot-helps-improve-
developer-productivity/
8 CONCLUSION [2] Meta Accessed 2021. ELI5: Bento - Interactive Notebook that Empowers Development
Collaboration and Best Practices. Meta. https://developers.facebook.com/blog
In this paper, we introduced an AI-based coding assistant system /post/2021/09/20/eli5-bento-interactive-notebook-empowers-development-
named CodeCompose, discussed how we scaled it to 16k developers collaboration-best-practices/
10
[3] Github Accessed 2021. Github Copilot. Github. https://github.com/features/copi [19] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii,
lot Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of Hallucination in
[4] Microsoft Accessed 2021. Microsoft Intellicode. Microsoft. https://visualstudio.m Natural Language Generation. ACM Comput. Surv. 55, 12, Article 248 (mar 2023),
icrosoft.com/services/intellicode 38 pages. https://doi.org/10.1145/3571730
[5] Google Accessed 2021. ML Enhanced Code Completion. Google. https://ai.googl [20] Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. 2021. Code Pre-
eblog.com/2022/07/ml-enhanced-code-completion-improves.html diction by Feeding Trees to Transformers. In 2021 IEEE/ACM 43rd International
[6] Accessed 2022. ML-Enhanced Code Completion Improves Developer Productivity. Conference on Software Engineering (ICSE). 150–162. https://doi.org/10.1109/IC
https://ai.googleblog.com/2022/07/ml-enhanced-code-completion-improves.h SE43902.2021.00026
tml [21] Zihao Li. 2023. The Dark Side of ChatGPT: Legal and Ethical Challenges from
[7] Amazon Accessed 2023. Amazon CodeWhisperer. Amazon. https://aws.amazon Stochastic Parrots and Hallucination. arXiv:2304.14347 [cs.CY]
.com/codewhisperer [22] Nhan Nguyen and Sarah Nadi. 2022. An empirical evaluation of GitHub copilot’s
[8] Accessed 2023. Flow Javascript. https://engineering.fb.com/2014/11/18/web/flo code suggestions. In Proceedings of the 19th International Conference on Mining
w-a-new-static-type-checker-for-javascript Software Repositories (MSR ’22).
[9] Accessed 2023. Workplace Group. https://www.workplace.com/features/groups [23] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou,
[10] Carl Auerbach and Louise B Silverstein. 2003. Qualitative data: An introduction Silvio Savarese, and Caiming Xiong. 2023. CodeGen: An Open Large Language
to coding and analysis. Vol. 21. NYU press. Model for Code with Multi-Turn Program Synthesis. arXiv:2203.13474 [cs.LG]
[11] Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine [24] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU:
McLeavey, Jerry Tworek, and Mark Chen. 2022. Efficient Training of Language A Method for Automatic Evaluation of Machine Translation. In Proceedings of
Models to Fill in the Middle. arXiv:2207.14255 [cs.CL] the 40th Annual Meeting on Association for Computational Linguistics (Philadel-
[12] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, phia, Pennsylvania) (ACL ’02). Association for Computational Linguistics, USA,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda 311–318. https://doi.org/10.3115/1073083.1073135
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, [25] Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Impact of AI on Developer Productivity: Evidence from GitHub Copilot. In Arxiv.
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin https://arxiv.org/abs/2302.06590
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya [26] Sebastian Proksch, Johannes Lerch, and Mira Mezini. 2015. Intelligent Code
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. Completion with Bayesian Networks. ACM Transactions on Software Engineering
arXiv:2005.14165 [cs.CL] and Methodology (TOSEM) 25, 1, Article 3 (12 2015).
[13] Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples [27] R. Robles and M. Lanza. 2008. How Program History Can Improve Code Com-
to improve code completion systems. In Proceedings of the 7th joint meeting of pletion. In 2008 23rd IEEE/ACM International Conference on Automated Software
the European software engineering conference and the ACM SIGSOFT symposium Engineering. 317–326. https://doi.org/10.1109/ASE.2008.42
on The foundations of software engineering (ESEC/FSE ’09). [28] Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. Ac-
[14] Nicola Dell, Vidya Vaidyanathan, Indrani Medhi, Edward Cutrell, and William cessed 2020. IntelliCode compose: code generation using transformer. In Proceed-
Thies. 2012. "Yours is Better!": Participant Response Bias in HCI. In Proceedings ings of the 28th ACM Joint Meeting on European Software Engineering Conference
of the SIGCHI Conference on Human Factors in Computing Systems (Austin, Texas, and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020).
USA) (CHI ’12). Association for Computing Machinery, New York, NY, USA, [29] Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. 2019. Pythia:
1321–1330. https://doi.org/10.1145/2207676.2208589 AI-assisted Code Completion System. In Proceedings of the 25th ACM SIGKDD
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: International Conference on Knowledge Discovery & Data Mining (KDD ’19).
Pre-training of Deep Bidirectional Transformers for Language Understanding. [30] Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. 2022. Expectation vs.
arXiv:1810.04805 [cs.CL] Experience: Evaluating the Usability of Code Generation Tools Powered by Large
[16] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Language Models. In Extended Abstracts of the 2022 CHI Conference on Human
Ruiqi Zhong, Wen tau Yih, Luke Zettlemoyer, and Mike Lewis. 2023. InCoder: A Factors in Computing Systems (CHI EA ’22).
Generative Model for Code Infilling and Synthesis. arXiv:2204.05999 [cs.SE] [31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[17] Nuno M. Guerreiro, Duarte Alves, Jonas Waldendorf, Barry Haddow, Alexandra Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You
Birch, Pierre Colombo, and André F. T. Martins. 2023. Hallucinations in Large Need. arXiv:1706.03762 [cs.CL]
Multilingual Translation Models. arXiv:2303.16104 [cs.CL] [32] Michael Williams and Tami Moser. 2019. The art of coding and thematic explo-
[18] Kajta Hofmann, Bhaskar Mitra, Filip Radlinski, and Milad Shokouhi. 2014. An ration in qualitative research. International Management Review 15, 1 (2019), 45–
Eye-Tracking Study of User Interactions with Query Auto Completion (CIKM 55.
’14). Association for Computing Machinery, New York, NY, USA, 549–558. https: [33] Wen Zhou, Seohyun Kim, Vijayaraghavan Murali, and Gareth Ari Aye. 2022.
//doi.org/10.1145/2661829.2661922 Improving Code Autocompletion with Transfer Learning. In 2022 IEEE/ACM 44th
International Conference on Software Engineering: Software Engineering in Practice
(ICSE-SEIP). 161–162. https://doi.org/10.1145/3510457.3513061
11

CodeCompose - A Large-Scale Industrial Deployment of AI-assisted Code Authoring

Uploaded by

Copyright:

Available Formats

You might also like

CodeCompose - A Large-Scale Industrial Deployment of AI-assisted Code Authoring

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CodeCompose - A Large-Scale Industrial Deployment of AI-assisted Code Authoring

Uploaded by

Copyright:

Available Formats

CodeCompose: A Large-Scale Industrial Deployment of

AI-assisted Code Authoring

Michael Bolin Daniel Cheng Negar Ghorbani

Meta Platforms Inc. Meta Platforms Inc. Meta Platforms Inc.

Renuka Fernandez Nachiappan Nagappan

(a) 2.1 Trust

metadata Select a random subsequence

def bubbleSort(array): 3 sequences of tokens

Figure 2: Steps to construct an input to the model in LCM with an example.

Exact Match BLEU score

4.4 Telemetry potential biases from early adopters. We instrumented telemetry to

5 EVALUATION: OBSERVATIONS AND

Language # Suggestions shown Acceptance rate Percentage of code # Users

User feedback Category Description # of responses

You might also like