Professional Documents
Culture Documents
Codebert Paper
Codebert Paper
sample
𝑤1 [𝑀𝐴𝑆𝐾]𝑤 𝑤51 replaced
𝑤2 𝑤2 𝑤2 original
𝑤3 𝑤3 NL Generator 𝑤3 original
𝑤4 𝑤4 𝑤4 original
sample
𝑤5 [𝑀𝐴𝑆𝐾]𝑤 𝑤5 original
NL-Code
Discriminator
𝑐1 𝑐1 𝑐1 original
sample
𝑐2 [𝑀𝐴𝑆𝐾]𝑐 𝑐29 replaced
𝑐3 𝑐3 𝑐3 original
Code Generator
𝑐4 𝑐4 𝑐4 original
𝑐5 𝑐5 𝑐5 original
sample
𝑐6 [𝑀𝐴𝑆𝐾]𝑐 𝑐162 replaced
Figure 2: An illustration about the replaced token detection objective. Both NL and code generators are language 6
models, which generate plausible tokens for masked positions based on surrounding contexts. NL-Code discrimi-
nator is the targeted pre-trained model, which is trained via detecting plausible alternatives tokens sampled from
NL and PL generators. NL-Code discriminator is used for producing general-purpose representations in the fine-
tuning step. Both NL and code generators are thrown out in the fine-tuning step.
2019; Sun et al., 2019). We apply masked language Objective #2: Replaced Token Detection (RTD)
modeling on bimodal data of NL-PL pairs. The sec- In the MLM objective, only bimodal data (i.e. data-
ond objective is replaced token detection (RTD), points of NL-PL pairs) is used for training. Here we
which further uses a large amount of unimodal data, present the objective of replaced token detection.
such as codes without paired natural language texts. The RTD objective (Clark et al., 2020) is origi-
Detailed hyper-parameters for model pre-training nally developed for efficiently learning pre-trained
are given in Appendix B.1. model for natural language. We adapt it in our sce-
nario, with the advantage of using both bimodal
Objective #1: Masked Language Modeling
and unimodal data for training. Specifically, there
(MLM) Given a datapoint of NL-PL pair (x =
are two data generators here, an NL generator pGw
{w, c}) as input, where w is a sequence of NL
and a PL generator pGc , both for generating plau-
words and c is a sequence of PL tokens, we first
sible alternatives for the set of randomly masked
select a random set of positions for both NL and PL
positions.
to mask out (i.e. mw and mc , respectively), and
then replace the selected positions with a special ŵi ∼ pGw (wi |wmasked ) for i ∈ mw (7)
[M ASK] token. Following Devlin et al. (2018), Gc masked c
cˆi ∼ p (ci |c ) for i ∈ m (8)
15% of the tokens from x are masked out.
mw
i ∼ unif{1, |w|} for i = 1 to |w| (1) wcorrupt = REPLACE(w, mw , ŵ) (9)
mci ∼ unif{1, |c|} for i = 1 to |c| (2) c corrupt
= REPLACE(c, m , ĉ) c
(10)
masked w corrupt corrupt corrupt
w = REPLACE(w, m , [M ASK]) (3) x =w +c (11)
masked c
c = REPLACE(c, m , [M ASK]) (4)
The discriminator is trained to determine whether
x=w+c (5) a word is the original one or not, which is a binary
classification problem. It is worth noting that the
The MLM objective is to predict the original to-
RTD objective is applied to every position in the
kens which are masked out, formulated as follows,
input, and it differs from GAN (generative adver-
where pD1 is the discriminator which predicts a
sarial network) in that if a generator happens to
token from a large vocabulary.
produce the correct token, the label of that token
is “real” instead of “fake” (Clark et al., 2020). The
X
LMLM (θ) = −log pD1 (xi |wmasked ,cmasked )
i∈mw ∪mc loss function of RTD with regard to the discrimina-
(6) tor parameterized by θ is given below, where δ(i) is
an indicator function and pD2 is the discriminator of CodeBERT are fixed. Finally, we evaluate Code-
that predicts the probability of the i-th word being BERT on a generation problem, i.e. code documen-
original. tation generation (§4.3), and further evaluate on a
programming language which is never seen in the
|w|+|c|
X training phase (§4.4).
LRTD (θ) = δ(i)log pD2 (xcorrupt , i)+
i=1 4.1 Natural Language Code Search
D2 corrupt
1 − δ(i) 1 − log p (x , i) Given a natural language as the input, the objec-
tive of code search is to find the most semantically
(12)
related code from a collection of codes. We con-
(
corrupt duct experiments on the CodeSearchNet corpus
1, if xi = xi .
δ(i) = (13) (Husain et al., 2019) 4 . We follow the official evalu-
0, otherwise. ation metric to calculate the Mean Reciprocal Rank
There are many different ways to implement the (MRR) for each pair of test data (c, w) over a fixed
generators. In this work, we implement two ef- set of 999 distractor codes. We further calculate the
ficient n-gram language models (Jurafsky, 2000) macro-average MRR for all languages as an overall
with bidirectional contexts, one for NL and one evaluation metric. It is helpful to note that this met-
for PL, and learn them from corresponding uni- ric differs from the AVG metric in the original pa-
model datapoints, respectively. The approach is per, where the answer is retrieved from candidates
easily generalized to learn bimodal generators or from all six languages. We fine-tune a language-
use more complicated generators like Transformer- specific model for each programming language5 .
based neural architecture learned in a joint manner. We train each model with a binary classification
We leave these to future work. The PL training data loss function, where a sof tmax layer is connected
is the unimodal codes as shown in Table 1, and the to the representation of [CLS]. Both training and
NL training data comes from the documentations validation datasets are created in a way that posi-
from bimodal data. One could easily extend these tive and negative samples are balanced. Negative
two training datasets to larger amount. The final samples consist of balanced number of instances
loss function are given below. with randomly replaced NL (i.e. (c, ŵ)) and PL
(i.e. (ĉ, w)). Detailed hyper-parameters for model
min LMLM (θ) + LRTD (θ) (14) fine-tuning are given in Appendix B.2.
θ
Table 2: Results on natural language code retrieval. Baselines include four joint embeddings (first group) of NL
and PL, RoBERTa, and RoBERTa which is continuously trained with masked language modeling on codes only
(second group). PT stands for pre-training. We train CodeBERT (third group) with different settings, including
using different initialization (from scratch (INIT = S) or initialized with the parameters of RoBERTa (INIT =R)) and
using different learning objectives (MLM, RTD, or the combination of both).
better than RoBERTa and the model pre-trained is replaced by [M ASK] and distractor candidate
with code only. CodeBERT (MLM) learned from answers are curated based on our expertise.
scratch performs better than RoBERTa. Unsur- Specifically, we evaluate on the NL side and PL
prisingly, initializing CodeBERT with RoBERTa side, respectively. To ease the effort of data col-
improves the performance 6 . lection, we collect data automatically from NL-PL
pairs in both validation and testing sets of Code-
4.2 NL-PL Probing SearchNet, both of which are unseen in the pre-
In the previous subsection, we show the empirical training phase. To evaluate on the NL side, we
effectiveness of CodeBERT in a setting that the select NL-PL pairs whose NL documentations in-
parameters of CodeBERT are fine-tuned in down- clude one of the six keywords (max, maximize, min,
stream tasks. In this subsection, we further inves- minimize, less, greater), and group them to four
tigate what type of knowledge is learned in Code- candidates by merging first two keywords and the
BERT without modifying the parameters. middle two keywords. The task is to ask pre-trained
models to select the correct one instead of three
Task Formulation and Data Construction Fol- other distractors. That is to say, the input in this
lowing the probing experiments in NLP (Petroni setting includes the complete code and a masked
et al., 2019; Talmor et al., 2019), we study NL- NL documentation. The goal is to select the correct
PL probing here. Since there is no existing work answer from four candidates. For the PL side, we
towards this goal, we formulate the problem of select codes containing keywords max and min, and
NL-PL probing and create the dataset by ourselves. formulate the task as a two-choice answer selection
Given an NL-PL pair (c, w), the goal of NL-PL problem. Here, the input includes complete NL
probing is to test model’s ability to correctly pre- documentation and a masked PL code, and the goal
dict/recover the masked token of interest (either a is to select the correct answer from two candidates.
code token ci or word token wj ) among distractors. Since code completion is an important scenario,
There are two major types of distractors: one is the we would like to test model’s ability in predicting
whole target vocabulary used for the masked lan- the correct token merely based on preceding PL
guage modeling objective (Petroni et al., 2019), and contexts. Therefore, we add an additional setting
another one has fewer candidates which are filter or for PL side, where the input includes the complete
curated based on experts’ understanding about the NL documentation and preceding PL codes. Data
ability to be tested (Talmor et al., 2019). We follow statistics is given in the top two rows in Table 3.
the second direction and formulate NL-PL probing
as a multi-choice question answering task, where Model Comparisons Results are given in Table
the question is cloze-style in which a certain token 3. We report accuracy, namely the number of cor-
6
We further give a learning curve of different pre-trained rectly predicted instances over the number of all
models in the fine-tuning process in Appendix C. instances, for each programming language. Since
RUBY JAVASCRIPT GO PYTHON JAVA PHP ALL
Table 3: Statistics of the data for NL-PL probing and the performance of different pre-trained models. Accuracies
(%) are reported. Best results in each group are in bold.
extremely unbalanced, we report the accumulated "Transforms a vector np.arange(-N, M, dx) to np.arange( min (|vec|),
max(N,M),dx)]"
metric with the same way. We use CodeBERT
def vec_to_halfvec(vec):
(MLM) here because its output layer naturally fits
for probing. Results show that CodeBERT per- d = vec[1:] - vec[:-1]
forms better than baselines on almost all languages if ((d/d.mean()).std() > 1e-14) or (d.mean() < 0):
raise ValueError('vec must be np.arange() in increasing order')
on both NL and PL probing. The numbers with dx = d.mean() masked PL token
only preceding contexts are lower than that with lowest = np.abs(vec). min ()
bidirectional contexts, which suggests that code highest = np.abs(vec).max()
return np.arange(lowest, highest + 0.1*dx, dx).astype(vec.dtype)
completion is challenging. We leave it as a future
work. max min less greater
We further give a case study on PL-NL probing. NL
Roberta 96.24% 3.73% 0.02% 0.01%
We mask NL token and PL token separately, and CodeBERT (MLM) 39.38% 60.60% 0.02% 0.0003%
Roberta 95.85% 4.15% - -
report the predicted probabilities of RoBERTa and PL
CodeBERT (MLM) 0.001% 99.999% - -
CodeBERT. Figure 3 illustrates the example of a
python code7 . We can see that RoBERTa fails in Figure 3: Case study on python language. Masked to-
both cases, whereas CodeBERT makes the correct kens in NL (in blue) and PL (in yellow) are separately
prediction in both NL and PL settings. applied. Predicted probabilities of RoBERTa and Code-
BERT are given.
4.3 Code Documentation Generation
Although the pre-training objective of Code- Model Comparisons We compare our model
BERT does not include generation-based objectives with several baselines, including a RNN-based
(Lewis et al., 2019), we would like to investigate model with attention mechanism (Sutskever et al.,
to what extent does CodeBERT perform on gen- 2014), the Transformer (Vaswani et al., 2017),
eration tasks. Specifically, we study code-to-NL RoBERTa and the model pre-trained on code only.
generation, and report results for the documenta- To demonstrate the effectiveness of CodeBERT
tion generation task on CodeSearchNet Corpus in on code-to-NL generation tasks, we adopt various
six programming languages. Since the generated pre-trained models as encoders and keep the hyper-
documentations are short and higher order n-grams parameters consistent. Detailed hyper-parameters
may not overlap, we remedy this problem by using are given in Appendix B.3.
smoothed BLEU score (Lin and Och, 2004). Table 4 shows the results with different mod-
els for the code-to-documentation generation task.
7
The example comes from https:// As we can see, models pre-trained on program-
github.com/peri-source/peri/blob/
61beed5deaaf978ab31ed716e8470d86ba639867/ ming language outperform RoBERTa, which illus-
peri/comp/psfcalc.py#L994-L1002 trates that pre-trainning models on programming
MODEL RUBY JAVASCRIPT GO PYTHON JAVA PHP OVERALL
language could improve code-to-NL generation. could generalize better to other programming lan-
Besides, results in the Table 4 show that CodeBERT guage which is never seen in the pre-training step.
pre-trained with RTD and MLM objectives brings However, our model achieve slightly lower results
a gain of 1.3 BLEU score over RoBERTa overall than code2seq (Alon et al., 2019). The main reason
and achieve the state-of-the-art performance8 . could be that code2seq makes use of compositional
paths in its abstract syntax tree (AST) while Code-
4.4 Generalization to Programming BERT only takes original code as the input. We
Languages NOT in Pre-training have trained a version of CodeBERT by traversing
We would like to evaluate CodeBERT on the pro- the tree structure of AST following a certain order,
gramming language which is never seen in the pre- but applying that model does not bring improve-
training step. To this end, we study the task of gen- ments on generation tasks. This shows a potential
erating a natural language summary of a C# code direction to improve CodeBERT by incorporating
snippet. We conduct experiments on the dataset AST.
of CodeNN (Iyer et al., 2016)9 , which consists of
66,015 pairs of questions and answers automati- 5 Conclusion
cally collected from StackOverflow. This dataset In this paper, we present CodeBERT, which to the
is challenging since the scale of dataset is orders best of our knowledge is the first large bimodal
of magnitude smaller than CodeSearchNet Corpus. pre-trained model for natural language and pro-
We evaluate models using smoothed BLEU-4 score gramming language. We train CodeBERT on both
and use the same evaluation scripts as Iyer et al. bimodal and unimodal data, and show that fine-
(2016). tuning CodeBERT achieves state-of-the-art perfor-
mance on downstream tasks including natural lan-
M ODEL BLEU
guage code search and code-to-documentation gen-
MOSES (KOEHN ET AL ., 2007) 11.57
IR 13.66
eration. To further investigate the knowledge em-
SUM-NN (RUSH ET AL ., 2015) 19.31 bodied in pre-trained models, we formulate the task
2- LAYER B I LSTM 19.78 of NL-PL probing and create a dataset for probing.
T RANSFORMER (VASWANI ET AL ., 2017) 19.68
T REE LSTM (TAI ET AL ., 2015) 20.11 We regard the probing task as a cloze-style answer
C ODE NN (I YER ET AL ., 2016) 20.53 selection problem, and curate distractors for both
CODE 2 SEQ (A LON ET AL ., 2019) 23.04 NL and PL parts. Results show that, with model
RO BERTA 19.81
PRE - TRAIN W / CODE ONLY 20.65 parameters fixed, CodeBERT performs better than
C ODE BERT (RTD) 22.14 RoBERTa and a continuously trained model using
C ODE BERT (MLM) 22.32 codes only.
C ODE BERT (MLM+RTD) 22.36
There are many potential directions for further
Table 5: Code-to-NL generation on C# language. research on this field. First, one could learn better
generators with bimodal evidence or more compli-
cated neural architecture to improve the replaced to-
Model Comparisons Table 5 shows that our ken detection objective. Second, the loss functions
model with MLM and RTD pre-training objectives of CodeBERT mainly target on NL-PL understand-
achieves 22.36 BLEU score and improves by 2.55 ing tasks. Although CodeBERT achieves strong
points over RoBERTa, which illustrates CodeBERT BLEU scores on code-to-documentation genera-
8
We further give some output examples in Appendix E. tion, the CodeBERT itself could be further im-
9
https://github.com/sriniiyer/codenn proved by generation-related learning objectives.
How to successfully incorporate AST into the pre- Dan Jurafsky. 2000. Speech & language processing.
training step is also an attractive direction. Third, Pearson Education India.
we plan to apply CodeBERT to more NL-PL re- Aditya Kanade, Petros Maniatis, Gogul Balakrish-
lated tasks, and extend it to more programming nan, and Kensen Shi. 2019. Pre-trained contex-
languages. Flexible and powerful domain/language tual embedding of source code. arXiv preprint
adaptation methods are necessary to generalize arXiv:2001.00059.
well. Yoon Kim. 2014. Convolutional neural net-
works for sentence classification. arXiv preprint
Acknowledgments arXiv:1408.5882.
Xiaocheng Feng is the corresponding author of this Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
work. We thank the anonymous reviewers for their Callison-Burch, Marcello Federico, Nicola Bertoldi,
insightful comments. Zhangyin Feng, Xiaocheng Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, et al. 2007. Moses: Open source
Feng, Bing Qin and Ting Liu are supported by the toolkit for statistical machine translation. In Pro-
National Key R&D Program of China via grant ceedings of the 45th annual meeting of the associ-
2018YFB1005103 and National Natural Science ation for computational linguistics companion vol-
Foundation of China (NSFC) via grant 61632011 ume proceedings of the demo and poster sessions,
pages 177–180.
and 61772156.
Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
jan Ghazvininejad, Abdelrahman Mohamed, Omer
References Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019.
Bart: Denoising sequence-to-sequence pre-training
Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. for natural language generation, translation, and
2019. code2seq: Generating sequences from struc- comprehension. arXiv preprint arXiv:1910.13461.
tured representations of code. International Confer-
enceon Learning Representations. Chin-Yew Lin and Franz Josef Och. 2004. Orange: a
method for evaluating automatic evaluation metrics
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- for machine translation. In Proceedings of the 20th
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger international conference on Computational Linguis-
Schwenk, and Yoshua Bengio. 2014. Learning tics, page 501. Association for Computational Lin-
phrase representations using rnn encoder-decoder guistics.
for statistical machine translation. arXiv preprint
arXiv:1406.1078. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Christopher D. Manning. 2020. {ELECTRA}: Pre- Roberta: A robustly optimized bert pretraining ap-
training text encoders as discriminators rather than proach. arXiv preprint arXiv:1907.11692.
generators. In International Conference on Learn-
ing Representations. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan
Lee. 2019. Vilbert: Pretraining task-agnostic visi-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and olinguistic representations for vision-and-language
Kristina Toutanova. 2018. Bert: Pre-training of deep tasks. In Advances in Neural Information Process-
bidirectional transformers for language understand- ing Systems, pages 13–23.
ing. arXiv preprint arXiv:1810.04805.
Bhaskar Mitra, Nick Craswell, et al. 2018. An intro-
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. duction to neural information retrieval. Foundations
Deep code search. In 2018 IEEE/ACM 40th Interna- and Trends
R in Information Retrieval, 13(1):1–126.
tional Conference on Software Engineering (ICSE),
pages 933–944. IEEE. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Zettlemoyer. 2018. Deep contextualized word repre-
Allamanis, and Marc Brockschmidt. 2019. Code- sentations. arXiv preprint arXiv:1802.05365.
searchnet challenge: Evaluating the state of seman-
tic code search. arXiv preprint arXiv:1909.09436. Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton
Bakhtin, Yuxiang Wu, Alexander H Miller, and Se-
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and bastian Riedel. 2019. Language models as knowl-
Luke Zettlemoyer. 2016. Summarizing source code edge bases? arXiv preprint arXiv:1909.01066.
using a neural attention model. In Proceedings
of the 54th Annual Meeting of the Association for Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
Computational Linguistics (Volume 1: Long Papers), How multilingual is multilingual bert? arXiv
pages 2073–2083. preprint arXiv:1906.01502.
Alec Radford, Karthik Narasimhan, Tim Salimans, C ODE S EARCH T RAINING D EV T ESTING
and Ilya Sutskever. 2018. Improving language GO 635,635 28,483 14,291
understanding by generative pre-training. URL JAVA 908,886 30,655 26,909
https://s3-us-west-2. amazonaws. com/openai- JAVA S CRIPT 247,773 16,505 6,483
assets/researchcovers/languageunsupervised/language PHP 1,047,406 52,029 28,391
understanding paper. pdf. P YTHON 824,342 46,213 22,176
RUBY 97,580 4,417 2,279
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Table 6: Data statistics about the CodeSearchNet Cor-
Wei Li, and Peter J Liu. 2019. Exploring the limits pus for natural language code search.
of transfer learning with a unified text-to-text trans-
former. arXiv preprint arXiv:1910.10683.
B Train Details
Alexander M Rush, Sumit Chopra, and Jason We-
ston. 2015. A neural attention model for ab- B.1 Pre-training
stractive sentence summarization. arXiv preprint
arXiv:1509.00685. We train CodeBERT on one NVIDIA DGX-2 ma-
chine using FP16. It combines 16 interconnected
Chen Sun, Austin Myers, Carl Vondrick, Kevin Mur-
phy, and Cordelia Schmid. 2019. Videobert: A joint NVIDIA Tesla V100 with 32GB memory. We use
model for video and language representation learn- the following set of hyper-parameters to train mod-
ing. arXiv preprint arXiv:1904.01766. els: batchsize is 2,048 and learning rate is 5e-4. We
use Adam to update the parameters and set the num-
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural networks. ber of warmup steps as 10K. We set the max length
In Advances in neural information processing sys- as 512 and the max training step is 100K. Training
tems, pages 3104–3112. 1,000 batches of data costs 600 minutes with MLM
objective, 120 minutes with RTD objective.
Kai Sheng Tai, Richard Socher, and Christopher D
Manning. 2015. Improved semantic representations
from tree-structured long short-term memory net- B.2 CodeSearch
works. arXiv preprint arXiv:1503.00075. In the fine-turning step, we set the learning rate as
Alon Talmor, Yanai Elazar, Yoav Goldberg, and 1e-5, the batch size as 64, the max sequence length
Jonathan Berant. 2019. olmpics–on what lan- as 200 and the max fine-tuning epoch as 8. As the
guage model pre-training captures. arXiv preprint same with pre-training, We use Adam to update the
arXiv:1912.13283. parameters. We choose the model performed best
on the development set, and use that to evaluate on
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz the test set.
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information pro- B.3 Code Summarization on Six
cessing systems, pages 5998–6008. Programming Languages
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V We use Transformer with 6 layers, 768 dimensional
Le, Mohammad Norouzi, Wolfgang Macherey, hidden states and 12 attention heads as our decoder
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
in all settings. We set the max length of input
Macherey, et al. 2016. Google’s neural machine
translation system: Bridging the gap between hu- and inference as 256 and 64, respectively. We use
man and machine translation. arXiv preprint the Adam optimizer to update model parameters.
arXiv:1609.08144. The learning rate and the batch size are 5e-5 and
64, respectively. We tune hyperparameters and
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
bonell, Ruslan Salakhutdinov, and Quoc V Le. perform early stopping on the development set.
2019. Xlnet: Generalized autoregressive pretrain-
ing for language understanding. arXiv preprint B.4 Code Summarization on C#
arXiv:1906.08237.
Since state-of-the-art methods use RNN as their de-
A Data Statistic coder, we choose a 2-layer GRU with an attention
mechanism as our decoder for a comparison. We
Data statistics of the training/validation/testing data fine-tune models using a grid search with the fol-
splits for six programming languages are given in lowing set of hyper-parameters: batchsize is in {32,
Table 6. 64} and learning rate is in {2e-5, 5e-5}. We report
the number when models achieve best performance
MODEL RUBY GO
on the development set.
RO BERTA 0.0043 0.0030
C Learning Curve of CodeSearch P RE -T RAIN W / CODE ONLY 0.1648 0.4179
C ODE BERT 0.6870 0.8372
From Figure 4, we can see that CodeBERT per-
Table 7: Results on natural language code search by
forms better at the early stage, which reflects that late fusion.
CodeBERT provides good initialization for learn-
ing downstream tasks.
only. And late fusion performs comparable with
88.5 Roberta 83.0 Roberta the standard way. What’s more, late fusion is more
CodeBERT CodeBERT
88.0 Pre-train w/ code only Pre-train w/ code only
82.5 efficient and this setting could be used in an online
87.5
82.0 system.
Dev Accuracy of Python
87.0
81.5
86.5
81.0 E Case Study
86.0
80.5
85.5 To qualitatively analyze the effectiveness of Code-
85.0 80.0
BERT, we give some cases for code search and
84.5 1 2 3 4 5 6 7 8 79.5 1 2 3 4 5 6 7 8
The Number of Epoch The Number of Epoch code documentation generation tasks.
Considering the limited space, we only give the
Figure 4: Learning curve of different pre-trained mod-
top2 results of the query for python programming
els in the fine-tuning step. We show results on Python
and Java. language. As show in Figure 5, search results are
very relevant with query.
D Late Fusion Figure 6 and Figure 7 show the outputs with
different models for the code documentation gen-
In section §4.1 , we show that CodeBERT per- eration task. As we can see, CodeBERT performs
forms well in the setting where natural languages better than all baselines.
and codes have early interactions. Here, we in-
vestigate whether CodeBERT is good at working
as a unified encoder. We apply CodeBERT for
natural language code search in a later fusion set-
ting, where CodeBERT first encodes NL and PL
separately, and then calculates the similarity by dot-
product. In this way, code search is equivalent to
find the nearest codes in the shared vector space.
This scenario also facilitates the use of CodeBERT
in an online system, where the representations of
codes are calculated in advance. In the runtime, a
system only needs to compute the representation
of NL and vector-based dot-product.
We fine-tune CodeBERT with the following ob-
jective, which maximizes the dot-product of the
ground truth while minimizing the dot-product of
distractors.
exp Enc(ci )| Enc(wi )
1 X
− log P | Enc(w )
N j exp Enc(c j ) i
i
(15)