547 File Paper

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Summarize and Generate to Back-translate:

Unsupervised Translation of Programming Languages

Anonymous EACL submission

Abstract Input in Java


1 p u b l i c s t a t i c b o o l e a n generator (
001 Back-translation is widely known for its effec- PsiBuilder b , i n t l){
002 tiveness in neural machine translation when 2 boolean r = f a l s e ;
3 i f (! recursion_guard (b , l)) r e t u r n r ;
003 little to no parallel data is available. In this 4 r = generator_0 (b , l + 1) ;
004 approach, a source-to-target model is coupled 5 i f (! r) r e t u r n generator_1 (b , l + 1) ;
005 with a target-to-source model trained in par- 6 }
006 allel. The target-to-source model generates (1) Java to Python Generation
007 noisy sources, while the source-to-target model
008 is trained to reconstruct the targets and vice 1 p u b l i c s t a t i c b o o l e a n generator (
PsiBuilder b , i n t l){
009 versa. Recent developments of multilingual 2 boolean r = f a l s e ;
010 pre-trained sequence-to-sequence models for 3 i f (! recursion_guard (b , l)) r e t u r n r ;
011 programming languages have been very effec- 4 i f (! generator_0 (b , l)) r e t u r n r;
012 tive for a broad spectrum of downstream soft- 5 r = generator_0 (b , l + 1) ;
013 ware engineering tasks. Hence, it is compelling 6 i f (! r) r e t u r n generator_1 (b , l + 1) ;
7 }
014 to train them to build programming language
015 translation systems via back-translation. How- (2) Java Code to NL Summary
016 ever, these models cannot be further trained via guard is used to determine if a generator is already
017 back-translation since they learn to output se- defined .
018 quences in the same language as the inputs dur-
019 ing pre-training. As an alternative, we propose (3) NL Summary to Python Code
020 performing back-translation via code summa- 1 d e f is_generator ( self , name ):
021 rization and generation. In code summarization, 2 i f name i n self . _generators :
022 a model learns to generate natural language 3 r e t u r n True
4 i f name i n self . _generators [ name ]:
023 (NL) summaries given code snippets. In code 5 r e t u r n True
024 generation, the model learns to do the oppo- 6 r e t u r n False
025 site. Therefore, target-to-source generation in
026 back-translation can be viewed as target-to-NL- Figure 1: Although PLBART is asked to generate in
027 to-source generation. We show that our pro- Python given input in Java (1), it generates in Java (due
028 posed approach performs competitively with to its pre-training objective). In contrast, PLBART fine-
029 state-of-the-art methods. We have made the tuned on code summarization and generation, generates
030 code publicly available.1 “noisy” translations (as in (2, 3)).

031 1 Introduction “Turing Completeness” allows rule-based transla- 041


tion of programs from one PL to another. The rule- 042
032 Choice of programming language (PL) in software
based translation may require an extensive number 043
033 development depends on the requirement of the
of handwritten transformation rules and could end 044
034 software and the available features of a particular
up producing very unreadable source code. In ad- 045
035 PL. In modern API-driven software development,
dition, such translation could entail translating the 046
036 the choice of language often depends on the avail-
entire library, even if a library implementing simi- 047
037 ability of libraries and APIs. The advent of newer
lar functionality is available in the target PL. 048
038 and richer programming languages often requires
Aligning libraries and APIs across different PLs 049
039 legacy software to be translated into modernized
is a non-trivial task. Recent progress in Neural Ma- 050
040 PLs. In theory, modern programming languages’
chine Translation (NMT) (Bahdanau et al., 2015; 051
1
https://github.com/anonym-nlp/sg2bt Vaswani et al., 2017) leveraging pre-trained mod- 052

1
(a) PLBART (b) PLBART + S&G
Figure 2: T-SNE plot of function embeddings of Java and Python functions. Figure 2a shows the embedding
generated by PLBART model. Figure 2b are the generated embedding when the PLBART is finetuned to jointly
summarize code to NL and generate code from NL (PLBART + S&G). While PLBART clusters programs from
each individual PLs, same program in different PLs are brought closer to each other by PLBART + S&G.

053 els (Feng et al., 2020a; Guo et al., 2021; Roziere about cross-lingual generation. 089
054 et al., 2021; Ding et al., 2022; Ahmad et al., 2021a; To endow such PSMs with knowledge about 090
055 Wang et al., 2021) could be a possible way to learn cross-lingual generation, we propose the usage of a 091
056 the alignment between PLs and translate source third language (i.e., English). Since a large quantity 092
057 code across languages. of monolingual code corpora comes with documen- 093
058 A significant challenge in supervised learning tation, which supposedly describes what the source 094
059 for NMT is the need for large-scale parallel corpora. code is doing, we train a Summarize-and-Generate 095
060 For instance, if we are planning to train a translator (S&G) model that can generate pseudo-parallel 096
061 for Java to Python translation, we need a consider- code sequences. Figure 1 shows PLBART’s be- 097
062 able number of the same program (i.e., exhibiting havior when it is further trained via S&G. First, 098
063 the same semantic behavior) in both the languages. given the Java code, it generates a NL summary (fig- 099
064 Availability of such parallel datasets is a vital chal- ure 1-2), and subsequently generates Python Code 100
065 lenge in programming language translation (Chen (figure 1-3). We empirically show that, even if such 101
066 et al., 2018). Back-Translation (BT) (Edunov et al., S&G model generates noisy parallel sequences, it 102
067 2018; Lachaux et al., 2020) is a clever way to learn allows us to employ PSMs in the BT-based training 103
068 alignments across different languages. While BT to learn programming language translation. 104
069 demonstrates success in NMT, those require either In summary, we present a Summarize-and- 105
070 (i.) small (perhaps noisy) parallel datasets or (ii.) a Generate (S&G) based approach to enable unsuper- 106
071 model with some capacity of cross-lingual genera- vised program translation training of PLBART via 107
072 tion - to kickstart the BT-based learning process. Back-Translation (BT). Experiment results show 108

073 In this research, we investigate the suitability that our proposed approach makes PLBART train- 109

074 of multilingual Pre-trained Sequence-to-Sequence able via BT and performs competitively with state- 110

075 Model (PSM) (e.g., PLBART (Ahmad et al., of-the-art program translation models.2 111

076 2021a)) for unsupervised programming language


2 Motivation 112
077 translation via BT. In particular, we assume a use
078 case scenario, where there is no parallel data avail- Recent years saw several Pre-trained Sequence- 113
079 able. Without much of a surprise, we empirically to-Sequence models (PSM) (Ahmad et al., 2021a; 114
080 found that, while these PSMs are good at generat- Wang et al., 2021). These models are pre-trained 115
081 ing code in each language, they exhibit very little on hundreds of Gigabytes of source code. Thus, 116
082 to no knowledge about the cross-lingual generation we are motivated to investigate their adoption in 117
083 since such PSMs are typically trained to reconstruct learning program translation via back-translation in 118
084 code sequences from noisy inputs. For example, this work. To understand such feasibility, we inves- 119
085 when we provide the input code in Figure 1 to tigate the program representations generated by the 120
086 PLBART and ask to generate Python code without PSM. As a case study, we chose PLBART (Ahmad 121
087 any training, it generates a slight variation of the 2
We have made our code publicly available at https://
088 input Java code, showing its lack of knowledge github.com/hidden/hidden.

2
122 et al., 2021a) and evaluated its multilingual embed- motivated based on the availability of bimodal data, 172
123 dings as suggested in Artetxe and Schwenk (2019). source code, and their summaries that are used to 173
124 We find the parallel Java function for each of the train code summarization and generation models. 174
125 948 Python functions using the parallel dataset pro-
126 posed in Lachaux et al. (2020). We find the nearest 3.1 Code Summarization and Generation 175

127 neighbor using cosine similarity between function Source code documentation (e.g., docstring or com- 176
128 embeddings and calculate the error rate. Unsur- ment) written by software developers is available 177
129 prisingly, PLBART performs poorly in function along with source code on a large scale. Such 178
130 retrieval with an 87.5% error rate. documentation has been the key source to form 179
131 In comparison, we fine-tune PLBART jointly on code summarization datasets (Wan et al., 2018; 180
132 code summarization and generation in Java and Hu et al., 2018; LeClair and McMillan, 2019; Hu- 181
133 Python. Repeating the experiment of function re- sain et al., 2019), and to study natural language 182
134 trieval, we find fine-tuned PLBART’s error rate (NL) to code generation (Parvez et al., 2021). It 183
135 drops to 23.7%. To visually illustrate the embed- is tangible that we can use a code summarization 184
136 dings produced by PLBART and its fine-tuned vari- and generation model to translate programming 185
137 ant, we provide a T-SNE plot of 8 sample functions’ languages. Such a model would first generate an 186
138 embedding in Figure 2. We see the functions that NL summary from an input code in the source lan- 187
139 belong to the same language are clustered together guage and then generate code in the target language 188
140 while the same functions in two different languages from the previously generated NL summary. As 189
141 are far apart from each other (see Figure 2a). we show in the evaluation, such an approach does 190

142 In contrast, the fine-tuned PLBART breaks up not work well in practice (see table 2); however, 191

143 the intra-language clusters and brings functions code summarization and generation models are vi- 192

144 in different languages close to each other in the able proxies to generate noisy translations. This 193

145 embedding space (see Figure 2b). These results enables us to train PLBART, to begin with generat- 194

146 motivate us to initialize the translation models with ing noisy translations and further learn to improve 195

147 fine-tuned PLBART on code summarization and in a self-supervised fashion when trained via back- 196

148 generation for back-translation as it learned some translation. Formally, we jointly train PLBART in 197

149 alignment across programming languages. a supervised setting to learn code summarization 198
(S) and generation (G): 199

150 3 Approach S = T RAIN Code→Summary


(Pc,s )
(1) 200
151 Sequence-to-sequence models, such as PLBART G = T RAIN Summary→Code (Pc,s )
152 (Ahmad et al., 2021a), CodeT5 (Wang et al., 2021), where Pc,s is estimated using the code-to-text 201
153 SPT-Code (Niu et al., 2022) map source code benchmark from CodeXGlue (Lu et al., 2021). We 202
154 sequences into a shared multilingual space by follow Tang et al. (2021) to perform multilingual 203
155 pre-training on multiple programming languages fine-tuning of PLBART (in Java and Python) to 204
156 jointly using unlabeled data (e.g., source code from learn S and G. 205
157 Github). The pre-training objective of these models
158 is either denoising autoencoding (DAE) or fill-in- 3.2 Back-translation 206

159 the-blank, where the models reconstruct the orig- Back-translation (BT) is one of the most popular 207
160 inal code snippet or predict the missing code to- ways for unsupervised machine translation (Artetxe 208
161 kens given a corrupted code snippet. Although pre- et al., 2018b; Lample et al., 2018a,b). In this ap- 209
162 trained jointly on many languages, these models proach, we leverage monolingual data in an un- 210
163 only learn to generate in the same language as input. supervised fashion. BT jointly trains a source-to- 211
164 As a result, these models are not trainable via back- target model coupled with a backward target-to- 212
165 translation (BT) to learn programming language source model. The target-to-source model trans- 213
166 translation in an unsupervised fashion. As an alter- lates target sequences into the source language, pro- 214
167 native, we propose translating to and from natural ducing noisy sources corresponding to the ground 215
168 language to perform back-translation between two truth target sequences. The source-to-target model 216
169 programming languages. We refer to translating to is then trained to generate the targets from the noisy 217
170 and from natural language as code summarization sources and vice versa. The two models are trained 218
171 and code generation, respectively. Our proposal is in parallel until convergence. This training proce- 219

3
Python Summary
Train S
def circlearea(a) :
Compute the area of the
  if a < 0 :
circle inscribed in a
    return -1
Train G hexagon with side length
  A = 3.14 * 3 * pow(a,2)
"a"

  return A / 4

Java Python
Apply (S, G) / fk
int nextPowerOf2(int n){ def nextPowerOf2(n):
  int count = 0;   count = 0
  if (n>0 && (n & (n-1)) == 0) Train bk   if (n and not(n & (n - 1))):
    return n;     return n
  while(n != 0){   while(n != 0):
    n >>= 1; count += 1; Apply (S, G) / bk     n >>= 1
  }     count += 1
  return 1 << count;   return 1 << count
} Train fk

Figure 3: Overview of our proposed back-translation framework to train PLBART. In the first k = m steps (out
of total N training steps), we use a multilingual code summarization and generation model (S, G) to perform
back-translation. In the remaining steps (N − m), PLBART is trained via standard back-translation method.

220 dure is widely known as online back-translation 3.3 Summarize–Generate to Back-translate 248
221 and the focus of this work. The recent advancements of pre-trained sequence- 249
222 Back-translation uses a target-to-source model to to-sequence models on programming languages 250
223 generate noisy sources and trains a source-to-target enables us to use them in initializing the source-to- 251
224 model to reconstruct the targets. Specifically, in target (f ) and target-to-source (b) models for back- 252
225 each step k (a mini-batch update), back-translation translation. Presumably, such pre-trained models 253
226 performs the following: should facilitate the learning process during train- 254

(f ) ing. Yet, their pre-training objective – i.e., recon- 255


Pk = {(x, fk−1 (x))|x ∈ Dsource } struction of original input from a noisy source lim- 256
 
(f )
bk = T RAIN target→source Pk its their ability to generate code snippets across 257
227 (2) languages (as shown in Figure 1). For example, 258
(b) 
Pk = (bk (y), y) |y ∈ Dtarget PLBART as f (·) and b(·) would reconstruct the 259
input, resulting in fk−1 (x) ≈ x and bk (y) ≈ y. As
 
(b) 260
fk = T RAIN source→target Pk .
a result, the models will learn to merely copy the 261
228 Here, Dsource , Dtarget represents unlabeled data in input sequences rather than translate them. 262
229 source and target languages and T RAIN indicates To this end, we propose to make use of available 263
230 standard sequence-to-sequence training. parallel data between programming and natural lan- 264

231 Generally, the training via back-translation starts guages to fine-tune PLBART and then use its pa- 265

232 from a forward (f0 ) and a backward (b0 ) model that rameters to initialize source-to-target (f ) and target- 266

233 is trained using parallel data (small gold-standard to-source (b) models for back-translation. Con- 267

234 or large-scale but noisy). Then an extensive collec- sequently, we revise the back-translation training 268

235 tion of unlabeled data is used to train the translation method outlined in Eq. (2) to follow a two-step 269

236 models. In this work, we assume there is no paral- generation process to perform back-translation: 270

237 lel data available across programming languages. code-to-summary generation in natural language 271

238 We initialize the forward and backward model with followed by summary-to-code generation in the 272

239 the pre-trained language model, PLBART. As men- source language. Formally, the first m steps (k ≤ 273

240 tioned before, PLBART cannot generate code in a m) of back-translation is performed as: 274
(f )
241 language different from the input (not even a noisy Pk = {(x, G (S (x))) |x ∈ Dsource }
242 code) (for example, figure 1-1). Therefore, we pro- (b)  (3) 275
Pk = (G (S (y)), y) |y ∈ Dtarget .
243 pose jointly fine-tuning PLBART on code summa-
244 rization and generation on multiple programming We find the noisy parallel sequences3 generated 276
245 languages in a supervised setting. Then use the 3
The output sequences are still noisy since the code sum-
246 resulting model to initialize the forward and back- marization and generation models are not highly accurate
247 ward model (f0 , b0 ) for back-translation. although trained in a supervised fashion.

4
Algorithm 1 Training Procedure Java Python
Input: Monolingual (unlabeled) data Dsource and Github - unimodal data
Dtarget ; number of initial steps m; number of total Nb of functions 7.2 M 8.3 M
steps I; code summarizer S(·, ·); code generator Nb of tokens 752 M 665 M
G(·, ·); parameters θ to initialize the forward and CodeNet - unimodal data
backward translation models f (·, ·) and b(·, ·). Nb of functions 0.42 M 0.15 M
Output: Final model parameters θ. Nb of tokens 47.3 M 17.0 M
1: for k = 0, · · · , I do CodeXGlue - bimodal data
2: y ← (ys ∼ Dsource ) ∪ (yt ∼ Dtarget ) Nb of functions 164,923 251,818
3: if k ≤ m then Nb of tokens 21.2 M 44.3 M
4: xnl ∼ S(·|y) ▷ code-to-summary Table 1: Statistics of the data used to train PLBART
5: x̂ ∼ G(·|xnl ) ▷ summary-to-code at different stages in this work. Bimodal data refers to
6: else parallel function-summary pairs, while unimodal data
7: x̂ ← (xs ∼ b(·|yt )) ∪ (xt ∼ f (·|ys )) refers to monolingual (and unparallel) functions.
8: Update θ by maximizing log-likelihood of
f (x̂s , yt ) and b(x̂t , ys ) TransCoder is a neural translation model for 304
programming languages (Lachaux et al., 2020). 305
TransCoder is developed by pretraining Trans- 306
277 via summarization and generation commences the former (Vaswani et al., 2017) via masked language 307
278 learning process. The overall idea of our proposed modeling (MLM) objective (Devlin et al., 2019) 308
279 framework is illustrated in Figure 3 and the Al- on monolingual source code datasets. In a second 309
280 gorithm 1 describes the training procedure. Note step, TransCoder is trained via denoising autoen- 310
281 that we find it is sufficient to apply our proposed coding (DAE) and BT. In this work, we consider 311
282 summarization-generation based back-translation TransCoder as the primary baseline.6 312
283 only for the first m steps as the source-to-target and
DOBF Roziere et al. (2021) proposed a pretrain- 313
284 target-to-source models gradually learn to translate,
ing objective, DOBF, that leverages the structural 314
285 the standard back-translation training reinstated.
aspects of programming languages. According 315

286 4 Experiment Setup to this pretraining paradigm, the identifiers (class, 316
function, and variable names) in code snippets are 317
287 4.1 Models and Baselines obfuscated, and a model is trained to recover the 318

288 Our model Our proposed approach can be ap- original names. DOBF shares the exact same neural 319

289 plied to pre-trained sequence-to-sequence mod- architecture as TransCoder. We report the evalua- 320

290 els, e.g., PLBART (Ahmad et al., 2021a) and tion performances of TransCoder and DOBF from 321

291 CodeT5 (Wang et al., 2021). In this work, we the official code release by Lachaux et al. (2020).7 322

292 chose PLBART4 to perform experiments and show


4.2 Evaluation Dataset and Metrics 323
293 the effectiveness of our proposed framework.
Evaluation Dataset Lachaux et al. (2020) pro- 324
294 Baseline Models posed an evaluation dataset composed of parallel 325

295 j2py is a framework that translates Java source functions in Java, Python, and C++ languages. The 326

296 code to Python.5 It follows handwritten rules man- dataset consists of 464 Java to Python and 482 327

297 ually built using expert knowledge. Python to Java test examples, where each example 328
is accompanied by 10 unit test cases. 329
298 Summarize-and-Generate (S&G) performs
299 code-to-code translation via two steps, code-to- Evaluation Metrics 330
300 summary and summary-to-code generation. We BLEU measures n-gram overlap between a gen- 331
301 evaluate the S&G model (as in Eq. (1)) that is used erated translation and a collection of reference 332
302 to perform code summarization and generation in translations (Papineni et al., 2002). 333
303 our proposed framework to train PLBART via BT.
6
We compare TransCoder and PLBART in terms of model
4
Since its pretraining implementation is publicly available architecture and training setup in the Appendix D.
at https://github.com/wasiahmad/PLBART. 7
https://github.com/facebookresearch/CodeGen/
5
https://github.com/natural/java2python blob/main/docs/transcoder.md#results).

5
334 Exact Match (EM) represents the percentage of submissions for 4053 problems. We use the 378
335 of generated translations exactly match with the deduplicated accepted solutions of the problems 379
336 collection of reference translations. for BT training. Presumably, CodeNet and the eval- 380
uation dataset (Lachaux et al., 2020) have a similar 381
337 CodeBLEU measures grammatical and logical nature that should positively impact downstream 382
338 correctness in addition to n-gram overlap between translation performance. 383
339 generated and reference translations (Ren et al.,
340 2020). CodeBLEU is defined as a weighted sum Preprocessing We use tree_sitter11 for tok- 384
341 of n-gram match, weighted n-gram match,8 syntax enizing Java functions and the tokenizer of the stan- 385
342 match (based on AST), and data-flow match. dard library for Python.12 We extract standalone 386
functions13 from the BT training datasets follow- 387
343 Computational Accuracy (CA), proposed by ing the function extraction technique from Lachaux 388
344 Lachaux et al. (2020), assess functional correct- et al. (2020). Considering our computational bud- 389
345 ness; a translated code is considered correct if it get, we filter the standalone functions exceeding a 390
346 passes a set of unit tests. It evaluates if a gener- maximum length of 256 to cope with our computa- 391
347 ated function outputs the same as the reference tional resources. The statistics of the preprocessed 392
348 when given the same set of inputs. This metric datasets are presented in Table 1. 393
349 overcome the shortcoming of match-based metrics
350 (e.g., BLEU, CodeBLEU) by accounting for the 4.4 Implementation Details 394
351 program-execution behavior (Lachaux et al., 2020; We jointly train PLBART on code summarization 395
352 Chen et al., 2021). and generation in Java and Python using the au- 396
thors’ provided code.14 Subsequently, we further 397
353 4.3 Training Datasets and Preprocessing
train PLBART via back-translation as described 398
354 Code Summarization and Generation Lu et al. in Algorithm 1. We set I = 10, 000 and tuned 399
355 (2021) curated a code summarization dataset con- m = 200.15 We train PLBART using 8 Nvidia 400
356 sisting of code and summary pairs based on the GeForce RTX 2080 Ti GPUs, and the effective 401
357 CodeSearchNet dataset (Husain et al., 2019). We batch size is maintained at 1024 instances at both 402
358 use this dataset in Java and Python program- training stages. We optimize PLBART with the 403
359 ming languages to train the code-to-summary and Adam optimizer (Kingma and Ba, 2015), a learn- 404
360 summary-to-code generation models. ing rate of 10e-4, and use a polynomial learning 405
rate decay scheduling. The best models are selected 406
361 Back-translation (BT) For BT training (as dis-
based on the validation BLEU scores. We imple- 407
362 cussed in § 3.3), we use the GitHub public dataset
ment our approach in Fairseq (Ott et al., 2019) and 408
363 available on Google BigQuery (Hoffa, 2016).9 We
use float16 operations to speed up training. 409
364 first deduplicate10 the GitHub dataset at the pro-
365 gram level, extract the functions, and finally per- Decoding During inference, we use beam search 410
366 form another deduplication at the function level. decoding (Koen, 2004) to generate multiple trans- 411
367 Note that the Github dataset is composed of source lations using PLBART. We chose greedy search 412
368 code that covers a wide variety of programming (Beam 1) as the default decoding scheme for valida- 413
369 topics (as they come from various projects). In tion and evaluation. However, following Lachaux 414
370 contrast, the evaluation dataset is composed of pro- et al. (2020), we report two sets of results for the 415
371 gramming problems covering basic data structure computational accuracy (CA) metric: CA@n B=n, 416
372 and algorithmic concepts. Therefore, to investigate the percentage of functions with at least one correct 417
373 the impact of data on BT training, we alternatively translation in the beam (of size n), and CA@1 B=n 418
374 chose unparallel code samples in Java and Python the percentage of functions where the hypothesis 419
375 from CodeNet (Puri et al., 2021). The CodeNet 11
https://github.com/tree-sitter
376 dataset is collected from two online judge websites, 12
https://docs.python.org/3/library/tokenize.
377 AIZU Online Judge and AtCoder, and composed html
13
Standalone functions can be used without instantiating
8
Different weights are assigned to n-grams such that the a class. In Java, this corresponds to static methods, and in
keywords (e.g., for, while) have higher weights Python, it corresponds to functions outside classes.
9 14
https://console.cloud.google.com/marketplace/ https://github.com/wasiahmad/PLBART/tree/
product/github/github-repos main/multilingual
10 15
We used a hash-based data deduplication method. We tuned m in the range [100, 1000] with 100 steps.

6
Java → Python Python → Java
Models
BLEU EM CodeBLEU CA BLEU EM CodeBLEU CA
j2py* - - - 38.3 - - - -
TransCoder∗ 68.1 3.7 - 46.9 64.6 0.8 - 32.6
TransCoder w/ DOBF∗ - - - 49.2 - - - 40.4
S&G (1) 7.6 0.0 15.8 0.2 12.4 0 16.3 0.2
PLBART (this work)
trained via BT 31.2 0.0 36.6 0.0 31.7 0.0 32.1 0.0
trained via BT (via S&G) 64.2 2.8 63.4 40.4 64.1 2.1 65.9 31.9
Table 2: Evaluation results of the baselines models and our proposed framework using greedy decoding. ∗ indicates
the updated scores reported in the official code repository of Lachaux et al. (2020). Note that, TransCoder and
PLBART models have 312M and 140M parameters, respectively.

420 in the beam with the highest log-probability is a Models TransCoder PLBART
421 correct translation. Java → Python
CA@1 B=1 46.9 40.4
422 5 Results and Analysis CA@1 B=10 48.8 41.8
423 5.1 Main Result CA@5 B=5 60.0 47.7
CA@10 B=10 64.4 50.3
424 Table 2 shows the performance of our proposed Python → Java
425 approach and the baseline models on both Java CA@1 B=1 32.6 31.9
426 to Python and Python to Java translation. We be- CA@1 B=10 36.0 34.5
427 gin by comparing PLBART directly used in back- CA@5 B=5 44.3 45.1
428 translation (BT) training with our proposed ap- CA@10 B=10 51.1 50.0
429 proach (the last block in Table 2). Since, PLBART
Table 3: Computational accuracy (CA@m) with beam
430 does not know to generate across languages, so
search decoding and comparison between TransCoder
431 when the model is trained via BT, it only learns and PLBART. TransCoder’s performances are reported
432 to copy the input sources. As a result, PLBART from Lachaux et al. (2020). The value B indicates the
433 scores 0% EM and 0% CA, while 30+ BLEU and beam size. CA@m B=n means that we use beam de-
434 CodeBLEU scores. In contrast, following our pro- coding to generate n translations, and select the top m
435 posed approach of summarizing and generating to translations based on their log-probability scores.
436 back-translate, PLBART trained via BT (via S&G)
437 achieves 40.4% and 31.9% CA scores. This perfor- sequences (offline) and warm-start PLBART for 453
438 mance is competitive to state-of-the-art translation back-translation-based training. We compare these 454
439 system, TransCoder.16 We further compare them two approaches in Table 4, and the results show that 455
440 using beam search decoding in Table 3. both approaches perform comparably. However, it 456
441 Overall, the experimental results confirm our is essential to note that the online setting gives us 457
442 conjecture that pre-trained sequence-to-sequence flexibility as we can tune the number of initial steps 458
443 models cannot be effectively used in BT training; (m in Algorithm 1). In contrast, the offline setting 459
444 however, training via S&G empowers them to gen- requires generating a sufficiently large number of 460
445 erate across languages and be further trained via parallel code sequences for effective training. 461
446 BT to learn programming language translation.
Impact of in-domain training data The evalu- 462
447 5.2 Analysis ation dataset comprises solutions to programming 463
448 Summarize and generate to create parallel data problems involving data structures and algorithm 464
449 Our proposed approach generates parallel code se- concepts. While Github offers large-scale unla- 465
450 quences on the fly (online) for training. An alter- beled data, most of its code belongs to software 466
451 native to our approach is to use a code summariza- projects that use APIs and advanced functionalities. 467
452 tion and generation model to create parallel code Therefore, we utilize an alternative dataset called 468

16
CodeNet collected from two online judge websites. 469
Note that, while comparing PLBART with TransCoder
on the translation performance, their differences (shown in We refer to this dataset as in-domain since its nature 470
Table 9) should be taken into consideration. aligns with the evaluation dataset (data structure 471

7
Java to Python Python to Java
Approach
BLEU EM CodeBLEU CA BLEU EM CodeBLEU CA
Warm-start w/ PD 60.5 2.8 61.1 41.9 62.6 2.4 65.9 32.0
Proposed approach 64.2 2.8 63.4 40.4 64.1 2.1 65.9 31.9
Table 4: Comparison between PLBART warm-started using parallel data (PD) and our approach to summarize and
generate to back-translate on the fly during the initial steps of back-translation training.

Java to Python Python to Java


Data Source
BLEU EM CodeBLEU CA BLEU EM CodeBLEU CA
Github 64.2 2.8 63.4 40.4 64.1 2.1 65.9 31.9
CodeNet 65.6 3.1 64.7 50.9 65.1 2.5 68.5 46.5
Table 5: PLBART evaluation results when our proposed framework uses data from Github (available via BigQuery
(Hoffa, 2016)) and competitive programming sites (available via CodeNet (Puri et al., 2021)).

472 and algorithm focused problems aggregated from et al., 2018a). This work empirically investigated 508
473 GeeksforGeeks). We compare in-domain data us- the suitability of adopting BT to train existing pre- 509
474 age with Github data on BT-based training. The re- trained encoder-decoder models and proposed an 510
475 sults in Table 5 show that the use of in-domain data alternative via summarization and generation. 511
476 significantly boosts the performance in both trans-
477 lation directions. A detailed error analysis reveals Unsupervised Machine Translation via Back- 512

478 that such a performance boost is due to a reduc- translation Gathering sufficiently large parallel 513

479 tion in TypeError. We speculate that in-domain corpora has been a major challenge for Machine 514

480 data have similarities in the data type usage that Translation (MT) (Guzmán et al., 2019). Several 515

481 helps the model. Due to the page limit, we present research efforts are invested in learning MT using 516

482 more findings of the error analysis and qualitative monolingual data (Artetxe et al., 2018a,b; Lachaux 517

483 examples in the Appendix. et al., 2020) to solve this problem. For example, 518
Gulcehre et al. (2015) proposed integration of a 519
484 6 Related Work Language model integrated into the decoder. He 520
et al. (2016) proposed Neural MT (NMT) as a bidi- 521
485 Programming Language Translation Translat- rectional and dual learning task. More recent ad- 522
486 ing programs or source code across different pro- vancements in unsupervised MT leverages back- 523
487 gramming languages (PL) requires a profound un- translation (BT) (Sennrich et al., 2016; Lample 524
488 derstanding of the PLs. Having strictly defined et al., 2018a,b). In back-translation, the target-to- 525
489 syntax and semantics, PLs are suitable for phrase- source model generates noisy sources given tar- 526
490 based statistical machine translation (Nguyen et al., get sequences and then trains the source-to-target 527
491 2013; Karaivanov et al., 2014; Aggarwal et al., model to reconstruct the targets and vice versa. 528
492 2015). Chen et al. (2018) introduced a tree to tree While BT has been widely adopted for unsuper- 529
493 machine translation to translate programs and to vised NMT, it is used in other applications (Zhu 530
494 learn the syntactic alignment between source and et al., 2017; Hoffman et al., 2018; Shen et al., 2017; 531
495 target PL. Recently proposed pre-trained program- Yang et al., 2018; Zhang et al., 2019). 532
496 ming language models showed promising results
497 in translating programs across PLs (Feng et al., 7 Conclusion 533
498 2020b; Guo et al., 2021; Ahmad et al., 2021a,b).
499 However, these approaches require a set of parallel In this research, we show that pre-trained sequence- 534
500 programs to train the encoder-decoder model. to-sequence models (e.g., PLBART) are not suit- 535
501 Recently proposed Transcoder (Lachaux et al., able for direct adaptation via back-translation to 536
502 2020) shows initial success results in unsupervised learn to translate. To address the issue, we propose 537
503 program translation, eliminating the requirement to use code summarization and generation as an al- 538
504 of bi-modal data. They achieve such jointly train- ternative to performing back-translation. We show 539
505 ing a model using XLM (Conneau and Lample, that our proposed approach turns PLBART into a 540
506 2019), Denoising Auto Encoding (DAE) (Vincent translation model that performs competitively to 541
507 et al., 2008), and Back-Translation(BT) (Lample existing unsupervised translation models. 542

8
543 Limitations Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, 593
Henrique Ponde, Jared Kaplan, Harri Edwards, Yura 594
544 One of the risks of using our developed translation Burda, Nicholas Joseph, Greg Brockman, et al. 2021. 595
545 model is we used the Github dataset for training Evaluating large language models trained on code. 596
arXiv preprint arXiv:2107.03374. 597
546 that may contain information that uniquely identi-
547 fies an individual or offensive content. Since we Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree- 598
548 are developing the translation model for research to-tree neural networks for program translation. In 599
549 purposes only, we believe our usage of the Github Advances in Neural Information Processing Systems 600
31, pages 2547–2557. Curran Associates, Inc. 601
550 data does not violate their licensing terms and con-
551 ditions. While we do not present it as a justifica- Alexis Conneau and Guillaume Lample. 2019. Cross- 602
552 tion, the PLBART model was pre-trained on the lingual language model pretraining. In H. Wal- 603
lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, 604
553 Github data that may include sensitive informa-
E. Fox, and R. Garnett, editors, Advances in Neu- 605
554 tion. As our intention is to develop a programming ral Information Processing Systems 32, pages 7059– 606
555 language translation model, it is unlikely to gener- 7069. Curran Associates, Inc. 607
556 ate sensitive information unless it is provided such
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 608
557 information as input. Kristina Toutanova. 2019. BERT: Pre-training of 609
deep bidirectional transformers for language under- 610
standing. In Proceedings of the 2019 Conference of 611
558 References the North American Chapter of the Association for 612
Computational Linguistics: Human Language Tech- 613
559 Karan Aggarwal, Mohammad Salameh, and Abram Hin- nologies, Volume 1 (Long and Short Papers), pages 614
560 dle. 2015. Using machine translation for converting 4171–4186, Minneapolis, Minnesota. Association for 615
561 python 2 to python 3 code. Technical report, PeerJ Computational Linguistics. 616
562 PrePrints.
Yangruibo Ding, Luca Buratti, Saurabh Pujar, Alessan- 617
563 Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and dro Morari, Baishakhi Ray, and Saikat Chakraborty. 618
564 Kai-Wei Chang. 2021a. Unified pre-training for pro- 2022. Towards learning (dis)-similarity of source 619
565 gram understanding and generation. In Proceedings code from program contrasts. In Proceedings of the 620
566 of the 2021 Conference of the North American Chap- 60th Annual Meeting of the Association for Compu- 621
567 ter of the Association for Computational Linguistics: tational Linguistics (Volume 1: Long Papers), pages 622
568 Human Language Technologies, pages 2655–2668, 6300–6312, Dublin, Ireland. Association for Compu- 623
569 Online. Association for Computational Linguistics. tational Linguistics. 624

570 Wasi Uddin Ahmad, Md Golam Rahman Tushar, Saikat Sergey Edunov, Myle Ott, Michael Auli, and David 625
571 Chakraborty, and Kai-Wei Chang. 2021b. Avatar: A Grangier. 2018. Understanding back-translation at 626
572 parallel corpus for java-python program translation. scale. In Proceedings of the 2018 Conference on 627
573 arXiv preprint arXiv:2108.11590. Empirical Methods in Natural Language Processing, 628
pages 489–500, Brussels, Belgium. Association for 629
574 Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018a. Computational Linguistics. 630
575 Unsupervised statistical machine translation. In Pro-
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi- 631
576 ceedings of the 2018 Conference on Empirical Meth-
aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, 632
577 ods in Natural Language Processing, pages 3632–
Ting Liu, Daxin Jiang, and Ming Zhou. 2020a. Code- 633
578 3642, Brussels, Belgium. Association for Computa-
BERT: A pre-trained model for programming and 634
579 tional Linguistics.
natural languages. In Findings of the Association 635
for Computational Linguistics: EMNLP 2020, pages 636
580 Mikel Artetxe, Gorka Labaka, Eneko Agirre, and 1536–1547, Online. Association for Computational 637
581 Kyunghyun Cho. 2018b. Unsupervised neural ma- Linguistics. 638
582 chine translation. In International Conference on
583 Learning Representations. Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi- 639
aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, 640
584 Mikel Artetxe and Holger Schwenk. 2019. Mas- Ting Liu, Daxin Jiang, and Ming Zhou. 2020b. Code- 641
585 sively multilingual sentence embeddings for zero- BERT: A pre-trained model for programming and 642
586 shot cross-lingual transfer and beyond. Transactions natural languages. In Findings of the Association 643
587 of the Association for Computational Linguistics, for Computational Linguistics: EMNLP 2020, pages 644
588 7:597–610. 1536–1547, Online. Association for Computational 645
Linguistics. 646
589 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
590 gio. 2015. Neural machine translation by jointly Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun 647
591 learning to align and translate. In International Con- Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, 648
592 ference on Learning Representations. Holger Schwenk, and Yoshua Bengio. 2015. On 649

9
650 using monolingual corpora in neural machine trans- Philipp Koen. 2004. Pharaoh: a beam search decoder 704
651 lation. arXiv preprint arXiv:1503.03535. for phrase-based statistical machine translation mod- 705
els. In Proceedings of the 6th Conference of the 706
652 Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Association for Machine Translation in the Americas: 707
653 Tang, Shujie Liu, Long Zhou, Nan Duan, Jian Yin, Technical Papers, pages 115–124, Washington, USA. 708
654 Daxin Jiang, et al. 2021. Graphcodebert: Pre-training Springer. 709
655 code representations with data flow. In International
656 Conference on Learning Representations. Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanus- 710
sot, and Guillaume Lample. 2020. Unsupervised 711
657 Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan translation of programming languages. In Ad- 712
658 Pino, Guillaume Lample, Philipp Koehn, Vishrav vances in Neural Information Processing Systems, 713
659 Chaudhary, and Marc’Aurelio Ranzato. 2019. The volume 33, pages 20601–20611. Curran Associates, 714
660 FLORES evaluation datasets for low-resource ma- Inc. 715
661 chine translation: Nepali–English and Sinhala–
Guillaume Lample, Alexis Conneau, Ludovic Denoyer, 716
662 English. In Proceedings of the 2019 Conference on
and Marc’Aurelio Ranzato. 2018a. Unsupervised 717
663 Empirical Methods in Natural Language Processing
machine translation using monolingual corpora only. 718
664 and the 9th International Joint Conference on Natu-
In International Conference on Learning Representa- 719
665 ral Language Processing (EMNLP-IJCNLP), pages
tions. 720
666 6098–6111, Hong Kong, China. Association for Com-
667 putational Linguistics. Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic 721
Denoyer, and Marc’Aurelio Ranzato. 2018b. Phrase- 722
668 Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai based & neural unsupervised machine translation. 723
669 Yu, Tie-Yan Liu, and Wei-Ying Ma. 2016. Dual In Proceedings of the 2018 Conference on Empiri- 724
670 learning for machine translation. In Advances in cal Methods in Natural Language Processing, pages 725
671 Neural Information Processing Systems, volume 29. 5039–5049, Brussels, Belgium. Association for Com- 726
672 Curran Associates, Inc. putational Linguistics. 727

673 Felipe Hoffa. 2016. Github on bigquery: Analyze all Alexander LeClair and Collin McMillan. 2019. Rec- 728
674 the open source code. ommendations for datasets for source code summa- 729
rization. In Proceedings of the 2019 Conference of 730
675 Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, the North American Chapter of the Association for 731
676 Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Computational Linguistics: Human Language Tech- 732
677 Darrell. 2018. CyCADA: Cycle-consistent adversar- nologies, Volume 1 (Long and Short Papers), pages 733
678 ial domain adaptation. In Proceedings of the 35th 3931–3937, Minneapolis, Minnesota. Association for 734
679 International Conference on Machine Learning, vol- Computational Linguistics. 735
680 ume 80 of Proceedings of Machine Learning Re-
681 search, pages 1989–1998. PMLR. Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey 736
Svyatkovskiy, Ambrosio Blanco, Colin Clement, 737
682 Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. 738
683 Zhi Jin. 2018. Summarizing source code with trans- Codexglue: A machine learning benchmark dataset 739
684 ferred api knowledge. In Proceedings of the Twenty- for code understanding and generation. arXiv 740
685 Seventh International Joint Conference on Artificial preprint arXiv:2102.04664. 741
686 Intelligence, IJCAI-18, pages 2269–2275. Interna-
687 tional Joint Conferences on Artificial Intelligence Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N 742
688 Organization. Nguyen. 2013. Lexical statistical machine transla- 743
tion for language migration. In Proceedings of the 744
2013 9th Joint Meeting on Foundations of Software 745
689 Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis
Engineering, pages 651–654. 746
690 Allamanis, and Marc Brockschmidt. 2019. Code-
691 searchnet challenge: Evaluating the state of semantic Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, 747
692 code search. arXiv preprint arXiv:1909.09436. Liguo Huang, and Bin Luo. 2022. Spt-code: 748
Sequence-to-sequence pre-training for learning the 749
693 Svetoslav Karaivanov, Veselin Raychev, and Martin representation of source code. In 2022 IEEE/ACM 750
694 Vechev. 2014. Phrase-based statistical translation 44th International Conference on Software Engineer- 751
695 of programming languages. In Proceedings of the ing (ICSE). Association for Computing Machinery. 752
696 2014 ACM International Symposium on New Ideas,
697 New Paradigms, and Reflections on Programming & Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, 753
698 Software, pages 173–184. Sam Gross, Nathan Ng, David Grangier, and Michael 754
Auli. 2019. fairseq: A fast, extensible toolkit for 755
699 Diederik P. Kingma and Jimmy Ba. 2015. Adam: A sequence modeling. In Proceedings of the 2019 Con- 756
700 method for stochastic optimization. In 3rd Inter- ference of the North American Chapter of the Associa- 757
701 national Conference on Learning Representations, tion for Computational Linguistics (Demonstrations), 758
702 ICLR 2015, San Diego, CA, USA, May 7-9, 2015, pages 48–53, Minneapolis, Minnesota. Association 759
703 Conference Track Proceedings. for Computational Linguistics. 760

10
761 Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and 817
762 Jing Zhu. 2002. Bleu: a method for automatic evalu- Pierre-Antoine Manzagol. 2008. Extracting and com- 818
763 ation of machine translation. In Proceedings of the posing robust features with denoising autoencoders. 819
764 40th Annual Meeting of the Association for Compu- In Proceedings of the 25th international conference 820
765 tational Linguistics, pages 311–318, Philadelphia, on Machine learning, pages 1096–1103. 821
766 Pennsylvania, USA. Association for Computational
767 Linguistics. Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, 822
Haochao Ying, Jian Wu, and Philip S. Yu. 2018. Im- 823
768 Md Rizwan Parvez, Wasi Ahmad, Saikat Chakraborty, proving automatic source code summarization via 824
769 Baishakhi Ray, and Kai-Wei Chang. 2021. Retrieval deep reinforcement learning. In Proceedings of 825
770 augmented code generation and summarization. In the 33rd ACM/IEEE International Conference on 826
771 Findings of the Association for Computational Lin- Automated Software Engineering, ASE 2018, page 827
772 guistics: EMNLP 2021, pages 2719–2734, Punta 397–407, New York, NY, USA. Association for Com- 828
773 Cana, Dominican Republic. Association for Compu- puting Machinery. 829
774 tational Linguistics.
Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. 830
775 Ruchir Puri, David S Kung, Geert Janssen, Wei Hoi. 2021. CodeT5: Identifier-aware unified pre- 831
776 Zhang, Giacomo Domeniconi, Vladmir Zolotov, Ju- trained encoder-decoder models for code understand- 832
777 lian Dolby, Jie Chen, Mihir Choudhury, Lindsey ing and generation. In Proceedings of the 2021 833
778 Decker, et al. 2021. Project codenet: A large-scale Conference on Empirical Methods in Natural Lan- 834
779 ai for code dataset for learning a diversity of coding guage Processing, pages 8696–8708, Online and 835
780 tasks. In Proceedings of the Neural Information Pro- Punta Cana, Dominican Republic. Association for 836
781 cessing Systems Track on Datasets and Benchmarks, Computational Linguistics. 837
782 volume 1.
Zichao Yang, Zhiting Hu, Chris Dyer, Eric P Xing, and 838
Taylor Berg-Kirkpatrick. 2018. Unsupervised text 839
783 Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie
style transfer using language models as discrimina- 840
784 Liu, Duyu Tang, Ming Zhou, Ambrosio Blanco, and
tors. In Proceedings of the 32nd International Con- 841
785 Shuai Ma. 2020. Codebleu: a method for auto-
ference on Neural Information Processing Systems, 842
786 matic evaluation of code synthesis. arXiv preprint
pages 7298–7309. 843
787 arXiv:2009.10297.
Zhirui Zhang, Shuo Ren, Shujie Liu, Jianyong Wang, 844
788 Baptiste Roziere, Marie-Anne Lachaux, Marc Peng Chen, Mu Li, Ming Zhou, and Enhong Chen. 845
789 Szafraniec, and Guillaume Lample. 2021. Dobf: A 2019. Style transfer as unsupervised machine trans- 846
790 deobfuscation pre-training objective for program- lation. In Thirty-Third AAAI Conference on Artificial 847
791 ming languages. In Advances in Neural Information Intelligence. 848
792 Processing Systems.
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A 849
793 Rico Sennrich, Barry Haddow, and Alexandra Birch. Efros. 2017. Unpaired image-to-image translation 850
794 2016. Improving neural machine translation models using cycle-consistent adversarial networks. In Pro- 851
795 with monolingual data. In Proceedings of the 54th ceedings of the IEEE international conference on 852
796 Annual Meeting of the Association for Computational computer vision, pages 2223–2232. 853
797 Linguistics (Volume 1: Long Papers), pages 86–96,
798 Berlin, Germany. Association for Computational Lin-
799 guistics.

800 Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi


801 Jaakkola. 2017. Style transfer from non-parallel text
802 by cross-alignment. In Advances in Neural Informa-
803 tion Processing Systems 30.

804 Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Na-
805 man Goyal, Vishrav Chaudhary, Jiatao Gu, and An-
806 gela Fan. 2021. Multilingual translation from de-
807 noising pre-training. In Findings of the Association
808 for Computational Linguistics: ACL-IJCNLP 2021,
809 pages 3450–3466, Online. Association for Computa-
810 tional Linguistics.

811 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob


812 Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
813 Kaiser, and Illia Polosukhin. 2017. Attention is all
814 you need. In Advances in Neural Information Pro-
815 cessing Systems 30, pages 5998–6008. Curran Asso-
816 ciates, Inc.

11
Supplementary Material: Appendices

TransCoder PLBART Error Category TransCoder PLBART


Java → Python #Errors (Java → Python) 149 146
#Tests 464 464
Error 149 146 Compilation - -
Failure 93 123 Runtime 149 146
Success 218 188 TypeError 47 61
EM 17 24 IndexError 18 20
Timeout 4 7 NameError 17 16
Python → Java ValueError 11 15
#Tests 482 482 UnboundLocalError 13 11
Error 201 212 Others 17 14
Failure 118 108 SyntaxError 26 9
Success 157 154 #Errors (Python → Java) 201 212
EM 6 2 Compilation 151 180
Timeout 6 8 TypeError 89 108
CantFindSymbol 23 30
Table 6: Detailed results of computational accuracy
SyntaxError 14 25
using greedy decoding for Java ↔ Python translation.
BadOperand 15 12
Others 10 5
854 A Analysis of Computational Accuracy Runtime 50 27
IndexOutOfBoundsE. 40 15
855 Table 6 shows the breakdown of computational NumberFormatE. 5 6
856 accuracies for Java-to-Python and Python-to-Java NullPointerE. 2 3
857 translation for TransCoder and our proposed ap- Others 3 3
858 proach using PLBART. We execute the generated
859 function and match the output w.r.t. the expected Table 7: Category of errors made by the TransCoder
860 output. TransCoder results in 149 error cases, 93 and PLBART translation models. The error categories
are sorted based on the PLBART’s error count on the
861 failure cases, and 218 success cases in Java-to-
respective category. In Python → Java runtime error
862 Python translation, with 17 solutions matching the categories, “E.” stands for “Exception”.
863 ground truth. In contrast, PLBART results in 146
864 error cases, 123 failure cases, 188 success cases.
865 Out of these 188 successes in PLBART, 24 solu- TransCoder is disproportionately susceptible to 882
866 tions exactly match the target solution. SyntaxError. In the case of Python-to-Java trans- 883
867 For Python-Java translation, TransCoder results lation, PLBART exhibits more Compilation errors, 884
868 in 201 errors, 118 failures, and 157 successes, out but TransCoder exhibits more Runtime errors. The 885
869 of which 6 are an exact match. On the other hand, most common type of compilation errors in both 886
870 in the case of PLBART, there are 212 error cases, TransCoder and PLBART is TypeError. The most 887
871 108 failure cases, and 154 success cases, out of common runtime error in Python-to-Java transla- 888
872 which 2 exactly match the target solution. tion is InderOutOfBoundException for both mod- 889
els, where TransCoder exhibits more than twice the 890
873 B Error Analysis number of such errors in PLBART. 891
874 We further analyze the error cases for TransCoder Finally, we identified the top five error categories 892
875 and our proposed approach using PLBART. Since (which accounts for 123 errors out of 146) ex- 893
876 Python is an interpreted language, syntactic and hibited by PLBART in Java-to-Python translation 894
877 semantic errors are caught at runtime. Thus, we and analyzed the error messages. In most cases, 895
878 categorize all errors for Java-to-Python translation TypeError and ValueError are due to mismatch 896
879 as runtime errors. Table 7 shows the errors in in underlying data type of variable. Table 8 shows 897
880 both Java-to-Python and Python-to-Java transla- the detailed statistics of different error types, sub- 898
881 tion. While PLBART is susceptible to TypeError, types, and their frequencies. 899

12
Error Category Count
Type Error 61
list indices must be integers or slices, not A 18
A object does not support item assignment 13
A object cannot be interpreted as an integer 8
unsupported/bad operand type(s) 10
int object is not iterable/callable/subscriptable 6
Others 6
Index Error 20
B index out of range 19
others 1
Name Error 16
name C is not defined 16
Value Error 15
not enough values to unpack 7
too many values to unpack 3
the truth value of an array with more than one element is ambiguous 3
others 2
Unbound Local Error 11
local variable D referenced before assignment 11

Table 8: Analyzing the five most frequent error cases (123 out of 146) encountered in PLBART generated Java to
Python translation. A and B indicate {bool, int, tuple, str, range} and {string, list}, respectively. C and D indicate
identifier (class, function, variable) names.

900 C Qualitative Examples chitecture, but TransCoder is a twice as large model 923
as PLBART. Both the models have gone through a 924
901 Figure 4 shows an example of Java-to-Python trans-
two-stage training process. In Stage-1, TransCoder 925
902 lation by PLBART. The translated code is both
is pre-trained via MLM using 920B tokens, while 926
903 syntactically and semantically correct i.e., our com-
PLBART is pre-trained via DAE using 87B tokens. 927
904 piler could successfully parse and build the trans-
In Stage-2, TransCoder leverages 625M tokens and 928
905 lated code. It passed 2 test cases out of 10 when
jointly trained via DAE and BT. In comparison, 929
906 executed. The translated code is slightly different
PLBART is trained via BT using 430M tokens. 930
907 from the input Java code. In particular, line 13 in
908 the input Java code is a loop that iterates backward Why TransCoder does not suffer from the 931
909 (decreasing order). However, line 12 in the gener- same language generation issue? In Stage-1 932
910 ated python code iterates forward (increasing or- pre-training, TransCoder only trains the Trans- 933
911 der). If the generated python code was range(c-1, former Encoder and then initializes a decoder with 934
912 0, -1) instead of range(c-1), it would pass all Encoders’ parameters, and the cross attention sub- 935
913 the test cases. We attribute such behavior to the layers are randomly initialized. We speculate that 936
914 fact that range(*) is much more frequent pattern such random initialization leaves TransCoder un- 937
915 than range(*, 0, -1) in python code. biased towards generating in the same language 938
as input. Moreover, PLBART uses language ID 939
916 D TransCoder vs. PLBART
token as a prefix to generate in the target lan- 940
917 As we consider TransCoder as the primary base- guage. We noticed that PLBART’s decoder dis- 941
918 line of our proposed approach using PLBART, for regards the prefix token if not fine-tuned to gen- 942
919 the sake of fairness, we compare them in terms erate in the target language. On the other hand, 943
920 of model structure and training setting. Table 9 TransCoder uses language embeddings with each 944
921 presents the comparison. TransCoder and PLBART token in the input. Intuitively, this does not allow 945
922 both use the Transformer (Vaswani et al., 2017) ar- the TransCoder’s decoder to ignore the language 946

13
Input function in Java Translated function in Python
1 s t a t i c i n t s o r t e d C o u n t ( i n t mat [][] , 1 d e f s o r t e d _ c o u n t ( mat , r , c):
2 i n t r , i n t c){ 2 result = 0
3 i n t result = 0; 3 f o r i i n range (r):
4 f o r ( i n t i = 0; i < r; i ++) { 4 j = 0
5 i n t j; 5 f o r j i n range (c - 1) :
6 f o r ( j = 0; j < c - 1; j ++) 6 i f mat [i ][ j + 1] <= mat [i ][ j ]:
7 i f ( mat [i ][ j + 1] <= mat [i ][ j ]) 7 break
8 break ; 8 i f j == c - 1:
9 i f ( j == c - 1) result ++; 9 result += 1
10 } 10 f o r i i n range (r):
11 f o r ( i n t i = 0; i < r; i ++) { 11 j = c - 1
12 i n t j; 12 f o r j i n range (c - 1) :
13 f o r ( j = c - 1; j > 0; j - -) 13 i f mat [i ][ j - 1] <= mat [i ][ j ]:
14 i f ( mat [i ][ j - 1] <= mat [i ][ j ]) 14 break
15 break ; 15 i f c > 1 and j == 0:
16 i f ( c > 1 && j == 0) result ++; 16 result += 1
17 } 17 r e t u r n result
18 r e t u r n result ;
19 }
Figure 4: An example of Java to Python translation by PLBART that passes 2 out of 10 unit test cases. Line no. 13
(marked in green) in the Java function is incorrectly translated in python (line no. 12, marked in red). Replacing the
range function parameter “(c-1)” by “(c - 1, 0, -1)” would make the translated function pass all the test cases.

TransCoder PLBART
#layers (encoder) 6 6
#layers (decoder) 6 6
#heads 8 12
Model dim 1024 768
Vocab size 64,000 50,000
Total parameters 312 M 140 M
Stage1: Pre-training
Objective MLM DAE
Total tokens 920 B 87 B
Token types BPE Sentencepiece
Languages Java, Python, C++ Java, Python, English
Stage2: Training
Objective DAE+BT BT
Total tokens 625 M 430 M
Token types BPE Sentencepiece
Languages Java, Python, C++ Java, Python

Table 9: TransCoder vs. PLBART.

947 information. For example, with position index “0”


948 and language ID “Python”, TransCoder is more
949 likely to generate “def” token and less likely to
950 generate “static” or “int” since they do not ap-
951 pear in the Python language. In essence, unlike
952 PLBART, TransCoder does not suffer from the is-
953 sue of sequence-to-sequence models being unable
954 to generate across languages.

14

You might also like