Professional Documents
Culture Documents
2021.findings Acl.251
2021.findings Acl.251
Summarization
2842
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2842–2851
August 1–6, 2021. ©2021 Association for Computational Linguistics
(a) Source Code
(a) Source Code
(b) AST
2843
Figure 3: Overview of our proposed model.
add sibling edges to ASTs as shown in Figure 1c ing the sequential and structural information, the
(red dotted lines). Transformer decoder generates summaries.
In the case of Python, the Python AST parser we
used does not create a node for the function name, 3.1 Graph Convolution Network
unlike the Java AST parser. As the function name is Graph convolutional network (Kipf and Welling,
a very important keyword in generating summaries, 2017) is one of graph neural networks for repre-
we add the function name to FunctionDef node as senting nodes based on neighborhood features of
a child in the Python AST as shown in Figure 2. each node in graph data. In this paper, graph con-
Then, we modify Python ASTs by adding sibling volution layers are used to capture the structure
edges. information of mASTs. Since the mAST extracted
from a given code C is a graph, we denote an AST
3 Proposed Model as G(C) = {V, E}, where V is a set of nodes and
E is a set of edges. Initially, nodes in V are one-hot
We propose a model using graph convolution lay-
encoded tokens and then mapped into representa-
ers and transformer layers to summarize the source
tion vectors, X, by the embedding layer.
code. To encode both the structural and the sequen-
Given representation X of nodes and E of edges,
tial information of code, we combine both of the
new representations of nodes are calculated by
above layers.
graph convolution layers as follows.
Figure 3 shows the overview of our model. A
given snippet is represented as a modified AST H 0 = X, X ∈ Rn×d
(mAST). The initial representation of nodes in the
mAST is generated by the embedding layer. Since
the embedding layer generates the representation H (l+1) = σ(AH (l) W (l) ), W l ∈ Rd×d
considering only nodes themselves, we use graph
convolution to capture the structural information. where A is the adjacency matrix, W l is the graph
We apply graph convolution to each node in mAST. convolution weight matrix in the l-th layer, σ is the
Then, we can have node representations consid- activation function, n is the total number of nodes
ering the structural features as well as the node in an mAST, and d is the embedding dimension.
features. The feature of each node represented by the graph
The graph-convolutioned mASTs are converted convolution layer is denoted as H. In the experi-
into sequences by pre-order traversal, and the se- ment, the dimension of the weight matrix in a graph
quences are given to the Transformer encoder. convolution is d=512.
Since structurally-neighboring nodes have similar
representations, the Transformer encoder can ef- 3.2 Transformer Encoder-Decoder
fectively capture not only the sequential features After graph convolution layers, the mAST is con-
but also the structural features such as sentences verted into a sequence by pre-order traversal. The
or blocks of source code. After the Transformer pre-order traversal is applied to the original AST,
encoder generates the representations by reflect- not the mAST, because mASTs are transformed
2844
Dataset Java Python Metrics We adopt 3 performance metrics: BLEU
Train 69,708 55,538 (Papineni et al., 2002), METEOR (Banerjee and
Valid 8,714 18,505 Lavie, 2005), and ROUGE-L (Lin, 2004).
Test 8,714 18,502
Unique leaf nodes in ASTs 106 54 Baselines We compare our model with baseline
Unique non-leaf nodes in ASTs 57,372 101,229 models based on sequential information by Iyer
Unique tokens in summaries 46,895 56,189
et al. (2016); Hu et al. (2018a,b); Wei et al. (2019);
Avg. nodes in AST 131.72 104.11
Avg. tokens in summary 17.73 9.48
Ahmad et al. (2020) and based on structural infor-
mation by Eriguchi et al. (2016); Wan et al. (2018).
Table 1: Statistics of Java and Python dataset We refer to the baseline results reported by Ahmad
et al. (2020).
into graphs by adding sibling edges. The mAST is Hyper-parameters We set the maximum length
used to obtain the structural representation of nodes to 200, and the vocabulary sizes for code and
by considering nodes and their neighbors. Since the summary to 50,000 and 30,000, respectively. We
original AST contains the original structure of the train our proposed model using Adam optimizer
source code, we use it to obtain a sequence. (Kingma and Ba, 2015). The mini-batch size and
The Transformer encoder and decoder follow dropout rate are 32 and 0.2. We set the maximum
the graph convolution layers. The sequence of the training epoch to 200, and use early stopping. We
mAST nodes is processed into the Transformer en- adopt beam search during inference time and set
coder. The Transformer architecture is good at cap- the beam size to 4.
turing long-term dependencies in a sequence. Since
4.2 Quantitative Result
we used graph convolutions, which generate similar
representations for structurally-neighboring nodes, Overall Result Table 2 shows the overall perfor-
the Transformer encoder can easily capture depen- mance of the models. We present three proposed
dencies between nodes in the same code block, models: AST-Only, AST+GCN and mAST+GCN.
between similar code blocks, and between code AST-Only is the proposed model without graph
blocks at the same level. As a result, the Trans- convolution layers. The model converts code snip-
former encoder can generate new representation pets into ASTs and does not include graph convo-
vectors which well reflect sequential and the struc- lutions. The sequenced AST nodes are given to the
tural information. Transformer encoder and decoder. We present this
Next, the Transformer decoder generates the to- model to verify how much ASTs are effective for
ken of summary from the vectors generated by the source code summarization. It performs better than
Transformer encoder. In the experiment, the dimen- the baselines except for TransRel model proposed
sion of nodes and summary tokens is dmodel =512. by Ahmad et al. (2020). This result shows that AST,
The Transformer encoder and decoder are respec- which has more structural information on source
tively composed of a stack of N = 6 layers. code, is better than simple code for source code
summarization.
AST+GCN is the proposed model with graph
4 Experiment
convolution layers but without AST modification
We perform various experiments to show the superi- (adding sibling edges). Code snippets are converted
ority of our model for source code summarization. into ASTs and node representations are gener-
ated by graph convolutions. The sequenced graph-
convolutioned AST nodes are input to the Trans-
4.1 Setup
former encoder and decoders. This model can ver-
Datasets We evaluate our model using Java ify how much the graph convolutions are useful. It
dataset (Hu et al., 2018b) and Python dataset shows better performance than the baselines.
(Wan et al., 2018). The statistics of the experiment mAST+GCN is the proposed model with modi-
datasets are shown in Table 1. We used the Java fied ASTs by adding sibling edges and with graph
parser used by Alon et al. (2019) and the Python convolution layers. It outperforms all baseline mod-
parser used by Wan et al. (2018) for extracting the els. The performance improves by 0.91 and 0.3
abstract syntax tree of the code. BLEU, 0.74 and 0.35 METEOR, and 0.06 and 0.08
2845
Java Python
Methods
BLEU METEOR ROUGE-L BLEU METEOR ROUGE-L
CODE-NN (Iyer et al., 2016) 27.60 12.61 41.10 17.36 09.29 37.81
Tree2Seq (Eriguchi et al., 2016) 37.88 22.55 51.50 20.07 08.96 35.64
RL+Hybrid2Seq (Wan et al., 2018) 38.22 22.75 51.91 19.28 09.75 39.34
DeepCom (Hu et al., 2018a) 39.75 23.06 52.67 20.78 09.98 37.35
API+CODE (Hu et al., 2018b) 41.31 23.73 52.25 15.36 08.57 33.65
Dual Model (Wei et al., 2019) 42.39 25.77 53.61 21.80 11.14 39.45
TransRel (Ahmad et al., 2020) 44.58 26.43 54.76 32.52 19.77 46.73
Proposed Model: AST-Only 44.76 26.75 53.93 31.59 19.16 45.48
Proposed Model: AST+GCN 45.30 27.26 54.45 32.41 19.77 46.35
Proposed Model: mAST+GCN 45.49 27.17 54.82 32.82 20.12 46.81
2846
(a) mAST+GCN (b) AST+GCN (c) Baseline
Figure 4: Attention maps of mAST+GCN, AST+GCN, and the baseline models for a code in Figure 5. (a), (b) and
(c) are the attention maps of the first Transformer encoder layer and (d), (e), and (f) are the attention maps of the last
Transformer encoder layer. A red box in (a) represents blocks in the snippets. We can see that the structural features
are clearly captured in the red box. Our model effectively captures large and hierarchical structural features.
2847
maps of the last layer of mAST+GCN, AST+GCN Fluency Relevance Coverage
and the baseline. Wins 146 145 144
Rectangles on the diagonal as shown in a red Losses 130 135 140
Ties 124 120 116
box in Figure 4a represent blocks in the snippets.
Since nodes or tokens in a block may have high Table 5: Human evaluation of the appropriateness of
similarities, we can see rectangles along with the the generated summaries on the Java dataset. We ask
diagonal. In the attention maps of the first layers, annotators to select a more appropriate summary from
the rectangles are faint because the structural infor- two candidates generated by different models. Our pro-
mation has not been processed much yet, but we posed model outperforms the baseline.
can see many distinct rectangles which means that
the structural features are clearly captured. • Relevance (selection of the important content
If we compare the attention maps by our models from source code)
and the baseline, there are many small rectangles
in the baseline. On the contrary, in our models, we • Coverage (selection of the whole content of
can see a few large rectangles in which there are source code)
small rectangles. We see a hierarchical structure in
the attention maps of our model. We show pairs of summaries from our model and
The fact that the baseline produces many small the baseline (Ahmad et al., 2020) to the annotators,
rectangles implies that the baseline can capture and ask them to select one of win, tie, and loss in
only small structural features. We can also note the three aspects, respectively. Our model shows
that these small structural features are smaller than superiority in all aspects as shown in Table 5. The
statements, considering that the example snippets scores of fluency and relevance are higher than the
have only 4 lines. The baseline hardly captures baseline, which means that our model generates
large structural features. more appropriate summaries using more natural
On the other hand, our model effectively cap- expressions.
tures large and hierarchical structural features. We Figure 6 shows some examples of summaries for
can easily identify rectangles that match with state- the qualitative comparison. We choose 6 Java snip-
ments or blocks in the attention maps by our pro- pet examples. We choose them from the snippets
posed models. on which all the annotators make the same decision
The attention maps from mAST+GCN and in each aspect. The three snippets on the left are
AST+GCN are very similar. However, we can see the ones that the annotators choose win (our model
differences in each attention map of the first and is better) and the right ones for loss.
the last layer. If we compare the first layer attention 5 Related Work
maps, the blocks of mAST+GCN are more distinct,
which implies that the modification of ASTs by As techniques and methods of deep learning have
adding sibling edges is helpful to model structural developed, researches for source code summa-
information. If we compare the last layer attention rization have been studied based on sequence-to-
maps, we can see that hierarchical structures of sequence models. Iyer et al. (2016) proposed a
rectangles are clear in the map by mAST+GCN, model that performed source code summarization
which also says that the modification is effective to task for the first time. Allamanis et al. (2016) sum-
capture structural features. marized the source code using a convolutional at-
tention network model. Hu et al. (2018a) proposed
Human Evaluation We performed human evalu- an RNN-based sequence-to-sequence model using
ation (Kryscinski et al., 2019) on the Java dataset to the pre-order traversal sequence of the abstract syn-
prove the effectiveness of how good summaries our tax tree. Also, Hu et al. (2018b) summarized source
model generates. We randomly choose 100 snip- code with the knowledge on imported APIs us-
pets and ask 4 people with knowledge of the Java ing two encoders (source code encoder and API
language to evaluate the summaries. They are CS encoder). Ahmad et al. (2020) proposed a Trans-
graduate students and have many years of experi- former model with a relative position for summa-
ence in Java languages. We ask them to evaluate rizing source code. Wei et al. (2019) proposed a
the 3 following aspects: dual model that learned the code and summary se-
• Fluency (quality of the summary) quence simultaneously. Wan et al. (2018) adopted
2848
Figure 6: Examples for the qualitative comparison. The left examples are chosen from the ones that have 4 wins
(all the annotators agree that our model is better), and the right examples are chosen from the opposite cases.
2849
Acknowledgments Jacob Harer, Chris Reale, and Peter Chin. 2019. Tree-
transformer: A transformer-based method for cor-
This work was supported by Institute of Infor- rection of tree-structured data. arXiv preprint
mation & communications Technology Planning arXiv:1908.00449.
& Evaluation (IITP) grant funded by the Ko-
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin.
rea government (MSIT) (No.2019-0-00421, Ar- 2018a. Deep code comment generation. In 2018
tificial Intelligence Graduate School Program IEEE/ACM 26th International Conference on Pro-
(Sungkyunkwan University)). This research was gram Comprehension (ICPC), pages 200–20010.
supported by the MSIT (Ministry of Science, ICT), IEEE.
Korea, under the High-Potential Individuals Global
Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and
Training Program) (2019-0-01579) supervised by Zhi Jin. 2018b. Summarizing source code with
the IITP (Institute for Information & Communica- transferred API knowledge. In Proceedings of the
tions Technology Planning & Evaluation). Twenty-Seventh International Joint Conference on
Artificial Intelligence, IJCAI 2018, July 13-19, 2018,
Stockholm, Sweden, pages 2269–2275. ijcai.org.
References
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and
Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Luke Zettlemoyer. 2016. Summarizing source code
Kai-Wei Chang. 2020. A transformer-based ap- using a neural attention model. In Proceedings
proach for source code summarization. In Proceed- of the 54th Annual Meeting of the Association for
ings of the 58th Annual Meeting of the Association Computational Linguistics (Volume 1: Long Papers),
for Computational Linguistics, pages 4998–5007, pages 2073–2083.
Online. Association for Computational Linguistics.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
Miltiadis Allamanis, Hao Peng, and Charles Sutton. method for stochastic optimization. In 3rd Inter-
2016. A convolutional attention network for ex- national Conference on Learning Representations,
treme summarization of source code. In Proceed- ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
ings of the 33nd International Conference on Ma- Conference Track Proceedings.
chine Learning, ICML 2016, New York City, NY,
USA, June 19-24, 2016, volume 48 of JMLR Work- Thomas N. Kipf and Max Welling. 2017. Semi-
shop and Conference Proceedings, pages 2091– supervised classification with graph convolutional
2100. JMLR.org. networks. In 5th International Conference on Learn-
ing Representations, ICLR 2017, Toulon, France,
Uri Alon, Shaked Brody, Omer Levy, and Eran Ya- April 24-26, 2017, Conference Track Proceedings.
hav. 2019. code2seq: Generating sequences from OpenReview.net.
structured representations of code. In 7th Inter-
national Conference on Learning Representations,
Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc-
ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
Cann, Caiming Xiong, and Richard Socher. 2019.
OpenReview.net.
Neural text summarization: A critical evaluation. In
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: Proceedings of the 2019 Conference on Empirical
An automatic metric for MT evaluation with im- Methods in Natural Language Processing and the
proved correlation with human judgments. In Pro- 9th International Joint Conference on Natural Lan-
ceedings of the ACL Workshop on Intrinsic and Ex- guage Processing (EMNLP-IJCNLP), pages 540–
trinsic Evaluation Measures for Machine Transla- 551, Hong Kong, China. Association for Computa-
tion and/or Summarization, pages 65–72, Ann Ar- tional Linguistics.
bor, Michigan. Association for Computational Lin-
guistics. Alexander LeClair, Sakib Haque, Lingfei Wu, and
Collin McMillan. 2020. Improved code summariza-
Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa tion via a graph neural network. In Proceedings of
Tsuruoka. 2016. Tree-to-sequence attentional neu- the 28th International Conference on Program Com-
ral machine translation. In Proceedings of the 54th prehension, pages 184–195.
Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers), pages Yuding Liang and Kenny Qili Zhu. 2018. Auto-
823–833, Berlin, Germany. Association for Compu- matic generation of text descriptive comments for
tational Linguistics. code blocks. In Proceedings of the Thirty-Second
AAAI Conference on Artificial Intelligence, (AAAI-
Patrick Fernandes, Miltiadis Allamanis, and Marc 18), the 30th innovative Applications of Artificial In-
Brockschmidt. 2019. Structured neural summariza- telligence (IAAI-18), and the 8th AAAI Symposium
tion. In 7th International Conference on Learning on Educational Advances in Artificial Intelligence
Representations, ICLR 2019, New Orleans, LA, USA, (EAAI-18), New Orleans, Louisiana, USA, February
May 6-9, 2019. OpenReview.net. 2-7, 2018, pages 5229–5236. AAAI Press.
2850
Chin-Yew Lin. 2004. ROUGE: A package for auto-
matic evaluation of summaries. In Text Summariza-
tion Branches Out, pages 74–81, Barcelona, Spain.
Association for Computational Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic eval-
uation of machine translation. In Proceedings of
the 40th Annual Meeting of the Association for Com-
putational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational
Linguistics.
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus
Hagenbuchner, and Gabriele Monfardini. 2008. The
graph neural network model. IEEE transactions on
neural networks, 20(1):61–80.
Yusuke Shido, Yasuaki Kobayashi, Akihiro Yamamoto,
Atsushi Miyamoto, and Tadayuki Matsumura. 2019.
Automatic source code summarization with ex-
tended tree-lstm. In 2019 International Joint Con-
ference on Neural Networks (IJCNN), pages 1–8.
IEEE.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
cessing Systems 30: Annual Conference on Neural
Information Processing Systems 2017, December 4-
9, 2017, Long Beach, CA, USA, pages 5998–6008.
2851