Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Institute for Systems and Technologies of Information, Control and Communication

ENASE 2023 - Paper #96


Paper Title: Programming language identification in stack overflow post
snippets with Regex based Tf-Idf vectorization over ANN
Reviewer #1

General Assessment (Please assign scores using the following criteria (1=weakest; 6=strongest))

Relevance (Paper fits one or more of the topic areas?): 4


Originality (Newness of the ideas expressed): 4
Technical Quality (Theoretical soundness/methodology): 4
Significance (Is the problem worth the given attention?): 5
Presentation (Structure/Length/English): 4
Overall Rating (Weighted value of above items): 4

Improvement Suggestions (for authors to consider in the camera-ready version. Additional detail in "Obser-
vations" )

Abstract and Introduction are adequate? Yes


Needs more experimental results? No
Needs comparative evaluation? No
Improve critical discussion ? (validation): Yes
Figures are adequate ? (in number and quality): Yes
Conclusions/Future Work are convincing? Yes
References are up-to-date and appropriate? Yes
Paper formatting needs adjustment? No
Improve English? Yes

Detailed comments to authors, including aspects that must be improved in the camera-ready version of the
paper:

In this manuscript the authors propose an approach for automatic programming language identification starting
from small snippets of source code extracted from Stack Overflow. The approach uses regular expressions on
three levels to extract tokens from the snippets and then TF−IDF to encode the input data. Feature
selection is then performed via a Chi−squared test to keep only the tokens that are highly dependent on the
output. Last, the authors use three classification models, namely Random Forrest, XG Boost and artificial
neural networks (ANN). The obtained results prove that the proposed approach is promising, comparable
and even surpassing other recent approaches.

The problem in question is important and the obtained results are promising. Still , there are some items that
in my opinion would improve the quality of the manuscript and some that require more details:

− Do the three regular expressions provided in Table 1 represent the entire list of patterns used, or do the
authors have a set of regular expressions for each level ?

− How did the authors decide upon the limit number for the features (5000)?

− How many relevant features are selected after performing the chi squared test?

e-mail: reviews@insticc.org
Institute for Systems and Technologies of Information, Control and Communication

− Why does the ANN contain 100,000 nodes in the input layer? Do he authors provide a different input to the
ANN (considering that the number of features should be less than 5000)?

− Table 6: results are in "percentage", not " percentile ".

− For the comparative analysis illustrated in Table 7: it is not clear whether the other two approaches were
tested on code snippets only. It should probably be mentioned that a Bi−LSTM was used for the word2vec
encoding.

− Figure 5 shows the performance for a Bi−LSTM usnig word2vec. The image should include the training and
validation for the TF−IDF based ANN as well, for a visual comparison.

− There are various places where citations are indicated (e.g. for Sklearn, for the definition of TF−IDF,
Hugging Face Inc.)

− I suggest a proofreading of the manuscript, as there are several phrases that need improvements, as well as
some grammar errors.

e-mail: reviews@insticc.org
Institute for Systems and Technologies of Information, Control and Communication

ENASE 2023 - Paper #96


Paper Title: Programming language identification in stack overflow post
snippets with Regex based Tf-Idf vectorization over ANN
Reviewer #2

General Assessment (Please assign scores using the following criteria (1=weakest; 6=strongest))

Relevance (Paper fits one or more of the topic areas?): 4


Originality (Newness of the ideas expressed): 3
Technical Quality (Theoretical soundness/methodology): 4
Significance (Is the problem worth the given attention?): 4
Presentation (Structure/Length/English): 4
Overall Rating (Weighted value of above items): 4

Improvement Suggestions (for authors to consider in the camera-ready version. Additional detail in "Obser-
vations" )

Abstract and Introduction are adequate? Yes


Needs more experimental results? No
Needs comparative evaluation? No
Improve critical discussion ? (validation): No
Figures are adequate ? (in number and quality): Yes
Conclusions/Future Work are convincing? Yes
References are up-to-date and appropriate? Yes
Paper formatting needs adjustment? No
Improve English? No

Detailed comments to authors, including aspects that must be improved in the camera-ready version of the
paper:

The authors have proposed an approach to classify code snippets to identify programming language of those code
snippets. The approach makes use of ANN classifier to which feature reduced vector is fed as input, and the
the input is generated using regex based tf−idf vectorization approach. The problem addressed by the
authors is relevant to the applications that support users’ queries seeking inputs on their code snippets.
Searching through the existing repositories for the new query is a challenging task. The problem has been
explored earlier through several other techniques and methodologies. The authors have discussed those
existing approaches and the need for a better solution . The authors have very clearly laid out their
proposed approach along with a comparative study too. Appreciate that the results too have been discussed
in detail in the paper. The paper is very well organized! Overall, I would recommend accepting the paper.

e-mail: reviews@insticc.org
Institute for Systems and Technologies of Information, Control and Communication

ENASE 2023 - Paper #96


Paper Title: Programming language identification in stack overflow post
snippets with Regex based Tf-Idf vectorization over ANN
Reviewer #3

General Assessment (Please assign scores using the following criteria (1=weakest; 6=strongest))

Relevance (Paper fits one or more of the topic areas?): 6


Originality (Newness of the ideas expressed): 5
Technical Quality (Theoretical soundness/methodology): 4
Significance (Is the problem worth the given attention?): 5
Presentation (Structure/Length/English): 4
Overall Rating (Weighted value of above items): 4

Improvement Suggestions (for authors to consider in the camera-ready version. Additional detail in "Obser-
vations" )

Abstract and Introduction are adequate? Too long


Needs more experimental results? No
Needs comparative evaluation? No
Improve critical discussion ? (validation): No
Figures are adequate ? (in number and quality): Yes
Conclusions/Future Work are convincing? Yes
References are up-to-date and appropriate? Yes
Paper formatting needs adjustment? Yes
Improve English? No

Detailed comments to authors, including aspects that must be improved in the camera-ready version of the
paper:

The authors use code snippets as input to a multi−step ANN based classification process. While arriving at
ANN as the optimal classifier they compared the output with other classifiers . Overall the paper is well
written.

The authors clearly show that the model proposed is better. But if the technique of Alreshedy et al . is giving
better results by taking title and description is addition to code snippets, would it not be better to use
their model? What is the time complexity in both the cases? I see Table 6 shows the difference at the
individual program level. May be the authors could also investigate and see "what happens if other
components like title , description are included in the input to their proposed model? Does the accuracy
etc. improve to beyond Alreshedy et al.?"

e-mail: reviews@insticc.org

You might also like