Literature Review by Tejas

Name: Tejas Kedare
Reg. no.: 20BCS134

Course: Natural Language and Processing (NLP)
LITERATURE REVIEW
Research Paper Title: Predicting stack overflow question quality through the use of deep
learning techniques.
Abstract: This paper proposes a Deep Learning-based Natural Language Processing

approach to classify Stack Overflow questions into "High Quality", "Low
Quality Edit", and "Low Quality Close". A Neural Network model was developed
using a distilled version of the BERT encoder and achieved a training accuracy
of 96.6% and a test accuracy of 93.5%. The proposed model can predict the
quality of Stack Overflow questions and aid in automated moderation. Future
research could focus on improving the model's accuracy and applying the
approach to other platforms.
Introduction: This paper presents a research project aimed at developing a method for
predicting the quality of a question posted on Stack Overflow, a popular
online community for computer scientists and software developers. The
goal is to determine whether a question is likely to be deleted or closed
based on its quality, as assessed according to the community guidelines
of Stack Overflow. The research utilizes Natural Language Processing and
Deep Learning techniques, with a focus on text classification. The dataset
used in the project is drawn from Kaggle and contains questions posted
on Stack Overflow from 2016 to 2020.
The text classification process involves a number of steps, including data

preprocessing, exploratory data analysis, feature extraction, model
development, and performance evaluation. The neural network chosen for
this project is a Bidirectional Transform Classifier, which is well-suited for
processing sequential data such as words and sentences in natural
language text. The model will be trained and fitted using a DistilBERT
Model with a single additional layer. The performance of the model will be
evaluated and the results will be analyzed to determine its effectiveness in
predicting the quality of questions posted on Stack Overflow.
Materials and Methods:
1) Dataset Description: The study used the 2020 dataset of "60k Stack Overflow
Questions with Quality Rating" from Kaggle Inc. The dataset provided 60,000
questions posted on Stack Overflow from 2016 to 2020. The questions were in
HTML format and contained features such as title, body, tags, creation date,
and type. The types of questions were already labeled as "High Quality," "Low
Quality Edited," and "Low Quality Closed," which served as the three categories
used for classification. The study defined each category as follows: HQ - posts
that don't need edits, LQ-Edit - low quality posts with negative score but remain
open after edits, and LQ-Close - low quality posts that are deleted without an
edit.
2) Data Preprocessing: The data preprocessing stage of the project focused on
preparing the text data in the body of 60,000 Stack Overflow questions from
2016 to 2020. Null values were checked for and removed, categorical data was
converted into integers, and the text data was cleaned and normalized. This
involved converting the text into lowercase, removing digits, URLs, HTML tags,
and stop words, and tokenizing the text. The resulting preprocessed data was
then used for the training of the classifier for better accuracy and faster
computation.
3) Exploratory Data Analysis: The researchers analyzed the distribution of
questions in two datasets of Stack Overflow, both containing 33% of each type
of question, resulting in a balanced dataset. They also looked at the
distribution of question types over the years and found that the quality of
questions has declined. The average length of questions decreased after data
preprocessing, which improved the efficiency of training the model. The
researchers also performed a semantic analysis by creating word clouds of
the most frequent words in each type of question. The results showed that
high-quality questions had more specific terminology and low-quality
questions had more general words. This implied that the composition of the
entire question should be considered in model development.
4) Model Development: BERT (Bidirectional Encoder Representations from
Transformers) is a state-of-the-art language representation model in NLP,
published by Google in 2018. It is based on Transformers, a deep learning
model that has every output element connected to every input element and the
weightings between them are dynamically calculated. BERT uses a masked
language model (MLM) and next sentence prediction (NSP) to understand
human language, which typically involves predicting a missing word in a
sentence. BERT was trained on large informational datasets like Wikipedia and
Google Book Corpus, contributing to its deep knowledge. The BERT encoder
mechanism uses attention mechanisms for memory and processes a series of
vector-embedded tokens to generate outputs for the MLM and NSP tasks. In
this work, the researchers used a distilled version of BERT called DistilBERT,
which is 40% smaller and 60% faster while retaining 97% of BERT's language
processing abilities. They fine-tuned the DistilBERT model by adding a dense
output layer and using 40 epochs, with the goal of determining if neural
networks can help predict the quality of Stack Overflow questions.
Results and Discussion
A deep learning model was developed to predict the quality of Stack Overflow
questions. The model used the BERT transformer and consisted of one dense layer with
3 neurons. The training dataset was split into training and validation sets, with a
validation set size of 0.2. The model was trained using the Google Collab Cloud Tensor
Processing Unit (TPU) with a batch size of 256 and was run for 40 epochs, with a
patience of 5 for the Early Stop callback. The final obtained accuracy for the test set
was 96.6% and 92.6% for the validation data. The model was then used to make
predictions on unseen data, resulting in a high precision and accuracy score of 93.5%. A
confusion matrix and a classification report were also generated, showing that the
model accurately predicted the quality of Stack Overflow questions. These results are
comparable to previous studies in the literature and demonstrate the effectiveness of
using the BERT transformer for this task.
Limitations
The research utilizing Stack Overflow dataset has a potential limitation as the question
quality was predicted and labeled through another machine learning model, which could
lead to inaccuracies and bias. Additionally, the dataset lacks information about the
answered, unanswered, and deleted questions, leading to potential imbalance in the
dataset and limiting its practical utility for Stack Overflow authors.
Conclusion
This paper proposes a Deep Learning approach for predicting question quality on Stack
Overflow using the DistilBERT model. The experiment results were promising, with an
accuracy of 93.5% on the validation set. The model has the potential to improve
question quality and benefit users, moderators, and other programmers. Further
research can expand the boundaries of the field of Natural Language Processing and
Text Classification.

Literature Review by Tejas

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Literature Review by Tejas

Uploaded by

Copyright:

Available Formats

Name: Tejas Kedare

Reg. no.: 20BCS134

Abstract: This paper proposes a Deep Learning-based Natural Language Processing

The text classification process involves a number of steps, including data

Results and Discussion

You might also like