Professional Documents
Culture Documents
Literature Review by Tejas
Literature Review by Tejas
LITERATURE REVIEW
Research Paper Title: Predicting stack overflow question quality through the use of deep
learning techniques.
Introduction: This paper presents a research project aimed at developing a method for
predicting the quality of a question posted on Stack Overflow, a popular
online community for computer scientists and software developers. The
goal is to determine whether a question is likely to be deleted or closed
based on its quality, as assessed according to the community guidelines
of Stack Overflow. The research utilizes Natural Language Processing and
Deep Learning techniques, with a focus on text classification. The dataset
used in the project is drawn from Kaggle and contains questions posted
on Stack Overflow from 2016 to 2020.
1) Dataset Description: The study used the 2020 dataset of "60k Stack Overflow
Questions with Quality Rating" from Kaggle Inc. The dataset provided 60,000
questions posted on Stack Overflow from 2016 to 2020. The questions were in
HTML format and contained features such as title, body, tags, creation date,
and type. The types of questions were already labeled as "High Quality," "Low
Quality Edited," and "Low Quality Closed," which served as the three categories
used for classification. The study defined each category as follows: HQ - posts
that don't need edits, LQ-Edit - low quality posts with negative score but remain
open after edits, and LQ-Close - low quality posts that are deleted without an
edit.
2) Data Preprocessing: The data preprocessing stage of the project focused on
preparing the text data in the body of 60,000 Stack Overflow questions from
2016 to 2020. Null values were checked for and removed, categorical data was
converted into integers, and the text data was cleaned and normalized. This
involved converting the text into lowercase, removing digits, URLs, HTML tags,
and stop words, and tokenizing the text. The resulting preprocessed data was
then used for the training of the classifier for better accuracy and faster
computation.
3) Exploratory Data Analysis: The researchers analyzed the distribution of
questions in two datasets of Stack Overflow, both containing 33% of each type
of question, resulting in a balanced dataset. They also looked at the
distribution of question types over the years and found that the quality of
questions has declined. The average length of questions decreased after data
preprocessing, which improved the efficiency of training the model. The
researchers also performed a semantic analysis by creating word clouds of
the most frequent words in each type of question. The results showed that
high-quality questions had more specific terminology and low-quality
questions had more general words. This implied that the composition of the
entire question should be considered in model development.
4) Model Development: BERT (Bidirectional Encoder Representations from
Transformers) is a state-of-the-art language representation model in NLP,
published by Google in 2018. It is based on Transformers, a deep learning
model that has every output element connected to every input element and the
weightings between them are dynamically calculated. BERT uses a masked
language model (MLM) and next sentence prediction (NSP) to understand
human language, which typically involves predicting a missing word in a
sentence. BERT was trained on large informational datasets like Wikipedia and
Google Book Corpus, contributing to its deep knowledge. The BERT encoder
mechanism uses attention mechanisms for memory and processes a series of
vector-embedded tokens to generate outputs for the MLM and NSP tasks. In
this work, the researchers used a distilled version of BERT called DistilBERT,
which is 40% smaller and 60% faster while retaining 97% of BERT's language
processing abilities. They fine-tuned the DistilBERT model by adding a dense
output layer and using 40 epochs, with the goal of determining if neural
networks can help predict the quality of Stack Overflow questions.
A deep learning model was developed to predict the quality of Stack Overflow
questions. The model used the BERT transformer and consisted of one dense layer with
3 neurons. The training dataset was split into training and validation sets, with a
validation set size of 0.2. The model was trained using the Google Collab Cloud Tensor
Processing Unit (TPU) with a batch size of 256 and was run for 40 epochs, with a
patience of 5 for the Early Stop callback. The final obtained accuracy for the test set
was 96.6% and 92.6% for the validation data. The model was then used to make
predictions on unseen data, resulting in a high precision and accuracy score of 93.5%. A
confusion matrix and a classification report were also generated, showing that the
model accurately predicted the quality of Stack Overflow questions. These results are
comparable to previous studies in the literature and demonstrate the effectiveness of
using the BERT transformer for this task.
Limitations
The research utilizing Stack Overflow dataset has a potential limitation as the question
quality was predicted and labeled through another machine learning model, which could
lead to inaccuracies and bias. Additionally, the dataset lacks information about the
answered, unanswered, and deleted questions, leading to potential imbalance in the
dataset and limiting its practical utility for Stack Overflow authors.
Conclusion
This paper proposes a Deep Learning approach for predicting question quality on Stack
Overflow using the DistilBERT model. The experiment results were promising, with an
accuracy of 93.5% on the validation set. The model has the potential to improve
question quality and benefit users, moderators, and other programmers. Further
research can expand the boundaries of the field of Natural Language Processing and
Text Classification.