Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Simon Weber

CSC297 Final Project Proposal

Predicting the Popularity of Open-Source Code

Problem Statement
With the rise of sites like GitHub and BitBucket, the social community surrounding open-
source code has flourished. The goal of this project is to understand what makes some
code more socially popular than others, and to predict popularity given code.

Using code from GitHub, popularity will be measured in GitHub stars. These are similar
to the concept of a like on Facebook or +1 on Google Plus. Another option is the number
of forks per project, which broadly represent the number of times other people have
wanted to contribute to the project. However, experience shows that many forks end with
no contributions; stars were decided to be preferable. If time allows, the two will be
compared as popularity metrics.

GitHub uses Git, so individual projects will be referred to as repositories (or repos). This
study will limit itself to repos with a main language of Python (as identified by GitHub)
due to the author’s involvement in the Python open-source community, and subsequent
domain knowledge. In addition, since per-star prediction is too granular to be useful,
classes will likely be quantized into 10 or 20 percentile chunks.

Previous work seems to have focused on predicting code quality [1] [2]. More relevant is
a Kaggle competition which required participants to classify code to certain open-source
projects [3]. The author also did basic work in classifying code to the language it was
written in [4].

Foreseen difficulties include the requirement to understand the classifier’s reasoning, as


well as the feature-rich - but likely noisy or redundant - nature of code.

Data Acquisition
First, many Python repos will be collected, and their stars recorded. GitHub’s search
feature can be (ab)used to find repos with a certain number of stars, but they may be
able to assist directly to avoid this. Exploratory research shows about 270,000 total
Python repos in approximately a half-normal distribution [8]. Repos range from a few
kilobytes to around 5 megabytes in size, so collecting them all is not reasonable. A
strategy will have to be determined for sampling the distribution.
After repositories are collected, features must be extracted. Project metadata (number of
commits, bug-tracker information, etc) would likely predict very well, but risks reversing a
causal relationship. Instead, features of the code itself will be extracted.

Feature Choices
Features are motivated by the author’s domain knowledge. A few broad categories are
under consideration:
● Style: how clean the code is - includes style guide (PEP8 [5]) adherence,
commenting, and work on the project readme.
● Language Utilization: what features of Python the code uses - includes relative
presence of specific AST nodes (eg lambdas, generators, context managers) [9],
advanced languages features like metaclasses, and imported stdlib modules.
● Tooling: what software engineering practices are employed - includes presence
of tests, continuous integration and packaging files.

Algorithm Choices
Considering the need to understand the resulting classifier and the limited computing
resources of the author, random forest seems a natural choice. Aside from being easily
interpreted, random forest is quick to train and give results, reasonably tolerant to
irrelevancy/redundancy/noise, and has many open-source implementations. However,
incremental training is more difficult and performance is not best-in-class [6].

Naive bayes and rule-learning approaches are also transparent, but tend to perform
worse and are less tolerant to unclean data [6].
Performance Evaluation
Surveys have shown k-fold cross-validation (at k >= 10) to be simple and robust [7].
Because of data limitations, a lesser k may be used. After using this to create a
reasonable classifier, new repos can be pulled down from GitHub to determine
performance.

References
1. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?
reload=true&arnumber=6080797&contentType=Conference+Publications
2. http://ieeexplore.ieee.org/xpl/articleDetails.jsp;jsessionid=710HP6jPn5r2gQjHSLf7rs3vp
MRS9Gj1GpxZHwHqcWMg3lyJ3YzW!-864544984?
arnumber=6080814&contentType=Conference+Publications
3. http://www.kaggle.com/c/emc-data-science
4. http://arxiv.org/abs/1106.4064
5. http://www.python.org/dev/peps/pep-0008/
6. http://pdf.aminer.org/001/202/088/evaluating_learning_algorithms_composed_by_a_con
structive_meta_learning_scheme.pdf
7. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.9391
8. https://docs.google.com/spreadsheet/ccc?
key=0ArbW86SpnfA8dE9RUVhDZUo1TXJxcW0zZ3JpVDBFSHc
9. http://docs.python.org/2/library/ast.html

You might also like