Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2015 3rd International Conference on Information and Communication Technology (ICoICT)

Big Data Analytics on Large-Scale Socio-technical


Software Engineering Archives
Shahabedin Bayati, David Parsons, Teo Susnjak, Marzieh Heidary
School of Engineering and Advanced Technology
Massey University,
Auckland, New Zealand
S.bayati@massey.ac.nz

Abstract—Given the fast growing nature of software engineering new field in data science and software engineering, valuable
data in online software repositories and open source communities, research and applications have already been presented and this
it would be helpful to analyse these assets to discover valuable community gets bigger and bigger. In MSR research, data
information about the software engineering development process mining and machine learning algorithms like classification,
and other related data. Big Data Analytics (BDA) techniques and
association rules, text mining, sentiment analysis, social
frameworks can be applied on these data resources to achieve a
high-performance and relevant data collection and analysis. network analysis and clustering are widely used to find
Software engineering is a socio-technical process which needs information about software ecosystems or development through
development team collaboration and technical knowledge to software repositories [1].
develop a high-quality application. GitHub, as an online social
coding foundation, contains valuable information about the Various online open source repositories are now available
software engineers’ communications and project life cycles. In through the Web. GitHub, SourceFoge, Code Google and
this paper, unsupervised data mining techniques are applied on BitBucket are some of these resources, which may contain
the data collected by general Big Data approaches to analyse various types of repositories. In this research we focus on
GitHub projects, source codes and interactions. Source codes and GitHub as a social coding environment which enables software
projects are clustered using features and metrics derived from
engineers to do their coding process through version control
historical data in repositories, object oriented programming
metrics and the influences of developers on source codes. and also communicate via social networking tools within
GitHub. GitHub has variety of events, repositories and APIs to
Index Terms— Big Data; GitHub Mining; Clustering; Mining help software engineers in the development lifecycle. This
Software Repositories (MSR); Empirical Software Engineering stored socio-technical data is a great resource for data mining
purposes. GitHub uses Git and subversion as version control
I. INTRODUCTION technologies. It has gained more than 7 million users since its
establishment. At the time of writing it is one the 130 most
Software engineering data is available in open source popular Websites as ranked by Alexa and it has about 17
software repositories through the Web. By increasing the million project repositories.
number of online repositories and distributing the development
teams across the world this accumulated data became a great Huge amounts of fast growing data in GitHub repositories
resource for data analysis and mining about software are stored in a variety of formats and complex structures which
engineering. Online software archives involves social illustrated that GitHub is a Big Data resource and has the main
communications repositories, version control, source code, characteristics of Big Data. Big Data involves 3-Vs (Volume,
commit logs, bug tracking and other development artifacts. Variety and Velocity) as the most important features of a data
Web 2.0 and social network features also align with these set that may be classified as a Big Data resource [2]. Volume
repositories and supports social collaboration among team means that data should be big in size, usually more than
members and software engineers in large distributed terabyte (TB). Velocity means the generation rate of data
environments. As the size of archives has rapidly increased and should also be fast, and it should be processed fast. Variety
the complexity of stored data in these repositories gets more means that a Big Data resource should have a variety of data
complex, the benefit of applying distributed computing structures and formats, both unstructured and structured.
techniques and Big Data approaches has become more obvious. GitHub as the main software engineering resource meets these
criteria and is suitable for Big Data Analytics in the area of
During the last decade data scientists in software software development lifecycle (SDLC).
engineering archives presented a new field in knowledge
discovery and management known as Mining Software In this paper a new framework is proposed to collect
Repositories (MSR), which is focused on discovering valuable GitHub data through a distributed REST API and using
and actionable information and knowledge about software Map/Reduce and NoSQL DB as a general approach in Big
engineering development processes. Although MSR is quite a Data for data collection and pre-processing. In the data

978-1-4799-7752-9/15/$31.00 ©2015 IEEE 65


2015 3rd International Conference on Information and Communication Technology (ICoICT)

processing phase, unsupervised machine learning techniques B. Map/Reduce Process


are used for clustering GitHub project source code based on the
metrics and features calculated in the pre-processing phase. The Map/Reduce is a distributed processing technique which is
result of this clustering is analysed and will be presented to presented by Google to address large-scale datasets [7].
software engineers, project managers and researchers. This Map/Reduce is based on distributed divide and conquer
paper is structured as follows. In the next section a brief algorithm which has its origins in Lisp and other functional
description of the basic concepts used for this research is programming languages such as Scala. It has two main phases
provided. In section three, related works is reviewed. Then in "Map" and "Reduce", and also "Shuffling" phase in between. In
two subsequent sections the main framework and research most cases it works based on a <Key, Value> data format. In
methodology are illustrated. Then the preliminary results of the the "Map" phase the key values are distributed for processing
proposed framework are demonstrated. Finally in the last in multiple processors and generate intermediate key value
section the conclusions are summarized. pairs. In the "Reduce" phase all the key values pairs with the
same keys are aggregated by the reducer and after some
II. BASIC RESEARCH CONCEPTS computation a single key value will be generated by the reducer.
Together with other reducer results the final result will be
To provide the background to this research, this section presented. Figure 1 summarizes the Map/Reduce Process. A
presents a brief overview of basic research concepts related to very common example of Map/Reduce is word count in large
this paper. Mining software repositories and Map/Reduce are textual documents that can be used for indexing. In this paper a
introduced in this section new sample based on software engineering repositories will be
discussed which is closer to the topic of research. For using
A. Mining Software Repositories (MSR) Map/Reduce on any data we have some restrictions like the
data should be presented in key-values and there must not be
Software repositories contain various data about software any dependencies among pairs.
projects histories which can be mined to support software
engineering processes. MSR is the specific field in data science
which focused on analyzing the rich data in software
repositories to discover interesting and actionable information
about the projects, software ecosystems and systems [3]. This
extracted information can be used by software practitioners and
researchers in developing their understandings about software
engineering, decision makings, system maintenance and
empirical validation of their ideas.

Recommending software developer for a reported bug,


predicting defects and efforts in software projects, discovering
API usage patterns in applications, analyzing the developers'
communication topics, extracting functions and class change
patterns [4] and detecting code clones are some of the topics
introduced in MSR [5]. Software intelligence, software
analytics, software development analytics and empirical
software engineering are other terms for mining software Fig 1. Map/Reduce
repositories (MSR) which are discussed in various related
research with similar definitions. Three types of repositories A sample model for software evolution analysis which is
have been used in MSR research which are listed with used on software repositories based on changes in lines of code
repository examples in Table 1 [6]. (LoC) per source file in each commit version, is presented here
by using the Map/Reduce method [7]. Assume that pairs with
<Project name+ version, Source code file> like <Prj_A 1.0,
a.java> and <Prj_A 1.0, b.java> are the input data to the
TABLE 1. DIFFERENT TYPES OF SOFTWARE REPOSITORIES mapping phase. Then the LoC for that file is computed and the
Repository Type Repository Instances new pair <Project name+ version, LoC count> is presented to
Cooperation Source Code Repositories (CVS), the shuffle mechanism like <Prj_A 1.0, 150> and <Prj_A 1.0,
Artifacts Repositories 250>. In this phase the aggregated keys are sent to the same
Coordination Issue Tracking, Reputation Systems, reducer to sum all the values per keys to calculate the <Project
Project Management
Communication Forums, Q&A Websites, Wikis,
Name+ version, Total LoC> as the result like <Prj_A 1.0, 400>.
Commit Log, User Reports, Code By analyzing the results, the process of software evolution is
Comments, Mailing List and Social Media more obvious based on LoC metrics.

978-1-4799-7752-9/15/$31.00 ©2015 IEEE 66


2015 3rd International Conference on Information and Communication Technology (ICoICT)

III. RELATED WORKS problems are faced by data scientists who work on GitHub.
Some of the projects on GitHub are not real software
In this section a brief review of previous work related to development projects, or are not active and many of the
this project is presented, which covers research in Big Data in projects do not use all the events and features of GitHub. Data
MSR and GitHub Mining. scientists should be aware of these issues when analyzing
GitHub [14]. A further data source for data analytics on GitHub
A. Big Data Researches in MSR has been collected in a relational model through the torrents
protocols [15]. In other research GitHub users' social activities
Most of the research in the field of MSR is focused on were observed. The authors analyzed people with a high
development tasks. Other software development life cycle number of followers to find the role of transparency in GitHub
phases are not analyzed as well as the development phase [5]. environment [16].
In other cases, most of the research only focuses on one project
and one type of repository, and limited data mining techniques IV. FRAMEWORK
are applied to discover the valuable information in a special
context [8]. By presenting general Big Data Analytics To present our approach for Big Data Analytics, a
approaches and tools in the field of MSR a new approach for framework is proposed which describes the phases of a data
Internet-Scale software repository analysis is presented. mining process through GitHub as the goal of this research.
Analyzing developers' interactions, documents, source codes Figure 2 schematically explains these processes. In the first
and artifacts through on-line repositories like GitHub is now step a distributed application is needed to extract the required
possible with BDA techniques. However, some customizations data from GitHub by using its pre-defined REST APIs. The
in traditional MSR mining algorithms are inevitable. Already GitHub API returns its data in JSON format through service
some specific and high-cost approaches for large-scale data calls and it has some restrictions on usage which forces us to
mining on software repositories have been applied but they are use a distributed application which is presented in figure 2 as
not applicable for other researches. By using Map/Reduce GitHub API Extractors. The retrieved data will be stored in a
methods and other general frameworks and tools from the Big No-SQL document storage DB and a pre-processing phase will
Data domain this work can be done more easily and cost be run on this data through using the Map/Reduce technique. In
effectively [9]. NoSQL DB the extracted history of projects, developers and
source code are stored. The calculated metrics and features of
B. GitHub Mining object oriented projects will be gathered in a Relational DBMS
and the required data will be extracted through SQL queries in
As mentioned in previous sections GitHub is a great the form of CSV for applying any clustering algorithm through
resource for Big Data Analytics in the software engineering different tools. As the size of stored data in the NoSQL DB is
area. The role of developers' profiles in some software large, Map/Reduce is applied for metric calculation process.
engineering related social media like GitHub and Stack The main contributions of this framework to MSR research are
Overflow is investigated in [10]. This shows how a developer's 1- using general and cost-efficient Big Data approaches such as
profile can affect their reputation and skills assessment by other NoSQL, distributed processing and Map/Reduce in MSR. 2-
developers, and how recruiters look for developers' profiles. In Presenting a general framework for data analysis on large-scale
other MSR research [11] the way developers do the refactoring open access socio-technical archives like GitHub. The
process is discussed by analyzing the data gathered from presented framework is extendable for most of general data
GitHub and Stack Overflow posts and answers. In GitHub they mining techniques.
analyzed comments on commits messages, issues and pull-
requests. They found that refactoring is related mostly to C# V. METHODOLOGY
and Java based projects and only a small portion of comments
are related to refactoring (less than 1%). In research by Thung As the proposed frameworks describes, a service oriented
et al., [12] developer and project relationships in GitHub were application is being developed to extract GitHub data and store
analyzed by using social network analysis approaches and data it as JSON documents in MongoDB. As a seed to the
mining techniques. They created two separate graphs which distributed nodes of the application the URLs of object oriented
show developer to developer and project to project GitHub repositories are used. MongoDB itself supports
relationships. Their aim was to find which developers are more lightweight Map/Reduce which can be queried across (Hadoop
active and influential and also what is the relationship between Distributed File System) HDFS. In addition a metric calculator
developers and projects. program is developed in Java which calculates object oriented
interactions and historical information about the source codes
GitHub's role in improving group awareness in software and projects. Some of the applied metrics are LoC, number of
engineering is discussed by Lanubile, Calefato and Ebert. This methods, number of statements in methods, depth block
awareness can help projects in distributed environment to be complexity, number of developers who worked on a source,
more successful and avoid failure. GitHub can support number of changed lines, average of changed lines per commit,
awareness as a distributed social coding environment [13]. project size, programming language and average of co-changed
Although GitHub is a great resource for data mining, some files. Since it is important for Map/Reduce programming, the

978-1-4799-7752-9/15/$31.00 ©2015 IEEE 67


2015 3rd International Conference on Information and Communication Technology (ICoICT)

data should not be dependent and all of the mentioned metrics comments and a low rate of change. We should note the fact
in the key value format are not dependent on others. Finally the that a file which changed at a low rate in the past will have a
result of the queries and programs are stored in My SQL and low probability for changes in future.
the extracted CSV file are fed to WEKA for K-Means
clustering. The results of clustering can be analysed for
relevance to end-users for future evaluation. Currently, a proof
of concept has been developed to demonstrate the principle of
the framework, and some initial results have been derived.

Fig 2. Proposed Framework

VI. PROOF OF CONCEPT

In this section some preliminary results from the proposed Fig 3. the result of clustering on GitHub source code
framework are introduced. The application is developed based
on the described methodology and the results of data mining VII. CONCLUSION AND FUTURE WORKS
process support the concepts mentioned in this paper. We
applied K-Means clustering on source code samples to cluster This paper proposed usage of General Big Data Analytics
them based on their level of difficulty for developers with approaches and techniques like Map/Reduce and NoSQL on
different levels of experience and knowledge in GitHub to Internet-scale software engineering archives. GitHub as a social
work with. Source code from C#, Java and C++ projects was coding online environment is selected for this research. GitHub
analyzed, and the historical information and object oriented along with its version controlling and issue tracking
metrics were used as clustering features. Figure 3 shows the repositories provides facilities for social networking among
result of this clustering in box-plots with different perspectives; software engineers. GitHub API is used to extract data from
lines of code (LoC), percentage of comment lines and number GitHub and a general framework is proposed for data analysis
of changes. We assume that these three metrics can impact on on GitHub based on Big Data approaches. From the extracted
the ease of modification. source code and projects from GitHub some metrics and
features are calculated for a mining process. We have presented
These plots show that source code which belongs to some preliminary results that act as a proof on concept for the
cluster_2 and cluster_3 is relatively difficult to work with framework. These results have applied a set of metrics that may
because it has changed a lot, the size is big and there have been be useful in a recommender system for newcomers to open
many changes by different developers. Cluster1 is a good source software development by identifying which source code
cluster because it has a limited number of developers, enough components may be more easily modified.

978-1-4799-7752-9/15/$31.00 ©2015 IEEE 68


2015 3rd International Conference on Information and Communication Technology (ICoICT)

For future work the result of this framework can be used on source programs using distributed CCFinder: D-CCFinder,"
other software analytical systems and some visualization in Proceedings of the 29th international conference on
techniques can be applied on the extracted information. Using Software Engineering, 2007, pp. 106-115.
the general data MSR approaches on the result of this [9] W. Shang, Z. M. Jiang, B. Adams, and A. E. Hassan,
"MapReduce as a general framework to support research in
framework and comparing the result with specific-sized MSR Mining Software Repositories (MSR)," in Mining Software
research is the other future work. Repositories, 2009. MSR'09. 6th IEEE International
Working Conference on, 2009, pp. 21-30.
REFERENCES [10] L. Singer, F. Figueira Filho, B. Cleary, C. Treude, M.-A.
Storey, and K. Schneider, "Mutual assessment in the social
[1] A. Guzzi, A. Bacchelli, M. Lanza, M. Pinzger, and A. v. programmer ecosystem: an empirical investigation of
Deursen, "Communication in open source software developer profile aggregators," in Proceedings of the 2013
development mailing lists," in Proceedings of the 10th conference on Computer supported cooperative work,
Working Conference on Mining Software Repositories, 2013, pp. 103-116.
2013, pp. 277-286. [11] G. Destefanis and M. Ortu, "Position Paper: Are
[2] J. J. Berman, Principles of big data: preparing, sharing, Refactoring Techniques Used by Developers? A
and analyzing complex information: Newnes, 2013. Preliminary Empirical," presented at the International
[3] M. Halkidi, D. Spinellis, G. Tsatsaronis, and M. Workshop on Refactoring & Testing (RefTest), 2014.
Vazirgiannis, "Data mining in software engineering," [12] F. Thung, T. F. Bissyandé, D. Lo, and L. Jiang, "Network
Intelligent Data Analysis, vol. 15, pp. 413-441, 2011. structure of social coding in GitHub," in Software
[4] T. Zimmermann, A. Zeller, P. Weissgerber, and S. Diehl, Maintenance and Reengineering (CSMR), 2013 17th
"Mining version histories to guide software changes," European Conference on, 2013, pp. 323-326.
Software Engineering, IEEE Transactions on, vol. 31, pp. [13] F. Lanubile, F. Calefato, and C. Ebert, "Group Awareness
429-445, 2005. in Global Software Engineering," IEEE software, vol. 30,
[5] M. W. Godfrey, A. E. Hassan, J. Herbsleb, G. C. Murphy, pp. 18-23, 2013.
M. Robillard, P. Devanbu, et al., "Future of mining [14] E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M.
software archives: A roundtable," Software, IEEE, vol. 26, German, and D. Damian, "The promises and perils of
pp. 67-70, 2009. mining GitHub," in Proceedings of the 11th Working
[6] A. E. Hassan and T. Xie, "Software intelligence: the future Conference on Mining Software Repositories, 2014, pp. 92-
of mining software engineering data," in Proceedings of the 101.
FSE/SDP workshop on Future of software engineering [15] G. Gousios, "The GHTorent dataset and tool suite," in
research, 2010, pp. 161-166. Proceedings of the 10th Working Conference on Mining
[7] W. Shang, B. Adams, and A. E. Hassan, "An experience Software Repositories, 2013, pp. 233-236.
report on scaling tools for mining software repositories [16] L. Dabbish, C. Stuart, J. Tsay, and J. Herbsleb, "Social
using mapreduce," in Proceedings of the IEEE/ACM coding in GitHub: transparency and collaboration in an
international conference on Automated software open software repository," in Proceedings of the ACM 2012
engineering, 2010, pp. 275-284. conference on Computer Supported Cooperative Work,
[8] S. Livieri, Y. Higo, M. Matushita, and K. Inoue, "Very- 2012, pp. 1277-1286.
large scale code clone analysis and visualization of open

978-1-4799-7752-9/15/$31.00 ©2015 IEEE 69

You might also like