Fish 2017

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

A Code Inspection Tool by Mining Recurring

Changes in Evolving Software


Alex Fish, Thuy Linh Nguyen, Myoungkyu Song
Department of Computer Science
University of Nebraska, Omaha, NE, USA
{acfish,ltnguyen,myoungkyu}@unomaha.edu

Abstract—Mining software repositories have frequently been modifying existing expressions, developers are not easy to find
investigated in recent research. Software modification in reposi- all the relevant program elements, because investigating such
tories are often recurring changes, similar but different changes similar change contexts typically requires a manual process,
across multiple locations. It is not easy for developers to find
all the relevant locations to maintain such changes, including resulting in inconsistent and missing updates.
bug-fixes, new feature addition, and refactorings. Performing There is limited support for code inspection for recurring
recurring changes is tedious and error-prone, resulting in in- changes in existing mining tools. In many cases, these tools
consistent and missing updates. To address this problem, we produce lower-level changes, although developers have applied
present CloneMap, a clone-aware code inspection tool that helps high-level operations such as refactorings or crosscutting mod-
developers ensure correctness of recurring changes to multiple
locations in an evolving software. CloneMap allows developers ifications. The search and replace features in text editors or
to specify the old and new versions of a program. It then Integrated Development Environments (IDEs), such as Eclipse,
applies a clone detection technique to (1) mine repositories for are widely used but do not support non-contiguous changes.
extracting differences of recurring changes, (2) visualize the Recent work find and recommend edits for new locations based
clone evolution, and (3) help developers focus their attention to on changes to similar code fragments or data and control
potential anomalies, such as inconsistent and missing updates.
dependences between changes by using examples [9], [10],
I. I NTRODUCTION [11]. Still, developers inspect individual changes by manually
Software evolves with continuous changes. Developers add comparing differences between code change fragments, rather
new features, fix bugs, and conduct refactorings as daily tasks than capturing structural changes sharing common character-
for development and maintenance. Recent work reports that istics.
understanding code changes is the core of advanced develop- To maintain numerous issues related to duplicated code,
ment and maintenance activities, such as program investigation clone detection techniques have been proposed using text-
of unexpected behaviors after large-scale changes [7], and based [13], token-based [6], tree-based [4], and graph-
duplicated code management in evolving software [12]. based [2] approaches. These tools detect clones by matching
According to recent studies, many developers edit existing code fragments with gaps, demonstrating to scale in large sys-
code in numerous classes repetitively—in similar but differ- tems. Despite high performance in detecting code clones [14],
ent ways [7], [8], [11]. For example, Nguyen et al. mined understanding recurring changes remains challenging in an
repositories and found that 17% to 45% of bug fixes are evolving software. Investigating recurring changes with dif-
recurring edits that comprise similar changes to numerous ferent variable, method, and type names in a new context
methods [11]. Kim et al. find that on average, 75% of structural is tedious and error-prone. Developers are burdened with
changes to mature software systems are recurring changes. detecting inconsistent and missing updates.
They find that these changes are similar but different, and To overcome these limitations and to complement existing
their change contexts consist of similar structures. As another techniques, we describe the design and implementation of a
class of recurring changes in software repositories, changing clone-aware code inspection tool, called CloneMap. CloneMap
APIs, such as bug-fixes, refactorings, or new requirement generates the differences between old and new versions of a
implementations, requires the client programs to make similar program by applying a mining technique for version archives.
changes to correctly use the updated libraries [3]. It then clusters a group of similar code fragments by using
Developers often determine why a program does not behave a clone detection technique. Given a set of clone groups,
as intended, while inspecting changes in the modified software. CloneMap applies visualization to show clone evolutions by
Understanding recurring changes and finding all the relevant demonstrating edit operations on Abstract Syntax Tree (AST)
locations is currently a tedious and error-prone process. Some nodes. It highlights potential anomalies, such as inconsistent
recurring changes comprise slight lexical modification; others and missing updates to assist developers to ensure the program
require complicated changes scattered across different meth- correctness of recurring, similar changes distributed across a
ods and files. Although these edits often consist of similar large number of locations.
syntactical changes to the code such as adding statements or

c 2017 IEEE
978-1-5386-1389-4/17/$31.00  48 SoftwareMining 2017, Urbana-Champaign, IL, USA
II. R ELATED W ORK      
Clone Analysis. Other work has been done in the area of    
  

clone detection and analysis, with varying styles of implemen-     
 
 
  
tation and degrees of success. These varying styles produce        
      


 
similar results, but require necessary tradeoffs, in terms of
one or a combination of speed, memory requirements, or Fig. 1: The system workflow.
portability.
Token-based, or Lexical, analysis is the most widely-applied
style of clone detection. In token-based analysis, the source III. M INING R ECURRING C HANGES
code is divided into many small, distinct chunks, referred CloneMap has been implemented in the context of Eclipse
to as tokens. Each token represents a fragment of source IDE, a widely used extensible open-source development en-
code, and can have sub-tokens within it. For example, a for- vironment. Our tool is designed as a combination of Eclipse
loop would be its own token, but would have its sub-tokens views and a plug-in that constitute a static program analysis
of various operations within, such as variable declarations technique. CloneMap is built as a plug-in of Eclipse and con-
or method calls. The analyzers work inside-out, finding the ceptually can be divided into three functional parts. One part
smallest tokens, using them to build the larger tokens until is responsible for deriving a set of AST node representation
the file can be expressed as a sequence of larger-scale tokens. from two consecutive versions from repositories. Another part
There are two major token-based clone analyzers, parses clone groups reported by Deckard, a clone detector [4].1
CCFinder [6] and iClones, the latter of which we used in The third part manages the change views that allow the user to
our raw data collection. To perform the analysis the programs visualize recurring change information—clone evolution and
parse the code as described, then perform various transfor- change anomaly detection. Figure 1 shows the CloneMap’s
mations, cleaning and changing the code. The analyzers then workflow.
compare the various transformed tokens from different parts of Purpose. The purpose of our tool CloneMap is to detect
the code-base against each other and assign clone pairs/groups missing or incomplete updates to fragments of source code
if multiple tokens are deemed similar. which are part of clone groups, and provide an interactive
Text-based analyzers look at lines of source code as strings visual aid to developers to handle such scenarios. When an
of text, rather than distinct code elements. Code fragments are issue occurs within one element of a clone group, the source
compared against each other to find patterns between them, code fragment must be modified and updated to fix the issue
and if a threshold of similarity is met, they are considered that arose. Then, all other elements of the group must be
clones of each other. The analyzers will modify the code updated along with the first, as they are highly likely to incur
fragments throughout the process, adding or removing minor a similar issue.
fragments which would not change how the code runs, but This is a very time consuming process and leads to increased
would change the structure of the code, such as white space effort and attention spent on maintaining the software. This
or parentheses, finding clones it could not find at first. can lead to another issue of not every clone in the group
Clone Visualization. Other work has been done in the being updated as it should, and possibly being lost to the clone
area of clone visualization. CloneDetective provides an Eclipse analyzer in the next revision of the project. CloneMap provides
workbench with various tools to assist developers in finding an interactive graph to assist in easing and speeding up this
and maintaining clones [5]. The extent of the visualization process. Developers can either look at the graphs or the data
aspect is slightly limited as it is more of a tool for finding generated to find clones to keep an eye out for, as well as
clones within the code-base. CloneDetective provides several clones which have not been updated along properly with their
styles of graphs depicting the throughput of clones, and how sister-clones.
they evolve within the code. CloneDetective specializes in Representing AST Nodes. CloneMap converts source code
finding clones, and exporting the data to a human-readable to Abstract Syntax Trees (AST) with ASTParser.2 It represents
format. each program version using a set of AST nodes that describe
CloneTracker is another tool for clone visualization [1]. program elements (e.g, method, class, and package), their
CloneTracker is similar to CloneDetective. Its main purpose containment relationships, and their structural dependencies.
is clone detection, with some minor elements of visualization CloneMap uses the Zest Graphical Framework to visualize the
to ease the process of software maintenance. CloneTracker’s code base’s structure. This framework provides an excellent
visualization aspect provides support, giving warnings when medium for both displaying code in the form of a tree-style
the current code being edited is part of a clone group. It graph, and providing interaction between the user and the
also provides the developer a clone documentation window, differing levels of granularity, with which the user can view
giving the developer necessary data and interaction to see the code.
clone groups, and find the clones within them. 1 The default settings with 30 minT (minimal number of tokens required for
clones), 2 stride (size of the sliding window), and 0.95 Similarity are used.
2 We leverage an ASTParser provided by the Eclipse JDT toolkit,
http://eclipse.org/jdt.

49
Managing and Tracking Clone Outputs. Tracking how of the analysis begins. First off, nodes on the graph which
clones evolve over time requires the identification of clones correlate to clone data3 are marked.4 Clones are analyzed
across several versions of a program. Within a single ver- per-group. The analysis takes a clone in current group being
sion, clones are typically identified by offset numbers (i.e., analyzed, and finds the information of the method in which the
the starting position from the beginning of the source file). code-fragment represented by the clone information lies. Then,
However, this case does not work when applied to several each clone in the matching group in the new revision’s dataset
versions: offset numbers may change when other contents are is iterated through, looking for matching method information,
removed or added. such as method name, and name of its class and package.
The first step, once raw clone data is collected, is to If the clone from the old revision’s clone dataset is found in
preprocess the data. This step is to establish shared clone the new revision’s dataset, the clone is considered successfully
groups between revisions-groups which represent the same updated. If not found, the nodes on the graph will be marked to
clones-the same code fragments in, for example, v1.0 and v1.1 notify the user that the relevant code fragments require further
of a program. If all instances of clones in a clone group in v1.0 attention.
are deleted or changed beyond similarity, the group would not Once the main analysis has been finished, a final dataset
occur in v1.1. But, if the group isn’t changed at all or clones is generated. This dataset describes in more detail the clone
have been similarly updated, the clone group would occur in relationships between the old revision and the new revision.
the newer revision, possibly with a different group number. Once these two revisions have been analyzed once, CloneMap
This step assigns a more meaningful group number tying the can use this generated dataset to mark the graphs for updated
related groups together. and non-updated clone data much faster for future uses.
Datasets of two revisions are compared, an older revision The user should keep in mind, this dataset only accurately
and a newer revision. Each clone is analyzed in the older revi- describes the clone relationships between the two revisions
sion’s dataset. For each entry in the older revision’s dataset, it analyzed. This dataset can be used for other revisions if
finds the clone’s corresponding source code information. Once their raw clone datasets are similar enough to the dataset
found, the new revision’s dataset is iterated through, trying to analyzed, but the user could expect some inaccuracies. It is
find a clone which has matching corresponding source code recommended to perform the main clone analysis for each pair
information, such that both clones refer to the same method, of revisions they would like to scan for non-updated clones.
identical, or near-identical source code fragments within a IV. C URRENT S TATUS AND R ESULTS
method. Once matched, the groups these clones reside in
are considered matching groups and put into a separate data CloneMap was tested across many revisions of multiple
structure, recording the clone groups’ correlation. open source projects, including Apache Log4j, JRuby, and
This process continues until all clone groups in the old Apache Tomcat. Tables I, II, and III display the results
revision’s dataset have found a correlating group in the new generated. For the projects tested, total groups per revision
data set, or have been determined to be left out from one ranged from 100-500, with each group’s total clones ranging
revision to the next, and therefore have no correlating group. from 2-15. Total number of groups is highly variant with
This information is recorded and output a slightly updated, number of files in the project and average size of each file.
more relevant dataset, where the similar groups have the same There were on average 3-4 clones in each group. Clone groups
clone group number. This dataset output from the preprocess- which represent smaller code-fragments tend to have more
ing phase is what is used in the main analysis. clones in the group. An assumption can be made that this
Mining Recurring Changes. Once datasets relating two is because smaller fragments are easier to find uses for, while
revisions’ clone groups together have been generated, the larger fragments have more niche uses.
3 If a node on the graph and a clone in the dataset represent the same code
main analysis of the data begins. The analysis uses the older
fragments, the node and clone entry are correlated.
revision’s data as a baseline for checking which clones have 4 When nodes on the graph are marked, they are changed to an appropriate
been properly updated. The user could compare the first color, gray to signify clone data correlation, and red if the node correlates
revision’s clone data against the newest revision, or from one with clone data which has not been updated appropriately.
revision to the next, or any combination of revisions which
would give them the information they need to more efficiently TABLE I: C LONE EVOLUTION BETWEEN REVISIONS OF A PACHE
find un-updated clones and maintain their program. L OG 4 J .
First, the datasets are read into memory for faster read ac-
cess later-on. The datasets are read into a HashMap, where the Revision Pair Updated Not Updated
key is the clone group number, and the value is a collection/list 2.0-2.1 146 4
of clones within that group. The individual clone entries in 2.1-2.2 152 4
the table are represented by CloneData objects, which simply 2.2-2.3 155 0
holds the entries’ information. This step retains all information 2.3-2.4 153 5
from the external data sets, but provides faster read access and 2.4-2.5 176 5
an easier way to keep groups separate from each other. 2.5-2.6 189 11
Now that all the preparatory steps are finished, the bulk 2.6-2.7 238 8

50
The longest step in the process of clone analysis is the pre- V. C ONCLUSION
processing step, although this must only be done once per Software cloning, the act of reusing fragments of code for
revision pair. This step is there to fix any inaccuracies in the identical or very similar use, can speed up early development
clone detection. Some groups may be similar to each other of software. This boost in speed of development is offset by the
but not caught by the clone detector. The pre-processing step expense and time put into maintenance later in the life cycle
combines these similar groups together. of the software. If an issue arises in a cloned code-fragment,
The next step, the main analysis, takes some time. However, a time-consuming process occurs in which the developer must
it takes significantly less time than the pre-processing step. track down clones of the code-fragment which was changed
This step must also be run once per revision pair. This step or updated and perform a similar update.
finds clones which are left out of the new version, or have We developed CloneMap to aid developers in the processes
not been properly updated and are not in a clone group of updating and maintaining source code. CloneMap analyzes
anymore. Some clones may be updated or changed to the clone data and provides an interactive visual aid for developers
point that they are unrecognizable from the previous version. to keep an eye on clones in the code-base, as well as find
Cases include renaming or changing the name of a method clones which have yet to be updated. CloneMap provides a
or class, or significantly changing the size or contents of the tree-style graph of the source code, with nodes marked to
code-fragment the clone represents. This is a possible point indicate whether they have been properly updated or not.
of inaccuracy in the analysis, as it would deem the previous
version’s clones not updated because it cannot reconcile these R EFERENCES
clones in the old and new revisions together. This case makes [1] E. Duala-Ekoko and M. P. Robillard. Clonetracker: tool support for code
up an insignificant or near-insignificant number of total clone clone management. In Proceedings of the 30th international conference
on Software engineering, pages 843–846. ACM, 2008.
updates in the projects tested. [2] M. Gabel, L. Jiang, and Z. Su. Scalable detection of semantic clones.
The main analysis provides an output file which can be In Proc. ICSE, pages 321–330, 2008.
reused to display the analyzed clone data. The data reused [3] J. Henkel and A. Diwan. Catchup!: Capturing and replaying refactorings
to support API evolution. In ICSE ’05, pages 274–283, 2005.
is for a revision pair at a time. This step, the post-process [4] L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: Scalable and
step, uses the already-analyzed data to display to the user accurate tree-based detection of code clones. In Proc. of ICSE, pages
the information on updated or missing clones on the graph. 96–105. IEEE, 2007.
[5] E. Juergens, F. Deissenboeck, and B. Hummel. Clonedetective-a
Re-using this data takes marginal time for further use in the workbench for clone detection research. In Proceedings of the 31st
maintenance of software. International Conference on Software Engineering, pages 603–606.
IEEE Computer Society, 2009.
[6] T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic
TABLE II: C LONE EVOLUTION OF REVISIONS IN JRUBY. token-based code clone detection system for large scale source code.
IEEE Transactions on Software Engineering, 28:654–670, 2002.
Revision Pair Updated Not Updated [7] M. Kim and D. Notkin. Discovering and representing systematic code
changes. In ICSE ’09: Proceedings of the 2009 IEEE 31st International
1.0-1.1 223 1 Conference on Software Engineering, pages 309–319. IEEE Computer
1.1-1.2 192 0 Society, 2009.
1.2-1.3 352 0 [8] M. Kim, V. Sazawal, D. Notkin, and G. Murphy. An empirical study of
code clone genealogies. In Proc. of ESEC/FSE, pages 187–196, 2005.
1.3-1.4 336 0 [9] N. Meng, M. Kim, and K. McKinley. LASE: Locating and applying
1.4-1.5 298 0 systematic edits by learning from examples. In ICSE ’13: Proceedings
1.5-1.6 366 1 of 35th IEEE/ACM International Conference on Software Engineering,
page 10 pages. IEEE Society, 2013.
1.6-1.7 386 0 [10] H. A. Nguyen, T. T. Nguyen, G. Wilson, Jr., A. T. Nguyen, M. Kim,
1.7.0-1.7.3 463 0 and T. N. Nguyen. A graph-based approach to api usage adaptation.
1.7.3-1.7.6 473 0 In Proceedings of the ACM international conference on Object oriented
programming systems languages and applications, OOPSLA ’10, pages
1.7.6-1.7.9 489 0 302–321, New York, NY, USA, 2010. ACM.
[11] T. T. Nguyen, H. A. Nguyen, N. H. Pham, J. Al-Kofahi, and T. N.
TABLE III: C LONE EVOLUTION OF REVISIONS IN T OMCAT. Nguyen. Recurring bug fixes in object-oriented programs. In ICSE
’10: Proceedings of the 32nd ACM/IEEE International Conference on
Revision Pair Updated Not Updated Software Engineering, pages 315–324, New York, NY, USA, 2010.
ACM.
7.0.0-7.0.10 357 2 [12] T. T. Nguyen, H. A. Nguyen, N. H. Pham, J. M. Al-Kofahi, and T. N.
7.0.10-7.0.20 353 0 Nguyen. Clone-aware configuration management. In Proc of ASE, 2009.
7.0.20-7.0.30 389 0 [13] C. K. Roy and J. R. Cordy. Nicad: Accurate detection of near-miss
intentional clones using flexible pretty-printing and code normalization.
7.0.30-7.0.40 397 0 In Program Comprehension, 2008. ICPC 2008. The 16th IEEE Interna-
7.0.40-7.0.50 388 0 tional Conference on, pages 172–181. IEEE, 2008.
7.0.50-7.0.60 432 1 [14] C. K. Roy, J. R. Cordy, and R. Koschke. Comparison and evaluation
of code clone detection techniques and tools: A qualitative approach.
7.0.60-7.0.70 464 0 Science of Computer Programming, 74(7):470–495, 2009.
7.0.70-8.0.0 459 0

51

You might also like