Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2017 International Symposium on Computer Science and Software Engineering Conference (CSSE)

Semantic-based Software clustering using hill climbing


Masoud Kargar
Faculty of Computer and Ayaz Isazadeh Habib Izadkhah
Information Technology Department of Computer Science, Department of Computer Science,
Engineering, Qazvin Branch, Islamic University of Tabriz, Tabriz, Iran University of Tabriz, Tabriz, Iran
Azad University, Qazvin, Iran Email: Isazadeh@Tabrizu.ac.ir Email: Izadkhah@Tabrizu.ac.ir
Email: Kargar@qiau.ac.ir

Abstract— Clustering techniques are used for extracting software


architecture in reverse engineering process. Extracting the Call
Dependency Graph (CDG) from the source code is the first step in
the process of software clustering. A CDG indicates the method
invocations between software's artifacts. This graph is tightly
coupled to the used programming language so that the existing
toolsets for constructing a CDG works on the particular
programming language. Therefore, using existing CDG extraction Fig. 1. An overview of software clustering
toolsets for large-scale software systems, e.g., Mozilla Firefox,
which written by different programming languages, is impossible. Identifying different clusters is the basis of understanding
To overcome this problem, in this paper, we propose a new software systems and the quality function, such as Turbo MQ, is
dependency graph, called semantic dependency graph (SDG), for this purpose. The internal composition of the cluster
which is independent of the programming languages. The members and the number of clusters has related to an increase in
combination of lexical analysis and latent semantic analysis (LSA) the quality function. Call Dependency Graph creates by different
generates this graph. Two versions of Hill climbing algorithms are tools that can be concluded: Acacia, Columbus, Chava,
used to compare their performance with SDG and CDG. The NDepend, Understand, Bauhaus, DPMld and other [4]. These
results show that the SDG can be replaced instead of CDG. The
tools have limitations like the variety of programming languages
results of various experiments confirm this claim. The SDG can be
and the volume of the content of the source code.
independent of programming languages, hence, helps to the
software engineer to clustering the large-scale software systems to There are also various clustering algorithms so that, they
extract software architecture, aiming to understand and maintain have different local and global behaviors [3]. According to the
the existing software systems. Greedy Algorithm, trapped in the local minimum, this is a good
reason, which the SDG replaced instead of Call Dependency
Keywords: software clustering, semantic dependency graph, call Graph. Then, the results of this replacement test in Greedy
dependency graph, Hill Climbing.
Algorithm. A new SDG of the source code and a hill climbing
I. INTRODUCTION clustering algorithm are proposed based on that graph.
Essential software systems are large-scale and intricate, as a In this paper, without considering of the keywords, the data
consequence, their structure comprehension is strict. One of the structures, variables and programming language syntaxes, the
reasons for this complexity is that source code contains many independent method of programming languages is proposed that
files so that they are related to each other. As well as, each time, creates a new graph which called the SDG. In the preparation of
a software engineer understands a system's structure, it’s hard to this graph, the standard and general words with high frequency
retain this perception, because the structure tends to change are used. This SDG will be widely used in the large scale of the
during maintenance. software systems with multiple programming languages or
software systems that do not have the tools to create a CDG.
Different clustering algorithms have been proposed to be a
system defined by several subsystems. In algorithms, clusters The rest of the paper is organized as follows. Section 2
play the role of subsystems, which are collections of source code explains the background such as Call Dependency Graph,
resources that exhibit similar features, properties or behaviors. Greedy Hill-Climbing Algorithm, and Turbo MQ. The section 3
So, over time will be very difficult and even impossible to addresses related work. Then in section 4 is described the SDG
analyze manually source code [1]. Several tools demonstrate the and explained the needed steps to create it. Also, in Section 5 the
relationship between entities in the software's code [2]. The experimental results of the Hill-Climbing Algorithms are
association includes the classes, functions, procedures, variables displayed on the SDG and the CDG. Finally, is expressed
and calling functions, classes, inheritance between classes and conclusions and future works.
so on. The most frequent appearance of the relationship between
the components of a source code is Call Dependency Graph or II. BACKGROUND
briefly CDG. Figure 1 shows an example of a graph-based This section discusses three issues which are: The Steepest
clustering [3]. Ascent Hill Climbing (SAHC) and The Next Ascent Hill

978-1-5386-1302-3/17/$31.00 ©2017 IEEE

55

Authorized licensed use limited to: Istinye Universitesi. Downloaded on February 26,2023 at 01:11:00 UTC from IEEE Xplore. Restrictions apply.
2017 International Symposium on Computer Science and Software Engineering Conference (CSSE)

Climbing (NAHC) algorithms Continuing that theme call Evaluation of modules is critical during clustering algorithm.
dependency graph and module dependency graph (MDG) will Usually, algorithms use a global fitness measure called
be reviewed. Evaluation measurement and fitness function such Modularization Quality (MQ) [7]. MQ is designed to have better
as MQ and Turbo MQ described at the end. fitness scores for clustering with higher overall inner cluster
cohesion and lower outer coupling. In summary, clustering
Heuristic clustering techniques, such as Hill-climbing, have algorithm starts with a random partition and repeatedly finds
been used as a substitute for traditional clustering method, such better neighboring partition as long as there is no larger MQ for
as hierarchical clustering, to tackle software module clustering. any neighbor partition. Anquetil et al. developed a similarity
Hill-climbing is the greedy algorithm that can cope better with measurement based a precision and Recall [8]. Tzerpos and Hot
this problem [4]. authored a paper on a similarity distance measurement called
The Steepest Ascent Hill Climbing algorithm is based on MoJo [9, 10].
traditional hill climbing techniques. The goal of this algorithm Objective functions used in all search-based software
is to progressively create a new partition from the current clustering algorithms are Basic MQ and Turbo MQ [2]. If Ai is
partition of the MDG where the MQ of the newer partition is internal connection level for a module and Eij represents
larger than the MQ of the original partition. Each iteration of the connection level between two modules “i” and “j,” then having
algorithm attempts to improve MQ by finding a maximal a program graph divided into “k” modules, Basic MQ is defined
neighboring partition (MNP) of the current partition. The MNP as follows:
is determined by examining all NP' s of P and selecting the one
that has the largest MQ. The SAHC algorithm converges when ଵ ଵ
‫ ܳܯܿ݅ݏܽܤ‬ൌ σ ‫ܣ‬௜ െ ೖሺೖషభሻ σ ‫ܧ‬௜௝ (1)
it unable to find an MNP of the current partition with a larger ௞

MQ [1, 5].
If ߤ௜ and ߝ௜ǡ௝ respectively represent the internal edges of
The NAHC algorithm is another hill climbing algorithm that module and edges between two modules, then Turbo MQ value
is similar, but often faster, then its SAHC counterpart. The computed as follows:
NAHC algorithm varies from SAHC in how it comprehensively
expands partitions of the MDG. In NAHC, our hill climbing ܶ‫ ܳܯ݋ܾݎݑ‬ൌ σ௞௜ୀଵ ‫ܨܥ‬௜ (2)
improvement technique is based on finding a better neighboring ଶఓ೔
‫ܨܥ‬௜ ൌ (3)
partition (BNP) of the existing partition. A BNP of the current ଶఓ೔ ାσೖ
ೕసభሺఌ೔ǡೕ ାఌೕǡ೔ ሻ
partition P is found by randomly enumerating through the NP's
of P until an NP is found with a larger MQ. The NAHC
algorithm converges when no BNP of P can be found with a III. RELATED WORK
larger MQ [1].
There are various algorithms for software systems clustering.
The hill-climbing algorithm uses a threshold ¨ (0% İ ¨ CDG restricts some of these algorithms, and the clustering of
İ 100%) to estimate the minimum number of neighbors that software systems relies on the CDG. The major problem for
must be considered during each iteration of the hill-climbing these algorithms is their dependence on the CDG, and the CDG
process. The experience has shown that examining many is dependent on programming languages. Other software system
neighbors during each iteration (i.e., using a significant clustering algorithms use the semantics of the source code.
threshold such as ¨ ı 75%) often produces a better result, but
The extracted semantics are dependent on the source code,
the time to converge is generally longer than if the first directly or indirectly. For example, semantic-based algorithms
discovered neighbor with a higher MQ (¨ = 0%) is used as the employed in references [11-13] depend on Java keywords, and
basis for the next iteration of the hill-climbing algorithm. Prior [14] depends on the keywords of COBOL. Semantic-based
to introducing the adjustable threshold Ș, we only considered the algorithms are restricted in [11] and [15] to object-oriented
two extremes of the threshold – use the first neighboring systems and [16] to procedural languages. In semantic-based
partition found with a better MQ as the basis for the next clustering methods, choosing words from source code is the
iteration (Ș = 0%), or examine all neighboring partitions and primary reason for differences in proposed methods and
pick the one with the largest MQ as the basis for the next constraints. In semantic-based clustering methods, the choice of
iteration (Ș = 100%). The first algorithm calls Next Ascent Hill words from the source code is the primary reason for differences
Climbing, and the second algorithm deems to be Steepest Ascent in proposed approaches and limitations, but they are all the same
Hill-Climbing. Thus, by adjusting the neighboring threshold Ș, in dependence on the syntax of programming languages.
the user can configure the generic hill-climbing algorithm to
behave like NAHC, SAHC, or any algorithm in between [1]. In [14], repeated words are counted from within the source
code, and a limited number of these words are used as a criterion
The primary step in the software modularization process is for latent semantic analysis (LSA) and the latent semantic index
to extract a Call Dependency Graph (CDG) from the source (LSI). In [17] parts of the source code, such as variables,
code. In CDG resources (e.g., classes, modules, variable) are function names, comments, and class names were selected. In
graph nodes, that depend on each other in intricate ways (e.g., [18] and [13], the keywords of the programming language have
procedure calls, inheritance relationship, variable references) been chosen as Term in semantic analysis. In [19] the precise
[6]. This graph illustrates the structure of software system and words of the programming language are selected respectively.
clustering algorithm divides into different categories so that each
cluster represents a subsystem.

56

Authorized licensed use limited to: Istinye Universitesi. Downloaded on February 26,2023 at 01:11:00 UTC from IEEE Xplore. Restrictions apply.
2017 International Symposium on Computer Science and Software Engineering Conference (CSSE)

IV. SOFTWARE CLUSTERING BASED ON SEMANTIC HILL based on SDG algorithm. Below is shown pseudocode about
CLIMBING how to build the SDG:
In this method, we have developed an SDG between files in Algorithm 1: Create Semantic Dependency Graph (SDG)
software, Then, the graph will be replaced instead of the call begin
1 Create Data LSA (Latent Semantic Analysis)
dependency graph (CDG) and finally, the Hill Climbing 1.1 Calculate matrix X (frequency Contents per Files)
algorithm used. For SDGs, perform the following steps to be 1.2 Prepare matrices SVD (i.e. S, V and D)
among software files: While all pair files
2 Calculate the vector pair for pair file, based on the matrices S and X
1: word5000 is a standard data set and consists of 5,000 3 Calculate Cosine Similarity between two vectors (i, j)
common words [20]. To reduce the effect of conjunctions, verbs, 4 The inclusion of similarity values in the indices (i, j) and (j, i) within
adjectives, and improve the speed of calculation of similarities the similarity matrix.
End While
between files, we remove them from the word5000 dataset and end
keep only the noun, and the number is 2543. Following pseudo-code shows the Steepest Ascent Hill
2: Lexical analysis performed on all software files. For this Climbing based on SDG algorithm:
purpose, using delimiters to separate the parts of a file from Algorithm 2: Steepest Ascent Hill Climbing based on Semantic
another. Then all components will be converted to lowercase dependency graph (SAHCS)
letters, to eliminate case sensitivity. Input: A Semantic Graph N*N / Output: A string of length N
Begin
3: A matrix is created and name it is called X so that the rows BestTMQ=-INF;
of the matrix is the number of nouns and columns count is the Choice.value=Create_initilize_random_value ();
Choice.CalculateTMQ();
number of files. Based on the above explanation, the matrix size While Choice.TMQ>BestTMQ or Iteration<threshold
is 2543 * Numbers of files. The value of the element of row i Result_FN_CN=SelectBestCandidateNeighborhood (Semantic Graph);
and the column j of the X matrix refers to the number of Choice.UpdateValue (FN, CN);
occurrence of the name i in the file j. Choice.CalculateTMQ();
End while
4: Based on the X matrix and using SVD algorithm can be Choice.ShowFilePerCluster();
end
calculated three matrix U, V and S [21], [22]. Based on this step,
Algorithm 2 shows how SAHCS clustered using SDG. In
the dimensions of the X matrix can be reduced. This reduction
this algorithm, Choice is the selected string at each step, and the
in dimensions will increase the speed of processing with a slight
Turbo MQ (TMQ) refers to the selected string quality. At each
decrease in accuracy.
step, this algorithm is considered by all neighbors to choose
5: Using the LSA (Latent Semantic Analysis) and the matrix Choice.
of the previous stage, for each pair of files caused two vectors
Following pseudo-code shows the Next Ascent Hill
[23]. The cosine similarity is calculated for each pair of vectors,
Climbing based on SDG algorithm:
and this value is an element of the semantic matrix between files.
Semantic matrix is a square matrix so that the number of rows Algorithm 3: Next Ascent Hill Climbing based on Semantic dependency
and columns is the number of files. graph (NAHCS)
Input: A Semantic Graph N*N / Output: A string of length N
6: Each Steepest Ascent Hill Climbing and Next Ascent Hill Begin
Climbing clustering algorithm runs, once a call dependency BestTMQ=-INF;
Choice.value=Create_initilize_random_value ();
matrix, and again with the semantic matrix. It is noteworthy that Choice.CalculateTMQ();
in the implementation of the algorithms used the Turbo MQ. While Choice.TMQ>BestTMQ or Iteration<threshold
Figure 2 shows the steps for the proposed hill climbing Result_FN_CN=SelectFirstOptimumCandidateNeighborhood
algorithm. (Semantic Graph);
Choice.UpdateValue(FN, CN);
Choice.CalculateTMQ();
End while
Choice.ShowFilePerCluster();
end
Algorithm 3 shows how NAHCS clustering using SDG. The
algorithm works similarly with the SAHCS algorithm, with the
difference that the first neighbor with a larger TMQ is to choose
the Choice string
V. RESULT ASSESSMENT
In this research, we have used the Mozilla open source. The
number of files and folders related to open source, there are
about 41900 and 55, respectively. The first ten folders that were
Fig. 2. The proposed Semantic Hill climbing. selected, except the chrome folder that was just a cluster. Table
1 shows folder name and the number of clusters per folder.
In the form of pseudocode expressed how to create an SDG,
as well as how to uses Steepest / Next Ascent Hill Climbing

57

Authorized licensed use limited to: Istinye Universitesi. Downloaded on February 26,2023 at 01:11:00 UTC from IEEE Xplore. Restrictions apply.
2017 International Symposium on Computer Science and Software Engineering Conference (CSSE)

Climbing algorithms with two different graphs such as Call


Table 1. The names of the selected folders and the numbers Dependency Graph and SDG. Similarly, Table 4 examines
MoJo metric, and Table 5 compares FM metric.
of the clusters from the Mozilla (Expert view)

Mozilla Source Code Table 4: The MoJo metric results for two algorithms with CDG and SDG
Folder name File Number Cluster Number Mojo metrics
Accessible 179 8 Call Dependency Semantic
Folder name SAHCD NAHCD SAHCS NAHCS
Browser 45 4
Accessible 118 120 116 117
Build 21 2
Browser 22 22 21 22
Content 881 13
Build 4 4 4 4
Db 97 4 Content 690 687 682 683
Dom 163 5 Db 7 7 6 6
Extensions 179 13 Dom 102 97 99 96
Gfx 342 7 Extensions 129 133 126 123
Intl 573 7 Gfx 197 198 188 187
Ipc 391 4 Intl 143 143 142 142
Ipc 75 75 74 74
Steepest Ascent Hill Climbing and Next Ascent Hill
Climbing clustering algorithm were implemented with SDG Compare the results of tables 2,3,4 and five show that the
and the call dependency graph. The following results show the clustering metrics such as MoJo, precision / Recall and FM
difference between the two graphs, based on various criteria. for SDG is better than Call Dependency Graph. Thus, the
Table 2. The Precision/ Recall metrics results for the algorithm with CDG SDG that is independent of the programming language can be
useful and helpful information. The figures 3 and four shows,
Precision/Recall metrics
the third clusters corresponding to the folder Accessible and
Call Dependency graph the internal state of the cluster to be exhibited at a glance. The
Folder name SAHCD NAHCD inner structure of clusters to be mapped based on different
Accessible 0.1314 0.2042 0.1308 0.2075 graphs.
Browser 0.3680 0.2911 0.3377 0.2671 Table 5: The MoJo metric results for two algorithms with CDG and SDG
Build 0.4936 0.7549 0.5833 0.6946 FM metrics
Content 0.0776 0.1298 0.0794 0.1338 Call Dependency Semantic
Db 0.9158 0.3917 0.9186 0.3838 Folder SAHCD NAHCD SAHCS NAHCS
Dom 0.1906 0.2858 0.2107 0.3182 name
Extensions 0.0822 0.1479 0.0739 0.1401 Accessible 0.1599 0.160 0.1424 0.1375
Gfx 0.1443 0.3097 0.1402 0.2971 Browser 0.2408 0.2210 0.2068 0.2153
Intl 0.1427 0.5885 0.1414 0.5900 Build 0.5969 0.6946 0.5896 0.5781
Ipc 0.2481 0.6781 0.2480 0.6804 Content 0.0972 0.0996 0.0980 0.0989
Db 0.2491 0.2426 0.3435 0.3307
Table 3: The Precision/ Recall metrics results for the algorithm with SDG
Dom 0.2287 0.2535 0.2396 0.2179
Precision/Recall metrics
Extensions 0.1056 0.0967 0.1025 0.1061
Semantic Dependency Graph
Gfx 0.1968 0.1905 0.2047 0.2064
SAHCS NAHCS
Intl 0.2297 0.2281 0.2582 0.2595
Accessible 0.1053 0.2198 0.0998 0.2206
Ipc 0.3633 0.3635 0.4492 0.4464
Browser 0.3946 0.2714 0.3290 0.2603
Build 0.5064 0.7054 0.4744 0.7400
Content 0.0778 0.1323 0.0788 0.1328 The number of connections in the SDG is high compared to
the CDG. If clustering algorithms of software systems are
Db 0.9269 0.5013 0.9106 0.4852
capable of interacting with local minima, a large number of links
Dom 0.1987 0.3015 0.1704 0.3020 in the SDG can help prevent or delay this problem.
Extensions 0.0743 0.1654 0.0804 0.1559
Gfx 0.1525 0.3111 0.1517 0.3228
Intl 0.1653 0.5895 0.1662 0.5917
Ipc 0.3344 0.6840 0.3318 0.6821

Table 2 and 3 compare precision and Recall metrics on the


Steepest Ascent Hill Climbing and Next Ascent Hill

58

Authorized licensed use limited to: Istinye Universitesi. Downloaded on February 26,2023 at 01:11:00 UTC from IEEE Xplore. Restrictions apply.
2017 International Symposium on Computer Science and Software Engineering Conference (CSSE)

CONCLUSION
Implementation values show that if replacement semantics
Graph instead of Call Dependency Graph in the greedy
algorithm, we can deduce the following results:
1. Call Dependency Graph, a graph in which its production
is dependent on the programming language, and perhaps all the
files cannot participate in a graph, while the SDG is created
independent of the programming language and only Lexical
analysis and semantic similarity are sufficient. Also, all files can
be involved.
2. In case, lack of availability of dependency graph, we can
replace it with the SDG. No problems occur in its performance,
Fig. 3. The third cluster related to the SAHCD algorithm for and even the replacement improves to the acceptable percentage.
Accessible folder
3. Valuable comments in the open source files are used as an
added value. In that case, the programmer, as an expert will play
a role in increasing similarities between files.
FUTURE WORK
SDG replaced instead of CDG in clustering algorithms based on
search, can help to spread the claim.
REFERENCES
[1] Mitchell, B.S., A heuristic search approach to solving the software
clustering problem. 2002, Drexel University.
[2] Mitchell, B.S. and S. Mancoridis, On the automatic modularization of
software systems using the bunch tool. IEEE Transactions on Software
Engineering, 2006. 32(3): p. 193-208.
[3] Isazadeh, A., H. Izadkhah, and I. Elgedawy, Source Code Modularization
- Theory and Techniques. 2017: Springer.
[4] Izadkhah, H., I. Elgedawy, and A. Isazadeh, E-CDGM: An Evolutionary
Call-Dependency Graph Modularization Approach for Software Systems.
Cybernetics and Information Technologies, 2016. 16(3): p. 70-90.
[5] Celebi, M.E. and K. Aydin, Unsupervised Learning Algorithms. 2016:
Springer.
[6] Dietrich, J., et al. Cluster analysis of Java dependency graphs. in
Proceedings of the 4th ACM symposium on Software visualization. 2008.
ACM.
[7] Naseem, R., O. Maqbool, and S. Muhammad. Improved similarity
measures for software clustering. in Software Maintenance and
Reengineering (CSMR), 2011 15th European Conference on. 2011. IEEE.
[8] Anquetil, N. and T.C. Lethbridge. Experiments with clustering as a
software remodularization method. in Reverse Engineering, 1999.
Proceedings. Sixth Working Conference on. 1999. IEEE.
[9] Wen, Z. and V. Tzerpos. An effectiveness measure for software clustering
algorithms. in Program Comprehension, 2004. Proceedings. 12th IEEE
International Workshop on. 2004. IEEE.
[10] Tzerpos, V. and R.C. Holt. MoJo: A distance metric for software
clusterings. in Reverse Engineering, 1999. Proceedings. Sixth Working
Conference on. 1999. IEEE.
[11] Kim, H. and D.-H. Bae, Object-oriented concept analysis for software
modularisation. IET software, 2008. 2(2): p. 134-148.
[12] Misra, J., Java source-code clustering: unifying syntactic and semantic
features. arXiv preprint arXiv:1208.6408, 2012.
[13] Kuhn, A., Semantic clustering: Making use of linguistic information to
reveal concepts in the source code. Master's thesis, University of Bern,
2006.
Fig. 4. The third cluster related to the SAHCS algorithm for
[14] Van Deursen, A., and T. Kuipers. Identifying objects using cluster and
Accessible folder concept analysis. in Software Engineering, 1999. Proceedings of the 1999
International Conference on. 1999. IEEE.

59

Authorized licensed use limited to: Istinye Universitesi. Downloaded on February 26,2023 at 01:11:00 UTC from IEEE Xplore. Restrictions apply.
2017 International Symposium on Computer Science and Software Engineering Conference (CSSE)

[15] Poshyvanyk, D., et al., Using information retrieval based coupling [20] Davies, M. and D. Gardner, A frequency dictionary of contemporary
measures for impact analysis. Empirical software engineering, 2009. American English: Word sketches, collocates, and thematic lists. 2013:
14(1): p. 5-32. Routledge.
[16] Antoniol, G., et al. A method to re-organize legacy systems via concept [21] Lange, K., Singular value decomposition. Numerical Analysis for
analysis. in Program Comprehension, 2001. IWPC 2001. Proceedings. 9th Statisticians, 2010: p. 129-142.
International Workshop on. 2001. IEEE. [22] Banerjee, S. and A. Roy, Linear algebra and matrix analysis for statistics.
[17] Corazza, A., et al., Weighing lexical information for software clustering 2014: CRC Press.
in the context of architecture recovery. Empirical Software Engineering, [23] Evangelopoulos, N., X. Zhang, and V.R. Prybutok, Latent semantic
2016. 21(1): p. 72-103. analysis: five methodological recommendations. European Journal of
[18] Maletic, J.I. and N. Valluri. Automatic software clustering via latent Information Systems, 2012. 21(1): p. 70-86.
semantic analysis. in Automated Software Engineering, 1999. 14th IEEE
International Conference On. 1999. IEEE.
[19] Maletic, J.I. and A. Marcus. Using latent semantic analysis to identify
similarities in the source code to support program understanding. in Tools
with Artificial Intelligence, 2000. ICTAI 2000. Proceedings. 12th IEEE
International Conference on. 2000. IEEE.

60

Authorized licensed use limited to: Istinye Universitesi. Downloaded on February 26,2023 at 01:11:00 UTC from IEEE Xplore. Restrictions apply.

You might also like