Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2010)

Object-Oriented Software Architecture Recovery


Using a New Hybrid Clustering Algorithm
Qifeng Zhang, Dehong Qiu, Qubo Tian, Lei Sun
School of Software Engineering
Huazhong University of Science and Technology
Wuhan , China

Abstract—In order to recover high-level software architecture architecture recovery have two shortcomings. First, most of
from existing systems, we define Weighted Directed Class studies used static dependency graph [4, 5, 6, 8] as the
Graph(WDCG) to represent object-oriented software in this clustering data set, which did not represent the dynamic
paper, which not only reflects static information of lowest level information of software running. Second, the clustering
composition of software but also reflects dynamic information of algorithms for recovering software architecture [8] were not
software running. A new hybrid clustering algorithm based on tailored enough to the nature of data set and the feature of
hierarchical clustering and partition clustering is proposed for software.
recovering high-level software architecture from WDCG. Four
metrics are introduced to measure the effect of the new In this paper, we extracted Weighted Directed Class Graph
clustering algorithm for software architecture recovery. (WDCG) from the existing systems and used the coupling
Experimental results show that our algorithm performs best in between classes as the weights of edges. WDCG not only
terms of software clustering quality, authoritativeness and reflects static information of software but also reflects
extremity of cluster distribution. dynamic information of software running. According to the
nature of WDCG and the feature of object-oriented software,
Keywords-software architecture; clustering; WDCG; we proposed a new hybrid algorithm based on hierarchical
clustering and partition clustering for software architecture
I. INTRODUCTION recovery. Experimental results showed that our approach was
Software architecture acts as a shared mental model of a effective.
system expressed at a high-level of abstraction [1], which The organization of the paper is as follows. In section II
plays an important role in at least six aspects of software we present related research and problem description. Section
development: understanding, reuse, construction, evolution, II also presents the definition of Weighted Directed Class
analysis and management [2]. But the original software Graph. Section III gives the description of our hybrid
architecture would deviate from actual systems because of clustering algorithm. Section IV presents the experiment result
software maintenance and software evolution [13]. Besides, and analysis. Finally we give the conclusions.
many open source software and some other software lack the
original documentations. Without high-level software
II. RELATED RESEARCH AND PROBLEM DESCRIPTION
abstraction, software engineers would spend much time in
program understanding, because it is confusing towards Mancoridis et al. [4] extracted the file dependency graph
thousands of lines source code. Moreover, without software from the source code and used clustering algorithm based on
architecture, software maintenance engineer is often forced to genetic algorithm to partition the graph in a way that derived
make modifications to the source code without a thorough the high-level subsystem structure from the component-level
understanding of its organization [4]. So it is important to relationships. Mahdavi et al. [5] extracted the weighted and
recover software architecture from existing systems. non-weighted file dependency graph from the source code and
used the multiple hills climbing approach to implement
Many approaches and techniques were proposed in the software architecture recovery. Saeed et al. [6] extracted the
literature to support software architecture recovery [3]. In the function dependency graph by Rigi tool and presented a new
fields of semiautomatic and automatic software architecture clustering algorithm called the ‘combined’ algorithm to
recovery, clustering was commonly used [4~9]. Clustering implement software architecture recovery. Chiricota et al. [7]
analysis is not a new field, and it has been applied in many proposed a clustering algorithm based on graph theory to
disciplines to discover similarities between artifacts. Recently, support component identification. Dietrich et al. [8] used the
clustering analysis has also been applied in software class dependency graph to represent programs and proposed to
engineering to discover patterns within data. The software use the Girvan-Newman clustering algorithm to compute the
clustering problem consists of finding a good quality modular structure of programs. Pourhaji Kazem et al. [9]
clustering of software modules based on the relationships presented a new genetic algorithm for clustering Weighted
among the modules. These relationships typically take the Module Dependency Graph. Bittencourt et al. [10] suggested
form of dependencies between modules. Unfortunately, k-means clustering algorithm performed best in terms of
previous studies on clustering algorithm for software

978-1-4244-5934-6/10/$26.00 ©2010 IEEE 2546


Authorized licensed use limited to: Istinye Universitesi. Downloaded on February 26,2023 at 01:10:32 UTC from IEEE Xplore. Restrictions apply.
authoritativeness by an empirical study that evaluated four Definition3. Data Coupling (DC) indicates class c1 uses
clustering algorithms for software architecture recovery. public data of c 2 . The value of DC is the numbers of data
Unfortunately, all the studies above used static dependency used by c1 .
graph to represent software, which did not reflect the dynamic
information of software running. Besides, many clustering Definition4. Coupling Between Classes (CBC) is the sum
algorithm of above research were not tailored enough to the of IC, MC and DC.
nature of data set and the feature of software engineering. For CBC (c1, c 2) = IC (c1, c 2) + MC (c1, c 2) + DC (c1, c 2) . (1)
example, Dietrich et al. [8] proposed to use the Girvan-
Newman clustering algorithm to compute the modular Definition5. Module M1 and M2 consists of n1 and n 2
structure of programs, but this algorithm performed bad in classes, coupling between two modules as:
software architecture recovery according to the empirical
study in paper [10].
Coup( M 1, M 2) =
∑ CBC (c , c ) , i j
(2)
Class is the basic entity in object-oriented software, so it is 2n1n 2
appropriate to use class as the basic entity for software where class ci and cj belong to different modules.
clustering. We extract Weighted Directed Class Graph
(WDCG) from Java byte code of existing systems to represent Definition6. Module M consists of n classes, cohesion of
software, which can reflect the static structure of software. In module M as:
addition, because we use coupling between classes as the
weights of edges, WDCG can also reflect the dynamic
Cohe( M ) =
∑ CBC (c , c ) ,
i j
(3)
information of software running. The dynamic information 1
includes function calls and parameters passing. n(n − 1)
2
According to the coupling framework proposed by Eder where class ci and cj belong to module M.
et al. [11], we divide coupling between classes into three
types: Inheritance Coupling (IC), Method Coupling (MC) and Definition7. Weighted Directed Class Graph (WDCG) =
Data Coupling (DC). Related definitions are as follows. <V, E, W>. V is the set of vertices, each vertex indicates a
outer class or a interface of existing software. E is the set of
Definition1. Inheritance Coupling(IC) indicates class c1 directed edges. W is the set of weights of each edge. Fig. 1
inheritances c 2 . shows a Weighted Directed Class Graph of a part of
Log4j1.2.5 extracted by structure101 [12] tool.
Definition2. Method Coupling (MC) indicates class c1
calls methods of c 2 . The value of MC is the numbers of
method called by c1 .

Figure 1. Weighted Directed Class Graph

III. HYBRID CLUSTERING ALGORITHM


According to the principle of high cohesion and low
coupling in the design of object-oriented software, we know
that the coupling between classes within one module is
usually higher than the coupling between classes within
different modules [14]. For example, In Fig. 2, software S
consists of three modules: M1, M2 and M3. M1 and M2 have
weights larger than the weight of edge (c3, c7). M2 and M3
have weights larger than the weight of edge (c10, c8). The
larger edges between classes are usually in one module, such Figure 2. Module graph of software S
as edge (c3, c2), (c8, c6), (c8, c7), (c11, c10).

2547
Authorized licensed use limited to: Istinye Universitesi. Downloaded on February 26,2023 at 01:10:32 UTC from IEEE Xplore. Restrictions apply.
In Weighted Directed Class Graph (WDCG), the lager After forming the kernels of clusters, other vertices can be
edges indicate that the coupling between two vertices is partitioned into kernels according to the coupling between
higher. The two vertices should be partitioned into a same vertices and kernels. Fig. 5 shows a partition of software S.
module usually. According to this principle we proposed a
new hybrid clustering algorithm based on hierarchical
clustering and partition clustering for software architecture
recovery. The first step of our hybrid algorithm is to use
hierarchical clustering to find out the kernels of clusters, and
then partition other vertices into the kernels. Fig. 3 shows the
description of our hybrid clustering algorithm in detail.
Hybrid Clustering Algorithm Figure 5. A partition of software S

Input: WDCG and K (the number of clusters).


IV. EXPERIMENT AND ANALYSIS
Output: a partition P of WDCG.
Steps:
A. Experimental design
1. Sorting the edges in accordance with the weights in
descending order. Our experimental procedures followed three steps: WDCG
2. DO WHILE(i<K) // i is a integer variable, initial value is 0 extraction, software clustering and comparison.
//n is a integer variable, initial value is 0 1) WDCG extraction: We used Log4j [15] version 1.04
// j is a integer variable, initial value is 0, vj1 and vj2 are the
and 1.1.3, Jedit [16] version 2.3 and 2.4, Junit [17] version
vertices of edge ej
// Flag[vm]==1 indicates vertex vm has not been partitioned 4.1 and 4.3.1 as test software. They are all popular open
into clusters source software, Jedit and Junit were also used for software
// Flag[vm]==0 indicates vertex vm has been partitioned architecture recovery experiments in paper [10].
into clusters We used structure101 tool to extract Weighted Directed
2.1 IF(Flag[vj1]==1&& Flag[vj2]==1) Class Graph. Table I shows the information of three software
THEN vertices vj1 and vj2 form a new cluster, in detail.
2.2 IF(Flag[vj1]==0&& Flag[vj2]==1)
THEN partition vertex vj2 into the cluster including vertex TABLE I. SOFTWARE UNDER EXPERIMENTATION
vj1
2.3 IF(Flag[vj1]==1&& Flag[vj2]==0) Number Edges of
Software Version ID KLOC
THEN partition vertex vj1 into the cluster including vertex of classes WDCG
vj2 Log4j
1.04 1 6.3 92 294
2.4 IF((Flag[vj1]==0&& Flag[vj2]==0)&& (vj1and vj2 1.1.3 2 7.2 101 335
belong to different cluster)) 2.3 3 26 373 1083
Jedit
THEN merge two clusters including vertex vj1and vj2 2.4 4 27 381 1178
3 END DO 4.1 5 3.3 96 232
4 DO WHILE (n<N) //N is the number of vertices Junit
4.3.1 6 3.5 99 240
// i is a integer variable, initial value is 0
// vi is a vertex belongs to cluster i%K, m is the number of 2) Software clustering: We used our hybrid clustering
cluster i%K algorithm (HPCA), hierarchical clustering algorithm [18] and
4.1 find out the vertex v making the coupling between MST clustering algorithm [19] for three software. Our hybrid
vertex v and cluster i%K maximum
clustering algorithm was described in section III. The other
4.2 compute the coupling between vertex v and other
clusters two algorithms are shortly described bellow:
4.3 partition vertex v into the cluster making the coupling ● Hierarchical clustering algorithm (HCA): Hierarchical
between vertex v and the cluster maximum clustering is a method of cluster analysis which seeks
5 END DO to build a hierarchy of clusters. Strategies for
Figure 3. Hybrid clustering algorithm hierarchical clustering generally fall into two types:
agglomerative method and divisive method.
According to our algorithm, when choose the number of Agglomerative method is a “bottom up” approach:
clusters as three, Fig. 4 shows the kernels of software S. each observation starts in its own cluster, and pairs of
clusters are merged as one moves up the hierarchy.
Divisive method is a “top down” approach: all
observations start in one cluster, and splits are
performed recursively as one moves down the
hierarchy. In our experiments, we used agglomerative
Figure 4. Kernels of software S method.
● MST clustering algorithm (MSTCA): MST clustering
is based on the idea of the minimum spanning tree

2548
Authorized licensed use limited to: Istinye Universitesi. Downloaded on February 26,2023 at 01:10:32 UTC from IEEE Xplore. Restrictions apply.
(MST) and is motivated by the way human perception MoJo( A, B ) = min(mno( A, B), mno( B, A)) . (5)
works. The idea of the algorithm is the following:
determine the minimum spanning tree of graph G and Definition10. Given partition B is authoritative partition,
then remove the edges that are “unusually” large the similarity quality between partition A and partition B is
compared with their neighboring edges. These edges defined as:
are called inconsistent, and it is expected that they
MoJo( A, B )
connect vertices from different clusters. SimQua( A, B) = (1 − ) × 100% , (6)
n
3) Comparison: We introduced four metrics: software
clustering quality, authoritativeness, extremity of cluster where n is the number of entities to be clustered.
distribution and stability [20] and compared the effect of In our experiment, we gave authoritative decomposition
three algorithms for software architecture recovery on four partitioned by software engineers who are familiar with three
metrics. software in Table I. Fig. 7 shows the similarity quality
between authoritative decomposition and clustering results for
B. Experimental results and analysis six versions.
1) Software clustering quality: We defined Software
Clustering Quality (SCQ) referring to Modularization Quality
(MQ) [4] to measure the cohesion and coupling of modules.
100.00%
Definition8. Software S consists of K modules, 80.00%

∑ Cohe(M ) − ∑ Coup(M , M ) , 60.00% HPCA

SimQua
i i j
SCQ = (4) HCA
K 1 40.00% MSTCA
K ( K − 1)
2 20.00%
where K > 1 . 0.00%
0 1 2 3 4 5 6 7
We calculated SCQ value for three algorithms for
ID
log4j1.1.3, Fig. 6 shows the relation between the number of
clusters and SCQ value. Figure 7. SimQua for each algorithm

Authoritativeness is the most important of four metrics


introduced in this paper, which best reflects the effect of
3.2 clustering algorithm for software architecture recovery. The
2.8 larger the value of similarity quality is, the more authority the
2.4 algorithm is. From Fig. 7, we know that our hybrid algorithm
SCQ value

2 HPCA
1.6 HCA is best authority of three algorithms.
1.2 MSTCA
0.8 3) Extremity of cluster distribution: Neither huge clusters
0.4 nor singletons are usual in architectural components. Huge
0 clusters would reduce cohesion of software, and singletons
0 2 4 6 8 10 12
would increase the coupling of software. Wu et al. [20]
Number of clusters
proposed a measure called non-extreme distribution (NED),
Figure 6. Relation between the number of clusters and SCQ which is defined as:
For one algorithm, we can determine the number of K

clusters according to the SCQ value. ∑


i =1, ninotextreme
ni
NED = , (7)
2) Authoritativeness: Authoritativeness measure how n
close software clustering result resembles one logical view
created by an expert. In this paper, we used MoJo [21] to where K is the number of clusters in the partition, ni is the
measure the similarity between partitions. The related size of cluster i and n is the total of entities to be clustered.
definitions are as follows. In our experiment, we chose 3 and 30 as the lower and
upper limits of non-extreme clusters. Fig. 8 shows the NED
Definition9. Given two partitions A and B. Move
value of three clustering algorithms for six versions.
operation means that move a resource from one cluster to
another (that includes moving a resource into a previously According to Fig. 8, we know our algorithm is best in
nonexistent cluster, thus creating a new cluster of cardinality NED value, because we tailor the algorithm to the extremity
1). Join operation means that join two clusters into one, thus of cluster distribution. However, hierarchical clustering
reducing the number of clusters by 1. Mno(B,A) is the partitions vertices according to the weights of edges, it does
minimum number of operations to transform partition A into not consider the extremity of cluster distribution. MST
partition B. The value of MoJo(A,B) is defined as: clustering deletes leaves of MST to form clusters, it easily

2549
Authorized licensed use limited to: Istinye Universitesi. Downloaded on February 26,2023 at 01:10:32 UTC from IEEE Xplore. Restrictions apply.
forms singletons and huge clusters, so the NED value of MST Foundation of Huazhong University of Science and
is very smallest. Technology (0125921001).

1 REFERENCES
0.9
[1] R. Holt, “Software architecture as a shared mental model,” Proc.
0.8
Workshop on Software Architecture(ASERC 01), 2001.
0.7
[2] D. Garlan, “Software architecture: a roadmap,” Proc. The Future of
NED value

0.6 HPCA
Software Engineering(ICSE 00), ACM Press, 2000, pp. 91–101.
0.5 HCA
[3] S. Ducasse and D. Pollet, “Software Architecture Reconstruction: A
0.4 MSTCA
Process - Oriented Taxonomy,” IEEE Transactions on Software
0.3 Engineering. vol. 35, no. 4, pp. 573-591, 2009.
0.2 [4] S. Mancoridis, B. S. Mitchell, C. Rorres, Y. Chen, and E. Gansner,
0.1 “Using automatic clustering to produce high-level system
0 organizations of source code,” Proc. Workshop on Program
0 1 2 3 4 5 6 7 Comprehension, IEEE Computer Society Press, 1998, pp. 45-52.
ID [5] K. Mahdavi, M. Harman, and R. M. Hierons, “A multiple hill
climbing approach to software module clustering,” Proc. Software
Figure 8. NED value for each algorithm Maintenance(ICSM 03), IEEE Computer Society Press, 2003, pp.
315-324.
4) Stability : Good algorithms should be stable enough to [6] M. Saeed, O. Maqbool, H.A. Babri, S.Z. Hassan and S.M. Sarwar,
produce similar clusters when small changes happen, but still “Software Clustering Techniques and the Use of Combined
Algorithm,” Proc. Software Maintenance and Reengineering(CSMR
produce different clusters when architectural changes happen. 03), IEEE Computer Society Press, 2003, pp. 301-306.
Considering the similarity of two consecutive versions of a [7] Y. Chiricota, F. Jourdan and G. Melancon, “Software components
same software, we got the partition Pi and Pi+1 of two capture using graph clustering,” Proc. Program Comprehension(IWPC
03), IEEE Computer Society Press, 2003, pp.217-226.
consecutive versions of a same software firstly, then obtained [8] J. Dietrich, V. Yakovlev, C. McCartin, G. Jenson and M. Duchrow,
partition Pi’ and Pi+1’ by deleting the different classes of “Cluster Analysis of Java Dependency Graphs,” Proc. Software
partition Pi and Pi+1. We measured algorithm stability visualization(SOFTVIS 08), ACM Press, 2008, pp. 91-94.
[9] A.A. Pourhaji Kazem and S. Lotfi, “An Evolutionary Approach for
through comparing the similarity quality between Pi’ and Partitioning Weighted Module Dependency Graphs,” Proc.
Pi+1’. Table II shows the similarity quality for two Innovations in Information Technology, IEEE Computer Society
consecutive versions of the same software for three Press, 2007, pp. 252-256.
[10] R.A. Bittencourt and D.D. Serey Guerrero, “Comparison of Graph
algorithms. Clustering Algorithms for Recovering Software Architecture Module
Views,” Proc. Software Maintenance and Reengineering(CSMR 09),
TABLE II. RELATIVE STABILITY FOR EACH ALGORITHM IEEE Computer Society Press, 2009, pp. 251-254.
[11] J. Eder, G. Kappel, and M. Schrefl. “Coupling and Cohesion in
Similarity quality Object-Oriented Systems,” Technical Report, Univ. of Klagenfurt,
Software Version
HPCA HCA MSTCA 1994.
Log4j 1.04 1.1.3 90.67% 92.50% 100% [12] http://www.headwaysoftware.com/products/structure101/index.php.
Jedit 2.3 2.4 85.12% 99.14% 100% [13] C. Riva, “View-Based Software Architecture Reconstruction,” PhD
Junit 4.1 4.3.1 98.46% 100% 100% thesis, Technical Univ. of Vienna, 2004.
[14] J.K. Lee, S.J. Jung, S.D. Kim, W.H Jang and D.H. Ham. “Component
V. CONCLUSION identification method with coupling and cohesion,” Proc. Software
Engineering Conference(APSEC 01), IEEE Computer Society Press,
In this paper, we defined Weighted Directed Class Graph 2001, pp. 79-86.
(WDCG) to represent existing systems, which not only [15] http://logging.apache.org/log4j/1.2/index.html.
reflects static information of software but also reflects [16] http://www.jedit.org/.
[17] http://www.junit.org/.
dynamic information of software running. According to the [18] O. Maqbool and H.A. Babri, “Hierarchical Clustering for Software
nature of WDCG and the feature of object-oriented software, Architecture Recovery,” IEEE Transactions on Software Engineering.
we proposed a new hybrid algorithm based on hierarchical vol. 33, no. 11, pp. 759-780, 2007.
clustering and partition clustering. We compared the effect of [19] O. Grygorash, Y. Zhou and Z. Jorgensen, “Minimum Spanning Tree
three algorithms for software architecture recovery on four Based Clustering Algorithms,’ Proc. Tools with Artificial
Intelligence(ICTAI 06), APSEC 01), IEEE Computer Society Press,
metrics. Experimental results show that our algorithm 2006, pp. 73-81.
performed best in terms of software clustering quality, [20] J. Wu, A.E. Hassan, and R.C. Holt, “Comparison of Clustering
authoritativeness and extremity of cluster distribution. It was Algorithms in the Context of Software Evolution,” Proc. Software
effective for software architecture recovery. Maintenance(ICSM 05), IEEE Computer Society Press, 2005, pp.
525-535.
[21] V. Tzerpos and R.C. Holt. “MoJo: A distance metric for software
ACKNOWLEDGMENT clusterings,” Proc. Reverse Engineering(WCRE 99), IEEE Computer
Society Press, 1999, pp. 187–193.
This work was supported in part by the National Natural
Science Foundation of China Grant 60873031, and Research

2550
Authorized licensed use limited to: Istinye Universitesi. Downloaded on February 26,2023 at 01:10:32 UTC from IEEE Xplore. Restrictions apply.

You might also like