Professional Documents
Culture Documents
A Weighted Ranking Algorithm For Facet-Based Component Retrieval System
A Weighted Ranking Algorithm For Facet-Based Component Retrieval System
RETRIEVAL SYSTEM
Yong Yang, Weishi Zhang, Xiuguo Zhang, Jinyu Shi
Department of Computer Science and Technology, Dalian Maritime University, Dalian 116026
{dmuyy, teesiv, zhangxg, sjy1967}@dlmu.edu.cn
505-049 274
2.2 Facet-based Retrieval Method tree is represented by a three-tuple: T= (V, E, root (T)), V
A component classification is a set of {facet, facet term} represents a limited set of vectors, root (T) represents the
pairs, also called descriptors [3]. Reusable software root of the tree, E represents the set of edges.
components (RSC) are classified by assigning appropriate
facet terms for all applicable facets. The objective in
classifying the RSC is to make it possible for all users 3. Weighted Ranking Algorithm for Facet-
who might use the RSC to retrieve it as a result of their based Component Retrieval System
requests. Faceted classification scheme is an effective
way for classifying the components and widely adopted There’s no a general formula to calculate the matching
by component library systems. degree due to the different feature of each retrieval
Correspondingly, there are several retrieving method. Facet-based retrieval method has been widely
algorithms for faceted classification scheme. Some adopted by existing component library systems, such as
systems use the traditional database query techniques in REBOOT, Proteus, Asset Library, and JBCL [5]. It has
facet-based retrieval. Wang YF proposed a tree matching been proved to be an effective method to the retrieval of
algorithm in his PH.D dissertation [4]. This algorithm component library system. And therefore, it makes great
maps the component facets into a facet tree and maps the sense to propose a component ranking algorithm for facet-
query conditions into a query tree. The matching based retrieval system.
algorithm deals with the match of the facet tree and query
tree and calculates the matching cost. This algorithm 3.1 ER-Diagram of Software Component Library
bases on the tree matching theories, such as tree
embedding, tree inclusion, and tree containment. These The extraction and identification of the influential factors
three tree matching methods are becoming more and more which are used to calculate the matching degree is the
elastic in order to improve the retrieving recall while first step to establish a mathematical model. To Analyze
maintaining the precision to a certain extent. Matching the ER-Diagram of software component library is an
cost of the tree matching will be calculated to measure the effective way to extract the factors. An ER-diagram of
approximate degree between the facet trees of the facet-based component library was given below:
components and the query tree. The data structure of a
Relate
n m
1 n n m
Producer Provide Component Reuse Consumer
n 1 1
n
Describe Feedback Feedback
m
1 n 1
Facet Include Term Describe Summary
275
we should take other factors into account besides the facet Summary of component information differs from
for ranking the retrieval results. According to the analysis Attribute-Valued pairs. It provides a comprehensive
of ER-Diagram above, we can extract some other factors description of a component in context.
which are able to influence the ranking of component Definition 6: Summary (S);
retrieval results while using the complex query. User feedback includes all the comments and feedback
Attributes of component, such as component name, can to a specific component:
be used to match the keywords in the query conditions. Definition 7: User Feedback (U1, U2, ……, Ui, ……,
The matching degree of Attribute-Valued pairs should be Un3); (n3∈N, n3≥1)
an influential factor for ranking. User feedback must be analyzed and evaluated to a
Summary of component can also be used to match the relative number. We use E to represent the Evaluation
query conditions. Query conditions usually consist of number of user feedback.
several keywords. The density, prominence, and position Definition 8: E = Evaluate (User Feedback).
of keywords within the component summary will Definition 9: Visited times of a component: Visited
influence the ranking of components. The keyword times (V);
density is just the number of occurrences of the keywords Definition 10: Downloaded times of a component:
within the component summary divided by the total Downloaded times (D);
number of words. Keyword prominence is related to the We have listed out all the influential factors above,
location of keywords in the summary. For example, which constitute a six-tuple:
keywords placed at the beginning of the summary maybe Factors (A, F, S, U, V, D);
carry more weight than those towards the end of it. Their influential weights differ from each other
User feedback of a component is very useful for other according to their feature and importance:
users who want to use it to evaluate the quality and other Definition 11: Weights (WA, WF, WS, WU, WV, WD);
features of the component. They can acquire much more (0≤WA, WF, WS, W U, W V, WD≤1,
objective description and useful information about the WA+WF+WS+WU+WV+WD = 1)
component besides the component attributes and WA represents the weight of Attributes; WF represents
summary. the weight of Facets; WS, WU, WV, and WD also represent
How many times the component information has been the weight of corresponding factor discussed above.
visited and how many times the component has been There are several functions for calculating the matching
downloaded for reusing should also be taken into account degree of some factors. The core calculating formula of
as the factors to calculate the matching degree for ranking. each function relies on its corresponding matching
They reflect the popularity and reusability of the algorithm.
component from another aspect. Functions Formulas
Summary FS (Keywords, no
3.3 Mathematical Model
Summary) ∑ match( Ki, S )
i =1
Retrieval results consist of a collection of components
matching the query conditions: Facets FF (Keywords, no n2
Definition 1: Components (C1, C2, ……, Ci, ……, Cn); Facets) ∑∑ match( Ki, Fj )
i =1 j =1
(n∈N, n≥1)
Attributes FA (Keywords, no n1
It makes no sense to discuss the circumstance of empty
retrieval results, since that we are going to discuss the
Attributes) ∑∑ match( Ki, Aj )
i =1 j =1
ranking of component lists.
Accordingly, each component has a rank value:
Definition 2: Ranks (R1, R2, ……, Ri, ……, Rn); (n∈N, Match function of component summary uses the
n≥1) content-based similarity measurement algorithm of the
The query condition consists of a collection of search engine techniques. A Best-First algorithm was
keywords: proposed by Cho [6]. This algorithm uses a vector space
Definition 3: Keywords (K1, K 2, ……, Ki, ……, K n0); model to calculate the similarity between the keywords
(n0∈N, n0≥1) and the content. Its formula is given as following:
Each component is described by a set of Attribute-
sim(q, p) =
∑ WW k∈q| p
kq kp
Valued pairs:
∑ W ∑ W
2 2
Definition 4: Attributes (A1, A2, ……, Ai, ……, An1);
k∈ p kp k∈q kq
(n1∈N, n1≥1)
Besides with Attribute-Valued pairs, components are The variable q represents the collection of keywords, p
also classified and represented by a set of facets and their represents the content, and Wkp represents the importance
terms: of k to a specific topic. In our mathematical model,
Definition 5: Facets (F1, F2, ……, Fi, ……, Fn2); variable q represents the query condition, and the variable
(n2∈N, n2≥1) p represents the summary of the component information.
276
Facet-based retrieving method usually adopts the facet We specify the weights with experiential values at the
tree matching. And therefore, its match function very beginning, and then use the data mining technology
calculates the matching degree between the facet tree of to analyze the user logs to dynamically and iteratively
component and the query tree. A formula to calculate the adjust those values.
matching cost of tree containment matching was given by
Xu [7]: 3.4 Timing of Ranks Calculation
Q=(V,E,root(Q)), D=(W,F,root(D)) are two unordered
label tree, TCostM(Q, D) represents the tree containment It’s time for us to determine when to deal with the
matching cost from tree Q to tree D. calculation of ranks, since that we have designed how to
TCostM(Q, D) = min{γ ( f ) | f : Q ⇒ D} deal with it. It’s just the matter of the timing of rank
calculation. There are two probable times for calculation:
calculating after the retrieving process has been finished;
γ(f )= ∑ γ (label (v) → label ( f (v))) + calculating during the process of retrieving. Both of the
v∈domain ( f ) probable times have their own advantages and
∑ γ (label (v) → λ ) +
v∈V − domain ( f )
∑ γ (λ → label (v))
w∈spectrum ( f ) − Range ( f )
disadvantages.
If we calculate the rank after the retrieving process has
been finished, we can deal with the retrieving and ranking
If f is a tree containment matching from tree Q to tree separately. It will be much easier for us to design and
D, and γ (f) = TCostM(Q, D), then f is the tree maintain the system, since the retrieving and ranking
containment matching which obtains the minimum process are independent. However, it costs much more
matching cost from tree Q to tree D. This definition could time and space to calculate the rank. It needs a lot of
be also applied to the containment matching between tree memory space to store a large number of retrieval results
and forest or between forest and forest. temporally before they are ranked, and costs much more
As to the match function of component attributes, we time to manage the transmission of data between storage
just use the traditional database query methods to deal devices and CPU. On the contrary, if we calculate the
with it. rank during the process of retrieving, we have to combine
E, V, and D are three ranking factors without any the retrieving with the ranking process completely or
relation to query keywords. Even though they are partly. Undoubtedly, it will be hard for us to implement
numbers, we could not use them directly for ranking. and maintain the system, but it can greatly improve the
Functions should be provided to transform them. time and the space performances.
According to the discussion above, we can choose a
proper time to calculate the ranks. Which solution we
Functions
should choose depends on the requirements of the system.
Evaluation of Feedback FE (E) The solution which calculates the ranks during the
Visited Times FV (V) retrieving process should be adopted if the time and the
space performances are rigorously required.
Downloaded Times FD (D)
277
Fig. 2. Class Diagram of Ranking Module The experimental results obviously shows that the
efficiency of group 2 is greatly higher than group 1. By
We calculate the ranks during the retrieving process to applying the weighted ranking algorithm into the
improve the time and space performances, and therefore, retrieving system of DLCL, users needn’t turn too many
we have to combine the ranking module partly with the pages to view and compare the component information or
retrieving module. In order to lower the coupled degree to adjust the query condition to improve the query
between these two modules, a callback mechanism was precision. Only to view the first page of retrieval results
adopted. We define an interface Rank, which consists of will be enough most of the time. And therefore, it greatly
only one method to calculate the ranks. saves the time and retrieval costs.
RankComponentImpl is a class implementing the
interface Rank to calculate the ranks of those components
retrieved by users. The method of its concrete object can 6. Related Works
be executed by the searcher during the retrieving process.
Class Component encapsulates those methods which The idea of Component Rank comes from computing fair
provide the component information. impact factors of published papers [8]. Google is a web
ComponentMatchingDegree is a class providing those search engine. Its method can be considered as an HTML
methods for calculating the matching degree between the extension of the method proposed for counting impact of
query keywords and component. Each influential factor publications, called influence weight in [8]. Google
has its own strategy for calculation. There’s also a class computes the ranks (called PageRanks) for HTML
Weight, which provides the methods to get the influential documents in the Internet [9, 10]. In reference [11], the
weight of each factor. authors present the Component Rank model for ranking
software components, and show a system for computing
Component Rank. In this model, a collection of software
5. Experiment and its Results components is represented as a weighted directed graph
whose nodes correspond to the components and edges
In order to verify the efficiency of the component correspond to the usage relations. Similar components are
retrieving system which adoptes our weighted ranking clustered into one node so that effect of simply duplicated
algorithm, we design an experiment and carry out this nodes is removed. The nodes in the graph are ranked by
experiment in our component library system, named their weights which are defined as the elements of the
DLCL. There are more than 1000 components in this eigenvector of an adjacent matrix for the directed graph.
system. Retrieving system of DLCL splits the retrieval A major distinction of Component Rank model in [11]
results into several pages if there are too many from PageRank and the influence weight in [9, 10] is that
components retrieved, and lists out 10 components per Component Rank model explores similarity between
page. components before the weight computation.
The experiment separates the users into two groups, In this paper, we also propose a weighted ranking
group 1 and group 2. Each group consists of 10 persons. algorithm for component retrieval system. This weighted
All the users know about the knowledge of component ranking algorithm uses different calculating strategies
reuse to a certain extent. Both groups use the facet-based according to the feature of facet-based retrieval methods.
component retrieving method to retrieve the components. While in [11], the authors employed only statical use
Retrieval results of group 1 are listed out without any relations.
ranking, however, those of group 2 are ranked by our
weighted ranking algorithm.
There are several aspects to measure the efficiency of 7. Conclusion
each group: how many pages they turned; how many
times they had to adjust the query condition; and the most In this paper, a mathematical model of weighted ranking
important, how many time was elapsed during the whole algorithm is proposed and the timing of ranks calculation
retrieving process. The experimental results are given in is discussed. We have applied this ranking algorithm into
the following table: our component library system, named DLCL. The
experiment we carried out shows that this algorithm
Group 1 Group 2 greatly improves the efficiency of component retrieving
Average Turned Pages 2.7 1.4 system, saving the time and retrieval costs for component
(pages) reusing.
Average Adjusted Times 2.3 1.1
(times)
Average Time elapsed 26.6 9.5 Acknowledgement
(minutes)
278
This research is partially supported by the National High
Technology Development 863 Program under Grant No.
2004AA116010.
References
[1] Frakes WB, Pole TP, An empirical study of
representation methods for reusable software components,
IEEE Transactions on Software Engineering, 1994, l20(8),
pp617-630
[2] H. Mili, R. Rada, W. Wang, K. Strickland, C.
Boldyreff, L. Olsen, J. Witt, J. Heger, W. Scherr, and P.
Elzer, Practitioner and SoftClass: A Comparative Study of
Two Software Reuse Research Projects, J. Systems and
Software, 1994, 27(5)
[3] NEC Software Engineering Laboratory, NATO
Standard for Management of a Reusable Software
Component Library, NATO Communications and
Information Systems Agency, 1991
[4] Wang YF. Research on retrieving reusable
components classified in faceted scheme [Ph.D. Thesis].
Shanghai: Fudan University, 2002.
[5] Chang JC, et al. Representation and Retrieval of
Reusable Software Components [J].Computer Science,
1999, 26(5):41-48.
[6] Cho J, Garcia-Molina H, Page L. Efficient Crawling
Through URL Ordering [J]. Computer Networks, 1998,
30(1~7):161-172
[7] Xu RZ, et al. Research on Matching Algorithm for
XML-Based Software Component Query, Journal of
Software, 2003, 14(7):1195-1202.
[8] G. Pinski and F. Narin. “Citation Influence for Journal
Aggregates of Scientific Publications: Theory, with
Application to the Literature of Physics”. Information
Processing and Management, 12(5):297.312, 1976.
[9] L. Page, S. Brin, R. Motwani, and T. Winograd. “The
PageRank Citation Ranking: Bringing Order to the Web”.
Technical Report of Stanford Digital Library
Technologies Project, 1998. “http://www-
db.stanford.edu/.backrub/ pageranksub.ps”.
[10] J. Kleinberg. “Authoritative Sources in a
Hyperlinked Environment”. Journal of the ACM,
46(5):604.632, 1999.
[11] Katsuro Inoue, Reishi Yokomori, Hikaru Fujiwara,
Tetsuo Yamamoto, Makoto Matsushita, Shinji Kusumoto:
Component Rank: Relative Significance Rank for
Software Component Search. ICSE 2003: 14-24
279