Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

ARTICLE IN PRESS

Information Systems 31 (2006) 247265


www.elsevier.com/locate/infosys

Modeling user interests by conceptual clustering


Daniela Godoy,1, Anala Amandi1
ISISTAN Research Institute, UNICEN University Campus Universitario, Tandil (7000), Buenos Aires, Argentina
Received 18 January 2005; received in revised form 18 January 2005; accepted 20 February 2005

Abstract

As more information becomes available on the Web, there has been a crescent interest in effective personalization
techniques. Personal agents providing assistance based on the content of Web documents and the user interests emerged
as a viable alternative to this problem. Provided that these agents rely on having knowledge about users contained into
user proles, i.e., models of user preferences and interests gathered by observation of user behavior, the capacity of
acquiring and modeling user interest categories has become a critical component in personal agent design. User proles
have to summarize categories corresponding to diverse user information interests at different levels of abstraction in
order to allow agents to decide on the relevance of new pieces of information. In accomplishing this goal, document
clustering offers the advantage that an a priori knowledge of categories is not needed, therefore the categorization is
completely unsupervised. In this paper we present a document clustering algorithm, named WebDCC (Web Document
Conceptual Clustering), that carries out incremental, unsupervised concept learning over Web documents in order to
acquire user proles. Unlike most user proling approaches, this algorithm offers comprehensible clustering solutions
that can be easily interpreted and explored by both users and other agents. By extracting semantics from Web pages,
this algorithm also produces intermediate results that can be nally integrated in a machine-understandable format such
as an ontology. Empirical results of using this algorithm in the context of an intelligent Web search agent proved it can
reach high levels of accuracy in suggesting Web pages.
r 2005 Elsevier B.V. All rights reserved.

Keywords: Personal information agents; Conceptual clustering; Ontology learning

1. Introduction interests offer users an alternative to cope with the


vast amount of information available on the Web.
Agents personalizing information-related tasks Due to this fact, personal information agents have
based on the content of documents and the user been intensively used in the last years to accom-
plish several tasks, such as searching, ltering and
Corresponding author. Tel./fax: +54 2293 440362. summarizing information on behalf of a user or
E-mail addresses: dgodoy@exa.unicen.edu.ar (D. Godoy),
community of users [1]. A critical component for
amandi@exa.unicen.edu.ar (A. Amandi). these agents is their capacity to acquire and model
1
Also at CONICET. user interest categories into user proles. As

0306-4379/$ - see front matter r 2005 Elsevier B.V. All rights reserved.
doi:10.1016/j.is.2005.02.008
ARTICLE IN PRESS

248 D. Godoy, A. Amandi / Information Systems 31 (2006) 247265

proles allow agents to categorize documents documents in order to acquire comprehensible


based on the features they exhibit, even when it user proles. Being a conceptual clustering ap-
is not possible for them to assess the full meaning proach, this algorithm not only performs cluster-
of documents, agent effectiveness depends on ing but also characterization, i.e., the formation of
prole completeness and accurateness. intentional concept descriptions for each exten-
A personal agent has not only to summarize sionally dened cluster. Algorithms in this para-
categories corresponding to diverse user informa- digm, have been recognized as a powerful machine
tion interests into a user prole (e.g., interests learning method to nd potentially interesting
related to their hobbies, like sports or movies, or to concepts and relations in ontology learning for the
their work, like politics), but also to describe user Semantic Web [46]. It is worth noticing that,
experiences in each of these categories at different although the knowledge acquired during user
levels of abstraction (e.g., a user interest in skying proling can foster ontology learning, the Seman-
may respond to a more general interest in winter tic Web is expected to be only of partial benet to
sports, or even in just sports). A hierarchical view the problem of formulating explicit user proles. A
of user interests enhances the semantics of user domain ontology can only capture concepts which
proles as it is much closer to the human are shared by a community of users, but these
conception of a set of interests. By assessing concepts are not specic enough to reect the
comprehensible user proles, agents become able interest of individual users.
to communicate representations of user interests to This paper is organized as follows. Section 2
either users for explanation or exploration pur- presents the WebDCC algorithm and its principal
poses, and other agents to establish a cooperative components. An overview of the datasets used for
behavior in an agent community. Learning mean- experiments is given in Section 3. Section 4
ingful user proles involves the extraction of compares WebDCC results with those obtained
semantics from a group of Web pages a user reads with a traditional hierarchical clustering algo-
in a personal fashion. rithm. An experience of using WebDCC for user
Extracting semantics from resources found on proling in a Web search agent is described in
the syntactic Web is an essential step in ontology Section 5. Section 6 places this work in the context
learning approaches for the Semantic Web [2]. of related ones. Finally, concluding remarks and
Semantic Web aims at extending the current Web future lines of research are described in Section 7.
with machine-processable semantics, allowing var-
ious intelligent services to understand the informa-
tion and to have a better access to it. Central to the 2. WebDCC algorithm overview
Semantic Web is the concept of ontology, which is
a conceptualization of a domain that enables Algorithms belonging to the conceptual cluster-
agents to reason about Web content and carry ing paradigm, rst introduced by [7], include not
out intelligent tasks on behalf of users. In this only clustering, but also characterization. More
context, ontology learning refers to the semi- formally, conceptual clustering is dened as the
automatic extraction of semantics from the Web task of, given a sequential presentation of in-
in order to create an ontology [3]. Integrated in an stances and their associated descriptions, nding
ontology learning process, a user proling ap- clusterings that group these instances into con-
proach discovering semantics from Web pages can cepts or categories, a summary description of each
contribute to build ontologies for the Semantic concept and a hierarchical organization of them
Web. [8]. In this section, we briey describe each of these
In this paper we propose a document clustering elements in the context of WebDCC algorithm.
algorithm belonging to the conceptual clustering Instances in our algorithm correspond to vector
paradigm, named WebDCC (Web Document representations of Web pages according to the
Conceptual Clustering), that carries out incremen- vector space model [9]. In this model, each
tal, unsupervised concept learning over Web document is identied by a feature vector in a
ARTICLE IN PRESS

D. Godoy, A. Amandi / Information Systems 31 (2006) 247265 249

space in which each dimension corresponds to a fc1 ; c2 ; . . . ; cn g; which are gradually discovered by
distinct term associated with a numerical value or the algorithm as new instances become available.
weight indicating its importance. The resulting In order to automatically assign instances to
representation of a Web page is, therefore, concepts, the algorithm associates to each of them
equivalent to a t-dimensional vector: a description given by a set of terms, ci
ht1 ; w1 ; . . . ; tm ; wm i; weighted according to their
d i ht1 ; w1 ; t2 ; w2 ; . . . ; tt ; wt i,
importance in the concept summarization. Fig. 1
where wj represents the weight of the term tj in the shows an example of a hierarchical clustering
instance or document d i : Before obtaining a solution which is possible to achieve with the
document representation, non-informative words WebDCC algorithm. Labels were added to con-
such as prepositions, conjunctions, pronouns and cepts in the gure for illustrative purpose only.
very common verbs are removed by using a However, a concept description generated by the
standard stop-word list. After stop-word removal clustering algorithm consists only of a set of terms.
a stemming algorithm is applied to the remaining At this point, we need to make a distinction
words in order to reduce the variant forms of a between concepts and categories as they will be
word to a common one. In the experiments used in this work. A category is considered to be
reported in this paper, terms were stemmed using any set of instances, while a concept is the internal
the Porter stemming algorithm [10]. representation of a category. Consider the algo-
Instances obtained by extracting feature vectors rithm has discovered the concept sports within a
from documents are incrementally given to the set of instances. In which case, the concept itself
WebDCC algorithm, which is concerned with will be a description, such as csports hsports; 0:9;
building hierarchies of concepts starting from team; 0:9; score; 0:8i which summarizes the
them. A conceptual hierarchy is a classication category sports, composed by all the instances
tree where internal nodes represent concepts and the algorithm has seen belonging to this concept.
leaf nodes represent clusters of instances. The Leaves in the hierarchy correspond to clusters of
hierarchy root corresponds to the most general instances belonging to all ancestor concepts.
concept, which summarizes all instances the Intuitively, clusters correspond to groups of
algorithm has seen, while inner concepts become instances whose members are more similar to one
increasingly specic as they are placed lower in the another than to the members of other clusters, so
hierarchy, covering only subsets of instances by that clusters group highly similar instances ob-
themselves. In turn, terminal concepts are those served by the algorithm. In general terms, a set of
with no further child concepts. ni instances or documents belonging to a concept
In other words, a hierarchy consists of an ci and denoted by Di fd 1 ; d 2 ; . . . ; d ni g; is orga-
arbitrary number of concepts, denoted by C nized into a collection of k clusters, S ji

User Interest Hierarchy


politics 0.9 Everything
government 0.9 sports 0.9
elections 0.8 team 0.9
campaign 0.6 score 0.8 User Interest Concepts
Politics Sports
football 0.9
cup 0.9 basketball 1.0
FIFA 0.8 NBA 0.9
Football Basketball
Clusters of Intances
(Experiences)

Fig. 1. A concept hierarchy illustrating knowledge representation in WebDCC.


ARTICLE IN PRESS

250 D. Godoy, A. Amandi / Information Systems 31 (2006) 247265

fs1i ; s2i ; . . . ; ski g; containing elements of Di such new instance. In the example of Fig. 1, a top level
that sli \ spi ;; 8lap: classier decides the classication of documents
WebDCC integrates classication and learning into the root category, the second level classiers
by sorting each instance through the concept decide the classication into sports and politics
hierarchy and simultaneously updating it. Upon categories and so forth.
encountering a new instance, the algorithm in- Denoting the concept where d new was last
corporates it below the root of the existing classied as cparent ; the algorithm adds d new to the
hierarchy and then recursively compares the best child concept of cparent (the one with higher
instance to each child concept as it descends the Classifyd new ; c) while there is a concept c such as
tree. Below a given concept, the algorithm it is a child of cparent : Therefore, the classication
considers incorporating the instance into a cluster process stops when the instance reaches a terminal
belonging to this concept as well as creating a new concept or when it cannot be classied into any
singleton cluster. Finally, the possibility of a new child to justify sorting downward further. In any
concept creation is evaluated in order to expand case, cparent contains the last category the instance
the conceptual hierarchy in depth. was able to be classied into and below which it
WebDCC, which is dened in Algorithm 1, needs to be clustered. An instance about basketball
takes as input a hierarchy H and a new instance in the previous example is expected to exceed the
d new ; and returns an updated hierarchy H 0 result- classication threshold for concepts along the
ing from the addition of d new to H. Learning takes complete path from the root to the basketball
place as instances are sorted through the hierarchy category. However, an instance about tennis will
of incrementally discovered concepts. The rst be probably classied into sports, but not into
time the algorithm is run a concept croot with no basketball and football categories.
associated description is used to initialize the Once the instance reaches a given concept in the
hierarchy, allowing every instance to be classied hierarchy, terminal or non-terminal, it is added
into the root category. This initial hierarchy with either to the cluster below this concept where the
the single concept croot is gradually expanded as most similar instances are found or to a new
new instances become available and new concepts cluster if there are no similar enough instances, by
are gathered starting from them. following a k-nearest neighbor approach (k-NN).
Having a hierarchy containing at least the croot All instances below the concept cparent are in the
concept, the algorithm sorts d new down by same category and, consequently, they are ex-
recursively assigning the instance to the best child pected to share the terms describing this concept
category at each level. Classication of instances and all its ancestors. For example, if cparent is the
into categories is in charge of the Classify function, concept representing the basketball category in
which receives an instance and a concept and Fig. 1 and the algorithm has to decide which the
returns a value representing the evidence for the best cluster below this concept to place a new
fact that the instance should be classied under the instance is, the terms sports, team, score, basketball
category the concept represents. In accomplishing and NBA are useless since they are equally likely to
this goal, concept descriptions act as classiers, appear in all clusters. For this reason, terms
which are trained by the algorithm to recognize included in the description of a concept and all
constitutive features of categories and thus dis- its ancestors are not considered at the moment of
criminate documents belonging to these categories. clustering instances below it. In Algorithm 1, this
A value returned from this function higher than a set of terms is denoted by T and is given to the k-
given classication threshold t for the instance NN function to prevent them from being used on
d new and the concept c is interpreted as a decision instance comparisons taking place inside this
of classifying d new under c, while a value lower function.
than t is interpreted as a decision of not classifying Besides the set of terms T, the k-NN function
d new under c. After classifying d new into a category receives as input the new instance d new ; the concept
c, the category classier is updated to include the cparent where this instance was classied, and the
ARTICLE IN PRESS

D. Godoy, A. Amandi / Information Systems 31 (2006) 247265 251

hierarchy H itself; and returns the cluster below and features are ordered according to their
cparent where d new needs to be added, denoted by assigned scores. A feature selection threshold s is
si;parent : If d new is not similar enough to any other dened such that the weight required for a feature
instance in the existent clusters below cparent to be selected needs to be higher than this
according to an established similarity threshold threshold.
da new cluster is created in order to contain this Finally, the concept cnew with its associa-
single instance and is returned by the k-NN ted description is added as a child of concept
function. Hence, instances sufciently dissimilar cparent in the hierarchy H and the instances
to all clusters in a given level of the hierarchy originally placed in the cluster which leads to
create disjoint categories, whereas further inclu- the new conceptas well as those in sibling
sions of instances in these clusters gives place to clustersare reorganized below the new
new concepts in the downward extensions of the concept to form a new sub-hierarchy. In order to
hierarchy. create this new partition of instances, the
The insertion of an instance into a given cluster WebDCC algorithm is recursively run for each
causes a hierarchy revision and, possibly, the instance being reorganized and the sub-hierarchy
denition of a new concept to summarize a given having cnew as rootthose instances that can be
category of instances. An Evaluation function classied into the new category are removed from
drives the entire process of conceptual hierarchy their original position in the hierarchy to be placed
formation. A low evaluation value for a cluster below cnew :
indicates that either the cluster does not provide In the following subsections each of the main
enough information to dene a concept summar- parts of WebDCC are explained in detail. Section
izing its instances, e.g., the cluster consists of only 2.1 describes the transformation of Web pages into
a few instances, or there does not exist a set of instances in WebDCC algorithm. The process of
terms which precisely describe instances in the hierarchical classication of instances into a given
cluster. Contrary to this, if the result of the hierarchy of concepts is presented in Section 2.2.
evaluation function exceeds a given evaluation The k-NN approach for clustering instances at leaf
threshold j; a new concept is dened with an levels of the hierarchy is described in Section 2.3.
associated description. Section 2.4 explains the hierarchical concept
As mentioned, conceptual clustering relates formation process.
simultaneously to two problems: hierarchical
clustering (i.e., nding a hierarchy of useful subsets 2.1. Web page representation
of instances by dividing up training instances into
meaningful categories) and characterization (i.e., Instances to be considered by the WebDCC
nding an intentional denition or description for algorithm are represented using a bag-of-words
each of these categories). Therefore, while the approach for document representation. According
discovering of categories in the previous steps of to this approach documents are encoded as feature
the algorithm aims at nding subsets of similar vectors, being their elements the weights indicating
instances, conceptual clustering extends this task the importance of terms. In order to weight terms,
by trying to nd intensional descriptions of these we rst compared the performance of boolean and
categories. In WebDCC the description of the new numerical weighting schemes in Web domain. For
concept cnew is generated by the ConceptFormation this purpose, vectors representing Web pages were
function based on the set of instances contained in constructed for the Syskill&Webert and the Bank-
the cluster si;parent : This function embodies a Search collections (see description in Section 3). In
feature selection method which extracts the set of the boolean scheme, a term weight is 1 if the term
features which best characterizes the instances in a is present in the document and 0 if it is not, while
given cluster. In this case, a lter approach for achieving numerical weights we used the
of feature selection is taken, in which a score document frequency normalized by a factor
is individually computed for each feature representing the Euclidean vector length as
ARTICLE IN PRESS

252 D. Godoy, A. Amandi / Information Systems 31 (2006) 247265

follows: document collection is a priori unknown. As


documents are incorporated in the hierarchy, their
tf i terms are locally weighted so that not dictionary
q
Pi . (1)
tf 2 has to be built for this purpose.
n1 i

Algorithm 1 WebDCC Algorithm

Input: An instance d new to be added to a hierarchy H


Output: A hierarchy H 0 containing d new
1: WebDCCd new ; H
2: if H ; then
3: Initialize H with an empty concept croot
4: end if
/*Classification of the new instance into the current hierarchy of concepts*/
5: cparent croot
6: while 9c : childc; cparent true ^ Classifyd new ; cXt do
7: Let cparent be the concept c with higher Classifyd new ; c
8: Update c description to include d new
9: end while
/*Selection of the best possible cluster in cparent to contain the new instance*/
10: Let T be the set of terms ft : t 2 cparent _ t 2 ancestorcparent g
11: si;parent k-NN d new ; cparent ; H; T
12: Add d new to si;parent in H
/*A new concept definition is evaluated*/
13: if Evaluationsi;parent Xj thin
14: cnew ConceptFormationsi;parent
15: Add cnew to H as a child of cparent
/*Instances in clusters below cparent are considered to be reorganized below cnew */
16: for all d; j : d 2 sj;parent do
17: if Classifyd; cnew Xt then
18: Remove d from H
19: Let H new be the hierarchy with cnew as root
20: Replace H new by WebDCCd; H new in H
21: end if
22: end for
23: end if
24: Return H

Local weighting functions like this one reect The performance of both weighting schemes was
the importance of a term within a particular evaluated using k-NN, which is applied at the
document but not within the entire set of docu- lowest levels of the hierarchy in WebDCC algo-
ments. Global functions like tf  idf (term fre- rithm. As clustering solutions resulting from the k-
quencyinverse document frequency) are not NN algorithm are known to be affected by the
applicable in this context since in an incremental order in which instances are presented, all values
clustering approach such as WebDCC the entire in this experiment were obtained as the average of
ARTICLE IN PRESS

D. Godoy, A. Amandi / Information Systems 31 (2006) 247265 253

ten algorithm runs with different sequence of as the Porter stemming algorithm to reduce
instances. The same orders of instances were then morphological variants of terms in Web pages.
repeated for every value of the similarity thresh-
old. Fig. 2 shows the variations in terms of 2.2. Hierarchical classification approach
clustering quality along possible values of the
similarity threshold for boolean and numerical In a rst stage of WebDCC algorithm, instances
weighing schemes measured by F-Measure [11]. have to be assigned to categories at the current
According to these experimental results, hereafter hierarchy of concepts. At any time, this hierarchy
we consider instances to be Web page representa- could consist of only a single concept representing
tions resulting from the application of the numer- the root category or several concepts in one or
ical weighting scheme previously described as well more hierarchical levels. In other words, a text
categorization problem is stated where a set of
documents D fd 1 ; d 2 ; . . . ; d n g has to be assigned
1.0 to a set of prescribed categories C
Numerical Weighting
fc1 ; c2 ; . . . ; cm g: Particularly, categories are dis-
0.9
Boolean Weighting posed in a hierarchy so that the task of categoriz-
0.8 ing text is known as hierarchical text
0.7 categorization as opposed to at text categoriza-
tion.
F-Measure

0.6
Exploiting the hierarchical relationships be-
0.5 tween categories, the global classication problem
0.4 can be divided into a set of smaller classication
problems corresponding to the splits in the
0.3
hierarchy. For each node in the hierarchical
0.2 structure, a separate classier is induced to
0.1 distinguish documents that should be assigned to
the category from other documents. In conse-
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 quence, at each decision point in the hierarchy a
(a) Similarity Threshold classier is concerned with a binary classication
problem where a document is either relevant or
1.0 not to a given category.
0.9 Numerical Weighting Based on this hierarchical organization of
Boolean Weighting
classiers, the classication task takes place in a
0.8
top-down manner. Initially, instances belonging to
0.7 the root category (i.e., all documents) are categor-
ized by the rst level classiers. Having selected a
F-Measure

0.6
category at a certain level in the hierarchy, only its
0.5
children are considered as prospective categories at
0.4 the next level. This procedure continues until
0.3 either all instances have reached some leaf node
in the concept hierarchy or they cannot be further
0.2
classied.
0.1 The motivation behind the construction of a
0.0 hierarchy of classiers is twofold. First, existent
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 experiences have shown that both precision and
(b) Similarity Threshold recall of at classiers decreases as the number of
Fig. 2. F-Measure scores for boolean and numerical weighting categories increases [12,13]. However, by dividing
schemes. the classication problem into several problems
ARTICLE IN PRESS

254 D. Godoy, A. Amandi / Information Systems 31 (2006) 247265

involving a smaller number of categories, each and categories, and easy to interpret, since it is
classier should be able to classify instances more assumed that terms with higher weights are better
accurately. Second, as each classier deals with a descriptors for a category than those with lower
simpler classication task, it has only to focus on a weights. Furthermore, the representation of linear
reduced set of features distinguishing instances classiers is similar to the representation of
belonging to the category the classier represents. documents, allowing them to take advantage of
As an example, it seems to be a fairly small set of standard information retrieval techniques.
terms, such as sports, team and score, whose WebDCC aims at obtaining a hierarchical set of
presence in an instance clearly determines whether linear classiers, each of which is based on a set of
an instance belong to the sports category instead of relevant features. This goal is achieved by combin-
the politics category for the hierarchy in Fig. 1. As ing a feature selection algorithm to choose
these terms are unlikely to be useful for the next the appropriate terms at each node in the
classiers to determine the category at the next tree and a supervised learning algorithm to
hierarchical level, each of these classiers should construct a classier for that node as it is explained
be based on its own set of features, such as in Section 2.4.
football, cup and FIFA, on the one hand, and In order to determine whether an instance d new
basketball and NBA, on the other hand. A should be assigned to a given category ci in the
reduction in the number of features not only hierarchy of linear classiers, the dot product
decreases the induction time, but it can also between the document vector and the feature
increase the accuracy of the resulting model. Even weighted vector constituting the classier is
though each classier uses only a small set of computed as follows:
features, the overall set of features is used at
different stages of the classication process. X
m
Classifyd new ; ci d new  ci . (2)
A classier is a function F i : d j ! 0; 1 that, i1
given an instance d j ; returns a number between 0
and 1 that represents the evidence for the fact that The value resulting from the application of this
d j should be classied under the category repre- function represents the condence in the fact that
sented by ci : This function also has a threshold t an instance belongs to a given category. Hence, if
such that F i d j Xt is interpreted as a decision to it is higher than a given classication threshold t;
classify d j under ci ; while F i d j ot is interpreted the instance is assigned to the category; otherwise,
as a decision of not to classify d j under ci : the document is considered as not belonging to the
Although there are several types of classiers category. For the classication of new instances,
potentially applicable to this problem (including the Classify function is applied to every concept at
decision trees, Naive Bayes, etc.), linear classiers each level of the hierarchy, obtaining a condence
exhibit interesting properties to be used in the value for the instance belonging to each of them.
context of WebDCC algorithm. Linear classiers For example, given the categories at the rst level
are a family of text classication learning algo- of the hierarchy C fcsports ; cpolitics g; the Classify
rithms which examine training instances a nite function applied to each of them might return the
number of times in order to construct a prototype following result: fcsports ; 0:97; cpolitics ; 0:14g: In
instance which is later compared with the instances this case, it can be inferred that the document
to be classied. A prototype, which serves also as d new is likely to belong to sports, but it can hardly
concept description, consists of a weighted vector: belong to politics category.
ci ht1 ; w1 ; . . . ; tm ; wm i, After the application of this function to
categories in a given hierarchy level three situa-
where wj is the weight associated to the term tj in tions are possible: (1) The document exceeds the
the category ci : threshold for only one category, in which case it is
Linear classiers are both efcient, since classi- classied in that category and the classication
cation is linear on the number of terms, instances continues at the next hierarchical level; (2) the
ARTICLE IN PRESS

D. Godoy, A. Amandi / Information Systems 31 (2006) 247265 255

document exceeds the threshold in more than one


ROOT
category, in which case the document is classied
into the best category and continues at the next
hierarchical level; (3) the document does not team 0.076 code 0.128
exceed the threshold for any category, in which
case the classication stops and the instance is
clustered below this category. race 0.198 england 0.191 java 0.324 visual 0.117
Naturally, the value of threshold t plays an cup 0.149 basic 0.117
important role in the classication process. A high game 0.094
footbal 0.090
value of t would prevent instances from being
classied into the correct category, whereas a small
Fig. 4. Hierarchy for the BankSearch collection.
value of t would cause instances to be classied
into wrong categories. In order to tune this
parameter we carried out a number of experiments
with the Syskill&Webert and BankSearch collec- process using precision and recall measures i.e., the
tions. For these experiments, Web pages in both ratio of correct decisions to the total number of
collections were divided into a training set (60%) decision and the ratio of decisions to the total
and a test set (40%). The training set for the rst number of documents, respectively. Those docu-
collection resulted in 199 instances distributed in ments left in clusters below non-terminal cate-
the four categories present in the collection. Based gories were counted as not belonging to any
on this training set the algorithm generates the category, consequently, reducing recall but not
hierarchy shown in Fig. 3. The training set for the precision.
second collection was of 360 instances, on which Fig. 5(a) and (b) plot the variation of precision
the algorithm was based to generate the hierarchy and recall along different values of the classica-
shown in Fig. 4. tion threshold t for both collections. In both
For each detected concept the gures summarize gures, there is a trend of precision to increase as
the terms constituting the concept descriptions the classication threshold grows, so that the
along with term weights. As it can be observed, the probability of misclassication is reduced. How-
algorithm effectively identied the four concepts ever, the growing of the classication threshold
involved in the rst set of pages as well as the two also causes a strong decrease in recall, leading to a
hierarchical levels in the second one. Based on the higher amount of pages to be placed outside the
two hierarchies, we performed experiments with category they belong to (in child clusters of root
the test set conformed by the remaining 133 category in the Syskill&Webert collection and in
instances in one case and 240 in the other, in child clusters of root category as well as Sports and
order to determine a suitable value for the Programming Languages in the BankSearch collec-
classication threshold t: In these experiments, tion). In the gures, the relationship between
we measured the effectiveness of the classication precision and recall is also summarized by F-
Measure. According to the exposed results we
established the threshold value of t in 0:7; where a
ROOT compromise between recall and precision is
attained.
excerb 1.0 goat 1.0 sheep 1.0 med 1.0
mon 1.0
2.3. Non-hierarchical clustering approach
medicin 0.8
mus 1.0
review 1.0 At the bottom levels of the hierarchy the
song 1.0
stere 1.0 algorithm behaves as a simple k-NN algorithm.
The fundamental assumption behind this ap-
Fig. 3. Hierarchy for the Syskill&Webert collection. proach is that instances which are close together
ARTICLE IN PRESS

256 D. Godoy, A. Amandi / Information Systems 31 (2006) 247265

1.0 k 1 the instance is assigned to the cluster of


0.9 either the closest instance or to the cluster the
majority of the closest instances belong to if k41:
0.8 An instance not close enough to any other instance
0.7 in the current clusters given a predened similarity
threshold d; leads to a new singleton cluster.
0.6
k-NN is based on a distance measure among
0.5 instances which, in the case of documents make it
0.4 possible to determine the degree of resemblance
between vectors representing them. In this regard,
0.3 Precision the dominant similarity measure in information
Recall
0.2 FMeasure retrieval and text classication is the cosine
0.1
similarity measure that evaluates the cosine of
the angle conformed by two vectors in the space
0.0 [14]. This cosine similarity can be calculated as the
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
normalized dot product:
(a) Classification Threshold
Pr
k1 wik  wjk
1.0 simd i ; d j q
Pr q
Pr , (3)
2 2
0.9 k1 wik  k1 wjk

0.8 where d i and d j are the respective instances, wik


0.7 and wjk the weights of the word k in each instance
and r the number of different words in both
0.6
instances. In order to handle other structures
0.5 besides text, such as RDF-based meta-data de-
0.4 scriptions, more specic similarity measures need
Precision to be dened to compare other instances attributes
0.3 Recall
FMeasure and their values.
0.2 As k-NN determines the best clusters for a new
0.1
instance using the k nearest instances, both the
similarity measure and, particularly, the similarity
0.0 threshold have a critical effect on the k-NN
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
performance. We carried out a number of experi-
(b) Classification Threshold
ments with the mentioned collections in order to
Fig. 5. Classication threshold setting. determine the range of values for the similarity
threshold d in which the WebDCC algorithm
reaches better quality solutions. For these experi-
ments, we analyzed two aspects of the resulting
in the feature space according to an appropriate clustering solutions the similarity threshold d
similarity measure belong to a same class. Thus, impacts on, the total number of clusters and the
once an instance has reached a given concept in the cluster homogeneity. A value of d near to 0 will
hierarchy, either because this concept is a terminal produce very few clusters (a single cluster when
one or because the instance cannot be further d 0) grouping even highly dissimilar instances,
classied down, it is placed in the cluster the which is almost as informative as the original
majority of the k nearest instances belong to. In collection of documents; while a value of d near to
order to determine which of these instances are, a 1 will produce a big number of clusters (the same
new instance d new must be compared with every number of cluster as instances when d 1), which
instance already seen by the algorithm. Finally, if is highly informative but practically useless.
ARTICLE IN PRESS

D. Godoy, A. Amandi / Information Systems 31 (2006) 247265 257

In order to take into consideration these two 1.5


aspects in the similarity threshold tuning, we 1.4
analyzed the quality of clustering solutions of k- 1.3
NN algorithm under variations of d based on the 1.2
1.1
number of resultant clusters and a measure of
1.0
cluster homogeneity such as Entropy [15]. Labeled
0.9
training data were used for entropy calculation but 0.8
not during clustering, expecting clusters generated 0.7
by the algorithm to be as homogeneous as possible 0.6
with respect to the labeling of their containing 0.5
instances. However, since the best entropy is 0.4 Validity Measure
Entropy
achieved when each cluster contains exactly one 0.3 Number of Clusters
single instance, a suitable value of d is always a 0.2
trade-off between entropy and number of clusters. 0.1
Fig. 6(a) and (b) plot achieved values with these 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
measures along values of the similarity threshold (a) Similarity Threshold
for the datasets being analyzed. Again, results
summarize ten algorithm runs with different 2.0
sequences of instances. Entropy and number of 1.9
1.8
clusters are summarized in both gures into a 1.7
single value by means of the information-theoretic 1.6
1.5
external cluster-validity measure proposed by [16], 1.4
simply validity measure hereafter. This measure 1.3 Entropy
1.2 Number of Clusters
renes entropy to penalize both highly hetero- 1.1 Validity Measure
geneous clusters as well as large number of 1.0
clusters. As it can be observed in both gures, 0.9
0.8
the lines for normalized entropy and number of 0.7
clusters (a proportion over the maximum possible 0.6
0.5
number of clusters) cross each other near to the 0.4
similarity threshold value of d 0:2; where coin- 0.3
0.2
cidently the minimum value of the validity 0.1
measure is achieved. 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
(b) Similarity Threshold
2.4. Hierarchical concept formation approach
Fig. 6. Similarity threshold setting.
Hierarchical concept formation involves the
gradual creation of concepts summarizing in- the formation of concepts until evidence about
stances within clusters and is mainly driven by several categories is gathered. In the algorithm
the Evaluation function. Every time a cluster intended destination, which is modeling user
updating takes place, caused by the insertion of a interests, this is particularly advantageous as
new instance into the hierarchy, this function interests do not develop similarly (e.g., if the user
determines whether a new concept can be dened reads more about sports than about politics, the
to summarize a given set of instances. Because of rst concept should be created before and inde-
the incremental nature of the algorithm, concept pendently of the second one).
formation is dened over the instances that are Given the algorithm goal, which is partitioning
part of a cluster without taking into consideration the whole set of instances into a collection of
those assigned to other clusters in the hierarchy. crescent specicity categories, a suitable evaluation
Thus, the algorithm does not unnecessarily retard measure should recognize the opportunity to
ARTICLE IN PRESS

258 D. Godoy, A. Amandi / Information Systems 31 (2006) 247265

create a new concept each time the existence of selection in the WebDCC algorithm is performed
instances from more than one category is by a lter approach based on an ordering scheme.
presumed inside a cluster. In other words, the In this approach, a weight is individually com-
lack of cluster cohesiveness can be considered puted for each feature and features are ordered
as a good indicator for the fact that, even though according to their assigned weights. A feature
instances belong to a same cluster because of selection threshold, s; is dened in the 0; 1 range
their common features, they correspond to dis- such that the weight required for a feature to be
tinctive aspects of a same general category. selected needs to be higher than s: A simple and
Furthermore, it can be assumed that subtracting effective approach to weigh terms is the document
the set of features shared by most instances in the frequency, denoted by DF tk ; which is the number
cluster to describe a general category, a new of instances in which the term tk occurs. The
partitioning of instances could be obtained and document frequency is computed for every unique
the hierarchy could gain an additional level of term appearing in the cluster and those terms
specicity. whose document frequency is less than s are
A method to compute cluster cohesiveness removed from the feature space. This method
consist in using the average pairwise similarity of assumes that rare terms are either non-informative
instances in the cluster as follows: for category prediction or not inuential in global
performance.
1 X After feature selection a supervised learning
Evaluationsr simd i ; d j , (4)
n2r d ;d 2s algorithm is applied to learn a classier for the new
i j r
category. In particular, we used in this work an
where nr is the size of the cluster sr : Mathemati- instantiation of the Rocchio algorithm [18], the
cally, the average pairwise similarity between all same used in [19], with parameter xed to a 1
instances in the cluster, including self-similarity, is and b 0 where a prototype for a category ci 2 C
equivalent to the length of the centroid vector is dened as follows:
since all documents are scaled to be unit
length. Therefore, the cohesiveness value can be 1 X
pci d. (5)
efciently calculated by using the centroid vector nci d2D
ci
of clusters.
If the cohesiveness value is higher than a Hence, a classier for a category ci is the plain
threshold j; a new concept is created. Otherwise, average or centroid of all instances belonging to
no updating in the hierarchy takes place. We have this category. For the classication of new
empirically determined a value of j 0:25 by instances, the closeness of an instance to the
comparing hierarchies of concepts generated with prototype vector is computed by using the dot
different values of j for Syskill&Webert and product as is described in Eq. (2).
BankSearch collections against the target hierar- As was stated by [20], an accurate classication
chies i.e., the hierarchy containing the four topics can be based on only a few features, those with
for the former collection and the hierarchy better discriminant power to a given category.
containing two topics in the rst level and four According to this observation, we propose an
topics in the second one in the later collection. aggressive selection of terms by setting the selec-
In case the creation of a new concept is tion threshold s to a very high value. We analyzed
established, a feature selection method is applied the effects of varying this parameter in terms of the
over the instances in the cluster to be summarized. quality of the achieved clustering solutions. Fig. 7
This method takes as input a set of features and plots F-Measure scores reached for different
outputs a subset of these features which are values of the selection threshold s for the
relevant to the target concept. Two main feature Syskill&Webert and BankSearch collections. In
selection approaches are used in machine learning both cases the highest values are reached for a
[17]: ltering and wrapper approach. Feature value of s equal to 0.8.
ARTICLE IN PRESS

D. Godoy, A. Amandi / Information Systems 31 (2006) 247265 259

1.0 and Soccer (150 pages) within Sports; and Visual


0.9 Basic (150 pages) and Java (150 pages) within
Programming Languages.
0.8

0.7
0.6 4. Evaluation of clustering results
F-Measure

0.5
In order to evaluate WebDCC performance we
0.4 compared its results with those obtained from the
0.3 traditional hierarchical agglomerative clustering
(HAC) [23]. HAC involves building a tree struc-
0.2 Syskill&Webert collection
BankSearch collection
ture by using a bottom-up method. An agglom-
0.1 erative algorithm begins with each instance in a
0.0
distinct cluster and successively merges the two
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 most similar clusters at each step to create a new
Feature Selection Threshold internal node in the tree. This process continues
until a single, all-inclusive root node is reached.
Fig. 7. Selection threshold setting.
The result of a hierarchical clustering algorithm is
a binary tree, or dendrogram, that depicts the
merging process and the intermediate clusters.
3. Dataset description Agglomerative clustering depends on the deni-
tion of a similarity measure between clusters to
In this paper, we carried out a number of determine the pairs of clusters to be merged at
experiments with the Syskill&Webert Web Page each step. Numerous approaches have been
Ratings collection [21] as well as the BankSearch developed for computing this similarity being the
Web Page collection [22]. most common alternatives the single-link, com-
Web pages in the Syskill&Webert collection plete-link and group-average schemes. The single-
belong to four different categories: Bands (61 link scheme measures the similarity of two clusters
pages), Goats (70 pages), Sheep (65 pages) and by the maximum similarity between the documents
BioMedical (136 pages). From the original dataset from each cluster, the complete-link by the
we removed empty pages and those corresponding minimum similarity and the group-average scheme
to not found pages leading to a total of 332 Web measured the similarity of two clusters as the
pages. This dataset also contains a single user average of the pairwise similarity of documents
rating of the interest in each page in a three point from both clusters.
scale, hot or very interesting (108 pages), medium In order to experimentally contrast the perfor-
or quite interesting (9 pages) and cold or not mance obtaining hierarchical clustering solutions
interesting (215 pages). User ratings provide a of agglomerative clustering and WebDCC algo-
valuable means to evaluate the performance of rithms, we compared the entropy values achieved
personal information agents in Web page user with the previous datasets. As the resulting
interest prediction. dendrogram of an agglomerative clustering can
A second dataset, the BankSearch Web page be analyzed at different levels of granularity, we
collection, was used to study the hierarchical summarized the results obtained by stopping the
clustering solutions provided by WebDCC algo- agglomeration process when k clusters were left.
rithm. From this collection we used a subset of 600 The results of HAC with different number of
Web pages divided into two broad categories clusters using the single-link, complete-link, and
Sports (300 pages) and Programming Languages group-average schemes are shown in Tables 1 and
(300 pages). Within these categories pages belong 2 for the Syskill&Webert and BankSearch collec-
to two subcategories: Motor Sports (150 pages) tions, respectively. By running the WebDCC
ARTICLE IN PRESS

260 D. Godoy, A. Amandi / Information Systems 31 (2006) 247265

Table 1 two categories in the rst hierarchical level (but


Evaluation of WebDCC and HAC in the Syskill&Webert
outside terminal categories).
collection
In spite of the non-incremental nature of HAC,
Algorithm # Clusters Single-link Complete- Group- its results are only comparable with those of
link average WebDCC when a high number of clusters is
considered. In both tables, the best result of
HAC 2 0.953 0.788 0.953
4 0.949 0.607 0.948 agglomerative clustering for every cluster similar-
5 0.946 0.603 0.944 ity scheme is underlined. Agglomerative clustering
7 0.939 0.484 0.556 reached lower entropy values than WebDCC only
10 0.933 0.439 0.342 with 20 clusters and complete-link or group-
15 0.918 0.373 0.277
average schemes. It is worth noticing that
20 0.902 0.174 0.143
Average 0.934 0.495 0.594 WebDCC not only obtained better solutions with
fewer clusters, but also generated descriptive
hierarchies for each collection in an incremental
WebDCC 5 0.198 0.198 0.198
fashion. Conversely, dendograms from agglom-
erative clustering are difcult to explore since no
descriptive information about clusters is provided
Table 2 and different partitions of instances can be
Evaluation of WebDCC and HAC in the BankSearch collection achieved depending on the level of the tree.
Algorithm # Clusters Single-link Complete- Group-
link average

HAC 2 0.994 0.883 0.994


5. An experience with WebDCC: the
4 0.986 0.698 0.986
5 0.983 0.649 0.587 PersonalSearcher agent
7 0.969 0.560 0.575
10 0.961 0.467 0.389 WebDCC algorithm was applied in the devel-
15 0.943 0.336 0.370 opment of an agent that assists users in nding
20 0.920 0.263 0.328
interesting documents on the Web, named Perso-
Average 0.965 0.550 0.604
nalSearcher [24]. This agent carries out a parallel
search in the most popular Web search engines
WebDCC 7 0.384 0.384 0.384
and lters the resultant list of Web documents
according to user proles it builds based on
the observation of user browsing behavior on
algorithm over these datasets, we obtained two the Web.
hierarchies with a high number of clusters in their For each reading in the standard browser the
leaves and, consequently, low entropy values. In agent observes a set of indicators in order to
order to compare WebDCC results with those estimate the user interest in a given Web page. This
obtained from HAC, we calculated entropy con- mechanism, known as implicit feedback, allows
sidering ve clusters for the Syskill&Webert the agent to obtain Web pages relevant to the user
collection and seven for the BankSearch collection. without distracting him from his normal behavior.
In the rst case, four clusters correspond to the These indicators are the time consumed in reading
four terminal categories (instances in each of these with relation to its length, the amount of scrolling
categories are considered a single cluster) and the in a page and whether it was added to the list of
last cluster groups those instances left in the root bookmarks. Web pages considered interesting by
category. In the second case, four clusters corre- the user through this means are used as input to
spond to the terminal categories and the remaining the WebDCC algorithm, whereas algorithm out-
three clusters correspond to instances left in the put corresponds to the hierarchy of concepts
root category and instances left as children of the constituting the user prole.
ARTICLE IN PRESS

D. Godoy, A. Amandi / Information Systems 31 (2006) 247265 261

Users interact with their PersonalSearcher ex-


pressing their information needs by keywords as
usual. The agent posts these queries to the most
popular search engines (Google, Yahoo, etc.),
obtaining a set of documents that cover a wide
portion of the Web. The relevance degree of each
document in relation to the user prole is
computed by the agent to determine the conve-
nience of suggesting the document to the user for
future reading. Only documents that exceed a user
given relevance threshold as regard to some
category in the user prole are sent back to the
user as a result to his query. PersonalSearcher
allows the user to customize the desired level of
assistance at any moment by adjusting the thresh- Fig. 9. Example of search results in PersonalSearcher.
old from the agent graphical user interface (GUI).
In order to illustrate PersonalSearcher behavior,
let us suppose a user who possesses a number of
reading experiences in the Visual Basic and Java
programming languages along with some readings
about soccer and motor sports. Fig. 8 shows a
prole for such user, which has been built using
Web pages from the BankSearch collection. If this
user performs a Web search using the keywords
programming languages the list of resulting pages
may include pages about several aspects of
programming, ranging from theoretical studies to
diverse languages as the search results shown in
Fig. 9. However, knowing the user interests,
PersonalSearcher suggests pages mostly related
to Visual Basic and Java as can be observed in
Fig. 10. Example of suggestions in PersonalSearcher.

Fig. 10. Those pages belonging to other languages


as FoxPro or C++ were discarded by the agent as
they were considered uninteresting to the user.
The effectiveness of PersonalSearcher to recom-
mend Web pages during searching was evaluated
based on the standard precision and recall
measures from information retrieval [14]. From
the user point of view, precision measures the
proportion of relevant documents in those recom-
mended by the agent while recall is the proportion
of relevant documents in the search results that
were in fact recommended to the user. In order to
calculate precision and recall values it is necessary
Fig. 8. Example of an user prole in PersonalSearcher. to know the relevance of each document in the
ARTICLE IN PRESS

262 D. Godoy, A. Amandi / Information Systems 31 (2006) 247265

search results. We used for this experiment the this value by default, although the user can
ratings given by a single user to pages in the still manually change this relevance threshold from
Syskill&Webert collection that, as we have men- the GUI.
tioned, are in a three point scale: hot (very PersonalSearcher agents assisting several users
interesting), medium (quite interesting) and cold can take advantage of existing knowledge in the
(not interesting at all). community they are immerse in through coopera-
A total of 82 pages (70%) from the 117 marked tion. A quantitative comparison of user proles,
as hot and medium were used to build three based on the Triple Matching-Model (MD3) [25],
different proles, the rst one with 50% of these was implemented in PersonalSearcher in order to
pages (41 pages), the second with 70% (57 pages) enable cooperation. Initial experimental results in
and the third with 100% (82 pages). Based on each prole comparison proved that similar topics in
of these proles we compared the agent effective- two user proles can be identied by applying the
ness in recommending Web pages, considering as a previous model [26]. A cooperative approach
result of a Web search the remaining pages, i.e., foster knowledge sharing and, consequently,
search result consist of all the cold pages in the would potentially enrich the results of individual
collection and the remaining 35 hot and medium agents by accessing other agent experience.
pages not previously used in prole construction.
Fig. 11 shows the average of precision and recall
reached by the agent along different values of the 6. Related works
relevance threshold. This gure also shows varia-
tions in precision/recall value for the three Hierarchical conceptual clustering plays an
different user prole sizes (amount of experi- important role in ontology learning for the
ences/pages). From these results, we conclude that Semantic Web. Manual creation of ontologies by
the agent improves its performance as the user domain experts presents the well-known problem
prole grows and the agent has more knowledge of acquisition bottleneck, however fully automat-
about the user interests. The highest accuracy was ing creation of ontologies is unfeasible. Ontology
always obtained with the relevance threshold set in learning aims at integrating numerous disciplines,
0.6 regardless of the user prole size. According to particularly machine learning, to facilitate ontol-
these results, PersonalSearcher is congured to use ogy construction. As an example, the Text-to-
Onto [5] is a framework which combines knowl-
1.0
edge acquisition with machine learning to extract
semantics from resources found on the Web (e.g.,
0.9 free text, semistructured text and schema deni-
0.8 tions). In this framework, ontology extraction
gleans a major part of the target ontology from
0.7
Web documents through learning. Particularly,
0.6 hierarchical conceptual clustering is recognized as
Accuracy

0.5 a powerful method in the ontology extraction


phase, whose results are rened and validated in
0.4
later ontology learning steps [4]. In the same line of
0.3 50% research, a conceptual clustering algorithm such as
70%
0.2 100% COBWEB [27] was applied to ontology discovery
in on-line communities [6]. COBWEB results are
0.1
translated to RDF schemes for the Semantic Web
0.0 and used by a song recommendation application.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
In contrast to the referred works, whose goal is to
Relevance Threshold
extract semantic knowledge from unnotated Web
Fig. 11. Accuracy of Web page prediction in PersonalSearcher. pages, approaches for clustering objects described
ARTICLE IN PRESS

D. Godoy, A. Amandi / Information Systems 31 (2006) 247265 263

by ontology-based meta-data are proposed in whether child concepts should be created. How-
[28,29]. A study of the impact the Semantic Web ever, probabilistic concepts of COBWEB limit the
would have on learning algorithms used for user application of this algorithm in document cluster-
proling and personalization was performed in ing where the space of features and values is highly
[30]. Results from this study demonstrated that dimensional.
available Semantic Web markup cannot be ex-
pected to outperform conventional machine learn-
ing applied to plain text in regard to the accuracy 7. Conclusions
of the learned model. Nevertheless, meaningful
and potentially reusable knowledge was discovered In user proling, document clustering offers the
by applying Inductive Logic programming techni- advantage that an a priori knowledge of user
ques [31] over Web pages annotated with semantic interest categories is not needed, so that the
meta-data. categorization is completely unsupervised.
From the user proling perspective, the short- WebDCC is a conceptual clustering algorithm
comings of existent clustering algorithms to build designed to support user proling in intelligent
user proles are mostly related to the clustering information agents. The advantages of this algo-
solutions they are able to supply, which do no rithm are twofold. First, as an incremental
resemble user interests, or to the way they build approach it allows agents to interact with users
such solutions, which is generally no incremental. over time to acquire and adapt user interest
In this regard, hierarchical clustering techniques categories. As a result, agents are able to deal
build tree structures by using either a bottom-up, with unpredictable and changing subject areas,
agglomerative method or a top-down, divisive that cannot be anticipated by agent developers.
method. The former begins with each instance in a Moreover, this algorithm enables an agent not
distinct cluster and successively merges the most only to discover a set of categories of interests, but
similar clusters at each step to create a new also to focus on the aspects of these categories the
internal node in the tree. This process continues user is really interested in. Consequently, a user
until a single, all-inclusive root node is reached. prole contains categories at different levels of
The latter, instead, starts with a single cluster specicity according to user experiences on each of
containing all instances and splits resulting clusters them. Second, unlike most user proling ap-
until only clusters of individual instances remain. proaches, WebDCC offers comprehensible cluster-
Merging of clusters or individual instances in ing solutions that can be easily interpreted and
agglomerative clustering as well as splitting in explored by either users or other agents. WebDCC
divisive clustering are usually binary, resulting in a capability for extracting semantics from Web
binary tree or dendogram that contains clustering documents enables it to be integrated in the
information on many different levels. However, ontology extraction phase that brings support
the exploration of clustering solutions requires too to the ontology learning process for the Seman-
much insight of the user as these clustering tic Web.
algorithms do not make any attempt to character- In this paper, we also compared clustering
ize clusters. Conversely, algorithms in the con- solutions provided by WebDCC from those of
ceptual clustering paradigm involved create agglomerative hierarchical clustering (HAC). Ex-
concepts that dene or exemplify the properties perimental results demonstrated that in spite of the
of a set of instances. COBWEB, for example, is an incremental nature of WebDCC, its performance
incremental conceptual clustering algorithm which is similar or even better, considering a reduced
constructs a tree from a sequence of observations. number of clusters, than the performance of HAC.
Each discovered concept in COBWEB records the Moreover, our algorithm offers the added value of
probability of each attribute and value and is a conceptual hierarchy to be used by personal
updated every time an instance is added. A agents. In spite of the potentials of this algorithm
category utility measure is used to determine to the proposed task, there are some issues that
ARTICLE IN PRESS

264 D. Godoy, A. Amandi / Information Systems 31 (2006) 247265

require a thorough analysis. First, as an incre- [7] R. Michalsky, R. Stepp, Learning from observation:
mental approach, WebDCC is affected by the conceptual clustering, in: R.S. Michalski, J.G. Carbonell,
order in which instances are presented. It remains T.M. Mitchell (Eds.), Machine Learning: An Articial
Intelligence Approach, Morgan Kauffmann, 1983,
to evaluate whether a strategy to reduce the impact pp. 331363.
of instances ordering would improve clustering [8] K. Thompson, P. Langley, Concept formation in struc-
quality solutions. Second, despite preliminary tured domains, in: D. Fisher, M. Pazzani, P. Langley
experiments have suggested promising results, (Eds.), Concept Formation: Knowledge and Experience in
prole comparison in agent communities perform- Unsupervised Learning, Morgan Kaufmann, 1991.
[9] G. Salton, C. Yang, A. Wong, A vector space model for
ing collaborative ltering needs further analysis. automatic indexing, Communications of the ACM (1975)
An application of WebDCC in a real-word 613620.
agent as PersonalSearcher have proved the algo- [10] M. Porter, An algorithm for sufx stripping program,
rithm capacity for handling the task at hand. In Program 14 (3) (1980) 130137.
this experience, the algorithm generates accurate [11] C.J. Van Rijsbergen, Information retrieval, Department of
Computer Science, University of Glasgow, London, UK,
proles which allow agents to provide affective
1979.
personalization, to adapt proles in order to [12] C. Apte, F. Damerau, S. Weiss, Automated learning of
improve agents efciency over time and to present decision rules for text categorization, Inform. Syst. 12 (3)
comprehensible proles to users. Concept hierar- (1994) 233251.
chies produced by WebDCC also facilitate prole [13] S. DAlessio, A. Kershenbaum, K. Murray, R. Schiafno,
Category levels in hierarchical text categorization, in:
comparison in order to assist a community of users
Proceedings of the 3rd Conference on Empirical Methods
in a collaborative way by identifying concepts in Natural Language Processing, Granada, Spain, 1998.
users have in common. [14] G. Salton, M. McGill, Introduction to modern informa-
tion retrieval, McGraw-Hill, 1983.
[15] C. Shannon, A mathematical theory of communication,
Bell System Technical Journal 27 (1948) 398403.
Acknowledgements [16] B. Dom, An information-theoretic external cluster-validity
measure, in: Proceedings of the 18th Conference in
Research reported in this paper has been Uncertainty in Articial Intelligence, Morgan Kaufmann,
Alberta, CA, 2002, pp. 137145.
partially supported by Fundacion Antorchas.
[17] G. John, R. Kohavi, K. Peger, Irrelevant features and the
subset selection problem, in: Proceedings of the 11th
International Conference in Machine Learning, Morgan
References Kaufmann, 1994.
[18] J. Rocchio, Relevance feedback in information retrieval,
[1] D. Mladenic, Text-learning and related intelligent agents: a in: The SMART Retrieval System, Prentice-Hall, 1971,
survey, IEEE Intel. Syst. 14 (4) (1999) 4454. pp. 313323.
[2] T. Berners-Lee, J. Hendler, O. Lassila, The semantic web, [19] S. Dumais, J. Platt, D. Heckerman, M. Sahami, Inductive
Sci. Am. (2001). learning algorithms and representations for text categor-
[3] B. Berendt, A. Hotho, G. Stumme, Towards Semantic ization, in: Proceedings of the 7th International Con-
Web mining, in: Proceedings of the First International ference on Information and Knowledge Management
Semantic Web Conference on The Semantic Web, Spring- (CIKM98), 1998, pp. 148155.
er, 2002, pp. 264278. [20] D. Koller, M. Sahami, Hierarchically classifying docu-
[4] D. Faure, C. Nedellec, A corpus-based conceptual ments using very few words, in: Proceedings of the 14th
clustering method for verb frames and ontology acquisi- International Conference on Machine Learning, Morgan
tion, in: P. Velardi (Ed.), Proceedings of Adapting Lexical Kaufmann, 1997, pp. 170178.
and Corpus Resources to Sublanguages and Applications, [21] M. Pazzani, J. Muramatsu, D. Billsus, Syskill&Webert:
Workshop of the 1st International Conference on Lan- identifying interesting Web sites, in: Proceedings of the
guage Resources and Evaluation (LREC), Granada, 13th National Conference on Articial Intelligence and 8th
Spain, 1998, pp. 18. Innovative Applications of Articial Intelligence Confer-
[5] A. Maedche, S. Staab, Ontology learning for the Semantic ence, Portland, US, 1996, pp. 5461.
Web, IEEE Intel. Syst. (2001) 7279. [22] M. Sinka, D. Corne, A large benchmark dataset for Web
[6] P. Clerkin, P. Cunningham, C. Hayes, Ontology discovery document clustering, Frontiers in Articial Intelligence
for the Semantic Web using hierarchical clustering, in: and Applications, Soft Computing Systems: Design,
Semantic Web Mining Workshop, 2001. Management and Applications 87 (2002) 881890.
ARTICLE IN PRESS

D. Godoy, A. Amandi / Information Systems 31 (2006) 247265 265

[23] W.B. Frakes, R. Baeza-Yates, Information Retrieval: Data [28] A. Maedche, V. Zacharias, Clustering ontology-based
Structures and Algorithms, Prentice-Hall, 1992. metadata in the Semantic Web, in: T. Elomaa, H. Mannila,
[24] D. Godoy, A. Amandi, PersonalSearcher: an intelligent H. Toivonen (Eds.), Principles of Data Mining and
agent for searching Web pages, in: Proceedings of the Knowledge Discovery, 6th European Conference, PKDD
International Joint Conference, 7th Ibero-American Con- 2002, Lecture Notes in Computer Science, vol. 2431,
ference on AI, 15th Brazilian Symposium on AI (IBER- Springer, Helsinki, Finland, 2002, pp. 348360.
AMIA-SBIA2000), Lecture Notes in Computer Science, [29] A. Delteil, C. Faron-Zucker, R. Dieng, Learning ontolo-
vol. 1952, Springer, 2000, pp. 4352. gies from RDF annotations, in: A. Maedche, S. Staab, C.
[25] M.A. Rodrguez, Assessing semantic similarity among Nedellec, E.H. Hovy (Eds.), Workshop on Ontology
spatial entity classes, Ph.D. Thesis, University of Maine, Learning, Proceedings of the Second Workshop on
2000. Ontology Learning OL2001, Seattle, US, 2001.
[26] G. Gimenez Lugo, A. Amandi, J.S. Sichman, D. Godoy, [30] G.A. Grimnes, Learning knowledge rich user models from
Enriching information agents knowledge by ontology the Semantic Web, in: P. Brusilovsky, A.T. Corbett, F. de
comparison: a case study, in: Proceedings of the 8th Ibero- Rosis (Eds.), Proceedings of the 9th International Con-
American Conference on Articial Intelligence (IBERA- ference on User Modeling, Lecture Notes in Computer
MIA2002), Lecture Notes in Computer Science, vol. 2527, Science, vol. 2702, Springer, 2003, pp. 414416.
Springer, 2002, pp. 546555. [31] S. Muggleton, Inverse entailment and Progol, New
[27] D.H. Fisher, Knowledge acquisition via incremental Generation Comput. Special issue on Inductive Logic
conceptual learning, Mach. Learning 2 (1987) 139172. Programming 13 (3) (1995) 34.

You might also like