Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 17

Web Document Clustering Using

Fuzzy Equivalence Relations


Anjali B. Raut , G. R. Bamnote
Volume 2 Special Issue ISSN 2079-8407 Journal of Emerging Trends in
Computing and Information Sciences
Contents
The paper is structured as follows
I. INTRODUCTION
II. RELATED WORK
III. PROPOSED WORK
IV. EXAMPLE
V. RESULTS AND SNAPSHOTS
VI. CONCLUSION AND FUTURE RESEARCH

Abstract
Conventional clustering
means classifying the given data objects as exclusive subsets (clusters).
we can discriminate clearly if an object belongs to a cluster or not .
Problem:
it does not sufficiently represent many real situations.

Fuzzy clustering
constructs clusters with uncertain boundaries and allows that one
object belongs to overlapping clusters with some membership degree
considers not only the belonging status to the clusters, but also to
consider to what degree do the object belong to the cluster.


Introduction
WWW
It has become a major source of information.
tremendous growth of information on the web.
search engines are used to quickly search the web.

Web Mining
is the use of Data Mining techniques to automatically discover and
extract information from web.
Clustering is one of the possible techniques to improve the efficiency
in information finding process.
Clustering
a Data Mining tool to use for grouping objects into clusters such that
the objects from the same cluster are similar and objects from
different cluster are dissimilar.
Introduction
Web Mining
has fuzzy characteristics
fuzzy clustering is sometimes better suitable for Web Mining in
comparison with conventional clustering.
as a document might be relevant to multiple queries, this document
should be given in the corresponding response sets, otherwise, the
users would not be aware of it.
Fuzzy equivalence clustering
Is a fuzzy clustering method based on the fuzzy equivalence relations.


Web Mining
Data Mining
is the process of extracting or mining knowledge from data.
Web Mining is use of Data Mining techniques to automatically discover
and extract information from web documents and services.

Web Mining
process-centric view
defined Web Mining as
a sequence of
different processes
data-centric view
defined Web Mining in
terms of the type of
data that was being
used in the mining
process
Web Mining
Web Mining Process



Web Mining Taxonomy
Web has different facets that yield different approaches for the
mining process :
1. Web pages consist of text.
2. Web pages are linked via hyperlinks
3. User activity can be monitored via Web server logs.


Resource
Discovery
Information
Pre-processing
Information
Extraction
Generalization
Web Mining
The three facets leads to three categories of Web Mining.
Web Mining
Web Content Mining
(WCM)
process of extracting
useful information from
the contents of web
documents
Web Structure Mining
(WSM)
is the process of
discovering structure
information from the
web
Web Usage Mining
(WUM)
is the application of data
mining techniques to
discover interesting usage
patterns from web usage
data.
Clustering
Users can encounter the following problems when interacting
with the web
Low precision: Todays search tools have the low precision problem,
which is due to the irrelevance of many search results. (difficulty
finding the relevant information.)
Low recall: It is due to the inability to index all the information
available on the web. (difficulty finding the un-indexed information
that is relevant)
Clustering is one of the Data Mining techniques to improve
the efficiency in information finding process.
Clustering algorithms can be broadly categorized into
partitional (K-Means) and hierarchical techniques(AHC
algorithms).

Clustering
Many traditional clustering algorithms assign each document to a single
cluster, thus making it difficult for the user to retrieve information.
Based on this concept clustering algorithm can be divided into
hard clustering algorithm: traditional clustering algorithm where each
object belongs to exactly one cluster
Soft clustering algorithm: each object can belongs to multiple clusters.

The conventional clustering algorithms have difficulties in handling the
collection of natural data which is often vague and uncertain.
The modeling of imprecise and qualitative knowledge, as well as handling
of uncertainty at various stages is possible through the use of fuzzy sets.
Therefore a fuzzy clustering method was offered to construct clusters with
uncertain boundaries, so this method allows that one object belongs to
multiple clusters with some membership degree.

Proposed Work

A Clustering method based upon fuzzy equivalence relations is
being proposed for web document clustering.
Steps:
1. the crawler is used to download the documents and the keywords
contained therein and stored in a database .
2. the indexer extracts all words from the entire set of documents and
eliminates stop words from each documents and stores them in the
indexed database.
3. A list of common words called keywords is generated
4. Each keyword is assigned a Keyword ID
5. proposed fuzzy clustering method based upon fuzzy equivalence
relations is applied on the indexed database.


Proposed Work
Since it is not directly possible:
first determine a fuzzy compatibility relation (reflexive and symmetric)
in terms of an appropriate distance function.
Then, a meaningful fuzzy equivalence relation is defined as the
transitive closure of this fuzzy compatibility relation.
fuzzy compatibility relation R, on X be defined in terms of an
appropriate distance function of the Minkowski class by the
formula


Where for all pairs (x
i
, x
k
) X, where q RT, and is a
constant that ensures that R (x
i
,x
k
) *0,1+, is the inverse
value of the largest distance in X.
p
p
j
q
kj ij
k i
X X X X R
1
1
) | | ( 1 ) , (

=
= o
Proposed Work
In general, R defined by the equation is a fuzzy compatibility
relation, but not necessarily a fuzzy equivalence relation.
Hence, there is need to determine the transitive closure of R.
Given a relation R(X,X), its transitive closure RT (X,X) can be
determined by simple algorithm that consists of the following
three steps:
After applying this algorithm a hierarchical cluster tree
will be generated. Each cluster has similar documents
which help to find the related documents in the terms of
time and relevancy.
Example
To illustrate the method based on fuzzy equivalence relation
Given the following web document data X.






for q=2, which is corresponds to the Euclidean distance,
The largest Euclidean distance between any pair of given data
points is 4 then = 1/4 =0.25 .
K 1 2 3 4 5
XK1 0 1 2 3 4
XK2 0 1 3 1 0
Example
calculate membership grade of R for equation (represented in
a matrix)

R = RT =







This relation includes four distinct partitions of its cuts:
*0, 0.44+ : , , x
1
, x
2
,x
3
, x
4
, x
5
,}}
( 0.44, 0.5+ : , , x
1
, x
2
, x
4
, x
5
},{x
3
}}
( 0.5, .65+ : , , x
1
,x
2
} , {x
3
}, {x
4,
x
5
}}
( 0.65, .1+ : , , x
1
} , {x
2
},{x
3
}, {x
4
},{ x
5
}}
repeat the analysis for q = 1 which represents the Hamming distance.
|
|
|
|
|
|
.
|

\
|
1 65 . 0 1 . 0 21 . 0 0
65 . 0 1 44 . 0 5 . 0 21 . 0
1 . 0 44 . 0 1 44 . 0 1 . 0
21 . 0 5 . 0 44 . 0 1 65 . 0
0 21 . 0 1 . 0 65 . 0 1
|
|
|
|
|
|
.
|

\
|
1 65 . 0 44 . 0 5 . 0 5 . 0
65 . 0 1 44 . 0 5 . 0 5 . 0
44 . 0 44 . 0 1 44 . 0 44 . 0
5 . 0 5 . 0 44 . 0 1 65 . 0
5 . 0 5 . 0 44 . 0 65 . 0 1

Results and Snapshots

result agrees with our visual perception of geometric
clusters in the data.
the dendrogram is a graphical representation of the results
of hierarchical cluster analysis.
a tree-like plot where each step of hierarchical clustering

Conclusions and future research
Web data has fuzzy characteristics, so fuzzy clustering is
sometimes better suitable for Web Mining in comparison with
conventional clustering.

This proposed technique for document retrieval on the web,
based on fuzzy logic approach improves relevancy factor.

This technique keeps the related documents in the same
cluster so that searching of documents becomes more
efficient in terms of time complexity.

You might also like