Professional Documents
Culture Documents
Web Page Enrichment Using A Rough Set Based Method: Bacharaju Vishnu Swathi
Web Page Enrichment Using A Rough Set Based Method: Bacharaju Vishnu Swathi
Abstract—When documents are matched to a given query, often the terms in the query are matched to the words in the documents for
calculating similarity. But it is a good idea if the given document is represented in an enriched manner with not only the actual words occurring
in the document but also with the synonyms of the important words. This would definitely improve the recall of the system. With its ability to
deal with vagueness and fuzziness, tolerance rough set seems to be promising tool to model relations between terms and documents. In many
information retrieval problems, especially in text classification, determining the relation between term-term and term-document is essential. In
this work, the application of TRSM to web page classification was evaluated to determine its effectiveness as a way to enrich a web page.
Keywords-web page, enrichment,classification, rough sets,text minig
_______________________,___________________________*****_________________________________________________
281
IJFRCSCE | December 2017, Available @ http://www.ijfrcsce.org
_______________________________________________________________________________________
International Journal on Future Revolution in Computer Science & Communication Engineering ISSN: 2454-4248
Volume: 3 Issue: 12 281 – 285
_______________________________________________________________________________________________
and transitive, xRy yRz xRz for x, y, z U ), the approximation L(R, X) and the upper approximation U(R, X) in
transitive property does not always hold in certain application R of any X U are defined as
domains, particularly in natural language processing and
information retrieval. This remark can be illustrated by LR ( X ) {x U | P( I ( x )) 1 & v( I ( x ), X ) 1}
considering words from Roget‟s thesaurus, where each word is
associated with a class of other words that have similar
meanings. Figure 1 shows associated classes of three words, U R ( X ) {x U | P( I ( x )) 1 & v( I ( x ), X ) 0}
root, cause, and basis. It is clear that these classes are not
disjoint (equivalence classes), but overlapping, and the The basic problem of using tolerance spaces in any
meaning of the words is not transitive. application is in the suitable determination of I, v and P.
Overlapping classes can be generated by tolerance relations A. Determination of Tolerance spaces
that require only reflexive and symmetric properties. A general
approximation model using tolerance relations was introduced Let P { p1, p2 ,....p N } be a set of document and
in [4] in which generalized spaces are called tolerance spaces T {t1, t 2,.......t M } set of index terms for P. With the adoption
that contain overlapping classes of objects in the universe
(tolerance classes). In [4], a tolerance space is formally defined of VSM each page pi is represented by a weight vector
as a quadruple R (U , I , v, P) where U is a universe of
w
objects, I :U 2
U
is an uncertainty function, [ wi1 , wi 2 ,.......wiM ] where ij denoted the weight of term j
U U
v : 2 2 [0,1] is a vague inclusion, and in page i . In TRSM, the tolerance space is denoted over a
P : I (U ) {0,1} is a structurality function. universe of all index terms
U T {t1, t21.......t M } .
The idea is to capture conceptually related index terms into
classes. There are several ways to identify conceptually related
index terms, for example, human experts, thesaurus, term co-
occurrence, and so on. In this work, term co-occurrence is
employed to determine tolerance relation and tolerance class.
The co-occurrence of the index terms is chosen for the
following reasons: (i) It gives a meaningful interpretation about
the dependency and semantic relation of index terms [6]; and
(ii) it is relatively simple and computationally efficient.
B. Tolerance class of term
Figure 1. Overlapping classes of words
Let f P (ti , t j ) denotes the number of web pages in P in
Assume that an object x is perceived by information which both terms ti and t j occurs. The uncertainty function I
Inf (x ) about it. The uncertainty function
with regards to threshold θ is defined as
U
I : U 2 determines I (x ) as a tolerance class of all objects
that are considered to have similar information to x. This I (ti ) {t j | f P (ti , t j ) } {ti }
uncertainty function can be any function satisfying the
condition x I ( x ) and y I ( x) iff x I ( y ) for clearly, the above function satisfies
any x, y U . Such a function corresponds to a relation conditions of being reflexive; ti I (ti ) and symmetric:
Ʈ U U understood as xTy iff y I(x) . Ʈ is a tolerance t j I (ti ) ti I (t j ) for any t i , t j T . Thus, the
relation because it satisfies the properties of reflexivity and tolerance relation Ʈ T T can be defined by means of
symmetry. The vague inclusion
U U function I:
v : 2 2 [0,1] measures the degree of inclusion of sets; ti Ʈ t j t j I (ti ) where I ( ti ) is the tolerance
in particular it relates to the question of whether the tolerance
class I (x ) of an object x U is included in a set X. There is class of index term ti . In context of Information Retrieval, a
only one requirement of monotonicity with respect to the tolerance class represents a concept that is characterized by
second argument of ν, that is, v(X, Y ) v(X, Z) for any terms it contains. By varying the threshold θ (e.g. relatively to
X, Y, Z U and Y Z. the size of web page collection), one can control the degree of
relatedness of words in tolerance classes (or in other words the
Finally, the structurality function is introduced by analogy
preciseness of the concept represented by a tolerance class). To
with mathematical morphology [5]. In the construction of the
measure degree of inclusion of one set in another, vague
lower and upper approximations, only tolerance sets being
inclusion function is defined as
structural elements are considered. The structurality
function, P : I (U ) {0,1} , classifies I (x ) for each
x U into two classes – structural subsets ( P( I ( x )) 1)) and
non-structural subsets ( P( I ( x )) 0)) . The lower
282
IJFRCSCE | December 2017, Available @ http://www.ijfrcsce.org
_______________________________________________________________________________________
International Journal on Future Revolution in Computer Science & Communication Engineering ISSN: 2454-4248
Volume: 3 Issue: 12 281 – 285
_______________________________________________________________________________________________
X Y Table 1. Term Co-occurrence matrix
v( X , Y )
X t1 t2 t3 t4 t5 t6
t1 -1 3 0 1 0 0
It is clear that this function is monotonous with respect to t2 3 -1 0 2 0 1
the second argument. The membership function for
t3 0 0 -1 2 0 0
t4 1 2 2 -1 2 1
ti T , X T is then defined as
0 4
0
Similarly, vague inclusion v ( I (ti ), p i ) of all terms in
page p1 is calculated and all the terms whose vague inclusion
is greater than zero will be added to the upper approximation of
the page p1 . Upper approximation for page
p1 is: U R ( pi ) {t1, t2 } .
In table 2, web page p1 contains the term t1 which is
“architect” and upper approximation contains terms t1 and t2
that is “architect” and “build”, which are conceptually related.
That means, the upper approximation of the web page contains
the terms appearing in the page and conceptually related terms
of these terms. The upper approximation of the page p3
contains the terms t 3 , t 4 , t 8 , t 9 and t11 which are biographi,
inform, Ceram, potteri and air respectively. These terms are not
semantically related but they are appearing in upper
approximation. This is due to the inappropriate value chosen Figure 2. Web –page representation before applying TRSM
for threshold θ. So care must be taken while choosing the
threshold value. In the example if θ=13, then upper
approximation of page 3 contains the terms t 8 , t 9 which are
semantically related.
Upper
Terms
approximation
P t1
t1, t2
1
P t1
t1,t2
2
Pt3, t4
t3, t4, t8, t9, t11
3
Pt1, t2, t4
t1, t2, t4, t8, t9, t11
4
P t11
t4, t8, t9, t11
5
Accuracy (%)
Web page representation
Non-TRSM TRSM
V. CONCLUSION
In general, Vector Space Model (VSM) will be used for
web page representation. This model does not capture the
conceptually related words. In this work, web page
representation is enriched by adding the conceptually related
terms to the web page. Tolerance Rough Set Model, which is
an extension of traditional RST, was used to enrich the web
page representation. Since enriching the original web page
further increases the dimensionality, TRSM is applied on after
applying FS on web pages. TRSM based web page
representation has increased the classification accuracy
obtained after applying Feature Selection by 2-3%.
REFERENCES
[1] Saori Kawasaki, Ngoc Binh Nguyen, T. B. H. Hierarchical
document clustering based on tolerance rough set model. In
Principles of Data Mining and Knowledge Discovery, 4th
European Conference, PKDD 2000.
[2] Tu Bao Ho, N. B. N. Nonhierarchical document clustering
based on a tolerance rough set model. International Journal of
Intelligent Systems 17, 2 (2002), 199-212.
http://dir.yahoo.com.
[3] T. Tsukada, M. Washio and H. Matoda. Automatic web-page
classification by using machine learning methods.
Proceedings of the First Asia-Pacific Conference on Web
Intelligence(WI2001), LNAI 2198:303–313, 2001.
[4] Toshiko Wakaki, H. Itakura and Masaki Tamura. Rough Set-
Aided Feature Selection for Automatic Web-page
Classification. Proceedings of the IEEE/WIC/ACM
International Conference on Web Intelligence, 2004.
[5] G. Salton, A. Wong and C.S. Yang. A vector space model for
automatic indexing. Communications of the ACM, Vol. 18,
No. 11, pp. 613–620. 1975
285
IJFRCSCE | December 2017, Available @ http://www.ijfrcsce.org
_______________________________________________________________________________________