Web Content Outlier Mining

4/26/12
DETECTION AND REMOVAL OF REDUNDANT WEB CONTENT THROUGH RECTANGULAR AND SIGNED APPROACH
Presented by, Click to edit Master subtitle style Pramod.N Reg no. 091718 V Sem M.C.A A.I.M.I.T
11
4/26/12
Topics to be Covered
Abstract Works and signed approach
Introduction Related Need
for the Algorithm Review
Rectangular Algorithm
Conclusion References
22
4/26/12
Abstract
Today, Internet marks the era of information revolution and people rely on search engines for getting relevant information without replicas. As the duplicated web pages increases, the indexing space and time complexity increases. Finding and removing these pages becomes significant for search engines and other likely systems. Web content outliers mining play a vital role in covering all these aspects. Existing algorithm for web content outliers mining focuses on applying weight age to structured documents. Where as in this research work, a mathematical approach based on signed and rectangular representation is developed to detect and remove the redundancy between unstructured web documents also. This method optimizes the indexing of web document as well as improves the quality of search engines.
33
4/26/12
Introduction
Due to voluminous amount of information available on the web, most of the people like to perform their task over the internet. There are many web documents, which have redundant and irrelevant contents. Web content outliers mining concentrates on finding outliers such as noise, irrelevant and redundant pages from the web documents. Generally, Outliers are the data that obviously deviate from others, disobey the general mode or behavior of data and disaccord with other existing data.
44
4/26/12
Web Mining
In general, web mining tasks can be classified into three major categories
I. II. III.
Web structure mining Web usage mining Web content mining.
Web structure mining tries to discover useful knowledge from the structure of hyperlinks. Web usage mining refers to the discovery of user access patterns from web usage logs. Web content mining aims to extract/mine useful information from the web pages based on their contents.
55
4/26/12
Related works
G Poonkuzhali suggested set theoretical approach for detecting and eliminating redundant links in web documents. Giuseppe Antoio Di proposed an algorithm based on clone detection and similarity metrics to detect duplicate pages in web sites and application implemented with HTML which works only for structured web documents.
66
4/26/12
Contd
Yunhe Weng come up with an idea of improved COPS (Copy Detection Algorithm) scheme which aims to protect intelligent property of the document owner by detecting overlap among documents. This method performs similarity computation only for the pages that are relevant to the suspicious pages.
77
4/26/12
Contd
Zhongming Han developed a novel multilayer framework for detecting duplicated web pages through two similarity text paragraphs detection algorithms based on Edit distance and bootstrap method. This method achieves high performance in detecting duplicates efficiently simply by tag statistic and text comparison, still it cannot find duplicates among multiple web pages.
88
4/26/12
Contd
All the above works on web content mining, are lack in simplicity of concept and computation. These issues results in determining a novel approach based on mathematics through signed and rectangular representation to detect and remove redundant web document with less time and space complexity. Apart from the above benefits, there is need for a algorithm which works well for both unstructured and structured data.
99
4/26/12
Need for the Algorithm
Existing web mining algorithms do not consider documents having varying contents within the same category called web content outliers. Most of the time we get different web documents with same contents!!! This algorithm focuses on detection and removal of noise issue and redundancy, which implies outliers mining.
1010
4/26/12
Design of Proposed System
1111
4/26/12
Rectangular and signed approach
In this framework, web documents are extracted from the search engines by giving query by the user to the web. Then the obtained web documents are preprocessed, i.e., stop words, stem words and except text other data such as hyperlinks, sound, images etc are removed. Then the number of documents extracted on the web is counted.
1212
4/26/12
Contd
I. II. III. IV.
Next, n x m matrix representation are generated for all the extracted documents based on four tuples namely,
Number of pages Paragraphs Lines Word occurrences
Then all the elements of 4 tuples taken from n x m matrix of first two documents are compared and its outcome is stored using signed approach.
1313
4/26/12
Contd
Finally, redundancy computation is done based on the results of similarity computation. Every element of Di is taken as a 4 tuple. For example a 4 tuple(3,2,5,8) refers 8th word from 5th line in 2nd paragraph of 3rd page. Usage of n x m matrix representation helps easy retrieval and searching of web content.
1414
4/26/12
Redundancy Computation (RC) Algorithm

Input: User Query q; Method: Rectangular representation and Signed Approach Output: Web Document without Redundancy. Step 1: Extract input web documents Di based on user query where 1 i N. Step2 : Preprocess all the extracted documents. Step 3: Calculate maximum number of pages p, paragraph q , lines r and words s in any of the extracted Web documents. Step 4: Generate n x m matrix for all extracted web documents with 4 tuples k, l, m and n where 1k p, 1l q ,1m r and 1n s respectively
1515
4/26/12
Contd

Step 5: Initialize i=1. Step 7: Assign j=i+1. Step 8: Initialize PC=0 and NC=0; (PC=Positive count, NC=Negative count). Step 9: Consider first element in 4-tuple (k,l,m,n) from Di and Dj and perform string comparison. Step 10: If they are similar, update PC=PC+1 else NC=NC+1 Step 11: Repeat step7 and step 8 for all the elements of 4-tuples till (p,q,r,s) taken from Di and Dj.
1616
4/26/12
Contd
Step 12: If PC NC then Di and Dj are redundant. Remove Dj from the set of documents. Else Di and Dj are not redundant.
Step 13: Increment j and repeat the steps from 8 to 12 until N.
Step 14: At the termination of 13th step redundancy with first document is eliminated. Step 15: Increment i and repeat the steps 7 to 13 until i<N. Step 16: Finally Mined web content without redundancy is 1717 obtained.
4/26/12
Algorithm Review
1818
4/26/12
Conclusion
Experimental results ensure that the memory space, search time and run time gets reduced by using rectangular representation and signed approach. As the efficiency of web content is increased, the quality of the search engines also gets increased. This method is very simple to implement. This algorithm works well for both unstructured and structured data.
1919
4/26/12
References
http://www.ijest.info/docs/IJEST10-02-09-1
http://www.waset.org/journals/waset/v56/
http://www.wseas.us/e-library/conference http://www.libsearch.com/view/1323898
2020
4/26/12
Thank You
2121

Web Content Outlier Mining

Uploaded by

Copyright:

Available Formats

You might also like

Web Content Outlier Mining

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Web Content Outlier Mining

Uploaded by

Copyright:

Available Formats

4/26/12

Abstract Works and signed approach

Introduction Related Need

for the Algorithm Review

Web structure mining Web usage mining Web content mining.

Need for the Algorithm

Design of Proposed System

Rectangular and signed approach

I. II. III. IV.

Redundancy Computation (RC) Algorithm

Step 13: Increment j and repeat the steps from 8 to 12 until N.

You might also like