Web Content Outlier Mining

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

4/26/12

DETECTION AND REMOVAL OF REDUNDANT WEB CONTENT THROUGH RECTANGULAR AND SIGNED APPROACH

Presented by, Click to edit Master subtitle style Pramod.N Reg no. 091718 V Sem M.C.A A.I.M.I.T
11

4/26/12

Topics to be Covered

Abstract Works and signed approach

Introduction Related Need

for the Algorithm Review

Rectangular Algorithm

Conclusion References
22

4/26/12

Abstract

Today, Internet marks the era of information revolution and people rely on search engines for getting relevant information without replicas. As the duplicated web pages increases, the indexing space and time complexity increases. Finding and removing these pages becomes significant for search engines and other likely systems. Web content outliers mining play a vital role in covering all these aspects. Existing algorithm for web content outliers mining focuses on applying weight age to structured documents. Where as in this research work, a mathematical approach based on signed and rectangular representation is developed to detect and remove the redundancy between unstructured web documents also. This method optimizes the indexing of web document as well as improves the quality of search engines.
33

4/26/12

Introduction

Due to voluminous amount of information available on the web, most of the people like to perform their task over the internet. There are many web documents, which have redundant and irrelevant contents. Web content outliers mining concentrates on finding outliers such as noise, irrelevant and redundant pages from the web documents. Generally, Outliers are the data that obviously deviate from others, disobey the general mode or behavior of data and disaccord with other existing data.
44

4/26/12

Web Mining

In general, web mining tasks can be classified into three major categories
I. II. III.

Web structure mining Web usage mining Web content mining.

Web structure mining tries to discover useful knowledge from the structure of hyperlinks. Web usage mining refers to the discovery of user access patterns from web usage logs. Web content mining aims to extract/mine useful information from the web pages based on their contents.

55

4/26/12

Related works

G Poonkuzhali suggested set theoretical approach for detecting and eliminating redundant links in web documents. Giuseppe Antoio Di proposed an algorithm based on clone detection and similarity metrics to detect duplicate pages in web sites and application implemented with HTML which works only for structured web documents.

66

4/26/12

Contd

Yunhe Weng come up with an idea of improved COPS (Copy Detection Algorithm) scheme which aims to protect intelligent property of the document owner by detecting overlap among documents. This method performs similarity computation only for the pages that are relevant to the suspicious pages.

77

4/26/12

Contd

Zhongming Han developed a novel multilayer framework for detecting duplicated web pages through two similarity text paragraphs detection algorithms based on Edit distance and bootstrap method. This method achieves high performance in detecting duplicates efficiently simply by tag statistic and text comparison, still it cannot find duplicates among multiple web pages.

88

4/26/12

Contd

All the above works on web content mining, are lack in simplicity of concept and computation. These issues results in determining a novel approach based on mathematics through signed and rectangular representation to detect and remove redundant web document with less time and space complexity. Apart from the above benefits, there is need for a algorithm which works well for both unstructured and structured data.

99

4/26/12

Need for the Algorithm

Existing web mining algorithms do not consider documents having varying contents within the same category called web content outliers. Most of the time we get different web documents with same contents!!! This algorithm focuses on detection and removal of noise issue and redundancy, which implies outliers mining.

1010

4/26/12

Design of Proposed System

1111

4/26/12

Rectangular and signed approach

In this framework, web documents are extracted from the search engines by giving query by the user to the web. Then the obtained web documents are preprocessed, i.e., stop words, stem words and except text other data such as hyperlinks, sound, images etc are removed. Then the number of documents extracted on the web is counted.

1212

4/26/12

Contd

I. II. III. IV.

Next, n x m matrix representation are generated for all the extracted documents based on four tuples namely,
Number of pages Paragraphs Lines Word occurrences

Then all the elements of 4 tuples taken from n x m matrix of first two documents are compared and its outcome is stored using signed approach.

1313

4/26/12

Contd

Finally, redundancy computation is done based on the results of similarity computation. Every element of Di is taken as a 4 tuple. For example a 4 tuple(3,2,5,8) refers 8th word from 5th line in 2nd paragraph of 3rd page. Usage of n x m matrix representation helps easy retrieval and searching of web content.

1414

4/26/12

Redundancy Computation (RC) Algorithm


Input: User Query q; Method: Rectangular representation and Signed Approach Output: Web Document without Redundancy. Step 1: Extract input web documents Di based on user query where 1 i N. Step2 : Preprocess all the extracted documents. Step 3: Calculate maximum number of pages p, paragraph q , lines r and words s in any of the extracted Web documents. Step 4: Generate n x m matrix for all extracted web documents with 4 tuples k, l, m and n where 1k p, 1l q ,1m r and 1n s respectively
1515

4/26/12

Contd

Step 5: Initialize i=1. Step 7: Assign j=i+1. Step 8: Initialize PC=0 and NC=0; (PC=Positive count, NC=Negative count). Step 9: Consider first element in 4-tuple (k,l,m,n) from Di and Dj and perform string comparison. Step 10: If they are similar, update PC=PC+1 else NC=NC+1 Step 11: Repeat step7 and step 8 for all the elements of 4-tuples till (p,q,r,s) taken from Di and Dj.

1616

4/26/12

Contd

Step 12: If PC NC then Di and Dj are redundant. Remove Dj from the set of documents. Else Di and Dj are not redundant.

Step 13: Increment j and repeat the steps from 8 to 12 until N.

Step 14: At the termination of 13th step redundancy with first document is eliminated. Step 15: Increment i and repeat the steps 7 to 13 until i<N. Step 16: Finally Mined web content without redundancy is 1717 obtained.

4/26/12

Algorithm Review

1818

4/26/12

Conclusion

Experimental results ensure that the memory space, search time and run time gets reduced by using rectangular representation and signed approach. As the efficiency of web content is increased, the quality of the search engines also gets increased. This method is very simple to implement. This algorithm works well for both unstructured and structured data.

1919

4/26/12

References

http://www.ijest.info/docs/IJEST10-02-09-1

http://www.waset.org/journals/waset/v56/

http://www.wseas.us/e-library/conference http://www.libsearch.com/view/1323898

2020

4/26/12

Thank You

2121

You might also like