Professional Documents
Culture Documents
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
Saumil Shah
Roll No : 46
MCA 4th sem
WEB MINING
Agenda
World Wide Web – a brief
history
Introduction to Data Mining
Data Mining Process &
Techniques
Web Mining
Data Mining Vs Web Mining
Classification of Web Mining
Benefits & Application Areas of
Web Mining
Web Mining Softwares
Summary
8/12/10
Data Mining vs. Web
Traditional data mining
data is structured and relational
well-defined tables, columns, rows, keys, and
constraints.
Web data
Semi-structured (HTML documents)and
unstructured (free text)
Mining
8/12/10
Problems when interacting with the Web
8/12/10
Web Mining
8/12/10
Web Mining - Definition
» “Web mining refers to the overall process of discovering
potentially useful and previously unknown information or
knowledge from the Web data.”
8/12/10
Web Mining - Subtasks
Resource finding
Retrieving intended documents
Information selection/pre-processing
Select and pre-process specific information from selected
documents
Generalization
Discover general patterns at individual web sites as well as
across multiple web sites
Analysis
Validation and/or interpretation of mined patterns
8/12/10
Web Mining Contd..
Web Mining is not IR:
Information retrieval (IR) is the automatic retrieval of all
relevant documents while at the same time retrieving as few
of the non-relevant documents as possible
8/12/10
Web Usage Mining
8/12/10
Web Content Mining
Web Content Mining extracts or mines useful information or
knowledge from web page contents.
Click
In this mining, patterns are extracted fromto editsources
online the
such as outline text format
HTML files
Text documents Second Outline
Images Level
E-books or email messages
Audio or Video − Third Outline
Level
The concept of WCM is far wider than searching for any specific
term or only keyword extraction or some simple statistics of words
and phrases in documents.
Fourth
Outline
A tool that performs WCM can summarize a web Level
page so that you
need not read the complete document and save your −time and energy.
Fifth
8/12/10
Outline
Web Content Mining
Contd..
The two basic approaches or models to implement WCM are
Local Knowledge base Model:
The abstract characterizations of several web pages
are stored locally. (i.e References to several web sites relating
to the categories are stored in a database and based on the
selection of the category the searching is performed with in the
web site)
Agent Based Model:
This approach applies the Artificial Intelligence
systems known as Web Agents that can perform a search on
behalf of a particular user for discovering and organizing
documents in the web. Some web agents can apply individual
user profiles for searching information from the web and
organize and interpret the discovered information.
8/12/10
Preprocessing Content
Content Preparation:
Extract text from HTML.
Perform Stemming.
Remove Stop Words.
Calculate Collection Wide Word Frequencies (DF).
Calculate per Document Term Frequencies (TF).
Vector Creation:
Common Information Retrieval Technique.
Each document (HTML page) is represented by a sparse
vector of term weights.
Typically, additional weight is given to terms appearing as
keywords or in titles.
8/12/10
Common Mining Techniques
The more basic and popular data mining techniques include:
Classification- Classification on server logs using decision trees,
Naives-Bayes classifier to discover the profiles of users
belonging to a particular category.
Clustering- can be used to group users exhibiting similar
browsing patterns.
Associations- can be used to relate pages that are most often
referenced together in a single server session.
The other significant ideas are:
Topic Identification, tracking and drift analysis
Concept hierarchy creation
Relevance of content.
8/12/10
Web Structure Mining
Web Structure Mining discovers useful knowledge from
hyper links, which represent the structure of the web.
Click to edit the
outline
Web structure mining can be divided text
into two format
kinds:
Extract patterns from hyperlinks in the web. A hyperlink is
Second Outline
a structural component that connects the web page to a
different location. Level
− Third
Mining the document structure. It is using the tree-like
Outline
structure to analyze and describe the HTML
Levelor XML tags
within the web page.
Fourth
Outline
The process of using the graph theory to analyze the node
and connection structure of a web site. Level
− Fifth
8/12/10
Outline
Web Structure Mining
Contd..
Web Structure is a useful source for extracting information
such as
Web Page Classification
Classifying web pages according to various topics
Quality of Web Page
The authority of a page on a topic
Ranking of web pages
Which pages to crawl
Deciding which web pages to add to the collection of web
pages
Finding Related Pages
Given one relevant page, find all related pages
8/12/10
Web Structure Mining
Contd..
The Hyperlink Induced Topic Search (HITS) is the common
method or algorithm for knowledge discovery in the Web. The
Concept of HITS is
8/12/10
Web Structure Mining
Identication of
Authorities: authoritative, high-quality web pages on broad
topics
hubs: web pages that link to a collection of authorities
A good authority is pointed to by many good hubs
A good hub points to many good authorities
8/12/10
Applications Contd..
Amazon:
A host of Web mining techniques, e.g. associations between
pages visited, click-path analysis, etc., are used to improve the
customer’s experience during a ’store visit’. Knowledge gained
from Web mining is the key intelligence behind Amazon’s
features such as ’instant recommendations’, ’purchase circles’,
’wish-lists’, etc.
8/12/10
Applications Contd..
Google
Earlier search engines concentrated on the Web content to
return the relevant pages to a query. Google was the first to
introduce the importance of the link structure in mining the
information from the web. Page Rank, that measures an
importance of a page, is the underlying technology in all
Google search products.
8/12/10
Benefits of Web Mining
Match your available resources to visitor interests
8/12/10
Web Mining Softwares
Web Miner:
Sinope Summarizer:
Teleport Pro:
Click Tracks
8/12/10
Summary
Major Limitations of Web Mining research:
Difficult to collect Web Usage data across different Web
Sites.
Lack of suitable test collections that can be reused by
researchers