Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem

By
Saumil Shah
Roll No : 46
MCA 4th sem
WEB MINING
Agenda
World Wide Web – a brief
history
Introduction to Data Mining
Data Mining Process &
Techniques
Web Mining
Data Mining Vs Web Mining
Classification of Web Mining
Benefits & Application Areas of
Web Mining
Web Mining Softwares
Summary
8/12/10
Data Mining vs. Web
Traditional data mining
data is structured and relational
well-defined tables, columns, rows, keys, and
constraints.
Web data
Semi-structured (HTML documents)and
unstructured (free text)
Mining
readily available data

rich in features and patterns
8/12/10
Problems when interacting with the Web
» Finding relevant information
» Creating new knowledge out of the

information available on the Web
» Personalization of the information
» Learning about consumers or individual users
8/12/10
Web Mining
8/12/10
Web Mining - Definition
» “Web mining refers to the overall process of discovering
potentially useful and previously unknown information or
knowledge from the Web data.”
» The web mining process is similar to the data mining

process, the difference is usually in the data collection.
» In data mining, the data is often already collected and
stored in a data warehouse.
» In web mining, data collection can be a substantial task,
especially for web structure and content mining, which
involves crawling a large number of target web pages.
8/12/10
Web Mining - Subtasks
 Resource finding
 Retrieving intended documents
 Information selection/pre-processing
 Select and pre-process specific information from selected
documents
 Generalization
 Discover general patterns at individual web sites as well as
across multiple web sites
 Analysis
 Validation and/or interpretation of mined patterns
8/12/10
Web Mining Contd..
Web Mining is not IR:
 Information retrieval (IR) is the automatic retrieval of all
relevant documents while at the same time retrieving as few
of the non-relevant documents as possible
Web Mining is not IE:

 Information extraction (IE) aims to extract the relevant facts
from given documents
 IE systems for the general Web are not feasible
 Most focus on specific Web sites or content
8/12/10
Web Usage Mining
Web Usage Mining refers to the discovery of user access

Click to edit the
patterns from the web usage logs, which record every click
made by each user. outline text format
Second Outline
The usage data records the user’s behavior
Level when the user
browses or makes transactions on the web site in order to better
understand and serve the needs of users or− Web-based
Third Outline
applications. Level
Fourth
It is an activity that involves the automatic discovery of
Outline
patterns from one or more Web servers.
Level
− Fifth
Outline
Web Usage Mining Contd..
Organizations often generate and collect large volumes of data;
most of this information is usually generated automatically by
Web servers and collected in server log.
Analyzing such data can help these organizations to

determine:
the value of particular customers
cross marketing strategies across products
the effectiveness of promotional campaigns, etc.
Typical Sources of Data
automatically generated data stored in server access logs,
proxy server logs referrer logs, browser logs, bookmark
data, mouse clicks and scrolls and client-side cookies
user profiles
 meta data: page attributes, content attributes, usage data
8/12/10
 The first web analysis tools simply provided mechanisms to
report user activity as recorded in the servers. Using such tools,
it was possible to determine such information as:
the number of accesses to the server
the times or time intervals of visits
the domain names and the URLs of users of the Web server.
 Two main categories:
Learning a user profile (personalized)
Web users would be interested in techniques that learn
their needs and preferences automatically
Learning user navigation patterns (impersonalized)
Information providers would be interested in techniques
that improve the effectiveness of their Web site or biasing
the users towards the goals of the site
8/12/10
 Web servers, Web proxies, and client applications can quite
easily capture Web Usage data.
Web server log:
Every visit to the pages, what and when files have been
requested, the IP address of the request, the error code, the
number of bytes sent to user, and the type of browser used…
 By analyzing the Web usage data, web mining systems can
discover useful knowledge about a system’s usage
characteristics and the users’ interests which has various
applications:
Personalization and Collaboration in Web-based systems
Marketing
Web site design and evaluation
Decision support
8/12/10
Web Usage Mining
Contd..
The technique to retrieve visitor based information from web
servers based log files and apply this information to analyze
data is known as Web Log Mining.
The major types of log files are
Access Log- file maintains a list of all the web pages that
the visitors have requested.
Agent Log- file consists of information about the browser
that was used to explore the various web pages.
8/12/10
Web Content Mining
Web Content Mining extracts or mines useful information or
knowledge from web page contents.
Click
In this mining, patterns are extracted fromto editsources
online the
such as outline text format
HTML files
Text documents Second Outline
Images Level
E-books or email messages
Audio or Video − Third Outline
Level
The concept of WCM is far wider than searching for any specific
term or only keyword extraction or some simple statistics of words
and phrases in documents.
Fourth
Outline
A tool that performs WCM can summarize a web Level
page so that you
need not read the complete document and save your −time and energy.
Fifth
8/12/10
Outline
Web Content Mining
Contd..
The two basic approaches or models to implement WCM are
Local Knowledge base Model:
The abstract characterizations of several web pages
are stored locally. (i.e References to several web sites relating
to the categories are stored in a database and based on the
selection of the category the searching is performed with in the
web site)
Agent Based Model:
This approach applies the Artificial Intelligence
systems known as Web Agents that can perform a search on
behalf of a particular user for discovering and organizing
documents in the web. Some web agents can apply individual
user profiles for searching information from the web and
organize and interpret the discovered information.
8/12/10
Preprocessing Content
Content Preparation:
Extract text from HTML.
Perform Stemming.
Remove Stop Words.
Calculate Collection Wide Word Frequencies (DF).
Calculate per Document Term Frequencies (TF).
Vector Creation:
Common Information Retrieval Technique.
Each document (HTML page) is represented by a sparse
vector of term weights.
Typically, additional weight is given to terms appearing as
keywords or in titles.
8/12/10
Common Mining Techniques
The more basic and popular data mining techniques include:
Classification- Classification on server logs using decision trees,
Naives-Bayes classifier to discover the profiles of users
belonging to a particular category.
Clustering- can be used to group users exhibiting similar
browsing patterns.
Associations- can be used to relate pages that are most often
referenced together in a single server session.
The other significant ideas are:
Topic Identification, tracking and drift analysis
Concept hierarchy creation
Relevance of content.
8/12/10
Web Structure Mining
Web Structure Mining discovers useful knowledge from
hyper links, which represent the structure of the web.
Click to edit the
outline
 Web structure mining can be divided text
into two format
kinds:
Extract patterns from hyperlinks in the web. A hyperlink is
Second Outline
a structural component that connects the web page to a
different location. Level
− Third
Mining the document structure. It is using the tree-like
Outline
structure to analyze and describe the HTML
Levelor XML tags
within the web page.
Fourth
Outline
 The process of using the graph theory to analyze the node
and connection structure of a web site. Level
− Fifth
8/12/10
Outline
Contd..
Web Structure is a useful source for extracting information
such as
Web Page Classification
 Classifying web pages according to various topics
Quality of Web Page
The authority of a page on a topic
Ranking of web pages
Which pages to crawl
 Deciding which web pages to add to the collection of web
pages
Finding Related Pages
Given one relevant page, find all related pages
8/12/10
Contd..
The Hyperlink Induced Topic Search (HITS) is the common
method or algorithm for knowledge discovery in the Web. The
Concept of HITS is
8/12/10
Identication of
Authorities: authoritative, high-quality web pages on broad
topics
hubs: web pages that link to a collection of authorities
A good authority is pointed to by many good hubs
A good hub points to many good authorities
Web structure mining has been largely influenced by research

in
Social network analysis
Citation analysis (bibliometrics).
in-links: the hyperlinks pointing to a page
out-links: the hyperlinks found in a page.
Usually, the larger the number of in-links, the better a page is.
8/12/10
Application Areas of Web Mining
E-commerce
Search Engines
Personalization
Website Design
Web mining applications
Amazon.com
Google
Double Click
AOL
Ebay
MyYahoo
CiteSeer
I-MODE
v-TAG Web Mining Server
8/12/10
Applications Contd..
Amazon:
A host of Web mining techniques, e.g. associations between
pages visited, click-path analysis, etc., are used to improve the
customer’s experience during a ’store visit’. Knowledge gained
from Web mining is the key intelligence behind Amazon’s
features such as ’instant recommendations’, ’purchase circles’,
’wish-lists’, etc.
8/12/10
Applications Contd..
Google
 Earlier search engines concentrated on the Web content to
return the relevant pages to a query. Google was the first to
introduce the importance of the link structure in mining the
information from the web. Page Rank, that measures an
importance of a page, is the underlying technology in all
Google search products.
 The Page Rank technology, that makes use of the structural

information of the Web graph, is the key to returning quality
results relevant to a query.
8/12/10
Benefits of Web Mining
Match your available resources to visitor interests
Increase the value of each visitor
Improve the visitor's experience at the website
Perform targeted resource management
Collect information in new ways
Test the relevance of content and web site architecture
8/12/10
Web Mining Softwares
Web Miner:
Sinope Summarizer:
Teleport Pro:
Click Tracks
8/12/10
Summary
Major Limitations of Web Mining research:
Difficult to collect Web Usage data across different Web
Sites.
Lack of suitable test collections that can be reused by
researchers
Future research directions:

Multimedia data mining: A picture is worth a thousand
words.
Multilingual knowledge extraction: Web page translations
The Hidden Web: Forms, Dynamically generated web pages.
Semantic Web
Wireless Web: WML and HDML.
8/12/10

Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem

Uploaded by

Copyright:

Available Formats

By

readily available data

» Finding relevant information

» Creating new knowledge out of the

» Personalization of the information

» Learning about consumers or individual users

» The web mining process is similar to the data mining

Web Mining is not IE:

Web Usage Mining refers to the discovery of user access

Analyzing such data can help these organizations to

Web structure mining has been largely influenced by research

 The Page Rank technology, that makes use of the structural

Increase the value of each visitor

Improve the visitor's experience at the website

Perform targeted resource management

Collect information in new ways

Test the relevance of content and web site architecture

Future research directions:

You might also like