Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 19

WEB MINING

BY:
ANITHA K
17EUEE017
WEB MINING :

 Web mining is the application of data mining techniques to


discover patterns from the World Wide Web. As the name
proposes, this is information gathered by mining the web.
 The goal of Web structure mining is to generate structural
summary about the Web site and Web page. Technically,
Web content mining mainly focuses on the structure of
inner-document, while Web structure mining tries to
discover the link structure of the hyperlinks at the inter-
document level.
TYPES OF WEB MINING :

 Web mining can be divided into three different types


1. Web usage mining,
2. Web content mining
3. Web structure mining.
 Web usage mining(http logs, app server logs, etc..)
 Web content mining(text, images, records, etc… )
 Web structure mining (hyperlinks, tags, etc..)
WEB MINING TAXONOMY:
WHAT IS WEB STRUCTURE
MINING?
 Web structure mining uses graph theory to analyze the node
and connection structure of a web site.
 According to the type of web structural data, web structure
mining can be divided into two kinds:
1. Extracting patterns from hyperlinks in the web: a hyperlink is a
structural component that connects the web page to a different
location.
2. Mining the document structure: analysis of the tree-like
structure of page structures to describe HTML or XML tag
usage.
WEB STRUCTURE MINING :
WEB STRUCTURE MINING
TERMINOLOGY :
 Web graph: directed graph representing web.
 Node: web page in graph.
 Edge: hyperlinks.
 In degree: number of links pointing to particular node.
 Out degree: number of links generated from particular
node.
 Directed Path: A sequence of links, starting from p that can
be followed to reach q
WEB STRUCTURE MINING
TERMINOLOGY :
 Shortest Path: Of all the paths between nodes p and q,
which has the shortest length, i.e. number of links on it.
 Diameter: The maximum of all the shortest paths between
a pair of nodes p and q, for all pairs of nodes p and q in the
Web-graph.
 Link: Each hyperlink on the Web is a directed edge of the
Web-graph
HYPERLINK ANALYSIS :

 The research at the hyperlink level is also called Hyperlink


analysis. Nool0uo0huozi
 MOTIVATION TO STUDY HYPERLINK STRUCTURE :
 Hyperlinks serve two main purposes :
1. Pure navigation
2. Point to pages with authority on the same topic of the page
containing the link
 This can be used to retrieve useful information from the web.
INTERESTING WEB STRUCTURES
HYPERLINK ANALYSIS
TECHNIQUES :

KNOWLEDGE
MODELS
MEASURES &
APPLICATIONS
ALGORITHMS

ANALYSIS
SCOPE &
PROPERTIES
 KNOWLEDGE MODELS : The underlying representations that
forms the basics to carry out the application specific tasks
 Analysis Scope and Properties: The scope of analysis specifies if
the task is relevant to a single node or set of nodes or the entire
graph. The properties are the characteristics of single node or the
set of nodes or the entire web.
 The measures are the standards for the properties such as quality,
relevance or distance between the nodes. Algorithms are designed
to for efficient computation of these measures
These three areas form the fundamental blocks for building various
links based on hyperlink analysis
GOOGLE’S PAGE RANK :

 PageRank (PR) is an algorithm used by Google Search to


rank web pages in their search engine results. PageRank
was named after Larry Page,one of the founders of
Google. PageRank is a way of measuring the importance
of website pages.
 Currently, PageRank is not the only algorithm used by
Google to order search results, but it is the first algorithm
that was used by the company, and it is the best known.
GOOGLE’S PAGE RANK :
HITS ALGORITHM :
 Hyperlink-Induced Topic Search (HITS; also known as hubs and
authorities) is a link analysis algorithm that rates Web pages, developed by
Jon Kleinberg.
 The idea behind Hubs and Authorities stemmed from a particular insight
into the creation of web pages when the Internet was originally forming; that
is, certain web pages, known as hubs, served as large directories that were
not actually authoritative in the information that they held, but were used as
compilations of a broad catalog of information that led users direct to other
authoritative pages.
 A good hub represented a page that pointed to many other pages, and a
good authority represented a page that was linked by many different hubs
HUBS AND ALGORITHMS :

 Hubs and authorities are„fans‟


and „centers‟ in a bipartite core
of a web
graph
 A good hub page is one that
points to many good authority
pages
 A good authority page is one that is
pointed to many good hub pages
INFORMATION SCENT :

 Information scent refers to the extent to which users


can predict what they will find if they pursue a certain
path through a website.
 The term is part of information foraging theory, which
explains how users interact with systems using the
analogy of animals hunting for food.
THANK YOU 🌟

You might also like