Web Mining

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 28

WEB MINING

PRESENTERS:
• Eshwari
• Kunal
• Parth
• Pranita
Introduction

◦ Web Mining is the use of the data mining techniques to discover and extract information from
web documents/services.
◦ It aims in finding and extracting relevant information that is hidden in web related data
◦ Web mining is a subset of Data mining
◦ Data is collected from server, client and database.
◦ Web mining helps in discovering patterns and insights from the World Wide Web. Hence, it
discovers useful data from Hyperlinks
WWW – Global Information Service center

WWW is huge, widely distributed within

News Advertisements E Commerce Government Social Consumer Access & Usage


Media Information Information
Difference - Data mining VS Web mining

Data Mining Web mining


1. The process of discovering patterns and relationships in large The process of extracting information and knowledge from web data
datasets

Involves techniques are used to analyze user behavior, trends in


2. Involves using statistical and computational techniques to
online content and identify patterns in web-based transactions.
identify meaningful insights from the data.

3. Data Mining uses structured data Web Mining uses structured & unstructured data
WEB MINING TECHNIQUES
Types of Web Mining

Web Mining can be generally


divided into 3 categories,
based on the data to be mined
as seen in the figure :
WEB CONTENT MINING
What is web content mining?

◦ Every webpage can have text, graphics, audio, video, forms, applications,
and more kinds of content
◦ It includes user-generated content and extracting all this relevant and useful
information from a website or any other online platform is web content
mining
Goal of web content mining

◦ Aims to discover patterns, generate insights, and trends from the large
volume of data obtain via web content mining
◦ Used to inform and improve business decisions, enhance search results,
personalize content and enhance the overall user experience
◦ Understand the upcoming trends from social media and using it beneficially.
Tools and Technologies used

◦ There are various technologies used in content mining, depending on the specific task
and data being analysed.
◦ Some commonly used tools and technologies are; web crawlers, natural language
processing, machine learning, text mining, data visualization tools and cloud computing
platforms.
◦ Tools like Scrapy, Selenium, ProWebScraper, Rstudio, Tableau, Oracle Data Mining
(ODM), Octoparse and algorithms like HITS algorithm, PageRank Algorithm are also
used for Web Content Mining.
HITS
algorithm
◦ HITS algorithm or
Hyperlink-Induced Topic
Search (HITS) is a link
analysis algorithm that
rates web pages as being
hubs or authorities.
WEB CONTENT
MINING FLOW
CHART
References

◦ Web content mining: A systematic review" by G. P. Saroha and H. S. Chahal


(2018)
◦ Mining the Web: Discovering Knowledge from Hypertext Data" by Soumen
Chakrabarti (2003)
◦ Web content mining using neural networks" by H. Abdollahi and S. A.
Mirroshandel (2017)
WEB STRUCTURE
MINING
What is Web Structure Mining?

◦ The process of analysing and extracting information from the link structure of the
World Wide Web.
◦ The link structure of the Web consists of the set of hyperlinks between web pages,
which can be represented as a directed graph, with web pages as nodes and
hyperlinks as edges
◦ Involves several techniques for analysing and extracting information from this graph
structure.
◦ One common technique is link analysis, which examines the relationships between
pages and their links, and can be used to identify important pages, such as
authoritative sources or hubs.
What is Web Structure Mining? - continued

o The structure of a typical web graph consists of Web pages as


nodes and hyperlink as edges connecting between two related
pages.

o Web structure mining techniques include link analysis, graph


analysis, clustering, and classification.

o These techniques can be used to identify important websites


or web pages, discover hidden communities or clusters of web
pages, and analyse the evolution of the web over time.
TYPES OF LINKS
Link-based classification: The task is to focus on the prediction of the
category of a web page, based on words that occur on the page, links
between pages, anchor text, html tags and other possible at- tributes
found on the web page.

Link-based Cluster Analysis: The data is segmented into groups,


where similar objects are grouped together, and dissimilar objects are
grouped into different groups.

Link Type: There are a wide range of tasks concerning the prediction of
the existence of links, such as predicting the type of link between two
entities, or predicting the purpose of a link.

Link Strength: Links could be associated with weights.

Link Cardinality: The main task here is to predict the number of links
between objects.
Applications

◦ Marketing
◦ E-commerce
◦ Information Retrieval
References

1. Business Intelligence and Data mining by Anil Maheshwari


2. Jidong Wang, Zheng Chen, Li Tao, Wei-Ying Ma, Liu Wenyin, Rank- ing User’s Relevance to a
Topic through Link Analysis on Web Logs, WIDM’ 02, November 2002.
3. A. A. Barfourosh, H.R. Motahary Nezhad, M. L. Anderson, D. Perlis, Information Retrieval on the
World Wide Web and Active Logic: A Survey and Problem Definition, 2002.
4. G. Piatetsky-Shapiro, and W.J. Frawley, Knowledge Discovery in Databases. AAAI/MIT Press,
1991.
WEB USAGE MINING
Web-Usage Mining
Web Mining

What is Web Usage Mining?


Web Structure Web Content Web Usage
◦ Extracting useful information and Mining Mining Mining
Discovering user ‘navigation patterns’
from data generated through web data.
◦ Prediction of user behavior while the user
interacts with the web
Usage Mining Process
◦ Data Collection:
Server Level
Client Level
◦ Analyzing data
Identify users, clicks, location & duration.
◦ Data Mining:
Navigation Patterns
Sequential Patterns
Data Mining Techniques – Navigation Patterns

B E

C D

Web Page Hierarchy of a Web Site


Data Mining Techniques – Sequential Patterns Example
◦ Customer Transaction Time Purchased Items

John 6/21/05 5:30 pm Beer


John 6/22/05 10:20 pm Brandy

Frank 6/20/05 10:15 am Juice, Coke


Frank 6/20/05 11:50 am Beer
Frank 6/20/05 12:50 am Wine, Cider

Mary 6/20/05 2:30 pm Beer


Mary 6/21/05 6:17 pm Wine, Cider
Mary 6/22/05 5:05 pm Brandy
Data Mining Techniques – Sequential Patterns
Example - continued
Customer Sequence Mining Result
Customer Customer Sequences Sequential Patterns Supporting
Customers
John (Beer) (Brandy)
Frank (Juice, Coke) (Beer) (Wine, Cider) (Beer) (Brandy) John, Mary

Mary (Beer) (Wine, Cider) (Brandy) (Beer) (Wine, Cider) Frank, Mary
Association
Transaction ID Items Purchased

1 butter, milk
2 bread, milk, beer, egg
3 diaper
… ………
◦ Example: Supermarket
An association rule can be:

“If a customer buys milk, in 50% of cases, he/she also buys


beer”. This happens in 33% of all transactions.

50%: confidence
33%: support
 Discovery of meaningful patterns from data generated by
client-server transactions.

◦ Restructure a website
◦ Extract user access patterns to target specific ads
◦ Predict user behavior based on previously learned rules and users’ profile
◦ Present dynamic information to users based on their interests and profiles
Conclusion

◦ As web usage and information source in the World Wide Web are growing
continuously it is a good opportunity having web miner to extract hidden
knowledge from the web
◦ As a weakness, not all bur some researchers have replaced Web Mining by
Text Mining.
◦ Since Web Mining is concentrated with too much multimedia information
however Text Mining is only textual data

You might also like