IR - ch7 - Web Search Engin

CS444: Information Retrieval
and Web Search

Fall 2021
CHAPTER 7:
WEB SEARCH ENGINE AND LINK ANALYSIS
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 2
The Bow-Tie Structure of the Web [Broader et al.]
IN: nodes that can reach the giant SCC but cannot be
reached from it.
OUT: nodes that can be reached from the giant SCC but
cannot reach it.
Tendrils: The “tendrils” of the bow-tie consist of
(a) the nodes reachable from IN that cannot reach the giant
SCC
(b) the nodes can reach OUT but cannot be reached from the
giant SCC.
It’s possible for a tendril node to satisfy both (a) and (b), in
which case it’s part of a “tube” that travels from IN to OUT
without touching the giant SCC.
Disconnected: Finally, there are nodes that would not have a
path to the giant SCC.
Web Mining:
 Discovering and extracting information from the Web
 Web mining can generally be divided into three categories:
Web content mining

 Page content (TEXT, IMAGE, ….)
Web structure (Link) mining

 Hyperlinks between pages
Web usage mining

 User behavior and logs
 Categories can be combined
Web Content Mining
Web content Mining is the process of extracting useful
information from the content of the Web pages.
The Web page content consists of many kinds of data such as
text, image, audio, video and hyperlinks.
The Web page content consists of unstructured data such as
the text, semi-structured data such as HTML or XML documents
and structured data such as data found in the tables.
There are two approaches to the Web Content Mining; agent
based approach and database approach.
Web Structure (Link) Mining
Web structure mining or sometime referred by Web link mining tries to
generate the model summaries the Web site and Web page.
This kind of Web Mining builds the structure of the linked web pages as a
Web graph where the nodes represent the Web pages and the edges of
the graph represent the hyperlinks between the web pages.
Web structure mining makes the classification of the Web pages easy in
terms of their relevance and importance.
Web structure mining also detects the similarity
and the relationship between different Web sites and Web pages.
Web Usage Mining
Web Usage Mining applies the data analysis technique to extract
the usage pattern generated by the surfers on the Web to use it in
enhancing the web services such as users’ search experience.
Web usage mining mines the data derived from users’ interactivity
with the Web.
The data of patterns of usage can be found for example in the IP
addresses, time and the date of the user access, user profile,
user queries, mouse clicks and scrolls, cookies, transaction
and any other data comes from user interactions.
Web Mining: Categories
Web Mining categories are used alone or in combinations with each other.
They aim to enhance search engines to make the searching results more relevant
and important to the users’ queries.
There are many different page ranking algorithms has been emerged that use
these kind of mining processes.
Most of these algorithms found the structure mining is the most important
one, while other try to combine the Web Structure mining with others such as
Web content mining and Web usage mining.
Next , we will discuss in some details the Web mining page ranking algorithms
that emerged to enhance search engines.
Web Mining: Categories
Web Mining
Web Content Mining Web Link Mining Web Usage Mining
Database Agent based General Customized

Approach Approach Access Usage
Pattern Tracking
Web Ranking
Rank is a value given to a web page to assign a priority to the placement of the
page on the results page of search engine.
1. Synonym and polysemy ( many words with one meaning , one word with many meaning
2. The same ranking of search results can not be right for every one
3. Dynamic and constantly changing nature of the web.
4. The problem of abundance
1. Filtering the few that are most important
2. Which few should the search engine recommend?

Ranking Algorithm classification
Structure Mining algorithms
• PageRank
• HITS
• SALSA
Content mining algorithms
• PCR
Usage-Based mining algorithms
• An improved Usage-based Ranking

Web Page Quality Measurement
Web page quality is an abstract measure of
how authoritative it is on the subject matter.
◦ Page Importance (graph structure) inlinks
Popularity (inlinks), informative (outlinkes)

outlinks
◦ Page Relevance (page content)
Word or phrase location, word frequency, anchor text
The Structure Mining
This type of mining can be performed: Title Concise summary of the
document
Paragraph 1 Likely to be an abstract
at the (intra-page) document level, or of the document
Paragraph 2
They might contribute differently
at the (inter-page) hyperlink level
…..
for a document’s relevance!
Images Visual description of the

document
Anchor texts References to other documents
Inter-document structure
Documents are no longer independent
Source: https://wiki.digitalmethods.net/Dmi/WikipediaAnalysis
What do the links mean?
Anchor text: Armonk, NY-based computer More information about IBM
◦ How others describe the page giant IBM announced today products can be found here
◦ E.g., “big blue” is a nick name of IBM, but never found on IBM’s official web site
◦ A good source for query expansion, or directly put into index www.ibm.com
Linkage relation: Joe’s computer hardware

links
Endorsement from others – utility of the page Sun Big Blue today announced
HP record profits for the quarter
IBM
"PageRank-hi-res". Licensed under Creative Commons Attribution-Share Alike 2.5 via

Wikimedia Commons - http://commons.wikimedia.org/wiki/File:PageRank-hi-
res.png#mediaviewer/File:PageRank-hi-res.png
ANALOGY (Citation Network)
Authors cite others’ work because A conferral of authority
◦ They appreciate the intellectual value in that paper
◦ There is certain relationship between the papers
Bibliometrics
◦ A citation is a vote for the usefulness of that paper
◦ Citation count indicates the quality of the paper
◦ E.g., # of in-links
Citation in the web environment
Adding a hyperlink costs almost nothing
◦ Taken advantage by web spammers
◦ Large volume of machine-generated pages to artificially increase “in-links” of the target page
◦ Fake or invisible links
We should not only consider the count of in-links, but the quality of
each in-link
◦ PageRank
◦ HITS
CS 6501: INFORMATION RETRIEVAL 17

Link analysis (Structure analysis)
Describes the characteristic of the network structure
Reflect the utility of the web document in a general sense
used to aid search using web graph.
It was one of the primary reasons for Google’s initial
success.
CS 6501: INFORMATION RETRIEVAL 18

Link analysis algorithms
PageRank
HITS
SALSA
PageRank [Page & Brin, 1998]
PageRank was presented and published by

Sergey Brin and Larry Page at the Seventh
International World Wide Web Conference
(WWW7) in April 1998.
PageRank is a link analysis algorithm which assigns numerical scores to web
pages using the importance of web pages as a measurement.
PageRank is a “vote”, by all the other pages on the Web, about how important
a page is.
“A web page is important if it is pointed to by other important pages.”
A page can have a high PageRank 100 50 53
◦ If there are many pages that point to it ___
___
___
3 _
◦ Or if there are some pages that point to it, __
and have a high PageRank. 9 50 50

___ ___
___ _
__
3
3
The PageRank value of each page can be regarded as its status in the network.
From the perspective of status, we use the following to derive the PageRank algorithm.
A hyperlink from a page pointing to another page is an implicit conveyance of authority to the
target page. Thus, the more in-links that a page “ i “ receives, the more status the page “ i “
has.
Pages that point to page “ i “also have their own status scores. A page with a higher status
score pointing to “ i “ is more important than a page with a lower status score pointing to “ i.”
In other words, a page is important if it is pointed to by other important pages.
In-links of page i: These are the hyperlinks that point to page “ i “
from other pages. Usually, hyperlinks from the same site are not
considered.
Out-links of page i: These are the hyperlinks that point out to other
pages from page “ i “. Usually, links to pages of the same site are not
considered.
Simple PageRank Algorithm [Page & Brin, 1998]
The PageRank of a web page is therefore calculated as a sum of the PageRanks of all pages linking to
it (its incoming links), divided by the number of out links on each of those pages (its outgoing links).
PR ( P1) PR ( P 2) PR ( P3)
PR ( P )   
OutDeg ( P1) OutDeg ( P 2) OutDeg ( P3)
n
PR ( PJ )
PR ( P )  
Where: j 1 OutDeg ( PJ )
• PR(P) is the PageRank of page A,
• OutDeg(Pj) is the number of outbound links on page pj
• n is the number of inlinks of page P
The PageRank of page A is recursively defined by the PageRank of those pages which link to page A
EXAMPLE:
We regard a small web consisting of four pages A, B, C and D, whereby page A links to the pages
B ,C and D, page B links to page C , page C links to page A and page D links to page C.
to keep the calculation simple we set d to 0.5.
PR(A) = ( PR(C)/1)
PR(B) = ( PR(A)/3)
PR(C) = ( PR(A)/3 + PR(B)/1 + PR(D)/1 )
PR(D) = ( PR(A)/3 )
At Start, we don’t Know PageRank values of the pages Assume equal values (.33)
We get the following PageRank values for the single pages:
First Iteration:
PR(A) = ( .33/1) = .33
PR(B) = ( .33/3) = .11
PR(C) = ( .33/3 + .33/1 + .33/1 ) = .77
PR(D) = ( .33/3 )= .11
Second Iteration:
PR(A) = ( .77/1) = .77
PR(B) = ( .33/3)=.11
PR(C) = ( .33/3 + .11/1 + .11/1 )=.33
PR(D) = ( .33/3 )=.11
.............
Modified PageRank Algorithm [Page & Brin, 1998]
Random Surfer
A "random surfer" who is given a web page at random and keeps clicking on links and eventually
gets bored and starts on another random page.
Dangling Page or Dangling Link This can possible when the page is a kind of files that has no
embedded links such as, post script files or PDF files or may the page has a special access that
the surfer is not eligible for.
The solution (Damping factor)
d is a damping factor which can be set between 0 and 1. It depends on the number of clicks,
usually set to 0.85..(by experiment )
n
PR ( PJ )
PR ( P )  (1  D )  d 
j 1 OutDeg ( PJ )
Modified PageRank Algorithm Page & Brin, 1998]
We regard a small web consisting of four pages A, B, C and D, whereby page A links to the pages
B ,C and D, page B links to page C , page C links to page A and page D links to page C.
to keep the calculation simple we set d to 0.5.

Assume the initial PageRank =.33 for all pages
PR(A) = 0.5 + 0.5 ( PR(C)/1) , PR(B) = 0.5 + 0.5 ( PR(A)/3)
PR(C) = 0.5 + 0.5 ( PR(A)/3 + PR(B)/1 + PR(D)/1 )
PR(D) = 0.5 + 0.5 ( PR(A)/3 )
We get the following PageRank values for the single pages:
PR(A) = 8/12 = 1.2 , PR(B) = 1/4 = 0.25
PR(C) = 20/12 = 1.4 , PR(D) = 20/12=1.4

IR - ch7 - Web Search Engin

Uploaded by

Copyright:

Available Formats

You might also like

IR - ch7 - Web Search Engin

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IR - ch7 - Web Search Engin

Uploaded by

Copyright:

Available Formats

CS444: Information Retrieval

and Web Search

Web content mining

Web structure (Link) mining

Web usage mining

Web Content Mining Web Link Mining Web Usage Mining

Database Agent based General Customized

3. Dynamic and constantly changing nature of the web.

4. The problem of abundance

1. Filtering the few that are most important

2. Which few should the search engine recommend?

Content mining algorithms

Usage-Based mining algorithms

• An improved Usage-based Ranking

◦ Page Importance (graph structure) inlinks

Popularity (inlinks), informative (outlinkes)

◦ Page Relevance (page content)

Word or phrase location, word frequency, anchor text

Images Visual description of the

Linkage relation: Joe’s computer hardware

"PageRank-hi-res". Licensed under Creative Commons Attribution-Share Alike 2.5 via

CS 6501: INFORMATION RETRIEVAL 17

CS 6501: INFORMATION RETRIEVAL 18

PageRank was presented and published by

and have a high PageRank. 9 50 50

We get the following PageRank values for the single pages:

to keep the calculation simple we set d to 0.5.

You might also like