IR - ch7 - Web Search Engin

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

CS444: Information Retrieval

and Web Search


Fall 2021

CHAPTER 7:
WEB SEARCH ENGINE AND LINK ANALYSIS
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 2
The Bow-Tie Structure of the Web [Broader et al.]
IN: nodes that can reach the giant SCC but cannot be
reached from it.
OUT: nodes that can be reached from the giant SCC but
cannot reach it.
Tendrils: The “tendrils” of the bow-tie consist of
(a) the nodes reachable from IN that cannot reach the giant
SCC
(b) the nodes can reach OUT but cannot be reached from the
giant SCC.
It’s possible for a tendril node to satisfy both (a) and (b), in
which case it’s part of a “tube” that travels from IN to OUT
without touching the giant SCC.
Disconnected: Finally, there are nodes that would not have a
path to the giant SCC.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 3
Web Mining:
 Discovering and extracting information from the Web
 Web mining can generally be divided into three categories:

Web content mining


 Page content (TEXT, IMAGE, ….)

Web structure (Link) mining


 Hyperlinks between pages

Web usage mining


 User behavior and logs
 Categories can be combined

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 4
Web Content Mining
Web content Mining is the process of extracting useful
information from the content of the Web pages.
The Web page content consists of many kinds of data such as
text, image, audio, video and hyperlinks.
The Web page content consists of unstructured data such as
the text, semi-structured data such as HTML or XML documents
and structured data such as data found in the tables.
There are two approaches to the Web Content Mining; agent
based approach and database approach.
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 5
Web Structure (Link) Mining
Web structure mining or sometime referred by Web link mining tries to
generate the model summaries the Web site and Web page.
This kind of Web Mining builds the structure of the linked web pages as a
Web graph where the nodes represent the Web pages and the edges of
the graph represent the hyperlinks between the web pages.
Web structure mining makes the classification of the Web pages easy in
terms of their relevance and importance.
Web structure mining also detects the similarity
and the relationship between different Web sites and Web pages.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 6
Web Usage Mining
Web Usage Mining applies the data analysis technique to extract
the usage pattern generated by the surfers on the Web to use it in
enhancing the web services such as users’ search experience.
Web usage mining mines the data derived from users’ interactivity
with the Web.
The data of patterns of usage can be found for example in the IP
addresses, time and the date of the user access, user profile,
user queries, mouse clicks and scrolls, cookies, transaction
and any other data comes from user interactions.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 7
Web Mining: Categories
Web Mining categories are used alone or in combinations with each other.
They aim to enhance search engines to make the searching results more relevant
and important to the users’ queries.
There are many different page ranking algorithms has been emerged that use
these kind of mining processes.
Most of these algorithms found the structure mining is the most important
one, while other try to combine the Web Structure mining with others such as
Web content mining and Web usage mining.
Next , we will discuss in some details the Web mining page ranking algorithms
that emerged to enhance search engines.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 8
Web Mining: Categories
Web Mining

Web Content Mining Web Link Mining Web Usage Mining

Database Agent based General Customized


Approach Approach Access Usage
Pattern Tracking

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 9
Web Ranking
Rank is a value given to a web page to assign a priority to the placement of the
page on the results page of search engine.

1. Synonym and polysemy ( many words with one meaning , one word with many meaning

2. The same ranking of search results can not be right for every one

3. Dynamic and constantly changing nature of the web.

4. The problem of abundance

1. Filtering the few that are most important

2. Which few should the search engine recommend?


CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 10
Ranking Algorithm classification
Structure Mining algorithms

• PageRank
• HITS
• SALSA

Content mining algorithms

• PCR

Usage-Based mining algorithms

• An improved Usage-based Ranking


Web Page Quality Measurement
Web page quality is an abstract measure of
how authoritative it is on the subject matter.

◦ Page Importance (graph structure) inlinks

Popularity (inlinks), informative (outlinkes)


outlinks

◦ Page Relevance (page content)

Word or phrase location, word frequency, anchor text

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 12
The Structure Mining
This type of mining can be performed: Title Concise summary of the
document
Paragraph 1 Likely to be an abstract
at the (intra-page) document level, or of the document
Paragraph 2
They might contribute differently
at the (inter-page) hyperlink level

…..
for a document’s relevance!

Images Visual description of the


document
Anchor texts References to other documents

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 13
Inter-document structure
Documents are no longer independent

Source: https://wiki.digitalmethods.net/Dmi/WikipediaAnalysis
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 14
What do the links mean?
Anchor text: Armonk, NY-based computer More information about IBM
◦ How others describe the page giant IBM announced today products can be found here
◦ E.g., “big blue” is a nick name of IBM, but never found on IBM’s official web site
◦ A good source for query expansion, or directly put into index www.ibm.com

Linkage relation: Joe’s computer hardware


links
Endorsement from others – utility of the page Sun Big Blue today announced
HP record profits for the quarter
IBM

"PageRank-hi-res". Licensed under Creative Commons Attribution-Share Alike 2.5 via


Wikimedia Commons - http://commons.wikimedia.org/wiki/File:PageRank-hi-
res.png#mediaviewer/File:PageRank-hi-res.png
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 15
ANALOGY (Citation Network)
Authors cite others’ work because A conferral of authority
◦ They appreciate the intellectual value in that paper
◦ There is certain relationship between the papers
Bibliometrics
◦ A citation is a vote for the usefulness of that paper
◦ Citation count indicates the quality of the paper
◦ E.g., # of in-links

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 16
Citation in the web environment
Adding a hyperlink costs almost nothing
◦ Taken advantage by web spammers
◦ Large volume of machine-generated pages to artificially increase “in-links” of the target page
◦ Fake or invisible links

We should not only consider the count of in-links, but the quality of
each in-link
◦ PageRank
◦ HITS

CS 6501: INFORMATION RETRIEVAL 17


Link analysis (Structure analysis)
Describes the characteristic of the network structure
Reflect the utility of the web document in a general sense
used to aid search using web graph.
It was one of the primary reasons for Google’s initial
success.

CS 6501: INFORMATION RETRIEVAL 18


Link analysis algorithms
PageRank
HITS
SALSA

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 19
PageRank [Page & Brin, 1998]

PageRank was presented and published by


Sergey Brin and Larry Page at the Seventh
International World Wide Web Conference
(WWW7) in April 1998.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 20
PageRank [Page & Brin, 1998]
PageRank is a link analysis algorithm which assigns numerical scores to web
pages using the importance of web pages as a measurement.
PageRank is a “vote”, by all the other pages on the Web, about how important
a page is.
“A web page is important if it is pointed to by other important pages.”
A page can have a high PageRank 100 50 53
◦ If there are many pages that point to it ___
___
___
3 _
◦ Or if there are some pages that point to it, __

and have a high PageRank. 9 50 50


___ ___
___ _
__
3
3

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 21
PageRank [Page & Brin, 1998]
The PageRank value of each page can be regarded as its status in the network.

From the perspective of status, we use the following to derive the PageRank algorithm.
A hyperlink from a page pointing to another page is an implicit conveyance of authority to the
target page. Thus, the more in-links that a page “ i “ receives, the more status the page “ i “
has.

Pages that point to page “ i “also have their own status scores. A page with a higher status
score pointing to “ i “ is more important than a page with a lower status score pointing to “ i.”
In other words, a page is important if it is pointed to by other important pages.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 22
PageRank [Page & Brin, 1998]
In-links of page i: These are the hyperlinks that point to page “ i “
from other pages. Usually, hyperlinks from the same site are not
considered.
Out-links of page i: These are the hyperlinks that point out to other
pages from page “ i “. Usually, links to pages of the same site are not
considered.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 23
Simple PageRank Algorithm [Page & Brin, 1998]

The PageRank of a web page is therefore calculated as a sum of the PageRanks of all pages linking to
it (its incoming links), divided by the number of out links on each of those pages (its outgoing links).
PR ( P1) PR ( P 2) PR ( P3)
PR ( P )   
OutDeg ( P1) OutDeg ( P 2) OutDeg ( P3)
n
PR ( PJ )
PR ( P )  
Where: j 1 OutDeg ( PJ )
• PR(P) is the PageRank of page A,
• OutDeg(Pj) is the number of outbound links on page pj
• n is the number of inlinks of page P

The PageRank of page A is recursively defined by the PageRank of those pages which link to page A

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 24
Simple PageRank Algorithm [Page & Brin, 1998]

EXAMPLE:
We regard a small web consisting of four pages A, B, C and D, whereby page A links to the pages
B ,C and D, page B links to page C , page C links to page A and page D links to page C.
to keep the calculation simple we set d to 0.5.

PR(A) = ( PR(C)/1)
PR(B) = ( PR(A)/3)
PR(C) = ( PR(A)/3 + PR(B)/1 + PR(D)/1 )
PR(D) = ( PR(A)/3 )

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 25
Simple PageRank Algorithm [Page & Brin, 1998]
At Start, we don’t Know PageRank values of the pages Assume equal values (.33)

We get the following PageRank values for the single pages:

First Iteration:
PR(A) = ( .33/1) = .33
PR(B) = ( .33/3) = .11
PR(C) = ( .33/3 + .33/1 + .33/1 ) = .77
PR(D) = ( .33/3 )= .11

Second Iteration:
PR(A) = ( .77/1) = .77
PR(B) = ( .33/3)=.11
PR(C) = ( .33/3 + .11/1 + .11/1 )=.33
PR(D) = ( .33/3 )=.11
.............

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 26
Modified PageRank Algorithm [Page & Brin, 1998]

Random Surfer
A "random surfer" who is given a web page at random and keeps clicking on links and eventually
gets bored and starts on another random page.
Dangling Page or Dangling Link This can possible when the page is a kind of files that has no
embedded links such as, post script files or PDF files or may the page has a special access that
the surfer is not eligible for.
The solution (Damping factor)
d is a damping factor which can be set between 0 and 1. It depends on the number of clicks,
usually set to 0.85..(by experiment )
n
PR ( PJ )
PR ( P )  (1  D )  d 
j 1 OutDeg ( PJ )

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 27
Modified PageRank Algorithm Page & Brin, 1998]
We regard a small web consisting of four pages A, B, C and D, whereby page A links to the pages
B ,C and D, page B links to page C , page C links to page A and page D links to page C.

to keep the calculation simple we set d to 0.5.


Assume the initial PageRank =.33 for all pages
PR(A) = 0.5 + 0.5 ( PR(C)/1) , PR(B) = 0.5 + 0.5 ( PR(A)/3)
PR(C) = 0.5 + 0.5 ( PR(A)/3 + PR(B)/1 + PR(D)/1 )
PR(D) = 0.5 + 0.5 ( PR(A)/3 )
We get the following PageRank values for the single pages:
PR(A) = 8/12 = 1.2 , PR(B) = 1/4 = 0.25
PR(C) = 20/12 = 1.4 , PR(D) = 20/12=1.4

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 28
Simple PageRank Algorithm [Page & Brin, 1998]

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 29

You might also like