IR - ch6 - Web Crawler

CS444: Information Retrieval
and Web Search

Fall 2021
CHAPTER 6:
WEB STRUCTURE AND CRAWLER (SPIDER)
Abstraction of search engine architecture
Indexed corpus
Crawler
Ranking procedure
Feedback Evaluation
Doc Analyzer
(Query)
Doc Representation
Query Rep User
Indexer Index Ranker results
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 2
Web Structure Introduction
 World Wide Web is distributed by numerous Web sites around the world,
global information system.
 Web servers can potentially host millions of pages which make the number of
web pages extremely difficult to track.
 Web networks like the thousands of interconnected, intertwined with the
cells organized in a complex structure.
 Each Web site also contains a number of Web pages.
It contains the following three parts;
Body of the page,
The page contains hypertext markup language and
Hyperlinks between Web pages.
The Structure of the Web
The IR systems focus on the information provided by:
◦ The text of the Web page.
◦ The hyperlinks where different Web pages are connected
The Web can be seen as a directed graph

◦ The nodes are the Web pages
◦ The links between nodes are the hyperlinks Web pages.
◦ The directed graph structure is called the Web Graph.
What is a Graph?
A Graph (undirected) is a data structure which consists of set of
vertices V, and set of edges E that connect (some of ) them, and is
denoted by G(V, E)
edge
1
2 node
3 4
V={1, 2, 3, 4, 5}
5 E={(1,2), (1,3), (2,3), (2,4),(3,5), (4,5)}
What is a Graph?
A Directed Graph (digraph) is a graph where each edge has a
direction.
edge
1 node
2
3 4
V={1, 2, 3, 4, 5}
5 E={(1,2), (1,3), (3,2), (4,2),(3,5), (5,4)}
Graph Connectivity
Connected Graph:
A graph is said to be connected if there is at least one path from every vertex to every other
vertex in the graph
Strong connected Graph:
A graph is said to be strongly connected if there is a path from every vertex to every other
vertices in the graph
Strongly
1 1 Connected graph
2 Connected graph 2
3 4 3 4
5 5
Web Crawler (Spider, Robot, Bot, aggregator)
An automatic program that systematically browses the web for the
purpose of Web content indexing and updating.
An automatic program that exploits the graph structure of the web,
fetches pages, and moves from one page to another following links.
How does CRAWLER work? The basic Algorithm
Given a set of initial URLs (Uniform Resource Locators)

the crawler downloads all the corresponding web pages
extracts the outgoing hyper-links from those pages
and again recursively downloads those pages.
How does CRAWLER work? The Algorithm
Initialize queue (Q) with initial set of known URL’s.
Until Q empty or page or time limit exhausted:
Pop URL, L, from front of Q.
If L is not an HTML page (.gif, .jpeg, .pdf, .ppt…) exit loop.
If already visited L, continue loop(get next url).
Download page, P, for L.
If cannot download P (e.g. 404 error, robot excluded) exit loop, else.
Index P (e.g. add to inverted index or store cached copy).
Parse P to obtain list of new links N.
Append N to the end of Q.
How does CRAWLER work? In Pseudo code
Def Crawler(entry_point) {
URL_list = [entry_point]
while (len(URL_list)>0) { Which page to visit next?
URL = URL_list.pop();
if (isVisited(URL) or !isLegal(URL) or !checkRobotsTxt(URL))
Is it visited already? continue;
Is the access granted?
Or shall we visit it again?HTML = URL.open();
for (anchor in HTML.listOfAnchors())
{
URL_list .append(anchor);
}
setVisited(URL);
insertToIndex(HTML);
}
}
Crawler’s Challenges
Scale: . (large number of computers and high bandwidth
networks)
Content selection trade-offs: .
It should ensure coverage of quality content and maintain a balance between coverage and
freshness.
Follow rules: The crawler should not repeatedly download a web page in order to maintain
freshness. A time interval must be maintained between consecutive requests to a web page
Some websites do not want the crawler to download a particular portions of the website.
.
Dodge misleads: Some web pages have near duplicate content or just mirrors of some well
known pages. Some web pages appear to be constantly changing but mostly this is due to the
ads or comments by readers.
Crawler Architecture
URL Frontier contains the list of URLs that are yet to
be fetched.
Fetch module actually fetches the web page.
DNS resolution module determines the address of
the server from which a web page has to be fetched.
Parsing module extracts text and links from the
fetched web page.
Content Seen?: test whether a web page with the
same content has already been seen at another
URL. Need to develop a way to measure the
fingerprint of a web page
URL Filter Whether the extracted URL should be
excluded from the frontier (robots.txt).
Duplicate elimination module eliminates the URLs
which are already present in URL Frontier
Visiting Strategies (Breadth-first crawling)
Breadth-first explores uniformly outward from the root page but requires memory of all nodes
on the previous level (exponential in depth). Standard crawling method.
Implemented with QUEUE (FIFO)
Finds pages along shortest path
If we start with “good” pages, this keeps us close;
( other good stuff )
Visiting Strategies (Depth-first crawling)
Depth-first requires memory of only depth times branching-factor (linear in depth) but gets
“lost” pursuing a single thread.
Implemented with STACK (LIFO)
Depth-first VS. Breadth-first crawling
 depth-first goes off into one branch until it reaches a leaf node
◦ not good if the goal node is on another branch
◦ neither complete nor optimal
◦ uses much less space than breadth-first
◦ much fewer visited nodes to keep track of smaller fringe
breadth-first is more careful by checking all alternatives
◦ complete and optimal
◦ very memory-intensive
Crawling Policies
Selection Policy that states which pages to download (relevant Pages).
Re-visit Policy that states when to check for changes to the pages(Dynamic
Web).
Politeness Policy that states how to avoid overloading Web sites (robots
exclusion protocol).
Parallelization Policy that states how to coordinate distributed Web crawlers
(runs multiple processes).
Robots Exclusion Protocol
Preventing Web crawler from accessing all or parts of a website.
Web sites and pages can specify that robots should not crawl/index certain areas.
Two components:
Robots Exclusion Protocol (robots.txt): Site wide specification of excluded directories.
◦ Site administrator puts a “robots.txt” file at the root of the host’s web directory.
◦ http://www.ebay.com/robots.txt
◦ http://www.cnn.com/robots.txt
◦ http://clgiles.ist.psu.edu/robots.txt
Robots META Tag: Individual document tag to exclude indexing or following links.
File is a list of excluded directories for a given robot (user-agent).
◦ Exclude all robots from the entire site:
◦ User-agent: *
◦ Disallow: /
Robot Exclusion Protocol Examples
Exclude specific directories:
◦ User-agent: *
◦ Disallow: /tmp/
◦ Disallow: /cgi-bin/
◦ Disallow: /users/paranoid/
Exclude a specific robot:

◦ User-agent: GoogleBot
◦ Disallow: /
Allow a specific robot:

◦ User-agent: GoogleBot
◦ Disallow:
Focused Crawling
Focused Crawler is a variation of a basic crawler which selectively
collects the web pages satisfying certain properties.
For example,
◦ if we need to crawl web pages only from '.in' domain or
◦ only of a particular language like Hindi or
◦ pertaining to a specific topic like Tourism, we need to employ a focused
crawler.
Distributed Crawling
A distributed computing technique whereby search engines
employ many computers to index the Internet via web
crawling.
The idea is to spread out the required resources of
computation and bandwidth to many computers and
networks.

IR - ch6 - Web Crawler

Uploaded by

Copyright:

Available Formats

You might also like

IR - ch6 - Web Crawler

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IR - ch6 - Web Crawler

Uploaded by

Copyright:

Available Formats

CS444: Information Retrieval

and Web Search

Indexer Index Ranker results

The Web can be seen as a directed graph

Given a set of initial URLs (Uniform Resource Locators)

Exclude a specific robot:

Allow a specific robot:

You might also like