IR - ch6 - Web Crawler

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

CS444: Information Retrieval

and Web Search


Fall 2021

CHAPTER 6:
WEB STRUCTURE AND CRAWLER (SPIDER)
Abstraction of search engine architecture
Indexed corpus
Crawler
Ranking procedure

Feedback Evaluation
Doc Analyzer
(Query)
Doc Representation
Query Rep User

Indexer Index Ranker results

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 2
Web Structure Introduction
 World Wide Web is distributed by numerous Web sites around the world,
global information system.
 Web servers can potentially host millions of pages which make the number of
web pages extremely difficult to track.
 Web networks like the thousands of interconnected, intertwined with the
cells organized in a complex structure.
 Each Web site also contains a number of Web pages.
It contains the following three parts;
Body of the page,
The page contains hypertext markup language and
Hyperlinks between Web pages.
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 3
The Structure of the Web
The IR systems focus on the information provided by:
◦ The text of the Web page.
◦ The hyperlinks where different Web pages are connected

The Web can be seen as a directed graph


◦ The nodes are the Web pages
◦ The links between nodes are the hyperlinks Web pages.
◦ The directed graph structure is called the Web Graph.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 4
What is a Graph?
A Graph (undirected) is a data structure which consists of set of
vertices V, and set of edges E that connect (some of ) them, and is
denoted by G(V, E)
edge
1
2 node

3 4
V={1, 2, 3, 4, 5}
5 E={(1,2), (1,3), (2,3), (2,4),(3,5), (4,5)}

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 5
What is a Graph?
A Directed Graph (digraph) is a graph where each edge has a
direction.
edge

1 node
2

3 4
V={1, 2, 3, 4, 5}
5 E={(1,2), (1,3), (3,2), (4,2),(3,5), (5,4)}

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 6
Graph Connectivity
Connected Graph:
A graph is said to be connected if there is at least one path from every vertex to every other
vertex in the graph
Strong connected Graph:
A graph is said to be strongly connected if there is a path from every vertex to every other
vertices in the graph
Strongly
1 1 Connected graph
2 Connected graph 2

3 4 3 4

5 5

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 7
Web Crawler (Spider, Robot, Bot, aggregator)
An automatic program that systematically browses the web for the
purpose of Web content indexing and updating.
An automatic program that exploits the graph structure of the web,
fetches pages, and moves from one page to another following links.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 8
How does CRAWLER work? The basic Algorithm

Given a set of initial URLs (Uniform Resource Locators)


the crawler downloads all the corresponding web pages
extracts the outgoing hyper-links from those pages
and again recursively downloads those pages.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 9
How does CRAWLER work? The Algorithm
Initialize queue (Q) with initial set of known URL’s.
Until Q empty or page or time limit exhausted:
Pop URL, L, from front of Q.
If L is not an HTML page (.gif, .jpeg, .pdf, .ppt…) exit loop.
If already visited L, continue loop(get next url).
Download page, P, for L.
If cannot download P (e.g. 404 error, robot excluded) exit loop, else.
Index P (e.g. add to inverted index or store cached copy).
Parse P to obtain list of new links N.
Append N to the end of Q.
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 10
How does CRAWLER work? In Pseudo code
Def Crawler(entry_point) {
URL_list = [entry_point]
while (len(URL_list)>0) { Which page to visit next?
URL = URL_list.pop();
if (isVisited(URL) or !isLegal(URL) or !checkRobotsTxt(URL))
Is it visited already? continue;
Is the access granted?
Or shall we visit it again?HTML = URL.open();
for (anchor in HTML.listOfAnchors())
{
URL_list .append(anchor);
}
setVisited(URL);
insertToIndex(HTML);
}
}

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 11
Crawler’s Challenges
Scale: . (large number of computers and high bandwidth
networks)
Content selection trade-offs: .
It should ensure coverage of quality content and maintain a balance between coverage and
freshness.
Follow rules: The crawler should not repeatedly download a web page in order to maintain
freshness. A time interval must be maintained between consecutive requests to a web page
Some websites do not want the crawler to download a particular portions of the website.
.
Dodge misleads: Some web pages have near duplicate content or just mirrors of some well
known pages. Some web pages appear to be constantly changing but mostly this is due to the
ads or comments by readers.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 12
Crawler Architecture
URL Frontier contains the list of URLs that are yet to
be fetched.
Fetch module actually fetches the web page.
DNS resolution module determines the address of
the server from which a web page has to be fetched.
Parsing module extracts text and links from the
fetched web page.
Content Seen?: test whether a web page with the
same content has already been seen at another
URL. Need to develop a way to measure the
fingerprint of a web page
URL Filter Whether the extracted URL should be
excluded from the frontier (robots.txt).
Duplicate elimination module eliminates the URLs
which are already present in URL Frontier

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 13
Visiting Strategies (Breadth-first crawling)
Breadth-first explores uniformly outward from the root page but requires memory of all nodes
on the previous level (exponential in depth). Standard crawling method.
Implemented with QUEUE (FIFO)
Finds pages along shortest path
If we start with “good” pages, this keeps us close;
( other good stuff )

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 14
Visiting Strategies (Depth-first crawling)
Depth-first requires memory of only depth times branching-factor (linear in depth) but gets
“lost” pursuing a single thread.
Implemented with STACK (LIFO)

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 15
Depth-first VS. Breadth-first crawling
 depth-first goes off into one branch until it reaches a leaf node
◦ not good if the goal node is on another branch
◦ neither complete nor optimal
◦ uses much less space than breadth-first
◦ much fewer visited nodes to keep track of smaller fringe
breadth-first is more careful by checking all alternatives
◦ complete and optimal
◦ very memory-intensive

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 16
Crawling Policies
Selection Policy that states which pages to download (relevant Pages).
Re-visit Policy that states when to check for changes to the pages(Dynamic
Web).
Politeness Policy that states how to avoid overloading Web sites (robots
exclusion protocol).
Parallelization Policy that states how to coordinate distributed Web crawlers
(runs multiple processes).

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 17
Robots Exclusion Protocol
Preventing Web crawler from accessing all or parts of a website.
Web sites and pages can specify that robots should not crawl/index certain areas.
Two components:
Robots Exclusion Protocol (robots.txt): Site wide specification of excluded directories.
◦ Site administrator puts a “robots.txt” file at the root of the host’s web directory.
◦ http://www.ebay.com/robots.txt
◦ http://www.cnn.com/robots.txt
◦ http://clgiles.ist.psu.edu/robots.txt

Robots META Tag: Individual document tag to exclude indexing or following links.
File is a list of excluded directories for a given robot (user-agent).
◦ Exclude all robots from the entire site:
◦ User-agent: *
◦ Disallow: /
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 18
Robot Exclusion Protocol Examples
Exclude specific directories:
◦ User-agent: *
◦ Disallow: /tmp/
◦ Disallow: /cgi-bin/
◦ Disallow: /users/paranoid/

Exclude a specific robot:


◦ User-agent: GoogleBot
◦ Disallow: /

Allow a specific robot:


◦ User-agent: GoogleBot
◦ Disallow:

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 19
Focused Crawling
Focused Crawler is a variation of a basic crawler which selectively
collects the web pages satisfying certain properties.
For example,
◦ if we need to crawl web pages only from '.in' domain or
◦ only of a particular language like Hindi or
◦ pertaining to a specific topic like Tourism, we need to employ a focused
crawler.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 20
Distributed Crawling
A distributed computing technique whereby search engines
employ many computers to index the Internet via web
crawling.
The idea is to spread out the required resources of
computation and bandwidth to many computers and
networks.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 21

You might also like