Professional Documents
Culture Documents
Lecture 39
Lecture 39
1
WebCrawler
Introduction
2
What is WebCrawler basically?
3
• The WebCrawler performs the
following actions after “retrieving” a
document.
¾It flags the document as being “retrieved”.
¾Looks for any outbound hyperlinks in the
document.
¾Indexes the content of the document.
• All the above actions requires some
information to be stored in a database.
Components in WebCrawler
libWWW
Agents
Search
Internet Engine
Query Database
Server
4
About the Components
¾Query server
Responsible for handling the query processing
service.
¾libWWW
This is the CERN WWW library, used by agents
to access several different kinds of contents
using different protocols.
5
Search Engine
• Basic requirement:
¾WebCrawler discovers new documents.
Start with a known set of documents.
Examine the hyperlinks from them, that point
to other documents.
Repeat the whole process recursively.
¾Alternate way of looking at the problem.
Web is a huge directed graph, with documents
as vertices and hyperlinks as edges.
Need to explore the graph using a suitable
graph traversal algorithm.
6
• Some points about search engine:
¾The search engine is responsible for:
Which documents to visit
Types of documents to visit
¾No point in retrieving files that the
WebCrawler cannot index.
Images, audio clip, postscript file, etc.
Even if they are retrieved, they are ignored
during indexing.
7
WebCrawler: Indexing Mode
• Basic motivation:
¾ Try and build an index of as much of the web as
possible.
¾ Some heuristics used:
Which documents to select if the space for
storing indices is limited?
A reasonable approach is to ensure that
documents come from as many different
servers as possible.
¾ WebCrawler uses a modified breadth first
approach in order to ensure that every server has
at least one document that has been indexed.
8
WebCrawler: Real-time Search
• Basic motivation:
¾Given a user’s query, try to find documents
that most closely matches.
¾A different search algorithm is used here by
WebCrawler.
¾Intuitive reasoning:
If we follow the links from a document that is
similar to what the user is looking for, they will
most likely lead to relevant documents.
• Steps followed:
¾Run the user’s query against the
WebCrawler index, to get the initial list
of “similar” documents.
¾Select the most relevant documents
from the list.
Follow any unexplored links from these
documents.
As new documents are retrieved, they are
also added to the index, and the query
rerun.
9
¾The results are sorted by relevance.
¾The documents highly relevant are
possible candidates for further
exploration.
10
About Agents
11
About the Query Server
12
• How many agents?
¾During searching, waiting time for
accessing the server and network
resources is the main bottleneck.
¾Agents are typically run as multiple
concurrent processes or threads (10 to
15, say).
13
SOLUTIONS TO QUIZ
QUESTIONS ON
LECTURE 38
14
Quiz Solutions on Lecture 38
15
Quiz Solutions on Lecture 38
16
Quiz Solutions on Lecture 38
17
QUIZ QUESTIONS ON
LECTURE 39
18