Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Data Storage and Retrieval

Spring 2021

Assignment two

Student Name: Dahlia Gamal Saleh

Registration No: 17235005

Prof. Dr. : Mohamed Shahine


Data Storage and Retrieval
Assignment One

Exercise 2.3.

A more-like-this query occurs when the user can click on a particular document in
the result list and tell the search engine to find documents that are similar to this
one. Describe which low-level components are used to answer this type of query
and the sequence in which they are used.
Search engines match queries against an index that they create. The index consists of the words in each
document, + pointers to their locations within the documents. This is called an inverted file. A search
engine four essential modules:

Document Processor

The document processor prepares, processes, and inputs the documents, pages, or sites that users
search against. The document processor performs some or all of the following steps:

 Preprocessing.
 Identify elements to index
 Deleting stop words.
 Term Stemming
 Extract index entries:
 Term weight assignment.
 Create index
 Document Filtering

1
Data Storage and Retrieval
Assignment One

Exercise 3.4.
a) What is the advantage of using HEAD requests instead of GET requests
during crawling?
b) When would a crawler use a GET request instead of a HEAD request?

The HEAD method is identical to GET except that the server MUST NOT return
a message-body in the response.

a) A Crawler can use HEAD requests when trying to determine if a web page
has changed since the last time it was crawled. Since HEAD requests only
return a very small payload, such requests can reduce the amount of data
transferred compared to always using GET requests, there by increasing the
efficiency of the crawler.
b) A crawler would use a GET request the first time it crawls a page or if a GET
request has indicated that the page has been updated since it was last
crawled.

2
Data Storage and Retrieval
Assignment One

Exercise 3.5. Name the three types of sites mentioned in the chapter that
compose the deep web.

 Private sites are intentionally private. They may have no incoming links, or
may
require you to log in with a valid account before using the rest of the site. These
sites generally want to block access from crawlers.
 Form results are sites that can be reached only after entering some data
into a
form. For example, websites selling airline tickets typically ask for trip information
on the site’s entry page. You are shown flight information only after
submitting this trip information. Even though you might want to use a search
engine to find flight timetables, most crawlers will not be able to get through
this form to get to the timetable information.
 Scripted pages are pages that use JavaScript, Flash, or another client-side
language in the web page. If a link is not in the raw HTML source of the web
page, but is instead generated by JavaScript code running on the browser, the
crawler will need to execute the JavaScript on the page in order to find the link.
Although this is technically possible, executing JavaScript can slow down the
crawler significantly and adds complexity to the system.

3
Data Storage and Retrieval
Assignment One

Exercise 3.9. Suppose that, in an effort to crawl web pages faster, you set up
two crawling machines with different starting seed URLs. Is this an effective
strategy for distributed crawling? Why or why not?

No,This is not an effective strategy for distributed crawling, because the two
machines are unable to share information with each other. Despite the fact that
the two machines are seeded with different URLs, this would lead to a great
duplicated effort, because the two machines would eventually crawl many of the
same URLs.
This strategy could be made more effective by sharing the URL request
queue between the two machines, thereby eliminating the duplicated effort.

You might also like