Welcome to Scribd!

Data Storage and Retrieval

Uploaded by

0% found this document useful (0 votes)

15 views5 pages

This document is a student assignment for a course on Data Storage and Retrieval. It contains the student's name, registration number, and course professor. The body consists of 5 exercises answering questions about search engine components, crawler techniques, and distributed crawling.

Original Description:

Copyright

Available Formats

DOCX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as docx, pdf, or txt

0% found this document useful (0 votes)

15 views5 pages

Data Storage and Retrieval

Uploaded by

Dahlia Gamal

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as docx, pdf, or txt

Jump to Page

You are on page 1of 5

Search inside document

Data Storage and Retrieval

Spring 2021

Assignment two

Student Name: Dahlia Gamal Saleh

Registration No: 17235005

Prof. Dr. : Mohamed Shahine

Data Storage and Retrieval
Assignment One

Exercise 2.3.

A more-like-this query occurs when the user can click on a particular document in
the result list and tell the search engine to find documents that are similar to this
one. Describe which low-level components are used to answer this type of query
and the sequence in which they are used.
Search engines match queries against an index that they create. The index consists of the words in each
document, + pointers to their locations within the documents. This is called an inverted file. A search
engine four essential modules:

Document Processor

The document processor prepares, processes, and inputs the documents, pages, or sites that users
search against. The document processor performs some or all of the following steps:

 Preprocessing.
 Identify elements to index
 Deleting stop words.
 Term Stemming
 Extract index entries:
 Term weight assignment.
 Create index
 Document Filtering

1
Data Storage and Retrieval
Assignment One

Exercise 3.4.
a) What is the advantage of using HEAD requests instead of GET requests
during crawling?
b) When would a crawler use a GET request instead of a HEAD request?

The HEAD method is identical to GET except that the server MUST NOT return
a message-body in the response.

a) A Crawler can use HEAD requests when trying to determine if a web page
has changed since the last time it was crawled. Since HEAD requests only
return a very small payload, such requests can reduce the amount of data
transferred compared to always using GET requests, there by increasing the
efficiency of the crawler.
b) A crawler would use a GET request the first time it crawls a page or if a GET
request has indicated that the page has been updated since it was last
crawled.

2
Data Storage and Retrieval
Assignment One

Exercise 3.5. Name the three types of sites mentioned in the chapter that
compose the deep web.

 Private sites are intentionally private. They may have no incoming links, or
may
require you to log in with a valid account before using the rest of the site. These
sites generally want to block access from crawlers.
 Form results are sites that can be reached only after entering some data
into a
form. For example, websites selling airline tickets typically ask for trip information
on the site’s entry page. You are shown flight information only after
submitting this trip information. Even though you might want to use a search
engine to find flight timetables, most crawlers will not be able to get through
this form to get to the timetable information.
 Scripted pages are pages that use JavaScript, Flash, or another client-side
language in the web page. If a link is not in the raw HTML source of the web
page, but is instead generated by JavaScript code running on the browser, the
crawler will need to execute the JavaScript on the page in order to find the link.
Although this is technically possible, executing JavaScript can slow down the
crawler significantly and adds complexity to the system.

3
Data Storage and Retrieval
Assignment One

Exercise 3.9. Suppose that, in an effort to crawl web pages faster, you set up
two crawling machines with different starting seed URLs. Is this an effective
strategy for distributed crawling? Why or why not?

No,This is not an effective strategy for distributed crawling, because the two
machines are unable to share information with each other. Despite the fact that
the two machines are seeded with different URLs, this would lead to a great
duplicated effort, because the two machines would eventually crawl many of the
same URLs.
This strategy could be made more effective by sharing the URL request
queue between the two machines, thereby eliminating the duplicated effort.

Google Cloud Certified - Professional Data Engineer Practice Exam 4 - Results
Document57 pages
Google Cloud Certified - Professional Data Engineer Practice Exam 4 - Results
vamshi nagabhyru
100% (3)
Business Analyst Interview Questions
Document5 pages
Business Analyst Interview Questions
Ejaz Alam
100% (2)
IMS Basic Signaling Procedure (SIP&IMS Procedures)
Document66 pages
IMS Basic Signaling Procedure (SIP&IMS Procedures)
Ariel Mendoza
100% (4)
Java Web Crawler
Document1 page
Java Web Crawler
John Wiltberger
No ratings yet
Seminar II Pdf3
Document5 pages
Seminar II Pdf3
samjhanakarki65
No ratings yet
Web Search Engines: Part 1
Document6 pages
Web Search Engines: Part 1
Pratik Van
No ratings yet
Server-Side Programming:: Web Server Architecture Follows The Following Two Approaches
Document4 pages
Server-Side Programming:: Web Server Architecture Follows The Following Two Approaches
Arun
No ratings yet
2.1.2. Students 2.1.3.2. Faculty 2.2 University 2.4. Human Resource Management Office 3. Management
Document10 pages
2.1.2. Students 2.1.3.2. Faculty 2.2 University 2.4. Human Resource Management Office 3. Management
wowpaksiw
No ratings yet
Site Scraper
Document10 pages
Site Scraper
pantelis
No ratings yet
Application Design and Development: Practice Exercises
Document6 pages
Application Design and Development: Practice Exercises
Samaru Kawsoi
No ratings yet
Searching For Hidden-Web Databases: Luciano Barbosa Lab@sci - Utah.edu Juliana Freire Juliana@cs - Utah.edu
Document6 pages
Searching For Hidden-Web Databases: Luciano Barbosa Lab@sci - Utah.edu Juliana Freire Juliana@cs - Utah.edu
rafael100@gmail
No ratings yet
Crawler and URL Retrieving & Queuing
Document5 pages
Crawler and URL Retrieving & Queuing
Arnav Guddu
No ratings yet
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
Document25 pages
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
jyoti222
No ratings yet
Extended Curlcrawler: A Focused and Path-Oriented Framework For Crawling The Web With Thumb
Document9 pages
Extended Curlcrawler: A Focused and Path-Oriented Framework For Crawling The Web With Thumb
surendiran123
No ratings yet
Web Scraping - Unit 1
Document31 pages
Web Scraping - Unit 1
MANOHAR SIVVALA 20111632
100% (1)
(IJETA-V4I3P7) :SekharBabu - Boddu, Prof - RajasekharaRao.Kurra
Document6 pages
(IJETA-V4I3P7) :SekharBabu - Boddu, Prof - RajasekharaRao.Kurra
IJETA - EighthSenseGroup
No ratings yet
The Design and Implementation of Web Crawler Distributed News Domain Detection System
Document6 pages
The Design and Implementation of Web Crawler Distributed News Domain Detection System
James bb
No ratings yet
Erformance Valuation EB Rawler: P E O W C
Document34 pages
Erformance Valuation EB Rawler: P E O W C
Ali Nawaz
No ratings yet
Tadpole: A Meta Search Engine
Document8 pages
Tadpole: A Meta Search Engine
rdmpage
100% (3)
Effect of Microstructure On Gas Sensing
Document3 pages
Effect of Microstructure On Gas Sensing
Editor IJRITCC
No ratings yet
A Scalable, Distributed Web-Crawler
Document8 pages
A Scalable, Distributed Web-Crawler
Vlada Stanković
No ratings yet
Web Technologies Unit-III
Document11 pages
Web Technologies Unit-III
kprasanth_mca
No ratings yet
Conclusion For Srs
Document5 pages
Conclusion For Srs
Lalit Kumar
No ratings yet
MS CS Manipal University Ashish Kumar Jha Data Structures and Algorithms Used in Search Engine
Document13 pages
MS CS Manipal University Ashish Kumar Jha Data Structures and Algorithms Used in Search Engine
Ashish Kumar Jha
No ratings yet
Architectural Design and Evaluation of An Efficient Web-Crawling System
Document8 pages
Architectural Design and Evaluation of An Efficient Web-Crawling System
khadafishah
No ratings yet
A Two Stage Crawler On Web Search Using Site Ranker For Adaptive Learning
Document4 pages
A Two Stage Crawler On Web Search Using Site Ranker For Adaptive Learning
Kumarecit
No ratings yet
Study of Webcrawler: Implementation of Efficient and Fast Crawler
Document6 pages
Study of Webcrawler: Implementation of Efficient and Fast Crawler
IOSRJEN : hard copy, certificates, Call for Papers 2013, publishing of journal
No ratings yet
How Do Search Engines Work
Document25 pages
How Do Search Engines Work
Remonda Saied
No ratings yet
How Google Works
Document61 pages
How Google Works
naveenksnk
No ratings yet
Search Engine Optimization - Using Data Mining Approach
Document5 pages
Search Engine Optimization - Using Data Mining Approach
International Journal of Application or Innovation in Engineering & Management
No ratings yet
Search Engine Seminar
Document17 pages
Search Engine Seminar
sidgrab
No ratings yet
S O W C A: Urvey F EB Rawling Lgorithms
Document8 pages
S O W C A: Urvey F EB Rawling Lgorithms
reddevil123vn2374
No ratings yet
CALA An Unsupervised URL-based Web Page Classification System
Document13 pages
CALA An Unsupervised URL-based Web Page Classification System
sojovey836
No ratings yet
EDS WebCrawlerArchitecture
Document3 pages
EDS WebCrawlerArchitecture
Anubhav Pareek
No ratings yet
Focused Crawling: A New Approach To Topic-Specific Web Resource Discovery
Document18 pages
Focused Crawling: A New Approach To Topic-Specific Web Resource Discovery
Priti Singh
No ratings yet
Dept. of Cse, Msec 2014-15
Document19 pages
Dept. of Cse, Msec 2014-15
Kumar Kumar T G
No ratings yet
Crawling The Hidden Web: An Approach To Dynamic Web Indexing
Document9 pages
Crawling The Hidden Web: An Approach To Dynamic Web Indexing
kostas
No ratings yet
Artificial Intelligence Assignment 2
Document7 pages
Artificial Intelligence Assignment 2
Darren Curran
No ratings yet
Seminar Report: Submitted By: Aanchal Garg CSE
Document22 pages
Seminar Report: Submitted By: Aanchal Garg CSE
Abhijit Singh Dahiya
No ratings yet
PRWB: A Framework For Creating Personal, Site-Specific Web Crawlers
Document6 pages
PRWB: A Framework For Creating Personal, Site-Specific Web Crawlers
pravin
No ratings yet
Meta Search Engines
Document48 pages
Meta Search Engines
Sunita Choudhary
No ratings yet
Tìm hiểu GAE 01 - Building.High-Perf
Document41 pages
Tìm hiểu GAE 01 - Building.High-Perf
dolamthien
No ratings yet
Web Crawler Research Paper
Document6 pages
Web Crawler Research Paper
fvf8zrn0
100% (1)
Crawling Deep Web Entity Pages: Yeye He Heyeye@cs - Wisc.edu Dong Xin Venkatesh Ganti Sriram Rajaraman Nirav Shah
Document10 pages
Crawling Deep Web Entity Pages: Yeye He Heyeye@cs - Wisc.edu Dong Xin Venkatesh Ganti Sriram Rajaraman Nirav Shah
Tarun Choudhary
No ratings yet
Experiment 9: Web Mining
Document9 pages
Experiment 9: Web Mining
Hazel D'cunha
No ratings yet
Web Info PDF
Document4 pages
Web Info PDF
kishore chandra
No ratings yet
U-3.1b Search Engines Text Exercises 04 2021
Document4 pages
U-3.1b Search Engines Text Exercises 04 2021
вова Ковальчук
No ratings yet
Deep Web
Document12 pages
Deep Web
Vir K Kharwar
0% (1)
A Methodical Study of Web Crawler
Document8 pages
A Methodical Study of Web Crawler
Hasnain Khan Afridi
No ratings yet
Inverted Indexing For Text Retrieval
Document21 pages
Inverted Indexing For Text Retrieval
Onet GB
No ratings yet
5-Steps To Replacing Elasticsearch and Solr With Atlas Search
Document22 pages
5-Steps To Replacing Elasticsearch and Solr With Atlas Search
Anh Trần Nguyễn Tuấn
No ratings yet
9
Document6 pages
9
mohdsabitht
No ratings yet
Challanges To Distributed Webinformation
Document16 pages
Challanges To Distributed Webinformation
talhakamran2006
No ratings yet
Scalable Techniques For Clustering The Web
Document6 pages
Scalable Techniques For Clustering The Web
matthewriley123
No ratings yet
Enhancing Virtual Intranet Server Full Doc (Editing)
Document56 pages
Enhancing Virtual Intranet Server Full Doc (Editing)
Tharani Tharani
No ratings yet
Deepthi - Webclustering Report PDF
Document38 pages
Deepthi - Webclustering Report PDF
jyoti
No ratings yet
Caching Tutorial
Document111 pages
Caching Tutorial
Rajesh Kumar
No ratings yet
Search Engine
Document42 pages
Search Engine
VinayKumarSingh
100% (2)
SIMHAR - Smart Distributed Web Crawler For The Hid
Document12 pages
SIMHAR - Smart Distributed Web Crawler For The Hid
Manoj Kumar Maurya
No ratings yet
The Oracle Universal Content Management Handbook: Build, administer, and manage Oracle Stellent UCM Solutions
From Everand
The Oracle Universal Content Management Handbook: Build, administer, and manage Oracle Stellent UCM Solutions
Dmitri Khanine
Rating: 5 out of 5 stars
5/5 (1)
Implementing Splunk: Big Data Reporting and Development for Operational Intelligence
From Everand
Implementing Splunk: Big Data Reporting and Development for Operational Intelligence
Vincent Bumgarner
Rating: 4 out of 5 stars
4/5 (2)
ASP.NET Site Performance Secrets
From Everand
ASP.NET Site Performance Secrets
Matt Perdeck
Rating: 5 out of 5 stars
5/5 (1)
Wk06 - 3 - Proposal Writing
Document8 pages
Wk06 - 3 - Proposal Writing
Dahlia Gamal
No ratings yet
TOP Digital Marketing Blogs
Document1 page
TOP Digital Marketing Blogs
Dahlia Gamal
No ratings yet
SEO - Methods (Autosaved)
Document67 pages
SEO - Methods (Autosaved)
Dahlia Gamal
No ratings yet
Wk02 - 3 - How To Write A Research Paper
Document27 pages
Wk02 - 3 - How To Write A Research Paper
Dahlia Gamal
No ratings yet
KG1 Revision-1
Document5 pages
KG1 Revision-1
Dahlia Gamal
No ratings yet
What Is Seo
Document11 pages
What Is Seo
Dahlia Gamal
No ratings yet
Wk12 - Motivations-Protocol Design and Validation
Document23 pages
Wk12 - Motivations-Protocol Design and Validation
Dahlia Gamal
No ratings yet
Kg1 Revision-2
Document11 pages
Kg1 Revision-2
Dahlia Gamal
No ratings yet
Wk06 1 Thesis+Presentation
Document57 pages
Wk06 1 Thesis+Presentation
Dahlia Gamal
No ratings yet
CS611 - L03 - Linked Lists - Saleh
Document37 pages
CS611 - L03 - Linked Lists - Saleh
Dahlia Gamal
No ratings yet
Statistical Thinking
Document39 pages
Statistical Thinking
Dahlia Gamal
No ratings yet
Statistics Review
Document25 pages
Statistics Review
Dahlia Gamal
No ratings yet
Wk07 - 1 - Research Methodology
Document21 pages
Wk07 - 1 - Research Methodology
Dahlia Gamal
No ratings yet
Paper Summary
Document2 pages
Paper Summary
Dahlia Gamal
No ratings yet
Proposal Items
Document1 page
Proposal Items
Dahlia Gamal
No ratings yet
Deep Learning Techniques For Identifying Voices
Document27 pages
Deep Learning Techniques For Identifying Voices
Dahlia Gamal
No ratings yet
Wk03 - 2 - Choosing A Research Topic
Document9 pages
Wk03 - 2 - Choosing A Research Topic
Dahlia Gamal
No ratings yet
SW Testing and QA Take Home Exam Answers
Document50 pages
SW Testing and QA Take Home Exam Answers
Dahlia Gamal
No ratings yet
Searchengineworld Seminar 150316115039 Conversion Gate01
Document22 pages
Searchengineworld Seminar 150316115039 Conversion Gate01
Dahlia Gamal
No ratings yet
13 Chapt 13 and 14 Testing
Document2 pages
13 Chapt 13 and 14 Testing
Dahlia Gamal
No ratings yet
CS212 Sep2016 09 Tree Traversals
Document25 pages
CS212 Sep2016 09 Tree Traversals
Dahlia Gamal
No ratings yet
List Conferences and Journals by Their Impact Factor (Rank)
Document11 pages
List Conferences and Journals by Their Impact Factor (Rank)
Dahlia Gamal
No ratings yet
SEO Methods (Autosaved)
Document54 pages
SEO Methods (Autosaved)
Dahlia Gamal
No ratings yet
11 12chapter11 &12 - ST (Q&A)
Document3 pages
11 12chapter11 &12 - ST (Q&A)
Dahlia Gamal
No ratings yet
10 Chapter 10 - Test Bank (SW Testing)
Document2 pages
10 Chapter 10 - Test Bank (SW Testing)
Dahlia Gamal
No ratings yet
9 CH 9 Testing Questions
Document2 pages
9 CH 9 Testing Questions
Dahlia Gamal
No ratings yet
A Like Kate Second Grade Reading Comprehension Worksheet
Document3 pages
A Like Kate Second Grade Reading Comprehension Worksheet
Dahlia Gamal
No ratings yet
3 Chapter 3 - Test Bank (SW Testing)
Document2 pages
3 Chapter 3 - Test Bank (SW Testing)
Dahlia Gamal
No ratings yet
Character Stories Comprehension Courage
Document3 pages
Character Stories Comprehension Courage
Dahlia Gamal
No ratings yet
Baby Sister Second Grade Reading Comprehension Worksheet
Document3 pages
Baby Sister Second Grade Reading Comprehension Worksheet
Dahlia Gamal
No ratings yet
Gorakhpur Center: Ministry of Electronics & Information Technology (Meity), Government of India
Document33 pages
Gorakhpur Center: Ministry of Electronics & Information Technology (Meity), Government of India
Shubhra
No ratings yet
Internet of Things Program Elective For 6 Sem CSE Students
Document2 pages
Internet of Things Program Elective For 6 Sem CSE Students
Frugal Labs - Digital Learn
No ratings yet
Data Modeler Resume
Document7 pages
Data Modeler Resume
bzmf6nm1
100% (1)
Q4 e Tech Review 1.1 Answer Key
Document3 pages
Q4 e Tech Review 1.1 Answer Key
D1 PC
No ratings yet
Roll. (2) Huma Mukhtar
Document28 pages
Roll. (2) Huma Mukhtar
Hasnin shah
No ratings yet
Git Project Manage For Developers and DevOps
Document741 pages
Git Project Manage For Developers and DevOps
shrewdjo
No ratings yet
Solaris SA 118
Document396 pages
Solaris SA 118
charrazca
No ratings yet
INDIA HCI 2021 - Conference Schedule
Document4 pages
INDIA HCI 2021 - Conference Schedule
Gladwin Joseph
No ratings yet
Ngoupeyou Tondji Lionel
Document60 pages
Ngoupeyou Tondji Lionel
AMANI Mhadhbi
No ratings yet
SQL Injection Cheat Sheet PDF
Document1 page
SQL Injection Cheat Sheet PDF
Job Go
No ratings yet
E-Business and Enterprise Resource Planning Systems
Document11 pages
E-Business and Enterprise Resource Planning Systems
Angelique Nitzakaya
No ratings yet
Purpose of Library Management System
Document2 pages
Purpose of Library Management System
Vanishree B R
No ratings yet
Guide For Customers Sophos Central Firewall Manager
Document8 pages
Guide For Customers Sophos Central Firewall Manager
Marco Antonio Giarone
No ratings yet
ISM Chapter 2
Document24 pages
ISM Chapter 2
Sarvesh Shinde
No ratings yet
Cbjeitpu
Document6 pages
Cbjeitpu
erseema26
No ratings yet
OSS XML Gateway 2.5.2.45 Release Notes
Document10 pages
OSS XML Gateway 2.5.2.45 Release Notes
Valentin Bădiță
No ratings yet
DSA Using Java Quick Guide
Document62 pages
DSA Using Java Quick Guide
Sacet 2003
No ratings yet
DBMS
Document9 pages
DBMS
Mohit Saini
No ratings yet
Veeam Backup Microsoft Office 365 4 0 User Guide Beta
Document199 pages
Veeam Backup Microsoft Office 365 4 0 User Guide Beta
sergio
No ratings yet
Request For Correction Forms Released JU20191
Document29 pages
Request For Correction Forms Released JU20191
Rona Janoras
No ratings yet
User Manual For QR Code Generation System
Document14 pages
User Manual For QR Code Generation System
Saptarshi Das
No ratings yet
Ahuja Vash U 2003287
Document1 page
Ahuja Vash U 2003287
Viany
No ratings yet
Unit 3 Computer Networks 3
Document1 page
Unit 3 Computer Networks 3
Tejendra Pachhai
No ratings yet
CS 3492 Database Management Systems
Document2 pages
CS 3492 Database Management Systems
asanajasima
No ratings yet
Best Practices For IMS Database Reorganization: Session: K01
Document54 pages
Best Practices For IMS Database Reorganization: Session: K01
Milad Monfared
No ratings yet
Software Engineering in 6 Hours
Document279 pages
Software Engineering in 6 Hours
Krishnanand mishra
No ratings yet
Internship - Report Guru Pratheek
Document25 pages
Internship - Report Guru Pratheek
Sanitha Michail
No ratings yet
Social Media:: Are You Playing It Safe?
Document12 pages
Social Media:: Are You Playing It Safe?
RAMON VENEZUELA
No ratings yet