Search Engines .: Presented By: Rasik Mevada Vishal Dabhi Vimal Nair Ravi Mathai

Search engines .
Presented by: Rasik Mevada Vishal Dabhi Vimal Nair Ravi Mathai
INTRODUCTION
HISTORY TYPES OF SEARCH ENGINE
HOW SEARCH ENGINES WORKS?
CONCLUSION
Introduction
Search engine is a software program that
searches for sites based on the words that you designate as search terms. Search engines look through their own databases of information in order to find what it is that you are looking for. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler.
Introduction Contd
A web search engine is designed to search for
information on the World Wide Web. The search results are generally presented in a list of results often referred to as search engine results pages (SERPs). Information consist of web pages, images, information and other types of files. Search engine is the popular term for an Information Retrieval (IR) system.
History
In 1990- first tool for searching on the internet
was ARCHIVE. The program downloaded the directory listings of all the files located on public anonymous FTP (File Transfer Protocol) sites, creating a searchable database of file names; However, Archie did not index the contents of these sites since the amount of data was so limited it could be readily searched manually.
In 1991 - The rise of Gopher led to two new
search programs, Veronica and Jughead. Like Archie, they searched the file names and titles stored in Gopher index systems. Veronica (Very Easy Rodent-Oriented Net-wide Index to Computerized Archives) provided a keyword search of most Gopher menu titles in the entire Gopher listings. Jughead (Jonzy's Universal Gopher Hierarchy Excavation And Display) was a tool for obtaining menu information from specific Gopher servers.
In June 1993, Matthew Gray, then at MIT,
produced what was probably the first web robot, the Perl-based World Wide Web Wanderer, and used it to generate an index called 'Wandex'. The web's second search engine Aliweb appeared in November 1993. Aliweb did not use a web robot, but instead depended on being notified by website administrators of the existence at each site of an index file in a particular format. One of the first "full text" crawler-based search engines was WebCrawler, which came out in 1994.
SEARCH Engine Types
CrawlerBased Search Engines
HumanPowered Directories
Hybrid Search Engines" Or Mixed Results
A Web crawler is a computer program that browses
the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, orespecially in the FOAF communityWeb scutters.
This process is called Web crawling or spidering.
Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.
Crawler-based search engines have three major components. 1) The crawler
Also called the spider, it visits a web page, reads it, and then follows links to other pages within the site. The spider will return to the site on a regular basis, such as every month or every fifteen days, to look for changes.
2) The index Everything the spider finds goes into the second part of the search engine, the index. The index will contain a copy of every web page that the spider finds. If a web page changes, then the index is updated with new information. 3) The search engine software
This is the software program that accepts the user-entered query, interprets it, and shifts through the millions of pages recorded in the index to find matches and ranks them in order of what it believes is most relevant and presents them in a customizable manner to the user.
Crawler Based Search Engines contd..

A Web crawler is one type of bot, or software agent. In
general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.
The large volume implies that the crawler can only
download limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.
Automating maintenance tasks on a Web site, such as
checking links or validating HTML code.

Crawlers can be used to gather specific types of
information from Web pages, such as harvesting e-mail addresses (usually for sending spam).
Architecture of Web Crawler
Behaviour of a Web Crawler

The behaviour of a Web crawler is the outcome of a combination of policies: a selection policy that states which pages to download, a re-visit policy that states when to check for changes to the pages, a politeness policy that states how to avoid overloading Web sites, and a parallelization policy that states how to coordinate distributed Web crawlers.
2. Human Powered Directories

A human-edited directory is created and maintained
by editors who add links based on the policies particular to that directory. Some directories may prevent search engines from rating a displayed link by using redirects, no follow attributes, or other techniques. Many human-edited directories, including the Open Directory Project, Salehoo and World Wide Web Virtual Library, are edited by volunteers, who are often experts in particular categories. These directories are sometimes criticized due to long delays in approving submissions, or for rigid organizational structures and disputes among volunteer editors.
In response to these criticisms, some volunteer-edited
directories have adopted wiki technology, to allow broader community participation in editing the directory (at the risk of introducing lower-quality, less objective entries).
Another direction taken by some web directories is the
paid for inclusion model.

This method enables the directory to offer timely
inclusion for submissions and generally fewer listings as a result of the paid model.
These options typically have an additional fee
associated, but offer significant help and visibility to
Today submission of websites to web directories
is considered a common SEO (search engine optimization) technique to get back-links for the submitted web site.
One distinctive feature of 'directory submission' is
that it cannot be fully automated like search engine submissions.

Manual directory submission is a tedious and time
consuming job and is often outsourced by the webmasters.
3.
In the web's early days, it used to be that a
search engine either presented crawler-based results or human-powered listings. Today, it extremely common for both types of results to be presented. Usually, a hybrid search engine will favour one type of listings over another.
For example, MSN Search is more likely to
present human-powered listings from Look Smart. However, it does also present crawler-based results (as provided by Inktomi), especially for more obscure queries.

Search Engines .: Presented By: Rasik Mevada Vishal Dabhi Vimal Nair Ravi Mathai

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Search Engines .: Presented By: Rasik Mevada Vishal Dabhi Vimal Nair Ravi Mathai

Uploaded by

Copyright:

Available Formats

Search engines .

HOW SEARCH ENGINES WORKS?

In 1991 - The rise of Gopher led to two new

In June 1993, Matthew Gray, then at MIT,

SEARCH Engine Types

CrawlerBased Search Engines

Hybrid Search Engines" Or Mixed Results

A Web crawler is a computer program that browses

This process is called Web crawling or spidering.

Crawler-based search engines have three major components. 1) The crawler

Crawler Based Search Engines contd..

Automating maintenance tasks on a Web site, such as

checking links or validating HTML code.

Architecture of Web Crawler

Behaviour of a Web Crawler

2. Human Powered Directories

In response to these criticisms, some volunteer-edited

paid for inclusion model.

associated, but offer significant help and visibility to

Today submission of websites to web directories

that it cannot be fully automated like search engine submissions.

consuming job and is often outsourced by the webmasters.

You might also like