Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 25

Search engines .

Presented by: Rasik Mevada Vishal Dabhi Vimal Nair Ravi Mathai

INTRODUCTION
HISTORY TYPES OF SEARCH ENGINE

HOW SEARCH ENGINES WORKS?

CONCLUSION

Introduction
Search engine is a software program that

searches for sites based on the words that you designate as search terms. Search engines look through their own databases of information in order to find what it is that you are looking for. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler.

Introduction Contd
A web search engine is designed to search for

information on the World Wide Web. The search results are generally presented in a list of results often referred to as search engine results pages (SERPs). Information consist of web pages, images, information and other types of files. Search engine is the popular term for an Information Retrieval (IR) system.

History
In 1990- first tool for searching on the internet

was ARCHIVE. The program downloaded the directory listings of all the files located on public anonymous FTP (File Transfer Protocol) sites, creating a searchable database of file names; However, Archie did not index the contents of these sites since the amount of data was so limited it could be readily searched manually.

In 1991 - The rise of Gopher led to two new

search programs, Veronica and Jughead. Like Archie, they searched the file names and titles stored in Gopher index systems. Veronica (Very Easy Rodent-Oriented Net-wide Index to Computerized Archives) provided a keyword search of most Gopher menu titles in the entire Gopher listings. Jughead (Jonzy's Universal Gopher Hierarchy Excavation And Display) was a tool for obtaining menu information from specific Gopher servers.

In June 1993, Matthew Gray, then at MIT,

produced what was probably the first web robot, the Perl-based World Wide Web Wanderer, and used it to generate an index called 'Wandex'. The web's second search engine Aliweb appeared in November 1993. Aliweb did not use a web robot, but instead depended on being notified by website administrators of the existence at each site of an index file in a particular format. One of the first "full text" crawler-based search engines was WebCrawler, which came out in 1994.

SEARCH Engine Types

CrawlerBased Search Engines

HumanPowered Directories

Hybrid Search Engines" Or Mixed Results

A Web crawler is a computer program that browses

the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, orespecially in the FOAF communityWeb scutters.

This process is called Web crawling or spidering.

Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.

Crawler-based search engines have three major components. 1) The crawler

Also called the spider, it visits a web page, reads it, and then follows links to other pages within the site. The spider will return to the site on a regular basis, such as every month or every fifteen days, to look for changes.
2) The index Everything the spider finds goes into the second part of the search engine, the index. The index will contain a copy of every web page that the spider finds. If a web page changes, then the index is updated with new information. 3) The search engine software

This is the software program that accepts the user-entered query, interprets it, and shifts through the millions of pages recorded in the index to find matches and ranks them in order of what it believes is most relevant and presents them in a customizable manner to the user.

Crawler Based Search Engines contd..


A Web crawler is one type of bot, or software agent. In

general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.
The large volume implies that the crawler can only

download limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

Automating maintenance tasks on a Web site, such as

checking links or validating HTML code.


Crawlers can be used to gather specific types of

information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

Architecture of Web Crawler

Behaviour of a Web Crawler


The behaviour of a Web crawler is the outcome of a combination of policies: a selection policy that states which pages to download, a re-visit policy that states when to check for changes to the pages, a politeness policy that states how to avoid overloading Web sites, and a parallelization policy that states how to coordinate distributed Web crawlers.

2. Human Powered Directories


A human-edited directory is created and maintained

by editors who add links based on the policies particular to that directory. Some directories may prevent search engines from rating a displayed link by using redirects, no follow attributes, or other techniques. Many human-edited directories, including the Open Directory Project, Salehoo and World Wide Web Virtual Library, are edited by volunteers, who are often experts in particular categories. These directories are sometimes criticized due to long delays in approving submissions, or for rigid organizational structures and disputes among volunteer editors.

In response to these criticisms, some volunteer-edited

directories have adopted wiki technology, to allow broader community participation in editing the directory (at the risk of introducing lower-quality, less objective entries).
Another direction taken by some web directories is the

paid for inclusion model.


This method enables the directory to offer timely

inclusion for submissions and generally fewer listings as a result of the paid model.
These options typically have an additional fee

associated, but offer significant help and visibility to

Today submission of websites to web directories

is considered a common SEO (search engine optimization) technique to get back-links for the submitted web site.
One distinctive feature of 'directory submission' is

that it cannot be fully automated like search engine submissions.


Manual directory submission is a tedious and time

consuming job and is often outsourced by the webmasters.

3.
In the web's early days, it used to be that a

search engine either presented crawler-based results or human-powered listings. Today, it extremely common for both types of results to be presented. Usually, a hybrid search engine will favour one type of listings over another.
For example, MSN Search is more likely to

present human-powered listings from Look Smart. However, it does also present crawler-based results (as provided by Inktomi), especially for more obscure queries.

You might also like