Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 17

1

Anatomy of a Search Engine

Submitted by:
Pradipta Kumar Rout
0805227040
MCA 4th Sem

CVRCA
2

11/04/21 12:44 PM

Topics to cover
• Introduction

• History of search Engines

• Working of a search engine

• Google Architecture
3

11/04/21 12:44 PM
INTRODUCTION
• Search engine is a software program that searches for
sites based on the words that you designate as search
terms.

• Search engines look through their own databases of


information in order to find what it is that you are
looking for.

• “Search engine” is the popular term for an


Information Retrieval (IR) system.
4

11/04/21 12:44 PM

History of Web Search Engines

• 1993 : W3Catalog (University of Geniva)


• 1994 : World Wide Web Worm (MIT)
• 1995 : Alta Vista
• 1996 : Yahoo
• 1998 : Google
• 2004 : Msn(now Bing)
5

11/04/21 12:44 PM

Working of a search engine

1. Web crawling
2. Indexing
3. Searching
6

11/04/21 12:44 PM

Web crawling
1. What is a Crawler and Crawling.
2. How it works
 Search heavily used servers and
very popular pages.
 The words within the page &
Where the words were found .
7

11/04/21 12:44 PM

Indexing

1.What is indexing.

2. How it is done.
 Weights.
 Hashing.
 DocId
 wordID.
 The hash table contains
the hashed number along
with a pointer to the actual
data.
8

11/04/21 12:44 PM

Searching
1.How it works.
9

11/04/21 12:44 PM

Working of a search Engine


10

11/04/21 12:44 PM

Google Architecture
1. URL server
2. Crawler
3. Store Server
4. Repository
5. Indexer
6. Barrels
7. Anchors
8. URL Resolver
9. Links
10.Doc Index
11.Page Rank
12.Sorter
13.Lexicon
11

11/04/21 12:44 PM

• URL server : That sends lists of URLs to be fetched to the crawlers.

• Storeserver :The web pages that are fetched are then sent to the

storeserver. The storeserver then compresses and stores the web pages into a
repository.

• Indexing : It reads the repository, uncompresses the documents, and parses


them.( Hits - record the word, position in document, an approximation of font
size, and capitalization. ,Anchor file- stores important information about a link.)

• Barrels : Stors data.(Forward index) .

• URL Resolver:The URLresolver reads the anchors file and converts relative URLs
into absolute URLs .

• Sorter :The sorter takes the barrels, which are sorted by docID, and resorts them
by wordID to generate the inverted index.

• Repository : The repository contains the full HTML of every web page in
compressed form;(the URL's checksum is computed and a binary search is
performed on the checksums file to find its docID)
• ,
12

11/04/21 12:44 PM

Repository
13

11/04/21 12:44 PM

Indexer
14

11/04/21 12:44 PM

Page Rank
0.25 0.25 1. Everyone gets page rank that
is 1/(number of pages) = ¼
0.25 2. Each page gets it’s page rank
A B
updated based on incoming
links.

0.25 In this case page rank of A PR(A)


0.25 is:
PR(A) = 0.25 + 0.25 + 0.25 =
0.75

C D

0.25 0.25
15

11/04/21 12:44 PM

Page Rank
• Links are weighted based on number of outgoing links

0.25 0.25 The page rank is divided by the


0.25/2 number of outgoing links a site
A B has (why?)

0.25/2 e.g., D’s links are worth 0.25/3


because it has 3 outgoing links

0.25/1
0.25/3 0.25/3 now:
PR(A)=0.25/2 + 0.25/1 + 0.25/3

C D
0.25/3
0.25 0.25
16

11/04/21
12:44 PM

References
• //howstuffworks.com
• //google.standforfd.edu
17

11/04/21 12:44 PM

You might also like