Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 15

INDEXING AND CACHING IN SEARCH ENGINES

11MX56

THE PROBLEM ?
The users want the results the moment they query What the engine has to handle ? A staggering amount of over 4 billion web pages ! And over a million queries per minute ! The response time must be as immediate as possible. Consume least amount of resources possible.

SOME SEARCH STATISTICS TO NOTE


The 63.7% of the queries are unique. An approx. 34% or only 1/3rd of the search queries submitted are repeated.

58% of the users only view the 1 st page of the search result.
(average considering popular search giants Google, Yahoo and Ask)

No more than 12% of users browse through more than 3 pages.

THE SOLUTION
With a Cache With out Cache

RESULT Yes

QUERY

RESULT

QUERY

CACHE HIT

No

QUERY SERVER

QUERY SERVER

Web

Web

WHAT DO WE CACHE/INDEX ?
Search

36% of all queries have been retrieved before. The stats show that
Same Different

most people are looking for the same thing when

using a search engine.

VARIANTS

1. Direct Cache 2. Inverted Index/List 3. Two-Level 4. N-Level

VARIANTS
Direct Cache
Stores the top few results of a
query that are searched frequently/recently.

Inverted Index
Stores a link to pages
containing the tokens in all frequently/recently searched

queries.
Can be fetched even before the query is processed and tokenized. Can only be fetched after the query is processed and tokenized.

POLICIES
LRU (Least Recently Used)
Allocate a queue that can accommodate a certain number of result pages. When the queue is full and a new page needs to be cached, the least RECENTLY used page is removed from the cache.

LFU (Least Frequently Used)


Allocate a rank based list that can accommodate a certain number of result pages. When the list is full and a new page needs to be cached, the least FREQUENTLY used page is removed from the cache.

AN ADVANCED POLICY
Probability Driven Cache
o Users search in sessions, the next query will probably be related to the previous query. o This is currently in use by Google. Noted by its related searches given at the bottom of the result page.

INDEXING
Steps and not just Types ! 1. Forward Index 2. Inverted Index

FORWARD INDEX
Pages
Page 1

Forward Index 1. This, is, what, it, is 2. What, is, it 3. It, is, a, panther

This is what it is.


Page 2

what is it ?
Page 3

It is a panther.

INVERTED INDEX
Forward Index
1. This, is, what, it, is 2. What, is, it 3. It, is, a, panther
Search term like what is it ? will give pages 1, 2 as best results. But It occurs in the same order in only 1 page i.e. 2 and ranked on top.

Inverted Index
This - 1 Is 1,2,3 What 1,2 It - 1,2,3 Is 1,2,3 A-3 Panther - 3

TROUBLES ENCOUNTERED
The indexed documents correspond to an older version of the web pages. The documents matched for a cached query correspond to an older version of the index. Periodic Refresh Has to be done to tackle above troubles !

IMPACT
Direct Cache Inverted Index

THANK YOU
References
Performance of Inverted List Caching, CIS Department, Brooklyn University, NY, USA

A Refreshing Perspective of search engine caching, Yahoo! Research, Barcelona, Spain


Some help from Wiki as usual

You might also like