Professional Documents
Culture Documents
Indexing and Caching in
Indexing and Caching in
11MX56
THE PROBLEM ?
The users want the results the moment they query What the engine has to handle ? A staggering amount of over 4 billion web pages ! And over a million queries per minute ! The response time must be as immediate as possible. Consume least amount of resources possible.
58% of the users only view the 1 st page of the search result.
(average considering popular search giants Google, Yahoo and Ask)
THE SOLUTION
With a Cache With out Cache
RESULT Yes
QUERY
RESULT
QUERY
CACHE HIT
No
QUERY SERVER
QUERY SERVER
Web
Web
WHAT DO WE CACHE/INDEX ?
Search
36% of all queries have been retrieved before. The stats show that
Same Different
VARIANTS
VARIANTS
Direct Cache
Stores the top few results of a
query that are searched frequently/recently.
Inverted Index
Stores a link to pages
containing the tokens in all frequently/recently searched
queries.
Can be fetched even before the query is processed and tokenized. Can only be fetched after the query is processed and tokenized.
POLICIES
LRU (Least Recently Used)
Allocate a queue that can accommodate a certain number of result pages. When the queue is full and a new page needs to be cached, the least RECENTLY used page is removed from the cache.
AN ADVANCED POLICY
Probability Driven Cache
o Users search in sessions, the next query will probably be related to the previous query. o This is currently in use by Google. Noted by its related searches given at the bottom of the result page.
INDEXING
Steps and not just Types ! 1. Forward Index 2. Inverted Index
FORWARD INDEX
Pages
Page 1
Forward Index 1. This, is, what, it, is 2. What, is, it 3. It, is, a, panther
what is it ?
Page 3
It is a panther.
INVERTED INDEX
Forward Index
1. This, is, what, it, is 2. What, is, it 3. It, is, a, panther
Search term like what is it ? will give pages 1, 2 as best results. But It occurs in the same order in only 1 page i.e. 2 and ranked on top.
Inverted Index
This - 1 Is 1,2,3 What 1,2 It - 1,2,3 Is 1,2,3 A-3 Panther - 3
TROUBLES ENCOUNTERED
The indexed documents correspond to an older version of the web pages. The documents matched for a cached query correspond to an older version of the index. Periodic Refresh Has to be done to tackle above troubles !
IMPACT
Direct Cache Inverted Index
THANK YOU
References
Performance of Inverted List Caching, CIS Department, Brooklyn University, NY, USA