Professional Documents
Culture Documents
Lucene at Yelp
Lucene at Yelp
Sudarshan Gaikaiwari
Bio
1. Over a decade of experience in information retrieval 2. Used IR techniques at Symantec's DLP group 3. Search Engineer at Yelp
Outline
1. Overview of search services at Yelp 2. Federation Motivation 3. Lucy Indexing 4. Lucy Searching 5. Efficiently Retrieving top k hits
Federation Motivation
Problem
Problem
Index is too large fit in memory on a single machine
Geographical sharding
Federation
1. Split index across multiple machines 2. Shard on business id 3. TF-IDF scores from different machines should be comparable
Virtual Nodes
Advantages
1. Flexibility (move vbuckets from one shard to another) 2. Split hot spot shards
Lucy Indexing
Lucy Searching
Lucy Server
Executing queries
1. Gather the top results for a query 2. Collect attribute statitics for attributes like places, categories
Lucene
1. Efficiently executes queries over the index 2. Provides how relevant the business is to the words in the query (word score) 3. Upgrading lucene to 2.9/3.1 is WIP
Federation
Binomial Distribution
Probability (r of top k hits) are in a particular shard
Mean
Variance
Formula
Std Deviation
Formula
Simulation
Formula
0.017
32
0.0001407
44
0.00000
Simulation Graph
Results
1. ~ 50% savings over 100 hits (44 hits requested from each shard) 2. 77% savings over 1000 hits (228 hits requested from each shard)
Future work
1. In memory index 2. Move towards real time search
Thank You
smg@yelp.com