Solr Performance: Key Innovations

Solr Performance & Key Innovations
Yonik Seeley, Lucid Imagination yonik@lucidimagination.com, May 26 2011
Solr 3.1 Highlights

Numeric range facets (similar to date faceting). New spatial search, including spatial filtering, boosting and sorting capabilities. Example Velocity driven search UI at http://localhost:8983/solr/browse A new faster termvector-based highlighter. Extended dismax (edismax) query parser with support for fielded queries, enhanced relevancy, and full lucene syntax support. Distributed search support for the Spell check and Terms components.
3
Solr 3.1 Highlights (continued)

Suggester, a fast trie-based autocomplete component. Sort results by any function query. JSON document indexing. CSV response format Apache UIMA integration for metadata extraction. Tons of optimizations, bugfixes, and new analysis capabilities via Apache Lucene 3.1.
Whats not in 3.1?

Result Grouping (AKA Field Collapsing) Pivot Faceting SolrCloud Pseudo-fields Pseudo-join Relevancy function queries Per-segment faceting *Tons* of new Lucene performance/efficiency goodness
5
Recent Lucene Performance

TieredMergePolicy the new default
Much better for incremental indexing / NRT Ignores segment order when selecting best merge Takes deletes into account Does not over-merge (no cascading merges)
Finite State Transducer (FST) based terms index
DocumentWriterPerThread (DWPT)
Flushing new segment is now concurrent w/ indexing Use multiple indexing threads/ connections When max mem is hit, biggest DWPT is concurrently flushed
Indexing thread Index Writer
DWPT in-memory
DWPT
DWPT
Flush segment to disk _1_0.tiv _1_0.prx _1_0.frq _2_0.tiv _2_0.prx _2_0.frq _3_0.tiv _3_0.prx _3_0.frq
7
Solr Cloud
http://.../solr/collection1?distrib=true
shard1 (replica1) replica2 replica3
Load-balanced sub-request
shard2 (replica1) replica2 replica3
ZK node ZK node
node
/collections /collection1 configName=myconf /shards /shard1 server1:8983/solr server2:8983/solr /shard2 server3:8983/solr ZK server4:8983/solr
/livenodes server1:8983/solr server2:8983/solr server2:8983/solr /configs /myconf solrconfig.xml schema.xml

ZK node
ZK node
ZooKeeper quorum
Solr Cloud: Getting Started

http://wiki.apache.org/solr/SolrCloud java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf -DzkRun -jar start.jar
Run an internal ZK server
Upload /solr/conf to ZK and call it myconf
http://localhost:8983/solr/collection1/admin/zookeeper.jsp
Distributed Requests
l Explicitly
specify node addresses to load-balance across
shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr
l l
A list of equivalent nodes are separated by | Different phases of the same distributed request use the same node
l Specify l Query
logical shard ids to search across
shards=NY_shard,NJ_shard
across all shards in the collection
http://localhost:8983/solr/collection1/select?distrib=true
l public CloudSolrServer(String zkHost)
l
SolrJ Java client that load-balances across all nodes in cluster
Extended Dismax Parser

l Superset
of dismax l Designed to directly handle user queries w/o exceptions

&defType=edismax&q=foo&qf=body
l Fixes l Full
l l
edge cases where dismax could still throw exceptions
OR AND NOT -
lucene syntax support
Tries lucene syntax first Smart escaping is done if syntax errors
l Optionally
supports treating and / or as AND/OR in lucene
syntax l Fielded queries (e.g. myfield:foo) even in degraded mode

l
uf parameter controls what field names may be directly specified in q
Extended Dismax Parser (continued)

parameter for multiplicative boost-by-function l Pure negative query clauses
l boost
Example: solr OR (-solr)

l Enhanced
l
term proximity boosting stopword handling
pf2=myfield results in term bigrams in sloppy phrase queries myfield: aa bb cc -> myfield: aa bb myfield: bb cc
l Enhanced
l
stopwords omitted in main query, but added in optional proximity boosting part Example: q=solr is awesome & qf=myfield & pf2=myfield -> +myfield:(solr awesome) (myfield: solr is myfield: is awesome ) l Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer
Faceting Performance Improvements

l For
facet.method=enum, speed up initial population of the filterCache (i.e. first time facet): from 30% to 32x improvement l Optimized facet.method=fc for multi-valued fields and large facet.limit up to 3x faster l Optimized deep facet paging up to 10x faster with really large facet.offsets l Less memory consumed by field cache entries l Per-segment faceting with facet.method=fcs
l l
Only faster when re-opening index frequently (many times a second) Only works for single-valued fields
Pivot Faceting
l Other
l
names that could have made sense: facet.pivot=field1,field2,field3,
Grid Faceting, Cross-Product Faceting, Matrix Faceting
l Syntax:
facet.pivot=cat,inStock #docs #docs w/ inStock:true cat:electronics cat:memory cat:connector cat:graphics card cat:hard drive 14 3 2 2 2 10 3 0 0 2 #docs w/ instock:false 4 0 2 2 0
Pivot Faceting
http://...&facet=true&facet.pivot=cat,popularity
"facet_counts":{ (continued) "facet_pivot":{ "cat,popularity":[{ { "field":"cat", "field":"popularity", 14 docs w/ "value":"electronics", "value":"1", cat==electronics "count":14, "count":2}]}, "pivot":[{ { "field":"popularity", "field":"cat", 5 docs w/ "value":"6", "value":"memory", cat==electronics && popularity==6 "count":5}, "count":3, { "pivot":[]}, "field":"popularity", "value":"7", [] "count":4},
Range Faceting
Like Date faceting, but more generic
"facet_counts":{ "facet_ranges":{ "price":{ "counts":{ "0.0":5, "50.0":2, "100.0":0, "150.0":2, "200.0":0, "250.0":1, "300.0":2, "350.0":2, "400.0":0, "450.0":1}, "gap":50.0, "start":0.0, "end":500.0}}}}
http://...&facet=true &facet.range=price &facet.range.start=0 &facet.range.end=500 &facet.range.gap=50
Spatial Search
Step1: Index some locations!
<field name= name >The Alpine Shop</field> <field name= store >44.013617,-73.168264</field>
Step2: Decide where you are

&pt=44.0153371,-73.16734 &d=1 &sfield=store
Step3: Profit!
Spatial Filter: &fq={!geofilt} Bounding Box: &fq={!bbox} Distance Function: &sort=geodist() asc Returning the distance: &fl=geodist() Pseudo-fields! Note: You can now sort by any arbitrary function query!
Pseudo-Fields
Returns other info along with document stored fields Function queries
fl=name,location,geodist(),add(myfield,10)
Fieldname globs
fl=id,attr_*
Multiple fl (field list) values Aliasing

fl=id,location:loc,_dist_:geodist()
&fl=id,attr_*&fl=geodist()&fl=termfreq(text,solr)
Future: inlined highlighting, explain, sort-values, group-value

18
Result Grouping / Field Collapsing

l Goal
Limit the number of results per category l category normally defined by unique values in a field
l
l Uses
Web Search collapse by web site l Email threads collapse by thread id l Ecommerce/retail l Show the top 5 items for each store category (music, movies, etc)
l
Field Collapsing by Site
Result Grouping by Category Field Collapse on Product Type
Group by Field
http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact "grouped":{ "manu_exact":{ "matches":3, "groups":[{ "groupValue":"Belkin", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}] }}, { "groupValue":"Apple Computer Inc.", "doclist":{"numFound":1,"start":0,"docs":[ {
Group by Query
http://...&group=true&group.query=price:[0 TO 99.99] &group.query=price:[100 TO *]&group.limit=5 "grouped":{ "price:[0 TO 99.99]":{ "matches":3, "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}, { "id":"F8V7067-APL-KIT", "name":"Belkin Mobile Power Cord for iPod"}] }}, "price:[100 TO *]":{ "matches":3, "doclist":{"numFound":1,"start":0,"docs":[
Grouping Params
parameter group.field=<field> group.query=<query> meaning Like facet.field group by unique field values Like facet.query top docs that also match default
group.function=<function Group by unique values produced by the query> function query group.limit=<n> group.sort=<sort spec> rows=<n> sort=<sort spec> group.format=<format> group.main=true/false How many docs per group How to sort documents within a group How many groups to return How to sort the groups relative to each other (based on top doc) grouped/simple if simple, a single flat list is used and rows units are docs grouped 1 Same as sort 10
If true, the first field grouping command is false used as main result set
Pseudo-Join
id: blog1 name: Solr n Stuff owner: Yonik Seeley Started: 2007-10-26 id: blog2 name: lifehacker owner: Gawker Media started: 2005-1-31
id: post1 blog_id: blog1 author: Yonik Seeley title: Solr relevancy function queries body: Lucenes default ranking [] id: post2 blog_id: blog1 author: Yonik Seeley title: Solr result grouping body: Result Grouping, also called []
Restrict to blogs mentioning netflix fq={!join from=blog_id to=id}body:netflix
id: post3 blog_id: blog2 author: Whitson Gordon title: How to Install Netflix on Almost Any Android Device
- Finds all documents matching netflix - Maps to different docs by following blog_id to id
25
Pseudo-Join Examples
Only show posts from blogs started after 2010
q=foo&fq={!join from=id to=blog_id}started:[2010 TO *]
If any post in a blog mentions obama, then search all posts in that blog for bomb (self-join)
q=bomb&fq={!join from=blog_id to=blog_id}obama
If any blog post mentions obama, then search all websites with the same blog owner for bomb
q=bomb&fq={!join from=owner to=website_owner}{!join from=blog_id to=id}obama
26
Cross-Core Join
id: doc1 security: managers title: doc for managers only body: id: doc1 security: managers, employees title: doc for everyone body: collection1 Single Solr Server
id: mary security_groups: managers, employees id: john security_groups: employees
sec1
http://localhost:8983/solr/collection1/select?q=foo&fq={!join fromIndex=sec1 from=security_groups to=security}user:john

27
Pseudo-Join vs Grouping
Pseudo-Join O(n_terms_in_join_fields) Single or multi-valued fields Filters only (no info currently passed from the from docs to the to docs). Chainable (one join can be the input to another) Affects which documents match a request, so naturally affects facet numbers (e.g. you can search posts and get numbers of blogs) Result Grouping / Field Collapsing O(n_docs_in_result) Single-valued fields only Can order docs within a group and groups by top doc within that group using normal sort criteria. Not currently chainable can only group one field deep Grouping does not currently affect the set of documents matching the query, so faceting is unaffected.
28
Auto-Suggest
l Many
l
people previously used terms component
Can be slow for a large corpus
l New
l l l
auto-suggest builds off SpellCheck component
TST implementation: compact memory based trie FST implementation: slower to build, but smaller & faster lookup Based on a field in the main index, or on a dictionary file
http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult
"spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}}
29
Index with JSON

$ URL=http://localhost:8983/solr/update/json $ curl $URL -H 'Content-type:application/json' -d [ { "id" : "978-0641723445", "cat" : ["book","hardcover"], "title" : "The Lightning Thief", "author" : "Rick Riordan", "series_t" : "Percy Jackson and the Olympians", "sequence_i" : 1, "genre_s" : "fantasy", "inStock" : true, "price" : 12.50, "pages_i" : 384 } ]'
Query Results in CSV

http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv name,price,cat,popularity iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1 Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1 Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10 Can handle multi-valued fields (see cat field in example) l Completely compatible with the CSV update handler (can round-trip) l Results are streamed good for dumping entire parts of the index
l
http://localhost:8983/solr/browse
Q&A

Solr Performance: Key Innovations

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Solr Performance: Key Innovations

Uploaded by

Copyright:

Available Formats

Solr Performance & Key Innovations

Yonik Seeley, Lucid Imagination yonik@lucidimagination.com, May 26 2011

Solr 3.1 Highlights

Solr 3.1 Highlights (continued)

Whats not in 3.1?

Recent Lucene Performance

Finite State Transducer (FST) based terms index

shard1 (replica1) replica2 replica3

shard2 (replica1) replica2 replica3

/livenodes server1:8983/solr server2:8983/solr server2:8983/solr /configs /myconf solrconfig.xml schema.xml

Solr Cloud: Getting Started

Upload /solr/conf to ZK and call it myconf

specify node addresses to load-balance across

logical shard ids to search across

across all shards in the collection

SolrJ Java client that load-balances across all nodes in cluster

Extended Dismax Parser

of dismax l Designed to directly handle user queries w/o exceptions

edge cases where dismax could still throw exceptions

lucene syntax support

Tries lucene syntax first Smart escaping is done if syntax errors

supports treating and / or as AND/OR in lucene

syntax l Fielded queries (e.g. myfield:foo) even in degraded mode

uf parameter controls what field names may be directly specified in q

Extended Dismax Parser (continued)

Example: solr OR (-solr)

term proximity boosting stopword handling

Faceting Performance Improvements

names that could have made sense: facet.pivot=field1,field2,field3,

Grid Faceting, Cross-Product Faceting, Matrix Faceting

http://...&facet=true &facet.range=price &facet.range.start=0 &facet.range.end=500 &facet.range.gap=50

Step2: Decide where you are

Multiple fl (field list) values Aliasing

Future: inlined highlighting, explain, sort-values, group-value

Result Grouping / Field Collapsing

Field Collapsing by Site

Result Grouping by Category Field Collapse on Product Type

Restrict to blogs mentioning netflix fq={!join from=blog_id to=id}body:netflix

http://localhost:8983/solr/collection1/select?q=foo&fq={!join fromIndex=sec1 from=security_groups to=security}user:john

people previously used terms component

Can be slow for a large corpus

auto-suggest builds off SpellCheck component

"spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}}

Index with JSON

Query Results in CSV

You might also like