Apache Solr Beyond The Box

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Apache Solr

Beyond The Box

Chris Hostetter
2008-11-05

http://people.apache.org/~hossman/apachecon2008us/
http://lucene.apache.org/solr/
Why Are We Here?

Plugins!
● What, How, Where, When, Why?
● Solr Internals In A Nutshell
● Real World Examples
● Testing
● Questions

2
What, How, Where,
Who, When, Why?

3
What Is Solr (To Users)
● Information Retrieval Application
● Index/Query Via HTTP
● Comprehensive HTML Administration Interfaces
● Scalability - Efficient Replication To Other Solr
Search Servers
● Highly Configurable Caching
● Flexible And Adaptable With XML Configuration
 Customizable Request Handlers And Response
Writers
 Data Schema With Dynamic Fields And Unique Keys
 Analyzers Created At Runtime From Tokenizers And
TokenFilters
4
What Is Solr (To Developers)
● Information Retrieval Application
● Java5 WebApp (WAR) With A Web Services-ish API
● Extensible Plugin Architecture
● MVC-ish Framework Around The Java Lucene
Search Library
● Allows Custom Business Logic and Text Analysis
Rules To Live Close To The Data
● Abstracts Away The Tricky Stuff:
 Index Consistency
 Data Replication
 Cache Management
How It Started
When/Why To Write A Plugin

“X can be done more


efficiently closer to the data.”

OR

“To force X
for all clients.”
Solr Internals
In A Nutshell

8
50,000' View
HTTP Java

SolrDispatchFilter EmbeddedSolrServer

SolrCore

SolrCore
CoreContainer

SolrCore
QueryResponseWriter

SolrQuery(Request/Response)

SolrRequestHandler
9
MVC-ish
● SolrRequestHandler ... A Controller
 handleRequest( SolrQueryRequest,
SolrQueryResponse )
● SolrQueryRequest ... An Event (++)
 Input Parameters
 List of ContentStreams
 Maintains SolrCore & SolrIndexSearcher References
● SolrQueryResponse ... Model
 Tree of "Simple" Objects and DocLists
● ResponseWriter ... View
 write(Writer, SolrQueryRequest,
SolrQueryResponse)
Hello World
public class HelloWorld extends RequestHandlerBase {
  public void handleRequestBody(SolrQueryRequest req,
                                SolrQueryResponse rsp) {
    String name = req.getParams().get("name");
    Integer age = req.getParams().getInt("age");
    rsp.add("greeting", "Hello " + name);
    rsp.add("yourage", age);
  }
  public String getVersion() { return "$Revision:$"; }
  public String getSource() { return "$Id:$"; }
  public String getSourceId() { return "$URL:$"; }
  public String getDescription() { return "Says Hello"; }
}
11
Hello World Output
http://localhost:8983/solr/hello?name=Hoss&age=32&wt=xml
    <response>
      <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">1</int>
      </lst>
      <str name="greeting">Hello Hoss</str>
      <int name="yourage">32</int>
    </response>
http://localhost:8983/solr/hello?name=Hoss&age=32&wt=json
    { "responseHeader":{ "status":0, "Qtime":1},
      "greeting":"Hello Hoss",
      "yourage":32
    }
12
Types Of Plugins
● SolrRequestHandler ● Similarity(Factory)
 SearchComponent ● Analyzer
 QparserPlugin  TokenizerFactory
 ValueSourceParser  TokenFilterFactory
● SolrHighlighter ● FieldType
 SolrFragmenter
 SolrFormatter ● SolrCache
● UpdateRequestProcessorFactory  CacheRegenerator
● QueryResponseWriter ● SolrEventListener
● UpdateHandler

Italics: Only One Per SolrCore


Color:
or Likelihood Of Needing To Write Your Own
Real World Examples

14
Tibetan And Himalayan
Digital Library Tools

15
Tsheg Analysis Factories
   public class TshegBarTokenizerFactory 
                extends BaseTokenizerFactory {
     public TokenStream create(Reader input) {
       return new TshegBarTokenizer(input);
     }
   }

   public class EdgeTshegTrimmerFactory 
                extends BaseTokenFilterFactory {
       public TokenStream create(TokenStream input) {
           return new EdgeTshegTrimmer(input);
       }
   }
16
DFLL

17
DFLL: Faceted Browsing
DFLL Category Metadata
● Category ID and Label: 3126 == “Tablet PCs”
● Category Query: tablet_form:[* TO *]
● Ordered List of Facets
 Facet ID and Label: 500016 == “OS Provided”
 Facet Display Info: Count vs. Alphabetical, etc...
 Ordered List of Constraints
● Constraint ID and Label: 111536 == “Apple OS X”
● Constraint Query: os:(“OSX10.1” “OSX10.2” ...)
DfllHandler Psuedo-Code
Document catMetaDoc = searcher.getFirstMatch(catDocId)
Metadata m = parseAndCacheMetadata(catMetaDoc, searcher)
m = m.clone()
DocListAndSet results =
              searcher.getDocListAndSet(m.catQuery, ...)
response.add(“products”, results.docList)
foreach (Facet f : m) {
  foreach (Constraint c : f) {
    c.setCount(searcher.numDocs(c.query,
                                results.docSet))
  }
}
response.add(“metadata”, m.asSimpleObjects())
20
Conceptual Picture
os:(“OSX10.1” “OSX10.2” ...)
memory:[1GB TO *]
proc_manu:Intel = 594
tablet_form:[* TO *] price asc
proc_manu:AMD = 382
getDocListAndSet(Query,Query[],Sort,offset,n)

price:[0 TO 500] = 247


Section of Unordered
ordered set of all price:[500 TO 1000] = 689
results results

manu:Dell = 104
DocSet manu:HP = 92
DocList
numDocs() manu:Lenovo = 75

Query Response
DFLL Response
<result name="products" numFound="394" start="0">...</results>
<lst name="metadata">
 ...
 <lst name="500016">
   <int name="rankDir">0</int><int name="datatype">1</int>
   <int name="rating">88</int><str name="name">OS provided</str>
   <lst name="values">
     <lst name="111536">
       <int name="valueId">111536</int>
       <str name="label">Apple Mac OS X</str>
       <str name="rating">50</str>
       <int name="count">1</int>
     </lst>
     ...
   </lst>
22
DfllCacheRegenerator
SolrCore “Auto-warms” all SolrCaches when new
versions of the index are opened for searching
(after a commit).

 public interface CacheRegenerator {
   public boolean regenerateItem(SolrIndexSearcher newSearcher,
                                 SolrCache newCache, 
                                 SolrCache oldCache, 
                                 Object oldKey, 
                                 Object oldVal) 
          throws IOException;
}

23
DataImportHandler

24
DataImportHandler
Builds and incrementally updates indexes based on
configured SQL or XPath queries.
<entity name="item" pk="ID" query="select * from ITEM"
   deltaQuery="select ID ... where 
               ITEMDATE > '${dataimporter.last_index_time}'">
 <field column="NAME" name="name" />
 ...
 <entity name="f" pk="ITEMID" 
    query="select DESC from FEATURE where ITEMID='${item.ID}'"
    deltaQuery="select ITEMID from FEATURE where 
                UPDATEDATE > '${dataimporter.last_index_time}'"
    parentDeltaQuery="select ID from ITEM where ID=${f.ITEMID}">
  <field name="features" column="DESC" />
  ...
25
DataImportHandler Plugins
● DataSource ● Transformer
 FileDataSource  DateFormatTransformer
 HttpDataSource  NumberFormatTransformer
 JdbcDataSource  RegexTransformer
 ScriptTransformer
 TemplateTransformer
● EntityProcessor
 FileListEntityProcessor
 SqlEntityProcessor
● CachedSqlEntityProcessor
 XPathEntityProcessor
LocalSolr

27
LocalSolr
LocalUpdateProcessorFactory
● Uses lat/lon fields to compute Cartesian Tier info
● Adds grid bodes of various sizes as new fields

 <updateRequestProcessorChain name="standard" default=”true”>
   <processor class="....LocalUpdateProcessorFactory">
      <str name="latField">lat</str>
      <str name="lngField">lng</str>
      <int name="startTier">9</int>
      <int name="endTier">17</int>
   </processor>
   <processor class="solr.LogUpdateProcessorFactory" />
   <processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>
LocalSolr Cartesian Tiers
LocalSolrQueryComponent
● Use in place of default QueryComponent
● Augments regular query with DistanceQuery and
DistanceSortSource
● Can use a custom SolrCache for distances for
commonly used points

  <searchComponent name="geoquery"
                   class="....LocalSolrQueryComponent" />
  <requestHandler name="geo" class="solr.SearchHandler">
     <arr name="components">
       <str>geoquery</str>
       ...
     </arr>
  </requestHandler>
GuardianComponent

32
GuardianComponent Goal
● When Searching Really Short Docs, Rule Out
Matches That Are “Significantly” Longer Then
Query
● Increase Precision At The Expense Of Recall
  

    q = Dance Party
  

  Dance Party (1995)
  Dance Party (2005) (V)
  Dance Party, USA (2006)
  Workout Party... Let's Dance! (2004) (V)
  Shrek in the Swamp Karaoke Dance Party (2001) (V)
Implementation
● SearchComponent
● Configured To Run After QueryComponent
● Post-Processes DocList
 Pick MAX_LEN Based On Number Of Query Clauses
 Re-analyze Stored “title“ Field
 Eliminate Any Results That Are With More Then
MAX_LEN Tokens In “title“
Alternate Approach
● <copyField source=“title” dest=“titleLen”/>
● Write TokenCountingTokenFilter For titleLen
● Write MaxLenQParserPlugin
 Subclass Your Favorite QParser
 Pick MAX_LEN Based On Number Of Query Clauses
From Super
 Add +titleLen:[* TO MAX_LEN] Clause To Query
Testing Your Plugins

36
AbstractSolrTestCase
public class YourTest extends AbstractSolrTestCase {
  ...
  public void testSomeStuff() throws Exception {
    assertU(adoc("id", "7",    "description", "Travel Guide”,
                  "title", "Paris in 10 Days"));
    assertU(adoc("id", "42",   "description", "Cool Book",
                 "title", "Hitch Hiker's Guide to the Galaxy"));
    assertU(commit());
    assertQ("multi qf", req("q",  "guide",
                            "qt", "dismax",
                            "qf", "title^2 description^1") 
            ,"//*[@numFound='2']"
            ,"//result/doc[1]/int[@name='id'][.='42']"
            ,"//result/doc[2]/int[@name='id'][.='7']"
            );
  }
37
Questions?
http://lucene.apache.org/solr/

38
?

You might also like