Professional Documents
Culture Documents
Apache Solr Beyond The Box
Apache Solr Beyond The Box
Apache Solr Beyond The Box
Chris Hostetter
2008-11-05
http://people.apache.org/~hossman/apachecon2008us/
http://lucene.apache.org/solr/
Why Are We Here?
Plugins!
● What, How, Where, When, Why?
● Solr Internals In A Nutshell
● Real World Examples
● Testing
● Questions
2
What, How, Where,
Who, When, Why?
3
What Is Solr (To Users)
● Information Retrieval Application
● Index/Query Via HTTP
● Comprehensive HTML Administration Interfaces
● Scalability - Efficient Replication To Other Solr
Search Servers
● Highly Configurable Caching
● Flexible And Adaptable With XML Configuration
Customizable Request Handlers And Response
Writers
Data Schema With Dynamic Fields And Unique Keys
Analyzers Created At Runtime From Tokenizers And
TokenFilters
4
What Is Solr (To Developers)
● Information Retrieval Application
● Java5 WebApp (WAR) With A Web Services-ish API
● Extensible Plugin Architecture
● MVC-ish Framework Around The Java Lucene
Search Library
● Allows Custom Business Logic and Text Analysis
Rules To Live Close To The Data
● Abstracts Away The Tricky Stuff:
Index Consistency
Data Replication
Cache Management
How It Started
When/Why To Write A Plugin
OR
“To force X
for all clients.”
Solr Internals
In A Nutshell
8
50,000' View
HTTP Java
SolrDispatchFilter EmbeddedSolrServer
SolrCore
SolrCore
CoreContainer
SolrCore
QueryResponseWriter
SolrQuery(Request/Response)
SolrRequestHandler
9
MVC-ish
● SolrRequestHandler ... A Controller
handleRequest( SolrQueryRequest,
SolrQueryResponse )
● SolrQueryRequest ... An Event (++)
Input Parameters
List of ContentStreams
Maintains SolrCore & SolrIndexSearcher References
● SolrQueryResponse ... Model
Tree of "Simple" Objects and DocLists
● ResponseWriter ... View
write(Writer, SolrQueryRequest,
SolrQueryResponse)
Hello World
public class HelloWorld extends RequestHandlerBase {
public void handleRequestBody(SolrQueryRequest req,
SolrQueryResponse rsp) {
String name = req.getParams().get("name");
Integer age = req.getParams().getInt("age");
rsp.add("greeting", "Hello " + name);
rsp.add("yourage", age);
}
public String getVersion() { return "$Revision:$"; }
public String getSource() { return "$Id:$"; }
public String getSourceId() { return "$URL:$"; }
public String getDescription() { return "Says Hello"; }
}
11
Hello World Output
http://localhost:8983/solr/hello?name=Hoss&age=32&wt=xml
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
</lst>
<str name="greeting">Hello Hoss</str>
<int name="yourage">32</int>
</response>
http://localhost:8983/solr/hello?name=Hoss&age=32&wt=json
{ "responseHeader":{ "status":0, "Qtime":1},
"greeting":"Hello Hoss",
"yourage":32
}
12
Types Of Plugins
● SolrRequestHandler ● Similarity(Factory)
SearchComponent ● Analyzer
QparserPlugin TokenizerFactory
ValueSourceParser TokenFilterFactory
● SolrHighlighter ● FieldType
SolrFragmenter
SolrFormatter ● SolrCache
● UpdateRequestProcessorFactory CacheRegenerator
● QueryResponseWriter ● SolrEventListener
● UpdateHandler
14
Tibetan And Himalayan
Digital Library Tools
15
Tsheg Analysis Factories
public class TshegBarTokenizerFactory
extends BaseTokenizerFactory {
public TokenStream create(Reader input) {
return new TshegBarTokenizer(input);
}
}
public class EdgeTshegTrimmerFactory
extends BaseTokenFilterFactory {
public TokenStream create(TokenStream input) {
return new EdgeTshegTrimmer(input);
}
}
16
DFLL
17
DFLL: Faceted Browsing
DFLL Category Metadata
● Category ID and Label: 3126 == “Tablet PCs”
● Category Query: tablet_form:[* TO *]
● Ordered List of Facets
Facet ID and Label: 500016 == “OS Provided”
Facet Display Info: Count vs. Alphabetical, etc...
Ordered List of Constraints
● Constraint ID and Label: 111536 == “Apple OS X”
● Constraint Query: os:(“OSX10.1” “OSX10.2” ...)
DfllHandler Psuedo-Code
Document catMetaDoc = searcher.getFirstMatch(catDocId)
Metadata m = parseAndCacheMetadata(catMetaDoc, searcher)
m = m.clone()
DocListAndSet results =
searcher.getDocListAndSet(m.catQuery, ...)
response.add(“products”, results.docList)
foreach (Facet f : m) {
foreach (Constraint c : f) {
c.setCount(searcher.numDocs(c.query,
results.docSet))
}
}
response.add(“metadata”, m.asSimpleObjects())
20
Conceptual Picture
os:(“OSX10.1” “OSX10.2” ...)
memory:[1GB TO *]
proc_manu:Intel = 594
tablet_form:[* TO *] price asc
proc_manu:AMD = 382
getDocListAndSet(Query,Query[],Sort,offset,n)
manu:Dell = 104
DocSet manu:HP = 92
DocList
numDocs() manu:Lenovo = 75
Query Response
DFLL Response
<result name="products" numFound="394" start="0">...</results>
<lst name="metadata">
...
<lst name="500016">
<int name="rankDir">0</int><int name="datatype">1</int>
<int name="rating">88</int><str name="name">OS provided</str>
<lst name="values">
<lst name="111536">
<int name="valueId">111536</int>
<str name="label">Apple Mac OS X</str>
<str name="rating">50</str>
<int name="count">1</int>
</lst>
...
</lst>
22
DfllCacheRegenerator
SolrCore “Auto-warms” all SolrCaches when new
versions of the index are opened for searching
(after a commit).
public interface CacheRegenerator {
public boolean regenerateItem(SolrIndexSearcher newSearcher,
SolrCache newCache,
SolrCache oldCache,
Object oldKey,
Object oldVal)
throws IOException;
}
23
DataImportHandler
24
DataImportHandler
Builds and incrementally updates indexes based on
configured SQL or XPath queries.
<entity name="item" pk="ID" query="select * from ITEM"
deltaQuery="select ID ... where
ITEMDATE > '${dataimporter.last_index_time}'">
<field column="NAME" name="name" />
...
<entity name="f" pk="ITEMID"
query="select DESC from FEATURE where ITEMID='${item.ID}'"
deltaQuery="select ITEMID from FEATURE where
UPDATEDATE > '${dataimporter.last_index_time}'"
parentDeltaQuery="select ID from ITEM where ID=${f.ITEMID}">
<field name="features" column="DESC" />
...
25
DataImportHandler Plugins
● DataSource ● Transformer
FileDataSource DateFormatTransformer
HttpDataSource NumberFormatTransformer
JdbcDataSource RegexTransformer
ScriptTransformer
TemplateTransformer
● EntityProcessor
FileListEntityProcessor
SqlEntityProcessor
● CachedSqlEntityProcessor
XPathEntityProcessor
LocalSolr
27
LocalSolr
LocalUpdateProcessorFactory
● Uses lat/lon fields to compute Cartesian Tier info
● Adds grid bodes of various sizes as new fields
<updateRequestProcessorChain name="standard" default=”true”>
<processor class="....LocalUpdateProcessorFactory">
<str name="latField">lat</str>
<str name="lngField">lng</str>
<int name="startTier">9</int>
<int name="endTier">17</int>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
LocalSolr Cartesian Tiers
LocalSolrQueryComponent
● Use in place of default QueryComponent
● Augments regular query with DistanceQuery and
DistanceSortSource
● Can use a custom SolrCache for distances for
commonly used points
<searchComponent name="geoquery"
class="....LocalSolrQueryComponent" />
<requestHandler name="geo" class="solr.SearchHandler">
<arr name="components">
<str>geoquery</str>
...
</arr>
</requestHandler>
GuardianComponent
32
GuardianComponent Goal
● When Searching Really Short Docs, Rule Out
Matches That Are “Significantly” Longer Then
Query
● Increase Precision At The Expense Of Recall
q = Dance Party
Dance Party (1995)
Dance Party (2005) (V)
Dance Party, USA (2006)
Workout Party... Let's Dance! (2004) (V)
Shrek in the Swamp Karaoke Dance Party (2001) (V)
Implementation
● SearchComponent
● Configured To Run After QueryComponent
● Post-Processes DocList
Pick MAX_LEN Based On Number Of Query Clauses
Re-analyze Stored “title“ Field
Eliminate Any Results That Are With More Then
MAX_LEN Tokens In “title“
Alternate Approach
● <copyField source=“title” dest=“titleLen”/>
● Write TokenCountingTokenFilter For titleLen
● Write MaxLenQParserPlugin
Subclass Your Favorite QParser
Pick MAX_LEN Based On Number Of Query Clauses
From Super
Add +titleLen:[* TO MAX_LEN] Clause To Query
Testing Your Plugins
36
AbstractSolrTestCase
public class YourTest extends AbstractSolrTestCase {
...
public void testSomeStuff() throws Exception {
assertU(adoc("id", "7", "description", "Travel Guide”,
"title", "Paris in 10 Days"));
assertU(adoc("id", "42", "description", "Cool Book",
"title", "Hitch Hiker's Guide to the Galaxy"));
assertU(commit());
assertQ("multi qf", req("q", "guide",
"qt", "dismax",
"qf", "title^2 description^1")
,"//*[@numFound='2']"
,"//result/doc[1]/int[@name='id'][.='42']"
,"//result/doc[2]/int[@name='id'][.='7']"
);
}
37
Questions?
http://lucene.apache.org/solr/
38
?