Professional Documents
Culture Documents
Programmer Open Source Search What's New in Apache Lucene Programmer's Guide To Open Source Search: Apache Lucene 3.0
Programmer Open Source Search What's New in Apache Lucene Programmer's Guide To Open Source Search: Apache Lucene 3.0
Programmer Open Source Search What's New in Apache Lucene Programmer's Guide To Open Source Search: Apache Lucene 3.0
Programmer
Open Source Search:
Search
What’s New
in Apache Lucene 3.0
A Lucid Imagination
Technical White Paper
© 2010 by Lucid Imagination, Inc. under the terms of Creative Commons license, as detailed at
http://www.lucidimagination.com/Copyrights-and-Disclaimers/. Version 1.02, published 6 June 2010. Solr,
Lucene, Apachecon and their logos are trademarks of the Apache Software Foundation.
Since its introduction nearly 10 years ago, Apache Lucene has become a competitive player
for developing extensible, high-performance full-text search solutions. The experience
accumulated over time by the community of Lucene committers and contributors and the
innovations they have engineered have delivered significant ongoing advances in Lucene’s
capabilities.
This white paper describes the new features and improvements in the latest versions,
Apache Lucene 2.9 and 3.0. It is intended mainly for programmers familiar with the broad
base of Lucene’s capabilities, though those new to Lucene should also find it a useful
exploration of the newest features. Key topics such as how to upgrade from 2.9 to 3.0, as
well as considerations for migrating from Lucene to Solr, are also addressed.
In the simplest terms, Lucene is now faster and more flexible than before. Historic weak
points have been improved to open the way for innovative new features like near-real-time
search, flexible indexing, and high-performance numerical range queries. Many new
features have been added, new APIs introduced, and critical bugs have been fixed—all with
the same goal: improving Lucene’s state-of-the-art search capabilities.
This white paper aims to address key issues for you if you have an Apache Lucene-based
application, and need to upgrade existing code to work well with these latest versions, so
that you may take advantage of the various improvements and prepare for future releases
and application maintainability. If you do not have a Lucene application, the paper should
also give you a good overview of the innovations in this release.
Unlike the previous 2.4.1 release (March 2009), Lucene 2.9 and 3.0 go well beyond just a
bug-fix release. They introduce multiple performance improvements, new features, better
runtime behavior, API changes, and bug-fixes at a variety of levels. Importantly, 2.9
deprecates a number of legacy interfaces, and 3.0 is in the main a reimplemented version of
2.9, but without those deprecated interfaces.
The 2.9 release improves Lucene in several key aspects, which make it an even more
compelling alternative to other solutions. Most notably:
Improvements for Near-Realtime Search capabilities make documents searchable
almost instantaneously.
A new, straightforward API for handling Numeric Ranges both simplifies
development and virtually wipes out performance overhead.
Analysis API has been replaced for more streamlined, flexible text handling.
While the majority of programmers are already running on either version 1.5 or 1.6
platforms (1.6 is the recommended JVM), Java 1.4 reached its end of service life in October
2008. With the new major Lucene 3.0 release, all legacy issues marked as deprecated have
now been removed, enforcing their replacement.
Some important notes on compatibility: because previous minor releases also contained
performance improvements and bug fixes, programmers have been accustomed to
upgrading to a new Lucene version just by replacing the JAR file in their classpath. And, in
those past cases, Lucene-based apps could be upgraded flawlessly without recompiling the
software components accessing or extending Apache Lucene. However, this may not be so
with Lucene 2.9/3.0.
The generated terms are indexed just like any other string values passed to Lucene. Under
the hood, Lucene associates distinct terms with all documents containing the term, so that
all documents containing a numeric value with the same prefix are “grouped” together,
meaning the number of terms that need to be searched is reduced tremendously. This
stands in contrast to the relatively less efficient encoding scheme in previous releases,
where each unique numeric value was indexed as a distinct term based on the number of
terms in the index.
You can also use the native encoding of numeric values beyond range searches. Numeric
fields can be loaded in the internal FieldCache, where they are used for sorting. Zero-
padding of numeric primitives (see code example above) is no longer needed as the trie-
encoding guarantees the correct ordering without requiring execution overhead or extra
coding.
The code listing below instead uses the new NumericField to index a numeric Java
primitive using 4-bit precision. Like the straightforward NumericField, querying
numeric ranges also provides a type-safe API. NumericRangeQuery instances are
created using one of the provided static constructors for the corresponding Java primitive.
The example below shows a numeric range query using an int primitive with the same
precision used in the indexing example. If different precision values are used at index or
search time, numeric queries can yield unexpected behavior.
Improvements resulting from new Lucene numeric capabilities are equally significant in
versatility and performance. Now, Lucene can cover almost every use-case related to
numeric values. Moreover, range searches or sorting on float or double values up to fast
date searches (dates converted to time stamps) will execute in less than 100 milliseconds
in most cases. By comparison, the old approach using padded full-precision values could
take up to 30 seconds or more depending on the underlying index.
What the above example does not demonstrate is the full power of the new token API.
There, we replaced one or more characters in the token and discarded the original one. Yet,
in many use-cases, the original token should be preserved in addition to the modified one.
Using the old API required a fair bit of work and logic to handle such a common use-case.
In contrast, the new attribute-based approach allows capture and restoration of the state of
attributes, which makes such use-cases almost trivial. The example below shows a version
of the previous example improved for Lucene 2.9/3.0, in which the original term attribute
is restored once the stream is advanced.
The separation of attributes makes it possible to add arbitrary properties to the analysis
chain without using a customized Token class. Attributes are then made type-safely
accessible by all subsequent TokenStream instances, and can eventually be used by the
consumer. This way, you get a generic way to add various kind of custom information, such
as part-of-speech tags, payloads, or average document length to the token stream.
Unfortunately, Lucene 2.9 and 3.0 don't yet provide functionality to persist a custom
Attribute implementation to the underlying index. This improvement, part of what is often
referred to as "flexible indexing," is under active development and is proposed for one of
the upcoming Lucene releases.
Beyond the generalizability of this API, one of its most significant improvements is its
effective reuse of Attribute instances across multiple iterations of analysis. Attribute
implementations are created during TokenStream instantiation and are reused each time
the stream advances to a successive increment. Even if a stream is used for another
Per-Segment Search
Since the early days of Apache Lucene, documents have been stored at the lowest level in a
segment—aa small but entirely independent index. On the highest abstraction level, Lucene
combines segments into one large index and executes searches across all visible segments.
As more and more documents are added to an index, Lucene buffers your documents in
RAM and flushes them to disk periodically. Depending on a variety of factors, Lucene either
incrementally adds documents to an existing segment, or creates entirely new segments. To
reduce the negative impact of an increasing number of segments on search performance,
performance
Lucene tries to combine/merge multiple segments into larger ones. For optimal search
performance, Lucene can optimize an index that essentially merges all existing segments
into a single segment.
Prior to Lucene 2.9, search logic resided at the highest abstraction level, accessing a single
IndexReader no matter how many segments the index was composed of. Similar Similarly the
FieldCache was associated with the top-level
top IndexReader, and then had to be
invalidated each time an index was reopened. With Lucene 2.9 2.9, the search logic and the
FieldCache have moved to a per-segment
per level. While this
his has introduced a little more
internal complexity, the benefit of the tradeoff is a new per-segment
segment index behavior that
yields a rich variety of performance improvements for unoptimized indexes.
The majority of Lucene users won’t touch the changes related to Per-
Segment
ment Search during their day-to-day business unless there are
working on low-level
level code implementing Filters or Custom –
Collector classes. Both classes directly expose the per – segment
model like Collector#setNextReader(),
Collector#setNextReader() which is called once
for each segment during search. The Filter API instead doesn’t
immediately yield its relation to per
per-segment search and has caused
lots of confusion in the past.
Filter#getDocIdSet(IndexReader) and its deprecated
relative Filter#bits(IndexReader) are also called once per
segment instead of once per index. The document IDs set by the
Filter must be relative to the current segment rathe
rather than absolute.
MultiTermQuery-Related
elated Improvements
In Lucene 2.4, many standard queries, such as FuzzyQuery,
WildcardQuery, and PrefixQuery were refactored and
subclassed under MultiTermQuery.
MultiTermQuery Lucene 2.9 adds some
improvements under the hood, resulting in much better performance
for those queries.2
In Lucene 2.9/3.0,, multiterm queries now use a constant score internally, based on the
assumption that most programmers don't care about the interim score of the queries
resulting from the term expansion that takes place during query rewriting.
Payloads
The Payloads feature, though originally added in a previous version of Lucene, remains
pretty new to most programmers. A payload is essentially a byte array that is associated
with a particular term in the index. Payloads can be associated with a single term during
text analysis and subsequently committed directly to the index. On the search side, these
byte arrays are accessible to influence the scoring for a particular term, or even to filter
entire documents.
3
See www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads for more information.
To provide a smooth transition from the existing core parser to the new API, this contrib
package also contains an implementation fully compliant with the standard query syntax.
This not only helps the switch to the new query parser but it also serves as an example of
how to use and extend the API. That said, the standard implementation is based on the new
query parser API and therefore it can't simply replace a core parser as is. If you have been
replacing Lucene's current query parser, you can use QueryParserWrapper instead,
which preserves the old query parser interface but calls the new parser framework. One
final caveat: the QueryParserWrapper is marked as deprecated, as the new query parser
will be moved to the core in the upcoming release and eventually replace the old API.
Should you move to 2.9 or 3.0? Whichever you do, first bear in mind that going to 3.0 will
require a migration first to 2.9; it is a prerequisite. Only once that 2.9 transition is
completed, will you be ready to work through the deprecation warnings in order to move.
Because 3.0 is a deprecation release, all deprecated-marked code in Lucene 2.9 will be
removed. Some parts of the API might be modified in order to make use of Java Generics,