Professional Documents
Culture Documents
Elasticsearch and Apache Lucene
Elasticsearch and Apache Lucene
The fundamental concepts required to understand the theory behind Apache Lucene are
indexes, documents, inverted indexes, scoring, and tokenisation. I will briefly describe these
concepts below.
An index is a collection of documents sharing conceptual and logical similarities. You can think
of an index as a folder with multiple related documents. Given that Elasticsearch is a distributed
system and clusters can be added on demand, there is virtually no limit to the number of
documents an Elasticsearch server can store.
A document is a record containing information related to the index. Documents do not have a
specific scheme and every document pushed into the index is tagged with a unique identifier. A
document contains a set of fields and is generally stored in JSON format. Fields are the minimal
unit of storage in the Lucene ecosystem. A simple document is illustrated in the following
figure:
The following table shows an example of an inverted index substructures build based on a set
of documents:
Scoring is the process that compares the user input against the stored documents and assigns
relevance values for each result. Similar to how the Google search engine works, the outcome
of the scoring process is a sorted array containing all the search matches ordered by score.
relevance,
user satisfaction, and
algorithms.
The first aspect is related to the relevance definition itself. Even though different users perform
the same search, the relevance results could be different for each of them; therefore, the
relevance should be aligned to the user expectations.
The second aspect concerns knowledge about a user’s satisfaction with the relevance score
provided in the search. In fact, giving feedback to the algorithm on the user’s perception is a
very fascinating field related to reinforcement learning techniques, which by itself would
warrant a separate article.
The third aspect encompasses the algorithms which calculate the scores. There are two main
algorithms used for scoring: Term Frequency-Inverse Document Frequency (TD-IDF) and Best
Match 25 (BM25). Both algorithms are rooted in the concept of tokenisation.
Tokenisation is a fundamental concept of the Natural Language Processing (NLP) field, which
is also being applied to search engines. Tokenisation consists of splitting the text string or
documents into tokens, or smaller chunks. Although tokenisation libraries provide a set of
methods for splitting text, users can implement their own rules using regular expressions.
Common tokenisation methods include breaking up the text by words or by punctuation marks.
Although tokenisation is commonly applied to text, in certain situations, special techniques allow
for the tokenisation of other kinds of data, such as numbers or geographical coordinates.
Consequently, each specific use case should be taken into consideration during the
configuration of Elasticsearch tokenisation capabilities.
The previously reviewed concepts work together to establish relevance scores from semi-
structured data. When a document is saved in Elasticsearch, two processes happen in parallel:
in the first, the raw document is stored in the index; in the second, each document passes
through an analysis phase before being stored in the inverted index. This is described in the
following figure:
The analysis layer is a configurable step which, by default, includes three actions: character
filter, tokeniser, and token filters. These actions facilitate the search algorithms. Additionally,
each document pushed into the search engine goes through a metadata enrichment process,
which can be as simple as the counting of terms or as complex as the calculation of a sentiment
trend.
Naturally, when the query API is used for retrieving documents, the search terms pass through
the same actions used when documents were stored. That way, search terms can be matched
against the inverted index, and the search server can calculate the scores accordingly and
return the results sorted by relevance. This process is depicted in the next figure:
The following example will clarify the process described above. Let us say we have the next
document from a review:
{
“description”: “Barcelona has beautifullllll beaches.”
}
After applying the analysis step, the information in the inverted index looks something like this:
[“barcelona”, “beautiful”, “beach”]
When a user inputs the search string “Barcelona’s beaches” and the analysis layer is applied,
the search string is transformed into something like this:
[“barcelona”, “beach”]
The resulting terms can now be matched against the inverted index, and the result can be
obtained by direct comparison.
For Endava’s Innovation Lab, an internal competition where Endavans develop innovative tech
solutions, we proposed a product using Elasticsearch as data storage. The goal was to provide
the recruiting team with a list of the most relevant candidates for a specific opening.
To achieve this, the system extracts professional profiles from various sources and then creates
a cherry-picked selection of candidates for the recruiting team, based on NLP techniques
leveraged by Elasticsearch’s built-in algorithms.
All the features explained before fed the idea of using Elasticsearch as principal storage.
However, there were two design decisions remaining for the final architectural approach: the
principal storage mechanism and the deployment method (cloud or on-premises).
Ultimately, Elasticsearch was selected as the principal storage mechanism, due to the medium-
scale storage requirement for the entire solution (estimated at just 36 Gigabytes) and the non-
transactional nature of the business requirements.
We deployed this solution in the AWS cloud because it provides a fully maintained Elasticsearch
service, and it can be easily integrated with other services within the AWS ecosystem.
The following diagram reflects the final design, including the forementioned decisions:
CONCLUSIONS
The purpose of this article was to review how the main components of Elasticsearch and
Apache Lucene interact to provide an effortless way to resolve relevance problems with search
results.
Apache Lucene is the heart of Elasticsearch and provides an interface which helps with
abstracting the complexity and algorithms behind the scenes.
For most business requirements, a default configuration of Elasticsearch will be sufficient.
However, some cases may require improvements in how documents are scored. This kind of
score tuning is possible, but such implementations require a deep understanding of the
relevance algorithms used in Apache Lucene.