Download as pdf or txt
Download as pdf or txt
You are on page 1of 559

Core Elasticsearch

Developer
n
io
is
ec
D
&
s
es
in
us
-B

An Elastic Training Course


71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

training.elastic.co
5.5.1
Course: Core Elasticsearch Developer

Version 5.5.1

© 2015-2017 Elasticsearch BV. All rights reserved. Decompiling, copying, publishing and/or distribution without written consent of Elasticsearch BV is
strictly prohibited.

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

© Elasticsearch BV 2015-2017. All rights reserved. 2


Lab Prep Checklist
• Download the PDF and zip file from links provided in email
• Minimum requirements:
‒ 20% or more free disk space
‒ Install Java
‒ version 1.8.0_20 minimum/1.8.0_73 or newer recommended
‒ 64-bit JDK preferred

n
io
is
ec
D
‒ set JAVA_HOME environment variable
&
s
es
in
us
-B

• Unzip the downloaded lab zip file on your hard drive


71
20
p-
Se
9-

‒ this creates a directory called CoreDeveloper


-0
d
ar
er
G
n

‒ do not unzip the lab file into a folder with spaces in the name
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 3


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Agenda and Introductions
Course Agenda
1 Introduction to Elasticsearch
2 The Search API
3 Text Analysis
4 Mappings
5 More Search Features

n
io
is
6 The Distributed Model
ec
D
&
s
es
in
us

7 Working with Search Results


-B
71
20
p-
Se

8 Aggregations
9-
-0
d
ar
er

9 More Aggregations
G
n
ie
st
ba
Se

10 Handling Relationships

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 5


distributing without written permission is strictly prohibited
Introductions
• Name
• Company
• What do you do?
• What are you using Elasticsearch for?
• What do you hope to get out of this training?

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 6


distributing without written permission is strictly prohibited
Logistics
• Facilities
• Emergency Exits
• Restrooms
• Breaks/Lunch

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 7


distributing without written permission is strictly prohibited
1 Introduction to Elasticsearch

2 The Search API

3 Text Analysis

4 Mappings

5 More Search Features


Chapter 1

n
6 The Distributed Model

io
Introduction to
is
ec
D
&
s
es 7 Working with Search Results

Elasticsearch
in
us
-B
7

8 Aggregations
1
20
p-
Se
9-

More Aggregations
-0

9
d
ar
er
G
n

10 Handling Relationships
ie
st
ba
Se
Topics covered:
• The Story of Elasticsearch
• The Components of Elasticsearch
• Documents
• Indexes
• Indexing Data

n
• Searching Data

io
is
ec
D
&
s
es
• The Bulk API
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 9


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The Story of
Elasticsearch
Once upon a time…
• As any good story begins, “Once upon a time…”
‒ More precisely: in 1999, Doug Cutting created an open-source
project called Lucene
• Lucene is:
‒ a search engine library entirely written in Java
‒ a top-level Apache project, as of 2005

n
io
is
‒ great for full-text search
ec
D
&
s
es
in

• But, Lucene is also:


us
-B
71
20
p-

‒ a library (you have to incorporate it into your application)


Se
9-
-0
d
ar
er

‒ challenging to use
G
n
ie
st
ba
Se

‒ not originally designed for scaling

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 11


distributing without written permission is strictly prohibited
Search is something
that any application

n
io
should have”
is
ec
D
&
s
es
in
us
-B
71
20

Shay Banon
p-
Se
9-

Creator of Elasticsearch
-0
d
ar
er
G
n
ie
st
ba
Se

‹#›
The Birth of Elasticsearch
• In 2004, Shay Banon developed a product called Compass
‒ Built on top of Lucene, Shay’s goal was to have search
integrated into Java applications as simply as possible
• The need for scalability became a top priority
• In 2010, Shay completely rewrote Compass with two main
objectives:

n
1. distributed from the ground up in its design

io
is
ec
D
&
s
es
2. easily used by any other programming language
in
us
-B
71
20

• He called it Elasticsearch
p-
Se
9-
-0
d

‒ …and we all lived happily ever after!


ar
er
G
n
ie
st
ba

• Today Elasticsearch is the most popular enterprise search


Se

engine
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 13
distributing without written permission is strictly prohibited
1. Distributed search:
• Elasticsearch is distributed and scales horizontally:

1 node cluster 10 node cluster 100+ node


cluster

Master Node Master Nodes

A node is an instance

n
io
Ingest Node
of Elasticsearch Ingest Nodes

is
ec
D
&
s
es
in

A cluster is a
us
-B

collection of
71

Data Nodes
20

Data Nodes – Hot


Elasticsearch nodes
p-
Se
9-
-0
d
ar
er
G
n
ie

Data Nodes – Warm


st
ba
Se

Your cluster can grow as your needs grow

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 14


distributing without written permission is strictly prohibited
2. Easily used by other languages:
• Elasticsearch provides REST APIs for communicating with
a cluster over HTTP
‒ Allows client applications to be written in any language

HTTP Request

n
io
is
Client
ec
D
REST APIs
&
App s
es
in
us
-B
71
20
p-
Se

Java, .NET, PHP,


9-
-0

Groovy, Ruby, HTTP Response


d
ar
er

Python, Perl, or
G
n
ie

whatever you want!


st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 15


distributing without written permission is strictly prohibited
Elasticsearch History
• 0.4: first version was released in February, 2010
• 1.0: released in January, 2014
• 2.0: released in October, 2015
• 5.0: released in October, 2016

n
• Why the big jump in version number?

io
is
ec
D
&
s
es
‒ The Elastic Stack…
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 16


distributing without written permission is strictly prohibited
The Elastic Stack
• Elasticsearch is a component of the Elastic Stack

Security

Alerting
Kibana

n
Monitoring

io
is
ec
D
&
s
es
Elasticsearch
in

Reporting
us

X-Pack Elastic Cloud


-B
71
20
p-
Se

Graph
9-
-0
d
ar

Beats Logstash
er
G
n
ie

Machine Learning
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 17


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Elasticsearch
The Components of
The Components of Elasticsearch
• The main components of Elasticsearch are:
‒ node: an instance of Elasticsearch
‒ cluster: one or more nodes that work together to serve the same
data and API requests
‒ document: a piece of data that you want to search
‒ index: a collection of documents that have somewhat similar
characteristics

n
io
is
ec
D
&
• Let’s take a look at documents and indexes (indices)… es
in
s
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 19


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Documents
Documents
• A document can be any text or numeric data you want to
search and/or analyze
• For example:
‒ an entry in a log file,
‒ a comment made by a customer on your website,
‒ the weather details from a weather station at a specific moment

n
io
is
ec
• Specifically, a document is a top-level object that is
D
&
s
es
serialized into JSON and stored in Elasticsearch
in
us
-B
71
20

• Every document has a unique ID


p-
Se
9-
-0
d
ar

‒ which either you provide


er
G
n
ie
st
ba

‒ or Elasticsearch generates for you


Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 21


distributing without written permission is strictly prohibited
Documents must be JSON objects
• Suppose you have some stock prices in a CSV format
‒ The data needs to be converted to JSON
‒ Each row of data represents a single document
each row has stock price data
20100526,PPG,62.83,62.83,61.46,61.86,32203

n
io
is
ec
D
&
stocks.csv convert the CSV to JSON
s
es
{
in
us

"volume" : 32203,
-B
71

"high" : 62.83,
20

JSON consists of fields


p-

"stock_symbol" : "PPG",
Se
9-

and values
-0

"low" : 61.46,
d
ar

"close" : 61.86,
er
G

"trade_date" : "2010-05-26T06:00:00.000Z",
n
ie
st

"open" : 62.83
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 22


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Indexes
Definition of an Index
• An index in Elasticsearch is a logical way of grouping data:
‒ An index contains mappings that define the documents’ field
names and data types of the index
‒ an index is a logical namespace that maps to where its contents
are stored in the cluster
• There are two different concepts in this definition:
‒ an index has some type of data schema mechanism

n
io
is
ec
D
&
‒ an index has some type of mechanism to distribute data across a es
s
in
us

cluster
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 24


distributing without written permission is strictly prohibited
You can have lots of indexes:
• Indexes are fairly lightweight in Elasticsearch, so it is not
unusual to create lots of them
• For example, you might define a new index everyday for
searching logs
‒ Using multiple indices allows you to organize your data logically:
‒ logs-2017-01-01

n
‒ logs-2017-01-02

io
is
ec
D
&
es
s
/logs-2017-01-01
‒ logs-2017-01-03
in
us
-B
71
20

‒ and so on /logs-2017-01-02
p-
Se
9-
-0
d
ar

/logs-2017-01-03
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 25


distributing without written permission is strictly prohibited
When we say “index”, we mean…
• We use the word “index” in multiple ways:
‒ Noun: a document is put into an index in Elasticsearch
‒ Verb: to index a document is to put the document into an index
in Elasticsearch

documents are indexed into an index


data ingestion involves
indexing documents

n
io
is
ec
D
&
/my_index1
{

s
"volume": 66605, es
in
"high": 28.86,
us

"stock_symbol": "ALL",
-B

"low": 28.37,
7

"close": 28.82,
1

{
20

"trade_date":

/my_index2
"volume": 46965,
p-

"2009-12-18T07:00:00.000Z",
"high": 31.56,
Se

"stock_symbol": "ALL",
9-

"low": 30.68,
-0

"close": 30.91,
d
ar

{
"trade_date":
er

"volume": 32203,
"2010-01-15T07:00:00.000Z",
G

/my_index3
"high": 62.83,
n

"stock_symbol": "PPG",
ie
st

"low": 61.46,
ba

"close": 61.86,
Se

"trade_date":
"2010-05-26T06:00:00.000Z",

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 26


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Indexing Data
Let’s index some simple JSON documents:
“username” and These are the values
“comment” are fields

{
"username" : "harrison",
"comment" : "My favorite movie is Star Wars!"
}

{
"username" : "maria",

n
io
is
"comment" : "The North Star is right above my house."

ec
D
}
&
s
es
in
us
-B

{
71
20

"username" : "eniko",
p-
Se
9-

"comment" : "My favorite movie star in Hollywood is Harrison Ford."


-0
d
ar

}
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 28


distributing without written permission is strictly prohibited
Define an Index
• Clients communicate with a cluster using Elasticsearch’s
REST APIs
‒ There is a collection of Document APIs for working with indexes
and documents
• An index is defined using the Create Index API, which can be
accomplished with a simple PUT command:
‒ “200 OK” response if successful

n
io
is
ec
D
&
s
es
in
us
-B

curl -XPUT 'http://localhost:9200/my_index' -H "Content-Type: application/json"


71
20
p-
Se
9-
-0
d
ar
er

The name of
G

PUT command
n
ie

the new index


st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 29


distributing without written permission is strictly prohibited
Index a Document
• The Index API is used to index a document
• Use a PUT again, but this time send the document in the
body of the PUT request:
‒ notice you specify a document type and a unique ID
‒ “201 Created” response if successful
document
index document id
type

n
io
is
ec
D
&
s
es
in
us

curl -XPUT 'http://localhost:9200/my_index/doc/1' -i -H


-B
71

"Content-Type: application/json" -d '


20
p-
Se

{
9-
-0

"username":"harrison",
d
ar
er

“comment":"My favorite movie is Star Wars!"


G
n
ie

}'
st
ba
Se

“doc” can actually be any name we


choose for our document type

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 30


distributing without written permission is strictly prohibited
Dynamic Index Creation
• Elasticsearch will create the index for you during the
indexing of a document, if the index is not already created:
‒ we could have simply executed the command on the previous
slide (and skipped that first “PUT my_index” command)

my_index is created if it doc type is created if it


does not exist does not exist

n
io
is
ec
D
&
s
es
in
curl -XPUT 'http://localhost:9200/my_index/doc/1' -H "Content-
us
-B

Type: application/json" -d '


71
20

{
p-
Se
9-

"username":"harrison",
-0
d
ar

"comment":"My favorite movie is Star Wars!"


er
G

}'
n
ie
st

If the id already exists, the


ba
Se

document is reindexed

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 31


distributing without written permission is strictly prohibited
Index Without Specifying an ID
• You can leave off the id and let Elasticsearch generate one
for you:
‒ But notice that only works with POST, not PUT

POST instead of PUT No id specified

n
io
is
ec
D
&
curl -XPOST 'http://localhost:9200/my_index/doc/' -i -H
s
es
in
"Content-Type: application/json" -d '
us
-B

{
71
20

"username" : "maria",
p-
Se
9-

"comment" : "The North Star is right above my house."


-0
d

}'
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 32


distributing without written permission is strictly prohibited
Index Without Specifying an ID
• If the POST is successful, the auto-generated ID is returned
in the response:
HTTP/1.1 201 Created 201 response if successful
Location: /my_index/doc/
AVnWIzyrUxRN673zt1An
content-type: application/json;
charset=UTF-8
content-length: 220

{
"_index" : "my_index",

n
io
is
ec
"_type" : "doc",

D
&
"_id" : "AVnWIzyrUxRN673zt1An",
s
es
in
"_version" : 1,
us
-B

"result" : "created",
71
20

"_shards" : {
p-
Se

"total" : 2,
9-
-0

"successful" : 1,
d
ar

"failed" : 0
er
G
n

},
ie
st
ba

"created" : true
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 33


distributing without written permission is strictly prohibited
Retrieving a Document
• Use GET to retrieve an indexed document
‒ Returns 200 if the document is found
‒ Returns a 404 error if the document is not found

curl -XGET 'http://localhost:9200/my_index/doc/1' -H


"Content-Type: application/json"

n
io
is
ec
D
{
&
s
"_index": "my_index",
es
in
us

"_type": "doc",
-B
7

"_id": "1",
1
20
p-

"_version": 1,
Se
9-

"found": true,
-0
d

"_source": {
ar
er
G

"username": "harrison",
n
ie
st

"comment": "My favorite movie is Star Wars!"


ba
Se

}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 34


distributing without written permission is strictly prohibited
The _source Endpoint
• You can get just the source of a document using the
_source endpoint

curl -XGET 'http://localhost:9200/my_index/doc/1/_source' -H


"Content-Type: application/json"

n
io
is
ec
D
{
&
s
"username": "harrison",es
in
us

"comment": "My favorite movie is Star Wars!"


-B
7

}
1
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 35


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Searching Data
“Hello, Search!”
• Let’s run our first search
‒ sort of a “Hello, World!” for Elasticsearch
• The simplest search is a match_all, which simply matches
all documents in the index (or indices) being searched:

Use GET or POST Invoke the _search endpoint

n
io
is
ec
D
&
curl -XGET 'http://localhost:9200/my_index/_search' -H
s
es
in
"Content-Type: application/json" -d '
us
-B

{
71
20

"query": { Add a “query” clause


p-
Se
9-

"match_all": {}
-0
d

}
ar
er
G

}' “match_all” matches all


n
ie
st
ba

documents
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 37


distributing without written permission is strictly prohibited
…and the search results are also JSON
{
"took": 1, took = the number of milliseconds it
"timed_out": false, took to process the query
"_shards": {
"total": 10,
"successful": 10,
"failed": 0
},
"hits": { total = the number of documents found
"total": 3,
"max_score": 1,
"hits": [ hits = array containing the documents

n
{ that “hit” the search criteria

io
is
ec
"_index": "my_index",

D
&
"_type": "doc",
s
es
in
"_id": "AVlBSnbcmM0CN4-9FrtP",
us
-B

"_score": 1, _id = the document’s unique ID


71
20

"_source": {
p-
Se

"username": "maria",
9-
-0

"comment": "The North Star is right above my house."


d
ar
er

}
G
n
ie

},
st
ba

...
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 38


distributing without written permission is strictly prohibited
A Simpler “Hello, World”
• If you want a search that matches all documents, you can
simply send a request to the _search endpoint
‒ and omit the “query” body
‒ same results as running a match_all
‒ by default, Elasticsearch returns 10 results

n
io
is
ec
D
curl -XGET 'http://localhost:9200/my_index/_search' -H
&
s
"Content-Type: application/json" es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 39


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
CRUD Operations
CRUD Operations
• We will discuss these in more detail later in the course:
PUT my_index/doc/4
{
Index "username" : "kimchy",
"comment" : "I like search!"
}

PUT my_index/doc/4/_create
{
Create "username" : "kimchy",
"comment" : "I like search!"

n
}

io
is
ec
D
&
Read es
GET my_index/doc/4
in
s
us
-B
71

POST my_index/doc/4/_update
20
p-

{
Se
9-

"doc" : {
Update
-0
d
ar

"comment" : "I love search!!"


er
G

}
n
ie
st

}
ba
Se

Delete DELETE my_index/doc/4

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 41


distributing without written permission is strictly prohibited
Index vs. Create
• The _create operation can explicitly be used when indexing
a document
‒ If the id already exists, the operation fails

PUT my_index/doc/5/_create Fails if id=5 is already indexed


{
"username" : "ricardo",
"comment" : "I love search too"

n
}

io
is
ec
D
&
s
es
in

Indexes a new document if


us
-B

id=6 is not already indexed.


71
20

PUT my_index/doc/6
p-

Otherwise, the current doc is


Se

{
9-

deleted and this new one is


-0

"username" : "maria",
d
ar

indexed.
er

"comment" : "It's a great day for


G
n
ie

search"
st
ba

}
Se

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 42


distributing without written permission is strictly prohibited
The Multi Get API
• The Multi Get API allows you to GET multiple documents in
a single request
‒ Avoid multiple round trips Use the _mget endpoint

curl -XGET 'http://localhost:9200/_mget' -H


"Content-Type: application/json" -d '
{
"docs" : [ “docs” is an array of documents to GET
{

n
io
is
"_index" : "INDEX_NAME",

ec
D
&
"_type": "TYPE",
s
es
in
"_id" : "ID"
us
-B

},
71
20

{
p-
Se
9-

"_index" : "INDEX_NAME",
-0
d

"_type": "TYPE",
ar
er
G

"_id" : "ID"
n
ie
st
ba

}
Se

]
}'

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 43


distributing without written permission is strictly prohibited
Example of _mget {
"docs": [
{
"_index": "products",
"_type": "product",
curl -XGET 'http://localhost: "_id": "101",
9200/_mget' -H "Content-Type: "_version": 1,
application/json" -d ' "found": true,
{ "_source": {
"brandName": "Kraft",
"docs" : [ "customerRating": 5,
{ "price": 4.29,
"_index" : "products", "grp_id": "101",
"_type" : "product", "quantitySold": 844255,
"upc12": "021000023233",
"_id" : 101 "productName": "Kraft Velveeta
},

n
Shells & Cheese 2% Milk"

io
is
{

ec
}

D
&
"_index" : "my_index", },
s
es
in
"_type" : "doc", {
us
-B

"_index": "my_index",
"_id" : 4
71

"_type": "doc",
20

}
p-

"_id": "4",
Se
9-

] "_version": 1,
-0
d

"found": true,
ar

}'
er
G

"_source": {
n
ie

"username" : "kimchy",
st
ba
Se

"comment" : "I like search!"


}
}]}

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 44


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The Bulk API
Cheaper in Bulk
• The Bulk API makes it possible to perform many write
operations in a single API call, greatly increases the
indexing speed
‒ useful if you need to index a data stream such as log events,
which can be queued up and indexed in batches of hundreds or
thousands
• Four actions: create, index, update and delete

n
io
• The response is a large JSON structure with the individual

is
ec
D
&
results of each action that was performed es
s
in
us
-B
71

‒ The failure of a single action does not affect the remaining


20
p-
Se

actions
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 46


distributing without written permission is strictly prohibited
Syntax of _bulk
• Has a unique syntax based on lines of commands:
‒ Each command appears on a single line

Notice the different content- The _bulk endpoint


type for _bulk

curl -XPOST 'http://localhost:9200/_bulk' -H "Content-Type:

n
io
is
application/x-ndjson" -i -d '

ec
D
{"create" : {...}}
&
The line following “create” or “index”
s
es
{DOCUMENT}
in

is the document that gets indexed


us
-B

{"index" : {...}}
71

{DOCUMENT}
20
p-

“delete” does not have a following line


Se

{"delete" : {...}}
9-
-0

'
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 47


distributing without written permission is strictly prohibited
Example of _bulk

curl -XPOST 'http://localhost:9200/_bulk' -H "Content-Type: application/x-ndjson" -i -d '


{"create" : {"_index" : "customers", "_type":"customer_type", "_id":102}}
{"firstname" : "Maureen","lastname" : "OHara","address" : "12 Elm St","city":"Brooklyn"}
{"index" : {"_index" : "customers", "_type":"customer_type", "_id":103}}
{"firstname" : "Lucy","lastname" : "Liu","address" : "39 Pine Tree Ln","city":"Albany"}
{"delete" : {"_index" : "customers", "_type":"customer_type", "_id":104}}
'

n
io
is
ec
D
&
s
es
in
us

The final \n is actually necessary


-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 48


distributing without written permission is strictly prohibited
Error Messages
• If things go wrong in your requests:

Issue Reason Action


retry until connection is
networking issue or cluster
Unable to connect down
available (round robin across
nodes)

internal server error in retry until successful 



5xx error

n
Elasticsearch (using round robin)

io
is
ec
D
&
s
es
invalid JSON or incorrect fix bad request before
in
us

4xx error
-B

request structure retrying


71
20
p-
Se
9-

retry (ideally with linear or


-0

429 error Elasticsearch is too busy


d
ar

exponential backoff)
er
G
n
ie
st
ba

connection
Se

retry (with risk of creating a


node died or network issue
unexpectedly closed duplicate document)

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 49


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• Elasticsearch is a search and analytics engine built on top
of Apache Lucene
• Elasticsearch is distributed and scales horizontally
• Clients communicate with a cluster using Elasticsearch’s
REST APIs
• Starting with version 5.0, all of Elastic Stack products will be
aligned, tested, and released together.

n
io
is
ec
D
• A document can be any JSON-formatted data you want to &
s
es
in
us
-B

search and/or analyze


71
20
p-
Se
9-

• An index in Elasticsearch is a logical way of grouping data


-0
d
ar
er
G
n

• Adding documents to Elasticsearch is called indexing


ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 51


distributing without written permission is strictly prohibited
Quiz
1. True or False: Elasticsearch uses Apache Lucene behind
the scenes to index and search data.
2. What are two different uses of the term “index”?
3. True or False: Every document indexed in Elasticsearch
belongs to an index.
4. True or False: Using the Bulk API is more efficient than
sending multiple, separate requests.

n
io
is
ec
D
5. What happens if you attempt to index a new document into &
s
es
in
us
-B

an index that does not exist?


71
20
p-
Se
9-

6. Can you PUT a document into an index without specifying


-0
d
ar

an ID?
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 52


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 1
Elasticsearch
An Introduction to
1 Introduction to Elasticsearch

2 The Search API

3 Text Analysis

4 Mappings

5 More Search Features


Chapter 2

n
6 The Distributed Model

io
The Search
is
ec
D
&
s
es 7 Working with Search Results

API
in
us
-B
7

8 Aggregations
1
20
p-
Se
9-

More Aggregations
-0

9
d
ar
er
G
n

10 Handling Relationships
ie
st
ba
Se
Topics covered:
• Introduction to the Search API
• URI Searches
• Request Body Searches
• The match Query
• The match_phrase Query

n
• The range Query

io
is
ec
D
&
s
es
• The bool Query
in
us
-B
71
20

• Source Filtering
p-
Se
9-
-0
d
ar

• Relevance
er
G
n
ie
st
ba
Se

• Boosting Relevance

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 55


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Search API
Introduction to the
The Search API
• The Search API allows you to execute a search query and
get back search hits that match the query
‒ You used the Search API in Lab 1 when you ran a simple
_search query
• Elasticsearch provides a full Query DSL (Domain Specific
Language) to define queries
• Two types of searches:

n
io
is
ec
D
‒ URI searches
&
s
es
in
us
-B

‒ Request body search


71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 57


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
URI Searches
URI Searches
• You can submit a search request using a URI and providing
request parameters
‒ Use a “q” to specify the query string
‒ Uses a special syntax called “query string syntax”

“q” is for query string

n
io
is
ec
D
&
s
es
in
us
-B

curl -XGET 'http://localhost:9200/my_index/_search?q=comment:hollywood'


71
20
p-
Se
9-
-0
d
ar

search the “comment”


er
G
n
ie

field for “hollywood”


st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 59


distributing without written permission is strictly prohibited
Why URI Searches?
• We will not discuss URI searches in detail
‒ The documentation has a good discussion on the query string
syntax and its many options
• URI searches are limited
‒ Not all search options are exposed
• But, it is good to know that URI searches exist

n
io
is
ec
‒ They can be handy for a quick curl query
D
&
s
es
in
us

• We will focus on request body searches…


-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 60


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Searches
Request Body
Request Body Searches
• To take full advantage of the Search API, write your queries
as request body searches using the Query DSL

the name of the index to be searched

GET index_to_search/_search

n
{

io
is
_search routes to the

ec
"query": {

D
Search API
&
s
es
in
us

define your query here...


-B
71
20
p-

}
Se
9-
-0

}
d
ar
er
G
n
ie
st

the query element is where you


ba
Se

define your query

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 62


distributing without written permission is strictly prohibited
Specifying Search Indices
• A search can be performed over multiple indices:

Syntax Indices searched

/_search everything

/index-1/_search only index-1

n
io
is
ec
D
&
s only type-1 docs in index-1
/index-1/type-1/_search es
in
us
-B
71
20

/index-1,index-2/_search all of index-1 and index-2


p-
Se
9-
-0
d
ar

all indices that start with index-


er

/index-*/_search
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 63


distributing without written permission is strictly prohibited
The Search API
• The Search API allows you to execute a search query and
get back search hits that match the query
• We will start by looking at the matching queries:
‒ match
‒ match_phrase
• Then we will discuss two other commonly-used queries:

n
io
is
ec
‒ range
D
&
s
es
in
us

‒ bool
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 64


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The match Query
The match Query
• The match query is a simple but powerful search for fields
that are text, numerical values, or dates
• This is your “go to” query
‒ The main use case for the match query is for full-text search
• The syntax looks like:
GET my_index/_search

n
io
{

is
ec
D
"query": {
&
s
es
"match": {
in
us
-B

"FIELD": "TEXT"
71

}
20
p-
Se

}
9-
-0

}
d
ar
er
G
n
ie
st
ba
Se

The field you want to The text you want to


search search for

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 66


distributing without written permission is strictly prohibited
Example of match
• Let’s search the “comment” fields of the documents in
my_index

Find all documents that


have “favorite” or “star” in
the comment.
GET my_index/_search
{

n
io
is
ec
"query": {

D
&
"match": {
s
es
in

"comment": "favorite star"


us
-B

}
71
20
p-

}
Se
9-

}
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 67


distributing without written permission is strictly prohibited
Query Response
• By default, the documents are returned sorted by _score
"hits": {
"total": 3, Three documents were a match
"max_score": 0.6219729,
"hits": [
{
"_index": "my_index",
"_type": "doc",
"_id": "1",
"_score": 0.6219729, Each hit has a _score
"_source": {
"username": "harrison",

n
"comment": "My favorite movie is Star Wars!"

io
is
}

ec
D
},
&
s
{ es
in
us

"_index": "my_index",
-B
7

"_type": "doc",
1
20

"_id": "3", Results are sorted by


p-
Se

"_score": 0.5306678,
9-

_score (by default)


-0

"_source": {
d
ar

"username": "eniko",
er
G

"comment": "My favorite movie star in Hollywood is Harrison Ford."


n
ie
st

}
ba
Se

},
{

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 68


distributing without written permission is strictly prohibited
Changing match to “and”
• Add the operator property and specify “and” or “or”
‒ You have to add a “query” property as well
Only 2 hits this time

Find all documents that "hits": {


have “favorite” and “star” in "total": 2,
"max_score": 0.6219729,
the comment. "hits": [
{
GET my_index/_search "_index": "my_index",
{ "_type": "doc",

n
"_id": "1",

io
"query": {

is
ec
"_score": 0.6219729,

D
"match": {
&
"_source": {
s
es
"comment": { "username":
in
us

"harrison",
-B

"query": "favorite star",


7

"comment": "My
1
20

"operator": "and"
p-

favorite movie is Star Wars!"


Se

}
9-

}
-0

} },
d
ar
er

{
}
G
n

"_index": "my_index",
ie
st

}
ba

Notice the slightly "_type": "doc",


Se

"_id": "3",
different syntax

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 69


distributing without written permission is strictly prohibited
The minimum_should_match Property
• If a match query searches on multiple terms, the “or” or
“and” options might be too strict
• You can specify a minimum_should_match value
‒ minimum_should_match can be used to trim the long tail of
practically irrelevant results
Find comments that
contain at least 2 terms out of

n
“my favorite movie”

io
is
GET my_index/_search

ec
D
{
&
s
es
in
"query": {
us
-B

"match": {
71
20

"comment": {
p-
Se

"query": "my favorite movie",


9-
-0
d

"minimum_should_match" : 2
ar
er
G

}
n
ie
st

}
ba
Se

}
} Can be an integer or
a percentage
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 70
distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Query
The match_phrase
The match_phrase Query
• The match_phrase query is for searching text when you
want to find terms that are near each other
‒ All the terms in the phrase must be in the document
‒ The position of the terms must be in the same relative order
• A match_phrase query can be quite expensive! Use them
wisely.

n
io
GET my_index/_search

is
ec
{

D
&
s
"query": { es
in
us

"match_phrase": {
-B
71

"FIELD": "PHRASE"
20
p-
Se

}
9-
-0

}
d
ar
er

}
G
n
ie
st
ba
Se

The field you want to The phrase you are


search searching for
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 72
distributing without written permission is strictly prohibited
Example of match_phrase
• Let’s look at an example:

I am looking for
“movie star” in the
comments

GET my_index/_search
{
"query": {

n
"match_phrase": {

io
is
ec
"comment": "movie star"

D
&
s
} es
Two things must happen for “movie
in
us

}
-B

star” to cause a hit:


7

}
1
20
p-
Se

1. “movie” and “star” must appear in


9-
-0
d

the “comment” field



ar
er
G
n
ie
st

2. The position of “star” must be 1


ba
Se

greater than “movie”

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 73


distributing without written permission is strictly prohibited
match_phrase vs. match
• Notice the difference between match_phrase and match
when searching for “movie star”:
“match_phrase” from “match” with “and” produces
previous slide only has 1 hit 2 hits

GET my_index/_search
"hits": { {
"total": 1, "query": {
"max_score": 0.53066784, "match": {
"hits": [ "comment": {

n
{ "query": "movie star",

io
is
ec
"_index": "my_index", "operator": "and"

D
&
"_type": "doc", }
s
es
}
in
"_id": "3",
us

}
-B

"_score": 0.53066784,
7

}
1

"_source": {
20
p-
Se

"username": "eniko",
9-
-0

"comment": "My favorite


d

"hits": {
ar

movie star in Hollywood is


er
G

"total": 2,
Harrison Ford."
n
ie

"max_score": 0.6219729,
st

}
ba

"hits": [
Se

}
]

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 74


distributing without written permission is strictly prohibited
The slop Parameter
• If match_phrase is too strict, you can introduce some
flexibility into the phrase (called slop)
‒ The slop parameter tells how far apart terms are allowed to be
while still considering the document a match

GET my_index/_search
{
"query": {
"match_phrase": {

n
io
is
ec
"comment": {

D
&
"query": "favorite star",
s
es
in

"slop" : 1
us
-B

} Two things must happen for


71
20

“favorite star” to find a hit:


p-

}
Se
9-

}
-0
d
ar

} 1. “favorite” and “star” must appear


er
G

in the “comment” field



n
ie
st
ba
Se

2. The distance between “star” and


“favorite” must be less than 2
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 75
distributing without written permission is strictly prohibited
Example of slop
• The search for “favorite star” found a hit
‒ The comment contains “favorite movie star”
• Without the slop set to at least 1, there would be no hits
"hits": {
"total": 1,
GET my_index/_search "max_score": 0.33159825,
{ "hits": [
"query": { {

n
io
"_index": "my_index",

is
"match_phrase": {

ec
D
"_type": "doc",
&
"comment": {
s
es
in
"_id": "3",
"query": "favorite star",
us

"_score": 0.33159825,
-B

"slop" : 1 "_source": {
71
20

} "username": "eniko",
p-
Se

"comment": "My favorite


9-

}
-0

movie star in Hollywood is Harrison


d

}
ar
er

Ford."
G

}
n
ie

}
st
ba

}
Se

]
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 76


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The range Query
The range Query
• The range query is for finding documents with fields that fall
in a given range
‒ Typically used for numeric and date fields, but can be used for
text as well
• The syntax for range looks like:

GET index/_search

n
{

io
is
ec
"query": {

D
&
s
"range": { es
in
us

"FIELD": {
-B
71

"gte": LOWER,
20
p-
Se

"lte": UPPER
9-
-0

}
d
ar
er

}
G
n
ie

}
st
ba

You can use “gt” or “lt” as


Se

}
well

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 78


distributing without written permission is strictly prohibited
Example of range
• The documents in the stocks index have a field named
“open” that represents the opening price for that day

Find all stocks with an


opening price between $50
and $60
GET stocks/_search
{

n
io
is
"query": { "hits": {

ec
D
"range": {
&
"total": 12829,
s
es
"open": { "max_score": 1,
in
us
-B

"gt": 50.00, "hits": [


71
20

"lt": 60.00 ...


p-
Se

} ]
9-
-0

}
d

}
ar
er
G

}
n
ie
st

}
ba

12,829 stocks in that


Se

range

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 79


distributing without written permission is strictly prohibited
The range Query on Dates
• Range queries on dates can use different formats:
‒ Specify the dates as a string
‒ Use a special syntax called “date math”

I am looking for all


stock prices from
GET stocks/_search August, 2009.
{

n
io
is
"query": {

ec
D
&
"range": {
s
es
in
"trade_date": {
us
-B

"gte": "2009-08",
71
20

"lt": "2009-09"
p-
Se
9-

}
-0
d

}
ar
er
G

}
n
ie
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 80


distributing without written permission is strictly prohibited
Date Math
• Elasticsearch has a user-friendly way to express date
ranges by using date math

I want all stock I want all stock prices from


prices from the last the last 24 hours
month

GET stocks/_search
GET stocks/_search

n
{

io
is
{

ec
"query": {

D
"query": {
&
s
es
in
"range": {
"range": {
us

"trade_date": {
-B

"trade_date": {
7

"gte": "now-1d"
1
20

"gte": "now-1M"
p-

}
Se

}
9-
-0

}
d

}
ar

}
er
G

}
n

}
ie
st

}
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 81


distributing without written permission is strictly prohibited
Date Math Expressions
• Here are the supported time units for date math and a few
examples

y years If now = 2016-12-08T12:56:23

M months now-1h 2016-12-08T11:56:23


w weeks

n
io
is
ec
now+1d 2016-12-09T12:56:23

D
d days
&
s
es
in
us

h or H hours
-B

now+1h+30m 2016-12-08T14:26:23
71
20
p-

m minutes
Se
9-
-0

Jan 15, 2017, plus 1


d

s seconds 2017-01-15||+1M
ar

month
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 82


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The bool Query
The bool Query
• A bool query is a combination of one or more of the
following boolean clauses:

The query must appear in matching documents


must
and will contribute to the _score
The query must not appear in matching
must_not
documents

n
io
is
ec
D
The query should optionally appear in matching
&
should s
es
in

documents, and matches contribute to the _score


us
-B
71
20

The query must appear in matching documents,


p-
Se

filter
9-
-0

but it will have no effect on the _score


d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 84


distributing without written permission is strictly prohibited
Syntax for bool
• Each of the following clauses is possible (but optional) in a
bool query
‒ And they can appear in any order
GET my_index/_search
{
"query": {
"bool": {
"must": [
{}

n
io
],

is
Notice the JSON array

ec
"must_not": [

D
&
syntax. You can specify
s
{} es
in

multiple queries in each


us

],
-B

"should": [ clause if desired.


71
20

{}
p-
Se
9-

],
-0
d

"filter": [
ar
er

{}
G

We will discuss filters in


n
ie

]
st
ba

Chapter 7
Se

}
}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 85


distributing without written permission is strictly prohibited
The must clause

I am looking for
comments from “harrison”
GET my_index/_search that contain “star”
{
"query" : {
"bool" : {
"must" : [
{
"match" : {
"comment" : "star"
}
},

n
io
is
ec
{

D
&
"match" : {
s
es
in
"username" : "harrison"
us
-B

}
71
20

}
p-
Se

]
9-
-0

}
d
ar

} "My favorite movie is Star Wars!"


er
G
n

}
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 86


distributing without written permission is strictly prohibited
Multiple bool Clauses

Find any comments


GET my_index/_search that mention “star” but not
{ “wars”
"query": {
"bool": {
"must": {
"match": {
"comment": "star"
}
},
"must_not": {
"match":{

n
io
is
ec
"comment": "wars"

D
&
}
s
es
in
}
us
-B

}
71
20

}
"My favorite movie star in Hollywood is Harrison Ford."
p-
Se

}
9-
-0

"The North Star is right above my house."


d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 87


distributing without written permission is strictly prohibited
The should Clause
• A term that “should” match increases the _score
‒ the score from each matching must or should clause are added
together to provide the final _score for each document

GET my_index/_search The comments I am


{ searching for must have “star”
"query": { and should have “wars”
"bool": {
"must": {

n
io
"match": {

is
ec
D
"comment": "star"
&
s
} es
in
us

},
-B
7

"should": {
1
20
p-

"match":{
Se
9-

"comment":"wars"
-0
d

}
ar
er
G

}
n
ie
st

}
ba
Se

}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 88


distributing without written permission is strictly prohibited
minimum_should_match with bool
• You can use minimum_should_match in a bool query

GET my_index/_search
{
"query": {
"bool": {
"must":
{"match": {"comment": "star"}},
"should": [
{"match": {"comment": "favorite"}},

n
{"match": {"comment": "hollywood"}},

io
is
ec
{"match": {"comment": "famous"}},

D
&
{"match": {"username": "harrison"}} s
es
in
us

],
-B
71

"minimum_should_match": "50%"
20
p-

}
Se
9-
-0

}
d
ar
er

}
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 89


distributing without written permission is strictly prohibited
Results from minimum_should_match

"hits": [
{
"_index": "my_index",
"_type": "doc", “harrison” + “favorite” >= 50%
"_id": "1",
"_score": 1.6028022,
"_source": {
"username": "harrison",
"comment": "My favorite movie is Star Wars!"
}
},

n
io
{

is
ec
D
"_index": "my_index",

&
s
"_type": "doc", es
in

“hollywood” + “favorite” >= 50%


us

"_id": "3",
-B

"_score": 1.3930775,
71
20

"_source": {
p-
Se

"username": "eniko",
9-
-0

"comment": "My favorite movie star in Hollywood is Harrison Ford."


d
ar
er

}
G
n

}
ie
st
ba

]
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 90


distributing without written permission is strictly prohibited
Searching tip for phrases
• You can take advantage of the boost that a should clause
provides by combining match and match_phrase when
searching for a phrase:
GET my_index/_search
{
"query": {
"bool": {
"must": {
"match": {
"comment": "movie star"

n
io
is
}

ec
D
},
&
s
es
"should": {
in
us
-B

"match_phrase": {
71

"comment": "movie star"


20
p-
Se

}
9-
-0

}
Hits will include “movie” or
d
ar

}
er

“star”, but comments that


G

}
n
ie
st

include the phrase “movie star”


ba

}
Se

will score higher

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 91


distributing without written permission is strictly prohibited
Only “should” Query
• In a bool query with no must or filter clauses, one or more
should clauses must match a document:
‒ or you can set minimum_should_match

GET my_index/_search
{
"query": {
At least one clause
must match for a hit

n
io
"bool": {

is
ec
"should": [

D
&
s
{"match": {"comment": es
in
"north"}},
us

{"match": {"comment": "south"}},


-B

{"match": {"comment": "east"}},


71
20

{"match": {"comment": "west"}}


p-
Se
9-

]
-0
d

}
ar
er

}
G
n
ie

}
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 92


distributing without written permission is strictly prohibited
A “should not” Query
• The following example implements “should not” logic
(notice how you can nest bool queries):
GET my_index/_search
{
"query": { I am looking for
"bool": { comments that contain “star”
"must": [ but should not contain
{
"match": { “wars”.
"comment": "star"
}
}

n
],

io
"_score": 1.1174096

is
ec
"should": [

D
"The North Star is right above my house."
&
{
s
es
"bool": {
in
us

"must_not": [
-B

"_score": 1.1174096
1.1682332
7

{
1
20

"My favorite movie star in Hollywood is


p-

"match": {
Se

"comment": "wars" Harrison Ford."


9-
-0

}
d
ar
er

}
G

"_score": 0.13761075
n

]
ie
st
ba

} "My favorite movie is Star Wars!"


Se

}
]
}
}}
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 93
distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Source Filtering
Requesting Specific Fields
• You can request specific fields using the _source
parameter
‒ separate multiple fields by commas
GET stocks/_search?_source=open,close
{
"query": {
"range": {
"open": {
"gte": 50.00,

n
io
is
ec
"lte": 60.00

D
"hits": [
&
}
s
{ es
in
us

} "_index": "stocks",
-B
7

} "_type": "stock_type",
1
20
p-

} "_id": "AVmAGxB992cAvoluWZvm",
Se
9-

"_score": 1,
-0
d

"_source": {
ar
er
G

"close": 57.1,
n
ie
st

"open": 57.58
ba
Se

}
},

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 95


distributing without written permission is strictly prohibited
_source in a Query
• You can also specify the source fields within the query
body using _source
‒ this query has the same output as the previous slide:

GET stocks/_search
{
"_source": ["open", "close"],

n
io
is
"query": {

ec
D
"range": {
&
s
es
"open": {
in
us
-B

"gte": 50.00,
71
20

"lte": 60.00
p-
Se

}
9-
-0
d

}
ar
er
G

}
n
ie
st

}
ba

Only return the “open”


Se

and “close” fields

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 96


distributing without written permission is strictly prohibited
The includes and excludes Patterns
• For complete control, you can specify which fields to
include and exclude within _source:
‒ notice also that you can use wildcards in a _source field

GET stocks/_search
{
"_source": {
"includes": ["volume","*date*"],
"excludes": ["trade_date_text"]

n
io
},

is
ec
D
"query": {
&
s
"range": { { es
in
us

"open": { "_index": "stocks",


-B
7

"_type": "stock_type",
1

"gte": 50.00,
20
p-

"_id": "AVmAGxB992cAvoluWZvn",
Se

"lte": 60.00
9-

"_score": 1,
-0

}
d

"_source": {
ar
er

}
G

"volume": 48749,
n
ie

} "trade_date": "2010-06-10T06:00:00.000Z"
st
ba
Se

} }
},

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 97


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Relevance
What is Relevance?
• Relevance refers to the scoring of a document based on
how closely it matches the query
‒ The _score is computed for each document that is a hit
• There are three main factors of a document’s score:
‒ TF (term frequency): The more a term appears in a field, the
more important it is
‒ IDF (inverse document frequency): The more documents that

n
io
is
ec
contain the term, the less important the term is
D
&
s
es
in
us

‒ Field length: shorter fields are more likely to be relevant than


-B
71

longer fields
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 99


distributing without written permission is strictly prohibited
The BM25 Algorithm
• There is some fairly complex mathematics happening
behind each search
‒ An effective search is only as good as the underlying algorithm
used to compute the _score
• BM25 (as in “Best Matching”) is the 25th iteration of
tweaking the formula

n
io
is
ec
D
&
s
es
in
us

BM25 assigns less


-B
7

weight to the frequency


1
20
p-

of a term after a certain


Se
9-
-0

number of occurrences
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 100


distributing without written permission is strictly prohibited
Influencing Score Calculation
• There are several ways to control the scoring of a search
request:
‒ adding boost to a query clause
‒ choosing another similarity algorithm
‒ the constant_score query
‒ the function_score query

n
io
GET messages/_search

is
ec
{

D
We discuss function_score and
&
"query": {
s
"function_score": { es
in
"functions": [ other ways to control scoring in our
us
-B

{
Advanced Elasticsearch Developer
7

"script_score": {
1
20

"script": {
course
p-
Se

"lang" : "painless",
9-

"inline" : """
-0

_score * 0.05 * doc['details.plus_ones'].value + 1


d
ar

"""
er
G

}
n
ie

}
st
ba

}
Se

]
}
}
}

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 101


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Boosting Relevance
Boosting Relevance
• A search for “elastic” will likely return a variety of different
documents
‒ From Elastic (the company) to stretchable fabric, rubber bands,
kinetic energy, and even the name of a jazz album
• One way to control the relevance of a search is by
boosting query clauses

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 103


distributing without written permission is strictly prohibited
Query-time Boosting
• Some types of searches allow a “boost” parameter to
increase (or decrease) the relative weight of a clause
• If boost > 1
‒ the relative weight of the clause is increased
• If 0 < boost < 1
‒ the relative weight of the clause is decreased

n
io
is
ec
D
• If boost < 0
&
s
es
in
us
-B

‒ the clause will contribute negatively to the score, effectively


71
20
p-

penalizing documents that match the clause


Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 104


distributing without written permission is strictly prohibited
Query-time Boosting
• Add the “boost” property to a query clause to manipulate
the score of matched document
GET my_index/_search
{ I am
"query" : { searching for “star”
"bool": { and “house”, but with
"should": [ an emphasis on
{ “house”
"match": {
"comment": "star"

n
}

io
is
ec
},

D
&
{
s
es
in
"match" : {
us
-B

"comment" : {
71
20

"query": "house",
p-
Se

"boost" : 2
9-
-0

}
d
ar
er

}
G
n

}
ie

Documents that match “house”


st
ba

] will get a boost of factor 2


Se

}
}
}
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 105
distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• The match query is a simple but powerful search for fields
that are text, numerical values, or dates
• The slop parameter tells how far apart terms are allowed to
be while still considering the document a match
• The range query is for finding documents with fields that fall
in a given range
• A bool query is a combination of one or more of the

n
io
is
ec
following boolean clauses: must, must_not, should and
D
&
s
es
filter
in
us
-B
71
20

• Relevance refers to the scoring of a document based on


p-
Se
9-
-0

how closely it matches the query


d
ar
er
G
n
ie
st

• You can request specific fields using the _source


ba
Se

parameter

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 107


distributing without written permission is strictly prohibited
Quiz
1. Explain the difference between match and match_phrase.
2. If you want to add some flexibility into match_phrase, you
can configure a ___________ property.
3. How would you write a date range that matches the last 12
hours?
4. Suppose you have a “should” clause in a “bool” with 5
queries, and you set minimum_should_match to 70%.

n
io
is
ec
How many of the “should” clauses need to match for a hit?
D
&
s
es
in
us
-B

5. What gets returned in the following request?


71
20
p-
Se
9-
-0

GET my_index/_search?_source=username
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 108


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 2
The Search API
1 Introduction to Elasticsearch

2 The Search API

3 Text Analysis

4 Mappings

5 More Search Features


Chapter 3

n
6 The Distributed Model

io
Text Analysis
is
ec
D
&
s
es
in
7 Working with Search Results
us
-B
7

8 Aggregations
1
20
p-
Se
9-

More Aggregations
-0

9
d
ar
er
G
n

10 Handling Relationships
ie
st
ba
Se
Topics covered:
• What is Analysis?
• Building an Inverted Index
• Analyzers
• Custom Analyzers
• Character Filters

n
• Tokenizers

io
is
ec
D
&
s
es
• Token Filters
in
us
-B
71
20

• Defining Analyzers
p-
Se
9-
-0
d
ar

• How to Choose an Analyzer


er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 111


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
What is Analysis?
Exact Values vs. Full Text
• In general, data in Elasticsearch can be categorized into
two groups:
‒ Exact values: includes numeric values, dates, and certain
strings
‒ Full text: unstructured text data that we want to search

{ Full text
"brandName": "Ahold",

n
io
is
ec
"customerRating": 2,

D
&
"price": 14.15,
Exact values s
es
in
"grp_id": "52903",
us
-B

"quantitySold": 805358,
71
20

"upc12": "688267137853",
p-
Se

"productName": "Ahold Companion Dog Food Kibbles & Munchy Morsels"


9-
-0

}
d
ar
er
G
n
ie
st
ba
Se

• Why does it matter if a value is exact or full text?

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 113


distributing without written permission is strictly prohibited
Full Text is Analyzed
• Analysis is the process of converting full text into terms for
the inverted index
• Analysis is performed by an analyzer
‒ either a built-in analyzer
‒ or a custom analyzer you configure

n
io
"Ahold Companion Dog Food Kibbles & Munchy Morsels"

is
ec
D
&
s
es
in
us
-B
71
20
p-

dog food
Se

It seems like all of these


9-
-0
d

searches should be a hit for


ar
er

Ahold food for dogs


G

this product
n
ie
st
ba
Se

kibble

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 114


distributing without written permission is strictly prohibited
Analysis Makes Full Text Searchable
• Full text gets analyzed during ingestion
• Similarly, the text in your queries gets analyzed using the
same analyzer
"Ahold Companion Dog Food Kibbles & Munchy Morsels"

ahold dog kibble munchy

n
io
is
ec
D
companion food morsel
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n

Ahold food for dogs ahold food dog


ie
st
ba
Se

The query gets analyzed too

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 115


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Index
Building an Inverted
What is an Inverted Index?
• You are familiar with the index in the back of a book
‒ Useful terms from the book are sorted, and page numbers tell you
where to find that term
• Lucene creates a similar inverted index with your documents

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba

Page 275 contains information


Se

about “Aston Martin company”

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 117


distributing without written permission is strictly prohibited
Building an Inverted Index
• Let’s look at how terms are analyzed and added to an
inverted index:
‒ text is broken into tokens token postings
Fast 2
‒ indexed by a unique document id
The 1
brown 1
dog 1

n
fox 1

io
is
1 The quick brown fox jumps over the lazy dog

ec
D
&
jumping 2
s
es
in
us

jumps 1
-B

2 Fast jumping spiders


71
20

lazy 1
p-
Se
9-
-0

over 1
d
ar
er
G

quick 1
n
ie
st
ba

spiders 2
Se

the 1

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 118


distributing without written permission is strictly prohibited
Lowercase
• Searching for “Fast” and “fast” should yield the same results
‒ common to lowercase all tokens
token postings

brown 1

dog 1

fast 2

fox 1

n
io
is
1 The quick brown fox jumps over the lazy dog

ec
jumping 2

D
&
s
es
jumps 1
in
us
-B

2 Fast jumping spiders


7

lazy 1
1
20
p-
Se
9-

over 1
-0
d
ar
er

quick 1
G
n
ie
st

spiders 2
ba

“the” was already


Se

indexed for doc 1 the 1

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 119


distributing without written permission is strictly prohibited
Stopwords
• Searching for “the” rarely makes sense
‒ typically we do not bother to index “the”
token postings
‒ or other stopwords
brown 1

dog 1

fast 2

fox 1

n
io
is
1 The quick brown fox jumps over the lazy dog

ec
jumping 2

D
&
s
es
in
jumps 1
us
-B

2 Fast jumping spiders


71

lazy 1
20
p-
Se
9-
-0

over 1
d
ar
er
G

quick 1
n
ie
st
ba

spiders 2
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 120


distributing without written permission is strictly prohibited
Stemming
• “jumps” and “jumping” share the same stem: “jump”
‒ want “jump” to match both
token postings
‒ so only index the stem
brown 1

dog 1

fast 2

fox 1

n
io
is
1 The quick brown fox jumps over the lazy dog

ec
jump 1,2

D
&
s
es
lazy 1
in
us
-B

2 Fast jumping spiders


71

over 1
20
p-
Se

quick 1
9-
-0
d
ar

spider 2
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 121


distributing without written permission is strictly prohibited
Synonyms
• The terms “quick” and “fast” have similar meaning
‒ Should a search for “quick” return doc 2?
token postings
‒ Should a search for “fast” return doc 1?
brown 1

dog 1

fast 1,2

fox 1

n
io
is
1 The quick brown fox jumps over the lazy dog

ec
jump 1,2

D
&
s
es
lazy 1
in
us
-B

2 Fast jumping spiders


71

over 1
20
p-
Se

quick 1,2
9-
-0
d
ar

spider 2
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 122


distributing without written permission is strictly prohibited
Analyze the Search Query Too
• If we are going to analyze our text before indexing, we
better also analyze our search queries in the same manner

token postings

brown 1
a Dog dog 1
dog 1

jumped jump 1,2 fast 1,2

n
io
is
ec
D
fox 1
&
s
es
in

quick
us

jump 1,2
-B

the quick
2,1
71
20

spiders
p-

spider lazy 1
Se
9-
-0

over 1
d
ar
er
G
n

quick 1,2
ie
st
ba
Se

spider 2

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 123


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Analyzers
Anatomy of an Analyzer
• An analyzer consists of three parts:
1. Character Filters
2. Tokenizer
3. Token Filters

“My <em>favorite</em> movie is


Star Wars!”

n
io
is
ec
my

D
&
s
es
in

favorit
us
-B
7

char
1

token
tokenizer
20

movi
p-

filters filters
Se
9-
-0

star
d
ar
er
G
n

war
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 125


distributing without written permission is strictly prohibited
The Analyze API
• Use the _analyze API to test what an analyzer will do to
your text:
Use the _analyze API Test the “simple” analyzer on the given “text”

GET _analyze
{
"analyzer": "simple",
"text": "My favorite movie is Star Wars!"

n
}

io
is
ec
D
&
s
es
in

my
us
-B
71
20

favorite
p-
Se
9-

"My favorite movie is Star “simple”


-0

movie
d
ar

Wars!"
er

analyzer
G
n
ie

is
st
ba
Se

star
wars
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 126
distributing without written permission is strictly prohibited
Pre-built Analyzers
• standard
• simple
• whitespace
• stop
• keyword

n
• pattern

io
is
ec
D
&
s
es
• language
in
us
-B
71
20
p-
Se
9-
-0

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 127


distributing without written permission is strictly prohibited
The simple Analyzer
• Notice the simple analyzer does two things:
‒ Breaks text into terms whenever it encounters a character which
is not a letter
‒ All terms are lower-cased

n
io
is
ec
Lower Case
D
&
s
es
Tokenizer
in
us
-B
71
20
p-
Se
9-
-0
d
ar

The simple analyzer has no character or token filters


er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 128


distributing without written permission is strictly prohibited
The stop Analyzer
• Notice the stop analyzer removes stop words
‒ similar to the simple analyzer,
‒ but adds support for removing stop words, which in English are:
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,
of, on, or, such, that, the, their, then, there, these, they, this,
to, was, will, with

n
io
is
ec
D
&
s
es
Stop
Lower Case
in
us
-B

Token
Tokenizer
71
20

Filter
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba

The stop analyzer includes the Stop Token Filter


Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 129


distributing without written permission is strictly prohibited
The standard Analyzer
• The default analyzer in Elasticsearch
‒ Optionally, you can configure the stop words as well

Disabled by default

n
io
is
Standard Lower Stop

ec
Standard Tokenizer
D
Token
&
Token Case Filter
s
es
in
Filter Filter
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 130


distributing without written permission is strictly prohibited
The language Analyzer
• A set of analyzers aimed at analyzing specific language text
‒ 30+ languages supported

GET _analyze
{
"analyzer": "english",
"text": "My favorite movie is Star Wars!"
}

n
io
is
ec
D
&
s
es
in
GET _analyze
us
-B

{
71
20

"analyzer": "french",
p-
Se

"text": "Mon film préféré est Star Wars!"


9-
-0
d

}
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 131


distributing without written permission is strictly prohibited
The Analyze API
• The Analyze API can be used to test individual components
‒ The following command tests a char filter and tokenizer

Test the “keyword” tokenizer with the “html_strip”


character filter on the given “text”

GET _analyze
{
"tokenizer": "keyword",

n
io
is
"char_filter": [ "html_strip" ],

ec
D
&
"text": "<strong>Welcome</strong> to the <em>jungle</em>"
s
es
in
}
us
-B
71
20
p-
Se
9-

"<strong>Welcome</strong>
-0
d
ar

to the <em>jungle</em>"
er
G
n
ie

_analyze "Welcome to the jungle"


st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 132


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Character Filters
1. Character Filters
• Preprocess characters before they are passed to the
tokenizer
‒ You can configure multiple character filters
• Three built-in character filters:
‒ html_strip
‒ mapping

n
io
is
ec
‒ pattern_replace
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 134


distributing without written permission is strictly prohibited
The html_strip Character Filter
• Strips out any HTML
• Replaces HTML entities with their decoded value
GET _analyze
{
"tokenizer": "keyword",
"char_filter": [ "html_strip" ],
"text": "Tom&apos;s <em>favorite</em> movie is Star Wars!"
}

n
io
is
ec
D
&
s
es
in
us

“Tom&apos;s <em>favorite</em> movie is Star Wars!”


-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st

html_strip “Tom's favorite movie is Star Wars!”


ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 135


distributing without written permission is strictly prohibited
The mapping Character Filter
• Map characters to other characters before tokenizing
GET /_analyze
{
"char_filter": [
{
"type": "mapping",
"mappings": [ "C++ => cpp", "c++ => cpp"]
}
],
"tokenizer": "standard",

n
io
"text": "I like to code in C++ and Java."

is
ec
D
}
&
s
es
in
us
-B
71
20
p-
Se
9-
-0

“c++”
d
ar
er
G
n

“cpp”
ie

mapping
st

“C++”
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 136


distributing without written permission is strictly prohibited
The pattern_replace Character Filter
• Use a regular expression to manipulate characters in a
string before processing
GET /_analyze
{
"char_filter": [ Looks for dashes in a string
{ of digits
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
} Replaces the dash with an

n
io
is
], underscore

ec
D
"tokenizer": "standard",
&
s
es
"text": "123-45-6789"
in
us
-B

}
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st

123-45-6789 123_45_6789
ba

pattern_replace
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 137


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Tokenizers
2. Tokenizer
• The second phase of an analyzer
‒ Divides a stream of text into tokens
• Some of the built-in tokenizers include:
‒ whitespace: divides text on any whitespace
‒ standard: divides text into terms on word boundaries
‒ path_hierarchy: for hierarchical values like filesystem paths

n
io
is
ec
‒ uax_url_email: recognizes URLs/email addresses as tokens
D
&
s
es
in
us

‒ keyword: a no-op tokenizer


-B
71
20
p-
Se

‒ pattern: splits text based on a given regular expression


9-
-0
d
ar
er

• You can also write your own tokenizer in Java


G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 139


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Token Filters
3. Token Filters
• Token filters are the third phase of an analyzer:
‒ They can add tokens (e.g. the synonym filter)
‒ They can remove tokens (e.g. stop words)
‒ They can change tokens (e.g. stemming to root form)
• Examples include:
‒ lowercase: make all text lower case

n
io
is
ec
D
&
‒ snowball: stems words using a Snowball-generated stemmer es
in
s
us
-B

‒ stop: removes stop words


71
20
p-
Se
9-

‒ nGram and edgeNGram:


-0
d
ar
er
G
n

‒ and many more…


ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 141


distributing without written permission is strictly prohibited
Order Matters in Token Filters

GET _analyze
{
"tokenizer": "whitespace",
"filter": ["lowercase", "stop" ], []
"text": "To Be Or Not To Be"
}

GET _analyze
to

n
io
{

is
ec
"tokenizer": "whitespace",

D
be
&
s
"filter": [ "stop", "lowercase" ], es
in
us

"text": "To Be Or Not To Be"


or
-B

}
71
20
p-

not
Se
9-
-0
d

to
ar
er
G
n
ie
st

be
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 142


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
ICU Analysis
ICU Analysis Plugin
• The ICU Analysis plugin adds extended Unicode support
for text analysis in Elasticsearch
‒ including better analysis of Asian languages
• The plugin consists of 1 character filter:
‒ Normalization
• 1 tokenizer:

n
io
is
‒ ICU Tokenizer

ec
D
&
s
es
in

• and 4 token filters:


us
-B
71
20
p-

‒ Normalization
Se
9-
-0
d
ar
er

‒ Folding
G
n
ie
st
ba
Se

‒ Collation
‒ Transform
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 144
distributing without written permission is strictly prohibited
Example of ICU Tokenizer
• You can use the components of the ICU plugin just like any
other character filter, tokenizer, or token filter
• For example, the following Chinese translates to “Star Wars
is my favorite movie”
‒ The “standard” tokenizer incorrectly splits the sentence into
single characters

n
io
is
星球⼤大战 喜欢

ec
D
&
s
es
in
us

POST _analyze 是 的
-B
71

{
20
p-

我 电影
Se

"tokenizer": "icu_tokenizer",
9-
-0

"text": ["星球⼤大战是我最喜欢的电影"]
d
ar
er

} 最
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 145


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Defining Analyzers
Configuring a Custom Analyzer
• Analyzers are defined in the “analysis” part of an index’s
settings

PUT test_index Whatever name you want


{ to use for your analyzer
"settings": {
"analysis": {
Set “type” to “custom”
"analyzer": {
"my_custom_analyzer" : {

n
"type" : "custom",

io
is
ec
"char_filter" : ["html_strip"],

D
&
s
"tokenizer" : "standard", es
in
us

"filter" : ["lowercase", "snowball"]


-B
7

}
1
20
p-

}
Se
9-
-0

}
d
ar

} Configure the three


er
G
n

}
ie

components of your
st
ba
Se

custom analyzer

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 147


distributing without written permission is strictly prohibited
Defining Custom Filters
• If your character or token filters need to define properties,
then you have to define them differently:
{
"settings": { Define a custom char_filter and
"analysis": { name it whatever you want
"char_filter": {
"cpp_filter" : {
"type" : "mapping",
"mappings" : ["C++ => cpp", "c++ => cpp"]
}

n
io
is
ec
},

D
&
"analyzer": {
s
es
in
us

"my_custom_analyzer" : {
-B

"char_filter": ["cpp_filter"],
71
20
p-

"tokenizer" : "standard"
Se
9-

}
-0
d

Use your custom filter later in


ar

}
er
G

the “analyzer” section


n

}
ie
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 148


distributing without written permission is strictly prohibited
More Advanced Example
• You can use your custom filters along with the built-in filters
{
"settings": {
"analysis": {
"char_filter": {
"cpp_filter" : {
"type" : "mapping",
"mappings" : ["C++ => cpp", "c++ => cpp"]
}
},
"filter": {
"my_stopwords" : {

n
"type" : "stop",

io
is
ec
"stopwords" : ["my", "is"]

D
&
}
s
es
in
},
us
-B

"analyzer": {
71
20

"my_custom_analyzer" : {
p-
Se

"char_filter": ["html_strip", "cpp_filter"],


9-
-0

"tokenizer" : "standard",
d
ar
er

"filter" : ["lowercase","my_stopwords", "snowball"]


G
n

}
ie
st
ba

}
Se

}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 149


distributing without written permission is strictly prohibited
Testing a Custom Analyzer
• To test an analyzer that is already defined for an index, use
_analyze at the index level: {
"tokens": [
{
"token": "cpp",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
GET test_index/_analyze {
"token": "favorit",

n
{

io
is
"start_offset": 10,

ec
"analyzer": "my_custom_analyzer",

D
"end_offset": 18,
&
"text" : "C++ is my favorite language." s
es "type": "<ALPHANUM>",
in
us

} "position": 3
-B
7

},
1
20
p-

{
Se
9-

"token": "languag",
-0
d

"start_offset": 19,
ar
er

"end_offset": 27,
G
n
ie

"type": "<ALPHANUM>",
st
ba
Se

"position": 4
}
]
}
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 150
distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
How do we decide
which analyzer to use?
Step 1: Exact value or full text?
• The first question you need to ask is: “Which text fields are
exact values and which fields are full text?”
‒ Exact values do not need to be analyzed

{
"brandName": "Uncle Ben's",
"customerRating": 3,
"price": 8.57,

n
io
is
ec
"grp_id": "26333",

D
&
s
"quantitySold": 514342, es
in
us

"upc12": "054800085453",
-B
7

"productName": "Uncle Ben's Rice Pilaf Ready Rice"


1
20
p-

}
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 152


distributing without written permission is strictly prohibited
Step 2: What will our searches look like?
• The second question is: “What will a search look like and
when should that match a document?”
‒ Look at specific examples in your dataset
‒ Does “Uncle Ben’s” = “uncle bens” = “uncl ben” = “benjamin”?

“Uncle Ben’s” original text

n
io
is
ec
D
&
s
es
in
char token
tokenizer
us

?
-B

filters filters
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st

“uncl ben” search query


ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 153


distributing without written permission is strictly prohibited
Step 3: Test several options
• There are still a lot of questions to ask
‒ Where will the text be split?
‒ Should we use stemming?
‒ Should we build n-grams?
‒ What are some critical synonyms?
• The most important step: test your analyzer thoroughly!

n
io
is
ec
D
&
‒ Try several options es
in
s
us
-B

‒ Test a lot of searches!


71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 154


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• Exact values include numeric values, dates, and certain
strings
• Full text is unstructured text data that we want to search
• Analysis is the process of converting full text into terms for
the inverted index
• analyzer = char_filter to tokenizer to token filter

n
io
is
• Use the _analyze API to test an analyzer (or the
ec
D
components individually) &
s
es
in
us
-B
71

• The ICU Analysis plugin adds extended Unicode support


20
p-
Se
9-

for text analysis in Elasticsearch


-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 156


distributing without written permission is strictly prohibited
Quiz
1. The process of converting full text into terms for the
inverted index is called __________.
2. Why does it matter if a field is an exact value vs. full text?
3. What are the three components that make up an analyzer?
4. What does the snowball token filter do to the word
“happy”?

n
io
is
5. How does the following “my_logfile_analyzer” process the
ec
D
text “192.168.1.201 - jing https://training.Elastic.co/”? &
s
es
in
us
-B
71
20

PUT my_log_files
p-

{
Se

"settings": {
9-
-0

"analysis": {
d
ar

"analyzer": {
er

"my_logfile_analyzer": {
G
n

"tokenizer": "uax_url_email"
ie
st

}
ba
Se

}
}
}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 157


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 3
Text Analysis
1 Introduction to Elasticsearch

2 The Search API

3 Text Analysis

4 Mappings

5 More Search Features


Chapter 4

n
6 The Distributed Model

io
Mappings
is
ec
D
&
s
es
in
7 Working with Search Results
us
-B
7

8 Aggregations
1
20
p-
Se
9-

More Aggregations
-0

9
d
ar
er
G
n

10 Handling Relationships
ie
st
ba
Se
Topics covered:
• What is a Mapping?
• Dynamic Mappings
• Defining Explicit Mappings
• Adding Fields
• Dive Deeper into Mappings

n
• Specifying Analyzers

io
is
ec
D
&
s
es
• Multi-Fields
in
us
-B
71
20

• Index Templates
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 160


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
What is a Mapping?
What is a Mapping?
• Elasticsearch will happily index any document without
knowing its details (number of fields, their data types, etc.)
‒ However, behind-the-scenes Elasticsearch assigns data types to
your fields in a mapping
• A mapping is a schema definition that contains:
‒ Names of fields
‒ Data types of fields

n
io
is
ec
D
&
‒ How the field should be indexed and stored by Lucene s
es
in
us
-B
71

• Mappings map your complex JSON documents into the


20
p-
Se
9-

simple flat documents that Lucene expects


-0
d
ar
er
G
n

• Mappings belong to index types…


ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 162


distributing without written permission is strictly prohibited
Remember Types?
• Documents are indexed into an index
• Each document also has a type
‒ An index can have multiple types (although that is not
encouraged)
‒ Every type has exactly one mapping

n
io
is
ec
D
{ Every document has
&
s
"_index": "my_index", es
a type
in
us

"_type": "my_type",
-B
71

"_id": "2",
20
p-
Se

"_score": 1,
9-
-0

"_source": {
d
ar
er

"username": "maria",
G
n
ie

"comment": "The North Star is right above my house."


st
ba
Se

}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 163


distributing without written permission is strictly prohibited
Every Type has One Mapping
• Defined in the “mappings” section of the index

The name of your type


PUT my_index
{
"mappings": {
"my_type": {

n
io
is
"properties": {

ec
D
define your fields and types here
&
s
es
}
in
us
-B

}
71
20

}
p-
Se

}
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 164


distributing without written permission is strictly prohibited
Example of a Mapping

{
GET my_index/_mappings "my_index": {
"mappings": {
"my_type": {
"properties": {
"comment": {
“comment” is a “text” "type": "text",
},
"username": {
"type": "keyword"
“username” is a “keyword”

n
io
is
}

ec
D
}

&
s
es
in }
us

}
-B

}
71
20

}
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 165


distributing without written permission is strictly prohibited
Defining a Mapping
• Mappings are defined in the “mappings” section of an index
‒ Define mappings at index creation, or
‒ Update a mapping of an existing index using the Put API

You can define the


mapping in the PUT

n
command

io
PUT comments

is
ec
D
{
&
s
"mappings": { es
in
us

define mapping here


-B
7

}
1
20
p-

}
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 166


distributing without written permission is strictly prohibited
Elasticsearch Data Types for Fields
• Simple Types, including:
‒ text: for full text (analyzed) strings “string” in older versions
of Elasticsearch
‒ keyword: for exact value strings
‒ date: string formatted as dates, or numeric dates
‒ integer types: like byte, short, integer, long
‒ floating-point numbers: float, double, half_float, scaled_float

n
io
is
ec
D
‒ boolean
&
s
es
in
us
-B

‒ ip: for IPv4 or IPv6 addresses


71
20
p-
Se
9-

• Hierarchical Types: like object and nested


-0
d
ar
er
G
n

• Specialized Types: geo_point, geo_shape and percolator


ie
st
ba
Se

This list is not every


data type available
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 167
distributing without written permission is strictly prohibited
Why have we not needed
to define any mappings
n
io
is
ec
yet? D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se
Dynamic Mappings
• You are not required to manually define mappings
• When you index a document into a type, Elasticsearch
dynamically updates the mapping based on the
document

text

n
io
is
ec
PUT comments/comment_type/1

D
&
s text
{ es
in
us

"user" : "Ricardo",
-B
71

"comment" : "Best movie I've ever seen! :)",


20
p-
Se

"comment_time" : "2016-11-06T15:31:18-07:00",
9-
-0

"rating" : 5
d
ar
er

} date
G
n
ie
st

long
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 169


distributing without written permission is strictly prohibited
Elasticsearch “guesses” data types:

Sometimes it conveniently
guesses what you want…

PUT comments/comment_type/2
{ "num_of_views": {

n
io
is
"type": "long"

ec
"num_of_views" : 430

D
},
&
}
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 170


distributing without written permission is strictly prohibited
Elasticsearch “guesses” data types:

Dates in the default format are


also recognized

PUT comments/comment_type/3
{ "comment_time": {

n
"type": "date"

io
"comment_time" : "2016-11-06"

is
ec
} }

D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 171


distributing without written permission is strictly prohibited
Elasticsearch “guesses” data types:

Sometimes it may choose a


type you do not want…

"average_length": {
PUT comments/comment_type/4 "type": "text",
{

n
"fields": {

io
is
"average_length" : "24.8"

ec
"keyword": {

D
&
} "type": "keyword",
s
es
in
"ignore_above": 256
us
-B

}
71
20

}
p-
Se

}
9-
-0
d
ar
er
G
n
ie
st
ba

Notice the multi-field mapping


Se

for “average_length”

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 172


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Defining
Explicit Mappings
Defining Explicit Mappings
• If you need to define an explicit mapping, we typically follow
these steps:
1. Index a sample document that contains the fields you want
defined in the mapping (using a temporary index)
2. Get the dynamic mapping that was created automatically by
Elasticsearch in Step 1
3. Edit the mapping definition

n
io
is
ec
4. Create your index, using your explicit mappings
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 174


distributing without written permission is strictly prohibited
1. Index a Sample Document
• Start by indexing a document into a temporary index…

…but use a type name that you


want to keep

PUT temp_comments/comment_type/1
{

n
io
"user" : "Ricardo",

is
ec
"comment" : "Best movie I've ever seen! :)",

D
&
s
"comment_time" : "2016-11-06T15:31:18-07:00", es
in
us

"rating" : 5
-B
71

}
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 175


distributing without written permission is strictly prohibited
2. Get the Dynamic Mapping
• Get the mapping, then copy-and-paste it into the Console:

GET temp_comments/_mapping
{
"temp_comments": {
"mappings": {
"comment_type": {
"properties": {
"comment": {
"type": "text",

n
"fields": {

io
is
ec
"keyword": {

D
&
s
"type": "keyword", es
in
us

"ignore_above": 256
-B
71

}
20
p-
Se

}
9-
-0

},
d
ar
er

"comment_time": {
G
n
ie

"type": "date"
st
ba
Se

},
"rating": {
"type": "long"

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 176


distributing without written permission is strictly prohibited
3. Edit the Mapping
• In our example, we will make a couple of changes:
‒ By changing “user” to a “keyword”, it will only match exact
searches
Let’s change “rating” to
a “byte”…
"rating": {
"type": "long"
},
"user": { "rating": {
"type": "text", "type": "byte"

n
io
"fields": { },

is
ec
"user": {

D
"keyword": {
&
s
es "type": "keyword"
"type": "keyword",
in
us

}
-B

"ignore_above": 256
71
20

}
p-
Se

}
9-
-0

}
d

and change “user” to a


ar
er
G

“keyword”
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 177


distributing without written permission is strictly prohibited
4. Define a New Index with the Mapping
PUT comments
{
"mappings": {
"comment_type": {
“comments” is a new "properties": {
index with our explicit "comment": {
mapping "type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}

n
},

io
is
ec
"comment_time": {

D
&
"type": "date"
s
es
in
},
us
-B

"rating": {
71
20

"type": "byte"
p-
Se

},
9-
-0
d

"user": {
ar
er
G

"type": "keyword"
n
ie

}
st
ba
Se

}
}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 178


distributing without written permission is strictly prohibited
Add a document to the new index…
• Test the new mapping by adding a document
‒ Notice since “user” is a “keyword” now, only exact searches
return a hit:
GET comments/_search
{
"query": {
"match": {
"user": "ricardo" does not match “Ricardo”
}

n
io
}

is
ec
D
}
&
s
es
in
us
-B
71

GET comments/_search
20
p-
Se

{
9-
-0

"query": {
d
ar
er

"match": {
G
n
ie

"user": "Ricardo" a match!


st
ba
Se

}
}
}
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 179
distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
mapping?
Can you change a
Can you change a mapping?
• No - not without reindexing your documents
‒ You can add fields and modify a few properties, but…
‒ …all other schema changes require a reindexing
• Why not?
‒ If you switch a field’s data type, all existing documents with that
field already indexed would become unsearchable on that field

n
io
is
• Why can I add fields without reindexing?
ec
D
&
s
es
in
us

‒ Adding a field has no effect on the existing indexed fields or


-B
71

documents
20
p-
Se
9-
-0

‒ The index can freely grow, but indexed fields can not change
d
ar
er
G

data types
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 181


distributing without written permission is strictly prohibited
Controlling the dynamic feature…
• You can control the effect of new fields added to a mapping
using the “dynamic” property (three options):
doc indexed? fields indexed? mapping updated?
“true” ✔ ✔ ✔
“false” ✔ 𝙓 𝙓
“strict” 𝙓

n
io
is
ec
D
&
PUT comments
s
es
in
{
us
-B

"mappings": {
71
20

"comment_type": {
p-
Se
9-

"dynamic" : "false",
-0

Index any documents with


d

"properties": {
ar
er
G

... unmapped fields, but


n
ie
st

ignore the fields and do not


ba

}
Se

} update the mapping


}
}
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 182
distributing without written permission is strictly prohibited
Adding Fields to a Mapping
• There are different ways to add a field:
‒ If the “dynamic” setting is “true”, you can add a field by simply
indexing a document that contains the new field
‒ Alternately, you can use the PUT API:

We want to modify the mapping


for “comment_type”

n
io
PUT comments/_mapping/comment_type

is
ec
D
{
&
s
"properties": { es
in
us

"comment_location" : {
-B
71

"type" : "geo_point"
20
p-
Se

}
9-
-0

}
d
ar
er

}
G
n
ie
st
ba
Se

Add a new field named


“comment_location”

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 183


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Dive Deeper
into Mappings
Array Fields
• In Elasticsearch, there is no dedicated array type
‒ But, any field can contain multiple values (as long as they are the
same type)

PUT comments/comment_type/2
{
"user" : "Maria",
"comment" : "Good morning everyone",
"comment_time" : "2017-05-15T07:14:39",
"rating" : 4 “rating” is a single value

n
io
is
in this document

ec
}

D
&
s
es
in
us
-B
71
20
p-

POST comments/comment_type/2/_update
Se
9-

{
-0
d

"doc": {
ar
er

“rating” is an array now


G

"rating" : [4,3,5]
n
ie
st

}
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 185


distributing without written permission is strictly prohibited
Date Formats
• Use the format setting to specify the format of your date
fields:
‒ format uses either a date string like “MM/dd/YYYY”, or
‒ one of the many built-in date formats
• The default format value is
strict_date_optional_time||epoch_millis

n
io
is
ec
D
PUT comments/comment_type/5
&
s
es
{
in
us
-B

"comment_time" : "2016-11-06T15:31:18-07:00"
71
20

}
p-
Se
9-
-0
d
ar
er

"comment_time": {
G
n
ie

"type": "date"
st
ba
Se

},

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 186


distributing without written permission is strictly prohibited
Specifying a Date Format

{
"mappings": {
"comment_type": {
"properties": {
"last_viewed" : {
"type": "date",
"format": "dd/MM/YYYY" Use || for multiple formats
},
"comment_time" : {

n
io
"type": "date",

is
ec
D
"format" : "basic_date||epoch_millis"
&
s
es
}
in
us
-B

}
71
20

}
p-
Se

}
9-
-0

} View the documentation for a


d
ar
er

complete list of built-in date


G
n
ie

formats
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 187


distributing without written permission is strictly prohibited
Specifying a Date Format
• Dates must now match the format:
{
"_index": "comments",
"_type": "comment_type",
PUT comments/comment_type/4 "_id": "4",
{ "_score": 1,
"last_viewed" : "17/12/2016", "_source": {
"comment_time" : 1481071407407 "last_viewed": "17/12/2016",
} "comment_time": 1481071407407
}
}

n
io
is
ec
D
&
{
s
es "error": {
in
us

"root_cause": [{
-B

"type": "mapper_parsing_exception",
71

"reason": "failed to parse [last_viewed]"


20
p-

}
Se

PUT comments/comment_type/5 ],
9-
-0

"type": "mapper_parsing_exception",
{
d
ar

"reason": "failed to parse [last_viewed]",


er

"last_viewed" : "01/31/2017" "caused_by": {


G
n

"type": "illegal_field_value_exception",
ie

}
st
ba

"reason": "Cannot parse \"01/31/2017\": Value 31


Se

for monthOfYear must be in the range [1,12]"


}
Set “ignore_malformed” }, "status": 400
to “true” to ignore the error }

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 188


distributing without written permission is strictly prohibited
Meta-Fields
• Each document has metadata fields (_index, _type, _id)
• Some meta-fields we have not seen yet include:
‒ _all: a special catch-all field which concatenates the values of all
of the other fields into one big string
‒ _meta: for storing application-specific metadata

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 189


distributing without written permission is strictly prohibited
Disabling _all
• The _all field is being removed in future versions of
Elasticsearch
‒ unable to use it for new indices in Elasticsearch 6
‒ you can disable it to save disk space if your application is not
using _all

PUT comments

n
io
{

is
ec
D
"mappings": { disable _all for for
&
s
es
"comment_type": { comment_type
in
us
-B

"_all": {
71
20

"enabled": false
p-
Se

},
9-
-0

"properties": {
d
ar
er

"comment": {
G
n
ie

"type": "text",
st
ba
Se

},
...

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 190


distributing without written permission is strictly prohibited
The _meta Field
• Using the “_meta” field, you can put any custom metadata
you want in your mapping
‒ Any “_meta” data is associated with the type (not your individual
documents)

PUT comments/_mapping/comment_type
{

n
"_meta" : {

io
is
ec
"comment_type_version" : "2.1"

D
&
s
} es
in
us

}
-B
71
20
p-
Se
9-
-0

Store any application-specific


d
ar

JSON here…
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 191


distributing without written permission is strictly prohibited
Common Field Parameters
• “index”: should the field be indexed (defaults to true)
• “null_value”: provide a default value if the incoming field is
null. Default is “null”, which means skip the field

"properties": {
"my_password" : { “my_password” will not
be indexed

n
"type": "keyword",

io
is
ec
"index": false

D
&
},
s
es
in
"my_rating" : {
us
-B

"type": "integer",
71
20

"null_value": 1
p-

The value of 1 will be


Se

}
9-

indexed for “my_rating” in


-0

}
d
ar

documents where it is null


er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 192


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Specifying Analyzers
Specifying a Custom Analyzer
PUT comments
{
"settings": {
"analysis": {
"filter": {
"my_edgengram_filter": {
"type": "edgeNGram",
"min_gram": 3,
"max_gram": 7
}
},
"analyzer": {
"my_analyzer": {
Define “my_analyzer” in the
"tokenizer": "standard", “analysis” section
"filter": [
"lowercase",
"my_edgengram_filter"

n
io
is
]

ec
D
}

&
s
} es
in
}
us
-B

},
71

"mappings": {
20
p-

"comment_type": {
Se

"properties": {
9-
-0

"comment": {
d
ar

"type": "text", Configure the “comment” field


er
G

"analyzer": "my_analyzer"
to use “my_analyzer”
n
ie
st

}
ba
Se

}
}
}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 194


distributing without written permission is strictly prohibited
Specifying a Custom Analyzer
• Let’s test my_analzyer:
POST comments/_analyze
{
"analyzer": "my_analyzer",
"text": "I like Elasticsearch"
}

lik

like

n
io
is
ec
D
ela
&
s
es
in
us
-B

elas
71
20
p-
Se
9-

elast
-0
d
ar
er
G

elasti
n
ie
st
ba
Se

elastic

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 195


distributing without written permission is strictly prohibited
Specifying a Search Analyzer
• The edgeNGram filter in the previous example is useful for
indexed text fields
‒ but using edgeNGram on a query string rarely makes sense
• Use the search_analyzer property to specify a different
analyzer for query strings:
PUT comments
{

n
io
...

is
ec
D
"mappings": {
&
"comment_type": { es
in
s
Search terms will use the
us

"properties": { “standard” analyzer


-B
7

"comment": {
1
20
p-

"type": "text",
Se
9-

"analyzer": "my_analyzer",
-0
d

"search_analyzer": "standard"
ar
er
G

}
n
ie
st

}
ba
Se

}
}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 196


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Multi-Fields
Multi-Fields
• It is often useful to index the same field in different ways
‒ For example, it allows your full text to be analyzed in multiple
ways
• You can specify an “analyzer” property for each field
"comment": {
"type": "text", This “comment” field will be
"fields": { indexed four different ways

n
"keyword": {

io
is
ec
"type": "keyword",

D
&
s
"ignore_above": 256 es
in
us

},
-B
7

"english_comment" : {
1
20
p-

"type" : "text",
Se
9-
-0

"analyzer": "english"
d
ar

},
er
G
n

"whitespace_comment" : {
ie
st
ba

"type" : "text",
Se

"analyzer": "whitespace"
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 198


distributing without written permission is strictly prohibited
Accessing a Multi-Field
• Use the dot operator in a query to refer to a multi-field:

PUT comments/comment_type/1
{ “Hello, world!” gets indexed
"comment" : "Hello, world!" four different ways
}

GET comments/_search

n
io
is
{

ec
D
"query": {
&
s
es
in
"match": {
us
-B

"comment.english_comment": "hello"
71
20

}
p-
Se

}
9-
-0
d

}
ar
er
G
n
ie
st
ba

Use the dot operator to choose


Se

which multi-field to query

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 199


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Index Templates
Index Templates
• Index templates allow you to define mappings and settings
that will automatically be applied to newly-created indices:
‒ You can have multiple index templates
‒ They get “merged” and applied to the created index
‒ You can provide an “order” value to control the merging process
• When a new index is created:

n
io
is
ec
1. The default settings and mappings are applied
D
&
s
es
in
us

2. Settings and mappings from the lowest “order” template are


-B
71

applied
20
p-
Se
9-
-0

3. Settings and mappings from the higher “order” templates are


d
ar
er
G

applied, overriding previous settings along the way


n
ie
st
ba
Se

4. The setting and mappings from the PUT command of the index
are applied last and override all previous templates
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 201
distributing without written permission is strictly prohibited
Defining a Template
• Use the _template endpoint to add, view and delete
templates
name of the template

PUT _template/my_template_1
{
"template" : "*", Apply this template to
"order" : 1, any new index
"mappings": {
"my_type" : {
"_all": {

n
io
is
"enabled": false

ec
D
&
}
s
es
in
}
us
-B

}
71
20

}
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 202


distributing without written permission is strictly prohibited
Let’s define a second template:

PUT _template/my_template_2
{
Apply this template to
"template" : "my_*",
"order" : 5,
any new index that
"mappings": { starts with “my_”
"my_type" : {
"properties": {
"version" : {

n
"type" : "integer"

io
is
ec
}

D
&
s
} es
in
us

}
-B
7

}
1
20
p-

}
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 203


distributing without written permission is strictly prohibited
Now define a new index:
• Notice our index named “my_test” matches both templates:

PUT my_test

GET my_test/_mappings
{
"my_test": {
"mappings": {
"my_type": {
"_all": {

n
io
is
"enabled": false

ec
D
},
&
s
es "properties": {
in
us

"version": {
-B
7

"type": "integer"
1
20
p-

}
Se
9-

}
-0
d
ar

}
er
G

}
n
ie
st

}
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 204


distributing without written permission is strictly prohibited
Viewing and Deleting Templates
• Use a GET to view your defined templates:
GET _template

• You can DELETE a template also:


DELETE _template/my_template_2

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 205


distributing without written permission is strictly prohibited
Mapping Best Practices
• Define your own mappings!
‒ The dynamically-created mappings may have the wrong data
types or settings appropriate for your documents
• Use index templates!
• Define a default template for “*” that contains global
settings for all your new indexes

n
• Templates are a great place to define your custom

io
is
ec
D
analyzers es
in
&
s
us
-B

• Disable “_all” if you do not need it


71
20
p-
Se
9-
-0

• Try to avoid using multiple types in a single index


d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 206


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• Every type has a mapping that defines a document’s fields and
their data types
• Mappings are created dynamically, or you can define your
own
• A common way to define an explicit mapping is to modify a
dynamic mapping using a well-designed sample document
• Define your mappings well! Only a few changes can be made,

n
io
is
including the adding of fields.
ec
D
&
s
es
in

• Use the analyzer property to specify a custom analyzer for a


us
-B
7

field, and search_analyzer to specify a different analyzer for


1
20
p-
Se

query strings
9-
-0
d
ar
er
G

• Index templates allow you to define mappings and settings


n
ie
st
ba

that will automatically be applied to newly-created indices


Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 208


distributing without written permission is strictly prohibited
Quiz
1. What data type would “value” : 300 get mapped to
dynamically?
2. If “dynamic” is set to “strict”, what happens when an
unmapped field is indexed?
3. True or False: You can use mappings to specify text
analyzers at the field level.
4. True or False: You can change a field’s data type from

n
“integer” to “long” because those two types are compatible.

io
is
ec
D
&
s
es
5. In the following mapping, which analyzer would get used on
in
us
-B

“my_field” during indexing? At search time?


71
20
p-
Se
9-
-0

"my_field": {
d
ar
er

"type": "text",
G
n
ie

"analyzer": "my_custom_analyzer",
st
ba

"search_analyzer": "english"
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 209


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 4
Mappings
1 Introduction to Elasticsearch

2 The Search API

3 Text Analysis

4 Mappings

5 More Search Features


Chapter 5

n
6 The Distributed Model

io
More Search
is
ec
D
&
s
es 7 Working with Search Results

Features
in
us
-B
7

8 Aggregations
1
20
p-
Se
9-

More Aggregations
-0

9
d
ar
er
G
n

10 Handling Relationships
ie
st
ba
Se
Topics covered:
• Filters
• Index Aliases
• The term Query
• More Term-level Queries
• The multi_match Query

n
• Fuzziness

io
is
ec
D
&
s
es
• Highlighting
in
us
-B
71
20

• Synonyms
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 212


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
dataset…
We have a new
Dataset with nutritional information:
• 500 food items in an index named “nutrition”:
{
"servings": {
"serving_size_unit": "chips",
"servings_per_container": 32, Notice “servings” and
"serving_size_qty": 10 “details” have nested fields
},
"ingredients": [ (common in JSON)
"whole yellow corn",
"vegetable oil",
"corn",
"soybean",

n
io
"canola and/or sunflower oil",

is
ec
D
"salt"

&
s
], es
in
"brand_name": "Tostitos",
us
-B

"item_name": "Tortilla Chips, Thick & Hearty Rounds",


71
20

"details": {
p-
Se

"total_fat": 6,
9-

"calories_from_fat": 50,
-0
d

"calories": 140,
ar
er
G

"saturated_fat": 1
n
ie

},
st
ba

"item_description": "Thick & Hearty Rounds"


Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 214


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Filters
Query vs. Filter
• We have seen query contexts, but there is another clause
known as a filter context:
‒ Query context: results in a _score
‒ Filter context: results in a “yes” or “no”

“Does this

n
“How well does this

io
is
document match

ec
D
document match
&
es
s this query?”
this query?”
in
us
-B
71
20
p-
Se
9-

“Here’s
-0

“Yes”
d
ar

the _score”
er
G
n
ie

Query context Filter context


st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 216


distributing without written permission is strictly prohibited
The filter Clause
• The filter clause is an optional clause in a bool query:

GET my_index/_search
{ The _score will be the result
"query": { of this match query only
"bool": {
"must": {
"match": {
"comment": "star"
}

n
io
is
Hits must have “favorite”

ec
},

D
&
"filter": { in the “comment” field
s
es
in
us

"match": {
-B

"comment": "favorite"
71
20
p-

}
Se
9-

}
-0
d
ar

} "My favorite movie star in Hollywood is Harrison Ford."


er
G
n

}
ie
st

"My favorite movie is Star Wars!"


ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 217


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Index Aliases
Index Aliases
• The Index Aliases API allows you to define an alias for an
index
• The syntax looks like:

POST _aliases
{
"actions": [
{

n
io
"add": {

is
ec
"index": "INDEX_NAME",

D
&
s
"alias": "ALIAS_NAME" es
in

}
us
-B

}
71
20

]
p-
Se

}
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 219


distributing without written permission is strictly prohibited
Example of Index Alias

POST _aliases
{
"actions": [
{
"add": {
"index": "my_index",
"alias": "my_alias"
}
}
]

n
}

io
is
ec
D
&
GET my_alias/_search
s
es
in
{
us

Returns hits from


-B

"query": {
71

“my_index”
20

"match_all": {}
p-
Se

}
9-
-0

}
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 220


distributing without written permission is strictly prohibited
Filtered Aliases
• An alias can include a filter
‒ useful for creating different views of the same index:

POST _aliases
{
"actions": [
{
"add": {
"index": "customers",
"alias": "active_customers",

n
io
is
"filter": {

ec
D
“match": {
&
s
es
"customer_status": "active"
in
us

}
-B
71

}
20
p-

}
Se
9-

}
-0
d

Any query on the


ar

]
er
G

} active_customers alias will


n
ie
st

only include “active” customers


ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 221


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The term Query
Term-level Queries
• Text fits into two groups:
‒ Exact values: includes numeric values, dates, and certain
strings
‒ Full text: unstructured text data that we want to search
• When you submit a string in a match query, the mapping is
checked to see if the string should be analyzed
‒ This makes sense: your query string will match your indexed text

n
io
is
ec
D
&
• Term-level queries are queries that do not go through an s
es
in
us
-B

analysis phase
71
20
p-
Se
9-

‒ for searching numerics, dates, and string values that are exact
-0
d
ar
er
G
n

‒ typically used in filter contexts


ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 223


distributing without written permission is strictly prohibited
The term Query
• In a term query, the search terms are not analyzed
‒ A hit has to match the search term exactly
“Find
GET nutrition/_search
{
all products from
"query": { Market Pantry.”
"term": {
"brand_name.keyword" : "Market Pantry"
}
}

n
io
}

is
ec
D
The nutrition index has 4 hits
&
s
es
in
us
-B
7

GET nutrition/_search
1
20

“Find
p-

{
Se

all products from


9-

"query": {
-0

Market.”
d

"term": {
ar
er
G

"brand_name.keyword" : "Market"
n
ie
st

}
ba
Se

}
}
This search returns 0 hits

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 224


distributing without written permission is strictly prohibited
Using term in a Filter
• You can use term at the query level (see previous slide)
‒ But, term queries are typically used in filter clauses, as we often
filter on exact matches
GET nutrition/_search
{
"query": { 74 hits without the filter…
"bool": {
“I am looking for "must": [
products from Market {
Basket with syrup in the "match": {

n
io
is
ingredients”

ec
"ingredients": "syrup"

D
&
}
s
es
in
}
us
-B

],
71
20

"filter": {
p-
Se

"term": {
9-
-0

"brand_name.keyword": "Market Basket"


d
ar

}
er
G
n

}
ie
st
ba

}
Se

} …only 1 hit with the filter


}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 225


distributing without written permission is strictly prohibited
Match vs. Term and Filter vs. Query
• When do we use a match query vs. a term query?
‒ And when do we use a filter instead of a query?
• It depends on your use case:

• You need the search string analyzed


match query • You want scoring

n
io
is
ec
• You do not want the search string analyzed
D
&
term query
s
es
• You want scoring
in
us
-B
71
20
p-
Se
9-

match in a filter • You need the search string analyzed


-0
d
ar
er
G

There is no scoring in a filter


n
ie
st
ba
Se

term in a filter • You do not want the search string analyzed

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 226


distributing without written permission is strictly prohibited
Using term out of a Filter
• Suppose you want to boost certain brand names, but not
limit results to only those brand names:
GET nutrition/_search
{
"query": {
"bool": { "brand_name": "Giant",
"must": [ "item_name": "Cheddar Cheese Twice Baked Potatoes"
{
"match": { "brand_name": "Giant",
"item_name": "cheese" "item_name": "Cottage Cheese, Small Curd, 1% Milkfat,
} Lowfat, No Salt Added"
}

n
"brand_name": "365 Everyday Value",

io
],

is
"item_name": "Feta Cheese"

ec
D
"should": [

&
s
{ "brand_name": "Kraft", es
in
"term": {
us

"item_name": "Cheese, Italian Three Cheese"


-B

"brand_name.keyword": {
71

…and so on
20

"value": "Giant"
p-
Se

}
9-

}
-0
d

}
ar
er
G

]
n
ie

}
st

The field must match


ba

}
Se

} “Giant” exactly, and it will


contribute to the _score

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 227


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Queries
More Term-level
The terms Query
• Use terms to find documents that match any of the given
terms:
GET nutrition/_search
{
"query": { “I am looking for syrup
"bool": { from Market Basket, Nabisco,
"must": [ or Barilla.”
{
"match": {
"ingredients": "syrup"
}
}

n
io
],

is
ec
"filter": {

D
&
s
"terms": { es
in

"brand_name.keyword": [
us
-B

"Market Basket",
71
20

"Nabisco", The terms query allows


p-
Se

"Barilla" you to search multiple


9-
-0

]
d

terms
ar
er

}
G
n
ie

}
st
ba
Se

}
}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 229


distributing without written permission is strictly prohibited
Wildcard Query
• The wildcard query is a term-based search that contains a
wildcard:
‒ * = anything
‒ ? = any single character
• wildcard can be expensive, especially if you have leading
wildcards

n
I am looking for cookies

io
GET nutrition/_search

is
ec
{

D
from a brand name that starts
&
s
"query": { es
with “Ste”.
in
us

"bool": {
-B

"must":
71
20

{"match": {"item_name": "cookie"}},


p-
Se
9-

"filter": {
-0
d

"wildcard": {
ar
er

"brand_name.keyword": {
G
n

"Premium Dark Chocolate,


ie

"value": "Ste*"
st
ba

with Mint Cookie Crunch”


Se

}
} from “Stewart’s” brand.
}}}}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 230


distributing without written permission is strictly prohibited
The exists Query
• The exists query returns documents that have at least one
non-null value in the specified field
‒ It is a simple way to filter out documents with null fields
GET nutrition/_search
{
"query": {
"bool": {
"must": [
{
“I am looking for

n
"match": {

io
is
ec
"item_name": "cookie" cookies that have their

D
&
} ingredients listed.”
s
es
in
}
us
-B

],
71
20

"filter": {
p-
Se

"exists": {
9-
-0

"field": "ingredients"
d
ar
er

}
G
n
ie

}
st
ba

}
Se

}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 231


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The multi-match Query
The multi_match Query
• multi_match is similar to match except you can query
multiple fields:
‒ The syntax looks like:

GET nutrition/_search Enter the query string once…


{
"query" : {
"multi_match": {

n
io
is
"query": "",

ec
D
"fields": []
&
s
} es
in
us

}
-B
7

}
1
20

…and an array of fields to


p-
Se

search
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 233


distributing without written permission is strictly prohibited
Example of multi_match
• By default, the _score from the best field is used
‒ But you can control various aspects of how the _score is
calculated
“Search
for ‘olive oil’ in either
GET nutrition/_search the item name or in the
{ ingredients”.
"query": {
"multi_match": {
"query": "olive oil",

n
io
"fields": [

is
ec
D
"item_name",
&
s
"ingredients" es
in
us

]
-B

"Extra Virgin Olive Oil"


7

}
1
20
p-

}
Se
9-

} "Trout, Golden Smoked Fillets 3.75 Oz"


-0
d
ar
er
G

"Olive Oil, Extra Virgin, Cold Pressed,


n
ie
st
ba

Bread & Salads"


Se

…and so on

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 234


distributing without written permission is strictly prohibited
Per-field boosting in multi-match
• Individual fields can be boosted
‒ Use a caret (^) followed by a boost multiplier

GET nutrition/_search
{
"query": { The “ingredients” score is 2 times more
"multi_match": { important than “item_name”
"query": "olive oil",
"fields": [

n
io
"item_name",

is
ec
"ingredients^2"

D
&
s
] es
in
us

}
-B

}
71
20

}
p-
Se

"Trout, Golden Smoked Fillets 3.75 Oz"


9-
-0
d
ar
er

"Olive Oil, Extra Virgin, Cold Pressed,


G
n
ie

Bread & Salads"


st
ba
Se

"House Salad Dressing"

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 235


distributing without written permission is strictly prohibited
Wildcards in Field Names
• multi_match allows for wildcards in fields names:

GET nutrition/_search
{
"query": {
"multi_match": {
"query": "rice",
"fields": [

n
Search any field that ends in “name”

io
"*name",

is
ec
D
"ingredients"
&
s
] es
in
us

}
-B
7

}
1
20
p-

}
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 236


distributing without written permission is strictly prohibited
Types of multi_match Queries
• best_fields (default): score is from the best field
• most_fields: score is a combination from all matching
fields
• cross_fields: searches for each term in the query string in
any of the fields in the document
GET nutrition/_search
{

n
io
is
"query": {

ec
D
"multi_match": {
&
s
"query": "rice", es
in
us

"fields": [
-B
7

"*name", Change the scoring to use “most_fields”


1
20
p-

"ingredients"
Se
9-

],
-0
d
ar

"type": "most_fields"
er
G

}
n
ie
st

}
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 237


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Fuzziness
What is Fuzziness?
• Fuzzy matching allows for query-time matching of
misspelled words
‒ It is a certainty that search users will misspell words
• Fuzzy matching treats two words that are “fuzzily” similar
as if they were the same word
‒ Fuzziness is something that can be assigned a value:

n
‒ It refers to the number of character modifications, known as 'edits',

io
is
ec
D
to make two words match
&
s
es
fuzziness = 2
in
us
-B

fuzziness = 1
71
20

“nervan”
p-
Se
9-
-0

“the beetles” e i
d
ar
er
G
n
ie
st

e a a
ba
Se

“the beatles” “nirvana”

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 239


distributing without written permission is strictly prohibited
Fuzzy match Query
• The match query has a “fuzziness” property
‒ Can be set to 0, 1 or 2; or can be set to AUTO

115 hits
GET nutrition/_search
{
"query" : {
"match": { 115 hits
GET nutrition/_search
"ingredients" : "sugar" {
} "query" : {

n
io
}

is
"match": {

ec
D
}
&
"ingredients" : {
s
es
"query" : "suger",
in
us
-B

0 hits "fuzziness": 1
71
20

GET nutrition/_search }
p-
Se

{ }
9-
-0

"query" : { }
d
ar

}
er

"match": {
G
n

"ingredients" : "suger"
ie
st
ba

}
Se

}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 240


distributing without written permission is strictly prohibited
Choosing the Fuzziness
• Be aware of the side effect of setting fuzziness
‒ Suppose you set it to 2
‒ a word like “to" also matches "as", "by", "if", "the", and so on
• Let’s search for items from “J & Ds” and set fuzziness to 2:

GET nutrition/_search

n
io
{

is
ec
"query": {

D
&
s
"match": { es
in
61 hits, even though “J & Ds”
"brand_name": {
us

only has one item in the index


-B

"query": "j & d",


71
20

"fuzziness": 2
p-
Se

}
9-
-0

}
d
ar
er

}
G
n
ie

}
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 241


distributing without written permission is strictly prohibited
The AUTO Setting
• fuzziness can be set to AUTO
‒ generates an edit distance based on the length of the term
‒ AUTO is the preferred way to use fuzziness
• Notice the improvement on our previous query for “J & Ds”:

n
GET nutrition/_search

io
is
Only 2 hits

ec
{
"J & Ds"
D
&
"query": { with “auto” s
es
in
"match": {
us

"J. Skinner"
-B

"brand_name": {
71
20

"query": "j & d",


p-
Se

"fuzziness": "auto"
9-
-0

}
d
ar
er

}
G
n

}
ie
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 242


distributing without written permission is strictly prohibited
Multiple terms in the query
• If the query has multiple terms, the fuzziness value is
applied to each term
‒ For example “citric” and “sitrik” are 2 edits apart
‒ “acid” and “acet” are also 2 edits apart

122 hits
GET nutrition/_search
120 hits {

n
io
GET nutrition/_search

is
"query" : {

ec
{

D
"match": {
&
s
"query": { es
in
"ingredients" : {
us

"match": { "query" : "sitrik acet",


-B

"ingredients": "citric acid"


7

"fuzziness": 2
1
20

}
p-

}
Se
9-

} }
-0

}
d
ar

}
er
G

}
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 243


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Highlighting
Highlighting
• A common use case for search results is to highlight the
matched terms
‒ Elasticsearch makes it easy to do this

GET my_index/_search
{
"query" : {
...

n
io
},

is
ec
D
"highlight": {
&
s
"fields": { es
in
us

...
-B
7

}
1
20
p-

}
Se
9-

}
-0

Add a “highlight” clause that


d
ar

lists the fields you want


er
G
n

highlighted
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 245


distributing without written permission is strictly prohibited
Example of Highlighting
GET nutrition/_search
{
"query" : {
"match": {
"ingredients": "oil"
}
},
"highlight": {
"fields": { Highlight “oil” if it appears in
"ingredients" : {} “ingredients”
}
}
}

n
io
is
ec
D
&
s
es
in
us
-B

The response will contain a


71
20
p-

“highlight” section
Se

"highlight": {
9-
-0

"ingredients": [
d
ar
er

"vegetable <em>oil</em>",
G
n
ie

"canola <em>oil</em> or soybean <em>oil</em>"


st
ba
Se

]
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 246


distributing without written permission is strictly prohibited
Highlighting multiple fields
• Note: you have to query a field if you want it highlighted in
the results (or set require_field_match: false)
GET nutrition/_search GET nutrition/_search
{ {
"size": 200, "query" : {
"query" : { "multi_match": {
"match": { "query": "oil",
"ingredients": "oil" "fields": ["ingredients", "item_name"]
} }
}, },
"highlight": { "highlight": {

n
io
is
"fields": { "fields": {

ec
D
&
"ingredients" : {}, "ingredients" : {},
s
es
"item_name" : {} "item_name" : {}
in
us
-B

} }
71

} }
20
p-
Se

} }
9-
-0
d
ar
er
G

"highlight": { "highlight": {
n
ie
st

"ingredients": [...] "ingredients": [...],


ba
Se

} "item_name": [...]
}
“item_name” not here

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 247


distributing without written permission is strictly prohibited
Changing the Tags
• Use “pre_tags” and “post_tags” to change the highlighting:

"highlight": {
"pre_tags" : ["<elasticsearch-hit>"],
"post_tags" : ["</elasticsearch-hit>"],
"fields": {
"ingredients" : {},
"item_name" : {}
} Results highlighted with your

n
io
} custom tags

is
ec
D
&
s
es
in
us
-B

"highlight": {
71
20

"ingredients": [
p-
Se

"grapeseed <elasticsearch-hit>oil</elasticsearch-hit>"
9-
-0

],
d
ar

"item_name": [
er
G
n

"Grapeseed <elasticsearch-hit>Oil</elasticsearch-hit>"
ie
st
ba

]
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 248


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Synonyms
Working with Synonyms
• Searching for “USA” should probably match “US” and
“United States of America” (and vice-versa)
• You can use synonyms to achieve this behavior
‒ As of Elasticsearch 5.2, use the synonym graph token filter

n
io
“US”

is
ec
D
&
synonym
s
es
in

“US” graph token “USA”


us
-B

filter
71
20
p-

“United States of
Se
9-
-0

America”
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 250


distributing without written permission is strictly prohibited
Defining Synonyms
• Synonyms are defined either in the mapping, or in a
separate text file
‒ You can format synonyms in two ways (which behave differently)

Comma-separated list
If any of the terms is
us, usa, united states of america
won’t, will not
encountered, it is replaced
by all the terms

n
io
is
ec
D
&
s
es
in
us

left side => right side


-B
71
20
p-

Terms on the left are


Se
9-

international business machines => ibm


replaced with terms on the
-0
d

queen => queen, monarch


ar

right
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 251


distributing without written permission is strictly prohibited
Expand or Contract
• Depending on the definition of the synonym, terms and
phrases can be expanded or contracted (or simply
replaced)
‒ Using the synonyms from the previous slide:
“usa” gets expanded:
usa us, usa, united states of america

n
io
is
ec
D
“queen” gets replaced (expanded):
&
s
es
in
us
-B

queen queen, monarch


71
20
p-
Se
9-
-0
d

“international business machines” gets replaced (contracted):


ar
er
G
n
ie
st
ba

international business machines ibm


Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 252


distributing without written permission is strictly prohibited
Index vs. Query Time
• When does the expansion or contraction occur?
‒ Either at index time, query time, or both

Index time: “US” is indexed 3 ways


us
PUT test_index/doc/1
{
"comment" : "I won't be in the US in July" usa
}
united states of america

n
io
is
ec
D
&
s
es
Query time:
in
us
-B
71
20

GET test_index/_search
p-
Se

{
9-
-0

"query": {
d
ar
er

"match": {
Becomes a search
G
n

"comment": "international business machines"


ie
st

for “ibm”
ba

}
Se

}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 253


distributing without written permission is strictly prohibited
Index vs. Query Time
• Should I use index or query time synonyms, or both?
‒ It depends!
• The synonym token filter can be used at both index and
query time
• The synonym_graph token filter works much better on
phrases than synonym token filter

n
io
‒ But it only works at query time, so that might answer your

is
ec
D
question
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 254


distributing without written permission is strictly prohibited
Configuring Synonyms
PUT test_index
{
"settings": {
"analysis": { Use synonym_graph as
"filter": {
"my_synonyms": { the type of filter
"type": "synonym_graph",
"synonyms": [
"us, usa, united states of america",
"won't, will not",
"queen => queen, monarch",
"international business machines => ibm"

}
]
List the synonyms in the
}, mapping (or point to a
"analyzer": {
"my_analyzer": { text file)
"tokenizer": "standard",

n
io
is
"filter": ["lowercase","my_synonyms"]

ec
D
}

&
}
s
es
in
}
us

},
-B
7

"mappings": {
1
20

"doc": {
p-
Se

"properties": {
9-
-0

"comment": {
d

"type": "text",
ar
er

"analyzer": "standard", Use “my_analyzer” on query


G
n

"search_analyzer": "my_analyzer"
ie

strings at query time


st
ba

}
Se

}
}
}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 255


distributing without written permission is strictly prohibited
A Synonyms File
• If your synonyms are defined in a text file, configure the
"synonyms_path" property to specify the file:

"filter" : {
"my_synonyms" : {
"type" : "synonym_graph",
"synonyms_path" : "/path/to/file.txt"
}
}

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 256


distributing without written permission is strictly prohibited
Test Our Synonyms
• Let’s index a couple of documents that have terms that
match our synonyms:

PUT test_index/doc/1
{
"comment" : "I won't be in the US in July" Because we are using
} synonym_graph, no
expanding or contracting
happens at index time

n
io
is
ec
PUT test_index/doc/2

D
&
s
{ es
in
us

"comment" : "IBM beats forecast by $0.23 a share"


-B

}
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 257


distributing without written permission is strictly prohibited
Test Our Synonyms
• Running a query on “united states of america” also
queries the “comment” field for “usa” and “us”:
Same as searching for “us”, so
GET test_index/_search doc 1 is a hit
{
"query": {
"match_phrase": {
"comment": "united states of america"
}
}

n
io
}

is
ec
"hits": [

D
&
{
s
es
in
"_index": "test_index",
us
-B

"_type": "doc",
71

"_id": "1",
20
p-

"_score": 8.396978,
Se
9-

"_source": {
-0
d

"comment": "I won't be in the US


ar
er

in July"
G
n
ie

}
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 258


distributing without written permission is strictly prohibited
Test Our Synonyms
• “International Business Machines” gets replaced with
“ibm” in this query:
Same as searching for “ibm”,
GET test_index/_search so doc 2 is a hit
{
"query": {
"match": {
"comment": "international business machines"
}
}

n
}

io
is
ec
"hits": [

D
&
{
s
es
in
"_index": "test_index",
us
-B

"_type": "doc",
71

"_id": "2",
20
p-

"_score": 0.7081689,
Se
9-

"_source": {
-0
d

"comment": "IBM beats forecast by


ar
er

$0.23 a share"
G
n
ie

}
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 259


distributing without written permission is strictly prohibited
Let’s Index Some Documents…

PUT test_index/doc/3
{
"comment" : "We won't meet the Queen of
England in the United States of America"
}
The only difference is
PUT test_index/doc/4
“queen” and “monarch"

n
io
{

is
ec
"comment" : "We won't meet the Monarch of

D
&
s
England in the United States of America" es
in
us

}
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 260


distributing without written permission is strictly prohibited
…and Search for “queen”
• A search for “queen” returns both docs with the same score
‒ It might be nice to give a higher score to the doc that actually
contained the term “queen”

"hits": [
{
"_index": "test_index",
GET test_index/_search "_type": "doc",
{ "_id": "3",
"_score": 0.9792457,
"query": { "_source": {

n
io
"match": {

is
"comment": "We won't meet the Queen

ec
D
"comment": "queen" of England in the United States of

&
s
es America"
}
in
}
us

}
-B

},
7

} {
1
20

"_index": "test_index",
p-
Se

"_type": "doc",
9-
-0

"_id": "4",
Both documents are hits, and
d

"_score": 0.9792457,
ar
er

"_source": {
G

both have the same score


n
ie

"comment": "We won't meet the


st
ba

Monarch of England in the United States of


Se

America"
}
}
]

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 261


distributing without written permission is strictly prohibited
Give Preference to the Original Text
• Let’s run a bool query that searches “comment” using both
“my_analyzer” (with synonyms applied) and the “standard”
analyzer (with no synonyms applied)
GET test_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"comment": "queen"
Search for “queen” or

n
“monarch”

io
is
}

ec
D
},
&
s
{ es
in
us

"match": {
-B

"comment": {
71
20

"query": "queen",
Search only for “queen”
p-
Se

"analyzer": "standard"
9-
-0

}
d
ar

}
er
G

}
n
ie
st

]
ba
Se

}
}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 262


distributing without written permission is strictly prohibited
Give Preference to the Original Text
• Notice the document with “queen” scores higher, even
though “queen” and “monarch” are synonyms:
"hits": {
"total": 2,
"max_score": 1.9584914,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1.9584914,
"_source": {

n
io
"comment": "We won't meet the Queen of England

is
ec
in the United States of America"

D
&
}
s
es
in
},
us
-B

{
71

"_index": "test_index",
20
p-

"_type": "doc",
Se
9-

"_id": "4",
-0
d

"_score": 0.9792457,
ar
er

"_source": {
G
n
ie

"comment": "We won't meet the Monarch of


st
ba

England in the United States of America"


Se

}
}
]

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 263


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• Filters are cached and reused by Elasticsearch, for even
greater performance
• The Index Aliases API allows you to define an alias for an
index
• multi_match is similar to the match query and allows you to
query multiple fields
• Fuzzy matching allows for query-time matching of similar words

n
io
is
• Highlighting allows you to highlight search results that match
ec
D
&
the query es
s
in
us
-B
71

• Synonyms are defined either in the mapping, or in a separate


20
p-
Se

text file
9-
-0
d
ar
er
G

• The synonym_graph token filter is preferred over the


n
ie
st
ba

synonym token filter, especially when working with phrases


Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 265


distributing without written permission is strictly prohibited
Quiz
1. What is the fuzziness value between a search for “yellow submarine”
and “yelow submarene”?
2. Assuming the following synonym definition is applied at query time, what
would a search for “cat” query for?


cat, kitten, pet

3. What is the easiest way to search for the same terms across multiple
fields?
4. What is the difference between a full-text query and a term-level query?

n
io
is
ec
D
&
5. Suppose we are searching StackOverflow. What is the difference using es
in
s
us

between best_fields and most_fields?


-B
71
20
p-

GET stackoverflow/_search
Se
9-

{
-0

"query": {
d
ar
er

"multi_match": {
G
n

"query": "java",
ie
st
ba

"fields": ["body", "subject"]


Se

}
}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 266


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 5
More Search Features
1 Introduction to Elasticsearch

2 The Search API

3 Text Analysis

4 Mappings

5 More Search Features

Chapter 6 6 The Distributed Model

n
io
The Distributed
is
ec
D
&
s
es 7 Working with Search Results

Model
in
us
-B
7

8 Aggregations
1
20
p-
Se
9-

More Aggregations
-0

9
d
ar
er
G
n

10 Handling Relationships
ie
st
ba
Se
Topics covered:
• Starting a Node
• Creating an Index
• Starting a Second Node
• Shards: Distribution of an Index
• Distributing Documents

n
• Replication

io
is
ec
D
&
s
es
• Split Brain
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 269


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Starting a Node
What is a Node?
• A node is an instance of Elasticsearch
‒ It is a java process that runs in a JVM
• Every node has a name
‒ you can set node.name in the elasticsearch.yml file
‒ or on the command line
• Each node has a UID assigned at startup

n
io
is
ec
D
&
‒ and persisted in the data folder for future restarts es
in
s
us
-B
71
20
p-
Se
9-

adam_01
-0
d
ar
er

bin/elasticsearch -E node.name=adam_01
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 271


distributing without written permission is strictly prohibited
Cluster
• Everything in Elasticsearch happens in the context of a
cluster
‒ Every cluster has a name (e.g. training_cluster)
‒ Default cluster name is “elasticsearch”
‒ Every node belongs to a cluster
• You can set the cluster.name that a node belongs to in
elasticsearch.yml or on the command line

n
io
is
ec
D
&
s
bin/elasticsearch -E node.name=adam_01 -E cluster.name=training_cluster es
in
us
-B
71
20

training_cluster
p-
Se
9-
-0
d

adam_01
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 272


distributing without written permission is strictly prohibited
Cluster State
• Every cluster has information referred to as the cluster
state
‒ includes the nodes in the cluster, indices, mappings, settings,
and so on
• Cluster state is stored on each node

n
io
is
ec
training_cluster

D
&
s
es
in

adam_01
us
-B
71
20

clusterState0
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 273


distributing without written permission is strictly prohibited
But not every node can edit the cluster state
• In a distributed model, if everyone could edit the cluster
state there could be inconsistencies and data loss
• We need a single source of truth...

“I am adam_01, Master
of the Universe!”

n
io
is
ec
training_cluster

D
&
s
es
in

adam_01
us
-B
71
20

clusterState0
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 274


distributing without written permission is strictly prohibited
Master-eligible Nodes
• Nodes can be eligible to become the master
• When you start adam_01 it is a master-eligible node by
default
‒ which can be disabled by setting node.master: false

“OK, so maybe I am
not the master, but I am

n
io
is
ec
training_cluster
eligible.”
D
&
s
es
in

adam_01
us
-B
71
20

clusterState0
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 275


distributing without written permission is strictly prohibited
The Election Process
• An election process occurs to select the actual master
‒ When adam_01 starts, it elects itself to be the master

n
io
“I guess I am the only

is
ec
training_cluster

D
one eligible, so I win the
&
s
es
election!”
in

adam_01
us
-B
71
20

clusterState0
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 276


distributing without written permission is strictly prohibited
Data Nodes
• We also need to store data in the cluster
• A node that is able to store data is called a data node
‒ When adam_01 starts, it is a data node by default
‒ You can disable by setting node.data: false

training_cluster

n
io
is
ec
D
adam_01
&
s
es
in
us
-B

“I am ready for
71
20

some data!”
p-
Se

data allocation
9-
-0
d
ar
er
G
n
ie
st

clusterState0
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 277


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Creating an Index
Creating an Index
• Suppose you submit a PUT request to create a new index
in your cluster:

PUT my_index
{
"settings": {
training_cluster
...

n
io
is
}

ec
D
} adam_01
&
s
es
in
us
-B
7

Client
1
20
p-

data allocation
Se

App
9-
-0
d
ar
er
G
n
ie
st

clusterState0
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 279


distributing without written permission is strictly prohibited
The Coordinating Node
• A request is not sent to the cluster - it is sent to one of the
nodes
• The node that handles your PUT request is called the
coordinating node for that request
‒ It inspects the request and determines what needs to be done
‒ It makes sure the request is executed and sends the response
back to the client

n
io
training_cluster

is
ec
D
&
s
PUT my_index es adam_01
in
us
-B
7

Client
1
20
p-
Se

App
9-
-0

data allocation
d
ar
er
G
n
ie
st
ba
Se

clusterState0

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 280


distributing without written permission is strictly prohibited
Master Nodes Create Indices
• Our PUT request for a new index can only be executed by
the master node (same for a DELETE index request)
‒ Since adam_01 is the elected master node (and the only node in
our cluster), it handles the request for this new index
• The cluster state is updated, the allocation of the index
occurs, and adam_01 sends back a response

n
io
training_cluster

is
ec
D
&
s
PUT my_index es adam_01
in
us
-B
7

Client
1
20
p-
Se

App
9-

my_index
-0
d
ar
er
G

data allocation
n
ie
st
ba
Se

200 OK clusterState1

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 281


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Starting a Second Node
Starting a Second Node
• Elasticsearch is a distributed system, which means it can
benefit from multiple machines.
‒ Let's start a new node named princess_10 on a new machine.
• By default, princess_10 is master-eligible + data node

bin/elasticsearch -E node.name=princess_10 -E cluster.name=training_cluster

n
io
is
training_cluster

ec
D
princess_10
&
s
es
adam_01
in
us
-B
71
20
p-
Se

data allocation
9-
-0

my_index
d
ar
er
G
n

data allocation
ie
st
ba
Se

clusterState1

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 283


distributing without written permission is strictly prohibited
Joining the Cluster
• princess_10 has to join the cluster, but how?
‒ Nodes ping each other using unicast (on port 9300 by default)
• A seed list of hostnames should be listed in the
unicast.hosts property:
discovery.zen.ping.unicast.hosts: ["adam_01"]

n

io
is
training_cluster

ec
D
princess_10 “Hi adam_01.
&
s
es
May I join adam_01
in
us
-B

training_cluster?”
71
20
p-
Se

data allocation
9-
-0

my_index
d

“Sure,
ar
er
G

you can join


n

data allocation
ie
st
ba

training_cluster!”
Se

clusterState1

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 284


distributing without written permission is strictly prohibited
Sharing the Cluster State
• adam_01 updates the cluster state
‒ adds princess_10 to the list of nodes
• Then adam_01 sends the new cluster state to princess_10

n
io
is
training_cluster

ec
D
&
s
es
princess_10 adam_01
in

“I am the
us
-B

master, and here


71
20

is the cluster
p-
Se
9-

state”
-0

my_index
data allocation
d
ar
er
G
n

data allocation
ie
st
ba
Se

clusterState2 clusterState2

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 285


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
an Index
Shards: Distribution of
Distributing Data
• We now have 1 cluster, 2 nodes, and 1 index
• How can Elasticsearch use both nodes?
‒ By distributing the data!

training_cluster

princess_10 adam_01

n
io
is
ec
D
&
s
es
in
my_index
us
-B

data allocation
71
20
p-

data allocation
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 287


distributing without written permission is strictly prohibited
"Late" Data Splitting is Expensive
• Elasticsearch "could" break an index into parts late after
creating an index, but that has some challenges:
‒ how many parts?
‒ how to divide documents?
‒ need to rebuild the entire inverted index!

n
training_cluster

io
is
ec
D
&
s
princess_10 es
in
adam_01
us
-B
7
1
20
p-
Se

my_in dex
9-
-0
d
ar
er

data allocation data allocation


G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 288


distributing without written permission is strictly prohibited
“Early” Data Splitting
• Instead, Elasticsearch breaks an index into parts early, at
creation time
• An index is a virtual namespace which points to a number
of shards
‒ A shard is a worker unit that holds data and can be assigned to
nodes

n
io
is
training_cluster

ec
D
&
s
es
princess_10 adam_01
in
us
-B
71
20
p-

An index is “split” into shards


Se

my_index
9-

before any documents are


-0
d
ar

indexed
er
G
n

data allocation data allocation


ie
st
ba
Se

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 289


distributing without written permission is strictly prohibited
Shards are distributed across nodes
• Because every index is partitioned into shards at creation
time, it is easy to distribute them later
‒ The master decides which shards to allocate to which nodes
(published via cluster state)

training_cluster
clusterState3 contains the
shard allocation table

n
princess_10 adam_01

io
is
ec
D
&
s
es
in
us
-B

When princess_10 receives


71
20
p-

clusterState2, she starts the


Se

data allocation data allocation


9-

request for copying the shards


-0
d
ar

assigned to her
er

clusterState3 clusterState3
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 290


distributing without written permission is strictly prohibited
The Number of Primary Shards is Fixed
• The default number of primary shards for an index is 5
‒ You specify the number of shards when you create the index
‒ You can NOT change the number of primary shards, so choose
wisely! (or reindex your data)

PUT my_index

n
io
is
{

ec
D
"settings": {
&
s
es
"number_of_shards": 6
in
us
-B

}
71
20

}
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 291


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Distributing Documents
Distributing Documents
• If an index is partitioned into shards, what happens when
you index a document?
• We have already discussed the PUT vs. POST commands
‒ PUT: when you specify the _id
‒ POST: when you want Elasticsearch to generate the _id

n
PUT my_index/doc/101

io
is
training_cluster

ec
{

D
&
"firstname" : "Robert",
s
es
princess_10 adam_01
in
"lastname" : "Smith",
us
-B

"city" : "Lancashire"
71
20

}
p-
Se
9-
-0
d
ar
er
G
n

data allocation data allocation


ie

Client
st
ba
Se

App
clusterState3 clusterState3
201 response

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 293


distributing without written permission is strictly prohibited
Routing Documents to Shards
• Documents are routed to shards using the following

formula: shard = hash(_routing) % number_of_primary_shards

‒ The document’s id is used as the _routing value by default

‒ This typically results in uniformly-sized shards (our goal!)

PUT my_index/doc/101 shard = hash(“101”)%6 = 3, so


{
Robert Smith will route to shard 3

n
io
"firstname" : "Robert",

is
ec
"lastname" : "Smith", training_cluster

D
&
"city" : "Lancashire"
s
es
} adam_01
in

princess_10
us
-B
71
20
p-
Se
9-

Client
-0
d
ar

App
er
G
n

data allocation
ie

data allocation
st
ba
Se

201 response clusterState3 clusterState3

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 294


distributing without written permission is strictly prohibited
Reindexing a Document
• If you index an existing document, it gets reindexed
‒ Behind the scenes, the existing document is deleted and then a
new one is created
PUT my_index/doc/101 training_cluster
{
princess_10
"firstname" : "Robert",
"lastname" : "Smith", 2. hash adam_01
"city" : "London"
}
3. reroute
1. index

n
io
data allocation

is
ec
D
&
Client s
es
clusterState3
in
us

App 4. delete
-B
71
20
p-

5. index
Se
9-

7. 200 response
-0

6. ok
d
ar
er
G
n

data allocation
ie
st
ba
Se

clusterState3

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 295


distributing without written permission is strictly prohibited
Update a Partial Document
• If you just want to update one or more fields, you can use
the _update endpoint
‒ The entire document is still deleted and reindexed
training_cluster
POST my_index/doc/101/_update princess_10
{
"doc" : {
"city": "Chelmsford"
2. hash adam_01
}
}
1. update 3. reroute

n
io
data allocation

is
ec
D
&
Client s
es
clusterState3
4. get
in
us

App
-B
71
20

5. delete
p-
Se
9-

8. 200 response
-0

7. ok 6. index
d
ar
er
G
n

data allocation
ie
st
ba

The new or updated fields and values


Se

need to appear in the “doc” section clusterState3

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 296


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Replication
Why Replication?

training_cluster

princess_10 adam_01

data allocation data allocation

n
io
is
ec
D
&
clusterState3 clusterState3
s
es
in
us
-B
71
20
p-

If a node fails, we could


Se
9-
-0

potentially lose data


d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 298


distributing without written permission is strictly prohibited
Shard Replication
• Elasticsearch makes copies of your shards called replicas
‒ The original shard is called the primary
• You can also change the number of replicas for an index at
any time using the _settings endpoint:

PUT my_index/_settings

n
{

io
is
training_cluster

ec
"number_of_replicas": 1

D
&
s
} es
princess_10 adam_01
in
us
-B
71
20
p-
Se

Client
9-
-0

App
d
ar
er
G
n

data allocation data allocation


ie
st
ba
Se

clusterState4 clusterState4

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 299


distributing without written permission is strictly prohibited
Shard Replication
• The replicas are stored on different nodes in the cluster
‒ provides high availability in case a shard or node fails
‒ also improves search performance by scaling the processing

Both nodes will start replicating


training_cluster
shards to the other node

princess_10 adam_01

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n

data allocation data allocation


ie
st
ba
Se

clusterState5 clusterState5

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 300


distributing without written permission is strictly prohibited
High Availability
• If we lose princess_10 the entire data is available in
adam_01

training_cluster

adam_01

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n

data allocation
ie
st
ba
Se

clusterState5

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 301


distributing without written permission is strictly prohibited
Primary Promotion
• After losing a primary, Elasticsearch will automatically
promote a replica into a primary

training_cluster

adam_01

n
io
is
ec
D
&
s
es
in
us

Each of these shards is


-B
71

promoted to a primary,
20
p-
Se

and the cluster state is


9-
-0

updated
d
ar
er
G
n

data allocation
ie
st
ba
Se

clusterState6

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 302


distributing without written permission is strictly prohibited
Node Re-join
• When princess_10 is back online:
‒ it joins the cluster again
‒ gets the new cluster state
‒ replicates shards
training_cluster

princess_10 adam_01

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n

data allocation data allocation


ie
st
ba
Se

clusterState7 clusterState7

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 303


distributing without written permission is strictly prohibited
Let's Delete the Existing Indices
DELETE my_index,my_index2

DELETE my_index* Several ways to delete


multiple indices
DELETE my_index
DELETE my_index2

training_cluster

princess_10 adam_01

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n

data allocation data allocation


ie
st
ba
Se

clusterState8 clusterState8

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 304


distributing without written permission is strictly prohibited
Let's Start Over
PUT my_index
{
"settings": {
"number_of_shards": 4,
"number_of_replicas": 1
}
}

n
io
is
training_cluster

ec
D
&
s
es
princess_10 adam_01
in
us
-B
71
20
p-
Se
9-

3 2 3 2
-0
d
ar
er
G
n
ie

1
st

0 1 0
ba
Se

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 305


distributing without written permission is strictly prohibited
Let's add the same document again:

Client PUT my_index/doc/101


App
{
"firstname" : "Robert",
"lastname" : "Smith",
"city" : "Lancashire"
6. 201 response 1. PUT }

n
io
is
training_cluster

ec
D
&
princess_10 s adam_01
es
in
us

2. index to P2
-B
71
20
p-
Se

3. replicate to R2
3 2 3 2
9-
-0
d
ar
er

4. replicate OK
G
n

1
ie

0 1 0
st
ba
Se

5. index OK

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 306


distributing without written permission is strictly prohibited
The Response of a PUT
• Notice the response of a PUT contains a _shards section
‒ “total” = 1 primary + # of replicas
‒ successful < total, then a replica is likely not available

HTTP/1.1 201 Created 201 = created, 200 = updated


{
"_index" : "my_index",
"_type" : "doc", # of shard copies the doc

n
io
should be written to

is
"_id" : "202",

ec
D
"_version" : 1,
&
s
"result" : "created", es
in
us

"_shards" : { # of shard copies the doc was


-B
7

"total" : 2, successfully written to


1
20
p-

"successful" : 1,
Se
9-

"failed" : 0
-0
d
ar

},
“failed” is an array of errors if
er
G

"created" : true
n

something went wrong


ie
st

}
ba
Se

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 307


distributing without written permission is strictly prohibited
The Get API
• Use the Get API to retrieve a specific document from the
index using its type and id
‒ returns a 404 error if the document is not found

GET my_index/doc/101 Client


App

n
io
is
training_cluster

ec
D
&
s
es
princess_10 adam_01
in
us
-B
71
20
p-
Se
9-

3 2 3 2
-0
d
ar
er
G
n
ie

1
st

0 1 0
ba
Se

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 308


distributing without written permission is strictly prohibited
Let's delete the document
• Use DELETE to delete a specific document
‒ 200 response if successful
‒ 404 response if the document is not found
• Delete marks the document as deleted The document is not
actually deleted from
storage until a segment
DELETE my_index/doc/101 merge occurs

n
io
is
ec
training_cluster

D
&
s
es
in

princess_10 adam_01
us
-B
7

Client
1
20
p-

App
Se
9-
-0

3 2 3 2
d
ar
er
G
n
ie
st
ba

1 0 1 0
Se

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 309


distributing without written permission is strictly prohibited
The DELETE request in detail:

Client
App

1. delete 7. response

training_cluster

n
io
adam_01

is
princess_10

ec
6. deleted

D
&
s
es
in

5. delete OK
us
-B

3 2 3. delete
7

2
1

3
20

4. delete replica
p-
Se
9-
-0

1 0
d

1
ar

0
er
G
n
ie
st
ba
Se

2. reroute

Copyright Elasticsearch BV 2015-2016 Copying, publishing and/or 310


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Split Brain
Split Brain
• Even though the master node is the single source of truth
when it comes to the cluster state, problems can still
happen
‒ In particular, a big concern is if the cluster gets partitioned
‒ The cluster could inadvertently elect two masters, which is
referred to as a “split brain”

n
io
is
ec
training_cluster

D
&
s
es
in
us

gary_21 princess_10 adam_01


-B
71
20
p-
Se
9-

“Yay - I
-0

“I am the
d
ar

get to be the
er

master!”
G
n

master!”
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 312


distributing without written permission is strictly prohibited
Two masters can lead to inconsistency:

Client
App Inconsistent cluster
states makes it basically
impossible to recover
from a split brain

n
io
is
ec
training_cluster

D
&
s
es
in
us

gary_21 princess_10 adam_01


-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie

clusterState4
st

clusterState10 clusterState10
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 313


distributing without written permission is strictly prohibited
Avoiding the Split Brain
• A master-eligible node needs at least
minimum_master_nodes votes to win an election
‒ Setting it to a quorum helps avoid the split brain scenario
• Our simple recommendation is for production systems to
always have 3 dedicated master-eligible nodes
‒ and set minimum_master_nodes to 2
gary_21 can not become a master if

n
io
is
minimum_master_nodes is 2

ec
training_cluster

D
&
s
es
in
us

gary_21 princess_10 adam_01


-B
71
20
p-
Se
9-
-0

“I vote for me,


d

“I vote for
ar

“I vote for me”


er

which makes a
G

adam_01.”
n
ie

quorum.”
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 314


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• Elasticsearch subdivides the data of your index into multiple
pieces called shards
• Each shard has one (and only one) primary and zero or
more replicas
• A document is routed to a shard using a formula that
incorporates a hash value of the document id
• You can not change the number of primary shards of an

n
io
index after it has been created

is
ec
D
&
s
es
in

• After losing a primary, Elasticsearch will automatically


us
-B
7

promote a replica into a primary


1
20
p-
Se
9-
-0

• A master-eligible node needs at least


d
ar
er
G

minimum_master_nodes votes to win an election


n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 316


distributing without written permission is strictly prohibited
Quiz
1. How does Elasticsearch distribute your data within an index?
2. Suppose you have a cluster with 6 data nodes, you create a new
index with “PUT hello”, then insert 10,000 documents. How
many total shards will the hello index have?
3. True or False: When you add a new node to a cluster, you have
to run the _refresh command to redistribute shards to the new
node?

n
4. If number_of_shards for an index is 4, and

io
is
ec
D
number_of_replicas is 2, how many total shards will exist for
&
s
es
in

this index?
us
-B
71
20
p-

5. If you have three master-eligible nodes in your cluster, what


Se
9-
-0

should you set minimum_master_nodes to?


d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 317


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 6
The Distributed Model
1 Introduction to Elasticsearch

2 The Search API

3 Text Analysis

4 Mappings

5 More Search Features

Chapter 7 6 The Distributed Model

n
io
Working with
is
ec
D
7 Working with Search Results
&
s
es

Search Results
in
us
-B
7

8 Aggregations
1
20
p-
Se
9-

More Aggregations
-0

9
d
ar
er
G
n

10 Handling Relationships
ie
st
ba
Se
Topics covered:
• Segments
• The Anatomy of a Search
• DFS Query-then-fetch
• Sorting Results
• Doc Values and Fielddata

n
• Sorting by Strings

io
is
ec
D
&
s
es
• Pagination
in
us
-B
71
20

• Scroll Searches
p-
Se
9-
-0
d
ar

• Choosing a Search Type


er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 320


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Segments
Inverted Index in Practice
• The inverted index concept introduced in the analysis
chapter is practically represented inside a package called a
Lucene segment
‒ a self-contained immutable inverted index
my_index
shard0 shard1

n
io
is
ec
D
&
s
es
3 9 4 8
in
us
-B
71
20
p-
Se

1 7 2 6
9-
-0
d
ar
er
G
n
ie
st
ba
Se

A shard contains one or more segments Segments can contain


many documents

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 322


distributing without written permission is strictly prohibited
Segment is a Package
• A segment contains many documents
‒ and is a package of many data structures,
‒ including an inverted index for each field that is searchable

9 7

my_field_food_name

n
io
apple 7,9

is
ec
banana 7

D
&
carrot 9
s
es
in
dill 7,9
us
-B
71
20

9
p-
Se

my_field_color
9-
-0


d
ar
er

green 7,9
7
G
n

orange 9
ie
st
ba

red 7,9
Se

yellow 7

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 323


distributing without written permission is strictly prohibited
Why Segments?
• Lucene does not build one giant inverted index
‒ that would be to difficult to maintain and store
• Similarly, Lucene does not build an inverted index for each
document after every indexing operation
‒ that would be inefficient
• Segments provide a nice solution between having one

n
large inverted index and too many small ones

io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 324


distributing without written permission is strictly prohibited
Segments are Refreshing
• The lightweight process of writing and opening a new
segment is called a refresh
‒ by default, every shard is refreshed once every second
‒ configured at the index level with refresh_interval property
• Documents are not searchable until a refresh occurs
‒ you can force a refresh using the _refresh endpoint, although
this is not a common use case:

n
io
is
ec
D
&
s
es
in
us
-B
7

POST my_index/_refresh
1
20
p-
Se
9-
-0
d
ar
er
G
n

Manually force a refresh, which makes


ie
st
ba

newly-indexed documents searchable


Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 325


distributing without written permission is strictly prohibited
The refresh Parameter
• The Document APIs that create, update or delete have an
optional refresh parameter:
‒ controls when changes made by this request are made visible to
searches
• Three possible values:
‒ false: the default value, any changes to the document are not
visible immediately

n
io
is
ec
‒ true: refresh the affected primary and replica shards so that the
D
&
s
es
changes are visible immediately
in
us
-B
71
20

‒ wait_for: synchronous request that waits for the changes to


p-
Se
9-

occur
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 326


distributing without written permission is strictly prohibited
Example of the refresh Parameter

Wait for the indexing to complete


before returning a response

PUT my_index/doc/102/?refresh=wait_for
{

n
io
is
"firstname" : "James",

ec
D
&
"lastname" : "Brown",
s
es
in
"address" : "6011 Downtown Lane",
us
-B

"city" : "Detroit"
71
20

}
p-
Se

• wait_for does not force a refresh, but


9-
-0
d

rather it waits for a refresh to happen


ar
er
G

• true forces a refresh


n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 327


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
of a Search
The Anatomy
Searches are performed on segments
• When you submit a query to Elasticsearch, Lucene runs the
search on every segment of the index
The number of segments node 1
affects the query
performance

segment 1
segment 1
segment 2
segment 2

n
io
is
ec
GET my_index/_search

D
&
{
s
es
“query” : {
in
node 2
us


-B
7

}
1
20

}
p-
Se
9-
-0

segment 1
d
ar
er

segment 1
G

coordinating node
n
ie
st

segment 2
ba
Se

segment 2

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 329


distributing without written permission is strictly prohibited
The Anatomy of a Search
• When a client makes a request, the node handling that
request is referred to as the coordinating node
‒ Any good client will “round robin” search requests to different
coordinating nodes to avoid overloading a single node

production_cluster

n
io
is
ec
D
&
s
es
in
node 1 node 4
us
-B

Client
71
20

node 2 node 5
p-
Se
9-
-0

node 3 node 6
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 330


distributing without written permission is strictly prohibited
The Phases of a Search
• The coordinating node is responsible for making sure a
request is routed to the proper nodes
‒ results from multiple shards must be combined into a single
sorted list
‒ then the coordinating node returns a “page” of results
• In a search request, the coordinating node also coordinates
the two phases of the search:

n
io
is
ec
D
1. Query phase
&
s
es
in
us
-B

2. Fetch phase
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 331


distributing without written permission is strictly prohibited
1. Query Phase
2. Choose shard copy
1. Determine the (trying to load balance) 3. Forward the request
relevant shards to the nodes
Primary
0

node 1
Replica
Coordinating node node 2 3
4. Each selected shard
node 3 executes the query

n
io
node 4

is
ec
D
Replica
&
node 5
s
es
1 Primary
in
us
-B

node 6 2
71
20
p-
Se
9-
-0

6. The Lucene ids are


d

5. Each shard sorts the


ar
er

sent back with the


G

results and/or performs


n
ie

sorting value
st

any aggregations
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 332


distributing without written permission is strictly prohibited
2. The Fetch Phase
• The query phase identifies which documents satisfy the
search request
• The fetch phase retrieves the documents:

1. The coordinating node identifies which


Primary
docs to retrieve and requests them from
0
the applicable shards
Replica

n
1

io
is
Coordinating node

ec
D
&
s
es
Client
in
node 4
us
-B
71
20
p-

Replica
Se
9-

3
-0
d

3. Coordinating node
ar
er
G

sends results back to client


n
ie

2. Each shard returns the docs to


st
ba
Se

the coordinating node

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 333


distributing without written permission is strictly prohibited
The Search Response
• The _shards section of a search response provides how
many shards were searched
‒ as well as a count of the successful/failed searched shards
‒ if successful < total you likely received only partial results
‒ your business logic and SLAs typically determine how to handle
errors and/or incomplete results

n
io
is
ec
D
&
s
{ es
in
us

"took": 11,
-B

"timed_out": false,
7

If “successful” is less
1
20

"_shards": {
p-

than “total”, you likely


Se
9-

"total": 3,
-0

"successful": 3, have incomplete results


d
ar
er

"failed": 0
G
n
ie

},
st
ba
Se

"hits": {

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 334


distributing without written permission is strictly prohibited
Performance Benefit of Filters
• A query in a filter context can be cached by Elasticsearch
to improve performance
‒ referred to as the node query cache (or query cache for short)
‒ uses a bit set at the segment level

segment with 3 docs bit set of size 3 for our


“comment” :“favorite”

n
io
is
0 1 1 filter

ec
1
D
&
s
es
in
us
-B
7

2
1
20
p-
Se
9-
-0

“favorite” appears in the


d
ar

3
er

“comment” of docs 2 and 3


G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 335


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
DFS Query-then-fetch
Relevance and Sharding
• How is sharding related to the _score and search phases?
‒ It is partially calculated at the shard level
‒ For smaller or skewed datasets, this can drastically affect the
scoring
node 1

{
"_index":
This document might
score unrealistically high
“my_index",
{
"_type":
"_index":
{ “my_type”,
“my_index",
"_index":
"_type":

because it is the only hit


“my_index",
“my_type”,

n
on the shard

io
is
ec
D
&
s
es
in
us

{
"_index":
-B

“my_index",
"_type": {
7

"_index":
1

“my_index",
20

"_type":
p-

This better document might


Se
9-
-0

score less because it is on a


d
ar

shard with a lot of hits


er

{
{ "_index":
G

"_index": “my_index",
“my_index",
{
n

"_index":
ie

{
“my_index",
"_index":
st

“my_index",
ba
Se

node 2 node 3

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 337


distributing without written permission is strictly prohibited
Query-then-fetch vs. DFS Query-then-fetch
• If the default query-then-fetch scoring is not working for
you:
‒ which uses local BM25 scores
• Then you can try executing a DFS query-then-fetch
‒ which computes a global BM25 score
‒ more expensive, because it involves running a pre-query

n
io
is
ec
• Tends to fix the issue of skewed datasets
D
&
s
es
in
us

‒ For example, time-based indices may create skewed results. On


-B
71
20

the day a new iPhone is released, tweets containing #iphone6


p-
Se

will be skewed for that particular day


9-
-0
d
ar
er
G

‒ Smaller datasets are also more prone to being skewed


n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 338


distributing without written permission is strictly prohibited
Performing a DFS Query-then-fetch
• Set the search_type property to dfs_query_then_fetch
‒ Defaults to query_then_fetch

GET my_index/_search query_then_fetch


{
"query" : {
"match": {
"comment": "star"
}
}

n
io
}

is
ec
D
&
s
es
in
us
-B

GET my_index/_search?search_type=dfs_query_then_fetch
71
20

{
p-
Se

"query" : {
9-
-0

"match": {
d
ar
er

"comment": "star"
G
n
ie

}
st
ba

}
Se

}
dfs_query_then_fetch

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 339


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Sorting Results
Sorting
• Sorting the results of a query can be done three ways:
‒ Sort by a field
‒ Sort by _score
‒ Sort by _doc
• The syntax looks like:

The “sort” clause

n
io
is
{

ec
D
"query": {
&
s
es
...
in
us

The array of field


-B

},
7

names to sort by
1

"sort": [
20
p-
Se

{
9-
-0

"FIELD": {
d
ar

"order": "desc"
er
G

}
n
ie
st
ba

}
Se

] “desc” or “asc”
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 341


distributing without written permission is strictly prohibited
A sort Example
• The following query returns products with “chocolate” in
the “ingredients” sorted by “total_fat” descending:

Which items with


chocolate have the
GET nutrition/_search most fat?
{
"query" : {
"match": {
"ingredients": "chocolate"

n
io
}

is
ec
D
},
&
s
"sort": [ es
in
us

{
-B
7

"details.total_fat": {
1
20
p-

"order": "desc"
Se
9-

}
-0
d

}
ar
er
G

]
n
ie

}
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 342


distributing without written permission is strictly prohibited
A sort Example
• Notice the results are not scored in our query because the
_score is not being used in sorting
"hits": {
"total": 15,
"max_score": null,
"hits": [
{
"_index": "nutrition",
"_type": "doc",
"_id": "AVv-kZCazWyyiGorJZ5Y",
"_score": null,
The item with the most "_source": {

n
io
total_fat is the first hit

is
...

ec
D
"brand_name": "ProBar",
&
s
es "item_name": "Bar, Koka Moka",
in
us

"details": {
-B

"total_fat": 18,
71
20

...
p-
Se

},
9-
-0

"item_description": "Koka Moka"


Each hit in the response
d
ar

},
er

contains a “sort” section


G

"sort": [
n
ie
st

containing the values used for 18


ba
Se

]
sorting },

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 343


distributing without written permission is strictly prohibited
Sort on Multiple Fields
GET nutrition/_search
{
Sort by “servings_per_container”,
"query" : { then by “_score”
"match": {
"ingredients": "chocolate"
}
},
"sort": [
{
"servings.servings_per_container": {
"order": "desc"
}
}, "_score": 2.6099455,

n
io
"_source": {

is
{

ec
"servings": {

D
"_score" : {
&
"serving_size_unit": "cookies",
s
"order" : "desc" es
in
"servings_per_container": 13.5,
us

}
-B

"serving_size_qty": 3
7

}
1

},
20
p-

] "brand_name": "Pepperidge Farm",


Se
9-

} "item_name": "Cookie Collection,


-0
d

Holiday Homestyle",
ar
er

},
G
n
ie

"sort": [
st
ba

Notice both values used for sorting 13.5,


Se

2.6099455
are provided in the results ]

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 344


distributing without written permission is strictly prohibited
Sort by _doc
• Sorting by _doc provides the best performance and least
overhead
‒ for situations where you do not care about the sort order

GET nutrition/_search
{
"query" : {
"match": {

n
"ingredients": "chocolate"

io
is
ec
}

D
&
},
s
es
in
"sort": [
us
-B

"_doc"
71
20

]
p-
Se

}
9-
-0
d

Documents are sorted in the order that


ar
er
G

Lucene has them indexed


n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 345


distributing without written permission is strictly prohibited
Sorting by Text
• Look at the following query. It seems very simple. What will
the results look like?
I am searching for
yogurt, and sort the results
by “brand_name”.
GET nutrition/_search
{
"query": {
"match" : {

n
"item_name" : "yogurt"

io
is
ec
}

D
&
},
s
es
in
"sort" : [
us
-B

{
71
20

"brand_name" : {
p-
Se

"order" : "asc"
9-
-0

}
d
ar
er

}
G
n

]
ie
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 346


distributing without written permission is strictly prohibited
Why can’t we sort by text?
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set
fielddata=true on [brand_name] in order to load fielddata in memory
by uninverting the inverted index. Note that this can however use
significant memory."

• You can sort by text, but you need to understand the cost
‒ The issue is that text data types are analyzed
‒ The original value of the text has been changed for indexing

n
io
is
ec
D
We want to sort by this data But this is what our data looks like
&
s
es
in
us
-B

“Chobani” “chobani”
71
20
p-
Se

“del”
9-

“My Favorite”
-0

_analyze
d
ar

“essential”
er
G

“Del Monte”
n
ie
st
ba

“everyday”
Se

“Essential Everyday”
“favorite”
“monte”
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 347
distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Fielddata
Doc Values and
Introducing Doc Values and Fielddata
• An inverted index is not ideal for sorting on field values
‒ sorting (and aggregations) require us to uninvert the inverted
index
• Two options in Elasticsearch for building this uninverted
index:
‒ doc values
‒ fielddata

n
io
is
ec
D
&
• doc values only work on exact value fields (so NOT on s
es
in
us
-B

analyzed fields)
71
20
p-
Se
9-

• We need to enable fielddata on a tokenized field if we want


-0
d
ar
er

to sort by it
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 349


distributing without written permission is strictly prohibited
Doc Values vs. Fielddata
• Which option is better? Well, it depends…

Fielddata Doc Values


Created on the fly when
When needed in a _search
Created at index time

Where JVM heap files on disk

n
io
is
ec
D
&
s
es
faster indexing and avoids avoid using too much
in
us

Pros
-B

disk usage memory


71
20
p-
Se
9-

extra disk usage,


-0

Cons large heap usage


d
ar

slightly slower indexing


er
G
n
ie
st
ba
Se

Default Elasticsearch 1.x and prior Elasticsearch 2.x and later

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 350


distributing without written permission is strictly prohibited
Why change the defaults?
• Save some disk space and slightly reduce index time:
‒ disable doc_values in fields you will not sort or aggregate on
‒ set doc_values: false in the mappings of the field(s)
‒ Be careful: if later on you want to sort or aggregate on that field
you will need to update the mapping and re-index your
documents
• You really need to sort or aggregate on a text field:

n
io
is
ec
D
&
‒ enable fielddata on that field (e.g. significant terms) es
s
in
us
-B
7

‒ set fielddata: true in the mappings of the field(s)


1
20
p-
Se
9-
-0

• Be careful: fielddata can consume a lot of memory!


d
ar
er
G
n
ie

‒ fielddata is protected by circuit breakers = if too much memory is


st
ba
Se

used, Elasticsearch will fail the request

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 351


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Sorting by Strings
So how do we sort by strings?
• One option is to use the “keyword” type:
‒ in our case, “brand_name” has a “keyword” multi-field

GET nutrition/_search
{ “Chobani”
"query": {
"match" : { “Chobani Greek Yogurt”
"item_name" : "yogurt"
}
},
“Crowley”
"sort" : [

n
“Dannon”

io
is
{

ec
D
&
"brand_name.keyword" : {
s
es
"order" : "asc" …and so on
in
us
-B

}
71

}
20
p-
Se

]
9-
-0

}
d

“brand_name.keyword” is an OK
ar
er
G

Sorting is allowed here because solution, but there is one issue…


n
ie
st
ba

the “keyword” multi-field of


Se

“brand_name” has doc values

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 353


distributing without written permission is strictly prohibited
Case-sensitivity in String Sorting
• Using a keyword field avoided the fielddata issue
‒ but keyword fields are not analyzed, so upper and lowercase
strings affect our sorting:
“Chobani”
GET nutrition/_search
{
"query": {
“Chobani Greek Yogurt”
"match" : {
"item_name" : "yogurt" “Crowley”
}

n
“Dannon”

io
is
},

ec
D
"sort" : [
&
s …
es
{
in
us

"brand_name.keyword" : {
-B
7

“Welch’s”
1

"order" : "asc"
20
p-

}
Se
9-

} “Yoplait”
-0
d
ar

]
er
G

} “del Monte”
n
ie
st
ba
Se

Unfortunately, “del Monte” appears


last in our sort results

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 354


distributing without written permission is strictly prohibited
Normalizers
• Elasticsearch 5.2 introduced normalizers
‒ similar to analyzers, but they only emit one token (no tokenizing)
‒ only accept a subset of the available char filters and token filters
• Normalizers provide a nice solution for lowercasing
keyword values for sorting…

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 355


distributing without written permission is strictly prohibited
Defining a Custom Normalizer
• To use a normalizer, you have to define a custom one:

Let’s create a new index

PUT nutrition2
{
"settings": {
"analysis": {
"normalizer" : {

n
io
"my_normalizer" : {

is
ec
D
"type" : "custom",
&
s
"filter" : ["lowercase"] es
in
us

}
-B
7

}
1
20
p-

} A simple normalizer that


Se
9-

}, only lowercases
-0
d

...
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 356


distributing without written permission is strictly prohibited
Assigning a Normalizer
• Let’s assign brand_name.keyword to use our simple
lowercasing normalizer:

...

"mappings": {
"doc" : {
"properties": {
"brand_name": {
"type": "text",

n
io
is
"fields": {

ec
D
&
"keyword": {
s
es
"type": "keyword",
in
us
-B

"normalizer": "my_normalizer"
71

}
20
p-
Se

}
9-
-0

},
Assign our “normalizer” to
d
ar
er
G

the “keyword” multi-field


n

...
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 357


distributing without written permission is strictly prohibited
Sort by Strings Again…
• Sorting by brand_name.keyword on our new index results
in the expected sort order this time:

GET nutrition2/_search “Chobani”


{
"size": 20, “Chobani Greek Yogurt”
"query": {
"match" : { “Crowley”
"item_name" : "yogurt"

n
} “Dannon”

io
is
ec
},

D
&
"sort" : [
s “del Monte”
es
in
{
us
-B

"brand_name.keyword" : {
7


1
20

"order" : "asc"
p-
Se

}
9-

“Welch’s”
-0

}
d
ar
er

]
G

“Yoplait”
n
ie

}
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 358


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Pagination
Pagination
• A very common use case of searching is pagination
‒ dividing search results into “pages”, or subsets
• A query clause uses the “from” and “size” parameters to
implement pagination:
GET nutrition/_search
{ Return the first 10 hits from the
"from": 0, given query
"size": 10,

n
io
"query": {

is
ec
"match": {

D
&
s
"ingredients" : "onion" es
in
us

}
-B

},
71
20

"sort" : [
p-
Se
9-

{
-0
d

"details.calories" : {
ar
er

"order" : "desc"
G
n
ie

}
st
ba
Se

}
]
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 360


distributing without written permission is strictly prohibited
Retrieving Subsequent Pages
• You have some work to do in your client application to
determine which page to request
‒ But that is common in any paging use case
‒ Some clients will have an API to help simplify page requests
GET nutrition/_search
{
"from": 10, Returns the next 10 hits
"size": 10, in the result set

n
io
"query": {

is
ec
"match": {

D
&
s
"ingredients" : "onion" es
in
us

}
-B

},
71
20

"sort" : [
p-
Se
9-

{
-0
d

"details.calories" : {
ar
er

"order" : "desc"
G
n
ie

}
st
ba
Se

}
]
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 361


distributing without written permission is strictly prohibited
Be Careful with Deep Pagination
• Suppose you want hits from 990 to 1,000. That requires a
1,000 documents from each shard to be considered!
‒ Deep pagination should be avoided

node 1
coordinating
{
"_index":
{
“my_index",
1000 docs {
"_index":
“my_index",
"_index":
"_type":
“my_index",
"_type":

1000 docs
If our index has 6 shards

n
io
is
ec
and we want 990-1000,

D
{

&
1000 docs
"_index":

then 6,000 documents will s


“my_index",
{
es "_index":
in
“my_index",
us

need to be considered 1000 docs


-B
71
20

1000 docs
p-

{
Se

"_index":
“my_index",
"_type": “my_type”,
"_id": "1",

{
9-

{
-0

"_index":

1000 docs “my_index",


d
ar
er
G

node 2 node 3
n
ie
st
ba
Se

from + size can not be more than index.max_result_window


which defaults to 10,000
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 362
distributing without written permission is strictly prohibited
The search_after Parameter
• If you have a lot of pages, a better option for retrieving
subsequent pages is to use the search_after parameter:
‒ You specify the last sort values from the previous result set:

GET nutrition/_search
{
"size": 10, Run the initial search…and keep
"query": { track of the “sort” block from the
"match": { last document returned

n
"ingredients" : "onion"

io
is
ec
}

D
&
s
}, es
in

"sort" : [
us
-B

{"details.calories" : {"order" : "desc"}},


71
20

{"_score": {"order": "desc"}},


p-
Se

{"_uid" : {"order" : "asc"}}


9-
-0

]
d
ar
er

}
G
n
ie

Add _uid to avoid any


st
ba
Se

“ties” in sorting

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 363


distributing without written permission is strictly prohibited
The search_after Parameter
• Set the search_after parameter equal to the last sort
values from the previous result set:

GET nutrition/_search
{
"size": 10,
"query": {
"match": {
"ingredients" : "onion"
}
},
The previous hit was 270 calories

n
io
"search_after" : [

is
ec
and _score of 1.0284368
D
270,
&
s
1.0284368, es
in
us

"doc#AVwBRWkdzWyyiGorJZ_j"
-B
7

],
1
20
p-

"sort" : [
Se
9-

{"details.calories" : {"order" : "desc"}},


-0
d

{"_score": {"order": "desc"}},


ar
er
G

{"_uid" : {"order" : "asc"}}


n
ie
st

]
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 364


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Scroll Searches
Scroll Searches
• What if you want more than a “page” of results?
‒ Large result sets can be expensive to both the cluster and the
client
• The scroll API allows you to take a snapshot of a large
number of results from a single search request
‒ Notice the word “snapshot”
‒ When you perform a scroll search, any changes made to

n
io
is
ec
documents after the scroll is initiated will not show up in your
D
&
s
es
results
in
us
-B
71
20

• Scroll searches are a 4-step process:


p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 366


distributing without written permission is strictly prohibited
1. Initiate the scroll
• To initiate a scroll search, add the scroll parameter to your
search query
‒ Specifying how long Elasticsearch should keep the search
context alive
Initiate a scroll for this query

GET nutrition/_search?scroll=30s

n
{

io
is
ec
"size": 100, If we are idle for more than 30
D
&
s
"query": { es
seconds, then delete the scroll
in

"match_all": {}
us
-B

},
71
20

"sort": [
p-
Se

"_doc" Maximum number of hits


9-
-0

] to return
d
ar
er

}
G
n
ie
st
ba
Se

This example is not


concerned with the order

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 367


distributing without written permission is strictly prohibited
2. Get the current _scroll_id
• The response will contain the first 100 hits, and
• a _scroll_id
‒ Hang on to it! You need it for the next request

{
"_scroll_id":
"DnF1ZXJ5VGhlbkZldGNoCgAAAAAAAAflFnZVWmtsa2VKUUdTVnJCRkFiZ3RjZ2
cAAAAAAAAH5xZ2VVprbGtlSlFHU1ZyQkZBYmd0Y2dnAAAAAAAAB-

n
io
YWdlVaa2xrZUpRR1NWckJGQWJndGNnZwAAAAAAAAfoFnZVWmtsa2VKUUdTVnJCR

is
ec
D
kFiZ3RjZ2cAAAAAAAAH6RZ2VVprbGtlSlFHU1ZyQkZBYmd0Y2dnAAAAAAAAB--0
&
s
WdlVaa2xrZUpRR1NWckJGQWJndGNnZwAAAAAAAAfuFnZVWmtsa2VKUUdTVnJCRk es
in
us

FiZ3RjZ2c=",
-B
7

"took": 2,
1
20
p-

"timed_out": false,
Se
9-

"_shards": {
-0
d

"total": 10,
ar
er
G

"successful": 10,
n
ie
st

"failed": 0
ba
Se

},
...

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 368


distributing without written permission is strictly prohibited
3. Retrieve more documents
• When you are ready for more documents, you just need to
send a GET request, along with the prior _scroll_id

Do not specify an index, type or query here -


just invoke scroll

GET _search/scroll
{ Always use the prior _scroll_id
"scroll" : "30s",

n
io
is
ec
"scroll_id" :

D
&
"DnF1ZXJ5VGhlbkZldGNoCgAAAAAAAAflFnZVWmtsa2VKUUdTVnJCRkFiZ3RjZ2cAA
s
es
in
AAAAAAH5xZ2VVprbGtlSlFHU1ZyQkZBYmd0Y2dnAAAAAAAAB-
us
-B

YWdlVaa2xrZUpRR1NWckJGQWJndGNnZwAAAAAAAAfoFnZVWmtsa2VKUUdTVnJCRkFi
71
20

Z3RjZ2cAAAAAAAAH6RZ2VVprbGtlSlFHU1ZyQkZBYmd0Y2dnAAAAAAAAB-
p-
Se

oWdlVaa2xrZUpRR1NWckJGQWJndGNnZwAAAAAAAAfrFnZVWmtsa2VKUUdTVnJCRkFi
9-
-0

Z3RjZ2cAAAAAAAAH7BZ2VVprbGtlSlFHU1ZyQkZBYmd0Y2dnAAAAAAAAB-0WdlVaa2
d
ar

xrZUpRR1NWckJGQWJndGNnZwAAAAAAAAfuFnZVWmtsa2VKUUdTVnJCRkFiZ3RjZ2c=
er
G
n

"
ie
st
ba

}
Se

If the scroll is still alive, the next


set of hits is returned

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 369


distributing without written permission is strictly prohibited
4. Clearing a scroll
• If you are done using a scroll and want to manually close it,
use a DELETE request:

DELETE _search/scroll/DnF1ZXJ5VGhlbkZldGNoCgAAAAAAAAflFnZVWm
tsa2VKUUdTVnJCRkFiZ3RjZ2cAAAAAAAAH5xZ2VVprbGtlSlFHU1ZyQkZBYmd0Y2dnA
AAAAAAABYWdlVaa2xrZUpRR1NWckJGQWJndGNnZwAAAAAAAAfoFnZVWmtsa2VKUUdTV
nJCRkFiZ3RjZ2cAAAAAAAAH6RZ2VVprbGtlSlFHU1ZyQkZBYmd0Y2dnAAAAAAAAB-
oWdlVaa2xrZUpRR1NWckJGQWJndGNnZwAAAAAAAAfrFnZVWmtsa2VKUUdTVnJCRkFiZ
3RjZ2cAAAAAAAAH7BZ2VVprbGtlSlFHU1ZyQkZBYmd0Y2dnAAAAAAAAB-0WdlVaa2xr

n
io
is
ec
ZUpRR1NWckJGQWJndGNnZwAAAAAAAAfuFnZVWmtsa2VKUUdTVnJCRkFiZ3RjZ2c=

D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 370


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Choosing a
Search Type
Choosing a Search Type
• We have seen several different types of searches:
‒ regular searches: the typical “query” searches that we have
been running throughout the course
‒ scroll searches: a snapshot of a (possibly) large result set of
documents
‒ pagination searches: a small subset of a larger result set of
documents

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 372


distributing without written permission is strictly prohibited
Real-world Use Cases
• Regular search:
‒ Suppose we are pulling back search results from an auction
website like Ebay and sorting by the date the auction is ending.
We want auctions that are ending the soonest at the top. We
would need to run a new query each time.
• Scroll search:
‒ Imagine you are implementing an “export your data” feature on a

n
site like Facebook. A snapshot of the data works best for a large

io
is
ec
D
number of hits that we want to return in small batches.
&
s
es
in
us
-B

• Pagination search:
71
20
p-
Se
9-

‒ If you're implementing classic pagination on a website, then from


-0
d
ar

and size are likely your “go to” option


er
G
n
ie
st
ba

‒ Use search_after if you need deep pagination


Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 373


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• The inverted index concept introduced in the analysis chapter is
practically represented inside a package called a Lucene
segment
• Search queries typically have two phases: query and fetch
• Relevance refers to the scoring of a document based on how
closely it matches the query
• Use the sort clause for sorting by a field, _score or _doc

n
io
is
ec
D
• Doc values store your original fields’ values on disk
&
s
es
in
us
-B

• Normalizers allow you to apply certain character filters to


71
20

keyword data types


p-
Se
9-
-0
d
ar

• Use “from” and “size” parameters to implement pagination


er
G
n
ie
st
ba

• The scroll API allows you to take a snapshot of a large number


Se

of results from a single search request


Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 375
distributing without written permission is strictly prohibited
Quiz
1. Give two use cases for enabling dfs_query_then_fetch on a query.
2. True or False: You can sort the results of a query by a field of type
“text”.
3. True or False: You can sort the results of a query by a field of type
“keyword”.
4. How are hits sorted by default?
5. True or False: Normalizers can only be applied to fields of type

n
io
is
“keyword”.
ec
D
&
s
es
in

6. What two parameters are used to implement paging of search


us
-B
7

results?
1
20
p-
Se
9-
-0

7. What is the difference between a scroll search and a pagination


d
ar
er

search?
G
n
ie
st
ba
Se

8.True or False: You have to invoke _refresh occasionally while


indexing a lot of documents to refresh the in-memory buffer.
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 376
distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 7
Working with Search Results
1 Introduction to Elasticsearch

2 The Search API

3 Text Analysis

4 Mappings

5 More Search Features

Chapter 8 6 The Distributed Model

n
io
Aggregations
is
ec
7 Working with Search Results

D
&
s
es
in
us

8 Aggregations
-B
71
20
p-
Se
9-

More Aggregations
-0

9
d
ar
er
G
n

10 Handling Relationships
ie
st
ba
Se
Topics covered:
• What are Aggregations?
• Types of Aggregations
• Buckets and Metrics
• Common Metrics Aggregations
• The range Aggregation

n
• The date_range Aggregation

io
is
ec
D
&
s
es
• The terms Aggregation
in
us
-B
71
20

• Nesting Buckets
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 379


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
What are
Aggregations?
What is an Aggregation?
• We have been focusing on search, but Elasticsearch has
another powerful capability known as aggregations
• Aggregations are a way to perform analytics on your
indexed data

What

n
io
is
is the average ticket price

ec
What movies have “star” in
D
of a movie in the U.S?
&
s
es
the title?
in
us
-B
71
20
p-
Se

That
9-

That
-0

is an aggregation…
d

is a search
ar
er
G

query…
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 381


distributing without written permission is strictly prohibited
You ask different questions with aggs:
• In search, we have been • In aggregations, we
asking for a subset of the analyze the data by
original dataset: asking questions about
the dataset:
‒ Results are hits that match
our search parameters ‒ Results are (typically)
computed values

“What are the “How many 404


ERRORs in our log errors occurred each

n
io
is
ec
file?” day last month?”

D
&
s
es
in

“What
us
-B

is the average amount


71
20

of time visitors spend on


p-
Se

our home page?”


9-
-0
d
ar
er
G
n
ie

“What is the most


st
ba
Se

visited page on our


website?”

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 382


distributing without written permission is strictly prohibited
booking.com

Search
criteria

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se

Aggregations
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 383


distributing without written permission is strictly prohibited
Kibana Visualizations

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 384


distributing without written permission is strictly prohibited
Aggregation Syntax
• An aggregation request is a part of the Search API
‒ With or without a “query” clause

The “aggs” clause can be


spelled out “aggregations”
GET my_index/_search
{
"aggs": {
"my_aggregation": {

n
io
"AGG_TYPE": {

is
ec
D
...
&
s
} es
The name you choose comes
in
us

}
-B

back in the results


7

}
1
20
p-

}
Se
9-
-0
d
ar

Lots of different aggregation


er
G
n

types
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 385


distributing without written permission is strictly prohibited
Example of an Aggregation

What
is the average
number of calories of items
with “olive oil” in the
ingredients?
GET nutrition/_search
{
"query" : {
"match_phrase": {
"ingredients": "olive oil"

n
io
}

is
The “query” defines the

ec
},

D
&
"aggs": { scope of the aggregation
es
in
s
us

"my_avg_calories": {
-B

"avg": {
71
20

"field": "details.calories"
p-
Se

}
9-
-0

} The “aggs” section returns an


d
ar
er

}
G

average
n
ie

}
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 386


distributing without written permission is strictly prohibited
Viewing the Results
• The results will have an “aggregations” section with the
results of all the “aggs” in your search

"hits": {
"total": 8,
"max_score": 6.0192027,
"hits": [
... The name you gave
]

n
your aggregation

io
is
},

ec
D
&
"aggregations": {
s
es
"my_avg_calories": {
in
us
-B

"value": 196.25
71
20

}
p-
Se

}
9-
-0

The computation
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 387


distributing without written permission is strictly prohibited
If you just want the aggregation value:
• If you are not interested in the hits and only want the values
of the aggregations, then set “size” to 0
‒ This will speed up the query since the “fetch” phase can be
skipped
{
"took": 25,
GET nutrition/_search "timed_out": false,
{ "_shards": {
"size": 0, "total": 5,
"query" : { "successful": 5,

n
"match_phrase": {

io
"failed": 0

is
ec
"ingredients": "olive oil"

D
},
&
}
s
es "hits": {
in
},
us

"total": 8,
-B

"aggs": {
7

"max_score": 0,
1
20

"my_avg_calories": {
p-

"hits": []
Se

"avg": {
9-

},
-0

"field": "details.calories"
d

"aggregations": {
ar
er

}
G

"my_avg_calories": {
n
ie

}
st

"value": 196.25
ba

}
Se

}
} }
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 388


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Types of Aggregations
Four Types of Aggregations
• Bucket: aggregations that combine documents that meet a
given criteria into buckets
‒ Bucket aggregations build buckets
• Metric: mathematical calculations performed on the fields
of documents
‒ Typically, metrics are calculated on buckets of data

n
• Matrix: aggregation that operates on multiple fields and

io
is
ec
D
produces a matrix result es
in
&
s
us
-B

‒ matrix_stats is currently the only matrix aggregation


71
20
p-
Se
9-
-0

• Pipeline: aggregate the output of other aggregations


d
ar
er

(instead of from documents)


G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 390


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Buckets and Metrics
Buckets
• A bucket is simply a collection of documents that meet a
criterion
‒ Buckets are a key element of aggregations
• For example, suppose we want to analyze products by their
price range. We could put them into specific buckets:
{
"_index": "product",
"_type": "product_type",

n
"_id": "26588",

io
is
"_version": 1,

ec
"_score": 8.296555,

D
{ {
"_source":

&
"_index":
"brandName": "product",
"Jelly Belly",

s
"_type": "product_type", es
in
"_id": "26589",
us

"_version": 1,
-B

{ { {
"_score": 8.296555, "_index": "product",
{
"_type": "product_type",
"_index": "product",
{
"_type": "product_type",
"_index": "product",
{
"_type": "product_type",
"_source":{ {
7

"_id":"_index":
"26590","product", "_id":"_index":
"26590","product", "_id":"_index":
"26590","product",
1

"_type":1,"product_type",
"_version": "_type":1,"product_type",
"_version": "_type":1,"product_type",
"_version":
"_index":
"Jelly"product",
20

"brandName": Belly", "_id":8.296555,


"_score": "26590",
"_version": 1,
"_id":8.296555,
"_score": "26590",
"_version": 1,
"_id":8.296555,
"_score": "26590",
"_version": 1,
"_type": "product_type",
p-

"_source": { "_source": { "_source": {


"_score": 8.296555,
"brandName": "Jelly "_score": 8.296555,
"brandName": "Jelly "_score": 8.296555,
"brandName": "Jelly
Se

"_source": { "_source": { "_source": {


"_id": "26590", "brandName": "Jelly "brandName": "Jelly "brandName": "Jelly
9-

"_version": 1,
-0

"_score": 8.296555,
"_source": {
d
ar

"brandName": "Jelly Belly",


er

price > $5
G

price <= $5 price > $10


n
ie

&&
st

All of your products


ba
Se

price <= $10

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 392


distributing without written permission is strictly prohibited
Buckets can be Nested
• Suppose we want to analyze products under $5 and also by
their customer rating
All products under $5
with a rating of 2

This bucket can consist of


nested buckets within it rating=1 rating=2 rating=3

n
io
is
{

ec
{ "_index": "product", {
"_index": "product", { "_index": "product",

D
"_type": "product_type",
{ {
"_type": "product_type", "_id":"_index":
"26590","product", "_type": "product_type",
"_id":"_index":
"26590","product", "_id":"_index":
"26590","product",

&
"_type":1,"product_type",
"_version":
"_type":1,"product_type",
"_version": "_id":8.296555,
"26590", "_type":1,"product_type",
"_version":
"_score":

s
"_id":8.296555,
"_score": "26590", "_version": 1, "_id":8.296555,
"_score": "26590",

es
"_source": {
"_version":
"_source": { 1, "_score": 8.296555, "_version":
"_source": { 1,
"brandName": "Jelly
"_score": 8.296555, "_source": { "_score": 8.296555,
in
"brandName": "Jelly "brandName": "Jelly
"_source": { "brandName": "Jelly "_source": {
us

"brandName": "Jelly "brandName": "Jelly


-B
71
20
p-
Se
9-
-0
d
ar

{
{ "_index": "product",
er

"_index": "product", {
"_type": "product_type",
{
"_type": "product_type", "_id":"_index":
"26590","product",
G

"_id":"_index":
"26590","product", "_type":1,"product_type",
"_version":
"_type":1,"product_type",
n

"_version": "_id":8.296555,
"_score": "26590",
"_id":8.296555,
"26590",
ie

"_score": "_version":
"_source": { 1,
"_version":
"_source": { 1, "_score": 8.296555,
st

"_score": 8.296555, "brandName": "Jelly


"brandName": "Jelly "_source": {
ba

"_source": { "brandName": "Jelly


"brandName": "Jelly
Se

price <= $5
rating=4 rating=5

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 393


distributing without written permission is strictly prohibited
Metrics
• Metrics compute numeric values based on your dataset
‒ Either from fields
‒ Or from values generated by custom scripts
• Most metrics are mathematical operations that output a
single value:
‒ avg, sum, min, max, cardinality

n
io
is
ec
• Some metrics output multiple values:
D
&
s
es
in
us

‒ stats, percentiles, percentile_ranks


-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 394


distributing without written permission is strictly prohibited
Metrics + Buckets = Aggregations
• An aggregation can be a combination of buckets and
metrics
‒ This is why aggregations are so powerful - they can be combined
in any number of combinations
‒ For example, you could sum all the values of the “quantitySold”
field, which would be one metric on one bucket
‒ You could also compute the average “quantitySold” by price
range, which would be one metric on multiple buckets:

n
io
is
ec
D
&
s
es
What
in
us
-B

is the average
71
20

quantity of items sold


p-
Se

for products under $5,


9-
-0
d

from $5-$10, and


ar

{ { {
er

"_index": "product", "_index": "product", "_index": "product",


{ { {
G

"_type": "product_type", "_type": "product_type", "_type": "product_type",

over $10?
"_id":"_index":
"26590","product", "_id":"_index":
"26590","product", "_id":"_index":
"26590","product",
"_type":1,"product_type", "_type":1,"product_type",
n

"_version": "_type":1,"product_type",
"_version": "_version":
ie

"_id":8.296555,
"_score": "26590", "_id":8.296555,
"26590", "_id":8.296555,
"_score": "26590",

That
"_score":
"_version": 1, "_version": 1, "_version": 1,
st

"_source": { "_source": { "_source": {


"_score": 8.296555,
"brandName": "Jelly "_score": 8.296555, "_score": 8.296555,
"brandName": "Jelly
"brandName": "Jelly
ba

"_source": { "_source": { "_source": {


"brandName": "Jelly "brandName": "Jelly "brandName": "Jelly
Se

will require
buckets and
metrics
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 395
distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Aggregations
Common Metrics
The avg Metric
• The avg metric aggregation computes the average of a
specified numeric field

What is the average


number of calories of
products that have sugar
GET nutrition/_search in them?
{
"size" : 0,
"query" : {

n
"match": {

io
is
ec
"ingredients": "sugar" "aggregations": {

D
&
} "my_sugar_calorie_average": {
s
es
in
}, "value": 155.44347826086957
us
-B

"aggs" : { }
71
20

"my_sugar_calorie_average" : { }
p-
Se

"avg": {
9-
-0

"field": "details.calories"
d
ar
er

}
G
n

}
ie
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 397


distributing without written permission is strictly prohibited
min and max
• These two metrics are fairly straightforward:

What is the maximum and


minimum number of calories
of all products?
GET nutrition/_search
{
"size" : 0,
"aggs" : {
"my_max_calorie_item" : { "aggregations": {

n
"max": { "my_min_calorie_item": {

io
is
ec
"field": "details.calories" "value": 0

D
&
} },
s
es
in
}, "my_max_calorie_item": {
us
-B

"my_min_calorie_item" : { "value": 630


71
20

"min": { }
p-
Se

"field": "details.calories" }
9-
-0

}
d
ar
er

}
G
n
ie

}
st

Notice you can have


ba

}
Se

multiple aggregations in a
single “aggs” clause

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 398


distributing without written permission is strictly prohibited
The stats Aggregation
• The stats metric aggregation gives you common statistics
of a field in a single aggregation request:
‒ count, min, max, avg and sum

What can you tell me


about the “calories” field?

n
GET nutrition/_search

io
is
ec
{ "aggregations": {

D
&
"size" : 0, "my_calorie_stats": {
s
es
in
"aggs" : { "count": 499,
us
-B

"my_calorie_stats" : { "min": 0,
71
20

"stats": { "max": 630,


p-
Se

"field": "details.calories" "avg": 136.7875751503006,


9-
-0

} "sum": 68257
d
ar
er

} }
G
n

} }
ie
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 399


distributing without written permission is strictly prohibited
The cardinality Aggregation
• The cardinality metric aggregation is an approximation of
the number of distinct values of a field
‒ based on the HyperLogLog++ algorithm
‒ accurate up until precision_threshold (max of 40,000)

About how many unique

n
brand names are there?

io
is
ec
GET nutrition/_search

D
&
s
{ es
in

"size" : 0,
us
-B

"aggs": {
71

"aggregations": {
20

"my_brand_name_cardinality": {
p-

"my_brand_name_cardinality": {
Se

"cardinality": {
9-

"value": 430
-0

"field": "brand_name.keyword",
d
ar

}
er

"precision_threshold": 100
G

}
n
ie

}
st
ba

}
Se

}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 400


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Aggregations
Common Bucket
Common Bucket Aggregations
• Let’s look at two useful bucket aggregations:
‒ range
‒ terms

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 402


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The range Aggregation
The range Aggregation
• The range bucket aggregation puts documents into buckets
based on ranges of values on a specified field
GET nutrition/_search "aggregations": {
{ "my_calorie_range_buckets": {
"size" : 0, "buckets": [
"aggs" : { {
"my_calorie_range_buckets" : { "key": "*-100.0",
"range": { "to": 100,
"field": "details.calories", "doc_count": 178
"ranges": [ },
{ {

n
io
"to": 100

is
"key": "100.0-200.0",

ec
D
}, "from": 100,
&
s
{ es
in
"to": 200,
us

"from": 100,
-B

"doc_count": 207
7

"to": 200
1

},
20
p-

}, {
Se
9-

{
-0

"key": "200.0-*",
d

"from" : 200
ar

"from": 200,
er
G

} "doc_count": 114
n
ie
st

] }
ba
Se

} ]
} }
} }
}
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 404
distributing without written permission is strictly prohibited
Labeling Ranges
• Add a “key” field to label a range using any string you like
‒ Your “key” field gets returned in the results
"aggregations": {
"aggs" : { "my_calorie_range_buckets": {
"my_calorie_range_buckets" : { "buckets": [
"range": { {
"field": "details.calories", "key": "low_calorie",
"ranges": [ "to": 100,
{ "doc_count": 178
"key" : "low_calorie", },

n
io
"to": 100 {

is
ec
D
}, "key": "medium_calorie",
&
s
{ es "from": 100,
in
us

"key" : "medium_calorie", "to": 200,


-B
7

"from": 100,"to": 200 "doc_count": 207


1
20
p-

}, },
Se
9-

{ {
-0
d

"key" : "high_calorie", "key": "high_calorie",


ar
er
G

"from" : 200 "from": 200,


n
ie
st

} "doc_count": 114
ba
Se

] }
} ]
} }

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 405


distributing without written permission is strictly prohibited
Combining Buckets and Metrics
• The previous range example created three buckets
‒ Let’s perform a metrics aggregation on each bucket:
GET nutrition/_search Add a nested aggregation to
{
"size" : 0, “my_calorie_range_buckets”
"aggs" : {
"my_calorie_range_buckets" : {
"range": {
"field": "details.calories",
"ranges": [

n
io
{"to": 100},

is
ec
{"from": 100,"to": 200},

D
&
s
{"from" : 200} es
in
us

]
-B

},
71
20

"aggs": {
p-
Se

The “avg” will be computed


9-

"my_average_total_fat": {
-0
d

"avg": { for each bucket


ar
er

"field": "details.total_fat"
G
n
ie

}
st
ba
Se

}
}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 406


distributing without written permission is strictly prohibited
The result of our nested aggregation:
"aggregations": {
"my_calorie_range_buckets": {
"buckets": [
{ The average total fat for
"key": "*-100.0", items with less than 100
"to": 100, calories
"doc_count": 178,
"my_average_total_fat": {
"value": 1.2522471911702933
}
},
{
"key": "100.0-200.0",
"from": 100,

n
io
is
"to": 200,

ec
D
"doc_count": 207,
&
s
es
"my_average_total_fat": {
in
us
-B

"value": 4.618357487922705
71

}
20
p-

},
Se
9-
-0

{
d
ar

"key": "200.0-*",
er
G

"from": 200,
n
ie
st

"doc_count": 114,
ba
Se

"my_average_total_fat": {
"value": 12.20438596501685
}
}
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 407
distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Aggregation
The date_range
The Date Range Aggregation
• The date_range bucket aggregation is similar to range
except it is specifically used for date fields
‒ syntax looks like:

GET my_index/_search
{
"aggs" : {
"my_date_range_agg" : {

n
"date_range": {

io
is
ec
"field": "FIELD",

D
&
"ranges": [
s
es
in
{
us
-B

"from": "FROM", You can use actual dates


71
20

"to": "TO" or date math expressions


p-
Se

}
9-
-0

]
d
ar
er

}
G
n

}
ie
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 409


distributing without written permission is strictly prohibited
Example of date_range
GET stocks/_search
{
"size": 0, “Compare the week-to-
"aggs": { week stock volume from
"my_week_comparison": { January 1, 2010.”
"date_range": {
"field": "trade_date",
"ranges": [
{
"from": "2010-01-01||-1w",
"to": "2010-01-01"
}, "my_total_volume": {
{ "value": 75871338

n
io
"from": "2010-01-01", }

is
ec
D
"to": "2010-01-01||+1w"
&
s
} es
"my_total_volume": {
in
us

]
-B

"value": 161270672
7

},
1
20

}
p-

"aggs": {
Se
9-

"my_total_volume": {
-0
d

"sum": {
ar
er
G

"field": "volume"
n
ie
st

}
ba
Se

}
}
}}}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 410


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The terms Aggregation
Terms Aggregation
• The terms aggregation will dynamically create a new
bucket for every unique term it encounters of the specified
field
‒ In other words, each distinct value of the field will have its own
bucket of documents
rating=1 rating=2 rating=3

n
Which product has the lowest

io
is
{

ec
{ "_index": "product", {
"_index": "product", { "_index": "product",

D
rating but sells the best?
"_type": "product_type",
{ {
"_type": "product_type", "_id":"_index":
"26590","product", "_type": "product_type",
"_id":"_index":
"26590","product", "_id":"_index":
"26590","product",

&
"_type":1,"product_type",
"_version":
"_type":1,"product_type",
"_version": "_id":8.296555,
"26590", "_type":1,"product_type",
"_version":
"_score":

s
"_id":8.296555,
"_score": "26590", "_version": 1, "_id":8.296555,
"_score": "26590",

es
"_source": {
"_version":
"_source": { 1, "_score": 8.296555, "_version":
"_source": { 1,
"brandName": "Jelly
"_score": 8.296555, "_source": { "_score": 8.296555,
in
"brandName": "Jelly "brandName": "Jelly
"_source": { "brandName": "Jelly "_source": {
us

"brandName": "Jelly "brandName": "Jelly


-B
71
20
p-
Se
9-
-0

Start by using “terms” to


d
ar

{
{ "_index": "product",
er

"_index": "product", {
"_type": "product_type",

put the products into


{
"_type": "product_type", "_id":"_index":
"26590","product",
G

"_id":"_index":
"26590","product", "_type":1,"product_type",
"_version":
"_type":1,"product_type",
n

"_version": "_id":8.296555,
"_score": "26590",
"_id":8.296555,
"26590",
ie

"_score": "_version":
"_source": { 1,
"_version": 1,

distinct buckets by rating


"_source": { "_score": 8.296555,
st

"_score": 8.296555, "brandName": "Jelly


"brandName": "Jelly "_source": {
ba

"_source": { "brandName": "Jelly


"brandName": "Jelly
Se

rating=4 rating=5

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 412


distributing without written permission is strictly prohibited
Simple Example of terms
• You just need to specify a “field” to bucket the terms by:

GET products/_search
{
"size" : 0,
"aggs" : {
"my_customer_ratings" : {
"terms": {
"field": "customerRating",
"size": 5

n
io
}

is
ec
}

D
&
s
} es
in
us

}
-B
71
20
p-
Se

size = number of buckets


9-
-0

(defaults to 10)
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 413


distributing without written permission is strictly prohibited
The Output of our terms Aggregation:
• Notice each bucket has a "aggregations": {
"my_customer_ratings": {
“key” that represents the "doc_count_error_upper_bound": 0,
distinct value of “field”, "sum_other_doc_count": 0,
"buckets": [
{
• and “doc_count” for the "key": 2,
"doc_count": 22190
number of docs in the },
bucket {
"key": 3,
"doc_count": 22152

n
},

io
is
“2” is the most

ec
{

D
&
common rating s
"key": 1,
es
in
"doc_count": 22100
us
-B

},
71
20

{
p-
Se

"key": 4,
9-
-0

"doc_count": 22009
d
ar
er

},
G
n
ie

{
st
ba

"key": 5,
Se

"doc_count": 21984
}]}}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 414


distributing without written permission is strictly prohibited
Why are terms aggs not always precise?

shard0 shard1

a:5 b:4 a:4 b:5

c:3 d:4 c:3 d:1

n
io
is
ec
D
a:9 b:9 d:4
&
c:3
s
es
in
us
-B
71
20
p-
Se
9-
-0

If size=3, we get an inaccurate result


d
ar
er

(c is 6 and should be in the top 3)


G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 415


distributing without written permission is strictly prohibited
Understanding the Output of terms
• There are two values returned in a terms aggregation:
‒ “doc_count_error_upper_bound”: the maximum number of
missing documents that could potentially have appeared in a
bucket
‒ “sum_other_doc_count”: the number of documents that do not
appear in any of the buckets

n
io
is
ec
GET nutrition/_search

D
"aggregations": {
&
{
s
es
in
"my_brand_names": {
"size" : 0,
us

"doc_count_error_upper_bound": 5,
-B

"aggs" : { "sum_other_doc_count": 469,


71
20

"my_brand_names" : {
p-

"buckets": [
Se

"terms": {
9-

{
-0

"field": "brand_name.keyword"
d

"key": "Giant",
ar
er

} "doc_count": 6
G
n

}
ie

},
st
ba

}
Se

...
} The value 6 may not be
accurate

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 416


distributing without written permission is strictly prohibited
Enabling show_term_doc_count_error
• Shows you the upper-bound error for each bucket:
GET nutrition/_search
{
"size" : 0,
"aggs" : {
"my_brand_names" : {
"terms": {
"field": "brand_name.keyword",
"show_term_doc_count_error": true
}
}
} "my_brand_names": {

n
io
"doc_count_error_upper_bound": 5,

is
ec
D
"sum_other_doc_count": 469,
&
s
"buckets": [ es
in
us

{
-B
7

"key": "Giant",
1
20
p-

"doc_count": 6,
Se

Both “Giant” and


9-

"doc_count_error_upper_bound": 1
-0

“Essential Everyday” can


d

},
ar
er
G

be at most 7 {
n
ie
st

"key": "Essential Everyday",


ba
Se

"doc_count": 4,
"doc_count_error_upper_bound": 3
},

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 417


distributing without written permission is strictly prohibited
The shard_size Parameter
• shard_size tells Elasticsearch to gather more top terms
from each shard to improve accuracy
‒ shard_size = (size x 1.5) + 10 (by default)

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 418


distributing without written permission is strictly prohibited
Increasing the shard_size Parameter
• Let’s try increasing shard_size to something larger than its
default value:
GET nutrition/_search
{
"size" : 0,
"aggs" : {
"my_brand_names" : {
"terms": {
"field": "brand_name.keyword",
"show_term_doc_count_error": true,
"size": 5,

n
io
is
"shard_size": 110 "my_brand_names": {

ec
D
} "doc_count_error_upper_bound": 0,
&
s
} es
"sum_other_doc_count": 474,
in
us

}
-B

"buckets": [
7

}
1

{
20
p-

"key": "Giant",
Se
9-
-0

"doc_count": 6,
d
ar

"doc_count_error_upper_bound": 0
er

That lowered the error upper


G

},
n
ie

bound to 0 for the top hits


st

"key": "Essential Everyday",


ba
Se

"doc_count": 5,
"doc_count_error_upper_bound": 0
},

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 419


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Nesting Buckets
Nesting Buckets
• A bucket can be defined within a bucket:
GET nutrition/_search
{
"size" : 0, Put docs with the same
"aggs" : { “brand_name” into a
"my_brand_names" : { bucket
"terms": {
"field": "brand_name.keyword",
"size" : 10,
"shard_size" : 45
},
"aggs": {

n
io
is
"my_calories_buckets": {

ec
D
"range": {
&
s
es
"field": "details.calories",
in
us
-B

"ranges": [
71

{ Put docs from each


20
p-
Se

"from": 0, “brand_name” into two


9-
-0

"to": 100
buckets: “calories” < 100
d
ar

},
er

and “calories” >= 100


G

{
n
ie
st
ba

"from" : 100
Se

}
]
}}}}}}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 421


distributing without written permission is strictly prohibited
Output of the example:
"aggregations": {
"my_brand_names": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 395,
"buckets": [
{
"key": "Giant",
"doc_count": 6,
"my_calories_buckets": {
"buckets": [
{
There are six items from "key": "0.0-100.0",

n
io
“Giant”, with 2 items below "from": 0,

is
ec
"to": 100,

D
100 calories and 4 items
&
s
"doc_count": 2 es
above 100 calories
in
us

},
-B

{
71
20

"key": "100.0-*",
p-
Se
9-

"from": 100,
-0
d

"doc_count": 4
ar
er

}
G
n
ie

]
st
ba
Se

}
},

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 422


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• You have two new tools in your Elasticsearch toolbox:
‒ Metrics Aggregations like min, max, avg, stats
‒ Bucket Aggregations like range and terms
• Metrics compute numeric values based on your dataset
• A bucket is a collection of documents that meet a criterion
• An aggregation can be thought of as a unit-of-work that

n
io
is
builds analytic information over a set of documents
ec
D
&
s
es
in

• Aggregations can be nested within other aggregations


us
-B
71
20
p-
Se

• The range aggregation puts documents into buckets based


9-
-0
d

on ranges of values on a specified field


ar
er
G
n
ie
st
ba

• The terms aggregation dynamically creates buckets for


Se

every unique term it encounters of a specified field


Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 424
distributing without written permission is strictly prohibited
Quiz
1. Query or Aggregation: “What is the box office revenue of
all movies directed by Steven Spielberg?”
2. Query or Aggregation: “What are the nearby restaurants
that serve pizza?”
3. What are the four types of aggregations?
4. What aggregation would you use to put logging events into
buckets by log type (“error”, “warn”, “info”, etc.)?

n
io
is
ec
D
5. How can you increase the accuracy of a terms &
s
es
in
us
-B

aggregation?
71
20
p-
Se
9-

6. What aggregation(s) would you use to answer the question


-0
d
ar

“How many unique visitors came to our website today?”


er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 425


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 8
Working with Aggregations
1 Introduction to Elasticsearch

2 The Search API

3 Text Analysis

4 Mappings

5 More Search Features

Chapter 9 6 The Distributed Model

n
io
More
is
ec
7 Working with Search Results

D
&
s
es

Aggregations
in
us

8 Aggregations
-B
71
20
p-
Se

9 More Aggregations
9-
-0
d
ar
er
G
n

10 Handling Relationships
ie
st
ba
Se
Topics covered:
• Aggregation Scope
• Using post_filter
• The missing Aggregation
• Histograms
• Date Histograms

n
• Percentiles

io
is
ec
D
&
s
es
• Top Hits
in
us
-B
71
20

• Significant Terms
p-
Se
9-
-0
d
ar

• Sorting Buckets
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 428


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Aggregation Scope
Aggregation Scope
• Aggregation scope refers to the documents that are used
to compute the aggregation
‒ Up until now, your “aggs” have been computed using the “hits”
of your queries:
GET stocks/_search
{
"size": 0,
"query": {
"term": {

n
"stock_symbol.keyword": {

io
is
ec
"value": "EBAY"

D
&
}
s
es
in
}
us
-B

},
7

The scope of this “avg”


1
20

"aggs": {
p-

agg is the “EBAY” stocks


Se

"ebay_avg_high": {
9-
-0

"avg": { from the query


d
ar
er

"field": "high"
G
n

}
ie
st
ba

}
Se

}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 430


distributing without written permission is strictly prohibited
Aggregation Scope
• There are options available to compute aggregations over a
different set of documents than the “hits”, including:
‒ global agg
‒ post_filter

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 431


distributing without written permission is strictly prohibited
The global Aggregation
• Use the global bucket aggregation to perform an
aggregation over all documents
‒ Even when a corresponding query is used
• The scope of a global aggregation is all documents

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 432


distributing without written permission is strictly prohibited
Example of global Aggregation
GET nutrition/_search I want the max total_fat
{ "size": 0,
"query": {
of items with 120 calories,
"term": { compared to the max total_fat
"details.calories": 120 of all items.
}
},
"aggs": {
"max_total_fat_from_120" : {
"max": {
"field": "details.total_fat" The max total fat of items
} with 120 calories

n
},

io
is
ec
"all_of_my_items" : {

D
&
"global": {},
s
es
in
"aggs" : {
us
-B

"max_total_fat_from_all": {
71
20

"max": {
p-

The max total fat of all


Se

"field": "details.total_fat"
9-
-0

} items
d
ar
er

}
G
n
ie

}
st
ba

}
Se

}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 433


distributing without written permission is strictly prohibited
Output from the global Aggregation:

"aggregations": {
"all_of_my_items": {
"doc_count": 499,
"max_total_fat_from_all": {
"value": 36
}
},
"max_total_fat_from_120": {

n
"value": 14

io
is
ec
}

D
&
s
} es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 434


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Using post_filter
Use Case for post_filter
• A typical search example you see online all the time:

Brand Names Search Results


Philadelphia (3) 1. Feta Cheese
Essential Everyday (2) 2. Cheese, Italian Three Cheese
Giant (2) 3. Turkey & Cheese
Wegmans (2) 4. Macaroni and Cheese Dinner, Organic, Three Cheese

n
365 Everyday Value (1)

io
5. Shredded Mozzarella Cheese

is
ec
D
6. Party Cheese Tray
&
s
es
in
7. Cheese & Grapes Snack
us
-B
7

8. Potato Chips, Grilled Cheese


1
20
p-
Se

9. Tri-Color Cheese Tortellini


9-
-0

Selecting a brand name 10. Burrito, Bean & Cheese


d
ar
er

filters the search results


G
n
ie
st
ba
Se

“cheese” items from all


brands
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 436
distributing without written permission is strictly prohibited
Retrieving the Brand Names
• Step 1 is to search for “cheese” and include the brand
names of companies that sell “cheese” items
‒ we will use a terms agg:

GET nutrition/_search
{
"query": {
"match": {
"item_name": "cheese"
Retrieve items with
} “cheese” in the name

n
io
is
ec
},

D
&
"aggs": {
s
es
in
"popular_brand_names": {
us

Retrieve the top 5 brand


-B

"terms": {
71

names that sell “cheese”


20

"field": "brand_name.keyword",
p-
Se

"size": 5 items
9-
-0

}
d
ar

}
er
G
n

}
ie
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 437


distributing without written permission is strictly prohibited
The User Selects a Brand Name…
• We still want our terms agg to return the top 5 brand
names
‒ the scope of the terms agg is still over all items with “cheese”
‒ use a post_filter to only return the items from “Wegmans”

Brand Names Search Results


Philadelphia (3) 1. Feta Cheese

n
io
is
ec
Essential Everyday (2) 2. Cheese, Italian Three Cheese

D
&
s
Giant (2) 3. es
Turkey & Cheese
in
us
-B

Wegmans (2) 4. Macaroni and Cheese Dinner, Organic, Three Cheese


71
20

365 Everyday Value (1)


p-

5. Shredded Mozzarella Cheese


Se
9-
-0

6. Party Cheese Tray


d
ar
er

7. Cheese & Grapes Snack


G
n
ie
st

8. Potato Chips, Grilled Cheese


ba

The user clicks on


Se

9. Tri-Color Cheese Tortellini


“Wegmans”
10. Burrito, Bean & Cheese

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 438


distributing without written permission is strictly prohibited
Using post_filter

GET nutrition/_search
{
"query": {
"match": { Retrieve items with
"item_name": "cheese"
} “cheese” in the name
},
"aggs": {
"popular_brand_names": {
"terms": { Retrieve the top 5 brand
"field": "brand_name.keyword", names that sell “cheese”

n
io
"size": 5 items

is
ec
}

D
&
s
} es
in
us

},
-B

"post_filter": {
71
20

"match": {
p-

Only return items from


Se
9-

"brand_name.keyword": "Wegmans"
-0

} “Wegmans”
d
ar
er

}
G
n
ie

}
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 439


distributing without written permission is strictly prohibited
Result from post_filter
• We get back the two “cheese” items from “Wegmans”
• along with top 5 brand names again

Brand Names Search Results


Philadelphia (3) 1. Ravioli, Three Cheese
Essential Everyday (2) 2. Pierogies, Spinach & Feta Cheese, Club Pack

n
Giant (2)

io
is
ec
D
Wegmans (2)

&
s
es
365 Everyday Value (1)
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 440


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Aggregation
The missing
The missing Aggregation
• The missing bucket aggregation creates a bucket of all
documents that are missing a field value
‒ Either the field is not defined or it contains a null value
‒ Useful when you are creating buckets on a field value and you
have documents missing that particular field
• The syntax looks like:

n
io
is
ec
D
GET my_index/_search
&
s
{ es
in
us

"aggs": {
-B
7

"my_bucket": {
1
20
p-

"missing": {
Se
9-

"field": "FIELD_NAME"
-0
d

}
ar
er
G

}
n
ie
st

}
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 442


distributing without written permission is strictly prohibited
Example of missing
GET nutrition/_search “Show me the brand
{ names with the most products
"size": 0,
"aggs": {
which are missing their
"my_missing_ingredients": { ingredients.”
"missing": {
"field": "ingredients"
},
"aggs": {
"my_brands": {
"terms": {
"field": "brand_name.keyword"
}
"my_missing_ingredients": {
}

n
io
"doc_count": 250,

is
}

ec
"my_brands": {

D
}
&
s
"doc_count_error_upper_bound": 2, es
}
in
"sum_other_doc_count": 222,
us
-B

} "buckets": [
71
20

{
p-
Se

"key": "Giant",
9-
-0

"doc_count": 5
d
ar

“Giant” has 5 items missing their


er

},
G
n

{
ie

ingredients
st
ba

"key": "Unknown",
Se

"doc_count": 5
},
{

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 443


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Histograms
The histogram Aggregation
• The histogram bucket aggregation builds a histogram on a
given “field” using a specified “interval”:
‒ A histogram is like a bar chart that shows how many values fit
into various intervals

GET nutrition/_search
{

n
"size": 0,

io
is
ec
"aggs": {

D
&
"my_calories_histogram": {
s
es
in
"histogram": { A bucket is created for
us
-B

"field": "details.calories",
every 100 calories
71
20

"interval": 100
p-
Se

}
9-
-0

}
d
ar
er

}
G
n

}
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 445


distributing without written permission is strictly prohibited
The output of histogram:
• It’s a histogram!
‒ (OK, we cheated and used a Kibana visualization)

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 446


distributing without written permission is strictly prohibited
The API result:
"aggregations": {
"my_calories_histogram": {
"buckets": [
{
• The number of buckets is "key": 0,
"doc_count": 178
dynamic and depends on the },
{
input values and the interval "key": 100,
"doc_count": 207
},
{
"key": 200,
"doc_count": 84
},
{
"key": 300,

n
24 items have calories "doc_count": 24

io
is
ec
between 300 inclusive },

D
&
{
and 400 exclusive s
es
in
"key": 400,
us
-B

"doc_count": 3
71
20

},
p-
Se

{
9-
-0

"key": 500,
d
ar

"doc_count": 1
er
G
n

},
ie
st
ba

{
Se

"key": 600,
"doc_count": 2
}
]
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 447
distributing without written permission is strictly prohibited
The min_doc_count Parameter
• You can use min_doc_count to raise the minimum count to
throw out smaller buckets
"aggregations": {
"my_calories_histogram": {
"buckets": [
GET nutrition/_search {
{ "key": 0,
"size": 0, "doc_count": 178
"aggs": { },
"my_calories_histogram": { {

n
"histogram": { "key": 100,

io
is
ec
"field": "details.calories", "doc_count": 207

D
&
"interval": 100, },
s
es
{
in
"min_doc_count": 10
us
-B

} "key": 200,
71

"doc_count": 84
20

}
p-
Se

} },
9-
-0

{
}
d
ar

"key": 300,
er
G

"doc_count": 24
n
ie
st

}
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 448


distributing without written permission is strictly prohibited
Metrics on a Histogram
• Just like any bucket, you can perform metrics on a
histogram
‒ Let’s look at the stats of each bucket
GET nutrition/_search
{
"size": 0,
"aggs": {
"my_calories_histogram": {
"histogram": {
"field": "details.calories",

n
io
"interval": 100,

is
ec
D
"min_doc_count": 10
&
s
}, es
in
us

"aggs": {
-B

metric within a bucket


7

"my_bucket_stats": {
1
20
p-

"stats": {
Se
9-

"field": "details.calories"
-0
d

}
ar
er
G

}
n
ie

}
st
ba
Se

}
}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 449


distributing without written permission is strictly prohibited
Output of stats:
"aggregations": {
"my_calories_histogram": {
"buckets": [
{
"key": 0, There are 178 products
"doc_count": 178, with less than 100
"bucket_stats": { calories
"count": 178,
There is at least one "min": 0,
product with 0 calories "max": 96,
"avg": 47.29775280898876,
"sum": 8419
}

n
},

io
is
ec
{

D
&
"key": 100,
s
es
in
"doc_count": 207,
us
-B

"bucket_stats": {
71
20

"count": 207,
p-
Se

"min": 100,
9-
-0

"max": 195,
d
ar

"avg": 137.8985507246377,
er
G
n

"sum": 28545
ie
st
ba

}
Se

},

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 450


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Date Histograms
Date Histograms
• The date_histogram bucket aggregation creates buckets
over a date field and specified interval
‒ Histograms over date ranges are a powerful tool in data analytics
• The syntax looks like:

n
io
is
ec
GET stocks/_search

D
&
{
s
es
in
"aggs": { must be a “date” field
us
-B

"my_date_histogram": {
71
20

"date_histogram": {
p-
Se

"field": "DATE_FIELD",
9-
-0

"interval": "INTERVAL"
d
ar
er

}
G
n

the interval of each bucket


ie

}
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 452


distributing without written permission is strictly prohibited
Intervals
• Valid values for interval can either be one of the following
strings:
‒ year, quarter, month, week, day, hour, minute, second
• Or a time unit:
‒ a number followed by a time unit
‒ for example: “2d” is two days

n
io
is
ec
• Time units are:
D
&
s
es
in
us

‒ d = day, h = hour, m = minutes, s = seconds, ms = milliseconds


-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 453


distributing without written permission is strictly prohibited
Example of a Date Histogram
• Let’s run a date_histogram aggregation on the stock
prices:

Put the stock prices from


GET stocks/_search each week into a bucket
{
"size": 0,
"aggs": {
"my_weekly_buckets": {

n
io
is
"date_histogram": {

ec
D
&
"field": "trade_date",
s
es
"interval": "1w"
in
us
-B

}
71
20

}
p-
Se

}
9-
-0

}
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 454


distributing without written permission is strictly prohibited
The stocks in weekly buckets:
Notice the buckets are
"aggregations": { 7 days apart
"my_weekly_buckets": {
"buckets": [
{
"key_as_string": "2009-08-20T00:00:00.000Z",
"key": 1250726400000,
"doc_count": 1996
},
{
"key_as_string": "2009-08-27T00:00:00.000Z",
"key": 1251331200000,
"doc_count": 2495

n
io
is
},

ec
D
{
&
s
es
"key_as_string": "2009-09-03T00:00:00.000Z",
in
us

"key": 1251936000000,
-B
71

"doc_count": 1497
20
p-

},
Se
9-

{
-0
d
ar

"key_as_string": "2009-09-10T00:00:00.000Z",
er
G

"key": 1252540800000,
n
ie
st

"doc_count": 2495
ba
Se

},

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 455


distributing without written permission is strictly prohibited
Example of Metrics on a Date Histogram
GET stocks/_search What is the monthly
{ average stock price of the
"query": { “SHW” stock?
"match": {
"stock_symbol": "SHW"
}
},
"size": 0,
"aggs": {
"my_stock_date_histogram": {
"date_histogram": {
"field": "trade_date", "2009-08-01" = 60.85857173374721

n
io
is
"interval": "month" "2009-09-01" = 60.339000129699706

ec
D
}, "2009-10-01" = 60.16849994659424
&
s
"aggs": { es
"2009-11-01" = 60.203529582304114
in
us

"my_avg_close": { "2009-12-01" = 61.94954560019753


-B
7

"avg": {
1

"2010-01-01" = 60.4199995743601
20
p-

"field": "close" "2010-02-01" = 64.26526320608039


Se
9-

} "2010-03-01" = 65.56347905034605
-0
d

"2010-04-01" = 74.24333336239769
ar

}
er
G

} "2010-05-01" = 77.47699966430665
n
ie
st

} "2010-06-01" = 74.70545473965731
ba
Se

} "2010-07-01" = 70.23285675048828
} "2010-08-01" = 69.4557135445731

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 456


distributing without written permission is strictly prohibited
Visualization of our date_histogram

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 457


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Percentiles
The Percentiles Aggregation
• The percentiles metrics aggregation calculates percentiles
over a numeric field
‒ percentiles show the point at which a certain percentage of
observed values occur

What is the 95th


percentile of the calories
field?
GET nutrition/_search

n
{

io
"aggregations": {

is
ec
"size": 0,

D
"my_calorie_percentiles": {
&
"aggs": {
s
es "values": {
in
"my_calorie_percentiles": {
us

"1.0": 0,
-B

"percentiles": {
7

"5.0": 10,
1
20

"field": "details.calories"
p-

"25.0": 70,
Se

}
9-

"50.0": 120,
-0

}
d

"75.0": 189.16666666,
ar
er

}
G

"95.0": 310,
n
ie

}
st

"99.0": 400
ba
Se

}
}
} 95% of items have 310 or
fewer calories
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 459
distributing without written permission is strictly prohibited
Percentiles Example
• Quintiles are a commonly-used percentile
‒ 5 evenly-spaced percentages: 20% - 40% - 60% - 80% - 100%

What are the


quintile values for
GET nutrition/_search
{ calories?
"size" : 0, "aggregations": {
"aggs": { "my_calorie_quintiles": {
"my_calorie_quintiles": { "values": {

n
io
is
"percentiles": {

ec
"20.0": 60,

D
&
"field": "details.calories", "40.0": 110,
s
es
"percents": [
in
"60.0": 140,
us
-B

20, "80.0": 210,


71

40,
20

"100.0": 630
p-
Se

60, }
9-
-0

80, }
d
ar

100 }
er
G

]
n
ie
st
ba

} 80% of the items have


Se

}
}
210 or fewer calories
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 460


distributing without written permission is strictly prohibited
Percentiles Rank
• Instead of providing percents and getting back values, you
can pass in a value and get back its percentile
‒ percentiles_rank is essentially the opposite logic of percentile

What percentile is
150 calories?
GET nutrition/_search
{

n
"size": 0,

io
is
ec
"aggs": { "aggregations": {

D
&
"my_calorie_percentiles": { "my_calorie_percentiles": {
s
es
in
"percentile_ranks": { "values": {
us
-B

"field": "details.calories", "150.0": 65.33066132264528,


71
20

"values": [150, 300] "300.0": 94.18837675350701


p-
Se

} }
9-
-0

} }
d
ar
er

}
G
n

}
ie
st

65.3% of items have


ba
Se

You can pass in multiple 150 calories or less, and 94.2%


values have less than 300

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 461


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Top Hits
The Top Hits Aggregations
• top_hits is a metrics aggregation that allows you to find the
most relevant documents in a bucket
‒ used typically as a sub-aggregation
• The syntax looks like:

number of top hits to


return per bucket
"aggs": {

n
io
is
"my_top_hits": {

ec
D
&
"top_hits": {
s
es
"size": SIZE,
in
us
-B

"sort": [],
71
20

"from": OFFSET
p-
Se

}
9-
-0

}
d

the offset from the first


ar

}
er
G

result you want to fetch


n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 463


distributing without written permission is strictly prohibited
Motivation for top_hits
• Suppose we do a search for items with sugar in the
ingredients, then bucket those by the number of calories:

GET nutrition/_search 9 items with sugar have


{ 140 calories, but what
"size" : 0, are the most relevant
"query": { items in this bucket?
"match": {
"ingredients": "sugar"
}
},

n
io
is
"buckets": [

ec
"aggs": {

D
{
&
"my_calorie_bucket": {
s
es "key": 140,
in
"terms": {
us

"doc_count": 9
-B

"field": "details.calories"
7

},
1
20

}
p-

{
Se

}
9-

"key": 120,
-0

}
d
ar

} "doc_count": 8
er
G

},
n
ie
st

{
ba
Se

"key": 270,
"doc_count": 7
},

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 464


distributing without written permission is strictly prohibited
Example of top_hits
GET nutrition/_search
{
"size" : 0,
"query": {
"match": {
"ingredients": "sugar"
}
},
"aggs": {
"my_calorie_bucket": {
"terms": {
"field": "details.calories" top_hits is a sub-aggregation

n
of the “calorie_bucket”

io
},

is
ec
"aggs": {

D
&
s
"my_top_hits": { es
in

"top_hits": {
us
-B

"size": 5
71
20

}
p-
Se

} Returns the top 5 hits in each


9-
-0

}
d

bucket (based on its _score of


ar
er

}
G

the “sugar” query)


n
ie

}
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 465


distributing without written permission is strictly prohibited
The output of top_hits:
• Notice the top 5 hits from each bucket are returned in the
“aggregations” clause of the response

"aggregations": {
"my_calorie_bucket": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 57,
"buckets": [
{ “Soda”
"key": 140,

n
io
"doc_count": 9, "Hot Cocoa Mix,

is
ec
D
"my_top_hits": {
Cappuccino Classics
&
s
"hits": { es
Suprema 1.16 Oz"
in
us

"total": 9,
-B

"max_score": 1.341344,
71
20

"Raisin Bran"
p-

"hits": [
Se
9-

top 5 hits
-0

"Exotic Juice Drink,


d

]
ar
er
G

Guanabana"
n
ie
st
ba
Se

"Simply Made Cookies,


Butter"

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 466


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Significant Terms
The Significant Terms Aggregation
• The significant_terms bucket aggregation is used to find
interesting or unusual occurrences of terms
• For example, suppose a bank needs to find fraudulent
credit card transactions
‒ A group of customers complains about fraudulent charges
‒ A regular terms aggregation would reveal that a lot of those
customers charged something recently at Amazon. This is

n
io
“commonly common” - a lot of customers charge things at

is
ec
D
&
Amazon es
in
s
us
-B

‒ Three of the complaining customers had charges at a local


71
20
p-

grocery store. This is “commonly uncommon” - only a few


Se
9-
-0

customers had this same transaction


d
ar
er
G
n
ie

‒ significant_terms helps you find the “uncommonly common” -


st
ba
Se

the merchants with charges from all complaining customers, but


not a lot of the other customers
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 468
distributing without written permission is strictly prohibited
Enabling Fielddata
• If we are going to perform a terms agg or significant terms
agg on a text data type, we need to enable fielddata for
that field
‒ In our example, we are going to perform both aggregations on
the ingredients field of our nutrition index:

PUT nutrition/_mapping/doc Update the existing


{ mapping

n
"properties" : {

io
is
ec
"ingredients" : {

D
&
"type" : "text",
s
es
in
"fielddata" : true
us
-B

}
71
20

}
p-
Se

} Enable fielddata for


9-
-0
d

ingredients
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 469


distributing without written permission is strictly prohibited
Let’s Start with a Terms Agg

Let’s look for a


GET nutrition/_search relationship between total
{ fat and ingredients.
"size":0,
"aggs" : {
"my_total_fat_histogram":{
"histogram": {
"field": "details.total_fat",
"interval": 5,
"min_doc_count": 5

n
},

io
is
ec
"aggregations":{

D
&
"my_top_words" : {
s
es
in
"terms" : {
us
-B

"field" : "ingredients",
71
20

"size":5
p-
Se

}
9-
-0

}
d
ar
er

}
G
n
ie

}
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 470


distributing without written permission is strictly prohibited
The Results of the Terms Agg
• These would be considered “commonly common”
ingredients:
"key": 15, 31 products with total_fat
"doc_count": 31, between 15 and 20
"my_top_words": {
"buckets": [
{
"key": "salt",
"doc_count": 15
},
{
"key": "oil",
"doc_count": 14

n
io
The “commonly common”

is
},

ec
D
{ ingredients are “salt”, “oil”,
&
s
"key": "and", es
“corn”, “acid” and the
in
us

"doc_count": 11
-B

}, stopword “and”
71
20

{
p-
Se

"key": "corn",
9-
-0

"doc_count": 8
d
ar

},
er
G

{
n
ie
st

"key": "acid",
ba
Se

"doc_count": 7
}
]}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 471


distributing without written permission is strictly prohibited
Using significant_terms
• Now let’s try the same agg with significant_terms:

GET nutrition/_search
{ Let’s look for the
"size":0, “uncommonly common”
"aggs" : { ingredients in relationship
"my_total_fat_histogram":{ to total fat.
"histogram": {
"field": "details.total_fat",
"interval": 5,
"min_doc_count": 5
},

n
io
is
"aggregations":{

ec
D
&
"my_top_words" : {
s
es
"significant_terms" : {
in
us
-B

"field" : "ingredients",
71
20

"size":5
p-
Se

}
9-
-0

}
d
ar

}
er
G
n

}
ie
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 472


distributing without written permission is strictly prohibited
The Output of significant_terms:
"key": 15,
"my_top_words": { The same 31 products with
"doc_count": 31,
"buckets": [ total_fat between 15 and 20
{
"key": "almonds",
"doc_count": 5,
"score": 2.4349635796045783,
"bg_count": 5
},
{
"key": "flax",
"doc_count": 4,
"score": 1.947970863683663,
"bg_count": 4 The top 5 significant ingredients
}, are “almonds”, “flax”,
{
“shortening”, “apples” and

n
io
"key": "shortening",

is
ec
"doc_count": 4, “cashews”

D
&
"score": 1.532570239334027,
s
es
"bg_count": 5
in
us

},
-B
7

{
1
20

"key": "apples",
p-
Se

"doc_count": 4,
9-
-0

"score": 1.532570239334027,
d
ar

"bg_count": 5
er
G

},
n
ie

{
st
ba

"key": "cashews",
Se

"doc_count": 3,
"score": 1.460978147762747,
"bg_count": 3

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 473


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Sorting Buckets
Default Bucket Sorting
• By default, the results within a bucket are sorted by _count
(the document count) in descending order
• You can specify the sorting of terms and the histograms
using “order”:
‒ _count: this is the default; works with terms, histogram and
date_histogram
‒ _term: sort by the term alphabetically; only works with terms

n
io
is
ec
D
‒ _key: Sort by the bucket’s key; works with histogram and
&
s
es
in

date_histogram
us
-B
71
20
p-
Se
9-
-0
d
ar

• You can also sort by a metric value in a nested aggregation


er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 475


distributing without written permission is strictly prohibited
Sort by _key Example

GET stocks/_search
{
"aggs": {
"my_stock_date_histogram": { The default for date_histogram is
"date_histogram": { _key ascending
"field": "trade_date",
"interval": "month",
"order": {
"_key": "desc"
}
},

n
io
is
"aggs": {

ec
D
"my_avg_close": {
&
s
es
"avg": {
in
us
-B

"field": "close"
71

}
20
p-

}
Se
9-
-0

}
d
ar

}
er
G

}
n
ie
st

}
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 476


distributing without written permission is strictly prohibited
Sorting by a Metric

Find the brands with the


GET nutrition/_search
{ highest-calorie products.
"size" : 0,
"aggs": {
"my_brands_bucket": {
"terms": {
"field": "brand_name.keyword",
"order": {
"my_max_calories": "desc"
}

n
io
},

is
This output will be sorted

ec
"aggs": {

D
&
"my_max_calories": { es
in
s
by the results of the
nested metric
us

"max": {
-B

"field": "details.calories"
71
20

}
p-
Se
9-

}
-0

}
d
ar
er

}
G
n
ie

}
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 477


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• Use global to perform an aggregation over all documents
• The histogram bucket aggregation builds a histogram on a
given “field” using a specified “interval”
• The date_histogram bucket aggregation creates buckets
over a date field and specified interval
• The percentiles metrics aggregation calculates percentiles
over a numeric field

n
io
is
ec
D
&
• top_hits is a metrics aggregation that allows you to find the es
in
s
us

most relevant documents in another bucket


-B
71
20
p-
Se

• The significant_terms bucket aggregation is used to find


9-
-0
d
ar

interesting or unusual occurrences of terms


er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 479


distributing without written permission is strictly prohibited
Quiz
1. How many ERRORs occurred per day last month? Which
aggregation would work well for this?
2. You want to recommend movies based on another movie
that someone likes. Which aggregation might work well for
finding recommendations that are out of the ordinary?
3. Which aggregations would you use to answer the question:
“What are the top 3 posts from the top 10 users at

n
StackOverflow?”

io
is
ec
D
&
s
es
4.You want to compare the total volume of stocks from the
in
us
-B

tech industry to the overall market? What aggregation would


71
20
p-
Se

you use?
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 480


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 9
More Aggregations
1 Introduction to Elasticsearch

2 The Search API

3 Text Analysis

4 Mappings

5 More Search Features

Chapter 10 6 The Distributed Model

n
io
Handling
is
ec
7 Working with Search Results

D
&
s
es

Relationships
in
us

8 Aggregations
-B
71
20
p-
Se

9 More Aggregations
9-
-0
d
ar
er

Handling Relationships
G

10
n
ie
st
ba
Se
Topics covered:
• The Need for Data Modeling
• Denormalization
• The Need for Nested Types
• Nested Types
• Querying a Nested Type

n
• Sorting on a Nested Type

io
is
ec
D
&
s
es
• The Nested Aggregation
in
us
-B
71
20

• Parent/Child Types
p-
Se
9-
-0
d
ar

• The has_child Query


er
G
n
ie
st
ba
Se

• The has_parent Query

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 483


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Modeling
The Need for Data
It’s all about relationships…
• If you come from the SQL world, you will likely need to
change your thought process for modeling data for
Elasticsearch
‒ In SQL, you typically normalize your data
• Search requires different considerations
‒ In Elasticsearch, you typically denormalize your data!

n
• A flat world has its advantages

io
is
ec
D
&
s
es
‒ Indexing and searching is fast
in
us
-B
71
20

‒ No need to join tables or lock rows


p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 485


distributing without written permission is strictly prohibited
…and sometimes relationships matter
• There are times when relationships matter
‒ We need to bridge the gap between normal and flat
• Four common techniques for managing relational data in
Elasticsearch
‒ Denormalizing: flatten your data (typically the best solution)
‒ Application-side joins: run multiple queries on normalized data

n
io
is
ec
‒ Nested objects
D
&
s
es
in
us

‒ Parent/child relationships
-B
71
20
p-
Se

• We will discuss nested objects and parent/child


9-
-0
d

relationships in detail in this chapter


ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 486


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Denormalization
Denormalization
• Denormalizing your data refers to “flattening” your data
‒ storing redundant copies of data in each document, instead of
using some type of relationship
• Denormalization provides the best performance out of
Elasticsearch
‒ no need to perform expensive joins

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 488


distributing without written permission is strictly prohibited
Example of Denormalization
• Suppose you are indexing users and tweets
‒ users in one index, and tweets in another
• When searching for tweets, suppose you want to search by
the username also

users tweets

n
io
is
ec
D
&
s
{ { es
in
us

"username" : "harrison", "body" : "My favorite movie is Star Wars",


-B
7

"userid" : 1, "time" : "2017-01-24T02:32:27",


1
20
p-

"city" : "Los Angeles", "userid" : 1


Se
9-

"state" : "California" }
-0
d

}
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 489


distributing without written permission is strictly prohibited
Example of Denormalization
• Duplicate the desired users data in the tweets document:
users
PUT users/doc/1
{
"username" : "harrison",
"userid" : 1,
"city" : "Los Angeles",
"state" : "California"
}

tweets

n
io
is
ec
D
&
s
PUT tweets/doc/123 es
in
us

{
-B
7

"body" : "My favorite movie is Star Wars",


1
20
p-

"time" : "2017-01-24T02:32:27",
Se

PUT tweets/doc/456
9-

"userid" : 1,
-0

{
d

"username" : "harrison"
ar
er

"body" : "Laugh it up, fuzzball.",


G

}
n
ie

"time" : "1977-05-25T00:00:00",
st
ba

"userid" : 1,
Se

"username" : "harrison"
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 490


distributing without written permission is strictly prohibited
Example of Denormalization
• Now you can search tweets and username all in a single
query:

GET tweets/_search
{
"query": {
"bool": {
"must": [
{"match": {"body": "movie"}},
{"match": {"username": "harrison"}}
]

n
io
is
}

ec
D
}
&
s
es
}
in
us
-B
71
20

"_score": 0.55510485,
p-
Se
9-

"_source": {
-0
d

"body": "My favorite movie is Star Wars",


ar
er
G

"time": "2017-01-24T02:32:27",
n
ie
st
ba

"userid": 1,
Se

"username": "harrison"
}
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 491
distributing without written permission is strictly prohibited
Tips for Denormalizing Data
• Denormalize data to optimize for reads
• There is no mechanism to keep the denormalized data
consistent with the original document
‒ avoid denormalizing data that is likely to change frequently
• When denormalizing data that changes, you should ensure
that there is 1 and only 1 authoritative source for the data
‒ If you do denormalize data that might change, it can help to also

n
io
is
ec
denormalize a field that does not change, like an _id
D
&
s
es
in
us
-B
71
20

users tweets
p-
Se
9-
-0
d
ar

PUT users/doc/1 PUT tweets/doc/123


er

Our example uses


G

{
n

{
ie

“userid” as an
st

"userid" : 1,
ba

"userid" : 1,
Se

... ... authoritative source


} }

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 492


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Types
The Need for Nested
Inner Objects
• There is an interesting (and confusing) scenario that can
arise with JSON inner objects
• Suppose we index the following two documents into an
index named “companies”:

PUT companies/doc/1
{
"name" : "Stark Enterprises",
"employees" : [

n
io
is
{"first_name" : "Tony", "last_name" : "Stark"},

ec
D
{"first_name" : "Virginia", "last_name" : "Potts"}
&
s
es
]
in
us

}
-B
71

PUT companies/doc/2
20
p-

{
Se
9-

"name" : "NBC Universal",


-0
d
ar

"employees" : [
er
G

{"first_name" : "Tony", "last_name" : "Potts"}


n
ie
st

]
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 494


distributing without written permission is strictly prohibited
Searching Inner Objects…
• What do you think the result is of the following query?

GET companies/_search
{
"query": {
"bool": {
"must": [
{"match": {"employees.first_name": "Tony"}},
?

n
io
is
{"match": {"employees.last_name": "Potts"}}

ec
D
]
&
s
es
}
in
us

}
-B
71

}
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 495


distributing without written permission is strictly prohibited
"hits": {

…can have "total": 2,


"max_score": 0.5753642,
"hits": [
surprising {
"_index": "companies",

results: "_type": "doc",


"_id": "2",
"_score": 0.5753642,
"": {
"name": "NBC Universal",
"employees": [
{
"first_name": "Tony",
"last_name": "Potts"
}
]
}
},
{
"_index": "companies",

n
io
"_type": "doc",

is
ec
D
"_id": "1",

&
"_score": 0.51623213,
s
es
in
"_source": {
us
-B

"name": "Stark Enterprises",


7

"employees": [
1
20
p-

{
Se

"first_name": "Tony",
9-
-0

“Stark Enterprises” does "last_name": "Stark"


d
ar

},
not have an employee
er
G

{
n

named “Tony Potts”


ie

"first_name": "Virginia",
st
ba

"last_name": "Potts"
Se

}
]
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 496


distributing without written permission is strictly prohibited
Why the confusing results?
• The “first_name” and “last_name” fields are JSON inner
objects of the “employees” field
‒ When the JSON object got flattened in the index, we lost the
relationship between “Tony” and “Potts”

All first names and last names are


put into respective arrays

n
io
is
ec
D
{
&
s
"name" : "Stark Enterprises", es
in
us

"employee.first_name" : [ "Tony", "Virginia" ],


-B
7

"employee.last_name" : [ "Stark", "Potts" ]


1
20
p-

}
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

A query for “Tony Potts” is a hit


(so is “Virginia Stark”)

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 497


distributing without written permission is strictly prohibited
How do we fix this scenario?
• A hit on the inner object yields a hit on the root object
• The problem is with the mapping:
‒ We need to use the “nested” data type for the JSON inner
objects (in this particular scenario)

"companies": {
"mappings": {

n
io
is
"doc": {

ec
“first_name” and “last_name” are
D
"properties": {
&
s inner objects - but they should be
es
"employees": {
in
us

“nested” for our use case


-B

"properties": {
71

"first_name": {
20
p-

"type": "text",
Se
9-
-0

"fields": {
d
ar

"keyword": {
er
G

"type": "keyword",
n
ie
st

"ignore_above": 256
ba
Se

}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 498


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Nested Types
Nested Data Type
• The nested type allows arrays of objects to be indexed and
queried independently of each other
‒ Use nested if you need to maintain the independence of each
object in the array
• The mapping syntax looks like:

"mappings": {

n
"doc": {

io
is
ec
"properties": {

D
&
"outer_object": {
s
es
in
"type": "nested",
us
-B

"properties": {
71
20

"inner_field": "TYPE",
p-
Se

...
9-
-0

}
d
ar
er

}
G
n
ie

},
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 500


distributing without written permission is strictly prohibited
Example of nested
PUT companies
{ “first_name” and “last_name”
"mappings": {
"doc": {
are nested inside “employees”
"properties": {
"employees": {
"type": "nested",
"properties": {
"first_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",

n
io
"ignore_above": 256

is
ec
}

D
&
s
} es
in
us

},
-B

"last_name": {
71
20

"type": "text",
p-
Se
9-

"fields": {
-0
d

"keyword": {
ar
er

"type": "keyword",
G
n
ie

"ignore_above": 256
st
ba
Se

}
}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 501


distributing without written permission is strictly prohibited
Nested Objects are Stored Separately
• The nested mapping stores the nested documents in a
non-flattened way that maintains the original association:

{
"name" : "Stark Enterprises"
}
{
"employee.first_name" : "Tony",

n
io
is
"employee.last_name" : "Stark"

ec
D
}
&
s
es
{
in
us
-B

"employee.first_name" : "Virginia",
71

"employee.last_name" : "Potts"
20
p-

}
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 502


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Querying a
Nested Type
Querying a nested Type
• To query a nested data type, you have to use a “nested”
query

GET companies/_search
{
"query": { Specify the “path” to the
"nested": { nested object
"path": "employees",
"query": {
"bool": {
"must": [

n
io
is
{"match": {"employees.first_name": "Tony"}},

ec
D
&
{"match": {"employees.last_name": "Potts"}}
s
es
]
in
us
-B

}
71

}
20
p-
Se

}
9-
-0

}
d
ar

}
er
G
n
ie
st
ba
Se

We get 1 hit this time, as


expected (“NBC Universal”)
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 504
distributing without written permission is strictly prohibited
Finding the Cause of a Hit
• Suppose we search for companies who have an employee
with the first name “Tony”
‒ both companies are a “hit”, but the response does not reveal
specifically which nested employee generated the hit:

GET companies/_search
{
"query": {
"nested": {

n
"Stark Enterprises"

io
is
"path": "employees",

ec
D
"query": {
&
s
es
"match": { "NBC Universal"
in
us

"employees.first_name": "Tony"
-B
71

}
20
p-

}
Se
9-
-0

}
d
ar

}
er
G

}
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 505


distributing without written permission is strictly prohibited
The inner_hits Query
• The inner_hits query reveals which specific nested object
(or objects) that caused the hit
‒ inner_hits is added to your nested query:

GET companies/_search
{
"query": {
"nested": {

n
"path": "employees",

io
is
ec
"query": {

D
&
"match": {
s
es
in
"employees.first_name": "Tony"
us
-B

}
71
20

}, You can add “inner_hits”


p-
Se

"inner_hits" : {}
9-

to a “nested” query
-0

}
d
ar
er

}
G
n
ie

}
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 506


distributing without written permission is strictly prohibited
The inner_hits Query
• The output from inner_hits looks like:
{
"_index": "companies",
"_type": "doc",
"_id": "1",
"_score": 0.6931472,
"_source": {
"name": "Stark Enterprises", This is a company that has an
"employees": [ employee named “Tony”
...
]
},
"inner_hits": {
"employees": {
"hits": {
"total": 1,

n
io
"max_score": 0.6931472,

is
ec
"hits": [

D
&
{
s
"_nested": { es
in
us

"field": "employees",
-B

"offset": 0
71

},
20
p-

"_score": 0.6931472,
Se

"_source": {
9-

This the employee named


-0

"first_name": "Tony",
d

“Tony” that caused the hit


ar

"last_name": "Stark"
er
G

}
n
ie

}
st
ba

]
Se

}
}
}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 507


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Sorting on a
Nested Type
Sorting Nested Queries
• It is possible to sort by the value of a nested field
‒ even though the value exists in a separate nested document
• Sorting in nested queries is different than non-nested
queries
‒ we have to repeat the original query in the “sort” clause
‒ otherwise the sorting happens on all documents in the index, not
just the ones that hit the nested query

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 509


distributing without written permission is strictly prohibited
Example of Nested Sorting
• Notice the “match” query is repeated in “sort” in a
“nested_filter” clause:
GET companies/_search
{
"query": {
"nested": {
"path": "employees",
"query": {
"match": {"employees.last_name": "Potts"}
}

n
}

io
is
ec
},

D
&
"sort": {
s
es
in
"employees.first_name.keyword" : {
us
-B

"order" : "asc",
71
20

"nested_path": "employees",
p-
Se

"nested_filter" : {
9-
-0

"match": {"employees.last_name": "Potts"}


d
ar
er

}
G
n

}
ie
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 510


distributing without written permission is strictly prohibited
Result of the Nested Sorting:
"employees": [
{
"first_name": "Tony",
"last_name": "Potts"
}
]
},
"sort": [
"Tony"
]
},
"employees": [
{

n
"first_name": "Tony",

io
Notice the first names are

is
ec
"last_name": "Stark"

D
sorted “asc”
&
},
s
es
in
{
us
-B

"first_name": "Virginia",
71
20

"last_name": "Potts"
p-
Se

}
9-
-0

]
d
ar
er

},
G
n
ie

"sort": [
st
ba

"Virginia"
Se

]
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 511


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The Nested
Aggregation
The nested Aggregation
• The nested bucket aggregation puts nested objects into a
bucket
‒ Useful for performing sub-aggregations
• The following simple example puts all employees into a
bucket:

GET companies/_search

n
io
is
{

ec
D
"size": 0,
&
s
es "aggregations": {
"aggs": {
in
us

"my_employees_bucket": {
-B

"my_employees_bucket": {
7

"doc_count": 3
1

"nested": {
20
p-

}
Se

"path": "employees"
9-

}
-0

}
d
ar

}
er
G

}
n
ie
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 513


distributing without written permission is strictly prohibited
Example of a nested Aggregation
• Suppose we want to put all employees with the same last
name into the same bucket:

GET companies/_search
{
"size": 0,
"aggs": {
"my_employees_bucket": {
"nested": {
"path": "employees"

n
},

io
is
ec
"aggs": {

D
&
s
"last_name": { es
in

"terms": {
us
-B

"field" : "employees.last_name.keyword"
71
20

}
p-
Se

}
9-
-0

}
d
ar
er

}
G
n
ie

}
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 514


distributing without written permission is strictly prohibited
The result of the nested aggregation:
• We only have two unique last names in our small dataset:

"aggregations": {
"my_employees_bucket": {
"doc_count": 3,
"last_name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Potts",

n
io
is
ec
"doc_count": 2

D
&
},
s
es
in
{
us
-B

"key": "Stark",
71
20

"doc_count": 1
p-
Se

}
9-
-0

]
d
ar

}
er
G
n

}
ie
st
ba

}
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 515


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Parent/Child Types
The Need for Parent/Child Types
• Nested objects are great for many use cases
‒ But nested objects all live within the same document
‒ updating a nested object requires a complete reindexing of the
root object AND all other of its nested objects
• Using a parent/child data type, you can completely
separate two objects while maintaining their relationship
‒ the parent and children are completely separate documents

n
io
is
ec
D
&
‒ the parent can be updated without reindexing the children s
es
in
us
-B
71

‒ children can be added/changed/deleted without affecting the


20
p-
Se

parent or any other children


9-
-0
d
ar
er
G

• Configured in your mappings using the _parent setting


n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 517


distributing without written permission is strictly prohibited
Parent/Child and Shards
• Behind the scenes, the parent and all its children must live
on the same shard
‒ This makes query-time joins very fast
• Remember how documents are routed using the hash
function?
‒ They will not work for a child document - it must get routed to the
same shard as its parent

n
io
is
ec
D
‒ The parent’s id is used as the routing value for the child
&
s
es
in

document
us
-B
71
20

• Therefore, every time you refer to a child document, you


p-
Se
9-
-0

must specify its parent’s id


d
ar
er
G
n
ie
st

‒ We will see how this is implemented next…


ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 518


distributing without written permission is strictly prohibited
Defining a Parent/Child Relationship
• Let’s go through the steps of defining a parent/child
relationship
1. Define the mappings
2. Index some parent documents
3. Index some child documents
4. Query the documents

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 519


distributing without written permission is strictly prohibited
1. Define the mappings
• You have to define the parent mapping before any parent
documents can be indexed
‒ Use the _parent property to specify the parent type
‒ The following “employee” type has “company” as its parent:

PUT companies

n
{

io
is
ec
"mappings": {

D
&
s
"company" : {}, es
in
"employee" : {
us
-B

"_parent": {
71
20

"type": "company"
p-
Se

}
9-
-0

}
d
ar
er

}
G
n
ie

}
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 520


distributing without written permission is strictly prohibited
2. Index some parent documents
• A parent is not aware of its children, so indexing a parent
document is simply a matter of putting them into
Elasticsearch
‒ a child can actually be indexed prior to its parent, but we will just
index the parents first

PUT companies/company/c1
{

n
io
is
ec
"name" : "Stark Enterprises"

D
&
}
s
es
in
us
-B

PUT companies/company/c2
71
20

{
p-
Se

"name" : "NBC Universal"


9-
-0

}
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 521


distributing without written permission is strictly prohibited
3. Index some child documents
• A child document needs to specify its parent when it is
indexed
‒ Use the “parent” parameter to specify the parent id

PUT companies/employee/emp1?parent=c1
{
"first_name" : "Tony",
"last_name" : "Stark" “Stark Enterprises”
}

n
io
is
ec
PUT companies/employee/emp2?parent=c1

D
&
s
{ es
in
"first_name" : "Virginia",
us
-B

"last_name" : "Potts"
71
20

}
p-
Se
9-
-0

PUT companies/employee/emp3?parent=c2 “NBC Universal”


d
ar
er

{
G
n
ie

"first_name" : "Tony",
st
ba

"last_name" : "Potts"
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 522


distributing without written permission is strictly prohibited
4. Query the documents
• Notice that querying the index results in 5 separate
documents:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
GET companies/_search "failed": 0
},
"hits": {
"total": 5,

n
io
"max_score": 1,

is
ec
D
"hits": [
&
s
es ...
in
us

]
-B
71
20
p-
Se
9-
-0

• Typical parent/child queries involve has_child and


d
ar
er

has_parent
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 523


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The has_child Query
The has_child Query
• The has_child query is a filter that accepts a query and the
child type to run against
‒ It results in parent documents that have child docs matching the
query
‒ Parent and child documents are routed to the same shard, so
joining them when searching will be in-memory and efficient
• The syntax looks like:

n
io
the parent type

is
ec
D
&
s
es
in
us

GET my_index/parent/_search
-B
71

{
20
p-

"query": {
Se
9-

"has_child": {
-0
d
ar

"type": "TYPE",
er
G

"query": {}
n
ie

the child type


st

}
ba
Se

}
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 525


distributing without written permission is strictly prohibited
Example of has_child

I want all
companies who have an
employee named
“Stark”
GET companies/company/_search
{ "hits": [
"query": { {
"has_child": { "_index": "companies",

n
io
"type": "employee", "_type": "company",

is
ec
"query": { "_id": "c1",

D
&
s
"match": { es
in
"_score": 1,
us

"last_name": "Stark" "_source": {


-B

} "name": "Stark Enterprises"


71
20

} }
p-
Se

}
9-

}
-0

}
d

]
ar
er

}
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 526


distributing without written permission is strictly prohibited
The inner_hits Query
• Notice in the previous query that “Stark Enterprises” has
an employee with the last name “Stark”, but we do not
know which employee caused the hit
‒ Use the inner_hits query to get the relevant children from the
has_child query:

GET companies/company/_search "_source": {


{ "name": "Stark Enterprises"
"query": { },

n
io
is
"has_child": { "inner_hits": {

ec
D
"type": "employee", "employee": {
&
s
es
"query": { "hits": {
in
us
-B

"match": { "total": 1,
71

"last_name": "Stark" "max_score": 0.6931472,


20
p-

} "hits": [
Se
9-
-0

}, {
d
ar

"inner_hits" : {} "_source": {
er
G

} "first_name": "Tony",
n
ie
st

} "last_name": "Stark"
ba
Se

} }

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 527


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The has_parent Query
The has_parent Query
• The has_parent query is executed in the parent document
space
‒ Returns child documents which associated parents have
matched
• The syntax looks like:
the child type

n
io
is
ec
D
GET my_index/child/_search
&
s
es
{
in
us

"query": {
-B
71

"has_parent": {
20
p-

"parent_type": "TYPE",
Se
9-

"query": {}
-0
d
ar

}
the parent type
er
G

}
n
ie
st

}
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 529


distributing without written permission is strictly prohibited
Example of has_parent

I want all employees


that work for “NBC”

GET companies/employee/_search
{
"query": {
"has_parent": { "_source": {
"parent_type": "company", "first_name": "Tony",

n
"last_name": "Potts"

io
"query": {

is
ec
}

D
"match": {
&
s
"name": "NBC" es
in
us

}
-B
7

}
1
20
p-

}
Se
9-

}
-0
d

}
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 530


distributing without written permission is strictly prohibited
Accessing a Child Document
• The Document APIs require the parent property for routing
purposes
‒ A child document is stored on the same shard as its parent
‒ The APIs need to know the routing id of the parent to find the
child

GET companies/employee/emp1 GET companies/employee/emp1?parent=c1

n
io
is
ec
D
&
{
s
es
"error": { {
in
us

"root_cause": [ "_index": "companies",


-B
7

{ "_type": "employee",
1
20

"type": "_id": "emp1",


p-
Se

"routing_missing_exception", "_version": 2,
9-
-0

"reason": "routing is required for "_routing": "c1",


d
ar

[companies]/[employee]/[emp1]", "_parent": "c1",


er
G

"index_uuid": "_na_", "found": true,


n
ie

"index": "companies" "_source": {


st
ba

} "first_name": "Tony",
Se

], "last_name": "Stark"
"type": "routing_missing_exception", }
}

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 531


distributing without written permission is strictly prohibited
Updating Child Documents
• One of the key benefits of a parent/child relationship is the
ability to modify a child object independent of the parent
‒ For example, changing an employee has no effect on its parent
(or any of its siblings either)

POST companies/employee/emp1/_update?parent=c1

n
{

io
is
ec
"doc" : {

D
&
"first_name" : "Anthony"
s
es
in
}
us
-B

}
7

Change “Tony” to “Anthony”


1
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 532


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Choosing a Technique
Can
all related
items fit in a No Consider
single JSON parent/child
doc?

Yes

Are docs
updated Yes
frequently?

n
io
is
ec
D
Consider
&
No
s
es
in

nested docs
us
-B
71
20

Do
p-
Se

your
9-

Do your queries test


-0

Denormalize docs contain more than 1


d
ar

No Yes Yes
er

arrays of property when


your data
G

matching objects
n
ie

objects?
st

in these
ba
Se

arrays?

No
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 534
distributing without written permission is strictly prohibited
Kibana Considerations
• Kibana currently does not support nested types or
parent/child relationships
‒ Important to consider this limitation if you are using your indices
for dashboards and visualizations in Kibana
• Kibana may implement these relationships in the future
‒ but for now, you will have to decide which is more important:
using nested or parent/child relationships, or being able to

n
io
visualize your data in Kibana

is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 535


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• The nested type allows arrays of objects to be indexed and
queried independently of each other
• Sorting by a nested field requires you to repeat the original
query in the “sort” clause
• The nested bucket aggregation puts nested objects into a
bucket
• Updating a nested object requires a complete reindexing of

n
io
is
ec
the root object AND all other of its nested objects
D
&
s
es
in
us
-B

• Using a parent/child data type, you can completely


71
20

separate two objects while maintaining their relationship


p-
Se
9-
-0
d
ar

• One of the key benefits of a parent/child relationship is the


er
G
n
ie
st

ability to modify a child object independently of the parent


ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 537


distributing without written permission is strictly prohibited
Quiz
1. True or False: Updating a nested inner object causes the
root object and all other nested objects to be reindexed.
2. True or False: Deleting a child object actually causes the
_parent object and all other siblings to be reindexed.
3. Why not just use a parent/child relationship all the time (as
opposed to nested types) when dealing with relational
objects?

n
io
is
ec
4. True or False: Child objects are routed to the same shard
D
&
s
es
as its _parent object.
in
us
-B
71
20

5. True or False: Deleting a parent object causes all child


p-
Se
9-
-0

objects to be deleted.
d
ar
er
G
n
ie
st

6. True or False: You can index a child object whose _parent


ba
Se

does not exist.

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 538


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 10
Handling Relationships
Se
ba
st
ie
n
G

Conclusions
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Development vs.
Production Mode
HTTP vs. Transport
• There are two important network communication
mechanisms in Elasticsearch to understand:
‒ HTTP: address and port to bind to for HTTP communication,
which is how the Elasticsearch REST APIs are exposed
‒ transport: used for internal communication between nodes
within the cluster
• The defaults are fine for downloading and playing with
Elasticsearch, but not useful for production systems

n
io
is
ec
D
&
‒ bind to localhost by default es
s
in
us
-B
71
20
p-

binds to first available


Se
9-
-0

node port in the range


d
ar

9200-9299
er
G

HTTP
n
ie
st
ba
Se

transport binds to first available


port in the range
9300-9399
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 542
distributing without written permission is strictly prohibited
Development vs. Production Mode
• Every Elasticsearch 5.x instance is either in development
mode or production mode:
‒ development mode: if it does not bind transport to an external
interface (the default)
‒ production mode: if it does bind transport to an external
interface
“I am just a local “I am running in a
cluster used for production

n
io
is
development.” environment.”

ec
D
&
s
es
in
us

my_dev_cluster my_production_cluster
-B
71
20
p-

http.port: 9200 http.port: 9200


Se
9-
-0

http.host: localhost http.host: 192.168.1.21


d
ar
er
G

transport.tcp.port: 9300 transport.tcp.port: 9300


n
ie
st

transport.bind_host: localhost transport.bind_host: 192.168.1.21


ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 543


distributing without written permission is strictly prohibited
Bootstrap Checks
• Elasticsearch has bootstrap checks upon startup:
‒ inspect a variety of Elasticsearch and system settings
‒ compare them to values that are safe for the operation of
Elasticsearch
• Bootstrap checks behave differently depending on the
mode:
‒ development mode: any bootstrap checks that fail appear as

n
io
is
ec
warnings in the Elasticsearch log
D
&
s
es
in
us
-B

‒ production mode: any bootstrap checks that fail will cause


71
20

Elasticsearch to refuse to start


p-
Se
9-
-0
d

‒ https://www.elastic.co/guide/en/elasticsearch/reference/master/
ar
er
G
n

bootstrap-checks.html
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 544


distributing without written permission is strictly prohibited
Documentation and Help
• Discussion forums: http://discuss.elastic.co/

• Meetups: http://elasticsearch.meetup.com

• Docs: https://elastic.co/docs

n
io
is
ec
D
&
s
es
• Community: https://elastic.co/community
in
us
-B
71
20
p-
Se
9-
-0
d
ar

• More resources: https://elastic.co/learn


er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 545


distributing without written permission is strictly prohibited
Elastic Training Paths

n
io
is
ec
D
&
s
es
in
us
-B
17
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2017 Copying, publishing and/or 546


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d

Quiz Answers
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter 1 Quiz Answers
1. True
2. As a noun: an index is what Elasticsearch creates for your
documents. As a verb: we “index” documents into an index
3. True
4. True - much more efficient than separate requests
5. Elasticsearch will define a new index for you

n
io
is
ec
D
6. No. If you want Elasticsearch to generate an ID, you have
&
s
es
in

to use POST
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 548


distributing without written permission is strictly prohibited
Chapter 2 Quiz Answers
1. Multiple match terms use “and” or “or”, while multiple terms
in match_phrase are “and” and position matters
2. slop
3. now-12h
4. 70% of 5 = 3.5 and it rounds down, so 3 need to match
5. 10 docs from my_index, but only the “username” field in

n
io
is
the _source is returned
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 549


distributing without written permission is strictly prohibited
Chapter 3 Quiz Answers
1. text analysis
2. Exact values are not analyzed vs. full text is analyzed
3. character filter, tokenizer, token filter
4. “happy” is analyzed to “happi”
5. Three tokens: "192.168.1.201","jing", and "https://
training.Elastic.co/"

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 550


distributing without written permission is strictly prohibited
Chapter 4 Quiz Answers
1. “value” would be a “long”
2. The document is not indexed and an error occurs
3. True
4. False: you can not change the data type of any field in a
mapping
5. “my_custom_analyzer” at indexing, and “english” at

n
io
is
search time
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 551


distributing without written permission is strictly prohibited
Chapter 5 Quiz Answers
1. 1
2. The query would search for “cat”, “kitten” or “pet” in the
queried field
3. multi_match
4. The search terms in a full-text query are analyzed. Term-
level queries are not analyzed.

n
io
is
5. We are searching for “java” in both the body and subject.
ec
D
“most_fields” gives a higher score to documents with &
s
es
in
us
-B

“java” in both fields, best_field returns the score of the


71
20

better field
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 552


distributing without written permission is strictly prohibited
Chapter 6 Quiz Answers
1. Document based, using hash(_id) % num_shards to
determine which shard a doc is routed to
2. Since no settings are defined, the number of shards and
replicas default to 5 and 1, respectively, which means 10
shards
3. False - there is nothing to do when adding a new node,
except to wait

n
io
is
ec
4. 12 = 4 primary shards + 8 replicas
D
&
s
es
in
us
-B

5. 2
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 553


distributing without written permission is strictly prohibited
Chapter 7 Quiz Answers
1. Skewed data, like what can occur with time-series data;
and small datasets that are more likely to have skew
2. True, but only if fielddata is enabled.
3. True
4. _score
5. True

n
io
is
ec
D
6. “from” and “size”
&
s
es
in
us
-B

7. scroll is a snapshot, pagination is “dynamic”


71
20
p-
Se
9-
-0

8. False. you “can” call _refresh, but it is certainly not


d
ar
er

required or recommended
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 554


distributing without written permission is strictly prohibited
Chapter 8 Quiz Answers
1. actually both query and aggregation
2. Query
3. Metrics, bucket, matrix, pipeline
4. terms
5. Increase shard_size

n
6. cardinality

io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 555


distributing without written permission is strictly prohibited
Chapter 9 Quiz Answers
1. Date histogram + terms/filters is probably the simplest
solution
2. Significant terms aggregation
3. Use terms aggregation for the username buckets, then a
top_hits sub-aggregation
4. You could do an aggregation of the tech industry stocks,
along with a global agg over all stocks

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 556


distributing without written permission is strictly prohibited
Chapter 10 Quiz Answers
1. True
2. False
3. There is an overhead to parent/child - they are separate
documents that must be joined. The join is fast, but nested
objects are all a single document which will always be
faster for searches (but more expensive for updates)
4. True - using the parent parameter

n
io
is
ec
D
5. False &
s
es
in
us
-B
71

6. True
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 557


distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Thank you!
Please complete the online survey
Course: Core Elasticsearch for Developers

Version 5.5.1

© 2015-2017 Elasticsearch BV. All rights reserved. Decompiling, copying, publishing and/or distribution without written consent of Elasticsearch BV is
strictly prohibited.

n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

© Elasticsearch BV 2015-2017. All rights reserved. 559

You might also like