CoreDeveloper-5 5 1

Core Elasticsearch
Developer
n
io
is
ec
D
&
s
es
in
us
-B
An Elastic Training Course

71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se
training.elastic.co
5.5.1
Course: Core Elasticsearch Developer
Version 5.5.1
© 2015-2017 Elasticsearch BV. All rights reserved. Decompiling, copying, publishing and/or distribution without written consent of Elasticsearch BV is
strictly prohibited.
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se
© Elasticsearch BV 2015-2017. All rights reserved. 2

Lab Prep Checklist
• Download the PDF and zip file from links provided in email
• Minimum requirements:
‒ 20% or more free disk space
‒ Install Java
‒ version 1.8.0_20 minimum/1.8.0_73 or newer recommended
‒ 64-bit JDK preferred
n
io
is
ec
D
‒ set JAVA_HOME environment variable
&
s
es
in
us
-B
• Unzip the downloaded lab zip file on your hard drive

71
20
p-
Se
9-
‒ this creates a directory called CoreDeveloper

-0
d
ar
er
G
n
‒ do not unzip the lab file into a folder with spaces in the name
ie
st
ba
Se
Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 3

distributing without written permission is strictly prohibited
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Agenda and Introductions
Course Agenda
1 Introduction to Elasticsearch
2 The Search API
3 Text Analysis
4 Mappings
5 More Search Features
n
io
is
6 The Distributed Model
ec
D
&
s
es
in
us
7 Working with Search Results

-B
71
20
p-
Se
8 Aggregations
9-
-0
d
ar
er
9 More Aggregations
G
n
ie
st
ba
Se
10 Handling Relationships

Introductions
• Name
• Company
• What do you do?
• What are you using Elasticsearch for?
• What do you hope to get out of this training?
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Logistics
• Facilities
• Emergency Exits
• Restrooms
• Breaks/Lunch
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

2 The Search API
3 Text Analysis
4 Mappings

Chapter 1
n
io
Introduction to
is
ec
D
&
s
es 7 Working with Search Results
Elasticsearch
in
us
-B
7
8 Aggregations
1
20
p-
Se
9-
More Aggregations
-0
9
d
ar
er
G
n
ie
st
ba
Se
Topics covered:
• The Story of Elasticsearch
• The Components of Elasticsearch
• Documents
• Indexes
• Indexing Data
n
• Searching Data
io
is
ec
D
&
s
es
• The Bulk API
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The Story of
Elasticsearch
Once upon a time…
• As any good story begins, “Once upon a time…”
‒ More precisely: in 1999, Doug Cutting created an open-source
project called Lucene
• Lucene is:
‒ a search engine library entirely written in Java
‒ a top-level Apache project, as of 2005
n
io
is
‒ great for full-text search
ec
D
&
s
es
in
• But, Lucene is also:

us
-B
71
20
p-
‒ a library (you have to incorporate it into your application)

Se
9-
-0
d
ar
er
‒ challenging to use
G
n
ie
st
ba
Se
‒ not originally designed for scaling

Search is something
that any application
n
io
should have”
is
ec
D
&
s
es
in
us
-B
71
20
Shay Banon
p-
Se
9-
Creator of Elasticsearch
-0
d
ar
er
G
n
ie
st
ba
Se
‹#›
The Birth of Elasticsearch
• In 2004, Shay Banon developed a product called Compass
‒ Built on top of Lucene, Shay’s goal was to have search
integrated into Java applications as simply as possible
• The need for scalability became a top priority
• In 2010, Shay completely rewrote Compass with two main
objectives:
n
1. distributed from the ground up in its design
io
is
ec
D
&
s
es
2. easily used by any other programming language
in
us
-B
71
20
• He called it Elasticsearch
p-
Se
9-
-0
d
‒ …and we all lived happily ever after!

ar
er
G
n
ie
st
ba
• Today Elasticsearch is the most popular enterprise search

Se
engine
1. Distributed search:
• Elasticsearch is distributed and scales horizontally:
1 node cluster 10 node cluster 100+ node

cluster
Master Node Master Nodes
A node is an instance
n
io
Ingest Node
of Elasticsearch Ingest Nodes
is
ec
D
&
s
es
in
A cluster is a
us
-B
collection of
71
Data Nodes
20
Data Nodes – Hot

Elasticsearch nodes
p-
Se
9-
-0
d
ar
er
G
n
ie
Data Nodes – Warm

st
ba
Se
Your cluster can grow as your needs grow

2. Easily used by other languages:
• Elasticsearch provides REST APIs for communicating with
a cluster over HTTP
‒ Allows client applications to be written in any language
HTTP Request
n
io
is
Client
ec
D
REST APIs
&
App s
es
in
us
-B
71
20
p-
Se
Java, .NET, PHP,

9-
-0
Groovy, Ruby, HTTP Response

d
ar
er
Python, Perl, or
G
n
ie
whatever you want!

st
ba
Se

Elasticsearch History
• 0.4: first version was released in February, 2010
• 1.0: released in January, 2014
• 2.0: released in October, 2015
• 5.0: released in October, 2016
n
• Why the big jump in version number?
io
is
ec
D
&
s
es
‒ The Elastic Stack…
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

The Elastic Stack
• Elasticsearch is a component of the Elastic Stack
Security
Alerting
Kibana
n
Monitoring
io
is
ec
D
&
s
es
Elasticsearch
in
Reporting
us
X-Pack Elastic Cloud

-B
71
20
p-
Se
Graph
9-
-0
d
ar
Beats Logstash
er
G
n
ie
Machine Learning
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Elasticsearch
The Components of
The Components of Elasticsearch
• The main components of Elasticsearch are:
‒ node: an instance of Elasticsearch
‒ cluster: one or more nodes that work together to serve the same
data and API requests
‒ document: a piece of data that you want to search
‒ index: a collection of documents that have somewhat similar
characteristics
n
io
is
ec
D
&
• Let’s take a look at documents and indexes (indices)… es
in
s
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Documents
Documents
• A document can be any text or numeric data you want to
search and/or analyze
• For example:
‒ an entry in a log file,
‒ a comment made by a customer on your website,
‒ the weather details from a weather station at a specific moment
n
io
is
ec
• Specifically, a document is a top-level object that is
D
&
s
es
serialized into JSON and stored in Elasticsearch
in
us
-B
71
20
• Every document has a unique ID

p-
Se
9-
-0
d
ar
‒ which either you provide

er
G
n
ie
st
ba
‒ or Elasticsearch generates for you

Se

Documents must be JSON objects
• Suppose you have some stock prices in a CSV format
‒ The data needs to be converted to JSON
‒ Each row of data represents a single document
each row has stock price data
20100526,PPG,62.83,62.83,61.46,61.86,32203
n
io
is
ec
D
&
stocks.csv convert the CSV to JSON
s
es
{
in
us
"volume" : 32203,
-B
71
"high" : 62.83,
20
JSON consists of fields

p-
"stock_symbol" : "PPG",
Se
9-
and values
-0
"low" : 61.46,
d
ar
"close" : 61.86,
er
G
"trade_date" : "2010-05-26T06:00:00.000Z",
n
ie
st
"open" : 62.83
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Indexes
Definition of an Index
• An index in Elasticsearch is a logical way of grouping data:
‒ An index contains mappings that define the documents’ field
names and data types of the index
‒ an index is a logical namespace that maps to where its contents
are stored in the cluster
• There are two different concepts in this definition:
‒ an index has some type of data schema mechanism
n
io
is
ec
D
&
‒ an index has some type of mechanism to distribute data across a es
s
in
us
cluster
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

You can have lots of indexes:
• Indexes are fairly lightweight in Elasticsearch, so it is not
unusual to create lots of them
• For example, you might define a new index everyday for
searching logs
‒ Using multiple indices allows you to organize your data logically:
‒ logs-2017-01-01
n
‒ logs-2017-01-02
io
is
ec
D
&
es
s
/logs-2017-01-01
‒ logs-2017-01-03
in
us
-B
71
20
‒ and so on /logs-2017-01-02
p-
Se
9-
-0
d
ar
/logs-2017-01-03
er
G
n
ie
st
ba
Se

When we say “index”, we mean…
• We use the word “index” in multiple ways:
‒ Noun: a document is put into an index in Elasticsearch
‒ Verb: to index a document is to put the document into an index
in Elasticsearch
documents are indexed into an index

data ingestion involves
indexing documents
n
io
is
ec
D
&
/my_index1
{
s
"volume": 66605, es
in
"high": 28.86,
us
"stock_symbol": "ALL",
-B
"low": 28.37,
7
"close": 28.82,
1
{
20
"trade_date":
/my_index2
"volume": 46965,
p-
"2009-12-18T07:00:00.000Z",
"high": 31.56,
Se
"stock_symbol": "ALL",
9-
"low": 30.68,
-0
"close": 30.91,
d
ar
{
"trade_date":
er
"volume": 32203,
"2010-01-15T07:00:00.000Z",
G
/my_index3
"high": 62.83,
n
"stock_symbol": "PPG",
ie
st
"low": 61.46,
ba
"close": 61.86,
Se
"trade_date":
"2010-05-26T06:00:00.000Z",

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Indexing Data
Let’s index some simple JSON documents:
“username” and These are the values
“comment” are fields
{
"username" : "harrison",
"comment" : "My favorite movie is Star Wars!"
}
{
"username" : "maria",
n
io
is
"comment" : "The North Star is right above my house."
ec
D
}
&
s
es
in
us
-B
{
71
20
"username" : "eniko",
p-
Se
9-
"comment" : "My favorite movie star in Hollywood is Harrison Ford."

-0
d
ar
}
er
G
n
ie
st
ba
Se

Define an Index
• Clients communicate with a cluster using Elasticsearch’s
REST APIs
‒ There is a collection of Document APIs for working with indexes
and documents
• An index is defined using the Create Index API, which can be
accomplished with a simple PUT command:
‒ “200 OK” response if successful
n
io
is
ec
D
&
s
es
in
us
-B
curl -XPUT 'http://localhost:9200/my_index' -H "Content-Type: application/json"

71
20
p-
Se
9-
-0
d
ar
er
The name of
G
PUT command
n
ie
the new index

st
ba
Se

Index a Document
• The Index API is used to index a document
• Use a PUT again, but this time send the document in the
body of the PUT request:
‒ notice you specify a document type and a unique ID
‒ “201 Created” response if successful
document
index document id
type
n
io
is
ec
D
&
s
es
in
us
curl -XPUT 'http://localhost:9200/my_index/doc/1' -i -H

-B
71
"Content-Type: application/json" -d '

20
p-
Se
{
9-
-0
"username":"harrison",
d
ar
er
“comment":"My favorite movie is Star Wars!"

G
n
ie
}'
st
ba
Se
“doc” can actually be any name we

choose for our document type

Dynamic Index Creation
• Elasticsearch will create the index for you during the
indexing of a document, if the index is not already created:
‒ we could have simply executed the command on the previous
slide (and skipped that first “PUT my_index” command)
my_index is created if it doc type is created if it

does not exist does not exist
n
io
is
ec
D
&
s
es
in
curl -XPUT 'http://localhost:9200/my_index/doc/1' -H "Content-
us
-B
Type: application/json" -d '

71
20
{
p-
Se
9-
"username":"harrison",
-0
d
ar
"comment":"My favorite movie is Star Wars!"

er
G
}'
n
ie
st
If the id already exists, the

ba
Se
document is reindexed

Index Without Specifying an ID
• You can leave off the id and let Elasticsearch generate one
for you:
‒ But notice that only works with POST, not PUT
POST instead of PUT No id specified
n
io
is
ec
D
&
curl -XPOST 'http://localhost:9200/my_index/doc/' -i -H
s
es
in
us
-B
{
71
20
p-
Se
9-
"comment" : "The North Star is right above my house."

-0
d
}'
ar
er
G
n
ie
st
ba
Se

Index Without Specifying an ID
• If the POST is successful, the auto-generated ID is returned
in the response:
HTTP/1.1 201 Created 201 response if successful
Location: /my_index/doc/
AVnWIzyrUxRN673zt1An
content-type: application/json;
charset=UTF-8
content-length: 220
{
"_index" : "my_index",
n
io
is
ec
"_type" : "doc",
D
&
"_id" : "AVnWIzyrUxRN673zt1An",
s
es
in
"_version" : 1,
us
-B
"result" : "created",
71
20
"_shards" : {
p-
Se
"total" : 2,
9-
-0
"successful" : 1,
d
ar
"failed" : 0
er
G
n
},
ie
st
ba
"created" : true
Se

Retrieving a Document
• Use GET to retrieve an indexed document
‒ Returns 200 if the document is found
‒ Returns a 404 error if the document is not found
curl -XGET 'http://localhost:9200/my_index/doc/1' -H

"Content-Type: application/json"
n
io
is
ec
D
{
&
s
"_index": "my_index",
es
in
us
"_type": "doc",
-B
7
"_id": "1",
1
20
p-
"_version": 1,
Se
9-
"found": true,
-0
d
"_source": {
ar
er
G
"username": "harrison",
n
ie
st
"comment": "My favorite movie is Star Wars!"

ba
Se
}
}

The _source Endpoint
• You can get just the source of a document using the
_source endpoint
curl -XGET 'http://localhost:9200/my_index/doc/1/_source' -H

"Content-Type: application/json"
n
io
is
ec
D
{
&
s
"username": "harrison",es
in
us

-B
7
}
1
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Searching Data
“Hello, Search!”
• Let’s run our first search
‒ sort of a “Hello, World!” for Elasticsearch
• The simplest search is a match_all, which simply matches
all documents in the index (or indices) being searched:
Use GET or POST Invoke the _search endpoint
n
io
is
ec
D
&
curl -XGET 'http://localhost:9200/my_index/_search' -H
s
es
in
us
-B
{
71
20
"query": { Add a “query” clause

p-
Se
9-
"match_all": {}
-0
d
}
ar
er
G
}' “match_all” matches all

n
ie
st
ba
documents
Se

…and the search results are also JSON
{
"took": 1, took = the number of milliseconds it
"timed_out": false, took to process the query
"_shards": {
"total": 10,
"successful": 10,
"failed": 0
},
"hits": { total = the number of documents found
"total": 3,
"max_score": 1,
"hits": [ hits = array containing the documents
n
{ that “hit” the search criteria
io
is
ec
D
&
"_type": "doc",
s
es
in
"_id": "AVlBSnbcmM0CN4-9FrtP",
us
-B
"_score": 1, _id = the document’s unique ID

71
20
"_source": {
p-
Se
"username": "maria",
9-
-0
"comment": "The North Star is right above my house."

d
ar
er
}
G
n
ie
},
st
ba
...
Se

A Simpler “Hello, World”
• If you want a search that matches all documents, you can
simply send a request to the _search endpoint
‒ and omit the “query” body
‒ same results as running a match_all
‒ by default, Elasticsearch returns 10 results
n
io
is
ec
D
curl -XGET 'http://localhost:9200/my_index/_search' -H
&
s
"Content-Type: application/json" es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
CRUD Operations
CRUD Operations
• We will discuss these in more detail later in the course:
PUT my_index/doc/4
{
Index "username" : "kimchy",
"comment" : "I like search!"
}
PUT my_index/doc/4/_create
{
Create "username" : "kimchy",
n
}
io
is
ec
D
&
Read es
GET my_index/doc/4
in
s
us
-B
71
POST my_index/doc/4/_update
20
p-
{
Se
9-
"doc" : {
Update
-0
d
ar
"comment" : "I love search!!"

er
G
}
n
ie
st
}
ba
Se
Delete DELETE my_index/doc/4

Index vs. Create
• The _create operation can explicitly be used when indexing
a document
‒ If the id already exists, the operation fails
PUT my_index/doc/5/_create Fails if id=5 is already indexed

{
"username" : "ricardo",
"comment" : "I love search too"
n
}
io
is
ec
D
&
s
es
in
Indexes a new document if

us
-B
id=6 is not already indexed.

71
20
PUT my_index/doc/6
p-
Otherwise, the current doc is

Se
{
9-
deleted and this new one is

-0
d
ar
indexed.
er
"comment" : "It's a great day for

G
n
ie
search"
st
ba
}
Se

The Multi Get API
• The Multi Get API allows you to GET multiple documents in
a single request
‒ Avoid multiple round trips Use the _mget endpoint
curl -XGET 'http://localhost:9200/_mget' -H

{
"docs" : [ “docs” is an array of documents to GET
{
n
io
is
"_index" : "INDEX_NAME",
ec
D
&
"_type": "TYPE",
s
es
in
"_id" : "ID"
us
-B
},
71
20
{
p-
Se
9-
"_index" : "INDEX_NAME",
-0
d
"_type": "TYPE",
ar
er
G
"_id" : "ID"
n
ie
st
ba
}
Se
]
}'

Example of _mget {
"docs": [
{
"_index": "products",
"_type": "product",
curl -XGET 'http://localhost: "_id": "101",
9200/_mget' -H "Content-Type: "_version": 1,
application/json" -d ' "found": true,
{ "_source": {
"brandName": "Kraft",
"docs" : [ "customerRating": 5,
{ "price": 4.29,
"_index" : "products", "grp_id": "101",
"_type" : "product", "quantitySold": 844255,
"upc12": "021000023233",
"_id" : 101 "productName": "Kraft Velveeta
},
n
Shells & Cheese 2% Milk"
io
is
{
ec
}
D
&
"_index" : "my_index", },
s
es
in
"_type" : "doc", {
us
-B
"_id" : 4
71
"_type": "doc",
20
}
p-
"_id": "4",
Se
9-
] "_version": 1,
-0
d
"found": true,
ar
}'
er
G
"_source": {
n
ie
"username" : "kimchy",
st
ba
Se

}
}]}

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The Bulk API
Cheaper in Bulk
• The Bulk API makes it possible to perform many write
operations in a single API call, greatly increases the
indexing speed
‒ useful if you need to index a data stream such as log events,
which can be queued up and indexed in batches of hundreds or
thousands
• Four actions: create, index, update and delete
n
io
• The response is a large JSON structure with the individual
is
ec
D
&
results of each action that was performed es
s
in
us
-B
71
‒ The failure of a single action does not affect the remaining

20
p-
Se
actions
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Syntax of _bulk
• Has a unique syntax based on lines of commands:
‒ Each command appears on a single line
Notice the different content- The _bulk endpoint

type for _bulk
curl -XPOST 'http://localhost:9200/_bulk' -H "Content-Type:
n
io
is
application/x-ndjson" -i -d '
ec
D
{"create" : {...}}
&
The line following “create” or “index”
s
es
{DOCUMENT}
in
is the document that gets indexed

us
-B
{"index" : {...}}
71
{DOCUMENT}
20
p-
“delete” does not have a following line

Se
{"delete" : {...}}
9-
-0
'
d
ar
er
G
n
ie
st
ba
Se

Example of _bulk
curl -XPOST 'http://localhost:9200/_bulk' -H "Content-Type: application/x-ndjson" -i -d '

{"create" : {"_index" : "customers", "_type":"customer_type", "_id":102}}
{"firstname" : "Maureen","lastname" : "OHara","address" : "12 Elm St","city":"Brooklyn"}
{"index" : {"_index" : "customers", "_type":"customer_type", "_id":103}}
{"firstname" : "Lucy","lastname" : "Liu","address" : "39 Pine Tree Ln","city":"Albany"}
{"delete" : {"_index" : "customers", "_type":"customer_type", "_id":104}}
'
n
io
is
ec
D
&
s
es
in
us
The final \n is actually necessary

-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Error Messages
• If things go wrong in your requests:
Issue Reason Action

retry until connection is
networking issue or cluster
Unable to connect down
available (round robin across
nodes)
internal server error in retry until successful  

5xx error
n
Elasticsearch (using round robin)
io
is
ec
D
&
s
es
invalid JSON or incorrect fix bad request before
in
us
4xx error
-B
request structure retrying

71
20
p-
Se
9-
retry (ideally with linear or

-0
429 error Elasticsearch is too busy

d
ar
exponential backoff)
er
G
n
ie
st
ba
connection
Se
retry (with risk of creating a

node died or network issue
unexpectedly closed duplicate document)

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• Elasticsearch is a search and analytics engine built on top
of Apache Lucene
• Elasticsearch is distributed and scales horizontally
• Clients communicate with a cluster using Elasticsearch’s
REST APIs
• Starting with version 5.0, all of Elastic Stack products will be
aligned, tested, and released together.
n
io
is
ec
D
• A document can be any JSON-formatted data you want to &
s
es
in
us
-B
search and/or analyze

71
20
p-
Se
9-
• An index in Elasticsearch is a logical way of grouping data

-0
d
ar
er
G
n
• Adding documents to Elasticsearch is called indexing

ie
st
ba
Se

Quiz
1. True or False: Elasticsearch uses Apache Lucene behind
the scenes to index and search data.
2. What are two different uses of the term “index”?
3. True or False: Every document indexed in Elasticsearch
belongs to an index.
4. True or False: Using the Bulk API is more efficient than
sending multiple, separate requests.
n
io
is
ec
D
5. What happens if you attempt to index a new document into &
s
es
in
us
-B
an index that does not exist?

71
20
p-
Se
9-
6. Can you PUT a document into an index without specifying

-0
d
ar
an ID?
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 1
Elasticsearch
An Introduction to
2 The Search API
3 Text Analysis
4 Mappings

Chapter 2
n
io
The Search
is
ec
D
&
s
API
in
us
-B
7
8 Aggregations
1
20
p-
Se
9-
More Aggregations
-0
9
d
ar
er
G
n
ie
st
ba
Se
Topics covered:
• Introduction to the Search API
• URI Searches
• Request Body Searches
• The match Query
• The match_phrase Query
n
• The range Query
io
is
ec
D
&
s
es
• The bool Query
in
us
-B
71
20
• Source Filtering
p-
Se
9-
-0
d
ar
• Relevance
er
G
n
ie
st
ba
Se
• Boosting Relevance

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Search API
Introduction to the
The Search API
• The Search API allows you to execute a search query and
get back search hits that match the query
‒ You used the Search API in Lab 1 when you ran a simple
_search query
• Elasticsearch provides a full Query DSL (Domain Specific
Language) to define queries
• Two types of searches:
n
io
is
ec
D
‒ URI searches
&
s
es
in
us
-B
‒ Request body search

71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
URI Searches
URI Searches
• You can submit a search request using a URI and providing
request parameters
‒ Use a “q” to specify the query string
‒ Uses a special syntax called “query string syntax”
“q” is for query string
n
io
is
ec
D
&
s
es
in
us
-B
curl -XGET 'http://localhost:9200/my_index/_search?q=comment:hollywood'

71
20
p-
Se
9-
-0
d
ar
search the “comment”

er
G
n
ie
field for “hollywood”

st
ba
Se

Why URI Searches?
• We will not discuss URI searches in detail
‒ The documentation has a good discussion on the query string
syntax and its many options
• URI searches are limited
‒ Not all search options are exposed
• But, it is good to know that URI searches exist
n
io
is
ec
‒ They can be handy for a quick curl query
D
&
s
es
in
us
• We will focus on request body searches…

-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Searches
Request Body
Request Body Searches
• To take full advantage of the Search API, write your queries
as request body searches using the Query DSL
the name of the index to be searched
GET index_to_search/_search
n
{
io
is
_search routes to the
ec
"query": {
D
Search API
&
s
es
in
us
define your query here...

-B
71
20
p-
}
Se
9-
-0
}
d
ar
er
G
n
ie
st
the query element is where you

ba
Se
define your query

Specifying Search Indices
• A search can be performed over multiple indices:
Syntax Indices searched
/_search everything
/index-1/_search only index-1
n
io
is
ec
D
&
s only type-1 docs in index-1
/index-1/type-1/_search es
in
us
-B
71
20
/index-1,index-2/_search all of index-1 and index-2

p-
Se
9-
-0
d
ar
all indices that start with index-

er
/index-*/_search
G
n
ie
st
ba
Se

The Search API
• The Search API allows you to execute a search query and
get back search hits that match the query
• We will start by looking at the matching queries:
‒ match
‒ match_phrase
• Then we will discuss two other commonly-used queries:
n
io
is
ec
‒ range
D
&
s
es
in
us
‒ bool
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The match Query
The match Query
• The match query is a simple but powerful search for fields
that are text, numerical values, or dates
• This is your “go to” query
‒ The main use case for the match query is for full-text search
• The syntax looks like:
GET my_index/_search
n
io
{
is
ec
D
"query": {
&
s
es
"match": {
in
us
-B
"FIELD": "TEXT"
71
}
20
p-
Se
}
9-
-0
}
d
ar
er
G
n
ie
st
ba
Se
The field you want to The text you want to

search search for

Example of match
• Let’s search the “comment” fields of the documents in
my_index
Find all documents that

have “favorite” or “star” in
the comment.
{
n
io
is
ec
"query": {
D
&
"match": {
s
es
in
"comment": "favorite star"

us
-B
}
71
20
p-
}
Se
9-
}
-0
d
ar
er
G
n
ie
st
ba
Se

Query Response
• By default, the documents are returned sorted by _score
"hits": {
"total": 3, Three documents were a match
"max_score": 0.6219729,
"hits": [
{
"_type": "doc",
"_id": "1",
"_score": 0.6219729, Each hit has a _score
"_source": {
n
io
is
}
ec
D
},
&
s
{ es
in
us
-B
7
"_type": "doc",
1
20
"_id": "3", Results are sorted by

p-
Se
"_score": 0.5306678,
9-
_score (by default)

-0
"_source": {
d
ar
"username": "eniko",
er
G
"comment": "My favorite movie star in Hollywood is Harrison Ford."

n
ie
st
}
ba
Se
},
{

Changing match to “and”
• Add the operator property and specify “and” or “or”
‒ You have to add a “query” property as well
Only 2 hits this time
Find all documents that "hits": {

have “favorite” and “star” in "total": 2,
"max_score": 0.6219729,
the comment. "hits": [
{
GET my_index/_search "_index": "my_index",
{ "_type": "doc",
n
"_id": "1",
io
"query": {
is
ec
"_score": 0.6219729,
D
"match": {
&
"_source": {
s
es
"comment": { "username":
in
us
"harrison",
-B
"query": "favorite star",

7
"comment": "My
1
20
"operator": "and"
p-
favorite movie is Star Wars!"

Se
}
9-
}
-0
} },
d
ar
er
{
}
G
n
ie
st
}
ba
Notice the slightly "_type": "doc",

Se
"_id": "3",
different syntax

The minimum_should_match Property
• If a match query searches on multiple terms, the “or” or
“and” options might be too strict
• You can specify a minimum_should_match value
‒ minimum_should_match can be used to trim the long tail of
practically irrelevant results
Find comments that
contain at least 2 terms out of
n
“my favorite movie”
io
is
ec
D
{
&
s
es
in
"query": {
us
-B
"match": {
71
20
"comment": {
p-
Se
"query": "my favorite movie",

9-
-0
d
"minimum_should_match" : 2
ar
er
G
}
n
ie
st
}
ba
Se
}
} Can be an integer or
a percentage
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Query
The match_phrase
The match_phrase Query
• The match_phrase query is for searching text when you
want to find terms that are near each other
‒ All the terms in the phrase must be in the document
‒ The position of the terms must be in the same relative order
• A match_phrase query can be quite expensive! Use them
wisely.
n
io
is
ec
{
D
&
s
"query": { es
in
us
"match_phrase": {
-B
71
"FIELD": "PHRASE"
20
p-
Se
}
9-
-0
}
d
ar
er
}
G
n
ie
st
ba
Se
The field you want to The phrase you are

search searching for
Example of match_phrase
• Let’s look at an example:
I am looking for
“movie star” in the
comments
{
"query": {
n
"match_phrase": {
io
is
ec
"comment": "movie star"
D
&
s
} es
Two things must happen for “movie
in
us
}
-B
star” to cause a hit:

7
}
1
20
p-
Se
1. “movie” and “star” must appear in

9-
-0
d
the “comment” field 

ar
er
G
n
ie
st
2. The position of “star” must be 1

ba
Se
greater than “movie”

match_phrase vs. match
• Notice the difference between match_phrase and match
when searching for “movie star”:
“match_phrase” from “match” with “and” produces
previous slide only has 1 hit 2 hits
"hits": { {
"total": 1, "query": {
"max_score": 0.53066784, "match": {
"hits": [ "comment": {
n
{ "query": "movie star",
io
is
ec
"_index": "my_index", "operator": "and"
D
&
"_type": "doc", }
s
es
}
in
"_id": "3",
us
}
-B
"_score": 0.53066784,
7
}
1
"_source": {
20
p-
Se
9-
-0
"comment": "My favorite

d
"hits": {
ar
movie star in Hollywood is

er
G
"total": 2,
Harrison Ford."
n
ie
"max_score": 0.6219729,
st
}
ba
"hits": [
Se
}
]

The slop Parameter
• If match_phrase is too strict, you can introduce some
flexibility into the phrase (called slop)
‒ The slop parameter tells how far apart terms are allowed to be
while still considering the document a match
{
"query": {
"match_phrase": {
n
io
is
ec
"comment": {
D
&
s
es
in
"slop" : 1
us
-B
} Two things must happen for

71
20
“favorite star” to find a hit:

p-
}
Se
9-
}
-0
d
ar
} 1. “favorite” and “star” must appear

er
G
in the “comment” field 

n
ie
st
ba
Se
2. The distance between “star” and

“favorite” must be less than 2
Example of slop
• The search for “favorite star” found a hit
‒ The comment contains “favorite movie star”
• Without the slop set to at least 1, there would be no hits
"hits": {
"total": 1,
GET my_index/_search "max_score": 0.33159825,
{ "hits": [
"query": { {
n
io
is
"match_phrase": {
ec
D
"_type": "doc",
&
"comment": {
s
es
in
"_id": "3",
us
"_score": 0.33159825,
-B
"slop" : 1 "_source": {
71
20
} "username": "eniko",
p-
Se
"comment": "My favorite

9-
}
-0
movie star in Hollywood is Harrison

d
}
ar
er
Ford."
G
}
n
ie
}
st
ba
}
Se
]
}

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The range Query
The range Query
• The range query is for finding documents with fields that fall
in a given range
‒ Typically used for numeric and date fields, but can be used for
text as well
• The syntax for range looks like:
GET index/_search
n
{
io
is
ec
"query": {
D
&
s
"range": { es
in
us
"FIELD": {
-B
71
"gte": LOWER,
20
p-
Se
"lte": UPPER
9-
-0
}
d
ar
er
}
G
n
ie
}
st
ba
You can use “gt” or “lt” as

Se
}
well

Example of range
• The documents in the stocks index have a field named
“open” that represents the opening price for that day
Find all stocks with an

opening price between $50
and $60
GET stocks/_search
{
n
io
is
"query": { "hits": {
ec
D
"range": {
&
"total": 12829,
s
es
"open": { "max_score": 1,
in
us
-B
"gt": 50.00, "hits": [

71
20
"lt": 60.00 ...

p-
Se
} ]
9-
-0
}
d
}
ar
er
G
}
n
ie
st
}
ba
12,829 stocks in that

Se
range

The range Query on Dates
• Range queries on dates can use different formats:
‒ Specify the dates as a string
‒ Use a special syntax called “date math”
I am looking for all

stock prices from
GET stocks/_search August, 2009.
{
n
io
is
"query": {
ec
D
&
"range": {
s
es
in
"trade_date": {
us
-B
"gte": "2009-08",
71
20
"lt": "2009-09"
p-
Se
9-
}
-0
d
}
ar
er
G
}
n
ie
st
ba
}
Se

Date Math
• Elasticsearch has a user-friendly way to express date
ranges by using date math
I want all stock I want all stock prices from

prices from the last the last 24 hours
month
GET stocks/_search
GET stocks/_search
n
{
io
is
{
ec
"query": {
D
"query": {
&
s
es
in
"range": {
"range": {
us
"trade_date": {
-B
"trade_date": {
7
"gte": "now-1d"
1
20
"gte": "now-1M"
p-
}
Se
}
9-
-0
}
d
}
ar
}
er
G
}
n
}
ie
st
}
ba
Se

Date Math Expressions
• Here are the supported time units for date math and a few
examples
y years If now = 2016-12-08T12:56:23
M months now-1h 2016-12-08T11:56:23

w weeks
n
io
is
ec
now+1d 2016-12-09T12:56:23
D
d days
&
s
es
in
us
h or H hours
-B
now+1h+30m 2016-12-08T14:26:23
71
20
p-
m minutes
Se
9-
-0
Jan 15, 2017, plus 1

d
s seconds 2017-01-15||+1M
ar
month
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The bool Query
The bool Query
• A bool query is a combination of one or more of the
following boolean clauses:
The query must appear in matching documents

must
and will contribute to the _score
The query must not appear in matching
must_not
documents
n
io
is
ec
D
The query should optionally appear in matching
&
should s
es
in
documents, and matches contribute to the _score

us
-B
71
20
The query must appear in matching documents,

p-
Se
filter
9-
-0
but it will have no effect on the _score

d
ar
er
G
n
ie
st
ba
Se

Syntax for bool
• Each of the following clauses is possible (but optional) in a
bool query
‒ And they can appear in any order
{
"query": {
"bool": {
"must": [
{}
n
io
],
is
Notice the JSON array
ec
"must_not": [
D
&
syntax. You can specify
s
{} es
in
multiple queries in each

us
],
-B
"should": [ clause if desired.

71
20
{}
p-
Se
9-
],
-0
d
"filter": [
ar
er
{}
G
We will discuss filters in

n
ie
]
st
ba
Chapter 7
Se
}
}
}

The must clause
I am looking for
comments from “harrison”
GET my_index/_search that contain “star”
{
"query" : {
"bool" : {
"must" : [
{
"match" : {
"comment" : "star"
}
},
n
io
is
ec
{
D
&
"match" : {
s
es
in
"username" : "harrison"
us
-B
}
71
20
}
p-
Se
]
9-
-0
}
d
ar
} "My favorite movie is Star Wars!"

er
G
n
}
ie
st
ba
Se

Multiple bool Clauses
Find any comments

GET my_index/_search that mention “star” but not
{ “wars”
"query": {
"bool": {
"must": {
"match": {
"comment": "star"
}
},
"must_not": {
"match":{
n
io
is
ec
"comment": "wars"
D
&
}
s
es
in
}
us
-B
}
71
20
}
"My favorite movie star in Hollywood is Harrison Ford."
p-
Se
}
9-
-0
"The North Star is right above my house."

d
ar
er
G
n
ie
st
ba
Se

The should Clause
• A term that “should” match increases the _score
‒ the score from each matching must or should clause are added
together to provide the final _score for each document
GET my_index/_search The comments I am

{ searching for must have “star”
"query": { and should have “wars”
"bool": {
"must": {
n
io
"match": {
is
ec
D
"comment": "star"
&
s
} es
in
us
},
-B
7
"should": {
1
20
p-
"match":{
Se
9-
"comment":"wars"
-0
d
}
ar
er
G
}
n
ie
st
}
ba
Se
}
}

minimum_should_match with bool
• You can use minimum_should_match in a bool query
{
"query": {
"bool": {
"must":
{"match": {"comment": "star"}},
"should": [
{"match": {"comment": "favorite"}},
n
{"match": {"comment": "hollywood"}},
io
is
ec
{"match": {"comment": "famous"}},
D
&
{"match": {"username": "harrison"}} s
es
in
us
],
-B
71
"minimum_should_match": "50%"
20
p-
}
Se
9-
-0
}
d
ar
er
}
G
n
ie
st
ba
Se

Results from minimum_should_match
"hits": [
{
"_type": "doc", “harrison” + “favorite” >= 50%
"_id": "1",
"_score": 1.6028022,
"_source": {
}
},
n
io
{
is
ec
D
&
s
"_type": "doc", es
in
“hollywood” + “favorite” >= 50%

us
"_id": "3",
-B
"_score": 1.3930775,
71
20
"_source": {
p-
Se
9-
-0
"comment": "My favorite movie star in Hollywood is Harrison Ford."

d
ar
er
}
G
n
}
ie
st
ba
]
Se

Searching tip for phrases
• You can take advantage of the boost that a should clause
provides by combining match and match_phrase when
searching for a phrase:
{
"query": {
"bool": {
"must": {
"match": {
n
io
is
}
ec
D
},
&
s
es
"should": {
in
us
-B
"match_phrase": {
71

20
p-
Se
}
9-
-0
}
Hits will include “movie” or
d
ar
}
er
“star”, but comments that

G
}
n
ie
st
include the phrase “movie star”

ba
}
Se
will score higher

Only “should” Query
• In a bool query with no must or filter clauses, one or more
should clauses must match a document:
‒ or you can set minimum_should_match
{
"query": {
At least one clause
must match for a hit
n
io
"bool": {
is
ec
"should": [
D
&
s
{"match": {"comment": es
in
"north"}},
us
{"match": {"comment": "south"}},

-B
{"match": {"comment": "east"}},

71
20
{"match": {"comment": "west"}}

p-
Se
9-
]
-0
d
}
ar
er
}
G
n
ie
}
st
ba
Se

A “should not” Query
• The following example implements “should not” logic
(notice how you can nest bool queries):
{
"query": { I am looking for
"bool": { comments that contain “star”
"must": [ but should not contain
{
"match": { “wars”.
"comment": "star"
}
}
n
],
io
"_score": 1.1174096
is
ec
"should": [
D
"The North Star is right above my house."
&
{
s
es
"bool": {
in
us
"must_not": [
-B
"_score": 1.1174096
1.1682332
7
{
1
20
"My favorite movie star in Hollywood is

p-
"match": {
Se
"comment": "wars" Harrison Ford."

9-
-0
}
d
ar
er
}
G
"_score": 0.13761075
n
]
ie
st
ba
} "My favorite movie is Star Wars!"

Se
}
]
}
}}
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Source Filtering
Requesting Specific Fields
• You can request specific fields using the _source
parameter
‒ separate multiple fields by commas
GET stocks/_search?_source=open,close
{
"query": {
"range": {
"open": {
"gte": 50.00,
n
io
is
ec
"lte": 60.00
D
"hits": [
&
}
s
{ es
in
us
} "_index": "stocks",
-B
7
} "_type": "stock_type",
1
20
p-
} "_id": "AVmAGxB992cAvoluWZvm",
Se
9-
"_score": 1,
-0
d
"_source": {
ar
er
G
"close": 57.1,
n
ie
st
"open": 57.58
ba
Se
}
},

_source in a Query
• You can also specify the source fields within the query
body using _source
‒ this query has the same output as the previous slide:
GET stocks/_search
{
"_source": ["open", "close"],
n
io
is
"query": {
ec
D
"range": {
&
s
es
"open": {
in
us
-B
"gte": 50.00,
71
20
"lte": 60.00
p-
Se
}
9-
-0
d
}
ar
er
G
}
n
ie
st
}
ba
Only return the “open”

Se
and “close” fields

The includes and excludes Patterns
• For complete control, you can specify which fields to
include and exclude within _source:
‒ notice also that you can use wildcards in a _source field
GET stocks/_search
{
"_source": {
"includes": ["volume","*date*"],
"excludes": ["trade_date_text"]
n
io
},
is
ec
D
"query": {
&
s
"range": { { es
in
us
"open": { "_index": "stocks",

-B
7
"_type": "stock_type",
1
"gte": 50.00,
20
p-
"_id": "AVmAGxB992cAvoluWZvn",
Se
"lte": 60.00
9-
"_score": 1,
-0
}
d
"_source": {
ar
er
}
G
"volume": 48749,
n
ie
} "trade_date": "2010-06-10T06:00:00.000Z"
st
ba
Se
} }
},

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Relevance
What is Relevance?
• Relevance refers to the scoring of a document based on
how closely it matches the query
‒ The _score is computed for each document that is a hit
• There are three main factors of a document’s score:
‒ TF (term frequency): The more a term appears in a field, the
more important it is
‒ IDF (inverse document frequency): The more documents that
n
io
is
ec
contain the term, the less important the term is
D
&
s
es
in
us
‒ Field length: shorter fields are more likely to be relevant than

-B
71
longer fields
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

The BM25 Algorithm
• There is some fairly complex mathematics happening
behind each search
‒ An effective search is only as good as the underlying algorithm
used to compute the _score
• BM25 (as in “Best Matching”) is the 25th iteration of
tweaking the formula
n
io
is
ec
D
&
s
es
in
us
BM25 assigns less

-B
7
weight to the frequency

1
20
p-
of a term after a certain

Se
9-
-0
number of occurrences
d
ar
er
G
n
ie
st
ba
Se

Influencing Score Calculation
• There are several ways to control the scoring of a search
request:
‒ adding boost to a query clause
‒ choosing another similarity algorithm
‒ the constant_score query
‒ the function_score query
n
io
GET messages/_search
is
ec
{
D
We discuss function_score and
&
"query": {
s
"function_score": { es
in
"functions": [ other ways to control scoring in our
us
-B
{
Advanced Elasticsearch Developer
7
"script_score": {
1
20
"script": {
course
p-
Se
"lang" : "painless",
9-
"inline" : """
-0
_score * 0.05 * doc['details.plus_ones'].value + 1

d
ar
"""
er
G
}
n
ie
}
st
ba
}
Se
]
}
}
}

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Boosting Relevance
Boosting Relevance
• A search for “elastic” will likely return a variety of different
documents
‒ From Elastic (the company) to stretchable fabric, rubber bands,
kinetic energy, and even the name of a jazz album
• One way to control the relevance of a search is by
boosting query clauses
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Query-time Boosting
• Some types of searches allow a “boost” parameter to
increase (or decrease) the relative weight of a clause
• If boost > 1
‒ the relative weight of the clause is increased
• If 0 < boost < 1
‒ the relative weight of the clause is decreased
n
io
is
ec
D
• If boost < 0
&
s
es
in
us
-B
‒ the clause will contribute negatively to the score, effectively

71
20
p-
penalizing documents that match the clause

Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Query-time Boosting
• Add the “boost” property to a query clause to manipulate
the score of matched document
{ I am
"query" : { searching for “star”
"bool": { and “house”, but with
"should": [ an emphasis on
{ “house”
"match": {
"comment": "star"
n
}
io
is
ec
},
D
&
{
s
es
in
"match" : {
us
-B
"comment" : {
71
20
"query": "house",
p-
Se
"boost" : 2
9-
-0
}
d
ar
er
}
G
n
}
ie
Documents that match “house”

st
ba
] will get a boost of factor 2

Se
}
}
}
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• The match query is a simple but powerful search for fields
that are text, numerical values, or dates
• The slop parameter tells how far apart terms are allowed to
be while still considering the document a match
• The range query is for finding documents with fields that fall
in a given range
• A bool query is a combination of one or more of the
n
io
is
ec
following boolean clauses: must, must_not, should and
D
&
s
es
filter
in
us
-B
71
20
• Relevance refers to the scoring of a document based on

p-
Se
9-
-0
how closely it matches the query

d
ar
er
G
n
ie
st
• You can request specific fields using the _source

ba
Se
parameter

Quiz
1. Explain the difference between match and match_phrase.
2. If you want to add some flexibility into match_phrase, you
can configure a ___________ property.
3. How would you write a date range that matches the last 12
hours?
4. Suppose you have a “should” clause in a “bool” with 5
queries, and you set minimum_should_match to 70%.
n
io
is
ec
How many of the “should” clauses need to match for a hit?
D
&
s
es
in
us
-B
5. What gets returned in the following request?

71
20
p-
Se
9-
-0
GET my_index/_search?_source=username
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 2
The Search API
2 The Search API
3 Text Analysis
4 Mappings

Chapter 3
n
io
Text Analysis
is
ec
D
&
s
es
in
us
-B
7
8 Aggregations
1
20
p-
Se
9-
More Aggregations
-0
9
d
ar
er
G
n
ie
st
ba
Se
Topics covered:
• What is Analysis?
• Building an Inverted Index
• Analyzers
• Custom Analyzers
• Character Filters
n
• Tokenizers
io
is
ec
D
&
s
es
• Token Filters
in
us
-B
71
20
• Defining Analyzers
p-
Se
9-
-0
d
ar
• How to Choose an Analyzer

er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
What is Analysis?
Exact Values vs. Full Text
• In general, data in Elasticsearch can be categorized into
two groups:
‒ Exact values: includes numeric values, dates, and certain
strings
‒ Full text: unstructured text data that we want to search
{ Full text
"brandName": "Ahold",
n
io
is
ec
"customerRating": 2,
D
&
"price": 14.15,
Exact values s
es
in
"grp_id": "52903",
us
-B
"quantitySold": 805358,
71
20
"upc12": "688267137853",
p-
Se
"productName": "Ahold Companion Dog Food Kibbles & Munchy Morsels"

9-
-0
}
d
ar
er
G
n
ie
st
ba
Se
• Why does it matter if a value is exact or full text?

Full Text is Analyzed
• Analysis is the process of converting full text into terms for
the inverted index
• Analysis is performed by an analyzer
‒ either a built-in analyzer
‒ or a custom analyzer you configure
n
io
"Ahold Companion Dog Food Kibbles & Munchy Morsels"
is
ec
D
&
s
es
in
us
-B
71
20
p-
dog food
Se
It seems like all of these

9-
-0
d
searches should be a hit for

ar
er
Ahold food for dogs

G
this product
n
ie
st
ba
Se
kibble

Analysis Makes Full Text Searchable
• Full text gets analyzed during ingestion
• Similarly, the text in your queries gets analyzed using the
same analyzer
"Ahold Companion Dog Food Kibbles & Munchy Morsels"
ahold dog kibble munchy
n
io
is
ec
D
companion food morsel
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
Ahold food for dogs ahold food dog

ie
st
ba
Se
The query gets analyzed too

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Index
Building an Inverted
What is an Inverted Index?
• You are familiar with the index in the back of a book
‒ Useful terms from the book are sorted, and page numbers tell you
where to find that term
• Lucene creates a similar inverted index with your documents
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Page 275 contains information

Se
about “Aston Martin company”

Building an Inverted Index
• Let’s look at how terms are analyzed and added to an
inverted index:
‒ text is broken into tokens token postings
Fast 2
‒ indexed by a unique document id
The 1
brown 1
dog 1
n
fox 1
io
is
1 The quick brown fox jumps over the lazy dog
ec
D
&
jumping 2
s
es
in
us
jumps 1
-B
2 Fast jumping spiders

71
20
lazy 1
p-
Se
9-
-0
over 1
d
ar
er
G
quick 1
n
ie
st
ba
spiders 2
Se
the 1

Lowercase
• Searching for “Fast” and “fast” should yield the same results
‒ common to lowercase all tokens
token postings
brown 1
dog 1
fast 2
fox 1
n
io
is
ec
jumping 2
D
&
s
es
jumps 1
in
us
-B

7
lazy 1
1
20
p-
Se
9-
over 1
-0
d
ar
er
quick 1
G
n
ie
st
spiders 2
ba
“the” was already

Se
indexed for doc 1 the 1

Stopwords
• Searching for “the” rarely makes sense
‒ typically we do not bother to index “the”
token postings
‒ or other stopwords
brown 1
dog 1
fast 2
fox 1
n
io
is
ec
jumping 2
D
&
s
es
in
jumps 1
us
-B

71
lazy 1
20
p-
Se
9-
-0
over 1
d
ar
er
G
quick 1
n
ie
st
ba
spiders 2
Se

Stemming
• “jumps” and “jumping” share the same stem: “jump”
‒ want “jump” to match both
token postings
‒ so only index the stem
brown 1
dog 1
fast 2
fox 1
n
io
is
ec
jump 1,2
D
&
s
es
lazy 1
in
us
-B

71
over 1
20
p-
Se
quick 1
9-
-0
d
ar
spider 2
er
G
n
ie
st
ba
Se

Synonyms
• The terms “quick” and “fast” have similar meaning
‒ Should a search for “quick” return doc 2?
token postings
‒ Should a search for “fast” return doc 1?
brown 1
dog 1
fast 1,2
fox 1
n
io
is
ec
jump 1,2
D
&
s
es
lazy 1
in
us
-B

71
over 1
20
p-
Se
quick 1,2
9-
-0
d
ar
spider 2
er
G
n
ie
st
ba
Se

Analyze the Search Query Too
• If we are going to analyze our text before indexing, we
better also analyze our search queries in the same manner
token postings
brown 1
a Dog dog 1
dog 1
jumped jump 1,2 fast 1,2
n
io
is
ec
D
fox 1
&
s
es
in
quick
us
jump 1,2
-B
the quick
2,1
71
20
spiders
p-
spider lazy 1
Se
9-
-0
over 1
d
ar
er
G
n
quick 1,2
ie
st
ba
Se
spider 2

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Analyzers
Anatomy of an Analyzer
• An analyzer consists of three parts:
1. Character Filters
2. Tokenizer
3. Token Filters
“My favorite movie is

Star Wars!”
n
io
is
ec
my
D
&
s
es
in
favorit
us
-B
7
char
1
token
tokenizer
20
movi
p-
filters filters
Se
9-
-0
star
d
ar
er
G
n
war
ie
st
ba
Se

The Analyze API
• Use the _analyze API to test what an analyzer will do to
your text:
Use the _analyze API Test the “simple” analyzer on the given “text”
GET _analyze
{
"analyzer": "simple",
"text": "My favorite movie is Star Wars!"
n
}
io
is
ec
D
&
s
es
in
my
us
-B
71
20
favorite
p-
Se
9-
"My favorite movie is Star “simple”

-0
movie
d
ar
Wars!"
er
analyzer
G
n
ie
is
st
ba
Se
star
wars
Pre-built Analyzers
• standard
• simple
• whitespace
• stop
• keyword
n
• pattern
io
is
ec
D
&
s
es
• language
in
us
-B
71
20
p-
Se
9-
-0
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
d
ar
er
G
n
ie
st
ba
Se

The simple Analyzer
• Notice the simple analyzer does two things:
‒ Breaks text into terms whenever it encounters a character which
is not a letter
‒ All terms are lower-cased
n
io
is
ec
Lower Case
D
&
s
es
Tokenizer
in
us
-B
71
20
p-
Se
9-
-0
d
ar
The simple analyzer has no character or token filters

er
G
n
ie
st
ba
Se

The stop Analyzer
• Notice the stop analyzer removes stop words
‒ similar to the simple analyzer,
‒ but adds support for removing stop words, which in English are:
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,
of, on, or, such, that, the, their, then, there, these, they, this,
to, was, will, with
n
io
is
ec
D
&
s
es
Stop
Lower Case
in
us
-B
Token
Tokenizer
71
20
Filter
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
The stop analyzer includes the Stop Token Filter

Se

The standard Analyzer
• The default analyzer in Elasticsearch
‒ Optionally, you can configure the stop words as well
Disabled by default
n
io
is
Standard Lower Stop
ec
Standard Tokenizer
D
Token
&
Token Case Filter
s
es
in
Filter Filter
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

The language Analyzer
• A set of analyzers aimed at analyzing specific language text
‒ 30+ languages supported
GET _analyze
{
"analyzer": "english",
"text": "My favorite movie is Star Wars!"
}
n
io
is
ec
D
&
s
es
in
GET _analyze
us
-B
{
71
20
"analyzer": "french",
p-
Se
"text": "Mon film préféré est Star Wars!"

9-
-0
d
}
ar
er
G
n
ie
st
ba
Se

The Analyze API
• The Analyze API can be used to test individual components
‒ The following command tests a char filter and tokenizer
Test the “keyword” tokenizer with the “html_strip”

character filter on the given “text”
GET _analyze
{
"tokenizer": "keyword",
n
io
is
"char_filter": [ "html_strip" ],
ec
D
&
"text": "Welcome to the jungle"
s
es
in
}
us
-B
71
20
p-
Se
9-
"Welcome
-0
d
ar
to the jungle"
er
G
n
ie
_analyze "Welcome to the jungle"

st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Character Filters
1. Character Filters
• Preprocess characters before they are passed to the
tokenizer
‒ You can configure multiple character filters
• Three built-in character filters:
‒ html_strip
‒ mapping
n
io
is
ec
‒ pattern_replace
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

The html_strip Character Filter
• Strips out any HTML
• Replaces HTML entities with their decoded value
GET _analyze
{
"tokenizer": "keyword",
"char_filter": [ "html_strip" ],
"text": "Tom's favorite movie is Star Wars!"
}
n
io
is
ec
D
&
s
es
in
us
“Tom's favorite movie is Star Wars!”

-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
html_strip “Tom's favorite movie is Star Wars!”

ba
Se

The mapping Character Filter
• Map characters to other characters before tokenizing
GET /_analyze
{
"char_filter": [
{
"type": "mapping",
"mappings": [ "C++ => cpp", "c++ => cpp"]
}
],
"tokenizer": "standard",
n
io
"text": "I like to code in C++ and Java."
is
ec
D
}
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
“c++”
d
ar
er
G
n
“cpp”
ie
mapping
st
“C++”
ba
Se

The pattern_replace Character Filter
• Use a regular expression to manipulate characters in a
string before processing
GET /_analyze
{
"char_filter": [ Looks for dashes in a string
{ of digits
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1_"
} Replaces the dash with an
n
io
is
], underscore
ec
D
&
s
es
"text": "123-45-6789"
in
us
-B
}
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
123-45-6789 123_45_6789
ba
pattern_replace
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Tokenizers
2. Tokenizer
• The second phase of an analyzer
‒ Divides a stream of text into tokens
• Some of the built-in tokenizers include:
‒ whitespace: divides text on any whitespace
‒ standard: divides text into terms on word boundaries
‒ path_hierarchy: for hierarchical values like filesystem paths
n
io
is
ec
‒ uax_url_email: recognizes URLs/email addresses as tokens
D
&
s
es
in
us
‒ keyword: a no-op tokenizer

-B
71
20
p-
Se
‒ pattern: splits text based on a given regular expression

9-
-0
d
ar
er
• You can also write your own tokenizer in Java

G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Token Filters
3. Token Filters
• Token filters are the third phase of an analyzer:
‒ They can add tokens (e.g. the synonym filter)
‒ They can remove tokens (e.g. stop words)
‒ They can change tokens (e.g. stemming to root form)
• Examples include:
‒ lowercase: make all text lower case
n
io
is
ec
D
&
‒ snowball: stems words using a Snowball-generated stemmer es
in
s
us
-B
‒ stop: removes stop words

71
20
p-
Se
9-
‒ nGram and edgeNGram:

-0
d
ar
er
G
n
‒ and many more…

ie
st
ba
Se

Order Matters in Token Filters
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["lowercase", "stop" ], []
"text": "To Be Or Not To Be"
}
GET _analyze
to
n
io
{
is
ec
"tokenizer": "whitespace",
D
be
&
s
"filter": [ "stop", "lowercase" ], es
in
us
"text": "To Be Or Not To Be"

or
-B
}
71
20
p-
not
Se
9-
-0
d
to
ar
er
G
n
ie
st
be
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
ICU Analysis
ICU Analysis Plugin
• The ICU Analysis plugin adds extended Unicode support
for text analysis in Elasticsearch
‒ including better analysis of Asian languages
• The plugin consists of 1 character filter:
‒ Normalization
• 1 tokenizer:
n
io
is
‒ ICU Tokenizer
ec
D
&
s
es
in
• and 4 token filters:

us
-B
71
20
p-
‒ Normalization
Se
9-
-0
d
ar
er
‒ Folding
G
n
ie
st
ba
Se
‒ Collation
‒ Transform
Example of ICU Tokenizer
• You can use the components of the ICU plugin just like any
other character filter, tokenizer, or token filter
• For example, the following Chinese translates to “Star Wars
is my favorite movie”
‒ The “standard” tokenizer incorrectly splits the sentence into
single characters
n
io
is
星球⼤大战喜欢
ec
D
&
s
es
in
us
POST _analyze 是的
-B
71
{
20
p-
我电影
Se
"tokenizer": "icu_tokenizer",
9-
-0
"text": ["星球⼤大战是我最喜欢的电影"]
d
ar
er
} 最
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Defining Analyzers
Configuring a Custom Analyzer
• Analyzers are defined in the “analysis” part of an index’s
settings
PUT test_index Whatever name you want

{ to use for your analyzer
"settings": {
"analysis": {
Set “type” to “custom”
"analyzer": {
"my_custom_analyzer" : {
n
"type" : "custom",
io
is
ec
"char_filter" : ["html_strip"],
D
&
s
"tokenizer" : "standard", es
in
us
"filter" : ["lowercase", "snowball"]

-B
7
}
1
20
p-
}
Se
9-
-0
}
d
ar
} Configure the three

er
G
n
}
ie
components of your
st
ba
Se
custom analyzer

Defining Custom Filters
• If your character or token filters need to define properties,
then you have to define them differently:
{
"settings": { Define a custom char_filter and
"analysis": { name it whatever you want
"char_filter": {
"cpp_filter" : {
"type" : "mapping",
"mappings" : ["C++ => cpp", "c++ => cpp"]
}
n
io
is
ec
},
D
&
"analyzer": {
s
es
in
us
-B
"char_filter": ["cpp_filter"],
71
20
p-
"tokenizer" : "standard"
Se
9-
}
-0
d
Use your custom filter later in

ar
}
er
G
the “analyzer” section

n
}
ie
st
ba
}
Se

More Advanced Example
• You can use your custom filters along with the built-in filters
{
"settings": {
"analysis": {
"char_filter": {
"cpp_filter" : {
"type" : "mapping",
"mappings" : ["C++ => cpp", "c++ => cpp"]
}
},
"filter": {
"my_stopwords" : {
n
"type" : "stop",
io
is
ec
"stopwords" : ["my", "is"]
D
&
}
s
es
in
},
us
-B
"analyzer": {
71
20
p-
Se
"char_filter": ["html_strip", "cpp_filter"],

9-
-0
"tokenizer" : "standard",
d
ar
er
"filter" : ["lowercase","my_stopwords", "snowball"]

G
n
}
ie
st
ba
}
Se
}
}

Testing a Custom Analyzer
• To test an analyzer that is already defined for an index, use
_analyze at the index level: {
"tokens": [
{
"token": "cpp",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
GET test_index/_analyze {
"token": "favorit",
n
{
io
is
"start_offset": 10,
ec
"analyzer": "my_custom_analyzer",
D
"end_offset": 18,
&
"text" : "C++ is my favorite language." s
es "type": "<ALPHANUM>",
in
us
} "position": 3
-B
7
},
1
20
p-
{
Se
9-
"token": "languag",
-0
d
"start_offset": 19,
ar
er
"end_offset": 27,
G
n
ie
"type": "<ALPHANUM>",
st
ba
Se
"position": 4
}
]
}
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
How do we decide
which analyzer to use?
Step 1: Exact value or full text?
• The first question you need to ask is: “Which text fields are
exact values and which fields are full text?”
‒ Exact values do not need to be analyzed
{
"brandName": "Uncle Ben's",
"customerRating": 3,
"price": 8.57,
n
io
is
ec
"grp_id": "26333",
D
&
s
"quantitySold": 514342, es
in
us
"upc12": "054800085453",
-B
7
"productName": "Uncle Ben's Rice Pilaf Ready Rice"

1
20
p-
}
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Step 2: What will our searches look like?
• The second question is: “What will a search look like and
when should that match a document?”
‒ Look at specific examples in your dataset
‒ Does “Uncle Ben’s” = “uncle bens” = “uncl ben” = “benjamin”?
“Uncle Ben’s” original text
n
io
is
ec
D
&
s
es
in
char token
tokenizer
us
?
-B
filters filters
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
“uncl ben” search query

ba
Se

Step 3: Test several options
• There are still a lot of questions to ask
‒ Where will the text be split?
‒ Should we use stemming?
‒ Should we build n-grams?
‒ What are some critical synonyms?
• The most important step: test your analyzer thoroughly!
n
io
is
ec
D
&
‒ Try several options es
in
s
us
-B
‒ Test a lot of searches!

71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• Exact values include numeric values, dates, and certain
strings
• Full text is unstructured text data that we want to search
• Analysis is the process of converting full text into terms for
the inverted index
• analyzer = char_filter to tokenizer to token filter
n
io
is
• Use the _analyze API to test an analyzer (or the
ec
D
components individually) &
s
es
in
us
-B
71
• The ICU Analysis plugin adds extended Unicode support

20
p-
Se
9-
for text analysis in Elasticsearch

-0
d
ar
er
G
n
ie
st
ba
Se

Quiz
1. The process of converting full text into terms for the
inverted index is called __________.
2. Why does it matter if a field is an exact value vs. full text?
3. What are the three components that make up an analyzer?
4. What does the snowball token filter do to the word
“happy”?
n
io
is
5. How does the following “my_logfile_analyzer” process the
ec
D
text “192.168.1.201 - jing https://training.Elastic.co/”? &
s
es
in
us
-B
71
20
PUT my_log_files
p-
{
Se
"settings": {
9-
-0
"analysis": {
d
ar
"analyzer": {
er
"my_logfile_analyzer": {
G
n
"tokenizer": "uax_url_email"
ie
st
}
ba
Se
}
}
}
}

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 3
Text Analysis
2 The Search API
3 Text Analysis
4 Mappings

Chapter 4
n
io
Mappings
is
ec
D
&
s
es
in
us
-B
7
8 Aggregations
1
20
p-
Se
9-
More Aggregations
-0
9
d
ar
er
G
n
ie
st
ba
Se
Topics covered:
• What is a Mapping?
• Dynamic Mappings
• Defining Explicit Mappings
• Adding Fields
• Dive Deeper into Mappings
n
• Specifying Analyzers
io
is
ec
D
&
s
es
• Multi-Fields
in
us
-B
71
20
• Index Templates
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
What is a Mapping?
What is a Mapping?
• Elasticsearch will happily index any document without
knowing its details (number of fields, their data types, etc.)
‒ However, behind-the-scenes Elasticsearch assigns data types to
your fields in a mapping
• A mapping is a schema definition that contains:
‒ Names of fields
‒ Data types of fields
n
io
is
ec
D
&
‒ How the field should be indexed and stored by Lucene s
es
in
us
-B
71
• Mappings map your complex JSON documents into the

20
p-
Se
9-
simple flat documents that Lucene expects

-0
d
ar
er
G
n
• Mappings belong to index types…

ie
st
ba
Se

Remember Types?
• Documents are indexed into an index
• Each document also has a type
‒ An index can have multiple types (although that is not
encouraged)
‒ Every type has exactly one mapping
n
io
is
ec
D
{ Every document has
&
s
"_index": "my_index", es
a type
in
us
"_type": "my_type",
-B
71
"_id": "2",
20
p-
Se
"_score": 1,
9-
-0
"_source": {
d
ar
er
"username": "maria",
G
n
ie
"comment": "The North Star is right above my house."

st
ba
Se
}
}

Every Type has One Mapping
• Defined in the “mappings” section of the index
The name of your type

PUT my_index
{
"mappings": {
"my_type": {
n
io
is
"properties": {
ec
D
define your fields and types here
&
s
es
}
in
us
-B
}
71
20
}
p-
Se
}
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Example of a Mapping
{
GET my_index/_mappings "my_index": {
"mappings": {
"my_type": {
"properties": {
"comment": {
“comment” is a “text” "type": "text",
},
"username": {
"type": "keyword"
“username” is a “keyword”
n
io
is
}
ec
D
}
&
s
es
in }
us
}
-B
}
71
20
}
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Defining a Mapping
• Mappings are defined in the “mappings” section of an index
‒ Define mappings at index creation, or
‒ Update a mapping of an existing index using the Put API
You can define the

mapping in the PUT
n
command
io
PUT comments
is
ec
D
{
&
s
"mappings": { es
in
us
define mapping here

-B
7
}
1
20
p-
}
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Elasticsearch Data Types for Fields
• Simple Types, including:
‒ text: for full text (analyzed) strings “string” in older versions
of Elasticsearch
‒ keyword: for exact value strings
‒ date: string formatted as dates, or numeric dates
‒ integer types: like byte, short, integer, long
‒ floating-point numbers: float, double, half_float, scaled_float
n
io
is
ec
D
‒ boolean
&
s
es
in
us
-B
‒ ip: for IPv4 or IPv6 addresses

71
20
p-
Se
9-
• Hierarchical Types: like object and nested

-0
d
ar
er
G
n
• Specialized Types: geo_point, geo_shape and percolator

ie
st
ba
Se
This list is not every

data type available
Why have we not needed
to define any mappings
n
io
is
ec
yet? D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se
Dynamic Mappings
• You are not required to manually define mappings
• When you index a document into a type, Elasticsearch
dynamically updates the mapping based on the
document
text
n
io
is
ec
PUT comments/comment_type/1
D
&
s text
{ es
in
us
"user" : "Ricardo",
-B
71
"comment" : "Best movie I've ever seen! :)",

20
p-
Se
"comment_time" : "2016-11-06T15:31:18-07:00",
9-
-0
"rating" : 5
d
ar
er
} date
G
n
ie
st
long
ba
Se

Elasticsearch “guesses” data types:
Sometimes it conveniently
guesses what you want…
{ "num_of_views": {
n
io
is
"type": "long"
ec
"num_of_views" : 430
D
},
&
}
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Dates in the default format are

also recognized
{ "comment_time": {
n
"type": "date"
io
"comment_time" : "2016-11-06"
is
ec
} }
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Sometimes it may choose a

type you do not want…
"average_length": {
PUT comments/comment_type/4 "type": "text",
{
n
"fields": {
io
is
"average_length" : "24.8"
ec
"keyword": {
D
&
} "type": "keyword",
s
es
in
"ignore_above": 256
us
-B
}
71
20
}
p-
Se
}
9-
-0
d
ar
er
G
n
ie
st
ba
Notice the multi-field mapping

Se
for “average_length”

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Defining
Explicit Mappings
Defining Explicit Mappings
• If you need to define an explicit mapping, we typically follow
these steps:
1. Index a sample document that contains the fields you want
defined in the mapping (using a temporary index)
2. Get the dynamic mapping that was created automatically by
Elasticsearch in Step 1
3. Edit the mapping definition
n
io
is
ec
4. Create your index, using your explicit mappings
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

1. Index a Sample Document
• Start by indexing a document into a temporary index…
…but use a type name that you

want to keep
PUT temp_comments/comment_type/1
{
n
io
"user" : "Ricardo",
is
ec
"comment" : "Best movie I've ever seen! :)",
D
&
s
"comment_time" : "2016-11-06T15:31:18-07:00", es
in
us
"rating" : 5
-B
71
}
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

2. Get the Dynamic Mapping
• Get the mapping, then copy-and-paste it into the Console:
GET temp_comments/_mapping
{
"temp_comments": {
"mappings": {
"comment_type": {
"properties": {
"comment": {
"type": "text",
n
"fields": {
io
is
ec
"keyword": {
D
&
s
"type": "keyword", es
in
us
"ignore_above": 256
-B
71
}
20
p-
Se
}
9-
-0
},
d
ar
er
"comment_time": {
G
n
ie
"type": "date"
st
ba
Se
},
"rating": {
"type": "long"

3. Edit the Mapping
• In our example, we will make a couple of changes:
‒ By changing “user” to a “keyword”, it will only match exact
searches
Let’s change “rating” to
a “byte”…
"rating": {
"type": "long"
},
"user": { "rating": {
"type": "text", "type": "byte"
n
io
"fields": { },
is
ec
"user": {
D
"keyword": {
&
s
es "type": "keyword"
"type": "keyword",
in
us
}
-B
"ignore_above": 256
71
20
}
p-
Se
}
9-
-0
}
d
and change “user” to a

ar
er
G
“keyword”
n
ie
st
ba
Se

4. Define a New Index with the Mapping
PUT comments
{
"mappings": {
"comment_type": {
“comments” is a new "properties": {
index with our explicit "comment": {
mapping "type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
n
},
io
is
ec
"comment_time": {
D
&
"type": "date"
s
es
in
},
us
-B
"rating": {
71
20
"type": "byte"
p-
Se
},
9-
-0
d
"user": {
ar
er
G
"type": "keyword"
n
ie
}
st
ba
Se
}
}
}

Add a document to the new index…
• Test the new mapping by adding a document
‒ Notice since “user” is a “keyword” now, only exact searches
return a hit:
GET comments/_search
{
"query": {
"match": {
"user": "ricardo" does not match “Ricardo”
}
n
io
}
is
ec
D
}
&
s
es
in
us
-B
71
20
p-
Se
{
9-
-0
"query": {
d
ar
er
"match": {
G
n
ie
"user": "Ricardo" a match!

st
ba
Se
}
}
}
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
mapping?
Can you change a
Can you change a mapping?
• No - not without reindexing your documents
‒ You can add fields and modify a few properties, but…
‒ …all other schema changes require a reindexing
• Why not?
‒ If you switch a field’s data type, all existing documents with that
field already indexed would become unsearchable on that field
n
io
is
• Why can I add fields without reindexing?
ec
D
&
s
es
in
us
‒ Adding a field has no effect on the existing indexed fields or

-B
71
documents
20
p-
Se
9-
-0
‒ The index can freely grow, but indexed fields can not change
d
ar
er
G
data types
n
ie
st
ba
Se

Controlling the dynamic feature…
• You can control the effect of new fields added to a mapping
using the “dynamic” property (three options):
doc indexed? fields indexed? mapping updated?
“true” ✔ ✔ ✔
“false” ✔ 𝙓 𝙓
“strict” 𝙓
n
io
is
ec
D
&
PUT comments
s
es
in
{
us
-B
"mappings": {
71
20
"comment_type": {
p-
Se
9-
"dynamic" : "false",
-0
Index any documents with

d
"properties": {
ar
er
G
... unmapped fields, but

n
ie
st
ignore the fields and do not

ba
}
Se
} update the mapping

}
}
Adding Fields to a Mapping
• There are different ways to add a field:
‒ If the “dynamic” setting is “true”, you can add a field by simply
indexing a document that contains the new field
‒ Alternately, you can use the PUT API:
We want to modify the mapping

for “comment_type”
n
io
PUT comments/_mapping/comment_type
is
ec
D
{
&
s
"properties": { es
in
us
"comment_location" : {
-B
71
"type" : "geo_point"
20
p-
Se
}
9-
-0
}
d
ar
er
}
G
n
ie
st
ba
Se
Add a new field named

“comment_location”

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Dive Deeper
into Mappings
Array Fields
• In Elasticsearch, there is no dedicated array type
‒ But, any field can contain multiple values (as long as they are the
same type)
{
"user" : "Maria",
"comment" : "Good morning everyone",
"comment_time" : "2017-05-15T07:14:39",
"rating" : 4 “rating” is a single value
n
io
is
in this document
ec
}
D
&
s
es
in
us
-B
71
20
p-
POST comments/comment_type/2/_update
Se
9-
{
-0
d
"doc": {
ar
er
“rating” is an array now

G
"rating" : [4,3,5]
n
ie
st
}
ba
Se

Date Formats
• Use the format setting to specify the format of your date
fields:
‒ format uses either a date string like “MM/dd/YYYY”, or
‒ one of the many built-in date formats
• The default format value is
strict_date_optional_time||epoch_millis
n
io
is
ec
D
&
s
es
{
in
us
-B
"comment_time" : "2016-11-06T15:31:18-07:00"
71
20
}
p-
Se
9-
-0
d
ar
er
"comment_time": {
G
n
ie
"type": "date"
st
ba
Se
},

Specifying a Date Format
{
"mappings": {
"comment_type": {
"properties": {
"last_viewed" : {
"type": "date",
"format": "dd/MM/YYYY" Use || for multiple formats
},
"comment_time" : {
n
io
"type": "date",
is
ec
D
"format" : "basic_date||epoch_millis"
&
s
es
}
in
us
-B
}
71
20
}
p-
Se
}
9-
-0
} View the documentation for a

d
ar
er
complete list of built-in date

G
n
ie
formats
st
ba
Se

Specifying a Date Format
• Dates must now match the format:
{
"_index": "comments",
"_type": "comment_type",
PUT comments/comment_type/4 "_id": "4",
{ "_score": 1,
"last_viewed" : "17/12/2016", "_source": {
"comment_time" : 1481071407407 "last_viewed": "17/12/2016",
} "comment_time": 1481071407407
}
}
n
io
is
ec
D
&
{
s
es "error": {
in
us
"root_cause": [{
-B
"type": "mapper_parsing_exception",
71
"reason": "failed to parse [last_viewed]"

20
p-
}
Se
PUT comments/comment_type/5 ],
9-
-0
"type": "mapper_parsing_exception",
{
d
ar
"reason": "failed to parse [last_viewed]",

er
"last_viewed" : "01/31/2017" "caused_by": {

G
n
"type": "illegal_field_value_exception",
ie
}
st
ba
"reason": "Cannot parse \"01/31/2017\": Value 31

Se
for monthOfYear must be in the range [1,12]"

}
Set “ignore_malformed” }, "status": 400
to “true” to ignore the error }

Meta-Fields
• Each document has metadata fields (_index, _type, _id)
• Some meta-fields we have not seen yet include:
‒ _all: a special catch-all field which concatenates the values of all
of the other fields into one big string
‒ _meta: for storing application-specific metadata
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Disabling _all
• The _all field is being removed in future versions of
Elasticsearch
‒ unable to use it for new indices in Elasticsearch 6
‒ you can disable it to save disk space if your application is not
using _all
PUT comments
n
io
{
is
ec
D
"mappings": { disable _all for for
&
s
es
"comment_type": { comment_type
in
us
-B
"_all": {
71
20
"enabled": false
p-
Se
},
9-
-0
"properties": {
d
ar
er
"comment": {
G
n
ie
"type": "text",
st
ba
Se
},
...

The _meta Field
• Using the “_meta” field, you can put any custom metadata
you want in your mapping
‒ Any “_meta” data is associated with the type (not your individual
documents)
PUT comments/_mapping/comment_type
{
n
"_meta" : {
io
is
ec
"comment_type_version" : "2.1"
D
&
s
} es
in
us
}
-B
71
20
p-
Se
9-
-0
Store any application-specific

d
ar
JSON here…
er
G
n
ie
st
ba
Se

Common Field Parameters
• “index”: should the field be indexed (defaults to true)
• “null_value”: provide a default value if the incoming field is
null. Default is “null”, which means skip the field
"properties": {
"my_password" : { “my_password” will not
be indexed
n
"type": "keyword",
io
is
ec
"index": false
D
&
},
s
es
in
"my_rating" : {
us
-B
"type": "integer",
71
20
"null_value": 1
p-
The value of 1 will be

Se
}
9-
indexed for “my_rating” in

-0
}
d
ar
documents where it is null

er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Specifying Analyzers
Specifying a Custom Analyzer
PUT comments
{
"settings": {
"analysis": {
"filter": {
"my_edgengram_filter": {
"type": "edgeNGram",
"min_gram": 3,
"max_gram": 7
}
},
"analyzer": {
"my_analyzer": {
Define “my_analyzer” in the
"tokenizer": "standard", “analysis” section
"filter": [
"lowercase",
"my_edgengram_filter"
n
io
is
]
ec
D
}
&
s
} es
in
}
us
-B
},
71
"mappings": {
20
p-
"comment_type": {
Se
"properties": {
9-
-0
"comment": {
d
ar
"type": "text", Configure the “comment” field

er
G
"analyzer": "my_analyzer"
to use “my_analyzer”
n
ie
st
}
ba
Se
}
}
}
}

Specifying a Custom Analyzer
• Let’s test my_analzyer:
POST comments/_analyze
{
"analyzer": "my_analyzer",
"text": "I like Elasticsearch"
}
lik
like
n
io
is
ec
D
ela
&
s
es
in
us
-B
elas
71
20
p-
Se
9-
elast
-0
d
ar
er
G
elasti
n
ie
st
ba
Se
elastic

Specifying a Search Analyzer
• The edgeNGram filter in the previous example is useful for
indexed text fields
‒ but using edgeNGram on a query string rarely makes sense
• Use the search_analyzer property to specify a different
analyzer for query strings:
PUT comments
{
n
io
...
is
ec
D
"mappings": {
&
"comment_type": { es
in
s
Search terms will use the
us
"properties": { “standard” analyzer

-B
7
"comment": {
1
20
p-
"type": "text",
Se
9-
"analyzer": "my_analyzer",
-0
d
"search_analyzer": "standard"
ar
er
G
}
n
ie
st
}
ba
Se
}
}
}

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Multi-Fields
Multi-Fields
• It is often useful to index the same field in different ways
‒ For example, it allows your full text to be analyzed in multiple
ways
• You can specify an “analyzer” property for each field
"comment": {
"type": "text", This “comment” field will be
"fields": { indexed four different ways
n
"keyword": {
io
is
ec
"type": "keyword",
D
&
s
"ignore_above": 256 es
in
us
},
-B
7
"english_comment" : {
1
20
p-
"type" : "text",
Se
9-
-0
"analyzer": "english"
d
ar
},
er
G
n
"whitespace_comment" : {
ie
st
ba
"type" : "text",
Se
"analyzer": "whitespace"
}

Accessing a Multi-Field
• Use the dot operator in a query to refer to a multi-field:
{ “Hello, world!” gets indexed
"comment" : "Hello, world!" four different ways
}
n
io
is
{
ec
D
"query": {
&
s
es
in
"match": {
us
-B
"comment.english_comment": "hello"
71
20
}
p-
Se
}
9-
-0
d
}
ar
er
G
n
ie
st
ba
Use the dot operator to choose

Se
which multi-field to query

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Index Templates
Index Templates
• Index templates allow you to define mappings and settings
that will automatically be applied to newly-created indices:
‒ You can have multiple index templates
‒ They get “merged” and applied to the created index
‒ You can provide an “order” value to control the merging process
• When a new index is created:
n
io
is
ec
1. The default settings and mappings are applied
D
&
s
es
in
us
2. Settings and mappings from the lowest “order” template are

-B
71
applied
20
p-
Se
9-
-0
3. Settings and mappings from the higher “order” templates are

d
ar
er
G
applied, overriding previous settings along the way

n
ie
st
ba
Se
4. The setting and mappings from the PUT command of the index
are applied last and override all previous templates
Defining a Template
• Use the _template endpoint to add, view and delete
templates
name of the template
PUT _template/my_template_1
{
"template" : "*", Apply this template to
"order" : 1, any new index
"mappings": {
"my_type" : {
"_all": {
n
io
is
"enabled": false
ec
D
&
}
s
es
in
}
us
-B
}
71
20
}
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Let’s define a second template:
PUT _template/my_template_2
{
Apply this template to
"template" : "my_*",
"order" : 5,
any new index that
"mappings": { starts with “my_”
"my_type" : {
"properties": {
"version" : {
n
"type" : "integer"
io
is
ec
}
D
&
s
} es
in
us
}
-B
7
}
1
20
p-
}
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Now define a new index:
• Notice our index named “my_test” matches both templates:
PUT my_test
GET my_test/_mappings
{
"my_test": {
"mappings": {
"my_type": {
"_all": {
n
io
is
"enabled": false
ec
D
},
&
s
es "properties": {
in
us
"version": {
-B
7
"type": "integer"
1
20
p-
}
Se
9-
}
-0
d
ar
}
er
G
}
n
ie
st
}
ba
Se

Viewing and Deleting Templates
• Use a GET to view your defined templates:
GET _template
• You can DELETE a template also:

DELETE _template/my_template_2
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Mapping Best Practices
• Define your own mappings!
‒ The dynamically-created mappings may have the wrong data
types or settings appropriate for your documents
• Use index templates!
• Define a default template for “*” that contains global
settings for all your new indexes
n
• Templates are a great place to define your custom
io
is
ec
D
analyzers es
in
&
s
us
-B
• Disable “_all” if you do not need it

71
20
p-
Se
9-
-0
• Try to avoid using multiple types in a single index

d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• Every type has a mapping that defines a document’s fields and
their data types
• Mappings are created dynamically, or you can define your
own
• A common way to define an explicit mapping is to modify a
dynamic mapping using a well-designed sample document
• Define your mappings well! Only a few changes can be made,
n
io
is
including the adding of fields.
ec
D
&
s
es
in
• Use the analyzer property to specify a custom analyzer for a

us
-B
7
field, and search_analyzer to specify a different analyzer for

1
20
p-
Se
query strings
9-
-0
d
ar
er
G
• Index templates allow you to define mappings and settings

n
ie
st
ba
that will automatically be applied to newly-created indices

Se

Quiz
1. What data type would “value” : 300 get mapped to
dynamically?
2. If “dynamic” is set to “strict”, what happens when an
unmapped field is indexed?
3. True or False: You can use mappings to specify text
analyzers at the field level.
4. True or False: You can change a field’s data type from
n
“integer” to “long” because those two types are compatible.
io
is
ec
D
&
s
es
5. In the following mapping, which analyzer would get used on
in
us
-B
“my_field” during indexing? At search time?

71
20
p-
Se
9-
-0
"my_field": {
d
ar
er
"type": "text",
G
n
ie
"analyzer": "my_custom_analyzer",
st
ba
"search_analyzer": "english"
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 4
Mappings
2 The Search API
3 Text Analysis
4 Mappings

Chapter 5
n
io
More Search
is
ec
D
&
s
Features
in
us
-B
7
8 Aggregations
1
20
p-
Se
9-
More Aggregations
-0
9
d
ar
er
G
n
ie
st
ba
Se
Topics covered:
• Filters
• Index Aliases
• The term Query
• More Term-level Queries
• The multi_match Query
n
• Fuzziness
io
is
ec
D
&
s
es
• Highlighting
in
us
-B
71
20
• Synonyms
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
dataset…
We have a new
Dataset with nutritional information:
• 500 food items in an index named “nutrition”:
{
"servings": {
"serving_size_unit": "chips",
"servings_per_container": 32, Notice “servings” and
"serving_size_qty": 10 “details” have nested fields
},
"ingredients": [ (common in JSON)
"whole yellow corn",
"vegetable oil",
"corn",
"soybean",
n
io
"canola and/or sunflower oil",
is
ec
D
"salt"
&
s
], es
in
"brand_name": "Tostitos",
us
-B
"item_name": "Tortilla Chips, Thick & Hearty Rounds",

71
20
"details": {
p-
Se
"total_fat": 6,
9-
"calories_from_fat": 50,
-0
d
"calories": 140,
ar
er
G
"saturated_fat": 1
n
ie
},
st
ba
"item_description": "Thick & Hearty Rounds"

Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Filters
Query vs. Filter
• We have seen query contexts, but there is another clause
known as a filter context:
‒ Query context: results in a _score
‒ Filter context: results in a “yes” or “no”
“Does this
n
“How well does this
io
is
document match
ec
D
document match
&
es
s this query?”
this query?”
in
us
-B
71
20
p-
Se
9-
“Here’s
-0
“Yes”
d
ar
the _score”
er
G
n
ie
Query context Filter context

st
ba
Se

The filter Clause
• The filter clause is an optional clause in a bool query:
{ The _score will be the result
"query": { of this match query only
"bool": {
"must": {
"match": {
"comment": "star"
}
n
io
is
Hits must have “favorite”
ec
},
D
&
"filter": { in the “comment” field
s
es
in
us
"match": {
-B
"comment": "favorite"
71
20
p-
}
Se
9-
}
-0
d
ar
} "My favorite movie star in Hollywood is Harrison Ford."

er
G
n
}
ie
st
"My favorite movie is Star Wars!"

ba
}
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Index Aliases
Index Aliases
• The Index Aliases API allows you to define an alias for an
index
POST _aliases
{
"actions": [
{
n
io
"add": {
is
ec
"index": "INDEX_NAME",
D
&
s
"alias": "ALIAS_NAME" es
in
}
us
-B
}
71
20
]
p-
Se
}
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Example of Index Alias
POST _aliases
{
"actions": [
{
"add": {
"index": "my_index",
"alias": "my_alias"
}
}
]
n
}
io
is
ec
D
&
GET my_alias/_search
s
es
in
{
us
Returns hits from

-B
"query": {
71
“my_index”
20
"match_all": {}
p-
Se
}
9-
-0
}
d
ar
er
G
n
ie
st
ba
Se

Filtered Aliases
• An alias can include a filter
‒ useful for creating different views of the same index:
POST _aliases
{
"actions": [
{
"add": {
"index": "customers",
"alias": "active_customers",
n
io
is
"filter": {
ec
D
“match": {
&
s
es
"customer_status": "active"
in
us
}
-B
71
}
20
p-
}
Se
9-
}
-0
d
Any query on the

ar
]
er
G
} active_customers alias will

n
ie
st
only include “active” customers

ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The term Query
Term-level Queries
• Text fits into two groups:
‒ Exact values: includes numeric values, dates, and certain
strings
‒ Full text: unstructured text data that we want to search
• When you submit a string in a match query, the mapping is
checked to see if the string should be analyzed
‒ This makes sense: your query string will match your indexed text
n
io
is
ec
D
&
• Term-level queries are queries that do not go through an s
es
in
us
-B
analysis phase
71
20
p-
Se
9-
‒ for searching numerics, dates, and string values that are exact
-0
d
ar
er
G
n
‒ typically used in filter contexts

ie
st
ba
Se

The term Query
• In a term query, the search terms are not analyzed
‒ A hit has to match the search term exactly
“Find
GET nutrition/_search
{
all products from
"query": { Market Pantry.”
"term": {
"brand_name.keyword" : "Market Pantry"
}
}
n
io
}
is
ec
D
The nutrition index has 4 hits
&
s
es
in
us
-B
7
1
20
“Find
p-
{
Se
all products from

9-
"query": {
-0
Market.”
d
"term": {
ar
er
G
"brand_name.keyword" : "Market"
n
ie
st
}
ba
Se
}
}
This search returns 0 hits

Using term in a Filter
• You can use term at the query level (see previous slide)
‒ But, term queries are typically used in filter clauses, as we often
filter on exact matches
{
"query": { 74 hits without the filter…
"bool": {
“I am looking for "must": [
products from Market {
Basket with syrup in the "match": {
n
io
is
ingredients”
ec
"ingredients": "syrup"
D
&
}
s
es
in
}
us
-B
],
71
20
"filter": {
p-
Se
"term": {
9-
-0
"brand_name.keyword": "Market Basket"

d
ar
}
er
G
n
}
ie
st
ba
}
Se
} …only 1 hit with the filter

}

Match vs. Term and Filter vs. Query
• When do we use a match query vs. a term query?
‒ And when do we use a filter instead of a query?
• It depends on your use case:
• You need the search string analyzed

match query • You want scoring
n
io
is
ec
• You do not want the search string analyzed
D
&
term query
s
es
• You want scoring
in
us
-B
71
20
p-
Se
9-
match in a filter • You need the search string analyzed

-0
d
ar
er
G
There is no scoring in a filter

n
ie
st
ba
Se
term in a filter • You do not want the search string analyzed

Using term out of a Filter
• Suppose you want to boost certain brand names, but not
limit results to only those brand names:
{
"query": {
"bool": { "brand_name": "Giant",
"must": [ "item_name": "Cheddar Cheese Twice Baked Potatoes"
{
"match": { "brand_name": "Giant",
"item_name": "cheese" "item_name": "Cottage Cheese, Small Curd, 1% Milkfat,
} Lowfat, No Salt Added"
}
n
"brand_name": "365 Everyday Value",
io
],
is
"item_name": "Feta Cheese"
ec
D
"should": [
&
s
{ "brand_name": "Kraft", es
in
"term": {
us
"item_name": "Cheese, Italian Three Cheese"

-B
"brand_name.keyword": {
71
…and so on
20
"value": "Giant"
p-
Se
}
9-
}
-0
d
}
ar
er
G
]
n
ie
}
st
The field must match

ba
}
Se
} “Giant” exactly, and it will

contribute to the _score

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Queries
More Term-level
The terms Query
• Use terms to find documents that match any of the given
terms:
{
"query": { “I am looking for syrup
"bool": { from Market Basket, Nabisco,
"must": [ or Barilla.”
{
"match": {
"ingredients": "syrup"
}
}
n
io
],
is
ec
"filter": {
D
&
s
"terms": { es
in
"brand_name.keyword": [
us
-B
"Market Basket",
71
20
"Nabisco", The terms query allows

p-
Se
"Barilla" you to search multiple

9-
-0
]
d
terms
ar
er
}
G
n
ie
}
st
ba
Se
}
}
}

Wildcard Query
• The wildcard query is a term-based search that contains a
wildcard:
‒ * = anything
‒ ? = any single character
• wildcard can be expensive, especially if you have leading
wildcards
n
I am looking for cookies
io
is
ec
{
D
from a brand name that starts
&
s
"query": { es
with “Ste”.
in
us
"bool": {
-B
"must":
71
20
{"match": {"item_name": "cookie"}},

p-
Se
9-
"filter": {
-0
d
"wildcard": {
ar
er
"brand_name.keyword": {
G
n
"Premium Dark Chocolate,

ie
"value": "Ste*"
st
ba
with Mint Cookie Crunch”

Se
}
} from “Stewart’s” brand.
}}}}

The exists Query
• The exists query returns documents that have at least one
non-null value in the specified field
‒ It is a simple way to filter out documents with null fields
{
"query": {
"bool": {
"must": [
{
“I am looking for
n
"match": {
io
is
ec
"item_name": "cookie" cookies that have their
D
&
} ingredients listed.”
s
es
in
}
us
-B
],
71
20
"filter": {
p-
Se
"exists": {
9-
-0
"field": "ingredients"
d
ar
er
}
G
n
ie
}
st
ba
}
Se
}
}

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The multi-match Query
The multi_match Query
• multi_match is similar to match except you can query
multiple fields:
‒ The syntax looks like:
GET nutrition/_search Enter the query string once…

{
"query" : {
"multi_match": {
n
io
is
"query": "",
ec
D
"fields": []
&
s
} es
in
us
}
-B
7
}
1
20
…and an array of fields to

p-
Se
search
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Example of multi_match
• By default, the _score from the best field is used
‒ But you can control various aspects of how the _score is
calculated
“Search
for ‘olive oil’ in either
GET nutrition/_search the item name or in the
{ ingredients”.
"query": {
"multi_match": {
"query": "olive oil",
n
io
"fields": [
is
ec
D
"item_name",
&
s
"ingredients" es
in
us
]
-B
"Extra Virgin Olive Oil"

7
}
1
20
p-
}
Se
9-
} "Trout, Golden Smoked Fillets 3.75 Oz"

-0
d
ar
er
G
"Olive Oil, Extra Virgin, Cold Pressed,

n
ie
st
ba
Bread & Salads"

Se
…and so on

Per-field boosting in multi-match
• Individual fields can be boosted
‒ Use a caret (^) followed by a boost multiplier
{
"query": { The “ingredients” score is 2 times more
"multi_match": { important than “item_name”
"query": "olive oil",
"fields": [
n
io
"item_name",
is
ec
"ingredients^2"
D
&
s
] es
in
us
}
-B
}
71
20
}
p-
Se
"Trout, Golden Smoked Fillets 3.75 Oz"

9-
-0
d
ar
er
"Olive Oil, Extra Virgin, Cold Pressed,

G
n
ie
Bread & Salads"

st
ba
Se
"House Salad Dressing"

Wildcards in Field Names
• multi_match allows for wildcards in fields names:
{
"query": {
"multi_match": {
"query": "rice",
"fields": [
n
Search any field that ends in “name”
io
"*name",
is
ec
D
"ingredients"
&
s
] es
in
us
}
-B
7
}
1
20
p-
}
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Types of multi_match Queries
• best_fields (default): score is from the best field
• most_fields: score is a combination from all matching
fields
• cross_fields: searches for each term in the query string in
any of the fields in the document
{
n
io
is
"query": {
ec
D
"multi_match": {
&
s
"query": "rice", es
in
us
"fields": [
-B
7
"*name", Change the scoring to use “most_fields”

1
20
p-
"ingredients"
Se
9-
],
-0
d
ar
"type": "most_fields"
er
G
}
n
ie
st
}
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Fuzziness
What is Fuzziness?
• Fuzzy matching allows for query-time matching of
misspelled words
‒ It is a certainty that search users will misspell words
• Fuzzy matching treats two words that are “fuzzily” similar
as if they were the same word
‒ Fuzziness is something that can be assigned a value:
n
‒ It refers to the number of character modifications, known as 'edits',
io
is
ec
D
to make two words match
&
s
es
fuzziness = 2
in
us
-B
fuzziness = 1
71
20
“nervan”
p-
Se
9-
-0
“the beetles” e i
d
ar
er
G
n
ie
st
e a a
ba
Se
“the beatles” “nirvana”

Fuzzy match Query
• The match query has a “fuzziness” property
‒ Can be set to 0, 1 or 2; or can be set to AUTO
115 hits
{
"query" : {
"match": { 115 hits
"ingredients" : "sugar" {
} "query" : {
n
io
}
is
"match": {
ec
D
}
&
"ingredients" : {
s
es
"query" : "suger",
in
us
-B
0 hits "fuzziness": 1
71
20
GET nutrition/_search }
p-
Se
{ }
9-
-0
"query" : { }
d
ar
}
er
"match": {
G
n
"ingredients" : "suger"
ie
st
ba
}
Se
}
}

Choosing the Fuzziness
• Be aware of the side effect of setting fuzziness
‒ Suppose you set it to 2
‒ a word like “to" also matches "as", "by", "if", "the", and so on
• Let’s search for items from “J & Ds” and set fuzziness to 2:
n
io
{
is
ec
"query": {
D
&
s
"match": { es
in
61 hits, even though “J & Ds”
"brand_name": {
us
only has one item in the index

-B
"query": "j & d",

71
20
"fuzziness": 2
p-
Se
}
9-
-0
}
d
ar
er
}
G
n
ie
}
st
ba
Se

The AUTO Setting
• fuzziness can be set to AUTO
‒ generates an edit distance based on the length of the term
‒ AUTO is the preferred way to use fuzziness
• Notice the improvement on our previous query for “J & Ds”:
n
io
is
Only 2 hits
ec
{
"J & Ds"
D
&
"query": { with “auto” s
es
in
"match": {
us
"J. Skinner"
-B
"brand_name": {
71
20
"query": "j & d",

p-
Se
"fuzziness": "auto"
9-
-0
}
d
ar
er
}
G
n
}
ie
st
ba
}
Se

Multiple terms in the query
• If the query has multiple terms, the fuzziness value is
applied to each term
‒ For example “citric” and “sitrik” are 2 edits apart
‒ “acid” and “acet” are also 2 edits apart
122 hits
120 hits {
n
io
is
"query" : {
ec
{
D
"match": {
&
s
"query": { es
in
"ingredients" : {
us
"match": { "query" : "sitrik acet",

-B
"ingredients": "citric acid"

7
"fuzziness": 2
1
20
}
p-
}
Se
9-
} }
-0
}
d
ar
}
er
G
}
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Highlighting
Highlighting
• A common use case for search results is to highlight the
matched terms
‒ Elasticsearch makes it easy to do this
{
"query" : {
...
n
io
},
is
ec
D
"highlight": {
&
s
"fields": { es
in
us
...
-B
7
}
1
20
p-
}
Se
9-
}
-0
Add a “highlight” clause that

d
ar
lists the fields you want

er
G
n
highlighted
ie
st
ba
Se

Example of Highlighting
{
"query" : {
"match": {
"ingredients": "oil"
}
},
"highlight": {
"fields": { Highlight “oil” if it appears in
"ingredients" : {} “ingredients”
}
}
}
n
io
is
ec
D
&
s
es
in
us
-B
The response will contain a

71
20
p-
“highlight” section
Se
"highlight": {
9-
-0
"ingredients": [
d
ar
er
"vegetable oil",
G
n
ie
"canola oil or soybean oil"

st
ba
Se
]
}

Highlighting multiple fields
• Note: you have to query a field if you want it highlighted in
the results (or set require_field_match: false)
GET nutrition/_search GET nutrition/_search
{ {
"size": 200, "query" : {
"query" : { "multi_match": {
"match": { "query": "oil",
"ingredients": "oil" "fields": ["ingredients", "item_name"]
} }
}, },
"highlight": { "highlight": {
n
io
is
"fields": { "fields": {
ec
D
&
"ingredients" : {}, "ingredients" : {},
s
es
"item_name" : {} "item_name" : {}
in
us
-B
} }
71
} }
20
p-
Se
} }
9-
-0
d
ar
er
G
"highlight": { "highlight": {
n
ie
st
"ingredients": [...] "ingredients": [...],

ba
Se
} "item_name": [...]
}
“item_name” not here

Changing the Tags
• Use “pre_tags” and “post_tags” to change the highlighting:
"highlight": {
"pre_tags" : ["<elasticsearch-hit>"],
"post_tags" : ["</elasticsearch-hit>"],
"fields": {
"ingredients" : {},
"item_name" : {}
} Results highlighted with your
n
io
} custom tags
is
ec
D
&
s
es
in
us
-B
"highlight": {
71
20
"ingredients": [
p-
Se
"grapeseed <elasticsearch-hit>oil</elasticsearch-hit>"
9-
-0
],
d
ar
"item_name": [
er
G
n
"Grapeseed <elasticsearch-hit>Oil</elasticsearch-hit>"
ie
st
ba
]
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Synonyms
Working with Synonyms
• Searching for “USA” should probably match “US” and
“United States of America” (and vice-versa)
• You can use synonyms to achieve this behavior
‒ As of Elasticsearch 5.2, use the synonym graph token filter
n
io
“US”
is
ec
D
&
synonym
s
es
in
“US” graph token “USA”

us
-B
filter
71
20
p-
“United States of
Se
9-
-0
America”
d
ar
er
G
n
ie
st
ba
Se

Defining Synonyms
• Synonyms are defined either in the mapping, or in a
separate text file
‒ You can format synonyms in two ways (which behave differently)
Comma-separated list
If any of the terms is
us, usa, united states of america
won’t, will not
encountered, it is replaced
by all the terms
n
io
is
ec
D
&
s
es
in
us
left side => right side

-B
71
20
p-
Terms on the left are

Se
9-
international business machines => ibm

replaced with terms on the
-0
d
queen => queen, monarch

ar
right
er
G
n
ie
st
ba
Se

Expand or Contract
• Depending on the definition of the synonym, terms and
phrases can be expanded or contracted (or simply
replaced)
‒ Using the synonyms from the previous slide:
“usa” gets expanded:
usa us, usa, united states of america
n
io
is
ec
D
“queen” gets replaced (expanded):
&
s
es
in
us
-B
queen queen, monarch

71
20
p-
Se
9-
-0
d
“international business machines” gets replaced (contracted):

ar
er
G
n
ie
st
ba
international business machines ibm

Se

Index vs. Query Time
• When does the expansion or contraction occur?
‒ Either at index time, query time, or both
Index time: “US” is indexed 3 ways

us
PUT test_index/doc/1
{
"comment" : "I won't be in the US in July" usa
}
united states of america
n
io
is
ec
D
&
s
es
Query time:
in
us
-B
71
20
GET test_index/_search
p-
Se
{
9-
-0
"query": {
d
ar
er
"match": {
Becomes a search
G
n
"comment": "international business machines"

ie
st
for “ibm”
ba
}
Se
}
}

Index vs. Query Time
• Should I use index or query time synonyms, or both?
‒ It depends!
• The synonym token filter can be used at both index and
query time
• The synonym_graph token filter works much better on
phrases than synonym token filter
n
io
‒ But it only works at query time, so that might answer your
is
ec
D
question
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Configuring Synonyms
PUT test_index
{
"settings": {
"analysis": { Use synonym_graph as
"filter": {
"my_synonyms": { the type of filter
"type": "synonym_graph",
"synonyms": [
"us, usa, united states of america",
"won't, will not",
"queen => queen, monarch",
"international business machines => ibm"
}
]
List the synonyms in the
}, mapping (or point to a
"analyzer": {
"my_analyzer": { text file)
n
io
is
"filter": ["lowercase","my_synonyms"]
ec
D
}
&
}
s
es
in
}
us
},
-B
7
"mappings": {
1
20
"doc": {
p-
Se
"properties": {
9-
-0
"comment": {
d
"type": "text",
ar
er
"analyzer": "standard", Use “my_analyzer” on query

G
n
"search_analyzer": "my_analyzer"
ie
strings at query time

st
ba
}
Se
}
}
}
}

A Synonyms File
• If your synonyms are defined in a text file, configure the
"synonyms_path" property to specify the file:
"filter" : {
"my_synonyms" : {
"type" : "synonym_graph",
"synonyms_path" : "/path/to/file.txt"
}
}
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Test Our Synonyms
• Let’s index a couple of documents that have terms that
match our synonyms:
{
"comment" : "I won't be in the US in July" Because we are using
} synonym_graph, no
expanding or contracting
happens at index time
n
io
is
ec
D
&
s
{ es
in
us
"comment" : "IBM beats forecast by $0.23 a share"

-B
}
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Test Our Synonyms
• Running a query on “united states of america” also
queries the “comment” field for “usa” and “us”:
Same as searching for “us”, so
GET test_index/_search doc 1 is a hit
{
"query": {
"match_phrase": {
"comment": "united states of america"
}
}
n
io
}
is
ec
"hits": [
D
&
{
s
es
in
"_index": "test_index",
us
-B
"_type": "doc",
71
"_id": "1",
20
p-
"_score": 8.396978,
Se
9-
"_source": {
-0
d
"comment": "I won't be in the US

ar
er
in July"
G
n
ie
}
st
ba
}
Se

Test Our Synonyms
• “International Business Machines” gets replaced with
“ibm” in this query:
Same as searching for “ibm”,
GET test_index/_search so doc 2 is a hit
{
"query": {
"match": {
"comment": "international business machines"
}
}
n
}
io
is
ec
"hits": [
D
&
{
s
es
in
us
-B
"_type": "doc",
71
"_id": "2",
20
p-
"_score": 0.7081689,
Se
9-
"_source": {
-0
d
"comment": "IBM beats forecast by

ar
er
$0.23 a share"
G
n
ie
}
st
ba
}
Se

Let’s Index Some Documents…
{
"comment" : "We won't meet the Queen of
England in the United States of America"
}
The only difference is
“queen” and “monarch"
n
io
{
is
ec
"comment" : "We won't meet the Monarch of
D
&
s
England in the United States of America" es
in
us
}
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

…and Search for “queen”
• A search for “queen” returns both docs with the same score
‒ It might be nice to give a higher score to the doc that actually
contained the term “queen”
"hits": [
{
GET test_index/_search "_type": "doc",
{ "_id": "3",
"_score": 0.9792457,
"query": { "_source": {
n
io
"match": {
is
"comment": "We won't meet the Queen
ec
D
"comment": "queen" of England in the United States of
&
s
es America"
}
in
}
us
}
-B
},
7
} {
1
20
p-
Se
"_type": "doc",
9-
-0
"_id": "4",
Both documents are hits, and
d
"_score": 0.9792457,
ar
er
"_source": {
G
both have the same score

n
ie
"comment": "We won't meet the

st
ba
Monarch of England in the United States of

Se
America"
}
}
]

Give Preference to the Original Text
• Let’s run a bool query that searches “comment” using both
“my_analyzer” (with synonyms applied) and the “standard”
analyzer (with no synonyms applied)
GET test_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"comment": "queen"
Search for “queen” or
n
“monarch”
io
is
}
ec
D
},
&
s
{ es
in
us
"match": {
-B
"comment": {
71
20
"query": "queen",
Search only for “queen”
p-
Se
"analyzer": "standard"
9-
-0
}
d
ar
}
er
G
}
n
ie
st
]
ba
Se
}
}
}

Give Preference to the Original Text
• Notice the document with “queen” scores higher, even
though “queen” and “monarch” are synonyms:
"hits": {
"total": 2,
"max_score": 1.9584914,
"hits": [
{
"_type": "doc",
"_id": "3",
"_score": 1.9584914,
"_source": {
n
io
"comment": "We won't meet the Queen of England
is
ec
in the United States of America"
D
&
}
s
es
in
},
us
-B
{
71
20
p-
"_type": "doc",
Se
9-
"_id": "4",
-0
d
"_score": 0.9792457,
ar
er
"_source": {
G
n
ie
"comment": "We won't meet the Monarch of

st
ba
England in the United States of America"

Se
}
}
]

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• Filters are cached and reused by Elasticsearch, for even
greater performance
• The Index Aliases API allows you to define an alias for an
index
• multi_match is similar to the match query and allows you to
query multiple fields
• Fuzzy matching allows for query-time matching of similar words
n
io
is
• Highlighting allows you to highlight search results that match
ec
D
&
the query es
s
in
us
-B
71
• Synonyms are defined either in the mapping, or in a separate

20
p-
Se
text file
9-
-0
d
ar
er
G
• The synonym_graph token filter is preferred over the

n
ie
st
ba
synonym token filter, especially when working with phrases

Se

Quiz
1. What is the fuzziness value between a search for “yellow submarine”
and “yelow submarene”?
2. Assuming the following synonym definition is applied at query time, what
would a search for “cat” query for? 
 
cat, kitten, pet
3. What is the easiest way to search for the same terms across multiple
fields?
4. What is the difference between a full-text query and a term-level query?
n
io
is
ec
D
&
5. Suppose we are searching StackOverflow. What is the difference using es
in
s
us
between best_fields and most_fields?

-B
71
20
p-
GET stackoverflow/_search
Se
9-
{
-0
"query": {
d
ar
er
"multi_match": {
G
n
"query": "java",
ie
st
ba
"fields": ["body", "subject"]

Se
}
}
}

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 5
More Search Features
2 The Search API
3 Text Analysis
4 Mappings
Chapter 6 6 The Distributed Model
n
io
The Distributed
is
ec
D
&
s
Model
in
us
-B
7
8 Aggregations
1
20
p-
Se
9-
More Aggregations
-0
9
d
ar
er
G
n
ie
st
ba
Se
Topics covered:
• Starting a Node
• Creating an Index
• Starting a Second Node
• Shards: Distribution of an Index
• Distributing Documents
n
• Replication
io
is
ec
D
&
s
es
• Split Brain
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Starting a Node
What is a Node?
• A node is an instance of Elasticsearch
‒ It is a java process that runs in a JVM
• Every node has a name
‒ you can set node.name in the elasticsearch.yml file
‒ or on the command line
• Each node has a UID assigned at startup
n
io
is
ec
D
&
‒ and persisted in the data folder for future restarts es
in
s
us
-B
71
20
p-
Se
9-
adam_01
-0
d
ar
er
bin/elasticsearch -E node.name=adam_01
G
n
ie
st
ba
Se

Cluster
• Everything in Elasticsearch happens in the context of a
cluster
‒ Every cluster has a name (e.g. training_cluster)
‒ Default cluster name is “elasticsearch”
‒ Every node belongs to a cluster
• You can set the cluster.name that a node belongs to in
elasticsearch.yml or on the command line
n
io
is
ec
D
&
s
bin/elasticsearch -E node.name=adam_01 -E cluster.name=training_cluster es
in
us
-B
71
20
training_cluster
p-
Se
9-
-0
d
adam_01
ar
er
G
n
ie
st
ba
Se

Cluster State
• Every cluster has information referred to as the cluster
state
‒ includes the nodes in the cluster, indices, mappings, settings,
and so on
• Cluster state is stored on each node
n
io
is
ec
training_cluster
D
&
s
es
in
adam_01
us
-B
71
20
clusterState0
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

But not every node can edit the cluster state
• In a distributed model, if everyone could edit the cluster
state there could be inconsistencies and data loss
• We need a single source of truth...
“I am adam_01, Master
of the Universe!”
n
io
is
ec
training_cluster
D
&
s
es
in
adam_01
us
-B
71
20
clusterState0
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Master-eligible Nodes
• Nodes can be eligible to become the master
• When you start adam_01 it is a master-eligible node by
default
‒ which can be disabled by setting node.master: false
“OK, so maybe I am
not the master, but I am
n
io
is
ec
training_cluster
eligible.”
D
&
s
es
in
adam_01
us
-B
71
20
clusterState0
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

The Election Process
• An election process occurs to select the actual master
‒ When adam_01 starts, it elects itself to be the master
n
io
“I guess I am the only
is
ec
training_cluster
D
one eligible, so I win the
&
s
es
election!”
in
adam_01
us
-B
71
20
clusterState0
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Data Nodes
• We also need to store data in the cluster
• A node that is able to store data is called a data node
‒ When adam_01 starts, it is a data node by default
‒ You can disable by setting node.data: false
training_cluster
n
io
is
ec
D
adam_01
&
s
es
in
us
-B
“I am ready for
71
20
some data!”
p-
Se
data allocation
9-
-0
d
ar
er
G
n
ie
st
clusterState0
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Creating an Index
Creating an Index
• Suppose you submit a PUT request to create a new index
in your cluster:
PUT my_index
{
"settings": {
training_cluster
...
n
io
is
}
ec
D
} adam_01
&
s
es
in
us
-B
7
Client
1
20
p-
data allocation
Se
App
9-
-0
d
ar
er
G
n
ie
st
clusterState0
ba
Se

The Coordinating Node
• A request is not sent to the cluster - it is sent to one of the
nodes
• The node that handles your PUT request is called the
coordinating node for that request
‒ It inspects the request and determines what needs to be done
‒ It makes sure the request is executed and sends the response
back to the client
n
io
training_cluster
is
ec
D
&
s
PUT my_index es adam_01
in
us
-B
7
Client
1
20
p-
Se
App
9-
-0
data allocation
d
ar
er
G
n
ie
st
ba
Se
clusterState0

Master Nodes Create Indices
• Our PUT request for a new index can only be executed by
the master node (same for a DELETE index request)
‒ Since adam_01 is the elected master node (and the only node in
our cluster), it handles the request for this new index
• The cluster state is updated, the allocation of the index
occurs, and adam_01 sends back a response
n
io
training_cluster
is
ec
D
&
s
PUT my_index es adam_01
in
us
-B
7
Client
1
20
p-
Se
App
9-
my_index
-0
d
ar
er
G
data allocation
n
ie
st
ba
Se
200 OK clusterState1

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Starting a Second Node
Starting a Second Node
• Elasticsearch is a distributed system, which means it can
benefit from multiple machines.
‒ Let's start a new node named princess_10 on a new machine.
• By default, princess_10 is master-eligible + data node
bin/elasticsearch -E node.name=princess_10 -E cluster.name=training_cluster
n
io
is
training_cluster
ec
D
princess_10
&
s
es
adam_01
in
us
-B
71
20
p-
Se
data allocation
9-
-0
my_index
d
ar
er
G
n
data allocation
ie
st
ba
Se
clusterState1

Joining the Cluster
• princess_10 has to join the cluster, but how?
‒ Nodes ping each other using unicast (on port 9300 by default)
• A seed list of hostnames should be listed in the
unicast.hosts property:
discovery.zen.ping.unicast.hosts: ["adam_01"]
n
 
io
is
training_cluster
ec
D
princess_10 “Hi adam_01.
&
s
es
May I join adam_01
in
us
-B
training_cluster?”
71
20
p-
Se
data allocation
9-
-0
my_index
d
“Sure,
ar
er
G
you can join

n
data allocation
ie
st
ba
training_cluster!”
Se
clusterState1

Sharing the Cluster State
• adam_01 updates the cluster state
‒ adds princess_10 to the list of nodes
• Then adam_01 sends the new cluster state to princess_10
n
io
is
training_cluster
ec
D
&
s
es
princess_10 adam_01
in
“I am the
us
-B
master, and here

71
20
is the cluster
p-
Se
9-
state”
-0
my_index
data allocation
d
ar
er
G
n
data allocation
ie
st
ba
Se
clusterState2 clusterState2

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
an Index
Shards: Distribution of
Distributing Data
• We now have 1 cluster, 2 nodes, and 1 index
• How can Elasticsearch use both nodes?
‒ By distributing the data!
training_cluster
princess_10 adam_01
n
io
is
ec
D
&
s
es
in
my_index
us
-B
data allocation
71
20
p-
data allocation
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

"Late" Data Splitting is Expensive
• Elasticsearch "could" break an index into parts late after
creating an index, but that has some challenges:
‒ how many parts?
‒ how to divide documents?
‒ need to rebuild the entire inverted index!
n
training_cluster
io
is
ec
D
&
s
princess_10 es
in
adam_01
us
-B
7
1
20
p-
Se
my_in dex
9-
-0
d
ar
er
data allocation data allocation

G
n
ie
st
ba
Se

“Early” Data Splitting
• Instead, Elasticsearch breaks an index into parts early, at
creation time
• An index is a virtual namespace which points to a number
of shards
‒ A shard is a worker unit that holds data and can be assigned to
nodes
n
io
is
training_cluster
ec
D
&
s
es
princess_10 adam_01
in
us
-B
71
20
p-
An index is “split” into shards

Se
my_index
9-
before any documents are

-0
d
ar
indexed
er
G
n

ie
st
ba
Se

Shards are distributed across nodes
• Because every index is partitioned into shards at creation
time, it is easy to distribute them later
‒ The master decides which shards to allocate to which nodes
(published via cluster state)
training_cluster
clusterState3 contains the
shard allocation table
n
princess_10 adam_01
io
is
ec
D
&
s
es
in
us
-B
When princess_10 receives

71
20
p-
clusterState2, she starts the

Se

9-
request for copying the shards

-0
d
ar
assigned to her
er
G
n
ie
st
ba
Se

The Number of Primary Shards is Fixed
• The default number of primary shards for an index is 5
‒ You specify the number of shards when you create the index
‒ You can NOT change the number of primary shards, so choose
wisely! (or reindex your data)
PUT my_index
n
io
is
{
ec
D
"settings": {
&
s
es
"number_of_shards": 6
in
us
-B
}
71
20
}
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Distributing Documents
Distributing Documents
• If an index is partitioned into shards, what happens when
you index a document?
• We have already discussed the PUT vs. POST commands
‒ PUT: when you specify the _id
‒ POST: when you want Elasticsearch to generate the _id
n
PUT my_index/doc/101
io
is
training_cluster
ec
{
D
&
"firstname" : "Robert",
s
es
princess_10 adam_01
in
"lastname" : "Smith",
us
-B
"city" : "Lancashire"
71
20
}
p-
Se
9-
-0
d
ar
er
G
n

ie
Client
st
ba
Se
App
201 response

Routing Documents to Shards
• Documents are routed to shards using the following 
formula: shard = hash(_routing) % number_of_primary_shards
‒ The document’s id is used as the _routing value by default
‒ This typically results in uniformly-sized shards (our goal!)
PUT my_index/doc/101 shard = hash(“101”)%6 = 3, so

{
Robert Smith will route to shard 3
n
io
is
ec
"lastname" : "Smith", training_cluster
D
&
s
es
} adam_01
in
princess_10
us
-B
71
20
p-
Se
9-
Client
-0
d
ar
App
er
G
n
data allocation
ie
data allocation
st
ba
Se
201 response clusterState3 clusterState3

Reindexing a Document
• If you index an existing document, it gets reindexed
‒ Behind the scenes, the existing document is deleted and then a
new one is created
PUT my_index/doc/101 training_cluster
{
princess_10
"lastname" : "Smith", 2. hash adam_01
"city" : "London"
}
3. reroute
1. index
n
io
data allocation
is
ec
D
&
Client s
es
clusterState3
in
us
App 4. delete
-B
71
20
p-
5. index
Se
9-
7. 200 response
-0
6. ok
d
ar
er
G
n
data allocation
ie
st
ba
Se
clusterState3

Update a Partial Document
• If you just want to update one or more fields, you can use
the _update endpoint
‒ The entire document is still deleted and reindexed
training_cluster
POST my_index/doc/101/_update princess_10
{
"doc" : {
"city": "Chelmsford"
2. hash adam_01
}
}
1. update 3. reroute
n
io
data allocation
is
ec
D
&
Client s
es
clusterState3
4. get
in
us
App
-B
71
20
5. delete
p-
Se
9-
8. 200 response
-0
7. ok 6. index
d
ar
er
G
n
data allocation
ie
st
ba
The new or updated fields and values

Se
need to appear in the “doc” section clusterState3

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Replication
Why Replication?
training_cluster
princess_10 adam_01
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
If a node fails, we could

Se
9-
-0
potentially lose data

d
ar
er
G
n
ie
st
ba
Se

Shard Replication
• Elasticsearch makes copies of your shards called replicas
‒ The original shard is called the primary
• You can also change the number of replicas for an index at
any time using the _settings endpoint:
PUT my_index/_settings
n
{
io
is
training_cluster
ec
"number_of_replicas": 1
D
&
s
} es
princess_10 adam_01
in
us
-B
71
20
p-
Se
Client
9-
-0
App
d
ar
er
G
n

ie
st
ba
Se

Shard Replication
• The replicas are stored on different nodes in the cluster
‒ provides high availability in case a shard or node fails
‒ also improves search performance by scaling the processing
Both nodes will start replicating

training_cluster
shards to the other node
princess_10 adam_01
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n

ie
st
ba
Se

High Availability
• If we lose princess_10 the entire data is available in
adam_01
training_cluster
adam_01
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
data allocation
ie
st
ba
Se
clusterState5

Primary Promotion
• After losing a primary, Elasticsearch will automatically
promote a replica into a primary
training_cluster
adam_01
n
io
is
ec
D
&
s
es
in
us
Each of these shards is

-B
71
promoted to a primary,
20
p-
Se
and the cluster state is

9-
-0
updated
d
ar
er
G
n
data allocation
ie
st
ba
Se
clusterState6

Node Re-join
• When princess_10 is back online:
‒ it joins the cluster again
‒ gets the new cluster state
‒ replicates shards
training_cluster
princess_10 adam_01
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n

ie
st
ba
Se

Let's Delete the Existing Indices
DELETE my_index,my_index2
DELETE my_index* Several ways to delete

multiple indices
DELETE my_index
DELETE my_index2
training_cluster
princess_10 adam_01
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n

ie
st
ba
Se

Let's Start Over
PUT my_index
{
"settings": {
"number_of_shards": 4,
"number_of_replicas": 1
}
}
n
io
is
training_cluster
ec
D
&
s
es
princess_10 adam_01
in
us
-B
71
20
p-
Se
9-
3 2 3 2
-0
d
ar
er
G
n
ie
1
st
0 1 0
ba
Se

Let's add the same document again:
Client PUT my_index/doc/101

App
{
"lastname" : "Smith",
6. 201 response 1. PUT }
n
io
is
training_cluster
ec
D
&
princess_10 s adam_01
es
in
us
2. index to P2
-B
71
20
p-
Se
3. replicate to R2
3 2 3 2
9-
-0
d
ar
er
4. replicate OK
G
n
1
ie
0 1 0
st
ba
Se
5. index OK

The Response of a PUT
• Notice the response of a PUT contains a _shards section
‒ “total” = 1 primary + # of replicas
‒ successful < total, then a replica is likely not available
HTTP/1.1 201 Created 201 = created, 200 = updated

{
"_index" : "my_index",
"_type" : "doc", # of shard copies the doc
n
io
should be written to
is
"_id" : "202",
ec
D
"_version" : 1,
&
s
"result" : "created", es
in
us
"_shards" : { # of shard copies the doc was

-B
7
"total" : 2, successfully written to

1
20
p-
"successful" : 1,
Se
9-
"failed" : 0
-0
d
ar
},
“failed” is an array of errors if
er
G
"created" : true
n
something went wrong

ie
st
}
ba
Se

The Get API
• Use the Get API to retrieve a specific document from the
index using its type and id
‒ returns a 404 error if the document is not found
GET my_index/doc/101 Client

App
n
io
is
training_cluster
ec
D
&
s
es
princess_10 adam_01
in
us
-B
71
20
p-
Se
9-
3 2 3 2
-0
d
ar
er
G
n
ie
1
st
0 1 0
ba
Se

Let's delete the document
• Use DELETE to delete a specific document
‒ 200 response if successful
‒ 404 response if the document is not found
• Delete marks the document as deleted The document is not
actually deleted from
storage until a segment
DELETE my_index/doc/101 merge occurs
n
io
is
ec
training_cluster
D
&
s
es
in
princess_10 adam_01
us
-B
7
Client
1
20
p-
App
Se
9-
-0
3 2 3 2
d
ar
er
G
n
ie
st
ba
1 0 1 0
Se

The DELETE request in detail:
Client
App
1. delete 7. response
training_cluster
n
io
adam_01
is
princess_10
ec
6. deleted
D
&
s
es
in
5. delete OK
us
-B
3 2 3. delete
7
2
1
3
20
4. delete replica
p-
Se
9-
-0
1 0
d
1
ar
0
er
G
n
ie
st
ba
Se
2. reroute

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Split Brain
Split Brain
• Even though the master node is the single source of truth
when it comes to the cluster state, problems can still
happen
‒ In particular, a big concern is if the cluster gets partitioned
‒ The cluster could inadvertently elect two masters, which is
referred to as a “split brain”
n
io
is
ec
training_cluster
D
&
s
es
in
us
gary_21 princess_10 adam_01

-B
71
20
p-
Se
9-
“Yay - I
-0
“I am the
d
ar
get to be the
er
master!”
G
n
master!”
ie
st
ba
Se

Two masters can lead to inconsistency:
Client
App Inconsistent cluster
states makes it basically
impossible to recover
from a split brain
n
io
is
ec
training_cluster
D
&
s
es
in
us

-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
clusterState4
st
ba
Se

Avoiding the Split Brain
• A master-eligible node needs at least
minimum_master_nodes votes to win an election
‒ Setting it to a quorum helps avoid the split brain scenario
• Our simple recommendation is for production systems to
always have 3 dedicated master-eligible nodes
‒ and set minimum_master_nodes to 2
gary_21 can not become a master if
n
io
is
minimum_master_nodes is 2
ec
training_cluster
D
&
s
es
in
us

-B
71
20
p-
Se
9-
-0
“I vote for me,

d
“I vote for
ar
“I vote for me”

er
which makes a
G
adam_01.”
n
ie
quorum.”
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• Elasticsearch subdivides the data of your index into multiple
pieces called shards
• Each shard has one (and only one) primary and zero or
more replicas
• A document is routed to a shard using a formula that
incorporates a hash value of the document id
• You can not change the number of primary shards of an
n
io
index after it has been created
is
ec
D
&
s
es
in
• After losing a primary, Elasticsearch will automatically

us
-B
7
promote a replica into a primary

1
20
p-
Se
9-
-0
• A master-eligible node needs at least

d
ar
er
G
minimum_master_nodes votes to win an election

n
ie
st
ba
Se

Quiz
1. How does Elasticsearch distribute your data within an index?
2. Suppose you have a cluster with 6 data nodes, you create a new
index with “PUT hello”, then insert 10,000 documents. How
many total shards will the hello index have?
3. True or False: When you add a new node to a cluster, you have
to run the _refresh command to redistribute shards to the new
node?
n
4. If number_of_shards for an index is 4, and
io
is
ec
D
number_of_replicas is 2, how many total shards will exist for
&
s
es
in
this index?
us
-B
71
20
p-
5. If you have three master-eligible nodes in your cluster, what

Se
9-
-0
should you set minimum_master_nodes to?

d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 6
The Distributed Model
2 The Search API
3 Text Analysis
4 Mappings
n
io
Working with
is
ec
D
&
s
es
Search Results
in
us
-B
7
8 Aggregations
1
20
p-
Se
9-
More Aggregations
-0
9
d
ar
er
G
n
ie
st
ba
Se
Topics covered:
• Segments
• The Anatomy of a Search
• DFS Query-then-fetch
• Sorting Results
• Doc Values and Fielddata
n
• Sorting by Strings
io
is
ec
D
&
s
es
• Pagination
in
us
-B
71
20
• Scroll Searches
p-
Se
9-
-0
d
ar
• Choosing a Search Type

er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Segments
Inverted Index in Practice
• The inverted index concept introduced in the analysis
chapter is practically represented inside a package called a
Lucene segment
‒ a self-contained immutable inverted index
my_index
shard0 shard1
n
io
is
ec
D
&
s
es
3 9 4 8
in
us
-B
71
20
p-
Se
1 7 2 6
9-
-0
d
ar
er
G
n
ie
st
ba
Se
A shard contains one or more segments Segments can contain

many documents

Segment is a Package
• A segment contains many documents
‒ and is a package of many data structures,
‒ including an inverted index for each field that is searchable
9 7
my_field_food_name
 
n
io
apple 7,9
is
ec
banana 7
D
&
carrot 9
s
es
in
dill 7,9
us
-B
71
20
9
p-
Se
my_field_color
9-
-0
 
d
ar
er
green 7,9
7
G
n
orange 9
ie
st
ba
red 7,9
Se
yellow 7

Why Segments?
• Lucene does not build one giant inverted index
‒ that would be to difficult to maintain and store
• Similarly, Lucene does not build an inverted index for each
document after every indexing operation
‒ that would be inefficient
• Segments provide a nice solution between having one
n
large inverted index and too many small ones
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Segments are Refreshing
• The lightweight process of writing and opening a new
segment is called a refresh
‒ by default, every shard is refreshed once every second
‒ configured at the index level with refresh_interval property
• Documents are not searchable until a refresh occurs
‒ you can force a refresh using the _refresh endpoint, although
this is not a common use case:
n
io
is
ec
D
&
s
es
in
us
-B
7
POST my_index/_refresh
1
20
p-
Se
9-
-0
d
ar
er
G
n
Manually force a refresh, which makes

ie
st
ba
newly-indexed documents searchable

Se

The refresh Parameter
• The Document APIs that create, update or delete have an
optional refresh parameter:
‒ controls when changes made by this request are made visible to
searches
• Three possible values:
‒ false: the default value, any changes to the document are not
visible immediately
n
io
is
ec
‒ true: refresh the affected primary and replica shards so that the
D
&
s
es
changes are visible immediately
in
us
-B
71
20
‒ wait_for: synchronous request that waits for the changes to

p-
Se
9-
occur
-0
d
ar
er
G
n
ie
st
ba
Se

Example of the refresh Parameter
Wait for the indexing to complete

before returning a response
PUT my_index/doc/102/?refresh=wait_for
{
n
io
is
"firstname" : "James",
ec
D
&
"lastname" : "Brown",
s
es
in
"address" : "6011 Downtown Lane",
us
-B
"city" : "Detroit"
71
20
}
p-
Se
• wait_for does not force a refresh, but

9-
-0
d
rather it waits for a refresh to happen

ar
er
G
• true forces a refresh

n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
of a Search
The Anatomy
Searches are performed on segments
• When you submit a query to Elasticsearch, Lucene runs the
search on every segment of the index
The number of segments node 1
affects the query
performance
segment 1
segment 1
segment 2
segment 2
n
io
is
ec
D
&
{
s
es
“query” : {
in
node 2
us
…
-B
7
}
1
20
}
p-
Se
9-
-0
segment 1
d
ar
er
segment 1
G
coordinating node
n
ie
st
segment 2
ba
Se
segment 2

The Anatomy of a Search
• When a client makes a request, the node handling that
request is referred to as the coordinating node
‒ Any good client will “round robin” search requests to different
coordinating nodes to avoid overloading a single node
production_cluster
n
io
is
ec
D
&
s
es
in
node 1 node 4
us
-B
Client
71
20
node 2 node 5
p-
Se
9-
-0
node 3 node 6
d
ar
er
G
n
ie
st
ba
Se

The Phases of a Search
• The coordinating node is responsible for making sure a
request is routed to the proper nodes
‒ results from multiple shards must be combined into a single
sorted list
‒ then the coordinating node returns a “page” of results
• In a search request, the coordinating node also coordinates
the two phases of the search:
n
io
is
ec
D
1. Query phase
&
s
es
in
us
-B
2. Fetch phase
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

1. Query Phase
2. Choose shard copy
1. Determine the (trying to load balance) 3. Forward the request
relevant shards to the nodes
Primary
0
node 1
Replica
Coordinating node node 2 3
4. Each selected shard
node 3 executes the query
n
io
node 4
is
ec
D
Replica
&
node 5
s
es
1 Primary
in
us
-B
node 6 2
71
20
p-
Se
9-
-0
6. The Lucene ids are

d
5. Each shard sorts the

ar
er
sent back with the

G
results and/or performs

n
ie
sorting value
st
any aggregations
ba
Se

2. The Fetch Phase
• The query phase identifies which documents satisfy the
search request
• The fetch phase retrieves the documents:
1. The coordinating node identifies which

Primary
docs to retrieve and requests them from
0
the applicable shards
Replica
n
1
io
is
Coordinating node
ec
D
&
s
es
Client
in
node 4
us
-B
71
20
p-
Replica
Se
9-
3
-0
d
3. Coordinating node
ar
er
G
sends results back to client

n
ie
2. Each shard returns the docs to

st
ba
Se
the coordinating node

The Search Response
• The _shards section of a search response provides how
many shards were searched
‒ as well as a count of the successful/failed searched shards
‒ if successful < total you likely received only partial results
‒ your business logic and SLAs typically determine how to handle
errors and/or incomplete results
n
io
is
ec
D
&
s
{ es
in
us
"took": 11,
-B
"timed_out": false,
7
If “successful” is less
1
20
"_shards": {
p-
than “total”, you likely

Se
9-
"total": 3,
-0
"successful": 3, have incomplete results

d
ar
er
"failed": 0
G
n
ie
},
st
ba
Se
"hits": {

Performance Benefit of Filters
• A query in a filter context can be cached by Elasticsearch
to improve performance
‒ referred to as the node query cache (or query cache for short)
‒ uses a bit set at the segment level
segment with 3 docs bit set of size 3 for our

“comment” :“favorite”
n
io
is
0 1 1 filter
ec
1
D
&
s
es
in
us
-B
7
2
1
20
p-
Se
9-
-0
“favorite” appears in the

d
ar
3
er
“comment” of docs 2 and 3

G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
DFS Query-then-fetch
Relevance and Sharding
• How is sharding related to the _score and search phases?
‒ It is partially calculated at the shard level
‒ For smaller or skewed datasets, this can drastically affect the
scoring
node 1
{
"_index":
This document might
score unrealistically high
“my_index",
{
"_type":
"_index":
{ “my_type”,
“my_index",
"_index":
"_type":
because it is the only hit

“my_index",
“my_type”,
n
on the shard
io
is
ec
D
&
s
es
in
us
{
"_index":
-B
“my_index",
"_type": {
7
"_index":
1
“my_index",
20
"_type":
p-
This better document might

Se
9-
-0
score less because it is on a

d
ar
shard with a lot of hits

er
{
{ "_index":
G
"_index": “my_index",
“my_index",
{
n
"_index":
ie
{
“my_index",
"_index":
st
“my_index",
ba
Se
node 2 node 3

Query-then-fetch vs. DFS Query-then-fetch
• If the default query-then-fetch scoring is not working for
you:
‒ which uses local BM25 scores
• Then you can try executing a DFS query-then-fetch
‒ which computes a global BM25 score
‒ more expensive, because it involves running a pre-query
n
io
is
ec
• Tends to fix the issue of skewed datasets
D
&
s
es
in
us
‒ For example, time-based indices may create skewed results. On

-B
71
20
the day a new iPhone is released, tweets containing #iphone6

p-
Se
will be skewed for that particular day

9-
-0
d
ar
er
G
‒ Smaller datasets are also more prone to being skewed

n
ie
st
ba
Se

Performing a DFS Query-then-fetch
• Set the search_type property to dfs_query_then_fetch
‒ Defaults to query_then_fetch
GET my_index/_search query_then_fetch

{
"query" : {
"match": {
"comment": "star"
}
}
n
io
}
is
ec
D
&
s
es
in
us
-B
GET my_index/_search?search_type=dfs_query_then_fetch
71
20
{
p-
Se
"query" : {
9-
-0
"match": {
d
ar
er
"comment": "star"
G
n
ie
}
st
ba
}
Se
}
dfs_query_then_fetch

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Sorting Results
Sorting
• Sorting the results of a query can be done three ways:
‒ Sort by a field
‒ Sort by _score
‒ Sort by _doc
The “sort” clause
n
io
is
{
ec
D
"query": {
&
s
es
...
in
us
The array of field

-B
},
7
names to sort by
1
"sort": [
20
p-
Se
{
9-
-0
"FIELD": {
d
ar
"order": "desc"
er
G
}
n
ie
st
ba
}
Se
] “desc” or “asc”
}

A sort Example
• The following query returns products with “chocolate” in
the “ingredients” sorted by “total_fat” descending:
Which items with

chocolate have the
GET nutrition/_search most fat?
{
"query" : {
"match": {
"ingredients": "chocolate"
n
io
}
is
ec
D
},
&
s
"sort": [ es
in
us
{
-B
7
"details.total_fat": {
1
20
p-
"order": "desc"
Se
9-
}
-0
d
}
ar
er
G
]
n
ie
}
st
ba
Se

A sort Example
• Notice the results are not scored in our query because the
_score is not being used in sorting
"hits": {
"total": 15,
"max_score": null,
"hits": [
{
"_index": "nutrition",
"_type": "doc",
"_id": "AVv-kZCazWyyiGorJZ5Y",
"_score": null,
The item with the most "_source": {
n
io
total_fat is the first hit
is
...
ec
D
"brand_name": "ProBar",
&
s
es "item_name": "Bar, Koka Moka",
in
us
"details": {
-B
"total_fat": 18,
71
20
...
p-
Se
},
9-
-0
"item_description": "Koka Moka"

Each hit in the response
d
ar
},
er
contains a “sort” section

G
"sort": [
n
ie
st
containing the values used for 18

ba
Se
]
sorting },

Sort on Multiple Fields
{
Sort by “servings_per_container”,
"query" : { then by “_score”
"match": {
}
},
"sort": [
{
"servings.servings_per_container": {
"order": "desc"
}
}, "_score": 2.6099455,
n
io
"_source": {
is
{
ec
"servings": {
D
"_score" : {
&
"serving_size_unit": "cookies",
s
"order" : "desc" es
in
"servings_per_container": 13.5,
us
}
-B
"serving_size_qty": 3
7
}
1
},
20
p-
] "brand_name": "Pepperidge Farm",

Se
9-
} "item_name": "Cookie Collection,

-0
d
Holiday Homestyle",
ar
er
},
G
n
ie
"sort": [
st
ba
Notice both values used for sorting 13.5,

Se
2.6099455
are provided in the results ]

Sort by _doc
• Sorting by _doc provides the best performance and least
overhead
‒ for situations where you do not care about the sort order
{
"query" : {
"match": {
n
io
is
ec
}
D
&
},
s
es
in
"sort": [
us
-B
"_doc"
71
20
]
p-
Se
}
9-
-0
d
Documents are sorted in the order that

ar
er
G
Lucene has them indexed

n
ie
st
ba
Se

Sorting by Text
• Look at the following query. It seems very simple. What will
the results look like?
I am searching for
yogurt, and sort the results
by “brand_name”.
{
"query": {
"match" : {
n
"item_name" : "yogurt"
io
is
ec
}
D
&
},
s
es
in
"sort" : [
us
-B
{
71
20
"brand_name" : {
p-
Se
"order" : "asc"
9-
-0
}
d
ar
er
}
G
n
]
ie
st
ba
}
Se

Why can’t we sort by text?
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set
fielddata=true on [brand_name] in order to load fielddata in memory
by uninverting the inverted index. Note that this can however use
significant memory."
• You can sort by text, but you need to understand the cost
‒ The issue is that text data types are analyzed
‒ The original value of the text has been changed for indexing
n
io
is
ec
D
We want to sort by this data But this is what our data looks like
&
s
es
in
us
-B
“Chobani” “chobani”
71
20
p-
Se
“del”
9-
“My Favorite”
-0
_analyze
d
ar
“essential”
er
G
“Del Monte”
n
ie
st
ba
“everyday”
Se
“Essential Everyday”
“favorite”
“monte”
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Fielddata
Doc Values and
Introducing Doc Values and Fielddata
• An inverted index is not ideal for sorting on field values
‒ sorting (and aggregations) require us to uninvert the inverted
index
• Two options in Elasticsearch for building this uninverted
index:
‒ doc values
‒ fielddata
n
io
is
ec
D
&
• doc values only work on exact value fields (so NOT on s
es
in
us
-B
analyzed fields)
71
20
p-
Se
9-
• We need to enable fielddata on a tokenized field if we want

-0
d
ar
er
to sort by it
G
n
ie
st
ba
Se

Doc Values vs. Fielddata
• Which option is better? Well, it depends…
Fielddata Doc Values

Created on the fly when
When needed in a _search
Created at index time
Where JVM heap files on disk
n
io
is
ec
D
&
s
es
faster indexing and avoids avoid using too much
in
us
Pros
-B
disk usage memory

71
20
p-
Se
9-
extra disk usage,

-0
Cons large heap usage

d
ar
slightly slower indexing

er
G
n
ie
st
ba
Se
Default Elasticsearch 1.x and prior Elasticsearch 2.x and later

Why change the defaults?
• Save some disk space and slightly reduce index time:
‒ disable doc_values in fields you will not sort or aggregate on
‒ set doc_values: false in the mappings of the field(s)
‒ Be careful: if later on you want to sort or aggregate on that field
you will need to update the mapping and re-index your
documents
• You really need to sort or aggregate on a text field:
n
io
is
ec
D
&
‒ enable fielddata on that field (e.g. significant terms) es
s
in
us
-B
7
‒ set fielddata: true in the mappings of the field(s)

1
20
p-
Se
9-
-0
• Be careful: fielddata can consume a lot of memory!

d
ar
er
G
n
ie
‒ fielddata is protected by circuit breakers = if too much memory is

st
ba
Se
used, Elasticsearch will fail the request

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Sorting by Strings
So how do we sort by strings?
• One option is to use the “keyword” type:
‒ in our case, “brand_name” has a “keyword” multi-field
{ “Chobani”
"query": {
"match" : { “Chobani Greek Yogurt”
}
},
“Crowley”
"sort" : [
n
“Dannon”
io
is
{
ec
D
&
"brand_name.keyword" : {
s
es
"order" : "asc" …and so on
in
us
-B
}
71
}
20
p-
Se
]
9-
-0
}
d
“brand_name.keyword” is an OK
ar
er
G
Sorting is allowed here because solution, but there is one issue…

n
ie
st
ba
the “keyword” multi-field of

Se
“brand_name” has doc values

Case-sensitivity in String Sorting
• Using a keyword field avoided the fielddata issue
‒ but keyword fields are not analyzed, so upper and lowercase
strings affect our sorting:
“Chobani”
{
"query": {
“Chobani Greek Yogurt”
"match" : {
"item_name" : "yogurt" “Crowley”
}
n
“Dannon”
io
is
},
ec
D
"sort" : [
&
s …
es
{
in
us
-B
7
“Welch’s”
1
"order" : "asc"
20
p-
}
Se
9-
} “Yoplait”
-0
d
ar
]
er
G
} “del Monte”
n
ie
st
ba
Se
Unfortunately, “del Monte” appears

last in our sort results

Normalizers
• Elasticsearch 5.2 introduced normalizers
‒ similar to analyzers, but they only emit one token (no tokenizing)
‒ only accept a subset of the available char filters and token filters
• Normalizers provide a nice solution for lowercasing
keyword values for sorting…
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Defining a Custom Normalizer
• To use a normalizer, you have to define a custom one:
Let’s create a new index
PUT nutrition2
{
"settings": {
"analysis": {
"normalizer" : {
n
io
"my_normalizer" : {
is
ec
D
"type" : "custom",
&
s
"filter" : ["lowercase"] es
in
us
}
-B
7
}
1
20
p-
} A simple normalizer that

Se
9-
}, only lowercases
-0
d
...
ar
er
G
n
ie
st
ba
Se

Assigning a Normalizer
• Let’s assign brand_name.keyword to use our simple
lowercasing normalizer:
...
"mappings": {
"doc" : {
"properties": {
"brand_name": {
"type": "text",
n
io
is
"fields": {
ec
D
&
"keyword": {
s
es
"type": "keyword",
in
us
-B
"normalizer": "my_normalizer"
71
}
20
p-
Se
}
9-
-0
},
Assign our “normalizer” to
d
ar
er
G
the “keyword” multi-field

n
...
ie
st
ba
Se

Sort by Strings Again…
• Sorting by brand_name.keyword on our new index results
in the expected sort order this time:
GET nutrition2/_search “Chobani”

{
"size": 20, “Chobani Greek Yogurt”
"query": {
"match" : { “Crowley”
n
} “Dannon”
io
is
ec
},
D
&
"sort" : [
s “del Monte”
es
in
{
us
-B
7
…
1
20
"order" : "asc"
p-
Se
}
9-
“Welch’s”
-0
}
d
ar
er
]
G
“Yoplait”
n
ie
}
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Pagination
Pagination
• A very common use case of searching is pagination
‒ dividing search results into “pages”, or subsets
• A query clause uses the “from” and “size” parameters to
implement pagination:
{ Return the first 10 hits from the
"from": 0, given query
"size": 10,
n
io
"query": {
is
ec
"match": {
D
&
s
"ingredients" : "onion" es
in
us
}
-B
},
71
20
"sort" : [
p-
Se
9-
{
-0
d
"details.calories" : {
ar
er
"order" : "desc"
G
n
ie
}
st
ba
Se
}
]
}

Retrieving Subsequent Pages
• You have some work to do in your client application to
determine which page to request
‒ But that is common in any paging use case
‒ Some clients will have an API to help simplify page requests
{
"from": 10, Returns the next 10 hits
"size": 10, in the result set
n
io
"query": {
is
ec
"match": {
D
&
s
"ingredients" : "onion" es
in
us
}
-B
},
71
20
"sort" : [
p-
Se
9-
{
-0
d
"details.calories" : {
ar
er
"order" : "desc"
G
n
ie
}
st
ba
Se
}
]
}

Be Careful with Deep Pagination
• Suppose you want hits from 990 to 1,000. That requires a
1,000 documents from each shard to be considered!
‒ Deep pagination should be avoided
node 1
coordinating
{
"_index":
{
“my_index",
1000 docs {
"_index":
“my_index",
"_index":
"_type":
“my_index",
"_type":
1000 docs
If our index has 6 shards
n
io
is
ec
and we want 990-1000,
D
{
&
1000 docs
"_index":
then 6,000 documents will s

“my_index",
{
es "_index":
in
“my_index",
us
need to be considered 1000 docs

-B
71
20
1000 docs
p-
{
Se
"_index":
“my_index",
"_type": “my_type”,
"_id": "1",
{
9-
{
-0
"_index":
1000 docs “my_index",

d
ar
er
G
node 2 node 3
n
ie
st
ba
Se
from + size can not be more than index.max_result_window

which defaults to 10,000
The search_after Parameter
• If you have a lot of pages, a better option for retrieving
subsequent pages is to use the search_after parameter:
‒ You specify the last sort values from the previous result set:
{
"size": 10, Run the initial search…and keep
"query": { track of the “sort” block from the
"match": { last document returned
n
"ingredients" : "onion"
io
is
ec
}
D
&
s
}, es
in
"sort" : [
us
-B
{"details.calories" : {"order" : "desc"}},

71
20
{"_score": {"order": "desc"}},

p-
Se
{"_uid" : {"order" : "asc"}}

9-
-0
]
d
ar
er
}
G
n
ie
Add _uid to avoid any

st
ba
Se
“ties” in sorting

The search_after Parameter
• Set the search_after parameter equal to the last sort
values from the previous result set:
{
"size": 10,
"query": {
"match": {
"ingredients" : "onion"
}
},
The previous hit was 270 calories
n
io
"search_after" : [
is
ec
and _score of 1.0284368
D
270,
&
s
1.0284368, es
in
us
"doc#AVwBRWkdzWyyiGorJZ_j"
-B
7
],
1
20
p-
"sort" : [
Se
9-
{"details.calories" : {"order" : "desc"}},

-0
d
{"_score": {"order": "desc"}},

ar
er
G
{"_uid" : {"order" : "asc"}}

n
ie
st
]
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Scroll Searches
Scroll Searches
• What if you want more than a “page” of results?
‒ Large result sets can be expensive to both the cluster and the
client
• The scroll API allows you to take a snapshot of a large
number of results from a single search request
‒ Notice the word “snapshot”
‒ When you perform a scroll search, any changes made to
n
io
is
ec
documents after the scroll is initiated will not show up in your
D
&
s
es
results
in
us
-B
71
20
• Scroll searches are a 4-step process:

p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

1. Initiate the scroll
• To initiate a scroll search, add the scroll parameter to your
search query
‒ Specifying how long Elasticsearch should keep the search
context alive
Initiate a scroll for this query
GET nutrition/_search?scroll=30s
n
{
io
is
ec
"size": 100, If we are idle for more than 30
D
&
s
"query": { es
seconds, then delete the scroll
in
"match_all": {}
us
-B
},
71
20
"sort": [
p-
Se
"_doc" Maximum number of hits

9-
-0
] to return
d
ar
er
}
G
n
ie
st
ba
Se
This example is not

concerned with the order

2. Get the current _scroll_id
• The response will contain the first 100 hits, and
• a _scroll_id
‒ Hang on to it! You need it for the next request
{
"_scroll_id":
"DnF1ZXJ5VGhlbkZldGNoCgAAAAAAAAflFnZVWmtsa2VKUUdTVnJCRkFiZ3RjZ2
cAAAAAAAAH5xZ2VVprbGtlSlFHU1ZyQkZBYmd0Y2dnAAAAAAAAB-
n
io
YWdlVaa2xrZUpRR1NWckJGQWJndGNnZwAAAAAAAAfoFnZVWmtsa2VKUUdTVnJCR
is
ec
D
kFiZ3RjZ2cAAAAAAAAH6RZ2VVprbGtlSlFHU1ZyQkZBYmd0Y2dnAAAAAAAAB--0
&
s
WdlVaa2xrZUpRR1NWckJGQWJndGNnZwAAAAAAAAfuFnZVWmtsa2VKUUdTVnJCRk es
in
us
FiZ3RjZ2c=",
-B
7
"took": 2,
1
20
p-
"timed_out": false,
Se
9-
"_shards": {
-0
d
"total": 10,
ar
er
G
"successful": 10,
n
ie
st
"failed": 0
ba
Se
},
...

3. Retrieve more documents
• When you are ready for more documents, you just need to
send a GET request, along with the prior _scroll_id
Do not specify an index, type or query here -

just invoke scroll
GET _search/scroll
{ Always use the prior _scroll_id
"scroll" : "30s",
n
io
is
ec
"scroll_id" :
D
&
"DnF1ZXJ5VGhlbkZldGNoCgAAAAAAAAflFnZVWmtsa2VKUUdTVnJCRkFiZ3RjZ2cAA
s
es
in
AAAAAAH5xZ2VVprbGtlSlFHU1ZyQkZBYmd0Y2dnAAAAAAAAB-
us
-B
YWdlVaa2xrZUpRR1NWckJGQWJndGNnZwAAAAAAAAfoFnZVWmtsa2VKUUdTVnJCRkFi
71
20
Z3RjZ2cAAAAAAAAH6RZ2VVprbGtlSlFHU1ZyQkZBYmd0Y2dnAAAAAAAAB-
p-
Se
oWdlVaa2xrZUpRR1NWckJGQWJndGNnZwAAAAAAAAfrFnZVWmtsa2VKUUdTVnJCRkFi
9-
-0
Z3RjZ2cAAAAAAAAH7BZ2VVprbGtlSlFHU1ZyQkZBYmd0Y2dnAAAAAAAAB-0WdlVaa2
d
ar
xrZUpRR1NWckJGQWJndGNnZwAAAAAAAAfuFnZVWmtsa2VKUUdTVnJCRkFiZ3RjZ2c=
er
G
n
"
ie
st
ba
}
Se
If the scroll is still alive, the next

set of hits is returned

4. Clearing a scroll
• If you are done using a scroll and want to manually close it,
use a DELETE request:
DELETE _search/scroll/DnF1ZXJ5VGhlbkZldGNoCgAAAAAAAAflFnZVWm
tsa2VKUUdTVnJCRkFiZ3RjZ2cAAAAAAAAH5xZ2VVprbGtlSlFHU1ZyQkZBYmd0Y2dnA
AAAAAAABYWdlVaa2xrZUpRR1NWckJGQWJndGNnZwAAAAAAAAfoFnZVWmtsa2VKUUdTV
nJCRkFiZ3RjZ2cAAAAAAAAH6RZ2VVprbGtlSlFHU1ZyQkZBYmd0Y2dnAAAAAAAAB-
oWdlVaa2xrZUpRR1NWckJGQWJndGNnZwAAAAAAAAfrFnZVWmtsa2VKUUdTVnJCRkFiZ
3RjZ2cAAAAAAAAH7BZ2VVprbGtlSlFHU1ZyQkZBYmd0Y2dnAAAAAAAAB-0WdlVaa2xr
n
io
is
ec
ZUpRR1NWckJGQWJndGNnZwAAAAAAAAfuFnZVWmtsa2VKUUdTVnJCRkFiZ3RjZ2c=
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Choosing a
Search Type
Choosing a Search Type
• We have seen several different types of searches:
‒ regular searches: the typical “query” searches that we have
been running throughout the course
‒ scroll searches: a snapshot of a (possibly) large result set of
documents
‒ pagination searches: a small subset of a larger result set of
documents
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Real-world Use Cases
• Regular search:
‒ Suppose we are pulling back search results from an auction
website like Ebay and sorting by the date the auction is ending.
We want auctions that are ending the soonest at the top. We
would need to run a new query each time.
• Scroll search:
‒ Imagine you are implementing an “export your data” feature on a
n
site like Facebook. A snapshot of the data works best for a large
io
is
ec
D
number of hits that we want to return in small batches.
&
s
es
in
us
-B
• Pagination search:
71
20
p-
Se
9-
‒ If you're implementing classic pagination on a website, then from

-0
d
ar
and size are likely your “go to” option

er
G
n
ie
st
ba
‒ Use search_after if you need deep pagination

Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• The inverted index concept introduced in the analysis chapter is
practically represented inside a package called a Lucene
segment
• Search queries typically have two phases: query and fetch
• Relevance refers to the scoring of a document based on how
closely it matches the query
• Use the sort clause for sorting by a field, _score or _doc
n
io
is
ec
D
• Doc values store your original fields’ values on disk
&
s
es
in
us
-B
• Normalizers allow you to apply certain character filters to

71
20
keyword data types

p-
Se
9-
-0
d
ar
• Use “from” and “size” parameters to implement pagination

er
G
n
ie
st
ba
• The scroll API allows you to take a snapshot of a large number

Se
of results from a single search request

Quiz
1. Give two use cases for enabling dfs_query_then_fetch on a query.
2. True or False: You can sort the results of a query by a field of type
“text”.
3. True or False: You can sort the results of a query by a field of type
“keyword”.
4. How are hits sorted by default?
5. True or False: Normalizers can only be applied to fields of type
n
io
is
“keyword”.
ec
D
&
s
es
in
6. What two parameters are used to implement paging of search

us
-B
7
results?
1
20
p-
Se
9-
-0
7. What is the difference between a scroll search and a pagination

d
ar
er
search?
G
n
ie
st
ba
Se
8.True or False: You have to invoke _refresh occasionally while

indexing a lot of documents to refresh the in-memory buffer.
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 7
Working with Search Results
2 The Search API
3 Text Analysis
4 Mappings
n
io
Aggregations
is
ec
D
&
s
es
in
us
8 Aggregations
-B
71
20
p-
Se
9-
More Aggregations
-0
9
d
ar
er
G
n
ie
st
ba
Se
Topics covered:
• What are Aggregations?
• Types of Aggregations
• Buckets and Metrics
• Common Metrics Aggregations
• The range Aggregation
n
• The date_range Aggregation
io
is
ec
D
&
s
es
• The terms Aggregation
in
us
-B
71
20
• Nesting Buckets
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
What are
Aggregations?
What is an Aggregation?
• We have been focusing on search, but Elasticsearch has
another powerful capability known as aggregations
• Aggregations are a way to perform analytics on your
indexed data
What
n
io
is
is the average ticket price
ec
What movies have “star” in
D
of a movie in the U.S?
&
s
es
the title?
in
us
-B
71
20
p-
Se
That
9-
That
-0
is an aggregation…
d
is a search
ar
er
G
query…
n
ie
st
ba
Se

You ask different questions with aggs:
• In search, we have been • In aggregations, we
asking for a subset of the analyze the data by
original dataset: asking questions about
the dataset:
‒ Results are hits that match
our search parameters ‒ Results are (typically)
computed values
“What are the “How many 404

ERRORs in our log errors occurred each
n
io
is
ec
file?” day last month?”
D
&
s
es
in
“What
us
-B
is the average amount

71
20
of time visitors spend on

p-
Se
our home page?”

9-
-0
d
ar
er
G
n
ie
“What is the most

st
ba
Se
visited page on our

website?”

booking.com
Search
criteria
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
Aggregations
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Kibana Visualizations
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Aggregation Syntax
• An aggregation request is a part of the Search API
‒ With or without a “query” clause
The “aggs” clause can be

spelled out “aggregations”
{
"aggs": {
"my_aggregation": {
n
io
"AGG_TYPE": {
is
ec
D
...
&
s
} es
The name you choose comes
in
us
}
-B
back in the results

7
}
1
20
p-
}
Se
9-
-0
d
ar
Lots of different aggregation

er
G
n
types
ie
st
ba
Se

Example of an Aggregation
What
is the average
number of calories of items
with “olive oil” in the
ingredients?
{
"query" : {
"match_phrase": {
"ingredients": "olive oil"
n
io
}
is
The “query” defines the
ec
},
D
&
"aggs": { scope of the aggregation
es
in
s
us
"my_avg_calories": {
-B
"avg": {
71
20
"field": "details.calories"
p-
Se
}
9-
-0
} The “aggs” section returns an

d
ar
er
}
G
average
n
ie
}
st
ba
Se

Viewing the Results
• The results will have an “aggregations” section with the
results of all the “aggs” in your search
"hits": {
"total": 8,
"max_score": 6.0192027,
"hits": [
... The name you gave
]
n
your aggregation
io
is
},
ec
D
&
"aggregations": {
s
es
in
us
-B
"value": 196.25
71
20
}
p-
Se
}
9-
-0
The computation
d
ar
er
G
n
ie
st
ba
Se

If you just want the aggregation value:
• If you are not interested in the hits and only want the values
of the aggregations, then set “size” to 0
‒ This will speed up the query since the “fetch” phase can be
skipped
{
"took": 25,
GET nutrition/_search "timed_out": false,
{ "_shards": {
"size": 0, "total": 5,
"query" : { "successful": 5,
n
"match_phrase": {
io
"failed": 0
is
ec
"ingredients": "olive oil"
D
},
&
}
s
es "hits": {
in
},
us
"total": 8,
-B
"aggs": {
7
"max_score": 0,
1
20
p-
"hits": []
Se
"avg": {
9-
},
-0
d
"aggregations": {
ar
er
}
G
n
ie
}
st
"value": 196.25
ba
}
Se
}
} }
}

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Types of Aggregations
Four Types of Aggregations
• Bucket: aggregations that combine documents that meet a
given criteria into buckets
‒ Bucket aggregations build buckets
• Metric: mathematical calculations performed on the fields
of documents
‒ Typically, metrics are calculated on buckets of data
n
• Matrix: aggregation that operates on multiple fields and
io
is
ec
D
produces a matrix result es
in
&
s
us
-B
‒ matrix_stats is currently the only matrix aggregation

71
20
p-
Se
9-
-0
• Pipeline: aggregate the output of other aggregations

d
ar
er
(instead of from documents)

G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Buckets and Metrics
Buckets
• A bucket is simply a collection of documents that meet a
criterion
‒ Buckets are a key element of aggregations
• For example, suppose we want to analyze products by their
price range. We could put them into specific buckets:
{
"_index": "product",
"_type": "product_type",
n
"_id": "26588",
io
is
"_version": 1,
ec
"_score": 8.296555,
D
{ {
"_source":
&
"_index":
"brandName": "product",
"Jelly Belly",
s
"_type": "product_type", es
in
"_id": "26589",
us
"_version": 1,
-B
{ { {
"_score": 8.296555, "_index": "product",
{
{
{
"_source":{ {
7
"_id":"_index":
"26590","product", "_id":"_index":
"26590","product",
1
"_type":1,"product_type",
"_version": "_type":1,"product_type",
"_version":
"_index":
"Jelly"product",
20
"brandName": Belly", "_id":8.296555,

"_score": "26590",
"_version": 1,
"_id":8.296555,
"_score": "26590",
"_version": 1,
"_id":8.296555,
"_score": "26590",
"_version": 1,
p-
"_source": { "_source": { "_source": {

"_score": 8.296555,
"brandName": "Jelly "_score": 8.296555,
"brandName": "Jelly "_score": 8.296555,
"brandName": "Jelly
Se

"_id": "26590", "brandName": "Jelly "brandName": "Jelly "brandName": "Jelly
9-
"_version": 1,
-0
"_score": 8.296555,
"_source": {
d
ar
"brandName": "Jelly Belly",

er
price > $5
G
price <= $5 price > $10

n
ie
&&
st
All of your products

ba
Se
price <= $10

Buckets can be Nested
• Suppose we want to analyze products under $5 and also by
their customer rating
All products under $5
with a rating of 2
This bucket can consist of

nested buckets within it rating=1 rating=2 rating=3
n
io
is
{
ec
{ "_index": "product", {
"_index": "product", { "_index": "product",
D
{ {
"_type": "product_type", "_id":"_index":
"26590","product", "_type": "product_type",
"_id":"_index":
"26590","product",
&
"_version":
"_version": "_id":8.296555,
"26590", "_type":1,"product_type",
"_version":
"_score":
s
"_id":8.296555,
"_score": "26590", "_version": 1, "_id":8.296555,
"_score": "26590",
es
"_source": {
"_version":
"_source": { 1, "_score": 8.296555, "_version":
"_source": { 1,
"brandName": "Jelly
"_score": 8.296555, "_source": { "_score": 8.296555,
in
"brandName": "Jelly "brandName": "Jelly
"_source": { "brandName": "Jelly "_source": {
us

-B
71
20
p-
Se
9-
-0
d
ar
{
{ "_index": "product",
er
"_index": "product", {
{
"26590","product",
G
"_id":"_index":
"26590","product", "_type":1,"product_type",
"_version":
n
"_version": "_id":8.296555,
"_score": "26590",
"_id":8.296555,
"26590",
ie
"_score": "_version":
"_source": { 1,
"_version":
"_source": { 1, "_score": 8.296555,
st
"_score": 8.296555, "brandName": "Jelly

"brandName": "Jelly "_source": {
ba
"_source": { "brandName": "Jelly

"brandName": "Jelly
Se
price <= $5
rating=4 rating=5

Metrics
• Metrics compute numeric values based on your dataset
‒ Either from fields
‒ Or from values generated by custom scripts
• Most metrics are mathematical operations that output a
single value:
‒ avg, sum, min, max, cardinality
n
io
is
ec
• Some metrics output multiple values:
D
&
s
es
in
us
‒ stats, percentiles, percentile_ranks

-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Metrics + Buckets = Aggregations
• An aggregation can be a combination of buckets and
metrics
‒ This is why aggregations are so powerful - they can be combined
in any number of combinations
‒ For example, you could sum all the values of the “quantitySold”
field, which would be one metric on one bucket
‒ You could also compute the average “quantitySold” by price
range, which would be one metric on multiple buckets:
n
io
is
ec
D
&
s
es
What
in
us
-B
is the average
71
20
quantity of items sold

p-
Se
for products under $5,

9-
-0
d
from $5-$10, and

ar
{ { {
er
"_index": "product", "_index": "product", "_index": "product",

{ { {
G
"_type": "product_type", "_type": "product_type", "_type": "product_type",
over $10?
"_id":"_index":
"26590","product",
"_type":1,"product_type", "_type":1,"product_type",
n
"_version": "_version":
ie
"_id":8.296555,
"_score": "26590", "_id":8.296555,
"26590", "_id":8.296555,
"_score": "26590",
That
"_score":
"_version": 1, "_version": 1, "_version": 1,
st

"_score": 8.296555,
"brandName": "Jelly "_score": 8.296555, "_score": 8.296555,
"brandName": "Jelly
"brandName": "Jelly
ba

"brandName": "Jelly "brandName": "Jelly "brandName": "Jelly
Se
will require
buckets and
metrics
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Aggregations
Common Metrics
The avg Metric
• The avg metric aggregation computes the average of a
specified numeric field
What is the average

number of calories of
products that have sugar
GET nutrition/_search in them?
{
"size" : 0,
"query" : {
n
"match": {
io
is
ec
"ingredients": "sugar" "aggregations": {
D
&
} "my_sugar_calorie_average": {
s
es
in
}, "value": 155.44347826086957
us
-B
"aggs" : { }
71
20
"my_sugar_calorie_average" : { }
p-
Se
"avg": {
9-
-0
d
ar
er
}
G
n
}
ie
st
ba
}
Se

min and max
• These two metrics are fairly straightforward:
What is the maximum and

minimum number of calories
of all products?
{
"size" : 0,
"aggs" : {
"my_max_calorie_item" : { "aggregations": {
n
"max": { "my_min_calorie_item": {
io
is
ec
"field": "details.calories" "value": 0
D
&
} },
s
es
in
}, "my_max_calorie_item": {
us
-B
"my_min_calorie_item" : { "value": 630

71
20
"min": { }
p-
Se
"field": "details.calories" }
9-
-0
}
d
ar
er
}
G
n
ie
}
st
Notice you can have

ba
}
Se
multiple aggregations in a
single “aggs” clause

The stats Aggregation
• The stats metric aggregation gives you common statistics
of a field in a single aggregation request:
‒ count, min, max, avg and sum
What can you tell me

about the “calories” field?
n
io
is
ec
{ "aggregations": {
D
&
"size" : 0, "my_calorie_stats": {
s
es
in
"aggs" : { "count": 499,
us
-B
"my_calorie_stats" : { "min": 0,
71
20
"stats": { "max": 630,

p-
Se
"field": "details.calories" "avg": 136.7875751503006,

9-
-0
} "sum": 68257
d
ar
er
} }
G
n
} }
ie
st
ba
}
Se

The cardinality Aggregation
• The cardinality metric aggregation is an approximation of
the number of distinct values of a field
‒ based on the HyperLogLog++ algorithm
‒ accurate up until precision_threshold (max of 40,000)
About how many unique
n
brand names are there?
io
is
ec
D
&
s
{ es
in
"size" : 0,
us
-B
"aggs": {
71
"aggregations": {
20
"my_brand_name_cardinality": {
p-
"my_brand_name_cardinality": {
Se
"cardinality": {
9-
"value": 430
-0
"field": "brand_name.keyword",
d
ar
}
er
"precision_threshold": 100
G
}
n
ie
}
st
ba
}
Se
}
}

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Aggregations
Common Bucket
Common Bucket Aggregations
• Let’s look at two useful bucket aggregations:
‒ range
‒ terms
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The range Aggregation
The range Aggregation
• The range bucket aggregation puts documents into buckets
based on ranges of values on a specified field
GET nutrition/_search "aggregations": {
{ "my_calorie_range_buckets": {
"size" : 0, "buckets": [
"aggs" : { {
"my_calorie_range_buckets" : { "key": "*-100.0",
"range": { "to": 100,
"field": "details.calories", "doc_count": 178
"ranges": [ },
{ {
n
io
"to": 100
is
"key": "100.0-200.0",
ec
D
}, "from": 100,
&
s
{ es
in
"to": 200,
us
"from": 100,
-B
"doc_count": 207
7
"to": 200
1
},
20
p-
}, {
Se
9-
{
-0
"key": "200.0-*",
d
"from" : 200
ar
"from": 200,
er
G
} "doc_count": 114
n
ie
st
] }
ba
Se
} ]
} }
} }
}
Labeling Ranges
• Add a “key” field to label a range using any string you like
‒ Your “key” field gets returned in the results
"aggregations": {
"aggs" : { "my_calorie_range_buckets": {
"my_calorie_range_buckets" : { "buckets": [
"range": { {
"field": "details.calories", "key": "low_calorie",
"ranges": [ "to": 100,
{ "doc_count": 178
"key" : "low_calorie", },
n
io
"to": 100 {
is
ec
D
}, "key": "medium_calorie",
&
s
{ es "from": 100,
in
us
"key" : "medium_calorie", "to": 200,

-B
7
"from": 100,"to": 200 "doc_count": 207

1
20
p-
}, },
Se
9-
{ {
-0
d
"key" : "high_calorie", "key": "high_calorie",

ar
er
G
"from" : 200 "from": 200,

n
ie
st
} "doc_count": 114
ba
Se
] }
} ]
} }

Combining Buckets and Metrics
• The previous range example created three buckets
‒ Let’s perform a metrics aggregation on each bucket:
GET nutrition/_search Add a nested aggregation to
{
"size" : 0, “my_calorie_range_buckets”
"aggs" : {
"my_calorie_range_buckets" : {
"range": {
"field": "details.calories",
"ranges": [
n
io
{"to": 100},
is
ec
{"from": 100,"to": 200},
D
&
s
{"from" : 200} es
in
us
]
-B
},
71
20
"aggs": {
p-
Se
The “avg” will be computed

9-
"my_average_total_fat": {
-0
d
"avg": { for each bucket

ar
er
"field": "details.total_fat"
G
n
ie
}
st
ba
Se
}
}
}

The result of our nested aggregation:
"aggregations": {
"my_calorie_range_buckets": {
"buckets": [
{ The average total fat for
"key": "*-100.0", items with less than 100
"to": 100, calories
"doc_count": 178,
"value": 1.2522471911702933
}
},
{
"key": "100.0-200.0",
"from": 100,
n
io
is
"to": 200,
ec
D
"doc_count": 207,
&
s
es
in
us
-B
"value": 4.618357487922705
71
}
20
p-
},
Se
9-
-0
{
d
ar
"key": "200.0-*",
er
G
"from": 200,
n
ie
st
"doc_count": 114,
ba
Se
"value": 12.20438596501685
}
}
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Aggregation
The date_range
The Date Range Aggregation
• The date_range bucket aggregation is similar to range
except it is specifically used for date fields
‒ syntax looks like:
{
"aggs" : {
"my_date_range_agg" : {
n
"date_range": {
io
is
ec
"field": "FIELD",
D
&
"ranges": [
s
es
in
{
us
-B
"from": "FROM", You can use actual dates

71
20
"to": "TO" or date math expressions

p-
Se
}
9-
-0
]
d
ar
er
}
G
n
}
ie
st
ba
}
Se

Example of date_range
GET stocks/_search
{
"size": 0, “Compare the week-to-
"aggs": { week stock volume from
"my_week_comparison": { January 1, 2010.”
"date_range": {
"field": "trade_date",
"ranges": [
{
"from": "2010-01-01||-1w",
"to": "2010-01-01"
}, "my_total_volume": {
{ "value": 75871338
n
io
"from": "2010-01-01", }
is
ec
D
"to": "2010-01-01||+1w"
&
s
} es
"my_total_volume": {
in
us
]
-B
"value": 161270672
7
},
1
20
}
p-
"aggs": {
Se
9-
"my_total_volume": {
-0
d
"sum": {
ar
er
G
"field": "volume"
n
ie
st
}
ba
Se
}
}
}}}

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The terms Aggregation
Terms Aggregation
• The terms aggregation will dynamically create a new
bucket for every unique term it encounters of the specified
field
‒ In other words, each distinct value of the field will have its own
bucket of documents
rating=1 rating=2 rating=3
n
Which product has the lowest
io
is
{
ec
{ "_index": "product", {
"_index": "product", { "_index": "product",
D
rating but sells the best?
{ {
"26590","product", "_type": "product_type",
"_id":"_index":
"26590","product",
&
"_version":
"_version": "_id":8.296555,
"26590", "_type":1,"product_type",
"_version":
"_score":
s
"_id":8.296555,
"_score": "26590", "_version": 1, "_id":8.296555,
"_score": "26590",
es
"_source": {
"_version":
"_source": { 1, "_score": 8.296555, "_version":
"_source": { 1,
"brandName": "Jelly
"_score": 8.296555, "_source": { "_score": 8.296555,
in
"_source": { "brandName": "Jelly "_source": {
us

-B
71
20
p-
Se
9-
-0
Start by using “terms” to

d
ar
{
{ "_index": "product",
er
"_index": "product", {
put the products into

{
"26590","product",
G
"_id":"_index":
"26590","product", "_type":1,"product_type",
"_version":
n
"_version": "_id":8.296555,
"_score": "26590",
"_id":8.296555,
"26590",
ie
"_score": "_version":
"_source": { 1,
"_version": 1,
distinct buckets by rating

"_source": { "_score": 8.296555,
st
"_score": 8.296555, "brandName": "Jelly

"brandName": "Jelly "_source": {
ba
"_source": { "brandName": "Jelly

"brandName": "Jelly
Se
rating=4 rating=5

Simple Example of terms
• You just need to specify a “field” to bucket the terms by:
GET products/_search
{
"size" : 0,
"aggs" : {
"my_customer_ratings" : {
"terms": {
"field": "customerRating",
"size": 5
n
io
}
is
ec
}
D
&
s
} es
in
us
}
-B
71
20
p-
Se
size = number of buckets

9-
-0
(defaults to 10)
d
ar
er
G
n
ie
st
ba
Se

The Output of our terms Aggregation:
• Notice each bucket has a "aggregations": {
"my_customer_ratings": {
“key” that represents the "doc_count_error_upper_bound": 0,
distinct value of “field”, "sum_other_doc_count": 0,
"buckets": [
{
• and “doc_count” for the "key": 2,
"doc_count": 22190
number of docs in the },
bucket {
"key": 3,
"doc_count": 22152
n
},
io
is
“2” is the most
ec
{
D
&
common rating s
"key": 1,
es
in
"doc_count": 22100
us
-B
},
71
20
{
p-
Se
"key": 4,
9-
-0
"doc_count": 22009
d
ar
er
},
G
n
ie
{
st
ba
"key": 5,
Se
"doc_count": 21984
}]}}

Why are terms aggs not always precise?
shard0 shard1
a:5 b:4 a:4 b:5
c:3 d:4 c:3 d:1
n
io
is
ec
D
a:9 b:9 d:4
&
c:3
s
es
in
us
-B
71
20
p-
Se
9-
-0
If size=3, we get an inaccurate result

d
ar
er
(c is 6 and should be in the top 3)

G
n
ie
st
ba
Se

Understanding the Output of terms
• There are two values returned in a terms aggregation:
‒ “doc_count_error_upper_bound”: the maximum number of
missing documents that could potentially have appeared in a
bucket
‒ “sum_other_doc_count”: the number of documents that do not
appear in any of the buckets
n
io
is
ec
D
"aggregations": {
&
{
s
es
in
"my_brand_names": {
"size" : 0,
us
"doc_count_error_upper_bound": 5,
-B
"aggs" : { "sum_other_doc_count": 469,

71
20
"my_brand_names" : {
p-
"buckets": [
Se
"terms": {
9-
{
-0
"field": "brand_name.keyword"
d
"key": "Giant",
ar
er
} "doc_count": 6
G
n
}
ie
},
st
ba
}
Se
...
} The value 6 may not be
accurate

Enabling show_term_doc_count_error
• Shows you the upper-bound error for each bucket:
{
"size" : 0,
"aggs" : {
"terms": {
"show_term_doc_count_error": true
}
}
} "my_brand_names": {
n
io
is
ec
D
"sum_other_doc_count": 469,
&
s
"buckets": [ es
in
us
{
-B
7
"key": "Giant",
1
20
p-
"doc_count": 6,
Se
Both “Giant” and

9-
"doc_count_error_upper_bound": 1
-0
“Essential Everyday” can

d
},
ar
er
G
be at most 7 {
n
ie
st
"key": "Essential Everyday",

ba
Se
"doc_count": 4,
},

The shard_size Parameter
• shard_size tells Elasticsearch to gather more top terms
from each shard to improve accuracy
‒ shard_size = (size x 1.5) + 10 (by default)
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Increasing the shard_size Parameter
• Let’s try increasing shard_size to something larger than its
default value:
{
"size" : 0,
"aggs" : {
"terms": {
"show_term_doc_count_error": true,
"size": 5,
n
io
is
"shard_size": 110 "my_brand_names": {
ec
D
} "doc_count_error_upper_bound": 0,
&
s
} es
in
us
}
-B
"buckets": [
7
}
1
{
20
p-
"key": "Giant",
Se
9-
-0
"doc_count": 6,
d
ar
er
That lowered the error upper

G
},
n
ie
bound to 0 for the top hits

st
"key": "Essential Everyday",

ba
Se
"doc_count": 5,
},

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Nesting Buckets
Nesting Buckets
• A bucket can be defined within a bucket:
{
"size" : 0, Put docs with the same
"aggs" : { “brand_name” into a
"my_brand_names" : { bucket
"terms": {
"size" : 10,
"shard_size" : 45
},
"aggs": {
n
io
is
"my_calories_buckets": {
ec
D
"range": {
&
s
es
in
us
-B
"ranges": [
71
{ Put docs from each

20
p-
Se
"from": 0, “brand_name” into two

9-
-0
"to": 100
buckets: “calories” < 100
d
ar
},
er
and “calories” >= 100

G
{
n
ie
st
ba
"from" : 100
Se
}
]
}}}}}}

Output of the example:
"aggregations": {
"my_brand_names": {
"buckets": [
{
"key": "Giant",
"doc_count": 6,
"my_calories_buckets": {
"buckets": [
{
There are six items from "key": "0.0-100.0",
n
io
“Giant”, with 2 items below "from": 0,
is
ec
"to": 100,
D
100 calories and 4 items
&
s
"doc_count": 2 es
above 100 calories
in
us
},
-B
{
71
20
"key": "100.0-*",
p-
Se
9-
"from": 100,
-0
d
"doc_count": 4
ar
er
}
G
n
ie
]
st
ba
Se
}
},

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• You have two new tools in your Elasticsearch toolbox:
‒ Metrics Aggregations like min, max, avg, stats
‒ Bucket Aggregations like range and terms
• Metrics compute numeric values based on your dataset
• A bucket is a collection of documents that meet a criterion
• An aggregation can be thought of as a unit-of-work that
n
io
is
builds analytic information over a set of documents
ec
D
&
s
es
in
• Aggregations can be nested within other aggregations

us
-B
71
20
p-
Se
• The range aggregation puts documents into buckets based

9-
-0
d
on ranges of values on a specified field

ar
er
G
n
ie
st
ba
• The terms aggregation dynamically creates buckets for

Se
every unique term it encounters of a specified field

Quiz
1. Query or Aggregation: “What is the box office revenue of
all movies directed by Steven Spielberg?”
2. Query or Aggregation: “What are the nearby restaurants
that serve pizza?”
3. What are the four types of aggregations?
4. What aggregation would you use to put logging events into
buckets by log type (“error”, “warn”, “info”, etc.)?
n
io
is
ec
D
5. How can you increase the accuracy of a terms &
s
es
in
us
-B
aggregation?
71
20
p-
Se
9-
6. What aggregation(s) would you use to answer the question

-0
d
ar
“How many unique visitors came to our website today?”

er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 8
Working with Aggregations
2 The Search API
3 Text Analysis
4 Mappings
n
io
More
is
ec
D
&
s
es
Aggregations
in
us
8 Aggregations
-B
71
20
p-
Se
9 More Aggregations
9-
-0
d
ar
er
G
n
ie
st
ba
Se
Topics covered:
• Aggregation Scope
• Using post_filter
• The missing Aggregation
• Histograms
• Date Histograms
n
• Percentiles
io
is
ec
D
&
s
es
• Top Hits
in
us
-B
71
20
• Significant Terms
p-
Se
9-
-0
d
ar
• Sorting Buckets
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Aggregation Scope
Aggregation Scope
• Aggregation scope refers to the documents that are used
to compute the aggregation
‒ Up until now, your “aggs” have been computed using the “hits”
of your queries:
GET stocks/_search
{
"size": 0,
"query": {
"term": {
n
"stock_symbol.keyword": {
io
is
ec
"value": "EBAY"
D
&
}
s
es
in
}
us
-B
},
7
The scope of this “avg”

1
20
"aggs": {
p-
agg is the “EBAY” stocks

Se
"ebay_avg_high": {
9-
-0
"avg": { from the query

d
ar
er
"field": "high"
G
n
}
ie
st
ba
}
Se
}
}

Aggregation Scope
• There are options available to compute aggregations over a
different set of documents than the “hits”, including:
‒ global agg
‒ post_filter
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

The global Aggregation
• Use the global bucket aggregation to perform an
aggregation over all documents
‒ Even when a corresponding query is used
• The scope of a global aggregation is all documents
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Example of global Aggregation
GET nutrition/_search I want the max total_fat
{ "size": 0,
"query": {
of items with 120 calories,
"term": { compared to the max total_fat
"details.calories": 120 of all items.
}
},
"aggs": {
"max_total_fat_from_120" : {
"max": {
"field": "details.total_fat" The max total fat of items
} with 120 calories
n
},
io
is
ec
"all_of_my_items" : {
D
&
"global": {},
s
es
in
"aggs" : {
us
-B
"max_total_fat_from_all": {
71
20
"max": {
p-
The max total fat of all

Se
"field": "details.total_fat"
9-
-0
} items
d
ar
er
}
G
n
ie
}
st
ba
}
Se
}
}

Output from the global Aggregation:
"aggregations": {
"all_of_my_items": {
"doc_count": 499,
"max_total_fat_from_all": {
"value": 36
}
},
"max_total_fat_from_120": {
n
"value": 14
io
is
ec
}
D
&
s
} es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Using post_filter
Use Case for post_filter
• A typical search example you see online all the time:
Brand Names Search Results

Philadelphia (3) 1. Feta Cheese
Essential Everyday (2) 2. Cheese, Italian Three Cheese
Giant (2) 3. Turkey & Cheese
Wegmans (2) 4. Macaroni and Cheese Dinner, Organic, Three Cheese
n
365 Everyday Value (1)
io
5. Shredded Mozzarella Cheese
is
ec
D
6. Party Cheese Tray
&
s
es
in
7. Cheese & Grapes Snack
us
-B
7
8. Potato Chips, Grilled Cheese

1
20
p-
Se
9. Tri-Color Cheese Tortellini

9-
-0
Selecting a brand name 10. Burrito, Bean & Cheese

d
ar
er
filters the search results

G
n
ie
st
ba
Se
“cheese” items from all

brands
Retrieving the Brand Names
• Step 1 is to search for “cheese” and include the brand
names of companies that sell “cheese” items
‒ we will use a terms agg:
{
"query": {
"match": {
"item_name": "cheese"
Retrieve items with
} “cheese” in the name
n
io
is
ec
},
D
&
"aggs": {
s
es
in
"popular_brand_names": {
us
Retrieve the top 5 brand

-B
"terms": {
71
names that sell “cheese”

20
p-
Se
"size": 5 items
9-
-0
}
d
ar
}
er
G
n
}
ie
st
ba
}
Se

The User Selects a Brand Name…
• We still want our terms agg to return the top 5 brand
names
‒ the scope of the terms agg is still over all items with “cheese”
‒ use a post_filter to only return the items from “Wegmans”

Philadelphia (3) 1. Feta Cheese
n
io
is
ec
Essential Everyday (2) 2. Cheese, Italian Three Cheese
D
&
s
Giant (2) 3. es
Turkey & Cheese
in
us
-B
Wegmans (2) 4. Macaroni and Cheese Dinner, Organic, Three Cheese

71
20

p-
5. Shredded Mozzarella Cheese

Se
9-
-0
6. Party Cheese Tray

d
ar
er
7. Cheese & Grapes Snack

G
n
ie
st
8. Potato Chips, Grilled Cheese

ba
The user clicks on

Se
9. Tri-Color Cheese Tortellini

“Wegmans”
10. Burrito, Bean & Cheese

Using post_filter
{
"query": {
"match": { Retrieve items with
"item_name": "cheese"
} “cheese” in the name
},
"aggs": {
"popular_brand_names": {
"terms": { Retrieve the top 5 brand
"field": "brand_name.keyword", names that sell “cheese”
n
io
"size": 5 items
is
ec
}
D
&
s
} es
in
us
},
-B
"post_filter": {
71
20
"match": {
p-
Only return items from

Se
9-
"brand_name.keyword": "Wegmans"
-0
} “Wegmans”
d
ar
er
}
G
n
ie
}
st
ba
Se

Result from post_filter
• We get back the two “cheese” items from “Wegmans”
• along with top 5 brand names again

Philadelphia (3) 1. Ravioli, Three Cheese
Essential Everyday (2) 2. Pierogies, Spinach & Feta Cheese, Club Pack
n
Giant (2)
io
is
ec
D
Wegmans (2)
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Aggregation
The missing
The missing Aggregation
• The missing bucket aggregation creates a bucket of all
documents that are missing a field value
‒ Either the field is not defined or it contains a null value
‒ Useful when you are creating buckets on a field value and you
have documents missing that particular field
n
io
is
ec
D
&
s
{ es
in
us
"aggs": {
-B
7
"my_bucket": {
1
20
p-
"missing": {
Se
9-
"field": "FIELD_NAME"
-0
d
}
ar
er
G
}
n
ie
st
}
ba
Se

Example of missing
GET nutrition/_search “Show me the brand
{ names with the most products
"size": 0,
"aggs": {
which are missing their
"my_missing_ingredients": { ingredients.”
"missing": {
"field": "ingredients"
},
"aggs": {
"my_brands": {
"terms": {
"field": "brand_name.keyword"
}
"my_missing_ingredients": {
}
n
io
"doc_count": 250,
is
}
ec
"my_brands": {
D
}
&
s
"doc_count_error_upper_bound": 2, es
}
in
us
-B
} "buckets": [
71
20
{
p-
Se
"key": "Giant",
9-
-0
"doc_count": 5
d
ar
“Giant” has 5 items missing their

er
},
G
n
{
ie
ingredients
st
ba
"key": "Unknown",
Se
"doc_count": 5
},
{

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Histograms
The histogram Aggregation
• The histogram bucket aggregation builds a histogram on a
given “field” using a specified “interval”:
‒ A histogram is like a bar chart that shows how many values fit
into various intervals
{
n
"size": 0,
io
is
ec
"aggs": {
D
&
"my_calories_histogram": {
s
es
in
"histogram": { A bucket is created for
us
-B
every 100 calories
71
20
"interval": 100
p-
Se
}
9-
-0
}
d
ar
er
}
G
n
}
ie
st
ba
Se

The output of histogram:
• It’s a histogram!
‒ (OK, we cheated and used a Kibana visualization)
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

The API result:
"aggregations": {
"buckets": [
{
• The number of buckets is "key": 0,
"doc_count": 178
dynamic and depends on the },
{
input values and the interval "key": 100,
"doc_count": 207
},
{
"key": 200,
"doc_count": 84
},
{
"key": 300,
n
24 items have calories "doc_count": 24
io
is
ec
between 300 inclusive },
D
&
{
and 400 exclusive s
es
in
"key": 400,
us
-B
"doc_count": 3
71
20
},
p-
Se
{
9-
-0
"key": 500,
d
ar
"doc_count": 1
er
G
n
},
ie
st
ba
{
Se
"key": 600,
"doc_count": 2
}
]
The min_doc_count Parameter
• You can use min_doc_count to raise the minimum count to
throw out smaller buckets
"aggregations": {
"buckets": [
GET nutrition/_search {
{ "key": 0,
"size": 0, "doc_count": 178
"aggs": { },
"my_calories_histogram": { {
n
"histogram": { "key": 100,
io
is
ec
"field": "details.calories", "doc_count": 207
D
&
"interval": 100, },
s
es
{
in
"min_doc_count": 10
us
-B
} "key": 200,
71
"doc_count": 84
20
}
p-
Se
} },
9-
-0
{
}
d
ar
"key": 300,
er
G
"doc_count": 24
n
ie
st
}
ba
Se

Metrics on a Histogram
• Just like any bucket, you can perform metrics on a
histogram
‒ Let’s look at the stats of each bucket
{
"size": 0,
"aggs": {
"histogram": {
n
io
"interval": 100,
is
ec
D
"min_doc_count": 10
&
s
}, es
in
us
"aggs": {
-B
metric within a bucket

7
"my_bucket_stats": {
1
20
p-
"stats": {
Se
9-
-0
d
}
ar
er
G
}
n
ie
}
st
ba
Se
}
}
}

Output of stats:
"aggregations": {
"buckets": [
{
"key": 0, There are 178 products
"doc_count": 178, with less than 100
"bucket_stats": { calories
"count": 178,
There is at least one "min": 0,
product with 0 calories "max": 96,
"avg": 47.29775280898876,
"sum": 8419
}
n
},
io
is
ec
{
D
&
"key": 100,
s
es
in
"doc_count": 207,
us
-B
"bucket_stats": {
71
20
"count": 207,
p-
Se
"min": 100,
9-
-0
"max": 195,
d
ar
"avg": 137.8985507246377,
er
G
n
"sum": 28545
ie
st
ba
}
Se
},

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Date Histograms
Date Histograms
• The date_histogram bucket aggregation creates buckets
over a date field and specified interval
‒ Histograms over date ranges are a powerful tool in data analytics
n
io
is
ec
GET stocks/_search
D
&
{
s
es
in
"aggs": { must be a “date” field
us
-B
"my_date_histogram": {
71
20
"date_histogram": {
p-
Se
"field": "DATE_FIELD",
9-
-0
"interval": "INTERVAL"
d
ar
er
}
G
n
the interval of each bucket

ie
}
st
ba
}
Se

Intervals
• Valid values for interval can either be one of the following
strings:
‒ year, quarter, month, week, day, hour, minute, second
• Or a time unit:
‒ a number followed by a time unit
‒ for example: “2d” is two days
n
io
is
ec
• Time units are:
D
&
s
es
in
us
‒ d = day, h = hour, m = minutes, s = seconds, ms = milliseconds

-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Example of a Date Histogram
• Let’s run a date_histogram aggregation on the stock
prices:
Put the stock prices from

GET stocks/_search each week into a bucket
{
"size": 0,
"aggs": {
"my_weekly_buckets": {
n
io
is
"date_histogram": {
ec
D
&
s
es
"interval": "1w"
in
us
-B
}
71
20
}
p-
Se
}
9-
-0
}
d
ar
er
G
n
ie
st
ba
Se

The stocks in weekly buckets:
Notice the buckets are
"aggregations": { 7 days apart
"my_weekly_buckets": {
"buckets": [
{
"key_as_string": "2009-08-20T00:00:00.000Z",
"key": 1250726400000,
"doc_count": 1996
},
{
"key_as_string": "2009-08-27T00:00:00.000Z",
"key": 1251331200000,
"doc_count": 2495
n
io
is
},
ec
D
{
&
s
es
"key_as_string": "2009-09-03T00:00:00.000Z",
in
us
"key": 1251936000000,
-B
71
"doc_count": 1497
20
p-
},
Se
9-
{
-0
d
ar
"key_as_string": "2009-09-10T00:00:00.000Z",
er
G
"key": 1252540800000,
n
ie
st
"doc_count": 2495
ba
Se
},

Example of Metrics on a Date Histogram
GET stocks/_search What is the monthly
{ average stock price of the
"query": { “SHW” stock?
"match": {
"stock_symbol": "SHW"
}
},
"size": 0,
"aggs": {
"my_stock_date_histogram": {
"date_histogram": {
"field": "trade_date", "2009-08-01" = 60.85857173374721
n
io
is
"interval": "month" "2009-09-01" = 60.339000129699706
ec
D
}, "2009-10-01" = 60.16849994659424
&
s
"aggs": { es
"2009-11-01" = 60.203529582304114
in
us
"my_avg_close": { "2009-12-01" = 61.94954560019753

-B
7
"avg": {
1
"2010-01-01" = 60.4199995743601
20
p-
"field": "close" "2010-02-01" = 64.26526320608039

Se
9-
} "2010-03-01" = 65.56347905034605
-0
d
"2010-04-01" = 74.24333336239769
ar
}
er
G
} "2010-05-01" = 77.47699966430665
n
ie
st
} "2010-06-01" = 74.70545473965731
ba
Se
} "2010-07-01" = 70.23285675048828
} "2010-08-01" = 69.4557135445731

Visualization of our date_histogram
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Percentiles
The Percentiles Aggregation
• The percentiles metrics aggregation calculates percentiles
over a numeric field
‒ percentiles show the point at which a certain percentage of
observed values occur
What is the 95th

percentile of the calories
field?
n
{
io
"aggregations": {
is
ec
"size": 0,
D
"my_calorie_percentiles": {
&
"aggs": {
s
es "values": {
in
"my_calorie_percentiles": {
us
"1.0": 0,
-B
"percentiles": {
7
"5.0": 10,
1
20
p-
"25.0": 70,
Se
}
9-
"50.0": 120,
-0
}
d
"75.0": 189.16666666,
ar
er
}
G
"95.0": 310,
n
ie
}
st
"99.0": 400
ba
Se
}
}
} 95% of items have 310 or
fewer calories
Percentiles Example
• Quintiles are a commonly-used percentile
‒ 5 evenly-spaced percentages: 20% - 40% - 60% - 80% - 100%
What are the

quintile values for
{ calories?
"size" : 0, "aggregations": {
"aggs": { "my_calorie_quintiles": {
"my_calorie_quintiles": { "values": {
n
io
is
"percentiles": {
ec
"20.0": 60,
D
&
"field": "details.calories", "40.0": 110,
s
es
"percents": [
in
"60.0": 140,
us
-B
20, "80.0": 210,

71
40,
20
"100.0": 630
p-
Se
60, }
9-
-0
80, }
d
ar
100 }
er
G
]
n
ie
st
ba
} 80% of the items have

Se
}
}
210 or fewer calories
}

Percentiles Rank
• Instead of providing percents and getting back values, you
can pass in a value and get back its percentile
‒ percentiles_rank is essentially the opposite logic of percentile
What percentile is
150 calories?
{
n
"size": 0,
io
is
ec
"aggs": { "aggregations": {
D
&
"my_calorie_percentiles": { "my_calorie_percentiles": {
s
es
in
"percentile_ranks": { "values": {
us
-B
"field": "details.calories", "150.0": 65.33066132264528,

71
20
"values": [150, 300] "300.0": 94.18837675350701

p-
Se
} }
9-
-0
} }
d
ar
er
}
G
n
}
ie
st
65.3% of items have

ba
Se
You can pass in multiple 150 calories or less, and 94.2%

values have less than 300

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Top Hits
The Top Hits Aggregations
• top_hits is a metrics aggregation that allows you to find the
most relevant documents in a bucket
‒ used typically as a sub-aggregation
number of top hits to

return per bucket
"aggs": {
n
io
is
"my_top_hits": {
ec
D
&
"top_hits": {
s
es
"size": SIZE,
in
us
-B
"sort": [],
71
20
"from": OFFSET
p-
Se
}
9-
-0
}
d
the offset from the first

ar
}
er
G
result you want to fetch

n
ie
st
ba
Se

Motivation for top_hits
• Suppose we do a search for items with sugar in the
ingredients, then bucket those by the number of calories:
GET nutrition/_search 9 items with sugar have

{ 140 calories, but what
"size" : 0, are the most relevant
"query": { items in this bucket?
"match": {
"ingredients": "sugar"
}
},
n
io
is
"buckets": [
ec
"aggs": {
D
{
&
"my_calorie_bucket": {
s
es "key": 140,
in
"terms": {
us
"doc_count": 9
-B
7
},
1
20
}
p-
{
Se
}
9-
"key": 120,
-0
}
d
ar
} "doc_count": 8
er
G
},
n
ie
st
{
ba
Se
"key": 270,
"doc_count": 7
},

Example of top_hits
{
"size" : 0,
"query": {
"match": {
"ingredients": "sugar"
}
},
"aggs": {
"terms": {
"field": "details.calories" top_hits is a sub-aggregation
n
of the “calorie_bucket”
io
},
is
ec
"aggs": {
D
&
s
"my_top_hits": { es
in
"top_hits": {
us
-B
"size": 5
71
20
}
p-
Se
} Returns the top 5 hits in each

9-
-0
}
d
bucket (based on its _score of

ar
er
}
G
the “sugar” query)

n
ie
}
st
ba
Se

The output of top_hits:
• Notice the top 5 hits from each bucket are returned in the
“aggregations” clause of the response
"aggregations": {
"buckets": [
{ “Soda”
"key": 140,
n
io
"doc_count": 9, "Hot Cocoa Mix,
is
ec
D
"my_top_hits": {
Cappuccino Classics
&
s
"hits": { es
Suprema 1.16 Oz"
in
us
"total": 9,
-B
"max_score": 1.341344,
71
20
"Raisin Bran"
p-
"hits": [
Se
9-
top 5 hits
-0
"Exotic Juice Drink,

d
]
ar
er
G
Guanabana"
n
ie
st
ba
Se
"Simply Made Cookies,

Butter"

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Significant Terms
The Significant Terms Aggregation
• The significant_terms bucket aggregation is used to find
interesting or unusual occurrences of terms
• For example, suppose a bank needs to find fraudulent
credit card transactions
‒ A group of customers complains about fraudulent charges
‒ A regular terms aggregation would reveal that a lot of those
customers charged something recently at Amazon. This is
n
io
“commonly common” - a lot of customers charge things at
is
ec
D
&
Amazon es
in
s
us
-B
‒ Three of the complaining customers had charges at a local

71
20
p-
grocery store. This is “commonly uncommon” - only a few

Se
9-
-0
customers had this same transaction

d
ar
er
G
n
ie
‒ significant_terms helps you find the “uncommonly common” -

st
ba
Se
the merchants with charges from all complaining customers, but

not a lot of the other customers
Enabling Fielddata
• If we are going to perform a terms agg or significant terms
agg on a text data type, we need to enable fielddata for
that field
‒ In our example, we are going to perform both aggregations on
the ingredients field of our nutrition index:
PUT nutrition/_mapping/doc Update the existing

{ mapping
n
"properties" : {
io
is
ec
"ingredients" : {
D
&
"type" : "text",
s
es
in
"fielddata" : true
us
-B
}
71
20
}
p-
Se
} Enable fielddata for

9-
-0
d
ingredients
ar
er
G
n
ie
st
ba
Se

Let’s Start with a Terms Agg
Let’s look for a

GET nutrition/_search relationship between total
{ fat and ingredients.
"size":0,
"aggs" : {
"my_total_fat_histogram":{
"histogram": {
"field": "details.total_fat",
"interval": 5,
"min_doc_count": 5
n
},
io
is
ec
"aggregations":{
D
&
"my_top_words" : {
s
es
in
"terms" : {
us
-B
"field" : "ingredients",
71
20
"size":5
p-
Se
}
9-
-0
}
d
ar
er
}
G
n
ie
}
st
ba
}
Se

The Results of the Terms Agg
• These would be considered “commonly common”
ingredients:
"key": 15, 31 products with total_fat
"doc_count": 31, between 15 and 20
"my_top_words": {
"buckets": [
{
"key": "salt",
"doc_count": 15
},
{
"key": "oil",
"doc_count": 14
n
io
The “commonly common”
is
},
ec
D
{ ingredients are “salt”, “oil”,
&
s
"key": "and", es
“corn”, “acid” and the
in
us
"doc_count": 11
-B
}, stopword “and”
71
20
{
p-
Se
"key": "corn",
9-
-0
"doc_count": 8
d
ar
},
er
G
{
n
ie
st
"key": "acid",
ba
Se
"doc_count": 7
}
]}

Using significant_terms
• Now let’s try the same agg with significant_terms:
{ Let’s look for the
"size":0, “uncommonly common”
"aggs" : { ingredients in relationship
"my_total_fat_histogram":{ to total fat.
"histogram": {
"field": "details.total_fat",
"interval": 5,
"min_doc_count": 5
},
n
io
is
"aggregations":{
ec
D
&
"my_top_words" : {
s
es
"significant_terms" : {
in
us
-B
"field" : "ingredients",
71
20
"size":5
p-
Se
}
9-
-0
}
d
ar
}
er
G
n
}
ie
st
ba
}
Se

The Output of significant_terms:
"key": 15,
"my_top_words": { The same 31 products with
"doc_count": 31,
"buckets": [ total_fat between 15 and 20
{
"key": "almonds",
"doc_count": 5,
"score": 2.4349635796045783,
"bg_count": 5
},
{
"key": "flax",
"doc_count": 4,
"score": 1.947970863683663,
"bg_count": 4 The top 5 significant ingredients
}, are “almonds”, “flax”,
{
“shortening”, “apples” and
n
io
"key": "shortening",
is
ec
"doc_count": 4, “cashews”
D
&
"score": 1.532570239334027,
s
es
"bg_count": 5
in
us
},
-B
7
{
1
20
"key": "apples",
p-
Se
"doc_count": 4,
9-
-0
"score": 1.532570239334027,
d
ar
"bg_count": 5
er
G
},
n
ie
{
st
ba
"key": "cashews",
Se
"doc_count": 3,
"score": 1.460978147762747,
"bg_count": 3

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Sorting Buckets
Default Bucket Sorting
• By default, the results within a bucket are sorted by _count
(the document count) in descending order
• You can specify the sorting of terms and the histograms
using “order”:
‒ _count: this is the default; works with terms, histogram and
date_histogram
‒ _term: sort by the term alphabetically; only works with terms
n
io
is
ec
D
‒ _key: Sort by the bucket’s key; works with histogram and
&
s
es
in
date_histogram
us
-B
71
20
p-
Se
9-
-0
d
ar
• You can also sort by a metric value in a nested aggregation

er
G
n
ie
st
ba
Se

Sort by _key Example
GET stocks/_search
{
"aggs": {
"my_stock_date_histogram": { The default for date_histogram is
"date_histogram": { _key ascending
"interval": "month",
"order": {
"_key": "desc"
}
},
n
io
is
"aggs": {
ec
D
"my_avg_close": {
&
s
es
"avg": {
in
us
-B
"field": "close"
71
}
20
p-
}
Se
9-
-0
}
d
ar
}
er
G
}
n
ie
st
}
ba
Se

Sorting by a Metric
Find the brands with the

{ highest-calorie products.
"size" : 0,
"aggs": {
"my_brands_bucket": {
"terms": {
"order": {
"my_max_calories": "desc"
}
n
io
},
is
This output will be sorted
ec
"aggs": {
D
&
"my_max_calories": { es
in
s
by the results of the
nested metric
us
"max": {
-B
71
20
}
p-
Se
9-
}
-0
}
d
ar
er
}
G
n
ie
}
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• Use global to perform an aggregation over all documents
• The histogram bucket aggregation builds a histogram on a
given “field” using a specified “interval”
• The date_histogram bucket aggregation creates buckets
over a date field and specified interval
• The percentiles metrics aggregation calculates percentiles
over a numeric field
n
io
is
ec
D
&
• top_hits is a metrics aggregation that allows you to find the es
in
s
us
most relevant documents in another bucket

-B
71
20
p-
Se
• The significant_terms bucket aggregation is used to find

9-
-0
d
ar
interesting or unusual occurrences of terms

er
G
n
ie
st
ba
Se

Quiz
1. How many ERRORs occurred per day last month? Which
aggregation would work well for this?
2. You want to recommend movies based on another movie
that someone likes. Which aggregation might work well for
finding recommendations that are out of the ordinary?
3. Which aggregations would you use to answer the question:
“What are the top 3 posts from the top 10 users at
n
StackOverflow?”
io
is
ec
D
&
s
es
4.You want to compare the total volume of stocks from the
in
us
-B
tech industry to the overall market? What aggregation would

71
20
p-
Se
you use?
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 9
More Aggregations
2 The Search API
3 Text Analysis
4 Mappings
n
io
Handling
is
ec
D
&
s
es
Relationships
in
us
8 Aggregations
-B
71
20
p-
Se
9 More Aggregations
9-
-0
d
ar
er
Handling Relationships
G
10
n
ie
st
ba
Se
Topics covered:
• The Need for Data Modeling
• Denormalization
• The Need for Nested Types
• Nested Types
• Querying a Nested Type
n
• Sorting on a Nested Type
io
is
ec
D
&
s
es
• The Nested Aggregation
in
us
-B
71
20
• Parent/Child Types
p-
Se
9-
-0
d
ar
• The has_child Query

er
G
n
ie
st
ba
Se
• The has_parent Query

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Modeling
The Need for Data
It’s all about relationships…
• If you come from the SQL world, you will likely need to
change your thought process for modeling data for
Elasticsearch
‒ In SQL, you typically normalize your data
• Search requires different considerations
‒ In Elasticsearch, you typically denormalize your data!
n
• A flat world has its advantages
io
is
ec
D
&
s
es
‒ Indexing and searching is fast
in
us
-B
71
20
‒ No need to join tables or lock rows

p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

…and sometimes relationships matter
• There are times when relationships matter
‒ We need to bridge the gap between normal and flat
• Four common techniques for managing relational data in
Elasticsearch
‒ Denormalizing: flatten your data (typically the best solution)
‒ Application-side joins: run multiple queries on normalized data
n
io
is
ec
‒ Nested objects
D
&
s
es
in
us
‒ Parent/child relationships
-B
71
20
p-
Se
• We will discuss nested objects and parent/child

9-
-0
d
relationships in detail in this chapter

ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Denormalization
Denormalization
• Denormalizing your data refers to “flattening” your data
‒ storing redundant copies of data in each document, instead of
using some type of relationship
• Denormalization provides the best performance out of
Elasticsearch
‒ no need to perform expensive joins
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Example of Denormalization
• Suppose you are indexing users and tweets
‒ users in one index, and tweets in another
• When searching for tweets, suppose you want to search by
the username also
users tweets
n
io
is
ec
D
&
s
{ { es
in
us
"username" : "harrison", "body" : "My favorite movie is Star Wars",

-B
7
"userid" : 1, "time" : "2017-01-24T02:32:27",

1
20
p-
"city" : "Los Angeles", "userid" : 1

Se
9-
"state" : "California" }
-0
d
}
ar
er
G
n
ie
st
ba
Se

• Duplicate the desired users data in the tweets document:
users
PUT users/doc/1
{
"username" : "harrison",
"userid" : 1,
"city" : "Los Angeles",
"state" : "California"
}
tweets
n
io
is
ec
D
&
s
PUT tweets/doc/123 es
in
us
{
-B
7
"body" : "My favorite movie is Star Wars",

1
20
p-
"time" : "2017-01-24T02:32:27",
Se
PUT tweets/doc/456
9-
"userid" : 1,
-0
{
d
ar
er
"body" : "Laugh it up, fuzzball.",

G
}
n
ie
"time" : "1977-05-25T00:00:00",
st
ba
"userid" : 1,
Se
}

• Now you can search tweets and username all in a single
query:
GET tweets/_search
{
"query": {
"bool": {
"must": [
{"match": {"body": "movie"}},
{"match": {"username": "harrison"}}
]
n
io
is
}
ec
D
}
&
s
es
}
in
us
-B
71
20
"_score": 0.55510485,
p-
Se
9-
"_source": {
-0
d
"body": "My favorite movie is Star Wars",

ar
er
G
"time": "2017-01-24T02:32:27",
n
ie
st
ba
"userid": 1,
Se
"username": "harrison"
}
Tips for Denormalizing Data
• Denormalize data to optimize for reads
• There is no mechanism to keep the denormalized data
consistent with the original document
‒ avoid denormalizing data that is likely to change frequently
• When denormalizing data that changes, you should ensure
that there is 1 and only 1 authoritative source for the data
‒ If you do denormalize data that might change, it can help to also
n
io
is
ec
denormalize a field that does not change, like an _id
D
&
s
es
in
us
-B
71
20
users tweets
p-
Se
9-
-0
d
ar
PUT users/doc/1 PUT tweets/doc/123

er
Our example uses

G
{
n
{
ie
“userid” as an
st
"userid" : 1,
ba
"userid" : 1,
Se
... ... authoritative source

} }

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Types
The Need for Nested
Inner Objects
• There is an interesting (and confusing) scenario that can
arise with JSON inner objects
• Suppose we index the following two documents into an
index named “companies”:
PUT companies/doc/1
{
"name" : "Stark Enterprises",
"employees" : [
n
io
is
{"first_name" : "Tony", "last_name" : "Stark"},
ec
D
{"first_name" : "Virginia", "last_name" : "Potts"}
&
s
es
]
in
us
}
-B
71
PUT companies/doc/2
20
p-
{
Se
9-
"name" : "NBC Universal",

-0
d
ar
"employees" : [
er
G
{"first_name" : "Tony", "last_name" : "Potts"}

n
ie
st
]
ba
Se

Searching Inner Objects…
• What do you think the result is of the following query?
GET companies/_search
{
"query": {
"bool": {
"must": [
{"match": {"employees.first_name": "Tony"}},
?
n
io
is
{"match": {"employees.last_name": "Potts"}}
ec
D
]
&
s
es
}
in
us
}
-B
71
}
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

"hits": {
…can have "total": 2,

"max_score": 0.5753642,
"hits": [
surprising {
"_index": "companies",
results: "_type": "doc",

"_id": "2",
"_score": 0.5753642,
"": {
"name": "NBC Universal",
"employees": [
{
"first_name": "Tony",
"last_name": "Potts"
}
]
}
},
{
n
io
"_type": "doc",
is
ec
D
"_id": "1",
&
"_score": 0.51623213,
s
es
in
"_source": {
us
-B
"name": "Stark Enterprises",

7
"employees": [
1
20
p-
{
Se
9-
-0
“Stark Enterprises” does "last_name": "Stark"

d
ar
},
not have an employee
er
G
{
n
named “Tony Potts”

ie
"first_name": "Virginia",
st
ba
Se
}
]
}

Why the confusing results?
• The “first_name” and “last_name” fields are JSON inner
objects of the “employees” field
‒ When the JSON object got flattened in the index, we lost the
relationship between “Tony” and “Potts”
All first names and last names are

put into respective arrays
n
io
is
ec
D
{
&
s
"name" : "Stark Enterprises", es
in
us
"employee.first_name" : [ "Tony", "Virginia" ],

-B
7
"employee.last_name" : [ "Stark", "Potts" ]

1
20
p-
}
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se
A query for “Tony Potts” is a hit

(so is “Virginia Stark”)

How do we fix this scenario?
• A hit on the inner object yields a hit on the root object
• The problem is with the mapping:
‒ We need to use the “nested” data type for the JSON inner
objects (in this particular scenario)
"companies": {
"mappings": {
n
io
is
"doc": {
ec
“first_name” and “last_name” are
D
"properties": {
&
s inner objects - but they should be
es
"employees": {
in
us
“nested” for our use case

-B
"properties": {
71
"first_name": {
20
p-
"type": "text",
Se
9-
-0
"fields": {
d
ar
"keyword": {
er
G
"type": "keyword",
n
ie
st
"ignore_above": 256
ba
Se
}
}

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Nested Types
Nested Data Type
• The nested type allows arrays of objects to be indexed and
queried independently of each other
‒ Use nested if you need to maintain the independence of each
object in the array
• The mapping syntax looks like:
"mappings": {
n
"doc": {
io
is
ec
"properties": {
D
&
"outer_object": {
s
es
in
"type": "nested",
us
-B
"properties": {
71
20
"inner_field": "TYPE",
p-
Se
...
9-
-0
}
d
ar
er
}
G
n
ie
},
st
ba
Se

Example of nested
PUT companies
{ “first_name” and “last_name”
"mappings": {
"doc": {
are nested inside “employees”
"properties": {
"employees": {
"type": "nested",
"properties": {
"first_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
n
io
"ignore_above": 256
is
ec
}
D
&
s
} es
in
us
},
-B
"last_name": {
71
20
"type": "text",
p-
Se
9-
"fields": {
-0
d
"keyword": {
ar
er
"type": "keyword",
G
n
ie
"ignore_above": 256
st
ba
Se
}
}
}

Nested Objects are Stored Separately
• The nested mapping stores the nested documents in a
non-flattened way that maintains the original association:
{
"name" : "Stark Enterprises"
}
{
"employee.first_name" : "Tony",
n
io
is
"employee.last_name" : "Stark"
ec
D
}
&
s
es
{
in
us
-B
"employee.first_name" : "Virginia",
71
"employee.last_name" : "Potts"
20
p-
}
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Querying a
Nested Type
Querying a nested Type
• To query a nested data type, you have to use a “nested”
query
{
"query": { Specify the “path” to the
"nested": { nested object
"path": "employees",
"query": {
"bool": {
"must": [
n
io
is
{"match": {"employees.first_name": "Tony"}},
ec
D
&
{"match": {"employees.last_name": "Potts"}}
s
es
]
in
us
-B
}
71
}
20
p-
Se
}
9-
-0
}
d
ar
}
er
G
n
ie
st
ba
Se
We get 1 hit this time, as

expected (“NBC Universal”)
Finding the Cause of a Hit
• Suppose we search for companies who have an employee
with the first name “Tony”
‒ both companies are a “hit”, but the response does not reveal
specifically which nested employee generated the hit:
{
"query": {
"nested": {
n
"Stark Enterprises"
io
is
ec
D
"query": {
&
s
es
"match": { "NBC Universal"
in
us
"employees.first_name": "Tony"
-B
71
}
20
p-
}
Se
9-
-0
}
d
ar
}
er
G
}
n
ie
st
ba
Se

The inner_hits Query
• The inner_hits query reveals which specific nested object
(or objects) that caused the hit
‒ inner_hits is added to your nested query:
{
"query": {
"nested": {
n
io
is
ec
"query": {
D
&
"match": {
s
es
in
"employees.first_name": "Tony"
us
-B
}
71
20
}, You can add “inner_hits”

p-
Se
"inner_hits" : {}
9-
to a “nested” query
-0
}
d
ar
er
}
G
n
ie
}
st
ba
Se

• The output from inner_hits looks like:
{
"_type": "doc",
"_id": "1",
"_score": 0.6931472,
"_source": {
"name": "Stark Enterprises", This is a company that has an
"employees": [ employee named “Tony”
...
]
},
"inner_hits": {
"employees": {
"hits": {
"total": 1,
n
io
"max_score": 0.6931472,
is
ec
"hits": [
D
&
{
s
"_nested": { es
in
us
"field": "employees",
-B
"offset": 0
71
},
20
p-
"_score": 0.6931472,
Se
"_source": {
9-
This the employee named

-0
d
“Tony” that caused the hit

ar
"last_name": "Stark"
er
G
}
n
ie
}
st
ba
]
Se
}
}
}
}

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Sorting on a
Nested Type
Sorting Nested Queries
• It is possible to sort by the value of a nested field
‒ even though the value exists in a separate nested document
• Sorting in nested queries is different than non-nested
queries
‒ we have to repeat the original query in the “sort” clause
‒ otherwise the sorting happens on all documents in the index, not
just the ones that hit the nested query
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Example of Nested Sorting
• Notice the “match” query is repeated in “sort” in a
“nested_filter” clause:
{
"query": {
"nested": {
"query": {
"match": {"employees.last_name": "Potts"}
}
n
}
io
is
ec
},
D
&
"sort": {
s
es
in
"employees.first_name.keyword" : {
us
-B
"order" : "asc",
71
20
"nested_path": "employees",
p-
Se
"nested_filter" : {
9-
-0
"match": {"employees.last_name": "Potts"}

d
ar
er
}
G
n
}
ie
st
ba
}
Se

Result of the Nested Sorting:
"employees": [
{
}
]
},
"sort": [
"Tony"
]
},
"employees": [
{
n
io
Notice the first names are
is
ec
"last_name": "Stark"
D
sorted “asc”
&
},
s
es
in
{
us
-B
"first_name": "Virginia",
71
20
p-
Se
}
9-
-0
]
d
ar
er
},
G
n
ie
"sort": [
st
ba
"Virginia"
Se
]
}

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The Nested
Aggregation
The nested Aggregation
• The nested bucket aggregation puts nested objects into a
bucket
‒ Useful for performing sub-aggregations
• The following simple example puts all employees into a
bucket:
n
io
is
{
ec
D
"size": 0,
&
s
es "aggregations": {
"aggs": {
in
us
"my_employees_bucket": {
-B
7
"doc_count": 3
1
"nested": {
20
p-
}
Se
"path": "employees"
9-
}
-0
}
d
ar
}
er
G
}
n
ie
st
ba
}
Se

Example of a nested Aggregation
• Suppose we want to put all employees with the same last
name into the same bucket:
{
"size": 0,
"aggs": {
"nested": {
"path": "employees"
n
},
io
is
ec
"aggs": {
D
&
s
"last_name": { es
in
"terms": {
us
-B
"field" : "employees.last_name.keyword"
71
20
}
p-
Se
}
9-
-0
}
d
ar
er
}
G
n
ie
}
st
ba
}
Se

The result of the nested aggregation:
• We only have two unique last names in our small dataset:
"aggregations": {
"doc_count": 3,
"last_name": {
"buckets": [
{
"key": "Potts",
n
io
is
ec
"doc_count": 2
D
&
},
s
es
in
{
us
-B
"key": "Stark",
71
20
"doc_count": 1
p-
Se
}
9-
-0
]
d
ar
}
er
G
n
}
ie
st
ba
}
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Parent/Child Types
The Need for Parent/Child Types
• Nested objects are great for many use cases
‒ But nested objects all live within the same document
‒ updating a nested object requires a complete reindexing of the
root object AND all other of its nested objects
• Using a parent/child data type, you can completely
separate two objects while maintaining their relationship
‒ the parent and children are completely separate documents
n
io
is
ec
D
&
‒ the parent can be updated without reindexing the children s
es
in
us
-B
71
‒ children can be added/changed/deleted without affecting the

20
p-
Se
parent or any other children

9-
-0
d
ar
er
G
• Configured in your mappings using the _parent setting

n
ie
st
ba
Se

Parent/Child and Shards
• Behind the scenes, the parent and all its children must live
on the same shard
‒ This makes query-time joins very fast
• Remember how documents are routed using the hash
function?
‒ They will not work for a child document - it must get routed to the
same shard as its parent
n
io
is
ec
D
‒ The parent’s id is used as the routing value for the child
&
s
es
in
document
us
-B
71
20
• Therefore, every time you refer to a child document, you

p-
Se
9-
-0
must specify its parent’s id

d
ar
er
G
n
ie
st
‒ We will see how this is implemented next…

ba
Se

Defining a Parent/Child Relationship
• Let’s go through the steps of defining a parent/child
relationship
1. Define the mappings
2. Index some parent documents
3. Index some child documents
4. Query the documents
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

1. Define the mappings
• You have to define the parent mapping before any parent
documents can be indexed
‒ Use the _parent property to specify the parent type
‒ The following “employee” type has “company” as its parent:
PUT companies
n
{
io
is
ec
"mappings": {
D
&
s
"company" : {}, es
in
"employee" : {
us
-B
"_parent": {
71
20
"type": "company"
p-
Se
}
9-
-0
}
d
ar
er
}
G
n
ie
}
st
ba
Se

2. Index some parent documents
• A parent is not aware of its children, so indexing a parent
document is simply a matter of putting them into
Elasticsearch
‒ a child can actually be indexed prior to its parent, but we will just
index the parents first
PUT companies/company/c1
{
n
io
is
ec
"name" : "Stark Enterprises"
D
&
}
s
es
in
us
-B
PUT companies/company/c2
71
20
{
p-
Se
"name" : "NBC Universal"

9-
-0
}
d
ar
er
G
n
ie
st
ba
Se

3. Index some child documents
• A child document needs to specify its parent when it is
indexed
‒ Use the “parent” parameter to specify the parent id
PUT companies/employee/emp1?parent=c1
{
"first_name" : "Tony",
"last_name" : "Stark" “Stark Enterprises”
}
n
io
is
ec
PUT companies/employee/emp2?parent=c1
D
&
s
{ es
in
"first_name" : "Virginia",
us
-B
"last_name" : "Potts"
71
20
}
p-
Se
9-
-0
PUT companies/employee/emp3?parent=c2 “NBC Universal”

d
ar
er
{
G
n
ie
"first_name" : "Tony",
st
ba
"last_name" : "Potts"
Se

4. Query the documents
• Notice that querying the index results in 5 separate
documents:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
GET companies/_search "failed": 0
},
"hits": {
"total": 5,
n
io
"max_score": 1,
is
ec
D
"hits": [
&
s
es ...
in
us
]
-B
71
20
p-
Se
9-
-0
• Typical parent/child queries involve has_child and

d
ar
er
has_parent
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The has_child Query
The has_child Query
• The has_child query is a filter that accepts a query and the
child type to run against
‒ It results in parent documents that have child docs matching the
query
‒ Parent and child documents are routed to the same shard, so
joining them when searching will be in-memory and efficient
n
io
the parent type
is
ec
D
&
s
es
in
us
GET my_index/parent/_search
-B
71
{
20
p-
"query": {
Se
9-
"has_child": {
-0
d
ar
"type": "TYPE",
er
G
"query": {}
n
ie
the child type

st
}
ba
Se
}
}

Example of has_child
I want all
companies who have an
employee named
“Stark”
GET companies/company/_search
{ "hits": [
"query": { {
"has_child": { "_index": "companies",
n
io
"type": "employee", "_type": "company",
is
ec
"query": { "_id": "c1",
D
&
s
"match": { es
in
"_score": 1,
us
"last_name": "Stark" "_source": {

-B
} "name": "Stark Enterprises"

71
20
} }
p-
Se
}
9-
}
-0
}
d
]
ar
er
}
G
n
ie
st
ba
Se

• Notice in the previous query that “Stark Enterprises” has
an employee with the last name “Stark”, but we do not
know which employee caused the hit
‒ Use the inner_hits query to get the relevant children from the
has_child query:
GET companies/company/_search "_source": {

{ "name": "Stark Enterprises"
"query": { },
n
io
is
"has_child": { "inner_hits": {
ec
D
"type": "employee", "employee": {
&
s
es
"query": { "hits": {
in
us
-B
"match": { "total": 1,
71
"last_name": "Stark" "max_score": 0.6931472,

20
p-
} "hits": [
Se
9-
-0
}, {
d
ar
"inner_hits" : {} "_source": {
er
G
} "first_name": "Tony",
n
ie
st
} "last_name": "Stark"
ba
Se
} }

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
The has_parent Query
The has_parent Query
• The has_parent query is executed in the parent document
space
‒ Returns child documents which associated parents have
matched
the child type
n
io
is
ec
D
GET my_index/child/_search
&
s
es
{
in
us
"query": {
-B
71
"has_parent": {
20
p-
"parent_type": "TYPE",
Se
9-
"query": {}
-0
d
ar
}
the parent type
er
G
}
n
ie
st
}
ba
Se

Example of has_parent
I want all employees

that work for “NBC”
GET companies/employee/_search
{
"query": {
"has_parent": { "_source": {
"parent_type": "company", "first_name": "Tony",
n
io
"query": {
is
ec
}
D
"match": {
&
s
"name": "NBC" es
in
us
}
-B
7
}
1
20
p-
}
Se
9-
}
-0
d
}
ar
er
G
n
ie
st
ba
Se

Accessing a Child Document
• The Document APIs require the parent property for routing
purposes
‒ A child document is stored on the same shard as its parent
‒ The APIs need to know the routing id of the parent to find the
child
GET companies/employee/emp1 GET companies/employee/emp1?parent=c1
n
io
is
ec
D
&
{
s
es
"error": { {
in
us
"root_cause": [ "_index": "companies",

-B
7
{ "_type": "employee",
1
20
"type": "_id": "emp1",

p-
Se
"routing_missing_exception", "_version": 2,
9-
-0
"reason": "routing is required for "_routing": "c1",

d
ar
[companies]/[employee]/[emp1]", "_parent": "c1",

er
G
"index_uuid": "_na_", "found": true,

n
ie
"index": "companies" "_source": {

st
ba
} "first_name": "Tony",
Se
], "last_name": "Stark"
"type": "routing_missing_exception", }
}

Updating Child Documents
• One of the key benefits of a parent/child relationship is the
ability to modify a child object independent of the parent
‒ For example, changing an employee has no effect on its parent
(or any of its siblings either)
POST companies/employee/emp1/_update?parent=c1
n
{
io
is
ec
"doc" : {
D
&
"first_name" : "Anthony"
s
es
in
}
us
-B
}
7
Change “Tony” to “Anthony”

1
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Choosing a Technique
Can
all related
items fit in a No Consider
single JSON parent/child
doc?
Yes
Are docs
updated Yes
frequently?
n
io
is
ec
D
Consider
&
No
s
es
in
nested docs
us
-B
71
20
Do
p-
Se
your
9-
Do your queries test

-0
Denormalize docs contain more than 1

d
ar
No Yes Yes
er
arrays of property when

your data
G
matching objects
n
ie
objects?
st
in these
ba
Se
arrays?
No
Kibana Considerations
• Kibana currently does not support nested types or
parent/child relationships
‒ Important to consider this limitation if you are using your indices
for dashboards and visualizations in Kibana
• Kibana may implement these relationships in the future
‒ but for now, you will have to decide which is more important:
using nested or parent/child relationships, or being able to
n
io
visualize your data in Kibana
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter Review
Summary
• The nested type allows arrays of objects to be indexed and
queried independently of each other
• Sorting by a nested field requires you to repeat the original
query in the “sort” clause
• The nested bucket aggregation puts nested objects into a
bucket
• Updating a nested object requires a complete reindexing of
n
io
is
ec
the root object AND all other of its nested objects
D
&
s
es
in
us
-B
• Using a parent/child data type, you can completely

71
20
separate two objects while maintaining their relationship

p-
Se
9-
-0
d
ar
• One of the key benefits of a parent/child relationship is the

er
G
n
ie
st
ability to modify a child object independently of the parent

ba
Se

Quiz
1. True or False: Updating a nested inner object causes the
root object and all other nested objects to be reindexed.
2. True or False: Deleting a child object actually causes the
_parent object and all other siblings to be reindexed.
3. Why not just use a parent/child relationship all the time (as
opposed to nested types) when dealing with relational
objects?
n
io
is
ec
4. True or False: Child objects are routed to the same shard
D
&
s
es
as its _parent object.
in
us
-B
71
20
5. True or False: Deleting a parent object causes all child

p-
Se
9-
-0
objects to be deleted.
d
ar
er
G
n
ie
st
6. True or False: You can index a child object whose _parent

ba
Se
does not exist.

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Lab 10
Handling Relationships
Se
ba
st
ie
n
G
Conclusions
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Development vs.
Production Mode
HTTP vs. Transport
• There are two important network communication
mechanisms in Elasticsearch to understand:
‒ HTTP: address and port to bind to for HTTP communication,
which is how the Elasticsearch REST APIs are exposed
‒ transport: used for internal communication between nodes
within the cluster
• The defaults are fine for downloading and playing with
Elasticsearch, but not useful for production systems
n
io
is
ec
D
&
‒ bind to localhost by default es
s
in
us
-B
71
20
p-
binds to first available

Se
9-
-0
node port in the range

d
ar
9200-9299
er
G
HTTP
n
ie
st
ba
Se
transport binds to first available

port in the range
9300-9399
Development vs. Production Mode
• Every Elasticsearch 5.x instance is either in development
mode or production mode:
‒ development mode: if it does not bind transport to an external
interface (the default)
‒ production mode: if it does bind transport to an external
interface
“I am just a local “I am running in a
cluster used for production
n
io
is
development.” environment.”
ec
D
&
s
es
in
us
my_dev_cluster my_production_cluster
-B
71
20
p-
http.port: 9200 http.port: 9200

Se
9-
-0
http.host: localhost http.host: 192.168.1.21

d
ar
er
G
transport.tcp.port: 9300 transport.tcp.port: 9300

n
ie
st
transport.bind_host: localhost transport.bind_host: 192.168.1.21

ba
Se

Bootstrap Checks
• Elasticsearch has bootstrap checks upon startup:
‒ inspect a variety of Elasticsearch and system settings
‒ compare them to values that are safe for the operation of
Elasticsearch
• Bootstrap checks behave differently depending on the
mode:
‒ development mode: any bootstrap checks that fail appear as
n
io
is
ec
warnings in the Elasticsearch log
D
&
s
es
in
us
-B
‒ production mode: any bootstrap checks that fail will cause

71
20
Elasticsearch to refuse to start

p-
Se
9-
-0
d
‒ https://www.elastic.co/guide/en/elasticsearch/reference/master/
ar
er
G
n
bootstrap-checks.html
ie
st
ba
Se

Documentation and Help
• Discussion forums: http://discuss.elastic.co/
• Meetups: http://elasticsearch.meetup.com
• Docs: https://elastic.co/docs
n
io
is
ec
D
&
s
es
• Community: https://elastic.co/community
in
us
-B
71
20
p-
Se
9-
-0
d
ar
• More resources: https://elastic.co/learn

er
G
n
ie
st
ba
Se

Elastic Training Paths
n
io
is
ec
D
&
s
es
in
us
-B
17
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se
Copyright Elasticsearch BV 2017 Copying, publishing and/or 546

Se
ba
st
ie
n
G
er
ar
d
Quiz Answers
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Chapter 1 Quiz Answers
1. True
2. As a noun: an index is what Elasticsearch creates for your
documents. As a verb: we “index” documents into an index
3. True
4. True - much more efficient than separate requests
5. Elasticsearch will define a new index for you
n
io
is
ec
D
6. No. If you want Elasticsearch to generate an ID, you have
&
s
es
in
to use POST
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

1. Multiple match terms use “and” or “or”, while multiple terms
in match_phrase are “and” and position matters
2. slop
3. now-12h
4. 70% of 5 = 3.5 and it rounds down, so 3 need to match
5. 10 docs from my_index, but only the “username” field in
n
io
is
the _source is returned
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

1. text analysis
2. Exact values are not analyzed vs. full text is analyzed
3. character filter, tokenizer, token filter
4. “happy” is analyzed to “happi”
5. Three tokens: "192.168.1.201","jing", and "https://
training.Elastic.co/"
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

1. “value” would be a “long”
2. The document is not indexed and an error occurs
3. True
4. False: you can not change the data type of any field in a
mapping
5. “my_custom_analyzer” at indexing, and “english” at
n
io
is
search time
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

1. 1
2. The query would search for “cat”, “kitten” or “pet” in the
queried field
3. multi_match
4. The search terms in a full-text query are analyzed. Term-
level queries are not analyzed.
n
io
is
5. We are searching for “java” in both the body and subject.
ec
D
“most_fields” gives a higher score to documents with &
s
es
in
us
-B
“java” in both fields, best_field returns the score of the

71
20
better field
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

1. Document based, using hash(_id) % num_shards to
determine which shard a doc is routed to
2. Since no settings are defined, the number of shards and
replicas default to 5 and 1, respectively, which means 10
shards
3. False - there is nothing to do when adding a new node,
except to wait
n
io
is
ec
4. 12 = 4 primary shards + 8 replicas
D
&
s
es
in
us
-B
5. 2
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

1. Skewed data, like what can occur with time-series data;
and small datasets that are more likely to have skew
2. True, but only if fielddata is enabled.
3. True
4. _score
5. True
n
io
is
ec
D
6. “from” and “size”
&
s
es
in
us
-B
7. scroll is a snapshot, pagination is “dynamic”

71
20
p-
Se
9-
-0
8. False. you “can” call _refresh, but it is certainly not

d
ar
er
required or recommended
G
n
ie
st
ba
Se

1. actually both query and aggregation
2. Query
3. Metrics, bucket, matrix, pipeline
4. terms
5. Increase shard_size
n
6. cardinality
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

1. Date histogram + terms/filters is probably the simplest
solution
2. Significant terms aggregation
3. Use terms aggregation for the username buckets, then a
top_hits sub-aggregation
4. You could do an aggregation of the tech industry stocks,
along with a global agg over all stocks
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

1. True
2. False
3. There is an overhead to parent/child - they are separate
documents that must be joined. The join is fast, but nested
objects are all a single document which will always be
faster for searches (but more expensive for updates)
4. True - using the parent parameter
n
io
is
ec
D
5. False &
s
es
in
us
-B
71
6. True
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se

Se
ba
st
ie
n
G
er
ar
d
-0
9-
Se
p-
20
17
-B
us
in
es
s
&
D
ec
is
io
n
Thank you!
Please complete the online survey
Course: Core Elasticsearch for Developers
Version 5.5.1
© 2015-2017 Elasticsearch BV. All rights reserved. Decompiling, copying, publishing and/or distribution without written consent of Elasticsearch BV is
strictly prohibited.
n
io
is
ec
D
&
s
es
in
us
-B
71
20
p-
Se
9-
-0
d
ar
er
G
n
ie
st
ba
Se
© Elasticsearch BV 2015-2017. All rights reserved. 559

CoreDeveloper-5 5 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CoreDeveloper-5 5 1

Uploaded by

Copyright:

Available Formats

Core Elasticsearch

An Elastic Training Course

© Elasticsearch BV 2015-2017. All rights reserved. 2

• Unzip the downloaded lab zip file on your hard drive

‒ this creates a directory called CoreDeveloper

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 3

7 Working with Search Results

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 5

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 6

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 7

2 The Search API

5 More Search Features

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 9

• But, Lucene is also:

‒ a library (you have to incorporate it into your application)

‒ not originally designed for scaling

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 11

‒ …and we all lived happily ever after!

• Today Elasticsearch is the most popular enterprise search

1 node cluster 10 node cluster 100+ node

Master Node Master Nodes

Data Nodes – Hot

Data Nodes – Warm

Your cluster can grow as your needs grow

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 14

Java, .NET, PHP,

Groovy, Ruby, HTTP Response

whatever you want!

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 15

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 16

X-Pack Elastic Cloud

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 17

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 19

• Every document has a unique ID

‒ which either you provide

‒ or Elasticsearch generates for you

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 21

JSON consists of fields

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 22

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 24

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 25

documents are indexed into an index

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 26

"comment" : "My favorite movie star in Hollywood is Harrison Ford."

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 28

curl -XPUT 'http://localhost:9200/my_index' -H "Content-Type: application/json"

the new index

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 29

curl -XPUT 'http://localhost:9200/my_index/doc/1' -i -H

"Content-Type: application/json" -d '

“comment":"My favorite movie is Star Wars!"

“doc” can actually be any name we

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 30

my_index is created if it doc type is created if it

Type: application/json" -d '

"comment":"My favorite movie is Star Wars!"

If the id already exists, the

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 31

POST instead of PUT No id specified

"comment" : "The North Star is right above my house."

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 32

Copyright Elasticsearch BV 2015-2017 Copying, publishing and/or 33

curl -XGET 'http://localhost:9200/my_index/doc/1' -H

internal server error in retry until successful