Professional Documents
Culture Documents
Elastic Stack 7
Elastic Stack 7
elasticsearch
getting set up
page
sundog-education.com 02
elasticsearch
system requirements
enable virtualization
Virtualization must be enabled in your BIOS settings. If
you have “Hyper-V” virtualization as an option, turn it
off.
beware avast
Avast anti-virus is known to conflict with Virtualbox.
page
sundog-education.com 03
let’s do
this.
elasticsearch
basics.
logical concepts of
elasticsearch
documents indices
Documents are the things An index powers search into all
you’re searching for. They can documents within a collection
be more than text – any of types. They contain inverted
structured JSON data works. indices that let you search
Every document has a unique across everything within them
ID, and a type. at once, and mappings that
define schemas for the data
within.
page
sundog-education.com 07
what is an
inverted index
Inverted index
Document 1:
Space: The final frontier. These are the space: 1, 2
voyages… the: 1, 2
final: 1
Document 2: frontier: 1
He’s bad, he’s number one. He’s the space he: 2
cowboy with the laser gun! bad: 2
…
page
sundog-education.com 08
of course it’s not
quite that simple.
page
sundog-education.com 09
using
indices
page
sundog-education.com 010
so what’s new in
elasticsearch 7?
page
sundog-education.com 011
how
elasticsearch
scales
an index is split into
shards.
Documents are hashed to a particular shard.
1 2 3 …
Shakespeare
page
sundog-education.com 014
primary and replica
shards
This index has two primary shards and two replicas.
Your application should round-robin requests amongst nodes.
page
sundog-education.com 015
The number of primary shards
cannot be changed later.
page
sundog-education.com 016
quiz time
The schema for
your documents
are defined by…
• The index
• The mapping
• The document itself
01
The schema for
your documents
are defined by…
• The index
• The mapping
• The document itself
01
What purpose do
inverted indices serve?
02
What purpose do
inverted indices serve?
02
03
• 8
• 15
• 20
Elasticsearch is built
only for full-text search
of documents.
04
• true
• false
Elasticsearch is built
only for full-text search
of documents.
connecting to
your cluster
elasticsearch
more setup
page
sundog-education.com 029
examining
movielens
movielens
movielens is a free dataset
of movie ratings gathered
from movielens.org.
page
sundog-education.com 035
common
mappings
page
sundog-education.com 036
more about
analyzers
character filters
remove HTML encoding, convert & to and
tokenizer
split strings on whitespace / punctuation / non-letters
token filter
lowercasing, stemming, synonyms, stopwords
page
sundog-education.com 037
choices for
analyzers
standard
splits on word boundaries, removes punctuation,
lowercases. good choice if language is unknown
simple
splits on anything that isn’t a letter, and lowercases
whitespace
splits on whitespace but doesn’t lowercase
page
sundog-education.com 038
import
one document
insert
curl -XPUT
127.0.0.1:9200/movies/_doc/109487 -d '
{
"genre" : ["IMAX","Sci-Fi"],
"title" : "Interstellar",
"year" : 2014
}'
import
many
documents
json bulk import
page
sundog-education.com 044
updating
documents
versions
page
sundog-education.com 047
partial update api
page
sundog-education.com 048
deleting
documents
it couldn’t be
easier.
sundog-education.com
elasticsearch
page
sundog-education.com 052
dealing with
concurrency
the problem
page
sundog-education.com 055
optimistic
concurrency control
page
sundog-education.com 059
data
modeling
strategies for
relational data
normalized data
RATING MOVIE
Look
userID Look movieID
up
movieID up title title
rating
rating genres
denormalized data
RATING
Look
userID
up
rating
rating
title
page
sundog-education.com 063
strategies for
relational data
Star Wars
page
sundog-education.com 064
query-line
search
“query lite”
/movies/_search?q=title:star
/movies/_search?q=+year:>2010+title:trek
page
sundog-education.com 067
it’s not always
simpler.
/movies/_search?q=+year:>2010+title:trek
/movies/_search?q=%2Byear%3A%3E2010+%2Btitle%3Atrek
page
sundog-education.com 068
and it can be
dangerous.
page
sundog-education.com 069
learn more.
page
sundog-education.com 073
queries and filters
use filters when you can – they are faster and cacheable.
page
sundog-education.com 074
example: boolean
query with a filter
page
sundog-education.com 075
some types of
filters
term: filter by exact values
{“term”: {“year”: 2014}}
range: Find numbers or dates in a given range (gt, gte, lt, lte)
{“range”: {“year”: {“gte”: 2010}}}
page
sundog-education.com 076
some types of
queries
match_all: returns all documents and is the default. Normally used with a filter.
{“match_all”: {}}
bool: Works like a bool filter, but results are scored by relevance.
page
sundog-education.com 077
syntax reminder
you can combine filters inside queries, or queries inside filters too.
page
sundog-education.com 078
phrase
search
phrase matching
page
sundog-education.com 081
slop
order matters, but you’re OK with some words being in between the terms:
the slop represents how far you’re willing to let a term move to satisfy a
phrase (in either direction!)
another example: “quick brown fox” would match “quick fox” with a slop of 1.
page
sundog-education.com 082
proximity queries
just use a really high slop if you want to get any documents that contain the words in your
phrase, but want documents that have the words closer together scored higher.
page
sundog-education.com 083
elasticsearch
page
sundog-education.com 085
pagination
specify “from” and
“size”
result 1
result 2 from = 0, size= 3
result 3
result 4
result 5 from = 3, size= 3
result 6
result 7
result 8
page
sundog-education.com 088
pagination syntax
page
sundog-education.com 089
beware
page
sundog-education.com 090
sorting
sorting your results is
usually quite simple.
page
sundog-education.com 093
unless you’re dealing
with strings.
A string field that is analyzed for full-text search can’t be used to sort documents
This is because it exists in the inverted index as individual terms, not as the entire string.
page
sundog-education.com 094
If you need to sort on an analyzed text field,
map a keyword copy.
curl -XPUT 127.0.0.1:9200/movies/ -d '
{
"mappings": {
"properties" : {
"title": {
"type" : “text",
"fields": {
"raw": {
"type": “keyword“
}
}
}
}
}
}'
page
sundog-education.com 095
Now you can sort on the
keyword “raw” field.
you’d have to delete it, set up a new mapping, and re-index it.
page
sundog-education.com 096
more with
filters
another filtered
query
page
sundog-education.com 099
elasticsearch
page
sundog-education.com 0101
fuzziness
fuzzy matches
page
sundog-education.com 0104
the fuzziness
parameter
page
sundog-education.com 0105
AUTO fuzziness
fuzziness: AUTO
page
sundog-education.com 0106
partial
matching
prefix queries on
strings
page
sundog-education.com 0109
wildcard queries
page
sundog-education.com 0110
search as
you type
query-time search-
as-you-type
abusing sloppiness…
page
sundog-education.com 0113
index-time with
N-grams
“star”:
unigram: [ s, t, a, r ]
bigram: [ st, ta, ar ]
trigram: [ sta, tar ]
4-gram: [ star ]
page
sundog-education.com 0114
indexing n-grams
curl -XPUT '127.0.0.1:9200/movies?pretty' -d '
{
"settings": {
1. Create an "analysis": {
“autocomplete” "filter": {
analyzer "autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}'
page
sundog-education.com 0115
now map your field
with it
page
sundog-education.com 0116
but only use n-grams on
the index side!
curl -XGET 127.0.0.1:9200/movies/_search?pretty -d '
{
"query": {
"match": {
"title": {
"query": "sta",
"analyzer": "standard"
}
}
}
}'
otherwise our query will also get split into n-grams, and we’ll get results for
everything that matches ‘s’, ‘t’, ‘a’, ‘st’, etc.
page
sundog-education.com 0117
completion
suggesters
You can also upload a list of all possible completions ahead of time
using completion suggesters.
page
sundog-education.com 0118
importing
data
you can import from
just about anything
logstash and beats can stream data from logs, S3, databases, and more
page
sundog-education.com 0121
importing
via script / json
hack together a
script
page
sundog-education.com 0123
for completeness:
import csv
import re
page
sundog-education.com 0124
importing
via client api’s
a less hacky script.
free elasticsearch client libraries are available for pretty much any language.
es = elasticsearch.Elasticsearch()
es.indices.delete(index="ratings",ignore=404)
deque(helpers.parallel_bulk(es,readRatings(),index="ratings"), maxlen=0)
es.indices.refresh()
page
sundog-education.com 0127
for completeness: import csv
from collections import deque
import elasticsearch
from elasticsearch import helpers
def readMovies():
csvfile = open('ml-latest-small/movies.csv', 'r')
titleLookup = {}
return titleLookup
def readRatings():
csvfile = open('ml-latest-small/ratings.csv', 'r')
titleLookup = readMovies()
es = elasticsearch.Elasticsearch()
es.indices.delete(index="ratings",ignore=404)
deque(helpers.parallel_bulk(es,readRatings(),index="ratings"), maxlen=0)
es.indices.refresh()
page
sundog-education.com 0128
elasticsearch
page
sundog-education.com 0130
my solution import csv
from collections import deque
import elasticsearch
from elasticsearch import helpers
def readMovies():
csvfile = open('ml-latest-small/movies.csv', 'r')
titleLookup = {}
return titleLookup
def readTags():
csvfile = open('ml-latest-small/tags.csv', 'r')
titleLookup = readMovies()
es = elasticsearch.Elasticsearch()
es.indices.delete(index="tags",ignore=404)
deque(helpers.parallel_bulk(es,readTags(),index="tags"), maxlen=0)
es.indices.refresh()
page
sundog-education.com 0131
introducing
logstash
what logstash
is for
logstash
elastic- mongo
aws hadoop …
search db
page
sundog-education.com 0134
it’s more than
plumbing
See https://www.elastic.co/guide/en/logstash/current/filter-plugins.html
for the huge list of filter plugins.
page
sundog-education.com 0135
huge variety of input
source events
page
sundog-education.com 0136
huge variety of output
“stash” destinations
page
sundog-education.com 0137
typical usage
elasticsearch
page
sundog-education.com 0138
installing
logstash
installing
logstash
page
sundog-education.com 0141
configuring
logstash
sudo vi /etc/logstash/conf.d/logstash.conf
input {
file {
path => "/home/student/access_log“
start_position => "beginning"
}
}
filter {
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
}
stdout {
codec => rubydebug
}
}
page
sundog-education.com 0142
running
logstash
cd /usr/share/logstash/
page
sundog-education.com 0143
logstash
with mysql
install a
jdbc driver
wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.15.zip
unzip mysql-connector-java-8.0.15.zip
page
sundog-education.com 0146
configure
logstash
input {
jdbc {
jdbc_connection_string => "jdbc:mysql://localhost:3306/movielens"
jdbc_user => “mysql"
jdbc_password => “password"
jdbc_driver_library => "/home/student/mysql-connector-java-8.0.15/mysql-connector-java-8.0.15.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
statement => "SELECT * FROM movies"
}
}
page
sundog-education.com 0147
logstash
with s3
what is
s3
page
sundog-education.com 0150
integration is
easy-peasy.
input {
s3 {
bucket => "sundog-es"
access_key_id => "AKIAIS****C26Y***Q"
secret_access_key => "d*****FENOXcCuNC4iTbSLbibA*****eyn****"
}
}
page
sundog-education.com 0151
logstash
with kafka
what is
kafka
• apache kafka
• open-source stream processing platform
• high throughput, low latency
• publish/subscribe
• process streams
• store streams
page
sundog-education.com 0154
integration is
easy-peasy.
input {
kafka {
bootstrap_servers => "localhost:9092"
topics => ["kafka-logs"]
}
}
page
sundog-education.com 0155
elasticsearch
with spark
what is
apache spark
flink is nipping at spark’s heels, and can also integrate with elasticsearch.
page
sundog-education.com 0158
integration with
elasticsearch-spark
./spark-2.4.1-bin-hadoop2.7/bin/spark-shell --packages org.elasticsearch:elasticsearch-spark-20_2.11:7.0.0
import org.elasticsearch.spark.sql._
import spark.implicits._
val lines = spark.sparkContext.textFile("fakefriends.csv")
val people = lines.map(mapper).toDF()
people.saveToEs("spark-people")
page
sundog-education.com 0159
elasticsearch
page
sundog-education.com 0161
integration with
elasticsearch-spark
./spark-2.4.1-bin-hadoop2.7/bin/spark-shell --packages org.elasticsearch:elasticsearch-spark-20_2.11:7.0.0
import org.elasticsearch.spark.sql._
import spark.implicits._
val lines = spark.sparkContext.textFile("fakefriends.csv")
val people = lines.map(mapper).toDF()
people.saveToEs("spark-people")
page
sundog-education.com 0162
dealing with the
header line
page
sundog-education.com 0163
my solution
import org.elasticsearch.spark.sql._
import spark.implicits._
val lines = spark.sparkContext.textFile("ml-latest-small/ratings.csv")
val header = lines.first()
val data = lines.filter(row => row != header)
val ratings= data.map(mapper).toDF()
ratings.saveToEs("ratings")
page
sundog-education.com 0164
aggregations
it’s not just for search
anymore
4.5
4.3
metrics 3.5 pipelines
average, stats, moving average,
min/max, percentiles, average bucket,
etc. 2.5 cumulative sum, etc.
buckets matrix
histograms, ranges, matrix stats
distances, significant
terms, etc.
q1 q2 q3 q4
page
sundog-education.com 0167
aggregations
are amazing
sundog-education.com
it gets better
sundog-education.com
let’s learn
by example
curl -XGET
'127.0.0.1:9200/ratings/_search?size=0&pretty' -d ‘
{
"aggs": {
"ratings": {
"terms": {
"field": "rating"
}
}
}
}'
page
sundog-education.com 0170
let’s learn
by example
count only 5-star ratings:
curl -XGET '127.0.0.1:9200/ratings/_search?size=0&pretty'
-d ‘
{
"query": {
"match": {
"rating": 5.0
}
},
"aggs" : {
"ratings": {
"terms": {
"field" : "rating"
}
}
}
}' page
sundog-education.com 0171
let’s learn
by example
average rating for Star Wars:
curl -XGET '127.0.0.1:9200/ratings/_search?size=0&pretty'
-d ‘
{
"query": {
"match_phrase": {
"title": "Star Wars Episode IV"
}
},
"aggs" : {
"avg_rating": {
"avg": {
"field" : "rating"
}
}
}
}' page
sundog-education.com 0172
histograms
what is a
histogram
display totals of
documents
bucketed by
some interval
range
page
sundog-education.com 0175
display ratings by
1.0-rating intervals
curl -XGET
'127.0.0.1:9200/ratings/_search?size=0&pretty' -d ‘
{
"aggs" : {
"whole_ratings": {
"histogram": {
"field": "rating",
"interval": 1.0
}
}
}
}'
page
sundog-education.com 0176
count up movies
from each decade
curl -XGET
'127.0.0.1:9200/movies/_search?size=0&pretty' -d ‘
{
"aggs" : {
"release": {
"histogram": {
"field": "year",
"interval": 10
}
}
}
}
page
sundog-education.com 0177
time series
dealing with time
Elasticsearch can
bucket and aggregate
fields that contain time
and dates properly. You
can aggregate by “year”
or “month” and it knows
about calendar rules.
page
sundog-education.com 0180
break down website
hits by hour:
page
sundog-education.com 0181
when does google
scrape me?
curl -XGET '127.0.0.1:9200/kafka-
logs/_search?size=0&pretty' -d ‘
{
"query" : {
"match": {
"agent": "Googlebot"
}
},
"aggs" : {
“timestamp": {
"date_histogram": {
"field": "@timestamp",
"interval": "hour"
}
}
}
}'
page
sundog-education.com 0182
elasticsearch
page
sundog-education.com 0184
my solution
GET /kafka-logs/_search?size=0&pretty
{
"query" : {
"match": {
"response": "500"
}
},
"aggs" : {
"timestamp": {
"date_histogram": {
"field": "@timestamp",
"interval": "minute"
}
}
}
}
page
sundog-education.com 0185
nested
aggregations
nested
aggregations
For example, what’s the average rating for each Star Wars movie?
sundog-education.com
for reference, here’s the
final query curl -XGET '127.0.0.1:9200/ratings/_search?size=0&pretty' -d ‘
{
"query": {
"match_phrase": {
"title": "Star Wars"
}
},
"aggs" : {
"titles": {
"terms": {
"field": "title.raw"
},
"aggs": {
"avg_rating": {
"avg": {
"field" : "rating"
}
}
}
}
}
}'
page
sundog-education.com 0189
using
kibana
what is
kibana
page
sundog-education.com 0192
installing kibana
page
sundog-education.com 0193
playing with
kibana
let’s analyze the works
of william shakespeare…
because we can.
elasticsearch
page
sundog-education.com 0198
using
filebeat
filebeat is a lightweight
shipper for logs
log log log logs can be from apache, nginx, auditd, or mysql
page
sundog-education.com 0201
this is called the
elastic stack
page
sundog-education.com 0202
why use filebeat and logstash
and not just one or the other?
page
sundog-education.com 0203
backpressure
Aah! I
can’t keep
up!
logstash
Fine, I’ll
slow
down.
filebeat
read pointer
page
sundog-education.com 0204
scalability, resiliency, and
security in production
kibana
logstash
beats nodes
elasticsearch
master nodes (3)
page
sundog-education.com 0205
x-pack
security
security
• access control
• data integrity
• audit trails
page
sundog-education.com 0208
users, groups, and
roles
page
sundog-education.com 0209
defining roles
POST /_security/role/users
{
"run_as": [ “user_impersonator" ],
"cluster": [ “movielens" ],
"indices": [
{
"names": [ “movies*" ],
"privileges": [ "read" ],
"field_security" : {
"grant" : [ “title", “year", “genres" ]
}
}
]
}
page
sundog-education.com 0210
installing
filebeat
installing and sudo apt-get update && sudo apt-get install filebeat
Make /home/student/logs
cd into it
wget http://media.sundog-soft.com/es/access_log
cd /etc/filebeat/modules.d
sudo mv apache.yml.disabled apache.yml
Edit apache.yml, add paths:
["/home/student/logs/access*"]
["/home/student/logs/error*”]
page
sundog-education.com 0213
analyzing logs
with kibana
elasticsearch
page
sundog-education.com 0217
elasticsearch
operations
choosing your
shards
an index is split into
shards.
Documents are hashed to a particular shard.
1 2 3 …
Shakespeare
page
sundog-education.com 0221
primary and replica
shards
This index has two primary shards and two replicas.
Your application should round-robin requests amongst nodes.
page
sundog-education.com 0222
how many shards
do i need?
page
sundog-education.com 0223
really? that’s kind
of hand-wavy.
page
sundog-education.com 0224
remember replica shards
can be added
page
sundog-education.com 0225
adding an index
creating a new
index
PUT /new_index
{
“settings”: {
“number_of_shards”: 10,
“number_of_replicas”: 1
}
}
page
sundog-education.com 0228
multiple indices as
a scaling strategy
page
sundog-education.com 0229
multiple indices as
a scaling strategy
• with time-based data, you can have one index per time
frame
• common strategy for log data where you usually just
want current data, but don’t want to delete old data
either
• again you can use index aliases, ie “logs_current”,
“last_3_months”, to point to specific indices as they
rotate
page
sundog-education.com 0230
alias rotation
example
POST /_aliases
{
“actions”: [
{ “add”: { “alias”: “logs_current”, “index”: “logs_2017_06” }},
{ “remove”: { “alias”: “logs_current”, “index”: “logs_2017_05” }},
{ “add”: { “alias”: “logs_last_3_months”, “index”: “logs_2017_06” }},
{ “remove”: { “alias”: “logs_last_3_months”, “index”: “logs_2017_03” }}
]
}
optionally….
DELETE /logs_2017_03
page
sundog-education.com 0231
index lifecycle
management
page
sundog-education.com 0232
index lifecycle
policies
PUT _ilm/policy/datastream_policy
{
"policy": {
"phases": { PUT _template/datastream_template
"hot": { {
"actions": {
"rollover": {
"index_patterns": ["datastream-*"],
"max_size": "50GB", "settings": {
"max_age": "30d" "number_of_shards": 1,
} "number_of_replicas": 1,
}
}, "index.lifecycle.name": "datastream_policy",
"delete": { "index.lifecycle.rollover_alias": "datastream"
"min_age": "90d", }
"actions": {
"delete": {}
}
}
}
}
}
}
page
sundog-education.com 0233
choosing your
hardware
RAM is likely your bottleneck
page
sundog-education.com 0237
heap sizing
your heap size is
wrong
the default heap size is only 1GB!
export ES_HEAP_SIZE=10g
or
ES_JAVA_OPTS=“-Xms10g –Xmx10g” ./bin/elasticsearch
page
sundog-education.com 0240
monitoring with
elastic stack
what is x-pack?
page
sundog-education.com 0243
let’s explore monitoring.
page
sundog-education.com 0244
elasticsearch
sql
new in
elasticsearch 6.3+!
Format:
POST /_xpack/sql?format-txt {
<your sql query>
}
Or use CLI
page
sundog-education.com 0247
how it works
page
sundog-education.com 0248
let’s try it out
page
sundog-education.com 0249
failover
in action
in this activity, we’ll…
page
sundog-education.com 0252
using
snapshots
snapshots let you back
up your indices
page
sundog-education.com 0255
create a repository
PUT _snapshot/backup-repo
{
"type": "fs",
"settings": {
"location": "/home/<user>/backups/backup-repo"
}
}
page
sundog-education.com 0256
using snapshots
page
sundog-education.com 0257
rolling
restarts
restarting your
cluster
page
sundog-education.com 0260
rolling restart
procedure
page
sundog-education.com 0261
cheat sheet
PUT _cluster/settings
{ Disable shard allocation
"transient": {
"cluster.routing.allocation.enable": "none"
}
}
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "all" Enable shard allocation
}
}
page
sundog-education.com 0262
let’s
practice
amazon
elasticsearch
service
let’s walk through
setting this up
page
sundog-education.com 0266
amazon es
+logstash
let’s do something a
little more complicated
• set up secure access to your cluster from kibana and from logstash
• need to create a IAM user and its credentials
• simultaneously allow access to the IP you’re connecting to kibana from and this user
• configure logstash with that user’s credentials for secure communication to the ES cluster
page
sundog-education.com 0269
our access policy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::159XXXXXXX66:user/estest",
"arn:aws:iam:: 159XXXXXXX66:user/estest :root"
]
},
"Action": "es:*",
"Resource": "arn:aws:es:us-east-1: 159XXXXXXX66:user/estest :domain/frank-test/*"
},
{
"Effect": "Allow",
"Principal": {
output {
Substitute your own log amazon_es {
path, elasticsearch hosts => ["search-test-logstash-tdjkXXXXXXdtp3o3hcy.us-east-
endpoint, region, and 1.es.amazonaws.com"]
credentials region => "us-east-1"
aws_access_key_id => 'AKIXXXXXXK7XYQQ'
aws_secret_access_key =>
'7rvZyxmUudcXXXXXXXXXgTunpuSyw2HGuF'
index => "production-logs-%{+YYYY.MM.dd}"
}
page
sundog-education.com 0271
elastic
cloud
what is elastic
cloud?
page
sundog-education.com 0274
let’s set up a
trial cluster.
wrapping up
you made it!
• installing elasticsearch
• mapping and indexing data
• searching data
• importing data
• aggregating data
• using kibana
• using logstash, beats, and the elastic stack
• elasticsearch operations and deployment
• using hosted elasticsearch clusters
learning more
• https://www.elastic.co/learn
• elasticsearch: the definitive guide
• documentation
• live training and videos
• keep experimenting!
THANK YOU