Elasticsearch¶
Concepts¶
Features¶
- document store where every field is indexed and searchable in real-time
- built on top of Apache Lucene, a full-text search engine library
- distributed and scalable horizontally
- JSON RESTful API
- Homepage
Elasticsearch vs Relational databases¶
index <-> database
type <-> table
document <-> row
field <-> column
mapping <-> schema
Terminology¶
- index = logical namespace which groups together one or more shards
- shard = Lucene instance = search engine and data container
- shards are distributed across nodes for load-balancing and replicated for fault tolerance
- by default, an index contains 5 primary shards, each having 1 replica shard
- segments = Lucene splits its own index inside the shards into segments
- an index can be optimized by merging all data into a single segment
- the size of a segment shoult not exceed
max_merged_segment
(default 5GB)
- document = stored object, serialized into JSON
- required metadata =
_index
,_type
,_id
, the combination of the three uniquely defines any document - documents (like indices) are immutable; they cannot be changed, only replaced
- required metadata =
-
type = the class of object that the document represents
- every type has its own mapping or schema definition, which defines the data structure for documents of that type, much like the columns in a database table.
-
filter = asks a yes or no question of every document and is used for fields that contain exact values
- no relevance calculation, faster, cacheable
-
query = calculates how relevant each document is and assigns it a score
-
tokenization = splitting a string into words/terms/tokens
- normalization = stemming into standard form (eg lowercase singular)
- analysis = tokenization + normalization
-
inverted index = list of all unique words/terms/tokens and the list of documents they appear in
-
aggregation = combination of buckets and metrics
- buckets = collection of documents which meet a criterion (binning)
- metrics = statistics calculated on the documents in a bucket
REST API¶
Showcase example using httpie
http GET localhost:9200/_search <<<'
{
"query" : {
"match" : {
"username" : "alice"
}
}
}'
GET / # poll REST interface for status information
GET /_cluster/health # status of the cluster
GET /_cluster/health?level=indices # status of the cluster with details on the indices
GET /_cluster/state # view current state of the cluster
GET /_cluster/stats # view statistics of the cluster
GET /_nodes # show info about elasticsearch nodes
GET /_nodes/stats # show statistics on elasticsearch nodes
GET /_status?pretty=true # status page with a variety of information
GET /_search?pretty=true # empty search, list first page of all entries
GET /_search?q=name:foo # search for matching fields using a query string
GET /<index>/_search # list index entries (or search within)
GET /<index>/<type>/_search # list type entries (or search within)
GET /indpatt*/type1,type2/_search # list entries of given indices and types (or search within)
GET /<index>/_settings # list index settings
GET /<index>/_segments # list index segment info
GET /<index>/_mapping # list mappings
GET /<index>/_mapping/<type> # list mapping of a given type
HEAD /<index>/<type>/<id> # check if given document exists
GET /<index>/<type>/<id>/_source # retrieve the data in a given document
GET /_stats/indices # list all indexes (json format)
GET /_cat # list all available cat routes
GET /_cat/indices # list all indexes and their sizes (text format)
GET /_aliases # list index aliases
GET /_validate/query <data> # check if given query has a valid syntax
Other actions
PUT /<index>/<type> <data> # add data to given index/type
DELETE /<index> # delete given index and all its data
POST /<index>/_close # put index in offline mode to save resources
POST /<index>/_optimize # optimize perf by reducing num. of segments
POST /<index>/_optimize?max_num_segments=1 # optimize perf by merging each shard into a single segment
Queries¶
match : full-text search¶
{
query: {
match: {
username: 'alice'
}
}
}
sort : order by¶
{
query: {
...
},
sort: { # sort: fieldname = abbrev for sort by this field in ascending order
date: { order: 'desc' } # sort: [ {}, {} ] = sort by multiple fields
}
}
fields : select which fields to return¶
{
fields: [ 'name', 'age', 'gender' ],
query: {
...
}
}
source filtering¶
{
"_source" : [ '*name', 'age' ], # fields that _source should contain for each hit
query: {
...
}
}
filtered, filter, term¶
The search API expects a query
, therefore the term
search must be wrapped inside a filtered
query.
{
query: {
filtered: {
query: { match_all: {} }, # this line can be omitted
filter: { # the filter is executed before the query
term: { user: 'bob' } # term for single, terms for multiple values
#terms:{ user: ['alice','bob'] }
}
}
}
}
bool¶
The order of filters in a bool
clause is important for performance. More specific filters should be placed first in order to exclude as many documents as possible for the next filters.
{
bool: {
must: [], # AND: all clauses must match
should: [], # OR: at least one of the clauses must match
must_not: [], # NOT: none of the the clauses must match
}
}
{
query: {
filtered: {
filter: {
bool: {
should: [
{ term: { user: 'bob' }},
{ term: { user: 'alice' }},
]
}
}
}
}
}
range¶
{
query: {
filtered: {
filter: {
range: {
timestamp: {
gt: '2014-01-01 00:00:00', # 'now-1h' (last hour), 'now-1M' (last month)
lt: '2015-01-01 00:00:00',
}
}
}
}
}
}
infex, from, size¶
index: 'logstash-2015.01.18', # restrict to given index
body: {
from: 0, # pagination start position
size: 20, # number of hits to be returned (default 10)
query: {
...
}
}
Aggregations¶
"aggregations" : { # abbrev: aggs
"<aggregation_name>" : { # custom name
"<aggregation_type>" : { # eg sum, avg, top_hits, geo_distance
<aggregation_body> # required properties (depending on the type)
}
[,"aggregations" : { [<sub_aggregation>]+ } ]? # nest results of agg into sub agg
}
[,"<aggregation_name_2>" : { ... } ] # additional top-level aggregations
}
Example: bin different values¶
"aggs" : {
"genders" : {
"terms" : {
"field" : "gender", # -> 2 buckets: male/female and their respective count
"order" : { "_count" : "asc" } # sort by count (_term for alphabetical, _key for numeric)
}
}
}
Example: sum¶
"aggs" : {
"sum_of_all_x_fields" : {
"sum" : {
"field" : "x"
}
}
}
Example: sum of squares¶
"aggs" : {
"sum_of_squared_values_all_x_fields" : {
"sum" : {
"field" : "x",
"script" : "_value * _value" }
}
}
Example: filter and avg¶
"aggs" : {
"average_price_of_in_stock_products" : {
"filter" : { "range" : { "stock" : { "gt" : 0 } } },
"aggs" : {
"avg_price" : { "avg" : { "field" : "price" } }
}
}
}
Example: count unique/distinct values¶
"aggs" : {
"count_distinct_authors" : {
"cardinality" : { # based on the approximate HyperLogLog++ algorithm
"field" : "author",
"precision_threshold": 100 # fields with <= 100 distinct values will be very accurate
}
}
}
Example: histogram¶
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price", # histogram of price values
"interval" : 50 # with bins of width 50
}
}
}
Example: date histogram¶
"aggs" : {
"sales" : {
"date_histogram" : {
"field" : "units_sold",
"interval" : "month", # with bins of one month
"format": "MM/yyyy", # date format for the bucket, stored in key_as_string
"min_doc_count" : 0, # also return buckets with 0 documents (for completeness)
"extended_bounds": { # custom interval of bins to return
"min" : "2014-01-01"
"max" : "2015-01-01"
}
}
}
}
Example: total bandwidth per user¶
SELECT users, sum(sentbyte + rcvdbyte) AS bw FROM table GROUP BY users ORDER BY bw DESC
aggs: {
group_by_bandwidth: {
terms: {
field: 'user',
order: {
TotalBandwidth: 'desc'
}
},
aggs: {
TotalBandwidth: {
sum: {
script: "doc['rcvdbyte'].value + doc['sentbyte'].value"
}
}
}
}
}
Third-party tools¶
Elasticsearch-kopf: web frontend for mainenance tasks¶
Alternative elasticsearch-head
Elasticdump: dump and import indices¶
Dump the .kibana index and its mapping¶
elasticdump --input=http://localhost:9200/.kibana --output=/opt/data/kibana_mapping.json --type=mapping
elasticdump --input=http://localhost:9200/.kibana --output=/opt/data/kibana_data.json --type=data
Dump all indices¶
elasticdump --input=http://localhost:9200/ --output=/opt/data/dump-mapping.json --all=true --type=mapping
elasticdump --input=http://localhost:9200/ --output=/opt/data/dump.json --all=true --type=data
Import again¶
elasticdump --input=/opt/data/dump-mapping.json --output=http://localhost:9200/ --bulk=true
elasticdump --input=/opt/data/dump.json --output=http://localhost:9200/ --bulk=true
Curator: close and delete indices¶
Index selection¶
--index indexname1 # specific name
--exclude logstash-2015.03 # exclude all indices from march
--all-indices # unfiltered list of all indices
--older-than 30 --time-unit days --timestring '%Y.%m.%d'
Show¶
Display list of indices matching the selection
curator show indices <index>
Open / Close / Delete¶
curator open indices <index>
curator close indices <index>
curator delete indices <index>
Optimize¶
Merge all index data into single segment:
curator optimize --max_num_segments 1 indices <index>