Elasticsearch

Concepts

Features

  • document store where every field is indexed and searchable in real-time
  • built on top of Apache Lucene, a full-text search engine library
  • distributed and scalable horizontally
  • JSON RESTful API
  • Homepage

Elasticsearch vs Relational databases

index         <-> database
type          <-> table
document      <-> row
field         <-> column
mapping       <-> schema

Terminology

  • index = logical namespace which groups together one or more shards
  • shard = Lucene instance = search engine and data container
    • shards are distributed across nodes for load-balancing and replicated for fault tolerance
    • by default, an index contains 5 primary shards, each having 1 replica shard
  • segments = Lucene splits its own index inside the shards into segments
    • an index can be optimized by merging all data into a single segment
    • the size of a segment shoult not exceed max_merged_segment (default 5GB)
  • document = stored object, serialized into JSON
    • required metadata = _index, _type, _id, the combination of the three uniquely defines any document
    • documents (like indices) are immutable; they cannot be changed, only replaced
  • type = the class of object that the document represents

    • every type has its own mapping or schema definition, which defines the data structure for documents of that type, much like the columns in a database table.
  • filter = asks a yes or no question of every document and is used for fields that contain exact values

    • no relevance calculation, faster, cacheable
  • query = calculates how relevant each document is and assigns it a score

  • tokenization = splitting a string into words/terms/tokens

  • normalization = stemming into standard form (eg lowercase singular)
  • analysis = tokenization + normalization
  • inverted index = list of all unique words/terms/tokens and the list of documents they appear in

  • aggregation = combination of buckets and metrics

  • buckets = collection of documents which meet a criterion (binning)
  • metrics = statistics calculated on the documents in a bucket

REST API

Showcase example using httpie

http GET localhost:9200/_search <<<'
{
    "query" : {
        "match" : {
            "username" : "alice"
        }
    }
}'
GET /                               # poll REST interface for status information
GET /_cluster/health                # status of the cluster
GET /_cluster/health?level=indices  # status of the cluster with details on the indices
GET /_cluster/state                 # view current state of the cluster
GET /_cluster/stats                 # view statistics of the cluster
GET /_nodes                         # show info about elasticsearch nodes
GET /_nodes/stats                   # show statistics on elasticsearch nodes
GET /_status?pretty=true            # status page with a variety of information
GET /_search?pretty=true            # empty search, list first page of all entries
GET /_search?q=name:foo             # search for matching fields using a query string
GET /<index>/_search                # list index entries (or search within)
GET /<index>/<type>/_search         # list type entries (or search within)
GET /indpatt*/type1,type2/_search   # list entries of given indices and types (or search within)
GET /<index>/_settings              # list index settings
GET /<index>/_segments              # list index segment info
GET /<index>/_mapping               # list mappings
GET /<index>/_mapping/<type>        # list mapping of a given type
HEAD /<index>/<type>/<id>           # check if given document exists
GET /<index>/<type>/<id>/_source    # retrieve the data in a given document
GET /_stats/indices                 # list all indexes (json format)
GET /_cat                           # list all available cat routes
GET /_cat/indices                   # list all indexes and their sizes (text format)
GET /_aliases                       # list index aliases
GET /_validate/query <data>         # check if given query has a valid syntax

Other actions

PUT /<index>/<type> <data>          # add data to given index/type
DELETE /<index>                     # delete given index and all its data
POST /<index>/_close                # put index in offline mode to save resources
POST /<index>/_optimize             # optimize perf by reducing num. of segments
POST /<index>/_optimize?max_num_segments=1  # optimize perf by merging each shard into a single segment

Queries

{
    query: {
        match: {
            username: 'alice'
        }
    }
}

sort : order by

{
    query: {
        ...
    },
    sort: {                     # sort: fieldname = abbrev for sort by this field in ascending order
        date: { order: 'desc' } # sort: [ {}, {} ] = sort by multiple fields
    }
}

fields : select which fields to return

{
    fields: [ 'name', 'age', 'gender' ],
    query: {
        ...
    }
}

source filtering

{
    "_source" : [ '*name', 'age' ],     # fields that _source should contain for each hit
    query: {
        ...
    }
}

filtered, filter, term

The search API expects a query, therefore the term search must be wrapped inside a filtered query.

{
    query: {
        filtered: {
            query: { match_all: {} },       # this line can be omitted
            filter: {                       # the filter is executed before the query
                term: { user: 'bob' }       # term for single, terms for multiple values
               #terms:{ user: ['alice','bob'] }
            }
        }
    }
}

bool

The order of filters in a bool clause is important for performance. More specific filters should be placed first in order to exclude as many documents as possible for the next filters.

{
    bool: {
        must:     [],           # AND: all clauses must match
        should:   [],           # OR: at least one of the clauses must match
        must_not: [],           # NOT: none of the the clauses must match
    }
}
{
    query: {
        filtered: {
            filter: {
                bool: {
                    should: [
                        { term: { user: 'bob' }},
                        { term: { user: 'alice' }},
                    ]
                }
            }
        }
    }
}

range

{
    query: {
        filtered: {
            filter: {
                range: {
                    timestamp: {
                        gt: '2014-01-01 00:00:00',  # 'now-1h' (last hour), 'now-1M' (last month)
                        lt: '2015-01-01 00:00:00',
                    }
                }
            }
        }
    }
}

infex, from, size

index: 'logstash-2015.01.18',   # restrict to given index
body: {
  from:  0,                     # pagination start position
  size: 20,                     # number of hits to be returned (default 10)
  query: {
    ...
  }
}

Aggregations

Documentation

"aggregations" : {                                          # abbrev: aggs
    "<aggregation_name>" : {                                # custom name
        "<aggregation_type>" : {                            # eg sum, avg, top_hits, geo_distance
            <aggregation_body>                              # required properties (depending on the type)
        }
        [,"aggregations" : { [<sub_aggregation>]+ } ]?      # nest results of agg into sub agg
    }
    [,"<aggregation_name_2>" : { ... } ]                    # additional top-level aggregations
}

Example: bin different values

"aggs" : {
    "genders" : {
        "terms" : {
            "field" : "gender",             # -> 2 buckets: male/female and their respective count
            "order" : { "_count" : "asc" }  # sort by count (_term for alphabetical, _key for numeric)
        }
    }
}

Example: sum

"aggs" : {
    "sum_of_all_x_fields" : {
        "sum" : {
            "field" : "x"
        }
    }
}

Example: sum of squares

"aggs" : {
    "sum_of_squared_values_all_x_fields" : {
        "sum" : {
            "field" : "x",
            "script" : "_value * _value" }
    }
}

Example: filter and avg

"aggs" : {
    "average_price_of_in_stock_products" : {
        "filter" : { "range" : { "stock" : { "gt" : 0 } } },
        "aggs" : {
            "avg_price" : { "avg" : { "field" : "price" } }
        }
    }
}

Example: count unique/distinct values

"aggs" : {
    "count_distinct_authors" : {
        "cardinality" : {               # based on the approximate HyperLogLog++ algorithm
            "field" : "author",
            "precision_threshold": 100  # fields with <= 100 distinct values will be very accurate
        }
    }
}

Example: histogram

"aggs" : {
    "prices" : {
        "histogram" : {
            "field" : "price",      # histogram of price values
            "interval" : 50         # with bins of width 50
        }
    }
}

Example: date histogram

"aggs" : {
    "sales" : {
        "date_histogram" : {
            "field" : "units_sold",
            "interval" : "month",   # with bins of one month
            "format": "MM/yyyy",    # date format for the bucket, stored in key_as_string
            "min_doc_count" : 0,    # also return buckets with 0 documents (for completeness)
            "extended_bounds": {    # custom interval of bins to return
                "min" : "2014-01-01"
                "max" : "2015-01-01"
            }
        }
    }
}

Example: total bandwidth per user

SELECT users, sum(sentbyte + rcvdbyte) AS bw FROM table GROUP BY users ORDER BY bw DESC
aggs: {
    group_by_bandwidth: {
        terms: {
            field: 'user',
            order: {
                TotalBandwidth: 'desc'
            }
        },
        aggs: {
            TotalBandwidth: {
                sum: {
                    script: "doc['rcvdbyte'].value + doc['sentbyte'].value"
                }
            }
        }
    }
}

Third-party tools

Elasticsearch-kopf: web frontend for mainenance tasks

GitHub

Alternative elasticsearch-head

Elasticdump: dump and import indices

GitHub

Dump the .kibana index and its mapping

elasticdump --input=http://localhost:9200/.kibana --output=/opt/data/kibana_mapping.json --type=mapping
elasticdump --input=http://localhost:9200/.kibana --output=/opt/data/kibana_data.json --type=data

Dump all indices

elasticdump --input=http://localhost:9200/ --output=/opt/data/dump-mapping.json --all=true --type=mapping
elasticdump --input=http://localhost:9200/ --output=/opt/data/dump.json --all=true --type=data

Import again

elasticdump --input=/opt/data/dump-mapping.json --output=http://localhost:9200/ --bulk=true
elasticdump --input=/opt/data/dump.json --output=http://localhost:9200/ --bulk=true

Curator: close and delete indices

GitHub

Index selection

--index indexname1                      # specific name
--exclude logstash-2015.03              # exclude all indices from march
--all-indices                           # unfiltered list of all indices
--older-than 30 --time-unit days --timestring '%Y.%m.%d'

Show

Display list of indices matching the selection

curator show indices <index>

Open / Close / Delete

curator open indices <index>
curator close indices <index>
curator delete indices <index>

Optimize

Merge all index data into single segment:

curator optimize --max_num_segments 1 indices <index>