Demo using ElasticSearch in Python
=====================


This is a quick demonstration for using ElasticSearch in Python.

Some materials are taken from https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html, it's a great book!


## elasticsearch.yml

Some changes you need to make before lauching a node:

- Change `cluster.name` for auto-discovery or not-auto-discovery cluster in your network
- Change `node.name` for easy determine which node are in trouble

Some more options:

- Lock the memory by setting `bootstrap.mlockall` to `true` for performance purpose
- Set `network.host` to `127.0.0.1` for security reason

## pyelasticsearch

We use `pyelasticsearch` package for wrapping ElasticSearch RESTful API around Python in this demo.

Install it using `pip install pyelasticsearch`


### Set things up

Import `ElasticSearch` class.

In [None]:
from pyelasticsearch import ElasticSearch

Config url for using ElasticSearch, there're more parameters but we're good for now.

In [2]:
es = ElasticSearch('http://localhost:9200')

We check the health first

In [11]:
es.health()

{'active_primary_shards': 0,
 'active_shards': 0,
 'cluster_name': 'elasticsearch_tai-dev',
 'initializing_shards': 0,
 'number_of_data_nodes': 1,
 'number_of_in_flight_fetch': 0,
 'number_of_nodes': 1,
 'number_of_pending_tasks': 0,
 'relocating_shards': 0,
 'status': 'green',
 'timed_out': False,
 'unassigned_shards': 0}

All we care for now is the 'green' status, that means all things are OK.

**Fact**: `health` method is a wrapper for calling `GET /_cluster/health?pretty` directly using API.

### CRUD: create-read-update-delete

Now before we have anything to do with ElasticSearch, we need to index our documents to ElasticSearch database.

#### Index

In [12]:
es.index('library', # Index name
 'books', # Type name
 {
 'title': 'A very interesting name',
 'name': {
 'first': 'Hugh',
 'last': 'Jackman'
 },
 'publish_date': '2015-07-02',
 'price': 20,
 },
 id=1 # Doc ID
 )

{'_id': '1',
 '_index': 'library',
 '_type': 'books',
 '_version': 1,
 'created': True}

#### Read

In [13]:
es.get('library', 'books', 1)

{'_id': '1',
 '_index': 'library',
 '_source': {'name': {'first': 'Hugh', 'last': 'Jackman'},
 'price': 20,
 'publish_date': '2015-07-02',
 'title': 'A very interesting name'},
 '_type': 'books',
 '_version': 1,
 'found': True}

If the document is not existed, an error is raised:

In [35]:
try:
 es.get('library', 'books', 123)
except:
 print("This is an error!")



This is an error!


Optional (and ugly) ID:

In [18]:
es.index('library', # Index name
 'books', # Type name
 {
 'title': 'Another interesting name',
 'name': {
 'first': 'Tom',
 'last': 'Cruise'
 },
 'publish_date': '2015-08-02',
 'price': 21,
 },
 )

{'_id': 'AU5N-j9DhyYCAHYcFB3R',
 '_index': 'library',
 '_type': 'books',
 '_version': 1,
 'created': True}

Get me that book:

In [19]:
es.get('library', 'books', 'AU5N-j9DhyYCAHYcFB3R')

{'_id': 'AU5N-j9DhyYCAHYcFB3R',
 '_index': 'library',
 '_source': {'name': {'first': 'Tom', 'last': 'Cruise'},
 'price': 21,
 'publish_date': '2015-08-02',
 'title': 'Another interesting name'},
 '_type': 'books',
 '_version': 1,
 'found': True}

#### Update

In [23]:
es.update('library', # Index name
 'books', # Type name
 id = 1, # Doc ID
 doc = {
 'title': 'A very interesting name 2',
 'name': {
 'first': 'Hugh',
 'last': 'Jackman'
 },
 'publish_date': '2015-07-03',
 'price': 30,
 },
 )

{'_id': '1', '_index': 'library', '_type': 'books', '_version': 2}

It worked, but the method is kind of ugly though.

In [24]:
es.get('library', 'books', 1)

{'_id': '1',
 '_index': 'library',
 '_source': {'name': {'first': 'Hugh', 'last': 'Jackman'},
 'price': 30,
 'publish_date': '2015-07-03',
 'title': 'A very interesting name 2'},
 '_type': 'books',
 '_version': 2,
 'found': True}

The method perform a partial update:

In [25]:
es.update('library', # Index name
 'books', # Type name
 id = 1, # Doc ID
 doc = {
 'price': 90,
 },
 )

{'_id': '1', '_index': 'library', '_type': 'books', '_version': 3}

In [26]:
es.get('library', 'books', 1)

{'_id': '1',
 '_index': 'library',
 '_source': {'name': {'first': 'Hugh', 'last': 'Jackman'},
 'price': 90,
 'publish_date': '2015-07-03',
 'title': 'A very interesting name 2'},
 '_type': 'books',
 '_version': 3,
 'found': True}

#### Delete

In [36]:
es.delete('library', 'books', 1)

{'_id': '1',
 '_index': 'library',
 '_type': 'books',
 '_version': 4,
 'found': True}

In [38]:
try:
 es.get('library', 'books', 1)
except:
 print('Not found!')



Not found!


### Bulk indexing and Search

#### Bulk index

Input data:

In [40]:
users = [{ "email" : "john@smith.com", "name" : "John Smith", "username" : "@john" }, 
 { "email" : "mary@jones.com", "name" : "Mary Jones", "username" : "@mary" }]

tweet = [{ "date" : "2014-09-13", "name" : "Mary Jones", "tweet" : "Elasticsearch means full text search has never been so easy", "user_id" : 2 },
 { "date" : "2014-09-14", "name" : "John Smith", "tweet" : "@mary it is not just text, it does everything", "user_id" : 1 },
 { "date" : "2014-09-15", "name" : "Mary Jones", "tweet" : "However did I manage before Elasticsearch?", "user_id" : 2 },
 { "date" : "2014-09-16", "name" : "John Smith", "tweet" : "The Elasticsearch API is really easy to use", "user_id" : 1 },
 { "date" : "2014-09-17", "name" : "Mary Jones", "tweet" : "The Query DSL is really powerful and flexible", "user_id" : 2 }]

Bulk indexing:

In [48]:
es.bulk((es.index_op(user, id=i) for i, user in enumerate(users)),
 index='demo',
 doc_type='user')

{'errors': False,
 'items': [{'index': {'_id': '0',
 '_index': 'demo',
 '_type': 'user',
 '_version': 1,
 'status': 201}},
 {'index': {'_id': '1',
 '_index': 'demo',
 '_type': 'user',
 '_version': 1,
 'status': 201}}],
 'took': 886}

In [49]:
es.bulk((es.index_op(t, id=i) for i, t in enumerate(tweet)),
 index='demo',
 doc_type='tweet')

{'errors': False,
 'items': [{'index': {'_id': '0',
 '_index': 'demo',
 '_type': 'tweet',
 '_version': 1,
 'status': 201}},
 {'index': {'_id': '1',
 '_index': 'demo',
 '_type': 'tweet',
 '_version': 1,
 'status': 201}},
 {'index': {'_id': '2',
 '_index': 'demo',
 '_type': 'tweet',
 '_version': 1,
 'status': 201}},
 {'index': {'_id': '3',
 '_index': 'demo',
 '_type': 'tweet',
 '_version': 1,
 'status': 201}},
 {'index': {'_id': '4',
 '_index': 'demo',
 '_type': 'tweet',
 '_version': 1,
 'status': 201}}],
 'took': 53}

#### Search

##### Search all

In [54]:
es.search({})

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '4',
 '_index': 'demo',
 '_score': 1.0,
 '_source': {'date': '2014-09-17',
 'name': 'Mary Jones',
 'tweet': 'The Query DSL is really powerful and flexible',
 'user_id': 2},
 '_type': 'tweet'},
 {'_id': '0',
 '_index': 'demo',
 '_score': 1.0,
 '_source': {'email': 'john@smith.com',
 'name': 'John Smith',
 'username': '@john'},
 '_type': 'user'},
 {'_id': '0',
 '_index': 'demo',
 '_score': 1.0,
 '_source': {'date': '2014-09-13',
 'name': 'Mary Jones',
 'tweet': 'Elasticsearch means full text search has never been so easy',
 'user_id': 2},
 '_type': 'tweet'},
 {'_id': '1',
 '_index': 'demo',
 '_score': 1.0,
 '_source': {'email': 'mary@jones.com',
 'name': 'Mary Jones',
 'username': '@mary'},
 '_type': 'user'},
 {'_id': '1',
 '_index': 'demo',
 '_score': 1.0,
 '_source': {'date': '2014-09-14',
 'name': 'John Smith',
 'tweet': '@mary it is not just text, it does everything',
 'user_id': 1},
 '_type': 'tweet'}

##### Match

Simple match

In [55]:
es.search('name:john', index='demo')

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '0',
 '_index': 'demo',
 '_score': 0.625,
 '_source': {'email': 'john@smith.com',
 'name': 'John Smith',
 'username': '@john'},
 '_type': 'user'},
 {'_id': '1',
 '_index': 'demo',
 '_score': 0.625,
 '_source': {'date': '2014-09-14',
 'name': 'John Smith',
 'tweet': '@mary it is not just text, it does everything',
 'user_id': 1},
 '_type': 'tweet'},
 {'_id': '3',
 '_index': 'demo',
 '_score': 0.19178301,
 '_source': {'date': '2014-09-16',
 'name': 'John Smith',
 'tweet': 'The Elasticsearch API is really easy to use',
 'user_id': 1},
 '_type': 'tweet'}],
 'max_score': 0.625,
 'total': 3},
 'timed_out': False,
 'took': 282}

Query API, yeah, we can hide it for sometime but we can't escape:

In [59]:
query = {'query':
 {'match': {'name': 'john'}}
 }

In [60]:
es.search(query, index='demo')

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '0',
 '_index': 'demo',
 '_score': 0.625,
 '_source': {'email': 'john@smith.com',
 'name': 'John Smith',
 'username': '@john'},
 '_type': 'user'},
 {'_id': '1',
 '_index': 'demo',
 '_score': 0.625,
 '_source': {'date': '2014-09-14',
 'name': 'John Smith',
 'tweet': '@mary it is not just text, it does everything',
 'user_id': 1},
 '_type': 'tweet'},
 {'_id': '3',
 '_index': 'demo',
 '_score': 0.19178301,
 '_source': {'date': '2014-09-16',
 'name': 'John Smith',
 'tweet': 'The Elasticsearch API is really easy to use',
 'user_id': 1},
 '_type': 'tweet'}],
 'max_score': 0.625,
 'total': 3},
 'timed_out': False,
 'took': 9}

How about 2 terms?

In [61]:
query = {'query':
 {'match': {'name': 'john mary'}}
 }
es.search(query, index='demo')

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '0',
 '_index': 'demo',
 '_score': 0.22097087,
 '_source': {'email': 'john@smith.com',
 'name': 'John Smith',
 'username': '@john'},
 '_type': 'user'},
 {'_id': '0',
 '_index': 'demo',
 '_score': 0.22097087,
 '_source': {'date': '2014-09-13',
 'name': 'Mary Jones',
 'tweet': 'Elasticsearch means full text search has never been so easy',
 'user_id': 2},
 '_type': 'tweet'},
 {'_id': '1',
 '_index': 'demo',
 '_score': 0.22097087,
 '_source': {'email': 'mary@jones.com',
 'name': 'Mary Jones',
 'username': '@mary'},
 '_type': 'user'},
 {'_id': '1',
 '_index': 'demo',
 '_score': 0.22097087,
 '_source': {'date': '2014-09-14',
 'name': 'John Smith',
 'tweet': '@mary it is not just text, it does everything',
 'user_id': 1},
 '_type': 'tweet'},
 {'_id': '4',
 '_index': 'demo',
 '_score': 0.028130025,
 '_source': {'date': '2014-09-17',
 'name': 'Mary Jones',
 'tweet': 'The Query DSL is really powerful and flexible'

And phrase?

In [62]:
query = {'query':
 {'match_phrase': {'name': 'john mary'}}
 }

es.search(query, index='demo')

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [], 'max_score': None, 'total': 0},
 'timed_out': False,
 'took': 147}

`search` does not return an error like the `get` method, this kind of behavior is much less scary.

##### Boolean combination

We can write boolean combinations with `must`, `must_not` and `should`:

Does John Smith mention "API" in his tweet?

In [67]:
query = \
{
 "query": {
 "bool": {
 "must": [
 {
 "match_phrase": {
 "name": "john smith"
 }
 },
 {
 "match": {
 "tweet": "API"
 }
 }
 ]
 }
 }
}

es.search(query, index='demo')

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '3',
 '_index': 'demo',
 '_score': 0.38595587,
 '_source': {'date': '2014-09-16',
 'name': 'John Smith',
 'tweet': 'The Elasticsearch API is really easy to use',
 'user_id': 1},
 '_type': 'tweet'}],
 'max_score': 0.38595587,
 'total': 1},
 'timed_out': False,
 'took': 13}

We can rank the importance of statments in combination using `boost` field:

We try it with 'DSL' and 'API':

In [76]:
query = \
{
 "query": {
 "bool": {
 "should": [
 {
 "match": {
 "tweet": {
 "query": "DSL",
 "boost": 5,
 } 
 }
 },
 {
 "match": {
 "tweet": "API"
 }
 }
 ]
 }
 }
}

es.search(query, index='demo')

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '4',
 '_index': 'demo',
 '_score': 0.04016714,
 '_source': {'date': '2014-09-17',
 'name': 'Mary Jones',
 'tweet': 'The Query DSL is really powerful and flexible',
 'user_id': 2},
 '_type': 'tweet'},
 {'_id': '3',
 '_index': 'demo',
 '_score': 0.0029369325,
 '_source': {'date': '2014-09-16',
 'name': 'John Smith',
 'tweet': 'The Elasticsearch API is really easy to use',
 'user_id': 1},
 '_type': 'tweet'}],
 'max_score': 0.04016714,
 'total': 2},
 'timed_out': False,
 'took': 10}

Now change `boost`, and the order change:

In [77]:
query = \
{
 "query": {
 "bool": {
 "should": [
 {
 "match": {
 "tweet": {
 "query": "DSL",
 "boost": 0.5,
 } 
 }
 },
 {
 "match": {
 "tweet": {
 "query": "API"
 }
 }
 }
 ]
 }
 }
}

es.search(query, index='demo')

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '3',
 '_index': 'demo',
 '_score': 0.025078464,
 '_source': {'date': '2014-09-16',
 'name': 'John Smith',
 'tweet': 'The Elasticsearch API is really easy to use',
 'user_id': 1},
 '_type': 'tweet'},
 {'_id': '4',
 '_index': 'demo',
 '_score': 0.0072710635,
 '_source': {'date': '2014-09-17',
 'name': 'Mary Jones',
 'tweet': 'The Query DSL is really powerful and flexible',
 'user_id': 2},
 '_type': 'tweet'}],
 'max_score': 0.025078464,
 'total': 2},
 'timed_out': False,
 'took': 9}

Highlight the result:

In [80]:
query = \
{
 "query": {
 "bool": {
 "must": [
 {
 "match_phrase": {
 "name": "john smith"
 }
 },
 {
 "match": {
 "tweet": "API"
 }
 }
 ]
 }
 },
 "highlight": {
 "fields": {
 "tweet": {}
 }
 }
}

es.search(query, index='demo')

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '3',
 '_index': 'demo',
 '_score': 0.38595587,
 '_source': {'date': '2014-09-16',
 'name': 'John Smith',
 'tweet': 'The Elasticsearch API is really easy to use',
 'user_id': 1},
 '_type': 'tweet',
 'highlight': {'tweet': ['The Elasticsearch API is really easy to use']}}],
 'max_score': 0.38595587,
 'total': 1},
 'timed_out': False,
 'took': 15}

##### Filter

Find all tweets posted after '2014-09-15':

In [83]:
query = \
{
 "query": {
 "filtered": {
 "filter": {
 "range": {
 "date": {
 "gt": '2014-09-15'
 }
 }
 }
 }
 }
}

es.search(query, index='demo')

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '4',
 '_index': 'demo',
 '_score': 1.0,
 '_source': {'date': '2014-09-17',
 'name': 'Mary Jones',
 'tweet': 'The Query DSL is really powerful and flexible',
 'user_id': 2},
 '_type': 'tweet'},
 {'_id': '3',
 '_index': 'demo',
 '_score': 1.0,
 '_source': {'date': '2014-09-16',
 'name': 'John Smith',
 'tweet': 'The Elasticsearch API is really easy to use',
 'user_id': 1},
 '_type': 'tweet'}],
 'max_score': 1.0,
 'total': 2},
 'timed_out': False,
 'took': 8}

How about just list only John Smith's tweets, after 2014-09-15?

In [85]:
query = \
{
 "query": {
 "filtered": {
 "query": {
 "match_phrase": {
 "name": "John Smith"
 }
 },
 "filter": {
 "range": {
 "date": {
 "gt": '2014-09-15'
 }
 }
 }
 }
 }
}

es.search(query, index='demo')

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '3',
 '_index': 'demo',
 '_score': 0.38356602,
 '_source': {'date': '2014-09-16',
 'name': 'John Smith',
 'tweet': 'The Elasticsearch API is really easy to use',
 'user_id': 1},
 '_type': 'tweet'}],
 'max_score': 0.38356602,
 'total': 1},
 'timed_out': False,
 'took': 12}

#### Analysis and Analyzer

All the fancy things above worked mostly because of Analysis.

> Analysis = Tokenization + Token filters

> Analyzer = Character filters + Tokenizer + Token filters


Analyzers are language-specific, as of July 2015, Vietnamese is not supported, so we won't talk much about it then.

#### Mapping

Mapping is kind of schema in ElasticSearch. It's automatically generated if we don't customize it.

In [93]:
es.get_mapping('demo', 'tweet')

{'demo': {'mappings': {'tweet': {'properties': {'date': {'format': 'dateOptionalTime',
 'type': 'date'},
 'name': {'type': 'string'},
 'tweet': {'type': 'string'},
 'user_id': {'type': 'long'}}}}}}

We can add a new field using `put_mapping` method:

In [96]:
es.put_mapping('demo', 'tweet',
 {'tweet':
 {'properties':
 {'very_new_field': {'type': 'string'}}}})

es.get_mapping('demo', 'tweet')

{'demo': {'mappings': {'tweet': {'properties': {'date': {'format': 'dateOptionalTime',
 'type': 'date'},
 'name': {'type': 'string'},
 'tweet': {'type': 'string'},
 'user_id': {'type': 'long'},
 'very_new_field': {'type': 'string'}}}}}}

We can't change mapping of an existing field though:

In [98]:
try:
 es.put_mapping('demo', 'tweet',
 {'tweet':
 {'properties':
 {'very_new_field': {'type': 'long'}}}})
except:
 print("Error")



Error


So if you must, specific your mapping before indexing to make sure things go in the way you want.