{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Demo using ElasticSearch in Python\n", "=====================\n", "\n", "\n", "This is a quick demonstration for using ElasticSearch in Python.\n", "\n", "Some materials are taken from https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html, it's a great book!\n", "\n", "\n", "## elasticsearch.yml\n", "\n", "Some changes you need to make before lauching a node:\n", "\n", "- Change `cluster.name` for auto-discovery or not-auto-discovery cluster in your network\n", "- Change `node.name` for easy determine which node are in trouble\n", "\n", "Some more options:\n", "\n", "- Lock the memory by setting `bootstrap.mlockall` to `true` for performance purpose\n", "- Set `network.host` to `127.0.0.1` for security reason\n", "\n", "## pyelasticsearch\n", "\n", "We use `pyelasticsearch` package for wrapping ElasticSearch RESTful API around Python in this demo.\n", "\n", "Install it using `pip install pyelasticsearch`\n", "\n", "\n", "### Set things up" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import `ElasticSearch` class." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from pyelasticsearch import ElasticSearch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Config url for using ElasticSearch, there're more parameters but we're good for now." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "es = ElasticSearch('http://localhost:9200')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We check the health first" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'active_primary_shards': 0,\n", " 'active_shards': 0,\n", " 'cluster_name': 'elasticsearch_tai-dev',\n", " 'initializing_shards': 0,\n", " 'number_of_data_nodes': 1,\n", " 'number_of_in_flight_fetch': 0,\n", " 'number_of_nodes': 1,\n", " 'number_of_pending_tasks': 0,\n", " 'relocating_shards': 0,\n", " 'status': 'green',\n", " 'timed_out': False,\n", " 'unassigned_shards': 0}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.health()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All we care for now is the 'green' status, that means all things are OK." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Fact**: `health` method is a wrapper for calling `GET /_cluster/health?pretty` directly using API." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CRUD: create-read-update-delete" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now before we have anything to do with ElasticSearch, we need to index our documents to ElasticSearch database." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Index" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_id': '1',\n", " '_index': 'library',\n", " '_type': 'books',\n", " '_version': 1,\n", " 'created': True}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.index('library', # Index name\n", " 'books', # Type name\n", " {\n", " 'title': 'A very interesting name',\n", " 'name': {\n", " 'first': 'Hugh',\n", " 'last': 'Jackman'\n", " },\n", " 'publish_date': '2015-07-02',\n", " 'price': 20,\n", " },\n", " id=1 # Doc ID\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Read" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_id': '1',\n", " '_index': 'library',\n", " '_source': {'name': {'first': 'Hugh', 'last': 'Jackman'},\n", " 'price': 20,\n", " 'publish_date': '2015-07-02',\n", " 'title': 'A very interesting name'},\n", " '_type': 'books',\n", " '_version': 1,\n", " 'found': True}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.get('library', 'books', 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the document is not existed, an error is raised:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:elasticsearch:GET /library/books/123 [status:404 request:0.004s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "This is an error!\n" ] } ], "source": [ "try:\n", " es.get('library', 'books', 123)\n", "except:\n", " print(\"This is an error!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Optional (and ugly) ID:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_id': 'AU5N-j9DhyYCAHYcFB3R',\n", " '_index': 'library',\n", " '_type': 'books',\n", " '_version': 1,\n", " 'created': True}" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.index('library', # Index name\n", " 'books', # Type name\n", " {\n", " 'title': 'Another interesting name',\n", " 'name': {\n", " 'first': 'Tom',\n", " 'last': 'Cruise'\n", " },\n", " 'publish_date': '2015-08-02',\n", " 'price': 21,\n", " },\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get me that book:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_id': 'AU5N-j9DhyYCAHYcFB3R',\n", " '_index': 'library',\n", " '_source': {'name': {'first': 'Tom', 'last': 'Cruise'},\n", " 'price': 21,\n", " 'publish_date': '2015-08-02',\n", " 'title': 'Another interesting name'},\n", " '_type': 'books',\n", " '_version': 1,\n", " 'found': True}" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.get('library', 'books', 'AU5N-j9DhyYCAHYcFB3R')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Update" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_id': '1', '_index': 'library', '_type': 'books', '_version': 2}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.update('library', # Index name\n", " 'books', # Type name\n", " id = 1, # Doc ID\n", " doc = {\n", " 'title': 'A very interesting name 2',\n", " 'name': {\n", " 'first': 'Hugh',\n", " 'last': 'Jackman'\n", " },\n", " 'publish_date': '2015-07-03',\n", " 'price': 30,\n", " },\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It worked, but the method is kind of ugly though." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_id': '1',\n", " '_index': 'library',\n", " '_source': {'name': {'first': 'Hugh', 'last': 'Jackman'},\n", " 'price': 30,\n", " 'publish_date': '2015-07-03',\n", " 'title': 'A very interesting name 2'},\n", " '_type': 'books',\n", " '_version': 2,\n", " 'found': True}" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.get('library', 'books', 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The method perform a partial update:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_id': '1', '_index': 'library', '_type': 'books', '_version': 3}" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.update('library', # Index name\n", " 'books', # Type name\n", " id = 1, # Doc ID\n", " doc = {\n", " 'price': 90,\n", " },\n", " )" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_id': '1',\n", " '_index': 'library',\n", " '_source': {'name': {'first': 'Hugh', 'last': 'Jackman'},\n", " 'price': 90,\n", " 'publish_date': '2015-07-03',\n", " 'title': 'A very interesting name 2'},\n", " '_type': 'books',\n", " '_version': 3,\n", " 'found': True}" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.get('library', 'books', 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Delete" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_id': '1',\n", " '_index': 'library',\n", " '_type': 'books',\n", " '_version': 4,\n", " 'found': True}" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.delete('library', 'books', 1)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:elasticsearch:GET /library/books/1 [status:404 request:0.004s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Not found!\n" ] } ], "source": [ "try:\n", " es.get('library', 'books', 1)\n", "except:\n", " print('Not found!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bulk indexing and Search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Bulk index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Input data:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true }, "outputs": [], "source": [ "users = [{ \"email\" : \"john@smith.com\", \"name\" : \"John Smith\", \"username\" : \"@john\" }, \n", " { \"email\" : \"mary@jones.com\", \"name\" : \"Mary Jones\", \"username\" : \"@mary\" }]\n", "\n", "tweet = [{ \"date\" : \"2014-09-13\", \"name\" : \"Mary Jones\", \"tweet\" : \"Elasticsearch means full text search has never been so easy\", \"user_id\" : 2 },\n", " { \"date\" : \"2014-09-14\", \"name\" : \"John Smith\", \"tweet\" : \"@mary it is not just text, it does everything\", \"user_id\" : 1 },\n", " { \"date\" : \"2014-09-15\", \"name\" : \"Mary Jones\", \"tweet\" : \"However did I manage before Elasticsearch?\", \"user_id\" : 2 },\n", " { \"date\" : \"2014-09-16\", \"name\" : \"John Smith\", \"tweet\" : \"The Elasticsearch API is really easy to use\", \"user_id\" : 1 },\n", " { \"date\" : \"2014-09-17\", \"name\" : \"Mary Jones\", \"tweet\" : \"The Query DSL is really powerful and flexible\", \"user_id\" : 2 }]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bulk indexing:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'errors': False,\n", " 'items': [{'index': {'_id': '0',\n", " '_index': 'demo',\n", " '_type': 'user',\n", " '_version': 1,\n", " 'status': 201}},\n", " {'index': {'_id': '1',\n", " '_index': 'demo',\n", " '_type': 'user',\n", " '_version': 1,\n", " 'status': 201}}],\n", " 'took': 886}" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.bulk((es.index_op(user, id=i) for i, user in enumerate(users)),\n", " index='demo',\n", " doc_type='user')" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'errors': False,\n", " 'items': [{'index': {'_id': '0',\n", " '_index': 'demo',\n", " '_type': 'tweet',\n", " '_version': 1,\n", " 'status': 201}},\n", " {'index': {'_id': '1',\n", " '_index': 'demo',\n", " '_type': 'tweet',\n", " '_version': 1,\n", " 'status': 201}},\n", " {'index': {'_id': '2',\n", " '_index': 'demo',\n", " '_type': 'tweet',\n", " '_version': 1,\n", " 'status': 201}},\n", " {'index': {'_id': '3',\n", " '_index': 'demo',\n", " '_type': 'tweet',\n", " '_version': 1,\n", " 'status': 201}},\n", " {'index': {'_id': '4',\n", " '_index': 'demo',\n", " '_type': 'tweet',\n", " '_version': 1,\n", " 'status': 201}}],\n", " 'took': 53}" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.bulk((es.index_op(t, id=i) for i, t in enumerate(tweet)),\n", " index='demo',\n", " doc_type='tweet')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Search all" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_shards': {'failed': 0, 'successful': 5, 'total': 5},\n", " 'hits': {'hits': [{'_id': '4',\n", " '_index': 'demo',\n", " '_score': 1.0,\n", " '_source': {'date': '2014-09-17',\n", " 'name': 'Mary Jones',\n", " 'tweet': 'The Query DSL is really powerful and flexible',\n", " 'user_id': 2},\n", " '_type': 'tweet'},\n", " {'_id': '0',\n", " '_index': 'demo',\n", " '_score': 1.0,\n", " '_source': {'email': 'john@smith.com',\n", " 'name': 'John Smith',\n", " 'username': '@john'},\n", " '_type': 'user'},\n", " {'_id': '0',\n", " '_index': 'demo',\n", " '_score': 1.0,\n", " '_source': {'date': '2014-09-13',\n", " 'name': 'Mary Jones',\n", " 'tweet': 'Elasticsearch means full text search has never been so easy',\n", " 'user_id': 2},\n", " '_type': 'tweet'},\n", " {'_id': '1',\n", " '_index': 'demo',\n", " '_score': 1.0,\n", " '_source': {'email': 'mary@jones.com',\n", " 'name': 'Mary Jones',\n", " 'username': '@mary'},\n", " '_type': 'user'},\n", " {'_id': '1',\n", " '_index': 'demo',\n", " '_score': 1.0,\n", " '_source': {'date': '2014-09-14',\n", " 'name': 'John Smith',\n", " 'tweet': '@mary it is not just text, it does everything',\n", " 'user_id': 1},\n", " '_type': 'tweet'},\n", " {'_id': '2',\n", " '_index': 'demo',\n", " '_score': 1.0,\n", " '_source': {'date': '2014-09-15',\n", " 'name': 'Mary Jones',\n", " 'tweet': 'However did I manage before Elasticsearch?',\n", " 'user_id': 2},\n", " '_type': 'tweet'},\n", " {'_id': '3',\n", " '_index': 'demo',\n", " '_score': 1.0,\n", " '_source': {'date': '2014-09-16',\n", " 'name': 'John Smith',\n", " 'tweet': 'The Elasticsearch API is really easy to use',\n", " 'user_id': 1},\n", " '_type': 'tweet'}],\n", " 'max_score': 1.0,\n", " 'total': 7},\n", " 'timed_out': False,\n", " 'took': 6}" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.search({})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Match" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Simple match" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_shards': {'failed': 0, 'successful': 5, 'total': 5},\n", " 'hits': {'hits': [{'_id': '0',\n", " '_index': 'demo',\n", " '_score': 0.625,\n", " '_source': {'email': 'john@smith.com',\n", " 'name': 'John Smith',\n", " 'username': '@john'},\n", " '_type': 'user'},\n", " {'_id': '1',\n", " '_index': 'demo',\n", " '_score': 0.625,\n", " '_source': {'date': '2014-09-14',\n", " 'name': 'John Smith',\n", " 'tweet': '@mary it is not just text, it does everything',\n", " 'user_id': 1},\n", " '_type': 'tweet'},\n", " {'_id': '3',\n", " '_index': 'demo',\n", " '_score': 0.19178301,\n", " '_source': {'date': '2014-09-16',\n", " 'name': 'John Smith',\n", " 'tweet': 'The Elasticsearch API is really easy to use',\n", " 'user_id': 1},\n", " '_type': 'tweet'}],\n", " 'max_score': 0.625,\n", " 'total': 3},\n", " 'timed_out': False,\n", " 'took': 282}" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.search('name:john', index='demo')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Query API, yeah, we can hide it for sometime but we can't escape:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": true }, "outputs": [], "source": [ "query = {'query':\n", " {'match': {'name': 'john'}}\n", " }" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_shards': {'failed': 0, 'successful': 5, 'total': 5},\n", " 'hits': {'hits': [{'_id': '0',\n", " '_index': 'demo',\n", " '_score': 0.625,\n", " '_source': {'email': 'john@smith.com',\n", " 'name': 'John Smith',\n", " 'username': '@john'},\n", " '_type': 'user'},\n", " {'_id': '1',\n", " '_index': 'demo',\n", " '_score': 0.625,\n", " '_source': {'date': '2014-09-14',\n", " 'name': 'John Smith',\n", " 'tweet': '@mary it is not just text, it does everything',\n", " 'user_id': 1},\n", " '_type': 'tweet'},\n", " {'_id': '3',\n", " '_index': 'demo',\n", " '_score': 0.19178301,\n", " '_source': {'date': '2014-09-16',\n", " 'name': 'John Smith',\n", " 'tweet': 'The Elasticsearch API is really easy to use',\n", " 'user_id': 1},\n", " '_type': 'tweet'}],\n", " 'max_score': 0.625,\n", " 'total': 3},\n", " 'timed_out': False,\n", " 'took': 9}" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.search(query, index='demo')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How about 2 terms?" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_shards': {'failed': 0, 'successful': 5, 'total': 5},\n", " 'hits': {'hits': [{'_id': '0',\n", " '_index': 'demo',\n", " '_score': 0.22097087,\n", " '_source': {'email': 'john@smith.com',\n", " 'name': 'John Smith',\n", " 'username': '@john'},\n", " '_type': 'user'},\n", " {'_id': '0',\n", " '_index': 'demo',\n", " '_score': 0.22097087,\n", " '_source': {'date': '2014-09-13',\n", " 'name': 'Mary Jones',\n", " 'tweet': 'Elasticsearch means full text search has never been so easy',\n", " 'user_id': 2},\n", " '_type': 'tweet'},\n", " {'_id': '1',\n", " '_index': 'demo',\n", " '_score': 0.22097087,\n", " '_source': {'email': 'mary@jones.com',\n", " 'name': 'Mary Jones',\n", " 'username': '@mary'},\n", " '_type': 'user'},\n", " {'_id': '1',\n", " '_index': 'demo',\n", " '_score': 0.22097087,\n", " '_source': {'date': '2014-09-14',\n", " 'name': 'John Smith',\n", " 'tweet': '@mary it is not just text, it does everything',\n", " 'user_id': 1},\n", " '_type': 'tweet'},\n", " {'_id': '4',\n", " '_index': 'demo',\n", " '_score': 0.028130025,\n", " '_source': {'date': '2014-09-17',\n", " 'name': 'Mary Jones',\n", " 'tweet': 'The Query DSL is really powerful and flexible',\n", " 'user_id': 2},\n", " '_type': 'tweet'},\n", " {'_id': '2',\n", " '_index': 'demo',\n", " '_score': 0.028130025,\n", " '_source': {'date': '2014-09-15',\n", " 'name': 'Mary Jones',\n", " 'tweet': 'However did I manage before Elasticsearch?',\n", " 'user_id': 2},\n", " '_type': 'tweet'},\n", " {'_id': '3',\n", " '_index': 'demo',\n", " '_score': 0.028130025,\n", " '_source': {'date': '2014-09-16',\n", " 'name': 'John Smith',\n", " 'tweet': 'The Elasticsearch API is really easy to use',\n", " 'user_id': 1},\n", " '_type': 'tweet'}],\n", " 'max_score': 0.22097087,\n", " 'total': 7},\n", " 'timed_out': False,\n", " 'took': 179}" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = {'query':\n", " {'match': {'name': 'john mary'}}\n", " }\n", "es.search(query, index='demo')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And phrase?" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_shards': {'failed': 0, 'successful': 5, 'total': 5},\n", " 'hits': {'hits': [], 'max_score': None, 'total': 0},\n", " 'timed_out': False,\n", " 'took': 147}" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = {'query':\n", " {'match_phrase': {'name': 'john mary'}}\n", " }\n", "\n", "es.search(query, index='demo')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`search` does not return an error like the `get` method, this kind of behavior is much less scary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Boolean combination" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can write boolean combinations with `must`, `must_not` and `should`:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Does John Smith mention \"API\" in his tweet?" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_shards': {'failed': 0, 'successful': 5, 'total': 5},\n", " 'hits': {'hits': [{'_id': '3',\n", " '_index': 'demo',\n", " '_score': 0.38595587,\n", " '_source': {'date': '2014-09-16',\n", " 'name': 'John Smith',\n", " 'tweet': 'The Elasticsearch API is really easy to use',\n", " 'user_id': 1},\n", " '_type': 'tweet'}],\n", " 'max_score': 0.38595587,\n", " 'total': 1},\n", " 'timed_out': False,\n", " 'took': 13}" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \\\n", "{\n", " \"query\": {\n", " \"bool\": {\n", " \"must\": [\n", " {\n", " \"match_phrase\": {\n", " \"name\": \"john smith\"\n", " }\n", " },\n", " {\n", " \"match\": {\n", " \"tweet\": \"API\"\n", " }\n", " }\n", " ]\n", " }\n", " }\n", "}\n", "\n", "es.search(query, index='demo')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can rank the importance of statments in combination using `boost` field:\n", "\n", "We try it with 'DSL' and 'API':" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_shards': {'failed': 0, 'successful': 5, 'total': 5},\n", " 'hits': {'hits': [{'_id': '4',\n", " '_index': 'demo',\n", " '_score': 0.04016714,\n", " '_source': {'date': '2014-09-17',\n", " 'name': 'Mary Jones',\n", " 'tweet': 'The Query DSL is really powerful and flexible',\n", " 'user_id': 2},\n", " '_type': 'tweet'},\n", " {'_id': '3',\n", " '_index': 'demo',\n", " '_score': 0.0029369325,\n", " '_source': {'date': '2014-09-16',\n", " 'name': 'John Smith',\n", " 'tweet': 'The Elasticsearch API is really easy to use',\n", " 'user_id': 1},\n", " '_type': 'tweet'}],\n", " 'max_score': 0.04016714,\n", " 'total': 2},\n", " 'timed_out': False,\n", " 'took': 10}" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \\\n", "{\n", " \"query\": {\n", " \"bool\": {\n", " \"should\": [\n", " {\n", " \"match\": {\n", " \"tweet\": {\n", " \"query\": \"DSL\",\n", " \"boost\": 5,\n", " } \n", " }\n", " },\n", " {\n", " \"match\": {\n", " \"tweet\": \"API\"\n", " }\n", " }\n", " ]\n", " }\n", " }\n", "}\n", "\n", "es.search(query, index='demo')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now change `boost`, and the order change:" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_shards': {'failed': 0, 'successful': 5, 'total': 5},\n", " 'hits': {'hits': [{'_id': '3',\n", " '_index': 'demo',\n", " '_score': 0.025078464,\n", " '_source': {'date': '2014-09-16',\n", " 'name': 'John Smith',\n", " 'tweet': 'The Elasticsearch API is really easy to use',\n", " 'user_id': 1},\n", " '_type': 'tweet'},\n", " {'_id': '4',\n", " '_index': 'demo',\n", " '_score': 0.0072710635,\n", " '_source': {'date': '2014-09-17',\n", " 'name': 'Mary Jones',\n", " 'tweet': 'The Query DSL is really powerful and flexible',\n", " 'user_id': 2},\n", " '_type': 'tweet'}],\n", " 'max_score': 0.025078464,\n", " 'total': 2},\n", " 'timed_out': False,\n", " 'took': 9}" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \\\n", "{\n", " \"query\": {\n", " \"bool\": {\n", " \"should\": [\n", " {\n", " \"match\": {\n", " \"tweet\": {\n", " \"query\": \"DSL\",\n", " \"boost\": 0.5,\n", " } \n", " }\n", " },\n", " {\n", " \"match\": {\n", " \"tweet\": {\n", " \"query\": \"API\"\n", " }\n", " }\n", " }\n", " ]\n", " }\n", " }\n", "}\n", "\n", "es.search(query, index='demo')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Highlight the result:" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_shards': {'failed': 0, 'successful': 5, 'total': 5},\n", " 'hits': {'hits': [{'_id': '3',\n", " '_index': 'demo',\n", " '_score': 0.38595587,\n", " '_source': {'date': '2014-09-16',\n", " 'name': 'John Smith',\n", " 'tweet': 'The Elasticsearch API is really easy to use',\n", " 'user_id': 1},\n", " '_type': 'tweet',\n", " 'highlight': {'tweet': ['The Elasticsearch API is really easy to use']}}],\n", " 'max_score': 0.38595587,\n", " 'total': 1},\n", " 'timed_out': False,\n", " 'took': 15}" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \\\n", "{\n", " \"query\": {\n", " \"bool\": {\n", " \"must\": [\n", " {\n", " \"match_phrase\": {\n", " \"name\": \"john smith\"\n", " }\n", " },\n", " {\n", " \"match\": {\n", " \"tweet\": \"API\"\n", " }\n", " }\n", " ]\n", " }\n", " },\n", " \"highlight\": {\n", " \"fields\": {\n", " \"tweet\": {}\n", " }\n", " }\n", "}\n", "\n", "es.search(query, index='demo')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Filter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find all tweets posted after '2014-09-15':" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_shards': {'failed': 0, 'successful': 5, 'total': 5},\n", " 'hits': {'hits': [{'_id': '4',\n", " '_index': 'demo',\n", " '_score': 1.0,\n", " '_source': {'date': '2014-09-17',\n", " 'name': 'Mary Jones',\n", " 'tweet': 'The Query DSL is really powerful and flexible',\n", " 'user_id': 2},\n", " '_type': 'tweet'},\n", " {'_id': '3',\n", " '_index': 'demo',\n", " '_score': 1.0,\n", " '_source': {'date': '2014-09-16',\n", " 'name': 'John Smith',\n", " 'tweet': 'The Elasticsearch API is really easy to use',\n", " 'user_id': 1},\n", " '_type': 'tweet'}],\n", " 'max_score': 1.0,\n", " 'total': 2},\n", " 'timed_out': False,\n", " 'took': 8}" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \\\n", "{\n", " \"query\": {\n", " \"filtered\": {\n", " \"filter\": {\n", " \"range\": {\n", " \"date\": {\n", " \"gt\": '2014-09-15'\n", " }\n", " }\n", " }\n", " }\n", " }\n", "}\n", "\n", "es.search(query, index='demo')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How about just list only John Smith's tweets, after 2014-09-15?" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_shards': {'failed': 0, 'successful': 5, 'total': 5},\n", " 'hits': {'hits': [{'_id': '3',\n", " '_index': 'demo',\n", " '_score': 0.38356602,\n", " '_source': {'date': '2014-09-16',\n", " 'name': 'John Smith',\n", " 'tweet': 'The Elasticsearch API is really easy to use',\n", " 'user_id': 1},\n", " '_type': 'tweet'}],\n", " 'max_score': 0.38356602,\n", " 'total': 1},\n", " 'timed_out': False,\n", " 'took': 12}" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \\\n", "{\n", " \"query\": {\n", " \"filtered\": {\n", " \"query\": {\n", " \"match_phrase\": {\n", " \"name\": \"John Smith\"\n", " }\n", " },\n", " \"filter\": {\n", " \"range\": {\n", " \"date\": {\n", " \"gt\": '2014-09-15'\n", " }\n", " }\n", " }\n", " }\n", " }\n", "}\n", "\n", "es.search(query, index='demo')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Analysis and Analyzer\n", "\n", "All the fancy things above worked mostly because of Analysis.\n", "\n", "> Analysis = Tokenization + Token filters\n", "\n", "> Analyzer = Character filters + Tokenizer + Token filters\n", "\n", "\n", "Analyzers are language-specific, as of July 2015, Vietnamese is not supported, so we won't talk much about it then." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Mapping\n", "\n", "Mapping is kind of schema in ElasticSearch. It's automatically generated if we don't customize it." ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'demo': {'mappings': {'tweet': {'properties': {'date': {'format': 'dateOptionalTime',\n", " 'type': 'date'},\n", " 'name': {'type': 'string'},\n", " 'tweet': {'type': 'string'},\n", " 'user_id': {'type': 'long'}}}}}}" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.get_mapping('demo', 'tweet')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can add a new field using `put_mapping` method:" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'demo': {'mappings': {'tweet': {'properties': {'date': {'format': 'dateOptionalTime',\n", " 'type': 'date'},\n", " 'name': {'type': 'string'},\n", " 'tweet': {'type': 'string'},\n", " 'user_id': {'type': 'long'},\n", " 'very_new_field': {'type': 'string'}}}}}}" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.put_mapping('demo', 'tweet',\n", " {'tweet':\n", " {'properties':\n", " {'very_new_field': {'type': 'string'}}}})\n", "\n", "es.get_mapping('demo', 'tweet')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can't change mapping of an existing field though:" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:elasticsearch:PUT /demo/tweet/_mapping [status:400 request:0.068s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Error\n" ] } ], "source": [ "try:\n", " es.put_mapping('demo', 'tweet',\n", " {'tweet':\n", " {'properties':\n", " {'very_new_field': {'type': 'long'}}}})\n", "except:\n", " print(\"Error\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So if you must, specific your mapping before indexing to make sure things go in the way you want." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.0" } }, "nbformat": 4, "nbformat_minor": 0 }