{ "metadata": { "name": "", "signature": "sha256:bf0e558822c7bc4602bc82fc3603154d6db6c5eff809694adf693f6cb2fc91ea" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Elasticsearch experiments\n", "\n", "This is a tutorial on [ElasticSearch](http://www.elasticsearch.org/) for the [Pandas](http://pandas.pydata.org/) programmer.\n", "\n", "Installing is easy. Get [Java](https://java.com/en/download/) and [download elasticsearch](http://www.elasticsearch.org/overview/elkdownloads/). Unzip it. Then run `bin/elasticsearch.bat` or `bin/elasticsearch`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Populating a dataset\n", "\n", "Let's take a dataset from the Internet and populate it into Elasticsearch. This [Singapore property data](https://github.com/weilu/singapore-property-market/blob/master/data.csv) is archived at http://files.gramener.com/data/. Let's load that via Pandas." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "data = pd.read_csv('http://files.gramener.com/data/singapore-property-data.csv')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data has information about various construction projects, their location, the number of units and the price range at which they were sold." ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.irow(0)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "year 2007\n", "month 6\n", "Project Name 38 AMBER\n", "Street Name AMBER ROAD\n", "Developer Ho Brothers Investment Pte Ltd\n", "Property Type Non-Landed\n", "Locality RCR\n", "Total Number of Units in Project 30\n", "Cumulative Units Launched to-date 0\n", "Cumulative Units Sold to-date 0\n", "Cumulative Units Launched but Unsold 0\n", "Units Launched in the Month 0\n", "Units Sold in the Month 0\n", "Median Price($psm) # in the Month Number Sold By Price Range -\n", "Lowest Price ($psm) # in the Month -\n", "Highest Price ($psm) # in the Month -\n", "Name: 0, dtype: object" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's populate this into Elasticsearch under an index called `singapore` and a type called `property`.\n", "\n", "First, let's [delete](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-delete-index.html) everything under it." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import json\n", "import requests\n", "\n", "root = 'http://localhost:9200'\n", "database = root + '/singapore/property'\n", "requests.delete(database).json()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 85, "text": [ "{u'acknowledged': True}" ] } ], "prompt_number": 85 }, { "cell_type": "code", "collapsed": false, "input": [ "# Loop through the first 5 row and add the document. The ID is the index\n", "for index, row in data.head().iterrows():\n", " print requests.put(database + '/%d' % index, data=row.to_json()).json()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "{u'_type': u'property', u'_id': u'0', u'created': True, u'_version': 1, u'_index': u'singapore'}\n", "{u'_type': u'property', u'_id': u'1', u'created': True, u'_version': 1, u'_index': u'singapore'}\n", "{u'_type': u'property', u'_id': u'2', u'created': True, u'_version': 1, u'_index': u'singapore'}\n", "{u'_type': u'property', u'_id': u'3', u'created': True, u'_version': 1, u'_index': u'singapore'}" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "{u'_type': u'property', u'_id': u'4', u'created': True, u'_version': 1, u'_index': u'singapore'}\n" ] } ], "prompt_number": 86 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above step is slow for a large number of documents, and can take a few minutes. The [bulk API](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html) is faster. Without worrying about how, let's run this code:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "bulk_command = []\n", "action = json.dumps({\n", " \"index\": {\n", " \"_index\": \"singapore\",\n", " \"_type\": \"property\",\n", " \"_id\": \"%d\"\n", " }\n", "})\n", "for index, row in data.iterrows():\n", " bulk_command.append(action % index + '\\n')\n", " bulk_command.append(row.to_json() + '\\n')\n", "\n", "result = requests.post(root + '/_bulk', data=''.join(bulk_command)).json()\n", "print 'Errors:', result['errors']" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Errors: False\n" ] } ], "prompt_number": 87 }, { "cell_type": "code", "collapsed": false, "input": [ "requests.get(database + '/_count').json()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 88, "text": [ "{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5}, u'count': 31683}" ] } ], "prompt_number": 88 }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Comparing queries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count the number of documents" ] }, { "cell_type": "code", "collapsed": false, "input": [ "len(data)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 89, "text": [ "31683" ] } ], "prompt_number": 89 }, { "cell_type": "code", "collapsed": false, "input": [ "requests.post(database + '/_count').json()['count']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 92, "text": [ "31683" ] } ], "prompt_number": 92 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count the number of projects in each year" ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.groupby('year')['Project Name'].count()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 91, "text": [ "year\n", "2007 2402\n", "2008 5072\n", "2009 5587\n", "2010 5652\n", "2011 5819\n", "2012 6121\n", "2013 1030\n", "Name: Project Name, dtype: int64" ] } ], "prompt_number": 91 }, { "cell_type": "code", "collapsed": false, "input": [ "requests.post(database + '/_search?search_type=count', data=json.dumps({\n", " \"size\": 0,\n", " \"aggs\": {\n", " \"yearwise\": {\n", " \"terms\": {\n", " \"field\": \"year\"\n", " }\n", " }\n", " }\n", "})).json()['aggregations']['yearwise']['buckets']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 108, "text": [ "[{u'doc_count': 6121, u'key': 2012},\n", " {u'doc_count': 5819, u'key': 2011},\n", " {u'doc_count': 5652, u'key': 2010},\n", " {u'doc_count': 5587, u'key': 2009},\n", " {u'doc_count': 5072, u'key': 2008},\n", " {u'doc_count': 2402, u'key': 2007},\n", " {u'doc_count': 1030, u'key': 2013}]" ] } ], "prompt_number": 108 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sum the Total units each year" ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.groupby('year')['Total Number of Units in Project'].sum()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 106, "text": [ "year\n", "2007 279810\n", "2008 533265\n", "2009 591630\n", "2010 653184\n", "2011 792051\n", "2012 978774\n", "2013 177945\n", "Name: Total Number of Units in Project, dtype: int64" ] } ], "prompt_number": 106 }, { "cell_type": "code", "collapsed": false, "input": [ "requests.post(database + '/_search?search_type=count', data=json.dumps({\n", " \"size\": 0,\n", " \"aggs\": {\n", " \"yearwise\": {\n", " \"terms\": {\n", " \"field\": \"year\"\n", " },\n", " \"aggs\": {\n", " \"total_units\": {\n", " \"sum\": {\n", " \"field\": \"Total Number of Units in Project\"\n", " }\n", " }\n", " }\n", " }\n", " }\n", "})).json()['aggregations']['yearwise']['buckets']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 110, "text": [ "[{u'doc_count': 6121, u'key': 2012, u'total_units': {u'value': 978774.0}},\n", " {u'doc_count': 5819, u'key': 2011, u'total_units': {u'value': 792051.0}},\n", " {u'doc_count': 5652, u'key': 2010, u'total_units': {u'value': 653184.0}},\n", " {u'doc_count': 5587, u'key': 2009, u'total_units': {u'value': 591630.0}},\n", " {u'doc_count': 5072, u'key': 2008, u'total_units': {u'value': 533265.0}},\n", " {u'doc_count': 2402, u'key': 2007, u'total_units': {u'value': 279810.0}},\n", " {u'doc_count': 1030, u'key': 2013, u'total_units': {u'value': 177945.0}}]" ] } ], "prompt_number": 110 }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Shutdown Elasticsearch" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Shutdown\n", "requests.post(root + '/_shutdown').json()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 111, "text": [ "{u'cluster_name': u'elasticsearch',\n", " u'nodes': {u'cZkboCdnSi2PqPUJD6pfiA': {u'name': u'Tenpin'}}}" ] } ], "prompt_number": 111 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }