{
 "metadata": {
  "name": "",
  "signature": "sha256:bf0e558822c7bc4602bc82fc3603154d6db6c5eff809694adf693f6cb2fc91ea"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Elasticsearch experiments\n",
      "\n",
      "This is a tutorial on [ElasticSearch](http://www.elasticsearch.org/) for the [Pandas](http://pandas.pydata.org/) programmer.\n",
      "\n",
      "Installing is easy. Get [Java](https://java.com/en/download/) and [download elasticsearch](http://www.elasticsearch.org/overview/elkdownloads/). Unzip it. Then run `bin/elasticsearch.bat` or `bin/elasticsearch`."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Populating a dataset\n",
      "\n",
      "Let's take a dataset from the Internet and populate it into Elasticsearch. This [Singapore property data](https://github.com/weilu/singapore-property-market/blob/master/data.csv) is archived at http://files.gramener.com/data/. Let's load that via Pandas."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "data = pd.read_csv('http://files.gramener.com/data/singapore-property-data.csv')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The data has information about various construction projects, their location, the number of units and the price range at which they were sold."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data.irow(0)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "year                                                                                      2007\n",
        "month                                                                                        6\n",
        "Project Name                                                                          38 AMBER\n",
        "Street Name                                                                         AMBER ROAD\n",
        "Developer                                                       Ho Brothers Investment Pte Ltd\n",
        "Property Type                                                                       Non-Landed\n",
        "Locality                                                                                   RCR\n",
        "Total Number of Units in Project                                                            30\n",
        "Cumulative Units Launched to-date                                                            0\n",
        "Cumulative Units Sold to-date                                                                0\n",
        "Cumulative Units Launched but Unsold                                                         0\n",
        "Units Launched in the Month                                                                  0\n",
        "Units Sold in the Month                                                                      0\n",
        "Median Price($psm) # in the Month Number Sold By Price Range                                 -\n",
        "Lowest Price ($psm) # in the Month                                                           -\n",
        "Highest Price ($psm) # in the Month                                                          -\n",
        "Name: 0, dtype: object"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's populate this into Elasticsearch under an index called `singapore` and a type called `property`.\n",
      "\n",
      "First, let's [delete](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-delete-index.html) everything under it."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import json\n",
      "import requests\n",
      "\n",
      "root = 'http://localhost:9200'\n",
      "database = root + '/singapore/property'\n",
      "requests.delete(database).json()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 85,
       "text": [
        "{u'acknowledged': True}"
       ]
      }
     ],
     "prompt_number": 85
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Loop through the first 5 row and add the document. The ID is the index\n",
      "for index, row in data.head().iterrows():\n",
      "    print requests.put(database + '/%d' % index, data=row.to_json()).json()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "{u'_type': u'property', u'_id': u'0', u'created': True, u'_version': 1, u'_index': u'singapore'}\n",
        "{u'_type': u'property', u'_id': u'1', u'created': True, u'_version': 1, u'_index': u'singapore'}\n",
        "{u'_type': u'property', u'_id': u'2', u'created': True, u'_version': 1, u'_index': u'singapore'}\n",
        "{u'_type': u'property', u'_id': u'3', u'created': True, u'_version': 1, u'_index': u'singapore'}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{u'_type': u'property', u'_id': u'4', u'created': True, u'_version': 1, u'_index': u'singapore'}\n"
       ]
      }
     ],
     "prompt_number": 86
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The above step is slow for a large number of documents, and can take a few minutes. The [bulk API](http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html) is faster. Without worrying about how, let's run this code:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "bulk_command = []\n",
      "action = json.dumps({\n",
      "    \"index\": {\n",
      "        \"_index\": \"singapore\",\n",
      "        \"_type\": \"property\",\n",
      "        \"_id\": \"%d\"\n",
      "    }\n",
      "})\n",
      "for index, row in data.iterrows():\n",
      "    bulk_command.append(action % index + '\\n')\n",
      "    bulk_command.append(row.to_json() + '\\n')\n",
      "\n",
      "result = requests.post(root + '/_bulk', data=''.join(bulk_command)).json()\n",
      "print 'Errors:', result['errors']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Errors: False\n"
       ]
      }
     ],
     "prompt_number": 87
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "requests.get(database + '/_count').json()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 88,
       "text": [
        "{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5}, u'count': 31683}"
       ]
      }
     ],
     "prompt_number": 88
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Comparing queries"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Count the number of documents"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(data)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 89,
       "text": [
        "31683"
       ]
      }
     ],
     "prompt_number": 89
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "requests.post(database + '/_count').json()['count']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 92,
       "text": [
        "31683"
       ]
      }
     ],
     "prompt_number": 92
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Count the number of projects in each year"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data.groupby('year')['Project Name'].count()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 91,
       "text": [
        "year\n",
        "2007    2402\n",
        "2008    5072\n",
        "2009    5587\n",
        "2010    5652\n",
        "2011    5819\n",
        "2012    6121\n",
        "2013    1030\n",
        "Name: Project Name, dtype: int64"
       ]
      }
     ],
     "prompt_number": 91
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "requests.post(database + '/_search?search_type=count', data=json.dumps({\n",
      "    \"size\": 0,\n",
      "    \"aggs\": {\n",
      "        \"yearwise\": {\n",
      "            \"terms\": {\n",
      "                \"field\": \"year\"\n",
      "            }\n",
      "        }\n",
      "    }\n",
      "})).json()['aggregations']['yearwise']['buckets']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 108,
       "text": [
        "[{u'doc_count': 6121, u'key': 2012},\n",
        " {u'doc_count': 5819, u'key': 2011},\n",
        " {u'doc_count': 5652, u'key': 2010},\n",
        " {u'doc_count': 5587, u'key': 2009},\n",
        " {u'doc_count': 5072, u'key': 2008},\n",
        " {u'doc_count': 2402, u'key': 2007},\n",
        " {u'doc_count': 1030, u'key': 2013}]"
       ]
      }
     ],
     "prompt_number": 108
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Sum the Total units each year"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data.groupby('year')['Total Number of Units in Project'].sum()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 106,
       "text": [
        "year\n",
        "2007    279810\n",
        "2008    533265\n",
        "2009    591630\n",
        "2010    653184\n",
        "2011    792051\n",
        "2012    978774\n",
        "2013    177945\n",
        "Name: Total Number of Units in Project, dtype: int64"
       ]
      }
     ],
     "prompt_number": 106
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "requests.post(database + '/_search?search_type=count', data=json.dumps({\n",
      "    \"size\": 0,\n",
      "    \"aggs\": {\n",
      "        \"yearwise\": {\n",
      "            \"terms\": {\n",
      "                \"field\": \"year\"\n",
      "            },\n",
      "            \"aggs\": {\n",
      "                \"total_units\": {\n",
      "                    \"sum\": {\n",
      "                        \"field\": \"Total Number of Units in Project\"\n",
      "                    }\n",
      "                }\n",
      "            }\n",
      "        }\n",
      "    }\n",
      "})).json()['aggregations']['yearwise']['buckets']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 110,
       "text": [
        "[{u'doc_count': 6121, u'key': 2012, u'total_units': {u'value': 978774.0}},\n",
        " {u'doc_count': 5819, u'key': 2011, u'total_units': {u'value': 792051.0}},\n",
        " {u'doc_count': 5652, u'key': 2010, u'total_units': {u'value': 653184.0}},\n",
        " {u'doc_count': 5587, u'key': 2009, u'total_units': {u'value': 591630.0}},\n",
        " {u'doc_count': 5072, u'key': 2008, u'total_units': {u'value': 533265.0}},\n",
        " {u'doc_count': 2402, u'key': 2007, u'total_units': {u'value': 279810.0}},\n",
        " {u'doc_count': 1030, u'key': 2013, u'total_units': {u'value': 177945.0}}]"
       ]
      }
     ],
     "prompt_number": 110
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Shutdown Elasticsearch"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Shutdown\n",
      "requests.post(root + '/_shutdown').json()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 111,
       "text": [
        "{u'cluster_name': u'elasticsearch',\n",
        " u'nodes': {u'cZkboCdnSi2PqPUJD6pfiA': {u'name': u'Tenpin'}}}"
       ]
      }
     ],
     "prompt_number": 111
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}