{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%autosave 10"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "javascript": [
        "IPython.notebook.set_autosave_interval(10000)"
       ],
       "metadata": {},
       "output_type": "display_data"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Autosaving every 10 seconds\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Source links\n",
      "\n",
      "- https://github.com/ihuston/plpython_examples\n",
      "- http://tinyurl.com/ih-plpython"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Pivotal contributions\n",
      "\n",
      "- PyMADLib - Python Wrapper for MADlib\n",
      "    - MADlib is massively scalable machine learning platform based on PostgreSQL"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##\u00a0PL/Python\n",
      "\n",
      "- You have a massively-parallel database.\n",
      "- Want to run Python, e.g. nltk, scikit-learn, directly on database.\n",
      "- pandas, matplotlib too, all inside database. Don't need to bring data back.\n",
      "- Bringing the code to the data, not vice versa."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## How to scale\n",
      "\n",
      "- Could have 20-100 PostgreSQL instances, and one master distributor/gatherer of query/results\n",
      "    - This is Greenplum DB. [http://en.wikipedia.org/wiki/Greenplum](http://en.wikipedia.org/wiki/Greenplum)\n",
      "        - Free version: [http://www.ndm.net/datawarehouse/Greenplum/download-greenplum-database](http://www.ndm.net/datawarehouse/Greenplum/download-greenplum-database)\n",
      "- PL/Python allows user-defined functions in Python, not SQL."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "!!AI so if this is PL/Python on the database side, could you use pandas `pandas.read_sql` to execute it then return results? I guess so.\n",
      "\n",
      "- But not really the point. Really want to do everything on DB and just return a very small result."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##\u00a0How to deploy\n",
      "\n",
      "- Each database node needs a full Python stack deployed.\n",
      "    - Maintain using your favourite DevOps technique.\n",
      "    - Pivotal use something Fabric-like.\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Distribution large amounts of data\n",
      "\n",
      "Since you want to join or use data across tables, each maybe on different nodes, it's important to figure out how you split data depending on how you wantt o subsequently join it.\n",
      "\n",
      "If you split by customer name, but then need it by order ID, always hitting multiple nodes.\n",
      "\n",
      "If instead you split by customer ID, and all orders for that customer tend to live in same servers, avoid multi-node hits.\n",
      "\n",
      "So when they go to customers ask \"how do you use data? what queries do you need?\" in order to speed up joins.\n",
      "\n",
      "- This \"sharding\" is explicit. Use \"DISTRIBUTED BY (name)\" to use hash of name as shard key.\n",
      "- Be careful that the cardinality of the output of the hash function is large enough to use all the nodes in your cluster.\n",
      "- When you run future SELECTs grouped by your shard key you transparently achieve parallelisation, because data lives on different nodes."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Applying parallelisation to machine learning\n",
      "\n",
      "- Just run your e.g. `scipy.stats.linregress` on your data, importantly GROUP BY the shard key.\n",
      "- Assuming correct cardinality and groups of data by hash of shard key your machine learning is naturally parallelisation."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Questions\n",
      "\n",
      "- Can PostgreSQL query planner take data moving costs into account?\n",
      "    - No.\n",
      "    - !!AI I think Hadoop vis a vis HDFS does.\n",
      "- What about scalability of master, does it scale to concurrent usage?\n",
      "    - Not really. But it's much better than moving data to your laptop.\n",
      "- !!AI This is an interesting contrast to Hive, which is using the Hive Query Langauage (HQL) to use SQL-like syntax to execute MapReduce jobs on Hadoop. This talk is about staying inside the SQL world yet trying to scale.\n",
      "- What about debugging errors, do you get tracebacks?\n",
      "    - There might be a way of getting stack traces, but not a good story yet."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}