{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%autosave 10"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "javascript": [
        "IPython.notebook.set_autosave_interval(10000)"
       ],
       "metadata": {},
       "output_type": "display_data"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Autosaving every 10 seconds\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Where to get talk\n",
      "\n",
      "https://github.com/enthought/pydata-ldn-2014"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##\u00a0Background of scientists\n",
      "\n",
      "- Bridging the gap between databases and pandas.\n",
      "- Scientists use SQL, or are forced to use e.g. Microsoft SQL Server.\n",
      "- How to help them reach pandas?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Why should you care about databases at all? StackOverflow\n",
      "\n",
      "- 18 million questions, 23GB of XML for posts\n",
      "- Multiple tables, relationships.\n",
      "- Popular tags? Trending tags over time?\n",
      "\n",
      "Idea: store in DB, load post-processes subset into pandas. It'd be nice if we could just do everything on a regular laptop"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "!!AI isn't this why HDF5 was invented? Or is that only suitable for numeric data?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Goals\n",
      "\n",
      "- Importing data from PostgreSQL into pandas\n",
      "- Use SQLAlchemy without too much pain\n",
      "- Use pandasql to make data manipulation easier\n",
      "- Semi-structured data in PostgreSQL through HSTORE, SQLAlchemy\n",
      "- We are **not** talking about:\n",
      "    - \"Big data\", this is quite small\n",
      "    - Scalability."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# What is a database\n",
      "\n",
      "- Wikipedia definition is useless. By its definition NumPy arrays, pandas DataFrames, CSV files are databases\n",
      "- Why are MySQL / PostgreSQL / Oracle databases?\n",
      "    - querying languages?\n",
      "    - data modelling?\n",
      "    - storage?\n",
      "- Well, NumPy arrays are fast because data is continguous in one big block in-memory.\n",
      "    - You can use memory mapping to have file-backed NumPy arrays.\n",
      "    - But then concurrent read/write access to this memory mapped file gets very painful, easy to suffer corruption.\n",
      "- Concurrent read/write access to massive file-backed data that cannot fit in-memory is where databases excel.\n",
      "    - !!AI And HDF5 doesn't offer concurrent read/write access, and in fact is poor at random writes, only fast for appends (I think?)\n",
      "- So storage and data modelling are where databases are useful."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "import pandasql\n",
      "\n",
      "# Useful shim, saves typing\n",
      "pysqldf = lambda q:  pandasql.sqldf(q, globals())\n",
      "\n",
      "# !!AI maybe use examples from Intro to Data Science course,\n",
      "# it's identical to this."
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Database data structures\n",
      "\n",
      "- They use B-Trees. Efficient search and insert, both of which NumPy arrays can't do efficiently.\n",
      "    - NumPy array of a million entries, append a million, no choice but to allocate an empty array of two million then copy.\n",
      "- Databases know how to do disk-based I/O, and random I/O, very well. NumPy and pandas don't."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##\u00a0Using SQLAlchemy to deal with databases\n",
      "\n",
      "- The most interesting part isn't that it's an Object Relational Mapper (ORM).\n",
      "- It executes queries in layers, where the ORM is optional.\n",
      "    - You can deal directly with tables and data types.\n",
      "    \n",
      "- !!AI the speaker gives an SQLAlchemy tutorial via IPython Notebook."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Moving onto StackOverflow data\n",
      "\n",
      "- Start with `math.stackexchange.com` Posts.xml.\n",
      "- Source (I think): [https://archive.org/details/stackexchange](https://archive.org/details/stackexchange) (500MB 7-zip file!!)\n",
      "- Uses `etree.iterparse` because the XML file is massive, don't load it all into memory.\n",
      "- Commit every 1000 inserts; don't autocommit per row, don't shove everything in one transaction."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Crossing the boundary"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "# !!AI won't run, just the gist\n",
      "\n",
      "import pandas.io.sql\n",
      "import psycopg2\n",
      "\n",
      "connection = psycopg2.connect()  # !!AI TODO fill in\n",
      "\n",
      "math_by_date = pandas.io.sql.read_sql(\"\"\"\\\n",
      "    SELECT ...\n",
      "    FROM...\n",
      "    WHERE ...\n",
      "    AND .\n",
      "    AND ...\n",
      "    GROUP BY ...\n",
      "\"\"\", connection)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "ename": "AttributeError",
       "evalue": "'NoneType' object has no attribute 'rollback'",
       "output_type": "pyerr",
       "traceback": [
        "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
        "\u001b[0;32m<ipython-input-8-cbea362ebd4b>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m     11\u001b[0m     \u001b[0mAND\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     12\u001b[0m     \u001b[0mGROUP\u001b[0m \u001b[0mBY\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \"\"\", connection)\n\u001b[0m",
        "\u001b[0;32m/usr/local/lib/python2.7/site-packages/pandas/io/sql.pyc\u001b[0m in \u001b[0;36mread_frame\u001b[0;34m(sql, con, index_col, coerce_float, params)\u001b[0m\n\u001b[1;32m    158\u001b[0m         \u001b[0mList\u001b[0m \u001b[0mof\u001b[0m \u001b[0mparameters\u001b[0m \u001b[0mto\u001b[0m \u001b[0;32mpass\u001b[0m \u001b[0mto\u001b[0m \u001b[0mexecute\u001b[0m \u001b[0mmethod\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    159\u001b[0m     \"\"\"\n\u001b[0;32m--> 160\u001b[0;31m     \u001b[0mcur\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mexecute\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msql\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcon\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mparams\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    161\u001b[0m     \u001b[0mrows\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_safe_fetch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcur\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    162\u001b[0m     \u001b[0mcolumns\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mcol_desc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mcol_desc\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mcur\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdescription\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
        "\u001b[0;32m/usr/local/lib/python2.7/site-packages/pandas/io/sql.pyc\u001b[0m in \u001b[0;36mexecute\u001b[0;34m(sql, con, retry, cur, params)\u001b[0m\n\u001b[1;32m     51\u001b[0m     \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     52\u001b[0m         \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 53\u001b[0;31m             \u001b[0mcon\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrollback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     54\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m\u001b[0;34m:\u001b[0m  \u001b[0;31m# pragma: no cover\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     55\u001b[0m             \u001b[0;32mpass\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
        "\u001b[0;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'rollback'"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Error on sql     SELECT ...\n",
        "    FROM...\n",
        "    WHERE ...\n",
        "    AND ..\n",
        "    AND ...\n",
        "    GROUP BY ...\n",
        "\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "ERROR: An unexpected error occurred while tokenizing input\n",
        "The following traceback may be corrupted or invalid\n",
        "The error message is: ('EOF in multi-line string', (1, 0))\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#\u00a0More work with pandas.io.sql"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Semi-structured data analysis\n",
      "\n",
      "- Using PostgreSQL HSTORE.\n",
      "- !!AI I guess Entity Value Attribute (EVA) is controversial in SQL circles?\n",
      "- Note that SQLAlchemy knows about HSTORE!\n",
      "- `pandas.io.sql.read_sql` knows about HSTORE obviously, because it executes queries directly using a `psycopg2` connection.\n",
      "- How do HSTOREs actually work: [http://stackoverflow.com/questions/13133774/how-does-hstore-internally-store-the-data](http://stackoverflow.com/questions/13133774/how-does-hstore-internally-store-the-data)\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}