{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "%autosave 10" ], "language": "python", "metadata": {}, "outputs": [ { "javascript": [ "IPython.notebook.set_autosave_interval(10000)" ], "metadata": {}, "output_type": "display_data" }, { "output_type": "stream", "stream": "stdout", "text": [ "Autosaving every 10 seconds\n" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Where to get talk\n", "\n", "https://github.com/enthought/pydata-ldn-2014" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##\u00a0Background of scientists\n", "\n", "- Bridging the gap between databases and pandas.\n", "- Scientists use SQL, or are forced to use e.g. Microsoft SQL Server.\n", "- How to help them reach pandas?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why should you care about databases at all? StackOverflow\n", "\n", "- 18 million questions, 23GB of XML for posts\n", "- Multiple tables, relationships.\n", "- Popular tags? Trending tags over time?\n", "\n", "Idea: store in DB, load post-processes subset into pandas. It'd be nice if we could just do everything on a regular laptop" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "!!AI isn't this why HDF5 was invented? Or is that only suitable for numeric data?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Goals\n", "\n", "- Importing data from PostgreSQL into pandas\n", "- Use SQLAlchemy without too much pain\n", "- Use pandasql to make data manipulation easier\n", "- Semi-structured data in PostgreSQL through HSTORE, SQLAlchemy\n", "- We are **not** talking about:\n", " - \"Big data\", this is quite small\n", " - Scalability." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# What is a database\n", "\n", "- Wikipedia definition is useless. By its definition NumPy arrays, pandas DataFrames, CSV files are databases\n", "- Why are MySQL / PostgreSQL / Oracle databases?\n", " - querying languages?\n", " - data modelling?\n", " - storage?\n", "- Well, NumPy arrays are fast because data is continguous in one big block in-memory.\n", " - You can use memory mapping to have file-backed NumPy arrays.\n", " - But then concurrent read/write access to this memory mapped file gets very painful, easy to suffer corruption.\n", "- Concurrent read/write access to massive file-backed data that cannot fit in-memory is where databases excel.\n", " - !!AI And HDF5 doesn't offer concurrent read/write access, and in fact is poor at random writes, only fast for appends (I think?)\n", "- So storage and data modelling are where databases are useful." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import pandasql\n", "\n", "# Useful shim, saves typing\n", "pysqldf = lambda q: pandasql.sqldf(q, globals())\n", "\n", "# !!AI maybe use examples from Intro to Data Science course,\n", "# it's identical to this." ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Database data structures\n", "\n", "- They use B-Trees. Efficient search and insert, both of which NumPy arrays can't do efficiently.\n", " - NumPy array of a million entries, append a million, no choice but to allocate an empty array of two million then copy.\n", "- Databases know how to do disk-based I/O, and random I/O, very well. NumPy and pandas don't." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##\u00a0Using SQLAlchemy to deal with databases\n", "\n", "- The most interesting part isn't that it's an Object Relational Mapper (ORM).\n", "- It executes queries in layers, where the ORM is optional.\n", " - You can deal directly with tables and data types.\n", " \n", "- !!AI the speaker gives an SQLAlchemy tutorial via IPython Notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Moving onto StackOverflow data\n", "\n", "- Start with `math.stackexchange.com` Posts.xml.\n", "- Source (I think): [https://archive.org/details/stackexchange](https://archive.org/details/stackexchange) (500MB 7-zip file!!)\n", "- Uses `etree.iterparse` because the XML file is massive, don't load it all into memory.\n", "- Commit every 1000 inserts; don't autocommit per row, don't shove everything in one transaction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Crossing the boundary" ] }, { "cell_type": "code", "collapsed": true, "input": [ "# !!AI won't run, just the gist\n", "\n", "import pandas.io.sql\n", "import psycopg2\n", "\n", "connection = psycopg2.connect() # !!AI TODO fill in\n", "\n", "math_by_date = pandas.io.sql.read_sql(\"\"\"\\\n", " SELECT ...\n", " FROM...\n", " WHERE ...\n", " AND .\n", " AND ...\n", " GROUP BY ...\n", "\"\"\", connection)" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "AttributeError", "evalue": "'NoneType' object has no attribute 'rollback'", "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0mAND\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0mGROUP\u001b[0m \u001b[0mBY\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m \"\"\", connection)\n\u001b[0m", "\u001b[0;32m/usr/local/lib/python2.7/site-packages/pandas/io/sql.pyc\u001b[0m in \u001b[0;36mread_frame\u001b[0;34m(sql, con, index_col, coerce_float, params)\u001b[0m\n\u001b[1;32m 158\u001b[0m \u001b[0mList\u001b[0m \u001b[0mof\u001b[0m \u001b[0mparameters\u001b[0m \u001b[0mto\u001b[0m \u001b[0;32mpass\u001b[0m \u001b[0mto\u001b[0m \u001b[0mexecute\u001b[0m \u001b[0mmethod\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 159\u001b[0m \"\"\"\n\u001b[0;32m--> 160\u001b[0;31m \u001b[0mcur\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mexecute\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msql\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcon\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mparams\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 161\u001b[0m \u001b[0mrows\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_safe_fetch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcur\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 162\u001b[0m \u001b[0mcolumns\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mcol_desc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mcol_desc\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mcur\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdescription\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python2.7/site-packages/pandas/io/sql.pyc\u001b[0m in \u001b[0;36mexecute\u001b[0;34m(sql, con, retry, cur, params)\u001b[0m\n\u001b[1;32m 51\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 52\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 53\u001b[0;31m \u001b[0mcon\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrollback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 54\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mException\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;31m# pragma: no cover\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 55\u001b[0m \u001b[0;32mpass\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'rollback'" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Error on sql SELECT ...\n", " FROM...\n", " WHERE ...\n", " AND ..\n", " AND ...\n", " GROUP BY ...\n", "\n" ] }, { "output_type": "stream", "stream": "stderr", "text": [ "ERROR: An unexpected error occurred while tokenizing input\n", "The following traceback may be corrupted or invalid\n", "The error message is: ('EOF in multi-line string', (1, 0))\n", "\n" ] } ], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "#\u00a0More work with pandas.io.sql" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Semi-structured data analysis\n", "\n", "- Using PostgreSQL HSTORE.\n", "- !!AI I guess Entity Value Attribute (EVA) is controversial in SQL circles?\n", "- Note that SQLAlchemy knows about HSTORE!\n", "- `pandas.io.sql.read_sql` knows about HSTORE obviously, because it executes queries directly using a `psycopg2` connection.\n", "- How do HSTOREs actually work: [http://stackoverflow.com/questions/13133774/how-does-hstore-internally-store-the-data](http://stackoverflow.com/questions/13133774/how-does-hstore-internally-store-the-data)\n" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }