{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "SQL" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Preface - " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This captures all the `SQL` stuff that I wanted collate and have in one place." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import os\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "from pandas.io import sql\n", "import matplotlib\n", "import matplotlib.pyplot as plt \n", "%matplotlib inline" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Basic database and DB connection - " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that `SQL` is Declarative! \n", "\n", "Just, specify what you want in the `select` statement.\n", "\n", "Recall the lexical ordering for writing queries:\n", "\n", "* SELECT\n", "* FROM\n", "* WHERE\n", "* GROUP BY\n", "* HAVING\n", "* UNION\n", "* ORDER BY\n", "\n", "But, the order of execution is different:\n", "\n", "* `FROM` - Generates the data set (with join).\n", "* `WHERE` - Reduces the generated data set.\n", "* `GROUPBY` - Aggregates the reduced data set.\n", "* `HAVING` - Reduces the aggregated data set.\n", "* `SELECT` - Transforms the reduced aggregated data set.\n", "* `ORDER` - BY sorts the transformed data set.\n", "* `LIMIT` - OFFSET frames the sorted data set.\n" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Connecting - " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To connect normally, I use `psql`.\n", "\n", "This is simply: \n", "\n", "`$ psql test`.\n", "\n", "You then enter the database and can very easily commit changes:\n", "\n", "`test=# DROP TABLE posts;`" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "SQLAlchemy - " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`sqlalchemy` is nice if we want to talk to the database from `python` directly.\n", " \n", "Create db connection using `sqlalchemy`.\n", "\n", "The return value of `create_engine()` is an instance of `Engine`, and it represents the core interface to the database" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sqlalchemy\n", "from sqlalchemy import create_engine\n", "engine = create_engine('postgresql://lmartin@localhost:5432/test')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "WHERE - " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, I have a test `postgres` database in which one column is `timestamp`.\n", "\n", "I can select dates from this column easily (below).\n", "\n", "Note that there are some critical rules about working with [`timestamp` objects](http://www.postgresql.org/docs/9.0/static/datatype-datetime.html):\n", "\n", "* The time-zero is specified relative to UTC. \n", "* For example, `-07:00` is PST.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "l='2013-1-01 10:00:00-07:00' # Note that the timezone relative to UTC is specified in the query! \n", "u='2013-12-22 17:00:00-07:00'\n", "infs=\" 'CMV','ASPERGILLUS','GPR' \"\n", "\n", "all_tests=sql.read_sql(\"select specimen as s,tested_infection as t \"\n", " \"from Clinical_Table \"\n", " \"WHERE tx_date < '%s' AND tx_date > '%s' AND tested_infection IN (%s) \"%(u,l,infs),engine)\n", "all_tests.head(3)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
st
0 Serum CMV
1 Serum CMV
2 Serum CMV
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ " s t\n", "0 Serum CMV\n", "1 Serum CMV\n", "2 Serum CMV" ] } ], "prompt_number": 10 }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "JOIN - " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inner join simply entries common to both tables." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Pull data for patient \n", "i_join=sql.read_sql(\"select stat_table.patient_id as i \"\n", " \"from stat_table \"\n", " \"INNER JOIN sample_table \"\n", " \"ON stat_table.patient_id=sample_table.patient_id \",engine)\n", "i_join.head(3)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
i
0 I10
1 I10
2 I10
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ " i\n", "0 I10\n", "1 I10\n", "2 I10" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Semi-join filters records based upon data in another table.\n", "\n", "Note that we are including a `sub-query` in the `WHERE` clause." ] }, { "cell_type": "code", "collapsed": false, "input": [ "s_join=sql.read_sql(\"select stat_table.patient_id as i \"\n", " \"from stat_table \"\n", " \"WHERE patient_id IN (SELECT sample_table.patient_id FROM sample_table) \",engine)\n", "\n", "s_join.head(3)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
i
0 I10
1 I10
2 I10
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ " i\n", "0 I10\n", "1 I10\n", "2 I10" ] } ], "prompt_number": 9 }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "GROUPBY - " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The columns that we define in the `GROUP BY` are the keys that we can use:\n", "\n", "If you have a `GROUP BY` clause, you may only reference:\n", "\n", "* (1) columns from that clause \n", "* (2) aggregate functions\n", "\n", "Common aggregate functions that we can apply following `GROUP BY`:\n", "\n", "http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html\n", "\n", "For example, we compute the total number of records within each group!\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "cd=sql.read_sql(\"select specimen random_name, count(*) AS c \"\n", " \"from Clinical_Table \"\n", " \"GROUP BY random_name \",engine)\n", "cd.head(3)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
random_namec
0 Lumen 460
1 Skin 46
2 Nasopharyngeal 1762
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ " random_name c\n", "0 Lumen 460\n", "1 Skin 46\n", "2 Nasopharyngeal 1762" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can the `HAVING` clause to filter grouped categories.\n", "\n", "`ORDER` let's us order the results by a specified column and `LIMIT` simply restricts the output count:\n", "\n", "* http://blog.jooq.org/2014/12/04/do-you-really-understand-sqls-group-by-and-having-clauses/" ] }, { "cell_type": "code", "collapsed": false, "input": [ "out=sql.read_sql(\"select specimen random_name, count(*) AS c \"\n", " \"from Clinical_Table \"\n", " \"GROUP BY specimen \" \n", " \"HAVING sum(test_result) > 1 \"\n", " \"order by 2 desc \"\n", " \"limit 3 \",engine)\n", "cd" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
random_namec
0 Serum 12502
1 BAL 9969
2 Urine 5823
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ " random_name c\n", "0 Serum 12502\n", "1 BAL 9969\n", "2 Urine 5823" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We count the unique number of tested infections per specimen group." ] }, { "cell_type": "code", "collapsed": false, "input": [ "cd=sql.read_sql(\"select specimen s, count(distinct tested_infection) AS counts \"\n", " \"from Clinical_Table \"\n", " \"GROUP BY s \",engine)\n", "print cd.shape\n", "cd.head(3)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "(21, 2)\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scounts
0 BAL 75
1 Bone Marrow 3
2 CSF 53
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 250, "text": [ " s counts\n", "0 BAL 75\n", "1 Bone Marrow 3\n", "2 CSF 53" ] } ], "prompt_number": 250 }, { "cell_type": "markdown", "metadata": {}, "source": [ "A nice things to recall is that we can partition `timestamp` objects into weeks.\n", "\n", "There are many useful function we can apply to datetime objecets:\n", "\n", "* http://www.postgresql.org/docs/8.1/static/functions-datetime.html" ] }, { "cell_type": "code", "collapsed": false, "input": [ "l='2013-1-01 10:00:00-07:00' # Note that the timezone relative to UTC is specified in the query! \n", "u='2013-12-22 17:00:00-07:00'\n", "infs=\" 'CMV','ASPERGILLUS','GPR' \"\n", "t=sql.read_sql(\" select EXTRACT(WEEK FROM tx_date) as week,tested_infection,specimen,COUNT(*)\"\n", " \" from Clinical_Table \"\n", " \" WHERE tx_date < '%s' AND tx_date > '%s' AND tested_infection in (%s) \"\n", " \" GROUP BY EXTRACT(WEEK FROM tx_date),tested_infection,specimen\"%(u,l,infs),engine)\n", "t.head(3)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weektested_infectionspecimencount
0 43 CMV Tissue 1
1 51 ASPERGILLUS Stool 3
2 35 ASPERGILLUS Whole Blood 18
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 265, "text": [ " week tested_infection specimen count\n", "0 43 CMV Tissue 1\n", "1 51 ASPERGILLUS Stool 3\n", "2 35 ASPERGILLUS Whole Blood 18" ] } ], "prompt_number": 265 }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Window functions - " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A window function:\n", "\n", "* Performs a calculation across a set of table rows that are somehow related to the current row. \n", "* Does not cause rows to become grouped into a single output row.\n", "* The rows retain their separate identities.\n", "\n", "http://postgresguide.com/tips/window.html\n", "\n", "\n", "Ok, so let's assume we want to rank the specimens sampled per week per tested infection!\n", "\n", "First, recall a basic query below." ] }, { "cell_type": "code", "collapsed": false, "input": [ "l='2013-1-01 10:00:00-07:00' # Note that the timezone relative to UTC is specified in the query! \n", "u='2013-12-22 17:00:00-07:00'\n", "infs=\" 'CMV','ASPERGILLUS','GPR' \"\n", "t=sql.read_sql(\" select tx_date d,tested_infection t,specimen s\"\n", " \" from Clinical_Table \"\n", " \" WHERE tx_date < '%s' AND tx_date > '%s' AND tested_infection in (%s) \"%(u,l,infs),engine)\n", "t.head(3)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dts
0 2013-08-30 00:00:00-07:00 CMV Serum
1 2013-08-30 00:00:00-07:00 CMV Serum
2 2013-08-30 00:00:00-07:00 CMV Serum
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ " d t s\n", "0 2013-08-30 00:00:00-07:00 CMV Serum\n", "1 2013-08-30 00:00:00-07:00 CMV Serum\n", "2 2013-08-30 00:00:00-07:00 CMV Serum" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use `window functions` to rank over a certain grouping, as needed.\n", "\n", "https://robots.thoughtbot.com/postgres-window-functions" ] }, { "cell_type": "code", "collapsed": false, "input": [ "l='2013-1-01 10:00:00-07:00' # Note that the timezone relative to UTC is specified in the query! \n", "u='2013-12-22 17:00:00-07:00'\n", "infs=\" 'CMV','ASPERGILLUS','GPR' \"\n", "t=sql.read_sql(\" select tested_infection as a,specimen as s,rank() OVER (PARTITION BY tested_infection ORDER BY specimen DESC)\"\n", " \" from Clinical_Table \"\n", " \" WHERE tx_date < '%s' AND tx_date > '%s' AND tested_infection in (%s) \"%(u,l,infs),engine)\n", "t.head(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
asrank
0 ASPERGILLUS Whole Blood 1
1 ASPERGILLUS Whole Blood 1
2 ASPERGILLUS Whole Blood 1
3 ASPERGILLUS Whole Blood 1
4 ASPERGILLUS Whole Blood 1
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ " a s rank\n", "0 ASPERGILLUS Whole Blood 1\n", "1 ASPERGILLUS Whole Blood 1\n", "2 ASPERGILLUS Whole Blood 1\n", "3 ASPERGILLUS Whole Blood 1\n", "4 ASPERGILLUS Whole Blood 1" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's computed the frequency of items in the group and ranked accordingly." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "ORMs - " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ORM presents a high level and abstracted pattern of usage assocaited with `sqlalchemy`.\n", "\n", "When using the ORM, the configurational process starts by describing the database tables we\u2019ll be dealing with.\n", "\n", "http://docs.sqlalchemy.org/en/rel_0_9/orm/tutorial.html" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sqlalchemy.ext.declarative import declarative_base\n", "Base = declarative_base()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a \u201cbase\u201d, we can define any number of mapped classes in terms of it:\n", "\n", "* at a minimum needs a __tablename__ attribute\n", "* at least one Column which is part of a primary key\n", "\n", "Here's a new table `users`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sqlalchemy import Column, Integer, String\n", "\n", "class User(Base):\n", " __tablename__ = 'users'\n", " id = Column(Integer, primary_key=True)\n", " name = Column(String)\n", " fullname = Column(String)\n", " password = Column(String)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "With our User class constructed via the Declarative system, we have defined information about our table, known as table metadata. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "User.__table__" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ "Table('users', MetaData(bind=None), Column('id', Integer(), table=, primary_key=True, nullable=False), Column('name', String(), table=), Column('fullname', String(), table=), Column('password', String(), table=), schema=None)" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ " As our database does not actually have a users table present, we can use MetaData to issue CREATE TABLE.\n", " \n", " Now the table exists." ] }, { "cell_type": "code", "collapsed": false, "input": [ "Base.metadata.create_all(engine) " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We\u2019re now ready to start talking to the database." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sqlalchemy.orm import sessionmaker\n", "Session = sessionmaker(bind=engine)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whenever you need to have a conversation with the database, you instantiate a Session." ] }, { "cell_type": "code", "collapsed": false, "input": [ "session = Session()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whenever you need to have a conversation with the database, you instantiate a Session." ] }, { "cell_type": "code", "collapsed": false, "input": [ "ed_user = User(name='ed', fullname='Ed Jones', password='edspassword')\n", "session.add(ed_user)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 25 }, { "cell_type": "code", "collapsed": false, "input": [ "session.add_all([\n", " User(name='wendy', fullname='Wendy Williams', password='foobar'),\n", " User(name='mary', fullname='Mary Contrary', password='xxg527'),\n", " User(name='fred', fullname='Fred Flinstone', password='blah')])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 26 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We tell the Session that we\u2019d like to issue all remaining changes to the database and commit the transaction." ] }, { "cell_type": "code", "collapsed": false, "input": [ "session.commit()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 27 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's make sure we can read it back out." ] }, { "cell_type": "code", "collapsed": false, "input": [ "t=sql.read_sql(\" select *\"\n", " \" from Users \",engine)\n", "t.head(3)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamefullnamepassword
0 1 ed Ed Jones edspassword
1 2 wendy Wendy Williams foobar
2 3 mary Mary Contrary xxg527
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 28, "text": [ " id name fullname password\n", "0 1 ed Ed Jones edspassword\n", "1 2 wendy Wendy Williams foobar\n", "2 3 mary Mary Contrary xxg527" ] } ], "prompt_number": 28 }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "More complex window functions - " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's define a new table." ] }, { "cell_type": "code", "collapsed": false, "input": [ "Base = declarative_base()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 92 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we can re-define existing `class` defined:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "Post.__table_args__ = {'extend_existing': True}\n", "Comments.__table_args__ = {'extend_existing': True}" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 90 }, { "cell_type": "code", "collapsed": false, "input": [ "from sqlalchemy.dialects.postgresql import CHAR, TIMESTAMP\n", "class Post(Base):\n", " __tablename__ = 'posts'\n", " id = Column(Integer, primary_key=True)\n", " body = Column(String) \n", " created_at = Column(TIMESTAMP)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 93 }, { "cell_type": "code", "collapsed": false, "input": [ "class Comments(Base):\n", " __tablename__ = 'comments'\n", " id = Column(Integer, primary_key=True)\n", " post_id = Column(Integer) \n", " body = Column(String) \n", " created_at = Column(TIMESTAMP)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 94 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create the tables." ] }, { "cell_type": "code", "collapsed": false, "input": [ "Base.metadata.create_all(engine) " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 95 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add data to the new tables." ] }, { "cell_type": "code", "collapsed": false, "input": [ "session = Session()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 96 }, { "cell_type": "code", "collapsed": false, "input": [ "session.add_all([\n", " Post(id=1, body='foo'),\n", " Post(id=2, body='bar'),])\n", "\n", "session.add_all([\n", " Comments(id=1, post_id=1, body='foo old'),\n", " Comments(id=2, post_id=1, body='foo new'),\n", " Comments(id=3, post_id=1, body='foo newer'),\n", " Comments(id=4, post_id=1, body='foo newest'),])\n", "\n", "session.add_all([\n", " Comments(id=5, post_id=2, body='bar old'),\n", " Comments(id=6, post_id=2, body='bar new'),\n", " Comments(id=7, post_id=2, body='bar newer'),\n", " Comments(id=8, post_id=2, body='bar newest'),])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 97 }, { "cell_type": "code", "collapsed": false, "input": [ "session.commit()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 98 }, { "cell_type": "code", "collapsed": false, "input": [ "p=sql.read_sql(\" select *\"\n", " \" from posts \",engine)\n", "p" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idbodycreated_at
0 1 foo None
1 2 bar None
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 101, "text": [ " id body created_at\n", "0 1 foo None\n", "1 2 bar None" ] } ], "prompt_number": 101 }, { "cell_type": "code", "collapsed": false, "input": [ "c=sql.read_sql(\" select *\"\n", " \" from Comments \",engine)\n", "c" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idpost_idbodycreated_at
0 1 1 foo old None
1 2 1 foo new None
2 3 1 foo newer None
3 4 1 foo newest None
4 5 2 bar old None
5 6 2 bar new None
6 7 2 bar newer None
7 8 2 bar newest None
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 102, "text": [ " id post_id body created_at\n", "0 1 1 foo old None\n", "1 2 1 foo new None\n", "2 3 1 foo newer None\n", "3 4 1 foo newest None\n", "4 5 2 bar old None\n", "5 6 2 bar new None\n", "6 7 2 bar newer None\n", "7 8 2 bar newest None" ] } ], "prompt_number": 102 }, { "cell_type": "code", "collapsed": false, "input": [ "c=sql.read_sql(\" select posts.id AS post_id, comments.id AS comment_ids, comments.body AS body \"\n", " \" from posts LEFT OUTER JOIN comments on posts.id=comments.post_id \",engine)\n", "c" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
post_idcomment_idsbody
0 1 1 foo old
1 1 2 foo new
2 1 3 foo newer
3 1 4 foo newest
4 2 5 bar old
5 2 6 bar new
6 2 7 bar newer
7 2 8 bar newest
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 106, "text": [ " post_id comment_ids body\n", "0 1 1 foo old\n", "1 1 2 foo new\n", "2 1 3 foo newer\n", "3 1 4 foo newest\n", "4 2 5 bar old\n", "5 2 6 bar new\n", "6 2 7 bar newer\n", "7 2 8 bar newest" ] } ], "prompt_number": 106 }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Derived tables - " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say we want something more complex, like the fraction of tests for a given infection that are from a given specimen." ] }, { "cell_type": "code", "collapsed": false, "input": [ "cd=sql.read_sql(\"select * from Clinical_Table \",engine2)\n", "a=cd.groupby(['tested_infection']).size()\n", "a.sort(ascending=False)\n", "b=cd[cd['tested_infection'] == 'CMV']\n", "ser=b[b['specimen'] == 'Serum']\n", "ser.shape[0]/float(b.shape[0])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "(43251, 10)\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 234, "text": [ "0.835023664638269" ] } ], "prompt_number": 234 }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Group by` using `Pandas` is nice for this!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_tests=sql.read_sql(\"select * from Clinical_Table\",engine)\n", "a=all_tests[['tested_infection','specimen']]\n", "b=a.groupby('tested_infection').agg(lambda x:float(x[x.specimen=='Serum'].shape[0])/x.shape[0])\n", "b.sort('specimen',ascending=False,inplace=True)\n", "b.head(4)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
specimen
tested_infection
Hepatitis A AB 1.000000
HEPATITIS B 1.000000
EBV 0.881657
CMV 0.835024
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 235, "text": [ " specimen\n", "tested_infection \n", "Hepatitis A AB 1.000000\n", "HEPATITIS B 1.000000\n", "EBV 0.881657\n", "CMV 0.835024" ] } ], "prompt_number": 235 }, { "cell_type": "markdown", "metadata": {}, "source": [ "But, we can also use Derived tables to get this same result!\n", "\n", "These are basically like variables.\n", "\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tmp=sql.read_sql(\" select tested_infection,COUNT(*)/CAST(total AS float) as pct \"\n", " \" from Clinical_Table c \"\n", " \" inner join \"\n", " \" ( \" \n", " \" SELECT tested_infection as inf, COUNT(*) as total \"\n", " \" from Clinical_Table \"\n", " \" GROUP BY tested_infection \" \n", " \" ) d \"\n", " \" on c.tested_infection = d.inf \"\n", " \" WHERE c.specimen = 'Serum' \"\n", " \" GROUP BY tested_infection,total \",engine)\n", "\n", "print tmp.shape\n", "tmp.sort('pct',inplace=True,ascending=False)\n", "tmp.head(4)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "(57, 2)\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tested_infectionpct
2 HEPATITIS B 1.000000
6 Hepatitis A AB 1.000000
17 EBV 0.881657
9 CMV 0.835024
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 236, "text": [ " tested_infection pct\n", "2 HEPATITIS B 1.000000\n", "6 Hepatitis A AB 1.000000\n", "17 EBV 0.881657\n", "9 CMV 0.835024" ] } ], "prompt_number": 236 } ], "metadata": {} } ] }