{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"SQL"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Preface - "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This captures all the `SQL` stuff that I wanted collate and have in one place."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import sys\n",
"import os\n",
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"from pandas.io import sql\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt \n",
"%matplotlib inline"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Basic database and DB connection - "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recall that `SQL` is Declarative! \n",
"\n",
"Just, specify what you want in the `select` statement.\n",
"\n",
"Recall the lexical ordering for writing queries:\n",
"\n",
"* SELECT\n",
"* FROM\n",
"* WHERE\n",
"* GROUP BY\n",
"* HAVING\n",
"* UNION\n",
"* ORDER BY\n",
"\n",
"But, the order of execution is different:\n",
"\n",
"* `FROM` - Generates the data set (with join).\n",
"* `WHERE` - Reduces the generated data set.\n",
"* `GROUPBY` - Aggregates the reduced data set.\n",
"* `HAVING` - Reduces the aggregated data set.\n",
"* `SELECT` - Transforms the reduced aggregated data set.\n",
"* `ORDER` - BY sorts the transformed data set.\n",
"* `LIMIT` - OFFSET frames the sorted data set.\n"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Connecting - "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To connect normally, I use `psql`.\n",
"\n",
"This is simply: \n",
"\n",
"`$ psql test`.\n",
"\n",
"You then enter the database and can very easily commit changes:\n",
"\n",
"`test=# DROP TABLE posts;`"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"SQLAlchemy - "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`sqlalchemy` is nice if we want to talk to the database from `python` directly.\n",
" \n",
"Create db connection using `sqlalchemy`.\n",
"\n",
"The return value of `create_engine()` is an instance of `Engine`, and it represents the core interface to the database"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import sqlalchemy\n",
"from sqlalchemy import create_engine\n",
"engine = create_engine('postgresql://lmartin@localhost:5432/test')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"WHERE - "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, I have a test `postgres` database in which one column is `timestamp`.\n",
"\n",
"I can select dates from this column easily (below).\n",
"\n",
"Note that there are some critical rules about working with [`timestamp` objects](http://www.postgresql.org/docs/9.0/static/datatype-datetime.html):\n",
"\n",
"* The time-zero is specified relative to UTC. \n",
"* For example, `-07:00` is PST.\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"l='2013-1-01 10:00:00-07:00' # Note that the timezone relative to UTC is specified in the query! \n",
"u='2013-12-22 17:00:00-07:00'\n",
"infs=\" 'CMV','ASPERGILLUS','GPR' \"\n",
"\n",
"all_tests=sql.read_sql(\"select specimen as s,tested_infection as t \"\n",
" \"from Clinical_Table \"\n",
" \"WHERE tx_date < '%s' AND tx_date > '%s' AND tested_infection IN (%s) \"%(u,l,infs),engine)\n",
"all_tests.head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" s | \n",
" t | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Serum | \n",
" CMV | \n",
"
\n",
" \n",
" 1 | \n",
" Serum | \n",
" CMV | \n",
"
\n",
" \n",
" 2 | \n",
" Serum | \n",
" CMV | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": [
" s t\n",
"0 Serum CMV\n",
"1 Serum CMV\n",
"2 Serum CMV"
]
}
],
"prompt_number": 10
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"JOIN - "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Inner join simply entries common to both tables."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Pull data for patient \n",
"i_join=sql.read_sql(\"select stat_table.patient_id as i \"\n",
" \"from stat_table \"\n",
" \"INNER JOIN sample_table \"\n",
" \"ON stat_table.patient_id=sample_table.patient_id \",engine)\n",
"i_join.head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" i | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" I10 | \n",
"
\n",
" \n",
" 1 | \n",
" I10 | \n",
"
\n",
" \n",
" 2 | \n",
" I10 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": [
" i\n",
"0 I10\n",
"1 I10\n",
"2 I10"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Semi-join filters records based upon data in another table.\n",
"\n",
"Note that we are including a `sub-query` in the `WHERE` clause."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"s_join=sql.read_sql(\"select stat_table.patient_id as i \"\n",
" \"from stat_table \"\n",
" \"WHERE patient_id IN (SELECT sample_table.patient_id FROM sample_table) \",engine)\n",
"\n",
"s_join.head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" i | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" I10 | \n",
"
\n",
" \n",
" 1 | \n",
" I10 | \n",
"
\n",
" \n",
" 2 | \n",
" I10 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": [
" i\n",
"0 I10\n",
"1 I10\n",
"2 I10"
]
}
],
"prompt_number": 9
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"GROUPBY - "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The columns that we define in the `GROUP BY` are the keys that we can use:\n",
"\n",
"If you have a `GROUP BY` clause, you may only reference:\n",
"\n",
"* (1) columns from that clause \n",
"* (2) aggregate functions\n",
"\n",
"Common aggregate functions that we can apply following `GROUP BY`:\n",
"\n",
"http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html\n",
"\n",
"For example, we compute the total number of records within each group!\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cd=sql.read_sql(\"select specimen random_name, count(*) AS c \"\n",
" \"from Clinical_Table \"\n",
" \"GROUP BY random_name \",engine)\n",
"cd.head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" random_name | \n",
" c | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Lumen | \n",
" 460 | \n",
"
\n",
" \n",
" 1 | \n",
" Skin | \n",
" 46 | \n",
"
\n",
" \n",
" 2 | \n",
" Nasopharyngeal | \n",
" 1762 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 13,
"text": [
" random_name c\n",
"0 Lumen 460\n",
"1 Skin 46\n",
"2 Nasopharyngeal 1762"
]
}
],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can the `HAVING` clause to filter grouped categories.\n",
"\n",
"`ORDER` let's us order the results by a specified column and `LIMIT` simply restricts the output count:\n",
"\n",
"* http://blog.jooq.org/2014/12/04/do-you-really-understand-sqls-group-by-and-having-clauses/"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"out=sql.read_sql(\"select specimen random_name, count(*) AS c \"\n",
" \"from Clinical_Table \"\n",
" \"GROUP BY specimen \" \n",
" \"HAVING sum(test_result) > 1 \"\n",
" \"order by 2 desc \"\n",
" \"limit 3 \",engine)\n",
"cd"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" random_name | \n",
" c | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Serum | \n",
" 12502 | \n",
"
\n",
" \n",
" 1 | \n",
" BAL | \n",
" 9969 | \n",
"
\n",
" \n",
" 2 | \n",
" Urine | \n",
" 5823 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 11,
"text": [
" random_name c\n",
"0 Serum 12502\n",
"1 BAL 9969\n",
"2 Urine 5823"
]
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We count the unique number of tested infections per specimen group."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cd=sql.read_sql(\"select specimen s, count(distinct tested_infection) AS counts \"\n",
" \"from Clinical_Table \"\n",
" \"GROUP BY s \",engine)\n",
"print cd.shape\n",
"cd.head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"(21, 2)\n"
]
},
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" s | \n",
" counts | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" BAL | \n",
" 75 | \n",
"
\n",
" \n",
" 1 | \n",
" Bone Marrow | \n",
" 3 | \n",
"
\n",
" \n",
" 2 | \n",
" CSF | \n",
" 53 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 250,
"text": [
" s counts\n",
"0 BAL 75\n",
"1 Bone Marrow 3\n",
"2 CSF 53"
]
}
],
"prompt_number": 250
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A nice things to recall is that we can partition `timestamp` objects into weeks.\n",
"\n",
"There are many useful function we can apply to datetime objecets:\n",
"\n",
"* http://www.postgresql.org/docs/8.1/static/functions-datetime.html"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"l='2013-1-01 10:00:00-07:00' # Note that the timezone relative to UTC is specified in the query! \n",
"u='2013-12-22 17:00:00-07:00'\n",
"infs=\" 'CMV','ASPERGILLUS','GPR' \"\n",
"t=sql.read_sql(\" select EXTRACT(WEEK FROM tx_date) as week,tested_infection,specimen,COUNT(*)\"\n",
" \" from Clinical_Table \"\n",
" \" WHERE tx_date < '%s' AND tx_date > '%s' AND tested_infection in (%s) \"\n",
" \" GROUP BY EXTRACT(WEEK FROM tx_date),tested_infection,specimen\"%(u,l,infs),engine)\n",
"t.head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" week | \n",
" tested_infection | \n",
" specimen | \n",
" count | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 43 | \n",
" CMV | \n",
" Tissue | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 51 | \n",
" ASPERGILLUS | \n",
" Stool | \n",
" 3 | \n",
"
\n",
" \n",
" 2 | \n",
" 35 | \n",
" ASPERGILLUS | \n",
" Whole Blood | \n",
" 18 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 265,
"text": [
" week tested_infection specimen count\n",
"0 43 CMV Tissue 1\n",
"1 51 ASPERGILLUS Stool 3\n",
"2 35 ASPERGILLUS Whole Blood 18"
]
}
],
"prompt_number": 265
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Window functions - "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A window function:\n",
"\n",
"* Performs a calculation across a set of table rows that are somehow related to the current row. \n",
"* Does not cause rows to become grouped into a single output row.\n",
"* The rows retain their separate identities.\n",
"\n",
"http://postgresguide.com/tips/window.html\n",
"\n",
"\n",
"Ok, so let's assume we want to rank the specimens sampled per week per tested infection!\n",
"\n",
"First, recall a basic query below."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"l='2013-1-01 10:00:00-07:00' # Note that the timezone relative to UTC is specified in the query! \n",
"u='2013-12-22 17:00:00-07:00'\n",
"infs=\" 'CMV','ASPERGILLUS','GPR' \"\n",
"t=sql.read_sql(\" select tx_date d,tested_infection t,specimen s\"\n",
" \" from Clinical_Table \"\n",
" \" WHERE tx_date < '%s' AND tx_date > '%s' AND tested_infection in (%s) \"%(u,l,infs),engine)\n",
"t.head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" d | \n",
" t | \n",
" s | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2013-08-30 00:00:00-07:00 | \n",
" CMV | \n",
" Serum | \n",
"
\n",
" \n",
" 1 | \n",
" 2013-08-30 00:00:00-07:00 | \n",
" CMV | \n",
" Serum | \n",
"
\n",
" \n",
" 2 | \n",
" 2013-08-30 00:00:00-07:00 | \n",
" CMV | \n",
" Serum | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 13,
"text": [
" d t s\n",
"0 2013-08-30 00:00:00-07:00 CMV Serum\n",
"1 2013-08-30 00:00:00-07:00 CMV Serum\n",
"2 2013-08-30 00:00:00-07:00 CMV Serum"
]
}
],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use `window functions` to rank over a certain grouping, as needed.\n",
"\n",
"https://robots.thoughtbot.com/postgres-window-functions"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"l='2013-1-01 10:00:00-07:00' # Note that the timezone relative to UTC is specified in the query! \n",
"u='2013-12-22 17:00:00-07:00'\n",
"infs=\" 'CMV','ASPERGILLUS','GPR' \"\n",
"t=sql.read_sql(\" select tested_infection as a,specimen as s,rank() OVER (PARTITION BY tested_infection ORDER BY specimen DESC)\"\n",
" \" from Clinical_Table \"\n",
" \" WHERE tx_date < '%s' AND tx_date > '%s' AND tested_infection in (%s) \"%(u,l,infs),engine)\n",
"t.head(5)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" s | \n",
" rank | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" ASPERGILLUS | \n",
" Whole Blood | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" ASPERGILLUS | \n",
" Whole Blood | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" ASPERGILLUS | \n",
" Whole Blood | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" ASPERGILLUS | \n",
" Whole Blood | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" ASPERGILLUS | \n",
" Whole Blood | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 12,
"text": [
" a s rank\n",
"0 ASPERGILLUS Whole Blood 1\n",
"1 ASPERGILLUS Whole Blood 1\n",
"2 ASPERGILLUS Whole Blood 1\n",
"3 ASPERGILLUS Whole Blood 1\n",
"4 ASPERGILLUS Whole Blood 1"
]
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's computed the frequency of items in the group and ranked accordingly."
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"ORMs - "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ORM presents a high level and abstracted pattern of usage assocaited with `sqlalchemy`.\n",
"\n",
"When using the ORM, the configurational process starts by describing the database tables we\u2019ll be dealing with.\n",
"\n",
"http://docs.sqlalchemy.org/en/rel_0_9/orm/tutorial.html"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sqlalchemy.ext.declarative import declarative_base\n",
"Base = declarative_base()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have a \u201cbase\u201d, we can define any number of mapped classes in terms of it:\n",
"\n",
"* at a minimum needs a __tablename__ attribute\n",
"* at least one Column which is part of a primary key\n",
"\n",
"Here's a new table `users`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sqlalchemy import Column, Integer, String\n",
"\n",
"class User(Base):\n",
" __tablename__ = 'users'\n",
" id = Column(Integer, primary_key=True)\n",
" name = Column(String)\n",
" fullname = Column(String)\n",
" password = Column(String)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With our User class constructed via the Declarative system, we have defined information about our table, known as table metadata. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"User.__table__"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 17,
"text": [
"Table('users', MetaData(bind=None), Column('id', Integer(), table=, primary_key=True, nullable=False), Column('name', String(), table=), Column('fullname', String(), table=), Column('password', String(), table=), schema=None)"
]
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" As our database does not actually have a users table present, we can use MetaData to issue CREATE TABLE.\n",
" \n",
" Now the table exists."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Base.metadata.create_all(engine) "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 23
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We\u2019re now ready to start talking to the database."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sqlalchemy.orm import sessionmaker\n",
"Session = sessionmaker(bind=engine)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whenever you need to have a conversation with the database, you instantiate a Session."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"session = Session()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 24
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whenever you need to have a conversation with the database, you instantiate a Session."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ed_user = User(name='ed', fullname='Ed Jones', password='edspassword')\n",
"session.add(ed_user)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 25
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"session.add_all([\n",
" User(name='wendy', fullname='Wendy Williams', password='foobar'),\n",
" User(name='mary', fullname='Mary Contrary', password='xxg527'),\n",
" User(name='fred', fullname='Fred Flinstone', password='blah')])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 26
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We tell the Session that we\u2019d like to issue all remaining changes to the database and commit the transaction."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"session.commit()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 27
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's make sure we can read it back out."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"t=sql.read_sql(\" select *\"\n",
" \" from Users \",engine)\n",
"t.head(3)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" name | \n",
" fullname | \n",
" password | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" ed | \n",
" Ed Jones | \n",
" edspassword | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" wendy | \n",
" Wendy Williams | \n",
" foobar | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" mary | \n",
" Mary Contrary | \n",
" xxg527 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 28,
"text": [
" id name fullname password\n",
"0 1 ed Ed Jones edspassword\n",
"1 2 wendy Wendy Williams foobar\n",
"2 3 mary Mary Contrary xxg527"
]
}
],
"prompt_number": 28
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"More complex window functions - "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's define a new table."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Base = declarative_base()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 92
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that we can re-define existing `class` defined:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Post.__table_args__ = {'extend_existing': True}\n",
"Comments.__table_args__ = {'extend_existing': True}"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 90
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sqlalchemy.dialects.postgresql import CHAR, TIMESTAMP\n",
"class Post(Base):\n",
" __tablename__ = 'posts'\n",
" id = Column(Integer, primary_key=True)\n",
" body = Column(String) \n",
" created_at = Column(TIMESTAMP)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 93
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class Comments(Base):\n",
" __tablename__ = 'comments'\n",
" id = Column(Integer, primary_key=True)\n",
" post_id = Column(Integer) \n",
" body = Column(String) \n",
" created_at = Column(TIMESTAMP)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 94
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create the tables."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Base.metadata.create_all(engine) "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 95
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Add data to the new tables."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"session = Session()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 96
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"session.add_all([\n",
" Post(id=1, body='foo'),\n",
" Post(id=2, body='bar'),])\n",
"\n",
"session.add_all([\n",
" Comments(id=1, post_id=1, body='foo old'),\n",
" Comments(id=2, post_id=1, body='foo new'),\n",
" Comments(id=3, post_id=1, body='foo newer'),\n",
" Comments(id=4, post_id=1, body='foo newest'),])\n",
"\n",
"session.add_all([\n",
" Comments(id=5, post_id=2, body='bar old'),\n",
" Comments(id=6, post_id=2, body='bar new'),\n",
" Comments(id=7, post_id=2, body='bar newer'),\n",
" Comments(id=8, post_id=2, body='bar newest'),])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 97
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"session.commit()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 98
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"p=sql.read_sql(\" select *\"\n",
" \" from posts \",engine)\n",
"p"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" body | \n",
" created_at | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" foo | \n",
" None | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" bar | \n",
" None | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 101,
"text": [
" id body created_at\n",
"0 1 foo None\n",
"1 2 bar None"
]
}
],
"prompt_number": 101
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"c=sql.read_sql(\" select *\"\n",
" \" from Comments \",engine)\n",
"c"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" post_id | \n",
" body | \n",
" created_at | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" foo old | \n",
" None | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" foo new | \n",
" None | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" foo newer | \n",
" None | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 1 | \n",
" foo newest | \n",
" None | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 2 | \n",
" bar old | \n",
" None | \n",
"
\n",
" \n",
" 5 | \n",
" 6 | \n",
" 2 | \n",
" bar new | \n",
" None | \n",
"
\n",
" \n",
" 6 | \n",
" 7 | \n",
" 2 | \n",
" bar newer | \n",
" None | \n",
"
\n",
" \n",
" 7 | \n",
" 8 | \n",
" 2 | \n",
" bar newest | \n",
" None | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 102,
"text": [
" id post_id body created_at\n",
"0 1 1 foo old None\n",
"1 2 1 foo new None\n",
"2 3 1 foo newer None\n",
"3 4 1 foo newest None\n",
"4 5 2 bar old None\n",
"5 6 2 bar new None\n",
"6 7 2 bar newer None\n",
"7 8 2 bar newest None"
]
}
],
"prompt_number": 102
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"c=sql.read_sql(\" select posts.id AS post_id, comments.id AS comment_ids, comments.body AS body \"\n",
" \" from posts LEFT OUTER JOIN comments on posts.id=comments.post_id \",engine)\n",
"c"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" post_id | \n",
" comment_ids | \n",
" body | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" foo old | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" foo new | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 3 | \n",
" foo newer | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 4 | \n",
" foo newest | \n",
"
\n",
" \n",
" 4 | \n",
" 2 | \n",
" 5 | \n",
" bar old | \n",
"
\n",
" \n",
" 5 | \n",
" 2 | \n",
" 6 | \n",
" bar new | \n",
"
\n",
" \n",
" 6 | \n",
" 2 | \n",
" 7 | \n",
" bar newer | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" 8 | \n",
" bar newest | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 106,
"text": [
" post_id comment_ids body\n",
"0 1 1 foo old\n",
"1 1 2 foo new\n",
"2 1 3 foo newer\n",
"3 1 4 foo newest\n",
"4 2 5 bar old\n",
"5 2 6 bar new\n",
"6 2 7 bar newer\n",
"7 2 8 bar newest"
]
}
],
"prompt_number": 106
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Derived tables - "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's say we want something more complex, like the fraction of tests for a given infection that are from a given specimen."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cd=sql.read_sql(\"select * from Clinical_Table \",engine2)\n",
"a=cd.groupby(['tested_infection']).size()\n",
"a.sort(ascending=False)\n",
"b=cd[cd['tested_infection'] == 'CMV']\n",
"ser=b[b['specimen'] == 'Serum']\n",
"ser.shape[0]/float(b.shape[0])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"(43251, 10)\n"
]
},
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 234,
"text": [
"0.835023664638269"
]
}
],
"prompt_number": 234
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`Group by` using `Pandas` is nice for this!"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"all_tests=sql.read_sql(\"select * from Clinical_Table\",engine)\n",
"a=all_tests[['tested_infection','specimen']]\n",
"b=a.groupby('tested_infection').agg(lambda x:float(x[x.specimen=='Serum'].shape[0])/x.shape[0])\n",
"b.sort('specimen',ascending=False,inplace=True)\n",
"b.head(4)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" specimen | \n",
"
\n",
" \n",
" tested_infection | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Hepatitis A AB | \n",
" 1.000000 | \n",
"
\n",
" \n",
" HEPATITIS B | \n",
" 1.000000 | \n",
"
\n",
" \n",
" EBV | \n",
" 0.881657 | \n",
"
\n",
" \n",
" CMV | \n",
" 0.835024 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 235,
"text": [
" specimen\n",
"tested_infection \n",
"Hepatitis A AB 1.000000\n",
"HEPATITIS B 1.000000\n",
"EBV 0.881657\n",
"CMV 0.835024"
]
}
],
"prompt_number": 235
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But, we can also use Derived tables to get this same result!\n",
"\n",
"These are basically like variables.\n",
"\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tmp=sql.read_sql(\" select tested_infection,COUNT(*)/CAST(total AS float) as pct \"\n",
" \" from Clinical_Table c \"\n",
" \" inner join \"\n",
" \" ( \" \n",
" \" SELECT tested_infection as inf, COUNT(*) as total \"\n",
" \" from Clinical_Table \"\n",
" \" GROUP BY tested_infection \" \n",
" \" ) d \"\n",
" \" on c.tested_infection = d.inf \"\n",
" \" WHERE c.specimen = 'Serum' \"\n",
" \" GROUP BY tested_infection,total \",engine)\n",
"\n",
"print tmp.shape\n",
"tmp.sort('pct',inplace=True,ascending=False)\n",
"tmp.head(4)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"(57, 2)\n"
]
},
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" tested_infection | \n",
" pct | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" HEPATITIS B | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 6 | \n",
" Hepatitis A AB | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 17 | \n",
" EBV | \n",
" 0.881657 | \n",
"
\n",
" \n",
" 9 | \n",
" CMV | \n",
" 0.835024 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 236,
"text": [
" tested_infection pct\n",
"2 HEPATITIS B 1.000000\n",
"6 Hepatitis A AB 1.000000\n",
"17 EBV 0.881657\n",
"9 CMV 0.835024"
]
}
],
"prompt_number": 236
}
],
"metadata": {}
}
]
}