{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "
\n", " \n", " Website\n", " \n", "
\n", "\n", "Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, and Jonathan Morgan." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Preparation for Machine Learning - Creating Labels\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python Setup\n", "- Back to [Table of Contents](#Table-of-Contents)\n", "\n", "Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment. We're already familiar with `numpy`, `pandas`, and `psycopg2` from previous tutorials. Here we'll also be using [`scikit-learn`](http://scikit-learn.org) to fit modeling." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pylab inline\n", "import pandas as pd\n", "import psycopg2\n", "from sqlalchemy import create_engine\n", "import time" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# and set our database connection parameters\n", "db_name = \"appliedda\"\n", "hostname = \"10.10.2.10\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Labels\n", "\n", "Labels are the dependent variables, or *Y* variables, that we are trying to predict. In the machine learning framework, your labels are usually *binary*: true or false, often encoded as 1 or 0. \n", "\n", "It is important to clearly and explicitly define the rows (aka observations) of your analysis to ensure you properly combine input datasets and populate the columns (aka features).\n", "\n", "In this notebook, we define each row as an individual finishing a TANF spell. A spell could be participation in just one case or a series of multiple cases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this example, let's use January 1, 2008, as our \"date of prediction\" to simulate predicting return to TANF **after 1 year of being off TANF**. With this definition, we can consider the workforce participation of individuals who exited TANF in Q4 of 2006 as a \"feature\" (more on that later) in our prediction." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# set database connections - use psycopg2 to more easily execute queries without returning data (eg for series of CREATE queries)\n", "conn = psycopg2.connect(database=db_name, host=hostname)\n", "cursor = conn.cursor()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_time = time.time()\n", "sql = '''\n", "CREATE TEMP TABLE cohort_20080101 AS\n", "SELECT recptno, start_date, end_date\n", "FROM il_dhs.ind_spells\n", "WHERE end_date >= (('2008-01-01'::date - '1 year'::interval)-'3 months'::interval) AND \n", " end_date < ('2008-01-01'::date - '1 year'::interval)\n", " AND benefit_type = 'tanf46';\n", "COMMIT;\n", "'''\n", "cursor.execute(sql)\n", "\n", "print('query complete in {:.2f} seconds'.format(time.time()-start_time))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql = '''\n", "SELECT *\n", "FROM cohort_20080101\n", "'''\n", "df = pd.read_sql(sql, conn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('there are {} TANF spells that end in our selected study period'.format(df.shape[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Outcome example: return to TANF in years 2 and 3 after 1 year of being off TANF\n", "\n", "\n", "For our prediction problem we will focus on the `ind_spells` table, which has the start and end dates of individual level spells on three different benefit programs; TANF, SNAP, and cash assistance.\n", "\n", "We defined our `cohort` above as those who exited the TANF program between 12 and 15 months prior to our prediction date. Now we will find those in our cohort who returned to TANF in the two years after our prediction date." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_time = time.time()\n", "# only return the first spell in the event they returned more than\n", "# once in the following 2 years.\n", "sql = '''\n", "CREATE TEMP TABLE cohort_returned_20080101 AS\n", "SELECT DISTINCT ON (recptno) recptno, start_date, end_date\n", "FROM il_dhs.ind_spells\n", "WHERE start_date >= '2008-01-01'::date\n", " AND start_date < ('2008-01-01'::date + '2 year'::interval)\n", " AND benefit_type = 'tanf46'\n", " AND recptno IN (SELECT recptno FROM cohort_20080101)\n", "ORDER BY recptno, start_date;\n", "COMMIT;\n", "'''\n", "cursor.execute(sql)\n", "\n", "print('query complete in {:.2f} seconds'.format(time.time()-start_time))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Load the Labels into Python Pandas \n", "sql = '''\n", "SELECT *\n", "FROM cohort_returned_20080101\n", "'''\n", "df = pd.read_sql(sql, conn)\n", "print('of our study cohort, {} returned to TANF'.format(df.shape[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now create a `label` variable that is set to `0` if the individual _does not_ return to TANF in the following two years after our prediction date and `1` if the individual _does_ have another TANF spell beginning within our time horizon (2 years after the prediction date)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create label table\n", "sql = \"\"\"\n", "CREATE TEMP TABLE label_20080101 AS\n", "SELECT a.recptno, a.start_date, a.end_date, \n", " CASE WHEN b.recptno IS NULL THEN 0 ELSE 1 END as label\n", "FROM cohort_20080101 a\n", "LEFT JOIN cohort_returned_20080101 b\n", "ON a.recptno = b.recptno;\n", "commit;\n", "\"\"\"\n", "cursor.execute(sql)\n", "\n", "df = pd.read_sql(\"SELECT * FROM label_20080101\", conn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(index = df['label'], columns = 'count')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# or use .value_counts(normalize=True) to show ratio\n", "df['label'].value_counts(normalize=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Repeating the Label Creation Process for Prediction date" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will need at least one (but preferably many) training and test sets for our machine learning analysis. We will put the above steps into a function with parameters for easier reuse." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conn = psycopg2.connect(database=db_name, host=hostname)\n", "cursor = conn.cursor()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Writing a Function to Create Labels\n", "\n", "In the above, the SQL queries were all hard coded. In ths section, we demonstrate how to use functions with parameters for the choices we made to define our observations (rows) and label (outcome variable). The complete list of parameters is given in parentheses after the `def generate_labels` statement. Some parameters are given a default value (like `months_back=3`), others (like `preddate`) are not. \n", "\n", "**Paramaters of `generate_labels()` function**\n", "- `preddate`: date of prediction, note that this should be the first day of the quarter \n", "- `months_off`: months off of TANF before prediction date\n", "- `months_back`: months before \"date of prediction - months off TANF\" to define cohort\n", "- `months_ahead`: time horizon ahead of date of prediction to consider\n", "\n", "- `schema`: Your team schema, where the label table will be written. The default value is set to `myschema`, which you define in the cell above the function.\n", "- `overwrite`: Whether you want the function to overwrite tables that already exist. Before writing a table, the function will check whether this table exists, and by default will not overwrite existing tables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myschema = 'ada_tanf'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def generate_labels(preddate, months_off=12, months_back=3, months_ahead=24, schema=myschema, overwrite=False):\n", " \n", " #database connection\n", " conn = psycopg2.connect(database=db_name, host = hostname) \n", " cursor = conn.cursor()\n", " \n", " # set variables based on prediction date\n", " tbl_suffix = preddate.replace('-', '') #remove dashes\n", " \n", " # create full set of queries to create labels\n", " sql = \"\"\"\n", " -- create the our study cohort for this prediction date\n", " CREATE TEMP TABLE cohort_{tbl_suffix} AS\n", " SELECT recptno, start_date, end_date\n", " FROM il_dhs.ind_spells\n", " WHERE end_date >= (('{pred_date}'::date - '{months_off} months'::interval)-'{months_back} months'::interval) \n", " AND end_date < ('{pred_date}'::date - '{months_off} months'::interval)\n", " AND benefit_type = 'tanf46';\n", " COMMIT;\n", " \n", " -- find how many in our cohort returned to TANF\n", " CREATE TEMP TABLE cohort_returned_{tbl_suffix} AS\n", " SELECT DISTINCT ON (recptno) recptno, start_date, end_date\n", " FROM il_dhs.ind_spells\n", " WHERE start_date >= '{pred_date}'::date\n", " AND start_date < ('{pred_date}'::date + '{months_ahead} months'::interval)\n", " AND benefit_type = 'tanf46'\n", " AND recptno IN (SELECT recptno FROM cohort_{tbl_suffix})\n", " ORDER BY recptno, start_date;\n", " COMMIT;\n", " \n", " -- create the label table for this prediction date in our schema\n", " -- first DROP to handle the overwrite case\n", " DROP TABLE IF EXISTS {schema}.labels_{tbl_suffix};\n", " CREATE TABLE {schema}.labels_{tbl_suffix} AS\n", " SELECT a.recptno, a.start_date, a.end_date, \n", " CASE WHEN b.recptno IS NULL THEN 0 ELSE 1 END as label\n", " FROM cohort_{tbl_suffix} a\n", " LEFT JOIN cohort_returned_{tbl_suffix} b\n", " ON a.recptno = b.recptno;\n", " commit;\n", " \n", " -- change owner of table to schema group\n", " ALTER TABLE {schema}.labels_{tbl_suffix} OWNER TO {schema}_admin;\n", " \"\"\".format(tbl_suffix=tbl_suffix, pred_date=preddate, months_off=months_off,\n", " months_back=months_back, months_ahead=months_ahead,\n", " schema=schema)\n", " \n", " \n", " # Let's check if the table already exists:\n", " # This query will return an empty table (with no rows) if the table does not exist\n", " cursor.execute('''\n", " SELECT * FROM pg_tables \n", " WHERE tablename = 'labels_{tbl_suffix}' \n", " AND schemaname = '{schema}';\n", " '''.format(tbl_suffix=tbl_suffix, schema=schema))\n", " \n", " # Let's write table if it does not exist (or if overwrite = True)\n", " if not(cursor.rowcount) or overwrite:\n", " print(\"Creating table\")\n", " cursor.execute(sql)\n", " else:\n", " print(\"Table already exists\")\n", "\n", " cursor.close()\n", " \n", " # Load table into pandas dataframe\n", " sql = '''\n", " SELECT * FROM {schema}.labels_{tbl_suffix}\n", " '''.format(tbl_suffix=tbl_suffix, schema=schema)\n", " df = pd.read_sql(sql, conn) \n", " \n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's test the function with a couple different paramaters:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_time = time.time()\n", "\n", "# Insert your team schema here:\n", "myschema = 'ada_tanf'\n", "\n", "# Set parameters:\n", "preddate = '2008-01-01' # \"date of prediction\"\n", "months_off=12 # months off of TANF before prediction date\n", "months_back=3 # months before \"date of prediction - months off TANF\" to define cohort\n", "months_ahead=24 # time horizon ahead of date of prediction to consider\n", "\n", "# create labels and return DataFrame\n", "# note: when functions have defaults only need to set parameters that change\n", "df_test1 = generate_labels(preddate)\n", "print('Labels generated in {:.2f} seconds'.format(time.time()-start_time))\n", "pd.crosstab(index = df_test1['label'], columns = 'count')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_time = time.time()\n", "# Set parameters:\n", "preddate = '2009-01-01' # \"date of prediction\"\n", "months_off=12 # months off of TANF before prediction date\n", "months_back=3 # months before \"date of prediction - months off TANF\" to define cohort\n", "months_ahead=24 # time horizon ahead of date of prediction to consider\n", "\n", "# create labels and return DataFrame\n", "# note: when functions have defaults only need to set parameters that change\n", "df_test2 = generate_labels(preddate)\n", "print('Labels generated in {:.2f} seconds'.format(time.time()-start_time))\n", "pd.crosstab(index = df_test2['label'], columns = 'count')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_time = time.time()\n", "# Set parameters:\n", "preddate = '2010-01-01' # \"date of prediction\"\n", "months_off=12 # months off of TANF before prediction date\n", "months_back=3 # months before \"date of prediction - months off TANF\" to define cohort\n", "months_ahead=24 # time horizon ahead of date of prediction to consider\n", "\n", "# create labels and return DataFrame\n", "df_test3 = generate_labels(preddate)\n", "print('Labels generated in {:.2f} seconds'.format(time.time()-start_time))\n", "pd.crosstab(index = df_test3['label'], columns = 'count')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# here's an easy way to compare proportions\n", "df_test1['label'].value_counts(normalize=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_test2['label'].value_counts(normalize=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_test3['label'].value_counts(normalize=True)" ] } ], "metadata": { "kernelspec": { "display_name": "py3-ada", "language": "python", "name": "py3-ada" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" }, "toc": { "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }