{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img style=\"float: center;\" src=\"./images/CI_horizontal.png\" width=\"600\">\n",
    "<center>\n",
    "    <span style=\"font-size: 1.5em;\">\n",
    "        <a href='https://www.coleridgeinitiative.org'>Website</a>\n",
    "    </span>\n",
    "</center>\n",
    "\n",
    "Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, and Jonathan Morgan."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Preparation for Machine Learning - Feature Creation\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Python Setup\n",
    "- Back to [Table of Contents](#Table-of-Contents)\n",
    "\n",
    "Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment. We're already familiar with `numpy`, `pandas`, and `psycopg2` from previous tutorials. Here we'll also be using [`scikit-learn`](http://scikit-learn.org) to fit modeling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pylab inline\n",
    "import pandas as pd\n",
    "import psycopg2\n",
    "from sqlalchemy import create_engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "db_name = \"appliedda\"\n",
    "hostname = \"10.10.2.10\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Insert team schema name below:\n",
    "myschema = 'ada_tanf'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Features\n",
    "\n",
    "Our features are our independent variables or predictors. Good features make machine learning systems effective. \n",
    "The better the features the easier it is the capture the structure of the data. You generate features using domain knowledge. In general, it is better to have more complex features and a simpler model rather than vice versa. Keeping the model simple makes it faster to train and easier to understand rather then extensively searching for the \"right\" model and \"right\" set of parameters. \n",
    "\n",
    "Machine Learning Algorithms learn a solution to a problem from sample data. The set of features is the best representation of the sample data to learn a solution to a problem. \n",
    "\n",
    "- **Feature engineering** is the process of transforming raw data into features that better represent the underlying problem/data/structure  to the predictive models, resulting in improved model accuracy on unseen data.\" ( from [Discover Feature Engineering](http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/) ).  In text, for example, this might involve deriving traits of the text like word counts, verb counts, or topics to feed into a model rather than simply giving it the raw text.\n",
    "\n",
    "Example of feature engineering are: \n",
    "\n",
    "- **Transformations**, such a log, square, and square root.\n",
    "- **Dummy (binary) variables**, also known as *indicator variables*, often done by taking categorical variables\n",
    "(such as city) which do not have a numerical value, and adding them to models as a binary value.\n",
    "- **Discretization**. Several methods require features to be discrete instead of continuous. This is often done \n",
    "by binning, which you can do by various approaches like equal width, deciles, Fisher-Jenks, etc. \n",
    "- **Aggregation.** Aggregate features often constitute the majority of features for a given problem. These use \n",
    "different aggregation functions (*count, min, max, average, standard deviation, etc.*) which summarize several\n",
    "values into one feature, aggregating over varying windows of time and space. For example, for policing or criminal justice problems, we may want to calculate the *number* (and *min, max, mean, variance*, etc.) of crimes within an *m*-mile radius of an address in the past *t* months for varying values of *m* and *t*, and then use all of them as features.\n",
    "\n",
    ">This notebook walks through creating the following features:\n",
    ">- `recp_age_end` (Continuous): age of the TANF recipient at \"end_date\" of the spell\n",
    ">- `recp_age_beg` (Continuous): age of the TANF recipient at \"start_date\" of the spell\n",
    ">- `job_during` (Binary): recipient has a job during the benefit spell\n",
    ">- `job_before` (Binary): recipient has a job before the benefit spell\n",
    ">- `num_cases` (Aggregation): The number of cases this spell represents \n",
    ">- `avg_case_dur` (Aggregation): Average case duration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Recipient age"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step by Step Approach"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "conn = psycopg2.connect(database=db_name, host = hostname)\n",
    "cursor = conn.cursor()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `member` table has `birth_date`, so it is quite easy to calculate the `recp_age_*` features for the `label_*` tables (as created in the [creating_labels](03_2_ML_data_preparation_creating_labels.ipynb) notebook) once we get the birth date. To do so, we first need to get the `ch_dpa_caseid` identifier from the `indcase_spells` table based on when our selected `ind_spells` ended"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sql = \"\"\"\n",
    "CREATE TEMP TABLE cohort_caseid AS\n",
    "SELECT a.*, b.ch_dpa_caseid \n",
    "FROM {schema}.labels_20080101 a\n",
    "JOIN il_dhs.indcase_spells b\n",
    "ON a.recptno = b.recptno \n",
    "    AND a.end_date = b.end_date\n",
    "WHERE b.benefit_type = 'tanf46';\n",
    "\"\"\".format(schema=myschema)\n",
    "\n",
    "cursor.execute(sql)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_sql('select * from cohort_caseid', conn)\n",
    "\n",
    "# check that our identifier ('recptno') is unique\n",
    "df.recptno.nunique(), df.shape[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# SQL to calculate recipient age at the beginning and end of a spell\n",
    "sql = '''\n",
    "CREATE TEMP TABLE features_age AS\n",
    "SELECT a.recptno, \n",
    "    extract(epoch from age(a.start_date, b.birth_date))/(3600.*24*365) AS recp_age_beg,\n",
    "    extract(epoch from age(a.end_date, b.birth_date))/(3600.*24*365) AS recp_age_end\n",
    "FROM cohort_caseid a\n",
    "LEFT JOIN il_dhs.member b\n",
    "ON a.recptno = b.recptno AND a.ch_dpa_caseid = b.ch_dpa_caseid;\n",
    "'''\n",
    "cursor.execute(sql)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_sql('select * from features_age', conn)\n",
    "\n",
    "# check that our identifier ('recptno') is unique\n",
    "df.recptno.nunique(), df.shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then merge this list to our labels and view the distribution of ages at the beginning and and of TANF spells for our cohort"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sql = '''\n",
    "SELECT a.recptno, a.label, b.recp_age_beg, b.recp_age_end\n",
    "FROM {schema}.labels_20080101 a\n",
    "LEFT JOIN features_age AS b\n",
    "ON a.recptno = b.recptno;\n",
    "'''.format(schema=myschema)\n",
    "df = pd.read_sql(sql, conn)\n",
    "\n",
    "df.recptno.nunique(), df.shape[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('label')[['recp_age_beg', 'recp_age_end']].describe().T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[(df['recp_age_end']<1) | (df['recp_age_beg']<1)].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[(df['recp_age_end']<1) | (df['recp_age_beg']<1)].shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define Function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to facilitate creating this feature for several years of data, we combined all the above steps into a Python function, and added a final step that writes the feature table to the database.\n",
    "\n",
    "In the step-by-step approach above, all SQL queries were entirely hard coded. As in the labels notebook, using a Python function with parameters so you can easily reuse for other values (eg prediction date). The function's parameters are:\n",
    "- `preddate`: The year at which we are doing the prediction.\n",
    "- `schema`: Your team schema, where the label table will be written. The default value is set to `myschema`, defined in the [Python Setup](#Python-Setup) section of this notebook.\n",
    "- `db_name`: Database name. This is the name of the SQL database we are using. The default value is set to `db_name`, defined in the [Python Setup](#Python-Setup) section of this notebook.\n",
    "- `hostname`: Host name. This is the host name for the SQL database we are using. The default value is set to `hostname`, defined in the [Python Setup](#Python-Setup) section of this notebook.\n",
    "- `overwrite`: Whether you want the function to overwrite tables that already exist. Before writing a table, the function will check whether this table exists, and by default will not overwrite existing tables.\n",
    "\n",
    "Note that we assume the corresponding `label_<date>` table has already been created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def spell_age_features(preddate, \n",
    "                       schema=myschema,\n",
    "                       db_name=db_name, \n",
    "                       hostname=hostname, \n",
    "                       overwrite=False):\n",
    "    # set the database connection\n",
    "    conn = psycopg2.connect(database=db_name, host = hostname) \n",
    "    cursor = conn.cursor()\n",
    "    \n",
    "    # set variables based on prediction date\n",
    "    tbl_suffix = preddate.replace('-', '') #remove dashes\n",
    "        \n",
    "    # Check if the table already exists:\n",
    "    cursor.execute('''\n",
    "    SELECT * FROM information_schema.tables \n",
    "    WHERE table_name = 'features_age_{tbl_suffix}' \n",
    "    AND table_schema = '{schema}';\n",
    "    '''.format(tbl_suffix=tbl_suffix, schema=schema))\n",
    "    \n",
    "    # Let's write table if it does not exist (or if overwrite = True)\n",
    "    if not(cursor.rowcount) or overwrite:\n",
    "        print(\"Creating table\")    \n",
    "        \n",
    "        sql = '''\n",
    "        -- get caseid's to find birth_date\n",
    "        CREATE TEMP TABLE cohort_caseid AS\n",
    "        SELECT a.*, b.ch_dpa_caseid \n",
    "        FROM {schema}.labels_{tbl_suffix} a\n",
    "        JOIN il_dhs.indcase_spells b\n",
    "        ON a.recptno = b.recptno AND a.end_date = b.end_date\n",
    "        WHERE b.benefit_type = 'tanf46';\n",
    "\n",
    "        DROP TABLE IF EXISTS {schema}.features_age_{tbl_suffix};\n",
    "        CREATE TABLE {schema}.features_age_{tbl_suffix} AS\n",
    "        SELECT a.recptno, \n",
    "            extract(epoch from age(a.start_date, b.birth_date))/(3600.*24*365) AS recp_age_beg,\n",
    "            extract(epoch from age(a.end_date, b.birth_date))/(3600.*24*365) AS recp_age_end\n",
    "        FROM cohort_caseid a\n",
    "        LEFT JOIN il_dhs.member b\n",
    "        ON a.recptno = b.recptno AND a.ch_dpa_caseid = b.ch_dpa_caseid;\n",
    "        COMMIT;\n",
    "\n",
    "        ALTER TABLE {schema}.features_age_{tbl_suffix} OWNER TO {schema}_admin;\n",
    "        COMMIT;\n",
    "        '''.format(tbl_suffix=tbl_suffix, schema=schema)  \n",
    "#         print(sql) # to debug\n",
    "        cursor.execute(sql)\n",
    "        \n",
    "    else:\n",
    "        print(\"Table already exists\")\n",
    "    \n",
    "    cursor.close()\n",
    "    \n",
    "    sql = '''\n",
    "    SELECT * FROM {schema}.features_age_{tbl_suffix};\n",
    "        '''.format(tbl_suffix=tbl_suffix, schema=schema) \n",
    "    df = pd.read_sql(sql, conn)  \n",
    "    \n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_time = time.time()\n",
    "df_test1 = spell_age_features('2008-01-01')\n",
    "print('ages generated in {:.2f} seconds'.format(time.time()-start_time))\n",
    "df_test1[['recp_age_beg', 'recp_age_end']].describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_time = time.time()\n",
    "df_test2 = spell_age_features('2009-01-01')\n",
    "print('ages generated in {:.2f} seconds'.format(time.time()-start_time))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_time = time.time()\n",
    "\n",
    "df_test3 = spell_age_features('2010-01-01')\n",
    "print('ages generated in {:.2f} seconds'.format(time.time()-start_time))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Employment\n",
    "\n",
    "We can use the wage record data to define what it means to \"have a job\" before and during the benefit spell\n",
    "- job_before (Binary): recipient has a job before the benefit spell (up to X quarters prior)\n",
    "- job_during (Binary): recipient has a job during the benefit spell\n",
    "- job_after (Binary): recipient has a job after the benefit spell (up to the prediction date)\n",
    "\n",
    "For this exercise, we'll define that the individual has a job if they have at least 1 quarter in the time period (before/during/after the spell) with total earnings over a calculated minumum full-time employment wage (using IL's minimum hourly wage in the `il_minimum_wage_by_year` table and assuming 35 hours per week and 13 weeks per quarter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step by Step Approach"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will again start with our study cohort to subset the UI wage record data to just our population of interest.\n",
    "\n",
    "Due to how the data is constructed, specifically that the wage data does not have a date index, we will define a \"quarters_back()\" function to create the WHERE clause. \n",
    "\n",
    "> Note: refer back to the [Data Exploration](02_2_Dataset_Exploration.ipynb) for an example of using the `EXPLAIN` keyword in postgresql"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def quarters_back(preddate, num_quarters=12):\n",
    "    # use Pandas datetime functions to easily access quarter info\n",
    "    predDate = pd.to_datetime(preddate)\n",
    "\n",
    "    # starting parameters\n",
    "    i = 0 # counter for loop\n",
    "    I = num_quarters # number of sequential quarters to select\n",
    "    Yr = predDate.year\n",
    "    Qt = predDate.quarter\n",
    "    \n",
    "    # start with previous quarter:\n",
    "    # if it's the last quarter\n",
    "    if Qt==1:\n",
    "        Yr -= 1 # decrement to previous year\n",
    "        Qt = 4 # reset to 4th quarter\n",
    "    else:\n",
    "        Qt -= 1 # decrement to previous quarter\n",
    "\n",
    "    # list to collect (year=Y AND quarter=Q) combinations\n",
    "    where_list = []\n",
    "\n",
    "    while i<I:\n",
    "        # set year and quarter selection for this quarter\n",
    "        where_clause = '(year = ' + str(Yr) + ' AND quarter = ' + str(Qt) +')'\n",
    "        # add this clause to the list\n",
    "        where_list.append(where_clause)\n",
    "        \n",
    "        # if it's the last quarter\n",
    "        if Qt==1:\n",
    "            Yr -= 1 # decrement to previous year\n",
    "            Qt = 4 # reset to 4th quarter\n",
    "        else:\n",
    "            Qt -= 1 # decrement to previous quarter\n",
    "        i += 1 # increment counter\n",
    "        \n",
    "    # join all of the (year = Y AND quarter = Q) separated by OR\n",
    "    return(' OR '.join(where_list))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "quarters_back('2008-01-01')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# first we need to find the SSNs of our cohort\n",
    "# similarly to how we found the caseid\n",
    "\n",
    "sql = \"\"\"\n",
    "CREATE TEMP TABLE cohort_ssn AS\n",
    "SELECT a.*, b.ssn_hash AS ssn\n",
    "FROM cohort_caseid a\n",
    "JOIN il_dhs.member b\n",
    "ON a.recptno = b.recptno AND a.ch_dpa_caseid = b.ch_dpa_caseid;\n",
    "\"\"\".format(schema=myschema)\n",
    "\n",
    "cursor.execute(sql)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will also use the `start_date` and `end_date` columns to define whether a given job occured _before_, _during_, or _after_ each spell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_time = time.time()\n",
    "\n",
    "# create temp table with job history for this cohort\n",
    "sql = '''\n",
    "CREATE TEMP TABLE job_history_cohort_20080101 AS\n",
    "SELECT *, CASE WHEN empl_quarter <= (start_date - '3 months'::interval) THEN 'before'\n",
    "        WHEN empl_quarter < end_date\n",
    "        AND empl_quarter > (start_date - '3 months'::interval) THEN 'during'\n",
    "        ELSE 'after' END job_spell_time\n",
    "FROM (\n",
    "    SELECT a.recptno, a.start_date, a.end_date, b.wage, \n",
    "        year, quarter, b.ein, b.seinunit, b.empr_no, \n",
    "        to_date(b.year::text||right('0'||((b.quarter-1)*3+1)::text,2)||'01', 'YYYYMMDD') \n",
    "        AS empl_quarter\n",
    "    FROM cohort_ssn a\n",
    "    JOIN il_des_kcmo.il_wage b\n",
    "    ON a.ssn = b.ssn\n",
    "    WHERE ({where_list})\n",
    ") q;\n",
    "\n",
    "COMMIT;\n",
    "'''.format(where_list=quarters_back('2008-01-01'))\n",
    "# print(sql)\n",
    "cursor.execute(sql)\n",
    "\n",
    "print('job history generated in {:.2f} seconds'.format(time.time()-start_time))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sql = 'SELECT * FROM job_history_cohort_20080101'\n",
    "df = pd.read_sql(sql, conn, parse_dates=['start_date', 'end_date', 'empl_quarter'])\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('job_spell_time')['wage'].describe(percentiles=[0.01,0.05,0.25,0.50,0.75,0.95,0.99])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "df.groupby(['empl_quarter', 'job_spell_time'])['wage']\\\n",
    ".describe(percentiles=[0.01,0.05,0.25,0.50,0.75,0.95,0.99])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# number of inidividuals present in this selection of wage record data\n",
    "df['recptno'].nunique() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are many different ways we could summarize the job history data to include in our analysis; here we will create\n",
    "1. Total earnings before/during/after spell\n",
    "2. Avg earnings before/during/after spell\n",
    "3. Number of quarters before/during/after the spell the TANF recipient was \"fully employed\"\n",
    "\n",
    "To define \"fully employed\" for a given quarter, we'll calculate based on `minimum hourly wage in IL` \\* 35 (work hours per week) \\* 13 (weeks per quarter). We can use the `il_minimum_wage_by_year` table in the `public` schema for this calculation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_time = time.time()\n",
    "\n",
    "sql = '''\n",
    "-- summarize earnings for each individual by quarter\n",
    "-- note we can keep year for our next query, too\n",
    "CREATE TEMP TABLE job_earnings_summary_20080101 AS\n",
    "SELECT recptno, job_spell_time, empl_quarter, year, sum(wage) total_earnings\n",
    "FROM job_history_cohort_20080101\n",
    "GROUP BY recptno, job_spell_time, year, empl_quarter;\n",
    "\n",
    "COMMIT;\n",
    "\n",
    "-- create earnings features for each individual\n",
    "-- that we found in the wage record data\n",
    "CREATE TEMP TABLE employ_summary_20080101 AS\n",
    "SELECT recptno, \n",
    "    sum(CASE WHEN job_spell_time = 'before' THEN total_earnings ELSE 0 END) tot_earn_before,\n",
    "    sum(CASE WHEN job_spell_time = 'during' THEN total_earnings ELSE 0 END) tot_earn_during,\n",
    "    sum(CASE WHEN job_spell_time = 'after' THEN total_earnings ELSE 0 END) tot_earn_after,\n",
    "    avg(CASE WHEN job_spell_time = 'before' THEN total_earnings ELSE 0 END) avg_earn_before,\n",
    "    avg(CASE WHEN job_spell_time = 'during' THEN total_earnings ELSE 0 END) avg_earn_during,\n",
    "    avg(CASE WHEN job_spell_time = 'after' THEN total_earnings ELSE 0 END) avg_earn_after,\n",
    "    sum(CASE WHEN job_spell_time = 'before' \n",
    "            AND total_earnings > (b.minimum_wage*35*13) THEN 1\n",
    "        ELSE 0 END) qtr_full_empl_before, \n",
    "    sum(CASE WHEN job_spell_time = 'during' \n",
    "            AND total_earnings > (b.minimum_wage*35*13) THEN 1\n",
    "        ELSE 0 END) qtr_full_empl_during, \n",
    "    sum(CASE WHEN job_spell_time = 'after' \n",
    "            AND total_earnings > (b.minimum_wage*35*13) THEN 1\n",
    "        ELSE 0 END) qtr_full_empl_after\n",
    "FROM job_earnings_summary_20080101 a\n",
    "JOIN il_minimum_wage_by_year b\n",
    "ON a.year = b.year\n",
    "GROUP BY recptno;\n",
    "\n",
    "COMMIT;\n",
    "\n",
    "-- create the employment feature table\n",
    "CREATE TEMP TABLE features_employment_20080101 AS\n",
    "SELECT a.recptno, \n",
    "    CASE WHEN b.recptno IS NOT NULL THEN tot_earn_before ELSE 0 END AS tot_earn_before,\n",
    "    CASE WHEN b.recptno IS NOT NULL THEN tot_earn_during ELSE 0 END AS tot_earn_during,\n",
    "    CASE WHEN b.recptno IS NOT NULL THEN tot_earn_after ELSE 0 END AS tot_earn_after,\n",
    "    CASE WHEN b.recptno IS NOT NULL THEN avg_earn_before ELSE 0 END AS avg_earn_before,\n",
    "    CASE WHEN b.recptno IS NOT NULL THEN avg_earn_during ELSE 0 END AS avg_earn_during,\n",
    "    CASE WHEN b.recptno IS NOT NULL THEN avg_earn_after ELSE 0 END AS avg_earn_after,\n",
    "    CASE WHEN b.recptno IS NOT NULL THEN qtr_full_empl_before ELSE 0 \n",
    "        END AS qtr_full_empl_before,\n",
    "    CASE WHEN b.recptno IS NOT NULL THEN qtr_full_empl_during ELSE 0 \n",
    "        END AS qtr_full_empl_during,\n",
    "    CASE WHEN b.recptno IS NOT NULL THEN qtr_full_empl_after ELSE 0 \n",
    "        END AS qtr_full_empl_after\n",
    "FROM {schema}.labels_20080101 a\n",
    "LEFT JOIN employ_summary_20080101 b\n",
    "ON a.recptno = b.recptno;\n",
    "\n",
    "COMMIT;\n",
    "'''.format(schema=myschema)\n",
    "\n",
    "cursor.execute(sql)\n",
    "\n",
    "print('features created in {:.2f} seconds'.format(time.time()-start_time))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_sql('SELECT * FROM features_employment_20080101', conn)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### define function\n",
    "\n",
    "Now we'll create a function to do all the above employment steps for a given input (ie \"prediction\") date."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def employment_features(preddate, \n",
    "                        qtrs_back = 12, \n",
    "                        quarter_wage_hours=35*13, \n",
    "                        schema=myschema,\n",
    "                        db_name=db_name, \n",
    "                        hostname=hostname, \n",
    "                        overwrite=False):\n",
    "    #database connection\n",
    "    conn = psycopg2.connect(database=db_name, host = hostname) \n",
    "    cursor = conn.cursor()\n",
    "    \n",
    "    # set table suffix based on prediction date\n",
    "    tbl_suffix = preddate.replace('-', '') #remove dashes\n",
    "        \n",
    "    # check if the table already exists:\n",
    "    cursor.execute('''\n",
    "    SELECT * FROM information_schema.tables \n",
    "    WHERE table_name = 'features_employment_{tbl_suffix}' \n",
    "    AND table_schema = '{schema}';\n",
    "    '''.format(tbl_suffix=tbl_suffix, schema=schema))\n",
    "    \n",
    "    # Let's write table if it does not exist (or if overwrite = True)\n",
    "    if not(cursor.rowcount) or overwrite:\n",
    "        print(\"Creating table\")\n",
    "        \n",
    "        # create \"where_list\" for quarters to pull from wage data\n",
    "        where_quarters = quarters_back(preddate, qtrs_back)\n",
    "        \n",
    "        sql = '''\n",
    "        -- handle overwrite case\n",
    "        DROP TABLE IF EXISTS {schema}.features_employment_{tbl_suffix};\n",
    "        \n",
    "        -- get caseid for cohort\n",
    "        CREATE TEMP TABLE cohort_caseid AS\n",
    "        SELECT a.*, b.ch_dpa_caseid \n",
    "        FROM {schema}.labels_{tbl_suffix} a\n",
    "        JOIN il_dhs.indcase_spells b\n",
    "        ON a.recptno = b.recptno \n",
    "            AND a.end_date = b.end_date\n",
    "        WHERE b.benefit_type = 'tanf46';\n",
    "        \n",
    "        -- get cohort SSNs\n",
    "        CREATE TEMP TABLE cohort_ssn AS\n",
    "        SELECT a.*, b.ssn_hash AS ssn\n",
    "        FROM cohort_caseid a\n",
    "        JOIN il_dhs.member b\n",
    "        ON a.recptno = b.recptno \n",
    "            AND a.ch_dpa_caseid = b.ch_dpa_caseid;\n",
    "        commit;\n",
    "        \n",
    "        -- get cohort's job history\n",
    "        CREATE TEMP TABLE job_history_cohort AS\n",
    "        SELECT *, \n",
    "            CASE WHEN empl_quarter <= (start_date - '3 months'::interval) \n",
    "                THEN 'before'\n",
    "                WHEN empl_quarter < end_date\n",
    "                    AND empl_quarter > (start_date - '3 months'::interval) \n",
    "                THEN 'during'\n",
    "                ELSE 'after' END job_spell_time\n",
    "        FROM (\n",
    "            SELECT a.recptno, a.start_date, a.end_date, b.wage, \n",
    "                year, quarter, b.ein, b.seinunit, b.empr_no, \n",
    "                to_date(b.year::text||right('0'||((b.quarter-1)*3+1)::text,2)||'01', 'YYYYMMDD') \n",
    "                AS empl_quarter\n",
    "            FROM cohort_ssn a\n",
    "            JOIN il_des_kcmo.il_wage b\n",
    "            ON a.ssn = b.ssn\n",
    "            WHERE ({where_list})\n",
    "        ) q;\n",
    "        commit;\n",
    "        \n",
    "        -- summarize earnings for each individual by quarter\n",
    "        -- note we can keep year for our next query, too\n",
    "        CREATE TEMP TABLE job_earnings_summary AS\n",
    "        SELECT recptno, job_spell_time, empl_quarter, year, sum(wage) total_earnings\n",
    "        FROM job_history_cohort\n",
    "        GROUP BY recptno, job_spell_time, year, empl_quarter;\n",
    "\n",
    "        COMMIT;\n",
    "\n",
    "        -- create earnings features for each individual\n",
    "        -- that we found in the wage record data\n",
    "        CREATE TEMP TABLE employ_summary AS\n",
    "        SELECT recptno, \n",
    "            sum(CASE WHEN job_spell_time = 'before' THEN total_earnings ELSE 0 END) tot_earn_before,\n",
    "            sum(CASE WHEN job_spell_time = 'during' THEN total_earnings ELSE 0 END) tot_earn_during,\n",
    "            sum(CASE WHEN job_spell_time = 'after' THEN total_earnings ELSE 0 END) tot_earn_after,\n",
    "            avg(CASE WHEN job_spell_time = 'before' THEN total_earnings ELSE 0 END) avg_earn_before,\n",
    "            avg(CASE WHEN job_spell_time = 'during' THEN total_earnings ELSE 0 END) avg_earn_during,\n",
    "            avg(CASE WHEN job_spell_time = 'after' THEN total_earnings ELSE 0 END) avg_earn_after,\n",
    "            sum(CASE WHEN job_spell_time = 'before' \n",
    "                    AND total_earnings > (b.minimum_wage*{qtr_hours}) THEN 1\n",
    "                ELSE 0 END) qtr_full_empl_before, \n",
    "            sum(CASE WHEN job_spell_time = 'during' \n",
    "                    AND total_earnings > (b.minimum_wage*{qtr_hours}) THEN 1\n",
    "                ELSE 0 END) qtr_full_empl_during, \n",
    "            sum(CASE WHEN job_spell_time = 'after' \n",
    "                    AND total_earnings > (b.minimum_wage*{qtr_hours}) THEN 1\n",
    "                ELSE 0 END) qtr_full_empl_after\n",
    "        FROM job_earnings_summary a\n",
    "        JOIN il_minimum_wage_by_year b\n",
    "        ON a.year = b.year\n",
    "        GROUP BY recptno;\n",
    "\n",
    "        COMMIT;\n",
    "\n",
    "        -- create the employment feature table\n",
    "        CREATE TABLE {schema}.features_employment_{tbl_suffix} AS\n",
    "        SELECT a.recptno, \n",
    "            CASE WHEN b.recptno IS NOT NULL THEN tot_earn_before ELSE 0 END AS tot_earn_before,\n",
    "            CASE WHEN b.recptno IS NOT NULL THEN tot_earn_during ELSE 0 END AS tot_earn_during,\n",
    "            CASE WHEN b.recptno IS NOT NULL THEN tot_earn_after ELSE 0 END AS tot_earn_after,\n",
    "            CASE WHEN b.recptno IS NOT NULL THEN avg_earn_before ELSE 0 END AS avg_earn_before,\n",
    "            CASE WHEN b.recptno IS NOT NULL THEN avg_earn_during ELSE 0 END AS avg_earn_during,\n",
    "            CASE WHEN b.recptno IS NOT NULL THEN avg_earn_after ELSE 0 END AS avg_earn_after,\n",
    "            CASE WHEN b.recptno IS NOT NULL THEN qtr_full_empl_before ELSE 0 \n",
    "                END AS qtr_full_empl_before,\n",
    "            CASE WHEN b.recptno IS NOT NULL THEN qtr_full_empl_during ELSE 0 \n",
    "                END AS qtr_full_empl_during,\n",
    "            CASE WHEN b.recptno IS NOT NULL THEN qtr_full_empl_after ELSE 0 \n",
    "                END AS qtr_full_empl_after\n",
    "        FROM {schema}.labels_{tbl_suffix} a\n",
    "        LEFT JOIN employ_summary b\n",
    "        ON a.recptno = b.recptno;\n",
    "\n",
    "        COMMIT;\n",
    "        \n",
    "        -- set owner to schema's admin group:\n",
    "        ALTER TABLE {schema}.features_employment_{tbl_suffix} OWNER TO {schema}_admin;\n",
    "        COMMIT;\n",
    "        '''.format(tbl_suffix=tbl_suffix, where_list=where_quarters,\n",
    "                   qtr_hours=quarter_wage_hours, schema=schema)  \n",
    "#         print(sql) # to debug\n",
    "        cursor.execute(sql)\n",
    "        \n",
    "    else:\n",
    "        print(\"Table already exists\")\n",
    "    \n",
    "    cursor.close()\n",
    "    \n",
    "    sql = '''\n",
    "    SELECT * FROM {schema}.features_employment_{tbl_suffix};\n",
    "        '''.format(tbl_suffix=tbl_suffix, schema=schema) \n",
    "    df = pd.read_sql(sql, conn)  \n",
    "    \n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# time our function\n",
    "start_time = time.time()\n",
    "\n",
    "# calculate for 2008-01-01 prediction date with default values\n",
    "df = employment_features('2008-01-01')\n",
    "\n",
    "print('features created in {:.2f} seconds'.format(time.time()-start_time))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# how many of our cohort had full employment for how many quarters before their TANF spell?\n",
    "df['qtr_full_empl_before'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# time our function\n",
    "start_time = time.time()\n",
    "\n",
    "# calculate for 2008-01-01 prediction date with default values\n",
    "df = employment_features('2009-01-01')\n",
    "\n",
    "print('features created in {:.2f} seconds'.format(time.time()-start_time))\n",
    "df['qtr_full_empl_before'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# time our function\n",
    "start_time = time.time()\n",
    "\n",
    "# calculate for 2012-10-01 prediction date\n",
    "df = employment_features('2010-01-01')\n",
    "\n",
    "print('features created in {:.2f} seconds'.format(time.time()-start_time))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# how many of our cohort had full employment for how many quarters during their TANF spell?\n",
    "df['qtr_full_empl_during'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Individual characteristics\n",
    "\n",
    "The IDHS data from Chapin Hall has a number of \"time-invariant\" tables which include information about the individuals on various social benefit programs (see the [IDHS data documentation](../documentation/IDHS-data_documentation.pdf) for details)\n",
    "\n",
    "Here we will collect and clean some information about the individuals in our cohort(s)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "conn = psycopg2.connect(database=db_name, host = hostname)\n",
    "cursor = conn.cursor()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# from the documentation:\n",
    "# education levels\n",
    "member_info_edlevel = {'A':'None', 'B': 'Some Elementary',\n",
    "                       'C': 'Completed 7th grade', 'D':'Completed 8th grade', \n",
    "                       'E':'Completed 9th or 10th grade', 'F':'Completed 11th grade', \n",
    "                       'G':'Completed GED', 'H':'High School Diploma', \n",
    "                       'V':'Post Secondary vocational training', 'W':'One year of college',\n",
    "                       'X':'Two Years of College', 'Y':'Three years of college', \n",
    "                       'Z':\"College graduate (bachelor's degree)\", \n",
    "                       'P':'Post graduate college degree', '1':'None', \n",
    "                       '2':'Some Elementary Grades', '3':'All Elementary Grades', \n",
    "                       '4':'Some High School Grades', '5':'All High School Grades', \n",
    "                       '6':'Some College', '7':'All College'}\n",
    "# marital status\n",
    "member_info_martlst = {1: 'Never married', 2: 'Married - living with spouse', \n",
    "                       3: 'Married - spouse in nursing home, etc.', \n",
    "                       4: 'Married - spouse deserted', \n",
    "                       5: 'Married - legally separated', \n",
    "                       6: 'Married - other, including incarceration', \n",
    "                       7: 'Divorced', 8: 'Widowed'}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step by Step Approach"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get caseid\n",
    "sql = \"\"\"\n",
    "CREATE TEMP TABLE cohort_caseid AS\n",
    "SELECT a.*, b.ch_dpa_caseid \n",
    "FROM {schema}.labels_20080101 a\n",
    "JOIN il_dhs.indcase_spells b\n",
    "ON a.recptno = b.recptno \n",
    "    AND a.end_date = b.end_date\n",
    "WHERE b.benefit_type = 'tanf46';\n",
    "\"\"\".format(schema=myschema)\n",
    "\n",
    "cursor.execute(sql)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get member_info characateristics\n",
    "sql = \"\"\"\n",
    "CREATE TEMP TABLE cohort_member_info AS\n",
    "SELECT b.*, edlevel, martlst\n",
    "FROM il_dhs.member_info a\n",
    "JOIN cohort_caseid b\n",
    "ON a.recptno = b.recptno \n",
    "    AND a.ch_dpa_caseid = b.ch_dpa_caseid\n",
    "\"\"\"\n",
    "cursor.execute(sql)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_sql(\"SELECT * FROM cohort_member_info\", conn)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['edlevel'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# but the codes aren't that great, let's use the descriptions instead\n",
    "df['edlevel'].value_counts(normalize=True).rename(index=member_info_edlevel)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['martlst'].value_counts(normalize=True).rename(index=member_info_martlst)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define Function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def member_info_features(preddate, \n",
    "                        schema=myschema,\n",
    "                        db_name=db_name, \n",
    "                        hostname=hostname, \n",
    "                        overwrite=False):\n",
    "    #database connection\n",
    "    conn = psycopg2.connect(database=db_name, host = hostname) \n",
    "    cursor = conn.cursor()\n",
    "    \n",
    "    # set table suffix based on prediction date\n",
    "    tbl_suffix = preddate.replace('-', '') #remove dashes\n",
    "        \n",
    "    # check if the table already exists:\n",
    "    cursor.execute('''\n",
    "    SELECT * FROM information_schema.tables \n",
    "    WHERE table_name = 'features_member_info_{tbl_suffix}' \n",
    "    AND table_schema = '{schema}';\n",
    "    '''.format(tbl_suffix=tbl_suffix, schema=schema))\n",
    "    \n",
    "    # Let's write table if it does not exist (or if overwrite = True)\n",
    "    if not(cursor.rowcount) or overwrite:\n",
    "        print(\"Creating table\")\n",
    "        \n",
    "        sql = '''\n",
    "        -- handle overwrite case\n",
    "        DROP TABLE IF EXISTS {schema}.features_member_info_{tbl_suffix};\n",
    "        \n",
    "        -- get caseid for cohort\n",
    "        CREATE TEMP TABLE cohort_caseid AS\n",
    "        SELECT a.*, b.ch_dpa_caseid \n",
    "        FROM {schema}.labels_{tbl_suffix} a\n",
    "        JOIN il_dhs.indcase_spells b\n",
    "        ON a.recptno = b.recptno \n",
    "            AND a.end_date = b.end_date\n",
    "        WHERE b.benefit_type = 'tanf46';\n",
    "        commit;\n",
    "        \n",
    "        -- get member_info variables of interest\n",
    "        CREATE TEMP TABLE cohort_member_info AS\n",
    "        SELECT b.*, edlevel, martlst\n",
    "        FROM il_dhs.member_info a\n",
    "        JOIN cohort_caseid b\n",
    "        ON a.recptno = b.recptno \n",
    "            AND a.ch_dpa_caseid = b.ch_dpa_caseid;\n",
    "        commit;\n",
    "        \n",
    "        -- classify values and attach to our cohort\n",
    "        CREATE TABLE {schema}.features_member_info_{tbl_suffix} AS\n",
    "        SELECT a.recptno,\n",
    "            CASE WHEN b.edlevel IN ('1', 'A') THEN 'ed_none'\n",
    "                WHEN b.edlevel IN ('2', '3', 'B', 'C', 'D') THEN 'ed_no_hs'\n",
    "                WHEN b.edlevel IN ('4', 'E', 'F') THEN 'ed_some_hs'\n",
    "                WHEN b.edlevel IN ('5', 'G', 'H') THEN 'ed_hs_ged'\n",
    "                WHEN b.edlevel IN ('6', 'V', 'W', 'X', 'Y') THEN 'ed_some_coll'\n",
    "                WHEN b.edlevel IN ('7', 'P', 'Z') THEN 'ed_comp_coll'\n",
    "                ELSE 'ed_unknown' \n",
    "                END AS ed_level, \n",
    "            CASE WHEN b.martlst = 1 THEN 'martlst_never'\n",
    "                WHEN b.martlst IN (2, 3) THEN 'martlst_married'\n",
    "                WHEN b.martlst IN (3, 4, 5, 6, 7, 8) THEN 'martlst_separated'\n",
    "                ELSE 'martlst_unknown' \n",
    "                END AS martl_status\n",
    "        FROM {schema}.labels_{tbl_suffix} a\n",
    "        LEFT JOIN cohort_member_info b\n",
    "        ON a.recptno = b.recptno;\n",
    "        commit;\n",
    "        \n",
    "        -- set owner to schema's admin group:\n",
    "        ALTER TABLE {schema}.features_member_info_{tbl_suffix} OWNER TO {schema}_admin;\n",
    "        COMMIT;\n",
    "        '''.format(tbl_suffix=tbl_suffix, schema=schema)  \n",
    "#         print(sql) # to debug\n",
    "        cursor.execute(sql)\n",
    "        \n",
    "    else:\n",
    "        print(\"Table already exists\")\n",
    "    \n",
    "    cursor.close()\n",
    "    \n",
    "    sql = '''\n",
    "    SELECT * FROM {schema}.features_member_info_{tbl_suffix};\n",
    "        '''.format(tbl_suffix=tbl_suffix, schema=schema) \n",
    "    df = pd.read_sql(sql, conn)  \n",
    "    \n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_time = time.time()\n",
    "\n",
    "df = member_info_features('2008-01-01')\n",
    "\n",
    "print('features created in {:.2f} seconds'.format(time.time()-start_time))\n",
    "\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[['ed_level', 'martl_status']].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_time = time.time()\n",
    "\n",
    "df = member_info_features('2009-01-01')\n",
    "\n",
    "print('features created in {:.2f} seconds'.format(time.time()-start_time))\n",
    "df['ed_level'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_time = time.time()\n",
    "\n",
    "df = member_info_features('2010-01-01')\n",
    "\n",
    "print('features created in {:.2f} seconds'.format(time.time()-start_time))\n",
    "\n",
    "df['martl_status'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Case information\n",
    "\n",
    "Social benefit programs are often administered on a case by case basis, and there is associated information with each case. The TANF data we are using in this program has gone through some processing (by Chapin Hall at UChicago), and as you have seen we are creating our observations (rows) from the `ind_spells` table. These individual spells are often groupings of multiple cases. \n",
    "\n",
    "In this section we will create features based on the underlying cases which make up each spell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "conn = psycopg2.connect(database=db_name, host = hostname)\n",
    "cursor = conn.cursor()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Step by step approach"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the number of cases this individual spell represents is fairly easy\n",
    "# using just the two `*_spells` tables:\n",
    "sql = \"\"\"\n",
    "CREATE TEMP TABLE cohort_case_summary AS\n",
    "SELECT a.recptno, count(b.*) case_count,\n",
    "avg(extract(epoch from age(b.end_date, b.start_date))/(3600.*24)) avg_len_days\n",
    "FROM {schema}.labels_20080101 a\n",
    "JOIN il_dhs.indcase_spells b\n",
    "ON a.recptno = b.recptno \n",
    "    AND a.end_date <= b.end_date\n",
    "    AND a.start_date >= b.start_date\n",
    "WHERE b.benefit_type = 'tanf46'\n",
    "GROUP BY a.recptno\n",
    "\"\"\".format(schema=myschema)\n",
    "\n",
    "cursor.execute(sql)\n",
    "\n",
    "df = pd.read_sql('SELECT * FROM cohort_case_summary', conn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get caseid\n",
    "sql = \"\"\"\n",
    "CREATE TEMP TABLE cohort_caseid AS\n",
    "SELECT a.*, b.ch_dpa_caseid \n",
    "FROM {schema}.labels_20080101 a\n",
    "JOIN il_dhs.indcase_spells b\n",
    "ON a.recptno = b.recptno \n",
    "    AND a.end_date = b.end_date\n",
    "WHERE b.benefit_type = 'tanf46';\n",
    "\"\"\".format(schema=myschema)\n",
    "\n",
    "cursor.execute(sql)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# for the most recent case, let's get\n",
    "# 1) the County or district code of the Public Assistance Office\n",
    "# 2) the homelessnes flag\n",
    "\n",
    "sql = \"\"\"\n",
    "CREATE TEMP TABLE cohort_case_assist AS\n",
    "SELECT a.recptno, b.district, b.homeless\n",
    "FROM cohort_caseid a\n",
    "JOIN il_dhs.assistance_case b\n",
    "ON a.ch_dpa_caseid = b.ch_dpa_caseid;\n",
    "\"\"\"\n",
    "cursor.execute(sql)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_sql('SELECT * FROM cohort_case_assist', conn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "df['district'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> From the documentation: \"district\" is defined as \"County or district code of recipient's Public Aid office. 10-115=Downstate County Codes, 200-294=Cook County district codes\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# so if we were to group our cohort by county, how many couties are represented:\n",
    "df[(10 <= df['district']) & (df['district'] <= 115)]['district'].nunique() + 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define Function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def case_info_features(preddate, \n",
    "                        schema=myschema,\n",
    "                        db_name=db_name, \n",
    "                        hostname=hostname, \n",
    "                        overwrite=False):\n",
    "    #database connection\n",
    "    conn = psycopg2.connect(database=db_name, host = hostname) \n",
    "    cursor = conn.cursor()\n",
    "    \n",
    "    # set table suffix based on prediction date\n",
    "    tbl_suffix = preddate.replace('-', '') #remove dashes\n",
    "        \n",
    "    # check if the table already exists:\n",
    "    cursor.execute('''\n",
    "    SELECT * FROM information_schema.tables \n",
    "    WHERE table_name = 'features_case_info_{tbl_suffix}' \n",
    "    AND table_schema = '{schema}';\n",
    "    '''.format(tbl_suffix=tbl_suffix, schema=schema))\n",
    "    \n",
    "    # Let's write table if it does not exist (or if overwrite = True)\n",
    "    if not(cursor.rowcount) or overwrite:\n",
    "        print(\"Creating table\")\n",
    "        \n",
    "        sql = '''\n",
    "        -- handle the overwrite case\n",
    "        DROP TABLE IF EXISTS {schema}.features_case_info_{tbl_suffix};\n",
    "        commit;\n",
    "        \n",
    "        -- produce the case summary temp table\n",
    "        CREATE TEMP TABLE cohort_case_summary AS\n",
    "        SELECT a.recptno, count(b.*) case_count,\n",
    "        avg(extract(epoch from age(b.end_date, b.start_date))/(3600.*24)) avg_len_days\n",
    "        FROM {schema}.labels_{tbl_suffix} a\n",
    "        JOIN il_dhs.indcase_spells b\n",
    "        ON a.recptno = b.recptno \n",
    "            AND a.end_date <= b.end_date\n",
    "            AND a.start_date >= b.start_date\n",
    "        WHERE b.benefit_type = 'tanf46'\n",
    "        GROUP BY a.recptno;\n",
    "        commit;\n",
    "        \n",
    "        -- get the cohort's caseid\n",
    "        CREATE TEMP TABLE cohort_caseid AS\n",
    "        SELECT a.*, b.ch_dpa_caseid \n",
    "        FROM {schema}.labels_{tbl_suffix} a\n",
    "        JOIN il_dhs.indcase_spells b\n",
    "        ON a.recptno = b.recptno \n",
    "            AND a.end_date = b.end_date\n",
    "        WHERE b.benefit_type = 'tanf46';\n",
    "        \n",
    "        -- get the district and homeless codes\n",
    "        CREATE TEMP TABLE cohort_case_assist AS\n",
    "        SELECT a.recptno, b.district, b.homeless\n",
    "        FROM cohort_caseid a\n",
    "        JOIN il_dhs.assistance_case b\n",
    "        ON a.ch_dpa_caseid = b.ch_dpa_caseid;\n",
    "        commit;\n",
    "        \n",
    "        -- create our cohort's feature table\n",
    "        CREATE TABLE {schema}.features_case_info_{tbl_suffix} AS\n",
    "        SELECT a.recptno, \n",
    "            CASE WHEN b.case_count IS NULL THEN 0 ELSE b.case_count END AS case_count,\n",
    "            CASE WHEN b.avg_len_days IS NULL THEN 0 ELSE b.avg_len_days END AS avg_len_days,\n",
    "            c.district, c.homeless\n",
    "        FROM {schema}.labels_{tbl_suffix} a\n",
    "        LEFT JOIN cohort_case_summary b\n",
    "        ON a.recptno = b.recptno\n",
    "        LEFT JOIN cohort_case_assist c\n",
    "        ON a.recptno = c.recptno;\n",
    "        commit;\n",
    "        \n",
    "        -- set owner to schema's admin group:\n",
    "        ALTER TABLE {schema}.features_case_info_{tbl_suffix} OWNER TO {schema}_admin;\n",
    "        COMMIT;\n",
    "        '''.format(tbl_suffix=tbl_suffix, schema=schema)  \n",
    "#         print(sql) # to debug\n",
    "        cursor.execute(sql)\n",
    "        \n",
    "    else:\n",
    "        print(\"Table already exists\")\n",
    "    \n",
    "    cursor.close()\n",
    "    \n",
    "    sql = '''\n",
    "    SELECT * FROM {schema}.features_case_info_{tbl_suffix};\n",
    "        '''.format(tbl_suffix=tbl_suffix, schema=schema) \n",
    "    df = pd.read_sql(sql, conn)  \n",
    "    \n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_time = time.time()\n",
    "\n",
    "df = case_info_features('2008-01-01')\n",
    "\n",
    "print('features created in {:.2f} seconds'.format(time.time()-start_time))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_time = time.time()\n",
    "\n",
    "df = case_info_features('2009-01-01')\n",
    "\n",
    "print('features created in {:.2f} seconds'.format(time.time()-start_time))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_time = time.time()\n",
    "\n",
    "df = case_info_features('2010-01-01')\n",
    "\n",
    "print('features created in {:.2f} seconds'.format(time.time()-start_time))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Removing Outliers \n",
    "\n",
    "**It is never a good idea to drop observations without prior investigation AND a good reason to believe the data is wrong!** \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imputing Missing Values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are many ways of imputing missing values based on the rest of the data. Missing values can be imputed to median of the rest of the data, or you can use other characteristics (eg industry, geography, etc.).\n",
    "\n",
    "For our data, we have made an assumption about what \"missing\" means for each of our data's components (eg if the individual does not show up in the IDES data we say they do not have a job in that time period)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "py3-ada",
   "language": "python",
   "name": "py3-ada"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "toc_cell": false,
   "toc_position": {
    "height": "566px",
    "left": "0px",
    "right": "1255px",
    "top": "110px",
    "width": "212px"
   },
   "toc_section_display": "block",
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}