{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning -- Creating Labels\n", "-----\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "In this tutorial, we'll discuss how to formulate a policy problem or a social science question in the machine learning framework; how to transform raw data into something that can be fed into a model; how to build, evaluate, compare, and select models; and how to reasonably and accurately interpret model results. You'll also get hands-on experience using the `scikit-learn` package in Python. \n", "\n", "This tutorial is based on chapter \"Machine Learning\" of [Big Data and Social Science](https://coleridge-initiative.github.io/big-data-and-social-science/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import sqlite3\n", "from dateutil.parser import parse" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DB = 'ncdoc.db'\n", "conn = sqlite3.connect(DB)\n", "cur = conn.cursor()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Problem Formulation\n", "---\n", " \n", "Our Machine Learning Problem\n", ">Of all prisoners released, we would like to predict who is likely to reenter jail within *5* years of the day we make our prediction. For instance, say it is Jan 1, 2009 and we want to identify which \n", ">prisoners are likely to re-enter jail between now and end of 2013. We can run our predictive model and identify who is most likely at risk. The is an example of a *binary classification* problem. \n", "\n", "Note the outcome window of 5 years is completely arbitrary. You could use a window of 5, 3, 1 years or 1 day. \n", "\n", "In order to predict recidivism, we will be using data from the `inmate` and `sentences` table to create labels (predictors, or independent variables, or $X$ variables) and features (dependent variables, or $Y$ variables). \n", "\n", "We need to munge our data into **labels** (1_Machine_Learning_Labels.ipynb) and **features** (2_Machine_Learning_Features.ipynb) before we can train and evaluate **machine learning models** (3_Machine_Learning_Models.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Creating Labels (Outcomes)\n", "---\n", "\n", "First, we create a new sentence table `sentences_prep` that includes the sentence begin and end dates in date format." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"drop table if exists sentences_prep;\"\n", "cur.execute(sql_string)\n", "\n", "sql_string =\"\"\"\n", "create table sentences_prep as\n", "select inmate_doc_number, \n", "cast(inmate_sentence_component as integer) as sentence_component,\n", "date([sentence_begin_date_(for_max)]) as sentence_begin_date,\n", "date(actual_sentence_end_date) as sentence_end_date\n", "from sentences;\n", "\"\"\"\n", "cur.execute(sql_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then create a table `release_dates_2000_2008`, which is based on the `sentence_prep` table. We take all of the records for `inmate_doc_number` and `sentence_end_date` between 2000 and 2008." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"drop table if exists release_dates_2000_2008;\"\n", "cur.execute(sql_string)\n", "\n", "sql_string =\"\"\"\n", "create temp table release_dates_2000_2008 as\n", "select inmate_doc_number, sentence_end_date\n", "from sentences_prep\n", "where sentence_end_date >= '2000-01-01' and sentence_end_date <= '2008-12-31';\n", "\"\"\"\n", "cur.execute(sql_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we create a table `last_exit_2000_2008`, which takes the *maximum* (most recent) `sentence_end_date` for every `inmate_doc_number`. This table will only have one entry per `inmate_doc_number`, so for any given `inmate_doc_number`, or individual, we know their *most recent* release year." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"drop table if exists last_exit_2000_2008;\"\n", "cur.execute(sql_string)\n", "\n", "sql_string =\"\"\"\n", "create temp table last_exit_2000_2008 as\n", "select inmate_doc_number, max(sentence_end_date) sentence_end_date\n", "from release_dates_2000_2008\n", "group by inmate_doc_number;\n", "\"\"\"\n", "cur.execute(sql_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then find everyone admitted into prison between 2009 and 2013 and create table `admit_2009_2013`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"drop table if exists admit_2009_2013;\"\n", "cur.execute(sql_string)\n", "\n", "sql_string =\"\"\"\n", "create temp table admit_2009_2013 as\n", "select inmate_doc_number, sentence_component, sentence_begin_date\n", "from sentences_prep\n", "where sentence_begin_date >= '2009-01-01' and sentence_begin_date <= '2013-12-31' and sentence_component = 1;\n", "\"\"\"\n", "cur.execute(sql_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we do a `left join` on the `last_exit_2000_2008` (left) table and the `admit_2009_2013` (right) table on the `inmate_doc_number` field. The resulting table will keep all the entries from the left table (most recent releases between 2000 and 2008) and add their admits between 2009 and 2013. Now we can create a label: 0 indicates *no recidivism*, 1 indicates *recidivism*, i.e. that person did return to jail between 2009 and 2013. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"drop table if exists recidivism_2009_2013;\"\n", "cur.execute(sql_string)\n", "\n", "sql_string =\"\"\"\n", "create temp table recidivism_2009_2013 as\n", "select r.inmate_doc_number, r.sentence_end_date, a.sentence_begin_date,\n", "case when a.sentence_begin_date is null then 0 else 1 end recidivism\n", "from last_exit_2000_2008 r\n", "left join admit_2009_2013 a on r.inmate_doc_number = a.inmate_doc_number;\n", "\"\"\"\n", "cur.execute(sql_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then remove any potential duplicates and create the final label table `recidivism_labels_2009_2013`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"drop table if exists recidivism_labels_2009_2013;\"\n", "cur.execute(sql_string)\n", "\n", "sql_string =\"\"\"\n", "create table recidivism_labels_2009_2013 as\n", "select distinct inmate_doc_number, recidivism\n", "from recidivism_2009_2013;\n", "\"\"\"\n", "cur.execute(sql_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we load the label table into `label_2009_2013` and inspect the first observations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"SELECT *\"\n", "sql_string += \"FROM recidivism_labels_2009_2013 \"\n", "sql_string += \";\"\n", "\n", "label_2009_2013 = pd.read_sql(sql_string, con = conn)\n", "label_2009_2013.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Following the machine learning pipeline, we will need a second label table for creating a test set later on. To facilitate the label generation process, we define a function called `create_labels` that automates all steps that are needed to create the label table. In essence, it runs all previous steps for a given prediction start date and prediction end date." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_labels(features_end, prediction_start, prediction_end, conn):\n", " \"\"\"\n", " Generate a list of labels and return the table as a dataframe.\n", " \n", " Parameters\n", " ----------\n", " features_end\n", " prediction_start\n", " prediction_end\n", " conn: obj\n", " \n", " Returns\n", " -------\n", " df_labels: DataFrame\n", " \"\"\"\n", " end_x_year = parse(features_end, fuzzy = True).year\n", " start_y_year = parse(prediction_start, fuzzy = True).year\n", " end_y_year = parse(prediction_end, fuzzy = True).year\n", " \n", " sql_script=\"\"\"\n", "\n", "drop table if exists sentences_prep;\n", "create table sentences_prep as\n", "select inmate_doc_number, \n", "cast(inmate_sentence_component as integer) as sentence_component,\n", "date([sentence_begin_date_(for_max)]) as sentence_begin_date,\n", "date(actual_sentence_end_date) as sentence_end_date\n", "from sentences;\n", "\n", "drop table if exists release_dates_2000_{end_x_year};\n", "create temp table release_dates_2000_{end_x_year} as\n", "select inmate_doc_number, sentence_end_date\n", "from sentences_prep\n", "where sentence_end_date >= '2000-01-01' and sentence_end_date <= '{features_end}';\n", "\n", "drop table if exists last_exit_2000_{end_x_year};\n", "create temp table last_exit_2000_{end_x_year} as\n", "select inmate_doc_number, max(sentence_end_date) sentence_end_date\n", "from release_dates_2000_{end_x_year}\n", "group by inmate_doc_number;\n", "\n", "drop table if exists admit_{start_y_year}_{end_y_year};\n", "create temp table admit_{start_y_year}_{end_y_year} as\n", "select inmate_doc_number, sentence_component, sentence_begin_date\n", "from sentences_prep\n", "where sentence_begin_date >= '{prediction_start}' and sentence_begin_date <= '{prediction_end}' and sentence_component = 1;\n", "\n", "drop table if exists recidivism_{start_y_year}_{end_y_year};\n", "create temp table recidivism_{start_y_year}_{end_y_year} as\n", "select r.inmate_doc_number, r.sentence_end_date, a.sentence_begin_date,\n", "case when a.sentence_begin_date is null then 0 else 1 end recidivism\n", "from last_exit_2000_{end_x_year} r\n", "left join admit_{start_y_year}_{end_y_year} a on r.inmate_doc_number = a.inmate_doc_number;\n", "\n", "drop table if exists recidivism_labels_{start_y_year}_{end_y_year};\n", "create table recidivism_labels_{start_y_year}_{end_y_year} as\n", "select distinct inmate_doc_number, recidivism\n", "from recidivism_{start_y_year}_{end_y_year};\n", "\n", " \"\"\".format(features_end = features_end,\n", " prediction_start = prediction_start,\n", " prediction_end = prediction_end,\n", " end_x_year = end_x_year,\n", " start_y_year = start_y_year,\n", " end_y_year = end_y_year)\n", " \n", " cur.executescript(sql_script)\n", " df_label = pd.read_sql('select * from recidivism_labels_{start_y_year}_{end_y_year}'.format(\n", " start_y_year = start_y_year,\n", " end_y_year = end_y_year), conn) \n", " return df_label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `create_labels` function takes a `features_end` date, a `prediction_start` date and a `prediction_end` date as arguments. Our second label table covers recidivism between 2014 and 2018, based on all releases between 2000 and 2013." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "label_2014_2018 = create_labels('2013-12-31', '2014-01-01', '2018-12-31', conn)\n", "label_2014_2018.head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cur.close()\n", "conn.close()" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }