{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning -- Creating Features\n", "-----\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "In this tutorial, we'll discuss how to formulate a policy problem or a social science question in the machine learning framework; how to transform raw data into something that can be fed into a model; how to build, evaluate, compare, and select models; and how to reasonably and accurately interpret model results. You'll also get hands-on experience using the `scikit-learn` package in Python. \n", "\n", "This tutorial is based on chapter \"Machine Learning\" of [Big Data and Social Science](https://coleridge-initiative.github.io/big-data-and-social-science/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import sqlite3\n", "from dateutil.parser import parse" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "DB = 'ncdoc.db'\n", "conn = sqlite3.connect(DB)\n", "cur = conn.cursor()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Problem Formulation\n", "---\n", " \n", "Our Machine Learning Problem\n", ">Of all prisoners released, we would like to predict who is likely to reenter jail within *5* years of the day we make our prediction. For instance, say it is Jan 1, 2009 and we want to identify which \n", ">prisoners are likely to re-enter jail between now and end of 2013. We can run our predictive model and identify who is most likely at risk. The is an example of a *binary classification* problem. \n", "\n", "Note the outcome window of 5 years is completely arbitrary. You could use a window of 5, 3, 1 years or 1 day. \n", "\n", "In order to predict recidivism, we will be using data from the `inmate` and `sentences` table to create labels (predictors, or independent variables, or $X$ variables) and features (dependent variables, or $Y$ variables). \n", "\n", "We need to munge our data into **labels** (1_Machine_Learning_Labels.ipynb) and **features** (2_Machine_Learning_Features.ipynb) before we can train and evaluate **machine learning models** (3_Machine_Learning_Models.ipynb).\n", "\n", "This notebook assumes that you have already worked through the `1_Machine_Learning_Labels` notebook. If that is not the case, the following function allows you to execute a Python script which includes all steps that are needed to successfully run the subsequent code in this notebook. Note that executing this script might take some time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%run 'create_labels.py'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Generation\n", "---\n", "\n", "Our features for prediction recidivism (between 2009 and 2013) are the following:\n", "\n", "- `num_admits`: The number of times someone has been admitted to prison before 2009. The more times someone has been to prison the more times they are likely continue to be arrested. \n", "\n", "- `length_longest_sentence`: The length of the longest sentence of all admits before 2009. Long previous sentences might decrease the likelihood of future arrests.\n", "\n", "- `age_first_admit`: The age someone was first admitted to prison. The idea behind creating this feature is that people who are younger when they are first arrested are more likely to be arrested again. \n", "\n", "- `age`: The age at the end of our last exit time range, i.e. in 2008. People who are younger when they are released might be more likely to be arrested again. \n", "\n", "First, we create a new sentence table `sentences_prep` that includes the sentence begin and end dates in date format." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"drop table if exists sentences_prep;\"\n", "cur.execute(sql_string)\n", "\n", "sql_string =\"\"\"\n", "create table sentences_prep as\n", "select inmate_doc_number, \n", "cast(inmate_sentence_component as integer) as sentence_component,\n", "date([sentence_begin_date_(for_max)]) as sentence_begin_date,\n", "date(actual_sentence_end_date) as sentence_end_date\n", "from sentences;\n", "\"\"\"\n", "cur.execute(sql_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create the feature `num_admits`, we count the number of rows (individual sentence periods) for each `inmate_doc_number` before 2009 and write this information into `feature_num_admits_2000_2008`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"drop table if exists feature_num_admits_2000_2008;\"\n", "cur.execute(sql_string)\n", "\n", "sql_string =\"\"\"\n", "create table feature_num_admits_2000_2008 as\n", "select inmate_doc_number, count(*) num_admits\n", "from sentences_prep\n", "where inmate_doc_number in (select inmate_doc_number from recidivism_labels_2009_2013)\n", "and sentence_begin_date < '2008-12-31' and sentence_component = 1\n", "group by inmate_doc_number;\n", "\"\"\"\n", "cur.execute(sql_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For `length_longest_sentence`, we first compute the length of all sentences before 2009 and create the table `feature_length_sentence_2000_2008`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"drop table if exists feature_length_sentence_2000_2008;\"\n", "cur.execute(sql_string)\n", "\n", "sql_string =\"\"\"\n", "create table feature_length_sentence_2000_2008 as\n", "select inmate_doc_number, sentence_component, cast(julianday(sentence_end_date) - julianday(sentence_begin_date) as integer) length_sentence\n", "from sentences_prep\n", "where inmate_doc_number in (select inmate_doc_number from recidivism_labels_2009_2013)\n", "and sentence_begin_date < '2008-12-31' and sentence_component = 1\n", "and sentence_begin_date > '0001-01-01' and sentence_end_date > '0001-01-01' and sentence_end_date > sentence_begin_date;\n", "\"\"\"\n", "cur.execute(sql_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On this basis, we find the longest sentence period (`max(length_sentence)`) for each `inmate_doc_number`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"drop table if exists feature_length_long_sentence_2000_2008;\"\n", "cur.execute(sql_string)\n", "\n", "sql_string =\"\"\"\n", "create temp table feature_length_long_sentence_2000_2008 as\n", "select inmate_doc_number, max(length_sentence) length_longest_sentence\n", "from feature_length_sentence_2000_2008\n", "group by inmate_doc_number;\n", "\"\"\"\n", "cur.execute(sql_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several steps needed to compute the age at first arrest. First, we find the first arrest (`min(sentence_begin_date)`) for each `inmate_doc_number` and create the table `docnbr_admityr`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"drop table if exists docnbr_admityr;\"\n", "cur.execute(sql_string)\n", "\n", "sql_string =\"\"\"\n", "create temp table docnbr_admityr as\n", "select inmate_doc_number, min(sentence_begin_date) min_admityr\n", "from sentences_prep\n", "where sentence_begin_date > '0001-01-01'\n", "group by inmate_doc_number;\n", "\"\"\"\n", "cur.execute(sql_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then join the `inmate` and `docnbr_admityr` tables and extract the years from `inmate_birth_date` (birth year) and `min_admityr` (year first admitted into prison)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"drop table if exists age_first_admit_birth_year;\"\n", "cur.execute(sql_string)\n", "\n", "sql_string =\"\"\"\n", "create temp table age_first_admit_birth_year as\n", "select da.inmate_doc_number,\n", "cast(strftime(\"%Y\", da.min_admityr) as integer) min_admityr,\n", "cast(strftime(\"%Y\", p.inmate_birth_date) as integer) inmate_birth_date\n", "from docnbr_admityr da\n", "left join inmate p on da.inmate_doc_number = p.inmate_doc_number;\n", "\"\"\"\n", "cur.execute(sql_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The combined table allows us to create `age_first_admit` by subtracting the birth year from the year first admitted into prison." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"drop table if exists feature_age_first_admit;\"\n", "cur.execute(sql_string)\n", "\n", "sql_string =\"\"\"\n", "create table feature_age_first_admit as\n", "select inmate_doc_number, (min_admityr - inmate_birth_date) age_first_admit\n", "from age_first_admit_birth_year;\n", "\"\"\"\n", "cur.execute(sql_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then filter the `feature_age_first_admit` table such that it only includes observations that are observed in the label table `recidivism_labels_2009_2013`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"drop table if exists feature_agefirstadmit;\"\n", "cur.execute(sql_string)\n", "\n", "sql_string =\"\"\"\n", "create table feature_agefirstadmit as\n", "select inmate_doc_number, age_first_admit\n", "from feature_age_first_admit\n", "where inmate_doc_number in (select inmate_doc_number from recidivism_labels_2009_2013);\n", "\"\"\"\n", "cur.execute(sql_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To compute the age in 2008, we simply subtract the `inmate_birth_date` from 2008 and store this information in `feature_age_2008`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"drop table if exists feature_age_2008;\" \n", "cur.execute(sql_string) \n", "\n", "sql_string =\"\"\"\n", "create table feature_age_2008 as\n", "select inmate_doc_number, (2008 - cast(strftime(\"%Y\", inmate_birth_date) as integer)) age\n", "from inmate\n", "where inmate_doc_number in (select inmate_doc_number from recidivism_labels_2009_2013);\n", "\"\"\"\n", "cur.execute(sql_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we join all (final) feature tables by `inmate_doc_number` and create table `features_2000_2008`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"drop table if exists features_2000_2008;\"\n", "cur.execute(sql_string)\n", "\n", "sql_string =\"\"\"\n", "create table features_2000_2008 as\n", "select f1.inmate_doc_number, f1.num_admits, f2.length_longest_sentence, f3.age_first_admit, f4.age\n", "from feature_num_admits_2000_2008 f1\n", "left join feature_length_long_sentence_2000_2008 f2 on f1.inmate_doc_number = f2.inmate_doc_number\n", "left join feature_agefirstadmit f3 on f1.inmate_doc_number = f3.inmate_doc_number\n", "left join feature_age_2008 f4 on f1.inmate_doc_number = f4.inmate_doc_number;\n", "\"\"\"\n", "cur.execute(sql_string)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now load the feature table and compute descriptive statistics for the features we created." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sql_string = \"SELECT *\"\n", "sql_string += \"FROM features_2000_2008 \"\n", "sql_string += \";\"\n", "\n", "features_2000_2008 = pd.read_sql(sql_string, con = conn)\n", "features_2000_2008.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As with the label table, we need to create a second feature table which we will use for model evaluation purposes. We again create a function, this time called `create_features`, to ease the process. This function allows to run all feature generation steps for a given feature end date, prediction start date and prediction end date with just one function call." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_features(features_end, prediction_start, prediction_end, conn):\n", " \"\"\"\n", " Generate a list of features and return the table as a dataframe.\n", " Note: There has to be a table of labels that correspond with the same time period. \n", " \n", " Parameters\n", " ----------\n", " features_end\n", " prediction_start\n", " prediction_end\n", " conn: obj\n", " \n", " Returns\n", " -------\n", " df_features: Dataframe\n", " \"\"\"\n", " end_x_year = parse(features_end, fuzzy=True).year\n", " start_y_year = parse(prediction_start, fuzzy=True).year\n", " end_y_year = parse(prediction_end, fuzzy=True).year \n", " \n", " sql_script=\"\"\"\n", "\n", "drop table if exists sentences_prep;\n", "create table sentences_prep as\n", "select inmate_doc_number, \n", "cast(inmate_sentence_component as integer) as sentence_component,\n", "date([sentence_begin_date_(for_max)]) as sentence_begin_date,\n", "date(actual_sentence_end_date) as sentence_end_date\n", "from sentences;\n", "\n", "drop table if exists feature_num_admits_2000_{end_x_year};\n", "create table feature_num_admits_2000_{end_x_year} as\n", "select inmate_doc_number, count(*) num_admits\n", "from sentences_prep\n", "where inmate_doc_number in (select inmate_doc_number from recidivism_labels_{start_y_year}_{end_y_year})\n", "and sentence_begin_date < '{features_end}' and sentence_component = 1\n", "group by inmate_doc_number;\n", "\n", "drop table if exists feature_length_sentence_2000_{end_x_year};\n", "create table feature_length_sentence_2000_{end_x_year} as\n", "select inmate_doc_number, sentence_component, cast(julianday(sentence_end_date) - julianday(sentence_begin_date) as integer) length_sentence\n", "from sentences_prep\n", "where inmate_doc_number in (select inmate_doc_number from recidivism_labels_{start_y_year}_{end_y_year})\n", "and sentence_begin_date < '{features_end}' and sentence_component = 1\n", "and sentence_begin_date > '0001-01-01' and sentence_end_date > '0001-01-01' and sentence_end_date > sentence_begin_date;\n", "\n", "drop table if exists feature_length_long_sentence_2000_{end_x_year};\n", "create temp table feature_length_long_sentence_2000_{end_x_year} as\n", "select inmate_doc_number, max(length_sentence) length_longest_sentence\n", "from feature_length_sentence_2000_{end_x_year}\n", "group by inmate_doc_number;\n", "\n", "drop table if exists docnbr_admityr;\n", "create temp table docnbr_admityr as\n", "select inmate_doc_number, min(sentence_begin_date) min_admityr\n", "from sentences_prep\n", "where sentence_begin_date > '0001-01-01'\n", "group by inmate_doc_number;\n", "\n", "drop table if exists age_first_admit_birth_year;\n", "create temp table age_first_admit_birth_year as\n", "select da.inmate_doc_number,\n", "cast(strftime(\"%Y\", da.min_admityr) as integer) min_admityr,\n", "cast(strftime(\"%Y\", p.inmate_birth_date) as integer) inmate_birth_date\n", "from docnbr_admityr da\n", "left join inmate p on da.inmate_doc_number = p.inmate_doc_number;\n", "\n", "drop table if exists feature_age_first_admit; \n", "create table feature_age_first_admit as\n", "select inmate_doc_number, (min_admityr - inmate_birth_date) age_first_admit\n", "from age_first_admit_birth_year;\n", "\n", "drop table if exists feature_agefirstadmit; \n", "create table feature_agefirstadmit as\n", "select inmate_doc_number, age_first_admit\n", "from feature_age_first_admit\n", "where inmate_doc_number in (select inmate_doc_number from recidivism_labels_{start_y_year}_{end_y_year});\n", "\n", "drop table if exists feature_age_{end_x_year}; \n", "create table feature_age_{end_x_year} as\n", "select inmate_doc_number, ({end_x_year} - cast(strftime(\"%Y\", inmate_birth_date) as integer)) age\n", "from inmate\n", "where inmate_doc_number in (select inmate_doc_number from recidivism_labels_{start_y_year}_{end_y_year});\n", "\n", "drop table if exists features_2000_{end_x_year}; \n", "create table features_2000_{end_x_year} as\n", "select f1.inmate_doc_number, f1.num_admits, f2.length_longest_sentence, f3.age_first_admit, f4.age\n", "from feature_num_admits_2000_{end_x_year} f1\n", "left join feature_length_long_sentence_2000_{end_x_year} f2 on f1.inmate_doc_number = f2.inmate_doc_number\n", "left join feature_agefirstadmit f3 on f1.inmate_doc_number = f3.inmate_doc_number\n", "left join feature_age_{end_x_year} f4 on f1.inmate_doc_number = f4.inmate_doc_number;\n", "\n", " \"\"\".format(features_end = features_end,\n", " end_x_year = end_x_year,\n", " start_y_year = start_y_year,\n", " end_y_year = end_y_year)\n", " \n", " cur.executescript(sql_script)\n", " df_features = pd.read_sql('select * from features_2000_{end_x_year}'.format(end_x_year = end_x_year), conn) \n", " return df_features " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to create a feature table that matches the second label table (`recidivism_labels_2014_2018`), we create features that contain information up to the end of 2013." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features_2000_2013 = create_features('2013-12-31', '2014-01-01', '2018-12-31', conn)\n", "features_2000_2013.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cur.close()\n", "conn.close()" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 4 }