{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 11: Part 2 - Food Inspection Forecasting: Data processing\n",
    "This file is an ipython notebook with [`R-magic`](https://ipython.org/ipython-doc/1/config/extensions/rmagic.html) to convert the data from Rds (the R programming language data dtorage sytem) to `csv` to be read into Python. If you ever find yourself in a bind with R code available for you... give `R-magic` a try. \n",
    "\n",
    "\n",
    "## **HUGE NOTE:  All code here is taken from the [food-inspections-evaluation]( https://github.com/Chicago/food-inspections-evaluation) repository** \n",
    "### They did a great job at cleaning the data in R so I don't want to repeat work.\n",
    "\n",
    "All code and data is available on GitHub:\n",
    "https://github.com/Chicago/food-inspections-evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The rpy2.ipython extension is already loaded. To reload it, use:\n",
      "  %reload_ext rpy2.ipython\n"
     ]
    }
   ],
   "source": [
    "import rpy2\n",
    "import pandas as pd\n",
    "%load_ext rpy2.ipython"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%R\n",
    "# change to your local clone\n",
    "data_dir = '~/food-inspections-evaluation/'\n",
    "out_dir = '~'\n",
    "\n",
    "library(\"data.table\", \"ggplot2\")\n",
    "\n",
    "setwd(data_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Food Inspection database processing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%R\n",
    "food = readRDS(\"DATA/food_inspections.Rds\")\n",
    "write.csv(food, file = paste(out_dir, '/food_inspections.csv', sep = ''), row.names = FALSE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model Dataframe processing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%R\n",
    "dat = readRDS(\"DATA/dat_model.Rds\")\n",
    "write.csv(dat, file = paste(out_dir, '/dat_model.csv', sep = ''))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Classes ‘data.table’ and 'data.frame':\t18712 obs. of  16 variables:\n",
       " $ Inspectorblue                              : num  0 1 1 1 1 0 0 0 0 0 ...\n",
       " $ Inspectorbrown                             : num  0 0 0 0 0 0 0 0 0 0 ...\n",
       " $ Inspectorgreen                             : num  1 0 0 0 0 0 0 0 0 0 ...\n",
       " $ Inspectororange                            : num  0 0 0 0 0 1 1 1 1 1 ...\n",
       " $ Inspectorpurple                            : num  0 0 0 0 0 0 0 0 0 0 ...\n",
       " $ Inspectoryellow                            : num  0 0 0 0 0 0 0 0 0 0 ...\n",
       " $ pastSerious                                : num  0 0 0 0 0 0 0 0 0 0 ...\n",
       " $ pastCritical                               : num  0 0 0 0 0 0 0 0 0 0 ...\n",
       " $ timeSinceLast                              : num  2 2 2 2 2 2 2 2 2 2 ...\n",
       " $ ageAtInspection                            : num  1 1 1 1 1 1 0 1 1 0 ...\n",
       " $ consumption_on_premises_incidental_activity: num  0 0 0 0 0 0 0 0 0 0 ...\n",
       " $ tobacco_retail_over_counter                : num  1 0 0 0 0 0 0 0 0 0 ...\n",
       " $ temperatureMax                             : num  53.5 59 59 56.2 52.7 ...\n",
       " $ heat_burglary                              : num  26.99 13.98 12.61 35.91 9.53 ...\n",
       " $ heat_sanitation                            : num  37.75 15.41 8.32 38.19 2.13 ...\n",
       " $ heat_garbage                               : num  12.8 12.9 8 26.2 3.4 ...\n",
       " - attr(*, \".internal.selfref\")=<externalptr> \n",
       "[1] 18712\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%R\n",
    "dat <- readRDS(\"DATA/dat_model.Rds\")\n",
    "\n",
    "## Only keep \"Retail Food Establishment\"\n",
    "dat <- dat[LICENSE_DESCRIPTION == \"Retail Food Establishment\"]\n",
    "## Remove License Description\n",
    "dat$LICENSE_DESCRIPTION <- NULL\n",
    "dat <- na.omit(dat)\n",
    "\n",
    "## Add criticalFound variable to dat:\n",
    "dat$criticalFound <- pmin(1, dat$criticalCount)\n",
    "\n",
    "# ## Set the key for dat\n",
    "setkey(dat, Inspection_ID)\n",
    "\n",
    "# Match time period of original results\n",
    "# dat <- dat[Inspection_Date < \"2013-09-01\" | Inspection_Date > \"2014-07-01\"]\n",
    "\n",
    "#==============================================================================\n",
    "# CREATE MODEL DATA\n",
    "#==============================================================================\n",
    "# sort(colnames(dat))\n",
    "\n",
    "xmat <- dat[ , list(Inspector = Inspector_Assigned,\n",
    "                    pastSerious = pmin(pastSerious, 1),\n",
    "                    pastCritical = pmin(pastCritical, 1),\n",
    "                    timeSinceLast,\n",
    "                    ageAtInspection = ifelse(ageAtInspection > 4, 1L, 0L),\n",
    "                    consumption_on_premises_incidental_activity,\n",
    "                    tobacco_retail_over_counter,\n",
    "                    temperatureMax,\n",
    "                    heat_burglary = pmin(heat_burglary, 70),\n",
    "                    heat_sanitation = pmin(heat_sanitation, 70),\n",
    "                    heat_garbage = pmin(heat_garbage, 50),\n",
    "                    # Facility_Type,\n",
    "                    criticalFound),\n",
    "            keyby = Inspection_ID]\n",
    "mm <- model.matrix(criticalFound ~ . -1, data=xmat[ , -1, with=F])\n",
    "mm <- as.data.table(mm)\n",
    "str(mm)\n",
    "colnames(mm)\n",
    "\n",
    "#==============================================================================\n",
    "# CREATE TEST / TRAIN PARTITIONS\n",
    "#==============================================================================\n",
    "# 2014-07-01 is an easy separator\n",
    "\n",
    "dat[Inspection_Date < \"2014-07-01\", range(Inspection_Date)]\n",
    "dat[Inspection_Date > \"2014-07-01\", range(Inspection_Date)]\n",
    "\n",
    "iiTrain <- dat[ , which(Inspection_Date < \"2014-07-01\")]\n",
    "iiTest <- dat[ , which(Inspection_Date > \"2014-07-01\")]\n",
    "\n",
    "## Check to see if any rows didn't make it through the model.matrix formula\n",
    "nrow(dat)\n",
    "nrow(xmat)\n",
    "nrow(mm)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%%R\n",
    "# Output Model Matrix and Target\n",
    "write.csv(mm, file = paste(out_dir, '/model_matrix.csv', sep = ''), row.names = FALSE)\n",
    "write.csv(xmat$criticalFound, file = paste(out_dir, '/TARGET.csv', sep = ''), row.names = FALSE)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}