{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step-by-step TDD in a data science task\n",
    "\n",
    "If you are interested in a longer introduction [click here]()\n",
    "\n",
    "I took an example dataset from Kaggle, the [House Prices dataet](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) this is a sufficiently easy and fun data. Just right to pass the imaginary test of 'tutorial on TDD for analysis'.\n",
    "\n",
    "Since it was a csv file, I started by reading the data with Pandas. The first thing I wanted to check is if there are NaN/NULL values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Populating the interactive namespace from numpy and matplotlib\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>MSSubClass</th>\n",
       "      <th>MSZoning</th>\n",
       "      <th>LotFrontage</th>\n",
       "      <th>LotArea</th>\n",
       "      <th>Street</th>\n",
       "      <th>Alley</th>\n",
       "      <th>LotShape</th>\n",
       "      <th>LandContour</th>\n",
       "      <th>Utilities</th>\n",
       "      <th>LotConfig</th>\n",
       "      <th>...</th>\n",
       "      <th>PoolArea</th>\n",
       "      <th>PoolQC</th>\n",
       "      <th>Fence</th>\n",
       "      <th>MiscFeature</th>\n",
       "      <th>MiscVal</th>\n",
       "      <th>MoSold</th>\n",
       "      <th>YrSold</th>\n",
       "      <th>SaleType</th>\n",
       "      <th>SaleCondition</th>\n",
       "      <th>SalePrice</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>60</td>\n",
       "      <td>RL</td>\n",
       "      <td>65.0</td>\n",
       "      <td>8450</td>\n",
       "      <td>Pave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Reg</td>\n",
       "      <td>Lvl</td>\n",
       "      <td>AllPub</td>\n",
       "      <td>Inside</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2008</td>\n",
       "      <td>WD</td>\n",
       "      <td>Normal</td>\n",
       "      <td>208500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>20</td>\n",
       "      <td>RL</td>\n",
       "      <td>80.0</td>\n",
       "      <td>9600</td>\n",
       "      <td>Pave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Reg</td>\n",
       "      <td>Lvl</td>\n",
       "      <td>AllPub</td>\n",
       "      <td>FR2</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>2007</td>\n",
       "      <td>WD</td>\n",
       "      <td>Normal</td>\n",
       "      <td>181500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>60</td>\n",
       "      <td>RL</td>\n",
       "      <td>68.0</td>\n",
       "      <td>11250</td>\n",
       "      <td>Pave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>IR1</td>\n",
       "      <td>Lvl</td>\n",
       "      <td>AllPub</td>\n",
       "      <td>Inside</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>9</td>\n",
       "      <td>2008</td>\n",
       "      <td>WD</td>\n",
       "      <td>Normal</td>\n",
       "      <td>223500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>70</td>\n",
       "      <td>RL</td>\n",
       "      <td>60.0</td>\n",
       "      <td>9550</td>\n",
       "      <td>Pave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>IR1</td>\n",
       "      <td>Lvl</td>\n",
       "      <td>AllPub</td>\n",
       "      <td>Corner</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2006</td>\n",
       "      <td>WD</td>\n",
       "      <td>Abnorml</td>\n",
       "      <td>140000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>60</td>\n",
       "      <td>RL</td>\n",
       "      <td>84.0</td>\n",
       "      <td>14260</td>\n",
       "      <td>Pave</td>\n",
       "      <td>NaN</td>\n",
       "      <td>IR1</td>\n",
       "      <td>Lvl</td>\n",
       "      <td>AllPub</td>\n",
       "      <td>FR2</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>12</td>\n",
       "      <td>2008</td>\n",
       "      <td>WD</td>\n",
       "      <td>Normal</td>\n",
       "      <td>250000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 80 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \\\n",
       "Id                                                                    \n",
       "1           60       RL         65.0     8450   Pave   NaN      Reg   \n",
       "2           20       RL         80.0     9600   Pave   NaN      Reg   \n",
       "3           60       RL         68.0    11250   Pave   NaN      IR1   \n",
       "4           70       RL         60.0     9550   Pave   NaN      IR1   \n",
       "5           60       RL         84.0    14260   Pave   NaN      IR1   \n",
       "\n",
       "   LandContour Utilities LotConfig  ... PoolArea PoolQC Fence MiscFeature  \\\n",
       "Id                                  ...                                     \n",
       "1          Lvl    AllPub    Inside  ...        0    NaN   NaN         NaN   \n",
       "2          Lvl    AllPub       FR2  ...        0    NaN   NaN         NaN   \n",
       "3          Lvl    AllPub    Inside  ...        0    NaN   NaN         NaN   \n",
       "4          Lvl    AllPub    Corner  ...        0    NaN   NaN         NaN   \n",
       "5          Lvl    AllPub       FR2  ...        0    NaN   NaN         NaN   \n",
       "\n",
       "   MiscVal MoSold  YrSold  SaleType  SaleCondition  SalePrice  \n",
       "Id                                                             \n",
       "1        0      2    2008        WD         Normal     208500  \n",
       "2        0      5    2007        WD         Normal     181500  \n",
       "3        0      9    2008        WD         Normal     223500  \n",
       "4        0      2    2006        WD        Abnorml     140000  \n",
       "5        0     12    2008        WD         Normal     250000  \n",
       "\n",
       "[5 rows x 80 columns]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%pylab inline\n",
    "import pandas as pd\n",
    "\n",
    "train = pd.read_csv('train.csv', index_col=['Id'])\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SumOfNulls</th>\n",
       "      <th>DataTypes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>PoolQC</th>\n",
       "      <td>1453</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MiscFeature</th>\n",
       "      <td>1406</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Alley</th>\n",
       "      <td>1369</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Fence</th>\n",
       "      <td>1179</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>FireplaceQu</th>\n",
       "      <td>690</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>LotFrontage</th>\n",
       "      <td>259</td>\n",
       "      <td>float64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GarageYrBlt</th>\n",
       "      <td>81</td>\n",
       "      <td>float64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GarageCond</th>\n",
       "      <td>81</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GarageType</th>\n",
       "      <td>81</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GarageFinish</th>\n",
       "      <td>81</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GarageQual</th>\n",
       "      <td>81</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BsmtExposure</th>\n",
       "      <td>38</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BsmtFinType2</th>\n",
       "      <td>38</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BsmtCond</th>\n",
       "      <td>37</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BsmtQual</th>\n",
       "      <td>37</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BsmtFinType1</th>\n",
       "      <td>37</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MasVnrArea</th>\n",
       "      <td>8</td>\n",
       "      <td>float64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MasVnrType</th>\n",
       "      <td>8</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Electrical</th>\n",
       "      <td>1</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MSSubClass</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Fireplaces</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Functional</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>KitchenQual</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>KitchenAbvGr</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BedroomAbvGr</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>HalfBath</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>FullBath</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BsmtHalfBath</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TotRmsAbvGrd</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GarageCars</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>HouseStyle</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BldgType</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Condition2</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Condition1</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>LandSlope</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2ndFlrSF</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>LotConfig</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Utilities</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>LandContour</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>LotShape</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Street</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>LotArea</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>YearBuilt</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>YearRemodAdd</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>RoofStyle</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>RoofMatl</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Exterior1st</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Exterior2nd</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ExterQual</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ExterCond</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Foundation</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BsmtFinSF1</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BsmtFinSF2</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>BsmtUnfSF</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TotalBsmtSF</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Heating</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>HeatingQC</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MSZoning</th>\n",
       "      <td>0</td>\n",
       "      <td>object</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1stFlrSF</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>SalePrice</th>\n",
       "      <td>0</td>\n",
       "      <td>int64</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>80 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "              SumOfNulls DataTypes\n",
       "PoolQC              1453    object\n",
       "MiscFeature         1406    object\n",
       "Alley               1369    object\n",
       "Fence               1179    object\n",
       "FireplaceQu          690    object\n",
       "LotFrontage          259   float64\n",
       "GarageYrBlt           81   float64\n",
       "GarageCond            81    object\n",
       "GarageType            81    object\n",
       "GarageFinish          81    object\n",
       "GarageQual            81    object\n",
       "BsmtExposure          38    object\n",
       "BsmtFinType2          38    object\n",
       "BsmtCond              37    object\n",
       "BsmtQual              37    object\n",
       "BsmtFinType1          37    object\n",
       "MasVnrArea             8   float64\n",
       "MasVnrType             8    object\n",
       "Electrical             1    object\n",
       "MSSubClass             0     int64\n",
       "Fireplaces             0     int64\n",
       "Functional             0    object\n",
       "KitchenQual            0    object\n",
       "KitchenAbvGr           0     int64\n",
       "BedroomAbvGr           0     int64\n",
       "HalfBath               0     int64\n",
       "FullBath               0     int64\n",
       "BsmtHalfBath           0     int64\n",
       "TotRmsAbvGrd           0     int64\n",
       "GarageCars             0     int64\n",
       "...                  ...       ...\n",
       "HouseStyle             0    object\n",
       "BldgType               0    object\n",
       "Condition2             0    object\n",
       "Condition1             0    object\n",
       "LandSlope              0    object\n",
       "2ndFlrSF               0     int64\n",
       "LotConfig              0    object\n",
       "Utilities              0    object\n",
       "LandContour            0    object\n",
       "LotShape               0    object\n",
       "Street                 0    object\n",
       "LotArea                0     int64\n",
       "YearBuilt              0     int64\n",
       "YearRemodAdd           0     int64\n",
       "RoofStyle              0    object\n",
       "RoofMatl               0    object\n",
       "Exterior1st            0    object\n",
       "Exterior2nd            0    object\n",
       "ExterQual              0    object\n",
       "ExterCond              0    object\n",
       "Foundation             0    object\n",
       "BsmtFinSF1             0     int64\n",
       "BsmtFinSF2             0     int64\n",
       "BsmtUnfSF              0     int64\n",
       "TotalBsmtSF            0     int64\n",
       "Heating                0    object\n",
       "HeatingQC              0    object\n",
       "MSZoning               0    object\n",
       "1stFlrSF               0     int64\n",
       "SalePrice              0     int64\n",
       "\n",
       "[80 rows x 2 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum_nulls = train.isnull().sum()\n",
    "datatypes = train.dtypes\n",
    "\n",
    "summary = pd.DataFrame(dict(SumOfNulls = sum_nulls, DataTypes = datatypes))\n",
    "summary.sort_values(by='SumOfNulls', ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Yep, there are. Usually missing cells is inherent to data. If you don't find any in your data then it is probably time to start being suspicious. Nevertheless, it is good to start the test-driven part to come up with a test that checks if there are NaN values in the data before we go on analysing it. It's easy also, since missing data will not make sense for several model types it's an insanity test. Usually, I first write the test also quick and dirty in the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "ename": "AssertionError",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mAssertionError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-3-17e4e883296c>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      2\u001b[0m     \u001b[0;32massert\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0misnull\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0many\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0mtest_no_nan_values\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtrain\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m<ipython-input-3-17e4e883296c>\u001b[0m in \u001b[0;36mtest_no_nan_values\u001b[0;34m(data)\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mtest_no_nan_values\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m     \u001b[0;32massert\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0misnull\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0many\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0mtest_no_nan_values\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtrain\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mAssertionError\u001b[0m: "
     ]
    }
   ],
   "source": [
    "def test_no_nan_values(data):\n",
    "    assert not data.isnull().any(axis=None)\n",
    "    \n",
    "test_no_nan_values(train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Good, it failed. Also note that neither I have docstrings at this moment nor this is a proper unittest/pytest case. It's a test. And it fails. That was the purpose of the first stage, so let's move on to the next. \n",
    "\n",
    "Missing values are tricky. Pandas is schemaless which makes prototyping easy but does not help in this case. E.g. missing values are filled with np.nan in otherwise string fields. Nevertheless, this is an intentional feature of pandas which makes easier to use some generic DataFrame functions. In my case I don't really want this to happen though. Let's see what we can do with NaN values:\n",
    "\n",
    "   1. Add 'Unknown' as a separate string to replace NaNs so we can use these fields later - What should be the limit of NaN values where we apply this strategy, obviously you don't want to spend too much energy on columns which have 95% missing values in the first iteration. So 90? 75? This should be tested.. So probably this is not the easiest solution to pass the test.\n",
    "\n",
    "   2. Predict the missing values. - This would take lot's of models and maybe different strategies for different fields. Again not what I call easy solution.\n",
    "\n",
    "   3. Ommit columns with missing values. - Since I want to pass the test first with the least effort (and from the higher level perspective bring results the soonest possible) I will choose this. \n",
    "\n",
    "I looked at the test set and because there even more NaN values were present I decided to take the union of columns with missing values in the two dataset. Also Since I want to do the same operations on the two sets I decided to quickly create a common class for them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(None, None, None)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "class HousePriceData(object):\n",
    "    def __init__(self, filename):\n",
    "        self.X_cols = ['MSSubClass', 'LotArea', 'Street', 'LotShape', 'LandContour',\n",
    "                       'LotConfig', 'LandSlope', 'Condition1', 'Condition2',\n",
    "                       'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt',\n",
    "                       'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond',\n",
    "                       'Foundation', 'Heating', 'HeatingQC', 'CentralAir', \n",
    "                       '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'FullBath',\n",
    "                       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',\n",
    "                       'Fireplaces', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',\n",
    "                       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',\n",
    "                       'MoSold', 'YrSold', 'SaleCondition']\n",
    "        self.y_col = 'SalePrice'\n",
    "        data = pd.read_csv(filename, index_col=['Id'])\n",
    "        self.X = data[self.X_cols]\n",
    "        if self.y_col in data.columns:\n",
    "            self.y = np.log(data[self.y_col])\n",
    "        else:\n",
    "            self.y = None\n",
    "            \n",
    "train = HousePriceData('train.csv')\n",
    "test = HousePriceData('test.csv')\n",
    "\n",
    "test_no_nan_values(train.X), test_no_nan_values(train.y), test_no_nan_values(test.X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Good, test passed. Now it's time to refactor. To this aim I created two python files, one for my class and one for the insanity tests. I then transformed the small test above to fit a unit testcase and rechecked if after the transformation whether it is going to fail for the raw data and does not fail for my transformed data. I left with 44 features from the 80+ which I considered enough to go further.\n",
    "\n",
    "So far I tested only insanity. It's right time to start sanity testing as well because it helps tremendously staying focused and keeping in mind the bigger picture. For sanity testing, I prefer using BDD frameworks for a while (e.g. [pytest-bdd](https://github.com/pytest-dev/pytest-bdd) or [behave](https://behave.readthedocs.io/en/latest/) ). The good thing in BDD is that it connects the testing to a human readable description of the testing steps written in Gherkin. This is useful since the tests are often actual requirements from the client and since it is easily readable for non-programmers it helps collecting external inputs to the solution. And as the developer I really want to keep my development process in sync with the client expectations.\n",
    "\n",
    "The first sanity test that I'm writing is going to test if any solution that we produce is better then the arthmetic average of the house prices. For this the Gherkin description looks like this:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```gherkin\n",
    "Feature: Sanity of the model\n",
    "\n",
    "  Scenario: Better than the simple average\n",
    "    Given the trained model\n",
    "    When I use the arithmetic average of the outcome as a reference \n",
    "    And the rmse of the prediction of the arithmetic average on the validation data is the reference RMSE\n",
    "    And the rmse of the prediction with the trained_model on the validation data is our RMSE\n",
    "    Then we see that our RMSE is better than the reference RMSE\n",
    "    \n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course there is a proper .py file attached to this nice description. And also this fine textual description is maybe not your first iteration of the actual sanity test, maybe it is something like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "could not convert string to float: 'Pave'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-5-21b251f96bed>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m     14\u001b[0m \u001b[0mstupid_model\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mstupid_model\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# this model is so stupid that despite fitting it remains dumb\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     15\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m \u001b[0mtest_better_than_average\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstupid_model\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m<ipython-input-5-21b251f96bed>\u001b[0m in \u001b[0;36mtest_better_than_average\u001b[0;34m(model, X, y)\u001b[0m\n\u001b[1;32m      7\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mtest_better_than_average\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      8\u001b[0m     \u001b[0mreference_score\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrmse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrepeat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m     \u001b[0my_pred\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     10\u001b[0m     \u001b[0mour_score\u001b[0m \u001b[0;34m=\u001b[0m  \u001b[0mrmse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_pred\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     11\u001b[0m     \u001b[0;32massert\u001b[0m \u001b[0mour_score\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0mreference_score\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/opt/conda/lib/python3.6/site-packages/sklearn/dummy.py\u001b[0m in \u001b[0;36mpredict\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m    466\u001b[0m         \"\"\"\n\u001b[1;32m    467\u001b[0m         \u001b[0mcheck_is_fitted\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"constant_\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 468\u001b[0;31m         \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'csr'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'csc'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'coo'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    469\u001b[0m         \u001b[0mn_samples\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    470\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_array\u001b[0;34m(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)\u001b[0m\n\u001b[1;32m    400\u001b[0m         \u001b[0;31m# make sure we actually converted to numeric:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    401\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mdtype_numeric\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0marray\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkind\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m\"O\"\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 402\u001b[0;31m             \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0marray\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfloat64\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    403\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mallow_nd\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0marray\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mndim\u001b[0m \u001b[0;34m>=\u001b[0m \u001b[0;36m3\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    404\u001b[0m             raise ValueError(\"Found array with dim %d. %s expected <= 2.\"\n",
      "\u001b[0;31mValueError\u001b[0m: could not convert string to float: 'Pave'"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import mean_squared_error\n",
    "from sklearn.dummy import DummyRegressor\n",
    "\n",
    "def rmse(y_true, y_pred):\n",
    "    return np.sqrt(mean_squared_error(y_true, y_pred))\n",
    "\n",
    "def test_better_than_average(model, X, y):\n",
    "    reference_score = rmse(y, np.repeat(np.mean(y), len(y)))\n",
    "    y_pred = model.predict(X)\n",
    "    our_score =  rmse(y, y_pred)\n",
    "    assert our_score < reference_score\n",
    "    \n",
    "stupid_model = DummyRegressor(strategy='constant', constant=1) # so I'm using no matter what the price of the house is 1.\n",
    "stupid_model = stupid_model.fit(train.X, train.y) # this model is so stupid that despite fitting it remains dumb\n",
    "\n",
    "test_better_than_average(stupid_model, train.X, train.y)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hm. Expected my test fail, but not this way. Actually, this means that my data contains non numeric values. Although some models handle this, let's stick to numeric values for now. So I convert data to numeric the following way by vectorizing all text columns. But this is not so easy like calling `pd.get_dummies()` and bamm. The problem is that it could be that some categories (note, vectorizing only makes sense if we are talking about categorical variables) only present in the train or in the test data. So if you simply call `pd.get_dummies()` you get different shapes of data. \n",
    "\n",
    "Now, it's clear that I want to test that there are only numeric values in my data. Therefore I added a test for that. Now to get to that you have multiple solutions:\n",
    "- Use a database with schema - that's the good and proper way to do it but it also takes some time to set up the schema properly for this 89 columns, so if there is another way then no.\n",
    "- Use some dictionary in Python and iterate over column and set up the categorical values one by one. - Nightmare to maintain such thing. Also it means you are replicating function which has been written by several people in several ways. \n",
    "- Use one of the already existing tools to get to this. - I actually opted for this solution and `pip install tdda` a nice package that check the values and validates the files. Now, the str --> categorical conversion is not built-in but the created JSON file contains the list of available values. So I added just a simple function to use that for this purpose"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/jovyan/work/Documents/TDD/tdd_data_analysis/data_analysis.py:16: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
      "  X['has' + col] = (X[col] != 0).astype(int)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(None, None, None)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from data_analysis import HousePriceData\n",
    "from pandas.api.types import is_numeric_dtype\n",
    "\n",
    "def test_if_all_numeric(data):\n",
    "    assert all(is_numeric_dtype(data.values))\n",
    "\n",
    "constraints_filename = 'house_prices_constraints_mod.tdda'\n",
    "train = HousePriceData('train.csv', constraints=constraints_filename)\n",
    "test = HousePriceData('test.csv', constraints=constraints_filename)\n",
    "\n",
    "test_if_all_numeric(train.X), test_if_all_numeric(train.y), test_if_all_numeric(test.X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are ready for real to fail our first sanity test. Let's see:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "ename": "AssertionError",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mAssertionError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-7-a7994ac2c8ca>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mtest_better_than_average\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstupid_model\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m<ipython-input-5-21b251f96bed>\u001b[0m in \u001b[0;36mtest_better_than_average\u001b[0;34m(model, X, y)\u001b[0m\n\u001b[1;32m      9\u001b[0m     \u001b[0my_pred\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     10\u001b[0m     \u001b[0mour_score\u001b[0m \u001b[0;34m=\u001b[0m  \u001b[0mrmse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_pred\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 11\u001b[0;31m     \u001b[0;32massert\u001b[0m \u001b[0mour_score\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0mreference_score\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     12\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     13\u001b[0m \u001b[0mstupid_model\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mDummyRegressor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstrategy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'constant'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconstant\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# so I'm using no matter what the price of the house is 1.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mAssertionError\u001b[0m: "
     ]
    }
   ],
   "source": [
    "test_better_than_average(stupid_model, train.X, train.y)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Good, failed. Now, let's pass the test. Tempting to get the XGBoost and task done, but not so fast. Although an XGBoost model sure passes the test, but so does many other model e.g. a Linear Regression. Not that fancy but a lot simpler which in this case means that is just enough to pass the test. Remember, when you pass a test do not only think about the code that you yourself has to write but also take into account the overall complexity of your solution. So let's see if the Linear Regression passes the test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "model = LinearRegression()\n",
    "model = model.fit(train.X, train.y)\n",
    "\n",
    "test_better_than_average(model, train.X, train.y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Without problems. In the refactoring step I consolidated the HousePricaData class and set up the feature file with the Gherkin program and added the necessary pytest files (I used pytest-bdd for this project). Now the next step is to have something better than Linear Regression. But, actually you don't want 'just' better, probably you want significantly better. Otherwise, how would you explain the client to but a model which predicts based on thousands of weights instand of less than a hundred? So when I'm failing my second sanity test that is going to be because I want at least less than 75% of the Linear Regression from any complex method. \n",
    "\n",
    "For these steps I also need a proper train and validation split, so I've splitted the data and saved the bigger part as train_short and the smaller part as validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.\n",
      "  from numpy.core.umath_tests import inner1d\n",
      "/home/jovyan/work/Documents/TDD/tdd_data_analysis/data_analysis.py:16: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
      "  X['has' + col] = (X[col] != 0).astype(int)\n"
     ]
    },
    {
     "ename": "AssertionError",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mAssertionError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-9-6814b04c685d>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m     14\u001b[0m \u001b[0mmodel\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     15\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m \u001b[0mtest_better_than_linear_regression\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalidation\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalidation\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpercent\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m75\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m<ipython-input-9-6814b04c685d>\u001b[0m in \u001b[0;36mtest_better_than_linear_regression\u001b[0;34m(model, X_train, y_train, X_validation, y_validation, percent)\u001b[0m\n\u001b[1;32m      6\u001b[0m     \u001b[0my_pred\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_validation\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      7\u001b[0m     \u001b[0mour_score\u001b[0m \u001b[0;34m=\u001b[0m  \u001b[0mrmse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my_validation\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_pred\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 8\u001b[0;31m     \u001b[0;32massert\u001b[0m \u001b[0mour_score\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0mreference_score\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     10\u001b[0m \u001b[0mtrain\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mHousePriceData\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'train_short.csv'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconstraints\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mconstraints_filename\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mAssertionError\u001b[0m: "
     ]
    }
   ],
   "source": [
    "from sklearn.ensemble import ExtraTreesRegressor\n",
    "\n",
    "def test_better_than_linear_regression(model, X_train, y_train, X_validation, y_validation, percent=75):\n",
    "    lm = LinearRegression().fit(X_train, y_train)\n",
    "    reference_score = rmse(y_validation, lm.predict(X_validation)) * percent /100\n",
    "    y_pred = model.predict(X_validation)\n",
    "    our_score =  rmse(y_validation, y_pred)\n",
    "    assert our_score < reference_score\n",
    "    \n",
    "train = HousePriceData('train_short.csv', constraints=constraints_filename)\n",
    "validation = HousePriceData('validation.csv', constraints=constraints_filename)\n",
    "    \n",
    "model = ExtraTreesRegressor()\n",
    "model = model.fit(train.X, train.y)\n",
    "\n",
    "test_better_than_linear_regression(model, train.X, train.y, validation.X, validation.y, percent=75)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of couse it failed, actually if you look at the results you can see that the Linear Regression explains already 70% of the variance which is pretty good. After some hyperparameter tuning I found an XGBoost model that passes this test. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from xgboost import XGBRegressor\n",
    "\n",
    "model = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
    "       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,\n",
    "       max_depth=4, min_child_weight=1, missing=None, n_estimators=400,\n",
    "       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,\n",
    "       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42,\n",
    "       silent=True, subsample=0.7)\n",
    "\n",
    "model = model.fit(train.X, train.y)\n",
    "\n",
    "test_better_than_linear_regression(model, train.X, train.y, validation.X, validation.y, percent=75)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step was also a refactoring: I took the 'test against model' logic and organized the to the same functions to make the whole testing logic more robust.\n",
    "\n",
    "So here I was with a working model and was able to answer already if it's better than an average prediction or a simple linear regression. Also I had a model very early on, by starting off without handling NULL values in sophisticated ways I had a model in just ~2 hour after I downloaded the data, note taking included. I would have been able to stop at basically anytime after that and had a working model. Also I had a list of steps that I planned to include. \n",
    "\n",
    "I uploaded the code to github which includes the state after refactoring. In the repository you can also see that I went on adding two more sanity tests and implemented solutions to pass them. In the first test, the passing criteria was at least an RMSE that would have been enough to land a submission in the middle of the Kaggle leaderboard (~0.13 RMSE). This was easy to fail and hard to pass :). I tried combining models, search hyperparameters, data transformations. In the end, I had to go back to the missing value problem and add the fields that had had only a small number of missing values to the data and handle the data imputation. In the next step, I tested whether I overfit the training data. Overfitting is a serious problem for a production model. Just imagine if you are overfitting on a test set that you split from the data that you had how much you would underperform on newly collected data... I stopped after this step, but I had to fight the urgue of refactoring, because now my tests ran for 8 seconds which is way to much for TDD. But this step is left for the next iteration.\n",
    "\n",
    "In sum, I think it is quite clear at this point how useful to do TDD data analysis. Of course, at first you may feel that writing the test is just extra time and you of course have a model that is better than the average. I'm sure you are right, but sometimes all of us makes mistakes, tests help to reduce their effect. Also tests often live longer than the version of the code that you ended up delivering. Maybe later someone takes over but you want to make sure that his/her approach is not only different than yours but better, or at least as good. So it passes your insanity and sanity tests. \n",
    "\n",
    "Also, TDD helps keeping focus. Data analysis, cleaning, imputing, modeling are all huge areas. Let's say you impute the missing values with zero, but then you think maybe you should have predicted those, then you think if you predict them then maybe you can even 'generate' extra data for training, then you start to think about how would you be able to test the effectiveness of such data generation... and you find yourself in the middle of the forest with absolutely no idea how you can get back to you original goal. Many effective individual I know follow similar practices, and what is more important, I see that working in this framework makes teamwork so much more effective and fun. \n",
    "\n",
    "Try it and let me know how it works in your practice!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1m============================= test session starts ==============================\u001b[0m\n",
      "platform linux -- Python 3.6.3, pytest-4.3.0, py-1.8.0, pluggy-0.9.0 -- /opt/conda/bin/python\n",
      "cachedir: .pytest_cache\n",
      "rootdir: /home/jovyan/work/Documents/TDD/tdd_data_analysis, inifile:\n",
      "plugins: bdd-3.1.0\n",
      "collected 6 items                                                              \u001b[0m\u001b[1m\n",
      "\n",
      "tests/insanity_test.py::DataIntegrity::test_no_nan_values \u001b[32mPASSED\u001b[0m\u001b[36m         [ 16%]\u001b[0m\n",
      "tests/insanity_test.py::DataIntegrity::test_whether_only_numeric_values \u001b[32mPASSED\u001b[0m\u001b[36m [ 33%]\u001b[0m\n",
      "\u001b[34mFeature: \u001b[0m\u001b[34mThe model is making sense\u001b[0m\n",
      "\u001b[32m    Scenario: \u001b[0m\u001b[32mBetter than simple average\u001b[0m\n",
      "\u001b[32m        Given the validation data and the trained_model\n",
      "\u001b[0m\u001b[32m        When I claim that the simplest is to take the Average of the outcome\n",
      "\u001b[0m\u001b[32m        And I take the train_short data\n",
      "\u001b[0m\u001b[32m        And I get the Average of the outcome from the train_short data\n",
      "\u001b[0m\u001b[32m        And the RMSE of the prediction of the Average of the outcome for the validation as the reference score\n",
      "\u001b[0m\u001b[32m        And the RMSE of the prediction with the trained_model on the validation as our RMSE\n",
      "\u001b[0m\u001b[32m        Then we see that our RMSE is lower than the reference score\n",
      "\u001b[0m\u001b[32m    PASSED\u001b[0m\n",
      "\n",
      "\u001b[34mFeature: \u001b[0m\u001b[34mThe model is making sense\u001b[0m\n",
      "\u001b[32m    Scenario: \u001b[0m\u001b[32mBetter than linear_regression\u001b[0m\n",
      "\u001b[32m        Given the validation data and the trained_model\n",
      "\u001b[0m\u001b[32m        When I use the Linear Regression\n",
      "\u001b[0m\u001b[32m        And I take the train_short data\n",
      "\u001b[0m\u001b[32m        And I train the Linear Regression on the train_short data\n",
      "\u001b[0m\u001b[32m        And the RMSE of the prediction with the Linear Regression on the validation as reference score\n",
      "\u001b[0m\u001b[32m        And the RMSE of the prediction with trained_model on the validation as our RMSE\n",
      "\u001b[0m\u001b[32m        And my target is less than 75% of the reference score\n",
      "\u001b[0m\u001b[32m        Then we see that our RMSE is lower than the reference score\n",
      "\u001b[0m\u001b[32m    PASSED\u001b[0m\n",
      "\n",
      "\u001b[34mFeature: \u001b[0m\u001b[34mThe model is making sense\u001b[0m\n",
      "\u001b[32m    Scenario: \u001b[0m\u001b[32mWe are providing good results\u001b[0m\n",
      "\u001b[32m        Given the validation data and the trained_model\n",
      "\u001b[0m\u001b[32m        When the RMSE of the prediction with trained_model on the validation as our RMSE\n",
      "\u001b[0m\u001b[32m        And my reference score is 0.13 and I expect lower value from my model\n",
      "\u001b[0m\u001b[32m        Then I see that our RMSE is indeed lower than the reference score\n",
      "\u001b[0m\u001b[32m    PASSED\u001b[0m\n",
      "\n",
      "\u001b[34mFeature: \u001b[0m\u001b[34mThe model is making sense\u001b[0m\n",
      "\u001b[32m    Scenario: \u001b[0m\u001b[32mWe are not overfitting the training data\u001b[0m\n",
      "\u001b[32m        Given the validation data and the trained_model\n",
      "\u001b[0m\u001b[32m        When I take the train_short data\n",
      "\u001b[0m\u001b[32m        And the RMSE of the prediction with trained_model on the validation as our RMSE\n",
      "\u001b[0m\u001b[32m        And the RMSE of the prediction with trained_model on the train_short as the reference score\n",
      "\u001b[0m\u001b[32m        And my target is max 150% of the reference score\n",
      "\u001b[0m\u001b[32m        Then I see that our RMSE is under this reference score limit\n",
      "\u001b[0m\u001b[32m    PASSED\u001b[0m\n",
      "\n",
      "\n",
      "\u001b[32m\u001b[1m=========================== 6 passed in 8.23 seconds ===========================\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!pytest -vv --gherkin-terminal-reporter --gherkin-terminal-reporter-expanded -p no:warnings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "<< FIN >>\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}