{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Step-by-step TDD in a data science task\n", "\n", "If you are interested in a longer introduction [click here]()\n", "\n", "I took an example dataset from Kaggle, the [House Prices dataet](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) this is a sufficiently easy and fun data. Just right to pass the imaginary test of 'tutorial on TDD for analysis'.\n", "\n", "Since it was a csv file, I started by reading the data with Pandas. The first thing I wanted to check is if there are NaN/NULL values." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilitiesLotConfig...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
Id
160RL65.08450PaveNaNRegLvlAllPubInside...0NaNNaNNaN022008WDNormal208500
220RL80.09600PaveNaNRegLvlAllPubFR2...0NaNNaNNaN052007WDNormal181500
360RL68.011250PaveNaNIR1LvlAllPubInside...0NaNNaNNaN092008WDNormal223500
470RL60.09550PaveNaNIR1LvlAllPubCorner...0NaNNaNNaN022006WDAbnorml140000
560RL84.014260PaveNaNIR1LvlAllPubFR2...0NaNNaNNaN0122008WDNormal250000
\n", "

5 rows × 80 columns

\n", "
" ], "text/plain": [ " MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", "Id \n", "1 60 RL 65.0 8450 Pave NaN Reg \n", "2 20 RL 80.0 9600 Pave NaN Reg \n", "3 60 RL 68.0 11250 Pave NaN IR1 \n", "4 70 RL 60.0 9550 Pave NaN IR1 \n", "5 60 RL 84.0 14260 Pave NaN IR1 \n", "\n", " LandContour Utilities LotConfig ... PoolArea PoolQC Fence MiscFeature \\\n", "Id ... \n", "1 Lvl AllPub Inside ... 0 NaN NaN NaN \n", "2 Lvl AllPub FR2 ... 0 NaN NaN NaN \n", "3 Lvl AllPub Inside ... 0 NaN NaN NaN \n", "4 Lvl AllPub Corner ... 0 NaN NaN NaN \n", "5 Lvl AllPub FR2 ... 0 NaN NaN NaN \n", "\n", " MiscVal MoSold YrSold SaleType SaleCondition SalePrice \n", "Id \n", "1 0 2 2008 WD Normal 208500 \n", "2 0 5 2007 WD Normal 181500 \n", "3 0 9 2008 WD Normal 223500 \n", "4 0 2 2006 WD Abnorml 140000 \n", "5 0 12 2008 WD Normal 250000 \n", "\n", "[5 rows x 80 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%pylab inline\n", "import pandas as pd\n", "\n", "train = pd.read_csv('train.csv', index_col=['Id'])\n", "train.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SumOfNullsDataTypes
PoolQC1453object
MiscFeature1406object
Alley1369object
Fence1179object
FireplaceQu690object
LotFrontage259float64
GarageYrBlt81float64
GarageCond81object
GarageType81object
GarageFinish81object
GarageQual81object
BsmtExposure38object
BsmtFinType238object
BsmtCond37object
BsmtQual37object
BsmtFinType137object
MasVnrArea8float64
MasVnrType8object
Electrical1object
MSSubClass0int64
Fireplaces0int64
Functional0object
KitchenQual0object
KitchenAbvGr0int64
BedroomAbvGr0int64
HalfBath0int64
FullBath0int64
BsmtHalfBath0int64
TotRmsAbvGrd0int64
GarageCars0int64
.........
HouseStyle0object
BldgType0object
Condition20object
Condition10object
LandSlope0object
2ndFlrSF0int64
LotConfig0object
Utilities0object
LandContour0object
LotShape0object
Street0object
LotArea0int64
YearBuilt0int64
YearRemodAdd0int64
RoofStyle0object
RoofMatl0object
Exterior1st0object
Exterior2nd0object
ExterQual0object
ExterCond0object
Foundation0object
BsmtFinSF10int64
BsmtFinSF20int64
BsmtUnfSF0int64
TotalBsmtSF0int64
Heating0object
HeatingQC0object
MSZoning0object
1stFlrSF0int64
SalePrice0int64
\n", "

80 rows × 2 columns

\n", "
" ], "text/plain": [ " SumOfNulls DataTypes\n", "PoolQC 1453 object\n", "MiscFeature 1406 object\n", "Alley 1369 object\n", "Fence 1179 object\n", "FireplaceQu 690 object\n", "LotFrontage 259 float64\n", "GarageYrBlt 81 float64\n", "GarageCond 81 object\n", "GarageType 81 object\n", "GarageFinish 81 object\n", "GarageQual 81 object\n", "BsmtExposure 38 object\n", "BsmtFinType2 38 object\n", "BsmtCond 37 object\n", "BsmtQual 37 object\n", "BsmtFinType1 37 object\n", "MasVnrArea 8 float64\n", "MasVnrType 8 object\n", "Electrical 1 object\n", "MSSubClass 0 int64\n", "Fireplaces 0 int64\n", "Functional 0 object\n", "KitchenQual 0 object\n", "KitchenAbvGr 0 int64\n", "BedroomAbvGr 0 int64\n", "HalfBath 0 int64\n", "FullBath 0 int64\n", "BsmtHalfBath 0 int64\n", "TotRmsAbvGrd 0 int64\n", "GarageCars 0 int64\n", "... ... ...\n", "HouseStyle 0 object\n", "BldgType 0 object\n", "Condition2 0 object\n", "Condition1 0 object\n", "LandSlope 0 object\n", "2ndFlrSF 0 int64\n", "LotConfig 0 object\n", "Utilities 0 object\n", "LandContour 0 object\n", "LotShape 0 object\n", "Street 0 object\n", "LotArea 0 int64\n", "YearBuilt 0 int64\n", "YearRemodAdd 0 int64\n", "RoofStyle 0 object\n", "RoofMatl 0 object\n", "Exterior1st 0 object\n", "Exterior2nd 0 object\n", "ExterQual 0 object\n", "ExterCond 0 object\n", "Foundation 0 object\n", "BsmtFinSF1 0 int64\n", "BsmtFinSF2 0 int64\n", "BsmtUnfSF 0 int64\n", "TotalBsmtSF 0 int64\n", "Heating 0 object\n", "HeatingQC 0 object\n", "MSZoning 0 object\n", "1stFlrSF 0 int64\n", "SalePrice 0 int64\n", "\n", "[80 rows x 2 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sum_nulls = train.isnull().sum()\n", "datatypes = train.dtypes\n", "\n", "summary = pd.DataFrame(dict(SumOfNulls = sum_nulls, DataTypes = datatypes))\n", "summary.sort_values(by='SumOfNulls', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yep, there are. Usually missing cells is inherent to data. If you don't find any in your data then it is probably time to start being suspicious. Nevertheless, it is good to start the test-driven part to come up with a test that checks if there are NaN values in the data before we go on analysing it. It's easy also, since missing data will not make sense for several model types it's an insanity test. Usually, I first write the test also quick and dirty in the notebook." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "ename": "AssertionError", "evalue": "", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;32massert\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0misnull\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0many\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0mtest_no_nan_values\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtrain\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m\u001b[0m in \u001b[0;36mtest_no_nan_values\u001b[0;34m(data)\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mtest_no_nan_values\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;32massert\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0misnull\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0many\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mtest_no_nan_values\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtrain\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mAssertionError\u001b[0m: " ] } ], "source": [ "def test_no_nan_values(data):\n", " assert not data.isnull().any(axis=None)\n", " \n", "test_no_nan_values(train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Good, it failed. Also note that neither I have docstrings at this moment nor this is a proper unittest/pytest case. It's a test. And it fails. That was the purpose of the first stage, so let's move on to the next. \n", "\n", "Missing values are tricky. Pandas is schemaless which makes prototyping easy but does not help in this case. E.g. missing values are filled with np.nan in otherwise string fields. Nevertheless, this is an intentional feature of pandas which makes easier to use some generic DataFrame functions. In my case I don't really want this to happen though. Let's see what we can do with NaN values:\n", "\n", " 1. Add 'Unknown' as a separate string to replace NaNs so we can use these fields later - What should be the limit of NaN values where we apply this strategy, obviously you don't want to spend too much energy on columns which have 95% missing values in the first iteration. So 90? 75? This should be tested.. So probably this is not the easiest solution to pass the test.\n", "\n", " 2. Predict the missing values. - This would take lot's of models and maybe different strategies for different fields. Again not what I call easy solution.\n", "\n", " 3. Ommit columns with missing values. - Since I want to pass the test first with the least effort (and from the higher level perspective bring results the soonest possible) I will choose this. \n", "\n", "I looked at the test set and because there even more NaN values were present I decided to take the union of columns with missing values in the two dataset. Also Since I want to do the same operations on the two sets I decided to quickly create a common class for them" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(None, None, None)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class HousePriceData(object):\n", " def __init__(self, filename):\n", " self.X_cols = ['MSSubClass', 'LotArea', 'Street', 'LotShape', 'LandContour',\n", " 'LotConfig', 'LandSlope', 'Condition1', 'Condition2',\n", " 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt',\n", " 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond',\n", " 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', \n", " '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'FullBath',\n", " 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',\n", " 'Fireplaces', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',\n", " 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',\n", " 'MoSold', 'YrSold', 'SaleCondition']\n", " self.y_col = 'SalePrice'\n", " data = pd.read_csv(filename, index_col=['Id'])\n", " self.X = data[self.X_cols]\n", " if self.y_col in data.columns:\n", " self.y = np.log(data[self.y_col])\n", " else:\n", " self.y = None\n", " \n", "train = HousePriceData('train.csv')\n", "test = HousePriceData('test.csv')\n", "\n", "test_no_nan_values(train.X), test_no_nan_values(train.y), test_no_nan_values(test.X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Good, test passed. Now it's time to refactor. To this aim I created two python files, one for my class and one for the insanity tests. I then transformed the small test above to fit a unit testcase and rechecked if after the transformation whether it is going to fail for the raw data and does not fail for my transformed data. I left with 44 features from the 80+ which I considered enough to go further.\n", "\n", "So far I tested only insanity. It's right time to start sanity testing as well because it helps tremendously staying focused and keeping in mind the bigger picture. For sanity testing, I prefer using BDD frameworks for a while (e.g. [pytest-bdd](https://github.com/pytest-dev/pytest-bdd) or [behave](https://behave.readthedocs.io/en/latest/) ). The good thing in BDD is that it connects the testing to a human readable description of the testing steps written in Gherkin. This is useful since the tests are often actual requirements from the client and since it is easily readable for non-programmers it helps collecting external inputs to the solution. And as the developer I really want to keep my development process in sync with the client expectations.\n", "\n", "The first sanity test that I'm writing is going to test if any solution that we produce is better then the arthmetic average of the house prices. For this the Gherkin description looks like this:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```gherkin\n", "Feature: Sanity of the model\n", "\n", " Scenario: Better than the simple average\n", " Given the trained model\n", " When I use the arithmetic average of the outcome as a reference \n", " And the rmse of the prediction of the arithmetic average on the validation data is the reference RMSE\n", " And the rmse of the prediction with the trained_model on the validation data is our RMSE\n", " Then we see that our RMSE is better than the reference RMSE\n", " \n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course there is a proper .py file attached to this nice description. And also this fine textual description is maybe not your first iteration of the actual sanity test, maybe it is something like this:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "could not convert string to float: 'Pave'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 14\u001b[0m \u001b[0mstupid_model\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mstupid_model\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# this model is so stupid that despite fitting it remains dumb\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m \u001b[0mtest_better_than_average\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstupid_model\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m\u001b[0m in \u001b[0;36mtest_better_than_average\u001b[0;34m(model, X, y)\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mtest_better_than_average\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0mreference_score\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrmse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrepeat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m \u001b[0my_pred\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 10\u001b[0m \u001b[0mour_score\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrmse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_pred\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;32massert\u001b[0m \u001b[0mour_score\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0mreference_score\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/opt/conda/lib/python3.6/site-packages/sklearn/dummy.py\u001b[0m in \u001b[0;36mpredict\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m 466\u001b[0m \"\"\"\n\u001b[1;32m 467\u001b[0m \u001b[0mcheck_is_fitted\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"constant_\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 468\u001b[0;31m \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'csr'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'csc'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'coo'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 469\u001b[0m \u001b[0mn_samples\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 470\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_array\u001b[0;34m(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)\u001b[0m\n\u001b[1;32m 400\u001b[0m \u001b[0;31m# make sure we actually converted to numeric:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 401\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mdtype_numeric\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0marray\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mkind\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m\"O\"\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 402\u001b[0;31m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0marray\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfloat64\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 403\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mallow_nd\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0marray\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mndim\u001b[0m \u001b[0;34m>=\u001b[0m \u001b[0;36m3\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 404\u001b[0m raise ValueError(\"Found array with dim %d. %s expected <= 2.\"\n", "\u001b[0;31mValueError\u001b[0m: could not convert string to float: 'Pave'" ] } ], "source": [ "from sklearn.metrics import mean_squared_error\n", "from sklearn.dummy import DummyRegressor\n", "\n", "def rmse(y_true, y_pred):\n", " return np.sqrt(mean_squared_error(y_true, y_pred))\n", "\n", "def test_better_than_average(model, X, y):\n", " reference_score = rmse(y, np.repeat(np.mean(y), len(y)))\n", " y_pred = model.predict(X)\n", " our_score = rmse(y, y_pred)\n", " assert our_score < reference_score\n", " \n", "stupid_model = DummyRegressor(strategy='constant', constant=1) # so I'm using no matter what the price of the house is 1.\n", "stupid_model = stupid_model.fit(train.X, train.y) # this model is so stupid that despite fitting it remains dumb\n", "\n", "test_better_than_average(stupid_model, train.X, train.y) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hm. Expected my test fail, but not this way. Actually, this means that my data contains non numeric values. Although some models handle this, let's stick to numeric values for now. So I convert data to numeric the following way by vectorizing all text columns. But this is not so easy like calling `pd.get_dummies()` and bamm. The problem is that it could be that some categories (note, vectorizing only makes sense if we are talking about categorical variables) only present in the train or in the test data. So if you simply call `pd.get_dummies()` you get different shapes of data. \n", "\n", "Now, it's clear that I want to test that there are only numeric values in my data. Therefore I added a test for that. Now to get to that you have multiple solutions:\n", "- Use a database with schema - that's the good and proper way to do it but it also takes some time to set up the schema properly for this 89 columns, so if there is another way then no.\n", "- Use some dictionary in Python and iterate over column and set up the categorical values one by one. - Nightmare to maintain such thing. Also it means you are replicating function which has been written by several people in several ways. \n", "- Use one of the already existing tools to get to this. - I actually opted for this solution and `pip install tdda` a nice package that check the values and validates the files. Now, the str --> categorical conversion is not built-in but the created JSON file contains the list of available values. So I added just a simple function to use that for this purpose" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/jovyan/work/Documents/TDD/tdd_data_analysis/data_analysis.py:16: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " X['has' + col] = (X[col] != 0).astype(int)\n" ] }, { "data": { "text/plain": [ "(None, None, None)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from data_analysis import HousePriceData\n", "from pandas.api.types import is_numeric_dtype\n", "\n", "def test_if_all_numeric(data):\n", " assert all(is_numeric_dtype(data.values))\n", "\n", "constraints_filename = 'house_prices_constraints_mod.tdda'\n", "train = HousePriceData('train.csv', constraints=constraints_filename)\n", "test = HousePriceData('test.csv', constraints=constraints_filename)\n", "\n", "test_if_all_numeric(train.X), test_if_all_numeric(train.y), test_if_all_numeric(test.X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are ready for real to fail our first sanity test. Let's see:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "ename": "AssertionError", "evalue": "", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mtest_better_than_average\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstupid_model\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m\u001b[0m in \u001b[0;36mtest_better_than_average\u001b[0;34m(model, X, y)\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0my_pred\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0mour_score\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrmse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_pred\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 11\u001b[0;31m \u001b[0;32massert\u001b[0m \u001b[0mour_score\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0mreference_score\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 12\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 13\u001b[0m \u001b[0mstupid_model\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mDummyRegressor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstrategy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'constant'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconstant\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# so I'm using no matter what the price of the house is 1.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mAssertionError\u001b[0m: " ] } ], "source": [ "test_better_than_average(stupid_model, train.X, train.y) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Good, failed. Now, let's pass the test. Tempting to get the XGBoost and task done, but not so fast. Although an XGBoost model sure passes the test, but so does many other model e.g. a Linear Regression. Not that fancy but a lot simpler which in this case means that is just enough to pass the test. Remember, when you pass a test do not only think about the code that you yourself has to write but also take into account the overall complexity of your solution. So let's see if the Linear Regression passes the test" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "model = LinearRegression()\n", "model = model.fit(train.X, train.y)\n", "\n", "test_better_than_average(model, train.X, train.y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Without problems. In the refactoring step I consolidated the HousePricaData class and set up the feature file with the Gherkin program and added the necessary pytest files (I used pytest-bdd for this project). Now the next step is to have something better than Linear Regression. But, actually you don't want 'just' better, probably you want significantly better. Otherwise, how would you explain the client to but a model which predicts based on thousands of weights instand of less than a hundred? So when I'm failing my second sanity test that is going to be because I want at least less than 75% of the Linear Regression from any complex method. \n", "\n", "For these steps I also need a proper train and validation split, so I've splitted the data and saved the bigger part as train_short and the smaller part as validation" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/opt/conda/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.\n", " from numpy.core.umath_tests import inner1d\n", "/home/jovyan/work/Documents/TDD/tdd_data_analysis/data_analysis.py:16: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " X['has' + col] = (X[col] != 0).astype(int)\n" ] }, { "ename": "AssertionError", "evalue": "", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 14\u001b[0m \u001b[0mmodel\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m \u001b[0mtest_better_than_linear_regression\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmodel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtrain\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalidation\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalidation\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpercent\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m75\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m\u001b[0m in \u001b[0;36mtest_better_than_linear_regression\u001b[0;34m(model, X_train, y_train, X_validation, y_validation, percent)\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0my_pred\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_validation\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0mour_score\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrmse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my_validation\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_pred\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 8\u001b[0;31m \u001b[0;32massert\u001b[0m \u001b[0mour_score\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0mreference_score\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0mtrain\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mHousePriceData\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'train_short.csv'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconstraints\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mconstraints_filename\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mAssertionError\u001b[0m: " ] } ], "source": [ "from sklearn.ensemble import ExtraTreesRegressor\n", "\n", "def test_better_than_linear_regression(model, X_train, y_train, X_validation, y_validation, percent=75):\n", " lm = LinearRegression().fit(X_train, y_train)\n", " reference_score = rmse(y_validation, lm.predict(X_validation)) * percent /100\n", " y_pred = model.predict(X_validation)\n", " our_score = rmse(y_validation, y_pred)\n", " assert our_score < reference_score\n", " \n", "train = HousePriceData('train_short.csv', constraints=constraints_filename)\n", "validation = HousePriceData('validation.csv', constraints=constraints_filename)\n", " \n", "model = ExtraTreesRegressor()\n", "model = model.fit(train.X, train.y)\n", "\n", "test_better_than_linear_regression(model, train.X, train.y, validation.X, validation.y, percent=75)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of couse it failed, actually if you look at the results you can see that the Linear Regression explains already 70% of the variance which is pretty good. After some hyperparameter tuning I found an XGBoost model that passes this test. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from xgboost import XGBRegressor\n", "\n", "model = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n", " colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,\n", " max_depth=4, min_child_weight=1, missing=None, n_estimators=400,\n", " n_jobs=1, nthread=None, objective='reg:linear', random_state=0,\n", " reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42,\n", " silent=True, subsample=0.7)\n", "\n", "model = model.fit(train.X, train.y)\n", "\n", "test_better_than_linear_regression(model, train.X, train.y, validation.X, validation.y, percent=75)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step was also a refactoring: I took the 'test against model' logic and organized the to the same functions to make the whole testing logic more robust.\n", "\n", "So here I was with a working model and was able to answer already if it's better than an average prediction or a simple linear regression. Also I had a model very early on, by starting off without handling NULL values in sophisticated ways I had a model in just ~2 hour after I downloaded the data, note taking included. I would have been able to stop at basically anytime after that and had a working model. Also I had a list of steps that I planned to include. \n", "\n", "I uploaded the code to github which includes the state after refactoring. In the repository you can also see that I went on adding two more sanity tests and implemented solutions to pass them. In the first test, the passing criteria was at least an RMSE that would have been enough to land a submission in the middle of the Kaggle leaderboard (~0.13 RMSE). This was easy to fail and hard to pass :). I tried combining models, search hyperparameters, data transformations. In the end, I had to go back to the missing value problem and add the fields that had had only a small number of missing values to the data and handle the data imputation. In the next step, I tested whether I overfit the training data. Overfitting is a serious problem for a production model. Just imagine if you are overfitting on a test set that you split from the data that you had how much you would underperform on newly collected data... I stopped after this step, but I had to fight the urgue of refactoring, because now my tests ran for 8 seconds which is way to much for TDD. But this step is left for the next iteration.\n", "\n", "In sum, I think it is quite clear at this point how useful to do TDD data analysis. Of course, at first you may feel that writing the test is just extra time and you of course have a model that is better than the average. I'm sure you are right, but sometimes all of us makes mistakes, tests help to reduce their effect. Also tests often live longer than the version of the code that you ended up delivering. Maybe later someone takes over but you want to make sure that his/her approach is not only different than yours but better, or at least as good. So it passes your insanity and sanity tests. \n", "\n", "Also, TDD helps keeping focus. Data analysis, cleaning, imputing, modeling are all huge areas. Let's say you impute the missing values with zero, but then you think maybe you should have predicted those, then you think if you predict them then maybe you can even 'generate' extra data for training, then you start to think about how would you be able to test the effectiveness of such data generation... and you find yourself in the middle of the forest with absolutely no idea how you can get back to you original goal. Many effective individual I know follow similar practices, and what is more important, I see that working in this framework makes teamwork so much more effective and fun. \n", "\n", "Try it and let me know how it works in your practice!" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1m============================= test session starts ==============================\u001b[0m\n", "platform linux -- Python 3.6.3, pytest-4.3.0, py-1.8.0, pluggy-0.9.0 -- /opt/conda/bin/python\n", "cachedir: .pytest_cache\n", "rootdir: /home/jovyan/work/Documents/TDD/tdd_data_analysis, inifile:\n", "plugins: bdd-3.1.0\n", "collected 6 items \u001b[0m\u001b[1m\n", "\n", "tests/insanity_test.py::DataIntegrity::test_no_nan_values \u001b[32mPASSED\u001b[0m\u001b[36m [ 16%]\u001b[0m\n", "tests/insanity_test.py::DataIntegrity::test_whether_only_numeric_values \u001b[32mPASSED\u001b[0m\u001b[36m [ 33%]\u001b[0m\n", "\u001b[34mFeature: \u001b[0m\u001b[34mThe model is making sense\u001b[0m\n", "\u001b[32m Scenario: \u001b[0m\u001b[32mBetter than simple average\u001b[0m\n", "\u001b[32m Given the validation data and the trained_model\n", "\u001b[0m\u001b[32m When I claim that the simplest is to take the Average of the outcome\n", "\u001b[0m\u001b[32m And I take the train_short data\n", "\u001b[0m\u001b[32m And I get the Average of the outcome from the train_short data\n", "\u001b[0m\u001b[32m And the RMSE of the prediction of the Average of the outcome for the validation as the reference score\n", "\u001b[0m\u001b[32m And the RMSE of the prediction with the trained_model on the validation as our RMSE\n", "\u001b[0m\u001b[32m Then we see that our RMSE is lower than the reference score\n", "\u001b[0m\u001b[32m PASSED\u001b[0m\n", "\n", "\u001b[34mFeature: \u001b[0m\u001b[34mThe model is making sense\u001b[0m\n", "\u001b[32m Scenario: \u001b[0m\u001b[32mBetter than linear_regression\u001b[0m\n", "\u001b[32m Given the validation data and the trained_model\n", "\u001b[0m\u001b[32m When I use the Linear Regression\n", "\u001b[0m\u001b[32m And I take the train_short data\n", "\u001b[0m\u001b[32m And I train the Linear Regression on the train_short data\n", "\u001b[0m\u001b[32m And the RMSE of the prediction with the Linear Regression on the validation as reference score\n", "\u001b[0m\u001b[32m And the RMSE of the prediction with trained_model on the validation as our RMSE\n", "\u001b[0m\u001b[32m And my target is less than 75% of the reference score\n", "\u001b[0m\u001b[32m Then we see that our RMSE is lower than the reference score\n", "\u001b[0m\u001b[32m PASSED\u001b[0m\n", "\n", "\u001b[34mFeature: \u001b[0m\u001b[34mThe model is making sense\u001b[0m\n", "\u001b[32m Scenario: \u001b[0m\u001b[32mWe are providing good results\u001b[0m\n", "\u001b[32m Given the validation data and the trained_model\n", "\u001b[0m\u001b[32m When the RMSE of the prediction with trained_model on the validation as our RMSE\n", "\u001b[0m\u001b[32m And my reference score is 0.13 and I expect lower value from my model\n", "\u001b[0m\u001b[32m Then I see that our RMSE is indeed lower than the reference score\n", "\u001b[0m\u001b[32m PASSED\u001b[0m\n", "\n", "\u001b[34mFeature: \u001b[0m\u001b[34mThe model is making sense\u001b[0m\n", "\u001b[32m Scenario: \u001b[0m\u001b[32mWe are not overfitting the training data\u001b[0m\n", "\u001b[32m Given the validation data and the trained_model\n", "\u001b[0m\u001b[32m When I take the train_short data\n", "\u001b[0m\u001b[32m And the RMSE of the prediction with trained_model on the validation as our RMSE\n", "\u001b[0m\u001b[32m And the RMSE of the prediction with trained_model on the train_short as the reference score\n", "\u001b[0m\u001b[32m And my target is max 150% of the reference score\n", "\u001b[0m\u001b[32m Then I see that our RMSE is under this reference score limit\n", "\u001b[0m\u001b[32m PASSED\u001b[0m\n", "\n", "\n", "\u001b[32m\u001b[1m=========================== 6 passed in 8.23 seconds ===========================\u001b[0m\n" ] } ], "source": [ "!pytest -vv --gherkin-terminal-reporter --gherkin-terminal-reporter-expanded -p no:warnings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "<< FIN >>\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }