{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Favorita 2016 10 weeks (28-06 to 31-08 : 65 days inclusive)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "%reload_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from fastai.structured import *\n", "from fastai.column_data import *\n", "np.set_printoptions(threshold=50, edgeitems=20)\n", "import datetime\n", "\n", "PATH = 'data/favorita/'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "holidays_events.csv\n", "items.csv\n", "models\n", "oil.csv\n", "sample_submission.csv\n", "stores.csv\n", "test.csv\n", "tmp\n", "train.csv\n", "transactions.csv\n", "\n" ] } ], "source": [ "# Input data files are available in the \"../input/\" directory.\n", "# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n", "\n", "from subprocess import check_output\n", "print(check_output([\"ls\", PATH]).decode(\"utf8\"))\n", "\n", "# Any results you write to the current directory are saved as output." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create datasets" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "table_names = ['train', 'stores', 'items', 'transactions', \n", " 'holidays_events', 'oil', 'test', 'sample_submission']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll be using the popular data manipulation framework pandas. Among other things, pandas allows you to manipulate tables/data frames in python as one would in a database.\n", "\n", "We're going to go ahead and load all of our csv's as dataframes into the list tables." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "tables = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=False) for fname in table_names]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from IPython.display import HTML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use `head()` to get a quick look at the contents of each table:\n", "* **train**: includes the target unit_sales by date, store_nbr, and item_nbr and a unique id to label rows\n", "\n", "* **stores**: metadata; including city, state, type, and cluster; cluster is a grouping of similar stores\n", "\n", "* **items**: metadata; including family, class, and perishable; perishable have a score weight of 1.25; otherwise, the weight is 1.0\n", "\n", "* **transactions**: count of sales transactions for each date, store_nbr combination. Only included for the training data timeframe\n", "\n", "* **holidays_events**: metadata; Pay special attention to the transferred column.\n", " * A holiday that is transferred officially falls on that calendar day, but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was actually celebrated, look for the corresponding row where type is Transfer.\n", " * For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to payback the Bridge.\n", " * Additional holidays are days added a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday).\n", "\n", "* **oil**: Daily oil price. Includes values during both the train and test data timeframe. (Ecuador is an oil-dependent country and it's economical health is highly vulnerable to shocks in oil prices.)\n", "\n", "* **test**: Test data, with the date, store_nbr, item_nbr combinations that are to be predicted, along with the onpromotion information.\n", " * NOTE: The test data has a small number of items that are not contained in the training data. Part of the exercise will be to predict a new item sales based on similar products.\n", " * The public / private leaderboard split is based on time. All items in the public split are also included in the private split.\n", " \n", "* **sample_submission**: sample submission file in the correct format. It is highly recommend you zip your submission file before uploading!" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddatestore_nbritem_nbrunit_salesonpromotion
002013-01-01251036657.0NaN
112013-01-01251055741.0NaN
222013-01-01251055752.0NaN
332013-01-01251080791.0NaN
442013-01-01251087011.0NaN
\n", "
" ], "text/plain": [ " id date store_nbr item_nbr unit_sales onpromotion\n", "0 0 2013-01-01 25 103665 7.0 NaN\n", "1 1 2013-01-01 25 105574 1.0 NaN\n", "2 2 2013-01-01 25 105575 2.0 NaN\n", "3 3 2013-01-01 25 108079 1.0 NaN\n", "4 4 2013-01-01 25 108701 1.0 NaN" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(125497040, 6)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
store_nbrcitystatetypecluster
01QuitoPichinchaD13
12QuitoPichinchaD13
23QuitoPichinchaD8
34QuitoPichinchaD9
45Santo DomingoSanto Domingo de los TsachilasD4
\n", "
" ], "text/plain": [ " store_nbr city state type cluster\n", "0 1 Quito Pichincha D 13\n", "1 2 Quito Pichincha D 13\n", "2 3 Quito Pichincha D 8\n", "3 4 Quito Pichincha D 9\n", "4 5 Santo Domingo Santo Domingo de los Tsachilas D 4" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(54, 5)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
item_nbrfamilyclassperishable
096995GROCERY I10930
199197GROCERY I10670
2103501CLEANING30080
3103520GROCERY I10280
4103665BREAD/BAKERY27121
\n", "
" ], "text/plain": [ " item_nbr family class perishable\n", "0 96995 GROCERY I 1093 0\n", "1 99197 GROCERY I 1067 0\n", "2 103501 CLEANING 3008 0\n", "3 103520 GROCERY I 1028 0\n", "4 103665 BREAD/BAKERY 2712 1" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(4100, 4)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datestore_nbrtransactions
02013-01-0125770
12013-01-0212111
22013-01-0222358
32013-01-0233487
42013-01-0241922
\n", "
" ], "text/plain": [ " date store_nbr transactions\n", "0 2013-01-01 25 770\n", "1 2013-01-02 1 2111\n", "2 2013-01-02 2 2358\n", "3 2013-01-02 3 3487\n", "4 2013-01-02 4 1922" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(83488, 3)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datetypelocalelocale_namedescriptiontransferred
02012-03-02HolidayLocalMantaFundacion de MantaFalse
12012-04-01HolidayRegionalCotopaxiProvincializacion de CotopaxiFalse
22012-04-12HolidayLocalCuencaFundacion de CuencaFalse
32012-04-14HolidayLocalLibertadCantonizacion de LibertadFalse
42012-04-21HolidayLocalRiobambaCantonizacion de RiobambaFalse
\n", "
" ], "text/plain": [ " date type locale locale_name description \\\n", "0 2012-03-02 Holiday Local Manta Fundacion de Manta \n", "1 2012-04-01 Holiday Regional Cotopaxi Provincializacion de Cotopaxi \n", "2 2012-04-12 Holiday Local Cuenca Fundacion de Cuenca \n", "3 2012-04-14 Holiday Local Libertad Cantonizacion de Libertad \n", "4 2012-04-21 Holiday Local Riobamba Cantonizacion de Riobamba \n", "\n", " transferred \n", "0 False \n", "1 False \n", "2 False \n", "3 False \n", "4 False " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(350, 6)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datedcoilwtico
02013-01-01NaN
12013-01-0293.14
22013-01-0392.97
32013-01-0493.12
42013-01-0793.20
\n", "
" ], "text/plain": [ " date dcoilwtico\n", "0 2013-01-01 NaN\n", "1 2013-01-02 93.14\n", "2 2013-01-03 92.97\n", "3 2013-01-04 93.12\n", "4 2013-01-07 93.20" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(1218, 2)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddatestore_nbritem_nbronpromotion
01254970402017-08-16196995False
11254970412017-08-16199197False
21254970422017-08-161103501False
31254970432017-08-161103520False
41254970442017-08-161103665False
\n", "
" ], "text/plain": [ " id date store_nbr item_nbr onpromotion\n", "0 125497040 2017-08-16 1 96995 False\n", "1 125497041 2017-08-16 1 99197 False\n", "2 125497042 2017-08-16 1 103501 False\n", "3 125497043 2017-08-16 1 103520 False\n", "4 125497044 2017-08-16 1 103665 False" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(3370464, 5)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idunit_sales
01254970400
11254970410
21254970420
31254970430
41254970440
\n", "
" ], "text/plain": [ " id unit_sales\n", "0 125497040 0\n", "1 125497041 0\n", "2 125497042 0\n", "3 125497043 0\n", "4 125497044 0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(3370464, 2)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for t in tables: display(t.head(), t.shape)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddatestore_nbritem_nbrunit_salesonpromotion
count1.25497e+08NaN1.25497e+081.25497e+081.25497e+08NaN
mean6.27485e+07NaN27.46469727698.55487NaN
std3.62279e+07NaN16.330552053423.6052NaN
min0NaN196995-15372NaN
25%3.13743e+07NaN125223832NaN
50%6.27485e+07NaN289595004NaN
75%9.41228e+07NaN431.35438e+069NaN
max1.25497e+08NaN542.12711e+0689440NaN
counts125497040125497040125497040125497040125497040103839389
uniques12549704016845440362584742
missing0000021657651
missing_perc0%0%0%0%0%17.26%
typesnumericcategoricalnumericnumericnumericbool
\n", "
" ], "text/plain": [ " id date store_nbr item_nbr unit_sales \\\n", "count 1.25497e+08 NaN 1.25497e+08 1.25497e+08 1.25497e+08 \n", "mean 6.27485e+07 NaN 27.4646 972769 8.55487 \n", "std 3.62279e+07 NaN 16.3305 520534 23.6052 \n", "min 0 NaN 1 96995 -15372 \n", "25% 3.13743e+07 NaN 12 522383 2 \n", "50% 6.27485e+07 NaN 28 959500 4 \n", "75% 9.41228e+07 NaN 43 1.35438e+06 9 \n", "max 1.25497e+08 NaN 54 2.12711e+06 89440 \n", "counts 125497040 125497040 125497040 125497040 125497040 \n", "uniques 125497040 1684 54 4036 258474 \n", "missing 0 0 0 0 0 \n", "missing_perc 0% 0% 0% 0% 0% \n", "types numeric categorical numeric numeric numeric \n", "\n", " onpromotion \n", "count NaN \n", "mean NaN \n", "std NaN \n", "min NaN \n", "25% NaN \n", "50% NaN \n", "75% NaN \n", "max NaN \n", "counts 103839389 \n", "uniques 2 \n", "missing 21657651 \n", "missing_perc 17.26% \n", "types bool " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
store_nbrcitystatetypecluster
count54NaNNaNNaN54
mean27.5NaNNaNNaN8.48148
std15.7321NaNNaNNaN4.69339
min1NaNNaNNaN1
25%14.25NaNNaNNaN4
50%27.5NaNNaNNaN8.5
75%40.75NaNNaNNaN13
max54NaNNaNNaN17
counts5454545454
uniques542216517
missing00000
missing_perc0%0%0%0%0%
typesnumericcategoricalcategoricalcategoricalnumeric
\n", "
" ], "text/plain": [ " store_nbr city state type cluster\n", "count 54 NaN NaN NaN 54\n", "mean 27.5 NaN NaN NaN 8.48148\n", "std 15.7321 NaN NaN NaN 4.69339\n", "min 1 NaN NaN NaN 1\n", "25% 14.25 NaN NaN NaN 4\n", "50% 27.5 NaN NaN NaN 8.5\n", "75% 40.75 NaN NaN NaN 13\n", "max 54 NaN NaN NaN 17\n", "counts 54 54 54 54 54\n", "uniques 54 22 16 5 17\n", "missing 0 0 0 0 0\n", "missing_perc 0% 0% 0% 0% 0%\n", "types numeric categorical categorical categorical numeric" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
item_nbrfamilyclassperishable
count4100NaN41004100
mean1.25144e+06NaN2169.650.240488
std587687NaN1484.910.427432
min96995NaN10020
25%818111NaN10680
50%1.3062e+06NaN20040
75%1.90492e+06NaN2990.50
max2.13424e+06NaN77801
counts4100410041004100
uniques4100333372
missing0000
missing_perc0%0%0%0%
typesnumericcategoricalnumericbool
\n", "
" ], "text/plain": [ " item_nbr family class perishable\n", "count 4100 NaN 4100 4100\n", "mean 1.25144e+06 NaN 2169.65 0.240488\n", "std 587687 NaN 1484.91 0.427432\n", "min 96995 NaN 1002 0\n", "25% 818111 NaN 1068 0\n", "50% 1.3062e+06 NaN 2004 0\n", "75% 1.90492e+06 NaN 2990.5 0\n", "max 2.13424e+06 NaN 7780 1\n", "counts 4100 4100 4100 4100\n", "uniques 4100 33 337 2\n", "missing 0 0 0 0\n", "missing_perc 0% 0% 0% 0%\n", "types numeric categorical numeric bool" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datestore_nbrtransactions
countNaN8348883488
meanNaN26.93921694.6
stdNaN15.6082963.287
minNaN15
25%NaN131046
50%NaN271393
75%NaN402079
maxNaN548359
counts834888348883488
uniques1682544993
missing000
missing_perc0%0%0%
typescategoricalnumericnumeric
\n", "
" ], "text/plain": [ " date store_nbr transactions\n", "count NaN 83488 83488\n", "mean NaN 26.9392 1694.6\n", "std NaN 15.6082 963.287\n", "min NaN 1 5\n", "25% NaN 13 1046\n", "50% NaN 27 1393\n", "75% NaN 40 2079\n", "max NaN 54 8359\n", "counts 83488 83488 83488\n", "uniques 1682 54 4993\n", "missing 0 0 0\n", "missing_perc 0% 0% 0%\n", "types categorical numeric numeric" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datetypelocalelocale_namedescriptiontransferred
count350350350350350350
unique31263241032
top2014-06-25HolidayNationalEcuadorCarnavalFalse
freq422117417410338
counts350350350350350350
uniques31263241032
missing000000
missing_perc0%0%0%0%0%0%
typescategoricalcategoricalcategoricalcategoricalcategoricalbool
\n", "
" ], "text/plain": [ " date type locale locale_name description \\\n", "count 350 350 350 350 350 \n", "unique 312 6 3 24 103 \n", "top 2014-06-25 Holiday National Ecuador Carnaval \n", "freq 4 221 174 174 10 \n", "counts 350 350 350 350 350 \n", "uniques 312 6 3 24 103 \n", "missing 0 0 0 0 0 \n", "missing_perc 0% 0% 0% 0% 0% \n", "types categorical categorical categorical categorical categorical \n", "\n", " transferred \n", "count 350 \n", "unique 2 \n", "top False \n", "freq 338 \n", "counts 350 \n", "uniques 2 \n", "missing 0 \n", "missing_perc 0% \n", "types bool " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datedcoilwtico
countNaN1175
meanNaN67.7144
stdNaN25.6305
minNaN26.19
25%NaN46.405
50%NaN53.19
75%NaN95.66
maxNaN110.62
counts12181175
uniques1218998
missing043
missing_perc0%3.53%
typesuniquenumeric
\n", "
" ], "text/plain": [ " date dcoilwtico\n", "count NaN 1175\n", "mean NaN 67.7144\n", "std NaN 25.6305\n", "min NaN 26.19\n", "25% NaN 46.405\n", "50% NaN 53.19\n", "75% NaN 95.66\n", "max NaN 110.62\n", "counts 1218 1175\n", "uniques 1218 998\n", "missing 0 43\n", "missing_perc 0% 3.53%\n", "types unique numeric" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddatestore_nbritem_nbronpromotion
count3.37046e+06NaN3.37046e+063.37046e+06NaN
mean1.27182e+08NaN27.51.2448e+06NaN
std972969NaN15.5858589836NaN
min1.25497e+08NaN196995NaN
25%1.2634e+08NaN14805321NaN
50%1.27182e+08NaN27.51.29466e+06NaN
75%1.28025e+08NaN411.73002e+06NaN
max1.28868e+08NaN542.13424e+06NaN
counts33704643370464337046433704643370464
uniques3370464165439012
missing00000
missing_perc0%0%0%0%0%
typesnumericcategoricalnumericnumericbool
\n", "
" ], "text/plain": [ " id date store_nbr item_nbr onpromotion\n", "count 3.37046e+06 NaN 3.37046e+06 3.37046e+06 NaN\n", "mean 1.27182e+08 NaN 27.5 1.2448e+06 NaN\n", "std 972969 NaN 15.5858 589836 NaN\n", "min 1.25497e+08 NaN 1 96995 NaN\n", "25% 1.2634e+08 NaN 14 805321 NaN\n", "50% 1.27182e+08 NaN 27.5 1.29466e+06 NaN\n", "75% 1.28025e+08 NaN 41 1.73002e+06 NaN\n", "max 1.28868e+08 NaN 54 2.13424e+06 NaN\n", "counts 3370464 3370464 3370464 3370464 3370464\n", "uniques 3370464 16 54 3901 2\n", "missing 0 0 0 0 0\n", "missing_perc 0% 0% 0% 0% 0%\n", "types numeric categorical numeric numeric bool" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idunit_sales
count3.37046e+063.37046e+06
mean1.27182e+080
std9729690
min1.25497e+080
25%1.2634e+080
50%1.27182e+080
75%1.28025e+080
max1.28868e+080
counts33704643370464
uniques33704641
missing00
missing_perc0%0%
typesnumericconstant
\n", "
" ], "text/plain": [ " id unit_sales\n", "count 3.37046e+06 3.37046e+06\n", "mean 1.27182e+08 0\n", "std 972969 0\n", "min 1.25497e+08 0\n", "25% 1.2634e+08 0\n", "50% 1.27182e+08 0\n", "75% 1.28025e+08 0\n", "max 1.28868e+08 0\n", "counts 3370464 3370464\n", "uniques 3370464 1\n", "missing 0 0\n", "missing_perc 0% 0%\n", "types numeric constant" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# The following returns summarized aggregate information to each table accross each field.\n", "for t in tables: display(DataFrameSummary(t).summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Cleaning / Feature Engineering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a structured data problem, we necessarily have to go through all the cleaning and feature engineering, even though we're using a neural network." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "train, stores, items, transactions, holidays_events, oil, test, sample_submission = tables" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "train.unit_sales = np.clip(train.unit_sales, 0, None)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(125497040, 3370464)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(train),len(test)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 1.254970e+08\n", "mean 8.556009e+00\n", "std 2.352696e+01\n", "min 0.000000e+00\n", "25% 2.000000e+00\n", "50% 4.000000e+00\n", "75% 9.000000e+00\n", "max 8.944000e+04\n", "Name: unit_sales, dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.unit_sales.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We turn state OnPromotion to booleans, to make them more convenient for modeling. We can do calculations on pandas fields using notation very similar (often identical) to numpy." ] }, { "cell_type": "raw", "metadata": {}, "source": [ "train.onpromotion = train.onpromotion!='0'\n", "test.onpromotion = test.onpromotion!='0'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Optimizing the Date format" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following extracts particular date fields from a complete datetime for the purpose of constructing categoricals.\n", "\n", "You should *always* consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities. We'll add to every table with a date field.\n", "\n", "**note**: Dayofweek starts at 0, Dayofyear starts at 1" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "add_datepart(train, \"date\", drop=False)" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "%add_datepart" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "add_datepart(transactions, \"date\", drop=False)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "add_datepart(holidays_events, \"date\", drop=False)\n", "add_datepart(oil, \"date\", drop=False)\n", "add_datepart(test, \"date\", drop=False)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddatestore_nbritem_nbrunit_salesonpromotionYearMonthWeekDayDayofweekDayofyearIs_month_endIs_month_startIs_quarter_endIs_quarter_startIs_year_endIs_year_startElapsed
002013-01-01251036657.0NaN201311111FalseTrueFalseTrueFalseTrue1356998400
112013-01-01251055741.0NaN201311111FalseTrueFalseTrueFalseTrue1356998400
222013-01-01251055752.0NaN201311111FalseTrueFalseTrueFalseTrue1356998400
332013-01-01251080791.0NaN201311111FalseTrueFalseTrueFalseTrue1356998400
442013-01-01251087011.0NaN201311111FalseTrueFalseTrueFalseTrue1356998400
\n", "
" ], "text/plain": [ " id date store_nbr item_nbr unit_sales onpromotion Year Month \\\n", "0 0 2013-01-01 25 103665 7.0 NaN 2013 1 \n", "1 1 2013-01-01 25 105574 1.0 NaN 2013 1 \n", "2 2 2013-01-01 25 105575 2.0 NaN 2013 1 \n", "3 3 2013-01-01 25 108079 1.0 NaN 2013 1 \n", "4 4 2013-01-01 25 108701 1.0 NaN 2013 1 \n", "\n", " Week Day Dayofweek Dayofyear Is_month_end Is_month_start \\\n", "0 1 1 1 1 False True \n", "1 1 1 1 1 False True \n", "2 1 1 1 1 False True \n", "3 1 1 1 1 False True \n", "4 1 1 1 1 False True \n", "\n", " Is_quarter_end Is_quarter_start Is_year_end Is_year_start Elapsed \n", "0 False True False True 1356998400 \n", "1 False True False True 1356998400 \n", "2 False True False True 1356998400 \n", "3 False True False True 1356998400 \n", "4 False True False True 1356998400 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(125497040, 19)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
store_nbrcitystatetypecluster
01QuitoPichinchaD13
12QuitoPichinchaD13
23QuitoPichinchaD8
34QuitoPichinchaD9
45Santo DomingoSanto Domingo de los TsachilasD4
\n", "
" ], "text/plain": [ " store_nbr city state type cluster\n", "0 1 Quito Pichincha D 13\n", "1 2 Quito Pichincha D 13\n", "2 3 Quito Pichincha D 8\n", "3 4 Quito Pichincha D 9\n", "4 5 Santo Domingo Santo Domingo de los Tsachilas D 4" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(54, 5)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
item_nbrfamilyclassperishable
096995GROCERY I10930
199197GROCERY I10670
2103501CLEANING30080
3103520GROCERY I10280
4103665BREAD/BAKERY27121
\n", "
" ], "text/plain": [ " item_nbr family class perishable\n", "0 96995 GROCERY I 1093 0\n", "1 99197 GROCERY I 1067 0\n", "2 103501 CLEANING 3008 0\n", "3 103520 GROCERY I 1028 0\n", "4 103665 BREAD/BAKERY 2712 1" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(4100, 4)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datestore_nbrtransactionsYearMonthWeekDayDayofweekDayofyearIs_month_endIs_month_startIs_quarter_endIs_quarter_startIs_year_endIs_year_startElapsed
02013-01-0125770201311111FalseTrueFalseTrueFalseTrue1356998400
12013-01-0212111201311222FalseFalseFalseFalseFalseFalse1357084800
22013-01-0222358201311222FalseFalseFalseFalseFalseFalse1357084800
32013-01-0233487201311222FalseFalseFalseFalseFalseFalse1357084800
42013-01-0241922201311222FalseFalseFalseFalseFalseFalse1357084800
\n", "
" ], "text/plain": [ " date store_nbr transactions Year Month Week Day Dayofweek \\\n", "0 2013-01-01 25 770 2013 1 1 1 1 \n", "1 2013-01-02 1 2111 2013 1 1 2 2 \n", "2 2013-01-02 2 2358 2013 1 1 2 2 \n", "3 2013-01-02 3 3487 2013 1 1 2 2 \n", "4 2013-01-02 4 1922 2013 1 1 2 2 \n", "\n", " Dayofyear Is_month_end Is_month_start Is_quarter_end Is_quarter_start \\\n", "0 1 False True False True \n", "1 2 False False False False \n", "2 2 False False False False \n", "3 2 False False False False \n", "4 2 False False False False \n", "\n", " Is_year_end Is_year_start Elapsed \n", "0 False True 1356998400 \n", "1 False False 1357084800 \n", "2 False False 1357084800 \n", "3 False False 1357084800 \n", "4 False False 1357084800 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(83488, 16)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datetypelocalelocale_namedescriptiontransferredYearMonthWeekDayDayofweekDayofyearIs_month_endIs_month_startIs_quarter_endIs_quarter_startIs_year_endIs_year_startElapsed
02012-03-02HolidayLocalMantaFundacion de MantaFalse2012392462FalseFalseFalseFalseFalseFalse1330646400
12012-04-01HolidayRegionalCotopaxiProvincializacion de CotopaxiFalse20124131692FalseTrueFalseTrueFalseFalse1333238400
22012-04-12HolidayLocalCuencaFundacion de CuencaFalse2012415123103FalseFalseFalseFalseFalseFalse1334188800
32012-04-14HolidayLocalLibertadCantonizacion de LibertadFalse2012415145105FalseFalseFalseFalseFalseFalse1334361600
42012-04-21HolidayLocalRiobambaCantonizacion de RiobambaFalse2012416215112FalseFalseFalseFalseFalseFalse1334966400
\n", "
" ], "text/plain": [ " date type locale locale_name description \\\n", "0 2012-03-02 Holiday Local Manta Fundacion de Manta \n", "1 2012-04-01 Holiday Regional Cotopaxi Provincializacion de Cotopaxi \n", "2 2012-04-12 Holiday Local Cuenca Fundacion de Cuenca \n", "3 2012-04-14 Holiday Local Libertad Cantonizacion de Libertad \n", "4 2012-04-21 Holiday Local Riobamba Cantonizacion de Riobamba \n", "\n", " transferred Year Month Week Day Dayofweek Dayofyear Is_month_end \\\n", "0 False 2012 3 9 2 4 62 False \n", "1 False 2012 4 13 1 6 92 False \n", "2 False 2012 4 15 12 3 103 False \n", "3 False 2012 4 15 14 5 105 False \n", "4 False 2012 4 16 21 5 112 False \n", "\n", " Is_month_start Is_quarter_end Is_quarter_start Is_year_end \\\n", "0 False False False False \n", "1 True False True False \n", "2 False False False False \n", "3 False False False False \n", "4 False False False False \n", "\n", " Is_year_start Elapsed \n", "0 False 1330646400 \n", "1 False 1333238400 \n", "2 False 1334188800 \n", "3 False 1334361600 \n", "4 False 1334966400 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(350, 19)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datedcoilwticoYearMonthWeekDayDayofweekDayofyearIs_month_endIs_month_startIs_quarter_endIs_quarter_startIs_year_endIs_year_startElapsed
02013-01-01NaN201311111FalseTrueFalseTrueFalseTrue1356998400
12013-01-0293.14201311222FalseFalseFalseFalseFalseFalse1357084800
22013-01-0392.97201311333FalseFalseFalseFalseFalseFalse1357171200
32013-01-0493.12201311444FalseFalseFalseFalseFalseFalse1357257600
42013-01-0793.20201312707FalseFalseFalseFalseFalseFalse1357516800
\n", "
" ], "text/plain": [ " date dcoilwtico Year Month Week Day Dayofweek Dayofyear \\\n", "0 2013-01-01 NaN 2013 1 1 1 1 1 \n", "1 2013-01-02 93.14 2013 1 1 2 2 2 \n", "2 2013-01-03 92.97 2013 1 1 3 3 3 \n", "3 2013-01-04 93.12 2013 1 1 4 4 4 \n", "4 2013-01-07 93.20 2013 1 2 7 0 7 \n", "\n", " Is_month_end Is_month_start Is_quarter_end Is_quarter_start \\\n", "0 False True False True \n", "1 False False False False \n", "2 False False False False \n", "3 False False False False \n", "4 False False False False \n", "\n", " Is_year_end Is_year_start Elapsed \n", "0 False True 1356998400 \n", "1 False False 1357084800 \n", "2 False False 1357171200 \n", "3 False False 1357257600 \n", "4 False False 1357516800 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(1218, 15)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddatestore_nbritem_nbronpromotionYearMonthWeekDayDayofweekDayofyearIs_month_endIs_month_startIs_quarter_endIs_quarter_startIs_year_endIs_year_startElapsed
01254970402017-08-16196995False2017833162228FalseFalseFalseFalseFalseFalse1502841600
11254970412017-08-16199197False2017833162228FalseFalseFalseFalseFalseFalse1502841600
21254970422017-08-161103501False2017833162228FalseFalseFalseFalseFalseFalse1502841600
31254970432017-08-161103520False2017833162228FalseFalseFalseFalseFalseFalse1502841600
41254970442017-08-161103665False2017833162228FalseFalseFalseFalseFalseFalse1502841600
\n", "
" ], "text/plain": [ " id date store_nbr item_nbr onpromotion Year Month Week \\\n", "0 125497040 2017-08-16 1 96995 False 2017 8 33 \n", "1 125497041 2017-08-16 1 99197 False 2017 8 33 \n", "2 125497042 2017-08-16 1 103501 False 2017 8 33 \n", "3 125497043 2017-08-16 1 103520 False 2017 8 33 \n", "4 125497044 2017-08-16 1 103665 False 2017 8 33 \n", "\n", " Day Dayofweek Dayofyear Is_month_end Is_month_start Is_quarter_end \\\n", "0 16 2 228 False False False \n", "1 16 2 228 False False False \n", "2 16 2 228 False False False \n", "3 16 2 228 False False False \n", "4 16 2 228 False False False \n", "\n", " Is_quarter_start Is_year_end Is_year_start Elapsed \n", "0 False False False 1502841600 \n", "1 False False False 1502841600 \n", "2 False False False 1502841600 \n", "3 False False False 1502841600 \n", "4 False False False 1502841600 " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(3370464, 18)" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idunit_sales
01254970400
11254970410
21254970420
31254970430
41254970440
\n", "
" ], "text/plain": [ " id unit_sales\n", "0 125497040 0\n", "1 125497041 0\n", "2 125497042 0\n", "3 125497043 0\n", "4 125497044 0" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(3370464, 2)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for t in tables: display(t.head(), t.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Reducing data to the last 10 weeks for the training set (16 days needed for Validation/Test)\n", "Using 2016 data lets us use the same period 16-days, as the test data, but as validation period. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(125497040, 19)\n" ] } ], "source": [ "# If done on all train data, results in 125m rows. So, we're taking a small sample of the 2016 10 weeks:\n", "train_mask_10w = (train['date'] >= '2016-06-28') & (train['date'] <= '2016-08-31')\n", "print(train.shape)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(6270562, 19)\n" ] } ], "source": [ "train = train[train_mask_10w]\n", "print(train.shape)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddatestore_nbritem_nbrunit_salesonpromotionYearMonthWeekDayDayofweekDayofyearIs_month_endIs_month_startIs_quarter_endIs_quarter_startIs_year_endIs_year_startElapsed
83386215833862152016-06-2811035202.0False2016626281180FalseFalseFalseFalseFalseFalse1467072000
83386216833862162016-06-2811036654.0False2016626281180FalseFalseFalseFalseFalseFalse1467072000
83386217833862172016-06-2811055741.0False2016626281180FalseFalseFalseFalseFalseFalse1467072000
83386218833862182016-06-28110557524.0False2016626281180FalseFalseFalseFalseFalseFalse1467072000
83386219833862192016-06-2811056932.0False2016626281180FalseFalseFalseFalseFalseFalse1467072000
\n", "
" ], "text/plain": [ " id date store_nbr item_nbr unit_sales onpromotion \\\n", "83386215 83386215 2016-06-28 1 103520 2.0 False \n", "83386216 83386216 2016-06-28 1 103665 4.0 False \n", "83386217 83386217 2016-06-28 1 105574 1.0 False \n", "83386218 83386218 2016-06-28 1 105575 24.0 False \n", "83386219 83386219 2016-06-28 1 105693 2.0 False \n", "\n", " Year Month Week Day Dayofweek Dayofyear Is_month_end \\\n", "83386215 2016 6 26 28 1 180 False \n", "83386216 2016 6 26 28 1 180 False \n", "83386217 2016 6 26 28 1 180 False \n", "83386218 2016 6 26 28 1 180 False \n", "83386219 2016 6 26 28 1 180 False \n", "\n", " Is_month_start Is_quarter_end Is_quarter_start Is_year_end \\\n", "83386215 False False False False \n", "83386216 False False False False \n", "83386217 False False False False \n", "83386218 False False False False \n", "83386219 False False False False \n", "\n", " Is_year_start Elapsed \n", "83386215 False 1467072000 \n", "83386216 False 1467072000 \n", "83386217 False 1467072000 \n", "83386218 False 1467072000 \n", "83386219 False 1467072000 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(83488, 16)\n" ] } ], "source": [ "transactions_mask_10w = (transactions['date'] >= '2016-06-28') & (transactions['date'] <= '2016-08-31')\n", "print(transactions.shape)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(3418, 16)\n" ] } ], "source": [ "transactions = transactions[transactions_mask_10w]\n", "print(transactions.shape)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datestore_nbrtransactionsYearMonthWeekDayDayofweekDayofyearIs_month_endIs_month_startIs_quarter_endIs_quarter_startIs_year_endIs_year_startElapsed
617152016-06-28116992016626281180FalseFalseFalseFalseFalseFalse1467072000
617162016-06-28217062016626281180FalseFalseFalseFalseFalseFalse1467072000
617172016-06-28326882016626281180FalseFalseFalseFalseFalseFalse1467072000
617182016-06-28412362016626281180FalseFalseFalseFalseFalseFalse1467072000
617192016-06-28511562016626281180FalseFalseFalseFalseFalseFalse1467072000
\n", "
" ], "text/plain": [ " date store_nbr transactions Year Month Week Day Dayofweek \\\n", "61715 2016-06-28 1 1699 2016 6 26 28 1 \n", "61716 2016-06-28 2 1706 2016 6 26 28 1 \n", "61717 2016-06-28 3 2688 2016 6 26 28 1 \n", "61718 2016-06-28 4 1236 2016 6 26 28 1 \n", "61719 2016-06-28 5 1156 2016 6 26 28 1 \n", "\n", " Dayofyear Is_month_end Is_month_start Is_quarter_end \\\n", "61715 180 False False False \n", "61716 180 False False False \n", "61717 180 False False False \n", "61718 180 False False False \n", "61719 180 False False False \n", "\n", " Is_quarter_start Is_year_end Is_year_start Elapsed \n", "61715 False False False 1467072000 \n", "61716 False False False 1467072000 \n", "61717 False False False 1467072000 \n", "61718 False False False 1467072000 \n", "61719 False False False 1467072000 " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transactions.head()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(350, 19)\n" ] } ], "source": [ "holidays_events_mask_10w = (holidays_events['date'] >= '2016-06-28') & (holidays_events['date'] <= '2016-08-31')\n", "print(holidays_events.shape)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(11, 19)\n" ] } ], "source": [ "holidays_events = holidays_events[holidays_events_mask_10w]\n", "print(holidays_events.shape)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datetypelocalelocale_namedescriptiontransferredYearMonthWeekDayDayofweekDayofyearIs_month_endIs_month_startIs_quarter_endIs_quarter_startIs_year_endIs_year_startElapsed
2612016-07-03HolidayLocalEl CarmenCantonizacion de El CarmenFalse201672636185FalseFalseFalseFalseFalseFalse1467504000
2622016-07-03HolidayLocalSanto DomingoFundacion de Santo DomingoFalse201672636185FalseFalseFalseFalseFalseFalse1467504000
2632016-07-23HolidayLocalCayambeCantonizacion de CayambeFalse2016729235205FalseFalseFalseFalseFalseFalse1469232000
2642016-07-24AdditionalLocalGuayaquilFundacion de Guayaquil-1False2016729246206FalseFalseFalseFalseFalseFalse1469318400
2652016-07-24TransferLocalGuayaquilTraslado Fundacion de GuayaquilFalse2016729246206FalseFalseFalseFalseFalseFalse1469318400
\n", "
" ], "text/plain": [ " date type locale locale_name \\\n", "261 2016-07-03 Holiday Local El Carmen \n", "262 2016-07-03 Holiday Local Santo Domingo \n", "263 2016-07-23 Holiday Local Cayambe \n", "264 2016-07-24 Additional Local Guayaquil \n", "265 2016-07-24 Transfer Local Guayaquil \n", "\n", " description transferred Year Month Week Day \\\n", "261 Cantonizacion de El Carmen False 2016 7 26 3 \n", "262 Fundacion de Santo Domingo False 2016 7 26 3 \n", "263 Cantonizacion de Cayambe False 2016 7 29 23 \n", "264 Fundacion de Guayaquil-1 False 2016 7 29 24 \n", "265 Traslado Fundacion de Guayaquil False 2016 7 29 24 \n", "\n", " Dayofweek Dayofyear Is_month_end Is_month_start Is_quarter_end \\\n", "261 6 185 False False False \n", "262 6 185 False False False \n", "263 5 205 False False False \n", "264 6 206 False False False \n", "265 6 206 False False False \n", "\n", " Is_quarter_start Is_year_end Is_year_start Elapsed \n", "261 False False False 1467504000 \n", "262 False False False 1467504000 \n", "263 False False False 1469232000 \n", "264 False False False 1469318400 \n", "265 False False False 1469318400 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "holidays_events.head()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1218, 15)\n" ] } ], "source": [ "oil_mask_10w = (oil['date'] >= '2016-06-28') & (oil['date'] <= '2016-08-31')\n", "print(oil.shape)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(47, 15)\n" ] } ], "source": [ "oil = oil[oil_mask_10w]\n", "print(oil.shape)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datedcoilwticoYearMonthWeekDayDayofweekDayofyearIs_month_endIs_month_startIs_quarter_endIs_quarter_startIs_year_endIs_year_startElapsed
9102016-06-2847.932016626281180FalseFalseFalseFalseFalseFalse1467072000
9112016-06-2949.852016626292181FalseFalseFalseFalseFalseFalse1467158400
9122016-06-3048.272016626303182TrueFalseTrueFalseFalseFalse1467244800
9132016-07-0149.02201672614183FalseTrueFalseTrueFalseFalse1467331200
9142016-07-04NaN201672740186FalseFalseFalseFalseFalseFalse1467590400
\n", "
" ], "text/plain": [ " date dcoilwtico Year Month Week Day Dayofweek Dayofyear \\\n", "910 2016-06-28 47.93 2016 6 26 28 1 180 \n", "911 2016-06-29 49.85 2016 6 26 29 2 181 \n", "912 2016-06-30 48.27 2016 6 26 30 3 182 \n", "913 2016-07-01 49.02 2016 7 26 1 4 183 \n", "914 2016-07-04 NaN 2016 7 27 4 0 186 \n", "\n", " Is_month_end Is_month_start Is_quarter_end Is_quarter_start \\\n", "910 False False False False \n", "911 False False False False \n", "912 True False True False \n", "913 False True False True \n", "914 False False False False \n", "\n", " Is_year_end Is_year_start Elapsed \n", "910 False False 1467072000 \n", "911 False False 1467158400 \n", "912 False False 1467244800 \n", "913 False False 1467331200 \n", "914 False False 1467590400 " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "oil.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Join the tables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`join_df` is a function for joining tables on specific fields. By default, we'll be doing a left outer join of `right` on the `left` argument using the given fields for each table.\n", "\n", "Pandas does joins using the `merge` method. The `suffixes` argument describes the naming convention for duplicate fields. We've elected to leave the duplicate field names on the left untouched, and append a \"\\_y\" to those on the right." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "def join_df(left, right, left_on, right_on=None, suffix='_y'):\n", " if right_on is None: right_on = left_on\n", " return left.merge(right, how='left', left_on=left_on, right_on=right_on, \n", " suffixes=(\"\", suffix))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can outer join all of our data into a single dataframe. Recall that in outer joins everytime a value in the joining field on the left table does not have a corresponding value on the right table, the corresponding row in the new table has Null values for all right table fields. One way to check that all records are consistent and complete is to check for Null values post-join, as we do here.\n", "\n", "*Aside*: Why not just do an inner join?\n", "If you are assuming that all records are complete and match on the field you desire, an inner join will do the same thing as an outer join. However, in the event you are wrong or a mistake is made, an outer join followed by a null-check will catch it. (Comparing before/after # of rows for inner join is equivalent, but requires keeping track of before/after row #'s. Outer join is easier.)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined = join_df(train, stores, \"store_nbr\")\n", "len(joined[joined.type.isnull()])" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddatestore_nbritem_nbrunit_salesonpromotionYearMonthWeekDay...Is_month_startIs_quarter_endIs_quarter_startIs_year_endIs_year_startElapsedcitystatetypecluster
0833862152016-06-2811035202.0False201662628...FalseFalseFalseFalseFalse1467072000QuitoPichinchaD13
1833862162016-06-2811036654.0False201662628...FalseFalseFalseFalseFalse1467072000QuitoPichinchaD13
2833862172016-06-2811055741.0False201662628...FalseFalseFalseFalseFalse1467072000QuitoPichinchaD13
3833862182016-06-28110557524.0False201662628...FalseFalseFalseFalseFalse1467072000QuitoPichinchaD13
4833862192016-06-2811056932.0False201662628...FalseFalseFalseFalseFalse1467072000QuitoPichinchaD13
\n", "

5 rows × 23 columns

\n", "
" ], "text/plain": [ " id date store_nbr item_nbr unit_sales onpromotion Year \\\n", "0 83386215 2016-06-28 1 103520 2.0 False 2016 \n", "1 83386216 2016-06-28 1 103665 4.0 False 2016 \n", "2 83386217 2016-06-28 1 105574 1.0 False 2016 \n", "3 83386218 2016-06-28 1 105575 24.0 False 2016 \n", "4 83386219 2016-06-28 1 105693 2.0 False 2016 \n", "\n", " Month Week Day ... Is_month_start Is_quarter_end Is_quarter_start \\\n", "0 6 26 28 ... False False False \n", "1 6 26 28 ... False False False \n", "2 6 26 28 ... False False False \n", "3 6 26 28 ... False False False \n", "4 6 26 28 ... False False False \n", "\n", " Is_year_end Is_year_start Elapsed city state type cluster \n", "0 False False 1467072000 Quito Pichincha D 13 \n", "1 False False 1467072000 Quito Pichincha D 13 \n", "2 False False 1467072000 Quito Pichincha D 13 \n", "3 False False 1467072000 Quito Pichincha D 13 \n", "4 False False 1467072000 Quito Pichincha D 13 \n", "\n", "[5 rows x 23 columns]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined.head()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined_test = join_df(test, stores, \"store_nbr\")\n", "len(joined_test[joined_test.type.isnull()])" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined = join_df(joined, items, \"item_nbr\")\n", "len(joined[joined.family.isnull()])" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddatestore_nbritem_nbrunit_salesonpromotionYearMonthWeekDay...Is_year_endIs_year_startElapsedcitystatetypeclusterfamilyclassperishable
0833862152016-06-2811035202.0False201662628...FalseFalse1467072000QuitoPichinchaD13GROCERY I10280
1833862162016-06-2811036654.0False201662628...FalseFalse1467072000QuitoPichinchaD13BREAD/BAKERY27121
2833862172016-06-2811055741.0False201662628...FalseFalse1467072000QuitoPichinchaD13GROCERY I10450
3833862182016-06-28110557524.0False201662628...FalseFalse1467072000QuitoPichinchaD13GROCERY I10450
4833862192016-06-2811056932.0False201662628...FalseFalse1467072000QuitoPichinchaD13GROCERY I10340
\n", "

5 rows × 26 columns

\n", "
" ], "text/plain": [ " id date store_nbr item_nbr unit_sales onpromotion Year \\\n", "0 83386215 2016-06-28 1 103520 2.0 False 2016 \n", "1 83386216 2016-06-28 1 103665 4.0 False 2016 \n", "2 83386217 2016-06-28 1 105574 1.0 False 2016 \n", "3 83386218 2016-06-28 1 105575 24.0 False 2016 \n", "4 83386219 2016-06-28 1 105693 2.0 False 2016 \n", "\n", " Month Week Day ... Is_year_end Is_year_start Elapsed city \\\n", "0 6 26 28 ... False False 1467072000 Quito \n", "1 6 26 28 ... False False 1467072000 Quito \n", "2 6 26 28 ... False False 1467072000 Quito \n", "3 6 26 28 ... False False 1467072000 Quito \n", "4 6 26 28 ... False False 1467072000 Quito \n", "\n", " state type cluster family class perishable \n", "0 Pichincha D 13 GROCERY I 1028 0 \n", "1 Pichincha D 13 BREAD/BAKERY 2712 1 \n", "2 Pichincha D 13 GROCERY I 1045 0 \n", "3 Pichincha D 13 GROCERY I 1045 0 \n", "4 Pichincha D 13 GROCERY I 1034 0 \n", "\n", "[5 rows x 26 columns]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined.head()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined_test = join_df(joined_test, items, \"item_nbr\")\n", "len(joined_test[joined_test.family.isnull()])" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined = join_df(joined, transactions, [\"date\", \"store_nbr\"] )\n", "len(joined[joined.store_nbr.isnull()])" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined_test = join_df(joined_test, transactions, [\"date\", \"store_nbr\"] )\n", "len(joined_test[joined_test.store_nbr.isnull()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **Note**: at this stage, we don't incorporate the Holidays (needs tuning for local vs national) or the Oil prices, this will also require external data sources on the Test set.\n", "\n", "**TBD**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "# we drop the duplicate columns ending with _y\n", "for df in (joined, joined_test):\n", " for c in df.columns:\n", " if c.endswith('_y'):\n", " if c in df.columns: df.drop(c, inplace=True, axis=1)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddatestore_nbritem_nbrunit_salesonpromotionYearMonthWeekDay...Is_year_startElapsedcitystatetypeclusterfamilyclassperishabletransactions
0833862152016-06-2811035202.0False201662628...False1467072000QuitoPichinchaD13GROCERY I102801699
1833862162016-06-2811036654.0False201662628...False1467072000QuitoPichinchaD13BREAD/BAKERY271211699
2833862172016-06-2811055741.0False201662628...False1467072000QuitoPichinchaD13GROCERY I104501699
3833862182016-06-28110557524.0False201662628...False1467072000QuitoPichinchaD13GROCERY I104501699
4833862192016-06-2811056932.0False201662628...False1467072000QuitoPichinchaD13GROCERY I103401699
\n", "

5 rows × 27 columns

\n", "
" ], "text/plain": [ " id date store_nbr item_nbr unit_sales onpromotion Year \\\n", "0 83386215 2016-06-28 1 103520 2.0 False 2016 \n", "1 83386216 2016-06-28 1 103665 4.0 False 2016 \n", "2 83386217 2016-06-28 1 105574 1.0 False 2016 \n", "3 83386218 2016-06-28 1 105575 24.0 False 2016 \n", "4 83386219 2016-06-28 1 105693 2.0 False 2016 \n", "\n", " Month Week Day ... Is_year_start Elapsed city state \\\n", "0 6 26 28 ... False 1467072000 Quito Pichincha \n", "1 6 26 28 ... False 1467072000 Quito Pichincha \n", "2 6 26 28 ... False 1467072000 Quito Pichincha \n", "3 6 26 28 ... False 1467072000 Quito Pichincha \n", "4 6 26 28 ... False 1467072000 Quito Pichincha \n", "\n", " type cluster family class perishable transactions \n", "0 D 13 GROCERY I 1028 0 1699 \n", "1 D 13 BREAD/BAKERY 2712 1 1699 \n", "2 D 13 GROCERY I 1045 0 1699 \n", "3 D 13 GROCERY I 1045 0 1699 \n", "4 D 13 GROCERY I 1034 0 1699 \n", "\n", "[5 rows x 27 columns]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined.head()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idstore_nbritem_nbrunit_salesYearMonthWeekDayDayofweekDayofyearElapsedclusterclassperishabletransactions
count6.270562e+066.270562e+066.270562e+066.270562e+066270562.06.270562e+066.270562e+066.270562e+066.270562e+066.270562e+066.270562e+066.270562e+066.270562e+066.270562e+066.270562e+06
mean8.652150e+072.769834e+011.100521e+067.607019e+002016.07.432728e+003.027350e+011.645794e+013.001217e+002.119157e+021.469830e+098.640504e+001.969754e+032.598797e-011.751359e+03
std1.810155e+061.620490e+015.489163e+051.841107e+010.05.761485e-012.690027e+009.196580e+002.009433e+001.868661e+011.614523e+064.582949e+001.165504e+034.385685e-019.290509e+02
min8.338622e+071.000000e+009.919700e+040.000000e+002016.06.000000e+002.600000e+011.000000e+000.000000e+001.800000e+021.467072e+091.000000e+001.002000e+030.000000e+003.950000e+02
25%8.495386e+071.300000e+016.557440e+052.000000e+002016.07.000000e+002.800000e+018.000000e+001.000000e+001.960000e+021.468454e+095.000000e+001.058000e+030.000000e+001.101000e+03
50%8.652150e+072.800000e+011.146784e+064.000000e+002016.07.000000e+003.000000e+011.700000e+013.000000e+002.120000e+021.469837e+099.000000e+001.190000e+030.000000e+001.445000e+03
75%8.808914e+074.300000e+011.464008e+068.000000e+002016.08.000000e+003.300000e+012.500000e+015.000000e+002.280000e+021.471219e+091.300000e+012.708000e+031.000000e+002.196000e+03
max8.965678e+075.400000e+012.037487e+061.375700e+042016.08.000000e+003.500000e+013.100000e+016.000000e+002.440000e+021.472602e+091.700000e+017.780000e+031.000000e+005.165000e+03
\n", "
" ], "text/plain": [ " id store_nbr item_nbr unit_sales Year \\\n", "count 6.270562e+06 6.270562e+06 6.270562e+06 6.270562e+06 6270562.0 \n", "mean 8.652150e+07 2.769834e+01 1.100521e+06 7.607019e+00 2016.0 \n", "std 1.810155e+06 1.620490e+01 5.489163e+05 1.841107e+01 0.0 \n", "min 8.338622e+07 1.000000e+00 9.919700e+04 0.000000e+00 2016.0 \n", "25% 8.495386e+07 1.300000e+01 6.557440e+05 2.000000e+00 2016.0 \n", "50% 8.652150e+07 2.800000e+01 1.146784e+06 4.000000e+00 2016.0 \n", "75% 8.808914e+07 4.300000e+01 1.464008e+06 8.000000e+00 2016.0 \n", "max 8.965678e+07 5.400000e+01 2.037487e+06 1.375700e+04 2016.0 \n", "\n", " Month Week Day Dayofweek Dayofyear \\\n", "count 6.270562e+06 6.270562e+06 6.270562e+06 6.270562e+06 6.270562e+06 \n", "mean 7.432728e+00 3.027350e+01 1.645794e+01 3.001217e+00 2.119157e+02 \n", "std 5.761485e-01 2.690027e+00 9.196580e+00 2.009433e+00 1.868661e+01 \n", "min 6.000000e+00 2.600000e+01 1.000000e+00 0.000000e+00 1.800000e+02 \n", "25% 7.000000e+00 2.800000e+01 8.000000e+00 1.000000e+00 1.960000e+02 \n", "50% 7.000000e+00 3.000000e+01 1.700000e+01 3.000000e+00 2.120000e+02 \n", "75% 8.000000e+00 3.300000e+01 2.500000e+01 5.000000e+00 2.280000e+02 \n", "max 8.000000e+00 3.500000e+01 3.100000e+01 6.000000e+00 2.440000e+02 \n", "\n", " Elapsed cluster class perishable transactions \n", "count 6.270562e+06 6.270562e+06 6.270562e+06 6.270562e+06 6.270562e+06 \n", "mean 1.469830e+09 8.640504e+00 1.969754e+03 2.598797e-01 1.751359e+03 \n", "std 1.614523e+06 4.582949e+00 1.165504e+03 4.385685e-01 9.290509e+02 \n", "min 1.467072e+09 1.000000e+00 1.002000e+03 0.000000e+00 3.950000e+02 \n", "25% 1.468454e+09 5.000000e+00 1.058000e+03 0.000000e+00 1.101000e+03 \n", "50% 1.469837e+09 9.000000e+00 1.190000e+03 0.000000e+00 1.445000e+03 \n", "75% 1.471219e+09 1.300000e+01 2.708000e+03 1.000000e+00 2.196000e+03 \n", "max 1.472602e+09 1.700000e+01 7.780000e+03 1.000000e+00 5.165000e+03 " ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined.describe()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iddatestore_nbritem_nbronpromotionYearMonthWeekDayDayofweek...Is_year_startElapsedcitystatetypeclusterfamilyclassperishabletransactions
01254970402017-08-16196995False2017833162...False1502841600QuitoPichinchaD13GROCERY I10930NaN
11254970412017-08-16199197False2017833162...False1502841600QuitoPichinchaD13GROCERY I10670NaN
21254970422017-08-161103501False2017833162...False1502841600QuitoPichinchaD13CLEANING30080NaN
31254970432017-08-161103520False2017833162...False1502841600QuitoPichinchaD13GROCERY I10280NaN
41254970442017-08-161103665False2017833162...False1502841600QuitoPichinchaD13BREAD/BAKERY27121NaN
\n", "

5 rows × 26 columns

\n", "
" ], "text/plain": [ " id date store_nbr item_nbr onpromotion Year Month Week \\\n", "0 125497040 2017-08-16 1 96995 False 2017 8 33 \n", "1 125497041 2017-08-16 1 99197 False 2017 8 33 \n", "2 125497042 2017-08-16 1 103501 False 2017 8 33 \n", "3 125497043 2017-08-16 1 103520 False 2017 8 33 \n", "4 125497044 2017-08-16 1 103665 False 2017 8 33 \n", "\n", " Day Dayofweek ... Is_year_start Elapsed city state \\\n", "0 16 2 ... False 1502841600 Quito Pichincha \n", "1 16 2 ... False 1502841600 Quito Pichincha \n", "2 16 2 ... False 1502841600 Quito Pichincha \n", "3 16 2 ... False 1502841600 Quito Pichincha \n", "4 16 2 ... False 1502841600 Quito Pichincha \n", "\n", " type cluster family class perishable transactions \n", "0 D 13 GROCERY I 1093 0 NaN \n", "1 D 13 GROCERY I 1067 0 NaN \n", "2 D 13 CLEANING 3008 0 NaN \n", "3 D 13 GROCERY I 1028 0 NaN \n", "4 D 13 BREAD/BAKERY 2712 1 NaN \n", "\n", "[5 rows x 26 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined_test.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we'll fill in missing values to avoid complications with NA's. NA (not available) is how Pandas indicates missing values; many models have problems when missing values are present, so it's always important to think about how to deal with them. In these cases, we are picking an arbitrary signal value that doesn't otherwise appear in the data.\n", "\n", "** Note**: as seen below, its seems there are no NANs !?!?!" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check if any NANs\n", "joined.isnull().values.any()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check if any NANs (slower, more complete)\n", "joined.isnull().sum().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Durations : TBD !" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE: code from Rossmann has 25+ cells**\n", "\n", "It is common when working with time series data to extract data that explains relationships across rows as opposed to columns, e.g.:\n", "* Running averages\n", "* Time until next event\n", "* Time since last event\n", "\n", "This is often difficult to do with most table manipulation frameworks, since they are designed to work with relationships across columns. As such, we've created a class to handle this type of data.\n", "\n", "We'll define a function `get_elapsed` for cumulative counting across a sorted dataframe. Given a particular field `fld` to monitor, this function will start tracking time since the last occurrence of that field. When the field is seen again, the counter is set to zero.\n", "\n", "Upon initialization, this will result in datetime na's until the field is encountered. This is reset every time a new store is seen. We'll see how to use this shortly." ] }, { "cell_type": "raw", "metadata": {}, "source": [ "def get_elapsed(fld, pre):\n", " day1 = np.timedelta64(1, 'D')\n", " last_date = np.datetime64()\n", " last_store = 0\n", " res = []\n", "\n", " for s,v,d in zip(df.Store.values,df[fld].values, df.Date.values):\n", " if s != last_store:\n", " last_date = np.datetime64()\n", " last_store = s\n", " if v: last_date = d\n", " res.append(((d-last_date).astype('timedelta64[D]') / day1).astype(int))\n", " df[pre+fld] = res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll be applying this to a subset of columns:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "columns = [\"Date\", \"Store\", \"Promo\", \"StateHoliday\", \"SchoolHoliday\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TBD" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "raw", "metadata": {}, "source": [ "%add_datepart" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create features" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
id8338621583386216833862178338621883386219
date2016-06-28 00:00:002016-06-28 00:00:002016-06-28 00:00:002016-06-28 00:00:002016-06-28 00:00:00
store_nbr11111
item_nbr103520103665105574105575105693
unit_sales241242
onpromotionFalseFalseFalseFalseFalse
Year20162016201620162016
Month66666
Week2626262626
Day2828282828
Dayofweek11111
Dayofyear180180180180180
Is_month_endFalseFalseFalseFalseFalse
Is_month_startFalseFalseFalseFalseFalse
Is_quarter_endFalseFalseFalseFalseFalse
Is_quarter_startFalseFalseFalseFalseFalse
Is_year_endFalseFalseFalseFalseFalse
Is_year_startFalseFalseFalseFalseFalse
Elapsed14670720001467072000146707200014670720001467072000
cityQuitoQuitoQuitoQuitoQuito
statePichinchaPichinchaPichinchaPichinchaPichincha
typeDDDDD
cluster1313131313
familyGROCERY IBREAD/BAKERYGROCERY IGROCERY IGROCERY I
class10282712104510451034
perishable01000
transactions16991699169916991699
\n", "
" ], "text/plain": [ " 0 1 \\\n", "id 83386215 83386216 \n", "date 2016-06-28 00:00:00 2016-06-28 00:00:00 \n", "store_nbr 1 1 \n", "item_nbr 103520 103665 \n", "unit_sales 2 4 \n", "onpromotion False False \n", "Year 2016 2016 \n", "Month 6 6 \n", "Week 26 26 \n", "Day 28 28 \n", "Dayofweek 1 1 \n", "Dayofyear 180 180 \n", "Is_month_end False False \n", "Is_month_start False False \n", "Is_quarter_end False False \n", "Is_quarter_start False False \n", "Is_year_end False False \n", "Is_year_start False False \n", "Elapsed 1467072000 1467072000 \n", "city Quito Quito \n", "state Pichincha Pichincha \n", "type D D \n", "cluster 13 13 \n", "family GROCERY I BREAD/BAKERY \n", "class 1028 2712 \n", "perishable 0 1 \n", "transactions 1699 1699 \n", "\n", " 2 3 \\\n", "id 83386217 83386218 \n", "date 2016-06-28 00:00:00 2016-06-28 00:00:00 \n", "store_nbr 1 1 \n", "item_nbr 105574 105575 \n", "unit_sales 1 24 \n", "onpromotion False False \n", "Year 2016 2016 \n", "Month 6 6 \n", "Week 26 26 \n", "Day 28 28 \n", "Dayofweek 1 1 \n", "Dayofyear 180 180 \n", "Is_month_end False False \n", "Is_month_start False False \n", "Is_quarter_end False False \n", "Is_quarter_start False False \n", "Is_year_end False False \n", "Is_year_start False False \n", "Elapsed 1467072000 1467072000 \n", "city Quito Quito \n", "state Pichincha Pichincha \n", "type D D \n", "cluster 13 13 \n", "family GROCERY I GROCERY I \n", "class 1045 1045 \n", "perishable 0 0 \n", "transactions 1699 1699 \n", "\n", " 4 \n", "id 83386219 \n", "date 2016-06-28 00:00:00 \n", "store_nbr 1 \n", "item_nbr 105693 \n", "unit_sales 2 \n", "onpromotion False \n", "Year 2016 \n", "Month 6 \n", "Week 26 \n", "Day 28 \n", "Dayofweek 1 \n", "Dayofyear 180 \n", "Is_month_end False \n", "Is_month_start False \n", "Is_quarter_end False \n", "Is_quarter_start False \n", "Is_year_end False \n", "Is_year_start False \n", "Elapsed 1467072000 \n", "city Quito \n", "state Pichincha \n", "type D \n", "cluster 13 \n", "family GROCERY I \n", "class 1034 \n", "perishable 0 \n", "transactions 1699 " ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Look at all columns pivoted to rows\n", "joined.head().T.head(40)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "# dropping \"Elasped\" as it generates an error later, due to crazy 10 digits\n", "joined.drop(['Elapsed'],axis = 1, inplace = True)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
id8338621583386216833862178338621883386219
date2016-06-28 00:00:002016-06-28 00:00:002016-06-28 00:00:002016-06-28 00:00:002016-06-28 00:00:00
store_nbr11111
item_nbr103520103665105574105575105693
unit_sales241242
onpromotionFalseFalseFalseFalseFalse
Year20162016201620162016
Month66666
Week2626262626
Day2828282828
Dayofweek11111
Dayofyear180180180180180
Is_month_endFalseFalseFalseFalseFalse
Is_month_startFalseFalseFalseFalseFalse
Is_quarter_endFalseFalseFalseFalseFalse
Is_quarter_startFalseFalseFalseFalseFalse
Is_year_endFalseFalseFalseFalseFalse
Is_year_startFalseFalseFalseFalseFalse
cityQuitoQuitoQuitoQuitoQuito
statePichinchaPichinchaPichinchaPichinchaPichincha
typeDDDDD
cluster1313131313
familyGROCERY IBREAD/BAKERYGROCERY IGROCERY IGROCERY I
class10282712104510451034
perishable01000
transactions16991699169916991699
\n", "
" ], "text/plain": [ " 0 1 \\\n", "id 83386215 83386216 \n", "date 2016-06-28 00:00:00 2016-06-28 00:00:00 \n", "store_nbr 1 1 \n", "item_nbr 103520 103665 \n", "unit_sales 2 4 \n", "onpromotion False False \n", "Year 2016 2016 \n", "Month 6 6 \n", "Week 26 26 \n", "Day 28 28 \n", "Dayofweek 1 1 \n", "Dayofyear 180 180 \n", "Is_month_end False False \n", "Is_month_start False False \n", "Is_quarter_end False False \n", "Is_quarter_start False False \n", "Is_year_end False False \n", "Is_year_start False False \n", "city Quito Quito \n", "state Pichincha Pichincha \n", "type D D \n", "cluster 13 13 \n", "family GROCERY I BREAD/BAKERY \n", "class 1028 2712 \n", "perishable 0 1 \n", "transactions 1699 1699 \n", "\n", " 2 3 \\\n", "id 83386217 83386218 \n", "date 2016-06-28 00:00:00 2016-06-28 00:00:00 \n", "store_nbr 1 1 \n", "item_nbr 105574 105575 \n", "unit_sales 1 24 \n", "onpromotion False False \n", "Year 2016 2016 \n", "Month 6 6 \n", "Week 26 26 \n", "Day 28 28 \n", "Dayofweek 1 1 \n", "Dayofyear 180 180 \n", "Is_month_end False False \n", "Is_month_start False False \n", "Is_quarter_end False False \n", "Is_quarter_start False False \n", "Is_year_end False False \n", "Is_year_start False False \n", "city Quito Quito \n", "state Pichincha Pichincha \n", "type D D \n", "cluster 13 13 \n", "family GROCERY I GROCERY I \n", "class 1045 1045 \n", "perishable 0 0 \n", "transactions 1699 1699 \n", "\n", " 4 \n", "id 83386219 \n", "date 2016-06-28 00:00:00 \n", "store_nbr 1 \n", "item_nbr 105693 \n", "unit_sales 2 \n", "onpromotion False \n", "Year 2016 \n", "Month 6 \n", "Week 26 \n", "Day 28 \n", "Dayofweek 1 \n", "Dayofyear 180 \n", "Is_month_end False \n", "Is_month_start False \n", "Is_quarter_end False \n", "Is_quarter_start False \n", "Is_year_end False \n", "Is_year_start False \n", "city Quito \n", "state Pichincha \n", "type D \n", "cluster 13 \n", "family GROCERY I \n", "class 1034 \n", "perishable 0 \n", "transactions 1699 " ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined.head().T.head(40)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've engineered all our features, we need to convert to input compatible with a neural network.\n", "\n", "This includes converting categorical variables into contiguous integers or one-hot encodings, normalizing continuous features to standard normal, etc..." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "cat_vars = ['store_nbr', 'item_nbr', 'onpromotion', 'Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',\n", " 'city', 'state', 'type', 'cluster', 'family', 'class', 'perishable']" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "contin_vars = ['transactions']" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6270562" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n = len(joined); n" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "for v in cat_vars:\n", " joined[v] = joined[v].astype('category').cat.as_ordered()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "for v in cat_vars:\n", " joined_test[v] = joined_test[v].astype('category').cat.as_ordered()" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "for v in contin_vars:\n", " joined[v] = joined[v].astype('float32')\n", " joined_test[v] = joined_test[v].astype('float32')" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "for v in contin_vars:\n", " joined[v] = joined[v].astype('float32')\n" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "dep = 'unit_sales'\n", "joined = joined[cat_vars+contin_vars+[dep, 'date']].copy()" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "joined_test[dep] = 0\n", "joined_test = joined_test[cat_vars+contin_vars+[dep, 'date', 'id']].copy()" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
store_nbr11111
item_nbr103520103665105574105575105693
onpromotionFalseFalseFalseFalseFalse
Year20162016201620162016
Month66666
Week2626262626
Day2828282828
Dayofweek11111
Dayofyear180180180180180
cityQuitoQuitoQuitoQuitoQuito
statePichinchaPichinchaPichinchaPichinchaPichincha
typeDDDDD
cluster1313131313
familyGROCERY IBREAD/BAKERYGROCERY IGROCERY IGROCERY I
class10282712104510451034
perishable01000
transactions16991699169916991699
unit_sales241242
date2016-06-28 00:00:002016-06-28 00:00:002016-06-28 00:00:002016-06-28 00:00:002016-06-28 00:00:00
\n", "
" ], "text/plain": [ " 0 1 2 \\\n", "store_nbr 1 1 1 \n", "item_nbr 103520 103665 105574 \n", "onpromotion False False False \n", "Year 2016 2016 2016 \n", "Month 6 6 6 \n", "Week 26 26 26 \n", "Day 28 28 28 \n", "Dayofweek 1 1 1 \n", "Dayofyear 180 180 180 \n", "city Quito Quito Quito \n", "state Pichincha Pichincha Pichincha \n", "type D D D \n", "cluster 13 13 13 \n", "family GROCERY I BREAD/BAKERY GROCERY I \n", "class 1028 2712 1045 \n", "perishable 0 1 0 \n", "transactions 1699 1699 1699 \n", "unit_sales 2 4 1 \n", "date 2016-06-28 00:00:00 2016-06-28 00:00:00 2016-06-28 00:00:00 \n", "\n", " 3 4 \n", "store_nbr 1 1 \n", "item_nbr 105575 105693 \n", "onpromotion False False \n", "Year 2016 2016 \n", "Month 6 6 \n", "Week 26 26 \n", "Day 28 28 \n", "Dayofweek 1 1 \n", "Dayofyear 180 180 \n", "city Quito Quito \n", "state Pichincha Pichincha \n", "type D D \n", "cluster 13 13 \n", "family GROCERY I GROCERY I \n", "class 1045 1034 \n", "perishable 0 0 \n", "transactions 1699 1699 \n", "unit_sales 24 2 \n", "date 2016-06-28 00:00:00 2016-06-28 00:00:00 " ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined.head().T.head(40)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
store_nbr11111
item_nbr9699599197103501103520103665
onpromotionFalseFalseFalseFalseFalse
Year20172017201720172017
Month88888
Week3333333333
Day1616161616
Dayofweek22222
Dayofyear228228228228228
cityQuitoQuitoQuitoQuitoQuito
statePichinchaPichinchaPichinchaPichinchaPichincha
typeDDDDD
cluster1313131313
familyGROCERY IGROCERY ICLEANINGGROCERY IBREAD/BAKERY
class10931067300810282712
perishable00001
transactionsNaNNaNNaNNaNNaN
unit_sales00000
date2017-08-16 00:00:002017-08-16 00:00:002017-08-16 00:00:002017-08-16 00:00:002017-08-16 00:00:00
id125497040125497041125497042125497043125497044
\n", "
" ], "text/plain": [ " 0 1 2 \\\n", "store_nbr 1 1 1 \n", "item_nbr 96995 99197 103501 \n", "onpromotion False False False \n", "Year 2017 2017 2017 \n", "Month 8 8 8 \n", "Week 33 33 33 \n", "Day 16 16 16 \n", "Dayofweek 2 2 2 \n", "Dayofyear 228 228 228 \n", "city Quito Quito Quito \n", "state Pichincha Pichincha Pichincha \n", "type D D D \n", "cluster 13 13 13 \n", "family GROCERY I GROCERY I CLEANING \n", "class 1093 1067 3008 \n", "perishable 0 0 0 \n", "transactions NaN NaN NaN \n", "unit_sales 0 0 0 \n", "date 2017-08-16 00:00:00 2017-08-16 00:00:00 2017-08-16 00:00:00 \n", "id 125497040 125497041 125497042 \n", "\n", " 3 4 \n", "store_nbr 1 1 \n", "item_nbr 103520 103665 \n", "onpromotion False False \n", "Year 2017 2017 \n", "Month 8 8 \n", "Week 33 33 \n", "Day 16 16 \n", "Dayofweek 2 2 \n", "Dayofyear 228 228 \n", "city Quito Quito \n", "state Pichincha Pichincha \n", "type D D \n", "cluster 13 13 \n", "family GROCERY I BREAD/BAKERY \n", "class 1028 2712 \n", "perishable 0 1 \n", "transactions NaN NaN \n", "unit_sales 0 0 \n", "date 2017-08-16 00:00:00 2017-08-16 00:00:00 \n", "id 125497043 125497044 " ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined_test.head().T.head(40)" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "# check this cell function ?\n", "apply_cats(joined_test, joined)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're going to run on a sample." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1254112" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "idxs = get_cv_idxs(n)\n", "joined_samp = joined.iloc[idxs].set_index(\"date\")\n", "samp_size = len(joined_samp)\n", "samp_size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To run on the full dataset, use this instead:" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "samp_size = n\n", "joined_samp = joined.set_index(\"Date\")" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "samp_size = len(joined_samp)\n", "samp_size" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
store_nbritem_nbronpromotionYearMonthWeekDayDayofweekDayofyearcitystatetypeclusterfamilyclassperishabletransactionsunit_sales
date
2016-07-05361464079False201672751187LibertadGuayasE10BEVERAGES119001224.01.0
2016-08-0612414620False201683165219LatacungaCotopaxiC15GROCERY I101601162.01.0
\n", "
" ], "text/plain": [ " store_nbr item_nbr onpromotion Year Month Week Day Dayofweek \\\n", "date \n", "2016-07-05 36 1464079 False 2016 7 27 5 1 \n", "2016-08-06 12 414620 False 2016 8 31 6 5 \n", "\n", " Dayofyear city state type cluster family class \\\n", "date \n", "2016-07-05 187 Libertad Guayas E 10 BEVERAGES 1190 \n", "2016-08-06 219 Latacunga Cotopaxi C 15 GROCERY I 1016 \n", "\n", " perishable transactions unit_sales \n", "date \n", "2016-07-05 0 1224.0 1.0 \n", "2016-08-06 0 1162.0 1.0 " ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined_samp.head(2)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
store_nbritem_nbronpromotionYearMonthWeekDayDayofweekDayofyearcitystatetypeclusterfamilyclassperishabletransactionsunit_sales
date
2016-07-0622414620False201672762188PuyoPastazaC7GROCERY I10160712.05.0
2016-08-1646513853False2016833161229QuitoPichinchaA14CLEANING303402877.01.0
2016-08-1534890371False2016833150228GuayaquilGuayasB6GROCERY I100202227.01.0
2016-08-16171221045False2016833161229QuitoPichinchaC12GROCERY I101001530.07.0
2016-06-3091354382False2016626303182QuitoPichinchaB6GROCERY I104201824.04.0
\n", "
" ], "text/plain": [ " store_nbr item_nbr onpromotion Year Month Week Day Dayofweek \\\n", "date \n", "2016-07-06 22 414620 False 2016 7 27 6 2 \n", "2016-08-16 46 513853 False 2016 8 33 16 1 \n", "2016-08-15 34 890371 False 2016 8 33 15 0 \n", "2016-08-16 17 1221045 False 2016 8 33 16 1 \n", "2016-06-30 9 1354382 False 2016 6 26 30 3 \n", "\n", " Dayofyear city state type cluster family class \\\n", "date \n", "2016-07-06 188 Puyo Pastaza C 7 GROCERY I 1016 \n", "2016-08-16 229 Quito Pichincha A 14 CLEANING 3034 \n", "2016-08-15 228 Guayaquil Guayas B 6 GROCERY I 1002 \n", "2016-08-16 229 Quito Pichincha C 12 GROCERY I 1010 \n", "2016-06-30 182 Quito Pichincha B 6 GROCERY I 1042 \n", "\n", " perishable transactions unit_sales \n", "date \n", "2016-07-06 0 712.0 5.0 \n", "2016-08-16 0 2877.0 1.0 \n", "2016-08-15 0 2227.0 1.0 \n", "2016-08-16 0 1530.0 7.0 \n", "2016-06-30 0 1824.0 4.0 " ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined_samp.tail()" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "%proc_df\n", "[x, y, nas, mapper(optional)]:\n", "\n", " x: x is the transformed version of df. x will not have the response variable\n", " and is entirely numeric.\n", "\n", " y: y is the response variable\n", "\n", " nas (handles missing values): returns a dictionary of which nas it created, and the associated median.\n", "\n", " mapper: A DataFrameMapper which stores the mean and standard deviation of the corresponding continous\n", " variables which is then used for scaling of during test-time.\n" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "df, y, nas, mapper = proc_df(joined_samp, 'unit_sales', do_scale=True)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/eric/anaconda3/envs/fastai/lib/python3.6/site-packages/ipykernel_launcher.py:1: RuntimeWarning: divide by zero encountered in log\n", " \"\"\"Entry point for launching an IPython kernel.\n" ] } ], "source": [ "yl = np.log(y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# df is now a entirely numeric dataframe, without the \"unit sales\" columns\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 1. , 1. , 5. , 12. , 10. , 3. , 90. , 5. , 3. , 4. ,\n", " 2. , 3. , 8. , 5. , 7. , 1. , 2. , 1. , 2. , 8. ,\n", " ..., 3. , 8. , 11. , 7. , 2. , 12. , 183.517, 1. , 14. ,\n", " 1. , 3. , 6. , 5. , 5. , 7. , 5. , 1. , 1. , 7. ,\n", " 4. ])" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# y contains the \"unit sales\" now\n", "y" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "min(y)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1254112" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(y)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "13757.0" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "max(y)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0. , 0. , 1.60944, 2.48491, 2.30259, 1.09861, 4.49981, 1.60944, 1.09861, 1.38629,\n", " 0.69315, 1.09861, 2.07944, 1.60944, 1.94591, 0. , 0.69315, 0. , 0.69315, 2.07944,\n", " ..., 1.09861, 2.07944, 2.3979 , 1.94591, 0.69315, 2.48491, 5.21231, 0. , 2.63906,\n", " 0. , 1.09861, 1.79176, 1.60944, 1.60944, 1.94591, 1.60944, 0. , 0. , 1.94591,\n", " 1.38629])" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "yl" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1254112" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(yl)" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.isnan(y).any()" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "joined_test = joined_test.set_index(\"date\")" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
store_nbritem_nbronpromotionYearMonthWeekDayDayofweekDayofyearcitystatetypeclusterfamilyclassperishabletransactionsunit_salesid
date
2017-08-161NaNFalse2017833162228QuitoPichinchaD13GROCERY I10930NaN0125497040
2017-08-16199197.0False2017833162228QuitoPichinchaD13GROCERY I10670NaN0125497041
2017-08-161103501.0False2017833162228QuitoPichinchaD13CLEANING30080NaN0125497042
2017-08-161103520.0False2017833162228QuitoPichinchaD13GROCERY I10280NaN0125497043
2017-08-161103665.0False2017833162228QuitoPichinchaD13BREAD/BAKERY27121NaN0125497044
\n", "
" ], "text/plain": [ " store_nbr item_nbr onpromotion Year Month Week Day Dayofweek \\\n", "date \n", "2017-08-16 1 NaN False 2017 8 33 16 2 \n", "2017-08-16 1 99197.0 False 2017 8 33 16 2 \n", "2017-08-16 1 103501.0 False 2017 8 33 16 2 \n", "2017-08-16 1 103520.0 False 2017 8 33 16 2 \n", "2017-08-16 1 103665.0 False 2017 8 33 16 2 \n", "\n", " Dayofyear city state type cluster family class \\\n", "date \n", "2017-08-16 228 Quito Pichincha D 13 GROCERY I 1093 \n", "2017-08-16 228 Quito Pichincha D 13 GROCERY I 1067 \n", "2017-08-16 228 Quito Pichincha D 13 CLEANING 3008 \n", "2017-08-16 228 Quito Pichincha D 13 GROCERY I 1028 \n", "2017-08-16 228 Quito Pichincha D 13 BREAD/BAKERY 2712 \n", "\n", " perishable transactions unit_sales id \n", "date \n", "2017-08-16 0 NaN 0 125497040 \n", "2017-08-16 0 NaN 0 125497041 \n", "2017-08-16 0 NaN 0 125497042 \n", "2017-08-16 0 NaN 0 125497043 \n", "2017-08-16 1 NaN 0 125497044 " ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joined_test.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# joined_test.drop(['transactions'], axis = 1, inplace = True)" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "df_test, _, nas, mapper = proc_df(joined_test, 'unit_sales', do_scale=True, skip_flds=['transactions'],\n", " na_dict=nas)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In time series data, cross-validation is not random. Instead, our holdout data is generally the most recent data, as it would be in real application. This issue is discussed in detail in [this post](http://www.fast.ai/2017/11/13/validation-sets/) on our web site.\n", "\n", "One approach is to take the last 25% of rows (sorted by date) as our validation set." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "#ratio of .754 is 16 days by 65 days, to be close to real test duration\n", "train_ratio = 0.754\n", "# train_ratio = 0.9\n", "train_size = int(samp_size * train_ratio); train_size\n", "val_idx = list(range(train_size, len(df)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An even better option for picking a validation set is using the exact same length of time period as the test set uses - this is implemented here:" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "val_idx = np.flatnonzero(\n", " (df.index<=datetime.datetime(2016,8,16)) & (df.index>=datetime.datetime(2016,8,31)))" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "308512" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(val_idx)" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1254112" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "samp_size" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.753999642775127" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1 - (len(val_idx)/ samp_size)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deep Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're ready to put together our models.\n", "\n", "Root-mean-squared percent error is the metric Kaggle used for this ROSSMANN competition." ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "#from Rossmann\n", "def inv_y(a): return np.exp(a)\n", "\n", "def exp_rmspe(y_pred, targ):\n", " targ = inv_y(targ)\n", " pct_var = (targ - inv_y(y_pred))/targ\n", " return math.sqrt((pct_var**2).mean())\n", "\n", "max_log_y = np.max(y)\n", "y_range = (0, max_log_y*1.2)" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "# Favorita\n", "# Normalized Weighted Root Mean Squared Logarithmic Error (NWRMSLE)\n", "# https://www.kaggle.com/tkm2261/eval-metric-and-evaluating-last-year-sales-bench\n", "WEIGHTS = \n", "def NWRMSLE(y, pred):\n", " y = y.clip(0, y.max())\n", " pred = pred.clip(0, pred.max())\n", " score = np.nansum(WEIGHTS * ((np.log1p(pred) - np.log1p(y)) ** 2)) / WEIGHTS.sum()\n", " return np.sqrt(score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create a ModelData object directly from out data frame." ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [], "source": [ "md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=512,\n", " test_df=df_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some categorical variables have a lot more levels than others. Store, in particular, has over a thousand in the Rossmann competition.\n", "Let's see in Favorita." ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "cat_sz = [(c, len(joined_samp[c].cat.categories)+1) for c in cat_vars]" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('store_nbr', 54),\n", " ('item_nbr', 3660),\n", " ('onpromotion', 3),\n", " ('Year', 2),\n", " ('Month', 4),\n", " ('Week', 11),\n", " ('Day', 32),\n", " ('Dayofweek', 8),\n", " ('Dayofyear', 66),\n", " ('city', 23),\n", " ('state', 17),\n", " ('type', 6),\n", " ('cluster', 18),\n", " ('family', 33),\n", " ('class', 318),\n", " ('perishable', 3)]" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_sz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the *cardinality* of each variable (that is, its number of unique values) to decide how large to make its *embeddings*. Each level will be associated with a vector with length defined as below." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(54, 27),\n", " (3660, 50),\n", " (3, 2),\n", " (2, 1),\n", " (4, 2),\n", " (11, 6),\n", " (32, 16),\n", " (8, 4),\n", " (66, 33),\n", " (23, 12),\n", " (17, 9),\n", " (6, 3),\n", " (18, 9),\n", " (33, 17),\n", " (318, 50),\n", " (3, 2)]" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "emb_szs" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [], "source": [ "m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),\n", " 0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)\n", "lr = 1e-3" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6b61bdc7eafb4c5489c86809e36cda72", "version_major": 2, "version_minor": 0 }, "text/html": [ "

Failed to display Jupyter Widget of type HBox.

\n", "

\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "

\n", "

\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "

\n" ], "text/plain": [ "HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ " 0%| | 6/1847 [00:06<31:59, 1.04s/it, loss=inf] \n", " \r" ] } ], "source": [ "m.lr_find()" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZQAAAEOCAYAAACuOOGFAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvAOZPmwAAEz5JREFUeJzt3X2wXHV9x/H3B1KxVoyAASEhhhasE8enumCp2qFVIjrVqFCJbW1U2vSJanVqi7UdFZ2KT7U+1whWaqug0GIEa3hQqm0VcoNSCEpJox1SUKJBCqIw6Ld/7EGW2725m9zfvZubvF8zO3vO7/zOOd+998753N85u2dTVUiSNFP7jLsASdKewUCRJDVhoEiSmjBQJElNGCiSpCYMFElSEwaKJKkJA0WS1ISBIklqwkCRJDWxYNwFzKWHPexhtWzZsnGXIUnzysaNG79dVYum67dXBcqyZcuYmJgYdxmSNK8k+e9R+nnKS5LUhIEiSWrCQJEkNWGgSJKaMFAkSU0YKJKkJgwUSVITBookqQkDRZLUhIEiSWrCQJEkNWGgSJKaMFAkSU0YKJKkJgwUSVITBookqQkDRZLUhIEiSWrCQJEkNWGgSJKaMFAkSU0YKJKkJgwUSVITYw2UJCckuT7J5iSnDVm+X5Jzu+VXJFk2afnSJHck+eO5qlmSNNzYAiXJvsB7gWcCy4EXJlk+qdspwK1VdSTwDuDNk5a/A/jn2a5VkjS9cY5QjgE2V9WWqrobOAdYOanPSuDsbvo84GlJApDkucAWYNMc1StJ2oFxBspi4MaB+a1d29A+VXUPcBtwUJKfAv4UeP0c1ClJGsE4AyVD2mrEPq8H3lFVd0y7k2RNkokkE9u2bduFMiVJo1gwxn1vBQ4fmF8C3DRFn61JFgALge3Ak4CTkrwFeCjwoyQ/qKr3TN5JVa0F1gL0er3JgSVJamScgbIBOCrJEcD/AKuAX5vUZx2wGvgicBLw2aoq4Kn3dkjyOuCOYWEiSZo7YwuUqronyanAemBf4ENVtSnJ6cBEVa0DzgI+kmQz/ZHJqnHVK0nasfT/4d879Hq9mpiYGHcZkjSvJNlYVb3p+vlJeUlSEwaKJKkJA0WS1ISBIklqwkCRJDVhoEiSmjBQJElNGCiSpCYMFElSEwaKJKkJA0WS1ISBIklqwkCRJDVhoEiSmjBQJElNGCiSpCYMFElSEwaKJKkJA0WS1ISBIklqwkCRJDVhoEiSmjBQJElNGCiSpCYMFElSEwaKJKkJA0WS1ISBIklqwkCRJDVhoEiSmjBQJElNjDVQkpyQ5Pokm5OcNmT5fknO7ZZfkWRZ1358ko1Jrumef3mua5ck3d/YAiXJvsB7gWcCy4EXJlk+qdspwK1VdSTwDuDNXfu3gWdX1WOA1cBH5qZqSdJUxjlCOQbYXFVbqupu4Bxg5aQ+K4Gzu+nzgKclSVV9uapu6to3AQ9Mst+cVC1JGmqcgbIYuHFgfmvXNrRPVd0D3AYcNKnPicCXq+quWapTkjSCBWPcd4a01c70SfJo+qfBVky5k2QNsAZg6dKlO1+lJGkk4xyhbAUOH5hfAtw0VZ8kC4CFwPZufgnwT8BvVtV/TbWTqlpbVb2q6i1atKhh+ZKkQeMMlA3AUUmOSPIAYBWwblKfdfQvugOcBHy2qirJQ4GLgFdX1b/NWcWSpCmNLVC6ayKnAuuBrwIfr6pNSU5P8pyu21nAQUk2A68E7n1r8anAkcBfJPlK9zh4jl+CJGlAqiZftthz9Xq9mpiYGHcZkjSvJNlYVb3p+vlJeUlSEwaKJKkJA0WS1ISBIklqwkCRJDVhoEiSmjBQJElNGCiSpCYMFElSEwaKJKkJA0WS1ISBIklqwkCRJDVhoEiSmjBQJElNGCiSpCYMFElSEwaKJKkJA0WS1ISBIklqwkCRJDVhoEiSmjBQJElNGCiSpCYMFElSEwaKJKkJA0WS1MRIgZLk5Ukekr6zklyVZMVsFydJmj9GHaG8tKr+F1gBLAJeApwxa1VJkuadUQMl3fOzgL+tqqsH2iRJGjlQNia5mH6grE+yP/Cjme48yQlJrk+yOclpQ5bvl+TcbvkVSZYNLHt11359kmfMtBZJ0swsGLHfKcDjgS1VdWeSA+mf9tplSfYF3gscD2wFNiRZV1XXTdrvrVV1ZJJVwJuBk5MsB1YBjwYOAy5N8siq+uFMapIk7bpRRyjHAtdX1XeT/Abw58BtM9z3McDmqtpSVXcD5wArJ/VZCZzdTZ8HPC1JuvZzququqvo6sLnbniRpTEYNlPcDdyZ5HPAnwH8DfzfDfS8GbhyY39q1De1TVffQD7GDRlxXkjSHRg2Ue6qq6I8M3llV7wT2n+G+h13UrxH7jLJufwPJmiQTSSa2bdu2kyVKkkY1aqDcnuTVwIuAi7rrHz8xw31vBQ4fmF8C3DRVnyQLgIXA9hHXBaCq1lZVr6p6ixYtmmHJkqSpjBooJwN30f88yjfpn1566wz3vQE4KskRSR5A/yL7ukl91gGru+mTgM92I6V1wKruXWBHAEcBV86wHknSDIz0Lq+q+maSfwCOTvIrwJVVNaNrKFV1T5JTgfXAvsCHqmpTktOBiapaB5wFfCTJZvojk1XdupuSfBy4DrgH+APf4SVJ45X+P/zTdEpeQH9Ecjn96xdPBV5VVefNanWN9Xq9mpiYGHcZkjSvJNlYVb3p+o36OZTXAEdX1S3dxhcBl9J/K68kSSNfQ9nn3jDpfGcn1pUk7QVGHaF8Jsl64GPd/MnAp2enJEnSfDTqRflXJTkReDL9ayhrq+qfZrUySdK8MuoIhao6Hzh/FmuRJM1jOwyUJLcz/BPoAaqqHjIrVUmS5p0dBkpVzfT2KpKkvYTv1JIkNWGgSJKaMFAkSU0YKJKkJgwUSVITBookqQkDRZLUhIEiSWrCQJEkNWGgSJKaMFAkSU0YKJKkJgwUSVITBookqQkDRZLUhIEiSWrCQJEkNWGgSJKaMFAkSU0YKJKkJgwUSVITBookqQkDRZLUhIEiSWpiLIGS5MAklyS5oXs+YIp+q7s+NyRZ3bU9KMlFSb6WZFOSM+a2eknSMOMaoZwGXFZVRwGXdfP3k+RA4LXAk4BjgNcOBM/bqupRwBOAJyd55tyULUmayrgCZSVwdjd9NvDcIX2eAVxSVdur6lbgEuCEqrqzqj4HUFV3A1cBS+agZknSDowrUA6pqpsBuueDh/RZDNw4ML+1a/uxJA8Fnk1/lCNJGqMFs7XhJJcCDx+y6DWjbmJIWw1sfwHwMeBdVbVlB3WsAdYALF26dMRdS5J21qwFSlU9faplSb6V5NCqujnJocAtQ7ptBY4bmF8CXD4wvxa4oar+epo61nZ96fV6taO+kqRdN65TXuuA1d30auCTQ/qsB1YkOaC7GL+iayPJG4GFwB/NQa2SpBGMK1DOAI5PcgNwfDdPkl6SMwGqajvwBmBD9zi9qrYnWUL/tNly4KokX0nyW+N4EZKk+6Rq7zkL1Ov1amJiYtxlSNK8kmRjVfWm6+cn5SVJTRgokqQmDBRJUhMGiiSpCQNFktSEgSJJasJAkSQ1YaBIkpowUCRJTRgokqQmDBRJUhMGiiSpCQNFktSEgSJJasJAkSQ1YaBIkpowUCRJTRgokqQmDBRJUhMGiiSpCQNFktSEgSJJasJAkSQ1YaBIkpowUCRJTRgokqQmDBRJUhMGiiSpCQNFktSEgSJJamIsgZLkwCSXJLmhez5gin6ruz43JFk9ZPm6JNfOfsWSpOmMa4RyGnBZVR0FXNbN30+SA4HXAk8CjgFeOxg8SZ4P3DE35UqSpjOuQFkJnN1Nnw08d0ifZwCXVNX2qroVuAQ4ASDJg4FXAm+cg1olSSMYV6AcUlU3A3TPBw/psxi4cWB+a9cG8Abg7cCds1mkJGl0C2Zrw0kuBR4+ZNFrRt3EkLZK8njgyKp6RZJlI9SxBlgDsHTp0hF3LUnaWbMWKFX19KmWJflWkkOr6uYkhwK3DOm2FThuYH4JcDlwLPDEJN+gX//BSS6vquMYoqrWAmsBer1e7fwrkSSNYlynvNYB975razXwySF91gMrkhzQXYxfAayvqvdX1WFVtQx4CvCfU4WJJGnujCtQzgCOT3IDcHw3T5JekjMBqmo7/WslG7rH6V2bJGk3lKq95yxQr9eriYmJcZchSfNKko1V1Zuun5+UlyQ1YaBIkpowUCRJTRgokqQmDBRJUhMGiiSpCQNFktSEgSJJasJAkSQ1YaBIkpowUCRJTRgokqQmDBRJUhMGiiSpCQNFktSEgSJJasJAkSQ1YaBIkpowUCRJTRgokqQmDBRJUhMGiiSpCQNFktSEgSJJasJAkSQ1kaoadw1zJsk24LvAbbuw+sOAb7etSDuwkF37Pe3OdtfXNK66Znu/rbffansz2c6urjvT49cjqmrRdJ32qkABSLK2qtbswnoTVdWbjZr0/+3q72l3tru+pnHVNdv7bb39VtubyXZ29+PX3njK61PjLkAj2RN/T7vraxpXXbO939bbb7W9mWxnd/0bAvbCEcqucoQiab5yhLL7WTvuAiRpF83J8csRiiSpCUcokqQmDBRJUhMGiiSpCQNlFyX5qSRnJ/lgkl8fdz2SNKokP53krCTntdyugTIgyYeS3JLk2kntJyS5PsnmJKd1zc8Hzquq3waeM+fFStKAnTl+VdWWqjqldQ0Gyv19GDhhsCHJvsB7gWcCy4EXJlkOLAFu7Lr9cA5rlKRhPszox69ZYaAMqKrPA9snNR8DbO4S/W7gHGAlsJV+qIA/R0ljtpPHr1nhgXB6i7lvJAL9IFkM/CNwYpL3s5vfDkHSXmvo8SvJQUn+BnhCkle32tmCVhvag2VIW1XV94CXzHUxkrQTpjp+fQf43dY7c4Qyva3A4QPzS4CbxlSLJO2MOT1+GSjT2wAcleSIJA8AVgHrxlyTJI1iTo9fBsqAJB8Dvgj8bJKtSU6pqnuAU4H1wFeBj1fVpnHWKUmT7Q7HL28OKUlqwhGKJKkJA0WS1ISBIklqwkCRJDVhoEiSmjBQJElNGCjabSW5Yw728ZyBrySYE0mOS/ILu7DeE5Kc2U2/OMl72le385Ism3zL9CF9FiX5zFzVpPEwULTH627hPVRVrauqM2Zhnzu6T95xwE4HCvBnwLt3qaAxq6ptwM1JnjzuWjR7DBTNC0lelWRDkv9I8vqB9guSbEyyKcmagfY7kpye5Arg2CTfSPL6JFcluSbJo7p+P/5PP8mHk7wryb8n2ZLkpK59nyTv6/ZxYZJP37tsUo2XJ/nLJP8CvDzJs5NckeTLSS5NckiSZfRvyveKJF9J8tTuv/fzu9e3YdhBN8n+wGOr6uohyx6R5LLuZ3NZkqVd+88k+VK3zdOHjfi6bx69KMnVSa5NcnLXfnT3c7g6yZVJ9u9GIl/ofoZXDRtlJdk3yVsHfle/M7D4AsBvN92TVZUPH7vlA7ije14BrKV/59R9gAuBX+yWHdg9/yRwLXBQN1/ACwa29Q3gD7vp3wfO7KZfDLynm/4w8IluH8vpf48EwEnAp7v2hwO3AicNqfdy4H0D8wdw390ofgt4ezf9OuCPB/p9FHhKN70U+OqQbf8ScP7A/GDdnwJWd9MvBS7opi8EXthN/+69P89J2z0R+ODA/ELgAcAW4Oiu7SH070z+IOCBXdtRwEQ3vQy4tpteA/x5N70fMAEc0c0vBq4Z99+Vj9l7ePt6zQcruseXu/kH0z+gfR54WZLnde2Hd+3fof8tmudP2s4/ds8b6X+F8zAXVNWPgOuSHNK1PQX4RNf+zSSf20Gt5w5MLwHOTXIo/YP016dY5+nA8uTHdxp/SJL9q+r2gT6HAtumWP/YgdfzEeAtA+3P7aY/CrxtyLrXAG9L8mbgwqr6QpLHADdX1QaAqvpf6I9mgPckeTz9n+8jh2xvBfDYgRHcQvq/k68DtwCHTfEatAcwUDQfBHhTVX3gfo3JcfQPxsdW1Z1JLgce2C3+QVVN/mrmu7rnHzL13/5dA9OZ9DyK7w1Mvxv4q6pa19X6uinW2Yf+a/j+Drb7fe57bdMZ+QZ9VfWfSZ4IPAt4U5KL6Z+aGraNVwDfAh7X1fyDIX1CfyS4fsiyB9J/HdpDeQ1F88F64KVJHgyQZHGSg+n/93trFyaPAn5+lvb/r/S/nXOfbtRy3IjrLQT+p5tePdB+O7D/wPzF9O8IC0A3Apjsq8CRU+zn3+nflhz61yj+tZv+Ev1TWgwsv58khwF3VtXf0x/B/BzwNeCwJEd3ffbv3mSwkP7I5UfAi4Bhb3ZYD/xekp/o1n1kN7KB/ohmh+8G0/xmoGi3V1UX0z9l88Uk1wDn0T8gfwZYkOQ/gDfQP4DOhvPpf1HRtcAHgCuA20ZY73XAJ5J8Afj2QPungOfde1EeeBnQ6y5iX8eQb9Krqq8BC7uL85O9DHhJ93N4EfDyrv2PgFcmuZL+KbNhNT8GuDLJV4DXAG+s/nePnwy8O8nVwCX0RxfvA1Yn+RL9cPjekO2dCVwHXNW9lfgD3Dca/CXgoiHraA/h7eulESR5cFXdkeQg4ErgyVX1zTmu4RXA7VV15oj9HwR8v6oqySr6F+hXzmqRO67n88DKqrp1XDVodnkNRRrNhUkeSv/i+hvmOkw67wd+dSf6P5H+RfQA36X/DrCxSLKI/vUkw2QP5ghFktSE11AkSU0YKJKkJgwUSVITBookqQkDRZLUhIEiSWri/wAuNq8iO7eWkwAAAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "m.sched.plot(100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sample" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),\n", " 0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)\n", "lr = 1e-3" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "4d59028d7c724ed78700c7cf12bdfed3", "version_major": 2, "version_minor": 0 }, "text/html": [ "

Failed to display Jupyter Widget of type HBox.

\n", "

\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "

\n", "

\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "

\n" ], "text/plain": [ "HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "[ 0. nan nan nan] \n", "\n" ] } ], "source": [ "m.fit(lr, 1, metrics=[exp_rmspe])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "m.fit(lr, 3, metrics=[exp_rmspe])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }