{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Favorita 2016 10 weeks (28-06 to 31-08 : 65 days inclusive)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"%reload_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from fastai.structured import *\n",
"from fastai.column_data import *\n",
"np.set_printoptions(threshold=50, edgeitems=20)\n",
"import datetime\n",
"\n",
"PATH = 'data/favorita/'"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"holidays_events.csv\n",
"items.csv\n",
"models\n",
"oil.csv\n",
"sample_submission.csv\n",
"stores.csv\n",
"test.csv\n",
"tmp\n",
"train.csv\n",
"transactions.csv\n",
"\n"
]
}
],
"source": [
"# Input data files are available in the \"../input/\" directory.\n",
"# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n",
"\n",
"from subprocess import check_output\n",
"print(check_output([\"ls\", PATH]).decode(\"utf8\"))\n",
"\n",
"# Any results you write to the current directory are saved as output."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create datasets"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"table_names = ['train', 'stores', 'items', 'transactions', \n",
" 'holidays_events', 'oil', 'test', 'sample_submission']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll be using the popular data manipulation framework pandas. Among other things, pandas allows you to manipulate tables/data frames in python as one would in a database.\n",
"\n",
"We're going to go ahead and load all of our csv's as dataframes into the list tables."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"tables = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=False) for fname in table_names]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import HTML"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use `head()` to get a quick look at the contents of each table:\n",
"* **train**: includes the target unit_sales by date, store_nbr, and item_nbr and a unique id to label rows\n",
"\n",
"* **stores**: metadata; including city, state, type, and cluster; cluster is a grouping of similar stores\n",
"\n",
"* **items**: metadata; including family, class, and perishable; perishable have a score weight of 1.25; otherwise, the weight is 1.0\n",
"\n",
"* **transactions**: count of sales transactions for each date, store_nbr combination. Only included for the training data timeframe\n",
"\n",
"* **holidays_events**: metadata; Pay special attention to the transferred column.\n",
" * A holiday that is transferred officially falls on that calendar day, but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was actually celebrated, look for the corresponding row where type is Transfer.\n",
" * For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to payback the Bridge.\n",
" * Additional holidays are days added a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday).\n",
"\n",
"* **oil**: Daily oil price. Includes values during both the train and test data timeframe. (Ecuador is an oil-dependent country and it's economical health is highly vulnerable to shocks in oil prices.)\n",
"\n",
"* **test**: Test data, with the date, store_nbr, item_nbr combinations that are to be predicted, along with the onpromotion information.\n",
" * NOTE: The test data has a small number of items that are not contained in the training data. Part of the exercise will be to predict a new item sales based on similar products.\n",
" * The public / private leaderboard split is based on time. All items in the public split are also included in the private split.\n",
" \n",
"* **sample_submission**: sample submission file in the correct format. It is highly recommend you zip your submission file before uploading!"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" date | \n",
" store_nbr | \n",
" item_nbr | \n",
" unit_sales | \n",
" onpromotion | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 2013-01-01 | \n",
" 25 | \n",
" 103665 | \n",
" 7.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 2013-01-01 | \n",
" 25 | \n",
" 105574 | \n",
" 1.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 2013-01-01 | \n",
" 25 | \n",
" 105575 | \n",
" 2.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" 2013-01-01 | \n",
" 25 | \n",
" 108079 | \n",
" 1.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" 2013-01-01 | \n",
" 25 | \n",
" 108701 | \n",
" 1.0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id date store_nbr item_nbr unit_sales onpromotion\n",
"0 0 2013-01-01 25 103665 7.0 NaN\n",
"1 1 2013-01-01 25 105574 1.0 NaN\n",
"2 2 2013-01-01 25 105575 2.0 NaN\n",
"3 3 2013-01-01 25 108079 1.0 NaN\n",
"4 4 2013-01-01 25 108701 1.0 NaN"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(125497040, 6)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" store_nbr | \n",
" city | \n",
" state | \n",
" type | \n",
" cluster | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 8 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 9 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" Santo Domingo | \n",
" Santo Domingo de los Tsachilas | \n",
" D | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" store_nbr city state type cluster\n",
"0 1 Quito Pichincha D 13\n",
"1 2 Quito Pichincha D 13\n",
"2 3 Quito Pichincha D 8\n",
"3 4 Quito Pichincha D 9\n",
"4 5 Santo Domingo Santo Domingo de los Tsachilas D 4"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(54, 5)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" item_nbr | \n",
" family | \n",
" class | \n",
" perishable | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 96995 | \n",
" GROCERY I | \n",
" 1093 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 99197 | \n",
" GROCERY I | \n",
" 1067 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 103501 | \n",
" CLEANING | \n",
" 3008 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 103520 | \n",
" GROCERY I | \n",
" 1028 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 103665 | \n",
" BREAD/BAKERY | \n",
" 2712 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" item_nbr family class perishable\n",
"0 96995 GROCERY I 1093 0\n",
"1 99197 GROCERY I 1067 0\n",
"2 103501 CLEANING 3008 0\n",
"3 103520 GROCERY I 1028 0\n",
"4 103665 BREAD/BAKERY 2712 1"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(4100, 4)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" store_nbr | \n",
" transactions | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2013-01-01 | \n",
" 25 | \n",
" 770 | \n",
"
\n",
" \n",
" 1 | \n",
" 2013-01-02 | \n",
" 1 | \n",
" 2111 | \n",
"
\n",
" \n",
" 2 | \n",
" 2013-01-02 | \n",
" 2 | \n",
" 2358 | \n",
"
\n",
" \n",
" 3 | \n",
" 2013-01-02 | \n",
" 3 | \n",
" 3487 | \n",
"
\n",
" \n",
" 4 | \n",
" 2013-01-02 | \n",
" 4 | \n",
" 1922 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date store_nbr transactions\n",
"0 2013-01-01 25 770\n",
"1 2013-01-02 1 2111\n",
"2 2013-01-02 2 2358\n",
"3 2013-01-02 3 3487\n",
"4 2013-01-02 4 1922"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(83488, 3)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" type | \n",
" locale | \n",
" locale_name | \n",
" description | \n",
" transferred | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2012-03-02 | \n",
" Holiday | \n",
" Local | \n",
" Manta | \n",
" Fundacion de Manta | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" 2012-04-01 | \n",
" Holiday | \n",
" Regional | \n",
" Cotopaxi | \n",
" Provincializacion de Cotopaxi | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" 2012-04-12 | \n",
" Holiday | \n",
" Local | \n",
" Cuenca | \n",
" Fundacion de Cuenca | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" 2012-04-14 | \n",
" Holiday | \n",
" Local | \n",
" Libertad | \n",
" Cantonizacion de Libertad | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 2012-04-21 | \n",
" Holiday | \n",
" Local | \n",
" Riobamba | \n",
" Cantonizacion de Riobamba | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date type locale locale_name description \\\n",
"0 2012-03-02 Holiday Local Manta Fundacion de Manta \n",
"1 2012-04-01 Holiday Regional Cotopaxi Provincializacion de Cotopaxi \n",
"2 2012-04-12 Holiday Local Cuenca Fundacion de Cuenca \n",
"3 2012-04-14 Holiday Local Libertad Cantonizacion de Libertad \n",
"4 2012-04-21 Holiday Local Riobamba Cantonizacion de Riobamba \n",
"\n",
" transferred \n",
"0 False \n",
"1 False \n",
"2 False \n",
"3 False \n",
"4 False "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(350, 6)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" dcoilwtico | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2013-01-01 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 2013-01-02 | \n",
" 93.14 | \n",
"
\n",
" \n",
" 2 | \n",
" 2013-01-03 | \n",
" 92.97 | \n",
"
\n",
" \n",
" 3 | \n",
" 2013-01-04 | \n",
" 93.12 | \n",
"
\n",
" \n",
" 4 | \n",
" 2013-01-07 | \n",
" 93.20 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date dcoilwtico\n",
"0 2013-01-01 NaN\n",
"1 2013-01-02 93.14\n",
"2 2013-01-03 92.97\n",
"3 2013-01-04 93.12\n",
"4 2013-01-07 93.20"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(1218, 2)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" date | \n",
" store_nbr | \n",
" item_nbr | \n",
" onpromotion | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 125497040 | \n",
" 2017-08-16 | \n",
" 1 | \n",
" 96995 | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" 125497041 | \n",
" 2017-08-16 | \n",
" 1 | \n",
" 99197 | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" 125497042 | \n",
" 2017-08-16 | \n",
" 1 | \n",
" 103501 | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" 125497043 | \n",
" 2017-08-16 | \n",
" 1 | \n",
" 103520 | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 125497044 | \n",
" 2017-08-16 | \n",
" 1 | \n",
" 103665 | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id date store_nbr item_nbr onpromotion\n",
"0 125497040 2017-08-16 1 96995 False\n",
"1 125497041 2017-08-16 1 99197 False\n",
"2 125497042 2017-08-16 1 103501 False\n",
"3 125497043 2017-08-16 1 103520 False\n",
"4 125497044 2017-08-16 1 103665 False"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(3370464, 5)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" unit_sales | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 125497040 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 125497041 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 125497042 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 125497043 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 125497044 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id unit_sales\n",
"0 125497040 0\n",
"1 125497041 0\n",
"2 125497042 0\n",
"3 125497043 0\n",
"4 125497044 0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(3370464, 2)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"for t in tables: display(t.head(), t.shape)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" date | \n",
" store_nbr | \n",
" item_nbr | \n",
" unit_sales | \n",
" onpromotion | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 1.25497e+08 | \n",
" NaN | \n",
" 1.25497e+08 | \n",
" 1.25497e+08 | \n",
" 1.25497e+08 | \n",
" NaN | \n",
"
\n",
" \n",
" mean | \n",
" 6.27485e+07 | \n",
" NaN | \n",
" 27.4646 | \n",
" 972769 | \n",
" 8.55487 | \n",
" NaN | \n",
"
\n",
" \n",
" std | \n",
" 3.62279e+07 | \n",
" NaN | \n",
" 16.3305 | \n",
" 520534 | \n",
" 23.6052 | \n",
" NaN | \n",
"
\n",
" \n",
" min | \n",
" 0 | \n",
" NaN | \n",
" 1 | \n",
" 96995 | \n",
" -15372 | \n",
" NaN | \n",
"
\n",
" \n",
" 25% | \n",
" 3.13743e+07 | \n",
" NaN | \n",
" 12 | \n",
" 522383 | \n",
" 2 | \n",
" NaN | \n",
"
\n",
" \n",
" 50% | \n",
" 6.27485e+07 | \n",
" NaN | \n",
" 28 | \n",
" 959500 | \n",
" 4 | \n",
" NaN | \n",
"
\n",
" \n",
" 75% | \n",
" 9.41228e+07 | \n",
" NaN | \n",
" 43 | \n",
" 1.35438e+06 | \n",
" 9 | \n",
" NaN | \n",
"
\n",
" \n",
" max | \n",
" 1.25497e+08 | \n",
" NaN | \n",
" 54 | \n",
" 2.12711e+06 | \n",
" 89440 | \n",
" NaN | \n",
"
\n",
" \n",
" counts | \n",
" 125497040 | \n",
" 125497040 | \n",
" 125497040 | \n",
" 125497040 | \n",
" 125497040 | \n",
" 103839389 | \n",
"
\n",
" \n",
" uniques | \n",
" 125497040 | \n",
" 1684 | \n",
" 54 | \n",
" 4036 | \n",
" 258474 | \n",
" 2 | \n",
"
\n",
" \n",
" missing | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 21657651 | \n",
"
\n",
" \n",
" missing_perc | \n",
" 0% | \n",
" 0% | \n",
" 0% | \n",
" 0% | \n",
" 0% | \n",
" 17.26% | \n",
"
\n",
" \n",
" types | \n",
" numeric | \n",
" categorical | \n",
" numeric | \n",
" numeric | \n",
" numeric | \n",
" bool | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id date store_nbr item_nbr unit_sales \\\n",
"count 1.25497e+08 NaN 1.25497e+08 1.25497e+08 1.25497e+08 \n",
"mean 6.27485e+07 NaN 27.4646 972769 8.55487 \n",
"std 3.62279e+07 NaN 16.3305 520534 23.6052 \n",
"min 0 NaN 1 96995 -15372 \n",
"25% 3.13743e+07 NaN 12 522383 2 \n",
"50% 6.27485e+07 NaN 28 959500 4 \n",
"75% 9.41228e+07 NaN 43 1.35438e+06 9 \n",
"max 1.25497e+08 NaN 54 2.12711e+06 89440 \n",
"counts 125497040 125497040 125497040 125497040 125497040 \n",
"uniques 125497040 1684 54 4036 258474 \n",
"missing 0 0 0 0 0 \n",
"missing_perc 0% 0% 0% 0% 0% \n",
"types numeric categorical numeric numeric numeric \n",
"\n",
" onpromotion \n",
"count NaN \n",
"mean NaN \n",
"std NaN \n",
"min NaN \n",
"25% NaN \n",
"50% NaN \n",
"75% NaN \n",
"max NaN \n",
"counts 103839389 \n",
"uniques 2 \n",
"missing 21657651 \n",
"missing_perc 17.26% \n",
"types bool "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" store_nbr | \n",
" city | \n",
" state | \n",
" type | \n",
" cluster | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 54 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 54 | \n",
"
\n",
" \n",
" mean | \n",
" 27.5 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 8.48148 | \n",
"
\n",
" \n",
" std | \n",
" 15.7321 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 4.69339 | \n",
"
\n",
" \n",
" min | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 1 | \n",
"
\n",
" \n",
" 25% | \n",
" 14.25 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 4 | \n",
"
\n",
" \n",
" 50% | \n",
" 27.5 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 8.5 | \n",
"
\n",
" \n",
" 75% | \n",
" 40.75 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 13 | \n",
"
\n",
" \n",
" max | \n",
" 54 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 17 | \n",
"
\n",
" \n",
" counts | \n",
" 54 | \n",
" 54 | \n",
" 54 | \n",
" 54 | \n",
" 54 | \n",
"
\n",
" \n",
" uniques | \n",
" 54 | \n",
" 22 | \n",
" 16 | \n",
" 5 | \n",
" 17 | \n",
"
\n",
" \n",
" missing | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" missing_perc | \n",
" 0% | \n",
" 0% | \n",
" 0% | \n",
" 0% | \n",
" 0% | \n",
"
\n",
" \n",
" types | \n",
" numeric | \n",
" categorical | \n",
" categorical | \n",
" categorical | \n",
" numeric | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" store_nbr city state type cluster\n",
"count 54 NaN NaN NaN 54\n",
"mean 27.5 NaN NaN NaN 8.48148\n",
"std 15.7321 NaN NaN NaN 4.69339\n",
"min 1 NaN NaN NaN 1\n",
"25% 14.25 NaN NaN NaN 4\n",
"50% 27.5 NaN NaN NaN 8.5\n",
"75% 40.75 NaN NaN NaN 13\n",
"max 54 NaN NaN NaN 17\n",
"counts 54 54 54 54 54\n",
"uniques 54 22 16 5 17\n",
"missing 0 0 0 0 0\n",
"missing_perc 0% 0% 0% 0% 0%\n",
"types numeric categorical categorical categorical numeric"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" item_nbr | \n",
" family | \n",
" class | \n",
" perishable | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 4100 | \n",
" NaN | \n",
" 4100 | \n",
" 4100 | \n",
"
\n",
" \n",
" mean | \n",
" 1.25144e+06 | \n",
" NaN | \n",
" 2169.65 | \n",
" 0.240488 | \n",
"
\n",
" \n",
" std | \n",
" 587687 | \n",
" NaN | \n",
" 1484.91 | \n",
" 0.427432 | \n",
"
\n",
" \n",
" min | \n",
" 96995 | \n",
" NaN | \n",
" 1002 | \n",
" 0 | \n",
"
\n",
" \n",
" 25% | \n",
" 818111 | \n",
" NaN | \n",
" 1068 | \n",
" 0 | \n",
"
\n",
" \n",
" 50% | \n",
" 1.3062e+06 | \n",
" NaN | \n",
" 2004 | \n",
" 0 | \n",
"
\n",
" \n",
" 75% | \n",
" 1.90492e+06 | \n",
" NaN | \n",
" 2990.5 | \n",
" 0 | \n",
"
\n",
" \n",
" max | \n",
" 2.13424e+06 | \n",
" NaN | \n",
" 7780 | \n",
" 1 | \n",
"
\n",
" \n",
" counts | \n",
" 4100 | \n",
" 4100 | \n",
" 4100 | \n",
" 4100 | \n",
"
\n",
" \n",
" uniques | \n",
" 4100 | \n",
" 33 | \n",
" 337 | \n",
" 2 | \n",
"
\n",
" \n",
" missing | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" missing_perc | \n",
" 0% | \n",
" 0% | \n",
" 0% | \n",
" 0% | \n",
"
\n",
" \n",
" types | \n",
" numeric | \n",
" categorical | \n",
" numeric | \n",
" bool | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" item_nbr family class perishable\n",
"count 4100 NaN 4100 4100\n",
"mean 1.25144e+06 NaN 2169.65 0.240488\n",
"std 587687 NaN 1484.91 0.427432\n",
"min 96995 NaN 1002 0\n",
"25% 818111 NaN 1068 0\n",
"50% 1.3062e+06 NaN 2004 0\n",
"75% 1.90492e+06 NaN 2990.5 0\n",
"max 2.13424e+06 NaN 7780 1\n",
"counts 4100 4100 4100 4100\n",
"uniques 4100 33 337 2\n",
"missing 0 0 0 0\n",
"missing_perc 0% 0% 0% 0%\n",
"types numeric categorical numeric bool"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" store_nbr | \n",
" transactions | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" NaN | \n",
" 83488 | \n",
" 83488 | \n",
"
\n",
" \n",
" mean | \n",
" NaN | \n",
" 26.9392 | \n",
" 1694.6 | \n",
"
\n",
" \n",
" std | \n",
" NaN | \n",
" 15.6082 | \n",
" 963.287 | \n",
"
\n",
" \n",
" min | \n",
" NaN | \n",
" 1 | \n",
" 5 | \n",
"
\n",
" \n",
" 25% | \n",
" NaN | \n",
" 13 | \n",
" 1046 | \n",
"
\n",
" \n",
" 50% | \n",
" NaN | \n",
" 27 | \n",
" 1393 | \n",
"
\n",
" \n",
" 75% | \n",
" NaN | \n",
" 40 | \n",
" 2079 | \n",
"
\n",
" \n",
" max | \n",
" NaN | \n",
" 54 | \n",
" 8359 | \n",
"
\n",
" \n",
" counts | \n",
" 83488 | \n",
" 83488 | \n",
" 83488 | \n",
"
\n",
" \n",
" uniques | \n",
" 1682 | \n",
" 54 | \n",
" 4993 | \n",
"
\n",
" \n",
" missing | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" missing_perc | \n",
" 0% | \n",
" 0% | \n",
" 0% | \n",
"
\n",
" \n",
" types | \n",
" categorical | \n",
" numeric | \n",
" numeric | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date store_nbr transactions\n",
"count NaN 83488 83488\n",
"mean NaN 26.9392 1694.6\n",
"std NaN 15.6082 963.287\n",
"min NaN 1 5\n",
"25% NaN 13 1046\n",
"50% NaN 27 1393\n",
"75% NaN 40 2079\n",
"max NaN 54 8359\n",
"counts 83488 83488 83488\n",
"uniques 1682 54 4993\n",
"missing 0 0 0\n",
"missing_perc 0% 0% 0%\n",
"types categorical numeric numeric"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" type | \n",
" locale | \n",
" locale_name | \n",
" description | \n",
" transferred | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 350 | \n",
" 350 | \n",
" 350 | \n",
" 350 | \n",
" 350 | \n",
" 350 | \n",
"
\n",
" \n",
" unique | \n",
" 312 | \n",
" 6 | \n",
" 3 | \n",
" 24 | \n",
" 103 | \n",
" 2 | \n",
"
\n",
" \n",
" top | \n",
" 2014-06-25 | \n",
" Holiday | \n",
" National | \n",
" Ecuador | \n",
" Carnaval | \n",
" False | \n",
"
\n",
" \n",
" freq | \n",
" 4 | \n",
" 221 | \n",
" 174 | \n",
" 174 | \n",
" 10 | \n",
" 338 | \n",
"
\n",
" \n",
" counts | \n",
" 350 | \n",
" 350 | \n",
" 350 | \n",
" 350 | \n",
" 350 | \n",
" 350 | \n",
"
\n",
" \n",
" uniques | \n",
" 312 | \n",
" 6 | \n",
" 3 | \n",
" 24 | \n",
" 103 | \n",
" 2 | \n",
"
\n",
" \n",
" missing | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" missing_perc | \n",
" 0% | \n",
" 0% | \n",
" 0% | \n",
" 0% | \n",
" 0% | \n",
" 0% | \n",
"
\n",
" \n",
" types | \n",
" categorical | \n",
" categorical | \n",
" categorical | \n",
" categorical | \n",
" categorical | \n",
" bool | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date type locale locale_name description \\\n",
"count 350 350 350 350 350 \n",
"unique 312 6 3 24 103 \n",
"top 2014-06-25 Holiday National Ecuador Carnaval \n",
"freq 4 221 174 174 10 \n",
"counts 350 350 350 350 350 \n",
"uniques 312 6 3 24 103 \n",
"missing 0 0 0 0 0 \n",
"missing_perc 0% 0% 0% 0% 0% \n",
"types categorical categorical categorical categorical categorical \n",
"\n",
" transferred \n",
"count 350 \n",
"unique 2 \n",
"top False \n",
"freq 338 \n",
"counts 350 \n",
"uniques 2 \n",
"missing 0 \n",
"missing_perc 0% \n",
"types bool "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" dcoilwtico | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" NaN | \n",
" 1175 | \n",
"
\n",
" \n",
" mean | \n",
" NaN | \n",
" 67.7144 | \n",
"
\n",
" \n",
" std | \n",
" NaN | \n",
" 25.6305 | \n",
"
\n",
" \n",
" min | \n",
" NaN | \n",
" 26.19 | \n",
"
\n",
" \n",
" 25% | \n",
" NaN | \n",
" 46.405 | \n",
"
\n",
" \n",
" 50% | \n",
" NaN | \n",
" 53.19 | \n",
"
\n",
" \n",
" 75% | \n",
" NaN | \n",
" 95.66 | \n",
"
\n",
" \n",
" max | \n",
" NaN | \n",
" 110.62 | \n",
"
\n",
" \n",
" counts | \n",
" 1218 | \n",
" 1175 | \n",
"
\n",
" \n",
" uniques | \n",
" 1218 | \n",
" 998 | \n",
"
\n",
" \n",
" missing | \n",
" 0 | \n",
" 43 | \n",
"
\n",
" \n",
" missing_perc | \n",
" 0% | \n",
" 3.53% | \n",
"
\n",
" \n",
" types | \n",
" unique | \n",
" numeric | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date dcoilwtico\n",
"count NaN 1175\n",
"mean NaN 67.7144\n",
"std NaN 25.6305\n",
"min NaN 26.19\n",
"25% NaN 46.405\n",
"50% NaN 53.19\n",
"75% NaN 95.66\n",
"max NaN 110.62\n",
"counts 1218 1175\n",
"uniques 1218 998\n",
"missing 0 43\n",
"missing_perc 0% 3.53%\n",
"types unique numeric"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" date | \n",
" store_nbr | \n",
" item_nbr | \n",
" onpromotion | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 3.37046e+06 | \n",
" NaN | \n",
" 3.37046e+06 | \n",
" 3.37046e+06 | \n",
" NaN | \n",
"
\n",
" \n",
" mean | \n",
" 1.27182e+08 | \n",
" NaN | \n",
" 27.5 | \n",
" 1.2448e+06 | \n",
" NaN | \n",
"
\n",
" \n",
" std | \n",
" 972969 | \n",
" NaN | \n",
" 15.5858 | \n",
" 589836 | \n",
" NaN | \n",
"
\n",
" \n",
" min | \n",
" 1.25497e+08 | \n",
" NaN | \n",
" 1 | \n",
" 96995 | \n",
" NaN | \n",
"
\n",
" \n",
" 25% | \n",
" 1.2634e+08 | \n",
" NaN | \n",
" 14 | \n",
" 805321 | \n",
" NaN | \n",
"
\n",
" \n",
" 50% | \n",
" 1.27182e+08 | \n",
" NaN | \n",
" 27.5 | \n",
" 1.29466e+06 | \n",
" NaN | \n",
"
\n",
" \n",
" 75% | \n",
" 1.28025e+08 | \n",
" NaN | \n",
" 41 | \n",
" 1.73002e+06 | \n",
" NaN | \n",
"
\n",
" \n",
" max | \n",
" 1.28868e+08 | \n",
" NaN | \n",
" 54 | \n",
" 2.13424e+06 | \n",
" NaN | \n",
"
\n",
" \n",
" counts | \n",
" 3370464 | \n",
" 3370464 | \n",
" 3370464 | \n",
" 3370464 | \n",
" 3370464 | \n",
"
\n",
" \n",
" uniques | \n",
" 3370464 | \n",
" 16 | \n",
" 54 | \n",
" 3901 | \n",
" 2 | \n",
"
\n",
" \n",
" missing | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" missing_perc | \n",
" 0% | \n",
" 0% | \n",
" 0% | \n",
" 0% | \n",
" 0% | \n",
"
\n",
" \n",
" types | \n",
" numeric | \n",
" categorical | \n",
" numeric | \n",
" numeric | \n",
" bool | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id date store_nbr item_nbr onpromotion\n",
"count 3.37046e+06 NaN 3.37046e+06 3.37046e+06 NaN\n",
"mean 1.27182e+08 NaN 27.5 1.2448e+06 NaN\n",
"std 972969 NaN 15.5858 589836 NaN\n",
"min 1.25497e+08 NaN 1 96995 NaN\n",
"25% 1.2634e+08 NaN 14 805321 NaN\n",
"50% 1.27182e+08 NaN 27.5 1.29466e+06 NaN\n",
"75% 1.28025e+08 NaN 41 1.73002e+06 NaN\n",
"max 1.28868e+08 NaN 54 2.13424e+06 NaN\n",
"counts 3370464 3370464 3370464 3370464 3370464\n",
"uniques 3370464 16 54 3901 2\n",
"missing 0 0 0 0 0\n",
"missing_perc 0% 0% 0% 0% 0%\n",
"types numeric categorical numeric numeric bool"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" unit_sales | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 3.37046e+06 | \n",
" 3.37046e+06 | \n",
"
\n",
" \n",
" mean | \n",
" 1.27182e+08 | \n",
" 0 | \n",
"
\n",
" \n",
" std | \n",
" 972969 | \n",
" 0 | \n",
"
\n",
" \n",
" min | \n",
" 1.25497e+08 | \n",
" 0 | \n",
"
\n",
" \n",
" 25% | \n",
" 1.2634e+08 | \n",
" 0 | \n",
"
\n",
" \n",
" 50% | \n",
" 1.27182e+08 | \n",
" 0 | \n",
"
\n",
" \n",
" 75% | \n",
" 1.28025e+08 | \n",
" 0 | \n",
"
\n",
" \n",
" max | \n",
" 1.28868e+08 | \n",
" 0 | \n",
"
\n",
" \n",
" counts | \n",
" 3370464 | \n",
" 3370464 | \n",
"
\n",
" \n",
" uniques | \n",
" 3370464 | \n",
" 1 | \n",
"
\n",
" \n",
" missing | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" missing_perc | \n",
" 0% | \n",
" 0% | \n",
"
\n",
" \n",
" types | \n",
" numeric | \n",
" constant | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id unit_sales\n",
"count 3.37046e+06 3.37046e+06\n",
"mean 1.27182e+08 0\n",
"std 972969 0\n",
"min 1.25497e+08 0\n",
"25% 1.2634e+08 0\n",
"50% 1.27182e+08 0\n",
"75% 1.28025e+08 0\n",
"max 1.28868e+08 0\n",
"counts 3370464 3370464\n",
"uniques 3370464 1\n",
"missing 0 0\n",
"missing_perc 0% 0%\n",
"types numeric constant"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# The following returns summarized aggregate information to each table accross each field.\n",
"for t in tables: display(DataFrameSummary(t).summary())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Cleaning / Feature Engineering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a structured data problem, we necessarily have to go through all the cleaning and feature engineering, even though we're using a neural network."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"train, stores, items, transactions, holidays_events, oil, test, sample_submission = tables"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"train.unit_sales = np.clip(train.unit_sales, 0, None)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(125497040, 3370464)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(train),len(test)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 1.254970e+08\n",
"mean 8.556009e+00\n",
"std 2.352696e+01\n",
"min 0.000000e+00\n",
"25% 2.000000e+00\n",
"50% 4.000000e+00\n",
"75% 9.000000e+00\n",
"max 8.944000e+04\n",
"Name: unit_sales, dtype: float64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train.unit_sales.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We turn state OnPromotion to booleans, to make them more convenient for modeling. We can do calculations on pandas fields using notation very similar (often identical) to numpy."
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"train.onpromotion = train.onpromotion!='0'\n",
"test.onpromotion = test.onpromotion!='0'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Optimizing the Date format"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following extracts particular date fields from a complete datetime for the purpose of constructing categoricals.\n",
"\n",
"You should *always* consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities. We'll add to every table with a date field.\n",
"\n",
"**note**: Dayofweek starts at 0, Dayofyear starts at 1"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"add_datepart(train, \"date\", drop=False)"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"%add_datepart"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"add_datepart(transactions, \"date\", drop=False)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"add_datepart(holidays_events, \"date\", drop=False)\n",
"add_datepart(oil, \"date\", drop=False)\n",
"add_datepart(test, \"date\", drop=False)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" date | \n",
" store_nbr | \n",
" item_nbr | \n",
" unit_sales | \n",
" onpromotion | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" Dayofweek | \n",
" Dayofyear | \n",
" Is_month_end | \n",
" Is_month_start | \n",
" Is_quarter_end | \n",
" Is_quarter_start | \n",
" Is_year_end | \n",
" Is_year_start | \n",
" Elapsed | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 2013-01-01 | \n",
" 25 | \n",
" 103665 | \n",
" 7.0 | \n",
" NaN | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" 1356998400 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 2013-01-01 | \n",
" 25 | \n",
" 105574 | \n",
" 1.0 | \n",
" NaN | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" 1356998400 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 2013-01-01 | \n",
" 25 | \n",
" 105575 | \n",
" 2.0 | \n",
" NaN | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" 1356998400 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" 2013-01-01 | \n",
" 25 | \n",
" 108079 | \n",
" 1.0 | \n",
" NaN | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" 1356998400 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" 2013-01-01 | \n",
" 25 | \n",
" 108701 | \n",
" 1.0 | \n",
" NaN | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" 1356998400 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id date store_nbr item_nbr unit_sales onpromotion Year Month \\\n",
"0 0 2013-01-01 25 103665 7.0 NaN 2013 1 \n",
"1 1 2013-01-01 25 105574 1.0 NaN 2013 1 \n",
"2 2 2013-01-01 25 105575 2.0 NaN 2013 1 \n",
"3 3 2013-01-01 25 108079 1.0 NaN 2013 1 \n",
"4 4 2013-01-01 25 108701 1.0 NaN 2013 1 \n",
"\n",
" Week Day Dayofweek Dayofyear Is_month_end Is_month_start \\\n",
"0 1 1 1 1 False True \n",
"1 1 1 1 1 False True \n",
"2 1 1 1 1 False True \n",
"3 1 1 1 1 False True \n",
"4 1 1 1 1 False True \n",
"\n",
" Is_quarter_end Is_quarter_start Is_year_end Is_year_start Elapsed \n",
"0 False True False True 1356998400 \n",
"1 False True False True 1356998400 \n",
"2 False True False True 1356998400 \n",
"3 False True False True 1356998400 \n",
"4 False True False True 1356998400 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(125497040, 19)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" store_nbr | \n",
" city | \n",
" state | \n",
" type | \n",
" cluster | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 8 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 9 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" Santo Domingo | \n",
" Santo Domingo de los Tsachilas | \n",
" D | \n",
" 4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" store_nbr city state type cluster\n",
"0 1 Quito Pichincha D 13\n",
"1 2 Quito Pichincha D 13\n",
"2 3 Quito Pichincha D 8\n",
"3 4 Quito Pichincha D 9\n",
"4 5 Santo Domingo Santo Domingo de los Tsachilas D 4"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(54, 5)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" item_nbr | \n",
" family | \n",
" class | \n",
" perishable | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 96995 | \n",
" GROCERY I | \n",
" 1093 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 99197 | \n",
" GROCERY I | \n",
" 1067 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 103501 | \n",
" CLEANING | \n",
" 3008 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 103520 | \n",
" GROCERY I | \n",
" 1028 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 103665 | \n",
" BREAD/BAKERY | \n",
" 2712 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" item_nbr family class perishable\n",
"0 96995 GROCERY I 1093 0\n",
"1 99197 GROCERY I 1067 0\n",
"2 103501 CLEANING 3008 0\n",
"3 103520 GROCERY I 1028 0\n",
"4 103665 BREAD/BAKERY 2712 1"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(4100, 4)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" store_nbr | \n",
" transactions | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" Dayofweek | \n",
" Dayofyear | \n",
" Is_month_end | \n",
" Is_month_start | \n",
" Is_quarter_end | \n",
" Is_quarter_start | \n",
" Is_year_end | \n",
" Is_year_start | \n",
" Elapsed | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2013-01-01 | \n",
" 25 | \n",
" 770 | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" 1356998400 | \n",
"
\n",
" \n",
" 1 | \n",
" 2013-01-02 | \n",
" 1 | \n",
" 2111 | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1357084800 | \n",
"
\n",
" \n",
" 2 | \n",
" 2013-01-02 | \n",
" 2 | \n",
" 2358 | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1357084800 | \n",
"
\n",
" \n",
" 3 | \n",
" 2013-01-02 | \n",
" 3 | \n",
" 3487 | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1357084800 | \n",
"
\n",
" \n",
" 4 | \n",
" 2013-01-02 | \n",
" 4 | \n",
" 1922 | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1357084800 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date store_nbr transactions Year Month Week Day Dayofweek \\\n",
"0 2013-01-01 25 770 2013 1 1 1 1 \n",
"1 2013-01-02 1 2111 2013 1 1 2 2 \n",
"2 2013-01-02 2 2358 2013 1 1 2 2 \n",
"3 2013-01-02 3 3487 2013 1 1 2 2 \n",
"4 2013-01-02 4 1922 2013 1 1 2 2 \n",
"\n",
" Dayofyear Is_month_end Is_month_start Is_quarter_end Is_quarter_start \\\n",
"0 1 False True False True \n",
"1 2 False False False False \n",
"2 2 False False False False \n",
"3 2 False False False False \n",
"4 2 False False False False \n",
"\n",
" Is_year_end Is_year_start Elapsed \n",
"0 False True 1356998400 \n",
"1 False False 1357084800 \n",
"2 False False 1357084800 \n",
"3 False False 1357084800 \n",
"4 False False 1357084800 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(83488, 16)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" type | \n",
" locale | \n",
" locale_name | \n",
" description | \n",
" transferred | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" Dayofweek | \n",
" Dayofyear | \n",
" Is_month_end | \n",
" Is_month_start | \n",
" Is_quarter_end | \n",
" Is_quarter_start | \n",
" Is_year_end | \n",
" Is_year_start | \n",
" Elapsed | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2012-03-02 | \n",
" Holiday | \n",
" Local | \n",
" Manta | \n",
" Fundacion de Manta | \n",
" False | \n",
" 2012 | \n",
" 3 | \n",
" 9 | \n",
" 2 | \n",
" 4 | \n",
" 62 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1330646400 | \n",
"
\n",
" \n",
" 1 | \n",
" 2012-04-01 | \n",
" Holiday | \n",
" Regional | \n",
" Cotopaxi | \n",
" Provincializacion de Cotopaxi | \n",
" False | \n",
" 2012 | \n",
" 4 | \n",
" 13 | \n",
" 1 | \n",
" 6 | \n",
" 92 | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" 1333238400 | \n",
"
\n",
" \n",
" 2 | \n",
" 2012-04-12 | \n",
" Holiday | \n",
" Local | \n",
" Cuenca | \n",
" Fundacion de Cuenca | \n",
" False | \n",
" 2012 | \n",
" 4 | \n",
" 15 | \n",
" 12 | \n",
" 3 | \n",
" 103 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1334188800 | \n",
"
\n",
" \n",
" 3 | \n",
" 2012-04-14 | \n",
" Holiday | \n",
" Local | \n",
" Libertad | \n",
" Cantonizacion de Libertad | \n",
" False | \n",
" 2012 | \n",
" 4 | \n",
" 15 | \n",
" 14 | \n",
" 5 | \n",
" 105 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1334361600 | \n",
"
\n",
" \n",
" 4 | \n",
" 2012-04-21 | \n",
" Holiday | \n",
" Local | \n",
" Riobamba | \n",
" Cantonizacion de Riobamba | \n",
" False | \n",
" 2012 | \n",
" 4 | \n",
" 16 | \n",
" 21 | \n",
" 5 | \n",
" 112 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1334966400 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date type locale locale_name description \\\n",
"0 2012-03-02 Holiday Local Manta Fundacion de Manta \n",
"1 2012-04-01 Holiday Regional Cotopaxi Provincializacion de Cotopaxi \n",
"2 2012-04-12 Holiday Local Cuenca Fundacion de Cuenca \n",
"3 2012-04-14 Holiday Local Libertad Cantonizacion de Libertad \n",
"4 2012-04-21 Holiday Local Riobamba Cantonizacion de Riobamba \n",
"\n",
" transferred Year Month Week Day Dayofweek Dayofyear Is_month_end \\\n",
"0 False 2012 3 9 2 4 62 False \n",
"1 False 2012 4 13 1 6 92 False \n",
"2 False 2012 4 15 12 3 103 False \n",
"3 False 2012 4 15 14 5 105 False \n",
"4 False 2012 4 16 21 5 112 False \n",
"\n",
" Is_month_start Is_quarter_end Is_quarter_start Is_year_end \\\n",
"0 False False False False \n",
"1 True False True False \n",
"2 False False False False \n",
"3 False False False False \n",
"4 False False False False \n",
"\n",
" Is_year_start Elapsed \n",
"0 False 1330646400 \n",
"1 False 1333238400 \n",
"2 False 1334188800 \n",
"3 False 1334361600 \n",
"4 False 1334966400 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(350, 19)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" dcoilwtico | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" Dayofweek | \n",
" Dayofyear | \n",
" Is_month_end | \n",
" Is_month_start | \n",
" Is_quarter_end | \n",
" Is_quarter_start | \n",
" Is_year_end | \n",
" Is_year_start | \n",
" Elapsed | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2013-01-01 | \n",
" NaN | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" 1356998400 | \n",
"
\n",
" \n",
" 1 | \n",
" 2013-01-02 | \n",
" 93.14 | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1357084800 | \n",
"
\n",
" \n",
" 2 | \n",
" 2013-01-03 | \n",
" 92.97 | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 3 | \n",
" 3 | \n",
" 3 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1357171200 | \n",
"
\n",
" \n",
" 3 | \n",
" 2013-01-04 | \n",
" 93.12 | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 4 | \n",
" 4 | \n",
" 4 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1357257600 | \n",
"
\n",
" \n",
" 4 | \n",
" 2013-01-07 | \n",
" 93.20 | \n",
" 2013 | \n",
" 1 | \n",
" 2 | \n",
" 7 | \n",
" 0 | \n",
" 7 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1357516800 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date dcoilwtico Year Month Week Day Dayofweek Dayofyear \\\n",
"0 2013-01-01 NaN 2013 1 1 1 1 1 \n",
"1 2013-01-02 93.14 2013 1 1 2 2 2 \n",
"2 2013-01-03 92.97 2013 1 1 3 3 3 \n",
"3 2013-01-04 93.12 2013 1 1 4 4 4 \n",
"4 2013-01-07 93.20 2013 1 2 7 0 7 \n",
"\n",
" Is_month_end Is_month_start Is_quarter_end Is_quarter_start \\\n",
"0 False True False True \n",
"1 False False False False \n",
"2 False False False False \n",
"3 False False False False \n",
"4 False False False False \n",
"\n",
" Is_year_end Is_year_start Elapsed \n",
"0 False True 1356998400 \n",
"1 False False 1357084800 \n",
"2 False False 1357171200 \n",
"3 False False 1357257600 \n",
"4 False False 1357516800 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(1218, 15)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" date | \n",
" store_nbr | \n",
" item_nbr | \n",
" onpromotion | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" Dayofweek | \n",
" Dayofyear | \n",
" Is_month_end | \n",
" Is_month_start | \n",
" Is_quarter_end | \n",
" Is_quarter_start | \n",
" Is_year_end | \n",
" Is_year_start | \n",
" Elapsed | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 125497040 | \n",
" 2017-08-16 | \n",
" 1 | \n",
" 96995 | \n",
" False | \n",
" 2017 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 2 | \n",
" 228 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1502841600 | \n",
"
\n",
" \n",
" 1 | \n",
" 125497041 | \n",
" 2017-08-16 | \n",
" 1 | \n",
" 99197 | \n",
" False | \n",
" 2017 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 2 | \n",
" 228 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1502841600 | \n",
"
\n",
" \n",
" 2 | \n",
" 125497042 | \n",
" 2017-08-16 | \n",
" 1 | \n",
" 103501 | \n",
" False | \n",
" 2017 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 2 | \n",
" 228 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1502841600 | \n",
"
\n",
" \n",
" 3 | \n",
" 125497043 | \n",
" 2017-08-16 | \n",
" 1 | \n",
" 103520 | \n",
" False | \n",
" 2017 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 2 | \n",
" 228 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1502841600 | \n",
"
\n",
" \n",
" 4 | \n",
" 125497044 | \n",
" 2017-08-16 | \n",
" 1 | \n",
" 103665 | \n",
" False | \n",
" 2017 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 2 | \n",
" 228 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1502841600 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id date store_nbr item_nbr onpromotion Year Month Week \\\n",
"0 125497040 2017-08-16 1 96995 False 2017 8 33 \n",
"1 125497041 2017-08-16 1 99197 False 2017 8 33 \n",
"2 125497042 2017-08-16 1 103501 False 2017 8 33 \n",
"3 125497043 2017-08-16 1 103520 False 2017 8 33 \n",
"4 125497044 2017-08-16 1 103665 False 2017 8 33 \n",
"\n",
" Day Dayofweek Dayofyear Is_month_end Is_month_start Is_quarter_end \\\n",
"0 16 2 228 False False False \n",
"1 16 2 228 False False False \n",
"2 16 2 228 False False False \n",
"3 16 2 228 False False False \n",
"4 16 2 228 False False False \n",
"\n",
" Is_quarter_start Is_year_end Is_year_start Elapsed \n",
"0 False False False 1502841600 \n",
"1 False False False 1502841600 \n",
"2 False False False 1502841600 \n",
"3 False False False 1502841600 \n",
"4 False False False 1502841600 "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(3370464, 18)"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" unit_sales | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 125497040 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 125497041 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 125497042 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 125497043 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 125497044 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id unit_sales\n",
"0 125497040 0\n",
"1 125497041 0\n",
"2 125497042 0\n",
"3 125497043 0\n",
"4 125497044 0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"(3370464, 2)"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"for t in tables: display(t.head(), t.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Reducing data to the last 10 weeks for the training set (16 days needed for Validation/Test)\n",
"Using 2016 data lets us use the same period 16-days, as the test data, but as validation period. "
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(125497040, 19)\n"
]
}
],
"source": [
"# If done on all train data, results in 125m rows. So, we're taking a small sample of the 2016 10 weeks:\n",
"train_mask_10w = (train['date'] >= '2016-06-28') & (train['date'] <= '2016-08-31')\n",
"print(train.shape)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(6270562, 19)\n"
]
}
],
"source": [
"train = train[train_mask_10w]\n",
"print(train.shape)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" date | \n",
" store_nbr | \n",
" item_nbr | \n",
" unit_sales | \n",
" onpromotion | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" Dayofweek | \n",
" Dayofyear | \n",
" Is_month_end | \n",
" Is_month_start | \n",
" Is_quarter_end | \n",
" Is_quarter_start | \n",
" Is_year_end | \n",
" Is_year_start | \n",
" Elapsed | \n",
"
\n",
" \n",
" \n",
" \n",
" 83386215 | \n",
" 83386215 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 103520 | \n",
" 2.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" 1 | \n",
" 180 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
"
\n",
" \n",
" 83386216 | \n",
" 83386216 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 103665 | \n",
" 4.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" 1 | \n",
" 180 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
"
\n",
" \n",
" 83386217 | \n",
" 83386217 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 105574 | \n",
" 1.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" 1 | \n",
" 180 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
"
\n",
" \n",
" 83386218 | \n",
" 83386218 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 105575 | \n",
" 24.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" 1 | \n",
" 180 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
"
\n",
" \n",
" 83386219 | \n",
" 83386219 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 105693 | \n",
" 2.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" 1 | \n",
" 180 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id date store_nbr item_nbr unit_sales onpromotion \\\n",
"83386215 83386215 2016-06-28 1 103520 2.0 False \n",
"83386216 83386216 2016-06-28 1 103665 4.0 False \n",
"83386217 83386217 2016-06-28 1 105574 1.0 False \n",
"83386218 83386218 2016-06-28 1 105575 24.0 False \n",
"83386219 83386219 2016-06-28 1 105693 2.0 False \n",
"\n",
" Year Month Week Day Dayofweek Dayofyear Is_month_end \\\n",
"83386215 2016 6 26 28 1 180 False \n",
"83386216 2016 6 26 28 1 180 False \n",
"83386217 2016 6 26 28 1 180 False \n",
"83386218 2016 6 26 28 1 180 False \n",
"83386219 2016 6 26 28 1 180 False \n",
"\n",
" Is_month_start Is_quarter_end Is_quarter_start Is_year_end \\\n",
"83386215 False False False False \n",
"83386216 False False False False \n",
"83386217 False False False False \n",
"83386218 False False False False \n",
"83386219 False False False False \n",
"\n",
" Is_year_start Elapsed \n",
"83386215 False 1467072000 \n",
"83386216 False 1467072000 \n",
"83386217 False 1467072000 \n",
"83386218 False 1467072000 \n",
"83386219 False 1467072000 "
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train.head()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(83488, 16)\n"
]
}
],
"source": [
"transactions_mask_10w = (transactions['date'] >= '2016-06-28') & (transactions['date'] <= '2016-08-31')\n",
"print(transactions.shape)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(3418, 16)\n"
]
}
],
"source": [
"transactions = transactions[transactions_mask_10w]\n",
"print(transactions.shape)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" store_nbr | \n",
" transactions | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" Dayofweek | \n",
" Dayofyear | \n",
" Is_month_end | \n",
" Is_month_start | \n",
" Is_quarter_end | \n",
" Is_quarter_start | \n",
" Is_year_end | \n",
" Is_year_start | \n",
" Elapsed | \n",
"
\n",
" \n",
" \n",
" \n",
" 61715 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 1699 | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" 1 | \n",
" 180 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
"
\n",
" \n",
" 61716 | \n",
" 2016-06-28 | \n",
" 2 | \n",
" 1706 | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" 1 | \n",
" 180 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
"
\n",
" \n",
" 61717 | \n",
" 2016-06-28 | \n",
" 3 | \n",
" 2688 | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" 1 | \n",
" 180 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
"
\n",
" \n",
" 61718 | \n",
" 2016-06-28 | \n",
" 4 | \n",
" 1236 | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" 1 | \n",
" 180 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
"
\n",
" \n",
" 61719 | \n",
" 2016-06-28 | \n",
" 5 | \n",
" 1156 | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" 1 | \n",
" 180 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date store_nbr transactions Year Month Week Day Dayofweek \\\n",
"61715 2016-06-28 1 1699 2016 6 26 28 1 \n",
"61716 2016-06-28 2 1706 2016 6 26 28 1 \n",
"61717 2016-06-28 3 2688 2016 6 26 28 1 \n",
"61718 2016-06-28 4 1236 2016 6 26 28 1 \n",
"61719 2016-06-28 5 1156 2016 6 26 28 1 \n",
"\n",
" Dayofyear Is_month_end Is_month_start Is_quarter_end \\\n",
"61715 180 False False False \n",
"61716 180 False False False \n",
"61717 180 False False False \n",
"61718 180 False False False \n",
"61719 180 False False False \n",
"\n",
" Is_quarter_start Is_year_end Is_year_start Elapsed \n",
"61715 False False False 1467072000 \n",
"61716 False False False 1467072000 \n",
"61717 False False False 1467072000 \n",
"61718 False False False 1467072000 \n",
"61719 False False False 1467072000 "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transactions.head()"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(350, 19)\n"
]
}
],
"source": [
"holidays_events_mask_10w = (holidays_events['date'] >= '2016-06-28') & (holidays_events['date'] <= '2016-08-31')\n",
"print(holidays_events.shape)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(11, 19)\n"
]
}
],
"source": [
"holidays_events = holidays_events[holidays_events_mask_10w]\n",
"print(holidays_events.shape)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" type | \n",
" locale | \n",
" locale_name | \n",
" description | \n",
" transferred | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" Dayofweek | \n",
" Dayofyear | \n",
" Is_month_end | \n",
" Is_month_start | \n",
" Is_quarter_end | \n",
" Is_quarter_start | \n",
" Is_year_end | \n",
" Is_year_start | \n",
" Elapsed | \n",
"
\n",
" \n",
" \n",
" \n",
" 261 | \n",
" 2016-07-03 | \n",
" Holiday | \n",
" Local | \n",
" El Carmen | \n",
" Cantonizacion de El Carmen | \n",
" False | \n",
" 2016 | \n",
" 7 | \n",
" 26 | \n",
" 3 | \n",
" 6 | \n",
" 185 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467504000 | \n",
"
\n",
" \n",
" 262 | \n",
" 2016-07-03 | \n",
" Holiday | \n",
" Local | \n",
" Santo Domingo | \n",
" Fundacion de Santo Domingo | \n",
" False | \n",
" 2016 | \n",
" 7 | \n",
" 26 | \n",
" 3 | \n",
" 6 | \n",
" 185 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467504000 | \n",
"
\n",
" \n",
" 263 | \n",
" 2016-07-23 | \n",
" Holiday | \n",
" Local | \n",
" Cayambe | \n",
" Cantonizacion de Cayambe | \n",
" False | \n",
" 2016 | \n",
" 7 | \n",
" 29 | \n",
" 23 | \n",
" 5 | \n",
" 205 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1469232000 | \n",
"
\n",
" \n",
" 264 | \n",
" 2016-07-24 | \n",
" Additional | \n",
" Local | \n",
" Guayaquil | \n",
" Fundacion de Guayaquil-1 | \n",
" False | \n",
" 2016 | \n",
" 7 | \n",
" 29 | \n",
" 24 | \n",
" 6 | \n",
" 206 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1469318400 | \n",
"
\n",
" \n",
" 265 | \n",
" 2016-07-24 | \n",
" Transfer | \n",
" Local | \n",
" Guayaquil | \n",
" Traslado Fundacion de Guayaquil | \n",
" False | \n",
" 2016 | \n",
" 7 | \n",
" 29 | \n",
" 24 | \n",
" 6 | \n",
" 206 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1469318400 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date type locale locale_name \\\n",
"261 2016-07-03 Holiday Local El Carmen \n",
"262 2016-07-03 Holiday Local Santo Domingo \n",
"263 2016-07-23 Holiday Local Cayambe \n",
"264 2016-07-24 Additional Local Guayaquil \n",
"265 2016-07-24 Transfer Local Guayaquil \n",
"\n",
" description transferred Year Month Week Day \\\n",
"261 Cantonizacion de El Carmen False 2016 7 26 3 \n",
"262 Fundacion de Santo Domingo False 2016 7 26 3 \n",
"263 Cantonizacion de Cayambe False 2016 7 29 23 \n",
"264 Fundacion de Guayaquil-1 False 2016 7 29 24 \n",
"265 Traslado Fundacion de Guayaquil False 2016 7 29 24 \n",
"\n",
" Dayofweek Dayofyear Is_month_end Is_month_start Is_quarter_end \\\n",
"261 6 185 False False False \n",
"262 6 185 False False False \n",
"263 5 205 False False False \n",
"264 6 206 False False False \n",
"265 6 206 False False False \n",
"\n",
" Is_quarter_start Is_year_end Is_year_start Elapsed \n",
"261 False False False 1467504000 \n",
"262 False False False 1467504000 \n",
"263 False False False 1469232000 \n",
"264 False False False 1469318400 \n",
"265 False False False 1469318400 "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"holidays_events.head()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(1218, 15)\n"
]
}
],
"source": [
"oil_mask_10w = (oil['date'] >= '2016-06-28') & (oil['date'] <= '2016-08-31')\n",
"print(oil.shape)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(47, 15)\n"
]
}
],
"source": [
"oil = oil[oil_mask_10w]\n",
"print(oil.shape)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" dcoilwtico | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" Dayofweek | \n",
" Dayofyear | \n",
" Is_month_end | \n",
" Is_month_start | \n",
" Is_quarter_end | \n",
" Is_quarter_start | \n",
" Is_year_end | \n",
" Is_year_start | \n",
" Elapsed | \n",
"
\n",
" \n",
" \n",
" \n",
" 910 | \n",
" 2016-06-28 | \n",
" 47.93 | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" 1 | \n",
" 180 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
"
\n",
" \n",
" 911 | \n",
" 2016-06-29 | \n",
" 49.85 | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 29 | \n",
" 2 | \n",
" 181 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467158400 | \n",
"
\n",
" \n",
" 912 | \n",
" 2016-06-30 | \n",
" 48.27 | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 30 | \n",
" 3 | \n",
" 182 | \n",
" True | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467244800 | \n",
"
\n",
" \n",
" 913 | \n",
" 2016-07-01 | \n",
" 49.02 | \n",
" 2016 | \n",
" 7 | \n",
" 26 | \n",
" 1 | \n",
" 4 | \n",
" 183 | \n",
" False | \n",
" True | \n",
" False | \n",
" True | \n",
" False | \n",
" False | \n",
" 1467331200 | \n",
"
\n",
" \n",
" 914 | \n",
" 2016-07-04 | \n",
" NaN | \n",
" 2016 | \n",
" 7 | \n",
" 27 | \n",
" 4 | \n",
" 0 | \n",
" 186 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467590400 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date dcoilwtico Year Month Week Day Dayofweek Dayofyear \\\n",
"910 2016-06-28 47.93 2016 6 26 28 1 180 \n",
"911 2016-06-29 49.85 2016 6 26 29 2 181 \n",
"912 2016-06-30 48.27 2016 6 26 30 3 182 \n",
"913 2016-07-01 49.02 2016 7 26 1 4 183 \n",
"914 2016-07-04 NaN 2016 7 27 4 0 186 \n",
"\n",
" Is_month_end Is_month_start Is_quarter_end Is_quarter_start \\\n",
"910 False False False False \n",
"911 False False False False \n",
"912 True False True False \n",
"913 False True False True \n",
"914 False False False False \n",
"\n",
" Is_year_end Is_year_start Elapsed \n",
"910 False False 1467072000 \n",
"911 False False 1467158400 \n",
"912 False False 1467244800 \n",
"913 False False 1467331200 \n",
"914 False False 1467590400 "
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"oil.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Join the tables"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`join_df` is a function for joining tables on specific fields. By default, we'll be doing a left outer join of `right` on the `left` argument using the given fields for each table.\n",
"\n",
"Pandas does joins using the `merge` method. The `suffixes` argument describes the naming convention for duplicate fields. We've elected to leave the duplicate field names on the left untouched, and append a \"\\_y\" to those on the right."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"def join_df(left, right, left_on, right_on=None, suffix='_y'):\n",
" if right_on is None: right_on = left_on\n",
" return left.merge(right, how='left', left_on=left_on, right_on=right_on, \n",
" suffixes=(\"\", suffix))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can outer join all of our data into a single dataframe. Recall that in outer joins everytime a value in the joining field on the left table does not have a corresponding value on the right table, the corresponding row in the new table has Null values for all right table fields. One way to check that all records are consistent and complete is to check for Null values post-join, as we do here.\n",
"\n",
"*Aside*: Why not just do an inner join?\n",
"If you are assuming that all records are complete and match on the field you desire, an inner join will do the same thing as an outer join. However, in the event you are wrong or a mistake is made, an outer join followed by a null-check will catch it. (Comparing before/after # of rows for inner join is equivalent, but requires keeping track of before/after row #'s. Outer join is easier.)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined = join_df(train, stores, \"store_nbr\")\n",
"len(joined[joined.type.isnull()])"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" date | \n",
" store_nbr | \n",
" item_nbr | \n",
" unit_sales | \n",
" onpromotion | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" ... | \n",
" Is_month_start | \n",
" Is_quarter_end | \n",
" Is_quarter_start | \n",
" Is_year_end | \n",
" Is_year_start | \n",
" Elapsed | \n",
" city | \n",
" state | \n",
" type | \n",
" cluster | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 83386215 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 103520 | \n",
" 2.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
"
\n",
" \n",
" 1 | \n",
" 83386216 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 103665 | \n",
" 4.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
"
\n",
" \n",
" 2 | \n",
" 83386217 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 105574 | \n",
" 1.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
"
\n",
" \n",
" 3 | \n",
" 83386218 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 105575 | \n",
" 24.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
"
\n",
" \n",
" 4 | \n",
" 83386219 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 105693 | \n",
" 2.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" ... | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 23 columns
\n",
"
"
],
"text/plain": [
" id date store_nbr item_nbr unit_sales onpromotion Year \\\n",
"0 83386215 2016-06-28 1 103520 2.0 False 2016 \n",
"1 83386216 2016-06-28 1 103665 4.0 False 2016 \n",
"2 83386217 2016-06-28 1 105574 1.0 False 2016 \n",
"3 83386218 2016-06-28 1 105575 24.0 False 2016 \n",
"4 83386219 2016-06-28 1 105693 2.0 False 2016 \n",
"\n",
" Month Week Day ... Is_month_start Is_quarter_end Is_quarter_start \\\n",
"0 6 26 28 ... False False False \n",
"1 6 26 28 ... False False False \n",
"2 6 26 28 ... False False False \n",
"3 6 26 28 ... False False False \n",
"4 6 26 28 ... False False False \n",
"\n",
" Is_year_end Is_year_start Elapsed city state type cluster \n",
"0 False False 1467072000 Quito Pichincha D 13 \n",
"1 False False 1467072000 Quito Pichincha D 13 \n",
"2 False False 1467072000 Quito Pichincha D 13 \n",
"3 False False 1467072000 Quito Pichincha D 13 \n",
"4 False False 1467072000 Quito Pichincha D 13 \n",
"\n",
"[5 rows x 23 columns]"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined.head()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined_test = join_df(test, stores, \"store_nbr\")\n",
"len(joined_test[joined_test.type.isnull()])"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined = join_df(joined, items, \"item_nbr\")\n",
"len(joined[joined.family.isnull()])"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" date | \n",
" store_nbr | \n",
" item_nbr | \n",
" unit_sales | \n",
" onpromotion | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" ... | \n",
" Is_year_end | \n",
" Is_year_start | \n",
" Elapsed | \n",
" city | \n",
" state | \n",
" type | \n",
" cluster | \n",
" family | \n",
" class | \n",
" perishable | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 83386215 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 103520 | \n",
" 2.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" ... | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" GROCERY I | \n",
" 1028 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 83386216 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 103665 | \n",
" 4.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" ... | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" BREAD/BAKERY | \n",
" 2712 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" 83386217 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 105574 | \n",
" 1.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" ... | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" GROCERY I | \n",
" 1045 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 83386218 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 105575 | \n",
" 24.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" ... | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" GROCERY I | \n",
" 1045 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 83386219 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 105693 | \n",
" 2.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" ... | \n",
" False | \n",
" False | \n",
" 1467072000 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" GROCERY I | \n",
" 1034 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 26 columns
\n",
"
"
],
"text/plain": [
" id date store_nbr item_nbr unit_sales onpromotion Year \\\n",
"0 83386215 2016-06-28 1 103520 2.0 False 2016 \n",
"1 83386216 2016-06-28 1 103665 4.0 False 2016 \n",
"2 83386217 2016-06-28 1 105574 1.0 False 2016 \n",
"3 83386218 2016-06-28 1 105575 24.0 False 2016 \n",
"4 83386219 2016-06-28 1 105693 2.0 False 2016 \n",
"\n",
" Month Week Day ... Is_year_end Is_year_start Elapsed city \\\n",
"0 6 26 28 ... False False 1467072000 Quito \n",
"1 6 26 28 ... False False 1467072000 Quito \n",
"2 6 26 28 ... False False 1467072000 Quito \n",
"3 6 26 28 ... False False 1467072000 Quito \n",
"4 6 26 28 ... False False 1467072000 Quito \n",
"\n",
" state type cluster family class perishable \n",
"0 Pichincha D 13 GROCERY I 1028 0 \n",
"1 Pichincha D 13 BREAD/BAKERY 2712 1 \n",
"2 Pichincha D 13 GROCERY I 1045 0 \n",
"3 Pichincha D 13 GROCERY I 1045 0 \n",
"4 Pichincha D 13 GROCERY I 1034 0 \n",
"\n",
"[5 rows x 26 columns]"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined.head()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined_test = join_df(joined_test, items, \"item_nbr\")\n",
"len(joined_test[joined_test.family.isnull()])"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined = join_df(joined, transactions, [\"date\", \"store_nbr\"] )\n",
"len(joined[joined.store_nbr.isnull()])"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined_test = join_df(joined_test, transactions, [\"date\", \"store_nbr\"] )\n",
"len(joined_test[joined_test.store_nbr.isnull()])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### **Note**: at this stage, we don't incorporate the Holidays (needs tuning for local vs national) or the Oil prices, this will also require external data sources on the Test set.\n",
"\n",
"**TBD**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"# we drop the duplicate columns ending with _y\n",
"for df in (joined, joined_test):\n",
" for c in df.columns:\n",
" if c.endswith('_y'):\n",
" if c in df.columns: df.drop(c, inplace=True, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" date | \n",
" store_nbr | \n",
" item_nbr | \n",
" unit_sales | \n",
" onpromotion | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" ... | \n",
" Is_year_start | \n",
" Elapsed | \n",
" city | \n",
" state | \n",
" type | \n",
" cluster | \n",
" family | \n",
" class | \n",
" perishable | \n",
" transactions | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 83386215 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 103520 | \n",
" 2.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" ... | \n",
" False | \n",
" 1467072000 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" GROCERY I | \n",
" 1028 | \n",
" 0 | \n",
" 1699 | \n",
"
\n",
" \n",
" 1 | \n",
" 83386216 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 103665 | \n",
" 4.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" ... | \n",
" False | \n",
" 1467072000 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" BREAD/BAKERY | \n",
" 2712 | \n",
" 1 | \n",
" 1699 | \n",
"
\n",
" \n",
" 2 | \n",
" 83386217 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 105574 | \n",
" 1.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" ... | \n",
" False | \n",
" 1467072000 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" GROCERY I | \n",
" 1045 | \n",
" 0 | \n",
" 1699 | \n",
"
\n",
" \n",
" 3 | \n",
" 83386218 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 105575 | \n",
" 24.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" ... | \n",
" False | \n",
" 1467072000 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" GROCERY I | \n",
" 1045 | \n",
" 0 | \n",
" 1699 | \n",
"
\n",
" \n",
" 4 | \n",
" 83386219 | \n",
" 2016-06-28 | \n",
" 1 | \n",
" 105693 | \n",
" 2.0 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 28 | \n",
" ... | \n",
" False | \n",
" 1467072000 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" GROCERY I | \n",
" 1034 | \n",
" 0 | \n",
" 1699 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 27 columns
\n",
"
"
],
"text/plain": [
" id date store_nbr item_nbr unit_sales onpromotion Year \\\n",
"0 83386215 2016-06-28 1 103520 2.0 False 2016 \n",
"1 83386216 2016-06-28 1 103665 4.0 False 2016 \n",
"2 83386217 2016-06-28 1 105574 1.0 False 2016 \n",
"3 83386218 2016-06-28 1 105575 24.0 False 2016 \n",
"4 83386219 2016-06-28 1 105693 2.0 False 2016 \n",
"\n",
" Month Week Day ... Is_year_start Elapsed city state \\\n",
"0 6 26 28 ... False 1467072000 Quito Pichincha \n",
"1 6 26 28 ... False 1467072000 Quito Pichincha \n",
"2 6 26 28 ... False 1467072000 Quito Pichincha \n",
"3 6 26 28 ... False 1467072000 Quito Pichincha \n",
"4 6 26 28 ... False 1467072000 Quito Pichincha \n",
"\n",
" type cluster family class perishable transactions \n",
"0 D 13 GROCERY I 1028 0 1699 \n",
"1 D 13 BREAD/BAKERY 2712 1 1699 \n",
"2 D 13 GROCERY I 1045 0 1699 \n",
"3 D 13 GROCERY I 1045 0 1699 \n",
"4 D 13 GROCERY I 1034 0 1699 \n",
"\n",
"[5 rows x 27 columns]"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined.head()"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" store_nbr | \n",
" item_nbr | \n",
" unit_sales | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" Dayofweek | \n",
" Dayofyear | \n",
" Elapsed | \n",
" cluster | \n",
" class | \n",
" perishable | \n",
" transactions | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 6.270562e+06 | \n",
" 6.270562e+06 | \n",
" 6.270562e+06 | \n",
" 6.270562e+06 | \n",
" 6270562.0 | \n",
" 6.270562e+06 | \n",
" 6.270562e+06 | \n",
" 6.270562e+06 | \n",
" 6.270562e+06 | \n",
" 6.270562e+06 | \n",
" 6.270562e+06 | \n",
" 6.270562e+06 | \n",
" 6.270562e+06 | \n",
" 6.270562e+06 | \n",
" 6.270562e+06 | \n",
"
\n",
" \n",
" mean | \n",
" 8.652150e+07 | \n",
" 2.769834e+01 | \n",
" 1.100521e+06 | \n",
" 7.607019e+00 | \n",
" 2016.0 | \n",
" 7.432728e+00 | \n",
" 3.027350e+01 | \n",
" 1.645794e+01 | \n",
" 3.001217e+00 | \n",
" 2.119157e+02 | \n",
" 1.469830e+09 | \n",
" 8.640504e+00 | \n",
" 1.969754e+03 | \n",
" 2.598797e-01 | \n",
" 1.751359e+03 | \n",
"
\n",
" \n",
" std | \n",
" 1.810155e+06 | \n",
" 1.620490e+01 | \n",
" 5.489163e+05 | \n",
" 1.841107e+01 | \n",
" 0.0 | \n",
" 5.761485e-01 | \n",
" 2.690027e+00 | \n",
" 9.196580e+00 | \n",
" 2.009433e+00 | \n",
" 1.868661e+01 | \n",
" 1.614523e+06 | \n",
" 4.582949e+00 | \n",
" 1.165504e+03 | \n",
" 4.385685e-01 | \n",
" 9.290509e+02 | \n",
"
\n",
" \n",
" min | \n",
" 8.338622e+07 | \n",
" 1.000000e+00 | \n",
" 9.919700e+04 | \n",
" 0.000000e+00 | \n",
" 2016.0 | \n",
" 6.000000e+00 | \n",
" 2.600000e+01 | \n",
" 1.000000e+00 | \n",
" 0.000000e+00 | \n",
" 1.800000e+02 | \n",
" 1.467072e+09 | \n",
" 1.000000e+00 | \n",
" 1.002000e+03 | \n",
" 0.000000e+00 | \n",
" 3.950000e+02 | \n",
"
\n",
" \n",
" 25% | \n",
" 8.495386e+07 | \n",
" 1.300000e+01 | \n",
" 6.557440e+05 | \n",
" 2.000000e+00 | \n",
" 2016.0 | \n",
" 7.000000e+00 | \n",
" 2.800000e+01 | \n",
" 8.000000e+00 | \n",
" 1.000000e+00 | \n",
" 1.960000e+02 | \n",
" 1.468454e+09 | \n",
" 5.000000e+00 | \n",
" 1.058000e+03 | \n",
" 0.000000e+00 | \n",
" 1.101000e+03 | \n",
"
\n",
" \n",
" 50% | \n",
" 8.652150e+07 | \n",
" 2.800000e+01 | \n",
" 1.146784e+06 | \n",
" 4.000000e+00 | \n",
" 2016.0 | \n",
" 7.000000e+00 | \n",
" 3.000000e+01 | \n",
" 1.700000e+01 | \n",
" 3.000000e+00 | \n",
" 2.120000e+02 | \n",
" 1.469837e+09 | \n",
" 9.000000e+00 | \n",
" 1.190000e+03 | \n",
" 0.000000e+00 | \n",
" 1.445000e+03 | \n",
"
\n",
" \n",
" 75% | \n",
" 8.808914e+07 | \n",
" 4.300000e+01 | \n",
" 1.464008e+06 | \n",
" 8.000000e+00 | \n",
" 2016.0 | \n",
" 8.000000e+00 | \n",
" 3.300000e+01 | \n",
" 2.500000e+01 | \n",
" 5.000000e+00 | \n",
" 2.280000e+02 | \n",
" 1.471219e+09 | \n",
" 1.300000e+01 | \n",
" 2.708000e+03 | \n",
" 1.000000e+00 | \n",
" 2.196000e+03 | \n",
"
\n",
" \n",
" max | \n",
" 8.965678e+07 | \n",
" 5.400000e+01 | \n",
" 2.037487e+06 | \n",
" 1.375700e+04 | \n",
" 2016.0 | \n",
" 8.000000e+00 | \n",
" 3.500000e+01 | \n",
" 3.100000e+01 | \n",
" 6.000000e+00 | \n",
" 2.440000e+02 | \n",
" 1.472602e+09 | \n",
" 1.700000e+01 | \n",
" 7.780000e+03 | \n",
" 1.000000e+00 | \n",
" 5.165000e+03 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id store_nbr item_nbr unit_sales Year \\\n",
"count 6.270562e+06 6.270562e+06 6.270562e+06 6.270562e+06 6270562.0 \n",
"mean 8.652150e+07 2.769834e+01 1.100521e+06 7.607019e+00 2016.0 \n",
"std 1.810155e+06 1.620490e+01 5.489163e+05 1.841107e+01 0.0 \n",
"min 8.338622e+07 1.000000e+00 9.919700e+04 0.000000e+00 2016.0 \n",
"25% 8.495386e+07 1.300000e+01 6.557440e+05 2.000000e+00 2016.0 \n",
"50% 8.652150e+07 2.800000e+01 1.146784e+06 4.000000e+00 2016.0 \n",
"75% 8.808914e+07 4.300000e+01 1.464008e+06 8.000000e+00 2016.0 \n",
"max 8.965678e+07 5.400000e+01 2.037487e+06 1.375700e+04 2016.0 \n",
"\n",
" Month Week Day Dayofweek Dayofyear \\\n",
"count 6.270562e+06 6.270562e+06 6.270562e+06 6.270562e+06 6.270562e+06 \n",
"mean 7.432728e+00 3.027350e+01 1.645794e+01 3.001217e+00 2.119157e+02 \n",
"std 5.761485e-01 2.690027e+00 9.196580e+00 2.009433e+00 1.868661e+01 \n",
"min 6.000000e+00 2.600000e+01 1.000000e+00 0.000000e+00 1.800000e+02 \n",
"25% 7.000000e+00 2.800000e+01 8.000000e+00 1.000000e+00 1.960000e+02 \n",
"50% 7.000000e+00 3.000000e+01 1.700000e+01 3.000000e+00 2.120000e+02 \n",
"75% 8.000000e+00 3.300000e+01 2.500000e+01 5.000000e+00 2.280000e+02 \n",
"max 8.000000e+00 3.500000e+01 3.100000e+01 6.000000e+00 2.440000e+02 \n",
"\n",
" Elapsed cluster class perishable transactions \n",
"count 6.270562e+06 6.270562e+06 6.270562e+06 6.270562e+06 6.270562e+06 \n",
"mean 1.469830e+09 8.640504e+00 1.969754e+03 2.598797e-01 1.751359e+03 \n",
"std 1.614523e+06 4.582949e+00 1.165504e+03 4.385685e-01 9.290509e+02 \n",
"min 1.467072e+09 1.000000e+00 1.002000e+03 0.000000e+00 3.950000e+02 \n",
"25% 1.468454e+09 5.000000e+00 1.058000e+03 0.000000e+00 1.101000e+03 \n",
"50% 1.469837e+09 9.000000e+00 1.190000e+03 0.000000e+00 1.445000e+03 \n",
"75% 1.471219e+09 1.300000e+01 2.708000e+03 1.000000e+00 2.196000e+03 \n",
"max 1.472602e+09 1.700000e+01 7.780000e+03 1.000000e+00 5.165000e+03 "
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined.describe()"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" date | \n",
" store_nbr | \n",
" item_nbr | \n",
" onpromotion | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" Dayofweek | \n",
" ... | \n",
" Is_year_start | \n",
" Elapsed | \n",
" city | \n",
" state | \n",
" type | \n",
" cluster | \n",
" family | \n",
" class | \n",
" perishable | \n",
" transactions | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 125497040 | \n",
" 2017-08-16 | \n",
" 1 | \n",
" 96995 | \n",
" False | \n",
" 2017 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 2 | \n",
" ... | \n",
" False | \n",
" 1502841600 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" GROCERY I | \n",
" 1093 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 125497041 | \n",
" 2017-08-16 | \n",
" 1 | \n",
" 99197 | \n",
" False | \n",
" 2017 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 2 | \n",
" ... | \n",
" False | \n",
" 1502841600 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" GROCERY I | \n",
" 1067 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" 125497042 | \n",
" 2017-08-16 | \n",
" 1 | \n",
" 103501 | \n",
" False | \n",
" 2017 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 2 | \n",
" ... | \n",
" False | \n",
" 1502841600 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" CLEANING | \n",
" 3008 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" 125497043 | \n",
" 2017-08-16 | \n",
" 1 | \n",
" 103520 | \n",
" False | \n",
" 2017 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 2 | \n",
" ... | \n",
" False | \n",
" 1502841600 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" GROCERY I | \n",
" 1028 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 125497044 | \n",
" 2017-08-16 | \n",
" 1 | \n",
" 103665 | \n",
" False | \n",
" 2017 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 2 | \n",
" ... | \n",
" False | \n",
" 1502841600 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" BREAD/BAKERY | \n",
" 2712 | \n",
" 1 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 26 columns
\n",
"
"
],
"text/plain": [
" id date store_nbr item_nbr onpromotion Year Month Week \\\n",
"0 125497040 2017-08-16 1 96995 False 2017 8 33 \n",
"1 125497041 2017-08-16 1 99197 False 2017 8 33 \n",
"2 125497042 2017-08-16 1 103501 False 2017 8 33 \n",
"3 125497043 2017-08-16 1 103520 False 2017 8 33 \n",
"4 125497044 2017-08-16 1 103665 False 2017 8 33 \n",
"\n",
" Day Dayofweek ... Is_year_start Elapsed city state \\\n",
"0 16 2 ... False 1502841600 Quito Pichincha \n",
"1 16 2 ... False 1502841600 Quito Pichincha \n",
"2 16 2 ... False 1502841600 Quito Pichincha \n",
"3 16 2 ... False 1502841600 Quito Pichincha \n",
"4 16 2 ... False 1502841600 Quito Pichincha \n",
"\n",
" type cluster family class perishable transactions \n",
"0 D 13 GROCERY I 1093 0 NaN \n",
"1 D 13 GROCERY I 1067 0 NaN \n",
"2 D 13 CLEANING 3008 0 NaN \n",
"3 D 13 GROCERY I 1028 0 NaN \n",
"4 D 13 BREAD/BAKERY 2712 1 NaN \n",
"\n",
"[5 rows x 26 columns]"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined_test.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we'll fill in missing values to avoid complications with NA's. NA (not available) is how Pandas indicates missing values; many models have problems when missing values are present, so it's always important to think about how to deal with them. In these cases, we are picking an arbitrary signal value that doesn't otherwise appear in the data.\n",
"\n",
"** Note**: as seen below, its seems there are no NANs !?!?!"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check if any NANs\n",
"joined.isnull().values.any()"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check if any NANs (slower, more complete)\n",
"joined.isnull().sum().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Durations : TBD !"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**NOTE: code from Rossmann has 25+ cells**\n",
"\n",
"It is common when working with time series data to extract data that explains relationships across rows as opposed to columns, e.g.:\n",
"* Running averages\n",
"* Time until next event\n",
"* Time since last event\n",
"\n",
"This is often difficult to do with most table manipulation frameworks, since they are designed to work with relationships across columns. As such, we've created a class to handle this type of data.\n",
"\n",
"We'll define a function `get_elapsed` for cumulative counting across a sorted dataframe. Given a particular field `fld` to monitor, this function will start tracking time since the last occurrence of that field. When the field is seen again, the counter is set to zero.\n",
"\n",
"Upon initialization, this will result in datetime na's until the field is encountered. This is reset every time a new store is seen. We'll see how to use this shortly."
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"def get_elapsed(fld, pre):\n",
" day1 = np.timedelta64(1, 'D')\n",
" last_date = np.datetime64()\n",
" last_store = 0\n",
" res = []\n",
"\n",
" for s,v,d in zip(df.Store.values,df[fld].values, df.Date.values):\n",
" if s != last_store:\n",
" last_date = np.datetime64()\n",
" last_store = s\n",
" if v: last_date = d\n",
" res.append(((d-last_date).astype('timedelta64[D]') / day1).astype(int))\n",
" df[pre+fld] = res"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll be applying this to a subset of columns:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"columns = [\"Date\", \"Store\", \"Promo\", \"StateHoliday\", \"SchoolHoliday\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TBD"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"%add_datepart"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create features"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" id | \n",
" 83386215 | \n",
" 83386216 | \n",
" 83386217 | \n",
" 83386218 | \n",
" 83386219 | \n",
"
\n",
" \n",
" date | \n",
" 2016-06-28 00:00:00 | \n",
" 2016-06-28 00:00:00 | \n",
" 2016-06-28 00:00:00 | \n",
" 2016-06-28 00:00:00 | \n",
" 2016-06-28 00:00:00 | \n",
"
\n",
" \n",
" store_nbr | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" item_nbr | \n",
" 103520 | \n",
" 103665 | \n",
" 105574 | \n",
" 105575 | \n",
" 105693 | \n",
"
\n",
" \n",
" unit_sales | \n",
" 2 | \n",
" 4 | \n",
" 1 | \n",
" 24 | \n",
" 2 | \n",
"
\n",
" \n",
" onpromotion | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Year | \n",
" 2016 | \n",
" 2016 | \n",
" 2016 | \n",
" 2016 | \n",
" 2016 | \n",
"
\n",
" \n",
" Month | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
"
\n",
" \n",
" Week | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
"
\n",
" \n",
" Day | \n",
" 28 | \n",
" 28 | \n",
" 28 | \n",
" 28 | \n",
" 28 | \n",
"
\n",
" \n",
" Dayofweek | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" Dayofyear | \n",
" 180 | \n",
" 180 | \n",
" 180 | \n",
" 180 | \n",
" 180 | \n",
"
\n",
" \n",
" Is_month_end | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Is_month_start | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Is_quarter_end | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Is_quarter_start | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Is_year_end | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Is_year_start | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Elapsed | \n",
" 1467072000 | \n",
" 1467072000 | \n",
" 1467072000 | \n",
" 1467072000 | \n",
" 1467072000 | \n",
"
\n",
" \n",
" city | \n",
" Quito | \n",
" Quito | \n",
" Quito | \n",
" Quito | \n",
" Quito | \n",
"
\n",
" \n",
" state | \n",
" Pichincha | \n",
" Pichincha | \n",
" Pichincha | \n",
" Pichincha | \n",
" Pichincha | \n",
"
\n",
" \n",
" type | \n",
" D | \n",
" D | \n",
" D | \n",
" D | \n",
" D | \n",
"
\n",
" \n",
" cluster | \n",
" 13 | \n",
" 13 | \n",
" 13 | \n",
" 13 | \n",
" 13 | \n",
"
\n",
" \n",
" family | \n",
" GROCERY I | \n",
" BREAD/BAKERY | \n",
" GROCERY I | \n",
" GROCERY I | \n",
" GROCERY I | \n",
"
\n",
" \n",
" class | \n",
" 1028 | \n",
" 2712 | \n",
" 1045 | \n",
" 1045 | \n",
" 1034 | \n",
"
\n",
" \n",
" perishable | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" transactions | \n",
" 1699 | \n",
" 1699 | \n",
" 1699 | \n",
" 1699 | \n",
" 1699 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 \\\n",
"id 83386215 83386216 \n",
"date 2016-06-28 00:00:00 2016-06-28 00:00:00 \n",
"store_nbr 1 1 \n",
"item_nbr 103520 103665 \n",
"unit_sales 2 4 \n",
"onpromotion False False \n",
"Year 2016 2016 \n",
"Month 6 6 \n",
"Week 26 26 \n",
"Day 28 28 \n",
"Dayofweek 1 1 \n",
"Dayofyear 180 180 \n",
"Is_month_end False False \n",
"Is_month_start False False \n",
"Is_quarter_end False False \n",
"Is_quarter_start False False \n",
"Is_year_end False False \n",
"Is_year_start False False \n",
"Elapsed 1467072000 1467072000 \n",
"city Quito Quito \n",
"state Pichincha Pichincha \n",
"type D D \n",
"cluster 13 13 \n",
"family GROCERY I BREAD/BAKERY \n",
"class 1028 2712 \n",
"perishable 0 1 \n",
"transactions 1699 1699 \n",
"\n",
" 2 3 \\\n",
"id 83386217 83386218 \n",
"date 2016-06-28 00:00:00 2016-06-28 00:00:00 \n",
"store_nbr 1 1 \n",
"item_nbr 105574 105575 \n",
"unit_sales 1 24 \n",
"onpromotion False False \n",
"Year 2016 2016 \n",
"Month 6 6 \n",
"Week 26 26 \n",
"Day 28 28 \n",
"Dayofweek 1 1 \n",
"Dayofyear 180 180 \n",
"Is_month_end False False \n",
"Is_month_start False False \n",
"Is_quarter_end False False \n",
"Is_quarter_start False False \n",
"Is_year_end False False \n",
"Is_year_start False False \n",
"Elapsed 1467072000 1467072000 \n",
"city Quito Quito \n",
"state Pichincha Pichincha \n",
"type D D \n",
"cluster 13 13 \n",
"family GROCERY I GROCERY I \n",
"class 1045 1045 \n",
"perishable 0 0 \n",
"transactions 1699 1699 \n",
"\n",
" 4 \n",
"id 83386219 \n",
"date 2016-06-28 00:00:00 \n",
"store_nbr 1 \n",
"item_nbr 105693 \n",
"unit_sales 2 \n",
"onpromotion False \n",
"Year 2016 \n",
"Month 6 \n",
"Week 26 \n",
"Day 28 \n",
"Dayofweek 1 \n",
"Dayofyear 180 \n",
"Is_month_end False \n",
"Is_month_start False \n",
"Is_quarter_end False \n",
"Is_quarter_start False \n",
"Is_year_end False \n",
"Is_year_start False \n",
"Elapsed 1467072000 \n",
"city Quito \n",
"state Pichincha \n",
"type D \n",
"cluster 13 \n",
"family GROCERY I \n",
"class 1034 \n",
"perishable 0 \n",
"transactions 1699 "
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Look at all columns pivoted to rows\n",
"joined.head().T.head(40)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"# dropping \"Elasped\" as it generates an error later, due to crazy 10 digits\n",
"joined.drop(['Elapsed'],axis = 1, inplace = True)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" id | \n",
" 83386215 | \n",
" 83386216 | \n",
" 83386217 | \n",
" 83386218 | \n",
" 83386219 | \n",
"
\n",
" \n",
" date | \n",
" 2016-06-28 00:00:00 | \n",
" 2016-06-28 00:00:00 | \n",
" 2016-06-28 00:00:00 | \n",
" 2016-06-28 00:00:00 | \n",
" 2016-06-28 00:00:00 | \n",
"
\n",
" \n",
" store_nbr | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" item_nbr | \n",
" 103520 | \n",
" 103665 | \n",
" 105574 | \n",
" 105575 | \n",
" 105693 | \n",
"
\n",
" \n",
" unit_sales | \n",
" 2 | \n",
" 4 | \n",
" 1 | \n",
" 24 | \n",
" 2 | \n",
"
\n",
" \n",
" onpromotion | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Year | \n",
" 2016 | \n",
" 2016 | \n",
" 2016 | \n",
" 2016 | \n",
" 2016 | \n",
"
\n",
" \n",
" Month | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
"
\n",
" \n",
" Week | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
"
\n",
" \n",
" Day | \n",
" 28 | \n",
" 28 | \n",
" 28 | \n",
" 28 | \n",
" 28 | \n",
"
\n",
" \n",
" Dayofweek | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" Dayofyear | \n",
" 180 | \n",
" 180 | \n",
" 180 | \n",
" 180 | \n",
" 180 | \n",
"
\n",
" \n",
" Is_month_end | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Is_month_start | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Is_quarter_end | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Is_quarter_start | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Is_year_end | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Is_year_start | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" city | \n",
" Quito | \n",
" Quito | \n",
" Quito | \n",
" Quito | \n",
" Quito | \n",
"
\n",
" \n",
" state | \n",
" Pichincha | \n",
" Pichincha | \n",
" Pichincha | \n",
" Pichincha | \n",
" Pichincha | \n",
"
\n",
" \n",
" type | \n",
" D | \n",
" D | \n",
" D | \n",
" D | \n",
" D | \n",
"
\n",
" \n",
" cluster | \n",
" 13 | \n",
" 13 | \n",
" 13 | \n",
" 13 | \n",
" 13 | \n",
"
\n",
" \n",
" family | \n",
" GROCERY I | \n",
" BREAD/BAKERY | \n",
" GROCERY I | \n",
" GROCERY I | \n",
" GROCERY I | \n",
"
\n",
" \n",
" class | \n",
" 1028 | \n",
" 2712 | \n",
" 1045 | \n",
" 1045 | \n",
" 1034 | \n",
"
\n",
" \n",
" perishable | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" transactions | \n",
" 1699 | \n",
" 1699 | \n",
" 1699 | \n",
" 1699 | \n",
" 1699 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 \\\n",
"id 83386215 83386216 \n",
"date 2016-06-28 00:00:00 2016-06-28 00:00:00 \n",
"store_nbr 1 1 \n",
"item_nbr 103520 103665 \n",
"unit_sales 2 4 \n",
"onpromotion False False \n",
"Year 2016 2016 \n",
"Month 6 6 \n",
"Week 26 26 \n",
"Day 28 28 \n",
"Dayofweek 1 1 \n",
"Dayofyear 180 180 \n",
"Is_month_end False False \n",
"Is_month_start False False \n",
"Is_quarter_end False False \n",
"Is_quarter_start False False \n",
"Is_year_end False False \n",
"Is_year_start False False \n",
"city Quito Quito \n",
"state Pichincha Pichincha \n",
"type D D \n",
"cluster 13 13 \n",
"family GROCERY I BREAD/BAKERY \n",
"class 1028 2712 \n",
"perishable 0 1 \n",
"transactions 1699 1699 \n",
"\n",
" 2 3 \\\n",
"id 83386217 83386218 \n",
"date 2016-06-28 00:00:00 2016-06-28 00:00:00 \n",
"store_nbr 1 1 \n",
"item_nbr 105574 105575 \n",
"unit_sales 1 24 \n",
"onpromotion False False \n",
"Year 2016 2016 \n",
"Month 6 6 \n",
"Week 26 26 \n",
"Day 28 28 \n",
"Dayofweek 1 1 \n",
"Dayofyear 180 180 \n",
"Is_month_end False False \n",
"Is_month_start False False \n",
"Is_quarter_end False False \n",
"Is_quarter_start False False \n",
"Is_year_end False False \n",
"Is_year_start False False \n",
"city Quito Quito \n",
"state Pichincha Pichincha \n",
"type D D \n",
"cluster 13 13 \n",
"family GROCERY I GROCERY I \n",
"class 1045 1045 \n",
"perishable 0 0 \n",
"transactions 1699 1699 \n",
"\n",
" 4 \n",
"id 83386219 \n",
"date 2016-06-28 00:00:00 \n",
"store_nbr 1 \n",
"item_nbr 105693 \n",
"unit_sales 2 \n",
"onpromotion False \n",
"Year 2016 \n",
"Month 6 \n",
"Week 26 \n",
"Day 28 \n",
"Dayofweek 1 \n",
"Dayofyear 180 \n",
"Is_month_end False \n",
"Is_month_start False \n",
"Is_quarter_end False \n",
"Is_quarter_start False \n",
"Is_year_end False \n",
"Is_year_start False \n",
"city Quito \n",
"state Pichincha \n",
"type D \n",
"cluster 13 \n",
"family GROCERY I \n",
"class 1034 \n",
"perishable 0 \n",
"transactions 1699 "
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined.head().T.head(40)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we've engineered all our features, we need to convert to input compatible with a neural network.\n",
"\n",
"This includes converting categorical variables into contiguous integers or one-hot encodings, normalizing continuous features to standard normal, etc..."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"cat_vars = ['store_nbr', 'item_nbr', 'onpromotion', 'Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',\n",
" 'city', 'state', 'type', 'cluster', 'family', 'class', 'perishable']"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"contin_vars = ['transactions']"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6270562"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n = len(joined); n"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": [
"for v in cat_vars:\n",
" joined[v] = joined[v].astype('category').cat.as_ordered()"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"for v in cat_vars:\n",
" joined_test[v] = joined_test[v].astype('category').cat.as_ordered()"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"for v in contin_vars:\n",
" joined[v] = joined[v].astype('float32')\n",
" joined_test[v] = joined_test[v].astype('float32')"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"for v in contin_vars:\n",
" joined[v] = joined[v].astype('float32')\n"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"dep = 'unit_sales'\n",
"joined = joined[cat_vars+contin_vars+[dep, 'date']].copy()"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"joined_test[dep] = 0\n",
"joined_test = joined_test[cat_vars+contin_vars+[dep, 'date', 'id']].copy()"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" store_nbr | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" item_nbr | \n",
" 103520 | \n",
" 103665 | \n",
" 105574 | \n",
" 105575 | \n",
" 105693 | \n",
"
\n",
" \n",
" onpromotion | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Year | \n",
" 2016 | \n",
" 2016 | \n",
" 2016 | \n",
" 2016 | \n",
" 2016 | \n",
"
\n",
" \n",
" Month | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
" 6 | \n",
"
\n",
" \n",
" Week | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
" 26 | \n",
"
\n",
" \n",
" Day | \n",
" 28 | \n",
" 28 | \n",
" 28 | \n",
" 28 | \n",
" 28 | \n",
"
\n",
" \n",
" Dayofweek | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" Dayofyear | \n",
" 180 | \n",
" 180 | \n",
" 180 | \n",
" 180 | \n",
" 180 | \n",
"
\n",
" \n",
" city | \n",
" Quito | \n",
" Quito | \n",
" Quito | \n",
" Quito | \n",
" Quito | \n",
"
\n",
" \n",
" state | \n",
" Pichincha | \n",
" Pichincha | \n",
" Pichincha | \n",
" Pichincha | \n",
" Pichincha | \n",
"
\n",
" \n",
" type | \n",
" D | \n",
" D | \n",
" D | \n",
" D | \n",
" D | \n",
"
\n",
" \n",
" cluster | \n",
" 13 | \n",
" 13 | \n",
" 13 | \n",
" 13 | \n",
" 13 | \n",
"
\n",
" \n",
" family | \n",
" GROCERY I | \n",
" BREAD/BAKERY | \n",
" GROCERY I | \n",
" GROCERY I | \n",
" GROCERY I | \n",
"
\n",
" \n",
" class | \n",
" 1028 | \n",
" 2712 | \n",
" 1045 | \n",
" 1045 | \n",
" 1034 | \n",
"
\n",
" \n",
" perishable | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" transactions | \n",
" 1699 | \n",
" 1699 | \n",
" 1699 | \n",
" 1699 | \n",
" 1699 | \n",
"
\n",
" \n",
" unit_sales | \n",
" 2 | \n",
" 4 | \n",
" 1 | \n",
" 24 | \n",
" 2 | \n",
"
\n",
" \n",
" date | \n",
" 2016-06-28 00:00:00 | \n",
" 2016-06-28 00:00:00 | \n",
" 2016-06-28 00:00:00 | \n",
" 2016-06-28 00:00:00 | \n",
" 2016-06-28 00:00:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 \\\n",
"store_nbr 1 1 1 \n",
"item_nbr 103520 103665 105574 \n",
"onpromotion False False False \n",
"Year 2016 2016 2016 \n",
"Month 6 6 6 \n",
"Week 26 26 26 \n",
"Day 28 28 28 \n",
"Dayofweek 1 1 1 \n",
"Dayofyear 180 180 180 \n",
"city Quito Quito Quito \n",
"state Pichincha Pichincha Pichincha \n",
"type D D D \n",
"cluster 13 13 13 \n",
"family GROCERY I BREAD/BAKERY GROCERY I \n",
"class 1028 2712 1045 \n",
"perishable 0 1 0 \n",
"transactions 1699 1699 1699 \n",
"unit_sales 2 4 1 \n",
"date 2016-06-28 00:00:00 2016-06-28 00:00:00 2016-06-28 00:00:00 \n",
"\n",
" 3 4 \n",
"store_nbr 1 1 \n",
"item_nbr 105575 105693 \n",
"onpromotion False False \n",
"Year 2016 2016 \n",
"Month 6 6 \n",
"Week 26 26 \n",
"Day 28 28 \n",
"Dayofweek 1 1 \n",
"Dayofyear 180 180 \n",
"city Quito Quito \n",
"state Pichincha Pichincha \n",
"type D D \n",
"cluster 13 13 \n",
"family GROCERY I GROCERY I \n",
"class 1045 1034 \n",
"perishable 0 0 \n",
"transactions 1699 1699 \n",
"unit_sales 24 2 \n",
"date 2016-06-28 00:00:00 2016-06-28 00:00:00 "
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined.head().T.head(40)"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" store_nbr | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" item_nbr | \n",
" 96995 | \n",
" 99197 | \n",
" 103501 | \n",
" 103520 | \n",
" 103665 | \n",
"
\n",
" \n",
" onpromotion | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" Year | \n",
" 2017 | \n",
" 2017 | \n",
" 2017 | \n",
" 2017 | \n",
" 2017 | \n",
"
\n",
" \n",
" Month | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
" 8 | \n",
"
\n",
" \n",
" Week | \n",
" 33 | \n",
" 33 | \n",
" 33 | \n",
" 33 | \n",
" 33 | \n",
"
\n",
" \n",
" Day | \n",
" 16 | \n",
" 16 | \n",
" 16 | \n",
" 16 | \n",
" 16 | \n",
"
\n",
" \n",
" Dayofweek | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
" 2 | \n",
"
\n",
" \n",
" Dayofyear | \n",
" 228 | \n",
" 228 | \n",
" 228 | \n",
" 228 | \n",
" 228 | \n",
"
\n",
" \n",
" city | \n",
" Quito | \n",
" Quito | \n",
" Quito | \n",
" Quito | \n",
" Quito | \n",
"
\n",
" \n",
" state | \n",
" Pichincha | \n",
" Pichincha | \n",
" Pichincha | \n",
" Pichincha | \n",
" Pichincha | \n",
"
\n",
" \n",
" type | \n",
" D | \n",
" D | \n",
" D | \n",
" D | \n",
" D | \n",
"
\n",
" \n",
" cluster | \n",
" 13 | \n",
" 13 | \n",
" 13 | \n",
" 13 | \n",
" 13 | \n",
"
\n",
" \n",
" family | \n",
" GROCERY I | \n",
" GROCERY I | \n",
" CLEANING | \n",
" GROCERY I | \n",
" BREAD/BAKERY | \n",
"
\n",
" \n",
" class | \n",
" 1093 | \n",
" 1067 | \n",
" 3008 | \n",
" 1028 | \n",
" 2712 | \n",
"
\n",
" \n",
" perishable | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" transactions | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" unit_sales | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" date | \n",
" 2017-08-16 00:00:00 | \n",
" 2017-08-16 00:00:00 | \n",
" 2017-08-16 00:00:00 | \n",
" 2017-08-16 00:00:00 | \n",
" 2017-08-16 00:00:00 | \n",
"
\n",
" \n",
" id | \n",
" 125497040 | \n",
" 125497041 | \n",
" 125497042 | \n",
" 125497043 | \n",
" 125497044 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 \\\n",
"store_nbr 1 1 1 \n",
"item_nbr 96995 99197 103501 \n",
"onpromotion False False False \n",
"Year 2017 2017 2017 \n",
"Month 8 8 8 \n",
"Week 33 33 33 \n",
"Day 16 16 16 \n",
"Dayofweek 2 2 2 \n",
"Dayofyear 228 228 228 \n",
"city Quito Quito Quito \n",
"state Pichincha Pichincha Pichincha \n",
"type D D D \n",
"cluster 13 13 13 \n",
"family GROCERY I GROCERY I CLEANING \n",
"class 1093 1067 3008 \n",
"perishable 0 0 0 \n",
"transactions NaN NaN NaN \n",
"unit_sales 0 0 0 \n",
"date 2017-08-16 00:00:00 2017-08-16 00:00:00 2017-08-16 00:00:00 \n",
"id 125497040 125497041 125497042 \n",
"\n",
" 3 4 \n",
"store_nbr 1 1 \n",
"item_nbr 103520 103665 \n",
"onpromotion False False \n",
"Year 2017 2017 \n",
"Month 8 8 \n",
"Week 33 33 \n",
"Day 16 16 \n",
"Dayofweek 2 2 \n",
"Dayofyear 228 228 \n",
"city Quito Quito \n",
"state Pichincha Pichincha \n",
"type D D \n",
"cluster 13 13 \n",
"family GROCERY I BREAD/BAKERY \n",
"class 1028 2712 \n",
"perishable 0 1 \n",
"transactions NaN NaN \n",
"unit_sales 0 0 \n",
"date 2017-08-16 00:00:00 2017-08-16 00:00:00 \n",
"id 125497043 125497044 "
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined_test.head().T.head(40)"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"# check this cell function ?\n",
"apply_cats(joined_test, joined)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're going to run on a sample."
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1254112"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"idxs = get_cv_idxs(n)\n",
"joined_samp = joined.iloc[idxs].set_index(\"date\")\n",
"samp_size = len(joined_samp)\n",
"samp_size"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To run on the full dataset, use this instead:"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"samp_size = n\n",
"joined_samp = joined.set_index(\"Date\")"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"samp_size = len(joined_samp)\n",
"samp_size"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" store_nbr | \n",
" item_nbr | \n",
" onpromotion | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" Dayofweek | \n",
" Dayofyear | \n",
" city | \n",
" state | \n",
" type | \n",
" cluster | \n",
" family | \n",
" class | \n",
" perishable | \n",
" transactions | \n",
" unit_sales | \n",
"
\n",
" \n",
" date | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2016-07-05 | \n",
" 36 | \n",
" 1464079 | \n",
" False | \n",
" 2016 | \n",
" 7 | \n",
" 27 | \n",
" 5 | \n",
" 1 | \n",
" 187 | \n",
" Libertad | \n",
" Guayas | \n",
" E | \n",
" 10 | \n",
" BEVERAGES | \n",
" 1190 | \n",
" 0 | \n",
" 1224.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2016-08-06 | \n",
" 12 | \n",
" 414620 | \n",
" False | \n",
" 2016 | \n",
" 8 | \n",
" 31 | \n",
" 6 | \n",
" 5 | \n",
" 219 | \n",
" Latacunga | \n",
" Cotopaxi | \n",
" C | \n",
" 15 | \n",
" GROCERY I | \n",
" 1016 | \n",
" 0 | \n",
" 1162.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" store_nbr item_nbr onpromotion Year Month Week Day Dayofweek \\\n",
"date \n",
"2016-07-05 36 1464079 False 2016 7 27 5 1 \n",
"2016-08-06 12 414620 False 2016 8 31 6 5 \n",
"\n",
" Dayofyear city state type cluster family class \\\n",
"date \n",
"2016-07-05 187 Libertad Guayas E 10 BEVERAGES 1190 \n",
"2016-08-06 219 Latacunga Cotopaxi C 15 GROCERY I 1016 \n",
"\n",
" perishable transactions unit_sales \n",
"date \n",
"2016-07-05 0 1224.0 1.0 \n",
"2016-08-06 0 1162.0 1.0 "
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined_samp.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" store_nbr | \n",
" item_nbr | \n",
" onpromotion | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" Dayofweek | \n",
" Dayofyear | \n",
" city | \n",
" state | \n",
" type | \n",
" cluster | \n",
" family | \n",
" class | \n",
" perishable | \n",
" transactions | \n",
" unit_sales | \n",
"
\n",
" \n",
" date | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2016-07-06 | \n",
" 22 | \n",
" 414620 | \n",
" False | \n",
" 2016 | \n",
" 7 | \n",
" 27 | \n",
" 6 | \n",
" 2 | \n",
" 188 | \n",
" Puyo | \n",
" Pastaza | \n",
" C | \n",
" 7 | \n",
" GROCERY I | \n",
" 1016 | \n",
" 0 | \n",
" 712.0 | \n",
" 5.0 | \n",
"
\n",
" \n",
" 2016-08-16 | \n",
" 46 | \n",
" 513853 | \n",
" False | \n",
" 2016 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 1 | \n",
" 229 | \n",
" Quito | \n",
" Pichincha | \n",
" A | \n",
" 14 | \n",
" CLEANING | \n",
" 3034 | \n",
" 0 | \n",
" 2877.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2016-08-15 | \n",
" 34 | \n",
" 890371 | \n",
" False | \n",
" 2016 | \n",
" 8 | \n",
" 33 | \n",
" 15 | \n",
" 0 | \n",
" 228 | \n",
" Guayaquil | \n",
" Guayas | \n",
" B | \n",
" 6 | \n",
" GROCERY I | \n",
" 1002 | \n",
" 0 | \n",
" 2227.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2016-08-16 | \n",
" 17 | \n",
" 1221045 | \n",
" False | \n",
" 2016 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 1 | \n",
" 229 | \n",
" Quito | \n",
" Pichincha | \n",
" C | \n",
" 12 | \n",
" GROCERY I | \n",
" 1010 | \n",
" 0 | \n",
" 1530.0 | \n",
" 7.0 | \n",
"
\n",
" \n",
" 2016-06-30 | \n",
" 9 | \n",
" 1354382 | \n",
" False | \n",
" 2016 | \n",
" 6 | \n",
" 26 | \n",
" 30 | \n",
" 3 | \n",
" 182 | \n",
" Quito | \n",
" Pichincha | \n",
" B | \n",
" 6 | \n",
" GROCERY I | \n",
" 1042 | \n",
" 0 | \n",
" 1824.0 | \n",
" 4.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" store_nbr item_nbr onpromotion Year Month Week Day Dayofweek \\\n",
"date \n",
"2016-07-06 22 414620 False 2016 7 27 6 2 \n",
"2016-08-16 46 513853 False 2016 8 33 16 1 \n",
"2016-08-15 34 890371 False 2016 8 33 15 0 \n",
"2016-08-16 17 1221045 False 2016 8 33 16 1 \n",
"2016-06-30 9 1354382 False 2016 6 26 30 3 \n",
"\n",
" Dayofyear city state type cluster family class \\\n",
"date \n",
"2016-07-06 188 Puyo Pastaza C 7 GROCERY I 1016 \n",
"2016-08-16 229 Quito Pichincha A 14 CLEANING 3034 \n",
"2016-08-15 228 Guayaquil Guayas B 6 GROCERY I 1002 \n",
"2016-08-16 229 Quito Pichincha C 12 GROCERY I 1010 \n",
"2016-06-30 182 Quito Pichincha B 6 GROCERY I 1042 \n",
"\n",
" perishable transactions unit_sales \n",
"date \n",
"2016-07-06 0 712.0 5.0 \n",
"2016-08-16 0 2877.0 1.0 \n",
"2016-08-15 0 2227.0 1.0 \n",
"2016-08-16 0 1530.0 7.0 \n",
"2016-06-30 0 1824.0 4.0 "
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined_samp.tail()"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"%proc_df\n",
"[x, y, nas, mapper(optional)]:\n",
"\n",
" x: x is the transformed version of df. x will not have the response variable\n",
" and is entirely numeric.\n",
"\n",
" y: y is the response variable\n",
"\n",
" nas (handles missing values): returns a dictionary of which nas it created, and the associated median.\n",
"\n",
" mapper: A DataFrameMapper which stores the mean and standard deviation of the corresponding continous\n",
" variables which is then used for scaling of during test-time.\n"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"df, y, nas, mapper = proc_df(joined_samp, 'unit_sales', do_scale=True)"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/eric/anaconda3/envs/fastai/lib/python3.6/site-packages/ipykernel_launcher.py:1: RuntimeWarning: divide by zero encountered in log\n",
" \"\"\"Entry point for launching an IPython kernel.\n"
]
}
],
"source": [
"yl = np.log(y)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# df is now a entirely numeric dataframe, without the \"unit sales\" columns\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 1. , 1. , 5. , 12. , 10. , 3. , 90. , 5. , 3. , 4. ,\n",
" 2. , 3. , 8. , 5. , 7. , 1. , 2. , 1. , 2. , 8. ,\n",
" ..., 3. , 8. , 11. , 7. , 2. , 12. , 183.517, 1. , 14. ,\n",
" 1. , 3. , 6. , 5. , 5. , 7. , 5. , 1. , 1. , 7. ,\n",
" 4. ])"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# y contains the \"unit sales\" now\n",
"y"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"min(y)"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1254112"
]
},
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(y)"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"13757.0"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"max(y)"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0. , 0. , 1.60944, 2.48491, 2.30259, 1.09861, 4.49981, 1.60944, 1.09861, 1.38629,\n",
" 0.69315, 1.09861, 2.07944, 1.60944, 1.94591, 0. , 0.69315, 0. , 0.69315, 2.07944,\n",
" ..., 1.09861, 2.07944, 2.3979 , 1.94591, 0.69315, 2.48491, 5.21231, 0. , 2.63906,\n",
" 0. , 1.09861, 1.79176, 1.60944, 1.60944, 1.94591, 1.60944, 0. , 0. , 1.94591,\n",
" 1.38629])"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"yl"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1254112"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(yl)"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.isnan(y).any()"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
"joined_test = joined_test.set_index(\"date\")"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" store_nbr | \n",
" item_nbr | \n",
" onpromotion | \n",
" Year | \n",
" Month | \n",
" Week | \n",
" Day | \n",
" Dayofweek | \n",
" Dayofyear | \n",
" city | \n",
" state | \n",
" type | \n",
" cluster | \n",
" family | \n",
" class | \n",
" perishable | \n",
" transactions | \n",
" unit_sales | \n",
" id | \n",
"
\n",
" \n",
" date | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2017-08-16 | \n",
" 1 | \n",
" NaN | \n",
" False | \n",
" 2017 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 2 | \n",
" 228 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" GROCERY I | \n",
" 1093 | \n",
" 0 | \n",
" NaN | \n",
" 0 | \n",
" 125497040 | \n",
"
\n",
" \n",
" 2017-08-16 | \n",
" 1 | \n",
" 99197.0 | \n",
" False | \n",
" 2017 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 2 | \n",
" 228 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" GROCERY I | \n",
" 1067 | \n",
" 0 | \n",
" NaN | \n",
" 0 | \n",
" 125497041 | \n",
"
\n",
" \n",
" 2017-08-16 | \n",
" 1 | \n",
" 103501.0 | \n",
" False | \n",
" 2017 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 2 | \n",
" 228 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" CLEANING | \n",
" 3008 | \n",
" 0 | \n",
" NaN | \n",
" 0 | \n",
" 125497042 | \n",
"
\n",
" \n",
" 2017-08-16 | \n",
" 1 | \n",
" 103520.0 | \n",
" False | \n",
" 2017 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 2 | \n",
" 228 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" GROCERY I | \n",
" 1028 | \n",
" 0 | \n",
" NaN | \n",
" 0 | \n",
" 125497043 | \n",
"
\n",
" \n",
" 2017-08-16 | \n",
" 1 | \n",
" 103665.0 | \n",
" False | \n",
" 2017 | \n",
" 8 | \n",
" 33 | \n",
" 16 | \n",
" 2 | \n",
" 228 | \n",
" Quito | \n",
" Pichincha | \n",
" D | \n",
" 13 | \n",
" BREAD/BAKERY | \n",
" 2712 | \n",
" 1 | \n",
" NaN | \n",
" 0 | \n",
" 125497044 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" store_nbr item_nbr onpromotion Year Month Week Day Dayofweek \\\n",
"date \n",
"2017-08-16 1 NaN False 2017 8 33 16 2 \n",
"2017-08-16 1 99197.0 False 2017 8 33 16 2 \n",
"2017-08-16 1 103501.0 False 2017 8 33 16 2 \n",
"2017-08-16 1 103520.0 False 2017 8 33 16 2 \n",
"2017-08-16 1 103665.0 False 2017 8 33 16 2 \n",
"\n",
" Dayofyear city state type cluster family class \\\n",
"date \n",
"2017-08-16 228 Quito Pichincha D 13 GROCERY I 1093 \n",
"2017-08-16 228 Quito Pichincha D 13 GROCERY I 1067 \n",
"2017-08-16 228 Quito Pichincha D 13 CLEANING 3008 \n",
"2017-08-16 228 Quito Pichincha D 13 GROCERY I 1028 \n",
"2017-08-16 228 Quito Pichincha D 13 BREAD/BAKERY 2712 \n",
"\n",
" perishable transactions unit_sales id \n",
"date \n",
"2017-08-16 0 NaN 0 125497040 \n",
"2017-08-16 0 NaN 0 125497041 \n",
"2017-08-16 0 NaN 0 125497042 \n",
"2017-08-16 0 NaN 0 125497043 \n",
"2017-08-16 1 NaN 0 125497044 "
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined_test.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# joined_test.drop(['transactions'], axis = 1, inplace = True)"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
"df_test, _, nas, mapper = proc_df(joined_test, 'unit_sales', do_scale=True, skip_flds=['transactions'],\n",
" na_dict=nas)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In time series data, cross-validation is not random. Instead, our holdout data is generally the most recent data, as it would be in real application. This issue is discussed in detail in [this post](http://www.fast.ai/2017/11/13/validation-sets/) on our web site.\n",
"\n",
"One approach is to take the last 25% of rows (sorted by date) as our validation set."
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [],
"source": [
"#ratio of .754 is 16 days by 65 days, to be close to real test duration\n",
"train_ratio = 0.754\n",
"# train_ratio = 0.9\n",
"train_size = int(samp_size * train_ratio); train_size\n",
"val_idx = list(range(train_size, len(df)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An even better option for picking a validation set is using the exact same length of time period as the test set uses - this is implemented here:"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"val_idx = np.flatnonzero(\n",
" (df.index<=datetime.datetime(2016,8,16)) & (df.index>=datetime.datetime(2016,8,31)))"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"308512"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(val_idx)"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1254112"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"samp_size"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.753999642775127"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"1 - (len(val_idx)/ samp_size)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deep Learning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're ready to put together our models.\n",
"\n",
"Root-mean-squared percent error is the metric Kaggle used for this ROSSMANN competition."
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [],
"source": [
"#from Rossmann\n",
"def inv_y(a): return np.exp(a)\n",
"\n",
"def exp_rmspe(y_pred, targ):\n",
" targ = inv_y(targ)\n",
" pct_var = (targ - inv_y(y_pred))/targ\n",
" return math.sqrt((pct_var**2).mean())\n",
"\n",
"max_log_y = np.max(y)\n",
"y_range = (0, max_log_y*1.2)"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"# Favorita\n",
"# Normalized Weighted Root Mean Squared Logarithmic Error (NWRMSLE)\n",
"# https://www.kaggle.com/tkm2261/eval-metric-and-evaluating-last-year-sales-bench\n",
"WEIGHTS = \n",
"def NWRMSLE(y, pred):\n",
" y = y.clip(0, y.max())\n",
" pred = pred.clip(0, pred.max())\n",
" score = np.nansum(WEIGHTS * ((np.log1p(pred) - np.log1p(y)) ** 2)) / WEIGHTS.sum()\n",
" return np.sqrt(score)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can create a ModelData object directly from out data frame."
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [],
"source": [
"md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=512,\n",
" test_df=df_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some categorical variables have a lot more levels than others. Store, in particular, has over a thousand in the Rossmann competition.\n",
"Let's see in Favorita."
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [],
"source": [
"cat_sz = [(c, len(joined_samp[c].cat.categories)+1) for c in cat_vars]"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('store_nbr', 54),\n",
" ('item_nbr', 3660),\n",
" ('onpromotion', 3),\n",
" ('Year', 2),\n",
" ('Month', 4),\n",
" ('Week', 11),\n",
" ('Day', 32),\n",
" ('Dayofweek', 8),\n",
" ('Dayofyear', 66),\n",
" ('city', 23),\n",
" ('state', 17),\n",
" ('type', 6),\n",
" ('cluster', 18),\n",
" ('family', 33),\n",
" ('class', 318),\n",
" ('perishable', 3)]"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cat_sz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use the *cardinality* of each variable (that is, its number of unique values) to decide how large to make its *embeddings*. Each level will be associated with a vector with length defined as below."
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [],
"source": [
"emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(54, 27),\n",
" (3660, 50),\n",
" (3, 2),\n",
" (2, 1),\n",
" (4, 2),\n",
" (11, 6),\n",
" (32, 16),\n",
" (8, 4),\n",
" (66, 33),\n",
" (23, 12),\n",
" (17, 9),\n",
" (6, 3),\n",
" (18, 9),\n",
" (33, 17),\n",
" (318, 50),\n",
" (3, 2)]"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"emb_szs"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [],
"source": [
"m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),\n",
" 0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)\n",
"lr = 1e-3"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "6b61bdc7eafb4c5489c86809e36cda72",
"version_major": 2,
"version_minor": 0
},
"text/html": [
"Failed to display Jupyter Widget of type HBox
.
\n",
"\n",
" If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n",
" that the widgets JavaScript is still loading. If this message persists, it\n",
" likely means that the widgets JavaScript library is either not installed or\n",
" not enabled. See the Jupyter\n",
" Widgets Documentation for setup instructions.\n",
"
\n",
"\n",
" If you're reading this message in another frontend (for example, a static\n",
" rendering on GitHub or NBViewer),\n",
" it may mean that your frontend doesn't currently support widgets.\n",
"
\n"
],
"text/plain": [
"HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
" 0%| | 6/1847 [00:06<31:59, 1.04s/it, loss=inf] \n",
" \r"
]
}
],
"source": [
"m.lr_find()"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAZQAAAEOCAYAAACuOOGFAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvAOZPmwAAEz5JREFUeJzt3X2wXHV9x/H3B1KxVoyAASEhhhasE8enumCp2qFVIjrVqFCJbW1U2vSJanVqi7UdFZ2KT7U+1whWaqug0GIEa3hQqm0VcoNSCEpJox1SUKJBCqIw6Ld/7EGW2725m9zfvZubvF8zO3vO7/zOOd+998753N85u2dTVUiSNFP7jLsASdKewUCRJDVhoEiSmjBQJElNGCiSpCYMFElSEwaKJKkJA0WS1ISBIklqwkCRJDWxYNwFzKWHPexhtWzZsnGXIUnzysaNG79dVYum67dXBcqyZcuYmJgYdxmSNK8k+e9R+nnKS5LUhIEiSWrCQJEkNWGgSJKaMFAkSU0YKJKkJgwUSVITBookqQkDRZLUhIEiSWrCQJEkNWGgSJKaMFAkSU0YKJKkJgwUSVITBookqQkDRZLUhIEiSWrCQJEkNWGgSJKaMFAkSU0YKJKkJgwUSVITYw2UJCckuT7J5iSnDVm+X5Jzu+VXJFk2afnSJHck+eO5qlmSNNzYAiXJvsB7gWcCy4EXJlk+qdspwK1VdSTwDuDNk5a/A/jn2a5VkjS9cY5QjgE2V9WWqrobOAdYOanPSuDsbvo84GlJApDkucAWYNMc1StJ2oFxBspi4MaB+a1d29A+VXUPcBtwUJKfAv4UeP0c1ClJGsE4AyVD2mrEPq8H3lFVd0y7k2RNkokkE9u2bduFMiVJo1gwxn1vBQ4fmF8C3DRFn61JFgALge3Ak4CTkrwFeCjwoyQ/qKr3TN5JVa0F1gL0er3JgSVJamScgbIBOCrJEcD/AKuAX5vUZx2wGvgicBLw2aoq4Kn3dkjyOuCOYWEiSZo7YwuUqronyanAemBf4ENVtSnJ6cBEVa0DzgI+kmQz/ZHJqnHVK0nasfT/4d879Hq9mpiYGHcZkjSvJNlYVb3p+vlJeUlSEwaKJKkJA0WS1ISBIklqwkCRJDVhoEiSmjBQJElNGCiSpCYMFElSEwaKJKkJA0WS1ISBIklqwkCRJDVhoEiSmjBQJElNGCiSpCYMFElSEwaKJKkJA0WS1ISBIklqwkCRJDVhoEiSmjBQJElNGCiSpCYMFElSEwaKJKkJA0WS1ISBIklqwkCRJDVhoEiSmjBQJElNjDVQkpyQ5Pokm5OcNmT5fknO7ZZfkWRZ1358ko1Jrumef3mua5ck3d/YAiXJvsB7gWcCy4EXJlk+qdspwK1VdSTwDuDNXfu3gWdX1WOA1cBH5qZqSdJUxjlCOQbYXFVbqupu4Bxg5aQ+K4Gzu+nzgKclSVV9uapu6to3AQ9Mst+cVC1JGmqcgbIYuHFgfmvXNrRPVd0D3AYcNKnPicCXq+quWapTkjSCBWPcd4a01c70SfJo+qfBVky5k2QNsAZg6dKlO1+lJGkk4xyhbAUOH5hfAtw0VZ8kC4CFwPZufgnwT8BvVtV/TbWTqlpbVb2q6i1atKhh+ZKkQeMMlA3AUUmOSPIAYBWwblKfdfQvugOcBHy2qirJQ4GLgFdX1b/NWcWSpCmNLVC6ayKnAuuBrwIfr6pNSU5P8pyu21nAQUk2A68E7n1r8anAkcBfJPlK9zh4jl+CJGlAqiZftthz9Xq9mpiYGHcZkjSvJNlYVb3p+vlJeUlSEwaKJKkJA0WS1ISBIklqwkCRJDVhoEiSmjBQJElNGCiSpCYMFElSEwaKJKkJA0WS1ISBIklqwkCRJDVhoEiSmjBQJElNGCiSpCYMFElSEwaKJKkJA0WS1ISBIklqwkCRJDVhoEiSmjBQJElNGCiSpCYMFElSEwaKJKkJA0WS1MRIgZLk5Ukekr6zklyVZMVsFydJmj9GHaG8tKr+F1gBLAJeApwxa1VJkuadUQMl3fOzgL+tqqsH2iRJGjlQNia5mH6grE+yP/Cjme48yQlJrk+yOclpQ5bvl+TcbvkVSZYNLHt11359kmfMtBZJ0swsGLHfKcDjgS1VdWeSA+mf9tplSfYF3gscD2wFNiRZV1XXTdrvrVV1ZJJVwJuBk5MsB1YBjwYOAy5N8siq+uFMapIk7bpRRyjHAtdX1XeT/Abw58BtM9z3McDmqtpSVXcD5wArJ/VZCZzdTZ8HPC1JuvZzququqvo6sLnbniRpTEYNlPcDdyZ5HPAnwH8DfzfDfS8GbhyY39q1De1TVffQD7GDRlxXkjSHRg2Ue6qq6I8M3llV7wT2n+G+h13UrxH7jLJufwPJmiQTSSa2bdu2kyVKkkY1aqDcnuTVwIuAi7rrHz8xw31vBQ4fmF8C3DRVnyQLgIXA9hHXBaCq1lZVr6p6ixYtmmHJkqSpjBooJwN30f88yjfpn1566wz3vQE4KskRSR5A/yL7ukl91gGru+mTgM92I6V1wKruXWBHAEcBV86wHknSDIz0Lq+q+maSfwCOTvIrwJVVNaNrKFV1T5JTgfXAvsCHqmpTktOBiapaB5wFfCTJZvojk1XdupuSfBy4DrgH+APf4SVJ45X+P/zTdEpeQH9Ecjn96xdPBV5VVefNanWN9Xq9mpiYGHcZkjSvJNlYVb3p+o36OZTXAEdX1S3dxhcBl9J/K68kSSNfQ9nn3jDpfGcn1pUk7QVGHaF8Jsl64GPd/MnAp2enJEnSfDTqRflXJTkReDL9ayhrq+qfZrUySdK8MuoIhao6Hzh/FmuRJM1jOwyUJLcz/BPoAaqqHjIrVUmS5p0dBkpVzfT2KpKkvYTv1JIkNWGgSJKaMFAkSU0YKJKkJgwUSVITBookqQkDRZLUhIEiSWrCQJEkNWGgSJKaMFAkSU0YKJKkJgwUSVITBookqQkDRZLUhIEiSWrCQJEkNWGgSJKaMFAkSU0YKJKkJgwUSVITBookqQkDRZLUhIEiSWpiLIGS5MAklyS5oXs+YIp+q7s+NyRZ3bU9KMlFSb6WZFOSM+a2eknSMOMaoZwGXFZVRwGXdfP3k+RA4LXAk4BjgNcOBM/bqupRwBOAJyd55tyULUmayrgCZSVwdjd9NvDcIX2eAVxSVdur6lbgEuCEqrqzqj4HUFV3A1cBS+agZknSDowrUA6pqpsBuueDh/RZDNw4ML+1a/uxJA8Fnk1/lCNJGqMFs7XhJJcCDx+y6DWjbmJIWw1sfwHwMeBdVbVlB3WsAdYALF26dMRdS5J21qwFSlU9faplSb6V5NCqujnJocAtQ7ptBY4bmF8CXD4wvxa4oar+epo61nZ96fV6taO+kqRdN65TXuuA1d30auCTQ/qsB1YkOaC7GL+iayPJG4GFwB/NQa2SpBGMK1DOAI5PcgNwfDdPkl6SMwGqajvwBmBD9zi9qrYnWUL/tNly4KokX0nyW+N4EZKk+6Rq7zkL1Ov1amJiYtxlSNK8kmRjVfWm6+cn5SVJTRgokqQmDBRJUhMGiiSpCQNFktSEgSJJasJAkSQ1YaBIkpowUCRJTRgokqQmDBRJUhMGiiSpCQNFktSEgSJJasJAkSQ1YaBIkpowUCRJTRgokqQmDBRJUhMGiiSpCQNFktSEgSJJasJAkSQ1YaBIkpowUCRJTRgokqQmDBRJUhMGiiSpCQNFktSEgSJJamIsgZLkwCSXJLmhez5gin6ruz43JFk9ZPm6JNfOfsWSpOmMa4RyGnBZVR0FXNbN30+SA4HXAk8CjgFeOxg8SZ4P3DE35UqSpjOuQFkJnN1Nnw08d0ifZwCXVNX2qroVuAQ4ASDJg4FXAm+cg1olSSMYV6AcUlU3A3TPBw/psxi4cWB+a9cG8Abg7cCds1mkJGl0C2Zrw0kuBR4+ZNFrRt3EkLZK8njgyKp6RZJlI9SxBlgDsHTp0hF3LUnaWbMWKFX19KmWJflWkkOr6uYkhwK3DOm2FThuYH4JcDlwLPDEJN+gX//BSS6vquMYoqrWAmsBer1e7fwrkSSNYlynvNYB975razXwySF91gMrkhzQXYxfAayvqvdX1WFVtQx4CvCfU4WJJGnujCtQzgCOT3IDcHw3T5JekjMBqmo7/WslG7rH6V2bJGk3lKq95yxQr9eriYmJcZchSfNKko1V1Zuun5+UlyQ1YaBIkpowUCRJTRgokqQmDBRJUhMGiiSpCQNFktSEgSJJasJAkSQ1YaBIkpowUCRJTRgokqQmDBRJUhMGiiSpCQNFktSEgSJJasJAkSQ1YaBIkpowUCRJTRgokqQmDBRJUhMGiiSpCQNFktSEgSJJasJAkSQ1kaoadw1zJsk24LvAbbuw+sOAb7etSDuwkF37Pe3OdtfXNK66Znu/rbffansz2c6urjvT49cjqmrRdJ32qkABSLK2qtbswnoTVdWbjZr0/+3q72l3tru+pnHVNdv7bb39VtubyXZ29+PX3njK61PjLkAj2RN/T7vraxpXXbO939bbb7W9mWxnd/0bAvbCEcqucoQiab5yhLL7WTvuAiRpF83J8csRiiSpCUcokqQmDBRJUhMGiiSpCQNlFyX5qSRnJ/lgkl8fdz2SNKokP53krCTntdyugTIgyYeS3JLk2kntJyS5PsnmJKd1zc8Hzquq3waeM+fFStKAnTl+VdWWqjqldQ0Gyv19GDhhsCHJvsB7gWcCy4EXJlkOLAFu7Lr9cA5rlKRhPszox69ZYaAMqKrPA9snNR8DbO4S/W7gHGAlsJV+qIA/R0ljtpPHr1nhgXB6i7lvJAL9IFkM/CNwYpL3s5vfDkHSXmvo8SvJQUn+BnhCkle32tmCVhvag2VIW1XV94CXzHUxkrQTpjp+fQf43dY7c4Qyva3A4QPzS4CbxlSLJO2MOT1+GSjT2wAcleSIJA8AVgHrxlyTJI1iTo9fBsqAJB8Dvgj8bJKtSU6pqnuAU4H1wFeBj1fVpnHWKUmT7Q7HL28OKUlqwhGKJKkJA0WS1ISBIklqwkCRJDVhoEiSmjBQJElNGCjabSW5Yw728ZyBrySYE0mOS/ILu7DeE5Kc2U2/OMl72le385Ism3zL9CF9FiX5zFzVpPEwULTH627hPVRVrauqM2Zhnzu6T95xwE4HCvBnwLt3qaAxq6ptwM1JnjzuWjR7DBTNC0lelWRDkv9I8vqB9guSbEyyKcmagfY7kpye5Arg2CTfSPL6JFcluSbJo7p+P/5PP8mHk7wryb8n2ZLkpK59nyTv6/ZxYZJP37tsUo2XJ/nLJP8CvDzJs5NckeTLSS5NckiSZfRvyveKJF9J8tTuv/fzu9e3YdhBN8n+wGOr6uohyx6R5LLuZ3NZkqVd+88k+VK3zdOHjfi6bx69KMnVSa5NcnLXfnT3c7g6yZVJ9u9GIl/ofoZXDRtlJdk3yVsHfle/M7D4AsBvN92TVZUPH7vlA7ije14BrKV/59R9gAuBX+yWHdg9/yRwLXBQN1/ACwa29Q3gD7vp3wfO7KZfDLynm/4w8IluH8vpf48EwEnAp7v2hwO3AicNqfdy4H0D8wdw390ofgt4ezf9OuCPB/p9FHhKN70U+OqQbf8ScP7A/GDdnwJWd9MvBS7opi8EXthN/+69P89J2z0R+ODA/ELgAcAW4Oiu7SH070z+IOCBXdtRwEQ3vQy4tpteA/x5N70fMAEc0c0vBq4Z99+Vj9l7ePt6zQcruseXu/kH0z+gfR54WZLnde2Hd+3fof8tmudP2s4/ds8b6X+F8zAXVNWPgOuSHNK1PQX4RNf+zSSf20Gt5w5MLwHOTXIo/YP016dY5+nA8uTHdxp/SJL9q+r2gT6HAtumWP/YgdfzEeAtA+3P7aY/CrxtyLrXAG9L8mbgwqr6QpLHADdX1QaAqvpf6I9mgPckeTz9n+8jh2xvBfDYgRHcQvq/k68DtwCHTfEatAcwUDQfBHhTVX3gfo3JcfQPxsdW1Z1JLgce2C3+QVVN/mrmu7rnHzL13/5dA9OZ9DyK7w1Mvxv4q6pa19X6uinW2Yf+a/j+Drb7fe57bdMZ+QZ9VfWfSZ4IPAt4U5KL6Z+aGraNVwDfAh7X1fyDIX1CfyS4fsiyB9J/HdpDeQ1F88F64KVJHgyQZHGSg+n/93trFyaPAn5+lvb/r/S/nXOfbtRy3IjrLQT+p5tePdB+O7D/wPzF9O8IC0A3Apjsq8CRU+zn3+nflhz61yj+tZv+Ev1TWgwsv58khwF3VtXf0x/B/BzwNeCwJEd3ffbv3mSwkP7I5UfAi4Bhb3ZYD/xekp/o1n1kN7KB/ohmh+8G0/xmoGi3V1UX0z9l88Uk1wDn0T8gfwZYkOQ/gDfQP4DOhvPpf1HRtcAHgCuA20ZY73XAJ5J8Afj2QPungOfde1EeeBnQ6y5iX8eQb9Krqq8BC7uL85O9DHhJ93N4EfDyrv2PgFcmuZL+KbNhNT8GuDLJV4DXAG+s/nePnwy8O8nVwCX0RxfvA1Yn+RL9cPjekO2dCVwHXNW9lfgD3Dca/CXgoiHraA/h7eulESR5cFXdkeQg4ErgyVX1zTmu4RXA7VV15oj9HwR8v6oqySr6F+hXzmqRO67n88DKqrp1XDVodnkNRRrNhUkeSv/i+hvmOkw67wd+dSf6P5H+RfQA36X/DrCxSLKI/vUkw2QP5ghFktSE11AkSU0YKJKkJgwUSVITBookqQkDRZLUhIEiSWri/wAuNq8iO7eWkwAAAABJRU5ErkJggg==\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"m.sched.plot(100)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sample"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),\n",
" 0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)\n",
"lr = 1e-3"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "4d59028d7c724ed78700c7cf12bdfed3",
"version_major": 2,
"version_minor": 0
},
"text/html": [
"Failed to display Jupyter Widget of type HBox
.
\n",
"\n",
" If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n",
" that the widgets JavaScript is still loading. If this message persists, it\n",
" likely means that the widgets JavaScript library is either not installed or\n",
" not enabled. See the Jupyter\n",
" Widgets Documentation for setup instructions.\n",
"
\n",
"\n",
" If you're reading this message in another frontend (for example, a static\n",
" rendering on GitHub or NBViewer),\n",
" it may mean that your frontend doesn't currently support widgets.\n",
"
\n"
],
"text/plain": [
"HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 0. nan nan nan] \n",
"\n"
]
}
],
"source": [
"m.fit(lr, 1, metrics=[exp_rmspe])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"m.fit(lr, 3, metrics=[exp_rmspe])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}