{ "cells": [ { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from pandas import DataFrame, read_csv\n", "import pandas as pd\n", "import numpy as np\n", "import plotly.offline as py\n", "py.init_notebook_mode()\n", "import plotly.graph_objs as go\n", "from plotly import tools" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import and format data" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "date datetime64[ns]\n", "store category\n", "item category\n", "sales int64\n", "dtype: object" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('sales.csv')\n", "df.OrderDate = pd.to_datetime(df.OrderDate)\n", "df.columns = ['date', 'store', 'item', 'sales']\n", "df.date = pd.to_datetime(df.date)\n", "df.item = df.item.astype('category')\n", "df.store = df.store.astype('category')\n", "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below are some summary statistics on the data. Overall the quantities for individual items at individual stores is quite small. It would be difficult to forecast daily quantities of individual items and individual stores, so we will work towards forecasting weekly item sales at individual stores." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datestoreitemsales
count350796350796350796.0350796.000000
unique30118139.0NaN
top2016-11-05 00:00:00SEGWD741795.0NaN
freq4110535322282.0NaN
first2016-02-04 00:00:00NaNNaNNaN
last2016-11-30 00:00:00NaNNaNNaN
meanNaNNaNNaN1.681211
stdNaNNaNNaN1.134632
minNaNNaNNaN1.000000
25%NaNNaNNaN1.000000
50%NaNNaNNaN1.000000
75%NaNNaNNaN2.000000
maxNaNNaNNaN112.000000
\n", "
" ], "text/plain": [ " date store item sales\n", "count 350796 350796 350796.0 350796.000000\n", "unique 301 181 39.0 NaN\n", "top 2016-11-05 00:00:00 SEGWD7 41795.0 NaN\n", "freq 4110 5353 22282.0 NaN\n", "first 2016-02-04 00:00:00 NaN NaN NaN\n", "last 2016-11-30 00:00:00 NaN NaN NaN\n", "mean NaN NaN NaN 1.681211\n", "std NaN NaN NaN 1.134632\n", "min NaN NaN NaN 1.000000\n", "25% NaN NaN NaN 1.000000\n", "50% NaN NaN NaN 1.000000\n", "75% NaN NaN NaN 2.000000\n", "max NaN NaN NaN 112.000000" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe(include='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyze Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at a few cuts on the data to see if we can spot any trends. Below are plots of a handful of individual stores' sales. It looks like the answer to the question about the big jump in sales in September is the addition of a good number of stores." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Total Sales" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the total sales graph we took a look at yesterday." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_total = df.groupby(pd.Grouper(freq='W', key='date')).sum().fillna(0).unstack('date', 0)\n", "df_total.index.levels[1]\n", "\n", "len(df_total) == len(df_total.index.levels[1])\n", "\n", "trace = go.Scatter(\n", " x = df_total.index.levels[1],\n", " y = df_total\n", ")\n", "\n", "layout = go.Layout(\n", " title='Total Sales'\n", ")\n", "\n", "\n", "fig = go.Figure(data=[trace], layout=layout)\n", "py.iplot(fig, filename='total-sales')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We may want to look at this data to figure out if we need to limit that date range to be from March to November in order to avoid any partial weeks. But that shouldn't be too much of an issue because I'm going to set the training data up to make predictions of sales of items at individual stores, so as long as the drop offs on either end are the result of fewer stores being included in the data, that shouldn't have a negative impact on the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Individual Store Sales" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is the format of your plot grid:\n", "[ (1,1) x1,y1 ] [ (1,2) x2,y1 ] [ (1,3) x3,y1 ] \n", "[ (2,1) x4,y2 ] [ (2,2) x5,y2 ] [ (2,3) x6,y2 ] \n", "[ (3,1) x7,y3 ] [ (3,2) x8,y3 ] [ (3,3) x9,y3 ] \n", "[ (4,1) x10,y4 ] [ (4,2) x11,y4 ] [ (4,3) x12,y4 ] \n", "[ (5,1) x13,y5 ] [ (5,2) x14,y5 ] [ (5,3) x15,y5 ] \n", "[ (6,1) x16,y6 ] [ (6,2) x17,y6 ] [ (6,3) x18,y6 ] \n", "[ (7,1) x19,y7 ] [ (7,2) x20,y7 ] [ (7,3) x21,y7 ] \n", "[ (8,1) x22,y8 ] [ (8,2) x23,y8 ] [ (8,3) x24,y8 ] \n", "[ (9,1) x25,y9 ] [ (9,2) x26,y9 ] [ (9,3) x27,y9 ] \n", "[ (10,1) x28,y10 ] [ (10,2) x29,y10 ] [ (10,3) x30,y10 ]\n", "\n" ] }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_1w = df.groupby(['store']+[pd.Grouper(freq='W', key='date')]).sum().fillna(0).unstack('date', 0)\n", "\n", "rows = 10\n", "cols = 3\n", "spidx = np.arange(rows*cols).reshape(rows,cols)\n", "\n", "fig = tools.make_subplots(rows=rows, cols=cols, shared_yaxes=True, subplot_titles=df_1w.index[:rows*cols])\n", "\n", "for i in range(rows):\n", "\n", " for j in range(cols):\n", "\n", " trace = go.Scatter(c\n", " x = df_1w.iloc[1].index.levels[1],\n", " y = df_1w.iloc[spidx[i,j]],\n", " )\n", "\n", " fig.append_trace(trace, i+1, j+1)\n", "\n", "fig['layout'].update(height=250*rows, title='Sales by Store', showlegend=False);\n", "py.iplot(fig, filename='sales-by-store')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### New Stores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So let's do a plot of the count of the number of stores with sales in each week. This confirms that there were a lot of new stores added in September." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "store_sales = df.groupby(['store']+[pd.Grouper(freq='W', key='date')]).sum().fillna(0).unstack('date')\n", "stores_with_sales = store_sales['sales'].where(store_sales.sales > 0).count()\n", "\n", "stores_with_sales.index\n", "\n", "trace = go.Bar(\n", " x = stores_with_sales.index,\n", " y = stores_with_sales\n", ")\n", "\n", "layout = go.Layout(\n", " title='No. of Stores with Sales'\n", ")\n", "\n", "\n", "fig = go.Figure(data=[trace], layout=layout)\n", "py.iplot(fig, filename='stores-with-sales')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Items" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look sales for the individual items." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is the format of your plot grid:\n", "[ (1,1) x1,y1 ] [ (1,2) x2,y1 ] [ (1,3) x3,y1 ] \n", "[ (2,1) x4,y2 ] [ (2,2) x5,y2 ] [ (2,3) x6,y2 ] \n", "[ (3,1) x7,y3 ] [ (3,2) x8,y3 ] [ (3,3) x9,y3 ] \n", "[ (4,1) x10,y4 ] [ (4,2) x11,y4 ] [ (4,3) x12,y4 ] \n", "[ (5,1) x13,y5 ] [ (5,2) x14,y5 ] [ (5,3) x15,y5 ] \n", "[ (6,1) x16,y6 ] [ (6,2) x17,y6 ] [ (6,3) x18,y6 ] \n", "[ (7,1) x19,y7 ] [ (7,2) x20,y7 ] [ (7,3) x21,y7 ] \n", "[ (8,1) x22,y8 ] [ (8,2) x23,y8 ] [ (8,3) x24,y8 ] \n", "[ (9,1) x25,y9 ] [ (9,2) x26,y9 ] [ (9,3) x27,y9 ] \n", "[ (10,1) x28,y10 ] [ (10,2) x29,y10 ] [ (10,3) x30,y10 ]\n", "[ (11,1) x31,y11 ] [ (11,2) x32,y11 ] [ (11,3) x33,y11 ]\n", "[ (12,1) x34,y12 ] [ (12,2) x35,y12 ] [ (12,3) x36,y12 ]\n", "[ (13,1) x37,y13 ] [ (13,2) x38,y13 ] [ (13,3) x39,y13 ]\n", "\n" ] }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df_1w = df.groupby(['item']+[pd.Grouper(freq='W', key='date')]).sum().fillna(0).unstack('date', 0)\n", "rows = 13\n", "cols = 3\n", "\n", "fig = tools.make_subplots(rows=rows, cols=cols, shared_yaxes=True, subplot_titles=df_1w.index[:rows*cols])\n", "\n", "spidx = np.arange(rows*cols).reshape(rows,cols)\n", "\n", "for i in range(rows):\n", "\n", " for j in range(cols):\n", "\n", " trace = go.Scatter(\n", " x = df_1w.iloc[1].index.levels[1],\n", " y = df_1w.iloc[spidx[i,j]],\n", " )\n", "\n", " fig.append_trace(trace, i+1, j+1)\n", "\n", "fig['layout'].update(height=250*rows, title='Sales by Store', showlegend=False);\n", "py.iplot(fig, filename='sales-by-store')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It also looks like there were some big increases in sales of individual items. It would be interesting to do some more analysis to figure out if those items were in in the new stores that came on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add Trailing Average Sales" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now for the actual model. The first thing we'll do is add in a rolling average of the prior three weeks' sales." ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salessales_avg
storeitemdate
SEGWD103417742016-02-071.0NaN
2016-02-143.02.333333
2016-02-213.02.666667
2016-02-282.02.666667
2016-03-063.03.333333
2016-03-135.04.666667
2016-03-206.05.666667
2016-03-276.05.333333
2016-04-034.05.333333
2016-04-106.05.000000
\n", "
" ], "text/plain": [ " sales sales_avg\n", "store item date \n", "SEGWD103 41774 2016-02-07 1.0 NaN\n", " 2016-02-14 3.0 2.333333\n", " 2016-02-21 3.0 2.666667\n", " 2016-02-28 2.0 2.666667\n", " 2016-03-06 3.0 3.333333\n", " 2016-03-13 5.0 4.666667\n", " 2016-03-20 6.0 5.666667\n", " 2016-03-27 6.0 5.333333\n", " 2016-04-03 4.0 5.333333\n", " 2016-04-10 6.0 5.000000" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_model = df.groupby(['store', 'item']+[pd.Grouper(freq='W', key='date')]).sum().fillna(0)\n", "rolling_sum = df_model.apply(lambda x:x.rolling(window=3).mean())\n", "rolling_sum.shift(-1)\n", "df_model['sales_avg'] = rolling_sum.shift(-1)['sales']\n", "df_model.head(10)" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_model.loc['SEGWD7'].to_csv('SEGWD7.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Remove Missing Stores and Items" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Then we need to get rid of all the rows where there were no sales because an item hadn't yet been introduced or a store didn't exist. We'll do that by adding a column for cumulative sales and then dropping all rows with zero cumulative sales. (There is still some noise in the data that we might want to take a look at for normal product introductions and eliminations.)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
storeitemdatesalessales_avgcum_sales
count310596310596.0310596310596.000000310594.000000310596.000000
unique18139.044NaNNaNNaN
topSEGWD9742052.02016-07-10 00:00:00NaNNaNNaN
freq17167964.07059NaNNaNNaN
firstNaNNaN2016-02-07 00:00:00NaNNaNNaN
lastNaNNaN2016-12-04 00:00:00NaNNaNNaN
meanNaNNaNNaN1.8988071.89881136.002048
stdNaNNaNNaN3.8534543.66709988.267833
minNaNNaNNaN0.0000000.0000000.000000
25%NaNNaNNaN0.0000000.0000000.000000
50%NaNNaNNaN0.0000000.0000000.000000
75%NaNNaNNaN2.0000002.66666724.000000
maxNaNNaNNaN118.00000086.0000001977.000000
\n", "
" ], "text/plain": [ " store item date sales sales_avg \\\n", "count 310596 310596.0 310596 310596.000000 310594.000000 \n", "unique 181 39.0 44 NaN NaN \n", "top SEGWD97 42052.0 2016-07-10 00:00:00 NaN NaN \n", "freq 1716 7964.0 7059 NaN NaN \n", "first NaN NaN 2016-02-07 00:00:00 NaN NaN \n", "last NaN NaN 2016-12-04 00:00:00 NaN NaN \n", "mean NaN NaN NaN 1.898807 1.898811 \n", "std NaN NaN NaN 3.853454 3.667099 \n", "min NaN NaN NaN 0.000000 0.000000 \n", "25% NaN NaN NaN 0.000000 0.000000 \n", "50% NaN NaN NaN 0.000000 0.000000 \n", "75% NaN NaN NaN 2.000000 2.666667 \n", "max NaN NaN NaN 118.000000 86.000000 \n", "\n", " cum_sales \n", "count 310596.000000 \n", "unique NaN \n", "top NaN \n", "freq NaN \n", "first NaN \n", "last NaN \n", "mean 36.002048 \n", "std 88.267833 \n", "min 0.000000 \n", "25% 0.000000 \n", "50% 0.000000 \n", "75% 24.000000 \n", "max 1977.000000 " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_model['cum_sales'] = df_model.groupby(level=[0,1]).cumsum()['sales']\n", "df_model.reset_index(inplace=True)\n", "df_model.describe(include='all')" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
storeitemdatesalessales_avgcum_sales
count144822144822.0144822144822.000000144820.000000144822.000000
unique18139.044NaNNaNNaN
topSEGWD741793.02016-11-27 00:00:00NaNNaNNaN
freq13364759.06424NaNNaNNaN
firstNaNNaN2016-02-07 00:00:00NaNNaNNaN
lastNaNNaN2016-12-04 00:00:00NaNNaNNaN
meanNaNNaNNaN4.0723234.03059477.212661
stdNaNNaNNaN4.7953404.501835116.308495
minNaNNaNNaN0.0000000.0000001.000000
25%NaNNaNNaN0.0000000.6666679.000000
50%NaNNaNNaN3.0000002.66666729.000000
75%NaNNaNNaN6.0000005.66666798.000000
maxNaNNaNNaN118.00000086.0000001977.000000
\n", "
" ], "text/plain": [ " store item date sales sales_avg \\\n", "count 144822 144822.0 144822 144822.000000 144820.000000 \n", "unique 181 39.0 44 NaN NaN \n", "top SEGWD7 41793.0 2016-11-27 00:00:00 NaN NaN \n", "freq 1336 4759.0 6424 NaN NaN \n", "first NaN NaN 2016-02-07 00:00:00 NaN NaN \n", "last NaN NaN 2016-12-04 00:00:00 NaN NaN \n", "mean NaN NaN NaN 4.072323 4.030594 \n", "std NaN NaN NaN 4.795340 4.501835 \n", "min NaN NaN NaN 0.000000 0.000000 \n", "25% NaN NaN NaN 0.000000 0.666667 \n", "50% NaN NaN NaN 3.000000 2.666667 \n", "75% NaN NaN NaN 6.000000 5.666667 \n", "max NaN NaN NaN 118.000000 86.000000 \n", "\n", " cum_sales \n", "count 144822.000000 \n", "unique NaN \n", "top NaN \n", "freq NaN \n", "first NaN \n", "last NaN \n", "mean 77.212661 \n", "std 116.308495 \n", "min 1.000000 \n", "25% 9.000000 \n", "50% 29.000000 \n", "75% 98.000000 \n", "max 1977.000000 " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_model_masked = df_model[df_model.cum_sales != 0]\n", "df_model_masked.describe(include='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Encode Store and Item Categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like that eliminated quite a few observations, over half of them. Now we can encode the store and item variables as binary classifications and then we can use the data to train a model." ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "stores = pd.get_dummies(df_model_masked['store'])\n", "items = pd.get_dummies(df_model_masked['item'])\n", "df_final = pd.concat([df_model_masked, stores, items], axis=1).dropna(how='any')\n", "df_final.drop(['date', 'cum_sales', 'store', 'item', 41793, 'SEGWD103'], axis=1, inplace=True)\n", "# df_final.to_csv('modeldata.csv')" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salessales_avgSEGWD104SEGWD116SEGWD12SEGWD123SEGWD125SEGWD129SEGWD135SEGWD138...42043420444204542046420474204842049420504205142052
13.02.33333300000000...0000000000
23.02.66666700000000...0000000000
32.02.66666700000000...0000000000
43.03.33333300000000...0000000000
55.04.66666700000000...0000000000
\n", "

5 rows × 220 columns

\n", "
" ], "text/plain": [ " sales sales_avg SEGWD104 SEGWD116 SEGWD12 SEGWD123 SEGWD125 \\\n", "1 3.0 2.333333 0 0 0 0 0 \n", "2 3.0 2.666667 0 0 0 0 0 \n", "3 2.0 2.666667 0 0 0 0 0 \n", "4 3.0 3.333333 0 0 0 0 0 \n", "5 5.0 4.666667 0 0 0 0 0 \n", "\n", " SEGWD129 SEGWD135 SEGWD138 ... 42043 42044 42045 42046 42047 \\\n", "1 0 0 0 ... 0 0 0 0 0 \n", "2 0 0 0 ... 0 0 0 0 0 \n", "3 0 0 0 ... 0 0 0 0 0 \n", "4 0 0 0 ... 0 0 0 0 0 \n", "5 0 0 0 ... 0 0 0 0 0 \n", "\n", " 42048 42049 42050 42051 42052 \n", "1 0 0 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 0 0 \n", "5 0 0 0 0 0 \n", "\n", "[5 rows x 220 columns]" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_final.head()" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
salessales_avgSEGWD104SEGWD116SEGWD12SEGWD123SEGWD125SEGWD129SEGWD135SEGWD138...42043420444204542046420474204842049420504205142052
count144820.000000144820.000000144820.000000144820.000000144820.000000144820.000000144820.000000144820.000000144820.000000144820.000000...144820.000000144820.000000144820.000000144820.000000144820.000000144820.000000144820.000000144820.000000144820.000000144820.000000
mean4.0723734.0305940.0033490.0033560.0090940.0091420.0033700.0091980.0091770.009142...0.0162750.0160130.0162200.0153850.0166410.0165450.0162750.0166550.0166140.016344
std4.7953554.5018350.0577740.0578330.0949280.0951780.0579510.0954630.0953560.095178...0.1265330.1255260.1263220.1230770.1279240.1275580.1265330.1279760.1278190.126797
min0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%0.0000000.6666670.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
50%3.0000002.6666670.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
75%6.0000005.6666670.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
max118.00000086.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000...1.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000
\n", "

8 rows × 220 columns

\n", "
" ], "text/plain": [ " sales sales_avg SEGWD104 SEGWD116 \\\n", "count 144820.000000 144820.000000 144820.000000 144820.000000 \n", "mean 4.072373 4.030594 0.003349 0.003356 \n", "std 4.795355 4.501835 0.057774 0.057833 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.666667 0.000000 0.000000 \n", "50% 3.000000 2.666667 0.000000 0.000000 \n", "75% 6.000000 5.666667 0.000000 0.000000 \n", "max 118.000000 86.000000 1.000000 1.000000 \n", "\n", " SEGWD12 SEGWD123 SEGWD125 SEGWD129 \\\n", "count 144820.000000 144820.000000 144820.000000 144820.000000 \n", "mean 0.009094 0.009142 0.003370 0.009198 \n", "std 0.094928 0.095178 0.057951 0.095463 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.000000 0.000000 0.000000 \n", "50% 0.000000 0.000000 0.000000 0.000000 \n", "75% 0.000000 0.000000 0.000000 0.000000 \n", "max 1.000000 1.000000 1.000000 1.000000 \n", "\n", " SEGWD135 SEGWD138 ... 42043 \\\n", "count 144820.000000 144820.000000 ... 144820.000000 \n", "mean 0.009177 0.009142 ... 0.016275 \n", "std 0.095356 0.095178 ... 0.126533 \n", "min 0.000000 0.000000 ... 0.000000 \n", "25% 0.000000 0.000000 ... 0.000000 \n", "50% 0.000000 0.000000 ... 0.000000 \n", "75% 0.000000 0.000000 ... 0.000000 \n", "max 1.000000 1.000000 ... 1.000000 \n", "\n", " 42044 42045 42046 42047 \\\n", "count 144820.000000 144820.000000 144820.000000 144820.000000 \n", "mean 0.016013 0.016220 0.015385 0.016641 \n", "std 0.125526 0.126322 0.123077 0.127924 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.000000 0.000000 0.000000 \n", "50% 0.000000 0.000000 0.000000 0.000000 \n", "75% 0.000000 0.000000 0.000000 0.000000 \n", "max 1.000000 1.000000 1.000000 1.000000 \n", "\n", " 42048 42049 42050 42051 \\\n", "count 144820.000000 144820.000000 144820.000000 144820.000000 \n", "mean 0.016545 0.016275 0.016655 0.016614 \n", "std 0.127558 0.126533 0.127976 0.127819 \n", "min 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.000000 0.000000 0.000000 \n", "50% 0.000000 0.000000 0.000000 0.000000 \n", "75% 0.000000 0.000000 0.000000 0.000000 \n", "max 1.000000 1.000000 1.000000 1.000000 \n", "\n", " 42052 \n", "count 144820.000000 \n", "mean 0.016344 \n", "std 0.126797 \n", "min 0.000000 \n", "25% 0.000000 \n", "50% 0.000000 \n", "75% 0.000000 \n", "max 1.000000 \n", "\n", "[8 rows x 220 columns]" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_final.describe()" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "(144820, 220)" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = df_final.dropna().as_matrix()\n", "data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split Train and Test Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll save 30% of the data to test our model." ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "np.random.shuffle(data)\n", "split = int(0.70*data.shape[0])\n", "X_train = np.ones(data[:split,:].shape)\n", "X_test = np.ones(data[split:,:].shape)\n", "\n", "X_train[:,1:] = data[:split,1:]\n", "y_train = data[:split,0]\n", "\n", "X_test[:,1:] = data[split:,1:]\n", "y_test = data[split:,0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is where we actually train the model. I ran it for 200 iterations - more won't likely increase the predictive power of the model, but there are some other diagnostics we can run to see what other improvements we can make." ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.01,\n", " fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',\n", " loss='squared_loss', n_iter=200, penalty='l2', power_t=0.25,\n", " random_state=None, shuffle=True, verbose=0, warm_start=False)" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import linear_model\n", "clf = linear_model.SGDRegressor(n_iter=200)\n", "clf.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate Model" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "R-squared: 0.8886\n", "Total error in sales quantity: -7424\n", "Total error as a % of actual: -4.20%\n", "Total error in sales quantity with zero min prediction: 14\n", "Total error as a % of actual with zero min prediction: -4.19%\n" ] } ], "source": [ "predict = clf.predict(X_test)\n", "predict_neg = predict < 0\n", "error = y_test - predict\n", "error_neg = predict_neg @ error\n", "np.savetxt('modelparams.csv', clf.predict(np.eye(X_test.shape[1])), delimiter=\",\")\n", "print('R-squared: {:.{p}f}'.format(clf.score(X_test, y_test), p=4))\n", "print('Total error in sales quantity: {:.{p}f}'.format(sum(error), p=0))\n", "print('Total error as a % of actual: {:.{p}f}%'.format(sum(error) / sum(y_test)*100, p=2))\n", "print('Total error in sales quantity with zero min prediction: {:.{p}f}'.format(error_neg, p=0))\n", "print('Total error as a % of actual with zero min prediction: {:.{p}f}%'.format((sum(error)+error_neg) / sum(y_test)*100, p=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An overall r-square of 85% is pretty good, I think, for really just dealing with three variables - store and item numbers and historical sales - and the overall error is very low on such a large number of test observations." ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "predicted = go.Bar(\n", " name = 'predicted',\n", " y = clf.predict(X_test)\n", ")\n", "\n", "actual = go.Bar(\n", " name = 'actual',\n", " y = y_test\n", ")\n", "\n", "layout = go.Layout(\n", " title='Actual vs. Predicted'\n", ")\n", "\n", "fig = go.Figure(data=[actual, predicted], layout=layout)\n", "py.iplot(fig, filename='actual-vs-predicted')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overall this says that this is a pretty good model. The total error over 40,000 individual store-item observations is really low. The graph above shows that we are missing on the big spikes, but overall the performance is good." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next Steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Further analyze data\n", "2. Run model diagnostics\n", "3. Analyze model errors by hand\n", "4. Evaluate alternative model hyperparameters\n", "5. Complete Azure implementation" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "df_final.loc[:100].to_csv('samplemodeldata.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "np.eye(X_test.shape[1]) * clf.predict(X_test)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [conda root]", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 2 }