{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [01_Deterministic.ipynb](https://github.com/raybellwaves/xskillscore-tutorial/blob/master/01_Determinisitic.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" - In this notebook I show how `xskillscore` can be dropped in a typical data science task where the data is a [`pandas.DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).\n",
"\n",
" - I use the metric RMSE to verify forecasts of items sold.\n",
"\n",
" - I also show how you can apply weights to the verification and handle missing values."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import the necessary packages"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import xarray as xr\n",
"import pandas as pd\n",
"import numpy as np\n",
"import xskillscore as xs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's say you are a data scientist who works for a company which owns four stores which each sell three items (Store Keeping Units).\n",
"\n",
"Set up `stores` and `skus` arrays:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"stores = np.arange(4)\n",
"skus = np.arange(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and you are tracking daily perfomane of items sold between Jan 1st and Jan 5th 2020.\n",
"\n",
"Setup up `dates` array:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"dates = pd.date_range(\"1/1/2020\", \"1/5/2020\", freq=\"D\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate a `pandas.DataFrame` to show the number of items that were sold during this period. The number of items sold will be a random number between 1 and 10.\n",
"\n",
"This may be something you would obtain from querying a database:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"rows = []\n",
"for _, date in enumerate(dates):\n",
" for _, store in enumerate(stores):\n",
" for _, sku in enumerate(skus):\n",
" rows.append(\n",
" dict(\n",
" {\n",
" \"DATE\": date,\n",
" \"STORE\": store,\n",
" \"SKU\": sku,\n",
" \"QUANTITY_SOLD\": np.random.randint(9) + 1,\n",
" }\n",
" )\n",
" )\n",
"df = pd.DataFrame(rows)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pring the first 5 rows of the `pandas.DataFrame`:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
DATE
\n",
"
STORE
\n",
"
SKU
\n",
"
QUANTITY_SOLD
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
2020-01-01
\n",
"
0
\n",
"
0
\n",
"
9
\n",
"
\n",
"
\n",
"
1
\n",
"
2020-01-01
\n",
"
0
\n",
"
1
\n",
"
2
\n",
"
\n",
"
\n",
"
2
\n",
"
2020-01-01
\n",
"
0
\n",
"
2
\n",
"
2
\n",
"
\n",
"
\n",
"
3
\n",
"
2020-01-01
\n",
"
1
\n",
"
0
\n",
"
3
\n",
"
\n",
"
\n",
"
4
\n",
"
2020-01-01
\n",
"
1
\n",
"
1
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" DATE STORE SKU QUANTITY_SOLD\n",
"0 2020-01-01 0 0 9\n",
"1 2020-01-01 0 1 2\n",
"2 2020-01-01 0 2 2\n",
"3 2020-01-01 1 0 3\n",
"4 2020-01-01 1 1 1"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Your boss has asked you to use this data to predict the number of items sold for each store and sku level for the next 5 days.\n",
"\n",
"The prediction is outside of the scope of the tutorial but we will use `xskillscore` to tell us how good our prediction may be ."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, rename the target variable to ``y``:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
DATE
\n",
"
STORE
\n",
"
SKU
\n",
"
y
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
2020-01-01
\n",
"
0
\n",
"
0
\n",
"
9
\n",
"
\n",
"
\n",
"
1
\n",
"
2020-01-01
\n",
"
0
\n",
"
1
\n",
"
2
\n",
"
\n",
"
\n",
"
2
\n",
"
2020-01-01
\n",
"
0
\n",
"
2
\n",
"
2
\n",
"
\n",
"
\n",
"
3
\n",
"
2020-01-01
\n",
"
1
\n",
"
0
\n",
"
3
\n",
"
\n",
"
\n",
"
4
\n",
"
2020-01-01
\n",
"
1
\n",
"
1
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" DATE STORE SKU y\n",
"0 2020-01-01 0 0 9\n",
"1 2020-01-01 0 1 2\n",
"2 2020-01-01 0 2 2\n",
"3 2020-01-01 1 0 3\n",
"4 2020-01-01 1 1 1"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.rename(columns={\"QUANTITY_SOLD\": \"y\"}, inplace=True)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use [pandas MultiIndex](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html) to help handle the granularity of the forecast:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"df.set_index(['DATE', 'STORE', 'SKU'], inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This also displays the data in a cleaner foremat in the notebook:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
y
\n",
"
\n",
"
\n",
"
DATE
\n",
"
STORE
\n",
"
SKU
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
2020-01-01
\n",
"
0
\n",
"
0
\n",
"
9
\n",
"
\n",
"
\n",
"
1
\n",
"
2
\n",
"
\n",
"
\n",
"
2
\n",
"
2
\n",
"
\n",
"
\n",
"
1
\n",
"
0
\n",
"
3
\n",
"
\n",
"
\n",
"
1
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" y\n",
"DATE STORE SKU \n",
"2020-01-01 0 0 9\n",
" 1 2\n",
" 2 2\n",
" 1 0 3\n",
" 1 1"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Time for your prediction! As mentioned, this is outside of the scope of this tutorial.\n",
"\n",
"In our case we are going to generate data to mimic a prediction by taking `y` and perturbing randomly. This will provide a middle ground of creating a prediction which is not overfitting the data (being very similar to `y`) and the other extreme of random numbers for which the skill will be 0.\n",
"\n",
"The perturbations will scale `y` between -100% and 100% using a uniform distribution. For example, a value of 5 in `y` will be between 0 and 10 in the prediction (`yhat`)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Setup the perturbation array:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"noise = np.random.uniform(-1, 1, size=len(df['y']))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Name the prediction `yhat` and append it to the `pandas.DataFrame`.\n",
"\n",
"Lastly, convert it is an `int` to match the same format as the target (`y`):"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
y
\n",
"
yhat
\n",
"
\n",
"
\n",
"
DATE
\n",
"
STORE
\n",
"
SKU
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
2020-01-01
\n",
"
0
\n",
"
0
\n",
"
9
\n",
"
13
\n",
"
\n",
"
\n",
"
1
\n",
"
2
\n",
"
3
\n",
"
\n",
"
\n",
"
2
\n",
"
2
\n",
"
2
\n",
"
\n",
"
\n",
"
1
\n",
"
0
\n",
"
3
\n",
"
4
\n",
"
\n",
"
\n",
"
1
\n",
"
1
\n",
"
0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" y yhat\n",
"DATE STORE SKU \n",
"2020-01-01 0 0 9 13\n",
" 1 2 3\n",
" 2 2 2\n",
" 1 0 3 4\n",
" 1 1 0"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['yhat'] = (df['y'] + (df['y'] * noise)).astype(int)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using xskillscore - RMSE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"RMSE (root-mean-squre error) is the square root of the average of the squared differences between forecasts and verification data:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\\begin{align}\n",
"RMSE = \\sqrt{\\overline{(f - o)^{2}}}\n",
"\\end{align}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because the error is squared is it sensitive to outliers and is a more conservative metric than mean-absolute error.\n",
"\n",
"See https://climpred.readthedocs.io/en/stable/metrics.html#root-mean-square-error-rmse for further documentation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### sklearn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most data scientists are familiar with using `scikit-learn` for verifying forecasts, especially if you used `scikit-learn` for the prediction.\n",
"\n",
"To obtain RMSE from `scikit-learn` import `mean_squared_error` and specify `squared=False`:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2.932575659723036"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.metrics import mean_squared_error\n",
"mean_squared_error(df['y'], df['yhat'], squared=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While `skikit-learn` is simple it doesn't give the flexibility of that given in `xskillscore`.\n",
"\n",
"Note: `xskillscore` does use the same metrics as in `scikit-learn` such as the [`r2_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html), which is called `r2` in `xskillscore`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### xskillscore"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To use `xskillscore` you first have to put your data into an `xarray` object.\n",
"\n",
"Because `xarray` is part of the PyData stack it integrates will other Python data science packages.\n",
"\n",
"`pandas` has a convenient [`to_xarray`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_xarray.html) which makes going from `pandas` to `xarray` seamless.\n",
"\n",
"Use `to_xarray` to convert the `pandas.Dataframe` to an `xarray.Dataset`: "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
"\n",
"Dimensions: (DATE: 5, SKU: 3, STORE: 4)\n",
"Coordinates:\n",
" * DATE (DATE) datetime64[ns] 2020-01-01 2020-01-02 ... 2020-01-05\n",
" * STORE (STORE) int64 0 1 2 3\n",
" * SKU (SKU) int64 0 1 2\n",
"Data variables:\n",
" y (DATE, STORE, SKU) int64 9 2 2 3 1 6 4 7 6 7 ... 2 5 2 6 6 8 1 1 3\n",
" yhat (DATE, STORE, SKU) int64 13 3 2 4 0 7 7 7 9 3 ... 0 9 1 2 3 0 0 0 0"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds = df.to_xarray()\n",
"ds"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As seen above, `xarray` has a very nice html representation of `xarray.Dataset` objects.\n",
"\n",
"Click on the data symbol (the cylinder) to the see the data associated with the `Coordinates` and the `Data`.\n",
"\n",
"You now have one variable (`ds`) which houses the data and the associated meta data. You can also use the `Attributes` for handling things like units. (this is why `xarray` was developed!).\n",
"\n",
"If you would like to know more about `xarray` check out this [overview](http://xarray.pydata.org/en/stable/quick-overview.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use `xskillscore` on this `xarray.Dataset` via `xarray`'s [Accessor method](http://xarray.pydata.org/en/stable/generated/xarray.register_dataset_accessor.html).\n",
"\n",
"`xskillscore` expects at least 3 arguments for most functions. These are `y`: the target variable; `yhat`: the predicted variable and `dim(s)` the dimension(s) for which to apply the verification metric over.\n",
"\n",
"To replicate the `scikit-learn` metric above, apply RMSE over all the dimensions `[DATE, STORE, SKU]`. RMSE is called `rmse` in xskillscore. #Lastly call `.values` on the object to obtain the data as a `np.array`..."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
xarray.DataArray
2.933
array(2.93257566)
"
],
"text/plain": [
"\n",
"array(2.93257566)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rmse = ds.xs.rmse('y', 'yhat', ['DATE', 'STORE', 'SKU'])\n",
"rmse"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want just the data from the `xarray.DataArray` you can `.values` on it."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(2.93257566)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rmse.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`xskillscore` allows you apply the metric over any combination of dimensions (think of `pandas.groupby.apply` but faster).\n",
"\n",
"For example, your boss has asked you how good are your predictions at store level.\n",
"\n",
"In this case, apply the metrics over the `DATE` and `SKU` dimensions:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
"\n",
"array([2.17562252, 2.87518115, 3.21455025, 3.32665999])\n",
"Coordinates:\n",
" * STORE (STORE) int64 0 1 2 3"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rmse = ds.xs.rmse('y', 'yhat', ['DATE', 'SKU'])\n",
"rmse"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use `xarray` a bit further to explore our results.\n",
"\n",
"Let find out which store had the best forecast and which store had the worst forecast:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Our forecast performed well for store:\n",
"Coordinates:\n",
" * STORE (STORE) int64 0\n",
"\n",
"Our forecast struggled with store:\n",
"Coordinates:\n",
" * STORE (STORE) int64 3\n"
]
}
],
"source": [
"print('Our forecast performed well for store:')\n",
"print(rmse.where(rmse==rmse.min(), drop=True).coords)\n",
"print('')\n",
"print('Our forecast struggled with store:')\n",
"print(rmse.where(rmse==rmse.max(), drop=True).coords)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Providing weights to the verification metrics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can specify weights when calculating skill metrics. Here I will go through an example demonstrating why you may want to apply weights when verifying your forecast.\n",
"\n",
"You boss has asked for you to create a prediction for the next five days. You will update this prediction everyday and there is a larger focus on the performance of the model for the subsequent day and less of a focus on the fifth day.\n",
"\n",
"In this case you can weight your metric so the performance of day 1 has a larger influence than day 5. Here we will apply a linear scaling from 1 to 0 with day 1 having a weight of 1. and day 5 having a weight of 0..\n",
"\n",
"Generate the weights the same size as the `DATE` dimension and put it into an `xarray.DataArray`:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"array([1. , 0.75, 0.5 , 0.25, 0. ])\n",
"Dimensions without coordinates: DATE\n"
]
}
],
"source": [
"dim = 'DATE'\n",
"np_weights = np.linspace(1, 0, num=len(ds[dim]))\n",
"weights = xr.DataArray(np_weights, dims=dim)\n",
"print(weights)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now simply add the variable to the `weights` argument: "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
"\n",
"array([[3.46410162, 1. , 1.09544512],\n",
" [2.75680975, 4.04969135, 0.89442719],\n",
" [2.44948974, 2.86356421, 4.09878031],\n",
" [3.31662479, 2.82842712, 3.76828874]])\n",
"Coordinates:\n",
" * STORE (STORE) int64 0 1 2 3\n",
" * SKU (SKU) int64 0 1 2"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds.xs.rmse('y', 'yhat', 'DATE')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Handle missing values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There may be no purchases for certain items in certain stores on certain dates. These entries will be blank in the query from the database.\n",
"\n",
"To mimic data like this create the same type of data structure as before but randomly suppress each row. I have created a simply `if` statement that will drop the row with a probability of 0.2 (20%):"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
y
\n",
"
\n",
"
\n",
"
DATE
\n",
"
STORE
\n",
"
SKU
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
2020-01-01
\n",
"
0
\n",
"
0
\n",
"
7
\n",
"
\n",
"
\n",
"
2
\n",
"
1
\n",
"
\n",
"
\n",
"
1
\n",
"
0
\n",
"
2
\n",
"
\n",
"
\n",
"
1
\n",
"
5
\n",
"
\n",
"
\n",
"
2
\n",
"
2
\n",
"
\n",
"
\n",
"
2
\n",
"
0
\n",
"
6
\n",
"
\n",
"
\n",
"
2
\n",
"
4
\n",
"
\n",
"
\n",
"
3
\n",
"
0
\n",
"
6
\n",
"
\n",
"
\n",
"
2
\n",
"
8
\n",
"
\n",
"
\n",
"
2020-01-02
\n",
"
0
\n",
"
0
\n",
"
3
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" y\n",
"DATE STORE SKU \n",
"2020-01-01 0 0 7\n",
" 2 1\n",
" 1 0 2\n",
" 1 5\n",
" 2 2\n",
" 2 0 6\n",
" 2 4\n",
" 3 0 6\n",
" 2 8\n",
"2020-01-02 0 0 3"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"random_number_threshold = 0.8\n",
"\n",
"rows = []\n",
"for _, date in enumerate(dates):\n",
" for _, store in enumerate(stores):\n",
" for _, sku in enumerate(skus):\n",
" if np.random.rand(1) < random_number_threshold:\n",
" rows.append(\n",
" dict(\n",
" {\n",
" \"DATE\": date,\n",
" \"STORE\": store,\n",
" \"SKU\": sku,\n",
" \"QUANTITY_SOLD\": np.random.randint(9) + 1,\n",
" }\n",
" )\n",
" )\n",
"df = pd.DataFrame(rows)\n",
"df.rename(columns={\"QUANTITY_SOLD\": \"y\"}, inplace=True)\n",
"df.set_index(['DATE', 'STORE', 'SKU'], inplace=True)\n",
"df.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Converting the `pandas.DataFrame` to an `xarray.Dataset` is very handy in this case because it will infer the missing entries as `nans` (as long as all indexes are present in the `pandas.DataFrame`):"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
"\n",
"Dimensions: (DATE: 5, SKU: 3, STORE: 4)\n",
"Coordinates:\n",
" * DATE (DATE) datetime64[ns] 2020-01-01 2020-01-02 ... 2020-01-05\n",
" * STORE (STORE) int64 0 1 2 3\n",
" * SKU (SKU) int64 0 1 2\n",
"Data variables:\n",
" y (DATE, STORE, SKU) float64 7.0 nan 1.0 2.0 5.0 ... nan 3.0 3.0 8.0"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds = df.to_xarray()\n",
"ds"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Click on the data symbol associated with the `y` Data variable to see the `nans`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also use this step in your workflow if simply want to continue working with the `pandas.DataFrame`:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n",
"
y
\n",
"
\n",
"
\n",
"
DATE
\n",
"
SKU
\n",
"
STORE
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
2020-01-01
\n",
"
0
\n",
"
0
\n",
"
7.0
\n",
"
\n",
"
\n",
"
1
\n",
"
2.0
\n",
"
\n",
"
\n",
"
2
\n",
"
6.0
\n",
"
\n",
"
\n",
"
3
\n",
"
6.0
\n",
"
\n",
"
\n",
"
1
\n",
"
0
\n",
"
NaN
\n",
"
\n",
"
\n",
"
1
\n",
"
5.0
\n",
"
\n",
"
\n",
"
2
\n",
"
NaN
\n",
"
\n",
"
\n",
"
3
\n",
"
NaN
\n",
"
\n",
"
\n",
"
2
\n",
"
0
\n",
"
1.0
\n",
"
\n",
"
\n",
"
1
\n",
"
2.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" y\n",
"DATE SKU STORE \n",
"2020-01-01 0 0 7.0\n",
" 1 2.0\n",
" 2 6.0\n",
" 3 6.0\n",
" 1 0 NaN\n",
" 1 5.0\n",
" 2 NaN\n",
" 3 NaN\n",
" 2 0 1.0\n",
" 1 2.0"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_with_nans = ds.to_dataframe()\n",
"df_with_nans.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: xarray returns the fields alphabetically but it still shows the `nans`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In most cases you will not know a priori, if there will be no purchases for a particular item in a certain store during a day. Therefore, your prediction will not contain `nans` but you would hope the value is low.\n",
"\n",
"Append a prediction column as was done previously:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"