{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## [01_Deterministic.ipynb](https://github.com/raybellwaves/xskillscore-tutorial/blob/master/01_Determinisitic.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " - In this notebook I show how `xskillscore` can be dropped in a typical data science task where the data is a [`pandas.DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).\n", "\n", " - I use the metric RMSE to verify forecasts of items sold.\n", "\n", " - I also show how you can apply weights to the verification and handle missing values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import the necessary packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import xarray as xr\n", "import pandas as pd\n", "import numpy as np\n", "import xskillscore as xs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say you are a data scientist who works for a company which owns four stores which each sell three items (Store Keeping Units).\n", "\n", "Set up `stores` and `skus` arrays:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "stores = np.arange(4)\n", "skus = np.arange(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and you are tracking daily perfomane of items sold between Jan 1st and Jan 5th 2020.\n", "\n", "Setup up `dates` array:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "dates = pd.date_range(\"1/1/2020\", \"1/5/2020\", freq=\"D\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate a `pandas.DataFrame` to show the number of items that were sold during this period. The number of items sold will be a random number between 1 and 10.\n", "\n", "This may be something you would obtain from querying a database:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "rows = []\n", "for _, date in enumerate(dates):\n", " for _, store in enumerate(stores):\n", " for _, sku in enumerate(skus):\n", " rows.append(\n", " dict(\n", " {\n", " \"DATE\": date,\n", " \"STORE\": store,\n", " \"SKU\": sku,\n", " \"QUANTITY_SOLD\": np.random.randint(9) + 1,\n", " }\n", " )\n", " )\n", "df = pd.DataFrame(rows)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pring the first 5 rows of the `pandas.DataFrame`:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DATESTORESKUQUANTITY_SOLD
02020-01-01009
12020-01-01012
22020-01-01022
32020-01-01103
42020-01-01111
\n", "
" ], "text/plain": [ " DATE STORE SKU QUANTITY_SOLD\n", "0 2020-01-01 0 0 9\n", "1 2020-01-01 0 1 2\n", "2 2020-01-01 0 2 2\n", "3 2020-01-01 1 0 3\n", "4 2020-01-01 1 1 1" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your boss has asked you to use this data to predict the number of items sold for each store and sku level for the next 5 days.\n", "\n", "The prediction is outside of the scope of the tutorial but we will use `xskillscore` to tell us how good our prediction may be ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, rename the target variable to ``y``:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DATESTORESKUy
02020-01-01009
12020-01-01012
22020-01-01022
32020-01-01103
42020-01-01111
\n", "
" ], "text/plain": [ " DATE STORE SKU y\n", "0 2020-01-01 0 0 9\n", "1 2020-01-01 0 1 2\n", "2 2020-01-01 0 2 2\n", "3 2020-01-01 1 0 3\n", "4 2020-01-01 1 1 1" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.rename(columns={\"QUANTITY_SOLD\": \"y\"}, inplace=True)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use [pandas MultiIndex](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html) to help handle the granularity of the forecast:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "df.set_index(['DATE', 'STORE', 'SKU'], inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This also displays the data in a cleaner foremat in the notebook:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
y
DATESTORESKU
2020-01-01009
12
22
103
11
\n", "
" ], "text/plain": [ " y\n", "DATE STORE SKU \n", "2020-01-01 0 0 9\n", " 1 2\n", " 2 2\n", " 1 0 3\n", " 1 1" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Time for your prediction! As mentioned, this is outside of the scope of this tutorial.\n", "\n", "In our case we are going to generate data to mimic a prediction by taking `y` and perturbing randomly. This will provide a middle ground of creating a prediction which is not overfitting the data (being very similar to `y`) and the other extreme of random numbers for which the skill will be 0.\n", "\n", "The perturbations will scale `y` between -100% and 100% using a uniform distribution. For example, a value of 5 in `y` will be between 0 and 10 in the prediction (`yhat`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setup the perturbation array:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "noise = np.random.uniform(-1, 1, size=len(df['y']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Name the prediction `yhat` and append it to the `pandas.DataFrame`.\n", "\n", "Lastly, convert it is an `int` to match the same format as the target (`y`):" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yyhat
DATESTORESKU
2020-01-0100913
123
222
1034
110
\n", "
" ], "text/plain": [ " y yhat\n", "DATE STORE SKU \n", "2020-01-01 0 0 9 13\n", " 1 2 3\n", " 2 2 2\n", " 1 0 3 4\n", " 1 1 0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['yhat'] = (df['y'] + (df['y'] * noise)).astype(int)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using xskillscore - RMSE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "RMSE (root-mean-squre error) is the square root of the average of the squared differences between forecasts and verification data:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\\begin{align}\n", "RMSE = \\sqrt{\\overline{(f - o)^{2}}}\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the error is squared is it sensitive to outliers and is a more conservative metric than mean-absolute error.\n", "\n", "See https://climpred.readthedocs.io/en/stable/metrics.html#root-mean-square-error-rmse for further documentation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### sklearn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most data scientists are familiar with using `scikit-learn` for verifying forecasts, especially if you used `scikit-learn` for the prediction.\n", "\n", "To obtain RMSE from `scikit-learn` import `mean_squared_error` and specify `squared=False`:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.932575659723036" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import mean_squared_error\n", "mean_squared_error(df['y'], df['yhat'], squared=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While `skikit-learn` is simple it doesn't give the flexibility of that given in `xskillscore`.\n", "\n", "Note: `xskillscore` does use the same metrics as in `scikit-learn` such as the [`r2_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html), which is called `r2` in `xskillscore`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### xskillscore" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use `xskillscore` you first have to put your data into an `xarray` object.\n", "\n", "Because `xarray` is part of the PyData stack it integrates will other Python data science packages.\n", "\n", "`pandas` has a convenient [`to_xarray`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_xarray.html) which makes going from `pandas` to `xarray` seamless.\n", "\n", "Use `to_xarray` to convert the `pandas.Dataframe` to an `xarray.Dataset`: " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "Show/Hide data repr\n", "\n", "\n", "\n", "\n", "\n", "Show/Hide attributes\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
xarray.Dataset
" ], "text/plain": [ "\n", "Dimensions: (DATE: 5, SKU: 3, STORE: 4)\n", "Coordinates:\n", " * DATE (DATE) datetime64[ns] 2020-01-01 2020-01-02 ... 2020-01-05\n", " * STORE (STORE) int64 0 1 2 3\n", " * SKU (SKU) int64 0 1 2\n", "Data variables:\n", " y (DATE, STORE, SKU) int64 9 2 2 3 1 6 4 7 6 7 ... 2 5 2 6 6 8 1 1 3\n", " yhat (DATE, STORE, SKU) int64 13 3 2 4 0 7 7 7 9 3 ... 0 9 1 2 3 0 0 0 0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = df.to_xarray()\n", "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As seen above, `xarray` has a very nice html representation of `xarray.Dataset` objects.\n", "\n", "Click on the data symbol (the cylinder) to the see the data associated with the `Coordinates` and the `Data`.\n", "\n", "You now have one variable (`ds`) which houses the data and the associated meta data. You can also use the `Attributes` for handling things like units. (this is why `xarray` was developed!).\n", "\n", "If you would like to know more about `xarray` check out this [overview](http://xarray.pydata.org/en/stable/quick-overview.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use `xskillscore` on this `xarray.Dataset` via `xarray`'s [Accessor method](http://xarray.pydata.org/en/stable/generated/xarray.register_dataset_accessor.html).\n", "\n", "`xskillscore` expects at least 3 arguments for most functions. These are `y`: the target variable; `yhat`: the predicted variable and `dim(s)` the dimension(s) for which to apply the verification metric over.\n", "\n", "To replicate the `scikit-learn` metric above, apply RMSE over all the dimensions `[DATE, STORE, SKU]`. RMSE is called `rmse` in xskillscore. #Lastly call `.values` on the object to obtain the data as a `np.array`..." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "Show/Hide data repr\n", "\n", "\n", "\n", "\n", "\n", "Show/Hide attributes\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
xarray.DataArray
  • 2.933
    array(2.93257566)
    " ], "text/plain": [ "\n", "array(2.93257566)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rmse = ds.xs.rmse('y', 'yhat', ['DATE', 'STORE', 'SKU'])\n", "rmse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want just the data from the `xarray.DataArray` you can `.values` on it." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(2.93257566)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rmse.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`xskillscore` allows you apply the metric over any combination of dimensions (think of `pandas.groupby.apply` but faster).\n", "\n", "For example, your boss has asked you how good are your predictions at store level.\n", "\n", "In this case, apply the metrics over the `DATE` and `SKU` dimensions:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", "Show/Hide data repr\n", "\n", "\n", "\n", "\n", "\n", "Show/Hide attributes\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
    xarray.DataArray
    • STORE: 4
    • 2.176 2.875 3.215 3.327
      array([2.17562252, 2.87518115, 3.21455025, 3.32665999])
      • STORE
        (STORE)
        int64
        0 1 2 3
        array([0, 1, 2, 3])
    " ], "text/plain": [ "\n", "array([2.17562252, 2.87518115, 3.21455025, 3.32665999])\n", "Coordinates:\n", " * STORE (STORE) int64 0 1 2 3" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rmse = ds.xs.rmse('y', 'yhat', ['DATE', 'SKU'])\n", "rmse" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use `xarray` a bit further to explore our results.\n", "\n", "Let find out which store had the best forecast and which store had the worst forecast:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Our forecast performed well for store:\n", "Coordinates:\n", " * STORE (STORE) int64 0\n", "\n", "Our forecast struggled with store:\n", "Coordinates:\n", " * STORE (STORE) int64 3\n" ] } ], "source": [ "print('Our forecast performed well for store:')\n", "print(rmse.where(rmse==rmse.min(), drop=True).coords)\n", "print('')\n", "print('Our forecast struggled with store:')\n", "print(rmse.where(rmse==rmse.max(), drop=True).coords)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Providing weights to the verification metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can specify weights when calculating skill metrics. Here I will go through an example demonstrating why you may want to apply weights when verifying your forecast.\n", "\n", "You boss has asked for you to create a prediction for the next five days. You will update this prediction everyday and there is a larger focus on the performance of the model for the subsequent day and less of a focus on the fifth day.\n", "\n", "In this case you can weight your metric so the performance of day 1 has a larger influence than day 5. Here we will apply a linear scaling from 1 to 0 with day 1 having a weight of 1. and day 5 having a weight of 0..\n", "\n", "Generate the weights the same size as the `DATE` dimension and put it into an `xarray.DataArray`:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "array([1. , 0.75, 0.5 , 0.25, 0. ])\n", "Dimensions without coordinates: DATE\n" ] } ], "source": [ "dim = 'DATE'\n", "np_weights = np.linspace(1, 0, num=len(ds[dim]))\n", "weights = xr.DataArray(np_weights, dims=dim)\n", "print(weights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now simply add the variable to the `weights` argument: " ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", "Show/Hide data repr\n", "\n", "\n", "\n", "\n", "\n", "Show/Hide attributes\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
    xarray.DataArray
    • STORE: 4
    • SKU: 3
    • 3.55 1.0 1.049 2.646 3.507 0.8944 2.049 2.828 2.569 4.012 3.464 3.674
      array([[3.54964787, 1.        , 1.04880885],\n",
             "       [2.64575131, 3.50713558, 0.89442719],\n",
             "       [2.04939015, 2.82842712, 2.56904652],\n",
             "       [4.01248053, 3.46410162, 3.67423461]])
      • STORE
        (STORE)
        int64
        0 1 2 3
        array([0, 1, 2, 3])
      • SKU
        (SKU)
        int64
        0 1 2
        array([0, 1, 2])
    " ], "text/plain": [ "\n", "array([[3.54964787, 1. , 1.04880885],\n", " [2.64575131, 3.50713558, 0.89442719],\n", " [2.04939015, 2.82842712, 2.56904652],\n", " [4.01248053, 3.46410162, 3.67423461]])\n", "Coordinates:\n", " * STORE (STORE) int64 0 1 2 3\n", " * SKU (SKU) int64 0 1 2" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.xs.rmse('y', 'yhat', 'DATE', weights=weights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and you can compare to the result without the weights:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", "Show/Hide data repr\n", "\n", "\n", "\n", "\n", "\n", "Show/Hide attributes\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
    xarray.DataArray
    • STORE: 4
    • SKU: 3
    • 3.464 1.0 1.095 2.757 4.05 0.8944 2.449 2.864 4.099 3.317 2.828 3.768
      array([[3.46410162, 1.        , 1.09544512],\n",
             "       [2.75680975, 4.04969135, 0.89442719],\n",
             "       [2.44948974, 2.86356421, 4.09878031],\n",
             "       [3.31662479, 2.82842712, 3.76828874]])
      • STORE
        (STORE)
        int64
        0 1 2 3
        array([0, 1, 2, 3])
      • SKU
        (SKU)
        int64
        0 1 2
        array([0, 1, 2])
    " ], "text/plain": [ "\n", "array([[3.46410162, 1. , 1.09544512],\n", " [2.75680975, 4.04969135, 0.89442719],\n", " [2.44948974, 2.86356421, 4.09878031],\n", " [3.31662479, 2.82842712, 3.76828874]])\n", "Coordinates:\n", " * STORE (STORE) int64 0 1 2 3\n", " * SKU (SKU) int64 0 1 2" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.xs.rmse('y', 'yhat', 'DATE')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Handle missing values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There may be no purchases for certain items in certain stores on certain dates. These entries will be blank in the query from the database.\n", "\n", "To mimic data like this create the same type of data structure as before but randomly suppress each row. I have created a simply `if` statement that will drop the row with a probability of 0.2 (20%):" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    y
    DATESTORESKU
    2020-01-01007
    21
    102
    15
    22
    206
    24
    306
    28
    2020-01-02003
    \n", "
    " ], "text/plain": [ " y\n", "DATE STORE SKU \n", "2020-01-01 0 0 7\n", " 2 1\n", " 1 0 2\n", " 1 5\n", " 2 2\n", " 2 0 6\n", " 2 4\n", " 3 0 6\n", " 2 8\n", "2020-01-02 0 0 3" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_number_threshold = 0.8\n", "\n", "rows = []\n", "for _, date in enumerate(dates):\n", " for _, store in enumerate(stores):\n", " for _, sku in enumerate(skus):\n", " if np.random.rand(1) < random_number_threshold:\n", " rows.append(\n", " dict(\n", " {\n", " \"DATE\": date,\n", " \"STORE\": store,\n", " \"SKU\": sku,\n", " \"QUANTITY_SOLD\": np.random.randint(9) + 1,\n", " }\n", " )\n", " )\n", "df = pd.DataFrame(rows)\n", "df.rename(columns={\"QUANTITY_SOLD\": \"y\"}, inplace=True)\n", "df.set_index(['DATE', 'STORE', 'SKU'], inplace=True)\n", "df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting the `pandas.DataFrame` to an `xarray.Dataset` is very handy in this case because it will infer the missing entries as `nans` (as long as all indexes are present in the `pandas.DataFrame`):" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", "Show/Hide data repr\n", "\n", "\n", "\n", "\n", "\n", "Show/Hide attributes\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
    xarray.Dataset
      • DATE: 5
      • SKU: 3
      • STORE: 4
      • DATE
        (DATE)
        datetime64[ns]
        2020-01-01 ... 2020-01-05
        array(['2020-01-01T00:00:00.000000000', '2020-01-02T00:00:00.000000000',\n",
               "       '2020-01-03T00:00:00.000000000', '2020-01-04T00:00:00.000000000',\n",
               "       '2020-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
      • STORE
        (STORE)
        int64
        0 1 2 3
        array([0, 1, 2, 3])
      • SKU
        (SKU)
        int64
        0 1 2
        array([0, 1, 2])
      • y
        (DATE, STORE, SKU)
        float64
        7.0 nan 1.0 2.0 ... nan 3.0 3.0 8.0
        array([[[ 7., nan,  1.],\n",
               "        [ 2.,  5.,  2.],\n",
               "        [ 6., nan,  4.],\n",
               "        [ 6., nan,  8.]],\n",
               "\n",
               "       [[ 3.,  8., nan],\n",
               "        [ 9.,  6.,  4.],\n",
               "        [ 4.,  3.,  7.],\n",
               "        [ 7.,  1., nan]],\n",
               "\n",
               "       [[ 2., nan,  2.],\n",
               "        [nan,  3.,  2.],\n",
               "        [ 5., nan,  8.],\n",
               "        [ 7., nan,  5.]],\n",
               "\n",
               "       [[ 7.,  8.,  9.],\n",
               "        [ 4., nan,  3.],\n",
               "        [ 2.,  8.,  6.],\n",
               "        [ 2., nan,  9.]],\n",
               "\n",
               "       [[ 5.,  9., nan],\n",
               "        [ 5., nan,  5.],\n",
               "        [ 8.,  3., nan],\n",
               "        [ 3.,  3.,  8.]]])
    " ], "text/plain": [ "\n", "Dimensions: (DATE: 5, SKU: 3, STORE: 4)\n", "Coordinates:\n", " * DATE (DATE) datetime64[ns] 2020-01-01 2020-01-02 ... 2020-01-05\n", " * STORE (STORE) int64 0 1 2 3\n", " * SKU (SKU) int64 0 1 2\n", "Data variables:\n", " y (DATE, STORE, SKU) float64 7.0 nan 1.0 2.0 5.0 ... nan 3.0 3.0 8.0" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = df.to_xarray()\n", "ds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Click on the data symbol associated with the `y` Data variable to see the `nans`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also use this step in your workflow if simply want to continue working with the `pandas.DataFrame`:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    y
    DATESKUSTORE
    2020-01-01007.0
    12.0
    26.0
    36.0
    10NaN
    15.0
    2NaN
    3NaN
    201.0
    12.0
    \n", "
    " ], "text/plain": [ " y\n", "DATE SKU STORE \n", "2020-01-01 0 0 7.0\n", " 1 2.0\n", " 2 6.0\n", " 3 6.0\n", " 1 0 NaN\n", " 1 5.0\n", " 2 NaN\n", " 3 NaN\n", " 2 0 1.0\n", " 1 2.0" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_with_nans = ds.to_dataframe()\n", "df_with_nans.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: xarray returns the fields alphabetically but it still shows the `nans`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In most cases you will not know a priori, if there will be no purchases for a particular item in a certain store during a day. Therefore, your prediction will not contain `nans` but you would hope the value is low.\n", "\n", "Append a prediction column as was done previously:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    yyhat
    DATESKUSTORE
    2020-01-01007.010.446753
    12.03.411661
    26.07.584959
    36.09.522168
    10NaNNaN
    \n", "
    " ], "text/plain": [ " y yhat\n", "DATE SKU STORE \n", "2020-01-01 0 0 7.0 10.446753\n", " 1 2.0 3.411661\n", " 2 6.0 7.584959\n", " 3 6.0 9.522168\n", " 1 0 NaN NaN" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_with_nans['yhat'] = df_with_nans['y'] + (df_with_nans['y'] * noise)\n", "df_with_nans.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our prediction contains `nans` so to mimic a realistic prediction replace these with values:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    yyhat
    DATESKUSTORE
    2020-01-01007.010.446753
    12.03.411661
    26.07.584959
    36.09.522168
    10NaN2.000000
    \n", "
    " ], "text/plain": [ " y yhat\n", "DATE SKU STORE \n", "2020-01-01 0 0 7.0 10.446753\n", " 1 2.0 3.411661\n", " 2 6.0 7.584959\n", " 3 6.0 9.522168\n", " 1 0 NaN 2.000000" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "yhat = df_with_nans['yhat']\n", "\n", "yhat.loc[pd.isna(yhat)] = yhat[pd.isna(yhat)].apply(lambda x: np.random.randint(9) + 1)\n", "\n", "df_with_nans['yhat'] = yhat\n", "df_with_nans.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now if we try using `scikit-learn`:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "Input contains NaN, infinity or a value too large for dtype('float64').", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mmean_squared_error\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf_with_nans\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'y'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdf_with_nans\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'yhat'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msquared\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/local/bin/anaconda3/envs/xskillscore-tutorial/lib/python3.8/site-packages/sklearn/metrics/_regression.py\u001b[0m in \u001b[0;36mmean_squared_error\u001b[0;34m(y_true, y_pred, sample_weight, multioutput, squared)\u001b[0m\n\u001b[1;32m 249\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 250\u001b[0m \"\"\"\n\u001b[0;32m--> 251\u001b[0;31m y_type, y_true, y_pred, multioutput = _check_reg_targets(\n\u001b[0m\u001b[1;32m 252\u001b[0m y_true, y_pred, multioutput)\n\u001b[1;32m 253\u001b[0m \u001b[0mcheck_consistent_length\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my_true\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_pred\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msample_weight\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/local/bin/anaconda3/envs/xskillscore-tutorial/lib/python3.8/site-packages/sklearn/metrics/_regression.py\u001b[0m in \u001b[0;36m_check_reg_targets\u001b[0;34m(y_true, y_pred, multioutput, dtype)\u001b[0m\n\u001b[1;32m 83\u001b[0m \"\"\"\n\u001b[1;32m 84\u001b[0m \u001b[0mcheck_consistent_length\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my_true\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_pred\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 85\u001b[0;31m \u001b[0my_true\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my_true\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mensure_2d\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 86\u001b[0m \u001b[0my_pred\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my_pred\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mensure_2d\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 87\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/local/bin/anaconda3/envs/xskillscore-tutorial/lib/python3.8/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_array\u001b[0;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)\u001b[0m\n\u001b[1;32m 575\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 576\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mforce_all_finite\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 577\u001b[0;31m _assert_all_finite(array,\n\u001b[0m\u001b[1;32m 578\u001b[0m allow_nan=force_all_finite == 'allow-nan')\n\u001b[1;32m 579\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/local/bin/anaconda3/envs/xskillscore-tutorial/lib/python3.8/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36m_assert_all_finite\u001b[0;34m(X, allow_nan, msg_dtype)\u001b[0m\n\u001b[1;32m 55\u001b[0m not allow_nan and not np.isfinite(X).all()):\n\u001b[1;32m 56\u001b[0m \u001b[0mtype_err\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'infinity'\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mallow_nan\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0;34m'NaN, infinity'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 57\u001b[0;31m raise ValueError(\n\u001b[0m\u001b[1;32m 58\u001b[0m \u001b[0mmsg_err\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 59\u001b[0m (type_err,\n", "\u001b[0;31mValueError\u001b[0m: Input contains NaN, infinity or a value too large for dtype('float64')." ] } ], "source": [ "mean_squared_error(df_with_nans['y'], df_with_nans['yhat'], squared=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "you get a `ValueError: Input contains NaN`.\n", "\n", "In `xskillscore` you don't need to worry about this and simply specify `skipna=True`:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", "Show/Hide data repr\n", "\n", "\n", "\n", "\n", "\n", "Show/Hide attributes\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
    xarray.DataArray
    • 3.44
      array(3.44001822)
      " ], "text/plain": [ "\n", "array(3.44001822)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds = df_with_nans.to_xarray()\n", "ds.xs.rmse('y', 'yhat', ['DATE', 'STORE', 'SKU'], skipna=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Handle weights and missing values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can specify weights and skipna together for powerful analysis.." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
      \n", "\n", "\n", "Show/Hide data repr\n", "\n", "\n", "\n", "\n", "\n", "Show/Hide attributes\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
      xarray.DataArray
      • SKU: 3
      • STORE: 4
      • 3.038 1.105 1.467 4.962 6.45 1.036 ... 0.8742 1.193 2.412 3.459 3.493
        array([[3.03824381, 1.10544282, 1.46657789, 4.96241146],\n",
               "       [6.44982206, 1.03585395, 2.0144671 , 0.87421379],\n",
               "       [1.19256817, 2.41157233, 3.45932053, 3.49296654]])
        • SKU
          (SKU)
          int64
          0 1 2
          array([0, 1, 2])
        • STORE
          (STORE)
          int64
          0 1 2 3
          array([0, 1, 2, 3])
      " ], "text/plain": [ "\n", "array([[3.03824381, 1.10544282, 1.46657789, 4.96241146],\n", " [6.44982206, 1.03585395, 2.0144671 , 0.87421379],\n", " [1.19256817, 2.41157233, 3.45932053, 3.49296654]])\n", "Coordinates:\n", " * SKU (SKU) int64 0 1 2\n", " * STORE (STORE) int64 0 1 2 3" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ds.xs.rmse('y', 'yhat', 'DATE', weights=weights, skipna=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 4 }