{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## [03_Big_Data.ipynb](https://github.com/raybellwaves/xskillscore-tutorial/blob/master/03_Big_Data.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook I verify 12 million forecasts in a couple of seconds using the RMSE metric on a `dask.array`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import xarray as xr\n", "import pandas as pd\n", "import numpy as np\n", "import xskillscore as xs\n", "import dask.array as da\n", "from dask.distributed import Client" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default the [`dask.distributed.Client`](https://distributed.dask.org/en/latest/client.html) uses a [`LocalCluster`](https://docs.dask.org/en/latest/setup/single-distributed.html#localcluster)\n", "\n", "```\n", "cluster = LocalCluster()\n", "client = Client(cluster)\n", "```\n", "\n", "However, this code can easily be adapted to scale on massive datasets using distributed computing via various methods of deployment:\n", "\n", " - [Kubernetes](https://docs.dask.org/en/latest/setup/kubernetes.html)\n", " - [High Performance Computers](https://docs.dask.org/en/latest/setup/hpc.html)\n", " - [YARN](https://yarn.dask.org/en/latest/)\n", " - [AWS Fargate](https://aws.amazon.com/fargate/)\n", " \n", "or vendor products:\n", " \n", " - [SaturnCloud](https://www.saturncloud.io/s/)\n", " \n", "If anyone does run this example on a large cluster I would be curious how big you can scale `nstores` and `nskus` and how long it takes to run `rmse`. You are welcome to post it in the issue section following this [link](https://github.com/raybellwaves/xskillscore-tutorial/issues/new/choose)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setup the client (i.e. connect to the scheduler):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
\n", "

Client

\n", "\n", "
\n", "

Cluster

\n", "
    \n", "
  • Workers: 3
  • \n", "
  • Cores: 6
  • \n", "
  • Memory: 16.73 GB
  • \n", "
\n", "
" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "client = Client()\n", "client" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Due to the success of your previous forecast (and verification using xskillscore!) the company you work for has expanded. They have grown to 4,000 stores each with 3,000 products:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "That's 12,000,000 different forecasts to verify!\n" ] } ], "source": [ "nstores = 4000\n", "nskus = 3000\n", "\n", "nforecasts = nstores * nskus\n", "print(f\"That's {nforecasts:,d} different forecasts to verify!\")\n", "\n", "stores = np.arange(nstores)\n", "skus = np.arange(nskus)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The time period of interest is the same dates but for 2021:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "dates = pd.date_range(\"1/1/2021\", \"1/5/2021\", freq=\"D\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setup the data as a `dask.array` of dates x stores x skus.\n", "\n", "`dask` uses similar functions as `numpy`. In this case switch the `np.` to `da.` to generate random numbers between 1 and 10:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Array Chunk
Bytes 4.80 GB 60.00 MB
Shape (5, 4000, 30000) (5, 1000, 1500)
Count 160 Tasks 80 Chunks
Type int64 numpy.ndarray
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " 30000\n", " 4000\n", " 5\n", "\n", "
" ], "text/plain": [ "dask.array" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = da.random.randint(9, size=(len(dates), len(stores), len(skus))) + 1\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Put this into an `xarray.DataArray` and specify the Coordinates and dimensions:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "Show/Hide data repr\n", "\n", "\n", "\n", "\n", "\n", "Show/Hide attributes\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
xarray.DataArray
'add-dfe18897e28896b92dffd2271a18e7b8'
  • DATE: 5
  • STORE: 4000
  • SKU: 30000
  • dask.array<chunksize=(5, 1000, 1500), meta=np.ndarray>
    \n",
           "\n",
           "\n",
           "\n",
           "\n",
           "
    \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    Array Chunk
    Bytes 4.80 GB 60.00 MB
    Shape (5, 4000, 30000) (5, 1000, 1500)
    Count 160 Tasks 80 Chunks
    Type int64 numpy.ndarray
    \n", "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " 30000\n", " 4000\n", " 5\n", "\n", "
    • DATE
      (DATE)
      datetime64[ns]
      2021-01-01 ... 2021-01-05
      array(['2021-01-01T00:00:00.000000000', '2021-01-02T00:00:00.000000000',\n",
             "       '2021-01-03T00:00:00.000000000', '2021-01-04T00:00:00.000000000',\n",
             "       '2021-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
    • STORE
      (STORE)
      int64
      0 1 2 3 4 ... 3996 3997 3998 3999
      array([   0,    1,    2, ..., 3997, 3998, 3999])
    • SKU
      (SKU)
      int64
      0 1 2 3 ... 29996 29997 29998 29999
      array([    0,     1,     2, ..., 29997, 29998, 29999])
" ], "text/plain": [ "\n", "dask.array\n", "Coordinates:\n", " * DATE (DATE) datetime64[ns] 2021-01-01 2021-01-02 ... 2021-01-05\n", " * STORE (STORE) int64 0 1 2 3 4 5 6 ... 3993 3994 3995 3996 3997 3998 3999\n", " * SKU (SKU) int64 0 1 2 3 4 5 6 ... 29994 29995 29996 29997 29998 29999" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = xr.DataArray(data, coords=[dates, stores, skus], dims=[\"DATE\", \"STORE\", \"SKU\"])\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a prediction array similar to that in [01_Deterministic.ipynb](https://github.com/raybellwaves/xskillscore-tutorial/blob/master/01_Determinisitic.ipynb):" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "Show/Hide data repr\n", "\n", "\n", "\n", "\n", "\n", "Show/Hide attributes\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
xarray.DataArray
  • DATE: 5
  • STORE: 4000
  • SKU: 30000
  • dask.array<chunksize=(5, 1000, 1500), meta=np.ndarray>
    \n",
           "\n",
           "\n",
           "\n",
           "\n",
           "
    \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    Array Chunk
    Bytes 4.80 GB 60.00 MB
    Shape (5, 4000, 30000) (5, 1000, 1500)
    Count 400 Tasks 80 Chunks
    Type float64 numpy.ndarray
    \n", "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " 30000\n", " 4000\n", " 5\n", "\n", "
    • DATE
      (DATE)
      datetime64[ns]
      2021-01-01 ... 2021-01-05
      array(['2021-01-01T00:00:00.000000000', '2021-01-02T00:00:00.000000000',\n",
             "       '2021-01-03T00:00:00.000000000', '2021-01-04T00:00:00.000000000',\n",
             "       '2021-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
    • STORE
      (STORE)
      int64
      0 1 2 3 4 ... 3996 3997 3998 3999
      array([   0,    1,    2, ..., 3997, 3998, 3999])
    • SKU
      (SKU)
      int64
      0 1 2 3 ... 29996 29997 29998 29999
      array([    0,     1,     2, ..., 29997, 29998, 29999])
" ], "text/plain": [ "\n", "dask.array\n", "Coordinates:\n", " * DATE (DATE) datetime64[ns] 2021-01-01 2021-01-02 ... 2021-01-05\n", " * STORE (STORE) int64 0 1 2 3 4 5 6 ... 3993 3994 3995 3996 3997 3998 3999\n", " * SKU (SKU) int64 0 1 2 3 4 5 6 ... 29994 29995 29996 29997 29998 29999" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "noise = da.random.uniform(-1, 1, size=(len(dates), len(stores), len(skus)))\n", "yhat = y + (y * noise)\n", "yhat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally caculate RMSE at the store and sku level.\n", "\n", "Use the [`.compute()`](https://distributed.dask.org/en/latest/manage-computation.html) method to return the values:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.52 s, sys: 964 ms, total: 2.49 s\n", "Wall time: 9.63 s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "Show/Hide data repr\n", "\n", "\n", "\n", "\n", "\n", "Show/Hide attributes\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
xarray.DataArray
  • STORE: 4000
  • SKU: 30000
  • 1.531 3.147 3.151 5.253 3.666 3.23 ... 3.369 1.381 4.099 3.46 3.736
    array([[1.53123262, 3.14700356, 3.15132902, ..., 4.94581499, 3.27822252,\n",
           "        4.06088894],\n",
           "       [1.67277937, 4.08686635, 3.66413793, ..., 2.41712605, 2.57601639,\n",
           "        2.88718054],\n",
           "       [2.76270445, 3.33921586, 3.06681608, ..., 3.34186527, 1.32741548,\n",
           "        2.08743438],\n",
           "       ...,\n",
           "       [2.35981265, 2.1617547 , 4.92081192, ..., 1.98393152, 3.1364395 ,\n",
           "        3.59346663],\n",
           "       [1.26363135, 2.77340328, 2.7967874 , ..., 2.8342274 , 3.01885276,\n",
           "        0.62828305],\n",
           "       [4.29771572, 3.92254418, 2.15708334, ..., 4.099115  , 3.45980968,\n",
           "        3.73594864]])
    • STORE
      (STORE)
      int64
      0 1 2 3 4 ... 3996 3997 3998 3999
      array([   0,    1,    2, ..., 3997, 3998, 3999])
    • SKU
      (SKU)
      int64
      0 1 2 3 ... 29996 29997 29998 29999
      array([    0,     1,     2, ..., 29997, 29998, 29999])
" ], "text/plain": [ "\n", "array([[1.53123262, 3.14700356, 3.15132902, ..., 4.94581499, 3.27822252,\n", " 4.06088894],\n", " [1.67277937, 4.08686635, 3.66413793, ..., 2.41712605, 2.57601639,\n", " 2.88718054],\n", " [2.76270445, 3.33921586, 3.06681608, ..., 3.34186527, 1.32741548,\n", " 2.08743438],\n", " ...,\n", " [2.35981265, 2.1617547 , 4.92081192, ..., 1.98393152, 3.1364395 ,\n", " 3.59346663],\n", " [1.26363135, 2.77340328, 2.7967874 , ..., 2.8342274 , 3.01885276,\n", " 0.62828305],\n", " [4.29771572, 3.92254418, 2.15708334, ..., 4.099115 , 3.45980968,\n", " 3.73594864]])\n", "Coordinates:\n", " * STORE (STORE) int64 0 1 2 3 4 5 6 ... 3993 3994 3995 3996 3997 3998 3999\n", " * SKU (SKU) int64 0 1 2 3 4 5 6 ... 29994 29995 29996 29997 29998 29999" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time xs.rmse(y, yhat, 'DATE').compute()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 4 }