{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## [03_Big_Data.ipynb](https://github.com/raybellwaves/xskillscore-tutorial/blob/master/03_Big_Data.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook I verify 12 million forecasts in a couple of seconds using the RMSE metric on a `dask.array`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import xarray as xr\n", "import pandas as pd\n", "import numpy as np\n", "import xskillscore as xs\n", "import dask.array as da\n", "from dask.distributed import Client" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default the [`dask.distributed.Client`](https://distributed.dask.org/en/latest/client.html) uses a [`LocalCluster`](https://docs.dask.org/en/latest/setup/single-distributed.html#localcluster)\n", "\n", "```\n", "cluster = LocalCluster()\n", "client = Client(cluster)\n", "```\n", "\n", "However, this code can easily be adapted to scale on massive datasets using distributed computing via various methods of deployment:\n", "\n", " - [Kubernetes](https://docs.dask.org/en/latest/setup/kubernetes.html)\n", " - [High Performance Computers](https://docs.dask.org/en/latest/setup/hpc.html)\n", " - [YARN](https://yarn.dask.org/en/latest/)\n", " - [AWS Fargate](https://aws.amazon.com/fargate/)\n", " \n", "or vendor products:\n", " \n", " - [SaturnCloud](https://www.saturncloud.io/s/)\n", " \n", "If anyone does run this example on a large cluster I would be curious how big you can scale `nstores` and `nskus` and how long it takes to run `rmse`. You are welcome to post it in the issue section following this [link](https://github.com/raybellwaves/xskillscore-tutorial/issues/new/choose)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setup the client (i.e. connect to the scheduler):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n",
"Client\n", "
| \n",
"\n",
"Cluster\n", "
| \n",
"
\n",
"
| \n",
"\n", "\n", " | \n", "
\n",
"
| \n",
"\n", "\n", " | \n", "
array(['2021-01-01T00:00:00.000000000', '2021-01-02T00:00:00.000000000',\n", " '2021-01-03T00:00:00.000000000', '2021-01-04T00:00:00.000000000',\n", " '2021-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
array([ 0, 1, 2, ..., 3997, 3998, 3999])
array([ 0, 1, 2, ..., 29997, 29998, 29999])
\n",
"
| \n",
"\n", "\n", " | \n", "
array(['2021-01-01T00:00:00.000000000', '2021-01-02T00:00:00.000000000',\n", " '2021-01-03T00:00:00.000000000', '2021-01-04T00:00:00.000000000',\n", " '2021-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
array([ 0, 1, 2, ..., 3997, 3998, 3999])
array([ 0, 1, 2, ..., 29997, 29998, 29999])
array([[1.53123262, 3.14700356, 3.15132902, ..., 4.94581499, 3.27822252,\n", " 4.06088894],\n", " [1.67277937, 4.08686635, 3.66413793, ..., 2.41712605, 2.57601639,\n", " 2.88718054],\n", " [2.76270445, 3.33921586, 3.06681608, ..., 3.34186527, 1.32741548,\n", " 2.08743438],\n", " ...,\n", " [2.35981265, 2.1617547 , 4.92081192, ..., 1.98393152, 3.1364395 ,\n", " 3.59346663],\n", " [1.26363135, 2.77340328, 2.7967874 , ..., 2.8342274 , 3.01885276,\n", " 0.62828305],\n", " [4.29771572, 3.92254418, 2.15708334, ..., 4.099115 , 3.45980968,\n", " 3.73594864]])
array([ 0, 1, 2, ..., 3997, 3998, 3999])
array([ 0, 1, 2, ..., 29997, 29998, 29999])