{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## [03_Big_Data.ipynb](https://github.com/raybellwaves/xskillscore-tutorial/blob/master/03_Big_Data.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook I verify 12 million forecasts in a couple of seconds using the RMSE metric on a `dask.array`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import xarray as xr\n", "import pandas as pd\n", "import numpy as np\n", "import xskillscore as xs\n", "import dask.array as da\n", "from dask.distributed import Client" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default the [`dask.distributed.Client`](https://distributed.dask.org/en/latest/client.html) uses a [`LocalCluster`](https://docs.dask.org/en/latest/setup/single-distributed.html#localcluster)\n", "\n", "```\n", "cluster = LocalCluster()\n", "client = Client(cluster)\n", "```\n", "\n", "However, this code can easily be adapted to scale on massive datasets using distributed computing via various methods of deployment:\n", "\n", " - [Kubernetes](https://docs.dask.org/en/latest/setup/kubernetes.html)\n", " - [High Performance Computers](https://docs.dask.org/en/latest/setup/hpc.html)\n", " - [YARN](https://yarn.dask.org/en/latest/)\n", " - [AWS Fargate](https://aws.amazon.com/fargate/)\n", " \n", "or vendor products:\n", " \n", " - [SaturnCloud](https://www.saturncloud.io/s/)\n", " \n", "If anyone does run this example on a large cluster I would be curious how big you can scale `nstores` and `nskus` and how long it takes to run `rmse`. You are welcome to post it in the issue section following this [link](https://github.com/raybellwaves/xskillscore-tutorial/issues/new/choose)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setup the client (i.e. connect to the scheduler):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n",
"Client\n", "
| \n",
"\n",
"Cluster\n", "
| \n",
"
\n",
"
| \n",
"\n", "\n", " | \n", "
\n",
"
| \n",
"\n", "\n", " | \n", "
array(['2021-01-01T00:00:00.000000000', '2021-01-02T00:00:00.000000000',\n", " '2021-01-03T00:00:00.000000000', '2021-01-04T00:00:00.000000000',\n", " '2021-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
array([ 0, 1, 2, ..., 3997, 3998, 3999])
array([ 0, 1, 2, ..., 2997, 2998, 2999])
\n",
"
| \n",
"\n", "\n", " | \n", "
array(['2021-01-01T00:00:00.000000000', '2021-01-02T00:00:00.000000000',\n", " '2021-01-03T00:00:00.000000000', '2021-01-04T00:00:00.000000000',\n", " '2021-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
array([ 0, 1, 2, ..., 3997, 3998, 3999])
array([ 0, 1, 2, ..., 2997, 2998, 2999])
array([[3.41682887, 3.23411861, 2.69707474, ..., 2.87026685, 3.68759293,\n", " 2.6408434 ],\n", " [3.16087021, 2.17575623, 4.33849841, ..., 2.24438267, 5.54621049,\n", " 2.04912144],\n", " [1.81492691, 2.04205526, 1.63489789, ..., 4.26397024, 1.33314852,\n", " 4.22996336],\n", " ...,\n", " [5.12907457, 0.64156566, 3.15916421, ..., 3.2282438 , 5.43960656,\n", " 3.9416223 ],\n", " [3.84933788, 3.28570209, 2.29624814, ..., 1.48937493, 1.32878641,\n", " 2.65736995],\n", " [0.5710644 , 1.32473384, 4.10346585, ..., 4.25863161, 2.54228724,\n", " 4.75186667]])
array([ 0, 1, 2, ..., 3997, 3998, 3999])
array([ 0, 1, 2, ..., 2997, 2998, 2999])