{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Working with large data volumes\n",
    "\n",
    "> Abstract: Some strategies for requesting and handling larger data volumes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the code could take a long time to run, so it is better to adjust it for smaller jobs if you are just testing it out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext watermark\n",
    "%watermark -i -v -p viresclient,pandas,xarray,matplotlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import glob\n",
    "\n",
    "from viresclient import SwarmRequest\n",
    "import datetime as dt\n",
    "import pandas as pd\n",
    "import xarray as xr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Revert to non-html preview for xarray\n",
    "#  to avoid extra long outputs of the .attr Sources\n",
    "xr.set_options(display_style=\"text\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up the request parameters\n",
    "\n",
    "We specify the nominal scalar measurements (`F`) from the MAGxLR (1Hz) data product, subsampled to 5s. You may want to additionally set some magnetic model evaluation and filter according to the values of other parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "request = SwarmRequest()\n",
    "request.set_collection(\"SW_OPER_MAGA_LR_1B\")  # Swarm Alpha\n",
    "request.set_products(\n",
    "    measurements=[\"F\",],\n",
    "    # Choose between the full CHAOS model (will be a lot slower due to the MMA parts)\n",
    "#     models=[\"CHAOS = 'CHAOS-6-Core' + 'CHAOS-6-Static' + 'CHAOS-6-MMA-Primary' + 'CHAOS-6-MMA-Secondary'\"],\n",
    "    # ...or just the core part:\n",
    "#     models=[\"CHAOS = 'CHAOS-Core'\"],\n",
    "    sampling_step=\"PT5S\"\n",
    ")\n",
    "# Quality Flags\n",
    "# See: https://swarmhandbook.earth.esa.int/catalogue/SW_MAGx_LR_1B#flags-f-and-flags-b\n",
    "# NB: will need to do something different for Charlie because the ASM broke so Flags_F are bad\n",
    "request.set_range_filter(\"Flags_F\", 0, 1)\n",
    "# request.set_range_filter(\"Flags_B\", 0, 1)\n",
    "# # Magnetic latitude filter\n",
    "# request.set_range_filter(\"QDLat\", 70, 90)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look at one day to see what the output data will look like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = request.get_between(\n",
    "    start_time=dt.datetime(2014,1,1),\n",
    "    end_time=dt.datetime(2014,1,2)\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.as_xarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A) Fetch one year of data directly\n",
    "\n",
    "Note that viresclient automatically splits the data into multiple sequential requests."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = request.get_between(\n",
    "    start_time=dt.datetime(2014,1,1),\n",
    "    end_time=dt.datetime(2015,1,1)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The downloaded data is stored within a number of temporary CDF files according to the number of requests that were made. viresclient can still attempt to load this data directly, but depending on the data volume, you might run out of memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.as_xarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Without loading the data into memory, you can still find the temporary files on disk - their locations can be revealed by digging into the `data.contents` list, **but this is not recommended**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.contents[0]._file.name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## B) Tune the size of requests\n",
    "\n",
    "We can use [pandas.date_range](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html) to generate a sequence of intervals to use and then sequentially create data files for each interval."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Monthly start times\n",
    "dates = pd.date_range(start=\"2014-1-1\", end=\"2015-1-1\", freq=\"MS\").to_pydatetime()\n",
    "start_times = dates[:-1]\n",
    "end_times = dates[1:]\n",
    "list(zip(start_times, end_times))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate file names to use\n",
    "filenames = [f\"data_{start.strftime('%Y-%m-%d')}.cdf\" for start in start_times]\n",
    "# Loop through each pair of times, and write the data to a file\n",
    "for start, end, filename in zip(start_times, end_times, filenames):\n",
    "    # Use show_progress=False to disable the progress bars\n",
    "    data = request.get_between(start, end, show_progress=False)\n",
    "    data.to_file(filename, overwrite=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could now use these files however we want in further processing routines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm data*.cdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since CDF files are not so well accommodated by the Python ecosystem, we might instead save them as NetCDF files (beware: despite the confusing name, these are *entirely different formats*). We can do this by using viresclient to load them with xarray and using [to_netcdf](http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_netcdf.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate filenames to use\n",
    "filenames_nc = [f\"data_{start.strftime('%Y-%m-%d')}.nc\" for start in start_times]\n",
    "for start, end, filename in zip(start_times, end_times, filenames_nc):\n",
    "    # Attempt to download data for each interval\n",
    "    # Sometimes there may be gaps which need to be handled appropriately\n",
    "    try:\n",
    "        data = request.get_between(start, end, show_progress=False)\n",
    "        ds = data.as_xarray()\n",
    "    except RuntimeError:\n",
    "        print(f\"No data for {filename} - data not downloaded\")\n",
    "    try:\n",
    "        ds.to_netcdf(filename)\n",
    "        print(f\"saved {filename}\")\n",
    "    except AttributeError:\n",
    "        print(f\"No data for {filename} - file not created\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can reap the rewards of using NetCDF instead. We can use the lazy loading functionality of xarray and dask to open all the files together without actually loading everything into memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Scan for saved NetCDF files, but exclude those which are empty\n",
    "filenames = glob.glob(\"data_*.nc\")\n",
    "empty_files = []\n",
    "for filename in filenames:\n",
    "    _ds = xr.open_dataset(filename)\n",
    "    if _ds[\"Timestamp\"].size == 0:\n",
    "        empty_files.append(filename)\n",
    "filenames = [f for f in filenames if f not in empty_files]\n",
    "\n",
    "# Open these data files lazily\n",
    "ds = xr.open_mfdataset(filenames, combine=\"by_coords\")\n",
    "ds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Amongst other things, this enables us to directly slice out parts of the full dataset without worrying too much about the underlying files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds[\"F\"].sel(Timestamp=slice(\"2014-01-01\", \"2014-02-01\")).plot();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm data*.nc"
   ]
  }
 ],
 "metadata": {
  "execution": {
   "timeout": 1200
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}