{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Alignment and Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the data is made available as detailed in [Data Ingestion](./01_Data_Ingestion.ipynb), the next step is to ensure the data has been appropriately reshaped and aligned across data sources for consumption by the machine learning pipeline (or other analysis or simulation steps).\n",
    "\n",
    "We'll be aggregating data across several years, using small versions of the Landsat images used in the [Walker Lake](../topics/Walker_Lake.ipynb) example. See that topic for more work on calculating the difference between the water levels over time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import intake\n",
    "import numpy as np\n",
    "import xarray as xr\n",
    "\n",
    "import holoviews as hv\n",
    "\n",
    "import cartopy.crs as ccrs\n",
    "import geoviews as gv\n",
    "\n",
    "import hvplot.xarray\n",
    "\n",
    "hv.extension('bokeh', width=80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Recap: Loading data\n",
    "We'll use intake to read in the small versions of the landsat_5 and landsat_8 data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat = intake.open_catalog('../catalog.yml')\n",
    "landsat_5_da = cat.landsat_5_small.read_chunked()\n",
    "landsat_8_da = cat.landsat_8_small.read_chunked()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "landsat_8_da"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Do computations\n",
    "\n",
    "We can do regular computations with these`xarray.DataArray`s. For instance we can calculate the [NDVI](https://en.wikipedia.org/wiki/Normalized_difference_vegetation_index) (vegetation index) for each of these image sets. Note that in landsat 5 the Red and NIR bands are stored in bands 3 and 4 respectively whereas in landsat 8 the Red and NIR are bands 4 and 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NDVI_1988 = (landsat_5_da.sel(band=4) - landsat_5_da.sel(band=3)) / (landsat_5_da.sel(band=4) + landsat_5_da.sel(band=3))\n",
    "NDVI_1988"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NDVI_2017 = (landsat_8_da.sel(band=5) - landsat_8_da.sel(band=4)) / (landsat_8_da.sel(band=5) + landsat_8_da.sel(band=4))\n",
    "NDVI_2017"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can calculate the difference between these two years by subtracting one from the other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diff = NDVI_2017 - NDVI_1988\n",
    "diff.hvplot.image(x='x', y='y', width=400, cmap='coolwarm', clim=(-1,1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how pixelated that image looks. What is going on here? To figure it out, let's take a look at the shape of `diff`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diff.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a lot smaller than our NDVI for each year. What is happening is that when we compute the difference on the data we only get values where there are values for each year in the same grid cell. Since the cells are on a different resolution this only happens once every so often. What we'd rather do is interpolate to the same grid and then do our computations on that.\n",
    "\n",
    "## Combine data from overlapping grids\n",
    "\n",
    "These two sets of Landsat bands cover roughly the same area but were taken in 1988 and 2017. We have modified them to have different resolutions, different numbers of grid cells per image, and different x and y offsets. We can see that by printing the first `x` value for each year and seeing that they are not equivalent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(NDVI_1988.x[0].values)\n",
    "print(NDVI_2017.x[0].values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also do a quick check of resolution by subtracting the second x value from the first x value for each year."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print((NDVI_1988.x[1] - NDVI_1988.x[0]).values)\n",
    "print((NDVI_2017.x[1] - NDVI_2017.x[0]).values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Select region of interest\n",
    "\n",
    "We'll define a central point in lat, lon convert that point to the *coordinate reference system (CRS)* of the data, and then use the area around the central point as the Region of Interest (ROI). \n",
    "\n",
    "The first step is getting the CRS. This information is stored in the attributes of our original landsat data. Let's take a look at it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "landsat_8_da.crs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So what we have is a proj4 string. We can convert that to a `cartopy.crs` object using the `geoviews.util.proj_to_cartopy` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "crs = gv.util.proj_to_cartopy(landsat_8_da.crs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now with that `cartopy.crs` object, we can transform out lat, lon - which are in `cartopy.crs.PlateCarree()` to our new data `crs`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_center, y_center = crs.transform_point(-118.7081, 38.6942, ccrs.PlateCarree())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we just need to define the area that we are interested in around this point. In this case we'll use a 30 km box around the center point."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "buffer = 1.5e4\n",
    "\n",
    "xmin = x_center - buffer\n",
    "xmax = x_center + buffer\n",
    "ymin = y_center - buffer\n",
    "ymax = y_center + buffer\n",
    "\n",
    "bounding_box = [[[xmin, ymin], [xmin, ymax], [xmax, ymax], [xmax, ymin]]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's just check that bouding box captures the lake:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gv.tile_sources.EsriImagery * gv.Polygons(bounding_box, crs=crs).options(alpha=0.4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regrid\n",
    "\n",
    "We can use this region to define a new grid onto which we will interpolate our data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "res = 200\n",
    "x = np.arange(xmin, xmax, res)\n",
    "y = np.arange(ymin, ymax, res)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use linear interpolation to calculate the values for each grid cell. This operation is not yet supported for dask arrays, so we'll first load in the data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NDVI_1988.load()\n",
    "NDVI_2017.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can use the new coordinates defined above to interpolate from the input coordinates to the new grid. The options are `nearest` and `linear` with `linear` being selected by default. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "NDVI_2017_regridded = NDVI_2017.interp(x=x, y=y)\n",
    "NDVI_1988_regridded = NDVI_1988.interp(x=x, y=y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combining the data \n",
    "\n",
    "Now that we have our data on the same grid we can combine our two years into one `xarray` object. We will treat the years as names and create an `xarray.Dataset` - a group of named `xarray.DataArray`s that share some of the same coordinates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds_regridded = xr.Dataset({'NDVI_1988': NDVI_1988_regridded, 'NDVI_2017': NDVI_2017_regridded})\n",
    "ds_regridded"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vizualizing output\n",
    "We can plot the arrays side by side:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds_regridded.NDVI_1988.hvplot.image(x='x', y='y', crs=crs, width=350, clim=(-3, 1)).relabel('1988') +\\\n",
    "ds_regridded.NDVI_2017.hvplot.image(x='x', y='y', crs=crs, width=350, clim=(-3, 1)).relabel('2017')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or we can calculate and plot the difference between the two years:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diff_regridded = ds_regridded['NDVI_2017'] - ds_regridded['NDVI_1988']\n",
    "diff_regridded"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diff_regridded.hvplot.image(x='x', y='y', crs=crs, width=400, cmap='coolwarm', clim=(-1,1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Side-note: Resampling\n",
    "\n",
    "We can down-sample an `xarray` object by grouping the values into bins based on the desired resolution and taking the mean on each of those bins."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "res_1000 = 1e3\n",
    "x_1000 = np.arange(xmin, xmax, res_1000)\n",
    "y_1000 = np.arange(ymin, ymax, res_1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll use the left edge as the label for now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "diff_res_1000 = (diff_regridded\n",
    "    .groupby_bins('x', x_1000, labels=x_1000[:-1]).mean(dim='x')\n",
    "    .groupby_bins('y', y_1000, labels=y_1000[:-1]).mean(dim='y')\n",
    "    .rename(x_bins='x', y_bins='y')\n",
    ")\n",
    "diff_res_1000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise: Try to aggregate using a method other than mean."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See GeoViews docs for more information of [resampling grids](http://geoviews.org/user_guide/Resampling_Grids.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Next:\n",
    "\n",
    "Now that the data have been aligned and preprocessed appropriately, we can then use them in subsequent steps in a workflow. Next we'll learn about [Machine Learning](04_Machine_Learning.ipynb)."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}