{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Land Use Clustering\n",
    "Written by Tom Augspurger<br>\n",
    "Created: November 26, 2018<br>\n",
    "Last updated: June 30, 2020"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Spectral Clustering Example\n",
    "The image loaded here is a cropped portion of a LANDSAT image of Walker Lake.\n",
    "\n",
    "In addition to `dask-ml`, we'll use `rasterio` to read the data and `matplotlib` to plot the figures.\n",
    "I'm just working on my laptop, so we could use either the threaded or distributed scheduler, but here I'll use the distributed scheduler for the diagnostics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import rasterio\n",
    "import numpy as np\n",
    "import xarray as xr\n",
    "import holoviews as hv\n",
    "from holoviews import opts\n",
    "from holoviews.operation.datashader import regrid\n",
    "import cartopy.crs as ccrs\n",
    "import dask.array as da\n",
    "#from dask_ml.cluster import SpectralClustering\n",
    "from dask.distributed import Client\n",
    "hv.extension('bokeh')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import dask_ml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dask_ml.__version__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask_ml.cluster import SpectralClustering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "client = Client(processes=False)\n",
    "#client = Client(n_workers=8, threads_per_worker=1)\n",
    "client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import intake\n",
    "cat = intake.open_catalog('./catalog.yml')\n",
    "list(cat)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "landsat_5_img = cat.landsat_5.read_chunked()\n",
    "landsat_5_img"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "crs=ccrs.epsg(32611)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_center, y_center = crs.transform_point(-118.7081, 38.6942, ccrs.PlateCarree())\n",
    "buffer = 1.7e4\n",
    "\n",
    "xmin = x_center - buffer\n",
    "xmax = x_center + buffer\n",
    "ymin = y_center - buffer\n",
    "ymax = y_center + buffer\n",
    "\n",
    "ROI = landsat_5_img.sel(x=slice(xmin, xmax), y=slice(ymax, ymin))\n",
    "ROI = ROI.where(ROI > ROI.nodatavals[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bands = ROI.astype(float)\n",
    "bands = (bands - bands.mean()) / bands.std()\n",
    "bands"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "opts.defaults(\n",
    "    opts.Image(invert_yaxis=True, width=250, height=250, tools=['hover'], cmap='viridis'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hv.Layout([regrid(hv.Image(band, kdims=['x', 'y'])) for band in bands[:3]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "flat_input = bands.stack(z=('y', 'x'))\n",
    "flat_input"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "flat_input.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll reshape the image to be how dask-ml / scikit-learn expect it: `(n_samples, n_features)` where n_features is 1 in this case. Then we'll persist that in memory. We still have a small dataset at this point. The large dataset, which dask helps us manage, is the intermediate `n_samples x n_samples` array that spectral clustering operates on. For our 2,500 x 2,500 pixel subset, that's ~50"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = flat_input.values.astype('float').T\n",
    "X.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = da.from_array(X, chunks=100_000)\n",
    "X = client.persist(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we'll fit the estimator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf = SpectralClustering(n_clusters=4, random_state=0,\n",
    "                         gamma=None,\n",
    "                         kmeans_params={'init_max_iter': 5},\n",
    "                         persist_embedding=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time clf.fit(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = clf.assign_labels_.labels_.compute()\n",
    "labels.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = labels.reshape(bands[0].shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hv.Layout([regrid(hv.Image(band, kdims=['x', 'y'])) for band in bands]) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hv.Layout([regrid(hv.Image(band, kdims=['x', 'y'])) for band in bands[3:]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hv.Image(labels)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}