{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Spectral Clustering Example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The image loaded here is a cropped portion of a LANDSAT image of [Walker Lake](Walker_Lake.ipynb).\n", "\n", "In addition to `dask-ml`, we'll use `rasterio` to read the data and `matplotlib` to plot the figures.\n", "I'm just working on my laptop, so we could use either the threaded or distributed scheduler, but here I'll use the distributed scheduler for the diagnostics." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import rasterio\n", "import numpy as np\n", "import xarray as xr\n", "import holoviews as hv\n", "from holoviews import opts\n", "from holoviews.operation.datashader import regrid\n", "import cartopy.crs as ccrs\n", "import dask.array as da\n", "#from dask_ml.cluster import SpectralClustering\n", "from dask.distributed import Client\n", "hv.extension('bokeh')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import dask_ml" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dask_ml.__version__" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from dask_ml.cluster import SpectralClustering" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "client = Client(processes=False)\n", "#client = Client(n_workers=8, threads_per_worker=1)\n", "client" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import intake\n", "cat = intake.open_catalog('./catalog.yml')\n", "list(cat)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "landsat_5_img = cat.landsat_5.read_chunked()\n", "landsat_5_img" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "crs=ccrs.epsg(32611)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x_center, y_center = crs.transform_point(-118.7081, 38.6942, ccrs.PlateCarree())\n", "buffer = 1.7e4\n", "\n", "xmin = x_center - buffer\n", "xmax = x_center + buffer\n", "ymin = y_center - buffer\n", "ymax = y_center + buffer\n", "\n", "ROI = landsat_5_img.sel(x=slice(xmin, xmax), y=slice(ymax, ymin))\n", "ROI = ROI.where(ROI > ROI.nodatavals[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bands = ROI.astype(float)\n", "bands = (bands - bands.mean()) / bands.std()\n", "bands" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "opts.defaults(\n", " opts.Image(invert_yaxis=True, width=250, height=250, tools=['hover'], cmap='viridis'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hv.Layout([regrid(hv.Image(band, kdims=['x', 'y'])) for band in bands[:3]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "flat_input = bands.stack(z=('y', 'x'))\n", "flat_input" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "flat_input.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll reshape the image to be how dask-ml / scikit-learn expect it: `(n_samples, n_features)` where n_features is 1 in this case. Then we'll persist that in memory. We still have a small dataset at this point. The large dataset, which dask helps us manage, is the intermediate `n_samples x n_samples` array that spectral clustering operates on. For our 2,500 x 2,500 pixel subset, that's ~50" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = flat_input.values.astype('float').T\n", "X.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = da.from_array(X, chunks=100_000)\n", "X = client.persist(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we'll fit the estimator." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf = SpectralClustering(n_clusters=4, random_state=0,\n", " gamma=None,\n", " kmeans_params={'init_max_iter': 5},\n", " persist_embedding=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time clf.fit(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labels = clf.assign_labels_.labels_.compute()\n", "labels.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labels = labels.reshape(bands[0].shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hv.Layout([regrid(hv.Image(band, kdims=['x', 'y'])) for band in bands]) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hv.Layout([regrid(hv.Image(band, kdims=['x', 'y'])) for band in bands[3:]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hv.Image(labels)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" } }, "nbformat": 4, "nbformat_minor": 2 }