{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Land Use Clustering\n",
"Written by Tom Augspurger
\n",
"Created: November 26, 2018
\n",
"Last updated: June 30, 2020"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Spectral Clustering Example\n",
"The image loaded here is a cropped portion of a LANDSAT image of Walker Lake.\n",
"\n",
"In addition to `dask-ml`, we'll use `rasterio` to read the data and `matplotlib` to plot the figures.\n",
"I'm just working on my laptop, so we could use either the threaded or distributed scheduler, but here I'll use the distributed scheduler for the diagnostics."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import rasterio\n",
"import numpy as np\n",
"import xarray as xr\n",
"import holoviews as hv\n",
"from holoviews import opts\n",
"from holoviews.operation.datashader import regrid\n",
"import cartopy.crs as ccrs\n",
"import dask.array as da\n",
"#from dask_ml.cluster import SpectralClustering\n",
"from dask.distributed import Client\n",
"hv.extension('bokeh')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import dask_ml"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dask_ml.__version__"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from dask_ml.cluster import SpectralClustering"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client = Client(processes=False)\n",
"#client = Client(n_workers=8, threads_per_worker=1)\n",
"client"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import intake\n",
"cat = intake.open_catalog('./catalog.yml')\n",
"list(cat)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"landsat_5_img = cat.landsat_5.read_chunked()\n",
"landsat_5_img"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"crs=ccrs.epsg(32611)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x_center, y_center = crs.transform_point(-118.7081, 38.6942, ccrs.PlateCarree())\n",
"buffer = 1.7e4\n",
"\n",
"xmin = x_center - buffer\n",
"xmax = x_center + buffer\n",
"ymin = y_center - buffer\n",
"ymax = y_center + buffer\n",
"\n",
"ROI = landsat_5_img.sel(x=slice(xmin, xmax), y=slice(ymax, ymin))\n",
"ROI = ROI.where(ROI > ROI.nodatavals[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bands = ROI.astype(float)\n",
"bands = (bands - bands.mean()) / bands.std()\n",
"bands"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"opts.defaults(\n",
" opts.Image(invert_yaxis=True, width=250, height=250, tools=['hover'], cmap='viridis'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hv.Layout([regrid(hv.Image(band, kdims=['x', 'y'])) for band in bands[:3]])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"flat_input = bands.stack(z=('y', 'x'))\n",
"flat_input"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"flat_input.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll reshape the image to be how dask-ml / scikit-learn expect it: `(n_samples, n_features)` where n_features is 1 in this case. Then we'll persist that in memory. We still have a small dataset at this point. The large dataset, which dask helps us manage, is the intermediate `n_samples x n_samples` array that spectral clustering operates on. For our 2,500 x 2,500 pixel subset, that's ~50"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X = flat_input.values.astype('float').T\n",
"X.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X = da.from_array(X, chunks=100_000)\n",
"X = client.persist(X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And we'll fit the estimator."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"clf = SpectralClustering(n_clusters=4, random_state=0,\n",
" gamma=None,\n",
" kmeans_params={'init_max_iter': 5},\n",
" persist_embedding=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%time clf.fit(X)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"labels = clf.assign_labels_.labels_.compute()\n",
"labels.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"labels = labels.reshape(bands[0].shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hv.Layout([regrid(hv.Image(band, kdims=['x', 'y'])) for band in bands]) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hv.Layout([regrid(hv.Image(band, kdims=['x', 'y'])) for band in bands[3:]])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hv.Image(labels)"
]
}
],
"metadata": {
"language_info": {
"name": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}