{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train Models on Large Datasets\n",
    "==============================\n",
    "\n",
    "Most estimators in scikit-learn are designed to work with NumPy arrays or scipy sparse matricies.\n",
    "These data structures must fit in the RAM on a single machine.\n",
    "\n",
    "Estimators implemented in Dask-ML work well with Dask Arrays and DataFrames. This can be much larger than a single machine's RAM. They can be distributed in memory on a cluster of machines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask.distributed import Client\n",
    "\n",
    "# Scale up: connect to your own cluster with more resources\n",
    "# see http://dask.pydata.org/en/latest/setup.html\n",
    "client = Client(processes=False, threads_per_worker=4,\n",
    "                n_workers=1, memory_limit='2GB')\n",
    "client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import dask_ml.datasets\n",
    "import dask_ml.cluster\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, we'll use `dask_ml.datasets.make_blobs` to generate some random *dask* arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Scale up: increase n_samples or n_features\n",
    "X, y = dask_ml.datasets.make_blobs(n_samples=1000000,\n",
    "                                   chunks=100000,\n",
    "                                   random_state=0,\n",
    "                                   centers=3)\n",
    "X = X.persist()\n",
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll use the k-means implemented in Dask-ML to cluster the points. It uses the `k-means||` (read: \"k-means parallel\") initialization algorithm, which scales better than `k-means++`. All of the computation, both during and after initialization, can be done in parallel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "km = dask_ml.cluster.KMeans(n_clusters=3, init_max_iter=2, oversampling_factor=10)\n",
    "km.fit(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll plot a sample of points, colored by the cluster each falls into."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "ax.scatter(X[::1000, 0], X[::1000, 1], marker='.', c=km.labels_[::1000],\n",
    "           cmap='viridis', alpha=0.25);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For all the estimators implemented in Dask-ML, see the [API documentation](https://ml.dask.org/modules/api.html#)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}