{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Train Models on Large Datasets\n", "==============================\n", "\n", "Most estimators in scikit-learn are designed to work with NumPy arrays or scipy sparse matricies.\n", "These data structures must fit in the RAM on a single machine.\n", "\n", "Estimators implemented in Dask-ML work well with Dask Arrays and DataFrames. This can be much larger than a single machine's RAM. They can be distributed in memory on a cluster of machines." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from dask.distributed import Client\n", "\n", "# Scale up: connect to your own cluster with more resources\n", "# see http://dask.pydata.org/en/latest/setup.html\n", "client = Client(processes=False, threads_per_worker=4,\n", " n_workers=1, memory_limit='2GB')\n", "client" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import dask_ml.datasets\n", "import dask_ml.cluster\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we'll use `dask_ml.datasets.make_blobs` to generate some random *dask* arrays." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Scale up: increase n_samples or n_features\n", "X, y = dask_ml.datasets.make_blobs(n_samples=1000000,\n", " chunks=100000,\n", " random_state=0,\n", " centers=3)\n", "X = X.persist()\n", "X" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use the k-means implemented in Dask-ML to cluster the points. It uses the `k-means||` (read: \"k-means parallel\") initialization algorithm, which scales better than `k-means++`. All of the computation, both during and after initialization, can be done in parallel." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "km = dask_ml.cluster.KMeans(n_clusters=3, init_max_iter=2, oversampling_factor=10)\n", "km.fit(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll plot a sample of points, colored by the cluster each falls into." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "ax.scatter(X[::1000, 0], X[::1000, 1], marker='.', c=km.labels_[::1000],\n", " cmap='viridis', alpha=0.25);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For all the estimators implemented in Dask-ML, see the [API documentation](https://ml.dask.org/modules/api.html#)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }