{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use Voting Classifiers\n",
    "======================\n",
    "\n",
    "A [Voting classifier](http://scikit-learn.org/stable/modules/ensemble.html#voting-classifier) model combines multiple different models (i.e., sub-estimators) into a single model, which is (ideally) stronger than any of the individual models alone. \n",
    "\n",
    "[Dask](http://ml.dask.org/joblib.html) provides the software to train individual sub-estimators on different machines in a cluster. This enables users to train more models in parallel than would have been possible on a single machine. Note that users will only observe this benefit if they have a distributed cluster with more resources than their single machine (because sklearn already enables users to parallelize training across cores on a single machine).\n",
    "\n",
    "What follows is an example of how one would deploy a voting classifier model in dask (using a local cluster).\n",
    "\n",
    "<img src=\"http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg\" width=\"30%\" alt=\"Dask logo\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import VotingClassifier\n",
    "from sklearn.linear_model import SGDClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.svm import SVC\n",
    "\n",
    "import sklearn.datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We create a synthetic dataset (with 1000 rows and 20 columns) that we can give to the voting classifier model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X, y = sklearn.datasets.make_classification(n_samples=1_000, n_features=20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We specify the VotingClassifier as a list of (name, sub-estimator) tuples. Fitting the VotingClassifier on the data fits each of the sub-estimators in turn. We set the ```n_jobs``` argument to be -1, which instructs sklearn to use all available cores (notice that we haven't used dask)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "classifiers = [\n",
    "    ('sgd', SGDClassifier(max_iter=1000)),\n",
    "    ('logisticregression', LogisticRegression()),\n",
    "    ('svc', SVC(gamma='auto')),\n",
    "]\n",
    "clf = VotingClassifier(classifiers, n_jobs=-1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We call the classifier's fit method in order to train the classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time clf.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating a Dask [client](https://distributed.readthedocs.io/en/latest/client.html) provides performance and progress metrics via the dashboard. Because ```Client``` is given no arugments, its output refers to a [local cluster](http://distributed.readthedocs.io/en/latest/local-cluster.html) (not a distributed cluster).\n",
    "\n",
    "We can view the dashboard by clicking the link after running the cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import joblib\n",
    "from distributed import Client\n",
    "\n",
    "client = Client()\n",
    "client"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To train the voting classifier, we call the classifier's fit method, but enclosed in joblib's ```parallel_backend``` context manager. This distributes training of sub-estimators acoss the cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time \n",
    "with joblib.parallel_backend(\"dask\"):\n",
    "    clf.fit(X, y)\n",
    "\n",
    "print(clf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note, that we see no advantage of using dask because we are using a local cluster rather than a distributed cluster and sklearn is already using all my computer's cores. If we were using a distributed cluster, dask would enable us to take advantage of the multiple machines and train sub-estimators across them."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}