{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Logging Concurrently\n",
    " \n",
    "Let's see how a few other popular ``scikit-learn`` models perform with the\n",
    "wine dataset. ``rubicon_ml``'s filesystem logging is totally thread-safe,\n",
    "so we can test a lot of model configurations at once.\n",
    "\n",
    "**Note**: ``rubicon_ml``'s in-memory logging is inherently not threadsafe as\n",
    "each thread has its own memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<rubicon_ml.client.project.Project at 0x155660d00>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "\n",
    "from rubicon_ml import Rubicon\n",
    "\n",
    "\n",
    "root_dir = os.environ.get(\"RUBICON_ROOT\", \"rubicon-root\")\n",
    "root_path = f\"{os.path.dirname(os.getcwd())}/{root_dir}\"\n",
    "\n",
    "rubicon = Rubicon(persistence=\"filesystem\", root_dir=root_path)\n",
    "project = rubicon.get_or_create_project(\n",
    "    \"Concurrent Experiments\",\n",
    "    description=\"training multiple models in parallel\",\n",
    ")\n",
    "\n",
    "project"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a recap of the contents of the wine dataset, check out ``wine.DESCR``\n",
    "and ``wine.data``. We'll put together a training dataset using a subset of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_wine\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "\n",
    "wine = load_wine()\n",
    "wine_feature_names = wine.feature_names\n",
    "wine_datasets = train_test_split(\n",
    "    wine[\"data\"],\n",
    "    wine[\"target\"],\n",
    "    test_size=0.25,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll use this ``run_experiment`` function to log a new **experiment** to\n",
    "the provided **project** then train, run and log a model of type ``classifier_cls`` using\n",
    "the training and testing data in ``wine_datasets``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import namedtuple\n",
    "\n",
    "\n",
    "SklearnTrainingMetadata = namedtuple(\"SklearnTrainingMetadata\", \"module_name method\")\n",
    "\n",
    "def run_experiment(project, classifier_cls, wine_datasets, feature_names, **kwargs):\n",
    "    X_train, X_test, y_train, y_test = wine_datasets\n",
    "    \n",
    "    experiment = project.log_experiment(\n",
    "        training_metadata=[\n",
    "            SklearnTrainingMetadata(\"sklearn.datasets\", \"load_wine\"),\n",
    "        ],\n",
    "        model_name=classifier_cls.__name__,\n",
    "        tags=[classifier_cls.__name__],\n",
    "    )\n",
    "    \n",
    "    for key, value in kwargs.items():\n",
    "        experiment.log_parameter(key, value)\n",
    "    \n",
    "    for name in feature_names:\n",
    "        experiment.log_feature(name)\n",
    "        \n",
    "    classifier = classifier_cls(**kwargs)\n",
    "    classifier.fit(X_train, y_train)\n",
    "    classifier.predict(X_test)\n",
    "    \n",
    "    accuracy = classifier.score(X_test, y_test)\n",
    "    \n",
    "    experiment.log_metric(\"accuracy\", accuracy)\n",
    "\n",
    "    if accuracy >= .95:\n",
    "        experiment.add_tags([\"success\"])\n",
    "    else:\n",
    "        experiment.add_tags([\"failure\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're familiar with trying to use the ``multiprocessing`` library in iPython 3, you'll\n",
    "know the two don't play along well. In order for the ``multiprocessing`` library to locate\n",
    "the ``run_experiment`` function, we'll need to import it from a separate file.\n",
    "``./logging_concurrently.py`` defines the same ``run_experiment`` function as above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from logging_concurrently import run_experiment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This time we'll take a look at three classifiers - ``RandomForestClassifier``, ``DecisionTreeClassifier``, and\n",
    "``KNeighborsClassifier`` - to see which performs best. Each classifier will be run across four sets of parameters\n",
    "(provided as ``kwargs`` to ``run_experiment``), for a total of 12 experiments. Here, we'll build up a list of\n",
    "processes that will run each experiment in parallel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import multiprocessing\n",
    "\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "\n",
    "processes = []\n",
    "\n",
    "for n_estimators in [10, 20, 30, 40]:\n",
    "    processes.append(multiprocessing.Process(\n",
    "        target=run_experiment,\n",
    "        args=[project, RandomForestClassifier, wine_datasets, wine_feature_names],\n",
    "        kwargs={\"n_estimators\": n_estimators},\n",
    "    ))   \n",
    "\n",
    "for n_neighbors in [5, 10, 15, 20]:\n",
    "    processes.append(multiprocessing.Process(\n",
    "        target=run_experiment,\n",
    "        args=[project, KNeighborsClassifier, wine_datasets, wine_feature_names],\n",
    "        kwargs={\"n_neighbors\": n_neighbors},\n",
    "    ))\n",
    "\n",
    "for criterion in [\"gini\", \"entropy\"]:\n",
    "    for splitter in [\"best\", \"random\"]:\n",
    "        processes.append(multiprocessing.Process(\n",
    "            target=run_experiment,\n",
    "            args=[project, DecisionTreeClassifier, wine_datasets, wine_feature_names],\n",
    "            kwargs={\n",
    "                \"criterion\": criterion,\n",
    "                \"splitter\": splitter,\n",
    "            },\n",
    "        ))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's run all our experiments in parallel!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "for process in processes:\n",
    "    process.start()\n",
    "\n",
    "for process in processes:\n",
    "    process.join()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can validate that we successfully logged all 12 experiments to our project."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "12"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(project.experiments())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see which experiments we tagged as successful and what type of model they used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "experiment 3a144831 was successful using a RandomForestClassifier\n",
      "experiment 67961ac9 was successful using a RandomForestClassifier\n",
      "experiment 712564c0 was successful using a RandomForestClassifier\n",
      "experiment f143ea43 was successful using a RandomForestClassifier\n"
     ]
    }
   ],
   "source": [
    "for e in project.experiments(tags=[\"success\"]):    \n",
    "    print(f\"experiment {e.id[:8]} was successful using a {e.model_name}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also take a deeper look at any of our experiments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "experiment 0b6b8114-fe72-49af-813b-23e1667c1486\n",
      "training metadata: SklearnTrainingMetadata(module_name='sklearn.datasets', method='load_wine')\n",
      "tags: ['KNeighborsClassifier', 'failure']\n",
      "parameters: [('n_neighbors', 10)]\n",
      "metrics: [('accuracy', 0.7111111111111111)]\n"
     ]
    }
   ],
   "source": [
    "first_experiment = project.experiments()[0]\n",
    "\n",
    "training_metadata = SklearnTrainingMetadata(*first_experiment.training_metadata)\n",
    "tags = first_experiment.tags\n",
    "\n",
    "parameters = [(p.name, p.value) for p in first_experiment.parameters()]\n",
    "metrics = [(m.name, m.value) for m in first_experiment.metrics()]\n",
    "    \n",
    "print(\n",
    "    f\"experiment {first_experiment.id}\\n\"\n",
    "    f\"training metadata: {training_metadata}\\ntags: {tags}\\n\"\n",
    "    f\"parameters: {parameters}\\nmetrics: {metrics}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or we could grab the project's data as a dataframe!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>name</th>\n",
       "      <th>description</th>\n",
       "      <th>model_name</th>\n",
       "      <th>commit_hash</th>\n",
       "      <th>tags</th>\n",
       "      <th>created_at</th>\n",
       "      <th>splitter</th>\n",
       "      <th>criterion</th>\n",
       "      <th>n_estimators</th>\n",
       "      <th>n_neighbors</th>\n",
       "      <th>accuracy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>aa4faa9b-3051-454d-915d-54b5c16f6e34</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>DecisionTreeClassifier</td>\n",
       "      <td>None</td>\n",
       "      <td>[DecisionTreeClassifier, failure]</td>\n",
       "      <td>2021-04-28 12:45:27.759817</td>\n",
       "      <td>best</td>\n",
       "      <td>gini</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.933333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>7aec1e7b-c7dd-43e7-b20c-a400b2141d23</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>DecisionTreeClassifier</td>\n",
       "      <td>None</td>\n",
       "      <td>[DecisionTreeClassifier, failure]</td>\n",
       "      <td>2021-04-28 12:45:27.752157</td>\n",
       "      <td>best</td>\n",
       "      <td>entropy</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.933333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3a144831-b093-4dc3-85a6-71bc7a4e26a4</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>RandomForestClassifier</td>\n",
       "      <td>None</td>\n",
       "      <td>[RandomForestClassifier, success]</td>\n",
       "      <td>2021-04-28 12:45:27.715676</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>7503bcf0-2daa-4c66-9b45-a75863d446a5</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>DecisionTreeClassifier</td>\n",
       "      <td>None</td>\n",
       "      <td>[DecisionTreeClassifier, failure]</td>\n",
       "      <td>2021-04-28 12:45:27.714952</td>\n",
       "      <td>random</td>\n",
       "      <td>gini</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.911111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>f143ea43-d780-4639-a082-5c81b850f1e0</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>RandomForestClassifier</td>\n",
       "      <td>None</td>\n",
       "      <td>[RandomForestClassifier, success]</td>\n",
       "      <td>2021-04-28 12:45:27.636078</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.977778</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>67961ac9-4bd5-4916-925c-fd38844d78c6</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>RandomForestClassifier</td>\n",
       "      <td>None</td>\n",
       "      <td>[RandomForestClassifier, success]</td>\n",
       "      <td>2021-04-28 12:45:27.635968</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>30.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.977778</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>8b6b9b9a-1e06-4e0a-af2e-9649d79872e3</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>KNeighborsClassifier</td>\n",
       "      <td>None</td>\n",
       "      <td>[KNeighborsClassifier, failure]</td>\n",
       "      <td>2021-04-28 12:45:27.619764</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>15.0</td>\n",
       "      <td>0.733333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>3c679288-1402-4ffb-bbb1-23fcfd0b1415</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>DecisionTreeClassifier</td>\n",
       "      <td>None</td>\n",
       "      <td>[DecisionTreeClassifier, failure]</td>\n",
       "      <td>2021-04-28 12:45:27.615204</td>\n",
       "      <td>random</td>\n",
       "      <td>entropy</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.888889</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>38aa6e7b-efcc-4ee5-9950-5449ef8681ff</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>KNeighborsClassifier</td>\n",
       "      <td>None</td>\n",
       "      <td>[KNeighborsClassifier, failure]</td>\n",
       "      <td>2021-04-28 12:45:27.585590</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20.0</td>\n",
       "      <td>0.711111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0b6b8114-fe72-49af-813b-23e1667c1486</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>KNeighborsClassifier</td>\n",
       "      <td>None</td>\n",
       "      <td>[KNeighborsClassifier, failure]</td>\n",
       "      <td>2021-04-28 12:45:27.583647</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0.711111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>a0b7e295-6047-45da-81e6-08dc8288a2eb</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>KNeighborsClassifier</td>\n",
       "      <td>None</td>\n",
       "      <td>[KNeighborsClassifier, failure]</td>\n",
       "      <td>2021-04-28 12:45:27.564990</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.733333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>712564c0-e113-42ea-b59c-90a5ce0cf37b</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>RandomForestClassifier</td>\n",
       "      <td>None</td>\n",
       "      <td>[RandomForestClassifier, success]</td>\n",
       "      <td>2021-04-28 12:45:27.511807</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>40.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.955556</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                      id  name description  \\\n",
       "0   aa4faa9b-3051-454d-915d-54b5c16f6e34  None        None   \n",
       "1   7aec1e7b-c7dd-43e7-b20c-a400b2141d23  None        None   \n",
       "2   3a144831-b093-4dc3-85a6-71bc7a4e26a4  None        None   \n",
       "3   7503bcf0-2daa-4c66-9b45-a75863d446a5  None        None   \n",
       "4   f143ea43-d780-4639-a082-5c81b850f1e0  None        None   \n",
       "5   67961ac9-4bd5-4916-925c-fd38844d78c6  None        None   \n",
       "6   8b6b9b9a-1e06-4e0a-af2e-9649d79872e3  None        None   \n",
       "7   3c679288-1402-4ffb-bbb1-23fcfd0b1415  None        None   \n",
       "8   38aa6e7b-efcc-4ee5-9950-5449ef8681ff  None        None   \n",
       "9   0b6b8114-fe72-49af-813b-23e1667c1486  None        None   \n",
       "10  a0b7e295-6047-45da-81e6-08dc8288a2eb  None        None   \n",
       "11  712564c0-e113-42ea-b59c-90a5ce0cf37b  None        None   \n",
       "\n",
       "                model_name commit_hash                               tags  \\\n",
       "0   DecisionTreeClassifier        None  [DecisionTreeClassifier, failure]   \n",
       "1   DecisionTreeClassifier        None  [DecisionTreeClassifier, failure]   \n",
       "2   RandomForestClassifier        None  [RandomForestClassifier, success]   \n",
       "3   DecisionTreeClassifier        None  [DecisionTreeClassifier, failure]   \n",
       "4   RandomForestClassifier        None  [RandomForestClassifier, success]   \n",
       "5   RandomForestClassifier        None  [RandomForestClassifier, success]   \n",
       "6     KNeighborsClassifier        None    [KNeighborsClassifier, failure]   \n",
       "7   DecisionTreeClassifier        None  [DecisionTreeClassifier, failure]   \n",
       "8     KNeighborsClassifier        None    [KNeighborsClassifier, failure]   \n",
       "9     KNeighborsClassifier        None    [KNeighborsClassifier, failure]   \n",
       "10    KNeighborsClassifier        None    [KNeighborsClassifier, failure]   \n",
       "11  RandomForestClassifier        None  [RandomForestClassifier, success]   \n",
       "\n",
       "                   created_at splitter criterion  n_estimators  n_neighbors  \\\n",
       "0  2021-04-28 12:45:27.759817     best      gini           NaN          NaN   \n",
       "1  2021-04-28 12:45:27.752157     best   entropy           NaN          NaN   \n",
       "2  2021-04-28 12:45:27.715676      NaN       NaN          20.0          NaN   \n",
       "3  2021-04-28 12:45:27.714952   random      gini           NaN          NaN   \n",
       "4  2021-04-28 12:45:27.636078      NaN       NaN          10.0          NaN   \n",
       "5  2021-04-28 12:45:27.635968      NaN       NaN          30.0          NaN   \n",
       "6  2021-04-28 12:45:27.619764      NaN       NaN           NaN         15.0   \n",
       "7  2021-04-28 12:45:27.615204   random   entropy           NaN          NaN   \n",
       "8  2021-04-28 12:45:27.585590      NaN       NaN           NaN         20.0   \n",
       "9  2021-04-28 12:45:27.583647      NaN       NaN           NaN         10.0   \n",
       "10 2021-04-28 12:45:27.564990      NaN       NaN           NaN          5.0   \n",
       "11 2021-04-28 12:45:27.511807      NaN       NaN          40.0          NaN   \n",
       "\n",
       "    accuracy  \n",
       "0   0.933333  \n",
       "1   0.933333  \n",
       "2   1.000000  \n",
       "3   0.911111  \n",
       "4   0.977778  \n",
       "5   0.977778  \n",
       "6   0.733333  \n",
       "7   0.888889  \n",
       "8   0.711111  \n",
       "9   0.711111  \n",
       "10  0.733333  \n",
       "11  0.955556  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf = rubicon.get_project_as_df(\"Concurrent Experiments\", df_type=\"dask\")\n",
    "ddf.compute()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}