{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# OpenML - Machine Learning as a community\n",
    "> A description of how OpenML fits into traditional ML practices\n",
    "\n",
    "- toc: true \n",
    "- badges: true\n",
    "- comments: true\n",
    "- categories: [OpenML]\n",
    "- image: images/fastpages_posts/openml.png\n",
    "- author: Neeratyoy Mallik"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[OpenML](https://www.openml.org/) is an online Machine Learning (ML) experiments database accessible to everyone for free. The core idea is to have a single repository of datasets and results of ML experiments on them. Despite having gained a lot of popularity in recent years, with a plethora of tools now available, the numerous ML experimentations continue to happen in silos and not necessarily as one whole shared community.\n",
    "In this post, we shall try to get a brief glimpse of what OpenML offers and how it can fit our current Machine Learning practices.\n",
    "\n",
    "Let us jump straight at getting our hands dirty by building a simple machine learning model. If it is simplicity we are looking for, it has to be the Iris dataset that we shall work with. In the example script below, we are going to load the Iris dataset available with scikit-learn, use 10-fold cross-validation to evaluate a Random Forest of 10 trees. Sounds trivial enough and is indeed less than 10 lines of code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.model_selection import cross_val_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(150, 4) (150,)\n"
     ]
    }
   ],
   "source": [
    "# Loading Iris dataset\n",
    "X, y = datasets.load_iris(return_X_y=True)\n",
    "print(X.shape, y.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initializing a Random Forest with \n",
    "# arbitrary hyperparameters\n",
    "# max_depth kept as 2 since Iris has\n",
    "# only 4 features\n",
    "clf = RandomForestClassifier(n_estimators=10, max_depth=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean score : 0.94000\n"
     ]
    }
   ],
   "source": [
    "scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')\n",
    "print(\"Mean score : {:.5f}\".format(scores.mean()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A simple script and we achieve a mean accuracy of **95.33%**. That was easy. It is really amazing how far we have come with ML tools that make it easy to get started. As a result, we have hundreds of thousands of people working with these tools every day. That inevitably leads to the reinvention of the wheel. The tasks that each individual ML practitioner performs often have significant overlaps and can be omitted by reusing what someone from the community has done already. At the end of the day, we didn't build a Random Forest model all the way from scratch. We gladly reused code written by generous folks from the community. The special attribute of our species is the ability to work as a collective wherein our combined intellect becomes larger than the individual sum of parts. Why not do the same for ML? I mean, can I see what other ML practitioners have done to get better scores on the Iris dataset?\n",
    "\n",
    "Answering this is one of the targets of this post. We shall subsequently explore if this can be done, with the help of [OpenML](https://www.openml.org/). However, first, we shall briefly familiarize ourselves with few terminologies and see how we can split the earlier example we saw into modular components."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### OpenML Components"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<figure>\n",
    "  <img src=\"../images/fastpages-posts/openml.png\" alt=\"Image source\">\n",
    "  <figcaption></figcaption>\n",
    "</figure>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Image source: <a href=\"https://medium.com/open-machine-learning/openml-1e0d43f0ae13\">https://medium.com/open-machine-learning/openml-1e0d43f0ae13</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Dataset**: OpenML houses over 2k+ active datasets for various regression, classification, clustering, survival analysis, stream processing tasks and more. Any user can upload a dataset. Once uploaded, the server computes certain meta-features on the dataset - *Number of classes*, *Number of missing values*, *Number of features*, etc. With respect to our earlier example, the following line is the equivalent of fetching a dataset from OpenML."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "X, y = datasets.load_iris(return_X_y=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Task**: A task is linked to a specific dataset, defining what the target/dependent variable is. Also specifies evaluation measures such as - accuracy, precision, area under curve, etc. or the kind of estimation procedure to be used such as - 10-fold *cross-validation*, n% holdout set, etc. With respect to our earlier example, the *parameters* to the following function call capture the idea of a task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Flow**: Describes the kind of modelling to be performed. It could be a flow or a series of steps, i.e., a scikit-learn pipeline. For now, we have used a simple Random Forest model which is the *flow* component here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "clf = RandomForestClassifier(n_estimators=10, max_depth=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Run**: Pairs a *flow* and task together which results in a *run*. The *run* has the predictions which are turned into *evaluations* by the server. This is effectively captured by the *execution* of the line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, this may appear a little obfuscating given that we are trying to compartmentalize a simple 10-line code which works just fine. However, if we take a few seconds to go through the 4 components explained above, we can see that it makes our *training of a Random Forest* on Iris a series of modular tasks. Modules are such a fundamental concept in Computer Science. They are like Lego blocks. Once we have modules, it means we can plug and play at ease. The code snippet below attempts to rewrite the earlier example using the ideas of the OpenML components described, to give a glimpse of what we can potentially gain during experimentations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.model_selection import cross_val_score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### DATASET component"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To load IRIS dataset as a dataset module/component\n",
    "def dataset():\n",
    "    X, y = datasets.load_iris(return_X_y=True)\n",
    "    return X, y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### TASK component"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Tasks here define the number of cross-validation folds\n",
    "# and the scoring metric to be used for evaluation\n",
    "def task_1(f):\n",
    "    X, y = dataset()  # loads IRIS\n",
    "    return cross_val_score(f, X, y, cv=5, \n",
    "                           scoring='accuracy')\n",
    "\n",
    "def task_2(f):\n",
    "    X, y = dataset()  # loads IRIS\n",
    "    return cross_val_score(f, X, y, cv=15, \n",
    "                           scoring='balanced_accuracy')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### FLOW component"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Flows determine the modelling technique to be applied\n",
    "# Helps define a model irrespective of dataset or tasks\n",
    "def flow_1():\n",
    "    clf = RandomForestClassifier(n_estimators=10, max_depth=2)\n",
    "    return clf\n",
    "\n",
    "def flow_2():\n",
    "    clf = SVC(gamma='auto', kernel='linear')\n",
    "    return clf "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### RUN component"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Runs essentially evaluates a task-flow pairing \n",
    "# and therefore in effect executs the modelling \n",
    "# of a dataset as per the task task definition\n",
    "def run(task, flow):\n",
    "    return task(flow)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RF using task 1: 0.95333; task 2: 0.94444\n",
      "SVM using task 1: 0.98; task 2: 0.97222\n"
     ]
    }
   ],
   "source": [
    "# Results for Random Forest\n",
    "rf_task_1 = run(task_1, flow_1())\n",
    "rf_task_2 = run(task_2, flow_1())\n",
    "print(\"RF using task 1: {:<.5}; task 2: {:<.5}\".format(rf_task_1.mean(), rf_task_2.mean()))\n",
    "\n",
    "# Results for SVM\n",
    "svm_task_1 = run(task_1, flow_2())\n",
    "svm_task_2 = run(task_2, flow_2())\n",
    "print(\"SVM using task 1: {:<.5}; task 2: {:<.5}\".format(svm_task_1.mean(), svm_task_2.mean()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can, therefore, compose various different tasks, flows, which are independent operations. Runs can then pair any such task and flow to construct an ML *workflow* and return the evaluated scores. This approach can help us define such components one-time, and we can extend this for any combination of a dataset, model, and for any number of evaluations in the future. Imagine if the entire ML *community* defines such tasks and various simple to complicated flows that they use in their daily practice. We can build custom working ML pipeline and even get to compare performances of our techniques on the same *task* with others! OpenML aims exactly for that. In the next section of this post, we shall scratch the surface of OpenML to see if we can actually do with OpenML what it promises."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using OpenML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OpenML-Python can be installed using *pip* or by [cloning the git repo](https://openml.github.io/openml-python/develop/contributing.html#installation) and installing the current development version. So shall we then install OpenML? ;) It will be beneficial if the code snippets are tried out as this post is read. A consolidated Jupyter notebook with all the code can be found [here](https://nbviewer.jupyter.org/github/Neeratyoy/openml-python/blob/blog/OpenML%20-%20Machine%20Learning%20as%20a%20community.ipynb).\n",
    "\n",
    "Now that we have OpenML, let us jump straight into figuring out how we can get the Iris dataset from there. We can always browse the[OpenML website](https://www.openml.org/) and search for Iris. That is the easy route. Let us get familiar with the programmatic approach and learn how to fish instead. The OpenML-Python API can be found [here](https://openml.github.io/openml-python/develop/api.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Retrieving Iris from OpenML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the example below, we will list out all possible datasets available in OpenML. We can choose the output format. I'll go with *dataframe* so that we obtain a pandas DataFrame and can get a neat tabular representation to search and sort specific entries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "import openml\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(3073, 16)\n",
      "did\n",
      "name\n",
      "version\n",
      "uploader\n",
      "status\n",
      "format\n",
      "MajorityClassSize\n",
      "MaxNominalAttDistinctValues\n",
      "MinorityClassSize\n",
      "NumberOfClasses\n",
      "NumberOfFeatures\n",
      "NumberOfInstances\n",
      "NumberOfInstancesWithMissingValues\n",
      "NumberOfMissingValues\n",
      "NumberOfNumericFeatures\n",
      "NumberOfSymbolicFeatures\n"
     ]
    }
   ],
   "source": [
    "# Fetching the list of all available datasets on OpenML\n",
    "d = openml.datasets.list_datasets(output_format='dataframe')\n",
    "print(d.shape)\n",
    "\n",
    "# Listing column names or attributes that OpenML offers\n",
    "for name in d.columns:\n",
    "    print(name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   did        name  version uploader  status format  MajorityClassSize  \\\n",
      "2    2      anneal        1        1  active   ARFF              684.0   \n",
      "3    3    kr-vs-kp        1        1  active   ARFF             1669.0   \n",
      "4    4       labor        1        1  active   ARFF               37.0   \n",
      "5    5  arrhythmia        1        1  active   ARFF              245.0   \n",
      "6    6      letter        1        1  active   ARFF              813.0   \n",
      "\n",
      "   MaxNominalAttDistinctValues  MinorityClassSize  NumberOfClasses  \\\n",
      "2                          7.0                8.0              5.0   \n",
      "3                          3.0             1527.0              2.0   \n",
      "4                          3.0               20.0              2.0   \n",
      "5                         13.0                2.0             13.0   \n",
      "6                         26.0              734.0             26.0   \n",
      "\n",
      "   NumberOfFeatures  NumberOfInstances  NumberOfInstancesWithMissingValues  \\\n",
      "2              39.0              898.0                               898.0   \n",
      "3              37.0             3196.0                                 0.0   \n",
      "4              17.0               57.0                                56.0   \n",
      "5             280.0              452.0                               384.0   \n",
      "6              17.0            20000.0                                 0.0   \n",
      "\n",
      "   NumberOfMissingValues  NumberOfNumericFeatures  NumberOfSymbolicFeatures  \n",
      "2                22175.0                      6.0                      33.0  \n",
      "3                    0.0                      0.0                      37.0  \n",
      "4                  326.0                      8.0                       9.0  \n",
      "5                  408.0                    206.0                      74.0  \n",
      "6                    0.0                     16.0                       1.0  \n"
     ]
    }
   ],
   "source": [
    "print(d.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The column names indicate that they contain the meta-information about each of the datasets, and at this instance, we have access to **2958** datasets as indicated by the shape of the dataframe. We shall try searching for 'iris' in the column *name* and also use the *version* column to sort the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>did</th>\n",
       "      <th>name</th>\n",
       "      <th>version</th>\n",
       "      <th>uploader</th>\n",
       "      <th>status</th>\n",
       "      <th>format</th>\n",
       "      <th>MajorityClassSize</th>\n",
       "      <th>MaxNominalAttDistinctValues</th>\n",
       "      <th>MinorityClassSize</th>\n",
       "      <th>NumberOfClasses</th>\n",
       "      <th>NumberOfFeatures</th>\n",
       "      <th>NumberOfInstances</th>\n",
       "      <th>NumberOfInstancesWithMissingValues</th>\n",
       "      <th>NumberOfMissingValues</th>\n",
       "      <th>NumberOfNumericFeatures</th>\n",
       "      <th>NumberOfSymbolicFeatures</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>61</th>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>active</td>\n",
       "      <td>ARFF</td>\n",
       "      <td>50.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41950</th>\n",
       "      <td>41950</td>\n",
       "      <td>iris_test_upload</td>\n",
       "      <td>1</td>\n",
       "      <td>4030</td>\n",
       "      <td>active</td>\n",
       "      <td>ARFF</td>\n",
       "      <td>50.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42261</th>\n",
       "      <td>42261</td>\n",
       "      <td>iris-example</td>\n",
       "      <td>1</td>\n",
       "      <td>348</td>\n",
       "      <td>active</td>\n",
       "      <td>ARFF</td>\n",
       "      <td>50.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>50.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>451</th>\n",
       "      <td>451</td>\n",
       "      <td>irish</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>active</td>\n",
       "      <td>ARFF</td>\n",
       "      <td>278.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>222.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>500.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>969</th>\n",
       "      <td>969</td>\n",
       "      <td>iris</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>active</td>\n",
       "      <td>ARFF</td>\n",
       "      <td>100.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>150.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         did              name  version uploader  status format  \\\n",
       "61        61              iris        1        1  active   ARFF   \n",
       "41950  41950  iris_test_upload        1     4030  active   ARFF   \n",
       "42261  42261      iris-example        1      348  active   ARFF   \n",
       "451      451             irish        1        2  active   ARFF   \n",
       "969      969              iris        3        2  active   ARFF   \n",
       "\n",
       "       MajorityClassSize  MaxNominalAttDistinctValues  MinorityClassSize  \\\n",
       "61                  50.0                          3.0               50.0   \n",
       "41950               50.0                          3.0               50.0   \n",
       "42261               50.0                          NaN               50.0   \n",
       "451                278.0                         10.0              222.0   \n",
       "969                100.0                          2.0               50.0   \n",
       "\n",
       "       NumberOfClasses  NumberOfFeatures  NumberOfInstances  \\\n",
       "61                 3.0               5.0              150.0   \n",
       "41950              3.0               5.0              150.0   \n",
       "42261              3.0               5.0              150.0   \n",
       "451                2.0               6.0              500.0   \n",
       "969                2.0               5.0              150.0   \n",
       "\n",
       "       NumberOfInstancesWithMissingValues  NumberOfMissingValues  \\\n",
       "61                                    0.0                    0.0   \n",
       "41950                                 0.0                    0.0   \n",
       "42261                                 0.0                    0.0   \n",
       "451                                  32.0                   32.0   \n",
       "969                                   0.0                    0.0   \n",
       "\n",
       "       NumberOfNumericFeatures  NumberOfSymbolicFeatures  \n",
       "61                         4.0                       1.0  \n",
       "41950                      4.0                       1.0  \n",
       "42261                      4.0                       1.0  \n",
       "451                        2.0                       4.0  \n",
       "969                        4.0                       1.0  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Filtering dataset list to have 'iris' in the 'name' column\n",
    "# then sorting the list based on the 'version'\n",
    "d[d['name'].str.contains('iris')].sort_values(by='version').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Okay, so the iris dataset with the version as 1 has an ID of **61**. For verification, we can check the [website for dataset ID 61](https://www.openml.org/d/61). We can see that it is the original Iris dataset which is of interest to us - 3 classes of 50 instances, with 4 numeric features. However, we shall retrieve the same information, as promised, programmatically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "OpenML Dataset\n",
       "==============\n",
       "Name..........: iris\n",
       "Version.......: 1\n",
       "Format........: ARFF\n",
       "Upload Date...: 2014-04-06 23:23:39\n",
       "Licence.......: Public\n",
       "Download URL..: https://www.openml.org/data/v1/download/61/iris.arff\n",
       "OpenML URL....: https://www.openml.org/d/61\n",
       "# of features.: 5\n",
       "# of instances: 150"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris = openml.datasets.get_dataset(61)\n",
    "iris"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{0: [0 - sepallength (numeric)],\n",
       " 1: [1 - sepalwidth (numeric)],\n",
       " 2: [2 - petallength (numeric)],\n",
       " 3: [3 - petalwidth (numeric)],\n",
       " 4: [4 - class (nominal)]}"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris.features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "**Author**: R.A. Fisher  \n",
      "**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Iris) - 1936 - Donated by Michael Marshall  \n",
      "**Please cite**:   \n",
      "\n",
      "**Iris Plants Database**  \n",
      "This is perhaps the best known database to be found in the pattern recognition literature.  Fisher's paper is a classic in the field and is referenced frequently to this day.  (See Duda & Hart, for example.)  The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.  One class is     linearly separable from the other 2; the latter are NOT linearly separable from each other.\n",
      "\n",
      "Predicted attribute: class of iris plant.  \n",
      "This is an exceedingly simple domain.  \n",
      " \n",
      "### Attribute Information:\n",
      "    1. sepal length in cm\n",
      "    2. sepal width in cm\n",
      "    3. petal length in cm\n",
      "    4. petal width in cm\n",
      "    5. class: \n",
      "       -- Iris Setosa\n",
      "       -- Iris Versicolour\n",
      "       -- Iris Virginica\n"
     ]
    }
   ],
   "source": [
    "print(iris.description)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the appropriate dataset available, let us briefly go back to the terminologies we discussed earlier. We have only used the *dataset* component so far. The *dataset* component is closely tied with the task component. To reiterate, the task would describe *how* the dataset will be used."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Retrieving relevant tasks from OpenML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We shall firstly list all available tasks that work with the Iris dataset. However, we are only treating Iris as a supervised classification problem and hence will filter accordingly. Following which, we will collect only the task IDs of the tasks relevant to us."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tid</th>\n",
       "      <th>ttid</th>\n",
       "      <th>did</th>\n",
       "      <th>name</th>\n",
       "      <th>task_type</th>\n",
       "      <th>status</th>\n",
       "      <th>estimation_procedure</th>\n",
       "      <th>evaluation_measures</th>\n",
       "      <th>source_data</th>\n",
       "      <th>target_feature</th>\n",
       "      <th>...</th>\n",
       "      <th>NumberOfFeatures</th>\n",
       "      <th>NumberOfInstances</th>\n",
       "      <th>NumberOfInstancesWithMissingValues</th>\n",
       "      <th>NumberOfMissingValues</th>\n",
       "      <th>NumberOfNumericFeatures</th>\n",
       "      <th>NumberOfSymbolicFeatures</th>\n",
       "      <th>number_samples</th>\n",
       "      <th>cost_matrix</th>\n",
       "      <th>quality_measure</th>\n",
       "      <th>target_value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>59</th>\n",
       "      <td>59</td>\n",
       "      <td>1</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>Supervised Classification</td>\n",
       "      <td>active</td>\n",
       "      <td>10-fold Crossvalidation</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>61</td>\n",
       "      <td>class</td>\n",
       "      <td>...</td>\n",
       "      <td>5</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>118</th>\n",
       "      <td>118</td>\n",
       "      <td>3</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>Learning Curve</td>\n",
       "      <td>active</td>\n",
       "      <td>10 times 10-fold Learning Curve</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>61</td>\n",
       "      <td>class</td>\n",
       "      <td>...</td>\n",
       "      <td>5</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>289</th>\n",
       "      <td>289</td>\n",
       "      <td>1</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>Supervised Classification</td>\n",
       "      <td>active</td>\n",
       "      <td>33% Holdout set</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>61</td>\n",
       "      <td>class</td>\n",
       "      <td>...</td>\n",
       "      <td>5</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1758</th>\n",
       "      <td>1758</td>\n",
       "      <td>3</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>Learning Curve</td>\n",
       "      <td>active</td>\n",
       "      <td>10-fold Learning Curve</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>61</td>\n",
       "      <td>class</td>\n",
       "      <td>...</td>\n",
       "      <td>5</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1823</th>\n",
       "      <td>1823</td>\n",
       "      <td>1</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>Supervised Classification</td>\n",
       "      <td>active</td>\n",
       "      <td>5 times 2-fold Crossvalidation</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>61</td>\n",
       "      <td>class</td>\n",
       "      <td>...</td>\n",
       "      <td>5</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 24 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       tid  ttid  did  name                  task_type  status  \\\n",
       "59      59     1   61  iris  Supervised Classification  active   \n",
       "118    118     3   61  iris             Learning Curve  active   \n",
       "289    289     1   61  iris  Supervised Classification  active   \n",
       "1758  1758     3   61  iris             Learning Curve  active   \n",
       "1823  1823     1   61  iris  Supervised Classification  active   \n",
       "\n",
       "                 estimation_procedure  evaluation_measures source_data  \\\n",
       "59            10-fold Crossvalidation  predictive_accuracy          61   \n",
       "118   10 times 10-fold Learning Curve  predictive_accuracy          61   \n",
       "289                   33% Holdout set  predictive_accuracy          61   \n",
       "1758           10-fold Learning Curve  predictive_accuracy          61   \n",
       "1823   5 times 2-fold Crossvalidation  predictive_accuracy          61   \n",
       "\n",
       "     target_feature  ...  NumberOfFeatures  NumberOfInstances  \\\n",
       "59            class  ...                 5                150   \n",
       "118           class  ...                 5                150   \n",
       "289           class  ...                 5                150   \n",
       "1758          class  ...                 5                150   \n",
       "1823          class  ...                 5                150   \n",
       "\n",
       "      NumberOfInstancesWithMissingValues  NumberOfMissingValues  \\\n",
       "59                                     0                      0   \n",
       "118                                    0                      0   \n",
       "289                                    0                      0   \n",
       "1758                                   0                      0   \n",
       "1823                                   0                      0   \n",
       "\n",
       "      NumberOfNumericFeatures  NumberOfSymbolicFeatures  number_samples  \\\n",
       "59                          4                         1             NaN   \n",
       "118                         4                         1               4   \n",
       "289                         4                         1             NaN   \n",
       "1758                        4                         1               4   \n",
       "1823                        4                         1             NaN   \n",
       "\n",
       "      cost_matrix  quality_measure  target_value  \n",
       "59            NaN              NaN           NaN  \n",
       "118           NaN              NaN           NaN  \n",
       "289           NaN              NaN           NaN  \n",
       "1758          NaN              NaN           NaN  \n",
       "1823          NaN              NaN           NaN  \n",
       "\n",
       "[5 rows x 24 columns]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = openml.tasks.list_tasks(data_id=61, output_format='dataframe')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tid</th>\n",
       "      <th>ttid</th>\n",
       "      <th>did</th>\n",
       "      <th>name</th>\n",
       "      <th>task_type</th>\n",
       "      <th>status</th>\n",
       "      <th>estimation_procedure</th>\n",
       "      <th>evaluation_measures</th>\n",
       "      <th>source_data</th>\n",
       "      <th>target_feature</th>\n",
       "      <th>...</th>\n",
       "      <th>NumberOfFeatures</th>\n",
       "      <th>NumberOfInstances</th>\n",
       "      <th>NumberOfInstancesWithMissingValues</th>\n",
       "      <th>NumberOfMissingValues</th>\n",
       "      <th>NumberOfNumericFeatures</th>\n",
       "      <th>NumberOfSymbolicFeatures</th>\n",
       "      <th>number_samples</th>\n",
       "      <th>cost_matrix</th>\n",
       "      <th>quality_measure</th>\n",
       "      <th>target_value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>59</th>\n",
       "      <td>59</td>\n",
       "      <td>1</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>Supervised Classification</td>\n",
       "      <td>active</td>\n",
       "      <td>10-fold Crossvalidation</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>61</td>\n",
       "      <td>class</td>\n",
       "      <td>...</td>\n",
       "      <td>5</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>289</th>\n",
       "      <td>289</td>\n",
       "      <td>1</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>Supervised Classification</td>\n",
       "      <td>active</td>\n",
       "      <td>33% Holdout set</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>61</td>\n",
       "      <td>class</td>\n",
       "      <td>...</td>\n",
       "      <td>5</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1823</th>\n",
       "      <td>1823</td>\n",
       "      <td>1</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>Supervised Classification</td>\n",
       "      <td>active</td>\n",
       "      <td>5 times 2-fold Crossvalidation</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>61</td>\n",
       "      <td>class</td>\n",
       "      <td>...</td>\n",
       "      <td>5</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1939</th>\n",
       "      <td>1939</td>\n",
       "      <td>1</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>Supervised Classification</td>\n",
       "      <td>active</td>\n",
       "      <td>10 times 10-fold Crossvalidation</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>61</td>\n",
       "      <td>class</td>\n",
       "      <td>...</td>\n",
       "      <td>5</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1992</th>\n",
       "      <td>1992</td>\n",
       "      <td>1</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>Supervised Classification</td>\n",
       "      <td>active</td>\n",
       "      <td>Leave one out</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>61</td>\n",
       "      <td>class</td>\n",
       "      <td>...</td>\n",
       "      <td>5</td>\n",
       "      <td>150</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 24 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       tid  ttid  did  name                  task_type  status  \\\n",
       "59      59     1   61  iris  Supervised Classification  active   \n",
       "289    289     1   61  iris  Supervised Classification  active   \n",
       "1823  1823     1   61  iris  Supervised Classification  active   \n",
       "1939  1939     1   61  iris  Supervised Classification  active   \n",
       "1992  1992     1   61  iris  Supervised Classification  active   \n",
       "\n",
       "                  estimation_procedure  evaluation_measures source_data  \\\n",
       "59             10-fold Crossvalidation  predictive_accuracy          61   \n",
       "289                    33% Holdout set  predictive_accuracy          61   \n",
       "1823    5 times 2-fold Crossvalidation  predictive_accuracy          61   \n",
       "1939  10 times 10-fold Crossvalidation  predictive_accuracy          61   \n",
       "1992                     Leave one out  predictive_accuracy          61   \n",
       "\n",
       "     target_feature  ...  NumberOfFeatures  NumberOfInstances  \\\n",
       "59            class  ...                 5                150   \n",
       "289           class  ...                 5                150   \n",
       "1823          class  ...                 5                150   \n",
       "1939          class  ...                 5                150   \n",
       "1992          class  ...                 5                150   \n",
       "\n",
       "      NumberOfInstancesWithMissingValues  NumberOfMissingValues  \\\n",
       "59                                     0                      0   \n",
       "289                                    0                      0   \n",
       "1823                                   0                      0   \n",
       "1939                                   0                      0   \n",
       "1992                                   0                      0   \n",
       "\n",
       "      NumberOfNumericFeatures  NumberOfSymbolicFeatures  number_samples  \\\n",
       "59                          4                         1             NaN   \n",
       "289                         4                         1             NaN   \n",
       "1823                        4                         1             NaN   \n",
       "1939                        4                         1             NaN   \n",
       "1992                        4                         1             NaN   \n",
       "\n",
       "      cost_matrix  quality_measure  target_value  \n",
       "59            NaN              NaN           NaN  \n",
       "289           NaN              NaN           NaN  \n",
       "1823          NaN              NaN           NaN  \n",
       "1939          NaN              NaN           NaN  \n",
       "1992          NaN              NaN           NaN  \n",
       "\n",
       "[5 rows x 24 columns]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Filtering only the Supervised Classification tasks on Iris\n",
    "df.query(\"task_type=='Supervised Classification'\").head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "13\n"
     ]
    }
   ],
   "source": [
    "# Collecting all relevant task_ids\n",
    "tasks = df.query(\"task_type=='Supervised Classification'\")['tid'].to_numpy()\n",
    "print(len(tasks))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That settles the *task* component too. Notice how for one *dataset* (61), we obtain 11 task IDs which are of interest to us. This should illustrate the *one-to-many* relationship that *dataset-task components* can have. We have 2 more components to explore - *flows*, *runs*. We could list out all possible flows and filter out the ones we want, i.e., Random Forest. However, let us instead fetch all the evaluations made on the Iris dataset using the 11 tasks we collected above.\n",
    "\n",
    "We shall subsequently work with the scikit-learn based task which has been uploaded/used the most. We shall then further filter out the list of evaluations from the selected task (task_id=59 in this case), depending on if Random Forest was used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>run_id</th>\n",
       "      <th>task_id</th>\n",
       "      <th>setup_id</th>\n",
       "      <th>flow_id</th>\n",
       "      <th>flow_name</th>\n",
       "      <th>data_id</th>\n",
       "      <th>data_name</th>\n",
       "      <th>function</th>\n",
       "      <th>upload_time</th>\n",
       "      <th>uploader</th>\n",
       "      <th>uploader_name</th>\n",
       "      <th>value</th>\n",
       "      <th>values</th>\n",
       "      <th>array_data</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>81</td>\n",
       "      <td>59</td>\n",
       "      <td>12</td>\n",
       "      <td>67</td>\n",
       "      <td>weka.BayesNet_K2(1)</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2014-04-07 00:05:11</td>\n",
       "      <td>1</td>\n",
       "      <td>janvanrijn@gmail.com</td>\n",
       "      <td>0.940000</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>161</td>\n",
       "      <td>59</td>\n",
       "      <td>13</td>\n",
       "      <td>70</td>\n",
       "      <td>weka.SMO_PolyKernel(1)</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2014-04-07 00:55:32</td>\n",
       "      <td>1</td>\n",
       "      <td>janvanrijn@gmail.com</td>\n",
       "      <td>0.960000</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>234</td>\n",
       "      <td>59</td>\n",
       "      <td>1</td>\n",
       "      <td>56</td>\n",
       "      <td>weka.ZeroR(1)</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2014-04-07 01:33:24</td>\n",
       "      <td>1</td>\n",
       "      <td>janvanrijn@gmail.com</td>\n",
       "      <td>0.333333</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>447</td>\n",
       "      <td>59</td>\n",
       "      <td>6</td>\n",
       "      <td>61</td>\n",
       "      <td>weka.REPTree(1)</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2014-04-07 06:26:27</td>\n",
       "      <td>1</td>\n",
       "      <td>janvanrijn@gmail.com</td>\n",
       "      <td>0.926667</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>473</td>\n",
       "      <td>59</td>\n",
       "      <td>18</td>\n",
       "      <td>77</td>\n",
       "      <td>weka.LogitBoost_DecisionStump(1)</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2014-04-07 06:39:27</td>\n",
       "      <td>1</td>\n",
       "      <td>janvanrijn@gmail.com</td>\n",
       "      <td>0.946667</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   run_id  task_id  setup_id  flow_id                         flow_name  \\\n",
       "0      81       59        12       67               weka.BayesNet_K2(1)   \n",
       "1     161       59        13       70            weka.SMO_PolyKernel(1)   \n",
       "2     234       59         1       56                     weka.ZeroR(1)   \n",
       "3     447       59         6       61                   weka.REPTree(1)   \n",
       "4     473       59        18       77  weka.LogitBoost_DecisionStump(1)   \n",
       "\n",
       "   data_id data_name             function          upload_time  uploader  \\\n",
       "0       61      iris  predictive_accuracy  2014-04-07 00:05:11         1   \n",
       "1       61      iris  predictive_accuracy  2014-04-07 00:55:32         1   \n",
       "2       61      iris  predictive_accuracy  2014-04-07 01:33:24         1   \n",
       "3       61      iris  predictive_accuracy  2014-04-07 06:26:27         1   \n",
       "4       61      iris  predictive_accuracy  2014-04-07 06:39:27         1   \n",
       "\n",
       "          uploader_name     value values array_data  \n",
       "0  janvanrijn@gmail.com  0.940000   None       None  \n",
       "1  janvanrijn@gmail.com  0.960000   None       None  \n",
       "2  janvanrijn@gmail.com  0.333333   None       None  \n",
       "3  janvanrijn@gmail.com  0.926667   None       None  \n",
       "4  janvanrijn@gmail.com  0.946667   None       None  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Listing all evaluations made on the 11 tasks collected above\n",
    "# with evaluation metric as 'predictive_accuracy'\n",
    "task_df = openml.evaluations.list_evaluations(function='predictive_accuracy', task=tasks, output_format='dataframe')\n",
    "task_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>run_id</th>\n",
       "      <th>task_id</th>\n",
       "      <th>setup_id</th>\n",
       "      <th>flow_id</th>\n",
       "      <th>flow_name</th>\n",
       "      <th>data_id</th>\n",
       "      <th>data_name</th>\n",
       "      <th>function</th>\n",
       "      <th>upload_time</th>\n",
       "      <th>uploader</th>\n",
       "      <th>uploader_name</th>\n",
       "      <th>value</th>\n",
       "      <th>values</th>\n",
       "      <th>array_data</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>144</th>\n",
       "      <td>1849043</td>\n",
       "      <td>59</td>\n",
       "      <td>29015</td>\n",
       "      <td>5500</td>\n",
       "      <td>sklearn.ensemble.forest.RandomForestClassifier...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2017-03-03 17:10:12</td>\n",
       "      <td>1</td>\n",
       "      <td>janvanrijn@gmail.com</td>\n",
       "      <td>0.946667</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>145</th>\n",
       "      <td>1853409</td>\n",
       "      <td>59</td>\n",
       "      <td>30950</td>\n",
       "      <td>5873</td>\n",
       "      <td>sklearn.pipeline.Pipeline(Imputer=openml.utils...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2017-03-21 22:08:01</td>\n",
       "      <td>1</td>\n",
       "      <td>janvanrijn@gmail.com</td>\n",
       "      <td>0.960000</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146</th>\n",
       "      <td>6130126</td>\n",
       "      <td>59</td>\n",
       "      <td>4163633</td>\n",
       "      <td>7108</td>\n",
       "      <td>sklearn.model_selection._search.RandomizedSear...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2017-08-21 11:07:40</td>\n",
       "      <td>1</td>\n",
       "      <td>janvanrijn@gmail.com</td>\n",
       "      <td>0.960000</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>147</th>\n",
       "      <td>6130128</td>\n",
       "      <td>59</td>\n",
       "      <td>4163634</td>\n",
       "      <td>7108</td>\n",
       "      <td>sklearn.model_selection._search.RandomizedSear...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2017-08-21 11:08:06</td>\n",
       "      <td>1</td>\n",
       "      <td>janvanrijn@gmail.com</td>\n",
       "      <td>0.946667</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>148</th>\n",
       "      <td>6715383</td>\n",
       "      <td>59</td>\n",
       "      <td>4747289</td>\n",
       "      <td>7117</td>\n",
       "      <td>sklearn.model_selection._search.RandomizedSear...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2017-09-01 02:56:44</td>\n",
       "      <td>1</td>\n",
       "      <td>janvanrijn@gmail.com</td>\n",
       "      <td>0.960000</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      run_id  task_id  setup_id  flow_id  \\\n",
       "144  1849043       59     29015     5500   \n",
       "145  1853409       59     30950     5873   \n",
       "146  6130126       59   4163633     7108   \n",
       "147  6130128       59   4163634     7108   \n",
       "148  6715383       59   4747289     7117   \n",
       "\n",
       "                                             flow_name  data_id data_name  \\\n",
       "144  sklearn.ensemble.forest.RandomForestClassifier...       61      iris   \n",
       "145  sklearn.pipeline.Pipeline(Imputer=openml.utils...       61      iris   \n",
       "146  sklearn.model_selection._search.RandomizedSear...       61      iris   \n",
       "147  sklearn.model_selection._search.RandomizedSear...       61      iris   \n",
       "148  sklearn.model_selection._search.RandomizedSear...       61      iris   \n",
       "\n",
       "                function          upload_time  uploader         uploader_name  \\\n",
       "144  predictive_accuracy  2017-03-03 17:10:12         1  janvanrijn@gmail.com   \n",
       "145  predictive_accuracy  2017-03-21 22:08:01         1  janvanrijn@gmail.com   \n",
       "146  predictive_accuracy  2017-08-21 11:07:40         1  janvanrijn@gmail.com   \n",
       "147  predictive_accuracy  2017-08-21 11:08:06         1  janvanrijn@gmail.com   \n",
       "148  predictive_accuracy  2017-09-01 02:56:44         1  janvanrijn@gmail.com   \n",
       "\n",
       "        value values array_data  \n",
       "144  0.946667   None       None  \n",
       "145  0.960000   None       None  \n",
       "146  0.960000   None       None  \n",
       "147  0.946667   None       None  \n",
       "148  0.960000   None       None  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Filtering based on sklearn (scikit-learn)\n",
    "task_df = task_df[task_df['flow_name'].str.contains(\"sklearn\")]\n",
    "task_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "59       1985\n",
       "10107      25\n",
       "289         1\n",
       "Name: task_id, dtype: int64"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Counting frequency of the different tasks used to\n",
    "# solve Iris as a supervised classification using scikit-learn\n",
    "task_df['task_id'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "OpenML Classification Task\n",
       "==========================\n",
       "Task Type Description: https://www.openml.org/tt/1\n",
       "Task ID..............: 59\n",
       "Task URL.............: https://www.openml.org/t/59\n",
       "Estimation Procedure.: crossvalidation\n",
       "Evaluation Measure...: predictive_accuracy\n",
       "Target Feature.......: class\n",
       "# of Classes.........: 3\n",
       "Cost Matrix..........: Available"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Retrieving the most used task\n",
    "t = openml.tasks.get_task(59)\n",
    "t"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filtering for only task_id=59\n",
    "task_df = task_df.query(\"task_id==59\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>run_id</th>\n",
       "      <th>task_id</th>\n",
       "      <th>setup_id</th>\n",
       "      <th>flow_id</th>\n",
       "      <th>flow_name</th>\n",
       "      <th>data_id</th>\n",
       "      <th>data_name</th>\n",
       "      <th>function</th>\n",
       "      <th>upload_time</th>\n",
       "      <th>uploader</th>\n",
       "      <th>uploader_name</th>\n",
       "      <th>value</th>\n",
       "      <th>values</th>\n",
       "      <th>array_data</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>144</th>\n",
       "      <td>1849043</td>\n",
       "      <td>59</td>\n",
       "      <td>29015</td>\n",
       "      <td>5500</td>\n",
       "      <td>sklearn.ensemble.forest.RandomForestClassifier...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2017-03-03 17:10:12</td>\n",
       "      <td>1</td>\n",
       "      <td>janvanrijn@gmail.com</td>\n",
       "      <td>0.946667</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>145</th>\n",
       "      <td>1853409</td>\n",
       "      <td>59</td>\n",
       "      <td>30950</td>\n",
       "      <td>5873</td>\n",
       "      <td>sklearn.pipeline.Pipeline(Imputer=openml.utils...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2017-03-21 22:08:01</td>\n",
       "      <td>1</td>\n",
       "      <td>janvanrijn@gmail.com</td>\n",
       "      <td>0.960000</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146</th>\n",
       "      <td>6130126</td>\n",
       "      <td>59</td>\n",
       "      <td>4163633</td>\n",
       "      <td>7108</td>\n",
       "      <td>sklearn.model_selection._search.RandomizedSear...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2017-08-21 11:07:40</td>\n",
       "      <td>1</td>\n",
       "      <td>janvanrijn@gmail.com</td>\n",
       "      <td>0.960000</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>147</th>\n",
       "      <td>6130128</td>\n",
       "      <td>59</td>\n",
       "      <td>4163634</td>\n",
       "      <td>7108</td>\n",
       "      <td>sklearn.model_selection._search.RandomizedSear...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2017-08-21 11:08:06</td>\n",
       "      <td>1</td>\n",
       "      <td>janvanrijn@gmail.com</td>\n",
       "      <td>0.946667</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>190</th>\n",
       "      <td>6946499</td>\n",
       "      <td>59</td>\n",
       "      <td>4978397</td>\n",
       "      <td>7109</td>\n",
       "      <td>sklearn.pipeline.Pipeline(imputation=openmlstu...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2017-09-02 22:06:32</td>\n",
       "      <td>1</td>\n",
       "      <td>janvanrijn@gmail.com</td>\n",
       "      <td>0.920000</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      run_id  task_id  setup_id  flow_id  \\\n",
       "144  1849043       59     29015     5500   \n",
       "145  1853409       59     30950     5873   \n",
       "146  6130126       59   4163633     7108   \n",
       "147  6130128       59   4163634     7108   \n",
       "190  6946499       59   4978397     7109   \n",
       "\n",
       "                                             flow_name  data_id data_name  \\\n",
       "144  sklearn.ensemble.forest.RandomForestClassifier...       61      iris   \n",
       "145  sklearn.pipeline.Pipeline(Imputer=openml.utils...       61      iris   \n",
       "146  sklearn.model_selection._search.RandomizedSear...       61      iris   \n",
       "147  sklearn.model_selection._search.RandomizedSear...       61      iris   \n",
       "190  sklearn.pipeline.Pipeline(imputation=openmlstu...       61      iris   \n",
       "\n",
       "                function          upload_time  uploader         uploader_name  \\\n",
       "144  predictive_accuracy  2017-03-03 17:10:12         1  janvanrijn@gmail.com   \n",
       "145  predictive_accuracy  2017-03-21 22:08:01         1  janvanrijn@gmail.com   \n",
       "146  predictive_accuracy  2017-08-21 11:07:40         1  janvanrijn@gmail.com   \n",
       "147  predictive_accuracy  2017-08-21 11:08:06         1  janvanrijn@gmail.com   \n",
       "190  predictive_accuracy  2017-09-02 22:06:32         1  janvanrijn@gmail.com   \n",
       "\n",
       "        value values array_data  \n",
       "144  0.946667   None       None  \n",
       "145  0.960000   None       None  \n",
       "146  0.960000   None       None  \n",
       "147  0.946667   None       None  \n",
       "190  0.920000   None       None  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Filtering based on Random Forest\n",
    "task_rf =  task_df[task_df['flow_name'].str.contains(\"RandomForest\")]\n",
    "task_rf.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Retrieving top-performing models from OpenML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we are an ambitious bunch of ML practitioners who settle for nothing but the best, and also since most results will not be considered worth the effort if not matching or beating *state-of-the-art*, we shall aim for the best scores. We'll sort the filtered results we obtained based on the score or '*value*' and then extract the components from that run - *task* and *flow*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>run_id</th>\n",
       "      <th>task_id</th>\n",
       "      <th>setup_id</th>\n",
       "      <th>flow_id</th>\n",
       "      <th>flow_name</th>\n",
       "      <th>data_id</th>\n",
       "      <th>data_name</th>\n",
       "      <th>function</th>\n",
       "      <th>upload_time</th>\n",
       "      <th>uploader</th>\n",
       "      <th>uploader_name</th>\n",
       "      <th>value</th>\n",
       "      <th>values</th>\n",
       "      <th>array_data</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3549</th>\n",
       "      <td>523926</td>\n",
       "      <td>59</td>\n",
       "      <td>3526</td>\n",
       "      <td>2629</td>\n",
       "      <td>sklearn.ensemble.forest.RandomForestClassifier(8)</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2016-02-11 22:05:23</td>\n",
       "      <td>869</td>\n",
       "      <td>p.gijsbers@student.tue.nl</td>\n",
       "      <td>0.966667</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4353</th>\n",
       "      <td>8955370</td>\n",
       "      <td>59</td>\n",
       "      <td>6890988</td>\n",
       "      <td>7257</td>\n",
       "      <td>sklearn.ensemble.forest.RandomForestClassifier...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2018-04-06 16:32:22</td>\n",
       "      <td>3964</td>\n",
       "      <td>clear.tsai@gmail.com</td>\n",
       "      <td>0.960000</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3587</th>\n",
       "      <td>1852682</td>\n",
       "      <td>59</td>\n",
       "      <td>29263</td>\n",
       "      <td>5500</td>\n",
       "      <td>sklearn.ensemble.forest.RandomForestClassifier...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2017-03-15 22:55:18</td>\n",
       "      <td>1022</td>\n",
       "      <td>rso@randalolson.com</td>\n",
       "      <td>0.960000</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4375</th>\n",
       "      <td>8886608</td>\n",
       "      <td>59</td>\n",
       "      <td>6835139</td>\n",
       "      <td>7961</td>\n",
       "      <td>sklearn.pipeline.Pipeline(Imputer=sklearn.prep...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2018-03-17 16:46:27</td>\n",
       "      <td>5032</td>\n",
       "      <td>rashmi.kamath01@gmail.com</td>\n",
       "      <td>0.960000</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3107</th>\n",
       "      <td>1843272</td>\n",
       "      <td>59</td>\n",
       "      <td>24071</td>\n",
       "      <td>4830</td>\n",
       "      <td>sklearn.ensemble.forest.RandomForestClassifier...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2016-12-08 20:10:03</td>\n",
       "      <td>2</td>\n",
       "      <td>joaquin.vanschoren@gmail.com</td>\n",
       "      <td>0.960000</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       run_id  task_id  setup_id  flow_id  \\\n",
       "3549   523926       59      3526     2629   \n",
       "4353  8955370       59   6890988     7257   \n",
       "3587  1852682       59     29263     5500   \n",
       "4375  8886608       59   6835139     7961   \n",
       "3107  1843272       59     24071     4830   \n",
       "\n",
       "                                              flow_name  data_id data_name  \\\n",
       "3549  sklearn.ensemble.forest.RandomForestClassifier(8)       61      iris   \n",
       "4353  sklearn.ensemble.forest.RandomForestClassifier...       61      iris   \n",
       "3587  sklearn.ensemble.forest.RandomForestClassifier...       61      iris   \n",
       "4375  sklearn.pipeline.Pipeline(Imputer=sklearn.prep...       61      iris   \n",
       "3107  sklearn.ensemble.forest.RandomForestClassifier...       61      iris   \n",
       "\n",
       "                 function          upload_time  uploader  \\\n",
       "3549  predictive_accuracy  2016-02-11 22:05:23       869   \n",
       "4353  predictive_accuracy  2018-04-06 16:32:22      3964   \n",
       "3587  predictive_accuracy  2017-03-15 22:55:18      1022   \n",
       "4375  predictive_accuracy  2018-03-17 16:46:27      5032   \n",
       "3107  predictive_accuracy  2016-12-08 20:10:03         2   \n",
       "\n",
       "                     uploader_name     value values array_data  \n",
       "3549     p.gijsbers@student.tue.nl  0.966667   None       None  \n",
       "4353          clear.tsai@gmail.com  0.960000   None       None  \n",
       "3587           rso@randalolson.com  0.960000   None       None  \n",
       "4375     rashmi.kamath01@gmail.com  0.960000   None       None  \n",
       "3107  joaquin.vanschoren@gmail.com  0.960000   None       None  "
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "task_rf.sort_values(by='value', ascending=False).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "OpenML Flow\n",
       "===========\n",
       "Flow ID.........: 2629 (version 8)\n",
       "Flow URL........: https://www.openml.org/f/2629\n",
       "Flow Name.......: sklearn.ensemble.forest.RandomForestClassifier\n",
       "Flow Description: Flow generated by openml_run\n",
       "Upload Date.....: 2016-02-11 21:17:08\n",
       "Dependencies....: None"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Fetching the Random Forest flow with the best score\n",
    "f = openml.flows.get_flow(2629)\n",
    "f"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "OpenML Run\n",
       "==========\n",
       "Uploader Name...: Pieter Gijsbers\n",
       "Uploader Profile: https://www.openml.org/u/869\n",
       "Metric..........: predictive_accuracy\n",
       "Result..........: 0.966667\n",
       "Run ID..........: 523926\n",
       "Run URL.........: https://www.openml.org/r/523926\n",
       "Task ID.........: 59\n",
       "Task Type.......: Supervised Classification\n",
       "Task URL........: https://www.openml.org/t/59\n",
       "Flow ID.........: 2629\n",
       "Flow Name.......: sklearn.ensemble.forest.RandomForestClassifier(8)\n",
       "Flow URL........: https://www.openml.org/f/2629\n",
       "Setup ID........: 3526\n",
       "Setup String....: None\n",
       "Dataset ID......: 61\n",
       "Dataset URL.....: https://www.openml.org/d/61"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Fetching the run with the best score for\n",
    "# Random Forest on Iris\n",
    "r = openml.runs.get_run(523926)\n",
    "r"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Okay, let's take a pause and re-assess. From multiple users across the globe, who had uploaded runs to OpenML, for a Random Forest run on the Iris, the best score seen till now is **96.67%**. That is certainly better than the naive model we built at the beginning to achieve **95.33%**. We had used a basic 10-fold cross-validation to evaluate a Random Forest of 10 trees with a max depth of 2. Let us see, what the best run uses and if it differs from our approach."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'predictive_accuracy'"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The scoring metric used\n",
    "t.evaluation_measure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'type': 'crossvalidation',\n",
       " 'parameters': {'number_repeats': '1',\n",
       "  'number_folds': '10',\n",
       "  'percentage': '',\n",
       "  'stratified_sampling': 'true'},\n",
       " 'data_splits_url': 'https://www.openml.org/api_splits/get/59/Task_59_splits.arff'}"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The methodology used for estimations\n",
    "t.estimation_procedure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'sklearn.ensemble.forest.RandomForestClassifier'"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The model used\n",
    "f.name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "warm_start                : False     \n",
      "oob_score                 : False     \n",
      "n_jobs                    : 1         \n",
      "verbose                   : 0         \n",
      "max_leaf_nodes            : None      \n",
      "bootstrap                 : True      \n",
      "min_samples_leaf          : 1         \n",
      "n_estimators              : 10        \n",
      "min_samples_split         : 2         \n",
      "min_weight_fraction_leaf  : 0.0       \n",
      "criterion                 : gini      \n",
      "random_state              : None      \n",
      "max_features              : auto      \n",
      "max_depth                 : None      \n",
      "class_weight              : None      \n"
     ]
    }
   ],
   "source": [
    "# The model parameters\n",
    "for param in r.parameter_settings:\n",
    "    name, value = param['oml:name'], param['oml:value']\n",
    "    print(\"{:<25} : {:<10}\".format(name, value))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As evident, our initial approach is different on two fronts. We didn't explicitly use stratified sampling for our cross-validation. While the Random Forest hyperparameters are slightly different too (*max_depth=None*). That definitely sounds like a *to-do*, however, there is no reason why we should restrict ourselves to Random Forests. Remember, we are aiming *big* here. Given the [number of OpenML users](https://www.openml.org/search?type=user), there must be somebody who got a better score on Iris with some other model. Let us then retrieve that information. Programmatically, of course.\n",
    "\n",
    "In summary, we are now going to sort the performance of all scikit-learn based models on Iris dataset as per the task definition with *task_id=59*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>run_id</th>\n",
       "      <th>task_id</th>\n",
       "      <th>setup_id</th>\n",
       "      <th>flow_id</th>\n",
       "      <th>flow_name</th>\n",
       "      <th>data_id</th>\n",
       "      <th>data_name</th>\n",
       "      <th>function</th>\n",
       "      <th>upload_time</th>\n",
       "      <th>uploader</th>\n",
       "      <th>uploader_name</th>\n",
       "      <th>value</th>\n",
       "      <th>values</th>\n",
       "      <th>array_data</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3630</th>\n",
       "      <td>2039748</td>\n",
       "      <td>59</td>\n",
       "      <td>180922</td>\n",
       "      <td>6048</td>\n",
       "      <td>sklearn.pipeline.Pipeline(dualimputer=helper.d...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2017-04-09 01:09:01</td>\n",
       "      <td>1104</td>\n",
       "      <td>jmapvhoof@gmail.com</td>\n",
       "      <td>0.986667</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3631</th>\n",
       "      <td>2039750</td>\n",
       "      <td>59</td>\n",
       "      <td>180924</td>\n",
       "      <td>6048</td>\n",
       "      <td>sklearn.pipeline.Pipeline(dualimputer=helper.d...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2017-04-09 01:17:39</td>\n",
       "      <td>1104</td>\n",
       "      <td>jmapvhoof@gmail.com</td>\n",
       "      <td>0.986667</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3624</th>\n",
       "      <td>2012939</td>\n",
       "      <td>59</td>\n",
       "      <td>157622</td>\n",
       "      <td>6048</td>\n",
       "      <td>sklearn.pipeline.Pipeline(dualimputer=helper.d...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2017-04-06 23:29:28</td>\n",
       "      <td>1104</td>\n",
       "      <td>jmapvhoof@gmail.com</td>\n",
       "      <td>0.986667</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3618</th>\n",
       "      <td>2012930</td>\n",
       "      <td>59</td>\n",
       "      <td>157613</td>\n",
       "      <td>6048</td>\n",
       "      <td>sklearn.pipeline.Pipeline(dualimputer=helper.d...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2017-04-06 23:00:24</td>\n",
       "      <td>1104</td>\n",
       "      <td>jmapvhoof@gmail.com</td>\n",
       "      <td>0.986667</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3626</th>\n",
       "      <td>2012941</td>\n",
       "      <td>59</td>\n",
       "      <td>157624</td>\n",
       "      <td>6048</td>\n",
       "      <td>sklearn.pipeline.Pipeline(dualimputer=helper.d...</td>\n",
       "      <td>61</td>\n",
       "      <td>iris</td>\n",
       "      <td>predictive_accuracy</td>\n",
       "      <td>2017-04-07 01:36:00</td>\n",
       "      <td>1104</td>\n",
       "      <td>jmapvhoof@gmail.com</td>\n",
       "      <td>0.986667</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       run_id  task_id  setup_id  flow_id  \\\n",
       "3630  2039748       59    180922     6048   \n",
       "3631  2039750       59    180924     6048   \n",
       "3624  2012939       59    157622     6048   \n",
       "3618  2012930       59    157613     6048   \n",
       "3626  2012941       59    157624     6048   \n",
       "\n",
       "                                              flow_name  data_id data_name  \\\n",
       "3630  sklearn.pipeline.Pipeline(dualimputer=helper.d...       61      iris   \n",
       "3631  sklearn.pipeline.Pipeline(dualimputer=helper.d...       61      iris   \n",
       "3624  sklearn.pipeline.Pipeline(dualimputer=helper.d...       61      iris   \n",
       "3618  sklearn.pipeline.Pipeline(dualimputer=helper.d...       61      iris   \n",
       "3626  sklearn.pipeline.Pipeline(dualimputer=helper.d...       61      iris   \n",
       "\n",
       "                 function          upload_time  uploader        uploader_name  \\\n",
       "3630  predictive_accuracy  2017-04-09 01:09:01      1104  jmapvhoof@gmail.com   \n",
       "3631  predictive_accuracy  2017-04-09 01:17:39      1104  jmapvhoof@gmail.com   \n",
       "3624  predictive_accuracy  2017-04-06 23:29:28      1104  jmapvhoof@gmail.com   \n",
       "3618  predictive_accuracy  2017-04-06 23:00:24      1104  jmapvhoof@gmail.com   \n",
       "3626  predictive_accuracy  2017-04-07 01:36:00      1104  jmapvhoof@gmail.com   \n",
       "\n",
       "         value values array_data  \n",
       "3630  0.986667   None       None  \n",
       "3631  0.986667   None       None  \n",
       "3624  0.986667   None       None  \n",
       "3618  0.986667   None       None  \n",
       "3626  0.986667   None       None  "
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Fetching top performances\n",
    "task_df.sort_values(by='value', ascending=False).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "OpenML Flow\n",
       "===========\n",
       "Flow ID.........: 6048 (version 1)\n",
       "Flow URL........: https://www.openml.org/f/6048\n",
       "Flow Name.......: sklearn.pipeline.Pipeline(dualimputer=helper.dual_imputer.DualImputer,nusvc=sklearn.svm.classes.NuSVC)\n",
       "Flow Description: Automatically created scikit-learn flow.\n",
       "Upload Date.....: 2017-04-06 22:42:59\n",
       "Dependencies....: sklearn==0.18.1\n",
       "numpy>=1.6.1\n",
       "scipy>=0.9"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Fetching best performing flow\n",
    "f = openml.flows.get_flow(6048)\n",
    "f"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "steps                     : [('DualImputer', <helper.dual_imputer.DualImputer object at 0x7ff618e4d908>), ('nusvc', NuSVC(cache_size=200, class_weight=None, coef0=0.0,\n",
      "   decision_function_shape=None, degree=3, gamma='auto', kernel='linear',\n",
      "   max_iter=-1, nu=0.3, probability=True, random_state=3, shrinking=True,\n",
      "   tol=3.2419092644286417e-05, verbose=False))]\n",
      "cache_size                : 200       \n",
      "class_weight              : None      \n",
      "coef0                     : 0.0       \n",
      "decision_function_shape   : None      \n",
      "degree                    : 3         \n",
      "gamma                     : auto      \n",
      "kernel                    : linear    \n",
      "max_iter                  : -1        \n",
      "nu                        : 0.3       \n",
      "probability               : True      \n",
      "random_state              : 3         \n",
      "shrinking                 : True      \n",
      "tol                       : 3.24190926443e-05\n",
      "verbose                   : False     \n"
     ]
    }
   ],
   "source": [
    "# Fetching best performing run\n",
    "r = openml.runs.get_run(2012943)\n",
    "\n",
    "# The model parameters\n",
    "for param in r.parameter_settings:\n",
    "    name, value = param['oml:name'], param['oml:value']\n",
    "    print(\"{:<25} : {:<10}\".format(name, value))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The highest score obtained among the uploaded results is **98.67%** using a [variant of SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html#sklearn.svm.NuSVC). However, if we check the corresponding flow description, we see that it is using an old scikit-learn version (0.18.1) and therefore may not be possible to replicate the exact results. However, in order to improve from our score of 95.33%, we should try running a *nu-SVC* on the same problem and see where we stand. Let's go for it. Via OpenML, of course."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Running best performing flow on the required task"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "import openml\n",
    "import numpy as np\n",
    "from sklearn.svm import NuSVC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Building the NuSVC model object with parameters found\n",
    "clf = NuSVC(cache_size=200, class_weight=None, coef0=0.0,\n",
    "   decision_function_shape=None, degree=3, gamma='auto', kernel='linear',\n",
    "   max_iter=-1, nu=0.3, probability=True, random_state=3, shrinking=True,\n",
    "   tol=3.2419092644286417e-05, verbose=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "OpenML Classification Task\n",
       "==========================\n",
       "Task Type Description: https://www.openml.org/tt/1\n",
       "Task ID..............: 59\n",
       "Task URL.............: https://www.openml.org/t/59\n",
       "Estimation Procedure.: crossvalidation\n",
       "Evaluation Measure...: predictive_accuracy\n",
       "Target Feature.......: class\n",
       "# of Classes.........: 3\n",
       "Cost Matrix..........: Available"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Obtaining task used earlier\n",
    "t = openml.tasks.get_task(59)\n",
    "t"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "OpenML Flow\n",
       "===========\n",
       "Flow Name.......: sklearn.svm.classes.NuSVC\n",
       "Flow Description: Nu-Support Vector Classification.\n",
       "\n",
       "Similar to SVC but uses a parameter to control the number of support\n",
       "vectors.\n",
       "\n",
       "The implementation is based on libsvm.\n",
       "Dependencies....: sklearn==0.21.3\n",
       "numpy>=1.6.1\n",
       "scipy>=0.9"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Running the model on the task\n",
    "# Internally, the model will be made into \n",
    "# an OpenML flow and we can choose to retrieve it\n",
    "r, f = openml.runs.run_model_on_task(model=clf, task=t, upload_flow=False, return_flow=True)\n",
    "f"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.9866666666666667\n"
     ]
    }
   ],
   "source": [
    "# To obtain the score (without uploading)\n",
    "## r.publish() can be used to upload these results\n",
    "## need to sign-in to https://www.openml.org/\n",
    "score = []\n",
    "evaluations = r.fold_evaluations['predictive_accuracy'][0]\n",
    "for key in evaluations:\n",
    "    score.append(evaluations[key])\n",
    "print(np.mean(score))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lo and behold! We hit the magic number. I personally would have never tried out NuSVC and would have stuck around tweaking hyperparameters of the Random Forest. This is a new discovery of sorts for sure. I wonder though if anybody has tried XGBoost on Iris?\n",
    "\n",
    "In any case, we can now upload the results of this run to OpenML using:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "OpenML Run\n",
       "==========\n",
       "Uploader Name: None\n",
       "Metric.......: None\n",
       "Run ID.......: 10464835\n",
       "Run URL......: https://www.openml.org/r/10464835\n",
       "Task ID......: 59\n",
       "Task Type....: None\n",
       "Task URL.....: https://www.openml.org/t/59\n",
       "Flow ID......: 18579\n",
       "Flow Name....: sklearn.svm.classes.NuSVC\n",
       "Flow URL.....: https://www.openml.org/f/18579\n",
       "Setup ID.....: None\n",
       "Setup String.: Python_3.6.9. Sklearn_0.21.3. NumPy_1.16.4. SciPy_1.4.1. NuSVC(cache_size=200, class_weight=None, coef0=0.0,\n",
       "      decision_function_shape=None, degree=3, gamma='auto', kernel='linear',\n",
       "      max_iter=-1, nu=0.3, probability=True, random_state=3, shrinking=True,\n",
       "      tol=3.241909264428642e-05, verbose=False)\n",
       "Dataset ID...: 61\n",
       "Dataset URL..: https://www.openml.org/d/61"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r.publish()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One would need to sign-in to https://www.openml.org/ and generate their respective *apikey*. The results would then be available for everyone to view and who knows, you can have your name against the *best-ever* performance measured on the Iris dataset!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This post was in no ways intended to be a be-all-end-all guide to OpenML. The primary goal was to help form an acquaintance with the OpenML terminologies, introduce the API, establish connections with the general ML practices, and give a sneak-peek into the potential benefits of working together as a *community*. For a better understanding of OpenML, please explore the [documentation](https://openml.github.io/openml-python/develop/usage.html#usage). If one desires to continue from the examples given in this post and explore further, kindly refer to the [API](https://openml.github.io/openml-python/develop/api.html).\n",
    "\n",
    "OpenML-Python is an open-source project and contributions from everyone in the form of Issues and Pull Requests are most welcome. Contribution to the OpenML community is in fact not limited to code contribution. Every single user can make the community richer by sharing data, experiments, results, using OpenML.\n",
    "\n",
    "As ML practitioners, we may be dependent on tools for our tasks. However, as a collective, we can juice out its potential to a larger extent. Let us together, make ML more transparent, more democratic!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Special thanks to Heidi, Bilge, Sahithya, Matthias, Ashwin for the ideas, feedback, and support."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Related readings:\n",
    "* [To get started with OpenML-Python](https://openml.github.io/openml-python/develop/)\n",
    "* [OpenML-Python Github](https://github.com/openml/openml-python)\n",
    "* [The OpenML website](https://www.openml.org/)\n",
    "* [Miscellaneous reading on OpenML](https://openml.github.io/blog/)\n",
    "* [To get in touch!](https://www.openml.org/contact)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}