{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Wrapping New Individual Operators\n",
    "\n",
    "Lale comes with several library operators, so you do not need to write\n",
    "your own. But if you want to contribute new operators, this tutorial is\n",
    "for you.  First let us review some basic concepts in Lale from the\n",
    "point of view of adding new operators (estimators and\n",
    "transformers). Lale is a library for semi-automated data science, and\n",
    "is designed for the following goals:\n",
    "\n",
    "* Automation: easy search and tuning of pipelines\n",
    "* Usability: scikit-learn compatible, plus types\n",
    "* Interoperability: support for Python building blocks and beyond\n",
    "\n",
    "To enable the above properties for your operators with Lale, you need to:\n",
    "\n",
    "1. Write an operator implementation class with methods `__init__`,\n",
    "   `fit`, and `predict` or `transform`. If you have a custom estimator\n",
    "   or transformer as per scikit-learn, you can skip this step as that\n",
    "   is already a valid Lale operator.\n",
    "2. Register the operator implementation from Step 1 via `lale.operators.make_operator`. \n",
    "   This step automatically creates a JSON schema skeleton for your operator.\n",
    "3. Customize the hyperparameter JSON schema to indicate what\n",
    "   hyperparameters are expected by an operator, to specify the types,\n",
    "   default values, and recommended minimum/maximum values for\n",
    "   automatic tuning. The hyperparameter schema can also encode\n",
    "   constraints indicating dependencies between hyperparameter values\n",
    "   such as solver `abc` only supports penalty `xyz`.\n",
    "4. Optionally, customize the schemas for input and output datasets, \n",
    "   and enable transparent access to custom methods/properties of your implementation.\n",
    "5. Test and use the new operator, for instance, for training or\n",
    "   hyperparameter optimization.\n",
    "6. Consider contributing your new operator to the Lale open-source\n",
    "   project.\n",
    "\n",
    "The next sections illustrate these five steps using an example.  After\n",
    "the example-driven sections, this document concludes with a reference\n",
    "covering features from the example and beyond.  This document focuses\n",
    "on individual operators. Pipelines that compose\n",
    "multiple operators are documented elsewhere."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Create a New Operator\n",
    "\n",
    "This section can be skipped if you already have a scikit-learn\n",
    "compatible estimator or transformer class with methods `__init__`,\n",
    "`fit`, and `predict` or `transform`. Any other compatibility with\n",
    "scikit-learn such as `get_params` or `set_params` is optional, and so\n",
    "is extending from `sklearn.base.BaseEstimator`.\n",
    "\n",
    "This section illustrates how to implement this class with the help of\n",
    "an example. The running example in this document is a simple custom\n",
    "operator that just wraps the `LogisticRegression` estimator from\n",
    "scikit-learn. Of course you can write a similar class to wrap your own\n",
    "operators, which do not need to come from scikit-learn.  The following\n",
    "code defines a class `MyLRImpl`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sklearn.linear_model\n",
    "\n",
    "class _MyLRImpl:\n",
    "    def __init__(self, **hyperparams):\n",
    "        self._wrapped_model = sklearn.linear_model.LogisticRegression(\n",
    "            **hyperparams)\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        self._wrapped_model.fit(X, y)\n",
    "        return self\n",
    "\n",
    "    def predict(self, X, **kwargs):\n",
    "        return self._wrapped_model.predict(X, **kwargs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This code first imports the relevant scikit-learn package. Then, it declares\n",
    "a new class for wrapping it. Currently, Lale only supports Python, but\n",
    "eventually, it will also support other programming languages. Therefore, the\n",
    "Lale approach for wrapping new operators carefully avoids depending too much\n",
    "on the Python language or any particular Python library. Hence, the\n",
    "`MyLRImpl` class does not need to inherit from anything, but it does need to\n",
    "follow certain conventions:\n",
    "\n",
    "* It has a constructor, `__init__`, whose arguments are the\n",
    "  hyperparameters.\n",
    "\n",
    "* It has a training method, `fit`, with an argument `X` containing the\n",
    "  training examples and, in the case of supervised models, an argument `y`\n",
    "  containing labels. The `fit` method creates an instance of the scikit-learn\n",
    "  `LogisticRegression` operator, trains it, and returns the wrapper object.\n",
    "\n",
    "* It has a prediction method, `predict` for an estimator or `transform` for\n",
    "  a transformer. The method has an argument `X` containing the test examples\n",
    "  and returns the labels for `predict` or the transformed data for\n",
    "  `transform`.\n",
    "\n",
    "These conventions are designed to be similar to those of scikit-learn.\n",
    "However, they avoid a code dependency upon scikit-learn.\n",
    "\n",
    "Note that in a simple example like this, the underlying\n",
    "`sklearn.linear_model.LogisticRegression` class could be used directly,\n",
    "without needing the `_MyLRImpl` wrapper.  However, creating such a wrapper\n",
    "is useful for more complicated examples.\n",
    "\n",
    "## 2. Register a New Lale Operator\n",
    "\n",
    "We can now register `_MyLRImpl` as a new Lale operator `MyLR`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import lale.operators\n",
    "MyLR = lale.operators.make_operator(_MyLRImpl)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The call to `make_operator` automatically creates a skeleton JSON schema for\n",
    "the hyperparameters and the operator methods and attaches it to `MyLR`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```python\n",
       "{\n",
       "    \"$schema\": \"http://json-schema.org/draft-04/schema#\",\n",
       "    \"description\": \"Schema for <class 'type'> auto-generated by lale.type_checking.get_default_schema().\",\n",
       "    \"type\": \"object\",\n",
       "    \"tags\": {\"pre\": [], \"op\": [\"estimator\"], \"post\": []},\n",
       "    \"properties\": {\n",
       "        \"hyperparams\": {\n",
       "            \"allOf\": [\n",
       "                {\n",
       "                    \"type\": \"object\",\n",
       "                    \"properties\": {},\n",
       "                    \"relevantToOptimizer\": [],\n",
       "                }\n",
       "            ]\n",
       "        },\n",
       "        \"input_fit\": {\n",
       "            \"type\": \"object\",\n",
       "            \"properties\": {\n",
       "                \"X\": {\"laleType\": \"Any\"},\n",
       "                \"y\": {\"laleType\": \"Any\"},\n",
       "            },\n",
       "            \"additionalProperties\": false,\n",
       "            \"required\": [\"X\", \"y\"],\n",
       "        },\n",
       "        \"input_predict\": {\n",
       "            \"type\": \"object\",\n",
       "            \"properties\": {\"X\": {\"laleType\": \"Any\"}},\n",
       "            \"additionalProperties\": false,\n",
       "            \"required\": [\"X\"],\n",
       "        },\n",
       "        \"output_predict\": {\"laleType\": \"Any\"},\n",
       "    },\n",
       "}\n",
       "```"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from lale.pretty_print import ipython_display\n",
    "ipython_display(MyLR._schemas)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Customize Hyperparameter Schema\n",
    "\n",
    "Lale requires schemas both for error-checking and for generating search\n",
    "spaces for hyperparameter optimization.\n",
    "The schemas of a Lale operator specify the space of valid values for\n",
    "hyperparameters, for the arguments to `fit` and `predict` or `transform`,\n",
    "and for the output of `predict` or `transform`. To keep the schemas\n",
    "independent of the Python programming language, they are expressed as\n",
    "[JSON Schema](https://json-schema.org/understanding-json-schema/reference/).\n",
    "JSON Schema is currently a draft standard and is already being widely\n",
    "adopted and implemented, for instance, as part of specifying\n",
    "[Swagger APIs](https://www.openapis.org/).\n",
    "\n",
    "The schema of a Lale operator can be incrementally customized using the `customize_schema` method wich returns a copy of the operator with the customized schema. The `customize_schema` method also validates the new schema for early error reporting.\n",
    "\n",
    "Instead of manually writing the schemas -- which can be error prone -- we provide \n",
    "a dedicated API to help the authoring of operator schemas.\n",
    "\n",
    "The running example chooses hyperparameters of scikit-learn LogisticRegression that illustrate all the interesting cases. More complete and elaborate examples can be found in the Lale standard library. The following specifies each hyperparameter one at a time, omitting cross-cutting constraints."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from lale.schemas import Null, Enum, Int, Float, Object, Array, Not, AnyOf\n",
    "\n",
    "MyLR = MyLR.customize_schema(\n",
    "    relevantToOptimizer=['solver', 'penalty', 'C'],\n",
    "    solver=Enum(desc='Algorithm for optimization problem.',\n",
    "                values=['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],\n",
    "                default='liblinear'),\n",
    "    penalty=Enum(desc='Norm used in the penalization.',\n",
    "                 values=['l1', 'l2'],\n",
    "                 default='l2'),\n",
    "    C=Float(desc='Inverse regularization strength. '\n",
    "                 'Smaller values specify stronger regularization.',\n",
    "            minimum=0.0, exclusiveMinimum=True,\n",
    "            minimumForOptimizer=0.03125, maximumForOptimizer=32768, \n",
    "            distribution='loguniform',\n",
    "            default=1.0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, `solver` and `penalty` are categorical hyperparameters and `C` is a\n",
    "continuous hyperparameter. For all three hyperparameters, the schema\n",
    "includes a description, used for interactive documentation, and a\n",
    "default value, used when no explicit value is specified. The categorical\n",
    "hyperparameters are then specified as enumerations of their legal values.\n",
    "In contrast, the continuous hyperparameter is a number, and the schema\n",
    "includes additional information such as its distribution, minimum, and\n",
    "maximum. In the example, `C` has `'minimum': 0.0`, indicating that only\n",
    "positive values are valid. Furthermore, `C` has a\n",
    "`'minimumForOptimizer': 0.03125` and `'maxmumForOptimizer': 32768`,\n",
    "guiding the optimizer to limit its search space.\n",
    "\n",
    "### Constraints\n",
    "\n",
    "Besides specifying hyperparameters one at a time, users may also want to\n",
    "specify cross-cutting constraints to further restrict the hyperparameter\n",
    "schema. This part is an advanced use case and can be skipped by novice\n",
    "users."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "MyLR = MyLR.customize_schema(\n",
    "    constraint=AnyOf([Object(solver=Not(Enum(['newton-cg', 'sag', 'lbfgs']))),\n",
    "                      Object(penalty=Enum(['l2']))]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In JSON schema, `allOf` is a logical \"and\", `anyOf` is a logical \"or\", and\n",
    "`not` is a logical negation. Thus, the `anyOf` part of the example can be\n",
    "read as\n",
    "\n",
    "```python\n",
    "assert not (solver in ['newton-cg', 'sag', 'lbfgs']) or penalty == 'l2'\n",
    "```\n",
    "\n",
    "By standard Boolean rules, this is equivalent to a logical implication:\n",
    "\n",
    "```python\n",
    "if solver in ['newton-cg', 'sag', 'lbfgs']:\n",
    "    assert penalty == 'l2'\n",
    "```\n",
    "\n",
    "The complete hyperparameters schema simply combines the ranges with the\n",
    "constraints:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```python\n",
       "{\n",
       "    \"allOf\": [\n",
       "        {\n",
       "            \"type\": \"object\",\n",
       "            \"properties\": {\n",
       "                \"solver\": {\n",
       "                    \"default\": \"liblinear\",\n",
       "                    \"description\": \"Algorithm for optimization problem.\",\n",
       "                    \"enum\": [\n",
       "                        \"newton-cg\", \"lbfgs\", \"liblinear\", \"sag\", \"saga\",\n",
       "                    ],\n",
       "                },\n",
       "                \"penalty\": {\n",
       "                    \"default\": \"l2\",\n",
       "                    \"description\": \"Norm used in the penalization.\",\n",
       "                    \"enum\": [\"l1\", \"l2\"],\n",
       "                },\n",
       "                \"C\": {\n",
       "                    \"default\": 1.0,\n",
       "                    \"description\": \"Inverse regularization strength. Smaller values specify stronger regularization.\",\n",
       "                    \"type\": \"number\",\n",
       "                    \"minimum\": 0.0,\n",
       "                    \"exclusiveMinimum\": true,\n",
       "                    \"minimumForOptimizer\": 0.03125,\n",
       "                    \"maximumForOptimizer\": 32768,\n",
       "                    \"distribution\": \"loguniform\",\n",
       "                },\n",
       "            },\n",
       "            \"relevantToOptimizer\": [\"solver\", \"penalty\", \"C\"],\n",
       "        },\n",
       "        {\n",
       "            \"anyOf\": [\n",
       "                {\n",
       "                    \"type\": \"object\",\n",
       "                    \"properties\": {\n",
       "                        \"solver\": {\n",
       "                            \"not\": {\"enum\": [\"newton-cg\", \"sag\", \"lbfgs\"]}\n",
       "                        }\n",
       "                    },\n",
       "                },\n",
       "                {\n",
       "                    \"type\": \"object\",\n",
       "                    \"properties\": {\"penalty\": {\"enum\": [\"l2\"]}},\n",
       "                },\n",
       "            ]\n",
       "        },\n",
       "    ]\n",
       "}\n",
       "```"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "ipython_display(MyLR.hyperparam_schema())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Customize Fit and Predict  Schemas\n",
    "\n",
    "The next step is to specify the expected input and output type of the methods `fit`, and `predict` or `transform`.\n",
    "\n",
    "The `fit` method of `MyLR`\n",
    "takes two arguments, `X` and `y`. The `X` argument is an array of arrays of\n",
    "numbers. The outer array is over samples (rows) of a dataset. The inner\n",
    "array is over features (columns) of a sample. The `y` argument is an array\n",
    "of non-negative numbers. Each element of `y` is a label for the\n",
    "corresponding sample in `X`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "MyLR = MyLR.customize_schema(\n",
    "    input_fit=Object(\n",
    "        required=['X', 'y'],\n",
    "        additionalProperties=False,\n",
    "        X=Array(items=Array(items=Float())),\n",
    "        y=Array(items=Float())))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The schema for the arguments of the `predict` method is similar, just\n",
    "omitting `y`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "MyLR = MyLR.customize_schema(\n",
    "    input_predict=Object(\n",
    "        required=['X'],\n",
    "        X=Array(items=Array(items=Float())), \n",
    "        additionalProperties=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output schema indicates that the `predict` method returns an array of\n",
    "labels with the same schema as `y`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "MyLR = MyLR.customize_schema(output_predict=Array(items=Float()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transparent method/property forwarding\n",
    "\n",
    "Schemas can request transparent forwarding of specified methods/properties to the implementation.\n",
    "For example, to add support for calling \"intercept_\" directly on a `MyLR` object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "MyLR = MyLR.customize_schema(\n",
    "    forwards=[\"intercept_\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is supported only for methods/properties whose name does not conflict with names that lale uses in its operator wrapper classes.\n",
    "Methods/properties can still always be called via MyLR.impl.method()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tags\n",
    "\n",
    "Finally, we can add tags for discovery and documentation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "MyLR = MyLR.customize_schema(\n",
    "    tags= {'pre': ['~categoricals'],\n",
    "           'op': ['estimator', 'classifier', 'interpretable'],\n",
    "           'post': ['probabilities']})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have a complete JSON schemas for our `MyLR` operator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```python\n",
       "{\n",
       "    \"$schema\": \"http://json-schema.org/draft-04/schema#\",\n",
       "    \"description\": \"Schema for <class 'type'> auto-generated by lale.type_checking.get_default_schema().\",\n",
       "    \"type\": \"object\",\n",
       "    \"tags\": {\n",
       "        \"pre\": [\"~categoricals\"],\n",
       "        \"op\": [\"estimator\", \"classifier\", \"interpretable\"],\n",
       "        \"post\": [\"probabilities\"],\n",
       "    },\n",
       "    \"properties\": {\n",
       "        \"hyperparams\": {\n",
       "            \"allOf\": [\n",
       "                {\n",
       "                    \"type\": \"object\",\n",
       "                    \"properties\": {\n",
       "                        \"solver\": {\n",
       "                            \"default\": \"liblinear\",\n",
       "                            \"description\": \"Algorithm for optimization problem.\",\n",
       "                            \"enum\": [\n",
       "                                \"newton-cg\", \"lbfgs\", \"liblinear\", \"sag\",\n",
       "                                \"saga\",\n",
       "                            ],\n",
       "                        },\n",
       "                        \"penalty\": {\n",
       "                            \"default\": \"l2\",\n",
       "                            \"description\": \"Norm used in the penalization.\",\n",
       "                            \"enum\": [\"l1\", \"l2\"],\n",
       "                        },\n",
       "                        \"C\": {\n",
       "                            \"default\": 1.0,\n",
       "                            \"description\": \"Inverse regularization strength. Smaller values specify stronger regularization.\",\n",
       "                            \"type\": \"number\",\n",
       "                            \"minimum\": 0.0,\n",
       "                            \"exclusiveMinimum\": true,\n",
       "                            \"minimumForOptimizer\": 0.03125,\n",
       "                            \"maximumForOptimizer\": 32768,\n",
       "                            \"distribution\": \"loguniform\",\n",
       "                        },\n",
       "                    },\n",
       "                    \"relevantToOptimizer\": [\"solver\", \"penalty\", \"C\"],\n",
       "                },\n",
       "                {\n",
       "                    \"anyOf\": [\n",
       "                        {\n",
       "                            \"type\": \"object\",\n",
       "                            \"properties\": {\n",
       "                                \"solver\": {\n",
       "                                    \"not\": {\n",
       "                                        \"enum\": [\"newton-cg\", \"sag\", \"lbfgs\"]\n",
       "                                    }\n",
       "                                }\n",
       "                            },\n",
       "                        },\n",
       "                        {\n",
       "                            \"type\": \"object\",\n",
       "                            \"properties\": {\"penalty\": {\"enum\": [\"l2\"]}},\n",
       "                        },\n",
       "                    ]\n",
       "                },\n",
       "            ]\n",
       "        },\n",
       "        \"input_fit\": {\n",
       "            \"type\": \"object\",\n",
       "            \"required\": [\"X\", \"y\"],\n",
       "            \"additionalProperties\": false,\n",
       "            \"properties\": {\n",
       "                \"X\": {\n",
       "                    \"type\": \"array\",\n",
       "                    \"items\": {\"type\": \"array\", \"items\": {\"type\": \"number\"}},\n",
       "                },\n",
       "                \"y\": {\"type\": \"array\", \"items\": {\"type\": \"number\"}},\n",
       "            },\n",
       "        },\n",
       "        \"input_predict\": {\n",
       "            \"type\": \"object\",\n",
       "            \"required\": [\"X\"],\n",
       "            \"additionalProperties\": false,\n",
       "            \"properties\": {\n",
       "                \"X\": {\n",
       "                    \"type\": \"array\",\n",
       "                    \"items\": {\"type\": \"array\", \"items\": {\"type\": \"number\"}},\n",
       "                }\n",
       "            },\n",
       "        },\n",
       "        \"output_predict\": {\"type\": \"array\", \"items\": {\"type\": \"number\"}},\n",
       "    },\n",
       "}\n",
       "```"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "ipython_display(MyLR._schemas)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Testing and Using the new Operator\n",
    "\n",
    "Once your operator implementation and schema definitions are ready,\n",
    "you can test it with Lale as follows. First, you will need to install\n",
    "Lale, as described in the\n",
    "[installation](../../master/docs/installation.md)) instructions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use the new Operator\n",
    "\n",
    "Before demonstrating the new `MyLR` operator, the following code loads the\n",
    "Iris dataset, which comes out-of-the-box with scikit-learn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "expected [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]\n"
     ]
    }
   ],
   "source": [
    "import sklearn.datasets\n",
    "import sklearn.utils\n",
    "iris = sklearn.datasets.load_iris()\n",
    "X_all, y_all = sklearn.utils.shuffle(iris.data, iris.target, random_state=42)\n",
    "holdout_size = 30\n",
    "X_train, y_train = X_all[holdout_size:], y_all[holdout_size:]\n",
    "X_test, y_test = X_all[:holdout_size], y_all[:holdout_size]\n",
    "print('expected {}'.format(y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that the data is in place, the following code sets the hyperparameters,\n",
    "calls `fit` to train, and calls `predict` to make predictions. This code\n",
    "looks almost like what people would usually write with scikit-learn, except\n",
    "that it uses an enumeration `MyLR.solver` that is implicitly defined by Lale\n",
    "so users do not have to pass in error-prone strings for categorical\n",
    "hyperparameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "actual [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]\n"
     ]
    }
   ],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\", category=FutureWarning)\n",
    "\n",
    "trainable = MyLR(MyLR.enum.solver.lbfgs, C=0.1)\n",
    "trained = trainable.fit(X_train, y_train)\n",
    "predictions = trained.predict(X_test)\n",
    "print('actual {}'.format(predictions))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To illustrate interactive documentation, the following code retrieves the\n",
    "specification of the `C` hyperparameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'default': 1.0,\n",
       " 'description': 'Inverse regularization strength. Smaller values specify stronger regularization.',\n",
       " 'type': 'number',\n",
       " 'minimum': 0.0,\n",
       " 'exclusiveMinimum': True,\n",
       " 'minimumForOptimizer': 0.03125,\n",
       " 'maximumForOptimizer': 32768,\n",
       " 'distribution': 'loguniform'}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "MyLR.hyperparam_schema('C')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, operator tags are reflected via Python methods on the operator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n",
      "{'pre': ['~categoricals'], 'op': ['estimator', 'classifier', 'interpretable'], 'post': ['probabilities']}\n"
     ]
    }
   ],
   "source": [
    "print(MyLR.has_tag('interpretable'))\n",
    "print(MyLR.get_tags())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To illustrate error-checking, the following code showcases an invalid\n",
    "hyperparameter caught by JSON schema validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Invalid configuration for MyLR(solver='adam') due to invalid value solver=adam.\n",
      "Schema of argument solver: {\n",
      "    \"default\": \"liblinear\",\n",
      "    \"description\": \"Algorithm for optimization problem.\",\n",
      "    \"enum\": [\"newton-cg\", \"lbfgs\", \"liblinear\", \"sag\", \"saga\"],\n",
      "}\n",
      "Value: adam\n"
     ]
    }
   ],
   "source": [
    "import jsonschema, sys\n",
    "try:\n",
    "    MyLR(solver='adam')\n",
    "except jsonschema.ValidationError as e:\n",
    "    print(e.message, file=sys.stderr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, to illustrate hyperparameter optimization, the following code uses\n",
    "[hyperopt](http://hyperopt.github.io/hyperopt/). We will document the\n",
    "hyperparameter optimization use-case in more detail elsewhere. Here we only\n",
    "demonstrate that Lale with `MyLR` supports it. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100%|██████████| 10/10 [00:00<00:00, 54.54trial/s, best loss: -1.0]\n",
      "best hyperparameter combination {'C': 18098.51542502289, 'name': '__main__.MyLR', 'penalty': 'l2', 'solver': 'saga'}\n"
     ]
    }
   ],
   "source": [
    "from lale.search.op2hp import hyperopt_search_space\n",
    "from hyperopt import STATUS_OK, Trials, fmin, tpe, space_eval\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.exceptions import ConvergenceWarning\n",
    "warnings.filterwarnings(\"ignore\", category=ConvergenceWarning)\n",
    "\n",
    "def objective(hyperparams):\n",
    "    del hyperparams['name']\n",
    "    trainable = MyLR(**hyperparams)\n",
    "    trained = trainable.fit(X_train, y_train)\n",
    "    predictions = trained.predict(X_test)\n",
    "    accuracy = accuracy_score(y_test, predictions)\n",
    "    return {'loss': -accuracy, 'status': STATUS_OK}\n",
    "\n",
    "#The following line is enabled by the hyperparameter schema.\n",
    "search_space = hyperopt_search_space(MyLR)\n",
    "\n",
    "trials = Trials()\n",
    "fmin(objective, search_space, algo=tpe.suggest, max_evals=10, trials=trials)\n",
    "best_hyperparams = space_eval(search_space, trials.argmin)\n",
    "print('best hyperparameter combination {}'.format(best_hyperparams))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This concludes the running example. To\n",
    "summarize, we have learned how to write an operator implementation class and JSON\n",
    "schemas; how to register the Lale operator; and how to use the Lale operator for\n",
    "manual as well as automated machine-learning.\n",
    "\n",
    "### Additional Wrapper Class Features\n",
    "\n",
    "Besides `X` and `y`, the `fit` method in scikit-learn sometimes has\n",
    "additional arguments. Lale also supports such additional arguments.\n",
    "\n",
    "In addition to the `__init__`, `fit`, and `predict` methods, many\n",
    "scikit-learn estimators also have a `predict_proba` method. Lale will\n",
    "support that with its own metadata schema.\n",
    "\n",
    "## 6. Consider Contributing to Lale Open-Source Project\n",
    "\n",
    "We encourage you to add your new operator to Lale. Take a look at the \n",
    "\"Developer's Certificate of Origin\", which can be found in\n",
    "[DCO1.1.txt](https://github.com/IBM/lale/blob/master/DCO1.1.txt).\n",
    "Can you agree to the terms in the DCO, as well as to the\n",
    "[license](https://github.com/IBM/lale/blob/master/LICENSE.txt)?\n",
    "If yes, check out the \"For Developers\" part at the end of the\n",
    "[Installation instructions](https://github.com/IBM/lale/blob/master/docs/installation.rst),\n",
    "and follow the steps to create a pull request.\n",
    "\n",
    "## 7. Reference\n",
    "\n",
    "This section documents features of JSON Schema that Lale uses, as well as\n",
    "extensions that Lale adds to JSON schema for information specific to the\n",
    "machine-learning domain. For a more comprehensive introduction to JSON\n",
    "Schema, refer to its\n",
    "[Reference](https://json-schema.org/understanding-json-schema/reference/).\n",
    "\n",
    "The following table lists kinds of schemas in JSON Schema:\n",
    "\n",
    "| Kind of schema | Corresponding type in Python/Lale |\n",
    "| ---------------| ---------------------------- |\n",
    "| `null`         | `NoneType`, value `None` |\n",
    "| `boolean`      | `bool`, values `True` or `False` |\n",
    "| `string`       | `str` |\n",
    "| `enum`         | See discussion below. |\n",
    "| `number`       | `float`, .e.g, `0.1` |\n",
    "| `integer`      | `int`, e.g., `42` |\n",
    "| `array`        | See discussion below. |\n",
    "| `object`       | `dict` with string keys |\n",
    "| `anyOf`, `allOf`, `not` | See discussion below. |\n",
    "\n",
    "The use of `null`, `boolean`, and `string` is fairly straightforward.  The\n",
    "following paragraphs discuss the other kinds of schemas one by one.\n",
    "\n",
    "### 7.1. enum\n",
    "\n",
    "In JSON Schema, an enum can contain assorted values including strings,\n",
    "numbers, or even `null`. Lale uses enums of strings for categorical\n",
    "hyperparameters, such as `'penalty': {'enum': ['l1', 'l2']}` in the earlier\n",
    "example. In that case, Lale also automatically declares a corresponding\n",
    "Python `enum`.\n",
    "When Lale uses enums of other types, it is usually to restrict a\n",
    "hyperparameter to a single value, such as `'enum': [None]`.\n",
    "\n",
    "### 7.2. number, integer\n",
    "\n",
    "In schemas with `type` set to `number` or `integer`, JSON schema lets users\n",
    "specify `minimum`, `maximum`,\n",
    "`exclusiveMinimum`, and `exclusiveMaximum`. Lale further extends JSON schema\n",
    "with `minimumForOptimizer`, `maximumForOptimizer`, and `distribution`.\n",
    "Possible values for the `distribution` are `'uniform'` (the default) and\n",
    "`'loguniform'`. In the case of integers, Lale quantizes the distributions\n",
    "accordingly.\n",
    "\n",
    "### 7.3. array\n",
    "\n",
    "Lale schemas for input and output data make heavy use of the JSON Schema\n",
    "`array` type. In this case, Lale schemas are intended to capture logical\n",
    "schemas, not physical representations, similarly to how relational databases\n",
    "hide physical representations behind a well-formalized abstraction layer.\n",
    "Therefore, Lale uses arrays from JSON Schema for several types in Python.\n",
    "The most obvious one is a Python `list`. Another common one is a numpy\n",
    "[ndarray](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html),\n",
    "where Lale uses nested arrays to represent each of the dimensions of a\n",
    "multi-dimensional array. Lale also has support for `pandas.DataFrame` and\n",
    "`pandas.Series`, for which it again uses JSON Schema arrays.\n",
    "\n",
    "For arrays, JSON schema lets users specify `items`, `minItems`, and\n",
    "`maxItems`. Lale further extends JSON schema with `minItemsForOptimizer` and\n",
    "`maxItemsForOptimizer`. Furthermore, Lale supports a `laleType`,\n",
    "which can be `'Any'` to locally disable a subschema check, or`'tuple'` to support cases where the Python code requires a\n",
    "tuple instead of a list.\n",
    "\n",
    "### 7.4. object\n",
    "\n",
    "For objects, JSON schema lets users specify a list `required` of properties\n",
    "that must be present, a dictionary `properties` of sub-schemas, and a flag\n",
    "`additionalProperties` to indicate whether the object can have additional\n",
    "properties beyond the keys of the `properties` dictionary. Lale further\n",
    "extends JSON schema with a `relevantToOptimizer` list of properties that\n",
    "hyperparameter optimizers should search over.\n",
    "\n",
    "For individual properties, Lale supports a `default`, which is inspired by\n",
    "and consistent with web API specification practice. It also supports a\n",
    "`forOptimizer` flag which defaults to `True` but can be set to `False` to\n",
    "hide a particular subschema from the hyperparameter optimizer. For example,\n",
    "the number of components for PCA in scikit-learn can be specified as an\n",
    "integer or a floating point number, but an optimizer should only explore one\n",
    "of these choices. Lale supports a Boolean flag `transient` that, if true,\n",
    "elides a hyperparameter during pretty-printing, visualization, or in JSON.\n",
    "\n",
    "### 7.5. allOf, anyOf, not\n",
    "\n",
    "As discussed before, in JSON schema, `allOf` is a logical \"and\", `anyOf` is\n",
    "a logical \"or\", and `not` is a logical negation. The running example from\n",
    "earlier already illustrated how to use these for implementing cross-cutting\n",
    "constraints. Another use-case that takes advantage of `anyOf` is for\n",
    "expressing union types, which arise frequently in scikit-learn. For example,\n",
    "here is the schema for `n_components` from PCA:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```python\n",
       "{\n",
       "    \"default\": null,\n",
       "    \"anyOf\": [\n",
       "        {\"description\": \"If not set, keep all components.\", \"enum\": [null]},\n",
       "        {\n",
       "            \"description\": \"Use Minka's MLE to guess the dimension.\",\n",
       "            \"enum\": [\"mle\"],\n",
       "        },\n",
       "        {\n",
       "            \"description\": \"Select the number of components such that the amount of variance that needs to be explained is greater than the specified percentage.\",\n",
       "            \"type\": \"number\",\n",
       "            \"minimum\": 0.0,\n",
       "            \"exclusiveMinimum\": true,\n",
       "            \"maximum\": 1.0,\n",
       "            \"exclusiveMaximum\": true,\n",
       "        },\n",
       "        {\n",
       "            \"description\": \"Number of components to keep.\",\n",
       "            \"forOptimizer\": false,\n",
       "            \"type\": \"integer\",\n",
       "            \"minimum\": 1,\n",
       "        },\n",
       "    ],\n",
       "}\n",
       "```"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "n_components_sch = AnyOf(\n",
    "    [Enum([None], desc=\"If not set, keep all components.\"),\n",
    "     Enum(['mle'], desc=\"Use Minka's MLE to guess the dimension.\"),\n",
    "     Float(minimum=0.0, exclusiveMinimum=True, \n",
    "           maximum=1.0, exclusiveMaximum=True, \n",
    "           desc='Select the number of components such that the amount of variance '\n",
    "                'that needs to be explained is greater than the specified percentage.'),\n",
    "     Int(minimum=1, forOptimizer=False, desc='Number of components to keep.')],\n",
    "    default=None)\n",
    "\n",
    "ipython_display(n_components_sch.schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7.6. Schema Metadata\n",
    "\n",
    "We encourage users to make their schemas more readable by also including\n",
    "common JSON schema metadata such as `$schema` and `description`.  As seen in\n",
    "the examples in this document, Lale also extends JSON schema with `tags`\n",
    "and `documentation_url`. Finally, in some cases, schema-internal\n",
    "duplication can be avoided by cross-references and linking. This is\n",
    "supported by off-the-shelf features of JSON schema without requiring\n",
    "Lale-specific extensions."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}