{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 02\n",
    "The goal is to find the best set of hyper-parameters which maximize the\n",
    "performance on a training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(\n",
    "    \"https://www.openml.org/data/get_csv/1595261/adult-census.csv\")\n",
    "# Or use the local copy:\n",
    "# df = pd.read_csv('../datasets/adult-census.csv')\n",
    "\n",
    "target_name = \"class\"\n",
    "target = df[target_name].to_numpy()\n",
    "data = df.drop(columns=target_name)\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "df_train, df_test, target_train, target_test = train_test_split(\n",
    "    data, target, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TODO: create your machine learning pipeline\n",
    "\n",
    "You should:\n",
    "* preprocess the categorical columns using a `OneHotEncoder` and use a\n",
    "  `StandardScaler` to normalize the numerical data.\n",
    "* use a `LogisticRegression` as a predictive model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "source": [
    "Start by defining the columns and the preprocessing pipelines to be applied\n",
    "on each columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from sklearn.preprocessing import OneHotEncoder\n",
    "from sklearn.preprocessing import StandardScaler"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "source": [
    "Subsequently, create a `ColumnTransformer` to redirect the specific columns\n",
    "a preprocessing pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from sklearn.compose import ColumnTransformer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "source": [
    "Finally, concatenate the preprocessing pipeline with a logistic regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.linear_model import LogisticRegression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TODO: make your random search\n",
    "\n",
    "Use a `RandomizedSearchCV` to find the best set of hyper-parameters by tuning\n",
    "the following parameters for the `LogisticRegression` model:\n",
    "- `C` with values ranging from 0.001 to 10. You can use a reciprocal\n",
    "  distribution (i.e. `scipy.stats.reciprocal`);\n",
    "- `solver` with possible values being `\"liblinear\"` and `\"lbfgs\"`;\n",
    "- `penalty` with possible values being `\"l2\"` and `\"l1\"`;\n",
    "In addition, try several preprocessing strategies with the `OneHotEncoder`\n",
    "by always (or not) dropping the first column when encoding the categorical\n",
    "data.\n",
    "\n",
    "Notes: You can accept failure during a grid-search or a randomized-search\n",
    "by settgin `error_score` to `np.nan` for instance."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "python_scripts//py:percent,notebooks//ipynb"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}