{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 02\n", "The goal is to find the best set of hyper-parameters which maximize the\n", "performance on a training set." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "df = pd.read_csv(\n", " \"https://www.openml.org/data/get_csv/1595261/adult-census.csv\")\n", "# Or use the local copy:\n", "# df = pd.read_csv('../datasets/adult-census.csv')\n", "\n", "target_name = \"class\"\n", "target = df[target_name].to_numpy()\n", "data = df.drop(columns=target_name)\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "df_train, df_test, target_train, target_test = train_test_split(\n", " data, target, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TODO: create your machine learning pipeline\n", "\n", "You should:\n", "* preprocess the categorical columns using a `OneHotEncoder` and use a\n", " `StandardScaler` to normalize the numerical data.\n", "* use a `LogisticRegression` as a predictive model." ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Start by defining the columns and the preprocessing pipelines to be applied\n", "on each columns." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "\n", "from sklearn.preprocessing import OneHotEncoder\n", "from sklearn.preprocessing import StandardScaler" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Subsequently, create a `ColumnTransformer` to redirect the specific columns\n", "a preprocessing pipeline." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "\n", "from sklearn.compose import ColumnTransformer" ] }, { "cell_type": "markdown", "metadata": { "lines_to_next_cell": 0 }, "source": [ "Finally, concatenate the preprocessing pipeline with a logistic regression." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "lines_to_next_cell": 2 }, "outputs": [], "source": [ "\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.linear_model import LogisticRegression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TODO: make your random search\n", "\n", "Use a `RandomizedSearchCV` to find the best set of hyper-parameters by tuning\n", "the following parameters for the `LogisticRegression` model:\n", "- `C` with values ranging from 0.001 to 10. You can use a reciprocal\n", " distribution (i.e. `scipy.stats.reciprocal`);\n", "- `solver` with possible values being `\"liblinear\"` and `\"lbfgs\"`;\n", "- `penalty` with possible values being `\"l2\"` and `\"l1\"`;\n", "In addition, try several preprocessing strategies with the `OneHotEncoder`\n", "by always (or not) dropping the first column when encoding the categorical\n", "data.\n", "\n", "Notes: You can accept failure during a grid-search or a randomized-search\n", "by settgin `error_score` to `np.nan` for instance." ] } ], "metadata": { "jupytext": { "formats": "python_scripts//py:percent,notebooks//ipynb" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 2 }