{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcc3 Solution for Exercise M1.04\n",
    "\n",
    "The goal of this exercise is to evaluate the impact of using an arbitrary\n",
    "integer encoding for categorical variables along with a linear classification\n",
    "model such as Logistic Regression.\n",
    "\n",
    "To do so, let's try to use `OrdinalEncoder` to preprocess the categorical\n",
    "variables. This preprocessor is assembled in a pipeline with\n",
    "`LogisticRegression`. The generalization performance of the pipeline can be\n",
    "evaluated by cross-validation and then compared to the score obtained when\n",
    "using `OneHotEncoder` or to some other baseline score.\n",
    "\n",
    "First, we load the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_name = \"class\"\n",
    "target = adult_census[target_name]\n",
    "data = adult_census.drop(columns=[target_name, \"education-num\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous notebook, we used `sklearn.compose.make_column_selector` to\n",
    "automatically select columns with a specific data type (also called `dtype`).\n",
    "Here, we use this selector to get only the columns containing strings (column\n",
    "with `object` dtype) that correspond to categorical features in our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.compose import make_column_selector as selector\n",
    "\n",
    "categorical_columns_selector = selector(dtype_include=object)\n",
    "categorical_columns = categorical_columns_selector(data)\n",
    "data_categorical = data[categorical_columns]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a scikit-learn pipeline composed of an `OrdinalEncoder` and a\n",
    "`LogisticRegression` classifier.\n",
    "\n",
    "Because `OrdinalEncoder` can raise errors if it sees an unknown category at\n",
    "prediction time, you can set the `handle_unknown=\"use_encoded_value\"` and\n",
    "`unknown_value` parameters. You can refer to the [scikit-learn\n",
    "documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)\n",
    "for more details regarding these parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import OrdinalEncoder\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "# solution\n",
    "model = make_pipeline(\n",
    "    OrdinalEncoder(handle_unknown=\"use_encoded_value\", unknown_value=-1),\n",
    "    LogisticRegression(max_iter=500),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your model is now defined. Evaluate it using a cross-validation using\n",
    "`sklearn.model_selection.cross_validate`.\n",
    "\n",
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">Be aware that if an error happened during the cross-validation,\n",
    "<tt class=\"docutils literal\">cross_validate</tt> would raise a warning and return NaN (Not a Number) as scores.\n",
    "To make it raise a standard Python exception with a traceback, you can pass\n",
    "the <tt class=\"docutils literal\"><span class=\"pre\">error_score=\"raise\"</span></tt> argument in the call to <tt class=\"docutils literal\">cross_validate</tt>. An\n",
    "exception would be raised instead of a warning at the first encountered problem\n",
    "and <tt class=\"docutils literal\">cross_validate</tt> would stop right away instead of returning NaN values.\n",
    "This is particularly handy when developing complex machine learning pipelines.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_validate\n",
    "\n",
    "# solution\n",
    "cv_results = cross_validate(model, data_categorical, target)\n",
    "\n",
    "scores = cv_results[\"test_score\"]\n",
    "print(\n",
    "    \"The mean cross-validation accuracy is: \"\n",
    "    f\"{scores.mean():.3f} \u00b1 {scores.std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "Using an arbitrary mapping from string labels to integers as done here causes\n",
    "the linear model to make bad assumptions on the relative ordering of\n",
    "categories.\n",
    "\n",
    "This prevents the model from learning anything predictive enough and the\n",
    "cross-validated score is even lower than the baseline we obtained by ignoring\n",
    "the input data and just constantly predicting the most frequent class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "from sklearn.dummy import DummyClassifier\n",
    "\n",
    "cv_results = cross_validate(\n",
    "    DummyClassifier(strategy=\"most_frequent\"), data_categorical, target\n",
    ")\n",
    "scores = cv_results[\"test_score\"]\n",
    "print(\n",
    "    \"The mean cross-validation accuracy is: \"\n",
    "    f\"{scores.mean():.3f} \u00b1 {scores.std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we would like to compare the generalization performance of our previous\n",
    "model with a new model where instead of using an `OrdinalEncoder`, we use a\n",
    "`OneHotEncoder`. Repeat the model evaluation using cross-validation. Compare\n",
    "the score of both models and conclude on the impact of choosing a specific\n",
    "encoding strategy when using a linear model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import OneHotEncoder\n",
    "\n",
    "# solution\n",
    "model = make_pipeline(\n",
    "    OneHotEncoder(handle_unknown=\"ignore\"), LogisticRegression(max_iter=500)\n",
    ")\n",
    "cv_results = cross_validate(model, data_categorical, target)\n",
    "scores = cv_results[\"test_score\"]\n",
    "print(\n",
    "    \"The mean cross-validation accuracy is: \"\n",
    "    f\"{scores.mean():.3f} \u00b1 {scores.std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "With the linear classifier chosen, using an encoding that does not assume any\n",
    "ordering lead to much better result.\n",
    "\n",
    "The important message here is: linear model and `OrdinalEncoder` are used\n",
    "together only for ordinal categorical features, i.e. features that have a\n",
    "specific ordering. Otherwise, your model would perform poorly."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}