{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Encoding of categorical variables\n",
    "\n",
    "In this notebook, we present some typical ways of dealing with **categorical\n",
    "variables** by encoding them, namely **ordinal encoding** and **one-hot\n",
    "encoding**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's first load the entire adult dataset containing both numerical and\n",
    "categorical data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n",
    "# drop the duplicated column `\"education-num\"` as stated in the first notebook\n",
    "adult_census = adult_census.drop(columns=\"education-num\")\n",
    "\n",
    "target_name = \"class\"\n",
    "target = adult_census[target_name]\n",
    "\n",
    "data = adult_census.drop(columns=[target_name])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Identify categorical variables\n",
    "\n",
    "As we saw in the previous section, a numerical variable is a\n",
    "quantity represented by a real or integer number. These variables can be\n",
    "naturally handled by machine learning algorithms that are typically composed\n",
    "of a sequence of arithmetic instructions such as additions and\n",
    "multiplications.\n",
    "\n",
    "In contrast, categorical variables have discrete values, typically\n",
    "represented by string labels (but not only) taken from a finite list of\n",
    "possible choices. For instance, the variable `native-country` in our dataset\n",
    "is a categorical variable because it encodes the data using a finite list of\n",
    "possible countries (along with the `?` symbol when this information is\n",
    "missing):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[\"native-country\"].value_counts().sort_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How can we easily recognize categorical columns among the dataset? Part of\n",
    "the answer lies in the columns' data type:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we look at the `\"native-country\"` column, we observe its data type is\n",
    "`object`, meaning it contains string values.\n",
    "\n",
    "## Select features based on their data type\n",
    "\n",
    "In the previous notebook, we manually defined the numerical columns. We could\n",
    "do a similar approach. Instead, we can use the scikit-learn helper function\n",
    "`make_column_selector`, which allows us to select columns based on their data\n",
    "type. We now illustrate how to use this helper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.compose import make_column_selector as selector\n",
    "\n",
    "categorical_columns_selector = selector(dtype_include=object)\n",
    "categorical_columns = categorical_columns_selector(data)\n",
    "categorical_columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we created the selector by passing the data type to include; we then\n",
    "passed the input dataset to the selector object, which returned a list of\n",
    "column names that have the requested data type. We can now filter out the\n",
    "unwanted columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_categorical = data[categorical_columns]\n",
    "data_categorical"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"The dataset is composed of {data_categorical.shape[1]} features\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the remainder of this section, we will present different strategies to\n",
    "encode categorical data into numerical data which can be used by a\n",
    "machine-learning algorithm."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Strategies to encode categories\n",
    "\n",
    "### Encoding ordinal categories\n",
    "\n",
    "The most intuitive strategy is to encode each category with a different\n",
    "number. The `OrdinalEncoder` transforms the data in such manner. We start by\n",
    "encoding a single column to understand how the encoding works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import OrdinalEncoder\n",
    "\n",
    "education_column = data_categorical[[\"education\"]]\n",
    "\n",
    "encoder = OrdinalEncoder().set_output(transform=\"pandas\")\n",
    "education_encoded = encoder.fit_transform(education_column)\n",
    "education_encoded"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that each category in `\"education\"` has been replaced by a numeric\n",
    "value. We could check the mapping between the categories and the numerical\n",
    "values by checking the fitted attribute `categories_`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "encoder.categories_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can check the encoding applied on all categorical features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_encoded = encoder.fit_transform(data_categorical)\n",
    "data_encoded[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"The dataset encoded contains {data_encoded.shape[1]} features\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the categories have been encoded for each feature (column)\n",
    "independently. We also note that the number of features before and after the\n",
    "encoding is the same.\n",
    "\n",
    "However, be careful when applying this encoding strategy:\n",
    "using this integer representation leads downstream predictive models\n",
    "to assume that the values are ordered (0 < 1 < 2 < 3... for instance).\n",
    "\n",
    "By default, `OrdinalEncoder` uses a lexicographical strategy to map string\n",
    "category labels to integers. This strategy is arbitrary and often\n",
    "meaningless. For instance, suppose the dataset has a categorical variable\n",
    "named `\"size\"` with categories such as \"S\", \"M\", \"L\", \"XL\". We would like the\n",
    "integer representation to respect the meaning of the sizes by mapping them to\n",
    "increasing integers such as `0, 1, 2, 3`.\n",
    "However, the lexicographical strategy used by default would map the labels\n",
    "\"S\", \"M\", \"L\", \"XL\" to 2, 1, 0, 3, by following the alphabetical order.\n",
    "\n",
    "The `OrdinalEncoder` class accepts a `categories` constructor argument to\n",
    "pass categories in the expected ordering explicitly. You can find more\n",
    "information in the\n",
    "[scikit-learn documentation](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features)\n",
    "if needed.\n",
    "\n",
    "If a categorical variable does not carry any meaningful order information\n",
    "then this encoding might be misleading to downstream statistical models and\n",
    "you might consider using one-hot encoding instead (see below).\n",
    "\n",
    "### Encoding nominal categories (without assuming any order)\n",
    "\n",
    "`OneHotEncoder` is an alternative encoder that prevents the downstream\n",
    "models to make a false assumption about the ordering of categories. For a\n",
    "given feature, it creates as many new columns as there are possible\n",
    "categories. For a given sample, the value of the column corresponding to the\n",
    "category is set to `1` while all the columns of the other categories\n",
    "are set to `0`.\n",
    "\n",
    "We can encode a single feature (e.g. `\"education\"`) to illustrate how the\n",
    "encoding works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import OneHotEncoder\n",
    "\n",
    "encoder = OneHotEncoder(sparse_output=False).set_output(transform=\"pandas\")\n",
    "education_encoded = encoder.fit_transform(education_column)\n",
    "education_encoded"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p><tt class=\"docutils literal\">sparse_output=False</tt> is used in the <tt class=\"docutils literal\">OneHotEncoder</tt> for didactic purposes,\n",
    "namely easier visualization of the data.</p>\n",
    "<p class=\"last\">Sparse matrices are efficient data structures when most of your matrix\n",
    "elements are zero. They won't be covered in detail in this course. If you\n",
    "want more details about them, you can look at\n",
    "<a class=\"reference external\" href=\"https://scipy-lectures.org/advanced/scipy_sparse/introduction.html#why-sparse-matrices\">this</a>.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that encoding a single feature gives a dataframe full of zeros\n",
    "and ones. Each category (unique value) became a column; the encoding\n",
    "returned, for each sample, a 1 to specify which category it belongs to.\n",
    "\n",
    "Let's apply this encoding on the full dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"The dataset is composed of {data_categorical.shape[1]} features\")\n",
    "data_categorical"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_encoded = encoder.fit_transform(data_categorical)\n",
    "data_encoded[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"The encoded dataset contains {data_encoded.shape[1]} features\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look at how the `\"workclass\"` variable of the 3 first records has been encoded\n",
    "and compare this to the original string representation.\n",
    "\n",
    "The number of features after the encoding is more than 10 times larger than\n",
    "in the original data because some variables such as `occupation` and\n",
    "`native-country` have many possible categories."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Choosing an encoding strategy\n",
    "\n",
    "Choosing an encoding strategy depends on the underlying models and the type of\n",
    "categories (i.e. ordinal vs. nominal)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">In general <tt class=\"docutils literal\">OneHotEncoder</tt> is the encoding strategy used when the\n",
    "downstream models are <strong>linear models</strong> while <tt class=\"docutils literal\">OrdinalEncoder</tt> is often a\n",
    "good strategy with <strong>tree-based models</strong>.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using an `OrdinalEncoder` outputs ordinal categories. This means\n",
    "that there is an order in the resulting categories (e.g. `0 < 1 < 2`). The\n",
    "impact of violating this ordering assumption is really dependent on the\n",
    "downstream models. Linear models would be impacted by misordered categories\n",
    "while tree-based models would not.\n",
    "\n",
    "You can still use an `OrdinalEncoder` with linear models but you need to be\n",
    "sure that:\n",
    "- the original categories (before encoding) have an ordering;\n",
    "- the encoded categories follow the same ordering than the original\n",
    "  categories.\n",
    "The **next exercise** highlights the issue of misusing `OrdinalEncoder` with\n",
    "a linear model.\n",
    "\n",
    "One-hot encoding categorical variables with high cardinality can cause\n",
    "computational inefficiency in tree-based models. Because of this, it is not\n",
    "recommended to use `OneHotEncoder` in such cases even if the original\n",
    "categories do not have a given order. We will show this in the **final\n",
    "exercise** of this sequence."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate our predictive pipeline\n",
    "\n",
    "We can now integrate this encoder inside a machine learning pipeline like we\n",
    "did with numerical data: let's train a linear classifier on the encoded data\n",
    "and check the generalization performance of this machine learning pipeline using\n",
    "cross-validation.\n",
    "\n",
    "Before we create the pipeline, we have to focus on the `native-country`.\n",
    "Let's recall some statistics regarding this column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[\"native-country\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the `\"Holand-Netherlands\"` category is occurring rarely. This will\n",
    "be a problem during cross-validation: if the sample ends up in the test set\n",
    "during splitting then the classifier would not have seen the category during\n",
    "training and would not be able to encode it.\n",
    "\n",
    "In scikit-learn, there are some possible solutions to bypass this issue:\n",
    "\n",
    "* list all the possible categories and provide them to the encoder via the\n",
    "  keyword argument `categories` instead of letting the estimator automatically\n",
    "  determine them from the training data when calling fit;\n",
    "* set the parameter `handle_unknown=\"ignore\"`, i.e. if an unknown category is\n",
    "  encountered during transform, the resulting one-hot encoded columns for this\n",
    "  feature will be all zeros;\n",
    "* adjust the `min_frequency` parameter to collapse the rarest categories\n",
    "  observed in the training data into a single one-hot encoded feature. If you\n",
    "  enable this option, you can also set `handle_unknown=\"infrequent_if_exist\"`\n",
    "  to encode the unknown categories (categories only observed at predict time)\n",
    "  as ones in that last column.\n",
    "\n",
    "In this notebook we only explore the second option, namely\n",
    "`OneHotEncoder(handle_unknown=\"ignore\")`. Feel free to evaluate the\n",
    "alternatives on your own, for instance using a sandbox notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition tip alert alert-warning\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
    "<p class=\"last\">Be aware the <tt class=\"docutils literal\">OrdinalEncoder</tt> exposes a parameter also named <tt class=\"docutils literal\">handle_unknown</tt>.\n",
    "It can be set to <tt class=\"docutils literal\">use_encoded_value</tt>. If that option is chosen, you can define\n",
    "a fixed value that is assigned to all unknown categories during <tt class=\"docutils literal\">transform</tt>.\n",
    "For example, <tt class=\"docutils literal\"><span class=\"pre\">OrdinalEncoder(handle_unknown='use_encoded_value',</span> <span class=\"pre\">unknown_value=-1)</span></tt> would set all values encountered during <tt class=\"docutils literal\">transform</tt> to <tt class=\"docutils literal\"><span class=\"pre\">-1</span></tt>\n",
    "which are not part of the data encountered during the <tt class=\"docutils literal\">fit</tt> call. You are\n",
    "going to use these parameters in the next exercise.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now create our machine learning pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "model = make_pipeline(\n",
    "    OneHotEncoder(handle_unknown=\"ignore\"), LogisticRegression(max_iter=500)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">Here, we need to increase the maximum number of iterations to obtain a fully\n",
    "converged <tt class=\"docutils literal\">LogisticRegression</tt> and silence a <tt class=\"docutils literal\">ConvergenceWarning</tt>. Contrary\n",
    "to the numerical features, the one-hot encoded categorical features are all\n",
    "on the same scale (values are 0 or 1), so they would not benefit from\n",
    "scaling. In this case, increasing <tt class=\"docutils literal\">max_iter</tt> is the right thing to do.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can check the model's generalization performance only using the\n",
    "categorical columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_validate\n",
    "\n",
    "cv_results = cross_validate(model, data_categorical, target)\n",
    "cv_results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = cv_results[\"test_score\"]\n",
    "print(f\"The accuracy is: {scores.mean():.3f} \u00b1 {scores.std():.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, this representation of the categorical variables is slightly\n",
    "more predictive of the revenue than the numerical variables that we used\n",
    "previously. The reason being that we have more (predictive) categorical\n",
    "features than numerical ones."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "In this notebook we have:\n",
    "* seen two common strategies for encoding categorical features: **ordinal\n",
    "  encoding** and **one-hot encoding**;\n",
    "* used a **pipeline** to use a **one-hot encoder** before fitting a logistic\n",
    "  regression."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}