{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# \ud83d\udcc3 Solution for Exercise M1.04\n", "\n", "The goal of this exercise is to evaluate the impact of using an arbitrary\n", "integer encoding for categorical variables along with a linear classification\n", "model such as Logistic Regression.\n", "\n", "To do so, let's try to use `OrdinalEncoder` to preprocess the categorical\n", "variables. This preprocessor is assembled in a pipeline with\n", "`LogisticRegression`. The generalization performance of the pipeline can be\n", "evaluated by cross-validation and then compared to the score obtained when\n", "using `OneHotEncoder` or to some other baseline score.\n", "\n", "First, we load the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "target_name = \"class\"\n", "target = adult_census[target_name]\n", "data = adult_census.drop(columns=[target_name, \"education-num\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous notebook, we used `sklearn.compose.make_column_selector` to\n", "automatically select columns with a specific data type (also called `dtype`).\n", "Here, we use this selector to get only the columns containing strings (column\n", "with `object` dtype) that correspond to categorical features in our dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.compose import make_column_selector as selector\n", "\n", "categorical_columns_selector = selector(dtype_include=object)\n", "categorical_columns = categorical_columns_selector(data)\n", "data_categorical = data[categorical_columns]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a scikit-learn pipeline composed of an `OrdinalEncoder` and a\n", "`LogisticRegression` classifier.\n", "\n", "Because `OrdinalEncoder` can raise errors if it sees an unknown category at\n", "prediction time, you can set the `handle_unknown=\"use_encoded_value\"` and\n", "`unknown_value` parameters. You can refer to the [scikit-learn\n", "documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)\n", "for more details regarding these parameters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import make_pipeline\n", "from sklearn.preprocessing import OrdinalEncoder\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "# solution\n", "model = make_pipeline(\n", " OrdinalEncoder(handle_unknown=\"use_encoded_value\", unknown_value=-1),\n", " LogisticRegression(max_iter=500),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your model is now defined. Evaluate it using a cross-validation using\n", "`sklearn.model_selection.cross_validate`.\n", "\n", "
Note
\n", "Be aware that if an error happened during the cross-validation,\n", "cross_validate would raise a warning and return NaN (Not a Number) as scores.\n", "To make it raise a standard Python exception with a traceback, you can pass\n", "the error_score=\"raise\" argument in the call to cross_validate. An\n", "exception would be raised instead of a warning at the first encountered problem\n", "and cross_validate would stop right away instead of returning NaN values.\n", "This is particularly handy when developing complex machine learning pipelines.
\n", "