{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# \ud83d\udcc3 Solution for Exercise M4.03\n", "\n", "Now, we tackle a more realistic classification problem instead of making a\n", "synthetic dataset. We start by loading the Adult Census dataset with the\n", "following snippet. For the moment we retain only the **numerical features**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n", "target = adult_census[\"class\"]\n", "data = adult_census.select_dtypes([\"integer\", \"floating\"])\n", "data = data.drop(columns=[\"education-num\"])\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We confirm that all the selected features are numerical.\n", "\n", "Compute the generalization performance in terms of accuracy of a linear model\n", "composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold\n", "cross-validation with `return_estimator=True` to be able to inspect the\n", "trained estimators." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.model_selection import cross_validate\n", "\n", "model = make_pipeline(StandardScaler(), LogisticRegression())\n", "cv_results_lr = cross_validate(\n", " model, data, target, cv=10, return_estimator=True\n", ")\n", "test_score_lr = cv_results_lr[\"test_score\"]\n", "test_score_lr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the most important feature seen by the logistic regression?\n", "\n", "You can use a boxplot to compare the absolute values of the coefficients while\n", "also visualizing the variability induced by the cross-validation resampling." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "import matplotlib.pyplot as plt\n", "\n", "coefs = [pipeline[-1].coef_[0] for pipeline in cv_results_lr[\"estimator\"]]\n", "coefs = pd.DataFrame(coefs, columns=data.columns)\n", "\n", "color = {\"whiskers\": \"black\", \"medians\": \"black\", \"caps\": \"black\"}\n", "_, ax = plt.subplots()\n", "_ = coefs.abs().plot.box(color=color, vert=False, ax=ax)" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "solution" ] }, "source": [ "Since we scaled the features, the coefficients of the linear model can be\n", "meaningful compared directly. `\"capital-gain\"` is the most impacting feature.\n", "Just be aware not to draw conclusions on the causal effect provided the impact\n", "of a feature. Interested readers are refered to the [example on Common\n", "pitfalls in the interpretation of coefficients of linear\n", "models](https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html)\n", "or the [example on Failure of Machine Learning to infer causal\n", "effects](https://scikit-learn.org/stable/auto_examples/inspection/plot_causal_interpretation.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now work with **both numerical and categorical features**. You can\n", "reload the Adult Census dataset with the following snippet:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n", "target = adult_census[\"class\"]\n", "data = adult_census.drop(columns=[\"class\", \"education-num\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a predictive model where:\n", "- The numerical data must be scaled.\n", "- The categorical data must be one-hot encoded, set `min_frequency=0.01` to\n", " group categories concerning less than 1% of the total samples.\n", "- The predictor is a `LogisticRegression`. You may need to increase the number\n", " of `max_iter`, which is 100 by default.\n", "\n", "Use the same 10-fold cross-validation strategy with `return_estimator=True` as\n", "above to evaluate this complex pipeline." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "from sklearn.compose import make_column_selector as selector\n", "from sklearn.compose import make_column_transformer\n", "from sklearn.preprocessing import OneHotEncoder\n", "\n", "categorical_columns = selector(dtype_include=object)(data)\n", "numerical_columns = selector(dtype_exclude=object)(data)\n", "\n", "preprocessor = make_column_transformer(\n", " (\n", " OneHotEncoder(handle_unknown=\"ignore\", min_frequency=0.01),\n", " categorical_columns,\n", " ),\n", " (StandardScaler(), numerical_columns),\n", ")\n", "model = make_pipeline(preprocessor, LogisticRegression(max_iter=5_000))\n", "cv_results_complex_lr = cross_validate(\n", " model, data, target, cv=10, return_estimator=True, n_jobs=2\n", ")\n", "test_score_complex_lr = cv_results_complex_lr[\"test_score\"]\n", "test_score_complex_lr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By comparing the cross-validation test scores of both models fold-to-fold,\n", "count the number of times the model using both numerical and categorical\n", "features has a better test score than the model using only numerical features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "indices = np.arange(len(test_score_lr))\n", "plt.scatter(\n", " indices, test_score_lr, color=\"tab:blue\", label=\"numerical features only\"\n", ")\n", "plt.scatter(\n", " indices,\n", " test_score_complex_lr,\n", " color=\"tab:red\",\n", " label=\"all features\",\n", ")\n", "plt.ylim((0, 1))\n", "plt.xlabel(\"Cross-validation iteration\")\n", "plt.ylabel(\"Accuracy\")\n", "_ = plt.legend(bbox_to_anchor=(1.05, 1), loc=\"upper left\")\n", "\n", "print(\n", " \"A model using both all features is better than a\"\n", " \" model using only numerical features for\"\n", " f\" {sum(test_score_complex_lr > test_score_lr)} CV iterations out of 10.\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the following questions, you can copy and paste the following snippet to\n", "get the feature names from the column transformer here named `preprocessor`.\n", "\n", "```python\n", "preprocessor.fit(data)\n", "feature_names = (\n", " preprocessor.named_transformers_[\"onehotencoder\"].get_feature_names_out(\n", " categorical_columns\n", " )\n", ").tolist()\n", "feature_names += numerical_columns\n", "feature_names\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "preprocessor.fit(data)\n", "feature_names = (\n", " preprocessor.named_transformers_[\"onehotencoder\"].get_feature_names_out(\n", " categorical_columns\n", " )\n", ").tolist()\n", "feature_names += numerical_columns\n", "feature_names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that there are as many feature names as coefficients in the last step\n", "of your predictive pipeline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which of the following pairs of features is most impacting the predictions of\n", "the logistic regression classifier based on the absolute magnitude of its\n", "coefficients?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "coefs = [\n", " pipeline[-1].coef_[0] for pipeline in cv_results_complex_lr[\"estimator\"]\n", "]\n", "coefs = pd.DataFrame(coefs, columns=feature_names)\n", "\n", "_, ax = plt.subplots(figsize=(10, 35))\n", "_ = coefs.abs().plot.box(color=color, vert=False, ax=ax)" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "solution" ] }, "source": [ "We can visually inspect the coefficients and observe that `\"capital-gain\"` and\n", "`\"education_Doctorate\"` are impacting the predictions the most." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now create a similar pipeline consisting of the same preprocessor as above,\n", "followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.\n", "Set `degree=2` and `interaction_only=True` to the feature engineering step.\n", "Remember not to include a \"bias\" feature to avoid introducing a redundancy\n", "with the intercept of the subsequent logistic regression." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "from sklearn.preprocessing import PolynomialFeatures\n", "\n", "model_with_interaction = make_pipeline(\n", " preprocessor,\n", " PolynomialFeatures(degree=2, include_bias=False, interaction_only=True),\n", " LogisticRegression(C=0.01, max_iter=5_000),\n", ")\n", "model_with_interaction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By comparing the cross-validation test scores of both models fold-to-fold,\n", "count the number of times the model using multiplicative interactions and both\n", "numerical and categorical features has a better test score than the model\n", "without interactions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "cv_results_interactions = cross_validate(\n", " model_with_interaction,\n", " data,\n", " target,\n", " cv=10,\n", " return_estimator=True,\n", " n_jobs=2,\n", ")\n", "test_score_interactions = cv_results_interactions[\"test_score\"]\n", "test_score_interactions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# solution\n", "plt.scatter(\n", " indices, test_score_lr, color=\"tab:blue\", label=\"numerical features only\"\n", ")\n", "plt.scatter(\n", " indices,\n", " test_score_complex_lr,\n", " color=\"tab:red\",\n", " label=\"all features\",\n", ")\n", "plt.scatter(\n", " indices,\n", " test_score_interactions,\n", " color=\"black\",\n", " label=\"all features and interactions\",\n", ")\n", "plt.xlabel(\"Cross-validation iteration\")\n", "plt.ylabel(\"Accuracy\")\n", "_ = plt.legend(bbox_to_anchor=(1.05, 1), loc=\"upper left\")\n", "\n", "print(\n", " \"A model using all features and interactions is better than a model\"\n", " \" without interactions for\"\n", " f\" {sum(test_score_interactions > test_score_complex_lr)} CV iterations\"\n", " \" out of 10.\"\n", ")" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "solution" ] }, "source": [ "When you multiply two one-hot encoded categorical features, the resulting\n", "interaction feature is mostly 0, with a 1 only when both original features are\n", "active, acting as a logical `AND`. In this case it could mean we are creating\n", "new rules such as \"has a given education `AND` a given native country\", which\n", "we expect to be predictive. This new rules map the original feature space into\n", "a higher dimension space, where the linear model can separate the data more\n", "easily.\n", "\n", "Keep into account that multiplying all pairs of one-hot encoded features may\n", "lead to a rapid increase in the number of features, especially if the original\n", "categorical variables have many levels. This can increase the computational\n", "cost of your model and promote overfitting, as we will see in a future\n", "notebook." ] } ], "metadata": { "jupytext": { "main_language": "python" }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }