{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Solution for Exercise 03\n", "\n", "The goal of this exercise is to evaluate the impact of feature preprocessing on a pipeline that uses a decision-tree-based classifier instead of logistic regression.\n", "\n", "- The first question is to empirically evaluate whether scaling numerical feature is helpful or not;\n", "\n", "- The second question is to evaluate whether it is empirically better (both from a computational and a statistical perspective) to use integer coded or one-hot encoded categories.\n", "\n", "\n", "Hint: `HistGradientBoostingClassifier` does not yet support sparse input data. You might want to use\n", "`OneHotEncoder(handle_unknown=\"ignore\", sparse=False)` to force the use a dense representation as a workaround." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.model_selection import cross_val_score\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.preprocessing import OrdinalEncoder\n", "from sklearn.experimental import enable_hist_gradient_boosting\n", "from sklearn.ensemble import HistGradientBoostingClassifier\n", "\n", "df = pd.read_csv(\"https://www.openml.org/data/get_csv/1595261/adult-census.csv\")\n", "\n", "# Or use the local copy:\n", "# df = pd.read_csv('../datasets/adult-census.csv')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "target_name = \"class\"\n", "target = df[target_name].to_numpy()\n", "data = df.drop(columns=[target_name, \"fnlwgt\"])" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "numerical_columns = [c for c in data.columns\n", " if data[c].dtype.kind in [\"i\", \"f\"]]\n", "categorical_columns = [c for c in data.columns\n", " if data[c].dtype.kind not in [\"i\", \"f\"]]\n", "\n", "categories = [data[column].unique()\n", " for column in data[categorical_columns]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reference pipeline (no numerical scaling and integer-coded categories)\n", "\n", "First let's time the pipeline we used in the main notebook to serve as a reference:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The different scores obtained are: \n", "[0.87224895 0.87143003 0.8745905 0.87346437 0.87796888]\n", "The accuracy is: 0.874 +- 0.002\n", "CPU times: user 12.8 s, sys: 138 ms, total: 12.9 s\n", "Wall time: 3.72 s\n" ] } ], "source": [ "%%time\n", "\n", "preprocessor = ColumnTransformer([\n", " ('categorical', OrdinalEncoder(categories=categories), categorical_columns),\n", "], remainder=\"passthrough\")\n", "\n", "\n", "model = make_pipeline(\n", " preprocessor,\n", " HistGradientBoostingClassifier()\n", ")\n", "scores = cross_val_score(model, data, target)\n", "print(f\"The different scores obtained are: \\n{scores}\")\n", "print(f\"The accuracy is: {scores.mean():.3f} +- {scores.std():.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scaling numerical features" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The different scores obtained are: \n", "[0.87224895 0.87143003 0.8745905 0.87346437 0.87796888]\n", "The accuracy is: 0.874 +- 0.002\n", "CPU times: user 15.8 s, sys: 103 ms, total: 15.9 s\n", "Wall time: 4.95 s\n" ] } ], "source": [ "%%time\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "preprocessor = ColumnTransformer([\n", " ('numerical', StandardScaler(), numerical_columns),\n", " ('categorical', OrdinalEncoder(categories=categories), categorical_columns),\n", "])\n", "\n", "model = make_pipeline(\n", " preprocessor,\n", " HistGradientBoostingClassifier()\n", ")\n", "scores = cross_val_score(model, data, target)\n", "print(f\"The different scores obtained are: \\n{scores}\")\n", "print(f\"The accuracy is: {scores.mean():.3f} +- {scores.std():.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analysis\n", "\n", "We can observe that both the accuracy and the training time are approximately the same as the reference pipeline (any time difference you might observe is not significant).\n", "\n", "Scaling numerical features is indeed useless for most decision tree models in general and for `HistGradientBoostingClassifier` in particular." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## One-hot encoding of categorical variables\n", "\n", "For linear models, we have observed that integer coding of categorical\n", "variables can be very detrimental. However for\n", "`HistGradientBoostingClassifier` models, it does not seem to be the\n", "case as the cross-validation of the reference pipeline with\n", "`OrdinalEncoder` is good.\n", "\n", "Let's see if we can get an even better accuracy with `OneHotEncoding`:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The different scores obtained are: \n", "[0.87286314 0.87173713 0.87407862 0.87407862 0.87807125]\n", "The accuracy is: 0.874 +- 0.002\n", "CPU times: user 42.7 s, sys: 830 ms, total: 43.5 s\n", "Wall time: 13.3 s\n" ] } ], "source": [ "%%time\n", "from sklearn.preprocessing import OneHotEncoder\n", "\n", "preprocessor = ColumnTransformer([\n", " ('categorical',\n", " OneHotEncoder(handle_unknown=\"ignore\", sparse=False),\n", " categorical_columns),\n", "], remainder=\"passthrough\")\n", "\n", "\n", "model = make_pipeline(\n", " preprocessor,\n", " HistGradientBoostingClassifier()\n", ")\n", "scores = cross_val_score(model, data, target)\n", "print(f\"The different scores obtained are: \\n{scores}\")\n", "print(f\"The accuracy is: {scores.mean():.3f} +- {scores.std():.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analysis\n", "\n", "From an accuracy point of view, the result is almost exactly the same.\n", "The reason is that `HistGradientBoostingClassifier` is expressive\n", "and robust enough to deal with misleading ordering of integer coded\n", "categories (which was not the case for linear models).\n", "\n", "However from a computation point of view, the training time is\n", "significantly longer: this is caused by the fact that `OneHotEncoder`\n", "generates approximately 10 times more features than `OrdinalEncoder`.\n", "\n", "Note that the current implementation `HistGradientBoostingClassifier`\n", "is still incomplete, and once sparse representation are handled\n", "correctly, training time might improve with such kinds of encodings.\n", "\n", "The main take away message is that arbitrary integer coding of\n", "categories is perfectly fine for `HistGradientBoostingClassifier`\n", "and yields fast training times." ] } ], "metadata": { "jupytext": { "formats": "notebooks//ipynb,python_scripts//py:percent" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }