{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Dirty categories: learning with non normalized strings\n\nIncluding strings that represent categories often calls for much data\npreparation. In particular categories may appear with many morphological\nvariants, when they have been manually input, or assembled from diverse\nsources.\n\nIncluding such a column in a learning pipeline as a standard categorical\ncolum leads to categories with very high cardinalities and can lose\ninformation on which categories are similar.\n\nHere we look at a dataset on wages [#]_ where the column *Employee\nPosition Title* contains dirty categories.\n\n.. [#] https://catalog.data.gov/dataset/employee-salaries-2016\n\nWe investigate encodings to include such compare different categorical\nencodings for the dirty column to predict the *Current Annual Salary*,\nusing gradient boosted trees. For this purpose, we use the skrub\nlibrary ( https://skrub-data.org ).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ".. |SV| replace::\n :class:`~skrub.TableVectorizer`\n\n.. |tabular_learner| replace::\n :func:`~skrub.tabular_learner`\n\n.. |OneHotEncoder| replace::\n :class:`~sklearn.preprocessing.OneHotEncoder`\n\n.. |RandomForestRegressor| replace::\n :class:`~sklearn.ensemble.RandomForestRegressor`\n\n.. |SE| replace:: :class:`~skrub.SimilarityEncoder`\n\n.. |GapEncoder| replace:: :class:`~skrub.GapEncoder`\n\n.. |permutation importances| replace::\n :func:`~sklearn.inspection.permutation_importance`\n\n\n## The data\n\n### Data Importing and preprocessing\n\nWe first download the dataset:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from skrub.datasets import fetch_employee_salaries\nemployee_salaries = fetch_employee_salaries()\nprint(employee_salaries.description)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we load it:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\ndf = employee_salaries.X.copy()\ndf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recover the target\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y = employee_salaries.y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A simple default as a learner\n\nThe function |tabular_learner| is a simple way of creating a default\nlearner for tabular_learner data:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from skrub import tabular_learner\nmodel = tabular_learner(\"regressor\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can quickly compute its cross-validation score using the\ncorresponding scikit-learn utility\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import cross_validate\nimport numpy as np\n\nresults = cross_validate(model, df, y)\nprint(f\"Prediction score: {np.mean(results['test_score'])}\")\nprint(f\"Training time: {np.mean(results['fit_time'])}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below the hood, `model` is a pipeline:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that it is made of first a |SV|, and an\nHistGradientBoostingRegressor\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Understanding the vectorizer + learner pipeline\n\nThe number one difficulty is that our input is a complex and\nheterogeneous dataframe:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The |SV| is a transformer that turns this dataframe into a\nform suited for machine learning.\n\nFeeding it output to a powerful learner,\nsuch as gradient boosted trees, gives **a machine-learning method that\ncan be readily applied to the dataframe**.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from skrub import TableVectorizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Assembling the pipeline\n\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the |SV| with a HistGradientBoostingRegressor, which is a good\npredictor for data with heterogeneous columns\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.ensemble import HistGradientBoostingRegressor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then create a pipeline chaining our encoders to a learner\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.pipeline import make_pipeline\n\npipeline = make_pipeline(\n TableVectorizer(),\n HistGradientBoostingRegressor()\n)\npipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that it is almost the same model as above (can you spot the\ndifferences)\n\nLet's perform a cross-validation to see how well this model predicts\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "results = cross_validate(pipeline, df, y)\nprint(f\"Prediction score: {np.mean(results['test_score'])}\")\nprint(f\"Training time: {np.mean(results['fit_time'])}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The prediction perform here is pretty much as good as above\nbut the code here is much simpler as it does not involve specifying\ncolumns manually.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analyzing the features created\n\nLet us perform the same workflow, but without the `Pipeline`, so we can\nanalyze its mechanisms along the way.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tab_vec = TableVectorizer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We split the data between train and test, and transform them:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\ndf_train, df_test, y_train, y_test = train_test_split(\n df, y, test_size=0.15, random_state=42\n)\n\nX_train_enc = tab_vec.fit_transform(df_train, y_train)\nX_test_enc = tab_vec.transform(df_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The encoded data, X_train_enc and X_test_enc are numerical arrays:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_train_enc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "They have more columns than the original dataframe, but not much more:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_train_enc.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Inspecting the features created\n\nThe |SV| assigns a transformer for each column. We can inspect this\nchoice:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tab_vec.transformers_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is what is being passed to transform the different columns under the hood.\nWe can notice it classified the columns \"gender\" and \"assignment_category\"\nas low cardinality string variables.\nA |OneHotEncoder| will be applied to these columns.\n\nThe vectorizer actually makes the difference between string variables\n(data type ``object`` and ``string``) and categorical variables\n(data type ``category``).\n\nNext, we can have a look at the encoded feature names.\n\nBefore encoding:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.columns.to_list()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After encoding (we only plot the first 8 feature names):\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "feature_names = tab_vec.get_feature_names_out()\nfeature_names[:8]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, it created a new column for each unique value.\nThis is because we used |SE| on the column \"division\",\nwhich was classified as a high cardinality string variable.\n(default values, see |SV|'s docstring).\n\nIn total, we have reasonnable number of encoded columns.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "len(feature_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Feature importance in the statistical model\n\nHere we consider interpretability, plot the feature importances of a\nclassifier. We can do this because the |GapEncoder| leads to\ninterpretable features even with messy categories\n\n.. topic:: Note:\n\n To minimize compute time, use the feature importances computed by the\n |RandomForestRegressor|, but you should prefer |permutation importances|\n instead (which are less subject to biases)\n\nFirst, let's train the |RandomForestRegressor|,\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestRegressor\nregressor = RandomForestRegressor()\nregressor.fit(X_train_enc, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Retrieving the feature importances\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "importances = regressor.feature_importances_\nstd = np.std(\n [\n tree.feature_importances_\n for tree in regressor.estimators_\n ],\n axis=0\n)\nindices = np.argsort(importances)[::-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting the results:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nplt.figure(figsize=(12, 9))\nplt.title(\"Feature importances\")\nn = 20\nn_indices = indices[:n]\nlabels = np.array(feature_names)[n_indices]\nplt.barh(range(n), importances[n_indices], color=\"b\", yerr=std[n_indices])\nplt.yticks(range(n), labels, size=15)\nplt.tight_layout(pad=1)\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can deduce from this data that the three factors that define the\nmost the salary are: being hired for a long time, being a manager, and\nhaving a permanent, full-time job :).\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring different machine-learning pipeline to encode the data\n\n### The learning pipeline\n\nTo build a learning pipeline, we need to assemble encoders for each\ncolumn, and apply a supervised learning model on top.\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Encoding the table\n\nThe TableVectorizer applies different transformations to the different\ncolumns to turn them into numerical values suitable for learning\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from skrub import TableVectorizer\nencoder = TableVectorizer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Pipelining an encoder with a learner\n\nHere again we use a pipeline with HistGradientBoostingRegressor\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.ensemble import HistGradientBoostingRegressor\npipeline = make_pipeline(encoder, HistGradientBoostingRegressor())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The pipeline can be readily applied to the dataframe for prediction\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pipeline.fit(df, y)\n\n# The categorical encoders\n# ........................\n#\n# A encoder is needed to turn a categorical column into a numerical\n# representation\nfrom sklearn.preprocessing import OneHotEncoder\n\none_hot = OneHotEncoder(handle_unknown='ignore', sparse_output=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dirty-category encoding\n\nThe one-hot encoder is actually not well suited to the 'Employee\nPosition Title' column, as this columns contains 400 different entries.\n\nWe will now experiments with different encoders for dirty columns\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from skrub import SimilarityEncoder, MinHashEncoder,\\\n GapEncoder\nfrom sklearn.preprocessing import TargetEncoder\n\nsimilarity = SimilarityEncoder()\ntarget = TargetEncoder()\nminhash = MinHashEncoder(n_components=100)\ngap = GapEncoder(n_components=100)\n\nencoders = {\n 'one-hot': one_hot,\n 'similarity': similarity,\n 'target': target,\n 'minhash': minhash,\n 'gap': gap}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now loop over the different encoding methods,\ninstantiate each time a new pipeline, fit it\nand store the returned cross-validation score:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "all_scores = dict()\n\nfor name, method in encoders.items():\n encoder = TableVectorizer(high_cardinality=method)\n\n pipeline = make_pipeline(encoder, HistGradientBoostingRegressor())\n scores = cross_validate(pipeline, df, y)\n print('{} encoding'.format(name))\n print('r2 score: mean: {:.3f}; std: {:.3f}'.format(\n np.mean(scores['test_score']), np.std(scores['test_score'])))\n print('time: {:.3f}\\n'.format(\n np.mean(scores['fit_time'])))\n all_scores[name] = scores['test_score']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the time it takes to fit varies also a lot, and not only the\nprediction score\n\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Plotting the results\nFinally, we plot the scores on a boxplot:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import seaborn\nimport matplotlib.pyplot as plt\nplt.figure(figsize=(4, 3))\nax = seaborn.boxplot(data=pd.DataFrame(all_scores), orient='h')\nplt.ylabel('Encoding', size=20)\nplt.xlabel('Prediction accuracy ', size=20)\nplt.yticks(size=20)\nplt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The clear trend is that encoders that use the string form\nof the category (similarity, minhash, and gap) perform better than\nthose that discard it.\n\nSimilarityEncoder is the best performer, but it is less scalable on big\ndata than MinHashEncoder and GapEncoder. The most scalable encoder is\nthe MinHashEncoder. GapEncoder, on the other hand, has the benefit that\nit provides interpretable features, as shown above\n\n|\n\n\n.. topic:: The TableVectorizer automates preprocessing\n\n As this notebook demonstrates, many preprocessing steps can be\n automated by the |SV|, and the resulting pipeline can still be\n inspected, even with non-normalized entries.\n\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 0 }