{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Dirty categories: learning with non normalized strings\n\nIncluding strings that represent categories often calls for much data\npreparation. In particular categories may appear with many morphological\nvariants, when they have been manually input, or assembled from diverse\nsources.\n\nIncluding such a column in a learning pipeline as a standard categorical\ncolum leads to categories with very high cardinalities and can loose\ninformation on which categories are similar.\n\nHere we look at a dataset on wages [#]_ where the column *Employee\nPosition Title* contains dirty categories.\n\n.. [#] https://catalog.data.gov/dataset/employee-salaries-2016\n\nWe investigate encodings to include such compare different categorical\nencodings for the dirty column to predict the *Current Annual Salary*,\nusing gradient boosted trees. For this purpose, we use the dirty-cat\nlibrary ( https://dirty-cat.github.io ).\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## The data\n\n### Data Importing and preprocessing\n\nWe first download the dataset:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from dirty_cat.datasets import fetch_employee_salaries\nemployee_salaries = fetch_employee_salaries()\nprint(employee_salaries['DESCR'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Then we load it:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import pandas as pd\ndf = employee_salaries['data'].copy()\ndf"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, let's carry out some basic preprocessing:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "df['Date First Hired'] = pd.to_datetime(df['date_first_hired'])\ndf['Year First Hired'] = df['Date First Hired'].apply(lambda x: x.year)\n# drop rows with NaN in gender\ndf.dropna(subset=['gender'], inplace=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "First we extract the target\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "target_column = 'Current Annual Salary'\ny = df[target_column].values.ravel()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Assembling a machine-learning pipeline that encodes the data\n\n### The learning pipeline\n\nTo build a learning pipeline, we need to assemble encoders for each\ncolumn, and apply a supvervised learning model on top.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### The categorical encoders\n\nA encoder is needed to turn a categorical column into a numerical\nrepresentation\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.preprocessing import OneHotEncoder\n\none_hot = OneHotEncoder(handle_unknown='ignore', sparse=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We assemble these to be applied on the relevant columns.\nThe column transformer is created by specifying a set of transformers\nalongside with the column names on which to apply them\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.compose import make_column_transformer\nencoder = make_column_transformer(\n    (one_hot, ['gender', 'department_name', 'assignment_category']),\n    ('passthrough', ['Year First Hired']),\n    # Last but not least, our dirty column\n    (one_hot, ['employee_position_title']),\n    remainder='drop',\n   )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Pipelining an encoder with a learner\n\nWe will use a HistGradientBoostingRegressor, which is a good predictor\nfor data with heterogeneous columns\n(for scikit-learn 0.24 we need to require the experimental feature)\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.experimental import enable_hist_gradient_boosting\n# now you can import normally from ensemble\nfrom sklearn.ensemble import HistGradientBoostingRegressor\n\n# We then create a pipeline chaining our encoders to a learner\nfrom sklearn.pipeline import make_pipeline\npipeline = make_pipeline(encoder, HistGradientBoostingRegressor())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The pipeline can be readily applied to the dataframe for prediction\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "pipeline.fit(df, y)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Dirty-category encoding\n\nThe one-hot encoder is actually not well suited to the 'Employee\nPosition Title' column, as this columns contains 400 different entries.\n\nWe will now experiments with encoders for dirty columns\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from dirty_cat import SimilarityEncoder, TargetEncoder, MinHashEncoder,\\\n    GapEncoder\n\nsimilarity = SimilarityEncoder(similarity='ngram')\ntarget = TargetEncoder(handle_unknown='ignore')\nminhash = MinHashEncoder(n_components=100)\ngap = GapEncoder(n_components=100)\n\nencoders = {\n    'one-hot': one_hot,\n    'similarity': SimilarityEncoder(similarity='ngram'),\n    'target': target,\n    'minhash': minhash,\n    'gap': gap}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We now loop over the different encoding methods,\ninstantiate each time a new pipeline, fit it\nand store the returned cross-validation score:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import cross_val_score\nimport numpy as np\n\nall_scores = dict()\n\nfor name, method in encoders.items():\n    encoder = make_column_transformer(\n        (one_hot, ['gender', 'department_name', 'assignment_category']),\n        ('passthrough', ['Year First Hired']),\n        # Last but not least, our dirty column\n        (method, ['employee_position_title']),\n        remainder='drop',\n    )\n\n    pipeline = make_pipeline(encoder, HistGradientBoostingRegressor())\n    scores = cross_val_score(pipeline, df, y)\n    print('{} encoding'.format(name))\n    print('r2 score:  mean: {:.3f}; std: {:.3f}\\n'.format(\n        np.mean(scores), np.std(scores)))\n    all_scores[name] = scores"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Plotting the results\nFinally, we plot the scores on a boxplot:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import seaborn\nimport matplotlib.pyplot as plt\nplt.figure(figsize=(4, 3))\nax = seaborn.boxplot(data=pd.DataFrame(all_scores), orient='h')\nplt.ylabel('Encoding', size=20)\nplt.xlabel('Prediction accuracy     ', size=20)\nplt.yticks(size=20)\nplt.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The clear trend is that encoders that use the string form\nof the category (similarity, minhash, and gap) perform better than\nthose that discard it.\n\nSimilarityEncoder is the best performer, but it is less scalable on big\ndata than MinHashEncoder and GapEncoder. The most scalable encoder is\nthe MinHashEncoder. GapEncoder, on the other hand, has the benefit that\nit provides interpretable features (see `sphx_glr_auto_examples_04_feature_interpretation_gap_encoder.py`)\n\n|\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## An easier way: automatic vectorization\n\n.. |SV| replace::\n    :class:`~dirty_cat.SuperVectorizer`\n\n.. |OneHotEncoder| replace::\n    :class:`~sklearn.preprocessing.OneHotEncoder`\n\n.. |ColumnTransformer| replace::\n    :class:`~sklearn.compose.ColumnTransformer`\n\n.. |RandomForestRegressor| replace::\n    :class:`~sklearn.ensemble.RandomForestRegressor`\n\n.. |SE| replace:: :class:`~dirty_cat.SimilarityEncoder`\n\n.. |permutation importances| replace::\n    :func:`~sklearn.inspection.permutation_importance`\n\nThe code to assemble the column transformer is a bit tedious. We will\nnow explore a simpler, automated, way of encoding the data.\n\nLet's start again from the raw data:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X = employee_salaries['data'].copy()\ny = employee_salaries['target']"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We'll drop a few columns we don't want\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X.drop([\n            'Current Annual Salary',  # Too linked with target\n            'full_name',  # Not relevant to the analysis\n            '2016_gross_pay_received',  # Too linked with target\n            '2016_overtime_pay',  # Too linked with target\n            'date_first_hired'  # Redundant with \"year_first_hired\"\n        ], axis=1, inplace=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We still have a complex and heterogeneous dataframe:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The |SV| can to turn this dataframe into a form suited for\nmachine learning.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Using the SuperVectorizer in a supervised-learning pipeline\n\nAssembling the |SV| in a pipeline with a powerful learner,\nsuch as gradient boosted trees, gives **a machine-learning method that\ncan be readily applied to the dataframe**.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# The supervectorizer requires dirty_cat 0.2.0a1. If you have an older\n# version, you can install the alpha release with\n#\n#   pip install -pre dirty_cat==0.2.0a1\n#\n\nfrom dirty_cat import SuperVectorizer\n\npipeline = make_pipeline(\n    SuperVectorizer(auto_cast=True),\n    HistGradientBoostingRegressor()\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's perform a cross-validation to see how well this model predicts\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import cross_val_score\nscores = cross_val_score(pipeline, X, y, scoring='r2')\n\nimport numpy as np\nprint(f'{scores=}')\nprint(f'mean={np.mean(scores)}')\nprint(f'std={np.std(scores)}')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The prediction perform here is pretty much as good as above\nbut the code here is much simpler as it does not involve specifying\ncolumns manually.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Analyzing the features created\n\nLet us perform the same workflow, but without the `Pipeline`, so we can\nanalyze its mechanisms along the way.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "sup_vec = SuperVectorizer(auto_cast=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We split the data between train and test, and transform them:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import train_test_split\nX_train, X_test, y_train, y_test = train_test_split(\n    X, y, test_size=0.15, random_state=42\n)\n\nX_train_enc = sup_vec.fit_transform(X_train, y_train)\nX_test_enc = sup_vec.transform(X_test)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The encoded data, X_train_enc and X_test_enc are numerical arrays:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X_train_enc"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "They have more columns than the original dataframe, but not much more:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X_train_enc.shape"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Inspecting the features created\n\nThe |SV| assigns a transformer for each column. We can inspect this\nchoice:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "sup_vec.transformers_"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This is what is being passed to the |ColumnTransformer| under the hood.\nIf you're familiar with how the later works, it should be very intuitive.\nWe can notice it classified the columns \"gender\" and \"assignment_category\"\nas low cardinality string variables.\nA |OneHotEncoder| will be applied to these columns.\n\nThe vectorizer actually makes the difference between string variables\n(data type ``object`` and ``string``) and categorical variables\n(data type ``category``).\n\nNext, we can have a look at the encoded feature names.\n\nBefore encoding:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X.columns.to_list()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "After encoding (we only plot the first 8 feature names):\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "feature_names = sup_vec.get_feature_names()\nfeature_names[:8]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As we can see, it created a new column for each unique value.\nThis is because we used |SE| on the column \"division\",\nwhich was classified as a high cardinality string variable.\n(default values, see |SV|'s docstring).\n\nIn total, we have reasonnable number of encoded columns.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "len(feature_names)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Feature importance in the statistical model\n\nIn this section, we will train a regressor, and plot the feature importances\n\n.. topic:: Note:\n\n   To minimize compute time, use the feature importances computed by the\n   |RandomForestRegressor|, but you should prefer |permutation importances|\n   instead (which are less subject to biases)\n\nFirst, let's train the |RandomForestRegressor|,\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.ensemble import RandomForestRegressor\nregressor = RandomForestRegressor()\nregressor.fit(X_train_enc, y_train)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Retreiving the feature importances\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "importances = regressor.feature_importances_\nstd = np.std(\n    [\n        tree.feature_importances_\n        for tree in regressor.estimators_\n    ],\n    axis=0\n)\nindices = np.argsort(importances)[::-1]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Plotting the results:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\nplt.figure(figsize=(12, 9))\nplt.title(\"Feature importances\")\nn = 20\nn_indices = indices[:n]\nlabels = np.array(feature_names)[n_indices]\nplt.barh(range(n), importances[n_indices], color=\"b\", yerr=std[n_indices])\nplt.yticks(range(n), labels, size=15)\nplt.tight_layout(pad=1)\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can deduce from this data that the three factors that define the\nmost the salary are: being hired for a long time, being a manager, and\nhaving a permanent, full-time job :).\n\n\n.. topic:: The SuperVectorizer automates preprocessing\n\n  As this notebook demonstrates, many preprocessing steps can be\n  automated by the |SV|, and the resulting pipeline can still be\n  inspected, even with non-normalized entries.\n\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}