{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Tabular data preprocessing" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "hide_input": true }, "outputs": [], "source": [ "from fastai.gen_doc.nbdoc import *\n", "from fastai.tabular import *\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This package contains the basic class to define a transformation for preprocessing dataframes of tabular data, as well as basic [`TabularProc`](/tabular.transform.html#TabularProc). Preprocessing includes things like\n", "- replacing non-numerical variables by categories, then their ids,\n", "- filling missing values,\n", "- normalizing continuous variables.\n", "\n", "In all those steps we have to be careful to use the correspondence we decide on our training set (which id we give to each category, what is the value we put for missing data, or how the mean/std we use to normalize) on our validation or test set. To deal with this, we use a special class called [`TabularProc`](/tabular.transform.html#TabularProc).\n", "\n", "The data used in this document page is a subset of the [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult). It gives a certain amount of data on individuals to train a model to predict whether their salary is greater than \\$50k or not." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
049Private101320Assoc-acdm12.0Married-civ-spouseNaNWifeWhiteFemale0190240United-States>=50k
144Private236746Masters14.0DivorcedExec-managerialNot-in-familyWhiteMale10520045United-States>=50k
238Private96185HS-gradNaNDivorcedNaNUnmarriedBlackFemale0032United-States<50k
338Self-emp-inc112847Prof-school15.0Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040United-States>=50k
442Self-emp-not-inc822977th-8thNaNMarried-civ-spouseOther-serviceWifeBlackFemale0050United-States<50k
\n", "
" ], "text/plain": [ " age workclass fnlwgt education education-num \\\n", "0 49 Private 101320 Assoc-acdm 12.0 \n", "1 44 Private 236746 Masters 14.0 \n", "2 38 Private 96185 HS-grad NaN \n", "3 38 Self-emp-inc 112847 Prof-school 15.0 \n", "4 42 Self-emp-not-inc 82297 7th-8th NaN \n", "\n", " marital-status occupation relationship race \\\n", "0 Married-civ-spouse NaN Wife White \n", "1 Divorced Exec-managerial Not-in-family White \n", "2 Divorced NaN Unmarried Black \n", "3 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander \n", "4 Married-civ-spouse Other-service Wife Black \n", "\n", " sex capital-gain capital-loss hours-per-week native-country salary \n", "0 Female 0 1902 40 United-States >=50k \n", "1 Male 10520 0 45 United-States >=50k \n", "2 Female 0 0 32 United-States <50k \n", "3 Male 0 0 40 United-States >=50k \n", "4 Female 0 0 50 United-States <50k " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.ADULT_SAMPLE)\n", "df = pd.read_csv(path/'adult.csv')\n", "train_df, valid_df = df.iloc[:800].copy(), df.iloc[800:1000].copy()\n", "train_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see it contains numerical variables (like `age` or `education-num`) as well as categorical ones (like `workclass` or `relationship`). The original dataset is clean, but we removed a few values to give examples of dealing with missing variables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']\n", "cont_names = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transforms for tabular data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class TabularProc[source][test]

\n", "\n", "> TabularProc(**`cat_names`**:`StrList`, **`cont_names`**:`StrList`)\n", "\n", "
×

No tests found for TabularProc. To contribute a test please refer to this guide and this discussion.

\n", "\n", "A processor for tabular dataframes. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularProc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Base class for creating transforms for dataframes with categorical variables `cat_names` and continuous variables `cont_names`. Note that any column not in one of those lists won't be touched." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

__call__[source][test]

\n", "\n", "> __call__(**`df`**:`DataFrame`, **`test`**:`bool`=***`False`***)\n", "\n", "
×

No tests found for __call__. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Apply the correct function to `df` depending on `test`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularProc.__call__)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

apply_train[source][test]

\n", "\n", "> apply_train(**`df`**:`DataFrame`)\n", "\n", "
×

Tests found for apply_train:

Some other tests where apply_train is used:

  • pytest -sv tests/test_tabular_transform.py::test_categorify [source]
  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]
  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]

To run tests please refer to this guide.

\n", "\n", "Function applied to `df` if it's the train set. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularProc.apply_train)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

apply_test[source][test]

\n", "\n", "> apply_test(**`df`**:`DataFrame`)\n", "\n", "
×

Tests found for apply_test:

Some other tests where apply_test is used:

  • pytest -sv tests/test_tabular_transform.py::test_categorify [source]
  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]
  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]

To run tests please refer to this guide.

\n", "\n", "Function applied to `df` if it's the test set. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularProc.apply_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "
Important: Those two functions must be implemented in a subclass. `apply_test` defaults to `apply_train`.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "jekyll_important(\"Those two functions must be implemented in a subclass. `apply_test` defaults to `apply_train`.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following [`TabularProc`](/tabular.transform.html#TabularProc) are implemented in the fastai library. Note that the replacement from categories to codes as well as the normalization of continuous variables are automatically done in a [`TabularDataBunch`](/tabular.data.html#TabularDataBunch)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class Categorify[source][test]

\n", "\n", "> Categorify(**`cat_names`**:`StrList`, **`cont_names`**:`StrList`) :: [`TabularProc`](/tabular.transform.html#TabularProc)\n", "\n", "
×

Tests found for Categorify:

  • pytest -sv tests/test_tabular_transform.py::test_categorify [source]

To run tests please refer to this guide.

\n", "\n", "Transform the categorical variables to that type. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(Categorify)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Variables in `cont_names` aren't affected." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

apply_train[source][test]

\n", "\n", "> apply_train(**`df`**:`DataFrame`)\n", "\n", "
×

Tests found for apply_train:

Some other tests where apply_train is used:

  • pytest -sv tests/test_tabular_transform.py::test_categorify [source]
  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]
  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]

To run tests please refer to this guide.

\n", "\n", "Transform `self.cat_names` columns in categorical. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(Categorify.apply_train)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

apply_test[source][test]

\n", "\n", "> apply_test(**`df`**:`DataFrame`)\n", "\n", "
×

Tests found for apply_test:

Some other tests where apply_test is used:

  • pytest -sv tests/test_tabular_transform.py::test_categorify [source]
  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]
  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]

To run tests please refer to this guide.

\n", "\n", "Transform `self.cat_names` columns in categorical using the codes decided in `apply_train`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(Categorify.apply_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tfm = Categorify(cat_names, cont_names)\n", "tfm(train_df)\n", "tfm(valid_df, test=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we haven't changed the categories by their codes, nothing visible has changed in the dataframe yet, but we can check that the variables are now categorical and view their corresponding codes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index([' ?', ' Federal-gov', ' Local-gov', ' Private', ' Self-emp-inc',\n", " ' Self-emp-not-inc', ' State-gov', ' Without-pay'],\n", " dtype='object')" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df['workclass'].cat.categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The test set will be given the same category codes as the training set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index([' ?', ' Federal-gov', ' Local-gov', ' Private', ' Self-emp-inc',\n", " ' Self-emp-not-inc', ' State-gov', ' Without-pay'],\n", " dtype='object')" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valid_df['workclass'].cat.categories" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class FillMissing[source][test]

\n", "\n", "> FillMissing(**`cat_names`**:`StrList`, **`cont_names`**:`StrList`, **`fill_strategy`**:[`FillStrategy`](/tabular.transform.html#FillStrategy)=***``***, **`add_col`**:`bool`=***`True`***, **`fill_val`**:`float`=***`0.0`***) :: [`TabularProc`](/tabular.transform.html#TabularProc)\n", "\n", "
×

Tests found for FillMissing:

  • pytest -sv tests/test_tabular_transform.py::test_default_fill_strategy_is_median [source]

Some other tests where FillMissing is used:

  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]
  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]

To run tests please refer to this guide.

\n", "\n", "Fill the missing values in continuous columns. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(FillMissing)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`cat_names` variables are left untouched (their missing value will be replaced by code 0 in the [`TabularDataBunch`](/tabular.data.html#TabularDataBunch)). [`fill_strategy`](#FillStrategy) is adopted to replace those nans and if `add_col` is True, whenever a column `c` has missing values, a column named `c_nan` is added and flags the line where the value was missing." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

apply_train[source][test]

\n", "\n", "> apply_train(**`df`**:`DataFrame`)\n", "\n", "
×

Tests found for apply_train:

  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]
  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]

Some other tests where apply_train is used:

  • pytest -sv tests/test_tabular_transform.py::test_categorify [source]

To run tests please refer to this guide.

\n", "\n", "Fill missing values in `self.cont_names` according to `self.fill_strategy`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(FillMissing.apply_train)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

apply_test[source][test]

\n", "\n", "> apply_test(**`df`**:`DataFrame`)\n", "\n", "
×

Tests found for apply_test:

  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]
  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]

Some other tests where apply_test is used:

  • pytest -sv tests/test_tabular_transform.py::test_categorify [source]

To run tests please refer to this guide.

\n", "\n", "Fill missing values in `self.cont_names` like in `apply_train`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(FillMissing.apply_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fills the missing values in the `cont_names` columns with the ones picked during train." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agefnlwgteducation-numcapital-gaincapital-losshours-per-week
04910132012.00190240
14423674614.010520045
23896185NaN0032
33811284715.00040
44282297NaN0050
\n", "
" ], "text/plain": [ " age fnlwgt education-num capital-gain capital-loss hours-per-week\n", "0 49 101320 12.0 0 1902 40\n", "1 44 236746 14.0 10520 0 45\n", "2 38 96185 NaN 0 0 32\n", "3 38 112847 15.0 0 0 40\n", "4 42 82297 NaN 0 0 50" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df[cont_names].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agefnlwgteducation-numcapital-gaincapital-losshours-per-week
04910132012.00190240
14423674614.010520045
2389618510.00032
33811284715.00040
4428229710.00050
\n", "
" ], "text/plain": [ " age fnlwgt education-num capital-gain capital-loss hours-per-week\n", "0 49 101320 12.0 0 1902 40\n", "1 44 236746 14.0 10520 0 45\n", "2 38 96185 10.0 0 0 32\n", "3 38 112847 15.0 0 0 40\n", "4 42 82297 10.0 0 0 50" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfm = FillMissing(cat_names, cont_names)\n", "tfm(train_df)\n", "tfm(valid_df, test=True)\n", "train_df[cont_names].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Values missing in the `education-num` column are replaced by 10, which is the median of the column in `train_df`. Categorical variables are not changed, since `nan` is simply used as another category." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agefnlwgteducation-numcapital-gaincapital-losshours-per-week
800459697510.00040
8014619277910.015024060
8023637645510.00038
803255005310.00045
8043716452610.00040
\n", "
" ], "text/plain": [ " age fnlwgt education-num capital-gain capital-loss hours-per-week\n", "800 45 96975 10.0 0 0 40\n", "801 46 192779 10.0 15024 0 60\n", "802 36 376455 10.0 0 0 38\n", "803 25 50053 10.0 0 0 45\n", "804 37 164526 10.0 0 0 40" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valid_df[cont_names].head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

`FillStrategy`[test]

\n", "\n", "> Enum = [MEDIAN, COMMON, CONSTANT]\n", "\n", "
×

Tests found for FillStrategy:

Some other tests where FillStrategy is used:

  • pytest -sv tests/test_tabular_transform.py::test_default_fill_strategy_is_median [source]

To run tests please refer to this guide.

\n", "\n", "Enum flag represents determines how [`FillMissing`](/tabular.transform.html#FillMissing) should handle missing/nan values\n", "\n", "- *MEDIAN*: nans are replaced by the median value of the column\n", "- *COMMON*: nans are replaced by the most common value of the column\n", "- *CONSTANT*: nans are replaced by `fill_val` " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(FillStrategy, alt_doc_string='Enum flag represents determines how `FillMissing` should handle missing/nan values', arg_comments={\n", " 'MEDIAN':'nans are replaced by the median value of the column',\n", " 'COMMON': 'nans are replaced by the most common value of the column',\n", " 'CONSTANT': 'nans are replaced by `fill_val`'\n", "})" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class Normalize[source][test]

\n", "\n", "> Normalize(**`cat_names`**:`StrList`, **`cont_names`**:`StrList`) :: [`TabularProc`](/tabular.transform.html#TabularProc)\n", "\n", "
×

Tests found for Normalize:

  • pytest -sv tests/test_tabular_transform.py::test_normalize [source]

To run tests please refer to this guide.

\n", "\n", "Normalize the continuous variables. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(Normalize)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "norm = Normalize(cat_names, cont_names)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

apply_train[source][test]

\n", "\n", "> apply_train(**`df`**:`DataFrame`)\n", "\n", "
×

Tests found for apply_train:

Some other tests where apply_train is used:

  • pytest -sv tests/test_tabular_transform.py::test_categorify [source]
  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]
  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]

To run tests please refer to this guide.

\n", "\n", "Compute the means and stds of `self.cont_names` columns to normalize them. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(Normalize.apply_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agefnlwgteducation-numcapital-gaincapital-losshours-per-week
00.829039-0.8125890.981643-0.1362714.416656-0.050230
10.4439770.3555322.0784501.153121-0.2287600.361492
2-0.018098-0.856881-0.115165-0.136271-0.228760-0.708985
3-0.018098-0.7131622.626854-0.136271-0.228760-0.050230
40.289952-0.976672-0.115165-0.136271-0.2287600.773213
\n", "
" ], "text/plain": [ " age fnlwgt education-num capital-gain capital-loss \\\n", "0 0.829039 -0.812589 0.981643 -0.136271 4.416656 \n", "1 0.443977 0.355532 2.078450 1.153121 -0.228760 \n", "2 -0.018098 -0.856881 -0.115165 -0.136271 -0.228760 \n", "3 -0.018098 -0.713162 2.626854 -0.136271 -0.228760 \n", "4 0.289952 -0.976672 -0.115165 -0.136271 -0.228760 \n", "\n", " hours-per-week \n", "0 -0.050230 \n", "1 0.361492 \n", "2 -0.708985 \n", "3 -0.050230 \n", "4 0.773213 " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "norm.apply_train(train_df)\n", "train_df[cont_names].head()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

apply_test[source][test]

\n", "\n", "> apply_test(**`df`**:`DataFrame`)\n", "\n", "
×

Tests found for apply_test:

Some other tests where apply_test is used:

  • pytest -sv tests/test_tabular_transform.py::test_categorify [source]
  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]
  • pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]

To run tests please refer to this guide.

\n", "\n", "Normalize `self.cont_names` with the same statistics as in `apply_train`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(Normalize.apply_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agefnlwgteducation-numcapital-gaincapital-losshours-per-week
8000.520989-0.850066-0.115165-0.136271-0.22876-0.050230
8010.598002-0.023706-0.1151651.705157-0.228761.596657
802-0.1721231.560596-0.115165-0.136271-0.22876-0.214919
803-1.019260-1.254793-0.115165-0.136271-0.228760.361492
804-0.095110-0.267403-0.115165-0.136271-0.22876-0.050230
\n", "
" ], "text/plain": [ " age fnlwgt education-num capital-gain capital-loss \\\n", "800 0.520989 -0.850066 -0.115165 -0.136271 -0.22876 \n", "801 0.598002 -0.023706 -0.115165 1.705157 -0.22876 \n", "802 -0.172123 1.560596 -0.115165 -0.136271 -0.22876 \n", "803 -1.019260 -1.254793 -0.115165 -0.136271 -0.22876 \n", "804 -0.095110 -0.267403 -0.115165 -0.136271 -0.22876 \n", "\n", " hours-per-week \n", "800 -0.050230 \n", "801 1.596657 \n", "802 -0.214919 \n", "803 0.361492 \n", "804 -0.050230 " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "norm.apply_test(valid_df)\n", "valid_df[cont_names].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Treating date columns" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

add_datepart[source][test]

\n", "\n", "> add_datepart(**`df`**:`DataFrame`, **`field_name`**:`str`, **`prefix`**:`str`=***`None`***, **`drop`**:`bool`=***`True`***, **`time`**:`bool`=***`False`***)\n", "\n", "
×

Tests found for add_datepart:

  • pytest -sv tests/test_tabular_transform.py::test_add_datepart [source]

To run tests please refer to this guide.

\n", "\n", "Helper function that adds columns relevant to a date in the column `field_name` of `df`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(add_datepart)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Will `drop` the column in `df` if the flag is `True`. The `time` flag decides if we go down to the time parts or stick to the date parts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col2col1Yearcol1Monthcol1Weekcol1Daycol1Dayofweekcol1Dayofyearcol1Is_month_endcol1Is_month_startcol1Is_quarter_endcol1Is_quarter_startcol1Is_year_endcol1Is_year_startcol1Elapsed
0a2017253434FalseFalseFalseFalseFalseFalse1486080000
1b2017254535FalseFalseFalseFalseFalseFalse1486166400
2a2017255636FalseFalseFalseFalseFalseFalse1486252800
\n", "
" ], "text/plain": [ " col2 col1Year col1Month col1Week col1Day col1Dayofweek col1Dayofyear \\\n", "0 a 2017 2 5 3 4 34 \n", "1 b 2017 2 5 4 5 35 \n", "2 a 2017 2 5 5 6 36 \n", "\n", " col1Is_month_end col1Is_month_start col1Is_quarter_end \\\n", "0 False False False \n", "1 False False False \n", "2 False False False \n", "\n", " col1Is_quarter_start col1Is_year_end col1Is_year_start col1Elapsed \n", "0 False False False 1486080000 \n", "1 False False False 1486166400 \n", "2 False False False 1486252800 " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'col1': ['02/03/2017', '02/04/2017', '02/05/2017'], 'col2': ['a', 'b', 'a']})\n", "add_datepart(df, 'col1') # inplace\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

add_cyclic_datepart[source][test]

\n", "\n", "> add_cyclic_datepart(**`df`**:`DataFrame`, **`field_name`**:`str`, **`prefix`**:`str`=***`None`***, **`drop`**:`bool`=***`True`***, **`time`**:`bool`=***`False`***, **`add_linear`**:`bool`=***`False`***)\n", "\n", "
×

No tests found for add_cyclic_datepart. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Helper function that adds trigonometric date/time features to a date in the column `field_name` of `df`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(add_cyclic_datepart)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col2col1weekday_coscol1weekday_sincol1day_month_coscol1day_month_sincol1month_year_coscol1month_year_sincol1day_year_coscol1day_year_sin
0a-0.900969-0.4338840.9009690.4338840.8660250.50.8429420.538005
1b-0.222521-0.9749280.7818310.6234900.8660250.50.8335560.552435
2a0.623490-0.7818310.6234900.7818310.8660250.50.8239230.566702
\n", "
" ], "text/plain": [ " col2 col1weekday_cos col1weekday_sin col1day_month_cos \\\n", "0 a -0.900969 -0.433884 0.900969 \n", "1 b -0.222521 -0.974928 0.781831 \n", "2 a 0.623490 -0.781831 0.623490 \n", "\n", " col1day_month_sin col1month_year_cos col1month_year_sin \\\n", "0 0.433884 0.866025 0.5 \n", "1 0.623490 0.866025 0.5 \n", "2 0.781831 0.866025 0.5 \n", "\n", " col1day_year_cos col1day_year_sin \n", "0 0.842942 0.538005 \n", "1 0.833556 0.552435 \n", "2 0.823923 0.566702 " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'col1': ['02/03/2017', '02/04/2017', '02/05/2017'], 'col2': ['a', 'b', 'a']})\n", "df = add_cyclic_datepart(df, 'col1') # returns a dataframe\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Splitting data into cat and cont" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

cont_cat_split[source][test]

\n", "\n", "> cont_cat_split(**`df`**, **`max_card`**=***`20`***, **`dep_var`**=***`None`***) → `Tuple`\\[`List`\\[`T`\\], `List`\\[`T`\\]\\]\n", "\n", "
×

Tests found for cont_cat_split:

  • pytest -sv tests/test_tabular_transform.py::test_cont_cat_split [source]

To run tests please refer to this guide.

\n", "\n", "Helper function that returns column names of cont and cat variables from given df. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(cont_cat_split)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parameters:\n", "- df: A pandas data frame.\n", "- max_card: Maximum cardinality of a numerical categorical variable.\n", "- dep_var: A dependent variable.\n", "\n", "Return:\n", "- cont_names: A list of names of continuous variables.\n", "- cat_names: A list of names of categorical variables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3col4
01a0.5ab
12b1.2o
23a7.5o
\n", "
" ], "text/plain": [ " col1 col2 col3 col4\n", "0 1 a 0.5 ab\n", "1 2 b 1.2 o\n", "2 3 a 7.5 o" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['a', 'b', 'a'], 'col3': [0.5, 1.2, 7.5], 'col4': ['ab', 'o', 'o']})\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(['col3'], ['col1', 'col2'])" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cont_list, cat_list = cont_cat_split(df=df, max_card=20, dep_var='col4')\n", "cont_list, cat_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Undocumented Methods - Methods moved below this line will intentionally be hidden" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## New Methods - Please document or move to the undocumented section" ] } ], "metadata": { "jekyll": { "keywords": "fastai", "summary": "Transforms to clean and preprocess tabular data", "title": "tabular.transform" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 2 }