{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Tabular data preprocessing" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "hide_input": true }, "outputs": [], "source": [ "from fastai.gen_doc.nbdoc import *\n", "from fastai.tabular import *\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This package contains the basic class to define a transformation for preprocessing dataframes of tabular data, as well as basic [`TabularProc`](/tabular.transform.html#TabularProc). Preprocessing includes things like\n", "- replacing non-numerical variables by categories, then their ids,\n", "- filling missing values,\n", "- normalizing continuous variables.\n", "\n", "In all those steps we have to be careful to use the correspondence we decide on our training set (which id we give to each category, what is the value we put for missing data, or how the mean/std we use to normalize) on our validation or test set. To deal with this, we use a special class called [`TabularProc`](/tabular.transform.html#TabularProc).\n", "\n", "The data used in this document page is a subset of the [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult). It gives a certain amount of data on individuals to train a model to predict whether their salary is greater than \\$50k or not." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | age | \n", "workclass | \n", "fnlwgt | \n", "education | \n", "education-num | \n", "marital-status | \n", "occupation | \n", "relationship | \n", "race | \n", "sex | \n", "capital-gain | \n", "capital-loss | \n", "hours-per-week | \n", "native-country | \n", "salary | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "49 | \n", "Private | \n", "101320 | \n", "Assoc-acdm | \n", "12.0 | \n", "Married-civ-spouse | \n", "NaN | \n", "Wife | \n", "White | \n", "Female | \n", "0 | \n", "1902 | \n", "40 | \n", "United-States | \n", ">=50k | \n", "
1 | \n", "44 | \n", "Private | \n", "236746 | \n", "Masters | \n", "14.0 | \n", "Divorced | \n", "Exec-managerial | \n", "Not-in-family | \n", "White | \n", "Male | \n", "10520 | \n", "0 | \n", "45 | \n", "United-States | \n", ">=50k | \n", "
2 | \n", "38 | \n", "Private | \n", "96185 | \n", "HS-grad | \n", "NaN | \n", "Divorced | \n", "NaN | \n", "Unmarried | \n", "Black | \n", "Female | \n", "0 | \n", "0 | \n", "32 | \n", "United-States | \n", "<50k | \n", "
3 | \n", "38 | \n", "Self-emp-inc | \n", "112847 | \n", "Prof-school | \n", "15.0 | \n", "Married-civ-spouse | \n", "Prof-specialty | \n", "Husband | \n", "Asian-Pac-Islander | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", ">=50k | \n", "
4 | \n", "42 | \n", "Self-emp-not-inc | \n", "82297 | \n", "7th-8th | \n", "NaN | \n", "Married-civ-spouse | \n", "Other-service | \n", "Wife | \n", "Black | \n", "Female | \n", "0 | \n", "0 | \n", "50 | \n", "United-States | \n", "<50k | \n", "
class
TabularProc
[source][test]TabularProc
(**`cat_names`**:`StrList`, **`cont_names`**:`StrList`)\n",
"\n",
"No tests found for TabularProc
. To contribute a test please refer to this guide and this discussion.
__call__
[source][test]__call__
(**`df`**:`DataFrame`, **`test`**:`bool`=***`False`***)\n",
"\n",
"No tests found for __call__
. To contribute a test please refer to this guide and this discussion.
apply_train
[source][test]apply_train
(**`df`**:`DataFrame`)\n",
"\n",
"Tests found for apply_train
:
Some other tests where apply_train
is used:
pytest -sv tests/test_tabular_transform.py::test_categorify
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]To run tests please refer to this guide.
apply_test
[source][test]apply_test
(**`df`**:`DataFrame`)\n",
"\n",
"Tests found for apply_test
:
Some other tests where apply_test
is used:
pytest -sv tests/test_tabular_transform.py::test_categorify
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]To run tests please refer to this guide.
class
Categorify
[source][test]Categorify
(**`cat_names`**:`StrList`, **`cont_names`**:`StrList`) :: [`TabularProc`](/tabular.transform.html#TabularProc)\n",
"\n",
"apply_train
[source][test]apply_train
(**`df`**:`DataFrame`)\n",
"\n",
"Tests found for apply_train
:
Some other tests where apply_train
is used:
pytest -sv tests/test_tabular_transform.py::test_categorify
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]To run tests please refer to this guide.
apply_test
[source][test]apply_test
(**`df`**:`DataFrame`)\n",
"\n",
"Tests found for apply_test
:
Some other tests where apply_test
is used:
pytest -sv tests/test_tabular_transform.py::test_categorify
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]To run tests please refer to this guide.
class
FillMissing
[source][test]FillMissing
(**`cat_names`**:`StrList`, **`cont_names`**:`StrList`, **`fill_strategy`**:[`FillStrategy`](/tabular.transform.html#FillStrategy)=***`Tests found for FillMissing
:
pytest -sv tests/test_tabular_transform.py::test_default_fill_strategy_is_median
[source]Some other tests where FillMissing
is used:
pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]To run tests please refer to this guide.
apply_train
[source][test]apply_train
(**`df`**:`DataFrame`)\n",
"\n",
"Tests found for apply_train
:
pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]Some other tests where apply_train
is used:
pytest -sv tests/test_tabular_transform.py::test_categorify
[source]To run tests please refer to this guide.
apply_test
[source][test]apply_test
(**`df`**:`DataFrame`)\n",
"\n",
"Tests found for apply_test
:
pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]Some other tests where apply_test
is used:
pytest -sv tests/test_tabular_transform.py::test_categorify
[source]To run tests please refer to this guide.
\n", " | age | \n", "fnlwgt | \n", "education-num | \n", "capital-gain | \n", "capital-loss | \n", "hours-per-week | \n", "
---|---|---|---|---|---|---|
0 | \n", "49 | \n", "101320 | \n", "12.0 | \n", "0 | \n", "1902 | \n", "40 | \n", "
1 | \n", "44 | \n", "236746 | \n", "14.0 | \n", "10520 | \n", "0 | \n", "45 | \n", "
2 | \n", "38 | \n", "96185 | \n", "NaN | \n", "0 | \n", "0 | \n", "32 | \n", "
3 | \n", "38 | \n", "112847 | \n", "15.0 | \n", "0 | \n", "0 | \n", "40 | \n", "
4 | \n", "42 | \n", "82297 | \n", "NaN | \n", "0 | \n", "0 | \n", "50 | \n", "
\n", " | age | \n", "fnlwgt | \n", "education-num | \n", "capital-gain | \n", "capital-loss | \n", "hours-per-week | \n", "
---|---|---|---|---|---|---|
0 | \n", "49 | \n", "101320 | \n", "12.0 | \n", "0 | \n", "1902 | \n", "40 | \n", "
1 | \n", "44 | \n", "236746 | \n", "14.0 | \n", "10520 | \n", "0 | \n", "45 | \n", "
2 | \n", "38 | \n", "96185 | \n", "10.0 | \n", "0 | \n", "0 | \n", "32 | \n", "
3 | \n", "38 | \n", "112847 | \n", "15.0 | \n", "0 | \n", "0 | \n", "40 | \n", "
4 | \n", "42 | \n", "82297 | \n", "10.0 | \n", "0 | \n", "0 | \n", "50 | \n", "
\n", " | age | \n", "fnlwgt | \n", "education-num | \n", "capital-gain | \n", "capital-loss | \n", "hours-per-week | \n", "
---|---|---|---|---|---|---|
800 | \n", "45 | \n", "96975 | \n", "10.0 | \n", "0 | \n", "0 | \n", "40 | \n", "
801 | \n", "46 | \n", "192779 | \n", "10.0 | \n", "15024 | \n", "0 | \n", "60 | \n", "
802 | \n", "36 | \n", "376455 | \n", "10.0 | \n", "0 | \n", "0 | \n", "38 | \n", "
803 | \n", "25 | \n", "50053 | \n", "10.0 | \n", "0 | \n", "0 | \n", "45 | \n", "
804 | \n", "37 | \n", "164526 | \n", "10.0 | \n", "0 | \n", "0 | \n", "40 | \n", "
Enum
= [MEDIAN, COMMON, CONSTANT]\n",
"\n",
"class
Normalize
[source][test]Normalize
(**`cat_names`**:`StrList`, **`cont_names`**:`StrList`) :: [`TabularProc`](/tabular.transform.html#TabularProc)\n",
"\n",
"apply_train
[source][test]apply_train
(**`df`**:`DataFrame`)\n",
"\n",
"Tests found for apply_train
:
Some other tests where apply_train
is used:
pytest -sv tests/test_tabular_transform.py::test_categorify
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]To run tests please refer to this guide.
\n", " | age | \n", "fnlwgt | \n", "education-num | \n", "capital-gain | \n", "capital-loss | \n", "hours-per-week | \n", "
---|---|---|---|---|---|---|
0 | \n", "0.829039 | \n", "-0.812589 | \n", "0.981643 | \n", "-0.136271 | \n", "4.416656 | \n", "-0.050230 | \n", "
1 | \n", "0.443977 | \n", "0.355532 | \n", "2.078450 | \n", "1.153121 | \n", "-0.228760 | \n", "0.361492 | \n", "
2 | \n", "-0.018098 | \n", "-0.856881 | \n", "-0.115165 | \n", "-0.136271 | \n", "-0.228760 | \n", "-0.708985 | \n", "
3 | \n", "-0.018098 | \n", "-0.713162 | \n", "2.626854 | \n", "-0.136271 | \n", "-0.228760 | \n", "-0.050230 | \n", "
4 | \n", "0.289952 | \n", "-0.976672 | \n", "-0.115165 | \n", "-0.136271 | \n", "-0.228760 | \n", "0.773213 | \n", "
apply_test
[source][test]apply_test
(**`df`**:`DataFrame`)\n",
"\n",
"Tests found for apply_test
:
Some other tests where apply_test
is used:
pytest -sv tests/test_tabular_transform.py::test_categorify
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values
[source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians
[source]To run tests please refer to this guide.
\n", " | age | \n", "fnlwgt | \n", "education-num | \n", "capital-gain | \n", "capital-loss | \n", "hours-per-week | \n", "
---|---|---|---|---|---|---|
800 | \n", "0.520989 | \n", "-0.850066 | \n", "-0.115165 | \n", "-0.136271 | \n", "-0.22876 | \n", "-0.050230 | \n", "
801 | \n", "0.598002 | \n", "-0.023706 | \n", "-0.115165 | \n", "1.705157 | \n", "-0.22876 | \n", "1.596657 | \n", "
802 | \n", "-0.172123 | \n", "1.560596 | \n", "-0.115165 | \n", "-0.136271 | \n", "-0.22876 | \n", "-0.214919 | \n", "
803 | \n", "-1.019260 | \n", "-1.254793 | \n", "-0.115165 | \n", "-0.136271 | \n", "-0.22876 | \n", "0.361492 | \n", "
804 | \n", "-0.095110 | \n", "-0.267403 | \n", "-0.115165 | \n", "-0.136271 | \n", "-0.22876 | \n", "-0.050230 | \n", "
add_datepart
[source][test]add_datepart
(**`df`**:`DataFrame`, **`field_name`**:`str`, **`prefix`**:`str`=***`None`***, **`drop`**:`bool`=***`True`***, **`time`**:`bool`=***`False`***)\n",
"\n",
"\n", " | col2 | \n", "col1Year | \n", "col1Month | \n", "col1Week | \n", "col1Day | \n", "col1Dayofweek | \n", "col1Dayofyear | \n", "col1Is_month_end | \n", "col1Is_month_start | \n", "col1Is_quarter_end | \n", "col1Is_quarter_start | \n", "col1Is_year_end | \n", "col1Is_year_start | \n", "col1Elapsed | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "a | \n", "2017 | \n", "2 | \n", "5 | \n", "3 | \n", "4 | \n", "34 | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "1486080000 | \n", "
1 | \n", "b | \n", "2017 | \n", "2 | \n", "5 | \n", "4 | \n", "5 | \n", "35 | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "1486166400 | \n", "
2 | \n", "a | \n", "2017 | \n", "2 | \n", "5 | \n", "5 | \n", "6 | \n", "36 | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "False | \n", "1486252800 | \n", "
add_cyclic_datepart
[source][test]add_cyclic_datepart
(**`df`**:`DataFrame`, **`field_name`**:`str`, **`prefix`**:`str`=***`None`***, **`drop`**:`bool`=***`True`***, **`time`**:`bool`=***`False`***, **`add_linear`**:`bool`=***`False`***)\n",
"\n",
"No tests found for add_cyclic_datepart
. To contribute a test please refer to this guide and this discussion.
\n", " | col2 | \n", "col1weekday_cos | \n", "col1weekday_sin | \n", "col1day_month_cos | \n", "col1day_month_sin | \n", "col1month_year_cos | \n", "col1month_year_sin | \n", "col1day_year_cos | \n", "col1day_year_sin | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "a | \n", "-0.900969 | \n", "-0.433884 | \n", "0.900969 | \n", "0.433884 | \n", "0.866025 | \n", "0.5 | \n", "0.842942 | \n", "0.538005 | \n", "
1 | \n", "b | \n", "-0.222521 | \n", "-0.974928 | \n", "0.781831 | \n", "0.623490 | \n", "0.866025 | \n", "0.5 | \n", "0.833556 | \n", "0.552435 | \n", "
2 | \n", "a | \n", "0.623490 | \n", "-0.781831 | \n", "0.623490 | \n", "0.781831 | \n", "0.866025 | \n", "0.5 | \n", "0.823923 | \n", "0.566702 | \n", "
cont_cat_split
[source][test]cont_cat_split
(**`df`**, **`max_card`**=***`20`***, **`dep_var`**=***`None`***) → `Tuple`\\[`List`\\[`T`\\], `List`\\[`T`\\]\\]\n",
"\n",
"\n", " | col1 | \n", "col2 | \n", "col3 | \n", "col4 | \n", "
---|---|---|---|---|
0 | \n", "1 | \n", "a | \n", "0.5 | \n", "ab | \n", "
1 | \n", "2 | \n", "b | \n", "1.2 | \n", "o | \n", "
2 | \n", "3 | \n", "a | \n", "7.5 | \n", "o | \n", "