{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Tabular data preprocessing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [], "source": [ "from fastai.gen_doc.nbdoc import *\n", "from fastai.tabular import *\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This package contains the basic class to define a transformation for preprocessing dataframes of tabular data, as well as basic [`TabularProc`](/tabular.transform.html#TabularProc). Preprocessing includes things like\n", "- replacing non-numerical variables by categories, then their ids,\n", "- filling missing values,\n", "- normalizing continuous variables.\n", "\n", "In all those steps we have to be careful to use the correspondence we decide on our training set (which id we give to each category, what is the value we put for missing data, or how the mean/std we use to normalize) on our validation or test set. To deal with this, we use a special class called [`TabularProc`](/tabular.transform.html#TabularProc).\n", "\n", "The data used in this document page is a subset of the [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult). It gives a certain amount of data on individuals to train a model to predict wether their salary is greater than \\$50k or not." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | age | \n", "workclass | \n", "fnlwgt | \n", "education | \n", "education-num | \n", "marital-status | \n", "occupation | \n", "relationship | \n", "race | \n", "sex | \n", "capital-gain | \n", "capital-loss | \n", "hours-per-week | \n", "native-country | \n", "salary | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "49 | \n", "Private | \n", "101320 | \n", "Assoc-acdm | \n", "12.0 | \n", "Married-civ-spouse | \n", "NaN | \n", "Wife | \n", "White | \n", "Female | \n", "0 | \n", "1902 | \n", "40 | \n", "United-States | \n", ">=50k | \n", "
| 1 | \n", "44 | \n", "Private | \n", "236746 | \n", "Masters | \n", "14.0 | \n", "Divorced | \n", "Exec-managerial | \n", "Not-in-family | \n", "White | \n", "Male | \n", "10520 | \n", "0 | \n", "45 | \n", "United-States | \n", ">=50k | \n", "
| 2 | \n", "38 | \n", "Private | \n", "96185 | \n", "HS-grad | \n", "NaN | \n", "Divorced | \n", "NaN | \n", "Unmarried | \n", "Black | \n", "Female | \n", "0 | \n", "0 | \n", "32 | \n", "United-States | \n", "<50k | \n", "
| 3 | \n", "38 | \n", "Self-emp-inc | \n", "112847 | \n", "Prof-school | \n", "15.0 | \n", "Married-civ-spouse | \n", "Prof-specialty | \n", "Husband | \n", "Asian-Pac-Islander | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", ">=50k | \n", "
| 4 | \n", "42 | \n", "Self-emp-not-inc | \n", "82297 | \n", "7th-8th | \n", "NaN | \n", "Married-civ-spouse | \n", "Other-service | \n", "Wife | \n", "Black | \n", "Female | \n", "0 | \n", "0 | \n", "50 | \n", "United-States | \n", "<50k | \n", "
class TabularProc[source][test]TabularProc(**`cat_names`**:`StrList`, **`cont_names`**:`StrList`)\n",
"\n",
"No tests found for TabularProc. To contribute a test please refer to this guide and this discussion.
__call__[source][test]__call__(**`df`**:`DataFrame`, **`test`**:`bool`=***`False`***)\n",
"\n",
"No tests found for __call__. To contribute a test please refer to this guide and this discussion.
apply_train[source][test]apply_train(**`df`**:`DataFrame`)\n",
"\n",
"Tests found for apply_train:
Some other tests where apply_train is used:
pytest -sv tests/test_tabular_transform.py::test_categorify [source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]To run tests please refer to this guide.
apply_test[source][test]apply_test(**`df`**:`DataFrame`)\n",
"\n",
"Tests found for apply_test:
Some other tests where apply_test is used:
pytest -sv tests/test_tabular_transform.py::test_categorify [source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]To run tests please refer to this guide.
class Categorify[source][test]Categorify(**`cat_names`**:`StrList`, **`cont_names`**:`StrList`) :: [`TabularProc`](/tabular.transform.html#TabularProc)\n",
"\n",
"apply_train[source][test]apply_train(**`df`**:`DataFrame`)\n",
"\n",
"Tests found for apply_train:
Some other tests where apply_train is used:
pytest -sv tests/test_tabular_transform.py::test_categorify [source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]To run tests please refer to this guide.
apply_test[source][test]apply_test(**`df`**:`DataFrame`)\n",
"\n",
"Tests found for apply_test:
Some other tests where apply_test is used:
pytest -sv tests/test_tabular_transform.py::test_categorify [source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]To run tests please refer to this guide.
class FillMissing[source][test]FillMissing(**`cat_names`**:`StrList`, **`cont_names`**:`StrList`, **`fill_strategy`**:[`FillStrategy`](/tabular.transform.html#FillStrategy)=***`Tests found for FillMissing:
pytest -sv tests/test_tabular_transform.py::test_default_fill_strategy_is_median [source]Some other tests where FillMissing is used:
pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]To run tests please refer to this guide.
apply_train[source][test]apply_train(**`df`**:`DataFrame`)\n",
"\n",
"Tests found for apply_train:
pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]Some other tests where apply_train is used:
pytest -sv tests/test_tabular_transform.py::test_categorify [source]To run tests please refer to this guide.
apply_test[source][test]apply_test(**`df`**:`DataFrame`)\n",
"\n",
"Tests found for apply_test:
pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]Some other tests where apply_test is used:
pytest -sv tests/test_tabular_transform.py::test_categorify [source]To run tests please refer to this guide.
| \n", " | age | \n", "fnlwgt | \n", "education-num | \n", "capital-gain | \n", "capital-loss | \n", "hours-per-week | \n", "
|---|---|---|---|---|---|---|
| 0 | \n", "49 | \n", "101320 | \n", "12.0 | \n", "0 | \n", "1902 | \n", "40 | \n", "
| 1 | \n", "44 | \n", "236746 | \n", "14.0 | \n", "10520 | \n", "0 | \n", "45 | \n", "
| 2 | \n", "38 | \n", "96185 | \n", "NaN | \n", "0 | \n", "0 | \n", "32 | \n", "
| 3 | \n", "38 | \n", "112847 | \n", "15.0 | \n", "0 | \n", "0 | \n", "40 | \n", "
| 4 | \n", "42 | \n", "82297 | \n", "NaN | \n", "0 | \n", "0 | \n", "50 | \n", "
| \n", " | age | \n", "fnlwgt | \n", "education-num | \n", "capital-gain | \n", "capital-loss | \n", "hours-per-week | \n", "
|---|---|---|---|---|---|---|
| 0 | \n", "49 | \n", "101320 | \n", "12.0 | \n", "0 | \n", "1902 | \n", "40 | \n", "
| 1 | \n", "44 | \n", "236746 | \n", "14.0 | \n", "10520 | \n", "0 | \n", "45 | \n", "
| 2 | \n", "38 | \n", "96185 | \n", "10.0 | \n", "0 | \n", "0 | \n", "32 | \n", "
| 3 | \n", "38 | \n", "112847 | \n", "15.0 | \n", "0 | \n", "0 | \n", "40 | \n", "
| 4 | \n", "42 | \n", "82297 | \n", "10.0 | \n", "0 | \n", "0 | \n", "50 | \n", "
| \n", " | age | \n", "fnlwgt | \n", "education-num | \n", "capital-gain | \n", "capital-loss | \n", "hours-per-week | \n", "
|---|---|---|---|---|---|---|
| 800 | \n", "45 | \n", "96975 | \n", "10.0 | \n", "0 | \n", "0 | \n", "40 | \n", "
| 801 | \n", "46 | \n", "192779 | \n", "10.0 | \n", "15024 | \n", "0 | \n", "60 | \n", "
| 802 | \n", "36 | \n", "376455 | \n", "10.0 | \n", "0 | \n", "0 | \n", "38 | \n", "
| 803 | \n", "25 | \n", "50053 | \n", "10.0 | \n", "0 | \n", "0 | \n", "45 | \n", "
| 804 | \n", "37 | \n", "164526 | \n", "10.0 | \n", "0 | \n", "0 | \n", "40 | \n", "
Enum = [MEDIAN, COMMON, CONSTANT]\n",
"\n",
"class Normalize[source][test]Normalize(**`cat_names`**:`StrList`, **`cont_names`**:`StrList`) :: [`TabularProc`](/tabular.transform.html#TabularProc)\n",
"\n",
"No tests found for Normalize. To contribute a test please refer to this guide and this discussion.
apply_train[source][test]apply_train(**`df`**:`DataFrame`)\n",
"\n",
"Tests found for apply_train:
Some other tests where apply_train is used:
pytest -sv tests/test_tabular_transform.py::test_categorify [source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]To run tests please refer to this guide.
apply_test[source][test]apply_test(**`df`**:`DataFrame`)\n",
"\n",
"Tests found for apply_test:
Some other tests where apply_test is used:
pytest -sv tests/test_tabular_transform.py::test_categorify [source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values [source]pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians [source]To run tests please refer to this guide.
add_datepart[source][test]add_datepart(**`df`**:`DataFrame`, **`field_name`**:`str`, **`prefix`**:`str`=***`None`***, **`drop`**:`bool`=***`True`***, **`time`**:`bool`=***`False`***)\n",
"\n",
"No tests found for add_datepart. To contribute a test please refer to this guide and this discussion.
cont_cat_split[source][test]cont_cat_split(**`df`**, **`max_card`**=***`20`***, **`dep_var`**=***`None`***) → `Tuple`\\[`List`\\[`T`\\], `List`\\[`T`\\]\\]\n",
"\n",
"| \n", " | col1 | \n", "col2 | \n", "col3 | \n", "col4 | \n", "
|---|---|---|---|---|
| 0 | \n", "1 | \n", "a | \n", "0.5 | \n", "ab | \n", "
| 1 | \n", "2 | \n", "b | \n", "1.2 | \n", "o | \n", "
| 2 | \n", "3 | \n", "a | \n", "7.5 | \n", "o | \n", "