{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Tabular data handling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This module defines the main class to handle tabular data in the fastai library: [`TabularDataBunch`](/tabular.data.html#TabularDataBunch). As always, there is also a helper function to quickly get your data.\n", "\n", "To allow you to easily create a [`Learner`](/basic_train.html#Learner) for your data, it provides [`tabular_learner`](/tabular.data.html#tabular_learner)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [], "source": [ "from fastai.gen_doc.nbdoc import *\n", "from fastai.tabular import * \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class TabularDataBunch[source][test]

\n", "\n", "> TabularDataBunch(**`train_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), **`valid_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), **`fix_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)=***`None`***, **`test_dl`**:`Optional`\\[[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)\\]=***`None`***, **`device`**:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=***`None`***, **`dl_tfms`**:`Optional`\\[`Collection`\\[`Callable`\\]\\]=***`None`***, **`path`**:`PathOrStr`=***`'.'`***, **`collate_fn`**:`Callable`=***`'data_collate'`***, **`no_check`**:`bool`=***`False`***) :: [`DataBunch`](/basic_data.html#DataBunch)\n", "\n", "
×

No tests found for TabularDataBunch. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Create a [`DataBunch`](/basic_data.html#DataBunch) suitable for tabular data. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularDataBunch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best way to quickly get your data in a [`DataBunch`](/basic_data.html#DataBunch) suitable for tabular data is to organize it in two (or three) dataframes. One for training, one for validation, and if you have it, one for testing. Here we are interested in a subsample of the [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
049Private101320Assoc-acdm12.0Married-civ-spouseNaNWifeWhiteFemale0190240United-States>=50k
144Private236746Masters14.0DivorcedExec-managerialNot-in-familyWhiteMale10520045United-States>=50k
238Private96185HS-gradNaNDivorcedNaNUnmarriedBlackFemale0032United-States<50k
338Self-emp-inc112847Prof-school15.0Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040United-States>=50k
442Self-emp-not-inc822977th-8thNaNMarried-civ-spouseOther-serviceWifeBlackFemale0050United-States<50k
\n", "
" ], "text/plain": [ " age workclass fnlwgt education education-num \\\n", "0 49 Private 101320 Assoc-acdm 12.0 \n", "1 44 Private 236746 Masters 14.0 \n", "2 38 Private 96185 HS-grad NaN \n", "3 38 Self-emp-inc 112847 Prof-school 15.0 \n", "4 42 Self-emp-not-inc 82297 7th-8th NaN \n", "\n", " marital-status occupation relationship race \\\n", "0 Married-civ-spouse NaN Wife White \n", "1 Divorced Exec-managerial Not-in-family White \n", "2 Divorced NaN Unmarried Black \n", "3 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander \n", "4 Married-civ-spouse Other-service Wife Black \n", "\n", " sex capital-gain capital-loss hours-per-week native-country salary \n", "0 Female 0 1902 40 United-States >=50k \n", "1 Male 10520 0 45 United-States >=50k \n", "2 Female 0 0 32 United-States <50k \n", "3 Male 0 0 40 United-States >=50k \n", "4 Female 0 0 50 United-States <50k " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.ADULT_SAMPLE)\n", "df = pd.read_csv(path/'adult.csv')\n", "valid_idx = range(len(df)-2000, len(df))\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']\n", "dep_var = 'salary'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The initialization of [`TabularDataBunch`](/tabular.data.html#TabularDataBunch) is the same as [`DataBunch`](/basic_data.html#DataBunch) so you really want to use the factory method instead." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

from_df[source][test]

\n", "\n", "> from_df(**`path`**, **`df`**:`DataFrame`, **`dep_var`**:`str`, **`valid_idx`**:`Collection`\\[`int`\\], **`procs`**:`Optional`\\[`Collection`\\[[`TabularProc`](/tabular.transform.html#TabularProc)\\]\\]=***`None`***, **`cat_names`**:`OptStrList`=***`None`***, **`cont_names`**:`OptStrList`=***`None`***, **`classes`**:`Collection`\\[`T_co`\\]=***`None`***, **`test_df`**=***`None`***, **`bs`**:`int`=***`64`***, **`val_bs`**:`int`=***`None`***, **`num_workers`**:`int`=***`4`***, **`dl_tfms`**:`Optional`\\[`Collection`\\[`Callable`\\]\\]=***`None`***, **`device`**:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=***`None`***, **`collate_fn`**:`Callable`=***`'data_collate'`***, **`no_check`**:`bool`=***`False`***) → [`DataBunch`](/basic_data.html#DataBunch)\n", "\n", "
×

Tests found for from_df:

Some other tests where from_df is used:

  • pytest -sv tests/test_tabular_data.py::test_from_df [source]

To run tests please refer to this guide.

\n", "\n", "Create a [`DataBunch`](/basic_data.html#DataBunch) from `df` and `valid_idx` with `dep_var`. `kwargs` are passed to [`DataBunch.create`](/basic_data.html#DataBunch.create). " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularDataBunch.from_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Optionally, use `test_df` for the test set. The dependent variable is `dep_var`, while the categorical and continuous variables are in the `cat_names` columns and `cont_names` columns respectively. If `cont_names` is None then we assume all variables that aren't dependent or categorical are continuous. The [`TabularProcessor`](/tabular.data.html#TabularProcessor) in `procs` are applied to the dataframes as preprocessing, then the categories are replaced by their codes+1 (leaving 0 for `nan`) and the continuous variables are normalized. \n", "\n", "Note that the [`TabularProcessor`](/tabular.data.html#TabularProcessor) should be passed as `Callable`: the actual initialization with `cat_names` and `cont_names` is done during the preprocessing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "procs = [FillMissing, Categorify, Normalize]\n", "data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " You can then easily create a [`Learner`](/basic_train.html#Learner) for this data with [`tabular_learner`](/tabular.data.html#tabular_learner)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

tabular_learner[source][test]

\n", "\n", "> tabular_learner(**`data`**:[`DataBunch`](/basic_data.html#DataBunch), **`layers`**:`Collection`\\[`int`\\], **`emb_szs`**:`Dict`\\[`str`, `int`\\]=***`None`***, **`metrics`**=***`None`***, **`ps`**:`Collection`\\[`float`\\]=***`None`***, **`emb_drop`**:`float`=***`0.0`***, **`y_range`**:`OptRange`=***`None`***, **`use_bn`**:`bool`=***`True`***, **\\*\\*`learn_kwargs`**)\n", "\n", "
×

No tests found for tabular_learner. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Get a [`Learner`](/basic_train.html#Learner) using `data`, with `metrics`, including a [`TabularModel`](/tabular.models.html#TabularModel) created using the remaining params. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(tabular_learner)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`emb_szs` is a `dict` mapping categorical column names to embedding sizes; you only need to pass sizes for columns where you want to override the default behaviour of the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class TabularList[source][test]

\n", "\n", "> TabularList(**`items`**:`Iterator`\\[`T_co`\\], **`cat_names`**:`OptStrList`=***`None`***, **`cont_names`**:`OptStrList`=***`None`***, **`procs`**=***`None`***, **\\*\\*`kwargs`**) → `TabularList` :: [`ItemList`](/data_block.html#ItemList)\n", "\n", "
×

Tests found for TabularList:

Some other tests where TabularList is used:

  • pytest -sv tests/test_tabular_data.py::test_from_df [source]

To run tests please refer to this guide.

\n", "\n", "Basic [`ItemList`](/data_block.html#ItemList) for tabular data. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularList)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Basic class to create a list of inputs in `items` for tabular data. `cat_names` and `cont_names` are the names of the categorical and the continuous variables respectively. `processor` will be applied to the inputs or one will be created from the transforms in `procs`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

from_df[source][test]

\n", "\n", "> from_df(**`df`**:`DataFrame`, **`cat_names`**:`OptStrList`=***`None`***, **`cont_names`**:`OptStrList`=***`None`***, **`procs`**=***`None`***, **\\*\\*`kwargs`**) → `ItemList`\n", "\n", "
×

Tests found for from_df:

  • pytest -sv tests/test_tabular_data.py::test_from_df [source]

To run tests please refer to this guide.

\n", "\n", "Get the list of inputs in the `col` of `path/csv_name`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularList.from_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

get_emb_szs[source][test]

\n", "\n", "> get_emb_szs(**`sz_dict`**=***`None`***)\n", "\n", "
×

No tests found for get_emb_szs. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Return the default embedding sizes suitable for this data or takes the ones in `sz_dict`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularList.get_emb_szs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

show_xys[source][test]

\n", "\n", "> show_xys(**`xs`**, **`ys`**)\n", "\n", "
×

No tests found for show_xys. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Show the `xs` (inputs) and `ys` (targets). " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularList.show_xys)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

show_xyzs[source][test]

\n", "\n", "> show_xyzs(**`xs`**, **`ys`**, **`zs`**)\n", "\n", "
×

No tests found for show_xyzs. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Show `xs` (inputs), `ys` (targets) and `zs` (predictions). " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularList.show_xyzs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class TabularLine[source][test]

\n", "\n", "> TabularLine(**`cats`**, **`conts`**, **`classes`**, **`names`**) :: [`ItemBase`](/core.html#ItemBase)\n", "\n", "
×

No tests found for TabularLine. To contribute a test please refer to this guide and this discussion.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularLine, doc_string=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An object that will contain the encoded `cats`, the continuous variables `conts`, the `classes` and the `names` of the columns. This is the basic input for a dataset dealing with tabular data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class TabularProcessor[source][test]

\n", "\n", "> TabularProcessor(**`ds`**:[`ItemBase`](/core.html#ItemBase)=***`None`***, **`procs`**=***`None`***) :: [`PreProcessor`](/data_block.html#PreProcessor)\n", "\n", "
×

No tests found for TabularProcessor. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Regroup the `procs` in one [`PreProcessor`](/data_block.html#PreProcessor). " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularProcessor)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a [`PreProcessor`](/data_block.html#PreProcessor) from `procs`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Undocumented Methods - Methods moved below this line will intentionally be hidden" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

process_one[source][test]

\n", "\n", "> process_one(**`item`**)\n", "\n", "
×

No tests found for process_one. To contribute a test please refer to this guide and this discussion.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularProcessor.process_one)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

new[source][test]

\n", "\n", "> new(**`items`**:`Iterator`\\[`T_co`\\], **`processor`**:`Union`\\[[`PreProcessor`](/data_block.html#PreProcessor), `Collection`\\[[`PreProcessor`](/data_block.html#PreProcessor)\\]\\]=***`None`***, **\\*\\*`kwargs`**) → `ItemList`\n", "\n", "
×

No tests found for new. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Create a new [`ItemList`](/data_block.html#ItemList) from `items`, keeping the same attributes. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularList.new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

get[source][test]

\n", "\n", "> get(**`o`**)\n", "\n", "
×

No tests found for get. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Subclass if you want to customize how to create item `i` from `self.items`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularList.get)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

process[source][test]

\n", "\n", "> process(**`ds`**)\n", "\n", "
×

No tests found for process. To contribute a test please refer to this guide and this discussion.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularProcessor.process)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

reconstruct[source][test]

\n", "\n", "> reconstruct(**`t`**:`Tensor`)\n", "\n", "
×

No tests found for reconstruct. To contribute a test please refer to this guide and this discussion.

\n", "\n", "Reconstruct one of the underlying item for its data `t`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularList.reconstruct)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## New Methods - Please document or move to the undocumented section" ] } ], "metadata": { "jekyll": { "keywords": "fastai", "summary": "Base class to deal with tabular data and get a DataBunch", "title": "tabular.data" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 2 }