# Tabular data handling

This module defines the main class to handle tabular data in the fastai library: [`TabularDataset`](/tabular.data.html#TabularDataset). As always, there is also a helper function to quickly get your data.

To allow you to easily create a [`Learner`](/basic_train.html#Learner) for your data, it provides [`tabular_learner`](/tabular.data.html#tabular_learner).

In [None]:
from fastai.gen_doc.nbdoc import *
from fastai.tabular import * 
from fastai import *

In [None]:
show_doc(TabularDataBunch, doc_string=False)

<h2 id="TabularDataBunch"><code>class</code> <code>TabularDataBunch</code><a href="https://github.com/fastai/fastai/blob/master/fastai/tabular/data.py#L102" class="source_link">[source]</a></h2>

> <code>TabularDataBunch</code>(`train_dl`:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), `valid_dl`:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), `test_dl`:`Optional`\[[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)\]=`None`, `device`:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=`None`, `tfms`:`Optional`\[`Collection`\[`Callable`\]\]=`None`, `path`:`PathOrStr`=`'.'`, `collate_fn`:`Callable`=`'data_collate'`) :: [`DataBunch`](/basic_data.html#DataBunch)

The best way to quickly get your data in a [`DataBunch`](/basic_data.html#DataBunch) suitable for tabular data is to organize it in two (or three) dataframes. One for training, one for validation, and if you have it, one for testing. Here we are interested in a subsample of the [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult).

In [None]:
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
valid_idx = range(len(df)-2000, len(df))
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,>=50k
0,49,Private,101320,Assoc-acdm,12.0,Married-civ-spouse,,Wife,White,Female,0,1902,40,United-States,1
1,44,Private,236746,Masters,14.0,Divorced,Exec-managerial,Not-in-family,White,Male,10520,0,45,United-States,1
2,38,Private,96185,HS-grad,,Divorced,,Unmarried,Black,Female,0,0,32,United-States,0
3,38,Self-emp-inc,112847,Prof-school,15.0,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,1
4,42,Self-emp-not-inc,82297,7th-8th,,Married-civ-spouse,Other-service,Wife,Black,Female,0,0,50,United-States,0


In [None]:
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
dep_var = '>=50k'

In [None]:
show_doc(TabularDataBunch.from_df, doc_string=False)

<h4 id="TabularDataBunch.from_df"><code>from_df</code><a href="https://github.com/fastai/fastai/blob/master/fastai/tabular/data.py#L109" class="source_link">[source]</a></h4>

> <code>from_df</code>(`path`, `df`:`DataFrame`, `dep_var`:`str`, `valid_idx`:`Collection`\[`int`\], `procs`:`Optional`\[`Collection`\[[`TabularProc`](/tabular.transform.html#TabularProc)\]\]=`None`, `cat_names`:`OptStrList`=`None`, `cont_names`:`OptStrList`=`None`, `classes`:`Collection`=`None`, `kwargs`) → [`DataBunch`](/basic_data.html#DataBunch)

Creates a [`DataBunch`](/basic_data.html#DataBunch) in `path` from `train_df`, `valid_df` and optionally `test_df`. The dependent variable is `dep_var`, while the categorical and continuous variables are in the `cat_names` columns and `cont_names` columns respectively. If `cont_names` is None then we assume all variables that aren't dependent or categorical are continuous. The [`TabularTransform`](/tabular.transform.html#TabularTransform) in `tfms` are applied to the dataframes as preprocessing, then the categories are replaced by their codes+1 (leaving 0 for `nan`) and the continuous variables are normalized. You can pass the `stats` to use for that step. If `log_output` is True, the dependant variable is replaced by its log.

Note that the transforms should be passed as `Callable`: the actual initialization with `cat_names` and `cont_names` is done inside.

In [None]:
procs = [FillMissing, Categorify, Normalize]
data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names)

 You can then easily create a [`Learner`](/basic_train.html#Learner) for this data with [`tabular_learner`](/tabular.data.html#tabular_learner).

In [None]:
show_doc(tabular_learner)

<h4 id="tabular_learner"><code>tabular_learner</code><a href="https://github.com/fastai/fastai/blob/master/fastai/tabular/data.py#L153" class="source_link">[source]</a></h4>

> <code>tabular_learner</code>(`data`:[`DataBunch`](/basic_data.html#DataBunch), `layers`:`Collection`\[`int`\], `emb_szs`:`Dict`\[`str`, `int`\]=`None`, `metrics`=`None`, `ps`:`Collection`\[`float`\]=`None`, `emb_drop`:`float`=`0.0`, `y_range`:`OptRange`=`None`, `use_bn`:`bool`=`True`, `kwargs`)

Get a [`Learner`](/basic_train.html#Learner) using `data`, with `metrics`, including a [`TabularModel`](/tabular.models.html#TabularModel) created using the remaining params.  

`emb_szs` is a `dict` mapping categorical column names to embedding sizes; you only need to pass sizes for columns where you want to override the default behaviour of the model.

In [None]:
show_doc(TabularList)

<h2 id="TabularList"><code>class</code> <code>TabularList</code><a href="https://github.com/fastai/fastai/blob/master/fastai/tabular/data.py#L121" class="source_link">[source]</a></h2>

> <code>TabularList</code>(`items`:`Iterator`, `cat_names`:`OptStrList`=`None`, `cont_names`:`OptStrList`=`None`, `procs`=`None`, `kwargs`) → `TabularList` :: [`ItemList`](/data_block.html#ItemList)

Basic class to create a list of inputs in `items` for tabular data. `cat_names` and `cont_names` are the names of the categorical and the continuous variables respectively. `processor` will be applied to the inputs or one will be created from the transforms in `procs`.

In [None]:
show_doc(TabularList.from_df)

<h4 id="TabularList.from_df"><code>from_df</code><a href="https://github.com/fastai/fastai/blob/master/fastai/tabular/data.py#L133" class="source_link">[source]</a></h4>

> <code>from_df</code>(`df`:`DataFrame`, `cat_names`:`OptStrList`=`None`, `cont_names`:`OptStrList`=`None`, `procs`=`None`, `kwargs`) → `ItemList`

Get the list of inputs in the `col` of `path/csv_name`.  

In [None]:
show_doc(TabularList.get_emb_szs)

<h4 id="TabularList.get_emb_szs"><code>get_emb_szs</code><a href="https://github.com/fastai/fastai/blob/master/fastai/tabular/data.py#L146" class="source_link">[source]</a></h4>

> <code>get_emb_szs</code>(`sz_dict`)

Return the default embedding sizes suitable for this data or takes the ones in `sz_dict`.  

In [None]:
show_doc(TabularLine, doc_string=False)

<h2 id="TabularLine"><code>class</code> <code>TabularLine</code><a href="https://github.com/fastai/fastai/blob/master/fastai/tabular/data.py#L22" class="source_link">[source]</a></h2>

> <code>TabularLine</code>(`cats`, `conts`, `classes`, `names`) :: [`ItemBase`](/core.html#ItemBase)

An object that will contain the encoded `cats`, the continuous variables `conts`, the `classes` and the `names` of the columns. This is the basic input for a dataset dealing with tabular data.

In [None]:
show_doc(TabularLine.show_xys)

<h4 id="TabularLine.show_xys"><code>show_xys</code><a href="https://github.com/fastai/fastai/blob/master/fastai/tabular/data.py#L35" class="source_link">[source]</a></h4>

> <code>show_xys</code>(`xs`, `ys`)

Show the `xs` and `ys`.  

In [None]:
show_doc(TabularLine.show_xyzs)

<h4 id="TabularLine.show_xyzs"><code>show_xyzs</code><a href="https://github.com/fastai/fastai/blob/master/fastai/tabular/data.py#L47" class="source_link">[source]</a></h4>

> <code>show_xyzs</code>(`xs`, `ys`, `zs`)

Show `xs` (inputs), `ys` (targets) and `zs` (predictions).  

In [None]:
show_doc(TabularProcessor)

<h2 id="TabularProcessor"><code>class</code> <code>TabularProcessor</code><a href="https://github.com/fastai/fastai/blob/master/fastai/tabular/data.py#L59" class="source_link">[source]</a></h2>

> <code>TabularProcessor</code>(`ds`:[`ItemBase`](/core.html#ItemBase)=`None`, `procs`=`None`) :: [`PreProcessor`](/data_block.html#PreProcessor)

Create a [`PreProcessor`](/data_block.html#PreProcessor) from `procs`.

## Undocumented Methods - Methods moved below this line will intentionally be hidden

In [None]:
show_doc(TabularProcessor.process_one)

<h4 id="TabularProcessor.process_one"><code>process_one</code><a href="https://github.com/fastai/fastai/blob/master/fastai/tabular/data.py#L64" class="source_link">[source]</a></h4>

> <code>process_one</code>(`item`)

In [None]:
show_doc(TabularList.new)

<h4 id="TabularList.new"><code>new</code><a href="https://github.com/fastai/fastai/blob/master/fastai/tabular/data.py#L138" class="source_link">[source]</a></h4>

> <code>new</code>(`items`:`Iterator`, `kwargs`) → `TabularList`

In [None]:
show_doc(TabularList.get)

<h4 id="TabularList.get"><code>get</code><a href="https://github.com/fastai/fastai/blob/master/fastai/tabular/data.py#L141" class="source_link">[source]</a></h4>

> <code>get</code>(`o`)

In [None]:
show_doc(TabularProcessor.process)

<h4 id="TabularProcessor.process"><code>process</code><a href="https://github.com/fastai/fastai/blob/master/fastai/tabular/data.py#L77" class="source_link">[source]</a></h4>

> <code>process</code>(`ds`)

## New Methods - Please document or move to the undocumented section