{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tabular data handling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This module defines the main class to handle tabular data in the fastai library: [`TabularDataset`](/tabular.data.html#TabularDataset). As always, there is also a helper function to quickly get your data.\n", "\n", "To allow you to easily create a [`Learner`](/basic_train.html#Learner) for your data, it provides [`get_tabular_learner`](/tabular.data.html#get_tabular_learner)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [], "source": [ "from fastai.gen_doc.nbdoc import *\n", "from fastai.tabular import * \n", "from fastai import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class TabularDataBunch[source]

\n", "\n", "> TabularDataBunch(`train_dl`:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), `valid_dl`:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), `test_dl`:`Optional`\\[[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)\\]=`None`, `device`:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=`None`, `tfms`:`Optional`\\[`Collection`\\[`Callable`\\]\\]=`None`, `path`:`PathOrStr`=`'.'`, `collate_fn`:`Callable`=`'data_collate'`) :: [`DataBunch`](/basic_data.html#DataBunch)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularDataBunch, doc_string=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best way to quickly get your data in a [`DataBunch`](/basic_data.html#DataBunch) suitable for tabular data is to organize it in two (or three) dataframes. One for training, one for validation, and if you have it, one for testing. Here we are interested in a subsample of the [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-country>=50k
049Private101320Assoc-acdm12.0Married-civ-spouseNaNWifeWhiteFemale0190240United-States1
144Private236746Masters14.0DivorcedExec-managerialNot-in-familyWhiteMale10520045United-States1
238Private96185HS-gradNaNDivorcedNaNUnmarriedBlackFemale0032United-States0
338Self-emp-inc112847Prof-school15.0Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040United-States1
442Self-emp-not-inc822977th-8thNaNMarried-civ-spouseOther-serviceWifeBlackFemale0050United-States0
\n", "
" ], "text/plain": [ " age workclass fnlwgt education education-num \\\n", "0 49 Private 101320 Assoc-acdm 12.0 \n", "1 44 Private 236746 Masters 14.0 \n", "2 38 Private 96185 HS-grad NaN \n", "3 38 Self-emp-inc 112847 Prof-school 15.0 \n", "4 42 Self-emp-not-inc 82297 7th-8th NaN \n", "\n", " marital-status occupation relationship race \\\n", "0 Married-civ-spouse NaN Wife White \n", "1 Divorced Exec-managerial Not-in-family White \n", "2 Divorced NaN Unmarried Black \n", "3 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander \n", "4 Married-civ-spouse Other-service Wife Black \n", "\n", " sex capital-gain capital-loss hours-per-week native-country >=50k \n", "0 Female 0 1902 40 United-States 1 \n", "1 Male 10520 0 45 United-States 1 \n", "2 Female 0 0 32 United-States 0 \n", "3 Male 0 0 40 United-States 1 \n", "4 Female 0 0 50 United-States 0 " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.ADULT_SAMPLE)\n", "df = pd.read_csv(path/'adult.csv')\n", "train_df, valid_df = df[:800].copy(),df[800:].copy()\n", "train_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']\n", "dep_var = '>=50k'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

from_df[source]

\n", "\n", "> from_df(`path`, `train_df`:`DataFrame`, `valid_df`:`DataFrame`, `dep_var`:`str`, `test_df`:`OptDataFrame`=`None`, `tfms`:`Optional`\\[`Collection`\\[[`TabularTransform`](/tabular.transform.html#TabularTransform)\\]\\]=`None`, `cat_names`:`OptStrList`=`None`, `cont_names`:`OptStrList`=`None`, `stats`:`OptStats`=`None`, `log_output`:`bool`=`False`, `kwargs`) → [`DataBunch`](/basic_data.html#DataBunch)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularDataBunch.from_df, doc_string=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creates a [`DataBunch`](/basic_data.html#DataBunch) in `path` from `train_df`, `valid_df` and optionally `test_df`. The dependent variable is `dep_var`, while the categorical and continuous variables are in the `cat_names` columns and `cont_names` columns respectively. If `cont_names` is None then we assume all variables that aren't dependent or categorical are continuous. The [`TabularTransform`](/tabular.transform.html#TabularTransform) in `tfms` are applied to the dataframes as preprocessing, then the categories are replaced by their codes+1 (leaving 0 for `nan`) and the continuous variables are normalized. You can pass the `stats` to use for that step. If `log_output` is True, the dependant variable is replaced by its log.\n", "\n", "Note that the transforms should be passed as `Callable`: the actual initialization with `cat_names` and `cont_names` is done inside." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tfms = [FillMissing, Categorify]\n", "data = TabularDataBunch.from_df(path, train_df, valid_df, dep_var=dep_var, tfms=tfms, cat_names=cat_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " You can then easily create a [`Learner`](/basic_train.html#Learner) for this data with [`get_tabular_learner`](/tabular.data.html#get_tabular_learner)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

get_tabular_learner[source]

\n", "\n", "> get_tabular_learner(`data`:[`DataBunch`](/basic_data.html#DataBunch), `layers`:`Collection`\\[`int`\\], `emb_szs`:`Dict`\\[`str`, `int`\\]=`None`, `metrics`=`None`, `ps`:`Collection`\\[`float`\\]=`None`, `emb_drop`:`float`=`0.0`, `y_range`:`OptRange`=`None`, `use_bn`:`bool`=`True`, `kwargs`)\n", "\n", "Get a [`Learner`](/basic_train.html#Learner) using `data`, with `metrics`, including a [`TabularModel`](/tabular.models.html#TabularModel) created using the remaining params. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(get_tabular_learner)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`emb_szs` is a `dict` mapping categorical column names to embedding sizes; you only need to pass sizes for columns where you want to override the default behaviour of the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class TabularDataset[source]

\n", "\n", "> TabularDataset(`df`:`DataFrame`, `dep_var`:`str`, `cat_names`:`OptStrList`=`None`, `cont_names`:`OptStrList`=`None`, `stats`:`OptStats`=`None`, `log_output`:`bool`=`False`) :: [`DatasetBase`](/basic_data.html#DatasetBase)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularDataset, doc_string=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A dataset from `DataFrame` `df` with the dependent being the `dep_var` column, while the categorical and continuous variables are in the `cat_names` columns and `cont_names` columns respectively. Categories are replaced by their codes+1 (leaving 0 for `nan`) and the continuous variables are normalized. You can pass the `stats` to use for normalization; if none, then will be calculated from your data. If the flag `log_output` is True, the dependant variable is replaced by its log." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

from_dataframe[source]

\n", "\n", "> from_dataframe(`df`:`DataFrame`, `dep_var`:`str`, `tfms`:`Optional`\\[`Collection`\\[[`TabularTransform`](/tabular.transform.html#TabularTransform)\\]\\]=`None`, `cat_names`:`OptStrList`=`None`, `cont_names`:`OptStrList`=`None`, `stats`:`OptStats`=`None`, `log_output`:`bool`=`False`) → `TabularDataset`" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularDataset.from_dataframe, doc_string=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Factory method to create a [`TabularDataset`](/tabular.data.html#TabularDataset) from `df`. The only difference from the constructor is that it gets a list `tfms` of `TabularTfm` that it applied before passing the dataframe to the class initialization." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Undocumented Methods - Methods moved below this line will intentionally be hidden" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## New Methods - Please document or move to the undocumented section" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

get_emb_szs[source]

\n", "\n", "> get_emb_szs(`sz_dict`)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(TabularDataset.get_emb_szs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "jekyll": { "keywords": "fastai", "summary": "Base class to deal with tabular data and get a DataBunch", "title": "tabular.data" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 2 }