{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tabular data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "hide_input": true }, "outputs": [], "source": [ "from fastai.gen_doc.nbdoc import *\n", "from fastai.tabular.models import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[`tabular`](/tabular.html#tabular) contains all the necessary classes to deal with tabular data, across two modules:\n", "- [`tabular.transform`](/tabular.transform.html#tabular.transform): defines the [`TabularTransform`](/tabular.transform.html#TabularTransform) class to help with preprocessing;\n", "- [`tabular.data`](/tabular.data.html#tabular.data): defines the [`TabularDataset`](/tabular.data.html#TabularDataset) that handles that data, as well as the methods to quickly get a [`TabularDataBunch`](/tabular.data.html#TabularDataBunch).\n", "\n", "To create a model, you'll need to use [`models.tabular`](/tabular.html#tabular). See below for an end-to-end example using all these modules." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocessing tabular data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's import everything we need for the tabular application." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from fastai import *\n", "from fastai.tabular import * " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tabular data usually comes in the form of a delimited file (such as .csv) containing variables of different kinds: text/category, numbers, and perhaps some missing values. The example we'll work with in this section is a sample of the [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult) which has some census information on individuals. We'll use it to train a model to predict whether salary is greater than \\$50k or not." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PosixPath('/home/ubuntu/fastai/fastai/../data/adult_sample')" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.ADULT_SAMPLE)\n", "path" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | age | \n", "workclass | \n", "fnlwgt | \n", "education | \n", "education-num | \n", "marital-status | \n", "occupation | \n", "relationship | \n", "race | \n", "sex | \n", "capital-gain | \n", "capital-loss | \n", "hours-per-week | \n", "native-country | \n", ">=50k | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "49 | \n", "Private | \n", "101320 | \n", "Assoc-acdm | \n", "12.0 | \n", "Married-civ-spouse | \n", "NaN | \n", "Wife | \n", "White | \n", "Female | \n", "0 | \n", "1902 | \n", "40 | \n", "United-States | \n", "1 | \n", "
1 | \n", "44 | \n", "Private | \n", "236746 | \n", "Masters | \n", "14.0 | \n", "Divorced | \n", "Exec-managerial | \n", "Not-in-family | \n", "White | \n", "Male | \n", "10520 | \n", "0 | \n", "45 | \n", "United-States | \n", "1 | \n", "
2 | \n", "38 | \n", "Private | \n", "96185 | \n", "HS-grad | \n", "NaN | \n", "Divorced | \n", "NaN | \n", "Unmarried | \n", "Black | \n", "Female | \n", "0 | \n", "0 | \n", "32 | \n", "United-States | \n", "0 | \n", "
3 | \n", "38 | \n", "Self-emp-inc | \n", "112847 | \n", "Prof-school | \n", "15.0 | \n", "Married-civ-spouse | \n", "Prof-specialty | \n", "Husband | \n", "Asian-Pac-Islander | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "1 | \n", "
4 | \n", "42 | \n", "Self-emp-not-inc | \n", "82297 | \n", "7th-8th | \n", "NaN | \n", "Married-civ-spouse | \n", "Other-service | \n", "Wife | \n", "Black | \n", "Female | \n", "0 | \n", "0 | \n", "50 | \n", "United-States | \n", "0 | \n", "