{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#hide\n", "#skip\n", "! [ -e /content ] && pip install -Uqq fastai # upgrade fastai on colab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Tabular training\n", "\n", "> How to use the tabular application in fastai" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To illustrate the tabular application, we will use the example of the [Adult dataset](https://archive.ics.uci.edu/ml/datasets/Adult) where we have to predict if a person is earning more or less than $50k per year using some general data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.tabular.all import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can download a sample of this dataset with the usual `untar_data` command:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(#3) [Path('/home/ml1/.fastai/data/adult_sample/models'),Path('/home/ml1/.fastai/data/adult_sample/export.pkl'),Path('/home/ml1/.fastai/data/adult_sample/adult.csv')]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.ADULT_SAMPLE)\n", "path.ls()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we can have a look at how the data is structured:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
049Private101320Assoc-acdm12.0Married-civ-spouseNaNWifeWhiteFemale0190240United-States>=50k
144Private236746Masters14.0DivorcedExec-managerialNot-in-familyWhiteMale10520045United-States>=50k
238Private96185HS-gradNaNDivorcedNaNUnmarriedBlackFemale0032United-States<50k
338Self-emp-inc112847Prof-school15.0Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040United-States>=50k
442Self-emp-not-inc822977th-8thNaNMarried-civ-spouseOther-serviceWifeBlackFemale0050United-States<50k
\n", "
" ], "text/plain": [ " age workclass fnlwgt education education-num \\\n", "0 49 Private 101320 Assoc-acdm 12.0 \n", "1 44 Private 236746 Masters 14.0 \n", "2 38 Private 96185 HS-grad NaN \n", "3 38 Self-emp-inc 112847 Prof-school 15.0 \n", "4 42 Self-emp-not-inc 82297 7th-8th NaN \n", "\n", " marital-status occupation relationship race \\\n", "0 Married-civ-spouse NaN Wife White \n", "1 Divorced Exec-managerial Not-in-family White \n", "2 Divorced NaN Unmarried Black \n", "3 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander \n", "4 Married-civ-spouse Other-service Wife Black \n", "\n", " sex capital-gain capital-loss hours-per-week native-country salary \n", "0 Female 0 1902 40 United-States >=50k \n", "1 Male 10520 0 45 United-States >=50k \n", "2 Female 0 0 32 United-States <50k \n", "3 Male 0 0 40 United-States >=50k \n", "4 Female 0 0 50 United-States <50k " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(path/'adult.csv')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical (like workclass or education) and we will convert them to a unique index that we will feed to embedding layers. We can specify our categorical and continuous column names, as well as the name of the dependent variable in `TabularDataLoaders` factory methods:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names=\"salary\",\n", " cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],\n", " cont_names = ['age', 'fnlwgt', 'education-num'],\n", " procs = [Categorify, FillMissing, Normalize])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last part is the list of pre-processors we apply to our data:\n", "\n", "- `Categorify` is going to take every categorical variable and make a map from integer to unique categories, then replace the values by the corresponding index.\n", "- `FillMissing` will fill the missing values in the continuous variables by the median of existing values (you can choose a specific value if you prefer)\n", "- `Normalize` will normalize the continuous variables (substract the mean and divide by the std)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To further expose what's going on below the surface, let's rewrite this utilizing `fastai`'s `TabularPandas` class. We will need to make one adjustment, which is defining how we want to split our data. By default the factory method above used a random 80/20 split, so we will do the same:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "splits = RandomSplitter(valid_pct=0.2)(range_of(df))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],\n", " cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],\n", " cont_names = ['age', 'fnlwgt', 'education-num'],\n", " y_names='salary',\n", " splits=splits)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we build our `TabularPandas` object, our data is completely preprocessed as seen below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
workclasseducationmarital-statusoccupationrelationshipraceeducation-num_naagefnlwgteducation-num
15780216152510.9840372.210372-0.033692
1744251258251-1.509555-0.319624-0.425324
\n", "
" ], "text/plain": [ " workclass education marital-status occupation relationship race \\\n", "15780 2 16 1 5 2 5 \n", "17442 5 12 5 8 2 5 \n", "\n", " education-num_na age fnlwgt education-num \n", "15780 1 0.984037 2.210372 -0.033692 \n", "17442 1 -1.509555 -0.319624 -0.425324 " ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "to.xs.iloc[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can build our `DataLoaders` again:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dls = to.dataloaders(bs=64)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Later we will explore why using `TabularPandas` to preprocess will be valuable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `show_batch` method works like for every other application:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
workclasseducationmarital-statusoccupationrelationshipraceeducation-num_naagefnlwgteducation-numsalary
0State-govBachelorsMarried-civ-spouseProf-specialtyWifeWhiteFalse41.00000075409.00118213.0>=50k
1PrivateSome-collegeNever-marriedCraft-repairNot-in-familyWhiteFalse24.00000038455.00501310.0<50k
2PrivateAssoc-acdmMarried-civ-spouseProf-specialtyHusbandWhiteFalse48.000000101299.00309312.0<50k
3PrivateHS-gradNever-marriedOther-serviceOther-relativeBlackFalse42.000000227465.9992819.0<50k
4State-govSome-collegeNever-marriedProf-specialtyNot-in-familyWhiteFalse20.999999258489.99713010.0<50k
5Local-gov12thMarried-civ-spouseTech-supportHusbandWhiteFalse39.000000207853.0000678.0<50k
6PrivateAssoc-vocMarried-civ-spouseSalesHusbandWhiteFalse36.000000238414.99893011.0>=50k
7PrivateHS-gradNever-marriedCraft-repairNot-in-familyWhiteFalse19.000000445727.9989379.0<50k
8Local-govBachelorsMarried-civ-spouse#na#HusbandWhiteTrue59.000000196013.00017410.0>=50k
9PrivateHS-gradMarried-civ-spouseProf-specialtyWifeBlackFalse39.000000147500.0004039.0<50k
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dls.show_batch()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can define a model using the `tabular_learner` method. When we define our model, `fastai` will try to infer the loss function based on our `y_names` earlier. \n", "\n", "**Note**: Sometimes with tabular data, your `y`'s may be encoded (such as 0 and 1). In such a case you should explicitly pass `y_block = CategoryBlock` in your constructor so `fastai` won't presume you are doing regression." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn = tabular_learner(dls, metrics=accuracy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can train that model with the `fit_one_cycle` method (the `fine_tune` method won't be useful here since we don't have a pretrained model)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_lossaccuracytime
00.3693600.3480960.84075600:05
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn.fit_one_cycle(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then have a look at some predictions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
workclasseducationmarital-statusoccupationrelationshipraceeducation-num_naagefnlwgteducation-numsalarysalary_pred
05.012.03.08.01.05.01.00.324868-1.138177-0.4240220.00.0
15.010.05.02.02.05.01.0-0.482055-1.3519111.1484380.00.0
25.012.06.012.03.05.01.0-0.7754820.138709-0.4240220.00.0
35.016.05.02.04.04.01.0-1.362335-0.227515-0.0309070.00.0
45.02.05.00.04.05.01.0-1.509048-0.191191-1.2102520.00.0
55.016.03.013.01.05.01.01.498575-0.051096-0.0309071.01.0
65.012.03.015.01.05.01.0-0.5554120.039167-0.4240220.00.0
75.01.05.06.04.05.01.0-1.582405-1.396391-1.6033670.00.0
85.03.05.013.02.05.01.0-1.3623350.158354-0.8171370.00.0
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn.show_results()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or use the predict method on a row:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "row, clas, probs = learn.predict(df.iloc[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
workclasseducationmarital-statusoccupationrelationshipraceeducation-num_naagefnlwgteducation-numsalary
0PrivateAssoc-acdmMarried-civ-spouse#na#WifeWhiteFalse49.0101319.9978812.0>=50k
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "row.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(tensor(1), tensor([0.4995, 0.5005]))" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clas, probs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get prediction on a new dataframe, you can use the `test_dl` method of the `DataLoaders`. That dataframe does not need to have the dependent variable in its column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_df = df.copy()\n", "test_df.drop(['salary'], axis=1, inplace=True)\n", "dl = learn.dls.test_dl(test_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then `Learner.get_preds` will give you the predictions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "(tensor([[0.4995, 0.5005],\n", " [0.4882, 0.5118],\n", " [0.9824, 0.0176],\n", " ...,\n", " [0.5324, 0.4676],\n", " [0.7628, 0.2372],\n", " [0.5934, 0.4066]]), None)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learn.get_preds(dl=dl)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Note: Since machine learning models can't magically understand categories it was never trained on, the data should reflect this. If there are different missing values in your test data you should address this before training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `fastai` with Other Libraries\n", "\n", "As mentioned earlier, `TabularPandas` is a powerful and easy preprocessing tool for tabular data. Integration with libraries such as Random Forests and XGBoost requires only one extra step, that the `.dataloaders` call did for us. Let's look at our `to` again. It's values are stored in a `DataFrame` like object, where we can extract the `cats`, `conts,` `xs` and `ys` if we want to:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
workclasseducationmarital-statusoccupationrelationshipraceeducation-num_naagefnlwgteducation-num
25387516351510.471582-1.467756-0.030907
1687211651451-1.215622-0.649792-0.030907
25852516351511.865358-0.218915-0.030907
\n", "
" ], "text/plain": [ " workclass education marital-status ... age fnlwgt education-num\n", "25387 5 16 3 ... 0.471582 -1.467756 -0.030907\n", "16872 1 16 5 ... -1.215622 -0.649792 -0.030907\n", "25852 5 16 3 ... 1.865358 -0.218915 -0.030907\n", "\n", "[3 rows x 10 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "to.xs[:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that everything is encoded, you can then send this off to XGBoost or Random Forests by extracting the train and validation sets and their values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train, y_train = to.train.xs, to.train.ys.values.ravel()\n", "X_test, y_test = to.valid.xs, to.valid.ys.values.ravel()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we can directly send this in!" ] } ], "metadata": { "jupytext": { "split_at_heading": true }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 1 }