{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tabular data preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [],
   "source": [
    "from fastai.gen_doc.nbdoc import *\n",
    "from fastai.tabular import *\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This package contains the basic class to define a transformation for preprocessing dataframes of tabular data, as well as basic [`TabularProc`](/tabular.transform.html#TabularProc). Preprocessing includes things like\n",
    "- replacing non-numerical variables by categories, then their ids,\n",
    "- filling missing values,\n",
    "- normalizing continuous variables.\n",
    "\n",
    "In all those steps we have to be careful to use the correspondence we decide on our training set (which id we give to each category, what is the value we put for missing data, or how the mean/std we use to normalize) on our validation or test set. To deal with this, we use a special class called [`TabularProc`](/tabular.transform.html#TabularProc).\n",
    "\n",
    "The data used in this document page is a subset of the [adult dataset](https://archive.ics.uci.edu/ml/datasets/adult). It gives a certain amount of data on individuals to train a model to predict wether their salary is greater than \\$50k or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>workclass</th>\n",
       "      <th>fnlwgt</th>\n",
       "      <th>education</th>\n",
       "      <th>education-num</th>\n",
       "      <th>marital-status</th>\n",
       "      <th>occupation</th>\n",
       "      <th>relationship</th>\n",
       "      <th>race</th>\n",
       "      <th>sex</th>\n",
       "      <th>capital-gain</th>\n",
       "      <th>capital-loss</th>\n",
       "      <th>hours-per-week</th>\n",
       "      <th>native-country</th>\n",
       "      <th>salary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>49</td>\n",
       "      <td>Private</td>\n",
       "      <td>101320</td>\n",
       "      <td>Assoc-acdm</td>\n",
       "      <td>12.0</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Wife</td>\n",
       "      <td>White</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>1902</td>\n",
       "      <td>40</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&gt;=50k</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>44</td>\n",
       "      <td>Private</td>\n",
       "      <td>236746</td>\n",
       "      <td>Masters</td>\n",
       "      <td>14.0</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>10520</td>\n",
       "      <td>0</td>\n",
       "      <td>45</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&gt;=50k</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>38</td>\n",
       "      <td>Private</td>\n",
       "      <td>96185</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Unmarried</td>\n",
       "      <td>Black</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>32</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;50k</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>38</td>\n",
       "      <td>Self-emp-inc</td>\n",
       "      <td>112847</td>\n",
       "      <td>Prof-school</td>\n",
       "      <td>15.0</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Asian-Pac-Islander</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&gt;=50k</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>42</td>\n",
       "      <td>Self-emp-not-inc</td>\n",
       "      <td>82297</td>\n",
       "      <td>7th-8th</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Other-service</td>\n",
       "      <td>Wife</td>\n",
       "      <td>Black</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>50</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;50k</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age          workclass  fnlwgt     education  education-num  \\\n",
       "0   49            Private  101320    Assoc-acdm           12.0   \n",
       "1   44            Private  236746       Masters           14.0   \n",
       "2   38            Private   96185       HS-grad            NaN   \n",
       "3   38       Self-emp-inc  112847   Prof-school           15.0   \n",
       "4   42   Self-emp-not-inc   82297       7th-8th            NaN   \n",
       "\n",
       "        marital-status        occupation    relationship                 race  \\\n",
       "0   Married-civ-spouse               NaN            Wife                White   \n",
       "1             Divorced   Exec-managerial   Not-in-family                White   \n",
       "2             Divorced               NaN       Unmarried                Black   \n",
       "3   Married-civ-spouse    Prof-specialty         Husband   Asian-Pac-Islander   \n",
       "4   Married-civ-spouse     Other-service            Wife                Black   \n",
       "\n",
       "       sex  capital-gain  capital-loss  hours-per-week  native-country salary  \n",
       "0   Female             0          1902              40   United-States  >=50k  \n",
       "1     Male         10520             0              45   United-States  >=50k  \n",
       "2   Female             0             0              32   United-States   <50k  \n",
       "3     Male             0             0              40   United-States  >=50k  \n",
       "4   Female             0             0              50   United-States   <50k  "
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = untar_data(URLs.ADULT_SAMPLE)\n",
    "df = pd.read_csv(path/'adult.csv')\n",
    "train_df, valid_df = df.iloc[:800].copy(), df.iloc[800:1000].copy()\n",
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see it contains numerical variables (like `age` or `education-num`) as well as categorical ones (like `workclass` or `relationship`). The original dataset is clean, but we removed a few values to give examples of dealing with missing variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']\n",
    "cont_names = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transforms for tabular data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h2 id=\"TabularProc\" class=\"doc_header\"><code>class</code> <code>TabularProc</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/tabular/transform.py#L116\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TabularProc-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h2>\n",
       "\n",
       "> <code>TabularProc</code>(**`cat_names`**:`StrList`, **`cont_names`**:`StrList`)\n",
       "\n",
       "<div class=\"collapse\" id=\"TabularProc-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TabularProc-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>TabularProc</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "A processor for tabular dataframes.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TabularProc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Base class for creating transforms for dataframes with categorical variables `cat_names` and continuous variables `cont_names`. Note that any column not in one of those lists won't be touched."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TabularProc.__call__\" class=\"doc_header\"><code>__call__</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/tabular/transform.py#L121\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TabularProc-__call__-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>__call__</code>(**`df`**:`DataFrame`, **`test`**:`bool`=***`False`***)\n",
       "\n",
       "<div class=\"collapse\" id=\"TabularProc-__call__-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TabularProc-__call__-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>__call__</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Apply the correct function to `df` depending on `test`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TabularProc.__call__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TabularProc.apply_train\" class=\"doc_header\"><code>apply_train</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/tabular/transform.py#L126\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TabularProc-apply_train-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>apply_train</code>(**`df`**:`DataFrame`)\n",
       "\n",
       "<div class=\"collapse\" id=\"TabularProc-apply_train-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TabularProc-apply_train-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>apply_train</code>:</p><p>Some other tests where <code>apply_train</code> is used:</p><ul><li><code>pytest -sv tests/test_tabular_transform.py::test_categorify</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L6\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L36\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L49\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Function applied to `df` if it's the train set.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TabularProc.apply_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"TabularProc.apply_test\" class=\"doc_header\"><code>apply_test</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/tabular/transform.py#L129\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#TabularProc-apply_test-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>apply_test</code>(**`df`**:`DataFrame`)\n",
       "\n",
       "<div class=\"collapse\" id=\"TabularProc-apply_test-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#TabularProc-apply_test-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>apply_test</code>:</p><p>Some other tests where <code>apply_test</code> is used:</p><ul><li><code>pytest -sv tests/test_tabular_transform.py::test_categorify</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L6\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L36\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L49\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Function applied to `df` if it's the test set.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(TabularProc.apply_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<div markdown=\"span\" class=\"alert alert-warning\" role=\"alert\"><i class=\"fa fa-warning-circle\"></i> <b>Important: </b>Those two functions must be implemented in a subclass. `apply_test` defaults to `apply_train`.</div>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "jekyll_important(\"Those two functions must be implemented in a subclass. `apply_test` defaults to `apply_train`.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following [`TabularProc`](/tabular.transform.html#TabularProc) are implemented in the fastai library. Note that the replacement from categories to codes as well as the normalization of continuous variables are automatically done in a [`TabularDataBunch`](/tabular.data.html#TabularDataBunch)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h2 id=\"Categorify\" class=\"doc_header\"><code>class</code> <code>Categorify</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/tabular/transform.py#L133\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#Categorify-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h2>\n",
       "\n",
       "> <code>Categorify</code>(**`cat_names`**:`StrList`, **`cont_names`**:`StrList`) :: [`TabularProc`](/tabular.transform.html#TabularProc)\n",
       "\n",
       "<div class=\"collapse\" id=\"Categorify-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#Categorify-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>Categorify</code>:</p><ul><li><code>pytest -sv tests/test_tabular_transform.py::test_categorify</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L6\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Transform the categorical variables to that type.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(Categorify)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Variables in `cont_names` aren't affected."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"Categorify.apply_train\" class=\"doc_header\"><code>apply_train</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/tabular/transform.py#L135\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#Categorify-apply_train-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>apply_train</code>(**`df`**:`DataFrame`)\n",
       "\n",
       "<div class=\"collapse\" id=\"Categorify-apply_train-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#Categorify-apply_train-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>apply_train</code>:</p><p>Some other tests where <code>apply_train</code> is used:</p><ul><li><code>pytest -sv tests/test_tabular_transform.py::test_categorify</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L6\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L36\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L49\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Transform `self.cat_names` columns in categorical.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(Categorify.apply_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"Categorify.apply_test\" class=\"doc_header\"><code>apply_test</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/tabular/transform.py#L142\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#Categorify-apply_test-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>apply_test</code>(**`df`**:`DataFrame`)\n",
       "\n",
       "<div class=\"collapse\" id=\"Categorify-apply_test-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#Categorify-apply_test-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>apply_test</code>:</p><p>Some other tests where <code>apply_test</code> is used:</p><ul><li><code>pytest -sv tests/test_tabular_transform.py::test_categorify</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L6\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L36\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L49\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Transform `self.cat_names` columns in categorical using the codes decided in `apply_train`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(Categorify.apply_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tfm = Categorify(cat_names, cont_names)\n",
    "tfm(train_df)\n",
    "tfm(valid_df, test=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we haven't changed the categories by their codes, nothing visible has changed in the dataframe yet, but we can check that the variables are now categorical and view their corresponding codes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index([' ?', ' Federal-gov', ' Local-gov', ' Private', ' Self-emp-inc',\n",
       "       ' Self-emp-not-inc', ' State-gov', ' Without-pay'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_df['workclass'].cat.categories"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The test set will be given the same category codes as the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index([' ?', ' Federal-gov', ' Local-gov', ' Private', ' Self-emp-inc',\n",
       "       ' Self-emp-not-inc', ' State-gov', ' Without-pay'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "valid_df['workclass'].cat.categories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h2 id=\"FillMissing\" class=\"doc_header\"><code>class</code> <code>FillMissing</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/tabular/transform.py#L150\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#FillMissing-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h2>\n",
       "\n",
       "> <code>FillMissing</code>(**`cat_names`**:`StrList`, **`cont_names`**:`StrList`, **`fill_strategy`**:[`FillStrategy`](/tabular.transform.html#FillStrategy)=***`<FillStrategy.MEDIAN: 1>`***, **`add_col`**:`bool`=***`True`***, **`fill_val`**:`float`=***`0.0`***) :: [`TabularProc`](/tabular.transform.html#TabularProc)\n",
       "\n",
       "<div class=\"collapse\" id=\"FillMissing-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#FillMissing-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>FillMissing</code>:</p><ul><li><code>pytest -sv tests/test_tabular_transform.py::test_default_fill_strategy_is_median</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L30\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>Some other tests where <code>FillMissing</code> is used:</p><ul><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L36\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L49\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Fill the missing values in continuous columns.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(FillMissing)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`cat_names` variables are left untouched (their missing value will be replaced by code 0 in the [`TabularDataBunch`](/tabular.data.html#TabularDataBunch)). [`fill_strategy`](#FillStrategy) is adopted to replace those nans and if `add_col` is True, whenever a column `c` has missing values, a column named `c_nan` is added and flags the line where the value was missing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"FillMissing.apply_train\" class=\"doc_header\"><code>apply_train</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/tabular/transform.py#L155\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#FillMissing-apply_train-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>apply_train</code>(**`df`**:`DataFrame`)\n",
       "\n",
       "<div class=\"collapse\" id=\"FillMissing-apply_train-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#FillMissing-apply_train-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>apply_train</code>:</p><ul><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L36\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L49\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>Some other tests where <code>apply_train</code> is used:</p><ul><li><code>pytest -sv tests/test_tabular_transform.py::test_categorify</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L6\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Fill missing values in `self.cont_names` according to `self.fill_strategy`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(FillMissing.apply_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"FillMissing.apply_test\" class=\"doc_header\"><code>apply_test</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/tabular/transform.py#L169\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#FillMissing-apply_test-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>apply_test</code>(**`df`**:`DataFrame`)\n",
       "\n",
       "<div class=\"collapse\" id=\"FillMissing-apply_test-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#FillMissing-apply_test-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>apply_test</code>:</p><ul><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L36\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L49\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>Some other tests where <code>apply_test</code> is used:</p><ul><li><code>pytest -sv tests/test_tabular_transform.py::test_categorify</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L6\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Fill missing values in `self.cont_names` like in `apply_train`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(FillMissing.apply_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fills the missing values in the `cont_names` columns with the ones picked during train."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>fnlwgt</th>\n",
       "      <th>education-num</th>\n",
       "      <th>capital-gain</th>\n",
       "      <th>capital-loss</th>\n",
       "      <th>hours-per-week</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>49</td>\n",
       "      <td>101320</td>\n",
       "      <td>12.0</td>\n",
       "      <td>0</td>\n",
       "      <td>1902</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>44</td>\n",
       "      <td>236746</td>\n",
       "      <td>14.0</td>\n",
       "      <td>10520</td>\n",
       "      <td>0</td>\n",
       "      <td>45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>38</td>\n",
       "      <td>96185</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>38</td>\n",
       "      <td>112847</td>\n",
       "      <td>15.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>42</td>\n",
       "      <td>82297</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age  fnlwgt  education-num  capital-gain  capital-loss  hours-per-week\n",
       "0   49  101320           12.0             0          1902              40\n",
       "1   44  236746           14.0         10520             0              45\n",
       "2   38   96185            NaN             0             0              32\n",
       "3   38  112847           15.0             0             0              40\n",
       "4   42   82297            NaN             0             0              50"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_df[cont_names].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>fnlwgt</th>\n",
       "      <th>education-num</th>\n",
       "      <th>capital-gain</th>\n",
       "      <th>capital-loss</th>\n",
       "      <th>hours-per-week</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>49</td>\n",
       "      <td>101320</td>\n",
       "      <td>12.0</td>\n",
       "      <td>0</td>\n",
       "      <td>1902</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>44</td>\n",
       "      <td>236746</td>\n",
       "      <td>14.0</td>\n",
       "      <td>10520</td>\n",
       "      <td>0</td>\n",
       "      <td>45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>38</td>\n",
       "      <td>96185</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>38</td>\n",
       "      <td>112847</td>\n",
       "      <td>15.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>42</td>\n",
       "      <td>82297</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age  fnlwgt  education-num  capital-gain  capital-loss  hours-per-week\n",
       "0   49  101320           12.0             0          1902              40\n",
       "1   44  236746           14.0         10520             0              45\n",
       "2   38   96185           10.0             0             0              32\n",
       "3   38  112847           15.0             0             0              40\n",
       "4   42   82297           10.0             0             0              50"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tfm = FillMissing(cat_names, cont_names)\n",
    "tfm(train_df)\n",
    "tfm(valid_df, test=True)\n",
    "train_df[cont_names].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Values missing in the `education-num` column are replaced by 10, which is the median of the column in `train_df`. Categorical variables are not changed, since `nan` is simply used as another category."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>fnlwgt</th>\n",
       "      <th>education-num</th>\n",
       "      <th>capital-gain</th>\n",
       "      <th>capital-loss</th>\n",
       "      <th>hours-per-week</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>800</th>\n",
       "      <td>45</td>\n",
       "      <td>96975</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>801</th>\n",
       "      <td>46</td>\n",
       "      <td>192779</td>\n",
       "      <td>10.0</td>\n",
       "      <td>15024</td>\n",
       "      <td>0</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>802</th>\n",
       "      <td>36</td>\n",
       "      <td>376455</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>38</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>803</th>\n",
       "      <td>25</td>\n",
       "      <td>50053</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>804</th>\n",
       "      <td>37</td>\n",
       "      <td>164526</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     age  fnlwgt  education-num  capital-gain  capital-loss  hours-per-week\n",
       "800   45   96975           10.0             0             0              40\n",
       "801   46  192779           10.0         15024             0              60\n",
       "802   36  376455           10.0             0             0              38\n",
       "803   25   50053           10.0             0             0              45\n",
       "804   37  164526           10.0             0             0              40"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "valid_df[cont_names].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h2 id=\"FillStrategy\" class=\"doc_header\">`FillStrategy`<a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#FillStrategy-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h2>\n",
       "\n",
       "> <code>Enum</code> = [MEDIAN, COMMON, CONSTANT]\n",
       "\n",
       "<div class=\"collapse\" id=\"FillStrategy-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#FillStrategy-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>FillStrategy</code>:</p><p>Some other tests where <code>FillStrategy</code> is used:</p><ul><li><code>pytest -sv tests/test_tabular_transform.py::test_default_fill_strategy_is_median</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L30\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Enum flag represents determines how [`FillMissing`](/tabular.transform.html#FillMissing) should handle missing/nan values\n",
       "\n",
       "- *MEDIAN*: nans are replaced by the median value of the column\n",
       "- *COMMON*: nans are replaced by the most common value of the column\n",
       "- *CONSTANT*: nans are replaced by `fill_val` "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(FillStrategy, alt_doc_string='Enum flag represents determines how `FillMissing` should handle missing/nan values', arg_comments={\n",
    "    'MEDIAN':'nans are replaced by the median value of the column',\n",
    "    'COMMON': 'nans are replaced by the most common value of the column',\n",
    "    'CONSTANT': 'nans are replaced by `fill_val`'\n",
    "})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h2 id=\"Normalize\" class=\"doc_header\"><code>class</code> <code>Normalize</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/tabular/transform.py#L181\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#Normalize-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h2>\n",
       "\n",
       "> <code>Normalize</code>(**`cat_names`**:`StrList`, **`cont_names`**:`StrList`) :: [`TabularProc`](/tabular.transform.html#TabularProc)\n",
       "\n",
       "<div class=\"collapse\" id=\"Normalize-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#Normalize-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>Normalize</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Normalize the continuous variables.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(Normalize)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"Normalize.apply_train\" class=\"doc_header\"><code>apply_train</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/tabular/transform.py#L183\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#Normalize-apply_train-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>apply_train</code>(**`df`**:`DataFrame`)\n",
       "\n",
       "<div class=\"collapse\" id=\"Normalize-apply_train-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#Normalize-apply_train-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>apply_train</code>:</p><p>Some other tests where <code>apply_train</code> is used:</p><ul><li><code>pytest -sv tests/test_tabular_transform.py::test_categorify</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L6\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L36\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L49\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Compute the means and stds of `self.cont_names` columns to normalize them.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(Normalize.apply_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"Normalize.apply_test\" class=\"doc_header\"><code>apply_test</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/tabular/transform.py#L192\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#Normalize-apply_test-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>apply_test</code>(**`df`**:`DataFrame`)\n",
       "\n",
       "<div class=\"collapse\" id=\"Normalize-apply_test-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#Normalize-apply_test-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>apply_test</code>:</p><p>Some other tests where <code>apply_test</code> is used:</p><ul><li><code>pytest -sv tests/test_tabular_transform.py::test_categorify</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L6\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_leaves_no_na_values</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L36\" class=\"source_link\" style=\"float:right\">[source]</a></li><li><code>pytest -sv tests/test_tabular_transform.py::test_fill_missing_returns_correct_medians</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L49\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Normalize `self.cont_names` with the same statistics as in `apply_train`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(Normalize.apply_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Treating date columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"add_datepart\" class=\"doc_header\"><code>add_datepart</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/tabular/transform.py#L55\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#add_datepart-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>add_datepart</code>(**`df`**:`DataFrame`, **`field_name`**:`str`, **`prefix`**:`str`=***`None`***, **`drop`**:`bool`=***`True`***, **`time`**:`bool`=***`False`***)\n",
       "\n",
       "<div class=\"collapse\" id=\"add_datepart-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#add_datepart-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>No tests found for <code>add_datepart</code>. To contribute a test please refer to <a href=\"/dev/test.html\">this guide</a> and <a href=\"https://forums.fast.ai/t/improving-expanding-functional-tests/32929\">this discussion</a>.</p></div></div>\n",
       "\n",
       "Helper function that adds columns relevant to a date in the column `field_name` of `df`.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(add_datepart)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Will `drop` the column in `df` if the flag is `True`. The `time` flag decides if we go down to the time parts or stick to the date parts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Splitting data into cat and cont"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"cont_cat_split\" class=\"doc_header\"><code>cont_cat_split</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/tabular/transform.py#L106\" class=\"source_link\" style=\"float:right\">[source]</a><a class=\"source_link\" data-toggle=\"collapse\" data-target=\"#cont_cat_split-pytest\" style=\"float:right; padding-right:10px\">[test]</a></h4>\n",
       "\n",
       "> <code>cont_cat_split</code>(**`df`**, **`max_card`**=***`20`***, **`dep_var`**=***`None`***) → `Tuple`\\[`List`\\[`T`\\], `List`\\[`T`\\]\\]\n",
       "\n",
       "<div class=\"collapse\" id=\"cont_cat_split-pytest\"><div class=\"card card-body pytest_card\"><a type=\"button\" data-toggle=\"collapse\" data-target=\"#cont_cat_split-pytest\" class=\"close\" aria-label=\"Close\"><span aria-hidden=\"true\">&times;</span></a><p>Tests found for <code>cont_cat_split</code>:</p><ul><li><code>pytest -sv tests/test_tabular_transform.py::test_cont_cat_split</code> <a href=\"https://github.com/fastai/fastai/blob/master/tests/test_tabular_transform.py#L64\" class=\"source_link\" style=\"float:right\">[source]</a></li></ul><p>To run tests please refer to this <a href=\"/dev/test.html#quick-guide\">guide</a>.</p></div></div>\n",
       "\n",
       "Helper function that returns column names of cont and cat variables from given df.  "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(cont_cat_split)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Parameters:\n",
    "- df: A pandas data frame.\n",
    "- max_card: Maximum cardinality of a numerical categorical variable.\n",
    "- dep_var: A dependent variable.\n",
    "\n",
    "Return:\n",
    "- cont_names: A list of names of continuous variables.\n",
    "- cat_names: A list of names of categorical variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "      <th>col2</th>\n",
       "      <th>col3</th>\n",
       "      <th>col4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>a</td>\n",
       "      <td>0.5</td>\n",
       "      <td>ab</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>b</td>\n",
       "      <td>1.2</td>\n",
       "      <td>o</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>a</td>\n",
       "      <td>7.5</td>\n",
       "      <td>o</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col1 col2  col3 col4\n",
       "0     1    a   0.5   ab\n",
       "1     2    b   1.2    o\n",
       "2     3    a   7.5    o"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['a', 'b', 'a'], 'col3': [0.5, 1.2, 7.5], 'col4': ['ab', 'o', 'o']})\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(['col3'], ['col1', 'col2'])"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cont_list, cat_list = cont_cat_split(df=df, max_card=20, dep_var='col4')\n",
    "cont_list, cat_list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Undocumented Methods - Methods moved below this line will intentionally be hidden"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## New Methods - Please document or move to the undocumented section"
   ]
  }
 ],
 "metadata": {
  "jekyll": {
   "keywords": "fastai",
   "summary": "Transforms to clean and preprocess tabular data",
   "title": "tabular.transform"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}