{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook shows the functionality of the **`DummyEncoder` and `InteractionEncoder` classes of Appelpy** 🍏🥧 in depth, applied to an econometrics dataset.  These classes are in the `utils` module.\n",
    "\n",
    "**Notebook structure:**\n",
    "- **Data loading:** e.g. what format is needed for categorical columns and Boolean columns before using the Encoders.\n",
    "- **`DummyEncoder` functionality:** basic examples of categorical columns being encoded into dummy columns.\n",
    "- **`InteractionEncoder` functionality:** multiple scenarios are covered for interactions between different data types.\n",
    "- **Modelling:** examples of models that use interaction effects.\n",
    "\n",
    "The notebook ends with an example of a simple model pipeline using the `InterationEncoder`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# Appelpy imports:\n",
    "from appelpy.utils import DummyEncoder, InteractionEncoder\n",
    "from appelpy.linear_model import OLS\n",
    "\n",
    "# Hide Numpy warnings from Statsmodels\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [hsbdemo DTA file](https://stats.idre.ucla.edu/stat/data/hsbdemo.dta) in this example is a dataset with 200 observations on the academic choices of students and other information about the students themselves, e.g. their academic profiles and demographic information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_raw = pd.read_stata('https://stats.idre.ucla.edu/stat/data/hsbdemo.dta')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>female</th>\n",
       "      <th>ses</th>\n",
       "      <th>schtyp</th>\n",
       "      <th>prog</th>\n",
       "      <th>read</th>\n",
       "      <th>write</th>\n",
       "      <th>math</th>\n",
       "      <th>science</th>\n",
       "      <th>socst</th>\n",
       "      <th>honors</th>\n",
       "      <th>awards</th>\n",
       "      <th>cid</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>45.0</td>\n",
       "      <td>female</td>\n",
       "      <td>low</td>\n",
       "      <td>public</td>\n",
       "      <td>vocation</td>\n",
       "      <td>34.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>not enrolled</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>108.0</td>\n",
       "      <td>male</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>general</td>\n",
       "      <td>34.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>36.0</td>\n",
       "      <td>36.0</td>\n",
       "      <td>not enrolled</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>15.0</td>\n",
       "      <td>male</td>\n",
       "      <td>high</td>\n",
       "      <td>public</td>\n",
       "      <td>vocation</td>\n",
       "      <td>39.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>44.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>42.0</td>\n",
       "      <td>not enrolled</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>67.0</td>\n",
       "      <td>male</td>\n",
       "      <td>low</td>\n",
       "      <td>public</td>\n",
       "      <td>vocation</td>\n",
       "      <td>37.0</td>\n",
       "      <td>37.0</td>\n",
       "      <td>42.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>not enrolled</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>153.0</td>\n",
       "      <td>male</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>vocation</td>\n",
       "      <td>39.0</td>\n",
       "      <td>31.0</td>\n",
       "      <td>40.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>51.0</td>\n",
       "      <td>not enrolled</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      id  female     ses  schtyp      prog  read  write  math  science  socst  \\\n",
       "0   45.0  female     low  public  vocation  34.0   35.0  41.0     29.0   26.0   \n",
       "1  108.0    male  middle  public   general  34.0   33.0  41.0     36.0   36.0   \n",
       "2   15.0    male    high  public  vocation  39.0   39.0  44.0     26.0   42.0   \n",
       "3   67.0    male     low  public  vocation  37.0   37.0  42.0     33.0   32.0   \n",
       "4  153.0    male  middle  public  vocation  39.0   31.0  40.0     39.0   51.0   \n",
       "\n",
       "         honors  awards  cid  \n",
       "0  not enrolled     0.0    1  \n",
       "1  not enrolled     0.0    1  \n",
       "2  not enrolled     0.0    1  \n",
       "3  not enrolled     0.0    1  \n",
       "4  not enrolled     0.0    1  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_raw.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "id         200\n",
       "female       2\n",
       "ses          3\n",
       "schtyp       2\n",
       "prog         3\n",
       "read        30\n",
       "write       29\n",
       "math        40\n",
       "science     34\n",
       "socst       22\n",
       "honors       2\n",
       "awards       7\n",
       "cid         20\n",
       "dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_raw.nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The categorical columns from the Stata file are already set up to be recognised by Pandas as `pd.Categorical` dtype.\n",
    "\n",
    "**NOTE: categorical data fed to the encoders should be in the `pd.Categorical` dtype in order for the encoding to work!**  They must not be in the generic `object` dtype."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course the `DummyEncoder` also handles cases where there are NaN values for categorical data (via the `nan_policy` argument)!  That functionality will be covered separately in another notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "id          float32\n",
       "female     category\n",
       "ses        category\n",
       "schtyp     category\n",
       "prog       category\n",
       "read        float32\n",
       "write       float32\n",
       "math        float32\n",
       "science     float32\n",
       "socst       float32\n",
       "honors     category\n",
       "awards      float32\n",
       "cid           int16\n",
       "dtype: object"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_raw.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `female` column will be recoded here as a Boolean column with values in {0, 1}, rather than the {'male', 'female'} format originally in the dataset.\n",
    "\n",
    "**NOTE: Boolean data fed to the encoders should be restricted to values in {0, 1} in order for the encoding to work!**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Recode 'female' col into 1 and 0 vals\n",
    "df_raw['female'] = np.where(df_raw['female'] == 'female', 1, 0)\n",
    "\n",
    "# Create another Bool col for use later on - col for 'read' value being higher than the mean\n",
    "df_raw['read_gt_mean'] = np.where(df_raw['read'] > df_raw['read'].mean(), 1, 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are some examples of the types of data in the dataset.\n",
    "\n",
    "Boolean variables:\n",
    "- `female`\n",
    "\n",
    "Categorical variables:\n",
    "- `ses`\n",
    "- `prog`\n",
    "\n",
    "Continuous variables:\n",
    "- `read`, `write`, `math`, `science`, `socst`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data pre-processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `DummyEncoder` functionality"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make a new copy of the `df_raw` dataframe.\n",
    "\n",
    "The `dummy_encoder` object is an instance of the `DummyEncoder` class.\n",
    "\n",
    "**The encoder object must be initialized with a dataframe.**\n",
    "\n",
    "By default, the `_` separator is used to produce the dummy columns."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It takes a dictionary, where each column name is paired with a base level.  If a base level is specified, then the dummy column for that category is dropped from the final dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "dummy_encoder = DummyEncoder(df_raw, {'schtyp': None,\n",
    "                                      'prog': None,\n",
    "                                      'honors': None})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the transformed dataframe with the `transform` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# Overwrite the dataframe - encode dummies from the categorical variables specified\n",
    "df = dummy_encoder.transform()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Default NaN policy: row_of_zero\n"
     ]
    }
   ],
   "source": [
    "print(f\"Default NaN policy: {dummy_encoder.nan_policy}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>female</th>\n",
       "      <th>ses</th>\n",
       "      <th>read</th>\n",
       "      <th>write</th>\n",
       "      <th>math</th>\n",
       "      <th>science</th>\n",
       "      <th>socst</th>\n",
       "      <th>awards</th>\n",
       "      <th>cid</th>\n",
       "      <th>read_gt_mean</th>\n",
       "      <th>schtyp_public</th>\n",
       "      <th>schtyp_private</th>\n",
       "      <th>prog_general</th>\n",
       "      <th>prog_academic</th>\n",
       "      <th>prog_vocation</th>\n",
       "      <th>honors_not enrolled</th>\n",
       "      <th>honors_enrolled</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>45.0</td>\n",
       "      <td>1</td>\n",
       "      <td>low</td>\n",
       "      <td>34.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>108.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>34.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>36.0</td>\n",
       "      <td>36.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>15.0</td>\n",
       "      <td>0</td>\n",
       "      <td>high</td>\n",
       "      <td>39.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>44.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>42.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>67.0</td>\n",
       "      <td>0</td>\n",
       "      <td>low</td>\n",
       "      <td>37.0</td>\n",
       "      <td>37.0</td>\n",
       "      <td>42.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>153.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>39.0</td>\n",
       "      <td>31.0</td>\n",
       "      <td>40.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>51.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      id  female     ses  read  write  math  science  socst  awards  cid  \\\n",
       "0   45.0       1     low  34.0   35.0  41.0     29.0   26.0     0.0    1   \n",
       "1  108.0       0  middle  34.0   33.0  41.0     36.0   36.0     0.0    1   \n",
       "2   15.0       0    high  39.0   39.0  44.0     26.0   42.0     0.0    1   \n",
       "3   67.0       0     low  37.0   37.0  42.0     33.0   32.0     0.0    1   \n",
       "4  153.0       0  middle  39.0   31.0  40.0     39.0   51.0     0.0    1   \n",
       "\n",
       "   read_gt_mean  schtyp_public  schtyp_private  prog_general  prog_academic  \\\n",
       "0             0              1               0             0              0   \n",
       "1             0              1               0             1              0   \n",
       "2             0              1               0             0              0   \n",
       "3             0              1               0             0              0   \n",
       "4             0              1               0             0              0   \n",
       "\n",
       "   prog_vocation  honors_not enrolled  honors_enrolled  \n",
       "0              1                    1                0  \n",
       "1              0                    1                0  \n",
       "2              1                    1                0  \n",
       "3              1                    1                0  \n",
       "4              1                    1                0  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are three categorical variables fed to the `DummyEncoder`.\n",
    "\n",
    "The original columns for all three are removed from the final dataframe once encoding is done for their dummy variable equivalents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['schtyp', 'prog', 'honors']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[col for col in dummy_encoder.categorical_col_base_levels.keys()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "from appelpy.utils import get_dataframe_columns_diff"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Columns removed: ['prog', 'honors', 'schtyp']\n",
      "Columns added: ['prog_academic', 'honors_not enrolled', 'honors_enrolled', 'schtyp_public', 'prog_vocation', 'prog_general', 'schtyp_private']\n"
     ]
    }
   ],
   "source": [
    "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df)}\")\n",
    "print(f\"Columns added: {get_dataframe_columns_diff(df, df_raw)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `InteractionEncoder` functionality"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make a new copy of the `df_raw` dataframe.\n",
    "\n",
    "The `int_encoder` object is an instance of the `InteractionEncoder` class.\n",
    "\n",
    "**The encoder object must be initialized with a dataframe.**\n",
    "\n",
    "The `#` separator is used to represent the interaction between two variables in the columns that are produced by the encoder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df_raw.copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Examples of interactions between variables will be given for these cases:\n",
    "- Two Boolean variables\n",
    "- Two continuous variables\n",
    "- Two categorical variables\n",
    "- One Boolean variable and one categorical variable\n",
    "- One Boolean variable and one continuous variable\n",
    "- One categorical variable and one continuous variable"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Two Boolean variables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Bool: `female`\n",
    "- Bool: `read_gt_mean`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>female</th>\n",
       "      <th>ses</th>\n",
       "      <th>schtyp</th>\n",
       "      <th>prog</th>\n",
       "      <th>read</th>\n",
       "      <th>write</th>\n",
       "      <th>math</th>\n",
       "      <th>science</th>\n",
       "      <th>socst</th>\n",
       "      <th>honors</th>\n",
       "      <th>awards</th>\n",
       "      <th>cid</th>\n",
       "      <th>read_gt_mean</th>\n",
       "      <th>female#read_gt_mean</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>195</th>\n",
       "      <td>100.0</td>\n",
       "      <td>1</td>\n",
       "      <td>high</td>\n",
       "      <td>public</td>\n",
       "      <td>academic</td>\n",
       "      <td>63.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196</th>\n",
       "      <td>143.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>vocation</td>\n",
       "      <td>63.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>4.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>68.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>academic</td>\n",
       "      <td>73.0</td>\n",
       "      <td>67.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>7.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>57.0</td>\n",
       "      <td>1</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>academic</td>\n",
       "      <td>71.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199</th>\n",
       "      <td>132.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>academic</td>\n",
       "      <td>73.0</td>\n",
       "      <td>62.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>3.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        id  female     ses  schtyp      prog  read  write  math  science  \\\n",
       "195  100.0       1    high  public  academic  63.0   65.0  71.0     69.0   \n",
       "196  143.0       0  middle  public  vocation  63.0   63.0  75.0     72.0   \n",
       "197   68.0       0  middle  public  academic  73.0   67.0  71.0     63.0   \n",
       "198   57.0       1  middle  public  academic  71.0   65.0  72.0     66.0   \n",
       "199  132.0       0  middle  public  academic  73.0   62.0  73.0     69.0   \n",
       "\n",
       "     socst    honors  awards  cid  read_gt_mean  female#read_gt_mean  \n",
       "195   71.0  enrolled     5.0   20             1                    1  \n",
       "196   66.0  enrolled     4.0   20             1                    0  \n",
       "197   66.0  enrolled     7.0   20             1                    0  \n",
       "198   56.0  enrolled     5.0   20             1                    1  \n",
       "199   66.0  enrolled     3.0   20             1                    0  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "int_encoder = InteractionEncoder(df, {'female': ['read_gt_mean']})\n",
    "\n",
    "df_enc = int_encoder.transform()\n",
    "df_enc.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The columns for the main effects are both Boolean, so they must be kept in the final dataframe.\n",
    "\n",
    "There is only one interaction effect between the two Boolean variables, so one column is added to the dataframe."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `get_dataframe_columns_diff` method is useful for checking how the final dataframe is different from the original dataframe after the encoding process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Columns removed: []\n",
      "Columns added: ['female#read_gt_mean']\n"
     ]
    }
   ],
   "source": [
    "print(f\"Columns removed: {get_dataframe_columns_diff(df, df_enc)}\")\n",
    "print(f\"Columns added: {get_dataframe_columns_diff(df_enc, df)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code is essentially comparing the columns of the dataframes through sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Columns removed: []\n",
      "Columns added: []\n"
     ]
    }
   ],
   "source": [
    "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df)}\")\n",
    "print(f\"Columns added: {get_dataframe_columns_diff(df, df_raw)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Columns removed: []\n",
      "Columns added: ['female#read_gt_mean']\n"
     ]
    }
   ],
   "source": [
    "print(f\"Columns removed: {list(set(df.columns) - set(df_enc.columns))}\")\n",
    "print(f\"Columns added: {list(set(df_enc.columns) - set(df.columns))}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Two continuous variables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Continuous: `read`\n",
    "- Continuous: `write`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tip: do a one-line transformation by calling `transform` on an instance of the encoder class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>female</th>\n",
       "      <th>ses</th>\n",
       "      <th>schtyp</th>\n",
       "      <th>prog</th>\n",
       "      <th>read</th>\n",
       "      <th>write</th>\n",
       "      <th>math</th>\n",
       "      <th>science</th>\n",
       "      <th>socst</th>\n",
       "      <th>honors</th>\n",
       "      <th>awards</th>\n",
       "      <th>cid</th>\n",
       "      <th>read_gt_mean</th>\n",
       "      <th>read#write</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>195</th>\n",
       "      <td>100.0</td>\n",
       "      <td>1</td>\n",
       "      <td>high</td>\n",
       "      <td>public</td>\n",
       "      <td>academic</td>\n",
       "      <td>63.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>4095.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196</th>\n",
       "      <td>143.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>vocation</td>\n",
       "      <td>63.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>4.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>3969.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>68.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>academic</td>\n",
       "      <td>73.0</td>\n",
       "      <td>67.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>7.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>4891.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>57.0</td>\n",
       "      <td>1</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>academic</td>\n",
       "      <td>71.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>4615.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199</th>\n",
       "      <td>132.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>academic</td>\n",
       "      <td>73.0</td>\n",
       "      <td>62.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>3.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>4526.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        id  female     ses  schtyp      prog  read  write  math  science  \\\n",
       "195  100.0       1    high  public  academic  63.0   65.0  71.0     69.0   \n",
       "196  143.0       0  middle  public  vocation  63.0   63.0  75.0     72.0   \n",
       "197   68.0       0  middle  public  academic  73.0   67.0  71.0     63.0   \n",
       "198   57.0       1  middle  public  academic  71.0   65.0  72.0     66.0   \n",
       "199  132.0       0  middle  public  academic  73.0   62.0  73.0     69.0   \n",
       "\n",
       "     socst    honors  awards  cid  read_gt_mean  read#write  \n",
       "195   71.0  enrolled     5.0   20             1      4095.0  \n",
       "196   66.0  enrolled     4.0   20             1      3969.0  \n",
       "197   66.0  enrolled     7.0   20             1      4891.0  \n",
       "198   56.0  enrolled     5.0   20             1      4615.0  \n",
       "199   66.0  enrolled     3.0   20             1      4526.0  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_enc = InteractionEncoder(df_raw, {'read': ['write']}).transform()\n",
    "df_enc.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The columns for the main effects are both continuous, so they must be kept in the final dataframe.\n",
    "\n",
    "There is only one interaction effect between the two Boolean variables, so one column is added to the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Columns removed: []\n",
      "Columns added: ['read#write']\n"
     ]
    }
   ],
   "source": [
    "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}\")\n",
    "print(f\"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Two categorical variables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Categorical: `prog`\n",
    "- Categorical: `ses`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>female</th>\n",
       "      <th>schtyp</th>\n",
       "      <th>read</th>\n",
       "      <th>write</th>\n",
       "      <th>math</th>\n",
       "      <th>science</th>\n",
       "      <th>socst</th>\n",
       "      <th>honors</th>\n",
       "      <th>awards</th>\n",
       "      <th>...</th>\n",
       "      <th>ses_high</th>\n",
       "      <th>prog_general#ses_low</th>\n",
       "      <th>prog_general#ses_middle</th>\n",
       "      <th>prog_general#ses_high</th>\n",
       "      <th>prog_academic#ses_low</th>\n",
       "      <th>prog_academic#ses_middle</th>\n",
       "      <th>prog_academic#ses_high</th>\n",
       "      <th>prog_vocation#ses_low</th>\n",
       "      <th>prog_vocation#ses_middle</th>\n",
       "      <th>prog_vocation#ses_high</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>195</th>\n",
       "      <td>100.0</td>\n",
       "      <td>1</td>\n",
       "      <td>public</td>\n",
       "      <td>63.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196</th>\n",
       "      <td>143.0</td>\n",
       "      <td>0</td>\n",
       "      <td>public</td>\n",
       "      <td>63.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>4.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>68.0</td>\n",
       "      <td>0</td>\n",
       "      <td>public</td>\n",
       "      <td>73.0</td>\n",
       "      <td>67.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>7.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>57.0</td>\n",
       "      <td>1</td>\n",
       "      <td>public</td>\n",
       "      <td>71.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199</th>\n",
       "      <td>132.0</td>\n",
       "      <td>0</td>\n",
       "      <td>public</td>\n",
       "      <td>73.0</td>\n",
       "      <td>62.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>3.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 27 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        id  female  schtyp  read  write  math  science  socst    honors  \\\n",
       "195  100.0       1  public  63.0   65.0  71.0     69.0   71.0  enrolled   \n",
       "196  143.0       0  public  63.0   63.0  75.0     72.0   66.0  enrolled   \n",
       "197   68.0       0  public  73.0   67.0  71.0     63.0   66.0  enrolled   \n",
       "198   57.0       1  public  71.0   65.0  72.0     66.0   56.0  enrolled   \n",
       "199  132.0       0  public  73.0   62.0  73.0     69.0   66.0  enrolled   \n",
       "\n",
       "     awards  ...  ses_high  prog_general#ses_low  prog_general#ses_middle  \\\n",
       "195     5.0  ...         1                     0                        0   \n",
       "196     4.0  ...         0                     0                        0   \n",
       "197     7.0  ...         0                     0                        0   \n",
       "198     5.0  ...         0                     0                        0   \n",
       "199     3.0  ...         0                     0                        0   \n",
       "\n",
       "     prog_general#ses_high  prog_academic#ses_low  prog_academic#ses_middle  \\\n",
       "195                      0                      0                         0   \n",
       "196                      0                      0                         0   \n",
       "197                      0                      0                         1   \n",
       "198                      0                      0                         1   \n",
       "199                      0                      0                         1   \n",
       "\n",
       "     prog_academic#ses_high  prog_vocation#ses_low  prog_vocation#ses_middle  \\\n",
       "195                       1                      0                         0   \n",
       "196                       0                      0                         1   \n",
       "197                       0                      0                         0   \n",
       "198                       0                      0                         0   \n",
       "199                       0                      0                         0   \n",
       "\n",
       "     prog_vocation#ses_high  \n",
       "195                       0  \n",
       "196                       0  \n",
       "197                       0  \n",
       "198                       0  \n",
       "199                       0  \n",
       "\n",
       "[5 rows x 27 columns]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_enc = InteractionEncoder(df_raw, {'prog': ['ses']}).transform()\n",
    "df_enc.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The columns for the main effects are both categorical: the information in those columns all have string values.  The **original columns** `prog` and `ses` are **removed** from the final dataframe, as the `DummyEncoder` is used on them to produce dummy columns for them in the final dataframe.  The original columns thus become redundant.\n",
    "\n",
    "These are the **columns added** to the final dataframe via the encoding:\n",
    "- Dummy columns are produced for each category via the DummyEncoder: 3 values + 3 values = 6 dummy columns.\n",
    "- There are multiple interaction effects encoded between the two categorical variables: 3 values * 3 values = 9 interaction effects.\n",
    "\n",
    "**NOTE:** one of the categories could be used as a 'base level' in a regression model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Columns removed: ['ses', 'prog']\n",
      "Columns added: ['prog_vocation#ses_low', 'prog_academic#ses_middle', 'ses_low', 'prog_academic', 'prog_general#ses_high', 'prog_general#ses_low', 'prog_vocation#ses_middle', 'prog_academic#ses_low', 'prog_vocation', 'ses_high', 'prog_academic#ses_high', 'prog_vocation#ses_high', 'prog_general#ses_middle', 'ses_middle', 'prog_general']\n"
     ]
    }
   ],
   "source": [
    "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}\")\n",
    "print(f\"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The key-value pair in the class initialization can also be switched and produce a dataframe with the same information, but the column names for the interaction effects will be different."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>female</th>\n",
       "      <th>schtyp</th>\n",
       "      <th>read</th>\n",
       "      <th>write</th>\n",
       "      <th>math</th>\n",
       "      <th>science</th>\n",
       "      <th>socst</th>\n",
       "      <th>honors</th>\n",
       "      <th>awards</th>\n",
       "      <th>...</th>\n",
       "      <th>prog_vocation</th>\n",
       "      <th>ses_low#prog_general</th>\n",
       "      <th>ses_low#prog_academic</th>\n",
       "      <th>ses_low#prog_vocation</th>\n",
       "      <th>ses_middle#prog_general</th>\n",
       "      <th>ses_middle#prog_academic</th>\n",
       "      <th>ses_middle#prog_vocation</th>\n",
       "      <th>ses_high#prog_general</th>\n",
       "      <th>ses_high#prog_academic</th>\n",
       "      <th>ses_high#prog_vocation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>195</th>\n",
       "      <td>100.0</td>\n",
       "      <td>1</td>\n",
       "      <td>public</td>\n",
       "      <td>63.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196</th>\n",
       "      <td>143.0</td>\n",
       "      <td>0</td>\n",
       "      <td>public</td>\n",
       "      <td>63.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>4.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>68.0</td>\n",
       "      <td>0</td>\n",
       "      <td>public</td>\n",
       "      <td>73.0</td>\n",
       "      <td>67.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>7.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>57.0</td>\n",
       "      <td>1</td>\n",
       "      <td>public</td>\n",
       "      <td>71.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199</th>\n",
       "      <td>132.0</td>\n",
       "      <td>0</td>\n",
       "      <td>public</td>\n",
       "      <td>73.0</td>\n",
       "      <td>62.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>3.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 27 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        id  female  schtyp  read  write  math  science  socst    honors  \\\n",
       "195  100.0       1  public  63.0   65.0  71.0     69.0   71.0  enrolled   \n",
       "196  143.0       0  public  63.0   63.0  75.0     72.0   66.0  enrolled   \n",
       "197   68.0       0  public  73.0   67.0  71.0     63.0   66.0  enrolled   \n",
       "198   57.0       1  public  71.0   65.0  72.0     66.0   56.0  enrolled   \n",
       "199  132.0       0  public  73.0   62.0  73.0     69.0   66.0  enrolled   \n",
       "\n",
       "     awards  ...  prog_vocation  ses_low#prog_general  ses_low#prog_academic  \\\n",
       "195     5.0  ...              0                     0                      0   \n",
       "196     4.0  ...              1                     0                      0   \n",
       "197     7.0  ...              0                     0                      0   \n",
       "198     5.0  ...              0                     0                      0   \n",
       "199     3.0  ...              0                     0                      0   \n",
       "\n",
       "     ses_low#prog_vocation  ses_middle#prog_general  ses_middle#prog_academic  \\\n",
       "195                      0                        0                         0   \n",
       "196                      0                        0                         0   \n",
       "197                      0                        0                         1   \n",
       "198                      0                        0                         1   \n",
       "199                      0                        0                         1   \n",
       "\n",
       "     ses_middle#prog_vocation  ses_high#prog_general  ses_high#prog_academic  \\\n",
       "195                         0                      0                       1   \n",
       "196                         1                      0                       0   \n",
       "197                         0                      0                       0   \n",
       "198                         0                      0                       0   \n",
       "199                         0                      0                       0   \n",
       "\n",
       "     ses_high#prog_vocation  \n",
       "195                       0  \n",
       "196                       0  \n",
       "197                       0  \n",
       "198                       0  \n",
       "199                       0  \n",
       "\n",
       "[5 rows x 27 columns]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_enc = InteractionEncoder(df_raw, {'ses': ['prog']}).transform()\n",
    "df_enc.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Columns removed: ['ses', 'prog']\n",
      "Columns added: ['ses_low', 'prog_academic', 'ses_middle#prog_academic', 'prog_general', 'ses_middle#prog_general', 'ses_low#prog_general', 'ses_high#prog_academic', 'ses_low#prog_academic', 'prog_vocation', 'ses_low#prog_vocation', 'ses_high', 'ses_middle', 'ses_middle#prog_vocation', 'ses_high#prog_vocation', 'ses_high#prog_general']\n"
     ]
    }
   ],
   "source": [
    "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}\")\n",
    "print(f\"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### One Bool and one categorical"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Categorical: `prog`\n",
    "- Bool: `female`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>female</th>\n",
       "      <th>ses</th>\n",
       "      <th>schtyp</th>\n",
       "      <th>read</th>\n",
       "      <th>write</th>\n",
       "      <th>math</th>\n",
       "      <th>science</th>\n",
       "      <th>socst</th>\n",
       "      <th>honors</th>\n",
       "      <th>awards</th>\n",
       "      <th>cid</th>\n",
       "      <th>read_gt_mean</th>\n",
       "      <th>prog_general</th>\n",
       "      <th>prog_academic</th>\n",
       "      <th>prog_vocation</th>\n",
       "      <th>prog_general#female</th>\n",
       "      <th>prog_academic#female</th>\n",
       "      <th>prog_vocation#female</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>195</th>\n",
       "      <td>100.0</td>\n",
       "      <td>1</td>\n",
       "      <td>high</td>\n",
       "      <td>public</td>\n",
       "      <td>63.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196</th>\n",
       "      <td>143.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>63.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>4.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>68.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>73.0</td>\n",
       "      <td>67.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>7.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>57.0</td>\n",
       "      <td>1</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>71.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199</th>\n",
       "      <td>132.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>73.0</td>\n",
       "      <td>62.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>3.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        id  female     ses  schtyp  read  write  math  science  socst  \\\n",
       "195  100.0       1    high  public  63.0   65.0  71.0     69.0   71.0   \n",
       "196  143.0       0  middle  public  63.0   63.0  75.0     72.0   66.0   \n",
       "197   68.0       0  middle  public  73.0   67.0  71.0     63.0   66.0   \n",
       "198   57.0       1  middle  public  71.0   65.0  72.0     66.0   56.0   \n",
       "199  132.0       0  middle  public  73.0   62.0  73.0     69.0   66.0   \n",
       "\n",
       "       honors  awards  cid  read_gt_mean  prog_general  prog_academic  \\\n",
       "195  enrolled     5.0   20             1             0              1   \n",
       "196  enrolled     4.0   20             1             0              0   \n",
       "197  enrolled     7.0   20             1             0              1   \n",
       "198  enrolled     5.0   20             1             0              1   \n",
       "199  enrolled     3.0   20             1             0              1   \n",
       "\n",
       "     prog_vocation  prog_general#female  prog_academic#female  \\\n",
       "195              0                    0                     1   \n",
       "196              1                    0                     0   \n",
       "197              0                    0                     0   \n",
       "198              0                    0                     1   \n",
       "199              0                    0                     0   \n",
       "\n",
       "     prog_vocation#female  \n",
       "195                     0  \n",
       "196                     0  \n",
       "197                     0  \n",
       "198                     0  \n",
       "199                     0  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_enc = InteractionEncoder(df_raw, {'prog': ['female']}).transform()\n",
    "df_enc.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the main effect columns is for a Boolean variable, so that must be kept in the final dataframe.  The other main effect is a categorical variable, so dummy columns are encoded for it and the original column is removed in the final dataframe.\n",
    "\n",
    "The columns added:\n",
    "- Dummy columns for the categorical variable: 3 values gives 3 dummy columns\n",
    "- Interaction effects between the Boolean variable and the dummy columns: 3 dummy columns * 1 Bool column = 3 interaction effects"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Columns removed: ['prog']\n",
      "Columns added: ['prog_academic', 'prog_vocation#female', 'prog_academic#female', 'prog_vocation', 'prog_general#female', 'prog_general']\n"
     ]
    }
   ],
   "source": [
    "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}\")\n",
    "print(f\"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### One Bool and one continuous"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case let's encode interactions between `female` and TWO continuous variables!\n",
    "\n",
    "- Bool: `female`\n",
    "- Continuous: `read` and `write`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>female</th>\n",
       "      <th>ses</th>\n",
       "      <th>schtyp</th>\n",
       "      <th>prog</th>\n",
       "      <th>read</th>\n",
       "      <th>write</th>\n",
       "      <th>math</th>\n",
       "      <th>science</th>\n",
       "      <th>socst</th>\n",
       "      <th>honors</th>\n",
       "      <th>awards</th>\n",
       "      <th>cid</th>\n",
       "      <th>read_gt_mean</th>\n",
       "      <th>female#read</th>\n",
       "      <th>female#write</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>195</th>\n",
       "      <td>100.0</td>\n",
       "      <td>1</td>\n",
       "      <td>high</td>\n",
       "      <td>public</td>\n",
       "      <td>academic</td>\n",
       "      <td>63.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>63.0</td>\n",
       "      <td>65.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196</th>\n",
       "      <td>143.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>vocation</td>\n",
       "      <td>63.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>4.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>68.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>academic</td>\n",
       "      <td>73.0</td>\n",
       "      <td>67.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>7.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>57.0</td>\n",
       "      <td>1</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>academic</td>\n",
       "      <td>71.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>71.0</td>\n",
       "      <td>65.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199</th>\n",
       "      <td>132.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>academic</td>\n",
       "      <td>73.0</td>\n",
       "      <td>62.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>3.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        id  female     ses  schtyp      prog  read  write  math  science  \\\n",
       "195  100.0       1    high  public  academic  63.0   65.0  71.0     69.0   \n",
       "196  143.0       0  middle  public  vocation  63.0   63.0  75.0     72.0   \n",
       "197   68.0       0  middle  public  academic  73.0   67.0  71.0     63.0   \n",
       "198   57.0       1  middle  public  academic  71.0   65.0  72.0     66.0   \n",
       "199  132.0       0  middle  public  academic  73.0   62.0  73.0     69.0   \n",
       "\n",
       "     socst    honors  awards  cid  read_gt_mean  female#read  female#write  \n",
       "195   71.0  enrolled     5.0   20             1         63.0          65.0  \n",
       "196   66.0  enrolled     4.0   20             1          0.0           0.0  \n",
       "197   66.0  enrolled     7.0   20             1          0.0           0.0  \n",
       "198   56.0  enrolled     5.0   20             1         71.0          65.0  \n",
       "199   66.0  enrolled     3.0   20             1          0.0           0.0  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_enc = InteractionEncoder(df_raw, {'female': ['read', 'write']}).transform()\n",
    "df_enc.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The columns for the main effects are Boolean or continuous, so they must be kept in the final dataframe.\n",
    "\n",
    "There is only one interaction effect between a Boolean variable and a continuous variable, so one column is added to the dataframe for each of those pairings.\n",
    "\n",
    "(In this case, there were two continuous variables interacted with `female` so there are two interaction effects added to the final dataframe)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Columns removed: []\n",
      "Columns added: ['female#write', 'female#read']\n"
     ]
    }
   ],
   "source": [
    "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}\")\n",
    "print(f\"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>female</th>\n",
       "      <th>ses</th>\n",
       "      <th>schtyp</th>\n",
       "      <th>prog</th>\n",
       "      <th>read</th>\n",
       "      <th>write</th>\n",
       "      <th>math</th>\n",
       "      <th>science</th>\n",
       "      <th>socst</th>\n",
       "      <th>honors</th>\n",
       "      <th>awards</th>\n",
       "      <th>cid</th>\n",
       "      <th>read_gt_mean</th>\n",
       "      <th>read#female</th>\n",
       "      <th>write#female</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>195</th>\n",
       "      <td>100.0</td>\n",
       "      <td>1</td>\n",
       "      <td>high</td>\n",
       "      <td>public</td>\n",
       "      <td>academic</td>\n",
       "      <td>63.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>63.0</td>\n",
       "      <td>65.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196</th>\n",
       "      <td>143.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>vocation</td>\n",
       "      <td>63.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>4.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>68.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>academic</td>\n",
       "      <td>73.0</td>\n",
       "      <td>67.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>7.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>57.0</td>\n",
       "      <td>1</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>academic</td>\n",
       "      <td>71.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>71.0</td>\n",
       "      <td>65.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199</th>\n",
       "      <td>132.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>academic</td>\n",
       "      <td>73.0</td>\n",
       "      <td>62.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>3.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        id  female     ses  schtyp      prog  read  write  math  science  \\\n",
       "195  100.0       1    high  public  academic  63.0   65.0  71.0     69.0   \n",
       "196  143.0       0  middle  public  vocation  63.0   63.0  75.0     72.0   \n",
       "197   68.0       0  middle  public  academic  73.0   67.0  71.0     63.0   \n",
       "198   57.0       1  middle  public  academic  71.0   65.0  72.0     66.0   \n",
       "199  132.0       0  middle  public  academic  73.0   62.0  73.0     69.0   \n",
       "\n",
       "     socst    honors  awards  cid  read_gt_mean  read#female  write#female  \n",
       "195   71.0  enrolled     5.0   20             1         63.0          65.0  \n",
       "196   66.0  enrolled     4.0   20             1          0.0           0.0  \n",
       "197   66.0  enrolled     7.0   20             1          0.0           0.0  \n",
       "198   56.0  enrolled     5.0   20             1         71.0          65.0  \n",
       "199   66.0  enrolled     3.0   20             1          0.0           0.0  "
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_enc = InteractionEncoder(df_raw, {'read': ['female'],\n",
    "                                      'write': ['female']}).transform()\n",
    "df_enc.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Columns removed: []\n",
      "Columns added: ['read#female', 'write#female']\n"
     ]
    }
   ],
   "source": [
    "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}\")\n",
    "print(f\"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### One categorical and one continuous"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Categorical: `prog`\n",
    "- Continuous: `socst`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>female</th>\n",
       "      <th>ses</th>\n",
       "      <th>schtyp</th>\n",
       "      <th>read</th>\n",
       "      <th>write</th>\n",
       "      <th>math</th>\n",
       "      <th>science</th>\n",
       "      <th>socst</th>\n",
       "      <th>honors</th>\n",
       "      <th>awards</th>\n",
       "      <th>cid</th>\n",
       "      <th>read_gt_mean</th>\n",
       "      <th>prog_general</th>\n",
       "      <th>prog_academic</th>\n",
       "      <th>prog_vocation</th>\n",
       "      <th>socst#prog_general</th>\n",
       "      <th>socst#prog_academic</th>\n",
       "      <th>socst#prog_vocation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>195</th>\n",
       "      <td>100.0</td>\n",
       "      <td>1</td>\n",
       "      <td>high</td>\n",
       "      <td>public</td>\n",
       "      <td>63.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196</th>\n",
       "      <td>143.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>63.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>4.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>66.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>68.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>73.0</td>\n",
       "      <td>67.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>7.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>57.0</td>\n",
       "      <td>1</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>71.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199</th>\n",
       "      <td>132.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>73.0</td>\n",
       "      <td>62.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>3.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        id  female     ses  schtyp  read  write  math  science  socst  \\\n",
       "195  100.0       1    high  public  63.0   65.0  71.0     69.0   71.0   \n",
       "196  143.0       0  middle  public  63.0   63.0  75.0     72.0   66.0   \n",
       "197   68.0       0  middle  public  73.0   67.0  71.0     63.0   66.0   \n",
       "198   57.0       1  middle  public  71.0   65.0  72.0     66.0   56.0   \n",
       "199  132.0       0  middle  public  73.0   62.0  73.0     69.0   66.0   \n",
       "\n",
       "       honors  awards  cid  read_gt_mean  prog_general  prog_academic  \\\n",
       "195  enrolled     5.0   20             1             0              1   \n",
       "196  enrolled     4.0   20             1             0              0   \n",
       "197  enrolled     7.0   20             1             0              1   \n",
       "198  enrolled     5.0   20             1             0              1   \n",
       "199  enrolled     3.0   20             1             0              1   \n",
       "\n",
       "     prog_vocation  socst#prog_general  socst#prog_academic  \\\n",
       "195              0                 0.0                 71.0   \n",
       "196              1                 0.0                  0.0   \n",
       "197              0                 0.0                 66.0   \n",
       "198              0                 0.0                 56.0   \n",
       "199              0                 0.0                 66.0   \n",
       "\n",
       "     socst#prog_vocation  \n",
       "195                  0.0  \n",
       "196                 66.0  \n",
       "197                  0.0  \n",
       "198                  0.0  \n",
       "199                  0.0  "
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_enc = InteractionEncoder(df_raw, {'socst': ['prog']}).transform()\n",
    "df_enc.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the main effects is continuous, so the column for that one must be kept in the final dataframe.  The other main effect is a categorical variable, so the original column is dropped from the final dataframe after dummy columns are encoded from it.\n",
    "\n",
    "There is an interaction effect between each of the dummy variables and the continuous variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Columns removed: ['prog']\n",
      "Columns added: ['prog_academic', 'socst#prog_vocation', 'prog_vocation', 'socst#prog_general', 'socst#prog_academic', 'prog_general']\n"
     ]
    }
   ],
   "source": [
    "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}\")\n",
    "print(f\"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>female</th>\n",
       "      <th>ses</th>\n",
       "      <th>schtyp</th>\n",
       "      <th>read</th>\n",
       "      <th>write</th>\n",
       "      <th>math</th>\n",
       "      <th>science</th>\n",
       "      <th>socst</th>\n",
       "      <th>honors</th>\n",
       "      <th>awards</th>\n",
       "      <th>cid</th>\n",
       "      <th>read_gt_mean</th>\n",
       "      <th>prog_general</th>\n",
       "      <th>prog_academic</th>\n",
       "      <th>prog_vocation</th>\n",
       "      <th>prog_general#socst</th>\n",
       "      <th>prog_academic#socst</th>\n",
       "      <th>prog_vocation#socst</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>195</th>\n",
       "      <td>100.0</td>\n",
       "      <td>1</td>\n",
       "      <td>high</td>\n",
       "      <td>public</td>\n",
       "      <td>63.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>196</th>\n",
       "      <td>143.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>63.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>4.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>66.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>197</th>\n",
       "      <td>68.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>73.0</td>\n",
       "      <td>67.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>63.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>7.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>198</th>\n",
       "      <td>57.0</td>\n",
       "      <td>1</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>71.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>5.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>56.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199</th>\n",
       "      <td>132.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>73.0</td>\n",
       "      <td>62.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>69.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>enrolled</td>\n",
       "      <td>3.0</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>66.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        id  female     ses  schtyp  read  write  math  science  socst  \\\n",
       "195  100.0       1    high  public  63.0   65.0  71.0     69.0   71.0   \n",
       "196  143.0       0  middle  public  63.0   63.0  75.0     72.0   66.0   \n",
       "197   68.0       0  middle  public  73.0   67.0  71.0     63.0   66.0   \n",
       "198   57.0       1  middle  public  71.0   65.0  72.0     66.0   56.0   \n",
       "199  132.0       0  middle  public  73.0   62.0  73.0     69.0   66.0   \n",
       "\n",
       "       honors  awards  cid  read_gt_mean  prog_general  prog_academic  \\\n",
       "195  enrolled     5.0   20             1             0              1   \n",
       "196  enrolled     4.0   20             1             0              0   \n",
       "197  enrolled     7.0   20             1             0              1   \n",
       "198  enrolled     5.0   20             1             0              1   \n",
       "199  enrolled     3.0   20             1             0              1   \n",
       "\n",
       "     prog_vocation  prog_general#socst  prog_academic#socst  \\\n",
       "195              0                 0.0                 71.0   \n",
       "196              1                 0.0                  0.0   \n",
       "197              0                 0.0                 66.0   \n",
       "198              0                 0.0                 56.0   \n",
       "199              0                 0.0                 66.0   \n",
       "\n",
       "     prog_vocation#socst  \n",
       "195                  0.0  \n",
       "196                 66.0  \n",
       "197                  0.0  \n",
       "198                  0.0  \n",
       "199                  0.0  "
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "InteractionEncoder(df_raw, {'prog': ['socst']}).transform().tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's do basic OLS regression models using the dataset, where interaction effects are also used as variables in modelling.\n",
    "\n",
    "The UCLA's online resources have models of interaction effects on this dataset with Stata output: \n",
    "- [Interaction between two continuous variables](https://stats.idre.ucla.edu/stata/faq/how-can-i-explain-a-continuous-by-continuous-interaction-stata-12/)\n",
    "- [Interaction between categorical variable and continuous variable](https://stats.idre.ucla.edu/stata/faq/how-can-i-understand-a-categorical-by-continuous-interaction-stata-12/) (the example is a categorical variable with two categories, `female`, which is madr Boolean in this notebook).\n",
    "\n",
    "The Stata output for each model is also provided in this notebook for comparison against the models done through Appelpy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Interaction between two continuous variables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create new dataframe and set up the `InteractionEncoder` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_model = df_raw.copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's regress `read` on the scores for `math`, `socst` and the _interaction_ between `math` & `socst`.\n",
    "\n",
    "To get the interaction effect in the dataframe, we need to do some encoding to get the column `math#socst`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>female</th>\n",
       "      <th>ses</th>\n",
       "      <th>schtyp</th>\n",
       "      <th>prog</th>\n",
       "      <th>read</th>\n",
       "      <th>write</th>\n",
       "      <th>math</th>\n",
       "      <th>science</th>\n",
       "      <th>socst</th>\n",
       "      <th>honors</th>\n",
       "      <th>awards</th>\n",
       "      <th>cid</th>\n",
       "      <th>read_gt_mean</th>\n",
       "      <th>math#socst</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>45.0</td>\n",
       "      <td>1</td>\n",
       "      <td>low</td>\n",
       "      <td>public</td>\n",
       "      <td>vocation</td>\n",
       "      <td>34.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>not enrolled</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1066.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>108.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>general</td>\n",
       "      <td>34.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>36.0</td>\n",
       "      <td>36.0</td>\n",
       "      <td>not enrolled</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1476.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>15.0</td>\n",
       "      <td>0</td>\n",
       "      <td>high</td>\n",
       "      <td>public</td>\n",
       "      <td>vocation</td>\n",
       "      <td>39.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>44.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>42.0</td>\n",
       "      <td>not enrolled</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1848.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>67.0</td>\n",
       "      <td>0</td>\n",
       "      <td>low</td>\n",
       "      <td>public</td>\n",
       "      <td>vocation</td>\n",
       "      <td>37.0</td>\n",
       "      <td>37.0</td>\n",
       "      <td>42.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>not enrolled</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1344.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>153.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>vocation</td>\n",
       "      <td>39.0</td>\n",
       "      <td>31.0</td>\n",
       "      <td>40.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>51.0</td>\n",
       "      <td>not enrolled</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>2040.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      id  female     ses  schtyp      prog  read  write  math  science  socst  \\\n",
       "0   45.0       1     low  public  vocation  34.0   35.0  41.0     29.0   26.0   \n",
       "1  108.0       0  middle  public   general  34.0   33.0  41.0     36.0   36.0   \n",
       "2   15.0       0    high  public  vocation  39.0   39.0  44.0     26.0   42.0   \n",
       "3   67.0       0     low  public  vocation  37.0   37.0  42.0     33.0   32.0   \n",
       "4  153.0       0  middle  public  vocation  39.0   31.0  40.0     39.0   51.0   \n",
       "\n",
       "         honors  awards  cid  read_gt_mean  math#socst  \n",
       "0  not enrolled     0.0    1             0      1066.0  \n",
       "1  not enrolled     0.0    1             0      1476.0  \n",
       "2  not enrolled     0.0    1             0      1848.0  \n",
       "3  not enrolled     0.0    1             0      1344.0  \n",
       "4  not enrolled     0.0    1             0      2040.0  "
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_model = InteractionEncoder(df_model, {'math': ['socst']}).transform()\n",
    "df_model.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_list = ['read']\n",
    "X_list = ['math', 'socst', 'math#socst']\n",
    "model = OLS(df_model, y_list, X_list).fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>          <td>read</td>       <th>  R-squared:         </th> <td>   0.546</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.539</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   78.61</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Fri, 03 Jan 2020</td> <th>  Prob (F-statistic):</th> <td>1.99e-33</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>21:39:12</td>     <th>  Log-Likelihood:    </th> <td> -669.80</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>   200</td>      <th>  AIC:               </th> <td>   1348.</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>   196</td>      <th>  BIC:               </th> <td>   1361.</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     3</td>      <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "       <td></td>         <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>const</th>      <td>   37.8427</td> <td>   14.545</td> <td>    2.602</td> <td> 0.010</td> <td>    9.158</td> <td>   66.528</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>math</th>       <td>   -0.1105</td> <td>    0.292</td> <td>   -0.379</td> <td> 0.705</td> <td>   -0.686</td> <td>    0.465</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>socst</th>      <td>   -0.2200</td> <td>    0.272</td> <td>   -0.810</td> <td> 0.419</td> <td>   -0.756</td> <td>    0.316</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>math#socst</th> <td>    0.0113</td> <td>    0.005</td> <td>    2.157</td> <td> 0.032</td> <td>    0.001</td> <td>    0.022</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td> 3.611</td> <th>  Durbin-Watson:     </th> <td>   1.839</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.164</td> <th>  Jarque-Bera (JB):  </th> <td>   3.555</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td> 0.325</td> <th>  Prob(JB):          </th> <td>   0.169</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 2.942</td> <th>  Cond. No.          </th> <td>8.76e+04</td>\n",
       "</tr>\n",
       "</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.<br/>[2] The condition number is large, 8.76e+04. This might indicate that there are<br/>strong multicollinearity or other numerical problems."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:                   read   R-squared:                       0.546\n",
       "Model:                            OLS   Adj. R-squared:                  0.539\n",
       "Method:                 Least Squares   F-statistic:                     78.61\n",
       "Date:                Fri, 03 Jan 2020   Prob (F-statistic):           1.99e-33\n",
       "Time:                        21:39:12   Log-Likelihood:                -669.80\n",
       "No. Observations:                 200   AIC:                             1348.\n",
       "Df Residuals:                     196   BIC:                             1361.\n",
       "Df Model:                           3                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "==============================================================================\n",
       "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
       "------------------------------------------------------------------------------\n",
       "const         37.8427     14.545      2.602      0.010       9.158      66.528\n",
       "math          -0.1105      0.292     -0.379      0.705      -0.686       0.465\n",
       "socst         -0.2200      0.272     -0.810      0.419      -0.756       0.316\n",
       "math#socst     0.0113      0.005      2.157      0.032       0.001       0.022\n",
       "==============================================================================\n",
       "Omnibus:                        3.611   Durbin-Watson:                   1.839\n",
       "Prob(Omnibus):                  0.164   Jarque-Bera (JB):                3.555\n",
       "Skew:                           0.325   Prob(JB):                        0.169\n",
       "Kurtosis:                       2.942   Cond. No.                     8.76e+04\n",
       "==============================================================================\n",
       "\n",
       "Warnings:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "[2] The condition number is large, 8.76e+04. This might indicate that there are\n",
       "strong multicollinearity or other numerical problems.\n",
       "\"\"\""
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.results_output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The interaction between `math` and `socst`, i.e. `math#socst#`, is significant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'root_mse': 6.96003820368867,\n",
       " 'r_squared': 0.5461318818125249,\n",
       " 'r_squared_adj': 0.5391849208198595,\n",
       " 'aic': 1347.6088571651621,\n",
       " 'bic': 1360.8021266313542}"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.model_selection_stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is what the model output would be from Stata:\n",
    "\n",
    "```\n",
    "      Source |       SS       df       MS              Number of obs =     200\n",
    "-------------+------------------------------           F(  3,   196) =   78.61\n",
    "       Model |  11424.7622     3  3808.25406           Prob > F      =  0.0000\n",
    "    Residual |  9494.65783   196  48.4421318           R-squared     =  0.5461\n",
    "-------------+------------------------------           Adj R-squared =  0.5392\n",
    "       Total |    20919.42   199  105.122714           Root MSE      =    6.96\n",
    "\n",
    "------------------------------------------------------------------------------\n",
    "        read |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]\n",
    "-------------+----------------------------------------------------------------\n",
    "        math |  -.1105123   .2916338    -0.38   0.705    -.6856552    .4646307\n",
    "       socst |  -.2200442   .2717539    -0.81   0.419    -.7559812    .3158928\n",
    "             |\n",
    "      c.math#|\n",
    "     c.socst |   .0112807   .0052294     2.16   0.032     .0009677    .0215938\n",
    "             |\n",
    "       _cons |   37.84271   14.54521     2.60   0.010     9.157506    66.52792\n",
    "------------------------------------------------------------------------------\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Interaction between continuous and Bool variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>female</th>\n",
       "      <th>ses</th>\n",
       "      <th>schtyp</th>\n",
       "      <th>prog</th>\n",
       "      <th>read</th>\n",
       "      <th>write</th>\n",
       "      <th>math</th>\n",
       "      <th>science</th>\n",
       "      <th>socst</th>\n",
       "      <th>honors</th>\n",
       "      <th>awards</th>\n",
       "      <th>cid</th>\n",
       "      <th>read_gt_mean</th>\n",
       "      <th>female#socst</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>45.0</td>\n",
       "      <td>1</td>\n",
       "      <td>low</td>\n",
       "      <td>public</td>\n",
       "      <td>vocation</td>\n",
       "      <td>34.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>not enrolled</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>26.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>108.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>general</td>\n",
       "      <td>34.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>36.0</td>\n",
       "      <td>36.0</td>\n",
       "      <td>not enrolled</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>15.0</td>\n",
       "      <td>0</td>\n",
       "      <td>high</td>\n",
       "      <td>public</td>\n",
       "      <td>vocation</td>\n",
       "      <td>39.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>44.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>42.0</td>\n",
       "      <td>not enrolled</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>67.0</td>\n",
       "      <td>0</td>\n",
       "      <td>low</td>\n",
       "      <td>public</td>\n",
       "      <td>vocation</td>\n",
       "      <td>37.0</td>\n",
       "      <td>37.0</td>\n",
       "      <td>42.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>not enrolled</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>153.0</td>\n",
       "      <td>0</td>\n",
       "      <td>middle</td>\n",
       "      <td>public</td>\n",
       "      <td>vocation</td>\n",
       "      <td>39.0</td>\n",
       "      <td>31.0</td>\n",
       "      <td>40.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>51.0</td>\n",
       "      <td>not enrolled</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      id  female     ses  schtyp      prog  read  write  math  science  socst  \\\n",
       "0   45.0       1     low  public  vocation  34.0   35.0  41.0     29.0   26.0   \n",
       "1  108.0       0  middle  public   general  34.0   33.0  41.0     36.0   36.0   \n",
       "2   15.0       0    high  public  vocation  39.0   39.0  44.0     26.0   42.0   \n",
       "3   67.0       0     low  public  vocation  37.0   37.0  42.0     33.0   32.0   \n",
       "4  153.0       0  middle  public  vocation  39.0   31.0  40.0     39.0   51.0   \n",
       "\n",
       "         honors  awards  cid  read_gt_mean  female#socst  \n",
       "0  not enrolled     0.0    1             0          26.0  \n",
       "1  not enrolled     0.0    1             0           0.0  \n",
       "2  not enrolled     0.0    1             0           0.0  \n",
       "3  not enrolled     0.0    1             0           0.0  \n",
       "4  not enrolled     0.0    1             0           0.0  "
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_model = InteractionEncoder(df_raw, {'female': ['socst']}).transform()\n",
    "df_model.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = OLS(df_model, ['write'], ['female', 'socst', 'female#socst']).fit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>          <td>write</td>      <th>  R-squared:         </th> <td>   0.430</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.421</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   49.26</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Fri, 03 Jan 2020</td> <th>  Prob (F-statistic):</th> <td>9.02e-24</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>21:39:12</td>     <th>  Log-Likelihood:    </th> <td> -676.91</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>   200</td>      <th>  AIC:               </th> <td>   1362.</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>   196</td>      <th>  BIC:               </th> <td>   1375.</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     3</td>      <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>   \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "        <td></td>          <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>const</th>        <td>   17.7619</td> <td>    3.555</td> <td>    4.996</td> <td> 0.000</td> <td>   10.751</td> <td>   24.773</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>female</th>       <td>   15.0000</td> <td>    5.098</td> <td>    2.942</td> <td> 0.004</td> <td>    4.946</td> <td>   25.054</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>socst</th>        <td>    0.6248</td> <td>    0.067</td> <td>    9.315</td> <td> 0.000</td> <td>    0.493</td> <td>    0.757</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>female#socst</th> <td>   -0.2047</td> <td>    0.095</td> <td>   -2.147</td> <td> 0.033</td> <td>   -0.393</td> <td>   -0.017</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td> 2.193</td> <th>  Durbin-Watson:     </th> <td>   1.266</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.334</td> <th>  Jarque-Bera (JB):  </th> <td>   2.004</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td>-0.152</td> <th>  Prob(JB):          </th> <td>   0.367</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 2.615</td> <th>  Cond. No.          </th> <td>    713.</td>\n",
       "</tr>\n",
       "</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:                  write   R-squared:                       0.430\n",
       "Model:                            OLS   Adj. R-squared:                  0.421\n",
       "Method:                 Least Squares   F-statistic:                     49.26\n",
       "Date:                Fri, 03 Jan 2020   Prob (F-statistic):           9.02e-24\n",
       "Time:                        21:39:12   Log-Likelihood:                -676.91\n",
       "No. Observations:                 200   AIC:                             1362.\n",
       "Df Residuals:                     196   BIC:                             1375.\n",
       "Df Model:                           3                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "================================================================================\n",
       "                   coef    std err          t      P>|t|      [0.025      0.975]\n",
       "--------------------------------------------------------------------------------\n",
       "const           17.7619      3.555      4.996      0.000      10.751      24.773\n",
       "female          15.0000      5.098      2.942      0.004       4.946      25.054\n",
       "socst            0.6248      0.067      9.315      0.000       0.493       0.757\n",
       "female#socst    -0.2047      0.095     -2.147      0.033      -0.393      -0.017\n",
       "==============================================================================\n",
       "Omnibus:                        2.193   Durbin-Watson:                   1.266\n",
       "Prob(Omnibus):                  0.334   Jarque-Bera (JB):                2.004\n",
       "Skew:                          -0.152   Prob(JB):                        0.367\n",
       "Kurtosis:                       2.615   Cond. No.                         713.\n",
       "==============================================================================\n",
       "\n",
       "Warnings:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "\"\"\""
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.results_output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The interaction between `female` and `socst`, i.e. `female#socst`, is significant.\n",
    "\n",
    "In the [UCLA resources](https://stats.idre.ucla.edu/stata/faq/how-can-i-understand-a-categorical-by-continuous-interaction-stata-12/) the chart shows how the slopes for the effect of `socst` vary by gender."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'root_mse': 7.211611852775864,\n",
       " 'r_squared': 0.42986123794053965,\n",
       " 'r_squared_adj': 0.4211346242355479,\n",
       " 'aic': 1361.811865520546,\n",
       " 'bic': 1375.005134986738}"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.model_selection_stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is what the regression output would be from Stata:\n",
    "\n",
    "```\n",
    "      Source |       SS       df       MS              Number of obs =     200\n",
    "-------------+------------------------------           F(  3,   196) =   49.26\n",
    "       Model |  7685.43528     3  2561.81176           Prob > F      =  0.0000\n",
    "    Residual |  10193.4397   196  52.0073455           R-squared     =  0.4299\n",
    "-------------+------------------------------           Adj R-squared =  0.4211\n",
    "       Total |   17878.875   199   89.843593           Root MSE      =  7.2116\n",
    "\n",
    "------------------------------------------------------------------------------\n",
    "       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]\n",
    "-------------+----------------------------------------------------------------\n",
    "    1.female |   15.00001    5.09795     2.94   0.004     4.946132    25.05389\n",
    "       socst |   .6247968   .0670709     9.32   0.000     .4925236    .7570701\n",
    "             |\n",
    "      female#|\n",
    "     c.socst |\n",
    "          1  |  -.2047288   .0953726    -2.15   0.033    -.3928171   -.0166405\n",
    "             |\n",
    "       _cons |    17.7619   3.554993     5.00   0.000     10.75095    24.77284\n",
    "------------------------------------------------------------------------------\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model pipeline example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's possible to make model pipelines with Pandas via chaining of Appelpy methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "def process_data(raw_df):\n",
    "    return (raw_df\n",
    "            .pipe(InteractionEncoder, {'female': ['socst']})\n",
    "            .transform())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_model(df, y_list, X_list):\n",
    "    return OLS(df, y_list, X_list).fit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cell below retrieves the previous `model_selection_stats` via a Pandas pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'root_mse': 7.211611852775864,\n",
       " 'r_squared': 0.42986123794053965,\n",
       " 'r_squared_adj': 0.4211346242355479,\n",
       " 'aic': 1361.811865520546,\n",
       " 'bic': 1375.005134986738}"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(df_raw\n",
    " .pipe(process_data)\n",
    " .pipe(fit_model, ['write'], ['female', 'socst', 'female#socst'])\n",
    " .model_selection_stats)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "384.391px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}