{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook shows the functionality of the **`DummyEncoder` and `InteractionEncoder` classes of Appelpy** 🍏🥧 in depth, applied to an econometrics dataset. These classes are in the `utils` module.\n", "\n", "**Notebook structure:**\n", "- **Data loading:** e.g. what format is needed for categorical columns and Boolean columns before using the Encoders.\n", "- **`DummyEncoder` functionality:** basic examples of categorical columns being encoded into dummy columns.\n", "- **`InteractionEncoder` functionality:** multiple scenarios are covered for interactions between different data types.\n", "- **Modelling:** examples of models that use interaction effects.\n", "\n", "The notebook ends with an example of a simple model pipeline using the `InterationEncoder`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "# Appelpy imports:\n", "from appelpy.utils import DummyEncoder, InteractionEncoder\n", "from appelpy.linear_model import OLS\n", "\n", "# Hide Numpy warnings from Statsmodels\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [hsbdemo DTA file](https://stats.idre.ucla.edu/stat/data/hsbdemo.dta) in this example is a dataset with 200 observations on the academic choices of students and other information about the students themselves, e.g. their academic profiles and demographic information." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df_raw = pd.read_stata('https://stats.idre.ucla.edu/stat/data/hsbdemo.dta')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idfemalesesschtypprogreadwritemathsciencesocsthonorsawardscid
045.0femalelowpublicvocation34.035.041.029.026.0not enrolled0.01
1108.0malemiddlepublicgeneral34.033.041.036.036.0not enrolled0.01
215.0malehighpublicvocation39.039.044.026.042.0not enrolled0.01
367.0malelowpublicvocation37.037.042.033.032.0not enrolled0.01
4153.0malemiddlepublicvocation39.031.040.039.051.0not enrolled0.01
\n", "
" ], "text/plain": [ " id female ses schtyp prog read write math science socst \\\n", "0 45.0 female low public vocation 34.0 35.0 41.0 29.0 26.0 \n", "1 108.0 male middle public general 34.0 33.0 41.0 36.0 36.0 \n", "2 15.0 male high public vocation 39.0 39.0 44.0 26.0 42.0 \n", "3 67.0 male low public vocation 37.0 37.0 42.0 33.0 32.0 \n", "4 153.0 male middle public vocation 39.0 31.0 40.0 39.0 51.0 \n", "\n", " honors awards cid \n", "0 not enrolled 0.0 1 \n", "1 not enrolled 0.0 1 \n", "2 not enrolled 0.0 1 \n", "3 not enrolled 0.0 1 \n", "4 not enrolled 0.0 1 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_raw.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id 200\n", "female 2\n", "ses 3\n", "schtyp 2\n", "prog 3\n", "read 30\n", "write 29\n", "math 40\n", "science 34\n", "socst 22\n", "honors 2\n", "awards 7\n", "cid 20\n", "dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_raw.nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The categorical columns from the Stata file are already set up to be recognised by Pandas as `pd.Categorical` dtype.\n", "\n", "**NOTE: categorical data fed to the encoders should be in the `pd.Categorical` dtype in order for the encoding to work!** They must not be in the generic `object` dtype." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course the `DummyEncoder` also handles cases where there are NaN values for categorical data (via the `nan_policy` argument)! That functionality will be covered separately in another notebook." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id float32\n", "female category\n", "ses category\n", "schtyp category\n", "prog category\n", "read float32\n", "write float32\n", "math float32\n", "science float32\n", "socst float32\n", "honors category\n", "awards float32\n", "cid int16\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_raw.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `female` column will be recoded here as a Boolean column with values in {0, 1}, rather than the {'male', 'female'} format originally in the dataset.\n", "\n", "**NOTE: Boolean data fed to the encoders should be restricted to values in {0, 1} in order for the encoding to work!**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Recode 'female' col into 1 and 0 vals\n", "df_raw['female'] = np.where(df_raw['female'] == 'female', 1, 0)\n", "\n", "# Create another Bool col for use later on - col for 'read' value being higher than the mean\n", "df_raw['read_gt_mean'] = np.where(df_raw['read'] > df_raw['read'].mean(), 1, 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are some examples of the types of data in the dataset.\n", "\n", "Boolean variables:\n", "- `female`\n", "\n", "Categorical variables:\n", "- `ses`\n", "- `prog`\n", "\n", "Continuous variables:\n", "- `read`, `write`, `math`, `science`, `socst`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data pre-processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `DummyEncoder` functionality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make a new copy of the `df_raw` dataframe.\n", "\n", "The `dummy_encoder` object is an instance of the `DummyEncoder` class.\n", "\n", "**The encoder object must be initialized with a dataframe.**\n", "\n", "By default, the `_` separator is used to produce the dummy columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It takes a dictionary, where each column name is paired with a base level. If a base level is specified, then the dummy column for that category is dropped from the final dataframe." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "dummy_encoder = DummyEncoder(df_raw, {'schtyp': None,\n", " 'prog': None,\n", " 'honors': None})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create the transformed dataframe with the `transform` method." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": false }, "outputs": [], "source": [ "# Overwrite the dataframe - encode dummies from the categorical variables specified\n", "df = dummy_encoder.transform()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Default NaN policy: row_of_zero\n" ] } ], "source": [ "print(f\"Default NaN policy: {dummy_encoder.nan_policy}\")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idfemalesesreadwritemathsciencesocstawardscidread_gt_meanschtyp_publicschtyp_privateprog_generalprog_academicprog_vocationhonors_not enrolledhonors_enrolled
045.01low34.035.041.029.026.00.0101000110
1108.00middle34.033.041.036.036.00.0101010010
215.00high39.039.044.026.042.00.0101000110
367.00low37.037.042.033.032.00.0101000110
4153.00middle39.031.040.039.051.00.0101000110
\n", "
" ], "text/plain": [ " id female ses read write math science socst awards cid \\\n", "0 45.0 1 low 34.0 35.0 41.0 29.0 26.0 0.0 1 \n", "1 108.0 0 middle 34.0 33.0 41.0 36.0 36.0 0.0 1 \n", "2 15.0 0 high 39.0 39.0 44.0 26.0 42.0 0.0 1 \n", "3 67.0 0 low 37.0 37.0 42.0 33.0 32.0 0.0 1 \n", "4 153.0 0 middle 39.0 31.0 40.0 39.0 51.0 0.0 1 \n", "\n", " read_gt_mean schtyp_public schtyp_private prog_general prog_academic \\\n", "0 0 1 0 0 0 \n", "1 0 1 0 1 0 \n", "2 0 1 0 0 0 \n", "3 0 1 0 0 0 \n", "4 0 1 0 0 0 \n", "\n", " prog_vocation honors_not enrolled honors_enrolled \n", "0 1 1 0 \n", "1 0 1 0 \n", "2 1 1 0 \n", "3 1 1 0 \n", "4 1 1 0 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are three categorical variables fed to the `DummyEncoder`.\n", "\n", "The original columns for all three are removed from the final dataframe once encoding is done for their dummy variable equivalents." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['schtyp', 'prog', 'honors']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[col for col in dummy_encoder.categorical_col_base_levels.keys()]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from appelpy.utils import get_dataframe_columns_diff" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns removed: ['prog', 'honors', 'schtyp']\n", "Columns added: ['prog_academic', 'honors_not enrolled', 'honors_enrolled', 'schtyp_public', 'prog_vocation', 'prog_general', 'schtyp_private']\n" ] } ], "source": [ "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df)}\")\n", "print(f\"Columns added: {get_dataframe_columns_diff(df, df_raw)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `InteractionEncoder` functionality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make a new copy of the `df_raw` dataframe.\n", "\n", "The `int_encoder` object is an instance of the `InteractionEncoder` class.\n", "\n", "**The encoder object must be initialized with a dataframe.**\n", "\n", "The `#` separator is used to represent the interaction between two variables in the columns that are produced by the encoder." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "df = df_raw.copy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Examples of interactions between variables will be given for these cases:\n", "- Two Boolean variables\n", "- Two continuous variables\n", "- Two categorical variables\n", "- One Boolean variable and one categorical variable\n", "- One Boolean variable and one continuous variable\n", "- One categorical variable and one continuous variable" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Two Boolean variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Bool: `female`\n", "- Bool: `read_gt_mean`" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idfemalesesschtypprogreadwritemathsciencesocsthonorsawardscidread_gt_meanfemale#read_gt_mean
195100.01highpublicacademic63.065.071.069.071.0enrolled5.02011
196143.00middlepublicvocation63.063.075.072.066.0enrolled4.02010
19768.00middlepublicacademic73.067.071.063.066.0enrolled7.02010
19857.01middlepublicacademic71.065.072.066.056.0enrolled5.02011
199132.00middlepublicacademic73.062.073.069.066.0enrolled3.02010
\n", "
" ], "text/plain": [ " id female ses schtyp prog read write math science \\\n", "195 100.0 1 high public academic 63.0 65.0 71.0 69.0 \n", "196 143.0 0 middle public vocation 63.0 63.0 75.0 72.0 \n", "197 68.0 0 middle public academic 73.0 67.0 71.0 63.0 \n", "198 57.0 1 middle public academic 71.0 65.0 72.0 66.0 \n", "199 132.0 0 middle public academic 73.0 62.0 73.0 69.0 \n", "\n", " socst honors awards cid read_gt_mean female#read_gt_mean \n", "195 71.0 enrolled 5.0 20 1 1 \n", "196 66.0 enrolled 4.0 20 1 0 \n", "197 66.0 enrolled 7.0 20 1 0 \n", "198 56.0 enrolled 5.0 20 1 1 \n", "199 66.0 enrolled 3.0 20 1 0 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "int_encoder = InteractionEncoder(df, {'female': ['read_gt_mean']})\n", "\n", "df_enc = int_encoder.transform()\n", "df_enc.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The columns for the main effects are both Boolean, so they must be kept in the final dataframe.\n", "\n", "There is only one interaction effect between the two Boolean variables, so one column is added to the dataframe." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `get_dataframe_columns_diff` method is useful for checking how the final dataframe is different from the original dataframe after the encoding process." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns removed: []\n", "Columns added: ['female#read_gt_mean']\n" ] } ], "source": [ "print(f\"Columns removed: {get_dataframe_columns_diff(df, df_enc)}\")\n", "print(f\"Columns added: {get_dataframe_columns_diff(df_enc, df)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code is essentially comparing the columns of the dataframes through sets." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns removed: []\n", "Columns added: []\n" ] } ], "source": [ "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df)}\")\n", "print(f\"Columns added: {get_dataframe_columns_diff(df, df_raw)}\")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns removed: []\n", "Columns added: ['female#read_gt_mean']\n" ] } ], "source": [ "print(f\"Columns removed: {list(set(df.columns) - set(df_enc.columns))}\")\n", "print(f\"Columns added: {list(set(df_enc.columns) - set(df.columns))}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Two continuous variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Continuous: `read`\n", "- Continuous: `write`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tip: do a one-line transformation by calling `transform` on an instance of the encoder class." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idfemalesesschtypprogreadwritemathsciencesocsthonorsawardscidread_gt_meanread#write
195100.01highpublicacademic63.065.071.069.071.0enrolled5.02014095.0
196143.00middlepublicvocation63.063.075.072.066.0enrolled4.02013969.0
19768.00middlepublicacademic73.067.071.063.066.0enrolled7.02014891.0
19857.01middlepublicacademic71.065.072.066.056.0enrolled5.02014615.0
199132.00middlepublicacademic73.062.073.069.066.0enrolled3.02014526.0
\n", "
" ], "text/plain": [ " id female ses schtyp prog read write math science \\\n", "195 100.0 1 high public academic 63.0 65.0 71.0 69.0 \n", "196 143.0 0 middle public vocation 63.0 63.0 75.0 72.0 \n", "197 68.0 0 middle public academic 73.0 67.0 71.0 63.0 \n", "198 57.0 1 middle public academic 71.0 65.0 72.0 66.0 \n", "199 132.0 0 middle public academic 73.0 62.0 73.0 69.0 \n", "\n", " socst honors awards cid read_gt_mean read#write \n", "195 71.0 enrolled 5.0 20 1 4095.0 \n", "196 66.0 enrolled 4.0 20 1 3969.0 \n", "197 66.0 enrolled 7.0 20 1 4891.0 \n", "198 56.0 enrolled 5.0 20 1 4615.0 \n", "199 66.0 enrolled 3.0 20 1 4526.0 " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_enc = InteractionEncoder(df_raw, {'read': ['write']}).transform()\n", "df_enc.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The columns for the main effects are both continuous, so they must be kept in the final dataframe.\n", "\n", "There is only one interaction effect between the two Boolean variables, so one column is added to the dataframe." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns removed: []\n", "Columns added: ['read#write']\n" ] } ], "source": [ "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}\")\n", "print(f\"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Two categorical variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Categorical: `prog`\n", "- Categorical: `ses`" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idfemaleschtypreadwritemathsciencesocsthonorsawards...ses_highprog_general#ses_lowprog_general#ses_middleprog_general#ses_highprog_academic#ses_lowprog_academic#ses_middleprog_academic#ses_highprog_vocation#ses_lowprog_vocation#ses_middleprog_vocation#ses_high
195100.01public63.065.071.069.071.0enrolled5.0...1000001000
196143.00public63.063.075.072.066.0enrolled4.0...0000000010
19768.00public73.067.071.063.066.0enrolled7.0...0000010000
19857.01public71.065.072.066.056.0enrolled5.0...0000010000
199132.00public73.062.073.069.066.0enrolled3.0...0000010000
\n", "

5 rows × 27 columns

\n", "
" ], "text/plain": [ " id female schtyp read write math science socst honors \\\n", "195 100.0 1 public 63.0 65.0 71.0 69.0 71.0 enrolled \n", "196 143.0 0 public 63.0 63.0 75.0 72.0 66.0 enrolled \n", "197 68.0 0 public 73.0 67.0 71.0 63.0 66.0 enrolled \n", "198 57.0 1 public 71.0 65.0 72.0 66.0 56.0 enrolled \n", "199 132.0 0 public 73.0 62.0 73.0 69.0 66.0 enrolled \n", "\n", " awards ... ses_high prog_general#ses_low prog_general#ses_middle \\\n", "195 5.0 ... 1 0 0 \n", "196 4.0 ... 0 0 0 \n", "197 7.0 ... 0 0 0 \n", "198 5.0 ... 0 0 0 \n", "199 3.0 ... 0 0 0 \n", "\n", " prog_general#ses_high prog_academic#ses_low prog_academic#ses_middle \\\n", "195 0 0 0 \n", "196 0 0 0 \n", "197 0 0 1 \n", "198 0 0 1 \n", "199 0 0 1 \n", "\n", " prog_academic#ses_high prog_vocation#ses_low prog_vocation#ses_middle \\\n", "195 1 0 0 \n", "196 0 0 1 \n", "197 0 0 0 \n", "198 0 0 0 \n", "199 0 0 0 \n", "\n", " prog_vocation#ses_high \n", "195 0 \n", "196 0 \n", "197 0 \n", "198 0 \n", "199 0 \n", "\n", "[5 rows x 27 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_enc = InteractionEncoder(df_raw, {'prog': ['ses']}).transform()\n", "df_enc.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The columns for the main effects are both categorical: the information in those columns all have string values. The **original columns** `prog` and `ses` are **removed** from the final dataframe, as the `DummyEncoder` is used on them to produce dummy columns for them in the final dataframe. The original columns thus become redundant.\n", "\n", "These are the **columns added** to the final dataframe via the encoding:\n", "- Dummy columns are produced for each category via the DummyEncoder: 3 values + 3 values = 6 dummy columns.\n", "- There are multiple interaction effects encoded between the two categorical variables: 3 values * 3 values = 9 interaction effects.\n", "\n", "**NOTE:** one of the categories could be used as a 'base level' in a regression model." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns removed: ['ses', 'prog']\n", "Columns added: ['prog_vocation#ses_low', 'prog_academic#ses_middle', 'ses_low', 'prog_academic', 'prog_general#ses_high', 'prog_general#ses_low', 'prog_vocation#ses_middle', 'prog_academic#ses_low', 'prog_vocation', 'ses_high', 'prog_academic#ses_high', 'prog_vocation#ses_high', 'prog_general#ses_middle', 'ses_middle', 'prog_general']\n" ] } ], "source": [ "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}\")\n", "print(f\"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The key-value pair in the class initialization can also be switched and produce a dataframe with the same information, but the column names for the interaction effects will be different." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idfemaleschtypreadwritemathsciencesocsthonorsawards...prog_vocationses_low#prog_generalses_low#prog_academicses_low#prog_vocationses_middle#prog_generalses_middle#prog_academicses_middle#prog_vocationses_high#prog_generalses_high#prog_academicses_high#prog_vocation
195100.01public63.065.071.069.071.0enrolled5.0...0000000010
196143.00public63.063.075.072.066.0enrolled4.0...1000001000
19768.00public73.067.071.063.066.0enrolled7.0...0000010000
19857.01public71.065.072.066.056.0enrolled5.0...0000010000
199132.00public73.062.073.069.066.0enrolled3.0...0000010000
\n", "

5 rows × 27 columns

\n", "
" ], "text/plain": [ " id female schtyp read write math science socst honors \\\n", "195 100.0 1 public 63.0 65.0 71.0 69.0 71.0 enrolled \n", "196 143.0 0 public 63.0 63.0 75.0 72.0 66.0 enrolled \n", "197 68.0 0 public 73.0 67.0 71.0 63.0 66.0 enrolled \n", "198 57.0 1 public 71.0 65.0 72.0 66.0 56.0 enrolled \n", "199 132.0 0 public 73.0 62.0 73.0 69.0 66.0 enrolled \n", "\n", " awards ... prog_vocation ses_low#prog_general ses_low#prog_academic \\\n", "195 5.0 ... 0 0 0 \n", "196 4.0 ... 1 0 0 \n", "197 7.0 ... 0 0 0 \n", "198 5.0 ... 0 0 0 \n", "199 3.0 ... 0 0 0 \n", "\n", " ses_low#prog_vocation ses_middle#prog_general ses_middle#prog_academic \\\n", "195 0 0 0 \n", "196 0 0 0 \n", "197 0 0 1 \n", "198 0 0 1 \n", "199 0 0 1 \n", "\n", " ses_middle#prog_vocation ses_high#prog_general ses_high#prog_academic \\\n", "195 0 0 1 \n", "196 1 0 0 \n", "197 0 0 0 \n", "198 0 0 0 \n", "199 0 0 0 \n", "\n", " ses_high#prog_vocation \n", "195 0 \n", "196 0 \n", "197 0 \n", "198 0 \n", "199 0 \n", "\n", "[5 rows x 27 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_enc = InteractionEncoder(df_raw, {'ses': ['prog']}).transform()\n", "df_enc.tail()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns removed: ['ses', 'prog']\n", "Columns added: ['ses_low', 'prog_academic', 'ses_middle#prog_academic', 'prog_general', 'ses_middle#prog_general', 'ses_low#prog_general', 'ses_high#prog_academic', 'ses_low#prog_academic', 'prog_vocation', 'ses_low#prog_vocation', 'ses_high', 'ses_middle', 'ses_middle#prog_vocation', 'ses_high#prog_vocation', 'ses_high#prog_general']\n" ] } ], "source": [ "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}\")\n", "print(f\"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One Bool and one categorical" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Categorical: `prog`\n", "- Bool: `female`" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idfemalesesschtypreadwritemathsciencesocsthonorsawardscidread_gt_meanprog_generalprog_academicprog_vocationprog_general#femaleprog_academic#femaleprog_vocation#female
195100.01highpublic63.065.071.069.071.0enrolled5.0201010010
196143.00middlepublic63.063.075.072.066.0enrolled4.0201001000
19768.00middlepublic73.067.071.063.066.0enrolled7.0201010000
19857.01middlepublic71.065.072.066.056.0enrolled5.0201010010
199132.00middlepublic73.062.073.069.066.0enrolled3.0201010000
\n", "
" ], "text/plain": [ " id female ses schtyp read write math science socst \\\n", "195 100.0 1 high public 63.0 65.0 71.0 69.0 71.0 \n", "196 143.0 0 middle public 63.0 63.0 75.0 72.0 66.0 \n", "197 68.0 0 middle public 73.0 67.0 71.0 63.0 66.0 \n", "198 57.0 1 middle public 71.0 65.0 72.0 66.0 56.0 \n", "199 132.0 0 middle public 73.0 62.0 73.0 69.0 66.0 \n", "\n", " honors awards cid read_gt_mean prog_general prog_academic \\\n", "195 enrolled 5.0 20 1 0 1 \n", "196 enrolled 4.0 20 1 0 0 \n", "197 enrolled 7.0 20 1 0 1 \n", "198 enrolled 5.0 20 1 0 1 \n", "199 enrolled 3.0 20 1 0 1 \n", "\n", " prog_vocation prog_general#female prog_academic#female \\\n", "195 0 0 1 \n", "196 1 0 0 \n", "197 0 0 0 \n", "198 0 0 1 \n", "199 0 0 0 \n", "\n", " prog_vocation#female \n", "195 0 \n", "196 0 \n", "197 0 \n", "198 0 \n", "199 0 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_enc = InteractionEncoder(df_raw, {'prog': ['female']}).transform()\n", "df_enc.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the main effect columns is for a Boolean variable, so that must be kept in the final dataframe. The other main effect is a categorical variable, so dummy columns are encoded for it and the original column is removed in the final dataframe.\n", "\n", "The columns added:\n", "- Dummy columns for the categorical variable: 3 values gives 3 dummy columns\n", "- Interaction effects between the Boolean variable and the dummy columns: 3 dummy columns * 1 Bool column = 3 interaction effects" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns removed: ['prog']\n", "Columns added: ['prog_academic', 'prog_vocation#female', 'prog_academic#female', 'prog_vocation', 'prog_general#female', 'prog_general']\n" ] } ], "source": [ "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}\")\n", "print(f\"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One Bool and one continuous" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case let's encode interactions between `female` and TWO continuous variables!\n", "\n", "- Bool: `female`\n", "- Continuous: `read` and `write`" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idfemalesesschtypprogreadwritemathsciencesocsthonorsawardscidread_gt_meanfemale#readfemale#write
195100.01highpublicacademic63.065.071.069.071.0enrolled5.020163.065.0
196143.00middlepublicvocation63.063.075.072.066.0enrolled4.02010.00.0
19768.00middlepublicacademic73.067.071.063.066.0enrolled7.02010.00.0
19857.01middlepublicacademic71.065.072.066.056.0enrolled5.020171.065.0
199132.00middlepublicacademic73.062.073.069.066.0enrolled3.02010.00.0
\n", "
" ], "text/plain": [ " id female ses schtyp prog read write math science \\\n", "195 100.0 1 high public academic 63.0 65.0 71.0 69.0 \n", "196 143.0 0 middle public vocation 63.0 63.0 75.0 72.0 \n", "197 68.0 0 middle public academic 73.0 67.0 71.0 63.0 \n", "198 57.0 1 middle public academic 71.0 65.0 72.0 66.0 \n", "199 132.0 0 middle public academic 73.0 62.0 73.0 69.0 \n", "\n", " socst honors awards cid read_gt_mean female#read female#write \n", "195 71.0 enrolled 5.0 20 1 63.0 65.0 \n", "196 66.0 enrolled 4.0 20 1 0.0 0.0 \n", "197 66.0 enrolled 7.0 20 1 0.0 0.0 \n", "198 56.0 enrolled 5.0 20 1 71.0 65.0 \n", "199 66.0 enrolled 3.0 20 1 0.0 0.0 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_enc = InteractionEncoder(df_raw, {'female': ['read', 'write']}).transform()\n", "df_enc.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The columns for the main effects are Boolean or continuous, so they must be kept in the final dataframe.\n", "\n", "There is only one interaction effect between a Boolean variable and a continuous variable, so one column is added to the dataframe for each of those pairings.\n", "\n", "(In this case, there were two continuous variables interacted with `female` so there are two interaction effects added to the final dataframe)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns removed: []\n", "Columns added: ['female#write', 'female#read']\n" ] } ], "source": [ "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}\")\n", "print(f\"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}\")" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idfemalesesschtypprogreadwritemathsciencesocsthonorsawardscidread_gt_meanread#femalewrite#female
195100.01highpublicacademic63.065.071.069.071.0enrolled5.020163.065.0
196143.00middlepublicvocation63.063.075.072.066.0enrolled4.02010.00.0
19768.00middlepublicacademic73.067.071.063.066.0enrolled7.02010.00.0
19857.01middlepublicacademic71.065.072.066.056.0enrolled5.020171.065.0
199132.00middlepublicacademic73.062.073.069.066.0enrolled3.02010.00.0
\n", "
" ], "text/plain": [ " id female ses schtyp prog read write math science \\\n", "195 100.0 1 high public academic 63.0 65.0 71.0 69.0 \n", "196 143.0 0 middle public vocation 63.0 63.0 75.0 72.0 \n", "197 68.0 0 middle public academic 73.0 67.0 71.0 63.0 \n", "198 57.0 1 middle public academic 71.0 65.0 72.0 66.0 \n", "199 132.0 0 middle public academic 73.0 62.0 73.0 69.0 \n", "\n", " socst honors awards cid read_gt_mean read#female write#female \n", "195 71.0 enrolled 5.0 20 1 63.0 65.0 \n", "196 66.0 enrolled 4.0 20 1 0.0 0.0 \n", "197 66.0 enrolled 7.0 20 1 0.0 0.0 \n", "198 56.0 enrolled 5.0 20 1 71.0 65.0 \n", "199 66.0 enrolled 3.0 20 1 0.0 0.0 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_enc = InteractionEncoder(df_raw, {'read': ['female'],\n", " 'write': ['female']}).transform()\n", "df_enc.tail()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns removed: []\n", "Columns added: ['read#female', 'write#female']\n" ] } ], "source": [ "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}\")\n", "print(f\"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One categorical and one continuous" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Categorical: `prog`\n", "- Continuous: `socst`" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idfemalesesschtypreadwritemathsciencesocsthonorsawardscidread_gt_meanprog_generalprog_academicprog_vocationsocst#prog_generalsocst#prog_academicsocst#prog_vocation
195100.01highpublic63.065.071.069.071.0enrolled5.02010100.071.00.0
196143.00middlepublic63.063.075.072.066.0enrolled4.02010010.00.066.0
19768.00middlepublic73.067.071.063.066.0enrolled7.02010100.066.00.0
19857.01middlepublic71.065.072.066.056.0enrolled5.02010100.056.00.0
199132.00middlepublic73.062.073.069.066.0enrolled3.02010100.066.00.0
\n", "
" ], "text/plain": [ " id female ses schtyp read write math science socst \\\n", "195 100.0 1 high public 63.0 65.0 71.0 69.0 71.0 \n", "196 143.0 0 middle public 63.0 63.0 75.0 72.0 66.0 \n", "197 68.0 0 middle public 73.0 67.0 71.0 63.0 66.0 \n", "198 57.0 1 middle public 71.0 65.0 72.0 66.0 56.0 \n", "199 132.0 0 middle public 73.0 62.0 73.0 69.0 66.0 \n", "\n", " honors awards cid read_gt_mean prog_general prog_academic \\\n", "195 enrolled 5.0 20 1 0 1 \n", "196 enrolled 4.0 20 1 0 0 \n", "197 enrolled 7.0 20 1 0 1 \n", "198 enrolled 5.0 20 1 0 1 \n", "199 enrolled 3.0 20 1 0 1 \n", "\n", " prog_vocation socst#prog_general socst#prog_academic \\\n", "195 0 0.0 71.0 \n", "196 1 0.0 0.0 \n", "197 0 0.0 66.0 \n", "198 0 0.0 56.0 \n", "199 0 0.0 66.0 \n", "\n", " socst#prog_vocation \n", "195 0.0 \n", "196 66.0 \n", "197 0.0 \n", "198 0.0 \n", "199 0.0 " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_enc = InteractionEncoder(df_raw, {'socst': ['prog']}).transform()\n", "df_enc.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the main effects is continuous, so the column for that one must be kept in the final dataframe. The other main effect is a categorical variable, so the original column is dropped from the final dataframe after dummy columns are encoded from it.\n", "\n", "There is an interaction effect between each of the dummy variables and the continuous variable." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columns removed: ['prog']\n", "Columns added: ['prog_academic', 'socst#prog_vocation', 'prog_vocation', 'socst#prog_general', 'socst#prog_academic', 'prog_general']\n" ] } ], "source": [ "print(f\"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}\")\n", "print(f\"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}\")" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idfemalesesschtypreadwritemathsciencesocsthonorsawardscidread_gt_meanprog_generalprog_academicprog_vocationprog_general#socstprog_academic#socstprog_vocation#socst
195100.01highpublic63.065.071.069.071.0enrolled5.02010100.071.00.0
196143.00middlepublic63.063.075.072.066.0enrolled4.02010010.00.066.0
19768.00middlepublic73.067.071.063.066.0enrolled7.02010100.066.00.0
19857.01middlepublic71.065.072.066.056.0enrolled5.02010100.056.00.0
199132.00middlepublic73.062.073.069.066.0enrolled3.02010100.066.00.0
\n", "
" ], "text/plain": [ " id female ses schtyp read write math science socst \\\n", "195 100.0 1 high public 63.0 65.0 71.0 69.0 71.0 \n", "196 143.0 0 middle public 63.0 63.0 75.0 72.0 66.0 \n", "197 68.0 0 middle public 73.0 67.0 71.0 63.0 66.0 \n", "198 57.0 1 middle public 71.0 65.0 72.0 66.0 56.0 \n", "199 132.0 0 middle public 73.0 62.0 73.0 69.0 66.0 \n", "\n", " honors awards cid read_gt_mean prog_general prog_academic \\\n", "195 enrolled 5.0 20 1 0 1 \n", "196 enrolled 4.0 20 1 0 0 \n", "197 enrolled 7.0 20 1 0 1 \n", "198 enrolled 5.0 20 1 0 1 \n", "199 enrolled 3.0 20 1 0 1 \n", "\n", " prog_vocation prog_general#socst prog_academic#socst \\\n", "195 0 0.0 71.0 \n", "196 1 0.0 0.0 \n", "197 0 0.0 66.0 \n", "198 0 0.0 56.0 \n", "199 0 0.0 66.0 \n", "\n", " prog_vocation#socst \n", "195 0.0 \n", "196 66.0 \n", "197 0.0 \n", "198 0.0 \n", "199 0.0 " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "InteractionEncoder(df_raw, {'prog': ['socst']}).transform().tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do basic OLS regression models using the dataset, where interaction effects are also used as variables in modelling.\n", "\n", "The UCLA's online resources have models of interaction effects on this dataset with Stata output: \n", "- [Interaction between two continuous variables](https://stats.idre.ucla.edu/stata/faq/how-can-i-explain-a-continuous-by-continuous-interaction-stata-12/)\n", "- [Interaction between categorical variable and continuous variable](https://stats.idre.ucla.edu/stata/faq/how-can-i-understand-a-categorical-by-continuous-interaction-stata-12/) (the example is a categorical variable with two categories, `female`, which is madr Boolean in this notebook).\n", "\n", "The Stata output for each model is also provided in this notebook for comparison against the models done through Appelpy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Interaction between two continuous variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create new dataframe and set up the `InteractionEncoder` object." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "df_model = df_raw.copy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's regress `read` on the scores for `math`, `socst` and the _interaction_ between `math` & `socst`.\n", "\n", "To get the interaction effect in the dataframe, we need to do some encoding to get the column `math#socst`." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idfemalesesschtypprogreadwritemathsciencesocsthonorsawardscidread_gt_meanmath#socst
045.01lowpublicvocation34.035.041.029.026.0not enrolled0.0101066.0
1108.00middlepublicgeneral34.033.041.036.036.0not enrolled0.0101476.0
215.00highpublicvocation39.039.044.026.042.0not enrolled0.0101848.0
367.00lowpublicvocation37.037.042.033.032.0not enrolled0.0101344.0
4153.00middlepublicvocation39.031.040.039.051.0not enrolled0.0102040.0
\n", "
" ], "text/plain": [ " id female ses schtyp prog read write math science socst \\\n", "0 45.0 1 low public vocation 34.0 35.0 41.0 29.0 26.0 \n", "1 108.0 0 middle public general 34.0 33.0 41.0 36.0 36.0 \n", "2 15.0 0 high public vocation 39.0 39.0 44.0 26.0 42.0 \n", "3 67.0 0 low public vocation 37.0 37.0 42.0 33.0 32.0 \n", "4 153.0 0 middle public vocation 39.0 31.0 40.0 39.0 51.0 \n", "\n", " honors awards cid read_gt_mean math#socst \n", "0 not enrolled 0.0 1 0 1066.0 \n", "1 not enrolled 0.0 1 0 1476.0 \n", "2 not enrolled 0.0 1 0 1848.0 \n", "3 not enrolled 0.0 1 0 1344.0 \n", "4 not enrolled 0.0 1 0 2040.0 " ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_model = InteractionEncoder(df_model, {'math': ['socst']}).transform()\n", "df_model.head()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "y_list = ['read']\n", "X_list = ['math', 'socst', 'math#socst']\n", "model = OLS(df_model, y_list, X_list).fit()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: read R-squared: 0.546
Model: OLS Adj. R-squared: 0.539
Method: Least Squares F-statistic: 78.61
Date: Fri, 03 Jan 2020 Prob (F-statistic): 1.99e-33
Time: 21:39:12 Log-Likelihood: -669.80
No. Observations: 200 AIC: 1348.
Df Residuals: 196 BIC: 1361.
Df Model: 3
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
const 37.8427 14.545 2.602 0.010 9.158 66.528
math -0.1105 0.292 -0.379 0.705 -0.686 0.465
socst -0.2200 0.272 -0.810 0.419 -0.756 0.316
math#socst 0.0113 0.005 2.157 0.032 0.001 0.022
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 3.611 Durbin-Watson: 1.839
Prob(Omnibus): 0.164 Jarque-Bera (JB): 3.555
Skew: 0.325 Prob(JB): 0.169
Kurtosis: 2.942 Cond. No. 8.76e+04


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.76e+04. This might indicate that there are
strong multicollinearity or other numerical problems." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: read R-squared: 0.546\n", "Model: OLS Adj. R-squared: 0.539\n", "Method: Least Squares F-statistic: 78.61\n", "Date: Fri, 03 Jan 2020 Prob (F-statistic): 1.99e-33\n", "Time: 21:39:12 Log-Likelihood: -669.80\n", "No. Observations: 200 AIC: 1348.\n", "Df Residuals: 196 BIC: 1361.\n", "Df Model: 3 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const 37.8427 14.545 2.602 0.010 9.158 66.528\n", "math -0.1105 0.292 -0.379 0.705 -0.686 0.465\n", "socst -0.2200 0.272 -0.810 0.419 -0.756 0.316\n", "math#socst 0.0113 0.005 2.157 0.032 0.001 0.022\n", "==============================================================================\n", "Omnibus: 3.611 Durbin-Watson: 1.839\n", "Prob(Omnibus): 0.164 Jarque-Bera (JB): 3.555\n", "Skew: 0.325 Prob(JB): 0.169\n", "Kurtosis: 2.942 Cond. No. 8.76e+04\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "[2] The condition number is large, 8.76e+04. This might indicate that there are\n", "strong multicollinearity or other numerical problems.\n", "\"\"\"" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.results_output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The interaction between `math` and `socst`, i.e. `math#socst#`, is significant." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'root_mse': 6.96003820368867,\n", " 'r_squared': 0.5461318818125249,\n", " 'r_squared_adj': 0.5391849208198595,\n", " 'aic': 1347.6088571651621,\n", " 'bic': 1360.8021266313542}" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.model_selection_stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is what the model output would be from Stata:\n", "\n", "```\n", " Source | SS df MS Number of obs = 200\n", "-------------+------------------------------ F( 3, 196) = 78.61\n", " Model | 11424.7622 3 3808.25406 Prob > F = 0.0000\n", " Residual | 9494.65783 196 48.4421318 R-squared = 0.5461\n", "-------------+------------------------------ Adj R-squared = 0.5392\n", " Total | 20919.42 199 105.122714 Root MSE = 6.96\n", "\n", "------------------------------------------------------------------------------\n", " read | Coef. Std. Err. t P>|t| [95% Conf. Interval]\n", "-------------+----------------------------------------------------------------\n", " math | -.1105123 .2916338 -0.38 0.705 -.6856552 .4646307\n", " socst | -.2200442 .2717539 -0.81 0.419 -.7559812 .3158928\n", " |\n", " c.math#|\n", " c.socst | .0112807 .0052294 2.16 0.032 .0009677 .0215938\n", " |\n", " _cons | 37.84271 14.54521 2.60 0.010 9.157506 66.52792\n", "------------------------------------------------------------------------------\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Interaction between continuous and Bool variables" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idfemalesesschtypprogreadwritemathsciencesocsthonorsawardscidread_gt_meanfemale#socst
045.01lowpublicvocation34.035.041.029.026.0not enrolled0.01026.0
1108.00middlepublicgeneral34.033.041.036.036.0not enrolled0.0100.0
215.00highpublicvocation39.039.044.026.042.0not enrolled0.0100.0
367.00lowpublicvocation37.037.042.033.032.0not enrolled0.0100.0
4153.00middlepublicvocation39.031.040.039.051.0not enrolled0.0100.0
\n", "
" ], "text/plain": [ " id female ses schtyp prog read write math science socst \\\n", "0 45.0 1 low public vocation 34.0 35.0 41.0 29.0 26.0 \n", "1 108.0 0 middle public general 34.0 33.0 41.0 36.0 36.0 \n", "2 15.0 0 high public vocation 39.0 39.0 44.0 26.0 42.0 \n", "3 67.0 0 low public vocation 37.0 37.0 42.0 33.0 32.0 \n", "4 153.0 0 middle public vocation 39.0 31.0 40.0 39.0 51.0 \n", "\n", " honors awards cid read_gt_mean female#socst \n", "0 not enrolled 0.0 1 0 26.0 \n", "1 not enrolled 0.0 1 0 0.0 \n", "2 not enrolled 0.0 1 0 0.0 \n", "3 not enrolled 0.0 1 0 0.0 \n", "4 not enrolled 0.0 1 0 0.0 " ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_model = InteractionEncoder(df_raw, {'female': ['socst']}).transform()\n", "df_model.head()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "model = OLS(df_model, ['write'], ['female', 'socst', 'female#socst']).fit()" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: write R-squared: 0.430
Model: OLS Adj. R-squared: 0.421
Method: Least Squares F-statistic: 49.26
Date: Fri, 03 Jan 2020 Prob (F-statistic): 9.02e-24
Time: 21:39:12 Log-Likelihood: -676.91
No. Observations: 200 AIC: 1362.
Df Residuals: 196 BIC: 1375.
Df Model: 3
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
const 17.7619 3.555 4.996 0.000 10.751 24.773
female 15.0000 5.098 2.942 0.004 4.946 25.054
socst 0.6248 0.067 9.315 0.000 0.493 0.757
female#socst -0.2047 0.095 -2.147 0.033 -0.393 -0.017
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 2.193 Durbin-Watson: 1.266
Prob(Omnibus): 0.334 Jarque-Bera (JB): 2.004
Skew: -0.152 Prob(JB): 0.367
Kurtosis: 2.615 Cond. No. 713.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: write R-squared: 0.430\n", "Model: OLS Adj. R-squared: 0.421\n", "Method: Least Squares F-statistic: 49.26\n", "Date: Fri, 03 Jan 2020 Prob (F-statistic): 9.02e-24\n", "Time: 21:39:12 Log-Likelihood: -676.91\n", "No. Observations: 200 AIC: 1362.\n", "Df Residuals: 196 BIC: 1375.\n", "Df Model: 3 \n", "Covariance Type: nonrobust \n", "================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "--------------------------------------------------------------------------------\n", "const 17.7619 3.555 4.996 0.000 10.751 24.773\n", "female 15.0000 5.098 2.942 0.004 4.946 25.054\n", "socst 0.6248 0.067 9.315 0.000 0.493 0.757\n", "female#socst -0.2047 0.095 -2.147 0.033 -0.393 -0.017\n", "==============================================================================\n", "Omnibus: 2.193 Durbin-Watson: 1.266\n", "Prob(Omnibus): 0.334 Jarque-Bera (JB): 2.004\n", "Skew: -0.152 Prob(JB): 0.367\n", "Kurtosis: 2.615 Cond. No. 713.\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.results_output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The interaction between `female` and `socst`, i.e. `female#socst`, is significant.\n", "\n", "In the [UCLA resources](https://stats.idre.ucla.edu/stata/faq/how-can-i-understand-a-categorical-by-continuous-interaction-stata-12/) the chart shows how the slopes for the effect of `socst` vary by gender." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'root_mse': 7.211611852775864,\n", " 'r_squared': 0.42986123794053965,\n", " 'r_squared_adj': 0.4211346242355479,\n", " 'aic': 1361.811865520546,\n", " 'bic': 1375.005134986738}" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.model_selection_stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is what the regression output would be from Stata:\n", "\n", "```\n", " Source | SS df MS Number of obs = 200\n", "-------------+------------------------------ F( 3, 196) = 49.26\n", " Model | 7685.43528 3 2561.81176 Prob > F = 0.0000\n", " Residual | 10193.4397 196 52.0073455 R-squared = 0.4299\n", "-------------+------------------------------ Adj R-squared = 0.4211\n", " Total | 17878.875 199 89.843593 Root MSE = 7.2116\n", "\n", "------------------------------------------------------------------------------\n", " write | Coef. Std. Err. t P>|t| [95% Conf. Interval]\n", "-------------+----------------------------------------------------------------\n", " 1.female | 15.00001 5.09795 2.94 0.004 4.946132 25.05389\n", " socst | .6247968 .0670709 9.32 0.000 .4925236 .7570701\n", " |\n", " female#|\n", " c.socst |\n", " 1 | -.2047288 .0953726 -2.15 0.033 -.3928171 -.0166405\n", " |\n", " _cons | 17.7619 3.554993 5.00 0.000 10.75095 24.77284\n", "------------------------------------------------------------------------------\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Model pipeline example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's possible to make model pipelines with Pandas via chaining of Appelpy methods." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "def process_data(raw_df):\n", " return (raw_df\n", " .pipe(InteractionEncoder, {'female': ['socst']})\n", " .transform())" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "def fit_model(df, y_list, X_list):\n", " return OLS(df, y_list, X_list).fit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cell below retrieves the previous `model_selection_stats` via a Pandas pipeline." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'root_mse': 7.211611852775864,\n", " 'r_squared': 0.42986123794053965,\n", " 'r_squared_adj': 0.4211346242355479,\n", " 'aic': 1361.811865520546,\n", " 'bic': 1375.005134986738}" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(df_raw\n", " .pipe(process_data)\n", " .pipe(fit_model, ['write'], ['female', 'socst', 'female#socst'])\n", " .model_selection_stats)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "384.391px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }