This notebook shows the functionality of the **`DummyEncoder` and `InteractionEncoder` classes of Appelpy** üçèü•ß in depth, applied to an econometrics dataset.  These classes are in the `utils` module.

**Notebook structure:**
- **Data loading:** e.g. what format is needed for categorical columns and Boolean columns before using the Encoders.
- **`DummyEncoder` functionality:** basic examples of categorical columns being encoded into dummy columns.
- **`InteractionEncoder` functionality:** multiple scenarios are covered for interactions between different data types.
- **Modelling:** examples of models that use interaction effects.

The notebook ends with an example of a simple model pipeline using the `InterationEncoder`.

In [1]:
import pandas as pd
import numpy as np

# Appelpy imports:
from appelpy.utils import DummyEncoder, InteractionEncoder
from appelpy.linear_model import OLS

# Hide Numpy warnings from Statsmodels
import warnings
warnings.filterwarnings('ignore')

# Load data

The [hsbdemo DTA file](https://stats.idre.ucla.edu/stat/data/hsbdemo.dta) in this example is a dataset with 200 observations on the academic choices of students and other information about the students themselves, e.g. their academic profiles and demographic information.

In [2]:
df_raw = pd.read_stata('https://stats.idre.ucla.edu/stat/data/hsbdemo.dta')

In [3]:
df_raw.head()

Unnamed: 0,id,female,ses,schtyp,prog,read,write,math,science,socst,honors,awards,cid
0,45.0,female,low,public,vocation,34.0,35.0,41.0,29.0,26.0,not enrolled,0.0,1
1,108.0,male,middle,public,general,34.0,33.0,41.0,36.0,36.0,not enrolled,0.0,1
2,15.0,male,high,public,vocation,39.0,39.0,44.0,26.0,42.0,not enrolled,0.0,1
3,67.0,male,low,public,vocation,37.0,37.0,42.0,33.0,32.0,not enrolled,0.0,1
4,153.0,male,middle,public,vocation,39.0,31.0,40.0,39.0,51.0,not enrolled,0.0,1


In [4]:
df_raw.nunique()

id         200
female       2
ses          3
schtyp       2
prog         3
read        30
write       29
math        40
science     34
socst       22
honors       2
awards       7
cid         20
dtype: int64

The categorical columns from the Stata file are already set up to be recognised by Pandas as `pd.Categorical` dtype.

**NOTE: categorical data fed to the encoders should be in the `pd.Categorical` dtype in order for the encoding to work!**  They must not be in the generic `object` dtype.

Of course the `DummyEncoder` also handles cases where there are NaN values for categorical data (via the `nan_policy` argument)!  That functionality will be covered separately in another notebook.

In [5]:
df_raw.dtypes

id          float32
female     category
ses        category
schtyp     category
prog       category
read        float32
write       float32
math        float32
science     float32
socst       float32
honors     category
awards      float32
cid           int16
dtype: object

The `female` column will be recoded here as a Boolean column with values in {0, 1}, rather than the {'male', 'female'} format originally in the dataset.

**NOTE: Boolean data fed to the encoders should be restricted to values in {0, 1} in order for the encoding to work!**

In [6]:
# Recode 'female' col into 1 and 0 vals
df_raw['female'] = np.where(df_raw['female'] == 'female', 1, 0)

# Create another Bool col for use later on - col for 'read' value being higher than the mean
df_raw['read_gt_mean'] = np.where(df_raw['read'] > df_raw['read'].mean(), 1, 0)

These are some examples of the types of data in the dataset.

Boolean variables:
- `female`

Categorical variables:
- `ses`
- `prog`

Continuous variables:
- `read`, `write`, `math`, `science`, `socst`

# Data pre-processing

## `DummyEncoder` functionality

Make a new copy of the `df_raw` dataframe.

The `dummy_encoder` object is an instance of the `DummyEncoder` class.

**The encoder object must be initialized with a dataframe.**

By default, the `_` separator is used to produce the dummy columns.

It takes a dictionary, where each column name is paired with a base level.  If a base level is specified, then the dummy column for that category is dropped from the final dataframe.

In [7]:
dummy_encoder = DummyEncoder(df_raw, {'schtyp': None,
                                      'prog': None,
                                      'honors': None})

Create the transformed dataframe with the `transform` method.

In [8]:
# Overwrite the dataframe - encode dummies from the categorical variables specified
df = dummy_encoder.transform()

In [9]:
print(f"Default NaN policy: {dummy_encoder.nan_policy}")

Default NaN policy: row_of_zero


In [10]:
df.head()

Unnamed: 0,id,female,ses,read,write,math,science,socst,awards,cid,read_gt_mean,schtyp_public,schtyp_private,prog_general,prog_academic,prog_vocation,honors_not enrolled,honors_enrolled
0,45.0,1,low,34.0,35.0,41.0,29.0,26.0,0.0,1,0,1,0,0,0,1,1,0
1,108.0,0,middle,34.0,33.0,41.0,36.0,36.0,0.0,1,0,1,0,1,0,0,1,0
2,15.0,0,high,39.0,39.0,44.0,26.0,42.0,0.0,1,0,1,0,0,0,1,1,0
3,67.0,0,low,37.0,37.0,42.0,33.0,32.0,0.0,1,0,1,0,0,0,1,1,0
4,153.0,0,middle,39.0,31.0,40.0,39.0,51.0,0.0,1,0,1,0,0,0,1,1,0


There are three categorical variables fed to the `DummyEncoder`.

The original columns for all three are removed from the final dataframe once encoding is done for their dummy variable equivalents.

In [11]:
[col for col in dummy_encoder.categorical_col_base_levels.keys()]

['schtyp', 'prog', 'honors']

In [12]:
from appelpy.utils import get_dataframe_columns_diff

In [13]:
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df)}")
print(f"Columns added: {get_dataframe_columns_diff(df, df_raw)}")

Columns removed: ['prog', 'honors', 'schtyp']
Columns added: ['prog_academic', 'honors_not enrolled', 'honors_enrolled', 'schtyp_public', 'prog_vocation', 'prog_general', 'schtyp_private']


## `InteractionEncoder` functionality

Make a new copy of the `df_raw` dataframe.

The `int_encoder` object is an instance of the `InteractionEncoder` class.

**The encoder object must be initialized with a dataframe.**

The `#` separator is used to represent the interaction between two variables in the columns that are produced by the encoder.

In [14]:
df = df_raw.copy()

Examples of interactions between variables will be given for these cases:
- Two Boolean variables
- Two continuous variables
- Two categorical variables
- One Boolean variable and one categorical variable
- One Boolean variable and one continuous variable
- One categorical variable and one continuous variable

### Two Boolean variables

- Bool: `female`
- Bool: `read_gt_mean`

In [15]:
int_encoder = InteractionEncoder(df, {'female': ['read_gt_mean']})

df_enc = int_encoder.transform()
df_enc.tail()

Unnamed: 0,id,female,ses,schtyp,prog,read,write,math,science,socst,honors,awards,cid,read_gt_mean,female#read_gt_mean
195,100.0,1,high,public,academic,63.0,65.0,71.0,69.0,71.0,enrolled,5.0,20,1,1
196,143.0,0,middle,public,vocation,63.0,63.0,75.0,72.0,66.0,enrolled,4.0,20,1,0
197,68.0,0,middle,public,academic,73.0,67.0,71.0,63.0,66.0,enrolled,7.0,20,1,0
198,57.0,1,middle,public,academic,71.0,65.0,72.0,66.0,56.0,enrolled,5.0,20,1,1
199,132.0,0,middle,public,academic,73.0,62.0,73.0,69.0,66.0,enrolled,3.0,20,1,0


The columns for the main effects are both Boolean, so they must be kept in the final dataframe.

There is only one interaction effect between the two Boolean variables, so one column is added to the dataframe.

The `get_dataframe_columns_diff` method is useful for checking how the final dataframe is different from the original dataframe after the encoding process.

In [16]:
print(f"Columns removed: {get_dataframe_columns_diff(df, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df)}")

Columns removed: []
Columns added: ['female#read_gt_mean']


The code is essentially comparing the columns of the dataframes through sets.

In [17]:
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df)}")
print(f"Columns added: {get_dataframe_columns_diff(df, df_raw)}")

Columns removed: []
Columns added: []


In [18]:
print(f"Columns removed: {list(set(df.columns) - set(df_enc.columns))}")
print(f"Columns added: {list(set(df_enc.columns) - set(df.columns))}")

Columns removed: []
Columns added: ['female#read_gt_mean']


### Two continuous variables

- Continuous: `read`
- Continuous: `write`

Tip: do a one-line transformation by calling `transform` on an instance of the encoder class.

In [19]:
df_enc = InteractionEncoder(df_raw, {'read': ['write']}).transform()
df_enc.tail()

Unnamed: 0,id,female,ses,schtyp,prog,read,write,math,science,socst,honors,awards,cid,read_gt_mean,read#write
195,100.0,1,high,public,academic,63.0,65.0,71.0,69.0,71.0,enrolled,5.0,20,1,4095.0
196,143.0,0,middle,public,vocation,63.0,63.0,75.0,72.0,66.0,enrolled,4.0,20,1,3969.0
197,68.0,0,middle,public,academic,73.0,67.0,71.0,63.0,66.0,enrolled,7.0,20,1,4891.0
198,57.0,1,middle,public,academic,71.0,65.0,72.0,66.0,56.0,enrolled,5.0,20,1,4615.0
199,132.0,0,middle,public,academic,73.0,62.0,73.0,69.0,66.0,enrolled,3.0,20,1,4526.0


The columns for the main effects are both continuous, so they must be kept in the final dataframe.

There is only one interaction effect between the two Boolean variables, so one column is added to the dataframe.

In [20]:
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")

Columns removed: []
Columns added: ['read#write']


### Two categorical variables

- Categorical: `prog`
- Categorical: `ses`

In [21]:
df_enc = InteractionEncoder(df_raw, {'prog': ['ses']}).transform()
df_enc.tail()

Unnamed: 0,id,female,schtyp,read,write,math,science,socst,honors,awards,...,ses_high,prog_general#ses_low,prog_general#ses_middle,prog_general#ses_high,prog_academic#ses_low,prog_academic#ses_middle,prog_academic#ses_high,prog_vocation#ses_low,prog_vocation#ses_middle,prog_vocation#ses_high
195,100.0,1,public,63.0,65.0,71.0,69.0,71.0,enrolled,5.0,...,1,0,0,0,0,0,1,0,0,0
196,143.0,0,public,63.0,63.0,75.0,72.0,66.0,enrolled,4.0,...,0,0,0,0,0,0,0,0,1,0
197,68.0,0,public,73.0,67.0,71.0,63.0,66.0,enrolled,7.0,...,0,0,0,0,0,1,0,0,0,0
198,57.0,1,public,71.0,65.0,72.0,66.0,56.0,enrolled,5.0,...,0,0,0,0,0,1,0,0,0,0
199,132.0,0,public,73.0,62.0,73.0,69.0,66.0,enrolled,3.0,...,0,0,0,0,0,1,0,0,0,0


The columns for the main effects are both categorical: the information in those columns all have string values.  The **original columns** `prog` and `ses` are **removed** from the final dataframe, as the `DummyEncoder` is used on them to produce dummy columns for them in the final dataframe.  The original columns thus become redundant.

These are the **columns added** to the final dataframe via the encoding:
- Dummy columns are produced for each category via the DummyEncoder: 3 values + 3 values = 6 dummy columns.
- There are multiple interaction effects encoded between the two categorical variables: 3 values * 3 values = 9 interaction effects.

**NOTE:** one of the categories could be used as a 'base level' in a regression model.

In [22]:
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")

Columns removed: ['ses', 'prog']
Columns added: ['prog_vocation#ses_low', 'prog_academic#ses_middle', 'ses_low', 'prog_academic', 'prog_general#ses_high', 'prog_general#ses_low', 'prog_vocation#ses_middle', 'prog_academic#ses_low', 'prog_vocation', 'ses_high', 'prog_academic#ses_high', 'prog_vocation#ses_high', 'prog_general#ses_middle', 'ses_middle', 'prog_general']


The key-value pair in the class initialization can also be switched and produce a dataframe with the same information, but the column names for the interaction effects will be different.

In [23]:
df_enc = InteractionEncoder(df_raw, {'ses': ['prog']}).transform()
df_enc.tail()

Unnamed: 0,id,female,schtyp,read,write,math,science,socst,honors,awards,...,prog_vocation,ses_low#prog_general,ses_low#prog_academic,ses_low#prog_vocation,ses_middle#prog_general,ses_middle#prog_academic,ses_middle#prog_vocation,ses_high#prog_general,ses_high#prog_academic,ses_high#prog_vocation
195,100.0,1,public,63.0,65.0,71.0,69.0,71.0,enrolled,5.0,...,0,0,0,0,0,0,0,0,1,0
196,143.0,0,public,63.0,63.0,75.0,72.0,66.0,enrolled,4.0,...,1,0,0,0,0,0,1,0,0,0
197,68.0,0,public,73.0,67.0,71.0,63.0,66.0,enrolled,7.0,...,0,0,0,0,0,1,0,0,0,0
198,57.0,1,public,71.0,65.0,72.0,66.0,56.0,enrolled,5.0,...,0,0,0,0,0,1,0,0,0,0
199,132.0,0,public,73.0,62.0,73.0,69.0,66.0,enrolled,3.0,...,0,0,0,0,0,1,0,0,0,0


In [24]:
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")

Columns removed: ['ses', 'prog']
Columns added: ['ses_low', 'prog_academic', 'ses_middle#prog_academic', 'prog_general', 'ses_middle#prog_general', 'ses_low#prog_general', 'ses_high#prog_academic', 'ses_low#prog_academic', 'prog_vocation', 'ses_low#prog_vocation', 'ses_high', 'ses_middle', 'ses_middle#prog_vocation', 'ses_high#prog_vocation', 'ses_high#prog_general']


### One Bool and one categorical

- Categorical: `prog`
- Bool: `female`

In [25]:
df_enc = InteractionEncoder(df_raw, {'prog': ['female']}).transform()
df_enc.tail()

Unnamed: 0,id,female,ses,schtyp,read,write,math,science,socst,honors,awards,cid,read_gt_mean,prog_general,prog_academic,prog_vocation,prog_general#female,prog_academic#female,prog_vocation#female
195,100.0,1,high,public,63.0,65.0,71.0,69.0,71.0,enrolled,5.0,20,1,0,1,0,0,1,0
196,143.0,0,middle,public,63.0,63.0,75.0,72.0,66.0,enrolled,4.0,20,1,0,0,1,0,0,0
197,68.0,0,middle,public,73.0,67.0,71.0,63.0,66.0,enrolled,7.0,20,1,0,1,0,0,0,0
198,57.0,1,middle,public,71.0,65.0,72.0,66.0,56.0,enrolled,5.0,20,1,0,1,0,0,1,0
199,132.0,0,middle,public,73.0,62.0,73.0,69.0,66.0,enrolled,3.0,20,1,0,1,0,0,0,0


One of the main effect columns is for a Boolean variable, so that must be kept in the final dataframe.  The other main effect is a categorical variable, so dummy columns are encoded for it and the original column is removed in the final dataframe.

The columns added:
- Dummy columns for the categorical variable: 3 values gives 3 dummy columns
- Interaction effects between the Boolean variable and the dummy columns: 3 dummy columns * 1 Bool column = 3 interaction effects

In [26]:
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")

Columns removed: ['prog']
Columns added: ['prog_academic', 'prog_vocation#female', 'prog_academic#female', 'prog_vocation', 'prog_general#female', 'prog_general']


### One Bool and one continuous

In this case let's encode interactions between `female` and TWO continuous variables!

- Bool: `female`
- Continuous: `read` and `write`

In [27]:
df_enc = InteractionEncoder(df_raw, {'female': ['read', 'write']}).transform()
df_enc.tail()

Unnamed: 0,id,female,ses,schtyp,prog,read,write,math,science,socst,honors,awards,cid,read_gt_mean,female#read,female#write
195,100.0,1,high,public,academic,63.0,65.0,71.0,69.0,71.0,enrolled,5.0,20,1,63.0,65.0
196,143.0,0,middle,public,vocation,63.0,63.0,75.0,72.0,66.0,enrolled,4.0,20,1,0.0,0.0
197,68.0,0,middle,public,academic,73.0,67.0,71.0,63.0,66.0,enrolled,7.0,20,1,0.0,0.0
198,57.0,1,middle,public,academic,71.0,65.0,72.0,66.0,56.0,enrolled,5.0,20,1,71.0,65.0
199,132.0,0,middle,public,academic,73.0,62.0,73.0,69.0,66.0,enrolled,3.0,20,1,0.0,0.0


The columns for the main effects are Boolean or continuous, so they must be kept in the final dataframe.

There is only one interaction effect between a Boolean variable and a continuous variable, so one column is added to the dataframe for each of those pairings.

(In this case, there were two continuous variables interacted with `female` so there are two interaction effects added to the final dataframe)

In [28]:
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")

Columns removed: []
Columns added: ['female#write', 'female#read']


In [29]:
df_enc = InteractionEncoder(df_raw, {'read': ['female'],
                                      'write': ['female']}).transform()
df_enc.tail()

Unnamed: 0,id,female,ses,schtyp,prog,read,write,math,science,socst,honors,awards,cid,read_gt_mean,read#female,write#female
195,100.0,1,high,public,academic,63.0,65.0,71.0,69.0,71.0,enrolled,5.0,20,1,63.0,65.0
196,143.0,0,middle,public,vocation,63.0,63.0,75.0,72.0,66.0,enrolled,4.0,20,1,0.0,0.0
197,68.0,0,middle,public,academic,73.0,67.0,71.0,63.0,66.0,enrolled,7.0,20,1,0.0,0.0
198,57.0,1,middle,public,academic,71.0,65.0,72.0,66.0,56.0,enrolled,5.0,20,1,71.0,65.0
199,132.0,0,middle,public,academic,73.0,62.0,73.0,69.0,66.0,enrolled,3.0,20,1,0.0,0.0


In [30]:
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")

Columns removed: []
Columns added: ['read#female', 'write#female']


### One categorical and one continuous

- Categorical: `prog`
- Continuous: `socst`

In [31]:
df_enc = InteractionEncoder(df_raw, {'socst': ['prog']}).transform()
df_enc.tail()

Unnamed: 0,id,female,ses,schtyp,read,write,math,science,socst,honors,awards,cid,read_gt_mean,prog_general,prog_academic,prog_vocation,socst#prog_general,socst#prog_academic,socst#prog_vocation
195,100.0,1,high,public,63.0,65.0,71.0,69.0,71.0,enrolled,5.0,20,1,0,1,0,0.0,71.0,0.0
196,143.0,0,middle,public,63.0,63.0,75.0,72.0,66.0,enrolled,4.0,20,1,0,0,1,0.0,0.0,66.0
197,68.0,0,middle,public,73.0,67.0,71.0,63.0,66.0,enrolled,7.0,20,1,0,1,0,0.0,66.0,0.0
198,57.0,1,middle,public,71.0,65.0,72.0,66.0,56.0,enrolled,5.0,20,1,0,1,0,0.0,56.0,0.0
199,132.0,0,middle,public,73.0,62.0,73.0,69.0,66.0,enrolled,3.0,20,1,0,1,0,0.0,66.0,0.0


One of the main effects is continuous, so the column for that one must be kept in the final dataframe.  The other main effect is a categorical variable, so the original column is dropped from the final dataframe after dummy columns are encoded from it.

There is an interaction effect between each of the dummy variables and the continuous variable.

In [32]:
print(f"Columns removed: {get_dataframe_columns_diff(df_raw, df_enc)}")
print(f"Columns added: {get_dataframe_columns_diff(df_enc, df_raw)}")

Columns removed: ['prog']
Columns added: ['prog_academic', 'socst#prog_vocation', 'prog_vocation', 'socst#prog_general', 'socst#prog_academic', 'prog_general']


In [33]:
InteractionEncoder(df_raw, {'prog': ['socst']}).transform().tail()

Unnamed: 0,id,female,ses,schtyp,read,write,math,science,socst,honors,awards,cid,read_gt_mean,prog_general,prog_academic,prog_vocation,prog_general#socst,prog_academic#socst,prog_vocation#socst
195,100.0,1,high,public,63.0,65.0,71.0,69.0,71.0,enrolled,5.0,20,1,0,1,0,0.0,71.0,0.0
196,143.0,0,middle,public,63.0,63.0,75.0,72.0,66.0,enrolled,4.0,20,1,0,0,1,0.0,0.0,66.0
197,68.0,0,middle,public,73.0,67.0,71.0,63.0,66.0,enrolled,7.0,20,1,0,1,0,0.0,66.0,0.0
198,57.0,1,middle,public,71.0,65.0,72.0,66.0,56.0,enrolled,5.0,20,1,0,1,0,0.0,56.0,0.0
199,132.0,0,middle,public,73.0,62.0,73.0,69.0,66.0,enrolled,3.0,20,1,0,1,0,0.0,66.0,0.0


# Model

Let's do basic OLS regression models using the dataset, where interaction effects are also used as variables in modelling.

The UCLA's online resources have models of interaction effects on this dataset with Stata output: 
- [Interaction between two continuous variables](https://stats.idre.ucla.edu/stata/faq/how-can-i-explain-a-continuous-by-continuous-interaction-stata-12/)
- [Interaction between categorical variable and continuous variable](https://stats.idre.ucla.edu/stata/faq/how-can-i-understand-a-categorical-by-continuous-interaction-stata-12/) (the example is a categorical variable with two categories, `female`, which is madr Boolean in this notebook).

The Stata output for each model is also provided in this notebook for comparison against the models done through Appelpy.

## Interaction between two continuous variables

Create new dataframe and set up the `InteractionEncoder` object.

In [34]:
df_model = df_raw.copy()

Let's regress `read` on the scores for `math`, `socst` and the _interaction_ between `math` & `socst`.

To get the interaction effect in the dataframe, we need to do some encoding to get the column `math#socst`.

In [35]:
df_model = InteractionEncoder(df_model, {'math': ['socst']}).transform()
df_model.head()

Unnamed: 0,id,female,ses,schtyp,prog,read,write,math,science,socst,honors,awards,cid,read_gt_mean,math#socst
0,45.0,1,low,public,vocation,34.0,35.0,41.0,29.0,26.0,not enrolled,0.0,1,0,1066.0
1,108.0,0,middle,public,general,34.0,33.0,41.0,36.0,36.0,not enrolled,0.0,1,0,1476.0
2,15.0,0,high,public,vocation,39.0,39.0,44.0,26.0,42.0,not enrolled,0.0,1,0,1848.0
3,67.0,0,low,public,vocation,37.0,37.0,42.0,33.0,32.0,not enrolled,0.0,1,0,1344.0
4,153.0,0,middle,public,vocation,39.0,31.0,40.0,39.0,51.0,not enrolled,0.0,1,0,2040.0


In [36]:
y_list = ['read']
X_list = ['math', 'socst', 'math#socst']
model = OLS(df_model, y_list, X_list).fit()

In [37]:
model.results_output

0,1,2,3
Dep. Variable:,read,R-squared:,0.546
Model:,OLS,Adj. R-squared:,0.539
Method:,Least Squares,F-statistic:,78.61
Date:,"Fri, 03 Jan 2020",Prob (F-statistic):,1.99e-33
Time:,21:39:12,Log-Likelihood:,-669.8
No. Observations:,200,AIC:,1348.0
Df Residuals:,196,BIC:,1361.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,37.8427,14.545,2.602,0.010,9.158,66.528
math,-0.1105,0.292,-0.379,0.705,-0.686,0.465
socst,-0.2200,0.272,-0.810,0.419,-0.756,0.316
math#socst,0.0113,0.005,2.157,0.032,0.001,0.022

0,1,2,3
Omnibus:,3.611,Durbin-Watson:,1.839
Prob(Omnibus):,0.164,Jarque-Bera (JB):,3.555
Skew:,0.325,Prob(JB):,0.169
Kurtosis:,2.942,Cond. No.,87600.0


The interaction between `math` and `socst`, i.e. `math#socst#`, is significant.

In [38]:
model.model_selection_stats

{'root_mse': 6.96003820368867,
 'r_squared': 0.5461318818125249,
 'r_squared_adj': 0.5391849208198595,
 'aic': 1347.6088571651621,
 'bic': 1360.8021266313542}

This is what the model output would be from Stata:

```
      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =   78.61
       Model |  11424.7622     3  3808.25406           Prob > F      =  0.0000
    Residual |  9494.65783   196  48.4421318           R-squared     =  0.5461
-------------+------------------------------           Adj R-squared =  0.5392
       Total |    20919.42   199  105.122714           Root MSE      =    6.96

------------------------------------------------------------------------------
        read |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        math |  -.1105123   .2916338    -0.38   0.705    -.6856552    .4646307
       socst |  -.2200442   .2717539    -0.81   0.419    -.7559812    .3158928
             |
      c.math#|
     c.socst |   .0112807   .0052294     2.16   0.032     .0009677    .0215938
             |
       _cons |   37.84271   14.54521     2.60   0.010     9.157506    66.52792
------------------------------------------------------------------------------
```

## Interaction between continuous and Bool variables

In [39]:
df_model = InteractionEncoder(df_raw, {'female': ['socst']}).transform()
df_model.head()

Unnamed: 0,id,female,ses,schtyp,prog,read,write,math,science,socst,honors,awards,cid,read_gt_mean,female#socst
0,45.0,1,low,public,vocation,34.0,35.0,41.0,29.0,26.0,not enrolled,0.0,1,0,26.0
1,108.0,0,middle,public,general,34.0,33.0,41.0,36.0,36.0,not enrolled,0.0,1,0,0.0
2,15.0,0,high,public,vocation,39.0,39.0,44.0,26.0,42.0,not enrolled,0.0,1,0,0.0
3,67.0,0,low,public,vocation,37.0,37.0,42.0,33.0,32.0,not enrolled,0.0,1,0,0.0
4,153.0,0,middle,public,vocation,39.0,31.0,40.0,39.0,51.0,not enrolled,0.0,1,0,0.0


In [40]:
model = OLS(df_model, ['write'], ['female', 'socst', 'female#socst']).fit()

In [41]:
model.results_output

0,1,2,3
Dep. Variable:,write,R-squared:,0.43
Model:,OLS,Adj. R-squared:,0.421
Method:,Least Squares,F-statistic:,49.26
Date:,"Fri, 03 Jan 2020",Prob (F-statistic):,9.02e-24
Time:,21:39:12,Log-Likelihood:,-676.91
No. Observations:,200,AIC:,1362.0
Df Residuals:,196,BIC:,1375.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,17.7619,3.555,4.996,0.000,10.751,24.773
female,15.0000,5.098,2.942,0.004,4.946,25.054
socst,0.6248,0.067,9.315,0.000,0.493,0.757
female#socst,-0.2047,0.095,-2.147,0.033,-0.393,-0.017

0,1,2,3
Omnibus:,2.193,Durbin-Watson:,1.266
Prob(Omnibus):,0.334,Jarque-Bera (JB):,2.004
Skew:,-0.152,Prob(JB):,0.367
Kurtosis:,2.615,Cond. No.,713.0


The interaction between `female` and `socst`, i.e. `female#socst`, is significant.

In the [UCLA resources](https://stats.idre.ucla.edu/stata/faq/how-can-i-understand-a-categorical-by-continuous-interaction-stata-12/) the chart shows how the slopes for the effect of `socst` vary by gender.

In [42]:
model.model_selection_stats

{'root_mse': 7.211611852775864,
 'r_squared': 0.42986123794053965,
 'r_squared_adj': 0.4211346242355479,
 'aic': 1361.811865520546,
 'bic': 1375.005134986738}

This is what the regression output would be from Stata:

```
      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =   49.26
       Model |  7685.43528     3  2561.81176           Prob > F      =  0.0000
    Residual |  10193.4397   196  52.0073455           R-squared     =  0.4299
-------------+------------------------------           Adj R-squared =  0.4211
       Total |   17878.875   199   89.843593           Root MSE      =  7.2116

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    1.female |   15.00001    5.09795     2.94   0.004     4.946132    25.05389
       socst |   .6247968   .0670709     9.32   0.000     .4925236    .7570701
             |
      female#|
     c.socst |
          1  |  -.2047288   .0953726    -2.15   0.033    -.3928171   -.0166405
             |
       _cons |    17.7619   3.554993     5.00   0.000     10.75095    24.77284
------------------------------------------------------------------------------
```

# Model pipeline example

It's possible to make model pipelines with Pandas via chaining of Appelpy methods.

In [43]:
def process_data(raw_df):
    return (raw_df
            .pipe(InteractionEncoder, {'female': ['socst']})
            .transform())

In [44]:
def fit_model(df, y_list, X_list):
    return OLS(df, y_list, X_list).fit()

The cell below retrieves the previous `model_selection_stats` via a Pandas pipeline.

In [45]:
(df_raw
 .pipe(process_data)
 .pipe(fit_model, ['write'], ['female', 'socst', 'female#socst'])
 .model_selection_stats)

{'root_mse': 7.211611852775864,
 'r_squared': 0.42986123794053965,
 'r_squared_adj': 0.4211346242355479,
 'aic': 1361.811865520546,
 'bic': 1375.005134986738}