{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "a874f13f-0ff4-4557-a4f4-23430574fdbd",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import scipy as sp\n",
    "import scipy.stats\n",
    "import statsmodels.api as sm"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d4e1491-3b45-4ece-adc5-556d9cad947a",
   "metadata": {},
   "source": [
    "# <font face=\"gotham\" color=\"purple\"> ANOVA </font>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2419679-4234-44bc-a964-69adc25228d4",
   "metadata": {},
   "source": [
    "If you have studied statistics, you certainly know the famous **Analysis of Variance** (ANOVA), you can skip this section, but if you haven't, read on.\n",
    "\n",
    "Simply speaking, the ANOVA is a technique of comparing means of multiple$(\\geq 3)$ populations, the name derives from the way how calculations are performed. \n",
    "\n",
    "For example, a common hypotheses of ANOVA are \n",
    "$$\n",
    "H_0:\\quad \\mu_1=\\mu_2=\\mu_3=\\cdots=\\mu_n\\\\\n",
    "H_1:\\quad \\text{At least two means differ}\n",
    "$$\n",
    "In order to construct $F$-statistic, we need to introduce two more statistics, the first one is **Mean Square for Treatments** (MST), $\\bar{\\bar{x}}$ is the grand mean, $\\bar{x}_i$ is the sample mean of sample $x_i$, $n_i$ is the number of observable in sample $i$\n",
    "$$\n",
    "MST=\\frac{SST}{k-1},\\qquad\\text{where } SST=\\sum_{i=1}^kn_i(\\bar{x}_i-\\bar{\\bar{x}})^2\n",
    "$$\n",
    "And the second one is **Mean Square for Error** (MSE), $s_i$ is the sample variance of sample $i$\n",
    "$$\n",
    "MSE=\\frac{SSE}{n-k},\\qquad\\text{where } SSE =(n_1-1)s_1^2+(n_2-1)s_2^2+\\cdots+(n_k-1)s_k^2\n",
    "$$\n",
    "Join them together, an $F$-statistic is constructed\n",
    "$$\n",
    "F=\\frac{MST}{MSE}\n",
    "$$\n",
    "If the $F$-statistic is larger than critical value with its corresponding degree of freedom, we reject null hypothesis."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d568075-51b2-4ab9-9540-70f0cef0621a",
   "metadata": {},
   "source": [
    "# <font face=\"gotham\" color=\"purple\"> Dummy Variable </font>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17a5cf54-e1b4-4c65-ae80-3119d142b674",
   "metadata": {},
   "source": [
    "Here's dataset with dummy variables, which are either $1$ or $0$. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "0f75636f-6e87-40ab-ba7d-181b2a268cf1",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_excel('Basic_Econometrics_practice_data.xlsx', sheet_name = 'Hight_ANOVA')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "fc8abfee-6602-4779-b70c-b5600dd1494f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Height</th>\n",
       "      <th>NL_dummpy</th>\n",
       "      <th>DM_dummpy</th>\n",
       "      <th>FI_dummy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>161.783130</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>145.329934</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>174.569597</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>160.003162</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>162.242898</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>83</th>\n",
       "      <td>180.962477</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>84</th>\n",
       "      <td>172.791579</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>85</th>\n",
       "      <td>174.951880</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>86</th>\n",
       "      <td>176.059861</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87</th>\n",
       "      <td>182.802878</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>88 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        Height  NL_dummpy  DM_dummpy  FI_dummy\n",
       "0   161.783130          0          0         0\n",
       "1   145.329934          0          0         0\n",
       "2   174.569597          0          0         0\n",
       "3   160.003162          0          0         0\n",
       "4   162.242898          0          0         0\n",
       "..         ...        ...        ...       ...\n",
       "83  180.962477          0          0         1\n",
       "84  172.791579          0          0         1\n",
       "85  174.951880          0          0         1\n",
       "86  176.059861          0          0         1\n",
       "87  182.802878          0          0         1\n",
       "\n",
       "[88 rows x 4 columns]"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8089e1ad-5a39-4e40-b42d-e6da17746ad2",
   "metadata": {},
   "source": [
    "The dataset has five columns, the first column $Height$ is a sample of $88$ male height, other columns are dummy variables indication its qualitative feature, here is the nationality.\n",
    "\n",
    "There are $4$ countries in the sample, Japan, Netherlands, Denmark and Finland, however there are only $3$ dummies in the data set, this is to avoid _perfect multicollinearity_, which is also called **dummy variable trap**, because the height data is the perfect linear combination of four dummy variables. \n",
    "\n",
    "If we use the model with only dummy variables as independent variable, we basically regressing a ANOVA model, i.e.\n",
    "$$\n",
    "Y_{i}=\\beta_{1}+\\beta_{2} D_{2 i}+\\beta_{3 i} \\mathrm{D}_{3 i}+\\beta_{3 i} \\mathrm{D}_{3 i}+u_{i}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b66e97d-9307-43cb-8d2c-d638a7e9f5d9",
   "metadata": {},
   "source": [
    "where $Y_i =$ the male height, $D_{2i}=1$ if the male is from Netherlands,  $D_{3i}=1$ if the male is from Denmark and $D_{4i}=1$ if the male is from Finland. Japan doesn't have a dummy variable, so we are using it as reference, which will be clearer later.\n",
    "\n",
    "Now we run the regression and print the result. And how do we interpret the estimated coefficients?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "38b441a8-c7f8-456d-bc56-4c1ebbb1253e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                 Height   R-squared:                       0.453\n",
      "Model:                            OLS   Adj. R-squared:                  0.434\n",
      "Method:                 Least Squares   F-statistic:                     23.20\n",
      "Date:                Wed, 25 Aug 2021   Prob (F-statistic):           4.93e-11\n",
      "Time:                        13:10:31   Log-Likelihood:                -300.08\n",
      "No. Observations:                  88   AIC:                             608.2\n",
      "Df Residuals:                      84   BIC:                             618.1\n",
      "Df Model:                           3                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "==============================================================================\n",
      "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------\n",
      "const        163.0300      1.416    115.096      0.000     160.213     165.847\n",
      "NL_dummpy     17.7116      2.453      7.219      0.000      12.833      22.590\n",
      "DM_dummpy     12.2196      2.194      5.569      0.000       7.856      16.583\n",
      "FI_dummy      12.8530      2.041      6.296      0.000       8.794      16.912\n",
      "==============================================================================\n",
      "Omnibus:                       20.323   Durbin-Watson:                   2.355\n",
      "Prob(Omnibus):                  0.000   Jarque-Bera (JB):              105.597\n",
      "Skew:                          -0.355   Prob(JB):                     1.17e-23\n",
      "Kurtosis:                       8.319   Cond. No.                         4.40\n",
      "==============================================================================\n",
      "\n",
      "Notes:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n"
     ]
    }
   ],
   "source": [
    "X = df[[ 'NL_dummpy', 'DM_dummpy', 'FI_dummy']]\n",
    "Y = df['Height']\n",
    "\n",
    "X = sm.add_constant(X) # adding a constant\n",
    "\n",
    "model = sm.OLS(Y, X).fit() \n",
    "print_model = model.summary()\n",
    "print(print_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "145b609e-c406-499d-823b-aa846e1905ab",
   "metadata": {},
   "source": [
    "First, all the $p$-value are significantly small, so our estimation is valid. Then we examine the coefficients one by one. \n",
    "\n",
    "The estimated constant $b_1 = 163.03$ is the mean height of Japanese male. The mean of Dutch male height is $b_1+b_2 = 163.03+17.71=180.74$, the mean of Danish male height is $b_1+b_3=163.03+12.21=175.24$, the mean of Finnish male height is $b_1+b_4=163.03+12.85=175.88$. \n",
    "\n",
    "As you can see, the Japanese male height is used as a _reference_, also called _base category_, rest are added onpon it. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "184e6e17-c0d7-434b-8015-9431275d0d49",
   "metadata": {},
   "source": [
    "This regression has the same effect as ANOVA test, all dummy coefficients are significant and so is $F$-statistic, which means we reject null height null hypothesis that all countries' male height are the same."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04cbf944-4188-4676-a17a-378f5bbb19c7",
   "metadata": {},
   "source": [
    "# <font face=\"gotham\" color=\"purple\"> The ANCOVA Models </font>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea24e2a8-3cde-4eb7-a1e2-72938c7047aa",
   "metadata": {},
   "source": [
    "The example in the last section has only dummies in the independent variables, which is rare in practice. The more common situation is that independent variables have both quantitative and qualitative/dummy variables, and this is what we will do in this section. \n",
    "\n",
    "The model with both quantitative and qualitative variables are called **analysis of covariance** (ANCOVA) models. We have another dataset that contains the individual's parents' height. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "b7cdd23a-7751-485f-bdc4-d8e895420233",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                 Height   R-squared:                       0.679\n",
      "Model:                            OLS   Adj. R-squared:                  0.660\n",
      "Method:                 Least Squares   F-statistic:                     34.71\n",
      "Date:                Wed, 25 Aug 2021   Prob (F-statistic):           6.74e-19\n",
      "Time:                        13:24:31   Log-Likelihood:                -276.61\n",
      "No. Observations:                  88   AIC:                             565.2\n",
      "Df Residuals:                      82   BIC:                             580.1\n",
      "Df Model:                           5                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "====================================================================================\n",
      "                       coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------------\n",
      "const               27.8737     17.839      1.563      0.122      -7.614      63.361\n",
      "Height of Father     0.3305      0.097      3.417      0.001       0.138       0.523\n",
      "Height of Mother     0.5028      0.109      4.630      0.000       0.287       0.719\n",
      "NL_dummy             5.3669      2.527      2.124      0.037       0.339      10.395\n",
      "DM_dummy             2.9093      2.097      1.388      0.169      -1.262       7.080\n",
      "FI_dummy             1.0209      2.223      0.459      0.647      -3.402       5.443\n",
      "==============================================================================\n",
      "Omnibus:                        5.467   Durbin-Watson:                   2.121\n",
      "Prob(Omnibus):                  0.065   Jarque-Bera (JB):                6.176\n",
      "Skew:                          -0.293   Prob(JB):                       0.0456\n",
      "Kurtosis:                       4.158   Cond. No.                     7.09e+03\n",
      "==============================================================================\n",
      "\n",
      "Notes:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
      "[2] The condition number is large, 7.09e+03. This might indicate that there are\n",
      "strong multicollinearity or other numerical problems.\n"
     ]
    }
   ],
   "source": [
    "df = pd.read_excel('Basic_Econometrics_practice_data.xlsx', sheet_name = 'Hight_ANCOVA')\n",
    "\n",
    "X = df[['Height of Father', 'Height of Mother','NL_dummy', 'DM_dummy', 'FI_dummy']]\n",
    "Y = df['Height']\n",
    "\n",
    "X = sm.add_constant(X) # adding a constant\n",
    "\n",
    "model = sm.OLS(Y, X).fit() \n",
    "print_model = model.summary()\n",
    "print(print_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "416a5113-c5fd-44bb-aec9-d0d1b622056f",
   "metadata": {},
   "source": [
    "In order to interpret the results, let's type the estimated model here\n",
    "$$\n",
    "\\hat{Y}= 27.87+.33 X_{f} + .5 X_{m} + 5.36 D_{NL} + 2.90 D_{DM} + 1.02 D_{FI}\n",
    "$$\n",
    "where $X_{f}$ and $X_{m}$ are father's and mother's heights, $D$'s are dummy variables representing each country."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5a20773-15a3-41b7-887a-f394d39bd597",
   "metadata": {},
   "source": [
    "This is actually a function to predict a male's height if you input parents height, for instance if we set $D_{NL} = D_{DM}= D_{FI}=0 $, the function of height of Japanese male is\n",
    "$$\n",
    "\\hat{Y}= 27.87+.33 X_{f} + .5 X_{m}\n",
    "$$\n",
    "Or the function of Dutch male height with $D_{NL} = 1$ and $ D_{DM}= D_{FI}=0$\n",
    "$$\n",
    "\\hat{Y}= 27.87+.33 X_{f} + .5 X_{m} + 5.36\n",
    "$$\n",
    "With these results, we can define Python functions to predict male height"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "5c7826ed-2771-4aed-a9e1-228ea629fcb5",
   "metadata": {},
   "outputs": [],
   "source": [
    "def jp_height(fh, mh):\n",
    "    return 27.87 + fh*.33 + mh*.5\n",
    "def nl_height(fh, mh):\n",
    "    return 27.87 + fh*.33 + mh*.5 + 5.36"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "237b7923-a878-42f5-add9-47a4986944a7",
   "metadata": {},
   "source": [
    "A function to predict a Japanese male's height"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "3735d290-0694-4385-acd8-131c6f169ce7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "170.62"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "jp_height(175, 170)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b230023-291d-4d0e-9a09-4f88c4c2d226",
   "metadata": {},
   "source": [
    "And function to predict a Dutch male's height"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "b8b8ad2c-fd05-4264-bfd9-da5b16a54a29",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "186.78000000000003"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nl_height(185, 185)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89b9ce66-0338-4a45-89a6-a332a188e847",
   "metadata": {},
   "source": [
    "# <font face=\"gotham\" color=\"purple\"> Slope Dummy Variables </font>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a43b2483-9958-4594-b325-01de3f4af314",
   "metadata": {},
   "source": [
    "What we have discussed above are all **intercept dummy variables** which means the dummy variable only change the intercept term, however, there are dummies which imposed on slope variables too. \n",
    "\n",
    "Back to the height example, what if we suspect that parents' height in NL could have more marginal effect on their sons' height, i.e. the model is\n",
    "$$\n",
    "Y= \\beta_1 + \\beta_2D_{NL} + (\\beta_3+ \\beta_4D_{NL})X_{f} +  (\\beta_5+ \\beta_6D_{NL})X_{m}+u\n",
    "$$\n",
    "Rewrite for estimation purpose\n",
    "$$\n",
    "Y= \\beta_1 + \\beta_2D_{NL} + \\beta_3 X_f + \\beta_4 D_{NL}X_f + \\beta_5X_m + \\beta_6 D_{NL}X_m+u\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c35ca40b-8023-4e08-bced-eb67390186a9",
   "metadata": {},
   "source": [
    "Take a look at our data, we need to reconstruct it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "f5db15dd-ae88-475a-abda-a8e43975d142",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Height</th>\n",
       "      <th>Height of Father</th>\n",
       "      <th>Height of Mother</th>\n",
       "      <th>NL_dummy</th>\n",
       "      <th>DM_dummy</th>\n",
       "      <th>FI_dummy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>161.783130</td>\n",
       "      <td>173</td>\n",
       "      <td>152</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>145.329934</td>\n",
       "      <td>166</td>\n",
       "      <td>145</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>174.569597</td>\n",
       "      <td>177</td>\n",
       "      <td>169</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>160.003162</td>\n",
       "      <td>167</td>\n",
       "      <td>160</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>162.242898</td>\n",
       "      <td>163</td>\n",
       "      <td>152</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Height  Height of Father  Height of Mother  NL_dummy  DM_dummy  \\\n",
       "0  161.783130               173               152         0         0   \n",
       "1  145.329934               166               145         0         0   \n",
       "2  174.569597               177               169         0         0   \n",
       "3  160.003162               167               160         0         0   \n",
       "4  162.242898               163               152         0         0   \n",
       "\n",
       "   FI_dummy  \n",
       "0         0  \n",
       "1         0  \n",
       "2         0  \n",
       "3         0  \n",
       "4         0  "
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3070965c-b517-4bf2-a413-a03097d1f674",
   "metadata": {},
   "source": [
    "Drop the dummies of Denmark and Finland"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "a51d5766-12c6-437d-b89c-8574b07fcd8e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Height</th>\n",
       "      <th>Height of Father</th>\n",
       "      <th>Height of Mother</th>\n",
       "      <th>NL_dummy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>161.783130</td>\n",
       "      <td>173</td>\n",
       "      <td>152</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>145.329934</td>\n",
       "      <td>166</td>\n",
       "      <td>145</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>174.569597</td>\n",
       "      <td>177</td>\n",
       "      <td>169</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>160.003162</td>\n",
       "      <td>167</td>\n",
       "      <td>160</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>162.242898</td>\n",
       "      <td>163</td>\n",
       "      <td>152</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Height  Height of Father  Height of Mother  NL_dummy\n",
       "0  161.783130               173               152         0\n",
       "1  145.329934               166               145         0\n",
       "2  174.569597               177               169         0\n",
       "3  160.003162               167               160         0\n",
       "4  162.242898               163               152         0"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_NL = df.drop(['DM_dummy', 'FI_dummy'], axis=1); df_NL.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c0f8b59-1bca-4c62-8527-4dda7e8a282b",
   "metadata": {},
   "source": [
    "Also create the column of multiplication of $D_{NL} \\cdot X_f$ and $D_{NL}\\cdot X_m$, then construct independent variable matrix and dependent variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "fe4a4e08-3116-4fcc-8d45-18c068ffff40",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_NL['D_NL_Xf'] = df_NL['Height of Father']*df_NL['NL_dummy']\n",
    "df_NL['D_NL_Xm'] = df_NL['Height of Mother']*df_NL['NL_dummy']\n",
    "X = df_NL[['NL_dummy', 'Height of Father', 'D_NL_Xf', 'Height of Mother', 'D_NL_Xm']]\n",
    "Y = df['Height']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "9dcb4b76-d89e-4ed6-8930-7dd8804368c1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                            OLS Regression Results                            \n",
      "==============================================================================\n",
      "Dep. Variable:                 Height   R-squared:                       0.691\n",
      "Model:                            OLS   Adj. R-squared:                  0.672\n",
      "Method:                 Least Squares   F-statistic:                     36.63\n",
      "Date:                Wed, 25 Aug 2021   Prob (F-statistic):           1.53e-19\n",
      "Time:                        13:25:18   Log-Likelihood:                -275.00\n",
      "No. Observations:                  88   AIC:                             562.0\n",
      "Df Residuals:                      82   BIC:                             576.9\n",
      "Df Model:                           5                                         \n",
      "Covariance Type:            nonrobust                                         \n",
      "====================================================================================\n",
      "                       coef    std err          t      P>|t|      [0.025      0.975]\n",
      "------------------------------------------------------------------------------------\n",
      "const               11.7747     13.230      0.890      0.376     -14.544      38.093\n",
      "NL_dummy           120.9563     58.160      2.080      0.041       5.257     236.655\n",
      "Height of Father     0.3457      0.094      3.665      0.000       0.158       0.533\n",
      "D_NL_Xf             -0.0946      0.377     -0.251      0.803      -0.845       0.656\n",
      "Height of Mother     0.5903      0.098      6.000      0.000       0.395       0.786\n",
      "D_NL_Xm             -0.5746      0.328     -1.750      0.084      -1.228       0.079\n",
      "==============================================================================\n",
      "Omnibus:                        4.383   Durbin-Watson:                   2.055\n",
      "Prob(Omnibus):                  0.112   Jarque-Bera (JB):                3.950\n",
      "Skew:                          -0.334   Prob(JB):                        0.139\n",
      "Kurtosis:                       3.795   Cond. No.                     2.37e+04\n",
      "==============================================================================\n",
      "\n",
      "Notes:\n",
      "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
      "[2] The condition number is large, 2.37e+04. This might indicate that there are\n",
      "strong multicollinearity or other numerical problems.\n"
     ]
    }
   ],
   "source": [
    "X = sm.add_constant(X) # adding a constant\n",
    "\n",
    "model = sm.OLS(Y, X).fit() \n",
    "print_model = model.summary()\n",
    "print(print_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5cd34eeb-802c-4e00-94f6-2e29fd7ba633",
   "metadata": {},
   "source": [
    "Here's the estimated regression model\n",
    "$$\n",
    "\\hat{Y}=  11.7747  + 120.9563D_{NL} + 0.3457 X_f  - 0.0946 D_{NL}X_f + 0.5903X_m  -0.5746 D_{NL}X_m\n",
    "$$\n",
    "If $D_{NL}=1$ then\n",
    "$$\n",
    "\\hat{Y}=  132.731+0.2511X_f   -0.01X_m\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fcd77c36-0ba3-4388-b187-fbe0656a8aee",
   "metadata": {},
   "source": [
    "Again, we define a Python function to predict Dutch male height based on their parents' height"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "27d4e986-539c-4e16-b36d-fc0502d0a2fe",
   "metadata": {},
   "outputs": [],
   "source": [
    "def nl_height_marginal(fh, mh):\n",
    "    return 132.731 + fh*.2511 + mh*0.0157"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "id": "a01a943d-468f-48b1-a9c9-057e0bc5ab54",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "182.01049999999998"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nl_height_marginal(185, 180)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb131c99-47fa-4094-a0d6-2aa076775c5f",
   "metadata": {},
   "source": [
    "Prediction seem quite logical. \n",
    "\n",
    "However, as you can see from the results, the hypotheses test rejects our theory that Dutch parents could influence their sons' marginal height, i.e. coefficients of $D_{NL} \\cdot X_f$ and $D_{NL}\\cdot X_m$ fail to reject null hypothesis with $5\\%$ significance level."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "176a9064-23cd-4109-9bbb-432a89367c32",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}