{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 11\n",
    "\n",
    "Examples and Exercises from Think Stats, 2nd Edition\n",
    "\n",
    "http://thinkstats2.com\n",
    "\n",
    "Copyright 2016 Allen B. Downey\n",
    "\n",
    "MIT License: https://opensource.org/licenses/MIT\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from os.path import basename, exists\n",
    "\n",
    "\n",
    "def download(url):\n",
    "    filename = basename(url)\n",
    "    if not exists(filename):\n",
    "        from urllib.request import urlretrieve\n",
    "\n",
    "        local, _ = urlretrieve(url, filename)\n",
    "        print(\"Downloaded \" + local)\n",
    "\n",
    "\n",
    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkstats2.py\")\n",
    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkplot.py\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import thinkstats2\n",
    "import thinkplot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multiple regression\n",
    "\n",
    "Let's load up the NSFG data again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/nsfg.py\")\n",
    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/first.py\")\n",
    "\n",
    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dct\")\n",
    "download(\n",
    "    \"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dat.gz\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import first\n",
    "\n",
    "live, firsts, others = first.MakeFrames()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's birth weight as a function of mother's age (which we saw in the previous chapter)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>       <td>totalwgt_lb</td>   <th>  R-squared:         </th> <td>   0.005</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.005</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   43.02</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Sun, 09 Apr 2023</td> <th>  Prob (F-statistic):</th> <td>5.72e-11</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>09:51:41</td>     <th>  Log-Likelihood:    </th> <td> -15897.</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>  9038</td>      <th>  AIC:               </th> <td>3.180e+04</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>  9036</td>      <th>  BIC:               </th> <td>3.181e+04</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     1</td>      <th>                     </th>     <td> </td>    \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>    \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "      <td></td>         <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Intercept</th> <td>    6.8304</td> <td>    0.068</td> <td>  100.470</td> <td> 0.000</td> <td>    6.697</td> <td>    6.964</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>agepreg</th>   <td>    0.0175</td> <td>    0.003</td> <td>    6.559</td> <td> 0.000</td> <td>    0.012</td> <td>    0.023</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td>1024.052</td> <th>  Durbin-Watson:     </th> <td>   1.618</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th>  <td> 0.000</td>  <th>  Jarque-Bera (JB):  </th> <td>3081.833</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>           <td>-0.601</td>  <th>  Prob(JB):          </th> <td>    0.00</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>       <td> 5.596</td>  <th>  Cond. No.          </th> <td>    118.</td>\n",
       "</tr>\n",
       "</table><br/><br/>Notes:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:            totalwgt_lb   R-squared:                       0.005\n",
       "Model:                            OLS   Adj. R-squared:                  0.005\n",
       "Method:                 Least Squares   F-statistic:                     43.02\n",
       "Date:                Sun, 09 Apr 2023   Prob (F-statistic):           5.72e-11\n",
       "Time:                        09:51:41   Log-Likelihood:                -15897.\n",
       "No. Observations:                9038   AIC:                         3.180e+04\n",
       "Df Residuals:                    9036   BIC:                         3.181e+04\n",
       "Df Model:                           1                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "==============================================================================\n",
       "                 coef    std err          t      P>|t|      [0.025      0.975]\n",
       "------------------------------------------------------------------------------\n",
       "Intercept      6.8304      0.068    100.470      0.000       6.697       6.964\n",
       "agepreg        0.0175      0.003      6.559      0.000       0.012       0.023\n",
       "==============================================================================\n",
       "Omnibus:                     1024.052   Durbin-Watson:                   1.618\n",
       "Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3081.833\n",
       "Skew:                          -0.601   Prob(JB):                         0.00\n",
       "Kurtosis:                       5.596   Cond. No.                         118.\n",
       "==============================================================================\n",
       "\n",
       "Notes:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "\"\"\""
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import statsmodels.formula.api as smf\n",
    "\n",
    "formula = 'totalwgt_lb ~ agepreg'\n",
    "model = smf.ols(formula, data=live)\n",
    "results = model.fit()\n",
    "results.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can extract the parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(6.830396973311051, 0.017453851471802638)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inter = results.params['Intercept']\n",
    "slope = results.params['agepreg']\n",
    "inter, slope"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the p-value of the slope estimate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.7229471073163425e-11"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "slope_pvalue = results.pvalues['agepreg']\n",
    "slope_pvalue"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the coefficient of determination."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.004738115474710369"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results.rsquared"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The difference in birth weight between first babies and others."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-0.12476118453549034"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "diff_weight = firsts.totalwgt_lb.mean() - others.totalwgt_lb.mean()\n",
    "diff_weight"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The difference in age between mothers of first babies and others."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-3.5864347661500275"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "diff_age = firsts.agepreg.mean() - others.agepreg.mean()\n",
    "diff_age"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The age difference plausibly explains about half of the difference in weight."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-0.0625970997216918"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "slope * diff_age"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Running a single regression with a categorical variable, `isfirst`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>       <td>totalwgt_lb</td>   <th>  R-squared:         </th> <td>   0.002</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.002</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   17.74</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Sun, 09 Apr 2023</td> <th>  Prob (F-statistic):</th> <td>2.55e-05</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>09:51:41</td>     <th>  Log-Likelihood:    </th> <td> -15909.</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>  9038</td>      <th>  AIC:               </th> <td>3.182e+04</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>  9036</td>      <th>  BIC:               </th> <td>3.184e+04</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     1</td>      <th>                     </th>     <td> </td>    \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>    \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "         <td></td>            <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Intercept</th>       <td>    7.3259</td> <td>    0.021</td> <td>  356.007</td> <td> 0.000</td> <td>    7.286</td> <td>    7.366</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>isfirst[T.True]</th> <td>   -0.1248</td> <td>    0.030</td> <td>   -4.212</td> <td> 0.000</td> <td>   -0.183</td> <td>   -0.067</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td>988.919</td> <th>  Durbin-Watson:     </th> <td>   1.613</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.000</td>  <th>  Jarque-Bera (JB):  </th> <td>2897.107</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td>-0.589</td>  <th>  Prob(JB):          </th> <td>    0.00</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 5.511</td>  <th>  Cond. No.          </th> <td>    2.58</td>\n",
       "</tr>\n",
       "</table><br/><br/>Notes:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:            totalwgt_lb   R-squared:                       0.002\n",
       "Model:                            OLS   Adj. R-squared:                  0.002\n",
       "Method:                 Least Squares   F-statistic:                     17.74\n",
       "Date:                Sun, 09 Apr 2023   Prob (F-statistic):           2.55e-05\n",
       "Time:                        09:51:41   Log-Likelihood:                -15909.\n",
       "No. Observations:                9038   AIC:                         3.182e+04\n",
       "Df Residuals:                    9036   BIC:                         3.184e+04\n",
       "Df Model:                           1                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "===================================================================================\n",
       "                      coef    std err          t      P>|t|      [0.025      0.975]\n",
       "-----------------------------------------------------------------------------------\n",
       "Intercept           7.3259      0.021    356.007      0.000       7.286       7.366\n",
       "isfirst[T.True]    -0.1248      0.030     -4.212      0.000      -0.183      -0.067\n",
       "==============================================================================\n",
       "Omnibus:                      988.919   Durbin-Watson:                   1.613\n",
       "Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2897.107\n",
       "Skew:                          -0.589   Prob(JB):                         0.00\n",
       "Kurtosis:                       5.511   Cond. No.                         2.58\n",
       "==============================================================================\n",
       "\n",
       "Notes:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "\"\"\""
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "live['isfirst'] = live.birthord == 1\n",
    "formula = 'totalwgt_lb ~ isfirst'\n",
    "results = smf.ols(formula, data=live).fit()\n",
    "results.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now finally running a multiple regression:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>       <td>totalwgt_lb</td>   <th>  R-squared:         </th> <td>   0.005</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.005</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   24.02</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Sun, 09 Apr 2023</td> <th>  Prob (F-statistic):</th> <td>3.95e-11</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>09:51:41</td>     <th>  Log-Likelihood:    </th> <td> -15894.</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>  9038</td>      <th>  AIC:               </th> <td>3.179e+04</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>  9035</td>      <th>  BIC:               </th> <td>3.182e+04</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     2</td>      <th>                     </th>     <td> </td>    \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>    \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "         <td></td>            <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Intercept</th>       <td>    6.9142</td> <td>    0.078</td> <td>   89.073</td> <td> 0.000</td> <td>    6.762</td> <td>    7.066</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>isfirst[T.True]</th> <td>   -0.0698</td> <td>    0.031</td> <td>   -2.236</td> <td> 0.025</td> <td>   -0.131</td> <td>   -0.009</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>agepreg</th>         <td>    0.0154</td> <td>    0.003</td> <td>    5.499</td> <td> 0.000</td> <td>    0.010</td> <td>    0.021</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td>1019.945</td> <th>  Durbin-Watson:     </th> <td>   1.618</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th>  <td> 0.000</td>  <th>  Jarque-Bera (JB):  </th> <td>3063.682</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>           <td>-0.599</td>  <th>  Prob(JB):          </th> <td>    0.00</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>       <td> 5.588</td>  <th>  Cond. No.          </th> <td>    137.</td>\n",
       "</tr>\n",
       "</table><br/><br/>Notes:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:            totalwgt_lb   R-squared:                       0.005\n",
       "Model:                            OLS   Adj. R-squared:                  0.005\n",
       "Method:                 Least Squares   F-statistic:                     24.02\n",
       "Date:                Sun, 09 Apr 2023   Prob (F-statistic):           3.95e-11\n",
       "Time:                        09:51:41   Log-Likelihood:                -15894.\n",
       "No. Observations:                9038   AIC:                         3.179e+04\n",
       "Df Residuals:                    9035   BIC:                         3.182e+04\n",
       "Df Model:                           2                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "===================================================================================\n",
       "                      coef    std err          t      P>|t|      [0.025      0.975]\n",
       "-----------------------------------------------------------------------------------\n",
       "Intercept           6.9142      0.078     89.073      0.000       6.762       7.066\n",
       "isfirst[T.True]    -0.0698      0.031     -2.236      0.025      -0.131      -0.009\n",
       "agepreg             0.0154      0.003      5.499      0.000       0.010       0.021\n",
       "==============================================================================\n",
       "Omnibus:                     1019.945   Durbin-Watson:                   1.618\n",
       "Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3063.682\n",
       "Skew:                          -0.599   Prob(JB):                         0.00\n",
       "Kurtosis:                       5.588   Cond. No.                         137.\n",
       "==============================================================================\n",
       "\n",
       "Notes:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "\"\"\""
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "formula = 'totalwgt_lb ~ isfirst + agepreg'\n",
    "results = smf.ols(formula, data=live).fit()\n",
    "results.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, when we control for mother's age, the apparent difference due to `isfirst` is cut in half.\n",
    "\n",
    "If we add age squared, we can control for a quadratic relationship between age and weight."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>       <td>totalwgt_lb</td>   <th>  R-squared:         </th> <td>   0.007</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.007</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   22.64</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Sun, 09 Apr 2023</td> <th>  Prob (F-statistic):</th> <td>1.35e-14</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>09:51:41</td>     <th>  Log-Likelihood:    </th> <td> -15884.</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>  9038</td>      <th>  AIC:               </th> <td>3.178e+04</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>  9034</td>      <th>  BIC:               </th> <td>3.181e+04</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     3</td>      <th>                     </th>     <td> </td>    \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>    \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "         <td></td>            <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Intercept</th>       <td>    5.6923</td> <td>    0.286</td> <td>   19.937</td> <td> 0.000</td> <td>    5.133</td> <td>    6.252</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>isfirst[T.True]</th> <td>   -0.0504</td> <td>    0.031</td> <td>   -1.602</td> <td> 0.109</td> <td>   -0.112</td> <td>    0.011</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>agepreg</th>         <td>    0.1124</td> <td>    0.022</td> <td>    5.113</td> <td> 0.000</td> <td>    0.069</td> <td>    0.155</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>agepreg2</th>        <td>   -0.0018</td> <td>    0.000</td> <td>   -4.447</td> <td> 0.000</td> <td>   -0.003</td> <td>   -0.001</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td>1007.149</td> <th>  Durbin-Watson:     </th> <td>   1.616</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th>  <td> 0.000</td>  <th>  Jarque-Bera (JB):  </th> <td>3003.343</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>           <td>-0.594</td>  <th>  Prob(JB):          </th> <td>    0.00</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>       <td> 5.562</td>  <th>  Cond. No.          </th> <td>1.39e+04</td>\n",
       "</tr>\n",
       "</table><br/><br/>Notes:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.<br/>[2] The condition number is large, 1.39e+04. This might indicate that there are<br/>strong multicollinearity or other numerical problems."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:            totalwgt_lb   R-squared:                       0.007\n",
       "Model:                            OLS   Adj. R-squared:                  0.007\n",
       "Method:                 Least Squares   F-statistic:                     22.64\n",
       "Date:                Sun, 09 Apr 2023   Prob (F-statistic):           1.35e-14\n",
       "Time:                        09:51:41   Log-Likelihood:                -15884.\n",
       "No. Observations:                9038   AIC:                         3.178e+04\n",
       "Df Residuals:                    9034   BIC:                         3.181e+04\n",
       "Df Model:                           3                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "===================================================================================\n",
       "                      coef    std err          t      P>|t|      [0.025      0.975]\n",
       "-----------------------------------------------------------------------------------\n",
       "Intercept           5.6923      0.286     19.937      0.000       5.133       6.252\n",
       "isfirst[T.True]    -0.0504      0.031     -1.602      0.109      -0.112       0.011\n",
       "agepreg             0.1124      0.022      5.113      0.000       0.069       0.155\n",
       "agepreg2           -0.0018      0.000     -4.447      0.000      -0.003      -0.001\n",
       "==============================================================================\n",
       "Omnibus:                     1007.149   Durbin-Watson:                   1.616\n",
       "Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3003.343\n",
       "Skew:                          -0.594   Prob(JB):                         0.00\n",
       "Kurtosis:                       5.562   Cond. No.                     1.39e+04\n",
       "==============================================================================\n",
       "\n",
       "Notes:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "[2] The condition number is large, 1.39e+04. This might indicate that there are\n",
       "strong multicollinearity or other numerical problems.\n",
       "\"\"\""
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "live['agepreg2'] = live.agepreg**2\n",
    "formula = 'totalwgt_lb ~ isfirst + agepreg + agepreg2'\n",
    "results = smf.ols(formula, data=live).fit()\n",
    "results.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we do that, the apparent effect of `isfirst` gets even smaller, and is no longer statistically significant.\n",
    "\n",
    "These results suggest that the apparent difference in weight between first babies and others might be explained by difference in mothers' ages, at least in part."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Mining\n",
    "\n",
    "We can use `join` to combine variables from the preganancy and respondent tables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemResp.dct\")\n",
    "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemResp.dat.gz\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(8884, 3333)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import nsfg\n",
    "\n",
    "live = live[live.prglngth>30]\n",
    "resp = nsfg.ReadFemResp()\n",
    "resp.index = resp.caseid\n",
    "join = live.join(resp, on='caseid', rsuffix='_r')\n",
    "join.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we can search for variables with explanatory power.\n",
    "\n",
    "Because we don't clean most of the variables, we are probably missing some good ones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "import patsy\n",
    "\n",
    "def GoMining(df):\n",
    "    \"\"\"Searches for variables that predict birth weight.\n",
    "\n",
    "    df: DataFrame of pregnancy records\n",
    "\n",
    "    returns: list of (rsquared, variable name) pairs\n",
    "    \"\"\"\n",
    "    variables = []\n",
    "    for name in df.columns:\n",
    "        try:\n",
    "            if df[name].var() < 1e-7:\n",
    "                continue\n",
    "\n",
    "            formula = 'totalwgt_lb ~ agepreg + ' + name\n",
    "            model = smf.ols(formula, data=df)\n",
    "            if model.nobs < len(df)/2:\n",
    "                continue\n",
    "\n",
    "            results = model.fit()\n",
    "        except (ValueError, TypeError, patsy.PatsyError) as e:\n",
    "            continue\n",
    "        \n",
    "        variables.append((results.rsquared, name))\n",
    "\n",
    "    return variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0.005357647323640413, 'caseid'),\n",
       " (0.005750013985077129, 'pregordr'),\n",
       " (0.006330980237390205, 'pregend1'),\n",
       " (0.016017752709788113, 'nbrnaliv'),\n",
       " (0.005543156193094756, 'cmprgend'),\n",
       " (0.005442800591640151, 'cmprgbeg'),\n",
       " (0.005327612601561116, 'gestasun_m'),\n",
       " (0.007023552638453112, 'gestasun_w'),\n",
       " (0.12340041363361076, 'wksgest'),\n",
       " (0.02714427463957958, 'mosgest'),\n",
       " (0.0053368691675173, 'bpa_bdscheck1'),\n",
       " (0.018550925293942533, 'babysex'),\n",
       " (0.9498127305978009, 'birthwgt_lb'),\n",
       " (0.013102457615706498, 'birthwgt_oz'),\n",
       " (0.005543156193094756, 'cmbabdob'),\n",
       " (0.005684952650027997, 'kidage'),\n",
       " (0.006165319836040295, 'hpagelb'),\n",
       " (0.008066317368677245, 'matchfound'),\n",
       " (0.012529022541810653, 'anynurse'),\n",
       " (0.004409820583625601, 'frsteatd_n'),\n",
       " (0.004263973471709814, 'frsteatd_p'),\n",
       " (0.004020131462736054, 'frsteatd'),\n",
       " (0.005830571770254145, 'cmlastlb'),\n",
       " (0.005356747266123785, 'cmfstprg'),\n",
       " (0.005428333650990047, 'cmlstprg'),\n",
       " (0.005731401733759189, 'cmintstr'),\n",
       " (0.005543156193094756, 'cmintfin'),\n",
       " (0.00993306080712264, 'evuseint'),\n",
       " (0.009315099704132801, 'stopduse'),\n",
       " (0.00372683328673018, 'wantbold'),\n",
       " (0.0070729951341236275, 'timingok'),\n",
       " (0.005042504093811795, 'wthpart1'),\n",
       " (0.006835771483523323, 'hpwnold'),\n",
       " (0.006349094713449799, 'timokhp'),\n",
       " (0.00262913761511141, 'cohpbeg'),\n",
       " (0.0018043469091931774, 'cohpend'),\n",
       " (0.00808960003494219, 'tellfath'),\n",
       " (0.009056250355562567, 'whentell'),\n",
       " (0.005369974278794709, 'anyusint'),\n",
       " (0.13012519488625085, 'prglngth'),\n",
       " (0.005545615084230682, 'birthord'),\n",
       " (0.005591745847583596, 'datend'),\n",
       " (0.005327282505071085, 'agepreg'),\n",
       " (0.00566538884322787, 'datecon'),\n",
       " (0.10203149928156052, 'agecon'),\n",
       " (0.010461691367377068, 'fmarout5'),\n",
       " (0.009840804911715684, 'pmarpreg'),\n",
       " (0.011354138472805753, 'rmarout6'),\n",
       " (0.0106049646842995, 'fmarcon5'),\n",
       " (0.3008240784470769, 'lbw1'),\n",
       " (0.012193688404495417, 'bfeedwks'),\n",
       " (0.007984835684252678, 'oldwantr'),\n",
       " (0.00640138668536383, 'oldwantp'),\n",
       " (0.007980832538658, 'wantresp'),\n",
       " (0.006334468987300168, 'wantpart'),\n",
       " (0.00559161600442204, 'cmbirth'),\n",
       " (0.0055903980552193255, 'ager'),\n",
       " (0.0055903980552193255, 'agescrn'),\n",
       " (0.009944942659110723, 'fmarital'),\n",
       " (0.008267774071422429, 'rmarital'),\n",
       " (0.006450913803300651, 'educat'),\n",
       " (0.0066919868225499, 'hieduc'),\n",
       " (0.016199503586253106, 'race'),\n",
       " (0.005351273101023457, 'hispanic'),\n",
       " (0.01123834930203138, 'hisprace'),\n",
       " (0.005415425347505165, 'rcurpreg'),\n",
       " (0.0060378317082542265, 'pregnum'),\n",
       " (0.00650372032144908, 'parity'),\n",
       " (0.005444228863618061, 'insuranc'),\n",
       " (0.009858545642850713, 'pubassis'),\n",
       " (0.009743158975296873, 'poverty'),\n",
       " (0.006124250620027971, 'laborfor'),\n",
       " (0.005476246226178816, 'religion'),\n",
       " (0.005908687699079596, 'metro'),\n",
       " (0.0053296353237825, 'brnout'),\n",
       " (0.005388240758326224, 'prglngth_i'),\n",
       " (0.0053720967087050875, 'datend_i'),\n",
       " (0.005666104281317086, 'agepreg_i'),\n",
       " (0.0053480888696338935, 'datecon_i'),\n",
       " (0.005612740210896416, 'agecon_i'),\n",
       " (0.005733140260446579, 'fmarout5_i'),\n",
       " (0.005422598571288018, 'pmarpreg_i'),\n",
       " (0.005498885939111409, 'rmarout6_i'),\n",
       " (0.0057702817140151685, 'fmarcon5_i'),\n",
       " (0.005355587358294223, 'learnprg_i'),\n",
       " (0.005464552651942123, 'pncarewk_i'),\n",
       " (0.005911575701061822, 'paydeliv_i'),\n",
       " (0.005327282505071085, 'lbw1_i'),\n",
       " (0.005422843440833436, 'bfeedwks_i'),\n",
       " (0.005456277033588197, 'maternlv_i'),\n",
       " (0.005397823762493981, 'oldwantr_i'),\n",
       " (0.005330102063603626, 'oldwantp_i'),\n",
       " (0.005397823762493981, 'wantresp_i'),\n",
       " (0.00538826132872694, 'wantpart_i'),\n",
       " (0.005415854205569115, 'hieduc_i'),\n",
       " (0.005327282505070974, 'hispanic_i'),\n",
       " (0.005662161985408476, 'parity_i'),\n",
       " (0.005490192077694744, 'insuranc_i'),\n",
       " (0.005588263662201554, 'pubassis_i'),\n",
       " (0.005674668721373788, 'poverty_i'),\n",
       " (0.005635393818939516, 'laborfor_i'),\n",
       " (0.005329126750794222, 'religion_i'),\n",
       " (0.007266083159805259, 'basewgt'),\n",
       " (0.006863344757269019, 'adj_mod_basewgt'),\n",
       " (0.007414601906967189, 'finalwgt'),\n",
       " (0.005996732588561704, 'secu_p'),\n",
       " (0.00540529186826777, 'sest'),\n",
       " (1.0, 'totalwgt_lb'),\n",
       " (0.005617788291713333, 'isfirst'),\n",
       " (0.007529217800612997, 'agepreg2'),\n",
       " (0.005357647323640413, 'caseid_r'),\n",
       " (0.005327683657806448, 'rscrinf'),\n",
       " (0.005394521556674303, 'rdormres'),\n",
       " (0.005643925975030384, 'rostscrn'),\n",
       " (0.005404878128178137, 'rscreenhisp'),\n",
       " (0.009651605370030958, 'rscreenrace'),\n",
       " (0.005578533477514025, 'age_a'),\n",
       " (0.0055903980552193255, 'age_r'),\n",
       " (0.00559161600442204, 'cmbirth_r'),\n",
       " (0.0055903980552193255, 'agescrn_r'),\n",
       " (0.008267774071422429, 'marstat'),\n",
       " (0.009944942659110723, 'fmarit'),\n",
       " (0.009091376003145912, 'evrmarry'),\n",
       " (0.00535044323233691, 'hisp'),\n",
       " (0.005516347262115917, 'numrace'),\n",
       " (0.0053556674242533076, 'roscnt'),\n",
       " (0.007003475396870851, 'hplocale'),\n",
       " (0.007768164334321925, 'manrel'),\n",
       " (0.005348262928552283, 'fl_rrace'),\n",
       " (0.005330593394374583, 'fl_rhisp'),\n",
       " (0.005986092508868612, 'goschol'),\n",
       " (0.0062182476084240434, 'higrade'),\n",
       " (0.005595972205724942, 'compgrd'),\n",
       " (0.0061609210437847395, 'havedip'),\n",
       " (0.006413101502788066, 'dipged'),\n",
       " (0.006061422266963268, 'cmhsgrad'),\n",
       " (0.00532855360811324, 'wthparnw'),\n",
       " (0.005280615627084373, 'onown'),\n",
       " (0.0061291662740246, 'intact'),\n",
       " (0.006038819191132916, 'parmarr'),\n",
       " (0.005635183345626071, 'momdegre'),\n",
       " (0.006872582290938678, 'momworkd'),\n",
       " (0.00533037621783472, 'momchild'),\n",
       " (0.005349294568680385, 'momfstch'),\n",
       " (0.006245111146334303, 'daddegre'),\n",
       " (0.005328519387111874, 'bothbiol'),\n",
       " (0.006182128607900683, 'intact18'),\n",
       " (0.005340451644766264, 'onown18'),\n",
       " (0.00650372032144908, 'numbabes'),\n",
       " (0.005550348810967609, 'totplacd'),\n",
       " (0.005337228144332018, 'nplaced'),\n",
       " (0.005530586313726937, 'ndied'),\n",
       " (0.005539874079434126, 'nadoptv'),\n",
       " (0.005830571770254145, 'cmlastlb_r'),\n",
       " (0.005356747266123785, 'cmfstprg_r'),\n",
       " (0.005428333650990047, 'cmlstprg_r'),\n",
       " (0.0059155825615551105, 'menarche'),\n",
       " (0.00534384518588682, 'pregnowq'),\n",
       " (0.0060378317082542265, 'numpregs'),\n",
       " (0.005415425347504943, 'currpreg'),\n",
       " (0.005780984893938856, 'giveadpt'),\n",
       " (0.0052547662523021454, 'otherkid'),\n",
       " (0.005289464417210787, 'everadpt'),\n",
       " (0.0058286851663723604, 'seekadpt'),\n",
       " (0.0065420652227655696, 'evwntano'),\n",
       " (0.007146484736500702, 'timesmar'),\n",
       " (0.006260110249016293, 'hsbverif'),\n",
       " (0.0065101987688339635, 'cmmarrhx'),\n",
       " (0.007559824082689293, 'hxagemar'),\n",
       " (0.007432269596385654, 'cmhsbdobx'),\n",
       " (0.00783511174689866, 'lvtoghx'),\n",
       " (0.006732535398132455, 'hisphx'),\n",
       " (0.00854385589625295, 'racehx1'),\n",
       " (0.009526403068447875, 'chedmarn'),\n",
       " (0.006464090481781537, 'marbefhx'),\n",
       " (0.007928273601067293, 'kidshx'),\n",
       " (0.007183543845538876, 'cmmarrch'),\n",
       " (0.006761687432294994, 'cmdobch'),\n",
       " (0.006548743638518095, 'prevhusb'),\n",
       " (0.006295608867143532, 'cmstrthp'),\n",
       " (0.007320288484473303, 'evrcohab'),\n",
       " (0.0073365000741104636, 'liveoth'),\n",
       " (0.006131749211457538, 'prevcohb'),\n",
       " (0.005640038945984638, 'cmfstsex'),\n",
       " (0.0054298693534173825, 'agefstsx'),\n",
       " (0.0010172326396578057, 'grfstsx'),\n",
       " (0.005745270584702311, 'sameman'),\n",
       " (0.0054772600593092635, 'fpage'),\n",
       " (0.0053827337115199825, 'knowfp'),\n",
       " (0.006043726456224308, 'cmlsexfp'),\n",
       " (0.0059106934677164435, 'cmfplast'),\n",
       " (0.005423734887383902, 'lifeprt'),\n",
       " (0.0053417476676171916, 'mon12prt'),\n",
       " (0.005349388865006244, 'parts12'),\n",
       " (0.006502466569061172, 'ptsb4mar'),\n",
       " (0.00627002683228417, 'p1yrage'),\n",
       " (0.004475735734043584, 'p1yhsage'),\n",
       " (0.0046862977809064565, 'p1yrf'),\n",
       " (0.007927219826495358, 'cmfsexx'),\n",
       " (0.006085450415460381, 'pcurrntx'),\n",
       " (0.006103433602601682, 'cmlsexx'),\n",
       " (0.006129604796508592, 'cmlstsxx'),\n",
       " (0.0060299492496314056, 'cmlstsx12'),\n",
       " (0.005339302768089693, 'lifeprts'),\n",
       " (0.0053282763855250215, 'cmlastsx'),\n",
       " (0.005790778139281083, 'currprtt'),\n",
       " (0.006691333575973957, 'currprts'),\n",
       " (0.004259919311988658, 'cmpart1y1'),\n",
       " (0.0053727944846478914, 'evertubs'),\n",
       " (0.005394668684848614, 'everhyst'),\n",
       " (0.0051812137216755705, 'everovrs'),\n",
       " (0.0055216206466339735, 'everothr'),\n",
       " (0.00549788837202414, 'anyfster'),\n",
       " (0.005806983909926733, 'fstrop12'),\n",
       " (0.0064869433773807605, 'anyopsmn'),\n",
       " (0.0064756285532651114, 'anymster'),\n",
       " (0.005457516875982504, 'rsurgstr'),\n",
       " (0.006286783510051408, 'psurgstr'),\n",
       " (0.0058616476673256646, 'onlytbvs'),\n",
       " (0.0038784396522586473, 'posiblpg'),\n",
       " (0.008011147208074498, 'canhaver'),\n",
       " (0.003893424250926647, 'pregnono'),\n",
       " (0.005331594693332997, 'rstrstat'),\n",
       " (0.006809171829395333, 'pstrstat'),\n",
       " (0.00644307628137919, 'pill'),\n",
       " (0.005456714711317701, 'condom'),\n",
       " (0.006878239843079559, 'vasectmy'),\n",
       " (0.005647132703301971, 'widrawal'),\n",
       " (0.0067777099610751845, 'depoprov'),\n",
       " (0.005360766267951678, 'norplant'),\n",
       " (0.006558270799857713, 'rhythm'),\n",
       " (0.006781472906158714, 'tempsafe'),\n",
       " (0.005333255591501329, 'mornpill'),\n",
       " (0.005588775806413815, 'diafragm'),\n",
       " (0.008181466230171464, 'wocondom'),\n",
       " (0.00571244189146225, 'foamalon'),\n",
       " (0.005605892825595982, 'jelcrmal'),\n",
       " (0.006669505309996104, 'cervlcap'),\n",
       " (0.005328971051790532, 'supposit'),\n",
       " (0.0067739501854262585, 'todayspg'),\n",
       " (0.005348443535178604, 'iud'),\n",
       " (0.005669693175719415, 'lunelle'),\n",
       " (0.0055746648157347645, 'patch'),\n",
       " (0.006317045357729589, 'othrmeth'),\n",
       " (0.005328912963907251, 'everused'),\n",
       " (0.0062621002452544205, 'methdiss'),\n",
       " (0.006783255187156723, 'methstop01'),\n",
       " (0.005827582700002165, 'firsmeth01'),\n",
       " (0.0059969573119960096, 'numfirsm'),\n",
       " (0.0057600954802820015, 'numfirsm1'),\n",
       " (0.006335882371607426, 'numfirsm2'),\n",
       " (0.006336231852240637, 'drugdev'),\n",
       " (0.004369823481886859, 'firstime2'),\n",
       " (0.005183357448185544, 'cmfstuse'),\n",
       " (0.006498991331905568, 'cmfirsm'),\n",
       " (0.004437510523967125, 'agefstus'),\n",
       " (0.005735952227613139, 'usefstsx'),\n",
       " (0.005546199546767272, 'intr_ec3'),\n",
       " (0.006421238927684092, 'monsx1177'),\n",
       " (0.005995975528061859, 'monsx1178'),\n",
       " (0.005871772632105254, 'monsx1179'),\n",
       " (0.005355244912861767, 'monsx1180'),\n",
       " (0.005475780215880466, 'monsx1181'),\n",
       " (0.005878264818758194, 'monsx1182'),\n",
       " (0.00593529016936678, 'monsx1183'),\n",
       " (0.0061367227367412625, 'monsx1184'),\n",
       " (0.006114461327954457, 'monsx1185'),\n",
       " (0.005914492828921203, 'monsx1186'),\n",
       " (0.00571514572173959, 'monsx1187'),\n",
       " (0.0059426297773013115, 'monsx1188'),\n",
       " (0.006403878523879025, 'monsx1189'),\n",
       " (0.006652431486344534, 'monsx1190'),\n",
       " (0.005859464888209431, 'monsx1191'),\n",
       " (0.0061836704118874986, 'monsx1192'),\n",
       " (0.005970697654319124, 'monsx1193'),\n",
       " (0.006185930348181823, 'monsx1194'),\n",
       " (0.006261722615339971, 'monsx1195'),\n",
       " (0.005883642310483994, 'monsx1196'),\n",
       " (0.006312131013808564, 'monsx1197'),\n",
       " (0.006334490158298567, 'monsx1198'),\n",
       " (0.006238157095211028, 'monsx1199'),\n",
       " (0.005958303111450847, 'monsx1200'),\n",
       " (0.005971611823831213, 'monsx1201'),\n",
       " (0.006232965940293433, 'monsx1202'),\n",
       " (0.0061567646880547056, 'monsx1203'),\n",
       " (0.00646363990894494, 'monsx1204'),\n",
       " (0.006792900648230571, 'monsx1205'),\n",
       " (0.006223101663507369, 'monsx1206'),\n",
       " (0.006017807331303304, 'monsx1207'),\n",
       " (0.006329241899985627, 'monsx1208'),\n",
       " (0.0063566695058372424, 'monsx1209'),\n",
       " (0.006029258683564187, 'monsx1210'),\n",
       " (0.005520990260007963, 'monsx1211'),\n",
       " (0.005987728957053462, 'monsx1212'),\n",
       " (0.007198033668191051, 'monsx1213'),\n",
       " (0.006899020398532851, 'monsx1214'),\n",
       " (0.006151064433626341, 'monsx1215'),\n",
       " (0.006041191941944746, 'monsx1216'),\n",
       " (0.006215321754830194, 'monsx1217'),\n",
       " (0.005713521284034573, 'monsx1218'),\n",
       " (0.005933855306221036, 'monsx1219'),\n",
       " (0.006167667783488429, 'monsx1220'),\n",
       " (0.006029948736347213, 'monsx1221'),\n",
       " (0.005840062379099065, 'monsx1222'),\n",
       " (0.005619611941114044, 'monsx1223'),\n",
       " (0.006563305715290513, 'monsx1224'),\n",
       " (0.006284349732578187, 'monsx1225'),\n",
       " (0.006323173014771255, 'monsx1226'),\n",
       " (0.005534365979908085, 'monsx1227'),\n",
       " (0.006374798787152525, 'monsx1228'),\n",
       " (0.007520765906871785, 'monsx1229'),\n",
       " (0.008083697967827597, 'monsx1230'),\n",
       " (0.007233374656298253, 'monsx1231'),\n",
       " (0.007985217080101137, 'monsx1232'),\n",
       " (0.007191033197050278, 'monsx1233'),\n",
       " (0.006261425077471072, 'cmstrtmc'),\n",
       " (0.005825958517251317, 'cmendmc'),\n",
       " (0.005603506388835111, 'methhist011'),\n",
       " (0.007284980353802539, 'cmdatbgn'),\n",
       " (0.005635409045157691, 'nummult'),\n",
       " (0.005551844008685136, 'methhist021'),\n",
       " (0.0054895117434013985, 'nummult2'),\n",
       " (0.0055628303964547765, 'methhist031'),\n",
       " (0.005546883063649921, 'nummult3'),\n",
       " (0.005487202405734637, 'methhist041'),\n",
       " (0.005502460937693798, 'nummult4'),\n",
       " (0.0055361891691549925, 'methhist051'),\n",
       " (0.005452729882170382, 'nummult5'),\n",
       " (0.005489245446082758, 'methhist061'),\n",
       " (0.005475895500679395, 'nummult6'),\n",
       " (0.0054732146908181845, 'methhist071'),\n",
       " (0.005428876448518971, 'nummult7'),\n",
       " (0.005594046854832779, 'methhist081'),\n",
       " (0.0054730049104988465, 'nummult8'),\n",
       " (0.005482617158302006, 'methhist091'),\n",
       " (0.0054272326646152, 'nummult9'),\n",
       " (0.005503393156441327, 'methhist101'),\n",
       " (0.005437143233952169, 'nummult10'),\n",
       " (0.005516430998187216, 'methhist111'),\n",
       " (0.005502305304443622, 'nummult11'),\n",
       " (0.005470302098175561, 'methhist121'),\n",
       " (0.005504255019818993, 'nummult12'),\n",
       " (0.005607458874307136, 'methhist131'),\n",
       " (0.005549721074183167, 'nummult13'),\n",
       " (0.005546717068292573, 'methhist141'),\n",
       " (0.00554268385974932, 'nummult14'),\n",
       " (0.005539692295560172, 'methhist151'),\n",
       " (0.0055324680547158556, 'nummult15'),\n",
       " (0.005536492647946201, 'methhist161'),\n",
       " (0.005536401342006614, 'nummult16'),\n",
       " (0.00551933216368472, 'methhist171'),\n",
       " (0.005493486008020909, 'nummult17'),\n",
       " (0.005507179974851173, 'methhist181'),\n",
       " (0.0054913983634302665, 'nummult18'),\n",
       " (0.005471902813541596, 'methhist191'),\n",
       " (0.005487640499653779, 'nummult19'),\n",
       " (0.005504743168494475, 'methhist201'),\n",
       " (0.005466739231583695, 'nummult20'),\n",
       " (0.005510186988941679, 'methhist211'),\n",
       " (0.005491203193220606, 'nummult21'),\n",
       " (0.005506337448605736, 'methhist221'),\n",
       " (0.005511607291997178, 'nummult22'),\n",
       " (0.0055295450491826825, 'methhist231'),\n",
       " (0.005530806017465695, 'nummult23'),\n",
       " (0.00555002676240679, 'methhist241'),\n",
       " (0.005539948289882801, 'nummult24'),\n",
       " (0.0055211355102111614, 'methhist251'),\n",
       " (0.00556828550933397, 'nummult25'),\n",
       " (0.005540129688542339, 'methhist261'),\n",
       " (0.005673797628286015, 'nummult26'),\n",
       " (0.0055319741942275735, 'methhist271'),\n",
       " (0.005784089177182872, 'nummult27'),\n",
       " (0.0055784115374736265, 'methhist281'),\n",
       " (0.005600401746815531, 'nummult28'),\n",
       " (0.005578714034812471, 'methhist291'),\n",
       " (0.0056155148617611506, 'nummult29'),\n",
       " (0.0056429313351300525, 'methhist301'),\n",
       " (0.00573173986023956, 'nummult30'),\n",
       " (0.005616953589053675, 'methhist311'),\n",
       " (0.005669936949214471, 'nummult31'),\n",
       " (0.00564355522220783, 'methhist321'),\n",
       " (0.00574989591519115, 'nummult32'),\n",
       " (0.005675590741276326, 'methhist331'),\n",
       " (0.005780466079303159, 'nummult33'),\n",
       " (0.005727245232216571, 'methhist341'),\n",
       " (0.00587256497631905, 'nummult34'),\n",
       " (0.005746152259455073, 'methhist351'),\n",
       " (0.005842619866505694, 'nummult35'),\n",
       " (0.005750154151443976, 'methhist361'),\n",
       " (0.005681701891443347, 'nummult36'),\n",
       " (0.006075297759302933, 'methhist371'),\n",
       " (0.005704951869176855, 'nummult37'),\n",
       " (0.006209219638329877, 'methhist381'),\n",
       " (0.005708874481709425, 'nummult38'),\n",
       " (0.00615001891252942, 'methhist391'),\n",
       " (0.005747940396037099, 'nummult39'),\n",
       " (0.006692950545528653, 'methhist401'),\n",
       " (0.006258857669329654, 'nummult40'),\n",
       " (0.007456274427272258, 'methhist411'),\n",
       " (0.006932216156475546, 'nummult41'),\n",
       " (0.008359250704236931, 'methhist421'),\n",
       " (0.007521964733084974, 'nummult42'),\n",
       " (0.009953102543862835, 'methhist431'),\n",
       " (0.007971602243966425, 'nummult43'),\n",
       " (0.009584160754145143, 'methhist441'),\n",
       " (0.007606603124692746, 'nummult44'),\n",
       " (0.009508144086923354, 'methhist451'),\n",
       " (0.006828747690285741, 'nummult45'),\n",
       " (0.007584417482075834, 'currmeth1'),\n",
       " (0.00627037344004322, 'lastmonmeth1'),\n",
       " (0.006727716986788201, 'uselstp'),\n",
       " (0.007019085471651421, 'lstmthp11'),\n",
       " (0.00558079811714618, 'usefstp'),\n",
       " (0.006042935277705164, 'pst4wksx'),\n",
       " (0.006741983069778024, 'pswkcond2'),\n",
       " (0.006418878609596557, 'p12mocon'),\n",
       " (0.00534642916952166, 'bthcon12'),\n",
       " (0.005332954623398445, 'medtst12'),\n",
       " (0.005341624685738289, 'bccns12'),\n",
       " (0.005429737619422337, 'stcns12'),\n",
       " (0.005470924106891761, 'eccns12'),\n",
       " (0.005348972077159231, 'prgtst12'),\n",
       " (0.0053943749433755794, 'abort12'),\n",
       " (0.005367446414242583, 'pap12'),\n",
       " (0.005672332993427065, 'pelvic12'),\n",
       " (0.0062589483576193095, 'stdtst12'),\n",
       " (0.004842342942612765, 'numbcvis'),\n",
       " (0.00439966100781275, 'papplbc2'),\n",
       " (0.004437846218583008, 'pappelec'),\n",
       " (0.005725980591540503, 'rwant'),\n",
       " (0.0059484588815127415, 'pwant'),\n",
       " (0.005332008335859562, 'hlpprg'),\n",
       " (0.005561944793632145, 'prgvisit'),\n",
       " (0.005743806819574981, 'hlpmc'),\n",
       " (0.006676017274232948, 'duchfreq'),\n",
       " (0.005813100370292146, 'pid'),\n",
       " (0.005690158662538858, 'diabetes'),\n",
       " (0.005531743131489408, 'ovacyst'),\n",
       " (0.006038196805351448, 'uf'),\n",
       " (0.005934224089218509, 'endo'),\n",
       " (0.005440537772177567, 'ovuprob'),\n",
       " (0.0061251320208634, 'limited'),\n",
       " (0.005541214057692367, 'equipmnt'),\n",
       " (0.005981350416892073, 'donbld85'),\n",
       " (0.005436612794017082, 'hivtest'),\n",
       " (0.00447797159580221, 'cmhivtst'),\n",
       " (0.004355089528519485, 'plchiv'),\n",
       " (0.004314612061914858, 'hivtst'),\n",
       " (0.00561821377110816, 'talkdoct'),\n",
       " (0.005329810046365013, 'retrovir'),\n",
       " (0.006663672692549416, 'cover12'),\n",
       " (0.006693078733038593, 'coverhow01'),\n",
       " (0.0060500431013753575, 'sameadd'),\n",
       " (0.0053296353237825, 'brnout_r'),\n",
       " (0.014003795578114597, 'paydu'),\n",
       " (0.005407180634571018, 'relraisd'),\n",
       " (0.005370356923925956, 'relcurr'),\n",
       " (0.005655636115156848, 'fundam'),\n",
       " (0.005504988604336347, 'reldlife'),\n",
       " (0.005867952875404425, 'attndnow'),\n",
       " (0.005933481080438563, 'evwrk6mo'),\n",
       " (0.0063049738571671066, 'cmbfstwk'),\n",
       " (0.005644145001417411, 'evrntwrk'),\n",
       " (0.005328150312206126, 'wrk12mos'),\n",
       " (0.007178560512177579, 'fpt12mos'),\n",
       " (0.006445359844623688, 'dolastwk1'),\n",
       " (0.004018698456840442, 'dolastwk2'),\n",
       " (0.00521674319332921, 'dolastwk3'),\n",
       " (0.0062574600046569895, 'rwrkst'),\n",
       " (0.005644484119356141, 'everwork'),\n",
       " (0.005177779293875973, 'rnumjob'),\n",
       " (0.005293304160078116, 'rftptx'),\n",
       " (0.0059898508613943635, 'rearnty'),\n",
       " (0.007212528591326373, 'splstwk1'),\n",
       " (0.008528896910858341, 'spwrkst'),\n",
       " (0.005713999272682013, 'spnumjob'),\n",
       " (0.006751568523647111, 'spftptx'),\n",
       " (0.005877030054991406, 'spearnty'),\n",
       " (0.004886896120925077, 'chcarany'),\n",
       " (0.005366116602676385, 'better'),\n",
       " (0.005356197123854156, 'staytog'),\n",
       " (0.00532884602247885, 'samesex'),\n",
       " (0.0053541143477805475, 'anyact'),\n",
       " (0.0054050041417265104, 'sxok18'),\n",
       " (0.005997965901475055, 'sxok16'),\n",
       " (0.005984938037659759, 'chreward'),\n",
       " (0.00597338721593621, 'chsuppor'),\n",
       " (0.005352012775372117, 'gayadopt'),\n",
       " (0.005785901183284814, 'okcohab'),\n",
       " (0.005330226512754721, 'warm'),\n",
       " (0.005334826170625084, 'achieve'),\n",
       " (0.006122955918854811, 'family'),\n",
       " (0.005413165880377879, 'acasilang'),\n",
       " (0.007834205022225316, 'wage'),\n",
       " (0.00632142355555132, 'selfinc'),\n",
       " (0.006237964556073061, 'socsec'),\n",
       " (0.005330070889624117, 'disabil'),\n",
       " (0.005337980306478363, 'retire'),\n",
       " (0.008398675180815607, 'ssi'),\n",
       " (0.006015715048611758, 'unemp'),\n",
       " (0.005327537441129571, 'chldsupp'),\n",
       " (0.009477120535617223, 'interest'),\n",
       " (0.007333419685710885, 'dividend'),\n",
       " (0.005327382416789428, 'othinc'),\n",
       " (0.006775124578427327, 'toincwmy'),\n",
       " (0.005352217186313735, 'totinc'),\n",
       " (0.0064238214634630975, 'pubasst'),\n",
       " (0.008558622074178679, 'foodstmp'),\n",
       " (0.006637757779645814, 'wic'),\n",
       " (0.005378525056025318, 'hlptrans'),\n",
       " (0.005920897262712499, 'hlpchldc'),\n",
       " (0.005478955696682664, 'hlpjob'),\n",
       " (0.0055903980552193255, 'ager_r'),\n",
       " (0.009944942659110723, 'fmarital_r'),\n",
       " (0.006450913803300651, 'educat_r'),\n",
       " (0.0066919868225499, 'hieduc_r'),\n",
       " (0.005351273101023457, 'hispanic_r'),\n",
       " (0.016199503586253106, 'race_r'),\n",
       " (0.01123834930203138, 'hisprace_r'),\n",
       " (0.005329986431589662, 'numkdhh'),\n",
       " (0.005394528433483536, 'numfmhh'),\n",
       " (0.006182128607901127, 'intctfam'),\n",
       " (0.005525721076047874, 'parage14'),\n",
       " (0.005473746177716121, 'educmom'),\n",
       " (0.0058533592637336485, 'agemomb1'),\n",
       " (0.005415854205569115, 'hieduc_i_r'),\n",
       " (0.005327282505070974, 'hispanic_i_r'),\n",
       " (0.005329835819694262, 'parage14_i'),\n",
       " (0.006383228212261338, 'educmom_i'),\n",
       " (0.006160171070091369, 'agemomb1_i'),\n",
       " (0.005415425347505165, 'rcurpreg_r'),\n",
       " (0.0060378317082542265, 'pregnum_r'),\n",
       " (0.0060943638898968144, 'compreg'),\n",
       " (0.005630995365189961, 'lossnum'),\n",
       " (0.005522608888729241, 'abortion'),\n",
       " (0.0056503224424951926, 'lbpregs'),\n",
       " (0.00650372032144908, 'parity_r'),\n",
       " (0.005417941356685496, 'births5'),\n",
       " (0.005764362444789728, 'outcom01'),\n",
       " (0.005931010998080022, 'outcom02'),\n",
       " (0.006122853755040514, 'outcom03'),\n",
       " (0.005419025041086045, 'datend01'),\n",
       " (0.006029071803283159, 'datend02'),\n",
       " (0.006108693948040367, 'datend03'),\n",
       " (0.007278279163181578, 'ageprg01'),\n",
       " (0.008535289365551813, 'ageprg02'),\n",
       " (0.009220965964321981, 'ageprg03'),\n",
       " (0.005389678393071362, 'datcon01'),\n",
       " (0.006145518207557932, 'datcon02'),\n",
       " (0.0062167895640961035, 'datcon03'),\n",
       " (0.007067187542220688, 'agecon01'),\n",
       " (0.008474519661758162, 'agecon02'),\n",
       " (0.00889389736970092, 'agecon03'),\n",
       " (0.011269357246806444, 'marout01'),\n",
       " (0.010141720907287155, 'marout02'),\n",
       " (0.011807801994375033, 'marout03'),\n",
       " (0.011407737138640073, 'rmarout01'),\n",
       " (0.010546913206564978, 'rmarout02'),\n",
       " (0.01343006646571343, 'rmarout03'),\n",
       " (0.009234871431322733, 'marcon01'),\n",
       " (0.01048140179553414, 'marcon02'),\n",
       " (0.011752599354395321, 'marcon03'),\n",
       " (0.011437770919637269, 'cebow'),\n",
       " (0.007442478133716124, 'cebowc'),\n",
       " (0.005331372450776084, 'datbaby1'),\n",
       " (0.006431931319703765, 'agebaby1'),\n",
       " (0.006043928783913133, 'liv1chld'),\n",
       " (0.005662161985408476, 'lossnum_i'),\n",
       " (0.005662161985408476, 'abortion_i'),\n",
       " (0.005662161985408476, 'lbpregs_i'),\n",
       " (0.005662161985408476, 'parity_i_r'),\n",
       " (0.005662161985408476, 'births5_i'),\n",
       " (0.005917896845371695, 'outcom02_i'),\n",
       " (0.005328062662569244, 'outcom03_i'),\n",
       " (0.005673431831919484, 'outcom04_i'),\n",
       " (0.005512637793773423, 'outcom05_i'),\n",
       " (0.005331053251030227, 'outcom06_i'),\n",
       " (0.005336032874002861, 'outcom07_i'),\n",
       " (0.005431343175452796, 'outcom08_i'),\n",
       " (0.005365193139261981, 'outcom09_i'),\n",
       " (0.005548660472480371, 'outcom10_i'),\n",
       " (0.005517931754874139, 'datend01_i'),\n",
       " (0.005690934538439829, 'datend02_i'),\n",
       " (0.005332715022836387, 'datend03_i'),\n",
       " (0.005328883282733954, 'datend04_i'),\n",
       " (0.005521053140913668, 'datend05_i'),\n",
       " (0.00561428828090571, 'datend06_i'),\n",
       " (0.005663078981643865, 'datend07_i'),\n",
       " (0.005858198017675953, 'datend08_i'),\n",
       " (0.005795817124290448, 'datend09_i'),\n",
       " (0.005548660472480371, 'datend10_i'),\n",
       " (0.005537193660306361, 'datend12_i'),\n",
       " (0.005537193660306361, 'datend13_i'),\n",
       " (0.005572507513472824, 'ageprg01_i'),\n",
       " (0.006081858146695152, 'ageprg02_i'),\n",
       " (0.005418070111408713, 'ageprg03_i'),\n",
       " (0.0053816621737194925, 'ageprg04_i'),\n",
       " (0.005613913002136761, 'ageprg05_i'),\n",
       " (0.005489445784257141, 'ageprg06_i'),\n",
       " (0.005564320884191343, 'ageprg07_i'),\n",
       " (0.005806143379294859, 'ageprg08_i'),\n",
       " (0.005795817124290448, 'ageprg09_i'),\n",
       " (0.005548660472480371, 'ageprg10_i'),\n",
       " (0.005537193660306361, 'ageprg12_i'),\n",
       " (0.005537193660306361, 'ageprg13_i'),\n",
       " (0.005405602200151627, 'datcon01_i'),\n",
       " (0.005690934538439829, 'datcon02_i'),\n",
       " (0.005332715022836387, 'datcon03_i'),\n",
       " (0.005328593623978639, 'datcon04_i'),\n",
       " (0.005498332824140473, 'datcon05_i'),\n",
       " (0.00561428828090571, 'datcon06_i'),\n",
       " (0.005591773203161621, 'datcon07_i'),\n",
       " (0.005858198017675953, 'datcon08_i'),\n",
       " (0.005795817124290448, 'datcon09_i'),\n",
       " (0.005548660472480371, 'datcon10_i'),\n",
       " (0.005537193660306361, 'datcon12_i'),\n",
       " (0.005537193660306361, 'datcon13_i'),\n",
       " (0.005464459762510643, 'agecon01_i'),\n",
       " (0.005953867386222278, 'agecon02_i'),\n",
       " (0.005440344671912678, 'agecon03_i'),\n",
       " (0.005462357369145465, 'agecon04_i'),\n",
       " (0.005585209645216915, 'agecon05_i'),\n",
       " (0.005565046255331274, 'agecon06_i'),\n",
       " (0.005690797025869498, 'agecon07_i'),\n",
       " (0.005858198017675953, 'agecon08_i'),\n",
       " (0.005795817124290448, 'agecon09_i'),\n",
       " (0.005548660472480371, 'agecon10_i'),\n",
       " (0.005537193660306361, 'agecon12_i'),\n",
       " (0.005537193660306361, 'agecon13_i'),\n",
       " (0.005495846986551367, 'marout01_i'),\n",
       " (0.005617451126696205, 'marout02_i'),\n",
       " (0.0057675316941988575, 'marout03_i'),\n",
       " (0.006058555904314256, 'marout04_i'),\n",
       " (0.007831854171707842, 'marout05_i'),\n",
       " (0.006525666016629406, 'marout06_i'),\n",
       " (0.007242086468897568, 'marout07_i'),\n",
       " (0.006008832051372925, 'marout08_i'),\n",
       " (0.005363835534778705, 'marout09_i'),\n",
       " (0.005434828901757727, 'marout10_i'),\n",
       " (0.005588512919647237, 'marout11_i'),\n",
       " (0.005427474035637592, 'rmarout01_i'),\n",
       " (0.005678427980718048, 'rmarout02_i'),\n",
       " (0.005498637325022315, 'rmarout03_i'),\n",
       " (0.005526063051171093, 'rmarout04_i'),\n",
       " (0.006880617443338566, 'rmarout05_i'),\n",
       " (0.006242519070738695, 'rmarout06_i'),\n",
       " (0.006539245819281447, 'rmarout07_i'),\n",
       " (0.006008832051372925, 'rmarout08_i'),\n",
       " (0.005363835534778705, 'rmarout09_i'),\n",
       " (0.005434828901757727, 'rmarout10_i'),\n",
       " (0.005588512919647237, 'rmarout11_i'),\n",
       " (0.00561503495200022, 'marcon01_i'),\n",
       " (0.005611341957289406, 'marcon02_i'),\n",
       " (0.006181472870089744, 'marcon03_i'),\n",
       " (0.005767269267942132, 'marcon04_i'),\n",
       " (0.007880198703264507, 'marcon05_i'),\n",
       " (0.00680621180187746, 'marcon06_i'),\n",
       " (0.006785317780935385, 'marcon07_i'),\n",
       " (0.005977443508035418, 'marcon08_i'),\n",
       " (0.005345000467353644, 'marcon09_i'),\n",
       " (0.005434828901757727, 'marcon10_i'),\n",
       " (0.005588512919647237, 'marcon11_i'),\n",
       " (0.005662161985408476, 'cebow_i'),\n",
       " (0.005662161985408476, 'cebowc_i'),\n",
       " (0.005794281025144787, 'datbaby1_i'),\n",
       " (0.005943478567578153, 'agebaby1_i'),\n",
       " (0.005662161985408476, 'liv1chld_i'),\n",
       " (0.008267774071422429, 'rmarital_r'),\n",
       " (0.0062973555918356405, 'fmarno'),\n",
       " (0.006812670300484491, 'mardat01'),\n",
       " (0.0065210569478660885, 'fmar1age'),\n",
       " (0.0109615635907514, 'mar1diss'),\n",
       " (0.010112030016853457, 'mar1bir1'),\n",
       " (0.009060360846987803, 'mar1con1'),\n",
       " (0.008161692294398337, 'con1mar1'),\n",
       " (0.010059124870520297, 'b1premar'),\n",
       " (0.007255508514300568, 'cohever'),\n",
       " (0.006194957970746207, 'evmarcoh'),\n",
       " (0.00271939367881302, 'cohab1'),\n",
       " (0.0066280015790392, 'cohstat'),\n",
       " (0.0031380350118120903, 'cohout'),\n",
       " (0.005420479710673054, 'coh1dur'),\n",
       " (0.005336364971175067, 'sexever'),\n",
       " (0.005944534641656896, 'vry1stag'),\n",
       " (0.00614261333826438, 'sex1age'),\n",
       " (0.005331809873838411, 'vry1stsx'),\n",
       " (0.005308860111875258, 'datesex1'),\n",
       " (0.005353980891277033, 'sexonce'),\n",
       " (0.005355525286814045, 'fsexpage'),\n",
       " (0.00673129712673215, 'sexmar'),\n",
       " (0.0065143616191194464, 'sex1for'),\n",
       " (0.005443436227038578, 'parts1yr'),\n",
       " (0.0058559384438559015, 'lsexdate'),\n",
       " (0.005966988984321908, 'lsexrage'),\n",
       " (0.0063365994925745905, 'lifprtnr'),\n",
       " (0.005707996133960003, 'fmarno_i'),\n",
       " (0.005413354626453648, 'mardat01_i'),\n",
       " (0.005347124675960324, 'mardat02_i'),\n",
       " (0.005977057012740761, 'mardis01_i'),\n",
       " (0.0054115975647813785, 'mardis02_i'),\n",
       " (0.005421911068140828, 'mardis03_i'),\n",
       " (0.005404865865728525, 'mardis04_i'),\n",
       " (0.005455649152207198, 'mardis05_i'),\n",
       " (0.00556056629225854, 'marend01_i'),\n",
       " (0.005343277856898254, 'marend02_i'),\n",
       " (0.005329489935103182, 'marend03_i'),\n",
       " (0.005344829446110255, 'marend04_i'),\n",
       " (0.005466515793662197, 'fmar1age_i'),\n",
       " (0.006229712679169608, 'agediss1_i'),\n",
       " (0.005818093852332562, 'agedd1_i'),\n",
       " (0.005952534717923896, 'mar1diss_i'),\n",
       " (0.005562511474124343, 'dd1remar_i'),\n",
       " (0.0056514514657068915, 'mar1bir1_i'),\n",
       " (0.005519846049074295, 'mar1con1_i'),\n",
       " (0.005451844769302938, 'con1mar1_i'),\n",
       " (0.005682316811192689, 'b1premar_i'),\n",
       " (0.006692214376486705, 'cohab1_i'),\n",
       " (0.005675247994851085, 'cohstat_i'),\n",
       " (0.005512329697066498, 'cohout_i'),\n",
       " (0.006334563438736507, 'coh1dur_i'),\n",
       " (0.006043991819789318, 'sexever_i'),\n",
       " (0.0053941533797434715, 'vry1stag_i'),\n",
       " (0.005426725797249454, 'sex1age_i'),\n",
       " (0.005541139244667259, 'vry1stsx_i'),\n",
       " (0.0056974344065520155, 'datesex1_i'),\n",
       " (0.005352361370883685, 'fsexpage_i'),\n",
       " (0.0054768804615887845, 'sexmar_i'),\n",
       " (0.0055561700178216045, 'sex1for_i'),\n",
       " (0.0053912178048813875, 'parts1yr_i'),\n",
       " (0.005329254109874948, 'lsexdate_i'),\n",
       " (0.005436061430343253, 'lsexrage_i'),\n",
       " (0.005582651346653922, 'lifprtnr_i'),\n",
       " (0.005600500611765757, 'strloper'),\n",
       " (0.005363580062538564, 'tubs'),\n",
       " (0.006426387530719002, 'vasect'),\n",
       " (0.0055320684576875, 'hyst'),\n",
       " (0.0053307160638501605, 'ovarect'),\n",
       " (0.005518449429605998, 'othr'),\n",
       " (0.005817702615502629, 'othrm'),\n",
       " (0.005480431084785797, 'fecund'),\n",
       " (0.006048959480912219, 'anybc36'),\n",
       " (0.006011414881141541, 'nosex36'),\n",
       " (0.006212312054061253, 'infert'),\n",
       " (0.005921888449095469, 'anybc12'),\n",
       " (0.005328912963906918, 'anymthd'),\n",
       " (0.005580860850871838, 'nosex12'),\n",
       " (0.005549719784021412, 'sexp3mo'),\n",
       " (0.005984026261863229, 'sex3mo'),\n",
       " (0.006202060748728977, 'constat1'),\n",
       " (0.005357513743306619, 'constat2'),\n",
       " (0.005370972720360245, 'constat3'),\n",
       " (0.005465488617125702, 'constat4'),\n",
       " (0.006346969014618287, 'pillr'),\n",
       " (0.005464121795870747, 'condomr'),\n",
       " (0.005455089604600505, 'sex1mthd1'),\n",
       " (0.004003403776326686, 'sex1mthd2'),\n",
       " (0.006175243601095892, 'mthuse12'),\n",
       " (0.0072929966330781415, 'meth12m1'),\n",
       " (0.006372415393107289, 'mthuse3'),\n",
       " (0.006873907333843077, 'meth3m1'),\n",
       " (0.005951079009971272, 'nump3mos'),\n",
       " (0.005765060026368007, 'fmethod1'),\n",
       " (0.006312363569584423, 'dateuse1'),\n",
       " (0.005473240225484011, 'oldwp01'),\n",
       " (0.0061220725634859585, 'oldwp02'),\n",
       " (0.006566679752625704, 'oldwp03'),\n",
       " (0.0062996315229642, 'oldwr01'),\n",
       " (0.007236400760750383, 'oldwr02'),\n",
       " (0.009306831251880698, 'oldwr03'),\n",
       " (0.006285282977210982, 'wantrp01'),\n",
       " (0.007236400760750383, 'wantrp02'),\n",
       " (0.009306831251880698, 'wantrp03'),\n",
       " (0.0054865812407531855, 'wantp01'),\n",
       " (0.0060748063719646694, 'wantp02'),\n",
       " (0.00661996064952286, 'wantp03'),\n",
       " (0.00548528475124066, 'wantp5'),\n",
       " (0.005448957689459855, 'infert_i'),\n",
       " (0.0067450133507920285, 'nosex12_i'),\n",
       " (0.005480667731327604, 'sexp3mo_i'),\n",
       " (0.0053272936863250075, 'sex3mo_i'),\n",
       " (0.005829898683175627, 'constat1_i'),\n",
       " (0.0054638333403213, 'constat2_i'),\n",
       " (0.005562510845659396, 'constat3_i'),\n",
       " (0.005562510845659396, 'constat4_i'),\n",
       " (0.00532790310884812, 'pillr_i'),\n",
       " (0.00532790310884812, 'condomr_i'),\n",
       " (0.005480538682948177, 'sex1mthd1_i'),\n",
       " (0.005618082757721465, 'sex1mthd2_i'),\n",
       " (0.005618082757721465, 'sex1mthd3_i'),\n",
       " (0.005618082757721465, 'sex1mthd4_i'),\n",
       " (0.0053273478704319865, 'mthuse12_i'),\n",
       " (0.0053352895922493815, 'meth12m1_i'),\n",
       " (0.005328505216348645, 'meth12m2_i'),\n",
       " (0.00532850671281937, 'meth12m3_i'),\n",
       " (0.00532850671281937, 'meth12m4_i'),\n",
       " (0.005352822010079472, 'mthuse3_i'),\n",
       " (0.005374562641145442, 'meth3m1_i'),\n",
       " (0.005346414776090325, 'meth3m2_i'),\n",
       " (0.005338022356888628, 'meth3m3_i'),\n",
       " (0.005338022356888628, 'meth3m4_i'),\n",
       " (0.0062734191728233135, 'nump3mos_i'),\n",
       " (0.005378877823191908, 'fmethod1_i'),\n",
       " (0.006076376776896986, 'dateuse1_i'),\n",
       " (0.005368456842452463, 'sourcem1_i'),\n",
       " (0.00534327273291324, 'sourcem2_i'),\n",
       " (0.005333841105334969, 'sourcem3_i'),\n",
       " (0.005333841105334969, 'sourcem4_i'),\n",
       " (0.005665470931053074, 'oldwp01_i'),\n",
       " (0.005532484672147064, 'oldwp02_i'),\n",
       " (0.006038794473755105, 'oldwp03_i'),\n",
       " (0.005995955760873195, 'oldwp04_i'),\n",
       " (0.005457954823496536, 'oldwp05_i'),\n",
       " (0.005401403098434621, 'oldwp06_i'),\n",
       " (0.0054589299665898094, 'oldwp07_i'),\n",
       " (0.0057501139160012205, 'oldwp08_i'),\n",
       " (0.005364106218509024, 'oldwr01_i'),\n",
       " (0.005490172335675836, 'oldwr02_i'),\n",
       " (0.005363893387949736, 'oldwr03_i'),\n",
       " (0.005640306695205988, 'oldwr04_i'),\n",
       " (0.0053813426061590786, 'oldwr05_i'),\n",
       " (0.005396951687654639, 'oldwr06_i'),\n",
       " (0.0055298718389822366, 'oldwr07_i'),\n",
       " (0.005587196021343832, 'oldwr08_i'),\n",
       " (0.005365193139261981, 'oldwr09_i'),\n",
       " (0.005364106218509024, 'wantrp01_i'),\n",
       " (0.005490172335675836, 'wantrp02_i'),\n",
       " (0.005363893387949736, 'wantrp03_i'),\n",
       " (0.005640306695205988, 'wantrp04_i'),\n",
       " (0.0053813426061590786, 'wantrp05_i'),\n",
       " (0.005396951687654639, 'wantrp06_i'),\n",
       " (0.0055298718389822366, 'wantrp07_i'),\n",
       " (0.005587196021343832, 'wantrp08_i'),\n",
       " (0.005365193139261981, 'wantrp09_i'),\n",
       " (0.0054993860861720645, 'wantp01_i'),\n",
       " (0.005741447585054793, 'wantp02_i'),\n",
       " (0.005728830977663635, 'wantp03_i'),\n",
       " (0.006172654291036195, 'wantp04_i'),\n",
       " (0.005648956835502261, 'wantp05_i'),\n",
       " (0.00538592984977615, 'wantp06_i'),\n",
       " (0.005629613043803383, 'wantp07_i'),\n",
       " (0.005779919154650259, 'wantp08_i'),\n",
       " (0.006596791609596697, 'wantp5_i'),\n",
       " (0.00535023906134624, 'fptit12_i'),\n",
       " (0.0054279461800811335, 'fptitmed_i'),\n",
       " (0.005346447844715718, 'fpregmed_i'),\n",
       " (0.0055473444255559334, 'r_stclin'),\n",
       " (0.005492834434038474, 'intent'),\n",
       " (0.00532913283383607, 'addexp'),\n",
       " (0.005959748183472446, 'intent_i'),\n",
       " (0.005862368582374766, 'addexp_i'),\n",
       " (0.005328595910998213, 'anyprghp'),\n",
       " (0.005834911540965271, 'anymschp'),\n",
       " (0.005608926827060601, 'infever'),\n",
       " (0.0058056953191234495, 'pidtreat'),\n",
       " (0.00532728544675054, 'evhivtst'),\n",
       " (0.005490881370955658, 'anyprghp_i'),\n",
       " (0.005696589912781214, 'anymschp_i'),\n",
       " (0.0055920576323432725, 'infever_i'),\n",
       " (0.005490881370955658, 'ovulate_i'),\n",
       " (0.005490881370955658, 'tubes_i'),\n",
       " (0.005490881370955658, 'infertr_i'),\n",
       " (0.005490881370955658, 'inferth_i'),\n",
       " (0.005490881370955658, 'advice_i'),\n",
       " (0.005490881370955658, 'insem_i'),\n",
       " (0.005490881370955658, 'invitro_i'),\n",
       " (0.005490881370955658, 'endomet_i'),\n",
       " (0.005490881370955658, 'fibroids_i'),\n",
       " (0.0053273436347913705, 'pidtreat_i'),\n",
       " (0.005384796955323012, 'evhivtst_i'),\n",
       " (0.005444228863618061, 'insuranc_r'),\n",
       " (0.005908687699079596, 'metro_r'),\n",
       " (0.005476246226178816, 'religion_r'),\n",
       " (0.006124250620027971, 'laborfor_r'),\n",
       " (0.005490192077694744, 'insuranc_i_r'),\n",
       " (0.005329126750794222, 'religion_i_r'),\n",
       " (0.005635393818939516, 'laborfor_i_r'),\n",
       " (0.009743158975296873, 'poverty_r'),\n",
       " (0.011870069031173158, 'totincr'),\n",
       " (0.00988503292074805, 'pubassis_r'),\n",
       " (0.005674668721373788, 'poverty_i_r'),\n",
       " (0.005674668721373788, 'totincr_i'),\n",
       " (0.005588263662201554, 'pubassis_i_r'),\n",
       " (0.007266083159805259, 'basewgt_r'),\n",
       " (0.006863344757269019, 'adj_mod_basewgt_r'),\n",
       " (0.007414601906967189, 'finalwgt_r'),\n",
       " (0.006008491880136968, 'secu_r'),\n",
       " (0.00540529186826777, 'sest_r'),\n",
       " (0.005425914889651273, 'cmintvw_r'),\n",
       " (0.005425914889651051, 'cmlstyr'),\n",
       " (0.005823670091816058, 'intvlngth')]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "variables = GoMining(join)\n",
    "variables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following functions report the variables with the highest values of $R^2$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "def ReadVariables():\n",
    "    \"\"\"Reads Stata dictionary files for NSFG data.\n",
    "\n",
    "    returns: DataFrame that maps variables names to descriptions\n",
    "    \"\"\"\n",
    "    vars1 = thinkstats2.ReadStataDct('2002FemPreg.dct').variables\n",
    "    vars2 = thinkstats2.ReadStataDct('2002FemResp.dct').variables\n",
    "\n",
    "    all_vars = pd.concat([vars1, vars2])\n",
    "    all_vars.index = all_vars.name\n",
    "    return all_vars\n",
    "\n",
    "def MiningReport(variables, n=30):\n",
    "    \"\"\"Prints variables with the highest R^2.\n",
    "\n",
    "    t: list of (R^2, variable name) pairs\n",
    "    n: number of pairs to print\n",
    "    \"\"\"\n",
    "    all_vars = ReadVariables()\n",
    "\n",
    "    variables.sort(reverse=True)\n",
    "    for r2, name in variables[:n]:\n",
    "        key = re.sub('_r$', '', name)\n",
    "        try:\n",
    "            desc = all_vars.loc[key].desc\n",
    "            if isinstance(desc, pd.Series):\n",
    "                desc = desc[0]\n",
    "            print(name, r2, desc)\n",
    "        except (KeyError, IndexError):\n",
    "            print(name, r2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some of the variables that do well are not useful for prediction because they are not known ahead of time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "totalwgt_lb 1.0\n",
      "birthwgt_lb 0.9498127305978009 BD-3 BIRTHWEIGHT IN POUNDS - 1ST BABY FROM THIS PREGNANCY\n",
      "lbw1 0.3008240784470769 LOW BIRTHWEIGHT - BABY 1\n",
      "prglngth 0.13012519488625085 DURATION OF COMPLETED PREGNANCY IN WEEKS\n",
      "wksgest 0.12340041363361076 GESTATIONAL LENGTH OF COMPLETED PREGNANCY (IN WEEKS)\n",
      "agecon 0.10203149928156052 AGE AT TIME OF CONCEPTION\n",
      "mosgest 0.02714427463957958 GESTATIONAL LENGTH OF COMPLETED PREGNANCY (IN MONTHS)\n",
      "babysex 0.018550925293942533 BD-2 SEX OF 1ST LIVEBORN BABY FROM THIS PREGNANCY\n",
      "race_r 0.016199503586253106 RACE\n",
      "race 0.016199503586253106 RACE\n",
      "nbrnaliv 0.016017752709788113 BC-2 NUMBER OF BABIES BORN ALIVE FROM THIS PREGNANCY\n",
      "paydu 0.014003795578114597 IB-10 CURRENT LIVING QUARTERS OWNED/RENTED, ETC\n",
      "rmarout03 0.01343006646571343 INFORMAL MARITAL STATUS WHEN PREGNANCY ENDED - 3RD\n",
      "birthwgt_oz 0.013102457615706498 BD-3 BIRTHWEIGHT IN OUNCES - 1ST BABY FROM THIS PREGNANCY\n",
      "anynurse 0.012529022541810653 BH-1 WHETHER R BREASTFED THIS CHILD AT ALL - 1ST FROM THIS PREG\n",
      "bfeedwks 0.012193688404495417 DURATION OF BREASTFEEDING IN WEEKS\n",
      "totincr 0.011870069031173158 TOTAL INCOME OF R'S FAMILY\n",
      "marout03 0.011807801994375033 FORMAL MARITAL STATUS WHEN PREGNANCY ENDED - 3RD\n",
      "marcon03 0.011752599354395321 FORMAL MARITAL STATUS WHEN PREGNANCY BEGAN - 3RD\n",
      "cebow 0.011437770919637269 NUMBER OF CHILDREN BORN OUT OF WEDLOCK\n",
      "rmarout01 0.011407737138640073 INFORMAL MARITAL STATUS WHEN PREGNANCY ENDED - 1ST\n",
      "rmarout6 0.011354138472805753 INFORMAL MARITAL STATUS AT PREGNANCY OUTCOME - 6 CATEGORIES\n",
      "marout01 0.011269357246806444 FORMAL MARITAL STATUS WHEN PREGNANCY ENDED - 1ST\n",
      "hisprace_r 0.01123834930203138 RACE AND HISPANIC ORIGIN\n",
      "hisprace 0.01123834930203138 RACE AND HISPANIC ORIGIN\n",
      "mar1diss 0.0109615635907514 MONTHS BTW/1ST MARRIAGE & DISSOLUTION (OR INTERVIEW)\n",
      "fmarcon5 0.0106049646842995 FORMAL MARITAL STATUS AT CONCEPTION - 5 CATEGORIES\n",
      "rmarout02 0.010546913206564978 INFORMAL MARITAL STATUS WHEN PREGNANCY ENDED - 2ND\n",
      "marcon02 0.01048140179553414 FORMAL MARITAL STATUS WHEN PREGNANCY BEGAN - 2ND\n",
      "fmarout5 0.010461691367377068 FORMAL MARITAL STATUS AT PREGNANCY OUTCOME\n"
     ]
    }
   ],
   "source": [
    "MiningReport(variables)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Combining the variables that seem to have the most explanatory power."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>OLS Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>       <td>totalwgt_lb</td>   <th>  R-squared:         </th> <td>   0.060</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th> <td>   0.059</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th> <td>   79.98</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>             <td>Sun, 09 Apr 2023</td> <th>  Prob (F-statistic):</th> <td>4.86e-113</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                 <td>09:52:11</td>     <th>  Log-Likelihood:    </th> <td> -14295.</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>No. Observations:</th>      <td>  8781</td>      <th>  AIC:               </th> <td>2.861e+04</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Residuals:</th>          <td>  8773</td>      <th>  BIC:               </th> <td>2.866e+04</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Df Model:</th>              <td>     7</td>      <th>                     </th>     <td> </td>    \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>     <td> </td>    \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "            <td></td>              <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Intercept</th>            <td>    6.6303</td> <td>    0.065</td> <td>  102.223</td> <td> 0.000</td> <td>    6.503</td> <td>    6.757</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>C(race)[T.2]</th>         <td>    0.3570</td> <td>    0.032</td> <td>   11.215</td> <td> 0.000</td> <td>    0.295</td> <td>    0.419</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>C(race)[T.3]</th>         <td>    0.2665</td> <td>    0.051</td> <td>    5.175</td> <td> 0.000</td> <td>    0.166</td> <td>    0.367</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>babysex == 1[T.True]</th> <td>    0.2952</td> <td>    0.026</td> <td>   11.216</td> <td> 0.000</td> <td>    0.244</td> <td>    0.347</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>nbrnaliv > 1[T.True]</th> <td>   -1.3783</td> <td>    0.108</td> <td>  -12.771</td> <td> 0.000</td> <td>   -1.590</td> <td>   -1.167</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>paydu == 1[T.True]</th>   <td>    0.1196</td> <td>    0.031</td> <td>    3.861</td> <td> 0.000</td> <td>    0.059</td> <td>    0.180</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>agepreg</th>              <td>    0.0074</td> <td>    0.003</td> <td>    2.921</td> <td> 0.004</td> <td>    0.002</td> <td>    0.012</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>totincr</th>              <td>    0.0122</td> <td>    0.004</td> <td>    3.110</td> <td> 0.002</td> <td>    0.005</td> <td>    0.020</td>\n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "  <th>Omnibus:</th>       <td>398.813</td> <th>  Durbin-Watson:     </th> <td>   1.604</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Prob(Omnibus):</th> <td> 0.000</td>  <th>  Jarque-Bera (JB):  </th> <td>1388.362</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Skew:</th>          <td>-0.037</td>  <th>  Prob(JB):          </th> <td>3.32e-302</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Kurtosis:</th>      <td> 4.947</td>  <th>  Cond. No.          </th> <td>    221.</td> \n",
       "</tr>\n",
       "</table><br/><br/>Notes:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                            OLS Regression Results                            \n",
       "==============================================================================\n",
       "Dep. Variable:            totalwgt_lb   R-squared:                       0.060\n",
       "Model:                            OLS   Adj. R-squared:                  0.059\n",
       "Method:                 Least Squares   F-statistic:                     79.98\n",
       "Date:                Sun, 09 Apr 2023   Prob (F-statistic):          4.86e-113\n",
       "Time:                        09:52:11   Log-Likelihood:                -14295.\n",
       "No. Observations:                8781   AIC:                         2.861e+04\n",
       "Df Residuals:                    8773   BIC:                         2.866e+04\n",
       "Df Model:                           7                                         \n",
       "Covariance Type:            nonrobust                                         \n",
       "========================================================================================\n",
       "                           coef    std err          t      P>|t|      [0.025      0.975]\n",
       "----------------------------------------------------------------------------------------\n",
       "Intercept                6.6303      0.065    102.223      0.000       6.503       6.757\n",
       "C(race)[T.2]             0.3570      0.032     11.215      0.000       0.295       0.419\n",
       "C(race)[T.3]             0.2665      0.051      5.175      0.000       0.166       0.367\n",
       "babysex == 1[T.True]     0.2952      0.026     11.216      0.000       0.244       0.347\n",
       "nbrnaliv > 1[T.True]    -1.3783      0.108    -12.771      0.000      -1.590      -1.167\n",
       "paydu == 1[T.True]       0.1196      0.031      3.861      0.000       0.059       0.180\n",
       "agepreg                  0.0074      0.003      2.921      0.004       0.002       0.012\n",
       "totincr                  0.0122      0.004      3.110      0.002       0.005       0.020\n",
       "==============================================================================\n",
       "Omnibus:                      398.813   Durbin-Watson:                   1.604\n",
       "Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1388.362\n",
       "Skew:                          -0.037   Prob(JB):                    3.32e-302\n",
       "Kurtosis:                       4.947   Cond. No.                         221.\n",
       "==============================================================================\n",
       "\n",
       "Notes:\n",
       "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
       "\"\"\""
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "formula = ('totalwgt_lb ~ agepreg + C(race) + babysex==1 + '\n",
    "               'nbrnaliv>1 + paydu==1 + totincr')\n",
    "results = smf.ols(formula, data=join).fit()\n",
    "results.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Logistic regression\n",
    "\n",
    "Example: suppose we are trying to predict `y` using explanatory variables `x1` and `x2`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = np.array([0, 1, 0, 1])\n",
    "x1 = np.array([0, 0, 0, 1])\n",
    "x2 = np.array([0, 1, 1, 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "According to the logit model the log odds for the $i$th element of $y$ is\n",
    "\n",
    "$\\log o = \\beta_0 + \\beta_1 x_1 + \\beta_2 x_2 $\n",
    "\n",
    "So let's start with an arbitrary guess about the elements of $\\beta$:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "beta = [-1.5, 2.8, 1.1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plugging in the model, we get log odds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-1.5, -0.4, -0.4,  2.4])"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "log_o = beta[0] + beta[1] * x1 + beta[2] * x2\n",
    "log_o"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which we can convert to odds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.22313016,  0.67032005,  0.67032005, 11.02317638])"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "o = np.exp(log_o)\n",
    "o"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then convert to probabilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.18242552, 0.40131234, 0.40131234, 0.9168273 ])"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p = o / (o+1)\n",
    "p"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The likelihoods of the actual outcomes are $p$ where $y$ is 1 and $1-p$ where $y$ is 0. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.81757448, 0.40131234, 0.59868766, 0.9168273 ])"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "likes = np.where(y, p, 1-p)\n",
    "likes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The likelihood of $y$ given $\\beta$ is the product of `likes`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.1800933529673034"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "like = np.prod(likes)\n",
    "like"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Logistic regression works by searching for the values in $\\beta$ that maximize `like`.\n",
    "\n",
    "Here's an example using variables in the NSFG respondent file to predict whether a baby will be a boy or a girl."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "import first\n",
    "live, firsts, others = first.MakeFrames()\n",
    "live = live[live.prglngth>30]\n",
    "live['boy'] = (live.babysex==1).astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The mother's age seems to have a small effect."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimization terminated successfully.\n",
      "         Current function value: 0.693015\n",
      "         Iterations 3\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>Logit Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>          <td>boy</td>       <th>  No. Observations:  </th>  <td>  8884</td>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                 <td>Logit</td>      <th>  Df Residuals:      </th>  <td>  8882</td>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>                 <td>MLE</td>       <th>  Df Model:          </th>  <td>     1</td>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>            <td>Sun, 09 Apr 2023</td> <th>  Pseudo R-squ.:     </th> <td>6.144e-06</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                <td>09:52:12</td>     <th>  Log-Likelihood:    </th> <td> -6156.7</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>converged:</th>             <td>True</td>       <th>  LL-Null:           </th> <td> -6156.8</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>     <td>nonrobust</td>    <th>  LLR p-value:       </th>  <td>0.7833</td>  \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "      <td></td>         <th>coef</th>     <th>std err</th>      <th>z</th>      <th>P>|z|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Intercept</th> <td>    0.0058</td> <td>    0.098</td> <td>    0.059</td> <td> 0.953</td> <td>   -0.185</td> <td>    0.197</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>agepreg</th>   <td>    0.0010</td> <td>    0.004</td> <td>    0.275</td> <td> 0.783</td> <td>   -0.006</td> <td>    0.009</td>\n",
       "</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                           Logit Regression Results                           \n",
       "==============================================================================\n",
       "Dep. Variable:                    boy   No. Observations:                 8884\n",
       "Model:                          Logit   Df Residuals:                     8882\n",
       "Method:                           MLE   Df Model:                            1\n",
       "Date:                Sun, 09 Apr 2023   Pseudo R-squ.:               6.144e-06\n",
       "Time:                        09:52:12   Log-Likelihood:                -6156.7\n",
       "converged:                       True   LL-Null:                       -6156.8\n",
       "Covariance Type:            nonrobust   LLR p-value:                    0.7833\n",
       "==============================================================================\n",
       "                 coef    std err          z      P>|z|      [0.025      0.975]\n",
       "------------------------------------------------------------------------------\n",
       "Intercept      0.0058      0.098      0.059      0.953      -0.185       0.197\n",
       "agepreg        0.0010      0.004      0.275      0.783      -0.006       0.009\n",
       "==============================================================================\n",
       "\"\"\""
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = smf.logit('boy ~ agepreg', data=live)\n",
    "results = model.fit()\n",
    "results.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are the variables that seemed most promising."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimization terminated successfully.\n",
      "         Current function value: 0.692944\n",
      "         Iterations 3\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table class=\"simpletable\">\n",
       "<caption>Logit Regression Results</caption>\n",
       "<tr>\n",
       "  <th>Dep. Variable:</th>          <td>boy</td>       <th>  No. Observations:  </th>  <td>  8782</td>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Model:</th>                 <td>Logit</td>      <th>  Df Residuals:      </th>  <td>  8776</td>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Method:</th>                 <td>MLE</td>       <th>  Df Model:          </th>  <td>     5</td>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Date:</th>            <td>Sun, 09 Apr 2023</td> <th>  Pseudo R-squ.:     </th> <td>0.0001440</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Time:</th>                <td>09:52:12</td>     <th>  Log-Likelihood:    </th> <td> -6085.4</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>converged:</th>             <td>True</td>       <th>  LL-Null:           </th> <td> -6086.3</td> \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Covariance Type:</th>     <td>nonrobust</td>    <th>  LLR p-value:       </th>  <td>0.8822</td>  \n",
       "</tr>\n",
       "</table>\n",
       "<table class=\"simpletable\">\n",
       "<tr>\n",
       "        <td></td>          <th>coef</th>     <th>std err</th>      <th>z</th>      <th>P>|z|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
       "</tr>\n",
       "<tr>\n",
       "  <th>Intercept</th>    <td>   -0.0301</td> <td>    0.104</td> <td>   -0.290</td> <td> 0.772</td> <td>   -0.234</td> <td>    0.173</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>C(race)[T.2]</th> <td>   -0.0224</td> <td>    0.051</td> <td>   -0.439</td> <td> 0.660</td> <td>   -0.122</td> <td>    0.077</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>C(race)[T.3]</th> <td>   -0.0005</td> <td>    0.083</td> <td>   -0.005</td> <td> 0.996</td> <td>   -0.163</td> <td>    0.162</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>agepreg</th>      <td>   -0.0027</td> <td>    0.006</td> <td>   -0.484</td> <td> 0.629</td> <td>   -0.014</td> <td>    0.008</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>hpagelb</th>      <td>    0.0047</td> <td>    0.004</td> <td>    1.112</td> <td> 0.266</td> <td>   -0.004</td> <td>    0.013</td>\n",
       "</tr>\n",
       "<tr>\n",
       "  <th>birthord</th>     <td>    0.0050</td> <td>    0.022</td> <td>    0.227</td> <td> 0.821</td> <td>   -0.038</td> <td>    0.048</td>\n",
       "</tr>\n",
       "</table>"
      ],
      "text/plain": [
       "<class 'statsmodels.iolib.summary.Summary'>\n",
       "\"\"\"\n",
       "                           Logit Regression Results                           \n",
       "==============================================================================\n",
       "Dep. Variable:                    boy   No. Observations:                 8782\n",
       "Model:                          Logit   Df Residuals:                     8776\n",
       "Method:                           MLE   Df Model:                            5\n",
       "Date:                Sun, 09 Apr 2023   Pseudo R-squ.:               0.0001440\n",
       "Time:                        09:52:12   Log-Likelihood:                -6085.4\n",
       "converged:                       True   LL-Null:                       -6086.3\n",
       "Covariance Type:            nonrobust   LLR p-value:                    0.8822\n",
       "================================================================================\n",
       "                   coef    std err          z      P>|z|      [0.025      0.975]\n",
       "--------------------------------------------------------------------------------\n",
       "Intercept       -0.0301      0.104     -0.290      0.772      -0.234       0.173\n",
       "C(race)[T.2]    -0.0224      0.051     -0.439      0.660      -0.122       0.077\n",
       "C(race)[T.3]    -0.0005      0.083     -0.005      0.996      -0.163       0.162\n",
       "agepreg         -0.0027      0.006     -0.484      0.629      -0.014       0.008\n",
       "hpagelb          0.0047      0.004      1.112      0.266      -0.004       0.013\n",
       "birthord         0.0050      0.022      0.227      0.821      -0.038       0.048\n",
       "================================================================================\n",
       "\"\"\""
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "formula = 'boy ~ agepreg + hpagelb + birthord + C(race)'\n",
    "model = smf.logit(formula, data=live)\n",
    "results = model.fit()\n",
    "results.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make a prediction, we have to extract the exogenous and endogenous variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "endog = pd.DataFrame(model.endog, columns=[model.endog_names])\n",
    "exog = pd.DataFrame(model.exog, columns=model.exog_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The baseline prediction strategy is to guess \"boy\".  In that case, we're right almost 51% of the time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.507173764518333"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "actual = endog['boy']\n",
    "baseline = actual.mean()\n",
    "baseline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we use the previous model, we can compute the number of predictions we get right."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3944.0, 548.0)"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predict = (results.predict() >= 0.5)\n",
    "true_pos = predict * actual\n",
    "true_neg = (1 - predict) * (1 - actual)\n",
    "sum(true_pos), sum(true_neg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the accuracy, which is slightly higher than the baseline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5115007970849464"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "acc = (sum(true_pos) + sum(true_neg)) / len(actual)\n",
    "acc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make a prediction for an individual, we have to get their information into a `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0.513091\n",
       "dtype: float64"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "columns = ['agepreg', 'hpagelb', 'birthord', 'race']\n",
    "new = pd.DataFrame([[35, 39, 3, 2]], columns=columns)\n",
    "y = results.predict(new)\n",
    "y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This person has a 51% chance of having a boy (according to the model)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "**Exercise:** Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth. Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? You should limit yourself to variables that are known before the birth, and likely to be available to the people in the pool."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "import first\n",
    "live, firsts, others = first.MakeFrames()\n",
    "live = live[live.prglngth>30]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** The Trivers-Willard hypothesis suggests that for many mammals the sex ratio depends on “maternal condition”; that is, factors like the mother’s age, size, health, and social status. See https://en.wikipedia.org/wiki/Trivers-Willard_hypothesis\n",
    "\n",
    "Some studies have shown this effect among humans, but results are mixed. In this chapter we tested some variables related to these factors, but didn’t find any with a statistically significant effect on sex ratio.\n",
    "\n",
    "As an exercise, use a data mining approach to test the other variables in the pregnancy and respondent files. Can you find any factors with a substantial effect?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** If the quantity you want to predict is a count, you can use Poisson regression, which is implemented in StatsModels with a function called `poisson`. It works the same way as `ols` and `logit`. As an exercise, let’s use it to predict how many children a woman has born; in the NSFG dataset, this variable is called `numbabes`.\n",
    "\n",
    "Suppose you meet a woman who is 35 years old, black, and a college graduate whose annual household income exceeds $75,000. How many children would you predict she has born?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can predict the number of children for a woman who is 35 years old, black, and a college\n",
    "graduate whose annual household income exceeds $75,000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** If the quantity you want to predict is categorical, you can use multinomial logistic regression, which is implemented in StatsModels with a function called `mnlogit`. As an exercise, let’s use it to guess whether a woman is married, cohabitating, widowed, divorced, separated, or never married; in the NSFG dataset, marital status is encoded in a variable called `rmarital`.\n",
    "\n",
    "Suppose you meet a woman who is 25 years old, white, and a high school graduate whose annual household income is about $45,000. What is the probability that she is married, cohabitating, etc?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make a prediction for a woman who is 25 years old, white, and a high\n",
    "school graduate whose annual household income is about $45,000."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}