{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Chapter 11\n", "\n", "Examples and Exercises from Think Stats, 2nd Edition\n", "\n", "http://thinkstats2.com\n", "\n", "Copyright 2016 Allen B. Downey\n", "\n", "MIT License: https://opensource.org/licenses/MIT\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from os.path import basename, exists\n", "\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", "\n", " local, _ = urlretrieve(url, filename)\n", " print(\"Downloaded \" + local)\n", "\n", "\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkstats2.py\")\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkplot.py\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import thinkstats2\n", "import thinkplot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multiple regression\n", "\n", "Let's load up the NSFG data again." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/nsfg.py\")\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/first.py\")\n", "\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dct\")\n", "download(\n", " \"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dat.gz\"\n", ")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import first\n", "\n", "live, firsts, others = first.MakeFrames()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's birth weight as a function of mother's age (which we saw in the previous chapter)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: totalwgt_lb R-squared: 0.005
Model: OLS Adj. R-squared: 0.005
Method: Least Squares F-statistic: 43.02
Date: Sun, 09 Apr 2023 Prob (F-statistic): 5.72e-11
Time: 09:51:41 Log-Likelihood: -15897.
No. Observations: 9038 AIC: 3.180e+04
Df Residuals: 9036 BIC: 3.181e+04
Df Model: 1
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 6.8304 0.068 100.470 0.000 6.697 6.964
agepreg 0.0175 0.003 6.559 0.000 0.012 0.023
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1024.052 Durbin-Watson: 1.618
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3081.833
Skew: -0.601 Prob(JB): 0.00
Kurtosis: 5.596 Cond. No. 118.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: totalwgt_lb R-squared: 0.005\n", "Model: OLS Adj. R-squared: 0.005\n", "Method: Least Squares F-statistic: 43.02\n", "Date: Sun, 09 Apr 2023 Prob (F-statistic): 5.72e-11\n", "Time: 09:51:41 Log-Likelihood: -15897.\n", "No. Observations: 9038 AIC: 3.180e+04\n", "Df Residuals: 9036 BIC: 3.181e+04\n", "Df Model: 1 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "Intercept 6.8304 0.068 100.470 0.000 6.697 6.964\n", "agepreg 0.0175 0.003 6.559 0.000 0.012 0.023\n", "==============================================================================\n", "Omnibus: 1024.052 Durbin-Watson: 1.618\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 3081.833\n", "Skew: -0.601 Prob(JB): 0.00\n", "Kurtosis: 5.596 Cond. No. 118.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import statsmodels.formula.api as smf\n", "\n", "formula = 'totalwgt_lb ~ agepreg'\n", "model = smf.ols(formula, data=live)\n", "results = model.fit()\n", "results.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can extract the parameters." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(6.830396973311051, 0.017453851471802638)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inter = results.params['Intercept']\n", "slope = results.params['agepreg']\n", "inter, slope" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the p-value of the slope estimate." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5.7229471073163425e-11" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "slope_pvalue = results.pvalues['agepreg']\n", "slope_pvalue" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the coefficient of determination." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.004738115474710369" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results.rsquared" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The difference in birth weight between first babies and others." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-0.12476118453549034" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "diff_weight = firsts.totalwgt_lb.mean() - others.totalwgt_lb.mean()\n", "diff_weight" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The difference in age between mothers of first babies and others." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-3.5864347661500275" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "diff_age = firsts.agepreg.mean() - others.agepreg.mean()\n", "diff_age" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The age difference plausibly explains about half of the difference in weight." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-0.0625970997216918" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "slope * diff_age" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Running a single regression with a categorical variable, `isfirst`:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: totalwgt_lb R-squared: 0.002
Model: OLS Adj. R-squared: 0.002
Method: Least Squares F-statistic: 17.74
Date: Sun, 09 Apr 2023 Prob (F-statistic): 2.55e-05
Time: 09:51:41 Log-Likelihood: -15909.
No. Observations: 9038 AIC: 3.182e+04
Df Residuals: 9036 BIC: 3.184e+04
Df Model: 1
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 7.3259 0.021 356.007 0.000 7.286 7.366
isfirst[T.True] -0.1248 0.030 -4.212 0.000 -0.183 -0.067
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 988.919 Durbin-Watson: 1.613
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2897.107
Skew: -0.589 Prob(JB): 0.00
Kurtosis: 5.511 Cond. No. 2.58


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: totalwgt_lb R-squared: 0.002\n", "Model: OLS Adj. R-squared: 0.002\n", "Method: Least Squares F-statistic: 17.74\n", "Date: Sun, 09 Apr 2023 Prob (F-statistic): 2.55e-05\n", "Time: 09:51:41 Log-Likelihood: -15909.\n", "No. Observations: 9038 AIC: 3.182e+04\n", "Df Residuals: 9036 BIC: 3.184e+04\n", "Df Model: 1 \n", "Covariance Type: nonrobust \n", "===================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "-----------------------------------------------------------------------------------\n", "Intercept 7.3259 0.021 356.007 0.000 7.286 7.366\n", "isfirst[T.True] -0.1248 0.030 -4.212 0.000 -0.183 -0.067\n", "==============================================================================\n", "Omnibus: 988.919 Durbin-Watson: 1.613\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 2897.107\n", "Skew: -0.589 Prob(JB): 0.00\n", "Kurtosis: 5.511 Cond. No. 2.58\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "live['isfirst'] = live.birthord == 1\n", "formula = 'totalwgt_lb ~ isfirst'\n", "results = smf.ols(formula, data=live).fit()\n", "results.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now finally running a multiple regression:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: totalwgt_lb R-squared: 0.005
Model: OLS Adj. R-squared: 0.005
Method: Least Squares F-statistic: 24.02
Date: Sun, 09 Apr 2023 Prob (F-statistic): 3.95e-11
Time: 09:51:41 Log-Likelihood: -15894.
No. Observations: 9038 AIC: 3.179e+04
Df Residuals: 9035 BIC: 3.182e+04
Df Model: 2
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 6.9142 0.078 89.073 0.000 6.762 7.066
isfirst[T.True] -0.0698 0.031 -2.236 0.025 -0.131 -0.009
agepreg 0.0154 0.003 5.499 0.000 0.010 0.021
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1019.945 Durbin-Watson: 1.618
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3063.682
Skew: -0.599 Prob(JB): 0.00
Kurtosis: 5.588 Cond. No. 137.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: totalwgt_lb R-squared: 0.005\n", "Model: OLS Adj. R-squared: 0.005\n", "Method: Least Squares F-statistic: 24.02\n", "Date: Sun, 09 Apr 2023 Prob (F-statistic): 3.95e-11\n", "Time: 09:51:41 Log-Likelihood: -15894.\n", "No. Observations: 9038 AIC: 3.179e+04\n", "Df Residuals: 9035 BIC: 3.182e+04\n", "Df Model: 2 \n", "Covariance Type: nonrobust \n", "===================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "-----------------------------------------------------------------------------------\n", "Intercept 6.9142 0.078 89.073 0.000 6.762 7.066\n", "isfirst[T.True] -0.0698 0.031 -2.236 0.025 -0.131 -0.009\n", "agepreg 0.0154 0.003 5.499 0.000 0.010 0.021\n", "==============================================================================\n", "Omnibus: 1019.945 Durbin-Watson: 1.618\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 3063.682\n", "Skew: -0.599 Prob(JB): 0.00\n", "Kurtosis: 5.588 Cond. No. 137.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formula = 'totalwgt_lb ~ isfirst + agepreg'\n", "results = smf.ols(formula, data=live).fit()\n", "results.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, when we control for mother's age, the apparent difference due to `isfirst` is cut in half.\n", "\n", "If we add age squared, we can control for a quadratic relationship between age and weight." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: totalwgt_lb R-squared: 0.007
Model: OLS Adj. R-squared: 0.007
Method: Least Squares F-statistic: 22.64
Date: Sun, 09 Apr 2023 Prob (F-statistic): 1.35e-14
Time: 09:51:41 Log-Likelihood: -15884.
No. Observations: 9038 AIC: 3.178e+04
Df Residuals: 9034 BIC: 3.181e+04
Df Model: 3
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 5.6923 0.286 19.937 0.000 5.133 6.252
isfirst[T.True] -0.0504 0.031 -1.602 0.109 -0.112 0.011
agepreg 0.1124 0.022 5.113 0.000 0.069 0.155
agepreg2 -0.0018 0.000 -4.447 0.000 -0.003 -0.001
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 1007.149 Durbin-Watson: 1.616
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3003.343
Skew: -0.594 Prob(JB): 0.00
Kurtosis: 5.562 Cond. No. 1.39e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.39e+04. This might indicate that there are
strong multicollinearity or other numerical problems." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: totalwgt_lb R-squared: 0.007\n", "Model: OLS Adj. R-squared: 0.007\n", "Method: Least Squares F-statistic: 22.64\n", "Date: Sun, 09 Apr 2023 Prob (F-statistic): 1.35e-14\n", "Time: 09:51:41 Log-Likelihood: -15884.\n", "No. Observations: 9038 AIC: 3.178e+04\n", "Df Residuals: 9034 BIC: 3.181e+04\n", "Df Model: 3 \n", "Covariance Type: nonrobust \n", "===================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "-----------------------------------------------------------------------------------\n", "Intercept 5.6923 0.286 19.937 0.000 5.133 6.252\n", "isfirst[T.True] -0.0504 0.031 -1.602 0.109 -0.112 0.011\n", "agepreg 0.1124 0.022 5.113 0.000 0.069 0.155\n", "agepreg2 -0.0018 0.000 -4.447 0.000 -0.003 -0.001\n", "==============================================================================\n", "Omnibus: 1007.149 Durbin-Watson: 1.616\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 3003.343\n", "Skew: -0.594 Prob(JB): 0.00\n", "Kurtosis: 5.562 Cond. No. 1.39e+04\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "[2] The condition number is large, 1.39e+04. This might indicate that there are\n", "strong multicollinearity or other numerical problems.\n", "\"\"\"" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "live['agepreg2'] = live.agepreg**2\n", "formula = 'totalwgt_lb ~ isfirst + agepreg + agepreg2'\n", "results = smf.ols(formula, data=live).fit()\n", "results.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we do that, the apparent effect of `isfirst` gets even smaller, and is no longer statistically significant.\n", "\n", "These results suggest that the apparent difference in weight between first babies and others might be explained by difference in mothers' ages, at least in part." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Mining\n", "\n", "We can use `join` to combine variables from the preganancy and respondent tables." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemResp.dct\")\n", "download(\"https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemResp.dat.gz\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(8884, 3333)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import nsfg\n", "\n", "live = live[live.prglngth>30]\n", "resp = nsfg.ReadFemResp()\n", "resp.index = resp.caseid\n", "join = live.join(resp, on='caseid', rsuffix='_r')\n", "join.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can search for variables with explanatory power.\n", "\n", "Because we don't clean most of the variables, we are probably missing some good ones." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "import patsy\n", "\n", "def GoMining(df):\n", " \"\"\"Searches for variables that predict birth weight.\n", "\n", " df: DataFrame of pregnancy records\n", "\n", " returns: list of (rsquared, variable name) pairs\n", " \"\"\"\n", " variables = []\n", " for name in df.columns:\n", " try:\n", " if df[name].var() < 1e-7:\n", " continue\n", "\n", " formula = 'totalwgt_lb ~ agepreg + ' + name\n", " model = smf.ols(formula, data=df)\n", " if model.nobs < len(df)/2:\n", " continue\n", "\n", " results = model.fit()\n", " except (ValueError, TypeError, patsy.PatsyError) as e:\n", " continue\n", " \n", " variables.append((results.rsquared, name))\n", "\n", " return variables" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0.005357647323640413, 'caseid'),\n", " (0.005750013985077129, 'pregordr'),\n", " (0.006330980237390205, 'pregend1'),\n", " (0.016017752709788113, 'nbrnaliv'),\n", " (0.005543156193094756, 'cmprgend'),\n", " (0.005442800591640151, 'cmprgbeg'),\n", " (0.005327612601561116, 'gestasun_m'),\n", " (0.007023552638453112, 'gestasun_w'),\n", " (0.12340041363361076, 'wksgest'),\n", " (0.02714427463957958, 'mosgest'),\n", " (0.0053368691675173, 'bpa_bdscheck1'),\n", " (0.018550925293942533, 'babysex'),\n", " (0.9498127305978009, 'birthwgt_lb'),\n", " (0.013102457615706498, 'birthwgt_oz'),\n", " (0.005543156193094756, 'cmbabdob'),\n", " (0.005684952650027997, 'kidage'),\n", " (0.006165319836040295, 'hpagelb'),\n", " (0.008066317368677245, 'matchfound'),\n", " (0.012529022541810653, 'anynurse'),\n", " (0.004409820583625601, 'frsteatd_n'),\n", " (0.004263973471709814, 'frsteatd_p'),\n", " (0.004020131462736054, 'frsteatd'),\n", " (0.005830571770254145, 'cmlastlb'),\n", " (0.005356747266123785, 'cmfstprg'),\n", " (0.005428333650990047, 'cmlstprg'),\n", " (0.005731401733759189, 'cmintstr'),\n", " (0.005543156193094756, 'cmintfin'),\n", " (0.00993306080712264, 'evuseint'),\n", " (0.009315099704132801, 'stopduse'),\n", " (0.00372683328673018, 'wantbold'),\n", " (0.0070729951341236275, 'timingok'),\n", " (0.005042504093811795, 'wthpart1'),\n", " (0.006835771483523323, 'hpwnold'),\n", " (0.006349094713449799, 'timokhp'),\n", " (0.00262913761511141, 'cohpbeg'),\n", " (0.0018043469091931774, 'cohpend'),\n", " (0.00808960003494219, 'tellfath'),\n", " (0.009056250355562567, 'whentell'),\n", " (0.005369974278794709, 'anyusint'),\n", " (0.13012519488625085, 'prglngth'),\n", " (0.005545615084230682, 'birthord'),\n", " (0.005591745847583596, 'datend'),\n", " (0.005327282505071085, 'agepreg'),\n", " (0.00566538884322787, 'datecon'),\n", " (0.10203149928156052, 'agecon'),\n", " (0.010461691367377068, 'fmarout5'),\n", " (0.009840804911715684, 'pmarpreg'),\n", " (0.011354138472805753, 'rmarout6'),\n", " (0.0106049646842995, 'fmarcon5'),\n", " (0.3008240784470769, 'lbw1'),\n", " (0.012193688404495417, 'bfeedwks'),\n", " (0.007984835684252678, 'oldwantr'),\n", " (0.00640138668536383, 'oldwantp'),\n", " (0.007980832538658, 'wantresp'),\n", " (0.006334468987300168, 'wantpart'),\n", " (0.00559161600442204, 'cmbirth'),\n", " (0.0055903980552193255, 'ager'),\n", " (0.0055903980552193255, 'agescrn'),\n", " (0.009944942659110723, 'fmarital'),\n", " (0.008267774071422429, 'rmarital'),\n", " (0.006450913803300651, 'educat'),\n", " (0.0066919868225499, 'hieduc'),\n", " (0.016199503586253106, 'race'),\n", " (0.005351273101023457, 'hispanic'),\n", " (0.01123834930203138, 'hisprace'),\n", " (0.005415425347505165, 'rcurpreg'),\n", " (0.0060378317082542265, 'pregnum'),\n", " (0.00650372032144908, 'parity'),\n", " (0.005444228863618061, 'insuranc'),\n", " (0.009858545642850713, 'pubassis'),\n", " (0.009743158975296873, 'poverty'),\n", " (0.006124250620027971, 'laborfor'),\n", " (0.005476246226178816, 'religion'),\n", " (0.005908687699079596, 'metro'),\n", " (0.0053296353237825, 'brnout'),\n", " (0.005388240758326224, 'prglngth_i'),\n", " (0.0053720967087050875, 'datend_i'),\n", " (0.005666104281317086, 'agepreg_i'),\n", " (0.0053480888696338935, 'datecon_i'),\n", " (0.005612740210896416, 'agecon_i'),\n", " (0.005733140260446579, 'fmarout5_i'),\n", " (0.005422598571288018, 'pmarpreg_i'),\n", " (0.005498885939111409, 'rmarout6_i'),\n", " (0.0057702817140151685, 'fmarcon5_i'),\n", " (0.005355587358294223, 'learnprg_i'),\n", " (0.005464552651942123, 'pncarewk_i'),\n", " (0.005911575701061822, 'paydeliv_i'),\n", " (0.005327282505071085, 'lbw1_i'),\n", " (0.005422843440833436, 'bfeedwks_i'),\n", " (0.005456277033588197, 'maternlv_i'),\n", " (0.005397823762493981, 'oldwantr_i'),\n", " (0.005330102063603626, 'oldwantp_i'),\n", " (0.005397823762493981, 'wantresp_i'),\n", " (0.00538826132872694, 'wantpart_i'),\n", " (0.005415854205569115, 'hieduc_i'),\n", " (0.005327282505070974, 'hispanic_i'),\n", " (0.005662161985408476, 'parity_i'),\n", " (0.005490192077694744, 'insuranc_i'),\n", " (0.005588263662201554, 'pubassis_i'),\n", " (0.005674668721373788, 'poverty_i'),\n", " (0.005635393818939516, 'laborfor_i'),\n", " (0.005329126750794222, 'religion_i'),\n", " (0.007266083159805259, 'basewgt'),\n", " (0.006863344757269019, 'adj_mod_basewgt'),\n", " (0.007414601906967189, 'finalwgt'),\n", " (0.005996732588561704, 'secu_p'),\n", " (0.00540529186826777, 'sest'),\n", " (1.0, 'totalwgt_lb'),\n", " (0.005617788291713333, 'isfirst'),\n", " (0.007529217800612997, 'agepreg2'),\n", " (0.005357647323640413, 'caseid_r'),\n", " (0.005327683657806448, 'rscrinf'),\n", " (0.005394521556674303, 'rdormres'),\n", " (0.005643925975030384, 'rostscrn'),\n", " (0.005404878128178137, 'rscreenhisp'),\n", " (0.009651605370030958, 'rscreenrace'),\n", " (0.005578533477514025, 'age_a'),\n", " (0.0055903980552193255, 'age_r'),\n", " (0.00559161600442204, 'cmbirth_r'),\n", " (0.0055903980552193255, 'agescrn_r'),\n", " (0.008267774071422429, 'marstat'),\n", " (0.009944942659110723, 'fmarit'),\n", " (0.009091376003145912, 'evrmarry'),\n", " (0.00535044323233691, 'hisp'),\n", " (0.005516347262115917, 'numrace'),\n", " (0.0053556674242533076, 'roscnt'),\n", " (0.007003475396870851, 'hplocale'),\n", " (0.007768164334321925, 'manrel'),\n", " (0.005348262928552283, 'fl_rrace'),\n", " (0.005330593394374583, 'fl_rhisp'),\n", " (0.005986092508868612, 'goschol'),\n", " (0.0062182476084240434, 'higrade'),\n", " (0.005595972205724942, 'compgrd'),\n", " (0.0061609210437847395, 'havedip'),\n", " (0.006413101502788066, 'dipged'),\n", " (0.006061422266963268, 'cmhsgrad'),\n", " (0.00532855360811324, 'wthparnw'),\n", " (0.005280615627084373, 'onown'),\n", " (0.0061291662740246, 'intact'),\n", " (0.006038819191132916, 'parmarr'),\n", " (0.005635183345626071, 'momdegre'),\n", " (0.006872582290938678, 'momworkd'),\n", " (0.00533037621783472, 'momchild'),\n", " (0.005349294568680385, 'momfstch'),\n", " (0.006245111146334303, 'daddegre'),\n", " (0.005328519387111874, 'bothbiol'),\n", " (0.006182128607900683, 'intact18'),\n", " (0.005340451644766264, 'onown18'),\n", " (0.00650372032144908, 'numbabes'),\n", " (0.005550348810967609, 'totplacd'),\n", " (0.005337228144332018, 'nplaced'),\n", " (0.005530586313726937, 'ndied'),\n", " (0.005539874079434126, 'nadoptv'),\n", " (0.005830571770254145, 'cmlastlb_r'),\n", " (0.005356747266123785, 'cmfstprg_r'),\n", " (0.005428333650990047, 'cmlstprg_r'),\n", " (0.0059155825615551105, 'menarche'),\n", " (0.00534384518588682, 'pregnowq'),\n", " (0.0060378317082542265, 'numpregs'),\n", " (0.005415425347504943, 'currpreg'),\n", " (0.005780984893938856, 'giveadpt'),\n", " (0.0052547662523021454, 'otherkid'),\n", " (0.005289464417210787, 'everadpt'),\n", " (0.0058286851663723604, 'seekadpt'),\n", " (0.0065420652227655696, 'evwntano'),\n", " (0.007146484736500702, 'timesmar'),\n", " (0.006260110249016293, 'hsbverif'),\n", " (0.0065101987688339635, 'cmmarrhx'),\n", " (0.007559824082689293, 'hxagemar'),\n", " (0.007432269596385654, 'cmhsbdobx'),\n", " (0.00783511174689866, 'lvtoghx'),\n", " (0.006732535398132455, 'hisphx'),\n", " (0.00854385589625295, 'racehx1'),\n", " (0.009526403068447875, 'chedmarn'),\n", " (0.006464090481781537, 'marbefhx'),\n", " (0.007928273601067293, 'kidshx'),\n", " (0.007183543845538876, 'cmmarrch'),\n", " (0.006761687432294994, 'cmdobch'),\n", " (0.006548743638518095, 'prevhusb'),\n", " (0.006295608867143532, 'cmstrthp'),\n", " (0.007320288484473303, 'evrcohab'),\n", " (0.0073365000741104636, 'liveoth'),\n", " (0.006131749211457538, 'prevcohb'),\n", " (0.005640038945984638, 'cmfstsex'),\n", " (0.0054298693534173825, 'agefstsx'),\n", " (0.0010172326396578057, 'grfstsx'),\n", " (0.005745270584702311, 'sameman'),\n", " (0.0054772600593092635, 'fpage'),\n", " (0.0053827337115199825, 'knowfp'),\n", " (0.006043726456224308, 'cmlsexfp'),\n", " (0.0059106934677164435, 'cmfplast'),\n", " (0.005423734887383902, 'lifeprt'),\n", " (0.0053417476676171916, 'mon12prt'),\n", " (0.005349388865006244, 'parts12'),\n", " (0.006502466569061172, 'ptsb4mar'),\n", " (0.00627002683228417, 'p1yrage'),\n", " (0.004475735734043584, 'p1yhsage'),\n", " (0.0046862977809064565, 'p1yrf'),\n", " (0.007927219826495358, 'cmfsexx'),\n", " (0.006085450415460381, 'pcurrntx'),\n", " (0.006103433602601682, 'cmlsexx'),\n", " (0.006129604796508592, 'cmlstsxx'),\n", " (0.0060299492496314056, 'cmlstsx12'),\n", " (0.005339302768089693, 'lifeprts'),\n", " (0.0053282763855250215, 'cmlastsx'),\n", " (0.005790778139281083, 'currprtt'),\n", " (0.006691333575973957, 'currprts'),\n", " (0.004259919311988658, 'cmpart1y1'),\n", " (0.0053727944846478914, 'evertubs'),\n", " (0.005394668684848614, 'everhyst'),\n", " (0.0051812137216755705, 'everovrs'),\n", " (0.0055216206466339735, 'everothr'),\n", " (0.00549788837202414, 'anyfster'),\n", " (0.005806983909926733, 'fstrop12'),\n", " (0.0064869433773807605, 'anyopsmn'),\n", " (0.0064756285532651114, 'anymster'),\n", " (0.005457516875982504, 'rsurgstr'),\n", " (0.006286783510051408, 'psurgstr'),\n", " (0.0058616476673256646, 'onlytbvs'),\n", " (0.0038784396522586473, 'posiblpg'),\n", " (0.008011147208074498, 'canhaver'),\n", " (0.003893424250926647, 'pregnono'),\n", " (0.005331594693332997, 'rstrstat'),\n", " (0.006809171829395333, 'pstrstat'),\n", " (0.00644307628137919, 'pill'),\n", " (0.005456714711317701, 'condom'),\n", " (0.006878239843079559, 'vasectmy'),\n", " (0.005647132703301971, 'widrawal'),\n", " (0.0067777099610751845, 'depoprov'),\n", " (0.005360766267951678, 'norplant'),\n", " (0.006558270799857713, 'rhythm'),\n", " (0.006781472906158714, 'tempsafe'),\n", " (0.005333255591501329, 'mornpill'),\n", " (0.005588775806413815, 'diafragm'),\n", " (0.008181466230171464, 'wocondom'),\n", " (0.00571244189146225, 'foamalon'),\n", " (0.005605892825595982, 'jelcrmal'),\n", " (0.006669505309996104, 'cervlcap'),\n", " (0.005328971051790532, 'supposit'),\n", " (0.0067739501854262585, 'todayspg'),\n", " (0.005348443535178604, 'iud'),\n", " (0.005669693175719415, 'lunelle'),\n", " (0.0055746648157347645, 'patch'),\n", " (0.006317045357729589, 'othrmeth'),\n", " (0.005328912963907251, 'everused'),\n", " (0.0062621002452544205, 'methdiss'),\n", " (0.006783255187156723, 'methstop01'),\n", " (0.005827582700002165, 'firsmeth01'),\n", " (0.0059969573119960096, 'numfirsm'),\n", " (0.0057600954802820015, 'numfirsm1'),\n", " (0.006335882371607426, 'numfirsm2'),\n", " (0.006336231852240637, 'drugdev'),\n", " (0.004369823481886859, 'firstime2'),\n", " (0.005183357448185544, 'cmfstuse'),\n", " (0.006498991331905568, 'cmfirsm'),\n", " (0.004437510523967125, 'agefstus'),\n", " (0.005735952227613139, 'usefstsx'),\n", " (0.005546199546767272, 'intr_ec3'),\n", " (0.006421238927684092, 'monsx1177'),\n", " (0.005995975528061859, 'monsx1178'),\n", " (0.005871772632105254, 'monsx1179'),\n", " (0.005355244912861767, 'monsx1180'),\n", " (0.005475780215880466, 'monsx1181'),\n", " (0.005878264818758194, 'monsx1182'),\n", " (0.00593529016936678, 'monsx1183'),\n", " (0.0061367227367412625, 'monsx1184'),\n", " (0.006114461327954457, 'monsx1185'),\n", " (0.005914492828921203, 'monsx1186'),\n", " (0.00571514572173959, 'monsx1187'),\n", " (0.0059426297773013115, 'monsx1188'),\n", " (0.006403878523879025, 'monsx1189'),\n", " (0.006652431486344534, 'monsx1190'),\n", " (0.005859464888209431, 'monsx1191'),\n", " (0.0061836704118874986, 'monsx1192'),\n", " (0.005970697654319124, 'monsx1193'),\n", " (0.006185930348181823, 'monsx1194'),\n", " (0.006261722615339971, 'monsx1195'),\n", " (0.005883642310483994, 'monsx1196'),\n", " (0.006312131013808564, 'monsx1197'),\n", " (0.006334490158298567, 'monsx1198'),\n", " (0.006238157095211028, 'monsx1199'),\n", " (0.005958303111450847, 'monsx1200'),\n", " (0.005971611823831213, 'monsx1201'),\n", " (0.006232965940293433, 'monsx1202'),\n", " (0.0061567646880547056, 'monsx1203'),\n", " (0.00646363990894494, 'monsx1204'),\n", " (0.006792900648230571, 'monsx1205'),\n", " (0.006223101663507369, 'monsx1206'),\n", " (0.006017807331303304, 'monsx1207'),\n", " (0.006329241899985627, 'monsx1208'),\n", " (0.0063566695058372424, 'monsx1209'),\n", " (0.006029258683564187, 'monsx1210'),\n", " (0.005520990260007963, 'monsx1211'),\n", " (0.005987728957053462, 'monsx1212'),\n", " (0.007198033668191051, 'monsx1213'),\n", " (0.006899020398532851, 'monsx1214'),\n", " (0.006151064433626341, 'monsx1215'),\n", " (0.006041191941944746, 'monsx1216'),\n", " (0.006215321754830194, 'monsx1217'),\n", " (0.005713521284034573, 'monsx1218'),\n", " (0.005933855306221036, 'monsx1219'),\n", " (0.006167667783488429, 'monsx1220'),\n", " (0.006029948736347213, 'monsx1221'),\n", " (0.005840062379099065, 'monsx1222'),\n", " (0.005619611941114044, 'monsx1223'),\n", " (0.006563305715290513, 'monsx1224'),\n", " (0.006284349732578187, 'monsx1225'),\n", " (0.006323173014771255, 'monsx1226'),\n", " (0.005534365979908085, 'monsx1227'),\n", " (0.006374798787152525, 'monsx1228'),\n", " (0.007520765906871785, 'monsx1229'),\n", " (0.008083697967827597, 'monsx1230'),\n", " (0.007233374656298253, 'monsx1231'),\n", " (0.007985217080101137, 'monsx1232'),\n", " (0.007191033197050278, 'monsx1233'),\n", " (0.006261425077471072, 'cmstrtmc'),\n", " (0.005825958517251317, 'cmendmc'),\n", " (0.005603506388835111, 'methhist011'),\n", " (0.007284980353802539, 'cmdatbgn'),\n", " (0.005635409045157691, 'nummult'),\n", " (0.005551844008685136, 'methhist021'),\n", " (0.0054895117434013985, 'nummult2'),\n", " (0.0055628303964547765, 'methhist031'),\n", " (0.005546883063649921, 'nummult3'),\n", " (0.005487202405734637, 'methhist041'),\n", " (0.005502460937693798, 'nummult4'),\n", " (0.0055361891691549925, 'methhist051'),\n", " (0.005452729882170382, 'nummult5'),\n", " (0.005489245446082758, 'methhist061'),\n", " (0.005475895500679395, 'nummult6'),\n", " (0.0054732146908181845, 'methhist071'),\n", " (0.005428876448518971, 'nummult7'),\n", " (0.005594046854832779, 'methhist081'),\n", " (0.0054730049104988465, 'nummult8'),\n", " (0.005482617158302006, 'methhist091'),\n", " (0.0054272326646152, 'nummult9'),\n", " (0.005503393156441327, 'methhist101'),\n", " (0.005437143233952169, 'nummult10'),\n", " (0.005516430998187216, 'methhist111'),\n", " (0.005502305304443622, 'nummult11'),\n", " (0.005470302098175561, 'methhist121'),\n", " (0.005504255019818993, 'nummult12'),\n", " (0.005607458874307136, 'methhist131'),\n", " (0.005549721074183167, 'nummult13'),\n", " (0.005546717068292573, 'methhist141'),\n", " (0.00554268385974932, 'nummult14'),\n", " (0.005539692295560172, 'methhist151'),\n", " (0.0055324680547158556, 'nummult15'),\n", " (0.005536492647946201, 'methhist161'),\n", " (0.005536401342006614, 'nummult16'),\n", " (0.00551933216368472, 'methhist171'),\n", " (0.005493486008020909, 'nummult17'),\n", " (0.005507179974851173, 'methhist181'),\n", " (0.0054913983634302665, 'nummult18'),\n", " (0.005471902813541596, 'methhist191'),\n", " (0.005487640499653779, 'nummult19'),\n", " (0.005504743168494475, 'methhist201'),\n", " (0.005466739231583695, 'nummult20'),\n", " (0.005510186988941679, 'methhist211'),\n", " (0.005491203193220606, 'nummult21'),\n", " (0.005506337448605736, 'methhist221'),\n", " (0.005511607291997178, 'nummult22'),\n", " (0.0055295450491826825, 'methhist231'),\n", " (0.005530806017465695, 'nummult23'),\n", " (0.00555002676240679, 'methhist241'),\n", " (0.005539948289882801, 'nummult24'),\n", " (0.0055211355102111614, 'methhist251'),\n", " (0.00556828550933397, 'nummult25'),\n", " (0.005540129688542339, 'methhist261'),\n", " (0.005673797628286015, 'nummult26'),\n", " (0.0055319741942275735, 'methhist271'),\n", " (0.005784089177182872, 'nummult27'),\n", " (0.0055784115374736265, 'methhist281'),\n", " (0.005600401746815531, 'nummult28'),\n", " (0.005578714034812471, 'methhist291'),\n", " (0.0056155148617611506, 'nummult29'),\n", " (0.0056429313351300525, 'methhist301'),\n", " (0.00573173986023956, 'nummult30'),\n", " (0.005616953589053675, 'methhist311'),\n", " (0.005669936949214471, 'nummult31'),\n", " (0.00564355522220783, 'methhist321'),\n", " (0.00574989591519115, 'nummult32'),\n", " (0.005675590741276326, 'methhist331'),\n", " (0.005780466079303159, 'nummult33'),\n", " (0.005727245232216571, 'methhist341'),\n", " (0.00587256497631905, 'nummult34'),\n", " (0.005746152259455073, 'methhist351'),\n", " (0.005842619866505694, 'nummult35'),\n", " (0.005750154151443976, 'methhist361'),\n", " (0.005681701891443347, 'nummult36'),\n", " (0.006075297759302933, 'methhist371'),\n", " (0.005704951869176855, 'nummult37'),\n", " (0.006209219638329877, 'methhist381'),\n", " (0.005708874481709425, 'nummult38'),\n", " (0.00615001891252942, 'methhist391'),\n", " (0.005747940396037099, 'nummult39'),\n", " (0.006692950545528653, 'methhist401'),\n", " (0.006258857669329654, 'nummult40'),\n", " (0.007456274427272258, 'methhist411'),\n", " (0.006932216156475546, 'nummult41'),\n", " (0.008359250704236931, 'methhist421'),\n", " (0.007521964733084974, 'nummult42'),\n", " (0.009953102543862835, 'methhist431'),\n", " (0.007971602243966425, 'nummult43'),\n", " (0.009584160754145143, 'methhist441'),\n", " (0.007606603124692746, 'nummult44'),\n", " (0.009508144086923354, 'methhist451'),\n", " (0.006828747690285741, 'nummult45'),\n", " (0.007584417482075834, 'currmeth1'),\n", " (0.00627037344004322, 'lastmonmeth1'),\n", " (0.006727716986788201, 'uselstp'),\n", " (0.007019085471651421, 'lstmthp11'),\n", " (0.00558079811714618, 'usefstp'),\n", " (0.006042935277705164, 'pst4wksx'),\n", " (0.006741983069778024, 'pswkcond2'),\n", " (0.006418878609596557, 'p12mocon'),\n", " (0.00534642916952166, 'bthcon12'),\n", " (0.005332954623398445, 'medtst12'),\n", " (0.005341624685738289, 'bccns12'),\n", " (0.005429737619422337, 'stcns12'),\n", " (0.005470924106891761, 'eccns12'),\n", " (0.005348972077159231, 'prgtst12'),\n", " (0.0053943749433755794, 'abort12'),\n", " (0.005367446414242583, 'pap12'),\n", " (0.005672332993427065, 'pelvic12'),\n", " (0.0062589483576193095, 'stdtst12'),\n", " (0.004842342942612765, 'numbcvis'),\n", " (0.00439966100781275, 'papplbc2'),\n", " (0.004437846218583008, 'pappelec'),\n", " (0.005725980591540503, 'rwant'),\n", " (0.0059484588815127415, 'pwant'),\n", " (0.005332008335859562, 'hlpprg'),\n", " (0.005561944793632145, 'prgvisit'),\n", " (0.005743806819574981, 'hlpmc'),\n", " (0.006676017274232948, 'duchfreq'),\n", " (0.005813100370292146, 'pid'),\n", " (0.005690158662538858, 'diabetes'),\n", " (0.005531743131489408, 'ovacyst'),\n", " (0.006038196805351448, 'uf'),\n", " (0.005934224089218509, 'endo'),\n", " (0.005440537772177567, 'ovuprob'),\n", " (0.0061251320208634, 'limited'),\n", " (0.005541214057692367, 'equipmnt'),\n", " (0.005981350416892073, 'donbld85'),\n", " (0.005436612794017082, 'hivtest'),\n", " (0.00447797159580221, 'cmhivtst'),\n", " (0.004355089528519485, 'plchiv'),\n", " (0.004314612061914858, 'hivtst'),\n", " (0.00561821377110816, 'talkdoct'),\n", " (0.005329810046365013, 'retrovir'),\n", " (0.006663672692549416, 'cover12'),\n", " (0.006693078733038593, 'coverhow01'),\n", " (0.0060500431013753575, 'sameadd'),\n", " (0.0053296353237825, 'brnout_r'),\n", " (0.014003795578114597, 'paydu'),\n", " (0.005407180634571018, 'relraisd'),\n", " (0.005370356923925956, 'relcurr'),\n", " (0.005655636115156848, 'fundam'),\n", " (0.005504988604336347, 'reldlife'),\n", " (0.005867952875404425, 'attndnow'),\n", " (0.005933481080438563, 'evwrk6mo'),\n", " (0.0063049738571671066, 'cmbfstwk'),\n", " (0.005644145001417411, 'evrntwrk'),\n", " (0.005328150312206126, 'wrk12mos'),\n", " (0.007178560512177579, 'fpt12mos'),\n", " (0.006445359844623688, 'dolastwk1'),\n", " (0.004018698456840442, 'dolastwk2'),\n", " (0.00521674319332921, 'dolastwk3'),\n", " (0.0062574600046569895, 'rwrkst'),\n", " (0.005644484119356141, 'everwork'),\n", " (0.005177779293875973, 'rnumjob'),\n", " (0.005293304160078116, 'rftptx'),\n", " (0.0059898508613943635, 'rearnty'),\n", " (0.007212528591326373, 'splstwk1'),\n", " (0.008528896910858341, 'spwrkst'),\n", " (0.005713999272682013, 'spnumjob'),\n", " (0.006751568523647111, 'spftptx'),\n", " (0.005877030054991406, 'spearnty'),\n", " (0.004886896120925077, 'chcarany'),\n", " (0.005366116602676385, 'better'),\n", " (0.005356197123854156, 'staytog'),\n", " (0.00532884602247885, 'samesex'),\n", " (0.0053541143477805475, 'anyact'),\n", " (0.0054050041417265104, 'sxok18'),\n", " (0.005997965901475055, 'sxok16'),\n", " (0.005984938037659759, 'chreward'),\n", " (0.00597338721593621, 'chsuppor'),\n", " (0.005352012775372117, 'gayadopt'),\n", " (0.005785901183284814, 'okcohab'),\n", " (0.005330226512754721, 'warm'),\n", " (0.005334826170625084, 'achieve'),\n", " (0.006122955918854811, 'family'),\n", " (0.005413165880377879, 'acasilang'),\n", " (0.007834205022225316, 'wage'),\n", " (0.00632142355555132, 'selfinc'),\n", " (0.006237964556073061, 'socsec'),\n", " (0.005330070889624117, 'disabil'),\n", " (0.005337980306478363, 'retire'),\n", " (0.008398675180815607, 'ssi'),\n", " (0.006015715048611758, 'unemp'),\n", " (0.005327537441129571, 'chldsupp'),\n", " (0.009477120535617223, 'interest'),\n", " (0.007333419685710885, 'dividend'),\n", " (0.005327382416789428, 'othinc'),\n", " (0.006775124578427327, 'toincwmy'),\n", " (0.005352217186313735, 'totinc'),\n", " (0.0064238214634630975, 'pubasst'),\n", " (0.008558622074178679, 'foodstmp'),\n", " (0.006637757779645814, 'wic'),\n", " (0.005378525056025318, 'hlptrans'),\n", " (0.005920897262712499, 'hlpchldc'),\n", " (0.005478955696682664, 'hlpjob'),\n", " (0.0055903980552193255, 'ager_r'),\n", " (0.009944942659110723, 'fmarital_r'),\n", " (0.006450913803300651, 'educat_r'),\n", " (0.0066919868225499, 'hieduc_r'),\n", " (0.005351273101023457, 'hispanic_r'),\n", " (0.016199503586253106, 'race_r'),\n", " (0.01123834930203138, 'hisprace_r'),\n", " (0.005329986431589662, 'numkdhh'),\n", " (0.005394528433483536, 'numfmhh'),\n", " (0.006182128607901127, 'intctfam'),\n", " (0.005525721076047874, 'parage14'),\n", " (0.005473746177716121, 'educmom'),\n", " (0.0058533592637336485, 'agemomb1'),\n", " (0.005415854205569115, 'hieduc_i_r'),\n", " (0.005327282505070974, 'hispanic_i_r'),\n", " (0.005329835819694262, 'parage14_i'),\n", " (0.006383228212261338, 'educmom_i'),\n", " (0.006160171070091369, 'agemomb1_i'),\n", " (0.005415425347505165, 'rcurpreg_r'),\n", " (0.0060378317082542265, 'pregnum_r'),\n", " (0.0060943638898968144, 'compreg'),\n", " (0.005630995365189961, 'lossnum'),\n", " (0.005522608888729241, 'abortion'),\n", " (0.0056503224424951926, 'lbpregs'),\n", " (0.00650372032144908, 'parity_r'),\n", " (0.005417941356685496, 'births5'),\n", " (0.005764362444789728, 'outcom01'),\n", " (0.005931010998080022, 'outcom02'),\n", " (0.006122853755040514, 'outcom03'),\n", " (0.005419025041086045, 'datend01'),\n", " (0.006029071803283159, 'datend02'),\n", " (0.006108693948040367, 'datend03'),\n", " (0.007278279163181578, 'ageprg01'),\n", " (0.008535289365551813, 'ageprg02'),\n", " (0.009220965964321981, 'ageprg03'),\n", " (0.005389678393071362, 'datcon01'),\n", " (0.006145518207557932, 'datcon02'),\n", " (0.0062167895640961035, 'datcon03'),\n", " (0.007067187542220688, 'agecon01'),\n", " (0.008474519661758162, 'agecon02'),\n", " (0.00889389736970092, 'agecon03'),\n", " (0.011269357246806444, 'marout01'),\n", " (0.010141720907287155, 'marout02'),\n", " (0.011807801994375033, 'marout03'),\n", " (0.011407737138640073, 'rmarout01'),\n", " (0.010546913206564978, 'rmarout02'),\n", " (0.01343006646571343, 'rmarout03'),\n", " (0.009234871431322733, 'marcon01'),\n", " (0.01048140179553414, 'marcon02'),\n", " (0.011752599354395321, 'marcon03'),\n", " (0.011437770919637269, 'cebow'),\n", " (0.007442478133716124, 'cebowc'),\n", " (0.005331372450776084, 'datbaby1'),\n", " (0.006431931319703765, 'agebaby1'),\n", " (0.006043928783913133, 'liv1chld'),\n", " (0.005662161985408476, 'lossnum_i'),\n", " (0.005662161985408476, 'abortion_i'),\n", " (0.005662161985408476, 'lbpregs_i'),\n", " (0.005662161985408476, 'parity_i_r'),\n", " (0.005662161985408476, 'births5_i'),\n", " (0.005917896845371695, 'outcom02_i'),\n", " (0.005328062662569244, 'outcom03_i'),\n", " (0.005673431831919484, 'outcom04_i'),\n", " (0.005512637793773423, 'outcom05_i'),\n", " (0.005331053251030227, 'outcom06_i'),\n", " (0.005336032874002861, 'outcom07_i'),\n", " (0.005431343175452796, 'outcom08_i'),\n", " (0.005365193139261981, 'outcom09_i'),\n", " (0.005548660472480371, 'outcom10_i'),\n", " (0.005517931754874139, 'datend01_i'),\n", " (0.005690934538439829, 'datend02_i'),\n", " (0.005332715022836387, 'datend03_i'),\n", " (0.005328883282733954, 'datend04_i'),\n", " (0.005521053140913668, 'datend05_i'),\n", " (0.00561428828090571, 'datend06_i'),\n", " (0.005663078981643865, 'datend07_i'),\n", " (0.005858198017675953, 'datend08_i'),\n", " (0.005795817124290448, 'datend09_i'),\n", " (0.005548660472480371, 'datend10_i'),\n", " (0.005537193660306361, 'datend12_i'),\n", " (0.005537193660306361, 'datend13_i'),\n", " (0.005572507513472824, 'ageprg01_i'),\n", " (0.006081858146695152, 'ageprg02_i'),\n", " (0.005418070111408713, 'ageprg03_i'),\n", " (0.0053816621737194925, 'ageprg04_i'),\n", " (0.005613913002136761, 'ageprg05_i'),\n", " (0.005489445784257141, 'ageprg06_i'),\n", " (0.005564320884191343, 'ageprg07_i'),\n", " (0.005806143379294859, 'ageprg08_i'),\n", " (0.005795817124290448, 'ageprg09_i'),\n", " (0.005548660472480371, 'ageprg10_i'),\n", " (0.005537193660306361, 'ageprg12_i'),\n", " (0.005537193660306361, 'ageprg13_i'),\n", " (0.005405602200151627, 'datcon01_i'),\n", " (0.005690934538439829, 'datcon02_i'),\n", " (0.005332715022836387, 'datcon03_i'),\n", " (0.005328593623978639, 'datcon04_i'),\n", " (0.005498332824140473, 'datcon05_i'),\n", " (0.00561428828090571, 'datcon06_i'),\n", " (0.005591773203161621, 'datcon07_i'),\n", " (0.005858198017675953, 'datcon08_i'),\n", " (0.005795817124290448, 'datcon09_i'),\n", " (0.005548660472480371, 'datcon10_i'),\n", " (0.005537193660306361, 'datcon12_i'),\n", " (0.005537193660306361, 'datcon13_i'),\n", " (0.005464459762510643, 'agecon01_i'),\n", " (0.005953867386222278, 'agecon02_i'),\n", " (0.005440344671912678, 'agecon03_i'),\n", " (0.005462357369145465, 'agecon04_i'),\n", " (0.005585209645216915, 'agecon05_i'),\n", " (0.005565046255331274, 'agecon06_i'),\n", " (0.005690797025869498, 'agecon07_i'),\n", " (0.005858198017675953, 'agecon08_i'),\n", " (0.005795817124290448, 'agecon09_i'),\n", " (0.005548660472480371, 'agecon10_i'),\n", " (0.005537193660306361, 'agecon12_i'),\n", " (0.005537193660306361, 'agecon13_i'),\n", " (0.005495846986551367, 'marout01_i'),\n", " (0.005617451126696205, 'marout02_i'),\n", " (0.0057675316941988575, 'marout03_i'),\n", " (0.006058555904314256, 'marout04_i'),\n", " (0.007831854171707842, 'marout05_i'),\n", " (0.006525666016629406, 'marout06_i'),\n", " (0.007242086468897568, 'marout07_i'),\n", " (0.006008832051372925, 'marout08_i'),\n", " (0.005363835534778705, 'marout09_i'),\n", " (0.005434828901757727, 'marout10_i'),\n", " (0.005588512919647237, 'marout11_i'),\n", " (0.005427474035637592, 'rmarout01_i'),\n", " (0.005678427980718048, 'rmarout02_i'),\n", " (0.005498637325022315, 'rmarout03_i'),\n", " (0.005526063051171093, 'rmarout04_i'),\n", " (0.006880617443338566, 'rmarout05_i'),\n", " (0.006242519070738695, 'rmarout06_i'),\n", " (0.006539245819281447, 'rmarout07_i'),\n", " (0.006008832051372925, 'rmarout08_i'),\n", " (0.005363835534778705, 'rmarout09_i'),\n", " (0.005434828901757727, 'rmarout10_i'),\n", " (0.005588512919647237, 'rmarout11_i'),\n", " (0.00561503495200022, 'marcon01_i'),\n", " (0.005611341957289406, 'marcon02_i'),\n", " (0.006181472870089744, 'marcon03_i'),\n", " (0.005767269267942132, 'marcon04_i'),\n", " (0.007880198703264507, 'marcon05_i'),\n", " (0.00680621180187746, 'marcon06_i'),\n", " (0.006785317780935385, 'marcon07_i'),\n", " (0.005977443508035418, 'marcon08_i'),\n", " (0.005345000467353644, 'marcon09_i'),\n", " (0.005434828901757727, 'marcon10_i'),\n", " (0.005588512919647237, 'marcon11_i'),\n", " (0.005662161985408476, 'cebow_i'),\n", " (0.005662161985408476, 'cebowc_i'),\n", " (0.005794281025144787, 'datbaby1_i'),\n", " (0.005943478567578153, 'agebaby1_i'),\n", " (0.005662161985408476, 'liv1chld_i'),\n", " (0.008267774071422429, 'rmarital_r'),\n", " (0.0062973555918356405, 'fmarno'),\n", " (0.006812670300484491, 'mardat01'),\n", " (0.0065210569478660885, 'fmar1age'),\n", " (0.0109615635907514, 'mar1diss'),\n", " (0.010112030016853457, 'mar1bir1'),\n", " (0.009060360846987803, 'mar1con1'),\n", " (0.008161692294398337, 'con1mar1'),\n", " (0.010059124870520297, 'b1premar'),\n", " (0.007255508514300568, 'cohever'),\n", " (0.006194957970746207, 'evmarcoh'),\n", " (0.00271939367881302, 'cohab1'),\n", " (0.0066280015790392, 'cohstat'),\n", " (0.0031380350118120903, 'cohout'),\n", " (0.005420479710673054, 'coh1dur'),\n", " (0.005336364971175067, 'sexever'),\n", " (0.005944534641656896, 'vry1stag'),\n", " (0.00614261333826438, 'sex1age'),\n", " (0.005331809873838411, 'vry1stsx'),\n", " (0.005308860111875258, 'datesex1'),\n", " (0.005353980891277033, 'sexonce'),\n", " (0.005355525286814045, 'fsexpage'),\n", " (0.00673129712673215, 'sexmar'),\n", " (0.0065143616191194464, 'sex1for'),\n", " (0.005443436227038578, 'parts1yr'),\n", " (0.0058559384438559015, 'lsexdate'),\n", " (0.005966988984321908, 'lsexrage'),\n", " (0.0063365994925745905, 'lifprtnr'),\n", " (0.005707996133960003, 'fmarno_i'),\n", " (0.005413354626453648, 'mardat01_i'),\n", " (0.005347124675960324, 'mardat02_i'),\n", " (0.005977057012740761, 'mardis01_i'),\n", " (0.0054115975647813785, 'mardis02_i'),\n", " (0.005421911068140828, 'mardis03_i'),\n", " (0.005404865865728525, 'mardis04_i'),\n", " (0.005455649152207198, 'mardis05_i'),\n", " (0.00556056629225854, 'marend01_i'),\n", " (0.005343277856898254, 'marend02_i'),\n", " (0.005329489935103182, 'marend03_i'),\n", " (0.005344829446110255, 'marend04_i'),\n", " (0.005466515793662197, 'fmar1age_i'),\n", " (0.006229712679169608, 'agediss1_i'),\n", " (0.005818093852332562, 'agedd1_i'),\n", " (0.005952534717923896, 'mar1diss_i'),\n", " (0.005562511474124343, 'dd1remar_i'),\n", " (0.0056514514657068915, 'mar1bir1_i'),\n", " (0.005519846049074295, 'mar1con1_i'),\n", " (0.005451844769302938, 'con1mar1_i'),\n", " (0.005682316811192689, 'b1premar_i'),\n", " (0.006692214376486705, 'cohab1_i'),\n", " (0.005675247994851085, 'cohstat_i'),\n", " (0.005512329697066498, 'cohout_i'),\n", " (0.006334563438736507, 'coh1dur_i'),\n", " (0.006043991819789318, 'sexever_i'),\n", " (0.0053941533797434715, 'vry1stag_i'),\n", " (0.005426725797249454, 'sex1age_i'),\n", " (0.005541139244667259, 'vry1stsx_i'),\n", " (0.0056974344065520155, 'datesex1_i'),\n", " (0.005352361370883685, 'fsexpage_i'),\n", " (0.0054768804615887845, 'sexmar_i'),\n", " (0.0055561700178216045, 'sex1for_i'),\n", " (0.0053912178048813875, 'parts1yr_i'),\n", " (0.005329254109874948, 'lsexdate_i'),\n", " (0.005436061430343253, 'lsexrage_i'),\n", " (0.005582651346653922, 'lifprtnr_i'),\n", " (0.005600500611765757, 'strloper'),\n", " (0.005363580062538564, 'tubs'),\n", " (0.006426387530719002, 'vasect'),\n", " (0.0055320684576875, 'hyst'),\n", " (0.0053307160638501605, 'ovarect'),\n", " (0.005518449429605998, 'othr'),\n", " (0.005817702615502629, 'othrm'),\n", " (0.005480431084785797, 'fecund'),\n", " (0.006048959480912219, 'anybc36'),\n", " (0.006011414881141541, 'nosex36'),\n", " (0.006212312054061253, 'infert'),\n", " (0.005921888449095469, 'anybc12'),\n", " (0.005328912963906918, 'anymthd'),\n", " (0.005580860850871838, 'nosex12'),\n", " (0.005549719784021412, 'sexp3mo'),\n", " (0.005984026261863229, 'sex3mo'),\n", " (0.006202060748728977, 'constat1'),\n", " (0.005357513743306619, 'constat2'),\n", " (0.005370972720360245, 'constat3'),\n", " (0.005465488617125702, 'constat4'),\n", " (0.006346969014618287, 'pillr'),\n", " (0.005464121795870747, 'condomr'),\n", " (0.005455089604600505, 'sex1mthd1'),\n", " (0.004003403776326686, 'sex1mthd2'),\n", " (0.006175243601095892, 'mthuse12'),\n", " (0.0072929966330781415, 'meth12m1'),\n", " (0.006372415393107289, 'mthuse3'),\n", " (0.006873907333843077, 'meth3m1'),\n", " (0.005951079009971272, 'nump3mos'),\n", " (0.005765060026368007, 'fmethod1'),\n", " (0.006312363569584423, 'dateuse1'),\n", " (0.005473240225484011, 'oldwp01'),\n", " (0.0061220725634859585, 'oldwp02'),\n", " (0.006566679752625704, 'oldwp03'),\n", " (0.0062996315229642, 'oldwr01'),\n", " (0.007236400760750383, 'oldwr02'),\n", " (0.009306831251880698, 'oldwr03'),\n", " (0.006285282977210982, 'wantrp01'),\n", " (0.007236400760750383, 'wantrp02'),\n", " (0.009306831251880698, 'wantrp03'),\n", " (0.0054865812407531855, 'wantp01'),\n", " (0.0060748063719646694, 'wantp02'),\n", " (0.00661996064952286, 'wantp03'),\n", " (0.00548528475124066, 'wantp5'),\n", " (0.005448957689459855, 'infert_i'),\n", " (0.0067450133507920285, 'nosex12_i'),\n", " (0.005480667731327604, 'sexp3mo_i'),\n", " (0.0053272936863250075, 'sex3mo_i'),\n", " (0.005829898683175627, 'constat1_i'),\n", " (0.0054638333403213, 'constat2_i'),\n", " (0.005562510845659396, 'constat3_i'),\n", " (0.005562510845659396, 'constat4_i'),\n", " (0.00532790310884812, 'pillr_i'),\n", " (0.00532790310884812, 'condomr_i'),\n", " (0.005480538682948177, 'sex1mthd1_i'),\n", " (0.005618082757721465, 'sex1mthd2_i'),\n", " (0.005618082757721465, 'sex1mthd3_i'),\n", " (0.005618082757721465, 'sex1mthd4_i'),\n", " (0.0053273478704319865, 'mthuse12_i'),\n", " (0.0053352895922493815, 'meth12m1_i'),\n", " (0.005328505216348645, 'meth12m2_i'),\n", " (0.00532850671281937, 'meth12m3_i'),\n", " (0.00532850671281937, 'meth12m4_i'),\n", " (0.005352822010079472, 'mthuse3_i'),\n", " (0.005374562641145442, 'meth3m1_i'),\n", " (0.005346414776090325, 'meth3m2_i'),\n", " (0.005338022356888628, 'meth3m3_i'),\n", " (0.005338022356888628, 'meth3m4_i'),\n", " (0.0062734191728233135, 'nump3mos_i'),\n", " (0.005378877823191908, 'fmethod1_i'),\n", " (0.006076376776896986, 'dateuse1_i'),\n", " (0.005368456842452463, 'sourcem1_i'),\n", " (0.00534327273291324, 'sourcem2_i'),\n", " (0.005333841105334969, 'sourcem3_i'),\n", " (0.005333841105334969, 'sourcem4_i'),\n", " (0.005665470931053074, 'oldwp01_i'),\n", " (0.005532484672147064, 'oldwp02_i'),\n", " (0.006038794473755105, 'oldwp03_i'),\n", " (0.005995955760873195, 'oldwp04_i'),\n", " (0.005457954823496536, 'oldwp05_i'),\n", " (0.005401403098434621, 'oldwp06_i'),\n", " (0.0054589299665898094, 'oldwp07_i'),\n", " (0.0057501139160012205, 'oldwp08_i'),\n", " (0.005364106218509024, 'oldwr01_i'),\n", " (0.005490172335675836, 'oldwr02_i'),\n", " (0.005363893387949736, 'oldwr03_i'),\n", " (0.005640306695205988, 'oldwr04_i'),\n", " (0.0053813426061590786, 'oldwr05_i'),\n", " (0.005396951687654639, 'oldwr06_i'),\n", " (0.0055298718389822366, 'oldwr07_i'),\n", " (0.005587196021343832, 'oldwr08_i'),\n", " (0.005365193139261981, 'oldwr09_i'),\n", " (0.005364106218509024, 'wantrp01_i'),\n", " (0.005490172335675836, 'wantrp02_i'),\n", " (0.005363893387949736, 'wantrp03_i'),\n", " (0.005640306695205988, 'wantrp04_i'),\n", " (0.0053813426061590786, 'wantrp05_i'),\n", " (0.005396951687654639, 'wantrp06_i'),\n", " (0.0055298718389822366, 'wantrp07_i'),\n", " (0.005587196021343832, 'wantrp08_i'),\n", " (0.005365193139261981, 'wantrp09_i'),\n", " (0.0054993860861720645, 'wantp01_i'),\n", " (0.005741447585054793, 'wantp02_i'),\n", " (0.005728830977663635, 'wantp03_i'),\n", " (0.006172654291036195, 'wantp04_i'),\n", " (0.005648956835502261, 'wantp05_i'),\n", " (0.00538592984977615, 'wantp06_i'),\n", " (0.005629613043803383, 'wantp07_i'),\n", " (0.005779919154650259, 'wantp08_i'),\n", " (0.006596791609596697, 'wantp5_i'),\n", " (0.00535023906134624, 'fptit12_i'),\n", " (0.0054279461800811335, 'fptitmed_i'),\n", " (0.005346447844715718, 'fpregmed_i'),\n", " (0.0055473444255559334, 'r_stclin'),\n", " (0.005492834434038474, 'intent'),\n", " (0.00532913283383607, 'addexp'),\n", " (0.005959748183472446, 'intent_i'),\n", " (0.005862368582374766, 'addexp_i'),\n", " (0.005328595910998213, 'anyprghp'),\n", " (0.005834911540965271, 'anymschp'),\n", " (0.005608926827060601, 'infever'),\n", " (0.0058056953191234495, 'pidtreat'),\n", " (0.00532728544675054, 'evhivtst'),\n", " (0.005490881370955658, 'anyprghp_i'),\n", " (0.005696589912781214, 'anymschp_i'),\n", " (0.0055920576323432725, 'infever_i'),\n", " (0.005490881370955658, 'ovulate_i'),\n", " (0.005490881370955658, 'tubes_i'),\n", " (0.005490881370955658, 'infertr_i'),\n", " (0.005490881370955658, 'inferth_i'),\n", " (0.005490881370955658, 'advice_i'),\n", " (0.005490881370955658, 'insem_i'),\n", " (0.005490881370955658, 'invitro_i'),\n", " (0.005490881370955658, 'endomet_i'),\n", " (0.005490881370955658, 'fibroids_i'),\n", " (0.0053273436347913705, 'pidtreat_i'),\n", " (0.005384796955323012, 'evhivtst_i'),\n", " (0.005444228863618061, 'insuranc_r'),\n", " (0.005908687699079596, 'metro_r'),\n", " (0.005476246226178816, 'religion_r'),\n", " (0.006124250620027971, 'laborfor_r'),\n", " (0.005490192077694744, 'insuranc_i_r'),\n", " (0.005329126750794222, 'religion_i_r'),\n", " (0.005635393818939516, 'laborfor_i_r'),\n", " (0.009743158975296873, 'poverty_r'),\n", " (0.011870069031173158, 'totincr'),\n", " (0.00988503292074805, 'pubassis_r'),\n", " (0.005674668721373788, 'poverty_i_r'),\n", " (0.005674668721373788, 'totincr_i'),\n", " (0.005588263662201554, 'pubassis_i_r'),\n", " (0.007266083159805259, 'basewgt_r'),\n", " (0.006863344757269019, 'adj_mod_basewgt_r'),\n", " (0.007414601906967189, 'finalwgt_r'),\n", " (0.006008491880136968, 'secu_r'),\n", " (0.00540529186826777, 'sest_r'),\n", " (0.005425914889651273, 'cmintvw_r'),\n", " (0.005425914889651051, 'cmlstyr'),\n", " (0.005823670091816058, 'intvlngth')]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "variables = GoMining(join)\n", "variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following functions report the variables with the highest values of $R^2$." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "def ReadVariables():\n", " \"\"\"Reads Stata dictionary files for NSFG data.\n", "\n", " returns: DataFrame that maps variables names to descriptions\n", " \"\"\"\n", " vars1 = thinkstats2.ReadStataDct('2002FemPreg.dct').variables\n", " vars2 = thinkstats2.ReadStataDct('2002FemResp.dct').variables\n", "\n", " all_vars = pd.concat([vars1, vars2])\n", " all_vars.index = all_vars.name\n", " return all_vars\n", "\n", "def MiningReport(variables, n=30):\n", " \"\"\"Prints variables with the highest R^2.\n", "\n", " t: list of (R^2, variable name) pairs\n", " n: number of pairs to print\n", " \"\"\"\n", " all_vars = ReadVariables()\n", "\n", " variables.sort(reverse=True)\n", " for r2, name in variables[:n]:\n", " key = re.sub('_r$', '', name)\n", " try:\n", " desc = all_vars.loc[key].desc\n", " if isinstance(desc, pd.Series):\n", " desc = desc[0]\n", " print(name, r2, desc)\n", " except (KeyError, IndexError):\n", " print(name, r2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of the variables that do well are not useful for prediction because they are not known ahead of time." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "totalwgt_lb 1.0\n", "birthwgt_lb 0.9498127305978009 BD-3 BIRTHWEIGHT IN POUNDS - 1ST BABY FROM THIS PREGNANCY\n", "lbw1 0.3008240784470769 LOW BIRTHWEIGHT - BABY 1\n", "prglngth 0.13012519488625085 DURATION OF COMPLETED PREGNANCY IN WEEKS\n", "wksgest 0.12340041363361076 GESTATIONAL LENGTH OF COMPLETED PREGNANCY (IN WEEKS)\n", "agecon 0.10203149928156052 AGE AT TIME OF CONCEPTION\n", "mosgest 0.02714427463957958 GESTATIONAL LENGTH OF COMPLETED PREGNANCY (IN MONTHS)\n", "babysex 0.018550925293942533 BD-2 SEX OF 1ST LIVEBORN BABY FROM THIS PREGNANCY\n", "race_r 0.016199503586253106 RACE\n", "race 0.016199503586253106 RACE\n", "nbrnaliv 0.016017752709788113 BC-2 NUMBER OF BABIES BORN ALIVE FROM THIS PREGNANCY\n", "paydu 0.014003795578114597 IB-10 CURRENT LIVING QUARTERS OWNED/RENTED, ETC\n", "rmarout03 0.01343006646571343 INFORMAL MARITAL STATUS WHEN PREGNANCY ENDED - 3RD\n", "birthwgt_oz 0.013102457615706498 BD-3 BIRTHWEIGHT IN OUNCES - 1ST BABY FROM THIS PREGNANCY\n", "anynurse 0.012529022541810653 BH-1 WHETHER R BREASTFED THIS CHILD AT ALL - 1ST FROM THIS PREG\n", "bfeedwks 0.012193688404495417 DURATION OF BREASTFEEDING IN WEEKS\n", "totincr 0.011870069031173158 TOTAL INCOME OF R'S FAMILY\n", "marout03 0.011807801994375033 FORMAL MARITAL STATUS WHEN PREGNANCY ENDED - 3RD\n", "marcon03 0.011752599354395321 FORMAL MARITAL STATUS WHEN PREGNANCY BEGAN - 3RD\n", "cebow 0.011437770919637269 NUMBER OF CHILDREN BORN OUT OF WEDLOCK\n", "rmarout01 0.011407737138640073 INFORMAL MARITAL STATUS WHEN PREGNANCY ENDED - 1ST\n", "rmarout6 0.011354138472805753 INFORMAL MARITAL STATUS AT PREGNANCY OUTCOME - 6 CATEGORIES\n", "marout01 0.011269357246806444 FORMAL MARITAL STATUS WHEN PREGNANCY ENDED - 1ST\n", "hisprace_r 0.01123834930203138 RACE AND HISPANIC ORIGIN\n", "hisprace 0.01123834930203138 RACE AND HISPANIC ORIGIN\n", "mar1diss 0.0109615635907514 MONTHS BTW/1ST MARRIAGE & DISSOLUTION (OR INTERVIEW)\n", "fmarcon5 0.0106049646842995 FORMAL MARITAL STATUS AT CONCEPTION - 5 CATEGORIES\n", "rmarout02 0.010546913206564978 INFORMAL MARITAL STATUS WHEN PREGNANCY ENDED - 2ND\n", "marcon02 0.01048140179553414 FORMAL MARITAL STATUS WHEN PREGNANCY BEGAN - 2ND\n", "fmarout5 0.010461691367377068 FORMAL MARITAL STATUS AT PREGNANCY OUTCOME\n" ] } ], "source": [ "MiningReport(variables)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combining the variables that seem to have the most explanatory power." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: totalwgt_lb R-squared: 0.060
Model: OLS Adj. R-squared: 0.059
Method: Least Squares F-statistic: 79.98
Date: Sun, 09 Apr 2023 Prob (F-statistic): 4.86e-113
Time: 09:52:11 Log-Likelihood: -14295.
No. Observations: 8781 AIC: 2.861e+04
Df Residuals: 8773 BIC: 2.866e+04
Df Model: 7
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
Intercept 6.6303 0.065 102.223 0.000 6.503 6.757
C(race)[T.2] 0.3570 0.032 11.215 0.000 0.295 0.419
C(race)[T.3] 0.2665 0.051 5.175 0.000 0.166 0.367
babysex == 1[T.True] 0.2952 0.026 11.216 0.000 0.244 0.347
nbrnaliv > 1[T.True] -1.3783 0.108 -12.771 0.000 -1.590 -1.167
paydu == 1[T.True] 0.1196 0.031 3.861 0.000 0.059 0.180
agepreg 0.0074 0.003 2.921 0.004 0.002 0.012
totincr 0.0122 0.004 3.110 0.002 0.005 0.020
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 398.813 Durbin-Watson: 1.604
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1388.362
Skew: -0.037 Prob(JB): 3.32e-302
Kurtosis: 4.947 Cond. No. 221.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: totalwgt_lb R-squared: 0.060\n", "Model: OLS Adj. R-squared: 0.059\n", "Method: Least Squares F-statistic: 79.98\n", "Date: Sun, 09 Apr 2023 Prob (F-statistic): 4.86e-113\n", "Time: 09:52:11 Log-Likelihood: -14295.\n", "No. Observations: 8781 AIC: 2.861e+04\n", "Df Residuals: 8773 BIC: 2.866e+04\n", "Df Model: 7 \n", "Covariance Type: nonrobust \n", "========================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "----------------------------------------------------------------------------------------\n", "Intercept 6.6303 0.065 102.223 0.000 6.503 6.757\n", "C(race)[T.2] 0.3570 0.032 11.215 0.000 0.295 0.419\n", "C(race)[T.3] 0.2665 0.051 5.175 0.000 0.166 0.367\n", "babysex == 1[T.True] 0.2952 0.026 11.216 0.000 0.244 0.347\n", "nbrnaliv > 1[T.True] -1.3783 0.108 -12.771 0.000 -1.590 -1.167\n", "paydu == 1[T.True] 0.1196 0.031 3.861 0.000 0.059 0.180\n", "agepreg 0.0074 0.003 2.921 0.004 0.002 0.012\n", "totincr 0.0122 0.004 3.110 0.002 0.005 0.020\n", "==============================================================================\n", "Omnibus: 398.813 Durbin-Watson: 1.604\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 1388.362\n", "Skew: -0.037 Prob(JB): 3.32e-302\n", "Kurtosis: 4.947 Cond. No. 221.\n", "==============================================================================\n", "\n", "Notes:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formula = ('totalwgt_lb ~ agepreg + C(race) + babysex==1 + '\n", " 'nbrnaliv>1 + paydu==1 + totincr')\n", "results = smf.ols(formula, data=join).fit()\n", "results.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logistic regression\n", "\n", "Example: suppose we are trying to predict `y` using explanatory variables `x1` and `x2`." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "y = np.array([0, 1, 0, 1])\n", "x1 = np.array([0, 0, 0, 1])\n", "x2 = np.array([0, 1, 1, 1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to the logit model the log odds for the $i$th element of $y$ is\n", "\n", "$\\log o = \\beta_0 + \\beta_1 x_1 + \\beta_2 x_2 $\n", "\n", "So let's start with an arbitrary guess about the elements of $\\beta$:\n", "\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "beta = [-1.5, 2.8, 1.1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plugging in the model, we get log odds." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-1.5, -0.4, -0.4, 2.4])" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "log_o = beta[0] + beta[1] * x1 + beta[2] * x2\n", "log_o" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which we can convert to odds." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0.22313016, 0.67032005, 0.67032005, 11.02317638])" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "o = np.exp(log_o)\n", "o" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And then convert to probabilities." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.18242552, 0.40131234, 0.40131234, 0.9168273 ])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p = o / (o+1)\n", "p" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The likelihoods of the actual outcomes are $p$ where $y$ is 1 and $1-p$ where $y$ is 0. " ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.81757448, 0.40131234, 0.59868766, 0.9168273 ])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "likes = np.where(y, p, 1-p)\n", "likes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The likelihood of $y$ given $\\beta$ is the product of `likes`:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.1800933529673034" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "like = np.prod(likes)\n", "like" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Logistic regression works by searching for the values in $\\beta$ that maximize `like`.\n", "\n", "Here's an example using variables in the NSFG respondent file to predict whether a baby will be a boy or a girl." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "import first\n", "live, firsts, others = first.MakeFrames()\n", "live = live[live.prglngth>30]\n", "live['boy'] = (live.babysex==1).astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mother's age seems to have a small effect." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimization terminated successfully.\n", " Current function value: 0.693015\n", " Iterations 3\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Logit Regression Results
Dep. Variable: boy No. Observations: 8884
Model: Logit Df Residuals: 8882
Method: MLE Df Model: 1
Date: Sun, 09 Apr 2023 Pseudo R-squ.: 6.144e-06
Time: 09:52:12 Log-Likelihood: -6156.7
converged: True LL-Null: -6156.8
Covariance Type: nonrobust LLR p-value: 0.7833
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err z P>|z| [0.025 0.975]
Intercept 0.0058 0.098 0.059 0.953 -0.185 0.197
agepreg 0.0010 0.004 0.275 0.783 -0.006 0.009
" ], "text/plain": [ "\n", "\"\"\"\n", " Logit Regression Results \n", "==============================================================================\n", "Dep. Variable: boy No. Observations: 8884\n", "Model: Logit Df Residuals: 8882\n", "Method: MLE Df Model: 1\n", "Date: Sun, 09 Apr 2023 Pseudo R-squ.: 6.144e-06\n", "Time: 09:52:12 Log-Likelihood: -6156.7\n", "converged: True LL-Null: -6156.8\n", "Covariance Type: nonrobust LLR p-value: 0.7833\n", "==============================================================================\n", " coef std err z P>|z| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "Intercept 0.0058 0.098 0.059 0.953 -0.185 0.197\n", "agepreg 0.0010 0.004 0.275 0.783 -0.006 0.009\n", "==============================================================================\n", "\"\"\"" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = smf.logit('boy ~ agepreg', data=live)\n", "results = model.fit()\n", "results.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the variables that seemed most promising." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimization terminated successfully.\n", " Current function value: 0.692944\n", " Iterations 3\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Logit Regression Results
Dep. Variable: boy No. Observations: 8782
Model: Logit Df Residuals: 8776
Method: MLE Df Model: 5
Date: Sun, 09 Apr 2023 Pseudo R-squ.: 0.0001440
Time: 09:52:12 Log-Likelihood: -6085.4
converged: True LL-Null: -6086.3
Covariance Type: nonrobust LLR p-value: 0.8822
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err z P>|z| [0.025 0.975]
Intercept -0.0301 0.104 -0.290 0.772 -0.234 0.173
C(race)[T.2] -0.0224 0.051 -0.439 0.660 -0.122 0.077
C(race)[T.3] -0.0005 0.083 -0.005 0.996 -0.163 0.162
agepreg -0.0027 0.006 -0.484 0.629 -0.014 0.008
hpagelb 0.0047 0.004 1.112 0.266 -0.004 0.013
birthord 0.0050 0.022 0.227 0.821 -0.038 0.048
" ], "text/plain": [ "\n", "\"\"\"\n", " Logit Regression Results \n", "==============================================================================\n", "Dep. Variable: boy No. Observations: 8782\n", "Model: Logit Df Residuals: 8776\n", "Method: MLE Df Model: 5\n", "Date: Sun, 09 Apr 2023 Pseudo R-squ.: 0.0001440\n", "Time: 09:52:12 Log-Likelihood: -6085.4\n", "converged: True LL-Null: -6086.3\n", "Covariance Type: nonrobust LLR p-value: 0.8822\n", "================================================================================\n", " coef std err z P>|z| [0.025 0.975]\n", "--------------------------------------------------------------------------------\n", "Intercept -0.0301 0.104 -0.290 0.772 -0.234 0.173\n", "C(race)[T.2] -0.0224 0.051 -0.439 0.660 -0.122 0.077\n", "C(race)[T.3] -0.0005 0.083 -0.005 0.996 -0.163 0.162\n", "agepreg -0.0027 0.006 -0.484 0.629 -0.014 0.008\n", "hpagelb 0.0047 0.004 1.112 0.266 -0.004 0.013\n", "birthord 0.0050 0.022 0.227 0.821 -0.038 0.048\n", "================================================================================\n", "\"\"\"" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formula = 'boy ~ agepreg + hpagelb + birthord + C(race)'\n", "model = smf.logit(formula, data=live)\n", "results = model.fit()\n", "results.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make a prediction, we have to extract the exogenous and endogenous variables." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "endog = pd.DataFrame(model.endog, columns=[model.endog_names])\n", "exog = pd.DataFrame(model.exog, columns=model.exog_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The baseline prediction strategy is to guess \"boy\". In that case, we're right almost 51% of the time." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.507173764518333" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "actual = endog['boy']\n", "baseline = actual.mean()\n", "baseline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we use the previous model, we can compute the number of predictions we get right." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3944.0, 548.0)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predict = (results.predict() >= 0.5)\n", "true_pos = predict * actual\n", "true_neg = (1 - predict) * (1 - actual)\n", "sum(true_pos), sum(true_neg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the accuracy, which is slightly higher than the baseline." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5115007970849464" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "acc = (sum(true_pos) + sum(true_neg)) / len(actual)\n", "acc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make a prediction for an individual, we have to get their information into a `DataFrame`." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.513091\n", "dtype: float64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns = ['agepreg', 'hpagelb', 'birthord', 'race']\n", "new = pd.DataFrame([[35, 39, 3, 2]], columns=columns)\n", "y = results.predict(new)\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This person has a 51% chance of having a boy (according to the model)." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Exercises" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "**Exercise:** Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth. Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? You should limit yourself to variables that are known before the birth, and likely to be available to the people in the pool." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "import first\n", "live, firsts, others = first.MakeFrames()\n", "live = live[live.prglngth>30]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** The Trivers-Willard hypothesis suggests that for many mammals the sex ratio depends on “maternal condition”; that is, factors like the mother’s age, size, health, and social status. See https://en.wikipedia.org/wiki/Trivers-Willard_hypothesis\n", "\n", "Some studies have shown this effect among humans, but results are mixed. In this chapter we tested some variables related to these factors, but didn’t find any with a statistically significant effect on sex ratio.\n", "\n", "As an exercise, use a data mining approach to test the other variables in the pregnancy and respondent files. Can you find any factors with a substantial effect?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** If the quantity you want to predict is a count, you can use Poisson regression, which is implemented in StatsModels with a function called `poisson`. It works the same way as `ols` and `logit`. As an exercise, let’s use it to predict how many children a woman has born; in the NSFG dataset, this variable is called `numbabes`.\n", "\n", "Suppose you meet a woman who is 35 years old, black, and a college graduate whose annual household income exceeds $75,000. How many children would you predict she has born?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can predict the number of children for a woman who is 35 years old, black, and a college\n", "graduate whose annual household income exceeds $75,000" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** If the quantity you want to predict is categorical, you can use multinomial logistic regression, which is implemented in StatsModels with a function called `mnlogit`. As an exercise, let’s use it to guess whether a woman is married, cohabitating, widowed, divorced, separated, or never married; in the NSFG dataset, marital status is encoded in a variable called `rmarital`.\n", "\n", "Suppose you meet a woman who is 25 years old, white, and a high school graduate whose annual household income is about $45,000. What is the probability that she is married, cohabitating, etc?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make a prediction for a woman who is 25 years old, white, and a high\n", "school graduate whose annual household income is about $45,000." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.16" } }, "nbformat": 4, "nbformat_minor": 1 }