{ "cells": [ { "cell_type": "markdown", "id": "9e8b40d2", "metadata": {}, "source": [ "**Notes**:\n", "This notebook prepares the [example sleep data - sleep.csv](https://github.com/LSYS/pyforestplot/blob/main/examples/data/sleep.csv).\n", "\n", "The resulting output csv file ([sleep.csv](https://github.com/LSYS/pyforestplot/blob/main/examples/data/sleep.csv)) that indicates how certain individual characteristics correlates to the amount of sleep an one gets per week.\n", "Rows are the variables correlating with sleep. Columns included the computed pearson correlation coefficient, sample size, p-value, confidence interval (95%), etc.\n", "The `pingouin` is used to compute correlations.\n", "\n", "**Raw src**:\n", "* `sleep75.csv` (/wooldridge/sleep75) from https://vincentarelbundock.github.io/Rdatasets/articles/data.html\n", "* See https://rdrr.io/cran/wooldridge/man/sleep75.html for variable labels to the variables in `sleep75.csv`.\n", "\n", "\n", "\n", "**Requirements**: Mainly `pingouin`. See first cell of imports for requirements" ] }, { "cell_type": "code", "execution_count": 1, "id": "e05019fe", "metadata": { "ExecuteTime": { "end_time": "2022-09-18T05:01:11.859723Z", "start_time": "2022-09-18T05:01:04.107411Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageblackclericalconstruceducearns74gdhlthinlfsmsalhrwage...spwrk75totwrkunionworknrmworkscndexperyngkidyrsmarrhrwageagesq
13200.00.01200101.955861...03438034380140137.0700041024
23100.00.01495001100.357674...0502005020011001.429999961
34400.00.017425001113.021887...12815028150210020.5299971936
\n", "

3 rows × 30 columns

\n", "
" ], "text/plain": [ " age black clerical construc educ earns74 gdhlth inlf smsa \\\n", "1 32 0 0.0 0.0 12 0 0 1 0 \n", "2 31 0 0.0 0.0 14 9500 1 1 0 \n", "3 44 0 0.0 0.0 17 42500 1 1 1 \n", "\n", " lhrwage ... spwrk75 totwrk union worknrm workscnd exper yngkid \\\n", "1 1.955861 ... 0 3438 0 3438 0 14 0 \n", "2 0.357674 ... 0 5020 0 5020 0 11 0 \n", "3 3.021887 ... 1 2815 0 2815 0 21 0 \n", "\n", " yrsmarr hrwage agesq \n", "1 13 7.070004 1024 \n", "2 0 1.429999 961 \n", "3 0 20.529997 1936 \n", "\n", "[3 rows x 30 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import pingouin as pg\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "_url = \"https://vincentarelbundock.github.io/Rdatasets/csv/wooldridge/sleep75.csv\"\n", "drop_var = ['case', 'leis1', 'leis2', 'leis3']\n", "df = (pd.read_csv(_url, index_col=0)\n", " .drop(drop_var, axis=1)\n", " )\n", "df.head(3)" ] }, { "cell_type": "code", "execution_count": 2, "id": "e66b467a", "metadata": { "ExecuteTime": { "end_time": "2022-09-18T05:01:11.892027Z", "start_time": "2022-09-18T05:01:11.862006Z" }, "code_folding": [], "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
vargrouplabel
0ageagein years
1blackother factors=1 if black
2clericaloccupation=1 if clerical worker
\n", "
" ], "text/plain": [ " var group label\n", "0 age age in years\n", "1 black other factors =1 if black\n", "2 clerical occupation =1 if clerical worker" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Prep variable lablels (fold cell)\n", "# varlabels: http://fmwww.bc.edu/ec-p/data/wooldridge/sleep75.des\n", "df_label = (pd.read_csv('data/sleep75-des.csv', encoding=\"ISO-8859-1\")\n", " .assign(label=lambda df: df['des'].str.encode('ascii', 'ignore').str.decode('ascii'))\n", " .drop(['des'], axis=1)\n", " .set_index('var')\n", " .drop(drop_var)\n", " .reset_index()\n", " )\n", "\n", "df_label.head(3)" ] }, { "cell_type": "code", "execution_count": 3, "id": "efdb6621", "metadata": { "ExecuteTime": { "end_time": "2022-09-18T05:01:14.801972Z", "start_time": "2022-09-18T05:01:11.897008Z" }, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nrCI95%p-valBF10powervarhlllmoerrorgrouplabel
07060.090373[0.02, 0.16]1.630887e-020.8390.67age0.160.020.069627agein years
1706-0.027057[-0.1, 0.05]4.728889e-010.0610.11black0.05-0.100.077057other factors=1 if black
27060.048081[-0.03, 0.12]2.019484e-010.1060.25clerical0.12-0.030.071919occupation=1 if clerical worker
37060.041229[-0.03, 0.11]2.739475e-010.0860.19construc0.11-0.030.068771occupation=1 if construction worker
4706-0.095004[-0.17, -0.02]1.155151e-021.1370.72educ-0.02-0.170.075004labor factorsyears of schooling
5706-0.076890[-0.15, -0.0]4.110934e-020.3780.53earns74-0.00-0.150.076890labor factorstotal earnings, 1974
6706-0.102825[-0.18, -0.03]6.246660e-031.9670.78gdhlth-0.03-0.180.072825health factors=1 if in good or excel. health
7706-0.027126[-0.1, 0.05]4.717698e-010.0610.11inlf0.05-0.100.077126labor factors=1 if in labor force
8706-0.066997[-0.14, 0.01]7.524015e-020.2290.43smsa0.01-0.140.076997area of residence=1 if live in smsa
9532-0.067197[-0.15, 0.02]1.216222e-010.1790.34lhrwage0.02-0.150.087197labor factorslog hourly wage
107060.036661[-0.04, 0.11]3.306971e-010.0760.16lothinc0.11-0.040.073339labor factorslog othinc, unless othinc < 0
11706-0.035909[-0.11, 0.04]3.407214e-010.0740.16male0.04-0.110.075909other factors=1 if male
127060.053757[-0.02, 0.13]1.536188e-010.130.30marr0.13-0.020.076243family factors=1 if married
137060.027147[-0.05, 0.1]4.714176e-010.0610.11prot0.10-0.050.072853other factors=1 if Protestant
147060.867744[0.85, 0.88]6.051022e-2166.697e+2111.00rlxall0.880.850.012256other sleep factorsslpnaps + personal activs
157060.001782[-0.07, 0.08]9.623058e-010.0470.05selfe0.08-0.070.078218labor factors=1 if self employed
167060.893043[0.88, 0.91]2.339108e-2461.38e+2421.00slpnaps0.910.880.016957other sleep factorsminutes sleep, inc. naps
177060.078600[0.0, 0.15]3.679946e-020.4150.55south0.150.000.071400area of residence=1 if live in south
187060.007881[-0.07, 0.08]8.344125e-010.0480.06spsepay0.08-0.070.072119other factorsspousal wage income
197060.007868[-0.07, 0.08]8.346888e-010.0480.05spwrk750.08-0.070.072132other factors=1 if spouse works
20706-0.321384[-0.39, -0.25]1.994095e-181.961e+151.00totwrk-0.25-0.390.071384labor factorsmins worked per week
217060.009965[-0.06, 0.08]7.915440e-010.0490.06union0.08-0.060.070035labor factors=1 if belong to union
22706-0.322300[-0.39, -0.25]1.577335e-182.471e+151.00worknrm-0.25-0.390.072300labor factorsmins work main job
237060.001139[-0.07, 0.07]9.759034e-010.0470.05workscnd0.07-0.070.068861labor factorsmins work second job
247060.104191[0.03, 0.18]5.587422e-032.1750.79exper0.180.030.075809labor factorsage - educ - 6
25706-0.013262[-0.09, 0.06]7.250012e-010.050.06yngkid0.06-0.090.073262family factors=1 if children < 3 present
267060.063997[-0.01, 0.14]8.928507e-020.1990.40yrsmarr0.14-0.010.076003family factorsyears married
27532-0.049450[-0.13, 0.04]2.548774e-010.1040.21hrwage0.04-0.130.089450labor factorshourly wage
287060.099722[0.03, 0.17]8.010946e-031.5740.76agesq0.170.030.070278ageage^2
\n", "
" ], "text/plain": [ " n r CI95% p-val BF10 power var \\\n", "0 706 0.090373 [0.02, 0.16] 1.630887e-02 0.839 0.67 age \n", "1 706 -0.027057 [-0.1, 0.05] 4.728889e-01 0.061 0.11 black \n", "2 706 0.048081 [-0.03, 0.12] 2.019484e-01 0.106 0.25 clerical \n", "3 706 0.041229 [-0.03, 0.11] 2.739475e-01 0.086 0.19 construc \n", "4 706 -0.095004 [-0.17, -0.02] 1.155151e-02 1.137 0.72 educ \n", "5 706 -0.076890 [-0.15, -0.0] 4.110934e-02 0.378 0.53 earns74 \n", "6 706 -0.102825 [-0.18, -0.03] 6.246660e-03 1.967 0.78 gdhlth \n", "7 706 -0.027126 [-0.1, 0.05] 4.717698e-01 0.061 0.11 inlf \n", "8 706 -0.066997 [-0.14, 0.01] 7.524015e-02 0.229 0.43 smsa \n", "9 532 -0.067197 [-0.15, 0.02] 1.216222e-01 0.179 0.34 lhrwage \n", "10 706 0.036661 [-0.04, 0.11] 3.306971e-01 0.076 0.16 lothinc \n", "11 706 -0.035909 [-0.11, 0.04] 3.407214e-01 0.074 0.16 male \n", "12 706 0.053757 [-0.02, 0.13] 1.536188e-01 0.13 0.30 marr \n", "13 706 0.027147 [-0.05, 0.1] 4.714176e-01 0.061 0.11 prot \n", "14 706 0.867744 [0.85, 0.88] 6.051022e-216 6.697e+211 1.00 rlxall \n", "15 706 0.001782 [-0.07, 0.08] 9.623058e-01 0.047 0.05 selfe \n", "16 706 0.893043 [0.88, 0.91] 2.339108e-246 1.38e+242 1.00 slpnaps \n", "17 706 0.078600 [0.0, 0.15] 3.679946e-02 0.415 0.55 south \n", "18 706 0.007881 [-0.07, 0.08] 8.344125e-01 0.048 0.06 spsepay \n", "19 706 0.007868 [-0.07, 0.08] 8.346888e-01 0.048 0.05 spwrk75 \n", "20 706 -0.321384 [-0.39, -0.25] 1.994095e-18 1.961e+15 1.00 totwrk \n", "21 706 0.009965 [-0.06, 0.08] 7.915440e-01 0.049 0.06 union \n", "22 706 -0.322300 [-0.39, -0.25] 1.577335e-18 2.471e+15 1.00 worknrm \n", "23 706 0.001139 [-0.07, 0.07] 9.759034e-01 0.047 0.05 workscnd \n", "24 706 0.104191 [0.03, 0.18] 5.587422e-03 2.175 0.79 exper \n", "25 706 -0.013262 [-0.09, 0.06] 7.250012e-01 0.05 0.06 yngkid \n", "26 706 0.063997 [-0.01, 0.14] 8.928507e-02 0.199 0.40 yrsmarr \n", "27 532 -0.049450 [-0.13, 0.04] 2.548774e-01 0.104 0.21 hrwage \n", "28 706 0.099722 [0.03, 0.17] 8.010946e-03 1.574 0.76 agesq \n", "\n", " hl ll moerror group label \n", "0 0.16 0.02 0.069627 age in years \n", "1 0.05 -0.10 0.077057 other factors =1 if black \n", "2 0.12 -0.03 0.071919 occupation =1 if clerical worker \n", "3 0.11 -0.03 0.068771 occupation =1 if construction worker \n", "4 -0.02 -0.17 0.075004 labor factors years of schooling \n", "5 -0.00 -0.15 0.076890 labor factors total earnings, 1974 \n", "6 -0.03 -0.18 0.072825 health factors =1 if in good or excel. health \n", "7 0.05 -0.10 0.077126 labor factors =1 if in labor force \n", "8 0.01 -0.14 0.076997 area of residence =1 if live in smsa \n", "9 0.02 -0.15 0.087197 labor factors log hourly wage \n", "10 0.11 -0.04 0.073339 labor factors log othinc, unless othinc < 0 \n", "11 0.04 -0.11 0.075909 other factors =1 if male \n", "12 0.13 -0.02 0.076243 family factors =1 if married \n", "13 0.10 -0.05 0.072853 other factors =1 if Protestant \n", "14 0.88 0.85 0.012256 other sleep factors slpnaps + personal activs \n", "15 0.08 -0.07 0.078218 labor factors =1 if self employed \n", "16 0.91 0.88 0.016957 other sleep factors minutes sleep, inc. naps \n", "17 0.15 0.00 0.071400 area of residence =1 if live in south \n", "18 0.08 -0.07 0.072119 other factors spousal wage income \n", "19 0.08 -0.07 0.072132 other factors =1 if spouse works \n", "20 -0.25 -0.39 0.071384 labor factors mins worked per week \n", "21 0.08 -0.06 0.070035 labor factors =1 if belong to union \n", "22 -0.25 -0.39 0.072300 labor factors mins work main job \n", "23 0.07 -0.07 0.068861 labor factors mins work second job \n", "24 0.18 0.03 0.075809 labor factors age - educ - 6 \n", "25 0.06 -0.09 0.073262 family factors =1 if children < 3 present \n", "26 0.14 -0.01 0.076003 family factors years married \n", "27 0.04 -0.13 0.089450 labor factors hourly wage \n", "28 0.17 0.03 0.070278 age age^2 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Compute correlations\n", "df_corr = (pg.pairwise_corr(df)\n", " .rename(columns={'p-unc': 'p-val'})\n", " .query('Y==\"sleep\"|X==\"sleep\"')\n", " .assign(var=lambda df: df['X'])\n", " .assign(var=lambda df: np.where(df['var']==\"sleep\", df['Y'], df['var']))\n", " .drop([\"Y\", \"X\", \"method\", \"alternative\"], axis=1)\n", " .assign(\n", " hl=lambda df: [float(ci[1]) for ci in df['CI95%']],\n", " ll=lambda df: [float(ci[0]) for ci in df['CI95%']],\n", " moerror=lambda df: df['hl'] - df['r'],\n", " power=lambda df: df.power.round(decimals=2),\n", " n=lambda df: df.n.map(str)\n", " )\n", " # Get labels\n", " .merge(df_label, how='left', on='var', validate='1:1')\n", " .reset_index(drop=True)\n", " )\n", "df_corr" ] }, { "cell_type": "code", "execution_count": 4, "id": "c988902c", "metadata": { "ExecuteTime": { "end_time": "2022-09-18T05:01:14.852081Z", "start_time": "2022-09-18T05:01:14.801972Z" } }, "outputs": [], "source": [ "df_corr.to_csv('data/sleep-untruncated.csv', index=False)\n", "\n", "_drop = ['earns74', 'inlf', 'lothinc', 'workscnd', 'lhrwage', 'worknrm', \n", " 'spwrk75', 'marr', 'black', 'agesq', 'union', 'exper', 'rlxall', 'slpnaps']\n", "df_corr.query('var not in @_drop').to_csv('data/sleep.csv', index=False)" ] }, { "cell_type": "code", "execution_count": 5, "id": "d0b63fbd", "metadata": { "ExecuteTime": { "end_time": "2022-09-18T05:01:14.876068Z", "start_time": "2022-09-18T05:01:14.852081Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| | var | r | moerror | label | group | ll | hl | n | power | p-val |\n", "|---:|:---------|-----------:|----------:|:----------------------|:--------------|------:|-----:|----:|--------:|----------:|\n", "| 0 | age | 0.0903729 | 0.0696271 | in years | age | 0.02 | 0.16 | 706 | 0.67 | 0.0163089 |\n", "| 1 | black | -0.0270573 | 0.0770573 | =1 if black | other factors | -0.1 | 0.05 | 706 | 0.11 | 0.472889 |\n", "| 2 | clerical | 0.0480811 | 0.0719189 | =1 if clerical worker | occupation | -0.03 | 0.12 | 706 | 0.25 | 0.201948 |\n" ] } ], "source": [ "_cols = ['var', 'r', 'moerror', 'label', 'group', 'll', 'hl', 'n', 'power', 'p-val']\n", "print(df_corr[_cols].head(3).to_markdown())" ] }, { "cell_type": "code", "execution_count": null, "id": "d8a67137", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }