{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Health and Lifestyle Survey Questions Tutorial\n", "\n", "In this tutorial, we showcase how the Protodash explainer algorithm from AI Explainability 360 Toolkit implemented through the _ProtodashExplainer_ class could be used to summarize the National Health and Nutrition Examination Survey (NHANES) datasets ([Study 1](#study1)) available through the Center for Disease Control and Prevention (CDC). Moreover, we also show how the algorithm could be used to distill interesting relationships between different facets of life (i.e. early childhood and income), which were found by scientists ([Study 2](#study2)) through decades of rigorous experimentation. This study shows that in using Protodash, one can potentially uncover such insights cheaply, which could then be reaffirmed through rigorous experimentation.\n", "\n", " Data from this survey is typically used in epidemiological studies and health science research, which helps develop public health policy, direct and design health programs and services, and expand health knowledge. Thus, the impact of understanding these datasets and the relationships that may exist between them are far reaching for a social scientist." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Introduction to Center for Disease Control and Prevention (CDC) datasets\n", "\n", "The [NHANES CDC questionnaire datasets](https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Questionnaire&CycleBeginYear=2013) are surveys conducted by the organization involving thousands of civilians about various facets of their daily lives. There are 44 questionnaires that collect data about income, occupation, health, early childhood and many other behavioral and lifestyle aspects of individuals living in the US. These questionnaires are thus a rich source of information indicative of the quality of life of many civilians. \n", "\n", "This tutorial presents two studies. We first see how a CDC questionaire answered by thousands of individuals could be summarized by looking at answers given by a few prototypical users. Next, an interesting endeavor is to uncover relationships between different aspects of life by analyzing data across the different CDC questionnaires. In the second study, we do exactly that with the help of the Protodash explainer algorithm. We show how the algorithm is able to uncover an interesting [insight](https://www.theatlantic.com/business/archive/2016/07/social-mobility-america/491240/) known only through decades of experimentation, solely from the questionnaire datasets. This by no means suggests the method as a substitute for rigorous experimentation, but showcases it as an avenue for obtaining interesting insights at low cost, which could inspire further indepth studies. The manner in which this is accomplished is by finding prototypical individuals for each of the questionnaires and then evaluating how well they represent the income questionnaire (w.r.t. the method's objective function). The more representative these prototypes are, the more that questionnaire is indicative/representative of income. \n", "\n", "For this use case, we are selecting prototypes from specific questionnaires. Hence, the group we want to explain is the dataset itself, which — in this case — are the questionnaires. We are not training an AI model. Rather, we are trying to summarize each questionnaire, which was filled by thousands of people, by selecting a few representative individuals for each of them.\n", "\n", "\n", "The rest of the tutorial is organized as follows:
\n", "[Explore Income questionaire](#explore)
\n", "[Study 1: Summarize Income Questionnaire using Prototypes](#study1)
\n", "[Study 2: Find Questionnaire/s most representative of Income](#study2)
\n", "\n", "\n", "###### [Protodash: Fast Interpretable Prototype Selection](https://arxiv.org/abs/1707.01212)\n", "\n", "We now provide a brief overview of the method. The method takes as input a datapoint (or group of datapoints) that we want to explain with respect to instances in a training set belonging to the same feature space. The method then tries to minimize the maximum mean discrepancy (MMD metric) between the datapoints we want to explain and a prespecified number of instances from the training set that it will select. In other words, it will try to select training instances that have the same distribution as the datapoints we want to explain. The method does greedy selection and has quality guarantees with it also returning importance weights for the chosen prototypical training instances indicative of how similar/representative they are. \n", "\n", "\n", "###### Why Protodash?\n", "\n", "Before we showcase the two studies, we provide some motivation for using this method. The method is able to select in a deterministic fashion examples from a dataset, which we term as prototypes that represent the different segments in a dataset. For example, if we take people that answered the income questionnaire, we might find that there are three categories of people: i) those that are high earners, ii) those that are middle class and iii) those that don't earn much or are unemployed and receive unemployment benefits. Protodash would be able to find these segments by pointing to specific individuals that lie in these categories. Looking at the objective function value of Protodash, one would also be able to say that three segments is the right number here as adding one more segment may not improve the objective value by much.\n", "\n", "Compared with other methods such as k-medoids, it has the advantage that it is deterministic and does not have randomizations as in, say, k-medoids clustering, where the centers a typically randomly initialized. So the solutions are repeatable and it picks prototypes that are representative as well as diverse, which may not be the case with standard distance metrics such as euclidean distance. Diversity is important in practical settings (viz. income example above) where we want to capture all the different segments/modes in the dataset, not missing any of the key behaviors.\n", "\n", "Another benefit of the method is that, since it performs distribution matching between the user/users in question and those available in the training set, it could, in principle, also be applied in non-iid settings such as for time series data. Other approaches which find similar profiles using standard distance measures (viz. euclidean, cosine) do not have this property. Additionally, we can also highlight important features for the different prototypes that made them similar to the user/users in question." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Import Statements\n", "\n", "Import relevant libraries, datasets and Protodash explainer algorithm." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "import os\n", "import requests\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from sklearn.preprocessing import OneHotEncoder\n", "\n", "from aix360.algorithms.protodash import ProtodashExplainer\n", "from aix360.datasets.cdc_dataset import CDCDataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Load CDC dataset" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "nhanes = CDCDataset()\n", "nhanes_files = nhanes.get_csv_file_names()\n", "(nhanesinfo, _, _) = nhanes._cdc_files_info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Explore Income questionnaire\n", "\n", "Now let us explore the income questionnaire dataset and find out the types of responses received in the survey. Each column in the dataset corresponds to a question and each row denotes the answers given by a respondent to those questions. Both column names and answers by respondents are encoded. For example, 'SEQN' denotes the sequence number assigned to a respondent and 'IND235' corresponds to a question about monthly family income. As seen below, in most cases a value of 1 implies \"Yes\" to the question, while a value of 2 implies \"No.\" More details about the income questionaire and how questions and answers are encoded can be seen [here](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/INQ_H.htm)\n", "\n", "|Column |Description | Values and Meaning|\n", "|-------|----------------------------|---------|\n", "|SEQN | Respondent sequence number |\n", "|INQ020 | Income from wages/salaries |1->Yes, 2->No, 7->Refused, 9->Don't know|\n", "|INQ012 | Income from self employment|1->Yes, 2->No, 7->Refused, 9->Don't know|\n", "|INQ030 | Income from Social Security or RR |1->Yes, 2->No, 7->Refused, 9->Don't know|\n", "|INQ060 | Income from other disability pension |1->Yes, 2->No, 7->Refused, 9->Don't know|\n", "|INQ080 | Income from retirement/survivor pension |1->Yes, 2->No, 7->Refused, 9->Don't know|\n", "|INQ090 | Income from Supplemental Security Income |1->Yes, 2->No, 7->Refused, 9->Don't know|\n", "|INQ132 | Income from state/county cash assistance |1->Yes, 2->No, 7->Refused, 9->Don't know|\n", "|INQ140 | Income from interest/dividends or rental |1->Yes, 2->No, 7->Refused, 9->Don't know|\n", "|INQ150 | Income from other sources |1->Yes, 2->No, 7->Refused, 9->Don't know|\n", "|IND235 | Monthly family income |1-12->Increasing income brackets, 77->Refused, 99->Don't know|\n", "|INDFMMPI | Family monthly poverty level index |0-5->Higher value more affluent|\n", "|INDFMMPC | Family monthly poverty level category |1-3->Increasing INDFMMPI brackets, 7->Refused, 9->Don't know|\n", "|INQ244 | Family has savings more than $5000 |1->Yes, 2->No, 7->Refused, 9->Don't know|\n", "|IND247 | Total savings/cash assets for the family |1-6->Increasing savings brackets, 77->Refused, 99->Don't know|" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Answers given by some respondents to the income questionnaire:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
Respondent sequence number73557.0073558.0073559.0073560.0073561.0
Income from wages/salaries2.001.002.001.002.0
Income from self employment2.001.002.002.002.0
Income from Social Security or RR1.001.001.002.001.0
Income from other disability pension2.002.002.002.002.0
Income from retirement/survivor pension2.002.001.002.002.0
Income from Supplemental Security Income2.002.002.002.002.0
Income from state/county cash assistance2.002.002.002.002.0
Income from interest/dividends or rental2.001.001.002.002.0
Income from other sources2.002.002.001.002.0
Monthly family income4.005.0010.009.0011.0
Family monthly poverty level index0.860.924.372.525.0
Family monthly poverty level category1.001.003.003.003.0
Family has savings more than $50009.001.00NaNNaNNaN
Total savings/cash assets for the familyNaNNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " 0 1 2 \\\n", "Respondent sequence number 73557.00 73558.00 73559.00 \n", "Income from wages/salaries 2.00 1.00 2.00 \n", "Income from self employment 2.00 1.00 2.00 \n", "Income from Social Security or RR 1.00 1.00 1.00 \n", "Income from other disability pension 2.00 2.00 2.00 \n", "Income from retirement/survivor pension 2.00 2.00 1.00 \n", "Income from Supplemental Security Income 2.00 2.00 2.00 \n", "Income from state/county cash assistance 2.00 2.00 2.00 \n", "Income from interest/dividends or rental 2.00 1.00 1.00 \n", "Income from other sources 2.00 2.00 2.00 \n", "Monthly family income 4.00 5.00 10.00 \n", "Family monthly poverty level index 0.86 0.92 4.37 \n", "Family monthly poverty level category 1.00 1.00 3.00 \n", "Family has savings more than $5000 9.00 1.00 NaN \n", "Total savings/cash assets for the family NaN NaN NaN \n", "\n", " 3 4 \n", "Respondent sequence number 73560.00 73561.0 \n", "Income from wages/salaries 1.00 2.0 \n", "Income from self employment 2.00 2.0 \n", "Income from Social Security or RR 2.00 1.0 \n", "Income from other disability pension 2.00 2.0 \n", "Income from retirement/survivor pension 2.00 2.0 \n", "Income from Supplemental Security Income 2.00 2.0 \n", "Income from state/county cash assistance 2.00 2.0 \n", "Income from interest/dividends or rental 2.00 2.0 \n", "Income from other sources 1.00 2.0 \n", "Monthly family income 9.00 11.0 \n", "Family monthly poverty level index 2.52 5.0 \n", "Family monthly poverty level category 3.00 3.0 \n", "Family has savings more than $5000 NaN NaN \n", "Total savings/cash assets for the family NaN NaN " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# replace encoded column names by the associated question text. \n", "df_inc = nhanes.get_csv_file('INQ_H.csv')\n", "df_inc.columns[0]\n", "dict_inc = {\n", "'SEQN': 'Respondent sequence number', \n", "'INQ020': 'Income from wages/salaries',\n", "'INQ012': 'Income from self employment',\n", "'INQ030':'Income from Social Security or RR',\n", "'INQ060': 'Income from other disability pension', \n", "'INQ080': 'Income from retirement/survivor pension',\n", "'INQ090': 'Income from Supplemental Security Income',\n", "'INQ132': 'Income from state/county cash assistance', \n", "'INQ140': 'Income from interest/dividends or rental', \n", "'INQ150': 'Income from other sources',\n", "'IND235': 'Monthly family income',\n", "'INDFMMPI': 'Family monthly poverty level index', \n", "'INDFMMPC': 'Family monthly poverty level category',\n", "'INQ244': 'Family has savings more than $5000',\n", "'IND247': 'Total savings/cash assets for the family'\n", "}\n", "qlist = []\n", "for i in range(len(df_inc.columns)):\n", " qlist.append(dict_inc[df_inc.columns[i]])\n", "df_inc.columns = qlist\n", "print(\"Answers given by some respondents to the income questionnaire:\")\n", "df_inc.head(5).transpose()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, to get more of a feel for the dataset, let us look at the distribution of responses for two questions related to family financial status." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of respondents to Income questionnaire: 10175\n", "Distribution of answers to 'monthly family income' and 'Family savings' questions:\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "print(\"Number of respondents to Income questionnaire:\", df_inc.shape[0])\n", "print(\"Distribution of answers to \\'monthly family income\\' and \\'Family savings\\' questions:\")\n", "\n", "fig, axes = plt.subplots(1, 2, figsize=(10,5))\n", "fig.subplots_adjust(wspace=0.5)\n", "hist1 = df_inc['Monthly family income'].value_counts().plot(kind='bar', ax=axes[0])\n", "hist2 = df_inc['Family has savings more than $5000'].value_counts().plot(kind='bar', ax=axes[1])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Observe that the majority of individuals responded with a \"12\" o the question related to monthly family income, which means their income is above USD 8400 as explained [here](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/INQ_H.htm#IND235). Similarly, to the question of whether the family has savings more than USD 5000, the majority of individuals responded with a \"2\", which means \"No\". " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Study 1: Summarize Income Questionnaire using Prototypes\n", "\n", "We just explored the income dataset and looked at the distribution of answers for a couple of questions. Now, consider a social scientist who would like to quickly obtain a summary report of this dataset in terms of types of people that span this dataset. Is it possible to summarize this dataset by looking at answers given by a few representative/prototypical respondents? \n", "\n", "We now show how the Protodash algorithm can be used to obtain a few prototypical respondents (about 10 in this example) that span the diverse set of individuals answering the income questionnaire making it easy for the social scientist to summarize the dataset." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# convert pandas dataframe to numpy\n", "data = df_inc.to_numpy()\n", "\n", "#sort the rows by sequence numbers in 1st column \n", "idx = np.argsort(data[:, 0]) \n", "data = data[idx, :]\n", "\n", "# replace nan's (missing values) with 0's\n", "original = data\n", "original[np.isnan(original)] = 0\n", "\n", "# delete 1st column (sequence numbers)\n", "original = original[:, 1:]\n", "\n", "# one hot encode all features as they are categorical\n", "onehot_encoder = OneHotEncoder(sparse=False)\n", "onehot_encoded = onehot_encoder.fit_transform(original)\n", "\n", "explainer = ProtodashExplainer()\n", "\n", "# call Protodash explainer\n", "# S contains indices of the selected prototypes\n", "# W contains importance weights associated with the selected prototypes \n", "(W, S, _) = explainer.explain(onehot_encoded, onehot_encoded, m=10) \n", "\n", "# sort the order of prototypes in set S\n", "idx = np.argsort(S)\n", "S = S[idx]\n", "W = W[idx]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
81326901475244929123899507768957475
Respondent sequence number73565.0073689.0074247.0075032.0076006.0076469.0077456.0078634.0080452.0081032.00
Income from wages/salaries1.001.002.001.001.001.001.001.001.002.00
Income from self employment2.002.002.002.001.002.002.002.002.002.00
Income from Social Security or RR2.002.002.002.002.002.001.002.002.001.00
Income from other disability pension2.002.002.002.002.001.002.002.002.002.00
Income from retirement/survivor pension2.002.002.002.002.001.002.001.002.002.00
Income from Supplemental Security Income2.002.002.002.002.002.002.002.001.002.00
Income from state/county cash assistance2.002.002.002.002.002.002.002.001.002.00
Income from interest/dividends or rental2.001.002.001.002.002.002.002.002.001.00
Income from other sources2.002.002.002.001.002.002.002.002.002.00
Monthly family income12.0011.008.004.001.006.007.002.005.003.00
Family monthly poverty level index5.004.303.051.650.001.322.710.440.861.08
Family monthly poverty level category3.003.003.002.001.002.003.001.001.001.00
Family has savings more than $5000NaNNaNNaN2.002.001.00NaN2.002.002.00
Total savings/cash assets for the familyNaNNaNNaN3.001.00NaNNaN1.001.001.00
Weights of Prototypes0.180.070.090.060.150.070.120.070.090.10
\n", "
" ], "text/plain": [ " 8 132 690 \\\n", "Respondent sequence number 73565.00 73689.00 74247.00 \n", "Income from wages/salaries 1.00 1.00 2.00 \n", "Income from self employment 2.00 2.00 2.00 \n", "Income from Social Security or RR 2.00 2.00 2.00 \n", "Income from other disability pension 2.00 2.00 2.00 \n", "Income from retirement/survivor pension 2.00 2.00 2.00 \n", "Income from Supplemental Security Income 2.00 2.00 2.00 \n", "Income from state/county cash assistance 2.00 2.00 2.00 \n", "Income from interest/dividends or rental 2.00 1.00 2.00 \n", "Income from other sources 2.00 2.00 2.00 \n", "Monthly family income 12.00 11.00 8.00 \n", "Family monthly poverty level index 5.00 4.30 3.05 \n", "Family monthly poverty level category 3.00 3.00 3.00 \n", "Family has savings more than $5000 NaN NaN NaN \n", "Total savings/cash assets for the family NaN NaN NaN \n", "Weights of Prototypes 0.18 0.07 0.09 \n", "\n", " 1475 2449 2912 \\\n", "Respondent sequence number 75032.00 76006.00 76469.00 \n", "Income from wages/salaries 1.00 1.00 1.00 \n", "Income from self employment 2.00 1.00 2.00 \n", "Income from Social Security or RR 2.00 2.00 2.00 \n", "Income from other disability pension 2.00 2.00 1.00 \n", "Income from retirement/survivor pension 2.00 2.00 1.00 \n", "Income from Supplemental Security Income 2.00 2.00 2.00 \n", "Income from state/county cash assistance 2.00 2.00 2.00 \n", "Income from interest/dividends or rental 1.00 2.00 2.00 \n", "Income from other sources 2.00 1.00 2.00 \n", "Monthly family income 4.00 1.00 6.00 \n", "Family monthly poverty level index 1.65 0.00 1.32 \n", "Family monthly poverty level category 2.00 1.00 2.00 \n", "Family has savings more than $5000 2.00 2.00 1.00 \n", "Total savings/cash assets for the family 3.00 1.00 NaN \n", "Weights of Prototypes 0.06 0.15 0.07 \n", "\n", " 3899 5077 6895 \\\n", "Respondent sequence number 77456.00 78634.00 80452.00 \n", "Income from wages/salaries 1.00 1.00 1.00 \n", "Income from self employment 2.00 2.00 2.00 \n", "Income from Social Security or RR 1.00 2.00 2.00 \n", "Income from other disability pension 2.00 2.00 2.00 \n", "Income from retirement/survivor pension 2.00 1.00 2.00 \n", "Income from Supplemental Security Income 2.00 2.00 1.00 \n", "Income from state/county cash assistance 2.00 2.00 1.00 \n", "Income from interest/dividends or rental 2.00 2.00 2.00 \n", "Income from other sources 2.00 2.00 2.00 \n", "Monthly family income 7.00 2.00 5.00 \n", "Family monthly poverty level index 2.71 0.44 0.86 \n", "Family monthly poverty level category 3.00 1.00 1.00 \n", "Family has savings more than $5000 NaN 2.00 2.00 \n", "Total savings/cash assets for the family NaN 1.00 1.00 \n", "Weights of Prototypes 0.12 0.07 0.09 \n", "\n", " 7475 \n", "Respondent sequence number 81032.00 \n", "Income from wages/salaries 2.00 \n", "Income from self employment 2.00 \n", "Income from Social Security or RR 1.00 \n", "Income from other disability pension 2.00 \n", "Income from retirement/survivor pension 2.00 \n", "Income from Supplemental Security Income 2.00 \n", "Income from state/county cash assistance 2.00 \n", "Income from interest/dividends or rental 1.00 \n", "Income from other sources 2.00 \n", "Monthly family income 3.00 \n", "Family monthly poverty level index 1.08 \n", "Family monthly poverty level category 1.00 \n", "Family has savings more than $5000 2.00 \n", "Total savings/cash assets for the family 1.00 \n", "Weights of Prototypes 0.10 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Display the prototypes along with their computed weights\n", "inc_prototypes = df_inc.iloc[S, :].copy()\n", "# Compute normalized importance weights for prototypes\n", "inc_prototypes[\"Weights of Prototypes\"] = np.around(W/np.sum(W), 2) \n", "inc_prototypes.transpose()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Explanation:\n", "The 10 people shown above (i.e. 10 prototypes) are representative of the income questionnaire according to Protodash. Firstly, in the distribution plot for family finance related questions, we saw that there roughly were five times as many people not having savings in excess of $5000 compared with others. Our prototypes also have a similar spread, which is reassuring. Also, for monthly family income, we get a more even spread over the more commonly occurring categories. This is a kind of spot check to see if our prototypes actually match the distribution of values in the dataset.\n", "\n", "Looking at the other questions in the questionnaire and the corresponding answers given by the prototypical people above, the social scientist realizes that most people are employed (3rd question) and work for an organization earning through salary/wages (1st two questions). Most of them are also young (5th question) and fit to work (4th question). However, they don't seem to have much savings (last question). The insights that the social scientist acquired from studying the prototypes could also be conveyed to the appropriate government authorities that affect future public policy decisions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Study 2: Find Questionnaire/s that are most representative of Income\n", "\n", "We now move on to our second study, where we want to see how the remaining 39 questionnaires represent or relate to income. This will provide us with an idea of which lifestyle factors are likely to affect income the most. To do this we compute prototypes for each of the questionnaires and evaluate how well they represent the income questionnaire relative to our objective function. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Compute prototypes for all questionaires\n", "\n", "This step uses Protodash explainer to compute 10 prototypes for each of the questionaires and saves these for further evaluation. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "processing ACQ_H.csv\n", "processing ALQ_H.csv\n", "processing BPQ_H.csv\n", "processing CDQ_H.csv\n", "processing CFQ_H.csv\n", "processing CBQ_H.csv\n", "processing CKQ_H.csv\n", "processing HSQ_H.csv\n", "processing DEQ_H.csv\n", "processing DIQ_H.csv\n", "processing DBQ_H.csv\n", "processing DLQ_H.csv\n", "processing DUQ_H.csv\n", "processing ECQ_H.csv\n", "processing FSQ_H.csv\n", "processing HIQ_H.csv\n", "processing HEQ_H.csv\n", "processing HUQ_H.csv\n", "processing HOQ_H.csv\n", "processing IMQ_H.csv\n", "processing INQ_H.csv\n", "processing KIQ_U_H.csv\n", "processing MCQ_H.csv\n", "processing DPQ_H.csv\n", "processing OCQ_H.csv\n", "processing OHQ_H.csv\n", "processing OSQ_H.csv\n", "processing PAQ_H.csv\n", "processing PFQ_H.csv\n", "processing RXQASA_H.csv\n", "processing RHQ_H.csv\n", "processing SXQ_H.csv\n", "processing SLQ_H.csv\n", "processing SMQFAM_H.csv\n", "processing SMQRTU_H.csv\n", "processing SMQSHS_H.csv\n", "processing CSQ_H.csv\n", "processing VTQ_H.csv\n", "processing WHQ_H.csv\n", "processing WHQMEC_H.csv\n" ] } ], "source": [ "# Iterate through all questionnaire datasets and find 10 prototypes for each.\n", "\n", "prototypes = {}\n", "\n", "for i in range(len(nhanes_files)):\n", " \n", " f = nhanes_files[i]\n", " \n", " print(\"processing \", f)\n", " \n", " # read data to pandas dataframe\n", " df = nhanes.get_csv_file(f)\n", " \n", " # convert data to numpy\n", " data = df.to_numpy()\n", "\n", " #sort the rows by sequence numbers in 1st column \n", " idx = np.argsort(data[:, 0])\n", " data = data[idx, :]\n", "\n", " # replace nan's with 0's.\n", " original = data\n", " original[np.isnan(original)] = 0\n", "\n", " # delete 1st column (contains sequence numbers)\n", " original = original[:, 1:]\n", "\n", " # one hot encode all features as they are categorical\n", " onehot_encoder = OneHotEncoder(sparse=False)\n", " onehot_encoded = onehot_encoder.fit_transform(original)\n", "\n", " explainer = ProtodashExplainer()\n", "\n", " # call Protodash explainer\n", " # S contains indices of the selected prototypes\n", " # W contains importance weights associated with the selected prototypes \n", "\n", " (W, S, _) = explainer.explain(onehot_encoded, onehot_encoded, m=10) \n", "\n", " prototypes[f]={}\n", " prototypes[f]['W']= W\n", " prototypes[f]['S']= S\n", " prototypes[f]['data'] = data\n", " prototypes[f]['original'] = original" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Evaluate the set of prototypical respondents from various questionaires using their income questionaire. \n", "\n", "Now that we have the prototypes for each of the questionnaires we evaluate how well the prototypes of each questionaire represent the Income questionnaire based on the objective function that Protodash uses. We see below a ranked list of different questionnaires with their objective function values in ascending order. The higher a questionaire appears in the list, the better its prototypes represent the income questionaire. The values on the right indicate our objective value where lower value is better." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
QuestionairePrototypes representative of Income
0Early Childhood-96.119374
1Physical Functioning-95.584090
2Acculturation-95.355652
3Disability-93.984344
4Physical Activity-93.945023
5Smoking - Secondhand Smoke Exposure-93.193000
6Cognitive Functioning-93.050538
7Sleep Disorders-92.330593
8Diabetes-91.381703
9Taste & Smell-90.633708
10Smoking - Recent Tobacco Use-88.301894
11Preventive Aspirin Use-88.292714
12Kidney Conditions - Urology-87.817560
13Cardiovascular Health-85.699530
14Mental Health - Depression Screener-85.634763
15Volatile Toxicant (Subsample)-85.580458
16Smoking - Household Smokers-85.514068
17Alcohol Use-84.735015
18Dermatology-84.350459
19Consumer Behavior-84.151348
20Food Security-82.769481
21Immunization-82.186591
22Housing Characteristics-81.655651
23Drug Use-81.187136
24Occupation-81.005558
25Oral Health-78.920883
26Weight History - Youth-77.454574
27Income-76.364734
28Diet Behavior & Nutrition-75.799336
29Weight History-75.136793
30Blood Pressure & Cholesterol-74.227314
31Reproductive Health-73.813445
32Osteoporosis-71.564194
33Creatine Kinase-67.288085
34Hepatitis-67.152639
35Medical Conditions-65.222360
36Current Health Status-44.587781
37Hospital Utilization & Access to Care10.916130
38Sexual Behavior53.155880
39Health Insurance146.419436
\n", "
" ], "text/plain": [ " Questionaire Prototypes representative of Income\n", "0 Early Childhood -96.119374\n", "1 Physical Functioning -95.584090\n", "2 Acculturation -95.355652\n", "3 Disability -93.984344\n", "4 Physical Activity -93.945023\n", "5 Smoking - Secondhand Smoke Exposure -93.193000\n", "6 Cognitive Functioning -93.050538\n", "7 Sleep Disorders -92.330593\n", "8 Diabetes -91.381703\n", "9 Taste & Smell -90.633708\n", "10 Smoking - Recent Tobacco Use -88.301894\n", "11 Preventive Aspirin Use -88.292714\n", "12 Kidney Conditions - Urology -87.817560\n", "13 Cardiovascular Health -85.699530\n", "14 Mental Health - Depression Screener -85.634763\n", "15 Volatile Toxicant (Subsample) -85.580458\n", "16 Smoking - Household Smokers -85.514068\n", "17 Alcohol Use -84.735015\n", "18 Dermatology -84.350459\n", "19 Consumer Behavior -84.151348\n", "20 Food Security -82.769481\n", "21 Immunization -82.186591\n", "22 Housing Characteristics -81.655651\n", "23 Drug Use -81.187136\n", "24 Occupation -81.005558\n", "25 Oral Health -78.920883\n", "26 Weight History - Youth -77.454574\n", "27 Income -76.364734\n", "28 Diet Behavior & Nutrition -75.799336\n", "29 Weight History -75.136793\n", "30 Blood Pressure & Cholesterol -74.227314\n", "31 Reproductive Health -73.813445\n", "32 Osteoporosis -71.564194\n", "33 Creatine Kinase -67.288085\n", "34 Hepatitis -67.152639\n", "35 Medical Conditions -65.222360\n", "36 Current Health Status -44.587781\n", "37 Hospital Utilization & Access to Care 10.916130\n", "38 Sexual Behavior 53.155880\n", "39 Health Insurance 146.419436" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#load income dataset INQ_H and its prototypes\n", "X = prototypes['INQ_H.csv']['original']\n", "Xdata = prototypes['INQ_H.csv']['data']\n", "\n", "# Iterate through all questionnaires and evaluate how well their prototypes represent the income dataset. \n", "objs = []\n", "for i in range(len(nhanes_files)):\n", " \n", " #load a dataset, its prototypes & weights\n", "\n", " f = nhanes_files[i]\n", " \n", " Ydata = prototypes[f]['data']\n", " S = prototypes[f]['S']\n", " W = prototypes[f]['W']\n", " \n", " \n", " # sort the order of prototypes in set S\n", " idx = np.argsort(S)\n", " S = S[idx]\n", " W = W[idx]\n", " \n", " # access corresponding prototypes in X. \n", " XS = X[np.isin(Xdata[:, 0], Ydata[S, 0]), :]\n", " \n", " #print(Ydata[S, 0])\n", " #print(Xdata[np.isin(Xdata[:, 0], Ydata[S, 0]), 0]) \n", " \n", " temp = np.dot(XS, np.transpose(X)) \n", " u = np.sum(temp, axis=1)/temp.shape[1]\n", " \n", " K = np.dot(XS, XS.T)\n", " \n", " # evaluate prototypes on income based on our objective function with dot product as similarity measure\n", " obj = 0.5 * np.dot(np.dot(W.T, K), W) - np.dot(W.T, u)\n", " objs.append(obj) \n", " \n", "\n", "# sort the objectives (ascending order) \n", "index = np.argsort(np.array(objs))\n", "\n", "# load the results in a dataframe to print\n", "evalresult = []\n", "for i in range(0,len(index)): \n", " evalresult.append([ nhanesinfo[index[i]], objs[index[i]] ])\n", " \n", " \n", "df_evalresult = pd.DataFrame.from_records(evalresult)\n", "df_evalresult.columns = ['Questionaire', 'Prototypes representative of Income']\n", "df_evalresult" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Insight from Protodash\n", "\n", "Looking at the table above, what is interesting is that early childhood represents income the most. The early childhood questionnaire has information about the environment that the child was born and raised in. This is consistent with a long term study (https://www.theatlantic.com/business/archive/2016/07/social-mobility-america/491240/) which talks about significant decrease in social mobility in recent times, stressing the fact that your childhood impacts how monetarily successful you are likely to be. It is interesting that our method was able to uncover this relationship with access to just these survey questionnaires. Other such insights could be obtained and ones that a social scientist or policy maker finds interesting could potentially spawn long-term studies like the one just mentioned." ] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.15" } }, "nbformat": 4, "nbformat_minor": 2 }