{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Health and Lifestyle Survey Questions Tutorial\n",
    "\n",
    "In this tutorial, we showcase how the Protodash explainer algorithm from AI Explainability 360 Toolkit implemented through the _ProtodashExplainer_ class could be used to summarize the National Health and Nutrition Examination Survey (NHANES) datasets ([Study 1](#study1)) available through the Center for Disease Control and Prevention (CDC). Moreover, we also show how the algorithm could be used to distill interesting relationships between different facets of life (i.e. early childhood and income), which were found by scientists ([Study 2](#study2)) through decades of rigorous experimentation. This study shows that in using Protodash, one can potentially uncover such insights cheaply, which could then be reaffirmed through rigorous experimentation.\n",
    "\n",
    " Data from this survey is typically used in epidemiological studies and health science research, which helps develop public health policy, direct and design health programs and services, and expand health knowledge. Thus, the impact of understanding these datasets and the relationships that may exist between them are far reaching for a social scientist."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"intro\"></a>\n",
    "## Introduction to Center for Disease Control and Prevention (CDC) datasets\n",
    "\n",
    "The [NHANES CDC questionnaire datasets](https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Questionnaire&CycleBeginYear=2013) are surveys conducted by the organization involving thousands of civilians about various facets of their daily lives. There are 44 questionnaires that collect data about income, occupation, health, early childhood and many other behavioral and lifestyle aspects of individuals living in the US. These questionnaires are thus a rich source of information indicative of the quality of life of many civilians. \n",
    "\n",
    "This tutorial presents two studies. We first see how a CDC questionaire answered by thousands of individuals could be summarized by looking at answers given by a few prototypical users. Next, an interesting endeavor is to uncover relationships between different aspects of life by analyzing data across the different CDC questionnaires. In the second study, we do exactly that with the help of the Protodash explainer algorithm. We show how the algorithm is able to uncover an interesting [insight](https://www.theatlantic.com/business/archive/2016/07/social-mobility-america/491240/) known only through decades of experimentation, solely from the questionnaire datasets. This by no means suggests the method as a substitute for rigorous experimentation, but showcases it as an avenue for obtaining interesting insights at low cost, which could inspire further indepth studies. The manner in which this is accomplished is by finding prototypical individuals for each of the questionnaires and then evaluating how well they represent the income questionnaire (w.r.t. the method's objective function). The more representative these prototypes are, the more that questionnaire is indicative/representative of income. \n",
    "\n",
    "For this use case, we are selecting prototypes from specific questionnaires. Hence, the group we want to explain is the dataset itself, which — in this case — are the questionnaires. We are not training an AI model. Rather, we are trying to summarize each questionnaire, which was filled by thousands of people, by selecting a few representative individuals for each of them.\n",
    "\n",
    "\n",
    "The rest of the tutorial is organized as follows: <br>\n",
    "[Explore Income questionaire](#explore)<br>\n",
    "[Study 1: Summarize Income Questionnaire using Prototypes](#study1)<br>\n",
    "[Study 2: Find Questionnaire/s most representative of Income](#study2)<br>\n",
    "\n",
    "\n",
    "###### [Protodash: Fast Interpretable Prototype Selection](https://arxiv.org/abs/1707.01212)\n",
    "\n",
    "We now provide a brief overview of the method. The method takes as input a datapoint (or group of datapoints) that we want to explain with respect to instances in a training set belonging to the same feature space. The method then tries to minimize the maximum mean discrepancy (MMD metric) between the datapoints we want to explain and a prespecified number of instances from the training set that it will select. In other words, it will try to select training instances that have the same distribution as the datapoints we want to explain. The method does greedy selection and has quality guarantees with it also returning importance weights for the chosen prototypical training instances indicative of how similar/representative they are. \n",
    "\n",
    "\n",
    "###### Why Protodash?\n",
    "\n",
    "Before we showcase the two studies, we provide some motivation for using this method. The method is able to select in a deterministic fashion examples from a dataset, which we term as prototypes that represent the different segments in a dataset. For example, if we take people that answered the income questionnaire, we might find that there are three categories of people: i) those that are high earners, ii) those that are middle class and iii) those that don't earn much or are unemployed and receive unemployment benefits. Protodash would be able to find these segments by pointing to specific individuals that lie in these categories. Looking at the objective function value of Protodash, one would also be able to say that three segments is the right number here as adding one more segment may not improve the objective value by much.\n",
    "\n",
    "Compared with other methods such as k-medoids, it has the advantage that it is deterministic and does not have randomizations as in, say, k-medoids clustering, where the centers a typically randomly initialized. So the solutions are repeatable and it picks prototypes that are representative as well as diverse, which may not be the case with standard distance metrics such as euclidean distance. Diversity is important in practical settings (viz. income example above) where we want to capture all the different segments/modes in the dataset, not missing any of the key behaviors.\n",
    "\n",
    "Another benefit of the method is that, since it performs distribution matching between the user/users in question and those available in the training set, it could, in principle, also be applied in non-iid settings such as for time series data. Other approaches which find similar profiles using standard distance measures (viz. euclidean, cosine) do not have this property. Additionally, we can also highlight important features for the different prototypes that made them similar to the user/users in question."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Import Statements\n",
    "\n",
    "Import relevant libraries, datasets and Protodash explainer algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "import os\n",
    "import requests\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.preprocessing import OneHotEncoder\n",
    "\n",
    "from aix360.algorithms.protodash import ProtodashExplainer\n",
    "from aix360.datasets.cdc_dataset import CDCDataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Load CDC dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "nhanes = CDCDataset()\n",
    "nhanes_files = nhanes.get_csv_file_names()\n",
    "(nhanesinfo, _, _) = nhanes._cdc_files_info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"explore\"></a>\n",
    "## Explore Income questionnaire\n",
    "\n",
    "Now let us explore the income questionnaire dataset and find out the types of responses received in the survey. Each column in the dataset corresponds to a question and each row denotes the answers given by a respondent to those questions. Both column names and answers by respondents are encoded. For example, 'SEQN' denotes the sequence number assigned to a respondent and 'IND235' corresponds to a question about monthly family income. As seen below, in most cases a value of 1 implies \"Yes\" to the question, while a value of 2 implies \"No.\" More details about the income questionaire and how questions and answers are encoded can be seen [here](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/INQ_H.htm)\n",
    "\n",
    "|Column  |Description                    | Values and Meaning|\n",
    "|-------|----------------------------|---------|\n",
    "|SEQN   | Respondent sequence number |\n",
    "|INQ020 | Income from wages/salaries |1->Yes, 2->No, 7->Refused, 9->Don't know|\n",
    "|INQ012 | Income from self employment|1->Yes, 2->No, 7->Refused, 9->Don't know|\n",
    "|INQ030 | Income from Social Security or RR |1->Yes, 2->No, 7->Refused, 9->Don't know|\n",
    "|INQ060 | Income from other disability pension |1->Yes, 2->No, 7->Refused, 9->Don't know|\n",
    "|INQ080 | Income from retirement/survivor pension |1->Yes, 2->No, 7->Refused, 9->Don't know|\n",
    "|INQ090 | Income from Supplemental Security Income |1->Yes, 2->No, 7->Refused, 9->Don't know|\n",
    "|INQ132 | Income from state/county cash assistance |1->Yes, 2->No, 7->Refused, 9->Don't know|\n",
    "|INQ140 | Income from interest/dividends or rental |1->Yes, 2->No, 7->Refused, 9->Don't know|\n",
    "|INQ150 | Income from other sources |1->Yes, 2->No, 7->Refused, 9->Don't know|\n",
    "|IND235 | Monthly family income |1-12->Increasing income brackets, 77->Refused, 99->Don't know|\n",
    "|INDFMMPI | Family monthly poverty level index |0-5->Higher value more affluent|\n",
    "|INDFMMPC | Family monthly poverty level category |1-3->Increasing INDFMMPI brackets, 7->Refused, 9->Don't know|\n",
    "|INQ244 | Family has savings more than $5000 |1->Yes, 2->No, 7->Refused, 9->Don't know|\n",
    "|IND247 | Total savings/cash assets for the family |1-6->Increasing savings brackets, 77->Refused, 99->Don't know|"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Answers given by some respondents to the income questionnaire:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Respondent sequence number</th>\n",
       "      <td>73557.00</td>\n",
       "      <td>73558.00</td>\n",
       "      <td>73559.00</td>\n",
       "      <td>73560.00</td>\n",
       "      <td>73561.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from wages/salaries</th>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from self employment</th>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from Social Security or RR</th>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from other disability pension</th>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from retirement/survivor pension</th>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from Supplemental Security Income</th>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from state/county cash assistance</th>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from interest/dividends or rental</th>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from other sources</th>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Monthly family income</th>\n",
       "      <td>4.00</td>\n",
       "      <td>5.00</td>\n",
       "      <td>10.00</td>\n",
       "      <td>9.00</td>\n",
       "      <td>11.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Family monthly poverty level index</th>\n",
       "      <td>0.86</td>\n",
       "      <td>0.92</td>\n",
       "      <td>4.37</td>\n",
       "      <td>2.52</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Family monthly poverty level category</th>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>3.00</td>\n",
       "      <td>3.00</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Family has savings more than $5000</th>\n",
       "      <td>9.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Total savings/cash assets for the family</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 0         1         2  \\\n",
       "Respondent sequence number                73557.00  73558.00  73559.00   \n",
       "Income from wages/salaries                    2.00      1.00      2.00   \n",
       "Income from self employment                   2.00      1.00      2.00   \n",
       "Income from Social Security or RR             1.00      1.00      1.00   \n",
       "Income from other disability pension          2.00      2.00      2.00   \n",
       "Income from retirement/survivor pension       2.00      2.00      1.00   \n",
       "Income from Supplemental Security Income      2.00      2.00      2.00   \n",
       "Income from state/county cash assistance      2.00      2.00      2.00   \n",
       "Income from interest/dividends or rental      2.00      1.00      1.00   \n",
       "Income from other sources                     2.00      2.00      2.00   \n",
       "Monthly family income                         4.00      5.00     10.00   \n",
       "Family monthly poverty level index            0.86      0.92      4.37   \n",
       "Family monthly poverty level category         1.00      1.00      3.00   \n",
       "Family has savings more than $5000            9.00      1.00       NaN   \n",
       "Total savings/cash assets for the family       NaN       NaN       NaN   \n",
       "\n",
       "                                                 3        4  \n",
       "Respondent sequence number                73560.00  73561.0  \n",
       "Income from wages/salaries                    1.00      2.0  \n",
       "Income from self employment                   2.00      2.0  \n",
       "Income from Social Security or RR             2.00      1.0  \n",
       "Income from other disability pension          2.00      2.0  \n",
       "Income from retirement/survivor pension       2.00      2.0  \n",
       "Income from Supplemental Security Income      2.00      2.0  \n",
       "Income from state/county cash assistance      2.00      2.0  \n",
       "Income from interest/dividends or rental      2.00      2.0  \n",
       "Income from other sources                     1.00      2.0  \n",
       "Monthly family income                         9.00     11.0  \n",
       "Family monthly poverty level index            2.52      5.0  \n",
       "Family monthly poverty level category         3.00      3.0  \n",
       "Family has savings more than $5000             NaN      NaN  \n",
       "Total savings/cash assets for the family       NaN      NaN  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# replace encoded column names by the associated question text. \n",
    "df_inc = nhanes.get_csv_file('INQ_H.csv')\n",
    "df_inc.columns[0]\n",
    "dict_inc = {\n",
    "'SEQN': 'Respondent sequence number', \n",
    "'INQ020': 'Income from wages/salaries',\n",
    "'INQ012': 'Income from self employment',\n",
    "'INQ030':'Income from Social Security or RR',\n",
    "'INQ060':  'Income from other disability pension', \n",
    "'INQ080':  'Income from retirement/survivor pension',\n",
    "'INQ090':  'Income from Supplemental Security Income',\n",
    "'INQ132':  'Income from state/county cash assistance', \n",
    "'INQ140':  'Income from interest/dividends or rental', \n",
    "'INQ150':  'Income from other sources',\n",
    "'IND235':  'Monthly family income',\n",
    "'INDFMMPI':  'Family monthly poverty level index', \n",
    "'INDFMMPC':  'Family monthly poverty level category',\n",
    "'INQ244':  'Family has savings more than $5000',\n",
    "'IND247':  'Total savings/cash assets for the family'\n",
    "}\n",
    "qlist = []\n",
    "for i in range(len(df_inc.columns)):\n",
    "    qlist.append(dict_inc[df_inc.columns[i]])\n",
    "df_inc.columns = qlist\n",
    "print(\"Answers given by some respondents to the income questionnaire:\")\n",
    "df_inc.head(5).transpose()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, to get more of a feel for the dataset, let us look at the distribution of responses for two questions related to family financial status."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of respondents to Income questionnaire: 10175\n",
      "Distribution of answers to 'monthly family income' and 'Family savings' questions:\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAlwAAAE8CAYAAAAVAG93AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAAhAElEQVR4nO3df7RdZX3n8feXRFC0kiBphiahYWqsxc6AmIF0aKeMWAg/lqEdi2BHItJJZxVbO7aV0LqGLpVZsdMllWmlzZho6ChIUYdUsJAi1OW0IAlQfgUkxWCS8iMawLa02uB3/thP4Hi9N/fcu89z7jkn79daZ929n72fvZ99z93nfO7ez947MhNJkiTVc9BMN0CSJGnUGbgkSZIqM3BJkiRVZuCSJEmqzMAlSZJUmYFLkiSpMgOXJFUSEdsj4r6IuCciNpeywyNiU0Q8Un7OLeUREVdExLaIuDciju9Yzsoy/yMRsXKmtkfS9MUg34friCOOyMWLF890M6QD0pYtW76RmfNmuh3DLCK2A0sz8xsdZb8L7MnMNRGxGpibmRdHxBnArwBnACcCH8nMEyPicGAzsBRIYAvwhsx8eqL1+tkpzYz9fW7O7ndjpmLx4sVs3rx5ppshHZAi4rGZbsOIWgGcXIY3ALcBF5fyq7L5L/j2iJgTEUeWeTdl5h6AiNgELAeunmgFfnZKM2N/n5ueUpSkehK4OSK2RMSqUjY/Mx8vw08A88vwAmBHR92dpWyi8u8REasiYnNEbN69e3cvt0FSDwz0ES5JGnI/mZm7IuIHgU0R8VDnxMzMiOhJv47MXAusBVi6dOng9hWRDlAe4ZKkSjJzV/n5FPA54ATgyXKqkPLzqTL7LmBRR/WFpWyicklDxMAlSRVExMsj4gf2DQOnAvcDG4F9VxquBK4vwxuB88vVisuAZ8upx5uAUyNibrmi8dRSJmmIeEpRkuqYD3wuIqD5rP1UZv55RNwJXBsRFwKPAeeU+W+kuUJxG/AccAFAZu6JiA8Ad5b53r+vA72k4WHgkqQKMvNR4Nhxyr8JnDJOeQIXTbCs9cD6XrdRUv94SlGSJKkyA5ckSVJlBi5JkqTKDFySJEmVGbgkSZIqG6qrFBevvmG/07evObNPLZGk0TPZZ2yv+ZmtA8mkR7giYn1EPBUR93eU/c+IeCgi7o2Iz0XEnI5pl0TEtoh4OCJO6yhfXsq2RcTqnm+JJEnSgOrmlOInaJ5M32kT8OOZ+W+BrwKXAETEMcC5wOtKnY9GxKyImAX8IXA6cAxwXplXkiRp5E0auDLzS8CeMWU3Z+beMno7zbO9AFYA12TmtzPzazR3TD6hvLZl5qOZ+R3gmjKvJEnSyOtFp/l3Al8owwuAHR3TdpayicolSZJGXqvAFRG/DewFPtmb5kBErIqIzRGxeffu3b1arCRJ0oyZduCKiHcAZwG/UJ4BBrALWNQx28JSNlH598nMtZm5NDOXzps3b7rNkyRJGhjTClwRsRx4L/DmzHyuY9JG4NyIOCQijgaWAF+hecr9kog4OiIOpulYv7Fd0yVJkobDpPfhioirgZOBIyJiJ3ApzVWJhwCbIgLg9sz8r5n5QERcCzxIc6rxosx8viznXcBNwCxgfWY+UGF7JEmSBs6kgSszzxuneN1+5r8MuGyc8huBG6fUOkmSpBHgo30kSZIqM3BJkiRVZuCSJEmqzMAlSZJUmYFLkiSpMgOXJElSZQYuSZKkygxckiRJlRm4JEmSKjNwSZIkVWbgkiRJqszAJUmSVJmBS5IkqTIDlyRJUmUGLkmSpMoMXJIkSZUZuCRJkiozcEmSJFVm4JIkSarMwCVJklSZgUuSJKkyA5ckSVJlBi5JkqTKDFySJEmVGbgkSZIqM3BJkiRVZuCSJEmqzMAlSZJUmYFLkiSpMgOXJElSZQYuSaooImZFxN0R8fkyfnRE3BER2yLi0xFxcCk/pIxvK9MXdyzjklL+cEScNkObIqkFA5ck1fVuYGvH+IeAyzPz1cDTwIWl/ELg6VJ+eZmPiDgGOBd4HbAc+GhEzOpT2yX1iIFLkiqJiIXAmcDHyngAbwSuK7NsAM4uwyvKOGX6KWX+FcA1mfntzPwasA04oS8bIKlnDFySVM/vA+8FvlvGXwU8k5l7y/hOYEEZXgDsACjTny3zv1A+Tp0XRMSqiNgcEZt3797d482Q1NbsyWaIiPXAWcBTmfnjpexw4NPAYmA7cE5mPl3+G/sIcAbwHPCOzLyr1FkJvK8s9oOZuYE+W7z6hv1O377mzD61RNKoi4h9n5tbIuLk2uvLzLXAWoClS5dm7fVJmppJAxfwCeAPgKs6ylYDt2TmmohYXcYvBk4HlpTXicCVwIkloF0KLAUS2BIRGzPz6V5tSD8Y2CRNwUnAmyPiDOClwCtp/iGdExGzy1GshcCuMv8uYBGwMyJmA4cB3+wo36ezjqQhMekpxcz8ErBnTHFnX4OxfRCuysbtNB8sRwKnAZsyc08JWZtoOn9K0kjKzEsyc2FmLqbp9P7FzPwF4FbgLWW2lcD1ZXhjGadM/2JmZik/t1zFeDTNP7Rf6dNmSOqRbo5wjWd+Zj5ehp8A5pfhifoadNUHAZp+CMAqgKOOOmqazZOkgXUxcE1EfBC4G1hXytcBfxIR22j+yT0XIDMfiIhrgQeBvcBFmfl8/5stqY3pBq4XZGZGRM/6C9gPQdKoyczbgNvK8KOMc5VhZv4z8PMT1L8MuKxeCyXVNt2rFJ8spwopP58q5RP1NbAPgiRJOmBNN3B19jUY2wfh/GgsA54tpx5vAk6NiLkRMRc4tZRJkiSNvG5uC3E1cDJwRETspLnacA1wbURcCDwGnFNmv5HmlhDbaG4LcQFAZu6JiA8Ad5b53p+ZYzvijzyvcpQk6cA0aeDKzPMmmHTKOPMmcNEEy1kPrJ9S6yRJkkaAd5qXJEmqzMAlSZJUmYFLkiSpMgOXJElSZQYuSZKkygxckiRJlRm4JEmSKjNwSZIkVWbgkiRJqszAJUmSVJmBS5IkqTIDlyRJUmUGLkmSpMoMXJIkSZUZuCRJkiozcEmSJFVm4JIkSarMwCVJklSZgUuSJKkyA5ckSVJlBi5JkqTKDFySJEmVGbgkSZIqmz3TDVD3Fq++Yb/Tt685s08tkSRJU+ERLkmSpMoMXJIkSZUZuCRJkiozcEmSJFVm4JIkSarMwCVJklSZgUuSJKkyA5ckSVJlBi5JkqTKWgWuiPhvEfFARNwfEVdHxEsj4uiIuCMitkXEpyPi4DLvIWV8W5m+uCdbIEmSNOCmHbgiYgHwq8DSzPxxYBZwLvAh4PLMfDXwNHBhqXIh8HQpv7zMJ0mSNPLanlKcDbwsImYDhwKPA28ErivTNwBnl+EVZZwy/ZSIiJbrlyRJGnjTDlyZuQv4PeDrNEHrWWAL8Exm7i2z7QQWlOEFwI5Sd2+Z/1VjlxsRqyJic0Rs3r1793SbJ0mSNDDanFKcS3PU6mjgh4CXA8vbNigz12bm0sxcOm/evLaLkyRJmnFtTim+CfhaZu7OzH8BPgucBMwppxgBFgK7yvAuYBFAmX4Y8M0W65ckSRoKsyefZUJfB5ZFxKHAPwGnAJuBW4G3ANcAK4Hry/wby/hfl+lfzMxssX5N0eLVN+x3+vY1Z/apJZIkHVja9OG6g6bz+13AfWVZa4GLgfdExDaaPlrrSpV1wKtK+XuA1S3aLUmSNDTaHOEiMy8FLh1T/Chwwjjz/jPw823WJ0nDIiJeCnwJOITms/a6zLw0Io6mOQPwKpoLjd6emd+JiEOAq4A30HS3eGtmbi/LuoTm1jrPA7+amTf1e3skteOd5iWpjm8Db8zMY4HjgOURsYwp3qswIo6hucfh62guTPpoRMzq54ZIaq/VES4dWOwDJnWv9FH9hzL6kvJKmnsVvq2UbwB+B7iS5qrv3ynl1wF/UO5VuAK4JjO/DXytdMs4gaY/rKQh4REuSaokImZFxD3AU8Am4G+Z+r0KXygfp07nuryHoTTADFySVElmPp+Zx9HcIucE4LUV1+U9DKUBZuCSpMoy8xmaW+b8BFO/V+EL5ePUkTQkDFySVEFEzIuIOWX4ZcDPAFt58V6FMP69CuF771W4ETg3Ig4pVzguAb7Sl42Q1DN2mpekOo4ENpQrCg8Crs3Mz0fEg8A1EfFB4G6+916Ff1I6xe+huTKRzHwgIq4FHgT2Ahdl5vN93hZJLRm4JKmCzLwXeP045VO+V2FmXgZc1us2SuofTylKkiRVZuCSJEmqzMAlSZJUmYFLkiSpMjvNq298NJAk6UDlES5JkqTKDFySJEmVGbgkSZIqM3BJkiRVZuCSJEmqzMAlSZJUmYFLkiSpMgOXJElSZQYuSZKkygxckiRJlRm4JEmSKjNwSZIkVWbgkiRJqszAJUmSVJmBS5IkqTIDlyRJUmUGLkmSpMoMXJIkSZUZuCRJkiozcEmSJFXWKnBFxJyIuC4iHoqIrRHxExFxeERsiohHys+5Zd6IiCsiYltE3BsRx/dmEyRJkgZb2yNcHwH+PDNfCxwLbAVWA7dk5hLgljIOcDqwpLxWAVe2XLckSdJQmHbgiojDgP8ArAPIzO9k5jPACmBDmW0DcHYZXgFclY3bgTkRceR01y9JkjQs2hzhOhrYDXw8Iu6OiI9FxMuB+Zn5eJnnCWB+GV4A7Oiov7OUSZIkjbQ2gWs2cDxwZWa+HvhHXjx9CEBmJpBTWWhErIqIzRGxeffu3S2aJ0mSNBjaBK6dwM7MvKOMX0cTwJ7cd6qw/HyqTN8FLOqov7CUfY/MXJuZSzNz6bx581o0T5IkaTBMO3Bl5hPAjoj40VJ0CvAgsBFYWcpWAteX4Y3A+eVqxWXAsx2nHiVJkkbW7Jb1fwX4ZEQcDDwKXEAT4q6NiAuBx4Bzyrw3AmcA24DnyrySJEkjr1Xgysx7gKXjTDplnHkTuKjN+iRJkoaRd5qXJEmqzMAlSZJUmYFLkiSpMgOXJElSZQYuSZKkygxckiRJlRm4JEmSKmt741OpbxavvmG/07evObNPLZEkaWo8wiVJklSZgUuSJKkyA5ckSVJlBi5JqiAiFkXErRHxYEQ8EBHvLuWHR8SmiHik/JxbyiMiroiIbRFxb0Qc37GslWX+RyJi5Uxtk6TpM3BJUh17gV/PzGOAZcBFEXEMsBq4JTOXALeUcYDTgSXltQq4EpqABlwKnAicAFy6L6RJGh4GLkmqIDMfz8y7yvDfA1uBBcAKYEOZbQNwdhleAVyVjduBORFxJHAasCkz92Tm08AmYHn/tkRSLxi4JKmyiFgMvB64A5ifmY+XSU8A88vwAmBHR7WdpWyi8rHrWBURmyNi8+7du3u7AZJaM3BJUkUR8QrgM8CvZea3OqdlZgLZi/Vk5trMXJqZS+fNm9eLRUrqIQOXJFUSES+hCVufzMzPluIny6lCys+nSvkuYFFH9YWlbKJySUPEwCVJFUREAOuArZn54Y5JG4F9VxquBK7vKD+/XK24DHi2nHq8CTg1IuaWzvKnljJJQ8RH++iA4aOB1GcnAW8H7ouIe0rZbwFrgGsj4kLgMeCcMu1G4AxgG/AccAFAZu6JiA8Ad5b53p+Ze/qyBZJ6xsAlSRVk5peBmGDyKePMn8BFEyxrPbC+d62T1G+eUpQkSarMwCVJklSZgUuSJKkyA5ckSVJldpqXuuRVjpKk6fIIlyRJUmUGLkmSpMoMXJIkSZUZuCRJkiozcEmSJFVm4JIkSarMwCVJklSZgUuSJKkyA5ckSVJlrQNXRMyKiLsj4vNl/OiIuCMitkXEpyPi4FJ+SBnfVqYvbrtuSZKkYdCLI1zvBrZ2jH8IuDwzXw08DVxYyi8Eni7ll5f5JEmSRl6rwBURC4EzgY+V8QDeCFxXZtkAnF2GV5RxyvRTyvySJEkjre3Dq38feC/wA2X8VcAzmbm3jO8EFpThBcAOgMzcGxHPlvm/0bnAiFgFrAI46qijWjZPGhw+/FqSDlzTPsIVEWcBT2Xmlh62h8xcm5lLM3PpvHnzerloSZKkGdHmCNdJwJsj4gzgpcArgY8AcyJidjnKtRDYVebfBSwCdkbEbOAw4Jst1i9JkjQUpn2EKzMvycyFmbkYOBf4Ymb+AnAr8JYy20rg+jK8sYxTpn8xM3O665ckSRoWNe7DdTHwnojYRtNHa10pXwe8qpS/B1hdYd2SJEkDp22neQAy8zbgtjL8KHDCOPP8M/DzvVifJEnSMPFO85IkSZUZuCRJkiozcEmSJFVm4JIkSarMwCVJklSZgUuSJKkyA5ckSVJlBi5JkqTKDFySJEmVGbgkSZIqM3BJkiRV1pNnKUqqb/HqG/Y7ffuaM/vUEknSVHmES5IkqTIDlyRJUmUGLkmSpMoMXJIkSZUZuCRJkiozcEmSJFVm4JIkSarMwCVJklSZgUuSJKkyA5ckSVJlBi5JqiAi1kfEUxFxf0fZ4RGxKSIeKT/nlvKIiCsiYltE3BsRx3fUWVnmfyQiVs7Etkhqz8AlSXV8Alg+pmw1cEtmLgFuKeMApwNLymsVcCU0AQ24FDgROAG4dF9IkzRcDFySVEFmfgnYM6Z4BbChDG8Azu4ovyobtwNzIuJI4DRgU2buycyngU18f4iTNAQMXJLUP/Mz8/Ey/AQwvwwvAHZ0zLezlE1U/n0iYlVEbI6Izbt37+5tqyW1ZuCSpBmQmQlkD5e3NjOXZubSefPm9WqxknrEwCVJ/fNkOVVI+flUKd8FLOqYb2Epm6hc0pAxcElS/2wE9l1puBK4vqP8/HK14jLg2XLq8Sbg1IiYWzrLn1rKJA2Z2TPdAEn9sXj1Dfudvn3NmX1qyYEhIq4GTgaOiIidNFcbrgGujYgLgceAc8rsNwJnANuA54ALADJzT0R8ALizzPf+zBzbEV/SEDBwSVIFmXneBJNOGWfeBC6aYDnrgfU9bJqkGeApRUmSpMoMXJIkSZVNO3BFxKKIuDUiHoyIByLi3aV8yo+ukCRJGmVtjnDtBX49M48BlgEXRcQxTPHRFZIkSaNu2oErMx/PzLvK8N8DW2nugDzVR1dIkiSNtJ704YqIxcDrgTuY+qMrxi7Lx1NIkqSR0jpwRcQrgM8Av5aZ3+qcNp1HV/h4CkmSNGpaBa6IeAlN2PpkZn62FE/10RWSJEkjrc1VigGsA7Zm5oc7Jk310RWSJEkjrc2d5k8C3g7cFxH3lLLfYoqPrpAkSRp10w5cmfllICaYPKVHV0iSJI0y7zQvSZJUmYFLkiSpMgOXJElSZQYuSZKkygxckiRJlRm4JEmSKjNwSZIkVWbgkiRJqszAJUmSVJmBS5IkqbI2z1KUJGloLF59Q1/Xt33NmX1dnwabR7gkSZIqM3BJkiRVZuCSJEmqzMAlSZJUmYFLkiSpMgOXJElSZQYuSZKkygxckiRJlRm4JEmSKjNwSZIkVWbgkiRJqszAJUmSVJmBS5IkqTIDlyRJUmUGLkmSpMoMXJIkSZUZuCRJkiozcEmSJFU2e6YbIEmS2lu8+oa+rm/7mjP7ur5h5xEuSZKkygxckiRJlXlKUZIkDbx+njKtcbq070e4ImJ5RDwcEdsiYnW/1y9Jw8jPTmm49TVwRcQs4A+B04FjgPMi4ph+tkGSho2fndLw6/cRrhOAbZn5aGZ+B7gGWNHnNkjSsPGzUxpy/Q5cC4AdHeM7S5kkaWJ+dkpDLjKzfyuLeAuwPDN/sYy/HTgxM9/VMc8qYFUZ/VHg4f0s8gjgGy2aZH3rW39iP5yZ81osXz1S4bOz19r+LQ46t2+49XP7Jvzc7PdViruARR3jC0vZCzJzLbC2m4VFxObMXDrdxljf+taffn31VU8/O3tt1P+W3L7hNijb1+9TincCSyLi6Ig4GDgX2NjnNkjSsPGzUxpyfT3ClZl7I+JdwE3ALGB9Zj7QzzZI0rDxs1Mafn2/8Wlm3gjc2KPFtT18bn3rW19Docefnb026n9Lbt9wG4jt62uneUmSpAORz1KUJEmqzMAlSZJUmYFLkiSpsr53mh8EEXE4QGbumUbd+bx4h+ddmfnkMNUvy5j29veifhuD8PuT5L407Eb5/RvUbTtgOs1HxFHA7wKnAM8AAbwS+CKwOjO3T1L/OOCPgMN48YaDC8uyfjkz7xrw+m23v1X9tgbg9/dammfXvbATAxszc+sUtuEwYPmYZdyUmc9MYRnT+iDpRfslaL8vDYtB/dJua5Tfv4HftswcqhfwWuBi4Iryuhj4sS7q/TXwVmBWR9ksmhsI3t5F/XtoHqUxtnwZ8DdDUL/t9req3+a9m+nfX2nnPcBq4D+X1+p9ZV22/3zgb4ErgfeV1x+VsvO7qH8ccDuwFfiL8nqolB1fu/2+fO17td0XB/3VZl8bhtcov3+Dvm1DdYQrIi4GzgOuoXl4KzTp9Vzgmsxcs5+6j2TmkqlO67L+tsx89RDXb7v93dSf9nvXxfqr/v4i4qvA6zLzX8aUHww8MNm2l3kfpvkgeGZM+Vzgjsx8zST17wF+KTPvGFO+DPjjzDy2Zvulfdrui4Ouzb42DEb5/Rv0bRu2PlwXMv4Xx4eBB4D9fWlviYiPAhuAHaVsEbASuLuLdX8hIm4ArhpT/3zgz4egftvtb1u/zXsHM/v7+y7wQ8BjY8qPLNO6EcB4/918t0ybzMvHfgEAZObtEfHySer2ov3SPm33xUHXZl8bBqP8/g30tg3bEa6HgNMy87Ex5T8M3JyZP7qfugfTfOl39mPZCfwZsC4zv93F+k9n/H4wXd39eSbrt93+HtSf9nvXMe+M/P4iYjnwB8AjvLgTHwW8GnhXZk66I0fESuC/AzePWcbPAB/IzE9MUv8K4EcY/4Pka5n5rprtlzq13RcHWZt9bViM+Ps3sNs2bIHLL44hNezvXUQcBJzA9+7Ed2bm81NYxlzgNL6/0/zTXdZvE7hbt186UAzyl7aG11AFLqjzxRERZ2Xm51vUX5WZ035W0wDUb7v9XdWv9aU/078/SQ33peE2yu/fIGzb0N34NDO/m5m3Z+Znyuv2HvyX/u9a1u+mD84g12+7/V3Vr/TewQz+/iJi2kG1YxmtPgQiYlWLuq3bL3Vouy8OtDb72pAY5fdvxrdt6I5wTSQiPp+ZZ00yzwlAZuadEXEMzT2RHurylMyvAp/LzB2TzTtB/ROBrZn5rYh4Gc1l+ccDDwL/IzOfneLyfpLmaNH9mXnzNNt0VWae3+W8B9NcUfh3mfkXEfE24N/TXDq9dmxn+Cm2Y9L3rsz3r4Gfo+lP8TzwVeBTmfmtLtfzWpqja3dk5j90lC+f7inNiDgyMx+fTt2OZbwhM7e0qP9LmfnH06zbuv068NTYl4ZBm31tULT9Lht0bb8nahqlwLXfL46IuBQ4nebKzE3AicCtNJ2Wb8rMyyZZ/rPAP9LcN+lq4E8zc/cU2vcAcGxm7i1HNJ4DrqO5keixmflzk9T/SmaeUIb/C3AR8DngVODPuritwsaxRcB/pLlxKZn55knqf5Lmd3cozU3kXgF8trSfzHzH/upPsuxJv/TLh8RZwJeAM2iujHwG+FmaG9rd1kX9i2gC4nHAuzPz+jLtrsw8frrtn2kRcUFmfnym26EDwyjvS5MZhX2t7XfZIGv7PVFdDsDNyvrxAu6juVHnocC3gFeW8pcB93ZR/26aU7CnAuuA3TSXma4EfqCL+ls7hu8aM+2ebtbfMXwnMK8Mvxy4r4v6dwH/BzgZ+Ony8/Ey/NNd1L+3/JwNPEm5ASpNcJv099er968MHwrcVoaP6vzdTFL/FWV4MbCZ5ouCyerThMv309y+4tny3t8OvGMK7T+M5tYXDwF7gG/SfGGtAea0/N18vYv3/n3Aj9R+n3yN/qvNvjTsr8n2tWF4tf0uG+RX2++J2q+hug9XRLwSuITmhplfyMxPdUz7aGb+8n6q782mv9BzEfG3WQ4vZuY/RUQ39yLKzPwuzWX9N0fES2iOmJ0H/B4wb5L693f8d/Q3EbE0MzdHxGuAbk7HHVSucjuI5sjk7tKof4yIvV3UXwq8G/ht4Dcz856I+KfM/Msu6u5b/8E0Ae9QmgCxBzgEeEmXyxhXRHwhM0/vYtbZNIeID6EJQWTm18t7MZmDspz6yMztEXEycF25LcVk5/Y/SXM08TTgHJrfwTXA+yLiNZn5W12s/1qao4knZ+YTABHxr2g+5K6l+fCbUETcO9EkYP4k654LzAFujYgnaP6r/XRm/l0X7ZbGarMvDbyW+9owaPtdNujafE9UNVSBC/g4zW0FPgO8MyL+E/C2bO4BtWySut+JiEMz8zngDfsKo3m+XTeB63s+SLLps7QR2BgRh3ZR/xeBj0TE+4BvAH8dETtobpHwi13UPwzYUtqR+07DRcQrxrZtPGUHuzwi/rT8fJKpvf/raI7OzKIJbX8aEY/S/N6vmaxyREx0miFoTktM5mPAnRFxB/BTwIfKcufRBL/JPBkRx2XmPQCZ+Q8RcRawHvg3k9RdnC/eJ+vDEXFnZn4gIi6g6YPXTeBanJkf6iwowetDEfHOLurPpwl8Y28hEcBfTVL36cz8DeA3IuKnaD5Y74qIrcDVOaJXJamaNvvSMGizrw2Dtt9lg6zt90RVQ9WHKyLuyczjOsZ/m+Y87ZuBTbmfvgMRcUiOc3POiDgCODIz75tk3a/JzK9Ou/EvLueVwNE0YWdntnwgatlB5mfm16ZY70zgpC6Pzuyr80MAmfl3ETEHeBPNIfavdFH3eeAvGT8cLsvMl3WxjNcBP0ZzocBD3ba71F1Ic5TziXGmnZSZ/28/df8KeG9mfjkiVtD0BTitTHs4u7tp6800z2TbsO89j+bhuO8AfiYz3zRJ/XXAxzPzy+NM+1Rmvm0/de/OzNePKZtF03/xrZl5wWTtl/Zpsy8Ngzb72jDo1XfZoGrzPVHbsAWurTSPh/luR9k7gN+k6VPwwzPVNu1fRNwP/GxmPjLOtB2ZuWgGmtWViDgW+N/AEpp+XO/MzK+W/5rOy8wruljGXJorU1cAP1iKn6T5z3JNdnnz0+mIiGsy89xay5ckTW7YAtfv0jwG5i/GlC8H/lf6EN6BFRFvoenc//A4087OzP/b/1Z1LyJ+hBcvNd5LDy81rn3l06hfBi5Jw2CoAtf+jMLlugeqQX/val9qHBFfz8yjWjZzf8sf2cvAJWlYjFLgqvqlpXoG/b2LiPuA4zLz+dJn7sbMPDkijgKuH9s/aoJl7O/Kp9dk5iE9bPLYdd9Nc6HIm4C30vR53EITvj6bmX9fa92SpMZQXaV4AFyuO7JG4L1re6nxTF75NOqXgUvSwBuqwMXoX647yob5vevFpcafp7mw456xEyLitt40c0KjfBm4JA2FoTqlOOqX646yYX/vBvlS48mM+mXgkjQMhipwSZIkDaODZroBkiRJo87AJUmSVJmBS5IkqTIDlyRJUmX/HxVHwhJkOxj/AAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 720x360 with 2 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "print(\"Number of respondents to Income questionnaire:\", df_inc.shape[0])\n",
    "print(\"Distribution of answers to \\'monthly family income\\' and \\'Family savings\\' questions:\")\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(10,5))\n",
    "fig.subplots_adjust(wspace=0.5)\n",
    "hist1 = df_inc['Monthly family income'].value_counts().plot(kind='bar', ax=axes[0])\n",
    "hist2 = df_inc['Family has savings more than $5000'].value_counts().plot(kind='bar', ax=axes[1])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"Dplot\"></a>\n",
    "Observe that the majority of individuals responded with a \"12\" o the question related to monthly family income, which means their income is above USD 8400 as explained [here](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/INQ_H.htm#IND235). Similarly, to the question of whether the family has savings more than USD 5000, the majority of individuals responded with a \"2\", which means \"No\". "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"study1\"></a>\n",
    "## Study 1: Summarize Income Questionnaire using Prototypes\n",
    "\n",
    "We just explored the income dataset and looked at the distribution of answers for a couple of questions. Now, consider a social scientist who would like to quickly obtain a summary report of this dataset in terms of types of people that span this dataset. Is it possible to summarize this dataset by looking at answers given by a few representative/prototypical respondents? \n",
    "\n",
    "We now show how the Protodash algorithm can be used to obtain a few prototypical respondents (about 10 in this example) that span the diverse set of individuals answering the income questionnaire making it easy for the social scientist to summarize the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# convert pandas dataframe to numpy\n",
    "data = df_inc.to_numpy()\n",
    "\n",
    "#sort the rows by sequence numbers in 1st column \n",
    "idx = np.argsort(data[:, 0])  \n",
    "data = data[idx, :]\n",
    "\n",
    "# replace nan's (missing values) with 0's\n",
    "original = data\n",
    "original[np.isnan(original)] = 0\n",
    "\n",
    "# delete 1st column (sequence numbers)\n",
    "original = original[:, 1:]\n",
    "\n",
    "# one hot encode all features as they are categorical\n",
    "onehot_encoder = OneHotEncoder(sparse=False)\n",
    "onehot_encoded = onehot_encoder.fit_transform(original)\n",
    "\n",
    "explainer = ProtodashExplainer()\n",
    "\n",
    "# call Protodash explainer\n",
    "# S contains indices of the selected prototypes\n",
    "# W contains importance weights associated with the selected prototypes \n",
    "(W, S, _) = explainer.explain(onehot_encoded, onehot_encoded, m=10) \n",
    "\n",
    "# sort the order of prototypes in set S\n",
    "idx = np.argsort(S)\n",
    "S = S[idx]\n",
    "W = W[idx]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>8</th>\n",
       "      <th>132</th>\n",
       "      <th>690</th>\n",
       "      <th>1475</th>\n",
       "      <th>2449</th>\n",
       "      <th>2912</th>\n",
       "      <th>3899</th>\n",
       "      <th>5077</th>\n",
       "      <th>6895</th>\n",
       "      <th>7475</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Respondent sequence number</th>\n",
       "      <td>73565.00</td>\n",
       "      <td>73689.00</td>\n",
       "      <td>74247.00</td>\n",
       "      <td>75032.00</td>\n",
       "      <td>76006.00</td>\n",
       "      <td>76469.00</td>\n",
       "      <td>77456.00</td>\n",
       "      <td>78634.00</td>\n",
       "      <td>80452.00</td>\n",
       "      <td>81032.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from wages/salaries</th>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from self employment</th>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from Social Security or RR</th>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from other disability pension</th>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from retirement/survivor pension</th>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from Supplemental Security Income</th>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from state/county cash assistance</th>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from interest/dividends or rental</th>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Income from other sources</th>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Monthly family income</th>\n",
       "      <td>12.00</td>\n",
       "      <td>11.00</td>\n",
       "      <td>8.00</td>\n",
       "      <td>4.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>6.00</td>\n",
       "      <td>7.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>5.00</td>\n",
       "      <td>3.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Family monthly poverty level index</th>\n",
       "      <td>5.00</td>\n",
       "      <td>4.30</td>\n",
       "      <td>3.05</td>\n",
       "      <td>1.65</td>\n",
       "      <td>0.00</td>\n",
       "      <td>1.32</td>\n",
       "      <td>2.71</td>\n",
       "      <td>0.44</td>\n",
       "      <td>0.86</td>\n",
       "      <td>1.08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Family monthly poverty level category</th>\n",
       "      <td>3.00</td>\n",
       "      <td>3.00</td>\n",
       "      <td>3.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>3.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Family has savings more than $5000</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "      <td>2.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Total savings/cash assets for the family</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Weights of Prototypes</th>\n",
       "      <td>0.18</td>\n",
       "      <td>0.07</td>\n",
       "      <td>0.09</td>\n",
       "      <td>0.06</td>\n",
       "      <td>0.15</td>\n",
       "      <td>0.07</td>\n",
       "      <td>0.12</td>\n",
       "      <td>0.07</td>\n",
       "      <td>0.09</td>\n",
       "      <td>0.10</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                              8         132       690   \\\n",
       "Respondent sequence number                73565.00  73689.00  74247.00   \n",
       "Income from wages/salaries                    1.00      1.00      2.00   \n",
       "Income from self employment                   2.00      2.00      2.00   \n",
       "Income from Social Security or RR             2.00      2.00      2.00   \n",
       "Income from other disability pension          2.00      2.00      2.00   \n",
       "Income from retirement/survivor pension       2.00      2.00      2.00   \n",
       "Income from Supplemental Security Income      2.00      2.00      2.00   \n",
       "Income from state/county cash assistance      2.00      2.00      2.00   \n",
       "Income from interest/dividends or rental      2.00      1.00      2.00   \n",
       "Income from other sources                     2.00      2.00      2.00   \n",
       "Monthly family income                        12.00     11.00      8.00   \n",
       "Family monthly poverty level index            5.00      4.30      3.05   \n",
       "Family monthly poverty level category         3.00      3.00      3.00   \n",
       "Family has savings more than $5000             NaN       NaN       NaN   \n",
       "Total savings/cash assets for the family       NaN       NaN       NaN   \n",
       "Weights of Prototypes                         0.18      0.07      0.09   \n",
       "\n",
       "                                              1475      2449      2912  \\\n",
       "Respondent sequence number                75032.00  76006.00  76469.00   \n",
       "Income from wages/salaries                    1.00      1.00      1.00   \n",
       "Income from self employment                   2.00      1.00      2.00   \n",
       "Income from Social Security or RR             2.00      2.00      2.00   \n",
       "Income from other disability pension          2.00      2.00      1.00   \n",
       "Income from retirement/survivor pension       2.00      2.00      1.00   \n",
       "Income from Supplemental Security Income      2.00      2.00      2.00   \n",
       "Income from state/county cash assistance      2.00      2.00      2.00   \n",
       "Income from interest/dividends or rental      1.00      2.00      2.00   \n",
       "Income from other sources                     2.00      1.00      2.00   \n",
       "Monthly family income                         4.00      1.00      6.00   \n",
       "Family monthly poverty level index            1.65      0.00      1.32   \n",
       "Family monthly poverty level category         2.00      1.00      2.00   \n",
       "Family has savings more than $5000            2.00      2.00      1.00   \n",
       "Total savings/cash assets for the family      3.00      1.00       NaN   \n",
       "Weights of Prototypes                         0.06      0.15      0.07   \n",
       "\n",
       "                                              3899      5077      6895  \\\n",
       "Respondent sequence number                77456.00  78634.00  80452.00   \n",
       "Income from wages/salaries                    1.00      1.00      1.00   \n",
       "Income from self employment                   2.00      2.00      2.00   \n",
       "Income from Social Security or RR             1.00      2.00      2.00   \n",
       "Income from other disability pension          2.00      2.00      2.00   \n",
       "Income from retirement/survivor pension       2.00      1.00      2.00   \n",
       "Income from Supplemental Security Income      2.00      2.00      1.00   \n",
       "Income from state/county cash assistance      2.00      2.00      1.00   \n",
       "Income from interest/dividends or rental      2.00      2.00      2.00   \n",
       "Income from other sources                     2.00      2.00      2.00   \n",
       "Monthly family income                         7.00      2.00      5.00   \n",
       "Family monthly poverty level index            2.71      0.44      0.86   \n",
       "Family monthly poverty level category         3.00      1.00      1.00   \n",
       "Family has savings more than $5000             NaN      2.00      2.00   \n",
       "Total savings/cash assets for the family       NaN      1.00      1.00   \n",
       "Weights of Prototypes                         0.12      0.07      0.09   \n",
       "\n",
       "                                              7475  \n",
       "Respondent sequence number                81032.00  \n",
       "Income from wages/salaries                    2.00  \n",
       "Income from self employment                   2.00  \n",
       "Income from Social Security or RR             1.00  \n",
       "Income from other disability pension          2.00  \n",
       "Income from retirement/survivor pension       2.00  \n",
       "Income from Supplemental Security Income      2.00  \n",
       "Income from state/county cash assistance      2.00  \n",
       "Income from interest/dividends or rental      1.00  \n",
       "Income from other sources                     2.00  \n",
       "Monthly family income                         3.00  \n",
       "Family monthly poverty level index            1.08  \n",
       "Family monthly poverty level category         1.00  \n",
       "Family has savings more than $5000            2.00  \n",
       "Total savings/cash assets for the family      1.00  \n",
       "Weights of Prototypes                         0.10  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display the prototypes along with their computed weights\n",
    "inc_prototypes = df_inc.iloc[S, :].copy()\n",
    "# Compute normalized importance weights for prototypes\n",
    "inc_prototypes[\"Weights of Prototypes\"] = np.around(W/np.sum(W), 2) \n",
    "inc_prototypes.transpose()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Explanation:\n",
    "The 10 people shown above (i.e. 10 prototypes) are representative of the income questionnaire according to Protodash. Firstly, in the distribution plot for family finance related questions, we saw that there roughly were five times as many people not having savings in excess of $5000 compared with others. Our prototypes also have a similar spread, which is reassuring. Also, for monthly family income, we get a more even spread over the more commonly occurring categories. This is a kind of spot check to see if our prototypes actually match the distribution of values in the dataset.\n",
    "\n",
    "Looking at the other questions in the questionnaire and the corresponding answers given by the prototypical people above, the social scientist realizes that most people are employed (3rd question) and work for an organization earning through salary/wages (1st two questions). Most of them are also young (5th question) and fit to work (4th question). However, they don't seem to have much savings (last question). The insights that the social scientist acquired from studying the prototypes could also be conveyed to the appropriate government authorities that affect future public policy decisions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"study2\"></a>\n",
    "## Study 2: Find Questionnaire/s that are most representative of Income\n",
    "\n",
    "We now move on to our second study, where we want to see how the remaining 39 questionnaires represent or relate to income. This will provide us with an idea of which lifestyle factors are likely to affect income the most. To do this we compute prototypes for each of the questionnaires and evaluate how well they represent the income questionnaire relative to our objective function. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Compute prototypes for all questionaires\n",
    "\n",
    "This step uses Protodash explainer to compute 10 prototypes for each of the questionaires and saves these for  further evaluation. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "processing  ACQ_H.csv\n",
      "processing  ALQ_H.csv\n",
      "processing  BPQ_H.csv\n",
      "processing  CDQ_H.csv\n",
      "processing  CFQ_H.csv\n",
      "processing  CBQ_H.csv\n",
      "processing  CKQ_H.csv\n",
      "processing  HSQ_H.csv\n",
      "processing  DEQ_H.csv\n",
      "processing  DIQ_H.csv\n",
      "processing  DBQ_H.csv\n",
      "processing  DLQ_H.csv\n",
      "processing  DUQ_H.csv\n",
      "processing  ECQ_H.csv\n",
      "processing  FSQ_H.csv\n",
      "processing  HIQ_H.csv\n",
      "processing  HEQ_H.csv\n",
      "processing  HUQ_H.csv\n",
      "processing  HOQ_H.csv\n",
      "processing  IMQ_H.csv\n",
      "processing  INQ_H.csv\n",
      "processing  KIQ_U_H.csv\n",
      "processing  MCQ_H.csv\n",
      "processing  DPQ_H.csv\n",
      "processing  OCQ_H.csv\n",
      "processing  OHQ_H.csv\n",
      "processing  OSQ_H.csv\n",
      "processing  PAQ_H.csv\n",
      "processing  PFQ_H.csv\n",
      "processing  RXQASA_H.csv\n",
      "processing  RHQ_H.csv\n",
      "processing  SXQ_H.csv\n",
      "processing  SLQ_H.csv\n",
      "processing  SMQFAM_H.csv\n",
      "processing  SMQRTU_H.csv\n",
      "processing  SMQSHS_H.csv\n",
      "processing  CSQ_H.csv\n",
      "processing  VTQ_H.csv\n",
      "processing  WHQ_H.csv\n",
      "processing  WHQMEC_H.csv\n"
     ]
    }
   ],
   "source": [
    "# Iterate through all questionnaire datasets and find 10 prototypes for each.\n",
    "\n",
    "prototypes = {}\n",
    "\n",
    "for i in range(len(nhanes_files)):\n",
    "    \n",
    "    f = nhanes_files[i]\n",
    "    \n",
    "    print(\"processing \", f)\n",
    "    \n",
    "    # read data to pandas dataframe\n",
    "    df = nhanes.get_csv_file(f)\n",
    "    \n",
    "    # convert data to numpy\n",
    "    data = df.to_numpy()\n",
    "\n",
    "    #sort the rows by sequence numbers in 1st column \n",
    "    idx = np.argsort(data[:, 0])\n",
    "    data = data[idx, :]\n",
    "\n",
    "    # replace nan's with 0's.\n",
    "    original = data\n",
    "    original[np.isnan(original)] = 0\n",
    "\n",
    "    # delete 1st column (contains sequence numbers)\n",
    "    original = original[:, 1:]\n",
    "\n",
    "    # one hot encode all features as they are categorical\n",
    "    onehot_encoder = OneHotEncoder(sparse=False)\n",
    "    onehot_encoded = onehot_encoder.fit_transform(original)\n",
    "\n",
    "    explainer = ProtodashExplainer()\n",
    "\n",
    "    # call Protodash explainer\n",
    "    # S contains indices of the selected prototypes\n",
    "    # W contains importance weights associated with the selected prototypes \n",
    "\n",
    "    (W, S, _) = explainer.explain(onehot_encoded, onehot_encoded, m=10) \n",
    "\n",
    "    prototypes[f]={}\n",
    "    prototypes[f]['W']= W\n",
    "    prototypes[f]['S']= S\n",
    "    prototypes[f]['data'] = data\n",
    "    prototypes[f]['original'] = original"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Evaluate the set of prototypical respondents from various questionaires using their income questionaire. \n",
    "\n",
    "Now that we have the prototypes for each of the questionnaires we evaluate how well the prototypes of each questionaire represent the Income questionnaire based on the objective function that Protodash uses. We see below a ranked list of different questionnaires with their objective function values in ascending order. The higher a questionaire appears in the list, the better its prototypes represent the income questionaire. The values on the right indicate our objective value where lower value is better."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Questionaire</th>\n",
       "      <th>Prototypes representative of Income</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Early Childhood</td>\n",
       "      <td>-96.119374</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Physical Functioning</td>\n",
       "      <td>-95.584090</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Acculturation</td>\n",
       "      <td>-95.355652</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Disability</td>\n",
       "      <td>-93.984344</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Physical Activity</td>\n",
       "      <td>-93.945023</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Smoking - Secondhand Smoke Exposure</td>\n",
       "      <td>-93.193000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Cognitive Functioning</td>\n",
       "      <td>-93.050538</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Sleep Disorders</td>\n",
       "      <td>-92.330593</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Diabetes</td>\n",
       "      <td>-91.381703</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Taste &amp; Smell</td>\n",
       "      <td>-90.633708</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Smoking - Recent Tobacco Use</td>\n",
       "      <td>-88.301894</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Preventive Aspirin Use</td>\n",
       "      <td>-88.292714</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Kidney Conditions - Urology</td>\n",
       "      <td>-87.817560</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Cardiovascular Health</td>\n",
       "      <td>-85.699530</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Mental Health - Depression Screener</td>\n",
       "      <td>-85.634763</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Volatile Toxicant (Subsample)</td>\n",
       "      <td>-85.580458</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Smoking - Household Smokers</td>\n",
       "      <td>-85.514068</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Alcohol Use</td>\n",
       "      <td>-84.735015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Dermatology</td>\n",
       "      <td>-84.350459</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Consumer Behavior</td>\n",
       "      <td>-84.151348</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>Food Security</td>\n",
       "      <td>-82.769481</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>Immunization</td>\n",
       "      <td>-82.186591</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>Housing Characteristics</td>\n",
       "      <td>-81.655651</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>Drug Use</td>\n",
       "      <td>-81.187136</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>Occupation</td>\n",
       "      <td>-81.005558</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>Oral Health</td>\n",
       "      <td>-78.920883</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>Weight History - Youth</td>\n",
       "      <td>-77.454574</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>Income</td>\n",
       "      <td>-76.364734</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>Diet Behavior &amp; Nutrition</td>\n",
       "      <td>-75.799336</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>Weight History</td>\n",
       "      <td>-75.136793</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>Blood Pressure &amp; Cholesterol</td>\n",
       "      <td>-74.227314</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>Reproductive Health</td>\n",
       "      <td>-73.813445</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>Osteoporosis</td>\n",
       "      <td>-71.564194</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>Creatine Kinase</td>\n",
       "      <td>-67.288085</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>Hepatitis</td>\n",
       "      <td>-67.152639</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>Medical Conditions</td>\n",
       "      <td>-65.222360</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>Current Health Status</td>\n",
       "      <td>-44.587781</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37</th>\n",
       "      <td>Hospital Utilization &amp; Access to Care</td>\n",
       "      <td>10.916130</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38</th>\n",
       "      <td>Sexual Behavior</td>\n",
       "      <td>53.155880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>Health Insurance</td>\n",
       "      <td>146.419436</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                             Questionaire  Prototypes representative of Income\n",
       "0                         Early Childhood                           -96.119374\n",
       "1                    Physical Functioning                           -95.584090\n",
       "2                           Acculturation                           -95.355652\n",
       "3                              Disability                           -93.984344\n",
       "4                       Physical Activity                           -93.945023\n",
       "5     Smoking - Secondhand Smoke Exposure                           -93.193000\n",
       "6                   Cognitive Functioning                           -93.050538\n",
       "7                         Sleep Disorders                           -92.330593\n",
       "8                                Diabetes                           -91.381703\n",
       "9                           Taste & Smell                           -90.633708\n",
       "10           Smoking - Recent Tobacco Use                           -88.301894\n",
       "11                 Preventive Aspirin Use                           -88.292714\n",
       "12            Kidney Conditions - Urology                           -87.817560\n",
       "13                  Cardiovascular Health                           -85.699530\n",
       "14    Mental Health - Depression Screener                           -85.634763\n",
       "15          Volatile Toxicant (Subsample)                           -85.580458\n",
       "16            Smoking - Household Smokers                           -85.514068\n",
       "17                            Alcohol Use                           -84.735015\n",
       "18                            Dermatology                           -84.350459\n",
       "19                      Consumer Behavior                           -84.151348\n",
       "20                          Food Security                           -82.769481\n",
       "21                           Immunization                           -82.186591\n",
       "22                Housing Characteristics                           -81.655651\n",
       "23                               Drug Use                           -81.187136\n",
       "24                             Occupation                           -81.005558\n",
       "25                            Oral Health                           -78.920883\n",
       "26                 Weight History - Youth                           -77.454574\n",
       "27                                 Income                           -76.364734\n",
       "28              Diet Behavior & Nutrition                           -75.799336\n",
       "29                         Weight History                           -75.136793\n",
       "30           Blood Pressure & Cholesterol                           -74.227314\n",
       "31                    Reproductive Health                           -73.813445\n",
       "32                           Osteoporosis                           -71.564194\n",
       "33                        Creatine Kinase                           -67.288085\n",
       "34                              Hepatitis                           -67.152639\n",
       "35                     Medical Conditions                           -65.222360\n",
       "36                  Current Health Status                           -44.587781\n",
       "37  Hospital Utilization & Access to Care                            10.916130\n",
       "38                        Sexual Behavior                            53.155880\n",
       "39                       Health Insurance                           146.419436"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#load income dataset INQ_H and its prototypes\n",
    "X = prototypes['INQ_H.csv']['original']\n",
    "Xdata = prototypes['INQ_H.csv']['data']\n",
    "\n",
    "# Iterate through all questionnaires and evaluate how well their prototypes represent the income dataset. \n",
    "objs = []\n",
    "for i in range(len(nhanes_files)):\n",
    "        \n",
    "    #load a dataset, its prototypes & weights\n",
    "\n",
    "    f = nhanes_files[i]\n",
    "    \n",
    "    Ydata = prototypes[f]['data']\n",
    "    S = prototypes[f]['S']\n",
    "    W = prototypes[f]['W']\n",
    "    \n",
    "    \n",
    "    # sort the order of prototypes in set S\n",
    "    idx = np.argsort(S)\n",
    "    S = S[idx]\n",
    "    W = W[idx]\n",
    "    \n",
    "    # access corresponding prototypes in X. \n",
    "    XS = X[np.isin(Xdata[:, 0], Ydata[S, 0]), :]\n",
    "    \n",
    "    #print(Ydata[S, 0])\n",
    "    #print(Xdata[np.isin(Xdata[:, 0], Ydata[S, 0]), 0])   \n",
    "    \n",
    "    temp = np.dot(XS, np.transpose(X))    \n",
    "    u = np.sum(temp, axis=1)/temp.shape[1]\n",
    "    \n",
    "    K = np.dot(XS, XS.T)\n",
    "    \n",
    "    # evaluate prototypes on income based on our objective function with dot product as similarity measure\n",
    "    obj = 0.5 * np.dot(np.dot(W.T, K), W) - np.dot(W.T, u)\n",
    "    objs.append(obj)    \n",
    "    \n",
    "\n",
    "# sort the objectives (ascending order) \n",
    "index = np.argsort(np.array(objs))\n",
    "\n",
    "# load the results in a dataframe to print\n",
    "evalresult = []\n",
    "for i in range(0,len(index)):    \n",
    "    evalresult.append([ nhanesinfo[index[i]], objs[index[i]] ])\n",
    "    \n",
    "    \n",
    "df_evalresult = pd.DataFrame.from_records(evalresult)\n",
    "df_evalresult.columns = ['Questionaire', 'Prototypes representative of Income']\n",
    "df_evalresult"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Insight from Protodash\n",
    "\n",
    "Looking at the table above, what is interesting is that early childhood represents income the most. The early childhood questionnaire has information about the environment that the child was born and raised in. This is consistent with a long term study (https://www.theatlantic.com/business/archive/2016/07/social-mobility-america/491240/) which talks about significant decrease in social mobility in recent times, stressing the fact that your childhood impacts how monetarily successful you are likely to be. It is interesting that our method was able to uncover this relationship with access to just these survey questionnaires. Other such insights could be obtained and ones that a social scientist or policy maker finds interesting could potentially spawn long-term studies like the one just mentioned."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Edit Metadata",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}