{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cleaning and Validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-print"
    ]
   },
   "source": [
    "This is the first in a series of notebooks that make up a [case study in exploratory data analysis](https://allendowney.github.io/PoliticalAlignmentCaseStudy/).\n",
    "This case study is part of the [*Elements of Data Science*](https://allendowney.github.io/ElementsOfDataScience/) curriculum."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, we \n",
    "\n",
    "1. Read data from the General Social Survey (GSS),\n",
    "\n",
    "2. Clean the data, particularly dealing with special codes that indicate missing data,\n",
    "\n",
    "3. Validate the data by comparing the values in the dataset with values documented in the codebook.\n",
    "\n",
    "4. Generate resampled datasets that correct for deliberate oversampling in the dataset, and\n",
    "\n",
    "5. Store the resampled data in a binary format (HDF5) that makes it easier to work with in the notebooks that follow this one."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following cell loads the packages we need.  If you have everything installed, there should be no error messages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading the data\n",
    "\n",
    "The data we'll use is from the General Social Survey (GSS).  Using the [GSS Data Explorer](https://gssdataexplorer.norc.org), I selected a subset of the variables in the GSS and made it available along with this notebook.\n",
    "The following cell downloads this extract."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from os.path import basename, exists\n",
    "\n",
    "\n",
    "def download(url):\n",
    "    filename = basename(url)\n",
    "    if not exists(filename):\n",
    "        from urllib.request import urlretrieve\n",
    "\n",
    "        local, _ = urlretrieve(url, filename)\n",
    "        print(\"Downloaded \" + local)\n",
    "\n",
    "\n",
    "download(\"https://github.com/AllenDowney/GssExtract/raw/main/data/interim/gss_pacs.hdf\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(68846, 204)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gss = pd.read_hdf(\"gss_pacs.hdf\", \"gss\")\n",
    "gss.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use `head` to see what the `DataFrame` looks like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>id</th>\n",
       "      <th>divorce</th>\n",
       "      <th>sibs</th>\n",
       "      <th>childs</th>\n",
       "      <th>age</th>\n",
       "      <th>educ</th>\n",
       "      <th>degree</th>\n",
       "      <th>sex</th>\n",
       "      <th>race</th>\n",
       "      <th>...</th>\n",
       "      <th>ballot</th>\n",
       "      <th>wtssall</th>\n",
       "      <th>sexbirth</th>\n",
       "      <th>sexnow</th>\n",
       "      <th>eqwlth</th>\n",
       "      <th>realinc</th>\n",
       "      <th>realrinc</th>\n",
       "      <th>coninc</th>\n",
       "      <th>conrinc</th>\n",
       "      <th>commun</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1972</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>23.0</td>\n",
       "      <td>16.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.4446</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>18951.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>25926.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1972</td>\n",
       "      <td>2</td>\n",
       "      <td>2.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>70.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.8893</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>24366.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>33333.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1972</td>\n",
       "      <td>3</td>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>48.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.8893</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>24366.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>33333.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1972</td>\n",
       "      <td>4</td>\n",
       "      <td>2.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>27.0</td>\n",
       "      <td>17.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.8893</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>30458.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>41667.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1972</td>\n",
       "      <td>5</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>61.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.8893</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>50763.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>69444.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 204 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   year  id  divorce  sibs  childs   age  educ  degree  sex  race  ...  \\\n",
       "0  1972   1      NaN   3.0     0.0  23.0  16.0     3.0  2.0   1.0  ...   \n",
       "1  1972   2      2.0   4.0     5.0  70.0  10.0     0.0  1.0   1.0  ...   \n",
       "2  1972   3      2.0   5.0     4.0  48.0  12.0     1.0  2.0   1.0  ...   \n",
       "3  1972   4      2.0   5.0     0.0  27.0  17.0     3.0  2.0   1.0  ...   \n",
       "4  1972   5      2.0   2.0     2.0  61.0  12.0     1.0  2.0   1.0  ...   \n",
       "\n",
       "   ballot  wtssall  sexbirth  sexnow  eqwlth  realinc  realrinc   coninc  \\\n",
       "0     NaN   0.4446       NaN     NaN     NaN  18951.0       NaN  25926.0   \n",
       "1     NaN   0.8893       NaN     NaN     NaN  24366.0       NaN  33333.0   \n",
       "2     NaN   0.8893       NaN     NaN     NaN  24366.0       NaN  33333.0   \n",
       "3     NaN   0.8893       NaN     NaN     NaN  30458.0       NaN  41667.0   \n",
       "4     NaN   0.8893       NaN     NaN     NaN  50763.0       NaN  69444.0   \n",
       "\n",
       "   conrinc  commun  \n",
       "0      NaN     NaN  \n",
       "1      NaN     NaN  \n",
       "2      NaN     NaN  \n",
       "3      NaN     NaN  \n",
       "4      NaN     NaN  \n",
       "\n",
       "[5 rows x 204 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gss.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This dataset has 64814 rows, one for each respondent, and 204 columns, one for each variable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Validation\n",
    "\n",
    "Now that we've got the data loaded, it is important to validate it, which means checking for errors.\n",
    "\n",
    "The kinds of errors you have to check for depend on the nature of the data, the collection process, how the data is stored and transmitted, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this dataset, there are three kinds of validation we'll think about:\n",
    "\n",
    "1) We need to check the **integrity** of the dataset; that is, whether the data were corrupted or changed during transmission, storage, or conversion from one format to another.\n",
    "\n",
    "2) We need to check our **interpretation** of the data; for example, whether the numbers used to encode the data mean what we think they mean.\n",
    "\n",
    "3) We will also keep an eye out for **invalid** data; for example, missing data might be represented using special codes, or there might be patterns in the data that indicate problems with the survey process and the recording of the data.  \n",
    "\n",
    "In a different dataset I worked with, I found a surprising number of respondents whose height was supposedly 62 centimeters.  After investigating, I concluded that they were probably 6 feet, 2 inches, and their heights were recorded incorrectly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Validating data can be a tedious process, but it is important.  If you interpret data incorrectly and publish invalid results, you will be embarrassed in the best case, and in the worst case you might do real harm.  See [this article](https://www.vox.com/future-perfect/2019/6/4/18650969/married-women-miserable-fake-paul-dolan-happiness) for a recent example.\n",
    "\n",
    "However, I don't expect you to validate every variable in this dataset.  Instead, I will demonstrate the process, and then ask you to validate one additional variable as an exercise."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first variable we'll validate is called `polviews`.  It records responses to the following question:\n",
    "\n",
    ">We hear a lot of talk these days about liberals and conservatives. \n",
    "I'm going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal--point 1--to extremely conservative--point 7. Where would you place yourself on this scale?\n",
    "\n",
    "You can [read the documentation of this variable in the GSS codebook](https://gssdataexplorer.norc.org/variables/178/vshow)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The responses are encoded like this:\n",
    "\n",
    "```\n",
    "1\tExtremely liberal\n",
    "2\tLiberal\n",
    "3\tSlightly liberal\n",
    "4\tModerate\n",
    "5\tSlghtly conservative\n",
    "6\tConservative\n",
    "7\tExtremely conservative\n",
    "8\tDon't know\n",
    "9\tNo answer\n",
    "0\tNot applicable\n",
    "```\n",
    "\n",
    "The following function, `values`, takes a Series that represents a single variable and returns the values in the series and their frequencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def values(series):\n",
    "    \"\"\"Count the values and sort.\n",
    "\n",
    "    series: pd.Series\n",
    "\n",
    "    returns: series mapping from values to frequencies\n",
    "    \"\"\"\n",
    "    return series.value_counts().sort_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are the values for the variable `polviews`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "polviews = gss[\"polviews\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0     1889\n",
       "2.0     7137\n",
       "3.0     7500\n",
       "4.0    22747\n",
       "5.0     9166\n",
       "6.0     8847\n",
       "7.0     2006\n",
       "Name: polviews, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "values(polviews)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To check the integrity of the data and confirm that we have loaded it correctly, we'll do a \"spot check\"; that is, we'll pick one year and compare the values we see in the dataset to the values reported in the codebook.\n",
    "\n",
    "We can select values from a single year like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "one_year = gss[\"year\"] == 1974"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And look at the values and their frequencies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0     22\n",
       "2.0    201\n",
       "3.0    207\n",
       "4.0    564\n",
       "5.0    221\n",
       "6.0    160\n",
       "7.0     35\n",
       "Name: polviews, dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "values(polviews[one_year])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you [compare these results to the values in the codebook](https://gssdataexplorer.norc.org/variables/178/vshow), you should see that they agree.\n",
    "\n",
    "**Exercise:** Go back and change 1974 to another year, and compare the results to the codebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Missing data\n",
    "\n",
    "For many variables, missing values are encoded with numerical codes that we need to replace before we do any analysis.\n",
    "\n",
    "For `polviews`, the values 8, 9, and 0 represent \"Don't know\", \"No answer\", and \"Not applicable\".\n",
    "\"Not applicable\" usually means the respondent was not asked a particular question.\n",
    "\n",
    "To keep things simple, we'll treat all of these values as equivalent, but we lose some information by doing that.  For example, if a respondent refuses to answer a question, that might suggest something about their answer.  If so, treating their response as missing data might bias the results.\n",
    "\n",
    "Fortunately, for most questions the number of respondents who refused to answer is small."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I'll replace the numeric codes 8, 9, and 0 with `np.nan`, which is a special value used to indicate missing data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "source": [
    "[NOTE: When pd.NA is mature, we might switch to it.\n",
    "https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data-na]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "NA = np.nan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        NaN\n",
       "1        NaN\n",
       "2        NaN\n",
       "3        NaN\n",
       "4        NaN\n",
       "        ... \n",
       "68841    2.0\n",
       "68842    5.0\n",
       "68843    4.0\n",
       "68844    1.0\n",
       "68845    4.0\n",
       "Name: polviews, Length: 68846, dtype: float64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clean = polviews.replace([0, 8, 9], NA)\n",
    "clean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When `np.nan` is displayed, it appears as `NaN`, which stands for \"not a number\".\n",
    "\n",
    "We can use `notna` and `sum` to count the valid responses:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "59292"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clean.notna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we use `isna` to count the missing responses:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9554"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clean.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can [check these results against the codebook](https://gssdataexplorer.norc.org/variables/178/vshow); at the bottom of that page, it reports the number of \"Valid cases\" and \"Missing cases\".\n",
    "\n",
    "In this example, the results don't match.  The codebook reports 53081 valid cases and 9385 missing cases.\n",
    "It turns out that the numbers in the codebook are not up to date, so for now we'll have to assume that the numbers in the dataset are correct."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Replacing missing data\n",
    "\n",
    "For the other variables in this dataset, I read through the code book and identified the special values that indicate missing data.\n",
    "I recorded that information in the following function, which is intended to replace special values with `NaN`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def replace_invalid(df, columns, bad):\n",
    "    for column in columns:\n",
    "        df[column].replace(bad, NA, inplace=True)\n",
    "\n",
    "\n",
    "def gss_replace_invalid(df):\n",
    "    \"\"\"Replace invalid data with NaN.\n",
    "\n",
    "    df: DataFrame\n",
    "    \"\"\"\n",
    "    # different variables use different codes for invalid data\n",
    "    df.cohort.replace([0, 9999], NA, inplace=True)\n",
    "\n",
    "    # since there are a lot of variables that use 0, 8, and 9 for invalid data,\n",
    "    # I'll use a loop to replace all of them\n",
    "    columns = [\n",
    "        \"abany\",\n",
    "        \"abdefect\",\n",
    "        \"abhlth\",\n",
    "        \"abnomore\",\n",
    "        \"abpoor\",\n",
    "        \"abrape\",\n",
    "        \"absingle\",\n",
    "        \"acqntsex\",\n",
    "        \"affrmact\",\n",
    "        \"bible\",\n",
    "        \"cappun\",\n",
    "        \"colath\",\n",
    "        \"colcom\",\n",
    "        \"colhomo\",\n",
    "        \"colmil\",\n",
    "        \"colmslm\",\n",
    "        \"colrac\",\n",
    "        \"colsoc\",\n",
    "        \"compuse\",\n",
    "        \"conarmy\",\n",
    "        \"conbus\",\n",
    "        \"conclerg\",\n",
    "        \"coneduc\",\n",
    "        \"confed\",\n",
    "        \"confinan\",\n",
    "        \"conjudge\",\n",
    "        \"conlabor\",\n",
    "        \"conlegis\",\n",
    "        \"conmedic\",\n",
    "        \"conpress\",\n",
    "        \"consci\",\n",
    "        \"contv\",\n",
    "        \"databank\",\n",
    "        \"discaffm\",\n",
    "        \"discaffw\",\n",
    "        \"divlaw\",\n",
    "        \"divorce\",\n",
    "        \"eqwlth\",\n",
    "        \"fair\",\n",
    "        \"fear\",\n",
    "        \"fechld\",\n",
    "        \"fefam\",\n",
    "        \"fehelp\",\n",
    "        \"fehire\",\n",
    "        \"fehome\",\n",
    "        \"fejobaff\",\n",
    "        \"fepol\",\n",
    "        \"fepres\",\n",
    "        \"fepresch\",\n",
    "        \"fework\",\n",
    "        \"finrela\",\n",
    "        \"frndsex\",\n",
    "        \"fund\",\n",
    "        \"god\",\n",
    "        \"goodlife\",\n",
    "        \"grass\",\n",
    "        \"gunlaw\",\n",
    "        \"hapmar\",\n",
    "        \"happy\",\n",
    "        \"health\",\n",
    "        \"helpful\",\n",
    "        \"hhrace\",\n",
    "        \"homosex\",\n",
    "        \"hunt\",\n",
    "        \"libath\",\n",
    "        \"libcom\",\n",
    "        \"libhomo\",\n",
    "        \"libmil\",\n",
    "        \"libmslm\",\n",
    "        \"librac\",\n",
    "        \"libsoc\",\n",
    "        \"life\",\n",
    "        \"matesex\",\n",
    "        \"meovrwrk\",\n",
    "        \"miracles\",\n",
    "        \"nataid\",\n",
    "        \"natarms\",\n",
    "        \"natchld\",\n",
    "        \"natcity\",\n",
    "        \"natcrime\",\n",
    "        \"natdrug\",\n",
    "        \"nateduc\",\n",
    "        \"natenrgy\",\n",
    "        \"natenvir\",\n",
    "        \"natfare\",\n",
    "        \"natheal\",\n",
    "        \"natmass\",\n",
    "        \"natpark\",\n",
    "        \"natrace\",\n",
    "        \"natroad\",\n",
    "        \"natsci\",\n",
    "        \"natsoc\",\n",
    "        \"natspac\",\n",
    "        \"othersex\",\n",
    "        \"paidsex\",\n",
    "        \"pikupsex\",\n",
    "        \"polabuse\",\n",
    "        \"polattak\",\n",
    "        \"polescap\",\n",
    "        \"polhitok\",\n",
    "        \"polmurdr\",\n",
    "        \"polviews\",\n",
    "        \"popespks\",\n",
    "        \"pornlaw\",\n",
    "        \"postlife\",\n",
    "        \"pray\",\n",
    "        \"prayer\",\n",
    "        \"premarsx\",\n",
    "        \"pres00\",\n",
    "        \"pres04\",\n",
    "        \"pres08\",\n",
    "        \"pres12\",\n",
    "        \"pres96\",\n",
    "        \"racchurh\",\n",
    "        \"racclos\",\n",
    "        \"racdif1\",\n",
    "        \"racdif2\",\n",
    "        \"racdif3\",\n",
    "        \"racdif4\",\n",
    "        \"racdin\",\n",
    "        \"racdis\",\n",
    "        \"racfew\",\n",
    "        \"rachaf\",\n",
    "        \"rachome\",\n",
    "        \"racinteg\",\n",
    "        \"raclive\",\n",
    "        \"racmar\",\n",
    "        \"racmost\",\n",
    "        \"racopen\",\n",
    "        \"racpres\",\n",
    "        \"racpush\",\n",
    "        \"racschol\",\n",
    "        \"racseg\",\n",
    "        \"racwork\",\n",
    "        \"reborn\",\n",
    "        \"relexp\",\n",
    "        \"relexper\",\n",
    "        \"reliten\",\n",
    "        \"relpersn\",\n",
    "        \"res16\",\n",
    "        \"rowngun\",\n",
    "        \"satfin\",\n",
    "        \"satjob\",\n",
    "        \"savesoul\",\n",
    "        \"sexbirth\",\n",
    "        \"sexeduc\",\n",
    "        \"sexnow\",\n",
    "        \"sexornt\",\n",
    "        \"sexsex\",\n",
    "        \"sexsex5\",\n",
    "        \"spanking\",\n",
    "        \"spkath\",\n",
    "        \"spkcom\",\n",
    "        \"spkhomo\",\n",
    "        \"spklang\",\n",
    "        \"spkmil\",\n",
    "        \"spkmslm\",\n",
    "        \"spkrac\",\n",
    "        \"spksoc\",\n",
    "        \"sprtprsn\",\n",
    "        \"teensex\",\n",
    "        \"trdunion\",\n",
    "        \"trust\",\n",
    "        \"union\",\n",
    "        \"wkharsex\",\n",
    "        \"wkracism\",\n",
    "        \"wksexism\",\n",
    "        \"xmarsex\",\n",
    "        \"commun\",\n",
    "    ]\n",
    "    replace_invalid(df, columns, [0, 8, 9])\n",
    "\n",
    "    columns = [\"degree\", \"partyid\"]\n",
    "    replace_invalid(df, columns, [8, 9])\n",
    "\n",
    "    df.phone.replace([0, 2, 9], NA, inplace=True)\n",
    "    df.owngun.replace([0, 3, 8, 9], NA, inplace=True)\n",
    "    df.pistol.replace([0, 3, 8, 9], NA, inplace=True)\n",
    "    df[\"class\"].replace([0, 5, 8, 9], NA, inplace=True)\n",
    "\n",
    "    df.chldidel.replace([-1, 8, 9], NA, inplace=True)\n",
    "    df.sexfreq.replace([-1, 8, 9], NA, inplace=True)\n",
    "\n",
    "    df.attend.replace([9], NA, inplace=True)\n",
    "    df.childs.replace([9], NA, inplace=True)\n",
    "    df.adults.replace([9], NA, inplace=True)\n",
    "\n",
    "    df.age.replace([0, 98, 99], NA, inplace=True)\n",
    "    df.relig.replace([0, 98, 99], NA, inplace=True)\n",
    "    df.relig16.replace([0, 98, 99], NA, inplace=True)\n",
    "    df.relactiv.replace([0, 98, 89], NA, inplace=True)\n",
    "\n",
    "    # note: sibs contains some unlikely numbers\n",
    "    df.sibs.replace([-1, 98, 99], NA, inplace=True)\n",
    "    df.hrsrelax.replace([-1, 98, 99], NA, inplace=True)\n",
    "\n",
    "    df.educ.replace([97, 98, 99], NA, inplace=True)\n",
    "\n",
    "    df.realinc.replace([0], NA, inplace=True)\n",
    "    df.realrinc.replace([0], NA, inplace=True)\n",
    "\n",
    "    df.income.replace([0, 13, 98, 99], NA, inplace=True)\n",
    "    df.rincome.replace([0, 13, 98, 99], NA, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "gss_replace_invalid(gss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "# check that we've cleaned all columns that need it;\n",
    "# all columns we've added NaN to should be floats\n",
    "\n",
    "# some columns have no missing values\n",
    "clean_columns = [\"id\", \"year\", \"ballot\", \"sex\", \"race\", \"reg16\", \"region\", \"srcbelt\"]\n",
    "\n",
    "for column in gss.columns:\n",
    "    if gss[column].dtype == int and column not in clean_columns:\n",
    "        print(f\"'{column}', \", end=\"\")\n",
    "print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point, I have only moderate confidence that this code is correct.  I'm not sure I have dealt with every variable in the dataset, and I'm not sure that the special values for every variable are correct.\n",
    "\n",
    "So I will ask for your help."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise**: In order to validate the other variables, I'd like each person who works with this notebook to validate one variable.\n",
    "\n",
    "If you run the following cell, it will choose one of the columns from the dataset at random.  That's the variable you will check.\n",
    "\n",
    "If you get `year` or `id_`, run the cell again to get a different variable name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'trust'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.random.seed(None)\n",
    "np.random.choice(gss.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Go back through the previous two sections of this notebook and replace `polviews` with your randomly chosen variable.  Then run the cells again and go to [this online survey to report the results](https://forms.gle/tmST8YCu4qLc414F7). \n",
    "\n",
    "Note: Not all questions were asked during all years.  If your variable doesn't have data for 1974 or 2018, you might have to choose different years."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Resampling\n",
    "\n",
    "The GSS uses stratified sampling, which means that some groups are deliberately oversampled to help with statistical validity.\n",
    "\n",
    "As a result, each respondent has a sampling weight which is proportional to the number of people in the population they represent.\n",
    "\n",
    "Before running any analysis, we can compensate for stratified sampling by \"resampling\", that is, by drawing a random sample from the dataset, where each respondent's chance of appearing in the sample is proportional to their sampling weight."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def resample_rows_weighted(df, column):\n",
    "    \"\"\"Resamples a DataFrame using probabilities proportional to given column.\n",
    "\n",
    "    df: DataFrame\n",
    "    column: string column name to use as weights\n",
    "\n",
    "    returns: DataFrame\n",
    "    \"\"\"\n",
    "    weights = df[column]\n",
    "    sample = df.sample(n=len(df), replace=True, weights=weights)\n",
    "    return sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "def resample_by_year(df, column):\n",
    "    \"\"\"Resample rows within each year.\n",
    "\n",
    "    df: DataFrame\n",
    "    column: string name of weight variable\n",
    "\n",
    "    returns DataFrame\n",
    "    \"\"\"\n",
    "    grouped = df.groupby(\"year\")\n",
    "    samples = [resample_rows_weighted(group, column) for _, group in grouped]\n",
    "    sample = pd.concat(samples, ignore_index=True)\n",
    "    return sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(19)\n",
    "sample = resample_by_year(gss, \"wtssall\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Saving the results\n",
    "\n",
    "I'll save the results to an HDF5 file, which is a binary format that makes it much faster to read the data back.\n",
    "\n",
    "First I'll save the original (not resampled) data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An HDF5 file is like a dictionary on disk.  It contains keys and corresponding values.\n",
    "\n",
    "`to_hdf` takes three arguments:\n",
    "\n",
    "* The filename, `gss_pacs_clean.hdf`.\n",
    "\n",
    "* The key, `gss`\n",
    "\n",
    "* The compression level, which controls how hard the algorithm works to compress the file.\n",
    "\n",
    "So this file contains a single key, `gss`, which maps to the DataFrame with the original GSS data.\n",
    "\n",
    "The argument `w` says that if the file already exists, we should overwrite it.\n",
    "\n",
    "With compression level `6`, it reduces the size of the file by a factor of more than 10."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# if the file already exists, remove it\n",
    "import os\n",
    "\n",
    "if os.path.isfile(\"gss_pacs_clean.hdf\"):\n",
    "    !rm gss_pacs_resampled.hdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save the original\n",
    "\n",
    "gss.to_hdf(\"gss_pacs_clean.hdf\", \"gss\", \"w\", complevel=6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-rw-r-- 1 downey downey 7.6M Aug 28 17:53 gss_pacs_clean.hdf\r\n"
     ]
    }
   ],
   "source": [
    "!ls -lh gss_pacs_clean.hdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And I'll create a second file with three random resamplings of the original dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "# if the file already exists, remove it\n",
    "import os\n",
    "\n",
    "if os.path.isfile(\"gss_pacs_resampled.hdf\"):\n",
    "    !rm gss_pacs_resampled.hdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This file contains three keys, `gss0`, `gss1`, and `gss2`, which map to three DataFrames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# generate and store three resamplings\n",
    "keys = [\"gss0\", \"gss1\", \"gss2\"]\n",
    "\n",
    "for i in range(3):\n",
    "    np.random.seed(i)\n",
    "    sample = resample_by_year(gss, \"wtssall\")\n",
    "\n",
    "    sample.to_hdf(\"gss_pacs_resampled.hdf\", keys[i], complevel=6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-rw-r-- 1 downey downey 23M Aug 28 17:54 gss_pacs_resampled.hdf\r\n"
     ]
    }
   ],
   "source": [
    "!ls -lh gss_pacs_resampled.hdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the other notebooks in this case study, we'll load this resampled data rather than reading and cleaning the data every time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "source": [
    "Political Alignment Case Study\n",
    "\n",
    "Copyright 2020 Allen B. Downey\n",
    "\n",
    "License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}