{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Homework 1\n",
    "\n",
    "Load and validate GSS data\n",
    "\n",
    "Allen Downey\n",
    "\n",
    "[MIT License](https://en.wikipedia.org/wiki/MIT_License)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "sns.set(style='white')\n",
    "\n",
    "import utils\n",
    "from utils import decorate\n",
    "from thinkstats2 import Pmf, Cdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading and validation\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_gss(dirname):\n",
    "    \"\"\"Reads GSS files from the given directory.\n",
    "    \n",
    "    dirname: string\n",
    "    \n",
    "    returns: DataFrame\n",
    "    \"\"\"\n",
    "    dct = utils.read_stata_dict(dirname + '/GSS.dct')\n",
    "    gss = dct.read_fixed_width(dirname + '/GSS.dat.gz',\n",
    "                             compression='gzip')\n",
    "    return gss"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read the variables I selected from the GSS dataset.  You can look up these variables at https://gssdataexplorer.norc.org/variables/vfilter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "gss = read_gss('gss_eda')\n",
    "print(gss.shape)\n",
    "gss.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most variables use special codes to indicate missing data.  We have to be careful not to use these codes as numerical data; one way to manage that is to replace them with `NaN`, which Pandas recognizes as a missing value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def replace_invalid(df):\n",
    "    df.realinc.replace([0], np.nan, inplace=True)                  \n",
    "    df.educ.replace([98,99], np.nan, inplace=True)\n",
    "    # 89 means 89 or older\n",
    "    df.age.replace([98, 99], np.nan, inplace=True) \n",
    "    df.cohort.replace([9999], np.nan, inplace=True)\n",
    "    df.adults.replace([9], np.nan, inplace=True)\n",
    "\n",
    "replace_invalid(gss)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are summary statistics for the variables I have validated and cleaned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "gss['year'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "gss['sex'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "gss['age'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "gss['cohort'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "gss['race'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "gss['educ'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "gss['realinc'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "gss['wtssall'].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise** \n",
    "\n",
    "1. Look through the column headings to find a few variables that look interesting.  Look them up on the GSS data explorer.  \n",
    "\n",
    "2. Use `value_counts` to see what values appear in the dataset, and compare the results with the counts in the code book.  \n",
    "\n",
    "3. Identify special values that indicate missing data and replace them with `NaN`.\n",
    "\n",
    "4. Use `describe` to compute summary statistics.  What do you notice?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize distributions\n",
    "\n",
    "Let's visualize the distributions of the variables we've selected.\n",
    "\n",
    "Here's a Hist of the values in `educ`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from thinkstats2 import Hist, Pmf, Cdf\n",
    "import thinkplot\n",
    "\n",
    "hist_educ = Hist(gss.educ)\n",
    "thinkplot.hist(hist_educ)\n",
    "decorate(xlabel='Years of education', \n",
    "         ylabel='Count')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Hist` as defined in `thinkstats2` is different from `hist` as defined in Matplotlib.  The difference is that `Hist` keeps all unique values and does not put them in bins.  Also, `hist` does not handle `NaN`.\n",
    "\n",
    "One of the hazards of using `hist` is that the shape of the result depends on the bin size.\n",
    "\n",
    "**Exercise:** \n",
    "\n",
    "1. Run the following cell and compare the result to the `Hist` above.\n",
    "\n",
    "2. Add the keyword argument `bins=11` to `plt.hist` and see how it changes the results.\n",
    "\n",
    "3. Experiment with other numbers of bins."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.hist(gss.educ.dropna())\n",
    "decorate(xlabel='Years of education', \n",
    "         ylabel='Count')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, a drawback of `Hist` and `Pmf` is that they basically don't work when the number of unique values is large, as in this example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "hist_realinc = Hist(gss.realinc)\n",
    "thinkplot.hist(hist_realinc)\n",
    "decorate(xlabel='Real income (1986 USD)', \n",
    "         ylabel='Count')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:**\n",
    "    \n",
    "1. Make and plot a `Hist` of `age`.\n",
    "\n",
    "2. Make and plot a `Pmf` of `educ`.\n",
    "\n",
    "3. What fraction of people have 12, 14, and 16 years of education?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:**\n",
    "    \n",
    "1. Make and plot a `Cdf` of `educ`.\n",
    "\n",
    "2. What fraction of people have more than 12 years of education?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:**\n",
    "    \n",
    "1. Make and plot a `Cdf` of `age`.\n",
    "\n",
    "2. What is the median age?  What is the inter-quartile range (IQR)?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:**\n",
    "\n",
    "Find another numerical variable, plot a histogram, PMF, and CDF, and compute any statistics of interest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:**\n",
    "\n",
    "1. Compute the CDF of `realinc` for male and female respondents, and plot both CDFs on the same axes.\n",
    "\n",
    "2. What is the difference in median income between the two groups?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:**\n",
    "\n",
    "Use a variable to break the dataset into groups and plot multiple CDFs to compare distribution of something within groups.\n",
    "\n",
    "Note: Try to find something interesting, but be cautious about overinterpreting the results.  Between any two groups, there are often many differences, with many possible causes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save the cleaned data\n",
    "\n",
    "Now that we have the data in good shape, we'll save it in a binary format (HDF5), which will make it faster to load later.\n",
    "\n",
    "Also, we have to do some resampling to make the results representative.  We'll talk about this in class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(19)\n",
    "sample = utils.resample_by_year(gss, 'wtssall')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save the file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm gss.hdf5\n",
    "sample.to_hdf('gss.hdf5', 'gss')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load it and see how fast it is!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time gss = pd.read_hdf('gss.hdf5', 'gss')\n",
    "gss.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}