{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Cleaning and Validation" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "remove-print" ] }, "source": [ "This is the first in a series of notebooks that make up a [case study in exploratory data analysis](https://allendowney.github.io/PoliticalAlignmentCaseStudy/).\n", "This case study is part of the [*Elements of Data Science*](https://allendowney.github.io/ElementsOfDataScience/) curriculum." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we \n", "\n", "1. Read data from the General Social Survey (GSS),\n", "\n", "2. Clean the data, particularly dealing with special codes that indicate missing data,\n", "\n", "3. Validate the data by comparing the values in the dataset with values documented in the codebook.\n", "\n", "4. Generate resampled datasets that correct for deliberate oversampling in the dataset, and\n", "\n", "5. Store the resampled data in a binary format (HDF5) that makes it easier to work with in the notebooks that follow this one." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following cell loads the packages we need. If you have everything installed, there should be no error messages." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading the data\n", "\n", "The data we'll use is from the General Social Survey (GSS). Using the [GSS Data Explorer](https://gssdataexplorer.norc.org), I selected a subset of the variables in the GSS and made it available along with this notebook.\n", "The following cell downloads this extract." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from os.path import basename, exists\n", "\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", "\n", " local, _ = urlretrieve(url, filename)\n", " print(\"Downloaded \" + local)\n", "\n", "\n", "download(\"https://github.com/AllenDowney/GssExtract/raw/main/data/interim/gss_pacs.hdf\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(68846, 204)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gss = pd.read_hdf(\"gss_pacs.hdf\", \"gss\")\n", "gss.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use `head` to see what the `DataFrame` looks like." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | year | \n", "id | \n", "divorce | \n", "sibs | \n", "childs | \n", "age | \n", "educ | \n", "degree | \n", "sex | \n", "race | \n", "... | \n", "ballot | \n", "wtssall | \n", "sexbirth | \n", "sexnow | \n", "eqwlth | \n", "realinc | \n", "realrinc | \n", "coninc | \n", "conrinc | \n", "commun | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1972 | \n", "1 | \n", "NaN | \n", "3.0 | \n", "0.0 | \n", "23.0 | \n", "16.0 | \n", "3.0 | \n", "2.0 | \n", "1.0 | \n", "... | \n", "NaN | \n", "0.4446 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "18951.0 | \n", "NaN | \n", "25926.0 | \n", "NaN | \n", "NaN | \n", "
1 | \n", "1972 | \n", "2 | \n", "2.0 | \n", "4.0 | \n", "5.0 | \n", "70.0 | \n", "10.0 | \n", "0.0 | \n", "1.0 | \n", "1.0 | \n", "... | \n", "NaN | \n", "0.8893 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "24366.0 | \n", "NaN | \n", "33333.0 | \n", "NaN | \n", "NaN | \n", "
2 | \n", "1972 | \n", "3 | \n", "2.0 | \n", "5.0 | \n", "4.0 | \n", "48.0 | \n", "12.0 | \n", "1.0 | \n", "2.0 | \n", "1.0 | \n", "... | \n", "NaN | \n", "0.8893 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "24366.0 | \n", "NaN | \n", "33333.0 | \n", "NaN | \n", "NaN | \n", "
3 | \n", "1972 | \n", "4 | \n", "2.0 | \n", "5.0 | \n", "0.0 | \n", "27.0 | \n", "17.0 | \n", "3.0 | \n", "2.0 | \n", "1.0 | \n", "... | \n", "NaN | \n", "0.8893 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "30458.0 | \n", "NaN | \n", "41667.0 | \n", "NaN | \n", "NaN | \n", "
4 | \n", "1972 | \n", "5 | \n", "2.0 | \n", "2.0 | \n", "2.0 | \n", "61.0 | \n", "12.0 | \n", "1.0 | \n", "2.0 | \n", "1.0 | \n", "... | \n", "NaN | \n", "0.8893 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "50763.0 | \n", "NaN | \n", "69444.0 | \n", "NaN | \n", "NaN | \n", "
5 rows × 204 columns
\n", "