{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Project 1: Digital Divide\n",
    "### Data Prep\n",
    "\n",
    "#### Based on PPIC's Just the Facts report [\"California's Digital Divide\"](https://www.ppic.org/publication/californias-digital-divide/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Research Question(s):\n",
    "1. What share households in X state have access to high-speed internet? \n",
    "2. Does this number vary across demographic groups? (in this case race/ethnicity)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Goal:\n",
    "* explore datafiles (`acsdata.data.gz`) and create a _working dataset_ from it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Context:\n",
    "Obtained American Community Survey (ACS) survey data from [IPUMS](https://usa.ipums.org/usa/). <br>\n",
    "It contains basic demographics:\n",
    "  - age\n",
    "  - gender\n",
    "  - race/ethnicity\n",
    "\n",
    "and geographic indicators:\n",
    "  - state\n",
    "  - county"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 1: Set up your working environment.\n",
    "\n",
    "Import all necessary libraries and create `Path`s to your data directories. This ensures reproducibility across file systems (windows uses `\\` instead of `/`)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need \n",
    "1. `pandas` to work with the data.\n",
    "2. `pathlib`, and more specifically its `Path` object, to work with paths. This will ensure our code works in both Windows (which uses `\\` in its file paths) and MacOS/Linux (which uses `/`).\n",
    "3. `datetime` - tip: There are version control systems for data but tagging your data files with the date is not a bad first step if you're getting started."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "27-Apr-19\n"
     ]
    }
   ],
   "source": [
    "# setting up working environment\n",
    "import _____ as pd\n",
    "from _____ import Path\n",
    "from datetime import datetime as dt\n",
    "today = __.today().strftime(\"%d-%b-%y\")\n",
    "\n",
    "print(today)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_note: even if you are on windows you can type the path forward slashes_ `/` _below_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# data folder and paths\n",
    "RAW_DATA_PATH = ____(\"../data/raw/\")\n",
    "XXXX_XXXXX_XXXX = ____(\"../data/interim/\")\n",
    "YYYY_YYYYY_YYYY = ____(\"../data/processed/\")\n",
    "ZZZZ_ZZZZZ_ZZZZ = ____(\"../data/final/\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NOTE:** I've included a `tools.py` script with the function `tree` which displays a directory's tree (obtained from [RealPython's tutorial on the `pathlib` module](https://realpython.com/python-pathlib/)).\n",
    "\n",
    "    from our tools script import tree so we can use it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tree(________)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 2: Load and explore the data\n",
    "\n",
    "With `pandas` loading data is as easy as `.read_csv(PATH_TO_CSV_FILE)` and that works most of the time. `Pandas` `read_csv()` is so powerful it'll even read compressed files without any other parameter specification. Try the following:\n",
    "\n",
    "```python\n",
    "data = pd.read_csv(RAW_DATA_PATH / 'acs_data.csv.gz')\n",
    "data.head()\n",
    "```\n",
    "_*make sure you change_ `RAW_DATA_PATH` _to match whatever variable name you chose for it earlier._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "IPUMS offers a few data formats which can be more useful [[docs]](https://usa.ipums.org/usa-action/faq#ques12):\n",
    "> In addition to the ASCII data file, the system creates a statistical package syntax file to accompany each extract. The syntax file is designed to read in the ASCII data while applying appropriate variable and value labels. SPSS, SAS, and Stata are supported. You must download the syntax file with the extract or you will be unable to read the data. The syntax file requires minor editing to identify the location of the data file on your local computer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, we'll be using a **Stata** file (`.dta`). The main reason is that `.dta` files can store *value labels* which `pandas` can then read and convert columns to `Categorical` columns in our pandas DataFrame. This 1) saves memory, and 2) is good practice because certain social sciences really, _really_, ***really*** love Stata so their interesting datasets are likely `.dta` files. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, `pandas` cannot read compressed `.dta` directly like it can `.csv` files. IPUMS, uses *gzip* compressed format and `python` includes a `gzip` module in its standard library.\n",
    "\n",
    "**Import** gzip and try the following:\n",
    "```python\n",
    "with gzip.open(RAW_DATA_PATH / 'acs_data.dta.gz') as file:\n",
    "    data = pd.read_stata(file)\n",
    "```\n",
    "\n",
    "and then display the first five rows of your `data` DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import gzip and load data\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# display first 5 rows\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 3: Familiarize yourself with the dataset\n",
    "\n",
    "We've already seen `.head()` - the `pandas` method that will display the first 5 rows of your DataFrame. This gives you an idea of what your data looks like. However, there are is a lot more `.info()` you can get out of your dataframe. You can also just ask the data to `.describe()` itself..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# find out more info about your dataframe\n",
    "data.____()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# describing your data\n",
    "data.____()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check out the `shape` of your data with it's `.shape` attribute. Notice the lack of parentheses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data._____"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 4: Trim your data\n",
    "\n",
    "Right now you're working with your **masterfile** - a dataset containing everything you _could_ need for your analysis. You don't really want to modify this dataset because you might be using it for other analyses. For example, we're going to be analyzing access to high-speed internet in a state of your choosing but next week you might want to run the same analysis on another state or maybe just on a specific county. To make sure you can **reuse** your data and code later let's create an _analytical file_ or a _working dataset_, a dataset that contains only the data needed for **this** specific analysis at hand."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we are only interested in finding the _\"Digital Divide\"_ of one state right now. The **masterfile** contains data for all 50 states and the Disctric of Columbia. \n",
    "\n",
    "What you want to do is find all the rows where the `statefip` matches the your state's name. This is called boolean indexing.\n",
    "\n",
    "Try the following\n",
    "```python\n",
    "data['statefip'] == 'ohio'\n",
    "```\n",
    "_Note: you can change 'ohio' to any other of the 50 states or 'district of columbia' for DC._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# try boolean indexing\n",
    "___['______'] == '_______'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is going to return a `pandas.Series` of booleans (Trues and Falses) which then you can use to filter out any unnecessary rows.\n",
    "\n",
    "It's good practice to save these as a variable early in your code (if you know them beforehand) or right before you use them in case you use these conditionals in more than one place. This is going to save you time if you decide to change the value you're comparing, `'ohio'` for `'california'` for example.\n",
    "\n",
    "```python\n",
    "mask_state = (data['statefip'] == 'ohio')\n",
    "data[mask_state].head()\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# try it yourself\n",
    "mask_state = (________________________ == _______)\n",
    "data[mask_state].____()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "let's save it to another variable with a more useful name:\n",
    "\n",
    "```python\n",
    "state_data = data[mask_state].copy()\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You have to use `.copy()` to create actual copies of the data. If you ran\n",
    "```python\n",
    "state_data = data[mask_state]\n",
    "```\n",
    "`state_data` would be a _view_ of the `data` dataframe. This can have unintended consequences down the road if you modify your dataframes. A lot of the times you'd get just a warning and your code will run just as intented - but why take risks, right?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save your data to state_data\n",
    "state_data = __________.copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's see what `.columns` we have in our dataframe. You can find these the same way you found the `.shape` of it earlier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "state_data._____"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Are there any columns that you are **confident** you don't need? If you are not 90% sure you won't need a variable don't drop it. \n",
    "\n",
    "Dropping columns is as easy using `.drop()` on your dataframe.\n",
    "\n",
    "```python\n",
    "state_data.drop(columns = ['list', 'of', 'columns', 'to', 'drop'])\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If there are variables you _think_ you won't need but you're not very sure that's the case, you should explore them. \n",
    "\n",
    "`pandas` dataframe's columns are `pandas.Series` and they have methods and attributes just like dataframes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's explore the variable `gq` which stands for `Group Quarters`. From the IPUMS [docs](https://usa.ipums.org/usa-action/variables/GQ#description_section):\n",
    ">Group quarters are largely institutions and other group living arrangements, such as rooming houses and military barracks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see what `.unique()` values the `state_data['gq']` series has..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also see the `.value_counts()` which would give us a better idea of how useful this column might be. For example, if a column has 2 values but 99% of the observations have one value and 1% have the other - you could drop column altogether since it might not add a lot value to your analysis. \n",
    "\n",
    "Some variables have 100% of it's rows with the same value... \\*cough\\* \\*cough\\* `state_data['year']`..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From IPUMS [docs](https://usa.ipums.org/usa-action/variables/GQ#comparability_section):\n",
    ">There are three slightly different definitions of group quarters in the IPUMS. For the period 1940-1970 (excluding the 1940 100% dataset), group quarters are housing units with five or more individuals unrelated to the householder. Before 1940 and in 1980-1990, units with 10 or more individuals unrelated to the householder are considered group quarters. **In the 2000 census, 2010 census, the ACS and the PRCS, no threshold was applied; for a household to be considered group quarters, it had to be on a list of group quarters that is continuously maintained by the Census Bureau. In earlier years, a similar list was used, with the unrelated-persons rule imposed as a safeguard.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because of this and the fact that most of our observations fall into the 1970 and 1990 definition, we'll stick to those 2 for our analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create another _mask_ to filter out households that don't fit our definition.\n",
    "\n",
    "For multiple conditions we use: `&` and `|` operators (**and** and **or**, respectively)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mask_household = ( CONDITION ONE ) | ( CONDITION TWO )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**note**: another value added from having categorical variables is that, **if** they are ordered, you can use the `<`, `>` operators for conditions as well.\n",
    "```python\n",
    "mask_household = (state_data['gq'] <= 'additional households under 1990 definition')\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_note: since you are overwriting_ `state_data` _you don't need to use_ `.copy()` _but it doesn't hurt and if you're a beginner at_ `pandas` _it's good practice for when you actually need to use_ `.copy()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "state_data = state_data[mask_household].______()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point you're really close to a `working_data` dataset. You have:\n",
    "1. Kept one state's information and dropped the rest.\n",
    "2. Kept only those _households_ you're interested in and dropped the rest"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our research question 1 is: \"What share of households in X state have access to high-speed internet?\"\n",
    "\n",
    "Mathematically, \n",
    "$$ \\frac{households\\ with\\ high\\ speed\\ internet}{households\\ in\\ state}$$\n",
    "\n",
    "Your `state_data` dataset contains all you need to find the answer. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 5: Save your data\n",
    "\n",
    "Now that you have trimmed your **masterfile** into a `working_data` dataset you should save it. \n",
    "\n",
    "We've been working with a `.dta` file and it'd be best if we keep it that way. \n",
    "\n",
    "Try the following:\n",
    "```python\n",
    "state_data.to_stata(INTERIM_DATA_PATH / f'state_data-{today}.dta', write_index = False)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A few things:\n",
    "1. We're using `f-strings` to tag our datafile with today's date.\n",
    "2. You're turning off the `write_index` flag so you don't add a 'index' column to your `.dta` file. In this dataset, our index isn't meaningful. In other analysis you might have a meaningful index and you won't want to turn off this flag."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 6: Bonus\n",
    "What if we changed our research question a little bit, from <br>_\"What share of households in X state have access to high-speed internet?_ <br>to <br>_\"What share of households **with school-age children** in X state have access to high-speed internet?\"_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This would be an interesting statistic to policy-makers, especially if we find discrepancies across demographic groups (research question 2).\n",
    "\n",
    "The challenge here is that the **unit of observation** in our `state_data` file is a (weighted) person and we want to _filter_ out those **households** without any school-age children in them. This might sound a little complicated at first but it just requires modifying our previous workflow just a little.\n",
    "\n",
    "We need to do a few things:\n",
    "1. Define what we mean by school-age children.\n",
    "2. Create a _mask_ to grab all households where these children are.\n",
    "3. Create a list of unduplicated household identifiers (`'serial'`) \n",
    "4. Use that list to drop unwanted observations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Step 6.1: School-age children\n",
    "\n",
    "Most people would agree school age (Elementary through High School) is 6 - 17 year olds. Some people are interested in K-12 (5 - 17 or 18). Some people wouldn't include 18 year olds. Whatever measure you choose you must be able to defend why you are choosing it. \n",
    "\n",
    "For this analysis, I'll suggest we use 5 - 18 year olds (K-12) but you can choose whatever age range you want. Maybe high-school kids 14-18? That'd be interesting, you probably need access to high-speed internet at home a lot more in high school than you do in kindergarden. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mask_children = (state_data['age'] >= ___) & (___________________ <= )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<summary> <i>What data type is</i> <span style='font-family:monospace'>state_data['age']</span> <i>again?</i> \n",
    "    <details> \n",
    "        Categorical. This means that you even though its values _look_ like numbers, they're actually _value labels_ aka strings.\n",
    "    </details>\n",
    "</summary>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have our _mask_, we can use it to create a list of households with children in them.\n",
    "\n",
    "Earlier we applied a mask to a dataframe and saved it to another variable. Here, we'll go a step further and grab just a column of that _filtered out_ dataframe.\n",
    "\n",
    "Try it yourself first.\n",
    "\n",
    "*Hint: How did we grab and explore a single column of a dataframe earlier?*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "households_with_children = _________________________________________"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "households_with_children.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How do you think we can `.drop_duplicates()`?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you have your unduplicated list of households with children all you have to do is to check if a `serial` value from our `state_data` dataset `.isin()` our `households_with_children` series."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's save that as our `working_data` dataset and save that to memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "working_data = _____________________"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "working_data.to_stata(INTERIM_DATA_PATH / f'working_data-{today}.dta', write_index = False)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}