{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "title: \"Assignment 9: Data Wrangling, in Python\"\n", "author: \"Student Name\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## OVERVIEW\n", "\n", "This exercise accompanies the lessons in Environmental Data Analytics (ENV872L) on data wrangling in Python.
(_This assignment mimics Assignment 4: Data Wrangling)_\n", "\n", "Helpful materials: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf and https://duke-nsoe.github.io/ENV872/Python-R.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Directions\n", "1. Change \"Student Name\" on line 2 (above) with your name.\n", "\n", "\n", "2. Use the lesson as a guide. It contains code that can be modified to complete the assignment.\n", "\n", "\n", "3. Work through the steps, **creating code and output** that fulfill each instruction.\n", "\n", "\n", "4. Be sure to **answer the questions** in this assignment document.
Create a new cell in the Jupyter notebook and set its type to `Markdown`. Type your answer there. \n", "\n", "\n", "5. When you have completed the assignment, save your file as an HTML document via the `File`>`Download as` menu.\n", "\n", "\n", "6. After convertng to HTML, please submit the completed exercise to the dropbox in Sakai. Please add your last name into the file name (e.g., \"Fay_A07_DataWranglingInPython.html\") prior to submission.\n", "\n", "The completed exercise is due on **Tuesday March 5** before class begins.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up your session \n", "1a. Import pandas " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1b. Read in the following CSV files into dataframes named as follows:\n", " * `EPAair_PM25_NC2017_raw.csv` → `pm_2017`\n", " * `EPAair_PM25_NC2018_raw.csv` → `pm_2018`\n", " * `EPAair_O3_NC2017_raw.csv` → `o3_2017`\n", " * `EPAair_O3_NC2018_raw.csv` → `o3_2017`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pm_2017 = \n", "pm_2018 = \n", "o3_2017 = \n", "o3_2018 = " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Generate a few lines of code to get to know your `pm_2017` dataset (basic data summaries, etc.)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Show the structure of the dataframe\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Show a summary of the data in the dataframe\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Show the \"head\" or the first 5 rows of data\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Show what data type the \"Date\" field is\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrangle individual datasets to create processed files.\n", "3. Change `Date` field's data type to date" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Convert the date field to a datetime object\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Select the following columns: `Date`, `DAILY_AQI_VALUE`, `Site.Name`, `AQS_PARAMETER_DESC`, `COUNTY`, `SITE_LATITUDE`, `SITE_LONGITUDE`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Select the subset of fields\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. For the PM2.5 datasets, create a new column called `AQS_PARAMETER_DESC` filled with the value \"PM2.5\" (all cells in this column should be identical)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Save all four processed datasets in the Processed folder." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Combine datasets\n", "7. Merge (`concat`) the four datasets into one. Make sure your column names are identical prior to running this code. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Concatenate the 4 data frames into one\n", "allData = \n", "allData.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7a. What are the dimensions of the concatenated data frames?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Report the shape of the dataframe\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Subsetting and splitting\n", "\n", "#### 8. Wrangle your new dataset...\n", " 1. Subset only records in Blackstone, Bryson City, Triple Oak \n", " 1. Create a Boolean mask for each site name (`query` won't work because \"Site Name\" has a space in it!)\n", " 2. Apply these Boolean masks using the appropriate logical operators (\"`&`\", \"`|`\", \"`~`\") to select records in any of these sites into a new dataframe. \n", " 3. Make a \"deep\" copy of this dataframe (to avoid warnings about working on a subset of a dataframe...)\n", " 4. Display the dimensions of the resulting dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create 3 Boolean masks, one each for Blackstone, Bryson City, Triple Oak\n", "blackstoneMask = \n", "brysonMask = \n", "tripleoakMask = " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Apply the mask to the dataframe\n", "allDataSelect = " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Create a deep copy of the above dataframe\n", "allDataSelect = " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Display the dimensions of the dataframe\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Add columns for \"Month\" and \"Year\" by parsing your \"Date\" column (hint: see [`pd.DatetimeIndex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html)...)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#C. Add columns for Month and Year\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**9. Spread (`pivot_table`) your datasets** such that AQI values for ozone and PM2.5 are in separate columns. Each location on a specific date should now occupy only one row." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "allPivot = " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "10. Call up the dimensions of your new tidy dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "11. Save your processed dataset with the following file name: \"EPAair_O3_PM25_NC1718_Processed.csv\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate summary tables\n", "12. Use the split-apply-combine strategy to generate two new data frames: \n", " 1. A summary table of mean AQI values for O3 and PM2.5 by month\n", " 2. A summary table of the mean, minimum, and maximum AQI values of O3 and PM2.5 for each site" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#12a. A summary table of mean AQI values for O3 and PM2.5 by month\n", "group = allDataSelect.groupby(['Month','AQS_PARAMETER_DESC'])\n", "aggData = group['DAILY_AQI_VALUE'].mean().reset_index()\n", "aggData.pivot_table(index='Month',columns='AQS_PARAMETER_DESC',values='DAILY_AQI_VALUE')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#12b. A summary table of the mean, minimum, and maximum AQI values of O3 and PM2.5 for each site\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }