{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```\n",
"title: \"Assignment 9: Data Wrangling, in Python\"\n",
"author: \"Student Name\"\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## OVERVIEW\n",
"\n",
"This exercise accompanies the lessons in Environmental Data Analytics (ENV872L) on data wrangling in Python.
(_This assignment mimics Assignment 4: Data Wrangling)_\n",
"\n",
"Helpful materials: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf and https://duke-nsoe.github.io/ENV872/Python-R.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Directions\n",
"1. Change \"Student Name\" on line 2 (above) with your name.\n",
"\n",
"\n",
"2. Use the lesson as a guide. It contains code that can be modified to complete the assignment.\n",
"\n",
"\n",
"3. Work through the steps, **creating code and output** that fulfill each instruction.\n",
"\n",
"\n",
"4. Be sure to **answer the questions** in this assignment document.
Create a new cell in the Jupyter notebook and set its type to `Markdown`. Type your answer there. \n",
"\n",
"\n",
"5. When you have completed the assignment, save your file as an HTML document via the `File`>`Download as` menu.\n",
"\n",
"\n",
"6. After convertng to HTML, please submit the completed exercise to the dropbox in Sakai. Please add your last name into the file name (e.g., \"Fay_A07_DataWranglingInPython.html\") prior to submission.\n",
"\n",
"The completed exercise is due on **Tuesday March 5** before class begins.\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set up your session \n",
"1a. Import pandas "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1b. Read in the following CSV files into dataframes named as follows:\n",
" * `EPAair_PM25_NC2017_raw.csv` → `pm_2017`\n",
" * `EPAair_PM25_NC2018_raw.csv` → `pm_2018`\n",
" * `EPAair_O3_NC2017_raw.csv` → `o3_2017`\n",
" * `EPAair_O3_NC2018_raw.csv` → `o3_2017`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pm_2017 = \n",
"pm_2018 = \n",
"o3_2017 = \n",
"o3_2018 = "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. Generate a few lines of code to get to know your `pm_2017` dataset (basic data summaries, etc.)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Show the structure of the dataframe\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Show a summary of the data in the dataframe\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Show the \"head\" or the first 5 rows of data\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Show what data type the \"Date\" field is\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Wrangle individual datasets to create processed files.\n",
"3. Change `Date` field's data type to date"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Convert the date field to a datetime object\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"4. Select the following columns: `Date`, `DAILY_AQI_VALUE`, `Site.Name`, `AQS_PARAMETER_DESC`, `COUNTY`, `SITE_LATITUDE`, `SITE_LONGITUDE`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Select the subset of fields\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"5. For the PM2.5 datasets, create a new column called `AQS_PARAMETER_DESC` filled with the value \"PM2.5\" (all cells in this column should be identical)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"6. Save all four processed datasets in the Processed folder."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Combine datasets\n",
"7. Merge (`concat`) the four datasets into one. Make sure your column names are identical prior to running this code. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Concatenate the 4 data frames into one\n",
"allData = \n",
"allData.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"7a. What are the dimensions of the concatenated data frames?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Report the shape of the dataframe\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Subsetting and splitting\n",
"\n",
"#### 8. Wrangle your new dataset...\n",
" 1. Subset only records in Blackstone, Bryson City, Triple Oak \n",
" 1. Create a Boolean mask for each site name (`query` won't work because \"Site Name\" has a space in it!)\n",
" 2. Apply these Boolean masks using the appropriate logical operators (\"`&`\", \"`|`\", \"`~`\") to select records in any of these sites into a new dataframe. \n",
" 3. Make a \"deep\" copy of this dataframe (to avoid warnings about working on a subset of a dataframe...)\n",
" 4. Display the dimensions of the resulting dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create 3 Boolean masks, one each for Blackstone, Bryson City, Triple Oak\n",
"blackstoneMask = \n",
"brysonMask = \n",
"tripleoakMask = "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Apply the mask to the dataframe\n",
"allDataSelect = "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Create a deep copy of the above dataframe\n",
"allDataSelect = "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Display the dimensions of the dataframe\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Add columns for \"Month\" and \"Year\" by parsing your \"Date\" column (hint: see [`pd.DatetimeIndex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html)...)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#C. Add columns for Month and Year\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**9. Spread (`pivot_table`) your datasets** such that AQI values for ozone and PM2.5 are in separate columns. Each location on a specific date should now occupy only one row."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"allPivot = "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"10. Call up the dimensions of your new tidy dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"11. Save your processed dataset with the following file name: \"EPAair_O3_PM25_NC1718_Processed.csv\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generate summary tables\n",
"12. Use the split-apply-combine strategy to generate two new data frames: \n",
" 1. A summary table of mean AQI values for O3 and PM2.5 by month\n",
" 2. A summary table of the mean, minimum, and maximum AQI values of O3 and PM2.5 for each site"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#12a. A summary table of mean AQI values for O3 and PM2.5 by month\n",
"group = allDataSelect.groupby(['Month','AQS_PARAMETER_DESC'])\n",
"aggData = group['DAILY_AQI_VALUE'].mean().reset_index()\n",
"aggData.pivot_table(index='Month',columns='AQS_PARAMETER_DESC',values='DAILY_AQI_VALUE')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#12b. A summary table of the mean, minimum, and maximum AQI values of O3 and PM2.5 for each site\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}