{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"Environmental Data Analytics | John Fay and Luana Lima | Developed by Kateri Salk \n",
"Spring 2023\n",
"\n",
"---\n",
"# 3. Data Exploration\n",
"\n",
"## LESSON OBJECTIVES\n",
"1. Set up a data analysis session in Jupyter\n",
"2. Import and explore datasets in Python\n",
"3. Apply data exploration skills to a real-world example dataset\n",
"\n",
"A handy link: https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## BEST PRACTICES FOR PYTHON/JUPYTER\n",
"\n",
"In many situations in data analytics, you may be expected to work from multiple computers or share projects among multiple users. A few general best practices will avoid common pitfalls related to collaborative work. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Relative paths in Jupyter notebooks\n",
"\n",
"Jupyter notebooks can use absolute or relative paths, but relative paths are more robust and should be used where possible. Relative paths will be relative to where the Jupyter notebook lives and OS commands can navigate up or down the directory structure."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Listing contents of folders using OS commands followed by `!`\n",
"\n",
"OS-specific commands can be called within Jupyter by preceding them with a \"`!`\". For example, in Windows you can list the contents of the folder containing the script you are running using \"`! dir`\". On unix machines, this would be \"`! ls`\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#OS specific command for showing the current working directory\n",
"!pwd #for mac/linux based machines (!cd #for PCs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#List the contents of the current directory (\"!ls\" also works)\n",
"!dir "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#List the contents of the \"data\" sub directory \n",
"!dir data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#List the contents of the directory containing the current notebook\n",
"!dir .."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Navigating folders using Python's built-in `os` module"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Import the os module\n",
"import os"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Create a variable holding the current working directory\n",
"projectDir = os.getcwd()\n",
"#Display the current working directory\n",
"projectDir"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Change the directory to the data folder\n",
"os.chdir('data')\n",
"os.getcwd()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Go back to the current working directory (stored in the \"projectDir\" variable above)\n",
"os.chdir(projectDir)\n",
"os.listdir()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load your packages\n",
"As in R, packages should be loaded early in the script. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd #Import pandas, refering to it as \"pd\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
">#### A note on installing packages\n",
">Most packages you'll need are already installed in this containerized environment. However, if you need to install a package, you can use \"pip\". For example, to install pandas, you'd issue the command: \n",
">```bash\n",
"!pip install pandas\n",
">```\n",
">To install other packages, just replace pandas with the package you want to install. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import your data\n",
"The easiest way to import CSV data for data analysis is using Panda's [`read_csv()` function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) which reads data directly into a Pandas dataframe object.\n",
"\n",
"As in R, we supply the path to the CSV file, using relative path conventions. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_USGS = pd.read_csv('./data/Raw/USGS_Site02085000_Flow_Raw.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## EXPLORE YOUR DATASET\n",
"Take a moment to read through the README file associated with the USGS dataset on discharge at the Eno River. Where can you find this file? How does the placement and information found in this file relate to the best practices for reproducible data analysis?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#View all records\n",
"df_USGS"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Viewing properties of your dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Confirm the data type -- R: class(df_USGS)\n",
"type(df_USGS)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Display the column names -- R: colnames(df_USGS)\n",
"df_USGS.columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Rename columns -- R: colnames(df_USGS) <- c(...)\n",
"df_USGS.columns = (\"agency_cd\", \"site_no\", \"datetime\", \n",
" \"discharge_max\", \"discharge_max_approval\", \n",
" \"discharge_min\", \"discharge_min_approval\", \n",
" \"discharge_mean\", \"discharge_mean_approval\", \n",
" \"gage_height_max\", \"gage_height_max_approval\", \n",
" \"gage_height_min\", \"gage_height_min-approval\", \n",
" \"gage_height_mean\", \"gage_height_mean_approval\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Display the structure of the dataframe -- R: str(df_USGS))\n",
"df_USGS.info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Display the dimensions\n",
"df_USGS.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_USGS.size"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Viewing records in a dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#View the head (first 5 records) of the dataset\n",
"df_USGS.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Altenatively, view the first 9 records\n",
"df_USGS.head(9)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Or 6 records, selected at random\n",
"df_USGS.sample(6)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Or, the last 3 records\n",
"df_USGS.tail(3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#View records 30000 to 30005, columns 3, 8, and 14\n",
"df_USGS.iloc[29999:30004,[2,7,13]]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Show the count of values in the discharge_min_approval category\n",
"df_USGS['discharge_max_approval'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Show the data type of the 'datetime' column\n",
"df_USGS['datetime'].dtype"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Show the data type of all columns\n",
"df_USGS.dtypes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Summary of all data\n",
"df_USGS.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Summary of a specific column\n",
"df_USGS['discharge_mean'].describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Formatting dates\n",
"Yep, as in R, dates can be a pain. By default they are imported as generic, non-numeric \"objects\" (hence the dtype of \"O\" above). \n",
"\n",
"The Pandas `to_datetime` function ([link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)) works like R's `as.Date` function to help convert dates, in various formats, into actual date objects (called \"timestamps\" in Pandas lingo)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Create an example of a date, in string format\n",
"exampleDate = \"2021-04-14\"\n",
"#Convert to a Pandas \"timestamp\" object\n",
"dateObj = pd.to_datetime(exampleDate)\n",
"dateObj"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the date is in a non-standard format, we tell the command what format..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Create a date string in a non-standard format\n",
"exampleDate2 = \"Wednesday, 14 Apr. 2021\"\n",
"dateObj2 = pd.to_datetime(exampleDate2,format = '%A, %d %b. %Y')\n",
"dateObj2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Timestamp objects can be displayed in various other date formats using the `strftime` function. See http://strftime.org/ for all the formatting options and try a few yourself. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Display the timestamp objects in various formats using \"strftime\"\n",
"print(dateObj.strftime('%m/%d/%Y'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
" ---> See if you can get the date to read: Wednesday, Apr. 14, 2021
\n",
" print(dateObj.strftime('%A, %b. %d, %Y'))
\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(dateObj.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
" ---> What number day of the year is this date?
\n",
" print(dateObj.strftime('%j'))
\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(dateObj.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Convert our dataframes `datetime` values to timestamps\n",
"We can apply the `.to_datetime()` function to our datetime column. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Update the datetime column to be dates, not strings\n",
"df_USGS['datetime'] = pd.to_datetime(df_USGS['datetime'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Display a few samples\n",
"df_USGS.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As in our R example, the 2-digit dates in our raw data file are mistakenly assumed to be in the 21st century. We need to convert back to the 20th century. As we did in R, we'll apply a function to find and fix these dates..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_USGS.iloc[-1,2] > pd.to_datetime('2019-01-10')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Create a function called \"fixDate\" that corrects date values\n",
"def fixDate(d):\n",
" if d > pd.to_datetime('2019-01-10'):\n",
" return d - pd.DateOffset(years=100)\n",
" else:\n",
" return d"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Apply the function to the datetime values\n",
"df_USGS['datetime'] = df_USGS['datetime'].apply(fixDate)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#View the result\n",
"df_USGS.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Adjusting Datasets\n",
"\n",
"### Removing NAs\n",
"\n",
"Notice in our dataset that our discharge and gage height observations have many NAs, meaning no measurement was recorded for a specific day. In some cases, it might be in our best interest to remove NAs from a dataset. Removing NAs or not will depend on your research question."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#List the number of missing values in each column (sum across rows)\n",
"df_USGS.isna().sum(axis='rows')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Show NAs in just one variable\n",
"df_USGS['discharge_mean'].isna().sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Drop rows that have missing data in any column; -- R: \"omit.na\"\n",
"df_USGS_cleaned = df_USGS.dropna()\n",
"df_USGS_cleaned.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Saving datasets\n",
"We just edited our raw dataset into a processed form. We may want to return to this processed dataset later, which will be easier to do if we save it as a spreadsheet. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Save the file\n",
"df_USGS_cleaned.to_csv(\"./data/Processed/USGS_Site02085000_Flow_Processed.csv\", index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TIPS AND TRICKS: SPREADSHEETS\n",
"\n",
"* Files should be saved as .csv or .txt for easy import into Pandas. Note that complex formatting, including formulas in Excel, are not saved when spreadsheets are converted to comma separated or text formats (i.e., values alone are saved).\n",
"\n",
"\n",
"* The first row is reserved for column headers.\n",
"\n",
"\n",
"* A second, secondary row for column headers (e.g., units) should not be used if data are being imported into R. Incorporate units into the first row column headers if necessary.\n",
"\n",
"\n",
"* Short names are preferred for column headers, to the extent they are informative. Additional information can be stored in comments within Python scripts and/or in README files.\n",
"\n",
"\n",
"* Spaces in column names are allowed in Pandas, but should be replaced with underscores (\"`_`\") to avoid issues. \n",
"\n",
"\n",
"* Avoid symbols in column headers. This can cause issues when importing into Pandas."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}