{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "So far, we’ve used Python and the pandas library to explore and manipulate individual datasets by hand, much like we would do in a spreadsheet. The beauty of using a programming language like Python, though, comes from the ability to automate data processing through the use of loops and functions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# For loops\n", "\n", "Loops allow us to repeat a workflow (or series of actions) a given number of times or while some condition is true. We would use a loop to automatically process data that’s stored in multiple files (daily values with one file per year, for example). Loops lighten our work load by performing repeated tasks without our direct involvement and make it less likely that we’ll introduce errors by making mistakes while processing each file by hand.\n", "\n", "Let’s write a simple for loop that simulates what a kid might see during a visit to the zoo:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['lion', 'tiger', 'crocodile', 'vulture', 'hippo']\n" ] } ], "source": [ "animals = [\"lion\", \"tiger\", \"crocodile\", \"vulture\", \"hippo\"]\n", "print(animals)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "lion\n", "tiger\n", "crocodile\n", "vulture\n", "hippo\n" ] } ], "source": [ "for creature in animals:\n", " print(creature)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The line defining the loop must start with for and end with a colon, and the body of the loop must be indented.\n", "\n", "In this example, **creature is the loop variable** that takes the value of the next entry in animals every time the loop goes around. We can call the loop variable anything we like. After the loop finishes, the loop variable will still exist and will have the value of the last entry in the collection:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hippo\n" ] } ], "source": [ "print (creature)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are not asking Python to print the value of the loop variable anymore, but the for loop still runs and the value of creature changes on each pass through the loop. The statement pass in the body of the loop means “do nothing”." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']\n", "for creature in animals:\n", " pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are not asking Python to print the value of the loop variable anymore, but the for loop still runs and the value of creature changes on each pass through the loop. The statement pass in the body of the loop means “do nothing”." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question:** What happens if we don’t include the pass statement?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Automating data processing using For Loops\n", "Remark on paths for this exercise:\n", "* The surveys file should be in Desktop/data-carpentry/data folder \n", "* The notebook should be openend in the Desktop/data-carpentry folder " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The file we’ve been using so far, surveys.csv, contains 25 years of data and is very large. We would like to separate the data for each year into a separate file. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s start by making a new directory inside the folder data to store all of these files using the module os. The OS module in python provides functions for interacting with the operating system" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import os \n", "os.mkdir('data/yearly_files')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The command os.mkdir is equivalent to mkdir in the shell. Just so we are sure, we can check that the new directory was created within the data folder:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['bouldercreek_09_2013.txt',\n", " 'plots.csv',\n", " 'portal_mammals.sqlite',\n", " 'pr_Amon_ACCESS1-3_historical_r1i1p1_200101-200512.nc',\n", " 'pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_200101-200512.nc',\n", " 'README.txt',\n", " 'sftlf_fx_ACCESS1-3_historical_r0i0p0.nc',\n", " 'sftlf_fx_CSIRO-Mk3-6-0_historical_r0i0p0.nc',\n", " 'species.csv',\n", " 'speciesSubset.csv',\n", " 'surveys.csv',\n", " 'surveys2001.csv',\n", " 'surveys2002.csv',\n", " 'surveys_analysed.csv',\n", " 'yearly_files']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "os.listdir('data')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The command os.listdir is equivalent to ls in the shell.\n", "\n", "In previous lessons, we saw how to use the library pandas to load the species data into memory as a DataFrame, how to select a subset of the data using some criteria, and how to write the DataFrame into a CSV file. Let’s write a script that performs those three steps in sequence for the year 2002:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Load the data into a DataFrame\n", "surveys_df = pd.read_csv('data/surveys.csv')" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# Select only data for the year 2002\n", "surveys2002 = surveys_df[surveys_df[\"year\"] == 2002]\n", "#alternative: surveys2002 = surveys_df[surveys_df.year == 2002]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Write the new DataFrame to a CSV file\n", "surveys2002.to_csv('data/yearly_files/surveys2002.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create yearly data files, we could repeat the last two commands over and over, once for each year of data. Repeating code is neither elegant nor practical, and is very likely to introduce errors into your code. We want to turn what we’ve just written into a loop that repeats the last two commands for every year in the dataset.\n", "\n", "Let’s start by writing a loop that prints the names of the files we want to create - the dataset we are using covers 1977 through 2002, and we’ll create a separate file for each of those years. Listing the filenames is a good way to confirm that the loop is behaving as we expect.\n", "\n", "We have seen that we can loop over a list of items, so we need a list of years to loop over. We can get the years in our DataFrame with:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1977\n", "1 1977\n", "2 1977\n", "3 1977\n", "4 1977\n", " ... \n", "35544 2002\n", "35545 2002\n", "35546 2002\n", "35547 2002\n", "35548 2002\n", "Name: year, Length: 35549, dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Selecting the column year\n", "surveys_df['year']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "but we want only unique years, which we can get using the unique method which we have already seen." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,\n", " 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,\n", " 1999, 2000, 2001, 2002], dtype=int64)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "surveys_df['year'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Putting this into a foor loop we get: " ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1977\n", "1978\n", "1979\n", "1980\n", "1981\n", "1982\n", "1983\n", "1984\n", "1985\n", "1986\n", "1987\n", "1988\n", "1989\n", "1990\n", "1991\n", "1992\n", "1993\n", "1994\n", "1995\n", "1996\n", "1997\n", "1998\n", "1999\n", "2000\n", "2001\n", "2002\n" ] } ], "source": [ "for year in surveys_df['year'].unique():\n", " print(year)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now add the rest of the steps we need to create separate text files:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# Load the data into a DataFrame\n", "surveys_df = pd.read_csv('data/surveys.csv')\n", "\n", "for year in surveys_df['year'].unique():\n", "\n", " # Select data for the year\n", " surveys_year = surveys_df[surveys_df.year == year]\n", "\n", " # Write the new DataFrame to a CSV file\n", " filename = 'data/yearly_files/surveys' + str(year) + '.csv'\n", " surveys_year.to_csv(filename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Writing Unique File Names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the code above created a unique filename for each year." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "filename = 'data/yearly_files/surveys' + str(year) + '.csv'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s break down the parts of this name:\n", "\n", "* The first part is some text that specifies the directory to store our data file in (data/yearly_files/) and the first part of the file name (surveys): 'data/yearly_files/surveys'\n", "* We can concatenate this with the value of a variable, in this case year by using the plus + sign and the variable we want to add to the file name: + str(year)\n", "* Then we add the file extension as another text string: + '.csv'\n", "\n", "Notice that we use single quotes to add text strings. The variable is not surrounded by quotes. This code produces the string data/yearly_files/surveys2002.csv which contains the path to the new filename AND the file name itself." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Challenge:\n", "Instead of splitting out the data by years, a colleague wants to do analyses each species separately. How would you write a unique CSV file for each species?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Building reusable and modular code with functions\n", "\n", "Suppose that separating large data files into individual yearly files is a task that we frequently have to perform. We could write a for loop like the one above every time we needed to do it but that would be time consuming and error prone. A more elegant solution would be to create a reusable tool that performs this task with minimum input from the user. To do this, we are going to turn the code we’ve already written into a function.\n", "\n", "Functions are reusable, self-contained pieces of code that are called with a single command. They can be designed to accept arguments as input and return values, but they don’t need to do either. Variables declared inside functions only exist while the function is running and if a variable within the function (a local variable) has the same name as a variable somewhere else in the code, the local variable hides but doesn’t overwrite the other.\n", "\n", "Every method used in Python (for example, print) is a function, and the libraries we import (say, pandas) are a collection of functions. We will only use functions that are housed within the same code that uses them, but we can also write functions that can be used by different programs.\n", "\n", "Functions are declared following this general structure:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "def this_is_the_function_name(input_argument1, input_argument2):\n", "\n", " # The body of the function is indented\n", " # This function prints the two arguments to screen\n", " print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')\n", "\n", " # And returns their product\n", " return input_argument1 * input_argument2\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function declaration starts with the word def, followed by the function name and any arguments in parenthesis, and ends in a colon. The body of the function is indented just like loops are. If the function returns something when it is called, it includes a return statement at the end.\n", "\n", "This is how we call the function:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The function arguments are: 2 5 (this is done inside the function!)\n" ] } ], "source": [ "product_of_inputs = this_is_the_function_name(2, 5)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Their product is: 10 (this is done outside the function!)\n" ] } ], "source": [ "print('Their product is:', product_of_inputs, '(this is done outside the function!)')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Challenge:**\n", "\n", "* Change the values of the arguments in the function and check its output\n", "* Try calling the function by giving it the wrong number of arguments (not 2) or not assigning the function call to a variable (no product_of_inputs =)\n", "* Declare a variable inside the function and test to see where it exists (Hint: can you print it from outside the function?)\n", "* Explore what happens when a variable both inside and outside the function have the same name. What happens to the global variable when you change the value of the local variable?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now turn our code for saving yearly data files into a function. There are many different “chunks” of this code that we can turn into functions, and we can even create functions that call other functions inside them. Let’s first write a function that separates data for just one year and saves that data to a file:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "def one_year_csv_writer(this_year, all_data):\n", " \"\"\"\n", " Writes a csv file for data from a given year.\n", "\n", " this_year -- year for which data is extracted\n", " all_data -- DataFrame with multi-year data\n", " \"\"\"\n", "\n", " # Select data for the year\n", " surveys_year = all_data[all_data.year == this_year]\n", "\n", " # Write the new DataFrame to a csv file\n", " filename = 'data/yearly_files/function_surveys' + str(this_year) + '.csv'\n", " surveys_year.to_csv(filename)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The text between the two sets of triple double quotes is called a **docstring** and contains the documentation for the function. It does nothing when the function is running and is therefore not necessary, but it is good practice to include docstrings as a reminder of what the code does. Docstrings in functions also become part of their ‘official’ documentation:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "one_year_csv_writer?" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "one_year_csv_writer(2002, surveys_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We changed the root of the name of the CSV file so we can distinguish it from the one we wrote before. Check the `yearly_files` directory for the file. Did it do what you expect?\n", "\n", "What we really want to do, though, is create files for multiple years without having to request them one by one. Let’s write another function that replaces the entire `for` loop by looping through a sequence of years and repeatedly calling the function we just wrote, `one_year_csv_writer`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def yearly_data_csv_writer(start_year, end_year, all_data):\n", " \"\"\"\n", " Writes separate CSV files for each year of data.\n", "\n", " start_year -- the first year of data we want\n", " end_year -- the last year of data we want\n", " all_data -- DataFrame with multi-year data\n", " \"\"\"\n", "\n", " # \"end_year\" is the last year of data we want to pull, so we loop to end_year+1\n", " for year in range(start_year, end_year+1):\n", " one_year_csv_writer(year, all_data)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because people will naturally expect that the end year for the files is the last year with data, the for loop inside the function ends at end_year + 1. By writing the entire loop into a function, we’ve made a reusable tool for whenever we need to break a large data file into yearly files. Because we can specify the first and last year for which we want files, we can even use this function to create files for a subset of the years available. This is how we call this function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the data into a DataFrame\n", "surveys_df = pd.read_csv('data/surveys.csv')\n", "\n", "# Create CSV files\n", "yearly_data_csv_writer(1977, 2002, surveys_df)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "BEWARE! If you are using IPython Notebooks and you modify a function, you MUST re-run that cell in order for the changed function to be available to the rest of the code. Nothing will visibly happen when you do this, though, because defining a function without calling it doesn’t produce an output. Any cells that use the now-changed functions will also have to be re-run for their output to change." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Challenge**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The functions we wrote demand that we give them a value for every argument. Ideally, we would like these functions to be as flexible and independent as possible. Let’s modify the function yearly_data_csv_writer so that the start_year and end_year default to the full range of the data if they are not supplied by the user. Arguments can be given default values with an equal sign in the function declaration. Any arguments in the function without default values (here, all_data) is a required argument and MUST come before the argument with default values (which are optional in the function call)." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Both optional arguments:\t 1988 1993\n", "Default values:\t\t\t 1977 2002\n" ] } ], "source": [ "def yearly_data_arg_test(all_data, start_year=1977, end_year=2002):\n", " \"\"\"\n", " Modified from yearly_data_csv_writer to test default argument values!\n", "\n", " start_year -- the first year of data we want (default 1977)\n", " end_year -- the last year of data we want (default 2002)\n", " all_data -- DataFrame with multi-year data\n", " \"\"\"\n", "\n", " return start_year, end_year\n", "\n", "\n", "start, end = yearly_data_arg_test(surveys_df, 1988, 1993)\n", "print('Both optional arguments:\\t', start, end)\n", "\n", "start, end = yearly_data_arg_test(surveys_df)\n", "print('Default values:\\t\\t\\t', start, end)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The “\\t” in the print statements are tabs, used to make the text align and be easier to read.\n", "\n", "But what if our dataset doesn’t start in 1977 and end in 2002? We can modify the function so that it looks for the start and end years in the dataset if those dates are not provided:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Both optional arguments:\t 1988 1993\n", "Default values:\t\t\t 1977 2002\n" ] } ], "source": [ "def yearly_data_arg_test(all_data, start_year=None, end_year=None):\n", " \"\"\"\n", " Modified from yearly_data_csv_writer to test default argument values!\n", "\n", " all_data -- DataFrame with multi-year data\n", " start_year -- the first year of data we want, Check all_data! (default None)\n", " end_year -- the last year of data we want; Check all_data! (default None)\n", " \"\"\"\n", "\n", " if start_year is None:\n", " start_year = min(all_data.year)\n", " if end_year is None:\n", " end_year = max(all_data.year)\n", "\n", " return start_year, end_year\n", "\n", "\n", "start, end = yearly_data_arg_test(surveys_df, 1988, 1993)\n", "print('Both optional arguments:\\t', start, end)\n", "\n", "start, end = yearly_data_arg_test(surveys_df)\n", "print('Default values:\\t\\t\\t', start, end)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The default values of the start_year and end_year arguments in the function yearly_data_arg_test are now None. This is a built-in constant in Python that indicates the absence of a value - essentially, that the variable exists in the namespace of the function (the directory of variable names) but that it doesn’t correspond to any existing object." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# If Statements\n", "\n", "The body of the test function now has two conditionals (if statements) that check the values of start_year and end_year. If statements execute a segment of code when some condition is met. They commonly look something like this:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a is a positive number\n" ] } ], "source": [ "a = 5\n", "\n", "if a<0: # Meets first condition?\n", "\n", " # if a IS less than zero\n", " print('a is a negative number')\n", "\n", "elif a>0: # Did not meet first condition. meets second condition?\n", "\n", " # if a ISN'T less than zero and IS more than zero\n", " print('a is a positive number')\n", "\n", "else: # Met neither condition\n", "\n", " # if a ISN'T less than zero and ISN'T more than zero\n", " print('a must be zero!')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Change the value of a to see how this function works. The statement elif means “else if”, and all of the conditional statements must end in a colon.\n", "\n", "The if statements in the function yearly_data_arg_test check whether there is an object associated with the variable names start_year and end_year. If those variables are None, the if statements return the boolean True and execute whatever is in their body. On the other hand, if the variable names are associated with some value (they got a number in the function call), the if statements return False and do not execute. The opposite conditional statements, which would return True if the variables were associated with objects (if they had received value in the function call), would be if start_year and if end_year.\n", "\n", "As we’ve written it so far, the function yearly_data_arg_test associates values in the function call with arguments in the function definition just based on their order. If the function gets only two values in the function call, the first one will be associated with all_data and the second with start_year, regardless of what we intended them to be. We can get around this problem by calling the function using keyword arguments, where each of the arguments in the function definition is associated with a keyword and the function call passes values to the function using these keywords:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Default values:\t\t\t 1977 2002\n", "No keywords:\t\t\t 1988 1993\n", "Both keywords, in order:\t 1988 1993\n", "Both keywords, flipped:\t\t 1988 1993\n", "One keyword, default end:\t 1988 2002\n", "One keyword, default start:\t 1977 1993\n" ] } ], "source": [ "start, end = yearly_data_arg_test(surveys_df)\n", "print('Default values:\\t\\t\\t', start, end)\n", "\n", "start, end = yearly_data_arg_test(surveys_df, 1988, 1993)\n", "print('No keywords:\\t\\t\\t', start, end)\n", "\n", "start, end = yearly_data_arg_test(surveys_df, start_year=1988, end_year=1993)\n", "print('Both keywords, in order:\\t', start, end)\n", "\n", "start, end = yearly_data_arg_test(surveys_df, end_year=1993, start_year=1988)\n", "print('Both keywords, flipped:\\t\\t', start, end)\n", "\n", "start, end = yearly_data_arg_test(surveys_df, start_year=1988)\n", "print('One keyword, default end:\\t', start, end)\n", "\n", "start, end = yearly_data_arg_test(surveys_df, end_year=1993)\n", "print('One keyword, default start:\\t', start, end)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.8" } }, "nbformat": 4, "nbformat_minor": 4 }