{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Why automate your work flow, and how to approach the process" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Questions for students to consider:**\n", "\n", "1. In the data exploration section you made some plots from your data. What if you want to look at other relationships? \n", "2. Are there computational processes you do often? How do you implement these? \n", "3. Do you have a clear workflow you could replicate from data to conclusions?\n", "4. Could you plug this new data set into your old workflow? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Level of Python / Jupyter Automation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. __Good__ - Documenting all analysis steps in enough details that will enable them to be reproduced successfully.\n", "2. __Better__ - Script your analysis\n", "3. __Best__ - Script your analysis and write tests to validate each step." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Key takehomes\n", "- Code is read much more often that it is written\n", "- You are NEVER finshed with an analysis (drafts, reviewer comments, new data etc.). Make your own future life easy!\n", "- Repeating yourself creates opportunity for mistakes/divergence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning Objectives of Automation Module: (total time, 3 hrs including 15 min break)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Lesson 1](#lesson-1) \n", "- Employ best practices of naming a variable including: don’t use existing function names, avoid periods in names, don’t use numbers at the beginning of a variable name.\n", "- Defensive programming: catch errors instead of just fixing them.\n", "\n", "### [Lesson 2](#lesson-2) \n", "- Define \"Don't Repeat Yourself\" (DRY) and provide examples of how you would implement DRY in your code\n", "- Identify code that can be modularized following DRY and implement a modular workflow using functions.\n", "\n", "### [Lesson 3](#lesson-3) \n", "- Know how to construct a function: variables, function name, syntax, documentation, return values\n", "- Demonstrate use of function within the notebook / code. \n", "- Construct and compose function documentation that clearly defines inputs, output variables and behaviour.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Basic Overview of the suggested workflow using Socrative (Optional)__" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "- Use Socrative quiz to collect answers from student activities (students can run their code in their notebooks, and post to socrative). This will allow the instructor to see what solutions students came up with, and identify any places where misconceptions and confusion are coming up. Using Socrative quizes also allows for a record of the student work to be analyzed after class to see how students are learning and where they are having troubles.\n", "- sharing of prepared Socrative Quizes designed to be used with the automation module can be shared by URL links to each teacher so they do not have to be remade." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Please download the cleaned data file:__\n", "\n", "https://raw.githubusercontent.com/Reproducible-Science-Curriculum/automation-RR-Jupyter/gh-pages/notebooks/gapminder_cleaned.csv\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 1 \n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets begin by creating a new Jupyter notebook.\n", "\n", "__Question:__ \n", "\n", "- According to the organization we setup where should we put this notebook?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review of good variable practices\n", "**Learning Objective: ** Employ best practices of naming a variable including: don’t use existing function names, avoid periods in names, don’t use numbers at the beginning of a variable name\n", "\n", "### Types of variables:\n", "- strings, integers, etc..\n", "\n", "References:\n", "- PEP8 - Style Guide for Python Code - https://www.python.org/dev/peps/pep-0008/ \n", "- https://www.tutorialspoint.com/python3/python_variable_types.htm\n", "\n", "### Keep in mind that code is read many more times then it is written!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other useful conventions with variables to follow\n", "1. Set-up variables at the begining of your page, after importing libraries\n", "2. use variables instead of file names, or exact values or strings so that if you need to change the value of something you don't have to search through all your code to make sure you made the change everywhere, simply change the value of the variable at the top. -- This will also make your code more reproducible in the end. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's Get Started" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get started we will import the python modules that we will use in the session. These modules are developed by programmers and made available as open source packages for python. We would normally have to install each of these ourself but they are included as part of the [Anaconda Python Distribution](https://www.continuum.io/downloads).\n", "\n", "The _%matplotlib inline_ statement is part of the Jupyter and IPython magic that enables plaots generated by the matplotlib package to be discplayed as output in the Jupyter Notebook instead of open in a separate window." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import pylab as plt\n", "import matplotlib\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will continue where the _data exploration_ module left off but importing the cleaned gapminder dataset and setting it equal to a new varaible named __df__ to denote that we have imported a _pandas_ dataframe.\n", "\n", "As validation that we have imported the data we will also look at the top five rows of data using the _head_ method of pandas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Defensive programming" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "ename": "FileNotFoundError", "evalue": "[Errno 2] No such file or directory: 'data/gapminder_cleaned.csv'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mcleaned_data_location\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'data/gapminder_cleaned.csv'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcleaned_data_location\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhead\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36mread_csv\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)\u001b[0m\n\u001b[1;32m 608\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkwds_defaults\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 609\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 610\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 611\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 612\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m 460\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 461\u001b[0m \u001b[0;31m# Create the parser.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 462\u001b[0;31m \u001b[0mparser\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mTextFileReader\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 463\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 464\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mchunksize\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0miterator\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[1;32m 817\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"has_index_names\"\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"has_index_names\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 818\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 819\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_make_engine\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mengine\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 820\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 821\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36m_make_engine\u001b[0;34m(self, engine)\u001b[0m\n\u001b[1;32m 1048\u001b[0m )\n\u001b[1;32m 1049\u001b[0m \u001b[0;31m# error: Too many arguments for \"ParserBase\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1050\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mmapping\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mengine\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# type: ignore[call-arg]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1051\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1052\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_failover_to_python\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, src, **kwds)\u001b[0m\n\u001b[1;32m 1865\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1866\u001b[0m \u001b[0;31m# open handles\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1867\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_open_handles\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msrc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1868\u001b[0m \u001b[0;32massert\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhandles\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1869\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mkey\u001b[0m \u001b[0;32min\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;34m\"storage_options\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"encoding\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"memory_map\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"compression\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36m_open_handles\u001b[0;34m(self, src, kwds)\u001b[0m\n\u001b[1;32m 1360\u001b[0m \u001b[0mLet\u001b[0m \u001b[0mthe\u001b[0m \u001b[0mreaders\u001b[0m \u001b[0mopen\u001b[0m \u001b[0mIOHanldes\u001b[0m \u001b[0mafter\u001b[0m \u001b[0mthey\u001b[0m \u001b[0mare\u001b[0m \u001b[0mdone\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mtheir\u001b[0m \u001b[0mpotential\u001b[0m \u001b[0mraises\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1361\u001b[0m \"\"\"\n\u001b[0;32m-> 1362\u001b[0;31m self.handles = get_handle(\n\u001b[0m\u001b[1;32m 1363\u001b[0m \u001b[0msrc\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1364\u001b[0m \u001b[0;34m\"r\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/opt/anaconda3/lib/python3.8/site-packages/pandas/io/common.py\u001b[0m in \u001b[0;36mget_handle\u001b[0;34m(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)\u001b[0m\n\u001b[1;32m 640\u001b[0m \u001b[0merrors\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m\"replace\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 641\u001b[0m \u001b[0;31m# Encoding\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 642\u001b[0;31m handle = open(\n\u001b[0m\u001b[1;32m 643\u001b[0m \u001b[0mhandle\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 644\u001b[0m \u001b[0mioargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmode\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: 'data/gapminder_cleaned.csv'" ] } ], "source": [ "cleaned_data_location = 'data/gapminder_cleaned.csv'\n", "df = pd.read_csv(cleaned_data_location)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Whooops! That doesn't look great. and we know that the file *does* exist. We just downloaded it!\n", "Lets do some defensive programming to prevent things from breaking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your most common collaborator is YOU, in the future. Including error handling and messages help your colleagues and students, but most importantly, YOU.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Catching errors using try-except\n", "try and except statements are a good way to catch and deal with errors in a convenient way" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Couldn't find data file, check path? You tried data/gapminder_cleaned.csv\n" ] } ], "source": [ "cleaned_data_location = 'data/gapminder_cleaned.csv'\n", "\n", "try:\n", " df = pd.read_csv(cleaned_data_location)\n", "\n", "except FileNotFoundError:\n", " print(\"Couldn't find data file, check path? You tried\", cleaned_data_location)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise: what do you need to fix to actually open that data file?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "cleaned_data_location = '../data/gapminder_cleaned.csv'\n", "\n", "try:\n", " df = pd.read_csv(cleaned_data_location)\n", "\n", "except FileNotFoundError:\n", " print(\"Couldn't find data file, check path? You tried\", cleaned_data_location)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can set a flag at the top of our script to dynamically set how much infomation we want to see.\n", "e.g. VERBOSE.\n", "\n", "When the variable is set to `True` we print out extra information." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " year pop lifeexp gdppercap country continent\n", "0 1952 8425333 28.801 779.445314 afghanistan asia\n", "1 1957 9240934 30.332 820.853030 afghanistan asia\n", "2 1962 10267083 31.997 853.100710 afghanistan asia\n", "3 1967 11537966 34.020 836.197138 afghanistan asia\n", "4 1972 13079460 36.088 739.981106 afghanistan asia\n" ] } ], "source": [ "VERBOSE = True\n", "\n", "try:\n", " df = pd.read_csv(cleaned_data_location)\n", " if VERBOSE:\n", " print(df.head())\n", "\n", "except FileNotFoundError:\n", " print(\"Couldn't find data file, check path? You tried\", cleaned_data_location)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Using assert statements to be explicit about assumptions\n", "assert will fail if statement isn't true. Nothing will happen if it is true" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "years = df['year'].unique()\n", "years.sort()\n", "assert years[-1]==2007 #Check that the most recent year is as expected" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Lesson 2 \n", "---" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "__Learning Objectives__\n", "- Define \"Don't Repeat Yourself\" (DRY) and provide examples of how you would implement DRY in your code\n", "- Identify code that can be modularized following DRY and implement a modular workflow using functions." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "As you write software there comes a time when you are going to encounter a situation where you want to do the same analysis step as you have already done in your analysis. Our natural tendancy is the copy the code that we wrote and paste it into teh new location for reuse. Sounds easy, right. Copy, paste, move on...not so fast.\n", "\n", "What happens if there is a problem with the code or you decide to tweak it, just a little, to change a format or enahce it?\n", "\n", "You wil have to change the code in every place you ahve copied it. How do you know if you got _all_ of the copies? What happens if one of the copies is _not_ changed?\n", "\n", "These examples illustrate the principle of \"Don't Repeat Yourself\". We are going to look at how to __refactor__ our code and pull pieces out by making them functions. They we will __call__ the function everytime we want to use that code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# What if we want to ask questions about variables over time?\n", "## Lets start with mean life expectancy in Asia" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "#decide on a year, and calculate the statistic of interest\n", "## do it point by point!! DON'T RUSH\n", "\n", "mask_asia = df['continent'] == 'asia'\n", "df_asia = df[mask_asia]\n", "\n", "mask_1952 = df_asia['year'] == 1952\n", "df_1952 = df_asia[mask_1952]\n", "\n", "value = np.mean(df_1952['lifeexp'])\n", "\n", "# create an empty list\n", "result = []\n", "\n", "# append a row to list with a tuple containing your result\n", "result.append(('asia', '1952', value))\n", " \n", "# Turn the summary into a dataframe so that we can visualize easily\n", "result = pd.DataFrame(result, columns=['continent', 'year', 'lifeexp'])\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
continentyearlifeexp
0asia195246.314394
\n", "
" ], "text/plain": [ " continent year lifeexp\n", "0 asia 1952 46.314394" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Use a for loop to Loop through years and calculate the statistic of interest\n", "\n", "mask_asia = df['continent'] == 'asia'\n", "df_asia = df[mask_asia]\n", "\n", "years = df_asia['year'].unique()\n", "summary = []\n", "\n", "for year in years:\n", " mask_year = df_asia['year'] == year\n", " df_year = df_asia[mask_year]\n", " value = np.mean(df_year['lifeexp'])\n", " summary.append(('asia', year, value))\n", " \n", "# Turn the summary into a dataframe so that we can visualize easily\n", "summary = pd.DataFrame(summary, columns=['continent', 'year', 'lifeexp'])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
continentyearlifeexp
0asia195246.314394
1asia195749.318544
2asia196251.563223
3asia196754.663640
4asia197257.319269
5asia197759.610556
6asia198262.617939
7asia198764.851182
8asia199266.537212
9asia199768.020515
10asia200269.233879
11asia200770.728485
\n", "
" ], "text/plain": [ " continent year lifeexp\n", "0 asia 1952 46.314394\n", "1 asia 1957 49.318544\n", "2 asia 1962 51.563223\n", "3 asia 1967 54.663640\n", "4 asia 1972 57.319269\n", "5 asia 1977 59.610556\n", "6 asia 1982 62.617939\n", "7 asia 1987 64.851182\n", "8 asia 1992 66.537212\n", "9 asia 1997 68.020515\n", "10 asia 2002 69.233879\n", "11 asia 2007 70.728485" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# But now your PI wants that information all the years for a different continent!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Activity: How could we use variables to make it easier to re-run this for differnet continents?" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Define which continent / category we will use\n", "category = 'lifeexp'\n", "continent = 'asia'\n", "\n", "# Create a mask that selects the continent of choice\n", "mask_continent = df['continent'] == 'asia'\n", "df_continent = df[mask_continent]\n", "\n", "# Loop through years and calculate the statistic of interest\n", "years = df_continent['year'].unique()\n", "summary = []\n", "\n", "for year in years:\n", " mask_year = df_continent['year'] == year\n", " df_year = df_continent[mask_year]\n", " value = np.mean(df_year[category])\n", " summary.append((continent, year, value))\n", " \n", "# Turn the summary into a dataframe so that we can visualize easily\n", "summary = pd.DataFrame(summary, columns=['continent', 'year', category])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "summary.plot.line('year', category, label = \"life expectancy\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 3 \n", "## Building functions\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Learning Objectives__\n", "- Know how to construct a function: variables, function name, syntax, documentation, return values\n", "- Demonstrate use of function within the notebook / code.\n", "- Construct and compose function documentation that clearly defines inputs, output " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def calculate_mean_over_time(data, category, continent, verbose=False):\n", " \n", " # Create a mask that selects the continent of choice\n", " mask_continent = data['continent'] == continent\n", " data_continent = data[mask_continent]\n", "\n", " # Loop through years and calculate the statistic of interest\n", " years = data_continent['year'].unique()\n", " summary = []\n", " for year in years:\n", " if verbose:\n", " print(year)\n", " mask_year = data_continent['year'] == year\n", " data_year = data_continent[mask_year]\n", " value = np.mean(data_year[category])\n", " summary.append((continent, year, value))\n", "\n", " # Turn the summary into a dataframe so that we can visualize easily\n", " summary = pd.DataFrame(summary, columns=['continent', 'year', category])\n", " return summary" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
continentyearlifeexp
0asia195246.314394
1asia195749.318544
2asia196251.563223
3asia196754.663640
4asia197257.319269
5asia197759.610556
6asia198262.617939
7asia198764.851182
8asia199266.537212
9asia199768.020515
10asia200269.233879
11asia200770.728485
\n", "
" ], "text/plain": [ " continent year lifeexp\n", "0 asia 1952 46.314394\n", "1 asia 1957 49.318544\n", "2 asia 1962 51.563223\n", "3 asia 1967 54.663640\n", "4 asia 1972 57.319269\n", "5 asia 1977 59.610556\n", "6 asia 1982 62.617939\n", "7 asia 1987 64.851182\n", "8 asia 1992 66.537212\n", "9 asia 1997 68.020515\n", "10 asia 2002 69.233879\n", "11 asia 2007 70.728485" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "VERBOSE = False\n", "calculate_mean_over_time(df, \"lifeexp\", \"asia\", VERBOSE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Activity:\n", " How would you make a function to calculate the median through time?\n", " (Hint: focus on DRY.)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def calculate_statistic_over_time(data, category, continent, func):\n", " \n", " # Create a mask that selects the continent of choice\n", " mask_continent = data['continent'] == continent\n", " data_continent = data[mask_continent]\n", "\n", " # Loop through years and calculate the statistic of interest\n", " years = data_continent['year'].unique()\n", " summary = []\n", " for year in years:\n", " mask_year = data_continent['year'] == year\n", " data_year = data_continent[mask_year]\n", " value = func(data_year[category])\n", " summary.append((continent, year, value))\n", "\n", " # Turn the summary into a dataframe so that we can visualize easily\n", " summary = pd.DataFrame(summary, columns=['continent', 'year', category])\n", " return summary" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
continentyearlifeexp
0asia195244.869
1asia195748.284
2asia196249.325
3asia196753.655
4asia197256.950
5asia197760.765
6asia198263.739
7asia198766.295
8asia199268.690
9asia199770.265
10asia200271.028
11asia200772.396
\n", "
" ], "text/plain": [ " continent year lifeexp\n", "0 asia 1952 44.869\n", "1 asia 1957 48.284\n", "2 asia 1962 49.325\n", "3 asia 1967 53.655\n", "4 asia 1972 56.950\n", "5 asia 1977 60.765\n", "6 asia 1982 63.739\n", "7 asia 1987 66.295\n", "8 asia 1992 68.690\n", "9 asia 1997 70.265\n", "10 asia 2002 71.028\n", "11 asia 2007 72.396" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "calculate_statistic_over_time(df, \"lifeexp\", \"asia\", np.median)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Including docstrings with any functions is good programming practice, and helps out your collaborators (i.e., you!)\n", "\n", "More on docstrings: https://www.python.org/dev/peps/pep-0257/#what-is-a-docstring" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def calculate_statistic_over_time(data, category, continent, func):\n", " \"\"\"calculate values of a statistic through time\n", "\n", " Args:\n", " data: a data frame\n", " category: one of the column headers of the data frame (e.g. 'lifeexp')\n", " continent: possible value of the continent column of that data frame (e.g. 'asia')\n", " func: the funtion to apply to data values (e.g. np.mean)\n", " \n", " Returns:\n", " a summary table of value per year.\n", "\n", " \"\"\"\n", " \n", " # Create a mask that selects the continent of choice\n", " mask_continent = data['continent'] == continent\n", " data_continent = data[mask_continent]\n", "\n", " # Loop through years and calculate the statistic of interest\n", " years = data_continent['year'].unique()\n", " summary = []\n", " for year in years:\n", " mask_year = data_continent['year'] == year\n", " data_year = data_continent[mask_year]\n", " value = func(data_year[category])\n", " summary.append((continent, year, value))\n", "\n", " # Turn the summary into a dataframe so that we can visualize easily\n", " summary = pd.DataFrame(summary, columns=['continent', 'year', category])\n", " return summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Defensive programming activity\n", "How would you check to make sure input values are reasonable?\n", "Use assertions or try-except statements, and add options with the VERBOSE flag" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def calculate_statistic_over_time(data, category, continent, func, verbose=False):\n", " \"\"\"calculate values of a statistic through time\n", "\n", " Args:\n", " data: a pandas data frame\n", " category: one of the column headers of the data frame (e.g. 'lifeexp')\n", " continent: possible value of the continent column of that data frame (e.g. 'asia')\n", " func: the funtion to apply to data values (e.g. np.mean)\n", " \n", " Returns:\n", " a summary table of value per year.\n", "\n", " \"\"\"\n", " \n", " # Check values\n", " assert category in data.columns.values\n", " assert 'continent' in data.columns.values\n", " assert continent in data['continent'].unique()\n", " \n", " # Create a mask that selects the continent of choice\n", " mask_continent = data['continent'] == continent\n", " data_continent = data[mask_continent]\n", " \n", " \n", " # Loop through years and calculate the statistic of interest\n", " years = data_continent['year'].unique()\n", " if verbose:\n", " print(\"years include\", years)\n", " summary = []\n", " for year in years:\n", " mask_year = data_continent['year'] == year\n", " data_year = data_continent[mask_year]\n", " value = func(data_year[category])\n", " summary.append((continent, year, value))\n", "\n", " # Turn the summary into a dataframe so that we can visualize easily\n", " summary = pd.DataFrame(summary, columns=['continent', 'year', category])\n", " return summary" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
continentyearlifeexp
0asia195244.869
1asia195748.284
2asia196249.325
3asia196753.655
4asia197256.950
5asia197760.765
6asia198263.739
7asia198766.295
8asia199268.690
9asia199770.265
10asia200271.028
11asia200772.396
\n", "
" ], "text/plain": [ " continent year lifeexp\n", "0 asia 1952 44.869\n", "1 asia 1957 48.284\n", "2 asia 1962 49.325\n", "3 asia 1967 53.655\n", "4 asia 1972 56.950\n", "5 asia 1977 60.765\n", "6 asia 1982 63.739\n", "7 asia 1987 66.295\n", "8 asia 1992 68.690\n", "9 asia 1997 70.265\n", "10 asia 2002 71.028\n", "11 asia 2007 72.396" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "calculate_statistic_over_time(df, \"lifeexp\", \"asia\", np.median)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on function calculate_statistic_over_time in module __main__:\n", "\n", "calculate_statistic_over_time(data, category, continent, func, verbose=False)\n", " calculate values of a statistic through time\n", " \n", " Args:\n", " data: a pandas data frame\n", " category: one of the column headers of the data frame (e.g. 'lifeexp')\n", " continent: possible value of the continent column of that data frame (e.g. 'asia')\n", " func: the funtion to apply to data values (e.g. np.mean)\n", " \n", " Returns:\n", " a summary table of value per year.\n", "\n" ] } ], "source": [ "help(calculate_statistic_over_time)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#use this function to plot life expectancy through time\n", "continents = df['continent'].unique()\n", "VERBOSE = False\n", "fig, ax = plt.subplots()\n", "\n", "for continent in continents:\n", " output = calculate_statistic_over_time(df,\"lifeexp\", continent, np.median)\n", " output.plot.line('year', \"lifeexp\", ax=ax, label=continent)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Demo: Importing your own functions as module, (main.ipynb)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 1 }