{ "cells": [ { "cell_type": "markdown", "id": "behind-console", "metadata": {}, "source": [ "---\n", "*Environmental Data Analytics | John Fay and Luana Lima | Developed by Kateri Salk* \n", "*Spring 2023*\n", "\n", "---\n", "# 2: Reproducibility & Coding Basics\n", "\n", "## Objectives\n", "\n", "1. Discuss the benefits and approach for reproducible data analysis\n", "2. Perform simple operations using Python coding syntax\n", "3. Call and create functions in Python" ] }, { "cell_type": "markdown", "id": "above-insertion", "metadata": {}, "source": [ "## Reproducible Data Analysis\n", "\n", "### Fundamentals of reproducibility\n", "\n", "**Reproducibility**: when someone else (e.g., future self) can obtain the same outcomes from the same dataset and analysis\n", "\n", "* Raw data are always separate from processed data\n", "* Link data transformations with a reproducible pipeline\n", "* Raw datasets NEVER changed\n", "* Cleaning/transformations done through coding, not by editing within Excel\n", "* Edits documented by well-commented code\n", "* Majority of time spent in the data processing phase (clean, wrangle)\n", "\n", "### Rules and Conventions\n", "\n", "* Data stored in nonproprietary software (e.g., .csv, .md, .txt)\n", "* File names in ASCII text\n", "* No spaces!\n", "* Consistent file naming conventions\n", "* Store data, code, and output in separate folders\n", "\n", "### Version Control\n", "\n", "This semester, we will incorporate the fundamentals of **version control**, the process by which all changes to code, text, and files are tracked. In this manner, we're also able to maintain data and information to support collaborative projects, but to also make sure your analyses are preserved.\n", "\n", "Before coming to class, you were asked to create a GitHub.com account. **GitHub** is the web hosting platform for maintaining our Git repositories. Our version control system for the purposes of this course is **Git**. \n", "\n", "## JupyterLab Basics\n", "When you open your JupyterLab container, you will see the JupyterLab interface. Documentation on the interface is provided here: \n", " \n", "\n", "## Jupyter Notebooks\n", "A Jupyter Notebook is an analog to an R Markdown document. It too can include text chunks and R code chunks that can be viewed together. A few unique aspects of notebooks over Rmd files are:\n", " * Notebooks are not knitted. Instead they can be run and then exported into various formats. Notebooks are more WSIWYG (what you see is what you get) than Rmd file.\n", " * All outputs are provided in the document itself; there is no separate console to which you can direct output. (Though you can certainly save files to your filesystem.)\n", " * Notebooks are organized into \"cells\" and cells are defined as either code cells or markdown cells. \n", "\n", "Pretty much, after a small learning curve, notebooks and the JuptyerLab interface should become fairly intuitive. So, rather than write all this down here, we'll do some hands-on work that will be recorded for your benefit..." ] }, { "cell_type": "markdown", "id": "stainless-cholesterol", "metadata": {}, "source": [ "## Python Coding basics\n", "\n", "### Python as a calculator\n", "Below is a code cell. You can run code cells in a few ways: \n", "* Click the ► button in the menu.\n", "* Hit - on your keyboard (or - on a Mac)\n", " \n", "*→Note that you can't run single lines in Jupyter notebooks; you have the run the entire code cell*" ] }, { "cell_type": "markdown", "id": "baking-celebrity", "metadata": {}, "source": [ "##### Basic math" ] }, { "cell_type": "code", "execution_count": null, "id": "expanded-norman", "metadata": {}, "outputs": [], "source": [ "1 + 1" ] }, { "cell_type": "code", "execution_count": null, "id": "northern-helena", "metadata": {}, "outputs": [], "source": [ "1 - 1" ] }, { "cell_type": "code", "execution_count": null, "id": "quantitative-function", "metadata": {}, "outputs": [], "source": [ "2 * 2" ] }, { "cell_type": "code", "execution_count": null, "id": "engaged-suite", "metadata": {}, "outputs": [], "source": [ "1 / 2" ] }, { "cell_type": "code", "execution_count": null, "id": "opposed-check", "metadata": {}, "outputs": [], "source": [ "1 / 200 * 30" ] }, { "cell_type": "code", "execution_count": null, "id": "injured-replacement", "metadata": {}, "outputs": [], "source": [ "5 + 2 * 3" ] }, { "cell_type": "code", "execution_count": null, "id": "fresh-genius", "metadata": {}, "outputs": [], "source": [ "(5 + 2) * 3" ] }, { "cell_type": "markdown", "id": "accepting-cache", "metadata": {}, "source": [ "##### Common terms" ] }, { "cell_type": "code", "execution_count": null, "id": "innocent-cleaners", "metadata": {}, "outputs": [], "source": [ "import math #we need to import the `math` package for this\n", "math.sqrt(25)" ] }, { "cell_type": "code", "execution_count": null, "id": "cleared-lafayette", "metadata": {}, "outputs": [], "source": [ "math.sin(3)" ] }, { "cell_type": "code", "execution_count": null, "id": "younger-humidity", "metadata": {}, "outputs": [], "source": [ "math.pi" ] }, { "cell_type": "markdown", "id": "heated-overhead", "metadata": {}, "source": [ "##### Summary statistics" ] }, { "cell_type": "code", "execution_count": null, "id": "terminal-buffer", "metadata": {}, "outputs": [], "source": [ "import statistics as stats #we need to import the statistics package for this\n", "stats.mean([5, 4, 6, 4, 6])" ] }, { "cell_type": "code", "execution_count": null, "id": "innocent-going", "metadata": {}, "outputs": [], "source": [ "stats.median([5, 4, 6, 4, 6])" ] }, { "cell_type": "markdown", "id": "neural-sense", "metadata": {}, "source": [ "##### Conditional statements" ] }, { "cell_type": "code", "execution_count": null, "id": "threatened-tuition", "metadata": {}, "outputs": [], "source": [ "4 > 5" ] }, { "cell_type": "code", "execution_count": null, "id": "enabling-idaho", "metadata": {}, "outputs": [], "source": [ "4 < 5" ] }, { "cell_type": "code", "execution_count": null, "id": "swedish-merchant", "metadata": {}, "outputs": [], "source": [ "4 != 5" ] }, { "cell_type": "code", "execution_count": null, "id": "lucky-three", "metadata": {}, "outputs": [], "source": [ "4 == 5" ] }, { "cell_type": "markdown", "id": "animated-netscape", "metadata": {}, "source": [ "### Objects\n", "Python does not use R's an *assignment* statement; it just uses `=`. " ] }, { "cell_type": "code", "execution_count": null, "id": "periodic-positive", "metadata": {}, "outputs": [], "source": [ "x = 3*4" ] }, { "cell_type": "markdown", "id": "future-guatemala", "metadata": {}, "source": [ "Now, call up the object `x`. " ] }, { "cell_type": "code", "execution_count": null, "id": "acquired-attack", "metadata": {}, "outputs": [], "source": [ "x" ] }, { "cell_type": "markdown", "id": "welcome-shooting", "metadata": {}, "source": [ "Unlike R-Studio, Jupyter lab does not have a built in variable explorer. (There are [extensions](https://github.com/lckr/jupyterlab-variableInspector) for this, but we won't go into those here...) However, we can run the `%whos` command to reveal all named objects in our current session (including packages). " ] }, { "cell_type": "code", "execution_count": null, "id": "introductory-technology", "metadata": {}, "outputs": [], "source": [ "%whos" ] }, { "cell_type": "markdown", "id": "false-tampa", "metadata": {}, "source": [ "### Naming\n", "Python objects can be named with a combination of letters, numbers, and underscore (`_`) - **BUT NO PERIODS (`.`)**. The best object names are *informative*. Resist the temptation to call your object something convenient, like \"a\", \"b\", and so on. Calling your object something specific means that you can call up that object later and have an idea of what it contains, with less need for specific context. \n", "\n", "Informative names are the first illustration of a common data management recommendation: take the time to use best management practices at the outset, and it will save you time in the long term. \n", "\n", "Run the first code cell below. Then, type in \"long\" and press `tab`. What happens?\n" ] }, { "cell_type": "code", "execution_count": null, "id": "greater-facility", "metadata": {}, "outputs": [], "source": [ "long_name_for_illustration = 11" ] }, { "cell_type": "code", "execution_count": null, "id": "perceived-opening", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "according-outside", "metadata": {}, "source": [ "What happens if there is a typo in your code? \n", "Type the following in the R window: \n", "`Long_name_for_illustration` \n", "`longnameforillustration` " ] }, { "cell_type": "code", "execution_count": null, "id": "stable-central", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "institutional-thesaurus", "metadata": {}, "source": [ "### Comments\n", "Within your Python code, it is often useful to include notes about your workflow. So that these aren't interpreted by the software as code, precede the notes with a `#` sign. Your editor will display this comment as a different color to indicate it will not be run in the console. Comments can be placed on their own lines or at the end of a line of code." ] }, { "cell_type": "code", "execution_count": null, "id": "brazilian-porter", "metadata": {}, "outputs": [], "source": [ "# I am demonstrating a comment here. " ] }, { "cell_type": "code", "execution_count": null, "id": "primary-income", "metadata": {}, "outputs": [], "source": [ "1 + 1 # This is a simple math problem" ] }, { "cell_type": "markdown", "id": "atlantic-grade", "metadata": {}, "source": [ "```" ] }, { "cell_type": "markdown", "id": "neither-ebony", "metadata": {}, "source": [ "### Functions\n", "Python functions are the major tool. Functions can do virtually unlimited things within the Python universe, but each function requires specific inputs that are provided under specific syntax. We will start with a simple function that is built into Python, `len()`, which returns the length of an object." ] }, { "cell_type": "code", "execution_count": null, "id": "parallel-yugoslavia", "metadata": {}, "outputs": [], "source": [ "len(\"ABCDEF\")" ] }, { "cell_type": "markdown", "id": "fabulous-daily", "metadata": {}, "source": [ "*To mimic the code in the R counterpart of this document, we actually need two functions in Python. The `range()` function works like R's `seq()` function, but it returns a \"range\" object, not a vector. To conver this to a vector we coerce the range object into a list with the `list()` function...*" ] }, { "cell_type": "code", "execution_count": null, "id": "acute-metabolism", "metadata": {}, "outputs": [], "source": [ "list(range(10))" ] }, { "cell_type": "code", "execution_count": null, "id": "unnecessary-thousand", "metadata": {}, "outputs": [], "source": [ "ten_sequence = list(range(10))\n", "ten_sequence" ] }, { "cell_type": "code", "execution_count": null, "id": "champion-shannon", "metadata": {}, "outputs": [], "source": [ "list(range(1,10,2))" ] }, { "cell_type": "code", "execution_count": null, "id": "consolidated-neutral", "metadata": {}, "outputs": [], "source": [ "?range" ] }, { "cell_type": "markdown", "id": "furnished-journey", "metadata": {}, "source": [ "#### Defining your own function\n", "The basic form of a function is `functionname()`, and the packages we will use in this class will use these basic forms. However, there may be situations when you will want to create your own function. Below is a description of how to write functions through the metaphor of creating a recipe (credit: @IsabellaGhement on Twitter). \n", "\n", "Writing a function is like writing a recipe. Your function will need a recipe name (functionname). Your recipe ingredients will go inside the parentheses. The recipe steps and end product go inside the curly brackets.\n", "\n", ">→ *Note that Python does not use curly braces \"{ }\" to indicate which code goes into the function. Instead it uses **indentation**: all indented code will be part of the function's code...* \n", "```python\n", " def functionname():\n", " statement_1\n", " statement_2\n", " return(result)\n", " ```" ] }, { "cell_type": "markdown", "id": "senior-accommodation", "metadata": {}, "source": [ "♦ A single ingredient recipe: " ] }, { "cell_type": "code", "execution_count": null, "id": "informed-trash", "metadata": {}, "outputs": [], "source": [ "# Write the recipe\n", "def recipe1(x):\n", " mix = x*2\n", " return(mix)" ] }, { "cell_type": "code", "execution_count": null, "id": "authorized-millennium", "metadata": {}, "outputs": [], "source": [ "# Bake the recipe\n", "simplemeal = recipe1(5)" ] }, { "cell_type": "code", "execution_count": null, "id": "whole-preview", "metadata": {}, "outputs": [], "source": [ "# Serve the recipe\n", "simplemeal" ] }, { "cell_type": "markdown", "id": "bigger-plain", "metadata": {}, "source": [ "♦ Two single ingredient recipes, baked at the same time: " ] }, { "cell_type": "code", "execution_count": null, "id": "congressional-gates", "metadata": {}, "outputs": [], "source": [ "def recipe2(x):\n", " mix1 = x*2\n", " mix2 = x/2\n", " return([mix1, #comma indicates we continue onto the next line, as long as values are between ( ).\n", " mix2])" ] }, { "cell_type": "code", "execution_count": null, "id": "drawn-quebec", "metadata": {}, "outputs": [], "source": [ "doublesimplemeal = recipe2(6)" ] }, { "cell_type": "code", "execution_count": null, "id": "handmade-invention", "metadata": {}, "outputs": [], "source": [ "doublesimplemeal" ] }, { "cell_type": "markdown", "id": "primary-twelve", "metadata": {}, "source": [ "♦ Two double ingredient recipes, baked at the same time: " ] }, { "cell_type": "code", "execution_count": null, "id": "representative-pension", "metadata": {}, "outputs": [], "source": [ "def recipe3(x, f):\n", " mix1 = x*f\n", " mix2 = x/f\n", " return([mix1,mix2])" ] }, { "cell_type": "code", "execution_count": null, "id": "wooden-discretion", "metadata": {}, "outputs": [], "source": [ "doublecomplexmeal = recipe3(x = 5, f = 2)\n", "doublecomplexmeal" ] }, { "cell_type": "code", "execution_count": null, "id": "pressing-contractor", "metadata": {}, "outputs": [], "source": [ "#Show the first item in the returned list\n", "doublecomplexmeal[0]" ] }, { "cell_type": "markdown", "id": "coated-journal", "metadata": {}, "source": [ "♦Make a recipe based on the ingredients you have" ] }, { "cell_type": "code", "execution_count": null, "id": "italian-convert", "metadata": {}, "outputs": [], "source": [ "def recipe4(x):\n", " if(x < 3):\n", " return x*2 \n", " else:\n", " return x/2 \n", "\n", "def recipe5(x):\n", " if(x < 3): return x*2\n", " elif(x > 3): return x/2\n", " else: return x" ] }, { "cell_type": "code", "execution_count": null, "id": "respected-covering", "metadata": {}, "outputs": [], "source": [ "meal = recipe4(4); meal" ] }, { "cell_type": "code", "execution_count": null, "id": "developed-keeping", "metadata": {}, "outputs": [], "source": [ "meal2 = recipe4(2); meal2" ] }, { "cell_type": "code", "execution_count": null, "id": "raised-victoria", "metadata": {}, "outputs": [], "source": [ "meal3 = recipe5(3); meal3" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 5 }