{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Introduction to Data Science using Python\n", "\n", "\n", "## Abhijit Dasgupta, PhD" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Plans for this workshop\n", "\n", "We're meeting today and tomorrow 1:30 - 3:00 pm" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "| Day 1 | Day 2 |\n", "| ---------------------------- | -------------------- |\n", "| Why Python for Data Science? | Data visualization |\n", "| A Python Primer | Statistical modeling |\n", "| Pandas for data munging | Machine learning |" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Scope\n", "\n", "Obviously we are going to cover each topic at a high level, given time constraints. I intend to give you a taste for what is possible" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "There are much more detailed resources available as Jupyter notebooks.\n", "\n", "[https://github.com/districtdatalabs/Brookings_Python_DS](https://github.com/districtdatalabs/Brookings_Python_DS)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "| Topic | Notebook |\n", "| ---------------------------------------------- | ------------------ |\n", "| Python primer | 00_python_primer |\n", "| Numpy and the data science stack (not covered) | 01_python_tools_ds |\n", "| Pandas for data munging | 02_python_pandas |\n", "| Data visualization | 03_python_vis |\n", "| Statistical modeling | 04_python_stat |\n", "| Machine learning | 05_python_learning |\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Running notebooks on Binder\n", "\n", "Binder is a free service that allows Python resources to be run on the web from Github repositories. \n", "\n", "[Binder demo](https://github.com/districtdatalabs/Brookings_Python_DS)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Why Python for Data Science?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Who is a data scientist?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## One definition\n", "\n", "\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Unclear definition\n", "\n", "+ Statistician\n", "+ Computer scientist\n", "+ Database engineer\n", "+ Software engineer\n", "+ Data engineer\n", "+ Mathematician\n", "\n", "Some of the best ones I know are
\n", "neurobiologists and physicists" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## A broad umbrella\n", "\n", "Anyone who wants to work with data to solve problems within particular domains" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Data Science\n", "\n", "## What it involves\n", "\n", "![DSPipeline](graphs/DSPipeline.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## What it involves\n", "\n", "![data-science-explore](graphs/data-science-explore.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## What it involves\n", "\n", "1. Managing and cleaning data\n", "1. Interest in exploring relationships between things, informed by domain knowledge\n", "\n", "1. Statistical know-how\n", "1. Computational skills\n", "\n", "1. Tools" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## We're here for the tools\n", "\n", "The main two tools are\n", "\n", "1. Python (https://www.python.org)\n", "1. R (https://www.r-project.org)\n", "\n", "There is a perpetual flame war between the two camps\n", "\n", "That is not important" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Why Python?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Pros\n", "\n", "1. Very popular general purpose programming language\n", "1. Strong ecosystem through packages (over 230K projects)\n", "1. Succint syntax \n", "1. Reasonably fast while also relatively easy to program\n", " \n", " + Computational time vs Developer time\n", "1. Self-documenting\n", "1. Easier to integrate into production pipelines that already use Python\n", " + Web frameworks (Django, Flask, ...)\n", " + Workflow managers (Luigi, ...)\n", "1. Increasingly strong Data Science Stack" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Cons\n", "\n", "1. Not a rich-enough ecosystem for some purposes\n", "1. More computer science-y, less statistical\n", "1. Poorer frameworks for display and dissemination of information\n", "\n", "These are areas where R tends to shine. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Python Data Science stack" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Contributed packages over past 30 years\n", "\n", "+ To emulate Matlab\n", " + Numpy\n", " + Scipy\n", " + Matplotlib\n", "+ To emulate Maple\n", " + Sympy\n", "+ To add statistics/data science\n", " + Pandas\n", " + Various data visualization packages\n", " - seaborn\n", " - plotly" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "+ Many more user-contributed packages\n", "+ The basic philosophy has been to concentrate on a few monolithic comprehensive packages\n", " - statsmodels (Statistics)\n", " - scikit-learn (Machine Learning)\n", " - pillow (Image analysis)\n", " - nltk (Natural Language Processing)\n", " - tensorflow & PyTorch (Deep learning)\n", " - PyMC3 (Bayesian learning)\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Python as glue" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](graphs/r_py_glue.png)\n", "\n", "* The `rpy2` Python package is not developed on Windows\n", "* The `reticulate` R package actually works quite well" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "1. Data I/O\n", " + We can read data from a variety of formats into Python\n", " - Some proprietary\n", " - R, SAS, Stata, SQL, Parquet, JSON\n", "1. There are ways of running R, SAS, others from within Python\n", "1. The Jupyter sub-ecosystem allows the same interface for [many languages](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels)\n", " + R, SAS, Julia, Haskell, Javascript\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# A Python Primer" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Python is a popular, general purpose scripting language. The [TIOBE index](https://www.tiobe.com/tiobe-index/) ranks Python as the third most popular programming language after C and Java, while this recent article in IEEE Computer Society says" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "> \"Python can be used for web and desktop applications, GUI-based desktop applications, machine learning, data science, and network servers. The programming language enjoys immense community support and offers several open-source libraries, frameworks, and modules that make application development a cakewalk.\" ([Belani, 2020](https://www.computer.org/publications/tech-news/trends/programming-languages-you-should-learn-in-2020))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Python is a modular language\n", "\n", "Python is not a monolithic language but is comprised of \n", "\n", "1. a base programming language\n", "1. numerous modules or libraries that add functionality to the language. \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Python is a scripting language" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Using Python requires typing!! \n", "\n", "1. You write *code* in Python \n", "1. that is then interpreted by the Python interpreter \n", "1. to make the computer implement your instructions. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Your code is like a recipe that you write for the computer**. \n", "\n", "Python is a *high-level language* in that the code is English-like and human-readable and understandable, which reduces the time needed for a person to create the recipe. \n", "\n", "It is a language in that it has nouns (*variables* or *objects*), verbs (*functions*) and a structure or grammar that allows the programmer to write recipes for different functionalities." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Scripting can be frustrating in the beginning. You will find that the code you wrote doesn't work \"for some reason\", though it looks like you wrote it fine. The first things I look for, in order, are" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One thing that is important to note in Python: **case is important!**. If we have two objects named `data` and `Data`, they will refer to different things. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "1. Did I spell all the variables and functions correctly\n", "1. Did I close all the brackets I have opened\n", "1. Did I finish all the quotes I started, and paired single- and double-quotes\n", "1. Did I already import the right module for the function I'm trying to use. \n", "1. Do I have the right indentations in my code." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "These may not make sense right now, but as we go into Python, I hope you will remember these to help debug your code. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## An example\n", "\n", "Let's consider the following piece of Python code:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "lower: [0, 1, 2, 3]\n", "upper: [4, 5, 6, 7, 8, 9]\n" ] } ], "source": [ "# set a splitting point\n", "split_point = 3\n", "\n", "# make two empty lists\n", "lower = []; upper = []\n", "\n", "# Split numbers from 0 to 9 into two groups, \n", "# one lower or equal to the split point and \n", "# one higher than the split point\n", "\n", "for i in range(10): # count from 0 to 9\n", " if i <= split_point:\n", " lower.append(i)\n", " else:\n", " upper.append(i)\n", "\n", "print(\"lower:\", lower)\n", "print(\"upper:\", upper)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "First note that any line (or part of a line) starting with `#` is a **comment** in Python and is ignored by the interpreter. This makes it possible for us to write substantial text to remind us what each piece of our code does\n", "\n", "The first piece of code that the Python interpreter actually reads is" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "name": "00-python-primer-2", "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "split_point = 3" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This takes the number 3 and stores it in the **variable** `split_point`. Variables are just names where some Python object is stored. It really works as an address to some particular part of your computer's memory, telling the Python interpreter to look for the value stored at that particular part of memory. Variable names allow your code to be human-readable since it allows you to write expressive names to remind yourself what you are storing. The rules of variable names are:\n", "\n", "1. Variable names must start with a letter or underscore\n", "2. The rest of the name can have letters, numbers or underscores\n", "3. Names are case-sensitive\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The next piece of code initializes two **lists**, named `lower` and `upper`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "name": "00-python-primer-3", "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "lower = []; upper = []" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The semi-colon tells Python that, even though written on the same line, a particular instruction ends at the semi-colon, then another piece of instruction is written.\n", "\n", "Lists are a catch-all data structure that can store different kinds of things, In this case we'll use them to store numbers.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2, "name": "00-python-primer-4", "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "for i in range(10): # count from 0 to 9\n", " if i <= split_point \n", " lower.append(i)\n", " else:\n", " upper.append(i)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This is a _for-loop_.\n", "\n", "1. State with the numbers 0-9 (this is achieved in `range(10)`)\n", "2. Loop through each number, naming it `i` each time\n", " 1. Computer programs allow you to over-write a variable with a new value\n", "3. If the number currently stored in `i` is less than or equal to the value of `split_point`, i.e., 3 then add it to the list `lower`. Otherwise add it to the list `upper`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2, "name": "00-python-primer-4", "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "for i in range(10): # count from 0 to 9\n", " if i <= split_point:\n", " lower.append(i)\n", " else:\n", " upper.append(i)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Note the indentation in the code. **This is not by accident**. Python understands the extent of a particular block of code within a for-loop (or within a `if` statement) using the indentations. \n", "\n", "In this segment there are 3 code blocks:\n", "\n", "1. The for-loop as a whole (1st indentation)\n", "2. The `if` statement testing if the number is less than or equal to the split point, telling Python what to do if the test is true\n", "3. The `else` statement stating what to do if the test in the `if` statement is false\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The last bit of code prints out the results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "name": "00-python-primer-5", "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "print(\"lower:\", lower)\n", "print(\"upper:\", upper)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `print` statement adds some text, and then prints out a representation of the object stored in the variable being printed. In this example, this is a list, and is printed as\n", "\n", "```\n", "lower: [0, 1, 2, 3]\n", "upper: [4, 5, 6, 7, 8, 9]\n", "```\n", "\n", "We will expand on these concepts in the next few sections." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Some general rules on Python syntax\n", "\n", "1. Comments are marked by `#`\n", "2. A statement is terminated by the end of a line, or by a `;`.\n", "3. Indentation specifies blocks of code within particular structures. Whitespace at the beginning of lines matters. Typically you want to have 2 or 4 spaces to specify indentation, not a tab (\\t) character. This can be set up in your IDE.\n", "4. Whitespace within lines does not matter, so you can use spaces liberally to make your code more readable\n", "5. Parentheses (`()`) are for grouping pieces of code or for calling functions." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "There are several conventions about code styling including the one in [PEP8](https://www.python.org/dev/peps/pep-0008/#function-and-variable-names) (PEP = Python Enhancement Proposal) and one proposed by [Google](https://google.github.io/styleguide/pyguide.html#316-naming). We will typically be using lower case names, with words separated by underscores, in this workshop, basically following PEP8. Other conventions are of course allowed as long as they are within the basic rules stated above." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Data types" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Numbers\n", "\n", "1. Floats (decimal numbers) : `float`\n", "1. Integers : `int`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "| Operation | Result |\n", "| --------- | ---------------------------------- |\n", "| x + y | The sum of x and y |\n", "| x - y | The difference of x and y |\n", "| x * y | The product of x and y |\n", "| x / y | The quotient of x and y |\n", "| - x | The negative of x |\n", "| abs(x) | The absolute value of x |\n", "| x ** y | x raised to the power y |\n", "| int(x) | Convert a number to integer |\n", "| float(x) | Convert a number to floating point |" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "-119" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = 3; y = 5\n", "\n", "(2*x) - (5 * y**2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Strings" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_name = 'Abhijit'\n", "last_name = \"Dasgupta\"\n", "\n", "'jit' in last_name" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## String operations" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "first_name + last_name\n", "\n", "first_name*3\n", "\n", "\"gup\" in last_name" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Truthiness\n", "\n", "Truthiness means evaluating the truth of a statement. This typically results in a Boolean object, which can take values `True` and `False`, but Python has several equivalent representations. The following values are considered the same as False:\n", "\n", "> `None`, `False`, zero (`0`, `0L`, `0.0`), any empty sequence (`[]`, `''`, `()`), and a few others\n", "\n", "All other values are considered True. Usually we'll denote truth by `True` and the number `1`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "| Operation | Result |\n", "| --------- | --------------------------------- |\n", "| x < y | x is strictly less than y |\n", "| x <= y | x is less than or equal to y |\n", "| x == y | x equals y (note, it's 2 = signs) |\n", "| x != y | x is not equal to y |\n", "| x > y | x is strictly greater than y |\n", "| x >= y | x is greater or equal to y |" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can chain these comparisons using Boolean operations\n", "\n", "| Operation | Result |\n", "| --------- | ---------------------------------------- |\n", "| x \\| y | Either x is true or y is true or both |\n", "| x & y | Both x and y are true |\n", "| not x | if x is true, then false, and vice versa |" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = 5\n", "\n", "(x < 3) | (x <= 7)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Variables are like individual ingredients in your recipe. It's *mis en place* or setting the table for any operations (*functions*) we want to do to them. \n", "\n", "+ Variables are like *nouns*, \n", "+ which will be acted on by verbs (*functions*). \n", "\n", "In the next section we'll look at collections of variables. These collections are important in that it allows us to organize our variables with some structure. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Data structures" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "1. Lists (`[]`)\n", "2. Tuples (`()`)\n", "3. Dictionaries or dicts (`{}`)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Lists are baskets that can contain different kinds of things. They are ordered, so that there is a first element, and a second element, and a last element, in order. However, the *kinds* of things in a single list doesn't have to be the same type." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Tuples are basically like lists, except that they are *immutable*, i.e., once they are created, individual values can't be changed. They are also ordered, so there is a first element, a second element and so on." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Dictionaries are **unordered** key-value pairs, which are very fast for looking up things. They work almost like hash tables. Dictionaries will be very useful to us as we progress towards the PyData stack. Elements need to be referred to by *key*, not by position." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "'apple'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_list = [\"apple\", 3, True, \"Harvey\", 48205]\n", "test_tuple = (\"apple\", 3, True, \"Harvey\", 48205)\n", "\n", "test_list[0]\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "| | | | | | |\n", "| ------------------ | ------- | ---- | ---- | -------- | ----- |\n", "| index | 0 | 1 | 2 | 3 | 4 |\n", "| element | 'apple' | 3 | True | 'Harvey' | 48205 |\n", "| counting backwards | -5 | -4 | -3 | -2 | -1 |\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "contact = {\n", " \"first_name\": \"Abhijit\",\n", " \"last_name\": \"Dasgupta\",\n", " \"Age\": 48,\n", " \"address\": \"124 Main St\",\n", " \"Employed\": True,\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "contact['first_name']\n", "contact['address']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "contact.keys()\n", "contact.values()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Operations" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Loops" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "```pseudocode\n", "Start with a list of datasets, one for each state\n", "for each state\n", " compute and store fraction of votes that are Republican\n", " compute and store fraction of votes that are Democratic\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "for i in range(len(test_list)):\n", " print(test_list[i])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "for u in test_list:\n", " print(u)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_list2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", "mysum = 0\n", "for u in test_list2:\n", " mysum = mysum + u\n", "print(mysum)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "`enumerate` automatically creates both the index and the value for each element of a list." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0\n", "1 2\n", "2 4\n", "3 6\n", "4 8\n" ] } ], "source": [ "L = [0, 2, 4, 6, 8]\n", "for i, val in enumerate(L):\n", " print(i, val)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "`zip` puts multiple lists together and creates a composite iterator. You can have any number of iterators in zip, and the length of the result is determined by the length of the shortest iterator. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Han Solo : light\n", "Luke Skywalker : light\n", "Leia Skywaker : light\n", "Anakin Skywalker : light/dark/light\n" ] } ], "source": [ "first = [\"Han\", \"Luke\", \"Leia\", \"Anakin\"]\n", "last = [\"Solo\", \"Skywalker\", \"Skywaker\", \"Skywalker\"]\n", "types = ['light','light','light','light/dark/light']\n", "\n", "for val1, val2, val3 in zip(first, last, types):\n", " print(val1, val2, ' : ', val3)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## List comprehensions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_list2 = [1,2,3,4,5,6]\n", "\n", "squares = [u**2 for u in test_list2]\n", "squares" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Conditional evaluation\n", "\n", "```pseudocode\n", "if Condition 1 is true then\n", "\tdo Recipe 1\n", "else if (elif) Condition 2 is true then\n", " do Recipe 2\n", "else\n", " do Recipe 3\n", "```\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Negative', 'Negative', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even']\n" ] } ], "source": [ "\n", "x = [-2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", "y = [] # an empty list\n", "\n", "for u in x:\n", " if u < 0:\n", " y.append(\"Negative\")\n", " elif u % 2 == 1: # what is remainder when dividing by 2\n", " y.append(\"Odd\")\n", " else:\n", " y.append(\"Even\")\n", "\n", "print(y)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def my_mean(x):\n", " y = 0\n", " for u in x:\n", " y += u \n", " y = y / len(x)\n", " return y" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A Python function must start with the keyword `def` followed by the name of the function, the arguments within parentheses, and then a colon. The actual code for the function is indented, just like in for-loops and if-elif-else structures. It ends with a `return` function which specifies the output of the function.\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def my_mean(x):\n", " \"\"\"\n", " A function to compute the mean of a list of numbers.\n", " \n", " INPUTS:\n", " x : a list containing numbers\n", " \n", " OUTPUT:\n", " The arithmetic mean of the list of numbers\n", " \"\"\"\n", " y = 0\n", " for u in x:\n", " y = y + u\n", " y = y / len(x)\n", " return y" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on function my_mean in module __main__:\n", "\n", "my_mean(x)\n", " A function to compute the mean of a list of numbers.\n", " \n", " INPUTS:\n", " x : a list containing numbers\n", " \n", " OUTPUT:\n", " The arithmetic mean of the list of numbers\n", "\n" ] } ], "source": [ "help(my_mean)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Pandas (Python Data Analysis)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "+ Data ingestion\n", "+ Data cleaning and transformation\n", "+ Data can be passed on to modeling and visualization packages" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Activating packages for use\n", "\n", "+ Use the `import` command\n", "+ Maybe provide an alias for the package" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Data import\n", "\n", "| Format type | Description | reader | writer |\n", "| ----------- | ----------- | ------------ | ---------- |\n", "| text | CSV | read_csv | to_csv |\n", "| | Excel | read_excel | to_excel |\n", "| text | JSON | read_json | to_json |\n", "| binary | Feather | read_feather | to_feather |\n", "| binary | SAS | read_sas | |\n", "| SQL | SQL | read_sql | to_sql |\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "mtcars = pd.read_csv('data/mtcars.csv')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "> One of the big differences between a spreadsheet program and a programming language from the data science perspective is that you have to load data into the programming language. It's not \"just there\" like Excel. This is a good thing, since it allows the common functionality of the programming language to work across multiple data sets, and also keeps the original data set pristine. Excel users can run into problems and [corrupt their data](https://nature.berkeley.edu/garbelottoat/?p=1488) if they are not careful." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Exploring data" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Creating a DataFrame" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDE
a0.2282731.026890-0.839585-0.591182-0.956888
b-0.222326-0.6199151.837905-2.0532310.868583
c-0.920734-0.2323122.152957-1.3346610.076380
d-1.2460891.202272-1.0499421.056610-0.419678
\n", "
" ], "text/plain": [ " A B C D E\n", "a 0.228273 1.026890 -0.839585 -0.591182 -0.956888\n", "b -0.222326 -0.619915 1.837905 -2.053231 0.868583\n", "c -0.920734 -0.232312 2.152957 -1.334661 0.076380\n", "d -1.246089 1.202272 -1.049942 1.056610 -0.419678" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rng = np.random.RandomState(25)\n", "d2 = pd.DataFrame(rng.normal(0,1, (4, 5)), \n", " columns = ['A','B','C','D','E'], \n", " index = ['a','b','c','d'])\n", "d2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "A DataFrame has (mutable)\n", "\n", "+ An `index` (row names)\n", "+ A `column` (column names)‘" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['a', 'b', 'c', 'd'], dtype='object')" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d2.columns\n", "d2.index" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABCDEF
03.00.4813432020-05-126yesNIH
13.00.5165022020-05-126noNIH
23.00.3830482020-05-126noNIH
33.00.9975412020-05-126yesNIH
43.00.5142442020-05-126noNIH
\n", "
" ], "text/plain": [ " A B C D E F\n", "0 3.0 0.481343 2020-05-12 6 yes NIH\n", "1 3.0 0.516502 2020-05-12 6 no NIH\n", "2 3.0 0.383048 2020-05-12 6 no NIH\n", "3 3.0 0.997541 2020-05-12 6 yes NIH\n", "4 3.0 0.514244 2020-05-12 6 no NIH" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({\n", " 'A':3.,\n", " 'B':rng.random_sample(5),\n", " 'C': pd.Timestamp('20200512'),\n", " 'D': np.array([6] * 5),\n", " 'E': pd.Categorical(['yes','no','no','yes','no']),\n", " 'F': 'NIH'})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also use a `dict` to create a `DataFrame`. If elements aren't of the same size, errors will be thrown, unless it is a single element. Then it will be repeated. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Slicing and dicing a DataFrame" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.481343\n", "1 0.516502\n", "2 0.383048\n", "3 0.997541\n", "4 0.514244\n", "Name: B, dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['B']\n", "df.B\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "There are two extractor functions in `pandas`:\n", "\n", "+ `loc` extracts by label (index label, column label, slice of labels, etc.\n", "+ `iloc` extracts by index (integers, slice objects, etc.\n" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 2020-05-12\n", "2 2020-05-12\n", "3 2020-05-12\n", "Name: C, dtype: datetime64[ns]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ " df.loc[1:3, 'C']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "You can also extract rows by condition (filter)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefourfive
a-0.8473150.6639260.18221120False
c1.682927-0.0563880.24414920True
e0.1855020.554072-0.73974320True
f-0.151335-0.172999-0.65635420False
g0.672965-0.680025-0.06515320True
\n", "
" ], "text/plain": [ " one two three four five\n", "a -0.847315 0.663926 0.182211 20 False\n", "c 1.682927 -0.056388 0.244149 20 True\n", "e 0.185502 0.554072 -0.739743 20 True\n", "f -0.151335 -0.172999 -0.656354 20 False\n", "g 0.672965 -0.680025 -0.065153 20 True" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(np.random.randn(5, 3), index = ['a','c','e', 'f','g'], columns = ['one','two','three']) # pre-specify index and column names\n", "df['four'] = 20 # add a column named \"four\", which will all be 20\n", "df['five'] = df['one'] > 0\n", "df" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefourfive
c1.682927-0.0563880.24414920True
\n", "
" ], "text/plain": [ " one two three four five\n", "c 1.682927 -0.056388 0.244149 20 True" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[(df.one > 1) & (df.three < 0)]\n", "\n", "df.query('(one > 1) & (three > 0)')\n", " \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Replacing values" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothreefourfive
a-0.8473150.6639260.18221120False
c1.682927-0.0563880.24414920True
e0.1855020.554072-0.73974320True
f-0.151335-0.172999-0.65635420False
g0.672965-0.680025-0.06515320True
\n", "
" ], "text/plain": [ " one two three four five\n", "a -0.847315 0.663926 0.182211 20 False\n", "c 1.682927 -0.056388 0.244149 20 True\n", "e 0.185502 0.554072 -0.739743 20 True\n", "f -0.151335 -0.172999 -0.656354 20 False\n", "g 0.672965 -0.680025 -0.065153 20 True" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#df2.replace(0, -9) # replace 0 with -9\n", "\n", "df.replace({'one': {5: 500}, 'three':{0:-9, 8:800}})" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Joins\n", "\n", "There are basically four kinds of joins:\n", "\n", "| pandas | R | SQL | Description |\n", "| ------ | ---------- | ----------- | ------------------------------- |\n", "| left | left_join | left outer | keep all rows on left |\n", "| right | right_join | right outer | keep all rows on right |\n", "| outer | outer_join | full outer | keep all rows from both |\n", "| inner | inner_join | inner | keep only rows with common keys |" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Joins\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "survey = pd.read_csv('data/survey_survey.csv')\n", "visited = pd.read_csv('data/survey_visited.csv')" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
takenpersonquantreadingidentsitedated
0619dyerrad9.82619DR-11927-02-08
1619dyersal0.13619DR-11927-02-08
2622dyerrad7.80622DR-11927-02-10
3622dyersal0.09622DR-11927-02-10
4734pbrad8.41734DR-31939-01-07
5734lakesal0.05734DR-31939-01-07
6734pbtemp-21.50734DR-31939-01-07
7735pbrad7.22735DR-31930-01-12
8735NaNsal0.06735DR-31930-01-12
9735NaNtemp-26.00735DR-31930-01-12
10751pbrad4.35751DR-31930-02-26
11751pbtemp-18.50751DR-31930-02-26
12751lakesal0.10751DR-31930-02-26
13752lakerad2.19752DR-3NaN
14752lakesal0.09752DR-3NaN
15752laketemp-16.00752DR-3NaN
16752roesal41.60752DR-3NaN
17837lakerad1.46837MSK-41932-01-14
18837lakesal0.21837MSK-41932-01-14
19837roesal22.50837MSK-41932-01-14
20844roerad11.25844DR-11932-03-22
\n", "
" ], "text/plain": [ " taken person quant reading ident site dated\n", "0 619 dyer rad 9.82 619 DR-1 1927-02-08\n", "1 619 dyer sal 0.13 619 DR-1 1927-02-08\n", "2 622 dyer rad 7.80 622 DR-1 1927-02-10\n", "3 622 dyer sal 0.09 622 DR-1 1927-02-10\n", "4 734 pb rad 8.41 734 DR-3 1939-01-07\n", "5 734 lake sal 0.05 734 DR-3 1939-01-07\n", "6 734 pb temp -21.50 734 DR-3 1939-01-07\n", "7 735 pb rad 7.22 735 DR-3 1930-01-12\n", "8 735 NaN sal 0.06 735 DR-3 1930-01-12\n", "9 735 NaN temp -26.00 735 DR-3 1930-01-12\n", "10 751 pb rad 4.35 751 DR-3 1930-02-26\n", "11 751 pb temp -18.50 751 DR-3 1930-02-26\n", "12 751 lake sal 0.10 751 DR-3 1930-02-26\n", "13 752 lake rad 2.19 752 DR-3 NaN\n", "14 752 lake sal 0.09 752 DR-3 NaN\n", "15 752 lake temp -16.00 752 DR-3 NaN\n", "16 752 roe sal 41.60 752 DR-3 NaN\n", "17 837 lake rad 1.46 837 MSK-4 1932-01-14\n", "18 837 lake sal 0.21 837 MSK-4 1932-01-14\n", "19 837 roe sal 22.50 837 MSK-4 1932-01-14\n", "20 844 roe rad 11.25 844 DR-1 1932-03-22" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(survey, visited, left_on = 'taken', right_on = 'ident', how = 'left')\n", "# survey.merge(visited, left_on = 'taken', right_on = 'ident', how = 'left')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Here, the left dataset is `survey` and the right one is `visited`. \n", "\n", "Since we're doing a left join, we keed all the rows from `survey` and add columns from `visited`, matching on the common key, called \"taken\" in one dataset and \"ident\" in the other. \n", "\n", "Note that the rows of `visited` are repeated as needed to line up with all the rows with common \"taken\" values. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Data aggregation and split-apply-combine\n", "\n", "![](graphs/split-apply-combine.png)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countrycontinentyearlifeExppopgdpPercap
0AfghanistanAsia195228.8018425333779.445314
1AfghanistanAsia195730.3329240934820.853030
2AfghanistanAsia196231.99710267083853.100710
3AfghanistanAsia196734.02011537966836.197138
4AfghanistanAsia197236.08813079460739.981106
\n", "
" ], "text/plain": [ " country continent year lifeExp pop gdpPercap\n", "0 Afghanistan Asia 1952 28.801 8425333 779.445314\n", "1 Afghanistan Asia 1957 30.332 9240934 820.853030\n", "2 Afghanistan Asia 1962 31.997 10267083 853.100710\n", "3 Afghanistan Asia 1967 34.020 11537966 836.197138\n", "4 Afghanistan Asia 1972 36.088 13079460 739.981106" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gapminder = pd.read_csv('data/gapminder.tsv', sep = '\\t') \n", "gapminder.head()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "country\n", "Afghanistan 37.478833\n", "Albania 68.432917\n", "Algeria 59.030167\n", "Angola 37.883500\n", "Argentina 69.060417\n", " ... \n", "Vietnam 57.479500\n", "West Bank and Gaza 60.328667\n", "Yemen, Rep. 46.780417\n", "Zambia 45.996333\n", "Zimbabwe 52.663167\n", "Name: lifeExp, Length: 142, dtype: float64" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gapminder.groupby('country')['lifeExp'].mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "gapminder.groupby('country').get_group('United Kingdom')" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "continent\n", "Africa 47.7920\n", "Americas 67.0480\n", "Asia 61.7915\n", "Europe 72.2410\n", "Oceania 73.6650\n", "Name: lifeExp, dtype: float64" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gapminder.groupby('continent').lifeExp.agg(np.median) # Medians" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearlifeExppopgdpPercap
0195249.0576203943953.01968.528344
1195751.5074014282942.02173.220291
2196253.6092494686039.52335.439533
3196755.6782905170175.52678.334741
4197257.6473865877996.53339.129407
5197759.5701576404036.53798.609244
6198261.5331977007320.04216.228428
7198763.2126137774861.54280.300366
8199264.1603388688686.54386.085502
9199765.0146769735063.54781.825478
10200265.69492310372918.55319.804524
11200767.00742310517531.06124.371109
\n", "
" ], "text/plain": [ " year lifeExp pop gdpPercap\n", "0 1952 49.057620 3943953.0 1968.528344\n", "1 1957 51.507401 4282942.0 2173.220291\n", "2 1962 53.609249 4686039.5 2335.439533\n", "3 1967 55.678290 5170175.5 2678.334741\n", "4 1972 57.647386 5877996.5 3339.129407\n", "5 1977 59.570157 6404036.5 3798.609244\n", "6 1982 61.533197 7007320.0 4216.228428\n", "7 1987 63.212613 7774861.5 4280.300366\n", "8 1992 64.160338 8688686.5 4386.085502\n", "9 1997 65.014676 9735063.5 4781.825478\n", "10 2002 65.694923 10372918.5 5319.804524\n", "11 2007 67.007423 10517531.0 6124.371109" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gapminder.groupby('year').agg({'lifeExp': np.mean, 'pop': np.median, 'gdpPercap': np.median}).reset_index()" ] } ], "metadata": { "celltoolbar": "Slideshow", "jupytext": { "formats": "ipynb,Rmd" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": false, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }