{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Introduction to Data Science using Python\n",
    "\n",
    "\n",
    "## Abhijit Dasgupta, PhD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Plans for this workshop\n",
    "\n",
    "We're meeting today and tomorrow 1:30 - 3:00 pm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "| Day 1                        | Day 2                |\n",
    "| ---------------------------- | -------------------- |\n",
    "| Why Python for Data Science? | Data visualization   |\n",
    "| A Python Primer              | Statistical modeling |\n",
    "| Pandas for data munging      | Machine learning     |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Scope\n",
    "\n",
    "Obviously we are going to cover each topic at a high level, given time constraints. I intend to give you a taste for what is possible"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "There are much more detailed resources available as Jupyter notebooks.\n",
    "\n",
    "[https://github.com/districtdatalabs/Brookings_Python_DS](https://github.com/districtdatalabs/Brookings_Python_DS)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "| Topic                                          | Notebook           |\n",
    "| ---------------------------------------------- | ------------------ |\n",
    "| Python primer                                  | 00_python_primer   |\n",
    "| Numpy and the data science stack (not covered) | 01_python_tools_ds |\n",
    "| Pandas for data munging                        | 02_python_pandas   |\n",
    "| Data visualization                             | 03_python_vis      |\n",
    "| Statistical modeling                           | 04_python_stat     |\n",
    "| Machine learning                               | 05_python_learning |\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Running notebooks on Binder\n",
    "\n",
    "Binder is a free service that allows Python resources to be run on the web from Github repositories. \n",
    "\n",
    "[Binder demo](https://github.com/districtdatalabs/Brookings_Python_DS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Why Python for Data Science?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Who is a data scientist?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## One definition\n",
    "\n",
    "\n",
    "<img align=\"center\" src=\"graphs/Data_Science_VD.png\"/>\n",
    "\n",
    "<aside class=\"notes\">\n",
    "1. It's the confluence of statistics, computer science, domain knowledge\n",
    "2. You need all three to make this work\n",
    "3. We'll be talking more about stat/CS than domain expertise here\n",
    "\n",
    "</aside>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Unclear definition\n",
    "\n",
    "+ Statistician\n",
    "+ Computer scientist\n",
    "+ Database engineer\n",
    "+ Software engineer\n",
    "+ Data engineer\n",
    "+ Mathematician\n",
    "\n",
    "Some of the best ones I know are <br/>\n",
    "neurobiologists and physicists"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## A broad umbrella\n",
    "\n",
    "Anyone who wants to work with data to solve problems within particular domains"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Data Science\n",
    "\n",
    "## What it involves\n",
    "\n",
    "![DSPipeline](graphs/DSPipeline.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## What it involves\n",
    "\n",
    "![data-science-explore](graphs/data-science-explore.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## What it involves\n",
    "\n",
    "1. Managing and cleaning data\n",
    "1. Interest in exploring relationships between things, informed by domain knowledge\n",
    "\n",
    "1. Statistical know-how\n",
    "1. Computational skills\n",
    "\n",
    "1. Tools"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## We're here for the tools\n",
    "\n",
    "The main two tools are\n",
    "\n",
    "1. Python (https://www.python.org)\n",
    "1. R (https://www.r-project.org)\n",
    "\n",
    "There is a perpetual flame war between the two camps\n",
    "\n",
    "That is not important"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Why Python?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Pros\n",
    "\n",
    "1. Very popular general purpose programming language\n",
    "1. Strong ecosystem through packages (over 230K projects)\n",
    "1. Succint syntax \n",
    "1. Reasonably fast while also relatively easy to program\n",
    "  \n",
    "    + Computational time vs Developer time\n",
    "1. Self-documenting\n",
    "1. Easier to integrate into production pipelines that already use Python\n",
    "    + Web frameworks (Django, Flask, ...)\n",
    "    + Workflow managers (Luigi, ...)\n",
    "1. Increasingly strong Data Science Stack"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Cons\n",
    "\n",
    "1. Not a rich-enough ecosystem for some purposes\n",
    "1. More computer science-y, less statistical\n",
    "1. Poorer frameworks for display and dissemination of information\n",
    "\n",
    "These are areas where R tends to shine. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Python Data Science stack"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Contributed packages over past 30 years\n",
    "\n",
    "+ To emulate Matlab\n",
    "    + Numpy\n",
    "    + Scipy\n",
    "    + Matplotlib\n",
    "+ To emulate Maple\n",
    "    + Sympy\n",
    "+ To add statistics/data science\n",
    "    + Pandas\n",
    "    + Various data visualization packages\n",
    "        - seaborn\n",
    "        - plotly"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "+ Many more user-contributed packages\n",
    "+ The basic philosophy has been to concentrate on a few monolithic comprehensive packages\n",
    "    - statsmodels (Statistics)\n",
    "    - scikit-learn (Machine Learning)\n",
    "    - pillow (Image analysis)\n",
    "    - nltk (Natural Language Processing)\n",
    "    - tensorflow & PyTorch (Deep learning)\n",
    "    - PyMC3 (Bayesian learning)\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Python as glue"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "![](graphs/r_py_glue.png)\n",
    "\n",
    "* The `rpy2` Python package is not developed on Windows\n",
    "* The `reticulate` R package actually works quite well"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "1. Data I/O\n",
    "    + We can read data from a variety of formats into Python\n",
    "        - Some proprietary\n",
    "        - R, SAS, Stata, SQL, Parquet, JSON\n",
    "1. There are ways of running R, SAS, others from within Python\n",
    "1. The Jupyter sub-ecosystem allows the same interface for [many languages](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels)\n",
    "    + R, SAS, Julia, Haskell, Javascript\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# A Python Primer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Python is a popular, general purpose scripting language. The [TIOBE index](https://www.tiobe.com/tiobe-index/) ranks Python as the third most popular programming language after C and Java, while this recent article in IEEE Computer Society says"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "> \"Python can be used for web and desktop applications, GUI-based desktop applications, machine learning, data science, and network servers. The programming language enjoys immense community support and offers several open-source libraries, frameworks, and modules that make application development a cakewalk.\" ([Belani, 2020](https://www.computer.org/publications/tech-news/trends/programming-languages-you-should-learn-in-2020))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Python is a modular language\n",
    "\n",
    "Python is not a monolithic language but is comprised of \n",
    "\n",
    "1. a base programming language\n",
    "1. numerous modules or libraries that add functionality to the language. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Python is a scripting language"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Using Python requires typing!! \n",
    "\n",
    "1. You write *code* in Python \n",
    "1. that is then interpreted by the Python interpreter \n",
    "1. to make the computer implement your instructions. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "**Your code is like a recipe that you write for the computer**. \n",
    "\n",
    "Python is a *high-level language* in that the code is English-like and human-readable and understandable, which reduces the time needed for a person to create the recipe. \n",
    "\n",
    "It is a language in that it has nouns (*variables* or *objects*), verbs (*functions*) and a structure or grammar that allows the programmer to write recipes for different functionalities."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Scripting can be frustrating in the beginning. You will find that the code you wrote doesn't work \"for some reason\", though it looks like you wrote it fine. The first things I look for, in order, are"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "One thing that is important to note in Python: **case is important!**. If we have two objects named `data` and `Data`, they will refer to different things. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "1. Did I spell all the variables and functions correctly\n",
    "1. Did I close all the brackets I have opened\n",
    "1. Did I finish all the quotes I started, and paired single- and double-quotes\n",
    "1. Did I already import the right module for the function I'm trying to use. \n",
    "1. Do I have the right indentations in my code."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "These may not make sense right now, but as we go into Python, I hope you will remember these to help debug your code. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## An example\n",
    "\n",
    "Let's consider the following piece of Python code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "lower: [0, 1, 2, 3]\n",
      "upper: [4, 5, 6, 7, 8, 9]\n"
     ]
    }
   ],
   "source": [
    "# set a splitting point\n",
    "split_point = 3\n",
    "\n",
    "# make two empty lists\n",
    "lower = []; upper = []\n",
    "\n",
    "# Split numbers from 0 to 9 into two groups, \n",
    "# one lower or equal to the split point and \n",
    "# one higher than the split point\n",
    "\n",
    "for i in range(10):  # count from 0 to 9\n",
    "    if i <= split_point:\n",
    "        lower.append(i)\n",
    "    else:\n",
    "        upper.append(i)\n",
    "\n",
    "print(\"lower:\", lower)\n",
    "print(\"upper:\", upper)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "First note that any line (or part of a line) starting with `#` is a **comment** in Python and is ignored by the interpreter. This makes it possible for us to write substantial text to remind us what each piece of our code does\n",
    "\n",
    "The first piece of code that the Python interpreter actually reads is"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "name": "00-python-primer-2",
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "split_point = 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "This takes the number 3 and stores it in the **variable** `split_point`. Variables are just names where some Python object is stored. It really works as an address to some particular part of your computer's memory, telling the Python interpreter to look for the value stored at that particular part of memory. Variable names allow your code to be human-readable since it allows you to write expressive names to remind yourself what you are storing. The rules of variable names are:\n",
    "\n",
    "1. Variable names must start with a letter or underscore\n",
    "2. The rest of the name can have letters, numbers or underscores\n",
    "3. Names are case-sensitive\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "The next piece of code initializes two **lists**, named `lower` and `upper`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "name": "00-python-primer-3",
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "lower = []; upper = []"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "The semi-colon tells Python that, even though written on the same line, a particular instruction ends at the semi-colon, then another piece of instruction is written.\n",
    "\n",
    "Lists are a catch-all data structure that can store different kinds of things, In this case we'll use them to store numbers.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 2,
    "name": "00-python-primer-4",
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "for i in range(10):  # count from 0 to 9\n",
    "    if i <= split_point \n",
    "        lower.append(i)\n",
    "    else:\n",
    "        upper.append(i)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "This is a _for-loop_.\n",
    "\n",
    "1. State with the numbers 0-9 (this is achieved in `range(10)`)\n",
    "2. Loop through each number, naming it `i` each time\n",
    "    1. Computer programs allow you to over-write a variable with a new value\n",
    "3. If the number currently stored in `i` is less than or equal to the value of `split_point`, i.e., 3 then add it to the list `lower`. Otherwise add it to the list `upper`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 2,
    "name": "00-python-primer-4",
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "for i in range(10):  # count from 0 to 9\n",
    "    if i <= split_point:\n",
    "        lower.append(i)\n",
    "    else:\n",
    "        upper.append(i)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Note the indentation in the code. **This is not by accident**. Python understands the extent of a particular block of code within a for-loop (or within a `if` statement) using the indentations. \n",
    "\n",
    "In this segment there are 3 code blocks:\n",
    "\n",
    "1. The for-loop as a whole (1st indentation)\n",
    "2. The `if` statement testing if the number is less than or equal to the split point, telling Python what to do if the test is true\n",
    "3. The `else` statement stating what to do if the test in the `if` statement is false\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "The last bit of code prints out the results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "name": "00-python-primer-5",
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "print(\"lower:\", lower)\n",
    "print(\"upper:\", upper)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "The `print` statement adds some text, and then prints out a representation of the object stored in the variable being printed. In this example, this is a list, and is printed as\n",
    "\n",
    "```\n",
    "lower: [0, 1, 2, 3]\n",
    "upper: [4, 5, 6, 7, 8, 9]\n",
    "```\n",
    "\n",
    "We will expand on these concepts in the next few sections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Some general rules on Python syntax\n",
    "\n",
    "1. Comments are marked by `#`\n",
    "2. A statement is terminated by the end of a line, or by a `;`.\n",
    "3. Indentation specifies blocks of code within particular structures. Whitespace at the beginning of lines matters. Typically you want to have 2 or 4 spaces to specify indentation, not a tab (\\t) character. This can be set up in your IDE.\n",
    "4. Whitespace within lines does not matter, so you can use spaces liberally to make your code more readable\n",
    "5. Parentheses (`()`) are for grouping pieces of code or for calling functions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "There are several conventions about code styling including the one in [PEP8](https://www.python.org/dev/peps/pep-0008/#function-and-variable-names) (PEP = Python Enhancement Proposal) and one proposed by [Google](https://google.github.io/styleguide/pyguide.html#316-naming). We will typically be using lower case names, with words separated by underscores, in this workshop, basically following PEP8. Other conventions are of course allowed as long as they are within the basic rules stated above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Data types"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Numbers\n",
    "\n",
    "1. Floats (decimal numbers) : `float`\n",
    "1. Integers : `int`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "| Operation | Result                             |\n",
    "| --------- | ---------------------------------- |\n",
    "| x + y     | The sum of x and y                 |\n",
    "| x - y     | The difference of x and y          |\n",
    "| x * y     | The product of x and y             |\n",
    "| x / y     | The quotient of x and y            |\n",
    "| - x       | The negative of x                  |\n",
    "| abs(x)    | The absolute value of x            |\n",
    "| x ** y    | x raised to the power y            |\n",
    "| int(x)    | Convert a number to integer        |\n",
    "| float(x)  | Convert a number to floating point |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-119"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = 3; y = 5\n",
    "\n",
    "(2*x) -  (5 * y**2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Strings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first_name = 'Abhijit'\n",
    "last_name = \"Dasgupta\"\n",
    "\n",
    "'jit' in last_name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## String operations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "first_name + last_name\n",
    "\n",
    "first_name*3\n",
    "\n",
    "\"gup\" in last_name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Truthiness\n",
    "\n",
    "Truthiness means evaluating the truth of a statement. This typically results in a Boolean object, which can take values `True` and `False`, but Python has several equivalent representations. The following values are considered the same as False:\n",
    "\n",
    "> `None`, `False`, zero (`0`, `0L`, `0.0`), any empty sequence (`[]`, `''`, `()`), and a few others\n",
    "\n",
    "All other values are considered True. Usually we'll denote truth by `True` and the number `1`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "| Operation | Result                            |\n",
    "| --------- | --------------------------------- |\n",
    "| x < y     | x is strictly less than y         |\n",
    "| x <= y    | x is less than or equal to y      |\n",
    "| x == y    | x equals y (note, it's 2 = signs) |\n",
    "| x != y    | x is not equal to y               |\n",
    "| x > y     | x is strictly greater than y      |\n",
    "| x >= y    | x is greater or equal to y        |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "We can chain these comparisons using Boolean operations\n",
    "\n",
    "| Operation | Result                                   |\n",
    "| --------- | ---------------------------------------- |\n",
    "| x \\| y    | Either x is true or y is true or both    |\n",
    "| x & y     | Both x and y are true                    |\n",
    "| not x     | if x is true, then false, and vice versa |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = 5\n",
    "\n",
    "(x < 3) | (x <= 7)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Variables are like individual ingredients in your recipe. It's *mis en place* or setting the table for any operations (*functions*) we want to do to them. \n",
    "\n",
    "+ Variables are like *nouns*, \n",
    "+ which will be acted on by verbs (*functions*). \n",
    "\n",
    "In the next section we'll look at collections of variables. These collections are important in that it allows us to organize our variables with some structure. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Data structures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "1. Lists (`[]`)\n",
    "2. Tuples (`()`)\n",
    "3. Dictionaries or dicts (`{}`)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Lists are baskets that can contain different kinds of things. They  are ordered, so that there is a first element, and a second element, and a last element, in order. However, the *kinds* of things in a single list doesn't have to be the same type."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Tuples are basically like lists, except that they are *immutable*, i.e., once they are created, individual values can't be changed. They are also ordered, so there is a first element, a second element and so on."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Dictionaries are **unordered** key-value pairs, which  are very fast for looking up things. They work almost like hash tables.  Dictionaries will be very useful to us as we progress towards the PyData stack. Elements need to be referred to by *key*, not by position."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'apple'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_list = [\"apple\", 3, True, \"Harvey\", 48205]\n",
    "test_tuple = (\"apple\", 3, True, \"Harvey\", 48205)\n",
    "\n",
    "test_list[0]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "|                    |         |      |      |          |       |\n",
    "| ------------------ | ------- | ---- | ---- | -------- | ----- |\n",
    "| index              | 0       | 1    | 2    | 3        | 4     |\n",
    "| element            | 'apple' | 3    | True | 'Harvey' | 48205 |\n",
    "| counting backwards | -5      | -4   | -3   | -2       | -1    |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "contact = {\n",
    "    \"first_name\": \"Abhijit\",\n",
    "    \"last_name\": \"Dasgupta\",\n",
    "    \"Age\": 48,\n",
    "    \"address\": \"124 Main St\",\n",
    "    \"Employed\": True,\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "contact['first_name']\n",
    "contact['address']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "contact.keys()\n",
    "contact.values()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Operations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Loops"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "```pseudocode\n",
    "Start with a list of datasets, one for each state\n",
    "for each state\n",
    "    compute and store fraction of votes that are Republican\n",
    "    compute and store fraction of votes that are Democratic\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "for i in range(len(test_list)):\n",
    "    print(test_list[i])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "for u in test_list:\n",
    "    print(u)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_list2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n",
    "mysum = 0\n",
    "for u in test_list2:\n",
    "    mysum = mysum + u\n",
    "print(mysum)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "`enumerate` automatically creates both the index and the value for each element of a list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 0\n",
      "1 2\n",
      "2 4\n",
      "3 6\n",
      "4 8\n"
     ]
    }
   ],
   "source": [
    "L = [0, 2, 4, 6, 8]\n",
    "for i, val in enumerate(L):\n",
    "    print(i, val)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "`zip` puts multiple lists together and creates a composite iterator. You can have any number of iterators in zip, and the length of the result is determined by the length of the shortest iterator. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Han Solo  :  light\n",
      "Luke Skywalker  :  light\n",
      "Leia Skywaker  :  light\n",
      "Anakin Skywalker  :  light/dark/light\n"
     ]
    }
   ],
   "source": [
    "first = [\"Han\", \"Luke\", \"Leia\", \"Anakin\"]\n",
    "last = [\"Solo\", \"Skywalker\", \"Skywaker\", \"Skywalker\"]\n",
    "types = ['light','light','light','light/dark/light']\n",
    "\n",
    "for val1, val2, val3 in zip(first, last, types):\n",
    "    print(val1, val2, ' : ', val3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## List comprehensions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_list2 = [1,2,3,4,5,6]\n",
    "\n",
    "squares = [u**2 for u in test_list2]\n",
    "squares"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Conditional evaluation\n",
    "\n",
    "```pseudocode\n",
    "if Condition 1 is true then\n",
    "\tdo Recipe 1\n",
    "else if (elif) Condition 2 is true then\n",
    "  do Recipe 2\n",
    "else\n",
    "  do Recipe 3\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Negative', 'Negative', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even']\n"
     ]
    }
   ],
   "source": [
    "\n",
    "x = [-2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n",
    "y = []  # an empty list\n",
    "\n",
    "for u in x:\n",
    "    if u < 0:\n",
    "        y.append(\"Negative\")\n",
    "    elif u % 2 == 1:  # what is remainder when dividing by 2\n",
    "        y.append(\"Odd\")\n",
    "    else:\n",
    "        y.append(\"Even\")\n",
    "\n",
    "print(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "def my_mean(x):\n",
    "    y = 0\n",
    "    for u in x:\n",
    "        y += u \n",
    "    y = y / len(x)\n",
    "    return y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "A Python function must start with the keyword `def` followed by the name of the function, the arguments within parentheses, and then a colon. The actual code for the function is indented, just like in for-loops and if-elif-else structures. It ends with a `return` function which specifies the output of the function.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def my_mean(x):\n",
    "    \"\"\"\n",
    "  A function to compute the mean of a list of numbers.\n",
    "  \n",
    "  INPUTS:\n",
    "  x : a list containing numbers\n",
    "  \n",
    "  OUTPUT:\n",
    "  The arithmetic mean of the list of numbers\n",
    "  \"\"\"\n",
    "    y = 0\n",
    "    for u in x:\n",
    "        y = y + u\n",
    "    y = y / len(x)\n",
    "    return y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on function my_mean in module __main__:\n",
      "\n",
      "my_mean(x)\n",
      "    A function to compute the mean of a list of numbers.\n",
      "    \n",
      "    INPUTS:\n",
      "    x : a list containing numbers\n",
      "    \n",
      "    OUTPUT:\n",
      "    The arithmetic mean of the list of numbers\n",
      "\n"
     ]
    }
   ],
   "source": [
    "help(my_mean)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Pandas (Python Data Analysis)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "+ Data ingestion\n",
    "+ Data cleaning and transformation\n",
    "+ Data can be passed on to modeling and visualization packages"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Activating packages for use\n",
    "\n",
    "+ Use the `import` command\n",
    "+ Maybe provide an alias for the package"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Data import\n",
    "\n",
    "| Format type | Description | reader       | writer     |\n",
    "| ----------- | ----------- | ------------ | ---------- |\n",
    "| text        | CSV         | read_csv     | to_csv     |\n",
    "|             | Excel       | read_excel   | to_excel   |\n",
    "| text        | JSON        | read_json    | to_json    |\n",
    "| binary      | Feather     | read_feather | to_feather |\n",
    "| binary      | SAS         | read_sas     |            |\n",
    "| SQL         | SQL         | read_sql     | to_sql     |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "mtcars = pd.read_csv('data/mtcars.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "> One of the big differences between a spreadsheet program and a programming language from the data science perspective is that you have to load data into the programming language. It's not \"just there\" like Excel. This is a good thing, since it allows the common functionality of the programming language to work across multiple data sets, and also keeps the original data set pristine. Excel users can run into problems and [corrupt their data](https://nature.berkeley.edu/garbelottoat/?p=1488) if they are not careful."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Exploring data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Creating a DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "      <th>E</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>0.228273</td>\n",
       "      <td>1.026890</td>\n",
       "      <td>-0.839585</td>\n",
       "      <td>-0.591182</td>\n",
       "      <td>-0.956888</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>-0.222326</td>\n",
       "      <td>-0.619915</td>\n",
       "      <td>1.837905</td>\n",
       "      <td>-2.053231</td>\n",
       "      <td>0.868583</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>-0.920734</td>\n",
       "      <td>-0.232312</td>\n",
       "      <td>2.152957</td>\n",
       "      <td>-1.334661</td>\n",
       "      <td>0.076380</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>d</th>\n",
       "      <td>-1.246089</td>\n",
       "      <td>1.202272</td>\n",
       "      <td>-1.049942</td>\n",
       "      <td>1.056610</td>\n",
       "      <td>-0.419678</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          A         B         C         D         E\n",
       "a  0.228273  1.026890 -0.839585 -0.591182 -0.956888\n",
       "b -0.222326 -0.619915  1.837905 -2.053231  0.868583\n",
       "c -0.920734 -0.232312  2.152957 -1.334661  0.076380\n",
       "d -1.246089  1.202272 -1.049942  1.056610 -0.419678"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rng = np.random.RandomState(25)\n",
    "d2 = pd.DataFrame(rng.normal(0,1, (4, 5)), \n",
    "                  columns = ['A','B','C','D','E'], \n",
    "                  index = ['a','b','c','d'])\n",
    "d2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "A DataFrame has (mutable)\n",
    "\n",
    "+ An `index` (row names)\n",
    "+ A `column` (column names)‘"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['a', 'b', 'c', 'd'], dtype='object')"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d2.columns\n",
    "d2.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "      <th>D</th>\n",
       "      <th>E</th>\n",
       "      <th>F</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3.0</td>\n",
       "      <td>0.481343</td>\n",
       "      <td>2020-05-12</td>\n",
       "      <td>6</td>\n",
       "      <td>yes</td>\n",
       "      <td>NIH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3.0</td>\n",
       "      <td>0.516502</td>\n",
       "      <td>2020-05-12</td>\n",
       "      <td>6</td>\n",
       "      <td>no</td>\n",
       "      <td>NIH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3.0</td>\n",
       "      <td>0.383048</td>\n",
       "      <td>2020-05-12</td>\n",
       "      <td>6</td>\n",
       "      <td>no</td>\n",
       "      <td>NIH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.0</td>\n",
       "      <td>0.997541</td>\n",
       "      <td>2020-05-12</td>\n",
       "      <td>6</td>\n",
       "      <td>yes</td>\n",
       "      <td>NIH</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3.0</td>\n",
       "      <td>0.514244</td>\n",
       "      <td>2020-05-12</td>\n",
       "      <td>6</td>\n",
       "      <td>no</td>\n",
       "      <td>NIH</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     A         B          C  D    E    F\n",
       "0  3.0  0.481343 2020-05-12  6  yes  NIH\n",
       "1  3.0  0.516502 2020-05-12  6   no  NIH\n",
       "2  3.0  0.383048 2020-05-12  6   no  NIH\n",
       "3  3.0  0.997541 2020-05-12  6  yes  NIH\n",
       "4  3.0  0.514244 2020-05-12  6   no  NIH"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({\n",
    "    'A':3.,\n",
    "    'B':rng.random_sample(5),\n",
    "    'C': pd.Timestamp('20200512'),\n",
    "    'D': np.array([6] * 5),\n",
    "    'E': pd.Categorical(['yes','no','no','yes','no']),\n",
    "    'F': 'NIH'})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also use a `dict` to create a `DataFrame`. If elements aren't of the same size, errors will be thrown, unless it is a single element. Then it will be repeated. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Slicing and dicing a DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0.481343\n",
       "1    0.516502\n",
       "2    0.383048\n",
       "3    0.997541\n",
       "4    0.514244\n",
       "Name: B, dtype: float64"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['B']\n",
    "df.B\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "There are two extractor functions in `pandas`:\n",
    "\n",
    "+ `loc` extracts by label (index label, column label, slice of labels, etc.\n",
    "+ `iloc` extracts by index (integers, slice objects, etc.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1   2020-05-12\n",
       "2   2020-05-12\n",
       "3   2020-05-12\n",
       "Name: C, dtype: datetime64[ns]"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    " df.loc[1:3, 'C']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "You can also extract rows by condition (filter)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "      <th>three</th>\n",
       "      <th>four</th>\n",
       "      <th>five</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>-0.847315</td>\n",
       "      <td>0.663926</td>\n",
       "      <td>0.182211</td>\n",
       "      <td>20</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>1.682927</td>\n",
       "      <td>-0.056388</td>\n",
       "      <td>0.244149</td>\n",
       "      <td>20</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>e</th>\n",
       "      <td>0.185502</td>\n",
       "      <td>0.554072</td>\n",
       "      <td>-0.739743</td>\n",
       "      <td>20</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>f</th>\n",
       "      <td>-0.151335</td>\n",
       "      <td>-0.172999</td>\n",
       "      <td>-0.656354</td>\n",
       "      <td>20</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>g</th>\n",
       "      <td>0.672965</td>\n",
       "      <td>-0.680025</td>\n",
       "      <td>-0.065153</td>\n",
       "      <td>20</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        one       two     three  four   five\n",
       "a -0.847315  0.663926  0.182211    20  False\n",
       "c  1.682927 -0.056388  0.244149    20   True\n",
       "e  0.185502  0.554072 -0.739743    20   True\n",
       "f -0.151335 -0.172999 -0.656354    20  False\n",
       "g  0.672965 -0.680025 -0.065153    20   True"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(np.random.randn(5, 3), index = ['a','c','e', 'f','g'], columns = ['one','two','three']) # pre-specify index and column names\n",
    "df['four'] = 20 # add a column named \"four\", which will all be 20\n",
    "df['five'] = df['one'] > 0\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "      <th>three</th>\n",
       "      <th>four</th>\n",
       "      <th>five</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>1.682927</td>\n",
       "      <td>-0.056388</td>\n",
       "      <td>0.244149</td>\n",
       "      <td>20</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        one       two     three  four  five\n",
       "c  1.682927 -0.056388  0.244149    20  True"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[(df.one > 1) & (df.three < 0)]\n",
    "\n",
    "df.query('(one > 1) & (three > 0)')\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Replacing values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "      <th>three</th>\n",
       "      <th>four</th>\n",
       "      <th>five</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>-0.847315</td>\n",
       "      <td>0.663926</td>\n",
       "      <td>0.182211</td>\n",
       "      <td>20</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>1.682927</td>\n",
       "      <td>-0.056388</td>\n",
       "      <td>0.244149</td>\n",
       "      <td>20</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>e</th>\n",
       "      <td>0.185502</td>\n",
       "      <td>0.554072</td>\n",
       "      <td>-0.739743</td>\n",
       "      <td>20</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>f</th>\n",
       "      <td>-0.151335</td>\n",
       "      <td>-0.172999</td>\n",
       "      <td>-0.656354</td>\n",
       "      <td>20</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>g</th>\n",
       "      <td>0.672965</td>\n",
       "      <td>-0.680025</td>\n",
       "      <td>-0.065153</td>\n",
       "      <td>20</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        one       two     three  four   five\n",
       "a -0.847315  0.663926  0.182211    20  False\n",
       "c  1.682927 -0.056388  0.244149    20   True\n",
       "e  0.185502  0.554072 -0.739743    20   True\n",
       "f -0.151335 -0.172999 -0.656354    20  False\n",
       "g  0.672965 -0.680025 -0.065153    20   True"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#df2.replace(0, -9) # replace 0 with -9\n",
    "\n",
    "df.replace({'one': {5: 500}, 'three':{0:-9, 8:800}})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Joins\n",
    "\n",
    "There are basically four kinds of joins:\n",
    "\n",
    "| pandas | R          | SQL         | Description                     |\n",
    "| ------ | ---------- | ----------- | ------------------------------- |\n",
    "| left   | left_join  | left outer  | keep all rows on left           |\n",
    "| right  | right_join | right outer | keep all rows on right          |\n",
    "| outer  | outer_join | full outer  | keep all rows from both         |\n",
    "| inner  | inner_join | inner       | keep only rows with common keys |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Joins\n",
    "\n",
    "<img src=\"graphs/joins.png\" align=\"center\"/>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "survey = pd.read_csv('data/survey_survey.csv')\n",
    "visited = pd.read_csv('data/survey_visited.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>taken</th>\n",
       "      <th>person</th>\n",
       "      <th>quant</th>\n",
       "      <th>reading</th>\n",
       "      <th>ident</th>\n",
       "      <th>site</th>\n",
       "      <th>dated</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>619</td>\n",
       "      <td>dyer</td>\n",
       "      <td>rad</td>\n",
       "      <td>9.82</td>\n",
       "      <td>619</td>\n",
       "      <td>DR-1</td>\n",
       "      <td>1927-02-08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>619</td>\n",
       "      <td>dyer</td>\n",
       "      <td>sal</td>\n",
       "      <td>0.13</td>\n",
       "      <td>619</td>\n",
       "      <td>DR-1</td>\n",
       "      <td>1927-02-08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>622</td>\n",
       "      <td>dyer</td>\n",
       "      <td>rad</td>\n",
       "      <td>7.80</td>\n",
       "      <td>622</td>\n",
       "      <td>DR-1</td>\n",
       "      <td>1927-02-10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>622</td>\n",
       "      <td>dyer</td>\n",
       "      <td>sal</td>\n",
       "      <td>0.09</td>\n",
       "      <td>622</td>\n",
       "      <td>DR-1</td>\n",
       "      <td>1927-02-10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>734</td>\n",
       "      <td>pb</td>\n",
       "      <td>rad</td>\n",
       "      <td>8.41</td>\n",
       "      <td>734</td>\n",
       "      <td>DR-3</td>\n",
       "      <td>1939-01-07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>734</td>\n",
       "      <td>lake</td>\n",
       "      <td>sal</td>\n",
       "      <td>0.05</td>\n",
       "      <td>734</td>\n",
       "      <td>DR-3</td>\n",
       "      <td>1939-01-07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>734</td>\n",
       "      <td>pb</td>\n",
       "      <td>temp</td>\n",
       "      <td>-21.50</td>\n",
       "      <td>734</td>\n",
       "      <td>DR-3</td>\n",
       "      <td>1939-01-07</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>735</td>\n",
       "      <td>pb</td>\n",
       "      <td>rad</td>\n",
       "      <td>7.22</td>\n",
       "      <td>735</td>\n",
       "      <td>DR-3</td>\n",
       "      <td>1930-01-12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>735</td>\n",
       "      <td>NaN</td>\n",
       "      <td>sal</td>\n",
       "      <td>0.06</td>\n",
       "      <td>735</td>\n",
       "      <td>DR-3</td>\n",
       "      <td>1930-01-12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>735</td>\n",
       "      <td>NaN</td>\n",
       "      <td>temp</td>\n",
       "      <td>-26.00</td>\n",
       "      <td>735</td>\n",
       "      <td>DR-3</td>\n",
       "      <td>1930-01-12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>751</td>\n",
       "      <td>pb</td>\n",
       "      <td>rad</td>\n",
       "      <td>4.35</td>\n",
       "      <td>751</td>\n",
       "      <td>DR-3</td>\n",
       "      <td>1930-02-26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>751</td>\n",
       "      <td>pb</td>\n",
       "      <td>temp</td>\n",
       "      <td>-18.50</td>\n",
       "      <td>751</td>\n",
       "      <td>DR-3</td>\n",
       "      <td>1930-02-26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>751</td>\n",
       "      <td>lake</td>\n",
       "      <td>sal</td>\n",
       "      <td>0.10</td>\n",
       "      <td>751</td>\n",
       "      <td>DR-3</td>\n",
       "      <td>1930-02-26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>752</td>\n",
       "      <td>lake</td>\n",
       "      <td>rad</td>\n",
       "      <td>2.19</td>\n",
       "      <td>752</td>\n",
       "      <td>DR-3</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>752</td>\n",
       "      <td>lake</td>\n",
       "      <td>sal</td>\n",
       "      <td>0.09</td>\n",
       "      <td>752</td>\n",
       "      <td>DR-3</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>752</td>\n",
       "      <td>lake</td>\n",
       "      <td>temp</td>\n",
       "      <td>-16.00</td>\n",
       "      <td>752</td>\n",
       "      <td>DR-3</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>752</td>\n",
       "      <td>roe</td>\n",
       "      <td>sal</td>\n",
       "      <td>41.60</td>\n",
       "      <td>752</td>\n",
       "      <td>DR-3</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>837</td>\n",
       "      <td>lake</td>\n",
       "      <td>rad</td>\n",
       "      <td>1.46</td>\n",
       "      <td>837</td>\n",
       "      <td>MSK-4</td>\n",
       "      <td>1932-01-14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>837</td>\n",
       "      <td>lake</td>\n",
       "      <td>sal</td>\n",
       "      <td>0.21</td>\n",
       "      <td>837</td>\n",
       "      <td>MSK-4</td>\n",
       "      <td>1932-01-14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>837</td>\n",
       "      <td>roe</td>\n",
       "      <td>sal</td>\n",
       "      <td>22.50</td>\n",
       "      <td>837</td>\n",
       "      <td>MSK-4</td>\n",
       "      <td>1932-01-14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>844</td>\n",
       "      <td>roe</td>\n",
       "      <td>rad</td>\n",
       "      <td>11.25</td>\n",
       "      <td>844</td>\n",
       "      <td>DR-1</td>\n",
       "      <td>1932-03-22</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    taken person quant  reading  ident   site       dated\n",
       "0     619   dyer   rad     9.82    619   DR-1  1927-02-08\n",
       "1     619   dyer   sal     0.13    619   DR-1  1927-02-08\n",
       "2     622   dyer   rad     7.80    622   DR-1  1927-02-10\n",
       "3     622   dyer   sal     0.09    622   DR-1  1927-02-10\n",
       "4     734     pb   rad     8.41    734   DR-3  1939-01-07\n",
       "5     734   lake   sal     0.05    734   DR-3  1939-01-07\n",
       "6     734     pb  temp   -21.50    734   DR-3  1939-01-07\n",
       "7     735     pb   rad     7.22    735   DR-3  1930-01-12\n",
       "8     735    NaN   sal     0.06    735   DR-3  1930-01-12\n",
       "9     735    NaN  temp   -26.00    735   DR-3  1930-01-12\n",
       "10    751     pb   rad     4.35    751   DR-3  1930-02-26\n",
       "11    751     pb  temp   -18.50    751   DR-3  1930-02-26\n",
       "12    751   lake   sal     0.10    751   DR-3  1930-02-26\n",
       "13    752   lake   rad     2.19    752   DR-3         NaN\n",
       "14    752   lake   sal     0.09    752   DR-3         NaN\n",
       "15    752   lake  temp   -16.00    752   DR-3         NaN\n",
       "16    752    roe   sal    41.60    752   DR-3         NaN\n",
       "17    837   lake   rad     1.46    837  MSK-4  1932-01-14\n",
       "18    837   lake   sal     0.21    837  MSK-4  1932-01-14\n",
       "19    837    roe   sal    22.50    837  MSK-4  1932-01-14\n",
       "20    844    roe   rad    11.25    844   DR-1  1932-03-22"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.merge(survey, visited, left_on = 'taken', right_on = 'ident', how = 'left')\n",
    "# survey.merge(visited, left_on = 'taken', right_on = 'ident', how = 'left')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Here, the left dataset is `survey` and the right one is `visited`. \n",
    "\n",
    "Since we're doing a left join, we keed all the rows from `survey` and add columns from `visited`, matching on the common key, called \"taken\" in one dataset and \"ident\" in the other. \n",
    "\n",
    "Note that the rows of `visited` are repeated as needed to line up with all the rows with common \"taken\" values. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Data aggregation and split-apply-combine\n",
    "\n",
    "![](graphs/split-apply-combine.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>country</th>\n",
       "      <th>continent</th>\n",
       "      <th>year</th>\n",
       "      <th>lifeExp</th>\n",
       "      <th>pop</th>\n",
       "      <th>gdpPercap</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>Asia</td>\n",
       "      <td>1952</td>\n",
       "      <td>28.801</td>\n",
       "      <td>8425333</td>\n",
       "      <td>779.445314</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>Asia</td>\n",
       "      <td>1957</td>\n",
       "      <td>30.332</td>\n",
       "      <td>9240934</td>\n",
       "      <td>820.853030</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>Asia</td>\n",
       "      <td>1962</td>\n",
       "      <td>31.997</td>\n",
       "      <td>10267083</td>\n",
       "      <td>853.100710</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>Asia</td>\n",
       "      <td>1967</td>\n",
       "      <td>34.020</td>\n",
       "      <td>11537966</td>\n",
       "      <td>836.197138</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>Asia</td>\n",
       "      <td>1972</td>\n",
       "      <td>36.088</td>\n",
       "      <td>13079460</td>\n",
       "      <td>739.981106</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       country continent  year  lifeExp       pop   gdpPercap\n",
       "0  Afghanistan      Asia  1952   28.801   8425333  779.445314\n",
       "1  Afghanistan      Asia  1957   30.332   9240934  820.853030\n",
       "2  Afghanistan      Asia  1962   31.997  10267083  853.100710\n",
       "3  Afghanistan      Asia  1967   34.020  11537966  836.197138\n",
       "4  Afghanistan      Asia  1972   36.088  13079460  739.981106"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gapminder =  pd.read_csv('data/gapminder.tsv', sep = '\\t') \n",
    "gapminder.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "country\n",
       "Afghanistan           37.478833\n",
       "Albania               68.432917\n",
       "Algeria               59.030167\n",
       "Angola                37.883500\n",
       "Argentina             69.060417\n",
       "                        ...    \n",
       "Vietnam               57.479500\n",
       "West Bank and Gaza    60.328667\n",
       "Yemen, Rep.           46.780417\n",
       "Zambia                45.996333\n",
       "Zimbabwe              52.663167\n",
       "Name: lifeExp, Length: 142, dtype: float64"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gapminder.groupby('country')['lifeExp'].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "gapminder.groupby('country').get_group('United Kingdom')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "continent\n",
       "Africa      47.7920\n",
       "Americas    67.0480\n",
       "Asia        61.7915\n",
       "Europe      72.2410\n",
       "Oceania     73.6650\n",
       "Name: lifeExp, dtype: float64"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gapminder.groupby('continent').lifeExp.agg(np.median) # Medians"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>lifeExp</th>\n",
       "      <th>pop</th>\n",
       "      <th>gdpPercap</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1952</td>\n",
       "      <td>49.057620</td>\n",
       "      <td>3943953.0</td>\n",
       "      <td>1968.528344</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1957</td>\n",
       "      <td>51.507401</td>\n",
       "      <td>4282942.0</td>\n",
       "      <td>2173.220291</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1962</td>\n",
       "      <td>53.609249</td>\n",
       "      <td>4686039.5</td>\n",
       "      <td>2335.439533</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1967</td>\n",
       "      <td>55.678290</td>\n",
       "      <td>5170175.5</td>\n",
       "      <td>2678.334741</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1972</td>\n",
       "      <td>57.647386</td>\n",
       "      <td>5877996.5</td>\n",
       "      <td>3339.129407</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1977</td>\n",
       "      <td>59.570157</td>\n",
       "      <td>6404036.5</td>\n",
       "      <td>3798.609244</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1982</td>\n",
       "      <td>61.533197</td>\n",
       "      <td>7007320.0</td>\n",
       "      <td>4216.228428</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1987</td>\n",
       "      <td>63.212613</td>\n",
       "      <td>7774861.5</td>\n",
       "      <td>4280.300366</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1992</td>\n",
       "      <td>64.160338</td>\n",
       "      <td>8688686.5</td>\n",
       "      <td>4386.085502</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1997</td>\n",
       "      <td>65.014676</td>\n",
       "      <td>9735063.5</td>\n",
       "      <td>4781.825478</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2002</td>\n",
       "      <td>65.694923</td>\n",
       "      <td>10372918.5</td>\n",
       "      <td>5319.804524</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>2007</td>\n",
       "      <td>67.007423</td>\n",
       "      <td>10517531.0</td>\n",
       "      <td>6124.371109</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    year    lifeExp         pop    gdpPercap\n",
       "0   1952  49.057620   3943953.0  1968.528344\n",
       "1   1957  51.507401   4282942.0  2173.220291\n",
       "2   1962  53.609249   4686039.5  2335.439533\n",
       "3   1967  55.678290   5170175.5  2678.334741\n",
       "4   1972  57.647386   5877996.5  3339.129407\n",
       "5   1977  59.570157   6404036.5  3798.609244\n",
       "6   1982  61.533197   7007320.0  4216.228428\n",
       "7   1987  63.212613   7774861.5  4280.300366\n",
       "8   1992  64.160338   8688686.5  4386.085502\n",
       "9   1997  65.014676   9735063.5  4781.825478\n",
       "10  2002  65.694923  10372918.5  5319.804524\n",
       "11  2007  67.007423  10517531.0  6124.371109"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gapminder.groupby('year').agg({'lifeExp': np.mean, 'pop': np.median, 'gdpPercap': np.median}).reset_index()"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "jupytext": {
   "formats": "ipynb,Rmd"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": false,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": false,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}