{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### 20th Century Economic History \n", "\n", "## Module 0 Problem Set: Introduction to Python\n", "#### Due 2020-08-30 Su 11:59 pm via upload to \n", "\n", "### J. Bradford DeLong\n", "\n", "Welcome to the python jupyter universe! \n", "\n", "This introductory notebook will familiarize you with the software we will use during this course. Along the way it wll provide a (very brief) introduction to (a small subset of) computer programming—and it will remind you of some math, and some economics.\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Our Computing Environment: The Jupyter Notebook \n", "This webpage is called a jupyter notebook. A notebook is an editable computer document in which you can write computer programs; view their results; and comment, annotate, and explain what is going on. Project jupyter is headquartered here at Berkeley, where jupyter originator and ringmaster Fernando Pérez works: its purpose is to build human-friendly frameworks for interactive computing \n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Text Cells\n", "With your mouse or on your trackpad, doubleclick on this line of text. \n", "\n", "Now you see these lines as a block of text and symbols inside a rectangle with a grey background. You see at their top a brief line of text and symbols that starts with four hashtag symbols, thus: \"####\" and continues with \" Text Cells\". \n", "\n", "Now simultaneously press control and return on your computer keyboard. Alternatively, simply click on the button in the toolbar at the top of this computer window that looks like ▶. Notice how the rectangle changes: the grey background goes away, and the first line becomes larger and darker. Pressing control and return at the same time tells the notebook: interpret this block as text to be formatted using Jon Gruber's *markdown* formatting language (see: ). The four hashtags #### are a command to the markdown interpreter to format that line as a (small) header. Markdown is a simple text formatting computer language—it just has a few commands—that is worth learning. With it, you can add things like *italics*, **bold**, \n", "\n", "#### small section headings\n", "\n", "and:\n", "\n", "# large section headings\n", "\n", "as well as:\n", "\n", "1. lists\n", "2. of\n", "3. items\n", "\n", "plus:\n", "\n", "* other things\n", "\n", "to the text you will write in your notebook. It would probably be a good idea to learn markdown.\n", "\n", "Now doubleclick on this line again. The grey background comes back. And the text is now in a form in which you can (a) click on it to set the location of the computer cursor, and then (b) edit the text.\n", "\n", "If you look up and down in this document, you will see a number of such grey-background rectangles (or blocks of words and symbols that can be turned into such grey-background rectangles by doubleclicking in them.\n", "\n", "Each such grey-background rectangle is called a *cell*. Cells can be *markdown* cells, which contain words and symbols for humans to read. Cells can be *code* cells, which contain instructions for the computer—the python interpreter—to execute.\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Understanding Check 1**: This paragraph is in its own text cell. Try editing it so that this sentence is the last sentence in the paragraph, and then either press control-return or click the run cell ▶ button in the toolbar just above this computer document window. This sentence, for example, should be deleted. So should this one.\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Code cells\n", "Some cells (like this one) contain text (and symbols) that are intended to be formatted usiong markdown. Other cells (like the one immediately below this one) contain code to be sent to the computer's python 3 interpreter, and then executed as commands to mainpulate data. Running a code cell will execute all of the lines of code it contains.\n", "\n", "To run the code in a code cell, first click on that cell to activate it. It'll be highlighted with a little green or blue rectangle. Next, either press ▶, or hold down the control key and press return or enter, or hold down the shift key and press return or enter.\n", "\n", "Try running the code cell immediately below:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello, World!\n" ] } ], "source": [ "print(\"Hello, World!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And try running this one:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "👋, 🌏!\n" ] } ], "source": [ "print(\"\\N{WAVING HAND SIGN}, \\N{EARTH GLOBE ASIA-AUSTRALIA}!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The fundamental building block of Python code is an expression. Cells can contain multiple lines with multiple expressions. When you run a cell, the lines of code are executed in the order in which they appear. Every print expression prints a line. Run the next cell and notice the order of the output." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First this line is printed,\n", "and then this one.\n" ] } ], "source": [ "print(\"First this line is printed,\")\n", "print(\"and then this one.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Understanding Check 2**: Change the code cell above so that, when run, it prints out:\n", "\n", " First this line,\n", " then the whole 🌏,\n", " and then this one." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Don't be scared if you see a \"Kernel Restarting\" message! Your data and work will still be saved. Once you see \"Kernel Ready\" in a light blue box on the top right of the notebook, you'll be ready to work again. You should rerun any cells with imports, variables, and loaded data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Parenthetically, this is an unfriendly message—especially in the use of the word \"died\". when an animal dies, it is irretrievably broken. When a python interpreter kernel dies, it is simply that the python interpreter has decided that it needs to reset itself to its basic state in order to make sure that its subsequent calculations will be without errors other than those mistakes made by the programmer. **NOTHING YOU WRITE IN A JUPYTER NOTEBOOK CAN BREAK THE COMPUTER**. At worst, you may simply have to download a new copy of this notebook file from the github repository in which it is kep.)\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Writing Jupyter notebooks\n", "You can use Jupyter notebooks for your own projects or documents. When you make your own notebook, you'll need to create your own cells for text and code.\n", "\n", "To add a cell, click the + button in the menu bar. It'll start out as a text cell. You can change it to a code cell by clicking inside it so it's highlighted, clicking the drop-down box next to the restart (⟳) button in the menu bar, and choosing \"Code\".\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Understanding Check 3** Add a code cell below this one. Write code in it that prints out:\n", " \n", " A whole new cell! ♪🌏♪\n", "\n", "(That musical note symbol is like the Earth symbol. Its long-form name is \\N{EIGHTH NOTE}.)\n", "\n", "Run your cell to verify that it works.\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Errors\n", "Python is a language, and like natural human languages, it has rules. It differs from natural language in two important ways:\n", "1. The rules are *simple*. You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.\n", "2. The rules are *rigid*. If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes. A computer running Python code is not smart enough to do that.\n", "\n", "Whenever you write code, you'll make mistakes. When you run a code cell that has mistakes so that the python interpreter cannot understand what you are asking to do, the python interpreter will stop calculating and will instead print out to the screen what is called an \"error message\"—that it thinks that you have made an error in programming, in that you have asked the python interpreter to execute commands that it cannot understand.\n", "\n", "Errors are okay.\n", "\n", "Even—especially—xperienced programmers make many errors. When you issue a command that causes the computer to print out an error message, you just have to find the source of the problem, fix it, and move on. It is actually much worse when you make a programming mistake and the python interpreter does not return an error message to you: then the python interpreter has carried out calculations, but they are not the calculations that you wanted it to carry out.\n", "\n", "We have made an error in the next cell. Run it and see what happens." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "ename": "SyntaxError", "evalue": "unexpected EOF while parsing (, line 1)", "output_type": "error", "traceback": [ "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m print(\"This line is missing something.\"\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m unexpected EOF while parsing\n" ] } ], "source": [ "print(\"This line is missing something.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see something like this (minus our annotations):\n", "\n", "\n", "\n", "The last line of the error output attempts to tell you what went wrong. The *syntax* of a language is its structure, and this SyntaxError tells you that you have created an illegal structure. \"EOF\" means \"end of file\". In this case, the \"file\" is the code cell, so the message is saying that the python interpreter expected to find something more (in this case, a right parenthesis to balance the left parenthesis just after the print), but it reached the end of the cell instead.\n", "\n", "There's a lot of terminology in programming languages. But you don't need to know it all in order to program effectively. If you see a cryptic message like this, you can often get by without deciphering it. (Of course, if you're frustrated, feel free to ask a friend or post on the class Piazza.)\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Understanding Check 4** Try to fix the code above so that you can run the cell and see the intended message instead of an error.\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Submitting your work\n", "All assignments in the course will be distributed as notebooks like this one, and you will submit your work from the notebook. We will use a system called OK that checks your work and helps you submit. At the top of each assignment, you'll see a cell like the one below that prompts you to identify yourself. Run it and follow the instructions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Don't change this cell; just run it. \n", "# The result will give you directions about how to log in to the submission system, called OK.\n", "# Once you're logged in, you can run this cell again, but it won't ask you who you are because\n", "# it remembers you. However, you will need to log in once per assignment.\n", "\n", "# !pip install -U okpy\n", "# from client.api.notebook import Notebook\n", "# ok = Notebook('PS3_Intro_to_Python.ok')\n", "# _ = ok.auth(force=True, inline=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you finish an assignment, you need to submit it by running the submit command below. It's OK to submit multiple times, OK will only try to grade your final submission for each assignment. Don't forget to submit your assignment, even if you haven't finished everything." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "_ = ok.submit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that you are comfortable with our computing environment, we are going to be moving into more of the fundamentals of Python, but first, run the cell below to ensure all the libraries needed for this notebook are installed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install numpy\n", "!pip install pandas\n", "!pip install matplotlib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "### Introduction to Programming Concepts " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Part 1: Python basics \n", "Before getting into the more advanced analysis techniques that will be required in this course, we need to cover a few of the foundational elements of programming in Python.\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### A. Expressions\n", "The departure point for all programming is the concept of the __expression__. An expression is a combination of variables, operators, and other Python elements that the language interprets and acts upon. Expressions act as a set of instructions to be fed through the interpreter, with the goal of generating specific outcomes. See below for some examples of basic expressions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Examples of expressions:\n", "\n", "#addition\n", "print(2 + 2)\n", "\n", "#string concatenation \n", "print('me' + ' and I')\n", "\n", "#you can print a number with a string if you cast it \n", "print(\"me\" + str(2))\n", "\n", "#exponents\n", "print(12 ** 2)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will notice that only the last line in a cell gets printed out. If you want to see the values of previous expressions, you need to call print on that expression. Try adding print statements to some of the above expressions to get them to display.\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### B. Variables\n", "In the example below, a and b are Python objects known as __variables__. We are giving an object (in this case, an integer and a float, two Python data types) a name that we can store for later use. To use that value, we can simply type the name that we stored the value as. Variables are stored within the notebook's environment, meaning stored variable values carry over from cell to cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = 4\n", "b = 10/5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that when you create a variable, unlike what you previously saw with the expressions, it does not print anything out." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Notice that 'a' retains its value.\n", "print(a)\n", "a + b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Question 1: Variables\n", "See if you can write a series of expressions that creates two new variables called __x__ and __y__ and assigns them values of __10.5__ and __7.2__. Then assign their product to the variable __combo__ and print it." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "75.60000000000001\n" ] } ], "source": [ "# Fill in the missing lines to complete the expressions.\n", "\n", "x = 10.5\n", "y = 7.2\n", "combo = x * y\n", "print(combo)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Running the cell below will give you some feed back on your responses. Though the OK tests are not always comprehensive (passing all of the tests does not guarantee full credit for questions), they give you a pretty good indication as to whether or not you're on track." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# ok.grade('q01')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### C. Lists\n", "The next topic is particularly useful in the kind of data manipulation that you will see throughout 101B. The following few cells will introduce the concept of __lists__ (and their counterpart, numpy arrays). Read through the following cell to understand the basic structure of a list. \n", "\n", "A list is an ordered collection of objects. They allow us to store and access groups of variables and other objects for easy access and analysis. Check out this [documentation](https://www.tutorialspoint.com/python/python_lists.htm) for an in-depth look at the capabilities of lists.\n", "\n", "To initialize a list, you use brackets. Putting objects separated by commas in between the brackets will add them to the list. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[]\n", "[1, 3, 6, 'lists', 'arefun', 4]\n" ] } ], "source": [ "# an empty list\n", "lst = []\n", "print(lst)\n", "\n", "# reassigning our empty list to a new list\n", "lst = [1, 3, 6, 'lists', 'are' 'fun', 4]\n", "print(lst)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To access a value in the list, put the index of the item you wish to access in brackets following the variable that stores the list. Lists in Python are zero-indexed, so the indicies for lst are 0, 1, 2, 3, 4, 5, and 6." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6\n" ] } ], "source": [ "# Elements are selected like this:\n", "example = lst[2]\n", "\n", "# The above line selects the 3rd element of lst (list indices are 0-offset) and sets it to a variable named example.\n", "print(example)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is important to note that when you store a list to a variable, you are actually storing the **pointer** to the list. That means if you assign your list to another variable, and you change the elements in your other variable, then you are changing the same data as in the original list. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4\n" ] } ], "source": [ "a = [1,2,3] #original list\n", "b = a #b now points to list a \n", "b[0] = 4 \n", "print(a[0]) #return 4 since we modified the first element of the list pointed to by a and b " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "##### Slicing lists\n", "As you can see from above, lists do not have to be made up of elements of the same kind. Indices do not have to be taken one at a time, either. Instead, we can take a slice of indices and return the elements at those indices as a separate list." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[3, 6, 'lists']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### This line will store the first (inclusive) through fourth (exclusive) elements of lst as a new list called lst_2:\n", "lst_2 = lst[1:4]\n", "\n", "lst_2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Question 2: Lists\n", "Build a list of length 10 containing whatever elements you'd like. Then, slice it into a new list of length five using a index slicing. Finally, assign the last element in your sliced list to the given variable and print it." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": true }, "outputs": [], "source": [ "### Fill in the ellipses to complete the question.\n", "...\n", "...\n", "...\n", "..." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# ok.grade('q02')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lists can also be operated on with a few built-in analysis functions. These include min and max, among others. Lists can also be concatenated together. Find some examples below." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Max of a_list: 13\n", "Min of b_list: 1\n", "Concatenated: [1, 6, 4, 8, 13, 2, 4, 5, 2, 14, 9, 11]\n" ] } ], "source": [ "# A list containing six integers.\n", "a_list = [1, 6, 4, 8, 13, 2]\n", "\n", "# Another list containing six integers.\n", "b_list = [4, 5, 2, 14, 9, 11]\n", "\n", "print('Max of a_list:', max(a_list))\n", "print('Min of b_list:', min(a_list))\n", "\n", "# Concatenate a_list and b_list:\n", "c_list = a_list + b_list\n", "print('Concatenated:', c_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "##### D. Numpy Arrays\n", "Closely related to the concept of a list is the array, a nested sequence of elements that is structurally identical to a list. Arrays, however, can be operated on arithmetically with much more versatility than regular lists. For the purpose of later data manipulation, we'll access arrays through [Numpy](https://docs.scipy.org/doc/numpy/reference/routines.html), which will require an import statement.\n", "\n", "Now run the next cell to import the numpy library into your notebook, and examine how numpy arrays can be used." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Undoubled Array:\n", "[0 1 2 3 4 5 6 7 8 9]\n", "Doubled Array:\n", "[ 0 2 4 6 8 10 12 14 16 18]\n" ] } ], "source": [ "# Initialize an array of integers 0 through 9.\n", "example_array = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])\n", "\n", "# This can also be accomplished using np.arange\n", "example_array_2 = np.arange(10)\n", "print('Undoubled Array:')\n", "print(example_array_2)\n", "\n", "# Double the values in example_array and print the new array.\n", "double_array = example_array*2\n", "print('Doubled Array:')\n", "print(double_array)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This behavior differs from that of a list. See below what happens if you multiply a list." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]\n", "example_list * 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that instead of multiplying each of the elements by two, multiplying a list and a number returns that many copies of that list. This is the reason that we will sometimes use Numpy over lists. Other mathematical operations have interesting behaviors with lists that you should explore on your own. \n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### E. Looping\n", "[Loops](https://www.tutorialspoint.com/python/python_loops.htm) are often useful in manipulating, iterating over, or transforming large lists and arrays. The first type we will discuss is the __for loop__. For loops are helpful in traversing a list and performing an action at each element. For example, the following code moves through every element in example_array, adds it to the previous element in example_array, and copies this sum to a new array." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[5, 6, 7, 8, 9, 10, 11, 12, 13, 14]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_list = []\n", "\n", "for element in example_array:\n", " new_element = element + 5\n", " new_list.append(new_element)\n", "\n", "new_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most important line in the above cell is the \"for element in...\" line. This statement sets the structure of our loop, instructing the machine to stop at every number in example_array, perform the indicated operations, and then move on. Once Python has stopped at every element in example_array, the loop is completed and the final line, which outputs new_list, is executed. It's important to note that \"element\" is an arbitrary variable name used to represent whichever index value the loop is currently operating on. We can change the variable name to whatever we want and achieve the same result, as long as we stay consistent. For example:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[5, 6, 7, 8, 9, 10, 11, 12, 13, 14]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newer_list = []\n", "\n", "for completely_arbitrary_name in example_array:\n", " newer_element = completely_arbitrary_name + 5\n", " newer_list.append(newer_element)\n", " \n", "newer_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For loops can also iterate over ranges of numerical values. If I wanted to alter example_array without copying it over to a new list, I would use a numerical iterator to access list indices rather than the elements themselves. This iterator, called i, would range from 0, the value of the first index, to 9, the value of the last. I can make sure of this by using the built-in range and len functions." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for i in range(len(example_array)):\n", " example_array[i] = example_array[i] + 5\n", "\n", "example_array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "##### Other types of loops\n", "The __while loop__ repeatedly performs operations until a conditional is no longer satisfied. A conditional is a [boolean expression](https://en.wikipedia.org/wiki/Boolean_expression), that is an expression that evaluates to True or False. \n", "\n", "In the below example, an array of integers 0 to 9 is generated. When the program enters the while loop on the subsequent line, it notices that the maximum value of the array is less than 50. Because of this, it adds 1 to the fifth element, as instructed. Once the instructions embedded in the loop are complete, the program refers back to the conditional. Again, the maximum value is less than 50. This process repeats until the the fifth element, now the maximum value of the array, is equal to 50, at which point the conditional is no longer true and the loop breaks." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Before: [0 1 2 3 4 5 6 7 8 9]\n", "After: [ 0 1 2 3 50 5 6 7 8 9]\n" ] } ], "source": [ "while_array = np.arange(10) # Generate our array of values\n", "\n", "print('Before:', while_array)\n", "\n", "while(max(while_array) < 50): # Set our conditional\n", " while_array[4] += 1 # Add 1 to the fifth element if the conditional is satisfied \n", " \n", "print('After:', while_array)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "##### Question 3: Loops\n", "In the following cell, partial steps to manipulate an array are included. You must fill in the blanks to accomplish the following:
\n", "1. Iterate over the entire array, checking if each element is a multiple of 5\n", "2. If an element is not a multiple of 5, add 1 to it repeatedly until it is\n", "3. Iterate back over the list and print each element.\n", "\n", "> Hint: To check if an integer x is a multiple of y, use the modulus operator %. Typing x % y will return the remainder when x is divided by y. Therefore, (x % y != 0) will return True when y __does not divide__ x, and False when it does." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ellipsis" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "# Make use of iterators, range, length, while loops, and indices to complete this question.\n", "question_3 = np.array([12, 31, 50, 0, 22, 28, 19, 105, 44, 12, 77])\n", "\n", "...\n", "...\n", "..." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# ok.grade('q03')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "##### F. Functions!\n", "Functions are useful when you want to repeat a series of steps on multiple different objects, but don't want to type out the steps over and over again. Many functions are built into Python already; for example, you've already made use of len() to retrieve the number of elements in a list. You can also write your own functions, and at this point you already have the skills to do so.\n", "\n", "Functions generally take a set of __parameters__ (also called inputs), which define the objects they will use when they are run. For example, the len() function takes a list or array as its parameter, and returns the length of that list.\n", "\n", "\n", "The following cell gives an example of an extremely simple function, called add_two, which takes as its parameter an integer and returns that integer with, you guessed it, 2 added to it." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# An adder function that adds 2 to the given n.\n", "def add_two(n):\n", " return n + 2" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "add_two(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Easy enough, right? Let's look at a function that takes two parameters, compares them somehow, and then returns a boolean value (True or False) depending on the comparison. The is_multiple function below takes as parameters an integer m and an integer n, checks if m is a multiple of n, and returns True if it is. Otherwise, it returns False. \n", "\n", "if statements, just like while loops, are dependent on boolean expressions. If the conditional is True, then the following indented code block will be executed. If the conditional evaluates to False, then the code block will be skipped over. Read more about if statements [here](https://www.tutorialspoint.com/python/python_if_else.htm)." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "def is_multiple(m, n):\n", " if (m % n == 0):\n", " return True\n", " else:\n", " return False" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "is_multiple(12, 4)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "is_multiple(12, 7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Sidenote:** Another way to write is_multiple is below, think about why it works.\n", "\n", " def is_multiple(m, n):\n", " return m % n == 0\n", " \n", "Since functions are so easily replicable, we can include them in loops if we want. For instance, our is_multiple function can be used to check if a number is prime! See for yourself by testing some possible prime numbers in the cell below." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "9999991 is prime\n" ] } ], "source": [ "# Change possible_prime to any integer to test its primality\n", "# NOTE: If you happen to stumble across a large (> 8 digits) prime number, the cell could take a very, very long time\n", "# to run and will likely crash your kernel. Just click kernel>interrupt if it looks like it's caught.\n", "\n", "possible_prime = 9999991\n", "\n", "for i in range(2, possible_prime):\n", " if (is_multiple(possible_prime, i)):\n", " print(possible_prime, 'is not prime') \n", " break\n", " if (i >= possible_prime/2):\n", " print(possible_prime, 'is prime')\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "##### Question 4: Writing functions\n", "In the following cell, complete a function that will take as its parameters a list and two integers x and y, iterate through the list, and replace any number in the list that is a multiple of x with y.\n", "> Hint: use the is_multiple() function to streamline your code." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "def replace_with_y(lst, x, y):\n", " for i in range(...):\n", " if(...):\n", " ...\n", " return lst" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "# ok.grade('q04')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "#### Part 2: Pandas Dataframes \n", "\n", "We will be using Pandas dataframes for much of this class to organize and sort through economic data. [Pandas](http://pandas.pydata.org/pandas-docs/stable/) is one of the most widely used Python libraries in data science. It is commonly used for data cleaning, and with good reason: it’s very powerful and flexible, among many other things. Like we did with numpy, we will have to import pandas." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "##### Creating dataframes\n", "\n", "The rows and columns of a pandas dataframe are essentially a collection of lists stacked on top/next to each other. For example, if I wanted to store the top 10 movies and their ratings in a datatable, I could create 10 lists that each contain a rating and a corresponding title, and these lists would be the rows of the table:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RatingMovie
09.2The Shawshank Redemption (1994)
19.2The Godfather (1972)
29.0The Godfather: Part II (1974)
38.9Pulp Fiction (1994)
48.9Schindler's List (1993)
58.9The Lord of the Rings: The Return of the King ...
68.912 Angry Men (1957)
78.9The Dark Knight (2008)
88.9Il buono, il brutto, il cattivo (1966)
98.8The Lord of the Rings: The Fellowship of the R...
\n", "
" ], "text/plain": [ " Rating Movie\n", "0 9.2 The Shawshank Redemption (1994)\n", "1 9.2 The Godfather (1972)\n", "2 9.0 The Godfather: Part II (1974)\n", "3 8.9 Pulp Fiction (1994)\n", "4 8.9 Schindler's List (1993)\n", "5 8.9 The Lord of the Rings: The Return of the King ...\n", "6 8.9 12 Angry Men (1957)\n", "7 8.9 The Dark Knight (2008)\n", "8 8.9 Il buono, il brutto, il cattivo (1966)\n", "9 8.8 The Lord of the Rings: The Fellowship of the R..." ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top_10_movies = pd.DataFrame(data=np.array(\n", " [[9.2, 'The Shawshank Redemption (1994)'],\n", " [9.2, 'The Godfather (1972)'],\n", " [9., 'The Godfather: Part II (1974)'],\n", " [8.9, 'Pulp Fiction (1994)'],\n", " [8.9, \"Schindler's List (1993)\"],\n", " [8.9, 'The Lord of the Rings: The Return of the King (2003)'],\n", " [8.9, '12 Angry Men (1957)'],\n", " [8.9, 'The Dark Knight (2008)'],\n", " [8.9, 'Il buono, il brutto, il cattivo (1966)'],\n", " [8.8, 'The Lord of the Rings: The Fellowship of the Ring (2001)']]), columns=[\"Rating\", \"Movie\"])\n", "top_10_movies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, we can store data in a dictionary instead of in lists. A dictionary keeps a mapping of keys to a set of values, and each key is unique. Using our top 10 movies example, we could create a dictionary that contains ratings a key, and movie titles as another key." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "top_10_movies_dict = {\"Rating\" : [9.2, 9.2, 9., 8.9, 8.9, 8.9, 8.9, 8.9, 8.9, 8.8], \n", " \"Movie\" : ['The Shawshank Redemption (1994)',\n", " 'The Godfather (1972)',\n", " 'The Godfather: Part II (1974)',\n", " 'Pulp Fiction (1994)',\n", " \"Schindler's List (1993)\",\n", " 'The Lord of the Rings: The Return of the King (2003)',\n", " '12 Angry Men (1957)',\n", " 'The Dark Knight (2008)',\n", " 'Il buono, il brutto, il cattivo (1966)',\n", " 'The Lord of the Rings: The Fellowship of the Ring (2001)']}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can use this dictionary to create a table with columns Rating and Movie" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RatingMovie
09.2The Shawshank Redemption (1994)
19.2The Godfather (1972)
29.0The Godfather: Part II (1974)
38.9Pulp Fiction (1994)
48.9Schindler's List (1993)
58.9The Lord of the Rings: The Return of the King ...
68.912 Angry Men (1957)
78.9The Dark Knight (2008)
88.9Il buono, il brutto, il cattivo (1966)
98.8The Lord of the Rings: The Fellowship of the R...
\n", "
" ], "text/plain": [ " Rating Movie\n", "0 9.2 The Shawshank Redemption (1994)\n", "1 9.2 The Godfather (1972)\n", "2 9.0 The Godfather: Part II (1974)\n", "3 8.9 Pulp Fiction (1994)\n", "4 8.9 Schindler's List (1993)\n", "5 8.9 The Lord of the Rings: The Return of the King ...\n", "6 8.9 12 Angry Men (1957)\n", "7 8.9 The Dark Knight (2008)\n", "8 8.9 Il buono, il brutto, il cattivo (1966)\n", "9 8.8 The Lord of the Rings: The Fellowship of the R..." ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top_10_movies_2 = pd.DataFrame(data=top_10_movies_dict, columns=[\"Rating\", \"Movie\"])\n", "top_10_movies_2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how both ways return the same table! However, the list method created the table by essentially taking the lists and making up the rows of the table, while the dictionary method took the keys from the dictionary to make up the columns of the table. In this way, dataframes can be viewed as a collection of basic data structures, either through collecting rows or columns. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "##### Reading in Dataframes\n", "\n", "Luckily for you, most datatables in this course will be premade and given to you in a form that is easily read into a pandas method, which creates the table for you. A common file type that is used for economic data is a Comma-Separated Values (.csv) file, which stores tabular data. It is not necessary for you to know exactly how .csv files store data, but you should know how to read a file in as a pandas dataframe. You can use the \"read_csv\" method from pandas, which takes in one parameter which is the path to the csv file you are reading in." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will read in a .csv file that contains quarterly real GDI, real GDP, and nominal GDP data in the U.S. from 1947 to the present." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Run this cell to read in the table\n", "accounts = pd.read_csv(\"data/Quarterly_Accounts.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The pd.read_csv function expects a path to a .csv file as its input, and will return a data table created from the data contained in the csv.\n", "We have provided Quarterly_Accouunts.csv in the data directory, which is all contained in the current working directory (aka the folder this assignment is contained in). For this reason, we must specify to the read_csv function that it should look for the csv in the data directory, and the / indicates that Quarterly_Accounts.csv can be found there. \n", "\n", "Here is a sample of some of the rows in this datatable:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearQuarterReal GDIReal GDPNominal GDP
01947Q11912.51934.5243.1
11947Q21910.91932.3246.3
21947Q31914.01930.3250.1
31947Q41932.01960.7260.3
41948Q11984.41989.5266.2
\n", "
" ], "text/plain": [ " Year Quarter Real GDI Real GDP Nominal GDP\n", "0 1947 Q1 1912.5 1934.5 243.1\n", "1 1947 Q2 1910.9 1932.3 246.3\n", "2 1947 Q3 1914.0 1930.3 250.1\n", "3 1947 Q4 1932.0 1960.7 260.3\n", "4 1948 Q1 1984.4 1989.5 266.2" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accounts.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "##### Indexing Dataframes\n", "\n", "Oftentimes, tables will contain a lot of extraneous data that muddles our data tables, making it more difficult to quickly and accurately obtain the data we need. To correct for this, we can select out columns or rows that we need by indexing our dataframes. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The easiest way to index into a table is with square bracket notation. Suppose you wanted to obtain all of the Real GDP data from the data. Using a single pair of square brackets, you could index the table for \"Real GDP\"" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1934.5\n", "1 1932.3\n", "2 1930.3\n", "3 1960.7\n", "4 1989.5\n", " ... \n", "276 16571.6\n", "277 16663.5\n", "278 16778.1\n", "279 16851.4\n", "280 16903.2\n", "Name: Real GDP, Length: 281, dtype: float64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Run this cell and see what it outputs\n", "accounts[\"Real GDP\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the above cell returns an array of all the real GDP values in their original order.\n", "Now, if you wanted to get the first real GDP value from this array, you could index it with another pair of square brackets:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1934.5" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accounts[\"Real GDP\"][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas columns have many of the same properties as numpy arrays. Keep in mind that pandas dataframes, as well as many other data structures, are zero-indexed, meaning indexes start at 0 and end at the number of elements minus one. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you wanted to create a new datatable with select columns from the original table, you can index with double brackets." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearQuarterReal GDPReal GDI
01947Q11934.51912.5
11947Q21932.31910.9
21947Q31930.31914.0
31947Q41960.71932.0
41948Q11989.51984.4
\n", "
" ], "text/plain": [ " Year Quarter Real GDP Real GDI\n", "0 1947 Q1 1934.5 1912.5\n", "1 1947 Q2 1932.3 1910.9\n", "2 1947 Q3 1930.3 1914.0\n", "3 1947 Q4 1960.7 1932.0\n", "4 1948 Q1 1989.5 1984.4" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Note: .head() returns the first five rows of the table\n", "accounts[[\"Year\", \"Quarter\", \"Real GDP\", \"Real GDI\"]].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can also get rid of columns you dont need using .drop()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearQuarterReal GDIReal GDP
01947Q11912.51934.5
11947Q21910.91932.3
21947Q31914.01930.3
31947Q41932.01960.7
41948Q11984.41989.5
\n", "
" ], "text/plain": [ " Year Quarter Real GDI Real GDP\n", "0 1947 Q1 1912.5 1934.5\n", "1 1947 Q2 1910.9 1932.3\n", "2 1947 Q3 1914.0 1930.3\n", "3 1947 Q4 1932.0 1960.7\n", "4 1948 Q1 1984.4 1989.5" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accounts.drop(\"Nominal GDP\", axis=1).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, you can use square bracket notation to index rows by their indices with a single set of brackets. You must specify a range of values for which you want to index. For example, if I wanted the 20th to 30th rows of accounts:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearQuarterReal GDIReal GDPNominal GDP
201952Q12398.32423.5360.2
211952Q22412.62428.5361.4
221952Q32435.02446.1368.1
231952Q42509.52526.4381.2
241953Q12554.32573.4388.5
251953Q22572.22593.5392.3
261953Q32555.72578.9391.7
271953Q42504.12539.8386.5
281954Q12510.12528.0385.9
291954Q22514.52530.7386.7
301954Q32537.12559.4391.6
\n", "
" ], "text/plain": [ " Year Quarter Real GDI Real GDP Nominal GDP\n", "20 1952 Q1 2398.3 2423.5 360.2\n", "21 1952 Q2 2412.6 2428.5 361.4\n", "22 1952 Q3 2435.0 2446.1 368.1\n", "23 1952 Q4 2509.5 2526.4 381.2\n", "24 1953 Q1 2554.3 2573.4 388.5\n", "25 1953 Q2 2572.2 2593.5 392.3\n", "26 1953 Q3 2555.7 2578.9 391.7\n", "27 1953 Q4 2504.1 2539.8 386.5\n", "28 1954 Q1 2510.1 2528.0 385.9\n", "29 1954 Q2 2514.5 2530.7 386.7\n", "30 1954 Q3 2537.1 2559.4 391.6" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accounts[20:31]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "##### Filtering Data\n", "\n", "As you can tell from the previous, indexing rows based on indices is only useful when you know the specific set of rows that you need, and you can only really get a range of entries. Working with data often involves huge datasets, making it inefficient and sometimes impossible to know exactly what indices to be looking at. On top of that, most data analysis concerns itself with looking for patterns or specific conditions in the data, which is impossible to look for with simple index based sorting. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thankfully, you can also use square bracket notation to filter out data based on a condition. Suppose we only wanted real GDP and nominal GDP data from the 21st century:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Real GDPNominal GDP
21212359.110031.0
21312592.510278.3
21412607.710357.4
21512679.310472.3
21612643.310508.1
.........
27616571.618325.2
27716663.518538.0
27816778.118729.1
27916851.418905.5
28016903.219057.7
\n", "

69 rows × 2 columns

\n", "
" ], "text/plain": [ " Real GDP Nominal GDP\n", "212 12359.1 10031.0\n", "213 12592.5 10278.3\n", "214 12607.7 10357.4\n", "215 12679.3 10472.3\n", "216 12643.3 10508.1\n", ".. ... ...\n", "276 16571.6 18325.2\n", "277 16663.5 18538.0\n", "278 16778.1 18729.1\n", "279 16851.4 18905.5\n", "280 16903.2 19057.7\n", "\n", "[69 rows x 2 columns]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accounts[accounts[\"Year\"] >= 2000][[\"Real GDP\", \"Nominal GDP\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The accounts table is being indexed by the condition accounts[\"Year\"] >= 2000, which returns a table where only rows that have a \"Year\" greater than $2000$ is returned. We then index this table with the double bracket notation from the previous section to only get the real GDP and nominal GDP columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose now we wanted a table with data from the first quarter, and where the real GDP was less than 5000 or nominal GDP is greater than 15,000." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearQuarterReal GDIReal GDPNominal GDP
01947Q11912.51934.5243.1
41948Q11984.41989.5266.2
81949Q12001.52007.5275.4
121950Q12060.12084.6281.2
161951Q12281.02304.5336.4
201952Q12398.32423.5360.2
241953Q12554.32573.4388.5
281954Q12510.12528.0385.9
321955Q12661.62683.8413.8
361956Q12775.42770.0440.5
401957Q12862.02854.5470.6
441958Q12779.92772.7468.4
481959Q12976.52976.6511.1
521960Q13121.93123.2543.3
561961Q13109.93102.3545.9
601962Q13328.63336.8595.2
641963Q13469.13456.1622.7
681964Q13658.63672.7671.1
721965Q13885.53873.5719.2
761966Q14167.84201.9797.3
801967Q14286.54324.9846.0
841968Q14465.64490.6911.1
881969Q14665.44691.6995.4
921970Q14690.44707.11053.5
961971Q14778.04834.31137.8
2562011Q114924.414881.315238.4
2602012Q115500.415291.015973.9
2642013Q115642.715491.916475.4
2682014Q115912.815757.617031.3
2722015Q116599.616350.017874.7
2762016Q116776.116571.618325.2
2802017Q116992.116903.219057.7
\n", "
" ], "text/plain": [ " Year Quarter Real GDI Real GDP Nominal GDP\n", "0 1947 Q1 1912.5 1934.5 243.1\n", "4 1948 Q1 1984.4 1989.5 266.2\n", "8 1949 Q1 2001.5 2007.5 275.4\n", "12 1950 Q1 2060.1 2084.6 281.2\n", "16 1951 Q1 2281.0 2304.5 336.4\n", "20 1952 Q1 2398.3 2423.5 360.2\n", "24 1953 Q1 2554.3 2573.4 388.5\n", "28 1954 Q1 2510.1 2528.0 385.9\n", "32 1955 Q1 2661.6 2683.8 413.8\n", "36 1956 Q1 2775.4 2770.0 440.5\n", "40 1957 Q1 2862.0 2854.5 470.6\n", "44 1958 Q1 2779.9 2772.7 468.4\n", "48 1959 Q1 2976.5 2976.6 511.1\n", "52 1960 Q1 3121.9 3123.2 543.3\n", "56 1961 Q1 3109.9 3102.3 545.9\n", "60 1962 Q1 3328.6 3336.8 595.2\n", "64 1963 Q1 3469.1 3456.1 622.7\n", "68 1964 Q1 3658.6 3672.7 671.1\n", "72 1965 Q1 3885.5 3873.5 719.2\n", "76 1966 Q1 4167.8 4201.9 797.3\n", "80 1967 Q1 4286.5 4324.9 846.0\n", "84 1968 Q1 4465.6 4490.6 911.1\n", "88 1969 Q1 4665.4 4691.6 995.4\n", "92 1970 Q1 4690.4 4707.1 1053.5\n", "96 1971 Q1 4778.0 4834.3 1137.8\n", "256 2011 Q1 14924.4 14881.3 15238.4\n", "260 2012 Q1 15500.4 15291.0 15973.9\n", "264 2013 Q1 15642.7 15491.9 16475.4\n", "268 2014 Q1 15912.8 15757.6 17031.3\n", "272 2015 Q1 16599.6 16350.0 17874.7\n", "276 2016 Q1 16776.1 16571.6 18325.2\n", "280 2017 Q1 16992.1 16903.2 19057.7" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accounts[(accounts[\"Quarter\"] == \"Q1\") & ((accounts[\"Real GDP\"] < 5000) | (accounts[\"Nominal GDP\"] > 15000))]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many different conditions can be included to filter, and you can use & and | operators to connect them together. Make sure to include parantheses for each condition!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way to reorganize data to make it more convenient is to sort the data by the values in a specific column. For example, if we wanted to find the highest real GDP since 1947, we could sort the table for real GDP:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearQuarterReal GDIReal GDPNominal GDP
21947Q31914.01930.3250.1
11947Q21910.91932.3246.3
01947Q11912.51934.5243.1
31947Q41932.01960.7260.3
41948Q11984.41989.5266.2
..................
2762016Q116776.116571.618325.2
2772016Q216783.016663.518538.0
2782016Q316953.016778.118729.1
2792016Q416882.116851.418905.5
2802017Q116992.116903.219057.7
\n", "

281 rows × 5 columns

\n", "
" ], "text/plain": [ " Year Quarter Real GDI Real GDP Nominal GDP\n", "2 1947 Q3 1914.0 1930.3 250.1\n", "1 1947 Q2 1910.9 1932.3 246.3\n", "0 1947 Q1 1912.5 1934.5 243.1\n", "3 1947 Q4 1932.0 1960.7 260.3\n", "4 1948 Q1 1984.4 1989.5 266.2\n", ".. ... ... ... ... ...\n", "276 2016 Q1 16776.1 16571.6 18325.2\n", "277 2016 Q2 16783.0 16663.5 18538.0\n", "278 2016 Q3 16953.0 16778.1 18729.1\n", "279 2016 Q4 16882.1 16851.4 18905.5\n", "280 2017 Q1 16992.1 16903.2 19057.7\n", "\n", "[281 rows x 5 columns]" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accounts.sort_values(\"Real GDP\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But wait! The table looks like it's sorted in increasing order. This is because sort_values defaults to ordering the column in ascending order. To correct this, add in the extra optional parameter" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearQuarterReal GDIReal GDPNominal GDP
2802017Q116992.116903.219057.7
2792016Q416882.116851.418905.5
2782016Q316953.016778.118729.1
2772016Q216783.016663.518538.0
2762016Q116776.116571.618325.2
..................
41948Q11984.41989.5266.2
31947Q41932.01960.7260.3
01947Q11912.51934.5243.1
11947Q21910.91932.3246.3
21947Q31914.01930.3250.1
\n", "

281 rows × 5 columns

\n", "
" ], "text/plain": [ " Year Quarter Real GDI Real GDP Nominal GDP\n", "280 2017 Q1 16992.1 16903.2 19057.7\n", "279 2016 Q4 16882.1 16851.4 18905.5\n", "278 2016 Q3 16953.0 16778.1 18729.1\n", "277 2016 Q2 16783.0 16663.5 18538.0\n", "276 2016 Q1 16776.1 16571.6 18325.2\n", ".. ... ... ... ... ...\n", "4 1948 Q1 1984.4 1989.5 266.2\n", "3 1947 Q4 1932.0 1960.7 260.3\n", "0 1947 Q1 1912.5 1934.5 243.1\n", "1 1947 Q2 1910.9 1932.3 246.3\n", "2 1947 Q3 1914.0 1930.3 250.1\n", "\n", "[281 rows x 5 columns]" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accounts.sort_values(\"Real GDP\", ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can clearly see that the highest real GDP was attained in the first quarter of this year, and had a value of 16903.2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "##### Useful Functions for Numeric Data\n", "\n", "Here are a few useful functions when dealing with numeric data columns.\n", "To find the minimum value in a column, call min() on a column of the table." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1930.3" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accounts[\"Real GDP\"].min()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To find the maximum value, call max()." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "19057.7" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accounts[\"Nominal GDP\"].max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And to find the average value of a column, use mean()." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7890.370462633456" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accounts[\"Real GDI\"].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "#### Part 3: Visualization " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that you can read in data and manipulate it, you are now ready to learn about how to visualize data. To begin, run the cells below to import the required packages we will be using." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will be using US unemployment data from [FRED](https://fred.stlouisfed.org/) to show what we can do with data. The statement below will put the csv file into a pandas DataFrame." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datetotal_unemployedmore_than_15_weeksnot_in_labor_searched_for_workmulti_jobsleaverslosershousing_price_index
011/1/1016.98696253167085.763.0186.07
112/1/1016.68549260968996.461.2183.27
21/1/1116.28393280068166.560.1181.35
32/1/1116.08175273067416.460.2179.66
43/1/1115.98166243467356.460.3178.84
\n", "
" ], "text/plain": [ " date total_unemployed more_than_15_weeks \\\n", "0 11/1/10 16.9 8696 \n", "1 12/1/10 16.6 8549 \n", "2 1/1/11 16.2 8393 \n", "3 2/1/11 16.0 8175 \n", "4 3/1/11 15.9 8166 \n", "\n", " not_in_labor_searched_for_work multi_jobs leavers losers \\\n", "0 2531 6708 5.7 63.0 \n", "1 2609 6899 6.4 61.2 \n", "2 2800 6816 6.5 60.1 \n", "3 2730 6741 6.4 60.2 \n", "4 2434 6735 6.4 60.3 \n", "\n", " housing_price_index \n", "0 186.07 \n", "1 183.27 \n", "2 181.35 \n", "3 179.66 \n", "4 178.84 " ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unemployment_data = pd.read_csv(\"data/detailed_unemployment.csv\")\n", "unemployment_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the advantages of pandas is its built-in plotting methods. We can simply call .plot() on a dataframe to plot columns against one another. All that we have to do is specify which column to plot on which axis. Something special that pandas does is attempt to automatically parse dates into something that it can understand and order them sequentially.\n", "\n", "**Sidenote:** total_unemployed is a percent." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "