{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Lab 1: Introduction to Notebooks, Python, and Tables\n", "\n", "Welcome to lab 1! \n", "\n", "We hope you have enjoyed the first day of the summer workshop. This lab is intended to be a hands-on introduction to some of the concepts taught in Data 8. We will cover a subset of the topics covered in the first three to four weeks of the course. \n", "\n", "We have also set up some autograder tests for this assignment. You can check your solutions and see how students check submit their assignments using OKpy, the autograder used in Data 8. As you work through the lab, there will be lab assistants in the room to answer any of your questions. If you get stuck at any point, feel free to ask a neighbor or one of the lab assistants for help.\n", "\n", "## What this lab will cover:\n", "* [1. Intro to Jupyter Notebooks](#notebooks) \n", "* [2. Intro to Python](#python)\n", "* [3. Intro to Tables](#tables) \n", "* [4. Visualizations](#visualizations) \n", "\n", "## What you need to do:\n", "* Read the content, complete the questions \n", "* Load the autograder tests, log in\n", "* Autograde your solutions for questions 2-3\n", "* Submit the assignment\n", "\n", "## Quick links to questions:\n", "* [Question 1](#1)\n", "* [Question 2](#2)\n", "* [Question 3](#3)\n", "* [Question 4](#4)\n", "* [Question 5](#5)\n", "* [Question 6](#6)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# 1. Intro to Jupyter Notebooks\n", "\n", "You are currently working in a Jupyter Notebook. A Notebook allows text and code to be combined into one document. Each rectangular section of a notebook is called a \"cell.\" There are two types of cells in this notebook: text cells and code cells. " ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## 1.1 Text Cells\n", "\n", "Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings. If you want to write your own notebooks, it will be useful to know the basics of Markdown.\n", "\n", "After you edit a text cell, click the \"run cell\" button at the top that looks like ▶ to confirm any changes. Alternatively, you can hold down the `shift` key and then press `return` or `enter`." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Question 1\n", "\n", "This paragraph is in its own text cell. Try editing it so that this sentence is the last sentence in the paragraph. This sentence, for example, should be deleted. So should this one." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## 1.2 Code Cells\n", "\n", "Other cells contain code in the Python 3 language. Running a code cell will execute all of the code it contains. To run the code in a code cell, first click on that cell to activate it. Next, either click ▶ or hold down the `shift` key and then press `return` or `enter`.\n", "\n", "Try running the two cells below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "print(\"Hello, World!\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "print(\"The plural of anecdote is not data.\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Code cells can contain multiple lines. When you run a cell with multiple lines, each line will be executed. Every print expression in the following cell prints a line. Run the next cell and notice the order of the output." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "print(\"First this line is printed,\")\n", "print(\"and then this one.\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## 1.3 Comments\n", "\n", "Code cells can also contain text in the form of comments. Comments don't make anything happen in Python; Python ignores anything on a line after a #. Instead, comments are there to communicate something about the code to you, the human reader. Comments are extremely useful. The cell below contains two examples of how comments can be used.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# This is a comment\n", "print(\"Hello, World!\") # This is also a comment" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Load Tests and Log In\n", "\n", "After running the cell below, all tests for this assignment will be loaded. Only questions 2 and 3 have autograder tests for this lab. Here is what you need to do:\n", "* Open the URL that is shown.\n", "* Select an email account.\n", "* Copy the code that is shown.\n", "* Paste the code into the text box shown below. You may need to run the cell again to see the text box. \n", "* After you have pasted in the code, run the cell again and you will be logged in!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# These lines load the autograder tests. \n", "from client.api.notebook import Notebook\n", "ok = Notebook('lab01.ok')\n", "_ = ok.auth(inline=False)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# 2. Intro to Python" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Python is a language, and like natural human languages, it has rules. It differs from natural language in at least two important ways:\n", "\n", "1. The rules are *simple*. You can learn most of them in a few weeks and gain reasonable proficiency with the language in a semester.\n", "2. The rules are *rigid*. If you're proficient in a natural language, you can understand a non-proficient speaker, glossing over small mistakes. A computer running Python code is not smart enough to do that.\n", "\n", "We will cover some of these rules in this lab. Because the rules are so rigid, you may come across syntax errors throughout the lab. Don't hesitate to ask a lab assistant for help if you see such errors." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## 2.1 Numbers\n", "\n", "Quantitative information arises everywhere in data science. Python has two different types of numbers we can work with: integer and floats (aka decimals). Try running the first cell below which contains an integer. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "3" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Notice that we didn't have to `print`. When you run a notebook cell, if the last line has a value, then Jupyter helpfully prints out that value for you. The value of the integer 3 is simply 3. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "3.14" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Similarly, the value of the float 3.14 is simply 3.14. \n", "\n", "Only the value of the last line of a cell is automatically displayed. If you want to display the value of previous lines, you must explicitly do so using `print`. Notice that 3 does not get displayed when you run the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "print(2.5) # display by printing\n", "3 # does not display\n", "4 # displays automatically" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## 2.2 Text\n", "\n", "Programming doesn't just concern numbers. Text is one of the most common types of values used in programs.\n", "\n", "A snippet of text is represented by a string value in Python. The word **\"string\"** is a programming term for a sequence of characters. A string might contain a single character, a word, a sentence, or a whole book.\n", "\n", "To distinguish text data from actual code, we demarcate strings by putting quotation marks around them. Single quotes (') and double quotes (\") are both valid, but the types of opening and closing quotation marks must match. The contents can be any sequence of characters, including numbers and symbols." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "\"hello\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "'world'" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## 2.3 Expressions\n", "\n", "Expressions are the building blocks of programs, and they describe to the computer how to combine pieces of data. We can combine numbers using expressions. For example, we can add two numbers using + and multiply two numbers using \\*. \n", "\n", "Run the following cell to evaluate the multiplication expression shown. The value of the expression, 12 in this case, will be displayed below the cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "3 * 4" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "When there are multiple expressions in a cell, only the value of the final expression will be automatically displayed upon running the cell. If you want to display the value of previous expressions, you must explicitly do so using `print`. Notice that 5 does not get displayed when you run the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "print(2 * 2) # display by printing\n", "1 * 5 # does not display\n", "4 + 2 # displays automatically" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We can also create expressions using strings. Adding two strings together will concatenate them into a larger string." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "\"data\" + \"science\"" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Expressions are not limited to two values. They can consist of multiple values. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "\"I \" + \"love \" + \"data \" + \"science\"" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## 2.4 Names\n", "\n", "In natural language, we have terminology that lets us quickly reference very complicated concepts. We don't say, \"That's a large mammal with brown fur and sharp teeth!\" Instead, we say, \"Bear!\"\n", "\n", "Similarly, an effective strategy for writing code is to define names for data as we compute it, like a lawyer would define terms for complex ideas at the start of a legal document. Names are given to values in Python using an **assignment statement**. In an assignment, a name is followed by =, which is followed by any expression. The value of the expression to the right of = is assigned to the name. Once a name has a value assigned to it, the value will be substituted for that name in future expressions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "a = 10\n", "b = 20\n", "a + b" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "A previously assigned name can be used in the expression to the right of = ." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "quarter = 1/4\n", "half = 2 * quarter\n", "half" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "These are sometimes alternatively called variables. We say that the variable named `quarter` is equal to 0.25.\n", "\n", "Names can also be used with strings." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "favorite_class = \"Data 8\"\n", "print(\"My favorite class is \" + favorite_class)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Question 2\n", "Define a variable called `total_sum` that is equal to the total sum of 10, 14, 55, 86, and 290. Then, define a variable called `average` that is equal to the average of 10, 14, 55, 86, and 290. You can use the `/` operator to do division." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# Replace the ... with an expression\n", "total_sum = ...\n", "\n", "# This line will print the value of total_sum when you run this cell\n", "total_sum" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# Replace the ... with an expression\n", "# Use the total_sum variable in your expression\n", "average = ...\n", "\n", "# This line will print the value of average when you run this cell\n", "average" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Test your solution!\n", "\n", "The cell below is an autograder test that is set up to check your solution to Question 2. Run the cell to see whether your solution is correct. If the solution is correct, the final line will look like this: \n", "\n", " [ooooooooook] 100.0% passed\n", "\n", "If the solution is incorrect, the final line will look like this: \n", "\n", " [k..........] 0.0% passed" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "_ = ok.grade('q2')" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## 2.5 Functions\n", "\n", "We saw how to build expressions using + and \\*. Aside from addition and multiplication, we can combine or manipulate values in Python in many other ways by calling functions. Python comes with many built-in functions that perform common operations. Functions are *called* on *arguments* and *return* some value as the output. \n", "\n", "For example, the `abs` function takes a single number as its argument and returns the absolute value of that number. So `abs(5)` is 5 and `abs(-5)` is also 5." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "abs(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "abs(-5)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Some functions take multiple arguments, separated by commas. For example, the built-in `max` function can be called with many arguments and returns the maximum argument passed to it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "max(2, -3, 4, -5)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We can also call functions on strings. `print` is one such function that we have used throughout this lab. Another built-in function is `len`, which returns the length of the argument passed in. For example, the length of the string \"Berkeley\" is 8 letters." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "len(\"Berkeley\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## 2.6 Nested Expressions\n", "\n", "Function calls and arithmetic expressions can themselves contain expressions. There are called nested expressions. Here is a simple example of a nested expression\n", "\n", " abs(42-34)\n", "\n", "Nested expressions can turn into complicated-looking code. However, the way in which complicated expressions break down is very regular.\n", "\n", "Suppose we are interested in heights that are very unusual. We'll say that a height is unusual to the extent that it's far away on the number line from the average human height. [An estimate](http://press.endocrine.org/doi/full/10.1210/jcem.86.9.7875?ck=nck&) of the average adult human height (averaging, we hope, over all humans on Earth today) is 1.688 meters.\n", "\n", "So if Aditya is 1.21 meters tall, then his height is $|1.21 - 1.688|$, or $.478$, meters away from the average. Here's a picture of that:\n", "\n", "\n", "\n", "And here's how we'd write that in one line of Python code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "abs(1.21 - 1.688)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "What's going on here? `abs` takes just one argument, so the stuff inside the parentheses is all part of that *single argument*. Specifically, the argument is the value of the expression `1.21 - 1.688`. The value of that expression is `-.478`. That value then becomes the argument to `abs`. The absolute value of `-.478` is `.478`, so `.478` is the value of the full expression `abs(1.21 - 1.688)`.\n", "\n", "Picture simplifying the expression in several steps:\n", "\n", "1. `abs(1.21 - 1.688)`\n", "2. `abs(-.478)`\n", "3. `.478`\n", "\n", "In fact, that's basically what Python does to compute the value of the expression." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Question 3\n", "Say that Botan's height is 1.85 meters. In the next cell, use `abs` to compute the absolute value of the difference between Botan's height and the average human height. Give that value the name `botan_distance_from_average_m`.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# Replace the ... with an expression to compute the absolute\n", "# value of the difference between Botan's height (1.85m) and\n", "# the average human height.\n", "botan_distance_from_average_m = ...\n", "\n", "# We've written this here so that the distance you\n", "# compute will get printed when you run this cell.\n", "botan_distance_from_average_m" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Test your solution!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "_ = ok.grade('q2')" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## 2.7 Importing code\n", "\n", "> What has been will be again, \n", "> what has been done will be done again; \n", "> there is nothing new under the sun.\n", "\n", "Most programming involves work that is very similar to work that has been done before. Since writing code is time-consuming, it's good to rely on others' published code when you can. So far, we have only used built-in functions that are available by default. Python also includes many useful modules that are just an `import` away.\n", "\n", "Rather than rewriting existing functions, we can *import* relevant *modules* containing those functions. We'll look at the `math` module as a first example.\n", "\n", "Suppose we want to very accurately compute the area of a circle with radius 5 meters. For that, we need the constant $\\pi$, which is roughly 3.14. Conveniently, the `math` module defines `pi` for us:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "import math # syntax option 1\n", "radius = 5\n", "circumference_of_circle = 2 * math.pi * radius # have to use math.pi\n", "circumference_of_circle" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "In order to use a module at all, we must first write the statement `import `. `pi` is defined inside `math`, and the way that we access names that are inside modules is by writing the module's name, then a dot, then the name of the thing we want:\n", "\n", " . \n", "\n", "Another way to use a module is with the syntax shown below. With this alternate syntax, we can simply refer to `pi`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from math import * # syntax option 2\n", "radius = 5\n", "circumference_of_circle = 2 * pi * radius # can just use pi\n", "circumference_of_circle" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# 3. Intro to Tables" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "A table is a fundamental object type for representing data sets. A table can be viewed in two ways:\n", "* A sequence of named columns that each describe a single aspect of all entries in a data set\n", "* A sequence of rows that each contain all information about a single entry in a data set\n", "\n", "Students in Data 8 and connector courses work extensively with data in tables. They use a simple table data structure from the datascience module that was designed for these courses. \n", "\n", "Each column of this table data structure must contain values of the same kind. For example, a table can have a column of integers or a column of strings. We can create a table from scratch or from a CSV file. Let's try reading a CSV file into a table.\n", "\n", "Before doing so, we must first import the datascience module." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "# import datascience module\n", "from datascience import * " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# read the CSV into a table\n", "imdb = Table.read_table(\"data/imdb_ratings.csv\") \n", "imdb" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "If we want just the ratings of the movies, we can extract that column. Each column of a table is represented as an **array**. Similar to integers, floats, and strings, arrays are another **data type** that we can use in Python." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "imdb.column(\"Rank\") # Select just the \"Rank\" column" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "You may have noticed that we were not able to see all the rows in the `imdb` table. There was a line below the table that said \"`... (240 rows omitted)`.\" However, selecting a column allowed us to see all the values in that column.\n", "\n", "Besides extracting a specific column, we can manipulate the `imdb` table in other ways. For example, we can sort the rows by the value in the `rank` column." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# default sort order is ascending\n", "imdb.sort(\"Rank\") " ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "When using `sort`, the default sort order is ascending. However, we can also sort the table in descending order using the `descending` argument, shown below. The `descending` argument is optional and is set to `False` when it is not passed in. We can use the syntax shown below to pass in a different value for the `descending` argument. Doing so will sort the table in descending order.\n", "\n", "Whenever you see a function argument of the form `=`, the argument is optional and does not need to be passed in. Not passing in the argument will cause the default value to be used." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "imdb.sort(\"Rank\", descending=True)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Summary of Data Types\n", "\n", "For your later reference, here are the principal types of data we'll work with in this course.\n", "\n", "|English name|Python name|Example|Example Python expressions|\n", "|-|-|-|-|\n", "|Number|`float` (numbers with decimals) or `int` (integers)|The number of words in a book|`2`, `.25`, `2+2`|\n", "|Text|`string`|A word, chapter, or whole text of a book|`\"I <3 Data Science\"`|\n", "|A collection of multiple kinds of data|`table`|The letter grades and all the project, midterm, and final exam scores in a class|`Table.read_table(\"data/imdb_ratings.csv\")`|\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# 4. Visualizations" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Using tables, we can easily create different types of visualizations. This section will go over how to create bar charts and histograms. Bar charts allow us to visualize **categorical distribtions** and histograms allow us to visualize **numerical distributions**. Before we get started, please run the cell below to import the necessary modules." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "# Run this cell, but please don't change it.\n", "\n", "# These lines import the NumPy, datascience, and math modules.\n", "import numpy as np\n", "import math\n", "from datascience import *\n", "\n", "# These lines do some fancy plotting magic.\n", "import matplotlib\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.style.use('fivethirtyeight')\n", "import warnings\n", "warnings.simplefilter('ignore', FutureWarning)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## 4.1 Bar Charts" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Let's start by looking at the table below, called `top`. The `top` table consists of the top grossing movies of all time in the United States. The first column contains the title of the movie; *Star Wars: The Force Awakens* has the top rank, with a box office gross amount of more than 900 million dollars in the United States. The second column contains the name of the studio that produced the movie. The third contains the domestic box office gross in dollars, and the fourth contains the gross amount that would have been earned from ticket sales at 2016 prices. The fifth contains the release year of the movie." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "top = Table.read_table('data/top_movies.csv')\n", "top" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Which studios will appear most frequently if we look among all 200 rows?\n", "\n", "To figure this out, notice that all we need is a table with the movies and the studios; the other information is unnecessary. Let's create a table called `movies_and_studios` that contains just the `Title` and `Studio` columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# Extract just the \"Title\" and \"Studio\" columns\n", "movies_and_studios = top.select('Title', 'Studio')\n", "movies_and_studios" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The Table method `group` allows us to count how frequently each studio appears in the table, by calling each studio a category and assigning each row to one category. The `group` method takes as its argument the label of the column that contains the categories, and returns a table of counts of rows in each category." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# Count how frequently each studio appears in the table\n", "studio_distribution = movies_and_studios.group('Studio')\n", "studio_distribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# Sort rows in the studio_distribution table in descending order, based on count\n", "studio_distribution = studio_distribution.sort('count', descending=True)\n", "studio_distribution" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### The `barh` Method \n", "\n", "Now we can draw a bar chart to visualize this data. The bar chart is a familiar way of visualizing categorical distributions. It displays a bar for each category. The bars are equally spaced and equally wide. The length of each bar is proportional to the frequency of the corresponding category.\n", "\n", "We will draw bar charts with horizontal bars because it's easier to label the bars that way. The Table method is therefore called `barh`. It takes two arguments: the first is the column label of the categories, and the second is the column label of the frequencies." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# Draw a bar chart\n", "studio_distribution.barh('Studio', 'count')" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Warner Brothers and Buena Vista are the most common studios among the top 200 movies. Warner Brothers produces the Harry Potter movies and Buena Vista produces Star Wars." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## 4.2 Histograms" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "In this section we will draw graphs of the distribution of the numerical variable in the `Gross (Adjusted)` column in the `top` table. Let's look at that table once again." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "top" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "For simplicity, let's create a smaller table with two columns that has the information that we need. And since three-digit numbers are easier to work with than nine-digit numbers, let's measure the `Adjusted Gross` receipts in millions of dollars. `round` is used to retain only two decimal places.\n", "\n", "Read the comments in the code cell to understand what the code is doing." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# Python/computers start counting at 0, so top.select(0) extracts the \"Title\" column \n", "# from the top table (column 0)\n", "titles = top.select(0)\n", "\n", "# Each value from \"Gross (Adjusted)\" is divided by 1,000,000, then rounded to 2 decimal places,\n", "# so that all values are in millions\n", "gross_adjusted_in_millions = np.round(top.column(3)/1000000, 2)\n", "\n", "# .with_column adds a second column, called \"Adjusted Gross\"\n", "\n", "# The \"Adjusted Gross\" column is filled with data from the \"Gross (Adjusted)\" column (column 3)\n", "\n", "millions = titles.with_column('Adjusted Gross', gross_adjusted_in_millions)\n", "millions" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### The `hist` Method\n", "\n", "A *histogram* of a numerical dataset looks very much like a bar chart, though it has some important differences that we will examine in this section. First, let's just draw a histogram of the adjusted values. We can use the `hist` method to generate a histogram of the values in a column. \n", "\n", "The first argument to `hist` is the name of the column containing the data we want to display. The optional `unit` argument is used in the labels on the two axes. The `unit=\"Million Dollars\"` part tells the plot to say \"Percent per Million Dollars\" on the y-axis. Our histogram will show the distribution of the adjusted gross amounts, in millions of 2016 dollars. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "millions.hist('Adjusted Gross', unit=\"Million Dollars\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### The Horizontal Axis\n", "\n", "The amounts have been grouped into contiguous intervals called *bins*. Although no movie in this dataset grossed an amount that is exactly on the edge between two bins, `hist` does have to account for situations where there might have been values at the edges. So `hist` has an *endpoint convention*: bins include the data at their left endpoint, but not the data at their right endpoint. \n", "\n", "We will use the notation [*a*, *b*) for the bin that starts at *a* and ends at *b* but doesn't include *b*.\n", "\n", "We can see that there are 10 bins (some bars are so low that they are hard to see), and that they all have the same width. We can also see that none of the movies grossed fewer than 300 million dollars; that is because we are considering only the top grossing movies of all time. \n", "\n", "### Specifying Bins\n", "\n", "The `hist` function has an optional argument, `bins`, that can be used to specify the endpoints of the bins. It must consist of a sequence of numbers that starts with the left end of the first bin and ends with the right end of the last bin. We will start by setting the numbers in `bins` to be 300, 400, 500, and so on, ending with 2000." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "even = np.arange(300,2001,100) \n", "millions.hist('Adjusted Gross', bins=even, unit=\"Million Dollars\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### The Vertical Axis: Density Scale\n", "\n", "The horizontal axis of a histogram is straightforward to read, once we have taken care of details like the ends of the bins. The features of the vertical axis require a little more attention. \n", "\n", "**The height of each bar is the percent of elements that fall into the corresponding bin, relative to the width of the bin.** The height of bar is **not** the percent of entries in the bin; it is the percent of entries in the bin relative to the amount of space in the bin. That is why the height measures crowdedness or *density*. **The area of each bar is proportional to the number of entries in the bin.** The total area of all the bars in the histogram is 100%. Speaking in terms of proportions, we say that the areas of all the bars in a histogram \"sum to 1\".\n", "\n", "How do we calculate the area of each bar? \n", "\n", "$$ \\mbox{area of bar} = \\mbox{percent of entries in bin} = \\mbox{height of bar} \\times \\mbox{width of bin} $$\n", "\n", "How do we determine the height of each bar? Let's walk through the height calculation for the [300,400) bin. Remember that there are 200 movies in the dataset. The [300, 400) bin contains 81 movies. That's 40.5% of all the movies: \n", "\n", "$$ \\mbox{percent of entries in bin} = \\frac{81}{200} \\cdot 100 = 40.5 $$\n", "\n", "The width of the [300, 400) bin is $ 400 - 300 = 100$. So \n", "\n", "$$ \\mbox{height of bar} = \\frac{\\mbox{area of bar}}{\\mbox{width of bin}} = \\frac{\\mbox{percent of entries in bin}}{\\mbox{width of bin}} = \\frac{40.5}{100} = 0.405 $$\n", "\n", "\n", "### Unequal Bin Widths\n", "An advantage of the histogram over a bar chart is that a histogram can contain bins of unequal width. Below, the values in the `Millions` column are binned into three uneven categories." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "uneven = make_array(300, 400, 600, 1500)\n", "millions.hist('Adjusted Gross', bins=uneven, unit=\"Million Dollars\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Question 4\n", "Create another histogram of this data with the bins specified by `uneven_again`. Fill in the correct arguments for the `hist` function. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "uneven_again = make_array(300, 350, 400, 450, 1500)\n", "\n", "# Replace the ... for each argument\n", "millions.hist(..., bins=..., unit=...)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Answer the questions below by typing the correct responses in the text cells below. Look again at the histogram above, and compare the [400, 450) bin with the [450, 1500) bin. You can compare both of your answers to the solutions shown at the bottom of this [page](https://www.inferentialthinking.com/chapters/06/2/visualizing-numerical-distributions.html).\n", "\n", "### Question 5\n", "Which has more movies in it? " ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "*Write your answer here, replacing this text.*" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Question 6\n", "Then why is the [450, 1500) bar so much shorter than the [400, 450) bar? " ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "*Write your answer here, replacing this text.*" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Submit the assignment\n", "You have now completed lab 1! You can run the first cell below to regrade questions 1 and 2, for which autograder tests were provided. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# For your convenience, you can run this cell to run all the tests at once!\n", "import os\n", "_ = [ok.grade(q[:-3]) for q in os.listdir(\"tests\") if q.startswith('q')]" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Once you have checked your solutions, please run the below cell to submit your lab to the OKpy autograder site. Once you run the cell, you will see a URL for the OKpy autograder site. You can click on this URL to verify that your lab was properly submitted." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "_ = ok.submit()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 2 }