{ "cells": [ { "cell_type": "markdown", "source": [ "# Case Study 1: Diamonds" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "In this lesson, we're going to do some basic data analyses on a set of diamond characteristics and prices. " ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "import os\n", "my_dir = os.getcwd() # get current working directory\n", "my_dir" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "You should see something like `'/home/user/work/DIBS_materials/python'`. Make sure it ends with `/DIBS_materials/python`." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "!mkdir data # make directory called \"data\"\n", "target_dir = os.path.join(my_dir, 'data/')\n", "!wget -P \"$target_dir\" \"https://people.duke.edu/~jmp33/dibs/minerals.csv\" # download csv to data folder\n", "\n", "# if this doesn't work, manually download `minerals.csv` from https://people.duke.edu/~jmp33/dibs/ \n", "# to your local machine, and upload it to `data` folder" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "If you open the file in a text editor, you will see that it consists of a bunch of lines, each with a bunch of commas. This is a csv or \"comma-separated value\" file. Every row represents a record (like a row in a spreadsheet), with each cell separated by a comma. The first row has the same format, but the entries are the names of the columns." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "# Loading the data" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "We would like to load this data into Python. But to do so, we will need to tell Python that we want to get some tools ready. These tools are located in a library (the Python term is \"module\") called Pandas. So we do this:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 1, "source": [ "import pandas as pd" ], "outputs": [], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "This tells Python to load the `pandas` module and nickname it `pd`. Then, whenever we want to use functionality from `pandas`, we do so by giving the address of the function we want. In this case, we want to load data from a csv file, so we call the function `read_csv`:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 2, "source": [ "# we can include comments like this\n", "\n", "# note that for the following to work, you will need to be running the notebook from a folder\n", "# with a subdirectory called data that has the minerals.csv file inside\n", "data = pd.read_csv('data/minerals.csv')" ], "outputs": [], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Let's read from the left:\n", "\n", "`data =` \n", "\n", "tells Python that we want to do something (on the right hand side of the equation) and assign its output to a variable called `data`. We could have called it `duck` or `shotput` or `harmony`, but we want to give our variables meaningful names, and in cases where we are only going to be exploring a single dataset, it's convenient to name it the obvious thing.\n", "\n", "On the right hand side, we're doing several things:\n", "- We're telling Python to look in the `pandas` module (nicknamed `pd`)\n", "- We telling Python to call a function named `read_csv` that's found there (we will find that looking up functions in modules is a lot like looking up files in directories)\n", "- We're giving the function the (local) path to the file in quotes\n", "\n", "We will see this pattern repeatedly in Python. We use names like `read_csv` to tell Python to perform actions. In parentheses, we will supply variables or pieces of information needed as inputs by Python to perform those actions. The actions are called functions (related to the idea of functions in math, which are objects that take inputs and produce an output), and the pieces of information inside parentheses are called \"arguments.\" Much more on all of this later." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "# Examining the data" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "So what did we accomplish?\n", "\n", "The easiest way to see is by asking Python to print the variable to the screen. We can do this with" ], "metadata": {} }, { "cell_type": "code", "execution_count": 3, "source": [ "print(data)" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " item shape carat cut color clarity polish symmetry depth table \\\n", "0 1 PS 0.28 F F SI2 VG G 61.6 50.0 \n", "1 2 PS 0.23 G F SI1 VG G 67.5 51.0 \n", "2 3 EC 0.34 G I VS2 VG VG 71.6 65.0 \n", "3 4 MQ 0.34 G J VS1 G G 68.2 55.0 \n", "4 5 PS 0.23 G D SI1 VG VG 55.0 58.0 \n", "5 6 MQ 0.23 F G VS2 VG G 71.6 55.0 \n", "6 7 RA 0.37 F I SI2 G G 79.0 76.0 \n", "7 8 EC 0.24 VG E SI1 VG VG 68.6 65.0 \n", "8 9 RD 0.24 I G SI2 VG VG 62.0 59.0 \n", "9 10 MQ 0.33 G H SI2 G G 66.7 61.0 \n", "10 11 RD 0.23 VG E SI2 G VG 59.7 59.0 \n", "11 12 PS 0.35 F H SI2 G G 50.0 55.0 \n", "12 13 PS 0.23 VG D SI1 EX VG 65.8 61.0 \n", "13 14 RD 0.25 I H SI2 VG EX 61.2 58.0 \n", "14 15 RD 0.25 VG D SI2 VG VG 62.4 59.0 \n", "15 16 RA 0.33 G J VVS1 VG VG 59.9 69.0 \n", "16 17 RD 0.27 I H SI2 VG VG 61.7 56.0 \n", "17 18 CU 0.23 G D VS2 VG VG 56.3 66.0 \n", "18 19 RA 0.23 VG F VS1 EX VG 65.2 71.0 \n", "19 20 RA 0.23 VG F VS2 VG G 68.6 74.0 \n", "20 21 PS 0.39 G J VS1 G G 54.2 66.0 \n", "21 22 OV 0.23 F F VS1 EX G 51.4 65.0 \n", "22 23 OV 0.33 F I SI2 VG VG 52.4 53.0 \n", "23 24 RA 0.23 G G VVS2 VG VG 59.4 71.0 \n", "24 25 EC 0.32 VG I SI1 EX EX 67.8 60.0 \n", "25 26 CU 0.32 VG I VS2 EX VG 63.8 61.0 \n", "26 27 MQ 0.36 VG G SI2 G G 64.9 60.0 \n", "27 28 EC 0.23 G G VVS1 VG VG 68.8 73.0 \n", "28 29 EC 0.31 VG F SI2 VG VG 65.9 69.0 \n", "29 30 EC 0.31 VG D SI2 VG VG 68.2 68.0 \n", "... ... ... ... .. ... ... ... ... ... ... \n", "65346 65347 RD 6.08 I F VVS2 EX EX 62.0 56.0 \n", "65347 65348 EC 5.00 VG D IF VG G 67.3 60.0 \n", "65348 65349 EC 8.01 VG F VVS2 G G 68.7 61.0 \n", "65349 65350 PS 16.34 G G VVS2 VG G 54.1 62.0 \n", "65350 65351 RD 12.33 I J VS1 EX EX 60.4 58.0 \n", "65351 65352 OV 5.27 VG D IF EX EX 60.8 61.0 \n", "65352 65353 RD 8.75 I G VS1 EX EX 62.1 57.0 \n", "65353 65354 PS 6.22 G D IF VG VG 56.0 54.0 \n", "65354 65355 EC 11.11 VG H VVS2 EX G 60.9 64.0 \n", "65355 65356 RA 11.16 VG D VS2 G G 66.2 68.0 \n", "65356 65357 RD 5.24 I E IF EX EX 59.7 59.0 \n", "65357 65358 RD 5.27 I D VVS1 EX EX 61.8 56.0 \n", "65358 65359 RD 6.02 VG D VVS1 VG G 61.1 59.0 \n", "65359 65360 RD 4.83 I D IF EX EX 61.9 55.0 \n", "65360 65361 RD 7.53 I F VVS2 EX EX 62.7 54.0 \n", "65361 65362 EC 13.11 VG H VVS2 EX EX 62.5 63.0 \n", "65362 65363 OV 11.15 G F VS2 VG VG 65.3 58.0 \n", "65363 65364 RD 5.24 I D IF VG VG 62.6 54.0 \n", "65364 65365 EC 7.25 F D IF VG VG 69.4 53.0 \n", "65365 65366 RD 5.34 I D IF EX EX 61.6 57.0 \n", "65366 65367 RD 5.47 I D IF EX EX 62.4 54.0 \n", "65367 65368 RD 6.02 I D FL EX EX 62.8 57.0 \n", "65368 65369 RD 6.54 VG D VVS1 VG VG 58.0 64.0 \n", "65369 65370 RD 9.56 I F FL EX EX 60.3 60.0 \n", "65370 65371 CU 8.40 G D IF VG G 57.9 59.0 \n", "65371 65372 RD 10.13 I F IF EX EX 60.3 58.0 \n", "65372 65373 RD 20.13 I J VS1 EX EX 59.2 59.0 \n", "65373 65374 RD 12.35 I G IF EX EX 59.8 60.0 \n", "65374 65375 RD 9.19 I E IF EX EX 60.9 60.0 \n", "65375 65376 RD 10.13 I D FL EX EX 62.5 57.0 \n", "\n", " fluorescence price per carat culet length to width ratio \\\n", "0 Faint 864 None 1.65 \n", "1 None 1057 None 1.46 \n", "2 Faint 812 None 1.40 \n", "3 Faint 818 None 1.52 \n", "4 None 1235 None 1.42 \n", "5 None 1248 None 1.95 \n", "6 None 781 None 1.03 \n", "7 Faint 1204 None 1.34 \n", "8 None 1204 None 1.01 \n", "9 Medium 888 None 2.02 \n", "10 None 1278 None 1.01 \n", "11 None 860 Small 1.35 \n", "12 None 1322 None 1.52 \n", "13 None 1216 None 1.00 \n", "14 None 1280 None 1.01 \n", "15 None 976 None 1.11 \n", "16 None 1193 None 1.01 \n", "17 Faint 1426 None 1.08 \n", "18 None 1443 None 1.13 \n", "19 None 1452 None 1.15 \n", "20 None 859 Slightly Large 1.37 \n", "21 None 1461 None 1.35 \n", "22 Faint 1018 None 1.41 \n", "23 None 1474 None 1.09 \n", "24 None 1059 None 1.45 \n", "25 None 1059 None 1.10 \n", "26 None 944 Small 1.96 \n", "27 None 1496 None 1.30 \n", "28 None 1110 None 1.44 \n", "29 None 1113 None 1.35 \n", "... ... ... ... ... \n", "65346 None 101250 None 1.00 \n", "65347 None 123700 Small 1.47 \n", "65348 None 77715 Small 1.49 \n", "65349 None 38295 None 1.59 \n", "65350 Faint 51862 None 1.01 \n", "65351 None 121968 None 1.47 \n", "65352 None 74829 Very Small 1.01 \n", "65353 None 105450 None 1.51 \n", "65354 None 59629 Very Small 1.13 \n", "65355 Medium Blue 61372 None 1.17 \n", "65356 None 133488 None 1.01 \n", "65357 None 132894 None 1.01 \n", "65358 Strong 122082 Very Small 1.01 \n", "65359 None 155322 None 1.00 \n", "65360 None 104247 None 1.01 \n", "65361 Faint 60384 Small 1.21 \n", "65362 Faint 80420 Very Small 1.37 \n", "65363 None 173186 None 1.01 \n", "65364 None 133200 Medium 1.33 \n", "65365 None 185328 None 1.00 \n", "65366 None 183475 None 1.00 \n", "65367 Faint 168340 None 1.01 \n", "65368 None 169873 Very Small 1.01 \n", "65369 None 120336 Very Small 1.01 \n", "65370 None 144300 Slightly Large 1.20 \n", "65371 None 130112 Very Small 1.01 \n", "65372 None 66420 None 1.01 \n", "65373 Faint 110662 None 1.01 \n", "65374 None 150621 None 1.00 \n", "65375 None 256150 None 1.00 \n", "\n", " delivery date price \n", "0 \\r\\nJul 8\\r\\n 242.0 \n", "1 \\r\\nJul 8\\r\\n 243.0 \n", "2 \\r\\nJul 12\\r\\n 276.0 \n", "3 \\r\\nJul 8\\r\\n 278.0 \n", "4 \\r\\nJul 14\\r\\n 284.0 \n", "5 \\r\\nJul 8\\r\\n 287.0 \n", "6 \\r\\nJul 8\\r\\n 289.0 \n", "7 \\r\\nJul 8\\r\\n 289.0 \n", "8 \\r\\nJul 8\\r\\n 289.0 \n", "9 \\r\\nJul 8\\r\\n 293.0 \n", "10 \\r\\nJul 14\\r\\n 294.0 \n", "11 \\r\\nJul 8\\r\\n 301.0 \n", "12 \\r\\nJul 14\\r\\n 304.0 \n", "13 \\r\\nJul 14\\r\\n 304.0 \n", "14 \\r\\nJul 14\\r\\n 320.0 \n", "15 \\r\\nJul 8\\r\\n 322.0 \n", "16 \\r\\nJul 14\\r\\n 322.0 \n", "17 \\r\\nJul 8\\r\\n 328.0 \n", "18 \\r\\nJul 8\\r\\n 332.0 \n", "19 \\r\\nJul 8\\r\\n 334.0 \n", "20 \\r\\nJul 8\\r\\n 335.0 \n", "21 \\r\\nJul 8\\r\\n 336.0 \n", "22 \\r\\nJul 8\\r\\n 336.0 \n", "23 \\r\\nJul 8\\r\\n 339.0 \n", "24 \\r\\nJul 8\\r\\n 339.0 \n", "25 \\r\\nJul 8\\r\\n 339.0 \n", "26 \\r\\nJul 8\\r\\n 340.0 \n", "27 \\r\\nJul 8\\r\\n 344.0 \n", "28 \\r\\nJul 8\\r\\n 344.0 \n", "29 \\r\\nJul 8\\r\\n 345.0 \n", "... ... ... \n", "65346 \\r\\nJul 8\\r\\n 615600.0 \n", "65347 \\r\\nJul 12\\r\\n 618498.0 \n", "65348 \\r\\nJul 12\\r\\n 622495.0 \n", "65349 \\r\\nJul 14\\r\\n 625741.0 \n", "65350 \\r\\nJul 13\\r\\n 639454.0 \n", "65351 \\r\\nJul 12\\r\\n 642772.0 \n", "65352 \\r\\nJul 8\\r\\n 654753.0 \n", "65353 \\r\\nJul 8\\r\\n 655899.0 \n", "65354 \\r\\nJul 8\\r\\n 662481.0 \n", "65355 \\r\\nJul 8\\r\\n 684911.0 \n", "65356 \\r\\nJul 8\\r\\n 699478.0 \n", "65357 \\r\\nJul 8\\r\\n 700352.0 \n", "65358 \\r\\nJul 12\\r\\n 734935.0 \n", "65359 \\r\\nJul 18\\r\\n 750205.0 \n", "65360 \\r\\nJul 12\\r\\n 784980.0 \n", "65361 \\r\\nJul 8\\r\\n 791635.0 \n", "65362 \\r\\nJul 8\\r\\n 896678.0 \n", "65363 \\r\\nJul 8\\r\\n 907494.0 \n", "65364 \\r\\nJul 8\\r\\n 965700.0 \n", "65365 \\r\\nJul 8\\r\\n 989652.0 \n", "65366 \\r\\nJul 12\\r\\n 1003607.0 \n", "65367 \\r\\nJul 12\\r\\n 1013405.0 \n", "65368 \\r\\nJul 13\\r\\n 1110971.0 \n", "65369 \\r\\nJul 8\\r\\n 1150413.0 \n", "65370 \\r\\nJul 8\\r\\n 1212120.0 \n", "65371 \\r\\nJul 8\\r\\n 1318034.0 \n", "65372 \\r\\nJul 14\\r\\n 1337035.0 \n", "65373 \\r\\nJul 12\\r\\n 1366679.0 \n", "65374 \\r\\nJul 13\\r\\n 1384207.0 \n", "65375 \\r\\nJul 12\\r\\n 2594800.0 \n", "\n", "[65376 rows x 16 columns]\n" ] } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "You should be able to see that the data consists of a bunch of rows and columns, and that, at some point in the middle, Python puts a bunch of ...'s, indicating it's not printing all the information. That's good in this case, since the data are pretty big." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "So how big are the data? We can find this out by typing" ], "metadata": {} }, { "cell_type": "code", "execution_count": 4, "source": [ "data.shape" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(65376, 16)" ] }, "metadata": {}, "execution_count": 4 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "We could have gotten the same answer by typing\n", "\n", "```python\n", "print(data.shape)\n", "```\n", "but when we just type the variable name, Python assumes we mean `print`." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "So what does this answer mean? It means that our data have something like 65,000 rows and 16 columns. Here, the convention is (rows, columns)." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Notice also that the way we got this piece of information was by typing the variable, followed by `.`, followed by the name of a property (called an \"attribute\"). Again, you can think of this variable as an object having both pieces of information (attributes) and pieces of behavior (functions or methods) tucked inside of it like a file system. The way we access those is by giving a path, except with `.` instead of `/`." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "But there's an even friendlier way to look at our data that's special to the notebook. To look at the first few rows of our data, we can type" ], "metadata": {} }, { "cell_type": "code", "execution_count": 5, "source": [ "data.head()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " item shape carat cut color clarity polish symmetry depth table \\\n", "0 1 PS 0.28 F F SI2 VG G 61.6 50.0 \n", "1 2 PS 0.23 G F SI1 VG G 67.5 51.0 \n", "2 3 EC 0.34 G I VS2 VG VG 71.6 65.0 \n", "3 4 MQ 0.34 G J VS1 G G 68.2 55.0 \n", "4 5 PS 0.23 G D SI1 VG VG 55.0 58.0 \n", "\n", " fluorescence price per carat culet length to width ratio \\\n", "0 Faint 864 None 1.65 \n", "1 None 1057 None 1.46 \n", "2 Faint 812 None 1.40 \n", "3 Faint 818 None 1.52 \n", "4 None 1235 None 1.42 \n", "\n", " delivery date price \n", "0 \\r\\nJul 8\\r\\n 242.0 \n", "1 \\r\\nJul 8\\r\\n 243.0 \n", "2 \\r\\nJul 12\\r\\n 276.0 \n", "3 \\r\\nJul 8\\r\\n 278.0 \n", "4 \\r\\nJul 14\\r\\n 284.0 " ], "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
itemshapecaratcutcolorclaritypolishsymmetrydepthtablefluorescenceprice per caratculetlength to width ratiodelivery dateprice
01PS0.28FFSI2VGG61.650.0Faint864None1.65\\r\\nJul 8\\r\\n242.0
12PS0.23GFSI1VGG67.551.0None1057None1.46\\r\\nJul 8\\r\\n243.0
23EC0.34GIVS2VGVG71.665.0Faint812None1.40\\r\\nJul 12\\r\\n276.0
34MQ0.34GJVS1GG68.255.0Faint818None1.52\\r\\nJul 8\\r\\n278.0
45PS0.23GDSI1VGVG55.058.0None1235None1.42\\r\\nJul 14\\r\\n284.0
\n", "
" ] }, "metadata": {}, "execution_count": 5 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "This gives 5 rows by default (**note that counting starts at 0!**), but we can easily ask Python for 10:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 6, "source": [ "data.head(10)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " item shape carat cut color clarity polish symmetry depth table \\\n", "0 1 PS 0.28 F F SI2 VG G 61.6 50.0 \n", "1 2 PS 0.23 G F SI1 VG G 67.5 51.0 \n", "2 3 EC 0.34 G I VS2 VG VG 71.6 65.0 \n", "3 4 MQ 0.34 G J VS1 G G 68.2 55.0 \n", "4 5 PS 0.23 G D SI1 VG VG 55.0 58.0 \n", "5 6 MQ 0.23 F G VS2 VG G 71.6 55.0 \n", "6 7 RA 0.37 F I SI2 G G 79.0 76.0 \n", "7 8 EC 0.24 VG E SI1 VG VG 68.6 65.0 \n", "8 9 RD 0.24 I G SI2 VG VG 62.0 59.0 \n", "9 10 MQ 0.33 G H SI2 G G 66.7 61.0 \n", "\n", " fluorescence price per carat culet length to width ratio \\\n", "0 Faint 864 None 1.65 \n", "1 None 1057 None 1.46 \n", "2 Faint 812 None 1.40 \n", "3 Faint 818 None 1.52 \n", "4 None 1235 None 1.42 \n", "5 None 1248 None 1.95 \n", "6 None 781 None 1.03 \n", "7 Faint 1204 None 1.34 \n", "8 None 1204 None 1.01 \n", "9 Medium 888 None 2.02 \n", "\n", " delivery date price \n", "0 \\r\\nJul 8\\r\\n 242.0 \n", "1 \\r\\nJul 8\\r\\n 243.0 \n", "2 \\r\\nJul 12\\r\\n 276.0 \n", "3 \\r\\nJul 8\\r\\n 278.0 \n", "4 \\r\\nJul 14\\r\\n 284.0 \n", "5 \\r\\nJul 8\\r\\n 287.0 \n", "6 \\r\\nJul 8\\r\\n 289.0 \n", "7 \\r\\nJul 8\\r\\n 289.0 \n", "8 \\r\\nJul 8\\r\\n 289.0 \n", "9 \\r\\nJul 8\\r\\n 293.0 " ], "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
itemshapecaratcutcolorclaritypolishsymmetrydepthtablefluorescenceprice per caratculetlength to width ratiodelivery dateprice
01PS0.28FFSI2VGG61.650.0Faint864None1.65\\r\\nJul 8\\r\\n242.0
12PS0.23GFSI1VGG67.551.0None1057None1.46\\r\\nJul 8\\r\\n243.0
23EC0.34GIVS2VGVG71.665.0Faint812None1.40\\r\\nJul 12\\r\\n276.0
34MQ0.34GJVS1GG68.255.0Faint818None1.52\\r\\nJul 8\\r\\n278.0
45PS0.23GDSI1VGVG55.058.0None1235None1.42\\r\\nJul 14\\r\\n284.0
56MQ0.23FGVS2VGG71.655.0None1248None1.95\\r\\nJul 8\\r\\n287.0
67RA0.37FISI2GG79.076.0None781None1.03\\r\\nJul 8\\r\\n289.0
78EC0.24VGESI1VGVG68.665.0Faint1204None1.34\\r\\nJul 8\\r\\n289.0
89RD0.24IGSI2VGVG62.059.0None1204None1.01\\r\\nJul 8\\r\\n289.0
910MQ0.33GHSI2GG66.761.0Medium888None2.02\\r\\nJul 8\\r\\n293.0
\n", "
" ] }, "metadata": {}, "execution_count": 6 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Here, `head` is a method of the variable `data` (meaning it's a function stored in the data object). In the second case, we explicitly told `head` how many rows we wanted, while in the first, the number defaulted to 5. " ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Likewise, we can ask for the last few rows of the dataset with `tail`:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 7, "source": [ "data.tail(7)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " item shape carat cut color clarity polish symmetry depth table \\\n", "65369 65370 RD 9.56 I F FL EX EX 60.3 60.0 \n", "65370 65371 CU 8.40 G D IF VG G 57.9 59.0 \n", "65371 65372 RD 10.13 I F IF EX EX 60.3 58.0 \n", "65372 65373 RD 20.13 I J VS1 EX EX 59.2 59.0 \n", "65373 65374 RD 12.35 I G IF EX EX 59.8 60.0 \n", "65374 65375 RD 9.19 I E IF EX EX 60.9 60.0 \n", "65375 65376 RD 10.13 I D FL EX EX 62.5 57.0 \n", "\n", " fluorescence price per carat culet length to width ratio \\\n", "65369 None 120336 Very Small 1.01 \n", "65370 None 144300 Slightly Large 1.20 \n", "65371 None 130112 Very Small 1.01 \n", "65372 None 66420 None 1.01 \n", "65373 Faint 110662 None 1.01 \n", "65374 None 150621 None 1.00 \n", "65375 None 256150 None 1.00 \n", "\n", " delivery date price \n", "65369 \\r\\nJul 8\\r\\n 1150413.0 \n", "65370 \\r\\nJul 8\\r\\n 1212120.0 \n", "65371 \\r\\nJul 8\\r\\n 1318034.0 \n", "65372 \\r\\nJul 14\\r\\n 1337035.0 \n", "65373 \\r\\nJul 12\\r\\n 1366679.0 \n", "65374 \\r\\nJul 13\\r\\n 1384207.0 \n", "65375 \\r\\nJul 12\\r\\n 2594800.0 " ], "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
itemshapecaratcutcolorclaritypolishsymmetrydepthtablefluorescenceprice per caratculetlength to width ratiodelivery dateprice
6536965370RD9.56IFFLEXEX60.360.0None120336Very Small1.01\\r\\nJul 8\\r\\n1150413.0
6537065371CU8.40GDIFVGG57.959.0None144300Slightly Large1.20\\r\\nJul 8\\r\\n1212120.0
6537165372RD10.13IFIFEXEX60.358.0None130112Very Small1.01\\r\\nJul 8\\r\\n1318034.0
6537265373RD20.13IJVS1EXEX59.259.0None66420None1.01\\r\\nJul 14\\r\\n1337035.0
6537365374RD12.35IGIFEXEX59.860.0Faint110662None1.01\\r\\nJul 12\\r\\n1366679.0
6537465375RD9.19IEIFEXEX60.960.0None150621None1.00\\r\\nJul 13\\r\\n1384207.0
6537565376RD10.13IDFLEXEX62.557.0None256150None1.00\\r\\nJul 12\\r\\n2594800.0
\n", "
" ] }, "metadata": {}, "execution_count": 7 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "If you look carefully, you might notice that the rows seem to be sorted by the last item, price. The odd characters under delivery data are a result of the process used to download these data from the internet." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Finally, as a first, pass, we might just want to know some simple summary statistics of our data:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 8, "source": [ "data.describe()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " item carat depth table \\\n", "count 65376.000000 65376.000000 65376.000000 65376.000000 \n", "mean 32688.500000 0.878652 63.245745 59.397770 \n", "std 18872.569936 0.783895 3.861799 4.868447 \n", "min 1.000000 0.230000 6.700000 6.000000 \n", "25% 16344.750000 0.400000 61.200000 57.000000 \n", "50% 32688.500000 0.700000 62.200000 58.000000 \n", "75% 49032.250000 1.020000 64.400000 61.000000 \n", "max 65376.000000 20.130000 80.000000 555.000000 \n", "\n", " price per carat length to width ratio price \n", "count 65376.000000 65376.000000 6.537600e+04 \n", "mean 6282.785365 1.111838 9.495346e+03 \n", "std 7198.173546 0.212317 3.394968e+04 \n", "min 781.000000 0.770000 2.420000e+02 \n", "25% 2967.000000 1.010000 1.121000e+03 \n", "50% 4150.000000 1.010000 2.899000e+03 \n", "75% 6769.000000 1.110000 6.752250e+03 \n", "max 256150.000000 3.120000 2.594800e+06 " ], "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
itemcaratdepthtableprice per caratlength to width ratioprice
count65376.00000065376.00000065376.00000065376.00000065376.00000065376.0000006.537600e+04
mean32688.5000000.87865263.24574559.3977706282.7853651.1118389.495346e+03
std18872.5699360.7838953.8617994.8684477198.1735460.2123173.394968e+04
min1.0000000.2300006.7000006.000000781.0000000.7700002.420000e+02
25%16344.7500000.40000061.20000057.0000002967.0000001.0100001.121000e+03
50%32688.5000000.70000062.20000058.0000004150.0000001.0100002.899000e+03
75%49032.2500001.02000064.40000061.0000006769.0000001.1100006.752250e+03
max65376.00000020.13000080.000000555.000000256150.0000003.1200002.594800e+06
\n", "
" ] }, "metadata": {}, "execution_count": 8 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "# The fine art of looking things up" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "The Python ecosystem is *huge*. Nobody knows all the functions for all the libraries. This means that when you start analyzing data in earnest, you will need to learn the parts important for solving your particular problem. Initially, this will be difficult; everything will be new to you. Eventually, though, you develop a conceptual base that will be easy to add to." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "So what should you do in this case? How do we learn more about the functions we've used so far?\n", "\n", "First off, let's figure out what type of object `data` is. Every variable in Python is an object (meaning it has both information and behavior stored inside of it), and every object has a type. We can find this by using the `type` function:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 9, "source": [ "type(1)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "int" ] }, "metadata": {}, "execution_count": 9 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Here, `int` means integer, a number with no decimal part." ], "metadata": {} }, { "cell_type": "code", "execution_count": 10, "source": [ "type(1.5)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "float" ] }, "metadata": {}, "execution_count": 10 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Float is anything with a decimal point. Be aware that the precision of `float` variables is limited, and there is the potential for roundoff errors in calculations if you ever start to do serious data crunching (though most of the time you'll be fine)." ], "metadata": {} }, { "cell_type": "code", "execution_count": 11, "source": [ "type(data)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "metadata": {}, "execution_count": 11 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "So what in the world is this?\n", "\n", "Read it like so: the type of object `data` is is defined in the `pandas` module, in the `core` submodule, in the `frame` sub-submodule, and it is `DataFrame`. Again, using our filesystem analogy, the `data` variable has type `DataFrame`, and Python gives us the full path to its definition. As we will see, dataframes are a very convenient type of object in which to store our data, since this type of object carries with it very powerful behaviors (methods) that can be used to clean and analyze data. " ], "metadata": {} }, { "cell_type": "markdown", "source": [ "So if `type` tells us the type of object, how do we find out what's in it?\n", "\n", "We can do that with the `dir` command. `dir` is short for \"directory,\" and tells us the name of all the attributes (information) and methods (behaviors) associated with an object:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 12, "source": [ "dir(data)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['T',\n", " '_AXIS_ALIASES',\n", " '_AXIS_IALIASES',\n", " '_AXIS_LEN',\n", " '_AXIS_NAMES',\n", " '_AXIS_NUMBERS',\n", " '_AXIS_ORDERS',\n", " '_AXIS_REVERSED',\n", " '_AXIS_SLICEMAP',\n", " '__abs__',\n", " '__add__',\n", " '__and__',\n", " '__array__',\n", " '__array_wrap__',\n", " '__bool__',\n", " '__bytes__',\n", " '__class__',\n", " '__contains__',\n", " '__delattr__',\n", " '__delitem__',\n", " '__dict__',\n", " '__dir__',\n", " '__div__',\n", " '__doc__',\n", " '__eq__',\n", " '__finalize__',\n", " '__floordiv__',\n", " '__format__',\n", " '__ge__',\n", " '__getattr__',\n", " '__getattribute__',\n", " '__getitem__',\n", " '__getstate__',\n", " '__gt__',\n", " '__hash__',\n", " '__iadd__',\n", " '__imul__',\n", " '__init__',\n", " '__invert__',\n", " '__ipow__',\n", " '__isub__',\n", " '__iter__',\n", " '__itruediv__',\n", " '__le__',\n", " '__len__',\n", " '__lt__',\n", " '__mod__',\n", " '__module__',\n", " '__mul__',\n", " '__ne__',\n", " '__neg__',\n", " '__new__',\n", " '__nonzero__',\n", " '__or__',\n", " '__pow__',\n", " '__radd__',\n", " '__rand__',\n", " '__rdiv__',\n", " '__reduce__',\n", " '__reduce_ex__',\n", " '__repr__',\n", " '__rfloordiv__',\n", " '__rmod__',\n", " '__rmul__',\n", " '__ror__',\n", " '__round__',\n", " '__rpow__',\n", " '__rsub__',\n", " '__rtruediv__',\n", " '__rxor__',\n", " '__setattr__',\n", " '__setitem__',\n", " '__setstate__',\n", " '__sizeof__',\n", " '__str__',\n", " '__sub__',\n", " '__subclasshook__',\n", " '__truediv__',\n", " '__unicode__',\n", " '__weakref__',\n", " '__xor__',\n", " '_accessors',\n", " '_add_numeric_operations',\n", " '_add_series_only_operations',\n", " '_add_series_or_dataframe_operations',\n", " '_agg_by_level',\n", " '_align_frame',\n", " '_align_series',\n", " '_apply_broadcast',\n", " '_apply_empty_result',\n", " '_apply_raw',\n", " '_apply_standard',\n", " '_at',\n", " '_box_col_values',\n", " '_box_item_values',\n", " '_check_inplace_setting',\n", " '_check_is_chained_assignment_possible',\n", " '_check_percentile',\n", " '_check_setitem_copy',\n", " '_clear_item_cache',\n", " '_combine_const',\n", " '_combine_frame',\n", " '_combine_match_columns',\n", " '_combine_match_index',\n", " '_combine_series',\n", " '_combine_series_infer',\n", " '_compare_frame',\n", " '_compare_frame_evaluate',\n", " '_consolidate_inplace',\n", " '_construct_axes_dict',\n", " '_construct_axes_dict_for_slice',\n", " '_construct_axes_dict_from',\n", " '_construct_axes_from_arguments',\n", " '_constructor',\n", " '_constructor_expanddim',\n", " '_constructor_sliced',\n", " '_convert',\n", " '_count_level',\n", " '_create_indexer',\n", " '_dir_additions',\n", " '_dir_deletions',\n", " '_ensure_valid_index',\n", " '_expand_axes',\n", " '_flex_compare_frame',\n", " '_from_arrays',\n", " '_from_axes',\n", " '_get_agg_axis',\n", " '_get_axis',\n", " '_get_axis_name',\n", " '_get_axis_number',\n", " '_get_axis_resolvers',\n", " '_get_block_manager_axis',\n", " '_get_bool_data',\n", " '_get_cacher',\n", " '_get_index_resolvers',\n", " '_get_item_cache',\n", " '_get_numeric_data',\n", " '_get_values',\n", " '_getitem_array',\n", " '_getitem_column',\n", " '_getitem_frame',\n", " '_getitem_multilevel',\n", " '_getitem_slice',\n", " '_iat',\n", " '_iget_item_cache',\n", " '_iloc',\n", " '_indexed_same',\n", " '_info_axis',\n", " '_info_axis_name',\n", " '_info_axis_number',\n", " '_info_repr',\n", " '_init_dict',\n", " '_init_mgr',\n", " '_init_ndarray',\n", " '_internal_names',\n", " '_internal_names_set',\n", " '_is_cached',\n", " '_is_datelike_mixed_type',\n", " '_is_mixed_type',\n", " '_is_numeric_mixed_type',\n", " '_is_view',\n", " '_ix',\n", " '_ixs',\n", " '_join_compat',\n", " '_loc',\n", " '_maybe_cache_changed',\n", " '_maybe_update_cacher',\n", " '_metadata',\n", " '_needs_reindex_multi',\n", " '_nsorted',\n", " '_protect_consolidate',\n", " '_reduce',\n", " '_reindex_axes',\n", " '_reindex_axis',\n", " '_reindex_columns',\n", " '_reindex_index',\n", " '_reindex_multi',\n", " '_reindex_with_indexers',\n", " '_repr_fits_horizontal_',\n", " '_repr_fits_vertical_',\n", " '_repr_html_',\n", " '_repr_latex_',\n", " '_reset_cache',\n", " '_reset_cacher',\n", " '_sanitize_column',\n", " '_series',\n", " '_set_as_cached',\n", " '_set_axis',\n", " '_set_axis_name',\n", " '_set_is_copy',\n", " '_set_item',\n", " '_setitem_array',\n", " '_setitem_frame',\n", " '_setitem_slice',\n", " '_setup_axes',\n", " '_slice',\n", " '_stat_axis',\n", " '_stat_axis_name',\n", " '_stat_axis_number',\n", " '_typ',\n", " '_unpickle_frame_compat',\n", " '_unpickle_matrix_compat',\n", " '_update_inplace',\n", " '_validate_dtype',\n", " '_values',\n", " '_xs',\n", " 'abs',\n", " 'add',\n", " 'add_prefix',\n", " 'add_suffix',\n", " 'align',\n", " 'all',\n", " 'any',\n", " 'append',\n", " 'apply',\n", " 'applymap',\n", " 'as_blocks',\n", " 'as_matrix',\n", " 'asfreq',\n", " 'assign',\n", " 'astype',\n", " 'at',\n", " 'at_time',\n", " 'axes',\n", " 'between_time',\n", " 'bfill',\n", " 'blocks',\n", " 'bool',\n", " 'boxplot',\n", " 'carat',\n", " 'clarity',\n", " 'clip',\n", " 'clip_lower',\n", " 'clip_upper',\n", " 'color',\n", " 'columns',\n", " 'combine',\n", " 'combineAdd',\n", " 'combineMult',\n", " 'combine_first',\n", " 'compound',\n", " 'consolidate',\n", " 'convert_objects',\n", " 'copy',\n", " 'corr',\n", " 'corrwith',\n", " 'count',\n", " 'cov',\n", " 'culet',\n", " 'cummax',\n", " 'cummin',\n", " 'cumprod',\n", " 'cumsum',\n", " 'cut',\n", " 'depth',\n", " 'describe',\n", " 'diff',\n", " 'div',\n", " 'divide',\n", " 'dot',\n", " 'drop',\n", " 'drop_duplicates',\n", " 'dropna',\n", " 'dtypes',\n", " 'duplicated',\n", " 'empty',\n", " 'eq',\n", " 'equals',\n", " 'eval',\n", " 'ewm',\n", " 'expanding',\n", " 'ffill',\n", " 'fillna',\n", " 'filter',\n", " 'first',\n", " 'first_valid_index',\n", " 'floordiv',\n", " 'fluorescence',\n", " 'from_csv',\n", " 'from_dict',\n", " 'from_items',\n", " 'from_records',\n", " 'ftypes',\n", " 'ge',\n", " 'get',\n", " 'get_dtype_counts',\n", " 'get_ftype_counts',\n", " 'get_value',\n", " 'get_values',\n", " 'groupby',\n", " 'gt',\n", " 'head',\n", " 'hist',\n", " 'iat',\n", " 'icol',\n", " 'idxmax',\n", " 'idxmin',\n", " 'iget_value',\n", " 'iloc',\n", " 'index',\n", " 'info',\n", " 'insert',\n", " 'interpolate',\n", " 'irow',\n", " 'is_copy',\n", " 'isin',\n", " 'isnull',\n", " 'item',\n", " 'items',\n", " 'iteritems',\n", " 'iterkv',\n", " 'iterrows',\n", " 'itertuples',\n", " 'ix',\n", " 'join',\n", " 'keys',\n", " 'kurt',\n", " 'kurtosis',\n", " 'last',\n", " 'last_valid_index',\n", " 'le',\n", " 'loc',\n", " 'lookup',\n", " 'lt',\n", " 'mad',\n", " 'mask',\n", " 'max',\n", " 'mean',\n", " 'median',\n", " 'memory_usage',\n", " 'merge',\n", " 'min',\n", " 'mod',\n", " 'mode',\n", " 'mul',\n", " 'multiply',\n", " 'ndim',\n", " 'ne',\n", " 'nlargest',\n", " 'notnull',\n", " 'nsmallest',\n", " 'pct_change',\n", " 'pipe',\n", " 'pivot',\n", " 'pivot_table',\n", " 'plot',\n", " 'polish',\n", " 'pop',\n", " 'pow',\n", " 'price',\n", " 'prod',\n", " 'product',\n", " 'quantile',\n", " 'query',\n", " 'radd',\n", " 'rank',\n", " 'rdiv',\n", " 'reindex',\n", " 'reindex_axis',\n", " 'reindex_like',\n", " 'rename',\n", " 'rename_axis',\n", " 'reorder_levels',\n", " 'replace',\n", " 'resample',\n", " 'reset_index',\n", " 'rfloordiv',\n", " 'rmod',\n", " 'rmul',\n", " 'rolling',\n", " 'round',\n", " 'rpow',\n", " 'rsub',\n", " 'rtruediv',\n", " 'sample',\n", " 'select',\n", " 'select_dtypes',\n", " 'sem',\n", " 'set_axis',\n", " 'set_index',\n", " 'set_value',\n", " 'shape',\n", " 'shift',\n", " 'size',\n", " 'skew',\n", " 'slice_shift',\n", " 'sort',\n", " 'sort_index',\n", " 'sort_values',\n", " 'sortlevel',\n", " 'squeeze',\n", " 'stack',\n", " 'std',\n", " 'style',\n", " 'sub',\n", " 'subtract',\n", " 'sum',\n", " 'swapaxes',\n", " 'swaplevel',\n", " 'symmetry',\n", " 'table',\n", " 'tail',\n", " 'take',\n", " 'to_clipboard',\n", " 'to_csv',\n", " 'to_dense',\n", " 'to_dict',\n", " 'to_excel',\n", " 'to_gbq',\n", " 'to_hdf',\n", " 'to_html',\n", " 'to_json',\n", " 'to_latex',\n", " 'to_msgpack',\n", " 'to_panel',\n", " 'to_period',\n", " 'to_pickle',\n", " 'to_records',\n", " 'to_sparse',\n", " 'to_sql',\n", " 'to_stata',\n", " 'to_string',\n", " 'to_timestamp',\n", " 'to_wide',\n", " 'to_xarray',\n", " 'transpose',\n", " 'truediv',\n", " 'truncate',\n", " 'tshift',\n", " 'tz_convert',\n", " 'tz_localize',\n", " 'unstack',\n", " 'update',\n", " 'values',\n", " 'var',\n", " 'where',\n", " 'xs']" ] }, "metadata": {}, "execution_count": 12 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Whoa!\n", "\n", "Okay, keep in mind a few things:\n", "- One of the sayings in the Python credo is \"We're all adults here.\" Python trusts you as a programmer. Sometimes too much. This means that, in a case like this, you might get more info than you really need or can handle.\n", "- Any name in that list (and the output of this function is indeed a type of object called a list) that begins with `_` or `__` is a private variable (like files beginning with `.` in the shell). You can safely ignore these until you are much more experienced.\n", "- I don't know what half of these do, either. If I need to know, I look them up." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "How do we do that?\n", "\n", "First, IPython has some pretty spiffy things that can help us right from the shell or notebook. For instance, if I want to learn about the `sort` item in the list, I can type" ], "metadata": {} }, { "cell_type": "code", "execution_count": 13, "source": [ "data.sort?" ], "outputs": [], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "And IPython will pop up a handy (or cryptic, as you might feel at first) documentation window for the function. At the very least, we are told at the top of the page that the type of `data.sort` is an instancemethod, meaning that it's a behavior and not a piece of information. The help then goes on to tell us what inputs the function takes and what outputs it gives." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "More realistically, I would Google\n", "\n", "`DataFrame.sort python`\n", "\n", "(Remember, DataFrame is the type of the data object, and the internet doesn't know that `data` is a variable name for us. So I ask for the type.method and throw in the keyword \"python\" so Google knows what I'm talking about.)\n", "\n", "The first result that pops up should be [this](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html).\n", "\n", "Good news!\n", "- This page is much easier to read.\n", "- This page is located on the website for Pandas, the library we're using. Even if we don't understand this particular help page, there's probably a tutorial somewhere nearby that will get us started." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "As a result of this, if we look carefully, we might be able to puzzle out that if we want to sort the dataset by price per carat, we can do" ], "metadata": {} }, { "cell_type": "code", "execution_count": 14, "source": [ "data_sorted = data.sort_values(by='price per carat')" ], "outputs": [], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Notice that we have to save the output of the sort here, since the `sort` function doesn't touch the `data` variable. Instead, it returns a new data frame that we need to assign to a new variable name. In other words," ], "metadata": {} }, { "cell_type": "code", "execution_count": 15, "source": [ "data.head(10)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " item shape carat cut color clarity polish symmetry depth table \\\n", "0 1 PS 0.28 F F SI2 VG G 61.6 50.0 \n", "1 2 PS 0.23 G F SI1 VG G 67.5 51.0 \n", "2 3 EC 0.34 G I VS2 VG VG 71.6 65.0 \n", "3 4 MQ 0.34 G J VS1 G G 68.2 55.0 \n", "4 5 PS 0.23 G D SI1 VG VG 55.0 58.0 \n", "5 6 MQ 0.23 F G VS2 VG G 71.6 55.0 \n", "6 7 RA 0.37 F I SI2 G G 79.0 76.0 \n", "7 8 EC 0.24 VG E SI1 VG VG 68.6 65.0 \n", "8 9 RD 0.24 I G SI2 VG VG 62.0 59.0 \n", "9 10 MQ 0.33 G H SI2 G G 66.7 61.0 \n", "\n", " fluorescence price per carat culet length to width ratio \\\n", "0 Faint 864 None 1.65 \n", "1 None 1057 None 1.46 \n", "2 Faint 812 None 1.40 \n", "3 Faint 818 None 1.52 \n", "4 None 1235 None 1.42 \n", "5 None 1248 None 1.95 \n", "6 None 781 None 1.03 \n", "7 Faint 1204 None 1.34 \n", "8 None 1204 None 1.01 \n", "9 Medium 888 None 2.02 \n", "\n", " delivery date price \n", "0 \\r\\nJul 8\\r\\n 242.0 \n", "1 \\r\\nJul 8\\r\\n 243.0 \n", "2 \\r\\nJul 12\\r\\n 276.0 \n", "3 \\r\\nJul 8\\r\\n 278.0 \n", "4 \\r\\nJul 14\\r\\n 284.0 \n", "5 \\r\\nJul 8\\r\\n 287.0 \n", "6 \\r\\nJul 8\\r\\n 289.0 \n", "7 \\r\\nJul 8\\r\\n 289.0 \n", "8 \\r\\nJul 8\\r\\n 289.0 \n", "9 \\r\\nJul 8\\r\\n 293.0 " ], "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
itemshapecaratcutcolorclaritypolishsymmetrydepthtablefluorescenceprice per caratculetlength to width ratiodelivery dateprice
01PS0.28FFSI2VGG61.650.0Faint864None1.65\\r\\nJul 8\\r\\n242.0
12PS0.23GFSI1VGG67.551.0None1057None1.46\\r\\nJul 8\\r\\n243.0
23EC0.34GIVS2VGVG71.665.0Faint812None1.40\\r\\nJul 12\\r\\n276.0
34MQ0.34GJVS1GG68.255.0Faint818None1.52\\r\\nJul 8\\r\\n278.0
45PS0.23GDSI1VGVG55.058.0None1235None1.42\\r\\nJul 14\\r\\n284.0
56MQ0.23FGVS2VGG71.655.0None1248None1.95\\r\\nJul 8\\r\\n287.0
67RA0.37FISI2GG79.076.0None781None1.03\\r\\nJul 8\\r\\n289.0
78EC0.24VGESI1VGVG68.665.0Faint1204None1.34\\r\\nJul 8\\r\\n289.0
89RD0.24IGSI2VGVG62.059.0None1204None1.01\\r\\nJul 8\\r\\n289.0
910MQ0.33GHSI2GG66.761.0Medium888None2.02\\r\\nJul 8\\r\\n293.0
\n", "
" ] }, "metadata": {}, "execution_count": 15 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "looks the same as before, while" ], "metadata": {} }, { "cell_type": "code", "execution_count": 16, "source": [ "data_sorted.head(10)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " item shape carat cut color clarity polish symmetry depth table \\\n", "6 7 RA 0.37 F I SI2 G G 79.0 76.0 \n", "2 3 EC 0.34 G I VS2 VG VG 71.6 65.0 \n", "3 4 MQ 0.34 G J VS1 G G 68.2 55.0 \n", "20 21 PS 0.39 G J VS1 G G 54.2 66.0 \n", "11 12 PS 0.35 F H SI2 G G 50.0 55.0 \n", "0 1 PS 0.28 F F SI2 VG G 61.6 50.0 \n", "9 10 MQ 0.33 G H SI2 G G 66.7 61.0 \n", "49 50 MQ 0.39 VG I SI1 G G 64.9 61.0 \n", "48 49 PS 0.39 F G SI2 G G 57.0 68.0 \n", "26 27 MQ 0.36 VG G SI2 G G 64.9 60.0 \n", "\n", " fluorescence price per carat culet length to width ratio \\\n", "6 None 781 None 1.03 \n", "2 Faint 812 None 1.40 \n", "3 Faint 818 None 1.52 \n", "20 None 859 Slightly Large 1.37 \n", "11 None 860 Small 1.35 \n", "0 Faint 864 None 1.65 \n", "9 Medium 888 None 2.02 \n", "49 Faint 938 None 1.73 \n", "48 Faint 938 None 1.37 \n", "26 None 944 Small 1.96 \n", "\n", " delivery date price \n", "6 \\r\\nJul 8\\r\\n 289.0 \n", "2 \\r\\nJul 12\\r\\n 276.0 \n", "3 \\r\\nJul 8\\r\\n 278.0 \n", "20 \\r\\nJul 8\\r\\n 335.0 \n", "11 \\r\\nJul 8\\r\\n 301.0 \n", "0 \\r\\nJul 8\\r\\n 242.0 \n", "9 \\r\\nJul 8\\r\\n 293.0 \n", "49 \\r\\nJul 8\\r\\n 366.0 \n", "48 \\r\\nJul 8\\r\\n 366.0 \n", "26 \\r\\nJul 8\\r\\n 340.0 " ], "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
itemshapecaratcutcolorclaritypolishsymmetrydepthtablefluorescenceprice per caratculetlength to width ratiodelivery dateprice
67RA0.37FISI2GG79.076.0None781None1.03\\r\\nJul 8\\r\\n289.0
23EC0.34GIVS2VGVG71.665.0Faint812None1.40\\r\\nJul 12\\r\\n276.0
34MQ0.34GJVS1GG68.255.0Faint818None1.52\\r\\nJul 8\\r\\n278.0
2021PS0.39GJVS1GG54.266.0None859Slightly Large1.37\\r\\nJul 8\\r\\n335.0
1112PS0.35FHSI2GG50.055.0None860Small1.35\\r\\nJul 8\\r\\n301.0
01PS0.28FFSI2VGG61.650.0Faint864None1.65\\r\\nJul 8\\r\\n242.0
910MQ0.33GHSI2GG66.761.0Medium888None2.02\\r\\nJul 8\\r\\n293.0
4950MQ0.39VGISI1GG64.961.0Faint938None1.73\\r\\nJul 8\\r\\n366.0
4849PS0.39FGSI2GG57.068.0Faint938None1.37\\r\\nJul 8\\r\\n366.0
2627MQ0.36VGGSI2GG64.960.0None944Small1.96\\r\\nJul 8\\r\\n340.0
\n", "
" ] }, "metadata": {}, "execution_count": 16 } ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 17, "source": [ "data_sorted.tail(10)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " item shape carat cut color clarity polish symmetry depth table \\\n", "65315 65316 RD 3.01 I D FL EX EX 61.8 58.0 \n", "65370 65371 CU 8.40 G D IF VG G 57.9 59.0 \n", "65374 65375 RD 9.19 I E IF EX EX 60.9 60.0 \n", "65359 65360 RD 4.83 I D IF EX EX 61.9 55.0 \n", "65367 65368 RD 6.02 I D FL EX EX 62.8 57.0 \n", "65368 65369 RD 6.54 VG D VVS1 VG VG 58.0 64.0 \n", "65363 65364 RD 5.24 I D IF VG VG 62.6 54.0 \n", "65366 65367 RD 5.47 I D IF EX EX 62.4 54.0 \n", "65365 65366 RD 5.34 I D IF EX EX 61.6 57.0 \n", "65375 65376 RD 10.13 I D FL EX EX 62.5 57.0 \n", "\n", " fluorescence price per carat culet length to width ratio \\\n", "65315 None 143063 None 1.00 \n", "65370 None 144300 Slightly Large 1.20 \n", "65374 None 150621 None 1.00 \n", "65359 None 155322 None 1.00 \n", "65367 Faint 168340 None 1.01 \n", "65368 None 169873 Very Small 1.01 \n", "65363 None 173186 None 1.01 \n", "65366 None 183475 None 1.00 \n", "65365 None 185328 None 1.00 \n", "65375 None 256150 None 1.00 \n", "\n", " delivery date price \n", "65315 \\r\\nJul 15\\r\\n 430619.0 \n", "65370 \\r\\nJul 8\\r\\n 1212120.0 \n", "65374 \\r\\nJul 13\\r\\n 1384207.0 \n", "65359 \\r\\nJul 18\\r\\n 750205.0 \n", "65367 \\r\\nJul 12\\r\\n 1013405.0 \n", "65368 \\r\\nJul 13\\r\\n 1110971.0 \n", "65363 \\r\\nJul 8\\r\\n 907494.0 \n", "65366 \\r\\nJul 12\\r\\n 1003607.0 \n", "65365 \\r\\nJul 8\\r\\n 989652.0 \n", "65375 \\r\\nJul 12\\r\\n 2594800.0 " ], "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
itemshapecaratcutcolorclaritypolishsymmetrydepthtablefluorescenceprice per caratculetlength to width ratiodelivery dateprice
6531565316RD3.01IDFLEXEX61.858.0None143063None1.00\\r\\nJul 15\\r\\n430619.0
6537065371CU8.40GDIFVGG57.959.0None144300Slightly Large1.20\\r\\nJul 8\\r\\n1212120.0
6537465375RD9.19IEIFEXEX60.960.0None150621None1.00\\r\\nJul 13\\r\\n1384207.0
6535965360RD4.83IDIFEXEX61.955.0None155322None1.00\\r\\nJul 18\\r\\n750205.0
6536765368RD6.02IDFLEXEX62.857.0Faint168340None1.01\\r\\nJul 12\\r\\n1013405.0
6536865369RD6.54VGDVVS1VGVG58.064.0None169873Very Small1.01\\r\\nJul 13\\r\\n1110971.0
6536365364RD5.24IDIFVGVG62.654.0None173186None1.01\\r\\nJul 8\\r\\n907494.0
6536665367RD5.47IDIFEXEX62.454.0None183475None1.00\\r\\nJul 12\\r\\n1003607.0
6536565366RD5.34IDIFEXEX61.657.0None185328None1.00\\r\\nJul 8\\r\\n989652.0
6537565376RD10.13IDFLEXEX62.557.0None256150None1.00\\r\\nJul 12\\r\\n2594800.0
\n", "
" ] }, "metadata": {}, "execution_count": 17 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Looks like it worked!" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "# Exercises" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Question 1\n", "By default, the sort function sorts in ascending order (lowest to highest). Use Google-fu to figure out how to sort the data in descending order by carat." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Question 2\n", "What happens if we sort by shape instead? What sort order is used?\n" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "# Making the most of data frames" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "So let's get down to some real data analysis.\n", "\n", "If we want to know what the columns in our data frame are (if, for instance, there are too many to fit onscreen, or we need the list of them to manipulate), we can do" ], "metadata": {} }, { "cell_type": "code", "execution_count": 18, "source": [ "data.columns" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Index(['item', 'shape', 'carat', 'cut', 'color', 'clarity', 'polish',\n", " 'symmetry', 'depth', 'table', 'fluorescence', 'price per carat',\n", " 'culet', 'length to width ratio', 'delivery date', 'price'],\n", " dtype='object')" ] }, "metadata": {}, "execution_count": 18 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "So we see columns is an attribute, not a method (the giveaway in this case is that we did not have to use parentheses afterward, as we would with a function/method).\n", "\n", "What type of object is this?" ], "metadata": {} }, { "cell_type": "code", "execution_count": 19, "source": [ "type(data.columns)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "pandas.indexes.base.Index" ] }, "metadata": {}, "execution_count": 19 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "And what can we do with an Index?" ], "metadata": {} }, { "cell_type": "code", "execution_count": 20, "source": [ "dir(data.columns)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['T',\n", " '__abs__',\n", " '__add__',\n", " '__and__',\n", " '__array__',\n", " '__array_priority__',\n", " '__array_wrap__',\n", " '__bool__',\n", " '__bytes__',\n", " '__class__',\n", " '__contains__',\n", " '__copy__',\n", " '__deepcopy__',\n", " '__delattr__',\n", " '__dict__',\n", " '__dir__',\n", " '__doc__',\n", " '__eq__',\n", " '__floordiv__',\n", " '__format__',\n", " '__ge__',\n", " '__getattribute__',\n", " '__getitem__',\n", " '__gt__',\n", " '__hash__',\n", " '__iadd__',\n", " '__init__',\n", " '__inv__',\n", " '__iter__',\n", " '__le__',\n", " '__len__',\n", " '__lt__',\n", " '__module__',\n", " '__mul__',\n", " '__ne__',\n", " '__neg__',\n", " '__new__',\n", " '__nonzero__',\n", " '__or__',\n", " '__pos__',\n", " '__pow__',\n", " '__radd__',\n", " '__reduce__',\n", " '__reduce_ex__',\n", " '__repr__',\n", " '__rfloordiv__',\n", " '__rmul__',\n", " '__rpow__',\n", " '__rtruediv__',\n", " '__setattr__',\n", " '__setitem__',\n", " '__setstate__',\n", " '__sizeof__',\n", " '__str__',\n", " '__sub__',\n", " '__subclasshook__',\n", " '__truediv__',\n", " '__unicode__',\n", " '__weakref__',\n", " '__xor__',\n", " '_add_comparison_methods',\n", " '_add_logical_methods',\n", " '_add_logical_methods_disabled',\n", " '_add_numeric_methods',\n", " '_add_numeric_methods_binary',\n", " '_add_numeric_methods_disabled',\n", " '_add_numeric_methods_unary',\n", " '_add_numericlike_set_methods_disabled',\n", " '_allow_datetime_index_ops',\n", " '_allow_index_ops',\n", " '_allow_period_index_ops',\n", " '_arrmap',\n", " '_assert_can_do_op',\n", " '_assert_can_do_setop',\n", " '_assert_take_fillable',\n", " '_attributes',\n", " '_box_scalars',\n", " '_can_hold_na',\n", " '_can_reindex',\n", " '_cleanup',\n", " '_coerce_scalar_to_index',\n", " '_coerce_to_ndarray',\n", " '_comparables',\n", " '_constructor',\n", " '_convert_can_do_setop',\n", " '_convert_for_op',\n", " '_convert_list_indexer',\n", " '_convert_scalar_indexer',\n", " '_convert_slice_indexer',\n", " '_convert_tolerance',\n", " '_data',\n", " '_dir_additions',\n", " '_dir_deletions',\n", " '_engine',\n", " '_engine_type',\n", " '_ensure_compat_append',\n", " '_ensure_compat_concat',\n", " '_evaluate_with_datetime_like',\n", " '_evaluate_with_timedelta_like',\n", " '_evalute_compare',\n", " '_filter_indexer_tolerance',\n", " '_format_attrs',\n", " '_format_data',\n", " '_format_native_types',\n", " '_format_space',\n", " '_format_with_header',\n", " '_formatter_func',\n", " '_get_attributes_dict',\n", " '_get_consensus_name',\n", " '_get_duplicates',\n", " '_get_fill_indexer',\n", " '_get_fill_indexer_searchsorted',\n", " '_get_level_number',\n", " '_get_names',\n", " '_get_nearest_indexer',\n", " '_groupby',\n", " '_has_complex_internals',\n", " '_id',\n", " '_infer_as_myclass',\n", " '_inner_indexer',\n", " '_invalid_indexer',\n", " '_is_numeric_dtype',\n", " '_isnan',\n", " '_join_level',\n", " '_join_monotonic',\n", " '_join_multi',\n", " '_join_non_unique',\n", " '_join_precedence',\n", " '_left_indexer',\n", " '_left_indexer_unique',\n", " '_make_str_accessor',\n", " '_maybe_cast_indexer',\n", " '_maybe_cast_slice_bound',\n", " '_maybe_update_attributes',\n", " '_mpl_repr',\n", " '_na_value',\n", " '_nan_idxs',\n", " '_outer_indexer',\n", " '_possibly_promote',\n", " '_reduce',\n", " '_reindex_non_unique',\n", " '_reset_cache',\n", " '_reset_identity',\n", " '_scalar_data_error',\n", " '_searchsorted_monotonic',\n", " '_set_names',\n", " '_shallow_copy',\n", " '_shallow_copy_with_infer',\n", " '_simple_new',\n", " '_string_data_error',\n", " '_to_embed',\n", " '_to_safe_for_reshape',\n", " '_typ',\n", " '_unpickle_compat',\n", " '_update_inplace',\n", " '_validate_for_numeric_binop',\n", " '_validate_for_numeric_unaryop',\n", " '_validate_index_level',\n", " '_validate_indexer',\n", " '_values',\n", " '_wrap_joined_index',\n", " '_wrap_union_result',\n", " 'all',\n", " 'any',\n", " 'append',\n", " 'argmax',\n", " 'argmin',\n", " 'argsort',\n", " 'asi8',\n", " 'asof',\n", " 'asof_locs',\n", " 'astype',\n", " 'base',\n", " 'copy',\n", " 'data',\n", " 'delete',\n", " 'diff',\n", " 'difference',\n", " 'drop',\n", " 'drop_duplicates',\n", " 'dtype',\n", " 'dtype_str',\n", " 'duplicated',\n", " 'equals',\n", " 'factorize',\n", " 'fillna',\n", " 'flags',\n", " 'format',\n", " 'get_duplicates',\n", " 'get_indexer',\n", " 'get_indexer_for',\n", " 'get_indexer_non_unique',\n", " 'get_level_values',\n", " 'get_loc',\n", " 'get_slice_bound',\n", " 'get_value',\n", " 'get_values',\n", " 'groupby',\n", " 'has_duplicates',\n", " 'hasnans',\n", " 'holds_integer',\n", " 'identical',\n", " 'inferred_type',\n", " 'insert',\n", " 'intersection',\n", " 'is_',\n", " 'is_all_dates',\n", " 'is_boolean',\n", " 'is_categorical',\n", " 'is_floating',\n", " 'is_integer',\n", " 'is_lexsorted_for_tuple',\n", " 'is_mixed',\n", " 'is_monotonic',\n", " 'is_monotonic_decreasing',\n", " 'is_monotonic_increasing',\n", " 'is_numeric',\n", " 'is_object',\n", " 'is_type_compatible',\n", " 'is_unique',\n", " 'isin',\n", " 'item',\n", " 'itemsize',\n", " 'join',\n", " 'map',\n", " 'max',\n", " 'memory_usage',\n", " 'min',\n", " 'name',\n", " 'names',\n", " 'nbytes',\n", " 'ndim',\n", " 'nlevels',\n", " 'nunique',\n", " 'order',\n", " 'putmask',\n", " 'ravel',\n", " 'reindex',\n", " 'rename',\n", " 'repeat',\n", " 'searchsorted',\n", " 'set_names',\n", " 'set_value',\n", " 'shape',\n", " 'shift',\n", " 'size',\n", " 'slice_indexer',\n", " 'slice_locs',\n", " 'sort',\n", " 'sort_values',\n", " 'sortlevel',\n", " 'str',\n", " 'strides',\n", " 'summary',\n", " 'sym_diff',\n", " 'symmetric_difference',\n", " 'take',\n", " 'to_datetime',\n", " 'to_native_types',\n", " 'to_series',\n", " 'tolist',\n", " 'transpose',\n", " 'union',\n", " 'unique',\n", " 'value_counts',\n", " 'values',\n", " 'view']" ] }, "metadata": {}, "execution_count": 20 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Oh, a lot.\n", "\n", "What will often be very useful to us is to get a single column out of the data frame. We can do this like so:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 21, "source": [ "data.price" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 242.0\n", "1 243.0\n", "2 276.0\n", "3 278.0\n", "4 284.0\n", "5 287.0\n", "6 289.0\n", "7 289.0\n", "8 289.0\n", "9 293.0\n", "10 294.0\n", "11 301.0\n", "12 304.0\n", "13 304.0\n", "14 320.0\n", "15 322.0\n", "16 322.0\n", "17 328.0\n", "18 332.0\n", "19 334.0\n", "20 335.0\n", "21 336.0\n", "22 336.0\n", "23 339.0\n", "24 339.0\n", "25 339.0\n", "26 340.0\n", "27 344.0\n", "28 344.0\n", "29 345.0\n", " ... \n", "65346 615600.0\n", "65347 618498.0\n", "65348 622495.0\n", "65349 625741.0\n", "65350 639454.0\n", "65351 642772.0\n", "65352 654753.0\n", "65353 655899.0\n", "65354 662481.0\n", "65355 684911.0\n", "65356 699478.0\n", "65357 700352.0\n", "65358 734935.0\n", "65359 750205.0\n", "65360 784980.0\n", "65361 791635.0\n", "65362 896678.0\n", "65363 907494.0\n", "65364 965700.0\n", "65365 989652.0\n", "65366 1003607.0\n", "65367 1013405.0\n", "65368 1110971.0\n", "65369 1150413.0\n", "65370 1212120.0\n", "65371 1318034.0\n", "65372 1337035.0\n", "65373 1366679.0\n", "65374 1384207.0\n", "65375 2594800.0\n", "Name: price, dtype: float64" ] }, "metadata": {}, "execution_count": 21 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "or" ], "metadata": {} }, { "cell_type": "code", "execution_count": 22, "source": [ "data['price']" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 242.0\n", "1 243.0\n", "2 276.0\n", "3 278.0\n", "4 284.0\n", "5 287.0\n", "6 289.0\n", "7 289.0\n", "8 289.0\n", "9 293.0\n", "10 294.0\n", "11 301.0\n", "12 304.0\n", "13 304.0\n", "14 320.0\n", "15 322.0\n", "16 322.0\n", "17 328.0\n", "18 332.0\n", "19 334.0\n", "20 335.0\n", "21 336.0\n", "22 336.0\n", "23 339.0\n", "24 339.0\n", "25 339.0\n", "26 340.0\n", "27 344.0\n", "28 344.0\n", "29 345.0\n", " ... \n", "65346 615600.0\n", "65347 618498.0\n", "65348 622495.0\n", "65349 625741.0\n", "65350 639454.0\n", "65351 642772.0\n", "65352 654753.0\n", "65353 655899.0\n", "65354 662481.0\n", "65355 684911.0\n", "65356 699478.0\n", "65357 700352.0\n", "65358 734935.0\n", "65359 750205.0\n", "65360 784980.0\n", "65361 791635.0\n", "65362 896678.0\n", "65363 907494.0\n", "65364 965700.0\n", "65365 989652.0\n", "65366 1003607.0\n", "65367 1013405.0\n", "65368 1110971.0\n", "65369 1150413.0\n", "65370 1212120.0\n", "65371 1318034.0\n", "65372 1337035.0\n", "65373 1366679.0\n", "65374 1384207.0\n", "65375 2594800.0\n", "Name: price, dtype: float64" ] }, "metadata": {}, "execution_count": 22 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "The second method (but not the first) also works when the column name has spaces:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 23, "source": [ "data['length to width ratio']" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 1.65\n", "1 1.46\n", "2 1.40\n", "3 1.52\n", "4 1.42\n", "5 1.95\n", "6 1.03\n", "7 1.34\n", "8 1.01\n", "9 2.02\n", "10 1.01\n", "11 1.35\n", "12 1.52\n", "13 1.00\n", "14 1.01\n", "15 1.11\n", "16 1.01\n", "17 1.08\n", "18 1.13\n", "19 1.15\n", "20 1.37\n", "21 1.35\n", "22 1.41\n", "23 1.09\n", "24 1.45\n", "25 1.10\n", "26 1.96\n", "27 1.30\n", "28 1.44\n", "29 1.35\n", " ... \n", "65346 1.00\n", "65347 1.47\n", "65348 1.49\n", "65349 1.59\n", "65350 1.01\n", "65351 1.47\n", "65352 1.01\n", "65353 1.51\n", "65354 1.13\n", "65355 1.17\n", "65356 1.01\n", "65357 1.01\n", "65358 1.01\n", "65359 1.00\n", "65360 1.01\n", "65361 1.21\n", "65362 1.37\n", "65363 1.01\n", "65364 1.33\n", "65365 1.00\n", "65366 1.00\n", "65367 1.01\n", "65368 1.01\n", "65369 1.01\n", "65370 1.20\n", "65371 1.01\n", "65372 1.01\n", "65373 1.01\n", "65374 1.00\n", "65375 1.00\n", "Name: length to width ratio, dtype: float64" ] }, "metadata": {}, "execution_count": 23 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Note that the result of this operation doesn't return a 1-column data frame, but a Series:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 24, "source": [ "type(data['length to width ratio'])" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "pandas.core.series.Series" ] }, "metadata": {}, "execution_count": 24 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "What you might be able to guess here is that a DataFrame is an object that (more or less) contains a bunch of Series objects named in an Index." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Finally, we can get multiple columns at once:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 25, "source": [ "data[['price', 'price per carat']]" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " price price per carat\n", "0 242.0 864\n", "1 243.0 1057\n", "2 276.0 812\n", "3 278.0 818\n", "4 284.0 1235\n", "5 287.0 1248\n", "6 289.0 781\n", "7 289.0 1204\n", "8 289.0 1204\n", "9 293.0 888\n", "10 294.0 1278\n", "11 301.0 860\n", "12 304.0 1322\n", "13 304.0 1216\n", "14 320.0 1280\n", "15 322.0 976\n", "16 322.0 1193\n", "17 328.0 1426\n", "18 332.0 1443\n", "19 334.0 1452\n", "20 335.0 859\n", "21 336.0 1461\n", "22 336.0 1018\n", "23 339.0 1474\n", "24 339.0 1059\n", "25 339.0 1059\n", "26 340.0 944\n", "27 344.0 1496\n", "28 344.0 1110\n", "29 345.0 1113\n", "... ... ...\n", "65346 615600.0 101250\n", "65347 618498.0 123700\n", "65348 622495.0 77715\n", "65349 625741.0 38295\n", "65350 639454.0 51862\n", "65351 642772.0 121968\n", "65352 654753.0 74829\n", "65353 655899.0 105450\n", "65354 662481.0 59629\n", "65355 684911.0 61372\n", "65356 699478.0 133488\n", "65357 700352.0 132894\n", "65358 734935.0 122082\n", "65359 750205.0 155322\n", "65360 784980.0 104247\n", "65361 791635.0 60384\n", "65362 896678.0 80420\n", "65363 907494.0 173186\n", "65364 965700.0 133200\n", "65365 989652.0 185328\n", "65366 1003607.0 183475\n", "65367 1013405.0 168340\n", "65368 1110971.0 169873\n", "65369 1150413.0 120336\n", "65370 1212120.0 144300\n", "65371 1318034.0 130112\n", "65372 1337035.0 66420\n", "65373 1366679.0 110662\n", "65374 1384207.0 150621\n", "65375 2594800.0 256150\n", "\n", "[65376 rows x 2 columns]" ], "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
priceprice per carat
0242.0864
1243.01057
2276.0812
3278.0818
4284.01235
5287.01248
6289.0781
7289.01204
8289.01204
9293.0888
10294.01278
11301.0860
12304.01322
13304.01216
14320.01280
15322.0976
16322.01193
17328.01426
18332.01443
19334.01452
20335.0859
21336.01461
22336.01018
23339.01474
24339.01059
25339.01059
26340.0944
27344.01496
28344.01110
29345.01113
.........
65346615600.0101250
65347618498.0123700
65348622495.077715
65349625741.038295
65350639454.051862
65351642772.0121968
65352654753.074829
65353655899.0105450
65354662481.059629
65355684911.061372
65356699478.0133488
65357700352.0132894
65358734935.0122082
65359750205.0155322
65360784980.0104247
65361791635.060384
65362896678.080420
65363907494.0173186
65364965700.0133200
65365989652.0185328
653661003607.0183475
653671013405.0168340
653681110971.0169873
653691150413.0120336
653701212120.0144300
653711318034.0130112
653721337035.066420
653731366679.0110662
653741384207.0150621
653752594800.0256150
\n", "

65376 rows × 2 columns

\n", "
" ] }, "metadata": {}, "execution_count": 25 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "This actually does return a data frame. Note that the expression we put between the brackets in this case was a comma-separated list of strings (things in quotes), also enclosed in brackets. This is a type of object called a list in Python:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 26, "source": [ "mylist = [1, 'a string', 'another string', 6.7, True]\n", "type(mylist)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "list" ] }, "metadata": {}, "execution_count": 26 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Clearly, lists are pretty useful for holding, well, lists. And the items in the list don't have to be of the same type. You can learn about Python lists in any basic Python intro, and we'll do more with them, but the key idea is that Python has several data types like this that we can use as *collections of other objects*. As we meet the other types, including tuples and dictionaries, we will discuss when it's best to use one or another. " ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Handily, `pandas` has some functions we can use to calculate basic statistics of our data:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 27, "source": [ "# here, let's save these outputs to variable names to make the code cleaner\n", "ppc_mean = data['price per carat'].mean()\n", "ppc_std = data['price per carat'].std()\n", "\n", "# we can concatenate strings by using +, but first we have to use the str function to convert the numbers to strings\n", "print(\"The average price per carat is \" + str(ppc_mean))\n", "print(\"The standard deviation of price per carat is \" + str(ppc_std))\n", "print(\"The coefficient of variation (std / mean) is thus \" + str(ppc_mean / ppc_std))" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "The average price per carat is 6282.785364659814\n", "The standard deviation of price per carat is 7198.173546249848\n", "The coefficient of variation (std / mean) is thus 0.8728304929426245\n" ] } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "In fact, these functions will work column-by-column where appropriate:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 28, "source": [ "data.max()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "item 65376\n", "shape RD\n", "carat 20.13\n", "cut VG\n", "color J\n", "clarity VVS2\n", "polish VG\n", "symmetry VG\n", "depth 80\n", "table 555\n", "fluorescence Very Strong Blue\n", "price per carat 256150\n", "culet Very Small\n", "length to width ratio 3.12\n", "delivery date \\r\\nJul 8\\r\\n\n", "price 2.5948e+06\n", "dtype: object" ] }, "metadata": {}, "execution_count": 28 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Finally, we might eventually want to take select subsets of our data, for which there are lots of methods described [here](http://pandas.pydata.org/pandas-docs/stable/indexing.html). \n", "\n", "For example, say we wanted to look only at diamonds between 1 and 2 carats. One of the nicest methods to select these data is to use the `query` method:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 29, "source": [ "subset = data.query('(carat >= 1) & (carat <= 2)')\n", "print(\"The mean for the whole dataset is \" + str(data['price'].mean()))\n", "print(\"The mean for the subset is \" + str(subset['price'].mean()))" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "The mean for the whole dataset is 9495.346426823298\n", "The mean for the subset is 11344.487346603253\n" ] } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "# Exercises" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Question 1:\n", "Extract the subset of data with less than 1 carat and cut equal to Very Good (VG). (Hint: the double equals `==` operator tests for equality. The normal equals sign is only for assigning values to variables.)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Question 2:\n", "Extract the subset of data with color other than J. (Hint, if you have a query that would return all the rows you *don't* want, you can negate that query by putting `~` in front of it.)" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "# Plotting" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "Plotting is just fun. It is also, far and away, the best method for exploring your data." ], "metadata": {} }, { "cell_type": "code", "execution_count": 30, "source": [ "# this magic makes sure our plots appear in the browser\n", "%matplotlib inline" ], "outputs": [], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 31, "source": [ "# let's look at some distributions of variables\n", "data['price'].plot(kind='hist')" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 31 }, { "output_type": "display_data", "data": { "text/plain": [ "" ], "image/png": "" }, "metadata": {} } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Hmm. Let's fix two things:\n", "1. Let's plot the logarithm of price. That will make the scale more meaningful.\n", "2. Let's suppress the matplotlib.axes business (it's reflecting the object our plot command returns). We can do this by ending the line with a semicolon." ], "metadata": {} }, { "cell_type": "code", "execution_count": 32, "source": [ "#first, import numpy, which has the logarithm function. We'll also give it a nickname.\n", "import numpy as np\n", "\n", "# the apply method applies a function to every element of a data frame (or series in this case)\n", "# we will use this to create a new column in the data frame called log_price\n", "data['log_price'] = data['price'].apply(np.log10)\n", "data['log_price'].plot(kind='hist');" ], "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "" ], "image/png": "" }, "metadata": {} } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "But we can do so much better!" ], "metadata": {} }, { "cell_type": "code", "execution_count": 33, "source": [ "#let's pull out the big guns\n", "import matplotlib.pyplot as plt\n", "\n", "data['log_price'].plot(kind='hist', bins=100)\n", "plt.xlabel('Log_10 price (dollars)')\n", "plt.ylabel('Count');" ], "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "" ], "image/png": "" }, "metadata": {} } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "What about other types of plots?" ], "metadata": {} }, { "cell_type": "code", "execution_count": 34, "source": [ "# the value_counts() method counts the number of times each value occurs in the data['color'] series\n", "data['color'].value_counts().plot(kind='bar');" ], "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYEAAAD+CAYAAADcWrmEAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAFhhJREFUeJzt3X/sXfV93/HnKziMNDMMssBXs0NNGkxNoKmcxWTK1lwFFUKngaWtzOmaH8WqppiUNJ2a2PQPvtofbUCr4qUdaNUYmIjMMnQLREHmx8jVhAbFbZaaxRS+WgfYZny78CvdJmWYvvfHPWaHr77+/rj3+/X9fn2eD+kq577P55zzPt/g+7rnnHvPTVUhSeqmd4y7AUnS+BgCktRhhoAkdZghIEkdZghIUocZApLUYfOGQJLbk0wnOTij/mtJnk7yVJKvtuq7kkw1865o1TcnOZjk2SS7W/XTk+xtlnk8yflLtXOSpLkt5EjgDuDKdiFJD/gHwKVVdSnwL5r6JuBaYBNwFXBrkjSL3QZsr6qNwMYkx9e5HXilqi4EdgO3jLRHkqQFmzcEquox4NUZ5c8DX62qY82YHzb1a4C9VXWsqp4DpoAtSSaAtVV1oBl3F7C1tcyeZvpe4PIh90WStEjDXhPYCPxckieSfDfJh5v6OuBwa9zRprYOONKqH2lqb1umqt4EXktyzpB9SZIWYc0Iy51dVR9N8hHgHuD9S9RT5h8iSVoKw4bAYeDfA1TVgSRvJnkPg3f+7Qu765vaUeB9s9RpzXsxyWnAmVX1ymwbTeKNjiRpCFU16xvshZ4OCm9/h/4t4BMASTYCp1fVy8D9wD9uPvFzAfAB4Mmqegl4PcmW5kLxZ4D7mnXdD3y2mf5F4NF5dmTZHjfddNOyrn+5H/Zv7/a/Oh/L3f9c5j0SSPJNoAe8J8kLwE3AvwXuSPIU8OPmRZ2qOpRkH3AIeAPYUf+/g+uBO4EzgAeqan9Tvx34RpIp4GVg23w9SZKWxrwhUFW/dIJZnz7B+N8BfmeW+p8Al85S/zGDj5VKkk4yvzHc0uv1xt3CSOx/fFZz72D/4zbO/jPf+aKVJEmtpn4laSVIQo14YViSdAoyBCSpwwwBSeowQ0CSOswQkKQOMwQkqcNO6RCYmNhAkmV7TExsGPcuStJITunvCQxuU7Sc+5d578shSePm9wQkSbMyBCSpwwwBSeowQ0CSOswQkKQOMwQkqcMMAUnqMENAkjrMEJCkDps3BJLcnmQ6ycFZ5v2zJH+V5JxWbVeSqSRPJ7miVd+c5GCSZ5PsbtVPT7K3WebxJOcvxY5Jkua3kCOBO4ArZxaTrAd+Hni+VdvE4EfjNwFXAbdmcO8GgNuA7VW1EdiY5Pg6twOvVNWFwG7gliH3RZK0SPOGQFU9Brw6y6yvAb85o3YNsLeqjlXVc8AUsCXJBLC2qg404+4CtraW2dNM3wtcvqg9kCQNbahrAkmuBg5X1VMzZq0DDreeH21q64AjrfqRpva2ZarqTeC19uklSdLyWbPYBZK8C7iRwamg5TDrne4kSUtv0SEA/BSwAfjT5nz/euB7SbYweOffvrC7vqkdBd43S53WvBeTnAacWVWvnGjjk5OTb033ej16vd4QuyBJp65+v0+/31/Q2AX9nkCSDcC3q+rSWeb9d2BzVb2a5GLgbuAyBqd5HgYurKpK8gRwA3AA+A7w9aran2QHcElV7UiyDdhaVdtO0Ie/JyBJizTS7wkk+Sbwnxl8oueFJL8yY0jRnMKpqkPAPuAQ8ACwo/WqfT1wO/AsMFVV+5v67cDfTDIF/DqwczE7J0kanr8sNhKPBCStfP6ymCRpVoaAJHWYISBJHWYISFKHGQKS1GGGgCR1mCEgSR1mCEhShxkCktRhhoAkdZghIEkdZghIUocZApLUYYbACjYxsYEky/aYmNgw7l2UNGbeSnoky3sr6dXev6SVwVtJS5JmZQhIUocZApLUYYaAJHXYQn5o/vYk00kOtmq3JHk6yfeT/GGSM1vzdiWZauZf0apvTnIwybNJdrfqpyfZ2yzzeJLzl3IHJUkntpAjgTuAK2fUHgI+WFU/C0wBuwCSXAxcC2wCrgJuzeAjLgC3AduraiOwMcnxdW4HXqmqC4HdwC0j7I8kaRHmDYGqegx4dUbtkar6q+bpE8D6ZvpqYG9VHauq5xgExJYkE8DaqjrQjLsL2NpMXwPsaabvBS4fcl8kSYu0FNcErgMeaKbXAYdb8442tXXAkVb9SFN72zJV9SbwWpJzlqAvSdI8RgqBJL8FvFFV/26J+gGY9QsNkqSlt2bYBZN8DvgF4BOt8lHgfa3n65vaiertZV5MchpwZlW9cqLtTk5OvjXd6/Xo9XrD7oIknZL6/T79fn9BYxd024gkG4BvV9WlzfNPAr8L/FxVvdwadzFwN3AZg9M8DwMXVlUleQK4ATgAfAf4elXtT7IDuKSqdiTZBmytqm0n6MPbRiztFrxthNQBc902Yt4jgSTfBHrAe5K8ANwE3AicDjzcfPjniaraUVWHkuwDDgFvADtar9rXA3cCZwAPVNX+pn478I0kU8DLwKwBIElaet5AbiQeCUha+byBnCRpVoaAJHWYISBJHWYISFKHGQKS1GGGgCR1mCGgZTMxsWFRP3y/2MfExIZx76K06vk9gZH4PYE5177K+5dOFX5PQJI0K0NAkjrMEJCkDjMEJKnDDAFJ6jBDQJI6zBCQpA4zBCSpwwwBSeowQ0CSOswQkKQOmzcEktyeZDrJwVbt7CQPJXkmyYNJzmrN25VkKsnTSa5o1TcnOZjk2SS7W/XTk+xtlnk8yflLuYOSpBNbyJHAHcCVM2o7gUeq6iLgUWAXQJKLgWuBTcBVwK0Z3EUM4DZge1VtBDYmOb7O7cArVXUhsBu4ZYT9kSQtwrwhUFWPAa/OKF8D7Gmm9wBbm+mrgb1VdayqngOmgC1JJoC1VXWgGXdXa5n2uu4FLh9iP6Qlt5y3wvY22Fop1gy53LlVNQ1QVS8lObeprwMeb4072tSOAUda9SNN/fgyh5t1vZnktSTnVNUrQ/YmLYnp6edZrlthT0/Peldf6aRbqgvDS/kvxX8dknSSDHskMJ3kvKqabk71/EVTPwq8rzVufVM7Ub29zItJTgPOnOsoYHJy8q3pXq9Hr9cbchck6dTU7/fp9/sLGrugXxZLsgH4dlVd2jy/mcHF3JuTfAU4u6p2NheG7wYuY3Ca52HgwqqqJE8ANwAHgO8AX6+q/Ul2AJdU1Y4k24CtVbXtBH34y2JLuwX7n2vty9q/v4qmk2euXxab90ggyTeBHvCeJC8ANwFfBe5Jch3wPINPBFFVh5LsAw4BbwA7Wq/a1wN3AmcAD1TV/qZ+O/CNJFPAy8CsASBJWnr+xvBIVvM7UbD/edbukYBOEf7GsCRpVoaAJHWYISBJHWYISFKHGQKS1GGGgCR1mCEgSR1mCEhShxkCktRhhoAkdZghIEkdZghIUocZApLUYYaAJHWYISBJHWYISFKHGQKS1GGGgCR1mCEgSR02Uggk+VKS/5rkYJK7k5ye5OwkDyV5JsmDSc5qjd+VZCrJ00muaNU3N+t4NsnuUXqSJC3c0CGQ5G8BvwZsrqqfAdYAnwJ2Ao9U1UXAo8CuZvzFwLXAJuAq4NYMfskb4DZge1VtBDYmuXLYviRJCzfq6aDTgHcnWQO8CzgKXAPsaebvAbY201cDe6vqWFU9B0wBW5JMAGur6kAz7q7WMpKkZTR0CFTVi8DvAi8wePF/vaoeAc6rqulmzEvAuc0i64DDrVUcbWrrgCOt+pGmJklaZmuGXTDJ32Dwrv8ngdeBe5L8E6BmDJ35fCSTk5NvTfd6PXq93lKuXpJWvX6/T7/fX9DYVA33Gp3kHwFXVtWvNs8/DXwU+ATQq6rp5lTPd6tqU5KdQFXVzc34/cBNwPPHxzT1bcDHq+rzs2yzFtPv4JLDkmbQzC0w7N9vQWu3//m2sIr7X97epbYkVFVmmzfKNYEXgI8mOaO5wHs5cAi4H/hcM+azwH3N9P3AtuYTRBcAHwCebE4ZvZ5kS7Oez7SWkSQto6FPB1XVk0nuBf4L8Ebzv38ArAX2JbmOwbv8a5vxh5LsYxAUbwA7Wm/rrwfuBM4AHqiq/cP2JUlauKFPB42Dp4OWfAv2P9faPR2kU8RynQ6SJK1yhoAkdZghIEkdZghIUocZApLUYYaAJHWYISBJHWYISFKHGQKS1GGGgCR1mCEgSR1mCEhShxkCktRhhoB0CpqY2ECSZXtMTGwY9y5qiXgr6ZGs5lsZg/3Ps/ZVfCvp1f6319LyVtKSpFkZApLUYYaAJHWYISBJHTZSCCQ5K8k9SZ5O8oMklyU5O8lDSZ5J8mCSs1rjdyWZasZf0apvTnIwybNJdo/SkyRp4UY9EviXwANVtQn4EPBnwE7gkaq6CHgU2AWQ5GLgWmATcBVwawYfYQC4DdheVRuBjUmuHLEvSdICDB0CSc4E/l5V3QFQVceq6nXgGmBPM2wPsLWZvhrY24x7DpgCtiSZANZW1YFm3F2tZSRJy2iUI4ELgB8muSPJ95L8QZKfAM6rqmmAqnoJOLcZvw443Fr+aFNbBxxp1Y80NUnSMlsz4rKbgeur6o+TfI3BqaCZ3yBZ0m+UTE5OvjXd6/Xo9XpLuXpJWvX6/T79fn9BY4f+xnCS84DHq+r9zfO/yyAEfgroVdV0c6rnu1W1KclOoKrq5mb8fuAm4PnjY5r6NuDjVfX5WbbpN4aXdgv2P9fa/cbwXFvwG8OryLJ8Y7g55XM4ycamdDnwA+B+4HNN7bPAfc30/cC2JKcnuQD4APBkc8ro9SRbmgvFn2ktI0laRqOcDgK4Abg7yTuBPwd+BTgN2JfkOgbv8q8FqKpDSfYBh4A3gB2tt/XXA3cCZzD4tNH+EfuSJC2AN5AbiYf0c67d/uda+yruHTwdtLp4AzlJq4q3wj55PBIYie/m5ly7/c+19lXcO9j/6uKRgCRpVoaAJHWYISBJHWYISFKHGQKS1GGGgCR1mCEgSR1mCEhShxkCktRhhoAkdZghIEkdZghIUocZApLUYYaAJHWYISBJHWYISFKHGQKS1GEjh0CSdyT5XpL7m+dnJ3koyTNJHkxyVmvsriRTSZ5OckWrvjnJwSTPJtk9ak+SpIVZiiOBLwKHWs93Ao9U1UXAo8AugCQXA9cCm4CrgFsz+A05gNuA7VW1EdiY5Mol6EuSNI+RQiDJeuAXgH/TKl8D7Gmm9wBbm+mrgb1VdayqngOmgC1JJoC1VXWgGXdXaxlJ0jIa9Ujga8Bv8vZfhD6vqqYBquol4Nymvg443Bp3tKmtA4606keamiRpma0ZdsEkfx+YrqrvJ+nNMbTmmLdok5OTb033ej16vbk2LUnd0+/36ff7CxqbquFeo5P8NvDLwDHgXcBa4D8AfxvoVdV0c6rnu1W1KclOoKrq5mb5/cBNwPPHxzT1bcDHq+rzs2yzFtPv4JLDkmbQzC0w7N9vQWu3//m2sIr7X829g/2vLkmoqsw2b+jTQVV1Y1WdX1XvB7YBj1bVp4FvA59rhn0WuK+Zvh/YluT0JBcAHwCebE4ZvZ5kS3Oh+DOtZSRJy2jo00Fz+CqwL8l1DN7lXwtQVYeS7GPwSaI3gB2tt/XXA3cCZwAPVNX+ZehLkjTD0KeDxsHTQUu+Bfufa+2eDpprC/Y/h4mJDUxPP79s6z/vvJ/kpZeeW/D4uU4HGQIj8R/CnGu3/7nWvop7B/ufZ+0rrP9luSYgSVr9DAFJ6jBDQJI6zBCQpA4zBCSpwwwBSeowQ0CSOswQkKQOMwQkqcMMAUnqMENAkjrMEJCkDjMEJKnDDAFJ6jBDQJI6zBCQpA4zBCSpw4YOgSTrkzya5AdJnkpyQ1M/O8lDSZ5J8mCSs1rL7EoyleTpJFe06puTHEzybJLdo+2SJGmhRjkSOAb8RlV9EPg7wPVJfhrYCTxSVRcBjwK7AJJczOBH5zcBVwG3ZvAbbAC3AduraiOwMcmVI/QlSVqgoUOgql6qqu830/8LeBpYD1wD7GmG7QG2NtNXA3ur6lhVPQdMAVuSTABrq+pAM+6u1jKSpGW0JNcEkmwAfhZ4AjivqqZhEBTAuc2wdcDh1mJHm9o64EirfqSpSZKW2cghkOSvA/cCX2yOCGrGkJnPJUkrxJpRFk6yhkEAfKOq7mvK00nOq6rp5lTPXzT1o8D7Wouvb2onqs9qcnLyreler0ev1xtlFyTplNPv9+n3+wsam6rh36gnuQv4YVX9Rqt2M/BKVd2c5CvA2VW1s7kwfDdwGYPTPQ8DF1ZVJXkCuAE4AHwH+HpV7Z9le7WYfgfXnZfzQCSM8vebd+32P98WVnH/q7l3sP951r7C+k9CVWXWecP+IZJ8DPhPwFMM9raAG4EngX0M3t0/D1xbVa81y+wCtgNvMDh99FBT/zBwJ3AG8EBVffEE2zQElnYL9j/X2g2BubZg/3OtfYX1vywhMA6GwJJvwf7nWrshMNcW7H+uta+w/ucKAb8xLEkdZghIUocZApLUYYaAJHWYISBJHWYISFKHGQKS1GGGgCR1mCEgSR1mCEhShxkCktRhhoAkdZghIEkdZghIUocZApLUYYaAJHWYISBJHWYISFKHGQKS1GErJgSSfDLJnyV5NslXxt2PJHXBigiBJO8Afh+4Evgg8KkkP33yO+mf/E0uqf64GxhRf9wNjKA/7gZG1B93AyPqj7uBEfXHtuUVEQLAFmCqqp6vqjeAvcA1J7+N/snf5JLqj7uBEfXH3cAI+uNuYET9cTcwov64GxhRf2xbXikhsA443Hp+pKlJkpbRSgkBSdIYpKrG3QNJPgpMVtUnm+c7gaqqm2eMG3+zkrQKVVVmq6+UEDgNeAa4HPgfwJPAp6rq6bE2JkmnuDXjbgCgqt5M8gXgIQanqG43ACRp+a2IIwFJ0nh4YViSOmxFnA6SpFNdkr8ETnTq5cfAfwN+q6r+48nrytNBq1aS86vqhXH3sRSSvBegqv7nuHtZiCTXAOur6l81z/8IeG8z+8tVde/YmtOq1Hw45hLg7qq65GRuu7NHAkm+XFW3NNO/WFX3tOb9dlXdOL7uFuRbwGaAJH9YVf9wzP0sSpIANwFfYHBaMkmOAb9XVf98rM3N78vAttbzvwZ8BHg3cAewokMgye9x4nekVNUNJ7GdRZvjHXUYfLT8zJPc0siq6k3gT5v/b06qLl8TaP8j3jVj3idPZiNDan/m9/1j62J4XwI+Bnykqs6pqrOBy4CPJfnSeFub1+lV1f6G+2NV9XJzZPbucTW1CH8M/EnzuLo1ffyxolXV2qo6c5bH2tUYAG1V9a9P9jY7eyTA219EZ36JYtYvVawwdYLp1eLTwM9X1Q+PF6rqz5P8MoOPCn9tbJ3N7+z2k6r6Quvpe1nhqmrP8ekkv95+ru7p8pHAXC+iq+FF9UNJftQcGv9MM/2jJH+Z5Efjbm4B3tkOgOOa6wLvHEM/i/FHSX51ZjHJP2XwRcfVZDX8t65l1OUjgQ81L5YB3tV64QxwxvjaWpiqOm3cPYzo/w45byX4EvCtJL8EfK+pfZjBtYGtY+tKGoKfDtJYJHkT+N+zzQLOqKqVfjRAkk8w+P0LgB9U1aPj7GehZlxY/Qng/xyfxSq9sKrhGQKS1GFdviYgSZ1nCEhShxkCktRhhoAkdZghIEkd9v8AIYHk+Q4Wcr8AAAAASUVORK5CYII=" }, "metadata": {} } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "That's a bit ugly. Let's change plot styles." ], "metadata": {} }, { "cell_type": "code", "execution_count": 35, "source": [ "plt.style.use('ggplot')" ], "outputs": [], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 36, "source": [ "# scatter plot the relationship between carat and price\n", "data.plot(kind='scatter', x='carat', y='price');\n", "\n", "# do the same thing, but plot y on a log scale\n", "data.plot(kind='scatter', x='carat', y='price', logy=True);" ], "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAAawAAAEWCAYAAAA6maO/AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAIABJREFUeJzs3X14VPWd+P33mack85BMJiSQEDBARLfBBGpAxC2g2NoqvarUTeuv99XGqz7QKtLU3S3++uDdVaFsVQjy0Ad2WyvubVGM9713r1t31xBYJWhYGySx1qaRhwAhZCYPM5nJzJw55/5jyIEhARMMSWb4vK6Li8yZc858v2dmzme+3/P5fo+i67qOEEIIMcGZxrsAQgghxHBIwBJCCJEUJGAJIYRIChKwhBBCJAUJWEIIIZKCBCwhhBBJwTLWLxiNRnn88cdRVZVYLMbChQv5u7/7OwKBABs3buT06dPk5eVRVVWF3W4HoKamht27d2M2m6msrKSsrAyA1tZWtm7dSjQaZd68eVRWVgKgqiqbN2+mtbUVl8tFVVUVkyZNAqCuro6amhoAVqxYwZIlSwDo6OigurqaQCDAjBkzWLVqFWazeYyPjhBCiAvSx0F/f7+u67oei8X0//2//7f+l7/8RX/hhRf01157Tdd1Xa+pqdF37Nih67quHzt2TP+Hf/gHXVVV/dSpU/rDDz+sa5qm67quP/bYY/pf/vIXXdd1fe3atfof//hHXdd1/Y033tB//etf67qu62+//ba+YcMGXdd13e/36w8//LDe19enBwIB429d1/Vnn31W37dvn67ruv6rX/1K/4//+I9h1aWpqelTH4+JLJXrl8p103WpX7KT+g02Ll2CaWlpQLy1FYvFADhw4IDR2lm6dCkNDQ3G8kWLFmE2m8nLyyM/P5+Wlha6u7sJhUIUFxcDsHjxYmObhoYGY18LFy6kqakJgIMHD1JaWordbsfhcFBaWkpjYyMATU1N3HDDDQAsWbKEd999d1h1aW5u/tTHYyJL5fqlct1A6pfspH6DjXmXIICmaaxZs4ZTp05x2223UVxcTE9PD263GwC3201PTw8APp+P2bNnG9t6PB58Ph9ms5mcnBxjeU5ODj6fz9hm4DmTyYTdbicQCCQsP3dffr8fp9OJyWQy9tXV1XV5D4IQQogRGZeAZTKZ+Od//meCwSBPP/00x44dG7SOoiij9nr6MGafGs46Qgghxs+4BKwBdrudz3zmMzQ2NuJ2u+nu7jb+z8rKAuKtoM7OTmMbr9eLx+PB4/Hg9XoHLR/YZuCxpmmEQiGcTicejyehGer1epkzZw4ul4tgMIimaZhMpoR9na+5uTlhHxUVFaN6TCaaVK5fKtcNpH7J7kqo386dO43HJSUllJSUXHSbMQ9Yvb29WCwW7HY7kUiEQ4cO8ZWvfIXrr7+euro67rzzTurq6igvLwegvLycTZs2sXz5cnw+H+3t7RQXF6MoCna7nZaWFmbNmsXevXv50pe+ZGyzZ88err76aurr65kzZw4AZWVlvPTSS0ZwOnToEN/4xjeA+MHav38/ixYtYs+ePcbrn2+og3rixInLdbjGncvlwu/3j3cxLotUrhtI/ZJdqtevoKBgxEFZ0ce4L+zo0aNs2bIFTdPQdZ1FixaxYsUKAoEAGzZsoLOzk9zcXKqqqnA4HEA8rb22thaLxTIorX3Lli1GWvu9994LxJM5nnvuOQ4fPozL5WL16tXk5eUB8bT2V199FUVRBqW1b9y4kb6+PoqKili1ahUWy/DiuQSs5JTKdQOpX7JL9foVFBSMeJsxD1ipSAJWckrluoHUL9mlev0uJWDJTBdCCCGSggQsIYQQSUEClhBCiKQgAUsIIURSkIAlhBAiKUjAEkIIkRQkYAkhhEgKErCEEEIkBQlYQgghkoIELCGEEElBApYQQoikIAFLCCFEUpCAJYQQIilIwBJCCJEUJGAJIYRIChKwhBBCJAUJWEIIIZKCBCwhhBBJQQKWEEKIpGAZ7wIIIS4PNaZxuFflpD9CvsvGVZkWlPEulBCfggQsIVLUR6cDPPqHv6JqOhaTwtO3z2JGlnzlRfKSLkEhUtSJ3jCqpgOgajrtgcg4l0iIT0cClhApqiAzDYsp3gloMSnku2zjXCIhPp0x7x/wer1s3ryZnp4eFEXh1ltv5Utf+hIvv/wyb775JllZWQDcc889zJ07F4Camhp2796N2WymsrKSsrIyAFpbW9m6dSvRaJR58+ZRWVkJgKqqbN68mdbWVlwuF1VVVUyaNAmAuro6ampqAFixYgVLliwBoKOjg+rqagKBADNmzGDVqlWYzeaxPDRCjKpr8pw8ffss2gNnr2EJkczG/BNsNpv51re+RVFREf39/fzgBz+gtLQUgOXLl7N8+fKE9dva2qivr2fDhg14vV6eeOIJNm3ahKIobN++nZUrV1JcXMy6detobGxk7ty51NbW4nQ62bRpE/v27WPHjh1873vfIxAIsGvXLtavX4+u66xZs4b58+djt9t58cUXWb58OTfeeCO//vWvqa2t5fOf//xYHx4hRo3ZZGJGlkWuW4mUMeZdgm63m6KiIgDS09OZOnUqPp8PAF3XB61/4MABFi1ahNlsJi8vj/z8fFpaWuju7iYUClFcXAzA4sWLaWhoAKChocFoOS1cuJCmpiYADh48SGlpKXa7HYfDQWlpKY2NjQA0NTVxww03ALBkyRLefffdy3cQhBBCjNi4XsPq6OjgyJEjXH311QC8/vrr/MM//AO/+MUvCAaDAPh8PqM7D8Dj8eDz+fD5fOTk5BjLc3JyjMB37nMmkwm73U4gEBi0zcC+/H4/TqcTk8lk7Kurq+vyVl4IIcSIjFtfQX9/P88++yyVlZWkp6dz2223cffdd6MoCi+99BK/+93vWLly5ai81lAtt0tZB6C5uZnm5mbjcUVFBS6X65LLNtHZbLaUrV8q1w2kfsku1esHsHPnTuPvkpISSkpKLrr+uASsWCzGM888w+LFi5k/fz4AmZmZxvPLli1j/fr1QLwV1NnZaTzn9XrxeDx4PB68Xu+g5QPbDDzWNI1QKITT6cTj8SQEG6/Xy5w5c3C5XASDQTRNw2QyJezrfEMdVL/f/ymPyMTlcrlStn6pXDeQ+iW7K6F+FRUVI9pmXLoEt23bRmFhIbfffruxrLu72/j7nXfeYdq0aQCUl5ezb98+VFWlo6OD9vZ2iouLcbvd2O12Wlpa0HWdvXv3GsGvvLycPXv2AFBfX8+cOXMAKCsr49ChQwSDQQKBAIcOHTIyDktKSti/fz8Ae/bsoby8/PIfCCGEEMOm6MPtCxslH374IY8//jjTp09HURQUReGee+7hrbfe4vDhwyiKQm5uLg888AButxuIp7XX1tZisVgGpbVv2bLFSGu/9957AYhGozz33HMcPnwYl8vF6tWrycvLA+Jp7a+++iqKogxKa9+4cSN9fX0UFRWxatUqLJbhNUBPnDgx2odpwkjlX3mpXDeQ+iW7VK9fQUHBiLcZ84CViiRgJadUrhtI/ZJdqtfvUgKWzHQhhBAiKUjAEkIIkRQkYAkhhEgKErCEEEIkBQlYQgghkoIELCGEEElBApYQQoikIAFLCCFEUpCAJYQQIilIwBJCCJEUJGAJIYRIChKwhBBCJAUJWEIIIZKCBCwhhBBJQQKWEEKIpCABSwghRFKQgCWEECIpSMASQgiRFCRgCSGESAoSsIQQQiQFCVhCCCGSggQsIYQQSUEClhBCiKRgGesX9Hq9bN68mZ6eHhRFYdmyZdx+++0EAgE2btzI6dOnycvLo6qqCrvdDkBNTQ27d+/GbDZTWVlJWVkZAK2trWzdupVoNMq8efOorKwEQFVVNm/eTGtrKy6Xi6qqKiZNmgRAXV0dNTU1AKxYsYIlS5YA0NHRQXV1NYFAgBkzZrBq1SrMZvMYHx0hhBAXMuYtLLPZzLe+9S2effZZnnrqKd544w2OHz/Oa6+9xnXXXUd1dTUlJSVGUGlra6O+vp4NGzbw2GOPsX37dnRdB2D79u2sXLmS6upqTp48SWNjIwC1tbU4nU42bdrEHXfcwY4dOwAIBALs2rWLdevWsXbtWl555RWCwSAAL774IsuXL6e6uhqHw0Ftbe1YHxohhBAXMeYBy+12U1RUBEB6ejpTp07F6/Vy4MABo7WzdOlSGhoaADhw4ACLFi3CbDaTl5dHfn4+LS0tdHd3EwqFKC4uBmDx4sXGNg0NDca+Fi5cSFNTEwAHDx6ktLQUu92Ow+GgtLTUCHJNTU3ccMMNACxZsoR33313bA6IEEKIYRnXa1gdHR0cOXKE2bNn09PTg9vtBuJBraenBwCfz2d05wF4PB58Ph8+n4+cnBxjeU5ODj6fz9hm4DmTyYTdbicQCAzaZmBffr8fp9OJyWQy9tXV1XV5Ky+EEGJExvwa1oD+/n6effZZKisrSU9PH/S8oiij9loDXYifdh2A5uZmmpubjccVFRW4XK5LLttEZ7PZUrZ+qVw3kPolu1SvH8DOnTuNv0tKSigpKbno+uMSsGKxGM888wyLFy9m/vz5QLxV1d3dbfyflZUFxFtBnZ2dxrZerxePx4PH48Hr9Q5aPrDNwGNN0wiFQjidTjweT0Kw8Xq9zJkzB5fLRTAYRNM0TCZTwr7ON9RB9fv9o3NgJiCXy5Wy9UvluoHUL9ldCfWrqKgY0Tbj0iW4bds2CgsLuf32241l119/PXV1dUA8k6+8vByA8vJy9u3bh6qqdHR00N7eTnFxMW63G7vdTktLC7qus3fvXiP4lZeXs2fPHgDq6+uZM2cOAGVlZRw6dIhgMEggEODQoUNGxmFJSQn79+8HYM+ePcbrCyGEmBgUfbh9YaPkww8/5PHHH2f69OkoioKiKNxzzz0UFxezYcMGOjs7yc3NpaqqCofDAcTT2mtra7FYLIPS2rds2WKktd97770ARKNRnnvuOQ4fPozL5WL16tXk5eUB8WD46quvoijKoLT2jRs30tfXR1FREatWrcJiGV4D9MSJE6N9mCaMVP6Vl8p1A6lfskv1+hUUFIx4mzEPWKlIAlZySuW6gdQv2aV6/S4lYMlMF0IIIZKCBCwhhBBJQQKWEEKIpCABSwghRFKQgCWEECIpSMASQgiRFCRgCSGESAoSsIQQQiQFCVhCCCGSggQsIYQQSUEClhBCiKQgAUsIIURSkIAlhBAiKUjAEkIIkRQkYAkhhEgKErCEEEIkBQlYQgghkoIELCGEEElBApYQQoikIAFLCCFEUpCAJYQQIilIwBJCCJEUJGAJIYRICpaxfsFt27bx3nvvkZWVxdNPPw3Ayy+/zJtvvklWVhYA99xzD3PnzgWgpqaG3bt3YzabqayspKysDIDW1la2bt1KNBpl3rx5VFZWAqCqKps3b6a1tRWXy0VVVRWTJk0CoK6ujpqaGgBWrFjBkiVLAOjo6KC6uppAIMCMGTNYtWoVZrN5zI6JEEKITzbmLaybb76ZH/7wh4OWL1++nPXr17N+/XojWLW1tVFfX8+GDRt47LHH2L59O7quA7B9+3ZWrlxJdXU1J0+epLGxEYDa2lqcTiebNm3ijjvuYMeOHQAEAgF27drFunXrWLt2La+88grBYBCAF198keXLl1NdXY3D4aC2tnYsDoUQQogRGPOAde211+JwOAYtHwhE5zpw4ACLFi3CbDaTl5dHfn4+LS0tdHd3EwqFKC4uBmDx4sU0NDQA0NDQYLScFi5cSFNTEwAHDx6ktLQUu92Ow+GgtLTUCHJNTU3ccMMNACxZsoR333139CsuhBDiUxnzLsELef3119m7dy+zZs3im9/8Jna7HZ/Px+zZs411PB4PPp8Ps9lMTk6OsTwnJwefzweAz+cznjOZTNjtdgKBQMLyc/fl9/txOp2YTCZjX11dXWNRZSGEECMwIQLWbbfdxt13342iKLz00kv87ne/Y+XKlaOy76FabpeyzoDm5maam5uNxxUVFbhcrksqWzKw2WwpW79UrhtI/ZJdqtcPYOfOncbfJSUllJSUXHT9CRGwMjMzjb+XLVvG+vXrgXgrqLOz03jO6/Xi8XjweDx4vd5Bywe2GXisaRqhUAin04nH40kINF6vlzlz5uByuQgGg2iahslkStjXUIY6qH6//9MdgAnM5XKlbP1SuW4g9Ut2V0L9KioqRrTNuKS167qe0Krp7u42/n7nnXeYNm0aAOXl5ezbtw9VVeno6KC9vZ3i4mLcbjd2u52WlhZ0XWfv3r3Mnz/f2GbPnj0A1NfXM2fOHADKyso4dOgQwWCQQCDAoUOHjIzDkpIS9u/fD8CePXsoLy+//AdBCCHEiCj6SPrDRkF1dTUffPABfr+frKwsKioqaG5u5vDhwyiKQm5uLg888AButxuIp7XX1tZisVgGpbVv2bLFSGu/9957AYhGozz33HMcPnwYl8vF6tWrycvLA+Jp7a+++iqKogxKa9+4cSN9fX0UFRWxatUqLJbhNz5PnDgxmodoQknlX3mpXDeQ+iW7VK9fQUHBiLcZ84CViiRgJadUrhtI/ZJdqtfvUgLWiLsEOzs7+eijj0b8QkIIIcSnMex+r87OTqqrqzl8+DAAL7zwAvv376exsXHUMvqEEBen6XDUr3LSHyHfZeOqTAvKeBdKiDEy7BbWr371K+bNm8fzzz9vXN8pLS3l/fffv2yFE0IkOupXefQPf+VndUd59A9/5XCPOm5l0XQ43KtSfzzI4V4VubYgLrdht7BaWlpYs2aNMcAWwG63G9MbCSEuv5P+CKoWDw2qptMeiDAja3xGpwwET1XTsZgUnr591riVRVwZht3CysrKor29PWFZW1ubMbGsEOLyy3fZsJjinYAWk0K+yzZuZRkqeIrRoenwQXuvtF7PM+yfQ1/+8pdZv349d955J5qm8dZbb1FTU8Odd955OcsnhDjHVZkWnr59Fu2Bs9ewLkSNaRzuvXzXuwaC50ALazyDZ6qR1uvQhn0EbrnlFlwuF//1X/9FTk4Oe/fu5Wtf+xoLFiy4nOUTQpxDAWZkWYZ18vrodOCynvRGEjzFyEykrt+JZERHYP78+caMEkKIie1Eb/iynvRGEjzFyEjrdWjDvob1r//6r/z5z39OWPbnP/+Z3/72t6NdJiHEKCjITJsw17vEyFyVaWHjl2ezZul0nrljlrRezxh2wHr77beZNWtWwrKZM2fy1ltvjXqhhBCf3jV5Tp6+fZac9JKQAnwmP5Mbp9opkrF2hmF/ghVFQdO0hGWapo3o1hxCiLFjNpmky06klGG3sK699lpeeuklI2hpmsbLL7/Mtddee9kKJ4QQQgwY9k+ve++9l5/97Gc8+OCDTJo0ic7OTrKzs/nBD35wOcsnxBVDpl0S4uKGHbBycnJYv349LS0teL1ecnJyKC4uTpj5Qghx6WTsjRAXN6Jvg8lkYvbs2ZerLEJc0WTsjRAXd9FvQ1VVFRs2bADgO9/5zgXX27Zt2+iWSogrkIy9EeLiLhqwHnzwQePvVatWXfbCCHElu5SZI+S6l7iSXPQbMZABqGkatbW1PPjgg1it1jEpmBBXmkuZOUKue4krybAyJkwmE++//z6KIr/dhBgwEe4HJTOmiyvJsFP87rjjDnbu3Imqjt8N44SYSCbCzRQn0u1GhLjcht138Prrr9Pd3c0f/vAHMjMzE56TpAtxJZoIWX0yY7q4kgz70y1JF0IkmghZfTJjuriSDPtTPnv2bHbt2sXbb79NV1cX2dnZLFq0iBUrVlzO8gkxYUnrRoixNexv2K9//WtOnDjBvffeS25uLqdPn6ampgafz8d3v/vdYb/gtm3beO+998jKyuLpp58GIBAIsHHjRk6fPk1eXh5VVVXY7XYAampq2L17N2azmcrKSsrKygBobW1l69atRKNR5s2bR2VlJQCqqrJ582ZaW1txuVxUVVUxadIkAOrq6qipqQFgxYoVLFmyBICOjg6qq6sJBALMmDGDVatWYTabh10ncWWS1o0QY2vYSRcNDQ2sWbOGefPmUVhYyLx58/jHf/xHGhoaRvSCN998Mz/84Q8Tlr322mtcd911VFdXU1JSYgSVtrY26uvr2bBhA4899hjbt283Zoffvn07K1eupLq6mpMnT9LY2AhAbW0tTqeTTZs2cccdd7Bjxw4gHhR37drFunXrWLt2La+88grBYBCAF198keXLl1NdXY3D4aC2tnZEdRKpbSJkAwohRhCw3G434XA4YVkkEiE7O3tEL3jttdficDgSlh04cMBo7SxdutQIggcOHGDRokWYzWby8vLIz8+npaWF7u5uQqEQxcXFACxevNjYpqGhwdjXwoULaWpqAuDgwYOUlpZit9txOByUlpYaQa6pqYkbbrgBgCVLlvDuu++OqE4iNVwoME2EbEAhxAi6BBcvXszatWv54he/SE5ODl6vlzfeeIPFixcbQQFgzpw5Iy5ET08PbrcbiAfGnp4eAHw+X8LchR6PB5/Ph9lsJicnx1iek5ODz+czthl4zmQyYbfbCQQCCcvP3Zff78fpdBqT+Obk5NDV1TXiOojkNTBbhD+i8X/+58eDBuFOhGxAIcQIAtZ//ud/AhjddecuH3hOURQ2b978qQs1mgOUh3ODyZHchLK5uZnm5mbjcUVFBS6X65LKlgxsNlvK1m+gbh+09/LoH/7K/5o7OSEwdQRVSguzKXTrCdmAhe6MYR0TNabx0ekAJ3rDFGSmcU2eE/MY3t0gld87kPqlgp07dxp/l5SUUFJSctH1hx2wtmzZcuml+gRut5vu7m7j/6ysLCDeCurs7DTW83q9eDwePB4PXq930PKBbQYea5pGKBTC6XTi8XgSAo3X62XOnDm4XC6CwSCapmEymRL2NZShDqrf7x+V4zARuVyulK3fQN3aukOomo7DZk4ITJMdFvx+P4UOJSEbsNChDOuYHO4d32mTUvm9A6lfsnO5XFRUVIxom3G5mZWu6wmtmuuvv566ujognslXXl4OQHl5Ofv27UNVVTo6Omhvb6e4uBi3243dbqelpQVd19m7dy/z5883ttmzZw8A9fX1RhdlWVkZhw4dIhgMEggEOHTokJFxWFJSwv79+wHYs2eP8friyjAwnmpXUwf3Lyjg0cXTeOaOWUaa+kA24I1T7RSNYHJZmTZJiNGl6CPpDxsF1dXVfPDBB/j9frKysqioqGD+/Pls2LCBzs5OcnNzqaqqMhIzampqqK2txWKxDEpr37Jli5HWfu+99wIQjUZ57rnnOHz4MC6Xi9WrV5OXlwfEg+Grr76KoiiD0to3btxIX18fRUVFrFq1Cotl+L+ET5w4MZqHaEJJ5V95A3XTgcM9asJ4qtHolD6/hfXMHbMoGsOxWqn83oHUL9kVFBSMeJsxD1ipSAJWcnK5XPT0+i/b7TkuVyAcrlR+70Dql+wuJWBJqpO4ol3O23Mk08Biua+WSAYT/5skxGUkKetxcl8tkQzGJelCiIlCbs8RJwkiIhnITyhxRZMJbOMmwszzQnySK/PbKcQZyXSd6XKSwC2SgXwqhRASuEVSkE+nEOKykexDMZokYAkxBq7UE7dkH4rRJJ8cIcbAlXrilmEDYjRJWru4opx7z6sP2nvH7GaMV2rauAwbEKNJfuqIK8qltHRGozvvSk0bl+xDMZrk0yOuKJfSRTUa3XlX6olbsg/FaJJPkbiinN/SmeK08WFX1LjJ4uxs66B+8tG4DiMnbiE+Pfn2iCvKuS2dQncGYVXlh6+3GgHsqdtmcq3HmrDNldCdd6VmMYrkIgFLXFHObem4XC7+76aTCa2nE/7woIB1JXTnXalZjCK5SJaguKIVZKYlZLEVZKYNWmcgyN1QYAdg//Egh3vVy55heG5G4+V+vZO9V2YWo0gu8hNKXNGK3Vae+MJMTvrj17CuzrZecN2xboWM5etlO6wJ3Z7ZGRc+DgOkG1GMNQlY4oozcKI9dfI0WelmfvwfrcMKCqM9CPaTTvhjOeg2GFa5b0EBwUgMh81MMKICFw9aydiNKEE2uU3sT5cQl8G5J9pvfXZKQlA41hsedEv7gZOcM83yickXIzkhftIJfyyTPTx2K0/tPmq81jN3zPrEbZJxFotkDLLiLHmnxBXn3BOt3WZOCAqhqMYz9ccTTmYDJzmP3cL9CwqwW01Md6cNmXxx/gnxqdtmco3HagStcwOaqukXPeGPZbLHpbxWMmZPJmOQFWfJOyWuOOeeaF9r7uCp22bS0RfBYTXzi3eOA4kns4GTXEcgyrb9x1mzdDpFFzihn39C/OBUH2lmp3FSPDeg/fTzMy563Wgsx25dymslY/ZkMgZZcdbE/4QJMUKqBi09Fx4MPHCibeuN4M6w4LAqWE0Kp/ui+IIqkDjv3UhOcueva7eZE37FnxvQTvaGR3zdaCJJxsHQyRhkxVnybomU09ITvehgYAVQFNj41jFjnSdvm8nz/3OS+xYUEIrE+MwUh3EyG8lJ7qpMC0/dNpMPTvVht5l5rbmDNUuvMp4/N6CZTAq/eufEiK4biU8nGYOsOGtCvWsPPfQQdrsdRVEwm82sW7eOQCDAxo0bOX36NHl5eVRVVWG3x8fD1NTUsHv3bsxmM5WVlZSVlQHQ2trK1q1biUajzJs3j8rKSgBUVWXz5s20trbicrmoqqpi0qRJANTV1VFTUwPAihUrWLJkydgfAHHJzr02FIpqFx0MrOlwrCecsM4RX4jv/e10ukJR/iY3IyFZYuAkd1Vm/HrW/uPBCyZUKMA1HitpZiftgQhrll6VEOASZtrItMmvfSFGYEJ9QxRF4fHHH8fpdBrLXnvtNa677jq+8pWv8Nprr1FTU8M3vvEN2traqK+vZ8OGDXi9Xp544gk2bdqEoihs376dlStXUlxczLp162hsbGTu3LnU1tbidDrZtGkT+/btY8eOHXzve98jEAiwa9cu1q9fj67rrFmzhvnz5xuBUUx8F7s2dP5g4KN+lRx7YtcdikJHX4TPTXOMOKtvqMzAC/2KH+oXvvzaF2J4JtRMF7quo+uJ4/kPHDhgtHaWLl1KQ0ODsXzRokWYzWby8vLIz8+npaWF7u5uQqEQxcXFACxevNjYpqFazBN5AAAgAElEQVShwdjXwoULaWpqAuDgwYOUlpZit9txOByUlpbS2Ng4JnUWo+Pca0MvNbbz08/PYPVNhTz1xZnMPm8w8El/hK5QhB8tK+Jbn53C/QsKeK25g1BU43CPOqzXOHc2iIFA9rO6ozz6h79edB9jRdPhg/beMZklQ4ixMqEClqIoPPnkkzz22GO8+eabAPT09OB2uwFwu9309PQA4PP5jO48AI/Hg8/nw+fzkZOTYyzPycnB5/MZ2ww8ZzKZsNvtBAKBQdsM7Eskj3NvFPgXbz9tPWEyrCauHWL29XyXDXeGjV/ubyPDZiYYifHITdN55VDHRackutDNCCfizRmP+lW+9/98NKGCqBCf1oTqi3jiiSfIzs6mt7eXJ598koKCgkHrKMrojUs/vzU3HM3NzTQ3NxuPKyoqcLlco1amicZmsyVF/T7j0Fj7xVk0tweMZIcfLps5ZNk/49D4oL2bVTdNp90fZnqeg/5IPEOw0J1xwfp+xqGx8cuzOXFmGqdr8pyYTSYK3XpC9+LF9jFWTp08nRBEO4IqpYXZ41qm0ZYsn81Ller1A9i5c6fxd0lJCSUlJRddf0IFrOzs+BcqMzOT+fPn09LSgtvtpru72/g/KysLiLeCOjs7jW29Xi8ejwePx4PX6x20fGCbgceaphEKhXA6nXg8noQg5PV6mTNnzpBlHOqg+v3+0TkAE5DL5ZqQ9RvqutHsbAs209lkh0KHcsGym1A41h2iX9Xw2DUiGjz1xZlMvcg2ANOcCtOc6QAE+/oAKHQoCckTF3vdsZLnSJyVI89h4dDxrpSakmiifjZHy5VQv4qKihFtM2G6BMPhMP39/QD09/fz/vvvM336dK6//nrq6uqAeCZfeXk5AOXl5ezbtw9VVeno6KC9vZ3i4mLcbjd2u52WlhZ0XWfv3r3Mnz/f2GbPnj0A1NfXG0GprKyMQ4cOEQwGCQQCHDp0yMg4FONH1eDDrii1RwJ82BVFO+e5I0NcNxpIaLhxqp2iTzghDzwX03Q6AhH6VY0fvt7KR77osK77nDuT+pFelaJhvu5YsShw/4ICvvnZKTxwQwExTZ9w19mEGKkJ08Lq6enh5z//OYqiEIvF+NznPkdZWRmzZs1iw4YN7N69m9zcXKqqqgAoLCzkxhtvpKqqCovFwn333Wd0F377299my5YtRlr73LlzAbjlllt47rnneOSRR3C5XKxevRoAp9PJV7/6VdasWYOiKNx99904HI7xORDCMNR4qnSLwkl/hEgscVqjYz3hIbPthmqJ6Tp83B1m2/7jxnrfPDOn4Aen+nj+vfZBWYDHAyrdYY2uUJRpWfGsw78fZsbgeASwtt5IQv1W/22hTEkkkt6E+cTm5eXx85//fNByp9PJj3/84yG3ueuuu7jrrrsGLZ85cybPPPPMoOVWq5Xvf//7Q+5r6dKlLF26dGSFFpfVid7EsVInesNsqT+Oqun8ZFlR4rRG9qFniBgqFV1RIBTVErZ3nJlT0G4zG683cFI/6lf5c2coYZDv6puGDgBH/OqQgWyo2Tc4E9y8wSiONAtdweioBbnzZ9wYuO+XTEkkktmECVhCnO/8k+xkV5oRJPxhlfsXFNAXieFMM+NOG7p3+0IZfK8c6uC+BQWEVY2rJ9k52hXiR8uK+OX+NmBwFmAgHEvYjztj6PtHnT8g+VhPGEWBvqjOT/5jcGvx0T/8lfsXFLD2nJnSh5pBfKQtt6syLWz88mzaekLku2xMlymJRAqQT62YsGZmWfmnL8zkRE+YSU4rFuVsNt5LB0+x6qbp9PbHWyWFrqG74y40D6AvqPKL/fFZ2X+0rIjtDSfx2C18dU4edpuJaVlpmBWoPx4k226lNxxL2I/VzJDzALrTEwOZO8PKo3/4KytvmJoQyI73hrGa4+v1RWJDttbONdLbYijAZ/IzmeY8G9ZkSiKR7OTTKyas1p4oakzjF+/EuwHX3jYzoVWVZoEbp56djeTIECf1oqx4y+JYbxh3upWQquGymfinL8zkzx3x+f5ePhgfaPzR6SBF2elcnW3lL91R/v7M9bN8l5Xvf246j986g+7++DUsiwLb3x08D6DVTEIZLaZ4EJrkTAxkHruVjkAkoTty4LkpQ3TXXcptMdSYxuHe8b+eJsRokYAlJqwTvWEKM9P40bIiPvaG6O2PMsVlwxeMUpCVxsws63nrD31SVxSofqvNCAj3LyigMCsNu81MXyTGklkeOvsiPP9eO2uWTueoReGD9r6zs6r7ozS19zGvwElpbjwZR4chu9ic1sSuSbNJwWJS+H1jOz9aVoQvGMVjt/L7xna8IZWf3DqDYETlR8uKONLVT4bVhHmIqHIpt8X46HRAblYoUop8esWEVZCZxrtHulhwVTYzczLo6VeZkm6mLC9tyJZCtv287rh0K0f9Koe7E68r9UViHOvuB0UhK91CTNd5sfGU0YV3rCc86MaOdpuZj7v7UZR0prksHPOrRrCa5rJw5ExLpjDTxjWTMjjWGyYU1fhNwwnuX1BAhtVERyBCmlnhyTcPG/tt94cJhGP87r12ox5rlk5nuivxq3kpt8U4P2lFMgNFspNPr5iwit1WNC2bYz3xzLq/ne7AfOa5c69XTXHZsCjgC0YSriv5QhGO9sRHb50bfJxp8b1srT9OvsvKgwsLqbhuMjFdpysYISvNwr/9sc1o2Q3MnHFnSR6P/uGvPHXbTH74xtkEiidvm8mP3khMqFA1na318bTyP50+zuq/LWRr/XHynFbuW1BAmjkeLH/xznG+OifvE1tPl3JbDMkMFKlGApaYsNr8Kj8+L7OuKxRPstBJHAd1/4ICJrtsPPvfZ1svP/38DNr9QV7/yMt3Fk4FINdpJTvdTL+q863PTsFuM/PL/W3cWZLHr989wU8/P4POvgh3luTR3htmRk4Gvf0qd5bksaupw7hdybktl7bzWnAD98IaKq28IxBl+7sn+MmtRZzsjeALquxq6jBaYVe500Ytg++aPKdkBoqUIp9gMWGd7B18u/mBQb3nj4Pqi8T4fWM7T3xhJsfPZBUGwlGcaWZ8QZXn9rUZyRGFTgstXRFm5mTgC0Z55KbpdIci/GRZEf7+KC82nmLFnDxsFtOZYGNjzf93NnBOcSa2XM5PqLDbzOxqiqfNWxQFDR2nVTGSP0JRja5glNeaO4wbRk522dA1jaJRDCpmk2nYrbKJMuBZiIuRgCUmrGzH4EAA8QCVdd44KGeamT+dDqHrOjFdp9UbYpLDwkxPekJ231WZ8etNKPHrR/2qRrZdw6QoWM0mnn+v3WgF/fTzM/jhG61854aChK7GDn+/kVU4kGX4T1+YSW9/lGy7lY3/fdTYx/0LCti2/zhrlk7nxql2jvWGja7B795YaHQ5/nJ/Gz9YetWwsvouR3AZadq8EONBPpFiwgqGVSNQzMzJSBjUaz4zV57FpGBSoDArjVU3FeIPxzMJTwei5DnTKMqyctyv0hdWCMd0/tgeoi+qMdNjY7o7nZP+MGlmE3kOE0e7Ijy4sJDOQJRcpxV/fzReEEVJSGG/f0EBnX2RhCxDk6Jz41Q7OvCDJVdxtCceDF851HEmdd3Coc6IMU6rIxDll/vbjLFka5ZehVmB7/2/nxw0LkdwuZS0eSHGmnwixYSk6WBPs7D9zAwQA8kRnYEoMV3nWHc/v3jnBN/87BR+9147j98ynWlZafRH9fh8f+40/P1RWrpISIi4b0EB093pRDWMyXR1IBTVef69dh5cWEhvWCU/08bzZzL3Xmvu4EfLiujpVwmfCUJmBR5cWEirN2SU+XCvyjSXhbCm8381tvOVkjy+VjaZmKbz3L5j+IIqP/38jIRxWpk2hZKc+Fiy+uPBYQWN4QSXgRs4tnWHBrXCRjLAWoiJRAKWmHA0HT7qjrLxv4/yo2VF9Par9KsaW+vbuPu6eHLEQOtqYNDttndPGtl2MV0npumYTKZBCRHBSAxfX4TsdJtxqwIFCIRj3DUnj631bfiCKo/cNJVHbprOh2cGF/9yfxuV109hkiOdiusmM9llxd+vMslhpV/VePa/4wHpqdtm8sGpPk76o/xi/3EjoA441t3P1Kx0+iKqMUPHgOEGjeGsd7FW2FDPDQywlgQNMZHJp1JMOEf9Kh+0x0/6rd4QORlm8pw2vl42mSlOG6sWFdIfjfHTL8wk0B+Nt35CUXKdaXSHImSl27CYIBBRByVEOGxmXOkWjnZHsNusmJX4c3abGUdE52ulk5nksGI1KfSFo2TYzGRYTXz3xkIUwB+KMDUrjXZ/mPzMNBRd4wevf2yU/URv4hiu+GwXZ1/fZFJwpZkozbWj6Rjjt0Yy399wxmRdrBV2oedk6iYx0cmnU0w4J/0R46TvsJkJa/Dkm4eNuf4sZgUUhc1vH2XNzdMJRsBqVjAB4TO3HVFQ0HV4+WC70Z2X57BxtCuEGouRY7fRF9VQFLBbFfzhGB2BCDNyMvjF/jYeXFiIyWRi+7vHeOq2GYRUnc6+KAWZafRHVAqy0rgm28qRXnXQBL3PvX3UyP4ryk5n7Rdncqw7THaGhUBEJRTVjPttDTWz+ycFjeGMybpYK0y6/0SyUvRLuU+8SHDixInxLsJlMxZ3PT3/1htpZoX1dUf4Skkeiq5TlJNBMKLR268yJdOGGtM52Rsh12klwxJvtaBDX1SjKxTvpjMDHX1RsjIsbK1viw8M1jRynTa6QipTXRZ0xUxXKEJ2hg1fMILHbsPbF8GZbuHlg+3cdk0OmelWeoIR8lzxBI3JThsZVoWZWVYU4te/DvecnfUiEtNo9YWNa1TTstLo6Vf5Wd1Ro77f/OwU/q3xFKtvKuSZ/z5mLB/IJBwNOtAW0I3Z2s+9hnV+mQeeS7bU9ivhjrypXL+CgoIRbyMtLDHuzr9R4/ovzeQHS67iWE+YSQ4bnYEI2XYbEVUj3QJBTcFkAqvZhD8cpbNPpawgg5huwqSApumk20y4bIpxTSrXaaXdH+F0X3z80/c/N52T/gjudAtmE1yVbSMUhawMK12hKF+bm0+GReF0X4TJmen4+iJ0h1RUTWdGdjptfhVVh/YzJ/eFU+0owJ98kYS6KcrgFo3DZh7yFiWj2dIZarb2c58bqoUmqe1iopNPoxg3A7/oj593D6mjPWGWTneimHRiGmTbbZzyh5nqTiemQVcowhRXGi/8zwn+4u3n/gUFqBrE9Pj2VrMJRddIt1nj171caaRb4Bf7O4jp8NU5eTisOrkOGyf98VZdhgV6+3UsCrjTrXSHomhpFqIxnRf+5wR/VzYFAlF0HU73RQhEtIQbOg6c3B3nTX7rsJqY6rTw1G0zafdHCMfOpror6Amzd4x3ooOktouJTj6NYkxFYnDEHyUY0VAUhZ/+18f86Ly7B+e70mjqjDDZqeANwil/mCmueJp6zGbBaTOj6zq3Xp3D1+Za+eX+Nq6eNB01Fh8AHN+Pid5QlFyHDQWdDn+Uv7tuMroCBa40Mmw6PeF4mTQd/BFQYzGsFgs9oTCTM9MIRjRUDb4+dwpb69s46Y9iMSk88YWZtPv7BgXZoguc3I/5VX74RiurFk1F1+GLs3NwppnxBqOkWUyj1g34acm1LTHRScASY+qvPVGOdvfzq3dO8L/mTkbV9IRbbwy0nK4vzCLfdfZWHgAFWVbauqPYLCbSrQqvHDpltJiiqobVbELT9TO39IhvM3CzxYgG7gwL/rDK3r928vlrJhGN6ZiV+MBjuxXCqgVfMBLPNgxGyUy3YlIgw2ri4UXT6OyLMtllI90yOPvPaTNzuEflWE+YbfuPG/V99HPTjBs1vth4ypjdQtfhpYOnWLP0qjF+By7sUmaEF2IsySdSjKkTvWHjdvMDY6j+dDrEk28e5v4FBXzY0UdzR4jFMz2Dto03aHQyrGaeP3DCGLirKJDjMOENxrvgrCbwhzUiWnw8Vk8oQmFWGv0q9PSrLJqRQ04GRNT49Z1oTCcQUXDaIBKz0tMfxZVupd0fZpLDSqA/CiYzUU1H06GrL8bUzDS+u3Aq3f0qzjQz/rBKXyQ26LqUO8OKK82UMLvF9/52Ol2h+OwWEykoXMqM8EKMJflkistK0+F4QKU3otHZFyXflWaczAdmKU+3mphkt7J53zHuLMkznvubvOkARqbaiZ4IWRk2wlGV/+OzBei6ziSHlUl2K9EYmBRIt5o56Y8w2WUj12Sloy/KZFcavSGVNKuFdEv8hormIcp6OqCSbrOQlW6hPRDvhjzSFWJmTgZt3WF6+1U0XSfNrPBSwwlW3TSdnn7VaC1998ZpWEyJdxw2K2BWBt/sUcE6RAmEEBcjAUtcFqoGvWqUUwGMLkBV03n2jpkUZafz4zNjo3Ls8Tn7vMH4LT0U4KdfmEmHP0yaGYJgXJfKddrw96tkZljoV+FUIMJkVxpWk44JHVVT8AbjQTHdEr9WlueMd+t1hmJMtVqIqBq6rhPWBmfPpVnNdAQieDKsFGalET4zZspigimZaSgKTHGl0dsfpaJ0Mr2hs/MJ3jUnD6s5Ph7sXDpwvDfCjVPt0nIR4lOSb5AYFZEYdPZHCanxVpWqga7Dhx196GAkKMR0hY99/fSrGsU5Geg6xIBJdhvhmE4grGIC0i0KoRgkDAQ604Jq64mQnWHlKreNiAYneqOoTiuKAmYlfv0qEI6/no6Cry9KYVYaGVYwmUwo58UqRYGYBqf7orjTLURUFYvJyl+74gOJj3eHmeRMO7M/mJJppS8SL+tkl40jXf0AOK0mprosRFSMW5z82x9P8sANUy/noRfiiiEB6xyNjY389re/Rdd1br75Zu68887xLlJSCEeg9UzKdySmcSoQwWpS0AG7zYyinL3jr1nRmeZOp90fxqQoWEwangwb7WcyAcNRzrRkrDgtOqFIfG5Aq1khw6JzvCdKntNGTyhKusViZGSkWSCsQkzXUTXIsMKUM0EsoulEYhoWkwm7RcHfr5KdYUUnvr5ZUchMg7BqxRuKkuew4UyDmZ4MvMF4pqHFdDb5o6svSrbdxmdy7BzuUel3Wo15ARXgb3LiA5rbAxEeuGHqhLpOJUQyk2/SGZqm8S//8i/85Cc/ITs7m8cee4z58+czdar8Oj5XOALdapTeCKDDFAe0BeLJFACaFp9I9to8Bwqw6e2jfL1sMo8vK6KrX8VkVrDp8RbV6b5491u6FaZl2QiEVbIybJzsjc/Tl3vOhSZNh2AUCrNsBCI64ZhOWIU0i85kV3xbq/nsx9liUmjriZDrsJFmVrCaTPj6othtFrLSwWmBPnO8HB2BCCbFxiQHpFlsBKMax3tiZKZb0XQdDVBjOiZFIRrTyMywGrNADJWkIMkLQlwe8o06o6Wlhfz8fHJzcwG46aabaGhouOIDVqc3QISo8bgtEO/qO9YT7waLavEpi/Jd8S6zSExDQyfdAlnpOqtvms5JfxiL2YTDGk926NchGFHJdcSDVjZWrCZwpFk4fiZYWRWdKAp2G+S70owBvjaTjs0U79PTdB2IDxLOTLcQjEKGJd4l2B2K3zzEpEBmmiU+Y4XTRncoSne/TkauLd6NGYxQkJnGFKfOqT6FWCw+D2FE0zGbwJ1uxp1mQtXi3ZTudPOEn7JIiFQlAesMn89HTk6O8djj8dDS0jKOJZoYWgMRzp1t8qQ/3pIKhGPG43xXGnYbWE3Qr5pwWNMxm3Q6AooRaFw2iMTMBMIxFMVsTFKba7eSZlXoV3U6/BEmO9PoCUXJsJqZhEIoCh2BeEvJaoJ0qwIKTHPb6O6LEjObOBVQyXVYsZkVIjHoV3UcNhMOm5XjvWcyBh1WOvoiTMtKoyjTgi9yNgjHr6MpXJN9bubewKDZs8ukxSTE+JJv4Ag1NzfT3NxsPK6oqMDlco1jiS6vE4dPJjweaEkNJFHku9KY5sJog0W1+LWkY9742Kc8R7zLDaeNUCRGmj2xdWKzxgfu+vtVprhs9Efj3X0Om0JnENLO6RbUz+w/GoOOvggFLhuhaLwcFrNiXJOymBTSLVCU42LetKFna8hQVTr7/JgUBbNJYWqWC6sltb4ONpstpT+bUr/kt3PnTuPvkpISSkpKLrp+an1DPwWPx0NnZ6fx2Ofz4fEMHrw61EFN6RmVM9MSWlg59niLxGJKByA/PhkF7X6MxAnrmSCjauCwKWi6lVP+CHlOGy5b/FqUx24lFosRUS3xe0u50rCYICsd0i1WzGaFvAxo74vPUNEVimK32lBj0BWKMj0r7ZyuubNBafY5rSQ1HMYfDl+wbuXTs433rj8Uon80DtgEkuqzfUv9kpvL5aKiomJE20jAOqO4uJj29nZOnz5NdnY2b7/9NqtXrx7vYo27WTk2wuGzM5BHiaeAq1q8O1BR0rCf8ynSiV83uio7Pm7JaoJAJJ7Z5wvGkxuu9lgTBu5+JufCg2gzjViUZiwrdsugWyGuRBKwzjCZTHz729/mySefRNd1brnlFgoLC8e7WOMux+nErw/+lZebnhhopjovHHimOc8sv0hgEkKITyIB6xxz586lurp6vIshhBBiCKZPXkUIIYQYfxKwhBBCJAUJWEIIIZKCBCwhhBBJQQKWEEKIpCABSwghRFKQgCWEECIpSMASQgiRFCRgCSGESAoSsIQQQiQFCVhCCCGSggQsIYQQSUEClhBCiKQgAUsIIURSkIAlhBAiKUjAEkIIkRQkYAkhhEgKErCEEEIkBQlYQgghkoIELCGEEElBApYQQoikIAFLCCFEUrCMdwEAXn75Zd58802ysrIAuOeee5g7dy4ANTU17N69G7PZTGVlJWVlZQC0traydetWotEo8+bNo7KyEgBVVdm8eTOtra24XC6qqqqYNGkSAHV1ddTU1ACwYsUKlixZAkBHRwfV1dUEAgFmzJjBqlWrMJvNY3kIhBBCfIIJ08Javnw569evZ/369Uawamtro76+ng0bNvDYY4+xfft2dF0HYPv27axcuZLq6mpOnjxJY2MjALW1tTidTjZt2sQdd9zBjh07AAgEAuzatYt169axdu1aXnnlFYLBIAAvvvgiy5cvp7q6GofDQW1t7TgcASGEEBczYQLWQCA614EDB1i0aBFms5m8vDzy8/NpaWmhu7ubUChEcXExAIsXL6ahoQGAhoYGo+W0cOFCmpqaADh48CClpaXY7XYcDgelpaVGkGtqauKGG24AYMmSJbz77ruXvb5CCCFGZkJ0CQK8/vrr7N27l1mzZvHNb34Tu92Oz+dj9uzZxjoejwefz4fZbCYnJ8dYnpOTg8/nA8Dn8xnPmUwm7HY7gUAgYfm5+/L7/TidTkwmk7Gvrq6usaiyEEKIERizgPXEE0/Q09NjPNZ1HUVR+PrXv85tt93G3XffjaIovPTSS/zud79j5cqVo/K6Q7XcLmUdIYQQ42vMAtaPf/zjYa23bNky1q9fD8RbQZ2dncZzXq8Xj8eDx+PB6/UOWj6wzcBjTdMIhUI4nU48Hg/Nzc0J28yZMweXy0UwGETTNEwmU8K+htLc3Jywn4qKCgoKCoZ3EJKUy+Ua7yJcNqlcN5D6JbtUr9/OnTuNv0tKSigpKbno+hPiGlZ3d7fx9zvvvMO0adMAKC8vZ9++faiqSkdHB+3t7RQXF+N2u7Hb7bS0tKDrOnv37mX+/PnGNnv27AGgvr6eOXPmAFBWVsahQ4cIBoMEAgEOHTpkZByWlJSwf/9+APbs2UN5efkFy1pSUkJFRYXx79wDnopSuX6pXDeQ+iW7K6F+555LPylYwQS5hrVjxw4OHz6Moijk5ubywAMPAFBYWMiNN95IVVUVFouF++67D0VRAPj2t7/Nli1bjLT2gczCW265heeee45HHnkEl8vF6tWrAXA6nXz1q19lzZo1KIrC3XffjcPhAOAb3/gGGzdu5Pe//z1FRUXccsst43AUhBBCXMyECFgPP/zwBZ+76667uOuuuwYtnzlzJs8888yg5Varle9///tD7mvp0qUsXbp00PK8vDzWrl07/AILIYQYcxOiSzCZDacZm8xSuX6pXDeQ+iU7qd9gii4pckIIIZKAtLCEEEIkBQlYQgghksKESLpIdhebvDdZNTY28tvf/hZd17n55pu58847x7tIo+qhhx7CbrejKApms5l169aNd5E+lW3btvHee++RlZXF008/DcTnz9y4cSOnT58mLy+Pqqoq7Hb7OJf00gxVv1T53nm9XjZv3kxPTw+KorBs2TJuv/32lHn/zq/frbfeype+9KVLe/908ant3LlT//d///fxLsaoicVi+sMPP6x3dHTo0WhU//u//3u9ra1tvIs1qh566CHd7/ePdzFGzZ/+9Cf9448/1h999FFj2QsvvKC/9tpruq7rek1Njb5jx47xKt6nNlT9UuV719XVpX/88ce6rut6KBTSH3nkEb2trS1l3r8L1e9S3j/pEhwlegrlrrS0tJCfn09ubi4Wi4WbbrrJmFw4Vei6nlLv2bXXXmuMKxxw4MABYyLopUuXJvV7OFT9IDW+d263m6KiIgDS09OZOnUqXq83Zd6/oeo3MPfrSN8/6RIcJUNN3pushpoouKWlZRxLNPoUReHJJ5/EZDKxbNkybr311vEu0qjr6enB7XYD8ZPGuXN5popU+t5B/N58R44cYfbs2Sn5/g3U7+qrr+bDDz8c8fsnAWuYRjJ57/PPP893vvOdcSyt+CRPPPEE2dnZ9Pb28sQTT1BYWMi111473sW6rAZmiUkVqfa96+/v59lnn6WyspL09PRBzyf7+3d+/S7l/ZOANUyXMnlvsjp/0mGfz3fRCYGTUXZ2NgCZmZksWLCAlpaWlAtYbreb7u5u4/+Bi9upIjMz0/g72b93sViMZ555hsWLFxvzoqbS+zdU/S7l/ZNrWKPgQpP3Jqvi4mLa29s5ffo0qqry9ttvX3RC4GQTDofp7+8H4r/63n///aR/z2Dwdbnrr7+euro6AOrq6gDi21cAAAO9SURBVJL+PTy/fqn0vdu2bRuFhYXcfvvtxrJUev+Gqt+lvH8y08Uo2Lx586DJewf6npNVY2Mjv/nNb9B1nVtuuSWl0to7Ojr4+c9/jqIoxGIxPve5zyV9/aqrq/nggw/w+/1kZWVRUVHB/Pnz2bBhA52dneTm5lJVVTVk4kIyGKp+zc3NKfG9+/DDD3n88ceZPn06iqKgKAr33HMPxcXFKfH+Xah+b7311ojfPwlYQgghkoJ0CQohhEgKErCEEEIkBQlYQgghkoIELCGEEElBApYQQoikIAFLCCFEUpCAJYQQIilIwBJCUFdXx09+8pPxLoYQFyUBS4gUp2nasNZL9slVReqTmS6EmOC8Xi+/+c1v+PDDD9F1nZtuuok77riDX/7ylxw5cgRFUSgtLeW+++4zbs/w0EMP8YUvfIG33nqLEydO8MIL/397d8zSSBCGcfyfiStBsrhqIAppBNEgphUDai+xFLQIiI2FHyBbWIgfQBvFQiQWB9oHUkq0UrAXBK3ERhMxCrKiDFcIwdxxx3GcZ1aeXzs7C2/17LwM+36jVCpxcHDAw8MDiUSC2dlZRkdHub6+plAoYK3FcRyi0Si7u7ufXLXIzxRYIi3MWovv+2QyGebm5jDGcHl5ied53NzcMDw8zNPTE2tra/T39zM/Pw+8BVY8Hsf3fVzXxXEcTk5OSKfTeJ7H8fExW1tbbGxs4Hkeh4eHVCoVVldXP7likV9TS1CkhV1cXHB/f08+n6e9vZ22tjaGhoZIJpNkMhmi0Siu65LL5Tg7O2vaOzU1RXd3N47jADA2Ntb4uWg2m6Wvr+/LDeaUr03zsERaWK1WI5FIYEzzt2W9Xm+0CYMgwFpLPB5veub91GiAo6MjyuUyt7e3wNtolcfHx48tQOQfUmCJtLCenh6q1SrW2qbQ2t/fxxjD+vo6HR0dnJ6eUiwWm/a+v0RRrVbZ3t5mZWWFwcFBAAqFQmO+lC5cSBioJSjSwgYGBujq6mJvb4/n52deXl44Pz8nCAJisRixWIy7uztKpdJv3xMEAZFIBNd1sdZSqVS4urpqrHd2dlKr1Xh9ff3okkT+mk5YIi3MGIPv+xSLRZaWlohEIoyPjzMzM8Pm5iYLCwv09vYyMTFBuVxu7PvxxJRKpZienmZ5eRljDJOTk6TT6cb6yMgIqVSKxcVFjDHs7Oz8txpF/pRuCYqISCioJSgiIqGgwBIRkVBQYImISCgosEREJBQUWCIiEgoKLBERCQUFloiIhIICS0REQkGBJSIiofAdP6sxIODmLVwAAAAASUVORK5CYII=" }, "metadata": {} }, { "output_type": "display_data", "data": { "text/plain": [ "" ], "image/png": "" }, "metadata": {} } ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 37, "source": [ "data.boxplot(column='log_price', by='color');\n", "data.boxplot(column='log_price', by='cut');" ], "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "" ], "image/png": "" }, "metadata": {} }, { "output_type": "display_data", "data": { "text/plain": [ "" ], "image/png": "" }, "metadata": {} } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "## Challenge Question:\n", "You can see from the above above that color and cut don't seem to matter much to price. Can you think of a reason why this might be?" ], "metadata": {} }, { "cell_type": "code", "execution_count": 38, "source": [ "data.boxplot(column='carat', by='color');\n", "plt.ylim(0, 3);\n", "data.boxplot(column='carat', by='cut');\n", "plt.ylim(0, 3);" ], "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "" ], "image/png": "" }, "metadata": {} }, { "output_type": "display_data", "data": { "text/plain": [ "" ], "image/png": "" }, "metadata": {} } ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 39, "source": [ "from pandas.plotting import scatter_matrix\n", "\n", "column_list = ['carat', 'price per carat', 'log_price']\n", "scatter_matrix(data[column_list]);" ], "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "" ], "image/png": "" }, "metadata": {} } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "# Final thoughts:" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "There are lots and lots and lots of plotting libraries out there. \n", "- [Matplotlib](http://matplotlib.org/) is the standard (and most full-featured), but it's built to look and work like Matlab, which is not known for the prettiness of its plots. \n", "- There is an unofficial port of the excellent [ggplot2](https://ggplot2.tidyverse.org/) library from R to [Python](https://yhat.github.io/ggpy/). It lacks some features, but does follow ggplot's unique \"grammar of graphics\" approach.\n", "- [Seaborn](http://stanford.edu/~mwaskom/software/seaborn/) is what the cool kids seem to be using right now. ggplot-quality results, but with a more Python-y syntax. Focus on good-looking defaults relative to Matplotlib with less typing and swap-in stylesheets to give plots a consistent look and feel.\n", "- [Bokeh](http://bokeh.pydata.org/en/latest/) has a focus on web output and large or streaming datasets.\n", "- [plot.ly](https://plot.ly) has a focus on data sharing and collaboration. May not be best for quick and dirty data exploration, but nice for showing to colleagues.\n", "\n", "## Very important:\n", "Plotting is lots of fun to play around with, but almost no plot is going to be of publication quality without some tweaking. Once you pick a package, you will want to spend time learning how to get labels, spacing, tick marks, etc. right. All of the packages above are very powerful, but inevitably, you will want to do something that seems simple and turns out to be hard.\n", "\n", "Why not just take the plot that's easy to make and pretty it up in Adobe Illustrator? **Any plot that winds up in a paper will be revised many times in the course of revision and peer review. Learn to let the program do the hard work.** You want code that will get you 90 - 95% of the way to publication quality.\n", "\n", "Thankfully, it's very, very easy to learn how to do this. Because plotting routines present such nice visual feedback, there are lots and lots of examples on line with code that will show you how to make gorgeous plots. Here again, documentation and StackOverflow are your friends!" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "# Case Study 2: Time allocation" ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Now let's practice some of what we learned by analyzing data from a very simple survey.\n", "\n", "I asked members of CCN and Neurobiology to answer the following question:\n", "![survey image](qualtrics.png)" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "!wget -P \"$target_dir\" \"https://people.duke.edu/~jmp33/dibs/time_alloc.csv\" # download csv to data folder\n", "\n", "# if this doesn't work, manually download `time_alloc.csv` from https://people.duke.edu/~jmp33/dibs/ \n", "# to your local machine, and upload it to `data` folder" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": 40, "source": [ "dat = pd.read_csv('data/time_alloc.csv')" ], "outputs": [], "metadata": { "collapsed": true } }, { "cell_type": "code", "execution_count": 41, "source": [ "dat.head()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Start Date End Date Progress Duration (in seconds) Finished \\\n", "0 8/30/16 10:41 8/30/16 10:43 100 82 True \n", "1 8/30/16 8:42 8/30/16 8:44 100 99 True \n", "2 8/30/16 4:47 8/30/16 4:51 100 252 True \n", "3 8/29/16 19:23 8/29/16 19:26 100 161 True \n", "4 8/29/16 18:13 8/29/16 18:15 100 105 True \n", "\n", " Recorded Date Response ID Recipient Last Name \\\n", "0 8/30/16 10:43 R_1kO0775KprWgA91 NaN \n", "1 8/30/16 8:44 R_2P4C0IepjotM86v NaN \n", "2 8/30/16 4:51 R_6gGZesX5RTFq0Mh NaN \n", "3 8/29/16 19:26 R_3iCkovXNWsaIlmc NaN \n", "4 8/29/16 18:15 R_1I5jOg0c96ZUMsH NaN \n", "\n", " Recipient First Name Recipient Email External Reference \\\n", "0 NaN NaN NaN \n", "1 NaN NaN NaN \n", "2 NaN NaN NaN \n", "3 NaN NaN NaN \n", "4 NaN NaN NaN \n", "\n", " LocationLatitude - Location Latitude \\\n", "0 35.995407 \n", "1 52.516693 \n", "2 52.516693 \n", "3 35.946594 \n", "4 35.995407 \n", "\n", " LocationLongitude - Location Longitude \\\n", "0 -78.901901 \n", "1 13.399994 \n", "2 13.399994 \n", "3 -78.797699 \n", "4 -78.901901 \n", "\n", " DistributionChannel - Distribution Channel Q1_1 - Experimental design \\\n", "0 anonymous 5 \n", "1 anonymous 12 \n", "2 anonymous 11 \n", "3 anonymous 20 \n", "4 anonymous 40 \n", "\n", " Q1_2 - Piloting Q1_3 - Data collection Q1_4 - Data analysis \\\n", "0 5 40 30 \n", "1 4 24 38 \n", "2 10 7 59 \n", "3 10 20 20 \n", "4 25 10 10 \n", "\n", " Q1_5 - Writing results Q1_6 - Review process \n", "0 10 10 \n", "1 16 6 \n", "2 10 3 \n", "3 20 10 \n", "4 10 5 " ], "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Start DateEnd DateProgressDuration (in seconds)FinishedRecorded DateResponse IDRecipient Last NameRecipient First NameRecipient EmailExternal ReferenceLocationLatitude - Location LatitudeLocationLongitude - Location LongitudeDistributionChannel - Distribution ChannelQ1_1 - Experimental designQ1_2 - PilotingQ1_3 - Data collectionQ1_4 - Data analysisQ1_5 - Writing resultsQ1_6 - Review process
08/30/16 10:418/30/16 10:4310082True8/30/16 10:43R_1kO0775KprWgA91NaNNaNNaNNaN35.995407-78.901901anonymous5540301010
18/30/16 8:428/30/16 8:4410099True8/30/16 8:44R_2P4C0IepjotM86vNaNNaNNaNNaN52.51669313.399994anonymous1242438166
28/30/16 4:478/30/16 4:51100252True8/30/16 4:51R_6gGZesX5RTFq0MhNaNNaNNaNNaN52.51669313.399994anonymous1110759103
38/29/16 19:238/29/16 19:26100161True8/29/16 19:26R_3iCkovXNWsaIlmcNaNNaNNaNNaN35.946594-78.797699anonymous201020202010
48/29/16 18:138/29/16 18:15100105True8/29/16 18:15R_1I5jOg0c96ZUMsHNaNNaNNaNNaN35.995407-78.901901anonymous40251010105
\n", "
" ] }, "metadata": {}, "execution_count": 41 } ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 42, "source": [ "dat.columns" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Index(['Start Date', 'End Date', 'Progress', 'Duration (in seconds)',\n", " 'Finished', 'Recorded Date', 'Response ID', 'Recipient Last Name',\n", " 'Recipient First Name', 'Recipient Email', 'External Reference',\n", " 'LocationLatitude - Location Latitude',\n", " 'LocationLongitude - Location Longitude',\n", " 'DistributionChannel - Distribution Channel',\n", " 'Q1_1 - Experimental design', 'Q1_2 - Piloting',\n", " 'Q1_3 - Data collection', 'Q1_4 - Data analysis',\n", " 'Q1_5 - Writing results', 'Q1_6 - Review process'],\n", " dtype='object')" ] }, "metadata": {}, "execution_count": 42 } ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 43, "source": [ "dat.shape" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(96, 20)" ] }, "metadata": {}, "execution_count": 43 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Let's just pull out the data we care about, the columns that start with ```'Q1_'```:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 44, "source": [ "cols_to_extract = [c for c in dat.columns if 'Q1_' in c]\n", "print(cols_to_extract)" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "['Q1_1 - Experimental design', 'Q1_2 - Piloting', 'Q1_3 - Data collection', 'Q1_4 - Data analysis', 'Q1_5 - Writing results', 'Q1_6 - Review process']\n" ] } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Now we want to shorten the column names to just the descriptive part:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 45, "source": [ "col_names = [n.split(' - ')[-1] for n in cols_to_extract]\n", "print(col_names)" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "['Experimental design', 'Piloting', 'Data collection', 'Data analysis', 'Writing results', 'Review process']\n" ] } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Finally, make a reduced dataset in which we drop the first row and get the columns we want. Set the column name to the \n", "description we extracted:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 46, "source": [ "dat_red = dat[cols_to_extract]\n", "dat_red.columns = col_names\n", "dat_red.head()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Experimental design Piloting Data collection Data analysis \\\n", "0 5 5 40 30 \n", "1 12 4 24 38 \n", "2 11 10 7 59 \n", "3 20 10 20 20 \n", "4 40 25 10 10 \n", "\n", " Writing results Review process \n", "0 10 10 \n", "1 16 6 \n", "2 10 3 \n", "3 20 10 \n", "4 10 5 " ], "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Experimental designPilotingData collectionData analysisWriting resultsReview process
05540301010
11242438166
21110759103
3201020202010
440251010105
\n", "
" ] }, "metadata": {}, "execution_count": 46 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "Now, we can figure out the average percent time allocated to each aspect of a project:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 47, "source": [ "dat_red.mean()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Experimental design 14.500000\n", "Piloting 10.729167\n", "Data collection 25.239583\n", "Data analysis 26.135417\n", "Writing results 14.197917\n", "Review process 9.197917\n", "dtype: float64" ] }, "metadata": {}, "execution_count": 47 } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "And we can visualize this with a box plot:" ], "metadata": {} }, { "cell_type": "code", "execution_count": 48, "source": [ "import seaborn as sns" ], "outputs": [], "metadata": { "collapsed": true } }, { "cell_type": "code", "execution_count": 49, "source": [ "plt.figure(figsize=(10, 5))\n", "sns.boxplot(data=dat_red)" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 49 }, { "output_type": "display_data", "data": { "text/plain": [ "" ], "image/png": "" }, "metadata": {} } ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 50, "source": [ "plt.figure(figsize=(10, 5))\n", "sns.violinplot(data=dat_red)\n", "plt.ylim([0, 100])" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(0, 100)" ] }, "metadata": {}, "execution_count": 50 }, { "output_type": "display_data", "data": { "text/plain": [ "" ], "image/png": "" }, "metadata": {} } ], "metadata": { "collapsed": false } }, { "cell_type": "markdown", "source": [ "And we might want to know how these covary (bearing in mind that the values have to sum to 100):" ], "metadata": {} }, { "cell_type": "code", "execution_count": 51, "source": [ "dat_red.corr()" ], "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Experimental design Piloting Data collection \\\n", "Experimental design 1.000000 -0.057772 -0.394450 \n", "Piloting -0.057772 1.000000 -0.116101 \n", "Data collection -0.394450 -0.116101 1.000000 \n", "Data analysis -0.418189 -0.062659 -0.181762 \n", "Writing results -0.149596 -0.332602 -0.395939 \n", "Review process -0.007290 -0.215466 -0.344578 \n", "\n", " Data analysis Writing results Review process \n", "Experimental design -0.418189 -0.149596 -0.007290 \n", "Piloting -0.062659 -0.332602 -0.215466 \n", "Data collection -0.181762 -0.395939 -0.344578 \n", "Data analysis 1.000000 -0.146576 -0.340260 \n", "Writing results -0.146576 1.000000 0.375177 \n", "Review process -0.340260 0.375177 1.000000 " ], "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Experimental designPilotingData collectionData analysisWriting resultsReview process
Experimental design1.000000-0.057772-0.394450-0.418189-0.149596-0.007290
Piloting-0.0577721.000000-0.116101-0.062659-0.332602-0.215466
Data collection-0.394450-0.1161011.000000-0.181762-0.395939-0.344578
Data analysis-0.418189-0.062659-0.1817621.000000-0.146576-0.340260
Writing results-0.149596-0.332602-0.395939-0.1465761.0000000.375177
Review process-0.007290-0.215466-0.344578-0.3402600.3751771.000000
\n", "
" ] }, "metadata": {}, "execution_count": 51 } ], "metadata": { "collapsed": false } } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 2 }