{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python Homework with Chipotle Data - Explained\n", "\n", "*Original version written by [Alex Sherman](https://www.linkedin.com/in/alexjmsherman)*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1\n", "\n", "- Read in the file with csv.reader() and store it in an object called 'file_nested_list'.\n", "- Hint: This is a TSV file, and csv.reader() needs to be told how to handle it." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "c:\\Users\\Kevin\\Desktop\\DAT8\\data\n" ] } ], "source": [ "# Change the working directory to the 'data' directory\n", "%cd ../data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# To use csv.reader, we must import the csv module\n", "import csv\n", "\n", "# The csv.reader has a delimeter parameter, which we set to '\\t' to indicate that the file is tab-separated\n", "with open('chipotle.tsv', mode='rU') as f: # We temporarily refer to the file by the variable name f for file\n", " file_nested_list = [row for row in csv.reader(f, delimiter='\\t')] # Create a list by looping through each line in f" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why use csv.reader?\n", "\n", "As stated in the [CSV file reading and writing documentation](https://docs.python.org/2/library/csv.html):\n", "\n", "> There is no \"CSV standard\", so the format is operationally defined by the many applications which \n", "read and write it. The lack of a standard means that subtle differences often exist in the data \n", "produced and consumed by different applications. These differences can make it annoying to process \n", "CSV files from multiple sources. Still, while the delimiters and quoting characters vary, the \n", "overall format is similar enough that it is possible to write a single module which can efficiently\n", "manipulate such data, hiding the details of reading and writing the data from the programmer.\n", "\n", "In other words, depending on the source, there may be intricacies in the data format. These are not always easy to distinguish - for instance, non-visible new line characters. The csv.reader module is built to handle these intricacies, and thus provides an efficient way to load data.\n", "\n", "This is why we prefer: `file_nested_list = [row for row in csv.reader(f, delimiter='\\t')]`\n", "\n", "Instead of: `file_nested_list = [row.split('\\t') for row in f]`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2\n", "\n", "- Separate 'file_nested_list' into the 'header' and the 'data'." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "header = file_nested_list[0]\n", "data = file_nested_list[1:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3\n", "\n", "- Calculate the average price of an order.\n", "- **Hint:** Examine the data to see if the 'quantity' column is relevant to this calculation.\n", "- **Hint:** Think carefully about the simplest way to do this!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to find the average price of an order. This means we need the **sum of the price of all orders** and the **total number of orders**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calculating the sum of the price of all orders" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']\n", "['2', '2', 'Chicken Bowl', '[Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]]', '$16.98 ']\n", "['3', '1', 'Chicken Bowl', '[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]]', '$10.98 ']\n" ] } ], "source": [ "# After exploring our data for a minute, we find two orders for the same item - Chicken Bowl - differing by the quantity\n", "print header\n", "print data[4]\n", "print data[5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the item_price field reflects the quantity ordered. Thus, to calculate the total value of all orders, we can safely ignore the quantity column because the item_price takes quantity into account." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['$2.39 ', '$3.39 ', '$3.39 ', '$2.39 ', '$16.98 ']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We want the sum of all the order prices - the last item in each list. Here are two ways we could get this data:\n", "\n", "# Option 1\n", "prices = [row[4] for row in data] # slice to position four\n", "\n", "# Option 2\n", "prices = [row[-1] for row in data] # slice to the last position\n", "\n", "# Let's look at the first five results:\n", "prices[0:5]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "str" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Each item in the list is a sting. We can tell this because the results above are wrapped in quotes.\n", "# To confirm, let's explicity check the type of the first item in the list:\n", "type(prices[0])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['2.39 ', '3.39 ', '3.39 ', '2.39 ', '16.98 ']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Since we want to do a calculation, we need to change the type from string to float. \n", "# To do this, we first need to remove the $. Here are two different ways to accomplish this:\n", "\n", "# Option 1\n", "prices = [row[4][1:] for row in data] # remove the dollar sign by slicing\n", "\n", "# Option 2\n", "prices = [row[4].replace('$', '') for row in data] # remove the dollar sign by replacing '$' with an empty string\n", "\n", "# Let's look at the first five results:\n", "prices[0:5]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2.39, 3.39, 3.39, 2.39, 16.98]\n", "\n" ] } ], "source": [ "# Now we can convert our results to floats\n", "prices = [float(row[4][1:]) for row in data]\n", "\n", "# Let's look at the first five results and check the type of the first item:\n", "print prices[0:5]\n", "print type(prices[0])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "34500.16000000046" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Finally, we calculate our total order sum with the built-in sum function\n", "total_order_sum = sum([float(row[4][1:]) for row in data]) \n", "total_order_sum" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calculating the total number of orders" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']\n", "['1', '1', 'Chips and Fresh Tomato Salsa', 'NULL', '$2.39 ']\n", "['1834', '1', 'Chicken Salad Bowl', '[Fresh Tomato Salsa, [Fajita Vegetables, Pinto Beans, Lettuce]]', '$8.75 ']\n" ] } ], "source": [ "# We can look at the first and last items in the list\n", "print header\n", "print data[0]\n", "print data[-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems that there are 1834 orders. You could assume this since that's the maximum order_id, but it is best to check, as we are not certain that the data is clean. If the data was not sorted by order or if there was a missing order, then 1834 might not be correct.\n", "\n", "So, let's confirm this assumption:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['1', '1', '1', '1', '2', '3', '3', '4', '4', '5']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# First, let's build a list of the order_ids\n", "order_ids = [row[0] for row in data]\n", "\n", "# Let's look at the first ten results\n", "order_ids[0:10]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'1', '2', '3', '4', '5'}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We only want to count each order once. We can get the distinct order values with the set function:\n", "set(order_ids[0:10])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1834" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Only keep unique order_ids\n", "unique_order_ids = set(order_ids)\n", "\n", "# Use the len function to determine the number of unique order_ids\n", "num_orders = len(unique_order_ids)\n", "num_orders" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calculating the average price" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "18.811428571428824" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Finally, we answer the question by calculating the average\n", "average_order_price = total_order_sum / num_orders\n", "average_order_price" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "18.81" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's recap by looking at the final code:\n", "total_order_sum = sum([float(row[4][1:]) for row in data])\n", "num_orders = len(set([row[0] for row in data]))\n", "average_order_price = round(total_order_sum / num_orders, 2) # Let's round our result to 2 decimal places\n", "average_order_price" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 4\n", "\n", "- Create a list (or set) of all unique sodas and soft drinks that they sell.\n", "- **Note:** Just look for 'Canned Soda' and 'Canned Soft Drink', and ignore other drinks like 'Izze'." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'6 Pack Soft Drink',\n", " 'Barbacoa Bowl',\n", " 'Barbacoa Burrito',\n", " 'Barbacoa Crispy Tacos',\n", " 'Barbacoa Salad Bowl',\n", " 'Barbacoa Soft Tacos',\n", " 'Bottled Water',\n", " 'Bowl',\n", " 'Burrito',\n", " 'Canned Soda',\n", " 'Canned Soft Drink',\n", " 'Carnitas Bowl',\n", " 'Carnitas Burrito',\n", " 'Carnitas Crispy Tacos',\n", " 'Carnitas Salad',\n", " 'Carnitas Salad Bowl',\n", " 'Carnitas Soft Tacos',\n", " 'Chicken Bowl',\n", " 'Chicken Burrito',\n", " 'Chicken Crispy Tacos',\n", " 'Chicken Salad',\n", " 'Chicken Salad Bowl',\n", " 'Chicken Soft Tacos',\n", " 'Chips',\n", " 'Chips and Fresh Tomato Salsa',\n", " 'Chips and Guacamole',\n", " 'Chips and Mild Fresh Tomato Salsa',\n", " 'Chips and Roasted Chili Corn Salsa',\n", " 'Chips and Roasted Chili-Corn Salsa',\n", " 'Chips and Tomatillo Green Chili Salsa',\n", " 'Chips and Tomatillo Red Chili Salsa',\n", " 'Chips and Tomatillo-Green Chili Salsa',\n", " 'Chips and Tomatillo-Red Chili Salsa',\n", " 'Crispy Tacos',\n", " 'Izze',\n", " 'Nantucket Nectar',\n", " 'Salad',\n", " 'Side of Chips',\n", " 'Steak Bowl',\n", " 'Steak Burrito',\n", " 'Steak Crispy Tacos',\n", " 'Steak Salad',\n", " 'Steak Salad Bowl',\n", " 'Steak Soft Tacos',\n", " 'Veggie Bowl',\n", " 'Veggie Burrito',\n", " 'Veggie Crispy Tacos',\n", " 'Veggie Salad',\n", " 'Veggie Salad Bowl',\n", " 'Veggie Soft Tacos'}" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# First let's look at all of the items\n", "distinct_items = set([row[2] for row in data])\n", "distinct_items " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our first goal is to reduce the dataset to only soda and soft drink orders.\n", "\n", "It appears that the only items that use the word 'Canned' are 'Canned Soda' and 'Canned Soft Drink.'\n", "\n", "This means we only need to use one filter criteria: **Look for rows with the word 'Canned'**" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[['9', '2', 'Canned Soda', '[Sprite]', '$2.18 '],\n", " ['14', '1', 'Canned Soda', '[Dr. Pepper]', '$1.09 '],\n", " ['23', '2', 'Canned Soda', '[Mountain Dew]', '$2.18 '],\n", " ['24', '1', 'Canned Soda', '[Sprite]', '$1.09 '],\n", " ['47', '1', 'Canned Soda', '[Dr. Pepper]', '$1.09 ']]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a list only including soda and soft drink orders\n", "soda_orders = []\n", "for row in data:\n", " if 'Canned' in row[2]:\n", " soda_orders.append(row)\n", "\n", "# Let's look at the first five results:\n", "soda_orders[0:5]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# This can also be done using a list comprehension with an 'if' condition\n", "soda_orders = [row for row in data if 'Canned' in row[2]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just out of interest, let's look at two other ways we could have filtered the data:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[['9', '2', 'Canned Soda', '[Sprite]', '$2.18 '],\n", " ['14', '1', 'Canned Soda', '[Dr. Pepper]', '$1.09 '],\n", " ['23', '2', 'Canned Soda', '[Mountain Dew]', '$2.18 '],\n", " ['24', '1', 'Canned Soda', '[Sprite]', '$1.09 '],\n", " ['47', '1', 'Canned Soda', '[Dr. Pepper]', '$1.09 ']]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soda_orders = [row for row in data if 'Canned Soda' in row[2] or 'Canned Soft Drink' in row[2]]\n", "soda_orders[0:5]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[['9', '2', 'Canned Soda', '[Sprite]', '$2.18 '],\n", " ['14', '1', 'Canned Soda', '[Dr. Pepper]', '$1.09 '],\n", " ['23', '2', 'Canned Soda', '[Mountain Dew]', '$2.18 '],\n", " ['24', '1', 'Canned Soda', '[Sprite]', '$1.09 '],\n", " ['47', '1', 'Canned Soda', '[Dr. Pepper]', '$1.09 ']]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soda_orders = [row for row in data if 'Canned Soda' == row[2] or 'Canned Soft Drink' == row[2]]\n", "soda_orders[0:5]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['[Sprite]', '[Dr. Pepper]', '[Mountain Dew]', '[Sprite]', '[Dr. Pepper]']" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We only want the choice_description (e.g. Sprite, Mountain Dew). This is the fourth item in the list.\n", "# Since Python uses 0-based indexing, we get this by using row[3] as the first argument in our list comprehension:\n", "sodas = [row[3] for row in data if 'Canned' in row[2]]\n", "\n", "# Let's look at the first five results\n", "sodas[0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results above may look like 5 lists inside of a larger list. Let's assume that's the case, and try to get the first Sprite:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'['" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sodas[0][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is going on?\n", "\n", "The raw data for choice_description includues brackets (e.g. [Sprite]). We loaded this data in as a string, so while it looks like we have lists inside lists, the result is actually just one list. This is indicated by the quotes wrapping each item in the list, which means the list contains strings." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[Sprite]\n", "\n", "8\n" ] } ], "source": [ "# Print the first list element\n", "print sodas[0]\n", "\n", "# Show that it's a string\n", "print type(sodas[0])\n", "\n", "# It is 8 characters long, including the brackets\n", "print len(sodas[0])" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['Sprite', 'Dr. Pepper', 'Mountain Dew', 'Sprite', 'Dr. Pepper']" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's strip the brackets at the start and end of each soda name, using [1:-1] to remove the first and last characters\n", "sodas = [row[3][1:-1] for row in data if 'Canned' in row[2]]\n", "\n", "# Let's look at the first five results\n", "sodas[0:5]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'Coca Cola',\n", " 'Coke',\n", " 'Diet Coke',\n", " 'Diet Dr. Pepper',\n", " 'Dr. Pepper',\n", " 'Lemonade',\n", " 'Mountain Dew',\n", " 'Nestea',\n", " 'Sprite'}" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Almost done - we just need to get rid of duplicate values\n", "unique_sodas = set([row[3][1:-1] for row in data if 'Canned' in row[2]]) # Success in one line of code!\n", "unique_sodas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just for reference, how would this look if we did not use a list comprehension?" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# build a list of all sodas\n", "sodas = []\n", "for row in data:\n", " if 'Canned' in row[2]:\n", " sodas.append(row[3][1:-1]) # strip the brackets\n", "\n", "# create a set of unique sodas\n", "unique_sodas = set(sodas)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 5\n", "\n", "- Calculate the average number of toppings per burrito.\n", "- **Note:** Let's ignore the 'quantity' column to simplify this task.\n", "- **Hint:** Think carefully about the easiest way to count the number of toppings!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To calculate the average number of toppings, we simply need to divide the **total number of burritos** by the **total number of toppings**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calculating the total number of burritos" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# keep a running total\n", "burrito_count = 0\n", "\n", "# loop through the data, looking for lines containing 'Burrito'\n", "for row in data:\n", " if 'Burrito' in row[2]:\n", " burrito_count = burrito_count + 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like many programming languages, Python allows you to use `x += 1` as a replacement for `x = x + 1`. Let's use that instead:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# keep a running total\n", "burrito_count = 0\n", "\n", "# loop through the data, looking for lines containing 'Burrito'\n", "for row in data:\n", " if 'Burrito' in row[2]:\n", " burrito_count += 1 # this is the only line that changed" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1172" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "burrito_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The count is 1172, which seems reasonable given the total number of orders (1834)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calculating the total number of toppings" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['4',\n", " '1',\n", " 'Steak Burrito',\n", " '[Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]',\n", " '$11.75 ']" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's look at a single burrito order\n", "data[7]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'[Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]'" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# There appear to be 8 toppings:\n", "data[7][3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With all of this formatting within the string, what's the easiest way to count the number of toppings?\n", "\n", "Start by asking yourself: How did you count the number of toppings? You probably looked for **commas**!" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "7" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use the string method 'count' to count the number of commas\n", "data[7][3].count(',')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And of course, if there are 7 commas, that means there are 8 toppings.\n", "\n", "So, let's revise our original loop:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1172\n", "6323\n" ] } ], "source": [ "# keep a running total of burritos and toppings\n", "burrito_count = 0\n", "topping_count = 0\n", "\n", "# calculate number of toppings by counting the commas and adding 1\n", "for row in data:\n", " if 'Burrito' in row[2]:\n", " burrito_count += 1\n", " topping_count += (row[3].count(',') + 1)\n", "\n", "print burrito_count\n", "print topping_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calculating the average number of toppings" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "5.4" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate the average topping count and round to 2 digits\n", "round(topping_count / float(burrito_count), 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just for reference, how would this look if we used list comprehensions?" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "5.4" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "burrito_count = sum(1 for row in data if 'Burrito' in row[2])\n", "topping_count = sum([row[3].count(',') + 1 for row in data if 'Burrito' in row[2]])\n", "round(topping_count / float(burrito_count), 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 6\n", "\n", "- Create a dictionary in which the keys represent chip orders and the values represent the total number of orders.\n", "- **Expected output:** {'Chips and Roasted Chili-Corn Salsa': 18, ... }\n", "- **Note:** Please take the 'quantity' column into account!\n", "- **Optional:** Learn how to use 'defaultdict' to simplify your code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Building a dictionary of names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's pretend I have a list of four names, and I want to make a dictionary in which the **key** is the name, and the **value** is the count of that name." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# This is my list of names\n", "names = ['Ben', 'Victor', 'Laura', 'Victor']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I want to create a dictionary that looks like this:\n", "\n", "`{'Ben':1, 'Laura':1, 'Victor':2}`\n", "\n", "How would I do that? Here's my first attempt:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'Ben': 1, 'Laura': 1, 'Victor': 1}" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create empty dictionary\n", "name_count = {}\n", "\n", "# loop through list of names\n", "for name in names:\n", " # set the name as the key and 1 as the value\n", " name_count[name] = 1\n", "\n", "name_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Well, that creates a dictionary, but it didn't count Victor twice.\n", "\n", "Let's try again:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "ename": "KeyError", "evalue": "'Ben'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 3\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mname\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mnames\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[1;31m# increment the value\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 5\u001b[1;33m \u001b[0mname_count\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mname\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m+=\u001b[0m \u001b[1;36m1\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 6\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 7\u001b[0m \u001b[0mname_count\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mKeyError\u001b[0m: 'Ben'" ] } ], "source": [ "name_count = {}\n", "\n", "for name in names:\n", " # increment the value\n", " name_count[name] += 1\n", "\n", "name_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That doesn't work because the dictionary starts out empty, and you can't tell Python to \"increment the Ben value by 1\" unless the Ben value starts at 0.\n", "\n", "Let's try to fix that:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'Ben': 1, 'Laura': 1, 'Victor': 2}" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "name_count = {}\n", "\n", "for name in names:\n", " # initially set every name to 0\n", " name_count[name] = 0\n", "\n", "for name in names:\n", " # increment the value\n", " name_count[name] += 1\n", "\n", "name_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By looping through the list twice, we fixed the problem. But that's kind of clunky.\n", "\n", "Here's what we really want to do:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'Ben': 1, 'Laura': 1, 'Victor': 2}" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "name_count = {}\n", "\n", "for name in names:\n", " \n", " # check if the key is already present in the dictionary\n", " if name not in name_count:\n", " name_count[name] = 1 # this is a new key, so create key/value pair\n", " else:\n", " name_count[name] += 1 # this is an existing key, so add to the value\n", "\n", "name_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Applying this technique to chip orders" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Reminder on the header\n", "header" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[['1', '1', 'Chips and Fresh Tomato Salsa', 'NULL', '$2.39 '],\n", " ['1', '1', 'Chips and Tomatillo-Green Chili Salsa', 'NULL', '$2.39 '],\n", " ['3', '1', 'Side of Chips', 'NULL', '$1.69 '],\n", " ['5', '1', 'Chips and Guacamole', 'NULL', '$4.45 '],\n", " ['7', '1', 'Chips and Guacamole', 'NULL', '$4.45 ']]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find all the chip orders\n", "chip_orders = [row for row in data if 'Chips' in row[2]]\n", "\n", "# Look at the first five\n", "chip_orders[:5]" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['1', '1', '1', '1', '1']" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The chip quantities are easily accessible\n", "chip_quantities = [row[1] for row in data if 'Chips' in row[2]]\n", "\n", "# Look at the first five\n", "chip_quantities[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's put this all together!" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'Chips': 230,\n", " 'Chips and Fresh Tomato Salsa': 130,\n", " 'Chips and Guacamole': 506,\n", " 'Chips and Mild Fresh Tomato Salsa': 1,\n", " 'Chips and Roasted Chili Corn Salsa': 23,\n", " 'Chips and Roasted Chili-Corn Salsa': 18,\n", " 'Chips and Tomatillo Green Chili Salsa': 45,\n", " 'Chips and Tomatillo Red Chili Salsa': 50,\n", " 'Chips and Tomatillo-Green Chili Salsa': 33,\n", " 'Chips and Tomatillo-Red Chili Salsa': 25,\n", " 'Side of Chips': 110}" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# start with an empty dictionary\n", "chips = {}\n", "\n", "# if chip order is not in dictionary, then add a new key/value pair\n", "# if chip order is already in dictionary, then update the value for that key\n", "for row in data:\n", " if 'Chips' in row[2]:\n", " if row[2] not in chips:\n", " chips[row[2]] = int(row[1]) # this is a new key, so create key/value pair\n", " else:\n", " chips[row[2]] += int(row[1]) # this is an existing key, so add to the value\n", "\n", "chips" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using defaultdict instead" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[defaultdict](https://docs.python.org/2/library/collections.html) simplifies this task, because it saves you the trouble of checking whether a key already exists.\n", "\n", "Here's a simple example using the names data:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "defaultdict(int, {'Ben': 1, 'Laura': 1, 'Victor': 2})" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This is a tiny variation of our code that previously raised an error\n", "\n", "# Create an empty dictionary that will eventually contain integers (and thus the default value is 0)\n", "from collections import defaultdict\n", "name_count = defaultdict(int)\n", "\n", "# We no longer have to check if the key is present\n", "for name in names:\n", " name_count[name] += 1\n", "\n", "name_count" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'Ben': 1, 'Laura': 1, 'Victor': 2}" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# It will print nicely if we convert it to a regular dictionary\n", "dict(name_count)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'Chips': 230,\n", " 'Chips and Fresh Tomato Salsa': 130,\n", " 'Chips and Guacamole': 506,\n", " 'Chips and Mild Fresh Tomato Salsa': 1,\n", " 'Chips and Roasted Chili Corn Salsa': 23,\n", " 'Chips and Roasted Chili-Corn Salsa': 18,\n", " 'Chips and Tomatillo Green Chili Salsa': 45,\n", " 'Chips and Tomatillo Red Chili Salsa': 50,\n", " 'Chips and Tomatillo-Green Chili Salsa': 33,\n", " 'Chips and Tomatillo-Red Chili Salsa': 25,\n", " 'Side of Chips': 110}" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Apply this to the chip orders\n", "dchips = defaultdict(int)\n", "\n", "for row in data:\n", " if 'Chips' in row[2]:\n", " dchips[row[2]] += int(row[1])\n", "\n", "dict(dchips)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }