{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Assignment 3\n", "\n", "The US Department of Agriculture publishes price estimates for fruits and vegetables [online](https://www.ers.usda.gov/data-products/fruit-and-vegetable-prices/fruit-and-vegetable-prices/). The most recent estimates are based on a 2013 survey of US retail stores.\n", "\n", "The estimates are provided as a collection of MS Excel files, with one file per fruit or vegetable. The `assignment3_data.zip` file contains the fruit and vegetable files in the directories `fruit` and `vegetables`, respectively." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise 1.1.__ Use pandas to extract the \"Fresh\" row(s) from the fruit Excel files. Combine the data into a single data frame. Your data frame should look something like this:\n", "\n", "type | food | form | price_per_lb | yield | lb_per_cup | price_per_cup\n", "---------- | ---------- | ------ | ------------ | ----- | ---------- | -------------\n", "fruit | watermelon | Fresh1 | 0.333412 | 0.52 | 0.330693 | 0.212033\n", "fruit | cantaloupe | Fresh1 | 0.535874 | 0.51 | 0.374786 | 0.3938\n", "vegetables | onions | Fresh1 | 1.03811 | 0.9 | 0.35274 | 0.406868\n", "... | | | | | |\n", "\n", "\n", "It's okay if the rows and columns of your data frame are in a different order. These modules are especially relevant:\n", "\n", "* [`str` methods](https://docs.python.org/2/library/stdtypes.html#string-methods)\n", "* [`os`](https://docs.python.org/2/library/os.html)\n", "* [`os.path`](https://docs.python.org/2/library/os.path.html)\n", "* [pandas](http://pandas.pydata.org/pandas-docs/stable/): `read_excel()`, `concat()`, `.fillna()`, `.str`, plotting methods\n", "\n", "Ask questions and search the documentation/web to find the functions you need.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def read_data(path, Type):\n", " \"\"\"Extract data from each file, then extract the fresh row and delete irrelevant\n", " columns. Arguments are a string for the path and a string for the Type: fruit or vegetable\"\"\"\n", " files = os.listdir(path)\n", " df = pd.DataFrame()\n", " for f in files:\n", " data = pd.read_excel(path + '/' + f, header = None, skiprows = [0,1,2])\n", " data = data.iloc[:,:7]\n", " data = data.rename(columns=lambda i: ['Form', 'Price Per lb', 'Unit', 'Yield', 'lb Per Cup', 'Unit',\n", " 'Price per Cup'][i])\n", " data = data.drop(data.columns[[2,5]], axis=1)\n", " x1 = pd.ExcelFile(path + '/' + f)\n", " a = x1.sheet_names\n", " data['Food'] = str(a[0])\n", " data['Type'] = Type\n", " data = data[data[\"Form\"].str.startswith(\"Fresh\") == True]\n", " \n", " df = df.append(data)\n", " \n", " return df " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fruit_data = read_data('assignment3_data/fruit', 'Fruit')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise 1.2.__ Reuse your code from exercise 1.1 to extract the \"Fresh\" row(s) from the vegetable Excel files.\n", "\n", "Does your code produce the correct prices for tomatoes? If not, why not? Do any other files have the same problem as the tomatoes file?\n", "\n", "You don't need to extract the prices for these problem files. However, make sure the prices are extracted for files like asparagus that don't have this problem." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "veg_data = read_data('assignment3_data/vegetables', 'Vegetable')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise 1.3.__ Remove rows without a price from the vegetable data frame and then combine the fruit and vegetable data frames. Make sure all columns of numbers are numeric (not strings)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "veg_data = veg_data.dropna()\n", "final_data = pd.concat([fruit_data, veg_data])\n", "cols = final_data.columns.tolist()\n", "cols = cols[-1:] + [str(cols[5])] + cols[:-2]\n", "final_data = final_data[cols]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Exercise 1.4.__ Discuss the questions below (a paragraph each is sufficient). Use plots to support your ideas.\n", "\n", "* What kinds of fruits are the most expensive (per pound)? What kinds are the least expensive?\n", "* How do the price distributions compare for fruit and vegetables?\n", "* Which foods are the best value for the price?\n", "* What's something surprising about this data set?\n", "* Which foods do you expect to provide the best combination of price, yield, and nutrition? A future assignment may combine this data set with another so you can check your hypothesis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We add a column to each data frames that computes the ratio of \"Price Per lb\" to \"Yield\". " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fruit_data['Adj Price'] = fruit_data['Price Per lb']/fruit_data['Yield']\n", "veg_data['Adj Price'] = veg_data['Price Per lb']/veg_data['Yield']\n", "final_data['Adj Price'] = final_data['Price Per lb']/final_data['Yield']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We compute the mean of the three price columns among fruits and vegetables below. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "a = fruit_data[\"Price Per lb\"].mean()\n", "b = fruit_data[\"Adj Price\"].mean()\n", "c = fruit_data[\"Price per Cup\"].mean()\n", "d = veg_data[\"Price Per lb\"].mean()\n", "e = veg_data[\"Adj Price\"].mean()\n", "g = veg_data[\"Price per Cup\"].mean()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | Price Per lb | \n", "Adj Price | \n", "Price per Cup | \n", "
---|---|---|---|
fruit | \n", "2.076877 | \n", "2.512136 | \n", "0.862056 | \n", "
vegetable | \n", "1.838701 | \n", "2.514681 | \n", "0.856272 | \n", "