{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "----------\n", "# Cab Consulting\n", "_______" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Author Information\n", "Jessen Hobson, Ph.D.
\n", "Associate Professor of Accountancy
\n", "R.C. Evans Data Analytics Fellow
\n", "University of Illinois at Urbana-Champaign" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning Objectives\n", "* Retrieve data via API\n", "* Store and access data via Pandas and CSV\n", "* Clean data in Pandas\n", "* Descriptive analytics: \n", " * grouping data\n", " * visualization via box plots, scatter plots, histograms, and heat maps\n", "* Predictive analytics using regression\n", "* Story telling from a business analytics point of view" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Case Background\n", "Yellow Cab Chicago has hired your small accounting and consulting firm to help them shave costs and increase revenues. Like many traditional cab companies, Yellow Cab is feeling significant pressure from ride-hailing companies like Lyft and Uber. You are a recent graduate and new to the firm and are excited to learn you will have role in this engagement. Your excitement quickly turns to apprehension, however, when the partner pulls you into her office.\n", "\n", "\"Thanks for meeting with me,\" she says. \"This new engagement is really important to the firm, and I am happy you are on board. I heard that you have some data analytics skills. I would like you to see if you there is any data out there that might help us learn more about the situation Yellow Cab is in and if there are any interesting revenue opportunities out there that we have not thought of. You know, what is Yellow cab facing and were should they be focusing their efforts?\"\n", "\n", "You are a little overwhelmed, but after a quick internet search you find out that the city of Chicago puts a bunch of data out there about cab trips. This is just what you need!\n", " \n", "-------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Relevant Packages and Dependencies\n", "\n", "-------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set up the environment, load packages and dependencies." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# database tool\n", "import pandas as pd \n", "import numpy as np \n", "# for use with API to get data\n", "from sodapy import Socrata \n", "# to show graphs inline\n", "%matplotlib inline \n", "# disable an unneeded warning\n", "pd.options.mode.chained_assignment = None # default='warn'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To set home directory. Replace `C:\\...` with your own home directory (where you want your files to be stored)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%cd C:\\Users\\jlhobson\\Documents\\Data Analytics General\\Cases" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data and Save Data\n", "Load through data.cityofchicago.org using their API. An application token was not needed for 2,000,000 observations. E.g., https://digital.cityofchicago.org/index.php/chicago-taxi-data-released/\n", "\n", "------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Retrieve Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Unauthenticated client only works with public data sets. Note 'None'\n", "# in place of application token, and no username or password:\n", "client = Socrata(\"data.cityofchicago.org\", None)\n", "\n", "# returned as JSON from API / converted to Python list of\n", "# dictionaries by sodapy.\n", "# to get a subset of 2,000,000\n", "results = client.get(\"wrvz-psew\", limit=2000000)\n", "\n", "# Convert to pandas DataFrame\n", "df = pd.DataFrame.from_records(results)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save data as a CSV file so that you can save and return to your work without having to use an application token in the API." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ORIGINAL FILE\n", "file = 'cab_data.csv' " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Export file for later use\n", "df.to_csv(file, index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load data for use\n", "df = pd.read_csv(file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Look at the Data\n", "--------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question: take a look at 10 random rows of the DataFrame to get a sense of what you have downloaded." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Clean Data\n", "--------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Eliminate bad data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.1.1 Delete trips with no cost, since these are mistakes or uninteresting cases" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.1.2 Delete trips with NaN mileage and NaN seconds." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.1.2 Delete trips with no mileage but starting and stopping in a different community, since these are likely mistakes.\n", "#### 1.1.3 Delete trips with no seconds but starting and stopping in a different community, since these are likely mistakes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Fix datetime" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.2.1 Create a new column that converts `trip_start_timestamp` and `trip_end_timestamp` to a datetime object" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Questions;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.2.2 Create four new columns - for the year, month, day, and hour, respectively (use the start of the trip, since that is something the taxi driver can control)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3 Deal with outliers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The main focus in outliers is getting `trip_total` right, since you are trying to understand how Yellow Cab can make money. `trip_seconds` and `trip_duration` also have outliers, which you should look at, but we will focus on `trip_total` and `tips`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, draw box plots of those four variables to see how significant the outlier situation is and then describe what you see in a few sentences." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your description of what you see:
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question:
\n", "Next, draw two scatter plots to see how cost and miles and cost and seconds relate to each other." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question:
\n", "These plots indicate that most very high cost trips are probably errors or uninteresting since they have took very few seconds and went very few miles. For example, it does not seem plausible that a $8,000 cab fare took went about 0 miles and took about 0 seconds. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's correct this by eliminating rows with unusually high dollars per second (i.e., trip with unreasonably low seconds but high payment)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question:
\n", "Next, rerun the two scatterplots from above to visually inspect whether this worked. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your description of whether this worked:
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Descriptive Analytics - explore the data\n", "----------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Histograms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question: Next, explore the data using descriptive analytics. First graph in a bar graph / histogram five analyses--the number of trips by year, by month, by day of the month, by day of the week, and by hour of the day. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Using total cost and tips as a dependent measure" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2.1 Costs and tips by pick-up community area" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question: Graph on a histogram the top 20 community areas to pick up in if you want to maximize total cost and tips. Thus, two graphs are required. Can you draw any useful conclusions from this analysis?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can you draw any useful conclusions from this analysis?:
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2.2 Costs and tips by hour of the day and day of week" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question: Graph on a histogram the mean cost of the trip and tips by hour of the day and by day of the week. Thus, four graphs are required. Can you draw any useful conclusions from this analysis?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can you draw any useful conclusions from this analysis?:
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2.3 Costs and tips by means of payment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question: Graph on a histogram mean total cost and tips by type of payment. Thus, two graphs are required. Can you draw any useful conclusions from this analysis?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can you draw any useful conclusions from this analysis?:
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Visualization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.3.1 Visualizing tips and total bill by location" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question: Next, explore some of the geographical characteristics of the data. Graph the data based upon latitude and longitude. Use the pick-up location, since this is something the driver can control. Use built in features of a Python package (e.g., pyplot from matplotlib) to also see the total bill and the tip amount on this graph. (Hint: outliers will skew the graph, look at trips under $50.) What conclusions can be drawn from this analysis?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What conclusions can be drawn from this analysis?
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Predictive Analytics - Regression\n", "----------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Two Basic Regressions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question: Next, use regression to examine the factors that are predictive of the total bill and tips. What does this analysis tell you?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What does this analysis tell you?
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Tell a story" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Question: Finally, use descriptive and predictive analytics, as has been modeled above, to make recommendations to your supervisor for the Yellow Cab Company client." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Recommendation 1: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }