{ "metadata": { "name": "", "signature": "sha256:78547353f3f8241e5b59c6405ea034f347c8de7405b28bf0e945c00859a6526d" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "4. Obtaining a sample and improving the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The size of the dataset makes impossible to perform the learning in a relatively short time. A solution for this may be select a sample from the dataset and learning from it. Let's select a random portion of 1,000,000 trips equally distributed throughout the year." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "4.1. Preparing the notebook" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "%config InlineBackend.figure_format='retina'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "import datetime\n", "import os\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "\n", "from __future__ import division\n", "\n", "sns.set(font='sans')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "read_path = '../data/cleaned/cleaned_{0}.csv'\n", "save_path = '../data/dataset/dataset.csv'" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "# This auxiliary function applies another one to every row in a DataFrame for creating new columns.\n", "def iterate_and_apply(dataframe, function, necesary_columns):\n", " perform = True\n", " step = 100000\n", " start = 0\n", " to = step\n", " \n", " while perform:\n", " new_columns = dataframe[start:to][necesary_columns].apply(function, axis=1)\n", " if len(new_columns) == 0:\n", " perform = False\n", " else:\n", " dataframe.update(new_columns)\n", " new_columns = None\n", " start += step\n", " to += step\n", " \n", " return dataframe" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "4.2. Obtaining the subdataset" ] }, { "cell_type": "code", "collapsed": false, "input": [ "complete_length = 88156805\n", "final_length = 1000000\n", "current_length = 0\n", "\n", "first = True\n", "data = None" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "months = range(1, 13)\n", "for month in months:\n", " \n", " data_aux = pd.read_csv(read_path.format(month), index_col=0)\n", " \n", " if month != 12:\n", " this_length = int(final_length * (data_aux.shape[0] / complete_length))\n", " else:\n", " this_length = final_length - current_length\n", " current_length += this_length\n", " \n", " data_aux = data_aux.ix[np.random.choice(data_aux.index, this_length, replace=False)].copy()\n", " data_aux = data_aux.reset_index(drop=True)\n", " \n", " if first:\n", " data = data_aux.copy()\n", " first = False\n", " else:\n", " data = data.append(data_aux, ignore_index=True)\n", " data_aux = None" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have obtained the sample from the dataset, let's create new attributes. This new task would be impossible to do in the entire dataset. Now it's the perfect time: the data is clean and *small*." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "4.3. Getting `datetime` attributes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The date and time attribute that currently exists can't fit in the `scikit-learn`'s algorithms, we need to decompose it. Here it's proposed a possible list of extracted attributes from `pickup_datetime`, explaining those that they need it:\n", "\n", "* **pickup_month**\n", "* **pickup_weekday**\n", "* **pickup_day**\n", "* **pickup_time_in_mins**: Time showed in minutes. For example, 16:30 p.m. would be 992.\n", "* **pickup_non_working_today**: A boolean value that shows if that day was a holiday or not.\n", "* **pickup_non_working_tomorrow**: A boolean value that shows if the day following was a holiday or not.\n", "\n", "For creating the last two attributes, the dataset `nyc_2013_holidays.csv` (mentioned in the [first notebook](1. Preparing the environment.ipynb)) has to be used." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# A mini-dataset with the 2013 holidays in NYC.\n", "annual_holidays = pd.read_csv('../data/nyc_2013_holidays.csv')\n", "\n", "# Columns needed.\n", "datetime_necesary_columns = ['pickup_datetime']\n", "datetime_column_names = ['pickup_month', 'pickup_weekday', 'pickup_day', 'pickup_time_in_mins', 'pickup_non_working_today',\n", " 'pickup_non_working_tomorrow']\n", "\n", "# It says if a day is a holiday in NYC.\n", "def is_in_annual_holidays(the_day):\n", " return annual_holidays[(annual_holidays.month == the_day.month) & (annual_holidays.day == the_day.day)].shape[0]\n", "\n", "# It calculates data related with 'pickup_datetime'.\n", "def calculate_datetime_extra(row):\n", " dt = datetime.datetime.strptime(row.pickup_datetime, '%Y-%m-%d %H:%M:%S')\n", " pickup_month = dt.month\n", " pickup_weekday = dt.weekday()\n", " pickup_day = dt.day\n", " pickup_time_in_mins = (dt.hour * 60) + dt.minute\n", " pickup_non_working_today = int((pickup_weekday == 5) or (pickup_weekday == 6) or is_in_annual_holidays(dt))\n", " pickup_non_working_tomorrow = int((pickup_weekday == 4) or (pickup_weekday == 5) or\n", " is_in_annual_holidays(dt + datetime.timedelta(days=1)))\n", " \n", " return pd.Series({\n", " datetime_column_names[0]: pickup_month,\n", " datetime_column_names[1]: pickup_weekday,\n", " datetime_column_names[2]: pickup_day,\n", " datetime_column_names[3]: pickup_time_in_mins,\n", " datetime_column_names[4]: pickup_non_working_today,\n", " datetime_column_names[5]: pickup_non_working_tomorrow\n", " })" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "for column in datetime_column_names:\n", " data[column] = np.nan\n", "\n", "data = iterate_and_apply(data, calculate_datetime_extra, datetime_necesary_columns)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "4.4. Getting a label to predict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predict the tip percentage of a trip is a regression problem. Given the human nature of the data, even the noise that we couldn't clean, perform a regression could give us disastrous results. Maybe, we could change the problem to a classification one. Let's create a new attribute to predict it. It could be a total of six labels, denoting a few ranges of tips:\n", "\n", "$$\n", "[0,\\:10),\\:[10,\\:15),\\:[15,\\:20),\\:[20,\\:25),\\:[25,\\:30)\\:and\\:[30,\\:+\\infty)\n", "$$" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tip_label_column_name = 'tip_label'\n", "\n", "tip_labels = ['[0-10)', '[10-15)', '[15-20)', '[20-25)', '[25-30)', '[30-inf)']\n", "tip_ranges_by_label = [[0.0, 10.0], [10.0, 15.0], [15.0, 20.0], [20.0, 25.0], [25.0, 30.0], [30.0, 51.0]]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "data[tip_label_column_name] = ''\n", "\n", "for i, tip_label in enumerate(tip_labels):\n", " tip_mask = ((data.tip_perc >= tip_ranges_by_label[i][0]) & (data.tip_perc < tip_ranges_by_label[i][1]))\n", " data.tip_label[tip_mask] = tip_label\n", " \n", " tip_mask = None" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "ax = data.groupby('tip_label').size().plot(kind='bar')\n", "\n", "ax.set_xlabel('tip_label', fontsize=18)\n", "ax.set_ylabel('Number of trips', fontsize=18)\n", "ax.tick_params(labelsize=12)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": { "png": { "height": 407, "width": 533 } }, "output_type": "display_data", "png": "\n", "text": [ "" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are the classes that we are going to predict!" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "4.5. Saving the file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we are going to save this dataset with a particular column order, and use it in the [next notebook](5. Learning, badly.ipynb)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "tip_label_order = ['medallion', 'hack_license', 'vendor_id', 'pickup_datetime', 'pickup_month', 'pickup_weekday',\n", " 'pickup_day', 'pickup_time_in_mins', 'pickup_non_working_today', 'pickup_non_working_tomorrow',\n", " 'fare_amount', 'surcharge', 'tip_amount', 'tip_perc', 'tip_label', 'tolls_amount', 'total_amount',\n", " 'passenger_count', 'trip_time_in_secs', 'trip_distance', 'pickup_longitude', 'pickup_latitude',\n", " 'dropoff_longitude', 'dropoff_latitude']\n", "\n", "data = data.reindex_axis(tip_label_order, axis=1)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "if not os.path.exists('../data/dataset/'):\n", " os.makedirs('../data/dataset/')\n", "\n", "data.to_csv(save_path, index=False)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 } ], "metadata": {} } ] }