{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# First Contact with Data\n", "\n", "Every time I encounter new data file. There are few initial \"looks\" that I take on it. This help me understand if I can load the whole set to memory and what are the fields there. Since I'm command line oriented, I use linux command line utilities to do that (which are easily accesible from Jupython with `!`), but it's easily done with Python as well.\n", "\n", "As an example, we'll use a subset of the [NYC taxi dataset](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml). The file is called `taxi.csv`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## File Size" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-r--r-- 1 miki miki 16M Oct 2 12:40 taxi.csv\r\n" ] } ], "source": [ "# Command line\n", "!ls -lh taxi.csv" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15789.13 KB\n", "15.42 MB\n" ] } ], "source": [ "# Python\n", "from os import path\n", "print('%.2f KB' % (path.getsize('taxi.csv')/(1<<10)))\n", "print('%.2f MB' % (path.getsize('taxi.csv')/(1<<20)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Number of Lines" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100001 taxi.csv\r\n" ] } ], "source": [ "# Command line\n", "!wc -l taxi.csv" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100001\n" ] } ], "source": [ "# Python\n", "with open('taxi.csv') as fp:\n", " print(sum(1 for _ in fp))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Header" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "VendorID\n", "lpep_pickup_datetime\n", "Lpep_dropoff_datetime\n", "Store_and_fwd_flag\n", "RateCodeID\n", "Pickup_longitude\n", "Pickup_latitude\n", "Dropoff_longitude\n", "Dropoff_latitude\n", "Passenger_count\n", "Trip_distance\n", "Fare_amount\n", "Extra\n", "MTA_tax\n", "Tip_amount\n", "Tolls_amount\n", "Ehail_fee\n", "improvement_surcharge\n", "Total_amount\n", "Payment_type\n", "Trip_type \n", "21 fields" ] } ], "source": [ "# Command line\n", "!head -1 taxi.csv | tr , \\\\n\n", "!printf \"%d fields\" $(head -1 taxi.csv | tr , \\\\n | wc -l)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "VendorID\n", "lpep_pickup_datetime\n", "Lpep_dropoff_datetime\n", "Store_and_fwd_flag\n", "RateCodeID\n", "Pickup_longitude\n", "Pickup_latitude\n", "Dropoff_longitude\n", "Dropoff_latitude\n", "Passenger_count\n", "Trip_distance\n", "Fare_amount\n", "Extra\n", "MTA_tax\n", "Tip_amount\n", "Tolls_amount\n", "Ehail_fee\n", "improvement_surcharge\n", "Total_amount\n", "Payment_type\n", "Trip_type \n", "21 fields\n" ] } ], "source": [ "# Python\n", "import csv\n", "with open('taxi.csv') as fp:\n", " fields = next(csv.reader(fp))\n", "print('\\n'.join(fields))\n", "print('%d fields' % len(fields))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sample Data" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n", "2015-03-04 15:39:16\n", "2015-03-04 15:42:30\n", "N\n", "1\n", "-73.992240905761719\n", "40.690120697021484\n", "-73.999664306640625\n", "40.684993743896484\n", "2\n", ".71\n", "4.5\n", "0\n", "0.5\n", "0\n", "0\n", "\n", "0.3\n", "5.3\n", "2\n", "1\n", "\n", "\n", "23 values" ] } ], "source": [ "# Command line\n", "!head -2 taxi.csv | tail -1 | tr , \\\\n\n", "!printf \"%d values\" $(head -2 taxi.csv | tail -1 | tr , \\\\n | wc -l)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n", "2015-03-04 15:39:16\n", "2015-03-04 15:42:30\n", "N\n", "1\n", "-73.992240905761719\n", "40.690120697021484\n", "-73.999664306640625\n", "40.684993743896484\n", "2\n", ".71\n", "4.5\n", "0\n", "0.5\n", "0\n", "0\n", "\n", "0.3\n", "5.3\n", "2\n", "1\n", "\n", "\n", "23 values\n" ] } ], "source": [ "# Python\n", "with open('taxi.csv') as fp:\n", " fp.readline() # Skip header\n", " values = next(csv.reader(fp))\n", "print('\\n'.join(values))\n", "print('%d values' % len(values))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "VendorID : 2\n", "lpep_pickup_datetime: 2015-03-04 15:39:16\n", "Lpep_dropoff_datetime: 2015-03-04 15:42:30\n", "Store_and_fwd_flag : N\n", "RateCodeID : 1\n", "Pickup_longitude : -73.992240905761719\n", "Pickup_latitude : 40.690120697021484\n", "Dropoff_longitude : -73.999664306640625\n", "Dropoff_latitude : 40.684993743896484\n", "Passenger_count : 2\n", "Trip_distance : .71\n", "Fare_amount : 4.5\n", "Extra : 0\n", "MTA_tax : 0.5\n", "Tip_amount : 0\n", "Tolls_amount : 0\n", "Ehail_fee : \n", "improvement_surcharge: 0.3\n", "Total_amount : 5.3\n", "Payment_type : 2\n", "Trip_type : 1\n", "??? : \n", "??? : \n" ] } ], "source": [ "# Python (with field names)\n", "from itertools import zip_longest\n", "with open('taxi.csv') as fp:\n", " reader = csv.reader(fp)\n", " header = next(reader)\n", " values = next(reader)\n", "for col, val in zip_longest(header, values, fillvalue='???'):\n", " print('%-20s: %s' % (col, val))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In both methods (with fields or without) we see that we have some extra empty fields at the end of each data row." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading as DataFrame\n", "\n", "After the initial look, we know we can load the whole data to memory and have a good idea what to tell pandas for parsing it." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VendorIDlpep_pickup_datetimeLpep_dropoff_datetimeStore_and_fwd_flagRateCodeIDPickup_longitudePickup_latitudeDropoff_longitudeDropoff_latitudePassenger_count...Fare_amountExtraMTA_taxTip_amountTolls_amountEhail_feeimprovement_surchargeTotal_amountPayment_typeTrip_type
022015-03-04 15:39:162015-03-04 15:42:30N1-73.99224140.690121-73.99966440.6849942...4.50.00.50.000.0NaN0.35.3021
122015-03-22 17:36:492015-03-22 17:45:39N5-73.93003840.819576-73.90717340.8113062...12.00.00.00.000.0NaN0.012.0022
222015-03-25 22:08:452015-03-25 22:53:29N1-73.96108240.807022-73.98464240.6631471...45.00.50.59.260.0NaN0.355.5611
322015-03-16 13:45:202015-03-16 13:52:04N1-73.91320040.777962-73.92699440.7727432...6.50.00.50.000.0NaN0.37.3021
422015-03-19 18:53:502015-03-19 18:59:04N1-73.92588840.827602-73.91635140.8249661...5.51.00.50.000.0NaN0.37.3021
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag \\\n", "0 2 2015-03-04 15:39:16 2015-03-04 15:42:30 N \n", "1 2 2015-03-22 17:36:49 2015-03-22 17:45:39 N \n", "2 2 2015-03-25 22:08:45 2015-03-25 22:53:29 N \n", "3 2 2015-03-16 13:45:20 2015-03-16 13:52:04 N \n", "4 2 2015-03-19 18:53:50 2015-03-19 18:59:04 N \n", "\n", " RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude \\\n", "0 1 -73.992241 40.690121 -73.999664 \n", "1 5 -73.930038 40.819576 -73.907173 \n", "2 1 -73.961082 40.807022 -73.984642 \n", "3 1 -73.913200 40.777962 -73.926994 \n", "4 1 -73.925888 40.827602 -73.916351 \n", "\n", " Dropoff_latitude Passenger_count ... Fare_amount Extra MTA_tax \\\n", "0 40.684994 2 ... 4.5 0.0 0.5 \n", "1 40.811306 2 ... 12.0 0.0 0.0 \n", "2 40.663147 1 ... 45.0 0.5 0.5 \n", "3 40.772743 2 ... 6.5 0.0 0.5 \n", "4 40.824966 1 ... 5.5 1.0 0.5 \n", "\n", " Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount \\\n", "0 0.00 0.0 NaN 0.3 5.30 \n", "1 0.00 0.0 NaN 0.0 12.00 \n", "2 9.26 0.0 NaN 0.3 55.56 \n", "3 0.00 0.0 NaN 0.3 7.30 \n", "4 0.00 0.0 NaN 0.3 7.30 \n", "\n", " Payment_type Trip_type \n", "0 2 1 \n", "1 2 2 \n", "2 1 1 \n", "3 2 1 \n", "4 2 1 \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "date_cols = ['lpep_pickup_datetime', 'Lpep_dropoff_datetime']\n", "with open('taxi.csv') as fp:\n", " header = next(csv.reader(fp))\n", " df = pd.read_csv(fp, names=header, usecols=np.arange(len(header)), parse_dates=date_cols)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }