{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# First Contact with Data\n", "\n", "Every time I encounter new data file. There are few initial \"looks\" that I take on it. This help me understand if I can load the whole set to memory and what are the fields there. Since I'm command line oriented, I use linux command line utilities to do that (which are easily accesible from Jupython with `!`), but it's easily done with Python as well.\n", "\n", "As an example, we'll use a subset of the [NYC taxi dataset](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml). The file is called `taxi.csv`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## File Size" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-r--r-- 1 miki miki 16M Oct 2 12:40 taxi.csv\r\n" ] } ], "source": [ "# Command line\n", "!ls -lh taxi.csv" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15789.13 KB\n", "15.42 MB\n" ] } ], "source": [ "# Python\n", "from os import path\n", "print('%.2f KB' % (path.getsize('taxi.csv')/(1<<10)))\n", "print('%.2f MB' % (path.getsize('taxi.csv')/(1<<20)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Number of Lines" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100001 taxi.csv\r\n" ] } ], "source": [ "# Command line\n", "!wc -l taxi.csv" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100001\n" ] } ], "source": [ "# Python\n", "with open('taxi.csv') as fp:\n", " print(sum(1 for _ in fp))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Header" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "VendorID\n", "lpep_pickup_datetime\n", "Lpep_dropoff_datetime\n", "Store_and_fwd_flag\n", "RateCodeID\n", "Pickup_longitude\n", "Pickup_latitude\n", "Dropoff_longitude\n", "Dropoff_latitude\n", "Passenger_count\n", "Trip_distance\n", "Fare_amount\n", "Extra\n", "MTA_tax\n", "Tip_amount\n", "Tolls_amount\n", "Ehail_fee\n", "improvement_surcharge\n", "Total_amount\n", "Payment_type\n", "Trip_type \n", "21 fields" ] } ], "source": [ "# Command line\n", "!head -1 taxi.csv | tr , \\\\n\n", "!printf \"%d fields\" $(head -1 taxi.csv | tr , \\\\n | wc -l)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "VendorID\n", "lpep_pickup_datetime\n", "Lpep_dropoff_datetime\n", "Store_and_fwd_flag\n", "RateCodeID\n", "Pickup_longitude\n", "Pickup_latitude\n", "Dropoff_longitude\n", "Dropoff_latitude\n", "Passenger_count\n", "Trip_distance\n", "Fare_amount\n", "Extra\n", "MTA_tax\n", "Tip_amount\n", "Tolls_amount\n", "Ehail_fee\n", "improvement_surcharge\n", "Total_amount\n", "Payment_type\n", "Trip_type \n", "21 fields\n" ] } ], "source": [ "# Python\n", "import csv\n", "with open('taxi.csv') as fp:\n", " fields = next(csv.reader(fp))\n", "print('\\n'.join(fields))\n", "print('%d fields' % len(fields))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sample Data" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n", "2015-03-04 15:39:16\n", "2015-03-04 15:42:30\n", "N\n", "1\n", "-73.992240905761719\n", "40.690120697021484\n", "-73.999664306640625\n", "40.684993743896484\n", "2\n", ".71\n", "4.5\n", "0\n", "0.5\n", "0\n", "0\n", "\n", "0.3\n", "5.3\n", "2\n", "1\n", "\n", "\n", "23 values" ] } ], "source": [ "# Command line\n", "!head -2 taxi.csv | tail -1 | tr , \\\\n\n", "!printf \"%d values\" $(head -2 taxi.csv | tail -1 | tr , \\\\n | wc -l)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n", "2015-03-04 15:39:16\n", "2015-03-04 15:42:30\n", "N\n", "1\n", "-73.992240905761719\n", "40.690120697021484\n", "-73.999664306640625\n", "40.684993743896484\n", "2\n", ".71\n", "4.5\n", "0\n", "0.5\n", "0\n", "0\n", "\n", "0.3\n", "5.3\n", "2\n", "1\n", "\n", "\n", "23 values\n" ] } ], "source": [ "# Python\n", "with open('taxi.csv') as fp:\n", " fp.readline() # Skip header\n", " values = next(csv.reader(fp))\n", "print('\\n'.join(values))\n", "print('%d values' % len(values))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "VendorID : 2\n", "lpep_pickup_datetime: 2015-03-04 15:39:16\n", "Lpep_dropoff_datetime: 2015-03-04 15:42:30\n", "Store_and_fwd_flag : N\n", "RateCodeID : 1\n", "Pickup_longitude : -73.992240905761719\n", "Pickup_latitude : 40.690120697021484\n", "Dropoff_longitude : -73.999664306640625\n", "Dropoff_latitude : 40.684993743896484\n", "Passenger_count : 2\n", "Trip_distance : .71\n", "Fare_amount : 4.5\n", "Extra : 0\n", "MTA_tax : 0.5\n", "Tip_amount : 0\n", "Tolls_amount : 0\n", "Ehail_fee : \n", "improvement_surcharge: 0.3\n", "Total_amount : 5.3\n", "Payment_type : 2\n", "Trip_type : 1\n", "??? : \n", "??? : \n" ] } ], "source": [ "# Python (with field names)\n", "from itertools import zip_longest\n", "with open('taxi.csv') as fp:\n", " reader = csv.reader(fp)\n", " header = next(reader)\n", " values = next(reader)\n", "for col, val in zip_longest(header, values, fillvalue='???'):\n", " print('%-20s: %s' % (col, val))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In both methods (with fields or without) we see that we have some extra empty fields at the end of each data row." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading as DataFrame\n", "\n", "After the initial look, we know we can load the whole data to memory and have a good idea what to tell pandas for parsing it." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | VendorID | \n", "lpep_pickup_datetime | \n", "Lpep_dropoff_datetime | \n", "Store_and_fwd_flag | \n", "RateCodeID | \n", "Pickup_longitude | \n", "Pickup_latitude | \n", "Dropoff_longitude | \n", "Dropoff_latitude | \n", "Passenger_count | \n", "... | \n", "Fare_amount | \n", "Extra | \n", "MTA_tax | \n", "Tip_amount | \n", "Tolls_amount | \n", "Ehail_fee | \n", "improvement_surcharge | \n", "Total_amount | \n", "Payment_type | \n", "Trip_type | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "2 | \n", "2015-03-04 15:39:16 | \n", "2015-03-04 15:42:30 | \n", "N | \n", "1 | \n", "-73.992241 | \n", "40.690121 | \n", "-73.999664 | \n", "40.684994 | \n", "2 | \n", "... | \n", "4.5 | \n", "0.0 | \n", "0.5 | \n", "0.00 | \n", "0.0 | \n", "NaN | \n", "0.3 | \n", "5.30 | \n", "2 | \n", "1 | \n", "
1 | \n", "2 | \n", "2015-03-22 17:36:49 | \n", "2015-03-22 17:45:39 | \n", "N | \n", "5 | \n", "-73.930038 | \n", "40.819576 | \n", "-73.907173 | \n", "40.811306 | \n", "2 | \n", "... | \n", "12.0 | \n", "0.0 | \n", "0.0 | \n", "0.00 | \n", "0.0 | \n", "NaN | \n", "0.0 | \n", "12.00 | \n", "2 | \n", "2 | \n", "
2 | \n", "2 | \n", "2015-03-25 22:08:45 | \n", "2015-03-25 22:53:29 | \n", "N | \n", "1 | \n", "-73.961082 | \n", "40.807022 | \n", "-73.984642 | \n", "40.663147 | \n", "1 | \n", "... | \n", "45.0 | \n", "0.5 | \n", "0.5 | \n", "9.26 | \n", "0.0 | \n", "NaN | \n", "0.3 | \n", "55.56 | \n", "1 | \n", "1 | \n", "
3 | \n", "2 | \n", "2015-03-16 13:45:20 | \n", "2015-03-16 13:52:04 | \n", "N | \n", "1 | \n", "-73.913200 | \n", "40.777962 | \n", "-73.926994 | \n", "40.772743 | \n", "2 | \n", "... | \n", "6.5 | \n", "0.0 | \n", "0.5 | \n", "0.00 | \n", "0.0 | \n", "NaN | \n", "0.3 | \n", "7.30 | \n", "2 | \n", "1 | \n", "
4 | \n", "2 | \n", "2015-03-19 18:53:50 | \n", "2015-03-19 18:59:04 | \n", "N | \n", "1 | \n", "-73.925888 | \n", "40.827602 | \n", "-73.916351 | \n", "40.824966 | \n", "1 | \n", "... | \n", "5.5 | \n", "1.0 | \n", "0.5 | \n", "0.00 | \n", "0.0 | \n", "NaN | \n", "0.3 | \n", "7.30 | \n", "2 | \n", "1 | \n", "
5 rows × 21 columns
\n", "