{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#### New to Plotly?\n", "Plotly's Python library is free and open source! [Get started](https://plotly.com/python/getting-started/) by downloading the client and [reading the primer](https://plotly.com/python/getting-started/).\n", "
You can set up Plotly to work in [online](https://plotly.com/python/getting-started/#initialization-for-online-plotting) or [offline](https://plotly.com/python/getting-started/#initialization-for-offline-plotting) mode, or in [jupyter notebooks](https://plotly.com/python/getting-started/#start-plotting-online).\n", "
We also have a quick-reference [cheatsheet](https://images.plot.ly/plotly-documentation/images/python_cheat_sheet.pdf) (new!) to help you get started!\n", "#### Version Check\n", "Plotly's python package is updated frequently. Run `pip install plotly --upgrade` to use the latest version." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'2.0.1'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import plotly\n", "plotly.__version__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Imports\n", "This notebook explores a 3.9Gb CSV file containing NYC's 311 complaints since 2003. It's the most popular data set in [NYC's open data portal](https://nycopendata.socrata.com/data). This is a primer on out-of-memory data analysis with\n", "- [pandas](http://pandas.pydata.org/): A library with easy-to-use data structures and data analysis tools. Also, interfaces to out-of-memory databases like SQLite.\n", "- [IPython notebook](ipython.org/notebook.html): An interface for writing and sharing python code, text, and plots.\n", "- [SQLite](https://www.sqlite.org/): An self-contained, server-less database that's easy to set-up and query from Pandas.\n", "- [Plotly](https://plotly.com/python/): A platform for publishing beautiful, interactive graphs from Python to the web.\n", "\n", "The dataset is too large to load into a Pandas dataframe. So, instead we'll perform out-of-memory aggregations with SQLite and load the result directly into a dataframe with Panda's `iotools`. It's pretty easy to stream a CSV into SQLite and SQLite requires no setup. The SQL query language is pretty intuitive coming from a Pandas mindset." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import plotly.tools as tls\n", "tls.embed('https://plotly.com/~chris/7365')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "from sqlalchemy import create_engine # database connection\n", "import datetime as dt\n", "from IPython.display import display\n", "\n", "import plotly.plotly as py # interactive graphing\n", "import plotly.graph_objs as go" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Import the CSV data into SQLite\n", "1. Load the CSV, chunk-by-chunk, into a DataFrame\n", "2. Process the data a bit, strip out uninteresting columns\n", "3. Append it to the SQLite database" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pd.read_csv('311_100M.csv', nrows=2).head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 8281035\r\n" ] } ], "source": [ "!wc -l < 311_100M.csv # Number of lines in dataset" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "disk_engine = create_engine('sqlite:///311_8M.db') # Initializes database with filename 311_8M.db in current directory" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "//anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1164: DtypeWarning:\n", "\n", "Columns (17) have mixed types. Specify dtype option on import or set low_memory=False.\n", "\n", "//anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1164: DtypeWarning:\n", "\n", "Columns (8,46) have mixed types. Specify dtype option on import or set low_memory=False.\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "6 seconds: completed 20000 rows\n", "12 seconds: completed 40000 rows\n", "18 seconds: completed 60000 rows\n", "24 seconds: completed 80000 rows\n", "30 seconds: completed 100000 rows\n", "37 seconds: completed 120000 rows\n", "43 seconds: completed 140000 rows\n", "49 seconds: completed 160000 rows\n", "55 seconds: completed 180000 rows\n", "62 seconds: completed 200000 rows\n", "68 seconds: completed 220000 rows\n", "74 seconds: completed 240000 rows\n", "81 seconds: completed 260000 rows\n", "87 seconds: completed 280000 rows\n", "99 seconds: completed 300000 rows\n", "108 seconds: completed 320000 rows\n", "116 seconds: completed 340000 rows\n", "123 seconds: completed 360000 rows\n", "131 seconds: completed 380000 rows\n", "138 seconds: completed 400000 rows\n", "149 seconds: completed 420000 rows\n", "158 seconds: completed 440000 rows\n", "164 seconds: completed 460000 rows\n", "171 seconds: completed 480000 rows\n", "177 seconds: completed 500000 rows\n", "184 seconds: completed 520000 rows\n", "190 seconds: completed 540000 rows\n", "198 seconds: completed 560000 rows\n", "204 seconds: completed 580000 rows\n", "210 seconds: completed 600000 rows\n", "217 seconds: completed 620000 rows\n", "223 seconds: completed 640000 rows\n", "229 seconds: completed 660000 rows\n", "235 seconds: completed 680000 rows\n", "242 seconds: completed 700000 rows\n", "248 seconds: completed 720000 rows\n", "255 seconds: completed 740000 rows\n", "261 seconds: completed 760000 rows\n", "267 seconds: completed 780000 rows\n", "274 seconds: completed 800000 rows\n", "280 seconds: completed 820000 rows\n", "287 seconds: completed 840000 rows\n", "293 seconds: completed 860000 rows\n", "300 seconds: completed 880000 rows\n", "306 seconds: completed 900000 rows\n", "312 seconds: completed 920000 rows\n", "318 seconds: completed 940000 rows\n", "325 seconds: completed 960000 rows\n", "331 seconds: completed 980000 rows\n", "337 seconds: completed 1000000 rows\n", "344 seconds: completed 1020000 rows\n", "350 seconds: completed 1040000 rows\n", "356 seconds: completed 1060000 rows\n", "362 seconds: completed 1080000 rows\n", "369 seconds: completed 1100000 rows\n", "376 seconds: completed 1120000 rows\n", "383 seconds: completed 1140000 rows\n", "390 seconds: completed 1160000 rows\n", "398 seconds: completed 1180000 rows\n", "405 seconds: completed 1200000 rows\n", "412 seconds: completed 1220000 rows\n", "419 seconds: completed 1240000 rows\n", "426 seconds: completed 1260000 rows\n", "434 seconds: completed 1280000 rows\n", "441 seconds: completed 1300000 rows\n", "448 seconds: completed 1320000 rows\n", "456 seconds: completed 1340000 rows\n", "463 seconds: completed 1360000 rows\n", "470 seconds: completed 1380000 rows\n", "477 seconds: completed 1400000 rows\n", "485 seconds: completed 1420000 rows\n", "492 seconds: completed 1440000 rows\n", "499 seconds: completed 1460000 rows\n", "506 seconds: completed 1480000 rows\n", "514 seconds: completed 1500000 rows\n", "521 seconds: completed 1520000 rows\n", "528 seconds: completed 1540000 rows\n", "536 seconds: completed 1560000 rows\n", "543 seconds: completed 1580000 rows\n", "551 seconds: completed 1600000 rows\n", "558 seconds: completed 1620000 rows\n", "565 seconds: completed 1640000 rows\n", "573 seconds: completed 1660000 rows\n", "580 seconds: completed 1680000 rows\n", "588 seconds: completed 1700000 rows\n", "596 seconds: completed 1720000 rows\n", "603 seconds: completed 1740000 rows\n", "610 seconds: completed 1760000 rows\n", "618 seconds: completed 1780000 rows\n", "625 seconds: completed 1800000 rows\n", "633 seconds: completed 1820000 rows\n", "640 seconds: completed 1840000 rows\n", "648 seconds: completed 1860000 rows\n", "655 seconds: completed 1880000 rows\n", "663 seconds: completed 1900000 rows\n", "670 seconds: completed 1920000 rows\n", "678 seconds: completed 1940000 rows\n", "685 seconds: completed 1960000 rows\n", "693 seconds: completed 1980000 rows\n", "700 seconds: completed 2000000 rows\n", "708 seconds: completed 2020000 rows\n", "716 seconds: completed 2040000 rows\n", "723 seconds: completed 2060000 rows\n", "731 seconds: completed 2080000 rows\n", "738 seconds: completed 2100000 rows\n", "746 seconds: completed 2120000 rows\n", "753 seconds: completed 2140000 rows\n", "760 seconds: completed 2160000 rows\n", "768 seconds: completed 2180000 rows\n", "775 seconds: completed 2200000 rows\n", "782 seconds: completed 2220000 rows\n", "790 seconds: completed 2240000 rows\n", "797 seconds: completed 2260000 rows\n", "805 seconds: completed 2280000 rows\n", "812 seconds: completed 2300000 rows\n", "820 seconds: completed 2320000 rows\n", "827 seconds: completed 2340000 rows\n", "835 seconds: completed 2360000 rows\n", "843 seconds: completed 2380000 rows\n", "852 seconds: completed 2400000 rows\n", "860 seconds: completed 2420000 rows\n", "870 seconds: completed 2440000 rows\n", "878 seconds: completed 2460000 rows\n", "885 seconds: completed 2480000 rows\n", "893 seconds: completed 2500000 rows\n", "900 seconds: completed 2520000 rows\n", "908 seconds: completed 2540000 rows\n", "915 seconds: completed 2560000 rows\n", "922 seconds: completed 2580000 rows\n", "930 seconds: completed 2600000 rows\n", "937 seconds: completed 2620000 rows\n", "944 seconds: completed 2640000 rows\n", "952 seconds: completed 2660000 rows\n", "959 seconds: completed 2680000 rows\n", "967 seconds: completed 2700000 rows\n", "974 seconds: completed 2720000 rows\n", "982 seconds: completed 2740000 rows\n", "989 seconds: completed 2760000 rows\n", "997 seconds: completed 2780000 rows\n", "1004 seconds: completed 2800000 rows\n", "1011 seconds: completed 2820000 rows\n", "1019 seconds: completed 2840000 rows\n", "1026 seconds: completed 2860000 rows\n", "1034 seconds: completed 2880000 rows\n", "1041 seconds: completed 2900000 rows\n", "1049 seconds: completed 2920000 rows\n", "1056 seconds: completed 2940000 rows\n", "1064 seconds: completed 2960000 rows\n", "1071 seconds: completed 2980000 rows\n", "1079 seconds: completed 3000000 rows\n", "1086 seconds: completed 3020000 rows\n", "1093 seconds: completed 3040000 rows\n", "1101 seconds: completed 3060000 rows\n", "1108 seconds: completed 3080000 rows\n", "1116 seconds: completed 3100000 rows\n", "1123 seconds: completed 3120000 rows\n", "1131 seconds: completed 3140000 rows\n", "1138 seconds: completed 3160000 rows\n", "1146 seconds: completed 3180000 rows\n", "1153 seconds: completed 3200000 rows\n", "1161 seconds: completed 3220000 rows\n", "1168 seconds: completed 3240000 rows\n", "1176 seconds: completed 3260000 rows\n", "1183 seconds: completed 3280000 rows\n", "1191 seconds: completed 3300000 rows\n", "1199 seconds: completed 3320000 rows\n", "1206 seconds: completed 3340000 rows\n", "1214 seconds: completed 3360000 rows\n", "1221 seconds: completed 3380000 rows\n", "1229 seconds: completed 3400000 rows\n", "1236 seconds: completed 3420000 rows\n", "1244 seconds: completed 3440000 rows\n", "1251 seconds: completed 3460000 rows\n", "1259 seconds: completed 3480000 rows\n", "1266 seconds: completed 3500000 rows\n", "1274 seconds: completed 3520000 rows\n", "1282 seconds: completed 3540000 rows\n", "1289 seconds: completed 3560000 rows\n", "1297 seconds: completed 3580000 rows\n", "1304 seconds: completed 3600000 rows\n", "1312 seconds: completed 3620000 rows\n", "1319 seconds: completed 3640000 rows\n", "1327 seconds: completed 3660000 rows\n", "1334 seconds: completed 3680000 rows\n", "1342 seconds: completed 3700000 rows\n", "1350 seconds: completed 3720000 rows\n", "1357 seconds: completed 3740000 rows\n", "1364 seconds: completed 3760000 rows\n", "1372 seconds: completed 3780000 rows\n", "1379 seconds: completed 3800000 rows\n", "1387 seconds: completed 3820000 rows\n", "1394 seconds: completed 3840000 rows\n", "1402 seconds: completed 3860000 rows\n", "1409 seconds: completed 3880000 rows\n", "1416 seconds: completed 3900000 rows\n", "1424 seconds: completed 3920000 rows\n", "1431 seconds: completed 3940000 rows\n", "1439 seconds: completed 3960000 rows\n", "1446 seconds: completed 3980000 rows\n", "1454 seconds: completed 4000000 rows\n", "1461 seconds: completed 4020000 rows\n", "1468 seconds: completed 4040000 rows\n", "1476 seconds: completed 4060000 rows\n", "1484 seconds: completed 4080000 rows\n", "1491 seconds: completed 4100000 rows\n", "1498 seconds: completed 4120000 rows\n", "1506 seconds: completed 4140000 rows\n", "1513 seconds: completed 4160000 rows\n", "1521 seconds: completed 4180000 rows\n", "1528 seconds: completed 4200000 rows\n", "1536 seconds: completed 4220000 rows\n", "1543 seconds: completed 4240000 rows\n", "1551 seconds: completed 4260000 rows\n", "1558 seconds: completed 4280000 rows\n", "1566 seconds: completed 4300000 rows\n", "1573 seconds: completed 4320000 rows\n", "1581 seconds: completed 4340000 rows\n", "1588 seconds: completed 4360000 rows\n", "1596 seconds: completed 4380000 rows\n", "1603 seconds: completed 4400000 rows\n", "1611 seconds: completed 4420000 rows\n", "1618 seconds: completed 4440000 rows\n", "1626 seconds: completed 4460000 rows\n", "1634 seconds: completed 4480000 rows\n", "1641 seconds: completed 4500000 rows\n", "1649 seconds: completed 4520000 rows\n", "1656 seconds: completed 4540000 rows\n", "1664 seconds: completed 4560000 rows\n", "1671 seconds: completed 4580000 rows\n", "1679 seconds: completed 4600000 rows\n", "1686 seconds: completed 4620000 rows\n", "1694 seconds: completed 4640000 rows\n", "1701 seconds: completed 4660000 rows\n", "1709 seconds: completed 4680000 rows\n", "1717 seconds: completed 4700000 rows\n", "1724 seconds: completed 4720000 rows\n", "1732 seconds: completed 4740000 rows\n", "1739 seconds: completed 4760000 rows\n", "1747 seconds: completed 4780000 rows\n", "1754 seconds: completed 4800000 rows\n", "1762 seconds: completed 4820000 rows\n", "1769 seconds: completed 4840000 rows\n", "1777 seconds: completed 4860000 rows\n", "1785 seconds: completed 4880000 rows\n", "1792 seconds: completed 4900000 rows\n", "1800 seconds: completed 4920000 rows\n", "1807 seconds: completed 4940000 rows\n", "1815 seconds: completed 4960000 rows\n", "1822 seconds: completed 4980000 rows\n", "1830 seconds: completed 5000000 rows\n", "1837 seconds: completed 5020000 rows\n", "1845 seconds: completed 5040000 rows\n", "1853 seconds: completed 5060000 rows\n", "1860 seconds: completed 5080000 rows\n", "1867 seconds: completed 5100000 rows\n", "1875 seconds: completed 5120000 rows\n", "1883 seconds: completed 5140000 rows\n", "1890 seconds: completed 5160000 rows\n", "1898 seconds: completed 5180000 rows\n", "1905 seconds: completed 5200000 rows\n", "1913 seconds: completed 5220000 rows\n", "1920 seconds: completed 5240000 rows\n", "1928 seconds: completed 5260000 rows\n", "1935 seconds: completed 5280000 rows\n", "1943 seconds: completed 5300000 rows\n", "1950 seconds: completed 5320000 rows\n", "1958 seconds: completed 5340000 rows\n", "1965 seconds: completed 5360000 rows\n", "1973 seconds: completed 5380000 rows\n", "1980 seconds: completed 5400000 rows\n", "1987 seconds: completed 5420000 rows\n", "1995 seconds: completed 5440000 rows\n", "2002 seconds: completed 5460000 rows\n", "2010 seconds: completed 5480000 rows\n", "2017 seconds: completed 5500000 rows\n", "2025 seconds: completed 5520000 rows\n", "2032 seconds: completed 5540000 rows\n", "2040 seconds: completed 5560000 rows\n", "2047 seconds: completed 5580000 rows\n", "2055 seconds: completed 5600000 rows\n", "2062 seconds: completed 5620000 rows\n", "2070 seconds: completed 5640000 rows\n", "2078 seconds: completed 5660000 rows\n", "2085 seconds: completed 5680000 rows\n", "2092 seconds: completed 5700000 rows\n", "2099 seconds: completed 5720000 rows\n", "2106 seconds: completed 5740000 rows\n", "2113 seconds: completed 5760000 rows\n", "2120 seconds: completed 5780000 rows\n", "2127 seconds: completed 5800000 rows\n", "2134 seconds: completed 5820000 rows\n", "2141 seconds: completed 5840000 rows\n", "2148 seconds: completed 5860000 rows\n", "2155 seconds: completed 5880000 rows\n", "2162 seconds: completed 5900000 rows\n", "2169 seconds: completed 5920000 rows\n", "2176 seconds: completed 5940000 rows\n", "2183 seconds: completed 5960000 rows\n", "2190 seconds: completed 5980000 rows\n", "2197 seconds: completed 6000000 rows\n", "2204 seconds: completed 6020000 rows\n", "2211 seconds: completed 6040000 rows\n", "2218 seconds: completed 6060000 rows\n", "2225 seconds: completed 6080000 rows\n", "2232 seconds: completed 6100000 rows\n", "2239 seconds: completed 6120000 rows\n", "2246 seconds: completed 6140000 rows\n", "2252 seconds: completed 6160000 rows\n", "2259 seconds: completed 6180000 rows\n", "2266 seconds: completed 6200000 rows\n", "2274 seconds: completed 6220000 rows\n", "2281 seconds: completed 6240000 rows\n", "2288 seconds: completed 6260000 rows\n", "2296 seconds: completed 6280000 rows\n", "2303 seconds: completed 6300000 rows\n", "2311 seconds: completed 6320000 rows\n", "2318 seconds: completed 6340000 rows\n", "2326 seconds: completed 6360000 rows\n", "2333 seconds: completed 6380000 rows\n", "2341 seconds: completed 6400000 rows\n", "2348 seconds: completed 6420000 rows\n", "2356 seconds: completed 6440000 rows\n", "2363 seconds: completed 6460000 rows\n", "2371 seconds: completed 6480000 rows\n", "2378 seconds: completed 6500000 rows\n", "2386 seconds: completed 6520000 rows\n", "2393 seconds: completed 6540000 rows\n", "2401 seconds: completed 6560000 rows\n", "2409 seconds: completed 6580000 rows\n", "2417 seconds: completed 6600000 rows\n", "2424 seconds: completed 6620000 rows\n", "2432 seconds: completed 6640000 rows\n", "2440 seconds: completed 6660000 rows\n", "2448 seconds: completed 6680000 rows\n", "2456 seconds: completed 6700000 rows\n", "2463 seconds: completed 6720000 rows\n", "2471 seconds: completed 6740000 rows\n", "2478 seconds: completed 6760000 rows\n", "2486 seconds: completed 6780000 rows\n", "2493 seconds: completed 6800000 rows\n", "2501 seconds: completed 6820000 rows\n", "2508 seconds: completed 6840000 rows\n", "2516 seconds: completed 6860000 rows\n", "2523 seconds: completed 6880000 rows\n", "2531 seconds: completed 6900000 rows\n", "2538 seconds: completed 6920000 rows\n", "2546 seconds: completed 6940000 rows\n", "2554 seconds: completed 6960000 rows\n", "2561 seconds: completed 6980000 rows\n", "2568 seconds: completed 7000000 rows\n", "2576 seconds: completed 7020000 rows\n", "2583 seconds: completed 7040000 rows\n", "2591 seconds: completed 7060000 rows\n", "2599 seconds: completed 7080000 rows\n", "2606 seconds: completed 7100000 rows\n", "2614 seconds: completed 7120000 rows\n", "2621 seconds: completed 7140000 rows\n", "2629 seconds: completed 7160000 rows\n", "2636 seconds: completed 7180000 rows\n", "2643 seconds: completed 7200000 rows\n", "2651 seconds: completed 7220000 rows\n", "2658 seconds: completed 7240000 rows\n", "2666 seconds: completed 7260000 rows\n", "2673 seconds: completed 7280000 rows\n", "2681 seconds: completed 7300000 rows\n", "2688 seconds: completed 7320000 rows\n", "2696 seconds: completed 7340000 rows\n", "2703 seconds: completed 7360000 rows\n", "2711 seconds: completed 7380000 rows\n", "2718 seconds: completed 7400000 rows\n", "2726 seconds: completed 7420000 rows\n", "2733 seconds: completed 7440000 rows\n", "2740 seconds: completed 7460000 rows\n", "2748 seconds: completed 7480000 rows\n", "2756 seconds: completed 7500000 rows\n", "2763 seconds: completed 7520000 rows\n", "2770 seconds: completed 7540000 rows\n", "2778 seconds: completed 7560000 rows\n", "2785 seconds: completed 7580000 rows\n", "2792 seconds: completed 7600000 rows\n", "2800 seconds: completed 7620000 rows\n", "2807 seconds: completed 7640000 rows\n", "2815 seconds: completed 7660000 rows\n", "2822 seconds: completed 7680000 rows\n", "2830 seconds: completed 7700000 rows\n", "2837 seconds: completed 7720000 rows\n", "2845 seconds: completed 7740000 rows\n", "2852 seconds: completed 7760000 rows\n", "2860 seconds: completed 7780000 rows\n", "2867 seconds: completed 7800000 rows\n", "2875 seconds: completed 7820000 rows\n", "2882 seconds: completed 7840000 rows\n", "2889 seconds: completed 7860000 rows\n", "2897 seconds: completed 7880000 rows\n", "2904 seconds: completed 7900000 rows\n", "2912 seconds: completed 7920000 rows\n", "2919 seconds: completed 7940000 rows\n", "2927 seconds: completed 7960000 rows\n", "2934 seconds: completed 7980000 rows\n", "2942 seconds: completed 8000000 rows\n", "2949 seconds: completed 8020000 rows\n", "2957 seconds: completed 8040000 rows\n", "2964 seconds: completed 8060000 rows\n", "2972 seconds: completed 8080000 rows\n", "2979 seconds: completed 8100000 rows\n", "2987 seconds: completed 8120000 rows\n", "2994 seconds: completed 8140000 rows\n", "3002 seconds: completed 8160000 rows\n", "3009 seconds: completed 8180000 rows\n", "3017 seconds: completed 8200000 rows\n", "3024 seconds: completed 8220000 rows\n", "3031 seconds: completed 8240000 rows\n", "3038 seconds: completed 8260000 rows\n", "3045 seconds: completed 8280000 rows\n", "3047 seconds: completed 8300000 rows\n" ] } ], "source": [ "start = dt.datetime.now()\n", "chunksize = 20000\n", "j = 0\n", "index_start = 1\n", "\n", "for df in pd.read_csv('311_100M.csv', chunksize=chunksize, iterator=True, encoding='utf-8'):\n", " \n", " df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) # Remove spaces from columns\n", "\n", " df['CreatedDate'] = pd.to_datetime(df['CreatedDate']) # Convert to datetimes\n", " df['ClosedDate'] = pd.to_datetime(df['ClosedDate'])\n", "\n", " df.index += index_start\n", "\n", " # Remove the un-interesting columns\n", " columns = ['Agency', 'CreatedDate', 'ClosedDate', 'ComplaintType', 'Descriptor',\n", " 'CreatedDate', 'ClosedDate', 'TimeToCompletion',\n", " 'City']\n", "\n", " for c in df.columns:\n", " if c not in columns:\n", " df = df.drop(c, axis=1) \n", "\n", " \n", " j+=1\n", " print '{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunksize)\n", "\n", " df.to_sql('data', disk_engine, if_exists='append')\n", " index_start = df.index[-1] + 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### `LIMIT` the number of rows that are retrieved" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.read_sql_query('SELECT Agency, Descriptor FROM data LIMIT 3', disk_engine)\n", "df = pd.read_sql_query('SELECT ComplaintType, Descriptor, Agency '\n", " 'FROM data '\n", " 'LIMIT 10', disk_engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Filter rows with `WHERE`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.read_sql_query('SELECT ComplaintType, Descriptor, Agency '\n", " 'FROM data '\n", " 'WHERE Agency = \"NYPD\" '\n", " 'LIMIT 10', disk_engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Filter multiple values in a column with `WHERE` and `IN`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.read_sql_query('SELECT ComplaintType, Descriptor, Agency '\n", " 'FROM data '\n", " 'WHERE Agency IN (\"NYPD\", \"DOB\")'\n", " 'LIMIT 10', disk_engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Find the unique values in a column with `DISTINCT`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.read_sql_query('SELECT DISTINCT City FROM data', disk_engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Query value counts with `COUNT(*)` and `GROUP BY`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.read_sql_query('SELECT Agency, COUNT(*) as `num_complaints`'\n", " 'FROM data '\n", " 'GROUP BY Agency ', disk_engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Order the results with `ORDER` and `-`\n", "Housing and Development Dept receives the most complaints" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_sql_query('SELECT Agency, COUNT(*) as `num_complaints`'\n", " 'FROM data '\n", " 'GROUP BY Agency '\n", " 'ORDER BY -num_complaints', disk_engine)\n", "\n", "py.iplot([go.Bar(x=df.Agency, y=df.num_complaints)], filename='311/most common complaints by agency')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Heat / Hot Water is the most common complaint" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_sql_query('SELECT ComplaintType, COUNT(*) as `num_complaints`, Agency '\n", " 'FROM data '\n", " 'GROUP BY `ComplaintType` '\n", " 'ORDER BY -num_complaints', disk_engine)\n", "\n", "\n", "most_common_complaints = df # used later\n", "py.iplot({\n", " 'data': [go.Bar(x=df['ComplaintType'], y=df.num_complaints)],\n", " 'layout': { \n", " 'margin': {'b': 150}, # Make the bottom margin a bit bigger to handle the long text\n", " 'xaxis': {'tickangle': 40}} # Angle the labels a bit\n", " }, filename='311/most common complaints by complaint type')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*This graph is interactive. Click-and-drag horizontally to zoom, shift-click to pan, double click to autoscale*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### What's the most common complaint in each city?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's see how many cities are recorded in the dataset" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1758" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(pd.read_sql_query('SELECT DISTINCT City FROM data', disk_engine))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yikes - let's just plot the 10 most complained about cities" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.read_sql_query('SELECT City, COUNT(*) as `num_complaints` '\n", " 'FROM data '\n", " 'GROUP BY `City` '\n", " 'ORDER BY -num_complaints '\n", " 'LIMIT 10 ', disk_engine)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Flushing and FLUSHING, Jamaica and JAMAICA... the complaints are case sensitive." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Perform case insensitive queries with `GROUP BY` with `COLLATE NOCASE`" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_sql_query('SELECT City, COUNT(*) as `num_complaints` '\n", " 'FROM data '\n", " 'GROUP BY `City` '\n", " 'COLLATE NOCASE '\n", " 'ORDER BY -num_complaints '\n", " 'LIMIT 11 ', disk_engine)\n", "cities = list(df.City)\n", "cities.remove(None)\n", "\n", "traces = [] # the series in the graph - one trace for each city\n", "\n", "for city in cities:\n", " df = pd.read_sql_query('SELECT ComplaintType, COUNT(*) as `num_complaints` '\n", " 'FROM data '\n", " 'WHERE City = \"{}\" COLLATE NOCASE '\n", " 'GROUP BY `ComplaintType` '\n", " 'ORDER BY -num_complaints'.format(city), disk_engine)\n", "\n", " traces.append(go.Bar(x=df['ComplaintType'], y=df.num_complaints, name=city.capitalize()))\n", "\n", "py.iplot({'data': traces, 'layout': go.Layout(barmode='stack', xaxis={'tickangle': 40}, margin={'b': 150})}, filename='311/complaints by city stacked')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*You can also `click` on the legend entries to hide/show the traces. Click-and-drag to zoom in and shift-drag to pan.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's normalize these counts. This is super easy now that this data has been reduced into a dataframe." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for trace in traces:\n", " trace['y'] = 100.*trace['y']/sum(trace['y'])\n", " \n", "py.iplot({'data': traces, \n", " 'layout': go.Layout(\n", " barmode='group',\n", " xaxis={'tickangle': 40, 'autorange': False, 'range': [-0.5, 16]},\n", " yaxis={'title': 'Percent of Complaints by City'},\n", " margin={'b': 150},\n", " title='Relative Number of 311 Complaints by City')\n", " }, filename='311/relative complaints by city', validate=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- New York is loud\n", "- Staten Island is moldy, wet, and vacant\n", "- Flushing's muni meters are broken \n", "- Trash collection is great in the Bronx\n", "- Woodside doesn't like its graffiti\n", "\n", "Click and drag to pan across the graph and see more of the complaints. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 2: SQLite time series with Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Filter SQLite rows with timestamp strings: `YYYY-MM-DD hh:mm:ss`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.read_sql_query('SELECT ComplaintType, CreatedDate, City '\n", " 'FROM data '\n", " 'WHERE CreatedDate < \"2014-11-16 23:47:00\" '\n", " 'AND CreatedDate > \"2014-11-16 23:45:00\"', disk_engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Pull out the hour unit from timestamps with `strftime`\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df = pd.read_sql_query('SELECT CreatedDate, '\n", " 'strftime(\\'%H\\', CreatedDate) as hour, '\n", " 'ComplaintType '\n", " 'FROM data '\n", " 'LIMIT 5 ', disk_engine)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Count the number of complaints (rows) per hour with `strftime`, `GROUP BY`, and `count(*)`" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_sql_query('SELECT CreatedDate, '\n", " 'strftime(\\'%H\\', CreatedDate) as hour, '\n", " 'count(*) as `Complaints per Hour`'\n", " 'FROM data '\n", " 'GROUP BY hour', disk_engine)\n", "\n", "py.iplot({\n", " 'data': [go.Bar(x=df['hour'], y=df['Complaints per Hour'])],\n", " 'layout': go.Layout(xaxis={'title': 'Hour in Day'},\n", " yaxis={'title': 'Number of Complaints'})}, filename='311/complaints per hour')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Filter noise complaints by hour" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CreatedDatehourComplaints per Hour
02004-08-19 00:54:43.0000000041373
12008-08-29 01:07:39.0000000134588
\n", "
" ], "text/plain": [ " CreatedDate hour Complaints per Hour\n", "0 2004-08-19 00:54:43.000000 00 41373\n", "1 2008-08-29 01:07:39.000000 01 34588" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_sql_query('SELECT CreatedDate, '\n", " 'strftime(\\'%H\\', CreatedDate) as `hour`, '\n", " 'count(*) as `Complaints per Hour`'\n", " 'FROM data '\n", " 'WHERE ComplaintType IN (\"Noise\", '\n", " '\"Noise - Street/Sidewalk\", '\n", " '\"Noise - Commercial\", '\n", " '\"Noise - Vehicle\", '\n", " '\"Noise - Park\", '\n", " '\"Noise - House of Worship\", '\n", " '\"Noise - Helicopter\", '\n", " '\"Collection Truck Noise\") '\n", " 'GROUP BY hour', disk_engine)\n", "\n", "py.iplot({\n", " 'data': [go.Bar(x=df['hour'], y=df['Complaints per Hour'])],\n", " 'layout': go.Layout(xaxis={'title': 'Hour in Day'},\n", " yaxis={'title': 'Number of Complaints'},\n", " title='Number of Noise Complaints in NYC by Hour in Day'\n", " )}, filename='311/noise complaints per hour')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Segregate complaints by hour" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "complaint_traces = {} # Each series in the graph will represent a complaint\n", "complaint_traces['Other'] = {}\n", "\n", "for hour in range(1, 24):\n", " hour_str = '0'+str(hour) if hour < 10 else str(hour)\n", " df = pd.read_sql_query('SELECT CreatedDate, '\n", " 'ComplaintType ,'\n", " 'strftime(\\'%H\\', CreatedDate) as `hour`, '\n", " 'COUNT(*) as num_complaints '\n", " 'FROM data '\n", " 'WHERE hour = \"{}\" '\n", " 'GROUP BY ComplaintType '\n", " 'ORDER BY -num_complaints'.format(hour_str), disk_engine)\n", " \n", " complaint_traces['Other'][hour] = sum(df.num_complaints)\n", " \n", " # Grab the 7 most common complaints for that hour\n", " for i in range(7):\n", " complaint = df.get_value(i, 'ComplaintType')\n", " count = df.get_value(i, 'num_complaints')\n", " complaint_traces['Other'][hour] -= count\n", " if complaint in complaint_traces:\n", " complaint_traces[complaint][hour] = count\n", " else:\n", " complaint_traces[complaint] = {hour: count}\n", " \n", "traces = []\n", "for complaint in complaint_traces:\n", " traces.append({\n", " 'x': range(25),\n", " 'y': [complaint_traces[complaint].get(i, None) for i in range(25)],\n", " 'name': complaint,\n", " 'type': 'bar'\n", " })\n", "\n", "py.iplot({\n", " 'data': traces, \n", " 'layout': {\n", " 'barmode': 'stack',\n", " 'xaxis': {'title': 'Hour in Day'},\n", " 'yaxis': {'title': 'Number of Complaints'},\n", " 'title': 'The 7 Most Common 311 Complaints by Hour in a Day'\n", " }}, filename='311/most common complaints by hour')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Aggregated time series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, create a new column with timestamps rounded to the previous 15 minute interval" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, `GROUP BY` that interval and `COUNT(*)`" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "minutes = 15\n", "seconds = 15*60\n", "\n", "df = pd.read_sql_query('SELECT CreatedDate, '\n", " 'datetime(('\n", " 'strftime(\\'%s\\', CreatedDate) / {seconds}) * {seconds}, \\'unixepoch\\') interval '\n", " 'FROM data '\n", " 'LIMIT 10 '.format(seconds=seconds), disk_engine)\n", "\n", "minutes = 15\n", "seconds = minutes*60\n", "\n", "df = pd.read_sql_query('SELECT datetime(('\n", " 'strftime(\\'%s\\', CreatedDate) / {seconds}) * {seconds}, \\'unixepoch\\') interval ,'\n", " 'COUNT(*) as \"Complaints / interval\"'\n", " 'FROM data '\n", " 'GROUP BY interval '\n", " 'ORDER BY interval '\n", " 'LIMIT 500'.format(seconds=seconds), disk_engine)\n", "\n", "py.iplot(\n", " {\n", " 'data': [{\n", " 'x': df.interval,\n", " 'y': df['Complaints / interval'],\n", " 'type': 'bar'\n", " }],\n", " 'layout': {\n", " 'title': 'Number of 311 Complaints per 15 Minutes'\n", " }\n", "}, filename='311/complaints per 15 minutes')" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hours = 24\n", "minutes = hours*60\n", "seconds = minutes*60\n", "\n", "df = pd.read_sql_query('SELECT datetime(('\n", " 'strftime(\\'%s\\', CreatedDate) / {seconds}) * {seconds}, \\'unixepoch\\') interval ,'\n", " 'COUNT(*) as \"Complaints / interval\"'\n", " 'FROM data '\n", " 'GROUP BY interval '\n", " 'ORDER BY interval'.format(seconds=seconds), disk_engine)\n", "\n", "py.iplot(\n", " {\n", " 'data': [{\n", " 'x': df.interval,\n", " 'y': df['Complaints / interval'],\n", " 'type': 'bar'\n", " }],\n", " 'layout': {\n", " 'title': 'Number of 311 Complaints per Day'\n", " }\n", "}, filename='311/complaints per day')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### References\n", "\n", "- Find more open data sets on [Data.gov](https://data.gov) and [NYC Open Data](https://nycopendata.socrata.com)\n", "- Learn how to setup [MySql with Pandas and Plotly](http://moderndata.plot.ly/graph-data-from-mysql-database-in-python/)\n", "- Add [interactive widgets to IPython notebooks](http://moderndata.plot.ly/widgets-in-ipython-notebook-and-plotly/) for customized data exploration\n", "- Big data workflows with [HDF5 and Pandas](http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas)\n", "- [Interactive graphing with Plotly](https://plotly.com/python/)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Collecting git+https://github.com/plotly/publisher.git\n", " Cloning https://github.com/plotly/publisher.git to c:\\users\\brand\\appdata\\local\\temp\\pip-req-build-c28th7u5\n", "Installing collected packages: publisher\n", " Found existing installation: publisher 0.11\n", " Uninstalling publisher-0.11:\n", " Successfully uninstalled publisher-0.11\n", " Running setup.py install for publisher: started\n", " Running setup.py install for publisher: finished with status 'done'\n", "Successfully installed publisher-0.11\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Python27\\lib\\site-packages\\IPython\\nbconvert.py:13: ShimWarning: The `IPython.nbconvert` package has been deprecated since IPython 4.0. You should import from nbconvert instead.\n", " \"You should import from nbconvert instead.\", ShimWarning)\n", "C:\\Python27\\lib\\site-packages\\publisher\\publisher.py:53: UserWarning: Did you \"Save\" this notebook before running this command? Remember to save, always save.\n", " warnings.warn('Did you \"Save\" this notebook before running this command? '\n" ] } ], "source": [ "from IPython.display import display, HTML\n", "\n", "display(HTML(''))\n", "display(HTML(''))\n", "\n", "! pip install git+https://github.com/plotly/publisher.git --upgrade\n", "import publisher\n", "publisher.publish(\n", " 'sqlite.ipynb', 'python/big-data-analytics-with-pandas-and-sqlite/', 'Big Data Analytics with Pandas and SQLite', \n", " 'A primer on out-of-memory analytics of large datasets with Pandas, SQLite, and IPython notebooks.',\n", " title='Big Data Workflow with Pandas and SQLite | Plotly', has_thumbnail='false',\n", " redirect_from='ipython-notebooks/big-data-analytics-with-pandas-and-sqlite/',\n", " language='python', page_type='example_index', display_as='databases', order=4, ipynb= '~notebook_demo/1')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.14" } }, "nbformat": 4, "nbformat_minor": 1 }