{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "> This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1.2. Getting started with exploratory data analysis in IPython" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will download and process a dataset about attendance on Montreal's bicycle tracks. This example is largely inspired by a presentation from [Julia Evans](http://nbviewer.ipython.org/github/jvns/talks/blob/master/mtlpy35/pistes-cyclables.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. The very first step is to import the scientific packages we will be using in this recipe, namely NumPy, Pandas, and matplotlib. We also instruct matplotlib to render the figures as PNG images in the notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Now, we create a new Python variable called `url` that contains the address to a CSV (**Comma-separated values**) data file. This standard text-based file format is used to store tabular data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "url = \"https://github.com/ipython-books/cookbook-data/raw/master/bikes.csv\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Pandas defines a `read_csv` function that can read any CSV file. Here, we give it the URL to the file. Pandas will automatically download and parse the file, and return a `DataFrame` object. We need to specify a few options to make sure the dates are parsed correctly." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = pd.read_csv(url, index_col='Date', parse_dates=True, dayfirst=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. The `df` variable contains a `DataFrame` object, a specific Pandas data structure that contains 2D tabular data. The `head(n)` method displays the first `n` rows of this table." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "strip_output": [ 0, 0 ] }, "outputs": [], "source": [ "df.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every row contains the number of bicycles on every track of the city, for every day of the year." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. We can get some summary statistics of the table with the `describe` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "strip_output": [ 0, 0 ] }, "outputs": [], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Let's display some figures! We will plot the daily attendance of two tracks. First, we select the two columns `'Berri1'` and `'PierDup'`. Then, we call the `plot` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# The styling '-' and '--' is just to make the figure\n", "# readable in the black & white printed version of this book.\n", "df[['Berri1', 'PierDup']].plot(figsize=(8,4),\n", " style=['-', '--']);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. Now, we move to a slightly more advanced analysis. We will look at the attendance of all tracks as a function of the weekday. We can get the week day easily with Pandas: the `index` attribute of the `DataFrame` contains the dates of all rows in the table. This index has a few date-related attributes, including `weekday`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.index.weekday" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, we would like to have names (Monday, Tuesday, etc.) instead of numbers between 0 and 6. This can be done easily. First, we create an array `days` with all weekday names. Then, we index it by `df.index.weekday`. This operation replaces every integer in the index by the corresponding name in `days`. The first element, `Monday`, has index 0, so every 0 in `df.index.weekday` is replaced by `Monday`, and so on. We assign this new index to a new column `Weekday` in the `DataFrame`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "days = np.array(['Monday', 'Tuesday', 'Wednesday', \n", " 'Thursday', 'Friday', 'Saturday', \n", " 'Sunday'])\n", "df['Weekday'] = days[df.index.weekday]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "8. To get the attendance as a function of the weekday, we need to group the table by the weekday. The `groupby` method lets us do just that. Once grouped, we can sum all rows in every group." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_week = df.groupby('Weekday').sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "strip_output": [ 0, 0 ] }, "outputs": [], "source": [ "df_week" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "9. We can now display this information in a figure. We first need to reorder the table by the weekday using `ix` (indexing operation). Then, we plot the table, specifying the line width and the figure size." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_week.ix[days].plot(lw=3, figsize=(6,4));\n", "plt.ylim(0); # Set the bottom axis to 0." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "10. Finally, let's illustrate the new interactive capabilities of the notebook in IPython 2.0. We will plot a *smoothed* version of the track attendance as a function of time (**rolling mean**). The idea is to compute the mean value in the neighborhood of any day. The larger the neighborhood, the smoother the curve. We will create an interactive slider in the notebook to vary this parameter in real-time in the plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from ipywidgets import interact\n", "#from IPython.html.widgets import interact # IPython < 4.x\n", "@interact\n", "def plot(n=(1, 30)):\n", " plt.figure(figsize=(8,4));\n", " pd.rolling_mean(df['Berri1'], n).dropna().plot();\n", " plt.ylim(0, 8000);\n", " plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).\n", "\n", "> [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.0" } }, "nbformat": 4, "nbformat_minor": 0 }