{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Reproducible Data Analysis in Jupyter\n", "\n", "*Jake VanderPlas, March 2017*\n", "\n", "Jupyter notebooks provide a useful environment for interactive exploration of data. A common question, though, is how you can progress from this nonlinear, interactive, trial-and-error style of analysis to a more linear and reproducible analysis based on organized, well-tested code. This series of videos shows an example of how I approach reproducible data analysis within the Jupyter notebook.\n", "\n", "Each video is approximately 5-8 minutes; the videos are\n", "available in a [YouTube Playlist](https://www.youtube.com/playlist?list=PLYCpMb24GpOC704uO9svUrihl-HY1tTJJ).\n", "Alternatively, below you can find the videos with some description and lists of relevant resources" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Quick utility to embed the videos below\n", "from IPython.display import YouTubeVideo\n", "def embed_video(index, playlist='PLYCpMb24GpOC704uO9svUrihl-HY1tTJJ'):\n", " return YouTubeVideo('', index=index - 1, list=playlist, width=600, height=350)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Loading and Visualizing Data\n", "\n", "*In this video, I introduce the dataset, and use the Jupyter notebook to download and visualize it.*" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embed_video(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Relevant resources:\n", "\n", "- [Fremont Bridge Bike Counter](http://www.seattle.gov/transportation/bikecounter_fremont.htm): the website where you can explore the data\n", "\n", "- [A Whirlwind Tour of Python](https://github.com/jakevdp/WhirlwindTourOfPython): my book introducing the Python programming language, aimed at scientists and engineers.\n", "\n", "- [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook): my book introducing Python's data science tools, including an introduction to the IPython, Pandas, and Matplotlib tools used here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Further Data Exploration\n", "\n", "*In this video, I do some slightly more sophisticated visualization with the data, using matplotlib and pandas.*" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embed_video(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Relevant Resources:\n", "\n", "- [Pivot Tables Section](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.09-Pivot-Tables.ipynb) from the Python Data Science Handbook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3: Version Control with Git & GitHub\n", "\n", "*In this video, I set up a repository on GitHub and commit the notebook into version control.*" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embed_video(3)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Relevant Resources:\n", "\n", "- [Version Control With Git](https://swcarpentry.github.io/git-novice/): excellent novice-level tutorial from Software Carpentry\n", "- [Github Guides](https://guides.github.com/): set of tutorials on using GitHub\n", "- [The Whys and Hows of Licensing Scientific Code](http://www.astrobetter.com/blog/2014/03/10/the-whys-and-hows-of-licensing-scientific-code/): my 2014 blog post on AstroBetter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 4: Working with Data and GitHub\n", "\n", "*In this video, I refactor the data download script so that it only downloads the data when needed*" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embed_video(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 5: Creating a Python Package\n", "\n", "*In this video, I move the data download utility into its own separate package*" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embed_video(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Relevant Resources:\n", "\n", "- [How To Package Your Python Code](https://python-packaging.readthedocs.io/): broad tutorial on Python packaging." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 6: Unit Testing with PyTest\n", "\n", "*In this video, I add unit tests for the data download utility*" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embed_video(6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Relevant resources:\n", "\n", "- [Pytest Documentation](http://doc.pytest.org/)\n", "- [Getting Started with Pytest](https://jacobian.org/writing/getting-started-with-pytest/): a nice tutorial by Jacob Kaplan-Moss" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 7: Refactoring for Speed\n", "\n", "*In this video, I refactor the data download function to be a bit faster*" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embed_video(7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Relevant Resources:\n", "\n", "- [Python ``strftime`` reference](http://strftime.org/)\n", "- [Pandas Datetime Section](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.11-Working-with-Time-Series.ipynb) from the Python Data Science Handbook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 8: Debugging a Broken Function\n", "\n", "*In this video, I discover that my refactoring has caused a bug. I debug it and fix it.*" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embed_video(8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 8.5: Finding and Fixing a scikit-learn bug\n", "\n", "*In this video, I discover a bug in the scikit-learn codebase, and go through the process of submitting a GitHub Pull Request fixing the bug*" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embed_video(9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 9: Further Data Exploration: PCA and GMM\n", "\n", "*In this video, I apply unsupervised learning techniques to the data to explore what we can learn from it*" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embed_video(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Relevant Resources:\n", "\n", "- [Principal Component Analysis In-Depth](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.09-Principal-Component-Analysis.ipynb) from the Python Data Science Handbook\n", "- [Gaussian Mixture Models In-Depth](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.12-Gaussian-Mixtures.ipynb) from the Python Data Science Handbook" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 10: Cleaning-up the Notebook\n", "\n", "*In this video, I clean-up the unsupervised learning analysis to make it more reproducible and presentable.*" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embed_video(11)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Relevant Resources:\n", "\n", "- [Learning Seattle's Work Habits from Bicycle Counts](https://jakevdp.github.io/blog/2015/07/23/learning-seattles-work-habits-from-bicycle-counts/): My 2015 blog post using Fremont Bridge data" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3.5", "language": "", "name": "python3.5" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }