{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reproducible Data Analysis in Jupyter\n",
"\n",
"*Jake VanderPlas, March 2017*\n",
"\n",
"Jupyter notebooks provide a useful environment for interactive exploration of data. A common question, though, is how you can progress from this nonlinear, interactive, trial-and-error style of analysis to a more linear and reproducible analysis based on organized, well-tested code. This series of videos shows an example of how I approach reproducible data analysis within the Jupyter notebook.\n",
"\n",
"Each video is approximately 5-8 minutes; the videos are\n",
"available in a [YouTube Playlist](https://www.youtube.com/playlist?list=PLYCpMb24GpOC704uO9svUrihl-HY1tTJJ).\n",
"Alternatively, below you can find the videos with some description and lists of relevant resources"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Quick utility to embed the videos below\n",
"from IPython.display import YouTubeVideo\n",
"def embed_video(index, playlist='PLYCpMb24GpOC704uO9svUrihl-HY1tTJJ'):\n",
" return YouTubeVideo('', index=index - 1, list=playlist, width=600, height=350)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 1: Loading and Visualizing Data\n",
"\n",
"*In this video, I introduce the dataset, and use the Jupyter notebook to download and visualize it.*"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"embed_video(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Relevant resources:\n",
"\n",
"- [Fremont Bridge Bike Counter](http://www.seattle.gov/transportation/bikecounter_fremont.htm): the website where you can explore the data\n",
"\n",
"- [A Whirlwind Tour of Python](https://github.com/jakevdp/WhirlwindTourOfPython): my book introducing the Python programming language, aimed at scientists and engineers.\n",
"\n",
"- [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook): my book introducing Python's data science tools, including an introduction to the IPython, Pandas, and Matplotlib tools used here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 2: Further Data Exploration\n",
"\n",
"*In this video, I do some slightly more sophisticated visualization with the data, using matplotlib and pandas.*"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"embed_video(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Relevant Resources:\n",
"\n",
"- [Pivot Tables Section](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.09-Pivot-Tables.ipynb) from the Python Data Science Handbook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 3: Version Control with Git & GitHub\n",
"\n",
"*In this video, I set up a repository on GitHub and commit the notebook into version control.*"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"embed_video(3)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Relevant Resources:\n",
"\n",
"- [Version Control With Git](https://swcarpentry.github.io/git-novice/): excellent novice-level tutorial from Software Carpentry\n",
"- [Github Guides](https://guides.github.com/): set of tutorials on using GitHub\n",
"- [The Whys and Hows of Licensing Scientific Code](http://www.astrobetter.com/blog/2014/03/10/the-whys-and-hows-of-licensing-scientific-code/): my 2014 blog post on AstroBetter"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 4: Working with Data and GitHub\n",
"\n",
"*In this video, I refactor the data download script so that it only downloads the data when needed*"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"embed_video(4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 5: Creating a Python Package\n",
"\n",
"*In this video, I move the data download utility into its own separate package*"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"embed_video(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Relevant Resources:\n",
"\n",
"- [How To Package Your Python Code](https://python-packaging.readthedocs.io/): broad tutorial on Python packaging."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 6: Unit Testing with PyTest\n",
"\n",
"*In this video, I add unit tests for the data download utility*"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"embed_video(6)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Relevant resources:\n",
"\n",
"- [Pytest Documentation](http://doc.pytest.org/)\n",
"- [Getting Started with Pytest](https://jacobian.org/writing/getting-started-with-pytest/): a nice tutorial by Jacob Kaplan-Moss"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 7: Refactoring for Speed\n",
"\n",
"*In this video, I refactor the data download function to be a bit faster*"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"embed_video(7)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Relevant Resources:\n",
"\n",
"- [Python ``strftime`` reference](http://strftime.org/)\n",
"- [Pandas Datetime Section](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.11-Working-with-Time-Series.ipynb) from the Python Data Science Handbook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 8: Debugging a Broken Function\n",
"\n",
"*In this video, I discover that my refactoring has caused a bug. I debug it and fix it.*"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"embed_video(8)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 8.5: Finding and Fixing a scikit-learn bug\n",
"\n",
"*In this video, I discover a bug in the scikit-learn codebase, and go through the process of submitting a GitHub Pull Request fixing the bug*"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"embed_video(9)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 9: Further Data Exploration: PCA and GMM\n",
"\n",
"*In this video, I apply unsupervised learning techniques to the data to explore what we can learn from it*"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"embed_video(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Relevant Resources:\n",
"\n",
"- [Principal Component Analysis In-Depth](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.09-Principal-Component-Analysis.ipynb) from the Python Data Science Handbook\n",
"- [Gaussian Mixture Models In-Depth](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.12-Gaussian-Mixtures.ipynb) from the Python Data Science Handbook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part 10: Cleaning-up the Notebook\n",
"\n",
"*In this video, I clean-up the unsupervised learning analysis to make it more reproducible and presentable.*"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"embed_video(11)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Relevant Resources:\n",
"\n",
"- [Learning Seattle's Work Habits from Bicycle Counts](https://jakevdp.github.io/blog/2015/07/23/learning-seattles-work-habits-from-bicycle-counts/): My 2015 blog post using Fremont Bridge data"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3.5",
"language": "",
"name": "python3.5"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}