{
"metadata": {
"name": "",
"signature": "sha256:2c00f0ddb7215d7c4cea19200a00bf8d1c743b2cbb0d2fadb582c83fab517cbf"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Intro\n",
"\n",
"Below is:\n",
"\n",
"- A *summary* of key points that cropped up again and again.\n",
"- A set of *proposed readings by topic*\n",
"\n",
"Note that these notes haven't been edited or extended with audio recordings I've made, and some need updating with significantly more material. See the \"Proposed Readings by Topic* section or ask me for more details.\n",
"\n",
"## Summary\n",
"\n",
"- **Both the financial and academic worlds are increasingly adopting Python** for the same reasons.\n",
" - They're often encumbered with extremely large, heterogenous, legacy systems.\n",
" - Old work isn't discarded. Incremental additions, re-use over interfaces.\n",
" - Think COBOL/Excel/VBA for finance, FORTRAN/C in academia.\n",
" - They're both seeking **one** paradigm as an end-to-end high-level solution for all users.\n",
" - Neither can sacrifice performance, yet are finding the time-to-market and development lifecycles too long with legacy homogenous systems.\n",
" - Python has long had a reputation for being unable to deliver performance, but this is now commonly acknowledged to no longer be true.\n",
" - Python easily serves as a glue to high performance wrappers such as NumPy and SciPy, file formats such as HDF5, and heterogenous computation backends such as shared-memory parallelism (SMP) (OpenMP via Cython), GPUs (CUDA) and FPGAs (OpenCL).\n",
" - Achieving C/C++ performance can be done with significantly simpler code and designs, yet requires sophisticated knowledge of memory cache hierarchies, disk I/O patterns, and SMP issues.\n",
" - The financial industry has long relied on Python as an interface to core, high-performance components, but are paranoid and extremely closed and the only way knowledge gets shared is by stealing employees from other financial companies.\n",
" - More information:\n",
" - [\"06 - Python in the Financial Industry (KEYNOTE)\"](06%20-%20Python%20in%20the%20Financial%20Industry%20%28KEYNOTE%29.ipynb)\n",
" - [\"15 - Building a Cutting-Edge Data Processing Environment on a Budget\"](15%20-%20Building%20a%20Cutting-Edge%20Data%20Processing%20Environment%20on%20a%20Budget.ipynb)\n",
" - [\"20 - Python for High Throughput Science\"](20%20-%20Python%20for%20High%20Throughput%20Science.ipynb)\n",
"\n",
"- **IPython Notebook** is universally used and loved.\n",
" - Most technical talks used IPython Notebook either for all the content or to host demos/examples.\n",
" - Of those talks around half provided direct links to a pre-initialised IPython Notebook, so attendees could either download and follow in real-time, or download later.\n",
" - Deep and expanding integration with entire scientific Python ecosystem, e.g. `matplotlib`, `pandas`, `numpy`, `bokeh`, `sympy`, ...\n",
" - Good examples are:\n",
" - [\"04 - Visualisations Using Bokeh\"](04%20-%20Visualisations%20Using%20Bokeh.ipynb)\n",
" - [\"16 - presenter notes\" (for \"Generator Showcase Showdown\")](16%20-%20presenter%20notes.ipynb)\n",
" - [\"18 - presenter notes\" (for \"Measuring Similarity and Clustering Data\")](18%20-%20presenter%20notes.ipynb)\n",
"\n",
"- **No clear future for visualisations in Python**.\n",
" - `matplotlib` is universally used for publication-quality charts. API is difficult to use but powerful and well engineered.\n",
" - It's clear that web-based visualisations are the future, and very important even for publications.\n",
" - In order to reach the browser it's also acknowledged that JavaScript is the ideal interface, rather than static images.\n",
" - But how to reach browser? Many different perspectives:\n",
" - IPython Notebook - use `matplotlib` magic incantation to draw charts, no interactivity, no JavaScript.\n",
" - Other libraries build on top of `matplotlib` of course work just as well: `ggplot`, `seaborn`, `prettyplotlib`\n",
" - [\"04 - Visualisations in Bokeh\"](04%20-%20Visualisations%20Using%20Bokeh.ipynb): people love `ggplot` in R because of the Grammar of Graphics, and people love Python because it's a one-stop stop\n",
" - So use Python with a ggplot-like grammar to auto-generate HTML5-canvas backed web visualisations using JavaScript.\n",
" - HTML5-canvas is an investment and should reap rewards over SVG-based libraries like d3.js for very complex visualisations.\n",
" - [\"12 - Getting it out there - Python-JS-web-viz\"](12%20-%20Getting%20it%20out%20there%20-%20Python-JS-web-viz.ipynb): forget Python, just code front-end in JavaScript and defer back-end and data cleaning to Python.\n",
" - d3.js, nvd3, crossfilter, rickshaw, ...\n",
" - Lightning talks\n",
" - One presenter uses Python over a websocket bridge to Angular.js to create an RShiny-type interactive chart environment.\n",
" - Another presenter showed off IPython version 2 (coming end of April 2014) functionality with interactive widgets, dynamically recreating charts based on user input.\n",
" - There is no clear conclusion, except `matplotlib` is fantastic work and stood the test of time.\n",
" - Bokeh seems very exciting but rough around the edges with a large and difficult to install set of dependencies, worth exploring the tutorials in full (!!AI which I will, and post a new article).\n",
" \n",
"- **Cython is almost universally used, but more agile methods are being sought**\n",
" - Cython is a strict superset of Python that, with annotations, allow it to reach C-like speeds.\n",
" - These annotations no longer allow Python compatibility. Can the Python community do better? Perhaps, with Shedskin, Pythran, or Numba. PyPy may eventually reach the Holy Grail of numpy compatibility.\n",
" - See:\n",
" - [\"10 - The High Performance Python Landscape\"](10%20-%20The%20High%20Performance%20Python%20Landscape.ipynb)\n",
" - [\"15 - Building a Cutting-Edge Data Processing Environment on a Budget\"](15%20-%20Building%20a%20Cutting-Edge%20Data%20Processing%20Environment%20on%20a%20Budget.ipynb)\n",
" - [\"03 - Faster Python Programs through Optimization\"](03%20-%20Faster%20Python%20Programs%20through%20Optimization.ipynb) (!!AI there is a significant quantity of missing information, I will type up soon).\n",
" - [\"11 - Shared Memory Parallelism with Python\"](11%20-%20Shared%20Memory%20Parallelism%20with%20Python.ipynb)\n",
"\n",
"- **Everyone uses `scikit-learn`**\n",
" - Well thought out, very opinionated design, strong and diverse set of core contributors.\n",
" - Stands out amongst Python packages as having > 10 contributors who equally make the same volume of contributions.\n",
" - At the very least prototype in `scikit-learn` and `nltk`.\n",
" - If you hit scalability issues people usually scale vertically (bigger boxes) or use Cython.\n",
" - See:\n",
" - [\"15 - Building a Cutting-Edge Data Processing Environment on a Budget\"](15%20-%20Building%20a%20Cutting-Edge%20Data%20Processing%20Environment%20on%20a%20Budget.ipynb)\n",
" - [\"22 - Correcting 10 years of messy CRM data\"](22%20-%20Correcting%2010%20years%20of%20messy%20CRM%20data.ipynb)\n",
" - [\"19 - Gradient Boosted Regression Trees in scikit-learn\"](19%20-%20Gradient%20Boosted%20Regression%20Trees%20in%20scikit-learn.ipynb)\n",
"\n",
"- **MapReduce/clusters have less hype and traction than you'd expect**\n",
" - Certainly there are some who use it for their data processing pipeline, e.g. \"07 - Hierarchical Text Clustering in Python and Hive\".\n",
" - Given a large data set that cannot fit onto one disk, prefer to create large RDBMS clusters. See:\n",
" - [\"05 - Databases for Scientists\"](05%20-%20Databases%20for%20Scientists.ipynb)\n",
" - [\"08 - Massively Parallel Processing with Procedural Python\"](08%20-%20Massively%20Parallel%20Processing%20with%20Procedural%20Python.ipynb)\n",
" - [Presenter Notes](08%20-%20presenter%20notes.ipynb)\n",
" - Given a large data set that cannot fit into memory prefer to use e.g. HDF5 to disk-back it, or create additional abstractions on top of NumPy/HDF5, a la \"23 - Manipulating massive disk-backed arrays\".\n",
" - `scikit-learn` core contributors strongly prefer shared-memory parallelism to clusters, and are actively creating OpenMP-style abstractions (with better debugging and NumPy array performance).\n",
" - [\"15 - Building a Cutting Edge Data Processing Environment on a Budget\"](15%20-%20Building%20a%20Cutting-Edge%20Data%20Processing%20Environment%20on%20a%20Budget.ipynb)\n",
" - Fantastic lightning talk on end used [FireDrake](http://firedrakeproject.org/index.html) to easily switch computing backend from SMP to GPU, but again no mention of clusters.\n",
"\n",
"## Proposed readings by group\n",
"\n",
"### Culture, industry background\n",
"\n",
"- [\"06 - Python in the Financial Industry (KEYNOTE)\"](06%20-%20Python%20in%20the%20Financial%20Industry%20%28KEYNOTE%29.ipynb)\n",
"- [\"14 - Panel Discussion - Shouldn't companies be doing more data science?\"](14%20-%20Panel%20Discussion%20-%20Shouldn%27t%20companies%20be%20doing%20more%20data%20science%3F.ipynb)\n",
"\n",
"### Case studies\n",
"\n",
"- [\"07 - Hierarchical Text Clustering in Python and Hive\"](07%20-%20Hierarchical%20Text%20Clustering%20in%20Python%20and%20Hive.ipynb)\n",
"- [\"09 - Measuring the digital economy using big data\"](09%20-%20Measuring%20the%20digital%20economy%20using%20big%20data.ipynb)\n",
"- [\"17 - Adaptive Filtering of Tweets with Machine Learning\"](17%20-%20Adaptive%20Filtering%20of%20Tweets%20with%20Machine%20Learning.ipynb)\n",
"- [\"20 - Python for High Throughput Science\"](20%20-%20Python%20for%20High%20Throughput%20Science.ipynb)\n",
"- [\"22 - Correcting 10 years of messy CRM data\"](22%20-%20Correcting%2010%20years%20of%20messy%20CRM%20data.ipynb)\n",
"\n",
"### Technical - software engineering \n",
"\n",
"- [\"03 - Faster Python Programs through Optimization\"](03%20-%20Faster%20Python%20Programs%20through%20Optimization.ipynb)\n",
" - Needs updating with significant amount of presenter material we didn't cover.\n",
"- [\"10 - The High Performance Python Landscape\"](10%20-%20The%20High%20Performance%20Python%20Landscape.ipynb)\n",
"- [\"11 - Shared Memory Parallelism with Python\"](11%20-%20Shared%20Memory%20Parallelism%20with%20Python.ipynb)\n",
"- [\"15 - Building a Cutting-Edge Data Processing Environment on a Budget\"](15%20-%20Building%20a%20Cutting-Edge%20Data%20Processing%20Environment%20on%20a%20Budget.ipynb)\n",
"\n",
"### Technical - mathematical\n",
"\n",
"- [\"02 - Introduction to Action Recognition\"](02%20-%20Introduction%20to%20Action%20Recognition.ipynb)\n",
"- [\"13 - Python for Optimization\"](13%20-%20Python%20for%20Optimization.ipynb)\n",
"- [\"18 - Measuring Similarity and Clustering Data\"](18%20-%20Measuring%20Similarity%20and%20Clustering%20Data.ipynb)\n",
" - And presenter notes: [\"18 - presenter notes\"](18%20-%20presenter%20notes.ipynb)\n",
"- [\"19 - Gradient Boosted Regression Trees in scikit-learn\"](19%20-%20Gradient%20Boosted%20Regression%20Trees%20in%20scikit-learn.ipynb)\n",
"\n",
"### Technical - other\n",
"\n",
"- [\"01 - Interactive Financial Analytics with Python and IPython\"](01%20-%20Interactive%20Financial%20Analytics%20with%20Python%20and%20IPython.ipynb)\n",
" - Presenter's tutorial where I followed along with exercises are here: [\"01 - YH_PyData_Eurex_Tutorial\"](01%20-%20YH_PyData_Eurex_Tutorial.ipynb)\n",
"- [\"05 - Databases for Scientists\"](05%20-%20Databases%20for%20Scientists.ipynb)\n",
" - Needs updating with presenter's material.\n",
"- [\"08 - Massively Parallel Processing with Procedural Python\"](08%20-%20Massively%20Parallel%20Processing%20with%20Procedural%20Python.ipynb)\n",
" - And presenter notes: [\"08 - presenter notes\"](08%20-%20presenter%20notes.ipynb)\n",
"- [\"16 - Generator Showcase Showdown\"](16%20-%20Generator%20Showcase%20Showdown.ipynb)\n",
" - And presenter notes: [\"16 - presenter notes\"](16%20-%20presenter%20notes.ipynb)\n",
"- [\"23 - Manipulating massive disk-backed arrays\"](23%20-%20Manipulating%20massive%20disk-backed%20arrays.ipynb)\n",
"\n",
"### Visualisations\n",
"\n",
"- [\"04 - Visualisations Using Bokeh\"](04%20-%20Visualisations%20Using%20Bokeh.ipynb)\n",
" - I need to significantly update with the rest of their tutorial examples.\n",
"- [\"12 - Getting it out there - Python-JS-web-viz\"](12%20-%20Getting%20it%20out%20there%20-%20Python-JS-web-viz.ipynb)\n",
"- [\"21 - Winning Ways for Your Visualization Plays\"](21%20-%20Winning%20Ways%20for%20Your%20Visualization%20Plays.ipynb)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from IPython.core.display import HTML\n",
"def css_styling():\n",
" styles = open(\"styles/custom.css\", \"r\").read()\n",
" return HTML(styles)\n",
"css_styling()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"\n"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 23,
"text": [
""
]
}
],
"prompt_number": 23
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%autosave 10"
],
"language": "python",
"metadata": {},
"outputs": [
{
"javascript": [
"IPython.notebook.set_autosave_interval(10000)"
],
"metadata": {},
"output_type": "display_data"
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Autosaving every 10 seconds\n"
]
}
],
"prompt_number": 19
}
],
"metadata": {}
}
]
}