{ "metadata": { "name": "", "signature": "sha256:2c00f0ddb7215d7c4cea19200a00bf8d1c743b2cbb0d2fadb582c83fab517cbf" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Intro\n", "\n", "Below is:\n", "\n", "- A *summary* of key points that cropped up again and again.\n", "- A set of *proposed readings by topic*\n", "\n", "Note that these notes haven't been edited or extended with audio recordings I've made, and some need updating with significantly more material. See the \"Proposed Readings by Topic* section or ask me for more details.\n", "\n", "## Summary\n", "\n", "- **Both the financial and academic worlds are increasingly adopting Python** for the same reasons.\n", " - They're often encumbered with extremely large, heterogenous, legacy systems.\n", " - Old work isn't discarded. Incremental additions, re-use over interfaces.\n", " - Think COBOL/Excel/VBA for finance, FORTRAN/C in academia.\n", " - They're both seeking **one** paradigm as an end-to-end high-level solution for all users.\n", " - Neither can sacrifice performance, yet are finding the time-to-market and development lifecycles too long with legacy homogenous systems.\n", " - Python has long had a reputation for being unable to deliver performance, but this is now commonly acknowledged to no longer be true.\n", " - Python easily serves as a glue to high performance wrappers such as NumPy and SciPy, file formats such as HDF5, and heterogenous computation backends such as shared-memory parallelism (SMP) (OpenMP via Cython), GPUs (CUDA) and FPGAs (OpenCL).\n", " - Achieving C/C++ performance can be done with significantly simpler code and designs, yet requires sophisticated knowledge of memory cache hierarchies, disk I/O patterns, and SMP issues.\n", " - The financial industry has long relied on Python as an interface to core, high-performance components, but are paranoid and extremely closed and the only way knowledge gets shared is by stealing employees from other financial companies.\n", " - More information:\n", " - [\"06 - Python in the Financial Industry (KEYNOTE)\"](06%20-%20Python%20in%20the%20Financial%20Industry%20%28KEYNOTE%29.ipynb)\n", " - [\"15 - Building a Cutting-Edge Data Processing Environment on a Budget\"](15%20-%20Building%20a%20Cutting-Edge%20Data%20Processing%20Environment%20on%20a%20Budget.ipynb)\n", " - [\"20 - Python for High Throughput Science\"](20%20-%20Python%20for%20High%20Throughput%20Science.ipynb)\n", "\n", "- **IPython Notebook** is universally used and loved.\n", " - Most technical talks used IPython Notebook either for all the content or to host demos/examples.\n", " - Of those talks around half provided direct links to a pre-initialised IPython Notebook, so attendees could either download and follow in real-time, or download later.\n", " - Deep and expanding integration with entire scientific Python ecosystem, e.g. `matplotlib`, `pandas`, `numpy`, `bokeh`, `sympy`, ...\n", " - Good examples are:\n", " - [\"04 - Visualisations Using Bokeh\"](04%20-%20Visualisations%20Using%20Bokeh.ipynb)\n", " - [\"16 - presenter notes\" (for \"Generator Showcase Showdown\")](16%20-%20presenter%20notes.ipynb)\n", " - [\"18 - presenter notes\" (for \"Measuring Similarity and Clustering Data\")](18%20-%20presenter%20notes.ipynb)\n", "\n", "- **No clear future for visualisations in Python**.\n", " - `matplotlib` is universally used for publication-quality charts. API is difficult to use but powerful and well engineered.\n", " - It's clear that web-based visualisations are the future, and very important even for publications.\n", " - In order to reach the browser it's also acknowledged that JavaScript is the ideal interface, rather than static images.\n", " - But how to reach browser? Many different perspectives:\n", " - IPython Notebook - use `matplotlib` magic incantation to draw charts, no interactivity, no JavaScript.\n", " - Other libraries build on top of `matplotlib` of course work just as well: `ggplot`, `seaborn`, `prettyplotlib`\n", " - [\"04 - Visualisations in Bokeh\"](04%20-%20Visualisations%20Using%20Bokeh.ipynb): people love `ggplot` in R because of the Grammar of Graphics, and people love Python because it's a one-stop stop\n", " - So use Python with a ggplot-like grammar to auto-generate HTML5-canvas backed web visualisations using JavaScript.\n", " - HTML5-canvas is an investment and should reap rewards over SVG-based libraries like d3.js for very complex visualisations.\n", " - [\"12 - Getting it out there - Python-JS-web-viz\"](12%20-%20Getting%20it%20out%20there%20-%20Python-JS-web-viz.ipynb): forget Python, just code front-end in JavaScript and defer back-end and data cleaning to Python.\n", " - d3.js, nvd3, crossfilter, rickshaw, ...\n", " - Lightning talks\n", " - One presenter uses Python over a websocket bridge to Angular.js to create an RShiny-type interactive chart environment.\n", " - Another presenter showed off IPython version 2 (coming end of April 2014) functionality with interactive widgets, dynamically recreating charts based on user input.\n", " - There is no clear conclusion, except `matplotlib` is fantastic work and stood the test of time.\n", " - Bokeh seems very exciting but rough around the edges with a large and difficult to install set of dependencies, worth exploring the tutorials in full (!!AI which I will, and post a new article).\n", " \n", "- **Cython is almost universally used, but more agile methods are being sought**\n", " - Cython is a strict superset of Python that, with annotations, allow it to reach C-like speeds.\n", " - These annotations no longer allow Python compatibility. Can the Python community do better? Perhaps, with Shedskin, Pythran, or Numba. PyPy may eventually reach the Holy Grail of numpy compatibility.\n", " - See:\n", " - [\"10 - The High Performance Python Landscape\"](10%20-%20The%20High%20Performance%20Python%20Landscape.ipynb)\n", " - [\"15 - Building a Cutting-Edge Data Processing Environment on a Budget\"](15%20-%20Building%20a%20Cutting-Edge%20Data%20Processing%20Environment%20on%20a%20Budget.ipynb)\n", " - [\"03 - Faster Python Programs through Optimization\"](03%20-%20Faster%20Python%20Programs%20through%20Optimization.ipynb) (!!AI there is a significant quantity of missing information, I will type up soon).\n", " - [\"11 - Shared Memory Parallelism with Python\"](11%20-%20Shared%20Memory%20Parallelism%20with%20Python.ipynb)\n", "\n", "- **Everyone uses `scikit-learn`**\n", " - Well thought out, very opinionated design, strong and diverse set of core contributors.\n", " - Stands out amongst Python packages as having > 10 contributors who equally make the same volume of contributions.\n", " - At the very least prototype in `scikit-learn` and `nltk`.\n", " - If you hit scalability issues people usually scale vertically (bigger boxes) or use Cython.\n", " - See:\n", " - [\"15 - Building a Cutting-Edge Data Processing Environment on a Budget\"](15%20-%20Building%20a%20Cutting-Edge%20Data%20Processing%20Environment%20on%20a%20Budget.ipynb)\n", " - [\"22 - Correcting 10 years of messy CRM data\"](22%20-%20Correcting%2010%20years%20of%20messy%20CRM%20data.ipynb)\n", " - [\"19 - Gradient Boosted Regression Trees in scikit-learn\"](19%20-%20Gradient%20Boosted%20Regression%20Trees%20in%20scikit-learn.ipynb)\n", "\n", "- **MapReduce/clusters have less hype and traction than you'd expect**\n", " - Certainly there are some who use it for their data processing pipeline, e.g. \"07 - Hierarchical Text Clustering in Python and Hive\".\n", " - Given a large data set that cannot fit onto one disk, prefer to create large RDBMS clusters. See:\n", " - [\"05 - Databases for Scientists\"](05%20-%20Databases%20for%20Scientists.ipynb)\n", " - [\"08 - Massively Parallel Processing with Procedural Python\"](08%20-%20Massively%20Parallel%20Processing%20with%20Procedural%20Python.ipynb)\n", " - [Presenter Notes](08%20-%20presenter%20notes.ipynb)\n", " - Given a large data set that cannot fit into memory prefer to use e.g. HDF5 to disk-back it, or create additional abstractions on top of NumPy/HDF5, a la \"23 - Manipulating massive disk-backed arrays\".\n", " - `scikit-learn` core contributors strongly prefer shared-memory parallelism to clusters, and are actively creating OpenMP-style abstractions (with better debugging and NumPy array performance).\n", " - [\"15 - Building a Cutting Edge Data Processing Environment on a Budget\"](15%20-%20Building%20a%20Cutting-Edge%20Data%20Processing%20Environment%20on%20a%20Budget.ipynb)\n", " - Fantastic lightning talk on end used [FireDrake](http://firedrakeproject.org/index.html) to easily switch computing backend from SMP to GPU, but again no mention of clusters.\n", "\n", "## Proposed readings by group\n", "\n", "### Culture, industry background\n", "\n", "- [\"06 - Python in the Financial Industry (KEYNOTE)\"](06%20-%20Python%20in%20the%20Financial%20Industry%20%28KEYNOTE%29.ipynb)\n", "- [\"14 - Panel Discussion - Shouldn't companies be doing more data science?\"](14%20-%20Panel%20Discussion%20-%20Shouldn%27t%20companies%20be%20doing%20more%20data%20science%3F.ipynb)\n", "\n", "### Case studies\n", "\n", "- [\"07 - Hierarchical Text Clustering in Python and Hive\"](07%20-%20Hierarchical%20Text%20Clustering%20in%20Python%20and%20Hive.ipynb)\n", "- [\"09 - Measuring the digital economy using big data\"](09%20-%20Measuring%20the%20digital%20economy%20using%20big%20data.ipynb)\n", "- [\"17 - Adaptive Filtering of Tweets with Machine Learning\"](17%20-%20Adaptive%20Filtering%20of%20Tweets%20with%20Machine%20Learning.ipynb)\n", "- [\"20 - Python for High Throughput Science\"](20%20-%20Python%20for%20High%20Throughput%20Science.ipynb)\n", "- [\"22 - Correcting 10 years of messy CRM data\"](22%20-%20Correcting%2010%20years%20of%20messy%20CRM%20data.ipynb)\n", "\n", "### Technical - software engineering \n", "\n", "- [\"03 - Faster Python Programs through Optimization\"](03%20-%20Faster%20Python%20Programs%20through%20Optimization.ipynb)\n", " - Needs updating with significant amount of presenter material we didn't cover.\n", "- [\"10 - The High Performance Python Landscape\"](10%20-%20The%20High%20Performance%20Python%20Landscape.ipynb)\n", "- [\"11 - Shared Memory Parallelism with Python\"](11%20-%20Shared%20Memory%20Parallelism%20with%20Python.ipynb)\n", "- [\"15 - Building a Cutting-Edge Data Processing Environment on a Budget\"](15%20-%20Building%20a%20Cutting-Edge%20Data%20Processing%20Environment%20on%20a%20Budget.ipynb)\n", "\n", "### Technical - mathematical\n", "\n", "- [\"02 - Introduction to Action Recognition\"](02%20-%20Introduction%20to%20Action%20Recognition.ipynb)\n", "- [\"13 - Python for Optimization\"](13%20-%20Python%20for%20Optimization.ipynb)\n", "- [\"18 - Measuring Similarity and Clustering Data\"](18%20-%20Measuring%20Similarity%20and%20Clustering%20Data.ipynb)\n", " - And presenter notes: [\"18 - presenter notes\"](18%20-%20presenter%20notes.ipynb)\n", "- [\"19 - Gradient Boosted Regression Trees in scikit-learn\"](19%20-%20Gradient%20Boosted%20Regression%20Trees%20in%20scikit-learn.ipynb)\n", "\n", "### Technical - other\n", "\n", "- [\"01 - Interactive Financial Analytics with Python and IPython\"](01%20-%20Interactive%20Financial%20Analytics%20with%20Python%20and%20IPython.ipynb)\n", " - Presenter's tutorial where I followed along with exercises are here: [\"01 - YH_PyData_Eurex_Tutorial\"](01%20-%20YH_PyData_Eurex_Tutorial.ipynb)\n", "- [\"05 - Databases for Scientists\"](05%20-%20Databases%20for%20Scientists.ipynb)\n", " - Needs updating with presenter's material.\n", "- [\"08 - Massively Parallel Processing with Procedural Python\"](08%20-%20Massively%20Parallel%20Processing%20with%20Procedural%20Python.ipynb)\n", " - And presenter notes: [\"08 - presenter notes\"](08%20-%20presenter%20notes.ipynb)\n", "- [\"16 - Generator Showcase Showdown\"](16%20-%20Generator%20Showcase%20Showdown.ipynb)\n", " - And presenter notes: [\"16 - presenter notes\"](16%20-%20presenter%20notes.ipynb)\n", "- [\"23 - Manipulating massive disk-backed arrays\"](23%20-%20Manipulating%20massive%20disk-backed%20arrays.ipynb)\n", "\n", "### Visualisations\n", "\n", "- [\"04 - Visualisations Using Bokeh\"](04%20-%20Visualisations%20Using%20Bokeh.ipynb)\n", " - I need to significantly update with the rest of their tutorial examples.\n", "- [\"12 - Getting it out there - Python-JS-web-viz\"](12%20-%20Getting%20it%20out%20there%20-%20Python-JS-web-viz.ipynb)\n", "- [\"21 - Winning Ways for Your Visualization Plays\"](21%20-%20Winning%20Ways%20for%20Your%20Visualization%20Plays.ipynb)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.core.display import HTML\n", "def css_styling():\n", " styles = open(\"styles/custom.css\", \"r\").read()\n", " return HTML(styles)\n", "css_styling()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n" ], "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ "" ] } ], "prompt_number": 23 }, { "cell_type": "code", "collapsed": false, "input": [ "%autosave 10" ], "language": "python", "metadata": {}, "outputs": [ { "javascript": [ "IPython.notebook.set_autosave_interval(10000)" ], "metadata": {}, "output_type": "display_data" }, { "output_type": "stream", "stream": "stdout", "text": [ "Autosaving every 10 seconds\n" ] } ], "prompt_number": 19 } ], "metadata": {} } ] }