{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%autosave 10"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "javascript": [
        "IPython.notebook.set_autosave_interval(10000)"
       ],
       "metadata": {},
       "output_type": "display_data"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Autosaving every 10 seconds\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## What, problem\n",
      "\n",
      "- Python allows rapid prototyping\n",
      "- But after profiling and finding slowdown need to speed up the bottleneck.\n",
      "- Need to keep team speed high.\n",
      "    - Need to profile quickly, 30 minutes, not set up an expensive framework.\n",
      "    - Yet the bus factor of heavily optimized code must be larger than one.\n",
      "        - \"Bus factor\": how many people can get hit by buses until system unmaintainable.\n",
      "    - So performance optimizations can't be esoteric.\n",
      "- Why is this important\n",
      "    - Want to keep tasks fast and yet fit onto one machine.\n",
      "    - Else need to manage clusters\n",
      "    - 8GB RAM, 4 cores, ~hundreds GB SSD.\n",
      "- Book: \"High Performance Python\", O'Reilly.\n",
      "\n",
      "##\u00a0cProfile\n",
      "\n",
      "- CPU profiler, traces calls.\n",
      "- Combine with `RunSnakeRun` to visualise.\n",
      "- But can't drill into Python C modules, e.g. `abs`.\n",
      "- Don't get argument analysis; what set of arguments is causing pathological behaviour for a given function?\n",
      "    !!AI surely you'd make more than one function to allow this type of profiling?\n",
      "\n",
      "## line_profiler\n",
      "\n",
      "- line-by-line profiling\n",
      "- requires a decorator that'll fail your unit tests.\n",
      "    - !!AI in a previous talk presenter said you can make a dummy decorator\n",
      "- line_profiler, indeed all profilers, can't interrogate compound statements\n",
      "    - !!AI again, just break it down.\n",
      "\n",
      "##\u00a0memory_profiler\n",
      "\n",
      "- same decorator, method as line_profiler.\n",
      "- uses `psutil` to ask OS for memory consumption.\n",
      "    - we're not asking Python for memory occupancy of objects.\n",
      "- C modules don't tell Python how big they are, but since we're asking OS still works.\n",
      "- In IPython, `%memit` is magic incantation, e.g.\n",
      "\n",
      "    %memit [0]*1000000\n",
      "\n",
      "### memory_profiler mprof\n",
      "\n",
      "- measure difference between two codebases.\n",
      "- did my pull request make a meanginful difference? How does the difference vary over time?\n",
      "- `scikit-learn` pull request 2248.\n",
      "\n",
      "### transforming memory_profiler into a resource profiler?\n",
      "\n",
      "- Talking with author to also measure I/O, both on disk and over network.\n",
      "- Draw plots comparing CPU / memory / I/O over time.\n",
      "- So can do: CPU, memory, disk I/O, network I/O\n",
      "- `psutil` could also let us:\n",
      "    - mmaps?\n",
      "    - file handles?\n",
      "    - network connecions?\n",
      "    - cache utilisation via libperf?\n",
      "        - instructions per cycle. could be too low, using numppy improves it.\n",
      "        - if data set too big can't fit into L1/L2, and this could tell you.\n",
      "- Could allow quick overview of an application without having to do deep code reading.\n",
      "- Presenter has used `perfstat` to profile CPython externally, no reason why `libperf` couldn't be used too.\n",
      "\n",
      "##\u00a0Cython\n",
      "\n",
      "- Hands-down, easiest and fastest way to optimize Python.\n",
      "- But you need to annotate code, write C-like code, so reduces team agility.\n",
      "- If you've profiled and found one hot function, great use Cython.\n",
      "- But once you've done it the bus factor drops, you have to educate people on Cython and compiling C.\n",
      "\n",
      "## Cython + NumPy + OpenMP nogil\n",
      "\n",
      "- Use NumPy to escape from CPython control; just a continguous array of bytes.\n",
      "- Then escape the GIL, use OpenMP to transaprently parallelise over cores.\n",
      "\n",
      "## Shedskin\n",
      "\n",
      "- Point Shedskin at module with a main routine.\n",
      "- Shedskin does autonomous type annotation, then converts to C.\n",
      "- It's just like Cython, but you do no work.\n",
      "- However it doesn't work with NumPy, doesn't work on byte arrays.\n",
      "    - Shedskin copies all Python datastructures into C world, so double memory occupancy.\n",
      "- Idea: why not take AST of annotated Shedskin output and create a dodgy first guess annotated Cython file.\n",
      "    - wouldn't work first-time, but a hell of a hint.\n",
      "    - not implemented, an idea.\n",
      "\n",
      "## Pythran\n",
      "\n",
      "- Pass in another DSL, not same as Cython.\n",
      "    - Still, superior to Cython because you just need two lines for his example.\n",
      "- Use `#pythran` annotation.\n",
      "- Support of OpenMP on numpy arrays.\n",
      "\n",
      "## PyPy\n",
      "\n",
      "- Fast, production, Python 2.7 compatible, ready for pure-Python code.\n",
      "    - Many companies have switched to it for e.g. web servers.\n",
      "- Limited support for pre-existing C extentions\n",
      "- `numpypy` has bugs, incomplete, not production ready. If you try it add extensive unit tests.\n",
      "\n",
      "## Numba\n",
      "\n",
      "- Simple decorator, `@jit(nopython=True)`\n",
      "- LLVM, compile down to LLVM instruction language.\n",
      "    - So not just C as output, but can compile down to GPU instructions.\n",
      "- API is very unstable, in flux.\n",
      "    - You need to experiment and play with it.\n",
      "\n",
      "## Tool tradeoffs\n",
      "\n",
      "- PyPy, no learning curve, easiest win, pure Python only.\n",
      "- ShedSkin easy, pure Python only.\n",
      "- Cython, pure Python, hours to learn, team cost low.\n",
      "- Cython + NumPy + OpenMP, days to learn, high cost.\n",
      "- Numba has extreme dependency requirements (mainly LLVM), tricky to install. Could use Anaconda, but then depend on Anaconda.\n",
      "- Pythran is simple, hours to learn. Short projects looking for quick win then try it.\n",
      "- numexpr (not covered), intelligently vecotirses numpy expressions.\n",
      "    - !!AI pandas transparently uses this.\n",
      "\n",
      "##\u00a0Wrapup\n",
      "\n",
      "- Need better, richer profiling tools.\n",
      "- 4-12 physical cores is becoming commonplace. Need to exploit it.\n",
      "- Hand-annotating code reduces agility\n",
      "- JIT/AST compilers getting better, still requires manual intervention.\n",
      "- Ultimately: hardware is cheaper than people. So consider costs of this too."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Questions\n",
      "\n",
      "- Author's Cython workflow is to use its annotation mode, which shows yellow for code that calls into CPython. Want to avoid yellow.\n",
      "    - He makes six-seven subdirectories of different code, makes six-seven HTML annotation output, then compare yellowness to CPU times."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}