{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "%autosave 10" ], "language": "python", "metadata": {}, "outputs": [ { "javascript": [ "IPython.notebook.set_autosave_interval(10000)" ], "metadata": {}, "output_type": "display_data" }, { "output_type": "stream", "stream": "stdout", "text": [ "Autosaving every 10 seconds\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What, problem\n", "\n", "- Python allows rapid prototyping\n", "- But after profiling and finding slowdown need to speed up the bottleneck.\n", "- Need to keep team speed high.\n", " - Need to profile quickly, 30 minutes, not set up an expensive framework.\n", " - Yet the bus factor of heavily optimized code must be larger than one.\n", " - \"Bus factor\": how many people can get hit by buses until system unmaintainable.\n", " - So performance optimizations can't be esoteric.\n", "- Why is this important\n", " - Want to keep tasks fast and yet fit onto one machine.\n", " - Else need to manage clusters\n", " - 8GB RAM, 4 cores, ~hundreds GB SSD.\n", "- Book: \"High Performance Python\", O'Reilly.\n", "\n", "##\u00a0cProfile\n", "\n", "- CPU profiler, traces calls.\n", "- Combine with `RunSnakeRun` to visualise.\n", "- But can't drill into Python C modules, e.g. `abs`.\n", "- Don't get argument analysis; what set of arguments is causing pathological behaviour for a given function?\n", " !!AI surely you'd make more than one function to allow this type of profiling?\n", "\n", "## line_profiler\n", "\n", "- line-by-line profiling\n", "- requires a decorator that'll fail your unit tests.\n", " - !!AI in a previous talk presenter said you can make a dummy decorator\n", "- line_profiler, indeed all profilers, can't interrogate compound statements\n", " - !!AI again, just break it down.\n", "\n", "##\u00a0memory_profiler\n", "\n", "- same decorator, method as line_profiler.\n", "- uses `psutil` to ask OS for memory consumption.\n", " - we're not asking Python for memory occupancy of objects.\n", "- C modules don't tell Python how big they are, but since we're asking OS still works.\n", "- In IPython, `%memit` is magic incantation, e.g.\n", "\n", " %memit [0]*1000000\n", "\n", "### memory_profiler mprof\n", "\n", "- measure difference between two codebases.\n", "- did my pull request make a meanginful difference? How does the difference vary over time?\n", "- `scikit-learn` pull request 2248.\n", "\n", "### transforming memory_profiler into a resource profiler?\n", "\n", "- Talking with author to also measure I/O, both on disk and over network.\n", "- Draw plots comparing CPU / memory / I/O over time.\n", "- So can do: CPU, memory, disk I/O, network I/O\n", "- `psutil` could also let us:\n", " - mmaps?\n", " - file handles?\n", " - network connecions?\n", " - cache utilisation via libperf?\n", " - instructions per cycle. could be too low, using numppy improves it.\n", " - if data set too big can't fit into L1/L2, and this could tell you.\n", "- Could allow quick overview of an application without having to do deep code reading.\n", "- Presenter has used `perfstat` to profile CPython externally, no reason why `libperf` couldn't be used too.\n", "\n", "##\u00a0Cython\n", "\n", "- Hands-down, easiest and fastest way to optimize Python.\n", "- But you need to annotate code, write C-like code, so reduces team agility.\n", "- If you've profiled and found one hot function, great use Cython.\n", "- But once you've done it the bus factor drops, you have to educate people on Cython and compiling C.\n", "\n", "## Cython + NumPy + OpenMP nogil\n", "\n", "- Use NumPy to escape from CPython control; just a continguous array of bytes.\n", "- Then escape the GIL, use OpenMP to transaprently parallelise over cores.\n", "\n", "## Shedskin\n", "\n", "- Point Shedskin at module with a main routine.\n", "- Shedskin does autonomous type annotation, then converts to C.\n", "- It's just like Cython, but you do no work.\n", "- However it doesn't work with NumPy, doesn't work on byte arrays.\n", " - Shedskin copies all Python datastructures into C world, so double memory occupancy.\n", "- Idea: why not take AST of annotated Shedskin output and create a dodgy first guess annotated Cython file.\n", " - wouldn't work first-time, but a hell of a hint.\n", " - not implemented, an idea.\n", "\n", "## Pythran\n", "\n", "- Pass in another DSL, not same as Cython.\n", " - Still, superior to Cython because you just need two lines for his example.\n", "- Use `#pythran` annotation.\n", "- Support of OpenMP on numpy arrays.\n", "\n", "## PyPy\n", "\n", "- Fast, production, Python 2.7 compatible, ready for pure-Python code.\n", " - Many companies have switched to it for e.g. web servers.\n", "- Limited support for pre-existing C extentions\n", "- `numpypy` has bugs, incomplete, not production ready. If you try it add extensive unit tests.\n", "\n", "## Numba\n", "\n", "- Simple decorator, `@jit(nopython=True)`\n", "- LLVM, compile down to LLVM instruction language.\n", " - So not just C as output, but can compile down to GPU instructions.\n", "- API is very unstable, in flux.\n", " - You need to experiment and play with it.\n", "\n", "## Tool tradeoffs\n", "\n", "- PyPy, no learning curve, easiest win, pure Python only.\n", "- ShedSkin easy, pure Python only.\n", "- Cython, pure Python, hours to learn, team cost low.\n", "- Cython + NumPy + OpenMP, days to learn, high cost.\n", "- Numba has extreme dependency requirements (mainly LLVM), tricky to install. Could use Anaconda, but then depend on Anaconda.\n", "- Pythran is simple, hours to learn. Short projects looking for quick win then try it.\n", "- numexpr (not covered), intelligently vecotirses numpy expressions.\n", " - !!AI pandas transparently uses this.\n", "\n", "##\u00a0Wrapup\n", "\n", "- Need better, richer profiling tools.\n", "- 4-12 physical cores is becoming commonplace. Need to exploit it.\n", "- Hand-annotating code reduces agility\n", "- JIT/AST compilers getting better, still requires manual intervention.\n", "- Ultimately: hardware is cheaper than people. So consider costs of this too." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Questions\n", "\n", "- Author's Cython workflow is to use its annotation mode, which shows yellow for code that calls into CPython. Want to avoid yellow.\n", " - He makes six-seven subdirectories of different code, makes six-seven HTML annotation output, then compare yellowness to CPU times." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }