{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "%autosave 10" ], "language": "python", "metadata": {}, "outputs": [ { "javascript": [ "IPython.notebook.set_autosave_interval(10000)" ], "metadata": {}, "output_type": "display_data" }, { "output_type": "stream", "stream": "stdout", "text": [ "Autosaving every 10 seconds\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Background\n", "\n", "- Computing is a tool to achieve an objective, not en ends in itself\n", " - Based on experience of PhD in Quantum Physics\n", "\n", "## Problem\n", "\n", "- Using machine learning to understand brain function\n", " - Functional MRI; records time/spatial report of brain activity\n", "- Learn link between brain activity and cognitive function\n", " - Feature engineering an input to map it accurately to brain function.\n", " - This might inform on how brain is actually doing the mapping.\n", "\n", "## Prior Art\n", "\n", "- Visual image reconstruction from brain activity, Miyawaki et al 2008\n", " - Very impressive, but not reproducible.\n", " - Science needs to be reproducible.\n", "- Make it work, make it right, make it boring.\n", "- Want to robustly and reliably reproduce scientific results, make them boring\n", "\n", "## What is a good theory\n", "\n", "1. Accurately describe a large class of observations (training)\n", "2. Definite predictions about future observations (testing)\n", "\n", "- This is machine learning!\n", "- Not just minimising error; data driven science is using data to derive better models, not just making classifiers.\n", "\n", "## Problems with software development in labs\n", "\n", "- Labs are like startups\n", " - Recruiting talent, keeping them.\n", " - Limited resources.\n", " - \"Bus factor\". How many people can be hit by a bus before your project stops.\n", "- You really need to engineer software well to survive yourself moving on for whatever reason.\n", "- Technical debt\n", " - You need to do the maintenance, documentation, testing, or else your project will inevitably die off.\n", "\n", "## Patterns in data processing\n", "\n", "1. Interact with data manually.\n", "2. Automate the interaction.\n", "3. Go to 1.\n", "\n", "- Iteration goes with consolidation.\n", " - As you iterate you reduce technical debt and get closer to goal.\n", "- Academia is moving from statistics to statistical learning (formal machine learning)\n", " - Mainly due to dimensionality of feature sets.\n", "- From parameter inference to prediction.\n", "\n", "## Design philosophy\n", "\n", "1. Don't solve hard problems, bend original problems\n", " - Judo technique.\n", "2. Easy setup.\n", " - Think about installation steps, dependencies, convention over configuration.\n", "3. Fail gracefully.\n", " - Robust\n", " - Easy to debug (major, key success point of Python. Much easier to debug than C).\n", "4. Quality.\n", "5. Don't invent a kitchen sink.\n", " - Narrow your focus as narrowly as possible.\n", " - This increases the bus factor.\n", " - As you need features, create new projects and link.\n", "\n", "## scikit-learn\n", "\n", "- Presenter is a core contributor.\n", "- Vision: machine learning without knowing the math.\n", " - A black box, but one that can be opened.\n", "- Apple vs Linux.\n", " - Older geeks tend to use Apple products. Things should just work.\n", "- This module can't magically solve feature engineering for you.\n", " - But Python is the perfect language to solve this by yourself.\n", "- Sticking to high-level programming keeps scikit-learn alive.\n", " - But how do you stay performant at this high level?\n", " - Optimise algorithms, not low-level stuff.\n", " - Know NumPy and SciPy perfectly.\n", " - All data must be arrays/memoryviews. Avoid memory copies, defer to BLAS/LAPACK.\n", " - Cython.\n", " - scikit-learn actively avoids C/C++.\n", " - Increases bus factor.\n", " - New contributors always complain, but this philosophy works.\n", "- http://scipy-lectures.github.io\n", "\n", "## Hierarchical clustering\n", "\n", "- Pull request 2199\n", "- How\n", " 1. Take two closes clusters\n", " 2. Merge them.\n", " 3. Update distance matrix\n", "- First approach:\n", " - How to find minimum? Heaps!\n", " - Sparse growable strucutres? Skip lists in Cython!\n", "- Second approach.\n", " - But C++ `map[int, float]` is what I need? So wrap it in Cython!\n", "\n", "## Data vs operations\n", "\n", "- Conceptually, have a big blob of data and operations are agents that walk over the data.\n", "- Want an imperative-like language.\n", " - Declarative programming is great in theory but doesn't work in practice.\n", "- Core grammar\n", " - `fit`, `predict`, `tranform`, `score`, `partial_fit`\n", "- Grammar instantiated **without** data.\n", "- Build pipelines around grammar without data.\n", " - Configuration/run pattern. a la `traits`, `pyre`.\n", " - This is just convention, very light. You can ignore this if you want, but if you submi a pull request ignoring this you'll get rejected.\n", "- a la currying in functional programming.\n", "- a la MVC pattern.\n", "- APIs are important, and informed by prior art and heuristics, despite how simple they seem\n", "\n", "## Big data on small hardware\n", "\n", "- Can't afford Hadoop, and want to use Python end-to-end.\n", "- Off the shelf commodity hardware (laptops!)\n", "- One trick: online algorithms\n", " - Compute something on element at a time.\n", " - e.g. mean of gazillion numbers? Just do a running mean.\n", " - use algorithms that statistically converge to the true value with some estimatable error.\n", "- e.g. K-Means clustering.\n", " - `scipy.cluster.vq.kmeans` is precise, slow\n", " - `sklearn.cluster.MiniBatchKMeans` is statistical, much faster.\n", "- People complain \"I need a cluster to add petabytes of arrays\"\n", " - Why?? Use online algorithms.\n", "\n", "## Data reductions\n", "\n", "- Remember memory is hierarchical. Reducing data sets allows more to fit in higher levels of the hierarchy.\n", "- Take random subset\n", " - Random projection: `sklean.random_projection` (averages features)\n", " - e.g. Randomized SVD, `sklearn.utils.extmath.randomized_svd`\n", " - Their randomized solution is more accurate than other supposedly precise solutions\n", "\n", "## Their box\n", "\n", "- 48 cores, 384GB RAM, 70T storage (SSD cache on RAID controller)\n", "- Faster than an 800 CPU cluster!\n", "- Do you really need a cluster? Think about data access patterns.\n", "\n", "##\u00a0Parallel processing\n", "\n", "- Only want to care about embarassingly parallel problems\n", "- Data access / memory bus is going to be the bottleneck.\n", "- `joblib`.\n", " - OpenMP-style. Why not e.g. `IPython`, `multiprocessing`, `celery`.\n", " - No dependencies.\n", " - Better tracebacks.\n", " - Automatic mmap'ing of big arrays, no copies.\n", " - Lazy dispatching, important for big jobs.\n", " - With random forests, perfect 100% multi-core CPU allocation with low memoy allocation.\n", "\n", "## Need caching\n", "\n", "- `joblib.Memory`, memoize pattern.\n", "- Stores very large results on-disk, only returns if you `get`, and even then only if you iterate over it.\n", "- You must write functions, he will never make a context manager.\n", "- How to hash input arguments for memoize decorator.\n", " - `hashlib.md5`, robust, no dependencies.\n", " - Subclass the pickler, which is a state machine that walks the object graph.\n", " - If you walk and find something e.g. ndarrays, don't turn it into a string, just pass a pointer. Avoid copies, use memoryviews.\n", "- When persisting objects, again subclass pickle and e.g. `np.save` big numpy arrays.\n", "- How to handle locking of persisting results of the cache?\n", " - Rely on renaming directories being atomic, basic POSIX operation.\n", "- Should I compress data to/from disk?\n", " - Single core: faster uncompressed\n", " - Multi core: `zlib.compress` faster, used again because no dependencies.\n", " - But use it in an online way.\n", " - Copyless compression: store meta-data too.\n", "- Challenges\n", " - How to stream large results in a cluster.\n", " - Because too many file is slow, file open is slow on cluster.\n", "\n", "## Bigger picture - how to make a sustainable project\n", "\n", "- 200 contributors, ~12 core contributors.\n", "- Huge feature is due to this size of team.\n", "- Random Forests is getting faster by orders of magnitude because of community contribution.\n", "\n", "1. Focus on quality.\n", "2. Build great docs and examples.\n", "\n", "- scikit-learn has a very large number of contributors making a large proportional number of commits.\n", " - Unlike many other Python modules.\n", "\n", "## Tragedy of the Commons\n", "\n", "- SciPy, NumPy - everyone uses it, but not enough people contribute.\n", "- These core, vital projects **don't** get funding.\n", "\n", "## Heuristics\n", "\n", "1. Set goals correctly. 80/20 rule. Focus on core goals.\n", "2. Use simplest technology available. This requires great sophistication.\n", "3. Don't forget - real humans use your package.\n", "\n", "##\u00a0Questions\n", "\n", "- How to encourage contributors?\n", " - Avoid GUIs with a passion. Don't do it.\n", " - Just focus on getting users, the rest follows.\n", " - Don't dumb down the problem." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }