{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%autosave 10"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "javascript": [
        "IPython.notebook.set_autosave_interval(10000)"
       ],
       "metadata": {},
       "output_type": "display_data"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Autosaving every 10 seconds\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Background\n",
      "\n",
      "- Computing is a tool to achieve an objective, not en ends in itself\n",
      "    - Based on experience of PhD in Quantum Physics\n",
      "\n",
      "## Problem\n",
      "\n",
      "- Using machine learning to understand brain function\n",
      "    - Functional MRI; records time/spatial report of brain activity\n",
      "- Learn link between brain activity and cognitive function\n",
      "    - Feature engineering an input to map it accurately to brain function.\n",
      "    - This might inform on how brain is actually doing the mapping.\n",
      "\n",
      "## Prior Art\n",
      "\n",
      "- Visual image reconstruction from brain activity, Miyawaki et al 2008\n",
      "    - Very impressive, but not reproducible.\n",
      "    - Science needs to be reproducible.\n",
      "- Make it work, make it right, make it boring.\n",
      "- Want to robustly and reliably reproduce scientific results, make them boring\n",
      "\n",
      "## What is a good theory\n",
      "\n",
      "1. Accurately describe a large class of observations (training)\n",
      "2. Definite predictions about future observations (testing)\n",
      "\n",
      "- This is machine learning!\n",
      "- Not just minimising error; data driven science is using data to derive better models, not just making classifiers.\n",
      "\n",
      "## Problems with software development in labs\n",
      "\n",
      "- Labs are like startups\n",
      "    - Recruiting talent, keeping them.\n",
      "    - Limited resources.\n",
      "    - \"Bus factor\". How many people can be hit by a bus before your project stops.\n",
      "- You really need to engineer software well to survive yourself moving on for whatever reason.\n",
      "- Technical debt\n",
      "    - You need to do the maintenance, documentation, testing, or else your project will inevitably die off.\n",
      "\n",
      "## Patterns in data processing\n",
      "\n",
      "1. Interact with data manually.\n",
      "2. Automate the interaction.\n",
      "3. Go to 1.\n",
      "\n",
      "- Iteration goes with consolidation.\n",
      "    - As you iterate you reduce technical debt and get closer to goal.\n",
      "- Academia is moving from statistics to statistical learning (formal machine learning)\n",
      "    - Mainly due to dimensionality of feature sets.\n",
      "- From parameter inference to prediction.\n",
      "\n",
      "## Design philosophy\n",
      "\n",
      "1. Don't solve hard problems, bend original problems\n",
      "    - Judo technique.\n",
      "2. Easy setup.\n",
      "    - Think about installation steps, dependencies, convention over configuration.\n",
      "3. Fail gracefully.\n",
      "    - Robust\n",
      "    - Easy to debug (major, key success point of Python. Much easier to debug than C).\n",
      "4. Quality.\n",
      "5. Don't invent a kitchen sink.\n",
      "    - Narrow your focus as narrowly as possible.\n",
      "    - This increases the bus factor.\n",
      "    - As you need features, create new projects and link.\n",
      "\n",
      "## scikit-learn\n",
      "\n",
      "- Presenter is a core contributor.\n",
      "- Vision: machine learning without knowing the math.\n",
      "    - A black box, but one that can be opened.\n",
      "- Apple vs Linux.\n",
      "    - Older geeks tend to use Apple products. Things should just work.\n",
      "- This module can't magically solve feature engineering for you.\n",
      "    - But Python is the perfect language to solve this by yourself.\n",
      "- Sticking to high-level programming keeps scikit-learn alive.\n",
      "    - But how do you stay performant at this high level?\n",
      "    - Optimise algorithms, not low-level stuff.\n",
      "    - Know NumPy and SciPy perfectly.\n",
      "    - All data must be arrays/memoryviews. Avoid memory copies, defer to BLAS/LAPACK.\n",
      "    - Cython.\n",
      "    - scikit-learn actively avoids C/C++.\n",
      "        - Increases bus factor.\n",
      "        - New contributors always complain, but this philosophy works.\n",
      "- http://scipy-lectures.github.io\n",
      "\n",
      "## Hierarchical clustering\n",
      "\n",
      "- Pull request 2199\n",
      "- How\n",
      "    1. Take two closes clusters\n",
      "    2. Merge them.\n",
      "    3. Update distance matrix\n",
      "- First approach:\n",
      "    - How to find minimum? Heaps!\n",
      "    - Sparse growable strucutres? Skip lists in Cython!\n",
      "- Second approach.\n",
      "    - But C++ `map[int, float]` is what I need? So wrap it in Cython!\n",
      "\n",
      "## Data vs operations\n",
      "\n",
      "- Conceptually, have a big blob of data and operations are agents that walk over the data.\n",
      "- Want an imperative-like language.\n",
      "    - Declarative programming is great in theory but doesn't work in practice.\n",
      "- Core grammar\n",
      "    - `fit`, `predict`, `tranform`, `score`, `partial_fit`\n",
      "- Grammar instantiated **without** data.\n",
      "- Build pipelines around grammar without data.\n",
      "    - Configuration/run pattern. a la `traits`, `pyre`.\n",
      "    - This is just convention, very light. You can ignore this if you want, but if you submi a pull request ignoring this you'll get rejected.\n",
      "- a la currying in functional programming.\n",
      "- a la MVC pattern.\n",
      "- APIs are important, and informed by prior art and heuristics, despite how simple they seem\n",
      "\n",
      "## Big data on small hardware\n",
      "\n",
      "- Can't afford Hadoop, and want to use Python end-to-end.\n",
      "- Off the shelf commodity hardware (laptops!)\n",
      "- One trick: online algorithms\n",
      "    - Compute something on element at a time.\n",
      "    - e.g. mean of gazillion numbers? Just do a running mean.\n",
      "    - use algorithms that statistically converge to the true value with some estimatable error.\n",
      "- e.g. K-Means clustering.\n",
      "    - `scipy.cluster.vq.kmeans` is precise, slow\n",
      "    - `sklearn.cluster.MiniBatchKMeans` is statistical, much faster.\n",
      "- People complain \"I need a cluster to add petabytes of arrays\"\n",
      "    - Why?? Use online algorithms.\n",
      "\n",
      "## Data reductions\n",
      "\n",
      "- Remember memory is hierarchical. Reducing data sets allows more to fit in higher levels of the hierarchy.\n",
      "- Take random subset\n",
      "    - Random projection: `sklean.random_projection` (averages features)\n",
      "    - e.g. Randomized SVD, `sklearn.utils.extmath.randomized_svd`\n",
      "        - Their randomized solution is more accurate than other supposedly precise solutions\n",
      "\n",
      "## Their box\n",
      "\n",
      "- 48 cores, 384GB RAM, 70T storage (SSD cache on RAID controller)\n",
      "- Faster than an 800 CPU cluster!\n",
      "- Do you really need a cluster? Think about data access patterns.\n",
      "\n",
      "##\u00a0Parallel processing\n",
      "\n",
      "- Only want to care about embarassingly parallel problems\n",
      "- Data access / memory bus is going to be the bottleneck.\n",
      "- `joblib`.\n",
      "    - OpenMP-style. Why not e.g. `IPython`, `multiprocessing`, `celery`.\n",
      "    - No dependencies.\n",
      "    - Better tracebacks.\n",
      "    - Automatic mmap'ing of big arrays, no copies.\n",
      "    - Lazy dispatching, important for big jobs.\n",
      "    - With random forests, perfect 100% multi-core CPU allocation with low memoy allocation.\n",
      "\n",
      "## Need caching\n",
      "\n",
      "- `joblib.Memory`, memoize pattern.\n",
      "- Stores very large results on-disk, only returns if you `get`, and even then only if you iterate over it.\n",
      "- You must write functions, he will never make a context manager.\n",
      "- How to hash input arguments for memoize decorator.\n",
      "    - `hashlib.md5`, robust, no dependencies.\n",
      "    - Subclass the pickler, which is a state machine that walks the object graph.\n",
      "    - If you walk and find something e.g. ndarrays, don't turn it into a string, just pass a pointer. Avoid copies, use memoryviews.\n",
      "- When persisting objects, again subclass pickle and e.g. `np.save` big numpy arrays.\n",
      "- How to handle locking of persisting results of the cache?\n",
      "    - Rely on renaming directories being atomic, basic POSIX operation.\n",
      "- Should I compress data to/from disk?\n",
      "    - Single core: faster uncompressed\n",
      "    - Multi core: `zlib.compress` faster, used again because no dependencies.\n",
      "    - But use it in an online way.\n",
      "    - Copyless compression: store meta-data too.\n",
      "- Challenges\n",
      "    - How to stream large results in a cluster.\n",
      "    - Because too many file is slow, file open is slow on cluster.\n",
      "\n",
      "## Bigger picture - how to make a sustainable project\n",
      "\n",
      "- 200 contributors, ~12 core contributors.\n",
      "- Huge feature is due to this size of team.\n",
      "- Random Forests is getting faster by orders of magnitude because of community contribution.\n",
      "\n",
      "1. Focus on quality.\n",
      "2. Build great docs and examples.\n",
      "\n",
      "- scikit-learn has a very large number of contributors making a large proportional number of commits.\n",
      "    - Unlike many other Python modules.\n",
      "\n",
      "## Tragedy of the Commons\n",
      "\n",
      "- SciPy, NumPy - everyone uses it, but not enough people contribute.\n",
      "- These core, vital projects **don't** get funding.\n",
      "\n",
      "## Heuristics\n",
      "\n",
      "1. Set goals correctly. 80/20 rule. Focus on core goals.\n",
      "2. Use simplest technology available. This requires great sophistication.\n",
      "3. Don't forget - real humans use your package.\n",
      "\n",
      "##\u00a0Questions\n",
      "\n",
      "- How to encourage contributors?\n",
      "    - Avoid GUIs with a passion. Don't do it.\n",
      "    - Just focus on getting users, the rest follows.\n",
      "    - Don't dumb down the problem."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}