{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "%autosave 10" ], "language": "python", "metadata": {}, "outputs": [ { "javascript": [ "IPython.notebook.set_autosave_interval(10000)" ], "metadata": {}, "output_type": "display_data" }, { "output_type": "stream", "stream": "stdout", "text": [ "Autosaving every 10 seconds\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Intro\n", "\n", "- Particle accelerator, close to speed to light. Synchotron.\n", "- Data analysis during and after run.\n", "- Many users, time constraints.\n", "- Java-based code, General Data Acquisition (GDA)\n", " - But want scientists to script/extend it.\n", " - Jython.\n", "- Every increasing throughput of data.\n", " - 2007: 10MB/s\n", " - 2009: 60MB/s\n", " - 2011: 150MB/s\n", " - 2013: 600MB/s\n", " - 2015: 6000MB/s\n", " - doubling every 7.5 months!\n", " - peaks at 1TB/day right now\n", "\n", "##\u00a0Data storage\n", "\n", "- 1PB near-line, 0.5PB on-line.\n", "- 200M+ files.\n", "- High performance parallel file systems *hate* lots of small files.\n", "- So they moved bigger files as HDF5, but it hates ASCII files.\n", "\n", "##\u00a0Big data\n", "\n", "- Volume, variety, veracity, velocity\n", "\n", "##\u00a0Tooling\n", "\n", "- Excel, MATLAB, etc., all assume data fits onto laptop\n", "- These tools do not scale to big data, at least at reasonable price.\n", "\n", "## Python\n", "\n", "- Free, easily distributable\n", "- Already used some via Jython\n", "- But how to spread it?\n", " - Extend their existing acquisition tools to spin off new analysis tools\n", "- Use PyDev and Eclipse heavily\n", " - Spun off as \"Dawn\" product\n", "- Use `scisoftpy` to tie PyDev with HDF5 storage.\n", " - Like `matplotlib` but in Eclipse, do e.g. line fitting\n", "- MATLAB/IDL requires expensive support, but with Python easier to support.\n", " - Easier to create sustainable software that survives over time.\n", "\n", "##\u00a0Optimization\n", "\n", "- Need a magnetic array to wiggle light beam as it travels\n", "- But magnets have imperfections.\n", "- Use Python to optimize an objective function.\n", "- Originally, Python was 1k slower than Fortran, because direct port.\n", "- Then NumPy, 10 times slower, but much cleaner than Fortran.\n", "- It became obvious how to improve caching, then speed same.\n", "- Eventually Python is 100 times faster.\n", "- Then instead of simulated annealing, used Artificial Immune systems\n", " - Global.\n", " - Slower.\n", " - Parallelization, very simple, embarassingly parallel.\n", " - numpy with threads.\n", " - 25 machines, 200 CPUs.\n", "\n", "## Data reduction / processing\n", "\n", "- A lot of pre-existing FORTRAN code, but only single core\n", "- Want to create a data pipeline that parallelises.\n", "- Python to glue together data pipeline.\n", " - Not so much Python to do core processing, but DIALS is a project to do this.\n", "\n", "## Tomography (Current Implementation)\n", "\n", "- Need large TIFFs to do processing, single GPU, uses CUDA\n", "- Store in HDF5, use Python to write out TIFFs, multi GPUs.\n", "- Going forward\n", " - `zeromq` as transport, `blosc` as compression\n", " - can still use CUDA via Python.\n", "\n", "##\u00a0multiprocessing + MPI \"profiling\"\n", "\n", "- Drew a multi-thread multi-thread image of profile, time series.\n", "- Based on log messages, not fine grained like `cProfile`\n", "- But discovered `h5py` holds the GIL\n", "- `google.visualization.DataTable()` JavaScript, use Python to parse logs, `jinja2` to produce HTML\n", "- Much better than grepping!\n", "\n", "## Summary\n", "\n", "- Python is the best tool to let scientists to become developers.\n", "- Rate of change of techniques and equipment is increasing.\n", " - Python supports this.\n", " \n", "http://www.dawnsci.org\n", "http://www.diamond.ac.uk" ] } ], "metadata": {} } ] }