{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# THIS NOTEBOOK IS A WORK IN PROGRESS\n",
    "\n",
    "##### Last modified: 2.11.2015 14:38\n",
    "\n",
    "The aim is to illustrate the usage of `naklar`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# modified version of\n",
    "# http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html#example-svm-plot-iris-py\n",
    "\n",
    "import os\n",
    "import json\n",
    "import pickle\n",
    "import pprint\n",
    "import numbers\n",
    "import itertools\n",
    "from functools import partial\n",
    "\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "from sklearn import svm, datasets\n",
    "from sklearn import metrics\n",
    "\n",
    "import joblib\n",
    "\n",
    "from naklar import experiment\n",
    "\n",
    "\n",
    "# record these metrics for each experiment\n",
    "metric_fns = [('precision', partial(metrics.precision_score, average='micro')),\n",
    "               ('recall', partial(metrics.recall_score, average='micro')),\n",
    "               ('f1', partial(metrics.f1_score, average='micro')),\n",
    "               ('accuracy', metrics.accuracy_score)]\n",
    "\n",
    "# import some data to play with\n",
    "iris = datasets.load_iris()\n",
    "X = iris.data[:, :2]  # we only take the first two features. We could\n",
    "                      # avoid this ugly slicing by using a two-dim dataset\n",
    "y = iris.target\n",
    "\n",
    "def evaluate_settings(i, output='results', record_metrics=[], dump=['pickle', 'json', 'joblib'], **kwargs):\n",
    "    \"\"\"Evaluate an experimental condition and dump results onto disk.\n",
    "    \"\"\"\n",
    "    svc = svm.SVC(**kwargs).fit(X, y)\n",
    "    pred = svc.predict(X)\n",
    "    \n",
    "    conf = kwargs\n",
    "    conf['i'] = i\n",
    "    if record_metrics:\n",
    "        metrics = {}\n",
    "        for metric_name, metric in record_metrics:\n",
    "            conf[metric_name] = metric(y, pred)\n",
    "    \n",
    "    # store results on disk\n",
    "    conf['results'] = pred # results is a NumPy array\n",
    "    if not os.path.exists(os.path.join('.', output, str(i))):\n",
    "        os.makedirs(os.path.join('.', output, str(i)))\n",
    "    \n",
    "    if 'pickle' in dump:\n",
    "        conf['path'] = os.path.join(output, str(i))\n",
    "        conf_file = os.path.join(output, str(i), 'conf.pkl')\n",
    "        with open(conf_file, 'wb') as out:\n",
    "            pickle.dump(conf, out)\n",
    "    \n",
    "    # dump a json file\n",
    "    if 'json' in dump:\n",
    "        conf_file = os.path.join(output, str(i), 'conf.json')\n",
    "        with open(conf_file, 'w') as out:\n",
    "            conf_ = {k:v for k, v in conf.items() if isinstance(v, (str, numbers.Number, list, set))}\n",
    "            json.dump(conf_, out)\n",
    "    \n",
    "    # dump a joblib file\n",
    "    # the nice thing about joblib is that it'll dump large NumPy arrays\n",
    "    # as separate files\n",
    "    if 'joblib' in dump:\n",
    "        conf_file = os.path.join(output, str(i), 'conf.jbl')\n",
    "        joblib.dump(conf, conf_file)\n",
    "    \n",
    "    # store the trained model as well\n",
    "    with open(os.path.join(output, str(i), 'svc.skl'), 'wb') as out:\n",
    "        pickle.dump(svc, out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "kernels = ['linear', 'rbf', 'poly']\n",
    "Cs = [0.001, 0.1, 1, 2]\n",
    "\n",
    "# could also use sklearn.cross_validation\n",
    "jobs = itertools.product(kernels, Cs)\n",
    "for i, (kernel, C) in enumerate(jobs):\n",
    "    evaluate_settings(i, record_metrics=metric_fns, kernel=kernel, C=C)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The experimental framework\n",
    "\n",
    "TBW"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Initialisation / Loading experiments\n",
    "\n",
    "The entire purpose of `naklar` is to make loading experiment results and selecting certain parts of experiments for analysis easy. The motivation comes from personal experience with hyperparameter optimisation, using different preprocessing pipelines with crossvalidation and so on and so on.\n",
    "\n",
    "It's a good idea to store results from experiments on disk and then load them up separately for analysis as opposed to doing the analysis as part of the experiment and output only figures. The experiments are normally not interactive and take a long time to run, the analysis tends to be interactive and relatively quick to run. A simple experiment (such as the one above) can easily turn into hundreds of separate output directories on disk. Traversing all these filepaths over and over with for loops is tedious error prone and downright annoying.\n",
    "\n",
    "This is what `naklar` helps you with. The only assumption `naklar` makes is that result sets are separated into one directory per experimental condition and that there is a dictionary in that directory detailing the settings of that one condition. Assuming there is a host of experiment results under `./results/` you can load all of them up easily."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "from naklar import experiment \n",
    "\n",
    "#experiment.reset()   # if you've loaded results into memory before you have to reset the DB\n",
    "experiment.initialise('./results/', autoload=True, )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This traverses the `./results/` directory descending into all subdirectories and loads settings from a pickled Python dictionary called `conf.pkl`, if you've called the settings file something else use the `dict_file` parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# experiment.reset()\n",
    "experiment.initialise('./results/', autoload=True, dict_file='settings.pickle')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calling `experiment.initialise` creates a `sqlalchemy` table definition out of all the (key, value) pairs found in the dictionaries. The data types are reflected in the following order `[DateTime, Float, Integer, Boolean, String]` so if a parameter does not map onto any other `SQL` type a string is used, if the data can not be represented as a string (for instance a NumPy array) an error will be thrown.\n",
    "\n",
    "By default `experiment.initialise` will create an in memory SQLite database to store the results in. Other backends can be used by calling `experiment.connect` before calling `experiment.initialise`, the `.connect` function passes all parameters (`*args` and `**kwargs`) to `sqlalchemy.create_engine` (http://docs.sqlalchemy.org/en/rel_1_0/core/engines.html). Altenatively the connect parameters can be passed into `.initialise` as `**kwargs`.\n",
    "\n",
    "The `autoload=True` tells `naklar` that in addition to traversing the entire directory tree under `./results` and creating a database schema out of the dictionaries found, it should also load all of those dictionaries into the newly created database table. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading JSON or joblib dumps\n",
    "\n",
    "`nakla` is not limited to pickled Python dictionaries either, it can also load other file types by providing a `load_func`. For instance `json` or `joblib` dumps. The provided `load_func` callable is called once per matching file path in any subdirectory of the root directory passing the approriate file path to the function. A Python dictionary should be returned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0, 'results/0', 0.76),\n",
       " (1, 'results/1', 0.8000000000000002),\n",
       " (10, 'results/10', 0.8133333333333334),\n",
       " (11, 'results/11', 0.82),\n",
       " (2, 'results/2', 0.82),\n",
       " (3, 'results/3', 0.82),\n",
       " (4, 'results/4', 0.8066666666666665),\n",
       " (5, 'results/5', 0.82),\n",
       " (6, 'results/6', 0.8266666666666667),\n",
       " (7, 'results/7', 0.8266666666666667),\n",
       " (8, 'results/8', 0.8066666666666665),\n",
       " (9, 'results/9', 0.82)]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import json\n",
    "from functools import partial\n",
    "\n",
    "import joblib\n",
    "\n",
    "\n",
    "def load_json(fpath):\n",
    "    with open(fpath, 'r') as fh:\n",
    "        conf = json.load(fh)\n",
    "    return conf\n",
    "\n",
    "# load json\n",
    "experiment.reset()\n",
    "experiment.initialise('./results/', autoload=True, dict_file='conf.json', load_func=load_json)\n",
    "experiment.select('i', 'path', 'f1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`joblib` dumps are slightly more complicated as they can contain `numpy` arrays. It's a good idea to use memory mapping to load the `numpy` arrays as they can easily take up a lot of memory, however the memory mapped arrays have to be closed explicitly. There's a helper function in `naklar.util.load_joblib` that replaces numpy array values with their filepath and makes sure the memory mapped array is closed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0, 'results/0', '0.76'),\n",
       " (1, 'results/1', '0.8'),\n",
       " (10, 'results/10', '0.813333333333333'),\n",
       " (11, 'results/11', '0.82'),\n",
       " (2, 'results/2', '0.82'),\n",
       " (3, 'results/3', '0.82'),\n",
       " (4, 'results/4', '0.806666666666667'),\n",
       " (5, 'results/5', '0.82'),\n",
       " (6, 'results/6', '0.826666666666667'),\n",
       " (7, 'results/7', '0.826666666666667'),\n",
       " (8, 'results/8', '0.806666666666667'),\n",
       " (9, 'results/9', '0.82')]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load joblib\n",
    "# it's usually a good idea to use memory mapping in case joblib has dumped\n",
    "# large numpy arrays into the dictionary\n",
    "experiment.reset()\n",
    "experiment.initialise('./results/', autoload=True, dict_file='conf.jbl', load_func=naklar.util.load_joblib)\n",
    "experiment.select('i', 'path', 'f1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parameter Value Tranformation Using Decorators\n",
    "#### Custom Getters and Setters\n",
    "\n",
    "If some of the parameters need custom `getter` (`setter` is somewhat untested) code those can be provided to `.initialise`. One use case for this is path translation between different file systems. For instance if the experiments are done on a separate computing cluster and the results are then downloaded onto a desktop for analysis the path references will differ. These can be automatically translated using a custom `getter`.\n",
    "\n",
    "##### NOTE: using decorators is an experimental feature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from functools import partial\n",
    "from naklar import decorator\n",
    "\n",
    "output_path = partial(decorator.translate_path,\n",
    "                      ptrn='/home/me/results/',\n",
    "                      replace='/usr/local/scratch/results/')\n",
    "\n",
    "experiment.initialise(root_dir='./results' autoload=True,\n",
    "                      decorators={'output': (home_path, )})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The custom getters and setters are defined as a dictionary to `.initialise`, the keys should match those defined in the settings dictionary and the value needs to be a 1- or 2-tuple (1-tuple is getter only, 2-tuple is getter and setter). The get/set callable should take at least one parameter `self` and return the altered value _also in the case of a setter_.\n",
    "\n",
    "#### Adding Parameters Using Decorators\n",
    "\n",
    "It is also possible to add parameters to the `experiment` table/object using decorators. For instance computing adding a timestamp parsed from the experiment filepath. These parameters are also defined in the decorators dictionary and they become extra columns in the table definition.\n",
    "\n",
    "Consider a settings dictionary that contains an `output` field which is the output file path for the experiment:\n",
    "- `/home/user/results/exp.39034.21102015141054`\n",
    "- `/home/user/results/exp.89045.21102015113236`\n",
    "\n",
    "where the last part is a timestamp of when the experiment was started. This timestamp can be added to the table definition by only defining a `setter` decorator, however since `naklar` has no way of knowing what the data type of this parameter is a column definition must also be provided."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import sqlalchemy\n",
    "\n",
    "def set_startdate(self, v):\n",
    "    pth = self.output\n",
    "    ts = pth[-14:]\n",
    "    ts = (ts[:2], ts[2:4], ts[4:8], ts[8:10], ts[10:12], ts[12:])\n",
    "    ts = pd.to_datetime('{}.{}.{} {}:{}:{}'.format(*ts))\n",
    "    return ts\n",
    "\n",
    "startdate = (None, set_startdate, sqlalchemy.Column('startdate', sqlalchemy.DateTime))\n",
    "experiment.initialise(root_dir='./results' autoload=True, decorators={'startdate': startdate})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first value of the decorator tuple is `None` as this corresponds to the getter method. As we don't want the getter to do anything special - the default getter that just returns the value is enough - the value can be set to be anything that isn't a callable. The second parameter is the setter, and the third is the column definition."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-----\n",
    "##### Technical note\n",
    "The decorators are implemented as `sqlalchemy` hybrid properties but only ever use the two first paramters of the hybrid property - get and set for the python object. The latter two parameters are the `SQL` side get and set method, but _they are currently not supported_.\n",
    "\n",
    "This means that if the field `output` has a decorator, calling `experiment.select('output')` will fail as this will attempt to access `output` on the class not on an object, i.e. `experiment.E.output`. Calling `experiment.select()[0].output` is still perfectly fine as the elements in the list are instantiated objects.\n",
    "\n",
    "-----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The `experiment.E` object\n",
    "\n",
    "After the initialisation the reflected database definition is stored as `experiment.E`. Any parameters found from the settings dictionaries will become properties on `experiment.E`, all these parameters are `sqlalchemy` properties."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "naklar.experiment.Exp"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.E"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(<sqlalchemy.orm.attributes.InstrumentedAttribute at 0x10b352a98>,\n",
       " <sqlalchemy.orm.attributes.InstrumentedAttribute at 0x10b352d58>)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.E.f1, experiment.E.path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bypassing `naklar`\n",
    "\n",
    "Since `naklar` is basically just a wrapper around `sqlalchemy` you can bypass the whole thing and access the backend directly. The `.initialise()` function creates a database engine as `experiment._engine` that can be used to create sessions.\n",
    "\n",
    "Using the raw `sqlalchemy` queries is much more verbose but may in some cases be necessary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('linear'), ('linear'), ('linear'), ('linear')]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.select('kernel', kernel = 'linear')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('linear'), ('linear'), ('linear'), ('linear')]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.select('kernel', experiment.E.kernel == 'linear')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('linear',), ('linear',), ('linear',), ('linear',)]\n"
     ]
    }
   ],
   "source": [
    "from sqlalchemy.orm import Session\n",
    "\n",
    "session = Session(bind=experiment._engine)\n",
    "q = session.query(experiment.E.kernel)\n",
    "q = q.filter(experiment.E.kernel.in_(['linear']))\n",
    "pprint.pprint(q.all())\n",
    "session.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 206,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('results/0',), ('results/1',), ('results/2',), ('results/3',)]\n"
     ]
    }
   ],
   "source": [
    "from sqlalchemy.orm import Session\n",
    "\n",
    "session = Session(bind=experiment._engine)\n",
    "q = session.query('path')\n",
    "q = q.filter(experiment.E.kernel.in_(['linear']))\n",
    "pprint.pprint(q.all())\n",
    "session.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Selecting experiments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Select all columns of all experiments\n",
    "\n",
    "Calling `experiment.select()` with no arguments will fetch all experiments currently in the database and returns a list of the reflected `sqlalchemy` rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<naklar.experiment.Exp at 0x10b1d1390>,\n",
       " <naklar.experiment.Exp at 0x10b1d13c8>,\n",
       " <naklar.experiment.Exp at 0x10b1d1208>,\n",
       " <naklar.experiment.Exp at 0x10b1d1588>,\n",
       " <naklar.experiment.Exp at 0x10b1d1940>,\n",
       " <naklar.experiment.Exp at 0x10b1d1978>,\n",
       " <naklar.experiment.Exp at 0x10b1d1c18>,\n",
       " <naklar.experiment.Exp at 0x10b1d1c88>,\n",
       " <naklar.experiment.Exp at 0x10b1d1d30>,\n",
       " <naklar.experiment.Exp at 0x10b1d1dd8>,\n",
       " <naklar.experiment.Exp at 0x10b1d1e80>,\n",
       " <naklar.experiment.Exp at 0x10b1d1f28>]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.select()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Select some columns of all experiments\n",
    "\n",
    "Defining column names will instead return `NamedTuple`s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('results/0', 0.001, '0.76'),\n",
       " ('results/1', 0.1, '0.8'),\n",
       " ('results/10', 1.0, '0.813333333333333'),\n",
       " ('results/11', 2.0, '0.82'),\n",
       " ('results/2', 1.0, '0.82'),\n",
       " ('results/3', 2.0, '0.82'),\n",
       " ('results/4', 0.001, '0.806666666666667'),\n",
       " ('results/5', 0.1, '0.82'),\n",
       " ('results/6', 1.0, '0.826666666666667'),\n",
       " ('results/7', 2.0, '0.826666666666667'),\n",
       " ('results/8', 0.001, '0.806666666666667'),\n",
       " ('results/9', 0.1, '0.82')]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.select('path', 'C', 'f1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filter experiments by parameter values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<naklar.experiment.Exp at 0x10b1e23c8>,\n",
       " <naklar.experiment.Exp at 0x10b1e2438>,\n",
       " <naklar.experiment.Exp at 0x10b1e24a8>,\n",
       " <naklar.experiment.Exp at 0x10b1e2518>]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.select(kernel='linear')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('results/10'), ('results/2'), ('results/6')]"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.select('path', C=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('results/10'),\n",
       " ('results/11'),\n",
       " ('results/2'),\n",
       " ('results/3'),\n",
       " ('results/6'),\n",
       " ('results/7')]"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.select('path', C=[1, 2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Non existent settings don't matter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('results/10', 1.0),\n",
       " ('results/11', 2.0),\n",
       " ('results/2', 1.0),\n",
       " ('results/3', 2.0),\n",
       " ('results/6', 1.0),\n",
       " ('results/7', 2.0)]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.select('path', 'C', C=[1, 2, 3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All the select statements above are translated into `sqlalchemy` queries. The `**kwargs` provided to `.select()` are applied to the query as filters and translate directly into `sqlalchemy` query filters. The following two statements are equivalent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('results/10'),\n",
       " ('results/11'),\n",
       " ('results/2'),\n",
       " ('results/3'),\n",
       " ('results/6'),\n",
       " ('results/7')]"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.select('path', C=[1, 2, 3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('results/10'),\n",
       " ('results/11'),\n",
       " ('results/2'),\n",
       " ('results/3'),\n",
       " ('results/6'),\n",
       " ('results/7')]"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.select('path', experiment.E.C.in_([1, 2, 3]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice however that for equality comparisons there is a slight syntactic change. For the `sqlalchemy` version you need to use `==` (double equals) for the query filter to be valid, a single equals sign is a syntax error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('results/10'), ('results/2'), ('results/6')]"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.select('path', C=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('results/10'), ('results/2'), ('results/6')]"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# the sqlalchemy version uses double equals ==\n",
    "experiment.select('path', experiment.E.C == 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "SyntaxError",
     "evalue": "keyword can't be an expression (<ipython-input-83-5c5677046345>, line 1)",
     "output_type": "error",
     "traceback": [
      "\u001b[0;36m  File \u001b[0;32m\"<ipython-input-83-5c5677046345>\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m    experiment.select('path', experiment.E.C = 1)\u001b[0m\n\u001b[0m                             ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m keyword can't be an expression\n"
     ]
    }
   ],
   "source": [
    "experiment.select('path', experiment.E.C = 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the `experiment.E` table definition is a little cumbersome for simple queries where a parameter value has to match something or be on of a defined set. In some cases it is not possible to define the query filter as a variable assignmend `C=1`, for instance selecting all values that are above or below some threshold. In those cases using `experiment.E` becomes useful.\n",
    "\n",
    "http://docs.sqlalchemy.org/en/rel_1_0/orm/query.html?highlight=filter#sqlalchemy.orm.query.Query.filter\n",
    "\n",
    "#### Select all conditions where a metric is above some threshold\n",
    "\n",
    "Given the following results we only want to select those where the F1-score is greter than 82%, that is experiments 6 and 7."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0, 'linear', 0.001, '0.76'),\n",
       " (1, 'linear', 0.1, '0.8'),\n",
       " (10, 'poly', 1.0, '0.813333333333333'),\n",
       " (11, 'poly', 2.0, '0.82'),\n",
       " (2, 'linear', 1.0, '0.82'),\n",
       " (3, 'linear', 2.0, '0.82'),\n",
       " (4, 'rbf', 0.001, '0.806666666666667'),\n",
       " (5, 'rbf', 0.1, '0.82'),\n",
       " (6, 'rbf', 1.0, '0.826666666666667'),\n",
       " (7, 'rbf', 2.0, '0.826666666666667'),\n",
       " (8, 'poly', 0.001, '0.806666666666667'),\n",
       " (9, 'poly', 0.1, '0.82')]"
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.select('i', 'kernel', 'C', 'f1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(6, 'rbf', 1.0, '0.826666666666667'), (7, 'rbf', 2.0, '0.826666666666667')]"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "experiment.select('i', 'kernel', 'C', 'f1', experiment.E.f1 > 0.82)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hybrid Properties and Decorators\n",
    "\n",
    "`sqlalchemy` allows you to define hybrid attributes (http://docs.sqlalchemy.org/en/rel_1_0/orm/extensions/hybrid.html) for a database schema definition. These can be used to map values on disk to some other values.\n",
    "\n",
    "## Decorators\n",
    "Some experiment outputs are not easy to store in database, things like file system paths or single performance metrics are fine but entire tables of cross validated results are a bit trickier. One way to get around this problem is to dump the metric data to a separate file (say a `pandas` data frame) and have just a path reference to that file.\n",
    "\n",
    "A decorator can then easily be added to the `experiment.E` object that returns the data frame instead of the path.\n",
    "\n",
    "\n",
    "## Hybrid Properties\n",
    "Consider for instance the case of running experiments on a different machine to where the analysis is done. It is likely that the file system path mappings won't be the same. The settings dictionary will contain paths that are valid on the experiment machine but not on the analysis machine - a computing cluster vs. a desktop machine for instance. These paths can easily be mapped using regular expressions.\n",
    "\n",
    "`naklar.decorator` contains some ready made decorator functions that can be used to do this.\n",
    "\n",
    "Say for instance you needed to map `/mnt/scratch/results/` to `/usr/local/scratch/results/`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from functools import partial\n",
    "from naklar import decorator\n",
    "\n",
    "home_path = partial(decorator.translate_path,\n",
    "                    ptrn='/mnt/scratch/results/',\n",
    "                    replace='/usr/local/scratch/results/')\n",
    "\n",
    "experiment.initialise('/usr/local/scratch/lustre/results/', autoload=True, dict_file='args.final',\n",
    "                      restrict_keys=restrict_keys,\n",
    "                      decorators={'output': (home_path, None)})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Errors / Debugging "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## InvalidRequestError\n",
    "\n",
    "`InvalidRequestError: Table 'experiment' is already defined for this MetaData instance.  Specify 'extend_existing=True' to redefine options and columns on an existing Table object.`\n",
    "\n",
    "This happens when a table definition is loaded onto an already existing table definition. The fix is simple, just all `experiment.reset()` before `experiment.initialise()` - `.reset()` will remove the existing table definition."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}