{
 "metadata": {
  "name": "",
  "signature": "sha256:c743a32b7a0580b5370cac247dfeaa7811bf93559c9e6eb4b817b658d84c2233"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Introduction to MEmPaMaL\n",
      "========================\n",
      "\n",
      "Example with Scikit-learn\n",
      "-------------------------\n",
      "\n",
      "In this example, we take the classical iris dataset."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "from mempamal.datasets import iris\n",
      "X, y = iris.get_data()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The pipeline will contains:\n",
      "\n",
      "- scaling of the data: centering and scaling wrt. the standard deviation\n",
      "- logistic regression with default parameters\n",
      "\n",
      "The goodness of fit is estimated with:\n",
      "\n",
      "- a 5-folds (stratified) cross-validation\n",
      "- the score function is a F1 score"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "from sklearn.linear_model.logistic import LogisticRegression\n",
      "from sklearn.preprocessing.data import StandardScaler\n",
      "from sklearn.cross_validation import StratifiedKFold\n",
      "from sklearn.pipeline import Pipeline\n",
      "from sklearn.metrics import f1_score"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "s1 = StandardScaler(with_mean=True, with_std=True)\n",
      "s2 = LogisticRegression()\n",
      "p = [(\"scaler\", s1), (\"logit\", s2)]\n",
      "est = Pipeline(p)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here is an illustration on only one of the folds:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fold_iter = StratifiedKFold(y, n_folds=5)\n",
      "train, test = fold_iter.__iter__().next()\n",
      "X_train, X_test = X[train], X[test]\n",
      "y_train, y_test = y[train], y[test]\n",
      "y_pred = est.fit(X_train, y_train).predict(X_test)\n",
      "f1_score(y_test, y_pred)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 4,
       "text": [
        "0.82949701619778338"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Example with Scikit-learn + MEmPaMaL + Soma-Workflow\n",
      "----------------------------------------------------"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "from mempamal.configuration import JSONify_estimator, JSONify_cv, build_dataset\n",
      "from mempamal.workflow import create_wf, save_wf"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We just take the same estimator:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "s1 = StandardScaler(with_mean=True, with_std=True)\n",
      "s2 = LogisticRegression(C=1e4)\n",
      "p = [(\"scaler\", s1), (\"logit\", s2)]\n",
      "est = Pipeline(p)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We jsonify the estimator and the cross-validation configuration:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "method_conf = JSONify_estimator(est, out=\"./est.json\")\n",
      "cv_conf = JSONify_cv(StratifiedKFold, cv_kwargs={\"n_folds\": 5},\n",
      "                     score_func=f1_score, stratified=True,\n",
      "                     out=\"./cv.json\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We build the dataset in the current directory. \n",
      "It's create a ``dataset.joblib`` file. \n",
      "Then we create the workflow in our internal format (``create_wf``). \n",
      "With ``verbose=True``, it prints the commands on ``stdout``.\n",
      "And finally, we output the workflow (``save_wf``) in the soma-workflow format \n",
      "and write it to ``workflow.json`` (need soma-workflow)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dataset = build_dataset(X, y, method_conf, cv_conf, \".\")\n",
      "wfi = create_wf(dataset['folds'], cv_conf, method_conf, \".\",\n",
      "               verbose=True)\n",
      "wf = save_wf(wfi, \"./workflow.json\", mode=\"soma-workflow\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./red_res_0.pkl 0\n",
        "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./red_res_1.pkl 1\n",
        "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./red_res_2.pkl 2\n",
        "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./red_res_3.pkl 3\n",
        "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./red_res_4.pkl 4\n",
        "python mempamal/scripts/outer_reducer.py ./final_res.pkl ./red_res_{outer}.pkl\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We print all the dependencies and we can check that \n",
      "the *Final Reduce* depends on all *map tasks*."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for dep in wfi[1]: print(dep)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "('|--- Map outer=0', '|- Final reduce')\n",
        "('|--- Map outer=1', '|- Final reduce')\n",
        "('|--- Map outer=2', '|- Final reduce')\n",
        "('|--- Map outer=3', '|- Final reduce')\n",
        "('|--- Map outer=4', '|- Final reduce')\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, we create a ``WorkflowController`` and we submit the workflow. \n",
      "We wait for workflow completion then we read the final results."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from soma_workflow.client import WorkflowController\n",
      "\n",
      "import time\n",
      "import json\n",
      "import sklearn.externals.joblib as joblib\n",
      "\n",
      "controller = WorkflowController()\n",
      "wf_id = controller.submit_workflow(workflow=wf, name=\"first example\")\n",
      "\n",
      "while controller.workflow_status(wf_id) != 'workflow_done':\n",
      "    time.sleep(2)\n",
      "print(joblib.load('./final_res.pkl'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "light mode\n",
        "{'std': 0.025080367485459092, 'raw': array([ 0.93333333,  1.        ,  0.96658312,  0.96658312,  0.93265993]), 'median': 0.96658312447786132, 'mean': 0.9598319029897977}"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 10
    }
   ],
   "metadata": {}
  }
 ]
}