{ "metadata": { "name": "", "signature": "sha256:c743a32b7a0580b5370cac247dfeaa7811bf93559c9e6eb4b817b658d84c2233" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Introduction to MEmPaMaL\n", "========================\n", "\n", "Example with Scikit-learn\n", "-------------------------\n", "\n", "In this example, we take the classical iris dataset." ] }, { "cell_type": "code", "collapsed": true, "input": [ "from mempamal.datasets import iris\n", "X, y = iris.get_data()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The pipeline will contains:\n", "\n", "- scaling of the data: centering and scaling wrt. the standard deviation\n", "- logistic regression with default parameters\n", "\n", "The goodness of fit is estimated with:\n", "\n", "- a 5-folds (stratified) cross-validation\n", "- the score function is a F1 score" ] }, { "cell_type": "code", "collapsed": true, "input": [ "from sklearn.linear_model.logistic import LogisticRegression\n", "from sklearn.preprocessing.data import StandardScaler\n", "from sklearn.cross_validation import StratifiedKFold\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.metrics import f1_score" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": true, "input": [ "s1 = StandardScaler(with_mean=True, with_std=True)\n", "s2 = LogisticRegression()\n", "p = [(\"scaler\", s1), (\"logit\", s2)]\n", "est = Pipeline(p)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is an illustration on only one of the folds:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "fold_iter = StratifiedKFold(y, n_folds=5)\n", "train, test = fold_iter.__iter__().next()\n", "X_train, X_test = X[train], X[test]\n", "y_train, y_test = y[train], y[test]\n", "y_pred = est.fit(X_train, y_train).predict(X_test)\n", "f1_score(y_test, y_pred)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "0.82949701619778338" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example with Scikit-learn + MEmPaMaL + Soma-Workflow\n", "----------------------------------------------------" ] }, { "cell_type": "code", "collapsed": true, "input": [ "from mempamal.configuration import JSONify_estimator, JSONify_cv, build_dataset\n", "from mempamal.workflow import create_wf, save_wf" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We just take the same estimator:" ] }, { "cell_type": "code", "collapsed": true, "input": [ "s1 = StandardScaler(with_mean=True, with_std=True)\n", "s2 = LogisticRegression(C=1e4)\n", "p = [(\"scaler\", s1), (\"logit\", s2)]\n", "est = Pipeline(p)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We jsonify the estimator and the cross-validation configuration:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "method_conf = JSONify_estimator(est, out=\"./est.json\")\n", "cv_conf = JSONify_cv(StratifiedKFold, cv_kwargs={\"n_folds\": 5},\n", " score_func=f1_score, stratified=True,\n", " out=\"./cv.json\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We build the dataset in the current directory. \n", "It's create a ``dataset.joblib`` file. \n", "Then we create the workflow in our internal format (``create_wf``). \n", "With ``verbose=True``, it prints the commands on ``stdout``.\n", "And finally, we output the workflow (``save_wf``) in the soma-workflow format \n", "and write it to ``workflow.json`` (need soma-workflow)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "dataset = build_dataset(X, y, method_conf, cv_conf, \".\")\n", "wfi = create_wf(dataset['folds'], cv_conf, method_conf, \".\",\n", " verbose=True)\n", "wf = save_wf(wfi, \"./workflow.json\", mode=\"soma-workflow\")" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./red_res_0.pkl 0\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./red_res_1.pkl 1\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./red_res_2.pkl 2\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./red_res_3.pkl 3\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./red_res_4.pkl 4\n", "python mempamal/scripts/outer_reducer.py ./final_res.pkl ./red_res_{outer}.pkl\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We print all the dependencies and we can check that \n", "the *Final Reduce* depends on all *map tasks*." ] }, { "cell_type": "code", "collapsed": false, "input": [ "for dep in wfi[1]: print(dep)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "('|--- Map outer=0', '|- Final reduce')\n", "('|--- Map outer=1', '|- Final reduce')\n", "('|--- Map outer=2', '|- Final reduce')\n", "('|--- Map outer=3', '|- Final reduce')\n", "('|--- Map outer=4', '|- Final reduce')\n" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we create a ``WorkflowController`` and we submit the workflow. \n", "We wait for workflow completion then we read the final results." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from soma_workflow.client import WorkflowController\n", "\n", "import time\n", "import json\n", "import sklearn.externals.joblib as joblib\n", "\n", "controller = WorkflowController()\n", "wf_id = controller.submit_workflow(workflow=wf, name=\"first example\")\n", "\n", "while controller.workflow_status(wf_id) != 'workflow_done':\n", " time.sleep(2)\n", "print(joblib.load('./final_res.pkl'))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "light mode\n", "{'std': 0.025080367485459092, 'raw': array([ 0.93333333, 1. , 0.96658312, 0.96658312, 0.93265993]), 'median': 0.96658312447786132, 'mean': 0.9598319029897977}" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 10 } ], "metadata": {} } ] }