{ "metadata": { "name": "", "signature": "sha256:ba4d280b153a4bf67c8aac3a90ea29444bacffe2655282a1374cecdda9a687ab" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Model selection with MEmPaMaL\n", "=============================\n", "\n", "Example with Scikit-learn + MEmPaMaL + Soma-Workflow\n", "----------------------------------------------------\n", "\n", "In this example, we take the classical iris dataset." ] }, { "cell_type": "code", "collapsed": true, "input": [ "from mempamal.datasets import iris\n", "X, y = iris.get_data()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": true, "input": [ "from sklearn.linear_model.logistic import LogisticRegression\n", "from sklearn.preprocessing.data import StandardScaler\n", "from sklearn.cross_validation import StratifiedKFold\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.metrics import f1_score" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The pipeline will contains:\n", "\n", "- scaling of the data: centering and scaling wrt. the standard deviation\n", "- logistic regression with default parameters\n", "\n", "The goodness of fit is estimated with:\n", "\n", "- a 5-folds (stratified) cross-validation\n", "- the score function is a F1 score\n", "\n", "The model selection is performed with:\n", "\n", "- a 5-folds (stratified) cross-validation\n", "- the score function is a F1 score" ] }, { "cell_type": "code", "collapsed": true, "input": [ "s1 = StandardScaler(with_mean=True, with_std=True)\n", "s2 = LogisticRegression()\n", "p = [(\"scaler\", s1), (\"logit\", s2)]\n", "est = Pipeline(p)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": true, "input": [ "from mempamal.configuration import JSONify_estimator, JSONify_cv, build_dataset\n", "from mempamal.examples.parameters_grid import make_log_grid\n", "from mempamal.workflow import create_wf, save_wf" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We jsonify the estimator and the cross-validation configuration:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "method_conf = JSONify_estimator(est, out=\"./est.json\")\n", "cv_conf = JSONify_cv(StratifiedKFold, cv_kwargs={\"n_folds\": 5},\n", " score_func=f1_score,\n", " inner_cv=StratifiedKFold,\n", " inner_cv_kwargs={\"n_folds\": 5},\n", " inner_score_func=f1_score,\n", " stratified=True,\n", " out=\"./cv.json\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We build the dataset in the current directory. \n", "It's create a ``dataset.joblib`` file. \n", "Then we create the workflow in our internal format (``create_wf``). \n", "With ``verbose=True``, it prints the commands on ``stdout``.\n", "And finally, we output the workflow (``save_wf``) in the soma-workflow format \n", "and write it to ``workflow.json`` (need soma-workflow)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "grid = make_log_grid(X, y)\n", "dataset = build_dataset(X, y, method_conf, cv_conf, \".\", grid=grid)\n", "wfi = create_wf(dataset['folds'], cv_conf, method_conf, \".\",\n", " verbose=True)\n", "wf = save_wf(wfi, \"./workflow.json\", mode=\"soma-workflow\")" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_0_0.pkl 0 --inner 0\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_0_1.pkl 0 --inner 1\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_0_2.pkl 0 --inner 2\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_0_3.pkl 0 --inner 3\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_0_4.pkl 0 --inner 4\n", "\n", "python mempamal/scripts/inner_reducer.py ./cv.json ./est.json ./dataset.joblib ./red_res_0.pkl ./map_res_0_{inner}.pkl 0\n", "\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_1_0.pkl 1 --inner 0\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_1_1.pkl 1 --inner 1\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_1_2.pkl 1 --inner 2\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_1_3.pkl 1 --inner 3\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_1_4.pkl 1 --inner 4\n", "\n", "python mempamal/scripts/inner_reducer.py ./cv.json ./est.json ./dataset.joblib ./red_res_1.pkl ./map_res_1_{inner}.pkl 1\n", "\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_2_0.pkl 2 --inner 0\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_2_1.pkl 2 --inner 1\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_2_2.pkl 2 --inner 2\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_2_3.pkl 2 --inner 3\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_2_4.pkl 2 --inner 4\n", "\n", "python mempamal/scripts/inner_reducer.py ./cv.json ./est.json ./dataset.joblib ./red_res_2.pkl ./map_res_2_{inner}.pkl 2\n", "\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_3_0.pkl 3 --inner 0\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_3_1.pkl 3 --inner 1\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_3_2.pkl 3 --inner 2\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_3_3.pkl 3 --inner 3\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_3_4.pkl 3 --inner 4\n", "\n", "python mempamal/scripts/inner_reducer.py ./cv.json ./est.json ./dataset.joblib ./red_res_3.pkl ./map_res_3_{inner}.pkl 3\n", "\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_4_0.pkl 4 --inner 0\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_4_1.pkl 4 --inner 1\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_4_2.pkl 4 --inner 2\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_4_3.pkl 4 --inner 3\n", "python mempamal/scripts/mapper.py ./cv.json ./est.json ./dataset.joblib ./map_res_4_4.pkl 4 --inner 4\n", "\n", "python mempamal/scripts/inner_reducer.py ./cv.json ./est.json ./dataset.joblib ./red_res_4.pkl ./map_res_4_{inner}.pkl 4\n", "\n", "python mempamal/scripts/outer_reducer.py ./final_res.pkl ./red_res_{outer}.pkl\n" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we create a ``WorkflowController`` and we submit the workflow. \n", "We wait for workflow completion then we read the final results." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from soma_workflow.client import WorkflowController\n", "\n", "import time\n", "import json\n", "import sklearn.externals.joblib as joblib\n", "\n", "controller = WorkflowController()\n", "wf_id = controller.submit_workflow(workflow=wf, name=\"second example\")\n", "\n", "while controller.workflow_status(wf_id) != 'workflow_done':\n", " time.sleep(2)\n", "print(joblib.load('./final_res.pkl'))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "light mode\n", "{'std': 0.025080367485459092, 'raw': array([ 0.93333333, 1. , 0.96658312, 0.96658312, 0.93265993]), 'median': 0.96658312447786132, 'mean': 0.9598319029897977}" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 7 } ], "metadata": {} } ] }