{ "metadata": { "celltoolbar": "Slideshow", "name": "", "signature": "sha256:bf7fec4b30ca04619e75d92ba9714cc460fe502cfd5bb1a076c2e22b5a790841" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Machine Learning at Scale with Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* **[Andrea Zonca](http://twitter.com/andreazonca)**\n", "* 20 March 2014\n", "* at San Diego Supercomputer Center\n", "* **[San Diego Data Science meetup](http://www.meetup.com/San-Diego-Data-Science-R-Users-Group/events/170967362/)**" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Machine Learning at Scale with Python" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "* Setup [StarCluster](http://star.mit.edu/cluster/) to launch EC2 instances \n", "* Running IPython Notebook on Amazon EC2 \n", "* Running single node Machine Learning jobs using multiple cores \n", "* Distributing jobs with IPython parallel to multiple EC2 instances" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Machine Learning at Scale with Python" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "* Slides available at [zonca.github.io/machine-learning-at-scale-with-python](http://zonca.github.io/machine-learning-at-scale-with-python) or [http://bit.ly/ml-ec2](http://bit.ly/ml-ec2)\n", "* Slides are an `IPython notebook`\n", "* Available in executable notebook format on [github.com/zonca/machine-learning-at-scale-with-python](http://github.com/zonca/machine-learning-at-scale-with-python)\n", "* Notebook on nbviewer at [http://bit.ly/ml-ec2-ipynb](http://bit.ly/ml-ec2-ipynb)" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Setup virtual environment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Easier to just run the following commands at a terminal" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%bash\n", "# Miniconda allows to download binary instead of compiling like pip\n", "# get Miniconda from http://conda.pydata.org/miniconda.html\n", "wget http://repo.continuum.io/miniconda/Miniconda-3.0.5-Linux-x86_64.sh\n", "bash Miniconda-3.0.5-Linux-x86_64.sh\n", "conda create -n pysc --yes ipython pyzmq tornado jinja2 pandas pygments pip pycrypto\n", "source activate pysc\n", "pip install --upgrade starcluster" ], "language": "python", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then open the ipython notebook within that conda environment, i.e. activate and run \"`ipython notebook`\"" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "StarCluster configuration file" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from ConfigParser import ConfigParser\n", "config = ConfigParser()" ], "language": "python", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "prompt_number": 1 }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "StarCluster configuration: credentials" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Login at [https://console.aws.amazon.com/iam/home?#users](https://console.aws.amazon.com/iam/home?#users)\n", "* Create new user\n", "* Download credentials as `credentials.csv`\n", "* Add Full EC2 and S3 permissions under Permissions\n", "![AWS create user](files/screenshots/aws-create-user.png)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "# extract first row of csv\n", "credentials = pd.read_csv(\"credentials.csv\").ix[0]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "config.add_section(\"aws info\")\n", "config.set(\"aws info\", \"aws_access_key_id\", credentials[\"Access Key Id\"])\n", "config.set(\"aws info\", \"aws_secret_access_key\", credentials[\"Secret Access Key\"])" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "prompt_number": 3 }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "StarCluster configuration: key pairs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* From EC2 console: [https://console.aws.amazon.com/ec2](https://console.aws.amazon.com/ec2)\n", "* Click on Key Pairs\n", "* Create new Key Pair named `starcluster`\n", "* Move `starcluster.pem` in your working folder\n", "* Set permissions: `chmod 400 starcluster.pem`" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# key pairs are region-specific\n", "config.set(\"aws info\", \"aws_region_name\", \"us-west-2\")\n", "config.set(\"aws info\", \"aws_region_host\", \"ec2.us-west-2.amazonaws.com\")\n", "config.add_section(\"keypair starcluster\")\n", "config.set(\"keypair starcluster\", \"key_location\", \"starcluster.pem\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "import os.path\n", "def write_sc_conf(sc_conf):\n", " \"\"\"Write starcluster configuration to ~/.starcluster/config\"\"\"\n", " folder = os.path.join(os.path.expanduser(\"~\"), \".starcluster\")\n", " try:\n", " os.makedirs(folder)\n", " except:\n", " pass\n", " with open(os.path.join(folder, \"config\"), \"w\") as f:\n", " config.write(f)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "write_sc_conf(config)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 21 }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Available starcluster images" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%bash\n", "starcluster listpublic | grep 64" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "64bit Images:\n", "[0] ami-04bedf34 us-west-2 starcluster-base-ubuntu-13.04-x86_64 (EBS)\n", "[1] ami-80bedfb0 us-west-2 starcluster-base-ubuntu-13.04-x86_64-hvm (HVM-EBS)\n", "[2] ami-486afe78 us-west-2 starcluster-base-ubuntu-12.04-x86_64-hvm (HVM-EBS)\n", "[3] ami-706afe40 us-west-2 starcluster-base-ubuntu-12.04-x86_64 (EBS)\n", "[4] ami-c6bd30f6 us-west-2 starcluster-base-ubuntu-11.10-x86_64 (EBS)\n" ] }, { "output_type": "stream", "stream": "stderr", "text": [ "StarCluster - (http://star.mit.edu/cluster) (v. 0.95.3)\n", "Software Tools for Academics and Researchers (STAR)\n", "Please submit bug reports to starcluster@mit.edu\n", "\n" ] } ], "prompt_number": 40 }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "StarCluster configuration: cluster" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sec = \"cluster pyec2\"\n", "config.add_section(sec)\n", "config.set(sec, \"keyname\", \"starcluster\")\n", "config.set(sec, \"cluster_size\", 1)\n", "config.set(sec, \"cluster_user\", \"ipuser\")\n", "config.set(sec, \"disable_queue\", True)\n", "ami = \"ami-706afe40\"\n", "instance = \"t1.micro\"\n", "for name in [\"master\", \"node\"]:\n", " config.set(sec, name + \"_image_id\", ami)\n", " config.set(sec, name + \"_instance_type\", instance)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 23 }, { "cell_type": "code", "collapsed": false, "input": [ "config.add_section(\"global\")\n", "config.set(\"global\", \"default_template\", \"pyec2\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "write_sc_conf(config)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 25 }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "StarCluster configuration: EBS volume" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%bash\n", "starcluster createvolume -n ebs1gbwest2a -i ami-fa9cf1ca --detach-volume 1 us-west-2a " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "StarCluster - (http://star.mit.edu/cluster) (v. 0.95.2)\n", "Software Tools for Academics and Researchers (STAR)\n", "Please submit bug reports to starcluster@mit.edu\n", "\n", ">>> No keypair specified, picking one from config...\n", ">>> Using keypair: starcluster\n", ">>> Creating security group @sc-volumecreator...\n", ">>> No instance in group @sc-volumecreator for zone us-west-2a, launching one now.\n", "Reservation:r-eb9f81e2\n", ">>> Waiting for volume host to come up... (updating every 30s)\n", ">>> Waiting for all nodes to be in a 'running' state...\n", "1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% \n", ">>> Waiting for SSH to come up on all nodes...\n", "1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% \n", ">>> Waiting for cluster to come up took 0.606 mins\n", ">>> Checking for required remote commands...\n", ">>> Creating 1GB volume in zone us-west-2a\n", ">>> New volume id: vol-53829851\n", ">>> Waiting for vol-53829851 to become 'available'... \n", ">>> Attaching volume vol-53829851 to instance i-635cbd6b...\n", ">>> Waiting for vol-53829851 to transition to: attached... \n", ">>> Formatting volume...\n", "Filesystem label=\n", "OS type: Linux\n", "Block size=4096 (log=2)\n", "Fragment size=4096 (log=2)\n", "Stride=0 blocks, Stripe width=0 blocks\n", "65536 inodes, 262144 blocks\n", "13107 blocks (5.00%) reserved for the super user\n", "First data block=0\n", "Maximum filesystem blocks=268435456\n", "8 block groups\n", "32768 blocks per group, 32768 fragments per group\n", "8192 inodes per group\n", "Superblock backups stored on blocks: \n", "\t32768, 98304, 163840, 229376\n", "\n", "Allocating group tables: done \n", "Writing inode tables: done \n", "Creating journal (8192 blocks): done\n", "Writing superblocks and filesystem accounting information: done\n", "\n", "mke2fs 1.42 (29-Nov-2011)\n", ">>> Detaching volume vol-53829851 from instance i-635cbd6b\n", ">>> Not terminating host instance i-635cbd6b\n", ">>> Your new 1GB volume vol-53829851 has been created successfully\n", "*** WARNING - There are still volume hosts running: i-635cbd6b\n", ">>> Creating volume took 0.947 mins\n", "```" ] }, { "cell_type": "code", "collapsed": false, "input": [ "config.add_section(\"volume data\")\n", "# this is the Amazon EBS volume id\n", "config.set(\"volume data\", \"volume_id\", \"vol-53829851\")\n", "# the path to mount this EBS volume on\n", "# (this path will also be nfs shared to all nodes in the cluster)\n", "config.set(\"volume data\", \"mount_path\", \"/data\")\n", "config.set(\"cluster pyec2\", \"volumes\", \"data\")" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "prompt_number": 28 }, { "cell_type": "code", "collapsed": false, "input": [ "write_sc_conf(config)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "prompt_number": 29 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![AWS volume](files/screenshots/aws-volume.png)" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "StarCluster configuration: IPython" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sec = \"plugin ipcluster\"\n", "config.add_section(sec)\n", "config.set(sec, \"setup_class\", \"starcluster.plugins.ipcluster.IPCluster\")\n", "config.set(sec, \"enable_notebook\", True)\n", "# set a password for the notebook for increased security\n", "config.set(sec, \"notebook_passwd\", \"mysupersecretpassword\")\n", "\n", "# store notebooks on EBS!\n", "config.set(sec, \"notebook_directory\", \"/data\")\n", "\n", "# pickle is faster for communication than the default JSON\n", "config.set(sec, \"packer\", \"pickle\")\n", "\n", "config.add_section(\"plugin pypackages\")\n", "config.set(\"plugin pypackages\", \"setup_class\", \"starcluster.plugins.pypkginstaller.PyPkgInstaller\")\n", "config.set(\"plugin pypackages\", \"packages\", \"scikit-learn, psutil\")\n", "\n", "config.set(\"cluster pyec2\", \"plugins\", \"pypackages, ipcluster\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 30 }, { "cell_type": "code", "collapsed": false, "input": [ "write_sc_conf(config)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 31 }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Launch a single instance" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%bash\n", "starcluster start -s 1 pyec2" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ ">>> Using default cluster template: pyec2\n", ">>> Validating cluster template settings...\n", ">>> Cluster template settings are valid\n", ">>> Starting cluster...\n", ">>> Launching a 1-node cluster...\n", ">>> Creating security group @sc-pyec2...\n", ">>> Waiting for security group @sc-pyec2... \b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b\n", "Reservation:r-02948d0b\n", ">>> Waiting for cluster to come up... (updating every 30s)\n", ">>> Waiting for all nodes to be in a 'running' state...\n", ">>> Waiting for SSH to come up on all nodes...\n", ">>> Waiting for cluster to come up took 1.057 mins\n", ">>> The master node is ec2-54-186-36-53.us-west-2.compute.amazonaws.com\n", ">>> Configuring cluster...\n", ">>> Attaching volume vol-53829851 to master node on /dev/sdz ...\n", ">>> Waiting for vol-53829851 to transition to: attached... \b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b\n", ">>> Running plugin starcluster.clustersetup.DefaultClusterSetup\n", ">>> Configuring hostnames...\n", ">>> Mounting EBS volume vol-53829851 on /data...\n", ">>> Creating cluster user: ipuser (uid: 1001, gid: 1001)\n", ">>> Configuring scratch space for user(s): ipuser\n", ">>> Configuring /etc/hosts on each node\n", ">>> Starting NFS server on master\n", ">>> Setting up NFS took 0.031 mins\n", ">>> Configuring passwordless ssh for root\n", ">>> Configuring passwordless ssh for ipuser\n", ">>> Running plugin pypackages\n", ">>> Installing Python packages on all nodes:\n", ">>> $ pip install scikit-learn\n", ">>> $ pip install psutil\n", ">>> PyPkgInstaller took 2.622 mins\n", ">>> Running plugin ipcluster\n", ">>> Writing IPython cluster config files\n", ">>> Starting the IPython controller and 1 engines on master\n", ">>> Waiting for JSON connector file... \b \b|\b \b/\b \b\n", ">>> Creating IPCluster cache directory: /home/zonca/.starcluster/ipcluster\n", ">>> Authorizing tcp ports [1000-65535] on 0.0.0.0/0 for: IPython controller\n", ">>> Setting up IPython web notebook for user: ipuser\n", ">>> Creating SSL certificate for user ipuser\n", ">>> Authorizing tcp ports [8888-8888] on 0.0.0.0/0 for: notebook\n", ">>> IPython notebook URL: https://ec2-54-186-36-53.us-west-2.compute.amazonaws.com:8888\n", ">>> The notebook password is: mysupersecretpassword\n", "*** WARNING - Please check your local firewall settings if you're having\n", "*** WARNING - issues connecting to the IPython notebook\n", ">>> IPCluster has been started on SecurityGroup:@sc-pyec2 for user 'ipuser'\n", "with 1 engines on 1 nodes.\n", "\n", "To connect to cluster from your local machine use:\n", "\n", "from IPython.parallel import Client\n", "client = Client('/home/zonca/.starcluster/ipcluster/SecurityGroup:@sc-pyec2-us-west-2.json', sshkey='starcluster.pem')\n", "\n", "See the IPCluster plugin doc for usage details:\n", "http://star.mit.edu/cluster/docs/latest/plugins/ipython.html\n", "\n", ">>> IPCluster took 0.364 mins\n", ">>> Configuring cluster took 3.475 mins\n", ">>> Starting cluster took 7.204 mins\n", "\n", "The cluster is now ready to use. To login to the master node\n", "as root, run:\n", "\n", " $ starcluster sshmaster pyec2\n", "\n", "If you're having issues with the cluster you can reboot the\n", "instances and completely reconfigure the cluster from\n", "scratch using:\n", "\n", " $ starcluster restart pyec2\n", "\n", "When you're finished using the cluster and wish to terminate\n", "it and stop paying for service:\n", "\n", " $ starcluster terminate pyec2\n", "\n", "Alternatively, if the cluster uses EBS instances, you can\n", "use the 'stop' command to shutdown all nodes and put them\n", "into a 'stopped' state preserving the EBS volumes backing\n", "the nodes:\n", "\n", " $ starcluster stop pyec2\n", "\n", "WARNING: Any data stored in ephemeral storage (usually /mnt)\n", "will be lost!\n", "\n", "You can activate a 'stopped' cluster by passing the -x\n", "option to the 'start' command:\n", "\n", " $ starcluster start -x pyec2\n", "\n", "This will start all 'stopped' nodes and reconfigure the\n", "cluster.\n" ] }, { "output_type": "stream", "stream": "stderr", "text": [ "StarCluster - (http://star.mit.edu/cluster) (v. 0.95.3)\n", "Software Tools for Academics and Researchers (STAR)\n", "Please submit bug reports to starcluster@mit.edu\n", "\n", "0/1 | | 0% \r", "0/1 | | 0% \r", "1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% \n", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% \n", "0/1 | | 0% \r", "1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% \n", "0/1 | | 0% \r", "1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% \n", "0/1 | | 0% \r", "1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% \n", "0/1 | | 0% \r", "1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% \n", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "0/1 | | 0% \r", "1/1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100% \n", "/home/zonca/.starcluster/ipcluster/SecurityGroup:@sc-pyec2-us-west-2.json 0% || ETA: --:--:-- 0.00 B/s\r", "/home/zonca/.starcluster/ipcluster/SecurityGroup:@sc-pyec2-us-west-2.json 100% || Time: 00:00:00 2.96 K/s\n" ] } ], "prompt_number": 32 }, { "cell_type": "code", "collapsed": false, "input": [ "%%bash\n", "starcluster sshmaster -A pyec2 # -A use local keys remotely" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " _ _ _\n", "__/\\_____| |_ __ _ _ __ ___| |_ _ ___| |_ ___ _ __\n", "\\ / __| __/ _` | '__/ __| | | | / __| __/ _ \\ '__|\n", "/_ _\\__ \\ || (_| | | | (__| | |_| \\__ \\ || __/ |\n", " \\/ |___/\\__\\__,_|_| \\___|_|\\__,_|___/\\__\\___|_|\n", "\n", "StarCluster Ubuntu 12.04 AMI\n", "Software Tools for Academics and Researchers (STAR)\n", "Homepage: http://star.mit.edu/cluster\n", "Documentation: http://star.mit.edu/cluster/docs/latest\n", "Code: https://github.com/jtriley/StarCluster\n", "Mailing list: starcluster@mit.edu\n", "\n", "This AMI Contains:\n", "\n", " * Open Grid Scheduler (OGS - formerly SGE) queuing system\n", " * Condor workload management system\n", " * OpenMPI compiled with Open Grid Scheduler support\n", " * OpenBLAS- Highly optimized Basic Linear Algebra Routines\n", " * NumPy/SciPy linked against OpenBlas\n", " * IPython 0.13 with parallel support\n", " * and more! (use 'dpkg -l' to show all installed packages)\n", "\n", "Open Grid Scheduler/Condor cheat sheet:\n", "\n", " * qstat/condor_q - show status of batch jobs\n", " * qhost/condor_status- show status of hosts, queues, and jobs\n", " * qsub/condor_submit - submit batch jobs (e.g. qsub -cwd ./job.sh)\n", " * qdel/condor_rm - delete batch jobs (e.g. qdel 7)\n", " * qconf - configure Open Grid Scheduler system\n", "\n", "Current System Stats:\n", "\n", " System load: 0.06 Processes: 93\n", " Usage of /: 27.8% of 9.84GB Users logged in: 1\n", " Memory usage: 60% IP address for eth0: 172.31.17.235\n", " Swap usage: 0%\n", "\n" ] }, { "output_type": "stream", "stream": "stderr", "text": [ "StarCluster - (http://star.mit.edu/cluster) (v. 0.95.3)\n", "Software Tools for Academics and Researchers (STAR)\n", "Please submit bug reports to starcluster@mit.edu\n", "\n", "Pseudo-terminal will not be allocated because stdin is not a terminal.\r\n", "stdin: is not a tty\n" ] } ], "prompt_number": 37 }, { "cell_type": "code", "collapsed": false, "input": [ "%%bash\n", "starcluster terminate -c pyec2 # -c does not prompt for confirm" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ ">>> Running plugin starcluster.plugins.ipcluster.IPCluster\n", ">>> Running plugin starcluster.plugins.pypkginstaller.PyPkgInstaller\n", ">>> Running plugin starcluster.clustersetup.DefaultClusterSetup\n", ">>> Detaching volume vol-53829851 from master\n", ">>> Terminating node: master (i-1d9a7915)\n", ">>> Waiting for cluster to terminate... \b \b|\b \b\n", ">>> Removing security group: @sc-pyec2 \b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b|\b \b/\b \b-\b \b\\\b \b-\b \b\n" ] }, { "output_type": "stream", "stream": "stderr", "text": [ "StarCluster - (http://star.mit.edu/cluster) (v. 0.95.3)\n", "Software Tools for Academics and Researchers (STAR)\n", "Please submit bug reports to starcluster@mit.edu\n", "\n" ] } ], "prompt_number": 39 }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Running scikit-learn on a single instance on EC2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See example notebook on face recognition run on a **t1.micro** instance (half GB RAM)" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Most interesting instance types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "* Micro for testing: **t1.micro**, 1 core, 0.6GB RAM, just EBS disk, 0.02 dollars/h\n", "* Medium size instance: **c3.2xlarge**, 8 cores, 15GB RAM, 2 x 80GB SSD\tat 0.6 dollars/h\n", "* Current largest instance: **c3.8xlarge**, 32 cores, 60GB RAM, 2 x 320GB SSD at 2.4 dollars/h\n" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Running scikit-learn on a single instance on EC2: workflow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* launch **t1.micro** instance with starcluster\n", "* `rsync` data from local drive to `/data`\n", "* launch **c3.2xlarge** or **c3.8xlarge**\n", "* connect to the notebook\n", "* upload local notebook by dragging it ot the dashboard\n", "* run and debug interactively" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Running scikit-learn on a single instance on EC2: multi-core" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Supported out-of-the-box by `scikit-learn`:\n", "\n", "* `sklearn.cluster.KMeans(n_jobs=-1)`\n", "* `sklearn.ensemble.RandomForestClassifier(n_jobs=-1)`\n", "\n", "Just set `n_jobs` to the number of processes or -1 for automatically set it to the number of cores.\n", "\n", "Multi-core support provided by `joblib`: [https://pythonhosted.org/joblib/parallel.html](https://pythonhosted.org/joblib/parallel.html)" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Using EC2 spot instances" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* bid on extra EC2 capacity, same type of instances\n", "* save up to 5x!\n", "* market with fluctuating price based on demand\n", "* if price raises over your bid, instance gets **killed**\n", "* no charge for the last portion of an hour if killed\n", "* e.g. price is 0.2/h, I bid 0.5/h, I pay 0.2/h, if price -> 0.4/h, just pay more, if price -> 0.55/h, instance killed" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%bash\n", "starcluster spothistory c3.2xlarge" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ ">>> Fetching spot history for c3.2xlarge (VPC)\n", ">>> Current price: $0.1281\n", ">>> Max price: $2.4000\n", ">>> Average price: $0.3339\n" ] }, { "output_type": "stream", "stream": "stderr", "text": [ "StarCluster - (http://star.mit.edu/cluster) (v. 0.95.2)\n", "Software Tools for Academics and Researchers (STAR)\n", "Please submit bug reports to starcluster@mit.edu\n", "\n" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "* On-demand price was 0.6/h, so saving about a factor of 4.\n", "\n", "* Current largest instance: c3.8xlarge, at 2.4/h , was 0.5/h this morning!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%bash\n", "starcluster start -s 1 --force-spot-master -b 0.5 -I c3.2xlarge singlenode" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Distributed scikit-learn on EC2 with IPython parallel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Most suitable for cross-validation and hyperparameters optimizations, which are intrinsecally trivially parallel jobs\n", "* Based on [Olivier Grisel ML tutorial](http://nbviewer.ipython.org/github/ogrisel/parallel_ml_tutorial/blob/master/solutions/03%20-%20Distributed%20Model%20Selection%20and%20Assessment.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show local example to understand `ipcontroller`, `ipengines`, client" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Distributed scikit-learn on EC2 with IPython parallel: launch cluster" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%bash\n", "starcluster start -c pyec2 -s 5 -b 0.5 -I c3.2xlarge fivenodescluster" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Distributed scikit-learn on EC2 with IPython parallel: Distribute Cross-Validation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "* Open IPython Notebook on the master node" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def compute_evaluation(filename, model, params):\n", " \"\"\"Function executed by a worker to evaluate a model\"\"\"\n", " # All module imports should be executed in the worker namespace\n", " from sklearn.externals import joblib\n", "\n", " X_train, y_train, X_validation, y_validation = joblib.load(\n", " filename, mmap_mode='c')\n", " \n", " model.set_params(**params)\n", " model.fit(X_train, y_train)\n", " validation_score = model.score(X_validation, y_validation)\n", " return validation_score" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "svc_params = {\n", " 'C': np.logspace(-1, 2, 4),\n", " 'gamma': np.logspace(-4, 0, 5),\n", "}" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.grid_search import ParameterGrid\n", "list(ParameterGrid(svc_params))" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ "[{'C': 0.10000000000000001, 'gamma': 0.0001},\n", " {'C': 0.10000000000000001, 'gamma': 0.001},\n", " {'C': 0.10000000000000001, 'gamma': 0.01},\n", " {'C': 0.10000000000000001, 'gamma': 0.10000000000000001},\n", " {'C': 0.10000000000000001, 'gamma': 1.0},\n", " {'C': 1.0, 'gamma': 0.0001},\n", " {'C': 1.0, 'gamma': 0.001},\n", " {'C': 1.0, 'gamma': 0.01},\n", " {'C': 1.0, 'gamma': 0.10000000000000001},\n", " {'C': 1.0, 'gamma': 1.0},\n", " {'C': 10.0, 'gamma': 0.0001},\n", " {'C': 10.0, 'gamma': 0.001},\n", " {'C': 10.0, 'gamma': 0.01},\n", " {'C': 10.0, 'gamma': 0.10000000000000001},\n", " {'C': 10.0, 'gamma': 1.0},\n", " {'C': 100.0, 'gamma': 0.0001},\n", " {'C': 100.0, 'gamma': 0.001},\n", " {'C': 100.0, 'gamma': 0.01},\n", " {'C': 100.0, 'gamma': 0.10000000000000001},\n", " {'C': 100.0, 'gamma': 1.0}]" ] } ], "prompt_number": 23 }, { "cell_type": "code", "collapsed": false, "input": [ "len(list(ParameterGrid(svc_params)))" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 24, "text": [ "20" ] } ], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "def compute_evaluation(filename, model, params):\n", " \"\"\"Function executed by a worker to evaluate a model\"\"\"\n", " # All module imports should be executed in the worker namespace\n", " from sklearn.externals import joblib\n", "\n", " X_train, y_train, X_validation, y_validation = joblib.load(\n", " filename, mmap_mode='c')\n", " \n", " model.set_params(**params)\n", " model.fit(X_train, y_train)\n", " validation_score = model.score(X_validation, y_validation)\n", " return validation_score" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.parallel import Client\n", "rc = Client()\n", "# create the balanced view object\n", "lview = rc.load_balanced_view()\n", "\n", "tasks = []\n", "for each in svc_params:\n", " tasks.append(lview.apply_async(compute_evaluation, \"data/input.pkl\", model_1, each))" ], "language": "python", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Monitor progress" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def progress(tasks):\n", " return np.mean([task.ready() for task in tasks])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print(\"Tasks completed: {0}%\".format(100 * progress(tasks)))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def find_best(tasks, n_top=5):\n", " \"\"\"Compute the mean score of the completed tasks\"\"\"\n", " scores = [t.get() for t in tasks if t.ready()] \n", " return sorted(scores, reverse=True)[:n_top]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print(\"Tasks completed: {0}%\".format(100 * progress(tasks)))\n", "find_best(tasks)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Distributed scikit-learn on EC2 with IPython parallel: Optimizations" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Read data from `EBS` over `NFS` does not scale, so large data and more than few tens of engines better use S3." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* `StarCluster` sets number of engines equal to number of cores, we can modify to have less -> more memory\n", "* or combination of Parallel and multi-core\n", "* Also, use just 1 engine per node to copy data from S3 to local disk, then `memorymap` to share across engines" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Oliver Grisel has scripts for automating the distribution of tasks, see `model_selection.RandomizedGridSeach`: [https://github.com/ogrisel/parallel_ml_tutorial/](https://github.com/ogrisel/parallel_ml_tutorial/)" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "SDSC Cloud" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Based on OpenStack\n", "* Storage (like S3) already available also to Industry: [cloud.sdsc.edu](https://cloud.sdsc.edu)\n", "* In the next few months also cloud instances (like EC2) based on OpenStack Nova\n", "* Local support from SDSC staff\n", "![](files/screenshots/sdsc_cloud.png)" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Thank you!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [twitter.com/andreazonca](http://twitter.com/andreazonca)\n", "* [zonca.github.io](http://zonca.github.io)\n", "* email on `sdsc.edu`: `zonca` " ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Distributed scikit-learn on EC2 with IPython parallel: More performance" ] }, { "cell_type": "code", "collapsed": false, "input": [ "svc_params = {\n", " 'C': np.logspace(-1, 2, 4),\n", " 'gamma': np.logspace(-4, 0, 5),\n", "}" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.externals import joblib\n", "from sklearn.cross_validation import ShuffleSplit\n", "import os\n", "\n", "def persist_cv_splits(X, y, n_cv_iter=5, name='data',\n", " suffix=\"_cv_%03d.pkl\", test_size=0.25, random_state=None):\n", " \"\"\"Materialize randomized train test splits of a dataset.\"\"\"\n", "\n", " cv = ShuffleSplit(X.shape[0], n_iter=n_cv_iter,\n", " test_size=test_size, random_state=random_state)\n", " cv_split_filenames = []\n", " \n", " for i, (train, test) in enumerate(cv):\n", " cv_fold = (X[train], y[train], X[test], y[test])\n", " cv_split_filename = name + suffix % i\n", " cv_split_filename = os.path.abspath(cv_split_filename)\n", " joblib.dump(cv_fold, cv_split_filename)\n", " cv_split_filenames.append(cv_split_filename)\n", " \n", " return cv_split_filenames" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_digits\n", "\n", "digits = load_digits()\n", "digits_split_filenames = persist_cv_splits(digits.data, digits.target,\n", " name='digits', random_state=42)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def compute_evaluation(cv_split_filename, model, params):\n", " \"\"\"Function executed by a worker to evaluate a model on a CV split\"\"\"\n", " # All module imports should be executed in the worker namespace\n", " from sklearn.externals import joblib\n", "\n", " X_train, y_train, X_validation, y_validation = joblib.load(\n", " cv_split_filename, mmap_mode='c')\n", " \n", " model.set_params(**params)\n", " model.fit(X_train, y_train)\n", " validation_score = model.score(X_validation, y_validation)\n", " return validation_score" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def grid_search(lb_view, model, cv_split_filenames, param_grid):\n", " \"\"\"Launch all grid search evaluation tasks.\"\"\"\n", " all_tasks = []\n", " all_parameters = list(ParameterGrid(param_grid))\n", " \n", " for i, params in enumerate(all_parameters):\n", " task_for_params = []\n", " \n", " for j, cv_split_filename in enumerate(cv_split_filenames): \n", " t = lb_view.apply(\n", " compute_evaluation, cv_split_filename, model, params)\n", " task_for_params.append(t) \n", " \n", " all_tasks.append(task_for_params)\n", " \n", " return all_parameters, all_tasks" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.svm import SVC\n", "from IPython.parallel import Client\n", "\n", "client = Client()\n", "lb_view = client.load_balanced_view()\n", "model = SVC()\n", "svc_params = {\n", " 'C': np.logspace(-1, 2, 4),\n", " 'gamma': np.logspace(-4, 0, 5),\n", "}\n", "\n", "all_parameters, all_tasks = grid_search(\n", " lb_view, model, digits_split_filenames, svc_params)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def find_best(all_parameters, all_tasks, n_top=5):\n", " \"\"\"Compute the mean score of the completed tasks\"\"\"\n", " mean_scores = []\n", " \n", " for param, task_group in zip(all_parameters, all_tasks):\n", " scores = [t.get() for t in task_group if t.ready()]\n", " if len(scores) == 0:\n", " continue\n", " mean_scores.append((np.mean(scores), param))\n", " \n", " return sorted(mean_scores, reverse=True)[:n_top]\n", "\n", "find_best(all_parameters, all_tasks)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [] } ], "metadata": {} } ] }