{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Copyright (c) Microsoft Corporation. All rights reserved.\n", "\n", "Licensed under the MIT License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Testing different Hyperparameters and Benchmarking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we'll cover how to test different hyperparameters for a particular dataset and how to benchmark different parameters across a group of datasets using AzureML. We assume familiarity with the basic concepts and parameters, which are discussed in the [01_training_introduction.ipynb](01_training_introduction.ipynb), [02_multilabel_classification.ipynb](02_multilabel_classification.ipynb) and [03_training_accuracy_vs_speed.ipynb](03_training_accuracy_vs_speed.ipynb) notebooks. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to [11_exploring_hyperparameters.ipynb](https://github.com/microsoft/ComputerVision/blob/master/classification/notebooks/11_exploring_hyperparameters.ipynb), we will learn more about how different learning rates and different image sizes affect our model's accuracy when restricted to 16 epochs, and we want to build an AzureML experiment to test out these hyperparameters. \n", "\n", "We will be using a ResNet18 model to classify a set of images into 4 categories: 'can', 'carton', 'milk_bottle', 'water_bottle'. We will then conduct hyper-parameter tuning to find the best set of parameters for this model. For this,\n", "we present an overall process of utilizing AzureML, specifically [Hyperdrive](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive?view=azure-ml-py) component to run this tuning in parallel (and not successively).We demonstrate the following key steps: \n", "* Configure AzureML Workspace\n", "* Create Remote Compute Target (GPU cluster)\n", "* Prepare Data\n", "* Prepare Training Script\n", "* Setup and Run Hyperdrive Experiment\n", "* Model Import, Re-train and Test\n", "\n", "For key concepts of AzureML see this [tutorial](https://docs.microsoft.com/en-us/azure/machine-learning/service/tutorial-train-models-with-aml?view=azure-ml-py&toc=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fpython%2Fapi%2Fazureml_py_toc%2Ftoc.json%3Fview%3Dazure-ml-py&bc=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fpython%2Fazureml_py_breadcrumb%2Ftoc.json%3Fview%3Dazure-ml-py) on model training and evaluation." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "sys.path.append(\"../../\")\n", "\n", "import fastai\n", "from fastai.vision import *\n", "import scrapbook as sb\n", "\n", "import azureml.core\n", "from azureml.core import Workspace, Experiment\n", "from azureml.core.compute import ComputeTarget, AmlCompute\n", "from azureml.core.compute_target import ComputeTargetException\n", "import azureml.data\n", "from azureml.train.estimator import Estimator\n", "from azureml.train.hyperdrive import (\n", " RandomParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal, choice, uniform\n", ")\n", "import azureml.widgets as widgets\n", "\n", "from utils_cv.classification.data import Urls\n", "from utils_cv.common.data import unzip_url" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ensure edits to libraries are loaded and plotting is shown in the notebook." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now define some parameters which will be used in this notebook:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "# Azure resources\n", "subscription_id = \"YOUR_SUBSCRIPTION_ID\"\n", "resource_group = \"YOUR_RESOURCE_GROUP_NAME\" \n", "workspace_name = \"YOUR_WORKSPACE_NAME\" \n", "workspace_region = \"YOUR_WORKSPACE_REGION\" #Possible values eastus, eastus2, etc.\n", "\n", "# Choose a size for our cluster and the maximum number of nodes\n", "VM_SIZE = \"STANDARD_NC6\" #\"STANDARD_NC6S_V3\"\n", "MAX_NODES = 12\n", "\n", "# Hyperparameter search space\n", "IM_SIZES = [150, 300]\n", "LEARNING_RATE_MAX = 1e-3\n", "LEARNING_RATE_MIN = 1e-5\n", "MAX_TOTAL_RUNS = 10 #Set to higher value to test more parameter combinations\n", "\n", "# Image data\n", "DATA = unzip_url(Urls.fridge_objects_path, exist_ok=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Config AzureML workspace\n", "Below we setup (or load an existing) AzureML workspace, and get all its details as follows. Note that the resource group and workspace will get created if they do not yet exist. For more information regaring the AzureML workspace see also the [20_azure_workspace_setup.ipynb](20_azure_workspace_setup.ipynb) notebook.\n", "\n", "To simplify clean-up (see end of this notebook), we recommend creating a new resource group to run this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from utils_cv.common.azureml import get_or_create_workspace\n", "\n", "ws = get_or_create_workspace(\n", " subscription_id,\n", " resource_group,\n", " workspace_name,\n", " workspace_region)\n", "\n", "# Print the workspace attributes\n", "print('Workspace name: ' + ws.name, \n", " 'Workspace region: ' + ws.location, \n", " 'Subscription id: ' + ws.subscription_id, \n", " 'Resource group: ' + ws.resource_group, sep = '\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Create Remote Target\n", "We create a GPU cluster as our remote compute target. If a cluster with the same name already exists in our workspace, the script will load it instead. This [link](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets#compute-targets-for-training) provides more information about how to set up a compute target on different locations.\n", "\n", "By default, the VM size is set to use STANDARD\\_NC6 machines. However, if quota is available, our recommendation is to use STANDARD\\_NC6S\\_V3 machines which come with the much faster V100 GPU. We set the minimum number of nodes to zero so that the cluster won't incur additional compute charges when not in use." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Creating a new compute target...\n", "Creating\n", "Succeeded\n", "AmlCompute wait for completion finished\n", "Minimum number of nodes requested have been provisioned\n", "{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-08-06T15:57:12.457000+00:00', 'errors': None, 'creationTime': '2019-08-06T15:56:43.315467+00:00', 'modifiedTime': '2019-08-06T15:57:25.740370+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 12, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}\n" ] } ], "source": [ "CLUSTER_NAME = \"gpu-cluster\"\n", "\n", "try:\n", " # Retrieve if a compute target with the same cluster name already exists\n", " compute_target = ComputeTarget(workspace=ws, name=CLUSTER_NAME)\n", " print('Found existing compute target.')\n", " \n", "except ComputeTargetException:\n", " # If it doesn't already exist, we create a new one with the name provided\n", " print('Creating a new compute target...')\n", " compute_config = AmlCompute.provisioning_configuration(vm_size=VM_SIZE,\n", " min_nodes=0,\n", " max_nodes=MAX_NODES)\n", "\n", " # create the cluster\n", " compute_target = ComputeTarget.create(ws, CLUSTER_NAME, compute_config)\n", " compute_target.wait_for_completion(show_output=True)\n", "\n", "# we can use get_status() to get a detailed status for the current cluster. \n", "print(compute_target.get_status().serialize())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Prepare data\n", "In this notebook, we'll use the Fridge Objects dataset, which is already stored in the correct format. We then upload our data to the AzureML workspace.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Retrieving default datastore that got automatically created when we setup a workspace\n", "ds = ws.get_default_datastore()\n", "\n", "# We now upload the data to the 'data' folder on the Azure portal\n", "ds.upload(\n", " src_dir=os.path.dirname(DATA),\n", " target_path='data',\n", " overwrite=True, # with \"overwrite=True\", if this data already exists on the Azure blob storage, it will be overwritten\n", " show_progress=True\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Here's where you can see the data in your portal: \n", "\"Datastore\n", "\n", "### 4. Prepare training script\n", "\n", "Next step is to prepare scripts that AzureML Hyperdrive will use to train and evaluate models with selected hyperparameters." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# creating a folder for the training script here\n", "script_folder = os.path.join(os.getcwd(), \"hyperdrive\")\n", "os.makedirs(script_folder, exist_ok=True)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Overwriting C:\\Users\\pabuehle\\Desktop\\ComputerVision\\classification\\notebooks\\hyperparameter/train.py\n" ] } ], "source": [ "%%writefile $script_folder/train.py\n", "\n", "import argparse\n", "import numpy as np\n", "import os\n", "from sklearn.externals import joblib\n", "import sys\n", "\n", "import fastai\n", "from fastai.vision import *\n", "from fastai.vision.data import *\n", "\n", "from azureml.core import Run\n", "\n", "run = Run.get_context()\n", "\n", "\n", "#------------------------------------------------------------------\n", "# Define parameters that we are going to use for training\n", "ARCHITECTURE = models.resnet18\n", "EPOCHS_HEAD = 4\n", "EPOCHS_BODY = 12\n", "BATCH_SIZE = 16\n", "#------------------------------------------------------------------\n", "\n", "\n", "# Parse arguments passed by Hyperdrive\n", "parser = argparse.ArgumentParser()\n", "\n", "# Data path\n", "parser.add_argument('--data-folder', type=str, dest='DATA_DIR', help=\"Datastore path\")\n", "parser.add_argument('--im_size', type=int, dest='IM_SIZE')\n", "parser.add_argument('--learning_rate', type=float, dest='LEARNING_RATE')\n", "args = parser.parse_args()\n", "params = vars(args)\n", "\n", "if params['IM_SIZE'] is None:\n", " raise ValueError(\"Image Size empty\")\n", "if params['LEARNING_RATE'] is None:\n", " raise ValueError(\"Learning Rate empty\")\n", "if params['DATA_DIR'] is None:\n", " raise ValueError(\"Data folder empty\")\n", "\n", "# Getting training and validation data\n", "path = params['DATA_DIR'] + '/data/fridgeObjects'\n", "data = (ImageList.from_folder(path)\n", " .split_by_rand_pct(valid_pct=0.5, seed=10)\n", " .label_from_folder() \n", " .transform(size=params['IM_SIZE']) \n", " .databunch(bs=BATCH_SIZE) \n", " .normalize(imagenet_stats))\n", "\n", "# Get model and run training\n", "learn = cnn_learner(\n", " data,\n", " ARCHITECTURE,\n", " metrics=[accuracy]\n", ")\n", "learn.fit_one_cycle(EPOCHS_HEAD, params['LEARNING_RATE'])\n", "learn.unfreeze()\n", "learn.fit_one_cycle(EPOCHS_BODY, params['LEARNING_RATE'])\n", "\n", "# Add log entries\n", "training_losses = [x.numpy().ravel()[0] for x in learn.recorder.losses]\n", "accuracy = [100*x[0].numpy().ravel()[0] for x in learn.recorder.metrics][-1]\n", "run.log('data_dir',params['DATA_DIR'])\n", "run.log('im_size', params['IM_SIZE'])\n", "run.log('learning_rate', params['LEARNING_RATE'])\n", "run.log('accuracy', float(accuracy)) # Logging our primary metric 'accuracy'\n", "\n", "# Save trained model\n", "current_directory = os.getcwd()\n", "output_folder = os.path.join(current_directory, 'outputs')\n", "model_name = 'im_classif_resnet' # Name we will give our model both locally and on Azure\n", "os.makedirs(output_folder, exist_ok=True)\n", "learn.export(os.path.join(output_folder, model_name + \".pkl\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. Setup and run Hyperdrive experiment\n", "\n", "#### 5.1 Create Experiment \n", "Experiment is the main entry point into experimenting with AzureML. To create new Experiment or get the existing one, we pass our experimentation name 'hyperparameter-tuning'.\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "experiment_name = 'hyperparameter-tuning'\n", "exp = Experiment(workspace=ws, name=experiment_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.2. Define search space\n", "\n", "Now we define the search space of hyperparameters. As shown below, to test discrete parameter values use 'choice()', and for uniform sampling use 'uniform()'. For more options, see [Hyperdrive parameter expressions](https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.parameter_expressions?view=azure-ml-py).\n", "\n", "Hyperdrive provides three different parameter sampling methods: 'RandomParameterSampling', 'GridParameterSampling', and 'BayesianParameterSampling'. Details about each method can be found [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters). Here, we use the 'RandomParameterSampling'." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Hyperparameter search space\n", "param_sampling = RandomParameterSampling( {\n", " '--learning_rate': uniform(LEARNING_RATE_MIN, LEARNING_RATE_MAX),\n", " '--im_size': choice(IM_SIZES)\n", " }\n", ")\n", "\n", "early_termination_policy = BanditPolicy(slack_factor=0.15, evaluation_interval=1, delay_evaluation=20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "AzureML Estimator is the building block for training. An Estimator encapsulates the training code and parameters, the compute resources and runtime environment for a particular training scenario.\n", "We create one for our experimentation with the dependencies our model requires as follows:\n", "\n", "```python\n", "pip_packages=['fastai']\n", "conda_packages=['scikit-learn']\n", "```" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "script_params = {\n", " '--data-folder': ds.as_mount()\n", "}\n", "\n", "est = Estimator(source_directory=script_folder,\n", " script_params=script_params,\n", " compute_target=compute_target,\n", " entry_script='train.py',\n", " use_gpu=True,\n", " pip_packages=['fastai'],\n", " conda_packages=['scikit-learn'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now create a HyperDriveConfig object which includes information about parameter space sampling, termination policy, primary metric, estimator and the compute target to execute the experiment runs on. We feed the following parameters to it:\n", "\n", "- our estimator object that we created in the above cell\n", "- hyperparameter sampling method, in this case it is Random Parameter Sampling\n", "- early termination policy, in this case we use [Bandit Policy](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters#bandit-policy)\n", "- primary metric name reported by our runs, in this case it is accuracy \n", "- the goal, which determines whether the primary metric has to be maximized/minimized, in this case it is to maximize our accuracy \n", "- number of total child-runs\n", "\n", "The bigger the search space, the more child-runs get triggered for better results." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "hyperdrive_run_config = HyperDriveConfig(estimator=est,\n", " hyperparameter_sampling=param_sampling,\n", " policy=early_termination_policy,\n", " primary_metric_name='accuracy',\n", " primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,\n", " max_total_runs=MAX_TOTAL_RUNS,\n", " max_concurrent_runs=MAX_NODES)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.3 Run Experiment" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "5c51804ba4794f3aa163354fef634c59", "version_major": 2, "version_minor": 0 }, "text/plain": [ "_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Now we submit the Run to our experiment. \n", "hyperdrive_run = exp.submit(config=hyperdrive_run_config)\n", "\n", "# We can see the experiment progress from this notebook by using \n", "widgets.RunDetails(hyperdrive_run).show()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'runId': 'hyperparameter-tuning_1565107066432',\n", " 'target': 'gpu-cluster',\n", " 'status': 'Completed',\n", " 'startTimeUtc': '2019-08-06T15:57:46.90426Z',\n", " 'endTimeUtc': '2019-08-06T16:13:21.185098Z',\n", " 'properties': {'primary_metric_config': '{\"name\": \"accuracy\", \"goal\": \"maximize\"}',\n", " 'runTemplate': 'HyperDrive',\n", " 'azureml.runsource': 'hyperdrive',\n", " 'platform': 'AML',\n", " 'baggage': 'eyJvaWQiOiAiNWFlYTJmMzAtZjQxZC00ZDA0LWJiOGUtOWU0NGUyZWQzZGQ2IiwgInRpZCI6ICI3MmY5ODhiZi04NmYxLTQxYWYtOTFhYi0yZDdjZDAxMWRiNDciLCAidW5hbWUiOiAiMDRiMDc3OTUtOGRkYi00NjFhLWJiZWUtMDJmOWUxYmY3YjQ2In0',\n", " 'ContentSnapshotId': 'c662f56a-ff58-432e-b732-8a3bc6818778'},\n", " 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://pabuehlestorage1c7e31216.blob.core.windows.net/azureml/ExperimentRun/dcid.hyperparameter-tuning_1565107066432/azureml-logs/hyperdrive.txt?sv=2018-11-09&sr=b&sig=8D2gwxb%2BYn7nbzgGVHE7QSzJ%2FG7C1swzmLD7%2Fior2vE%3D&st=2019-08-06T17%3A36%3A08Z&se=2019-08-07T01%3A46%3A08Z&sp=r'}}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hyperdrive_run.wait_for_completion()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we can check from the Azure portal with the url link we get by running \n", "```python \n", "hyperdrive_run.get_portal_url().```\n", "\n", "To load an existing Hyperdrive Run instead of start new one, we can use \n", "```python\n", "hyperdrive_run = azureml.train.hyperdrive.HyperDriveRun(exp, , hyperdrive_run_config=hyperdrive_run_config)\n", "```\n", "We also can cancel the Run with \n", "```python \n", "hyperdrive_run_config.cancel().\n", "```\n", "\n", "Once all the child-runs are finished, we can get the best run and the metrics." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* Best Run Id:hyperparameter-tuning_1565107066432_8\n", "Run(Experiment: hyperparameter-tuning,\n", "Id: hyperparameter-tuning_1565107066432_8,\n", "Type: azureml.scriptrun,\n", "Status: Completed)\n", "\n", "* Best hyperparameters:\n", "{'--data-folder': '$AZUREML_DATAREFERENCE_workspaceblobstore', '--im_size': '150', '--learning_rate': '0.000552896672441507'}\n", "Accuracy = 92.53731369972229\n" ] } ], "source": [ "# Get best run and print out metrics\n", "best_run = hyperdrive_run.get_best_run_by_primary_metric()\n", "best_run_metrics = best_run.get_metrics()\n", "parameter_values = best_run.get_details()['runDefinition']['arguments']\n", "best_parameters = dict(zip(parameter_values[::2], parameter_values[1::2]))\n", "\n", "print(f\"* Best Run Id:{best_run.id}\")\n", "print(best_run)\n", "print(\"\\n* Best hyperparameters:\")\n", "print(best_parameters)\n", "print(f\"Accuracy = {best_run_metrics['accuracy']}\")\n", "#print(\"Learning Rate =\", best_run_metrics['learning_rate'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6. Download and test the model\n", "\n", "We can download the best model from the outputs/ folder and inspect it." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading outputs/im_classif_resnet.pkl..\n" ] } ], "source": [ "import joblib\n", "current_directory = os.getcwd()\n", "output_folder = os.path.join(current_directory, 'outputs')\n", "os.makedirs(output_folder, exist_ok=True)\n", "\n", "for f in best_run.get_file_names():\n", " if f.startswith('outputs/im_classif_resnet'):\n", " print(\"Downloading {}..\".format(f))\n", " best_run.download_file('outputs/im_classif_resnet.pkl')\n", "saved_model =joblib.load('im_classif_resnet.pkl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now use the retrieved best model to get predictions on unseen images as done in [03_training_accuracy_vs_speed.ipynb](https://github.com/microsoft/ComputerVision/blob/master/classification/notebooks/03_training_accuracy_vs_speed.ipynb) notebook using\n", "```python\n", "saved_model.predict(image)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7. Clean up\n", "\n", "To avoid unnecessary expenses, all resources which were created in this notebook need to get deleted once parameter search is concluded. To simplify this clean-up step, we recommend creating a new resource group to run this notebook. This resource group can then be deleted, e.g. using the Azure Portal, which will remove all created resources." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Log some outputs using scrapbook which are used during testing to verify correct notebook execution\n", "sb.glue(\"best_accuracy\", best_run_metrics['accuracy'])" ] } ], "metadata": { "kernelspec": { "display_name": "Python (cv)", "language": "python", "name": "cv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }