{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Easy distributed training with TensorFlow using train_and_evaluate()\n", "\n", "## Introduction\n", "\n", "TensorFlow release 1.4 introduced the function [`tf.estimator.train_and_evaluate`](https://www.tensorflow.org/api_docs/python/tf/estimator/train_and_evaluate), which simplifies training, evaluation, and exporting of [`Estimator`](https://www.tensorflow.org/get_started/estimator) models. It enables distributed execution for training and evaluation, while also supporting local execution, and provides consistent behavior for across both local/non-distributed and distributed configurations.\n", "\n", "This means that using `tf.estimator.train_and_evaluate`, you can run the same code on both locally and in the cloud, on different devices and using different cluster configurations, without making any code changes. A train-and-evaluate loop is automatically supported. When you're done training (or even at intermediate stages), the trained model is automatically exported in a form suitable for serving (e.g. for [Cloud ML Engine online prediction](https://cloud.google.com/ml-engine/docs/prediction-overview), or [TensorFlow serving](https://www.tensorflow.org/serving/)).\n", "\n", "In this example, we'll walk through how to use `tf.estimator.train_and_evaluate` with an Estimator model, and then show how easy it is to do distributed training of the model on [Cloud ML Engine](https://cloud.google.com/ml-engine) (CMLE), and to move between different cluster configurations with just a config tweak.\n", "\n", "The example also includes the use of [Datasets](https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset) to manage our input data. This API is part of TensorFlow 1.4, and is an easier and more performant way to create input pipelines to TensorFlow models.\n", "\n", "For our example, we'll use the The [Census Income Data\n", "Set](https://archive.ics.uci.edu/ml/datasets/Census+Income) hosted by the [UC Irvine Machine Learning\n", "Repository](https://archive.ics.uci.edu/ml/datasets/). We have hosted the data\n", "on Google Cloud Storage in a slightly cleaned form. We'll use this dataset to predict income category based on various information about a person.\n", "\n", "The example in this notebook is a slightly modified version of [this example](https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/estimator/trainer).\n", "\n", "### Prerequisites\n", "\n", "This example requires you to have TensorFlow 1.4 or higher installed, and Python 2.7 or 3.\n", "We strongly recommend that you install TensorFlow into a virtual environment, as described in the [installation instructions](https://www.tensorflow.org/install/).\n", "\n", "You'll also need to have [Google Cloud SDK (gcloud) installed](https://cloud.google.com/sdk/downloads).\n", "\n", "In a later section of this example, you'll need to create a GCP project (to use Cloud ML Engine). We'll point you to that info when we get there." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## First step: create an Estimator\n", "\n", "In this section, we'll do some setup and then create an [`Estimator`](https://www.tensorflow.org/get_started/estimator) model using a prebuilt Estimator subclass, [`DNNLinearCombinedClassifier`](https://www.tensorflow.org/api_docs/python/tf/estimator/DNNLinearCombinedClassifier). (More on this Estimator below). \n", "\n", "We're using the Estimator class because it gives us built-in support for distributed training and evaluation (along with other nice features). You should nearly always use an Estimator to create your TensorFlow models. You can build a Custom Estimator if none of the prebuilt (\"canned\") Estimators suit your purpose.\n", "\n", "First, copy the training and test data to a local directory and set some vars to point to the files. You can skip the download step if you've already grabbed these datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! mkdir -p census_data\n", "! gsutil cp gs://cloudml-public/census/data/adult.data.csv census_data/adult.data.csv\n", "! gsutil cp gs://cloudml-public/census/data/adult.test.csv census_data/adult.test.csv" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# list the contents of the data directory as a check\n", "!ls -l census_data\n", "! head census_data/adult.data.csv" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "TRAIN_FILES = ['census_data/adult.data.csv']\n", "EVAL_FILES = ['census_data/adult.test.csv']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%env TRAIN_FILE=census_data/adult.data.csv\n", "%env EVAL_FILE=census_data/adult.test.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do some imports and check your version of TensorFlow. It should be >=1.4." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from __future__ import division\n", "from __future__ import print_function\n", "\n", "import argparse\n", "import multiprocessing\n", "import os\n", "import time\n", "\n", "import tensorflow as tf\n", "from tensorflow.contrib.learn.python.learn.utils import (\n", " saved_model_export_utils)\n", "print(tf.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll begin defining our estimator. First we'll define the format of the input data. \n", "`income_bracket` is our `LABEL_COLUMN`, meaning that this is the value we'll predict. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "CSV_COLUMNS = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',\n", " 'marital_status', 'occupation', 'relationship', 'race', 'gender',\n", " 'capital_gain', 'capital_loss', 'hours_per_week',\n", " 'native_country', 'income_bracket']\n", "CSV_COLUMN_DEFAULTS = [[0], [''], [0], [''], [0], [''], [''], [''], [''], [''],\n", " [0], [0], [0], [''], ['']]\n", "LABEL_COLUMN = 'income_bracket'\n", "LABELS = [' <=50K', ' >50K']\n", "\n", "# Define the initial ingestion of each feature used by your model.\n", "# Additionally, provide metadata about the feature.\n", "INPUT_COLUMNS = [\n", " # Categorical base columns\n", "\n", " # For categorical columns with known values we can provide lists\n", " # of values ahead of time.\n", " tf.feature_column.categorical_column_with_vocabulary_list(\n", " 'gender', [' Female', ' Male']),\n", "\n", " tf.feature_column.categorical_column_with_vocabulary_list(\n", " 'race',\n", " [' Amer-Indian-Eskimo', ' Asian-Pac-Islander',\n", " ' Black', ' Other', ' White']\n", " ),\n", " tf.feature_column.categorical_column_with_vocabulary_list(\n", " 'education',\n", " [' Bachelors', ' HS-grad', ' 11th', ' Masters', ' 9th',\n", " ' Some-college', ' Assoc-acdm', ' Assoc-voc', ' 7th-8th',\n", " ' Doctorate', ' Prof-school', ' 5th-6th', ' 10th',\n", " ' 1st-4th', ' Preschool', ' 12th']),\n", " tf.feature_column.categorical_column_with_vocabulary_list(\n", " 'marital_status',\n", " [' Married-civ-spouse', ' Divorced', ' Married-spouse-absent',\n", " ' Never-married', ' Separated', ' Married-AF-spouse', ' Widowed']),\n", " tf.feature_column.categorical_column_with_vocabulary_list(\n", " 'relationship',\n", " [' Husband', ' Not-in-family', ' Wife', ' Own-child', ' Unmarried',\n", " ' Other-relative']),\n", " tf.feature_column.categorical_column_with_vocabulary_list(\n", " 'workclass',\n", " [' Self-emp-not-inc', ' Private', ' State-gov',\n", " ' Federal-gov', ' Local-gov', ' ?', ' Self-emp-inc',\n", " ' Without-pay', ' Never-worked']\n", " ),\n", "\n", " # For columns with a large number of values, or unknown values\n", " # We can use a hash function to convert to categories.\n", " tf.feature_column.categorical_column_with_hash_bucket(\n", " 'occupation', hash_bucket_size=100, dtype=tf.string),\n", " tf.feature_column.categorical_column_with_hash_bucket(\n", " 'native_country', hash_bucket_size=100, dtype=tf.string),\n", "\n", " # Continuous base columns.\n", " tf.feature_column.numeric_column('age'),\n", " tf.feature_column.numeric_column('education_num'),\n", " tf.feature_column.numeric_column('capital_gain'),\n", " tf.feature_column.numeric_column('capital_loss'),\n", " tf.feature_column.numeric_column('hours_per_week'),\n", "]\n", "\n", "# Now we'll define the unused columns-- those we won't use for this example.\n", "# In this case, there's just one: 'fnlwgt'.\n", "UNUSED_COLUMNS = set(CSV_COLUMNS) - {col.name for col in INPUT_COLUMNS} - {LABEL_COLUMN}\n", "print('unused columns: %s' % UNUSED_COLUMNS)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll define a function that builds our Estimator.\n", "\n", "We will use the [`DNNLinearCombinedClassifier`](https://www.tensorflow.org/api_docs/python/tf/estimator/DNNLinearCombinedClassifier) class to create our Estimator.\n", "\n", "This is a so-called \"wide and deep\" model.\n", "Wide and deep models use deep neural nets to learn high level abstractions about complex features or interactions between such features. These models then combined the outputs from the DNN with a linear regression performed on simpler features. This provides a balance between power and speed that is effective on many structured data problems.\n", "\n", "You can read more about this model and its use [here](https://research.googleblog.com/2016/06/wide-deep-learning-better-together-with.html). You can learn more about using feature columns [here](https://www.tensorflow.org/versions/master/get_started/feature_columns)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def build_estimator(config, embedding_size=8, hidden_units=None):\n", " \"\"\"Build a wide and deep model for predicting income category.\n", " \"\"\"\n", " (gender, race, education, marital_status, relationship,\n", " workclass, occupation, native_country, age,\n", " education_num, capital_gain, capital_loss, hours_per_week) = INPUT_COLUMNS\n", "\n", " # Continuous columns can be converted to categorical via bucketization\n", " age_buckets = tf.feature_column.bucketized_column(\n", " age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])\n", "\n", " # Wide columns and deep columns.\n", " wide_columns = [\n", " # Interactions between different categorical features can also\n", " # be added as new virtual features.\n", " tf.feature_column.crossed_column(\n", " ['education', 'occupation'], hash_bucket_size=int(1e4)),\n", " tf.feature_column.crossed_column(\n", " [age_buckets, race, 'occupation'], hash_bucket_size=int(1e6)),\n", " tf.feature_column.crossed_column(\n", " ['native_country', 'occupation'], hash_bucket_size=int(1e4)),\n", " gender,\n", " native_country,\n", " education,\n", " occupation,\n", " workclass,\n", " marital_status,\n", " relationship,\n", " age_buckets,\n", " ]\n", "\n", " deep_columns = [\n", " # Use indicator columns for low dimensional vocabularies\n", " tf.feature_column.indicator_column(workclass),\n", " tf.feature_column.indicator_column(education),\n", " tf.feature_column.indicator_column(marital_status),\n", " tf.feature_column.indicator_column(gender),\n", " tf.feature_column.indicator_column(relationship),\n", " tf.feature_column.indicator_column(race),\n", "\n", " # Use embedding columns for high dimensional vocabularies\n", " tf.feature_column.embedding_column(\n", " native_country, dimension=embedding_size),\n", " tf.feature_column.embedding_column(occupation, dimension=embedding_size),\n", " age,\n", " education_num,\n", " capital_gain,\n", " capital_loss,\n", " hours_per_week,\n", " ]\n", "\n", " return tf.estimator.DNNLinearCombinedClassifier(\n", " config=config,\n", " linear_feature_columns=wide_columns,\n", " dnn_feature_columns=deep_columns,\n", " dnn_hidden_units=hidden_units or [100, 70, 50, 25]\n", " )\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we'll create an estimator object using the function we defined, and our config values." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "output_dir = \"census_%s\" % (int(time.time()))\n", "print(output_dir)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "run_config = tf.estimator.RunConfig()\n", "run_config = run_config.replace(model_dir=output_dir)\n", "\n", "FIRST_LAYER_SIZE = 100 # Number of nodes in the first layer of the DNN\n", "NUM_LAYERS = 4 # Number of layers in the DNN\n", "SCALE_FACTOR = 0.7 # How quickly should the size of the layers in the DNN decay\n", "EMBEDDING_SIZE = 8 # Number of embedding dimensions for categorical columns\n", "\n", "estimator = build_estimator(\n", " embedding_size=EMBEDDING_SIZE,\n", " # Construct layers sizes with exponential decay\n", " hidden_units=[\n", " max(2, int(FIRST_LAYER_SIZE *\n", " SCALE_FACTOR**i))\n", " for i in range(NUM_LAYERS)\n", " ],\n", " config=run_config\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define input functions (using Datasets)\n", "\n", "\n", "To train and evaluate the estimator model, we'll need to tell it how to get its training and eval data. We'll define a function (`input_fn`) that knows how to generate features and labels for training or evaluation, then use that definition to create the actual train and eval input functions.\n", "\n", "We'll use [Datasets](https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset) in the `input_fn` to access our data. \n", "This API is part of TensorFlow 1.4, and is a new way to create [input pipelines to TensorFlow models](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/docs_src/performance/datasets_performance.md). The `Dataset` API is much more performant than using `feed_dict` or the queue-based pipelines, and it's [cleaner and easier](https://developers.googleblog.com/2017/09/introducing-tensorflow-datasets.html) to use.\n", "\n", "(In this simple example, our datasets are too small for the use of the Datasets API to make a large difference, but with larger datasets it becomes more important).\n", "\n", "We'll first define a couple of helper functions. `parse_label_column` is used to convert the label strings (in our case, ' <=50K' and ' >50K') into one-hot encodings." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def parse_label_column(label_string_tensor):\n", " \"\"\"Parses a string tensor into the label tensor\n", " \"\"\"\n", " # Build a Hash Table inside the graph\n", " table = tf.contrib.lookup.index_table_from_tensor(tf.constant(LABELS))\n", "\n", " # Use the hash table to convert string labels to ints and one-hot encode\n", " return table.lookup(label_string_tensor)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def parse_csv(rows_string_tensor):\n", " \"\"\"Takes the string input tensor and returns a dict of rank-2 tensors.\"\"\"\n", "\n", " # Takes a rank-1 tensor and converts it into rank-2 tensor\n", " # Example if the data is ['csv,line,1', 'csv,line,2', ..] to\n", " # [['csv,line,1'], ['csv,line,2']] which after parsing will result in a\n", " # tuple of tensors: [['csv'], ['csv']], [['line'], ['line']], [[1], [2]]\n", " row_columns = tf.expand_dims(rows_string_tensor, -1)\n", " columns = tf.decode_csv(row_columns, record_defaults=CSV_COLUMN_DEFAULTS)\n", " features = dict(zip(CSV_COLUMNS, columns))\n", "\n", " # Remove unused columns\n", " for col in UNUSED_COLUMNS:\n", " features.pop(col)\n", " return features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now define the input function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# This function returns a (features, indices) tuple, where features is a dictionary of\n", "# Tensors, and indices is a single Tensor of label indices.\n", "def input_fn(filenames,\n", " num_epochs=None,\n", " shuffle=True,\n", " skip_header_lines=0,\n", " batch_size=200):\n", " \"\"\"Generates features and labels for training or evaluation.\n", " \"\"\"\n", "\n", " dataset = tf.data.TextLineDataset(filenames).skip(skip_header_lines).map(parse_csv)\n", "\n", " if shuffle:\n", " dataset = dataset.shuffle(buffer_size=batch_size * 10)\n", " dataset = dataset.repeat(num_epochs)\n", " dataset = dataset.batch(batch_size)\n", " iterator = dataset.make_one_shot_iterator()\n", " features = iterator.get_next()\n", " return features, parse_label_column(features.pop(LABEL_COLUMN))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we'll use `input_fn` to define both the `train_input` and `eval_input` functions. We just need to pass `input_fn` the different source files to use for training versus evaluation. As we'll see below, these will be used to define a `TrainSpec` and `EvalSpec` used by `train_and_evaluate`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "train_input = lambda: input_fn(\n", " TRAIN_FILES,\n", " batch_size=40\n", ")\n", "\n", "# Don't shuffle evaluation data\n", "eval_input = lambda: input_fn(\n", " EVAL_FILES,\n", " batch_size=40,\n", " shuffle=False\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define training and eval specs\n", "\n", "\n", "Now we're nearly set. We just need to define the the `TrainSpec` and `EvalSpec` used by `tf.estimator.train_and_evaluate`. These specify not only the input functions, but how to export our trained model.\n", "\n", "First, we'll define the [`TrainSpec`](https://www.tensorflow.org/api_docs/python/tf/estimator/TrainSpec), which takes as an arg `train_input`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "train_spec = tf.estimator.TrainSpec(train_input,\n", " max_steps=1000\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For our [`EvalSpec`](https://www.tensorflow.org/api_docs/python/tf/estimator/EvalSpec), we'll instantiate it with something additional -- a list of exporters, that specify how to export a trained model so that it can be used for serving.\n", "\n", "To specify our exporter, we first define a *serving input function*. A serving input function should produce a [ServingInputReceiver](https://www.tensorflow.org/api_docs/python/tf/estimator/export/ServingInputReceiver).\n", "\n", "A `ServingInputReceiver` is instantiated with two arguments -- `features`, and `receiver_tensors`. The `features` represent the inputs to our Estimator when it is being served for prediction. The `receiver_tensor` represent inputs to the server. These will not necessarily always be the same — in some cases we may want to make some edits. [Here's](https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/census/estimator/trainer/model.py#L197) one example of that, where the inputs to the server (csv-formatted rows) include a field to be removed.\n", "\n", "However, in our case, the inputs to the server are the same as the features input to the model. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def json_serving_input_fn():\n", " \"\"\"Build the serving inputs.\"\"\"\n", " inputs = {}\n", " for feat in INPUT_COLUMNS:\n", " inputs[feat.name] = tf.placeholder(shape=[None], dtype=feat.dtype)\n", "\n", " return tf.estimator.export.ServingInputReceiver(inputs, inputs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we define an [Exporter](https://www.tensorflow.org/api_docs/python/tf/estimator/FinalExporter) in terms of that serving input function, and pass the `EvalSpec` constructor a list of exporters.\n", "(We're just using one exporter here, but if you define multiple exporters, training will result in multiple saved models)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "exporter = tf.estimator.FinalExporter('census',\n", " json_serving_input_fn)\n", "eval_spec = tf.estimator.EvalSpec(eval_input,\n", " steps=100,\n", " exporters=[exporter],\n", " name='census-eval'\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train your model, using `train_and_evaluate`\n", "\n", "Now, we have defined everything we need to train and evaluate our model, and export the trained model for serving, via a call to **`train_and_evaluate`**:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You've just trained your model and exported the result in a format that makes it easy to use it for serving! The training behavior will be consistent across both local/non-distributed and distributed configurations, thanks to `train_and_evaluate`.\n", "\n", "### Look at the signature of your exported model\n", "\n", "TensorFlow ships with a CLI that allows you to inspect the *signature* of exported binary files. To do this, first locate your model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [], "source": [ "# List the directory that contains the model. You'll use this info in the next section too.\n", "!ls -R $output_dir/export/census" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the listing above, find the directory that includes `saved_model.pb`, and edit the command below to use it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Now, view the model signature\n", "# This is an example. Edit this command to use your own directory path\n", "!saved_model_cli show --dir $output_dir/export/census/ --tag serve --signature_def predict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This lets us confirm the expected inputs, and shows the outputs we'll get when we run a prediction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use the Google Cloud SDK to make predictions on your trained model.\n", "\n", "Next, we'll use the Google Cloud SDK (gcloud) as an easy way to make predictions using our model.\n", "This section requires that you have the [Google Cloud SDK (gcloud) installed](https://cloud.google.com/sdk/downloads).\n", "\n", "We can use `gcloud` to easily make predictions using our local learned model, using \n", "`gcloud ml-engine local predict`. \n", "(Note the 'local' modifier; this is a good way to check locally that your exported model is behaving as expected. Later in this notebook we'll look at how to use scalable Cloud ML Engine Online Prediction instead.)\n", "\n", "We'll use the example input in `test.json`. As we saw above when we built our model, we'll be predicting 'income bracket' based on the features encoded in the `test.json` instance.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Edit the following path to point to your `$output_dir/export/census/` subdirectory from the listing above, the one that contains `saved_model.pb`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "! cat test.json\n", "# This is an example. Edit this command to use your own directory path.\n", "! gcloud ml-engine local predict --model-dir $output_dir/export/census/ --json-instances test.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see how the input fields in `test.json` correspond to the inputs listed by the `saved_model_cli` command above, and how the prediction outputs correspond to the outputs listed by `saved_model_cli`.\n", "In this model, Class 0 indicates income <= 50k and Class 1 indicates income >50k. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Cloud ML Engine for easy distributed training and scalable online prediction\n", "\n", "In the previous section, we looked at how to use `tf.estimator.train_and_evaluate` to train and export a model, and then make predictions using the trained model.\n", "\n", "In this section, you'll see how easy it is to use the same code — without any changes — to do distributed training on Cloud ML Engine (CMLE), thanks to the `Estimator` class and `train_and_evaluate`. Then we'll use CMLE Online Prediction to scalably serve the trained model.\n", "\n", "This section requires that you have [set up a GCP project and enabled the use of CMLE](https://cloud.google.com/ml-engine/docs/command-line)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To run training on CMLE, we can use `gcloud`. We'll need to package our code so that it can be deployed, and specify the Python file to run to start the training (`--module-name`). \n", "\n", "If you take a look in the `trainer` subdirectory of this directory, you'll see that it contains essentially the same code that's in this notebook, just packaged for deployment. `trainer.task` is the entry point, and when that file is run, it calls `tf.estimator.train_and_evaluate`. \n", "(You can read more about how to package your code [here](https://cloud.google.com/ml-engine/docs/packaging-trainer)). \n", "\n", "We'll test training via `gcloud` locally first, to make sure that we have everything packaged up correctly.\n", "\n", "### Test training locally via `gcloud`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "output_dir = \"census_%s\" % (int(time.time()))\n", "%env OUTPUT_DIR=$output_dir\n", "! gcloud ml-engine local train --package-path trainer \\\n", " --module-name trainer.task \\\n", " -- \\\n", " --train-files $TRAIN_FILE \\\n", " --eval-files $EVAL_FILE \\\n", " --train-steps 1000 \\\n", " --job-dir $OUTPUT_DIR \\\n", " --eval-steps 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Launch a distributed training job on Cloud ML Engine\n", "\n", "Now, let's use Cloud ML Engine (CMLE) to do distributed training in the cloud. Here's where you'll use your GCP project and CMLE setup. The CMLE setup instructions included creation of a Google Cloud Storage (GCS) bucket, which we'll use below.\n", "\n", "We'll set the training job to use the `SCALE_TIER_STANDARD_1` scale spec. This [gives you](https://cloud.google.com/ml-engine/docs/training-overview#job_configuration_parameters) one 'master' instance, plus four workers and three parameter servers. \n", "\n", "The cool thing about this is that **we don't need to change our code at all to use this distributed config**. Our use of the Estimator class in conjunction with the CMLE scale spec allows the distributed training config to be transparent to us -- it just works.\n", "For example, we could alternately set the `--scale-tier` config to [use GPUs](https://cloud.google.com/ml-engine/docs/using-gpus), without making any changes to our code.\n", "\n", "\n", "Notes: Each job requires a unique name, so rerun the cell that sets the env vars below each time you submit another job, if you want to run the following more than once. \n", "Your CMLE training job can take a few minutes to spin up, but for larger training jobs the startup time is a small percentage of the overall computation." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "job_name = \"census_job_%s\" % (int(time.time()))\n", "\n", "# Edit the following to point to your GCS bucket directory\n", "gcs_job_dir = \"gs://your-gcs-bucket/path/%s\" % job_name\n", "# For training on CMLE, we'll use datasets stored in Google Cloud Storage (GCS) instead of local files.\n", "%env GCS_TRAIN_FILE=gs://cloudml-public/census/data/adult.data.csv\n", "%env GCS_EVAL_FILE=gs://cloudml-public/census/data/adult.test.csv\n", "%env SCALE_TIER=STANDARD_1\n", "%env JOB_NAME=$job_name\n", "%env GCS_JOB_DIR=$gcs_job_dir" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# submit your distributed training job to CMLE\n", "!gcloud ml-engine jobs submit training $JOB_NAME --scale-tier $SCALE_TIER \\\n", " --runtime-version 1.4 --job-dir $GCS_JOB_DIR \\\n", " --module-name trainer.task --package-path trainer/ \\\n", " --region us-central1 \\\n", " -- --train-steps 10000 --train-files $GCS_TRAIN_FILE --eval-files $GCS_EVAL_FILE --eval-steps 100 " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "You can monitor the status of your job via the `stream-logs` command indicated above. \n", "(You can call `gcloud ml-engine jobs submit training` with the `--stream-logs` flag to stream the output logs right away). \n", "You can also monitor the status of your job in the Cloud Console: [console.cloud.google.com/mlengine/jobs](https://console.cloud.google.com/mlengine/jobs) \n", "In the logs, you'll see output from 4 worker replicas, numbered 0-3." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scalably serve your trained model with CMLE Online Prediction\n", "\n", "Once your job is finished, you'll find the exported model under `$GCS_JOB_DIR`, in addition to other data such as checkpoints.\n", "You can now deploy the exported model to Cloud ML Engine and scalably serve it for **prediction**, using the CMLE Online Prediction service." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Run this when the training job is finished. Look for the directory with the 'saved_model.pb' file.\n", "!gsutil ls -R $GCS_JOB_DIR" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# This is just an example.\n", "# Edit this path to point to the GCS directory that contains your saved_model.pb binary\n", "%env MODEL_BINARY=gs://$gcs_job_dir/export/census//" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a 'census' model on CMLE (you'll get an error if it already exists)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!gcloud ml-engine models create census --regions us-central1\n", "!gcloud ml-engine models list\n", "!gcloud ml-engine versions list --model census" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, deploy your trained model binary to CMLE as `v1` of the `census` model. This will let you use it for prediction. (You'll get an error if version 'v1' already exists. In that case, you can use a different version name)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!gcloud ml-engine versions create v1 --model census --origin $MODEL_BINARY --runtime-version 1.4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can look at the versions of a model in the Cloud Console, as well as set the default version: [console.cloud.google.com/mlengine/models](https://console.cloud.google.com/mlengine/models)\n", "\n", "Now you can use your deployed model for prediction. We've included a file, `test.json`, that encodes the input instance.\n", "\n", "[**add info about setting min instances to reduce warmup time**?]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Use your deployed model for prediction\n", "!cat test.json\n", "!gcloud ml-engine predict --model census --version v1 --json-instances test.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extras: Train on CMLE using a custom GPU cluster\n", "\n", "Above, we used the `STANDARD_1` scale tier to train our model. If you had wanted to train on 1 GPU, you could have used `BASIC_GPU` instead.\n", "\n", "You can train on a larger GPU cluster just as easily; with `gcloud`, it's just a matter of defining a .yaml config file that describes your cluster, and passing that config when you submit your training job.\n", "Note that using GPUs is [more expensive](https://cloud.google.com/ml-engine/pricing), so it will **cost more** to run this part of the example.\n", "\n", "To see how we'd do this, let's first take a look at the config file:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!cat config_custom_gpus.yaml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're using NVIDIA Tesla P100 GPUs for our master and worker nodes (which is quite overkill for this small example!)\n", "We're using standard CPU nodes for the parameter servers. You can find more info on the node types [here](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs).\n", "\n", "We'll just run our job as before, except now we specify `CUSTOM` scale tier, and point to our config file. \n", "As before, you'll need to edit the GCS bucket path in the next cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "job_name = \"census_job_%s\" % (int(time.time()))\n", "# Edit the following to point to your GCS bucket directory\n", "gcs_job_dir = \"gs://your-gcs-bucket/path/%s\" % job_name\n", "%env GCS_TRAIN_FILE=gs://cloudml-public/census/data/adult.data.csv\n", "%env GCS_EVAL_FILE=gs://cloudml-public/census/data/adult.test.csv\n", "%env SCALE_TIER=CUSTOM\n", "%env JOB_NAME=$job_name\n", "%env GCS_JOB_DIR=$gcs_job_dir" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!gcloud ml-engine jobs submit training $JOB_NAME --scale-tier $SCALE_TIER \\\n", " --runtime-version 1.4 --job-dir $GCS_JOB_DIR \\\n", " --module-name trainer.task --package-path trainer/ \\\n", " --region us-central1 --config config_custom_gpus.yaml \\\n", " -- --train-steps 15000 --train-files $GCS_TRAIN_FILE --eval-files $GCS_EVAL_FILE --eval-steps 100 \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extras: Use Hyperparameter Tuning\n", "\n", "CMLE makes it easy to do [Hyperparameter tuning](https://cloud.google.com/ml-engine/docs/hyperparameter-tuning-overview). See the documentation for [more info](https://cloud.google.com/ml-engine/docs/using-hyperparameter-tuning).\n", "\n", "For this run, we'll go back to using the `STANDARD_1` tier. Note that because HP tuning does multiple runs — in this case, it will be 6 — this will be **more expensive** than the previous single runs.\n", "As before, you'll need to edit the GCS bucket path in the next cell to point to your bucket." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "job_name = \"census_job_%s\" % (int(time.time()))\n", "# Edit the following to point to your GCS bucket directory\n", "gcs_job_dir = \"gs://your-gcs-bucket/path/%s\" % job_name\n", "%env GCS_TRAIN_FILE=gs://cloudml-public/census/data/adult.data.csv\n", "%env GCS_EVAL_FILE=gs://cloudml-public/census/data/adult.test.csv\n", "%env SCALE_TIER=STANDARD_1\n", "%env JOB_NAME=$job_name\n", "%env GCS_JOB_DIR=$gcs_job_dir" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# We'll use the `hptuning_config.yaml` file for this run.\n", "!cat hptuning_config.yaml" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!gcloud ml-engine jobs submit training $JOB_NAME --scale-tier $SCALE_TIER \\\n", " --runtime-version 1.4 --job-dir $GCS_JOB_DIR \\\n", " --module-name trainer.task --package-path trainer/ \\\n", " --region us-central1 --config hptuning_config.yaml \\\n", " -- --train-steps 15000 --train-files $GCS_TRAIN_FILE --eval-files $GCS_EVAL_FILE --eval-steps 100 \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can easily look at the results in the [Cloud Console](https://console.cloud.google.com/mlengine/jobs) — click on a job to see the its details, including the HP trial outcomes. You can also see information about each trial reflected in the job logs. The checkpoints and export for each trial are saved to separate subdirectories organized by trial number under your job dir." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "-----------------------------------\n", "\n", "Copyright 2018 Google Inc. All Rights Reserved. Licensed under the Apache\n", "License, Version 2.0 (the \"License\"); you may not use this file except in\n", "compliance with the License. You may obtain a copy of the License at \n", "http://www.apache.org/licenses/LICENSE-2.0\n", "\n", "Unless required by applicable law or agreed to in writing, software\n", "distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT\n", "WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the\n", "License for the specific language governing permissions and limitations under\n", "the License." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 2 }