{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Tce3stUlHN0L" }, "source": [ "##### Copyright 2019 The TensorFlow Authors.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "tuOe1ymfHZPu" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "MfBg1C5NB3X0" }, "source": [ "# Distributed training in TensorFlow" ] }, { "cell_type": "markdown", "metadata": { "id": "r6P32iYYV27b" }, "source": [ "\n", " \n", " \n", "
\n", " Run in Google Colab\n", " \n", " View source on GitHub\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "r6P32iYYV27b" }, "source": [ "> Note: This is an archived TF1 notebook. These are configured\n", "to run in TF2's \n", "[compatbility mode](https://www.tensorflow.org/guide/migrate)\n", "but will run in TF1 as well. To use TF1 in Colab, use the\n", "[%tensorflow_version 1.x](https://colab.research.google.com/notebooks/tensorflow_version.ipynb)\n", "magic." ] }, { "cell_type": "markdown", "metadata": { "id": "xHxb-dlhMIzW" }, "source": [ "## Overview\n", "\n", "The `tf.distribute.Strategy` API provides an abstraction for distributing your training\n", "across multiple processing units. The goal is to allow users to enable distributed training using existing models and training code, with minimal changes.\n", "\n", "This tutorial uses the `tf.distribute.MirroredStrategy`, which\n", "does in-graph replication with synchronous training on many GPUs on one machine.\n", "Essentially, it copies all of the model's variables to each processor.\n", "Then, it uses [all-reduce](http://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/) to combine the gradients from all processors and applies the combined value to all copies of the model.\n", "\n", "`MirroredStategy` is one of several distribution strategy available in TensorFlow core. You can read about more strategies at [distribution strategy guide](../../guide/distribute_strategy.ipynb).\n" ] }, { "cell_type": "markdown", "metadata": { "id": "MUXex9ctTuDB" }, "source": [ "### Keras API\n", "\n", "This example uses the `tf.keras` API to build the model and training loop. For custom training loops, see [this tutorial](training_loops.ipynb)." ] }, { "cell_type": "markdown", "metadata": { "id": "Dney9v7BsJij" }, "source": [ "## Import Dependencies" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "74rHkS_DB3X2" }, "outputs": [], "source": [ "# Import TensorFlow\n", "import tensorflow.compat.v1 as tf\n", "\n", "import tensorflow_datasets as tfds\n", "\n", "import os" ] }, { "cell_type": "markdown", "metadata": { "id": "hXhefksNKk2I" }, "source": [ "## Download the dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "OtnnUwvmB3X5" }, "source": [ "Download the MNIST dataset and load it from [TensorFlow Datasets](https://www.tensorflow.org/datasets). This returns a dataset in `tf.data` format." ] }, { "cell_type": "markdown", "metadata": { "id": "lHAPqG8MtS8M" }, "source": [ "Setting `with_info` to `True` includes the metadata for the entire dataset, which is being saved here to `ds_info`.\n", "Among other things, this metadata object includes the number of train and test examples.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iXMJ3G9NB3X6" }, "outputs": [], "source": [ "datasets, ds_info = tfds.load(name='mnist', with_info=True, as_supervised=True)\n", "mnist_train, mnist_test = datasets['train'], datasets['test']" ] }, { "cell_type": "markdown", "metadata": { "id": "GrjVhv-eKuHD" }, "source": [ "## Define Distribution Strategy" ] }, { "cell_type": "markdown", "metadata": { "id": "TlH8vx6BB3X9" }, "source": [ "Create a `MirroredStrategy` object. This will handle distribution, and provides a context manager (`tf.distribute.MirroredStrategy.scope`) to build your model inside." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4j0tdf4YB3X9" }, "outputs": [], "source": [ "strategy = tf.distribute.MirroredStrategy()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cY3KA_h2iVfN" }, "outputs": [], "source": [ "print ('Number of devices: {}'.format(strategy.num_replicas_in_sync))" ] }, { "cell_type": "markdown", "metadata": { "id": "lNbPv0yAleW8" }, "source": [ "## Setup Input pipeline" ] }, { "cell_type": "markdown", "metadata": { "id": "psozqcuptXhK" }, "source": [ "If a model is trained on multiple GPUs, the batch size should be increased accordingly so as to make effective use of the extra computing power. Moreover, the learning rate should be tuned accordingly." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "p1xWxKcnhar9" }, "outputs": [], "source": [ "# You can also do ds_info.splits.total_num_examples to get the total\n", "# number of examples in the dataset.\n", "\n", "num_train_examples = ds_info.splits['train'].num_examples\n", "num_test_examples = ds_info.splits['test'].num_examples\n", "\n", "BUFFER_SIZE = 10000\n", "\n", "BATCH_SIZE_PER_REPLICA = 64\n", "BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync" ] }, { "cell_type": "markdown", "metadata": { "id": "0Wm5rsL2KoDF" }, "source": [ "Pixel values, which are 0-255, [have to be normalized to the 0-1 range](https://en.wikipedia.org/wiki/Feature_scaling). Define this scale in a function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Eo9a46ZeJCkm" }, "outputs": [], "source": [ "def scale(image, label):\n", " image = tf.cast(image, tf.float32)\n", " image /= 255\n", "\n", " return image, label" ] }, { "cell_type": "markdown", "metadata": { "id": "WZCa5RLc5A91" }, "source": [ "Apply this function to the training and test data, shuffle the training data, and [batch it for training](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gRZu2maChwdT" }, "outputs": [], "source": [ "train_dataset = mnist_train.map(scale).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)\n", "eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)" ] }, { "cell_type": "markdown", "metadata": { "id": "4xsComp8Kz5H" }, "source": [ "## Create the model" ] }, { "cell_type": "markdown", "metadata": { "id": "1BnQYQTpB3YA" }, "source": [ "Create and compile the Keras model in the context of `strategy.scope`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IexhL_vIB3YA" }, "outputs": [], "source": [ "with strategy.scope():\n", " model = tf.keras.Sequential([\n", " tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),\n", " tf.keras.layers.MaxPooling2D(),\n", " tf.keras.layers.Flatten(),\n", " tf.keras.layers.Dense(64, activation='relu'),\n", " tf.keras.layers.Dense(10, activation='softmax')\n", " ])\n", "\n", " model.compile(loss='sparse_categorical_crossentropy',\n", " optimizer=tf.keras.optimizers.Adam(),\n", " metrics=['accuracy'])" ] }, { "cell_type": "markdown", "metadata": { "id": "8i6OU5W9Vy2u" }, "source": [ "## Define the callbacks.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "YOXO5nvvK3US" }, "source": [ "The callbacks used here are:\n", "\n", "* *Tensorboard*: This callback writes a log for Tensorboard which allows you to visualize the graphs.\n", "* *Model Checkpoint*: This callback saves the model after every epoch.\n", "* *Learning Rate Scheduler*: Using this callback, you can schedule the learning rate to change after every epoch/batch.\n", "\n", "For illustrative purposes, add a print callback to display the *learning rate* in the notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "A9bwLCcXzSgy" }, "outputs": [], "source": [ "# Define the checkpoint directory to store the checkpoints\n", "\n", "checkpoint_dir = './training_checkpoints'\n", "# Name of the checkpoint files\n", "checkpoint_prefix = os.path.join(checkpoint_dir, \"ckpt_{epoch}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wpU-BEdzJDbK" }, "outputs": [], "source": [ "# Function for decaying the learning rate.\n", "# You can define any decay function you need.\n", "def decay(epoch):\n", " if epoch < 3:\n", " return 1e-3\n", " elif epoch >= 3 and epoch < 7:\n", " return 1e-4\n", " else:\n", " return 1e-5" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jKhiMgXtKq2w" }, "outputs": [], "source": [ "# Callback for printing the LR at the end of each epoch.\n", "class PrintLR(tf.keras.callbacks.Callback):\n", " def on_epoch_end(self, epoch, logs=None):\n", " print ('\\nLearning rate for epoch {} is {}'.format(\n", " epoch + 1, tf.keras.backend.get_value(model.optimizer.lr)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YVqAbR6YyNQh" }, "outputs": [], "source": [ "callbacks = [\n", " tf.keras.callbacks.TensorBoard(log_dir='./logs'),\n", " tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,\n", " save_weights_only=True),\n", " tf.keras.callbacks.LearningRateScheduler(decay),\n", " PrintLR()\n", "]" ] }, { "cell_type": "markdown", "metadata": { "id": "70HXgDQmK46q" }, "source": [ "## Train and evaluate" ] }, { "cell_type": "markdown", "metadata": { "id": "6EophnOAB3YD" }, "source": [ "Now, train the model in the usual way, calling `fit` on the model and passing in the dataset created at the beginning of the tutorial. This step is the same whether you are distributing the training or not.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7MVw_6CqB3YD" }, "outputs": [], "source": [ "model.fit(train_dataset, epochs=10, callbacks=callbacks)" ] }, { "cell_type": "markdown", "metadata": { "id": "NUcWAUUupIvG" }, "source": [ "As you can see below, the checkpoints are getting saved." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JQ4zeSTxKEhB" }, "outputs": [], "source": [ "# check the checkpoint directory\n", "!ls {checkpoint_dir}" ] }, { "cell_type": "markdown", "metadata": { "id": "qor53h7FpMke" }, "source": [ "To see how the model perform, load the latest checkpoint and call `evaluate` on the test data.\n", "\n", "Call `evaluate` as before using appropriate datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JtEwxiTgpQoP" }, "outputs": [], "source": [ "model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))\n", "\n", "eval_loss, eval_acc = model.evaluate(eval_dataset)\n", "print ('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))" ] }, { "cell_type": "markdown", "metadata": { "id": "IIeF2RWfYu4N" }, "source": [ "To see the output, you can download and view the TensorBoard logs at the terminal.\n", "\n", "```\n", "$ tensorboard --logdir=path/to/log-directory\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LnyscOkvKKBR" }, "outputs": [], "source": [ "!ls -sh ./logs" ] }, { "cell_type": "markdown", "metadata": { "id": "kBLlogrDvMgg" }, "source": [ "## Export to SavedModel" ] }, { "cell_type": "markdown", "metadata": { "id": "Xa87y_A0vRma" }, "source": [ "If you want to export the graph and the variables, SavedModel is the best way of doing this. The model can be loaded back with or without the scope. Moreover, SavedModel is platform agnostic." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "h8Q4MKOLwG7K" }, "outputs": [], "source": [ "path = 'saved_model/'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4HvcDmVsvQoa" }, "outputs": [], "source": [ "tf.keras.experimental.export_saved_model(model, path)" ] }, { "cell_type": "markdown", "metadata": { "id": "vKJT4w5JwVPI" }, "source": [ "Load the model without `strategy.scope`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "T_gT0RbRvQ3o" }, "outputs": [], "source": [ "unreplicated_model = tf.keras.experimental.load_from_saved_model(path)\n", "\n", "unreplicated_model.compile(\n", " loss='sparse_categorical_crossentropy',\n", " optimizer=tf.keras.optimizers.Adam(),\n", " metrics=['accuracy'])\n", "\n", "eval_loss, eval_acc = unreplicated_model.evaluate(eval_dataset)\n", "print ('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))" ] }, { "cell_type": "markdown", "metadata": { "id": "8uNqWRdDMl5S" }, "source": [ "## What's next?\n", "\n", "Read the [distribution strategy guide](../../guide/distribute_strategy_tf1.ipynb).\n", "\n", "Note: `tf.distribute.Strategy` is actively under development and we will be adding more examples and tutorials in the near future. Please give it a try. We welcome your feedback via [issues on GitHub](https://github.com/tensorflow/tensorflow/issues/new)." ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "keras.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }