{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Pnn4rDWGqDZL" }, "source": [ "##### Copyright 2018 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "cellView": "form", "colab": {}, "colab_type": "code", "id": "l534d35Gp68G" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "3TI3Q3XBesaS" }, "source": [ "# Training checkpoints" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "yw_a0iGucY8z" }, "source": [ "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n", " \u003ctd\u003e\n", " \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/beta/guide/checkpoints\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003eView on TensorFlow.org\u003c/a\u003e\n", " \u003c/td\u003e\n", " \u003ctd\u003e\n", " \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/r2/guide/checkpoints.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n", " \u003c/td\u003e\n", " \u003ctd\u003e\n", " \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/docs/blob/master/site/en/r2/guide/checkpoints.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n", " \u003c/td\u003e\n", " \u003ctd\u003e\n", " \u003ca href=\"https://storage.googleapis.com/tensorflow_docs/site/en/r2/guide/checkpoints.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/download_logo_32px.png\" /\u003eDownload notebook\u003c/a\u003e\n", " \u003c/td\u003e\n", "\u003c/table\u003e" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "LeDp7dovcbus" }, "source": [ "\n", "The phrase \"Saving a TensorFlow model\" typically means one of two things: (1) Checkpoints, OR (2) SavedModel.\n", "\n", "Checkpoints capture the exact value of all parameters (`tf.Variable` objects) used by a model. Checkpoints do not contain any description of the computation defined by the model and thus are typically only useful when source code that will use the saved parameter values is available.\n", "\n", "The SavedModel format on the other hand includes a serialized description of the computation defined by the model in addition to the parameter values (checkpoint). Models in this format are independent of the source code that created the model. They are thus suitable for deployment via TensorFlow Serving, TensorFlow Lite, TensorFlow.js, or programs in other programming languages (the C, C++, Java, Go, Rust, C# etc. TensorFlow APIs).\n", "\n", "This guide covers APIs for writing and reading checkpoints." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "5vsq3-pffo1I" }, "source": [ "## Saving from `tf.keras` training APIs\n", "\n", "See the [`tf.keras` guide on saving and\n", "restoring](./keras.ipynb#save_and_restore).\n", "\n", "`tf.keras.Model.save_weights`\n", "optionally saves in the TensorFlow checkpoint format. This guide explains the format in more depth, and introduces APIs for managing checkpoints in custom training loops." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "XseWX5jDg4lQ" }, "source": [ "## Writing checkpoints manually" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "1jpZPz76ZP3K" }, "source": [ "The persistent state of a TensorFlow model is stored in `tf.Variable` objects. These can be constructed directly, but are often created through high-level APIs like `tf.keras.layers`.\n", "\n", "The easiest way to manage variables is by attaching them to Python objects, then referencing those objects. Subclasses of `tf.train.Checkpoint`, `tf.keras.layers.Layer`, and `tf.keras.Model` automatically track variables assigned to their attributes. The following example constructs a simple linear model, then writes checkpoints which contain values for all of the model's variables." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "VEvpMYAKsC4z" }, "outputs": [], "source": [ "from __future__ import absolute_import, division, print_function, unicode_literals\n", "!pip install tensorflow==2.0.0-beta1\n", "import tensorflow as tf" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "BR5dChK7rXnj" }, "outputs": [], "source": [ "class Net(tf.keras.Model):\n", " \"\"\"A simple linear model.\"\"\"\n", "\n", " def __init__(self):\n", " super(Net, self).__init__()\n", " self.l1 = tf.keras.layers.Dense(5)\n", "\n", " def call(self, x):\n", " return self.l1(x)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "fNjf9KaLdIRP" }, "source": [ "Although it's not the focus of this guide, to be executable the example needs data and an optimization step. The model will train on slices of an in-memory dataset." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "tSNyP4IJ9nkU" }, "outputs": [], "source": [ "def toy_dataset():\n", " inputs = tf.range(10.)[:, None]\n", " labels = inputs * 5. + tf.range(5.)[None, :]\n", " return tf.data.Dataset.from_tensor_slices(\n", " dict(x=inputs, y=labels)).repeat(10).batch(2)" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "ICm1cufh_JH8" }, "outputs": [], "source": [ "def train_step(net, example, optimizer):\n", " \"\"\"Trains `net` on `example` using `optimizer`.\"\"\"\n", " with tf.GradientTape() as tape:\n", " output = net(example['x'])\n", " loss = tf.reduce_mean(tf.abs(output - example['y']))\n", " variables = net.trainable_variables\n", " gradients = tape.gradient(loss, variables)\n", " optimizer.apply_gradients(zip(gradients, variables))\n", " return loss" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "NP9IySmCeCkn" }, "source": [ "The following training loop creates an instance of the model and of an optimizer, then gathers them into a `tf.train.Checkpoint` object. It calls the training step in a loop on each batch of data, and periodically writes checkpoints to disk." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "BbCS5A6K1VSH" }, "outputs": [], "source": [ "opt = tf.keras.optimizers.Adam(0.1)\n", "net = Net()\n", "ckpt = tf.train.Checkpoint(step=tf.Variable(1), optimizer=opt, net=net)\n", "manager = tf.train.CheckpointManager(ckpt, './tf_ckpts', max_to_keep=3)\n", "ckpt.restore(manager.latest_checkpoint)\n", "if manager.latest_checkpoint:\n", " print(\"Restored from {}\".format(manager.latest_checkpoint))\n", "else:\n", " print(\"Initializing from scratch.\")\n", "\n", "for example in toy_dataset():\n", " loss = train_step(net, example, opt)\n", " ckpt.step.assign_add(1)\n", " if int(ckpt.step) % 10 == 0:\n", " save_path = manager.save()\n", " print(\"Saved checkpoint for step {}: {}\".format(int(ckpt.step), save_path))\n", " print(\"loss {:1.2f}\".format(loss.numpy()))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "lw1QeyRBgsLE" }, "source": [ "The preceding snippet will randomly initialize the model variables when it first runs. After the first run it will resume training from where it left off:" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "UjilkTOV2PBK" }, "outputs": [], "source": [ "opt = tf.keras.optimizers.Adam(0.1)\n", "net = Net()\n", "ckpt = tf.train.Checkpoint(step=tf.Variable(1), optimizer=opt, net=net)\n", "manager = tf.train.CheckpointManager(ckpt, './tf_ckpts', max_to_keep=3)\n", "ckpt.restore(manager.latest_checkpoint)\n", "if manager.latest_checkpoint:\n", " print(\"Restored from {}\".format(manager.latest_checkpoint))\n", "else:\n", " print(\"Initializing from scratch.\")\n", "\n", "for example in toy_dataset():\n", " loss = train_step(net, example, opt)\n", " ckpt.step.assign_add(1)\n", " if int(ckpt.step) % 10 == 0:\n", " save_path = manager.save()\n", " print(\"Saved checkpoint for step {}: {}\".format(int(ckpt.step), save_path))\n", " print(\"loss {:1.2f}\".format(loss.numpy()))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "dxJT9vV-2PnZ" }, "source": [ "The `tf.train.CheckpointManager` object deletes old checkpoints. Above it's configured to keep only the three most recent checkpoints." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "3zmM0a-F5XqC" }, "outputs": [], "source": [ "print(manager.checkpoints) # List the three remaining checkpoints" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "qwlYDyjemY4P" }, "source": [ "These paths, e.g. `'./tf_ckpts/ckpt-10'`, are not files on disk. Instead they are prefixes for an `index` file and one or more data files which contain the variable values. These prefixes are grouped together in a single `checkpoint` file (`'./tf_ckpts/checkpoint'`) where the `CheckpointManager` saves its state." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "t1feej9JntV_" }, "outputs": [], "source": [ "!ls ./tf_ckpts" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "DR2wQc9x6b3X" }, "source": [ "\u003ca id=\"loading_mechanics\"/\u003e\n", "## Loading mechanics\n", "\n", "TensorFlow matches variables to checkpointed values by traversing a directed graph with named edges, starting from the object being loaded. Edge names typically come from attribute names in objects, for example the `\"l1\"` in `self.l1 = tf.keras.layers.Dense(5)`. `tf.train.Checkpoint` uses its keyword argument names, as in the `\"step\"` in `tf.train.Checkpoint(step=...)`.\n", "\n", "The dependency graph from the example above looks like this:\n", "\n", "![Visualization of the dependency graph for the example training loop](http://tensorflow.org/images/guide/whole_checkpoint.svg)\n", "\n", "With the optimizer in red, regular variables in blue, and optimizer slot variables in orange. The other nodes, for example representing the `tf.train.Checkpoint`, are black.\n", "\n", "Slot variables are part of the optimizer's state, but are created for a specific variable. For example the `'m'` edges above correspond to momentum, which the Adam optimizer tracks for each variable. Slot variables are only saved in a checkpoint if the variable and the optimizer would both be saved, thus the dashed edges." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "VpY5IuanUEQ0" }, "source": [ "Calling `restore()` on a `tf.train.Checkpoint` object queues the requested restorations, restoring variable values as soon as there's a matching path from the `Checkpoint` object. For example we can load just the kernel from the model we defined above by reconstructing one path to it through the network and the layer." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "wmX2AuyH7TVt" }, "outputs": [], "source": [ "to_restore = tf.Variable(tf.zeros([5]))\n", "print(to_restore.numpy()) # All zeros\n", "fake_layer = tf.train.Checkpoint(bias=to_restore)\n", "fake_net = tf.train.Checkpoint(l1=fake_layer)\n", "new_root = tf.train.Checkpoint(net=fake_net)\n", "status = new_root.restore(tf.train.latest_checkpoint('./tf_ckpts/'))\n", "print(to_restore.numpy()) # We get the restored value now" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "GqEW-_pJDAnE" }, "source": [ "The dependency graph for these new objects is a much smaller subgraph of the larger checkpoint we wrote above. It includes only the bias and a save counter that `tf.train.Checkpoint` uses to number checkpoints.\n", "\n", "![Visualization of a subgraph for the bias variable](http://tensorflow.org/images/guide/partial_checkpoint.svg)\n", "\n", "`restore()` returns a status object, which has optional assertions. All of the objects we've created in our new `Checkpoint` have been restored, so `status.assert_existing_objects_matched()` passes." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "P9TQXl81Dq5r" }, "outputs": [], "source": [ "status.assert_existing_objects_matched()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "GoMwf8CFDu9r" }, "source": [ "There are many objects in the checkpoint which haven't matched, including the layer's kernel and the optimizer's variables. `status.assert_consumed()` only passes if the checkpoint and the program match exactly, and would throw an exception here." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "KCcmJ-2j9RUP" }, "source": [ "### Delayed restorations\n", "\n", "`Layer` objects in TensorFlow may delay the creation of variables to their first call, when input shapes are available. For example the shape of a `Dense` layer's kernel depends on both the layer's input and output shapes, and so the output shape required as a constructor argument is not enough information to create the variable on its own. Since calling a `Layer` also reads the variable's value, a restore must happen between the variable's creation and its first use.\n", "\n", "To support this idiom, `tf.train.Checkpoint` queues restores which don't yet have a matching variable." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "TXYUCO3v-I72" }, "outputs": [], "source": [ "delayed_restore = tf.Variable(tf.zeros([1, 5]))\n", "print(delayed_restore.numpy()) # Not restored; still zeros\n", "fake_layer.kernel = delayed_restore\n", "print(delayed_restore.numpy()) # Restored" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "-DWhJ3glyobN" }, "source": [ "### Manually inspecting checkpoints\n", "\n", "`tf.train.list_variables` lists the checkpoint keys and shapes of variables in a checkpoint. Checkpoint keys are paths in the graph displayed above." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "RlRsADTezoBD" }, "outputs": [], "source": [ "tf.train.list_variables(tf.train.latest_checkpoint('./tf_ckpts/'))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "5fxk_BnZ4W1b" }, "source": [ "### List and dictionary tracking\n", "\n", "As with direct attribute assignments like `self.l1 = tf.keras.layers.Dense(5)`, assigning lists and dictionaries to attributes will track their contents." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "rfaIbDtDHAr_" }, "outputs": [], "source": [ "save = tf.train.Checkpoint()\n", "save.listed = [tf.Variable(1.)]\n", "save.listed.append(tf.Variable(2.))\n", "save.mapped = {'one': save.listed[0]}\n", "save.mapped['two'] = save.listed[1]\n", "save_path = save.save('./tf_list_example')\n", "\n", "restore = tf.train.Checkpoint()\n", "v2 = tf.Variable(0.)\n", "assert 0. == v2.numpy() # Not restored yet\n", "restore.mapped = {'two': v2}\n", "restore.restore(save_path)\n", "assert 2. == v2.numpy()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "UTKvbxHcI3T2" }, "source": [ "You may notice wrapper objects for lists and dictionaries. These wrappers are checkpointable versions of the underlying data-structures. Just like the attribute based loading, these wrappers restore a variable's value as soon as it's added to the container." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "s0Uq1Hv5JCmm" }, "outputs": [], "source": [ "restore.listed = []\n", "print(restore.listed) # ListWrapper([])\n", "v1 = tf.Variable(0.)\n", "restore.listed.append(v1) # Restores v1, from restore() in the previous cell\n", "assert 1. == v1.numpy()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "OxCIf2J6JyQ8" }, "source": [ "The same tracking is automatically applied to subclasses of `tf.keras.Model`, and may be used for example to track lists of layers." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "zGG1tOM0L6iM" }, "source": [ "## Saving object-based checkpoints with Estimator\n", "\n", "See the [guide to Estimator](https://www.tensorflow.org/guide/estimators).\n", "\n", "Estimators by default save checkpoints with variable names rather than the object graph described in the previous sections. `tf.train.Checkpoint` will accept name-based checkpoints, but variable names may change when moving parts of a model outside of the Estimator's `model_fn`. Saving object-based checkpoints makes it easier to train a model inside an Estimator and then use it outside of one." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "-8AMJeueNyoM" }, "outputs": [], "source": [ "import tensorflow.compat.v1 as tf_compat" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "T6fQsBzJQN2y" }, "outputs": [], "source": [ "def model_fn(features, labels, mode):\n", " net = Net()\n", " opt = tf.keras.optimizers.Adam(0.1)\n", " ckpt = tf.train.Checkpoint(step=tf_compat.train.get_global_step(),\n", " optimizer=opt, net=net)\n", " with tf.GradientTape() as tape:\n", " output = net(features['x'])\n", " loss = tf.reduce_mean(tf.abs(output - features['y']))\n", " variables = net.trainable_variables\n", " gradients = tape.gradient(loss, variables)\n", " return tf.estimator.EstimatorSpec(\n", " mode,\n", " loss=loss,\n", " train_op=tf.group(opt.apply_gradients(zip(gradients, variables)),\n", " ckpt.step.assign_add(1)),\n", " # Tell the Estimator to save \"ckpt\" in an object-based format.\n", " scaffold=tf_compat.train.Scaffold(saver=ckpt))\n", "\n", "tf.keras.backend.clear_session()\n", "est = tf.estimator.Estimator(model_fn, './tf_estimator_example/')\n", "est.train(toy_dataset, steps=10)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "tObYHnrrb_mL" }, "source": [ "`tf.train.Checkpoint` can then load the Estimator's checkpoints from its `model_dir`." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "Q6IP3Y_wb-fs" }, "outputs": [], "source": [ "opt = tf.keras.optimizers.Adam(0.1)\n", "net = Net()\n", "ckpt = tf.train.Checkpoint(\n", " step=tf.Variable(1, dtype=tf.int64), optimizer=opt, net=net)\n", "ckpt.restore(tf.train.latest_checkpoint('./tf_estimator_example/'))\n", "ckpt.step.numpy() # From est.train(..., steps=10)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "knyUFMrJg8y4" }, "source": [ "## Summary\n", "\n", "TensorFlow objects provide an easy automatic mechanism for saving and restoring the values of variables they use.\n" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "checkpoints.ipynb", "private_outputs": true, "provenance": [], "toc_visible": true, "version": "0.3.2" }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }