{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Good practices in Modern Tensorflow for NLP" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "__author__ = 'Guillaume Genthial'\n", "__date__ = '2018-09-22'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contents\n", "\n", "See the [README.md](README.md) for an overview of this notebook and its goals.\n", "\n", "0. [Eager execution](#Eager-execution)\n", "0. [`tf.data`: feeding data into the graph](#tf.data:-feeding-data-into-the-graph)\n", " 0. [Placeholders (before)](#Placeholders-(before))\n", " 0. [Dataset from `np.array`](#Dataset-from-np.array)\n", " 0. [Dataset from text file](#Dataset-from-text-file)\n", " 0. [Dataset from custom generator](#Dataset-from-custom-generator)\n", "0. [`tf.data`: Dataset Transforms](#tf.data:-Dataset-Transforms)\n", " 0. [Shuffle](#Shuffle)\n", " 0. [Repeat](#Repeat)\n", " 0. [Map](#Map)\n", " 0. [Batch](#Batch)\n", " 0. [Padded batch](#Padded-batch)\n", "0. [NLP: preprocessing in Tensorflow](#NLP:-preprocessing-in-Tensorflow)\n", " 0. [Tokenizing by white space in TensorFlow](#Tokenizing-by-white-space-in-TensorFlow)\n", " 0. [Lookup token index from vocab file in TensorFlow](#Lookup-token-index-from-vocab-file-in-TensorFlow)\n", "0. [Full Example](#Full-Example)\n", " 0. [Task and Data](#Task-and-Data)\n", " 0. [Graph (test with eager execution)](#Graph-(test-with-eager-execution))\n", " 0. [Model (`tf.estimator`)](#Model-(tf.estimator))\n", " 0. [Before: custom model classes](#Before:-custom-model-classes)\n", " 0. [Now: `tf.estimator`](#Now:--tf.estimator)\n", " 0. [`input_fn`](#input_fn)\n", " 0. [`model_fn`](#model_fn)\n", " 0. [Instantiate and train your Estimator](#Instantiate-and-train-your-Estimator)\n", " 0. [TensorBoard, `train_and_evaluate`, `predict` etc.](#TensorBoard,-train_and_evaluate,-predict-etc.)\n", "0. [A word about TensorFlow model serving](#A-word-about-TensorFlow-model-serving)\n", " 0. [Serving interface](#Serving-interface)\n", " 0. [Docker Image](#Docker-Image)\n", " 0. [Pull existing image](#Pull-existing-image)\n", " 0. [Run](#Run)\n", " 0. [Rest API POST with curl](#Rest-API-POST-with-curl)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from distutils.version import LooseVersion\n", "import sys\n", "\n", "if LooseVersion(sys.version) < LooseVersion('3.4'):\n", " raise Exception('You need python>=3.4, but you have {}'.format(sys.version))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Standard\n", "from pathlib import Path\n", "\n", "# External\n", "import numpy as np\n", "import tensorflow as tf" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "if LooseVersion(tf.__version__) < LooseVersion('1.9'):\n", " raise Exception('You need tensorflow>=1.9, but you have {}'.format(tf.__version__))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Eager execution\n", "\n", "Compatible with `numpy` (similar behavior to `pyTorch`). For a full review, see [this notebook from the TensorFlow team](https://colab.research.google.com/github/tensorflow/tensorflow/blob/r1.10/tensorflow/contrib/eager/python/examples/notebooks/eager_basics.ipynb).\n", "\n", "It's a great tool for __debugging__ and allowing dynamic graph building (if you really need it...)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# You need to activate it at program startup\n", "tf.enable_eager_execution()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf.Tensor(\n", "[[0.43946627 0.5605337 ]\n", " [0.6169051 0.38309485]], shape=(2, 2), dtype=float32)\n", "[[0.43946627 0.5605337 ]\n", " [0.6169051 0.38309485]]\n" ] } ], "source": [ "X = tf.random_normal([2, 4])\n", "h = tf.layers.dense(X, 2, activation=tf.nn.relu)\n", "y = tf.nn.softmax(h)\n", "print(y)\n", "print(y.numpy())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, `X, h, y` are nodes of the computational graph. But you can actually get the value of these nodes!\n", "\n", "In the past you would have written\n", "\n", "```python\n", "X = tf.placeholder(dtype=tf.float32, shape=[2, 4])\n", "h = tf.layers.dense(X, 2, activation=tf.nn.relu)\n", "y = tf.nn.softmax(h)\n", "with tf.Session() as sess:\n", " sess.run(tf.global_variables_initializer())\n", " sess.run(y, feed_dict={X: np.random.normal(size=[2, 4])})\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `tf.data`: feeding data into the graph\n", "\n", "`tf.placeholders` is replaced by `tf.data.Dataset`.\n", "\n", "### Placeholders (before)\n", "\n", "```python\n", "x = tf.placeholder(dtype=tf.int32, shape=[None, 5])\n", "with tf.Session() as sess:\n", " x_eval = sess.run(x, feed_dict={x: x_np})\n", " print(x_eval)\n", "```\n", "\n", "### Dataset from `np.array`\n", "\n", "Below is a simple example where we have a `np.array`, one row = one example." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[0, 0, 0, 0, 0],\n", " [1, 1, 1, 1, 1],\n", " [2, 2, 2, 2, 2],\n", " [3, 3, 3, 3, 3],\n", " [4, 4, 4, 4, 4],\n", " [5, 5, 5, 5, 5],\n", " [6, 6, 6, 6, 6],\n", " [7, 7, 7, 7, 7],\n", " [8, 8, 8, 8, 8],\n", " [9, 9, 9, 9, 9]])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x_np = np.array([[i]*5 for i in range(10)])\n", "x_np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a `Dataset` from this array. \n", "\n", "This dataset is a node of the graph. \n", "Each time you query its value, it will move to the next row of the underlying `np.array`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dataset = tf.data.Dataset.from_tensor_slices(x_np)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf.Tensor([0 0 0 0 0], shape=(5,), dtype=int64)\n", "tf.Tensor([1 1 1 1 1], shape=(5,), dtype=int64)\n", "tf.Tensor([2 2 2 2 2], shape=(5,), dtype=int64)\n", "tf.Tensor([3 3 3 3 3], shape=(5,), dtype=int64)\n", "tf.Tensor([4 4 4 4 4], shape=(5,), dtype=int64)\n", "tf.Tensor([5 5 5 5 5], shape=(5,), dtype=int64)\n", "tf.Tensor([6 6 6 6 6], shape=(5,), dtype=int64)\n", "tf.Tensor([7 7 7 7 7], shape=(5,), dtype=int64)\n", "tf.Tensor([8 8 8 8 8], shape=(5,), dtype=int64)\n", "tf.Tensor([9 9 9 9 9], shape=(5,), dtype=int64)\n" ] } ], "source": [ "for el in dataset:\n", " print(el)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`el` is the equivalent of the former `tf.placeholder`. It's a node of the graph, to which you can apply any Tensorflow operations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset from text file\n", "\n", "Let's just display the content of the file." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello world 1\n", "Hello world 2\n", "Hello world 3\n", "Hello world 4\n", "Hello world 5\n", "Hello world 6\n", "Hello world 7\n", "Hello world 8\n", "Hello world 9\n", "Hello world 1 2 3\n", "\n" ] } ], "source": [ "path = 'test.txt'\n", "with Path(path).open() as f:\n", " print(f.read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following does just the same as above, but now `el` is a `tf.Tensor` of `dtype=tf.string`!" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf.Tensor(b'Hello world 1', shape=(), dtype=string)\n", "tf.Tensor(b'Hello world 2', shape=(), dtype=string)\n", "tf.Tensor(b'Hello world 3', shape=(), dtype=string)\n", "tf.Tensor(b'Hello world 4', shape=(), dtype=string)\n", "tf.Tensor(b'Hello world 5', shape=(), dtype=string)\n", "tf.Tensor(b'Hello world 6', shape=(), dtype=string)\n", "tf.Tensor(b'Hello world 7', shape=(), dtype=string)\n", "tf.Tensor(b'Hello world 8', shape=(), dtype=string)\n", "tf.Tensor(b'Hello world 9', shape=(), dtype=string)\n", "tf.Tensor(b'Hello world 1 2 3', shape=(), dtype=string)\n" ] } ], "source": [ "dataset = tf.data.TextLineDataset([path])\n", "for el in dataset:\n", " print(el)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset from custom generator\n", "\n", "__The best of both worlds, perfect for NLP__\n", "\n", "It will allow you do put all your logic in pure python, in your `generator_fn`, before feeding it to the Graph." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def generator_fn():\n", " for _ in range(2):\n", " yield b'Hello world'" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dataset = (tf.data.Dataset.from_generator(\n", " generator_fn, \n", " output_types=(tf.string), # Define type and shape of your generator_fn output\n", " output_shapes=())) # like you would have for your `placeholders`" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf.Tensor(b'Hello world', shape=(), dtype=string)\n", "tf.Tensor(b'Hello world', shape=(), dtype=string)\n" ] } ], "source": [ "for el in dataset:\n", " print(el)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `tf.data`: Dataset Transforms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Shuffle\n", "\n", "Note: the `buffer_size` is the number of elements you load in the RAM before starting to sample from it. If it's too small (1 is no shuffling at all), it won't be efficient. Ideally, your `buffer_size` is the same as the number of elements in your dataset. But because not all datasets fit in RAM, you need to be able to set it manually." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dataset = dataset.shuffle(buffer_size=10)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf.Tensor(b'Hello world', shape=(), dtype=string)\n", "tf.Tensor(b'Hello world', shape=(), dtype=string)\n" ] } ], "source": [ "for el in dataset:\n", " print(el)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Repeat\n", "\n", "Repeat your dataset to perform multiple epochs!" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dataset = dataset.repeat(2) # 2 epochs" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf.Tensor(b'Hello world', shape=(), dtype=string)\n", "tf.Tensor(b'Hello world', shape=(), dtype=string)\n", "tf.Tensor(b'Hello world', shape=(), dtype=string)\n", "tf.Tensor(b'Hello world', shape=(), dtype=string)\n" ] } ], "source": [ "for el in dataset:\n", " print(el)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Map\n", "\n", "Note: while `map` is super handy when working with images (TensorFlow has a lot of image preprocessing functions and efficiency is crucial), it's not as practical for NLP, because you're now working with tensors. We found it easier to write the most of the preprocessing logic in python, in a `generator_fn`, before feeding it to the graph." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dataset = dataset.map(\n", " lambda t: tf.string_split([t], delimiter=' ').values, \n", " num_parallel_calls=4) # Multithreading" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf.Tensor([b'Hello' b'world'], shape=(2,), dtype=string)\n", "tf.Tensor([b'Hello' b'world'], shape=(2,), dtype=string)\n", "tf.Tensor([b'Hello' b'world'], shape=(2,), dtype=string)\n", "tf.Tensor([b'Hello' b'world'], shape=(2,), dtype=string)\n" ] } ], "source": [ "for el in dataset:\n", " print(el)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Batch" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dataset = dataset.batch(batch_size=3)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf.Tensor(\n", "[[b'Hello' b'world']\n", " [b'Hello' b'world']\n", " [b'Hello' b'world']], shape=(3, 2), dtype=string)\n", "tf.Tensor([[b'Hello' b'world']], shape=(1, 2), dtype=string)\n" ] } ], "source": [ "for el in dataset:\n", " print(el)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Padded batch\n", "\n", "In NLP, we usually work with sentences of different length. When building your batch, we need to 'pad', i.e. add some fake elements at the end of the shorter sentences. You can perform this operation easily in TensorFlow.\n", "\n", "Here is a dummy example:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def generator_fn():\n", " yield [1, 2]\n", " yield [1, 2, 3]\n", "\n", "dataset = tf.data.Dataset.from_generator(\n", " generator_fn, \n", " output_types=(tf.int32), \n", " output_shapes=([None]))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dataset = dataset.padded_batch(\n", " batch_size=2, \n", " padded_shapes=([None]), \n", " padding_values=(4)) # Optional, if not set will default to 0" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf.Tensor(\n", "[[1 2 4]\n", " [1 2 3]], shape=(2, 3), dtype=int32)\n" ] } ], "source": [ "for el in dataset:\n", " print(el)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that a `4` has been appended at the end of the first row.\n", "\n", "And much more: `prefetch`, `zip`, `concatenate`, `skip`, `take` etc.\n", "See the [documentation](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).\n", "\n", "Note: the recommended standard workflow is \n", "\n", "1. `shuffle`\n", "2. `repeat` (repeat after shuffle so that one epoch = all the examples)\n", "3. `map`, using the `num_parallel_calls` argument to get multithreading for free.\n", "4. `batch` or `padded_batch`\n", "5. `prefetch` (will prefetch data on the GPU so that it doesn't suffer from any data starvation – and only use 80% of your expensive GPU)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NLP: preprocessing in Tensorflow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenizing by white space in TensorFlow\n", "\n", "This is an example of why using `map` is kind of annoying in NLP. It works, but it's not as easy as just using `.split()` or any other python code." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def tf_tokenize(t):\n", " return tf.string_split([t], delimiter=' ').values" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf.Tensor([b'Hello' b'world' b'1'], shape=(3,), dtype=string)\n", "tf.Tensor([b'Hello' b'world' b'2'], shape=(3,), dtype=string)\n", "tf.Tensor([b'Hello' b'world' b'3'], shape=(3,), dtype=string)\n", "tf.Tensor([b'Hello' b'world' b'4'], shape=(3,), dtype=string)\n", "tf.Tensor([b'Hello' b'world' b'5'], shape=(3,), dtype=string)\n", "tf.Tensor([b'Hello' b'world' b'6'], shape=(3,), dtype=string)\n", "tf.Tensor([b'Hello' b'world' b'7'], shape=(3,), dtype=string)\n", "tf.Tensor([b'Hello' b'world' b'8'], shape=(3,), dtype=string)\n", "tf.Tensor([b'Hello' b'world' b'9'], shape=(3,), dtype=string)\n", "tf.Tensor([b'Hello' b'world' b'1' b'2' b'3'], shape=(5,), dtype=string)\n" ] } ], "source": [ "dataset = tf.data.TextLineDataset(['test.txt'])\n", "dataset = dataset.map(tf_tokenize)\n", "for el in dataset:\n", " print(el)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lookup token index from vocab file in TensorFlow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "You're probably used to performing the lookup `token -> token_idx` outside TensorFlow. However, `tf.contrib.lookup` provides exactly this functionality. It's fast, and when exporting the model for serving, it will consider your `vocab.txt` file as a model's resource and keep it with the model!" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 -> Hello\n", "1 -> world\n" ] } ], "source": [ "# One lexeme per line\n", "path_vocab = 'vocab.txt'\n", "with Path(path_vocab).open() as f:\n", " for idx, line in enumerate(f):\n", " print(idx, ' -> ', line.strip())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To use it in TensorFlow: " ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# The last idx (2) will be used for unknown words\n", "lookup_table = tf.contrib.lookup.index_table_from_file(\n", " path_vocab, num_oov_buckets=1)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf.Tensor([0 1 2], shape=(3,), dtype=int64)\n", "tf.Tensor([0 1 2], shape=(3,), dtype=int64)\n", "tf.Tensor([0 1 2], shape=(3,), dtype=int64)\n", "tf.Tensor([0 1 2], shape=(3,), dtype=int64)\n", "tf.Tensor([0 1 2], shape=(3,), dtype=int64)\n", "tf.Tensor([0 1 2], shape=(3,), dtype=int64)\n", "tf.Tensor([0 1 2], shape=(3,), dtype=int64)\n", "tf.Tensor([0 1 2], shape=(3,), dtype=int64)\n", "tf.Tensor([0 1 2], shape=(3,), dtype=int64)\n", "tf.Tensor([0 1 2 2 2], shape=(5,), dtype=int64)\n" ] } ], "source": [ "for el in dataset:\n", " print(lookup_table.lookup(el))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Full Example\n", "\n", "### Task and Data\n", "\n", "The `tokens_generator` returns list of ids. We map even/odd numbers to 2 different ids.\n", "\n", "The `labels_generator` returns list of label ids. We want to predict if a token is \n", "- a word (label `0`)\n", "- an odd number (label `1`)\n", "- an even number (label `2`)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# We tokenize by white space and assign these ids\n", "tok_to_idx = {'hello': 0, 'world': 1, '': 2, '': 3}\n", "\n", "def tokens_generator():\n", " with Path(path).open() as f:\n", " for line in f:\n", " # Tokenize by white space\n", " tokens = line.strip().split()\n", " token_ids = []\n", " for tok in tokens:\n", " # Look for digits\n", " if tok.isdigit():\n", " if int(tok) % 2 == 0:\n", " tok = ''\n", " else:\n", " tok = ''\n", " token_ids.append(tok_to_idx.get(tok.lower(), len(tok_to_idx)))\n", " yield (token_ids, len(token_ids))\n", " \n", "def get_label(token_id):\n", " if token_id == 2:\n", " return 1\n", " elif token_id == 3:\n", " return 2\n", " else:\n", " return 0 \n", " \n", "def labels_generator():\n", " for token_ids, _ in tokens_generator():\n", " yield [get_label(tok_id) for tok_id in token_ids]" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(, )\n", "(, )\n", "(, )\n", "(, )\n", "(, )\n", "(, )\n", "(, )\n", "(, )\n", "(, )\n", "(, )\n" ] } ], "source": [ "dataset = tf.data.Dataset.from_generator(\n", " tokens_generator, \n", " output_types=(tf.int32, tf.int32), \n", " output_shapes=([None], ()))\n", "for el in dataset:\n", " print(el)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Graph (test with eager execution)\n", "\n", "Let's build a model that predicts the classes `0`, `1` and `2` above.\n", "\n", "Test our graph logic here, with `eager_execution` activated." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": true }, "outputs": [], "source": [ "batch_size = 4\n", "vocab_size = 4\n", "dim = 100" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "shapes = ([None], ())\n", "defaults = (0, 0)\n", "# The last sentence is longer: need padding\n", "dataset = dataset.padded_batch( \n", " batch_size, shapes, defaults)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Define all variables (In eager execution mode, have to be done just once)\n", "# Otherwise you would create new variable at each loop iteration!\n", "embeddings = tf.get_variable('embeddings', shape=[vocab_size, dim])\n", "lstm_cell = tf.contrib.rnn.LSTMCell(100)\n", "dense_layer = tf.layers.Dense(3, activation=tf.nn.relu)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(4, 3, 3)\n", "(4, 3, 3)\n", "(2, 5, 3)\n" ] } ], "source": [ "for tokens, sequence_length in dataset:\n", " token_embeddings = tf.nn.embedding_lookup(embeddings, tokens)\n", " lstm_output, _ = tf.nn.dynamic_rnn(\n", " lstm_cell, token_embeddings, dtype=tf.float32, sequence_length=sequence_length)\n", " logits = dense_layer(lstm_output)\n", " print(logits.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No error/cryptic messages about some shape mismatch – seems like our TensorFlow logic is fine." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model (`tf.estimator`)\n", "\n", "`tf.estimator` uses the traditional graph-based environment (no eager execution).\n", "\n", "If you use the `tf.estimator` interface, you will get for free :\n", "\n", "1. Tensorboard\n", "2. Weights serialization\n", "3. Logging\n", "4. Model export for serving\n", "5. Unified structure compatible with open-source code\n", "\n", "#### Before: custom model classes\n", "People used to write custom model classes\n", "\n", "```python\n", "class Model:\n", "\n", " def get_feed_dict(self, X, y):\n", " return {self.X: X, self.y: y}\n", "\n", " def build(self):\n", " do_stuff()\n", "\n", " def train(self, X, y):\n", " with tf.Session() as sess:\n", " do_some_training()\n", "```\n", "\n", "\n", "#### Now: `tf.estimator`\n", "\n", "Now there is a common interface for models.\n", "\n", "```python\n", "def input_fn():\n", " # Return a tf.data.Dataset that yields a tuple features, labels\n", " return dataset\n", " \n", "def model_fn(features, labels, mode, params):\n", " \"\"\"\n", " Parameters\n", " ----------\n", " features: tf.Tensor or nested structure\n", " Returned by `input_fn`\n", " labels: tf.Tensor of nested structure\n", " Returned by `input_fn`\n", " mode: tf.estimator.ModeKeys\n", " Either PREDICT / EVAL / TRAIN\n", " params: dict\n", " Hyperparams\n", " \n", " Returns\n", " -------\n", " tf.estimator.EstimatorSpec\n", " \"\"\"\n", " if mode == tf.estimator.ModeKeys.TRAIN:\n", " return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)\n", " elif mode == ...\n", " ...\n", "\n", "estimator = tf.estimator.Estimator(\n", " model_fn=model_fn, params=params)\n", "estimator.train(input_fn)```" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Clear all the objects we defined above, to be sure \n", "# we don't mess with anything\n", "tf.reset_default_graph() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### `input_fn`\n", "\n", "A callable that returns a dataset that yields tuples of features, labels" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def input_fn():\n", " # Create datasets for features and labels\n", " dataset_tokens = tf.data.Dataset.from_generator(\n", " tokens_generator, \n", " output_types=(tf.int32, tf.int32), \n", " output_shapes=([None], ()))\n", " dataset_output = tf.data.Dataset.from_generator(\n", " labels_generator, \n", " output_types=(tf.int32), \n", " output_shapes=([None]))\n", " \n", " # Zip features and labels in one Dataset\n", " dataset = tf.data.Dataset.zip((dataset_tokens, dataset_output))\n", " \n", " # Shuffle, repeat, batch and prefetch\n", " shapes = (([None], ()), [None])\n", " defaults = ((0, 0), 0)\n", " dataset = (dataset\n", " .shuffle(10)\n", " .repeat(100)\n", " .padded_batch(4, shapes, defaults)\n", " .prefetch(1))\n", "\n", " # Dataset yields tuple of features, labels\n", " return dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### `model_fn`\n", "\n", "Inputs `(features, labels, mode, params)`; returns `EstimatorSpec` objects." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def model_fn(features, labels, mode, params):\n", " # Args features and labels are the same as returned by the dataset\n", " tokens, sequence_length = features\n", " \n", " # For Serving (ignore this)\n", " if isinstance(features, dict):\n", " tokens = features['tokens']\n", " sequence_length = features['sequence_length']\n", " \n", " # 1. Define the graph\n", " vocab_size = params['vocab_size']\n", " dim = params['dim']\n", " embeddings = tf.get_variable('embeddings', shape=[vocab_size, dim])\n", " token_embeddings = tf.nn.embedding_lookup(embeddings, tokens)\n", " lstm_cell = tf.contrib.rnn.LSTMCell(20)\n", " lstm_output, _ = tf.nn.dynamic_rnn(\n", " lstm_cell, token_embeddings, dtype=tf.float32)\n", " \n", " logits = tf.layers.dense(lstm_output, 3)\n", " preds = tf.argmax(logits, axis=-1)\n", " \n", " # 2. Define EstimatorSpecs for PREDICT\n", " if mode == tf.estimator.ModeKeys.PREDICT:\n", " # Predictions is any nested object (dict is convenient)\n", " predictions = {'logits': logits, 'preds': preds}\n", " # export_outputs is for serving (ignore this)\n", " export_outputs = {\n", " 'predictions': tf.estimator.export.PredictOutput(predictions)}\n", " return tf.estimator.EstimatorSpec(mode, predictions=predictions, \n", " export_outputs=export_outputs)\n", " else:\n", " # 3. Define loss and metrics\n", " # Define weights to account for padding\n", " weights = tf.sequence_mask(sequence_length)\n", " loss = tf.losses.sparse_softmax_cross_entropy(\n", " logits=logits, labels=labels, weights=weights)\n", " metrics = {\n", " 'accuracy': tf.metrics.accuracy(labels=labels, predictions=preds),\n", " }\n", " # For Tensorboard\n", " for k, v in metrics.items():\n", " # v[1] is the update op of the metrics object\n", " tf.summary.scalar(k, v[1])\n", " \n", " # 4. Define EstimatorSpecs for EVAL\n", " # Having an eval mode and metrics in Tensorflow allows you to use\n", " # built-in early stopping (see later)\n", " if mode == tf.estimator.ModeKeys.EVAL:\n", " return tf.estimator.EstimatorSpec(mode, loss=loss, \n", " eval_metric_ops=metrics)\n", " \n", " # 5. Define EstimatorSpecs for TRAIN\n", " elif mode == tf.estimator.ModeKeys.TRAIN:\n", " global_step = tf.train.get_or_create_global_step()\n", " train_op = (tf.train.AdamOptimizer(learning_rate=0.1)\n", " .minimize(loss, global_step=global_step))\n", " return tf.estimator.EstimatorSpec(mode, loss=loss, \n", " train_op=train_op)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you think about this `model_fn`? It seems like we wrote only things that matter (not a lot of boilerplate!)\n", "\n", "\n", "#### Instantiate and train your Estimator\n", "\n", "Now, let's define our estimator and train it!" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Using default config.\n", "INFO:tensorflow:Using config: {'_session_config': None, '_task_id': 0, '_save_summary_steps': 100, '_tf_random_seed': None, '_service': None, '_log_step_count_steps': 100, '_keep_checkpoint_every_n_hours': 10000, '_task_type': 'worker', '_num_worker_replicas': 1, '_cluster_spec': , '_num_ps_replicas': 0, '_save_checkpoints_secs': 600, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_train_distribute': None, '_device_fn': None, '_master': '', '_is_chief': True, '_evaluation_master': '', '_global_id_in_cluster': 0, '_model_dir': 'model'}\n" ] } ], "source": [ "params = {\n", " 'vocab_size': 4,\n", " 'dim': 3\n", "}\n", "estimator = tf.estimator.Estimator(\n", " model_fn=model_fn,\n", " model_dir='model', # Will save the weights here automatically\n", " params=params)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling model_fn.\n", "INFO:tensorflow:Done calling model_fn.\n", "INFO:tensorflow:Create CheckpointSaverHook.\n", "INFO:tensorflow:Graph was finalized.\n", "INFO:tensorflow:Restoring parameters from model/model.ckpt-2000\n", "INFO:tensorflow:Running local_init_op.\n", "INFO:tensorflow:Done running local_init_op.\n", "INFO:tensorflow:Saving checkpoints for 2000 into model/model.ckpt.\n", "INFO:tensorflow:loss = 5.960464e-08, step = 2001\n", "INFO:tensorflow:global_step/sec: 179.608\n", "INFO:tensorflow:loss = 6.953875e-08, step = 2101 (0.558 sec)\n", "INFO:tensorflow:global_step/sec: 274.294\n", "INFO:tensorflow:loss = 5.960464e-08, step = 2201 (0.364 sec)\n", "INFO:tensorflow:Saving checkpoints for 2250 into model/model.ckpt.\n", "INFO:tensorflow:Loss for final step: 1.8732878e-07.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "estimator.train(input_fn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### TensorBoard, `train_and_evaluate`, `predict` etc.\n", "\n", "Now, the `estimator` is trained, serialized to disk etc. You also have access to TensorBoard. (Lots of stuff for free, without having to write boilerplate code!)\n", "\n", "To access tensorboard :\n", "\n", "```\n", "tensorboard --logdir model\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check `evaluate`, `train_and_evaluate`... [documentation](https://www.tensorflow.org/api_docs/python/tf/estimator/train_and_evaluate).\n", "\n", "Example with early stopping, where we run evaluation every 2 minutes (`120` seconds).\n", "\n", "```python\n", "hook = tf.contrib.estimator.stop_if_no_increase_hook(\n", " estimator, 'accuracy', 500, min_steps=8000, run_every_secs=120)\n", "train_spec = tf.estimator.TrainSpec(input_fn=input_fn, hooks=[hook])\n", "eval_spec = tf.estimator.EvalSpec(input_fn=input_fn, throttle_secs=120)\n", "tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)\n", "```" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling model_fn.\n", "INFO:tensorflow:Done calling model_fn.\n", "INFO:tensorflow:Graph was finalized.\n", "INFO:tensorflow:Restoring parameters from model/model.ckpt-2250\n", "INFO:tensorflow:Running local_init_op.\n", "INFO:tensorflow:Done running local_init_op.\n", "[0 0 1 2 0]\n", "[0 0 1 2 1]\n" ] } ], "source": [ "# Iterate over the 2 first elements of the (shuffled) dataset and yield predictions\n", "# You need to write variants of your input_fn for eval / predict modes\n", "for idx, predictions in enumerate(estimator.predict(input_fn)):\n", " print(predictions['preds'])\n", " if idx > 0:\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A word about TensorFlow model serving" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exporting an inference graph and the serving signature is \"simple\" (though the `serving_fn` interface could be improved). The cool thing is that once you have your `tf.estimator` and your `serving_input_fn`, you can just use `tensorflow_serving` and get a RESTful API serving your model!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Serving interface" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def serving_input_fn():\n", " tokens = tf.placeholder(\n", " dtype=tf.int32, shape=[None, None], name=\"tokens\")\n", " sequence_length = tf.size(tokens)\n", " features = {'tokens': tokens, 'sequence_length': sequence_length}\n", " return tf.estimator.export.ServingInputReceiver(\n", " features=features, receiver_tensors=tokens)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:tensorflow:Calling model_fn.\n", "INFO:tensorflow:Done calling model_fn.\n", "INFO:tensorflow:Signatures INCLUDED in export for Predict: ['predictions', 'serving_default']\n", "INFO:tensorflow:Signatures INCLUDED in export for Regress: None\n", "INFO:tensorflow:Signatures INCLUDED in export for Train: None\n", "INFO:tensorflow:Signatures INCLUDED in export for Classify: None\n", "INFO:tensorflow:Signatures INCLUDED in export for Eval: None\n", "INFO:tensorflow:Restoring parameters from model/model.ckpt-2250\n", "INFO:tensorflow:Assets added to graph.\n", "INFO:tensorflow:No assets to write.\n", "INFO:tensorflow:SavedModel written to: export/temp-b'1537767208'/saved_model.pb\n" ] }, { "data": { "text/plain": [ "b'export/1537767208'" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "estimator.export_savedmodel('export', serving_input_fn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Docker Image\n", "\n", "\n", "#### Pull existing image\n", "\n", "```\n", "docker pull tensorflow/serving\n", "```\n", "\n", "#### Run\n", "\n", "```\n", "docker run -p 8501:8501 \\\n", "--mount type=bind,\\\n", "source=path_to_your_export_model,\\\n", "target=/models/dummy \\\n", "-e MODEL_NAME=dummy -t tensorflow/serving &\n", "```\n", "\n", "#### Rest API POST with curl\n", "\n", "```\n", "curl -d '{\"instances\": [[0, 1, 2],[0, 1, 3]]}' -X POST \\\n", "http://localhost:8501/v1/models/dummy:predict\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python [mtf]", "language": "python", "name": "Python [mtf]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" }, "widgets": { "state": {}, "version": "1.1.2" } }, "nbformat": 4, "nbformat_minor": 2 }