{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Preprocessing structured data in TF 2.3.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7TFM07NMV5bl",
        "colab_type": "text"
      },
      "source": [
        "# Preprocessing Structured data in TF 2.3"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BwF-oSahsDT1",
        "colab_type": "text"
      },
      "source": [
        "In TF 2.3, Keras adds new preprocessing layers for image, text and strucured data. The following notebook explores those new layers for dealing with Structured data. \n",
        "\n",
        "For a complete example of how to use the new preprocessing layer for Structured data check the Keras example - [link](https://keras.io/examples/structured_data/structured_data_classification_from_scratch/)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4uXfSpzsHeoc",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#hide\n",
        "%%bash\n",
        "pip install -q tf-nightly"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "YLxfwMIVHmox",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "6c3a7764-1d54-4015-e8cf-e59eadf871e2"
      },
      "source": [
        "#hide\n",
        "import pandas as pd\n",
        "import tensorflow as tf\n",
        "print('TF version: ', tf.__version__)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "TF version:  2.4.0-dev20200802\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tb_Q4T8oIAAv",
        "colab_type": "text"
      },
      "source": [
        "## Structured data\n",
        "Generate some random data for playing with and seeing what is the output of the preprocessing layers."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1Uv22PklKsAU",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 70
        },
        "outputId": "37d801b0-c5cd-4d5a-86a2-cce0416e42e9"
      },
      "source": [
        "xdf = pd.DataFrame({\n",
        "  'categorical_string': ['LOW', 'HIGH', 'HIGH', 'MEDIUM'],\n",
        "  'categorical_integer_1': [1, 0, 1, 0],\n",
        "  'categorical_integer_2': [1, 2, 3, 4],\n",
        "  'numerical_1': [2.3, 0.2, 1.9, 5.8],\n",
        "  'numerical_2': [16, 32, 8, 60]\n",
        "})\n",
        "ydf = pd.DataFrame({'target': [0, 0, 0, 1]})\n",
        "ds = tf.data.Dataset.from_tensor_slices((dict(xdf), ydf))\n",
        "for x, y in ds.take(1):\n",
        "  print('X:', x)\n",
        "  print('y:', y)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "X: {'categorical_string': <tf.Tensor: shape=(), dtype=string, numpy=b'cat1'>, 'categorical_integer_1': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'categorical_integer_2': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'numerical_1': <tf.Tensor: shape=(), dtype=float64, numpy=2.3>, 'numerical_2': <tf.Tensor: shape=(), dtype=int64, numpy=16>}\n",
            "y: tf.Tensor([0], shape=(1,), dtype=int64)\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "kZvc-SXCH6h7",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from tensorflow.keras.layers.experimental.preprocessing import Normalization\n",
        "from tensorflow.keras.layers.experimental.preprocessing import CategoryEncoding\n",
        "from tensorflow.keras.layers.experimental.preprocessing import StringLookup"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cBvtZLCZVgse",
        "colab_type": "text"
      },
      "source": [
        "## Pre-processing Numercial columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "z44X_lm1M8x0",
        "colab_type": "text"
      },
      "source": [
        "Preprocessing helper function to encode **numercial** features, e.g. 0.1, 0.2, etc."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "oQSJ91STM8IM",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def create_numerical_encoder(dataset, name):\n",
        "    # Create a Normalization layer for our feature\n",
        "    normalizer = Normalization()\n",
        "\n",
        "    # Prepare a Dataset that only yields our feature\n",
        "    feature_ds = dataset.map(lambda x, y: x[name])\n",
        "    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))\n",
        "\n",
        "    # Learn the statistics of the data\n",
        "    normalizer.adapt(feature_ds)\n",
        "\n",
        "    return normalizer"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "YnWPc9QcOdrl",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 101
        },
        "outputId": "783d5934-2ac0-40d5-9c68-917aeb8e66f2"
      },
      "source": [
        "# Apply normalization to a numerical feature\n",
        "normalizer = create_numerical_encoder(ds, 'numerical_1')\n",
        "normalizer.apply(xdf[name].values)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "<tf.Tensor: shape=(4, 1), dtype=float32, numpy=\n",
              "array([[-0.7615536],\n",
              "       [-1.2528784],\n",
              "       [-0.7615536],\n",
              "       [-1.2528784]], dtype=float32)>"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 25
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Vj47KP7AVlMw",
        "colab_type": "text"
      },
      "source": [
        "## Pre-processing Integer categorical columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "l41f7qUINiy0",
        "colab_type": "text"
      },
      "source": [
        "Preprocessing helper function to encode **integer categorical** features, e.g. 1, 2, 3"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Wf2UrNG0Ng96",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def create_integer_categorical_encoder(dataset, name):\n",
        "    # Create a CategoryEncoding for our integer indices\n",
        "    encoder = CategoryEncoding(output_mode=\"binary\")\n",
        "\n",
        "    # Prepare a Dataset that only yields our feature\n",
        "    feature_ds = dataset.map(lambda x, y: x[name])\n",
        "    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))\n",
        "\n",
        "    # Learn the space of possible indices\n",
        "    encoder.adapt(feature_ds)\n",
        "\n",
        "    return encoder"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pQp8Yrk9OUGq",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 101
        },
        "outputId": "52f13ccd-939a-453b-9dfd-a68d2fb4b0b1"
      },
      "source": [
        "# Apply one-hot encoding to an integer categorical feature\n",
        "encoder1 = create_integer_categorical_encoder(ds, 'categorical_integer_1')\n",
        "encoder1.apply(xdf['categorical_integer_1'].values)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "<tf.Tensor: shape=(4, 2), dtype=float32, numpy=\n",
              "array([[0., 1.],\n",
              "       [1., 0.],\n",
              "       [0., 1.],\n",
              "       [1., 0.]], dtype=float32)>"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 29
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "vudMemikPHTH",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 101
        },
        "outputId": "4e60debb-43d2-4842-b87e-fa75a4aecc5c"
      },
      "source": [
        "# Apply one-hot encoding to an integer categorical feature\n",
        "encoder2 = create_integer_categorical_encoder(ds, 'categorical_integer_2')\n",
        "encoder2.apply(xdf['categorical_integer_2'].values)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "<tf.Tensor: shape=(4, 5), dtype=float32, numpy=\n",
              "array([[0., 1., 0., 0., 0.],\n",
              "       [0., 0., 1., 0., 0.],\n",
              "       [0., 0., 0., 1., 0.],\n",
              "       [0., 0., 0., 0., 1.]], dtype=float32)>"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 30
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iKUiomfSVoCs",
        "colab_type": "text"
      },
      "source": [
        "## Pre-processing String categorical columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "S-he2CIaNGk1",
        "colab_type": "text"
      },
      "source": [
        "Preprocessing helper function to encode **string categorical** features, e.g. LOW, HIGH, MEDIUM.\n",
        "\n",
        "This will applying the following to the input feature:\n",
        "1. Create a token to index lookup table\n",
        "2. Apply one-hot encoding to the tokens indices"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7lOKw0EcICor",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def create_string_categorical_encoder(dataset, name):\n",
        "    # Create a StringLookup layer which will turn strings into integer indices\n",
        "    index = StringLookup()\n",
        "\n",
        "    # Prepare a Dataset that only yields our feature\n",
        "    feature_ds = dataset.map(lambda x, y: x[name])\n",
        "    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))\n",
        "\n",
        "    # Learn the set of possible string values and assign them a fixed integer index\n",
        "    index.adapt(feature_ds)\n",
        "\n",
        "    # Create a CategoryEncoding for our integer indices\n",
        "    encoder = CategoryEncoding(output_mode=\"binary\")\n",
        "\n",
        "    # Prepare a dataset of indices\n",
        "    feature_ds = feature_ds.map(index)\n",
        "\n",
        "    # Learn the space of possible indices\n",
        "    encoder.adapt(feature_ds)\n",
        "\n",
        "    return index, encoder"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1GKl0qpNQ9yz",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 101
        },
        "outputId": "011df6c2-f426-4bbf-a341-6818d714f6b3"
      },
      "source": [
        "# Apply one-hot encoding to an integer categorical feature\n",
        "indexer, encoder3 = create_string_categorical_encoder(ds, 'categorical_string')\n",
        "# Turn the string input into integer indices\n",
        "indices = indexer.apply(xdf['categorical_string'].values)\n",
        "# Apply one-hot encoding to our indices\n",
        "encoder3.apply(indices)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "<tf.Tensor: shape=(4, 5), dtype=float32, numpy=\n",
              "array([[0., 0., 0., 0., 1.],\n",
              "       [0., 0., 1., 0., 0.],\n",
              "       [0., 0., 1., 0., 0.],\n",
              "       [0., 0., 0., 1., 0.]], dtype=float32)>"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 41
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rWRSQ_AHVIqB",
        "colab_type": "text"
      },
      "source": [
        "Notice that the string categorical column was hot encoded into 5 tokens whereas in the input dataframe there is only 3 unique values. This is because the indexer adds 2 more tokens. See the vocabulary:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "3dNIEtKsTYTQ",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "93db990e-c93d-443b-cd6b-eae5adef47b3"
      },
      "source": [
        "indexer.get_vocabulary()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "['', '[UNK]', 'cat2', 'cat3', 'cat1']"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 42
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "r83mpLv6VFnz",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}