{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Preprocessing structured data in TF 2.3.ipynb", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "7TFM07NMV5bl", "colab_type": "text" }, "source": [ "# Preprocessing Structured data in TF 2.3" ] }, { "cell_type": "markdown", "metadata": { "id": "BwF-oSahsDT1", "colab_type": "text" }, "source": [ "In TF 2.3, Keras adds new preprocessing layers for image, text and strucured data. The following notebook explores those new layers for dealing with Structured data. \n", "\n", "For a complete example of how to use the new preprocessing layer for Structured data check the Keras example - [link](https://keras.io/examples/structured_data/structured_data_classification_from_scratch/)." ] }, { "cell_type": "code", "metadata": { "id": "4uXfSpzsHeoc", "colab_type": "code", "colab": {} }, "source": [ "#hide\n", "%%bash\n", "pip install -q tf-nightly" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "YLxfwMIVHmox", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "outputId": "6c3a7764-1d54-4015-e8cf-e59eadf871e2" }, "source": [ "#hide\n", "import pandas as pd\n", "import tensorflow as tf\n", "print('TF version: ', tf.__version__)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "TF version: 2.4.0-dev20200802\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "tb_Q4T8oIAAv", "colab_type": "text" }, "source": [ "## Structured data\n", "Generate some random data for playing with and seeing what is the output of the preprocessing layers." ] }, { "cell_type": "code", "metadata": { "id": "1Uv22PklKsAU", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 70 }, "outputId": "37d801b0-c5cd-4d5a-86a2-cce0416e42e9" }, "source": [ "xdf = pd.DataFrame({\n", " 'categorical_string': ['LOW', 'HIGH', 'HIGH', 'MEDIUM'],\n", " 'categorical_integer_1': [1, 0, 1, 0],\n", " 'categorical_integer_2': [1, 2, 3, 4],\n", " 'numerical_1': [2.3, 0.2, 1.9, 5.8],\n", " 'numerical_2': [16, 32, 8, 60]\n", "})\n", "ydf = pd.DataFrame({'target': [0, 0, 0, 1]})\n", "ds = tf.data.Dataset.from_tensor_slices((dict(xdf), ydf))\n", "for x, y in ds.take(1):\n", " print('X:', x)\n", " print('y:', y)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "X: {'categorical_string': , 'categorical_integer_1': , 'categorical_integer_2': , 'numerical_1': , 'numerical_2': }\n", "y: tf.Tensor([0], shape=(1,), dtype=int64)\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "kZvc-SXCH6h7", "colab_type": "code", "colab": {} }, "source": [ "from tensorflow.keras.layers.experimental.preprocessing import Normalization\n", "from tensorflow.keras.layers.experimental.preprocessing import CategoryEncoding\n", "from tensorflow.keras.layers.experimental.preprocessing import StringLookup" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "cBvtZLCZVgse", "colab_type": "text" }, "source": [ "## Pre-processing Numercial columns" ] }, { "cell_type": "markdown", "metadata": { "id": "z44X_lm1M8x0", "colab_type": "text" }, "source": [ "Preprocessing helper function to encode **numercial** features, e.g. 0.1, 0.2, etc." ] }, { "cell_type": "code", "metadata": { "id": "oQSJ91STM8IM", "colab_type": "code", "colab": {} }, "source": [ "def create_numerical_encoder(dataset, name):\n", " # Create a Normalization layer for our feature\n", " normalizer = Normalization()\n", "\n", " # Prepare a Dataset that only yields our feature\n", " feature_ds = dataset.map(lambda x, y: x[name])\n", " feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))\n", "\n", " # Learn the statistics of the data\n", " normalizer.adapt(feature_ds)\n", "\n", " return normalizer" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "YnWPc9QcOdrl", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 101 }, "outputId": "783d5934-2ac0-40d5-9c68-917aeb8e66f2" }, "source": [ "# Apply normalization to a numerical feature\n", "normalizer = create_numerical_encoder(ds, 'numerical_1')\n", "normalizer.apply(xdf[name].values)" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": { "tags": [] }, "execution_count": 25 } ] }, { "cell_type": "markdown", "metadata": { "id": "Vj47KP7AVlMw", "colab_type": "text" }, "source": [ "## Pre-processing Integer categorical columns" ] }, { "cell_type": "markdown", "metadata": { "id": "l41f7qUINiy0", "colab_type": "text" }, "source": [ "Preprocessing helper function to encode **integer categorical** features, e.g. 1, 2, 3" ] }, { "cell_type": "code", "metadata": { "id": "Wf2UrNG0Ng96", "colab_type": "code", "colab": {} }, "source": [ "def create_integer_categorical_encoder(dataset, name):\n", " # Create a CategoryEncoding for our integer indices\n", " encoder = CategoryEncoding(output_mode=\"binary\")\n", "\n", " # Prepare a Dataset that only yields our feature\n", " feature_ds = dataset.map(lambda x, y: x[name])\n", " feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))\n", "\n", " # Learn the space of possible indices\n", " encoder.adapt(feature_ds)\n", "\n", " return encoder" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "pQp8Yrk9OUGq", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 101 }, "outputId": "52f13ccd-939a-453b-9dfd-a68d2fb4b0b1" }, "source": [ "# Apply one-hot encoding to an integer categorical feature\n", "encoder1 = create_integer_categorical_encoder(ds, 'categorical_integer_1')\n", "encoder1.apply(xdf['categorical_integer_1'].values)" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": { "tags": [] }, "execution_count": 29 } ] }, { "cell_type": "code", "metadata": { "id": "vudMemikPHTH", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 101 }, "outputId": "4e60debb-43d2-4842-b87e-fa75a4aecc5c" }, "source": [ "# Apply one-hot encoding to an integer categorical feature\n", "encoder2 = create_integer_categorical_encoder(ds, 'categorical_integer_2')\n", "encoder2.apply(xdf['categorical_integer_2'].values)" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": { "tags": [] }, "execution_count": 30 } ] }, { "cell_type": "markdown", "metadata": { "id": "iKUiomfSVoCs", "colab_type": "text" }, "source": [ "## Pre-processing String categorical columns" ] }, { "cell_type": "markdown", "metadata": { "id": "S-he2CIaNGk1", "colab_type": "text" }, "source": [ "Preprocessing helper function to encode **string categorical** features, e.g. LOW, HIGH, MEDIUM.\n", "\n", "This will applying the following to the input feature:\n", "1. Create a token to index lookup table\n", "2. Apply one-hot encoding to the tokens indices" ] }, { "cell_type": "code", "metadata": { "id": "7lOKw0EcICor", "colab_type": "code", "colab": {} }, "source": [ "def create_string_categorical_encoder(dataset, name):\n", " # Create a StringLookup layer which will turn strings into integer indices\n", " index = StringLookup()\n", "\n", " # Prepare a Dataset that only yields our feature\n", " feature_ds = dataset.map(lambda x, y: x[name])\n", " feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))\n", "\n", " # Learn the set of possible string values and assign them a fixed integer index\n", " index.adapt(feature_ds)\n", "\n", " # Create a CategoryEncoding for our integer indices\n", " encoder = CategoryEncoding(output_mode=\"binary\")\n", "\n", " # Prepare a dataset of indices\n", " feature_ds = feature_ds.map(index)\n", "\n", " # Learn the space of possible indices\n", " encoder.adapt(feature_ds)\n", "\n", " return index, encoder" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "1GKl0qpNQ9yz", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 101 }, "outputId": "011df6c2-f426-4bbf-a341-6818d714f6b3" }, "source": [ "# Apply one-hot encoding to an integer categorical feature\n", "indexer, encoder3 = create_string_categorical_encoder(ds, 'categorical_string')\n", "# Turn the string input into integer indices\n", "indices = indexer.apply(xdf['categorical_string'].values)\n", "# Apply one-hot encoding to our indices\n", "encoder3.apply(indices)" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": { "tags": [] }, "execution_count": 41 } ] }, { "cell_type": "markdown", "metadata": { "id": "rWRSQ_AHVIqB", "colab_type": "text" }, "source": [ "Notice that the string categorical column was hot encoded into 5 tokens whereas in the input dataframe there is only 3 unique values. This is because the indexer adds 2 more tokens. See the vocabulary:" ] }, { "cell_type": "code", "metadata": { "id": "3dNIEtKsTYTQ", "colab_type": "code", "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "outputId": "93db990e-c93d-443b-cd6b-eae5adef47b3" }, "source": [ "indexer.get_vocabulary()" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['', '[UNK]', 'cat2', 'cat3', 'cat1']" ] }, "metadata": { "tags": [] }, "execution_count": 42 } ] }, { "cell_type": "code", "metadata": { "id": "r83mpLv6VFnz", "colab_type": "code", "colab": {} }, "source": [ "" ], "execution_count": null, "outputs": [] } ] }