{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "DweYe9FcbMK_" }, "source": [ "##### Copyright 2019 The TensorFlow Authors.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "AVV2e0XKbJeX" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "sUtoed20cRJJ" }, "source": [ "# Load CSV data" ] }, { "cell_type": "markdown", "metadata": { "id": "1ap_W4aQcgNT" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
\n", " View on TensorFlow.org\n", " \n", " Run in Google Colab\n", " \n", " View source on GitHub\n", " \n", " Download notebook\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "C-3Xbt0FfGfs" }, "source": [ "This tutorial provides examples of how to use CSV data with TensorFlow.\n", "\n", "There are two main parts to this:\n", "\n", "1. **Loading the data off disk**\n", "2. **Pre-processing it into a form suitable for training.**\n", "\n", "This tutorial focuses on the loading, and gives some quick examples of preprocessing. For a tutorial that focuses on the preprocessing aspect see the [preprocessing layers guide](https://www.tensorflow.org/guide/keras/preprocessing_layers#quick_recipes) and [tutorial](https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers). \n" ] }, { "cell_type": "markdown", "metadata": { "id": "fgZ9gjmPfSnK" }, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "baYFZMW_bJHh" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "# Make numpy values easier to read.\n", "np.set_printoptions(precision=3, suppress=True)\n", "\n", "import tensorflow as tf\n", "from tensorflow.keras import layers\n", "from tensorflow.keras.layers.experimental import preprocessing" ] }, { "cell_type": "markdown", "metadata": { "id": "1ZhJYbJxHNGJ" }, "source": [ "## In memory data" ] }, { "cell_type": "markdown", "metadata": { "id": "ny5TEgcmHjVx" }, "source": [ "For any small CSV dataset the simplest way to train a TensorFlow model on it is to load it into memory as a pandas Dataframe or a NumPy array. \n" ] }, { "cell_type": "markdown", "metadata": { "id": "LgpBOuU8PGFf" }, "source": [ "A relatively simple example is the [abalone dataset](https://archive.ics.uci.edu/ml/datasets/abalone). \n", "\n", "* The dataset is small. \n", "* All the input features are all limited-range floating point values. \n", "\n", "Here is how to download the data into a [Pandas `DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IZVExo9DKoNz" }, "outputs": [], "source": [ "abalone_train = pd.read_csv(\n", " \"https://storage.googleapis.com/download.tensorflow.org/data/abalone_train.csv\",\n", " names=[\"Length\", \"Diameter\", \"Height\", \"Whole weight\", \"Shucked weight\",\n", " \"Viscera weight\", \"Shell weight\", \"Age\"])\n", "\n", "abalone_train.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "hP22mdyPQ1_t" }, "source": [ "The dataset contains a set of measurements of [abalone](https://en.wikipedia.org/wiki/Abalone), a type of sea snail. \n", "\n", "![an abalone shell](https://tensorflow.org/images/abalone_shell.jpg)\n", "\n", " [“Abalone shell”](https://www.flickr.com/photos/thenickster/16641048623/) (by [Nicki Dugan Pogue](https://www.flickr.com/photos/thenickster/), CC BY-SA 2.0)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "vlfGrk_9N-wf" }, "source": [ "The nominal task for this dataset is to predict the age from the other measurements, so separate the features and labels for training:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "udOnDJOxNi7p" }, "outputs": [], "source": [ "abalone_features = abalone_train.copy()\n", "abalone_labels = abalone_features.pop('Age')" ] }, { "cell_type": "markdown", "metadata": { "id": "seK9n71-UBfT" }, "source": [ "For this dataset you will treat all features identically. Pack the features into a single NumPy array.:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Dp3N5McbUMwb" }, "outputs": [], "source": [ "abalone_features = np.array(abalone_features)\n", "abalone_features" ] }, { "cell_type": "markdown", "metadata": { "id": "1C1yFOxLOdxh" }, "source": [ "Next make a regression model predict the age. Since there is only a single input tensor, a `keras.Sequential` model is sufficient here." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "d8zzNrZqOmfB" }, "outputs": [], "source": [ "abalone_model = tf.keras.Sequential([\n", " layers.Dense(64),\n", " layers.Dense(1)\n", "])\n", "\n", "abalone_model.compile(loss = tf.losses.MeanSquaredError(),\n", " optimizer = tf.optimizers.Adam())" ] }, { "cell_type": "markdown", "metadata": { "id": "j6IWeP78O2wE" }, "source": [ "To train that model, pass the features and labels to `Model.fit`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "uZdpCD92SN3Z" }, "outputs": [], "source": [ "abalone_model.fit(abalone_features, abalone_labels, epochs=10)" ] }, { "cell_type": "markdown", "metadata": { "id": "GapLOj1OOTQH" }, "source": [ "You have just seen the most basic way to train a model using CSV data. Next, you will learn how to apply preprocessing to normalize numeric columns." ] }, { "cell_type": "markdown", "metadata": { "id": "B87Rd1SOUv02" }, "source": [ "## Basic preprocessing" ] }, { "cell_type": "markdown", "metadata": { "id": "yCrB2Jd-U0Vt" }, "source": [ "It's good practice to normalize the inputs to your model. The `experimental.preprocessing` layers provide a convenient way to build this normalization into your model. \n", "\n", "The layer will precompute the mean and variance of each column, and use these to normalize the data.\n", "\n", "First you create the layer:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "H2WQpDU5VRk7" }, "outputs": [], "source": [ "normalize = preprocessing.Normalization()" ] }, { "cell_type": "markdown", "metadata": { "id": "hGgEZE-7Vpt6" }, "source": [ "Then you use the `Normalization.adapt()` method to adapt the normalization layer to your data.\n", "\n", "Note: Only use your training data to `.adapt()` preprocessing layers. Do not use your validation or test data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2WgOPIiOVpLg" }, "outputs": [], "source": [ "normalize.adapt(abalone_features)" ] }, { "cell_type": "markdown", "metadata": { "id": "rE6vh0byV7cE" }, "source": [ "Then use the normalization layer in your model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "quPcZ9dTWA9A" }, "outputs": [], "source": [ "norm_abalone_model = tf.keras.Sequential([\n", " normalize,\n", " layers.Dense(64),\n", " layers.Dense(1)\n", "])\n", "\n", "norm_abalone_model.compile(loss = tf.losses.MeanSquaredError(),\n", " optimizer = tf.optimizers.Adam())\n", "\n", "norm_abalone_model.fit(abalone_features, abalone_labels, epochs=10)" ] }, { "cell_type": "markdown", "metadata": { "id": "Wuqj601Qw0Ml" }, "source": [ "## Mixed data types\n", "\n", "The \"Titanic\" dataset contains information about the passengers on the Titanic. The nominal task on this dataset is to predict who survived. \n", "\n", "![The Titanic](images/csv/Titanic.jpg)\n", "\n", "Image [from Wikimedia](https://commons.wikimedia.org/wiki/File:RMS_Titanic_3.jpg)\n", "\n", "The raw data can easily be loaded as a Pandas `DataFrame`, but is not immediately usable as input to a TensorFlow model. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GS-dBMpuYMnz" }, "outputs": [], "source": [ "titanic = pd.read_csv(\"https://storage.googleapis.com/tf-datasets/titanic/train.csv\")\n", "titanic.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "D8rCGIK1ZzKx" }, "outputs": [], "source": [ "titanic_features = titanic.copy()\n", "titanic_labels = titanic_features.pop('survived')" ] }, { "cell_type": "markdown", "metadata": { "id": "urHOwpCDYtcI" }, "source": [ "Because of the different data types and ranges you can't simply stack the features into NumPy array and pass it to a `keras.Sequential` model. Each column needs to be handled individually. \n", "\n", "As one option, you could preprocess your data offline (using any tool you like) to convert categorical columns to numeric columns, then pass the processed output to your TensorFlow model. The disadvantage to that approach is that if you save and export your model the preprocessing is not saved with it. The `experimental.preprocessing` layers avoid this problem because they're part of the model.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Bta4Sx0Zau5v" }, "source": [ "In this example, you'll build a model that implements the preprocessing logic using [Keras functional API](https://www.tensorflow.org/guide/keras/functional.ipynb). You could also do it by [subclassing](https://www.tensorflow.org/guide/keras/custom_layers_and_models).\n", "\n", "The functional API operates on \"symbolic\" tensors. Normal \"eager\" tensors have a value. In contrast these \"symbolic\" tensors do not. Instead they keep track of which operations are run on them, and build representation of the calculation, that you can run later. Here's a quick example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "730F16_97D-3" }, "outputs": [], "source": [ "# Create a symbolic input\n", "input = tf.keras.Input(shape=(), dtype=tf.float32)\n", "\n", "# Do a calculation using is\n", "result = 2*input + 1\n", "\n", "# the result doesn't have a value\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RtcNXWB18kMJ" }, "outputs": [], "source": [ "calc = tf.keras.Model(inputs=input, outputs=result)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fUGQOUqZ8sa-" }, "outputs": [], "source": [ "print(calc(1).numpy())\n", "print(calc(2).numpy())" ] }, { "cell_type": "markdown", "metadata": { "id": "rNS9lT7f6_U2" }, "source": [ "To build the preprocessing model, start by building a set of symbolic `keras.Input` objects, matching the names and data-types of the CSV columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5WODe_1da3yw" }, "outputs": [], "source": [ "inputs = {}\n", "\n", "for name, column in titanic_features.items():\n", " dtype = column.dtype\n", " if dtype == object:\n", " dtype = tf.string\n", " else:\n", " dtype = tf.float32\n", "\n", " inputs[name] = tf.keras.Input(shape=(1,), name=name, dtype=dtype)\n", "\n", "inputs" ] }, { "cell_type": "markdown", "metadata": { "id": "aaheJFmymq8l" }, "source": [ "The first step in your preprocessing logic is to concatenate the numeric inputs together, and run them through a normalization layer:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wPRC_E6rkp8D" }, "outputs": [], "source": [ "numeric_inputs = {name:input for name,input in inputs.items()\n", " if input.dtype==tf.float32}\n", "\n", "x = layers.Concatenate()(list(numeric_inputs.values()))\n", "norm = preprocessing.Normalization()\n", "norm.adapt(np.array(titanic[numeric_inputs.keys()]))\n", "all_numeric_inputs = norm(x)\n", "\n", "all_numeric_inputs" ] }, { "cell_type": "markdown", "metadata": { "id": "-JoR45Uj712l" }, "source": [ "Collect all the symbolic preprocessing results, to concatenate them later." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "M7jIJw5XntdN" }, "outputs": [], "source": [ "preprocessed_inputs = [all_numeric_inputs]" ] }, { "cell_type": "markdown", "metadata": { "id": "r0Hryylyosfm" }, "source": [ "For the string inputs use the `preprocessing.StringLookup` function to map from strings to integer indices in a vocabulary. Next, use `preprocessing.CategoryEncoding` to convert the indexes into `float32` data appropriate for the model. \n", "\n", "The default settings for the `preprocessing.CategoryEncoding` layer create a one-hot vector for each input. A `layers.Embedding` would also work. See the [preprocessing layers guide](https://www.tensorflow.org/guide/keras/preprocessing_layers#quick_recipes) and [tutorial](../structured_data/preprocessing_layers.ipynb) for more on this topic." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "79fi1Cgan2YV" }, "outputs": [], "source": [ "for name, input in inputs.items():\n", " if input.dtype == tf.float32:\n", " continue\n", " \n", " lookup = preprocessing.StringLookup(vocabulary=np.unique(titanic_features[name]))\n", " one_hot = preprocessing.CategoryEncoding(max_tokens=lookup.vocab_size())\n", "\n", " x = lookup(input)\n", " x = one_hot(x)\n", " preprocessed_inputs.append(x)" ] }, { "cell_type": "markdown", "metadata": { "id": "Wnhv0T7itnc7" }, "source": [ "With the collection of `inputs` and `processed_inputs`, you can concatenate all the preprocessed inputs together, and build a model that handles the preprocessing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XJRzUTe8ukXc" }, "outputs": [], "source": [ "preprocessed_inputs_cat = layers.Concatenate()(preprocessed_inputs)\n", "\n", "titanic_preprocessing = tf.keras.Model(inputs, preprocessed_inputs_cat)\n", "\n", "tf.keras.utils.plot_model(model = titanic_preprocessing , rankdir=\"LR\", dpi=72, show_shapes=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "PNHxrNW8vdda" }, "source": [ "This `model` just contains the input preprocessing. You can run it to see what it does to your data. Keras models don't automatically convert Pandas `DataFrames` because it's not clear if it should be converted to one tensor or to a dictionary of tensors. So convert it to a dictionary of tensors:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5YjdYyMEacwQ" }, "outputs": [], "source": [ "titanic_features_dict = {name: np.array(value) \n", " for name, value in titanic_features.items()}" ] }, { "cell_type": "markdown", "metadata": { "id": "0nKJYoPByada" }, "source": [ "Slice out the first training example and pass it to this preprocessing model, you see the numeric features and string one-hots all concatenated together:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SjnmU8PSv8T3" }, "outputs": [], "source": [ "features_dict = {name:values[:1] for name, values in titanic_features_dict.items()}\n", "titanic_preprocessing(features_dict)" ] }, { "cell_type": "markdown", "metadata": { "id": "qkBf4LvmzMDp" }, "source": [ "Now build the model on top of this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "coIPtGaCzUV7" }, "outputs": [], "source": [ "def titanic_model(preprocessing_head, inputs):\n", " body = tf.keras.Sequential([\n", " layers.Dense(64),\n", " layers.Dense(1)\n", " ])\n", "\n", " preprocessed_inputs = preprocessing_head(inputs)\n", " result = body(preprocessed_inputs)\n", " model = tf.keras.Model(inputs, result)\n", "\n", " model.compile(loss=tf.losses.BinaryCrossentropy(from_logits=True),\n", " optimizer=tf.optimizers.Adam())\n", " return model\n", "\n", "titanic_model = titanic_model(titanic_preprocessing, inputs)" ] }, { "cell_type": "markdown", "metadata": { "id": "LK5uBQQF2KbZ" }, "source": [ "When you train the model, pass the dictionary of features as `x`, and the label as `y`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "D1gVfwJ61ejz" }, "outputs": [], "source": [ "titanic_model.fit(x=titanic_features_dict, y=titanic_labels, epochs=10)" ] }, { "cell_type": "markdown", "metadata": { "id": "LxgJarZk3bfH" }, "source": [ "Since the preprocessing is part of the model, you can save the model and reload it somewhere else and get identical results:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Ay-8ymNA2ZCh" }, "outputs": [], "source": [ "titanic_model.save('test')\n", "reloaded = tf.keras.models.load_model('test')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Qm6jMTpD20lK" }, "outputs": [], "source": [ "features_dict = {name:values[:1] for name, values in titanic_features_dict.items()}\n", "\n", "before = titanic_model(features_dict)\n", "after = reloaded(features_dict)\n", "assert (before-after)<1e-3\n", "print(before)\n", "print(after)" ] }, { "cell_type": "markdown", "metadata": { "id": "7VsPlxIRZpXf" }, "source": [ "## Using tf.data\n" ] }, { "cell_type": "markdown", "metadata": { "id": "NyVDCwGzR5HW" }, "source": [ "In the previous section you relied on the model's built-in data shuffling and batching while training the model. \n", "\n", "If you need more control over the input data pipeline or need to use data that doesn't easily fit into memory: use `tf.data`. \n", "\n", "For more examples see the [tf.data guide](../../guide/data.ipynb)." ] }, { "cell_type": "markdown", "metadata": { "id": "gP5Y1jM2Sor0" }, "source": [ "### On in memory data\n", "\n", "As a first example of applying `tf.data` to CSV data consider the following code to manually slice up the dictionary of features from the previous section. For each index, it takes that index for each feature:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "i8wE-MVuVu7_" }, "outputs": [], "source": [ "import itertools\n", "\n", "def slices(features):\n", " for i in itertools.count():\n", " # For each feature take index `i`\n", " example = {name:values[i] for name, values in features.items()}\n", " yield example" ] }, { "cell_type": "markdown", "metadata": { "id": "cQ3RTbS9YEal" }, "source": [ "Run this and print the first example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Wwq8XK88WwFk" }, "outputs": [], "source": [ "for example in slices(titanic_features_dict):\n", " for name, value in example.items():\n", " print(f\"{name:19s}: {value}\")\n", " break" ] }, { "cell_type": "markdown", "metadata": { "id": "vvp8Dct6YOIE" }, "source": [ "The most basic `tf.data.Dataset` in memory data loader is the `Dataset.from_tensor_slices` constructor. This returns a `tf.data.Dataset` that implements a generalized version of the above `slices` function, in TensorFlow. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2gEJthslYxeV" }, "outputs": [], "source": [ "features_ds = tf.data.Dataset.from_tensor_slices(titanic_features_dict)" ] }, { "cell_type": "markdown", "metadata": { "id": "-ZC0rTpMZMZK" }, "source": [ "You can iterate over a `tf.data.Dataset` like any other python iterable:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gOHbiefaY4ag" }, "outputs": [], "source": [ "for example in features_ds:\n", " for name, value in example.items():\n", " print(f\"{name:19s}: {value}\")\n", " break" ] }, { "cell_type": "markdown", "metadata": { "id": "uwcFoVJWZY5F" }, "source": [ "The `from_tensor_slices` function can handle any structure of nested dictionaries or tuples. The following code makes a dataset of `(features_dict, labels)` pairs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xIHGBy76Zcrx" }, "outputs": [], "source": [ "titanic_ds = tf.data.Dataset.from_tensor_slices((titanic_features_dict, titanic_labels))" ] }, { "cell_type": "markdown", "metadata": { "id": "gQwxitt8c2GK" }, "source": [ "To train a model using this `Dataset`, you'll need to at least `shuffle` and `batch` the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SbJcbldhddeC" }, "outputs": [], "source": [ "titanic_batches = titanic_ds.shuffle(len(titanic_labels)).batch(32)" ] }, { "cell_type": "markdown", "metadata": { "id": "-4FRqhRFuoJx" }, "source": [ "Instead of passing `features` and `labels` to `Model.fit`, you pass the dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8yXkNPumdBtB" }, "outputs": [], "source": [ "titanic_model.fit(titanic_batches, epochs=5)" ] }, { "cell_type": "markdown", "metadata": { "id": "qXuibiv9exT7" }, "source": [ "### From a single file\n", "\n", "So far this tutorial has worked with in-memory data. `tf.data` is a highly scalable toolkit for building data pipelines, and provides a few functions for dealing loading CSV files. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Ncf5t6tgL5ZI" }, "outputs": [], "source": [ "titanic_file_path = tf.keras.utils.get_file(\"train.csv\", \"https://storage.googleapis.com/tf-datasets/titanic/train.csv\")" ] }, { "cell_type": "markdown", "metadata": { "id": "t4N-plO4tDXd" }, "source": [ "Now read the CSV data from the file and create a `tf.data.Dataset`. \n", "\n", "(For the full documentation, see `tf.data.experimental.make_csv_dataset`)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yIbUscB9sqha" }, "outputs": [], "source": [ "titanic_csv_ds = tf.data.experimental.make_csv_dataset(\n", " titanic_file_path,\n", " batch_size=5, # Artificially small to make examples easier to show.\n", " label_name='survived',\n", " num_epochs=1,\n", " ignore_errors=True,)" ] }, { "cell_type": "markdown", "metadata": { "id": "Sf3v3BKgy4AG" }, "source": [ "This function includes many convenient features so the data is easy to work with. This includes:\n", "\n", "* Using the column headers as dictionary keys.\n", "* Automatically determining the type of each column." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "v4oMO9MIxgTG" }, "outputs": [], "source": [ "for batch, label in titanic_csv_ds.take(1):\n", " for key, value in batch.items():\n", " print(f\"{key:20s}: {value}\")\n", " print()\n", " print(f\"{'label':20s}: {label}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "k-TgA6o2Ja6U" }, "source": [ "Note: if you run the above cell twice it will produce different results. The default settings for `make_csv_dataset` include `shuffle_buffer_size=1000`, which is more than sufficient for this small dataset, but may not be for a real-world dataset." ] }, { "cell_type": "markdown", "metadata": { "id": "d6uviU_KCCWD" }, "source": [ "It can also decompress the data on the fly. Here's a gzipped CSV file containing the [metro interstate traffic dataset](https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume)\n", "\n", "![A traffic jam.](images/csv/traffic.jpg)\n", "\n", "Image [from Wikimedia](https://commons.wikimedia.org/wiki/File:Trafficjam.jpg)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kT7oZI2E46Q8" }, "outputs": [], "source": [ "traffic_volume_csv_gz = tf.keras.utils.get_file(\n", " 'Metro_Interstate_Traffic_Volume.csv.gz', \n", " \"https://archive.ics.uci.edu/ml/machine-learning-databases/00492/Metro_Interstate_Traffic_Volume.csv.gz\",\n", " cache_dir='.', cache_subdir='traffic')" ] }, { "cell_type": "markdown", "metadata": { "id": "F-IOsFHbCw0i" }, "source": [ "Set the `compression_type` argument to read directly from the compressed file: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ar0MPEVJ5NeA" }, "outputs": [], "source": [ "traffic_volume_csv_gz_ds = tf.data.experimental.make_csv_dataset(\n", " traffic_volume_csv_gz,\n", " batch_size=256,\n", " label_name='traffic_volume',\n", " num_epochs=1,\n", " compression_type=\"GZIP\")\n", "\n", "for batch, label in traffic_volume_csv_gz_ds.take(1):\n", " for key, value in batch.items():\n", " print(f\"{key:20s}: {value[:5]}\")\n", " print()\n", " print(f\"{'label':20s}: {label[:5]}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "p12Y6tGq8D6M" }, "source": [ "Note: If you need to parse those date-time strings in the `tf.data` pipeline you can use `tfa.text.parse_time`." ] }, { "cell_type": "markdown", "metadata": { "id": "EtrAXzYGP3l0" }, "source": [ "### Caching" ] }, { "cell_type": "markdown", "metadata": { "id": "fN2dL_LRP83r" }, "source": [ "There is some overhead to parsing the csv data. For small models this can be the bottleneck in training.\n", "\n", "Depending on your use case it may be a good idea to use `Dataset.cache` or `data.experimental.snapshot` so that the csv data is only parsed on the first epoch. \n", "\n", "The main difference between the `cache` and `snapshot` methods is that `cache` files can only be used by the TensorFlow process that created them, but `snapshot` files can be read by other processes.\n", "\n", "For example, iterating over the `traffic_volume_csv_gz_ds` 20 times, takes ~15 seconds without caching, or ~2s with caching." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Qk38Sw4MO4eh" }, "outputs": [], "source": [ "%%time\n", "for i, (batch, label) in enumerate(traffic_volume_csv_gz_ds.repeat(20)):\n", " if i % 40 == 0:\n", " print('.', end='')\n", "print()" ] }, { "cell_type": "markdown", "metadata": { "id": "pN3HtDONh5TX" }, "source": [ "Note: `Dataset.cache` stores the data form the first epoch and replays it in order. So using `.cache` disables any shuffles earlier in the pipeline. Below the `.shuffle` is added back in after `.cache`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "r5Jj72MrPbnh" }, "outputs": [], "source": [ "%%time\n", "caching = traffic_volume_csv_gz_ds.cache().shuffle(1000)\n", "\n", "for i, (batch, label) in enumerate(caching.shuffle(1000).repeat(20)):\n", " if i % 40 == 0:\n", " print('.', end='')\n", "print()" ] }, { "cell_type": "markdown", "metadata": { "id": "wN7uUBjmgNZ9" }, "source": [ "Note: `snapshot` files are meant for *temporary* storage of a dataset while in use. This is *not* a format for long term storage. The file format is considered an internal detail, and not guaranteed between TensorFlow versions. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PHGD1E8ktUvW" }, "outputs": [], "source": [ "%%time\n", "snapshot = tf.data.experimental.snapshot('titanic.tfsnap')\n", "snapshotting = traffic_volume_csv_gz_ds.apply(snapshot).shuffle(1000)\n", "\n", "for i, (batch, label) in enumerate(snapshotting.shuffle(1000).repeat(20)):\n", " if i % 40 == 0:\n", " print('.', end='')\n", "print()" ] }, { "cell_type": "markdown", "metadata": { "id": "fUSSegnMCGRz" }, "source": [ "If your data loading is slowed by loading csv files, and `cache` and `snapshot` are insufficient for your use case, consider re-encoding your data into a more streamlined format." ] }, { "cell_type": "markdown", "metadata": { "id": "M0iGXv9pC5kr" }, "source": [ "### Multiple files" ] }, { "cell_type": "markdown", "metadata": { "id": "9FFzHQrCDH4w" }, "source": [ "All the examples so far in this section could easily be done without `tf.data`. One place where `tf.data` can really simplify things is when dealing with collections of files.\n", "\n", "For example, the [character font images](https://archive.ics.uci.edu/ml/datasets/Character+Font+Images) dataset is distributed as a collection of csv files, one per font.\n", "\n", "![Fonts](images/csv/fonts.jpg)\n", "\n", "Image by Willi Heidelbach from Pixabay\n", "\n", "Download the dataset, and have a look at the files inside:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RmVknMdJh5ks" }, "outputs": [], "source": [ "fonts_zip = tf.keras.utils.get_file(\n", " 'fonts.zip', \"https://archive.ics.uci.edu/ml/machine-learning-databases/00417/fonts.zip\",\n", " cache_dir='.', cache_subdir='fonts',\n", " extract=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xsDlMCnyi55e" }, "outputs": [], "source": [ "import pathlib\n", "font_csvs = sorted(str(p) for p in pathlib.Path('fonts').glob(\"*.csv\"))\n", "\n", "font_csvs[:10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lRAEJx9ROAGl" }, "outputs": [], "source": [ "len(font_csvs)" ] }, { "cell_type": "markdown", "metadata": { "id": "19Udrw9iG-FS" }, "source": [ "When dealing with a bunch of files you can pass a glob-style `file_pattern` to the `experimental.make_csv_dataset` function. The order of the files is shuffled each iteration.\n", "\n", "Use the `num_parallel_reads` argument to set how many files are read in parallel and interleaved together." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6TSUNdT6iG58" }, "outputs": [], "source": [ "fonts_ds = tf.data.experimental.make_csv_dataset(\n", " file_pattern = \"fonts/*.csv\",\n", " batch_size=10, num_epochs=1,\n", " num_parallel_reads=20,\n", " shuffle_buffer_size=10000)" ] }, { "cell_type": "markdown", "metadata": { "id": "XMoexinLHYFa" }, "source": [ "These csv files have the images flattened out into a single row. The column names are formatted `r{row}c{column}`. Here's the first batch:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RmFvBWxxi3pq" }, "outputs": [], "source": [ "for features in fonts_ds.take(1):\n", " for i, (name, value) in enumerate(features.items()):\n", " if i>15:\n", " break\n", " print(f\"{name:20s}: {value}\")\n", "print('...')\n", "print(f\"[total: {len(features)} features]\")" ] }, { "cell_type": "markdown", "metadata": { "id": "xrC3sKdeOhb5" }, "source": [ "#### Optional: Packing fields\n", "\n", "You probably don't want to work with each pixel in separate columns like this. Before trying to use this dataset be sure to pack the pixels into an image-tensor. \n", "\n", "Here is code that parses the column names to build images for each example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hct5EMEWNyfH" }, "outputs": [], "source": [ "import re\n", "\n", "def make_images(features):\n", " image = [None]*400\n", " new_feats = {}\n", "\n", " for name, value in features.items():\n", " match = re.match('r(\\d+)c(\\d+)', name)\n", " if match:\n", " image[int(match.group(1))*20+int(match.group(2))] = value\n", " else:\n", " new_feats[name] = value\n", "\n", " image = tf.stack(image, axis=0)\n", " image = tf.reshape(image, [20, 20, -1])\n", " new_feats['image'] = image\n", "\n", " return new_feats" ] }, { "cell_type": "markdown", "metadata": { "id": "61qy8utAwARP" }, "source": [ "Apply that function to each batch in the dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DJnnfIW9baE4" }, "outputs": [], "source": [ "fonts_image_ds = fonts_ds.map(make_images)\n", "\n", "for features in fonts_image_ds.take(1):\n", " break" ] }, { "cell_type": "markdown", "metadata": { "id": "_ThqrthGwHSm" }, "source": [ "Plot the resulting images:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "I5dcey31T_tk" }, "outputs": [], "source": [ "from matplotlib import pyplot as plt\n", "\n", "plt.figure(figsize=(6,6), dpi=120)\n", "\n", "for n in range(9):\n", " plt.subplot(3,3,n+1)\n", " plt.imshow(features['image'][..., n])\n", " plt.title(chr(features['m_label'][n]))\n", " plt.axis('off')" ] }, { "cell_type": "markdown", "metadata": { "id": "7-nNR0Nncdd1" }, "source": [ "## Lower level functions" ] }, { "cell_type": "markdown", "metadata": { "id": "3jiGZeUijJNd" }, "source": [ "So far this tutorial has focused on the highest level utilities for reading csv data. There are other two APIs that may be helpful for advanced users if your use-case doesn't fit the basic patterns.\n", "\n", "* `tf.io.decode_csv` - a function for parsing lines of text into a list of CSV column tensors.\n", "* `tf.data.experimental.CsvDataset` - a lower level csv dataset constructor.\n", "\n", "This section recreates functionality provided by `make_csv_dataset`, to demonstrate how this lower level functionality can be used.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "LL_ixywomOHW" }, "source": [ "### `tf.io.decode_csv`\n", "\n", "This function decodes a string, or list of strings into a list of columns.\n", "\n", "Unlike `make_csv_dataset` this function does not try to guess column data-types. You specify the column types by providing a list of `record_defaults` containing a value of the correct type, for each column.\n", "\n", "To read the Titanic data **as strings** using `decode_csv` you would say: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "m1D2C-qdlqeW" }, "outputs": [], "source": [ "text = pathlib.Path(titanic_file_path).read_text()\n", "lines = text.split('\\n')[1:-1]\n", "\n", "all_strings = [str()]*10\n", "all_strings" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9W4UeJYyHPx5" }, "outputs": [], "source": [ "features = tf.io.decode_csv(lines, record_defaults=all_strings) \n", "\n", "for f in features:\n", " print(f\"type: {f.dtype.name}, shape: {f.shape}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "j8TaHSQFoQL4" }, "source": [ "To parse them with their actual types, create a list of `record_defaults` of the corresponding types: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rzUjR59yoUe1" }, "outputs": [], "source": [ "print(lines[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7sPTunxwoeWU" }, "outputs": [], "source": [ "titanic_types = [int(), str(), float(), int(), int(), float(), str(), str(), str(), str()]\n", "titanic_types" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "n3NlViCzoB7F" }, "outputs": [], "source": [ "features = tf.io.decode_csv(lines, record_defaults=titanic_types) \n", "\n", "for f in features:\n", " print(f\"type: {f.dtype.name}, shape: {f.shape}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "m-LkTUTnpn2P" }, "source": [ "Note: it is more efficient to call `decode_csv` on large batches of lines than on individual lines of csv text." ] }, { "cell_type": "markdown", "metadata": { "id": "Yp1UItJmqGqw" }, "source": [ "### `tf.data.experimental.CsvDataset`\n", "\n", "The `tf.data.experimental.CsvDataset` class provides a minimal CSV `Dataset` interface without the convenience features of the `make_csv_dataset` function: column header parsing, column type-inference, automatic shuffling, file interleaving.\n", "\n", "This constructor follows uses `record_defaults` the same way as `io.parse_csv`:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9OzZLp3krP-t" }, "outputs": [], "source": [ "simple_titanic = tf.data.experimental.CsvDataset(titanic_file_path, record_defaults=titanic_types, header=True)\n", "\n", "for example in simple_titanic.take(1):\n", " print([e.numpy() for e in example])" ] }, { "cell_type": "markdown", "metadata": { "id": "_HBmfI-Ks7dw" }, "source": [ "The above code is basically equivalent to:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "E5O5d69Yq7gG" }, "outputs": [], "source": [ "def decode_titanic_line(line):\n", " return tf.io.decode_csv(line, titanic_types)\n", "\n", "manual_titanic = (\n", " # Load the lines of text\n", " tf.data.TextLineDataset(titanic_file_path)\n", " # Skip the header row.\n", " .skip(1)\n", " # Decode the line.\n", " .map(decode_titanic_line)\n", ")\n", "\n", "for example in manual_titanic.take(1):\n", " print([e.numpy() for e in example])" ] }, { "cell_type": "markdown", "metadata": { "id": "5R3ralsnt2AC" }, "source": [ "#### Multiple files\n", "\n", "To parse the fonts dataset using `experimental.CsvDataset`, you first need to determine the column types for the `record_defaults`. Start by inspecting the first row of one file: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3tlFOTjCvAI5" }, "outputs": [], "source": [ "font_line = pathlib.Path(font_csvs[0]).read_text().splitlines()[1]\n", "print(font_line)" ] }, { "cell_type": "markdown", "metadata": { "id": "etyGu8K_ySRz" }, "source": [ "Only the first two fields are strings, the rest are ints or floats, and you can get the total number of features by counting the commas:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "crgZZn0BzkSB" }, "outputs": [], "source": [ "num_font_features = font_line.count(',')+1\n", "font_column_types = [str(), str()] + [float()]*(num_font_features-2)" ] }, { "cell_type": "markdown", "metadata": { "id": "YeK2Pw540RNj" }, "source": [ "The `CsvDatasaet` constructor can take a list of input files, but reads them sequentially. The first file in the list of CSVs is `AGENCY.csv`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_SvL5Uvl0r0N" }, "outputs": [], "source": [ "font_csvs[0]" ] }, { "cell_type": "markdown", "metadata": { "id": "EfAX3G8Xywy6" }, "source": [ "So when you pass pass the list of files to `CsvDataaset` the records from `AGENCY.csv` are read first:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Gtr1E66VmBqj" }, "outputs": [], "source": [ "simple_font_ds = tf.data.experimental.CsvDataset(\n", " font_csvs, \n", " record_defaults=font_column_types, \n", " header=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "k750Mgq4yt_o" }, "outputs": [], "source": [ "for row in simple_font_ds.take(10):\n", " print(row[0].numpy())" ] }, { "cell_type": "markdown", "metadata": { "id": "NiqWKQV21FrE" }, "source": [ "To interleave multiple files, use `Dataset.interleave`.\n", "\n", "Here's an initial dataset that contains the csv file names: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "t9dS3SNb23W8" }, "outputs": [], "source": [ "font_files = tf.data.Dataset.list_files(\"fonts/*.csv\")" ] }, { "cell_type": "markdown", "metadata": { "id": "TNiLHMXpzHy5" }, "source": [ "This shuffles the file names each epoch:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zNd-TYyNzIgg" }, "outputs": [], "source": [ "print('Epoch 1:')\n", "for f in list(font_files)[:5]:\n", " print(\" \", f.numpy())\n", "print(' ...')\n", "print()\n", "\n", "print('Epoch 2:')\n", "for f in list(font_files)[:5]:\n", " print(\" \", f.numpy())\n", "print(' ...')" ] }, { "cell_type": "markdown", "metadata": { "id": "B0QB1PtU3WAN" }, "source": [ "The `interleave` method takes a `map_func` that creates a child-`Dataset` for each element of the parent-`Dataset`. \n", "\n", "Here, you want to create a `CsvDataset` from each element of the dataset of files:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QWp4rH0Q4uPh" }, "outputs": [], "source": [ "def make_font_csv_ds(path):\n", " return tf.data.experimental.CsvDataset(\n", " path, \n", " record_defaults=font_column_types, \n", " header=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "VxRGdLMB5nRF" }, "source": [ "The `Dataset` returned by interleave returns elements by cycling over a number of the child-`Dataset`s. Note, below, how the dataset cycles over `cycle_length)=3` three font files:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OePMNF_x1_Cc" }, "outputs": [], "source": [ "font_rows = font_files.interleave(make_font_csv_ds,\n", " cycle_length=3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UORIGWLy54-E" }, "outputs": [], "source": [ "fonts_dict = {'font_name':[], 'character':[]}\n", "\n", "for row in font_rows.take(10):\n", " fonts_dict['font_name'].append(row[0].numpy().decode())\n", " fonts_dict['character'].append(chr(row[2].numpy()))\n", "\n", "pd.DataFrame(fonts_dict)" ] }, { "cell_type": "markdown", "metadata": { "id": "mkKZa_HX8zAm" }, "source": [ "#### Performance\n" ] }, { "cell_type": "markdown", "metadata": { "id": "8BtGHraUApdJ" }, "source": [ "Earlier, it was noted that `io.decode_csv` is more efficient when run on a batch of strings.\n", "\n", "It is possible to take advantage of this fact, when using large batch sizes, to improve CSV loading performance (but try [caching](#caching) first)." ] }, { "cell_type": "markdown", "metadata": { "id": "d35zWMH7MDL1" }, "source": [ "With the built-in loader 20, 2048-example batches take about 17s. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ieUVAPryjpJS" }, "outputs": [], "source": [ "BATCH_SIZE=2048\n", "fonts_ds = tf.data.experimental.make_csv_dataset(\n", " file_pattern = \"fonts/*.csv\",\n", " batch_size=BATCH_SIZE, num_epochs=1,\n", " num_parallel_reads=100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MUC2KW4LkQIz" }, "outputs": [], "source": [ "%%time\n", "for i,batch in enumerate(fonts_ds.take(20)):\n", " print('.',end='')\n", "\n", "print()" ] }, { "cell_type": "markdown", "metadata": { "id": "5lhnh6rZEDS2" }, "source": [ "Passing **batches of text lines** to`decode_csv` runs faster, in about 5s:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4XbPZV1okVF9" }, "outputs": [], "source": [ "fonts_files = tf.data.Dataset.list_files(\"fonts/*.csv\")\n", "fonts_lines = fonts_files.interleave(\n", " lambda fname:tf.data.TextLineDataset(fname).skip(1), \n", " cycle_length=100).batch(BATCH_SIZE)\n", "\n", "fonts_fast = fonts_lines.map(lambda x: tf.io.decode_csv(x, record_defaults=font_column_types))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "te9C2km-qO8W" }, "outputs": [], "source": [ "%%time\n", "for i,batch in enumerate(fonts_fast.take(20)):\n", " print('.',end='')\n", "\n", "print()" ] }, { "cell_type": "markdown", "metadata": { "id": "aebC1plsMeOi" }, "source": [ "For another example of increasing csv performance by using large batches see the [overfit and underfit tutorial](../keras/overfit_and_underfit.ipynb).\n", "\n", "This sort of approach may work, but consider other options like `cache` and `snapshot`, or re-enncoding your data into a more streamlined format." ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "csv.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }