{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "zwBCE43Cv3PH" }, "source": [ "##### Copyright 2019 The TensorFlow Authors.\n", "\n", "Licensed under the Apache License, Version 2.0 (the \"License\");" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "fOad0I2cv569" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "YQB7yiF6v9GR" }, "source": [ "# Load a pandas.DataFrame" ] }, { "cell_type": "markdown", "metadata": { "id": "Oqa952X4wQKK" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
\n", " View on TensorFlow.org\n", " \n", " Run in Google Colab\n", " \n", " View source on GitHub\n", " \n", " Download notebook\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "UmyEaf4Awl2v" }, "source": [ "This tutorial provides an example of how to load pandas dataframes into a `tf.data.Dataset`.\n", "\n", "This tutorials uses a small [dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) provided by the Cleveland Clinic Foundation for Heart Disease. There are several hundred rows in the CSV. Each row describes a patient, and each column describes an attribute. We will use this information to predict whether a patient has heart disease, which in this dataset is a binary classification task." ] }, { "cell_type": "markdown", "metadata": { "id": "iiyC7HkqxlUD" }, "source": [ "## Read data using pandas" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5IoRbCA2n0_V" }, "outputs": [], "source": [ "import pandas as pd\n", "import tensorflow as tf" ] }, { "cell_type": "markdown", "metadata": { "id": "-2kBGy_pxn47" }, "source": [ "Download the csv file containing the heart dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "VS4w2LePn9g3" }, "outputs": [], "source": [ "csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/download.tensorflow.org/data/heart.csv')" ] }, { "cell_type": "markdown", "metadata": { "id": "6BXRPD2-xtQ1" }, "source": [ "Read the csv file using pandas." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UEfJ8TcMpe-2" }, "outputs": [], "source": [ "df = pd.read_csv(csv_file)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8FkK6QIRpjd4" }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_MOAKz654CT5" }, "outputs": [], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": { "id": "ww4lRDCS3qPh" }, "source": [ "Convert `thal` column which is an `object` in the dataframe to a discrete numerical value." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LmCl5R5C2IKo" }, "outputs": [], "source": [ "df['thal'] = pd.Categorical(df['thal'])\n", "df['thal'] = df.thal.cat.codes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "s4XA1SNW2QyI" }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "WWRhH6r4xxQu" }, "source": [ "## Load data using `tf.data.Dataset`" ] }, { "cell_type": "markdown", "metadata": { "id": "GuqmVVH_yApQ" }, "source": [ "Use `tf.data.Dataset.from_tensor_slices` to read the values from a pandas dataframe. \n", "\n", "One of the advantages of using `tf.data.Dataset` is it allows you to write simple, highly efficient data pipelines. Read the [loading data guide](https://www.tensorflow.org/guide/data) to find out more." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2wwhILm1ycSp" }, "outputs": [], "source": [ "target = df.pop('target')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "W6Yc-D3aqyBb" }, "outputs": [], "source": [ "dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "chEnp_Swsf0a" }, "outputs": [], "source": [ "for feat, targ in dataset.take(5):\n", " print ('Features: {}, Target: {}'.format(feat, targ))" ] }, { "cell_type": "markdown", "metadata": { "id": "GzwlAhX6xH9Q" }, "source": [ "Since a `pd.Series` implements the `__array__` protocol it can be used transparently nearly anywhere you would use a `np.array` or a `tf.Tensor`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GnpHHkpktl5y" }, "outputs": [], "source": [ "tf.constant(df['thal'])" ] }, { "cell_type": "markdown", "metadata": { "id": "9XLxRHS10Ylp" }, "source": [ "Shuffle and batch the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "R3dQ-83Ztsgl" }, "outputs": [], "source": [ "train_dataset = dataset.shuffle(len(df)).batch(1)" ] }, { "cell_type": "markdown", "metadata": { "id": "bB9C0XJkyQEk" }, "source": [ "## Create and train a model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "FQd9PcPRpkP4" }, "outputs": [], "source": [ "def get_compiled_model():\n", " model = tf.keras.Sequential([\n", " tf.keras.layers.Dense(10, activation='relu'),\n", " tf.keras.layers.Dense(10, activation='relu'),\n", " tf.keras.layers.Dense(1)\n", " ])\n", "\n", " model.compile(optimizer='adam',\n", " loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n", " metrics=['accuracy'])\n", " return model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ybDzNUheqxJw" }, "outputs": [], "source": [ "model = get_compiled_model()\n", "model.fit(train_dataset, epochs=15)" ] }, { "cell_type": "markdown", "metadata": { "id": "d6V_6F_MBiG9" }, "source": [ "## Alternative to feature columns" ] }, { "cell_type": "markdown", "metadata": { "id": "X63B9vDsD8Ly" }, "source": [ "Passing a dictionary as an input to a model is as easy as creating a matching dictionary of `tf.keras.layers.Input` layers, applying any pre-processing and stacking them up using the [functional api](../../guide/keras/functional.ipynb). You can use this as an alternative to [feature columns](../keras/feature_columns.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "FwQ47_WmOBnY" }, "outputs": [], "source": [ "inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}\n", "x = tf.stack(list(inputs.values()), axis=-1)\n", "\n", "x = tf.keras.layers.Dense(10, activation='relu')(x)\n", "output = tf.keras.layers.Dense(1)(x)\n", "\n", "model_func = tf.keras.Model(inputs=inputs, outputs=output)\n", "\n", "model_func.compile(optimizer='adam',\n", " loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n", " metrics=['accuracy'])" ] }, { "cell_type": "markdown", "metadata": { "id": "qSCN5f_vUURE" }, "source": [ "The easiest way to preserve the column structure of a `pd.DataFrame` when used with `tf.data` is to convert the `pd.DataFrame` to a `dict`, and slice that dictionary." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wUjRKgEhPZqK" }, "outputs": [], "source": [ "dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "WWRaiwxeyA9Z" }, "outputs": [], "source": [ "for dict_slice in dict_slices.take(1):\n", " print (dict_slice)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8nTrfczNyKup" }, "outputs": [], "source": [ "model_func.fit(dict_slices, epochs=15)" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "pandas_dataframe.ipynb", "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }