{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zwBCE43Cv3PH"
      },
      "source": [
        "##### Copyright 2019 The TensorFlow Authors.\n",
        "\n",
        "Licensed under the Apache License, Version 2.0 (the \"License\");"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "fOad0I2cv569"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YQB7yiF6v9GR"
      },
      "source": [
        "# Load a pandas.DataFrame"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Oqa952X4wQKK"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://www.tensorflow.org/tutorials/load_data/pandas_dataframe\"><img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />View on TensorFlow.org</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/load_data/pandas_dataframe.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/docs/blob/master/site/en/tutorials/load_data/pandas_dataframe.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/load_data/pandas_dataframe.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\" />Download notebook</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UmyEaf4Awl2v"
      },
      "source": [
        "This tutorial provides an example of how to load pandas dataframes into a `tf.data.Dataset`.\n",
        "\n",
        "This tutorials uses a small [dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) provided by the Cleveland Clinic Foundation for Heart Disease. There are several hundred rows in the CSV. Each row describes a patient, and each column describes an attribute. We will use this information to predict whether a patient has heart disease, which in this dataset is a binary classification task."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iiyC7HkqxlUD"
      },
      "source": [
        "## Read data using pandas"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5IoRbCA2n0_V"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "import tensorflow as tf"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-2kBGy_pxn47"
      },
      "source": [
        "Download the csv file containing the heart dataset."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "VS4w2LePn9g3"
      },
      "outputs": [],
      "source": [
        "csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/download.tensorflow.org/data/heart.csv')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6BXRPD2-xtQ1"
      },
      "source": [
        "Read the csv file using pandas."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UEfJ8TcMpe-2"
      },
      "outputs": [],
      "source": [
        "df = pd.read_csv(csv_file)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8FkK6QIRpjd4"
      },
      "outputs": [],
      "source": [
        "df.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_MOAKz654CT5"
      },
      "outputs": [],
      "source": [
        "df.dtypes"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ww4lRDCS3qPh"
      },
      "source": [
        "Convert `thal` column which is an `object` in the dataframe to a discrete numerical value."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LmCl5R5C2IKo"
      },
      "outputs": [],
      "source": [
        "df['thal'] = pd.Categorical(df['thal'])\n",
        "df['thal'] = df.thal.cat.codes"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "s4XA1SNW2QyI"
      },
      "outputs": [],
      "source": [
        "df.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WWRhH6r4xxQu"
      },
      "source": [
        "## Load data using `tf.data.Dataset`"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GuqmVVH_yApQ"
      },
      "source": [
        "Use `tf.data.Dataset.from_tensor_slices` to read the values from a pandas dataframe. \n",
        "\n",
        "One of the advantages of using `tf.data.Dataset` is it allows you to write simple, highly efficient data pipelines. Read the [loading data guide](https://www.tensorflow.org/guide/data) to find out more."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2wwhILm1ycSp"
      },
      "outputs": [],
      "source": [
        "target = df.pop('target')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "W6Yc-D3aqyBb"
      },
      "outputs": [],
      "source": [
        "dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "chEnp_Swsf0a"
      },
      "outputs": [],
      "source": [
        "for feat, targ in dataset.take(5):\n",
        "  print ('Features: {}, Target: {}'.format(feat, targ))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GzwlAhX6xH9Q"
      },
      "source": [
        "Since a `pd.Series` implements the `__array__` protocol it can be used transparently nearly anywhere you would use a `np.array` or a `tf.Tensor`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "GnpHHkpktl5y"
      },
      "outputs": [],
      "source": [
        "tf.constant(df['thal'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9XLxRHS10Ylp"
      },
      "source": [
        "Shuffle and batch the dataset."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "R3dQ-83Ztsgl"
      },
      "outputs": [],
      "source": [
        "train_dataset = dataset.shuffle(len(df)).batch(1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bB9C0XJkyQEk"
      },
      "source": [
        "## Create and train a model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "FQd9PcPRpkP4"
      },
      "outputs": [],
      "source": [
        "def get_compiled_model():\n",
        "  model = tf.keras.Sequential([\n",
        "    tf.keras.layers.Dense(10, activation='relu'),\n",
        "    tf.keras.layers.Dense(10, activation='relu'),\n",
        "    tf.keras.layers.Dense(1)\n",
        "  ])\n",
        "\n",
        "  model.compile(optimizer='adam',\n",
        "                loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n",
        "                metrics=['accuracy'])\n",
        "  return model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ybDzNUheqxJw"
      },
      "outputs": [],
      "source": [
        "model = get_compiled_model()\n",
        "model.fit(train_dataset, epochs=15)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d6V_6F_MBiG9"
      },
      "source": [
        "## Alternative to feature columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "X63B9vDsD8Ly"
      },
      "source": [
        "Passing a dictionary as an input to a model is as easy as creating a matching dictionary of `tf.keras.layers.Input` layers, applying any pre-processing and stacking them up using the [functional api](../../guide/keras/functional.ipynb). You can use this as an alternative to [feature columns](../keras/feature_columns.ipynb)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "FwQ47_WmOBnY"
      },
      "outputs": [],
      "source": [
        "inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}\n",
        "x = tf.stack(list(inputs.values()), axis=-1)\n",
        "\n",
        "x = tf.keras.layers.Dense(10, activation='relu')(x)\n",
        "output = tf.keras.layers.Dense(1)(x)\n",
        "\n",
        "model_func = tf.keras.Model(inputs=inputs, outputs=output)\n",
        "\n",
        "model_func.compile(optimizer='adam',\n",
        "                   loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n",
        "                   metrics=['accuracy'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qSCN5f_vUURE"
      },
      "source": [
        "The easiest way to preserve the column structure of a `pd.DataFrame` when used with `tf.data` is to convert the `pd.DataFrame` to a `dict`, and slice that dictionary."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wUjRKgEhPZqK"
      },
      "outputs": [],
      "source": [
        "dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "WWRaiwxeyA9Z"
      },
      "outputs": [],
      "source": [
        "for dict_slice in dict_slices.take(1):\n",
        "  print (dict_slice)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8nTrfczNyKup"
      },
      "outputs": [],
      "source": [
        "model_func.fit(dict_slices, epochs=15)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "pandas_dataframe.ipynb",
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}