{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "FhGuhbZ6M5tl" }, "source": [ "##### Copyright 2018 The TensorFlow Authors.\n", "https://www.tensorflow.org/tutorials/keras/regression" ] }, { "cell_type": "markdown", "metadata": { "id": "EIdT9iu_Z4Rb" }, "source": [ "# Basic regression: Predict fuel efficiency" ] }, { "cell_type": "markdown", "metadata": { "id": "AHp3M9ZmrIxj" }, "source": [ "In a *regression* problem, the aim is to predict the output of a continuous value, like a price or a probability. Contrast this with a *classification* problem, where the aim is to select a class from a list of classes (for example, where a picture contains an apple or an orange, recognizing which fruit is in the picture).\n", "\n", "This tutorial uses the classic [Auto MPG](https://archive.ics.uci.edu/ml/datasets/auto+mpg) dataset and demonstrates how to build models to predict the fuel efficiency of the late-1970s and early 1980s automobiles. To do this, you will provide the models with a description of many automobiles from that time period. This description includes attributes like cylinders, displacement, horsepower, and weight.\n", "\n", "This example uses the Keras API. (Visit the Keras [tutorials](https://www.tensorflow.org/tutorials/keras) and [guides](https://www.tensorflow.org/guide/keras) to learn more.)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "9xQKvCJ85kCQ", "outputId": "cb07b90d-55c7-4323-df7f-dd18c901d6d2", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "2.12.0\n" ] } ], "source": [ "import tensorflow as tf\n", "from tensorflow import keras\n", "from tensorflow.keras import layers\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "\n", "print(tf.__version__)" ] }, { "cell_type": "markdown", "metadata": { "id": "F_72b0LCNbjx" }, "source": [ "## The Auto MPG dataset\n", "\n", "The dataset is available from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/).\n" ] }, { "cell_type": "markdown", "metadata": { "id": "gFh9ne3FZ-On" }, "source": [ "### Get the data\n", "First download and import the dataset using pandas:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "CiX2FI4gZtTt" }, "outputs": [], "source": [ "url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'\n", "column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',\n", " 'Acceleration', 'Model Year', 'Origin']\n", "\n", "raw_dataset = pd.read_csv(url, names=column_names,\n", " na_values='?', comment='\\t',\n", " sep=' ', skipinitialspace=True)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "2oY3pMPagJrO", "outputId": "cecfb141-1c3b-4d57-9c27-783df427e52f", "colab": { "base_uri": "https://localhost:8080/", "height": 206 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " MPG Cylinders Displacement Horsepower Weight Acceleration \\\n", "393 27.0 4 140.0 86.0 2790.0 15.6 \n", "394 44.0 4 97.0 52.0 2130.0 24.6 \n", "395 32.0 4 135.0 84.0 2295.0 11.6 \n", "396 28.0 4 120.0 79.0 2625.0 18.6 \n", "397 31.0 4 119.0 82.0 2720.0 19.4 \n", "\n", " Model Year Origin \n", "393 82 1 \n", "394 82 2 \n", "395 82 1 \n", "396 82 1 \n", "397 82 1 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MPGCylindersDisplacementHorsepowerWeightAccelerationModel YearOrigin
39327.04140.086.02790.015.6821
39444.0497.052.02130.024.6822
39532.04135.084.02295.011.6821
39628.04120.079.02625.018.6821
39731.04119.082.02720.019.4821
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 3 } ], "source": [ "dataset = raw_dataset.copy()\n", "dataset.tail()" ] }, { "cell_type": "markdown", "metadata": { "id": "3MWuJTKEDM-f" }, "source": [ "### Clean the data\n", "\n", "The dataset contains a few unknown values:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "JEJHhN65a2VV", "outputId": "c33b1219-36f5-4357-c2a3-fc52b3b5ee3a", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "MPG 0\n", "Cylinders 0\n", "Displacement 0\n", "Horsepower 6\n", "Weight 0\n", "Acceleration 0\n", "Model Year 0\n", "Origin 0\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 4 } ], "source": [ "dataset.isna().sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "9UPN0KBHa_WI" }, "source": [ "Drop those rows to keep this initial tutorial simple:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "4ZUDosChC1UN" }, "outputs": [], "source": [ "dataset = dataset.dropna()" ] }, { "cell_type": "markdown", "metadata": { "id": "8XKitwaH4v8h" }, "source": [ "The `\"Origin\"` column is categorical, not numeric. So the next step is to one-hot encode the values in the column with [pd.get_dummies](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html).\n", "\n", "Note: You can set up the `tf.keras.Model` to do this kind of transformation for you but that's beyond the scope of this tutorial. Check out the [Classify structured data using Keras preprocessing layers](../structured_data/preprocessing_layers.ipynb) or [Load CSV data](../load_data/csv.ipynb) tutorials for examples." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "gWNTD2QjBWFJ" }, "outputs": [], "source": [ "dataset['Origin'] = dataset['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})" ] }, { "cell_type": "code", "source": [ "dataset.tail()" ], "metadata": { "id": "k1dNncezB9Zl", "outputId": "58d5e9b8-756b-4832-d576-f16a66ec3abd", "colab": { "base_uri": "https://localhost:8080/", "height": 206 } }, "execution_count": 7, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " MPG Cylinders Displacement Horsepower Weight Acceleration \\\n", "393 27.0 4 140.0 86.0 2790.0 15.6 \n", "394 44.0 4 97.0 52.0 2130.0 24.6 \n", "395 32.0 4 135.0 84.0 2295.0 11.6 \n", "396 28.0 4 120.0 79.0 2625.0 18.6 \n", "397 31.0 4 119.0 82.0 2720.0 19.4 \n", "\n", " Model Year Origin \n", "393 82 USA \n", "394 82 Europe \n", "395 82 USA \n", "396 82 USA \n", "397 82 USA " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MPGCylindersDisplacementHorsepowerWeightAccelerationModel YearOrigin
39327.04140.086.02790.015.682USA
39444.0497.052.02130.024.682Europe
39532.04135.084.02295.011.682USA
39628.04120.079.02625.018.682USA
39731.04119.082.02720.019.482USA
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 7 } ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "ulXz4J7PAUzk", "outputId": "6c9cd8c4-c2ed-4116-d192-17a879653b81", "colab": { "base_uri": "https://localhost:8080/", "height": 206 } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " MPG Cylinders Displacement Horsepower Weight Acceleration \\\n", "393 27.0 4 140.0 86.0 2790.0 15.6 \n", "394 44.0 4 97.0 52.0 2130.0 24.6 \n", "395 32.0 4 135.0 84.0 2295.0 11.6 \n", "396 28.0 4 120.0 79.0 2625.0 18.6 \n", "397 31.0 4 119.0 82.0 2720.0 19.4 \n", "\n", " Model Year Europe Japan USA \n", "393 82 0 0 1 \n", "394 82 1 0 0 \n", "395 82 0 0 1 \n", "396 82 0 0 1 \n", "397 82 0 0 1 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MPGCylindersDisplacementHorsepowerWeightAccelerationModel YearEuropeJapanUSA
39327.04140.086.02790.015.682001
39444.0497.052.02130.024.682100
39532.04135.084.02295.011.682001
39628.04120.079.02625.018.682001
39731.04119.082.02720.019.482001
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 8 } ], "source": [ "dataset = pd.get_dummies(dataset, columns=['Origin'], prefix='', prefix_sep='')\n", "dataset.tail()" ] }, { "cell_type": "markdown", "metadata": { "id": "Cuym4yvk76vU" }, "source": [ "### Split the data into training and test sets\n", "\n", "Now, split the dataset into a training set and a test set. You will use the test set in the final evaluation of your models." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "qn-IGhUE7_1H" }, "outputs": [], "source": [ "train_dataset = dataset.sample(frac=0.8, random_state=0)\n", "test_dataset = dataset.drop(train_dataset.index)" ] }, { "cell_type": "markdown", "source": [ "## Split features from labels\n", "\n", "Separate the target value—the \"label\"—from the features. This label is the value that you will train the model to predict." ], "metadata": { "id": "ULlGz3eVDUl4" } }, { "cell_type": "code", "source": [ "train_features = train_dataset.copy()\n", "test_features = test_dataset.copy()\n", "\n", "train_labels = train_features.pop('MPG')\n", "test_labels = test_features.pop('MPG')\n" ], "metadata": { "id": "JLFE3wcdDUFe" }, "execution_count": 10, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "mRklxK5s388r" }, "source": [ "## Normalization\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "-ywmerQ6dSox" }, "source": [ "It is good practice to normalize features that use different scales and ranges.\n", "\n", "One reason this is important is because the features are multiplied by the model weights. So, the scale of the outputs and the scale of the gradients are affected by the scale of the inputs.\n", "\n", "Although a model *might* converge without feature normalization, normalization makes training much more stable.\n", "\n", "Note: There is no advantage to normalizing the one-hot features—it is done here for simplicity. For more details on how to use the preprocessing layers, refer to the [Working with preprocessing layers](https://www.tensorflow.org/guide/keras/preprocessing_layers) guide and the [Classify structured data using Keras preprocessing layers](../structured_data/preprocessing_layers.ipynb) tutorial." ] }, { "cell_type": "markdown", "metadata": { "id": "aFJ6ISropeoo" }, "source": [ "### The Normalization layer\n", "\n", "The `tf.keras.layers.Normalization` is a clean and simple way to add feature normalization into your model.\n", "\n", "The first step is to create the layer:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "JlC5ooJrgjQF" }, "outputs": [], "source": [ "normalizer = tf.keras.layers.Normalization(axis=-1)" ] }, { "cell_type": "markdown", "metadata": { "id": "XYA2Ap6nVOha" }, "source": [ "Then, fit the state of the preprocessing layer to the data by calling `Normalization.adapt`:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "CrBbbjbwV91f" }, "outputs": [], "source": [ "normalizer.adapt(np.array(train_features))" ] }, { "cell_type": "markdown", "metadata": { "id": "oZccMR5yV9YV" }, "source": [ "Calculate the mean and variance, and store them in the layer:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "GGn-ukwxSPtx", "outputId": "20eb5419-bdeb-4d9a-d22f-e85015fc7067", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "[[5.47770691e+00 1.95318497e+02 1.04869446e+02 2.99025171e+03\n", " 1.55592356e+01 7.58980942e+01 1.78343967e-01 1.97452217e-01\n", " 6.24203861e-01]]\n", "[[2.8800766e+00 1.0850413e+04 1.4466993e+03 7.0989688e+05 7.7550268e+00\n", " 1.3467321e+01 1.4653738e-01 1.5846483e-01 2.3457341e-01]]\n" ] } ], "source": [ "print(normalizer.mean.numpy())\n", "print(normalizer.variance.numpy())" ] }, { "cell_type": "markdown", "metadata": { "id": "SmjdzxKzEu1-" }, "source": [ "## Regression with a deep neural network (DNN)" ] }, { "cell_type": "markdown", "metadata": { "id": "DT_aHPsrzO1t" }, "source": [] }, { "cell_type": "markdown", "metadata": { "id": "6SWtkIjhrZwa" }, "source": [ "These models will contain a few more layers than the linear model:\n", "\n", "* The normalization layer, as before (with `horsepower_normalizer` for a single-input model and `normalizer` for a multiple-input model).\n", "* Two hidden, non-linear, `Dense` layers with the ReLU (`relu`) activation function nonlinearity.\n", "* A linear `Dense` single-output layer.\n", "\n", "Both models will use the same training procedure, so the `compile` method is included in the `build_and_compile_model` function below." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "id": "c26juK7ZG8j-" }, "outputs": [], "source": [ "model = keras.Sequential([\n", " normalizer,\n", " layers.Dense(64, activation='relu'),\n", " layers.Dense(64, activation='relu'),\n", " layers.Dense(1)\n", "])\n", "\n", "model.compile(loss='mean_absolute_error',\n", " optimizer=tf.keras.optimizers.Adam(0.001))\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "id": "c0mhscXh2k36", "outputId": "6297cfd0-562d-485e-db44-22a4e5e3dab2", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Model: \"sequential\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " normalization (Normalizatio (None, 9) 19 \n", " n) \n", " \n", " dense (Dense) (None, 64) 640 \n", " \n", " dense_1 (Dense) (None, 64) 4160 \n", " \n", " dense_2 (Dense) (None, 1) 65 \n", " \n", "=================================================================\n", "Total params: 4,884\n", "Trainable params: 4,865\n", "Non-trainable params: 19\n", "_________________________________________________________________\n" ] } ], "source": [ "model.summary()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "CXDENACl2tuW", "outputId": "1ddafc66-230c-4ebb-c1a9-a4b7bbb8c651", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "CPU times: user 8.13 s, sys: 241 ms, total: 8.37 s\n", "Wall time: 17.3 s\n" ] } ], "source": [ "%%time\n", "history = model.fit(\n", " train_features,\n", " train_labels,\n", " validation_split=0.2,\n", " verbose=0, epochs=100)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "id": "-9Dbj0fX23RQ", "outputId": "240d8b1e-58f6-4b83-d8e5-cface25d5b32", "colab": { "base_uri": "https://localhost:8080/", "height": 455 } }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ], "source": [ "def plot_loss(history):\n", " plt.plot(history.history['loss'], label='loss')\n", " plt.plot(history.history['val_loss'], label='val_loss')\n", " plt.ylim([0, 10])\n", " plt.xlabel('Epoch')\n", " plt.ylabel('Error [MPG]')\n", " plt.legend()\n", " plt.grid(True)\n", "\n", "plot_loss(history)" ] }, { "cell_type": "markdown", "metadata": { "id": "hWoVYS34fJPZ" }, "source": [ "Collect the results on the test set:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "id": "-bZIa96W3c7K", "outputId": "2e62134d-0d31-4dc3-a789-bffd5d3e816f", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "1.6882120370864868" ] }, "metadata": {}, "execution_count": 18 } ], "source": [ "model.evaluate(test_features, test_labels, verbose=0)" ] }, { "cell_type": "markdown", "metadata": { "id": "ft603OzXuEZC" }, "source": [ "### Make predictions\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "id": "Xe7RXH3N3CWU", "outputId": "dd48050b-579e-4724-fc55-54f1f4be1183", "colab": { "base_uri": "https://localhost:8080/", "height": 473 } }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "3/3 [==============================] - 1s 11ms/step\n" ] }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ], "source": [ "test_predictions = model.predict(test_features).flatten()\n", "\n", "a = plt.axes(aspect='equal')\n", "plt.scatter(test_labels, test_predictions)\n", "plt.xlabel('True Values [MPG]')\n", "plt.ylabel('Predictions [MPG]')\n", "lims = [0, 50]\n", "plt.xlim(lims)\n", "plt.ylim(lims)\n", "_ = plt.plot(lims, lims)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "19wyogbOSU5t" }, "source": [ "It appears that the model predicts reasonably well.\n", "\n", "Now, check the error distribution:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "id": "f-OHX4DiXd8x", "outputId": "eed773fe-d181-455e-ba47-84f84455c37c", "colab": { "base_uri": "https://localhost:8080/", "height": 449 } }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ], "source": [ "error = test_predictions - test_labels\n", "plt.hist(error, bins=25)\n", "plt.xlabel('Prediction Error [MPG]')\n", "_ = plt.ylabel('Count')" ] } ], "metadata": { "colab": { "name": "regression.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }