{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Linear models\n", "> In this chapter, you will learn how to build, solve, and make predictions with models in TensorFlow 2.0. You will focus on a simple class of models – the linear regression model – and will try to predict housing prices. By the end of the chapter, you will know how to load and manipulate data, construct loss functions, perform minimization, make predictions, and reduce resource use with batch training. This is the Summary of lecture \"Introduction to TensorFlow in Python\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Tensorflow-Keras, Deep_Learning]\n", "- image: images/fitted_linreg.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'2.1.0'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import tensorflow as tf\n", "import pandas as pd\n", "import numpy as np\n", "\n", "tf.__version__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Input data\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load data using pandas\n", "Before you can train a machine learning model, you must first import data. There are several valid ways to do this, but for now, we will use a simple one-liner from pandas: `pd.read_csv()`. Recall from the video that the first argument specifies the path or URL. All other arguments are optional.\n", "\n", "In this exercise, you will import the King County housing dataset, which we will use to train a linear model later in the chapter." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 221900.0\n", "1 538000.0\n", "2 180000.0\n", "3 604000.0\n", "4 510000.0\n", " ... \n", "21608 360000.0\n", "21609 400000.0\n", "21610 402101.0\n", "21611 400000.0\n", "21612 325000.0\n", "Name: price, Length: 21613, dtype: float64\n" ] } ], "source": [ "# Load the dataset as a dataframe named housing\n", "housing = pd.read_csv('./dataset/kc_house_data.csv')\n", "\n", "# Print the price column of housing\n", "print(housing['price'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that you did not have to specify a delimiter with the `sep` parameter, since the dataset was stored in the default, comma-separated format." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting the data type\n", "In this exercise, you will both load data and set its type. You will import numpy and tensorflow, and define tensors that are usable in tensorflow using columns in housing with a given data type. Recall that you can select the `price` column, for instance, from housing using `housing['price']`.\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[221900. 538000. 180000. ... 402101. 400000. 325000.]\n", "tf.Tensor([False False False ... False False False], shape=(21613,), dtype=bool)\n" ] } ], "source": [ "# Use a numpy array to define price as a 32-bit float\n", "price = np.array(housing['price'], np.float32)\n", "\n", "# Define waterfront as a Boolean using case\n", "waterfront = tf.cast(housing['waterfront'], tf.bool)\n", "\n", "# Print price and waterfront\n", "print(price)\n", "print(waterfront)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Notice that printing `price` yielded a numpy array; whereas printing `waterfront` yielded a tf.Tensor()." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loss functions\n", "- Loss function\n", " - Fundamental tensorflow operation\n", " - Used to train model\n", " - Measure a model fit\n", " - Higher value -> worse fit\n", " - Minimize the loss function\n", "- Common loss functions in Tensorflow\n", " - Mean squared error (MSE)\n", " - Mean absolute error (MAE)\n", " - Huber error\n", "- Why do we care about loss functions?\n", " - MSE\n", " - Strongly penalizes outliers\n", " - High (gradient) sensitivity near minimum\n", " - MAE\n", " - Scales linearly with size of error\n", " - Low sensitivity near minimum\n", " - Huber\n", " - Similar to MSE near minimum\n", " - Similar to MAE away from minimum" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loss functions in TensorFlow\n", "In this exercise, you will compute the loss using data from the King County housing dataset. You are given a target, `price`, which is a tensor of house prices, and `predictions`, which is a tensor of predicted house prices. You will evaluate the loss function and print out the value of the loss.\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "kc_sample = pd.read_csv('./dataset/loss_price.csv')\n", "price = kc_sample['price'].to_numpy()\n", "predictions = kc_sample['pred'].to_numpy()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "141171604777.12717\n" ] } ], "source": [ "# Compute the mean squared error (mse)\n", "loss = tf.keras.losses.mse(price, predictions)\n", "\n", "# Print the mean squared error (mse)\n", "print(loss.numpy())" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "268827.9930208799\n" ] } ], "source": [ "# Compute the mean squared error (mse)\n", "loss = tf.keras.losses.mae(price, predictions)\n", "\n", "# Print the mean squared error (mse)\n", "print(loss.numpy())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may have noticed that the MAE was much smaller than the MSE, even though `price` and `predictions` were the same. This is because the different loss functions penalize deviations of `predictions` from `price` differently. MSE does not like large deviations and punishes them harshly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Modifying the loss function\n", "In the previous exercise, you defined a tensorflow loss function and then evaluated it once for a set of actual and predicted values. In this exercise, you will compute the loss within another function called `loss_function()`, which first generates predicted values from the data and variables. The purpose of this is to construct a function of the trainable model variables that returns the loss. You can then repeatedly evaluate this function for different variable values until you find the minimum. In practice, you will pass this function to an optimizer in tensorflow. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "features = tf.constant([1, 2, 3, 4, 5], dtype=tf.float32)\n", "targets = tf.constant([2, 4, 6, 8, 10], dtype=tf.float32)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3.0\n" ] } ], "source": [ "# Initialize a variable named scalar\n", "scalar = tf.Variable(1.0, tf.float32)\n", "\n", "# Define the model\n", "def model(scalar, features=features):\n", " return scalar * features\n", "\n", "# Define a loss function\n", "def loss_function(scalar, features=features, targets=targets):\n", " # Compute the predicted values\n", " predictions = model(scalar, features)\n", " \n", " # Return the mean absolute error loss\n", " return tf.keras.losses.mae(targets, predictions)\n", "\n", "# Evaluate the loss function and print the loss\n", "print(loss_function(scalar).numpy())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear regression\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up a linear regression\n", "A univariate linear regression identifies the relationship between a single feature and the target tensor. In this exercise, we will use a property's lot size and price. Just as we discussed in the video, we will take the natural logarithms of both tensors, which are available as `price_log` and `size_log`.\n", "\n", "In this exercise, you will define the model and the loss function. You will then evaluate the loss function for two different values of `intercept` and `slope`. Remember that the predicted values are given by `intercept + features * slope`. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "size_log = np.log(np.array(housing['sqft_lot'], np.float32))\n", "price_log = np.log(np.array(housing['price'], np.float32))\n", "bedrooms = np.array(housing['bedrooms'], np.float32)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "145.44653\n", "71.866\n" ] } ], "source": [ "# Define a linear regression model\n", "def linear_regression(intercept, slope, features=size_log):\n", " return intercept + slope * features\n", "\n", "# Set loss_function() to take the variables as arguments\n", "def loss_function(intercept, slope, features=size_log, targets=price_log):\n", " # Set the predicted values\n", " predictions = linear_regression(intercept, slope, features)\n", " \n", " # Return the mean squared error loss\n", " return tf.keras.losses.mse(targets, predictions)\n", "\n", "# Compute the loss function for different slope and intercept values\n", "print(loss_function(0.1, 0.1).numpy())\n", "print(loss_function(0.1, 0.5).numpy())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train a linear model\n", "In this exercise, we will pick up where the previous exercise ended. The `intercept` and `slope`, have been defined and initialized. Additionally, a function has been defined, `loss_function(intercept, slope)`, which computes the loss using the data and model variables.\n", "\n", "You will now define an optimization operation as `opt`. You will then train a univariate linear model by minimizing the loss to find the optimal values of `intercept` and `slope`. Note that the `opt` operation will try to move closer to the optimum with each step, but will require many steps to find it. Thus, you must repeatedly execute the operation." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "def plot_results(intercept, slope):\n", " size_range = np.linspace(6,14,100)\n", " price_pred = [intercept + slope * s for s in size_range]\n", " plt.figure(figsize=(8, 8))\n", " plt.scatter(size_log, price_log, color = 'black');\n", " plt.plot(size_range, price_pred, linewidth=3.0, color='red');\n", " plt.xlabel('log(size)');\n", " plt.ylabel('log(price)');\n", " plt.title('Scatterplot of data and fitted regression line');" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "65.26133\n", "1.4909142\n", "2.3818178\n", "2.9086726\n", "2.6110873\n", "1.7604784\n", "1.3467994\n", "1.3559676\n", "1.288407\n", "1.2425306\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "intercept = tf.Variable(0.0, tf.float32)\n", "slope = tf.Variable(0.0, tf.float32)\n", "\n", "# Initialize an adam optimizer\n", "opt = tf.keras.optimizers.Adam(learning_rate=0.5)\n", "\n", "for j in range(100):\n", " # Apply minimize, pass the loss function, and supply the variables\n", " opt.minimize(lambda: loss_function(intercept, slope), var_list=[intercept, slope])\n", " \n", " # Print every 10th value of the loss\n", " if j % 10 == 0:\n", " print(loss_function(intercept, slope).numpy())\n", " \n", "# Plot data and regressoin line\n", "plot_results(intercept, slope)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that we printed `loss_function(intercept, slope)` every 10th execution for 100 executions. Each time, the loss got closer to the minimum as the optimizer moved the `slope` and `intercept` parameters closer to their optimal values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiple linear regression\n", "In most cases, performing a univariate linear regression will not yield a model that is useful for making accurate predictions. In this exercise, you will perform a multiple regression, which uses more than one feature.\n", "\n", " You will use `price_log` as your target and `size_log` and `bedrooms` as your features. Each of these tensors has been defined and is available. You will also switch from using the the mean squared error loss to the mean absolute error loss: `keras.losses.mae()`. Finally, the predicted values are computed as follows:` params[0] + feature1 * params[1] + feature2 * params[2]`. Note that we've defined a vector of parameters, params, as a variable, rather than using three variables. Here, `params[0]` is the intercept and `params[1]` and `params[2]` are the slopes.\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def print_results(params):\n", " return print('loss: {:0.3f}, intercept: {:0.3f}, slope_1: {:0.3f}, slope_2: {:0.3f}'\n", " .format(loss_function(params).numpy(), \n", " params[0].numpy(), \n", " params[1].numpy(), \n", " params[2].numpy()))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "loss: 12.418, intercept: 0.101, slope_1: 0.051, slope_2: 0.021\n", "loss: 12.404, intercept: 0.102, slope_1: 0.052, slope_2: 0.022\n", "loss: 12.391, intercept: 0.103, slope_1: 0.053, slope_2: 0.023\n", "loss: 12.377, intercept: 0.104, slope_1: 0.054, slope_2: 0.024\n", "loss: 12.364, intercept: 0.105, slope_1: 0.055, slope_2: 0.025\n", "loss: 12.351, intercept: 0.106, slope_1: 0.056, slope_2: 0.026\n", "loss: 12.337, intercept: 0.107, slope_1: 0.057, slope_2: 0.027\n", "loss: 12.324, intercept: 0.108, slope_1: 0.058, slope_2: 0.028\n", "loss: 12.311, intercept: 0.109, slope_1: 0.059, slope_2: 0.029\n", "loss: 12.297, intercept: 0.110, slope_1: 0.060, slope_2: 0.030\n" ] } ], "source": [ "params = tf.Variable([0.1, 0.05, 0.02], tf.float32)\n", "\n", "# Define the linear regression model\n", "def linear_regression(params, feature1=size_log, feature2=bedrooms):\n", " return params[0] + feature1 * params[1] + feature2 * params[2]\n", "\n", "# Define the loss function\n", "def loss_function(params, targets=price_log, feature1=size_log, feature2=bedrooms):\n", " # Set the predicted values\n", " predictions = linear_regression(params, feature1, feature2)\n", " \n", " # Use the mean absolute error loss\n", " return tf.keras.losses.mae(targets, predictions)\n", "\n", "# Define the optimize operation\n", "opt = tf.keras.optimizers.Adam()\n", "\n", "# Perform minimization and print trainable variables\n", "for j in range(10):\n", " opt.minimize(lambda: loss_function(params), var_list=[params])\n", " print_results(params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that `params[2]` tells us how much the price will increase in percentage terms if we add one more bedroom. You could train `params[2]` and the other model parameters by increasing the number of times we iterate over `opt`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Batch training\n", "- Full sample versus batch training\n", " - Full sample\n", " 1. One update per epoch\n", " 2. Accepts dataset without modification\n", " 3. Limited by memory\n", " - Batch Training\n", " 1. Multiple updates per epoch\n", " 2. Requires division of dataset\n", " 3. No limit on dataset size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preparing to batch train\n", "Before we can train a linear model in batches, we must first define variables, a loss function, and an optimization operation. In this exercise, we will prepare to train a model that will predict `price_batch`, a batch of house prices, using `size_batch`, a batch of lot sizes in square feet. In contrast to the previous lesson, we will do this by loading batches of data using pandas, converting it to numpy arrays, and then using it to minimize the loss function in steps.\n", "\n", "Note that you should not set default argument values for either the model or loss function, since we will generate the data in batches during the training process." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Define the intercept and slope\n", "intercept = tf.Variable(10.0, tf.float32)\n", "slope = tf.Variable(0.5, tf.float32)\n", "\n", "# Define the model\n", "def linear_regression(intercept, slope, features):\n", " # Define the predicted values\n", " return intercept + slope * features\n", "\n", "# Define the loss function\n", "def loss_function(intercept, slope, targets, features):\n", " # Define the predicted values\n", " predictions = linear_regression(intercept, slope, features)\n", " \n", " # Define the MSE loss\n", " return tf.keras.losses.mse(targets, predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that we did not use default argument values for the input data, `features` and `targets`. This is because the input data has not been defined in advance. Instead, with batch training, we will load it during the training process." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training a linear model in batches\n", "In this exercise, we will train a linear regression model in batches, starting where we left off in the previous exercise. We will do this by stepping through the dataset in batches and updating the model's variables, `intercept` and `slope`, after each step. This approach will allow us to train with datasets that are otherwise too large to hold in memory.\n", "\n", "Note that the loss function,`loss_function(intercept, slope, targets, features)`, has been defined for you. The trainable variables should be entered into `var_list` in the order in which they appear as loss function arguments." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10.217888 0.7016001\n" ] } ], "source": [ "intercept = tf.Variable(10.0, tf.float32)\n", "slope = tf.Variable(0.5, tf.float32)\n", "\n", "# Initialize adam optimizer\n", "opt = tf.keras.optimizers.Adam()\n", "\n", "# Load data in batches\n", "for batch in pd.read_csv('./dataset/kc_house_data.csv', chunksize=100):\n", " size_batch = np.array(batch['sqft_lot'], np.float32)\n", " \n", " # Extract the price values for the current batch\n", " price_batch = np.array(batch['price'], np.float32)\n", " \n", " # Complete the loss, fill in the variable list, and minimize\n", " opt.minimize(lambda: loss_function(intercept, slope, price_batch, size_batch), \n", " var_list=[intercept, slope])\n", " \n", "# Print trained parameters\n", "print(intercept.numpy(), slope.numpy())" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }