{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "1gRj-x7h332N"
},
"source": [
"# Training Neural Networks\n",
"\n",
"The network we built in the previous part isn't so smart, it doesn't know anything about our handwritten digits. Neural networks with non-linear activations work like universal function approximators. There is some function that maps your input to the output. For example, images of handwritten digits to class probabilities. The power of neural networks is that we can train them to approximate this function, and basically any function given enough data and compute time.\n",
"\n",
"\n",
"\n",
"At first the network is naive, it doesn't know the function mapping the inputs to the outputs. We train the network by showing it examples of real data, then adjusting the network parameters such that it approximates this function.\n",
"\n",
"To find these parameters, we need to know how poorly the network is predicting the real outputs. For this we calculate a **loss function** (also called the cost), a measure of our prediction error. For example, the mean squared loss is often used in regression and binary classification problems\n",
"\n",
"$$\n",
"\\large \\ell = \\frac{1}{2n}\\sum_i^n{\\left(y_i - \\hat{y}_i\\right)^2}\n",
"$$\n",
"\n",
"where $n$ is the number of training examples, $y_i$ are the true labels, and $\\hat{y}_i$ are the predicted labels.\n",
"\n",
"By minimizing this loss with respect to the network parameters, we can find configurations where the loss is at a minimum and the network is able to predict the correct labels with high accuracy. We find this minimum using a process called **gradient descent**. The gradient is the slope of the loss function and points in the direction of fastest change. To get to the minimum in the least amount of time, we then want to follow the gradient (downwards). You can think of this like descending a mountain by following the steepest slope to the base.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "C-bEg-Zz4Q7z"
},
"source": [
"## Backpropagation\n",
"\n",
"For single layer networks, gradient descent is straightforward to implement. However, it's more complicated for deeper, multilayer neural networks like the one we've built. Complicated enough that it took about 30 years before researchers figured out how to train multilayer networks.\n",
"\n",
"Training multilayer networks is done through **backpropagation** which is really just an application of the chain rule from calculus. It's easiest to understand if we convert a two layer network into a graph representation.\n",
"\n",
"\n",
"\n",
"In the forward pass through the network, our data and operations go from bottom to top here. We pass the input $x$ through a linear transformation $L_1$ with weights $W_1$ and biases $b_1$. The output then goes through the sigmoid operation $S$ and another linear transformation $L_2$. Finally we calculate the loss $\\ell$. We use the loss as a measure of how bad the network's predictions are. The goal then is to adjust the weights and biases to minimize the loss.\n",
"\n",
"To train the weights with gradient descent, we propagate the gradient of the loss backwards through the network. Each operation has some gradient between the inputs and outputs. As we send the gradients backwards, we multiply the incoming gradient with the gradient for the operation. Mathematically, this is really just calculating the gradient of the loss with respect to the weights using the chain rule.\n",
"\n",
"$$\n",
"\\large \\frac{\\partial \\ell}{\\partial W_1} = \\frac{\\partial L_1}{\\partial W_1} \\frac{\\partial S}{\\partial L_1} \\frac{\\partial L_2}{\\partial S} \\frac{\\partial \\ell}{\\partial L_2}\n",
"$$\n",
"\n",
"**Note:** I'm glossing over a few details here that require some knowledge of vector calculus, but they aren't necessary to understand what's going on.\n",
"\n",
"We update our weights using this gradient with some learning rate $\\alpha$. \n",
"\n",
"$$\n",
"\\large W^\\prime_1 = W_1 - \\alpha \\frac{\\partial \\ell}{\\partial W_1}\n",
"$$\n",
"\n",
"The learning rate $\\alpha$ is set such that the weight update steps are small enough that the iterative method settles in a minimum."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "worDfYepJH6j"
},
"source": [
"## Import Resources"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "jFdhxHwr57Yn"
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"%config InlineBackend.figure_format = 'retina'\n",
"\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"import tensorflow as tf\n",
"import tensorflow_datasets as tfds\n",
"tfds.disable_progress_bar()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"logger = tf.get_logger()\n",
"logger.setLevel(logging.ERROR)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 85
},
"colab_type": "code",
"id": "yCtUH8paXqBQ",
"outputId": "1a4c93cf-21a8-4574-d121-f238912d28e8"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Using:\n",
"\t• TensorFlow version: 2.16.1\n",
"\t• Running on GPU\n"
]
}
],
"source": [
"print('Using:')\n",
"print('\\t\\u2022 TensorFlow version:', tf.__version__)\n",
"print('\\t\\u2022 Running on GPU' if tf.test.is_gpu_available() else '\\t\\u2022 GPU device not found. Running on CPU')"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"physical_devices = tf.config.list_physical_devices('GPU')\n",
"for device in physical_devices:\n",
" tf.config.experimental.set_memory_growth(device, True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "3zQV8MLaJOjN"
},
"source": [
"## Load the Dataset"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 360
},
"colab_type": "code",
"id": "Att74swb7Ol0",
"outputId": "a98f6ee1-9881-4d8d-8766-b8b00a2cb4f8"
},
"outputs": [],
"source": [
"training_set, dataset_info = tfds.load('mnist', split='train', as_supervised = True, with_info = True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "IiSe5BPrJquE"
},
"source": [
"## Create Pipeline"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "9r4EMOdT9pM3"
},
"outputs": [],
"source": [
"def normalize(image, label):\n",
" image = tf.cast(image, tf.float32)\n",
" image /= 255\n",
" return image, label\n",
"\n",
"num_training_examples = dataset_info.splits['train'].num_examples\n",
"\n",
"batch_size = 64\n",
"\n",
"training_batches = training_set.cache().shuffle(num_training_examples//4).batch(batch_size).map(normalize).prefetch(1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "K9SC4gnUJucy"
},
"source": [
"## Build the Model"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "Mo2DfMVvAdbd"
},
"outputs": [],
"source": [
"model = tf.keras.Sequential([\n",
" tf.keras.layers.Flatten(input_shape=(28, 28, 1)),\n",
" tf.keras.layers.Dense(128, activation='relu'),\n",
" tf.keras.layers.Dense(64, activation='relu'),\n",
" tf.keras.layers.Dense(10, activation='softmax')\n",
"])"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "5TCpaAlcKCDB"
},
"source": [
"## Getting the Model Ready For Training\n",
"\n",
"Before we can train our model we need to set the parameters we are going to use to train it. We can configure our model for training using the `.compile` method. The main parameters we need to specify in the `.compile` method are:\n",
"\n",
"* **Optimizer:** The algorithm that we'll use to update the weights of our model during training. Throughout these lessons we will use the [`adam`](http://arxiv.org/abs/1412.6980) optimizer. Adam is an optimization of the stochastic gradient descent algorithm. For a full list of the optimizers available in `tf.keras` check out the [optimizers documentation](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/optimizers#classes).\n",
"\n",
"\n",
"* **Loss Function:** The loss function we are going to use during training to measure the difference between the true labels of the images in your dataset and the predictions made by your model. In this lesson we will use the `sparse_categorical_crossentropy` loss function. We use the `sparse_categorical_crossentropy` loss function when our dataset has labels that are integers, and the `categorical_crossentropy` loss function when our dataset has one-hot encoded labels. For a full list of the loss functions available in `tf.keras` check out the [losses documentation](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/losses#classes).\n",
"\n",
"\n",
"* **Metrics:** A list of metrics to be evaluated by the model during training. Throughout these lessons we will measure the `accuracy` of our model. The `accuracy` calculates how often our model's predictions match the true labels of the images in our dataset. For a full list of the metrics available in `tf.keras` check out the [metrics documentation](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/metrics#classes).\n",
"\n",
"These are the main parameters we are going to set throught these lesson. You can check out all the other configuration parameters in the [TensorFlow documentation](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model#compile)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "jYv3pv5-InR1"
},
"outputs": [],
"source": [
"model.compile(optimizer='adam',\n",
" loss='sparse_categorical_crossentropy',\n",
" metrics=['accuracy'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Y5CjYa8ES3OI"
},
"source": [
"## Taking a Look at the Loss and Accuracy Before Training\n",
"\n",
"Before we train our model, let's take a look at how our model performs when it is just using random weights. Let's take a look at the `loss` and `accuracy` values when we pass a single batch of images to our un-trained model. To do this, we will use the `.evaluate(data, true_labels)` method. The `.evaluate(data, true_labels)` method compares the predicted output of our model on the given `data` with the given `true_labels` and returns the `loss` and `accuracy` values."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 105
},
"colab_type": "code",
"id": "u_7aijzvJQZ7",
"outputId": "f66f355e-d030-4c30-e50c-7bba125a20cf"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2/2 [==============================] - 9s 7ms/step - loss: 2.3233 - accuracy: 0.0938\n",
"\n",
"Loss before training: 2.323\n",
"Accuracy before training: 9.375%\n"
]
}
],
"source": [
"for image_batch, label_batch in training_batches.take(1):\n",
" loss, accuracy = model.evaluate(image_batch, label_batch)\n",
"\n",
"print(f'\\nLoss before training: {loss:,.3f}')\n",
"print(f'Accuracy before training: {accuracy:.3%}')"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "zvsfbLEMZjZ5"
},
"source": [
"## Training the Model\n",
"\n",
"Now let's train our model by using all the images in our training set. Some nomenclature, one pass through the entire dataset is called an *epoch*. To train our model for a given number of epochs we use the `.fit` method, as seen below:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 187
},
"colab_type": "code",
"id": "Z-CgmnKBZDjq",
"outputId": "38ab455c-767a-4705-c172-9d7cc926c239"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/5\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
"I0000 00:00:1713707293.846142 2594 service.cc:145] XLA service 0x75d9084ab520 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:\n",
"I0000 00:00:1713707293.846190 2594 service.cc:153] StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6\n",
"I0000 00:00:1713707293.846196 2594 service.cc:153] StreamExecutor device (1): NVIDIA GeForce RTX 2060 SUPER, Compute Capability 7.5\n",
"I0000 00:00:1713707293.922612 2594 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"938/938 [==============================] - 6s 3ms/step - loss: 0.2724 - accuracy: 0.9203\n",
"Epoch 2/5\n",
"938/938 [==============================] - 2s 2ms/step - loss: 0.1110 - accuracy: 0.9674\n",
"Epoch 3/5\n",
"938/938 [==============================] - 2s 2ms/step - loss: 0.0778 - accuracy: 0.9757\n",
"Epoch 4/5\n",
"938/938 [==============================] - 2s 2ms/step - loss: 0.0579 - accuracy: 0.9822\n",
"Epoch 5/5\n",
"938/938 [==============================] - 2s 2ms/step - loss: 0.0457 - accuracy: 0.9854\n"
]
}
],
"source": [
"EPOCHS = 5\n",
"\n",
"history = model.fit(training_batches, epochs=EPOCHS)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "IFgG_WfUjCic"
},
"source": [
"The `.fit` method returns a `History` object which contains a record of training accuracy and loss values at successive epochs, as well as validation accuracy and loss values when applicable. We will discuss the history object in a later lesson. \n",
"\n",
"With our model trained, we can check out it's predictions."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 243
},
"colab_type": "code",
"id": "ghr7z-SnctRw",
"outputId": "8e946c9a-56b5-45f4-e79f-c6451ff8b7d5"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2/2 [==============================] - 0s 3ms/step\n"
]
},
{
"data": {
"image/png": "",
"text/plain": [
"