{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "toc_visible": true,
      "authorship_tag": "ABX9TyMZABU6MSzIxB/Yv3Z7UrIZ",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/taehoonh/gpt-like-development/blob/main/gpt_dev.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### GPT-like large language model development using tiny Shakespear dataset"
      ],
      "metadata": {
        "id": "TJBttkA46aAQ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### **Step 1.** \n",
        "\n",
        "The following code uses the `requests` library, which is a more flexible and user-friendly way to handle HTTP requests in Python. It downloads the dataset from the URL and saves it as a file called `input.txt.` The status message at the end of the code lets you know when the dataset has been saved."
      ],
      "metadata": {
        "id": "FNXV_XwE31Nm"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import requests\n",
        "\n",
        "# Download the tiny shakespeare dataset\n",
        "!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt\n",
        "\n",
        "# Read the file\n",
        "with open('input.txt', 'r', encoding='utf-8') as f:\n",
        "    text = f.read()"
      ],
      "metadata": {
        "id": "1a7J6Kha3l2U",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "5dda6538-8972-40bc-9c9e-400e1a0a3558"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2023-02-18 07:16:23--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt\n",
            "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...\n",
            "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 1115394 (1.1M) [text/plain]\n",
            "Saving to: ‘input.txt.2’\n",
            "\n",
            "\rinput.txt.2           0%[                    ]       0  --.-KB/s               \rinput.txt.2         100%[===================>]   1.06M  --.-KB/s    in 0.05s   \n",
            "\n",
            "2023-02-18 07:16:23 (22.5 MB/s) - ‘input.txt.2’ saved [1115394/1115394]\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### **Step 2.** \n",
        "This code prints the length of the text string, which is the contents of the `input.txt file`. The length of the string is computed using the built-in `len` function and is expressed in characters. The resulting value is then printed to the console using the `print` function."
      ],
      "metadata": {
        "id": "VW2DsbFN4use"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(\"Length of the dataset in chracters: \", len(text))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "uf0TJ3Uk7BbB",
        "outputId": "4e0d2019-f434-46fd-ae2c-4ee0eba238fc"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Length of the dataset in chracters:  1115394\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### **Step 3.** \n",
        "Let's check out by printing the first 500 characters of the `text` variable, which was previously loaded from a file named `input.txt` using the with open statement. The `[:500]` syntax is used to slice the first 500 characters of the `text` string."
      ],
      "metadata": {
        "id": "oaCUkErS4_7U"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(text[:500])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "OE8trwIN7R5k",
        "outputId": "5edfa26c-4a07-4f77-ac04-fdbf940b16ba"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "First Citizen:\n",
            "Before we proceed any further, hear me speak.\n",
            "\n",
            "All:\n",
            "Speak, speak.\n",
            "\n",
            "First Citizen:\n",
            "You are all resolved rather to die than to famish?\n",
            "\n",
            "All:\n",
            "Resolved. resolved.\n",
            "\n",
            "First Citizen:\n",
            "First, you know Caius Marcius is chief enemy to the people.\n",
            "\n",
            "All:\n",
            "We know't, we know't.\n",
            "\n",
            "First Citizen:\n",
            "Let us kill him, and we'll have corn at our own price.\n",
            "Is't a verdict?\n",
            "\n",
            "All:\n",
            "No more talking on't; let it be done: away, away!\n",
            "\n",
            "Second Citizen:\n",
            "One word, good citizens.\n",
            "\n",
            "First Citizen:\n",
            "We are accounted poor\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### **Step 4.** \n",
        "\n",
        "After that, we are generating a set of unique characters in the text, then sorts it. The sorted set of unique characters are then joined together and printed out as a string. The number of unique characters is also calculated and printed out with a formatted message \"Unique characters:\".\n",
        "\n",
        "The set comprehension `{char for char in text}` is used to extract unique characters in the `text` variable, which was read from the file. The set comprehension will only include one occurrence of each character in the `text`, hence getting the unique characters in the text.\n",
        "\n",
        "After getting the unique characters, the `len` function is used to get the length of the set of unique characters which gives us the count of unique characters. This value is stored in the `vocab_size` variable and is printed out with a formatted message."
      ],
      "metadata": {
        "id": "hqeepQVb5cAt"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Check out the unique characters that occur in this text dataset\n",
        "chars = sorted({char for char in text})\n",
        "vocab_size = len(chars)\n",
        "print(''.join(chars))\n",
        "print(f\"Unique characters: {vocab_size}\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "m_PANT4d7beC",
        "outputId": "70ba0760-52de-4057-f5dd-886a6197fc46"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\n",
            " !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\n",
            "Unique characters: 65\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### **Step 5.** \n",
        "As a next step, we will create two dictionaries, `char_to_int` and `int_to_char`, that map characters in the text dataset to unique integers, and vice versa, respectively. Then, two functions `encode` and `decode` are defined to convert the text dataset to a list of integers, and vice versa. The code tests the two functions by encoding the string \"hello there\" and decoding the result to make sure the process works as expected. The output should be the encoded list of integers for \"hello there\" and the decoded string \"hello there\"."
      ],
      "metadata": {
        "id": "tGLMdxjD6jDi"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Create mappings from characters to integers and vice versa\n",
        "char_to_int = {char: index for index, char in enumerate(chars)}\n",
        "int_to_char = {index: char for index, char in enumerate(chars)}\n",
        "\n",
        "# Define encoding and decoding functions\n",
        "def encode(text):\n",
        "  return [char_to_int[char] for char in text]\n",
        "\n",
        "def decode(encoded):\n",
        "  return ''.join([int_to_char[index] for index in encoded])\n",
        "\n",
        "# Test the encoding and decoding functions\n",
        "encoded = encode(\"hello there\")\n",
        "print(encoded)\n",
        "print(decode(encoded))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Rfz8z03y7nBH",
        "outputId": "e5930d54-1f54-4b46-f33c-6fb13226683e"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[46, 43, 50, 50, 53, 1, 58, 46, 43, 56, 43]\n",
            "hello there\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### **Step 6.** \n",
        "\n",
        "As we checked encoding and decoding works well, we will import the PyTorch library and create a 1-dimensional tensor (i.e. a torch.LongTensor) called `data` that holds the encoding of the entire text dataset.\n",
        "\n",
        "The encoding of the text is done using the `encode` function that was defined earlier and takes a string as input and returns a list of integers. The resulting list is then passed to the `torch.tensor` function which creates a tensor from the input data. The `dtype` argument is set to `torch.long` which specifies the data type of the tensor as long integers.\n",
        "\n",
        "Finally, the shape and data type of the tensor are printed, as well as the first 500 characters in their encoded form. The output of the `print` statements will give us some basic information about the tensor."
      ],
      "metadata": {
        "id": "SFA75VnV63_Y"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Encode the entire text dataset and save it into a torch.Tensor\n",
        "\n",
        "import torch # PyTorch: https://pytorch.org\n",
        "\n",
        "data = torch.tensor(encode(text), dtype=torch.long)\n",
        "print(f\"Shape of data tensor: {data.shape}\")\n",
        "print(f\"Data type of data tensor: {data.dtype}\")\n",
        "print(f\"First 500 characters in encoded form: {data[:500]}\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Kw5JMbFb8dvN",
        "outputId": "d601e5b8-6d5d-43b0-feaf-29efdd7d269b"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Shape of data tensor: torch.Size([1115394])\n",
            "Data type of data tensor: torch.int64\n",
            "First 500 characters in encoded form: tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,\n",
            "        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,\n",
            "         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,\n",
            "        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,\n",
            "         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,\n",
            "        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,\n",
            "         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,\n",
            "        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,\n",
            "        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,\n",
            "         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,\n",
            "         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,\n",
            "        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,\n",
            "        47, 59, 57,  1, 47, 57,  1, 41, 46, 47, 43, 44,  1, 43, 52, 43, 51, 63,\n",
            "         1, 58, 53,  1, 58, 46, 43,  1, 54, 43, 53, 54, 50, 43,  8,  0,  0, 13,\n",
            "        50, 50, 10,  0, 35, 43,  1, 49, 52, 53, 61,  5, 58,  6,  1, 61, 43,  1,\n",
            "        49, 52, 53, 61,  5, 58,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47, 58,\n",
            "        47, 64, 43, 52, 10,  0, 24, 43, 58,  1, 59, 57,  1, 49, 47, 50, 50,  1,\n",
            "        46, 47, 51,  6,  1, 39, 52, 42,  1, 61, 43,  5, 50, 50,  1, 46, 39, 60,\n",
            "        43,  1, 41, 53, 56, 52,  1, 39, 58,  1, 53, 59, 56,  1, 53, 61, 52,  1,\n",
            "        54, 56, 47, 41, 43,  8,  0, 21, 57,  5, 58,  1, 39,  1, 60, 43, 56, 42,\n",
            "        47, 41, 58, 12,  0,  0, 13, 50, 50, 10,  0, 26, 53,  1, 51, 53, 56, 43,\n",
            "         1, 58, 39, 50, 49, 47, 52, 45,  1, 53, 52,  5, 58, 11,  1, 50, 43, 58,\n",
            "         1, 47, 58,  1, 40, 43,  1, 42, 53, 52, 43, 10,  1, 39, 61, 39, 63,  6,\n",
            "         1, 39, 61, 39, 63,  2,  0,  0, 31, 43, 41, 53, 52, 42,  1, 15, 47, 58,\n",
            "        47, 64, 43, 52, 10,  0, 27, 52, 43,  1, 61, 53, 56, 42,  6,  1, 45, 53,\n",
            "        53, 42,  1, 41, 47, 58, 47, 64, 43, 52, 57,  8,  0,  0, 18, 47, 56, 57,\n",
            "        58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 35, 43,  1, 39, 56, 43,  1,\n",
            "        39, 41, 41, 53, 59, 52, 58, 43, 42,  1, 54, 53, 53, 56])\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### **Step 7.** \n",
        "\n",
        "We will split the data tensor into training and validation set using a specified train ratio (0.8 in this case).\n",
        "\n",
        "The length of the training data is calculated as the product of the length of the data tensor and the train ratio (0.8). The first part of the data tensor with length equal to the calculated training data length becomes the training data. The rest of the data tensor becomes the validation data.\n",
        "\n",
        "The lengths of the training data and validation data are then printed out."
      ],
      "metadata": {
        "id": "iHJAYXF37sPW"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Split the data into training and validation sets\n",
        "\n",
        "train_ratio = 0.8\n",
        "train_data_length = int(len(data) * train_ratio)\n",
        "train_data = data[:train_data_length]\n",
        "val_data = data[train_data_length:]\n",
        "\n",
        "print(f\"Length of training data: {len(train_data)}\")\n",
        "print(f\"Length of validation data: {len(val_data)}\")"
      ],
      "metadata": {
        "id": "tODMXekl9Ng0",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "994aaced-1f13-49a9-d38c-43315153da13"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Length of training data: 892315\n",
            "Length of validation data: 223079\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### **Step 8.** \n",
        "\n",
        "Let's define a block size of 8 characters, and print the first 8 + 1 = 9 characters in the training data.\n",
        "\n",
        "It first converts the training data tensor slice to a list using `tolist()` method, then passes this list to the decode function to get the characters represented by the integers. The `decode` function takes a list of integers, which represent characters as indices in the `int_to_char` mapping, and converts the indices back to characters using the `int_to_char` mapping.\n",
        "\n",
        "The `print` statement outputs the decoded characters, allowing us to see a portion of the original text."
      ],
      "metadata": {
        "id": "BOUt_uKg8NXI"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "block_size = 8\n",
        "print(\"First\", block_size + 1, \"characters in training data:\")\n",
        "print(decode(train_data[:block_size+1].tolist()))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "jlrDJHr99WTx",
        "outputId": "2da487ca-eb18-4b66-e362-c0d5689cb5a3"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "First 9 characters in training data:\n",
            "First Cit\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### **Step 9.** \n",
        "\n",
        "Also, let's split the training data into two parts `x` and `y` with `block_size` characters each. `x` contains the first `block_size` characters of the training data and `y` contains the next `block_size` characters.\n",
        "\n",
        "Then, the code loops through the range `t` from 0 to `block_size` and uses `t` as the index to extract the context and target. The context is `x[:t+1]`, which is a slice of the `x` array that contains all elements up to and including the `t`-th element. The target is `y[t]`, which is the `t`-th element of the `y` array.\n",
        "\n",
        "Finally, the code uses the `decode` function to convert the context and target from encoded integers back to characters. The `decode` function takes a list of integers as input and returns the string that corresponds to the concatenation of the characters corresponding to the integers.\n",
        "\n",
        "The code prints the context and target for each value of `t`, with a message that describes what each one represents."
      ],
      "metadata": {
        "id": "UjtfuGIo8dOv"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "x = train_data[:block_size]\n",
        "y = train_data[1:block_size+1]\n",
        "for t in range(block_size):\n",
        "    context = x[:t+1]\n",
        "    target = y[t]\n",
        "    print(f\"when input is {context} the target: {target}\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "_j_DJtdF9ZGO",
        "outputId": "384718a6-e906-4e36-b373-a39d03aa0826"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "when input is tensor([18]) the target: 47\n",
            "when input is tensor([18, 47]) the target: 56\n",
            "when input is tensor([18, 47, 56]) the target: 57\n",
            "when input is tensor([18, 47, 56, 57]) the target: 58\n",
            "when input is tensor([18, 47, 56, 57, 58]) the target: 1\n",
            "when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15\n",
            "when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47\n",
            "when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### **Step 10.** \n",
        "\n",
        "The code generates a random batch of inputs (x) and targets (y) for either the training data or validation data. The data is split based on the input argument passed to the function `get_batch()`. The `block_size` specifies the maximum context length for predictions. The `batch_size` specifies the number of independent sequences that will be processed in parallel.\n",
        "\n",
        "The `torch.randint` function is used to generate a tensor of shape (batch_size, ) containing random integers between 0 and len(data) - block_size. These integers represent the starting indices of each sequence in the batch. The `torch.stack` function is used to create tensors `x` and `y` from the corresponding slices of the data tensor. `x` contains the blocks of size `block_size` from `data`, and `y` contains the blocks of size `block_size` shifted by one from `data`.\n",
        "\n",
        "The generated inputs and targets are printed out, along with their shapes. The code then loops over the generated batch and for each sequence in the batch, it prints out the context and the corresponding target. The `context` is taken as the slice of the input tensor `xb` from the beginning of the sequence to the current time step, and the target is taken as the corresponding element in the target tensor `yb`. The `tolist` method is used to convert the tensor to a Python list. The target is then printed out."
      ],
      "metadata": {
        "id": "2i7he-gt-89E"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "torch.manual_seed(1337)\n",
        "batch_size = 4 # The number of independent sequences that will be processed in parallel\n",
        "block_size = 8 # Max context length for predictions\n",
        "\n",
        "\n",
        "def get_batch(split):\n",
        "    data = train_data if split == 'train' else val_data\n",
        "    ix = torch.randint(0, len(data) - block_size, (batch_size,))\n",
        "    x = torch.stack([data[i:i+block_size] for i in ix])\n",
        "    y = torch.stack([data[i+1:i+block_size+1] for i in ix])\n",
        "    return x, y\n",
        "\n",
        "xb, yb = get_batch('train')\n",
        "print('Inputs:')\n",
        "print(xb.shape)\n",
        "print(xb)\n",
        "print('Targets:')\n",
        "print(yb.shape)\n",
        "print(yb)\n",
        "\n",
        "print('----')\n",
        "\n",
        "for b in range(batch_size):\n",
        "    for t in range(block_size):\n",
        "        context = xb[b, :t+1]\n",
        "        target = yb[b, t]\n",
        "        print(f\"When input is {context.tolist()}, the target is: {target}\")\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "2YAlgj3T9ck9",
        "outputId": "e04aabd4-a0b1-42da-a81a-3d1205a92f10"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Inputs:\n",
            "torch.Size([4, 8])\n",
            "tensor([[58, 63,  8,  0,  0, 19, 24, 27],\n",
            "        [39, 59, 45, 46, 58,  1, 46, 43],\n",
            "        [49, 43, 57,  1, 53, 50, 42,  1],\n",
            "        [52, 41, 47, 43, 52, 58,  1, 56]])\n",
            "Targets:\n",
            "torch.Size([4, 8])\n",
            "tensor([[63,  8,  0,  0, 19, 24, 27, 33],\n",
            "        [59, 45, 46, 58,  1, 46, 43,  1],\n",
            "        [43, 57,  1, 53, 50, 42,  1, 46],\n",
            "        [41, 47, 43, 52, 58,  1, 56, 47]])\n",
            "----\n",
            "When input is [58], the target is: 63\n",
            "When input is [58, 63], the target is: 8\n",
            "When input is [58, 63, 8], the target is: 0\n",
            "When input is [58, 63, 8, 0], the target is: 0\n",
            "When input is [58, 63, 8, 0, 0], the target is: 19\n",
            "When input is [58, 63, 8, 0, 0, 19], the target is: 24\n",
            "When input is [58, 63, 8, 0, 0, 19, 24], the target is: 27\n",
            "When input is [58, 63, 8, 0, 0, 19, 24, 27], the target is: 33\n",
            "When input is [39], the target is: 59\n",
            "When input is [39, 59], the target is: 45\n",
            "When input is [39, 59, 45], the target is: 46\n",
            "When input is [39, 59, 45, 46], the target is: 58\n",
            "When input is [39, 59, 45, 46, 58], the target is: 1\n",
            "When input is [39, 59, 45, 46, 58, 1], the target is: 46\n",
            "When input is [39, 59, 45, 46, 58, 1, 46], the target is: 43\n",
            "When input is [39, 59, 45, 46, 58, 1, 46, 43], the target is: 1\n",
            "When input is [49], the target is: 43\n",
            "When input is [49, 43], the target is: 57\n",
            "When input is [49, 43, 57], the target is: 1\n",
            "When input is [49, 43, 57, 1], the target is: 53\n",
            "When input is [49, 43, 57, 1, 53], the target is: 50\n",
            "When input is [49, 43, 57, 1, 53, 50], the target is: 42\n",
            "When input is [49, 43, 57, 1, 53, 50, 42], the target is: 1\n",
            "When input is [49, 43, 57, 1, 53, 50, 42, 1], the target is: 46\n",
            "When input is [52], the target is: 41\n",
            "When input is [52, 41], the target is: 47\n",
            "When input is [52, 41, 47], the target is: 43\n",
            "When input is [52, 41, 47, 43], the target is: 52\n",
            "When input is [52, 41, 47, 43, 52], the target is: 58\n",
            "When input is [52, 41, 47, 43, 52, 58], the target is: 1\n",
            "When input is [52, 41, 47, 43, 52, 58, 1], the target is: 56\n",
            "When input is [52, 41, 47, 43, 52, 58, 1, 56], the target is: 47\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "print(xb) # My input to the transformer"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Ai4YhFG_9nXk",
        "outputId": "71df7e96-f5f1-4d78-e26e-adba3dc63546"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "tensor([[58, 63,  8,  0,  0, 19, 24, 27],\n",
            "        [39, 59, 45, 46, 58,  1, 46, 43],\n",
            "        [49, 43, 57,  1, 53, 50, 42,  1],\n",
            "        [52, 41, 47, 43, 52, 58,  1, 56]])\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### **Step 11.**\n",
        "Now, we will introduce a Bigram Language Model using PyTorch. The Bigram Language Model is a type of language model that predicts the next word in a sentence based on the current word.\n",
        "\n",
        "The model is implemented as a custom PyTorch `nn.Module` named `BigramLanguageModel`.\n",
        "\n",
        "The model has two main components:\n",
        "\n",
        "- The token embedding table, implemented as an instance of `nn.Embedding` with `vocab_size` output features, which is used to embed each token (word) in the input sequence into a dense representation.\n",
        "- The `forward` method, which takes an input sequence `idx` of token indices (integers) and optional `targets` (also an integer sequence), and computes the logits for the next token predictions and the cross-entropy loss if targets are provided. The logits are computed by passing the input sequence through the token embedding table.\n",
        "\n",
        "The model also has a `generate` method, which generates a sequence of new tokens based on an initial context `idx` and a specified number of tokens `max_new_tokens` to generate. The method uses the `forward` method to obtain the logits for each new token, applies softmax to obtain probabilities, and samples from the distribution to obtain the next token index. The generated token indices are concatenated to the input sequence to obtain the new context for the next prediction.\n",
        "\n",
        "In the code, an instance of the `BigramLanguageModel` is created with a specified `vocab_size` and is used to compute the logits and loss for an input sequence `xb` and target sequence `yb`, and to generate a new sequence of tokens with an initial context of a single token with an index of 0 and 100 new tokens to generate."
      ],
      "metadata": {
        "id": "Sjj8oBUcCHHc"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import torch\n",
        "import torch.nn as nn\n",
        "from torch.nn import functional as F\n",
        "torch.manual_seed(1337)\n",
        "\n",
        "class BigramLanguageModel(nn.Module):\n",
        "\n",
        "    def __init__(self, vocab_size):\n",
        "        super().__init__()\n",
        "        self.embedding = nn.Embedding(vocab_size, vocab_size)\n",
        "\n",
        "    def forward(self, idx, targets=None):\n",
        "        logits = self.embedding(idx)\n",
        "        B, T, C = logits.shape\n",
        "        logits = logits.reshape(B * T, C)\n",
        "        if targets is None:\n",
        "            return logits, None\n",
        "        targets = targets.reshape(-1)\n",
        "        loss = F.cross_entropy(logits, targets)\n",
        "        return logits, loss\n",
        "    \n",
        "    def generate(self, idx, max_new_tokens):\n",
        "        for i in range(max_new_tokens):\n",
        "            logits = self.embedding(idx)\n",
        "            logits = logits[:, -1, :]\n",
        "            probs = F.softmax(logits, dim=-1)\n",
        "            next_token = torch.multinomial(probs, num_samples=1)\n",
        "            idx = torch.cat((idx, next_token), dim=1)\n",
        "        return idx\n",
        "\n",
        "m = BigramLanguageModel(vocab_size)\n",
        "logits, loss = m(xb, yb)\n",
        "print(logits.shape)\n",
        "print(loss)\n",
        "\n",
        "print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "0ynS-Z8h9vV7",
        "outputId": "f9f8ddaa-5afa-4dd0-fe1e-8aab4dbb9664"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "torch.Size([32, 65])\n",
            "tensor(5.0493, grad_fn=<NllLossBackward0>)\n",
            "\n",
            "SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp\n",
            "wnYWmnxKWWev-tDqXErVKLgJ\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### **Step 12.**\n",
        "\n",
        "Here, we create an instance of the `AdamW` optimizer in PyTorch. The optimizer will adjust the parameters of the model `m` to minimize the loss during training.\n",
        "\n",
        "The `AdamW` optimizer is a variant of the popular `Adam` optimizer that incorporates weight decay, which helps to regularize the model to prevent overfitting. The optimizer takes as input the parameters of the model `m` and sets their learning rate to 0.001."
      ],
      "metadata": {
        "id": "WisMJEvoC6-8"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Create a PyTorch optimizer\n",
        "optimizer = torch.optim.AdamW(m.parameters(), lr=0.001)"
      ],
      "metadata": {
        "id": "3nBbBfJl94aV"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### **Step 13.**\n",
        "\n",
        "After then, we will train a PyTorch model using early stopping and TensorBoard logging.\n",
        "\n",
        "- A PyTorch AdamW optimizer is created with a learning rate of 0.001 for the model parameters.\n",
        "- A TensorBoard writer is created to log the training and validation losses.\n",
        "- The code trains the model in a loop of 100 steps (max_steps) and evaluates the model on the validation set at each step.\n",
        "- The best validation loss is recorded and compared with the current validation loss.\n",
        "- If the current validation loss is worse than the best validation loss for a number of early_stop_steps, the training stops and the loop exits.\n",
        "- The training loss is printed and logged in TensorBoard every 10 steps.\n",
        "- After the loop, the TensorBoard writer is closed. The final train loss is printed."
      ],
      "metadata": {
        "id": "jJlQZUK1D2EZ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from torch.utils.tensorboard import SummaryWriter\n",
        "\n",
        "batch_size = 32\n",
        "max_steps = 100 # Increase number of steps for better results... \n",
        "early_stop_steps = 10\n",
        "\n",
        "# Create a TensorBoard writer\n",
        "writer = SummaryWriter()\n",
        "\n",
        "# Keep track of the best validation loss\n",
        "best_val_loss = float(\"inf\")\n",
        "\n",
        "# Early stopping counter\n",
        "early_stop_counter = 0\n",
        "\n",
        "for steps in range(max_steps):\n",
        "    # Obtain a batch of data as sample\n",
        "    xb, yb = get_batch('train')\n",
        "\n",
        "    # Loss evaluation\n",
        "    logits, loss = m(xb, yb)\n",
        "    optimizer.zero_grad(set_to_none=True)\n",
        "    loss.backward()\n",
        "    optimizer.step()\n",
        "\n",
        "    if steps % 10 == 0:\n",
        "        print(\"Step {}: Train Loss: {}\".format(steps, loss.item()))\n",
        "        writer.add_scalar('train_loss', loss.item(), steps)\n",
        "    \n",
        "    # Evaluate the model on the validation set\n",
        "    with torch.no_grad():\n",
        "        xb, yb = get_batch('val')\n",
        "        logits, val_loss = m(xb, yb)\n",
        "        writer.add_scalar('val_loss', val_loss.item(), steps)\n",
        "        \n",
        "        if val_loss.item() < best_val_loss:\n",
        "            best_val_loss = val_loss.item()\n",
        "            early_stop_counter = 0\n",
        "        else:\n",
        "            early_stop_counter += 1\n",
        "        \n",
        "        if early_stop_counter >= early_stop_steps:\n",
        "            print(\"Early stopping at step {}\".format(steps))\n",
        "            break\n",
        "\n",
        "writer.close()\n",
        "print(\"Final Train Loss: {}\".format(loss.item()))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "6ueVB1NQ97BV",
        "outputId": "1bdbf79e-688d-4036-c80e-6583c112bf51"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Step 0: Train Loss: 4.679175853729248\n",
            "Step 10: Train Loss: 4.761965751647949\n",
            "Early stopping at step 10\n",
            "Final Train Loss: 4.761965751647949\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### **Step 14.**\n",
        "\n",
        "Finally, we will generate text using a language model `m`, and prints the decoded result. The function `m.generate` generates text with the given input `idx`, which is an array of shape (1, 1) containing the starting index (e.g., the first word of the text), encoded as an integer. The argument `max_new_tokens=500` specifies the maximum number of tokens to generate in the output text. The generated text is then converted to a list of integers, `[0].tolist()`. Finally, the function `decode` is applied to the list of integers to convert the encoded text back to a string of human-readable words."
      ],
      "metadata": {
        "id": "SXBMCh7aEvel"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "lAd6TN10-D9K",
        "outputId": "8a7de5a1-c2c2-4382-99aa-1c07310e9e13"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\n",
            "dzX:COtRIDiOskzytNCxrfSjum;auw$CA'Oc!PO;DT:CSGKzzYC33,i!!:'ruHAUQJ;ZJi?NpHFP,h?jBjagJ.xYDHM,3-gnQQbjmJGGJEHxr'BirVMGXrvU,ZL$HA-!uD.vcWRlgH-s&LC,e?OJMo?yLTQb?qx;xzW f$-FZv$;.igBjU'AXgF-bGN&ZmZb&yFCaPZSJA'rA'KHx?w$YHAGjHRURSPwHo-W:MlapJ.\n",
            "jxLvUAQmZBL&$zdbA!BCPjxTfiKmJMQTFafjxI!udZV,SPGGSPlyYWNT;a;Q-BGrIu$Ca'PTR C&,SywwcPyFWgC3ryxfNd?EX&jF.WCq;3fq-ofcla!--UG&SBoiw'rt,rcIcmYcLC?OLfpOpX-ZK;vm,lDW?nZTbmJJrYdYZTH!abIJ&sXcoUEXrUZVm;K:vi-vTJaMPiH-UnZ??yFk$cOKBjThuq.ywEb$zLTQgUZayZ!pzd,RL&evVjZUAElx;pgOYPh\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Some short examples"
      ],
      "metadata": {
        "id": "YTBRfz94FDD9"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### (1) **Example 1**: Brief illustration of how matrix multiplication can perform a weighted combination\n",
        "\n",
        "The following creates a lower triangular matrix of ones (`a`) and normalizes the matrix along the rows. Then, it generates random weights from a normal distribution (`b`) and computes the weighted aggregation of `a` and `b` using matrix multiplication. The results are then printed."
      ],
      "metadata": {
        "id": "wNL57JVfbbRA"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Set random seed for reproducibility\n",
        "torch.manual_seed(42)\n",
        "\n",
        "# Create a lower triangular matrix of ones\n",
        "a = torch.tril(torch.ones(3, 3))\n",
        "\n",
        "# Normalize the lower triangular matrix along the rows\n",
        "a = a / torch.sum(a, 1, keepdim=True)\n",
        "\n",
        "# Generate random weights from a normal distribution\n",
        "b = torch.randn(3, 2).float()\n",
        "\n",
        "# Compute the weighted aggregation using matrix multiplication\n",
        "c = a @ b\n",
        "\n",
        "# Print results\n",
        "print('a=')\n",
        "print(a)\n",
        "print('--')\n",
        "print('b=')\n",
        "print(b)\n",
        "print('--')\n",
        "print('c=')\n",
        "print(c)\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "s2FySr9--Kam",
        "outputId": "6e9665e2-9a42-47fa-aa92-836dc7e7f080"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "a=\n",
            "tensor([[1.0000, 0.0000, 0.0000],\n",
            "        [0.5000, 0.5000, 0.0000],\n",
            "        [0.3333, 0.3333, 0.3333]])\n",
            "--\n",
            "b=\n",
            "tensor([[ 0.3367,  0.1288],\n",
            "        [ 0.2345,  0.2303],\n",
            "        [-1.1229, -0.1863]])\n",
            "--\n",
            "c=\n",
            "tensor([[ 0.3367,  0.1288],\n",
            "        [ 0.2856,  0.1796],\n",
            "        [-0.1839,  0.0576]])\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Here, we set the random seed for reproducibility, generate a tensor with random values (`x`) of shape (`B`, `T`, `C`), and prints the shape of the tensor. The code then attempts to compute the cumulative sum of `x` along the second dimension (time) and divide it by the range of the time steps."
      ],
      "metadata": {
        "id": "6ajMexQy2mrW"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Set random seed for reproducibility\n",
        "torch.manual_seed(1337)\n",
        "\n",
        "# Define the batch size (B), time steps (T), and number of channels (C)\n",
        "B, T, C = 4, 8, 2\n",
        "\n",
        "# Generate a tensor with random values, with shape (B, T, C)\n",
        "x = torch.randn(B, T, C)\n",
        "\n",
        "# Print the shape of the tensor\n",
        "print(\"Shape of the tensor:\", x.shape)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "hd9980Z6-O6d",
        "outputId": "2c72fb90-3684-4e97-f76f-6539ac0590af"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Shape of the tensor: torch.Size([4, 8, 2])\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The code calculates a moving average of a 3D tensor `x` with dimensions `[B, T, C]` along the time axis (`T`).\n",
        "\n",
        "`xbow` is initialized as a 3D tensor of zeros with the same dimensions as `x`.\n",
        "\n",
        "The calculation is performed by dividing the cumulative sum of the elements of `x` along the `T` axis by a sequence of integers from `1` to `T`. The `cumsum()` method is used to compute the cumulative sum along the `T` axis of `x`. The method returns a tensor with the same dimensions as `x`. The sequence of integers is created using `torch.arange()`, which returns a 1D tensor of consecutive integers.\n",
        "\n",
        "The resulting tensor is a 3D tensor of the same shape as `x`, where `xbow[b,t]` contains the average of `x[b,:t+1]`."
      ],
      "metadata": {
        "id": "DpmPjFbFWENd"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# We want x[b,t] = mean_{i<=t} x[b,i]\n",
        "xbow = torch.zeros((B,T,C))\n",
        "\n",
        "# for b in range(B):\n",
        "#    for t in range(T):\n",
        "#        xprev = x[b,:t+1] # (t,C)\n",
        "#        xbow[b,t] = torch.mean(xprev, 0)\n",
        "\n",
        "xbow = x.cumsum(1) / torch.arange(1, T+1, dtype=torch.float32).unsqueeze(0).unsqueeze(2)"
      ],
      "metadata": {
        "id": "awaZDvu5-Q_l"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### (2) **Example 2**: Using matrix multiply for a weighted aggregation\n",
        "\n",
        "This code generates a tensor `wei` that is a lower triangular matrix with ones and then normalizes it along the rows. It then generates a tensor `x` with shape `(B, T, C)` and performs a matrix multiplication between `wei` and `x` using `@`. The resulting tensor `xbow2` has the same shape as `x` and each element `xbow2[b, t]` is the weighted average of all the previous elements in `x[b]` up to and including `x[b, t]`, where the weights are given by the corresponding element in the `t`-th row of `wei`. Finally, it checks if `xbow` and `xbow2` are element-wise close using the `torch.allclose()` function."
      ],
      "metadata": {
        "id": "hFZUljyX2DF8"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "wei = torch.tril(torch.ones(T, T))\n",
        "wei = wei / wei.sum(1, keepdim=True)\n",
        "xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)\n",
        "torch.allclose(xbow, xbow2)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "TIgH4mop-Sf-",
        "outputId": "ca96936d-f418-45cf-f29c-69ebd4a44e65"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "True"
            ]
          },
          "metadata": {},
          "execution_count": 45
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### (3) **Example 3**: Using Softmax\n",
        "\n",
        "The provided code appears to be calculating the weights for each time step in the tensor `x` using a mask of a lower triangular matrix and a subsequent softmax operation to obtain weights that sum to 1 across each row. These weights are then used to obtain a weighted average of the values in `x` for each time step."
      ],
      "metadata": {
        "id": "_PKTPITN3S-t"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Create a mask for the lower triangular matrix of shape (T, T)\n",
        "mask = torch.tril(torch.ones(T, T)).bool()\n",
        "\n",
        "# Set the values outside the lower triangle to -inf to ensure zero weights\n",
        "mask = mask.float().masked_fill(~mask, float('-inf'))\n",
        "\n",
        "# Apply softmax to the masked values along the last dimension to obtain the weights\n",
        "weights = F.softmax(mask, dim=-1)\n",
        "\n",
        "# Obtain the weighted average of x using the computed weights\n",
        "xbow3 = torch.matmul(weights, x)\n",
        "\n",
        "# Check if xbow and xbow3 are equal\n",
        "torch.allclose(xbow, xbow3)\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "j8t3c6D8-eCz",
        "outputId": "a6687287-dc79-4d51-c71a-02ab3a873fdb"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "True"
            ]
          },
          "metadata": {},
          "execution_count": 46
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### (4) **Example 4**: Self-attention\n",
        "\n",
        "The code defines a single head of self-attention using linear transformations for the key, query, and value inputs. The input tensor `x` has shape `(B, T, C)` representing a batch of sequences with `T` timesteps and `C` features. The key and query are transformed using a linear layer with `C` input channels and `head_size` output channels, while the value is transformed using a linear layer with `C` input channels and `head_size` output channels.\n",
        "\n",
        "The dot product of the transformed query and key, divided by the square root of `head_size`, gives an attention weight matrix of shape `(B, T, T)`. A lower triangular mask is applied to the attention weight matrix to ensure that information flows only from the past to the present. The softmax operation is then applied along the last dimension of the resulting attention weight matrix to obtain a probability distribution. Finally, the output is obtained by multiplying the probability distribution with the value tensor.\n",
        "\n",
        "The output has shape `(B, T, head_size)`."
      ],
      "metadata": {
        "id": "ckZO2SVMXRnB"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Set the random seed to a fixed value for reproducibility\n",
        "torch.manual_seed(1337)\n",
        "\n",
        "# Define the batch size (B), number of time steps (T), and number of channels (C)\n",
        "B,T,C = 4,8,32\n",
        "\n",
        "# Generate a tensor with random values of shape (B, T, C)\n",
        "x = torch.randn(B,T,C)\n",
        "\n",
        "# Define the size of each head in the attention mechanism\n",
        "head_size = 16\n",
        "\n",
        "# Define the linear layers for the keys, queries, and values\n",
        "key = nn.Linear(C, head_size, bias=False)\n",
        "query = nn.Linear(C, head_size, bias=False)\n",
        "value = nn.Linear(C, head_size, bias=False)\n",
        "\n",
        "# Apply the linear layer for the keys to the input tensor to get k of shape (B, T, head_size)\n",
        "k = key(x)\n",
        "\n",
        "# Apply the linear layer for the queries to the input tensor to get q of shape (B, T, head_size)\n",
        "q = query(x)\n",
        "\n",
        "# Calculate the attention weights by multiplying q and k transposed together. This produces a tensor wei of shape (B, T, T).\n",
        "wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)\n",
        "\n",
        "# Create a lower triangular matrix with ones and zero out the upper triangular part\n",
        "tril = torch.tril(torch.ones(T, T))\n",
        "\n",
        "# Replace the 0 elements in the triangular part of the attention weights with negative infinity\n",
        "wei = wei.masked_fill(tril == 0, float('-inf'))\n",
        "\n",
        "# Apply the softmax function along the last dimension of wei to get the final attention weights\n",
        "wei = F.softmax(wei, dim=-1)\n",
        "\n",
        "# Apply the linear layer for the values to the input tensor to get v of shape (B, T, head_size)\n",
        "v = value(x)\n",
        "\n",
        "# Compute the weighted sum of the values with the attention weights to get the output tensor of shape (B, T, head_size)\n",
        "out = wei @ v\n",
        "\n",
        "# Print the shape of the output tensor\n",
        "print(out.shape)\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "olnwj18wYGZT",
        "outputId": "b10140d6-c47a-4e28-a57e-c8870219b51b"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "torch.Size([4, 8, 16])\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "print(wei[0])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "mltsK9eb-hol",
        "outputId": "3976be09-3664-47de-9f85-864287c33483"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],\n",
            "        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],\n",
            "        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],\n",
            "        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],\n",
            "        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],\n",
            "        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],\n",
            "        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],\n",
            "        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],\n",
            "       grad_fn=<SelectBackward0>)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Notes:\n",
        "- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.\n",
        "- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.\n",
        "- Each example across batch dimension is of course processed completely independently and never \"talk\" to each other\n",
        "- In an \"encoder\" attention block just delete the single line that does masking with tril, allowing all tokens to communicate. This block here is called a \"decoder\" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.\n",
        "- \"self-attention\" just means that the keys and values are produced from the same source as queries. In \"cross-attention\", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)\n",
        "- \"Scaled\" attention additional divides wei by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below"
      ],
      "metadata": {
        "id": "SFBBh0e0-kus"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "The provided code computes the scaled dot-product attention between two tensors `k` and `q`, and applies the softmax function on the output to get the attention weights. \n",
        "\n",
        "- We calculate the scale factor outside the product of `q` and `k` and then multiply the product by the scale factor.\n",
        "- We explicitly cast the integer `head_size` to `float32` using `torch.float32`.\n",
        "We print the variance of `k`, `q`, and `wei` to ensure they have the expected values.\n",
        "We use the `softmax()` function from PyTorch instead of performing the softmax operation manually.\n",
        "We also use the `softmax()` function with a larger scaling factor (`8` in this case), which results in a smoother attention distribution."
      ],
      "metadata": {
        "id": "VIlguVrUY9Uo"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "k = torch.randn(B, T, head_size)\n",
        "q = torch.randn(B, T, head_size)\n",
        "scale = torch.rsqrt(torch.tensor(head_size, dtype=torch.float32))\n",
        "wei = q @ k.transpose(-2, -1) * scale\n",
        "\n",
        "print(\"variance of k:\", k.var())\n",
        "print(\"variance of q:\", q.var())\n",
        "print(\"variance of wei:\", wei.var())\n",
        "\n",
        "# softmax over last dimension\n",
        "softmaxed_wei = torch.softmax(wei, dim=-1)\n",
        "\n",
        "# a different softmax that reduces the \"peakiness\" of the attention distribution\n",
        "sm = torch.nn.Softmax(dim=-1)\n",
        "scaled_sm_wei = sm(wei * 8)\n"
      ],
      "metadata": {
        "id": "fMxREQRr-q48",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "9f7fe0ca-5bba-4a7f-d43c-6b773363aaf9"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "variance of k: tensor(0.9006)\n",
            "variance of q: tensor(1.0037)\n",
            "variance of wei: tensor(0.9957)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "zxFp_1pp-yn7",
        "outputId": "8b82a2e3-f312-489d-c331-6d999c4ce26c"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])"
            ]
          },
          "metadata": {},
          "execution_count": 55
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "WZ0B65JF-z_d",
        "outputId": "5174af1a-32da-45f2-f93a-47f35b0de5a8"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])"
            ]
          },
          "metadata": {},
          "execution_count": 56
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The code defines a `LayerNorm1d` class, which is used to implement Layer Normalization for 1D input tensors. The `__init__` method sets the values of the `eps` parameter and the learnable parameters `gamma` and `beta`. The `__call__` method takes an input tensor `x`, computes the layer normalization operation, and returns the normalized output. The `parameters` method returns the learnable parameters of the layer.\n",
        "\n",
        "The code initializes an instance of the `LayerNorm1d` class with 100 dimensions and applies it to a 32 x 100 input tensor. The output tensor has the same shape as the input tensor."
      ],
      "metadata": {
        "id": "PlrbNR4FZxeh"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "class LayerNorm1d:\n",
        "    def __init__(self, dim, eps=1e-5, momentum=0.1):\n",
        "        self.eps = eps\n",
        "        self.gamma = torch.ones(dim)\n",
        "        self.beta = torch.zeros(dim)\n",
        "        \n",
        "    def __call__(self, x):\n",
        "        # Calculate the forward pass\n",
        "        xmean = x.mean(1, keepdim=True)  # Batch mean\n",
        "        xvar = x.var(1, keepdim=True)  # Batch variance\n",
        "        xhat = (x - xmean) / torch.sqrt(xvar + self.eps)  # Normalize to unit variance\n",
        "        self.out = self.gamma * xhat + self.beta\n",
        "        return self.out\n",
        "    \n",
        "    def parameters(self):\n",
        "        return [self.gamma, self.beta]\n",
        "\n",
        "# Instantiate the module\n",
        "torch.manual_seed(1337)\n",
        "module = LayerNorm1d(100)\n",
        "\n",
        "# Generate a batch of size 32 with 100-dimensional vectors\n",
        "x = torch.randn(32, 100)\n",
        "\n",
        "# Apply the layer normalization to the batch\n",
        "x = module(x)\n",
        "\n",
        "# Print the shape of the output tensor\n",
        "print(f\"Output shape: {x.shape}\")\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "QoGxDngl-1ar",
        "outputId": "ca6e40c0-0578-4464-add1-805a5aada10d"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Output shape: torch.Size([32, 100])\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Compute the mean and standard deviation of the first feature across all inputs in the batch\n",
        "batch_feature_mean, batch_feature_std = x[:, 0].mean(), x[:, 0].std()\n",
        "print(f\"Batch feature mean: {batch_feature_mean}, Batch feature std: {batch_feature_std}\")\n",
        "\n",
        "# Compute the mean and standard deviation of all features for a single input in the batch\n",
        "single_input_mean, single_input_std = x[0, :].mean(), x[0, :].std()\n",
        "print(f\"Single input mean: {single_input_mean}, Single input std: {single_input_std}\")\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "IflOPa0Ja_eZ",
        "outputId": "92f5fdfa-5ae9-4847-c59b-2895f2a7d539"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Batch feature mean: 0.14685693383216858, Batch feature std: 0.8803138732910156\n",
            "Single input mean: -9.53674295089968e-09, Single input std: 0.9999954700469971\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### References\n",
        "- `nanoGPT`: https://github.com/karpathy/nanoGPT\n",
        "- `SentencePiece`: https://github.com/google/sentencepiece\n",
        "- `Attention Is All You Need`: https://arxiv.org/abs/1706.03762\n",
        "- `Training language models to follow instructions with human feedback`: https://cdn.openai.com/papers/Training_language_models_to_follow_instructions_with_human_feedback.pdf\n",
        "- `The New Version of GPT-3 Is Much, Much Better`: https://towardsdatascience.com/the-new-version-of-gpt-3-is-much-much-better-53ac95f21cfb"
      ],
      "metadata": {
        "id": "cZyEYxCU8ZRI"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### As a Whole"
      ],
      "metadata": {
        "id": "VtvuQcY0eNQB"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "This is a PyTorch implementation of a language model using a transformer architecture. Here's a brief explanation of the code:\n",
        "\n",
        "- Sets hyperparameters for the model, such as batch size, block size, and learning rate.\n",
        "- Loads input text data from a file and creates character mappings.\n",
        "- Splits data into training and validation sets.\n",
        "- Defines a function to load batches of data for training or validation.\n",
        "- Defines a function to estimate loss during training.\n",
        "- Defines the self-attention head, which is a component of the transformer architecture.\n",
        "- Defines the multi-head attention module, which consists of multiple self-attention heads in parallel.\n",
        "- Defines the feed-forward module, which is another component of the transformer architecture.\n",
        "- Defines the transformer block, which is a combination of the self-attention head and feed-forward module.\n",
        "\n",
        "Overall, this code defines a language model that uses a transformer architecture and is trained using stochastic gradient descent with the AdamW optimizer. The transformer architecture consists of multiple transformer blocks, which are composed of self-attention heads and feed-forward modules. The self-attention heads enable the model to consider the relationships between all tokens in a sequence, while the feed-forward modules provide additional nonlinear transformations. The model is trained to predict the next token in a sequence given a context of previous tokens. The code prints the train and validation loss at specified intervals during training, and generates new text using the trained model."
      ],
      "metadata": {
        "id": "BCUtL3hYd-5u"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import torch\n",
        "import torch.nn as nn\n",
        "from torch.nn import functional as F\n",
        "\n",
        "# Hyperparameters\n",
        "batch_size = 16 \n",
        "block_size = 32 \n",
        "max_iters = 5000\n",
        "eval_interval = 100\n",
        "learning_rate = 1e-3\n",
        "device = 'cuda' if torch.cuda.is_available() else 'cpu'\n",
        "eval_iters = 200\n",
        "n_embd = 64\n",
        "n_head = 4\n",
        "n_layer = 4\n",
        "dropout = 0.0\n",
        "\n",
        "torch.manual_seed(1337)\n",
        "\n",
        "# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt\n",
        "with open('input.txt', 'r', encoding='utf-8') as f:\n",
        "    text = f.read()\n",
        "\n",
        "# Create a sorted list of unique characters in the text\n",
        "chars = sorted(list(set(text)))\n",
        "\n",
        "# Get the number of unique characters\n",
        "vocab_size = len(chars)\n",
        "\n",
        "# Create a dictionary mapping each character to an integer index\n",
        "stoi = {ch: i for i, ch in enumerate(chars)}\n",
        "\n",
        "# Create a dictionary mapping each integer index to a character\n",
        "itos = {i: ch for i, ch in enumerate(chars)}\n",
        "\n",
        "# Define a lambda function to encode a string as a list of integers\n",
        "encode = lambda s: [stoi[c] for c in s]\n",
        "\n",
        "# Define a lambda function to decode a list of integers as a string\n",
        "decode = lambda l: ''.join([itos[i] for i in l])\n",
        "\n",
        "# Convert the text to a tensor of integers\n",
        "data = torch.tensor(encode(text), dtype=torch.long)\n",
        "\n",
        "# Split the data into train and validation sets\n",
        "n = int(0.9 * len(data))  # use first 90% for training, last 10% for validation\n",
        "train_data = data[:n]\n",
        "val_data = data[n:]\n",
        "\n",
        "# Define a function to generate a small batch of input-target pairs\n",
        "def get_batch(split):\n",
        "    # Select either the training or validation set\n",
        "    data = train_data if split == 'train' else val_data\n",
        "    \n",
        "    # Generate random indices to start each block of input\n",
        "    ix = torch.randint(len(data) - block_size, (batch_size,))\n",
        "    \n",
        "    # Select block_size characters starting at each index for input\n",
        "    x = torch.stack([data[i:i + block_size] for i in ix])\n",
        "    \n",
        "    # Select block_size characters starting at each index + 1 for target\n",
        "    y = torch.stack([data[i + 1:i + block_size + 1] for i in ix])\n",
        "    \n",
        "    # Send tensors to GPU if available\n",
        "    x, y = x.to(device), y.to(device)\n",
        "    return x, y\n",
        "\n",
        "# Define a function to estimate the model's loss on the train and validation sets\n",
        "@torch.no_grad()\n",
        "def estimate_loss():\n",
        "    out = {}\n",
        "    model.eval()\n",
        "    for split in ['train', 'val']:\n",
        "        losses = torch.zeros(eval_iters)\n",
        "        for k in range(eval_iters):\n",
        "            X, Y = get_batch(split)\n",
        "            logits, loss = model(X, Y)\n",
        "            losses[k] = loss.item()\n",
        "        out[split] = losses.mean()\n",
        "    model.train()\n",
        "    return out\n",
        "\n",
        "\n",
        "class Head(nn.Module):\n",
        "    \"\"\" one head of self-attention \"\"\"\n",
        "\n",
        "    def __init__(self, head_size):\n",
        "        super().__init__()\n",
        "\n",
        "        # Linear transformations for key, query, and value\n",
        "        self.key = nn.Linear(n_embd, head_size, bias=False)\n",
        "        self.query = nn.Linear(n_embd, head_size, bias=False)\n",
        "        self.value = nn.Linear(n_embd, head_size, bias=False)\n",
        "\n",
        "        # Lower triangular matrix to mask future values\n",
        "        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))\n",
        "\n",
        "        # Dropout layer\n",
        "        self.dropout = nn.Dropout(dropout)\n",
        "\n",
        "    def forward(self, x):\n",
        "        # Get batch size, sequence length, and number of features\n",
        "        B, T, C = x.shape\n",
        "\n",
        "        # Linear transformations of key and query\n",
        "        k = self.key(x)   # (B,T,C)\n",
        "        q = self.query(x) # (B,T,C)\n",
        "\n",
        "        # Compute attention scores (\"affinities\")\n",
        "        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)\n",
        "\n",
        "        # Mask future values\n",
        "        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)\n",
        "\n",
        "        # Apply softmax to get attention weights\n",
        "        wei = F.softmax(wei, dim=-1) # (B, T, T)\n",
        "\n",
        "        # Apply dropout to attention weights\n",
        "        wei = self.dropout(wei)\n",
        "\n",
        "        # Linear transformation of value\n",
        "        v = self.value(x) # (B,T,C)\n",
        "\n",
        "        # Weighted sum of values using attention weights\n",
        "        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)\n",
        "\n",
        "        return out\n",
        "\n",
        "\n",
        "class MultiHeadAttention(nn.Module):\n",
        "    \"\"\" multiple heads of self-attention in parallel \"\"\"\n",
        "\n",
        "    def __init__(self, num_heads, head_size):\n",
        "        super().__init__()  # Initialize the superclass (nn.Module)\n",
        "        # Instantiate a list of head modules, and assign it to self.heads\n",
        "        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])\n",
        "        # A linear transformation to project the concatenated attention heads back to the input dimension\n",
        "        self.proj = nn.Linear(n_embd, n_embd)\n",
        "        # Dropout layer to avoid overfitting\n",
        "        self.dropout = nn.Dropout(dropout)\n",
        "\n",
        "    def forward(self, x):\n",
        "        # Apply the self-attention mechanism by passing the input through each attention head and concatenate the results along the feature dimension\n",
        "        out = torch.cat([h(x) for h in self.heads], dim=-1)\n",
        "        # Apply the projection layer and the dropout layer to the concatenated output\n",
        "        out = self.dropout(self.proj(out))\n",
        "        return out\n",
        "\n",
        "\n",
        "class FeedFoward(nn.Module):\n",
        "    \"\"\" a simple linear layer followed by a non-linearity \"\"\"\n",
        "\n",
        "    def __init__(self, n_embd):\n",
        "        super().__init__()\n",
        "\n",
        "        # Define a sequential neural network composed of two linear layers and a ReLU activation function\n",
        "        self.net = nn.Sequential(\n",
        "            nn.Linear(n_embd, 4 * n_embd),\n",
        "            nn.ReLU(),\n",
        "            nn.Linear(4 * n_embd, n_embd),\n",
        "            nn.Dropout(dropout),\n",
        "        )\n",
        "\n",
        "    def forward(self, x):\n",
        "        # Forward pass through the neural network\n",
        "        return self.net(x)\n",
        "\n",
        "\n",
        "class Block(nn.Module):\n",
        "    \"\"\" Transformer block: communication followed by computation \"\"\"\n",
        "\n",
        "    def __init__(self, n_embd, n_head):\n",
        "        # n_embd: embedding dimension, n_head: the number of heads we'd like\n",
        "        super().__init__()\n",
        "        head_size = n_embd // n_head\n",
        "        \n",
        "        # Multi-head self-attention layer\n",
        "        self.sa = MultiHeadAttention(n_head, head_size)\n",
        "        \n",
        "        # Feedforward layer\n",
        "        self.ffwd = FeedFoward(n_embd)\n",
        "        \n",
        "        # Layer normalization layer 1\n",
        "        self.ln1 = nn.LayerNorm(n_embd)\n",
        "        \n",
        "        # Layer normalization layer 2\n",
        "        self.ln2 = nn.LayerNorm(n_embd)\n",
        "\n",
        "    def forward(self, x):\n",
        "        # Communication followed by computation\n",
        "        x = x + self.sa(self.ln1(x))\n",
        "        x = x + self.ffwd(self.ln2(x))\n",
        "        return x\n",
        "\n",
        "\n",
        "# super simple bigram model\n",
        "class BigramLanguageModel(nn.Module):\n",
        "\n",
        "    def __init__(self):\n",
        "        super().__init__()\n",
        "        # Define the model architecture using embedding, multi-head attention, and linear layers\n",
        "        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)  # Lookup table for token embedding\n",
        "        self.position_embedding_table = nn.Embedding(block_size, n_embd)  # Lookup table for position embedding\n",
        "        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])  # A sequence of transformer blocks\n",
        "        self.ln_f = nn.LayerNorm(n_embd)  # Layer normalization\n",
        "        self.lm_head = nn.Linear(n_embd, vocab_size)  # Linear layer to get logits\n",
        "\n",
        "    def forward(self, idx, targets=None):\n",
        "        B, T = idx.shape\n",
        "\n",
        "        # Embed tokens and positions\n",
        "        tok_emb = self.token_embedding_table(idx)  # (B,T,C)\n",
        "        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T,C)\n",
        "        x = tok_emb + pos_emb  # (B,T,C)\n",
        "\n",
        "        # Pass through the transformer blocks\n",
        "        x = self.blocks(x)  # (B,T,C)\n",
        "\n",
        "        # Apply layer normalization and linear layer\n",
        "        x = self.ln_f(x)  # (B,T,C)\n",
        "        logits = self.lm_head(x)  # (B,T,vocab_size)\n",
        "\n",
        "        # Compute the cross-entropy loss if targets are provided\n",
        "        if targets is None:\n",
        "            loss = None\n",
        "        else:\n",
        "            B, T, C = logits.shape\n",
        "            logits = logits.view(B*T, C)\n",
        "            targets = targets.view(B*T)\n",
        "            loss = F.cross_entropy(logits, targets)\n",
        "\n",
        "        return logits, loss\n",
        "\n",
        "    def generate(self, idx, max_new_tokens):\n",
        "        # Generate new text by sampling from the learned distribution\n",
        "        for _ in range(max_new_tokens):\n",
        "            # Crop idx to the last block_size tokens\n",
        "            idx_cond = idx[:, -block_size:]\n",
        "            # Get the predictions\n",
        "            logits, loss = self(idx_cond)\n",
        "            # Focus only on the last time step\n",
        "            logits = logits[:, -1, :]  # becomes (B, C)\n",
        "            # Apply softmax to get probabilities\n",
        "            probs = F.softmax(logits, dim=-1)  # (B, C)\n",
        "            # Sample from the distribution\n",
        "            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)\n",
        "            # Append the sampled index to the running sequence\n",
        "            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)\n",
        "        return idx\n",
        "\n",
        "# Instantiate the bigram language model\n",
        "model = BigramLanguageModel()  \n",
        "\n",
        "# Move the model to the specified device (CPU or GPU)\n",
        "m = model.to(device)  \n",
        "\n",
        "# Print the number of parameters in the model\n",
        "print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')  \n",
        "\n",
        "# Create a PyTorch optimizer with the AdamW algorithm\n",
        "optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)\n",
        "\n",
        "# Loop over training iterations\n",
        "for iter in range(max_iters):  \n",
        "\n",
        "    # Every once in a while evaluate the loss on train and val sets\n",
        "    if iter % eval_interval == 0 or iter == max_iters - 1:\n",
        "        losses = estimate_loss()\n",
        "        print(f\"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}\")\n",
        "\n",
        "    # Sample a batch of data (inputs and targets)\n",
        "    xb, yb = get_batch('train')\n",
        "\n",
        "    # Evaluate the loss and update the model parameters\n",
        "    logits, loss = model(xb, yb)\n",
        "    optimizer.zero_grad(set_to_none=True)\n",
        "    loss.backward()\n",
        "    optimizer.step()\n",
        "\n",
        "# Generate text from the model\n",
        "\n",
        "# Initialize the context with a zero tensor\n",
        "context = torch.zeros((1, 1), dtype=torch.long, device=device)  \n",
        "\n",
        "# Generate a sequence of tokens using the model\n",
        "generated_sequence = m.generate(context, max_new_tokens=2000)\n",
        "\n",
        "# Decode the generated sequence into a string\n",
        "decoded_sequence = decode(generated_sequence[0].tolist())  \n",
        "\n",
        "# Print the generated text\n",
        "print(decoded_sequence)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "q0zYKKEh--V8",
        "outputId": "8b94df00-3710-4207-cc88-3840cd0ad8c9"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.209729 M parameters\n",
            "step 0: train loss 4.4116, val loss 4.4022\n",
            "step 100: train loss 2.6568, val loss 2.6670\n",
            "step 200: train loss 2.5090, val loss 2.5058\n",
            "step 300: train loss 2.4198, val loss 2.4340\n",
            "step 400: train loss 2.3503, val loss 2.3567\n",
            "step 500: train loss 2.2970, val loss 2.3136\n",
            "step 600: train loss 2.2410, val loss 2.2506\n",
            "step 700: train loss 2.2062, val loss 2.2198\n",
            "step 800: train loss 2.1638, val loss 2.1871\n",
            "step 900: train loss 2.1232, val loss 2.1494\n",
            "step 1000: train loss 2.1020, val loss 2.1293\n",
            "step 1100: train loss 2.0704, val loss 2.1196\n",
            "step 1200: train loss 2.0382, val loss 2.0798\n",
            "step 1300: train loss 2.0249, val loss 2.0640\n",
            "step 1400: train loss 1.9922, val loss 2.0354\n",
            "step 1500: train loss 1.9707, val loss 2.0308\n",
            "step 1600: train loss 1.9614, val loss 2.0474\n",
            "step 1700: train loss 1.9393, val loss 2.0130\n",
            "step 1800: train loss 1.9070, val loss 1.9943\n",
            "step 1900: train loss 1.9057, val loss 1.9871\n",
            "step 2000: train loss 1.8834, val loss 1.9954\n",
            "step 2100: train loss 1.8719, val loss 1.9758\n",
            "step 2200: train loss 1.8582, val loss 1.9623\n",
            "step 2300: train loss 1.8546, val loss 1.9517\n",
            "step 2400: train loss 1.8410, val loss 1.9476\n",
            "step 2500: train loss 1.8167, val loss 1.9455\n",
            "step 2600: train loss 1.8263, val loss 1.9401\n",
            "step 2700: train loss 1.8108, val loss 1.9340\n",
            "step 2800: train loss 1.8040, val loss 1.9247\n",
            "step 2900: train loss 1.8044, val loss 1.9304\n",
            "step 3000: train loss 1.7963, val loss 1.9242\n",
            "step 3100: train loss 1.7687, val loss 1.9147\n",
            "step 3200: train loss 1.7547, val loss 1.9102\n",
            "step 3300: train loss 1.7557, val loss 1.9037\n",
            "step 3400: train loss 1.7547, val loss 1.8946\n",
            "step 3500: train loss 1.7385, val loss 1.8968\n",
            "step 3600: train loss 1.7260, val loss 1.8914\n",
            "step 3700: train loss 1.7257, val loss 1.8808\n",
            "step 3800: train loss 1.7204, val loss 1.8919\n",
            "step 3900: train loss 1.7215, val loss 1.8788\n",
            "step 4000: train loss 1.7146, val loss 1.8639\n",
            "step 4100: train loss 1.7095, val loss 1.8724\n",
            "step 4200: train loss 1.7079, val loss 1.8707\n",
            "step 4300: train loss 1.7035, val loss 1.8502\n",
            "step 4400: train loss 1.7043, val loss 1.8693\n",
            "step 4500: train loss 1.6914, val loss 1.8522\n",
            "step 4600: train loss 1.6853, val loss 1.8357\n",
            "step 4700: train loss 1.6862, val loss 1.8483\n",
            "step 4800: train loss 1.6671, val loss 1.8434\n",
            "step 4900: train loss 1.6736, val loss 1.8415\n",
            "step 4999: train loss 1.6635, val loss 1.8226\n",
            "\n",
            "FlY BOLINGLO:\n",
            "Them thrumply towiter arts the\n",
            "muscue rike begatt the sea it\n",
            "What satell in rowers that some than othis Marrity.\n",
            "\n",
            "LUCENTVO:\n",
            "But userman these that, where can is not diesty rege;\n",
            "What and see to not. But's eyes. What?\n",
            "\n",
            "JOHN MARGARET:\n",
            "Than up I wark, what out, I ever of and love,\n",
            "one these do sponce, vois I me;\n",
            "But my pray sape to ries all to the not erralied in may.\n",
            "\n",
            "BENVOLIO:\n",
            "To spits as stold's bewear I would and say mesby all\n",
            "on sworn make he anough\n",
            "As cousins the solle, whose be my conforeful may lie them yet\n",
            "nobe allimely untraled to be thre I say be,\n",
            "Notham a brotes theme an make come,\n",
            "And that his reach to the duke ento\n",
            "the grmeants bell! and now there king-liff-or grief?\n",
            "\n",
            "GLOUCESTER:\n",
            "All the bettle dreene, for To his like thou thron!\n",
            "\n",
            "MENENIUS:\n",
            "Then, if I knom her all.\n",
            "My lord, but terruly friend\n",
            "Rish of the ploceiness and wilt tends sure?\n",
            "Is you knows a fasir wead\n",
            "That with him my spaut,\n",
            "I shall not tas where's not, becomity; my coulds sting,\n",
            "then the wit be dong to tyget our hereefore,\n",
            "Who strop me, mend here, if agains, bitten, thy lack.\n",
            "The but these it were is tus. For the her skeep the fasting. joy tweet Bumner:-\n",
            "How the enclady: It you and how,\n",
            "I am in him, And ladderle:\n",
            "Their hand whose wife, it my hithre,\n",
            "Roman and where sposs gives'd you.\n",
            "\n",
            "TROMIOLANUS:\n",
            "But livants you great, I shom mistrot come, for to she to lot\n",
            "for smy to men ventry mehus. Gazise;\n",
            "Full't were some the cause, and stouch set,\n",
            "Or promises, which a kingsasted to your gove them; and sterrer,\n",
            "And that wae love him.\n",
            "\n",
            "BRUTUS:\n",
            "You shape with these sweet.\n",
            "\n",
            "CORTENGONO:\n",
            "Lo, where 'twon elmes, 'morth young agres;\n",
            "Sir, azavoust to striel accurded we missery sets crave.\n",
            "\n",
            "ANGOLUM:\n",
            "For is Henry to have gleise the dreason\n",
            "That I ant shorfold wefth their servy in enscy.\n",
            "\n",
            "ISABELLA:\n",
            "O, I better you eyse such formfetrews.\n",
            "\n",
            "BUCKINGHARENT:\n",
            "Qead my lightle this righanneds flase them\n",
            "Wam which an take was our some pleasurs,\n",
            "Lovisoname to me, then fult me?--have it?\n",
            "\n",
            "HENRY BOLINGBROY:\n",
            "That wha\n"
          ]
        }
      ]
    }
  ]
}