{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Complete the transformer architecture" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# set up the env\n", "\n", "import pytest\n", "import ipytest\n", "import unittest\n", "\n", "ipytest.autoconfig()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transformer Model\n", "\n", "The encoder-decoder architecture based on the Transformer structure is illustrated in figure below. The left and right sides correspond to the encoder and decoder structures, respectively. They consist of several basic Transformer blocks (represented by the gray boxes in the figure), stacked N times. Each component comprises multiple Transformer blocks, which are stacked N times.\n", "\n", "Here's an overview of the key components and processes involved in the semantic abstraction process from input to output:\n", "\n", "Encoder:\n", "\n", "The encoder takes an input sequence {xi}ti=1, where each xi represents the representation of a word in the text sequence.\n", "It consists of stacked Transformer blocks. Each block includes:\n", "Attention Layer: Utilizes multi-head attention mechanisms to capture dependencies between words in the input sequence, facilitating the modeling of long-range dependencies without traditional recurrent structures.\n", "Position-wise Feedforward Layer: Applies complex transformations to the representations of each word in the input sequence.\n", "Residual Connections: Directly connect the input and output of the attention and feedforward layers, aiding in efficient information flow and model optimization.\n", "Layer Normalization: Normalizes the output representations of the attention and feedforward layers, stabilizing optimization.\n", "Decoder:\n", "\n", "The decoder generates an output sequence {yi}ti=1 based on the representations learned by the encoder.\n", "Similar to the encoder, it consists of stacked Transformer blocks, each including the same components as described above.\n", "In addition, the decoder includes an additional attention mechanism that focuses on the encoder's output to incorporate context information during sequence generation.\n", "Overall, the encoder-decoder architecture based on the Transformer structure allows for effective semantic abstraction by leveraging attention mechanisms, position-wise feedforward layers, residual connections, and layer normalization. This architecture enables the model to capture complex dependencies between words in the input sequence and generate meaningful outputs for various sequence-to-sequence tasks.\n", "\n", ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/llm/Transformer-python-%281%29.png\n", "Transformer-based encoder and decoder Architecture\n", ":::\n", "\n", "Next, we'll discuss the specific functionalities and implementation methods of each module in detail." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Embedding Layer\n", "\n", "The Embedding Layer in the Transformer model is responsible for converting discrete token indices into continuous vector representations. Each token index is mapped to a high-dimensional vector, which is learned during the training process. These embeddings capture semantic and syntactic information about the tokens.\n", "\n", "Implementation in PyTorch:\n", "\n", "We define a PositionalEncoder class that inherits from nn.Module.\n", "The constructor initializes the positional encoding matrix (pe) based on the given d_model (dimension of the model) and max_seq_len (maximum sequence length).\n", "The forward method scales the input embeddings (x) by the square root of the model dimension and adds the positional encoding matrix (pe) to the input embeddings.\n", "Note that we're using PyTorch's Variable and autograd to ensure that the positional encoding is compatible with the autograd mechanism for backpropagation.\n", "Finally, the PositionalEncoder class can be used within a larger PyTorch model to incorporate positional information into word embeddings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import torch\n", "import torch.nn as nn\n", "import math\n", "import copy\n", "import time\n", "import torch.optim as optim\n", "import torch.nn.functional as F\n", "from torch.autograd import Variable\n", "import numpy as np\n", "\n", "class PositionalEncoder(nn.Module):\n", " def __init__(self, d_model, max_seq_len=80):\n", " super().__init__()\n", " self.d_model = d_model\n", " # Creating a constant PE matrix based on pos and i\n", " pe = torch.zeros(max_seq_len, d_model)\n", " for pos in range(max_seq_len):\n", " for i in range(0, d_model, 2):\n", " pe[pos, i] = math.sin(pos / (10000 ** ((2 * i) / d_model)))\n", " pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * (i + 1)) / d_model)))\n", " pe = pe.unsqueeze(0)\n", " self.register_buffer('pe', pe)\n", "\n", " def forward(self, x):\n", " # Scaling word embeddings to make them relatively larger\n", " x = x * math.sqrt(self.d_model)\n", " # Adding positional constants to word embedding representations\n", " seq_len = x.size(1)\n", " x = x + Variable(self.pe[:, :seq_len], requires_grad=False).cuda()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "