{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"collapsed_sections": [],
"mount_file_id": "1DXQ2nyL3PBZqaxXXHuOji_Ff49Tcu7Ws",
"authorship_tag": "ABX9TyP7AbGxg6Hn7Mz56WhP0tfb",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"gpuClass": "standard",
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"source": [
"# Natural Language Processing Demystified | Transformers, Pre-training, and Transfer Learning\n",
"https://nlpdemystified.org
\n",
"https://github.com/nitinpunjabi/nlp-demystified
\n",
"\n",
"Course module for this demo: https://www.nlpdemystified.org/course/transformers"
],
"metadata": {
"id": "KmEyadzTtGxY"
}
},
{
"cell_type": "markdown",
"source": [
"**IMPORTANT**
\n",
"Enable **GPU acceleration** by going to *Runtime > Change Runtime Type*. Keep in mind that, on certain tiers, you're not guaranteed GPU access depending on usage history and current load.\n",
"
\n",
"Also, if you're running this in the cloud rather than a local Jupyter server on your machine, then the notebook will *timeout* after a period of inactivity.\n",
"
\n",
"Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:
\n",
"https://research.google.com/colaboratory/local-runtimes.html"
],
"metadata": {
"id": "uOVYaAveQJia"
}
},
{
"cell_type": "code",
"source": [
"!pip install BPEmb\n",
"\n",
"import math\n",
"import numpy as np\n",
"import tensorflow as tf\n",
"\n",
"from bpemb import BPEmb"
],
"metadata": {
"id": "vWyjB-YNwTG_"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"# Transformers From Scratch"
],
"metadata": {
"id": "MenE2varZEXc"
}
},
{
"cell_type": "markdown",
"source": [
"We'll build a transformer from scratch, layer-by-layer. We'll start with the **Multi-Head Self-Attention** layer since that's the most involved bit. Once we have that working, the rest of the model will look familiar if you've been following the course so far."
],
"metadata": {
"id": "mDkTVv3KMJX_"
}
},
{
"cell_type": "markdown",
"source": [
"## Multi-Head Self-Attention"
],
"metadata": {
"id": "LqX04fFXBdxy"
}
},
{
"cell_type": "markdown",
"source": [
"#### Scaled Dot Product Self-Attention"
],
"metadata": {
"id": "-XnKHnlYyijq"
}
},
{
"cell_type": "markdown",
"source": [
"\n",
"Inside each attention head is a **Scaled Dot Product Self-Attention** operation as we covered in the slides. Given *queries*, *keys*, and *values*, the operation returns a new \"mix\" of the values.\n",
"\n",
"$$Attention(Q, K, V) = softmax(\\frac{QK^T)}{\\sqrt{d_k}})V$$\n",
"\n",
"The following function implements this and also takes a mask to account for padding and for masking future tokens for decoding (i.e. **look-ahead mask**). We'll cover masking later in the notebook."
],
"metadata": {
"id": "3NAf9HP7RsQu"
}
},
{
"cell_type": "code",
"source": [
"def scaled_dot_product_attention(query, key, value, mask=None):\n",
" key_dim = tf.cast(tf.shape(key)[-1], tf.float32)\n",
" scaled_scores = tf.matmul(query, key, transpose_b=True) / np.sqrt(key_dim)\n",
"\n",
" if mask is not None:\n",
" scaled_scores = tf.where(mask==0, -np.inf, scaled_scores)\n",
"\n",
" softmax = tf.keras.layers.Softmax()\n",
" weights = softmax(scaled_scores) \n",
" return tf.matmul(weights, value), weights"
],
"metadata": {
"id": "7hpO6cGEN7HK"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Suppose our *queries*, *keys*, and *values* are each a length of 3 with a dimension of 4."
],
"metadata": {
"id": "lC_HhsreXh3H"
}
},
{
"cell_type": "code",
"source": [
"seq_len = 3\n",
"embed_dim = 4\n",
"\n",
"queries = np.random.rand(seq_len, embed_dim)\n",
"keys = np.random.rand(seq_len, embed_dim)\n",
"values = np.random.rand(seq_len, embed_dim)\n",
"\n",
"print(\"Queries:\\n\", queries)"
],
"metadata": {
"id": "WB2cDybgX5LZ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"This would be the self-attention output and weights."
],
"metadata": {
"id": "QuNdMuz5vb1c"
}
},
{
"cell_type": "code",
"source": [
"output, attn_weights = scaled_dot_product_attention(queries, keys, values)\n",
"\n",
"print(\"Output\\n\", output, \"\\n\")\n",
"print(\"Weights\\n\", attn_weights)"
],
"metadata": {
"id": "pxKj56hNX5UO"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"#### Generating queries, keys, and values for multiple heads."
],
"metadata": {
"id": "O8NLm6qaN7DE"
}
},
{
"cell_type": "markdown",
"source": [
"Now that we have a way to calculate self-attention, let's actually generate the input *queries*, *keys*, and *values* for multiple heads.\n",
"
\n",
"In the slides (and in most references), each attention head had its own separate set of *query*, *key*, and *value* weights. Each weight matrix was of dimension $d\\ x \\ d/h$ where h was the number of heads. "
],
"metadata": {
"id": "wBm9jbpSN6-L"
}
},
{
"cell_type": "markdown",
"source": [
"![](https://drive.google.com/uc?export=view&id=1SLWkHQgy4nQPFvvjG5_V8UTtpSAJ2zrr)"
],
"metadata": {
"id": "YLiJy9OzfMu5"
}
},
{
"cell_type": "markdown",
"source": [
"It's easier to understand things this way and we can certainly code it this way as well. But we can also \"simulate\" different heads with a single query matrix, single key matrix, and single value matrix.\n",
"
\n",
"We'll do both. First we'll create *query*, *key*, and *value* vectors using separate weights per head.\n",
"
\n",
"In the slides, we used an example of 12 dimensional embeddings processed by three attentions heads, and we'll do the same here."
],
"metadata": {
"id": "3tKPwmi3fbys"
}
},
{
"cell_type": "code",
"source": [
"batch_size = 1\n",
"seq_len = 3\n",
"embed_dim = 12\n",
"num_heads = 3\n",
"head_dim = embed_dim // num_heads\n",
"\n",
"print(f\"Dimension of each head: {head_dim}\")"
],
"metadata": {
"id": "rJLyGtqbX3uW"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"**Using separate weight matrices per head**"
],
"metadata": {
"id": "JDl37YzAf7bh"
}
},
{
"cell_type": "markdown",
"source": [
"Suppose these are our input embeddings. Here we have a batch of 1 containing a sequence of length 3, with each element being a 12-dimensional embedding."
],
"metadata": {
"id": "xQ_KoJq3fv-A"
}
},
{
"cell_type": "code",
"source": [
"x = np.random.rand(batch_size, seq_len, embed_dim).round(1)\n",
"print(\"Input shape: \", x.shape, \"\\n\")\n",
"print(\"Input:\\n\", x)"
],
"metadata": {
"id": "7NcX3KBrX3uW"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"We'll declare three sets of *query* weights (one for each head), three sets of *key* weights, and three sets of *value* weights. Remember each weight matrix should have a dimension of $\\text{d}\\ \\text{x}\\ \\text{d/h}$."
],
"metadata": {
"id": "uvJicbp6f7pI"
}
},
{
"cell_type": "code",
"source": [
"# The query weights for each head.\n",
"wq0 = np.random.rand(embed_dim, head_dim).round(1)\n",
"wq1 = np.random.rand(embed_dim, head_dim).round(1)\n",
"wq2 = np.random.rand(embed_dim, head_dim).round(1)\n",
"\n",
"# The key weights for each head. \n",
"wk0 = np.random.rand(embed_dim, head_dim).round(1)\n",
"wk1 = np.random.rand(embed_dim, head_dim).round(1)\n",
"wk2 = np.random.rand(embed_dim, head_dim).round(1)\n",
"\n",
"# The value weights for each head.\n",
"wv0 = np.random.rand(embed_dim, head_dim).round(1)\n",
"wv1 = np.random.rand(embed_dim, head_dim).round(1)\n",
"wv2 = np.random.rand(embed_dim, head_dim).round(1)"
],
"metadata": {
"id": "8zdg7rqrX3uX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"print(\"The three sets of query weights (one for each head):\")\n",
"print(\"wq0:\\n\", wq0)\n",
"print(\"wq1:\\n\", wq1)\n",
"print(\"wq2:\\n\", wq1)"
],
"metadata": {
"id": "QzMRHZooX3uX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"We'll generate our *queries*, *keys*, and *values* for each head by multiplying our input by the weights."
],
"metadata": {
"id": "HmwGKV9qgch-"
}
},
{
"cell_type": "code",
"source": [
"# Geneated queries, keys, and values for the first head.\n",
"q0 = np.dot(x, wq0)\n",
"k0 = np.dot(x, wk0)\n",
"v0 = np.dot(x, wv0)\n",
"\n",
"# Geneated queries, keys, and values for the second head.\n",
"q1 = np.dot(x, wq1)\n",
"k1 = np.dot(x, wk1)\n",
"v1 = np.dot(x, wv1)\n",
"\n",
"# Geneated queries, keys, and values for the third head.\n",
"q2 = np.dot(x, wq2)\n",
"k2 = np.dot(x, wk2)\n",
"v2 = np.dot(x, wv2)"
],
"metadata": {
"id": "NucbYNNSX3uX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"These are the resulting *query*, *key*, and *value* vectors for the first head."
],
"metadata": {
"id": "AIDiwWZ0gqhm"
}
},
{
"cell_type": "code",
"source": [
"print(\"Q, K, and V for first head:\\n\")\n",
"\n",
"print(f\"q0 {q0.shape}:\\n\", q0, \"\\n\")\n",
"print(f\"k0 {k0.shape}:\\n\", k0, \"\\n\")\n",
"print(f\"v0 {v0.shape}:\\n\", v0)"
],
"metadata": {
"id": "NMcMmbkqX3uX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Now that we have our Q, K, V vectors, we can just pass them to our self-attention operation. Here we're calculating the output and attention weights for the first head."
],
"metadata": {
"id": "iw5CQ9i6qZDv"
}
},
{
"cell_type": "code",
"source": [
"out0, attn_weights0 = scaled_dot_product_attention(q0, k0, v0)\n",
"\n",
"print(\"Output from first attention head: \", out0, \"\\n\")\n",
"print(\"Attention weights from first head: \", attn_weights0)"
],
"metadata": {
"id": "i7tHIvXKX3uX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Here are the other two (attention weights are ignored)."
],
"metadata": {
"id": "DoYEXSm7qr_A"
}
},
{
"cell_type": "code",
"source": [
"out1, _ = scaled_dot_product_attention(q1, k1, v1)\n",
"out2, _ = scaled_dot_product_attention(q2, k2, v2)\n",
"\n",
"print(\"Output from second attention head: \", out1, \"\\n\")\n",
"print(\"Output from third attention head: \", out2,)"
],
"metadata": {
"id": "otnqbaDSqpJ7"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"As we covered in the slides, once we have each head's output, we concatenate them and then put them through a linear layer for further processing."
],
"metadata": {
"id": "lOV717bqX3uX"
}
},
{
"cell_type": "code",
"source": [
"combined_out_a = np.concatenate((out0, out1, out2), axis=-1)\n",
"print(f\"Combined output from all heads {combined_out_a.shape}:\")\n",
"print(combined_out_a)\n",
"\n",
"# The final step would be to run combined_out_a through a linear/dense layer \n",
"# for further processing."
],
"metadata": {
"id": "gmSv5trtt2v9"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"So that's a complete run of **multi-head self-attention** using separate sets of weights per head.
\n",
"\n",
"Let's now get the same thing done using a single query weight matrix, single key weight matrix, and single value weight matrix.
\n",
"These were our separate per-head query weights:"
],
"metadata": {
"id": "RRZpFR0Wt8h9"
}
},
{
"cell_type": "code",
"source": [
"print(\"Query weights for first head: \\n\", wq0, \"\\n\")\n",
"print(\"Query weights for second head: \\n\", wq1, \"\\n\")\n",
"print(\"Query weights for third head: \\n\", wq2)"
],
"metadata": {
"id": "XoJmLAsUX3uX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Suppose instead of declaring three separate query weight matrices, we had declared one. i.e. a single $d\\ x\\ d$ matrix. We're concatenating our per-head query weights here instead of declaring a new set of weights so that we get the same results."
],
"metadata": {
"id": "oa_p3bk8mO9D"
}
},
{
"cell_type": "code",
"source": [
"wq = np.concatenate((wq0, wq1, wq2), axis=1)\n",
"print(f\"Single query weight matrix {wq.shape}: \\n\", wq)"
],
"metadata": {
"id": "7jh6zeg1X3uX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"In the same vein, pretend we declared a single key weight matrix, and single value weight matrix."
],
"metadata": {
"id": "-9MzE5Okmdbl"
}
},
{
"cell_type": "code",
"source": [
"wk = np.concatenate((wk0, wk1, wk2), axis=1)\n",
"wv = np.concatenate((wv0, wv1, wv2), axis=1)\n",
"\n",
"print(f\"Single key weight matrix {wk.shape}:\\n\", wk, \"\\n\")\n",
"print(f\"Single value weight matrix {wv.shape}:\\n\", wv)"
],
"metadata": {
"id": "xq2guuobX3uX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Now we can calculate all our *queries*, *keys*, and *values* with three dot products."
],
"metadata": {
"id": "WA7dl1VRnXHz"
}
},
{
"cell_type": "code",
"source": [
"q_s = np.dot(x, wq)\n",
"k_s = np.dot(x, wk)\n",
"v_s = np.dot(x, wv)"
],
"metadata": {
"id": "UQ5i98bLX3uX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"These are our resulting query vectors (we'll call them \"combined queries\"). How do we simulate different heads with this?"
],
"metadata": {
"id": "xkAzG-bgnx1U"
}
},
{
"cell_type": "code",
"source": [
"print(f\"Query vectors using a single weight matrix {q_s.shape}:\\n\", q_s)"
],
"metadata": {
"id": "H-qKM3jZr242"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Somehow, we need to separate these vectors such they're treated like three separate sets by the self-attention operation."
],
"metadata": {
"id": "qsUULAgRsB2n"
}
},
{
"cell_type": "code",
"source": [
"print(q0, \"\\n\")\n",
"print(q1, \"\\n\")\n",
"print(q2)"
],
"metadata": {
"id": "FKXYVHbJvnGp"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Notice how each set of per-head queries looks like we took the combined queries, and chopped them vertically every four dimensions.\n",
"
\n",
"We can split our combined queries into $\\text{d}\\ \\text{x}\\ \\text{d/h}$ heads using **reshape** and **transpose**.
\n",
"The first step is to *reshape* our combined queries from a shape of:
\n",
"(batch_size, seq_len, embed_dim)
\n",
"\n",
"into a shape of
\n",
" (batch_size, seq_len, num_heads, head_dim).\n",
"
\n",
"\n",
" https://www.tensorflow.org/api_docs/python/tf/reshape"
],
"metadata": {
"id": "twXi0Sx-sTut"
}
},
{
"cell_type": "code",
"source": [
"# Note: we can achieve the same thing by passing -1 instead of seq_len.\n",
"q_s_reshaped = tf.reshape(q_s, (batch_size, seq_len, num_heads, head_dim))\n",
"print(f\"Combined queries: {q_s.shape}\\n\", q_s, \"\\n\")\n",
"print(f\"Reshaped into separate heads: {q_s_reshaped.shape}\\n\", q_s_reshaped)"
],
"metadata": {
"id": "d3iHh7XxX3uY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"At this point, we have our desired shape. The next step is to *transpose* it such that simulates vertically chopping our combined queries. By transposing, our matrix dimensions become:
\n",
"(batch_size, num_heads, seq_len, head_dim)
\n",
"\n",
"https://www.tensorflow.org/api_docs/python/tf/transpose"
],
"metadata": {
"id": "6fIWohaZvVs9"
}
},
{
"cell_type": "code",
"source": [
"q_s_transposed = tf.transpose(q_s_reshaped, perm=[0, 2, 1, 3]).numpy()\n",
"print(f\"Queries transposed into \\\"separate\\\" heads {q_s_transposed.shape}:\\n\", \n",
" q_s_transposed)"
],
"metadata": {
"id": "6Vv3kV3jX3uY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"If we compare this against the separate per-head queries we calculated previously, we see the same result except we now have all our queries in a single matrix."
],
"metadata": {
"id": "J2DOWEPewUns"
}
},
{
"cell_type": "code",
"source": [
"print(\"The separate per-head query matrices from before: \")\n",
"print(q0, \"\\n\")\n",
"print(q1, \"\\n\")\n",
"print(q2)"
],
"metadata": {
"id": "ZMLEBmtowQ02"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Let's do the exact same thing with our combined keys and values."
],
"metadata": {
"id": "kmVPAaE3wmGj"
}
},
{
"cell_type": "code",
"source": [
"k_s_transposed = tf.transpose(tf.reshape(k_s, (batch_size, -1, num_heads, head_dim)), perm=[0, 2, 1, 3]).numpy()\n",
"v_s_transposed = tf.transpose(tf.reshape(v_s, (batch_size, -1, num_heads, head_dim)), perm=[0, 2, 1, 3]).numpy()\n",
"\n",
"print(f\"Keys for all heads in a single matrix {k_s.shape}: \\n\", k_s_transposed, \"\\n\")\n",
"print(f\"Values for all heads in a single matrix {v_s.shape}: \\n\", v_s_transposed)"
],
"metadata": {
"id": "vauGkBv3X3uY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Set up this way, we can now calculate the outputs from all attention heads with a single call to our self-attention operation."
],
"metadata": {
"id": "ebGFAKGrxCoe"
}
},
{
"cell_type": "code",
"source": [
"all_heads_output, all_attn_weights = scaled_dot_product_attention(q_s_transposed, \n",
" k_s_transposed, \n",
" v_s_transposed)\n",
"print(\"Self attention output:\\n\", all_heads_output)"
],
"metadata": {
"id": "hIElo1ObX3uY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"As a sanity check, we can compare this against the outputs from individual heads we calculated earlier:"
],
"metadata": {
"id": "PCPtOI_awd-Z"
}
},
{
"cell_type": "code",
"source": [
"print(\"Per head outputs from using separate sets of weights per head:\")\n",
"print(out0, \"\\n\")\n",
"print(out1, \"\\n\")\n",
"print(out2)"
],
"metadata": {
"id": "bXIB_z11xsh7"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"To get the final concatenated result, we need to reverse our **reshape** and **transpose** operation, starting with the **transpose** this time."
],
"metadata": {
"id": "hPlpXbZI74mX"
}
},
{
"cell_type": "code",
"source": [
"combined_out_b = tf.reshape(tf.transpose(all_heads_output, perm=[0, 2, 1, 3]), \n",
" shape=(batch_size, seq_len, embed_dim))\n",
"print(\"Final output from using single query, key, value matrices:\\n\", \n",
" combined_out_b, \"\\n\")\n",
"print(\"Final output from using separate query, key, value matrices per head:\\n\", \n",
" combined_out_a)"
],
"metadata": {
"id": "9lWtCPk1wuod"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"We can encapsulate everything we just covered in a class."
],
"metadata": {
"id": "Wi8WnhwL9UIa"
}
},
{
"cell_type": "code",
"source": [
"class MultiHeadSelfAttention(tf.keras.layers.Layer):\n",
" def __init__(self, d_model, num_heads):\n",
" super(MultiHeadSelfAttention, self).__init__()\n",
" self.d_model = d_model\n",
" self.num_heads = num_heads\n",
"\n",
" self.d_head = self.d_model // self.num_heads\n",
"\n",
" self.wq = tf.keras.layers.Dense(self.d_model)\n",
" self.wk = tf.keras.layers.Dense(self.d_model)\n",
" self.wv = tf.keras.layers.Dense(self.d_model)\n",
"\n",
" # Linear layer to generate the final output.\n",
" self.dense = tf.keras.layers.Dense(self.d_model)\n",
" \n",
" def split_heads(self, x):\n",
" batch_size = x.shape[0]\n",
"\n",
" split_inputs = tf.reshape(x, (batch_size, -1, self.num_heads, self.d_head))\n",
" return tf.transpose(split_inputs, perm=[0, 2, 1, 3])\n",
" \n",
" def merge_heads(self, x):\n",
" batch_size = x.shape[0]\n",
"\n",
" merged_inputs = tf.transpose(x, perm=[0, 2, 1, 3])\n",
" return tf.reshape(merged_inputs, (batch_size, -1, self.d_model))\n",
"\n",
" def call(self, q, k, v, mask):\n",
" qs = self.wq(q)\n",
" ks = self.wk(k)\n",
" vs = self.wv(v)\n",
"\n",
" qs = self.split_heads(qs)\n",
" ks = self.split_heads(ks)\n",
" vs = self.split_heads(vs)\n",
"\n",
" output, attn_weights = scaled_dot_product_attention(qs, ks, vs, mask)\n",
" output = self.merge_heads(output)\n",
"\n",
" return self.dense(output), attn_weights\n"
],
"metadata": {
"id": "Sd_IgJI34vP4"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"mhsa = MultiHeadSelfAttention(12, 3)\n",
"\n",
"output, attn_weights = mhsa(x, x, x, None)\n",
"print(f\"MHSA output{output.shape}:\")\n",
"print(output)"
],
"metadata": {
"id": "nuvv-8cg6owq"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Encoder Block"
],
"metadata": {
"id": "uAk-GG2yMM59"
}
},
{
"cell_type": "markdown",
"source": [
"We can now build our **Encoder Block**. In addition to the **Multi-Head Self Attention** layer, the **Encoder Block** also has **skip connections**, **layer normalization steps**, and a **two-layer feed-forward neural network**. The original **Attention Is All You Need** paper also included some **dropout** applied to the self-attention output which isn't shown in the illustration below (see references for a link to the paper).\n",
"\n",
"