{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<script>\n",
       "  function code_toggle() {\n",
       "    if (code_shown){\n",
       "      $('div.input').hide('500');\n",
       "      $('#toggleButton').val('Show Code')\n",
       "    } else {\n",
       "      $('div.input').show('500');\n",
       "      $('#toggleButton').val('Hide Code')\n",
       "    }\n",
       "    code_shown = !code_shown\n",
       "  }\n",
       "\n",
       "  $( document ).ready(function(){\n",
       "    code_shown=false;\n",
       "    $('div.input').hide()\n",
       "  });\n",
       "</script>\n",
       "<form action=\"javascript:code_toggle()\"><input type=\"submit\" id=\"toggleButton\" value=\"Show Code\"></form>\n",
       "<style>\n",
       ".rendered_html td {\n",
       "    font-size: xx-large;\n",
       "    text-align: left; !important\n",
       "}\n",
       ".rendered_html th {\n",
       "    font-size: xx-large;\n",
       "    text-align: left; !important\n",
       "}\n",
       "</style>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<script>\n",
    "  function code_toggle() {\n",
    "    if (code_shown){\n",
    "      $('div.input').hide('500');\n",
    "      $('#toggleButton').val('Show Code')\n",
    "    } else {\n",
    "      $('div.input').show('500');\n",
    "      $('#toggleButton').val('Hide Code')\n",
    "    }\n",
    "    code_shown = !code_shown\n",
    "  }\n",
    "\n",
    "  $( document ).ready(function(){\n",
    "    code_shown=false;\n",
    "    $('div.input').hide()\n",
    "  });\n",
    "</script>\n",
    "<form action=\"javascript:code_toggle()\"><input type=\"submit\" id=\"toggleButton\" value=\"Show Code\"></form>\n",
    "<style>\n",
    ".rendered_html td {\n",
    "    font-size: xx-large;\n",
    "    text-align: left; !important\n",
    "}\n",
    ".rendered_html th {\n",
    "    font-size: xx-large;\n",
    "    text-align: left; !important\n",
    "}\n",
    "</style>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "import sys\n",
    "sys.path.append(\"..\")\n",
    "import statnlpbook.util as util\n",
    "import matplotlib\n",
    "matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext tikzmagic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import Image\n",
    "import random"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Transformer Language Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "* Self-attention\n",
    "* Masked langauge models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Attention is all you need\n",
    "\n",
    "*Transformers* replace the whole LSTM with *self-attention* ([Vaswani et al., 2017](https://arxiv.org/pdf/1706.03762.pdf))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "All tokens attend to each other:\n",
    "\n",
    "<center>\n",
    "    <img src=\"http://jalammar.github.io/images/t/transformer_self-attention_visualization.png\" width=40%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: left;\">\n",
    "    (Image source: <a href=\"http://jalammar.github.io/illustrated-transformer/\">The Illustrated Transformer</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### The full transformer model\n",
    "\n",
    "Deep multi-head self-attention encoder-decoder with sinusodial positional encodings:\n",
    "\n",
    "<center>\n",
    "    <img src=\"mt_figures/transformer.png\" width=30%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://arxiv.org/pdf/1706.03762.pdf\">Vaswani et al., 2017</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Transformer unit\n",
    "\n",
    "Add residual connections, layer normalization and feed-forward layers (MLPs):\n",
    "\n",
    "<center>\n",
    "    <img src=\"mt_figures/transformer_layer.png\" width=30%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: left;\">\n",
    "    (Image source: <a href=\"https://arxiv.org/pdf/1706.03762.pdf\">Vaswani et al., 2017</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Multi-head self-attention\n",
    "\n",
    "Repeat this multiple times with multiple sets of parameter matrices, then concatenate:\n",
    "\n",
    "<center>\n",
    "    <img src=\"mt_figures/multi_head_self_att.png\" width=30%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: left;\">\n",
    "    (Image source: <a href=\"https://arxiv.org/pdf/1706.03762.pdf\">Vaswani et al., 2017</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Dot-Product Attention\n",
    "\n",
    "For each index $i$ in the sequence (of length $n$), use hidden representation $\\mathbf{h}_i$ to create three other $d_{\\mathbf{h}}$-dimensional vectors:\n",
    "query vector $\\color{purple}{\\mathbf{q}_i}=W^q\\mathbf{h}_i$,\n",
    "key vector $\\color{orange}{\\mathbf{k}_i}=W^k\\mathbf{h}_i$,\n",
    "value vector $\\color{blue}{\\mathbf{v}_i}=W^v\\mathbf{h}_i$.\n",
    "\n",
    "We use them to calculate the attention probability distribution $\\mathbf{\\alpha}$, which is an $n\\times n$ matrix, and the new hidden representation (for the next layer) $\\mathbf{h}_i^\\prime$, which is a $d_{\\mathbf{h}}$-dimensional vector:\n",
    "$$\n",
    "\\mathbf{\\alpha}_i = \\text{softmax}\\left(\n",
    "\\begin{array}{c}\n",
    "\\color{purple}{\\mathbf{q}_i} \\cdot \\color{orange}{\\mathbf{k}_1} \\\\\n",
    "\\ldots \\\\\n",
    "\\color{purple}{\\mathbf{q}_i} \\cdot \\color{orange}{\\mathbf{k}_n}\n",
    "\\end{array}\n",
    "\\right) \\\\\n",
    "\\mathbf{h}_i^\\prime = \\sum_{j=1}^n \\mathbf{\\alpha}_{i,j} \\color{blue}{\\mathbf{v}_j}\n",
    "$$\n",
    "\n",
    "$W^q$, $W^k$ and $W^v$ are all trained."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### ***Scaled*** Dot-Product Attention\n",
    "\n",
    "If we assume that $\\color{purple}{\\mathbf{q}_i}$ and $\\color{orange}{\\mathbf{k}_i}$ are $d_{\\mathbf{h}}$-dimensional vectors whose components are independent random variables with mean 0 and variance 1, then their dot product, $\\color{purple}{\\mathbf{q}_i} \\cdot \\color{orange}{\\mathbf{k}_i} = \\sum_{j=1}^{d_{\\mathbf{h}}} \\mathbf{q}_{ij} \\mathbf{k}_{ij}$, has mean 0 and variance $d_{\\mathbf{h}}$. Since we would prefer these values to have variance 1, we divide by $d_{\\mathbf{h}}$.\n",
    "\n",
    "$$\n",
    "\\mathbf{\\alpha}_i = \\text{softmax}\\left(\n",
    "\\begin{array}{c}\n",
    "\\frac{\\color{purple}{\\mathbf{q}_i} \\cdot \\color{orange}{\\mathbf{k}_1}}{\\sqrt{d_{\\mathbf{h}}}} \\\\\n",
    "\\ldots \\\\\n",
    "\\frac{\\color{purple}{\\mathbf{q}_i} \\cdot \\color{orange}{\\mathbf{k}_n}}{\\sqrt{d_{\\mathbf{h}}}}\n",
    "\\end{array}\n",
    "\\right) \\\\\n",
    "\\mathbf{h}_i^\\prime = \\sum_{j=1}^n \\mathbf{\\alpha}_{i,j} \\color{blue}{\\mathbf{v}_j}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "In matrix form:\n",
    "\n",
    "$$\n",
    "\\text{Attention}(Q,K,V)=\n",
    "\\text{softmax}\\left(\n",
    "\\frac{\\color{purple}{Q}\n",
    "\\color{orange}{K}^\\intercal}\n",
    "{\\sqrt{d_{\\mathbf{h}}}}\n",
    "\\right) \\color{blue}{V}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "$$\n",
    "\\text{MultiHead}(Q,K,V)=\\text{Concat}(\\text{head}_1,\\ldots,\\text{head}_h)W^O\n",
    "$$\n",
    "where\n",
    "$$\n",
    "\\text{head}_i=\\text{Attention}(QW_i^q,KW_i^k,VW_i^v)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Transformer\n",
    "\n",
    "Repeat this for multiple layers, each using the previous as input:\n",
    "\n",
    "\n",
    "$$\n",
    "\\text{MultiHead}^\\ell(Q^\\ell,K^\\ell,V^\\ell)=\\text{Concat}(\\text{head}_1^\\ell,\\ldots,\\text{head}_h^\\ell)W_\\ell^O\n",
    "$$\n",
    "where\n",
    "$$\n",
    "\\text{head}_i^\\ell=\\text{Attention}(Q^\\ell W_{i,\\ell}^q,K^\\ell W_{i,\\ell}^k,V^\\ell W_{i,\\ell}^v)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Long-distance dependencies\n",
    "\n",
    "<center>\n",
    "    <img src=\"mt_figures/ldd.png\" width=80%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: left;\">\n",
    "    (Image source: <a href=\"https://arxiv.org/pdf/1706.03762.pdf\">Vaswani et al., 2017</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Back to bag-of-words?\n",
    "\n",
    "RNNs process tokens sequentially, but\n",
    "Transformers process all tokens **at once**.\n",
    "\n",
    "In fact, we did not even provide any information about the order of tokens..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Positional Encoding ##\n",
    "\n",
    "Represent **positions** with fixed-length vectors, with the same dimensionality as word embeddings:\n",
    "\n",
    "(1st position, 2nd position, 3rd position, ...) $\\to$ Must decide on maximum sequence length\n",
    "\n",
    "Add to word embeddings at the input layer:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<center>\n",
    "    <img src=\"../img/positional_1.png\" width=\"80%\">\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Positional Encoding ##\n",
    "\n",
    "Alternatives:\n",
    "\n",
    "* Learned position embeddings (like word embeddings)\n",
    "* **Static position encoding:**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<center>\n",
    "    <img src=\"../img/positional_2.png\" width=\"80%\">\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "Picture source: https://kazemnejad.com/blog/transformer_architecture_positional_encoding/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Superiority over LSTM\n",
    "\n",
    "Transformers at similar parameter counts to LSTMs are:\n",
    "* better at language modelling\n",
    "* better at effectively using more input tokens"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<center>\n",
    "    <img src=\"../img/scaling_laws_lstms_trms.png\" width=\"80%\">\n",
    "    (Kaplan et al. arXiv:2001.0836)\n",
    "</center>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Transformers for decoding\n",
    "\n",
    "Attends to encoded input *and* to partial output.\n",
    "\n",
    "<center>\n",
    "    <img src=\"http://jalammar.github.io/images/xlnet/transformer-encoder-decoder.png\" width=70%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: left;\">\n",
    "    (Image source: <a href=\"http://jalammar.github.io/illustrated-gpt2/\">The Illustrated GPT-2</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Can only attend to already-generated tokens.\n",
    "\n",
    "<center>\n",
    "    <img src=\"http://jalammar.github.io/images/gpt2/self-attention-and-masked-self-attention.png\" width=80%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: left;\">\n",
    "    (Image source: <a href=\"http://jalammar.github.io/illustrated-gpt2/\">The Illustrated GPT-2</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "The encoder transformer is sometimes called \"bidirectional transformer\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## BERT\n",
    "\n",
    "**B**idirectional **E**ncoder **R**epresentations from **T**ransformers ([Devlin et al., 2019](https://www.aclweb.org/anthology/N19-1423.pdf)).\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://miro.medium.com/max/300/0*2XpE-VjhhLGkFDYg.jpg\" width=40%/>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Reminder: BERT training objective (1): **masked** language model\n",
    "\n",
    "Predict masked words given context on both sides:\n",
    "\n",
    "<center>\n",
    "    <img src=\"http://jalammar.github.io/images/BERT-language-modeling-masked-lm.png\" width=50%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"http://jalammar.github.io/illustrated-bert/\">The Illustrated BERT</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Reminder: BERT Training objective (2): next sentence prediction\n",
    "\n",
    "**Conditional encoding** of both sentences:\n",
    "\n",
    "<center>\n",
    "    <img src=\"http://jalammar.github.io/images/bert-next-sentence-prediction.png\" width=60%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"http://jalammar.github.io/illustrated-bert/\">The Illustrated BERT</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### BERT architecture\n",
    "\n",
    "Transformer with $L$ layers of dimension $H$, and $A$ self-attention heads.\n",
    "\n",
    "* BERT$_\\mathrm{BASE}$: $L=12, H=768, A=12$\n",
    "* BERT$_\\mathrm{LARGE}$: $L=24, H=1024, A=16$\n",
    "\n",
    "(Many other variations available through [HuggingFace Transformers](https://huggingface.co/docs/transformers/index))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Trained on 16GB of text from Wikipedia + BookCorpus.\n",
    "\n",
    "* BERT$_\\mathrm{BASE}$: 4 TPUs for 4 days\n",
    "* BERT$_\\mathrm{LARGE}$: 16 TPUs for 4 days"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### SNLI results\n",
    "\n",
    "| Model | Accuracy |\n",
    "|---|---|\n",
    "| LSTM | 77.6 |\n",
    "| LSTMs with conditional encoding | 80.9 |\n",
    "| LSTMs with conditional encoding + attention | 82.3 |\n",
    "| LSTMs with word-by-word attention | 83.5 |\n",
    "| Self-attention | 85.6 |\n",
    "| BERT$_\\mathrm{BASE}$ | 89.2 |\n",
    "| BERT$_\\mathrm{LARGE}$ |  90.4 |\n",
    "\n",
    "([Zhang et al., 2019](https://bcmi.sjtu.edu.cn/home/zhangzs/pubs/paclic33.pdf))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## RoBERTa\n",
    "\n",
    "Same architecture as BERT but better hyperparameter tuning and more training data ([Liu et al., 2019](https://arxiv.org/pdf/1907.11692.pdf)):\n",
    "\n",
    "- CC-News (76GB)\n",
    "- OpenWebText (38GB)\n",
    "- Stories (31GB)\n",
    "\n",
    "and **no** next-sentence-prediction task (only masked LM).\n",
    "\n",
    "Training: 1024 GPUs for one day.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### SNLI results\n",
    "\n",
    "| Model | Accuracy |\n",
    "|---|---|\n",
    "| LSTM | 77.6 |\n",
    "| LSTMs with conditional encoding | 80.9 |\n",
    "| LSTMs with conditional encoding + attention | 82.3 |\n",
    "| LSTMs with word-by-word attention | 83.5 |\n",
    "| Self-attention | 85.6 |\n",
    "| BERT$_\\mathrm{BASE}$ | 89.2 |\n",
    "| BERT$_\\mathrm{LARGE}$ |  90.4 |\n",
    "| RoBERTa$_\\mathrm{BASE}$ | 90.7 |\n",
    "| RoBERTa$_\\mathrm{LARGE}$ |  91.4 |\n",
    "\n",
    "([Sun et al., 2020](https://arxiv.org/abs/2012.01786))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Transformer LMs as pre-trained representations\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://d3i71xaburhd42.cloudfront.net/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035/4-Figure1-1.png\" width=90%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf\">Radford et al., 2018</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Text and position embeddings in BERT (and friends)\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://d3i71xaburhd42.cloudfront.net/df2b0e26d0599ce3e70df8a9da02e51594e0e992/5-Figure2-1.png\" width=70%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://www.aclweb.org/anthology/N19-1423.pdf\">Devlin et al., 2019</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Using BERT (and friends)\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://d3i71xaburhd42.cloudfront.net/df2b0e26d0599ce3e70df8a9da02e51594e0e992/3-Figure1-1.png\" width=70%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://www.aclweb.org/anthology/N19-1423.pdf\">Devlin et al., 2019</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Using BERT (and friends) for various tasks\n",
    "\n",
    "<center>\n",
    "    <img src=\"http://jalammar.github.io/images/bert-tasks.png\" width=70%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://www.aclweb.org/anthology/N19-1423.pdf\">Devlin et al., 2019</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Which layer to use?\n",
    "\n",
    "<center>\n",
    "    <img src=\"http://jalammar.github.io/images/bert-contexualized-embeddings.png\" width=80%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"http://jalammar.github.io/illustrated-bert/\">The Illustrated BERT</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<center>\n",
    "    <img src=\"http://jalammar.github.io/images/bert-feature-extraction-contextualized-embeddings.png\" width=80%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"http://jalammar.github.io/illustrated-bert/\">The Illustrated BERT</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Multilingual BERT\n",
    "\n",
    "* One model pre-trained on 104 languages with the largest Wikipedias\n",
    "* 110k *shared* WordPiece vocabulary\n",
    "* Same architecture as BERT$_\\mathrm{BASE}$: $L=12, H=768, A=12$\n",
    "* Same training objectives, **no cross-lingual signal**\n",
    "\n",
    "https://github.com/google-research/bert/blob/master/multilingual.md"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Other multilingual transformers\n",
    "\n",
    "+ XLM and XLM-R ([Lample and Conneau, 2019](https://arxiv.org/pdf/1901.07291.pdf))\n",
    "+ DistilmBERT ([Sanh et al., 2020](https://arxiv.org/pdf/1910.01108.pdf)) is a lighter version of mBERT\n",
    "+ Many monolingual BERTs for languages other than English\n",
    "([CamemBERT](https://arxiv.org/pdf/1911.03894.pdf),\n",
    "[BERTje](https://arxiv.org/pdf/1912.09582),\n",
    "[Nordic BERT](https://github.com/botxo/nordic_bert)...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Outlook\n",
    "\n",
    "* Transformer models keep coming out: larger, trained on more data, languages and domains, etc.\n",
    "  + Increasing energy usage and climate impact: see https://github.com/danielhers/climate-awareness-nlp\n",
    "* In the machine translation lecture, you will learn how to use them for cross-lingual tasks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "## Additional Reading\n",
    "\n",
    "+ [Jurafsky & Martin Chapter 9](https://web.stanford.edu/~jurafsky/slp3/9.pdf)\n",
    "+ [Jurafsky & Martin Chapter 11](https://web.stanford.edu/~jurafsky/slp3/11.pdf)\n",
    "+ Jay Alammar's blog posts:\n",
    "    + [The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/)\n",
    "    + [The Illustrated BERT](http://jalammar.github.io/illustrated-bert/)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}