{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "import sys\n", "sys.path.append(\"..\")\n", "import statnlpbook.util as util\n", "import matplotlib\n", "matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "%load_ext tikzmagic" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from IPython.display import Image\n", "import random" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Transformer Language Models" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Self-attention\n", "* Masked langauge models" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Attention is all you need\n", "\n", "*Transformers* replace the whole LSTM with *self-attention* ([Vaswani et al., 2017](https://arxiv.org/pdf/1706.03762.pdf))\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "All tokens attend to each other:\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (Image source: The Illustrated Transformer)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### The full transformer model\n", "\n", "Deep multi-head self-attention encoder-decoder with sinusodial positional encodings:\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Vaswani et al., 2017)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Transformer unit\n", "\n", "Add residual connections, layer normalization and feed-forward layers (MLPs):\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (Image source: Vaswani et al., 2017)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Multi-head self-attention\n", "\n", "Repeat this multiple times with multiple sets of parameter matrices, then concatenate:\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (Image source: Vaswani et al., 2017)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Dot-Product Attention\n", "\n", "For each index $i$ in the sequence (of length $n$), use hidden representation $\\mathbf{h}_i$ to create three other $d_{\\mathbf{h}}$-dimensional vectors:\n", "query vector $\\color{purple}{\\mathbf{q}_i}=W^q\\mathbf{h}_i$,\n", "key vector $\\color{orange}{\\mathbf{k}_i}=W^k\\mathbf{h}_i$,\n", "value vector $\\color{blue}{\\mathbf{v}_i}=W^v\\mathbf{h}_i$.\n", "\n", "We use them to calculate the attention probability distribution $\\mathbf{\\alpha}$, which is an $n\\times n$ matrix, and the new hidden representation (for the next layer) $\\mathbf{h}_i^\\prime$, which is a $d_{\\mathbf{h}}$-dimensional vector:\n", "$$\n", "\\mathbf{\\alpha}_i = \\text{softmax}\\left(\n", "\\begin{array}{c}\n", "\\color{purple}{\\mathbf{q}_i} \\cdot \\color{orange}{\\mathbf{k}_1} \\\\\n", "\\ldots \\\\\n", "\\color{purple}{\\mathbf{q}_i} \\cdot \\color{orange}{\\mathbf{k}_n}\n", "\\end{array}\n", "\\right) \\\\\n", "\\mathbf{h}_i^\\prime = \\sum_{j=1}^n \\mathbf{\\alpha}_{i,j} \\color{blue}{\\mathbf{v}_j}\n", "$$\n", "\n", "$W^q$, $W^k$ and $W^v$ are all trained." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### ***Scaled*** Dot-Product Attention\n", "\n", "If we assume that $\\color{purple}{\\mathbf{q}_i}$ and $\\color{orange}{\\mathbf{k}_i}$ are $d_{\\mathbf{h}}$-dimensional vectors whose components are independent random variables with mean 0 and variance 1, then their dot product, $\\color{purple}{\\mathbf{q}_i} \\cdot \\color{orange}{\\mathbf{k}_i} = \\sum_{j=1}^{d_{\\mathbf{h}}} \\mathbf{q}_{ij} \\mathbf{k}_{ij}$, has mean 0 and variance $d_{\\mathbf{h}}$. Since we would prefer these values to have variance 1, we divide by $d_{\\mathbf{h}}$.\n", "\n", "$$\n", "\\mathbf{\\alpha}_i = \\text{softmax}\\left(\n", "\\begin{array}{c}\n", "\\frac{\\color{purple}{\\mathbf{q}_i} \\cdot \\color{orange}{\\mathbf{k}_1}}{\\sqrt{d_{\\mathbf{h}}}} \\\\\n", "\\ldots \\\\\n", "\\frac{\\color{purple}{\\mathbf{q}_i} \\cdot \\color{orange}{\\mathbf{k}_n}}{\\sqrt{d_{\\mathbf{h}}}}\n", "\\end{array}\n", "\\right) \\\\\n", "\\mathbf{h}_i^\\prime = \\sum_{j=1}^n \\mathbf{\\alpha}_{i,j} \\color{blue}{\\mathbf{v}_j}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In matrix form:\n", "\n", "$$\n", "\\text{Attention}(Q,K,V)=\n", "\\text{softmax}\\left(\n", "\\frac{\\color{purple}{Q}\n", "\\color{orange}{K}^\\intercal}\n", "{\\sqrt{d_{\\mathbf{h}}}}\n", "\\right) \\color{blue}{V}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "$$\n", "\\text{MultiHead}(Q,K,V)=\\text{Concat}(\\text{head}_1,\\ldots,\\text{head}_h)W^O\n", "$$\n", "where\n", "$$\n", "\\text{head}_i=\\text{Attention}(QW_i^q,KW_i^k,VW_i^v)\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Transformer\n", "\n", "Repeat this for multiple layers, each using the previous as input:\n", "\n", "\n", "$$\n", "\\text{MultiHead}^\\ell(Q^\\ell,K^\\ell,V^\\ell)=\\text{Concat}(\\text{head}_1^\\ell,\\ldots,\\text{head}_h^\\ell)W_\\ell^O\n", "$$\n", "where\n", "$$\n", "\\text{head}_i^\\ell=\\text{Attention}(Q^\\ell W_{i,\\ell}^q,K^\\ell W_{i,\\ell}^k,V^\\ell W_{i,\\ell}^v)\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Long-distance dependencies\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (Image source: Vaswani et al., 2017)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Back to bag-of-words?\n", "\n", "RNNs process tokens sequentially, but\n", "Transformers process all tokens **at once**.\n", "\n", "In fact, we did not even provide any information about the order of tokens..." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Positional Encoding ##\n", "\n", "Represent **positions** with fixed-length vectors, with the same dimensionality as word embeddings:\n", "\n", "(1st position, 2nd position, 3rd position, ...) $\\to$ Must decide on maximum sequence length\n", "\n", "Add to word embeddings at the input layer:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Positional Encoding ##\n", "\n", "Alternatives:\n", "\n", "* Learned position embeddings (like word embeddings)\n", "* **Static position encoding:**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "Picture source: https://kazemnejad.com/blog/transformer_architecture_positional_encoding/" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Superiority over LSTM\n", "\n", "Transformers at similar parameter counts to LSTMs are:\n", "* better at language modelling\n", "* better at effectively using more input tokens" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", " \n", " (Kaplan et al. arXiv:2001.0836)\n", "
\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Transformers for decoding\n", "\n", "Attends to encoded input *and* to partial output.\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (Image source: The Illustrated GPT-2)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Can only attend to already-generated tokens.\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (Image source: The Illustrated GPT-2)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The encoder transformer is sometimes called \"bidirectional transformer\"." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## BERT\n", "\n", "**B**idirectional **E**ncoder **R**epresentations from **T**ransformers ([Devlin et al., 2019](https://www.aclweb.org/anthology/N19-1423.pdf)).\n", "\n", "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Reminder: BERT training objective (1): **masked** language model\n", "\n", "Predict masked words given context on both sides:\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from The Illustrated BERT)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Reminder: BERT Training objective (2): next sentence prediction\n", "\n", "**Conditional encoding** of both sentences:\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from The Illustrated BERT)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### BERT architecture\n", "\n", "Transformer with $L$ layers of dimension $H$, and $A$ self-attention heads.\n", "\n", "* BERT$_\\mathrm{BASE}$: $L=12, H=768, A=12$\n", "* BERT$_\\mathrm{LARGE}$: $L=24, H=1024, A=16$\n", "\n", "(Many other variations available through [HuggingFace Transformers](https://huggingface.co/docs/transformers/index))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Trained on 16GB of text from Wikipedia + BookCorpus.\n", "\n", "* BERT$_\\mathrm{BASE}$: 4 TPUs for 4 days\n", "* BERT$_\\mathrm{LARGE}$: 16 TPUs for 4 days" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### SNLI results\n", "\n", "| Model | Accuracy |\n", "|---|---|\n", "| LSTM | 77.6 |\n", "| LSTMs with conditional encoding | 80.9 |\n", "| LSTMs with conditional encoding + attention | 82.3 |\n", "| LSTMs with word-by-word attention | 83.5 |\n", "| Self-attention | 85.6 |\n", "| BERT$_\\mathrm{BASE}$ | 89.2 |\n", "| BERT$_\\mathrm{LARGE}$ | 90.4 |\n", "\n", "([Zhang et al., 2019](https://bcmi.sjtu.edu.cn/home/zhangzs/pubs/paclic33.pdf))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## RoBERTa\n", "\n", "Same architecture as BERT but better hyperparameter tuning and more training data ([Liu et al., 2019](https://arxiv.org/pdf/1907.11692.pdf)):\n", "\n", "- CC-News (76GB)\n", "- OpenWebText (38GB)\n", "- Stories (31GB)\n", "\n", "and **no** next-sentence-prediction task (only masked LM).\n", "\n", "Training: 1024 GPUs for one day.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### SNLI results\n", "\n", "| Model | Accuracy |\n", "|---|---|\n", "| LSTM | 77.6 |\n", "| LSTMs with conditional encoding | 80.9 |\n", "| LSTMs with conditional encoding + attention | 82.3 |\n", "| LSTMs with word-by-word attention | 83.5 |\n", "| Self-attention | 85.6 |\n", "| BERT$_\\mathrm{BASE}$ | 89.2 |\n", "| BERT$_\\mathrm{LARGE}$ | 90.4 |\n", "| RoBERTa$_\\mathrm{BASE}$ | 90.7 |\n", "| RoBERTa$_\\mathrm{LARGE}$ | 91.4 |\n", "\n", "([Sun et al., 2020](https://arxiv.org/abs/2012.01786))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Transformer LMs as pre-trained representations\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Radford et al., 2018)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Text and position embeddings in BERT (and friends)\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Devlin et al., 2019)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Using BERT (and friends)\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Devlin et al., 2019)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Using BERT (and friends) for various tasks\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Devlin et al., 2019)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Which layer to use?\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from The Illustrated BERT)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", " \n", "
\n", "\n", "
\n", " (from The Illustrated BERT)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Multilingual BERT\n", "\n", "* One model pre-trained on 104 languages with the largest Wikipedias\n", "* 110k *shared* WordPiece vocabulary\n", "* Same architecture as BERT$_\\mathrm{BASE}$: $L=12, H=768, A=12$\n", "* Same training objectives, **no cross-lingual signal**\n", "\n", "https://github.com/google-research/bert/blob/master/multilingual.md" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Other multilingual transformers\n", "\n", "+ XLM and XLM-R ([Lample and Conneau, 2019](https://arxiv.org/pdf/1901.07291.pdf))\n", "+ DistilmBERT ([Sanh et al., 2020](https://arxiv.org/pdf/1910.01108.pdf)) is a lighter version of mBERT\n", "+ Many monolingual BERTs for languages other than English\n", "([CamemBERT](https://arxiv.org/pdf/1911.03894.pdf),\n", "[BERTje](https://arxiv.org/pdf/1912.09582),\n", "[Nordic BERT](https://github.com/botxo/nordic_bert)...)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Outlook\n", "\n", "* Transformer models keep coming out: larger, trained on more data, languages and domains, etc.\n", " + Increasing energy usage and climate impact: see https://github.com/danielhers/climate-awareness-nlp\n", "* In the machine translation lecture, you will learn how to use them for cross-lingual tasks" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "## Additional Reading\n", "\n", "+ [Jurafsky & Martin Chapter 9](https://web.stanford.edu/~jurafsky/slp3/9.pdf)\n", "+ [Jurafsky & Martin Chapter 11](https://web.stanford.edu/~jurafsky/slp3/11.pdf)\n", "+ Jay Alammar's blog posts:\n", " + [The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/)\n", " + [The Illustrated BERT](http://jalammar.github.io/illustrated-bert/)" ] } ], "metadata": { "celltoolbar": "Slideshow", "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 4 }