{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2016-12-02T17:30:29.181539", "start_time": "2016-12-02T17:30:29.172204" }, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "{'theme': 'white',\n", " 'transition': 'none',\n", " 'controls': 'false',\n", " 'progress': 'true'}" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Reveal.js\n", "from notebook.services.config import ConfigManager\n", "cm = ConfigManager()\n", "cm.update('livereveal', {\n", " 'theme': 'white',\n", " 'transition': 'none',\n", " 'controls': 'false',\n", " 'progress': 'true',\n", "})" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%%capture\n", "%load_ext autoreload\n", "%autoreload 2\n", "# %cd ..\n", "import sys\n", "sys.path.append(\"..\")\n", "import statnlpbook.util as util\n", "util.execute_notebook('language_models.ipynb')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "<script>\n", " function code_toggle() {\n", " if (code_shown){\n", " $('div.input').hide('500');\n", " $('#toggleButton').val('Show Code')\n", " } else {\n", " $('div.input').show('500');\n", " $('#toggleButton').val('Hide Code')\n", " }\n", " code_shown = !code_shown\n", " }\n", "\n", " $( document ).ready(function(){\n", " code_shown=false;\n", " $('div.input').hide()\n", " });\n", "</script>\n", "<form action=\"javascript:code_toggle()\"><input type=\"submit\" id=\"toggleButton\" value=\"Show Code\"></form>\n" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "<script>\n", " function code_toggle() {\n", " if (code_shown){\n", " $('div.input').hide('500');\n", " $('#toggleButton').val('Show Code')\n", " } else {\n", " $('div.input').show('500');\n", " $('#toggleButton').val('Hide Code')\n", " }\n", " code_shown = !code_shown\n", " }\n", "\n", " $( document ).ready(function(){\n", " code_shown=false;\n", " $('div.input').hide()\n", " });\n", "</script>\n", "<form action=\"javascript:code_toggle()\"><input type=\"submit\" id=\"toggleButton\" value=\"Show Code\"></form>" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from IPython.display import Image\n", "import random" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Transformer Language Models" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "<img src=\"mt_figures/transformer.png?0.4768851170915196\" width=\"500\"/>" ], "text/plain": [ "<IPython.core.display.Image object>" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Image(url='mt_figures/transformer.png'+'?'+str(random.random()), width=500)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## BERT\n", "\n", "**B**idirectional **E**ncoder **R**epresentations from **T**ransformers ([Devlin et al., 2019](https://www.aclweb.org/anthology/N19-1423.pdf)).\n", "\n", "<center>\n", " <img src=\"https://miro.medium.com/max/300/0*2XpE-VjhhLGkFDYg.jpg\" width=40%/>\n", "</center>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "<center>\n", "<a href=\"slides/mlm.pdf\"><img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/0/08/Sesame_Street_logo.svg/500px-Sesame_Street_logo.svg.png\"></a>\n", "</center>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### BERT training objective (1): **masked** language model\n", "\n", "Predict masked words given context on both sides:\n", "\n", "<center>\n", " <img src=\"http://jalammar.github.io/images/BERT-language-modeling-masked-lm.png\" width=50%/>\n", "</center>\n", "\n", "<div style=\"text-align: right;\">\n", " (from <a href=\"http://jalammar.github.io/illustrated-bert/\">The Illustrated BERT</a>)\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### BERT Training objective (2): next sentence prediction\n", "\n", "**Conditional encoding** of both sentences:\n", "\n", "<center>\n", " <img src=\"http://jalammar.github.io/images/bert-next-sentence-prediction.png\" width=60%/>\n", "</center>\n", "\n", "<div style=\"text-align: right;\">\n", " (from <a href=\"http://jalammar.github.io/illustrated-bert/\">The Illustrated BERT</a>)\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### BERT architecture\n", "\n", "Transformer with $L$ layers of dimension $H$, and $A$ self-attention heads.\n", "\n", "* BERT$_\\mathrm{BASE}$: $L=12, H=768, A=12$\n", "* BERT$_\\mathrm{LARGE}$: $L=24, H=1024, A=16$\n", "\n", "(Many other variations available through [HuggingFace Transformers](https://huggingface.co/docs/transformers/index))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Trained on 16GB of text from Wikipedia + BookCorpus.\n", "\n", "* BERT$_\\mathrm{BASE}$: 4 TPUs for 4 days\n", "* BERT$_\\mathrm{LARGE}$: 16 TPUs for 4 days" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### SNLI results\n", "\n", "| Model | Accuracy |\n", "|---|---|\n", "| LSTM | 77.6 |\n", "| LSTMs with conditional encoding | 80.9 |\n", "| LSTMs with conditional encoding + attention | 82.3 |\n", "| LSTMs with word-by-word attention | 83.5 |\n", "| Self-attention | 85.6 |\n", "| BERT$_\\mathrm{BASE}$ | 89.2 |\n", "| BERT$_\\mathrm{LARGE}$ | 90.4 |\n", "\n", "([Zhang et al., 2019](https://bcmi.sjtu.edu.cn/home/zhangzs/pubs/paclic33.pdf))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### RoBERTa\n", "\n", "Same architecture as BERT but better hyperparameter tuning and more training data ([Liu et al., 2019](https://arxiv.org/pdf/1907.11692.pdf)):\n", "\n", "- CC-News (76GB)\n", "- OpenWebText (38GB)\n", "- Stories (31GB)\n", "\n", "and **no** next-sentence-prediction task (only masked LM).\n", "\n", "Training: 1024 GPUs for one day.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### SNLI results\n", "\n", "| Model | Accuracy |\n", "|---|---|\n", "| LSTM | 77.6 |\n", "| LSTMs with conditional encoding | 80.9 |\n", "| LSTMs with conditional encoding + attention | 82.3 |\n", "| LSTMs with word-by-word attention | 83.5 |\n", "| Self-attention | 85.6 |\n", "| BERT$_\\mathrm{BASE}$ | 89.2 |\n", "| BERT$_\\mathrm{LARGE}$ | 90.4 |\n", "| RoBERTa$_\\mathrm{BASE}$ | 90.7 |\n", "| RoBERTa$_\\mathrm{LARGE}$ | 91.4 |\n", "\n", "([Sun et al., 2020](https://arxiv.org/abs/2012.01786))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### How is that different from ELMo and GPT-$n$?\n", "\n", "<center>\n", " <img src=\"mt_figures/bert_gpt_elmo.png\" width=100%/>\n", "</center>\n", "\n", "<div style=\"text-align: right;\">\n", " (from <a href=\"https://www.aclweb.org/anthology/N19-1423.pdf\">Devlin et al., 2019</a>)\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### BERT tokenisation: not words, but WordPieces\n", "\n", "WordPiece and BPE (byte-pair encoding) tokenise text to **subwords** ([Sennrich et al., 2016](https://aclanthology.org/P16-1162/), [Wu et al., 2016](https://arxiv.org/abs/1609.08144v2))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* BERT has a [30,000 WordPiece vocabulary](https://huggingface.co/bert-base-cased/blob/main/vocab.txt), including ~10,000 unique characters.\n", "* No unknown words!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "<center>\n", " <img src=\"https://vamvas.ch/assets/bert-for-ner/tokenizer.png\" width=60%/>\n", "</center>\n", "\n", "<div style=\"text-align: right;\">\n", " (from <a href=\"https://vamvas.ch/bert-for-ner\">BERT for NER</a>)\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Visualizing BERT word embeddings\n", "\n", "Pretty similar to [word2vec](dl-representations_simple.ipynb):\n", "\n", "<center>\n", " <img src=\"https://home.ttic.edu/~kgimpel/viz-bert/viz-bert-voc.png\" width=70%/>\n", "</center>\n", "\n", "<div style=\"text-align: right;\">\n", " (from <a href=\"https://home.ttic.edu/~kgimpel/viz-bert/viz-bert.html\">Visualizing BERT</a>)\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Visualizing BERT word embeddings\n", "\n", "<center>\n", " <img src=\"https://home.ttic.edu/~kgimpel/viz-bert/viz-bert-voc-house.png\" width=70%/>\n", "</center>\n", "\n", "<div style=\"text-align: right;\">\n", " (from <a href=\"https://home.ttic.edu/~kgimpel/viz-bert/viz-bert.html\">Visualizing BERT</a>)\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Visualizing BERT word embeddings\n", "\n", "<center>\n", " <img src=\"https://home.ttic.edu/~kgimpel/viz-bert/viz-bert-voc-suffixes.png\" width=70%/>\n", "</center>\n", "\n", "<div style=\"text-align: right;\">\n", " (from <a href=\"https://home.ttic.edu/~kgimpel/viz-bert/viz-bert.html\">Visualizing BERT</a>)\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Transformer LMs as pre-trained representations\n", "\n", "<center>\n", " <img src=\"https://d3i71xaburhd42.cloudfront.net/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035/4-Figure1-1.png\" width=90%/>\n", "</center>\n", "\n", "<div style=\"text-align: right;\">\n", " (from <a href=\"https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf\">Radford et al., 2018</a>)\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Text and position embeddings in BERT (and friends)\n", "\n", "<center>\n", " <img src=\"https://d3i71xaburhd42.cloudfront.net/df2b0e26d0599ce3e70df8a9da02e51594e0e992/5-Figure2-1.png\" width=70%/>\n", "</center>\n", "\n", "<div style=\"text-align: right;\">\n", " (from <a href=\"https://www.aclweb.org/anthology/N19-1423.pdf\">Devlin et al., 2019</a>)\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Using BERT (and friends)\n", "\n", "<center>\n", " <img src=\"https://d3i71xaburhd42.cloudfront.net/df2b0e26d0599ce3e70df8a9da02e51594e0e992/3-Figure1-1.png\" width=70%/>\n", "</center>\n", "\n", "<div style=\"text-align: right;\">\n", " (from <a href=\"https://www.aclweb.org/anthology/N19-1423.pdf\">Devlin et al., 2019</a>)\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Using BERT (and friends) for NLI\n", "\n", "<center>\n", " <img src=\"https://production-media.paperswithcode.com/models/roberta-multichoice.png-0000000931-36fb4743.png\" width=70%/>\n", "</center>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Using BERT (and friends) for various tasks\n", "\n", "<center>\n", " <img src=\"http://jalammar.github.io/images/bert-tasks.png\" width=70%/>\n", "</center>\n", "\n", "<div style=\"text-align: right;\">\n", " (from <a href=\"https://www.aclweb.org/anthology/N19-1423.pdf\">Devlin et al., 2019</a>)\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Which layer to use?\n", "\n", "<center>\n", " <img src=\"http://jalammar.github.io/images/bert-contexualized-embeddings.png\" width=80%/>\n", "</center>\n", "\n", "<div style=\"text-align: right;\">\n", " (from <a href=\"http://jalammar.github.io/illustrated-bert/\">The Illustrated BERT</a>)\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "<center>\n", " <img src=\"http://jalammar.github.io/images/bert-feature-extraction-contextualized-embeddings.png\" width=80%/>\n", "</center>\n", "\n", "<div style=\"text-align: right;\">\n", " (from <a href=\"http://jalammar.github.io/illustrated-bert/\">The Illustrated BERT</a>)\n", "</div>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Multilingual BERT\n", "\n", "* One model pre-trained on 104 languages with the largest Wikipedias\n", "* 110k *shared* WordPiece vocabulary\n", "* Same architecture as BERT$_\\mathrm{BASE}$: $L=12, H=768, A=12$\n", "* Same training objectives, **no cross-lingual signal**\n", "\n", "https://github.com/google-research/bert/blob/master/multilingual.md" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Other multilingual transformers\n", "\n", "+ XLM and XLM-R ([Lample and Conneau, 2019](https://arxiv.org/pdf/1901.07291.pdf))\n", "+ DistilmBERT ([Sanh et al., 2020](https://arxiv.org/pdf/1910.01108.pdf)) is a lighter version of mBERT\n", "+ Many monolingual BERTs for languages other than English\n", "([CamemBERT](https://arxiv.org/pdf/1911.03894.pdf),\n", "[BERTje](https://arxiv.org/pdf/1912.09582),\n", "[Nordic BERT](https://github.com/botxo/nordic_bert)...)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Summary #\n", "\n", "* Static word embeddings do not differ depending on context\n", "* Contextualised representations are dynamic\n", "* Popular pre-trained contextual representations:\n", " * ELMo: bidirectional language model with LSTMs\n", " * GPT: transformer language models\n", " * BERT: transformer masked language model" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Outlook #\n", "\n", "* Transformer models keep coming out: larger, trained on more data, languages and domains, etc.\n", " + Increasing energy usage and climate impact: see https://github.com/danielhers/climate-awareness-nlp\n", "* In the machine translation lecture, you will learn how to use them for cross-lingual tasks" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Additional Reading #\n", "\n", "+ [Jurafsky & Martin Chapter 11](https://web.stanford.edu/~jurafsky/slp3/11.pdf)\n", "+ Jay Alammar's blog posts:\n", " + [The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/)\n", " + [The Illustrated BERT](http://jalammar.github.io/illustrated-bert/)" ] } ], "metadata": { "celltoolbar": "Slideshow", "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 1 }