{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2016-12-02T17:30:29.181539",
     "start_time": "2016-12-02T17:30:29.172204"
    },
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'theme': 'white',\n",
       " 'transition': 'none',\n",
       " 'controls': 'false',\n",
       " 'progress': 'true'}"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Reveal.js\n",
    "from notebook.services.config import ConfigManager\n",
    "cm = ConfigManager()\n",
    "cm.update('livereveal', {\n",
    "        'theme': 'white',\n",
    "        'transition': 'none',\n",
    "        'controls': 'false',\n",
    "        'progress': 'true',\n",
    "})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "# %cd ..\n",
    "import sys\n",
    "sys.path.append(\"..\")\n",
    "import statnlpbook.util as util\n",
    "util.execute_notebook('language_models.ipynb')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<script>\n",
       "  function code_toggle() {\n",
       "    if (code_shown){\n",
       "      $('div.input').hide('500');\n",
       "      $('#toggleButton').val('Show Code')\n",
       "    } else {\n",
       "      $('div.input').show('500');\n",
       "      $('#toggleButton').val('Hide Code')\n",
       "    }\n",
       "    code_shown = !code_shown\n",
       "  }\n",
       "\n",
       "  $( document ).ready(function(){\n",
       "    code_shown=false;\n",
       "    $('div.input').hide()\n",
       "  });\n",
       "</script>\n",
       "<form action=\"javascript:code_toggle()\"><input type=\"submit\" id=\"toggleButton\" value=\"Show Code\"></form>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<script>\n",
    "  function code_toggle() {\n",
    "    if (code_shown){\n",
    "      $('div.input').hide('500');\n",
    "      $('#toggleButton').val('Show Code')\n",
    "    } else {\n",
    "      $('div.input').show('500');\n",
    "      $('#toggleButton').val('Hide Code')\n",
    "    }\n",
    "    code_shown = !code_shown\n",
    "  }\n",
    "\n",
    "  $( document ).ready(function(){\n",
    "    code_shown=false;\n",
    "    $('div.input').hide()\n",
    "  });\n",
    "</script>\n",
    "<form action=\"javascript:code_toggle()\"><input type=\"submit\" id=\"toggleButton\" value=\"Show Code\"></form>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "from IPython.display import Image\n",
    "import random"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Transformer Language Models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"mt_figures/transformer.png?0.4768851170915196\" width=\"500\"/>"
      ],
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Image(url='mt_figures/transformer.png'+'?'+str(random.random()), width=500)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## BERT\n",
    "\n",
    "**B**idirectional **E**ncoder **R**epresentations from **T**ransformers ([Devlin et al., 2019](https://www.aclweb.org/anthology/N19-1423.pdf)).\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://miro.medium.com/max/300/0*2XpE-VjhhLGkFDYg.jpg\" width=40%/>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<center>\n",
    "<a href=\"slides/mlm.pdf\"><img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/0/08/Sesame_Street_logo.svg/500px-Sesame_Street_logo.svg.png\"></a>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### BERT training objective (1): **masked** language model\n",
    "\n",
    "Predict masked words given context on both sides:\n",
    "\n",
    "<center>\n",
    "    <img src=\"http://jalammar.github.io/images/BERT-language-modeling-masked-lm.png\" width=50%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"http://jalammar.github.io/illustrated-bert/\">The Illustrated BERT</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### BERT Training objective (2): next sentence prediction\n",
    "\n",
    "**Conditional encoding** of both sentences:\n",
    "\n",
    "<center>\n",
    "    <img src=\"http://jalammar.github.io/images/bert-next-sentence-prediction.png\" width=60%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"http://jalammar.github.io/illustrated-bert/\">The Illustrated BERT</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### BERT architecture\n",
    "\n",
    "Transformer with $L$ layers of dimension $H$, and $A$ self-attention heads.\n",
    "\n",
    "* BERT$_\\mathrm{BASE}$: $L=12, H=768, A=12$\n",
    "* BERT$_\\mathrm{LARGE}$: $L=24, H=1024, A=16$\n",
    "\n",
    "(Many other variations available through [HuggingFace Transformers](https://huggingface.co/docs/transformers/index))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Trained on 16GB of text from Wikipedia + BookCorpus.\n",
    "\n",
    "* BERT$_\\mathrm{BASE}$: 4 TPUs for 4 days\n",
    "* BERT$_\\mathrm{LARGE}$: 16 TPUs for 4 days"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### SNLI results\n",
    "\n",
    "| Model | Accuracy |\n",
    "|---|---|\n",
    "| LSTM | 77.6 |\n",
    "| LSTMs with conditional encoding | 80.9 |\n",
    "| LSTMs with conditional encoding + attention | 82.3 |\n",
    "| LSTMs with word-by-word attention | 83.5 |\n",
    "| Self-attention | 85.6 |\n",
    "| BERT$_\\mathrm{BASE}$ | 89.2 |\n",
    "| BERT$_\\mathrm{LARGE}$ |  90.4 |\n",
    "\n",
    "([Zhang et al., 2019](https://bcmi.sjtu.edu.cn/home/zhangzs/pubs/paclic33.pdf))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### RoBERTa\n",
    "\n",
    "Same architecture as BERT but better hyperparameter tuning and more training data ([Liu et al., 2019](https://arxiv.org/pdf/1907.11692.pdf)):\n",
    "\n",
    "- CC-News (76GB)\n",
    "- OpenWebText (38GB)\n",
    "- Stories (31GB)\n",
    "\n",
    "and **no** next-sentence-prediction task (only masked LM).\n",
    "\n",
    "Training: 1024 GPUs for one day.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### SNLI results\n",
    "\n",
    "| Model | Accuracy |\n",
    "|---|---|\n",
    "| LSTM | 77.6 |\n",
    "| LSTMs with conditional encoding | 80.9 |\n",
    "| LSTMs with conditional encoding + attention | 82.3 |\n",
    "| LSTMs with word-by-word attention | 83.5 |\n",
    "| Self-attention | 85.6 |\n",
    "| BERT$_\\mathrm{BASE}$ | 89.2 |\n",
    "| BERT$_\\mathrm{LARGE}$ |  90.4 |\n",
    "| RoBERTa$_\\mathrm{BASE}$ | 90.7 |\n",
    "| RoBERTa$_\\mathrm{LARGE}$ |  91.4 |\n",
    "\n",
    "([Sun et al., 2020](https://arxiv.org/abs/2012.01786))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### How is that different from ELMo and GPT-$n$?\n",
    "\n",
    "<center>\n",
    "    <img src=\"mt_figures/bert_gpt_elmo.png\" width=100%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://www.aclweb.org/anthology/N19-1423.pdf\">Devlin et al., 2019</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### BERT tokenisation: not words, but WordPieces\n",
    "\n",
    "WordPiece and BPE (byte-pair encoding) tokenise text to **subwords** ([Sennrich et al., 2016](https://aclanthology.org/P16-1162/), [Wu et al., 2016](https://arxiv.org/abs/1609.08144v2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "* BERT has a [30,000 WordPiece vocabulary](https://huggingface.co/bert-base-cased/blob/main/vocab.txt), including ~10,000 unique characters.\n",
    "* No unknown words!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<center>\n",
    "    <img src=\"https://vamvas.ch/assets/bert-for-ner/tokenizer.png\" width=60%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://vamvas.ch/bert-for-ner\">BERT for NER</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Visualizing BERT word embeddings\n",
    "\n",
    "Pretty similar to [word2vec](dl-representations_simple.ipynb):\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://home.ttic.edu/~kgimpel/viz-bert/viz-bert-voc.png\" width=70%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://home.ttic.edu/~kgimpel/viz-bert/viz-bert.html\">Visualizing BERT</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Visualizing BERT word embeddings\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://home.ttic.edu/~kgimpel/viz-bert/viz-bert-voc-house.png\" width=70%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://home.ttic.edu/~kgimpel/viz-bert/viz-bert.html\">Visualizing BERT</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Visualizing BERT word embeddings\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://home.ttic.edu/~kgimpel/viz-bert/viz-bert-voc-suffixes.png\" width=70%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://home.ttic.edu/~kgimpel/viz-bert/viz-bert.html\">Visualizing BERT</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Transformer LMs as pre-trained representations\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://d3i71xaburhd42.cloudfront.net/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035/4-Figure1-1.png\" width=90%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf\">Radford et al., 2018</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Text and position embeddings in BERT (and friends)\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://d3i71xaburhd42.cloudfront.net/df2b0e26d0599ce3e70df8a9da02e51594e0e992/5-Figure2-1.png\" width=70%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://www.aclweb.org/anthology/N19-1423.pdf\">Devlin et al., 2019</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Using BERT (and friends)\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://d3i71xaburhd42.cloudfront.net/df2b0e26d0599ce3e70df8a9da02e51594e0e992/3-Figure1-1.png\" width=70%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://www.aclweb.org/anthology/N19-1423.pdf\">Devlin et al., 2019</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Using BERT (and friends) for NLI\n",
    "\n",
    "<center>\n",
    "    <img src=\"https://production-media.paperswithcode.com/models/roberta-multichoice.png-0000000931-36fb4743.png\" width=70%/>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Using BERT (and friends) for various tasks\n",
    "\n",
    "<center>\n",
    "    <img src=\"http://jalammar.github.io/images/bert-tasks.png\" width=70%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"https://www.aclweb.org/anthology/N19-1423.pdf\">Devlin et al., 2019</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Which layer to use?\n",
    "\n",
    "<center>\n",
    "    <img src=\"http://jalammar.github.io/images/bert-contexualized-embeddings.png\" width=80%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"http://jalammar.github.io/illustrated-bert/\">The Illustrated BERT</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<center>\n",
    "    <img src=\"http://jalammar.github.io/images/bert-feature-extraction-contextualized-embeddings.png\" width=80%/>\n",
    "</center>\n",
    "\n",
    "<div style=\"text-align: right;\">\n",
    "    (from <a href=\"http://jalammar.github.io/illustrated-bert/\">The Illustrated BERT</a>)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Multilingual BERT\n",
    "\n",
    "* One model pre-trained on 104 languages with the largest Wikipedias\n",
    "* 110k *shared* WordPiece vocabulary\n",
    "* Same architecture as BERT$_\\mathrm{BASE}$: $L=12, H=768, A=12$\n",
    "* Same training objectives, **no cross-lingual signal**\n",
    "\n",
    "https://github.com/google-research/bert/blob/master/multilingual.md"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Other multilingual transformers\n",
    "\n",
    "+ XLM and XLM-R ([Lample and Conneau, 2019](https://arxiv.org/pdf/1901.07291.pdf))\n",
    "+ DistilmBERT ([Sanh et al., 2020](https://arxiv.org/pdf/1910.01108.pdf)) is a lighter version of mBERT\n",
    "+ Many monolingual BERTs for languages other than English\n",
    "([CamemBERT](https://arxiv.org/pdf/1911.03894.pdf),\n",
    "[BERTje](https://arxiv.org/pdf/1912.09582),\n",
    "[Nordic BERT](https://github.com/botxo/nordic_bert)...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Summary #\n",
    "\n",
    "* Static word embeddings do not differ depending on context\n",
    "* Contextualised representations are dynamic\n",
    "* Popular pre-trained contextual representations:\n",
    "    * ELMo: bidirectional language model with LSTMs\n",
    "    * GPT: transformer language models\n",
    "    * BERT: transformer masked language model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Outlook #\n",
    "\n",
    "* Transformer models keep coming out: larger, trained on more data, languages and domains, etc.\n",
    "  + Increasing energy usage and climate impact: see https://github.com/danielhers/climate-awareness-nlp\n",
    "* In the machine translation lecture, you will learn how to use them for cross-lingual tasks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Additional Reading #\n",
    "\n",
    "+ [Jurafsky & Martin Chapter 11](https://web.stanford.edu/~jurafsky/slp3/11.pdf)\n",
    "+ Jay Alammar's blog posts:\n",
    "    + [The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/)\n",
    "    + [The Illustrated BERT](http://jalammar.github.io/illustrated-bert/)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}