In [1]:
# Reveal.js
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
 'theme': 'white',
 'transition': 'none',
 'controls': 'false',
 'progress': 'true',
})

{'theme': 'white',
 'transition': 'none',
 'controls': 'false',
 'progress': 'true'}

In [2]:
%%capture
%load_ext autoreload
%autoreload 2
# %cd ..
import sys
sys.path.append("..")
import statnlpbook.util as util
util.execute_notebook('language_models.ipynb')

In [3]:
%%html
<script>
 function code_toggle() {
 if (code_shown){
 $('div.input').hide('500');
 $('#toggleButton').val('Show Code')
 } else {
 $('div.input').show('500');
 $('#toggleButton').val('Hide Code')
 }
 code_shown = !code_shown
 }

 $( document ).ready(function(){
 code_shown=false;
 $('div.input').hide()
 });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>

In [3]:
from IPython.display import Image
import random

# Transformer Language Models

In [8]:
Image(url='mt_figures/transformer.png'+'?'+str(random.random()), width=500)

## BERT

**B**idirectional **E**ncoder **R**epresentations from **T**ransformers ([Devlin et al., 2019](https://www.aclweb.org/anthology/N19-1423.pdf)).

<center>
 <img src="https://miro.medium.com/max/300/0*2XpE-VjhhLGkFDYg.jpg" width=40%/>
</center>

<center>
<a href="slides/mlm.pdf"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/08/Sesame_Street_logo.svg/500px-Sesame_Street_logo.svg.png"></a>
</center>

### BERT training objective (1): **masked** language model

Predict masked words given context on both sides:

<center>
 <img src="http://jalammar.github.io/images/BERT-language-modeling-masked-lm.png" width=50%/>
</center>

<div style="text-align: right;">
 (from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT</a>)
</div>

### BERT Training objective (2): next sentence prediction

**Conditional encoding** of both sentences:

<center>
 <img src="http://jalammar.github.io/images/bert-next-sentence-prediction.png" width=60%/>
</center>

<div style="text-align: right;">
 (from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT</a>)
</div>

### BERT architecture

Transformer with $L$ layers of dimension $H$, and $A$ self-attention heads.

* BERT$_\mathrm{BASE}$: $L=12, H=768, A=12$
* BERT$_\mathrm{LARGE}$: $L=24, H=1024, A=16$

(Many other variations available through [HuggingFace Transformers](https://huggingface.co/docs/transformers/index))

Trained on 16GB of text from Wikipedia + BookCorpus.

* BERT$_\mathrm{BASE}$: 4 TPUs for 4 days
* BERT$_\mathrm{LARGE}$: 16 TPUs for 4 days

#### SNLI results

| Model | Accuracy |
|---|---|
| LSTM | 77.6 |
| LSTMs with conditional encoding | 80.9 |
| LSTMs with conditional encoding + attention | 82.3 |
| LSTMs with word-by-word attention | 83.5 |
| Self-attention | 85.6 |
| BERT$_\mathrm{BASE}$ | 89.2 |
| BERT$_\mathrm{LARGE}$ | 90.4 |

([Zhang et al., 2019](https://bcmi.sjtu.edu.cn/home/zhangzs/pubs/paclic33.pdf))

### RoBERTa

Same architecture as BERT but better hyperparameter tuning and more training data ([Liu et al., 2019](https://arxiv.org/pdf/1907.11692.pdf)):

- CC-News (76GB)
- OpenWebText (38GB)
- Stories (31GB)

and **no** next-sentence-prediction task (only masked LM).

Training: 1024 GPUs for one day.


#### SNLI results

| Model | Accuracy |
|---|---|
| LSTM | 77.6 |
| LSTMs with conditional encoding | 80.9 |
| LSTMs with conditional encoding + attention | 82.3 |
| LSTMs with word-by-word attention | 83.5 |
| Self-attention | 85.6 |
| BERT$_\mathrm{BASE}$ | 89.2 |
| BERT$_\mathrm{LARGE}$ | 90.4 |
| RoBERTa$_\mathrm{BASE}$ | 90.7 |
| RoBERTa$_\mathrm{LARGE}$ | 91.4 |

([Sun et al., 2020](https://arxiv.org/abs/2012.01786))

### How is that different from ELMo and GPT-$n$?

<center>
 <img src="mt_figures/bert_gpt_elmo.png" width=100%/>
</center>

<div style="text-align: right;">
 (from <a href="https://www.aclweb.org/anthology/N19-1423.pdf">Devlin et al., 2019</a>)
</div>

### BERT tokenisation: not words, but WordPieces

WordPiece and BPE (byte-pair encoding) tokenise text to **subwords** ([Sennrich et al., 2016](https://aclanthology.org/P16-1162/), [Wu et al., 2016](https://arxiv.org/abs/1609.08144v2))

* BERT has a [30,000 WordPiece vocabulary](https://huggingface.co/bert-base-cased/blob/main/vocab.txt), including ~10,000 unique characters.
* No unknown words!

<center>
 <img src="https://vamvas.ch/assets/bert-for-ner/tokenizer.png" width=60%/>
</center>

<div style="text-align: right;">
 (from <a href="https://vamvas.ch/bert-for-ner">BERT for NER</a>)
</div>

### Visualizing BERT word embeddings

Pretty similar to [word2vec](dl-representations_simple.ipynb):

<center>
 <img src="https://home.ttic.edu/~kgimpel/viz-bert/viz-bert-voc.png" width=70%/>
</center>

<div style="text-align: right;">
 (from <a href="https://home.ttic.edu/~kgimpel/viz-bert/viz-bert.html">Visualizing BERT</a>)
</div>

### Visualizing BERT word embeddings

<center>
 <img src="https://home.ttic.edu/~kgimpel/viz-bert/viz-bert-voc-house.png" width=70%/>
</center>

<div style="text-align: right;">
 (from <a href="https://home.ttic.edu/~kgimpel/viz-bert/viz-bert.html">Visualizing BERT</a>)
</div>

### Visualizing BERT word embeddings

<center>
 <img src="https://home.ttic.edu/~kgimpel/viz-bert/viz-bert-voc-suffixes.png" width=70%/>
</center>

<div style="text-align: right;">
 (from <a href="https://home.ttic.edu/~kgimpel/viz-bert/viz-bert.html">Visualizing BERT</a>)
</div>

## Transformer LMs as pre-trained representations

<center>
 <img src="https://d3i71xaburhd42.cloudfront.net/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035/4-Figure1-1.png" width=90%/>
</center>

<div style="text-align: right;">
 (from <a href="https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf">Radford et al., 2018</a>)
</div>

### Text and position embeddings in BERT (and friends)

<center>
 <img src="https://d3i71xaburhd42.cloudfront.net/df2b0e26d0599ce3e70df8a9da02e51594e0e992/5-Figure2-1.png" width=70%/>
</center>

<div style="text-align: right;">
 (from <a href="https://www.aclweb.org/anthology/N19-1423.pdf">Devlin et al., 2019</a>)
</div>

### Using BERT (and friends)

<center>
 <img src="https://d3i71xaburhd42.cloudfront.net/df2b0e26d0599ce3e70df8a9da02e51594e0e992/3-Figure1-1.png" width=70%/>
</center>

<div style="text-align: right;">
 (from <a href="https://www.aclweb.org/anthology/N19-1423.pdf">Devlin et al., 2019</a>)
</div>

### Using BERT (and friends) for NLI

<center>
 <img src="https://production-media.paperswithcode.com/models/roberta-multichoice.png-0000000931-36fb4743.png" width=70%/>
</center>

### Using BERT (and friends) for various tasks

<center>
 <img src="http://jalammar.github.io/images/bert-tasks.png" width=70%/>
</center>

<div style="text-align: right;">
 (from <a href="https://www.aclweb.org/anthology/N19-1423.pdf">Devlin et al., 2019</a>)
</div>

### Which layer to use?

<center>
 <img src="http://jalammar.github.io/images/bert-contexualized-embeddings.png" width=80%/>
</center>

<div style="text-align: right;">
 (from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT</a>)
</div>

<center>
 <img src="http://jalammar.github.io/images/bert-feature-extraction-contextualized-embeddings.png" width=80%/>
</center>

<div style="text-align: right;">
 (from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT</a>)
</div>

## Multilingual BERT

* One model pre-trained on 104 languages with the largest Wikipedias
* 110k *shared* WordPiece vocabulary
* Same architecture as BERT$_\mathrm{BASE}$: $L=12, H=768, A=12$
* Same training objectives, **no cross-lingual signal**

https://github.com/google-research/bert/blob/master/multilingual.md

### Other multilingual transformers

+ XLM and XLM-R ([Lample and Conneau, 2019](https://arxiv.org/pdf/1901.07291.pdf))
+ DistilmBERT ([Sanh et al., 2020](https://arxiv.org/pdf/1910.01108.pdf)) is a lighter version of mBERT
+ Many monolingual BERTs for languages other than English
([CamemBERT](https://arxiv.org/pdf/1911.03894.pdf),
[BERTje](https://arxiv.org/pdf/1912.09582),
[Nordic BERT](https://github.com/botxo/nordic_bert)...)

# Summary #

* Static word embeddings do not differ depending on context
* Contextualised representations are dynamic
* Popular pre-trained contextual representations:
 * ELMo: bidirectional language model with LSTMs
 * GPT: transformer language models
 * BERT: transformer masked language model

# Outlook #

* Transformer models keep coming out: larger, trained on more data, languages and domains, etc.
 + Increasing energy usage and climate impact: see https://github.com/danielhers/climate-awareness-nlp
* In the machine translation lecture, you will learn how to use them for cross-lingual tasks

# Additional Reading #

+ [Jurafsky & Martin Chapter 11](https://web.stanford.edu/~jurafsky/slp3/11.pdf)
+ Jay Alammar's blog posts:
 + [The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/)
 + [The Illustrated BERT](http://jalammar.github.io/illustrated-bert/)