{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2016-12-02T17:30:29.181539", "start_time": "2016-12-02T17:30:29.172204" }, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "{'theme': 'white',\n", " 'transition': 'none',\n", " 'controls': 'false',\n", " 'progress': 'true'}" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Reveal.js\n", "from notebook.services.config import ConfigManager\n", "cm = ConfigManager()\n", "cm.update('livereveal', {\n", " 'theme': 'white',\n", " 'transition': 'none',\n", " 'controls': 'false',\n", " 'progress': 'true',\n", "})" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%%capture\n", "%load_ext autoreload\n", "%autoreload 2\n", "# %cd ..\n", "import sys\n", "sys.path.append(\"..\")\n", "import statnlpbook.util as util\n", "util.execute_notebook('language_models.ipynb')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from IPython.display import Image\n", "import random" ] }, { "cell_type": "markdown", "metadata": { "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "slide" } }, "source": [ "# Contextualised Word Representations\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## What makes a good word representation? ##\n", "\n", "1. Representations are **distinct**\n", "2. **Similar** words have **similar** representations" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Reminder: word2vec\n", "\n", "
\n", "\n", "
\n", " (word2vec: Mikolov et al., 2013)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Disadvantage of Static Word Embeddings\n", "\n", "* No context (or maybe small fixed context window) - the representation depends only on the word itself\n", "\n", "How can we address this shortcoming?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## What does this mean? ##\n", "\n", "\n", "* \"Yesterday I saw a bass ...\"" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Image(url='../img/bass_1.jpg'+'?'+str(random.random()), width=300)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Image(url='../img/bass_2.svg'+'?'+str(random.random()), width=100)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Contextualised Representations #\n", "\n", "* Static embeddings (e.g., [word2vec](dl-representations_simple.ipynb)) have one representation per word *type*, regardless of context\n", "\n", "* Contextualised representations use the context surrounding the word *token*\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Contextualised Representations Example ##\n", "\n", "\n", "* a) \"Yesterday I saw a bass swimming in the lake\"" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Image(url='../img/bass_1.jpg'+'?'+str(random.random()), width=300)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* b) \"Yesterday I saw a bass in the music shop\"" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Image(url='../img/bass_2.svg'+'?'+str(random.random()), width=100)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Contextualised Representations Example ##\n", "\n", "\n", "* a) \"Yesterday I saw a bass swimming in the lake\".\n", "* b) \"Yesterday I saw a bass in the music shop\"." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Image(url='../img/bass_visualisation.jpg'+'?'+str(random.random()), width=500)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## What makes a good representation? ##\n", "\n", "1. Representations are **distinct**\n", "2. **Similar** words have **similar** representations" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Additional criterion:\n", "\n", "3. Representations take **context** into account" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## How to train contextualised representations ##\n", "\n", "Basicallly like word2vec: predict a word from its context (or vice versa).\n", "\n", "Cannot just use lookup table (i.e., embedding matrix) any more.\n", "\n", "Train a network with the sequence as input! Does this remind you of anything?" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Image(url='../img/elmo_1.png'+'?'+str(random.random()), width=800)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "The hidden state of an RNN LM is a contextualised word representation!\n", "\n", "For the LM to use the hidden state to predict the next word, it should be a generally good sequence representation.\n", "\n", "In this example: a two-layer LSTM LM taken from a model called *ELMo*.\n", "\n", "
\n", " (from The Illustrated BERT)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Bidirectional RNN LM ##\n", "\n", "An RNN (or LSTM) LM only considers preceding context.\n", "\n", "ELMo (Embeddings from Language Models) is based on a biLM: *bidirectional language model* ([Peters et al., 2018](https://www.aclweb.org/anthology/N18-1202/))." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Image(url='../img/elmo_2.png'+'?'+str(random.random()), width=1200)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Image(url='../img/elmo_3.png'+'?'+str(random.random()), width=1200)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "# [tinyurl.com/diku-nlp-bilm](https://tinyurl.com/diku-nlp-bilm)\n", "([Responses](https://docs.google.com/forms/d/1BimPo-S12XWt1qOJLXBTIGjRpt-bVW8H7hmT3j0iRRQ/edit#responses))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Problem: Long-Term Dependencies ##\n", "\n", "LSTMs have *longer-term* memory, but they still forget.\n", "\n", "Solution: *transformers*! ([Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* In 2024, all state-of-the-art LMs are transformers.\n", " * Yes, also GPT-4\n", " * But some [RNN-inspired models](https://github.com/state-spaces/mamba) are in fact on the rise" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## OpenAI GPT (Generative Pre-trained Transformer)\n", "\n", "Series of *decoder-only* neural language models using the *transformer* architecture.\n", "\n", "As contextualised representations, can be accessed using the Embeddings API: https://platform.openai.com/docs/api-reference/embeddings\n", "\n", "As a language model, can be accessed using the chat completions API: https://platform.openai.com/docs/api-reference/chat\n", "\n", "See more in the Transformers lecture.\n", "\n", "
\n", " \n", "
\n", "\n", "\n", "
\n", " (from The Illustrated GPT-2)\n", "
" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Image(url='../img/transformers.png'+'?'+str(random.random()), width=400)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## BERT\n", "\n", "**B**idirectional **E**ncoder **R**epresentations from **T**ransformers ([Devlin et al., 2019](https://www.aclweb.org/anthology/N19-1423.pdf)), an *encoder-only* transformer.\n", "\n", "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### BERT training objective (1): **masked** language modelling (MLM)\n", "\n", "Predict *masked* words given context on both sides:\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from The Illustrated BERT)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### BERT Training objective (2): next sentence prediction (NSP)\n", "\n", "Classify whether one sentence follows another using *conditional encoding* of both sentences:\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from The Illustrated BERT)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### How is that different from ELMo and GPT?\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Devlin et al., 2019)\n", "
\n", "\n", "See more in the Attention lecture." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## T5 (Text-to-Text Transfer Transformer)\n", "\n", "An *encoder-decoder* model.\n", "\n", "See more in the Transfer Learning lecture.\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Raffel et al., 2019)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### BERT tokenisation: not words, but WordPieces\n", "\n", "WordPiece and BPE (byte-pair encoding) tokenise text to **subwords** ([Sennrich et al., 2016](https://aclanthology.org/P16-1162/), [Wu et al., 2016](https://arxiv.org/abs/1609.08144v2))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* BERT has a [30,000 WordPiece vocabulary](https://huggingface.co/bert-base-cased/blob/main/vocab.txt), including ~10,000 unique characters.\n", "* No unknown words!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", " \n", "
\n", "\n", "
\n", " (from BERT for NER)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Visualizing BERT word embeddings\n", "\n", "Pretty similar to [word2vec](dl-representations_simple.ipynb):\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Visualizing BERT)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Visualizing BERT word embeddings\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Visualizing BERT)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Visualizing BERT word embeddings\n", "\n", "
\n", " \n", "
\n", "\n", "
\n", " (from Visualizing BERT)\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Summary #\n", "\n", "* Static word embeddings do not differ depending on context\n", "* Contextualised representations are dynamic\n", "* Popular pre-trained contextual representations:\n", " * ELMo: bidirectional language model with LSTMs\n", " * GPT: transformer language models (decoder-only)\n", " * BERT: transformer masked language model (encoder-only)\n", " * T5: text-to-text transformer (encoder-decoder)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Additional Reading #\n", "\n", "+ [Jurafsky & Martin Chapter 7](https://web.stanford.edu/~jurafsky/slp3/7.pdf)\n", "+ [Jurafsky & Martin Chapter 8](https://web.stanford.edu/~jurafsky/slp3/8.pdf)" ] } ], "metadata": { "celltoolbar": "Slideshow", "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 1 }