{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
\n", "" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%%capture\n", "%load_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import sys\n", "sys.path.append(\"..\")\n", "import statnlpbook.util as util\n", "import statnlpbook.sequence as seq\n", "from statnlpbook.gmb import load_gmb_dataset\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "matplotlib.rcParams['figure.figsize'] = (8.0, 5.0)\n", "from collections import defaultdict, Counter\n", "from random import random\n", "\n", "from IPython.display import Image" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "\n", "$$\n", "\\newcommand{\\Xs}{\\mathcal{X}}\n", "\\newcommand{\\Ys}{\\mathcal{Y}}\n", "\\newcommand{\\y}{\\mathbf{y}}\n", "\\newcommand{\\balpha}{\\boldsymbol{\\alpha}}\n", "\\newcommand{\\bbeta}{\\boldsymbol{\\beta}}\n", "\\newcommand{\\aligns}{\\mathbf{a}}\n", "\\newcommand{\\align}{a}\n", "\\newcommand{\\source}{\\mathbf{s}}\n", "\\newcommand{\\target}{\\mathbf{t}}\n", "\\newcommand{\\ssource}{s}\n", "\\newcommand{\\starget}{t}\n", "\\newcommand{\\repr}{\\mathbf{f}}\n", "\\newcommand{\\repry}{\\mathbf{g}}\n", "\\newcommand{\\x}{\\mathbf{x}}\n", "\\newcommand{\\prob}{p}\n", "\\newcommand{\\bar}{\\,|\\,}\n", "\\newcommand{\\vocab}{V}\n", "\\newcommand{\\params}{\\boldsymbol{\\theta}}\n", "\\newcommand{\\param}{\\theta}\n", "\\DeclareMathOperator{\\perplexity}{PP}\n", "\\DeclareMathOperator{\\argmax}{argmax}\n", "\\DeclareMathOperator{\\argmin}{argmin}\n", "\\newcommand{\\train}{\\mathcal{D}}\n", "\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n", "\\newcommand{\\length}[1]{\\text{length}(#1) }\n", "\\newcommand{\\indi}{\\mathbb{I}}\n", "$$" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The tikzmagic extension is already loaded. To reload it, use:\n", " %reload_ext tikzmagic\n" ] } ], "source": [ "%load_ext tikzmagic" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Sequence Labelling" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "+ POS tagging\n", "+ Log-linear models\n", "+ IOB encoding\n", "+ Named entity recognition\n", "+ Evaluating sequence labelers\n", "+ Maximum-entropy Markov models" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Sequence Labelling\n", "\n", "+ Assigning exactly one label to each element in a sequence\n", "\n", "+ In context of RNNs or other sequence models, example of **one-to-one** paradigm\n", "\n", "
\n", "\n", "\n", "(In the example: Universal Semantic Tags from [Abzianidze and Bos (2017)](https://www.aclweb.org/anthology/W17-6901.pdf))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Parts of speech (POS)\n", "\n", "- Group words with **similar grammatical properties**\n", "- [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) is the most commonly used POS tag set for English\n", " - Has 36 POS tags and 12 other tags (for punctuation and currency symbols)\n", " - For example, distinguishes four types of nouns and six types of verbs\n", "\n", "| | | |\n", "|-|-|-|\n", "| **NN** | noun, singular or mass | *cat, rain* |\n", "| **NNS** | noun, plural | *cats, tables* |\n", "| **NNP** | proper noun, singular | *John, IBM* |\n", "| **NNPS** | proper noun, plural | *Muslims, Philippines* |\n", "\n", "\n", "- Granularity of tags can differ" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Task: POS tagging\n", "\n", "Assign each word in a sentence its **part-of-speech (POS) tag**.\n", "\n", "| 1 | 2 | 3 | 4 | 5 | 6 | 7 |\n", "|-|-|-|-|-|-|-|\n", "| I | predict | that | it | will | rain | tonight |\n", "| PRP | VBP | IN | PRP | MD | VB | NN |" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Example\n", "\n", "Let's look at the [GMB (Groningen Meaning Bank) dataset](https://www.kaggle.com/shoumikgoswami/annotated-gmb-corpus/), annotated with the Penn Treebank tag set\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012345678910111213
0TheymarchedfromtheHousesofParliamenttoarallyinHydePark.
1PRPVBDINDTNNSINNNTODTNNINNNPNNP.
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9 10 11 \\\n", "0 They marched from the Houses of Parliament to a rally in Hyde \n", "1 PRP VBD IN DT NNS IN NN TO DT NN IN NNP \n", "\n", " 12 13 \n", "0 Park . \n", "1 NNP . " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokens, pos, ents = load_gmb_dataset('../data/gmb/GMB_dataset_utf8.txt')\n", "\n", "pd.DataFrame([tokens[2], pos[2]])" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "examples = {}\n", "counts = Counter(tag for sent in pos for tag in sent)\n", "words = defaultdict(set)\n", "for x_s, y_s in zip(tokens, pos):\n", " for i, (x, y) in enumerate(zip(x_s, y_s)):\n", " if (y not in examples) or (random() > 0.97):\n", " examples[y] = [x_s[j] + \"/\" + y_s[j] if i == j else x_s[j] for j in range(max(i-2,0),min(i+3,len(x_s)))]\n", " words[y].add(x)\n", "sorted_tags = sorted(counts.items(),key=lambda x:-x[1])\n", "sorted_tags_with_examples = [(t,c,len(words[t]),\" \".join(examples[t])) for t,c in sorted_tags]\n", "\n", "sorted_tags_table = pd.DataFrame(sorted_tags_with_examples, columns=['Tag','Count','Unique Tokens','Example'])" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TagCountUnique TokensExample
0NN93072087's transitional government-in-exile/NN met Tue...
1NNP81892069restrictions on U.N./NNP peacekeepers in
2IN775994way out of/IN the stalemate
3DT631040The/DT chief investigating
4JJ48751214It is unclear/JJ how many
5NNS48031102Both men/NNS have yet
6.29923peace process ./.
7VBD2429470Israel completed/VBD its handover
8VBN20605882001 have put/VBN $ 880
9,19531said Hezbollah ,/, along with
\n", "
" ], "text/plain": [ " Tag Count Unique Tokens \\\n", "0 NN 9307 2087 \n", "1 NNP 8189 2069 \n", "2 IN 7759 94 \n", "3 DT 6310 40 \n", "4 JJ 4875 1214 \n", "5 NNS 4803 1102 \n", "6 . 2992 3 \n", "7 VBD 2429 470 \n", "8 VBN 2060 588 \n", "9 , 1953 1 \n", "\n", " Example \n", "0 's transitional government-in-exile/NN met Tue... \n", "1 restrictions on U.N./NNP peacekeepers in \n", "2 way out of/IN the stalemate \n", "3 The/DT chief investigating \n", "4 It is unclear/JJ how many \n", "5 Both men/NNS have yet \n", "6 peace process ./. \n", "7 Israel completed/VBD its handover \n", "8 2001 have put/VBN $ 880 \n", "9 said Hezbollah ,/, along with " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted_tags_table[:10]" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.xticks(rotation=45, fontsize=7)\n", "plt.bar(sorted(counts.keys(), key=counts.get), sorted(counts.values()))" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Sequence Labelling as Structured Prediction\n", "\n", "* Input Space $\\Xs$: sequences of items to label\n", "* Output Space $\\Ys$: sequences of output labels\n", "* Model: $s_{\\params}(\\x,\\y)$\n", "* Prediction: $\\argmax_\\y s_{\\params}(\\x,\\y)$" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Conditional Models\n", "Model probability distributions over label sequences $\\y$ conditioned on input sequences $\\x$\n", "\n", "$$\n", "s_{\\params}(\\x,\\y) = \\prob_\\params(\\y|\\x)\n", "$$\n", "\n", "* Just like the conditional models from the [text classification](doc_classify_slides_short.ipynb) chapter\n", "\n", "* But the label space is *exponential* (as a function of sequence length)!\n", "\n", "* Most unique $\\y$ are never even seen in training\n", "\n", "* Might be useful to **break it up**?" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Local Models / Classifiers\n", "A **fully factorised** or **local** model:\n", "\n", "$$\n", "p_\\params(\\y|\\x) = \\prod_{i=1}^n p_\\params(y_i|\\x,i,y_{1,\\ldots,i-1}) \\approx \\prod_{i=1}^n p_\\params(y_i|\\x,i)\n", "$$\n", "\n", "* Assumption: labels are independent of each other given the input\n", "* Inference in this model is trivial: **greedy decoding**" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "scrolled": false, "slideshow": { "slide_type": "fragment" } }, "source": [ "$$\n", "\\prob_\\params(\\text{\"PRP MD VB\"} \\bar \\text{\"it will rain\"}) \\approx \\\\\\\\ \\prob_\\params(\\text{\"PRP\"}\\bar \\text{\"it will rain\"},1) \\cdot \\\\ \\prob_\\params(\\text{\"MD\"} \\bar \\text{\"it will rain\"},2) \\cdot \\\\ \\prob_\\params(\\text{\"VB\"} \\bar \\text{\"it will rain\"},3)\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "scrolled": true, "slideshow": { "slide_type": "fragment" } }, "source": [ "Does this remind you of anything you've seen in previous lectures?" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "scrolled": false, "slideshow": { "slide_type": "skip" } }, "source": [ "$$\n", "\\prob_\\params(\\text{\"it will rain\"}) \\approx \\prob_\\params(\\text{\"it\"}) \\cdot \\prob_\\params(\\text{\"will\"} \\bar \\text{\"it\"}) \\cdot \\prob_\\params(\\text{\"rain\"} \\bar \\text{\"it will\"})\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "scrolled": false, "slideshow": { "slide_type": "skip" } }, "source": [ "### Graphical Representation\n", "\n", "- Models can be represented as factor graphs\n", "- Each variable of the model (our per-token tag labels and the input sequence $\\x$) is drawn using a circle\n", "- *Observed* variables are shaded\n", "- Each factor in the model (terms in the product) is drawn as a box that connects the variables that appear in the corresponding term\n", " - For example, the term $p_\\params(y_3|\\x,3)$ would connect the variables $y_3$ and $\\x$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Parametrisation\n", "\n", "**Log-linear multiclass classifier** $p_\\params(y\\bar\\x,i)$ to predict class for sentence $\\x$ and position $i$\n", "\n", "$$\n", " p_\\params(y\\bar\\x,i) \\approx \\frac{1}{Z_\\x} \\exp \\langle \\repr(\\x,i),\\params_y \\rangle\n", "$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "+ $\\repr(\\x,i)$ is a **feature function**\n", "+ ${Z_\\x} > 0$ is a normalisation factor to ensure that $\\sum_{y} p_\\params(y\\bar\\x,i) = 1$\n", "\n", "+ How far can we get with very simple features that only consider the word types (and no context)?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Bias:\n", "$$\n", "\\repr_0(\\x,i) = 1\n", "$$\n", "\n", "Word at token to tag:\n", "$$\n", "\\repr_w(\\x,i) = \\begin{cases}1 \\text{ if }x_i=w \\\\\\\\ 0 \\text{ else} \\end{cases}\n", "$$" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "def feat_1(x,i):\n", " return {\n", " 'bias': 1.0,\n", " 'word:' + x[i]: 1.0,\n", " }\n", "\n", "train = list(zip(tokens[:-200], pos[:-200]))\n", "dev = list(zip(tokens[-200:], pos[-200:]))\n", "\n", "local_1 = seq.LocalSequenceLabeler(feat_1, train, class_weight='balanced')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can assess the accuracy of this model on the development set." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "0.8872215709261431" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq.accuracy(dev, local_1.predict(dev))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Problem 1: unknown words\n", "\n", "Many words are new, but we should still be able to tag them based on form or context:\n", "\n", "> ’Twas brillig, and the slithy toves \n", "Did gyre and gimble in the wabe: \n", "All mimsy were the borogoves, \n", "And the mome raths outgrabe.\n", "\n", "(Jabberwocky by Lewis Carroll)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### How to Improve?\n", "\n", "Look at **confusion matrix**" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "plt.rcParams['figure.figsize'] = [7, 7]\n", "import matplotlib.pylab as plb\n", "plb.rcParams['figure.dpi'] = 120" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "seq.plot_confusion_matrix(dev, local_1.predict(dev), normalise=True)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* mostly strong diagonal (good predictions)\n", "* `NN` receives a lot of wrong counts, often confused with `NNP`" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " Previous\n", "  \n", " Next\n", "
\n", "
Thewalkoutwillshutdownthecity
DTNNMDVBDTNN
DTNNPMDNNDTNN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:walkout
1.01.0
3.730.00
3.730.00
1 / 60
\n", "
,andthrowthedailycommuteofitssevenmillion
,CCVBDTJJNNINPRP$CDCD
,CCVBDTRBNNPINPRP$CDCD
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:commute
1.01.0
3.730.00
3.730.00
2 / 60
\n", "
Contracttalksbetweenthetwo
NNNNSINDTCD
NNPNNSINDTCD
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:Contract
1.01.0
3.730.00
3.730.00
3 / 60
\n", "
deadlockedoversuchissuesaswageincreasesandatwhat
VBNINJJNNSINNNNNSCCINWP
NNPRPPDTNNSINNNPNNSCCINWP
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:wage
1.01.0
3.730.00
3.730.00
4 / 60
\n", "
eligibletoreceiveafullpension.
JJTOVBDTJJNN.
JJTOVBDTJJNNP.
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:pension
1.01.0
3.730.00
3.730.00
5 / 60
\n", "
UnionheadRogerToussaintcalls
NNNNNNPNNPVBZ
NNPVBNNPNNPVBZ
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:Union
1.01.0
3.73-0.52
3.734.19
6 / 60
\n", "
ineffecttopreventmassivetrafficjamsonNewYork
INNNTOVBJJNNNNSINNNPNNP
INNNTOVBJJNNPNNPINNNPNNP
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:traffic
1.01.0
3.730.00
3.730.00
7 / 60
\n", "
hopeCambodiacanbeamodelfortherestof
NNNNPMDVBDTNNINDTNNIN
VBPNNPMDVBDTNNPINDTNNIN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:model
1.01.0
3.730.00
3.730.00
8 / 60
\n", "
Clinton'svisitwillraiseawarenessaboutH.I.V.andhelp
NNPPOSNNMDVBNNINNNPCCVB
NNPPOSVBMDVBNNPINNNPCCVB
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:awareness
1.01.0
3.730.00
3.730.00
9 / 60
\n", "
allactsofabuseandbrutalityandtreatsallegationsof
DTNNSINNNCCNNCCVBZNNSIN
PDTNNSINNNCCNNPCCNNPNNSIN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:brutality
1.01.0
3.730.00
3.730.00
10 / 60
\n", "
itwasreleasedbytheNewsoftheWorldnewspaper
PRPVBDVBNINDTNNINDTNNPNN
PRPVBDVBNINDTNNPINDTNNPNN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:News
1.01.0
3.73-0.39
3.732.51
11 / 60
\n", "
isheadofMogadishu'sambulanceservice,saysat
VBZNNINNNPPOSNNNN,VBZIN
VBZVBINNNPPOSNNPNN,VBZIN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:ambulance
1.01.0
3.730.00
3.730.00
12 / 60
\n", "
promptingsoldierstolaunchanartillerybarrage.
VBGNNSTOVBDTNNNN.
VBGNNSTOVBDTNNPNNP.
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:artillery
1.01.0
3.730.00
3.730.00
13 / 60
\n", "
soldierstolaunchanartillerybarrage.
NNSTOVBDTNNNN.
NNSTOVBDTNNPNNP.
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:barrage
1.01.0
3.730.00
3.730.00
14 / 60
\n", "
twodecadesofviolenceandlawlessnesssincethefallof
CDNNSINNNCCNNINDTNNIN
CDNNSINNNCCNNPINDTVBIN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:lawlessness
1.01.0
3.730.00
3.730.00
15 / 60
\n", "
ThefounderofMicrosoft,Bill
DTNNINNNP,NNP
DTNNPINNNP,NNP
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:founder
1.01.0
3.730.00
3.730.00
16 / 60
\n", "
universitystudentseagerforaglimpseoftheworld's
NNNNSJJINDTNNINDTNNPOS
NNNNSNNPINDTNNPINDTNNPOS
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:glimpse
1.01.0
3.730.00
3.730.00
17 / 60
\n", "
thegovernmentlandedahightechdealwhenleadingchipmaker
DTNNVBDDTJJNNNNWRBVBGNN
DTNNVBNDTJJNNPNNWRBVBGNNP
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:tech
1.01.0
3.730.00
3.730.00
18 / 60
\n", "
hightechdealwhenleadingchipmakerIntelCorporationannouncedit
JJNNNNWRBVBGNNNNPNNPVBDPRP
JJNNPNNWRBVBGNNPNNPNNPVBDPRP
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:chipmaker
1.01.0
3.730.00
3.730.00
19 / 60
\n", "
Theingredient,calledartemisinin,isextractedfrom
DTNN,VBNNN,VBZVBNIN
DTNN,VBDNNP,VBZVBNIN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:artemisinin
1.01.0
3.730.00
3.730.00
20 / 60
\n", "
formofmalaria,calledfalciparum.
NNINNN,VBDNN.
VBINNN,VBDNNP.
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:falciparum
1.01.0
3.730.00
3.730.00
21 / 60
\n", "
usingtheeuroastheircurrencyhavenotdoneas
VBGDTNNINPRP$NNVBPRBVBNRB
VBGDTNNINPRP$NNPVBPRBVBNIN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:currency
1.01.0
3.730.00
3.730.00
22 / 60
\n", "
donotyetknowtheoriginofthedeadbirds
VBPRBRBVBDTNNINDTJJNNS
VBRBRBVBDTNNPINDTJJNNS
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:origin
1.01.0
3.730.00
3.730.00
23 / 60
\n", "
,ValerySitnikov,saidbio-terrorismcannotberuled
,NNPNNP,VBDNNMDRBVBVBN
,NNPNNP,VBDNNPMDRBVBVBN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:bio-terrorism
1.01.0
3.730.00
3.730.00
24 / 60
\n", "
fourmurdersandwastheco-founderoftheinfamousCrips
CDNNSCCVBDDTNNINDTJJNNP
CDNNSCCVBDDTNNPINDTNNPNNP
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:co-founder
1.01.0
3.730.00
3.730.00
25 / 60
\n", "
Intwootherdeathpenaltycaseshehasrefused
INCDJJNNNNNNSPRPVBZVBN
INCDJJNNNNPNNSPRPVBZVBN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:penalty
1.01.0
3.730.00
3.730.00
26 / 60
\n", "
Aftertheregistrationperiodends,applicants
INDTNNNNVBZ,NNS
INDTNNPNNVBZ,NNS
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:registration
1.01.0
3.730.00
3.730.00
27 / 60
\n", "
,whichwilldeterminethemake-upofthepresidentialballot
,WDTMDVBDTNNINDTJJNN
,WDTMDVBDTNNPINDTJJNN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:make-up
1.01.0
3.730.00
3.730.00
28 / 60
\n", "
Competitivedivingisoneofthe
JJNNVBZCDINDT
NNPNNPVBZCDINDT
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:diving
1.01.0
3.730.00
3.730.00
29 / 60
\n", "
Recreationaldivingisalsoagrowing
JJNNVBZRBDTVBG
NNPNNPVBZRBDTVBG
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:diving
1.01.0
3.730.00
3.730.00
30 / 60
\n", "
divingisalsoagrowingamateursportintheU.S.
NNVBZRBDTVBGNNNNINDTNNP
NNPVBZRBDTVBGNNPNNINDTNNP
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:amateur
1.01.0
3.730.00
3.730.00
31 / 60
\n", "
Unfortunately,thejoyofjumpingoffthe
RB,DTNNINVBGRPDT
NNP,DTNNPINNNPRPDT
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:joy
1.01.0
3.730.00
3.730.00
32 / 60
\n", "
divingboardatthelocalswimmingpoolhastoooften
VBGNNINDTJJNNNNVBZRBRB
NNPNNINDTJJNNPNNPVBZRBRB
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:swimming
1.01.0
3.730.00
3.730.00
33 / 60
\n", "
boardatthelocalswimmingpoolhastoooftenbeen
NNINDTJJNNNNVBZRBRBVBN
NNINDTJJNNPNNPVBZRBRBVBN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:pool
1.01.0
3.730.00
3.730.00
34 / 60
\n", "
AnofficialwithCanada'sspyagencyhassaidthat
DTNNINNNPPOSNNNNVBZVBNIN
DTJJINNNPPOSNNPNNVBZVBDWDT
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:spy
1.01.0
3.730.00
3.730.00
35 / 60
\n", "
beforeMr.Chavez'safternoondeparture.
INNNPNNPPOSNNNN.
INNNPNNPPOSNNNNP.
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:departure
1.01.0
3.730.00
3.730.00
36 / 60
\n", "
cancellationsonthepope'sschedule,butdidnot
NNSINDTNNPOSNN,CCVBDRB
NNPINDTNNPOSNNP,CCVBDRB
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:schedule
1.01.0
3.730.00
3.730.00
37 / 60
\n", "
fromParkinson'sdiseaseandarthritis,butcontinuesto
INNNPPOSNNCCNN,CCVBZTO
INNNPPOSNNCCNNP,CCVBZTO
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:arthritis
1.01.0
3.730.00
3.730.00
38 / 60
\n", "
tomaintainafulltravelschedule,holdaudiencesand
TOVBDTJJNNNN,NNNNSCC
TOVBDTJJVBNNP,VBNNPCC
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:schedule
1.01.0
3.730.00
3.730.00
39 / 60
\n", "
holdaudiencesandperformhispapalduties.
NNNNSCCVBPRP$NNNNS.
VBNNPCCVBPPRP$NNPNNS.
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:papal
1.01.0
3.730.00
3.730.00
40 / 60
\n", "
IsraelcompleteditshandoveroftheWestBank
NNPVBDPRP$NNINDTNNPNNP
NNPVBNPRP$NNPINDTNNPNNP
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:handover
1.01.0
3.730.00
3.730.00
41 / 60
\n", "
armedmenstormedintoabarinwesternMexicoand
JJNNSVBDINDTNNINJJNNPCC
JJNNSVBDINDTNNPINJJNNPCC
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:bar
1.01.0
3.730.00
3.730.00
42 / 60
\n", "
airastheyenteredthebarinMichoacanstatebefore
NNINPRPVBDDTNNINNNPNNIN
NNINPRPVBDDTNNPINNNPNNIN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:bar
1.01.0
3.730.00
3.730.00
43 / 60
\n", "
barinMichoacanstatebeforedawnWednesday.
NNINNNPNNINNNNNP.
NNPINNNPNNINNNPNNP.
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:dawn
1.01.0
3.730.00
3.730.00
44 / 60
\n", "
Thegroupalsoleftanotesayingthekillingswere
DTNNRBVBDDTNNVBGDTNNSVBD
DTNNRBVBDDTNNPVBGDTNNSVBD
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:note
1.01.0
3.730.00
3.730.00
45 / 60
\n", "
ofwhatitcalled\"divinejustice.\"
INWPPRPVBD``NNNN.``
INWPPRPVBD``NNPNN.``
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:divine
1.01.0
3.730.00
3.730.00
46 / 60
\n", "
childrenwerekilledinastampedeinthesoutherncity
NNSVBDVBNINDTNNINDTJJNN
NNSVBDVBNINDTNNPINDTJJNN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:stampede
1.01.0
3.730.00
3.730.00
47 / 60
\n", "
werewaitingtogetfreeflour.
VBDVBGTOVBJJNN.
VBDVBGTOVBVBNNP.
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:flour
1.01.0
3.730.00
3.730.00
48 / 60
\n", "
groupwasgivingouttheflourinhonorofthe
NNVBDVBGRPDTNNINNNINDT
NNVBDVBGRPDTNNPINVBINDT
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:flour
1.01.0
3.730.00
3.730.00
49 / 60
\n", "
ofthevictimsdiedofsuffocation.
INDTNNSVBDINNN.
INDTNNSVBDINNNP.
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:suffocation
1.01.0
3.730.00
3.730.00
50 / 60
\n", "
50membersofIsrael'stheatercommunityhavesigneda
CDNNSINNNPPOSNNNNVBPVBNDT
CDNNSINNNPPOSNNPNNVBPVBNDT
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:theater
1.01.0
3.730.00
3.730.00
51 / 60
\n", "
theatercommunityhavesignedapetitiontoboycottperformancesat
NNNNVBPVBNDTNNTOVBNNSIN
NNPNNVBPVBNDTNNPTOVBNNSIN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:petition
1.01.0
3.730.00
3.730.00
52 / 60
\n", "
boycottperformancesatastate-fundedtheaterinthenorthernWest
VBNNSINDTJJNNINDTJJNNP
VBNNSINDTNNPNNPINDTJJNNP
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:theater
1.01.0
3.730.00
3.730.00
53 / 60
\n", "
Somalia'stransitionalgovernment-in-exilemetTuesdaytotry
NNPPOSJJNNVBDNNPTOVB
NNPPOSJJNNPVBDNNPTOVB
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:government-in-exile
1.01.0
3.730.00
3.730.00
54 / 60
\n", "
weretoperformtheminorpilgrimageknownasOmra.
VBDTOVBDTJJNNVBNINNNP.
VBDTOVBPDTJJNNPVBNINNNP.
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:pilgrimage
1.01.0
3.730.00
3.730.00
55 / 60
\n", "
rangingfrommaintainingthestatusquotoafullwithdrawal
VBGINVBGDTNNNNTODTJJNN
VBGINVBGDTNNNNPTODTJJNN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:quo
1.01.0
3.730.00
3.730.00
56 / 60
\n", "
theoperationtoeitheranobserverorliaisoneffort.
DTNNTODTDTNNCCNNNN.
DTNNTORBDTNNPCCNNPNN.
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:observer
1.01.0
3.730.00
3.730.00
57 / 60
\n", "
toeitheranobserverorliaisoneffort.
TODTDTNNCCNNNN.
TORBDTNNPCCNNPNN.
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:liaison
1.01.0
3.730.00
3.730.00
58 / 60
\n", "
anyparticularoptionandsaidnoneofferedanidealway
DTJJNNCCVBDNNVBDDTJJNN
DTNNPNNCCVBDNNPVBNDTNNPNN
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:none
1.01.0
3.730.00
3.730.00
59 / 60
\n", "
InthisphotographreleasedbytheIraqi
INDTNNVBNINDTJJ
INDTNNPVBNINDTJJ
\n", " \n", " \n", " \n", " \n", " \n", "
biasword:photograph
1.01.0
3.730.00
3.730.00
60 / 60
\n", "
\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "util.Carousel(local_1.errors(dev,\n", " filter_gold=lambda y: y=='NN',\n", " filter_guess=lambda y: y=='NNP'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* \"walkout\", \"commute\", \"wage\" are misclassified as proper nouns\n", "* For $f_{\\text{word},w}$ feature template weights are $0$\n", "\n", "The word has not appeared in the training set!" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "Proper nouns tend to be capitalised! Can we capture that with a feature?" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0.9087924970691676" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def feat_2(x,i):\n", " return {\n", " 'bias': 1.0,\n", " 'word:' + x[i].lower(): 1.0,\n", " 'first_upper:' + str(x[i][0].isupper()): 1.0,\n", " }\n", "local_2 = seq.LocalSequenceLabeler(feat_2, train)\n", "seq.accuracy(dev, local_2.predict(dev))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Are these results actually caused by improved `NN`/`NNP` prediction?" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "seq.plot_confusion_matrix(dev, local_2.predict(dev), normalise=True)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " Previous\n", "  \n", " Next\n", "
\n", "
Contracttalksbetweenthetwo
NNNNSINDTCD
NNPNNSINDTCD
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:contract
1.01.01.0
6.713.722.25
6.728.41-0.01
1 / 3
\n", "
UnionheadRogerToussaintcalls
NNNNNNPNNPVBZ
NNPNNNNPNNPVBZ
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:union
1.01.01.0
6.713.721.44
6.728.412.39
2 / 3
\n", "
itwasreleasedbytheNewsoftheWorldnewspaper
PRPVBDVBNINDTNNINDTNNPNN
PRPVBDVBNINDTNNPINDTNNPNN
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:news
1.01.01.0
6.713.723.98
6.728.413.08
3 / 3
\n", "
\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "util.Carousel(local_2.errors(dev,\n", " filter_gold=lambda y: y=='NN',\n", " filter_guess=lambda y: y=='NNP'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Problem 2: ambiguity\n", "\n", "*Polysemous* words or *homonyms* have multiple senses. For example, *back*:\n", "\n", "Noun:\n", "\n", "| | | | | | |\n", "|-|-|-|-|-|-|\n", "| He | is | treated | for | **back** | injury |\n", "| PRP | VBP | VBN | IN | **NN** | NN |\n", "\n", "Adverb:\n", "\n", "| | | | | | |\n", "|-|-|-|-|-|-|\n", "| He | is | sent | **back** | to | prison |\n", "| PRP | VBP | VBN | **RB** | TO | NN |\n", "\n", "Verb:\n", "\n", "| | | | | |\n", "|-|-|-|-|-|\n", "| I | can | **back** | this | up |\n", "| PRP | MD | **VB** | DT | RP |" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### What other features would you try for English POS tagging?\n", "\n", "
\n", "\n", "# [tinyurl.com/diku-nlp-pos](https://tinyurl.com/diku-nlp-pos)\n", "\n", "([Responses](https://docs.google.com/forms/d/1K8l0D6sTWmC4KmKJwr61zEN0eNjgUOpaDj3WSGxTuIM/edit#responses))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Task: Named entity recognition (NER)\n", "\n", "\n", "| |\n", "|-|\n", "| \\[Barack Obama\\]per was born in \\[Hawaii\\]gpe |\n", "\n", "\n", " per = Person\n", " gpe = Geopolitical Entity" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "... but this is not sequence labeling, is it?" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "## IOB encoding\n", "\n", "Label tokens as beginning (B), inside (I), or outside (O) a **named entity:**\n", "\n", "| | | | | | |\n", "|-|-|-|-|-|-|\n", "| Barack | Obama | was | born | in | Hawaii |\n", "| B-per | I-per | O | O | O | B-gpe |\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "+ Many tasks can be framed as sequence labelling using this idea!" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Named entity types in GMB dataset\n", "\n", " geo = Geographical Entity\n", " org = Organization\n", " per = Person\n", " gpe = Geopolitical Entity\n", " tim = Time indicator\n", " art = Artifact\n", " eve = Event\n", " nat = Natural Phenomenon" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "fragment" } }, "source": [ "Example sentence from GMB:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012345678910
0Iran'snewPresidentMahmoudAhmadinejadsaidTuesdaythatEuropeanincentives
1NNPPOSJJNNPNNPNNPVBDNNPINJJNNS
2B-gpeOOB-perI-perI-perOB-timOB-gpeO
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 \\\n", "0 Iran 's new President Mahmoud Ahmadinejad said Tuesday that \n", "1 NNP POS JJ NNP NNP NNP VBD NNP IN \n", "2 B-gpe O O B-per I-per I-per O B-tim O \n", "\n", " 9 10 \n", "0 European incentives \n", "1 JJ NNS \n", "2 B-gpe O " ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame([tokens[12][:11], pos[12][:11], ents[12][:11]])" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "examples = {}\n", "counts_ent = Counter(tag[2:] for sent in ents for tag in sent if tag.startswith(\"B-\"))\n", "in_entity = False\n", "for x_s, y_s in zip(tokens, ents):\n", " for i, (x, y) in enumerate(zip(x_s, y_s)):\n", " if y == \"O\":\n", " in_entity = False\n", " continue\n", " y_ent = y[2:]\n", " if y[0] == \"B\":\n", " if y_ent not in examples or random() > 0.6:\n", " examples[y_ent] = [x]\n", " in_entity = True\n", " else:\n", " in_entity = False\n", " if y[0] == \"I\" and in_entity:\n", " examples[y_ent].append(x)\n", "\n", "sorted_ents = sorted(counts_ent.items(),key=lambda x:-x[1])\n", "sorted_ents_with_examples = [(t,c,\" \".join(examples[t])) for t,c in sorted_ents]\n", "\n", "sorted_ents_table = pd.DataFrame(sorted_ents_with_examples, columns=['Entity Type','Count','Example'])" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Entity TypeCountExample
0geo2070Dujail
1org1237Ethiopia and Eritrea
2gpe1230Iraqi
3tim11601982
4per1107Saddam Hussein
5art53USA Today
6eve45Summer Olympics
7nat20H5N1
\n", "
" ], "text/plain": [ " Entity Type Count Example\n", "0 geo 2070 Dujail\n", "1 org 1237 Ethiopia and Eritrea\n", "2 gpe 1230 Iraqi\n", "3 tim 1160 1982\n", "4 per 1107 Saddam Hussein\n", "5 art 53 USA Today\n", "6 eve 45 Summer Olympics\n", "7 nat 20 H5N1" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted_ents_table" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Can we run our simple **local model** on this?" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "0.9348182883939039" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_ner = list(zip(tokens[:-200], ents[:-200]))\n", "dev_ner = list(zip(tokens[-200:], ents[-200:]))\n", "\n", "def feat_2(x,i):\n", " return {\n", " 'bias': 1.0,\n", " 'word:' + x[i].lower(): 1.0,\n", " 'first_upper:' + str(x[i][0].isupper()): 1.0,\n", " }\n", "local_2 = seq.LocalSequenceLabeler(feat_2, train_ner)\n", "local_2_pred_dev = local_2.predict(dev_ner)\n", "seq.accuracy(dev_ner, local_2_pred_dev)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "This seems great, but tag distribution is also **highly skewed**:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "hist = Counter(tag for _, tags in dev_ner for tag in tags)\n", "plt.bar(sorted(hist.keys(), key=hist.get), sorted(hist.values()))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A baseline that always predicts `O` is already pretty good:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "0.8527549824150059" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "only_o = [tuple(['O'] * len(tags)) for _, tags in dev_ner]\n", "seq.accuracy(dev_ner, only_o)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "def get_spans(labels):\n", " spans = []\n", " current = [None, None, None]\n", " for i, label in enumerate(labels):\n", " if label.startswith(\"I-\") and label[2:] == current[0]:\n", " # continued span\n", " continue\n", " # push span, if there is any\n", " if current[0] is not None:\n", " current[2] = i\n", " spans.append(current)\n", " current = [None, None, None]\n", " if label.startswith(\"B-\"):\n", " current[0] = label[2:]\n", " current[1] = i\n", " if current[0] is not None:\n", " current[2] = len(labels)\n", " spans.append(current)\n", " return spans\n", "\n", "def _calculate_prf(preds, golds):\n", " total_pred, total_gold, match = 0, 0, 0\n", " for pred, gold in zip(preds, golds):\n", " pred_s = get_spans(pred)\n", " gold_s = get_spans(gold)\n", " total_pred += len(pred_s)\n", " total_gold += len(gold_s)\n", " match += sum(s in pred_s for s in gold_s)\n", " # precision: % of entities found by the system that are correct\n", " p = match / total_pred if total_pred else 0.0\n", " # recall: % of entities in dataset found by the system\n", " r = match / total_gold if total_gold else 0.0\n", " # f-score: harmonic mean of precision and recall\n", " f = 2 * (p * r) / (p + r) if p + r else 0.0\n", "\n", " return p, r, f\n", "\n", "def print_prf(goldset, preds):\n", " p, r, f = _calculate_prf(preds, [s[1] for s in goldset])\n", " print(f\"precision: {p:.2f}\\nrecall: {r:.2f}\\nf-score: {f:.2f}\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Tasks like NER are more commonly evaluated with...\n", "\n", "### Precision, recall, and F-score\n", "\n", "\\begin{align}\n", "\\text{precision} & = \\frac{|\\text{predicted}\\cap\\text{annotated}|}{|\\text{predicted}|} \\\\[.5em]\n", "\\text{recall} & = \\frac{|\\text{predicted}\\cap\\text{annotated}|}{|\\text{annotated}|} \\\\[.5em]\n", "F & = 2 \\cdot \\frac{\\text{precision}\\cdot\\text{recall}}{\\text{precision}+\\text{recall}} \\\\\n", "\\end{align}\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Example:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789101112
0Thenewspapersaysthetapewasshotin2004insouthernIraq.
1OOOOOOOOB-timOB-geoI-geoO
2OOOOOOOOB-timOOB-geoO
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9 10 11 \\\n", "0 The newspaper says the tape was shot in 2004 in southern Iraq \n", "1 O O O O O O O O B-tim O B-geo I-geo \n", "2 O O O O O O O O B-tim O O B-geo \n", "\n", " 12 \n", "0 . \n", "1 O \n", "2 O " ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame([dev_ner[18][0], dev_ner[18][1], local_2_pred_dev[18]])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "predicted = {2004, Iraq}\n", "\n", "annotated = {2004, southern Iraq}" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "precision: 0.50\n", "recall: 0.50\n", "f-score: 0.50\n" ] } ], "source": [ "print_prf([dev_ner[18]], [local_2_pred_dev[18]])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "back to the full dev set..." ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "precision: 0.73\n", "recall: 0.59\n", "f-score: 0.65\n" ] } ], "source": [ "local_2_pred_dev = local_2.predict(dev_ner)\n", "print_prf(dev_ner, local_2_pred_dev)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "precision: 0.00\n", "recall: 0.00\n", "f-score: 0.00\n" ] } ], "source": [ "print_prf(dev_ner, only_o)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Sequence labelling with neural networks" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "fragment" } }, "source": [ "We can use BiLSTMs for that!\n", "\n", "
\n", " \n", "
\n", "\n", "Source: https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "### Reminder\n", "A recurrent neural network (plain RNN, LSTM, GRU, ...) computes its output based on a hidden, internal state:\n", "\n", "$$\n", " {\\mathbf{y}}_{t} = \\text{RNN}(\\x_t, {\\mathbf{h}}_{t})\n", "$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\n", "A **bi-directional** RNN is just two uni-directional RNNs combined:\n", "\n", "\\begin{align}\n", " \\overrightarrow{\\mathbf{y}_{t}} & = \\overrightarrow{\\text{RNN}}(\\x_t, \\overrightarrow{\\mathbf{h}_{t}})\\\\\n", " \\overleftarrow{\\mathbf{y}_{t}} & = \\overleftarrow{\\text{RNN}}(\\x_t, \\overleftarrow{\\mathbf{h}_{t}})\n", " \\\\\n", " {\\mathbf{y}}_{t} & = \\overrightarrow{\\mathbf{y}_{t}} \\oplus \\overleftarrow{\\mathbf{y}_{t}} \\\\\n", "\\end{align}\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "To predict label probabilities, we use the **softmax function**:\n", "\n", "\n", "$$\n", "\\begin{aligned}\n", " {\\mathbf{y}}_{t} & = \\overrightarrow{\\mathbf{y}_{t}} \\oplus \\overleftarrow{\\mathbf{y}_{t}} \\\\\n", " \\hat{\\mathbf{y}}_{t} & = \\text{softmax}(\\mathbf{W}^o \\mathbf{y}_{t}) \\in \\mathbb{R}^{|V|} \\\\\n", "\\end{aligned}\n", "$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can also use transformers such as BERT\n", "![bert_ner](../img/bert_ner.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Note on tokenisation\n", "\n", "Parts of speech are defined for *words*.\n", "\n", "Tagger must output one tag per word even if using other tokenisation internally." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "### Tokenisation\n", "Combining word representations with character representations improves POS tagging:\n", "\n", "
\n", " \n", "
\n", "\n", "([Plank et al., 2016](https://aclanthology.org/P16-2067/))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### An important technical detail\n", "\n", "The linear transformation $\\mathbf{W}^o \\mathbf{y}_{t}$ is usually not modelled as part of the RNN itself in most deep learning frameworks.\n", "\n", "Instead, look for one of\n", "\n", "+ **feed-forward layer**\n", "+ **dense layer** (*e.g. in Keras*)\n", "+ **linear layer** (*e.g. in PyTorch*)\n", "\n", "with a softmax activation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Is it all the same?\n", "\n", "Remember the log-linear classifier:\n", "\n", "$$\n", " p_\\params(y\\bar\\x,i) = \\frac{1}{Z_\\x} \\exp \\langle \\repr(\\x,i),\\params_y \\rangle\n", "$$\n", "\n", "A neural sequence model with a softmax layer on top is also modelling $p_\\params(y\\bar\\x,i)$\n", "\n", "So if you take $\\params$ to be the set of parameters of the neural network, then:\n", "\n", "\\begin{align}\n", " \\hat{\\mathbf{y}}_{t} & = \\text{softmax}(\\hat{\\mathbf{h}}_{t}) \\\\\n", " &= \\frac{1}{Z_\\x} \\exp \\langle \\hat{\\mathbf{h}}_{t},\\params_y \\rangle \\\\\n", "\\end{align}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### What is the difference?\n", "\n", "
\n", "\n", "# [tinyurl.com/diku-nlp-seq](https://tinyurl.com/diku-nlp-seq)\n", "\n", "([Responses](https://docs.google.com/forms/d/1X6LKsQ3a9_XcsZm5gpa8p-q9brXosjlbTubp2nqR1pk/edit#responses))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Solution\n", "\n", "* Incorrect: That neural sequence models can use context: log-linear models can too\n", "* Correct: That neural sequence models have more parameters (generally)\n", "* Incorrect: That neural sequence models are global (not local) models (both can be either local or global)\n", "* Correct: That neural sequence models learn features on their own\n", "* Correct: That log-linear models can only combine features linearly\n", "* Incorrect: That log-linear models perform greedy inference\n", "* Incorrect: That log-linear models can only use the left context (preceding tokens)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "What haven't we modelled yet?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## There are *dependencies* between consecutive labels!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Can you think about fitting words for this POS tag sequence?\n", "\n", "| | | |\n", "|-|-|-|\n", "| DT | JJ | NN |\n", "| *determiner* | *adjective* | *noun (singular or mass)* |" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "What about this one?\n", "\n", "| | |\n", "|-|-|\n", "| DT | VB |\n", "| *determiner* | *verb (base form)* |" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "+ After determiners (`DT`), adjectives and nouns are much more likely than verbs\n", "+ *Local* models cannot *directly* capture this" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " Previous\n", "  \n", " Next\n", "
\n", "
FormerU.S.presidentBillClintonhassignedan
OB-geoOB-perI-perOOO
OB-geoOI-perI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:bill
1.01.01.0
0.601.141.72
1.270.962.54
1 / 85
\n", "
Mondayinthecapital,PhnomPenh.
B-timOOOOB-orgI-orgO
B-timOOOOI-perI-perO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:phnom
1.01.01.0
0.840.880.00
1.270.960.00
2 / 85
\n", "
TheformerpresidentisinCambodiatotourAIDS-relatedprojects
OOOOOB-geoOOOO
OOOOOI-perOOI-perO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:cambodia
1.01.01.0
1.340.520.00
1.270.960.00
3 / 85
\n", "
hisdevelopmentgroup,theClintonFoundationH.I.V./AIDSInitiative.
OOOOOB-perI-perI-perI-perO
OOOOOI-perOI-perI-perO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:clinton
1.01.01.0
0.601.142.71
1.270.963.70
4 / 85
\n", "
HesaidthereishopeCambodiacanbeamodel
OOOOOB-geoOOOO
OOOOOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:cambodia
1.01.01.0
1.340.520.00
1.270.960.00
5 / 85
\n", "
Cambodiahasreducedadultinfection
B-geoOOOO
I-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:cambodia
1.01.01.0
1.340.520.00
1.270.960.00
6 / 85
\n", "
inthecity'sbusyBakaramarket.
OOOOOB-geoOO
OOOOOI-perOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:bakara
1.01.01.0
1.340.520.00
1.270.960.00
7 / 85
\n", "
Insurgentgroupsal-ShababandHizbulIslamaretryingto
OOOOB-perI-perOOO
OOOOI-perB-geoOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:hizbul
1.01.01.0
0.601.140.00
1.270.960.00
8 / 85
\n", "
ThefounderofMicrosoft,BillGates,
OOOB-orgOB-perI-perO
OOOI-perOI-perI-perO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:microsoft
1.01.01.0
0.840.880.00
1.270.960.00
9 / 85
\n", "
ThefounderofMicrosoft,BillGates,hasreceived
OOOB-orgOB-perI-perOOO
OOOI-perOI-perI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:bill
1.01.01.0
0.601.141.72
1.270.962.54
10 / 85
\n", "
andpushedthroughcrowdsatHanoiUniversitySaturday,where
OOOOOB-orgI-orgB-timOO
OOOOOI-perI-orgB-timOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:hanoi
1.01.01.0
0.840.880.00
1.270.960.00
11 / 85
\n", "
HanoiUniversitySaturday,whereGateswasdeliveringaspeech
B-orgI-orgB-timOOB-perOOOO
I-perI-orgB-timOOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:gates
1.01.01.0
0.601.141.82
1.270.961.40
12 / 85
\n", "
Earlier,GatesmetPrimeMinisterPhan
OOB-perOB-perOB-per
OOI-perOB-perI-perI-per
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:gates
1.01.01.0
0.601.141.82
1.270.961.40
13 / 85
\n", "
,GatesmetPrimeMinisterPhanVanKhaiandPresident
OB-perOB-perOB-perI-perI-perOB-per
OI-perOB-perI-perI-perB-geoI-perOB-per
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:phan
1.01.01.0
0.601.140.00
1.270.960.00
14 / 85
\n", "
,theVietnameseleadersandGatessignedanagreementto
OOB-gpeOOB-perOOOO
OOB-gpeOOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:gates
1.01.01.0
0.601.141.82
1.270.961.40
15 / 85
\n", "
signedanagreementtouseMicrosoftsoftwareinVietnam's
OOOOOB-orgOOB-geoO
OOOOOI-perOOB-geoO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:microsoft
1.01.01.0
0.840.880.00
1.270.960.00
16 / 85
\n", "
Gates'triptoHanoi
B-perOOOB-geo
I-perOOOI-per
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:gates
1.01.01.0
0.601.141.82
1.270.961.40
17 / 85
\n", "
Gates'triptoHanoiisseenasanother
B-perOOOB-geoOOOO
I-perOOOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:hanoi
1.01.01.0
1.340.520.00
1.270.960.00
18 / 85
\n", "
techdealwhenleadingchipmakerIntelCorporationannounceditwas
OOOOOB-orgI-orgOOO
OOOOOI-perI-orgOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:intel
1.01.01.0
0.840.880.00
1.270.960.00
19 / 85
\n", "
KandaniNgwira,whoworks
B-perI-perOOO
I-perI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:kandani
1.01.01.0
0.601.140.00
1.270.960.00
20 / 85
\n", "
AMalawijournalistwhoworksfor
OB-gpeOOOO
OI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:malawi
1.01.01.0
-2.953.920.00
1.270.960.00
21 / 85
\n", "
KandaniNgwiracontactedmediaoutlets
B-perI-perOOO
I-perI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:kandani
1.01.01.0
0.601.140.00
1.270.960.00
22 / 85
\n", "
NgwiraworksfortheWeekly
B-perOOOB-org
I-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:ngwira
1.01.01.0
0.601.140.00
1.270.960.00
23 / 85
\n", "
,anewspaperthattheMalawiangovernmenttriedtoban
OOOOOB-gpeOOOO
OOOOOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:malawian
1.01.01.0
-2.953.920.00
1.270.960.00
24 / 85
\n", "
Nationalpolicespokesman,WillyMwaluka,sayshe
OOOOB-perI-perOOO
B-orgOOOI-perI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:willy
1.01.01.0
0.601.140.00
1.270.960.00
25 / 85
\n", "
hehadnoinformationaboutNgwira'sdetention.
OOOOOB-perOOO
OOOOOI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:ngwira
1.01.01.0
0.601.140.00
1.270.960.00
26 / 85
\n", "
sayshewasarrestedinBlantyreandtransportedtothe
OOOOOB-geoOOOO
OOOOOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:blantyre
1.01.01.0
1.340.520.00
1.270.960.00
27 / 85
\n", "
transportedtothecapital,Lilongwe.
OOOOOB-geoO
OOOOOI-perO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:lilongwe
1.01.01.0
1.340.520.00
1.270.960.00
28 / 85
\n", "
BlantyreNewspapersLimited,which
B-orgI-orgI-orgOO
I-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:blantyre
1.01.01.0
0.840.880.00
1.270.960.00
29 / 85
\n", "
isprovidingalawyerforNgwira.
OOOOOB-perO
OOOOOI-perO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:ngwira
1.01.01.0
0.601.140.00
1.270.960.00
30 / 85
\n", "
followeffortsbyPalestinianleaderMahmoudAbbastopersuademilitants
OOOB-gpeOB-perI-perOOO
OOOB-gpeOI-perI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:mahmoud
1.01.01.0
0.601.143.73
1.270.964.44
31 / 85
\n", "
'schiefveterinaryofficial,ValerySitnikov,saidbio-terrorism
OOOOOB-perI-perOOO
OOOOOI-perI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:valery
1.01.01.0
0.601.140.00
1.270.960.00
32 / 85
\n", "
MostofthevictimswereAsians.
OOOOOB-gpeO
OOOOOI-perO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:asians
1.01.01.0
-2.953.920.00
1.270.960.00
33 / 85
\n", "
'sDeputyForeignMinister,ChoeSuHon,said
OOOOOB-orgI-orgI-orgOO
OOB-perI-perOI-perI-perI-perOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:choe
1.01.01.0
0.840.880.00
1.270.960.00
34 / 85
\n", "
Meanwhile,Japan'sKyodonewsagencyreportsPyongyang
OOB-geoOB-geoOOOB-geo
OOB-geoOI-perOOOB-tim
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:kyodo
1.01.01.0
1.340.520.00
1.270.960.00
35 / 85
\n", "
Stanley\"Tookie\"Williams
B-perOB-perOB-per
I-perOI-perOI-per
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:stanley
1.01.01.0
0.601.140.00
1.270.960.00
36 / 85
\n", "
Stanley\"Tookie\"Williamshasbeen
B-perOB-perOB-perOO
I-perOI-perOI-perOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:tookie
1.01.01.0
0.601.140.00
1.270.960.00
37 / 85
\n", "
Stanley\"Tookie\"Williamshasbeenconvictedof
B-perOB-perOB-perOOOO
I-perOI-perOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:williams
1.01.01.0
0.601.14-0.28
1.270.962.04
38 / 85
\n", "
theco-founderoftheinfamousCripsstreetgang.
OOOOOB-geoOOO
OOOOOI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:crips
1.01.01.0
1.340.520.00
1.270.960.00
39 / 85
\n", "
hasattractedinternationalattentionbecauseWilliamsistheauthorof
OOOOOB-perOOOO
OOOOOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:williams
1.01.01.0
0.601.14-0.28
1.270.962.04
40 / 85
\n", "
thebooksareevidencethatWilliamshasturnedhislife
OOOOOB-perOOOO
OOOOOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:williams
1.01.01.0
0.601.14-0.28
1.270.962.04
41 / 85
\n", "
Williamshasapologizedforhis
B-perOOOO
I-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:williams
1.01.01.0
0.601.14-0.28
1.270.962.04
42 / 85
\n", "
theCaliforniahighcourtupheldWilliams'conviction.
OB-geoOOOB-perOOO
OB-geoOOOI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:williams
1.01.01.0
0.601.14-0.28
1.270.962.04
43 / 85
\n", "
Mr.Schwarzenegger'sdecisiononclemency
B-perB-orgOOOO
B-perI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:schwarzenegger
1.01.01.0
0.840.880.00
1.270.960.00
44 / 85
\n", "
themhighlyfavoredtwo-timepresidentAkbarHashemiRafsanjani.
OOOOB-perB-orgI-orgI-orgO
OOOOOI-perI-perI-perO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:akbar
1.01.01.0
0.840.880.00
1.270.960.00
45 / 85
\n", "
ofthenationalpolice,BagerQalibaf,andthe
OOOOOB-perI-perOOO
OOOOOI-perI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:bager
1.01.01.0
0.601.140.00
1.270.960.00
46 / 85
\n", "
themayorofTehran,MahmoudAhmadinejad,arealso
OOOB-geoOB-geoI-geoOOO
OOOB-geoOI-perI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:mahmoud
1.01.01.0
1.340.52-1.62
1.270.964.44
47 / 85
\n", "
willbescreenedbytheGuardiansCouncil,whichwill
OOOOOB-orgI-orgOOO
OOOOOI-perI-orgOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:guardians
1.01.01.0
0.840.880.00
1.270.960.00
48 / 85
\n", "
crowdinBeirutTuesday,HassanNasrallahaccusedMr.Bush
OOB-geoB-timOB-perI-perOB-perI-per
OOB-geoB-timOI-perI-perOB-perI-per
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:hassan
1.01.01.0
0.601.141.75
1.270.962.63
49 / 85
\n", "
HealsoassertedthattheBushadministrationorderedIsraelto
OOOOOB-perOOB-geoO
OOOOOI-perOOB-geoO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:bush
1.01.01.0
0.601.143.09
1.270.963.80
50 / 85
\n", "
ofasecurityconferenceinMunich,Mr.Annansaid
OOOOOB-orgI-orgB-perI-perO
OOOOOI-perOB-perI-perO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:munich
1.01.01.0
0.840.880.00
1.270.960.00
51 / 85
\n", "
alargeaudienceattheSummerOlympicsinBeijing.
OOOOOB-eveI-eveOB-geoO
OOOOOI-eveOOB-geoO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:summer
1.01.01.0
-0.47-0.42-0.03
-1.330.353.15
52 / 85
\n", "
VOA'sMelindaSmithhasdetailsof
B-orgOB-perI-perOOO
B-orgOI-perI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:melinda
1.01.01.0
0.601.140.00
1.270.960.00
53 / 85
\n", "
16prisonerswerekilledatUribanaprisonwhenrivalgangs
OOOOOB-geoOOOO
OOOOOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:uribana
1.01.01.0
1.340.520.00
1.270.960.00
54 / 85
\n", "
Service,spokeMondayinOttawatoalegislativecommittee
I-orgOOB-timOB-geoOOOO
OOOB-timOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:ottawa
1.01.01.0
1.340.520.00
1.270.960.00
55 / 85
\n", "
HoopertoldthelawmakersCanada
B-perOOOB-geo
I-perOOOB-org
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:hooper
1.01.01.0
0.601.140.00
1.270.960.00
56 / 85
\n", "
Hoopersaidthatmanyof
B-perOOOO
I-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:hooper
1.01.01.0
0.601.140.00
1.270.960.00
57 / 85
\n", "
SaturdaybeforeMr.Chavez'safternoondeparture.
B-timOB-perI-perOB-timOO
B-timOB-perI-perOI-timOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Falseword:afternoon
1.01.01.0
0.662.77-0.14
0.382.734.59
58 / 85
\n", "
sendelectionobserverstomonitorSuriname'sparliamentaryelectionsto
OOOOOB-geoOOOO
OOOOOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:suriname
1.01.01.0
1.340.520.00
1.270.960.00
59 / 85
\n", "
TheOASandSurinameofficialsagreedtothe
B-orgI-orgOB-geoOOOO
OI-perOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:suriname
1.01.01.0
1.340.520.00
1.270.960.00
60 / 85
\n", "
TherulingcoalitioninSurinamefacesoppositionfromthe
OOOOB-geoOOOO
OOOOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:suriname
1.01.01.0
1.340.520.00
1.270.960.00
61 / 85
\n", "
itsleader,formerdictatorDesiBouterse,wouldbecome
OOOOOB-perI-perOOO
OOOOOI-perI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:desi
1.01.01.0
0.601.140.00
1.270.960.00
62 / 85
\n", "
haswarnedthatrelationswithSurinamewouldsufferifBouterse
OOOOOB-geoOOOB-per
OOOOOI-perOOOI-per
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:suriname
1.01.01.0
1.340.520.00
1.270.960.00
63 / 85
\n", "
withSurinamewouldsufferifBoutersetakespower.
OB-geoOOOB-perOOO
OI-perOOOI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:bouterse
1.01.01.0
0.601.140.00
1.270.960.00
64 / 85
\n", "
HewasconvictedintheNetherlandssixyearsagofor
OOOOOB-geoB-timOOO
OOOOOI-geoOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:netherlands
1.01.01.0
1.340.522.26
0.330.243.73
65 / 85
\n", "
wasneversenttotheNetherlandsasthetwocountries
OOOOOB-geoOOOO
OOOOOI-geoOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:netherlands
1.01.01.0
1.340.522.26
0.330.243.73
66 / 85
\n", "
Bouterse,whoisan
B-perOOOO
I-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:bouterse
1.01.01.0
0.601.140.00
1.270.960.00
67 / 85
\n", "
asuccessfulmilitarycoupinSurinamein1980,and
OOOOOB-geoOB-timOO
OOOOOI-perOB-timOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:suriname
1.01.01.0
1.340.520.00
1.270.960.00
68 / 85
\n", "
,84-year-oldpopesuffersfromParkinson'sdiseaseandarthritis
OOOOOB-geoOOOO
OOOOOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:parkinson
1.01.01.0
1.340.520.00
1.270.960.00
69 / 85
\n", "
overthreeothertowns-Qalqiliya,BethlehemandRamallah
OOOOOB-geoOB-geoOB-geo
OOOOOI-perOI-perOB-geo
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:qalqiliya
1.01.01.0
1.340.520.00
1.270.960.00
70 / 85
\n", "
othertowns-Qalqiliya,BethlehemandRamallah.
OOOB-geoOB-geoOB-geoO
OOOI-perOI-perOB-geoO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:bethlehem
1.01.01.0
1.340.520.00
1.270.960.00
71 / 85
\n", "
build3,500newhomesinMaaleAdumin,thelargest
OOOOOB-geoI-geoOOO
OOOOOI-perI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:maale
1.01.01.0
1.340.520.00
1.270.960.00
72 / 85
\n", "
theyenteredthebarinMichoacanstatebeforedawnWednesday
OOOOOB-geoOOOB-tim
OOOOOI-perOOOB-tim
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:michoacan
1.01.01.0
1.340.520.00
1.270.960.00
73 / 85
\n", "
Mexico'sPresident-electFelipeCalderonhasvowedto
B-geoOOB-perI-perOOO
B-geoOI-perI-perI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:felipe
1.01.01.0
0.601.141.74
1.270.962.23
74 / 85
\n", "
PrimeMinisterBenjaminNetanyahutoldhisCabinet
B-perOB-perI-perOOO
B-perI-perI-perI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:benjamin
1.01.01.0
0.601.14-0.28
1.270.962.04
75 / 85
\n", "
Theregionalbloc-IGAD-hasdecidedto
OOOOB-orgOOOO
OOOOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:igad
1.01.01.0
0.840.880.00
1.270.960.00
76 / 85
\n", "
totheholycityofMecca,wheresomeof
OOOOOB-geoOOOO
OOOOOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:mecca
1.01.01.0
1.340.520.00
1.270.960.00
77 / 85
\n", "
theminorpilgrimageknownasOmra.
OOOOOB-geoO
OOOOOI-perO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:omra
1.01.01.0
1.340.520.00
1.270.960.00
78 / 85
\n", "
OnlyMuslimsareallowedinMecca.
OOOOOB-geoO
OOOOOI-perO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:mecca
1.01.01.0
1.340.520.00
1.270.960.00
79 / 85
\n", "
theterroristgrouphastargetedWesternersinthepast.
OOOOOB-orgOOOO
OOOOOI-perOOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:westerners
1.01.01.0
0.840.880.00
1.270.960.00
80 / 85
\n", "
authoritiessayformerPrimeMinisterYvonNeptuneandformerInterior
OOOB-perOB-perI-perOOO
OOOB-perI-perI-perI-perOOB-per
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:yvon
1.01.01.0
0.601.140.00
1.270.960.00
81 / 85
\n", "
NeptuneandformerInteriorMinisterJocelermePrivertarebackin
I-perOOOOB-perI-perOOO
I-perOOB-perI-perI-perI-perOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:jocelerme
1.01.01.0
0.601.140.00
1.270.960.00
82 / 85
\n", "
areaccusedofviolenceagainstAristideopponents.
OOOOOB-perOO
OOOOOI-perOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:aristide
1.01.01.0
0.601.140.00
1.270.960.00
83 / 85
\n", "
TheSecurityCouncilhasthreatenedsanctions
OB-orgI-orgOOO
OI-orgI-orgOOO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:security
1.01.01.0
0.840.881.28
0.970.503.60
84 / 85
\n", "
ofdozensofvillagersinDujail.
OOOOOB-geoO
OOOOOI-perO
\n", " \n", " \n", " \n", " \n", " \n", "
biasfirst_upper:Trueword:dujail
1.01.01.0
1.340.520.00
1.270.960.00
85 / 85
\n", "
\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "util.Carousel(local_2.errors(dev_ner,\n", " filter_guess=lambda y: y.startswith(\"I-\"),\n", " filter_gold=lambda y: y.startswith(\"B-\")))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "In the IOB tagging scheme:\n", "\n", "+ `I-[label]` can logically **only** appear after `B-[label]`!\n", "\n", "The following can **never** be valid tag sequences:\n", "\n", "* `O I-per`\n", "\n", "* `B-per I-geo`\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Remember that\n", "\n", "$$\n", "p_\\params(\\y|\\x) = \\prod_{i=1}^n p_\\params(y_i|\\x,i,y_{1,\\ldots,i-1})\n", "$$\n", "\n", "What if we went from this...\n", "\n", "$$\n", "\\approx \\prod_{i=1}^n p_\\params(y_i|\\x,i)\n", "$$\n", "\n", "...to this?\n", "\n", "$$\n", "\\approx \\prod_{i=1}^n p_\\params(y_i|\\x,\\color{red}{y_{i-1}},i)\n", "$$\n" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "scrolled": true, "slideshow": { "slide_type": "skip" } }, "source": [ "Does this remind you of anything you've seen in previous lectures?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### First-order Markov assumption\n", "\n", "* Probability of a label depends only on (the input and) the previous label\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Example\n", "\n", "$$\n", "\\prob_\\params(\\text{\"O I-per I-per\"} \\bar \\text{\"president Bill Clinton\"}) = \\\\\n", "\\prob_\\params(\\text{\"O\"}\\bar \\text{\"president Bill Clinton\"},\\text{\"\"},1) ~ \\cdot \\\\\n", "\\prob_\\params(\\text{\"I-per\"} \\bar \\text{\"president Bill Clinton\"},\\text{\"O\"},2) ~ \\cdot \\\\\n", "\\prob_\\params(\\text{\"I-per\"} \\bar \\text{\"president Bill Clinton\"},\\text{\"I-per\"},3) \\\\\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "## Maximum Entropy Markov Models (MEMM)\n", "\n", "Log-linear version with access to previous label:\n", "\n", "$$\n", " p_\\params(y_i|\\x,y_{i-1},i) = \\frac{1}{Z_{\\x,y_{i-1},i}} \\exp \\langle \\repr(\\x,y_{i-1},i),\\params_{y_i} \\rangle\n", "$$\n", "\n", "where\n", "$$Z_{\\x,y_{i-1},i}=\\sum_y \\exp \\langle \\repr(\\x,y_{i-1},i),\\params_{y_i} \\rangle$$\n", "is a *local* per-token normalisation factor." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "scrolled": false, "slideshow": { "slide_type": "skip" } }, "source": [ "### Graphical Representation\n", "\n", "- Reminder: models can be represented as factor graphs\n", "- Each variable of the model (our per-token tag labels and the input sequence $\\x$) is drawn using a circle\n", "- As before, *observed* variables are shaded\n", "- Each factor in the model (terms in the product) is drawn as a box that connects the variables that appear in the corresponding term" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Training MEMMs\n", "Optimising the conditional log-likelihood\n", "\n", "$$\n", "\\sum_{(\\x,\\y) \\in \\train} \\log \\prob_\\params(\\y|\\x)\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Decomposes nicely:\n", "$$\n", "\\sum_{(\\x,\\y) \\in \\train} \\sum_{i=1}^{|\\x|} \\log \\prob_\\params(y_i|\\x,y_{i-1},i)\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Easy to train\n", "* Equivalent to a **logistic regression objective** for a classifier that assigns labels based on previous gold labels" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "However...\n", "\n", "### Local normalisation introduces *label bias*\n", "\n", "+ Tag probabilities always sum to 1 at each position\n", "+ Can lead to MEMMs effectively \"ignoring\" the inputs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Conditional Random Fields (CRF)\n", "\n", "Replace *local* with *global* normalisation.\n", "\n", "Instead of normalising across all possible next states $y_{i+1}$ given a current state $y_i$ and observation $\\x$,\n", "\n", "the CRF normalises across all possible *sequences* $\\y$ given observation $\\x$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Formally:\n", "\n", "$$\n", " p_\\params(y_i|\\x,y_{i-1},i) = \\frac{1}{Z_{\\x}} \\exp \\langle \\repr(\\x,y_{i-1},i),\\params_{y_i} \\rangle\n", "$$\n", "\n", "where\n", "$$Z_{\\x}=\\sum_\\y \\prod_i^{|\\x|} \\exp \\langle \\repr(\\x,y_{i-1},i), \\params_{y_i} \\rangle$$\n", "is a *global* normalisation constant depending on $\\x$.\n", "\n", "Notably, each term $\\exp \\langle \\repr(\\x,y_{i-1},i), \\params_{y_i} \\rangle$ in the product can now take on values in $[0,\\infty)$ as opposed to the MEMM terms in $[0,1]$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "***\n", "\n", "+ More precisely, this is a **linear-chain CRF**.\n", "\n", " (CRFs can be applied to any graph structure, but we are only considering sequences.)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Pros and cons of CRFs\n", "\n", "\n", "### 👍\n", "\n", "+ Finds globally optimal label sequence\n", "+ Eliminates label bias\n", "\n", "### 👎\n", "\n", "+ More difficult to train (—cannot break down into local terms anymore!)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The best of both worlds?\n", "\n", "## Neural CRF\n", "\n", "+ We can **combine** our neural sequence models with a CRF!\n", "\n", "\n", "$$\n", " p_\\params(y_i|\\x,y_{i-1},i) = \\frac{1}{Z_{\\x}} \\exp \\langle \\hat{\\mathbf{h}}_{t},\\params_{y_i} \\rangle\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "\n", "![](https://www.gabormelli.com/RKB/images/thumb/1/1e/N16-1030_fig1.png/400px-N16-1030_fig1.png)\n", "\n", "(from [Lample et al., 2016](https://www.aclweb.org/anthology/N16-1030))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Prediction in MEMMs, CRFs, neural CRFs, ...\n", "\n", "To predict the best label sequence, find a $\\y^*$ with maximal conditional probability\n", "\n", "$$\n", "\\y^* =\\argmax_\\y \\prob_\\params(\\y|\\x).\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Greedy Prediction\n", "\n", "Simplest option:\n", "* Choose highest scoring label for token 1\n", "* Choose highest scoring label for token 2, conditioned on best label from 1\n", "* etc." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "But...\n", "\n", "+ May lead to **search errors** when returned $\\y^*$ is not highest scoring **global** solution" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Problem\n", "\n", "We cannot simply choose each label in isolation because **decisions depend on each other.**" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "skip" } }, "source": [ "## Beam Search\n", "\n", "Keep a \"beam\" of the best $\\beta$ previous solutions\n", "\n", "1. Choose $\\beta$ highest scoring labels for token 1\n", "2. 1. For each of the previous $\\beta$ labels: predict probabilities for next label, conditioned on the previous label(s)\n", " 2. **Sum** the log-likelihoods for previous states and next label\n", " 3. **Prune** the beam by only keeping the top $\\beta$ paths\n", "3. Repeat until end of sequence" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Summary\n", "\n", "\n", "- Many problems can be cast as sequence labelling\n", " - POS tagging\n", " - Named entity recognition (with IOB encoding)\n", " \n", "- Models are similar to sequence **classifiers** but are sequential\n", " - Log-linear models rely on good feature engineering\n", " - Neural sequence labelers rely on substantial amounts of training data but generally perform better\n" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Background Material \n", "\n", "- Longer introduction to sequence labelling with linear chain models: [notes](chapters/sequence_labeling.ipynb)\n", "- Longer introduction to sequence labelling with CRFs: [slides](chapters/sequence_labeling_crf_slides.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Jurafsky & Martin, Speech and Language Processing, [Chapter 17](https://web.stanford.edu/~jurafsky/slp3/17.pdf)\n", "- Tutorial on CRFs: Sutton & McCallum, [An Introduction to Conditional Random Fields for Relational Learning](https://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf)\n", "- LSTM-CRF architecture: [Huang et al., Bidirectional LSTM-CRF for Sequence Tagging](https://arxiv.org/pdf/1508.01991v1.pdf)\n", "- Globally Normalized Transition-Based Neural Networks: [Andor et al., 2016](https://arxiv.org/abs/1603.06042)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 1 }