{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font size=6>\n",
    "    <b>Model_Training_with_BERT.ipynb:</b>\n",
    "    <p>Use Text Extensions for Pandas to integrate BERT tokenization with model training for named entity recognition on Pandas.</p>\n",
    "</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "This notebook shows how to use the open source library [Text Extensions for Pandas](https://github.com/CODAIT/text-extensions-for-pandas) to seamlessly integrate BERT tokenization and embeddings with model training for named entity recognition using [Pandas](https://pandas.pydata.org/) DataFrames.\n",
    "\n",
    "This example will build on the analysis of the [CoNLL-2003](https://www.clips.uantwerpen.be/conll2003/ner/) corpus done in [Analyze_Model_Outputs](./Analyze_Model_Outputs.ipynb) to train a new model for named entity recognition (NER) using state-of-the-art natural language understanding with BERT tokenization and embeddings. While the model used is rather simple and will only get modest scoring results, the purpose is to demonstrate how Text Extensions for Pandas integrates BERT from [Huggingface Transformers](https://huggingface.co/transformers/index.html) with the `TensorArray` extension for model training and scoring, all within Pandas DataFrames. See [Text_Extension_for_Pandas_Overview](./Text_Extension_for_Pandas_Overview.ipynb) for `TensorArray` specification and more example usage.\n",
    "\n",
    "The notebook is divided into the following steps:\n",
    "\n",
    "1. Retokenize the entire corpus using a \"BERT-compatible\" tokenizer, and map the token/entity labels from the original corpus on to the new tokenization.\n",
    "1. Generate BERT embeddings for every token in the entire corpus in one pass, and store those embeddings in a DataFrame column (of type TensorDtype) alongside the tokens and labels.\n",
    "1. Persist the DataFrame with computed BERT embeddings to disk as a checkpoint.\n",
    "1. Use the embeddings to train a multinomial logistic regression model to perform named entity recognition.\n",
    "1. Compute precision/recall for the model predictions on a test set.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "This notebook requires a Python 3.7 or later environment with NumPy, Pandas, scikit-learn, PyTorch and Huggingface `transformers`. \n",
    "\n",
    "The notebook also requires the  `text_extensions_for_pandas` library. You can satisfy this dependency in two ways:\n",
    "\n",
    "* Run `pip install text_extensions_for_pandas` before running this notebook. This command adds the library to your Python environment.\n",
    "* Run this notebook out of your local copy of the Text Extensions for Pandas project's [source tree](https://github.com/CODAIT/text-extensions-for-pandas). In this case, the notebook will use the version of Text Extensions for Pandas in your local source tree **if the package is not installed in your Python environment**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gc\n",
    "import os\n",
    "import sys\n",
    "from typing import *\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import sklearn.pipeline\n",
    "import sklearn.linear_model\n",
    "import torch\n",
    "import transformers\n",
    "\n",
    "# And of course we need the text_extensions_for_pandas library itself.\n",
    "try:\n",
    "    import text_extensions_for_pandas as tp\n",
    "except ModuleNotFoundError as e:\n",
    "    # If we're running from within the project source tree and the parent Python\n",
    "    # environment doesn't have the text_extensions_for_pandas package, use the\n",
    "    # version in the local source tree.\n",
    "    if not os.getcwd().endswith(\"notebooks\"):\n",
    "        raise e\n",
    "    if \"..\" not in sys.path:\n",
    "        sys.path.insert(0, \"..\")\n",
    "    import text_extensions_for_pandas as tp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Named Entity Recognition with BERT on CoNLL-2003\n",
    "\n",
    "[CoNLL](https://www.conll.org/), the SIGNLL Conference on Computational Natural Language Learning, is an annual academic conference for natural language processing researchers. Each year's conference features a competition involving a challenging NLP task. The task for the 2003 competition involved identifying mentions of [named entities](https://en.wikipedia.org/wiki/Named-entity_recognition) in English and German news articles from the late 1990's. The corpus for this 2003 competition is one of the most widely-used benchmarks for the performance of named entity recognition models. Current [state-of-the-art results](https://paperswithcode.com/sota/named-entity-recognition-ner-on-conll-2003) on this corpus produce an F1 score (harmonic mean of precision and recall) of 0.93. The best F1 score in the original competition was 0.89.\n",
    "\n",
    "For more information about this data set, we recommend reading the conference paper about the competition results, [\"Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition,\"](https://www.aclweb.org/anthology/W03-0419/).\n",
    "\n",
    "**Note that the data set is licensed for research use only. Be sure to adhere to the terms of the license when using this data set!**\n",
    "\n",
    "The developers of the CoNLL-2003 corpus defined a file format for the corpus, based on the file format used in the earlier [Message Understanding Conference](https://en.wikipedia.org/wiki/Message_Understanding_Conference) competition. This format is generally known as \"CoNLL format\" or \"CoNLL-2003 format\".\n",
    "\n",
    "In the following cell, we use the facilities of Text Extensions for Pandas to download a copy of the CoNLL-2003 data set. Then we read the CoNLL-2003-format file containing the `test` fold of the corpus and translate the data into a collection of Pandas [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) objects, one Dataframe per document. Finally, we display the Dataframe for the first document of the `test` fold of the corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'train': 'outputs/eng.train',\n",
       " 'dev': 'outputs/eng.testa',\n",
       " 'test': 'outputs/eng.testb'}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Download and cache the data set.\n",
    "# NOTE: This data set is licensed for research use only. Be sure to adhere\n",
    "#  to the terms of the license when using this data set!\n",
    "data_set_info = tp.io.conll.maybe_download_conll_data(\"outputs\")\n",
    "data_set_info"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Show how to retokenize with a BERT tokenizer.\n",
    "\n",
    "The BERT model is originally from the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. The model is pre-trained with masked language modeling and next sentence prediction objectives, which make it effective for masked token prediction and NLU. \n",
    "\n",
    "With the CoNLL-2003 corpus loaded, it will need to be retokenized using a \"BERT-compatible\" tokenizer. Then we can map the token/entity labels from the original corpus on to the new tokenization.\n",
    "\n",
    "We will start by showing the retokenizing process for a single document before doing the same on the entire corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>span</th>\n",
       "      <th>ent_iob</th>\n",
       "      <th>ent_type</th>\n",
       "      <th>sentence</th>\n",
       "      <th>line_num</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[0, 10): '-DOCSTART-'</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "      <td>[0, 10): '-DOCSTART-'</td>\n",
       "      <td>1469</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[11, 18): 'CRICKET'</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "      <td>[11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE...</td>\n",
       "      <td>1471</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[18, 19): '-'</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "      <td>[11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE...</td>\n",
       "      <td>1472</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[20, 28): 'PAKISTAN'</td>\n",
       "      <td>B</td>\n",
       "      <td>LOC</td>\n",
       "      <td>[11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE...</td>\n",
       "      <td>1473</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>[29, 30): 'V'</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "      <td>[11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE...</td>\n",
       "      <td>1474</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>350</th>\n",
       "      <td>[1620, 1621): '8'</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "      <td>[1590, 1634): 'Third one-day match: December 8...</td>\n",
       "      <td>1865</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351</th>\n",
       "      <td>[1621, 1622): ','</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "      <td>[1590, 1634): 'Third one-day match: December 8...</td>\n",
       "      <td>1866</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>352</th>\n",
       "      <td>[1623, 1625): 'in'</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "      <td>[1590, 1634): 'Third one-day match: December 8...</td>\n",
       "      <td>1867</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>353</th>\n",
       "      <td>[1626, 1633): 'Karachi'</td>\n",
       "      <td>B</td>\n",
       "      <td>LOC</td>\n",
       "      <td>[1590, 1634): 'Third one-day match: December 8...</td>\n",
       "      <td>1868</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>354</th>\n",
       "      <td>[1633, 1634): '.'</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "      <td>[1590, 1634): 'Third one-day match: December 8...</td>\n",
       "      <td>1869</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>355 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                        span ent_iob ent_type  \\\n",
       "0      [0, 10): '-DOCSTART-'       O     None   \n",
       "1        [11, 18): 'CRICKET'       O     None   \n",
       "2              [18, 19): '-'       O     None   \n",
       "3       [20, 28): 'PAKISTAN'       B      LOC   \n",
       "4              [29, 30): 'V'       O     None   \n",
       "..                       ...     ...      ...   \n",
       "350        [1620, 1621): '8'       O     None   \n",
       "351        [1621, 1622): ','       O     None   \n",
       "352       [1623, 1625): 'in'       O     None   \n",
       "353  [1626, 1633): 'Karachi'       B      LOC   \n",
       "354        [1633, 1634): '.'       O     None   \n",
       "\n",
       "                                              sentence  line_num  \n",
       "0                                [0, 10): '-DOCSTART-'      1469  \n",
       "1    [11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE...      1471  \n",
       "2    [11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE...      1472  \n",
       "3    [11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE...      1473  \n",
       "4    [11, 62): 'CRICKET- PAKISTAN V NEW ZEALAND ONE...      1474  \n",
       "..                                                 ...       ...  \n",
       "350  [1590, 1634): 'Third one-day match: December 8...      1865  \n",
       "351  [1590, 1634): 'Third one-day match: December 8...      1866  \n",
       "352  [1590, 1634): 'Third one-day match: December 8...      1867  \n",
       "353  [1590, 1634): 'Third one-day match: December 8...      1868  \n",
       "354  [1590, 1634): 'Third one-day match: December 8...      1869  \n",
       "\n",
       "[355 rows x 5 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Read in the corpus in its original tokenization.\n",
    "corpus_raw = {}\n",
    "for fold_name, file_name in data_set_info.items():\n",
    "    df_list = tp.io.conll.conll_2003_to_dataframes(file_name, \n",
    "                                                   [\"pos\", \"phrase\", \"ent\"],\n",
    "                                                   [False, True, True])\n",
    "    corpus_raw[fold_name] = [\n",
    "        df.drop(columns=[\"pos\", \"phrase_iob\", \"phrase_type\"])\n",
    "        for df in df_list\n",
    "    ]\n",
    "\n",
    "test_raw = corpus_raw[\"test\"]\n",
    "\n",
    "# Pick out the dataframe for a single example document.\n",
    "example_df = test_raw[5]\n",
    "example_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `example_df` contains columns `span` and `sentence` of dtypes `SpanDtype` and `TokenSpanDtype`. These represent spans from the target text, and here they contain tokens of the text and the sentence containing that token. See the notebook [Text_Extension_for_Pandas_Overview](./Text_Extension_for_Pandas_Overview.ipynb) for more on `SpanArray` and `TokenSpanArray`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "span             SpanDtype\n",
       "ent_iob             object\n",
       "ent_type            object\n",
       "sentence    TokenSpanDtype\n",
       "line_num             int64\n",
       "dtype: object"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example_df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert IOB-Tagged Data to Lists of Entity Mentions\n",
    "\n",
    "The data we've looked at so far has been in [IOB2 format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). \n",
    "Each row of our DataFrame represents a token, and each token is tagged with an entity type (`ent_type`) and an IOB tag (`ent_iob`). The first token of each named entity mention is tagged `B`, while subsequent tokens are tagged `I`. Tokens that aren't part of any named entity are tagged `O`.\n",
    "\n",
    "IOB2 format is a convenient way to represent a corpus, but it is a less useful representation for analyzing the result quality of named entity recognition models. Most tokens in a typical NER corpus will be tagged `O`, any measure of error rate in terms of tokens will over-emphasizing the tokens that are part of entities. Token-level error rate implicitly assigns higher weight to named entity mentions that consist of multiple tokens, further unbalancing error metrics. And most crucially, a naive comparison of IOB tags can result in marking an incorrect answer as correct. Consider a case where the correct sequence of labels is `B, B, I` but the model has output `B, I, I`; in this case, last two tokens of model output are both incorrect (the model has assigned them to the same entity as the first token), but a naive token-level comparison will consider the last token to be correct.\n",
    "\n",
    "The CoNLL 2003 competition used the number of errors in extracting *entire* entity mentions to measure the result quality of the entries. We will use the same metric in this notebook. To compute entity-level errors, we convert the IOB-tagged tokens into pairs of `<entity span,  entity type>`. \n",
    "Text Extensions for Pandas includes a function `iob_to_spans()` that will handle this conversion for you."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>span</th>\n",
       "      <th>ent_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[20, 28): 'PAKISTAN'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[31, 42): 'NEW ZEALAND'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[80, 83): 'GMT'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[85, 92): 'SIALKOT'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>[94, 102): 'Pakistan'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>69</th>\n",
       "      <td>[1488, 1501): 'Shahid Afridi'</td>\n",
       "      <td>PER</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>70</th>\n",
       "      <td>[1512, 1523): 'Salim Malik'</td>\n",
       "      <td>PER</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71</th>\n",
       "      <td>[1535, 1545): 'Ijaz Ahmad'</td>\n",
       "      <td>PER</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72</th>\n",
       "      <td>[1565, 1573): 'Pakistan'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>73</th>\n",
       "      <td>[1626, 1633): 'Karachi'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>74 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                             span ent_type\n",
       "0            [20, 28): 'PAKISTAN'      LOC\n",
       "1         [31, 42): 'NEW ZEALAND'      LOC\n",
       "2                 [80, 83): 'GMT'     MISC\n",
       "3             [85, 92): 'SIALKOT'      LOC\n",
       "4           [94, 102): 'Pakistan'      LOC\n",
       "..                            ...      ...\n",
       "69  [1488, 1501): 'Shahid Afridi'      PER\n",
       "70    [1512, 1523): 'Salim Malik'      PER\n",
       "71     [1535, 1545): 'Ijaz Ahmad'      PER\n",
       "72       [1565, 1573): 'Pakistan'      LOC\n",
       "73        [1626, 1633): 'Karachi'      LOC\n",
       "\n",
       "[74 rows x 2 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Convert the corpus IOB2 tagged DataFrame to one with entity span and type columns.\n",
    "spans_df = tp.io.conll.iob_to_spans(example_df)\n",
    "spans_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initialize our BERT Tokenizer and Model\n",
    "\n",
    "Here we configure and initialize the [Huggingface transformers BERT tokenizer and model](https://huggingface.co/transformers/model_doc/bert.html). Text Extensions for Pandas provides a `make_bert_tokens()` function that will use the tokenizer to create BERT tokens as a span column in a DataFrame, suitable to compute BERT embeddings with."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>token_id</th>\n",
       "      <th>span</th>\n",
       "      <th>input_id</th>\n",
       "      <th>token_type_id</th>\n",
       "      <th>attention_mask</th>\n",
       "      <th>special_tokens_mask</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>[0, 0): ''</td>\n",
       "      <td>101</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>[0, 1): '-'</td>\n",
       "      <td>118</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>[1, 2): 'D'</td>\n",
       "      <td>141</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>[2, 4): 'OC'</td>\n",
       "      <td>9244</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>[4, 6): 'ST'</td>\n",
       "      <td>9272</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>684</th>\n",
       "      <td>684</td>\n",
       "      <td>[1621, 1622): ','</td>\n",
       "      <td>117</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>685</th>\n",
       "      <td>685</td>\n",
       "      <td>[1623, 1625): 'in'</td>\n",
       "      <td>1107</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>686</th>\n",
       "      <td>686</td>\n",
       "      <td>[1626, 1633): 'Karachi'</td>\n",
       "      <td>16237</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>687</th>\n",
       "      <td>687</td>\n",
       "      <td>[1633, 1634): '.'</td>\n",
       "      <td>119</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>688</th>\n",
       "      <td>688</td>\n",
       "      <td>[0, 0): ''</td>\n",
       "      <td>102</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>689 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     token_id                     span  input_id  token_type_id  \\\n",
       "0           0               [0, 0): ''       101              0   \n",
       "1           1              [0, 1): '-'       118              0   \n",
       "2           2              [1, 2): 'D'       141              0   \n",
       "3           3             [2, 4): 'OC'      9244              0   \n",
       "4           4             [4, 6): 'ST'      9272              0   \n",
       "..        ...                      ...       ...            ...   \n",
       "684       684        [1621, 1622): ','       117              0   \n",
       "685       685       [1623, 1625): 'in'      1107              0   \n",
       "686       686  [1626, 1633): 'Karachi'     16237              0   \n",
       "687       687        [1633, 1634): '.'       119              0   \n",
       "688       688               [0, 0): ''       102              0   \n",
       "\n",
       "     attention_mask  special_tokens_mask  \n",
       "0                 1                 True  \n",
       "1                 1                False  \n",
       "2                 1                False  \n",
       "3                 1                False  \n",
       "4                 1                False  \n",
       "..              ...                  ...  \n",
       "684               1                False  \n",
       "685               1                False  \n",
       "686               1                False  \n",
       "687               1                False  \n",
       "688               1                 True  \n",
       "\n",
       "[689 rows x 6 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Huggingface transformers BERT Configuration.\n",
    "bert_model_name = \"dslim/bert-base-NER\"\n",
    "\n",
    "tokenizer = transformers.BertTokenizerFast.from_pretrained(bert_model_name)\n",
    "\n",
    "# Disable the warning about long sequences. We know what we're doing.\n",
    "# Different versions of transformers disable this warning differently,\n",
    "# so we need to do this twice.\n",
    "tokenizer.deprecation_warnings[\n",
    "    \"sequence-length-is-longer-than-the-specified-maximum\"] = True\n",
    "tokenizer.model_max_length = 16384\n",
    "\n",
    "# Retokenize the document's text with the BERT tokenizer as a DataFrame \n",
    "# with a span column.\n",
    "bert_toks_df = tp.io.bert.make_bert_tokens(example_df[\"span\"].values[0].target_text, tokenizer)\n",
    "bert_toks_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>token_id</th>\n",
       "      <th>span</th>\n",
       "      <th>input_id</th>\n",
       "      <th>token_type_id</th>\n",
       "      <th>attention_mask</th>\n",
       "      <th>special_tokens_mask</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>[0, 0): ''</td>\n",
       "      <td>101</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>688</th>\n",
       "      <td>688</td>\n",
       "      <td>[0, 0): ''</td>\n",
       "      <td>102</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     token_id        span  input_id  token_type_id  attention_mask  \\\n",
       "0           0  [0, 0): ''       101              0               1   \n",
       "688       688  [0, 0): ''       102              0               1   \n",
       "\n",
       "     special_tokens_mask  \n",
       "0                   True  \n",
       "688                 True  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# BERT tokenization includes special zero-length tokens.\n",
    "bert_toks_df[bert_toks_df[\"special_tokens_mask\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>original_span</th>\n",
       "      <th>bert_spans</th>\n",
       "      <th>ent_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[20, 28): 'PAKISTAN'</td>\n",
       "      <td>[20, 28): 'PAKISTAN'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[31, 42): 'NEW ZEALAND'</td>\n",
       "      <td>[31, 42): 'NEW ZEALAND'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[80, 83): 'GMT'</td>\n",
       "      <td>[80, 83): 'GMT'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[85, 92): 'SIALKOT'</td>\n",
       "      <td>[85, 92): 'SIALKOT'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>[94, 102): 'Pakistan'</td>\n",
       "      <td>[94, 102): 'Pakistan'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>69</th>\n",
       "      <td>[1488, 1501): 'Shahid Afridi'</td>\n",
       "      <td>[1488, 1501): 'Shahid Afridi'</td>\n",
       "      <td>PER</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>70</th>\n",
       "      <td>[1512, 1523): 'Salim Malik'</td>\n",
       "      <td>[1512, 1523): 'Salim Malik'</td>\n",
       "      <td>PER</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71</th>\n",
       "      <td>[1535, 1545): 'Ijaz Ahmad'</td>\n",
       "      <td>[1535, 1545): 'Ijaz Ahmad'</td>\n",
       "      <td>PER</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72</th>\n",
       "      <td>[1565, 1573): 'Pakistan'</td>\n",
       "      <td>[1565, 1573): 'Pakistan'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>73</th>\n",
       "      <td>[1626, 1633): 'Karachi'</td>\n",
       "      <td>[1626, 1633): 'Karachi'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>74 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                    original_span                     bert_spans ent_type\n",
       "0            [20, 28): 'PAKISTAN'           [20, 28): 'PAKISTAN'      LOC\n",
       "1         [31, 42): 'NEW ZEALAND'        [31, 42): 'NEW ZEALAND'      LOC\n",
       "2                 [80, 83): 'GMT'                [80, 83): 'GMT'     MISC\n",
       "3             [85, 92): 'SIALKOT'            [85, 92): 'SIALKOT'      LOC\n",
       "4           [94, 102): 'Pakistan'          [94, 102): 'Pakistan'      LOC\n",
       "..                            ...                            ...      ...\n",
       "69  [1488, 1501): 'Shahid Afridi'  [1488, 1501): 'Shahid Afridi'      PER\n",
       "70    [1512, 1523): 'Salim Malik'    [1512, 1523): 'Salim Malik'      PER\n",
       "71     [1535, 1545): 'Ijaz Ahmad'     [1535, 1545): 'Ijaz Ahmad'      PER\n",
       "72       [1565, 1573): 'Pakistan'       [1565, 1573): 'Pakistan'      LOC\n",
       "73        [1626, 1633): 'Karachi'        [1626, 1633): 'Karachi'      LOC\n",
       "\n",
       "[74 rows x 3 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Align the BERT tokens with the original tokenization.\n",
    "bert_spans = tp.TokenSpanArray.align_to_tokens(bert_toks_df[\"span\"],\n",
    "                                               spans_df[\"span\"])\n",
    "pd.DataFrame({\n",
    "    \"original_span\": spans_df[\"span\"],\n",
    "    \"bert_spans\": bert_spans,\n",
    "    \"ent_type\": spans_df[\"ent_type\"]\n",
    "})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>token_id</th>\n",
       "      <th>span</th>\n",
       "      <th>input_id</th>\n",
       "      <th>token_type_id</th>\n",
       "      <th>attention_mask</th>\n",
       "      <th>special_tokens_mask</th>\n",
       "      <th>ent_iob</th>\n",
       "      <th>ent_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>[15, 17): 'KE'</td>\n",
       "      <td>22441</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>11</td>\n",
       "      <td>[17, 18): 'T'</td>\n",
       "      <td>1942</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>[18, 19): '-'</td>\n",
       "      <td>118</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>[20, 22): 'PA'</td>\n",
       "      <td>8544</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>B</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>[22, 23): 'K'</td>\n",
       "      <td>2428</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>[23, 25): 'IS'</td>\n",
       "      <td>6258</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>[25, 27): 'TA'</td>\n",
       "      <td>9159</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>[27, 28): 'N'</td>\n",
       "      <td>2249</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18</td>\n",
       "      <td>[29, 30): 'V'</td>\n",
       "      <td>159</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>19</td>\n",
       "      <td>[31, 33): 'NE'</td>\n",
       "      <td>26546</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>B</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    token_id            span  input_id  token_type_id  attention_mask  \\\n",
       "10        10  [15, 17): 'KE'     22441              0               1   \n",
       "11        11   [17, 18): 'T'      1942              0               1   \n",
       "12        12   [18, 19): '-'       118              0               1   \n",
       "13        13  [20, 22): 'PA'      8544              0               1   \n",
       "14        14   [22, 23): 'K'      2428              0               1   \n",
       "15        15  [23, 25): 'IS'      6258              0               1   \n",
       "16        16  [25, 27): 'TA'      9159              0               1   \n",
       "17        17   [27, 28): 'N'      2249              0               1   \n",
       "18        18   [29, 30): 'V'       159              0               1   \n",
       "19        19  [31, 33): 'NE'     26546              0               1   \n",
       "\n",
       "    special_tokens_mask ent_iob ent_type  \n",
       "10                False       O     <NA>  \n",
       "11                False       O     <NA>  \n",
       "12                False       O     <NA>  \n",
       "13                False       B      LOC  \n",
       "14                False       I      LOC  \n",
       "15                False       I      LOC  \n",
       "16                False       I      LOC  \n",
       "17                False       I      LOC  \n",
       "18                False       O     <NA>  \n",
       "19                False       B      LOC  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Generate IOB2 tags and entity labels that align with the BERT tokens.\n",
    "# See https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)\n",
    "bert_toks_df[[\"ent_iob\", \"ent_type\"]] = tp.io.conll.spans_to_iob(bert_spans, \n",
    "                                                        spans_df[\"ent_type\"])\n",
    "bert_toks_df[10:20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CategoricalDtype(categories=['O', 'B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC',\n",
       "                  'I-ORG', 'I-PER'],\n",
       ", ordered=False, categories_dtype=object)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a Pandas categorical type for consistent encoding of categories\n",
    "# across all documents.\n",
    "ENTITY_TYPES = [\"LOC\", \"MISC\", \"ORG\", \"PER\"]\n",
    "token_class_dtype, int_to_label, label_to_int = tp.io.conll.make_iob_tag_categories(ENTITY_TYPES)\n",
    "token_class_dtype"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>token_id</th>\n",
       "      <th>span</th>\n",
       "      <th>input_id</th>\n",
       "      <th>token_type_id</th>\n",
       "      <th>attention_mask</th>\n",
       "      <th>special_tokens_mask</th>\n",
       "      <th>ent_iob</th>\n",
       "      <th>ent_type</th>\n",
       "      <th>token_class</th>\n",
       "      <th>token_class_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>[0, 0): ''</td>\n",
       "      <td>101</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>[0, 1): '-'</td>\n",
       "      <td>118</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>[1, 2): 'D'</td>\n",
       "      <td>141</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>[2, 4): 'OC'</td>\n",
       "      <td>9244</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>[4, 6): 'ST'</td>\n",
       "      <td>9272</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>684</th>\n",
       "      <td>684</td>\n",
       "      <td>[1621, 1622): ','</td>\n",
       "      <td>117</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>685</th>\n",
       "      <td>685</td>\n",
       "      <td>[1623, 1625): 'in'</td>\n",
       "      <td>1107</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>686</th>\n",
       "      <td>686</td>\n",
       "      <td>[1626, 1633): 'Karachi'</td>\n",
       "      <td>16237</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>B</td>\n",
       "      <td>LOC</td>\n",
       "      <td>B-LOC</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>687</th>\n",
       "      <td>687</td>\n",
       "      <td>[1633, 1634): '.'</td>\n",
       "      <td>119</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>688</th>\n",
       "      <td>688</td>\n",
       "      <td>[0, 0): ''</td>\n",
       "      <td>102</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>689 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     token_id                     span  input_id  token_type_id  \\\n",
       "0           0               [0, 0): ''       101              0   \n",
       "1           1              [0, 1): '-'       118              0   \n",
       "2           2              [1, 2): 'D'       141              0   \n",
       "3           3             [2, 4): 'OC'      9244              0   \n",
       "4           4             [4, 6): 'ST'      9272              0   \n",
       "..        ...                      ...       ...            ...   \n",
       "684       684        [1621, 1622): ','       117              0   \n",
       "685       685       [1623, 1625): 'in'      1107              0   \n",
       "686       686  [1626, 1633): 'Karachi'     16237              0   \n",
       "687       687        [1633, 1634): '.'       119              0   \n",
       "688       688               [0, 0): ''       102              0   \n",
       "\n",
       "     attention_mask  special_tokens_mask ent_iob ent_type token_class  \\\n",
       "0                 1                 True       O     <NA>           O   \n",
       "1                 1                False       O     <NA>           O   \n",
       "2                 1                False       O     <NA>           O   \n",
       "3                 1                False       O     <NA>           O   \n",
       "4                 1                False       O     <NA>           O   \n",
       "..              ...                  ...     ...      ...         ...   \n",
       "684               1                False       O     <NA>           O   \n",
       "685               1                False       O     <NA>           O   \n",
       "686               1                False       B      LOC       B-LOC   \n",
       "687               1                False       O     <NA>           O   \n",
       "688               1                 True       O     <NA>           O   \n",
       "\n",
       "     token_class_id  \n",
       "0                 0  \n",
       "1                 0  \n",
       "2                 0  \n",
       "3                 0  \n",
       "4                 0  \n",
       "..              ...  \n",
       "684               0  \n",
       "685               0  \n",
       "686               1  \n",
       "687               0  \n",
       "688               0  \n",
       "\n",
       "[689 rows x 10 columns]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The traditional way to transform NER to token classification is to \n",
    "# treat each combination of {I,O,B} X {entity type} as a different\n",
    "# class. Generate class labels in that format.\n",
    "classes_df = tp.io.conll.add_token_classes(bert_toks_df, token_class_dtype)\n",
    "classes_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Show how to compute BERT embeddings\n",
    "\n",
    "We are going to use the BERT embeddings as the feature vector to train our model. First, we will show how they are computed "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>token_id</th>\n",
       "      <th>span</th>\n",
       "      <th>input_id</th>\n",
       "      <th>ent_iob</th>\n",
       "      <th>ent_type</th>\n",
       "      <th>token_class</th>\n",
       "      <th>embedding</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>[15, 17): 'KE'</td>\n",
       "      <td>22441</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>[   -0.19854149,     -0.4689835,     0.7755610...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>11</td>\n",
       "      <td>[17, 18): 'T'</td>\n",
       "      <td>1942</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>[   -0.24190366,    -0.42399386,      0.955406...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>[18, 19): '-'</td>\n",
       "      <td>118</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>[   -0.20076706,    -0.74819326,      1.302213...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>[20, 22): 'PA'</td>\n",
       "      <td>8544</td>\n",
       "      <td>B</td>\n",
       "      <td>LOC</td>\n",
       "      <td>B-LOC</td>\n",
       "      <td>[    0.20202558,    -0.26199856,     0.3297638...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>[22, 23): 'K'</td>\n",
       "      <td>2428</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>I-LOC</td>\n",
       "      <td>[   -0.54621553,    -0.90924287,   -0.05836811...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>[23, 25): 'IS'</td>\n",
       "      <td>6258</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>I-LOC</td>\n",
       "      <td>[    -0.3740038,    -0.68907374,    -0.1446250...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>[25, 27): 'TA'</td>\n",
       "      <td>9159</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>I-LOC</td>\n",
       "      <td>[   -0.46548516,    -0.87174106,     0.3557471...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>[27, 28): 'N'</td>\n",
       "      <td>2249</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>I-LOC</td>\n",
       "      <td>[   -0.18682763,      -0.900818,      0.360149...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18</td>\n",
       "      <td>[29, 30): 'V'</td>\n",
       "      <td>159</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>[   -0.16640016,    -0.83638126,      0.874061...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>19</td>\n",
       "      <td>[31, 33): 'NE'</td>\n",
       "      <td>26546</td>\n",
       "      <td>B</td>\n",
       "      <td>LOC</td>\n",
       "      <td>B-LOC</td>\n",
       "      <td>[   -0.30241072,    -0.83826625,      1.105809...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    token_id            span  input_id ent_iob ent_type token_class  \\\n",
       "10        10  [15, 17): 'KE'     22441       O     <NA>           O   \n",
       "11        11   [17, 18): 'T'      1942       O     <NA>           O   \n",
       "12        12   [18, 19): '-'       118       O     <NA>           O   \n",
       "13        13  [20, 22): 'PA'      8544       B      LOC       B-LOC   \n",
       "14        14   [22, 23): 'K'      2428       I      LOC       I-LOC   \n",
       "15        15  [23, 25): 'IS'      6258       I      LOC       I-LOC   \n",
       "16        16  [25, 27): 'TA'      9159       I      LOC       I-LOC   \n",
       "17        17   [27, 28): 'N'      2249       I      LOC       I-LOC   \n",
       "18        18   [29, 30): 'V'       159       O     <NA>           O   \n",
       "19        19  [31, 33): 'NE'     26546       B      LOC       B-LOC   \n",
       "\n",
       "                                            embedding  \n",
       "10  [   -0.19854149,     -0.4689835,     0.7755610...  \n",
       "11  [   -0.24190366,    -0.42399386,      0.955406...  \n",
       "12  [   -0.20076706,    -0.74819326,      1.302213...  \n",
       "13  [    0.20202558,    -0.26199856,     0.3297638...  \n",
       "14  [   -0.54621553,    -0.90924287,   -0.05836811...  \n",
       "15  [    -0.3740038,    -0.68907374,    -0.1446250...  \n",
       "16  [   -0.46548516,    -0.87174106,     0.3557471...  \n",
       "17  [   -0.18682763,      -0.900818,      0.360149...  \n",
       "18  [   -0.16640016,    -0.83638126,      0.874061...  \n",
       "19  [   -0.30241072,    -0.83826625,      1.105809...  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Initialize the BERT model that will be used to generate embeddings.\n",
    "bert = transformers.BertModel.from_pretrained(bert_model_name)\n",
    "\n",
    "# Force garbage collection in case this notebook is running on a low-RAM environment.\n",
    "gc.collect()\n",
    "\n",
    "# Compute BERT embeddings with the BERT model and add result to our example DataFrame.\n",
    "embeddings_df = tp.io.bert.add_embeddings(classes_df, bert)\n",
    "embeddings_df[[\"token_id\", \"span\", \"input_id\", \"ent_iob\", \"ent_type\", \"token_class\", \"embedding\"]].iloc[10:20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>span</th>\n",
       "      <th>ent_iob</th>\n",
       "      <th>ent_type</th>\n",
       "      <th>embedding</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>70</th>\n",
       "      <td>[155, 168): 'international'</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>[   0.23405074,   -0.55348676,     0.9083985, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71</th>\n",
       "      <td>[169, 176): 'between'</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>[   0.27792946,   -0.68537986,     1.1050353, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>72</th>\n",
       "      <td>[177, 185): 'Pakistan'</td>\n",
       "      <td>B</td>\n",
       "      <td>LOC</td>\n",
       "      <td>[   0.19718906,   -0.46341145,    0.51823384, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>73</th>\n",
       "      <td>[186, 189): 'and'</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>[   0.20423515,    -0.6375882,     0.8287437, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74</th>\n",
       "      <td>[190, 193): 'New'</td>\n",
       "      <td>B</td>\n",
       "      <td>LOC</td>\n",
       "      <td>[   0.28740603,   -0.47174266,     0.7771937, ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           span ent_iob ent_type  \\\n",
       "70  [155, 168): 'international'       O     <NA>   \n",
       "71        [169, 176): 'between'       O     <NA>   \n",
       "72       [177, 185): 'Pakistan'       B      LOC   \n",
       "73            [186, 189): 'and'       O     <NA>   \n",
       "74            [190, 193): 'New'       B      LOC   \n",
       "\n",
       "                                            embedding  \n",
       "70  [   0.23405074,   -0.55348676,     0.9083985, ...  \n",
       "71  [   0.27792946,   -0.68537986,     1.1050353, ...  \n",
       "72  [   0.19718906,   -0.46341145,    0.51823384, ...  \n",
       "73  [   0.20423515,    -0.6375882,     0.8287437, ...  \n",
       "74  [   0.28740603,   -0.47174266,     0.7771937, ...  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embeddings_df[[\"span\", \"ent_iob\", \"ent_type\", \"embedding\"]].iloc[70:75]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<text_extensions_for_pandas.array.tensor.TensorDtype at 0x31c110c50>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The `embedding` column is an extension type `TensorDtype` that holds a \n",
    "#`TensorArray` provided by Text Extensions for Pandas.\n",
    "embeddings_df[\"embedding\"].dtype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A `TensorArray` can be constructed with a NumPy array of arbitrary dimensions, added to a DataFrame, then used with standard Pandas functionality. See the notebook [Text_Extension_for_Pandas_Overview](./Text_Extensions_for_Pandas.ipynb) for more on `TensorArray`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(dtype('float32'), (689, 768))"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Zero-copy conversion to NumPy can be done by first unwrapping the\n",
    "# `TensorArray` with `.array` and calling `to_numpy()`.\n",
    "embeddings_arr = embeddings_df[\"embedding\"].array.to_numpy()\n",
    "embeddings_arr.dtype, embeddings_arr.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate BERT tokens and BERT embeddings for the entire corpus\n",
    "\n",
    "Text Extensions for Pandas has a convenience function that will combine the above cells to create BERT tokens and embeddings. We will use this to add embeddings to the entire corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>token_id</th>\n",
       "      <th>span</th>\n",
       "      <th>input_id</th>\n",
       "      <th>token_type_id</th>\n",
       "      <th>attention_mask</th>\n",
       "      <th>special_tokens_mask</th>\n",
       "      <th>ent_iob</th>\n",
       "      <th>ent_type</th>\n",
       "      <th>token_class</th>\n",
       "      <th>token_class_id</th>\n",
       "      <th>embedding</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>[0, 0): ''</td>\n",
       "      <td>101</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.08307027,     -0.3595905,      1.015068...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>[0, 1): '-'</td>\n",
       "      <td>118</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.22862527,    -0.49313563,      1.284232...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>[1, 2): 'D'</td>\n",
       "      <td>141</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   0.028480597,    -0.17874269,      1.543209...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>[2, 4): 'OC'</td>\n",
       "      <td>9244</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.46517557,    -0.29836097,      1.073769...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>[4, 6): 'ST'</td>\n",
       "      <td>9272</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[  -0.107307605,    -0.33720982,      1.226980...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>684</th>\n",
       "      <td>684</td>\n",
       "      <td>[1621, 1622): ','</td>\n",
       "      <td>117</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.12806726,  -0.0023241118,      0.678130...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>685</th>\n",
       "      <td>685</td>\n",
       "      <td>[1623, 1625): 'in'</td>\n",
       "      <td>1107</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[    0.30534184,    -0.52625746,      0.828170...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>686</th>\n",
       "      <td>686</td>\n",
       "      <td>[1626, 1633): 'Karachi'</td>\n",
       "      <td>16237</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>B</td>\n",
       "      <td>LOC</td>\n",
       "      <td>B-LOC</td>\n",
       "      <td>1</td>\n",
       "      <td>[   -0.04873915,     -0.3379735,    -0.0583515...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>687</th>\n",
       "      <td>687</td>\n",
       "      <td>[1633, 1634): '.'</td>\n",
       "      <td>119</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[ -0.0052883998,    -0.29743025,     0.7161748...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>688</th>\n",
       "      <td>688</td>\n",
       "      <td>[0, 0): ''</td>\n",
       "      <td>102</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[     -0.503024,     0.36253858,     0.7314936...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>689 rows × 11 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     token_id                     span  input_id  token_type_id  \\\n",
       "0           0               [0, 0): ''       101              0   \n",
       "1           1              [0, 1): '-'       118              0   \n",
       "2           2              [1, 2): 'D'       141              0   \n",
       "3           3             [2, 4): 'OC'      9244              0   \n",
       "4           4             [4, 6): 'ST'      9272              0   \n",
       "..        ...                      ...       ...            ...   \n",
       "684       684        [1621, 1622): ','       117              0   \n",
       "685       685       [1623, 1625): 'in'      1107              0   \n",
       "686       686  [1626, 1633): 'Karachi'     16237              0   \n",
       "687       687        [1633, 1634): '.'       119              0   \n",
       "688       688               [0, 0): ''       102              0   \n",
       "\n",
       "     attention_mask  special_tokens_mask ent_iob ent_type token_class  \\\n",
       "0                 1                 True       O     <NA>           O   \n",
       "1                 1                False       O     <NA>           O   \n",
       "2                 1                False       O     <NA>           O   \n",
       "3                 1                False       O     <NA>           O   \n",
       "4                 1                False       O     <NA>           O   \n",
       "..              ...                  ...     ...      ...         ...   \n",
       "684               1                False       O     <NA>           O   \n",
       "685               1                False       O     <NA>           O   \n",
       "686               1                False       B      LOC       B-LOC   \n",
       "687               1                False       O     <NA>           O   \n",
       "688               1                 True       O     <NA>           O   \n",
       "\n",
       "     token_class_id                                          embedding  \n",
       "0                 0  [   -0.08307027,     -0.3595905,      1.015068...  \n",
       "1                 0  [   -0.22862527,    -0.49313563,      1.284232...  \n",
       "2                 0  [   0.028480597,    -0.17874269,      1.543209...  \n",
       "3                 0  [   -0.46517557,    -0.29836097,      1.073769...  \n",
       "4                 0  [  -0.107307605,    -0.33720982,      1.226980...  \n",
       "..              ...                                                ...  \n",
       "684               0  [   -0.12806726,  -0.0023241118,      0.678130...  \n",
       "685               0  [    0.30534184,    -0.52625746,      0.828170...  \n",
       "686               1  [   -0.04873915,     -0.3379735,    -0.0583515...  \n",
       "687               0  [ -0.0052883998,    -0.29743025,     0.7161748...  \n",
       "688               0  [     -0.503024,     0.36253858,     0.7314936...  \n",
       "\n",
       "[689 rows x 11 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Example usage of the convenience function to create BERT tokens and embeddings.\n",
    "tp.io.bert.conll_to_bert(example_df, tokenizer, bert, token_class_dtype)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When this notebook is running on a resource-constrained environment like [Binder](https://mybinder.org/),\n",
    "there may not be enough RAM available to hold all the embeddings in memory.\n",
    "So we use [Gaussian random projection](https://scikit-learn.org/stable/modules/random_projection.html#gaussian-random-projection) to reduce the size of the embeddings.\n",
    "The projection shrinks the embeddings by a factor of 3 at the expense of a small\n",
    "decrease in model accuracy.\n",
    "\n",
    "Change the constant `SHRINK_EMBEDDINGS` in the following cell to `False` if you want to disable this behavior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "SHRINK_EMBEDDINGS = False\n",
    "PROJECTION_DIMS = 256\n",
    "RANDOM_SEED=42\n",
    "\n",
    "import sklearn.random_projection\n",
    "projection = sklearn.random_projection.GaussianRandomProjection(\n",
    "    n_components=PROJECTION_DIMS, random_state=RANDOM_SEED)\n",
    "\n",
    "def maybe_shrink_embeddings(df):\n",
    "    if SHRINK_EMBEDDINGS:\n",
    "        df[\"embedding\"] = tp.TensorArray(projection.fit_transform(df[\"embedding\"]))\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processing fold 'train'...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "7d66c6a67e354c08a20400f09b8c5262",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=946, style=ProgressStyle(desc…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processing fold 'dev'...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f82cff3396ba45bc9dbef2fbd5639d1c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=216, style=ProgressStyle(desc…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processing fold 'test'...\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "13c9a59b3e544a4280f50e452ec51ba7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=231, style=ProgressStyle(desc…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>token_id</th>\n",
       "      <th>span</th>\n",
       "      <th>input_id</th>\n",
       "      <th>token_type_id</th>\n",
       "      <th>attention_mask</th>\n",
       "      <th>special_tokens_mask</th>\n",
       "      <th>ent_iob</th>\n",
       "      <th>ent_type</th>\n",
       "      <th>token_class</th>\n",
       "      <th>token_class_id</th>\n",
       "      <th>embedding</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>[0, 0): ''</td>\n",
       "      <td>101</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[    -0.1766959,    -0.39899594,     0.9088877...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>[0, 1): '-'</td>\n",
       "      <td>118</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.38553804,     -0.5023272,      1.173233...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>[1, 2): 'D'</td>\n",
       "      <td>141</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.11718983,    -0.12701103,      1.389693...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>[2, 4): 'OC'</td>\n",
       "      <td>9244</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.39025718,    -0.25043368,      1.074508...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>[4, 6): 'ST'</td>\n",
       "      <td>9272</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.27732685,    -0.26160043,      1.078760...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2154</th>\n",
       "      <td>2154</td>\n",
       "      <td>[5704, 5705): ')'</td>\n",
       "      <td>114</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   0.015393254,   -0.040650375,      1.001184...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2155</th>\n",
       "      <td>2155</td>\n",
       "      <td>[5706, 5708): '39'</td>\n",
       "      <td>3614</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[    0.07503936,    0.014401494,       1.04323...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2156</th>\n",
       "      <td>2156</td>\n",
       "      <td>[5708, 5709): '.'</td>\n",
       "      <td>119</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[  -0.085797176,     0.05905599,      1.114640...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2157</th>\n",
       "      <td>2157</td>\n",
       "      <td>[5709, 5711): '93'</td>\n",
       "      <td>5429</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   0.011378761,     -0.2638729,     0.8818034...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2158</th>\n",
       "      <td>2158</td>\n",
       "      <td>[0, 0): ''</td>\n",
       "      <td>102</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[    0.48513296,      1.5709878,      0.592933...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2159 rows × 11 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      token_id                span  input_id  token_type_id  attention_mask  \\\n",
       "0            0          [0, 0): ''       101              0               1   \n",
       "1            1         [0, 1): '-'       118              0               1   \n",
       "2            2         [1, 2): 'D'       141              0               1   \n",
       "3            3        [2, 4): 'OC'      9244              0               1   \n",
       "4            4        [4, 6): 'ST'      9272              0               1   \n",
       "...        ...                 ...       ...            ...             ...   \n",
       "2154      2154   [5704, 5705): ')'       114              0               1   \n",
       "2155      2155  [5706, 5708): '39'      3614              0               1   \n",
       "2156      2156   [5708, 5709): '.'       119              0               1   \n",
       "2157      2157  [5709, 5711): '93'      5429              0               1   \n",
       "2158      2158          [0, 0): ''       102              0               1   \n",
       "\n",
       "      special_tokens_mask ent_iob ent_type token_class  token_class_id  \\\n",
       "0                    True       O     <NA>           O               0   \n",
       "1                   False       O     <NA>           O               0   \n",
       "2                   False       O     <NA>           O               0   \n",
       "3                   False       O     <NA>           O               0   \n",
       "4                   False       O     <NA>           O               0   \n",
       "...                   ...     ...      ...         ...             ...   \n",
       "2154                False       O     <NA>           O               0   \n",
       "2155                False       O     <NA>           O               0   \n",
       "2156                False       O     <NA>           O               0   \n",
       "2157                False       O     <NA>           O               0   \n",
       "2158                 True       O     <NA>           O               0   \n",
       "\n",
       "                                              embedding  \n",
       "0     [    -0.1766959,    -0.39899594,     0.9088877...  \n",
       "1     [   -0.38553804,     -0.5023272,      1.173233...  \n",
       "2     [   -0.11718983,    -0.12701103,      1.389693...  \n",
       "3     [   -0.39025718,    -0.25043368,      1.074508...  \n",
       "4     [   -0.27732685,    -0.26160043,      1.078760...  \n",
       "...                                                 ...  \n",
       "2154  [   0.015393254,   -0.040650375,      1.001184...  \n",
       "2155  [    0.07503936,    0.014401494,       1.04323...  \n",
       "2156  [  -0.085797176,     0.05905599,      1.114640...  \n",
       "2157  [   0.011378761,     -0.2638729,     0.8818034...  \n",
       "2158  [    0.48513296,      1.5709878,      0.592933...  \n",
       "\n",
       "[2159 rows x 11 columns]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Run the entire corpus through our processing pipeline.\n",
    "bert_toks_by_fold = {}\n",
    "for fold_name in corpus_raw.keys():\n",
    "    print(f\"Processing fold '{fold_name}'...\")\n",
    "    raw = corpus_raw[fold_name]\n",
    "    with torch.inference_mode():  # This line cuts CPU usage by ~50%\n",
    "        bert_toks_by_fold[fold_name] = tp.jupyter.run_with_progress_bar(\n",
    "            len(raw), lambda i: maybe_shrink_embeddings(tp.io.bert.conll_to_bert(\n",
    "                raw[i], tokenizer, bert, token_class_dtype)))\n",
    "    \n",
    "bert_toks_by_fold[\"dev\"][20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Collate the data structures we've generated so far"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fold</th>\n",
       "      <th>doc_num</th>\n",
       "      <th>token_id</th>\n",
       "      <th>span</th>\n",
       "      <th>input_id</th>\n",
       "      <th>token_type_id</th>\n",
       "      <th>attention_mask</th>\n",
       "      <th>special_tokens_mask</th>\n",
       "      <th>ent_iob</th>\n",
       "      <th>ent_type</th>\n",
       "      <th>token_class</th>\n",
       "      <th>token_class_id</th>\n",
       "      <th>embedding</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>train</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>[0, 0): ''</td>\n",
       "      <td>101</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[  -0.098505564,    -0.40501904,     0.7428880...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>train</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>[0, 1): '-'</td>\n",
       "      <td>118</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.05702211,     -0.4811217,     0.9898696...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>train</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>[1, 2): 'D'</td>\n",
       "      <td>141</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.04824175,    -0.25329986,       1.16719...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>train</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>[2, 4): 'OC'</td>\n",
       "      <td>9244</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.26682886,     -0.3100877,      1.007474...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>train</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>[4, 6): 'ST'</td>\n",
       "      <td>9272</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.22296861,    -0.21308465,      0.933102...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>416536</th>\n",
       "      <td>test</td>\n",
       "      <td>230</td>\n",
       "      <td>314</td>\n",
       "      <td>[1386, 1393): 'brother'</td>\n",
       "      <td>1711</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[  -0.028172327,    -0.08062323,     0.9804876...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>416537</th>\n",
       "      <td>test</td>\n",
       "      <td>230</td>\n",
       "      <td>315</td>\n",
       "      <td>[1393, 1394): ','</td>\n",
       "      <td>117</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[    0.11817275,   -0.070084594,     0.8654851...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>416538</th>\n",
       "      <td>test</td>\n",
       "      <td>230</td>\n",
       "      <td>316</td>\n",
       "      <td>[1395, 1400): 'Bobby'</td>\n",
       "      <td>5545</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>B</td>\n",
       "      <td>PER</td>\n",
       "      <td>B-PER</td>\n",
       "      <td>4</td>\n",
       "      <td>[    -0.3568941,     0.31400397,      1.573852...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>416539</th>\n",
       "      <td>test</td>\n",
       "      <td>230</td>\n",
       "      <td>317</td>\n",
       "      <td>[1400, 1401): '.'</td>\n",
       "      <td>119</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.18957107,     -0.2458124,      0.662574...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>416540</th>\n",
       "      <td>test</td>\n",
       "      <td>230</td>\n",
       "      <td>318</td>\n",
       "      <td>[0, 0): ''</td>\n",
       "      <td>102</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.44689038,    -0.31665286,      0.779687...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>416541 rows × 13 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         fold  doc_num  token_id                     span  input_id  \\\n",
       "0       train        0         0               [0, 0): ''       101   \n",
       "1       train        0         1              [0, 1): '-'       118   \n",
       "2       train        0         2              [1, 2): 'D'       141   \n",
       "3       train        0         3             [2, 4): 'OC'      9244   \n",
       "4       train        0         4             [4, 6): 'ST'      9272   \n",
       "...       ...      ...       ...                      ...       ...   \n",
       "416536   test      230       314  [1386, 1393): 'brother'      1711   \n",
       "416537   test      230       315        [1393, 1394): ','       117   \n",
       "416538   test      230       316    [1395, 1400): 'Bobby'      5545   \n",
       "416539   test      230       317        [1400, 1401): '.'       119   \n",
       "416540   test      230       318               [0, 0): ''       102   \n",
       "\n",
       "        token_type_id  attention_mask  special_tokens_mask ent_iob ent_type  \\\n",
       "0                   0               1                 True       O     <NA>   \n",
       "1                   0               1                False       O     <NA>   \n",
       "2                   0               1                False       O     <NA>   \n",
       "3                   0               1                False       O     <NA>   \n",
       "4                   0               1                False       O     <NA>   \n",
       "...               ...             ...                  ...     ...      ...   \n",
       "416536              0               1                False       O     <NA>   \n",
       "416537              0               1                False       O     <NA>   \n",
       "416538              0               1                False       B      PER   \n",
       "416539              0               1                False       O     <NA>   \n",
       "416540              0               1                 True       O     <NA>   \n",
       "\n",
       "       token_class  token_class_id  \\\n",
       "0                O               0   \n",
       "1                O               0   \n",
       "2                O               0   \n",
       "3                O               0   \n",
       "4                O               0   \n",
       "...            ...             ...   \n",
       "416536           O               0   \n",
       "416537           O               0   \n",
       "416538       B-PER               4   \n",
       "416539           O               0   \n",
       "416540           O               0   \n",
       "\n",
       "                                                embedding  \n",
       "0       [  -0.098505564,    -0.40501904,     0.7428880...  \n",
       "1       [   -0.05702211,     -0.4811217,     0.9898696...  \n",
       "2       [   -0.04824175,    -0.25329986,       1.16719...  \n",
       "3       [   -0.26682886,     -0.3100877,      1.007474...  \n",
       "4       [   -0.22296861,    -0.21308465,      0.933102...  \n",
       "...                                                   ...  \n",
       "416536  [  -0.028172327,    -0.08062323,     0.9804876...  \n",
       "416537  [    0.11817275,   -0.070084594,     0.8654851...  \n",
       "416538  [    -0.3568941,     0.31400397,      1.573852...  \n",
       "416539  [   -0.18957107,     -0.2458124,      0.662574...  \n",
       "416540  [   -0.44689038,    -0.31665286,      0.779687...  \n",
       "\n",
       "[416541 rows x 13 columns]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a single DataFrame with the entire corpus's embeddings.\n",
    "corpus_df = tp.io.conll.combine_folds(bert_toks_by_fold)\n",
    "corpus_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Checkpoint\n",
    "\n",
    "With the `TensorArray` from Text Extensions for Pandas, the computed embeddings can be persisted as a tensor along with the rest of the DataFrame using standard Pandas input/output methods. Since this is a costly operation and the embeddings are deterministic, it can save lots of time to checkpoint the data here and save the results to disk. This will allow us to continue working with model training without needing to re-compute the BERT embeddings again.\n",
    " \n",
    "### Save DataFrame with Embeddings Tensor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write the tokenized corpus with embeddings to a Feather file.\n",
    "# We can't currently serialize span columns that cover multiple documents (see issue #73 https://github.com/CODAIT/text-extensions-for-pandas/issues/73),\n",
    "# so drop span columns from the contents we write to the Feather file.\n",
    "cols_to_drop = [c for c in corpus_df.columns if \"span\" in c]\n",
    "corpus_df.drop(columns=cols_to_drop).to_feather(\"outputs/corpus.feather\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load DataFrame with Previously Computed Embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fold</th>\n",
       "      <th>doc_num</th>\n",
       "      <th>token_id</th>\n",
       "      <th>input_id</th>\n",
       "      <th>token_type_id</th>\n",
       "      <th>attention_mask</th>\n",
       "      <th>special_tokens_mask</th>\n",
       "      <th>ent_iob</th>\n",
       "      <th>ent_type</th>\n",
       "      <th>token_class</th>\n",
       "      <th>token_class_id</th>\n",
       "      <th>embedding</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>train</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>101</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[  -0.098505564,    -0.40501904,     0.7428880...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>train</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>118</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.05702211,     -0.4811217,     0.9898696...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>train</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>141</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.04824175,    -0.25329986,       1.16719...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>train</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>9244</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.26682886,     -0.3100877,      1.007474...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>train</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>9272</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.22296861,    -0.21308465,      0.933102...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>416536</th>\n",
       "      <td>test</td>\n",
       "      <td>230</td>\n",
       "      <td>314</td>\n",
       "      <td>1711</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[  -0.028172327,    -0.08062323,     0.9804876...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>416537</th>\n",
       "      <td>test</td>\n",
       "      <td>230</td>\n",
       "      <td>315</td>\n",
       "      <td>117</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[    0.11817275,   -0.070084594,     0.8654851...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>416538</th>\n",
       "      <td>test</td>\n",
       "      <td>230</td>\n",
       "      <td>316</td>\n",
       "      <td>5545</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>B</td>\n",
       "      <td>PER</td>\n",
       "      <td>B-PER</td>\n",
       "      <td>4</td>\n",
       "      <td>[    -0.3568941,     0.31400397,      1.573852...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>416539</th>\n",
       "      <td>test</td>\n",
       "      <td>230</td>\n",
       "      <td>317</td>\n",
       "      <td>119</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.18957107,     -0.2458124,      0.662574...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>416540</th>\n",
       "      <td>test</td>\n",
       "      <td>230</td>\n",
       "      <td>318</td>\n",
       "      <td>102</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.44689038,    -0.31665286,      0.779687...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>416541 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         fold  doc_num  token_id  input_id  token_type_id  attention_mask  \\\n",
       "0       train        0         0       101              0               1   \n",
       "1       train        0         1       118              0               1   \n",
       "2       train        0         2       141              0               1   \n",
       "3       train        0         3      9244              0               1   \n",
       "4       train        0         4      9272              0               1   \n",
       "...       ...      ...       ...       ...            ...             ...   \n",
       "416536   test      230       314      1711              0               1   \n",
       "416537   test      230       315       117              0               1   \n",
       "416538   test      230       316      5545              0               1   \n",
       "416539   test      230       317       119              0               1   \n",
       "416540   test      230       318       102              0               1   \n",
       "\n",
       "        special_tokens_mask ent_iob ent_type token_class  token_class_id  \\\n",
       "0                      True       O     <NA>           O               0   \n",
       "1                     False       O     <NA>           O               0   \n",
       "2                     False       O     <NA>           O               0   \n",
       "3                     False       O     <NA>           O               0   \n",
       "4                     False       O     <NA>           O               0   \n",
       "...                     ...     ...      ...         ...             ...   \n",
       "416536                False       O     <NA>           O               0   \n",
       "416537                False       O     <NA>           O               0   \n",
       "416538                False       B      PER       B-PER               4   \n",
       "416539                False       O     <NA>           O               0   \n",
       "416540                 True       O     <NA>           O               0   \n",
       "\n",
       "                                                embedding  \n",
       "0       [  -0.098505564,    -0.40501904,     0.7428880...  \n",
       "1       [   -0.05702211,     -0.4811217,     0.9898696...  \n",
       "2       [   -0.04824175,    -0.25329986,       1.16719...  \n",
       "3       [   -0.26682886,     -0.3100877,      1.007474...  \n",
       "4       [   -0.22296861,    -0.21308465,      0.933102...  \n",
       "...                                                   ...  \n",
       "416536  [  -0.028172327,    -0.08062323,     0.9804876...  \n",
       "416537  [    0.11817275,   -0.070084594,     0.8654851...  \n",
       "416538  [    -0.3568941,     0.31400397,      1.573852...  \n",
       "416539  [   -0.18957107,     -0.2458124,      0.662574...  \n",
       "416540  [   -0.44689038,    -0.31665286,      0.779687...  \n",
       "\n",
       "[416541 rows x 12 columns]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Read the serialized embeddings back in so that you can rerun the model \n",
    "# training parts of this notebook (the cells from here onward) without \n",
    "# regenerating the embeddings.\n",
    "corpus_df = pd.read_feather(\"outputs/corpus.feather\")\n",
    "corpus_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training a model on the BERT embeddings\n",
    "\n",
    "Now we will use the loaded BERT embeddings to train a multinomial model to predict the token class from the embeddings tensor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fold</th>\n",
       "      <th>doc_num</th>\n",
       "      <th>token_id</th>\n",
       "      <th>input_id</th>\n",
       "      <th>token_type_id</th>\n",
       "      <th>attention_mask</th>\n",
       "      <th>special_tokens_mask</th>\n",
       "      <th>ent_iob</th>\n",
       "      <th>ent_type</th>\n",
       "      <th>token_class</th>\n",
       "      <th>token_class_id</th>\n",
       "      <th>embedding</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>train</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>101</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[  -0.098505564,    -0.40501904,     0.7428880...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>train</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>118</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.05702211,     -0.4811217,     0.9898696...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>train</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>141</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.04824175,    -0.25329986,       1.16719...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>train</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>9244</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.26682886,     -0.3100877,      1.007474...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>train</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>9272</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.22296861,    -0.21308465,      0.933102...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>281104</th>\n",
       "      <td>train</td>\n",
       "      <td>945</td>\n",
       "      <td>53</td>\n",
       "      <td>17057</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>B</td>\n",
       "      <td>ORG</td>\n",
       "      <td>B-ORG</td>\n",
       "      <td>3</td>\n",
       "      <td>[     0.7556377,      -0.918912,    -0.1403013...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>281105</th>\n",
       "      <td>train</td>\n",
       "      <td>945</td>\n",
       "      <td>54</td>\n",
       "      <td>122</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.11528667,      -0.444921,      0.471555...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>281106</th>\n",
       "      <td>train</td>\n",
       "      <td>945</td>\n",
       "      <td>55</td>\n",
       "      <td>4617</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>B</td>\n",
       "      <td>ORG</td>\n",
       "      <td>B-ORG</td>\n",
       "      <td>3</td>\n",
       "      <td>[     0.4560219,    -0.89708394,     0.0678624...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>281107</th>\n",
       "      <td>train</td>\n",
       "      <td>945</td>\n",
       "      <td>56</td>\n",
       "      <td>123</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.19713758,    -0.54272026,     0.2940197...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>281108</th>\n",
       "      <td>train</td>\n",
       "      <td>945</td>\n",
       "      <td>57</td>\n",
       "      <td>102</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[    -0.5765076,    -0.42160636,     0.9947052...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>281109 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         fold  doc_num  token_id  input_id  token_type_id  attention_mask  \\\n",
       "0       train        0         0       101              0               1   \n",
       "1       train        0         1       118              0               1   \n",
       "2       train        0         2       141              0               1   \n",
       "3       train        0         3      9244              0               1   \n",
       "4       train        0         4      9272              0               1   \n",
       "...       ...      ...       ...       ...            ...             ...   \n",
       "281104  train      945        53     17057              0               1   \n",
       "281105  train      945        54       122              0               1   \n",
       "281106  train      945        55      4617              0               1   \n",
       "281107  train      945        56       123              0               1   \n",
       "281108  train      945        57       102              0               1   \n",
       "\n",
       "        special_tokens_mask ent_iob ent_type token_class  token_class_id  \\\n",
       "0                      True       O     <NA>           O               0   \n",
       "1                     False       O     <NA>           O               0   \n",
       "2                     False       O     <NA>           O               0   \n",
       "3                     False       O     <NA>           O               0   \n",
       "4                     False       O     <NA>           O               0   \n",
       "...                     ...     ...      ...         ...             ...   \n",
       "281104                False       B      ORG       B-ORG               3   \n",
       "281105                False       O     <NA>           O               0   \n",
       "281106                False       B      ORG       B-ORG               3   \n",
       "281107                False       O     <NA>           O               0   \n",
       "281108                 True       O     <NA>           O               0   \n",
       "\n",
       "                                                embedding  \n",
       "0       [  -0.098505564,    -0.40501904,     0.7428880...  \n",
       "1       [   -0.05702211,     -0.4811217,     0.9898696...  \n",
       "2       [   -0.04824175,    -0.25329986,       1.16719...  \n",
       "3       [   -0.26682886,     -0.3100877,      1.007474...  \n",
       "4       [   -0.22296861,    -0.21308465,      0.933102...  \n",
       "...                                                   ...  \n",
       "281104  [     0.7556377,      -0.918912,    -0.1403013...  \n",
       "281105  [   -0.11528667,      -0.444921,      0.471555...  \n",
       "281106  [     0.4560219,    -0.89708394,     0.0678624...  \n",
       "281107  [   -0.19713758,    -0.54272026,     0.2940197...  \n",
       "281108  [    -0.5765076,    -0.42160636,     0.9947052...  \n",
       "\n",
       "[281109 rows x 12 columns]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Extract the training set DataFrame.\n",
    "train_df = corpus_df[corpus_df[\"fold\"] == \"train\"]\n",
    "train_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 10min 45s, sys: 1min 1s, total: 11min 46s\n",
      "Wall time: 1min 3s\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {\n",
       "  /* Definition of color scheme common for light and dark mode */\n",
       "  --sklearn-color-text: #000;\n",
       "  --sklearn-color-text-muted: #666;\n",
       "  --sklearn-color-line: gray;\n",
       "  /* Definition of color scheme for unfitted estimators */\n",
       "  --sklearn-color-unfitted-level-0: #fff5e6;\n",
       "  --sklearn-color-unfitted-level-1: #f6e4d2;\n",
       "  --sklearn-color-unfitted-level-2: #ffe0b3;\n",
       "  --sklearn-color-unfitted-level-3: chocolate;\n",
       "  /* Definition of color scheme for fitted estimators */\n",
       "  --sklearn-color-fitted-level-0: #f0f8ff;\n",
       "  --sklearn-color-fitted-level-1: #d4ebff;\n",
       "  --sklearn-color-fitted-level-2: #b3dbfd;\n",
       "  --sklearn-color-fitted-level-3: cornflowerblue;\n",
       "\n",
       "  /* Specific color for light theme */\n",
       "  --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
       "  --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));\n",
       "  --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
       "  --sklearn-color-icon: #696969;\n",
       "\n",
       "  @media (prefers-color-scheme: dark) {\n",
       "    /* Redefinition of color scheme for dark theme */\n",
       "    --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
       "    --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));\n",
       "    --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
       "    --sklearn-color-icon: #878787;\n",
       "  }\n",
       "}\n",
       "\n",
       "#sk-container-id-1 {\n",
       "  color: var(--sklearn-color-text);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 pre {\n",
       "  padding: 0;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 input.sk-hidden--visually {\n",
       "  border: 0;\n",
       "  clip: rect(1px 1px 1px 1px);\n",
       "  clip: rect(1px, 1px, 1px, 1px);\n",
       "  height: 1px;\n",
       "  margin: -1px;\n",
       "  overflow: hidden;\n",
       "  padding: 0;\n",
       "  position: absolute;\n",
       "  width: 1px;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-dashed-wrapped {\n",
       "  border: 1px dashed var(--sklearn-color-line);\n",
       "  margin: 0 0.4em 0.5em 0.4em;\n",
       "  box-sizing: border-box;\n",
       "  padding-bottom: 0.4em;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-container {\n",
       "  /* jupyter's `normalize.less` sets `[hidden] { display: none; }`\n",
       "     but bootstrap.min.css set `[hidden] { display: none !important; }`\n",
       "     so we also need the `!important` here to be able to override the\n",
       "     default hidden behavior on the sphinx rendered scikit-learn.org.\n",
       "     See: https://github.com/scikit-learn/scikit-learn/issues/21755 */\n",
       "  display: inline-block !important;\n",
       "  position: relative;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-text-repr-fallback {\n",
       "  display: none;\n",
       "}\n",
       "\n",
       "div.sk-parallel-item,\n",
       "div.sk-serial,\n",
       "div.sk-item {\n",
       "  /* draw centered vertical line to link estimators */\n",
       "  background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));\n",
       "  background-size: 2px 100%;\n",
       "  background-repeat: no-repeat;\n",
       "  background-position: center center;\n",
       "}\n",
       "\n",
       "/* Parallel-specific style estimator block */\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item::after {\n",
       "  content: \"\";\n",
       "  width: 100%;\n",
       "  border-bottom: 2px solid var(--sklearn-color-text-on-default-background);\n",
       "  flex-grow: 1;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel {\n",
       "  display: flex;\n",
       "  align-items: stretch;\n",
       "  justify-content: center;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  position: relative;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item {\n",
       "  display: flex;\n",
       "  flex-direction: column;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item:first-child::after {\n",
       "  align-self: flex-end;\n",
       "  width: 50%;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item:last-child::after {\n",
       "  align-self: flex-start;\n",
       "  width: 50%;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item:only-child::after {\n",
       "  width: 0;\n",
       "}\n",
       "\n",
       "/* Serial-specific style estimator block */\n",
       "\n",
       "#sk-container-id-1 div.sk-serial {\n",
       "  display: flex;\n",
       "  flex-direction: column;\n",
       "  align-items: center;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  padding-right: 1em;\n",
       "  padding-left: 1em;\n",
       "}\n",
       "\n",
       "\n",
       "/* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is\n",
       "clickable and can be expanded/collapsed.\n",
       "- Pipeline and ColumnTransformer use this feature and define the default style\n",
       "- Estimators will overwrite some part of the style using the `sk-estimator` class\n",
       "*/\n",
       "\n",
       "/* Pipeline and ColumnTransformer style (default) */\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable {\n",
       "  /* Default theme specific background. It is overwritten whether we have a\n",
       "  specific estimator or a Pipeline/ColumnTransformer */\n",
       "  background-color: var(--sklearn-color-background);\n",
       "}\n",
       "\n",
       "/* Toggleable label */\n",
       "#sk-container-id-1 label.sk-toggleable__label {\n",
       "  cursor: pointer;\n",
       "  display: flex;\n",
       "  width: 100%;\n",
       "  margin-bottom: 0;\n",
       "  padding: 0.5em;\n",
       "  box-sizing: border-box;\n",
       "  text-align: center;\n",
       "  align-items: start;\n",
       "  justify-content: space-between;\n",
       "  gap: 0.5em;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 label.sk-toggleable__label .caption {\n",
       "  font-size: 0.6rem;\n",
       "  font-weight: lighter;\n",
       "  color: var(--sklearn-color-text-muted);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 label.sk-toggleable__label-arrow:before {\n",
       "  /* Arrow on the left of the label */\n",
       "  content: \"▸\";\n",
       "  float: left;\n",
       "  margin-right: 0.25em;\n",
       "  color: var(--sklearn-color-icon);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {\n",
       "  color: var(--sklearn-color-text);\n",
       "}\n",
       "\n",
       "/* Toggleable content - dropdown */\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable__content {\n",
       "  max-height: 0;\n",
       "  max-width: 0;\n",
       "  overflow: hidden;\n",
       "  text-align: left;\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable__content.fitted {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable__content pre {\n",
       "  margin: 0.2em;\n",
       "  border-radius: 0.25em;\n",
       "  color: var(--sklearn-color-text);\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable__content.fitted pre {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {\n",
       "  /* Expand drop-down */\n",
       "  max-height: 200px;\n",
       "  max-width: 100%;\n",
       "  overflow: auto;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {\n",
       "  content: \"▾\";\n",
       "}\n",
       "\n",
       "/* Pipeline/ColumnTransformer-specific style */\n",
       "\n",
       "#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Estimator-specific style */\n",
       "\n",
       "/* Colorize estimator box */\n",
       "#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-label label.sk-toggleable__label,\n",
       "#sk-container-id-1 div.sk-label label {\n",
       "  /* The background is the default theme color */\n",
       "  color: var(--sklearn-color-text-on-default-background);\n",
       "}\n",
       "\n",
       "/* On hover, darken the color of the background */\n",
       "#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "/* Label box, darken color on hover, fitted */\n",
       "#sk-container-id-1 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Estimator label */\n",
       "\n",
       "#sk-container-id-1 div.sk-label label {\n",
       "  font-family: monospace;\n",
       "  font-weight: bold;\n",
       "  display: inline-block;\n",
       "  line-height: 1.2em;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-label-container {\n",
       "  text-align: center;\n",
       "}\n",
       "\n",
       "/* Estimator-specific */\n",
       "#sk-container-id-1 div.sk-estimator {\n",
       "  font-family: monospace;\n",
       "  border: 1px dotted var(--sklearn-color-border-box);\n",
       "  border-radius: 0.25em;\n",
       "  box-sizing: border-box;\n",
       "  margin-bottom: 0.5em;\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-estimator.fitted {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "/* on hover */\n",
       "#sk-container-id-1 div.sk-estimator:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-estimator.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Specification for estimator info (e.g. \"i\" and \"?\") */\n",
       "\n",
       "/* Common style for \"i\" and \"?\" */\n",
       "\n",
       ".sk-estimator-doc-link,\n",
       "a:link.sk-estimator-doc-link,\n",
       "a:visited.sk-estimator-doc-link {\n",
       "  float: right;\n",
       "  font-size: smaller;\n",
       "  line-height: 1em;\n",
       "  font-family: monospace;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  border-radius: 1em;\n",
       "  height: 1em;\n",
       "  width: 1em;\n",
       "  text-decoration: none !important;\n",
       "  margin-left: 0.5em;\n",
       "  text-align: center;\n",
       "  /* unfitted */\n",
       "  border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-unfitted-level-1);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link.fitted,\n",
       "a:link.sk-estimator-doc-link.fitted,\n",
       "a:visited.sk-estimator-doc-link.fitted {\n",
       "  /* fitted */\n",
       "  border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-fitted-level-1);\n",
       "}\n",
       "\n",
       "/* On hover */\n",
       "div.sk-estimator:hover .sk-estimator-doc-link:hover,\n",
       ".sk-estimator-doc-link:hover,\n",
       "div.sk-label-container:hover .sk-estimator-doc-link:hover,\n",
       ".sk-estimator-doc-link:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,\n",
       ".sk-estimator-doc-link.fitted:hover,\n",
       "div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,\n",
       ".sk-estimator-doc-link.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "/* Span, style for the box shown on hovering the info icon */\n",
       ".sk-estimator-doc-link span {\n",
       "  display: none;\n",
       "  z-index: 9999;\n",
       "  position: relative;\n",
       "  font-weight: normal;\n",
       "  right: .2ex;\n",
       "  padding: .5ex;\n",
       "  margin: .5ex;\n",
       "  width: min-content;\n",
       "  min-width: 20ex;\n",
       "  max-width: 50ex;\n",
       "  color: var(--sklearn-color-text);\n",
       "  box-shadow: 2pt 2pt 4pt #999;\n",
       "  /* unfitted */\n",
       "  background: var(--sklearn-color-unfitted-level-0);\n",
       "  border: .5pt solid var(--sklearn-color-unfitted-level-3);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link.fitted span {\n",
       "  /* fitted */\n",
       "  background: var(--sklearn-color-fitted-level-0);\n",
       "  border: var(--sklearn-color-fitted-level-3);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link:hover span {\n",
       "  display: block;\n",
       "}\n",
       "\n",
       "/* \"?\"-specific style due to the `<a>` HTML tag */\n",
       "\n",
       "#sk-container-id-1 a.estimator_doc_link {\n",
       "  float: right;\n",
       "  font-size: 1rem;\n",
       "  line-height: 1em;\n",
       "  font-family: monospace;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  border-radius: 1rem;\n",
       "  height: 1rem;\n",
       "  width: 1rem;\n",
       "  text-decoration: none;\n",
       "  /* unfitted */\n",
       "  color: var(--sklearn-color-unfitted-level-1);\n",
       "  border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 a.estimator_doc_link.fitted {\n",
       "  /* fitted */\n",
       "  border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-fitted-level-1);\n",
       "}\n",
       "\n",
       "/* On hover */\n",
       "#sk-container-id-1 a.estimator_doc_link:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 a.estimator_doc_link.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-3);\n",
       "}\n",
       "</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>Pipeline(steps=[(&#x27;mlogreg&#x27;,\n",
       "                 LogisticRegression(C=0.1, max_iter=10000, verbose=1))])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" ><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow\"><div><div>Pipeline</div></div><div><a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.6/modules/generated/sklearn.pipeline.Pipeline.html\">?<span>Documentation for Pipeline</span></a><span class=\"sk-estimator-doc-link fitted\">i<span>Fitted</span></span></div></label><div class=\"sk-toggleable__content fitted\"><pre>Pipeline(steps=[(&#x27;mlogreg&#x27;,\n",
       "                 LogisticRegression(C=0.1, max_iter=10000, verbose=1))])</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" ><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow\"><div><div>LogisticRegression</div></div><div><a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.6/modules/generated/sklearn.linear_model.LogisticRegression.html\">?<span>Documentation for LogisticRegression</span></a></div></label><div class=\"sk-toggleable__content fitted\"><pre>LogisticRegression(C=0.1, max_iter=10000, verbose=1)</pre></div> </div></div></div></div></div></div>"
      ],
      "text/plain": [
       "Pipeline(steps=[('mlogreg',\n",
       "                 LogisticRegression(C=0.1, max_iter=10000, verbose=1))])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "# Train a multinomial logistic regression model on the training set.\n",
    "#MULTI_CLASS = \"multinomial\"\n",
    "    \n",
    "# How many iterations to run the BGFS optimizer when fitting logistic\n",
    "# regression models. 100 ==> Fast; 10000 ==> Full convergence\n",
    "LBGFS_ITERATIONS = 10000\n",
    "_REGULARIZATION_COEFF = 1e-1  # Smaller values ==> more regularization\n",
    "\n",
    "base_pipeline = sklearn.pipeline.Pipeline([\n",
    "    # Standard scaler. This only makes a difference for certain classes\n",
    "    # of embeddings.\n",
    "    #(\"scaler\", sklearn.preprocessing.StandardScaler()),\n",
    "    (\"mlogreg\", sklearn.linear_model.LogisticRegression(\n",
    "        #multi_class=MULTI_CLASS,\n",
    "        verbose=1,\n",
    "        max_iter=LBGFS_ITERATIONS,\n",
    "        C=_REGULARIZATION_COEFF\n",
    "    ))\n",
    "])\n",
    "\n",
    "X_train = train_df[\"embedding\"].values\n",
    "Y_train = train_df[\"token_class_id\"]\n",
    "base_model = base_pipeline.fit(X_train, Y_train)\n",
    "base_model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Make Predictions on Token Class from BERT Embeddings\n",
    "\n",
    "Using our model, we can now predict the token class from the test set using the computed embeddings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a function that will let us make predictions on a fold of the corpus.\n",
    "def predict_on_df(df: pd.DataFrame, id_to_class: Dict[int, str], predictor):\n",
    "    \"\"\"\n",
    "    Run a trained model on a DataFrame of tokens with embeddings.\n",
    "\n",
    "    :param df: DataFrame of tokens for a document, containing a TokenSpan column\n",
    "     called \"embedding\" for each token.\n",
    "    :param id_to_class: Mapping from class ID to class name, as returned by\n",
    "     :func:`text_extensions_for_pandas.make_iob_tag_categories`\n",
    "    :param predictor: Python object with a `predict_proba` method that accepts\n",
    "     a numpy array of embeddings.\n",
    "    :returns: A copy of `df`, with the following additional columns:\n",
    "     `predicted_id`, `predicted_class`, `predicted_iob`, `predicted_type`\n",
    "     and `predicted_class_pr`.\n",
    "    \"\"\"\n",
    "    result_df = df.copy()\n",
    "    embeddings = result_df[\"embedding\"].to_numpy()\n",
    "    class_pr = tp.TensorArray(predictor.predict_proba(embeddings))\n",
    "    result_df[\"predicted_id\"] = np.argmax(class_pr, axis=1)\n",
    "    result_df[\"predicted_class\"] = [id_to_class[i]\n",
    "                                    for i in result_df[\"predicted_id\"].values]\n",
    "    iobs, types = tp.io.conll.decode_class_labels(result_df[\"predicted_class\"].values)\n",
    "    result_df[\"predicted_iob\"] = iobs\n",
    "    result_df[\"predicted_type\"] = types\n",
    "    result_df[\"predicted_class_pr\"] = class_pr\n",
    "    return result_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fold</th>\n",
       "      <th>doc_num</th>\n",
       "      <th>token_id</th>\n",
       "      <th>input_id</th>\n",
       "      <th>token_type_id</th>\n",
       "      <th>attention_mask</th>\n",
       "      <th>special_tokens_mask</th>\n",
       "      <th>ent_iob</th>\n",
       "      <th>ent_type</th>\n",
       "      <th>token_class</th>\n",
       "      <th>token_class_id</th>\n",
       "      <th>embedding</th>\n",
       "      <th>predicted_id</th>\n",
       "      <th>predicted_class</th>\n",
       "      <th>predicted_iob</th>\n",
       "      <th>predicted_type</th>\n",
       "      <th>predicted_class_pr</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>351001</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>101</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.19626567,    -0.45093697,       0.67753...</td>\n",
       "      <td>0</td>\n",
       "      <td>O</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "      <td>[    0.9996154604786717,  1.744378080606689e-0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351002</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>118</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[    -0.3187216,     -0.5074786,      1.046451...</td>\n",
       "      <td>0</td>\n",
       "      <td>O</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "      <td>[    0.9992898679359635,  4.547117251747756e-0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351003</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>141</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.08053854,    -0.24774702,      1.356256...</td>\n",
       "      <td>0</td>\n",
       "      <td>O</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "      <td>[    0.9992201884500852,  0.000299488669725774...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351004</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>9244</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.68785733,    -0.30290136,     0.8842703...</td>\n",
       "      <td>0</td>\n",
       "      <td>O</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "      <td>[    0.9987328699744995,  6.450692993212812e-0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351005</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>9272</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[    -0.2963217,    -0.23313195,      0.939882...</td>\n",
       "      <td>0</td>\n",
       "      <td>O</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "      <td>[    0.9999286498617879, 1.1408958794709684e-0...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        fold  doc_num  token_id  input_id  token_type_id  attention_mask  \\\n",
       "351001  test        0         0       101              0               1   \n",
       "351002  test        0         1       118              0               1   \n",
       "351003  test        0         2       141              0               1   \n",
       "351004  test        0         3      9244              0               1   \n",
       "351005  test        0         4      9272              0               1   \n",
       "\n",
       "        special_tokens_mask ent_iob ent_type token_class  token_class_id  \\\n",
       "351001                 True       O     <NA>           O               0   \n",
       "351002                False       O     <NA>           O               0   \n",
       "351003                False       O     <NA>           O               0   \n",
       "351004                False       O     <NA>           O               0   \n",
       "351005                False       O     <NA>           O               0   \n",
       "\n",
       "                                                embedding  predicted_id  \\\n",
       "351001  [   -0.19626567,    -0.45093697,       0.67753...             0   \n",
       "351002  [    -0.3187216,     -0.5074786,      1.046451...             0   \n",
       "351003  [   -0.08053854,    -0.24774702,      1.356256...             0   \n",
       "351004  [   -0.68785733,    -0.30290136,     0.8842703...             0   \n",
       "351005  [    -0.2963217,    -0.23313195,      0.939882...             0   \n",
       "\n",
       "       predicted_class predicted_iob predicted_type  \\\n",
       "351001               O             O           None   \n",
       "351002               O             O           None   \n",
       "351003               O             O           None   \n",
       "351004               O             O           None   \n",
       "351005               O             O           None   \n",
       "\n",
       "                                       predicted_class_pr  \n",
       "351001  [    0.9996154604786717,  1.744378080606689e-0...  \n",
       "351002  [    0.9992898679359635,  4.547117251747756e-0...  \n",
       "351003  [    0.9992201884500852,  0.000299488669725774...  \n",
       "351004  [    0.9987328699744995,  6.450692993212812e-0...  \n",
       "351005  [    0.9999286498617879, 1.1408958794709684e-0...  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Make predictions on the test set.\n",
    "test_results_df = predict_on_df(corpus_df[corpus_df[\"fold\"] == \"test\"], int_to_label, base_model)\n",
    "test_results_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fold</th>\n",
       "      <th>doc_num</th>\n",
       "      <th>token_id</th>\n",
       "      <th>input_id</th>\n",
       "      <th>token_type_id</th>\n",
       "      <th>attention_mask</th>\n",
       "      <th>special_tokens_mask</th>\n",
       "      <th>ent_iob</th>\n",
       "      <th>ent_type</th>\n",
       "      <th>token_class</th>\n",
       "      <th>token_class_id</th>\n",
       "      <th>embedding</th>\n",
       "      <th>predicted_id</th>\n",
       "      <th>predicted_class</th>\n",
       "      <th>predicted_iob</th>\n",
       "      <th>predicted_type</th>\n",
       "      <th>predicted_class_pr</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>351041</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>3309</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>I</td>\n",
       "      <td>PER</td>\n",
       "      <td>I-PER</td>\n",
       "      <td>8</td>\n",
       "      <td>[   -0.21029335,     -0.8535667,  0.0002728667...</td>\n",
       "      <td>6</td>\n",
       "      <td>I-MISC</td>\n",
       "      <td>I</td>\n",
       "      <td>MISC</td>\n",
       "      <td>[ 0.0007534354222208874, 1.6836613367126546e-0...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351042</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>41</td>\n",
       "      <td>1306</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>I</td>\n",
       "      <td>PER</td>\n",
       "      <td>I-PER</td>\n",
       "      <td>8</td>\n",
       "      <td>[   -0.23205441,     -0.9290749,      0.388911...</td>\n",
       "      <td>6</td>\n",
       "      <td>I-MISC</td>\n",
       "      <td>I</td>\n",
       "      <td>MISC</td>\n",
       "      <td>[  0.009452956025954758,   0.00547183453730283...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351043</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>42</td>\n",
       "      <td>2001</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>I</td>\n",
       "      <td>PER</td>\n",
       "      <td>I-PER</td>\n",
       "      <td>8</td>\n",
       "      <td>[    0.36844233,    -0.68090975,   -0.10591102...</td>\n",
       "      <td>5</td>\n",
       "      <td>I-LOC</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>[  0.006176520691561816,    0.1195221604259170...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351044</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>43</td>\n",
       "      <td>1181</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>I</td>\n",
       "      <td>PER</td>\n",
       "      <td>I-PER</td>\n",
       "      <td>8</td>\n",
       "      <td>[    -0.3013107,     -0.6545994,    -0.1726927...</td>\n",
       "      <td>8</td>\n",
       "      <td>I-PER</td>\n",
       "      <td>I</td>\n",
       "      <td>PER</td>\n",
       "      <td>[  0.013546332504567974,  0.000950956053344989...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351045</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>44</td>\n",
       "      <td>2293</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>I</td>\n",
       "      <td>PER</td>\n",
       "      <td>I-PER</td>\n",
       "      <td>8</td>\n",
       "      <td>[   -0.16116214,    -0.69890887,     0.2342461...</td>\n",
       "      <td>5</td>\n",
       "      <td>I-LOC</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>[  0.011681190424447915,   0.01903022200617198...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351046</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>45</td>\n",
       "      <td>18589</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>B</td>\n",
       "      <td>LOC</td>\n",
       "      <td>B-LOC</td>\n",
       "      <td>1</td>\n",
       "      <td>[  -0.058566056,    -0.79558563,      0.336061...</td>\n",
       "      <td>5</td>\n",
       "      <td>I-LOC</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>[   0.03106093699696433,    0.4497557937871749...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351047</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>46</td>\n",
       "      <td>118</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>I-LOC</td>\n",
       "      <td>5</td>\n",
       "      <td>[    0.20376033,     -0.7373088,    -0.0888546...</td>\n",
       "      <td>5</td>\n",
       "      <td>I-LOC</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>[    0.3021401447717397,   0.00750658170497511...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351048</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>47</td>\n",
       "      <td>19016</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>I-LOC</td>\n",
       "      <td>5</td>\n",
       "      <td>[   -0.10341236,    -0.33681706,     0.1738456...</td>\n",
       "      <td>5</td>\n",
       "      <td>I-LOC</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>[   0.06367579347291549,     0.421286404138072...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351049</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>48</td>\n",
       "      <td>2249</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>I-LOC</td>\n",
       "      <td>5</td>\n",
       "      <td>[   -0.40542558,    -0.65165085,     0.2469604...</td>\n",
       "      <td>5</td>\n",
       "      <td>I-LOC</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>[ 0.0009965409993564457,  0.002211756265164430...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351050</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>49</td>\n",
       "      <td>117</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>0</td>\n",
       "      <td>[   -0.16829309,     -0.6475864,      0.814903...</td>\n",
       "      <td>0</td>\n",
       "      <td>O</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "      <td>[    0.9999666135640827,  6.878966783337114e-0...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        fold  doc_num  token_id  input_id  token_type_id  attention_mask  \\\n",
       "351041  test        0        40      3309              0               1   \n",
       "351042  test        0        41      1306              0               1   \n",
       "351043  test        0        42      2001              0               1   \n",
       "351044  test        0        43      1181              0               1   \n",
       "351045  test        0        44      2293              0               1   \n",
       "351046  test        0        45     18589              0               1   \n",
       "351047  test        0        46       118              0               1   \n",
       "351048  test        0        47     19016              0               1   \n",
       "351049  test        0        48      2249              0               1   \n",
       "351050  test        0        49       117              0               1   \n",
       "\n",
       "        special_tokens_mask ent_iob ent_type token_class  token_class_id  \\\n",
       "351041                False       I      PER       I-PER               8   \n",
       "351042                False       I      PER       I-PER               8   \n",
       "351043                False       I      PER       I-PER               8   \n",
       "351044                False       I      PER       I-PER               8   \n",
       "351045                False       I      PER       I-PER               8   \n",
       "351046                False       B      LOC       B-LOC               1   \n",
       "351047                False       I      LOC       I-LOC               5   \n",
       "351048                False       I      LOC       I-LOC               5   \n",
       "351049                False       I      LOC       I-LOC               5   \n",
       "351050                False       O     <NA>           O               0   \n",
       "\n",
       "                                                embedding  predicted_id  \\\n",
       "351041  [   -0.21029335,     -0.8535667,  0.0002728667...             6   \n",
       "351042  [   -0.23205441,     -0.9290749,      0.388911...             6   \n",
       "351043  [    0.36844233,    -0.68090975,   -0.10591102...             5   \n",
       "351044  [    -0.3013107,     -0.6545994,    -0.1726927...             8   \n",
       "351045  [   -0.16116214,    -0.69890887,     0.2342461...             5   \n",
       "351046  [  -0.058566056,    -0.79558563,      0.336061...             5   \n",
       "351047  [    0.20376033,     -0.7373088,    -0.0888546...             5   \n",
       "351048  [   -0.10341236,    -0.33681706,     0.1738456...             5   \n",
       "351049  [   -0.40542558,    -0.65165085,     0.2469604...             5   \n",
       "351050  [   -0.16829309,     -0.6475864,      0.814903...             0   \n",
       "\n",
       "       predicted_class predicted_iob predicted_type  \\\n",
       "351041          I-MISC             I           MISC   \n",
       "351042          I-MISC             I           MISC   \n",
       "351043           I-LOC             I            LOC   \n",
       "351044           I-PER             I            PER   \n",
       "351045           I-LOC             I            LOC   \n",
       "351046           I-LOC             I            LOC   \n",
       "351047           I-LOC             I            LOC   \n",
       "351048           I-LOC             I            LOC   \n",
       "351049           I-LOC             I            LOC   \n",
       "351050               O             O           None   \n",
       "\n",
       "                                       predicted_class_pr  \n",
       "351041  [ 0.0007534354222208874, 1.6836613367126546e-0...  \n",
       "351042  [  0.009452956025954758,   0.00547183453730283...  \n",
       "351043  [  0.006176520691561816,    0.1195221604259170...  \n",
       "351044  [  0.013546332504567974,  0.000950956053344989...  \n",
       "351045  [  0.011681190424447915,   0.01903022200617198...  \n",
       "351046  [   0.03106093699696433,    0.4497557937871749...  \n",
       "351047  [    0.3021401447717397,   0.00750658170497511...  \n",
       "351048  [   0.06367579347291549,     0.421286404138072...  \n",
       "351049  [ 0.0009965409993564457,  0.002211756265164430...  \n",
       "351050  [    0.9999666135640827,  6.878966783337114e-0...  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Take a slice to show a region with more entities.\n",
    "test_results_df.iloc[40:50]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compute Precision and Recall\n",
    "\n",
    "With our model predictions on the test set, we can now compute precision and recall. To do this, we will use the following steps:\n",
    "\n",
    "1. Split up test set predictions by document, so we can work on the document level.\n",
    "1. Join the test predictions with token information into one DataFrame per document.\n",
    "1. Convert each DataFrame from IOB2 format to span, entity type pairs as done before.\n",
    "1. Compute accuracy for each document as a DataFrame.\n",
    "1. Aggregate per-document accuracy to get overal precision/recall."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>token_id</th>\n",
       "      <th>span</th>\n",
       "      <th>ent_iob</th>\n",
       "      <th>ent_type</th>\n",
       "      <th>predicted_iob</th>\n",
       "      <th>predicted_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <td>40</td>\n",
       "      <td>[68, 70): 'di'</td>\n",
       "      <td>I</td>\n",
       "      <td>PER</td>\n",
       "      <td>I</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41</th>\n",
       "      <td>41</td>\n",
       "      <td>[70, 71): 'm'</td>\n",
       "      <td>I</td>\n",
       "      <td>PER</td>\n",
       "      <td>I</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42</th>\n",
       "      <td>42</td>\n",
       "      <td>[72, 74): 'La'</td>\n",
       "      <td>I</td>\n",
       "      <td>PER</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43</th>\n",
       "      <td>43</td>\n",
       "      <td>[74, 75): 'd'</td>\n",
       "      <td>I</td>\n",
       "      <td>PER</td>\n",
       "      <td>I</td>\n",
       "      <td>PER</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44</th>\n",
       "      <td>44</td>\n",
       "      <td>[75, 77): 'ki'</td>\n",
       "      <td>I</td>\n",
       "      <td>PER</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>45</th>\n",
       "      <td>45</td>\n",
       "      <td>[78, 80): 'AL'</td>\n",
       "      <td>B</td>\n",
       "      <td>LOC</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>46</th>\n",
       "      <td>46</td>\n",
       "      <td>[80, 81): '-'</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>47</th>\n",
       "      <td>47</td>\n",
       "      <td>[81, 83): 'AI'</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48</th>\n",
       "      <td>48</td>\n",
       "      <td>[83, 84): 'N'</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49</th>\n",
       "      <td>49</td>\n",
       "      <td>[84, 85): ','</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50</th>\n",
       "      <td>50</td>\n",
       "      <td>[86, 92): 'United'</td>\n",
       "      <td>B</td>\n",
       "      <td>LOC</td>\n",
       "      <td>B</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>51</th>\n",
       "      <td>51</td>\n",
       "      <td>[93, 97): 'Arab'</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52</th>\n",
       "      <td>52</td>\n",
       "      <td>[98, 106): 'Emirates'</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "      <td>I</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>53</td>\n",
       "      <td>[107, 111): '1996'</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>54</th>\n",
       "      <td>54</td>\n",
       "      <td>[111, 112): '-'</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>55</th>\n",
       "      <td>55</td>\n",
       "      <td>[112, 114): '12'</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56</th>\n",
       "      <td>56</td>\n",
       "      <td>[114, 115): '-'</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>57</th>\n",
       "      <td>57</td>\n",
       "      <td>[115, 117): '06'</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58</th>\n",
       "      <td>58</td>\n",
       "      <td>[118, 123): 'Japan'</td>\n",
       "      <td>B</td>\n",
       "      <td>LOC</td>\n",
       "      <td>B</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59</th>\n",
       "      <td>59</td>\n",
       "      <td>[124, 129): 'began'</td>\n",
       "      <td>O</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>O</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    token_id                   span ent_iob ent_type predicted_iob  \\\n",
       "40        40         [68, 70): 'di'       I      PER             I   \n",
       "41        41          [70, 71): 'm'       I      PER             I   \n",
       "42        42         [72, 74): 'La'       I      PER             I   \n",
       "43        43          [74, 75): 'd'       I      PER             I   \n",
       "44        44         [75, 77): 'ki'       I      PER             I   \n",
       "45        45         [78, 80): 'AL'       B      LOC             I   \n",
       "46        46          [80, 81): '-'       I      LOC             I   \n",
       "47        47         [81, 83): 'AI'       I      LOC             I   \n",
       "48        48          [83, 84): 'N'       I      LOC             I   \n",
       "49        49          [84, 85): ','       O     <NA>             O   \n",
       "50        50     [86, 92): 'United'       B      LOC             B   \n",
       "51        51       [93, 97): 'Arab'       I      LOC             I   \n",
       "52        52  [98, 106): 'Emirates'       I      LOC             I   \n",
       "53        53     [107, 111): '1996'       O     <NA>             O   \n",
       "54        54        [111, 112): '-'       O     <NA>             O   \n",
       "55        55       [112, 114): '12'       O     <NA>             O   \n",
       "56        56        [114, 115): '-'       O     <NA>             O   \n",
       "57        57       [115, 117): '06'       O     <NA>             O   \n",
       "58        58    [118, 123): 'Japan'       B      LOC             B   \n",
       "59        59    [124, 129): 'began'       O     <NA>             O   \n",
       "\n",
       "   predicted_type  \n",
       "40           MISC  \n",
       "41           MISC  \n",
       "42            LOC  \n",
       "43            PER  \n",
       "44            LOC  \n",
       "45            LOC  \n",
       "46            LOC  \n",
       "47            LOC  \n",
       "48            LOC  \n",
       "49           None  \n",
       "50            LOC  \n",
       "51            LOC  \n",
       "52            LOC  \n",
       "53           None  \n",
       "54           None  \n",
       "55           None  \n",
       "56           None  \n",
       "57           None  \n",
       "58            LOC  \n",
       "59           None  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Split model outputs for an entire fold back into documents and add\n",
    "# token information.\n",
    "\n",
    "# Get unique documents per fold.\n",
    "fold_and_doc = test_results_df[[\"fold\", \"doc_num\"]] \\\n",
    "        .drop_duplicates() \\\n",
    "        .to_records(index=False)\n",
    "\n",
    "# Index by fold, doc and token id, then make sure sorted.\n",
    "indexed_df = test_results_df \\\n",
    "        .set_index([\"fold\", \"doc_num\", \"token_id\"], verify_integrity=True) \\\n",
    "        .sort_index()\n",
    "\n",
    "# Join predictions with token information, for each document.\n",
    "test_results_by_doc = {}\n",
    "for collection, doc_num in fold_and_doc:\n",
    "    doc_slice = indexed_df.loc[collection, doc_num].reset_index()\n",
    "    doc_toks = bert_toks_by_fold[collection][doc_num][\n",
    "        [\"token_id\", \"span\", \"ent_iob\", \"ent_type\"]\n",
    "    ].rename(columns={\"id\": \"token_id\"})\n",
    "    joined_df = doc_toks.copy().merge(\n",
    "        doc_slice[[\"token_id\", \"predicted_iob\", \"predicted_type\"]])\n",
    "    test_results_by_doc[(collection, doc_num)] = joined_df\n",
    "    \n",
    "# Test results are now in one DataFrame per document.\n",
    "test_results_by_doc[(\"test\", 0)].iloc[40:60]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>span</th>\n",
       "      <th>ent_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[19, 24): 'JAPAN'</td>\n",
       "      <td>PER</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[29, 34): 'LUCKY'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[40, 45): 'CHINA'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[66, 84): 'Nadim Ladki AL-AIN'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>[86, 106): 'United Arab Emirates'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                span ent_type\n",
       "0                  [19, 24): 'JAPAN'      PER\n",
       "1                  [29, 34): 'LUCKY'      LOC\n",
       "2                  [40, 45): 'CHINA'      ORG\n",
       "3     [66, 84): 'Nadim Ladki AL-AIN'      LOC\n",
       "4  [86, 106): 'United Arab Emirates'      LOC"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Convert IOB2 format to spans, entity type with `tp.io.conll.iob_to_spans()`.\n",
    "test_actual_spans = {k: tp.io.conll.iob_to_spans(v) for k, v in test_results_by_doc.items()}\n",
    "test_model_spans = {k:\n",
    "        tp.io.conll.iob_to_spans(v, iob_col_name = \"predicted_iob\",\n",
    "                                 entity_type_col_name = \"predicted_type\")\n",
    "            .rename(columns={\"predicted_type\": \"ent_type\"})\n",
    "        for k, v in test_results_by_doc.items()}\n",
    "\n",
    "test_model_spans[(\"test\", 0)].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fold</th>\n",
       "      <th>doc_num</th>\n",
       "      <th>num_true_positives</th>\n",
       "      <th>num_extracted</th>\n",
       "      <th>num_entities</th>\n",
       "      <th>precision</th>\n",
       "      <th>recall</th>\n",
       "      <th>F1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>46</td>\n",
       "      <td>45</td>\n",
       "      <td>0.869565</td>\n",
       "      <td>0.888889</td>\n",
       "      <td>0.879121</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>test</td>\n",
       "      <td>1</td>\n",
       "      <td>41</td>\n",
       "      <td>42</td>\n",
       "      <td>44</td>\n",
       "      <td>0.976190</td>\n",
       "      <td>0.931818</td>\n",
       "      <td>0.953488</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>test</td>\n",
       "      <td>2</td>\n",
       "      <td>52</td>\n",
       "      <td>54</td>\n",
       "      <td>54</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>0.962963</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>test</td>\n",
       "      <td>3</td>\n",
       "      <td>42</td>\n",
       "      <td>44</td>\n",
       "      <td>44</td>\n",
       "      <td>0.954545</td>\n",
       "      <td>0.954545</td>\n",
       "      <td>0.954545</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>test</td>\n",
       "      <td>4</td>\n",
       "      <td>18</td>\n",
       "      <td>19</td>\n",
       "      <td>19</td>\n",
       "      <td>0.947368</td>\n",
       "      <td>0.947368</td>\n",
       "      <td>0.947368</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>226</th>\n",
       "      <td>test</td>\n",
       "      <td>226</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>0.857143</td>\n",
       "      <td>0.857143</td>\n",
       "      <td>0.857143</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227</th>\n",
       "      <td>test</td>\n",
       "      <td>227</td>\n",
       "      <td>18</td>\n",
       "      <td>19</td>\n",
       "      <td>21</td>\n",
       "      <td>0.947368</td>\n",
       "      <td>0.857143</td>\n",
       "      <td>0.900000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>228</th>\n",
       "      <td>test</td>\n",
       "      <td>228</td>\n",
       "      <td>24</td>\n",
       "      <td>27</td>\n",
       "      <td>27</td>\n",
       "      <td>0.888889</td>\n",
       "      <td>0.888889</td>\n",
       "      <td>0.888889</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>229</th>\n",
       "      <td>test</td>\n",
       "      <td>229</td>\n",
       "      <td>25</td>\n",
       "      <td>27</td>\n",
       "      <td>27</td>\n",
       "      <td>0.925926</td>\n",
       "      <td>0.925926</td>\n",
       "      <td>0.925926</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>230</th>\n",
       "      <td>test</td>\n",
       "      <td>230</td>\n",
       "      <td>25</td>\n",
       "      <td>27</td>\n",
       "      <td>28</td>\n",
       "      <td>0.925926</td>\n",
       "      <td>0.892857</td>\n",
       "      <td>0.909091</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>231 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     fold  doc_num  num_true_positives  num_extracted  num_entities  \\\n",
       "0    test        0                  40             46            45   \n",
       "1    test        1                  41             42            44   \n",
       "2    test        2                  52             54            54   \n",
       "3    test        3                  42             44            44   \n",
       "4    test        4                  18             19            19   \n",
       "..    ...      ...                 ...            ...           ...   \n",
       "226  test      226                   6              7             7   \n",
       "227  test      227                  18             19            21   \n",
       "228  test      228                  24             27            27   \n",
       "229  test      229                  25             27            27   \n",
       "230  test      230                  25             27            28   \n",
       "\n",
       "     precision    recall        F1  \n",
       "0     0.869565  0.888889  0.879121  \n",
       "1     0.976190  0.931818  0.953488  \n",
       "2     0.962963  0.962963  0.962963  \n",
       "3     0.954545  0.954545  0.954545  \n",
       "4     0.947368  0.947368  0.947368  \n",
       "..         ...       ...       ...  \n",
       "226   0.857143  0.857143  0.857143  \n",
       "227   0.947368  0.857143  0.900000  \n",
       "228   0.888889  0.888889  0.888889  \n",
       "229   0.925926  0.925926  0.925926  \n",
       "230   0.925926  0.892857  0.909091  \n",
       "\n",
       "[231 rows x 8 columns]"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Compute per-document statistics into a single DataFrame.\n",
    "test_stats_by_doc = tp.io.conll.compute_accuracy_by_document(test_actual_spans, test_model_spans)\n",
    "test_stats_by_doc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'num_true_positives': 4879,\n",
       " 'num_entities': 5648,\n",
       " 'num_extracted': 5621,\n",
       " 'precision': 0.8679950186799502,\n",
       " 'recall': 0.8638456090651558,\n",
       " 'F1': 0.8659153429763067}"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Collection-wide precision and recall can be computed by aggregating\n",
    "# our DataFrame.\n",
    "tp.io.conll.compute_global_accuracy(test_stats_by_doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Adjusting the BERT Model Output\n",
    "\n",
    "The above results aren't bad for a first shot, but taking a look a some of the predictions will show that sometimes the tokens have been split up into multiple entities. This is because the BERT tokenizer uses WordPiece to make subword tokens, see https://huggingface.co/transformers/tokenizer_summary.html and https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf for more information.\n",
    "\n",
    "This is going to cause a problem when computing precision/recall because we are comparing exact spans, and if the entity is split, it will be counted as a false negative _and_ possibly one or more false positives. Luckily we can fix up with Text Extension for Pandas.\n",
    "\n",
    "Let's drill down to see an example of the issue and how to correct it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>span</th>\n",
       "      <th>ent_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[11, 22): 'RUGBY UNION'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[24, 31): 'BRITISH'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[41, 47): 'LONDON'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[70, 77): 'British'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>[111, 125): 'Pilkington Cup'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>[139, 146): 'Reading'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>[150, 151): 'W'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>[151, 156): 'idnes'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>[159, 166): 'English'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>[180, 184): 'Bath'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           span ent_type\n",
       "0       [11, 22): 'RUGBY UNION'      ORG\n",
       "1           [24, 31): 'BRITISH'     MISC\n",
       "2            [41, 47): 'LONDON'      LOC\n",
       "3           [70, 77): 'British'     MISC\n",
       "4  [111, 125): 'Pilkington Cup'     MISC\n",
       "5         [139, 146): 'Reading'      ORG\n",
       "6               [150, 151): 'W'      ORG\n",
       "7           [151, 156): 'idnes'      ORG\n",
       "8         [159, 166): 'English'     MISC\n",
       "9            [180, 184): 'Bath'      ORG"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Every once in a while, the BERT model will split a token in the original data\n",
    "# set into multiple entities. For example, look at document 202 of the test set:\n",
    "test_model_spans[(\"test\", 202)].head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice `[150, 151): 'W'` and `[151, 156): 'idnes'`. These outputs are part\n",
    "of the same original token, but have been split by the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>span</th>\n",
       "      <th>corpus_token</th>\n",
       "      <th>ent_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[11, 22): 'RUGBY UNION'</td>\n",
       "      <td>[11, 16): 'RUGBY'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[11, 22): 'RUGBY UNION'</td>\n",
       "      <td>[17, 22): 'UNION'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[24, 31): 'BRITISH'</td>\n",
       "      <td>[24, 31): 'BRITISH'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[41, 47): 'LONDON'</td>\n",
       "      <td>[41, 47): 'LONDON'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>[70, 77): 'British'</td>\n",
       "      <td>[70, 77): 'British'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>[111, 125): 'Pilkington Cup'</td>\n",
       "      <td>[111, 121): 'Pilkington'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>[111, 125): 'Pilkington Cup'</td>\n",
       "      <td>[122, 125): 'Cup'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>[139, 146): 'Reading'</td>\n",
       "      <td>[139, 146): 'Reading'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>[150, 151): 'W'</td>\n",
       "      <td>[150, 156): 'Widnes'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>[151, 156): 'idnes'</td>\n",
       "      <td>[150, 156): 'Widnes'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           span              corpus_token ent_type\n",
       "0       [11, 22): 'RUGBY UNION'         [11, 16): 'RUGBY'      ORG\n",
       "1       [11, 22): 'RUGBY UNION'         [17, 22): 'UNION'      ORG\n",
       "2           [24, 31): 'BRITISH'       [24, 31): 'BRITISH'     MISC\n",
       "3            [41, 47): 'LONDON'        [41, 47): 'LONDON'      LOC\n",
       "4           [70, 77): 'British'       [70, 77): 'British'     MISC\n",
       "5  [111, 125): 'Pilkington Cup'  [111, 121): 'Pilkington'     MISC\n",
       "6  [111, 125): 'Pilkington Cup'         [122, 125): 'Cup'     MISC\n",
       "7         [139, 146): 'Reading'     [139, 146): 'Reading'      ORG\n",
       "8               [150, 151): 'W'      [150, 156): 'Widnes'      ORG\n",
       "9           [151, 156): 'idnes'      [150, 156): 'Widnes'      ORG"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We can use spanner algebra in `tp.spanner.overlap_join()`\n",
    "# to fix up these outputs.\n",
    "spans_df = test_model_spans[(\"test\", 202)]\n",
    "toks_df = test_raw[202]\n",
    "\n",
    "# First, find which tokens the spans overlap with:\n",
    "overlaps_df = (\n",
    "    tp.spanner.overlap_join(spans_df[\"span\"], toks_df[\"span\"],\n",
    "                            \"span\", \"corpus_token\")\n",
    "        .merge(spans_df)\n",
    ")\n",
    "overlaps_df.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>span</th>\n",
       "      <th>corpus_token</th>\n",
       "      <th>ent_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[11, 22): 'RUGBY UNION'</td>\n",
       "      <td>[11, 22): 'RUGBY UNION'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[24, 31): 'BRITISH'</td>\n",
       "      <td>[24, 31): 'BRITISH'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[41, 47): 'LONDON'</td>\n",
       "      <td>[41, 47): 'LONDON'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[70, 77): 'British'</td>\n",
       "      <td>[70, 77): 'British'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>[111, 125): 'Pilkington Cup'</td>\n",
       "      <td>[111, 125): 'Pilkington Cup'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>[139, 146): 'Reading'</td>\n",
       "      <td>[139, 146): 'Reading'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>[150, 151): 'W'</td>\n",
       "      <td>[150, 156): 'Widnes'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>[151, 156): 'idnes'</td>\n",
       "      <td>[150, 156): 'Widnes'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>[159, 166): 'English'</td>\n",
       "      <td>[159, 166): 'English'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>[180, 184): 'Bath'</td>\n",
       "      <td>[180, 184): 'Bath'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           span                  corpus_token ent_type\n",
       "0       [11, 22): 'RUGBY UNION'       [11, 22): 'RUGBY UNION'      ORG\n",
       "1           [24, 31): 'BRITISH'           [24, 31): 'BRITISH'     MISC\n",
       "2            [41, 47): 'LONDON'            [41, 47): 'LONDON'      LOC\n",
       "3           [70, 77): 'British'           [70, 77): 'British'     MISC\n",
       "4  [111, 125): 'Pilkington Cup'  [111, 125): 'Pilkington Cup'     MISC\n",
       "5         [139, 146): 'Reading'         [139, 146): 'Reading'      ORG\n",
       "6               [150, 151): 'W'          [150, 156): 'Widnes'      ORG\n",
       "7           [151, 156): 'idnes'          [150, 156): 'Widnes'      ORG\n",
       "8         [159, 166): 'English'         [159, 166): 'English'     MISC\n",
       "9            [180, 184): 'Bath'            [180, 184): 'Bath'      ORG"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Next, compute the minimum span that covers all the corpus tokens\n",
    "# that overlap with each entity span.\n",
    "agg_df = (\n",
    "    overlaps_df\n",
    "    .groupby(\"span\")\n",
    "    .aggregate({\"corpus_token\": \"sum\", \"ent_type\": \"first\"})\n",
    "    .reset_index()\n",
    ")\n",
    "agg_df.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>span</th>\n",
       "      <th>ent_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[11, 22): 'RUGBY UNION'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[24, 31): 'BRITISH'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[41, 47): 'LONDON'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[70, 77): 'British'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>[111, 125): 'Pilkington Cup'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>[139, 146): 'Reading'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>[150, 156): 'Widnes'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>[159, 166): 'English'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>[180, 184): 'Bath'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>[188, 198): 'Harlequins'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                            span ent_type\n",
       "0        [11, 22): 'RUGBY UNION'      ORG\n",
       "1            [24, 31): 'BRITISH'     MISC\n",
       "2             [41, 47): 'LONDON'      LOC\n",
       "3            [70, 77): 'British'     MISC\n",
       "4   [111, 125): 'Pilkington Cup'     MISC\n",
       "5          [139, 146): 'Reading'      ORG\n",
       "6           [150, 156): 'Widnes'      ORG\n",
       "8          [159, 166): 'English'     MISC\n",
       "9             [180, 184): 'Bath'      ORG\n",
       "10      [188, 198): 'Harlequins'      ORG"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Finally, take unique values and covert character-based spans to token\n",
    "# spans in the corpus tokenization (since the new offsets might not match a\n",
    "# BERT tokenizer token boundary).\n",
    "cons_df = (\n",
    "    tp.spanner.consolidate(agg_df, \"corpus_token\")[[\"corpus_token\", \"ent_type\"]]\n",
    "        .rename(columns={\"corpus_token\": \"span\"})\n",
    ")\n",
    "cons_df[\"span\"] = tp.TokenSpanArray.align_to_tokens(toks_df[\"span\"],\n",
    "                                                    cons_df[\"span\"])\n",
    "cons_df.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>span</th>\n",
       "      <th>ent_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[11, 22): 'RUGBY UNION'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[24, 31): 'BRITISH'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[41, 47): 'LONDON'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[70, 77): 'British'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>[111, 125): 'Pilkington Cup'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>[139, 146): 'Reading'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>[150, 156): 'Widnes'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>[159, 166): 'English'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>[180, 184): 'Bath'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>[188, 198): 'Harlequins'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                            span ent_type\n",
       "0        [11, 22): 'RUGBY UNION'      ORG\n",
       "1            [24, 31): 'BRITISH'     MISC\n",
       "2             [41, 47): 'LONDON'      LOC\n",
       "3            [70, 77): 'British'     MISC\n",
       "4   [111, 125): 'Pilkington Cup'     MISC\n",
       "5          [139, 146): 'Reading'      ORG\n",
       "6           [150, 156): 'Widnes'      ORG\n",
       "8          [159, 166): 'English'     MISC\n",
       "9             [180, 184): 'Bath'      ORG\n",
       "10      [188, 198): 'Harlequins'      ORG"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Text Extensions for Pandas contains a single function that repeats the actions of the \n",
    "# previous 3 cells.\n",
    "tp.io.bert.align_bert_tokens_to_corpus_tokens(test_model_spans[(\"test\", 202)], test_raw[202]).head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "76959b8483d74269ac98d37bf7f3e072",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "IntProgress(value=0, description='Starting...', layout=Layout(width='100%'), max=231, style=ProgressStyle(desc…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>span</th>\n",
       "      <th>ent_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[11, 22): 'RUGBY UNION'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[24, 31): 'BRITISH'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[41, 47): 'LONDON'</td>\n",
       "      <td>LOC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[70, 77): 'British'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>[111, 125): 'Pilkington Cup'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>[139, 146): 'Reading'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>[150, 156): 'Widnes'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>[159, 166): 'English'</td>\n",
       "      <td>MISC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>[180, 184): 'Bath'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>[188, 198): 'Harlequins'</td>\n",
       "      <td>ORG</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                            span ent_type\n",
       "0        [11, 22): 'RUGBY UNION'      ORG\n",
       "1            [24, 31): 'BRITISH'     MISC\n",
       "2             [41, 47): 'LONDON'      LOC\n",
       "3            [70, 77): 'British'     MISC\n",
       "4   [111, 125): 'Pilkington Cup'     MISC\n",
       "5          [139, 146): 'Reading'      ORG\n",
       "6           [150, 156): 'Widnes'      ORG\n",
       "8          [159, 166): 'English'     MISC\n",
       "9             [180, 184): 'Bath'      ORG\n",
       "10      [188, 198): 'Harlequins'      ORG"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Run all of our DataFrames through `align_bert_tokens_to_corpus_tokens()`.\n",
    "keys = list(test_model_spans.keys())\n",
    "new_values = tp.jupyter.run_with_progress_bar(\n",
    "    len(keys), \n",
    "    lambda i: tp.io.bert.align_bert_tokens_to_corpus_tokens(test_model_spans[keys[i]], test_raw[keys[i][1]]))\n",
    "test_model_spans = {k: v for k, v in zip(keys, new_values)}\n",
    "test_model_spans[(\"test\", 202)].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fold</th>\n",
       "      <th>doc_num</th>\n",
       "      <th>num_true_positives</th>\n",
       "      <th>num_extracted</th>\n",
       "      <th>num_entities</th>\n",
       "      <th>precision</th>\n",
       "      <th>recall</th>\n",
       "      <th>F1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>test</td>\n",
       "      <td>0</td>\n",
       "      <td>41</td>\n",
       "      <td>46</td>\n",
       "      <td>45</td>\n",
       "      <td>0.891304</td>\n",
       "      <td>0.911111</td>\n",
       "      <td>0.901099</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>test</td>\n",
       "      <td>1</td>\n",
       "      <td>41</td>\n",
       "      <td>42</td>\n",
       "      <td>44</td>\n",
       "      <td>0.976190</td>\n",
       "      <td>0.931818</td>\n",
       "      <td>0.953488</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>test</td>\n",
       "      <td>2</td>\n",
       "      <td>52</td>\n",
       "      <td>54</td>\n",
       "      <td>54</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>0.962963</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>test</td>\n",
       "      <td>3</td>\n",
       "      <td>42</td>\n",
       "      <td>44</td>\n",
       "      <td>44</td>\n",
       "      <td>0.954545</td>\n",
       "      <td>0.954545</td>\n",
       "      <td>0.954545</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>test</td>\n",
       "      <td>4</td>\n",
       "      <td>18</td>\n",
       "      <td>19</td>\n",
       "      <td>19</td>\n",
       "      <td>0.947368</td>\n",
       "      <td>0.947368</td>\n",
       "      <td>0.947368</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>226</th>\n",
       "      <td>test</td>\n",
       "      <td>226</td>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227</th>\n",
       "      <td>test</td>\n",
       "      <td>227</td>\n",
       "      <td>18</td>\n",
       "      <td>19</td>\n",
       "      <td>21</td>\n",
       "      <td>0.947368</td>\n",
       "      <td>0.857143</td>\n",
       "      <td>0.900000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>228</th>\n",
       "      <td>test</td>\n",
       "      <td>228</td>\n",
       "      <td>24</td>\n",
       "      <td>27</td>\n",
       "      <td>27</td>\n",
       "      <td>0.888889</td>\n",
       "      <td>0.888889</td>\n",
       "      <td>0.888889</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>229</th>\n",
       "      <td>test</td>\n",
       "      <td>229</td>\n",
       "      <td>26</td>\n",
       "      <td>27</td>\n",
       "      <td>27</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>0.962963</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>230</th>\n",
       "      <td>test</td>\n",
       "      <td>230</td>\n",
       "      <td>26</td>\n",
       "      <td>27</td>\n",
       "      <td>28</td>\n",
       "      <td>0.962963</td>\n",
       "      <td>0.928571</td>\n",
       "      <td>0.945455</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>231 rows × 8 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     fold  doc_num  num_true_positives  num_extracted  num_entities  \\\n",
       "0    test        0                  41             46            45   \n",
       "1    test        1                  41             42            44   \n",
       "2    test        2                  52             54            54   \n",
       "3    test        3                  42             44            44   \n",
       "4    test        4                  18             19            19   \n",
       "..    ...      ...                 ...            ...           ...   \n",
       "226  test      226                   7              7             7   \n",
       "227  test      227                  18             19            21   \n",
       "228  test      228                  24             27            27   \n",
       "229  test      229                  26             27            27   \n",
       "230  test      230                  26             27            28   \n",
       "\n",
       "     precision    recall        F1  \n",
       "0     0.891304  0.911111  0.901099  \n",
       "1     0.976190  0.931818  0.953488  \n",
       "2     0.962963  0.962963  0.962963  \n",
       "3     0.954545  0.954545  0.954545  \n",
       "4     0.947368  0.947368  0.947368  \n",
       "..         ...       ...       ...  \n",
       "226   1.000000  1.000000  1.000000  \n",
       "227   0.947368  0.857143  0.900000  \n",
       "228   0.888889  0.888889  0.888889  \n",
       "229   0.962963  0.962963  0.962963  \n",
       "230   0.962963  0.928571  0.945455  \n",
       "\n",
       "[231 rows x 8 columns]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Compute per-document statistics into a single DataFrame.\n",
    "test_stats_by_doc = tp.io.conll.compute_accuracy_by_document(test_actual_spans, test_model_spans)\n",
    "test_stats_by_doc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'num_true_positives': 4974,\n",
       " 'num_entities': 5648,\n",
       " 'num_extracted': 5588,\n",
       " 'precision': 0.8901216893342878,\n",
       " 'recall': 0.8806657223796034,\n",
       " 'F1': 0.8853684585261659}"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Collection-wide precision and recall can be computed by aggregating\n",
    "# our DataFrame.\n",
    "tp.io.conll.compute_global_accuracy(test_stats_by_doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These results are a bit better than before, and while the F1 score is not high compared to todays standards, it is decent enough for a simplistic model. More importantly, we did show it was fairly easy to create a model for named entity recognition and analyze the output by leveraging the functionalitiy of Pandas DataFrames along with [Text Extensions for Pandas](https://github.com/CODAIT/text-extensions-for-pandas) `SpanArray`, `TensorArray` and integration with BERT from [Huggingface Transformers](https://huggingface.co/transformers/index.html)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}