{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1MABwVpcVXQJ"
   },
   "source": [
    "# Byte-Pair Encoding tokenization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "BPE training starts by computing the unique set of words used in the corpus (after the normalization and pre-tokenization steps are completed), then building the vocabulary by taking all the symbols used to write those words. As a very simple example, let’s say our corpus uses these five words:\n",
    "\n",
    "This material was adapted from the Huggingface tutorial available here:\n",
    "\n",
    "https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "corpus = [\"hug\", \"pug\", \"pun\", \"bun\", \"hugs\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'g', 'n', 'p', 'u', 's', 'h', 'b'}\n"
     ]
    }
   ],
   "source": [
    "vocab = set([ c for w in corpus for c in w ])\n",
    "print(vocab)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After getting this base vocabulary, we add new tokens until the desired vocabulary size is reached by learning merges, which are rules to merge two elements of the existing vocabulary together into a new one. So, at the beginning these merges will create tokens with two characters, and then, as training progresses, longer subwords.\n",
    "\n",
    "At any step during the tokenizer training, the BPE algorithm will search for the most frequent pair of existing tokens (by “pair,” here we mean two consecutive tokens in a word). That most frequent pair is the one that will be merged, and we rinse and repeat for the next step.\n",
    "\n",
    "Going back to our previous example, let’s assume the words had the following frequencies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "corpus = [(\"hug\", 10), (\"pug\", 5), (\"pun\", 12), (\"bun\", 4), (\"hugs\", 5)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"hug\" was present 10 times in the corpus, \"pug\" 5 times, \"pun\" 12 times, \"bun\" 4 times, and \"hugs\" 5 times. We start the training by splitting each word into characters (the ones that form our initial vocabulary) so we can see each word as a list of tokens:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "corpus = [(\"h\" \"u\" \"g\", 10), (\"p\" \"u\" \"g\", 5), (\"p\" \"u\" \"n\", 12), (\"b\" \"u\" \"n\", 4), (\"h\" \"u\" \"g\" \"s\", 5)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we look at pairs. The pair (\"h\", \"u\") is present in the words \"hug\" and \"hugs\", so 15 times total in the corpus. It’s not the most frequent pair, though: the most frequent is (\"u\", \"g\"), which is present in \"hug\", \"pug\", and \"hugs\", for a grand total of 20 times in the vocabulary.\n",
    "\n",
    "Thus, the first merge rule learned by the tokenizer is (\"u\", \"g\") -> \"ug\", which means that \"ug\" will be added to the vocabulary, and the pair should be merged in all the words of the corpus. At the end of this stage, the vocabulary and corpus look like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab = [\"b\", \"g\", \"h\", \"n\", \"p\", \"s\", \"u\", \"ug\"]\n",
    "corps = [(\"h\" \"ug\", 10), (\"p\" \"ug\", 5), (\"p\" \"u\" \"n\", 12), (\"b\" \"u\" \"n\", 4), (\"h\" \"ug\" \"s\", 5)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have some pairs that result in a token longer than two characters: the pair (\"h\", \"ug\"), for instance (present 15 times in the corpus). The most frequent pair at this stage is (\"u\", \"n\"), however, present 16 times in the corpus, so the second merge rule learned is (\"u\", \"n\") -> \"un\". Adding that to the vocabulary and merging all existing occurrences leads us to:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab = [\"b\", \"g\", \"h\", \"n\", \"p\", \"s\", \"u\", \"ug\", \"un\"]\n",
    "corpus = [(\"h\" \"ug\", 10), (\"p\" \"ug\", 5), (\"p\" \"un\", 12), (\"b\" \"un\", 4), (\"h\" \"ug\" \"s\", 5)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the most frequent pair is (\"h\", \"ug\"), so we learn the merge rule (\"h\", \"ug\") -> \"hug\", which gives us our first three-letter token. After the merge, the corpus looks like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab = [\"b\", \"g\", \"h\", \"n\", \"p\", \"s\", \"u\", \"ug\", \"un\", \"hug\"]\n",
    "corpus = [(\"hug\", 10), (\"p\" \"ug\", 5), (\"p\" \"un\", 12), (\"b\" \"un\", 4), (\"hug\" \"s\", 5)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we continue like this until we reach the desired vocabulary size. Usually we provide the number of merges we want to obtain a particular vocabulary size."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenization Algorithm\n",
    "\n",
    "Tokenization follows the training process closely, in the sense that new inputs are tokenized by applying the following steps:\n",
    "\n",
    "1. Normalization\n",
    "1. Pre-tokenization\n",
    "1. Splitting the words into individual characters\n",
    "1. Applying the merge rules learned in order on those splits\n",
    "\n",
    "Let’s take the example we used during training, with the three merge rules learned:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "(\"u\", \"g\") -> \"ug\"\n",
    "(\"u\", \"n\") -> \"un\"\n",
    "(\"h\", \"ug\") -> \"hug\"\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The word \"bug\" will be tokenized as [\"b\", \"ug\"]. \"mug\", however, will be tokenized as [\"[UNK]\", \"ug\"] since the letter \"m\" was not in the base vocabulary. Likewise, the word \"thug\" will be tokenized as [\"[UNK]\", \"hug\"]: the letter \"t\" is not in the base vocabulary, and applying the merge rules results first in \"u\" and \"g\" being merged and then \"hu\" and \"g\" being merged."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question**: How do you think the word \"unhug\" will be tokenized?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Implementing BPE for sub-word tokenization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iDtJ2ithVXQL"
   },
   "source": [
    "Install the Transformers, Datasets, and Evaluate libraries to run this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "id": "Ni3jt_LXVXQL"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting datasets\n",
      "  Obtaining dependency information for datasets from https://files.pythonhosted.org/packages/09/7e/fd4d6441a541dba61d0acb3c1fd5df53214c2e9033854e837a99dd9e0793/datasets-2.14.5-py3-none-any.whl.metadata\n",
      "  Using cached datasets-2.14.5-py3-none-any.whl.metadata (19 kB)\n",
      "Collecting evaluate\n",
      "  Using cached evaluate-0.4.0-py3-none-any.whl (81 kB)\n",
      "Collecting transformers[sentencepiece]\n",
      "  Obtaining dependency information for transformers[sentencepiece] from https://files.pythonhosted.org/packages/1a/d1/3bba59606141ae808017f6fde91453882f931957f125009417b87a281067/transformers-4.34.0-py3-none-any.whl.metadata\n",
      "  Using cached transformers-4.34.0-py3-none-any.whl.metadata (121 kB)\n",
      "Requirement already satisfied: numpy>=1.17 in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from datasets) (1.26.0)\n",
      "Collecting pyarrow>=8.0.0 (from datasets)\n",
      "  Obtaining dependency information for pyarrow>=8.0.0 from https://files.pythonhosted.org/packages/77/0d/3a698f5fee20e6086017ae8a0fe8eac40eebceb7dc66e96993b10503ad58/pyarrow-13.0.0-cp310-cp310-macosx_11_0_arm64.whl.metadata\n",
      "  Using cached pyarrow-13.0.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (3.0 kB)\n",
      "Collecting dill<0.3.8,>=0.3.0 (from datasets)\n",
      "  Obtaining dependency information for dill<0.3.8,>=0.3.0 from https://files.pythonhosted.org/packages/f5/3a/74a29b11cf2cdfcd6ba89c0cecd70b37cd1ba7b77978ce611eb7a146a832/dill-0.3.7-py3-none-any.whl.metadata\n",
      "  Using cached dill-0.3.7-py3-none-any.whl.metadata (9.9 kB)\n",
      "Requirement already satisfied: pandas in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from datasets) (2.1.1)\n",
      "Requirement already satisfied: requests>=2.19.0 in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from datasets) (2.31.0)\n",
      "Requirement already satisfied: tqdm>=4.62.1 in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from datasets) (4.66.1)\n",
      "Collecting xxhash (from datasets)\n",
      "  Obtaining dependency information for xxhash from https://files.pythonhosted.org/packages/ad/7f/dfdf25e416b67970e89d7b85b0e6a4860ec8a227544cb5db069617cc323e/xxhash-3.4.1-cp310-cp310-macosx_11_0_arm64.whl.metadata\n",
      "  Downloading xxhash-3.4.1-cp310-cp310-macosx_11_0_arm64.whl.metadata (12 kB)\n",
      "Collecting multiprocess (from datasets)\n",
      "  Obtaining dependency information for multiprocess from https://files.pythonhosted.org/packages/35/a8/36d8d7b3e46b377800d8dec47891cdf05842d1a2366909ae4a0c89fbc5e6/multiprocess-0.70.15-py310-none-any.whl.metadata\n",
      "  Using cached multiprocess-0.70.15-py310-none-any.whl.metadata (7.2 kB)\n",
      "Collecting fsspec[http]<2023.9.0,>=2023.1.0 (from datasets)\n",
      "  Obtaining dependency information for fsspec[http]<2023.9.0,>=2023.1.0 from https://files.pythonhosted.org/packages/e3/bd/4c0a4619494188a9db5d77e2100ab7d544a42e76b2447869d8e124e981d8/fsspec-2023.6.0-py3-none-any.whl.metadata\n",
      "  Using cached fsspec-2023.6.0-py3-none-any.whl.metadata (6.7 kB)\n",
      "Collecting aiohttp (from datasets)\n",
      "  Obtaining dependency information for aiohttp from https://files.pythonhosted.org/packages/94/a9/61f60723b20f9accdf4c9dc812ad4a61c1c63bdc732bc4e81fde9e6c40a9/aiohttp-3.8.6-cp310-cp310-macosx_11_0_arm64.whl.metadata\n",
      "  Downloading aiohttp-3.8.6-cp310-cp310-macosx_11_0_arm64.whl.metadata (7.7 kB)\n",
      "Collecting huggingface-hub<1.0.0,>=0.14.0 (from datasets)\n",
      "  Obtaining dependency information for huggingface-hub<1.0.0,>=0.14.0 from https://files.pythonhosted.org/packages/ef/b5/b6107bd65fa4c96fdf00e4733e2fe5729bb9e5e09997f63074bb43d3ab28/huggingface_hub-0.18.0-py3-none-any.whl.metadata\n",
      "  Downloading huggingface_hub-0.18.0-py3-none-any.whl.metadata (13 kB)\n",
      "Requirement already satisfied: packaging in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from datasets) (23.2)\n",
      "Requirement already satisfied: pyyaml>=5.1 in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from datasets) (6.0.1)\n",
      "Collecting responses<0.19 (from evaluate)\n",
      "  Using cached responses-0.18.0-py3-none-any.whl (38 kB)\n",
      "Requirement already satisfied: filelock in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from transformers[sentencepiece]) (3.12.4)\n",
      "Requirement already satisfied: regex!=2019.12.17 in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from transformers[sentencepiece]) (2023.10.3)\n",
      "Collecting tokenizers<0.15,>=0.14 (from transformers[sentencepiece])\n",
      "  Obtaining dependency information for tokenizers<0.15,>=0.14 from https://files.pythonhosted.org/packages/ec/2d/b0ce807327959036aad8431304a1e4a115efe678d68920f6cd192152c17b/tokenizers-0.14.1-cp310-cp310-macosx_11_0_arm64.whl.metadata\n",
      "  Downloading tokenizers-0.14.1-cp310-cp310-macosx_11_0_arm64.whl.metadata (6.7 kB)\n",
      "Collecting safetensors>=0.3.1 (from transformers[sentencepiece])\n",
      "  Obtaining dependency information for safetensors>=0.3.1 from https://files.pythonhosted.org/packages/f5/d1/d00ec454c4fd0cd363245f4b475ebfe929e096f7fec086590ad173d82857/safetensors-0.4.0-cp310-cp310-macosx_11_0_arm64.whl.metadata\n",
      "  Downloading safetensors-0.4.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (3.8 kB)\n",
      "Collecting sentencepiece!=0.1.92,>=0.1.91 (from transformers[sentencepiece])\n",
      "  Using cached sentencepiece-0.1.99-cp310-cp310-macosx_11_0_arm64.whl (1.2 MB)\n",
      "Collecting protobuf (from transformers[sentencepiece])\n",
      "  Obtaining dependency information for protobuf from https://files.pythonhosted.org/packages/88/12/efb5896c901382548ecb58d0449885a8f9aa62bb559d65e5a8a47f122629/protobuf-4.24.4-cp37-abi3-macosx_10_9_universal2.whl.metadata\n",
      "  Downloading protobuf-4.24.4-cp37-abi3-macosx_10_9_universal2.whl.metadata (540 bytes)\n",
      "Requirement already satisfied: attrs>=17.3.0 in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from aiohttp->datasets) (23.1.0)\n",
      "Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from aiohttp->datasets) (3.3.0)\n",
      "Collecting multidict<7.0,>=4.5 (from aiohttp->datasets)\n",
      "  Using cached multidict-6.0.4-cp310-cp310-macosx_11_0_arm64.whl (29 kB)\n",
      "Collecting async-timeout<5.0,>=4.0.0a3 (from aiohttp->datasets)\n",
      "  Obtaining dependency information for async-timeout<5.0,>=4.0.0a3 from https://files.pythonhosted.org/packages/a7/fa/e01228c2938de91d47b307831c62ab9e4001e747789d0b05baf779a6488c/async_timeout-4.0.3-py3-none-any.whl.metadata\n",
      "  Using cached async_timeout-4.0.3-py3-none-any.whl.metadata (4.2 kB)\n",
      "Collecting yarl<2.0,>=1.0 (from aiohttp->datasets)\n",
      "  Using cached yarl-1.9.2-cp310-cp310-macosx_11_0_arm64.whl (62 kB)\n",
      "Collecting frozenlist>=1.1.1 (from aiohttp->datasets)\n",
      "  Obtaining dependency information for frozenlist>=1.1.1 from https://files.pythonhosted.org/packages/67/6a/55a49da0fa373ac9aa49ccd5b6393ecc183e2a0904d9449ea3ee1163e0b1/frozenlist-1.4.0-cp310-cp310-macosx_11_0_arm64.whl.metadata\n",
      "  Using cached frozenlist-1.4.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (5.2 kB)\n",
      "Collecting aiosignal>=1.1.2 (from aiohttp->datasets)\n",
      "  Using cached aiosignal-1.3.1-py3-none-any.whl (7.6 kB)\n",
      "Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from huggingface-hub<1.0.0,>=0.14.0->datasets) (4.8.0)\n",
      "Requirement already satisfied: idna<4,>=2.5 in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from requests>=2.19.0->datasets) (3.4)\n",
      "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from requests>=2.19.0->datasets) (2.0.6)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from requests>=2.19.0->datasets) (2023.7.22)\n",
      "Collecting huggingface-hub<1.0.0,>=0.14.0 (from datasets)\n",
      "  Obtaining dependency information for huggingface-hub<1.0.0,>=0.14.0 from https://files.pythonhosted.org/packages/aa/f3/3fc97336a0e90516901befd4f500f08d691034d387406fdbde85bea827cc/huggingface_hub-0.17.3-py3-none-any.whl.metadata\n",
      "  Using cached huggingface_hub-0.17.3-py3-none-any.whl.metadata (13 kB)\n",
      "Requirement already satisfied: python-dateutil>=2.8.2 in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from pandas->datasets) (2.8.2)\n",
      "Requirement already satisfied: pytz>=2020.1 in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from pandas->datasets) (2023.3.post1)\n",
      "Requirement already satisfied: tzdata>=2022.1 in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from pandas->datasets) (2023.3)\n",
      "Requirement already satisfied: six>=1.5 in /Users/anoop/git-repos/teaching/nlp-class/venv/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas->datasets) (1.16.0)\n",
      "Using cached datasets-2.14.5-py3-none-any.whl (519 kB)\n",
      "Using cached dill-0.3.7-py3-none-any.whl (115 kB)\n",
      "Downloading aiohttp-3.8.6-cp310-cp310-macosx_11_0_arm64.whl (347 kB)\n",
      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m347.7/347.7 kB\u001b[0m \u001b[31m2.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m[31m4.8 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\n",
      "\u001b[?25hUsing cached pyarrow-13.0.0-cp310-cp310-macosx_11_0_arm64.whl (23.7 MB)\n",
      "Downloading safetensors-0.4.0-cp310-cp310-macosx_11_0_arm64.whl (425 kB)\n",
      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m425.4/425.4 kB\u001b[0m \u001b[31m14.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hDownloading tokenizers-0.14.1-cp310-cp310-macosx_11_0_arm64.whl (2.5 MB)\n",
      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.5/2.5 MB\u001b[0m \u001b[31m19.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0mm eta \u001b[36m0:00:01\u001b[0m0:01\u001b[0m:01\u001b[0m\n",
      "\u001b[?25hDownloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)\n",
      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m295.0/295.0 kB\u001b[0m \u001b[31m7.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hUsing cached multiprocess-0.70.15-py310-none-any.whl (134 kB)\n",
      "Downloading protobuf-4.24.4-cp37-abi3-macosx_10_9_universal2.whl (409 kB)\n",
      "\u001b[2K   \u001b[38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m409.4/409.4 kB\u001b[0m \u001b[31m17.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
      "\u001b[?25hUsing cached transformers-4.34.0-py3-none-any.whl (7.7 MB)\n",
      "Downloading xxhash-3.4.1-cp310-cp310-macosx_11_0_arm64.whl (30 kB)\n",
      "Using cached async_timeout-4.0.3-py3-none-any.whl (5.7 kB)\n",
      "Using cached frozenlist-1.4.0-cp310-cp310-macosx_11_0_arm64.whl (46 kB)\n",
      "Using cached fsspec-2023.6.0-py3-none-any.whl (163 kB)\n",
      "Installing collected packages: sentencepiece, xxhash, safetensors, pyarrow, protobuf, multidict, fsspec, frozenlist, dill, async-timeout, yarl, responses, multiprocess, huggingface-hub, aiosignal, tokenizers, aiohttp, transformers, datasets, evaluate\n",
      "  Attempting uninstall: fsspec\n",
      "    Found existing installation: fsspec 2023.9.2\n",
      "    Uninstalling fsspec-2023.9.2:\n",
      "      Successfully uninstalled fsspec-2023.9.2\n",
      "Successfully installed aiohttp-3.8.6 aiosignal-1.3.1 async-timeout-4.0.3 datasets-2.14.5 dill-0.3.7 evaluate-0.4.0 frozenlist-1.4.0 fsspec-2023.6.0 huggingface-hub-0.17.3 multidict-6.0.4 multiprocess-0.70.15 protobuf-4.24.4 pyarrow-13.0.0 responses-0.18.0 safetensors-0.4.0 sentencepiece-0.1.99 tokenizers-0.14.1 transformers-4.34.0 xxhash-3.4.1 yarl-1.9.2\n"
     ]
    }
   ],
   "source": [
    "!pip install datasets evaluate transformers[sentencepiece]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we need a corpus, so let’s create a simple one with a few sentences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "id": "R5mBudXbVXQM"
   },
   "outputs": [],
   "source": [
    "corpus = [\n",
    "    \"This is a sample corpus.\",\n",
    "    \"This corpus will be used to show how subword tokenization works.\",\n",
    "    \"This section shows several tokenizer algorithms.\",\n",
    "    \"Hopefully, you will be able to understand how they are trained and generate tokens.\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we need to pre-tokenize that corpus into words. Since we are replicating a BPE tokenizer (like GPT-2), we will use the gpt2 tokenizer for the pre-tokenization:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "id": "kgRuhjLgVXQM"
   },
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"gpt2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we compute the frequencies of each word in the corpus as we do the pre-tokenization:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "id": "G4jtqOBBVXQM",
    "outputId": "3488cffa-f18d-4f52-8eca-310e5359b5a6"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "defaultdict(<class 'int'>, {'This': 3, 'Ġis': 1, 'Ġa': 1, 'Ġsample': 1, 'Ġcorpus': 2, '.': 4, 'Ġwill': 2, 'Ġbe': 2, 'Ġused': 1, 'Ġto': 2, 'Ġshow': 1, 'Ġhow': 2, 'Ġsubword': 1, 'Ġtokenization': 1, 'Ġworks': 1, 'Ġsection': 1, 'Ġshows': 1, 'Ġseveral': 1, 'Ġtokenizer': 1, 'Ġalgorithms': 1, 'Hopefully': 1, ',': 1, 'Ġyou': 1, 'Ġable': 1, 'Ġunderstand': 1, 'Ġthey': 1, 'Ġare': 1, 'Ġtrained': 1, 'Ġand': 1, 'Ġgenerate': 1, 'Ġtokens': 1})\n"
     ]
    }
   ],
   "source": [
    "from collections import defaultdict\n",
    "\n",
    "word_freqs = defaultdict(int)\n",
    "\n",
    "for text in corpus:\n",
    "    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)\n",
    "    new_words = [word for word, offset in words_with_offsets]\n",
    "    for word in new_words:\n",
    "        word_freqs[word] += 1\n",
    "\n",
    "print(word_freqs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step is to compute the base vocabulary, formed by all the characters used in the corpus:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "id": "BIBZjTFpVXQM",
    "outputId": "9077443f-a451-4948-f0c8-c909426442d2"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[',', '.', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'Ġ']\n"
     ]
    }
   ],
   "source": [
    "alphabet = []\n",
    "\n",
    "for word in word_freqs.keys():\n",
    "    for letter in word:\n",
    "        if letter not in alphabet:\n",
    "            alphabet.append(letter)\n",
    "alphabet.sort()\n",
    "\n",
    "print(alphabet)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also add the special tokens used by the model at the beginning of that vocabulary. In the case of GPT-2, the only special token is `<|endoftext|>`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "id": "OI1pyl1cVXQN"
   },
   "outputs": [],
   "source": [
    "vocab = ['<|endoftext|>'] + alphabet.copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now need to split each word into individual characters, to be able to start training:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "id": "X6GYtG4xVXQN"
   },
   "outputs": [],
   "source": [
    "splits = {word: [c for c in word] for word in word_freqs.keys()}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we are ready for training, let’s write a function that computes the frequency of each pair. We’ll need to use this at each step of the training:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "id": "q7B7oGEJVXQN"
   },
   "outputs": [],
   "source": [
    "def compute_pair_freqs(splits):\n",
    "    pair_freqs = defaultdict(int)\n",
    "    for word, freq in word_freqs.items():\n",
    "        split = splits[word]\n",
    "        if len(split) == 1:\n",
    "            continue\n",
    "        for i in range(len(split) - 1):\n",
    "            pair = (split[i], split[i + 1])\n",
    "            pair_freqs[pair] += freq\n",
    "    return pair_freqs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s have a look at a part of this dictionary after the initial splits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "id": "aehIyCnxVXQO",
    "outputId": "d9db566f-81ed-4cd0-aefc-5e32285778f2"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('T', 'h'): 3\n",
      "('h', 'i'): 3\n",
      "('i', 's'): 4\n",
      "('Ġ', 'i'): 1\n",
      "('Ġ', 'a'): 5\n",
      "('Ġ', 's'): 6\n"
     ]
    }
   ],
   "source": [
    "pair_freqs = compute_pair_freqs(splits)\n",
    "\n",
    "for i, key in enumerate(pair_freqs.keys()):\n",
    "    print(f\"{key}: {pair_freqs[key]}\")\n",
    "    if i >= 5:\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finding the most frequent pair only takes a quick loop:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "id": "FqQm7KP6VXQO",
    "outputId": "3e589bd5-4f88-40be-9af7-52b886b46e0c"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('Ġ', 't') 7\n"
     ]
    }
   ],
   "source": [
    "best_pair = \"\"\n",
    "max_freq = None\n",
    "\n",
    "for pair, freq in pair_freqs.items():\n",
    "    if max_freq is None or max_freq < freq:\n",
    "        best_pair = pair\n",
    "        max_freq = freq\n",
    "\n",
    "print(best_pair, max_freq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the first merge to learn is ('Ġ', 't') -> 'Ġt', and we add 'Ġt' to the vocabulary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "id": "8itZ8G5HVXQO"
   },
   "outputs": [],
   "source": [
    "merges = {(\"Ġ\", \"t\"): \"Ġt\"}\n",
    "vocab.append(\"Ġt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To continue, we need to apply that merge in our splits dictionary. Let’s write another function for this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "id": "N1Vt3OOQVXQP"
   },
   "outputs": [],
   "source": [
    "def merge_pair(a, b, splits):\n",
    "    for word in word_freqs:\n",
    "        split = splits[word]\n",
    "        if len(split) == 1:\n",
    "            continue\n",
    "\n",
    "        i = 0\n",
    "        while i < len(split) - 1:\n",
    "            if split[i] == a and split[i + 1] == b:\n",
    "                split = split[:i] + [a + b] + split[i + 2 :]\n",
    "            else:\n",
    "                i += 1\n",
    "        splits[word] = split\n",
    "    return splits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we can have a look at the result of the first merge:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "id": "t5YWe95CVXQP",
    "outputId": "cee31056-563c-4922-d67f-a75ff6e37b7b"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Ġt', 'r', 'a', 'i', 'n', 'e', 'd']\n"
     ]
    }
   ],
   "source": [
    "splits = merge_pair(\"Ġ\", \"t\", splits)\n",
    "print(splits[\"Ġtrained\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have everything we need to loop until we have learned all the merges we want. Let’s aim for a vocab size of 50:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "id": "assVtBa3VXQP"
   },
   "outputs": [],
   "source": [
    "vocab_size = 50\n",
    "\n",
    "while len(vocab) < vocab_size:\n",
    "    pair_freqs = compute_pair_freqs(splits)\n",
    "    best_pair = \"\"\n",
    "    max_freq = None\n",
    "    for pair, freq in pair_freqs.items():\n",
    "        if max_freq is None or max_freq < freq:\n",
    "            best_pair = pair\n",
    "            max_freq = freq\n",
    "    splits = merge_pair(*best_pair, splits)\n",
    "    merges[best_pair] = best_pair[0] + best_pair[1]\n",
    "    vocab.append(best_pair[0] + best_pair[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a result, we’ve learned 19 merge rules (the initial vocabulary had a size of 31 — 30 characters in the alphabet, plus the special token):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "id": "w8auY3s8VXQP",
    "outputId": "22ef896d-1eac-4356-a7b5-689aac8c695d"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{('Ġ', 't'): 'Ġt', ('Ġ', 's'): 'Ġs', ('Ġ', 'a'): 'Ġa', ('o', 'r'): 'or', ('Ġt', 'o'): 'Ġto', ('i', 's'): 'is', ('h', 'o'): 'ho', ('ho', 'w'): 'how', ('e', 'n'): 'en', ('e', 'r'): 'er', ('T', 'h'): 'Th', ('Th', 'is'): 'This', ('u', 's'): 'us', ('Ġ', 'w'): 'Ġw', ('l', 'l'): 'll', ('Ġto', 'k'): 'Ġtok', ('Ġtok', 'en'): 'Ġtoken', ('n', 'd'): 'nd', ('l', 'e'): 'le', ('Ġ', 'c'): 'Ġc', ('Ġc', 'or'): 'Ġcor'}\n"
     ]
    }
   ],
   "source": [
    "print(merges)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "id": "lfCia12VVXQP",
    "outputId": "0d8d78a2-9f60-45d4-be59-d0934c72fad0"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['<|endoftext|>', ',', '.', 'H', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z', 'Ġ', 'Ġt', 'Ġs', 'Ġa', 'or', 'Ġto', 'is', 'ho', 'how', 'en', 'er', 'Th', 'This', 'us', 'Ġw', 'll', 'Ġtok', 'Ġtoken', 'nd', 'le', 'Ġc', 'Ġcor']\n"
     ]
    }
   ],
   "source": [
    "print(vocab)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To tokenize a new text, we pre-tokenize it, split it, then apply all the merge rules learned:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "id": "UT7xPzO6VXQP"
   },
   "outputs": [],
   "source": [
    "def tokenize(text):\n",
    "    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)\n",
    "    pre_tokenized_text = [word for word, offset in pre_tokenize_result]\n",
    "    splits = [[l for l in word] for word in pre_tokenized_text]\n",
    "    for pair, merge in merges.items():\n",
    "        for idx, split in enumerate(splits):\n",
    "            i = 0\n",
    "            while i < len(split) - 1:\n",
    "                if split[i] == pair[0] and split[i + 1] == pair[1]:\n",
    "                    split = split[:i] + [merge] + split[i + 2 :]\n",
    "                else:\n",
    "                    i += 1\n",
    "            splits[idx] = split\n",
    "\n",
    "    return sum(splits, [])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can try this on any text composed of characters in the alphabet:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "id": "OBqg_Nq4VXQP",
    "outputId": "34e4d4a7-c6b4-48b2-d0f5-7c8886c37f1c"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['This', 'Ġ', 'is', 'Ġ', 'n', 'o', 't', 'Ġa', 'Ġtoken', '.']"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenize(\"This is not a token.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our implementation will throw an error if there is an unknown character since we didn’t do anything to handle them. GPT-2 doesn’t actually have an unknown token (it’s impossible to get an unknown character when using byte-level BPE), but this could happen here because we did not include all the possible bytes in the initial vocabulary. This aspect of BPE is beyond the scope of this section, so we’ve left the details out."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training a transformers library tokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['This is a sample corpus.'], ['This corpus will be used to show how subword tokenization works.'], ['This section shows several tokenizer algorithms.'], ['Hopefully, you will be able to understand how they are trained and generate tokens.']]\n"
     ]
    }
   ],
   "source": [
    "training_corpus = [ [i] for i in corpus ]\n",
    "print(training_corpus)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n",
      "['This', 'Ġ', 'is', 'Ġ', 'n', 'o', 't', 'Ġa', 'Ġtoken']\n"
     ]
    }
   ],
   "source": [
    "bpe_tokenizer = tokenizer.train_new_from_iterator(training_corpus, 275) # do 275 merges\n",
    "tokens = bpe_tokenizer.tokenize(\"This is not a token\")\n",
    "print(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## OpenAI vocab\n",
    "\n",
    "This 50K vocabulary is created using OpenAI's variant of BPE sub-word tokenization called [tiktoken](https://github.com/openai/tiktoken) and is available here:\n",
    "\n",
    "https://huggingface.co/gpt2/blob/main/vocab.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "50256\n",
      "Anoop does not exist\n",
      "2025\n",
      "11224\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "gpt_vocab = None\n",
    "with open('gpt_vocab.json', 'r') as f:\n",
    "    gpt_vocab = json.load(f)\n",
    "if gpt_vocab:\n",
    "    print(gpt_vocab['<|endoftext|>'])\n",
    "    try:\n",
    "        print(gpt_vocab['Anoop'])\n",
    "    except:\n",
    "        print('Anoop does not exist')\n",
    "    print(gpt_vocab['An'])\n",
    "    print(gpt_vocab['oop'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## End"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "    @font-face {\n",
       "        font-family: \"Computer Modern\";\n",
       "        src: url('http://9dbb143991406a7c655e-aa5fcb0a5a4ec34cff238a2d56ca4144.r56.cf5.rackcdn.com/cmunss.otf');\n",
       "    }\n",
       "    @font-face {\n",
       "        font-family: \"Computer Modern\";\n",
       "        font-weight: bold;\n",
       "        src: url('http://9dbb143991406a7c655e-aa5fcb0a5a4ec34cff238a2d56ca4144.r56.cf5.rackcdn.com/cmunsx.otf');\n",
       "    }\n",
       "    @font-face {\n",
       "        font-family: \"Computer Modern\";\n",
       "        font-style: oblique;\n",
       "        src: url('http://9dbb143991406a7c655e-aa5fcb0a5a4ec34cff238a2d56ca4144.r56.cf5.rackcdn.com/cmunsi.otf');\n",
       "    }\n",
       "    @font-face {\n",
       "        font-family: \"Computer Modern\";\n",
       "        font-weight: bold;\n",
       "        font-style: oblique;\n",
       "        src: url('http://9dbb143991406a7c655e-aa5fcb0a5a4ec34cff238a2d56ca4144.r56.cf5.rackcdn.com/cmunso.otf');\n",
       "    }\n",
       "    div.cell{\n",
       "        width:800px;\n",
       "        font-size: 110%;\n",
       "        margin-left:5% !important;\n",
       "        margin-right:auto;\n",
       "    }\n",
       "    h1 {\n",
       "        font-family: Helvetica, serif;\n",
       "    }\n",
       "    h4{\n",
       "        margin-top:12px;\n",
       "        margin-bottom: 3px;\n",
       "       }\n",
       "    div.text_cell_render{\n",
       "        font-family: Computer Modern, \"Helvetica Neue\", Arial, Helvetica, Geneva, sans-serif;\n",
       "        line-height: 145%;\n",
       "        font-size: 110%;\n",
       "        width:800px;\n",
       "        margin-left:auto;\n",
       "        margin-right:auto;\n",
       "    }\n",
       "    .CodeMirror{\n",
       "            font-family: \"Source Code Pro\", source-code-pro,Consolas, monospace;\n",
       "            font-size: 110%;\n",
       "    }\n",
       "    .prompt{\n",
       "        display: None;\n",
       "    }\n",
       "    .text_cell_render h5 {\n",
       "        font-weight: 300;\n",
       "        font-size: 22pt;\n",
       "        color: #4057A1;\n",
       "        font-style: italic;\n",
       "        margin-bottom: .5em;\n",
       "        margin-top: 0.5em;\n",
       "        display: block;\n",
       "    }\n",
       "    \n",
       "    .warning{\n",
       "        color: rgb( 240, 20, 20 )\n",
       "        }  \n",
       "\n",
       "</style>\n",
       "<script>\n",
       "    MathJax.Hub.Config({\n",
       "                        TeX: {\n",
       "                           extensions: [\"AMSmath.js\"]\n",
       "                           },\n",
       "                tex2jax: {\n",
       "                    inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
       "                    displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
       "                },\n",
       "                displayAlign: 'center', // Change this to 'center' to center equations.\n",
       "                \"HTML-CSS\": {\n",
       "                    styles: {'.MathJax_Display': {\"margin\": 4}}\n",
       "                }\n",
       "        });\n",
       "</script>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.core.display import HTML\n",
    "\n",
    "\n",
    "def css_styling():\n",
    "    styles = open(\"../css/notebook.css\", \"r\").read()\n",
    "    return HTML(styles)\n",
    "css_styling()"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "Byte-Pair Encoding tokenization",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}