{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<script>\n",
       "  function code_toggle() {\n",
       "    if (code_shown){\n",
       "      $('div.input').hide('500');\n",
       "      $('#toggleButton').val('Show Code')\n",
       "    } else {\n",
       "      $('div.input').show('500');\n",
       "      $('#toggleButton').val('Hide Code')\n",
       "    }\n",
       "    code_shown = !code_shown\n",
       "  }\n",
       "\n",
       "  $( document ).ready(function(){\n",
       "    code_shown=false;\n",
       "    $('div.input').hide()\n",
       "  });\n",
       "</script>\n",
       "<form action=\"javascript:code_toggle()\"><input type=\"submit\" id=\"toggleButton\" value=\"Show Code\"></form>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "<script>\n",
    "  function code_toggle() {\n",
    "    if (code_shown){\n",
    "      $('div.input').hide('500');\n",
    "      $('#toggleButton').val('Show Code')\n",
    "    } else {\n",
    "      $('div.input').show('500');\n",
    "      $('#toggleButton').val('Hide Code')\n",
    "    }\n",
    "    code_shown = !code_shown\n",
    "  }\n",
    "\n",
    "  $( document ).ready(function(){\n",
    "    code_shown=false;\n",
    "    $('div.input').hide()\n",
    "  });\n",
    "</script>\n",
    "<form action=\"javascript:code_toggle()\"><input type=\"submit\" id=\"toggleButton\" value=\"Show Code\"></form>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "%cd ..\n",
    "import statnlpbook.tokenization as tok"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Tokenisation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "* Identify the **meaningful units** in a string of characters: for example, **words**.\n",
    "\n",
    "![nospaces](../img/nospaces.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "In Python you can tokenise text via `split`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Mr.',\n",
       " 'Bob',\n",
       " 'Dobolina',\n",
       " 'is',\n",
       " \"thinkin'\",\n",
       " 'of',\n",
       " 'a',\n",
       " 'master',\n",
       " 'plan.\\nWhy',\n",
       " \"doesn't\",\n",
       " 'he',\n",
       " 'quit?']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = \"\"\"Mr. Bob Dobolina is thinkin' of a master plan.\n",
    "Why doesn't he quit?\"\"\"\n",
    "text.split(\" \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "What is wrong with this?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "In Python you can also tokenise using **patterns** at which to split tokens:\n",
    "### Regular Expressions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "A **regular expression** is a compact definition of a **set** of (character) sequences (strings).\n",
    "\n",
    "Examples:\n",
    "* `Mr.`: all strings containing `Mr` followed by any single character\n",
    "* `Mr\\.`: only the string `Mr.`\n",
    "* <code>&nbsp;</code>`|\\n|!!!`: only the strings <code>&nbsp;</code> (space), `\\n` and `!!!`\n",
    "* `[abc]`: only the characters `a`, `b` and `c`\n",
    "* `\\s`: all whitespace characters\n",
    "* `1+`: all sequences of at least one `1`\n",
    "* `\\w+`: all sequences of alphanumeric characters and `_`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Mr.',\n",
       " 'Bob',\n",
       " 'Dobolina',\n",
       " 'is',\n",
       " \"thinkin'\",\n",
       " 'of',\n",
       " 'a',\n",
       " 'master',\n",
       " 'plan.',\n",
       " 'Why',\n",
       " \"doesn't\",\n",
       " 'he',\n",
       " 'quit?']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "re.compile('\\s').split(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Problems:\n",
    "* Bad treatment of punctuation.  \n",
    "* Easier to **define a token** than a gap. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Let us use `findall` instead:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Mr',\n",
       " '.',\n",
       " 'Bob',\n",
       " 'Dobolina',\n",
       " 'is',\n",
       " 'thinkin',\n",
       " 'of',\n",
       " 'a',\n",
       " 'master',\n",
       " 'plan',\n",
       " '.',\n",
       " 'Why',\n",
       " 'doesn',\n",
       " 't',\n",
       " 'he',\n",
       " 'quit',\n",
       " '?']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.compile('\\w+|[.?]').findall(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Problems:\n",
    "* `Mr.` and `doesn't` are split into two tokens each.\n",
    "* Lost an apostrophe (`thinkin'`).\n",
    "\n",
    "Both are fixed below ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Mr.',\n",
       " 'Bob',\n",
       " 'Dobolina',\n",
       " 'is',\n",
       " \"thinkin'\",\n",
       " 'of',\n",
       " 'a',\n",
       " 'master',\n",
       " 'plan',\n",
       " '.',\n",
       " 'Why',\n",
       " \"doesn't\",\n",
       " 'he',\n",
       " 'quit',\n",
       " '?']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.compile('Mr\\.|[\\w\\']+|[.?]').findall(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Learning to Tokenise?\n",
    "* For English, simple pattern matching is often sufficient.\n",
    "* In some languages (e.g., Japanese), words are not separated by whitespace.\n",
    "* In some languages (e.g., Vietnamese), whitespace does not indicate word boundary.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "jap = \"今日もしないといけない。\"\n",
    "viet = \"thuế  thu nhập cá nhân\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Try lexicon-based tokenisation ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['今日', 'もし', 'と', 'いけない']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.compile('もし|今日|も|しない|と|いけない').findall(jap)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['thuế  thu nhập', 'cá nhân']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.compile('thuế  thu nhập|cá nhân').findall(viet)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Equally complex for certain English domains (e.g., biomedical text)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "bio = \"\"\"We developed a nanocarrier system of herceptin-conjugated nanoparticles\n",
    "of d-alpha-tocopheryl-co-poly(ethylene glycol) 1000 succinate (TPGS)-cisplatin\n",
    "prodrug ...\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "* d-alpha-tocopheryl-co-poly is **one** token\n",
    "* (TPGS)-cisplatin are **five**: \n",
    "  * ( \n",
    "  * TPGS \n",
    "  * ) \n",
    "  * - \n",
    "  * cisplatin "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['We',\n",
       " 'developed',\n",
       " 'a',\n",
       " 'nanocarrier',\n",
       " 'system',\n",
       " 'of',\n",
       " 'herceptin-conjugated',\n",
       " 'nanoparticles',\n",
       " 'of',\n",
       " 'd-alpha-tocopheryl-co-poly(ethylene',\n",
       " 'glycol)',\n",
       " '1000',\n",
       " 'succinate',\n",
       " '(TPGS)-cisplatin',\n",
       " 'prodrug']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.compile('\\s').split(bio)[:15]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['New', 'York-based', 'companies']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.compile('\\s').split(\"New York-based companies\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Solution: Treat tokenisation as a **statistical problem**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Subword Tokenisation\n",
    "\n",
    "Learn from data what is the best way to break down strings to tokens.\n",
    "\n",
    "- **Why Subword Tokenization?**\n",
    "  - Efficient handling of Out-of-Vocabulary (OOV) words.\n",
    "  - Capture meaningful subword information.\n",
    "- **Popular Algorithms**: \n",
    "  - WordPiece\n",
    "  - Byte Pair Encoding (BPE)\n",
    "  - Unigram (SentencePiece)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## WordPiece\n",
    "- **Used In**: BERT, DistillBERT\n",
    "- **Origin**: Developed by Google for speech recognition and later adapted for text.\n",
    "- **How it Works**: \n",
    "  1. Initialize vocabulary with characters and special tokens.\n",
    "  2. Merge subwords iteratively based on scoring criteria to form the new vocabulary."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### WordPiece Examples\n",
    "- **English**: \"hugging\": \n",
    "  - Initial: (\"h\", \"u\", \"g\", \"g\", \"i\", \"n\", \"g\")\n",
    "  - After training: (\"hug\", \"##ging\")\n",
    "- **Danish**: \"hygge\"\n",
    "  - Initial: (\"h\", \"y\", \"g\", \"g\", \"e\")\n",
    "  - After training: (\"hy\", \"##gge\")\n",
    "- **Japanese**: \"こんにちは\" \n",
    "  - Initial: (\"こ\", \"ん\", \"に\", \"ち\", \"は\")\n",
    "  - After training: (\"こん\", \"##に\", \"##ち\", \"##は\")\n",
    "- **Vietnamese**: \"xin chào\"\n",
    "  - Initial: (\"x\", \"i\", \"n\", \" \", \"c\", \"h\", \"à\", \"o\")\n",
    "  - After training: (\"x\", \"##in\", \" \", \"##ch\", \"##à\", \"##o\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Byte Pair Encoding (BPE)\n",
    "- **Used In**: GPT, RoBERTa\n",
    "- **Origin**: Initially developed for data compression.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Unigram Algorithm in SentencePiece\n",
    "- **Used In**: ALBERT, T5, mBART, Big Bird, XLNet\n",
    "- **Origin**: Developed by Google for machine translation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Sentence Segmentation\n",
    "\n",
    "* Many NLP tools work sentence-by-sentence. \n",
    "* Often trivial after tokenisation: split sentences at sentence-ending punctuation tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"Mr. Bob Dobolina is thinkin' of a master plan.\\nWhy doesn't he quit?\""
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Mr.',\n",
       "  'Bob',\n",
       "  'Dobolina',\n",
       "  'is',\n",
       "  \"thinkin'\",\n",
       "  'of',\n",
       "  'a',\n",
       "  'master',\n",
       "  'plan',\n",
       "  '.'],\n",
       " ['Why', \"doesn't\", 'he', 'quit', '?']]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokens = re.compile('Mr.|[\\w\\']+|[.?]').findall(text)\n",
    "# try different regular expressions\n",
    "tok.sentence_segment(re.compile('\\.'), tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<center><img src=\"../img/quiz_time.png\"></center>\n",
    "\n",
    "What are the challenges in sentence splitting? \n",
    "\n",
    "Discuss and enter your answer(s) here:\n",
    "\n",
    "# [tinyurl.com/diku-nlp-q2](https://tinyurl.com/diku-nlp-q2)\n",
    "\n",
    "([Responses](https://docs.google.com/forms/d/1WANt_ndHZhGkOwPu1klR4HmGAUH1QL9W4AAkNwU6Ulg/edit))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Background Reading\n",
    "\n",
    "* Jurafsky & Martin, [Speech and Language Processing (Third Edition)](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf): Chapter 2, Regular Expressions, Text Normalization, Edit Distance.\n",
    "* Hugging Face's excellent NLP course: [Tokenizers](https://huggingface.co/learn/nlp-course/chapter6/1)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}