{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%html\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%%capture\n", "%load_ext autoreload\n", "%autoreload 2\n", "%cd ..\n", "import statnlpbook.tokenization as tok" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "# Tokenisation" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "* Identify the **meaningful units** in a string of characters: for example, **words**.\n", "\n", "![nospaces](../img/nospaces.jpg)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "In Python you can tokenise text via `split`:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['Mr.',\n", " 'Bob',\n", " 'Dobolina',\n", " 'is',\n", " \"thinkin'\",\n", " 'of',\n", " 'a',\n", " 'master',\n", " 'plan.\\nWhy',\n", " \"doesn't\",\n", " 'he',\n", " 'quit?']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = \"\"\"Mr. Bob Dobolina is thinkin' of a master plan.\n", "Why doesn't he quit?\"\"\"\n", "text.split(\" \")" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "fragment" } }, "source": [ "What is wrong with this?" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "In Python you can also tokenise using **patterns** at which to split tokens:\n", "### Regular Expressions" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "A **regular expression** is a compact definition of a **set** of (character) sequences (strings).\n", "\n", "Examples:\n", "* `Mr.`: all strings containing `Mr` followed by any single character\n", "* `Mr\\.`: only the string `Mr.`\n", "*  `|\\n|!!!`: only the strings   (space), `\\n` and `!!!`\n", "* `[abc]`: only the characters `a`, `b` and `c`\n", "* `\\s`: all whitespace characters\n", "* `1+`: all sequences of at least one `1`\n", "* `\\w+`: all sequences of alphanumeric characters and `_`\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "['Mr.',\n", " 'Bob',\n", " 'Dobolina',\n", " 'is',\n", " \"thinkin'\",\n", " 'of',\n", " 'a',\n", " 'master',\n", " 'plan.',\n", " 'Why',\n", " \"doesn't\",\n", " 'he',\n", " 'quit?']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "re.compile('\\s').split(text)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "Problems:\n", "* Bad treatment of punctuation. \n", "* Easier to **define a token** than a gap. " ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "Let us use `findall` instead:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['Mr',\n", " '.',\n", " 'Bob',\n", " 'Dobolina',\n", " 'is',\n", " 'thinkin',\n", " 'of',\n", " 'a',\n", " 'master',\n", " 'plan',\n", " '.',\n", " 'Why',\n", " 'doesn',\n", " 't',\n", " 'he',\n", " 'quit',\n", " '?']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.compile('\\w+|[.?]').findall(text)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "Problems:\n", "* `Mr.` and `doesn't` are split into two tokens each.\n", "* Lost an apostrophe (`thinkin'`).\n", "\n", "Both are fixed below ..." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['Mr.',\n", " 'Bob',\n", " 'Dobolina',\n", " 'is',\n", " \"thinkin'\",\n", " 'of',\n", " 'a',\n", " 'master',\n", " 'plan',\n", " '.',\n", " 'Why',\n", " \"doesn't\",\n", " 'he',\n", " 'quit',\n", " '?']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.compile('Mr\\.|[\\w\\']+|[.?]').findall(text)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Learning to Tokenise?\n", "* For English, simple pattern matching is often sufficient.\n", "* In some languages (e.g., Japanese), words are not separated by whitespace.\n", "* In some languages (e.g., Vietnamese), whitespace does not indicate word boundary.\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "jap = \"今日もしないといけない。\"\n", "viet = \"thuế thu nhập cá nhân\"" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "fragment" } }, "source": [ "Try lexicon-based tokenisation ..." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['今日', 'もし', 'と', 'いけない']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.compile('もし|今日|も|しない|と|いけない').findall(jap)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['thuế thu nhập', 'cá nhân']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.compile('thuế thu nhập|cá nhân').findall(viet)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "Equally complex for certain English domains (e.g., biomedical text)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "bio = \"\"\"We developed a nanocarrier system of herceptin-conjugated nanoparticles\n", "of d-alpha-tocopheryl-co-poly(ethylene glycol) 1000 succinate (TPGS)-cisplatin\n", "prodrug ...\"\"\"" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "fragment" } }, "source": [ "* d-alpha-tocopheryl-co-poly is **one** token\n", "* (TPGS)-cisplatin are **five**: \n", " * ( \n", " * TPGS \n", " * ) \n", " * - \n", " * cisplatin " ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "['We',\n", " 'developed',\n", " 'a',\n", " 'nanocarrier',\n", " 'system',\n", " 'of',\n", " 'herceptin-conjugated',\n", " 'nanoparticles',\n", " 'of',\n", " 'd-alpha-tocopheryl-co-poly(ethylene',\n", " 'glycol)',\n", " '1000',\n", " 'succinate',\n", " '(TPGS)-cisplatin',\n", " 'prodrug']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.compile('\\s').split(bio)[:15]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "['New', 'York-based', 'companies']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.compile('\\s').split(\"New York-based companies\")" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "Solution: Treat tokenisation as a **statistical problem**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Subword Tokenisation\n", "\n", "Learn from data what is the best way to break down strings to tokens.\n", "\n", "- **Why Subword Tokenization?**\n", " - Efficient handling of Out-of-Vocabulary (OOV) words.\n", " - Capture meaningful subword information.\n", "- **Popular Algorithms**: \n", " - WordPiece\n", " - Byte Pair Encoding (BPE)\n", " - Unigram (SentencePiece)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## WordPiece\n", "- **Used In**: BERT, DistillBERT\n", "- **Origin**: Developed by Google for speech recognition and later adapted for text.\n", "- **How it Works**: \n", " 1. Initialize vocabulary with characters and special tokens.\n", " 2. Merge subwords iteratively based on scoring criteria to form the new vocabulary." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### WordPiece Examples\n", "- **English**: \"hugging\": \n", " - Initial: (\"h\", \"u\", \"g\", \"g\", \"i\", \"n\", \"g\")\n", " - After training: (\"hug\", \"##ging\")\n", "- **Danish**: \"hygge\"\n", " - Initial: (\"h\", \"y\", \"g\", \"g\", \"e\")\n", " - After training: (\"hy\", \"##gge\")\n", "- **Japanese**: \"こんにちは\" \n", " - Initial: (\"こ\", \"ん\", \"に\", \"ち\", \"は\")\n", " - After training: (\"こん\", \"##に\", \"##ち\", \"##は\")\n", "- **Vietnamese**: \"xin chào\"\n", " - Initial: (\"x\", \"i\", \"n\", \" \", \"c\", \"h\", \"à\", \"o\")\n", " - After training: (\"x\", \"##in\", \" \", \"##ch\", \"##à\", \"##o\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Byte Pair Encoding (BPE)\n", "- **Used In**: GPT, RoBERTa\n", "- **Origin**: Initially developed for data compression.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Unigram Algorithm in SentencePiece\n", "- **Used In**: ALBERT, T5, mBART, Big Bird, XLNet\n", "- **Origin**: Developed by Google for machine translation." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "# Sentence Segmentation\n", "\n", "* Many NLP tools work sentence-by-sentence. \n", "* Often trivial after tokenisation: split sentences at sentence-ending punctuation tokens." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "\"Mr. Bob Dobolina is thinkin' of a master plan.\\nWhy doesn't he quit?\"" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "pycharm": { "name": "#%%\n" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[['Mr.',\n", " 'Bob',\n", " 'Dobolina',\n", " 'is',\n", " \"thinkin'\",\n", " 'of',\n", " 'a',\n", " 'master',\n", " 'plan',\n", " '.'],\n", " ['Why', \"doesn't\", 'he', 'quit', '?']]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokens = re.compile('Mr.|[\\w\\']+|[.?]').findall(text)\n", "# try different regular expressions\n", "tok.sentence_segment(re.compile('\\.'), tokens)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "What are the challenges in sentence splitting? \n", "\n", "Discuss and enter your answer(s) here:\n", "\n", "# [tinyurl.com/diku-nlp-q2](https://tinyurl.com/diku-nlp-q2)\n", "\n", "([Responses](https://docs.google.com/forms/d/1WANt_ndHZhGkOwPu1klR4HmGAUH1QL9W4AAkNwU6Ulg/edit))" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "slideshow": { "slide_type": "slide" } }, "source": [ "# Background Reading\n", "\n", "* Jurafsky & Martin, [Speech and Language Processing (Third Edition)](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf): Chapter 2, Regular Expressions, Text Normalization, Edit Distance.\n", "* Hugging Face's excellent NLP course: [Tokenizers](https://huggingface.co/learn/nlp-course/chapter6/1)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.17" } }, "nbformat": 4, "nbformat_minor": 1 }