{ "cells": [ { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "%%capture\n", "%load_ext autoreload\n", "%autoreload 2\n", "%cd ..\n", "import statnlpbook.tokenization as tok" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Tokenization" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "* Identify the **words** in a string of characters. \n", "* Improve input **representation** in [structured prediction recipe](structured_prediction.ipynb)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "In Python you can tokenize a text via `split`:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['Mr.',\n", " 'Bob',\n", " 'Dobolina',\n", " 'is',\n", " \"thinkin'\",\n", " 'of',\n", " 'a',\n", " 'master',\n", " 'plan.\\nWhy',\n", " \"doesn't\",\n", " 'he',\n", " 'quit?']" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = \"\"\"Mr. Bob Dobolina is thinkin' of a master plan.\n", "Why doesn't he quit?\"\"\"\n", "text.split(\" \")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Why is this suboptimal? " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Python allows users to construct tokenizers using \n", "### Regular Expressions \n", "that define **patterns** at which to split tokens." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A **regular expression** is a compact definition of a **set** of (character) sequences.\n", "\n", "Examples:\n", "* `\"Mr.\"`: set containing only `\"Mr.\"`\n", "* `\" |\\n|!!!\"`: set containing the sequences `\" \"`, `\"\\n\"` and `\"!!!\"`\n", "* `\"[abc]\"`: set containing only the characters `a`, `b` and `c`\n", "* `\"\\s\"`: set of all whitespace characters\n", "* `\"1+\"`: set of all sequences of at least one `\"1\"` \n", "* etc.\n" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "['Mr.',\n", " 'Bob',\n", " 'Dobolina',\n", " 'is',\n", " \"thinkin'\",\n", " 'of',\n", " 'a',\n", " 'master',\n", " 'plan.',\n", " 'Why',\n", " \"doesn't\",\n", " 'he',\n", " 'quit?']" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "re.compile('\\s').split(text)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Problems:\n", "* Bad treatment of punctuation. \n", "* Easier to **define a token** than a gap. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let us use `findall` instead:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "['Mr',\n", " '.',\n", " 'Bob',\n", " 'Dobolina',\n", " 'is',\n", " 'thinkin',\n", " 'of',\n", " 'a',\n", " 'master',\n", " 'plan',\n", " '.',\n", " 'Why',\n", " 'doesn',\n", " 't',\n", " 'he',\n", " 'quit',\n", " '?']" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.compile('\\w+|[.?]').findall(text) # \\w+|[.?]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Problems:\n", "* \"Mr.\" is split into two tokens, should be single. \n", "* Lost an apostrophe. \n", "\n", "Both is fixed below ..." ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "['Mr.',\n", " 'Bob',\n", " 'Dobolina',\n", " 'is',\n", " \"thinkin'\",\n", " 'of',\n", " 'a',\n", " 'master',\n", " 'plan',\n", " '.',\n", " 'Why',\n", " \"doesn't\",\n", " 'he',\n", " 'quit',\n", " '?']" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.compile('Mr.|[\\w\\']+|[.?]').findall(text)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Learning to Tokenize?\n", "* For English simple pattern matching often sufficient. \n", "* In other languages (e.g. Japanese), words are not separated by whitespace.\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "jap = \"今日もしないといけない。\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Try lexicon-based tokenization ..." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "['今日', 'もし', 'と', 'けない']" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.compile('もし|今日|も|しない|と|けない').findall(jap)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Equally complex for certain English domains (eg. bio-medical text). " ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "bio = \"\"\"We developed a nanocarrier system of herceptin-conjugated nanoparticles\n", "of d-alpha-tocopheryl-co-poly(ethylene glycol) 1000 succinate (TPGS)-cisplatin\n", "prodrug ...\"\"\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "* d-alpha-tocopheryl-co-poly is **one** token\n", "* (TPGS)-cisplatin are **five**: \n", " * ( \n", " * TPGS \n", " * ) \n", " * - \n", " * cisplatin " ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "['We',\n", " 'developed',\n", " 'a',\n", " 'nanocarrier',\n", " 'system',\n", " 'of',\n", " 'herceptin-conjugated',\n", " 'nanoparticles',\n", " 'of',\n", " 'd-alpha-tocopheryl-co-poly(ethylene',\n", " 'glycol)',\n", " '1000',\n", " 'succinate',\n", " '(TPGS)-cisplatin',\n", " 'prodrug']" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.compile('\\s').split(bio)[:15]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Solution: Treat tokenization as a **statistical NLP problem** (and as structured prediction)! \n", " * [classification](doc_classify.ipynb)\n", " * [sequence labelling](sequence_labelling.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Sentence Segmentation\n", "\n", "* Many NLP tools work sentence-by-sentence. \n", "* Often trivial after tokenisation: split sentences at sentence-ending punctuation tokens." ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[['Mr.',\n", " 'Bob',\n", " 'Dobolina',\n", " 'is',\n", " \"thinkin'\",\n", " 'of',\n", " 'a',\n", " 'master',\n", " 'plan',\n", " '.'],\n", " ['Why', \"doesn't\", 'he', 'quit', '?']]" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokens = re.compile('Mr.|[\\w\\']+|[.?]').findall(text)\n", "# try different regular expressions\n", "tok.sentence_segment(re.compile('\\.'), tokens)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "What do you do with transcribed speech? " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Background Reading\n", "\n", "* Jurafsky & Martin, Speech and Language Processing: Chapter 2, Regular Expressions and Automata.\n", "* Manning, Raghavan & Schuetze, Introduction to Information Retrieval: [Tokenization](http://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 1 }