{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "%cd ..\n",
    "import statnlpbook.tokenization as tok"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Tokenization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "* Identify the **words** in a string of characters. \n",
    "* Improve input **representation** in [structured prediction recipe](structured_prediction.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "In Python you can tokenize a text via `split`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Mr.',\n",
       " 'Bob',\n",
       " 'Dobolina',\n",
       " 'is',\n",
       " \"thinkin'\",\n",
       " 'of',\n",
       " 'a',\n",
       " 'master',\n",
       " 'plan.\\nWhy',\n",
       " \"doesn't\",\n",
       " 'he',\n",
       " 'quit?']"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = \"\"\"Mr. Bob Dobolina is thinkin' of a master plan.\n",
    "Why doesn't he quit?\"\"\"\n",
    "text.split(\" \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Why is this suboptimal? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Python allows users to construct tokenizers using \n",
    "### Regular Expressions \n",
    "that define **patterns** at which to split tokens."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "A **regular expression** is a compact definition of a **set** of (character) sequences.\n",
    "\n",
    "Examples:\n",
    "* `\"Mr.\"`: set containing only `\"Mr.\"`\n",
    "* `\" |\\n|!!!\"`: set containing the sequences `\" \"`, `\"\\n\"` and `\"!!!\"`\n",
    "* `\"[abc]\"`: set containing only the characters `a`, `b` and `c`\n",
    "* `\"\\s\"`: set of all whitespace characters\n",
    "* `\"1+\"`: set of all sequences of at least one `\"1\"` \n",
    "* etc.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Mr.',\n",
       " 'Bob',\n",
       " 'Dobolina',\n",
       " 'is',\n",
       " \"thinkin'\",\n",
       " 'of',\n",
       " 'a',\n",
       " 'master',\n",
       " 'plan.',\n",
       " 'Why',\n",
       " \"doesn't\",\n",
       " 'he',\n",
       " 'quit?']"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "re.compile('\\s').split(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Problems:\n",
    "* Bad treatment of punctuation.  \n",
    "* Easier to **define a token** than a gap. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Let us use `findall` instead:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Mr',\n",
       " '.',\n",
       " 'Bob',\n",
       " 'Dobolina',\n",
       " 'is',\n",
       " 'thinkin',\n",
       " 'of',\n",
       " 'a',\n",
       " 'master',\n",
       " 'plan',\n",
       " '.',\n",
       " 'Why',\n",
       " 'doesn',\n",
       " 't',\n",
       " 'he',\n",
       " 'quit',\n",
       " '?']"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.compile('\\w+|[.?]').findall(text) # \\w+|[.?]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Problems:\n",
    "* \"Mr.\" is split into two tokens, should be single. \n",
    "* Lost an apostrophe. \n",
    "\n",
    "Both is fixed below ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Mr.',\n",
       " 'Bob',\n",
       " 'Dobolina',\n",
       " 'is',\n",
       " \"thinkin'\",\n",
       " 'of',\n",
       " 'a',\n",
       " 'master',\n",
       " 'plan',\n",
       " '.',\n",
       " 'Why',\n",
       " \"doesn't\",\n",
       " 'he',\n",
       " 'quit',\n",
       " '?']"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.compile('Mr.|[\\w\\']+|[.?]').findall(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Learning to Tokenize?\n",
    "* For English simple pattern matching often sufficient. \n",
    "* In other languages (e.g. Japanese), words are not separated by whitespace.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "jap = \"今日もしないといけない。\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Try lexicon-based tokenization ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['今日', 'もし', 'と', 'けない']"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.compile('もし|今日|も|しない|と|けない').findall(jap)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Equally complex for certain English domains (eg. bio-medical text). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "bio = \"\"\"We developed a nanocarrier system of herceptin-conjugated nanoparticles\n",
    "of d-alpha-tocopheryl-co-poly(ethylene glycol) 1000 succinate (TPGS)-cisplatin\n",
    "prodrug ...\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "* d-alpha-tocopheryl-co-poly is **one** token\n",
    "* (TPGS)-cisplatin are **five**: \n",
    "  * ( \n",
    "  * TPGS \n",
    "  * ) \n",
    "  * - \n",
    "  * cisplatin "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['We',\n",
       " 'developed',\n",
       " 'a',\n",
       " 'nanocarrier',\n",
       " 'system',\n",
       " 'of',\n",
       " 'herceptin-conjugated',\n",
       " 'nanoparticles',\n",
       " 'of',\n",
       " 'd-alpha-tocopheryl-co-poly(ethylene',\n",
       " 'glycol)',\n",
       " '1000',\n",
       " 'succinate',\n",
       " '(TPGS)-cisplatin',\n",
       " 'prodrug']"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.compile('\\s').split(bio)[:15]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Solution: Treat tokenization as a **statistical NLP problem** (and as structured prediction)! \n",
    "  * [classification](doc_classify.ipynb)\n",
    "  * [sequence labelling](sequence_labelling.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Sentence Segmentation\n",
    "\n",
    "* Many NLP tools work sentence-by-sentence. \n",
    "* Often trivial after tokenisation: split sentences at sentence-ending punctuation tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Mr.',\n",
       "  'Bob',\n",
       "  'Dobolina',\n",
       "  'is',\n",
       "  \"thinkin'\",\n",
       "  'of',\n",
       "  'a',\n",
       "  'master',\n",
       "  'plan',\n",
       "  '.'],\n",
       " ['Why', \"doesn't\", 'he', 'quit', '?']]"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokens = re.compile('Mr.|[\\w\\']+|[.?]').findall(text)\n",
    "# try different regular expressions\n",
    "tok.sentence_segment(re.compile('\\.'), tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "What do you do with transcribed speech? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Background Reading\n",
    "\n",
    "* Jurafsky & Martin, Speech and Language Processing: Chapter 2, Regular Expressions and Automata.\n",
    "* Manning, Raghavan & Schuetze, Introduction to Information Retrieval: [Tokenization](http://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}