{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "%cd ..\n",
    "import statnlpbook.tokenization as tok"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tokenization\n",
    "\n",
    "Before a program can process natural language, we need identify the _words_ that constitute a string of characters. This, in fact, can be seen as a crucial transformation step to improve the input *representation* of language in the [structured prediction recipe](structured_prediction.ipynb).\n",
    "\n",
    "By default text on a computer is represented through `String` values. These values store a sequence of characters (nowadays mostly in [UTF-8](http://en.wikipedia.org/wiki/UTF-8) format). The first step of an NLP pipeline is therefore to split the text into smaller units corresponding to the words of the language we are considering. In the context of NLP we often refer to these units as _tokens_, and the process of extracting these units is called _tokenization_. Tokenization is considered boring by most, but it's hard to overemphasize its importance, seeing as it's the first step in a long pipeline of NLP processors, and if you get this step wrong, all further steps will suffer. \n",
    "\n",
    "\n",
    "In Python a simple way to tokenize a text is via the `split` method that divides a text wherever a particular substring is found. In the code below this pattern is simply the whitespace character, and this seems like a reasonable starting point for an English tokenization approach."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Mr.',\n",
       " 'Bob',\n",
       " 'Dobolina',\n",
       " 'is',\n",
       " \"thinkin'\",\n",
       " 'of',\n",
       " 'a',\n",
       " 'master',\n",
       " 'plan.\\nWhy',\n",
       " \"doesn't\",\n",
       " 'he',\n",
       " 'quit?']"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text = \"Mr. Bob Dobolina is thinkin' of a master plan.\" + \\\n",
    "       \"\\nWhy doesn't he quit?\"\n",
    "text.split(\" \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tokenization with Regular Expressions\n",
    "Python allows users to construct tokenizers using [regular expressions](http://en.wikipedia.org/wiki/Regular_expression) that define the character sequence patterns at which to either split tokens, or patterns that define what constitutes a token. In general regular expressions are a powerful tool NLP practitioners can use when working with text, and they come in handy when you work with command line tools such as [grep](http://en.wikipedia.org/wiki/Grep). In the code below we use a simple pattern `\\\\s` that matches any whitespace to define where to split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Mr.',\n",
       " 'Bob',\n",
       " 'Dobolina',\n",
       " 'is',\n",
       " \"thinkin'\",\n",
       " 'of',\n",
       " 'a',\n",
       " 'master',\n",
       " 'plan.',\n",
       " 'Why',\n",
       " \"doesn't\",\n",
       " 'he',\n",
       " 'quit?']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "gap = re.compile('\\s')\n",
    "gap.split(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One shortcoming of this tokenization is its treatment of punctuation because it considers \"plan.\" as a token whereas ideally we would prefer \"plan\" and \".\" to be distinct tokens (why?). It is easier to address this problem if we define what a token token is, instead of what constitutes a gap. Below we have define tokens as sequences of alphanumeric characters and punctuation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Mr',\n",
       " '.',\n",
       " 'Bob',\n",
       " 'Dobolina',\n",
       " 'is',\n",
       " 'thinkin',\n",
       " 'of',\n",
       " 'a',\n",
       " 'master',\n",
       " 'plan',\n",
       " '.',\n",
       " 'Why',\n",
       " 'doesn',\n",
       " 't',\n",
       " 'he',\n",
       " 'quit',\n",
       " '?']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "token = re.compile('\\w+|[.?:]')\n",
    "token.findall(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This still isn't perfect as \"Mr.\" is split into two tokens, but it should be a single token. Moreover, we have actually lost an apostrophe. Both is fixed below, although we now fail to break up the contraction \"doesn't\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Mr.',\n",
       " 'Bob',\n",
       " 'Dobolina',\n",
       " 'is',\n",
       " \"thinkin'\",\n",
       " 'of',\n",
       " 'a',\n",
       " 'master',\n",
       " 'plan',\n",
       " '.',\n",
       " 'Why',\n",
       " \"doesn't\",\n",
       " 'he',\n",
       " 'quit',\n",
       " '?']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "token = re.compile('Mr.|[\\w\\']+|[.?]')\n",
    "tokens = token.findall(text)\n",
    "tokens"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning to Tokenize\n",
    "For most English domains powerful and robust tokenizers can be built using the simple pattern matching approach shown above. However, in languages such as Japanese, words are not separated by whitespace, and this makes tokenization substantially more challenging. Try to, for example, find a good *generic* regular expression pattern to tokenize the following sentence.\n",
    "\n",
    "TODO: Show that you can't just remember words either."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['彼', 'は', '音楽', 'を', '聞くの', 'が', '大好き', 'です']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "jap = \"彼は音楽を聞くのが大好きです\"\n",
    "re.compile('彼|は|く|音楽|を|聞くの|が|大好き|です').findall(jap)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Even for certain English domains such as the domain of biomedical papers, tokenization is non-trivial (see an analysis why [here](https://aclweb.org/anthology/W/W15/W15-2605.pdf)).\n",
    "\n",
    "When tokenization is more challenging and difficult to capture in a few rules a machine-learning based approach can be useful. In a nutshell, we can treat the tokenization problem as a character classification problem, or if needed, as a sequential labelling problem."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sentence Segmentation\n",
    "Many NLP tools work on a sentence-by-sentence basis. The next preprocessing step is hence to segment streams of tokens into sentences. In most cases this is straightforward after tokenization, because we only need to split sentences at sentence-ending punctuation tokens.\n",
    "\n",
    "However, keep in mind that, as well as tokenization, sentence segmentation is language specific - not all languages contain punctuation which denotes sentence boundary, and even if they do, not all segmentations are trivial (can you think of examples?)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['Mr.',\n",
       "  'Bob',\n",
       "  'Dobolina',\n",
       "  'is',\n",
       "  \"thinkin'\",\n",
       "  'of',\n",
       "  'a',\n",
       "  'master',\n",
       "  'plan',\n",
       "  '.'],\n",
       " ['Why', \"doesn't\", 'he', 'quit', '?']]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tok.sentence_segment(re.compile('\\.'), tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Background Reading\n",
    "\n",
    "* Jurafsky & Martin, Speech and Language Processing: Chapter 2, Regular Expressions and Automata.\n",
    "* Manning, Raghavan & Schuetze, Introduction to Information Retrieval: [Tokenization](http://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}