{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lemmatization\n",
    "\n",
    "CLTK uses a backoff lemmatizer consisting of a chain of different lemmatizers:\n",
    "\n",
    "1) dictionary-based lemmatizer with high-frequency words\n",
    "\n",
    "2) a training-data-based lemmatizer based on 4,000 sentences from the [Perseus Latin Dependency Treebanks](https://perseusdl.github.io/treebank_data/)\n",
    "\n",
    "3) a regular-expression-based lemmatizer stripping word affixes \n",
    "\n",
    "4) a dictionary-based lemmatizer with the complete set of Morpheus lemmas\n",
    "\n",
    "5) an ‘identity’ lemmatizer returning the token itself as the lemma\n",
    "\n",
    "We'll later see some of those in detail.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cltk.lemmatize.latin.backoff import BackoffLatinLemmatizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "lemmatizer = BackoffLatinLemmatizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from  data.word_tokenized_text import word_tokenized_text as text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "lemmatized = lemmatizer.lemmatize(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Conditi', 'Conditi'),\n",
       " ('paradoxi', 'paradoxus'),\n",
       " ('compositio', 'compositio'),\n",
       " ('mellis', 'mel'),\n",
       " ('pondo', 'pondo'),\n",
       " ('XV', 'XV'),\n",
       " ('in', 'in'),\n",
       " ('aeneum', 'aeneus'),\n",
       " ('vas', 'vas'),\n",
       " ('mittuntur', 'mitto'),\n",
       " ('praemissis', 'praemitto'),\n",
       " ('vini', 'vinum'),\n",
       " ('sextariis', 'sextarius'),\n",
       " ('duobus', 'duo'),\n",
       " ('ut', 'ut'),\n",
       " ('in', 'in'),\n",
       " ('coctura', 'coquo'),\n",
       " ('mellis', 'mel'),\n",
       " ('vinum', 'vinum'),\n",
       " ('decoquas', 'decoquo'),\n",
       " ('quod', 'qui'),\n",
       " ('igni', 'ignis'),\n",
       " ('lento', 'lentus'),\n",
       " ('et', 'et'),\n",
       " ('aridis', 'aridus'),\n",
       " ('lignis', 'lignum'),\n",
       " ('calefactum', 'calefacio'),\n",
       " ('commotum', 'commoveo'),\n",
       " ('ferula', 'ferula'),\n",
       " ('dum', 'dum')]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lemmatized[:30]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "line_lemmatized_text = []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from data.line_tokenized_text import line_tokenized_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cltk.stem.latin.j_v import JVReplacer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "for line in line_tokenized_text:\n",
    "    _line = lemmatizer.lemmatize(JVReplacer().replace(line).split(\" \"))\n",
    "    line_lemmatized_text.append(_line)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Conditi', 'Conditi'),\n",
       " ('paradoxi', 'paradoxus'),\n",
       " ('compositio', 'compositio'),\n",
       " ('mellis', 'mel'),\n",
       " ('pondo', 'pondo'),\n",
       " ('XU', 'XU'),\n",
       " ('in', 'in'),\n",
       " ('aeneum', 'aeneus'),\n",
       " ('uas', 'uas'),\n",
       " ('mittuntu', 'mittuntu'),\n",
       " ('praemissis', 'praemitto'),\n",
       " ('uini', 'uinum'),\n",
       " ('sextariis', 'sextarius'),\n",
       " ('duobu', 'duobu'),\n",
       " ('ut', 'ut'),\n",
       " ('in', 'in'),\n",
       " ('coctura', 'coquo'),\n",
       " ('mellis', 'mel'),\n",
       " ('uinum', 'uinum'),\n",
       " ('decoquas', 'decoquo')]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "line_lemmatized_text[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Detecting the ingredients in the text\n",
    "\n",
    "Suppose we want to detect which tokens of the text correspond to the ingredients needed for each recipe. We could in theory create a huge list of all the ingredients and simply check if each token is on the list. That however would mean the we'd have to also include each grammatical case and orthographic variation. However, using a lemmatizer we can restrict the list to the lemmata of the ingredients and simply check if the lemmatized token belongs to the list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "from data.ingredient_list import ingredients"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "ingredients = set(ingredients)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "ingredient_indices = []\n",
    "\n",
    "for i, lemma in enumerate(lemmatized):\n",
    "    if lemma[1] in ingredients:\n",
    "        ingredient_indices.append(i)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[3,\n",
       " 11,\n",
       " 17,\n",
       " 18,\n",
       " 34,\n",
       " 83,\n",
       " 88,\n",
       " 102,\n",
       " 119,\n",
       " 122,\n",
       " 137,\n",
       " 140,\n",
       " 147,\n",
       " 152,\n",
       " 185,\n",
       " 206,\n",
       " 214,\n",
       " 221,\n",
       " 234,\n",
       " 247,\n",
       " 256,\n",
       " 280,\n",
       " 307,\n",
       " 314,\n",
       " 322,\n",
       " 334,\n",
       " 357,\n",
       " 360,\n",
       " 388,\n",
       " 394]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ingredient_indices[:30]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we check how many of the first 100 tokens are ingredients."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3 ('mellis', 'mel')\n",
      "11 ('vini', 'vinum')\n",
      "17 ('mellis', 'mel')\n",
      "18 ('vinum', 'vinum')\n",
      "34 ('vini', 'vinum')\n",
      "83 ('vino', 'vinum')\n",
      "88 ('vini', 'vinum')\n"
     ]
    }
   ],
   "source": [
    "for i, lemma in enumerate(lemmatized[:100]):\n",
    "    if lemma[1] in ingredients:\n",
    "        print(i, lemmatized[i])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train your own lemmatizer\n",
    "\n",
    "If you have a collection of tagged lemmata you can also train your own lemmatizers based on nltks models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Identity Lemmatizer\n",
    "The identity lemmatizer simply returns the token itself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cltk.lemmatize.backoff import IdentityLemmatizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "lemmatizer_identity = IdentityLemmatizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('mellis', 'mellis'), ('vino', 'vino')]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lemmatizer_identity.lemmatize(['mellis', 'vino'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dictionary Lemmatizer\n",
    "The dictionary looks up the lemma on a user-defined dictionary. It is suitable for commonly occuring tokens (mainly words with irregular inflection). You can also define a \"backoff\" lemmatizer which your lemmatizer calls in case of undefined tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cltk.lemmatize.backoff import DictLemmatizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "lemmata_dict = {'mellis': 'mel', 'vino': 'vinum'}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "lemmatizer_dict = DictLemmatizer(lemmas=lemmata_dict, backoff=lemmatizer_identity)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('mellis', 'mel'), ('vino', 'vinum'), ('vini', 'vini')]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lemmatizer_dict.lemmatize(['mellis', 'vino', 'vini'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Unigram Lemmatizer\n",
    "The unigram lemmatizer accepts a tagged list of lemmata and returns the lemma with the highest frequency"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cltk.lemmatize.backoff import UnigramLemmatizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data = [[('dactylum', 'dactylus'), ('dactylum', 'dactilus'), ('dactylum', 'dactylus')]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "lemmatizer_unigram = UnigramLemmatizer(train_data, backoff=lemmatizer_dict)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('dactylum', 'dactylus')]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lemmatizer_unigram.lemmatize(['dactylum'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regular-expression Lemmatizer\n",
    "You can also create a regex rule-based lemmatizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cltk.lemmatize.backoff import RegexpLemmatizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "regexps = [\n",
    "    (r'^(.+)(a|is|ii)$', r'\\1um')\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "lemmatizer_regexp = RegexpLemmatizer(regexps=regexps, backoff=lemmatizer_unigram)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('vinis', 'vinum'), ('mella', 'mellum'), ('dactylum', 'dactylus')]"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lemmatizer_regexp.lemmatize(['vinis', 'mella', 'dactylum'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Your Turn 💡 \n",
    "\n",
    "It is very common that the same words can be spelled differently even in the same text. Write a simple program to detect words in the ingredient list that may have up to one different letter (bonus point for including inserted or deleted characters).\n",
    "\n",
    "#### Example:\n",
    "\n",
    "``>> isIngredient(\"daktylus\")``\n",
    "``>> True``\n",
    "\n",
    "``>> isIngredient(\"dactyllus\")``\n",
    "``>> True``\n",
    "\n",
    "``>> isIngredient(\"daktyllus\")``\n",
    "``>> False``"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}