{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Stemming and Lemmatization\n",
    "\n",
    "[![Open In Colab](colab-badge.svg)](https://colab.research.google.com/github/alexisperrier/intro2nlp/blob/master/notebooks/intro2nlp_04_lemmatization_stemming.ipynb)\n",
    "\n",
    "[Lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) is the algorithmic process of determining the lemma of a word based on its intended meaning.\n",
    "\n",
    "[Stemming](https://en.wikipedia.org/wiki/Stemming) is the process of reducing inflected or sometimes derived words to their word stem, base or root form\n",
    "\n",
    "Let's see how stemming works on the Wikipedia Earth's page\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "def wikipedia_page(title):\n",
    "    '''\n",
    "    This function returns the raw text of a wikipedia page \n",
    "    given a wikipedia page title\n",
    "    '''\n",
    "    params = { \n",
    "        'action': 'query', \n",
    "        'format': 'json', # request json formatted content\n",
    "        'titles': title, # title of the wikipedia page\n",
    "        'prop': 'extracts', \n",
    "        'explaintext': True\n",
    "    }\n",
    "    # send a request to the wikipedia api \n",
    "    response = requests.get(\n",
    "         'https://en.wikipedia.org/w/api.php',\n",
    "         params= params\n",
    "     ).json()\n",
    "\n",
    "    # Parse the result\n",
    "    page = next(iter(response['query']['pages'].values()))\n",
    "    # return the page content \n",
    "    if 'extract' in page.keys():\n",
    "        return page['extract']\n",
    "    else:\n",
    "        return \"Page not found\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install NLTK and download relevant resources if you haven't done so already\n",
    "!pip install nltk \n",
    "import nltk\n",
    "nltk.download('popular')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.tokenize import WordPunctTokenizer\n",
    "from nltk.stem import PorterStemmer\n",
    "from nltk.corpus import stopwords\n",
    "\n",
    "# Get the text, for instance from Wikipedia. \n",
    "text    = wikipedia_page('Earth').lower()\n",
    "\n",
    "# Tokenize and remove stopwords\n",
    "tokens  = WordPunctTokenizer().tokenize(text)\n",
    "tokens = [tk for tk in tokens if tk not in stopwords.words('english')]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Instantiate a stemmer\n",
    "ps      = PorterStemmer()\n",
    "\n",
    "# and stem\n",
    "stems   = [ps.stem(tk) for tk in tokens ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# look at a random selection of stemmed tokens\n",
    "import numpy as np\n",
    "for i in range(5):\n",
    "    print()\n",
    "    print(np.random.choice(stems, size = 10))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your results will differ but we see that some words are brutally truncated."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lemmatize with spacy\n",
    "Since stemming can be brutal, we need a smarter way to reduce the number of forms of words.\n",
    "Lemmatization reduces a word to its lemma. And the lemma is the word form you would find in a dictionary.\n",
    "\n",
    "Let's see how we can tokenize and lemmatize with the library [spacy.io](https://spacy.io/)\n",
    "\n",
    "\n",
    "see this page to install spacy: https://spacy.io/usage and download the models\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# install spacy\n",
    "!pip install -U spacy\n",
    "!python -m spacy download en_core_web_sm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#import the library\n",
    "import spacy\n",
    "\n",
    "# load the small English model\n",
    "nlp = spacy.load(\"en_core_web_sm\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tokenization\n",
    "Right out of the box\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# parse a text\n",
    "doc = nlp(\"Roads? Where we’re going we don’t need roads!\")\n",
    "\n",
    "for token in doc:\n",
    "    print(token)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lemmatization\n",
    "Also right out of the box\n",
    "\n",
    "The lemma of a token is directly available via ```token.lemma_```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import spacy\n",
    "nlp = spacy.load(\"en_core_web_sm\")\n",
    "doc = nlp(\"I came in and met with her teammates at the meeting.\")\n",
    "\n",
    "print(f\"{'Token':10}\\t Lemma \")\n",
    "print(f\"{'-------':10}\\t ------- \")\n",
    "for token in doc:\n",
    "    print(f\"{token.text:10}\\t {token.lemma_} \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how the word \"met\" was correctly lemmatized to \"meet\" while the noun \"meeting\" remained lemmatized to \"meeting\". Lemmatization of a word depends on its context and its grammatical role."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Form detection\n",
    "Spacy offers many other functions including some handy word caracterization methods\n",
    "\n",
    "- is_space\n",
    "- is_punct\n",
    "- is_upper\n",
    "- is_digit\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import spacy\n",
    "nlp = spacy.load(\"en_core_web_sm\")\n",
    "doc = nlp(\"All aboard! \\t Train NXH123 departs from platform 22 at 3:16 sharp.\")\n",
    "\n",
    "\n",
    "print(f\"token \\t\\tspace? \\tpunct?\\tupper?\\tdigit?\")\n",
    "\n",
    "token, token.is_space, token.is_punct, token.is_upper, token.is_digit\n",
    "\n",
    "for token in doc:\n",
    "    print(f\"{str(token):10} \\t{token.is_space} \\t{token.is_punct} \\t{token.is_upper} \\t{token.is_digit}\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}