{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Stemming and Lemmatization\n", "\n", "[![Open In Colab](colab-badge.svg)](https://colab.research.google.com/github/alexisperrier/intro2nlp/blob/master/notebooks/intro2nlp_04_lemmatization_stemming.ipynb)\n", "\n", "[Lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) is the algorithmic process of determining the lemma of a word based on its intended meaning.\n", "\n", "[Stemming](https://en.wikipedia.org/wiki/Stemming) is the process of reducing inflected or sometimes derived words to their word stem, base or root form\n", "\n", "Let's see how stemming works on the Wikipedia Earth's page\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "def wikipedia_page(title):\n", " '''\n", " This function returns the raw text of a wikipedia page \n", " given a wikipedia page title\n", " '''\n", " params = { \n", " 'action': 'query', \n", " 'format': 'json', # request json formatted content\n", " 'titles': title, # title of the wikipedia page\n", " 'prop': 'extracts', \n", " 'explaintext': True\n", " }\n", " # send a request to the wikipedia api \n", " response = requests.get(\n", " 'https://en.wikipedia.org/w/api.php',\n", " params= params\n", " ).json()\n", "\n", " # Parse the result\n", " page = next(iter(response['query']['pages'].values()))\n", " # return the page content \n", " if 'extract' in page.keys():\n", " return page['extract']\n", " else:\n", " return \"Page not found\"\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install NLTK and download relevant resources if you haven't done so already\n", "!pip install nltk \n", "import nltk\n", "nltk.download('popular')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nltk.tokenize import WordPunctTokenizer\n", "from nltk.stem import PorterStemmer\n", "from nltk.corpus import stopwords\n", "\n", "# Get the text, for instance from Wikipedia. \n", "text = wikipedia_page('Earth').lower()\n", "\n", "# Tokenize and remove stopwords\n", "tokens = WordPunctTokenizer().tokenize(text)\n", "tokens = [tk for tk in tokens if tk not in stopwords.words('english')]\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Instantiate a stemmer\n", "ps = PorterStemmer()\n", "\n", "# and stem\n", "stems = [ps.stem(tk) for tk in tokens ]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# look at a random selection of stemmed tokens\n", "import numpy as np\n", "for i in range(5):\n", " print()\n", " print(np.random.choice(stems, size = 10))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your results will differ but we see that some words are brutally truncated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lemmatize with spacy\n", "Since stemming can be brutal, we need a smarter way to reduce the number of forms of words.\n", "Lemmatization reduces a word to its lemma. And the lemma is the word form you would find in a dictionary.\n", "\n", "Let's see how we can tokenize and lemmatize with the library [spacy.io](https://spacy.io/)\n", "\n", "\n", "see this page to install spacy: https://spacy.io/usage and download the models\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# install spacy\n", "!pip install -U spacy\n", "!python -m spacy download en_core_web_sm" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#import the library\n", "import spacy\n", "\n", "# load the small English model\n", "nlp = spacy.load(\"en_core_web_sm\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Tokenization\n", "Right out of the box\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# parse a text\n", "doc = nlp(\"Roads? Where we’re going we don’t need roads!\")\n", "\n", "for token in doc:\n", " print(token)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lemmatization\n", "Also right out of the box\n", "\n", "The lemma of a token is directly available via ```token.lemma_```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import spacy\n", "nlp = spacy.load(\"en_core_web_sm\")\n", "doc = nlp(\"I came in and met with her teammates at the meeting.\")\n", "\n", "print(f\"{'Token':10}\\t Lemma \")\n", "print(f\"{'-------':10}\\t ------- \")\n", "for token in doc:\n", " print(f\"{token.text:10}\\t {token.lemma_} \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the word \"met\" was correctly lemmatized to \"meet\" while the noun \"meeting\" remained lemmatized to \"meeting\". Lemmatization of a word depends on its context and its grammatical role." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Form detection\n", "Spacy offers many other functions including some handy word caracterization methods\n", "\n", "- is_space\n", "- is_punct\n", "- is_upper\n", "- is_digit\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import spacy\n", "nlp = spacy.load(\"en_core_web_sm\")\n", "doc = nlp(\"All aboard! \\t Train NXH123 departs from platform 22 at 3:16 sharp.\")\n", "\n", "\n", "print(f\"token \\t\\tspace? \\tpunct?\\tupper?\\tdigit?\")\n", "\n", "token, token.is_space, token.is_punct, token.is_upper, token.is_digit\n", "\n", "for token in doc:\n", " print(f\"{str(token):10} \\t{token.is_space} \\t{token.is_punct} \\t{token.is_upper} \\t{token.is_digit}\")\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 4 }