{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Stopwords and word frequency\n",
    "\n",
    "[![Open In Colab](colab-badge.svg)](https://colab.research.google.com/github/alexisperrier/intro2nlp/blob/master/notebooks/intro2nlp_02_remove_stop_words.ipynb)\n",
    "\n",
    "\n",
    "Let's count the frequency of the words in the Wikipedia Earth page\n",
    "\n",
    "We use the same function to download content from wikipedia\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# if you haven't done so already, install the wordcloud library\n",
    "!pip install wordcloud"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "def wikipedia_page(title):\n",
    "    '''\n",
    "    This function returns the raw text of a wikipedia page \n",
    "    given a wikipedia page title\n",
    "    '''\n",
    "    params = { \n",
    "        'action': 'query', \n",
    "        'format': 'json', # request json formatted content\n",
    "        'titles': title, # title of the wikipedia page\n",
    "        'prop': 'extracts', \n",
    "        'explaintext': True\n",
    "    }\n",
    "    # send a request to the wikipedia api \n",
    "    response = requests.get(\n",
    "         'https://en.wikipedia.org/w/api.php',\n",
    "         params= params\n",
    "     ).json()\n",
    "\n",
    "    # Parse the result\n",
    "    page = next(iter(response['query']['pages'].values()))\n",
    "    # return the page content \n",
    "    if 'extract' in page.keys():\n",
    "        return page['extract']\n",
    "    else:\n",
    "        return \"Page not found\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = wikipedia_page('Earth').lower()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we count the number of times each word is present in the text.\n",
    "\n",
    "First we split the text over whitespaces and then use the Counter class to find the 20 most common words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "# we transform the text into a list of words \n",
    "# by splitting over the space character ' '\n",
    "word_list = text.split(' ')\n",
    "# and count the words\n",
    "word_counts = Counter(word_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "word_counts.most_common(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's remove the stopwords\n",
    "\n",
    "We define a list of stopwords, frequent words that are mostly meaningless and remove them from the text. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# transform the text into a list of words\n",
    "words_list = text.split(' ')\n",
    "# define the list of words you want to remove from the text\n",
    "stopwords = ['the', 'of', 'and', 'is','to','in','a','from','by','that', 'with', 'this', 'as', 'an', 'are','its', 'at', 'for']\n",
    "# use a python list comprehension to remove the stopwords from words_list\n",
    "words_without_stopwords = [ word for word in words_list if word not in stopwords ]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We get a very different list of frequent words. Much more relevant to the text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Counter(words_without_stopwords).most_common(20)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can generate a wordcloud on the text with the stopwords removed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from wordcloud import WordCloud\n",
    "# Instantiate a new wordcloud.\n",
    "wordcloud = WordCloud(\n",
    "        random_state = 8,\n",
    "        normalize_plurals = False,\n",
    "        width = 600, \n",
    "        height= 300,\n",
    "        max_words = 300,\n",
    "        stopwords = [])\n",
    "\n",
    "# Transform the list of words back into a string \n",
    "text_without_stopwords  = ' '.join(words_without_stopwords)\n",
    "\n",
    "# Apply the wordcloud to the text.\n",
    "wordcloud.generate(text_without_stopwords)\n",
    "\n",
    "# And plot\n",
    "import matplotlib.pyplot as plt\n",
    "fig, ax = plt.subplots(1,1, figsize = (9,6))\n",
    "plt.imshow(wordcloud, interpolation='bilinear')\n",
    "plt.axis(\"off\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# WordCloud stopwords\n",
    "\n",
    "Wordcloud comes with its own predefined list of stopwords\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(f\"Wordcloud has {len(WordCloud().stopwords)} stopwords:\") \n",
    "print()\n",
    "print(list(WordCloud().stopwords))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# More exhaustive lists of stopwords\n",
    "\n",
    "You don't need to use default lists of stopwords. There a couple of github repos that offer much more exhaustive lists.\n",
    "\n",
    "For instance \n",
    "\n",
    "* https://github.com/Alir3z4/stop-words has stopwords in several languages\n",
    "* https://github.com/igorbrigadir/stopwords lists several sources of stopwords. \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Zipf's law\n",
    "\n",
    "Looking at the frequency of words from the original text we notice a pattern.\n",
    "\n",
    "The frequency of the nth word is roughly proportional to 1/n. The most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. \n",
    "\n",
    "Let's calculate the observed relative frequency of a token: \n",
    "\n",
    "    occurence of the token / occurence of \"the\" \n",
    "\n",
    "where \"the\" is the most common token\n",
    "\n",
    "and compare it to the inverse of the rank of the token "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "text = wikipedia_page('Earth').lower()\n",
    "word_list = text.split(' ')\n",
    "word_counts = Counter(word_list).most_common(10)\n",
    "\n",
    "maxfreq = word_counts[0][1]\n",
    "print(f\" rank word  observed frequency ~= Zipf frequency\")\n",
    "\n",
    "for i in range(10):\n",
    "    print(f\"{i+1:4}) {word_counts[i][0]:10} freq: {np.round(word_counts[i][1] / word_counts[0][1],2):5} ~= {np.round(1/(i+1),2)}\")\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Zipf's law is an empirical law observed in multiple domains. The [wikipedia article](https://en.wikipedia.org/wiki/Zipf%27s_law) explains it all."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}