{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create your first wordcloud\n",
    "\n",
    "[![Open In Colab](colab-badge.svg)](https://colab.research.google.com/github/alexisperrier/intro2nlp/blob/master/notebooks/intro2nlp_01_Create_a_wordcloud.ipynb)\n",
    "\n",
    "In this first chapter of the Intro to NLP course, you learn how to load a page from wikipedia and create a wordcloud. \n",
    "\n",
    "Let's start by installing the [wordcloud](https://github.com/amueller/word_cloud) library "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install wordcloud"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Content from wikipedia\n",
    "\n",
    "Wikipedia is a great source of quality text. \n",
    "We use the Wikipedia API to get the text of a page given its title\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1) import the necessary library\n",
    "import requests\n",
    "\n",
    "# 2) set the title of the page (uncomment to change the page)\n",
    "title = 'Earth'\n",
    "\n",
    "# 3) send a request to the wikipedia api \n",
    "# asking to return content of the page formatted as json\n",
    "\n",
    "response = requests.get(\n",
    "    'https://en.wikipedia.org/w/api.php',\n",
    "    params={\n",
    "        'action': 'query',\n",
    "        'format': 'json',\n",
    "        'titles': title,\n",
    "        'prop': 'extracts',\n",
    "        'explaintext': True,\n",
    "    }).json()\n",
    "    \n",
    "# 4) Parse the result and extract the text\n",
    "page = next(iter(response['query']['pages'].values()))\n",
    "text = page['extract']\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print the 1st 300 characters from the text\n",
    "print(text[:300])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's wrap the code to get text from wikipedia into a convenient function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "def wikipedia_page(title):\n",
    "    '''\n",
    "    This function returns the raw text of a wikipedia page \n",
    "    given a wikipedia page title\n",
    "    '''\n",
    "    params = { \n",
    "        'action': 'query', \n",
    "        'format': 'json', # request json formatted content\n",
    "        'titles': title, # title of the wikipedia page\n",
    "        'prop': 'extracts', \n",
    "        'explaintext': True\n",
    "    }\n",
    "    # send a request to the wikipedia api \n",
    "    response = requests.get(\n",
    "         'https://en.wikipedia.org/w/api.php',\n",
    "         params= params\n",
    "     ).json()\n",
    "\n",
    "    # Parse the result\n",
    "    page = next(iter(response['query']['pages'].values()))\n",
    "    # return the page content \n",
    "    if 'extract' in page.keys():\n",
    "        return page['extract']\n",
    "    else:\n",
    "        return \"Page not found\"\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We lowercase the text to avoid having to deal with uppercase and capitalized words\n",
    "text = wikipedia_page('Earth').lower()\n",
    "print(text) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create a wordcloud \n",
    "We use the [wordcloud](https://github.com/amueller/word_cloud) library.\n",
    "\n",
    "Modify the parameters to get different results (size, max_words, ...)\n",
    "\n",
    "The Wordcloud library comes with its own list of stopwords. To disable it we set the list of stopwords to be empty.\n",
    "\n",
    "            stopwords = []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import the wordcloud library\n",
    "from wordcloud import WordCloud\n",
    "# Instantiate a new wordcloud.\n",
    "wordcloud = WordCloud(\n",
    "        random_state = 8,\n",
    "        normalize_plurals = False,\n",
    "        width = 600, \n",
    "        height= 300,\n",
    "        max_words = 300,\n",
    "        stopwords = [])\n",
    "\n",
    "# Apply the wordcloud to the text.\n",
    "wordcloud.generate(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use matplotlib to display the word cloud as an image:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "# create a figure\n",
    "fig, ax = plt.subplots(1,1, figsize = (9,6))\n",
    "# add interpolation = bilinear to smooth things out\n",
    "plt.imshow(wordcloud, interpolation='bilinear')\n",
    "# and remove the axis\n",
    "plt.axis(\"off\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We mostly see stopwords: _the_  _of_  _by_ _in_ etc ...\n",
    "\n",
    "To get rid of these stopwords, we build a new wordcloud, this time without setting the stopword parameter to an empty list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wordcloud = WordCloud(\n",
    "        random_state = 8,\n",
    "        normalize_plurals = False,\n",
    "        width = 800, \n",
    "        height= 400,\n",
    "        max_words = 300)\n",
    "wordcloud.generate(text)\n",
    "# plot\n",
    "fig, ax = plt.subplots(1,1, figsize = (9,6))\n",
    "plt.imshow(wordcloud, interpolation='bilinear')\n",
    "plt.axis(\"off\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A wordcloud which is much more representative of the Earth wikipedia page.\n",
    "\n",
    "Let's see what we get for another page, ... for instance [New York](https://en.wikipedia.org/wiki/New_York)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Text\n",
    "text = wikipedia_page('New_York').lower()\n",
    "# Wordcloud\n",
    "wordcloud = WordCloud(\n",
    "        random_state = 8,\n",
    "        normalize_plurals = False,\n",
    "        width = 800, \n",
    "        height= 400,\n",
    "        max_words = 400)\n",
    "wordcloud.generate(text)\n",
    "# plot\n",
    "fig, ax = plt.subplots(1,1, figsize = (9,6))\n",
    "plt.imshow(wordcloud, interpolation='bilinear')\n",
    "plt.axis(\"off\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "You get the gist :)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Gutenberg project\n",
    "\n",
    "The [Gutenberg project](https://www.gutenberg.org) is another great source of text.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "# this is the url for Frankenstein, by Mary Wollstonecraft Shelley\n",
    "frankenstein_url = 'https://www.gutenberg.org/files/84/84-0.txt'\n",
    "\n",
    "# this is the url for Alice in Wonderland by Lewis Carroll\n",
    "alice_url = 'http://www.gutenberg.org/files/11/11-0.txt'\n",
    "\n",
    "# get the text from Alice in Wonderland\n",
    "r = requests.get(alice_url)\n",
    "\n",
    "# remove the header, the footer and some weird characters \n",
    "text = ' '.join(r.text.split('***')[1:])\n",
    "text = text.split(\"END OF THE PROJECT GUTENBERG\")[0]\n",
    "text = text.encode('ascii',errors='ignore').decode('utf-8')\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Wordcloud\n",
    "wordcloud = WordCloud(\n",
    "        random_state = 8,\n",
    "        normalize_plurals = True,\n",
    "        width = 800, \n",
    "        height= 400,\n",
    "        max_words = 400)\n",
    "wordcloud.generate(text)\n",
    "# plot\n",
    "fig, ax = plt.subplots(1,1, figsize = (9,6))\n",
    "plt.imshow(wordcloud, interpolation='bilinear')\n",
    "plt.axis(\"off\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# and the Frankenstein Wordcloud\n",
    "\n",
    "import requests\n",
    "# this is the url for Frankenstein, by Mary Wollstonecraft Shelley\n",
    "frankenstein_url = 'https://www.gutenberg.org/files/84/84-0.txt'\n",
    "\n",
    "# get the text from Alice in Wonderland\n",
    "r = requests.get(frankenstein_url)\n",
    "\n",
    "# remove the header, the footer and some weird characters \n",
    "text = ' '.join(r.text.split('***')[1:])\n",
    "text = text.split(\"END OF THE PROJECT GUTENBERG\")[0]\n",
    "text = text.encode('ascii',errors='ignore').decode('utf-8')\n",
    "print(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Wordcloud\n",
    "wordcloud = WordCloud(\n",
    "        random_state = 8,\n",
    "        normalize_plurals = True,\n",
    "        width = 800, \n",
    "        height= 400,\n",
    "        max_words = 400)\n",
    "wordcloud.generate(text)\n",
    "# plot\n",
    "fig, ax = plt.subplots(1,1, figsize = (9,6))\n",
    "plt.imshow(wordcloud, interpolation='bilinear')\n",
    "plt.axis(\"off\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}