{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Create your first wordcloud\n", "\n", "[![Open In Colab](colab-badge.svg)](https://colab.research.google.com/github/alexisperrier/intro2nlp/blob/master/notebooks/intro2nlp_01_Create_a_wordcloud.ipynb)\n", "\n", "In this first chapter of the Intro to NLP course, you learn how to load a page from wikipedia and create a wordcloud. \n", "\n", "Let's start by installing the [wordcloud](https://github.com/amueller/word_cloud) library " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install wordcloud" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Content from wikipedia\n", "\n", "Wikipedia is a great source of quality text. \n", "We use the Wikipedia API to get the text of a page given its title\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 1) import the necessary library\n", "import requests\n", "\n", "# 2) set the title of the page (uncomment to change the page)\n", "title = 'Earth'\n", "\n", "# 3) send a request to the wikipedia api \n", "# asking to return content of the page formatted as json\n", "\n", "response = requests.get(\n", " 'https://en.wikipedia.org/w/api.php',\n", " params={\n", " 'action': 'query',\n", " 'format': 'json',\n", " 'titles': title,\n", " 'prop': 'extracts',\n", " 'explaintext': True,\n", " }).json()\n", " \n", "# 4) Parse the result and extract the text\n", "page = next(iter(response['query']['pages'].values()))\n", "text = page['extract']\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# print the 1st 300 characters from the text\n", "print(text[:300])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's wrap the code to get text from wikipedia into a convenient function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "def wikipedia_page(title):\n", " '''\n", " This function returns the raw text of a wikipedia page \n", " given a wikipedia page title\n", " '''\n", " params = { \n", " 'action': 'query', \n", " 'format': 'json', # request json formatted content\n", " 'titles': title, # title of the wikipedia page\n", " 'prop': 'extracts', \n", " 'explaintext': True\n", " }\n", " # send a request to the wikipedia api \n", " response = requests.get(\n", " 'https://en.wikipedia.org/w/api.php',\n", " params= params\n", " ).json()\n", "\n", " # Parse the result\n", " page = next(iter(response['query']['pages'].values()))\n", " # return the page content \n", " if 'extract' in page.keys():\n", " return page['extract']\n", " else:\n", " return \"Page not found\"\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We lowercase the text to avoid having to deal with uppercase and capitalized words\n", "text = wikipedia_page('Earth').lower()\n", "print(text) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Create a wordcloud \n", "We use the [wordcloud](https://github.com/amueller/word_cloud) library.\n", "\n", "Modify the parameters to get different results (size, max_words, ...)\n", "\n", "The Wordcloud library comes with its own list of stopwords. To disable it we set the list of stopwords to be empty.\n", "\n", " stopwords = []" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import the wordcloud library\n", "from wordcloud import WordCloud\n", "# Instantiate a new wordcloud.\n", "wordcloud = WordCloud(\n", " random_state = 8,\n", " normalize_plurals = False,\n", " width = 600, \n", " height= 300,\n", " max_words = 300,\n", " stopwords = [])\n", "\n", "# Apply the wordcloud to the text.\n", "wordcloud.generate(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use matplotlib to display the word cloud as an image:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "# create a figure\n", "fig, ax = plt.subplots(1,1, figsize = (9,6))\n", "# add interpolation = bilinear to smooth things out\n", "plt.imshow(wordcloud, interpolation='bilinear')\n", "# and remove the axis\n", "plt.axis(\"off\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We mostly see stopwords: _the_ _of_ _by_ _in_ etc ...\n", "\n", "To get rid of these stopwords, we build a new wordcloud, this time without setting the stopword parameter to an empty list." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wordcloud = WordCloud(\n", " random_state = 8,\n", " normalize_plurals = False,\n", " width = 800, \n", " height= 400,\n", " max_words = 300)\n", "wordcloud.generate(text)\n", "# plot\n", "fig, ax = plt.subplots(1,1, figsize = (9,6))\n", "plt.imshow(wordcloud, interpolation='bilinear')\n", "plt.axis(\"off\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A wordcloud which is much more representative of the Earth wikipedia page.\n", "\n", "Let's see what we get for another page, ... for instance [New York](https://en.wikipedia.org/wiki/New_York)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Text\n", "text = wikipedia_page('New_York').lower()\n", "# Wordcloud\n", "wordcloud = WordCloud(\n", " random_state = 8,\n", " normalize_plurals = False,\n", " width = 800, \n", " height= 400,\n", " max_words = 400)\n", "wordcloud.generate(text)\n", "# plot\n", "fig, ax = plt.subplots(1,1, figsize = (9,6))\n", "plt.imshow(wordcloud, interpolation='bilinear')\n", "plt.axis(\"off\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "You get the gist :)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Gutenberg project\n", "\n", "The [Gutenberg project](https://www.gutenberg.org) is another great source of text.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "# this is the url for Frankenstein, by Mary Wollstonecraft Shelley\n", "frankenstein_url = 'https://www.gutenberg.org/files/84/84-0.txt'\n", "\n", "# this is the url for Alice in Wonderland by Lewis Carroll\n", "alice_url = 'http://www.gutenberg.org/files/11/11-0.txt'\n", "\n", "# get the text from Alice in Wonderland\n", "r = requests.get(alice_url)\n", "\n", "# remove the header, the footer and some weird characters \n", "text = ' '.join(r.text.split('***')[1:])\n", "text = text.split(\"END OF THE PROJECT GUTENBERG\")[0]\n", "text = text.encode('ascii',errors='ignore').decode('utf-8')\n", "print(text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# Wordcloud\n", "wordcloud = WordCloud(\n", " random_state = 8,\n", " normalize_plurals = True,\n", " width = 800, \n", " height= 400,\n", " max_words = 400)\n", "wordcloud.generate(text)\n", "# plot\n", "fig, ax = plt.subplots(1,1, figsize = (9,6))\n", "plt.imshow(wordcloud, interpolation='bilinear')\n", "plt.axis(\"off\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# and the Frankenstein Wordcloud\n", "\n", "import requests\n", "# this is the url for Frankenstein, by Mary Wollstonecraft Shelley\n", "frankenstein_url = 'https://www.gutenberg.org/files/84/84-0.txt'\n", "\n", "# get the text from Alice in Wonderland\n", "r = requests.get(frankenstein_url)\n", "\n", "# remove the header, the footer and some weird characters \n", "text = ' '.join(r.text.split('***')[1:])\n", "text = text.split(\"END OF THE PROJECT GUTENBERG\")[0]\n", "text = text.encode('ascii',errors='ignore').decode('utf-8')\n", "print(text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Wordcloud\n", "wordcloud = WordCloud(\n", " random_state = 8,\n", " normalize_plurals = True,\n", " width = 800, \n", " height= 400,\n", " max_words = 400)\n", "wordcloud.generate(text)\n", "# plot\n", "fig, ax = plt.subplots(1,1, figsize = (9,6))\n", "plt.imshow(wordcloud, interpolation='bilinear')\n", "plt.axis(\"off\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 4 }