{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Frequencies of words in novels: a Data Science pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this code-along session, you will use some basic Natural Language Processing to plot the most frequently occurring words in the novel _Moby Dick_. In doing so, you'll also see the efficacy of thinking in terms of the following Data Science pipeline with a constant regard for process:\n", "1. State your question;\n", "2. Get your data;\n", "3. Wrangle your data to answer your question;\n", "4. Answer your question;\n", "5. Present your solution so that others can understand it.\n", "\n", "For example, what would the following word frequency distribution be from?\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pre-steps\n", "\n", "Follow the instructions in the README.md to get your system set up and ready to go." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. State your question\n", "\n", "What are the most frequent words in the novel _Moby Dick_ and how often do they occur?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Get your data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your raw data is the text of Melville's novel _Moby Dick_. We can find it at [Project Gutenberg](https://www.gutenberg.org/). \n", "\n", "**TO DO:** Head there, find _Moby Dick_ and then store the relevant url in your Python namespace:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Store url\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "You're going to use [`requests`](http://docs.python-requests.org/en/master/) to get the web data.\n", "You can find out more in DataCamp's [Importing Data in Python (Part 2) course](https://www.datacamp.com/courses/importing-data-in-python-part-2). \n", "\n", "\n", "\n", "According to the `requests` package website:\n", "\n", "> Requests is one of the most downloaded Python packages of all time, pulling in over 13,000,000 downloads every month. All the cool kids are doing it!\n", "\n", "You'll be making a `GET` request from the website, which means you're _getting_ data from it. `requests` make this easy with its `get` function. \n", "\n", "**TO DO:** Make the request here and check the object type returned." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Import `requests`\n", "\n", "\n", "# Make the request and check object type\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a `Response` object. You can see in the [`requests` kickstart guide](http://docs.python-requests.org/en/master/user/quickstart/) that a `Response` object has an attribute `text` that allows you to get the HTML from it! \n", "\n", "**TO DO:** Get the HTML and print the HTML to check it out:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Extract HTML from Response object and print\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK! This HTML is not quite what you want. However, it does _contain_ what you want: the text of _Moby Dick_. What you need to do now is _wrangle_ this HTML to extract the novel. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Recap:** \n", "\n", "* you have now scraped the web to get _Moby Dick_ from Project Gutenberg.\n", "\n", "**Up next:** it's time for you to parse the html and extract the text of the novel." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Wrangle your data to answer the question" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 1: getting the text from the HTML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Here you'll use the package [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/). The package website says:\n", "\n", "\n", "\n", "\n", "**TO DO:** Create a `BeautifulSoup` object from the HTML." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Import BeautifulSoup from bs4\n", "\n", "\n", "# Create a BeautifulSoup object from the HTML\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From these soup objects, you can extract all types of interesting information about the website you're scraping, such as title:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Get soup title\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or the title as a string:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Get soup title as string\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or all URLs found within a page’s < a > tags (hyperlinks):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Get hyperlinks from soup and check out first 10\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What you want to do is to extract the text from the `soup` and there's a super helpful `.get_text()` method precisely for this. \n", "\n", "**TO DO:** Get the text, print it out and have a look at it. Is it what you want?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Get the text out of the soup and print it\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that this is now nearly what you want. You'll need to do a bit more work." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Recap:** \n", "\n", "* you have scraped the web to get _Moby Dick_ from Project Gutenberg;\n", "* you have also now parsed the html and extracted the text of the novel.\n", "\n", "**Up next:** you'll use Natural Language Processing, tokenization and regular expressions to extract the list of words in _Moby Dick_." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 2: Extract words from your text using NLP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll now use `nltk`, the Natural Language Toolkit, to\n", "\n", "1. Tokenize the text (fancy term for splitting into tokens, such as words);\n", "2. Remove stopwords (words such as 'a' and 'the' that occur a great deal in ~ nearly all English language texts.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Step 1: Tokenize\n", "\n", "You want to tokenize your text, that is, split it into a list a words.\n", "\n", "To do this, you're going to use a powerful tool called _regular expressions_, or _regex_.\n", "\n", "* Example: you have the string 'peter piper picked a peck of pickled peppers' and you want to extract from the list of _all_ words in it that start with a 'p'. \n", "\n", "The regular expression that matches all words beginning with 'p' is 'p\\w+'. Let's unpack this: \n", "\n", "* the 'p' at the beginning of the regular expression means that you'll only match sequences of characters that start with a 'p';\n", "* the '\\w' is a special character that will match any alphanumeric A-Z, a-z, 0-9, along with underscores;\n", "* The '+' tells you that the previous character in the regex can appear as many times as you want in strings that you're trying to match. This means that '\\w+' will match arbitrary sequences of alphanumeric characters and underscores.\n", "\n", "**You'll now use the built-in Python package `re` to extract all words beginning with 'p' from the sentence 'peter piper picked a peck of pickled peppers' as a warm-up.**\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Import regex package\n", "\n", "\n", "\n", "# Define sentence\n", "\n", "\n", "# Define regex\n", "\n", "\n", "\n", "# Find all words in sentence that match the regex and print them\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This looks pretty good. Now, if 'p\\w+' is the regex that matches words beginning with 'p', what's the regex that matches all words?\n", "\n", "**It's your job to now do this for our toy Peter Piper sentence above.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Find all words and print them\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**TO DO:** use regex to get all the words in _Moby Dick_:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Find all words in Moby Dick and print several\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Recap:** \n", "\n", "* you have scraped the web to get _Moby Dick_ from Project Gutenberg;\n", "* you have parsed the html and extracted the text of the novel;\n", "* you have used tokenization and regular expressions to extract the list of words in _Moby Dick_.\n", "\n", "**Up next:** extract the list of words in _Moby Dick_ using `nltk`, the Natural Language Toolkit.\n", "\n", "Go get it!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Import RegexpTokenizer from nltk.tokenize\n", "\n", "\n", "# Create tokenizer\n", "\n", "\n", "\n", "# Create tokens\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**TO DO:** Create a list containing all the words in _Moby Dick_ such that all words contain only lower case letters. You'll find the string method `.lower()` handy:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Initialize new list\n", "\n", "\n", "# Loop through list tokens and make lower case\n", "\n", "\n", "\n", "# Print several items from list as sanity check\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Recap:** \n", "\n", "* you have scraped the web to get _Moby Dick_ from Project Gutenberg;\n", "* you have parsed the html and extracted the text of the novel;\n", "* you have used tokenization and regular expressions to extract the list of words in _Moby Dick_.\n", "\n", "**Up next:** remove common words such as 'a' and 'the' from the list of words." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Step 2: Remove stop words\n", "\n", "It is common practice to remove words that appear alot in the English language such as 'the', 'of' and 'a' (known as stopwords) because they're not so interesting. For more on all of these techniques, check out our [Natural Language Processing Fundamentals in Python course](https://www.datacamp.com/courses/nlp-fundamentals-in-python). \n", "\n", "The package `nltk` has a list of stopwords in English which you'll now store as `sw` and print the first several elements of:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Import nltk\n", "\n", "\n", "\n", "# Get English stopwords and print some of them\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You want the list of all words in `words` that are *not* in `sw`. One way to get this list is to loop over all elements of `words` and add the to a new list if they are *not* in `sw`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [], "source": [ "# Initialize new list\n", "\n", "\n", "\n", "# Add to words_ns all words that are in words but not in sw\n", "\n", "\n", "\n", "\n", "# Print several list items as sanity check\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Recap:** \n", "\n", "* you have scraped the web to get _Moby Dick_ from Project Gutenberg;\n", "* you have parsed the html and extracted the text of the novel;\n", "* you have used tokenization and regular expressions to extract the list of words in _Moby Dick_.\n", "* you have removed common words such as 'a' and 'the' from the list of words.\n", "\n", "**Up next:** plot the word frequency distribution of words in _Moby Dick_." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Answer your question" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Our question was 'What are the most frequent words in the novel Moby Dick and how often do they occur?' \n", "\n", "You can now plot a frequency distribution of words in _Moby Dick_ in two line of code using `nltk`. To do this,\n", "\n", "* Create a frequency distribution object using the function `nltk.FreqDist()`;\n", "* Using the plot method of the resulting object." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Import datavis libraries\n", "\n", "\n", "\n", "# Figures inline and set visualization style\n", "\n", "\n", "\n", "# Create freq dist and plot\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Recap:** \n", "\n", "* you have scraped the web to get _Moby Dick_ from Project Gutenberg;\n", "* you have parsed the html and extracted the text of the novel;\n", "* you have used tokenization and regular expressions to extract the list of words in _Moby Dick_.\n", "* you have removed common words such as 'a' and 'the' from the list of words.\n", "* you have plotted the word frequency distribution of words in _Moby Dick_.\n", "\n", "**Up next:** adding more stopwords." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add more stop words" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Import stopwords from sklearn\n", "\n", "\n", "# Add sklearn stopwords to words_sw\n", "\n", "\n", "# Initialize new list\n", "\n", "\n", "# Add to words_ns all words that are in words but not in sw\n", "\n", "\n", "\n", "\n", "# Print several list items as sanity check\n", "\n", "# Create freq dist and plot\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Present your solution so that others can understand it.\n", "\n", "The cool thing is that, in using `nltk` to answer our question, we actually already presented our solution in a manner that can be communicated to other: a frequency distribution plot! You can read off the most common words, along with their frequency. For example, 'whale' is the most common word in the novel (go figure), excepting stopwords, and it occurs a whopping >1200 times! " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "## BONUS MATERIAL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you have seen that there are lots of novels on Project Gutenberg we can make these word frequency distributions of, it makes sense to write your own function that does all of this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }