{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Frequencies of words in novels: a Data Science pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this code-along session, you will use some basic Natural Language Processing to plot the most frequently occurring words in the novel _Moby Dick_. In doing so, you'll also see the efficacy of thinking in terms of the following Data Science pipeline with a constant regard for process:\n",
"1. State your question;\n",
"2. Get your data;\n",
"3. Wrangle your data to answer your question;\n",
"4. Answer your question;\n",
"5. Present your solution so that others can understand it.\n",
"\n",
"For example, what would the following word frequency distribution be from?\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pre-steps\n",
"\n",
"Follow the instructions in the README.md to get your system set up and ready to go."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. State your question\n",
"\n",
"What are the most frequent words in the novel _Moby Dick_ and how often do they occur?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Get your data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Your raw data is the text of Melville's novel _Moby Dick_. We can find it at [Project Gutenberg](https://www.gutenberg.org/). \n",
"\n",
"**TO DO:** Head there, find _Moby Dick_ and then store the relevant url in your Python namespace:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Store url\n",
"url = 'https://www.gutenberg.org/files/2701/2701-h/2701-h.htm'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"You're going to use [`requests`](http://docs.python-requests.org/en/master/) to get the web data.\n",
"You can find out more in DataCamp's [Importing Data in Python (Part 2) course](https://www.datacamp.com/courses/importing-data-in-python-part-2). \n",
"\n",
"
\n",
"\n",
"According to the `requests` package website:\n",
"\n",
"> Requests is one of the most downloaded Python packages of all time, pulling in over 13,000,000 downloads every month. All the cool kids are doing it!\n",
"\n",
"You'll be making a `GET` request from the website, which means you're _getting_ data from it. `requests` make this easy with its `get` function. \n",
"\n",
"**TO DO:** Make the request here and check the object type returned."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"requests.models.Response"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Import `requests`\n",
"import requests\n",
"\n",
"# Make the request and check object type\n",
"r = requests.get(url)\n",
"type(r)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a `Response` object. You can see in the [`requests` kickstart guide](http://docs.python-requests.org/en/master/user/quickstart/) that a `Response` object has an attribute `text` that allows you to get the HTML from it! \n",
"\n",
"**TO DO:** Get the HTML and print the HTML to check it out:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Extract HTML from Response object and print\n",
"html = r.text\n",
"#print(html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"OK! This HTML is not quite what you want. However, it does _contain_ what you want: the text of _Moby Dick_. What you need to do now is _wrangle_ this HTML to extract the novel. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Recap:** \n",
"\n",
"* you have now scraped the web to get _Moby Dick_ from Project Gutenberg.\n",
"\n",
"**Up next:** it's time for you to parse the html and extract the text of the novel."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Wrangle your data to answer the question"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Part 1: getting the text from the HTML"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Here you'll use the package [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/). The package website says:\n",
"\n",
"
\n",
"\n",
"\n",
"**TO DO:** Create a `BeautifulSoup` object from the HTML."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"bs4.BeautifulSoup"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Import BeautifulSoup from bs4\n",
"from bs4 import BeautifulSoup\n",
"\n",
"\n",
"# Create a BeautifulSoup object from the HTML\n",
"soup = BeautifulSoup(html, \"html5lib\")\n",
"type(soup)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From these soup objects, you can extract all types of interesting information about the website you're scraping, such as title:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"