{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Plotting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "[Click here to run this notebook on Colab](https://colab.research.google.com/github/AllenDowney/ElementsOfDataScience/blob/master/06_plotting.ipynb) or\n",
    "[click here to download it](https://github.com/AllenDowney/ElementsOfDataScience/raw/master/06_plotting.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This chapter presents ways to create figures and graphs, more generally called **data visualizations**.  As examples, we'll generate three figures:\n",
    "\n",
    "* We'll replicate a figure from the Pew Research Center that shows changes in religious affiliation in the U.S. over time.\n",
    "\n",
    "* We'll replicate the figure from *The Economist* that shows the prices of sandwiches in Boston and London (we saw this data back in Chapter 3).\n",
    "\n",
    "* We'll make a plot to test Zipf's law, which describes the relationship between word frequencies and their ranks.\n",
    "\n",
    "With the tools in this chapter, you can generate a variety of simple graphs.  We will see more visualization tools in later chapters.\n",
    "But before we get started with plotting, we need a new language feature: keyword arguments."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Keyword Arguments\n",
    "\n",
    "When you call most functions, you have to provide values.  For example, when you call `np.exp`, which raises $e$ to a given power, the value you provide is a number: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "np.exp(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When you call `np.power`, you have to provide two numbers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.power(10, 6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The values you provide are called **arguments**.  Specifically, the values in these examples are **positional arguments** because their position determines how they are used.\n",
    "In the second example, `power` computes `10` to the sixth power, not `6` to the 10th power because of the order of the arguments.\n",
    "\n",
    "Many functions also take **keyword arguments**, which are identified by name.  For example, we have previously used `int` to convert a string to an integer.\n",
    "Here's how we use it with a string as a positional argument:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "int('21')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, `int` assumes that the number is in base 10.  But you can provide a keyword argument that specifies a different base.\n",
    "For example, the string `'21'`, interpreted in base 8, represents the number `2 * 8 + 1 = 17`.  Here's how we do this conversion using `int`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "int('21', base=8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The integer value `8` is a keyword argument, with the keyword `base`.\n",
    "\n",
    "Specifying a keyword argument looks like an assignment statement, but it does not create a new variable.\n",
    "And when you specify a keyword argument, you don't choose the variable name.  In this example, the keyword name, `base`, is part of the definition of `int`.  If you specify another keyword name, you get an error."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "Run the following line in the next cell to see what happens.\n",
    "\n",
    "```\n",
    "int('123', bass=11)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** The `print` function takes a keyword argument called `end` that specifies the character it prints at the end of the line.  By default, `end` is the newline character, `\\n`.  So if you call `print` more than once, the results normally appear on separate lines, like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "for x in [1, 2, 3]:\n",
    "    print(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Modify the previous example so it prints the elements of the list, all on one line, with spaces between them.\n",
    "Then modify it to print an open bracket at the beginning and a close bracket and newline at the end."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Graphing Religious Affiliation\n",
    "\n",
    "Now we're ready to make some graphs.\n",
    "In October 2019 the Pew Research Center published \"In U.S., Decline of Christianity Continues at Rapid Pace\" at <https://www.pewresearch.org/religion/2019/10/17/in-u-s-decline-of-christianity-continues-at-rapid-pace/>.\n",
    "It includes this figure, which shows changes in religious affiliation among adults in the U.S. over the previous 10 years.\n",
    "\n",
    "<img src=\"https://github.com/AllenDowney/ElementsOfDataScience/raw/master/figs/pew_religion_figure1.png\" width=\"400\">\n",
    "\n",
    "As an exercise, we'll replicate this figure.  It shows results from two sources, Religious Landscape Studies and Pew Research Political Surveys.  The political surveys provide data from more years, so we'll focus on that.\n",
    "\n",
    "The data from the figure are available from Pew Research at <https://www.pewforum.org/wp-content/uploads/sites/7/2019/10/Detailed-Tables-v1-FOR-WEB.pdf>, but they are in a PDF document.  It is sometimes possible to extract data from PDF documents, but for now we'll enter the data by hand."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "year = [2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018]\n",
    "\n",
    "christian = [77, 76, 75, 73, 73, 71, 69, 68, 67, 65]\n",
    "\n",
    "unaffiliated = [17, 17, 19, 19, 20, 21, 24, 23, 25, 26]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The library we'll use for plotting is Matplotlib; more specifically, we'll use a part of it called Pyplot, which we'll import with the nickname `plt`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pyplot provides a function called `plot` that makes a line plot.  It takes two sequences as arguments, the `x` values and the `y` values.  The sequences can be tuples, lists, or arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(year, christian);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The semi-colon at the end of the line prevents the return value from `plot`, which is an object representing the line, from being displayed.\n",
    "\n",
    "If you plot multiple lines in a single cell, they appear on the same axes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(year, christian)\n",
    "plt.plot(year, unaffiliated);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plotting them on the same axes makes it possible to compare them directly.\n",
    "However, notice that Pyplot chooses the range for the axes automatically; in this example the `y` axis starts around 15, not zero.\n",
    "\n",
    "As a result, it provides a misleading picture, making the ratio of the two lines look bigger than it really is.\n",
    "We can set the limits of the `y` axis using the function `plt.ylim`.  The argument is a list with two values, the lower bound and the upper bound."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(year, christian)\n",
    "plt.plot(year, unaffiliated)\n",
    "\n",
    "plt.ylim([0, 80]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's better, but this graph is missing some of the most important elements: labels for the axes and a title.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Decorating the Axes\n",
    "\n",
    "To label the axes and add a title, we'll use Pyplot functions `xlabel`, `ylabel`, and `title`.  All of them take strings as arguments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(year, christian)\n",
    "plt.plot(year, unaffiliated)\n",
    "\n",
    "plt.ylim([0, 80])\n",
    "plt.xlabel('Year')\n",
    "plt.ylabel('% of adults')\n",
    "plt.title('Religious affiliation of U.S adults');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's add another important element, a legend that indicates which line is which.\n",
    "To do that, we add a label to each line, using the keyword argument `label`.\n",
    "Then we call `plt.legend` to create the legend."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(year, christian, label='Christian')\n",
    "plt.plot(year, unaffiliated, label='Unaffiliated')\n",
    "\n",
    "plt.ylim([0, 80])\n",
    "plt.xlabel('Year')\n",
    "plt.ylabel('% of adults')\n",
    "plt.title('Religious affiliation of U.S adults')\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The legend shows the labels we provided when we created the lines."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** The original figure plots lines between the data points, but it also plots markers showing the location of each data point.  It is generally good practice to include markers, especially if data are not available for every year.\n",
    "\n",
    "Modify the previous example to include a keyword argument `marker` with the string value `'o'`, which indicates that you want to plot circles as markers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** In the original figure, the line labeled `'Christian'` is red and the line labeled `'Unaffiliated'` is grey.\n",
    "\n",
    "Find the online documentation of `plt.plot` and figure out how to use keyword arguments to specify colors.  Choose colors to (roughly) match the original figure.\n",
    "\n",
    "The `legend` function takes a keyword argument that specifies the location of the legend.  Read the documentation of this function and move the legend to the center left of the figure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plotting Sandwich Prices\n",
    "\n",
    "In Chapter 3 we used data from an article in *The Economist* comparing sandwich prices in Boston and London: \"Why Americans pay more for lunch than Britons do\" at <https://www.economist.com/finance-and-economics/2019/09/07/why-americans-pay-more-for-lunch-than-britons-do>.\n",
    "\n",
    "The article includes this graph showing prices of several sandwiches in the two cities:\n",
    "\n",
    "<img src=\"https://github.com/AllenDowney/ElementsOfDataScience/raw/master/figs/20190907_FNC941.png\"  width=\"400\">\n",
    "\n",
    "As an exercise, let's see if we can replicate this figure.\n",
    "Here's the data from the article again: the names of the sandwiches and the price list for each city."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "name_list = [\n",
    "    'Lobster roll',\n",
    "    'Chicken caesar',\n",
    "    'Bang bang chicken',\n",
    "    'Ham and cheese',\n",
    "    'Tuna and cucumber',\n",
    "    'Egg'\n",
    "]\n",
    "\n",
    "boston_price_list = [9.99, 7.99, 7.49, 7, 6.29, 4.99]\n",
    "london_price_list = [7.5, 5, 4.4, 5, 3.75, 2.25]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous section we plotted percentages on the `y` axis versus time on the `x` axis.\n",
    "Now we want to plot the sandwich names on the `y` axis and the prices on the `x` axis.\n",
    "Here's how:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(boston_price_list, name_list)\n",
    "plt.xlabel('Price in USD');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`name_list` is a list of strings; Pyplot orders them from top to bottom, equally spaced.\n",
    "\n",
    "By default Pyplot connects the points with lines, but in this example the lines don't make sense because the sandwich names are discrete; there are no intermediate points between an egg sandwich and a tuna sandwich.\n",
    "\n",
    "We can turn on markers and turn off lines with keyword arguments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(boston_price_list, name_list, \n",
    "         marker='o', linestyle='')\n",
    "plt.xlabel('Price in USD');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or we can do the same thing more concisely by providing a **format string** as a positional argument.\n",
    "In this example, `'o'` indicates a circle marker and `'s'` indicates a square.\n",
    "You can read the documentation of `plt.plot` to learn more about format strings.\n",
    "\n",
    "And let's add a title while we're at it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(boston_price_list, name_list, 'o')\n",
    "plt.plot(london_price_list, name_list, 's')\n",
    "\n",
    "plt.xlabel('Price in USD')\n",
    "plt.title('Pret a Manger prices in Boston and London');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, to approximate the colors in the original figure, we can use the strings `'C3'` and `'C0'`, which specify colors from the default color sequence.  You can read more about specifying colors in the Pyplot documentation: <https://matplotlib.org/3.1.1/tutorials/colors/colors.html>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(boston_price_list, name_list, 'o', color='C3')\n",
    "plt.plot(london_price_list, name_list, 's', color='C0')\n",
    "\n",
    "plt.xlabel('Price in USD')\n",
    "plt.title('Pret a Manger prices in Boston and London');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To connect the dots with lines, we'll use `plt.hlines`, which draws horizontal lines.  It takes three arguments: a sequence of values on the `y` axis, which are the sandwich names in this example, and two sequences of values on the `x` axis, which are the London prices and Boston prices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hlines(name_list, london_price_list, boston_price_list, color='gray')\n",
    "\n",
    "plt.plot(boston_price_list, name_list, 'o', color='C3')\n",
    "plt.plot(london_price_list, name_list, 's', color='C0')\n",
    "\n",
    "plt.xlabel('Price in USD')\n",
    "plt.title('Pret a Manger prices in Boston and London');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** To finish off this example, add a legend that identifies the London and Boston prices.  Remember that you have to add a `label` keyword each time you call `plt.plot`, and then call `plt.legend`.\n",
    "\n",
    "Notice that the sandwiches in our figure are in the opposite order of the sandwiches in the original figure.  There is a Pyplot function that inverts the `y` axis; see if you can find it and use it to reverse the order of the sandwich list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Zipf's Law\n",
    "\n",
    "In almost any book, in almost any language, if you count the number of unique words the the number of times each word appears, you will find a remarkable pattern: the most common word appears twice as often as the second most common, at least approximately, three times as often as the third most common, and so on.\n",
    "\n",
    "In general, if we sort the words in descending order of frequency, there is an inverse relationship between the rank of the words -- first, second, third, etc. -- and the number of times they appear.\n",
    "This observation was most famously made by George Kingsley Zipf, so it is called Zipf's law.\n",
    "\n",
    "To see if this law holds for the words in *War and Peace*, we'll make a Zipf plot, which shows:\n",
    "\n",
    "* The frequency of each word on the `y` axis, and\n",
    "\n",
    "* The rank of each word on the `x` axis, starting from 1."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "First, let's download the book again.  \n",
    "When you run the following cell, it checks to see whether you already have a file named `2600-0.txt`, which is the name of the file that contains the text of *War and Peace*.\n",
    "If not, it copies the file from Project Gutenberg to your computer.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from os.path import basename, exists\n",
    "\n",
    "def download(url):\n",
    "    filename = basename(url)\n",
    "    if not exists(filename):\n",
    "        from urllib.request import urlretrieve\n",
    "        local, _ = urlretrieve(url, filename)\n",
    "        print('Downloaded ' + local)\n",
    "    \n",
    "download('https://www.gutenberg.org/files/2600/2600-0.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous chapter, we looped through the book and made a string that contains all punctuation characters.\n",
    "Here are the results, which we will need again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_punctuation = ',.-:[#]*/“’—‘!?”;()%@'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code loops through the book and makes a dictionary that maps from each word to the number of times it appears."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# On Windows, it might be necessary to specify the encoding \n",
    "\n",
    "# fp = open('2600-0.txt', encoding='utf-8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [],
   "source": [
    "first_line = \"CHAPTER I\\n\"\n",
    "last_line = \"*** END OF THE PROJECT GUTENBERG EBOOK WAR AND PEACE ***\\n\"\n",
    "\n",
    "fp = open('2600-0.txt')\n",
    "for line in fp:\n",
    "    if line == first_line:\n",
    "        break\n",
    "\n",
    "unique_words = {}\n",
    "for line in fp:\n",
    "    if line == last_line:\n",
    "        break\n",
    "        \n",
    "    for word in line.split():\n",
    "        word = word.lower()\n",
    "        word = word.strip(all_punctuation)\n",
    "        if word in unique_words:\n",
    "            unique_words[word] += 1\n",
    "        else:\n",
    "            unique_words[word] = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In `unique_words`, the keys are words and the values are their frequencies.  We can use the `values` function to get the values from the dictionary.  The result has the type `dict_values`: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [],
   "source": [
    "freqs = unique_words.values()\n",
    "type(freqs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we plot them, we have to sort them, but the `sort` function doesn't work with `dict_values`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "Try this to see what happens:\n",
    "\n",
    "```\n",
    "freqs.sort()\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use `list` to make a list of frequencies:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [],
   "source": [
    "freqs = list(unique_words.values())\n",
    "type(freqs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we can use `sort`.  By default it sorts in ascending order, but we can pass a keyword argument to reverse the order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [],
   "source": [
    "freqs.sort(reverse=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, for the ranks, we need a sequence that counts from 1 to `n`, where `n` is the number of elements in `freqs`.  We can use the `range` function, which returns a value with type `range`.\n",
    "\n",
    "As a small example, here's the range from 1 to 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [],
   "source": [
    "range(1, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, there's a catch.  If we use the range to make a list, we see that \"the range from 1 to 5\" includes 1, but it doesn't include 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [],
   "source": [
    "list(range(1, 5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That might seem strange, but it is often more convenient to use `range` when it is defined this way, rather than what might seem like the more natural way (see <https://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html>).\n",
    "Anyway, we can get what we want by increasing the second argument by one:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [],
   "source": [
    "list(range(1, 6))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, finally, we can make a range that represents the ranks from `1` to `n`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [],
   "source": [
    "n = len(freqs)\n",
    "ranks = range(1, n+1)\n",
    "ranks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we can plot the frequencies versus the ranks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(ranks, freqs)\n",
    "\n",
    "plt.xlabel('Rank')\n",
    "plt.ylabel('Frequency')\n",
    "plt.title(\"War and Peace and Zipf's law\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The few most common words are very common, but the great majority of words are rare.  So that's consistent with Zipf's law, but Zipf's law is more specific.  It claims that the frequencies should be inversely proportional to the ranks.  If that's true, we can write:\n",
    "\n",
    "$f = k / r$\n",
    "\n",
    "where $r$ is the rank of a word, $f$ is its frequency, and $k$ is an unknown constant of proportionality.  If we take the log of both sides, we get\n",
    "\n",
    "$\\log f = \\log k - \\log r$\n",
    "\n",
    "This equation implies that if we plot $f$ versus $r$ on a log-log scale, we expect to see a straight line with intercept at $\\log k$ and slope -1."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Logarithmic Scales\n",
    "\n",
    "We can use `plt.xscale` to plot the `x` axis on a log scale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(ranks, freqs)\n",
    "\n",
    "plt.xlabel('Rank')\n",
    "plt.ylabel('Frequency')\n",
    "plt.title(\"War and Peace and Zipf's law\")\n",
    "plt.xscale('log')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And `plt.yscale` to plot the `y` axis on a log scale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(ranks, freqs)\n",
    "\n",
    "plt.xlabel('Rank')\n",
    "plt.ylabel('Frequency')\n",
    "plt.title(\"War and Peace and Zipf's law\")\n",
    "plt.xscale('log')\n",
    "plt.yscale('log')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is not quite a straight line, but it is close.  We can get a sense of the slope by connecting the end points with a line.\n",
    "We'll select the first and last elements from `xs`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [],
   "source": [
    "xs = ranks[0], ranks[-1]\n",
    "xs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the first and last elements from `ys`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [],
   "source": [
    "ys = freqs[0], freqs[-1]\n",
    "ys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And plot a line between them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(xs, ys, color='gray')\n",
    "plt.plot(ranks, freqs)\n",
    "\n",
    "plt.xlabel('Rank')\n",
    "plt.ylabel('Frequency')\n",
    "plt.title(\"War and Peace and Zipf's law\")\n",
    "plt.xscale('log')\n",
    "plt.yscale('log')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The slope of this line is the \"rise over run\", that is, the difference on the `y` axis divided by the difference on the `x` axis.\n",
    "\n",
    "We can compute the rise using `np.log10` to compute the log base 10 of the first and last values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.log10(ys)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we can use `np.diff` to compute the difference between the elements:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "rise = np.diff(np.log10(ys))\n",
    "rise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the following exercise, you'll compute the run and the slope of the gray line."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** Use `log10` and `diff` to compute the run, that is, the difference on the `x` axis.  Then divide the rise by the run to get the slope of the grey line.\n",
    "Is it close to -1, as Zipf's law predicts?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "This chapter introduces the Matplotlib library, which we used to replicate two figures a Zipf plot.\n",
    "These examples demonstrate the most common elements of data visualization, including lines and markers, values and labels on the axes, a legend and a title.\n",
    "The Zipf plot also shows the power of plotting data on logarithmic scales."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "*Elements of Data Science*\n",
    "\n",
    "Copyright 2021 [Allen B. Downey](https://allendowney.com)\n",
    "\n",
    "License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}