{ "cells": [ { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# HIDDEN\n", "from datascience import *\n", "import numpy as np\n", "import matplotlib\n", "matplotlib.use('Agg', warn=False)\n", "%matplotlib inline\n", "import matplotlib.pyplot as plots\n", "plots.style.use('fivethirtyeight')\n", "import warnings\n", "warnings.simplefilter(action=\"ignore\", category=FutureWarning)\n", "\n", "from urllib.request import urlopen \n", "import re\n", "def read_url(url): \n", " return re.sub('\\\\s+', ' ', urlopen(url).read().decode())" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# HIDDEN\n", "\n", "# Read two books, fast (again)!\n", "\n", "huck_finn_url = 'https://www.inferentialthinking.com/chapters/01/3/huck_finn.txt'\n", "huck_finn_text = read_url(huck_finn_url)\n", "huck_finn_chapters = huck_finn_text.split('CHAPTER ')[44:]\n", "\n", "little_women_url = 'https://www.inferentialthinking.com/chapters/01/3/little_women.txt'\n", "little_women_text = read_url(little_women_url)\n", "little_women_chapters = little_women_text.split('CHAPTER ')[1:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In some situations, the relationships between quantities allow us to make predictions. This text will explore how to make accurate predictions based on incomplete information and develop methods for combining multiple sources of uncertain information to make decisions.\n", "\n", "As an example of visualizing information derived from multiple sources, let us first use the computer to get some information that would be tedious to acquire by hand. In the context of novels, the word \"character\" has a second meaning: a printed symbol such as a letter or number or punctuation symbol. Here, we ask the computer to count the number of characters and the number of periods in each chapter of both *Huckleberry Finn* and *Little Women*." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# In each chapter, count the number of all characters;\n", "# call this the \"length\" of the chapter.\n", "# Also count the number of periods.\n", "\n", "chars_periods_huck_finn = Table().with_columns([\n", " 'Huck Finn Chapter Length', [len(s) for s in huck_finn_chapters],\n", " 'Number of Periods', np.char.count(huck_finn_chapters, '.')\n", " ])\n", "chars_periods_little_women = Table().with_columns([\n", " 'Little Women Chapter Length', [len(s) for s in little_women_chapters],\n", " 'Number of Periods', np.char.count(little_women_chapters, '.')\n", " ])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the data for *Huckleberry Finn*. Each row of the table corresponds to one chapter of the novel and displays the number of characters as well as the number of periods in the chapter. Not surprisingly, chapters with fewer characters also tend to have fewer periods, in general – the shorter the chapter, the fewer sentences there tend to be, and vice versa. The relation is not entirely predictable, however, as sentences are of varying lengths and can involve other punctuation such as question marks. " ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Huck Finn Chapter Length Number of Periods
7026 66
11982 117
8529 72
6799 84
8166 91
14550 125
13218 127
22208 249
8081 71
7036 70
\n", "

... (33 rows omitted)\n", " \n", " \n", " Little Women Chapter Length Number of Periods\n", " \n", " \n", " \n", " \n", " 21759 189 \n", " \n", " \n", " \n", " 22148 188 \n", " \n", " \n", " \n", " 20558 231 \n", " \n", " \n", " \n", " 25526 195 \n", " \n", " \n", " \n", " 23395 255 \n", " \n", " \n", " \n", " 14622 140 \n", " \n", " \n", " \n", " 14431 131 \n", " \n", " \n", " \n", " 22476 214 \n", " \n", " \n", " \n", " 33767 337 \n", " \n", " \n", " \n", " 18508 185 \n", " \n", " \n", "\n", "

... (37 rows omitted)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plots.figure(figsize=(6, 6))\n", "plots.scatter(chars_periods_huck_finn.column(1), \n", " chars_periods_huck_finn.column(0), \n", " color='darkblue')\n", "plots.scatter(chars_periods_little_women.column(1), \n", " chars_periods_little_women.column(0), \n", " color='gold')\n", "plots.xlabel('Number of periods in chapter')\n", "plots.ylabel('Number of characters in chapter');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The plot shows us that many but not all of the chapters of *Little Women* are longer than those of *Huckleberry Finn*, as we had observed by just looking at the numbers. But it also shows us something more. Notice how the blue points are roughly clustered around a straight line, as are the yellow points. Moreover, it looks as though both colors of points might be clustered around the *same* straight line." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now look at all the chapters that contain about 100 periods. The plot shows that those chapters contain about 10,000 characters to about 15,000 characters, roughly. That's about 100 to 150 characters per period.\n", "\n", "Indeed, it appears from looking at the plot that on average both books tend to have somewhere between 100 and 150 characters between periods, as a very rough estimate. Perhaps these two great 19th century novels were signaling something so very familiar us now: the 140-character limit of Twitter." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.3" } }, "nbformat": 4, "nbformat_minor": 1 }