{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 2017-08-30 (More than a) Few Words About \"Computer Literacy\" in the Twenty-First Century..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "## 2017-08-13 Educational Technologies Here at Berkeley\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### Why We Are Moving to Computer Problem Sets\n", "\n", "The Economics Department is responding to the current Berkeley budget crisis by enforcing a work speedup on our GSI section leaders\n", "\n", "* Instead of teaching 2 sections of 30 for a 50% GSI salary\n", "* They will be teaching 3 sections of 25\n", "* They are underpaid anyway\n", "* Hence we need to offload the problem set grading part of their job onto our machines...\n", "\n", "This is an opportunity and a challenge\n", "\n", "* The opportunity: you have to learn (or remember) a little Python for simple programming exercises\n", "* The challenge: you have to learn (or remember) a little Python for simple programming exercises\n", "* The opportunity: you learn how to do more things in problem sets\n", "* The challenge: we can demand that you do more things in problem sets\n", "* The opportunity: you will master important tools" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### Why We Are Moving to Computing\n", "\n", "Four stages in the western European academic intellectual tradition\n", "\n", "1. The Visible College: Scribal literacy and Roman numeracy\n", "2. The Invisible College: Print literacy and Arabic numeracy\n", "3. The Republic of Letters: Enlightenment critical thinking and the scientific method\n", "4. The Unseen University...\n", "\n", "Data Science Wizard skills as the equivalent today of the chancery hand of the fourteenth century..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### Data Science as Today's Equivalent of the Chancery Hand\n", "\n", "Needed: anecdotes and data\n", "\n", "* **Data** so that we can tell whether the stories we hear (and the stories we tell) are in any sense representative\n", "* **Anecdotes**—case studies—thick description—so that we can understand what the not-atypical cases and patterns _really are_\n", "\n", " \n", "\n", "Data Science wizard skills needed for handling Big Data (and small data)\n", "\n", "* Datasets\n", "* Web scraping\n", "* Statistical analysis\n", "* Emergence\n", "* Simulation\n", "* Presentation\n", "\n", " \n", "\n", "Anecdotes\n", "\n", "* Read stuff\n", "* But, these days, so much to read!\n", " * We are not just writing the history of Europe from the archive of Venetian ambassadors' reports back to the _Serenissima_ anymore, are we?\n", " * Cf. **Leopold von Ranke** and his use of the _Relazione_: **Gino Benzoni** \n", "* Point-and-click simply does not scale...\n", " * Data science skills essential for wrangling what you are going to read about your anecdotes—case studies—thick description exemplars as well\n", " \n", " \n", "\n", "Compare to the _chancery hand_ of the late Medieval period\n", "\n", "* The abiity to write quickly, legibly, and formally a huge leg up in your career\n", "* A _chancery hand_ not absolutely essential—but very, very useful and practical\n", "* You went to the late Medieval university to learn the \"liberal arts\"\n", " * That isk the skills needed to make your way as a free person\n", " * Not a serf\n", " * But, also, not a feudal lord and not a feudal vassal\n", " * Needed\n", " * The _trivium_: grammar, logic, and rhetoric: how to read, how to think, and how to present and persuade\n", " * The _quadrivium_: arithmetic, geometry, music theory, and astronomy—arithmetic, algebra (including accounting), geometry (including logistics), trigonometry (including surveying), frequency, harmony, and astronomy (including weather forecasting and navigation) \n", " * Plus: law, medicine, architecture, etc....\n", " * What were you?\n", " * A judge\n", " * A reeve or bailiff\n", " * A courtier\n", " * A theologian\n", " * Working in some noble's incipient bureaucracy\n", " * Working in some bishop's not-so-incipient bureaucracy\n", " * Working for a city government\n", " * Working for somebody in a guild merchant" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### Rules for Computer Programming\n", "\n", "Five principles:\n", "\n", "1. Half an hour a day...\n", "2. Find a program that does almost what you want to do:\n", " * Inspect the source code...\n", " * Tweak it and see if that does the job...\n", "3. Google the error message...\n", "4. Remember: Reinstall and rerun (i.e., conda install/update) is your best friend...\n", "5. One language at a time!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "#### Something Wise Seth Lloyd Said to Me in May...\n", "\n", "The discussion was about how frequently potential employers are changing the stack that they expect CUNY computer science B.A.s to have seen in the major\n", "\n", "* Seth: \"I have made it a lifelong goal to have to know only one computer language at a time...\"\n", "\n", "Seth has had some success in his career:\n", "\n", "* Nam P. Suh Professor of Mechanical Engineering and Engineering Systems at MIT\n", "* Director of the Center for Extreme Quantum Information Theory at MIT\n", "* External Adjunct Fellow at the Santa Fe Institute\n", "* Ph.D. (Physics), Rockefeller University. M. Phil. (Philosophy), Cambridge University. B.A. (Physics, _summa cum laude_), Harvard University\n", "* Calls himself a \"Quantum Mechanic\"—although no blue overalls in evidence...\n", "\n", "\"Seth\n", "\n", "**Since you cannot learn every tool, make sure you learn tools that are broadly useful. Python is broadly useful**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "\"Preview\n", "\n", "#### Berkeley as Python (or at Least jupyter notebook) Central\n", "\n", "Lots of Python resources around\n", "\n", "* **Jean Mark Gawron**: Python for Social Science \n", "* **Python Practice**: Learning Resources \n", "* **Python Practice**: Python Community \n", "\n", " \n", "\n", "Python as Least Unfriendly Thing to Learn\n", "\n", "* Kunal Marwaha says\n", " * Python is the best of both worlds—the functionality of a general-purpose language but with different “packages” for different disciplines…\n", " * Python is friendly to novices. Most languages can do what you need, but efficiency depends on how quickly you can learn it…\n", " * The user community of each language makes a big difference…\n", " * Python is the fifth most popular programming language and is open source.\n", " * If you have never programmed and are working on a research problem, Python is almost certainly the best language to try first…" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "## 2017-08-26 Let's Do Something Useful with Python: Solving Allen Downey's Bayesian Dice Problem\n", "\n", "And, in the process, think a little bit about how to use computers—how to learn the equivalent of what learning to write in a fine chancery hand was for Oxford students back in the fourteenth century...\n", "\n", "cf: **Allen Downey** _Think Bayes_ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### Python Environment Setup\n", "\n", "Get the library modules we will use..." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np # the standard Python numerical manipulation library module...\n", "import matplotlib.pyplot as pyplot # basic Python graphing module—imitates MATLAB...\n", "import random # basic Python pseudo-random number generator...\n", "from collections import Counter # see below..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the system up to show graphs in the notebook..." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### On Counter...\n", "\n", "In Python, you feed \"Counter\" a list of numbers or objects (for example, \"our_list_of_results\") and then label what Counter gives us back (in this case, \"our counts\") like this:\n", "\n", " our_counts = Counter(our_list_of_results)\n", "\n", "Counter then gives you what is essentially a two-column table \"our_counts\": The first column is a list of all the things that Counter found in our_list_of_results—all the outcomes. The second column is the number of times each of those outcomes appeared in our_list_of_results. Thus if you feed Counter this list of die rolls:\n", "\n", " our_list_of_results = [1, 2, 4, 2, 3, 2]\n", "\n", "Then Counter will give us:\n", "\n", " our_counts = {\n", " 1: 1 \n", " 2: 3 \n", " 3: 1 \n", " 4: 1 \n", " }\n", "\n", "By some abuse of language, we then do not say: \"'our_counts' is what is returned by the Counter function applied to 'our_list_of_results'\". We say, instead: \"'our_counts' is a Counter object\". \n", "\n", "IMHO, \"object\" is a bad word to people who are not already object-oriented computer programmers. Think, instead: \"Counter\" is a Python program somebody has already written. It takse a list, and burps back a table (a \"dict\" or \"dictionary\" data structure in Python) that tells us what outcomes are in the list and how many times each outcome is in the list.\n", "\n", "The fact that the language programmers of Python have already done the work of creating the \"Counter\" class and including it in the \"collections\" module makes our work here much, much easier. We are going to roll our (virtual) dice a million times. Counting up the outcomes would be tedious, if not for \"Counter\".\n", "\n", "Why are we rollling a million (virtual) dice? It is the quickest-and-easiest way to solve Allen Downey's Bayesian Dice Problem :\n", "\n", ">Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die, each of them with sides labeled from 1 on up sequentially. If you have ever played Dungeons & Dragons, you know what I am talking about. Suppose I select a die from the box at random, roll it, and truthfully report to you that I get a 6. What is the probability that I rolled each die?\n", "\n", "Professor Downey solves this via creative thought and clever programming, using Python's object-oriented language structure. But we are, first, going to solve it much more easily and quickly—in literally five minutes of time spent programming—using what we might as well call: Brute Force and Massive Ignorance.\n", "\n", "I am going to present you with both ways because they together nearly span the ways we interact with computers: getting them to do lots of really simply and dumb calculations really fast, and keeping track of the results, on the one hand; building and keeping track of very complex, sophisticated, and clever chains of reasoning, on the other. It is useful to see both sides. (And this was how I originally started to learn this stuff in AM 110 in the spring of 1979: we worked close to the bare metal of the DEC PDP-11/70 via Assembly language on the one hand; we worked as far from the bare metal as was then possible with LISP on... well... in theory, at least, it did not matter what it was on on the other hand.) What I call BFMI is not Assembly, but it is not far. The type of Python Allen Downey writes is not LISP, but it is not far either...)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "## Brute Force and Massive Ignorance...\n", "\n", "We are, first, going to solve Allen Downey's Bayesian Dice Problem :\n", "\n", ">Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die, each of them with sides labeled from 1 on up sequentially. If you have ever played Dungeons & Dragons, you know what I am talking about. Suppose I select a die from the box at random, roll it, and truthfully report to you that I get a 6. What is the probability that I rolled each die?\n", "\n", "forcefully, brutally, and quickly. We are simply going to roll the (virtual) dice a million times, and we will see what results. We will then be able to say:\n", "\n", ">Hmmmm. When the result of the die roll was a six:\n", "\n", ">* 0.0% of the time the die rolled was four-sided (duh!)\n", ">* 39.4% of the time the die rolled was six-sided\n", ">* 29.6% of the time the die rolled was eight-sided\n", ">* 19.4% of the time the die rolled was twelve-sided\n", ">* 11.7% of the time the die rolled was twenty-sided\n", "\n", "By the Central Limit Theorem, and by the hope that a million dice rolls is enough trials for the CLM to apply, we then assign those numbers as our probabilities of what die was rolled in the case the result of the die roll was a six.\n", "\n", "So let's get to work: here is the program:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "random.seed(1) # seed the Mersenne Twister random number generator so that results can \n", " # be duplicated whenever we want...\n", "\n", "data = [[] for x in range(20)] # set up a list of 20 empty sublists to track the results: one individual \n", " # sublist for each of the 20 possible numbers from 1 to 20 that the dice \n", " # might roll...\n", "\n", "dice = [4, 6, 8, 12, 20] # set up a list of the five dice you have: 4-sided, 6-sided, 8-sided,\n", " # 12-sided, and 20-sided...\n", "\n", "for i in range(1000000): # tell the computer do the following block of instructions a million times...\n", " die = random.choice(dice) # choose a die at random—a 20% chance each of 4-, 6-, 8-, 12-, 20-sided...\n", " result = random.randint(1, die) # roll the chosen die; get a number from 1 and the die's number of sides...\n", " data[result-1].append(die) # add the number of sides of the die that made that roll onto the\n", " # sublist of the times that number was rolled\n", "\n", "odds = {4: [0] * 20, 6: [0] * 20, 8: [0] * 20, # set up a 20 x 5 table of the relative frequences each \n", " 12: [0] * 20, 20: [0] * 20} # number was rolled by each die; the braces \"{}\", colons \":\",\n", " # and the 4, 6, 8, 12, 20 tell Python that each column of\n", " # this tab;e has a label: the number of sides of the die\n", " # that it corresponds to (this is another \"dict\" object \n", " # in Python, if you care)...\n", " \n", "for i in range(20): # for each of the 20 possible numbers rolled...\n", " for j in dice: # for each of the five possible dice...\n", " odds[j][i] = (Counter(data[i])[j]/ # calculate the probability by using counter to see how many\n", " sum(Counter(data[i]).values())) # times number i was rolled by die j, and then divide by the\n", " # total number of times i was rolled...\n", " \n", " # and we are done: the table \"odds\" is what we want..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here is how we print out our table of results" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MONTE CARLO DICE PROBLEM RESULTS (10^6 TRIALS)\n", " \n", "roll 4-sd 6-sd 8-sd 12-sd 20-sd\n", " \n", "1 37.04% 24.81% 18.41% 12.34% 7.41%\n", "2 37.07% 24.85% 18.32% 12.39% 7.37%\n", "3 36.93% 24.59% 18.50% 12.39% 7.58%\n", "4 37.04% 24.71% 18.55% 12.26% 7.44%\n", "5 0.00% 39.36% 29.12% 19.64% 11.88%\n", "6 0.00% 39.26% 29.32% 19.78% 11.65%\n", "7 0.00% 0.00% 48.61% 32.15% 19.24%\n", "8 0.00% 0.00% 48.06% 32.54% 19.40%\n", "9 0.00% 0.00% 0.00% 62.71% 37.29%\n", "10 0.00% 0.00% 0.00% 62.29% 37.71%\n", "11 0.00% 0.00% 0.00% 62.45% 37.55%\n", "12 0.00% 0.00% 0.00% 62.45% 37.55%\n", "13 0.00% 0.00% 0.00% 0.00% 100.00%\n", "14 0.00% 0.00% 0.00% 0.00% 100.00%\n", "15 0.00% 0.00% 0.00% 0.00% 100.00%\n", "16 0.00% 0.00% 0.00% 0.00% 100.00%\n", "17 0.00% 0.00% 0.00% 0.00% 100.00%\n", "18 0.00% 0.00% 0.00% 0.00% 100.00%\n", "19 0.00% 0.00% 0.00% 0.00% 100.00%\n", "20 0.00% 0.00% 0.00% 0.00% 100.00%\n" ] } ], "source": [ "print(\"MONTE CARLO DICE PROBLEM RESULTS (10^6 TRIALS)\") # print the title of the table of probabilities...\n", "print(\" \") # blank line\n", "print(\"roll 4-sd 6-sd 8-sd 12-sd 20-sd\") # header row showing which column goes with which die\n", "print(\" \") # blank line\n", "for i in range(20): # the loop to print all 20 lines\n", " print(i+1, \" \", \n", " '{:.2%}'.format(round(odds[4][i], 5)), \n", " '{:.2%}'.format(round(odds[6][i], 5)), \n", " '{:.2%}'.format(round(odds[8][i], 5)),\n", " '{:.2%}'.format(round(odds[12][i], 5)), \n", " '{:.2%}'.format(round(odds[20][i], 5)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Discussion\n", "\n", "That took about 15 minutes of my time: five minutes to program up the million dice rolls (the program ran first time—that almost never happens) and then count, sort, and divide to produce the odds table; five more minutes to add the comments to the main program; and a last five minutes to look up how to print out the ugly version of the odds table just above.\n", "\n", "Oh. And computer time? Cells [1], [2], and [4] saw the computer came back finished before I could think of what to do next. Cell [3] took about 6 seconds to run on my machine (13\" MacBookAir, Early 2015; 2.2 GHz Intel Core i7; 8 GB 1600 MHz DDR3; C1MPQ2QJG944). \n", "\n", "Note that this table is not exactly right. The first four rows should be _identical_: the only information we get from a die roll of 1-4 is \"this roll could have been produced by any of the dice\". Similarly, rows 5-6 (\"this roll could have been produced by any except the 4-sided die\"), rows 7-8 (\"this roll could not have been produced by the 4- or by the 6-sided die\"), rows 9-12 (\"this roll could only have been produced by the 12- or the 20-sided die), and rows 13-20 (\"yup: the die rolled has more than 12 sides\") should be identical. But they are not quite. (Even a ten million rolls run would still leave occasional jiggles up or down by 0.1% in som elements of the table: I tried it, and it took the computer a minute. If I wanted to nail the numbers—have CLT convergence down to the last tenth of a percentage point—I would have to let the microprocessor spend as much time running it as I spent programming it: 15 minutes or so and five hundred million rolls.)\n", "\n", "How should we grasp what is going on in this odds table? Rows 13-20 are simple: one of the dice has been chosen and rolled, and the number rolled is one that could only have been rolled by the 20-sided die. So the chance the 20-sided die was the one chosen and rolled is 100%.\n", "\n", "But what is going on in rows 9-12? A die is chosen, and the resulting roll could not have been made by the 4-, the 6-, or the 8-sided die. We thus understand why the first three numbers in each line are: 0%, 0%, 0%: we know that either the 12- and 20-sided die was the one chosen. But we know that each had an equal chance of being chosen. So why are the last two numbers not 50%, 50% but instead 62.5%, 37.5%? Because we know more than just that the die roll could not have been made by any of the 4-, 6-, or 8-sided dice. We also know, in addition, that _if_ the 20-sided die were chosen _then_ it gave us a relatively unlikely relatively low roll: a 9-12 and not the 13-20 that would have been more likely had we known only (a) that the 20-sided die were chosen and (b) that it rolled a number greater than 8.\n", "\n", "To be more concrete: 1/5 of the time the 12-sided die is rolled. 1/3 of the time that the 12-sided die is rolled, the result is a 9-12. Thus (a) 12-sided die and (b) 9-12 occurs 1/5 x 1/3 = 1/15 of the time a die is rolled. 1/5 of the time the 20-sided die is rolled. 1/5 of the time that the 20-sided die is rolled the result is a 9-12. Thus (a) 12-sided die and (b) 9-12 occurs 1/5 x 1/5 = 1/25 of the time a die is rolled. Seeing a 9-12 thus takes place 1/15 + 1/25 = 8/75 of the time—3/75 with a 20- and 5/75 with a 5-sided die. Hence 37.5% and 62.5%.\n", "\n", "These arguments _sound_ convincing. But how do we know they are really right? How do we know that this is the way things work? We know because we rolled the (virtual) dice a million times. 5/8 of the time that a 9-12 was rolled, the die doing the rolling was the 12-sided die; only 3/8 of the time was the die doing the rolling the 20-sided die." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "## Math, Clever Programming, and Deep Insight...\n", "\n", "An alternative way to solve this Bayesian Dice Problem :\n", "\n", ">Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die, each of them with sides labeled from 1 on up sequentially. If you have ever played Dungeons & Dragons, you know what I am talking about. Suppose I select a die from the box at random, roll it, and truthfully report to you that I get a 6. What is the probability that I rolled each die?\n", "\n", "is the way that Professor Allen Downey of Olin College solves it: via math, clever programming, and deep insight.\n", "\n", "First, Allen Downey of Olin College extends Counter—building it into what we will here call a \"ProbabilityMass\" by adding a \"normalize\" to it so that what it will contain (after \"normalize\") is not counts but rather frequencies that sum to one—probabilities, in other words. Then he extends ProbabilityMass, building it into what he calls a \"Suite\"—a suite, that is, of the subjective Bayesian probabilities of complete and non-overlapping hypotheses about the world. He does so by adding a \"bayesian_update\" function that takes a piece of data and uses it to update the Bayesian probabilities of the hypotheses in the Suite if you know the likelihoods of the data appearing under each of the hyptheses. Last, he extends Suite, building it into what he calls a \"DiceSuite\", by adding on a function calculating the likelihood that a particular roll was made by a certain die given how many sides the die has: " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class ProbabilityMass(Counter):\n", " \"\"\"A Counter extended to have probabilities. \"normalize\" computes the total of the \n", " frequencies and divides through, yielding probabilities that add to 1. __add__ enumerates \n", " all pairs of value and returns a new ProbabilityMass that represents the distribution of the sum.\n", " __hash__ and __id__ make ProbabilityMass hashable. render returns the values and probabilities \n", " in a form ready for plotting\"\"\"\n", "\n", " def normalize(self):\n", " \"\"\"Normalizes the PM so the probabilities add to 1.\"\"\"\n", " total = float(sum(self.values()))\n", " for key in self:\n", " self[key] /= total\n", "\n", " def __add__(self, other):\n", " \"\"\"Adds two distributions.\n", "\n", " The result is the distribution of sums of values from the\n", " two distributions.\n", "\n", " other: Pmf\n", "\n", " returns: new Pmf\n", " \"\"\"\n", " pmf = Pmf()\n", " for key1, prob1 in self.items():\n", " for key2, prob2 in other.items():\n", " pmf[key1 + key2] += prob1 * prob2\n", " return pmf\n", "\n", " def __hash__(self):\n", " \"\"\"Returns an integer hash value.\"\"\"\n", " return id(self)\n", " \n", " def __eq__(self, other):\n", " return self is other\n", "\n", " def render(self):\n", " \"\"\"Returns values and their probabilities, suitable for plotting.\"\"\"\n", " return zip(*sorted(self.items()))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class Suite(ProbabilityMass):\n", " \"\"\"Map from hypothesis to probability.\"\"\"\n", "\n", " def bayesian_update(self, data):\n", " \"\"\"Performs a Bayesian update. data: result of a die roll\"\"\"\n", " for hypo in self:\n", " like = self.likelihood(data, hypo)\n", " self[hypo] = self[hypo] * like\n", " self.normalize()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class DiceSuite(Suite):\n", " \"\"\"DiceSuite inherits bayesian_update from Suite and provides likelihood. data\n", " is the observed die roll. hypo is the hypothetical die I might have rolled; to get \n", " the likelihood, select, from the given die, the probability of the given value\"\"\"\n", " \n", " def likelihood(self, data, hypo):\n", " \"\"\"Computes the likelihood of the data under the hypothesis. data: result of a die roll\n", " hypo: die object\"\"\"\n", "\n", " return hypo[data] # note: if data > hypo, the die roll is greater than the number\n", " # of faces on the die, and so then hypo[data] = 0. If data ≤ hypo,\n", " # then hypo[data] = 1/hypo—equal one divided by the number of faces\n", " # on the die. When \"likelihood\" is called inside \"Bayesian update\", it\n", " # thus produces the right multiplicative factor with which to multiply\n", " # the previous hypothesis probability in the line \"self[hypo] *= like\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Last, he defines a function to make an object representing a k-sided die, if you feed it the number k:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def make_die(num_sides):\n", " die = ProbabilityMass(range(1, num_sides+1))\n", " die.name = 'd%d' % num_sides\n", " die.normalize()\n", " return die" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then the programming is trivial: make the box of dice, and then one loop to calculate how the probabilities of each die are updated from their initial equal chances for each of the 20 possible numbers that might result from the die roll, and a subloop to print out the lines of the output table with percent signs:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BAYESIAN UPDATE RESULTS\n", " \n", "roll 4-sd 6-sd 8-sd 12-sd 20-sd\n", " \n", "1 37.04% 24.69% 18.52% 12.35% 7.41%\n", "2 37.04% 24.69% 18.52% 12.35% 7.41%\n", "3 37.04% 24.69% 18.52% 12.35% 7.41%\n", "4 37.04% 24.69% 18.52% 12.35% 7.41%\n", "5 0.00% 39.22% 29.41% 19.61% 11.77%\n", "6 0.00% 39.22% 29.41% 19.61% 11.77%\n", "7 0.00% 0.00% 48.39% 32.26% 19.36%\n", "8 0.00% 0.00% 48.39% 32.26% 19.36%\n", "9 0.00% 0.00% 0.00% 62.50% 37.50%\n", "10 0.00% 0.00% 0.00% 62.50% 37.50%\n", "11 0.00% 0.00% 0.00% 62.50% 37.50%\n", "12 0.00% 0.00% 0.00% 62.50% 37.50%\n", "13 0.00% 0.00% 0.00% 0.00% 100.00%\n", "14 0.00% 0.00% 0.00% 0.00% 100.00%\n", "15 0.00% 0.00% 0.00% 0.00% 100.00%\n", "16 0.00% 0.00% 0.00% 0.00% 100.00%\n", "17 0.00% 0.00% 0.00% 0.00% 100.00%\n", "18 0.00% 0.00% 0.00% 0.00% 100.00%\n", "19 0.00% 0.00% 0.00% 0.00% 100.00%\n", "20 0.00% 0.00% 0.00% 0.00% 100.00%\n" ] } ], "source": [ "dice = [make_die(x) for x in [4, 6, 8, 12, 20]]\n", "\n", "print(\"BAYESIAN UPDATE RESULTS\") # print the table odds, of probabilities...\n", "print(\" \")\n", "print(\"roll 4-sd 6-sd 8-sd 12-sd 20-sd\") # header row\n", "print(\" \") \n", "\n", "for i in range(20): # loop over die roll values from 1 to 20\n", " dice_suite = DiceSuite(dice) # set up the equzl prior probabilities\n", " dice_suite.bayesian_update(i+1) # do a Bayesian update for the current die roll\n", " \n", " rounded = [\"\"] * 5\n", " processing = list(dice_suite.values())\n", " for j in range(5):\n", " rounded[j] = round(processing[j],5)\n", " print(i+1, \" \", '{:.2%}'.format(rounded[0]), \n", " '{:.2%}'.format(rounded[1]), \n", " '{:.2%}'.format(rounded[2]), \n", " '{:.2%}'.format(rounded[3]), \n", " '{:.2%}'.format(rounded[4]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### Recap\n", "\n", "The math, clever programming, and deep insight way has given us the same table—save for the jittering caused by sampling error, because 1,000,000 dice rolls is not quite enough for the CLT to nail things to even the first decimal place—as in the BFMI section above.\n", "\n", "How did that work? Did that go by too fast? Let's run through this again:\n", "\n", "Allen Downey starts with the already-programmed up \"Counter\" from the _collections_ module:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1})" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = Counter([1, 2, 3, 4, 5, 6])\n", "\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Applying \"Counter\" to the list [1, 2, 3, 4, 5, 6] gets us a table (here printed by Python on one line). The 1, 2, 3, 4, 5, and 6es before the colons tell us the outcomes that are in the list (the \"keys\" of the Python \"dict\" object that Counter returns), and the 1s after the colons tell us the counts of how many times each outcome was found in the list (the \"values\" of the Python \"dict\" object that Counter returns).\n", "\n", "Then Allen Downey extends Counter to ProbabilityMass:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class ProbabilityMass(Counter):\n", " \"\"\"A Counter extended to have probabilities. \"normalize\" computes the total of the \n", " frequencies and divides through, yielding probabilities that add to 1. __add__ enumerates \n", " all pairs of value and returns a new ProbabilityMass that represents the distribution of the sum.\n", " __hash__ and __id__ make ProbabilityMass hashable. render returns the values and probabilities \n", " in a form ready for plotting\"\"\"\n", "\n", " def normalize(self):\n", " \"\"\"Normalizes the PM so the probabilities add to 1.\"\"\"\n", " total = float(sum(self.values()))\n", " for key in self:\n", " self[key] /= total\n", "\n", " def __add__(self, other):\n", " \"\"\"Adds two distributions.\n", "\n", " The result is the distribution of sums of values from the\n", " two distributions.\n", "\n", " other: Pmf\n", "\n", " returns: new Pmf\n", " \"\"\"\n", " pmf = Pmf()\n", " for key1, prob1 in self.items():\n", " for key2, prob2 in other.items():\n", " pmf[key1 + key2] += prob1 * prob2\n", " return pmf\n", "\n", " def __hash__(self):\n", " \"\"\"Returns an integer hash value.\"\"\"\n", " return id(self)\n", " \n", " def __eq__(self, other):\n", " return self is other\n", "\n", " def render(self):\n", " \"\"\"Returns values and their probabilities, suitable for plotting.\"\"\"\n", " return zip(*sorted(self.items()))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class Suite(ProbabilityMass):\n", " \"\"\"Map from hypothesis to probability.\"\"\"\n", "\n", " def bayesian_update(self, data):\n", " \"\"\"Performs a Bayesian update. data: result of a die roll\"\"\"\n", " for hypo in self:\n", " like = self.likelihood(data, hypo)\n", " self[hypo] = self[hypo] * like\n", " self.normalize()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class DiceSuite(Suite):\n", " \"\"\"DiceSuite inherits bayesian_update from Suite and provides likelihood. data\n", " is the observed die roll. hypo is the hypothetical die I might have rolled; to get \n", " the likelihood, select, from the given die, the probability of the given value\"\"\"\n", " \n", " def likelihood(self, data, hypo):\n", " \"\"\"Computes the likelihood of the data under the hypothesis. data: result of a die roll\n", " hypo: die object\"\"\"\n", "\n", " return hypo[data] # note: if data > hypo, the die roll is greater than the number\n", " # of faces on the die, and so then hypo[data] = 0. If data ≤ hypo,\n", " # then hypo[data] = 1/hypo—equal one divided by the number of faces\n", " # on the die. When \"likelihood\" is called inside \"Bayesian update\", it\n", " # thus produces the right multiplicative factor with which to multiply\n", " # the previous hypothesis probability in the line \"self[hypo] *= like\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Last, he defines a function to make an object representing a k-sided die, if you feed it the number k:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def make_die(num_sides):\n", " die = ProbabilityMass(range(1, num_sides+1))\n", " die.name = 'd%d' % num_sides\n", " die.normalize()\n", " return die" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that everything is set up, let us get to work. Let us make our five dice:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dice = [make_die(x) for x in [4, 6, 8, 12, 20]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And let us see what we have made:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[ProbabilityMass({1: 0.25, 2: 0.25, 3: 0.25, 4: 0.25}),\n", " ProbabilityMass({1: 0.16666666666666666,\n", " 2: 0.16666666666666666,\n", " 3: 0.16666666666666666,\n", " 4: 0.16666666666666666,\n", " 5: 0.16666666666666666,\n", " 6: 0.16666666666666666}),\n", " ProbabilityMass({1: 0.125,\n", " 2: 0.125,\n", " 3: 0.125,\n", " 4: 0.125,\n", " 5: 0.125,\n", " 6: 0.125,\n", " 7: 0.125,\n", " 8: 0.125}),\n", " ProbabilityMass({1: 0.08333333333333333,\n", " 2: 0.08333333333333333,\n", " 3: 0.08333333333333333,\n", " 4: 0.08333333333333333,\n", " 5: 0.08333333333333333,\n", " 6: 0.08333333333333333,\n", " 7: 0.08333333333333333,\n", " 8: 0.08333333333333333,\n", " 9: 0.08333333333333333,\n", " 10: 0.08333333333333333,\n", " 11: 0.08333333333333333,\n", " 12: 0.08333333333333333}),\n", " ProbabilityMass({1: 0.05,\n", " 2: 0.05,\n", " 3: 0.05,\n", " 4: 0.05,\n", " 5: 0.05,\n", " 6: 0.05,\n", " 7: 0.05,\n", " 8: 0.05,\n", " 9: 0.05,\n", " 10: 0.05,\n", " 11: 0.05,\n", " 12: 0.05,\n", " 13: 0.05,\n", " 14: 0.05,\n", " 15: 0.05,\n", " 16: 0.05,\n", " 17: 0.05,\n", " 18: 0.05,\n", " 19: 0.05,\n", " 20: 0.05})]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dice" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have just asked Python to give us back, as its output in Out[], the thing that \"dice\" currently is. And Python has delivered\n", "\n", "The Python object \"dice\" is rather complicated. First, it is a list—a list of 5 ProbabilityMass objects. That is what the opening \"[\", the closing \"]\", and the four commas before the second, third, fourth, and fifth occurances of \"ProductivityMass\" mean to Python: they tell it that this \"die\" thing is a five-item list.\n", "\n", "Each of the elements of the list is an instantiation—big word, but I don't know of a better one—a thing that belongs to the \"ProbabilityMass\" class of things. That is what the \"ProbabilityMass(\" and the closing \")\" mean to Python. Inside each ProbabilityMass wrapper is a Python dict object—that's what the opening \"{\" and the closing \"}\" mean. The first item in the dict inside the fifth ProbabilityMass object is \"1: 0.05,\"—the \"1\" is the _key_, in this case, a potential roll of this particular die, an outcome that might be obtained; the \":\" tells Python that the key is over; the \"0.05\" is the _value_, in this case that probability that this die, when rolled, will yield the outcome that is the _key_; and the \",\" tells Python: on to the next _key : value_ pair. The dict object inside the fifth ProbabilityMass has 20 _key: value_ entries—it is a 20-sided die, after all. The dict object inside the first ProbabilityMass has 4 _key: value_ entries—it is a 4-sided die.\n", "\n", "Next let us print the table title, a blank line, the table header row telling us that the first column is the outcome of the die roll and also which column is associated with which die, and then another blank line:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BAYESIAN UPDATE RESULTS\n", "\n", "roll 4-sd 6-sd 8-sd 12-sd 20-sd\n", "\n" ] } ], "source": [ "print(\"BAYESIAN UPDATE RESULTS\") # print the table odds, of probabilities...\n", "print(\"\") # blank line\n", "print(\"roll 4-sd 6-sd 8-sd 12-sd 20-sd\") # header row\n", "print(\"\") # blank line" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next let us loop over all 20 possible outcome die rolls, calculating what the odds were that each roll was made by each die if each die had an equal 1/5 chance beforehand of being selected and printing out the results:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 37.04% 24.69% 18.52% 12.35% 7.41%\n", "2 37.04% 24.69% 18.52% 12.35% 7.41%\n", "3 37.04% 24.69% 18.52% 12.35% 7.41%\n", "4 37.04% 24.69% 18.52% 12.35% 7.41%\n", "5 0.00% 39.22% 29.41% 19.61% 11.77%\n", "6 0.00% 39.22% 29.41% 19.61% 11.77%\n", "7 0.00% 0.00% 48.39% 32.26% 19.36%\n", "8 0.00% 0.00% 48.39% 32.26% 19.36%\n", "9 0.00% 0.00% 0.00% 62.50% 37.50%\n", "10 0.00% 0.00% 0.00% 62.50% 37.50%\n", "11 0.00% 0.00% 0.00% 62.50% 37.50%\n", "12 0.00% 0.00% 0.00% 62.50% 37.50%\n", "13 0.00% 0.00% 0.00% 0.00% 100.00%\n", "14 0.00% 0.00% 0.00% 0.00% 100.00%\n", "15 0.00% 0.00% 0.00% 0.00% 100.00%\n", "16 0.00% 0.00% 0.00% 0.00% 100.00%\n", "17 0.00% 0.00% 0.00% 0.00% 100.00%\n", "18 0.00% 0.00% 0.00% 0.00% 100.00%\n", "19 0.00% 0.00% 0.00% 0.00% 100.00%\n", "20 0.00% 0.00% 0.00% 0.00% 100.00%\n" ] } ], "source": [ "for i in range(20): # loop over die roll values from 1 to 20\n", " dice_suite = DiceSuite(dice) # set up the equzl prior probabilities\n", " dice_suite.bayesian_update(i+1) # do a Bayesian update for the current die roll\n", " rounded = [\"\"] * 5\n", " processing = list(dice_suite.values())\n", " for j in range(5):\n", " rounded[j] = round(processing[j],5)\n", " print(i+1, \" \", '{:.2%}'.format(rounded[0]), \n", " '{:.2%}'.format(rounded[1]), \n", " '{:.2%}'.format(rounded[2]), \n", " '{:.2%}'.format(rounded[3]), \n", " '{:.2%}'.format(rounded[4]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But what was going on inside the loop? To see, let's go inside just the sixth iteration of the loop, and print out what dice_suite looks like as the loop proceeds. Firs:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DiceSuite({ProbabilityMass({1: 0.25, 2: 0.25, 3: 0.25, 4: 0.25}): 0.2,\n", " ProbabilityMass({1: 0.16666666666666666,\n", " 2: 0.16666666666666666,\n", " 3: 0.16666666666666666,\n", " 4: 0.16666666666666666,\n", " 5: 0.16666666666666666,\n", " 6: 0.16666666666666666}): 0.2,\n", " ProbabilityMass({1: 0.125,\n", " 2: 0.125,\n", " 3: 0.125,\n", " 4: 0.125,\n", " 5: 0.125,\n", " 6: 0.125,\n", " 7: 0.125,\n", " 8: 0.125}): 0.2,\n", " ProbabilityMass({1: 0.08333333333333333,\n", " 2: 0.08333333333333333,\n", " 3: 0.08333333333333333,\n", " 4: 0.08333333333333333,\n", " 5: 0.08333333333333333,\n", " 6: 0.08333333333333333,\n", " 7: 0.08333333333333333,\n", " 8: 0.08333333333333333,\n", " 9: 0.08333333333333333,\n", " 10: 0.08333333333333333,\n", " 11: 0.08333333333333333,\n", " 12: 0.08333333333333333}): 0.2,\n", " ProbabilityMass({1: 0.05,\n", " 2: 0.05,\n", " 3: 0.05,\n", " 4: 0.05,\n", " 5: 0.05,\n", " 6: 0.05,\n", " 7: 0.05,\n", " 8: 0.05,\n", " 9: 0.05,\n", " 10: 0.05,\n", " 11: 0.05,\n", " 12: 0.05,\n", " 13: 0.05,\n", " 14: 0.05,\n", " 15: 0.05,\n", " 16: 0.05,\n", " 17: 0.05,\n", " 18: 0.05,\n", " 19: 0.05,\n", " 20: 0.05}): 0.2})" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dice_suite = DiceSuite(dice) # set up the equzl prior probabilities\n", "dice_suite.normalize()\n", "dice_suite" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point—just after its creation—\"dice_suite\" looks a lot like \"dice\" looked like, except that the first \"ProbabilityMass(...etc....)\" has had a \"{\" added before it, and a \": 0.2\" added after its final \")\"; the rest of the ProbabilityMass(...etc....) objects have had a \": 0.2\" added after their final \")\"s; and the last \"ProbabilityMass(...etc....): 0.2\" has had a \"}\" added after it. \n", "\n", "All the stuff from each \"P\" to the following \")\" is one of the dice. And each of the \"0.2\"s after the colon outside the first five closing brace-parenthesis pairs \"})\" is the probability that that particular die was chosen to be rolled.\n", "\n", "Next the loop asked Python to do:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dice_suite.bayesian_update(6) # do a Bayesian update for the die roll of 6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What does dice_suite look like now, after the update?" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DiceSuite({ProbabilityMass({1: 0.25, 2: 0.25, 3: 0.25, 4: 0.25}): 0.0,\n", " ProbabilityMass({1: 0.16666666666666666,\n", " 2: 0.16666666666666666,\n", " 3: 0.16666666666666666,\n", " 4: 0.16666666666666666,\n", " 5: 0.16666666666666666,\n", " 6: 0.16666666666666666}): 0.3921568627450981,\n", " ProbabilityMass({1: 0.125,\n", " 2: 0.125,\n", " 3: 0.125,\n", " 4: 0.125,\n", " 5: 0.125,\n", " 6: 0.125,\n", " 7: 0.125,\n", " 8: 0.125}): 0.2941176470588236,\n", " ProbabilityMass({1: 0.08333333333333333,\n", " 2: 0.08333333333333333,\n", " 3: 0.08333333333333333,\n", " 4: 0.08333333333333333,\n", " 5: 0.08333333333333333,\n", " 6: 0.08333333333333333,\n", " 7: 0.08333333333333333,\n", " 8: 0.08333333333333333,\n", " 9: 0.08333333333333333,\n", " 10: 0.08333333333333333,\n", " 11: 0.08333333333333333,\n", " 12: 0.08333333333333333}): 0.19607843137254904,\n", " ProbabilityMass({1: 0.05,\n", " 2: 0.05,\n", " 3: 0.05,\n", " 4: 0.05,\n", " 5: 0.05,\n", " 6: 0.05,\n", " 7: 0.05,\n", " 8: 0.05,\n", " 9: 0.05,\n", " 10: 0.05,\n", " 11: 0.05,\n", " 12: 0.05,\n", " 13: 0.05,\n", " 14: 0.05,\n", " 15: 0.05,\n", " 16: 0.05,\n", " 17: 0.05,\n", " 18: 0.05,\n", " 19: 0.05,\n", " 20: 0.05}): 0.11764705882352945})" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dice_suite" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note what has changed: only the five numbers after the \"):\"s. Those are your subjective probabilities of which die was chosen after seeing that the die roll was 6.\n", "\n", "The probability for the four-sided die is now zero. The \"bayesian_update\" function looked inside the four-sided die object—that is, inside\n", "\n", " ProbabilityMass({1: 0.25, 2: 0.25, 3: 0.25, 4: 0.25})\n", " \n", "for that is Python's representation of the four-sided die—for the chance that a four-sided die rolls a 6. It found nothing. And so it then multiplied zero by the previous probability of a four-sided die equal to 0.2—the 0.2 in the\n", "\n", " ProbabilityMass({1: 0.25, 2: 0.25, 3: 0.25, 4: 0.25}): 0.2\n", "\n", "obtained zero, and so substituted that zero in for 0.2.\n", "\n", "The probability for the six-sided die is now 0.3921568627450981. The \"bayesian_update\" function looked inside the six-sided die object—that is, inside\n", "\n", " ProbabilityMass({1: 0.16666666666666666, 2: 0.16666666666666666, 3: 0.16666666666666666, 4: 0.16666666666666666, 5: 0.16666666666666666, 6: 0.16666666666666666})\n", " \n", "for that is Python's representation of the six-sided die—for the chance that a six-sided die rolls a 6. It found 0.1666666666. And so it then multiplied the previous probability of a six-sided die equal to 0.2—the 0.2 in the \n", "\n", " ProbabilityMass({1: 0.16666666666666666, 2: 0.16666666666666666, 3: 0.16666666666666666, 4: 0.16666666666666666, 5: 0.16666666666666666, 6: 0.16666666666666666}): 0.2\n", " \n", "by 0.16666666666666, and, after doing the analogous multiplications for the eight-, twelve-, and twenty-sided dice, normalized them by multiplying them by 11.76470588235765 so that they all added up to the one that a proper set of probabilities needs to add up to. That is how we got the 0.3921568627450981. And, similarly, the 0.2941, the 0.1961, and the 0.11.77.\n", "\n", "Then the sixth run of the loop prints out the corresponding table line:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6 0.00% 39.22% 29.41% 19.61% 11.77%\n" ] } ], "source": [ "rounded = [\"\"] * 5\n", "processing = list(dice_suite.values())\n", "for j in range(5):\n", " rounded[j] = round(processing[j],5)\n", "print(6, \" \", '{:.2%}'.format(rounded[0]), \n", " '{:.2%}'.format(rounded[1]), \n", " '{:.2%}'.format(rounded[2]), \n", " '{:.2%}'.format(rounded[3]), \n", " '{:.2%}'.format(rounded[4]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "giving the probabilities of each die after seeing that the number rolled was a six.\n", "\n", "Putting the whole loop back together once again:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 37.04% 24.69% 18.52% 12.35% 7.41%\n", "2 37.04% 24.69% 18.52% 12.35% 7.41%\n", "3 37.04% 24.69% 18.52% 12.35% 7.41%\n", "4 37.04% 24.69% 18.52% 12.35% 7.41%\n", "5 0.00% 39.22% 29.41% 19.61% 11.77%\n", "6 0.00% 39.22% 29.41% 19.61% 11.77%\n", "7 0.00% 0.00% 48.39% 32.26% 19.36%\n", "8 0.00% 0.00% 48.39% 32.26% 19.36%\n", "9 0.00% 0.00% 0.00% 62.50% 37.50%\n", "10 0.00% 0.00% 0.00% 62.50% 37.50%\n", "11 0.00% 0.00% 0.00% 62.50% 37.50%\n", "12 0.00% 0.00% 0.00% 62.50% 37.50%\n", "13 0.00% 0.00% 0.00% 0.00% 100.00%\n", "14 0.00% 0.00% 0.00% 0.00% 100.00%\n", "15 0.00% 0.00% 0.00% 0.00% 100.00%\n", "16 0.00% 0.00% 0.00% 0.00% 100.00%\n", "17 0.00% 0.00% 0.00% 0.00% 100.00%\n", "18 0.00% 0.00% 0.00% 0.00% 100.00%\n", "19 0.00% 0.00% 0.00% 0.00% 100.00%\n", "20 0.00% 0.00% 0.00% 0.00% 100.00%\n" ] } ], "source": [ "for i in range(20): # loop over die roll values from 1 to 20\n", " dice_suite = DiceSuite(dice) # set up the equzl prior probabilities\n", " dice_suite.bayesian_update(i+1) # do a Bayesian update for the current die roll\n", " rounded = [\"\"] * 5\n", " processing = list(dice_suite.values())\n", " for j in range(5):\n", " rounded[j] = round(processing[j],5)\n", " print(i+1, \" \", '{:.2%}'.format(rounded[0]), \n", " '{:.2%}'.format(rounded[1]), \n", " '{:.2%}'.format(rounded[2]), \n", " '{:.2%}'.format(rounded[3]), \n", " '{:.2%}'.format(rounded[4]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once all the preliminary building of the scaffolding is done—the creation of the classes and the functions and the methods—the actual program is three lines of code:\n", "\n", "```\n", "for i in range(20): # do this once for each die roll value\n", " dice_suite = DiceSuite(dice) # set up the box of dice and pick one\n", " dice_suite.bayesian_update(i+1) # roll the die; calculate _posterior_ odds\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Discussion: Why Do Things Downey's—the _Pythonic_—Way?\n", "\n", "Now I do not know about you, but that took me not fifteen minutes but more like six hours: to copy the program logic from Allen Downey's github repository, poke around it until I felt that I understood its logic well enough to explain it, modify it so it prints out the whole table of results rather than just line 6, and then write this up. It would have taken me a lot more than six hours if I had had to come up with it rather than just follow along in the footsteps of Allen Downey's _Think Bayes_ .\n", "\n", "If it took you significantly less time, then congratulations: you have a bring future ahead of you in high-paying jobs working as an object-oriented computer programmer!\n", "\n", "What are the reasons to do things Downey's way—a style of programming that is object-oriented, Pythonic, using clever data structures, the theoretical insight of statistics developed by the Rev. Thomas Bayes that:\n", "\n", "$$P(A|B) = \\frac {P(B|A)P(A)}{P(B)}$$\n", "\n", "and other such intellectual tools, rather than \"let's roll a million (virtual) dice and see\"?\n", "\n", "* First, Allen Downey now has a suite of tools that he can easily adapt to answer related—and more complicated questions. What if you want to know: \"What are the probabilities if my counterparty rolls 3, 6, 4, 2, 5, 4, 3\"? What if you want to know: \"How many die rolls do I need to see—none of which are above 12—before I can conclude that it is highly, highly unlikely that the die chosen is the twenty-sided die?\"?\n", "\n", "* Second, sooner or later BFMI fails because you do not have enough brute force at your disposal. Suppose you had not one but 100 dice rolls, and not one but 100 possible ways that the initial box from which the die is chosen could be set up. We are now—if you want to nail the probabilities to even one decismal place—talking about not fifteen minutes but 2500 hours of computer time.\n", "\n", "* Third, even in the best of cases, BFMI—let's simulate it a lot of times and see what happens—requires that you know a simple thing to simulate, and the kind of knowledge behind what Downey is doing is the only thing that makes setting up the right simple thing to simulate many times possible.\n", "\n", "* Fourth, after programming it up, someone doing it Allen Downey's way would understand and could explain why the numbers are what they are in a way that somebody doing it the BRMI way could not. Recall the discussion beginning \"How should we grasp what is going on in this odds table?\" above: I had to bolt that onto the BFMI run in order to give you a chance of gaining some insight into the situation. Somebody doing it Allen Downey's way would have to learn and understand that discussion already before they could finish programming.\n", "\n", "* Fifth, the BFMI way makes my students cry, because my code reads as though it was badly translated from the FORTRAN computer language written over 1953-1957 by John Backus, Richard Goldberg, Sheldon F. Best, Harlan Herrick, Peter Sheridan, Roy Nutt, Robert Nelson, Irving Ziller, Lois Haibt, and David Sayre . " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Discussion: Dangers of the _Pythonic_ Way\n", "\n", "Of course, the benefits of eschewing BFMI—tool building, a big head start on related problems, economy of computer power when it becomes scarce, understanding of what are first- and what are third-order aspects of the phenomena, and insight—apply _only if you actually understand the program suite you are running_. \n", "\n", "If you do not, you are a mere _Code Monkey_ typing uncomprehended symbols. You are, at best, in the position of the underbriefed and undertrained _Sorcerer's Apprentice_.\n", "\n", "And everyone finds themselves in such a position soone or later. Indeed, there is a strange affinity between fantasy magic on the one hand and computer programming on the other—especially as computer science evolves languages and frameworks that are more and more abstract, object-oriented, recursive, etc. We see this, in fact, in the original _Hacker's Dictionary_:\n", "\n", ">WIZARD n. 1. A person who knows how a complex piece of software or hardware works; someone who can find and fix his bugs in an emergency. Rarely used at MIT, where HACKER is the preferred term. 2. A person who is permitted to do things forbidden to ordinary people, e.g., a \"net wizard\" on a TENEX may run programs which speak low-level host-imp protocol; an ADVENT wizard at SAIL may play Adventure during the day. —Paul Dourish: THE ORIGINAL HACKER'S DICTIONARY \n", "\n", "We see this more so in the Introduction to Abselson, Sussman, and Sussman's (1993) _Structure and Interpretation of Computer Programs_—a book that actually has a wizard on its cover:\n", "\n", ">Computational processes are abstract beings that inhabit computers. As they evolve, processes manipulate other abstract things called data. The evolution of a process is directed by a pattern of rules called a program. People create programs to direct processes. In effect, we conjure the spirits of the computer with our spells.\n", "\n", ">The programs we use to conjure processes are like a sorcerer's spells. They are carefully composed from symbolic expressions in arcane and esoteric programming languages that prescribe the tasks we want our processes to perform.\n", "\n", ">A computational process, in a correctly working computer, executes programs precisely and accurately. Thus, like the sorcerer's apprentice, novice programmers must learn to understand and to anticipate the consequences of their conjuring… \n", "\n", "At some point knowledge decays into superstition, as the programmers lose control of what they are doing because they do not understand the system—or because the system is itself buggy, and does not behave as they were taught it would. Thus things go wrng—here we have an example: Gandalf the Grey, Python νb, confronting unexpected behavior from a pandas.DataFrame:\n", "\n", "\"Durin's\n", "\n", "or xkcd:\n", "\n", "\"Xkcd\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Discussion: Strike a Balance\n", "\n", "Of course, it is not one or the other: one does not have to, and one should never be bullied into, declaring permanent allegiance to BFMI or to being clever and _Pythonic_. It is both/and. It is sometimes one/sometimes the other.\n", "\n", "We believe in rough consensus so the members of the community can help one another rather than get in one another's way; we believe in code that runs; and we believe in getting work done.\n", "\n", "Go thou, and do likewise..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "----\n", "\n", "-----\n", "\n", "### Addendum: Gandalf the Grey Talking Shop\n", "\n", ">'I found myself suddenly faced by something that I have not met before. I could think of nothing to do but to try and put a shutting-spell on the door. I know many; but to do things of that kind rightly requires time, and even then the door can be broken by strength. As I stood there I could hear orc-voices on the other side: at any moment I thought they would burst it open. I could not hear what was said; they seemed to be talking in their own hideous language. All I caught was _ghâsh_: that is “fire”. Then something came into the chamber–I felt it through the door, and the orcs themselves were afraid and fell silent. It laid hold of the iron ring, and then it perceived me and my spell. What it was I cannot guess, but I have never felt such a challenge. The counter-spell was terrible. It nearly broke me. For an instant the door left my control and began to open! I had to speak a word of Command. That proved too great a strain. The door burst in pieces. Something dark as a cloud was blocking out all the light inside, and I was thrown backwards down the stairs. All the wall gave way, and the roof of the chamber as well, I think…'\n", "\n", ">Picking up a faggot [Gandalf] held it aloft for a moment, and then with a word of command, _naur an edraith ammen!_, he thrust the end of his staff into the midst of it. At once a great spout of green and blue flame sprang out, and the wood flared and sputtered. ‘If there are any to see, then I at least am revealed to them’, he said. ‘I have written Gandalf is here in signs that all can read from Rivendell to the mouths of Anduin…’" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### Addendum: Philosophy: Magic and Python\n", "\n", "Seth Lloyd, 1982 First Marshall of the Alpha of Massachusetts Chapter of ΦΒΚ, Nam Pyo Suh Professor of Mechanical Engineering at MIT, Miller Fellow at the Santa Fe Institute, Director of the W.M. Keck Center for Extreme Quantum Information Theory at MIT, the director of the Program in Quantum Information at the Institute for Scientific Interchange, and author of _Programming the Universe_, says:\n", "\n", ">I have made it a principle of my life to learn as few pieces of software as possible!\n", "\n", "My graduate students have been known to say that when I program it looks as though everything has been badly translated from FORTRAN. It is time to fix that. And the youngs say that the thing to do is Python, because here at Berkeley there is a greater concentration of Python wizards than of anything else, and in fact the greatest concentration of Python wizards in the world—a veritable Pythonic information singularity: if it can be done in Python, there is somebody within a mile who has done it; and if it can't be done in Python, there is somebody within a mile extending Python to do it.\n", "\n", "But...\n", "\n", "Python is an odd, extendable and extended beast. Everything is an object. Everything you do is poking objects in one way or another to get them to respond. And if you do not understand what the objects you are dealing with really are, Python will surprise you in unpleasant ways. Here is a simple example, but that there is such a simple example is something I find quite alarming;" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "16" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eight = 8\n", "eight + eight" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'88'" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eight = \"8\"\n", "eight + eight" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[8, 8]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eight = [8]\n", "eight + eight" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### Addendum: On Not Positive But Negative Models for \"Data Science\"\n", "\n", "As you learn about and as you learn how to do what is now called \"data science\", it is important—not as important as knowing how to do \"data science\" right, but important—to be able to recognize and critique when data science is done wrong. There is lots of information out there put forward by lazy, undertrained, cynical, self-interested, and—sometimes—malevolent people. It is important to be able to recognize it, assess it, and then be able to identify and explain what has gone wrong.\n", "\n", "\"Data\n", "\n", "As **Drew Conway** put it in 2010 in his piece: \"[The Data Science Venn Diagram][]\", we have to deal with the fact that a number of skills need to be combined to do \"data science\" in a way that has a positive impact on the world: hacking skills, knowledge of the field, and also appropriate math and statistics:\n", "\n", ">**Hacking skills**, math and stats knowledge, and substantive expertise... each... very valuable.... Being able to manipulate text files at the command-line, understanding vectorized operations, thinking algorithmically... the hacking skills... apply[ing] **appropriate math and statistics** methods... [but] data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science... [which] is about discovery and building **knowledge**....\n", "\n", ">The hacking skills plus substantive expertise danger zone... people who “know enough to be dangerous”... capable of extracting and structuring data... related to a field they know quite a bit about and... run a linear regression... lack[ing] any understanding of what those coefficients mean. It is from this part of the diagram that the phrase “lies, damned lies, and statistics” emanates, because either through ignorance or malice this overlap of skills gives people the ability to create what appears to be a legitimate analysis without any understanding of how they got there or what they have created. \n", "\n", ">Fortunately, it requires near willful ignorance to acquire hacking skills and substantive expertise without also learning some math and statistics along the way. As such, the danger zone is sparsely populated, however, it does not take many to produce a lot of damage...\n", "\n", "I am much less optimistic than Conway. It seems to be easy—these days at least—to find people who will create a simulacrum of a data science analysis that is in fact grossly misleading, whether through laziness, undertraining, cynicism, self-interest, or—sometimes—malevolence.\n", "\n", "[The Data Science Venn Diagram]: http://www.dataists.com/2010/09/the-data-science-venn-diagram/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### Addendum: How Allen Downey Explains It\n", "\n", "If you want, Here is how Allen Downey explains his Bayesian Dice Problem, from his book _Think Bayes_ :\n", "\n", ">To represent a distribution in Python, you could use a dictionary that maps from each value to its probability. I have written a class called Pmf that uses a Python dictionary in exactly that way, and provides a number of useful methods. I called the class Pmf in reference to a probability mass function, which is a way to represent a distribution mathematically. Pmf is defined in a Python module I wrote....\n", "\n", ">The following code builds a Pmf to represent the distribution of outcomes for a six-sided die:\n", "\n", " pmf = Pmf()\n", " for x in [1,2,3,4,5,6]:\n", " pmf.Set(x, 1/6.0)\n", "\n", ">Pmf creates an empty Pmf with no values. The Set method sets the probability associated with each value to 1/6.... Pmf provides a method, Normalize....\n", "\n", " pmf.Normalize()....\n", "\n", ">Pmf uses a Python dictionary to store the values and their probabilities [note: this is confusing: the \"values\" of the random variable are the \"keys\" to the dictionary object; and the probabilities are the \"values\" of the dictionary object], so the values in the Pmf can be any hashable type. The probabilities can be any numerical type, but they are usually floating-point numbers....\n", "\n", ">Now that we see what elements of the framework are the same [for these problems], we can encapsulate them in an object—a Suite is a Pmf that provides __init__, Update, and Print:\n", "\n", " class Suite(Pmf):\n", " \"\"\"Represents a suite of hypotheses and their probabilities.\"\"\"\n", "\n", " def __init__(self, hypo=tuple()):\n", " \"\"\"Initializes the distribution.\"\"\"\n", "\n", " def Update(self, data):\n", " \"\"\"Updates each hypothesis based on the data.\"\"\"\n", "\n", " def Print(self):\n", " \"\"\"Prints the hypotheses and their probabilities.\"\"\"....\n", "\n", ">This chapter presents the Suite class, which encapsulates the Bayesian update framework. Suite is an abstract type, which means that it defines the interface a Suite is supposed to have, but does not provide a complete implementation. The Suite interface includes Update and Likelihood, but the Suite class only provides an implementation of Update, not Likelihood. A concrete type is a class that extends an abstract parent class and provides an implementation of the missing methods....\n", "\n", ">**Estimation: The dice problem**: Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die. If you have ever played Dungeons & Dragons, you know what I am talking about. Suppose I select a die from the box at random, roll it, and get a 6. What is the probability that I rolled each die?\n", "\n", ">Let me suggest a three-step strategy for approaching a problem like this:\n", "\n", ">* Choose a representation for the hypotheses.\n", ">* Choose a representation for the data.\n", ">* Write the likelihood function.\n", "\n", ">In previous examples I used strings to represent hypotheses and data, but for the die problem I’ll use numbers. Specifically, I’ll use the integers 4, 6, 8, 12, and 20 to represent hypotheses:\n", "\n", " In [3]: \n", " from dice import Dice\n", " suite = Dice([4, 6, 8, 12, 20])\n", "\n", ">And integers from 1 to 20 for the data. These representations make it easy to write the likelihood function:\n", "\n", " class Dice(Suite):\n", "\n", " def Likelihood(self, data, hypo):\n", " if hypo < data:\n", " return 0\n", " else:\n", " return 1.0/hypo\n", "\n", ">Here’s how Likelihood works. If hypo < data, that means the roll is greater than the number of sides on the die. That can’t happen, so the likelihood is 0.\n", "\n", ">Otherwise the question is, “Given that there are _hypo_ sides on the dice, what is the chance of rolling [the number] _data_?” The answer is 1/hypo, regardless of data.\n", "\n", ">Here is the statement that does the update (if I roll a 6):\n", "\n", " In [4]: suite.Update(6)\n", "\n", " Out[4]: 0.08500000000000002\n", "\n", ">And here is the posterior distribution:\n", "\n", " In [5]: suite.Print()\n", "\n", " Out[5]:\n", " 4 0.0 \n", " 6 0.3921568627450979 \n", " 8 0.2941176470588235 \n", " 12 0.19607843137254896 \n", " 20 0.11764705882352941\n", "\n", ">After we roll a 6, the probability for the 4-sided die is 0. The most likely alternative is the 6-sided die, but there is still almost a 12% chance for the 20-sided die.\n", "\n", "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### Addendum: Why Python?\n", "\n", "\"Seth\n", "\n", "**You Cannot Learn Every Tool**:\n", "\n", "* Something Seth Lloyd said to me in May: \"I have made it a lifelong goal to only have to know one computer language at a time...\"\n", " * He has had some success in his career: \"Nam P. Suh Professor of Mechanical Engineering and Engineering Systems and Director of the Center for Extreme Quantum Information Theory at MIT. External Adjunct Fellow at the Santa Fe Institute. Ph.D. (Physics), Rockefeller University. M. Phil. (Philosophy), Cambridge University. B.A. (Physics), Harvard University...\"\n", " * Calls himself a \"Quantum Mechanic\"\n", "* Since you cannot learn every tool, make sure you learn tools that are broadly useful. Python is broadly useful.\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Berkeley as Python (or, at Least, Jupyter Notebook) Central**:\n", "\n", "\"61A\n", "\n", "* **Jean Mark Gawron**: Python for Social Science \n", "* **Python Practice**: Learning Resources \n", "* **Python Practice**: Python Community \n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Python as Least Unfriendly Thing to Learn**:\n", "\n", "\"TIOBE\n", "\n", "**Kunal Marwaha says:**\n", "\n", "* Python is the best of both worlds—the functionality of a general-purpose language but with different “packages” for different disciplines…\n", "* Python is friendly to novices. Most languages can do what you need, but efficiency depends on how quickly you can learn it…\n", "* The user community of each language makes a big difference…\n", "* Python is the fifth most popular programming language and is open source.\n", "* If you have never programmed and are working on a research problem, Python is almost certainly the best language to try first…" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "### Shoebox: Resources from Data Sciences 100\n", "\n", "* **DS 100**: Computer Setup \n", "* Python Data Science Web Resources \n", " * **Justin Johnson**: Python Numpy Tutorial \n", " * **Hernan Rojas**: Python 101 Sample Notebook \n", " * **John Hunter et al.**: Pyplot tutorial—Matplotlib 2.0.2 documentation \n", " * **Michael Waskom**: Seaborn tutorial \n", " * **Julia Evans**: Pandas Cookbook \n", " * **Scott Chacon and Ben Straub**: Pro Git " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Shoebox: There was a time...\n", "\n", "There was a time, a century and more ago, when the high-tech bleeding-edge electricity sector was an important but discrete part of the economy.\n", "\n", "Today, if one asks \"where is the electricity sector? What is the impact of electricity on the economy?\" The only answer is: everywhere. Electricity is no longer in any sense any sort of discrete sector. Electricity is everywhere.\n", "\n", "So it is rapidly becoming with computers and \"data science\". Computers, data, simulations, emergence from the bottom up, communication--they are no longer a discrete piece of our society and our educational system. Soon they will be everywhere: soon it will make as little sense to talk about the discrete pieces of the university that make use of computers as it makes sense to talk about the discrete pieces of the university that use electricity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " \n", "\n", "----\n", "\n", "**Note to Self**: View jupyter notebook from dropbox (or other) links:\n", "\n", "* Dropbox: change: \"https://www.dropbox.com/\" to: \"http://nbviewer.jupyter.org/urls/dl.dropbox.com/\"; then load the url...\n", "* Else: change: \"https://\" to \"http://nbviewer.jupyter.org/\"; then load the url..." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 1 }