{ "cells": [ { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# HIDDEN\n", "\n", "from datascience import *\n", "import numpy as np\n", "%matplotlib inline\n", "import matplotlib.pyplot as plots\n", "plots.style.use('fivethirtyeight')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# HIDDEN\n", "\n", "# Construct a 52-card deck\n", "from itertools import product\n", "\n", "ranks = ['A', '2', '3', '4', '5', '6', '7', '8', '9', '10', 'J', 'Q', 'K']\n", "suits = ['♠︎', '♥︎', '♦︎', '♣︎']\n", "cards = product(ranks, suits)\n", "\n", "deck = Table(['rank', 'suit']).with_rows(cards)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Repetition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is often the case when programming that you will wish to repeat the same operation multiple times, perhaps with slightly different behavior each time. You could copy-paste the code 10 times, but that's tedious and prone to typos, and if you wanted to do it a thousand times (or a million times), forget it. \n", "\n", "A better solution is to use a `for` statement to loop over the contents of a sequence. A `for` statement begins with the word `for`, followed by a name for the item in the sequence, followed by the word `in`, and ending with an expression that evaluates to a sequence. The indented body of the `for` statement is executed once *for each item in that sequence*." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "1\n", "2\n", "3\n", "4\n" ] } ], "source": [ "for i in np.arange(5):\n", " print(i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A typical use of a `for` statement is to build up a table by repeating a random computation many times and storing each result as a new row. The `append` method of a table takes a sequence and adds a new row. It's different from `with_row` because a new table is not created; instead, the original table is extended. The cell below draws 100 cards, but keeps only the aces." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Rank Suit
A ♠︎
A ♣︎
A ♥︎
A ♦︎
A ♠︎
A ♦︎
" ], "text/plain": [ "Rank | Suit\n", "A | ♠︎\n", "A | ♣︎\n", "A | ♥︎\n", "A | ♦︎\n", "A | ♠︎\n", "A | ♦︎" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aces = Table(['Rank', 'Suit'])\n", "for i in np.arange(100):\n", " card = deck.row(np.random.randint(deck.num_rows))\n", " if card.item(0) == 'A':\n", " aces.append(card)\n", " \n", "aces" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This pattern can be used to track the results of repeated experiments. For example, perhaps we want to learn about the empirical properties of some randomly drawn poker hands. Below, we track whether the hand contains four-of-a-kind or five cards of the same suit (called a *flush*). " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Four-of-a-kind Flush
False False
False False
False False
False False
False False
False False
False False
False False
False False
False False
\n", "

... (9990 rows omitted)= 5:\n", " return not true_answer\n", " else:\n", " return true_answer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can try it. Assume our true answer is 'no'; let's see what happens this time:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "respond(False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, if you were to run it many times, you might get a different result each time. Below, we build a table of the responses for many responses when the true answer is always `False`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Truth Response
False False
False False
False False
False False
False False
False False
False True
False False
False True
False False
\n", "

... (990 rows omitted)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "responses.group('Response').barh('Response')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "656" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "responses.where('Response', False).num_rows" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "344" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "responses.where('Response', True).num_rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise for you:** If `N` out of 1000 responses are `True`, approximately what fraction of the population has truly sung a Justin Bieber song in the shower?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This method is called \"randomized response\". It is one way to poll people about sensitive subjects, while still protecting their privacy. You can see how it is a nice example of randomness at work.\n", "\n", "It turns out that randomized response has beautiful generalizations. For instance, your Chrome web browser uses it to anonymously report feedback to Google, in a way that won't violate your privacy. That's all we'll say about it for this semester, but if you take an upper-division course, maybe you'll get to see some generalizations of this beautiful technique." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The steps in the randomized response survey can be visualized using a *tree diagram*. The diagram partitions all the survey respondents according to their true and answer and the answer that they eventually give. It also displays the proportions of respondents whose true answers are 1 (\"True\") and 0 (\"False\"), as well as the chances that determine the answers that they give. As in the code above, we have used *p* to denote the proportion whose true answer is 1.\n", "\n", "![Tree Diagram](/images/rand_response_tree.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The respondents who answer 1 split into two groups. The first group consists of the respondents whose true answer and given answers are both 1. If the number of respondents is large, the proportion in this group is likely to be about 2/3 of *p*. The second group consists of the respondents whose true answer is 0 and given answer is 1. This proportion in this group is likely to be about 1/3 of *1-p*.\n", "\n", "We can observed $p^*$, the proportion of 1's among the given answers. Thus\n", "$$\n", "p^* ~\\approx ~ \\frac{2}{3} \\times p ~+~ \\frac{1}{3} \\times (1-p)\n", "$$\n", "\n", "This allows us to solve for an approximate value of *p*:\n", "$$\n", "p ~ \\approx ~ 3p^* - 1\n", "$$\n", "\n", "In this way we can use the observed proportion of 1's to \"work backwards\" and get an estimate of *p*, the proportion in which whe are interested. \n", "\n", "**Technical note.** It is worth noting the conditions under which this estimate is valid. The calculation of the proportions in the two groups whose given answer is 1 relies on *each of the groups* being large enough so that the Law of Averages allows us to make estimates about how their dice are going to land. This means that it is not only the total number of respondents that has to be large – the number of respondents whose true answer is 1 has to be large, as does the number whose true answer is 0. For this to happen, *p* must be neither close to 0 nor close to 1. If the characteristic of interest is either extremely rare or extremely common in the population, the method of randomized response described in this example might not work well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try out this method on some real data. The chance of drawing a poker hand with no aces is\n", "\n", "$$\\frac{48}{52} \\times \\frac{47}{51} \\times \\frac{46}{50} \\times \\frac{45}{49} \\times \\frac{44}{48}$$" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.65884199833779666" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.product(np.arange(48, 43,-1) / np.arange(52, 47, -1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is quite embarassing indeed to draw a hand with no aces. The table below contains one column for the truth of whether a hand has no aces and another for the randomized response." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Truth Response
False True
True True
False True
True False
True False
True True
False False
True False
False False
True True
\n", "

... (9990 rows omitted)