{ "cells": [ { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "# HIDDEN\n", "\n", "from datascience import *\n", "import numpy as np\n", "\n", "import matplotlib\n", "matplotlib.use('Agg', warn=False)\n", "%matplotlib inline\n", "import matplotlib.pyplot as plots\n", "plots.style.use('fivethirtyeight')" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Randomness ###\n", "\n", "In the previous chapters we developed skills needed to make insightful descriptions of data. Data scientists also have to be able to understand randomness. For example, they have to be able to assign individuals to treatment and control groups at random, and then try to say whether any observed differences in the outcomes of the two groups are simply due to the random assignment or genuinely due to the treatment.\n", "\n", "In this chapter, we begin our analysis of randomness. To start off, we will use Python to make choices at random. In `numpy` there is a sub-module called `random` that contains many functions that involve random selection. One of these functions is called `choice`. It picks one item at random from an array, and it is equally likely to pick any of the items. The function call is `np.random.choice(array_name)`, where `array_name` is the name of the array from which to make the choice.\n", "\n", "Thus the following code evaluates to `treatment` with chance 50%, and `control` with chance 50%." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/plain": [ "'treatment'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "two_groups = make_array('treatment', 'control')\n", "np.random.choice(two_groups)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The big difference between the code above and all the other code we have run thus far is that the code above doesn't always return the same value. It can return either `treatment` or `control`, and we don't know ahead of time which one it will pick. We can repeat the process by providing a second argument, the number of times to repeat the process." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['treatment', 'control', 'treatment', 'control', 'control',\n", " 'treatment', 'treatment', 'control', 'control', 'control'], \n", " dtype=' 1 + 1" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The value `True` indicates that the comparison is valid; Python has confirmed this simple fact about the relationship between `3` and `1+1`. The full set of common comparison operators are listed below.\n", "\n", "| Comparison | Operator | True example | False Example |\n", "|--------------------|----------|--------------|---------------|\n", "| Less than | < | 2 < 3 | 2 < 2 |\n", "| Greater than | > | 3 > 2 | 3 > 3 |\n", "| Less than or equal | <= | 2 <= 2 | 3 <= 2 |\n", "| Greater or equal | >= | 3 >= 3 | 2 >= 3 |\n", "| Equal | == | 3 == 3 | 3 == 2 |\n", "| Not equal | != | 3 != 2 | 2 != 2 |" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Notice the two equal signs `==` in the comparison to determine equality. This is necessary because Python already uses `=` to mean assignment to a name, as we have seen. It can't use the same symbol for a different purpose. Thus if you want to check whether 5 is equal to the 10/2, then you have to be careful: `5 = 10/2` returns an error message because Python assumes you are trying to assign the value of the expression 10/2 to a name that is the numeral 5. Instead, you must use `5 == 10/2`, which evaluates to `True`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "ename": "SyntaxError", "evalue": "can't assign to literal (, line 1)", "output_type": "error", "traceback": [ "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m 5 = 10/2\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m can't assign to literal\n" ] } ], "source": [ "5 = 10/2" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "5 == 10/2" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "An expression can contain multiple comparisons, and they all must hold in order for the whole expression to be `True`. For example, we can express that `1+1` is between `1` and `3` using the following expression." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1 < 1 + 1 < 3" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The average of two numbers is always between the smaller number and the larger number. We express this relationship for the numbers `x` and `y` below. You can try different values of `x` and `y` to confirm this relationship." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = 12\n", "y = 5\n", "min(x, y) <= (x+y)/2 <= max(x, y)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Comparing Strings ###\n", "\n", "Strings can also be compared, and their order is alphabetical. A shorter string is less than a longer string that begins with the shorter string." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "'Dog' > 'Catastrophe' > 'Cat'" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Let's return to random selection. Recall the array `two_groups` which consists of just two elements, `treatment` and `control`. To see whether a randomly assigned individual went to the treatment group, you can use a comparison:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.choice(two_groups) == 'treatment'" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "As before, the random choice will not always be the same, so the result of the comparison won't always be the same either. It will depend on whether `treatment` or `control` was chosen. With any cell that involves random selection, it is a good idea to run the cell several times to get a sense of the variability in the result." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "deletable": true, "editable": true }, "source": [ "### Comparing an Array and a Value ###\n", "Recall that we can perform arithmetic operations on many numbers in an array at once. For example, `make_array(0, 5, 2)*2` is equivalent to `make_array(0, 10, 4)`. In similar fashion, if we compare an array and one value, each element of the array is compared to that value, and the comparison evaluates to an array of Booleans." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/plain": [ "array([False, True, False, True, True], dtype=bool)" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tosses = make_array('Tails', 'Heads', 'Tails', 'Heads', 'Heads')\n", "tosses == 'Heads'" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The `numpy` method `count_nonzero` evaluates to the number of non-zero (that is, `True`) elements of the array." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.count_nonzero(tosses == 'Heads')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }