{ "cells": [ { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# HIDDEN\n", "from datascience import *\n", "%matplotlib inline\n", "import matplotlib.pyplot as plots\n", "plots.style.use('fivethirtyeight')\n", "import math\n", "from scipy import stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Inference about the mean of an unknown population\n", "\n", "Data scientists often have to make inferences based on incomplete data. One such situation is when they are trying to make inferences about an unknown population based on one large random sample.\n", "\n", "Suppose the goal is to try to estimate the population mean. We know that the sample mean is a good estimate if the sample size is large. But we expect it to be away from the population mean by about an SE of the sample average. To calculate that SE exactly, we need the population SD. But if the population data are unknown, we don't know the population SD. After all, we don't even know the population mean – that's what we are trying to estimte!\n", "\n", "Fortunately, a simple approximation takes care of this. In the place of the population SD, we can simply use the SD of the sample. It will not be equal to the population SD, but when we divide it by $\\sqrt{\\mbox{sample size}}$, the error is greatly reduced and the approximation works." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", "\\mbox{SE of the sample average} ~=~\n", "\\frac{\\mbox{Population SD}}{\\sqrt{\\mbox{sample size}}}\n", "~~\\approx~~ \\frac{\\mbox{sample SD}}{\\sqrt{\\mbox{sample size}}} ~~~\n", "\\mbox{when the random sample is large}\n", "$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Use in estimation\n", "\n", "As a consequence of the Central Limit Theorem, we can use proportions derived from the normal curve in statements about the distance between the sample mean and the population mean. \n", "\n", "For example, in about 95% of the samples:\n", "\n", "- the sample average is in the range \"population average $~\\pm~$ 2 $\\times$ SE of sample average\"\n", "\n", "This statement is equivalent to the following:\n", "\n", "** In about 95% of the samples:**\n", "\n", "- **the population average is in the range \"sample average $~\\pm~$ 2 $\\times$ SE of sample average\"**\n", "\n", "This gives rise to a method of estimating the population mean." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###An approximate 95%-confidence interval for the population mean\n", "A *confidence interval* is a range of estimates. An *approximate 95%-confidence interval for the population mean* is given by:\n", "\n", "$$\n", "\\mbox{sample average} ~\\pm~ 2 \\times \\mbox{SE of sample average}\n", "$$\n", "\n", "We estimate that the population mean will be in this range.\n", "\n", "The confidence is in the procedure.\n", "If we repeat this procedure many times, we will get many intervals, one for each repetition. About 95% of the intervals will contain the population average, which is what we are trying to estimate.\n", "\n", "The *level of confidence*, 95%, can be replaced by any percent. The number of SEs on either side of the sample average has to be adjusted accordingly, by using the normal curve." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The table ``baby`` contains data on a random sample of 1,174 mothers and their newborn babies. The column ``birthwt`` contains the birth weight of the baby, in ounces; ``gest_days`` is the number of gestational days, that is, the number of days the baby was in the womb. There is also data on maternal age, maternal height, maternal pregnancy weight, and whether or not the mother was a smoker.\n", "\n", "We will examine this datset in some detail in the following sections. For now, we will concentrate on the column ``mat_age``. This is the age of the mother, in years, when she gave birth. The goal is to try to estimate the average age of the women giving birth in the population." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
birthwt | gest_days | mat_age | mat_ht | mat_pw | m_smoker | \n", "
---|---|---|---|---|---|
120 | 284 | 27 | 62 | 100 | 0 | \n", "
113 | 282 | 33 | 64 | 135 | 0 | \n", "
128 | 279 | 28 | 64 | 115 | 1 | \n", "
108 | 282 | 23 | 67 | 125 | 1 | \n", "
136 | 286 | 25 | 62 | 93 | 0 | \n", "
138 | 244 | 33 | 62 | 178 | 0 | \n", "
132 | 245 | 23 | 65 | 140 | 0 | \n", "
120 | 289 | 25 | 62 | 125 | 0 | \n", "
143 | 299 | 30 | 66 | 136 | 1 | \n", "
140 | 351 | 27 | 68 | 120 | 0 | \n", "
... (1164 rows omitted)
" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "baby = Table.read_table('baby.csv')\n", "baby" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a histogram of the ages of the new mothers in the sample. Most of the women were in the mid-twenties to low thirties." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "