{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 03 - Stats Review: The Most Dangerous Equation\n", "\n", "In his famous article of 2007, Howard Wainer writes about very dangerous equations:\n", "\n", "\"Some equations are dangerous if you know them, and others are dangerous if you do not. The first category may pose danger because the secrets within its bounds open doors behind which lies terrible peril. The obvious winner in this is Einstein’s iconic equation $E = MC^2$, for it provides a measure of the enormous energy hidden within ordinary matter. \\[...\\] Instead I am interested in equations that unleash their danger not when we know about them, but rather when we do not. Kept close at hand, these equations allow us to understand things clearly, but their absence leaves us dangerously ignorant.\"\n", "\n", "The equation he talks about is Moivre’s equation:\n", "\n", "$\n", "SE = \\dfrac{\\sigma}{\\sqrt{n}} \n", "$\n", "\n", "where $SE$ is the standard error of the mean, $\\sigma$ is the standard deviation, and $n$ is the sample size. Sounds like a piece of math the brave and true should master, so let's get to it.\n", "\n", "To see why not knowing this equation is very dangerous, let's look at some education data. I've compiled data on ENEM scores (Brazilian standardised high school scores, similar to SAT) from different schools for 3 years. I also cleaned the data to keep only the information relevant to us. The original data can be downloaded on the [Inep website](http://portal.inep.gov.br/web/guest/microdados#).\n", "\n", "If we look at the top-performing school, something catches the eye: those schools have a reasonably small number of students. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "import pandas as pd\n", "import numpy as np\n", "from scipy import stats\n", "import seaborn as sns\n", "from matplotlib import pyplot as plt\n", "from matplotlib import style\n", "style.use(\"fivethirtyeight\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | year | \n", "school_id | \n", "number_of_students | \n", "avg_score | \n", "
---|---|---|---|---|
16670 | \n", "2007 | \n", "33062633 | \n", "68 | \n", "82.97 | \n", "
16796 | \n", "2007 | \n", "33065403 | \n", "172 | \n", "82.04 | \n", "
16668 | \n", "2005 | \n", "33062633 | \n", "59 | \n", "81.89 | \n", "
16794 | \n", "2005 | \n", "33065403 | \n", "177 | \n", "81.66 | \n", "
10043 | \n", "2007 | \n", "29342880 | \n", "43 | \n", "80.32 | \n", "
18121 | \n", "2007 | \n", "33152314 | \n", "14 | \n", "79.82 | \n", "
16781 | \n", "2007 | \n", "33065250 | \n", "80 | \n", "79.67 | \n", "
3026 | \n", "2007 | \n", "22025740 | \n", "144 | \n", "79.52 | \n", "
14636 | \n", "2007 | \n", "31311723 | \n", "222 | \n", "79.41 | \n", "
17318 | \n", "2007 | \n", "33087679 | \n", "210 | \n", "79.38 | \n", "