{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "NBIO208.ipynb", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "papermill": { "duration": 11.308284, "end_time": "2021-01-28T05:37:53.680473", "environment_variables": {}, "exception": null, "input_path": "__notebook__.ipynb", "output_path": "__notebook__.ipynb", "parameters": {}, "start_time": "2021-01-28T05:37:42.372189", "version": "2.1.0" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "0hgErj1IiJG1" }, "source": [ "# NBIO208. Statistical Testing in Python" ] }, { "cell_type": "markdown", "metadata": { "id": "4DBpQPRniJG1" }, "source": [ "The goal of this lesson is to introduce Python libraries, load and look at a dataset, begin some common statistical testing. It includes:\n", "- [1 The simplicity underlying common statistical tests](#anchor1)\n", "- [2 The python environment](#anchor2)\n", "- [A note about $p$-values](#anchor2b)\n", "- [3 Pearson and Spearman correlation](#anchor3)\n", "- [4 One mean comparisons](#anchor4)\n", "- [5 Many mean comparisons](#anchor5)\n", "- [6 Checking assumptions](#anchor6)\n", "- [Resources](#anchor7)" ] }, { "cell_type": "markdown", "metadata": { "id": "0b3h8W5-iJG2" }, "source": [ "\n", "## 1 The simplicity underlying common statistical tests\n", "\n", "Most of the common statistical models (t-test, correlation, ANOVA, chi-square, etc.) are special cases of linear models, or a very close approximation. This beautiful simplicity means that there is less to learn. In particular, it all comes down to $y = a \\cdot x + b$, where $a$ is the slope of the line and $b$ is the y-intercept where the line crosses the y-axis. \n", "\n", "There are certain assumptions we check to use these \"parametric tests.\" When assumptions are not met, we have alternative \"non-parametric\" counterparts. We will think of \"non-parametric\"\" tests as ranked versions of the corresponding parametric tests. \n", "\n", "This lesson is adapted from [Tests as Linear](https://github.com/eigenfoo/tests-as-linear), which is also available [in R](https://github.com/lindeloev/tests-as-linear). \n", "\n", "**See the [Cheat Sheet](https://lindeloev.github.io/tests-as-linear/linear_tests_cheat_sheet.pdf)**" ] }, { "cell_type": "markdown", "metadata": { "id": "3hYxOGLBchMJ" }, "source": [ "\n", "## 2 The Python Environment" ] }, { "cell_type": "markdown", "metadata": { "id": "Sno4smqRiJG3" }, "source": [ "In part 2, we will:\n", "" ] }, { "cell_type": "markdown", "metadata": { "id": "RwuaB2M7iJG3" }, "source": [ "### 2.1. Load libraries. \n", "Think of these as useful powerful toolboxes we are opening up on our workbench. Many additional libraries are available from the Python Package Index.\n", "\n", "we `import` some key scientific libraries, sometimes using a shorter *alias* for ease of coding, with the syntax `import library_name as alias_name` \n", "\n", "- [numpy](https://numpy.org) - adding support for large, multi-dimensional arrays and matrices, and mathematical functions for arrays.\n", "- [pandas](https://pandas.pydata.org/) - offers data structures and operations for manipulating numerical tables and time series.\n", "- [matplotlib](https://matplotlib.org/) - the most widely used scientific plotting library in Python.\n", "- [seaborn](https://seaborn.pydata.org/) - for drawing attractive and informative statistical graphics.\n", "- [statsmodels](https://www.statsmodels.org/stable/index.html) and [scipy](https://www.scipy.org/) for hypothesis testing and regression models." ] }, { "cell_type": "code", "metadata": { "id": "8TEg4K1AiJG3" }, "source": [ "# [1]:\n", "# Python libraries for data structures, arrays, and math\n", "import numpy as np\n", "import pandas as pd\n", "\n", "# Libraris for plotting\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# Importing the statistics module\n", "import statistics\n", "\n", "# Library for hypothesis testing\n", "import scipy\n", "\n", "# Libraries for regression modeling\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf\n", "\n", "# Library for latex \n", "from IPython.display import Latex, display, Markdown" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "mbR3OGMiiJG4" }, "source": [ "### 2.2 Load data.\n", "Loads a copy of the data into our environment. Unlike working with a spreadsheet, it does not affect the original file.\n", "\n", "*Our data* This synthetic dataset contains information on new born babies and their parents. It comes from [here](https://www.sheffield.ac.uk/mash/statistics/datasets). \n", "\n", "Read a Comma Separated Values (CSV) data file with `pd.read_csv()`. \n", "*[Need to read in a different file type?](https://realpython.com/pandas-read-write-files/)*\n", "- Argument is the name of the file to be read.\n", "- Assign result to a variable to store the data that was read." ] }, { "cell_type": "code", "metadata": { "id": "jWi0xwl1iJG4" }, "source": [ "# url is the name and path of the data file\n", "url = \"https://raw.githubusercontent.com/DeisData/datasets/main/Birthweight_reduced_kg_R.csv\"\n", "\n", "# data is the name of the dataFrame we are storing our data in\n", "# pd is pandas and read_csv is a tool in pandas for reading in a csv file\n", "data = pd.read_csv(url) " ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "hG_fqj-WiJG5" }, "source": [ "### 2.3. Inspect the data table \n", "There are lots of functions and methods we can apply to the dataframe to start inspecting it." ] }, { "cell_type": "markdown", "metadata": { "id": "5YtAh5rqiJG5" }, "source": [ "
\n", " | ID | \n", "Length | \n", "Birthweight | \n", "Headcirc | \n", "Gestation | \n", "smoker | \n", "mage | \n", "mnocig | \n", "mheight | \n", "mppwt | \n", "fage | \n", "fedyrs | \n", "fnocig | \n", "fheight | \n", "lowbwt | \n", "mage35 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1360 | \n", "56 | \n", "4.55 | \n", "34 | \n", "44 | \n", "0 | \n", "20 | \n", "0 | \n", "162 | \n", "57 | \n", "23 | \n", "10 | \n", "35 | \n", "179 | \n", "0 | \n", "0 | \n", "
1 | \n", "1016 | \n", "53 | \n", "4.32 | \n", "36 | \n", "40 | \n", "0 | \n", "19 | \n", "0 | \n", "171 | \n", "62 | \n", "19 | \n", "12 | \n", "0 | \n", "183 | \n", "0 | \n", "0 | \n", "
2 | \n", "462 | \n", "58 | \n", "4.10 | \n", "39 | \n", "41 | \n", "0 | \n", "35 | \n", "0 | \n", "172 | \n", "58 | \n", "31 | \n", "16 | \n", "25 | \n", "185 | \n", "0 | \n", "1 | \n", "
3 | \n", "1187 | \n", "53 | \n", "4.07 | \n", "38 | \n", "44 | \n", "0 | \n", "20 | \n", "0 | \n", "174 | \n", "68 | \n", "26 | \n", "14 | \n", "25 | \n", "189 | \n", "0 | \n", "0 | \n", "
4 | \n", "553 | \n", "54 | \n", "3.94 | \n", "37 | \n", "42 | \n", "0 | \n", "24 | \n", "0 | \n", "175 | \n", "66 | \n", "30 | \n", "12 | \n", "0 | \n", "184 | \n", "0 | \n", "0 | \n", "
This dataset contains information on new\n", "born babies and their parents. It\n", "contains mostly continuous variables (although some have only a few values e.g. number of cigarettes smoked per day) and is most useful\n", "for correlation and regression.
\n", "\n", "Main dependent variable = Birthweight (lbs)
\n", "\n", "\n",
" Name | \n",
" \n",
" Variable | \n",
" \n",
" Data type | \n",
"
\n",
" ID | \n",
" \n",
" Baby number | \n",
" \n",
" | \n",
"
\n",
" length | \n",
" \n",
" Length of baby (cm) | \n",
" \n",
" Scale | \n",
"
\n",
" Birthweight | \n",
" \n",
" Weight of baby (kg) | \n",
" \n",
" Scale | \n",
"
\n",
" headcirumference | \n",
" \n",
" Head Circumference | \n",
" \n",
" Scale | \n",
"
\n",
" Gestation | \n",
" \n",
" Gestation (weeks) | \n",
" \n",
" Scale | \n",
"
\n",
" smoker | \n",
" \n",
" Mother smokes 1 = smoker 0 =\n",
" non-smoker | \n",
" \n",
" Binary | \n",
"
\n",
" motherage | \n",
" \n",
" Maternal age | \n",
" \n",
" Scale | \n",
"
\n",
" mnocig | \n",
" \n",
" Number of cigarettes smoked per day\n",
" by mother | \n",
" \n",
" Scale | \n",
"
\n",
" mheight | \n",
" \n",
" Mothers height (cm) | \n",
" \n",
" Scale | \n",
"
\n",
" mppwt | \n",
" \n",
" Mothers pre-pregnancy weight (kg) | \n",
" \n",
" Scale | \n",
"
\n",
" fage | \n",
" \n",
" Father's age | \n",
" \n",
" Scale | \n",
"
\n",
" fedyrs | \n",
" \n",
" Father’s years in education | \n",
" \n",
" Scale | \n",
"
\n",
" fnocig | \n",
" \n",
" Number of cigarettes smoked per day\n",
" by father | \n",
" \n",
" Scale | \n",
"
\n",
" fheight | \n",
" \n",
" Father's height (kg) | \n",
" \n",
" Scale | \n",
"
\n",
" lowbwt | \n",
" \n",
" Low birth weight, 0 = No and 1 = yes | \n",
" \n",
" Binary | \n",
"
\n",
" mage35 | \n",
" \n",
" Mother over 35, 0 = No and 1 = yes | \n",
" \n",
" Binary | \n",
"
\n", " | ID | \n", "Length | \n", "Birthweight | \n", "Headcirc | \n", "Gestation | \n", "smoker | \n", "mage | \n", "mnocig | \n", "mheight | \n", "mppwt | \n", "fage | \n", "fedyrs | \n", "fnocig | \n", "fheight | \n", "lowbwt | \n", "mage35 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
20 | \n", "792 | \n", "53 | \n", "3.64 | \n", "38 | \n", "40 | \n", "1 | \n", "20 | \n", "2 | \n", "170 | \n", "59 | \n", "24 | \n", "12 | \n", "12 | \n", "185 | \n", "0 | \n", "0 | \n", "
21 | \n", "1388 | \n", "51 | \n", "3.14 | \n", "33 | \n", "41 | \n", "1 | \n", "22 | \n", "7 | \n", "160 | \n", "53 | \n", "24 | \n", "16 | \n", "12 | \n", "176 | \n", "0 | \n", "0 | \n", "
22 | \n", "575 | \n", "50 | \n", "2.78 | \n", "30 | \n", "37 | \n", "1 | \n", "19 | \n", "7 | \n", "165 | \n", "60 | \n", "20 | \n", "14 | \n", "0 | \n", "183 | \n", "0 | \n", "0 | \n", "
23 | \n", "569 | \n", "50 | \n", "2.51 | \n", "35 | \n", "39 | \n", "1 | \n", "22 | \n", "7 | \n", "159 | \n", "52 | \n", "23 | \n", "14 | \n", "25 | \n", "200 | \n", "1 | \n", "0 | \n", "
24 | \n", "1363 | \n", "48 | \n", "2.37 | \n", "30 | \n", "37 | \n", "1 | \n", "20 | \n", "7 | \n", "163 | \n", "47 | \n", "20 | \n", "10 | \n", "35 | \n", "185 | \n", "1 | \n", "0 | \n", "
\n", " | ID | \n", "Before | \n", "After4weeks | \n", "After8weeks | \n", "Margarine | \n", "
---|---|---|---|---|---|
0 | \n", "1 | \n", "6.42 | \n", "5.83 | \n", "5.75 | \n", "B | \n", "
1 | \n", "2 | \n", "6.76 | \n", "6.20 | \n", "6.13 | \n", "A | \n", "
2 | \n", "3 | \n", "6.56 | \n", "5.83 | \n", "5.71 | \n", "B | \n", "
3 | \n", "4 | \n", "4.80 | \n", "4.27 | \n", "4.15 | \n", "A | \n", "
4 | \n", "5 | \n", "8.43 | \n", "7.71 | \n", "7.67 | \n", "B | \n", "
\n", " | ID | \n", "Before | \n", "After4weeks | \n", "After8weeks | \n", "Margarine | \n", "cholesterol_diff | \n", "
---|---|---|---|---|---|---|
0 | \n", "1 | \n", "6.42 | \n", "5.83 | \n", "5.75 | \n", "B | \n", "-0.67 | \n", "
1 | \n", "2 | \n", "6.76 | \n", "6.20 | \n", "6.13 | \n", "A | \n", "-0.63 | \n", "
2 | \n", "3 | \n", "6.56 | \n", "5.83 | \n", "5.71 | \n", "B | \n", "-0.85 | \n", "
3 | \n", "4 | \n", "4.80 | \n", "4.27 | \n", "4.15 | \n", "A | \n", "-0.65 | \n", "
4 | \n", "5 | \n", "8.43 | \n", "7.71 | \n", "7.67 | \n", "B | \n", "-0.76 | \n", "
\n", " | ID | \n", "Before | \n", "After4weeks | \n", "After8weeks | \n", "Margarine | \n", "cholesterol_diff | \n", "
---|---|---|---|---|---|---|
0 | \n", "1 | \n", "6.42 | \n", "5.83 | \n", "5.75 | \n", "B | \n", "-0.67 | \n", "
1 | \n", "2 | \n", "6.76 | \n", "6.20 | \n", "6.13 | \n", "A | \n", "-0.63 | \n", "
2 | \n", "3 | \n", "6.56 | \n", "5.83 | \n", "5.71 | \n", "B | \n", "-0.85 | \n", "
3 | \n", "4 | \n", "4.80 | \n", "4.27 | \n", "4.15 | \n", "A | \n", "-0.65 | \n", "
4 | \n", "5 | \n", "8.43 | \n", "7.71 | \n", "7.67 | \n", "B | \n", "-0.76 | \n", "