{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Statistics example problems\n", "\n", "[1. ANOVA - geological example](#1.-ANOVA-geological-example)\n", "\n", "[2. Non-parametric tests](#2.-Non-parametric-tests)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from scipy import stats\n", "import pandas as pd\n", "import pingouin as pg" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. ANOVA geological example\n", "\n", "#### a. Implementation\n", "\n", "Data come from Table 10.1 of McKillup and Dyar, Geostatistics Explained, Cambridge University Press, 2010 (excerpt available on class Google Drive). Values represent the weight percent of MgO present in tourmalines from three locations in Maine. \n", "\n", "Use two different methods to test whether there is a significant difference in the mean MgO content between the three different sites.\n", "\n", "##### Method 1: Scipy" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('data/MgO_example/MgO_Maine.csv') # dataframe" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Mount MicaSebago BatholithBlack Mountain
0741
1852
21074
31185
\n", "
" ], "text/plain": [ " Mount Mica Sebago Batholith Black Mountain\n", "0 7 4 1\n", "1 8 5 2\n", "2 10 7 4\n", "3 11 8 5" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "F_onewayResult(statistic=10.799999999999999, pvalue=0.004058306777237465)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.f_oneway(df['Mount Mica'],df['Sebago Batholith'],df['Black Mountain'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Method 2: Pingouin" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "df2 = pd.read_csv('data/MgO_example/MgO_Maine_list.csv')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MgOLocation
07Mount Mica
18Mount Mica
210Mount Mica
311Mount Mica
44Sebago Batholith
55Sebago Batholith
67Sebago Batholith
78Sebago Batholith
81Black Mountain
92Black Mountain
104Black Mountain
115Black Mountain
\n", "
" ], "text/plain": [ " MgO Location\n", "0 7 Mount Mica\n", "1 8 Mount Mica\n", "2 10 Mount Mica\n", "3 11 Mount Mica\n", "4 4 Sebago Batholith\n", "5 5 Sebago Batholith\n", "6 7 Sebago Batholith\n", "7 8 Sebago Batholith\n", "8 1 Black Mountain\n", "9 2 Black Mountain\n", "10 4 Black Mountain\n", "11 5 Black Mountain" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sourceddof1ddof2Fp-uncnp2
0Location2910.80.0040580.705882
\n", "
" ], "text/plain": [ " Source ddof1 ddof2 F p-unc np2\n", "0 Location 2 9 10.8 0.004058 0.705882" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pg.anova(data=df2,dv='MgO',between='Location')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Post-hoc test" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABmean(A)mean(B)diffseTp-tukeyhedges
0Black MountainMount Mica39-61.290994-4.647580.001000-2.857683
1Black MountainSebago Batholith36-31.290994-2.323790.061925-1.428841
2Mount MicaSebago Batholith9631.2909942.323790.0619251.428841
\n", "
" ], "text/plain": [ " A B mean(A) mean(B) diff se \\\n", "0 Black Mountain Mount Mica 3 9 -6 1.290994 \n", "1 Black Mountain Sebago Batholith 3 6 -3 1.290994 \n", "2 Mount Mica Sebago Batholith 9 6 3 1.290994 \n", "\n", " T p-tukey hedges \n", "0 -4.64758 0.001000 -2.857683 \n", "1 -2.32379 0.061925 -1.428841 \n", "2 2.32379 0.061925 1.428841 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pg.pairwise_tukey(data=df2,dv='MgO',between='Location')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### b. ANOVA interpretation\n", "\n", "Write a summary of your interpretation of the statistical results conducted above. Address the following questions.\n", "\n", "* What is the null hypothesis being tested?\n", "* Should the null hypothesis be accepted or rejected?\n", "* What does the post-hoc test tell you?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Non-parametric tests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### a. Wilcoxon signed-rank test: implementation\n", "\n", "This example uses data from:\n", "http://www.biostathandbook.com/wilcoxonsignedrank.html\n", "\n", "The data are observations of aluminum content in 13 different poplar clones in a polluted area. The scientific question is whether there is a significant change in the aluminum content between August and November." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "df_al = pd.read_csv('data/wilcoxon_example/Al_content.csv',\n", " delimiter='\\t')" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CloneAugustNovember
0Columbia River18.312.7
1Fritzi Pauley13.311.1
2Hazendans16.515.3
3Primo12.612.7
4Raspalje9.510.5
5Hoogvorst13.615.6
6Balsam Spire8.111.2
7Gibecq8.914.2
8Beaupre10.016.3
9Unal8.315.5
10Trichobel7.919.9
11Gaver8.120.4
12Wolterson13.436.8
\n", "
" ], "text/plain": [ " Clone August November\n", "0 Columbia River 18.3 12.7\n", "1 Fritzi Pauley 13.3 11.1\n", "2 Hazendans 16.5 15.3\n", "3 Primo 12.6 12.7\n", "4 Raspalje 9.5 10.5\n", "5 Hoogvorst 13.6 15.6\n", "6 Balsam Spire 8.1 11.2\n", "7 Gibecq 8.9 14.2\n", "8 Beaupre 10.0 16.3\n", "9 Unal 8.3 15.5\n", "10 Trichobel 7.9 19.9\n", "11 Gaver 8.1 20.4\n", "12 Wolterson 13.4 36.8" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_al" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAD4CAYAAADFAawfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAKn0lEQVR4nO3cf4jk913H8dfbu0hLG2g1awlN1hURoQRN5Ih/REoNRdJGrApKA0oF4fzDQgqCRv+x9a8oWvxHxNMGo9aWQFotDf4I2FADmnoX05p4LZZyamzIJYTS3D9K0rd/7Fxyve7eTpKb3ffOPR6w3OzM92beHz63T+a+M7PV3QFgrm876AEAuDShBhhOqAGGE2qA4YQaYLijq7jTa665pre2tlZx1wBr6dSpU89298ZOt60k1FtbWzl58uQq7hpgLVXVf+52m1MfAMMJNcBwQg0wnFADDCfUAMMJNcBwS709r6rOJHk+yYtJXujuY6scCoCXvZL3Uf9odz+7skkA2JFTHwDDLfuMupP8fVV1kj/q7hMXH1BVx5McT5LNzc1XPdDWXQ+86r/7Wpy5+/YDeVyAvSz7jPqW7v6hJO9K8stV9faLD+juE919rLuPbWzs+HF1AF6FpULd3V9d/Hk2ySeT3LzKoQB42Z6hrqo3VNXV5y8n+bEkj696MAC2LXOO+i1JPllV54//y+7+25VOBcBL9gx1d38lyQ/uwywA7MDb8wCGE2qA4YQaYDihBhhOqAGGE2qA4YQaYDihBhhOqAGGE2qA4YQaYDihBhhOqAGGE2qA4YQaYDihBhhOqAGGE2qA4YQaYDihBhhOqAGGE2qA4YQaYDihBhhOqAGGE2qA4YQaYDihBhhOqAGGE2qA4YQaYDihBhhu6VBX1ZGq+teq+vQqBwLgm72SZ9R3Jjm9qkEA2NlSoa6q65LcnuRPVjsOABc7uuRxv5/kV5NcvdsBVXU8yfEk2dzcfM2D7betux44sMc+c/ftB/bYwHx7PqOuqh9Pcra7T13quO4+0d3HuvvYxsbGZRsQ4Eq3zKmPW5L8RFWdSfLxJLdW1V+sdCoAXrJnqLv717v7uu7eSvLeJP/Q3T+38skASOJ91ADjLftiYpKkux9K8tBKJgFgR55RAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMNyeoa6q11XV56rq81X1RFV9aD8GA2Db0SWO+d8kt3b3uaq6KsnDVfU33f3PK54NgCwR6u7uJOcW3161+OpVDgXAy5Y6R11VR6rqsSRnkzzY3Y+sdCoAXrJUqLv7xe6+Mcl1SW6uqhsuPqaqjlfVyao6+cwzz1zmMQGuXK/oXR/d/bUkDyW5bYfbTnT3se4+trGxcXmmA2Cpd31sVNWbFpdfn+SdSb644rkAWFjmXR/XJrm3qo5kO+z3dfenVzsWAOct866PLyS5aR9mAWAHPpkIMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMNyeoa6q66vqM1V1uqqeqKo792MwALYdXeKYF5L8Snc/WlVXJzlVVQ9297+veDYAssQz6u5+qrsfXVx+PsnpJG9d9WAAbFvmGfVLqmoryU1JHtnhtuNJjifJ5ubm5ZiNFdu664EDe+wzd99+YI8Nh83SLyZW1RuT3J/kA9399Ytv7+4T3X2su49tbGxczhkBrmhLhbqqrsp2pD/a3Z9Y7UgAXGiZd31Uko8kOd3dH179SABcaJln1Lck+fkkt1bVY4uvd694LgAW9nwxsbsfTlL7MAsAO/DJRIDhhBpgOKEGGE6oAYYTaoDhhBpgOKEGGE6oAYYTaoDhhBpgOKEGGE6oAYYTaoDhhBpgOKEGGE6oAYYTaoDhhBpgOKEGGE6oAYYTaoDhhBpgOKEGGE6oAYYTaoDhhBpgOKEGGE6oAYYTaoDhhBpgOKEGGE6oAYbbM9RVdU9Vna2qx/djIAC+2TLPqP80yW0rngOAXewZ6u7+bJLn9mEWAHZw9HLdUVUdT3I8STY3Ny/X3V4Rtu564KBHgLVyUD9TZ+6+fSX3e9leTOzuE919rLuPbWxsXK67BbjiedcHwHBCDTDcMm/P+1iSf0ry/VX1ZFX94urHAuC8PV9M7O479mMQAHbm1AfAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEINMJxQAwwn1ADDCTXAcEuFuqpuq6ovVdWXq+quVQ8FwMv2DHVVHUnyB0neleRtSe6oqretejAAti3zjPrmJF/u7q909/8l+XiS96x2LADOO7rEMW9N8t8XfP9kkh+++KCqOp7k+OLbc1X1pT3u95okzy4z5CFnnTuo317hJKt1JeznlbDGZAXrfI3/rr97txuWCXXtcF1/yxXdJ5KcWHaiqjrZ3ceWPf6wss71ciWs80pYY3K41rnMqY8nk1x/wffXJfnqasYB4GLLhPpfknxfVX1PVX17kvcm+dRqxwLgvD1PfXT3C1X1/iR/l+RIknu6+4nL8NhLnyY55KxzvVwJ67wS1pgconVW97ecbgZgEJ9MBBhOqAGG25dQV9U9VXW2qh6/4LrvqKoHq+o/Fn++eT9mWaVd1vnBqvqfqnps8fXug5zxtaqq66vqM1V1uqqeqKo7F9ev1X5eYp3rtp+vq6rPVdXnF+v80OL6ddvP3dZ5KPZzX85RV9Xbk5xL8mfdfcPiut9J8lx33734/SFv7u5fW/kwK7TLOj+Y5Fx3/+5Bzna5VNW1Sa7t7ker6uokp5L8ZJJfyBrt5yXW+bNZr/2sJG/o7nNVdVWSh5PcmeSns177uds6b8sh2M99eUbd3Z9N8txFV78nyb2Ly/dm+4fgUNtlnWulu5/q7kcXl59Pcjrbn15dq/28xDrXSm87t/j2qsVXZ/32c7d1HgoHeY76Ld39VLL9Q5Hkuw5wllV7f1V9YXFq5FD/F/JCVbWV5KYkj2SN9/OidSZrtp9VdaSqHktyNsmD3b2W+7nLOpNDsJ9eTFy9P0zyvUluTPJUkt870Gkuk6p6Y5L7k3ygu79+0POsyg7rXLv97O4Xu/vGbH/q+OaquuGAR1qJXdZ5KPbzIEP99OI84PnzgWcPcJaV6e6nF/9AvpHkj7P92wgPtcU5vvuTfLS7P7G4eu32c6d1ruN+ntfdX0vyULbP267dfp534ToPy34eZKg/leR9i8vvS/LXBzjLypz/x77wU0ke3+3Yw2DxosxHkpzu7g9fcNNa7edu61zD/dyoqjctLr8+yTuTfDHrt587rvOw7Od+vevjY0neke1fK/h0kt9M8ldJ7kuymeS/kvxMdx/qF+J2Wec7sv3fqk5yJskvnT/3dxhV1Y8k+cck/5bkG4urfyPb52/XZj8vsc47sl77+QPZfrHwSLafuN3X3b9VVd+Z9drP3db55zkE++kj5ADDeTERYDihBhhOqAGGE2qA4YQaYDihBhhOqAGG+38/s+3T2oZmYwAAAABJRU5ErkJggg==", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure()\n", "plt.hist(df_al['November']);" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "SkewtestResult(statistic=3.449022139607473, pvalue=0.0005626205706886182)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.skewtest(df_al['November'])" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "NormaltestResult(statistic=21.55304457655946, pvalue=2.0884103462437462e-05)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.normaltest(df_al['November'])" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "WilcoxonResult(statistic=16.0, pvalue=0.039794921875)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stats.wilcoxon(df_al['August'],df_al['November'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### b. Interpretation\n", "\n", "Under what situations are non-parametric statistics useful? What are the potential drawbacks in using non-parametric statistics when a parametric approach is justified?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3.10.6 ('data-book')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:38:29) [Clang 13.0.1 ]" }, "vscode": { "interpreter": { "hash": "0ef88d3abb6b62f34a20525ce337090c4512fe8aecf32c74604482b944e1c3bd" } } }, "nbformat": 4, "nbformat_minor": 4 }