{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Analysis of results of the 2015 FINA World Swimming Championships\n", "> In this chapter, you will practice your EDA, parameter estimation, and hypothesis testing skills on the results of the 2015 FINA World Swimming Championships. This is the Summary of lecture \"Case Studies in Statistical Thinking\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Statistics]\n", "- image: images/swim_slowdown.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import dc_stat_think as dcst\n", "\n", "plt.rcParams['figure.figsize'] = (10, 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction to swimming data\n", "- Strokes at the World Championships\n", " - Freestyle\n", " - Breaststroke\n", " - Butterfly\n", " - Backstroke\n", "- Events at the World Championships\n", " - Defined by gender, distance, stroke\n", "- Rounds of events\n", " - Heats: First round\n", " - Semifinals: Penultimate round in some events\n", " - Finals: The final round; the winner is champion\n", "- Data source\n", " - Data are freely available from [OMEGA](http://www.omegatiming.com)\n", "- Domain-specific knowledge\n", " - Imperative\n", " - An absolute pleasure" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Graphical EDA of men's 200 free heats\n", "In the heats, all contestants swim, the very fast and the very slow. To explore how the swim times are distributed, plot an ECDF of the men's 200 freestyle.\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
athleteidlastnamefirstnamebirthdategendernamecodeeventidheatlane...swimtimesplitcumswimtimesplitdistancedaytimerounddistancerelaycountstrokesplitswimtime
0100784BORSHINOEL1996-02-13FAlbaniaALB114...63.65129.6350930.0PRE1001FLY29.63
1100784BORSHINOEL1996-02-13FAlbaniaALB114...63.65263.65100930.0PRE1001FLY34.02
2100784BORSHINOEL1996-02-13FAlbaniaALB2018...140.28131.33501014.0PRE2001FLY31.33
3100784BORSHINOEL1996-02-13FAlbaniaALB2018...140.28266.811001014.0PRE2001FLY35.48
4100784BORSHINOEL1996-02-13FAlbaniaALB2018...140.283103.291501014.0PRE2001FLY36.48
\n", "

5 rows × 22 columns

\n", "
" ], "text/plain": [ " athleteid lastname firstname birthdate gender name code eventid \\\n", "0 100784 BORSHI NOEL 1996-02-13 F Albania ALB 1 \n", "1 100784 BORSHI NOEL 1996-02-13 F Albania ALB 1 \n", "2 100784 BORSHI NOEL 1996-02-13 F Albania ALB 20 \n", "3 100784 BORSHI NOEL 1996-02-13 F Albania ALB 20 \n", "4 100784 BORSHI NOEL 1996-02-13 F Albania ALB 20 \n", "\n", " heat lane ... swimtime split cumswimtime splitdistance daytime \\\n", "0 1 4 ... 63.65 1 29.63 50 930.0 \n", "1 1 4 ... 63.65 2 63.65 100 930.0 \n", "2 1 8 ... 140.28 1 31.33 50 1014.0 \n", "3 1 8 ... 140.28 2 66.81 100 1014.0 \n", "4 1 8 ... 140.28 3 103.29 150 1014.0 \n", "\n", " round distance relaycount stroke splitswimtime \n", "0 PRE 100 1 FLY 29.63 \n", "1 PRE 100 1 FLY 34.02 \n", "2 PRE 200 1 FLY 31.33 \n", "3 PRE 200 1 FLY 35.48 \n", "4 PRE 200 1 FLY 36.48 \n", "\n", "[5 rows x 22 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "swim = pd.read_csv('./dataset/2015_FINA.csv', skiprows=4)\n", "swim.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "mens_200_free_heats_df = swim[(swim['gender'] == 'M') & \n", " (swim['distance'] == 200) & \n", " (swim['stroke'] == 'FREE') &\n", " (swim['round'] == 'PRE') &\n", " (swim['split'] == 4)]\n", "mens_200_free_heats = mens_200_free_heats_df['cumswimtime'].unique()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Generate x and y values for ECDF: x, y\n", "x, y = dcst.ecdf(mens_200_free_heats)\n", "\n", "# Plot the ECDF as dots\n", "_ = plt.plot(x, y, marker='.', linestyle='none')\n", "\n", "# Label axes and show plot\n", "_ = plt.xlabel('time (s)')\n", "_ = plt.ylabel('ECDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that fast swimmers are below 115 seconds, with a smattering of slow swimmers past that, including one very slow swimmer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 200 m free time with confidence interval\n", "Now, you will practice parameter estimation and computation of confidence intervals by computing the mean and median swim time for the men's 200 freestyle heats. The median is useful because it is immune to heavy tails in the distribution of swim times, such as the slow swimmers in the heats." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "mean time: 111.63 sec.\n", "95% conf int of mean: [110.49, 112.92] sec.\n", "\n", "median time: 110.04 sec.\n", "95% conf int of median: [108.96, 111.01] sec.\n", "\n" ] } ], "source": [ "# Compute mean and median swim times\n", "mean_time = np.mean(mens_200_free_heats)\n", "median_time = np.median(mens_200_free_heats)\n", "\n", "# Draw 10,000 bootstrap replicates of the mean and median\n", "bs_reps_mean = dcst.draw_bs_reps(mens_200_free_heats, np.mean, size=10000)\n", "bs_reps_median = dcst.draw_bs_reps(mens_200_free_heats, np.median, size=10000)\n", "\n", "# Compute the 95% confidence intervals\n", "conf_int_mean = np.percentile(bs_reps_mean, [2.5, 97.5])\n", "conf_int_median = np.percentile(bs_reps_median, [2.5, 97.5])\n", "\n", "# Print the result to the screen\n", "print(\"\"\"\n", "mean time: {0:.2f} sec.\n", "95% conf int of mean: [{1:.2f}, {2:.2f}] sec.\n", "\n", "median time: {3:.2f} sec.\n", "95% conf int of median: [{4:.2f}, {5:.2f}] sec.\n", "\"\"\".format(mean_time, *conf_int_mean, median_time, *conf_int_median))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Do swimmers go faster in the finals?\n", "- Question : Do swimmers swim faster in the finals than in other rounds?\n", " - Individual swimmers, or the whole field?\n", " - Faster than heats? Faster than semifinals?\n", " - For what strokes? For what distances?\n", " \n", "- Question: Do individual female swimmers swim faster in the finals compared to the semifinals?\n", " - Events: 50, 100, 200 meter freestyle, breaststroke, butterfly, backstroke\n", " \n", "- Fractional improvement\n", "\n", "$$ f = \\frac{\\text{semifinals time} - \\text{finals time}}{\\text{semifinals time}} $$\n", "\n", "- Sharpened questions\n", " - What is the frational improvement of individual female swimmers from the semifinals to the finals?\n", " - Is the observed fractional improvement commensurate with there being no difference in performance in the semifinals and finals?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### EDA: finals versus semifinals\n", "First, you will get an understanding of how athletes' performance changes from the semifinals to the finals by computing the fractional improvement from the semifinals to finals and plotting an ECDF of all of these values.\n", "\n", "The arrays `final_times` and `semi_times` contain the swim times of the respective rounds. The arrays are aligned such that `final_times[i]` and `semi_times[i]` are for the same swimmer/event. If you are interested in the strokes/events, you can check out the data frame df in your namespace, which has more detailed information, but is not used in the analysis." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
athleteidlastnamefirstnamebirthdategendernamecodeeventidheatlane...swimtimesplitcumswimtimesplitdistancedaytimerounddistancerelaycountstrokesplitswimtime
303100537CAMPBELLBRONTE1994-05-14FAustraliaAUS22325...53.00253.001001732.0SEM1001FREE27.44
305100537CAMPBELLBRONTE1994-05-14FAustraliaAUS12313...52.52252.521001732.0FIN1001FREE27.37
307100537CAMPBELLBRONTE1994-05-14FAustraliaAUS23426...24.32124.32501828.0SEM501FREE24.32
308100537CAMPBELLBRONTE1994-05-14FAustraliaAUS13416...24.12124.12501805.0FIN501FREE24.12
315100631CAMPBELLCATE1992-05-20FAustraliaAUS22314...52.84252.841001732.0SEM1001FREE27.49
\n", "

5 rows × 22 columns

\n", "
" ], "text/plain": [ " athleteid lastname firstname birthdate gender name code \\\n", "303 100537 CAMPBELL BRONTE 1994-05-14 F Australia AUS \n", "305 100537 CAMPBELL BRONTE 1994-05-14 F Australia AUS \n", "307 100537 CAMPBELL BRONTE 1994-05-14 F Australia AUS \n", "308 100537 CAMPBELL BRONTE 1994-05-14 F Australia AUS \n", "315 100631 CAMPBELL CATE 1992-05-20 F Australia AUS \n", "\n", " eventid heat lane ... swimtime split cumswimtime splitdistance \\\n", "303 223 2 5 ... 53.00 2 53.00 100 \n", "305 123 1 3 ... 52.52 2 52.52 100 \n", "307 234 2 6 ... 24.32 1 24.32 50 \n", "308 134 1 6 ... 24.12 1 24.12 50 \n", "315 223 1 4 ... 52.84 2 52.84 100 \n", "\n", " daytime round distance relaycount stroke splitswimtime \n", "303 1732.0 SEM 100 1 FREE 27.44 \n", "305 1732.0 FIN 100 1 FREE 27.37 \n", "307 1828.0 SEM 50 1 FREE 24.32 \n", "308 1805.0 FIN 50 1 FREE 24.12 \n", "315 1732.0 SEM 100 1 FREE 27.49 \n", "\n", "[5 rows x 22 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "women_swim_df = swim[(swim['gender'] == \"F\") & \n", " (swim['stroke'] != \"MEDLEY\") & \n", " (swim['distance'].isin([100, 50, 200])) & \n", " (swim['round'].isin(['SEM', 'FIN'])) & \n", " (swim['splitdistance'] == swim['distance'])]\n", "women_swim_df.head(n = 5)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "women_swim_df = women_swim_df[['athleteid', 'stroke', 'distance', 'lastname', 'cumswimtime', 'round']]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "women_swim_df_fin = women_swim_df.loc[(women_swim_df['round'] == 'FIN')]\n", "women_swim_df_sem = women_swim_df.loc[(women_swim_df['round'] == 'SEM')]\n", "\n", "women_swim_df_w = women_swim_df_fin.merge(women_swim_df_sem, how = 'left', on = ['athleteid', 'stroke', 'distance', 'lastname'])\n", "\n", "df = women_swim_df_w.rename(index = str, columns = {\"cumswimtime_x\" : \"final_swimtime\", \"cumswimtime_y\" : \"semi_swimtime\"})\n", "df = df[['athleteid', 'stroke', 'distance', 'lastname', 'final_swimtime', 'semi_swimtime']]\n", "\n", "final_times = df['final_swimtime'].values\n", "semi_times = df['semi_swimtime'].values" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
athleteidstrokedistancelastnamefinal_swimtimesemi_swimtime
0100537FREE100CAMPBELL52.5253.00
1100537FREE50CAMPBELL24.1224.32
2100631FREE100CAMPBELL52.8252.84
3100631FREE50CAMPBELL24.3624.22
4100650FLY100MCKEON57.6757.59
.....................
91105595BACK200FRANKLIN126.34127.79
92105607BREAST50HARDY30.2030.25
93105640FLY200MCLAUGHLIN126.95127.52
94105676BACK100BAKER59.9959.63
95105686FLY200ADAMS126.40127.57
\n", "

96 rows × 6 columns

\n", "
" ], "text/plain": [ " athleteid stroke distance lastname final_swimtime semi_swimtime\n", "0 100537 FREE 100 CAMPBELL 52.52 53.00\n", "1 100537 FREE 50 CAMPBELL 24.12 24.32\n", "2 100631 FREE 100 CAMPBELL 52.82 52.84\n", "3 100631 FREE 50 CAMPBELL 24.36 24.22\n", "4 100650 FLY 100 MCKEON 57.67 57.59\n", ".. ... ... ... ... ... ...\n", "91 105595 BACK 200 FRANKLIN 126.34 127.79\n", "92 105607 BREAST 50 HARDY 30.20 30.25\n", "93 105640 FLY 200 MCLAUGHLIN 126.95 127.52\n", "94 105676 BACK 100 BAKER 59.99 59.63\n", "95 105686 FLY 200 ADAMS 126.40 127.57\n", "\n", "[96 rows x 6 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Compute fractional difference in time between finals and semis\n", "f = (df['semi_swimtime'] - df['final_swimtime']) / df['semi_swimtime']\n", "\n", "# Generate x and y values for the ECDF: x, y\n", "x, y = dcst.ecdf(f)\n", "\n", "# Make a plot of the ECDF\n", "_ = plt.plot(x, y, marker='.', linestyle='none')\n", "\n", "# Label axes\n", "_ = plt.xlabel('f')\n", "_ = plt.ylabel('ECDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The median of the ECDF is juuuust above zero. But at first glance, it does not look like there is much of any difference between semifinals and finals. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameter estimates of difference between finals and semifinals\n", "Compute the mean fractional improvement from the semifinals to finals, along with a 95% confidence interval of the mean." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "mean frac. diff.: 0.00040\n", "95% conf int of mean frac. diff.: [-0.00093, 0.00173]\n" ] } ], "source": [ "# Mean fractional time difference: f_mean\n", "f_mean = np.mean(f)\n", "\n", "# Get bootstrap reps of means: bs_reps\n", "bs_reps = dcst.draw_bs_reps(f, np.mean, size=10000)\n", "\n", "# Compute confidence intervals: conf_int\n", "conf_int = np.percentile(bs_reps, [2.5, 97.5])\n", "\n", "# Report\n", "print(\"\"\"\n", "mean frac. diff.: {0:.5f}\n", "95% conf int of mean frac. diff.: [{1:.5f}, {2:.5f}]\"\"\".format(f_mean, *conf_int))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like the mean finals time is juuuust faster than the mean semifinal time, and they very well may be the same. We'll test this hypothesis next." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How to do the permutation test\n", "Based on our EDA and parameter estimates, it is tough to discern improvement from the semifinals to finals. In the next exercise, you will test the hypothesis that there is no difference in performance between the semifinals and finals. A permutation test is fitting for this. We'll get test statistics with following strategy:\n", "- Take an array of semifinal times and an array of final times for each swimmer for each stroke/distance pair.\n", "- Go through each array, and for each index, swap the entry in the respective final and semifinal array with a 50% probability.\n", "- Use the resulting final and semifinal arrays to compute f and then the mean of f.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generating permutation samples\n", "As you worked out in the last exercise, we need to generate a permutation sample by randomly swapping corresponding entries in the `semi_times` and `final_times` array. Write a function with signature `swap_random(a, b)` that returns arrays where random indices have the entries in a and b swapped." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def swap_random(a, b):\n", " \"\"\"Randomly swap entries in two arrays\"\"\"\n", " # Indices to swap\n", " swap_inds = np.random.random(size=len(a)) < 0.5\n", " \n", " # Make copies of arrays a and b for output\n", " a_out = np.copy(a)\n", " b_out = np.copy(b)\n", " \n", " # Swap values\n", " a_out[swap_inds] = b[swap_inds]\n", " b_out[swap_inds] = a[swap_inds]\n", " \n", " return a_out, b_out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hypothesis test: Do women swim the same way in semis and finals?\n", "Test the hypothesis that performance in the finals and semifinals are identical using the mean of the fractional improvement as your test statistic. The test statistic under the null hypothesis is considered to be at least as extreme as what was observed if it is greater than or equal to `f_mean`." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p = 0.255\n" ] } ], "source": [ "# Set up array of permutation replicates\n", "perm_reps = np.empty(1000)\n", "\n", "for i in range(1000):\n", " # Generate a permutation sample\n", " semi_perm, final_perm = swap_random(semi_times, final_times)\n", " \n", " # Compute f from the permutation sample\n", " f = (semi_perm - final_perm) / semi_perm\n", " \n", " # Compute and store permutation replicate\n", " perm_reps[i] = np.mean(f)\n", " \n", "# Compute and print p-value\n", "print('p =', np.sum(perm_reps >= f_mean) / 1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The p-value is large, about 0.28, which suggests that the results of the 2015 World Championships are consistent with there being no difference in performance between the finals and semifinals." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How does the performance of swimmers decline over long events?\n", "- Swimming background\n", " - Split: The time is takes to swim one length of the pool\n", "- Quantifying slowdown\n", " - Use women's 800m freestyle heats\n", " - Omit first and last 100 meters\n", " - Compute mean split time for each split number\n", " - Perform linear regression to get slowdown per split\n", " - Perform hypothesis test: can the slowdown be explained by random variation?\n", "- Hypothesis tests for correlation\n", " - Posit null hypothesis: split time and split number are completely uncorrelated\n", " - Simulate data assuming null hypothesis is true\n", " - Use pearson correlation, $\\rho$ as test statistic\n", " - Compute p-value as the fraction of replicates that have $\\rho$ at least as large as observed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### EDA: Plot all your data\n", "To get a graphical overview of a data set, it is often useful to plot all of your data. In this exercise, plot all of the splits for all female swimmers in the 800 meter heats. The data are available in a Numpy arrays `split_number` and `splits`. The arrays are organized such that `splits[i,j]` is the split time for swimmer i for `split_number[j]`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "free_800_w = swim.loc[(swim['gender'] == 'F') &\n", " (swim['stroke'] == 'FREE') &\n", " (swim['distance'] == 800) &\n", " (swim['round'].isin(['PRE'])) &\n", " (~swim['split'].isin([1,2,15,16]))]\n", "free_800_w = free_800_w[['split', 'splitswimtime']]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "splits = np.reshape(free_800_w['splitswimtime'].values, (-1, 12))\n", "split_number = free_800_w['split'].unique()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Plot the splits for each swimmer\n", "for splitset in splits:\n", " _ = plt.plot(split_number, splitset, linewidth=1, color='lightgray')\n", " \n", "# Compute the mean split times\n", "mean_splits = np.mean(splits, axis=0)\n", "\n", "# Plot the mean split time\n", "_ = plt.plot(split_number, mean_splits, marker='.', linewidth=3, markersize=12)\n", "\n", "# Label axes\n", "_ = plt.xlabel('split number')\n", "_ = plt.ylabel('split time (s)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that there is wide variability in the splits among the swimmers, and what appears to be a slight trend toward slower split times." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Linear regression of average split time\n", "We will assume that the swimmers slow down in a linear fashion over the course of the 800 m event. The slowdown per split is then the slope of the mean split time versus split number plot. Perform a linear regression to estimate the slowdown per split and compute a pairs bootstrap 95% confidence interval on the slowdown. Also show a plot of the best fit line.\n", "\n", "> Note: We can compute error bars for the mean split times and use those in the regression analysis, but we will not take those into account here, as that is beyond the scope of this course." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "mean slowdown: 0.065 sec./split\n", "95% conf int of mean slowdown: [0.052, 0.079] sec./split\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Perform regression\n", "slowdown, split_3 = np.polyfit(split_number, mean_splits, 1)\n", "\n", "# Compute pairs bootstrap\n", "bs_reps, _ = dcst.draw_bs_pairs_linreg(split_number, mean_splits, size=10000)\n", "\n", "# Compute confidence interval\n", "conf_int = np.percentile(bs_reps, [2.5, 97.5])\n", "\n", "# Plot the data with regressions line\n", "_ = plt.plot(split_number, mean_splits, marker='.', linestyle='none')\n", "_ = plt.plot(split_number, slowdown * split_number + split_3, '-')\n", "\n", "# Label axes and show plot\n", "_ = plt.xlabel('split number')\n", "_ = plt.ylabel('split time (s)')\n", "\n", "# Print the slowdown per split\n", "print(\"\"\"\n", "mean slowdown: {0:.3f} sec./split\n", "95% conf int of mean slowdown: [{1:.3f}, {2:.3f}] sec./split\"\"\".format(\n", " slowdown, *conf_int))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a small (about 6 hundreths of a second), but discernible, slowdown per split. We'll do a hypothesis test next." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hypothesis test: are they slowing down?\n", "Now we will test the null hypothesis that the swimmer's split time is not at all correlated with the distance they are at in the swim. We will use the Pearson correlation coefficient (computed using dcst.pearson_r()) as the test statistic." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p = 0.0\n" ] } ], "source": [ "# Observed correlation\n", "rho = dcst.pearson_r(split_number, mean_splits)\n", "\n", "# Initialize permutation reps\n", "perm_reps_rho = np.empty(10000)\n", "\n", "# Make permutation reps\n", "for i in range(10000):\n", " # Scramble_split number array\n", " scrambled_split_number = np.random.permutation(split_number)\n", " \n", " # Compute the Pearson correlation coefficient\n", " perm_reps_rho[i] = dcst.pearson_r(scrambled_split_number, mean_splits)\n", " \n", "# Compute and print p-value\n", "p_val = np.sum(perm_reps_rho >= rho) / 10000\n", "print('p =', p_val)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The tiny effect is very real! With 10,000 replicates, we never got a correlation as big as observed under the hypothesis that the swimmers do not change speed as the race progresses. In fact, I did the test with a million replicates, and still never got a single replicate as big as the observed Pearson correlation coefficient." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }