{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The Current Controversy of the 2013 World Championships\n", "> Some swimmers said that they felt it was easier to swim in one direction versus another in the 2013 World Championships. Some analysts have posited that there was a swirling current in the pool. In this chapter, you'll investigate this claim! References - Quartz Media, Washington Post, SwimSwam, and Cornett, et al. This is the Summary of lecture \"Case Studies in Statistical Thinking\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Statistics]\n", "- image: images/linreg_swim_lane.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import dc_stat_think as dcst\n", "\n", "plt.rcParams['figure.figsize'] = (10, 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction to the current controversy\n", "- Task\n", " - Investigate improvement of individual swimmers moving from low- to high-numbered lanes in 50m events\n", " - Compute the size of the effect\n", " - Test the hypothesis that on average there is no difference between low- and high-numbered lanes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ECDF of improvement from low to high lanes\n", "Now that you have a metric for improvement going from low- to high-numbered lanes, plot an ECDF of this metric. I have put together the swim times of all swimmers who swam a 50 m semifinal in a high numbered lane and the final in a low numbered lane, and vice versa. The swim times are stored in the Numpy arrays `swimtime_high_lanes` and `swimtime_low_lanes`. Entry i in the respective arrays are for the same swimmer in the same event." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "swimtime_high_lanes = np.array([24.62, 22.9 , 27.05, 24.76, 30.31, 24.54, 26.12, 27.71, 23.15,\n", " 23.11, 21.62, 28.02, 24.73, 24.95, 25.83, 30.61, 27.04, 21.67,\n", " 27.16, 30.23, 21.51, 22.97, 28.05, 21.65, 24.54, 26.06])\n", "\n", "swimtime_low_lanes = np.array([24.66, 23.28, 27.2 , 24.95, 32.34, 24.66, 26.17, 27.93, 23.35,\n", " 22.93, 21.93, 28.33, 25.14, 25.19, 26.11, 31.31, 27.44, 21.85,\n", " 27.48, 30.66, 21.74, 23.22, 27.93, 21.42, 24.79, 26.46])" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Compute the fractional improvement of being in high lane: f\n", "f = (swimtime_low_lanes - swimtime_high_lanes) / swimtime_low_lanes\n", "\n", "# Make x and y values for ECDF: x, y\n", "x, y = dcst.ecdf(f)\n", "\n", "# Plot the ECDFs as dots\n", "_ = plt.plot(x, y, marker='.', linestyle='none')\n", "\n", "# Label the axes and show the plot\n", "_ = plt.xlabel('f')\n", "_ = plt.ylabel('ECDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oooo, this is starting to paint a picture of lane bias. The ECDF demonstrates that all but three of the 26 swimmers swam faster in the high numbered lanes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Estimation of mean improvement\n", "You will now estimate how big this current effect is. Compute the mean fractional improvement for being in a high-numbered lane versus a low-numbered lane, along with a 95% confidence interval of the mean.\n", "\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "mean frac. diff.: 0.01051\n", "95% conf int of mean frac. diff.: [0.00630, 0.01596]\n" ] } ], "source": [ "# Compute the mean difference: f_mean\n", "f_mean = np.mean(f)\n", "\n", "# Draw 10,000 bootstrap replicates: bs_reps\n", "bs_reps = dcst.draw_bs_reps(f, np.mean, size=10000)\n", "\n", "# Compute 95% confidence interval: conf_int\n", "conf_int = np.percentile(bs_reps, [2.5, 97.5])\n", "\n", "# Print the result\n", "print(\"\"\"\n", "mean frac. diff.: {0:.5f}\n", "95% conf int of mean frac. diff.: [{1:.5f}, {2:.5f}]\"\"\".format(f_mean, *conf_int))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "it sure looks like swimmers are faster in lanes 6-8." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How should we test the hypothesis?\n", "Q: You are interested in the presence of lane bias toward higher lanes, presumably due to a slight current in the pool. A natural null hypothesis to test, then, is that the mean fractional improvement going from low to high lane numbers is zero. Which of the following is a good way to simulate this null hypothesis?\n", "\n", "A: Subtract the mean of `f` from `f` to generate `f_shift`. Then, take bootstrap replicate of the mean from this `f_shift`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hypothesis test: Does lane assignment affect performance?\n", "Perform a bootstrap hypothesis test of the null hypothesis that the mean fractional improvement going from low-numbered lanes to high-numbered lanes is zero. Take the fractional improvement as your test statistic, and \"at least as extreme as\" to mean that the test statistic under the null hypothesis is greater than or equal to what was observed." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p = 0.00028\n" ] } ], "source": [ "# Shift f: f_shift\n", "f_shift = f - f_mean\n", "\n", "# Draw 100,000 bootstrap replicates of the mean: bs_reps\n", "bs_reps = dcst.draw_bs_reps(f_shift, np.mean, 100000)\n", "\n", "# Compute and report the p-value\n", "p_val = np.sum(bs_reps >= f_mean) / 100000\n", "print('p =', p_val)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A p-value of 0.0003 is quite small and suggests that the mean fractional improvment is greater than zero. For fun, I tested the more restrictive hypothesis that lane number has no bearing at all on performance (item (1) in the previous MCQ), and I got an even smaller p-value of about 0.00001. You can perform that test, too, for practice if you like." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Did the 2015 event have this problem?\n", "You would like to know if this is a typical problem with pools in competitive swimming. To address this question, perform a similar analysis for the results of the 2015 FINA World Championships. That is, compute the mean fractional improvement for going from lanes 1-3 to lanes 6-8 for the 2015 competition, along with a 95% confidence interval on the mean. Also test the hypothesis that the mean fractional improvement is zero." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "swimtime_high_lanes_15 = np.array([27.7 , 24.64, 23.21, 23.09, 26.87, 30.74, 21.88, 24.5 , 21.86,\n", " 25.9 , 26.2 , 24.73, 30.13, 26.92, 24.31, 30.25, 26.76])\n", "swimtime_low_lanes_15 = np.array([27.66, 24.69, 23.29, 23.05, 26.87, 31.03, 22.04, 24.51, 21.86,\n", " 25.64, 25.91, 24.77, 30.14, 27.23, 24.31, 30.2 , 26.86])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "mean frac. diff.: 0.00079\n", "95% conf int of mean frac. diff.: [-0.00191, 0.00336]\n", "p-value: 0.28205\n" ] } ], "source": [ "# Compute f and its mean\n", "f = (swimtime_low_lanes_15 - swimtime_high_lanes_15) / swimtime_low_lanes_15\n", "f_mean = np.mean(f)\n", "\n", "# Draw 10,000 bootstrap replicates\n", "bs_reps = dcst.draw_bs_reps(f, np.mean, 10000)\n", "\n", "# Compute 95% confidence interval\n", "conf_int = np.percentile(bs_reps, [2.5, 97.5])\n", "\n", "# Shift f\n", "f_shift = f - f_mean\n", "\n", "# Draw 100,000 bootstrap replicates of the mean\n", "bs_reps = dcst.draw_bs_reps(f_shift, np.mean, 100000)\n", "\n", "# Compute the p-value\n", "p_val = np.sum(bs_reps >= f_mean) / 100000\n", "\n", "# Print the results\n", "print(\"\"\"\n", "mean frac. diff.: {0:.5f}\n", "95% conf int of mean frac. diff.: [{1:.5f}, {2:.5f}]\n", "p-value: {3:.5f}\"\"\".format(f_mean, *conf_int, p_val))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both the confidence interval an the p-value suggest that there was no lane bias in 2015." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The zigzag effect\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Which splits should we consider?\n", "Q: As you proceed to quantitatively analyze the zigzag effect in the 1500 m, which splits should you include in our analysis?\n", "\n", "A: \n", "You should include all splits except the first two and the last two. You should neglect the last two because swimmers stop pacing themselves and \"kick\" for the final stretch. The first two are different because they involve jumping off the starting blocks and more underwater swimming than others." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### EDA: mean differences between odd and even splits\n", "To investigate the differences between odd and even splits, you first need to define a difference metric. In previous exercises, you investigated the improvement of moving from a low-numbered lane to a high-numbered lane, defining f = (ta - tb) / ta. There, the ta in the denominator served as our reference time for improvement. Here, you are considering both improvement and decline in performance depending on the direction of swimming, so you want the reference to be an average. So, we will define the fractional difference as $f = 2 \\frac{(t_a - t_b)}{(t_a + t_b)}$.\n", "\n", "Your task here is to plot the mean fractional difference between odd and even splits versus lane number." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "f_13 = np.array([-0.01562214, -0.0146381 , -0.00977673, -0.00525713, 0.00204104,\n", " 0.00381014, 0.0075664 , 0.01525869])\n", "f_15 = np.array([-0.00516018, -0.00392952, -0.00099284, 0.00059953, -0.002424 ,\n", " -0.00451099, 0.00047467, 0.00081962])\n", "lanes = np.array([1, 2, 3, 4, 5, 6, 7, 8])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Plot the fractional difference for 2013 and 2015\n", "_ = plt.plot(lanes, f_13, marker='.', markersize=12, linestyle='none')\n", "_ = plt.plot(lanes, f_15, marker='.', markersize=12, linestyle='none')\n", "\n", "# Add a legend\n", "_ = plt.legend((2013, 2015))\n", "\n", "# Label axes\n", "_ = plt.xlabel('lane')\n", "_ = plt.ylabel('frac. diff. (odd - even)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "EDA has exposed a strong slope in 2013 compared to 2015!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How does the current effect depend on lane position?\n", "To quantify the effect of lane number on performance, perform a linear regression on the f_13 versus lanes data. Do a pairs bootstrap calculation to get a 95% confidence interval. Finally, make a plot of the regression.\n", "\n", "Note that we could compute error bars on the mean fractional differences and use them in the regression, but that is beyond the scope of this course." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "slope: 0.00447 per lane\n", "95% conf int: [0.00393, 0.00502] per lane\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Compute the slope and intercept of the fra diff/lane curve\n", "slope, intercept = np.polyfit(lanes, f_13, 1)\n", "\n", "# Compute bootstrap replicates\n", "bs_reps_slope, bs_reps_int = dcst.draw_bs_pairs_linreg(lanes, f_13, 10000)\n", "\n", "# Compute 95% confidence interval of slope\n", "conf_int = np.percentile(bs_reps_slope, [2.5, 97.5])\n", "\n", "# Print slope and confidence interval\n", "print(\"\"\"\n", "slope: {0:.5f} per lane\n", "95% conf int: [{1:.5f}, {2:.5f}] per lane\"\"\".format(slope, *conf_int))\n", "\n", "# x-values for plotting regression lines\n", "x = np.array([1, 8])\n", "\n", "# Plot 100 bootstrap replicate lines\n", "for i in range(100):\n", " _ = plt.plot(x, bs_reps_slope[i] * x + bs_reps_int[i], \n", " color='red', alpha=0.2, linewidth=0.5)\n", " \n", "# Update the plot\n", "_ = plt.plot(lanes, f_13, marker='.', markersize=12, linestyle='none')\n", "plt.draw()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The slope is a fractional difference of about 0.4% per lane. This is quite a substantial difference at this elite level of swimming where races can be decided by tiny differences." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hypothesis test: can this be by chance?\n", "The EDA and linear regression analysis is pretty conclusive. Nonetheless, you will top off the analysis of the zigzag effect by testing the hypothesis that lane assignment has nothing to do with the mean fractional difference between even and odd lanes using a permutation test. You will use the Pearson correlation coefficient, which you can compute with `dcst.pearson_r()` as the test statistic. " ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p = 0.0\n" ] } ], "source": [ "# Compute observed correlation: rho\n", "rho = dcst.pearson_r(lanes, f_13)\n", "\n", "# Initialize permutation reps: perm_reps_rho\n", "perm_reps_rho = np.empty(10000)\n", "\n", "# Make permutation reps\n", "for i in range(10000):\n", " # Scramble the lanes array: scrambled_lanes\n", " scrambled_lanes = np.random.permutation(lanes)\n", " \n", " # Compute the Pearson correlation coefficient\n", " perm_reps_rho[i] = dcst.pearson_r(scrambled_lanes, f_13)\n", " \n", "# Compute and print p-value\n", "p_val = np.sum(perm_reps_rho >= rho) / 10000\n", "print('p =', p_val)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The p-value is very small, as you would expect from the confidence interval of the last exercise." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }