{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# NBA Free throw analysis\n", "\n", "Now let's see some of these methods in action on real world data.\n", "I'm not a basketball guru by any means, but I thought it would be fun to see whether we can find players that perform differently in free throws when playing at home versus away.\n", "[Basketballvalue.com](http://basketballvalue.com/downloads.php) has\n", "some nice play by play data on season and playoff data between 2007 and 2012, which we will use for this analysis.\n", "It's not perfect, for example it only records player's last names, but it will do for the purpose of demonstration.\n", "\n", "\n", "## Getting data:\n", "\n", "- Download and extract play by play data from 2007 - 2012 data from http://basketballvalue.com/downloads.php\n", "- Concatenate all text files into file called `raw.data`\n", "- Run following to extract free throw data into `free_throws.csv`\n", "```\n", "cat raw.data | ack Free Throw | sed -E 's/[0-9]+([A-Z]{3})([A-Z]{3})[[:space:]][0-9]*[[:space:]].?[0-9]{2}:[0-9]{2}:[0-9]{2}[[:space:]]*\\[([A-z]{3}).*\\][[:space:]](.*)[[:space:]]Free Throw.*(d|\\))/\\1,\\2,\\3,\\4,\\5/ ; s/(.*)d$/\\10/ ; s/(.*)\\)$/\\11/' > free_throws.csv\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from __future__ import division\n", "\n", "import pandas as pd\n", "import numpy as np\n", "import scipy as sp\n", "\n", "import scipy.stats\n", "\n", "import toyplot as tp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data munging\n", "\n", "Because only last name is included, we analyze \"player-team\" combinations to avoid duplicates.\n", "This could mean that the same player has multiple rows if he changed teams." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
awayhometeamplayerscoreat_home
0CHIMIACHIGordon1False
1CHIMIACHIGordon1False
2CHIMIACHIDeng1False
3CHIMIACHIDeng0False
4CHIMIAMIAWade1True
\n", "
" ], "text/plain": [ " away home team player score at_home\n", "0 CHI MIA CHI Gordon 1 False\n", "1 CHI MIA CHI Gordon 1 False\n", "2 CHI MIA CHI Deng 1 False\n", "3 CHI MIA CHI Deng 0 False\n", "4 CHI MIA MIA Wade 1 True" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('free_throws.csv', names=[\"away\", \"home\", \"team\", \"player\", \"score\"])\n", "\n", "df[\"at_home\"] = df[\"home\"] == df[\"team\"]\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overall free throw%\n", "\n", "We note that at home the ft% is slightly higher, but there is not much difference" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
score
at_home
False0.757928
True0.759924
\n", "
" ], "text/plain": [ " score\n", "at_home \n", "False 0.757928\n", "True 0.759924" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(\"at_home\").mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregating to player level\n", "\n", "We use a pivot table to get statistics on every player." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playerteamatm_awayatm_homescore_awayscore_homeatm_totalscore_total
543FreijeATL605065
1187OwensIND11238173425
980M. WilliamsATL322351270276673546
38AllenDAL48381211
222BrezecDET31125147
397DavisMIN143133124107276231
1071MilesMEM171212102922
388DavidsonCHA2531712818
1544UdohGSW6761493912888
683HaywardUTA164158132125322257
\n", "
" ], "text/plain": [ " player team atm_away atm_home score_away score_home atm_total \\\n", "543 Freije ATL 6 0 5 0 6 \n", "1187 Owens IND 11 23 8 17 34 \n", "980 M. Williams ATL 322 351 270 276 673 \n", "38 Allen DAL 4 8 3 8 12 \n", "222 Brezec DET 3 11 2 5 14 \n", "397 Davis MIN 143 133 124 107 276 \n", "1071 Miles MEM 17 12 12 10 29 \n", "388 Davidson CHA 25 3 17 1 28 \n", "1544 Udoh GSW 67 61 49 39 128 \n", "683 Hayward UTA 164 158 132 125 322 \n", "\n", " score_total \n", "543 5 \n", "1187 25 \n", "980 546 \n", "38 11 \n", "222 7 \n", "397 231 \n", "1071 22 \n", "388 18 \n", "1544 88 \n", "683 257 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sdf = pd.pivot_table(df, index=[\"player\", \"team\"], columns=\"at_home\", values=[\"score\"], \n", " aggfunc=[len, sum], fill_value=0).reset_index()\n", "sdf.columns = ['player', 'team', 'atm_away', 'atm_home', 'score_away', 'score_home']\n", "\n", "sdf['atm_total'] = sdf['atm_away'] + sdf['atm_home']\n", "sdf['score_total'] = sdf['score_away'] + sdf['score_home']\n", "\n", "sdf.sample(10)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Individual tests\n", "\n", "For each player, we assume each free throw is an independent draw from a Bernoulli distribution with probability $p_{ij}$ of succeeding where $i$ denotes the player and $j=\\{a, h\\}$ denoting away or home, respectively.\n", "\n", "Our null hypotheses are that there is no difference between playing at home and away, versus the alternative that there is a difference.\n", "While you could argue a one-sided test for home advantage is also appropriate, I am sticking with a two-sided test.\n", "\n", "$$\\begin{aligned}\n", "H_{0, i}&: p_{i, a} = p_{i, h},\\\\\n", "H_{1, i}&: p_{i, a} \\neq p_{i, h}.\n", "\\end{aligned}$$\n", "\n", "To get test statistics, we conduct a simple two-sample proportions test, where our test statistic is:\n", "\n", "$$Z = \\frac{\\hat p_h - \\hat p_a}{\\sqrt{\\hat p (1-\\hat p) (\\frac{1}{n_h} + \\frac{1}{n_a})}}$$\n", "\n", "where\n", "- $n_h$ and $n_a$ are the number of attempts at home and away, respectively\n", "- $X_h$ and $X_a$ are the number of free throws made at home and away\n", "- $\\hat p_h = X_h / n_h$ is the MLE for the free throw percentage at home\n", "- likewise, $\\hat p_a = X_a / n_a$ for away ft%\n", "- $\\hat p = \\frac{X_h + X_a}{n_h + n_a}$ is the MLE for overall ft%, used for the pooled variance estimator \n", "\n", "Then we know from Stats 101 that $Z \\sim N(0, 1)$ under the null hypothesis that there is no difference in free throw percentages.\n", "\n", "For a normal approximation to hold, we need $np > 5$ and $n(1-p) > 5$, since $p \\approx 0.75$, let's be a little conservative and say we need at least 50 samples for a player to get a good normal approximation.\n", "\n", "This leads to data on 936 players, and for each one we compute Z, and the corresponding p-value." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "936" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = sdf.query('atm_total > 50').copy()\n", "len(data)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data['p_home'] = data['score_home'] / data['atm_home']\n", "data['p_away'] = data['score_away'] / data['atm_away']\n", "data['p_ovr'] = (data['score_total']) / (data['atm_total'])\n", "\n", "# two-sided\n", "data['zval'] = (data['p_home'] - data['p_away']) / np.sqrt(data['p_ovr'] * (1-data['p_ovr']) * (1/data['atm_away'] + 1/data['atm_home']))\n", "data['pval'] = 2*(1-sp.stats.norm.cdf(np.abs(data['zval'])))\n", "\n", "# one-sided testing home advantage\n", "# data['zval'] = (data['p_home'] - data['p_away']) / np.sqrt(data['p_ovr'] * (1-data['p_ovr']) * (1/data['atm_away'] + 1/data['atm_home']))\n", "# data['pval'] = (1-sp.stats.norm.cdf(data['zval']))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
playerteamatm_awayatm_homescore_awayscore_homeatm_totalscore_totalp_homep_awayp_ovrzvalpval
1640WilcoxSEA3022931991975953960.6723550.6589400.6655460.3467350.728790
1078MillerHOU6034492994780.8529410.8166670.8297870.4496490.652964
626GreenSAS46543247100790.8703700.6956520.7900002.1379170.032524
683HaywardUTA1641581321253222570.7911390.8048780.798137-0.3070470.758808
476ElsonSAS59654749124960.7538460.7966100.774194-0.5687970.569494
1212PaulNOK2021531661243552900.8104580.8217820.816901-0.2732150.784688
1059McLeodGSW3122252253471.0000000.8064520.8867922.1912660.028433
554GainesNJN54622743116700.6935480.5000000.6034482.1256090.033536
921KrsticBOS3845293383620.7333330.7631580.746988-0.3113910.755504
1018MartinSAC912916792773182815650.8438860.8684210.856127-1.4944350.135062
\n", "
" ], "text/plain": [ " player team atm_away atm_home score_away score_home atm_total \\\n", "1640 Wilcox SEA 302 293 199 197 595 \n", "1078 Miller HOU 60 34 49 29 94 \n", "626 Green SAS 46 54 32 47 100 \n", "683 Hayward UTA 164 158 132 125 322 \n", "476 Elson SAS 59 65 47 49 124 \n", "1212 Paul NOK 202 153 166 124 355 \n", "1059 McLeod GSW 31 22 25 22 53 \n", "554 Gaines NJN 54 62 27 43 116 \n", "921 Krstic BOS 38 45 29 33 83 \n", "1018 Martin SAC 912 916 792 773 1828 \n", "\n", " score_total p_home p_away p_ovr zval pval \n", "1640 396 0.672355 0.658940 0.665546 0.346735 0.728790 \n", "1078 78 0.852941 0.816667 0.829787 0.449649 0.652964 \n", "626 79 0.870370 0.695652 0.790000 2.137917 0.032524 \n", "683 257 0.791139 0.804878 0.798137 -0.307047 0.758808 \n", "476 96 0.753846 0.796610 0.774194 -0.568797 0.569494 \n", "1212 290 0.810458 0.821782 0.816901 -0.273215 0.784688 \n", "1059 47 1.000000 0.806452 0.886792 2.191266 0.028433 \n", "554 70 0.693548 0.500000 0.603448 2.125609 0.033536 \n", "921 62 0.733333 0.763158 0.746988 -0.311391 0.755504 \n", "1018 1565 0.843886 0.868421 0.856127 -1.494435 0.135062 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.sample(10)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
0.00.51.00.00.51.01.52.0Histogram p-values-2020.00.10.20.30.4Histogram z-values
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "canvas = tp.Canvas(800, 300)\n", "ax1 = canvas.axes(grid=(1, 2, 0), label=\"Histogram p-values\")\n", "hist_p = ax1.bars(np.histogram(data[\"pval\"], bins=50, normed=True), color=\"steelblue\")\n", "hisp_p_density = ax1.plot([0, 1], [1, 1], color=\"red\")\n", "ax2 = canvas.axes(grid=(1, 2, 1), label=\"Histogram z-values\")\n", "hist_z = ax2.bars(np.histogram(data[\"zval\"], bins=50, normed=True), color=\"orange\")\n", "x = np.linspace(-3, 3, 200)\n", "hisp_z_density = ax2.plot(x, sp.stats.norm.pdf(x), color=\"red\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Global tests\n", "\n", "We can test the global null hypothesis, that is, there is no difference in free throw % between playing at home and away for any player using both Fisher's Combination Test and the Bonferroni method.\n", "Which one is preferred in this case? I would expect to see many small difference in effects rather than a few players showing huge effects, so Fisher's Combination Test probably has much better power.\n", "\n", "## Fisher's combination test\n", "\n", "We expect this test to have good power: if there is a difference between playing at home and away we would expect to see a lot of little effects." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p-value for Fisher Combination Test: 6.280e-04\n" ] } ], "source": [ "T = -2 * np.sum(np.log(data[\"pval\"]))\n", "print 'p-value for Fisher Combination Test: {:.3e}'.format(1 - sp.stats.chi2.cdf(T, 2*len(data)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bonferroni's method\n", "\n", "The theory would suggest this test has a lot less power, it's unlikely to have a few players where the difference is relatively huge." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"p-value\" Bonferroni: 1.000e+00\n" ] } ], "source": [ "print '\"p-value\" Bonferroni: {:.3e}'.format(min(1, data[\"pval\"].min() * len(data)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "Indeed, we find a small p-value for Fisher's Combination Test, while Bonferroni's method does not reject the null hypothesis.\n", "In fact, if we multiply the smallest p-value by the number of hypotheses, we get a number larger than 1, so we aren't even remotely close to any significance.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Multiple tests\n", "\n", "So there definitely seems some evidence that there is a difference in performance.\n", "If you tell a sport's analyst that there is evidence that at least some players that perform differently away versus at home, their first question will be: \"So who is?\"\n", "Let's see if we can properly answer that question.\n", "\n", "## Naive method\n", "\n", "Let's first test each null hypothesis ignoring the fact that we are dealing with many hypotheses. Please don't do this at home!" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of rejections: 65\n" ] } ], "source": [ "alpha = 0.05\n", "data[\"reject_naive\"] = 1*(data[\"pval\"] < alpha)\n", "\n", "print 'Number of rejections: {}'.format(data[\"reject_naive\"].sum())" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "If we don't correct for multiple comparisons, there are actually 65 \"significant\" results (at $\\alpha = 0.05$), which corresponds to about 7% of the players.\n", "We expect around 46 rejections by chance, so it's a bit more than expected, but this is a bogus method so no matter what, we should discard the results.\n", "\n", "\n", "\n", "## Bonferroni correction\n", "\n", "Let's do it the proper way though, first using Bonferroni correction.\n", "Since this method is basically the same as the Bonferroni global test, we expect no rejections:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from statsmodels.sandbox.stats.multicomp import multipletests" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of rejections: 0\n" ] } ], "source": [ "data[\"reject_bc\"] = 1*(data[\"pval\"] < alpha / len(data))\n", "print 'Number of rejections: {}'.format(data[\"reject_bc\"].sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indeed, no rejections.\n", "\n", "## Benjamini-Hochberg\n", "\n", "Let's also try the BHq procedure, which has a bit more power than Bonferonni." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [], "source": [ "is_reject, corrected_pvals, _, _ = multipletests(data[\"pval\"], alpha=0.1, method='fdr_bh')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of rejections: 0\n" ] } ], "source": [ "data[\"reject_fdr\"] = 1*is_reject\n", "data[\"pval_fdr\"] = corrected_pvals\n", "print 'Number of rejections: {}'.format(data[\"reject_fdr\"].sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even though the BHq procedure has more power, we can't reject any of the individual hypothesis, hence we don't find sufficient evidence for any of the players that free throw performance is affected by location.\n", "\n", "\n", "# Taking a step back\n", "\n", "If we take a step back and take another look at our data, we quickly find that we shouldn't be surprised with our results.\n", "In particular, our tests are clearly underpowered. \n", "That is, the probability of rejecting the null hypothesis when there is a true effect is very small given the effect sizes that are reasonable.\n", "\n", "While there are definitely sophisticated approaches to power analysis, we can use a [simple tool](http://statpages.org/proppowr.html) to get a rough estimate.\n", "The free throw% is around 75% percent, and at that level it takes almost 2500 total attempts to detect a difference in ft% of 5% ($\\alpha = 0.05$, power = $0.8$), and 5% is a pretty remarkable difference when only looking at home and away difference.\n", "For most players, the observed difference is not even close to 5%, and we have only 11 players in our dataset with more than 2500 free throws.\n", "\n", "\n", "To have any hope to detect effects for those few players that have plenty of data, the worst thing one can do is throw in a bunch of powerless tests.\n", "It would have been much better to restrict our analysis to players where we have a lot of data.\n", "Don't worry, I've already done that and again we cannot reject a single hypothesis.\n", "\n", "So unfortunately it seems we won't be impressing our friends with cool results, more likely we will be the annoying person pointing out the fancy stats during a game don't really mean anything.\n", "\n", "There is one cool take-away though: Fisher's combination test did reject the global null hypothesis even though each single test had almost no power, combined they did yield a significant result.\n", "If we aggregate the data across all players first and then conduct a single test of proportions, it turns out we cannot reject that hypothesis." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "11" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(data.query(\"atm_total > 2500\"))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of rejections: 0\n" ] } ], "source": [ "reduced_data = data.query(\"atm_total > 2500\").copy()\n", "\n", "is_reject2, corrected_pvals2, _, _ = multipletests(reduced_data[\"pval\"], alpha=0.1, method='fdr_bh')\n", "reduced_data[\"reject_fdr2\"] = 1*is_reject2\n", "reduced_data[\"pval_fdr2\"] = corrected_pvals2\n", "\n", "\n", "print 'Number of rejections: {}'.format(reduced_data[\"reject_fdr2\"].sum())" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }