{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## NCAA March Madness\n", "\n", "### Pomeroy Ratings and First Round Upset Picks\n", "\n", "This notebook continues our examination of strategies for picking first round March Madness upsets. In this notebook, we will look at Ken Pomeroy's [KenPom college basketball ratings](https://kenpom.com/) and see how they can help us predict wins and losses.\n", "\n", "We're going to merge the KenPom data with the Washington Post NCAA game history that we [analyzed previously](http://practicallypredictable.com/2018/03/07/march-madness-first-round-upsets/). By combining these data sets, we'll see whether the KenPom data would have helped predict upsets in previous NCAA tournaments." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "pd.options.display.float_format = '{:.3f}'.format\n", "pd.options.display.max_rows = 100" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "sns.set(context='notebook', palette='colorblind')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import math\n", "import warnings" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "PROJECT_DIR = Path.cwd().parent\n", "SCRAPED_DIR = PROJECT_DIR / 'data' / 'scraped'\n", "PREPARED_DIR = PROJECT_DIR / 'data' / 'prepared'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the KenPom Data\n", "\n", "First let's read in the previously scraped KenPom data.\n", "\n", "See [this notebook](https://nbviewer.jupyter.org/github/practicallypredictable/posts/blob/master/basketball/ncaa/notebooks/ncaa-scrape-kenpom.ipynb) to see how to scrape historical KenPom ratings. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def read_kenpom_csv(filepath):\n", " \"\"\"Read scraped KenPom CSV file.\"\"\"\n", " filename = 'kenpom-historical.csv'\n", " csvfile = filepath.joinpath(filename)\n", " df = pd.read_csv(csvfile).dropna()\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One challenge in combining different data sets is making sure the data are all consistent. It turns out that the KenPom and Washington Post data aren't consistent in how the schools are named. KenPom uses certain abbreviations that the Washington Post doesn't use, and vice versa. I needed to make adjustments to both data sets to make the school names line up. Ultimately, it doesn't matter which naming choice you make, as long as it's consistent in both data sets.\n", "\n", "Here's a function which changes certain school names in the KenPom data. A corresponding function for the Washington Post data is below." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def fix_kenpom_team_name(s):\n", " \"\"\"Correct certain team name problems and make consistent with Washington Post team names.\"\"\"\n", " # Convert \"Saint\" to \"St.\"\n", " s = s.replace('Saint', 'St.')\n", " # Add \"St.\" to certain school names\n", " add_st = {\n", " 'Middle Tennessee',\n", " 'Central Connecticut',\n", " }\n", " for team in add_st:\n", " s = s.replace(team, team+' St.')\n", " # Expand abbreviations for certain school names\n", " change_abbrs = {\n", " 'BYU': 'Brigham Young',\n", " 'SMU': 'Southern Methodist',\n", " 'UAB': 'Alabama Birmingham',\n", " 'UCF': 'Central Florida',\n", " 'UMBC': 'Maryland Baltimore County',\n", " 'USC': 'Southern California',\n", " 'UT Arlington': 'Texas Arlington',\n", " 'UTSA': 'Texas San Antonio',\n", " 'VCU': 'Virginia Commonwealth',\n", " }\n", " for team in change_abbrs:\n", " s = s.replace(team, change_abbrs[team])\n", " # Miscellaneous fixes\n", " fix_misc = {\n", " 'Southern Miss': 'Southern Mississippi',\n", " 'Albany': 'Albany (N.Y.)',\n", " 'Loyola MD': 'Loyola (Md.)',\n", " 'Miami FL': 'Miami (Fla.)',\n", " 'Miami OH': 'Miami (Ohio)',\n", " 'Troy St.': 'Troy',\n", " 'Corpus Chris': 'Corpus Christi',\n", " }\n", " for team in fix_misc:\n", " s = s.replace(team, fix_misc[team])\n", " return s" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def rename_kenpom_cols(df):\n", " \"\"\"Rename KenPom columns.\"\"\"\n", " cols = [col.lower().replace(' ', '_').replace('adj', 'adj_').replace('opp', 'opp_') for col in df.columns]\n", " df.columns = cols\n", " return df" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def load_kenpom(filepath=PREPARED_DIR):\n", " \"\"\"Load Ken Pomeroy data since 2002 and format it.\"\"\"\n", " df = read_kenpom_csv(filepath)\n", " df['Seed'] = df['Seed'].astype(int)\n", " df['Team'] = df['Team'].apply(fix_kenpom_team_name)\n", " df = rename_kenpom_cols(df)\n", " return df" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1061, 24)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kenpom = load_kenpom(PREPARED_DIR)\n", "kenpom.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Understanding KenPom Ratings\n", "\n", "Now that we have the KenPom data loaded, let's see what sort of data we have." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['year', 'team', 'conf', 'seed', 'wins', 'losses', 'kenpom', 'adj_em',\n", " 'adj_o', 'adj_d', 'adj_t', 'luck', 'sos_adj_em', 'opp_o', 'opp_d',\n", " 'ncsos_adj_em', 'adj_o_rank', 'adj_d_rank', 'adj_t_rank', 'luck_rank',\n", " 'sos_adj_em_rank', 'opp_o_rank', 'opp_d_rank', 'ncsos_adj_em_rank'],\n", " dtype='object')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kenpom.columns" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>year</th>\n", " <th>team</th>\n", " <th>conf</th>\n", " <th>seed</th>\n", " <th>wins</th>\n", " <th>losses</th>\n", " <th>kenpom</th>\n", " <th>adj_em</th>\n", " <th>adj_o</th>\n", " <th>adj_d</th>\n", " <th>...</th>\n", " <th>opp_d</th>\n", " <th>ncsos_adj_em</th>\n", " <th>adj_o_rank</th>\n", " <th>adj_d_rank</th>\n", " <th>adj_t_rank</th>\n", " <th>luck_rank</th>\n", " <th>sos_adj_em_rank</th>\n", " <th>opp_o_rank</th>\n", " <th>opp_d_rank</th>\n", " <th>ncsos_adj_em_rank</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2002</td>\n", " <td>Duke</td>\n", " <td>ACC</td>\n", " <td>1</td>\n", " <td>31</td>\n", " <td>4</td>\n", " <td>1</td>\n", " <td>34.190</td>\n", " <td>121.000</td>\n", " <td>86.800</td>\n", " <td>...</td>\n", " <td>99.500</td>\n", " <td>6.660</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>8</td>\n", " <td>223</td>\n", " <td>18</td>\n", " <td>13</td>\n", " <td>31</td>\n", " <td>34</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2002</td>\n", " <td>Cincinnati</td>\n", " <td>CUSA</td>\n", " <td>1</td>\n", " <td>31</td>\n", " <td>4</td>\n", " <td>2</td>\n", " <td>30.190</td>\n", " <td>118.100</td>\n", " <td>87.900</td>\n", " <td>...</td>\n", " <td>100.000</td>\n", " <td>3.480</td>\n", " <td>7</td>\n", " <td>3</td>\n", " <td>194</td>\n", " <td>165</td>\n", " <td>57</td>\n", " <td>66</td>\n", " <td>44</td>\n", " <td>80</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2002</td>\n", " <td>Maryland</td>\n", " <td>ACC</td>\n", " <td>1</td>\n", " <td>32</td>\n", " <td>4</td>\n", " <td>3</td>\n", " <td>29.250</td>\n", " <td>119.200</td>\n", " <td>89.900</td>\n", " <td>...</td>\n", " <td>99.500</td>\n", " <td>1.620</td>\n", " <td>4</td>\n", " <td>7</td>\n", " <td>15</td>\n", " <td>104</td>\n", " <td>16</td>\n", " <td>11</td>\n", " <td>32</td>\n", " <td>120</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2002</td>\n", " <td>Kansas</td>\n", " <td>B12</td>\n", " <td>1</td>\n", " <td>33</td>\n", " <td>4</td>\n", " <td>4</td>\n", " <td>28.990</td>\n", " <td>118.700</td>\n", " <td>89.700</td>\n", " <td>...</td>\n", " <td>99.900</td>\n", " <td>8.320</td>\n", " <td>5</td>\n", " <td>6</td>\n", " <td>3</td>\n", " <td>109</td>\n", " <td>10</td>\n", " <td>4</td>\n", " <td>40</td>\n", " <td>23</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2002</td>\n", " <td>Oklahoma</td>\n", " <td>B12</td>\n", " <td>2</td>\n", " <td>31</td>\n", " <td>5</td>\n", " <td>5</td>\n", " <td>26.040</td>\n", " <td>114.900</td>\n", " <td>88.900</td>\n", " <td>...</td>\n", " <td>100.400</td>\n", " <td>-0.440</td>\n", " <td>20</td>\n", " <td>4</td>\n", " <td>228</td>\n", " <td>69</td>\n", " <td>26</td>\n", " <td>15</td>\n", " <td>62</td>\n", " <td>169</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>5 rows × 24 columns</p>\n", "</div>" ], "text/plain": [ " year team conf seed wins losses kenpom adj_em adj_o adj_d \\\n", "0 2002 Duke ACC 1 31 4 1 34.190 121.000 86.800 \n", "1 2002 Cincinnati CUSA 1 31 4 2 30.190 118.100 87.900 \n", "2 2002 Maryland ACC 1 32 4 3 29.250 119.200 89.900 \n", "3 2002 Kansas B12 1 33 4 4 28.990 118.700 89.700 \n", "4 2002 Oklahoma B12 2 31 5 5 26.040 114.900 88.900 \n", "\n", " ... opp_d ncsos_adj_em adj_o_rank adj_d_rank \\\n", "0 ... 99.500 6.660 1 1 \n", "1 ... 100.000 3.480 7 3 \n", "2 ... 99.500 1.620 4 7 \n", "3 ... 99.900 8.320 5 6 \n", "4 ... 100.400 -0.440 20 4 \n", "\n", " adj_t_rank luck_rank sos_adj_em_rank opp_o_rank opp_d_rank \\\n", "0 8 223 18 13 31 \n", "1 194 165 57 66 44 \n", "2 15 104 16 11 32 \n", "3 3 109 10 4 40 \n", "4 228 69 26 15 62 \n", "\n", " ncsos_adj_em_rank \n", "0 34 \n", "1 80 \n", "2 120 \n", "3 23 \n", "4 169 \n", "\n", "[5 rows x 24 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kenpom.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can read about the most recent version of KenPom ratings [here](https://kenpom.com/blog/ratings-methodology-update/).\n", "\n", "#### Adjusted Offensive and Defensive Efficiencies\n", "\n", "The main building blocks are Adjusted Offensive Efficiency Rating (_AdjO_) and Adjusted Defensive Efficiency Rating (_AdjD_). _AdjO_ is a prediction of the team's points scored per 100 possessions, against an \"average\" team on a neutral court. KenPom defines \"average\" as being an average Division-I opponent.\n", "\n", "_AdjD_ is a prediction of how many points per 100 possessions an average opponent will score against the team on a neutral court.\n", "\n", "Equivalently, if you divide either rating by 100, it's the predicted points scored by or against the team per possession.\n", "\n", "Note that KenPom uses an assumption that home court advantage is worth 3.75 points in estimating the ratings. Since NCAA tournament games are played on neutral courts, no home court adjustment is made for predicting tournament games.\n", "\n", "#### Adjusted Efficiency Margin\n", "\n", "The Adjusted Efficiency Margin (_AdjEM_) is simply the difference between the offensive and defensive efficiency ratings for the team. It can be positive or negative. If it is negative, it could be because of relatively weak offense or relatively weak defense.\n", "\n", "#### KenPom Ranking\n", "\n", "In the KenPom methodology, teams are ranked by their _AdjEM_.\n", "\n", "Virginia is the top team as of Sunday, March 11 (and a first seed in this year's March Madness), with an _AdjEM_ of +32.15. This is composed of a 116.5 _AdjO_ less its 84.4 _AdjD_. Notice that Virginia's _AdjO_ is only ranked $21^{st}$, while it's _AdjD_ is ranked number 1. In contrast, the number 2 team, Villanova, has the top KenPom _AdjO_ of 127.4, but only the $22^{nd}$ ranked _AdjD_ of 96.0.\n", "\n", "#### Tempo\n", "\n", "Notice that by estimating points scored or allowed per possession, the KenPom methodology controls for the effect of team pace. In any particular game, both teams get roughly the same number of possessions. This is true whether one or both teams push the ball up the floor or not. You can [read Ken Pomeroy's own post on possessions here](https://kenpom.com/blog/the-possession/).\n", "\n", "If you want to predict scores or point differentials, you need to include the effect of pace.\n", "\n", "Fortunately, KenPom also helps to predict pace. The Adjusted Tempo (_AdjT_) is a prediction of the number of possessions the team will have against an \"average\" team on a neutral court.\n", "\n", "If you want to estimate the number of possessions each team will have in a given game, you can just average the _AdjT_ of each team. So, for example, if Virginia (_AdjT_ of 59.2) were to eventually meet Villanova (_AdjT_ of 68.3) in the 2018 NCAA tournament, KenPom would predict that there would be 127.5 possessions in total (each of average length 18.8 seconds), or equivalently that each team would have roughtly 64 possessions per game.\n", "\n", "#### Strength of Schedule Ratings\n", "\n", "The strength of schedule (SOS) ratings can be used to get a sense of the average quality of opponents that a team faces during a given season.\n", "\n", "The _SOS AdjEM_ is the _AdjEM_ of a hypothetical team that would be predicted to win half of its games against the team's full-season schedule (excluding post-season play). The non-conference strength of schedule measure (_NCSOS AdjEM_) is similar, except that it only looks at the team's non-conference opponents. \n", "\n", "#### Luck\n", "\n", "The [KenPom ratings glossary](https://kenpom.com/blog/ratings-glossary/) describes the luck rating as:\n", "\n", "> A measure of the deviation between a team’s actual winning percentage and what one would expect from its game-by-game efficiencies. It’s a Dean Oliver invention. Essentially, a team involved in a lot of close games should not win (or lose) all of them. Those that do will be viewed as lucky (or unlucky).\n", "\n", "You can read more about Dean Oliver and his contributions to basketball analytics [here](https://en.wikipedia.org/wiki/Dean_Oliver_(statistician%29).\n", "\n", "Theoretically, once a game is close and goes down to the wire, an average team should win roughly 50% of the time. The luck rating is just measuring historically whether teams have won (or lost) a meaningfully different fraction of these close games.\n", "\n", "You can interpret the luck rating in two completely different ways.\n", "\n", "1. Teams with high luck ratings have an ability (not captured by other KenPom ratings) to \"win in the clutch.\" You should expect those teams to continue to be \"lucky\" relative to their regular KenPom projections.\n", "2. Teams with high luck ratings just got very lucky in the past, and you shouldn't rely on that luck to continue in the future in close games. Maybe those teams aren't as good as their records suggest?\n", "\n", "You can read about what one researcher learned about these questions [here](http://blog.minitab.com/blog/the-statistics-game/analyzing-luck-in-college-basketball-part-1) and [here](http://blog.minitab.com/blog/the-statistics-game/analyzing-luck-in-college-basketball-part-ii). His conclusion was that you should go with the second interpretation, and be wary of teams with high luck ratings.\n", "\n", "#### Ranks\n", "\n", "The KenPom data also has the ranks of each team measured by every statistic. Our scraper pulled in this data, but we won't use it in this analysis.\n", "\n", "#### One Caveat with Historical KenPom Data\n", "\n", "The KenPom data for 2002-2017 that are used in this notebook are as of the end of the respective seasons, and include post-season play. This could be problematic for studyihng historical NCAA tournament results, since the KenPom ratings include the teams' performance in those tournaments. To run a correct analysis, you want to have the KenPom data as it appeared just prior to the tournament start each year. If the tournament results meaningfully changed the final KenPom ratings compared to how they appeared right before tournament, our prediction model would be biased by using the \"incorrect\" KenPom data.\n", "\n", "Furthermore, the methodology behind KenPom data changed in 2016. The \"historical\" data shown on the KenPom website for seasons prior to 2016 were computed using the new methodology. Even if we had the KenPom data as it stood just before the previous tournaments, it wouldn't be consistent with how the KenPom data are computed today. To make sure our analysis is using apples-to-apples data, we would need to be sure to get KenPom data, computed using the current methodology, using historical games up to but not including the NCAA tournaments in each year.\n", "\n", "Unfortunately, we don't currently have access to such data, so we will use the data we have. This is equivalent to assuming that the final post-season KenPom ratings (computed using the current methodology) in prior years wasn't too different from how it would have looked just prior to the tournaments.\n", "\n", "A team plays at most 6 games in the NCAA tournament, and most teams play much fewer than that. Since the KenPom data are based on the entire season, which includes many more games, you might hope that errors introduced by this assumption would be relativley small. We'll proceed with the analysis but we need to keep this potential source of error in mind.\n", "\n", "### A First Look at the KenPom Ratings by Seed \n", "\n", "Let's look at the average values of these ratings by tournament seed." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "kenpom_cols = ['kenpom', 'adj_em', 'adj_o', 'adj_d', 'adj_t', 'luck', 'sos_adj_em', 'ncsos_adj_em']" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>kenpom</th>\n", " <th>adj_em</th>\n", " <th>adj_o</th>\n", " <th>adj_d</th>\n", " <th>adj_t</th>\n", " <th>luck</th>\n", " <th>sos_adj_em</th>\n", " <th>ncsos_adj_em</th>\n", " </tr>\n", " <tr>\n", " <th>seed</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <td>3.703</td>\n", " <td>28.848</td>\n", " <td>119.002</td>\n", " <td>90.155</td>\n", " <td>67.161</td>\n", " <td>0.021</td>\n", " <td>9.023</td>\n", " <td>1.739</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>9.188</td>\n", " <td>24.768</td>\n", " <td>117.092</td>\n", " <td>92.333</td>\n", " <td>66.211</td>\n", " <td>0.026</td>\n", " <td>9.033</td>\n", " <td>1.272</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>15.062</td>\n", " <td>22.185</td>\n", " <td>115.855</td>\n", " <td>93.677</td>\n", " <td>65.972</td>\n", " <td>0.020</td>\n", " <td>8.786</td>\n", " <td>-0.239</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>16.781</td>\n", " <td>21.637</td>\n", " <td>115.041</td>\n", " <td>93.406</td>\n", " <td>66.211</td>\n", " <td>0.002</td>\n", " <td>8.631</td>\n", " <td>-0.182</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>21.906</td>\n", " <td>19.932</td>\n", " <td>113.461</td>\n", " <td>93.525</td>\n", " <td>65.717</td>\n", " <td>0.004</td>\n", " <td>7.921</td>\n", " <td>-0.326</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>27.859</td>\n", " <td>18.406</td>\n", " <td>113.277</td>\n", " <td>94.872</td>\n", " <td>65.666</td>\n", " <td>0.010</td>\n", " <td>7.735</td>\n", " <td>-0.750</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>31.328</td>\n", " <td>17.669</td>\n", " <td>113.355</td>\n", " <td>95.681</td>\n", " <td>65.053</td>\n", " <td>0.009</td>\n", " <td>6.922</td>\n", " <td>0.443</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>33.531</td>\n", " <td>16.792</td>\n", " <td>112.794</td>\n", " <td>95.997</td>\n", " <td>66.384</td>\n", " <td>0.012</td>\n", " <td>7.103</td>\n", " <td>0.159</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>39.531</td>\n", " <td>15.692</td>\n", " <td>111.056</td>\n", " <td>95.372</td>\n", " <td>65.631</td>\n", " <td>0.000</td>\n", " <td>6.647</td>\n", " <td>0.123</td>\n", " </tr>\n", " <tr>\n", " <th>10</th>\n", " <td>40.125</td>\n", " <td>15.778</td>\n", " <td>112.144</td>\n", " <td>96.367</td>\n", " <td>65.625</td>\n", " <td>-0.003</td>\n", " <td>7.147</td>\n", " <td>0.210</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>46.151</td>\n", " <td>14.613</td>\n", " <td>111.632</td>\n", " <td>97.015</td>\n", " <td>65.400</td>\n", " <td>0.003</td>\n", " <td>4.758</td>\n", " <td>0.587</td>\n", " </tr>\n", " <tr>\n", " <th>12</th>\n", " <td>54.985</td>\n", " <td>13.131</td>\n", " <td>110.819</td>\n", " <td>97.685</td>\n", " <td>65.110</td>\n", " <td>0.031</td>\n", " <td>1.999</td>\n", " <td>0.215</td>\n", " </tr>\n", " <tr>\n", " <th>13</th>\n", " <td>78.769</td>\n", " <td>9.617</td>\n", " <td>107.963</td>\n", " <td>98.354</td>\n", " <td>66.398</td>\n", " <td>0.027</td>\n", " <td>-0.534</td>\n", " <td>1.849</td>\n", " </tr>\n", " <tr>\n", " <th>14</th>\n", " <td>95.262</td>\n", " <td>7.396</td>\n", " <td>107.888</td>\n", " <td>100.500</td>\n", " <td>65.960</td>\n", " <td>0.030</td>\n", " <td>-2.054</td>\n", " <td>1.460</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>142.031</td>\n", " <td>2.276</td>\n", " <td>104.758</td>\n", " <td>102.481</td>\n", " <td>66.195</td>\n", " <td>0.047</td>\n", " <td>-4.209</td>\n", " <td>1.166</td>\n", " </tr>\n", " <tr>\n", " <th>16</th>\n", " <td>200.011</td>\n", " <td>-3.355</td>\n", " <td>101.279</td>\n", " <td>104.629</td>\n", " <td>66.360</td>\n", " <td>0.039</td>\n", " <td>-6.672</td>\n", " <td>2.394</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " kenpom adj_em adj_o adj_d adj_t luck sos_adj_em ncsos_adj_em\n", "seed \n", "1 3.703 28.848 119.002 90.155 67.161 0.021 9.023 1.739\n", "2 9.188 24.768 117.092 92.333 66.211 0.026 9.033 1.272\n", "3 15.062 22.185 115.855 93.677 65.972 0.020 8.786 -0.239\n", "4 16.781 21.637 115.041 93.406 66.211 0.002 8.631 -0.182\n", "5 21.906 19.932 113.461 93.525 65.717 0.004 7.921 -0.326\n", "6 27.859 18.406 113.277 94.872 65.666 0.010 7.735 -0.750\n", "7 31.328 17.669 113.355 95.681 65.053 0.009 6.922 0.443\n", "8 33.531 16.792 112.794 95.997 66.384 0.012 7.103 0.159\n", "9 39.531 15.692 111.056 95.372 65.631 0.000 6.647 0.123\n", "10 40.125 15.778 112.144 96.367 65.625 -0.003 7.147 0.210\n", "11 46.151 14.613 111.632 97.015 65.400 0.003 4.758 0.587\n", "12 54.985 13.131 110.819 97.685 65.110 0.031 1.999 0.215\n", "13 78.769 9.617 107.963 98.354 66.398 0.027 -0.534 1.849\n", "14 95.262 7.396 107.888 100.500 65.960 0.030 -2.054 1.460\n", "15 142.031 2.276 104.758 102.481 66.195 0.047 -4.209 1.166\n", "16 200.011 -3.355 101.279 104.629 66.360 0.039 -6.672 2.394" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kenpom.groupby('seed')[kenpom_cols].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe the following patterns in the above table:\n", "\n", "- Higher seeded teams tend to have better (lower) KenPom rankings and better (higher) _AdjEM_ ratings.\n", "- The same pattern applies to the underlying _AdjO_ and _AdjD_ ratings. Higher seeded teams have better (higher) _AdjO_ ratings and better (lower) _AdjD_ ratings.\n", "- There does not appear to be a significant pattern in tempo, as shown by the _AdjT_ ratings.\n", "- There does not appear to be much of a pattern in the luck rating by seed, although the fact that most seeds have positive average values suggest that these teams have slightly higher win percentages than the underlying KenPom ratings would have projected.\n", "- Higher seeded teams appear to have played against stronger opponents overall prior to the tournament, although there does not appear to be a pattern in the non-conference SOS ratings.\n", "\n", "\n", "### Loading the Washington Post Tournament Data\n", "\n", "Now we will move on to read in the historical tournament game data. This data was previously scraped as demonstrated in [this notebook](https://nbviewer.jupyter.org/github/practicallypredictable/posts/blob/master/basketball/ncaa/notebooks/ncaa-scrape-washpost.ipynb)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def read_washpost_csv(filepath):\n", " \"\"\"Read Washington Post NCAA Tournament games history CSV file.\"\"\"\n", " filename = 'game_history-washpost-1985_2017.csv'\n", " csvfile = filepath.joinpath(filename)\n", " df = pd.read_csv(csvfile)\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to look only at first round games since 2002 (the first year for which we have KenPom data). So, we will drop all the other historical game information from our analysis." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def filter_games(df):\n", " \"\"\"Limit games to first round since 2002 (to match KenPom data).\"\"\"\n", " df = df[df['Round'] == 1]\n", " df = df.drop(columns=['Round'])\n", " df = df[df['Year'] >= 2002]\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to correct a few of the school names to match up with our KenPom data set. This function will do that." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def fix_washpost_team_name(s):\n", " \"\"\"Correct Washington Post team names to match KenPom.\"\"\"\n", " s = s.replace('State', 'St.')\n", " s = s.replace('-', ' ')\n", " fix_misc = {\n", " 'Long Island': 'LIU Brooklyn',\n", " 'St. Mary\\'s (Cal.)': 'St. Mary\\'s',\n", " 'Bakersfied': 'Bakersfield',\n", " }\n", " for team in fix_misc:\n", " s = s.replace(team, fix_misc[team])\n", " return s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Washington Post data has game results (team, seed and score) in terms of the game winner and loser. This isn't what we need for this analysis. The problem is that we want the columns to be consistent by seed. Having winner and loser columns means that some times the higher seed will be in the winner column, and sometimes in the loser column. We want to organize the teams by higher seed versus lower seed.\n", "\n", "The next few functions will take care of this for us." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def team_hi(row):\n", " if row['WinnerSeed'] == row['HigherSeed']:\n", " return row['Winner']\n", " else:\n", " return row['Loser']" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def team_lo(row):\n", " if row['WinnerSeed'] == row['HigherSeed']:\n", " return row['Loser']\n", " else:\n", " return row['Winner']" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "def score_hi(row):\n", " if row['WinnerSeed'] == row['HigherSeed']:\n", " return row['WinnerScore']\n", " else:\n", " return row['LoserScore'] " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "def score_lo(row):\n", " if row['WinnerSeed'] == row['HigherSeed']:\n", " return row['LoserScore']\n", " else:\n", " return row['WinnerScore']" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "def fix_washpost_cols(df):\n", " \"\"\"Adjust columns to reflect high/low seed, not winner/loser.\"\"\"\n", " df['team_hi'] = df.apply(team_hi, axis=1)\n", " df['team_lo'] = df.apply(team_lo, axis=1)\n", " df['seed_hi'] = df['HigherSeed']\n", " df['seed_lo'] = 17 - df['seed_hi']\n", " df['score_hi'] = df.apply(score_hi, axis=1)\n", " df['score_lo'] = df.apply(score_lo, axis=1)\n", " df = df.drop(columns=['HigherSeed', 'Winner', 'WinnerSeed', 'WinnerScore', 'Loser', 'LoserSeed', 'LoserScore'])\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can put everything together to read in the data, filter out the games we don't want, fix the team names and then reorganize the columns." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "def rename_washpost_cols(df):\n", " \"\"\"Rename Washington Post game history columns.\"\"\"\n", " df = df.rename(columns={\n", " 'Year': 'year',\n", " })\n", " cols = ['year'] + [col for col in df.columns if '_hi' in col] + [col for col in df.columns if '_lo' in col]\n", " return df[cols]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def load_first_round_games(filepath):\n", " \"\"\"Load NCAA first round tournmanet games since 2002.\"\"\"\n", " df = read_washpost_csv(filepath)\n", " df = filter_games(df)\n", " df['Winner'] = df['Winner'].apply(fix_washpost_team_name)\n", " df['Loser'] = df['Loser'].apply(fix_washpost_team_name)\n", " df = fix_washpost_cols(df)\n", " df = rename_washpost_cols(df)\n", " return df.reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(512, 7)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "games = load_first_round_games(PREPARED_DIR)\n", "games.shape" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>year</th>\n", " <th>team_hi</th>\n", " <th>seed_hi</th>\n", " <th>score_hi</th>\n", " <th>team_lo</th>\n", " <th>seed_lo</th>\n", " <th>score_lo</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2002</td>\n", " <td>North Carolina St.</td>\n", " <td>7</td>\n", " <td>69</td>\n", " <td>Michigan St.</td>\n", " <td>10</td>\n", " <td>58</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2002</td>\n", " <td>Illinois</td>\n", " <td>4</td>\n", " <td>93</td>\n", " <td>San Diego St.</td>\n", " <td>13</td>\n", " <td>64</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2002</td>\n", " <td>Texas</td>\n", " <td>6</td>\n", " <td>70</td>\n", " <td>Boston College</td>\n", " <td>11</td>\n", " <td>57</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2002</td>\n", " <td>Cincinnati</td>\n", " <td>1</td>\n", " <td>90</td>\n", " <td>Boston University</td>\n", " <td>16</td>\n", " <td>52</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2002</td>\n", " <td>Georgia</td>\n", " <td>3</td>\n", " <td>85</td>\n", " <td>Murray St.</td>\n", " <td>14</td>\n", " <td>68</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>2002</td>\n", " <td>Texas Tech</td>\n", " <td>6</td>\n", " <td>68</td>\n", " <td>Southern Illinois</td>\n", " <td>11</td>\n", " <td>76</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>2002</td>\n", " <td>Maryland</td>\n", " <td>1</td>\n", " <td>85</td>\n", " <td>Siena</td>\n", " <td>16</td>\n", " <td>70</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>2002</td>\n", " <td>Oklahoma</td>\n", " <td>2</td>\n", " <td>71</td>\n", " <td>Illinois Chicago</td>\n", " <td>15</td>\n", " <td>63</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>2002</td>\n", " <td>California</td>\n", " <td>6</td>\n", " <td>82</td>\n", " <td>Penn</td>\n", " <td>11</td>\n", " <td>75</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>2002</td>\n", " <td>Xavier</td>\n", " <td>7</td>\n", " <td>70</td>\n", " <td>Hawaii</td>\n", " <td>10</td>\n", " <td>58</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " year team_hi seed_hi score_hi team_lo seed_lo \\\n", "0 2002 North Carolina St. 7 69 Michigan St. 10 \n", "1 2002 Illinois 4 93 San Diego St. 13 \n", "2 2002 Texas 6 70 Boston College 11 \n", "3 2002 Cincinnati 1 90 Boston University 16 \n", "4 2002 Georgia 3 85 Murray St. 14 \n", "5 2002 Texas Tech 6 68 Southern Illinois 11 \n", "6 2002 Maryland 1 85 Siena 16 \n", "7 2002 Oklahoma 2 71 Illinois Chicago 15 \n", "8 2002 California 6 82 Penn 11 \n", "9 2002 Xavier 7 70 Hawaii 10 \n", "\n", " score_lo \n", "0 58 \n", "1 64 \n", "2 57 \n", "3 52 \n", "4 68 \n", "5 76 \n", "6 70 \n", "7 63 \n", "8 75 \n", "9 58 " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "games.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perfect, we have 16 years of data, with 32 first round games per year, for 512 total first round NCAA games.\n", "\n", "### Checking that the Team Names Match\n", "\n", "The code below looks at team names that are in the Washington Post data and don't match a team in the KenPom data, and vice versa. This is the code I used to identify team name mismatches, which were corrected in the code already shown above. I left this code in this notebook to demonstrate the importance of validating data, and how to do it. It's also a useful check when re-running this code in the future." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "def first_round_teams(df):\n", " \"\"\"Get unique team names from Washington Post data set.\"\"\"\n", " return set(df['team_hi'].unique()) | set(df['team_lo'].unique())" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "237" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "washpost_teams = first_round_teams(games)\n", "len(washpost_teams)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "243" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kenpom_teams = set(kenpom['team'].unique())\n", "len(kenpom_teams)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "set()" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "washpost_teams - kenpom_teams" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Alabama A&M',\n", " 'Alcorn St.',\n", " 'Coppin St.',\n", " 'Lamar',\n", " 'New Orleans',\n", " 'North Florida'}" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kenpom_teams - washpost_teams" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are the only team names that still don't match. They are in the KenPom data, but not the Washington Post historical NCAA tournament data.\n", "\n", "Let's look at the specific KenPom rows that have these teams." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>year</th>\n", " <th>team</th>\n", " <th>conf</th>\n", " <th>seed</th>\n", " <th>wins</th>\n", " <th>losses</th>\n", " <th>kenpom</th>\n", " <th>adj_em</th>\n", " <th>adj_o</th>\n", " <th>adj_d</th>\n", " <th>...</th>\n", " <th>opp_d</th>\n", " <th>ncsos_adj_em</th>\n", " <th>adj_o_rank</th>\n", " <th>adj_d_rank</th>\n", " <th>adj_t_rank</th>\n", " <th>luck_rank</th>\n", " <th>sos_adj_em_rank</th>\n", " <th>opp_o_rank</th>\n", " <th>opp_d_rank</th>\n", " <th>ncsos_adj_em_rank</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>251</th>\n", " <td>2002</td>\n", " <td>Alcorn St.</td>\n", " <td>SWAC</td>\n", " <td>16</td>\n", " <td>20</td>\n", " <td>10</td>\n", " <td>252</td>\n", " <td>-8.710</td>\n", " <td>100.100</td>\n", " <td>108.800</td>\n", " <td>...</td>\n", " <td>109.800</td>\n", " <td>2.600</td>\n", " <td>213</td>\n", " <td>268</td>\n", " <td>23</td>\n", " <td>8</td>\n", " <td>325</td>\n", " <td>325</td>\n", " <td>325</td>\n", " <td>101</td>\n", " </tr>\n", " <tr>\n", " <th>1257</th>\n", " <td>2005</td>\n", " <td>Alabama A&M</td>\n", " <td>SWAC</td>\n", " <td>16</td>\n", " <td>15</td>\n", " <td>14</td>\n", " <td>278</td>\n", " <td>-12.490</td>\n", " <td>90.500</td>\n", " <td>102.900</td>\n", " <td>...</td>\n", " <td>106.900</td>\n", " <td>1.140</td>\n", " <td>316</td>\n", " <td>169</td>\n", " <td>10</td>\n", " <td>224</td>\n", " <td>328</td>\n", " <td>327</td>\n", " <td>308</td>\n", " <td>128</td>\n", " </tr>\n", " <tr>\n", " <th>2277</th>\n", " <td>2008</td>\n", " <td>Coppin St.</td>\n", " <td>MEAC</td>\n", " <td>16</td>\n", " <td>16</td>\n", " <td>21</td>\n", " <td>298</td>\n", " <td>-14.410</td>\n", " <td>92.400</td>\n", " <td>106.800</td>\n", " <td>...</td>\n", " <td>105.500</td>\n", " <td>7.870</td>\n", " <td>321</td>\n", " <td>230</td>\n", " <td>261</td>\n", " <td>36</td>\n", " <td>306</td>\n", " <td>321</td>\n", " <td>229</td>\n", " <td>18</td>\n", " </tr>\n", " <tr>\n", " <th>3464</th>\n", " <td>2012</td>\n", " <td>Lamar</td>\n", " <td>Slnd</td>\n", " <td>16</td>\n", " <td>23</td>\n", " <td>12</td>\n", " <td>108</td>\n", " <td>6.330</td>\n", " <td>107.600</td>\n", " <td>101.300</td>\n", " <td>...</td>\n", " <td>103.400</td>\n", " <td>5.210</td>\n", " <td>88</td>\n", " <td>138</td>\n", " <td>123</td>\n", " <td>246</td>\n", " <td>252</td>\n", " <td>310</td>\n", " <td>173</td>\n", " <td>37</td>\n", " </tr>\n", " <tr>\n", " <th>4542</th>\n", " <td>2015</td>\n", " <td>North Florida</td>\n", " <td>ASun</td>\n", " <td>16</td>\n", " <td>23</td>\n", " <td>12</td>\n", " <td>143</td>\n", " <td>2.260</td>\n", " <td>108.400</td>\n", " <td>106.200</td>\n", " <td>...</td>\n", " <td>108.200</td>\n", " <td>2.230</td>\n", " <td>88</td>\n", " <td>204</td>\n", " <td>51</td>\n", " <td>230</td>\n", " <td>331</td>\n", " <td>312</td>\n", " <td>321</td>\n", " <td>91</td>\n", " </tr>\n", " <tr>\n", " <th>5279</th>\n", " <td>2017</td>\n", " <td>New Orleans</td>\n", " <td>Slnd</td>\n", " <td>16</td>\n", " <td>20</td>\n", " <td>12</td>\n", " <td>178</td>\n", " <td>-0.520</td>\n", " <td>101.300</td>\n", " <td>101.800</td>\n", " <td>...</td>\n", " <td>107.600</td>\n", " <td>8.320</td>\n", " <td>235</td>\n", " <td>107</td>\n", " <td>272</td>\n", " <td>67</td>\n", " <td>287</td>\n", " <td>267</td>\n", " <td>291</td>\n", " <td>13</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>6 rows × 24 columns</p>\n", "</div>" ], "text/plain": [ " year team conf seed wins losses kenpom adj_em adj_o \\\n", "251 2002 Alcorn St. SWAC 16 20 10 252 -8.710 100.100 \n", "1257 2005 Alabama A&M SWAC 16 15 14 278 -12.490 90.500 \n", "2277 2008 Coppin St. MEAC 16 16 21 298 -14.410 92.400 \n", "3464 2012 Lamar Slnd 16 23 12 108 6.330 107.600 \n", "4542 2015 North Florida ASun 16 23 12 143 2.260 108.400 \n", "5279 2017 New Orleans Slnd 16 20 12 178 -0.520 101.300 \n", "\n", " adj_d ... opp_d ncsos_adj_em adj_o_rank adj_d_rank \\\n", "251 108.800 ... 109.800 2.600 213 268 \n", "1257 102.900 ... 106.900 1.140 316 169 \n", "2277 106.800 ... 105.500 7.870 321 230 \n", "3464 101.300 ... 103.400 5.210 88 138 \n", "4542 106.200 ... 108.200 2.230 88 204 \n", "5279 101.800 ... 107.600 8.320 235 107 \n", "\n", " adj_t_rank luck_rank sos_adj_em_rank opp_o_rank opp_d_rank \\\n", "251 23 8 325 325 325 \n", "1257 10 224 328 327 308 \n", "2277 261 36 306 321 229 \n", "3464 123 246 252 310 173 \n", "4542 51 230 331 312 321 \n", "5279 272 67 287 267 291 \n", "\n", " ncsos_adj_em_rank \n", "251 101 \n", "1257 128 \n", "2277 18 \n", "3464 37 \n", "4542 91 \n", "5279 13 \n", "\n", "[6 rows x 24 columns]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kenpom[kenpom['team'].isin(kenpom_teams-washpost_teams)]" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>year</th>\n", " <th>team</th>\n", " <th>conf</th>\n", " <th>seed</th>\n", " <th>wins</th>\n", " <th>losses</th>\n", " <th>kenpom</th>\n", " <th>adj_em</th>\n", " <th>adj_o</th>\n", " <th>adj_d</th>\n", " <th>...</th>\n", " <th>opp_d</th>\n", " <th>ncsos_adj_em</th>\n", " <th>adj_o_rank</th>\n", " <th>adj_d_rank</th>\n", " <th>adj_t_rank</th>\n", " <th>luck_rank</th>\n", " <th>sos_adj_em_rank</th>\n", " <th>opp_o_rank</th>\n", " <th>opp_d_rank</th>\n", " <th>ncsos_adj_em_rank</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>161</th>\n", " <td>2002</td>\n", " <td>Holy Cross</td>\n", " <td>Pat</td>\n", " <td>16</td>\n", " <td>18</td>\n", " <td>15</td>\n", " <td>162</td>\n", " <td>-0.240</td>\n", " <td>98.100</td>\n", " <td>98.400</td>\n", " <td>...</td>\n", " <td>106.100</td>\n", " <td>-3.380</td>\n", " <td>245</td>\n", " <td>75</td>\n", " <td>313</td>\n", " <td>268</td>\n", " <td>264</td>\n", " <td>261</td>\n", " <td>262</td>\n", " <td>246</td>\n", " </tr>\n", " <tr>\n", " <th>173</th>\n", " <td>2002</td>\n", " <td>Siena</td>\n", " <td>MAAC</td>\n", " <td>16</td>\n", " <td>17</td>\n", " <td>19</td>\n", " <td>174</td>\n", " <td>-1.210</td>\n", " <td>100.800</td>\n", " <td>102.000</td>\n", " <td>...</td>\n", " <td>104.400</td>\n", " <td>-0.420</td>\n", " <td>201</td>\n", " <td>141</td>\n", " <td>223</td>\n", " <td>296</td>\n", " <td>237</td>\n", " <td>245</td>\n", " <td>200</td>\n", " <td>167</td>\n", " </tr>\n", " <tr>\n", " <th>193</th>\n", " <td>2002</td>\n", " <td>Boston University</td>\n", " <td>AE</td>\n", " <td>16</td>\n", " <td>22</td>\n", " <td>10</td>\n", " <td>194</td>\n", " <td>-2.740</td>\n", " <td>101.300</td>\n", " <td>104.000</td>\n", " <td>...</td>\n", " <td>106.900</td>\n", " <td>0.250</td>\n", " <td>192</td>\n", " <td>187</td>\n", " <td>282</td>\n", " <td>7</td>\n", " <td>299</td>\n", " <td>305</td>\n", " <td>284</td>\n", " <td>154</td>\n", " </tr>\n", " <tr>\n", " <th>218</th>\n", " <td>2002</td>\n", " <td>Winthrop</td>\n", " <td>BSth</td>\n", " <td>16</td>\n", " <td>17</td>\n", " <td>12</td>\n", " <td>219</td>\n", " <td>-5.310</td>\n", " <td>97.300</td>\n", " <td>102.600</td>\n", " <td>...</td>\n", " <td>107.900</td>\n", " <td>-0.980</td>\n", " <td>258</td>\n", " <td>155</td>\n", " <td>243</td>\n", " <td>79</td>\n", " <td>311</td>\n", " <td>313</td>\n", " <td>304</td>\n", " <td>188</td>\n", " </tr>\n", " <tr>\n", " <th>251</th>\n", " <td>2002</td>\n", " <td>Alcorn St.</td>\n", " <td>SWAC</td>\n", " <td>16</td>\n", " <td>20</td>\n", " <td>10</td>\n", " <td>252</td>\n", " <td>-8.710</td>\n", " <td>100.100</td>\n", " <td>108.800</td>\n", " <td>...</td>\n", " <td>109.800</td>\n", " <td>2.600</td>\n", " <td>213</td>\n", " <td>268</td>\n", " <td>23</td>\n", " <td>8</td>\n", " <td>325</td>\n", " <td>325</td>\n", " <td>325</td>\n", " <td>101</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>5 rows × 24 columns</p>\n", "</div>" ], "text/plain": [ " year team conf seed wins losses kenpom adj_em \\\n", "161 2002 Holy Cross Pat 16 18 15 162 -0.240 \n", "173 2002 Siena MAAC 16 17 19 174 -1.210 \n", "193 2002 Boston University AE 16 22 10 194 -2.740 \n", "218 2002 Winthrop BSth 16 17 12 219 -5.310 \n", "251 2002 Alcorn St. SWAC 16 20 10 252 -8.710 \n", "\n", " adj_o adj_d ... opp_d ncsos_adj_em adj_o_rank \\\n", "161 98.100 98.400 ... 106.100 -3.380 245 \n", "173 100.800 102.000 ... 104.400 -0.420 201 \n", "193 101.300 104.000 ... 106.900 0.250 192 \n", "218 97.300 102.600 ... 107.900 -0.980 258 \n", "251 100.100 108.800 ... 109.800 2.600 213 \n", "\n", " adj_d_rank adj_t_rank luck_rank sos_adj_em_rank opp_o_rank \\\n", "161 75 313 268 264 261 \n", "173 141 223 296 237 245 \n", "193 187 282 7 299 305 \n", "218 155 243 79 311 313 \n", "251 268 23 8 325 325 \n", "\n", " opp_d_rank ncsos_adj_em_rank \n", "161 262 246 \n", "173 200 167 \n", "193 284 154 \n", "218 304 188 \n", "251 325 101 \n", "\n", "[5 rows x 24 columns]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kenpom[(kenpom['year'] == 2002) & (kenpom['seed'] == 16)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, KenPom has 5 16 seeds for the 2002 tournament. Alcorn St. lost in the play-in game. The other 4 16 seeds played in the first round." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>year</th>\n", " <th>team_hi</th>\n", " <th>seed_hi</th>\n", " <th>score_hi</th>\n", " <th>team_lo</th>\n", " <th>seed_lo</th>\n", " <th>score_lo</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>3</th>\n", " <td>2002</td>\n", " <td>Cincinnati</td>\n", " <td>1</td>\n", " <td>90</td>\n", " <td>Boston University</td>\n", " <td>16</td>\n", " <td>52</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>2002</td>\n", " <td>Maryland</td>\n", " <td>1</td>\n", " <td>85</td>\n", " <td>Siena</td>\n", " <td>16</td>\n", " <td>70</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>2002</td>\n", " <td>Duke</td>\n", " <td>1</td>\n", " <td>84</td>\n", " <td>Winthrop</td>\n", " <td>16</td>\n", " <td>37</td>\n", " </tr>\n", " <tr>\n", " <th>19</th>\n", " <td>2002</td>\n", " <td>Kansas</td>\n", " <td>1</td>\n", " <td>70</td>\n", " <td>Holy Cross</td>\n", " <td>16</td>\n", " <td>59</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " year team_hi seed_hi score_hi team_lo seed_lo score_lo\n", "3 2002 Cincinnati 1 90 Boston University 16 52\n", "6 2002 Maryland 1 85 Siena 16 70\n", "15 2002 Duke 1 84 Winthrop 16 37\n", "19 2002 Kansas 1 70 Holy Cross 16 59" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "games[(games['year'] == 2002) & (games['seed_hi'] == 1)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can find a list of historical NCAA play-in games [here](https://en.wikipedia.org/wiki/NCAA_Men%27s_Division_I_Basketball_Opening_Round#Single_game_results_(2001%E2%80%932010)). If you look at the list, you will see that the extra KenPom names all correspond to teams that lost in play-in games, and didn't otherwise play in the NCAA tournament. So, it's not a problem that they are missing from the Washington Post data.\n", "\n", "### Merging the KenPom and Washington Post Data\n", "\n", "The next step is to build a table combining the historical NCAA tournament data with the KenPom data. Each row will correspond to a first round game, with the KenPom data for both teams playing in that game.\n", "\n", "Using `pandas`, the way to achieve this is with the `DataFrame` [`merge()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) method. We want to merge the KenPom data as new columns in the NCAA game `DataFrame`.\n", "\n", "Since we are going to be merging KenPom data for each team in a game, let's first write a function to merge the data for either the higher seeded teams or the lower seeded teams. We also need to make sure to clean up any unnecessary columns that get added by the `merge()` method, as well as make sure the new columns have the correct names. Each merged KenPom column will have the suffix `'_hi'` or `'_lo'` depending upon which team is being merged." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "def merge_kenpom(games, kenpom, HL):\n", " \"\"\"Merge KenPom information for one team (high or low seed) with NCAA Tournament games.\"\"\"\n", " if HL == 'H':\n", " suffix = '_hi'\n", " elif HL == 'L':\n", " suffix = '_lo'\n", " else:\n", " raise ValueError('invalid HL', HL)\n", " df = games.merge(kenpom, left_on=['year', 'team'+suffix], right_on=['year', 'team'], validate='one_to_one')\n", " df = df.drop(columns=['team', 'seed'])\n", " cols = [col for col in kenpom.columns if col != 'year']\n", " new_cols = {col: col+suffix for col in cols}\n", " df = df.rename(columns=new_cols)\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can just call this function twice: once for the higher seeded teams, and once for the lower seeded teams." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "def merge_ncaa_kenpom(games, kenpom):\n", " \"\"\"Merge KenPom information with NCAA Tournament games.\"\"\"\n", " df = merge_kenpom(games, kenpom, 'H')\n", " df = merge_kenpom(df, kenpom, 'L')\n", " return df" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(512, 49)" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = merge_ncaa_kenpom(games, kenpom)\n", "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we stil have 512 rows, we know we didn't lose any game information by merging. Let's see what columns we have now." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['year', 'team_hi', 'seed_hi', 'score_hi', 'team_lo', 'seed_lo',\n", " 'score_lo', 'conf_hi', 'wins_hi', 'losses_hi', 'kenpom_hi', 'adj_em_hi',\n", " 'adj_o_hi', 'adj_d_hi', 'adj_t_hi', 'luck_hi', 'sos_adj_em_hi',\n", " 'opp_o_hi', 'opp_d_hi', 'ncsos_adj_em_hi', 'adj_o_rank_hi',\n", " 'adj_d_rank_hi', 'adj_t_rank_hi', 'luck_rank_hi', 'sos_adj_em_rank_hi',\n", " 'opp_o_rank_hi', 'opp_d_rank_hi', 'ncsos_adj_em_rank_hi', 'conf_lo',\n", " 'wins_lo', 'losses_lo', 'kenpom_lo', 'adj_em_lo', 'adj_o_lo',\n", " 'adj_d_lo', 'adj_t_lo', 'luck_lo', 'sos_adj_em_lo', 'opp_o_lo',\n", " 'opp_d_lo', 'ncsos_adj_em_lo', 'adj_o_rank_lo', 'adj_d_rank_lo',\n", " 'adj_t_rank_lo', 'luck_rank_lo', 'sos_adj_em_rank_lo', 'opp_o_rank_lo',\n", " 'opp_d_rank_lo', 'ncsos_adj_em_rank_lo'],\n", " dtype='object')" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, we have all the original game information, plus KenPom information for both the higher seeded and lower seeded teams.\n", "\n", "### Preparing the Data for Analysis\n", "\n", "We need to do a few more things to the data before we begin our upset analysis.\n", "\n", "#### Dummy Variables\n", "\n", "First, we will create two new important columns. The first will indicate that the game was an \"upset\", defined as the lower seed winning. The second will flag games where the team with the worse KenPom ranking was actually seeded higher. We will refer to this situation as a \"KenPom inversion\".\n", "\n", "These columns will each be \"dummy\" variables, with the value 0 denoting False/No, and 1 denoting True/Yes." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "def is_upset(row):\n", " if row['score_lo'] > row['score_hi']:\n", " return 1\n", " else:\n", " return 0" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "def kenpom_inverted(row):\n", " if row['kenpom_lo'] < row['kenpom_hi']:\n", " return 1\n", " else:\n", " return 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An upset is simple to determine: the lower seeded team outscores the higher seed team.\n", "\n", "On the other hand, you may have to stare at the \"KenPom inversion\" function a bit to convince yourself that it's correct. The best KenPom ranking is 1, so the function is looking for games where the lower seeded team in the tournament has a lower number (higher ranking) from KenPom.\n", "\n", "Now we can put these functions together to merge the data and create the dummy variables. Also, since we already know that the first seed has never been upset, we will drop the 64 1-16 games from our data set going forward." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "def merge_data(games, kenpom):\n", " \"\"\"Merge data and create dummy variables.\"\"\"\n", " df = merge_ncaa_kenpom(games, kenpom)\n", " df = df[df['seed_hi'] != 1]\n", " df['upset'] = df.apply(is_upset, axis=1)\n", " df['kenpom_inverted'] = df.apply(kenpom_inverted, axis=1)\n", " start_cols = [\n", " 'year',\n", " 'upset',\n", " 'kenpom_inverted',\n", " 'team_hi', 'seed_hi', 'kenpom_hi', 'team_lo', 'seed_lo', 'kenpom_lo',\n", " ]\n", " cols = (\n", " start_cols +\n", " [col for col in df.columns if '_hi' in col and 'rank' not in col and col not in start_cols] +\n", " [col for col in df.columns if '_lo' in col and 'rank' not in col and col not in start_cols]\n", " )\n", " return df[cols].reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(448, 35)" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = merge_data(games, kenpom)\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['year', 'upset', 'kenpom_inverted', 'team_hi', 'seed_hi', 'kenpom_hi',\n", " 'team_lo', 'seed_lo', 'kenpom_lo', 'score_hi', 'conf_hi', 'wins_hi',\n", " 'losses_hi', 'adj_em_hi', 'adj_o_hi', 'adj_d_hi', 'adj_t_hi', 'luck_hi',\n", " 'sos_adj_em_hi', 'opp_o_hi', 'opp_d_hi', 'ncsos_adj_em_hi', 'score_lo',\n", " 'conf_lo', 'wins_lo', 'losses_lo', 'adj_em_lo', 'adj_o_lo', 'adj_d_lo',\n", " 'adj_t_lo', 'luck_lo', 'sos_adj_em_lo', 'opp_o_lo', 'opp_d_lo',\n", " 'ncsos_adj_em_lo'],\n", " dtype='object')" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Upsets and KenPom Ranks\n", "\n", "Let's take a quick look at how many upsets occurred during the 2002-2017 first round games. In our [prior analysis on first round upsets](http://practicallypredictable.com/2018/03/07/march-madness-first-round-upsets/), we used NCAA tournament data going back to 1985, when the modern 64-team format started. We should make sure that the 2002-2017 sub-period isn't very different from the full data set before we delve too deeply in our analysis.\n", "\n", "Since the upset dummy variable is either 0 or 1, we can sum it to get the count of upsets that occurred. If we group by the higher seed, we can compare the upsets in the 2002-2017 period with the full 1985-2017 history." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "seed_hi\n", "2 0.062\n", "3 0.125\n", "4 0.188\n", "5 0.422\n", "6 0.438\n", "7 0.359\n", "8 0.406\n", "Name: upset, dtype: float64" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['seed_hi'])['upset'].sum() / (4*16)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the most part, these upset frequencies aren't significantly different from the full 33-year period. The eighth seed upset frequency above is 40.6%, whereas in the full 33-year period it's 49.2%. However, as we discussed in the earlier post, the 8-9 game is never really an upset in the correct meaning of the word, so this shouldn't concern us too much. It's unlikely that there was a meaningful change in the probability of eighth seeds winning these games after 2002 compared to previously.\n", "\n", "Overall, there are 128 \"upsets\" (again, including the 8-9 games) in our data set of first round games since 2002." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 320\n", "1 128\n", "Name: upset, dtype: int64" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['upset'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the other hand, there are 62 cases where the better-ranked team (by KenPom) was not the higher seeded team in the NCAA tournament first round game." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 386\n", "1 62\n", "Name: kenpom_inverted, dtype: int64" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['kenpom_inverted'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at how these variables interact, using `pandas` [`crosstab()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html) function." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th>kenpom_inverted</th>\n", " <th>0</th>\n", " <th>1</th>\n", " </tr>\n", " <tr>\n", " <th>upset</th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>300</td>\n", " <td>20</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>86</td>\n", " <td>42</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ "kenpom_inverted 0 1\n", "upset \n", "0 300 20\n", "1 86 42" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(index=df['upset'], columns=df['kenpom_inverted'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look first at the \"1\" column for the KenPom inversion dummy variable. Of the 62 times a team with higher KenPom rank was seeded worse in the tournament, 42 (or more than $\\frac 2 3$ of the time) resulted in an \"upset\" (again, including the 8-9 games).\n", "\n", "That seems promising. If you see a matchup between teams where the tournament seed and KenPom rankings are inverted, it would suggest you have much better than even odds going with the KenPom ranking in picking the winner.\n", "\n", "On the flip side, however, look at the \"1\" row for the upset dummy variable. Of the 128 upsets, only 42 (or roughly $\\frac 1 3$) had the KenPom rankings inverted.\n", "\n", "In other words, conditional on seeing a KenPom inversion, an upset is more likely than not. However, if you only focus on KenPom inversions, you'll miss a lot of potential upsets.\n", "\n", "We can shed more light on this by looking at the distribution of upsets and KenPom inversions by the seed of the higher seeded team." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>kenpom_inverted</th>\n", " <th>upset</th>\n", " </tr>\n", " <tr>\n", " <th>seed_hi</th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2</th>\n", " <td>0</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0</td>\n", " <td>8</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0</td>\n", " <td>12</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>5</td>\n", " <td>27</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>11</td>\n", " <td>28</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>23</td>\n", " <td>23</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>23</td>\n", " <td>26</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " kenpom_inverted upset\n", "seed_hi \n", "2 0 4\n", "3 0 8\n", "4 0 12\n", "5 5 27\n", "6 11 28\n", "7 23 23\n", "8 23 26" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.groupby(['seed_hi'])['kenpom_inverted', 'upset'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the KenPom inversion is only relevant for the fifth seed or below, and is much more common in the 7-10 and 8-9 matchups. This makes sense, as it's clearly more likely that NCAA seeding would diverge from KenPom rankings for the middle-quality teams.\n", "\n", "We can slice the data more finely to better see where the KenPom inversion might be useful in predicting upsets." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead tr th {\n", " text-align: left;\n", " }\n", "\n", " .dataframe thead tr:last-of-type th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr>\n", " <th>kenpom_inverted</th>\n", " <th colspan=\"2\" halign=\"left\">0</th>\n", " <th colspan=\"2\" halign=\"left\">1</th>\n", " </tr>\n", " <tr>\n", " <th>upset</th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>0</th>\n", " <th>1</th>\n", " </tr>\n", " <tr>\n", " <th>seed_hi</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2</th>\n", " <td>60</td>\n", " <td>4</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>56</td>\n", " <td>8</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>52</td>\n", " <td>12</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>35</td>\n", " <td>24</td>\n", " <td>2</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>35</td>\n", " <td>18</td>\n", " <td>1</td>\n", " <td>10</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>34</td>\n", " <td>7</td>\n", " <td>7</td>\n", " <td>16</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>28</td>\n", " <td>13</td>\n", " <td>10</td>\n", " <td>13</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ "kenpom_inverted 0 1 \n", "upset 0 1 0 1\n", "seed_hi \n", "2 60 4 0 0\n", "3 56 8 0 0\n", "4 52 12 0 0\n", "5 35 24 2 3\n", "6 35 18 1 10\n", "7 34 7 7 16\n", "8 28 13 10 13" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "upset_counts = pd.DataFrame(\n", " df.groupby(['seed_hi', 'upset', 'kenpom_inverted']\n", ").size()).unstack().unstack().fillna(0).astype(int)\n", "upset_counts.columns = upset_counts.columns.droplevel()\n", "upset_counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The chance of picking an upset simply using inverted KenPom rankings appears to be most favorable for sixth and seventh seed games. For fifth seed and eighth seed games, the impact doesn't seem as powerful.\n", "\n", "#### An Important Note on NCAA Tournament Selection, Seeding and Picking Upsets\n", "\n", "There has been a long and fierce debate about the methods and inputs the NCAA tournament committee has used to select teams for March Madness over the years. You can read about the history and some of the issues [here](https://fivethirtyeight.com/features/the-ncaa-is-modernizing-the-way-it-picks-march-madness-teams/), [here](http://www.espn.com/mens-college-basketball/story/_/id/22144913/ncaa-says-year-test-run-new-evaluation-system) and in the underlying linked articles. This year, [the committee has modified its process](https://www.ncaa.com/news/basketball-men/bracketiq/2018-02-15/ncaa-selection-committee-adjusts-team-sheets-emphasizing) to include additional data, including KenPom, into the selection decisions.\n", "\n", "I raise this issue here because it's important to note that one reason why KenPom rankings and tournament seeding divered in prior years may have been due (at least in part) to the fact that the tournament committee didn't look at KenPom data. This year, the committee will be looking at many new pieces of information, of which KenPom will be a part.\n", "\n", "The point is, the selection process itself is an input into the likelihood of an upset. If the selection and seeding process changes dramatically, historical data may be less useful in picking upsets going forward. This is something to keep an eye on as the committee continues to modify its process going forward.\n", "\n", "#### Repeating the Caveat about the Historical KenPom Data\n", "\n", "Also, keep in mind that the historical KenPom data include the impact of the NCAA tournament. Suffering an upset in the first round could negatively affect its KenPom ranking, perhaps lowering it enough to actualy cause the inversion in KenPom rankings. If this occurred, rather than _predicting_ the upset, the KenPom inversion would be _because_ of the upset. Until we obtain pre-tournament historical KenPom data, we can't verify whether this situation actually occurred often enough to matter.\n", "\n", "### Making Predictions with KenPom Data\n", "\n", "Let's see how to use the KenPom data to estimate win probability.\n", "\n", "The procedure is relatively simple:\n", "\n", "- Estimate the number of possessions in the game for each team.\n", "- Estimate the points per possession differential between the two teams.\n", "- Estimate the point spread as the product of the number of possessions per team and the points per possession differential.\n", "- Convert the point spread into a win probability.\n", "\n", "Let's go through this one step at a time.\n", "\n", "#### Estimating the Number of Possessions Per Team\n", "\n", "Each team has its own Adjusted Tempo measure, $AdjT_H$ and $AdjT_L$, for the higher and lower seed teams, respectively.\n", "\n", "Simply average these two tempo measures to estimate the number of possesions each team has (which are assumed to be equal). Call the number of possessions for each team $POSS$:\n", "\n", "$$ POSS = \\frac 1 2 (AdjT_H + AdjT_L) $$\n", "\n", "#### Estimating the Points Per Possession Differential\n", "\n", "We'll need to do a little algebra now.\n", "\n", "If we want to predict the offensive output of a team, we need to consider both the team's offensive efficiency and its opponent's defensive efficiency. If the opponent has a good defense (i.e., a low _AdjD_), that should drag down the team's offensive production compared to an average opponent. On the other hand, versus a bad defense (high _AdjD_), the team might put up even better offensive numbers.\n", "\n", "Let's focus first on the higher seeded team, with offensive efficiency $AdjO_H$, playing against a lower seeded opponent with defensive efficiency $AdjD_L$.\n", "\n", "The predicted KenPom points per possession for the higher seeded team is:\n", "\n", "$$ PPP_H = \\frac 1 2 (AdjO_H + AdjD_L) $$\n", "\n", "Similarly, the predicted KenPom points per possession for the lower seeded team is:\n", "\n", "$$ PPP_L = \\frac 1 2 (AdjO_L + AdjD_H) $$\n", "\n", "If you subtract these two, you'll get the points per possession differential. The team with the better (positive) differential is expected to win the game in the KenPom framework.\n", "\n", "$$\n", "\\begin{aligned}\n", "PPP_H - PPP_L &= \\frac 1 2 (AdjO_H + AdjD_L) - \\frac 1 2 (AdjO_L + AdjD_H) \\\\\n", "&= \\frac 1 2 (AdjO_H + AdjD_L - AdjO_L - AdjD_H) \\\\\n", "&= \\frac 1 2 (AdjO_H - AdjD_H - AdjO_L + AdjD_L) \\\\\n", "&= \\frac 1 2 ((AdjO_H - AdjD_H) - (AdjO_L - AdjD_L)) \\\\\n", "&= \\frac 1 2 (AdjEM_H - AdjEM_L)\n", "\\end{aligned}\n", "$$\n", "\n", "As you can see, the estimated points per possession differential is just the difference of the efficiency margins for each team.\n", "\n", "#### Estimate the Point Spread\n", "\n", "To get a spread measured in points, just multiply the differential by the number of possessions per team.\n", "\n", "$$ PS_{HL} = POSS \\times (AdjEM_H - AdjEM_L) $$\n", "\n", "#### Estimate the Win Probability\n", "\n", "Sometimes, the point spread is interesting information in and of itself (for example, if you are comparing the KenPom estimate to a betting line).\n", "\n", "Under the KenPom framework, the team with the higher effiency margin is expected to win the game. This is true whether the predicted point spread is 0.1 points or 20 points. To convert the point spread into a win probabilty, use the following rules:\n", "\n", "- Divide the point spread by 11 points (see [here](https://www.reddit.com/r/CollegeBasketball/comments/5xir8t/calculating_win_probability_and_margin_of_victory/) and [here](https://www.reddit.com/r/CollegeBasketball/comments/5tl6gj/calculating_probability_based_on_kenpoms_adjusted/))\n", "- Take this scaled value and plug it into the [standard normal cumulative distribution function](http://www.itl.nist.gov/div898/handbook/eda/section3/eda3661.htm)\n", "- The result is the estimated win probability\n", "\n", "The function below computes the win probability given the KenPom predicted point spread.\n", "\n", "It uses the [error function](https://en.wikipedia.org/wiki/Error_function) `erf()` in the Python `math` package to compute the standard normal cumulative distribution function. This is the probability of a standard normal random variable being less than the parameter (in this case, the KenPom point spread divided by 11)." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "def kenpom_prob(point_spread, std=11):\n", " \"\"\"Calculate team win probability using KenPom predicted point spread.\"\"\"\n", " return 0.5 * (1 + math.erf((point_spread) / (std*math.sqrt(2))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A Quick Example\n", "\n", "Let's look at the upcoming first round matchup between Virginia and UMBC, using KenPom data as of Monday, March 12, 2018.\n", "\n", "| KenPom Value | Virginia | UMBC |\n", "| :----------- | -------: | --------: |\n", "| _AdjT_ | 59.1 | 68.1 |\n", "| _AdjO_ | 116.5 | 103.3 |\n", "| _AdjD_ | 84.4 | 105.3 |\n", "| _AdjEM_ | +32.15 | -1.99 |\n", "\n", "From these KenPom values, we see that the predicted possessions per game is 63.66 for each team (about 18.9 seconds per possession).\n", "\n", "The efficiency margin differential is 34.14 points per 100 possessions in favor of Virginia. This leads to a KenPom prediction that the Cavaliers should beat the Retrievers by 21.7 points.\n", "\n", "Plug this point spread into the above function:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9757366785052486" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kenpom_prob(21.7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This formula predicts that Virginia has greater than 97.5% probability of advancing to the second round.\n", "\n", "### Analyzing Historical Upsets with KenPom Win Probabilities\n", "\n", "Let's apply this method to our historical KenPom data." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "def kenpom_predictions(df):\n", " \"\"\"Calculate historical KenPom point spread and win probability predictions.\"\"\"\n", " df = df.copy()\n", " df['kenpom_poss'] = (df['adj_t_hi'] + df['adj_t_lo']) / 2\n", " df['kenpom_em_diff'] = df['adj_em_hi'] - df['adj_em_lo']\n", " df['kenpom_spread'] = df['kenpom_em_diff'] * df['kenpom_poss'] / 100\n", " df['win_prob_hi'] = df['kenpom_spread'].apply(kenpom_prob)\n", " df['win_prob_lo'] = 1 - df['win_prob_hi']\n", " start_cols = [\n", " 'year', 'upset', 'kenpom_inverted', 'kenpom_spread', 'kenpom_em_diff',\n", " ]\n", " end_cols = ['kenpom_poss',]\n", " cols = start_cols + [col for col in df.columns if col not in start_cols and col not in end_cols] + end_cols\n", " return df[cols]" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['year', 'upset', 'kenpom_inverted', 'kenpom_spread', 'kenpom_em_diff',\n", " 'team_hi', 'seed_hi', 'kenpom_hi', 'team_lo', 'seed_lo', 'kenpom_lo',\n", " 'score_hi', 'conf_hi', 'wins_hi', 'losses_hi', 'adj_em_hi', 'adj_o_hi',\n", " 'adj_d_hi', 'adj_t_hi', 'luck_hi', 'sos_adj_em_hi', 'opp_o_hi',\n", " 'opp_d_hi', 'ncsos_adj_em_hi', 'score_lo', 'conf_lo', 'wins_lo',\n", " 'losses_lo', 'adj_em_lo', 'adj_o_lo', 'adj_d_lo', 'adj_t_lo', 'luck_lo',\n", " 'sos_adj_em_lo', 'opp_o_lo', 'opp_d_lo', 'ncsos_adj_em_lo',\n", " 'win_prob_hi', 'win_prob_lo', 'kenpom_poss'],\n", " dtype='object')" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = kenpom_predictions(df)\n", "data.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the range of predicted first round NCAA tournament win probabilities for the 2-15 seeds in our historical KenPom data. Remember the caveat that these win probabilities are probably affected by the fact that the KenPom data were retroactively calculated, and include NCAA tournament game results in the calculations." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>count</th>\n", " <th>mean</th>\n", " <th>std</th>\n", " <th>min</th>\n", " <th>25%</th>\n", " <th>50%</th>\n", " <th>75%</th>\n", " <th>max</th>\n", " </tr>\n", " <tr>\n", " <th>seed_hi</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2</th>\n", " <td>64.000</td>\n", " <td>0.904</td>\n", " <td>0.050</td>\n", " <td>0.713</td>\n", " <td>0.887</td>\n", " <td>0.917</td>\n", " <td>0.935</td>\n", " <td>0.983</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>64.000</td>\n", " <td>0.804</td>\n", " <td>0.084</td>\n", " <td>0.561</td>\n", " <td>0.759</td>\n", " <td>0.823</td>\n", " <td>0.866</td>\n", " <td>0.958</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>64.000</td>\n", " <td>0.758</td>\n", " <td>0.088</td>\n", " <td>0.536</td>\n", " <td>0.699</td>\n", " <td>0.779</td>\n", " <td>0.830</td>\n", " <td>0.903</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>64.000</td>\n", " <td>0.652</td>\n", " <td>0.104</td>\n", " <td>0.435</td>\n", " <td>0.565</td>\n", " <td>0.653</td>\n", " <td>0.734</td>\n", " <td>0.844</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>64.000</td>\n", " <td>0.589</td>\n", " <td>0.100</td>\n", " <td>0.270</td>\n", " <td>0.531</td>\n", " <td>0.596</td>\n", " <td>0.664</td>\n", " <td>0.792</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>64.000</td>\n", " <td>0.543</td>\n", " <td>0.110</td>\n", " <td>0.256</td>\n", " <td>0.474</td>\n", " <td>0.541</td>\n", " <td>0.629</td>\n", " <td>0.758</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>64.000</td>\n", " <td>0.526</td>\n", " <td>0.082</td>\n", " <td>0.269</td>\n", " <td>0.480</td>\n", " <td>0.526</td>\n", " <td>0.584</td>\n", " <td>0.709</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " count mean std min 25% 50% 75% max\n", "seed_hi \n", "2 64.000 0.904 0.050 0.713 0.887 0.917 0.935 0.983\n", "3 64.000 0.804 0.084 0.561 0.759 0.823 0.866 0.958\n", "4 64.000 0.758 0.088 0.536 0.699 0.779 0.830 0.903\n", "5 64.000 0.652 0.104 0.435 0.565 0.653 0.734 0.844\n", "6 64.000 0.589 0.100 0.270 0.531 0.596 0.664 0.792\n", "7 64.000 0.543 0.110 0.256 0.474 0.541 0.629 0.758\n", "8 64.000 0.526 0.082 0.269 0.480 0.526 0.584 0.709" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby(['seed_hi'])['win_prob_hi'].describe()" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>count</th>\n", " <th>mean</th>\n", " <th>std</th>\n", " <th>min</th>\n", " <th>25%</th>\n", " <th>50%</th>\n", " <th>75%</th>\n", " <th>max</th>\n", " </tr>\n", " <tr>\n", " <th>seed_lo</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>9</th>\n", " <td>64.000</td>\n", " <td>0.474</td>\n", " <td>0.082</td>\n", " <td>0.291</td>\n", " <td>0.416</td>\n", " <td>0.474</td>\n", " <td>0.520</td>\n", " <td>0.731</td>\n", " </tr>\n", " <tr>\n", " <th>10</th>\n", " <td>64.000</td>\n", " <td>0.457</td>\n", " <td>0.110</td>\n", " <td>0.242</td>\n", " <td>0.371</td>\n", " <td>0.459</td>\n", " <td>0.526</td>\n", " <td>0.744</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>64.000</td>\n", " <td>0.411</td>\n", " <td>0.100</td>\n", " <td>0.208</td>\n", " <td>0.336</td>\n", " <td>0.404</td>\n", " <td>0.469</td>\n", " <td>0.730</td>\n", " </tr>\n", " <tr>\n", " <th>12</th>\n", " <td>64.000</td>\n", " <td>0.348</td>\n", " <td>0.104</td>\n", " <td>0.156</td>\n", " <td>0.266</td>\n", " <td>0.347</td>\n", " <td>0.435</td>\n", " <td>0.565</td>\n", " </tr>\n", " <tr>\n", " <th>13</th>\n", " <td>64.000</td>\n", " <td>0.242</td>\n", " <td>0.088</td>\n", " <td>0.097</td>\n", " <td>0.170</td>\n", " <td>0.221</td>\n", " <td>0.301</td>\n", " <td>0.464</td>\n", " </tr>\n", " <tr>\n", " <th>14</th>\n", " <td>64.000</td>\n", " <td>0.196</td>\n", " <td>0.084</td>\n", " <td>0.042</td>\n", " <td>0.134</td>\n", " <td>0.177</td>\n", " <td>0.241</td>\n", " <td>0.439</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>64.000</td>\n", " <td>0.096</td>\n", " <td>0.050</td>\n", " <td>0.017</td>\n", " <td>0.065</td>\n", " <td>0.083</td>\n", " <td>0.113</td>\n", " <td>0.287</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " count mean std min 25% 50% 75% max\n", "seed_lo \n", "9 64.000 0.474 0.082 0.291 0.416 0.474 0.520 0.731\n", "10 64.000 0.457 0.110 0.242 0.371 0.459 0.526 0.744\n", "11 64.000 0.411 0.100 0.208 0.336 0.404 0.469 0.730\n", "12 64.000 0.348 0.104 0.156 0.266 0.347 0.435 0.565\n", "13 64.000 0.242 0.088 0.097 0.170 0.221 0.301 0.464\n", "14 64.000 0.196 0.084 0.042 0.134 0.177 0.241 0.439\n", "15 64.000 0.096 0.050 0.017 0.065 0.083 0.113 0.287" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.groupby(['seed_lo'])['win_prob_lo'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not surprisingly, the average win probabilities drop steadily as a function of the seed. There is quite a range in the values by seed, however.\n", "\n", "### Visualizing the Historical Win Probability Distributions\n", "\n", "We can create a [violin plot](https://en.wikipedia.org/wiki/Violin_plot) to better understand the distribution of win probability by seed. A violin plot is shows the frequency distribution of data organized by category. We can organize the data by seed and whether the game was an upset or not, and see if there is a pattern in the KenPom win probability distributions." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "def violinplot_kenpom(data, ax, diff, ylabel, legend_loc='upper right'):\n", " \"\"\"Violin plot of statistics by higher seed, upset vs. no upset.\"\"\"\n", " ax = sns.violinplot(data=data, x='seed_hi', y=diff, hue='upset', split=True, inner='quart', cut=0, ax=ax)\n", " ax.set_xlabel('Higher Seed')\n", " ax.set_ylabel(ylabel)\n", " ax.legend(loc=legend_loc, title='upset')\n", " return ax" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<matplotlib.figure.Figure at 0x10b072a58>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(figsize=(10, 7))\n", "ax = violinplot_kenpom(data, ax, 'win_prob_hi', 'Higher Seed KenPom Win Prob.')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A few things are immediately clear from this plot:\n", "\n", "- Overall, the win probabilities slope down and to the right, since win probability drops for lower seeds.\n", "- The green distributions (where an upset actually occurred) are lower than the blue distributions (where no upset occurred).\n", "- There is substantial overlap between the green and blue distributions, so KenPom doesn't do a perfect job of identifying upsets. Far from it. But, if you focus on the dashed lines in the distributions (which represent the 25%, 50% and 75% quantiles of the distriubutions), you'll see that there is a noticeable distinction between the upset and non-upset distributions.\n", "\n", "### Visualizing the Efficiency Margin Differences\n", "\n", "The KenPom win probability is a function of the difference between the teams' efficiency margins. We can also look at those efficiency margins directly. The story isn't really that different, as we'll soon see. But, since the efficiency margins are relatively easy to understand, it may help your intuition to look at the data from this perspective." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<matplotlib.figure.Figure at 0x1a1a5aaac8>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(figsize=(10, 7))\n", "palette_colors = sns.color_palette()\n", "data[data['upset'] == 0].plot(\n", " kind='scatter',\n", " x='adj_em_lo',\n", " y='adj_em_hi',\n", " label='No Upset',\n", " color=palette_colors[1],\n", " ax=ax,\n", ")\n", "data[data['upset'] == 1].plot(\n", " kind='scatter',\n", " x='adj_em_lo',\n", " y='adj_em_hi',\n", " label='Upset',\n", " marker='x',\n", " color=palette_colors[2],\n", " ax=ax,\n", ")\n", "ax.set_xlabel('Lower Seed KenPom Adj EM')\n", "ax.set_ylabel('Higher Seed KenPom Adj EM')\n", "ax.set_title('First Round NCAA Tournament Upsets and KenPom Adj EMs')\n", "ax.text(x=-4, y=11, s='Excludes 1 and 16 seeds')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The upsets (marked with 'x' for each data point) are clearly skewed to the lower right relative to the non-upset dots. Upsets have tended to occurr more often when the lower seeded team has a relatively high KenPom _AdjEM_, or where the higher seeded team has a relativley low KenPom _AdjEM_. Of course, this is just another way of saying that the higher seeded team has a lower KenPom win probability.\n", "\n", "#### Visualizing the Distribution by Seed\n", "\n", "The overall picture makes it clear that KenPom data should be somewhat useful in predicting first round upsets. However, it's important to look at the results in more detail by seed.\n", "\n", "Let's decompose the scatter plot above into 6 plots, one for each seed 3 through 8. We will skip the second seed games since there are so few upsets.\n", "\n", "First, we need a function to isolate the data we need for each sub-plot." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "def select(data, upset=None, seed_hi=None):\n", " \"\"\"Extract data based upon higher seed and upset dummy variable.\"\"\"\n", " if upset and seed_hi:\n", " return data[(data['upset'] == upset) & (data['seed_hi'] == seed_hi)]\n", " elif upset:\n", " return data[data['upset'] == upset]\n", " elif seed_hi:\n", " return data[data['seed_hi'] == seed_hi]\n", " else:\n", " raise ValueError('invalid parameters')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function converts a seed into a `tuple` telling us which sub-plot goes with which seed." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "def ax_map(seed):\n", " \"\"\"Map higher seed to a sub-plot.\"\"\"\n", " return (seed-3)%3, (seed-3)//3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function will plot the scatter plot for one particular seed game." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "def subplot_kenpom(data, ax, seed_hi, x, y):\n", " \"\"\"Scatter sub-plot for a particular seed.\"\"\"\n", " palette_colors = sns.color_palette()\n", " df0 = select(data, upset=0, seed_hi=seed_hi)\n", " df1 = select(data, upset=1, seed_hi=seed_hi)\n", " ax.plot(df0[x], df0[y], '.', ms=8.0, color=palette_colors[1])\n", " ax.plot(df1[x], df1[y], 'x', ms=8.0, mew=2, color=palette_colors[2])\n", " ax.set_title(f'{seed_hi}-{17-seed_hi} Games')\n", " return ax" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can put everything together in one function which will generate the 6 sub-plots." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "def scatterplot_kenpom(data, x='adj_em_lo', y='adj_em_hi'):\n", " \"\"\"Create matrix of scatter sub-plots by seed.\"\"\"\n", " fig, axarr = plt.subplots(nrows=3, ncols=2, figsize=(12, 12), sharex='col', sharey='row')\n", " for seed_hi in range(3, 9):\n", " ax = axarr[ax_map(seed_hi)]\n", " subplot_kenpom(data, ax, seed_hi, x, y)\n", " axarr[0, 0].legend(['No Upset', 'Upset'], loc='upper right')\n", " axarr[2, 0].set_xlabel(x)\n", " axarr[2, 1].set_xlabel(x)\n", " axarr[0, 0].set_ylabel(y)\n", " axarr[1, 0].set_ylabel(y)\n", " axarr[2, 0].set_ylabel(y)\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<matplotlib.figure.Figure at 0x1a1a5cbbe0>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "scatterplot_kenpom(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These plots show that there is a lot of noise under the surface, and that we should be cautious about relying on KenPom data in picking upsets. Although the overall patterns make sense, the historical predictive accuracy by seed is mixed. And of course, the data set is relatively small.\n", "\n", "### Logistic Regression Plots\n", "\n", "Let's look at the distribution by seed in another way. We can use the [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) plotting functionality in the [`seaborn`](https://seaborn.pydata.org/) package to visualize the estimated upset probability as a function of the KenPom efficiency margin difference, by seed.\n", "\n", "We won't try to explain the math behind logistic regression here. That will need to wait for future posts. The main idea is it tries to predict the value of binary variables (taking the value 0 or 1) based on the values of other variables. In our case, the binary variable is whether the upset occurred or not. The predictive variable (we hope) is the KenPom efficiency margin difference between the teams.\n", "\n", "Using `seaborn`, we can create the plots and simultaneously have the package estimate the logistic regressions for us.\n", "\n", "The idea is similar to the 6 scatter sub-plots we did above, except in this case we want 6 logistic regression plots." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "def subplot_kenpom_logistic(data, ax, seed_hi, x):\n", " \"\"\"Logistic regression sub-plot for a particular seed.\"\"\"\n", " palette_colors = sns.color_palette()\n", " ax = sns.regplot(\n", " data=data[data['seed_hi'] == seed_hi],\n", " x=x,\n", " y='upset',\n", " logistic=True,\n", " n_boot=500,\n", " y_jitter=0.03,\n", " ax=ax,\n", " )\n", " ax.set_title(f'{seed_hi}-{17-seed_hi} Games')\n", " return ax" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "def regplot_kenpom(data, x='kenpom_em_diff'):\n", " \"\"\"Create matrix of logistic regression sub-plots by seed.\"\"\"\n", " fig, axarr = plt.subplots(nrows=3, ncols=2, figsize=(12, 12), sharex='col', sharey='row')\n", " for seed_hi in range(3, 9):\n", " ax = axarr[ax_map(seed_hi)]\n", " subplot_kenpom_logistic(data, ax, seed_hi, x)\n", " axarr[2, 0].set_xlabel(x)\n", " axarr[2, 1].set_xlabel(x)\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<matplotlib.figure.Figure at 0x1a1a953e48>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "with warnings.catch_warnings():\n", " warnings.simplefilter(\"ignore\")\n", " regplot_kenpom(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In these plots, the upsets are the dots with the value 'upset' equal to 1, while the 0 values are the non-upsets. The curved lines (with the surrounding shading) are the estimated probabilities of an upset, given the value of the KenPom efficiency margin difference immediately below the curve.\n", "\n", "For example, for the 4-13 game, if the KenPom efficiency margin difference is 5, the estimated upset probability is about 60%. On the other hand, if the KenPom efficiency margin difference is 10, the estimated upset probability is only about 20% for the 4-13 game.\n", "\n", "You can see that the model doesn't ever predict a high upset probability for the 3-14 games. The message is similar to the scatter plots we saw above. The model is more confident predicting upsets in the 4-13 games than in the 5-12 games (since the probabilty curve gets to higher values). Don't be too confident in this result, however. The data set is quite small and this could be the result of noise, or the KenPom data caveats we've discussed previously.\n", "\n", "In any case, if you are inclined to pick upsets, this analysis shows you how to make the best possible use of the available KenPom data in making your picks.\n", "\n", "Good luck with your brackets!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:sports_py36]", "language": "python", "name": "conda-env-sports_py36-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }