{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## NBA Home Court Advantage\n", "\n", "### Part 2: A Deeper Look at Win Percentages\n", "\n", "In the [first notebook in this series on NBA home court advantage](http://nbviewer.jupyter.org/github/practicallypredictable/posts/blob/master/notebooks/nba_home_court-part1.ipynb), we saw that home court win percentages varied over the past 21 seasons, but have averaged around 60%.\n", "\n", "In this notebook, we'll try to drill a little deeper into the distribution of home court win percentages.\n", "\n", "Within a given NBA season, teams vary enormously in quality. From the historical data, we know that an average team (with a 50% win percentage) has roughly a 10% improvement in win probability from playing at home. But what about low-quality teams or elite teams? What does home court advantage look like for teams that aren't average?\n", "\n", "Let's look at the data and see. We'll use the same match up data from the 1996-97 through 2016-17 regular seasons." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "pd.options.display.max_rows = 999\n", "pd.options.display.max_columns = 999\n", "pd.options.display.float_format = '{:.3f}'.format" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "from matplotlib.colors import rgb2hex\n", "import seaborn as sns\n", "sns.set()\n", "sns.set_context('notebook')\n", "plt.style.use('ggplot')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "PROJECT_DIR = Path.cwd().parent / 'basketball' / 'nba'\n", "DATA_DIR = PROJECT_DIR / 'data' / 'prepared'\n", "DATA_DIR.mkdir(exist_ok=True, parents=True)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def load_nba_historical_matchups(input_dir):\n", " \"\"\"Load pickle file of NBA matchups prepared for analytics.\"\"\"\n", " PKLFILENAME = 'stats_nba_com-matchups-1996_97-2016_17.pkl'\n", " pklfile = input_dir.joinpath(PKLFILENAME)\n", " return pd.read_pickle(pklfile)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(26787, 41)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matchups = load_nba_historical_matchups(DATA_DIR)\n", "matchups.shape" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "21" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seasons = sorted(list(matchups['season'].unique()))\n", "len(seasons)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def prepare_regular_season(matchups):\n", " df = matchups.copy()\n", " df = df[df['season_type'] == 'regular']\n", " return df" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(24797, 41)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reg = prepare_regular_season(matchups)\n", "reg.shape" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "30" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "teams = sorted(list(reg['team_curr_h'].unique()))\n", "len(teams)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grouping by Season and Team\n", "\n", "Let's group the match up information by season and team so we can look for patterns. To do this, we'll use the `pandas` [`groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) methods. These methods are very powerful but can be tricky to use correctly.\n", "\n", "Grouping will allow us to calculate aggregate statistics for a particular team in a particular season. Teams change over time, as rosters, coaches and management evolve. Of course, there's also variation in team lineups intra-season, due to trades and injuries. We will eventually get to the player level of detail, but for now we will focus on the regular season as a reasonable time unit for analysizing a team.\n", "\n", "We will also split the home and away games in this grouping. We can subtract the away game statistics from the home game statistics, controlling for season and team. This will allow us to develop a purer estimate of the impact of home court.\n", "\n", "For now, we're only going to compute win percentages. In future analysis, we'll apply the same idea to box scores." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def win_loss_information(matchups, seasons, teams):\n", " # Split home and away games\n", " home_games = matchups.groupby(['season', 'team_curr_h', 'won'])\n", " away_games = matchups.groupby(['season', 'team_curr_a', 'won'])\n", " # Get counts of each type of game within each group\n", " home_wl_count = home_games['date'].count()\n", " away_wl_count = away_games['date'].count()\n", " # Get counts of home wins, home losses, away wins and away losses\n", " home_wins = {}\n", " home_losses = {}\n", " for season, team, winner in home_games.groups.keys():\n", " if winner == 'H':\n", " home_wins[(season, team)] = home_wl_count[(season, team, winner)]\n", " else:\n", " home_losses[(season, team)] = home_wl_count[(season, team, winner)]\n", " away_wins = {}\n", " away_losses = {}\n", " for season, team, winner in away_games.groups.keys():\n", " if winner == 'A':\n", " away_wins[(season, team)] = away_wl_count[(season, team, winner)]\n", " else:\n", " away_losses[(season, team)] = away_wl_count[(season, team, winner)]\n", " # Create DataFrame of counts and win/loss percentages\n", " df = pd.DataFrame({\n", " 'home_wins': home_wins,\n", " 'home_losses': home_losses,\n", " 'away_wins': away_wins,\n", " 'away_losses': away_losses,\n", " })\n", " df['home_games'] = df['home_wins'] + df['home_losses']\n", " df['away_games'] = df['away_wins'] + df['away_losses']\n", " df['games'] = df['home_games'] + df['away_games']\n", " df['wins'] = df['away_wins'] + df['home_wins']\n", " df['losses'] = df['away_losses'] + df['home_losses']\n", " df['win_percentage'] = df['wins'] / df['games']\n", " df['home_win_percentage'] = df['home_wins'] / df['home_games']\n", " df['away_win_percentage'] = df['away_wins'] / df['away_games']\n", " df = df.reset_index().rename(columns={'level_0': 'season', 'level_1': 'team'})\n", " return df[[\n", " 'season',\n", " 'team',\n", " 'games',\n", " 'wins',\n", " 'losses',\n", " 'win_percentage',\n", " 'home_games',\n", " 'home_wins',\n", " 'home_losses',\n", " 'home_win_percentage',\n", " 'away_games',\n", " 'away_wins',\n", " 'away_losses',\n", " 'away_win_percentage',\n", " ]]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>season</th>\n", " <th>team</th>\n", " <th>games</th>\n", " <th>wins</th>\n", " <th>losses</th>\n", " <th>win_percentage</th>\n", " <th>home_games</th>\n", " <th>home_wins</th>\n", " <th>home_losses</th>\n", " <th>home_win_percentage</th>\n", " <th>away_games</th>\n", " <th>away_wins</th>\n", " <th>away_losses</th>\n", " <th>away_win_percentage</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1996-97</td>\n", " <td>ATL</td>\n", " <td>82</td>\n", " <td>56</td>\n", " <td>26</td>\n", " <td>0.683</td>\n", " <td>41</td>\n", " <td>36</td>\n", " <td>5</td>\n", " <td>0.878</td>\n", " <td>41</td>\n", " <td>20</td>\n", " <td>21</td>\n", " <td>0.488</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1996-97</td>\n", " <td>BKN</td>\n", " <td>82</td>\n", " <td>26</td>\n", " <td>56</td>\n", " <td>0.317</td>\n", " <td>41</td>\n", " <td>16</td>\n", " <td>25</td>\n", " <td>0.390</td>\n", " <td>41</td>\n", " <td>10</td>\n", " <td>31</td>\n", " <td>0.244</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1996-97</td>\n", " <td>BOS</td>\n", " <td>82</td>\n", " <td>15</td>\n", " <td>67</td>\n", " <td>0.183</td>\n", " <td>41</td>\n", " <td>11</td>\n", " <td>30</td>\n", " <td>0.268</td>\n", " <td>41</td>\n", " <td>4</td>\n", " <td>37</td>\n", " <td>0.098</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1996-97</td>\n", " <td>CHI</td>\n", " <td>82</td>\n", " <td>69</td>\n", " <td>13</td>\n", " <td>0.841</td>\n", " <td>41</td>\n", " <td>39</td>\n", " <td>2</td>\n", " <td>0.951</td>\n", " <td>41</td>\n", " <td>30</td>\n", " <td>11</td>\n", " <td>0.732</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1996-97</td>\n", " <td>CLE</td>\n", " <td>82</td>\n", " <td>42</td>\n", " <td>40</td>\n", " <td>0.512</td>\n", " <td>41</td>\n", " <td>25</td>\n", " <td>16</td>\n", " <td>0.610</td>\n", " <td>41</td>\n", " <td>17</td>\n", " <td>24</td>\n", " <td>0.415</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " season team games wins losses win_percentage home_games home_wins \\\n", "0 1996-97 ATL 82 56 26 0.683 41 36 \n", "1 1996-97 BKN 82 26 56 0.317 41 16 \n", "2 1996-97 BOS 82 15 67 0.183 41 11 \n", "3 1996-97 CHI 82 69 13 0.841 41 39 \n", "4 1996-97 CLE 82 42 40 0.512 41 25 \n", "\n", " home_losses home_win_percentage away_games away_wins away_losses \\\n", "0 5 0.878 41 20 21 \n", "1 25 0.390 41 10 31 \n", "2 30 0.268 41 4 37 \n", "3 2 0.951 41 30 11 \n", "4 16 0.610 41 17 24 \n", "\n", " away_win_percentage \n", "0 0.488 \n", "1 0.244 \n", "2 0.098 \n", "3 0.732 \n", "4 0.415 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wl = win_loss_information(reg, seasons, teams)\n", "wl.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have data for season-by-season win percentages (overall, home and away) for each NBA team. As a reminder, we are using the current NBA team name for historical data, to keep track of team moves and name changes. Please see [the original post on scraping the data](http://practicallypredictable.com/2017/12/21/web-scraping-nba-team-matchups-box-scores/) for more information.\n", "\n", "Since it took some work to create this view of the data, let's save it to a CSV file. That way, we can easily use the data in different analyses." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "csvfile = DATA_DIR.joinpath('stats_nba_com-team_records-1996_97-2016_17.csv')\n", "wl.to_csv(csvfile, index=False, float_format='%g')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analyzing the Win Percentages\n", "\n", "Now let's get a high level overview of the win percentage and home win percentage, for each team/season pair." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>win_percentage</th>\n", " <th>home_win_percentage</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>count</th>\n", " <td>622.000</td>\n", " <td>622.000</td>\n", " </tr>\n", " <tr>\n", " <th>mean</th>\n", " <td>0.500</td>\n", " <td>0.598</td>\n", " </tr>\n", " <tr>\n", " <th>std</th>\n", " <td>0.155</td>\n", " <td>0.170</td>\n", " </tr>\n", " <tr>\n", " <th>min</th>\n", " <td>0.106</td>\n", " <td>0.121</td>\n", " </tr>\n", " <tr>\n", " <th>25%</th>\n", " <td>0.390</td>\n", " <td>0.488</td>\n", " </tr>\n", " <tr>\n", " <th>50%</th>\n", " <td>0.512</td>\n", " <td>0.610</td>\n", " </tr>\n", " <tr>\n", " <th>75%</th>\n", " <td>0.610</td>\n", " <td>0.732</td>\n", " </tr>\n", " <tr>\n", " <th>max</th>\n", " <td>0.890</td>\n", " <td>0.976</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " win_percentage home_win_percentage\n", "count 622.000 622.000\n", "mean 0.500 0.598\n", "std 0.155 0.170\n", "min 0.106 0.121\n", "25% 0.390 0.488\n", "50% 0.512 0.610\n", "75% 0.610 0.732\n", "max 0.890 0.976" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hca = wl[['season', 'team', 'win_percentage', 'home_win_percentage']].copy()\n", "hca.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 622 distinct team/season pairs. For some of these 21 seasons, there were only 29 NBA teams (prior to the Charlotte expansion).\n", "\n", "As required by basic arithmetic, the average NBA team has a 50% win percentage. And we see the same 59.8% home team win percentage from our prior analysis. What's interesting in this table, though, is the distribution information.\n", "\n", "For example, we see that the win percentage has ranged from just over 10% to as high as 89%, and that 50% of the win percentages are between 39% and 61%. To see that, look at the \"25%\" and \"75%\" rows in table above. This table shows the [quartiles](https://en.wikipedia.org/wiki/Quartile) of the win percentage distribution. Quartiles are a particular example of [quantiles](https://en.wikipedia.org/wiki/Quantile).\n", "\n", "The 25% and 75% quantiles of the home win percentage distribution are roughly 49% and 73%. These also happen to be roughly 10% higher than the corresponding win percentage. So, how correct is it to simply add 10% to any NBA team's win percentage to predict their home win percentage? Or conversely, how correct is it to subtract 10% from any NBA team's home win percentage to predict their overall season win percentage?\n", "\n", "Before we answer that question, let's look at how these percentages vary over time, by season.\n", "\n", "### Variation by Season\n", "\n", "For simplicity, we'll just look at the average win percentage and home win percentage by season. You could also do this analysis with the [median](https://en.wikipedia.org/wiki/Median) (the 50% quantile) or some other statistic." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>home_win_percentage</th>\n", " <th>win_percentage</th>\n", " </tr>\n", " <tr>\n", " <th>season</th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1996-97</th>\n", " <td>0.575</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>1997-98</th>\n", " <td>0.595</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>1998-99</th>\n", " <td>0.623</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>1999-00</th>\n", " <td>0.611</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2000-01</th>\n", " <td>0.598</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2001-02</th>\n", " <td>0.591</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2002-03</th>\n", " <td>0.628</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2003-04</th>\n", " <td>0.614</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2004-05</th>\n", " <td>0.605</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2005-06</th>\n", " <td>0.603</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2006-07</th>\n", " <td>0.591</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2007-08</th>\n", " <td>0.601</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2008-09</th>\n", " <td>0.608</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2009-10</th>\n", " <td>0.594</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2010-11</th>\n", " <td>0.604</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2011-12</th>\n", " <td>0.586</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2012-13</th>\n", " <td>0.612</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2013-14</th>\n", " <td>0.580</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2014-15</th>\n", " <td>0.575</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2015-16</th>\n", " <td>0.589</td>\n", " <td>0.500</td>\n", " </tr>\n", " <tr>\n", " <th>2016-17</th>\n", " <td>0.584</td>\n", " <td>0.500</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " home_win_percentage win_percentage\n", "season \n", "1996-97 0.575 0.500\n", "1997-98 0.595 0.500\n", "1998-99 0.623 0.500\n", "1999-00 0.611 0.500\n", "2000-01 0.598 0.500\n", "2001-02 0.591 0.500\n", "2002-03 0.628 0.500\n", "2003-04 0.614 0.500\n", "2004-05 0.605 0.500\n", "2005-06 0.603 0.500\n", "2006-07 0.591 0.500\n", "2007-08 0.601 0.500\n", "2008-09 0.608 0.500\n", "2009-10 0.594 0.500\n", "2010-11 0.604 0.500\n", "2011-12 0.586 0.500\n", "2012-13 0.612 0.500\n", "2013-14 0.580 0.500\n", "2014-15 0.575 0.500\n", "2015-16 0.589 0.500\n", "2016-17 0.584 0.500" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hca.groupby('season')['home_win_percentage', 'win_percentage'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The average win percentage isn't too interesting, since it's always 50%. The home win percentage fluctuates by season as we saw in our [preliminary analysis](http://nbviewer.jupyter.org/github/practicallypredictable/posts/blob/master/notebooks/nba_home_court-part1.ipynb).\n", "\n", "### Ranking by Quartile\n", "\n", "Let's make this analysis a little more interesting. Let's organize the data by quartile, so we can see if there is any useful pattern for weak, mediocre, good and elite teams.\n", "\n", "We can use `pandas` to get the numerical values of each quartile." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>home_win_percentage</th>\n", " <th>win_percentage</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.931</td>\n", " <td>0.776</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.241</td>\n", " <td>0.241</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.086</td>\n", " <td>0.069</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1.000</td>\n", " <td>1.000</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.534</td>\n", " <td>0.552</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " home_win_percentage win_percentage\n", "0 0.931 0.776\n", "1 0.241 0.241\n", "2 0.086 0.069\n", "3 1.000 1.000\n", "4 0.534 0.552" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hca_ranks = hca.groupby('season').rank(numeric_only=True, pct=True)\n", "hca_ranks.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, all we need to do is to label our data with the quartile number. For every team/season win percentage and home win percentage, we can just look up in the above table what quartile that team is for that season.\n", "\n", "Fortunately, `pandas` lets us do this in just a few lines of Python." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def quartile_index(x):\n", " return pd.qcut(x, q=4, labels=False)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def get_quartiles(df):\n", " df_q = df.groupby('season').rank(numeric_only=True, pct=True).apply(quartile_index)\n", " for col in df_q.columns:\n", " df[col].astype(int)\n", " return df_q" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>home_win_percentage</th>\n", " <th>win_percentage</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>3</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>3</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2</td>\n", " <td>2</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " home_win_percentage win_percentage\n", "0 3 3\n", "1 0 0\n", "2 0 0\n", "3 3 3\n", "4 2 2" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_quartiles(hca).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a moment to reflect on the above example. What we did was apply a function (to look up the quartile number) to our original `DataFrame`, and create a new `DataFrame` with those quartile values instead of the original data. This table only displays the first 5 lines of the data set, which comprise 622 team/season pairs. Let's put all this in a function to return the full 622-row table." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "def get_hca_quartile(hca, col, quartile):\n", " df = get_quartiles(hca)\n", " return df[df[col] == quartile]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### An Interesting Figure\n", "\n", "Since a picture is worth a thousand words, let's make a picture. We can use different colors to signify the different quartiles. Let's see the `seaborn` color palette." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZQAAABECAYAAACmjMM7AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAAZ9JREFUeJzt2jFKXFEAhtE7MtvIBlKb1iIhCLoIS+sphFnCQArrlC4iQhAt0sbaDWQbgZdeBkXyPZ7oOeW9PPi7Dy5vNU3TAID/dbD0AADeBkEBICEoACQEBYCEoACQWD9z7xcwAPZZPT54Lijjz+nhPFNegQ8/7senb3dLz5jN74vP4/vlw9IzZnO++Ti22+3SM2ax2+3G319nS8+YzfroalzefFl6xmw2X2/HuP659Iz5nBzvPfbkBUBCUABICAoACUEBICEoACQEBYCEoACQEBQAEoICQEJQAEgICgAJQQEgISgAJAQFgISgAJAQFAASggJAQlAASAgKAAlBASAhKAAkBAWAhKAAkBAUABKCAkBCUABICAoACUEBICEoACQEBYCEoACQEBQAEoICQEJQAEgICgAJQQEgISgAJAQFgISgAJAQFAASggJAQlAASAgKAAlBASAhKAAkBAWAhKAAkBAUABKCAkBCUABICAoACUEBICEoACRW0zQ9df/kJQDv1urxwfqlHwDAPp68AEgICgAJQQEgISgAJAQFgMQ/px8ciO76oEoAAAAASUVORK5CYII=\n", "text/plain": [ "<matplotlib.figure.Figure at 0x1a15ac2438>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "colors = sns.color_palette()\n", "sns.palplot(colors)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use the 4 colors on the right to signify the bottom, third, second and top quartile teams. We need a function to plot the data for each quartile in the correct color. We'll call this function 4 times, once for each quartile. The quartile is based on the team's win percentage in that season. The top quartile is the elite teams in that season, irrespective of home win percentage, and so on for the lower quartiles." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "def plot_hca_quartile(ax, hca, x_col, y_col, quartile, label, colors):\n", " lookup_df = get_hca_quartile(hca, y_col, quartile)\n", " data_df = hca.loc[lookup_df.index, :]\n", " ax = sns.regplot(\n", " data=data_df, x=x_col, y=y_col, ax=ax, fit_reg=False,\n", " label=label,\n", " scatter_kws={'alpha': 0.5, 'facecolors': colors[6-quartile]}\n", " )\n", " return ax" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<matplotlib.figure.Figure at 0x1a1704d6a0>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(figsize=(7, 7))\n", "x_col = 'home_win_percentage'\n", "y_col = 'win_percentage'\n", "ax = plot_hca_quartile(ax, hca, x_col, y_col, 0, 'bottom quartile win % for season', colors)\n", "ax = plot_hca_quartile(ax, hca, x_col, y_col, 1, '3rd quartile win % for season', colors)\n", "ax = plot_hca_quartile(ax, hca, x_col, y_col, 2, '2rd quartile win % for season', colors)\n", "ax = plot_hca_quartile(ax, hca, x_col, y_col, 3, 'top quartile win % for season', colors)\n", "ax.set_xlim(0, 1)\n", "ax.set_xlabel('Team Home Win Percentage')\n", "ax.set_ylim(0, 1)\n", "ax.set_ylabel('Team Win Percentage')\n", "ax.set_title('Team Overall Win Percentage and Home Win Percentage')\n", "ax.legend()\n", "ax.plot(ax.get_xlim(), ax.get_ylim(), linestyle='--', color='black', alpha=0.5)\n", "ax.axhline(y=0.5, linestyle='--', alpha=0.5, color='black')\n", "ax.axvline(x=0.5, linestyle='--', alpha=0.5, color='black')\n", "ax.axvline(x=hca['home_win_percentage'].mean(), alpha=0.5, color='black')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The figure has 4 lines overlaid to help in the interpretation.\n", "\n", "The horizontal dashed line divides the plot into team/season pairs with overall winning records in that season.\n", "\n", "The vertical dashed line divides the plot into team/season pairs with better-than-even home win percentages in that season. As you would expect, most of the observations fall to the right of this vertical line.\n", "\n", "There are some bottom-quartile teams with winning home records, and there are some third-quartile teams with losing home records.\n", "\n", "As a curiosity, look at the team sticking out on the lower-right. This team had a better than 65% home win record, but was sub-40% overall. Who was it? The 2002-3 Chicago Bulls, with their [franchise-worst 3-38 road record](https://en.wikipedia.org/wiki/2002%E2%80%9303_Chicago_Bulls_season). They got 27 of their 30 wins on at home." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>season</th>\n", " <th>team</th>\n", " <th>win_percentage</th>\n", " <th>home_win_percentage</th>\n", " <th>wins</th>\n", " <th>losses</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>177</th>\n", " <td>2002-03</td>\n", " <td>CHI</td>\n", " <td>0.366</td>\n", " <td>0.659</td>\n", " <td>30</td>\n", " <td>52</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " season team win_percentage home_win_percentage wins losses\n", "177 2002-03 CHI 0.366 0.659 30 52" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wl.loc[\n", " (wl['win_percentage'] < 0.38) & (wl['home_win_percentage'] > 0.62),\n", " ['season', 'team', 'win_percentage', 'home_win_percentage', 'wins', 'losses']\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The vertical solid line shows the location of the home win percentage average for all teams and seasons (59.8%).\n", "\n", "The diagonal line has slope 1, and is where the team win percentage equals the home win percentage. In other words, _teams on this line had no observable home court advantage_ in that particular season.\n", "\n", "In fact, we see that there are a few teams which had _worse_ home records than overall records. Here is a list of the 11 times where that occurred." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>season</th>\n", " <th>team</th>\n", " <th>win_percentage</th>\n", " <th>home_win_percentage</th>\n", " <th>wins</th>\n", " <th>losses</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>14</th>\n", " <td>1996-97</td>\n", " <td>MIA</td>\n", " <td>0.744</td>\n", " <td>0.707</td>\n", " <td>61</td>\n", " <td>21</td>\n", " </tr>\n", " <tr>\n", " <th>162</th>\n", " <td>2001-02</td>\n", " <td>NOP</td>\n", " <td>0.537</td>\n", " <td>0.512</td>\n", " <td>44</td>\n", " <td>38</td>\n", " </tr>\n", " <tr>\n", " <th>272</th>\n", " <td>2005-06</td>\n", " <td>HOU</td>\n", " <td>0.415</td>\n", " <td>0.366</td>\n", " <td>34</td>\n", " <td>48</td>\n", " </tr>\n", " <tr>\n", " <th>300</th>\n", " <td>2006-07</td>\n", " <td>DET</td>\n", " <td>0.646</td>\n", " <td>0.634</td>\n", " <td>53</td>\n", " <td>29</td>\n", " </tr>\n", " <tr>\n", " <th>343</th>\n", " <td>2007-08</td>\n", " <td>ORL</td>\n", " <td>0.634</td>\n", " <td>0.610</td>\n", " <td>52</td>\n", " <td>30</td>\n", " </tr>\n", " <tr>\n", " <th>369</th>\n", " <td>2008-09</td>\n", " <td>MIN</td>\n", " <td>0.293</td>\n", " <td>0.268</td>\n", " <td>24</td>\n", " <td>58</td>\n", " </tr>\n", " <tr>\n", " <th>384</th>\n", " <td>2009-10</td>\n", " <td>BOS</td>\n", " <td>0.610</td>\n", " <td>0.585</td>\n", " <td>50</td>\n", " <td>32</td>\n", " </tr>\n", " <tr>\n", " <th>404</th>\n", " <td>2009-10</td>\n", " <td>PHI</td>\n", " <td>0.329</td>\n", " <td>0.293</td>\n", " <td>27</td>\n", " <td>55</td>\n", " </tr>\n", " <tr>\n", " <th>437</th>\n", " <td>2010-11</td>\n", " <td>SAC</td>\n", " <td>0.293</td>\n", " <td>0.268</td>\n", " <td>24</td>\n", " <td>58</td>\n", " </tr>\n", " <tr>\n", " <th>443</th>\n", " <td>2011-12</td>\n", " <td>BKN</td>\n", " <td>0.333</td>\n", " <td>0.273</td>\n", " <td>22</td>\n", " <td>44</td>\n", " </tr>\n", " <tr>\n", " <th>579</th>\n", " <td>2015-16</td>\n", " <td>MIN</td>\n", " <td>0.354</td>\n", " <td>0.341</td>\n", " <td>29</td>\n", " <td>53</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " season team win_percentage home_win_percentage wins losses\n", "14 1996-97 MIA 0.744 0.707 61 21\n", "162 2001-02 NOP 0.537 0.512 44 38\n", "272 2005-06 HOU 0.415 0.366 34 48\n", "300 2006-07 DET 0.646 0.634 53 29\n", "343 2007-08 ORL 0.634 0.610 52 30\n", "369 2008-09 MIN 0.293 0.268 24 58\n", "384 2009-10 BOS 0.610 0.585 50 32\n", "404 2009-10 PHI 0.329 0.293 27 55\n", "437 2010-11 SAC 0.293 0.268 24 58\n", "443 2011-12 BKN 0.333 0.273 22 44\n", "579 2015-16 MIN 0.354 0.341 29 53" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wl.loc[\n", " wl['win_percentage'] > wl['home_win_percentage'],\n", " ['season', 'team', 'win_percentage', 'home_win_percentage', 'wins', 'losses']\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conclusion\n", "\n", "There is clearly a strong association between a team's home win percentage and the team's overall win percentage. The 10% extra home win probability rule of thumb is reasonable for an average team. However, there is a lot of variation around the averages.\n", "\n", "Moving along the horizontal line, we see that a team with an average win percentage could have anywhere from a below-average (50%) to a good (70%) home win percentage.\n", "\n", "Similarly, moving along the dashed vertical line, we see that a team with a below-average 50% home win percentage could either have an average season (50% overall) or a very poor season." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:sports_py36]", "language": "python", "name": "conda-env-sports_py36-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }