{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## NCAA March Madness\n", "\n", "### Preliminary Analysis of First Round Upsets\n", "\n", "The objective of the [Skatsbetball Tournament 2018](http://thisisstatistics.org/home-2/statsketball/) [Upset Challenge](http://thisisstatistics.org/home-2/statsketball/statsketball-guidelines/) is to predict the 32 winners of the First Round in the [2018 NCAA Men's Division I Basketball Tournament](https://www.ncaa.com/march-madness).\n", "\n", "In order to do well in this contest, you need to decide how many upsets to pick in the first round. After all, the purpose of the seeding is to make it more likely that the higher-seeded teams make it to the later rounds. If the tournament committee does it's job well, the higher-seeded teams should be measurably better than their first round opponents.\n", "\n", "Picking upsets is also important for filling out a regular March Madness bracket. To avoid having upsets bust your bracket, it can help to have a sense of the reasonable number of upsets to pick, and which seeds are more likely to fall early in the tournament.\n", "\n", "One reasonable way to assess the likelihood of first round upsets is to see how many have occurred historically. This notebook examines all first round NCAA tournament results from 1985 through 2017 to see how the higher seeds have fared." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "from matplotlib.colors import rgb2hex\n", "import seaborn as sns\n", "sns.set()\n", "sns.set_context('notebook')\n", "plt.style.use('ggplot')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "PROJECT_DIR = Path.cwd().parent\n", "DATA_DIR = PROJECT_DIR / 'data' / 'scraped'\n", "DATA_DIR.mkdir(exist_ok=True, parents=True)\n", "OUTPUT_DIR = PROJECT_DIR / 'data' / 'prepared'\n", "OUTPUT_DIR.mkdir(exist_ok=True, parents=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Loading the Data\n", "\n", "We previously obtained historical NCAA tournament data from the [Washington Post's NCAA Tournament site](https://apps.washingtonpost.com/sports/apps/live-updating-mens-ncaa-basketball-bracket/search/) and saved it as a CSV file.\n", "\n", "Let's load in the data and prepare it for analysis." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def load_tournament_games():\n", " filename = 'games-washpost.csv'\n", " csvfile = DATA_DIR.joinpath(filename)\n", " df = pd.read_csv(csvfile)\n", " df.columns = df.columns.str.rstrip()\n", " return df.rename(columns={'year': 'Year', 'round': 'Round',})" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Year</th>\n", " <th>Round</th>\n", " <th>seed1</th>\n", " <th>team1</th>\n", " <th>score1</th>\n", " <th>seed2</th>\n", " <th>team2</th>\n", " <th>score2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2017</td>\n", " <td>1</td>\n", " <td>5</td>\n", " <td>Notre Dame</td>\n", " <td>60</td>\n", " <td>12</td>\n", " <td>Princeton</td>\n", " <td>58</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2017</td>\n", " <td>1</td>\n", " <td>4</td>\n", " <td>West Virginia</td>\n", " <td>86</td>\n", " <td>13</td>\n", " <td>Bucknell</td>\n", " <td>80</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2017</td>\n", " <td>1</td>\n", " <td>5</td>\n", " <td>Virginia</td>\n", " <td>76</td>\n", " <td>12</td>\n", " <td>UNC Wilmington</td>\n", " <td>71</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2017</td>\n", " <td>1</td>\n", " <td>4</td>\n", " <td>Florida</td>\n", " <td>80</td>\n", " <td>13</td>\n", " <td>East Tennessee State</td>\n", " <td>65</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2017</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Gonzaga</td>\n", " <td>66</td>\n", " <td>16</td>\n", " <td>South Dakota State</td>\n", " <td>46</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Year Round seed1 team1 score1 seed2 team2 \\\n", "0 2017 1 5 Notre Dame 60 12 Princeton \n", "1 2017 1 4 West Virginia 86 13 Bucknell \n", "2 2017 1 5 Virginia 76 12 UNC Wilmington \n", "3 2017 1 4 Florida 80 13 East Tennessee State \n", "4 2017 1 1 Gonzaga 66 16 South Dakota State \n", "\n", " score2 \n", "0 58 \n", "1 80 \n", "2 71 \n", "3 65 \n", "4 46 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "load_tournament_games().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Determining Winners, Losers and Upsets\n", "\n", "We are going to create various new columns in our `DataFrame` to allow us to analyze the upsets in various ways.\n", "\n", "We will use the `pandas` `apply()` method to add columns with the correct information for each row of data." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def winner(row):\n", " if row['score1'] > row['score2']:\n", " return row['team1']\n", " else:\n", " return row['team2']" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def loser(row):\n", " if row['score1'] > row['score2']:\n", " return row['team2']\n", " else:\n", " return row['team1']" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def winner_seed(row):\n", " if row['score1'] > row['score2']:\n", " return row['seed1']\n", " else:\n", " return row['seed2'] " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def loser_seed(row):\n", " if row['score1'] > row['score2']:\n", " return row['seed2']\n", " else:\n", " return row['seed1']" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def higher_seed(row):\n", " return min(row['seed1'], row['seed2'])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def winner_score(row):\n", " return max(row['score1'], row['score2']) " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def loser_score(row):\n", " return min(row['score1'], row['score2']) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can put everything together in one function to load and format the data." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def tournament_games():\n", " df = load_tournament_games()\n", " df['HigherSeed'] = df.apply(higher_seed, axis=1)\n", " df['Winner'] = df.apply(winner, axis=1)\n", " df['Loser'] = df.apply(loser, axis=1)\n", " df['WinnerSeed'] = df.apply(winner_seed, axis=1)\n", " df['LoserSeed'] = df.apply(loser_seed, axis=1)\n", " df['WinnerScore'] = df.apply(winner_score, axis=1)\n", " df['LoserScore'] = df.apply(loser_score, axis=1)\n", " df = df.drop(columns=[\n", " 'seed1',\n", " 'team1',\n", " 'score1',\n", " 'seed2',\n", " 'team2',\n", " 'score2',\n", " ])\n", " winner_cols = [col for col in df.columns if 'Winner' in col]\n", " loser_cols = [col for col in df.columns if 'Loser' in col]\n", " cols = ['Year', 'Round', 'HigherSeed'] + winner_cols + loser_cols\n", " return df[cols].sort_values(['Year', 'Round']).reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2079, 9)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "games = tournament_games()\n", "games.shape" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Year</th>\n", " <th>Round</th>\n", " <th>HigherSeed</th>\n", " <th>Winner</th>\n", " <th>WinnerSeed</th>\n", " <th>WinnerScore</th>\n", " <th>Loser</th>\n", " <th>LoserSeed</th>\n", " <th>LoserScore</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1985</td>\n", " <td>1</td>\n", " <td>6</td>\n", " <td>Boston College</td>\n", " <td>11</td>\n", " <td>55</td>\n", " <td>Texas Tech</td>\n", " <td>6</td>\n", " <td>53</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1985</td>\n", " <td>1</td>\n", " <td>7</td>\n", " <td>Alabama</td>\n", " <td>7</td>\n", " <td>50</td>\n", " <td>Arizona</td>\n", " <td>10</td>\n", " <td>41</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1985</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>Virginia Commonwealth</td>\n", " <td>2</td>\n", " <td>81</td>\n", " <td>Marshall</td>\n", " <td>15</td>\n", " <td>65</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1985</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>Illinois</td>\n", " <td>3</td>\n", " <td>76</td>\n", " <td>Northeastern</td>\n", " <td>14</td>\n", " <td>57</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1985</td>\n", " <td>1</td>\n", " <td>6</td>\n", " <td>Georgia</td>\n", " <td>6</td>\n", " <td>67</td>\n", " <td>Wichita State</td>\n", " <td>11</td>\n", " <td>59</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Year Round HigherSeed Winner WinnerSeed WinnerScore \\\n", "0 1985 1 6 Boston College 11 55 \n", "1 1985 1 7 Alabama 7 50 \n", "2 1985 1 2 Virginia Commonwealth 2 81 \n", "3 1985 1 3 Illinois 3 76 \n", "4 1985 1 6 Georgia 6 67 \n", "\n", " Loser LoserSeed LoserScore \n", "0 Texas Tech 6 53 \n", "1 Arizona 10 41 \n", "2 Marshall 15 65 \n", "3 Northeastern 14 57 \n", "4 Wichita State 11 59 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "games.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This `DataFrame` is useful for analyzing later-round games as well, so let's save it for future analysis." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "filename = 'game_history-washpost-1985_2017.csv'\n", "csvfile = OUTPUT_DIR.joinpath(filename)\n", "games.to_csv(csvfile, index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### First Round Results\n", "\n", "We will focus only on the first round in this notebook." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1056" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "round1 = games[games['Round'] == 1].copy()\n", "len(round1)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "33" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(round1['Year'].unique())" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "32.0" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1056/33" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have 33 years of NCAA Tournament data, with 32 games played in the first round each year.\n", "\n", "Now let's see what proportion of first round games each seed has won in the first round. Remember, there are 4 regions, with 16 seeds in each region." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 1.000000\n", "2 0.939394\n", "3 0.840909\n", "4 0.803030\n", "5 0.643939\n", "6 0.628788\n", "7 0.613636\n", "8 0.507576\n", "9 0.492424\n", "10 0.386364\n", "11 0.371212\n", "12 0.356061\n", "13 0.196970\n", "14 0.159091\n", "15 0.060606\n", "Name: WinnerSeed, dtype: float64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "round1['WinnerSeed'].value_counts() / (4*33)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you [look at Wikipedia](https://en.wikipedia.org/wiki/NCAA_Division_I_Men%27s_Basketball_Tournament#First_and_Second_Rounds), you'll see identical frequency data for first round victories as of the time I wrote this.\n", "\n", "We can turn these in \"upset\" frequencies by subtracting them from 1." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 0.000000\n", "2 0.060606\n", "3 0.159091\n", "4 0.196970\n", "5 0.356061\n", "6 0.371212\n", "7 0.386364\n", "8 0.492424\n", "9 0.507576\n", "10 0.613636\n", "11 0.628788\n", "12 0.643939\n", "13 0.803030\n", "14 0.840909\n", "15 0.939394\n", "Name: WinnerSeed, dtype: float64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1 - round1['WinnerSeed'].value_counts() / (4*33)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<matplotlib.figure.Figure at 0x10d56b518>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fix, ax = plt.subplots(figsize=(9, 6))\n", "ax = (\n", " pd.DataFrame(round1['WinnerSeed'].value_counts() / (4*33))[1:15]\n", " .rename(columns={'WinnerSeed': 'Seed'})\n", " .plot(kind='barh', ax=ax)\n", ")\n", "ax.set_title('First Round Win Percentage for Seeds 2-15')\n", "ax.set_xticks(np.arange(0, 1, 0.1))\n", "ax.axvline(0.5, linestyle='--', color='grey')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Don't Pick Sixteenth Seeds\n", "\n", "Notice that a first seed has never lost in the first round, which is why the sixteenth seed doesn't appear in the above results.\n", "\n", "\n", "Since a first seed has never lost in the first round, we are going to exclude it from further analysis.\n", "\n", "Of course, if the NCAA Tournament continues with the same format for another 30, 50 or 100 years, eventually a sixteenth seed will advance. But unless you think the tournament committee has made a huge mistake in the seeding, don't pick the sixteenth seed to advance.\n", "\n", "#### Eight and Ninth Seeds\n", "\n", "At the other end of the spectrum, notice that the eight and ninth seed first round win frequencies are both roughly 50%.\n", "\n", "This means that the 8-9 matchups look like coin tosses in the historical data, with only a slight bias in favor of the eighth seed.\n", "\n", "#### Upsets by Seed\n", "\n", "Let's filter out the 1-16 matchups and define an upset as anytime the higher seed loses. Of course, it's not really correct to view a ninth seed victory as an upset, since those games historically look like toss-ups." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def filter_upsets(row):\n", " return (\n", " (row['HigherSeed'] == row['LoserSeed']) &\n", " (row['HigherSeed'] in range(2, 9))\n", " )" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "round1['Upset'] = round1.apply(lambda row: filter_upsets(row), axis=1)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "upsets = round1[round1['Upset'] == True]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "HigherSeed\n", "2 8\n", "3 21\n", "4 26\n", "5 47\n", "6 49\n", "7 51\n", "8 65\n", "Name: Year, dtype: int64" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "upsets.groupby(['HigherSeed'])['Year'].count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Upsets per Year\n", "\n", "The above counts give the total number of upsets over the 33-year historical period, grouped by seed. Let's look at the historical data grouped by year instead." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 33.000000\n", "mean 8.090909\n", "std 2.454125\n", "min 3.000000\n", "25% 7.000000\n", "50% 8.000000\n", "75% 10.000000\n", "max 13.000000\n", "dtype: float64" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "upsets.groupby(['Year']).size().describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, on average there are 8 upsets per year. However, what we really want is the distribution of upsets by seed within a given year." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "upsets_by_year = upsets.pivot_table(\n", " index='Year',\n", " columns='HigherSeed',\n", " values='WinnerSeed', aggfunc='count'\n", ").fillna(0)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th>HigherSeed</th>\n", " <th>2</th>\n", " <th>3</th>\n", " <th>4</th>\n", " <th>5</th>\n", " <th>6</th>\n", " <th>7</th>\n", " <th>8</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>count</th>\n", " <td>33.000000</td>\n", " <td>33.000000</td>\n", " <td>33.000000</td>\n", " <td>33.000000</td>\n", " <td>33.000000</td>\n", " <td>33.000000</td>\n", " <td>33.000000</td>\n", " </tr>\n", " <tr>\n", " <th>mean</th>\n", " <td>0.242424</td>\n", " <td>0.636364</td>\n", " <td>0.787879</td>\n", " <td>1.424242</td>\n", " <td>1.484848</td>\n", " <td>1.545455</td>\n", " <td>1.969697</td>\n", " </tr>\n", " <tr>\n", " <th>std</th>\n", " <td>0.501890</td>\n", " <td>0.652791</td>\n", " <td>0.599874</td>\n", " <td>0.867118</td>\n", " <td>1.003781</td>\n", " <td>0.904534</td>\n", " <td>1.185455</td>\n", " </tr>\n", " <tr>\n", " <th>min</th>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>25%</th>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>1.000000</td>\n", " <td>1.000000</td>\n", " <td>1.000000</td>\n", " <td>1.000000</td>\n", " </tr>\n", " <tr>\n", " <th>50%</th>\n", " <td>0.000000</td>\n", " <td>1.000000</td>\n", " <td>1.000000</td>\n", " <td>1.000000</td>\n", " <td>1.000000</td>\n", " <td>1.000000</td>\n", " <td>2.000000</td>\n", " </tr>\n", " <tr>\n", " <th>75%</th>\n", " <td>0.000000</td>\n", " <td>1.000000</td>\n", " <td>1.000000</td>\n", " <td>2.000000</td>\n", " <td>2.000000</td>\n", " <td>2.000000</td>\n", " <td>3.000000</td>\n", " </tr>\n", " <tr>\n", " <th>max</th>\n", " <td>2.000000</td>\n", " <td>2.000000</td>\n", " <td>2.000000</td>\n", " <td>3.000000</td>\n", " <td>4.000000</td>\n", " <td>4.000000</td>\n", " <td>4.000000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ "HigherSeed 2 3 4 5 6 7 \\\n", "count 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000 \n", "mean 0.242424 0.636364 0.787879 1.424242 1.484848 1.545455 \n", "std 0.501890 0.652791 0.599874 0.867118 1.003781 0.904534 \n", "min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 \n", "50% 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 \n", "75% 0.000000 1.000000 1.000000 2.000000 2.000000 2.000000 \n", "max 2.000000 2.000000 2.000000 3.000000 4.000000 4.000000 \n", "\n", "HigherSeed 8 \n", "count 33.000000 \n", "mean 1.969697 \n", "std 1.185455 \n", "min 0.000000 \n", "25% 1.000000 \n", "50% 2.000000 \n", "75% 3.000000 \n", "max 4.000000 " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "upsets_by_year.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check that the above pivot table ties back to the original win frequency data. If we sum up the number of upsets by higher seed and divide by the total number of historical games, we should get back the upset frequencies we computed above." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "HigherSeed\n", "2 0.060606\n", "3 0.159091\n", "4 0.196970\n", "5 0.356061\n", "6 0.371212\n", "7 0.386364\n", "8 0.492424\n", "dtype: float64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "upsets_by_year.sum() / (4*33)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since a picture is worth a thousand words, let's visualize the number of first round upsets by seed, by year." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<matplotlib.figure.Figure at 0x1128ab4e0>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fix, ax = plt.subplots(figsize=(10, 7))\n", "upsets_by_year.plot(ax=ax, kind='bar', stacked=True)\n", "ax.set_ylim(0, 15)\n", "ax.set_xlabel('Year')\n", "ax.set_ylabel('First Round Upsets (No. of Games)')\n", "ax.set_title('First Round NCAA Tournament Losses by Seeds 2-8')\n", "ax.legend(title='Higher Seed')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The chart makes it easy to understand the overall variability of upsets from year to year, both in terms of the overall number and by seed.\n", "\n", "We can also summarize the above data in a small table." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " <th>4</th>\n", " </tr>\n", " <tr>\n", " <th>HigherSeed</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2</th>\n", " <td>26</td>\n", " <td>6</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>15</td>\n", " <td>15</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>10</td>\n", " <td>20</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>10</td>\n", " <td>4</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>5</td>\n", " <td>13</td>\n", " <td>10</td>\n", " <td>4</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>3</td>\n", " <td>14</td>\n", " <td>12</td>\n", " <td>3</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>3</td>\n", " <td>10</td>\n", " <td>9</td>\n", " <td>7</td>\n", " <td>4</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2 3 4\n", "HigherSeed \n", "2 26 6 1 0 0\n", "3 15 15 3 0 0\n", "4 10 20 3 0 0\n", "5 4 15 10 4 0\n", "6 5 13 10 4 1\n", "7 3 14 12 3 1\n", "8 3 10 9 7 4" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "upset_freq = upsets_by_year.apply(pd.value_counts).fillna(0).astype(int).reset_index(drop=True).transpose()\n", "upset_freq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This frequency table will be useful for making specific decisions about how many first round upsets to pick.\n", "\n", "Since each row sums to 33, we can divide each value by 33 to estimate a probability distribution for the number of upsets by seed per year." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "HigherSeed\n", "2 33\n", "3 33\n", "4 33\n", "5 33\n", "6 33\n", "7 33\n", "8 33\n", "dtype: int64" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "upset_freq.sum(axis=1)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " <th>4</th>\n", " </tr>\n", " <tr>\n", " <th>HigherSeed</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>2</th>\n", " <td>0.787879</td>\n", " <td>0.181818</td>\n", " <td>0.030303</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.454545</td>\n", " <td>0.454545</td>\n", " <td>0.090909</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.303030</td>\n", " <td>0.606061</td>\n", " <td>0.090909</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>0.121212</td>\n", " <td>0.454545</td>\n", " <td>0.303030</td>\n", " <td>0.121212</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>0.151515</td>\n", " <td>0.393939</td>\n", " <td>0.303030</td>\n", " <td>0.121212</td>\n", " <td>0.030303</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>0.090909</td>\n", " <td>0.424242</td>\n", " <td>0.363636</td>\n", " <td>0.090909</td>\n", " <td>0.030303</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>0.090909</td>\n", " <td>0.303030</td>\n", " <td>0.272727</td>\n", " <td>0.212121</td>\n", " <td>0.121212</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2 3 4\n", "HigherSeed \n", "2 0.787879 0.181818 0.030303 0.000000 0.000000\n", "3 0.454545 0.454545 0.090909 0.000000 0.000000\n", "4 0.303030 0.606061 0.090909 0.000000 0.000000\n", "5 0.121212 0.454545 0.303030 0.121212 0.000000\n", "6 0.151515 0.393939 0.303030 0.121212 0.030303\n", "7 0.090909 0.424242 0.363636 0.090909 0.030303\n", "8 0.090909 0.303030 0.272727 0.212121 0.121212" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "upset_freq / 33" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How Many Upsets to Pick?\n", "\n", "In summary, here are some preliminary conclusions based upon the above data and analysis. We'll try to develop a more precise framework in a later post.\n", "\n", "#### Second Seed Upsets\n", "\n", "Although a second seed does occasionally lose in the first round (and in 2012, two second seeds lost), it's rare. Don't pick the fifteenth seed to advance, unless you have a very strong view that the tournament committee has made a mistake in either the second or the fifteenth seeds. If you do, however, the historical data say you're unlikely to be correct.\n", "\n", "#### Third and Fourth Seed Upsets\n", "\n", "Things are more promising for calling upsets in the third and fourth seed games. On average, there is at least one upset among these 8 first round games in a given year. There is often one upset in each of the third and fourth seed games, but there are rarely more than 2 upsets overall in these 8 games.\n", "\n", "The data suggest that you should pick an upset among the third and fourth seeds as a group. You should also try hard to identify another upset, for the seed which you didn't pick in the first upset. In other words, if you already picked a third seed upset, try to pick a fourth seed upset, and vice versa.\n", "\n", "#### Fifth Seed Upsets\n", "\n", "Most tournaments have featured at least 1 fifth seed upset in the first round, and many years have two. You should definitely look to pick one upset. It's probably reasonable to pick a second upset in this category if you have strong views about the matchup.\n", "\n", "#### Sixth and Seventh Seed Upsets\n", "\n", "The data tell a similar story for the sixth and seventh seed games. You should try to pick one sixth seed and one seventh seed upset in the first round. If you have strong views about particular matchups, it's reasonable to look for additional potential upsets. However, you should keep in mind that the overall number of upsets (excluding the 8-9 games) rarely exceeds 8 in a given year.\n", "\n", "In summary, among the second through seventh seeds, the historical data suggest you should aim to pick 5 or 6 upsets, and venture beyond that only if you have high conviction about a few additional games.\n", "\n", "#### The 8-9 Matchups\n", "\n", "As mentioned above, a ninth seed victory isn't really an upset. There are years where all the eighth seeds advance, and years where all the ninth seeds advance.\n", "\n", "For regular bracket selection, your goal is just to get as many teams as possible from your bracket into the second round. For that purpose, you should analyze the 8-9 games strictly on the merits of the matchups. In contrast, the Statsketball Upset Challenge awards bonus points for correctly picking an upset, defined as the lower seed beating the higher seed. With the possibility of bonus points, you have somewhat greater incentive to pick the ninth seed, even if the game is a true toss-up. We'll study the impact of Upset Challenge bonus points in a future post." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:sports_py36]", "language": "python", "name": "conda-env-sports_py36-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }