{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## NCAA March Madness\n", "\n", "### Web Scraping Pomeroy College Basketball Ratings\n", "\n", "This short notebook shows how to scrape historical ratings from Ken Pomeroy's (KenPom) [college basketball ratings](https://kenpom.com/) site.\n", "\n", "KenPom has ratings for NCAA men's basketball teams going back to 2002." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pracpred.scrape as pps" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from tqdm import tqdm_notebook" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "PROJECT_DIR = Path.cwd().parent\n", "DATA_DIR = PROJECT_DIR / 'data' / 'scraped'\n", "DATA_DIR.mkdir(exist_ok=True, parents=True)\n", "OUTPUT_DIR = PROJECT_DIR / 'data' / 'kenpom'\n", "OUTPUT_DIR.mkdir(exist_ok=True, parents=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by scraping 2017 data. We can use the HTML table scraping functionality in our `pracpred` package.\n", "\n", "You can find information on the `pracpred` package [here](https://github.com/practicallypredictable/pracpred) and [here](https://pypi.python.org/pypi/pracpred). You can install the package in your sports analytics environment by running the command `pip install pracpred` in Terminal (Mac or Linux) or Windows Anaconda Prompt." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "url = 'https://kenpom.com/index.php?y=2017'" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tables = pps.HTMLTables(url)\n", "len(tables)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " <th>4</th>\n", " <th>5</th>\n", " <th>6</th>\n", " <th>7</th>\n", " <th>8</th>\n", " <th>9</th>\n", " <th>...</th>\n", " <th>11</th>\n", " <th>12</th>\n", " <th>13</th>\n", " <th>14</th>\n", " <th>15</th>\n", " <th>16</th>\n", " <th>17</th>\n", " <th>18</th>\n", " <th>19</th>\n", " <th>20</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td></td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>...</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>Strength of Schedule</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NCSOS</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Rank</td>\n", " <td>Team</td>\n", " <td>Conf</td>\n", " <td>W-L</td>\n", " <td>AdjEM</td>\n", " <td>AdjO</td>\n", " <td>NaN</td>\n", " <td>AdjD</td>\n", " <td>NaN</td>\n", " <td>AdjT</td>\n", " <td>...</td>\n", " <td>Luck</td>\n", " <td>NaN</td>\n", " <td>AdjEM</td>\n", " <td>NaN</td>\n", " <td>OppO</td>\n", " <td>NaN</td>\n", " <td>OppD</td>\n", " <td>NaN</td>\n", " <td>AdjEM</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1</td>\n", " <td>Gonzaga 1</td>\n", " <td>WCC</td>\n", " <td>37-2</td>\n", " <td>+32.05</td>\n", " <td>118.4</td>\n", " <td>16</td>\n", " <td>86.3</td>\n", " <td>1</td>\n", " <td>70.1</td>\n", " <td>...</td>\n", " <td>+.020</td>\n", " <td>133</td>\n", " <td>+2.99</td>\n", " <td>89</td>\n", " <td>106.2</td>\n", " <td>84</td>\n", " <td>103.3</td>\n", " <td>105</td>\n", " <td>+1.01</td>\n", " <td>127</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2</td>\n", " <td>Villanova 1</td>\n", " <td>BE</td>\n", " <td>32-4</td>\n", " <td>+29.88</td>\n", " <td>122.4</td>\n", " <td>3</td>\n", " <td>92.5</td>\n", " <td>12</td>\n", " <td>64.1</td>\n", " <td>...</td>\n", " <td>+.010</td>\n", " <td>166</td>\n", " <td>+9.33</td>\n", " <td>33</td>\n", " <td>109.8</td>\n", " <td>38</td>\n", " <td>100.5</td>\n", " <td>32</td>\n", " <td>+3.55</td>\n", " <td>61</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>3</td>\n", " <td>North Carolina 1</td>\n", " <td>ACC</td>\n", " <td>33-7</td>\n", " <td>+28.22</td>\n", " <td>120.7</td>\n", " <td>9</td>\n", " <td>92.5</td>\n", " <td>11</td>\n", " <td>71.3</td>\n", " <td>...</td>\n", " <td>+.037</td>\n", " <td>85</td>\n", " <td>+12.49</td>\n", " <td>6</td>\n", " <td>112.0</td>\n", " <td>4</td>\n", " <td>99.5</td>\n", " <td>19</td>\n", " <td>+3.87</td>\n", " <td>53</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>5 rows × 21 columns</p>\n", "</div>" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9 \\\n", "0 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "1 Rank Team Conf W-L AdjEM AdjO NaN AdjD NaN AdjT \n", "2 1 Gonzaga 1 WCC 37-2 +32.05 118.4 16 86.3 1 70.1 \n", "3 2 Villanova 1 BE 32-4 +29.88 122.4 3 92.5 12 64.1 \n", "4 3 North Carolina 1 ACC 33-7 +28.22 120.7 9 92.5 11 71.3 \n", "\n", " ... 11 12 13 14 15 16 17 18 19 \\\n", "0 ... NaN NaN Strength of Schedule NaN NaN NaN NaN NaN NCSOS \n", "1 ... Luck NaN AdjEM NaN OppO NaN OppD NaN AdjEM \n", "2 ... +.020 133 +2.99 89 106.2 84 103.3 105 +1.01 \n", "3 ... +.010 166 +9.33 33 109.8 38 100.5 32 +3.55 \n", "4 ... +.037 85 +12.49 6 112.0 4 99.5 19 +3.87 \n", "\n", " 20 \n", "0 NaN \n", "1 NaN \n", "2 127 \n", "3 61 \n", "4 53 \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tables[0].to_df().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we go further, let's put this simple logic in a function which will get the data for any particular year." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def raw_kenpom(year):\n", " url = f'https://kenpom.com/index.php?y={year}'\n", " tables = pps.HTMLTables(url)\n", " return tables[0].to_df()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " <th>4</th>\n", " <th>5</th>\n", " <th>6</th>\n", " <th>7</th>\n", " <th>8</th>\n", " <th>9</th>\n", " <th>...</th>\n", " <th>11</th>\n", " <th>12</th>\n", " <th>13</th>\n", " <th>14</th>\n", " <th>15</th>\n", " <th>16</th>\n", " <th>17</th>\n", " <th>18</th>\n", " <th>19</th>\n", " <th>20</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td></td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>...</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>Strength of Schedule</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NCSOS</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Rank</td>\n", " <td>Team</td>\n", " <td>Conf</td>\n", " <td>W-L</td>\n", " <td>AdjEM</td>\n", " <td>AdjO</td>\n", " <td>NaN</td>\n", " <td>AdjD</td>\n", " <td>NaN</td>\n", " <td>AdjT</td>\n", " <td>...</td>\n", " <td>Luck</td>\n", " <td>NaN</td>\n", " <td>AdjEM</td>\n", " <td>NaN</td>\n", " <td>OppO</td>\n", " <td>NaN</td>\n", " <td>OppD</td>\n", " <td>NaN</td>\n", " <td>AdjEM</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1</td>\n", " <td>Gonzaga 1</td>\n", " <td>WCC</td>\n", " <td>37-2</td>\n", " <td>+32.05</td>\n", " <td>118.4</td>\n", " <td>16</td>\n", " <td>86.3</td>\n", " <td>1</td>\n", " <td>70.1</td>\n", " <td>...</td>\n", " <td>+.020</td>\n", " <td>133</td>\n", " <td>+2.99</td>\n", " <td>89</td>\n", " <td>106.2</td>\n", " <td>84</td>\n", " <td>103.3</td>\n", " <td>105</td>\n", " <td>+1.01</td>\n", " <td>127</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2</td>\n", " <td>Villanova 1</td>\n", " <td>BE</td>\n", " <td>32-4</td>\n", " <td>+29.88</td>\n", " <td>122.4</td>\n", " <td>3</td>\n", " <td>92.5</td>\n", " <td>12</td>\n", " <td>64.1</td>\n", " <td>...</td>\n", " <td>+.010</td>\n", " <td>166</td>\n", " <td>+9.33</td>\n", " <td>33</td>\n", " <td>109.8</td>\n", " <td>38</td>\n", " <td>100.5</td>\n", " <td>32</td>\n", " <td>+3.55</td>\n", " <td>61</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>3</td>\n", " <td>North Carolina 1</td>\n", " <td>ACC</td>\n", " <td>33-7</td>\n", " <td>+28.22</td>\n", " <td>120.7</td>\n", " <td>9</td>\n", " <td>92.5</td>\n", " <td>11</td>\n", " <td>71.3</td>\n", " <td>...</td>\n", " <td>+.037</td>\n", " <td>85</td>\n", " <td>+12.49</td>\n", " <td>6</td>\n", " <td>112.0</td>\n", " <td>4</td>\n", " <td>99.5</td>\n", " <td>19</td>\n", " <td>+3.87</td>\n", " <td>53</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>5 rows × 21 columns</p>\n", "</div>" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9 \\\n", "0 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "1 Rank Team Conf W-L AdjEM AdjO NaN AdjD NaN AdjT \n", "2 1 Gonzaga 1 WCC 37-2 +32.05 118.4 16 86.3 1 70.1 \n", "3 2 Villanova 1 BE 32-4 +29.88 122.4 3 92.5 12 64.1 \n", "4 3 North Carolina 1 ACC 33-7 +28.22 120.7 9 92.5 11 71.3 \n", "\n", " ... 11 12 13 14 15 16 17 18 19 \\\n", "0 ... NaN NaN Strength of Schedule NaN NaN NaN NaN NaN NCSOS \n", "1 ... Luck NaN AdjEM NaN OppO NaN OppD NaN AdjEM \n", "2 ... +.020 133 +2.99 89 106.2 84 103.3 105 +1.01 \n", "3 ... +.010 166 +9.33 33 109.8 38 100.5 32 +3.55 \n", "4 ... +.037 85 +12.49 6 112.0 4 99.5 19 +3.87 \n", "\n", " 20 \n", "0 NaN \n", "1 NaN \n", "2 127 \n", "3 61 \n", "4 53 \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw = raw_kenpom(2017)\n", "raw.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Formatting the Columns Names\n", "\n", "In the raw table, the first two rows are the table headers. Many of these cells are blank (NaN), because the header spans two rows. \"NaN\" stands for not-a-number, and is a value defined in the `numpy` package on top of which `pandas` is built. You can read more about how `pandas` handles missing values [here](https://pandas.pydata.org/pandas-docs/stable/missing_data.html).\n", "\n", "Most of the useful column names are in the second row. Also, we see that the columns which have blank (NaN) values in this row correspond to the ranks of each statistic in the KenPom table. The ranks show up as small text to the right of the statistic value. For example, Gonzaga had the sixteenth best Adjusted Offensive Efficiency and the highest-ranked Adjusted Defensive Efficiency for 2017, according to KenPom." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Rank\n", "1 Team\n", "2 Conf\n", "3 W-L\n", "4 AdjEM\n", "5 AdjO\n", "6 NaN\n", "7 AdjD\n", "8 NaN\n", "9 AdjT\n", "10 NaN\n", "11 Luck\n", "12 NaN\n", "13 AdjEM\n", "14 NaN\n", "15 OppO\n", "16 NaN\n", "17 OppD\n", "18 NaN\n", "19 AdjEM\n", "20 NaN\n", "Name: 1, dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw.iloc[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's write a function to pull out the useful column names from the raw data." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def column_names(raw_cols):\n", " cols = raw_cols.copy()\n", " adj_em_count = 0\n", " adj_em_col = {\n", " 1: 'AdjEM',\n", " 2: 'SOS AdjEM',\n", " 3: 'NCSOS AdjEM',\n", " }\n", " for i, col in enumerate(cols):\n", " if col == 'AdjEM':\n", " adj_em_count += 1\n", " cols[i] = adj_em_col[adj_em_count]\n", " if str(col) == 'nan':\n", " cols[i] = cols[i-1] + '_rank'\n", " if col == 'Rank':\n", " cols[i] = 'KenPom'\n", " return cols" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 KenPom\n", "1 Team\n", "2 Conf\n", "3 W-L\n", "4 AdjEM\n", "5 AdjO\n", "6 AdjO_rank\n", "7 AdjD\n", "8 AdjD_rank\n", "9 AdjT\n", "10 AdjT_rank\n", "11 Luck\n", "12 Luck_rank\n", "13 SOS AdjEM\n", "14 SOS AdjEM_rank\n", "15 OppO\n", "16 OppO_rank\n", "17 OppD\n", "18 OppD_rank\n", "19 NCSOS AdjEM\n", "20 NCSOS AdjEM_rank\n", "Name: 1, dtype: object" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "column_names(raw.iloc[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can write a function to take the raw data, pull out the column names, and create a new `DataFrame` with the column names.\n", "\n", "This function drops the first two rows of the raw data since they are no longer needed once we have the column names. It also drops any rows with blank data, in particular the column headers which repeat several times on the web site." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def formatted(raw):\n", " df = raw.copy()\n", " df.columns = column_names(raw.iloc[1])\n", " df.columns.name = None\n", " df = df.drop([0, 1])\n", " return df.dropna().reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(351, 21)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formatted(raw).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 351 rows of data, which matches the number of teams in the 2017 KenPom web site table.\n", "\n", "### Getting the NCAA Tournament Seed\n", "\n", "Now we need to split the school name from the NCAA tournament seed. To do this, we'll use [regular expressions](https://docs.python.org/3/howto/regex.html) to separate the text from the number. You can read more about working with text data in `pandas` [here](https://pandas.pydata.org/pandas-docs/stable/text.html)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Gonzaga</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Villanova</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>North Carolina</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Kentucky</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>Florida</td>\n", " <td>4</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1\n", "0 Gonzaga 1\n", "1 Villanova 1\n", "2 North Carolina 1\n", "3 Kentucky 2\n", "4 Florida 4" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formatted(raw)['Team'].str.extract('(.+) (\\d+)', expand=True).head()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>346</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>347</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>348</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>349</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>350</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1\n", "346 NaN NaN\n", "347 NaN NaN\n", "348 NaN NaN\n", "349 NaN NaN\n", "350 NaN NaN" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formatted(raw)['Team'].str.extract('(.+) (\\d+)', expand=True).tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that for rows which don't have an NCAA seed, the regular expression pattern matching fails. In this case, we end up with NaN for both columns in such rows. To correct for this, we need to fall back to use the team name using `pandas` `fillna()` method." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def ncaa_seed(df):\n", " split = df['Team'].str.extract('(.+) (\\d+)', expand=True)\n", " team = split[0].fillna(df['Team'])\n", " df['Team'] = team.str.replace(';', '')\n", " df['Seed'] = split[1]\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Wins and Losses\n", "\n", "Next, we want to split the wins and losses string to get one column for each value." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def wins_losses(df):\n", " df['Wins'] = df['W-L'].str.split('-').str.get(0).astype(int)\n", " df['Losses'] = df['W-L'].str.split('-').str.get(1).astype(int)\n", " df = df.drop(columns=['W-L'])\n", " return df" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Team</th>\n", " <th>Seed</th>\n", " <th>Wins</th>\n", " <th>Losses</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Gonzaga</td>\n", " <td>1</td>\n", " <td>37</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Villanova</td>\n", " <td>1</td>\n", " <td>32</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>North Carolina</td>\n", " <td>1</td>\n", " <td>33</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Kentucky</td>\n", " <td>2</td>\n", " <td>32</td>\n", " <td>6</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>Florida</td>\n", " <td>4</td>\n", " <td>27</td>\n", " <td>9</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Team Seed Wins Losses\n", "0 Gonzaga 1 37 2\n", "1 Villanova 1 32 4\n", "2 North Carolina 1 33 7\n", "3 Kentucky 2 32 6\n", "4 Florida 4 27 9" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wins_losses(ncaa_seed(formatted(raw)))[['Team', 'Seed', 'Wins', 'Losses']].head()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Team</th>\n", " <th>Seed</th>\n", " <th>Wins</th>\n", " <th>Losses</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>346</th>\n", " <td>Longwood</td>\n", " <td>NaN</td>\n", " <td>6</td>\n", " <td>24</td>\n", " </tr>\n", " <tr>\n", " <th>347</th>\n", " <td>Arkansas Pine Bluff</td>\n", " <td>NaN</td>\n", " <td>7</td>\n", " <td>25</td>\n", " </tr>\n", " <tr>\n", " <th>348</th>\n", " <td>North Carolina A&T</td>\n", " <td>NaN</td>\n", " <td>3</td>\n", " <td>29</td>\n", " </tr>\n", " <tr>\n", " <th>349</th>\n", " <td>Presbyterian</td>\n", " <td>NaN</td>\n", " <td>5</td>\n", " <td>25</td>\n", " </tr>\n", " <tr>\n", " <th>350</th>\n", " <td>Alabama A&M</td>\n", " <td>NaN</td>\n", " <td>2</td>\n", " <td>27</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Team Seed Wins Losses\n", "346 Longwood NaN 6 24\n", "347 Arkansas Pine Bluff NaN 7 25\n", "348 North Carolina A&T NaN 3 29\n", "349 Presbyterian NaN 5 25\n", "350 Alabama A&M NaN 2 27" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wins_losses(ncaa_seed(formatted(raw)))[['Team', 'Seed', 'Wins', 'Losses']].tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cleaning Up\n", "\n", "The last things we need to do are make sure all of the columns are of the correct data type and order them the way we want." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "def numeric(df):\n", " cols_to_strip = [\n", " 'AdjEM',\n", " 'Luck',\n", " 'SOS AdjEM',\n", " 'NCSOS AdjEM',\n", " ]\n", " for col in cols_to_strip:\n", " df[col] = pd.to_numeric(df[col])\n", " rank_cols = [col for col in df.columns if '_rank' in col]\n", " for col in rank_cols:\n", " df[col] = df[col].astype(int)\n", " other_cols = [\n", " 'AdjO',\n", " 'AdjD',\n", " 'AdjT',\n", " 'OppO',\n", " 'OppD',\n", " ]\n", " for col in other_cols:\n", " df[col] = df[col].astype(float)\n", " return df" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "def ordered(df):\n", " first_cols = [\n", " 'Year',\n", " 'Team',\n", " 'Conf',\n", " 'Seed',\n", " 'Wins',\n", " 'Losses',\n", " 'KenPom',\n", " ]\n", " last_cols = [col for col in df.columns if 'rank' in col]\n", " cols = first_cols + [col for col in df.columns if col not in first_cols and col not in last_cols] + last_cols\n", " return df[cols]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def formatted_kenpom(year):\n", " raw = raw_kenpom(year)\n", " df = formatted(raw)\n", " df = ncaa_seed(df)\n", " df = wins_losses(df)\n", " df = numeric(df)\n", " df['Year'] = int(year)\n", " df = ordered(df)\n", " return df" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(351, 24)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = formatted_kenpom(2017)\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Year int64\n", "Team object\n", "Conf object\n", "Seed object\n", "Wins int64\n", "Losses int64\n", "KenPom object\n", "AdjEM float64\n", "AdjO float64\n", "AdjD float64\n", "AdjT float64\n", "Luck float64\n", "SOS AdjEM float64\n", "OppO float64\n", "OppD float64\n", "NCSOS AdjEM float64\n", "AdjO_rank int64\n", "AdjD_rank int64\n", "AdjT_rank int64\n", "Luck_rank int64\n", "SOS AdjEM_rank int64\n", "OppO_rank int64\n", "OppD_rank int64\n", "NCSOS AdjEM_rank int64\n", "dtype: object" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Year</th>\n", " <th>Team</th>\n", " <th>Conf</th>\n", " <th>Seed</th>\n", " <th>Wins</th>\n", " <th>Losses</th>\n", " <th>KenPom</th>\n", " <th>AdjEM</th>\n", " <th>AdjO</th>\n", " <th>AdjD</th>\n", " <th>...</th>\n", " <th>OppD</th>\n", " <th>NCSOS AdjEM</th>\n", " <th>AdjO_rank</th>\n", " <th>AdjD_rank</th>\n", " <th>AdjT_rank</th>\n", " <th>Luck_rank</th>\n", " <th>SOS AdjEM_rank</th>\n", " <th>OppO_rank</th>\n", " <th>OppD_rank</th>\n", " <th>NCSOS AdjEM_rank</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2017</td>\n", " <td>Gonzaga</td>\n", " <td>WCC</td>\n", " <td>1</td>\n", " <td>37</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>32.05</td>\n", " <td>118.4</td>\n", " <td>86.3</td>\n", " <td>...</td>\n", " <td>103.3</td>\n", " <td>1.01</td>\n", " <td>16</td>\n", " <td>1</td>\n", " <td>76</td>\n", " <td>133</td>\n", " <td>89</td>\n", " <td>84</td>\n", " <td>105</td>\n", " <td>127</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2017</td>\n", " <td>Villanova</td>\n", " <td>BE</td>\n", " <td>1</td>\n", " <td>32</td>\n", " <td>4</td>\n", " <td>2</td>\n", " <td>29.88</td>\n", " <td>122.4</td>\n", " <td>92.5</td>\n", " <td>...</td>\n", " <td>100.5</td>\n", " <td>3.55</td>\n", " <td>3</td>\n", " <td>12</td>\n", " <td>324</td>\n", " <td>166</td>\n", " <td>33</td>\n", " <td>38</td>\n", " <td>32</td>\n", " <td>61</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2017</td>\n", " <td>North Carolina</td>\n", " <td>ACC</td>\n", " <td>1</td>\n", " <td>33</td>\n", " <td>7</td>\n", " <td>3</td>\n", " <td>28.22</td>\n", " <td>120.7</td>\n", " <td>92.5</td>\n", " <td>...</td>\n", " <td>99.5</td>\n", " <td>3.87</td>\n", " <td>9</td>\n", " <td>11</td>\n", " <td>40</td>\n", " <td>85</td>\n", " <td>6</td>\n", " <td>4</td>\n", " <td>19</td>\n", " <td>53</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2017</td>\n", " <td>Kentucky</td>\n", " <td>SEC</td>\n", " <td>2</td>\n", " <td>32</td>\n", " <td>6</td>\n", " <td>4</td>\n", " <td>27.72</td>\n", " <td>119.1</td>\n", " <td>91.4</td>\n", " <td>...</td>\n", " <td>99.5</td>\n", " <td>3.74</td>\n", " <td>12</td>\n", " <td>7</td>\n", " <td>26</td>\n", " <td>175</td>\n", " <td>19</td>\n", " <td>24</td>\n", " <td>15</td>\n", " <td>56</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2017</td>\n", " <td>Florida</td>\n", " <td>SEC</td>\n", " <td>4</td>\n", " <td>27</td>\n", " <td>9</td>\n", " <td>5</td>\n", " <td>27.50</td>\n", " <td>116.9</td>\n", " <td>89.5</td>\n", " <td>...</td>\n", " <td>97.8</td>\n", " <td>8.19</td>\n", " <td>25</td>\n", " <td>5</td>\n", " <td>117</td>\n", " <td>286</td>\n", " <td>7</td>\n", " <td>28</td>\n", " <td>2</td>\n", " <td>15</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>5 rows × 24 columns</p>\n", "</div>" ], "text/plain": [ " Year Team Conf Seed Wins Losses KenPom AdjEM AdjO AdjD \\\n", "0 2017 Gonzaga WCC 1 37 2 1 32.05 118.4 86.3 \n", "1 2017 Villanova BE 1 32 4 2 29.88 122.4 92.5 \n", "2 2017 North Carolina ACC 1 33 7 3 28.22 120.7 92.5 \n", "3 2017 Kentucky SEC 2 32 6 4 27.72 119.1 91.4 \n", "4 2017 Florida SEC 4 27 9 5 27.50 116.9 89.5 \n", "\n", " ... OppD NCSOS AdjEM AdjO_rank AdjD_rank AdjT_rank \\\n", "0 ... 103.3 1.01 16 1 76 \n", "1 ... 100.5 3.55 3 12 324 \n", "2 ... 99.5 3.87 9 11 40 \n", "3 ... 99.5 3.74 12 7 26 \n", "4 ... 97.8 8.19 25 5 117 \n", "\n", " Luck_rank SOS AdjEM_rank OppO_rank OppD_rank NCSOS AdjEM_rank \n", "0 133 89 84 105 127 \n", "1 166 33 38 32 61 \n", "2 85 6 4 19 53 \n", "3 175 19 24 15 56 \n", "4 286 7 28 2 15 \n", "\n", "[5 rows x 24 columns]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scraping all the Data\n", "\n", "Now that we have one function that will scrape and format the data for one year, all we need to do is call it for each year since 2002. Then we can save the data to a CSV file for later analysis." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "def scrape_kenpom(year=None):\n", " if year:\n", " return formatted_kenpom(year)\n", " else:\n", " dfs = dict()\n", " for year in tqdm_notebook(range(2002, 2018)):\n", " dfs[year] = formatted_kenpom(year)\n", " return pd.concat([dfs[year] for year in dfs], ignore_index=True)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "43b96256abb148bfa92449eb44578912", "version_major": 2, "version_minor": 0 }, "text/html": [ "<p>Failed to display Jupyter Widget of type <code>HBox</code>.</p>\n", "<p>\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the <a href=\"https://ipywidgets.readthedocs.io/en/stable/user_install.html\">Jupyter\n", " Widgets Documentation</a> for setup instructions.\n", "</p>\n", "<p>\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or <a href=\"https://nbviewer.jupyter.org/\">NBViewer</a>),\n", " it may mean that your frontend doesn't currently support widgets.\n", "</p>\n" ], "text/plain": [ "HBox(children=(IntProgress(value=0, max=16), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "(5453, 24)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = scrape_kenpom()\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "filename = 'kenpom-historical.csv'\n", "csvfile = OUTPUT_DIR.joinpath(filename)\n", "df.to_csv(csvfile, index=False, float_format='%g')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:sports_py36]", "language": "python", "name": "conda-env-sports_py36-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }