{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## NCAA March Madness\n", "\n", "### Web Scraping Pomeroy College Basketball Ratings\n", "\n", "This short notebook shows how to scrape historical ratings from Ken Pomeroy's (KenPom) [college basketball ratings](https://kenpom.com/) site.\n", "\n", "KenPom has ratings for NCAA men's basketball teams going back to 2002." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pracpred.scrape as pps" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from tqdm import tqdm_notebook" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "PROJECT_DIR = Path.cwd().parent\n", "DATA_DIR = PROJECT_DIR / 'data' / 'scraped'\n", "DATA_DIR.mkdir(exist_ok=True, parents=True)\n", "OUTPUT_DIR = PROJECT_DIR / 'data' / 'kenpom'\n", "OUTPUT_DIR.mkdir(exist_ok=True, parents=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by scraping 2017 data. We can use the HTML table scraping functionality in our `pracpred` package.\n", "\n", "You can find information on the `pracpred` package [here](https://github.com/practicallypredictable/pracpred) and [here](https://pypi.python.org/pypi/pracpred). You can install the package in your sports analytics environment by running the command `pip install pracpred` in Terminal (Mac or Linux) or Windows Anaconda Prompt." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "url = 'https://kenpom.com/index.php?y=2017'" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tables = pps.HTMLTables(url)\n", "len(tables)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789...11121314151617181920
0NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNStrength of ScheduleNaNNaNNaNNaNNaNNCSOSNaN
1RankTeamConfW-LAdjEMAdjONaNAdjDNaNAdjT...LuckNaNAdjEMNaNOppONaNOppDNaNAdjEMNaN
21Gonzaga 1WCC37-2+32.05118.41686.3170.1...+.020133+2.9989106.284103.3105+1.01127
32Villanova 1BE32-4+29.88122.4392.51264.1...+.010166+9.3333109.838100.532+3.5561
43North Carolina 1ACC33-7+28.22120.7992.51171.3...+.03785+12.496112.0499.519+3.8753
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9 \\\n", "0 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "1 Rank Team Conf W-L AdjEM AdjO NaN AdjD NaN AdjT \n", "2 1 Gonzaga 1 WCC 37-2 +32.05 118.4 16 86.3 1 70.1 \n", "3 2 Villanova 1 BE 32-4 +29.88 122.4 3 92.5 12 64.1 \n", "4 3 North Carolina 1 ACC 33-7 +28.22 120.7 9 92.5 11 71.3 \n", "\n", " ... 11 12 13 14 15 16 17 18 19 \\\n", "0 ... NaN NaN Strength of Schedule NaN NaN NaN NaN NaN NCSOS \n", "1 ... Luck NaN AdjEM NaN OppO NaN OppD NaN AdjEM \n", "2 ... +.020 133 +2.99 89 106.2 84 103.3 105 +1.01 \n", "3 ... +.010 166 +9.33 33 109.8 38 100.5 32 +3.55 \n", "4 ... +.037 85 +12.49 6 112.0 4 99.5 19 +3.87 \n", "\n", " 20 \n", "0 NaN \n", "1 NaN \n", "2 127 \n", "3 61 \n", "4 53 \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tables[0].to_df().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we go further, let's put this simple logic in a function which will get the data for any particular year." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def raw_kenpom(year):\n", " url = f'https://kenpom.com/index.php?y={year}'\n", " tables = pps.HTMLTables(url)\n", " return tables[0].to_df()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789...11121314151617181920
0NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNStrength of ScheduleNaNNaNNaNNaNNaNNCSOSNaN
1RankTeamConfW-LAdjEMAdjONaNAdjDNaNAdjT...LuckNaNAdjEMNaNOppONaNOppDNaNAdjEMNaN
21Gonzaga 1WCC37-2+32.05118.41686.3170.1...+.020133+2.9989106.284103.3105+1.01127
32Villanova 1BE32-4+29.88122.4392.51264.1...+.010166+9.3333109.838100.532+3.5561
43North Carolina 1ACC33-7+28.22120.7992.51171.3...+.03785+12.496112.0499.519+3.8753
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9 \\\n", "0 NaN NaN NaN NaN NaN NaN NaN NaN NaN \n", "1 Rank Team Conf W-L AdjEM AdjO NaN AdjD NaN AdjT \n", "2 1 Gonzaga 1 WCC 37-2 +32.05 118.4 16 86.3 1 70.1 \n", "3 2 Villanova 1 BE 32-4 +29.88 122.4 3 92.5 12 64.1 \n", "4 3 North Carolina 1 ACC 33-7 +28.22 120.7 9 92.5 11 71.3 \n", "\n", " ... 11 12 13 14 15 16 17 18 19 \\\n", "0 ... NaN NaN Strength of Schedule NaN NaN NaN NaN NaN NCSOS \n", "1 ... Luck NaN AdjEM NaN OppO NaN OppD NaN AdjEM \n", "2 ... +.020 133 +2.99 89 106.2 84 103.3 105 +1.01 \n", "3 ... +.010 166 +9.33 33 109.8 38 100.5 32 +3.55 \n", "4 ... +.037 85 +12.49 6 112.0 4 99.5 19 +3.87 \n", "\n", " 20 \n", "0 NaN \n", "1 NaN \n", "2 127 \n", "3 61 \n", "4 53 \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw = raw_kenpom(2017)\n", "raw.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Formatting the Columns Names\n", "\n", "In the raw table, the first two rows are the table headers. Many of these cells are blank (NaN), because the header spans two rows. \"NaN\" stands for not-a-number, and is a value defined in the `numpy` package on top of which `pandas` is built. You can read more about how `pandas` handles missing values [here](https://pandas.pydata.org/pandas-docs/stable/missing_data.html).\n", "\n", "Most of the useful column names are in the second row. Also, we see that the columns which have blank (NaN) values in this row correspond to the ranks of each statistic in the KenPom table. The ranks show up as small text to the right of the statistic value. For example, Gonzaga had the sixteenth best Adjusted Offensive Efficiency and the highest-ranked Adjusted Defensive Efficiency for 2017, according to KenPom." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Rank\n", "1 Team\n", "2 Conf\n", "3 W-L\n", "4 AdjEM\n", "5 AdjO\n", "6 NaN\n", "7 AdjD\n", "8 NaN\n", "9 AdjT\n", "10 NaN\n", "11 Luck\n", "12 NaN\n", "13 AdjEM\n", "14 NaN\n", "15 OppO\n", "16 NaN\n", "17 OppD\n", "18 NaN\n", "19 AdjEM\n", "20 NaN\n", "Name: 1, dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw.iloc[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's write a function to pull out the useful column names from the raw data." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def column_names(raw_cols):\n", " cols = raw_cols.copy()\n", " adj_em_count = 0\n", " adj_em_col = {\n", " 1: 'AdjEM',\n", " 2: 'SOS AdjEM',\n", " 3: 'NCSOS AdjEM',\n", " }\n", " for i, col in enumerate(cols):\n", " if col == 'AdjEM':\n", " adj_em_count += 1\n", " cols[i] = adj_em_col[adj_em_count]\n", " if str(col) == 'nan':\n", " cols[i] = cols[i-1] + '_rank'\n", " if col == 'Rank':\n", " cols[i] = 'KenPom'\n", " return cols" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 KenPom\n", "1 Team\n", "2 Conf\n", "3 W-L\n", "4 AdjEM\n", "5 AdjO\n", "6 AdjO_rank\n", "7 AdjD\n", "8 AdjD_rank\n", "9 AdjT\n", "10 AdjT_rank\n", "11 Luck\n", "12 Luck_rank\n", "13 SOS AdjEM\n", "14 SOS AdjEM_rank\n", "15 OppO\n", "16 OppO_rank\n", "17 OppD\n", "18 OppD_rank\n", "19 NCSOS AdjEM\n", "20 NCSOS AdjEM_rank\n", "Name: 1, dtype: object" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "column_names(raw.iloc[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can write a function to take the raw data, pull out the column names, and create a new `DataFrame` with the column names.\n", "\n", "This function drops the first two rows of the raw data since they are no longer needed once we have the column names. It also drops any rows with blank data, in particular the column headers which repeat several times on the web site." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def formatted(raw):\n", " df = raw.copy()\n", " df.columns = column_names(raw.iloc[1])\n", " df.columns.name = None\n", " df = df.drop([0, 1])\n", " return df.dropna().reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(351, 21)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formatted(raw).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 351 rows of data, which matches the number of teams in the 2017 KenPom web site table.\n", "\n", "### Getting the NCAA Tournament Seed\n", "\n", "Now we need to split the school name from the NCAA tournament seed. To do this, we'll use [regular expressions](https://docs.python.org/3/howto/regex.html) to separate the text from the number. You can read more about working with text data in `pandas` [here](https://pandas.pydata.org/pandas-docs/stable/text.html)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
0Gonzaga1
1Villanova1
2North Carolina1
3Kentucky2
4Florida4
\n", "
" ], "text/plain": [ " 0 1\n", "0 Gonzaga 1\n", "1 Villanova 1\n", "2 North Carolina 1\n", "3 Kentucky 2\n", "4 Florida 4" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formatted(raw)['Team'].str.extract('(.+) (\\d+)', expand=True).head()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
346NaNNaN
347NaNNaN
348NaNNaN
349NaNNaN
350NaNNaN
\n", "
" ], "text/plain": [ " 0 1\n", "346 NaN NaN\n", "347 NaN NaN\n", "348 NaN NaN\n", "349 NaN NaN\n", "350 NaN NaN" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "formatted(raw)['Team'].str.extract('(.+) (\\d+)', expand=True).tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that for rows which don't have an NCAA seed, the regular expression pattern matching fails. In this case, we end up with NaN for both columns in such rows. To correct for this, we need to fall back to use the team name using `pandas` `fillna()` method." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def ncaa_seed(df):\n", " split = df['Team'].str.extract('(.+) (\\d+)', expand=True)\n", " team = split[0].fillna(df['Team'])\n", " df['Team'] = team.str.replace(';', '')\n", " df['Seed'] = split[1]\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Wins and Losses\n", "\n", "Next, we want to split the wins and losses string to get one column for each value." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def wins_losses(df):\n", " df['Wins'] = df['W-L'].str.split('-').str.get(0).astype(int)\n", " df['Losses'] = df['W-L'].str.split('-').str.get(1).astype(int)\n", " df = df.drop(columns=['W-L'])\n", " return df" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TeamSeedWinsLosses
0Gonzaga1372
1Villanova1324
2North Carolina1337
3Kentucky2326
4Florida4279
\n", "
" ], "text/plain": [ " Team Seed Wins Losses\n", "0 Gonzaga 1 37 2\n", "1 Villanova 1 32 4\n", "2 North Carolina 1 33 7\n", "3 Kentucky 2 32 6\n", "4 Florida 4 27 9" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wins_losses(ncaa_seed(formatted(raw)))[['Team', 'Seed', 'Wins', 'Losses']].head()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TeamSeedWinsLosses
346LongwoodNaN624
347Arkansas Pine BluffNaN725
348North Carolina A&TNaN329
349PresbyterianNaN525
350Alabama A&MNaN227
\n", "
" ], "text/plain": [ " Team Seed Wins Losses\n", "346 Longwood NaN 6 24\n", "347 Arkansas Pine Bluff NaN 7 25\n", "348 North Carolina A&T NaN 3 29\n", "349 Presbyterian NaN 5 25\n", "350 Alabama A&M NaN 2 27" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wins_losses(ncaa_seed(formatted(raw)))[['Team', 'Seed', 'Wins', 'Losses']].tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cleaning Up\n", "\n", "The last things we need to do are make sure all of the columns are of the correct data type and order them the way we want." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "def numeric(df):\n", " cols_to_strip = [\n", " 'AdjEM',\n", " 'Luck',\n", " 'SOS AdjEM',\n", " 'NCSOS AdjEM',\n", " ]\n", " for col in cols_to_strip:\n", " df[col] = pd.to_numeric(df[col])\n", " rank_cols = [col for col in df.columns if '_rank' in col]\n", " for col in rank_cols:\n", " df[col] = df[col].astype(int)\n", " other_cols = [\n", " 'AdjO',\n", " 'AdjD',\n", " 'AdjT',\n", " 'OppO',\n", " 'OppD',\n", " ]\n", " for col in other_cols:\n", " df[col] = df[col].astype(float)\n", " return df" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "def ordered(df):\n", " first_cols = [\n", " 'Year',\n", " 'Team',\n", " 'Conf',\n", " 'Seed',\n", " 'Wins',\n", " 'Losses',\n", " 'KenPom',\n", " ]\n", " last_cols = [col for col in df.columns if 'rank' in col]\n", " cols = first_cols + [col for col in df.columns if col not in first_cols and col not in last_cols] + last_cols\n", " return df[cols]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def formatted_kenpom(year):\n", " raw = raw_kenpom(year)\n", " df = formatted(raw)\n", " df = ncaa_seed(df)\n", " df = wins_losses(df)\n", " df = numeric(df)\n", " df['Year'] = int(year)\n", " df = ordered(df)\n", " return df" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(351, 24)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = formatted_kenpom(2017)\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Year int64\n", "Team object\n", "Conf object\n", "Seed object\n", "Wins int64\n", "Losses int64\n", "KenPom object\n", "AdjEM float64\n", "AdjO float64\n", "AdjD float64\n", "AdjT float64\n", "Luck float64\n", "SOS AdjEM float64\n", "OppO float64\n", "OppD float64\n", "NCSOS AdjEM float64\n", "AdjO_rank int64\n", "AdjD_rank int64\n", "AdjT_rank int64\n", "Luck_rank int64\n", "SOS AdjEM_rank int64\n", "OppO_rank int64\n", "OppD_rank int64\n", "NCSOS AdjEM_rank int64\n", "dtype: object" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearTeamConfSeedWinsLossesKenPomAdjEMAdjOAdjD...OppDNCSOS AdjEMAdjO_rankAdjD_rankAdjT_rankLuck_rankSOS AdjEM_rankOppO_rankOppD_rankNCSOS AdjEM_rank
02017GonzagaWCC1372132.05118.486.3...103.31.01161761338984105127
12017VillanovaBE1324229.88122.492.5...100.53.5531232416633383261
22017North CarolinaACC1337328.22120.792.5...99.53.879114085641953
32017KentuckySEC2326427.72119.191.4...99.53.741272617519241556
42017FloridaSEC4279527.50116.989.5...97.88.19255117286728215
\n", "

5 rows × 24 columns

\n", "
" ], "text/plain": [ " Year Team Conf Seed Wins Losses KenPom AdjEM AdjO AdjD \\\n", "0 2017 Gonzaga WCC 1 37 2 1 32.05 118.4 86.3 \n", "1 2017 Villanova BE 1 32 4 2 29.88 122.4 92.5 \n", "2 2017 North Carolina ACC 1 33 7 3 28.22 120.7 92.5 \n", "3 2017 Kentucky SEC 2 32 6 4 27.72 119.1 91.4 \n", "4 2017 Florida SEC 4 27 9 5 27.50 116.9 89.5 \n", "\n", " ... OppD NCSOS AdjEM AdjO_rank AdjD_rank AdjT_rank \\\n", "0 ... 103.3 1.01 16 1 76 \n", "1 ... 100.5 3.55 3 12 324 \n", "2 ... 99.5 3.87 9 11 40 \n", "3 ... 99.5 3.74 12 7 26 \n", "4 ... 97.8 8.19 25 5 117 \n", "\n", " Luck_rank SOS AdjEM_rank OppO_rank OppD_rank NCSOS AdjEM_rank \n", "0 133 89 84 105 127 \n", "1 166 33 38 32 61 \n", "2 85 6 4 19 53 \n", "3 175 19 24 15 56 \n", "4 286 7 28 2 15 \n", "\n", "[5 rows x 24 columns]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scraping all the Data\n", "\n", "Now that we have one function that will scrape and format the data for one year, all we need to do is call it for each year since 2002. Then we can save the data to a CSV file for later analysis." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "def scrape_kenpom(year=None):\n", " if year:\n", " return formatted_kenpom(year)\n", " else:\n", " dfs = dict()\n", " for year in tqdm_notebook(range(2002, 2018)):\n", " dfs[year] = formatted_kenpom(year)\n", " return pd.concat([dfs[year] for year in dfs], ignore_index=True)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "43b96256abb148bfa92449eb44578912", "version_major": 2, "version_minor": 0 }, "text/html": [ "

Failed to display Jupyter Widget of type HBox.

\n", "

\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "

\n", "

\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "

\n" ], "text/plain": [ "HBox(children=(IntProgress(value=0, max=16), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "(5453, 24)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = scrape_kenpom()\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "filename = 'kenpom-historical.csv'\n", "csvfile = OUTPUT_DIR.joinpath(filename)\n", "df.to_csv(csvfile, index=False, float_format='%g')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:sports_py36]", "language": "python", "name": "conda-env-sports_py36-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }