{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NCAA March Madness\n",
    "\n",
    "### Web Scraping Pomeroy College Basketball Ratings\n",
    "\n",
    "This short notebook shows how to scrape historical ratings from Ken Pomeroy's (KenPom) [college basketball ratings](https://kenpom.com/) site.\n",
    "\n",
    "KenPom has ratings for NCAA men's basketball teams going back to 2002."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pracpred.scrape as pps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm import tqdm_notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROJECT_DIR = Path.cwd().parent\n",
    "DATA_DIR = PROJECT_DIR / 'data' / 'scraped'\n",
    "DATA_DIR.mkdir(exist_ok=True, parents=True)\n",
    "OUTPUT_DIR = PROJECT_DIR / 'data' / 'kenpom'\n",
    "OUTPUT_DIR.mkdir(exist_ok=True, parents=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by scraping 2017 data. We can use the HTML table scraping functionality in our `pracpred` package.\n",
    "\n",
    "You can find information on the `pracpred` package [here](https://github.com/practicallypredictable/pracpred) and [here](https://pypi.python.org/pypi/pracpred). You can install the package in your sports analytics environment by running the command `pip install pracpred` in Terminal (Mac or Linux) or Windows Anaconda Prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = 'https://kenpom.com/index.php?y=2017'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tables = pps.HTMLTables(url)\n",
    "len(tables)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>...</th>\n",
       "      <th>11</th>\n",
       "      <th>12</th>\n",
       "      <th>13</th>\n",
       "      <th>14</th>\n",
       "      <th>15</th>\n",
       "      <th>16</th>\n",
       "      <th>17</th>\n",
       "      <th>18</th>\n",
       "      <th>19</th>\n",
       "      <th>20</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td></td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Strength of Schedule</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NCSOS</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Rank</td>\n",
       "      <td>Team</td>\n",
       "      <td>Conf</td>\n",
       "      <td>W-L</td>\n",
       "      <td>AdjEM</td>\n",
       "      <td>AdjO</td>\n",
       "      <td>NaN</td>\n",
       "      <td>AdjD</td>\n",
       "      <td>NaN</td>\n",
       "      <td>AdjT</td>\n",
       "      <td>...</td>\n",
       "      <td>Luck</td>\n",
       "      <td>NaN</td>\n",
       "      <td>AdjEM</td>\n",
       "      <td>NaN</td>\n",
       "      <td>OppO</td>\n",
       "      <td>NaN</td>\n",
       "      <td>OppD</td>\n",
       "      <td>NaN</td>\n",
       "      <td>AdjEM</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>Gonzaga 1</td>\n",
       "      <td>WCC</td>\n",
       "      <td>37-2</td>\n",
       "      <td>+32.05</td>\n",
       "      <td>118.4</td>\n",
       "      <td>16</td>\n",
       "      <td>86.3</td>\n",
       "      <td>1</td>\n",
       "      <td>70.1</td>\n",
       "      <td>...</td>\n",
       "      <td>+.020</td>\n",
       "      <td>133</td>\n",
       "      <td>+2.99</td>\n",
       "      <td>89</td>\n",
       "      <td>106.2</td>\n",
       "      <td>84</td>\n",
       "      <td>103.3</td>\n",
       "      <td>105</td>\n",
       "      <td>+1.01</td>\n",
       "      <td>127</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>Villanova 1</td>\n",
       "      <td>BE</td>\n",
       "      <td>32-4</td>\n",
       "      <td>+29.88</td>\n",
       "      <td>122.4</td>\n",
       "      <td>3</td>\n",
       "      <td>92.5</td>\n",
       "      <td>12</td>\n",
       "      <td>64.1</td>\n",
       "      <td>...</td>\n",
       "      <td>+.010</td>\n",
       "      <td>166</td>\n",
       "      <td>+9.33</td>\n",
       "      <td>33</td>\n",
       "      <td>109.8</td>\n",
       "      <td>38</td>\n",
       "      <td>100.5</td>\n",
       "      <td>32</td>\n",
       "      <td>+3.55</td>\n",
       "      <td>61</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3</td>\n",
       "      <td>North Carolina 1</td>\n",
       "      <td>ACC</td>\n",
       "      <td>33-7</td>\n",
       "      <td>+28.22</td>\n",
       "      <td>120.7</td>\n",
       "      <td>9</td>\n",
       "      <td>92.5</td>\n",
       "      <td>11</td>\n",
       "      <td>71.3</td>\n",
       "      <td>...</td>\n",
       "      <td>+.037</td>\n",
       "      <td>85</td>\n",
       "      <td>+12.49</td>\n",
       "      <td>6</td>\n",
       "      <td>112.0</td>\n",
       "      <td>4</td>\n",
       "      <td>99.5</td>\n",
       "      <td>19</td>\n",
       "      <td>+3.87</td>\n",
       "      <td>53</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 21 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     0                 1     2     3       4      5    6     7    8     9   \\\n",
       "0                     NaN   NaN   NaN     NaN    NaN  NaN   NaN  NaN   NaN   \n",
       "1  Rank              Team  Conf   W-L   AdjEM   AdjO  NaN  AdjD  NaN  AdjT   \n",
       "2     1         Gonzaga 1   WCC  37-2  +32.05  118.4   16  86.3    1  70.1   \n",
       "3     2       Villanova 1    BE  32-4  +29.88  122.4    3  92.5   12  64.1   \n",
       "4     3  North Carolina 1   ACC  33-7  +28.22  120.7    9  92.5   11  71.3   \n",
       "\n",
       "  ...      11   12                    13   14     15   16     17   18     19  \\\n",
       "0 ...     NaN  NaN  Strength of Schedule  NaN    NaN  NaN    NaN  NaN  NCSOS   \n",
       "1 ...    Luck  NaN                 AdjEM  NaN   OppO  NaN   OppD  NaN  AdjEM   \n",
       "2 ...   +.020  133                 +2.99   89  106.2   84  103.3  105  +1.01   \n",
       "3 ...   +.010  166                 +9.33   33  109.8   38  100.5   32  +3.55   \n",
       "4 ...   +.037   85                +12.49    6  112.0    4   99.5   19  +3.87   \n",
       "\n",
       "    20  \n",
       "0  NaN  \n",
       "1  NaN  \n",
       "2  127  \n",
       "3   61  \n",
       "4   53  \n",
       "\n",
       "[5 rows x 21 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tables[0].to_df().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we go further, let's put this simple logic in a function which will get the data for any particular year."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def raw_kenpom(year):\n",
    "    url = f'https://kenpom.com/index.php?y={year}'\n",
    "    tables = pps.HTMLTables(url)\n",
    "    return tables[0].to_df()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>...</th>\n",
       "      <th>11</th>\n",
       "      <th>12</th>\n",
       "      <th>13</th>\n",
       "      <th>14</th>\n",
       "      <th>15</th>\n",
       "      <th>16</th>\n",
       "      <th>17</th>\n",
       "      <th>18</th>\n",
       "      <th>19</th>\n",
       "      <th>20</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td></td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Strength of Schedule</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NCSOS</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Rank</td>\n",
       "      <td>Team</td>\n",
       "      <td>Conf</td>\n",
       "      <td>W-L</td>\n",
       "      <td>AdjEM</td>\n",
       "      <td>AdjO</td>\n",
       "      <td>NaN</td>\n",
       "      <td>AdjD</td>\n",
       "      <td>NaN</td>\n",
       "      <td>AdjT</td>\n",
       "      <td>...</td>\n",
       "      <td>Luck</td>\n",
       "      <td>NaN</td>\n",
       "      <td>AdjEM</td>\n",
       "      <td>NaN</td>\n",
       "      <td>OppO</td>\n",
       "      <td>NaN</td>\n",
       "      <td>OppD</td>\n",
       "      <td>NaN</td>\n",
       "      <td>AdjEM</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>Gonzaga 1</td>\n",
       "      <td>WCC</td>\n",
       "      <td>37-2</td>\n",
       "      <td>+32.05</td>\n",
       "      <td>118.4</td>\n",
       "      <td>16</td>\n",
       "      <td>86.3</td>\n",
       "      <td>1</td>\n",
       "      <td>70.1</td>\n",
       "      <td>...</td>\n",
       "      <td>+.020</td>\n",
       "      <td>133</td>\n",
       "      <td>+2.99</td>\n",
       "      <td>89</td>\n",
       "      <td>106.2</td>\n",
       "      <td>84</td>\n",
       "      <td>103.3</td>\n",
       "      <td>105</td>\n",
       "      <td>+1.01</td>\n",
       "      <td>127</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>Villanova 1</td>\n",
       "      <td>BE</td>\n",
       "      <td>32-4</td>\n",
       "      <td>+29.88</td>\n",
       "      <td>122.4</td>\n",
       "      <td>3</td>\n",
       "      <td>92.5</td>\n",
       "      <td>12</td>\n",
       "      <td>64.1</td>\n",
       "      <td>...</td>\n",
       "      <td>+.010</td>\n",
       "      <td>166</td>\n",
       "      <td>+9.33</td>\n",
       "      <td>33</td>\n",
       "      <td>109.8</td>\n",
       "      <td>38</td>\n",
       "      <td>100.5</td>\n",
       "      <td>32</td>\n",
       "      <td>+3.55</td>\n",
       "      <td>61</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3</td>\n",
       "      <td>North Carolina 1</td>\n",
       "      <td>ACC</td>\n",
       "      <td>33-7</td>\n",
       "      <td>+28.22</td>\n",
       "      <td>120.7</td>\n",
       "      <td>9</td>\n",
       "      <td>92.5</td>\n",
       "      <td>11</td>\n",
       "      <td>71.3</td>\n",
       "      <td>...</td>\n",
       "      <td>+.037</td>\n",
       "      <td>85</td>\n",
       "      <td>+12.49</td>\n",
       "      <td>6</td>\n",
       "      <td>112.0</td>\n",
       "      <td>4</td>\n",
       "      <td>99.5</td>\n",
       "      <td>19</td>\n",
       "      <td>+3.87</td>\n",
       "      <td>53</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 21 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     0                 1     2     3       4      5    6     7    8     9   \\\n",
       "0                     NaN   NaN   NaN     NaN    NaN  NaN   NaN  NaN   NaN   \n",
       "1  Rank              Team  Conf   W-L   AdjEM   AdjO  NaN  AdjD  NaN  AdjT   \n",
       "2     1         Gonzaga 1   WCC  37-2  +32.05  118.4   16  86.3    1  70.1   \n",
       "3     2       Villanova 1    BE  32-4  +29.88  122.4    3  92.5   12  64.1   \n",
       "4     3  North Carolina 1   ACC  33-7  +28.22  120.7    9  92.5   11  71.3   \n",
       "\n",
       "  ...      11   12                    13   14     15   16     17   18     19  \\\n",
       "0 ...     NaN  NaN  Strength of Schedule  NaN    NaN  NaN    NaN  NaN  NCSOS   \n",
       "1 ...    Luck  NaN                 AdjEM  NaN   OppO  NaN   OppD  NaN  AdjEM   \n",
       "2 ...   +.020  133                 +2.99   89  106.2   84  103.3  105  +1.01   \n",
       "3 ...   +.010  166                 +9.33   33  109.8   38  100.5   32  +3.55   \n",
       "4 ...   +.037   85                +12.49    6  112.0    4   99.5   19  +3.87   \n",
       "\n",
       "    20  \n",
       "0  NaN  \n",
       "1  NaN  \n",
       "2  127  \n",
       "3   61  \n",
       "4   53  \n",
       "\n",
       "[5 rows x 21 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw = raw_kenpom(2017)\n",
    "raw.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Formatting the Columns Names\n",
    "\n",
    "In the raw table, the first two rows are the table headers. Many of these cells are blank (NaN), because the header spans two rows. \"NaN\" stands for not-a-number, and is a value defined in the `numpy` package on top of which `pandas` is built. You can read more about how `pandas` handles missing values [here](https://pandas.pydata.org/pandas-docs/stable/missing_data.html).\n",
    "\n",
    "Most of the useful column names are in the second row. Also, we see that the columns which have blank (NaN) values in this row correspond to the ranks of each statistic in the KenPom table. The ranks show up as small text to the right of the statistic value. For example, Gonzaga had the sixteenth best Adjusted Offensive Efficiency and the highest-ranked Adjusted Defensive Efficiency for 2017, according to KenPom."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      Rank\n",
       "1      Team\n",
       "2      Conf\n",
       "3       W-L\n",
       "4     AdjEM\n",
       "5      AdjO\n",
       "6       NaN\n",
       "7      AdjD\n",
       "8       NaN\n",
       "9      AdjT\n",
       "10      NaN\n",
       "11     Luck\n",
       "12      NaN\n",
       "13    AdjEM\n",
       "14      NaN\n",
       "15     OppO\n",
       "16      NaN\n",
       "17     OppD\n",
       "18      NaN\n",
       "19    AdjEM\n",
       "20      NaN\n",
       "Name: 1, dtype: object"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw.iloc[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's write a function to pull out the useful column names from the raw data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def column_names(raw_cols):\n",
    "    cols = raw_cols.copy()\n",
    "    adj_em_count = 0\n",
    "    adj_em_col = {\n",
    "        1: 'AdjEM',\n",
    "        2: 'SOS AdjEM',\n",
    "        3: 'NCSOS AdjEM',\n",
    "    }\n",
    "    for i, col in enumerate(cols):\n",
    "        if col == 'AdjEM':\n",
    "            adj_em_count += 1\n",
    "            cols[i] = adj_em_col[adj_em_count]\n",
    "        if str(col) == 'nan':\n",
    "            cols[i] = cols[i-1] + '_rank'\n",
    "        if col == 'Rank':\n",
    "            cols[i] = 'KenPom'\n",
    "    return cols"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0               KenPom\n",
       "1                 Team\n",
       "2                 Conf\n",
       "3                  W-L\n",
       "4                AdjEM\n",
       "5                 AdjO\n",
       "6            AdjO_rank\n",
       "7                 AdjD\n",
       "8            AdjD_rank\n",
       "9                 AdjT\n",
       "10           AdjT_rank\n",
       "11                Luck\n",
       "12           Luck_rank\n",
       "13           SOS AdjEM\n",
       "14      SOS AdjEM_rank\n",
       "15                OppO\n",
       "16           OppO_rank\n",
       "17                OppD\n",
       "18           OppD_rank\n",
       "19         NCSOS AdjEM\n",
       "20    NCSOS AdjEM_rank\n",
       "Name: 1, dtype: object"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "column_names(raw.iloc[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can write a function to take the raw data, pull out the column names, and create a new `DataFrame` with the column names.\n",
    "\n",
    "This function drops the first two rows of the raw data since they are no longer needed once we have the column names. It also drops any rows with blank data, in particular the column headers which repeat several times on the web site."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def formatted(raw):\n",
    "    df = raw.copy()\n",
    "    df.columns = column_names(raw.iloc[1])\n",
    "    df.columns.name = None\n",
    "    df = df.drop([0, 1])\n",
    "    return df.dropna().reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(351, 21)"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "formatted(raw).shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are 351 rows of data, which matches the number of teams in the 2017 KenPom web site table.\n",
    "\n",
    "### Getting the NCAA Tournament Seed\n",
    "\n",
    "Now we need to split the school name from the NCAA tournament seed. To do this, we'll use [regular expressions](https://docs.python.org/3/howto/regex.html) to separate the text from the number. You can read more about working with text data in `pandas` [here](https://pandas.pydata.org/pandas-docs/stable/text.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Gonzaga</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Villanova</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>North Carolina</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Kentucky</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Florida</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                0  1\n",
       "0         Gonzaga  1\n",
       "1       Villanova  1\n",
       "2  North Carolina  1\n",
       "3        Kentucky  2\n",
       "4         Florida  4"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "formatted(raw)['Team'].str.extract('(.+) (\\d+)', expand=True).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>346</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>347</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>348</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>349</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>350</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       0    1\n",
       "346  NaN  NaN\n",
       "347  NaN  NaN\n",
       "348  NaN  NaN\n",
       "349  NaN  NaN\n",
       "350  NaN  NaN"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "formatted(raw)['Team'].str.extract('(.+) (\\d+)', expand=True).tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that for rows which don't have an NCAA seed, the regular expression pattern matching fails. In this case, we end up with NaN for both columns in such rows. To correct for this, we need to fall back to use the team name using `pandas` `fillna()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def ncaa_seed(df):\n",
    "    split = df['Team'].str.extract('(.+) (\\d+)', expand=True)\n",
    "    team = split[0].fillna(df['Team'])\n",
    "    df['Team'] = team.str.replace(';', '')\n",
    "    df['Seed'] = split[1]\n",
    "    return df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Wins and Losses\n",
    "\n",
    "Next, we want to split the wins and losses string to get one column for each value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "def wins_losses(df):\n",
    "    df['Wins'] = df['W-L'].str.split('-').str.get(0).astype(int)\n",
    "    df['Losses'] = df['W-L'].str.split('-').str.get(1).astype(int)\n",
    "    df = df.drop(columns=['W-L'])\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Team</th>\n",
       "      <th>Seed</th>\n",
       "      <th>Wins</th>\n",
       "      <th>Losses</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Gonzaga</td>\n",
       "      <td>1</td>\n",
       "      <td>37</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Villanova</td>\n",
       "      <td>1</td>\n",
       "      <td>32</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>North Carolina</td>\n",
       "      <td>1</td>\n",
       "      <td>33</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Kentucky</td>\n",
       "      <td>2</td>\n",
       "      <td>32</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Florida</td>\n",
       "      <td>4</td>\n",
       "      <td>27</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Team Seed  Wins  Losses\n",
       "0         Gonzaga    1    37       2\n",
       "1       Villanova    1    32       4\n",
       "2  North Carolina    1    33       7\n",
       "3        Kentucky    2    32       6\n",
       "4         Florida    4    27       9"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wins_losses(ncaa_seed(formatted(raw)))[['Team', 'Seed', 'Wins', 'Losses']].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Team</th>\n",
       "      <th>Seed</th>\n",
       "      <th>Wins</th>\n",
       "      <th>Losses</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>346</th>\n",
       "      <td>Longwood</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>347</th>\n",
       "      <td>Arkansas Pine Bluff</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7</td>\n",
       "      <td>25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>348</th>\n",
       "      <td>North Carolina A&amp;T</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>29</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>349</th>\n",
       "      <td>Presbyterian</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5</td>\n",
       "      <td>25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>350</th>\n",
       "      <td>Alabama A&amp;M</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>27</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    Team Seed  Wins  Losses\n",
       "346             Longwood  NaN     6      24\n",
       "347  Arkansas Pine Bluff  NaN     7      25\n",
       "348   North Carolina A&T  NaN     3      29\n",
       "349         Presbyterian  NaN     5      25\n",
       "350          Alabama A&M  NaN     2      27"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wins_losses(ncaa_seed(formatted(raw)))[['Team', 'Seed', 'Wins', 'Losses']].tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cleaning Up\n",
    "\n",
    "The last things we need to do are make sure all of the columns are of the correct data type and order them the way we want."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "def numeric(df):\n",
    "    cols_to_strip = [\n",
    "        'AdjEM',\n",
    "        'Luck',\n",
    "        'SOS AdjEM',\n",
    "        'NCSOS AdjEM',\n",
    "    ]\n",
    "    for col in cols_to_strip:\n",
    "        df[col] = pd.to_numeric(df[col])\n",
    "    rank_cols = [col for col in df.columns if '_rank' in col]\n",
    "    for col in rank_cols:\n",
    "        df[col] = df[col].astype(int)\n",
    "    other_cols = [\n",
    "        'AdjO',\n",
    "        'AdjD',\n",
    "        'AdjT',\n",
    "        'OppO',\n",
    "        'OppD',\n",
    "    ]\n",
    "    for col in other_cols:\n",
    "        df[col] = df[col].astype(float)\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "def ordered(df):\n",
    "    first_cols = [\n",
    "        'Year',\n",
    "        'Team',\n",
    "        'Conf',\n",
    "        'Seed',\n",
    "        'Wins',\n",
    "        'Losses',\n",
    "        'KenPom',\n",
    "    ]\n",
    "    last_cols = [col for col in df.columns if 'rank' in col]\n",
    "    cols = first_cols + [col for col in df.columns if col not in first_cols and col not in last_cols] + last_cols\n",
    "    return df[cols]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "def formatted_kenpom(year):\n",
    "    raw = raw_kenpom(year)\n",
    "    df = formatted(raw)\n",
    "    df = ncaa_seed(df)\n",
    "    df = wins_losses(df)\n",
    "    df = numeric(df)\n",
    "    df['Year'] = int(year)\n",
    "    df = ordered(df)\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(351, 24)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = formatted_kenpom(2017)\n",
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Year                  int64\n",
       "Team                 object\n",
       "Conf                 object\n",
       "Seed                 object\n",
       "Wins                  int64\n",
       "Losses                int64\n",
       "KenPom               object\n",
       "AdjEM               float64\n",
       "AdjO                float64\n",
       "AdjD                float64\n",
       "AdjT                float64\n",
       "Luck                float64\n",
       "SOS AdjEM           float64\n",
       "OppO                float64\n",
       "OppD                float64\n",
       "NCSOS AdjEM         float64\n",
       "AdjO_rank             int64\n",
       "AdjD_rank             int64\n",
       "AdjT_rank             int64\n",
       "Luck_rank             int64\n",
       "SOS AdjEM_rank        int64\n",
       "OppO_rank             int64\n",
       "OppD_rank             int64\n",
       "NCSOS AdjEM_rank      int64\n",
       "dtype: object"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Year</th>\n",
       "      <th>Team</th>\n",
       "      <th>Conf</th>\n",
       "      <th>Seed</th>\n",
       "      <th>Wins</th>\n",
       "      <th>Losses</th>\n",
       "      <th>KenPom</th>\n",
       "      <th>AdjEM</th>\n",
       "      <th>AdjO</th>\n",
       "      <th>AdjD</th>\n",
       "      <th>...</th>\n",
       "      <th>OppD</th>\n",
       "      <th>NCSOS AdjEM</th>\n",
       "      <th>AdjO_rank</th>\n",
       "      <th>AdjD_rank</th>\n",
       "      <th>AdjT_rank</th>\n",
       "      <th>Luck_rank</th>\n",
       "      <th>SOS AdjEM_rank</th>\n",
       "      <th>OppO_rank</th>\n",
       "      <th>OppD_rank</th>\n",
       "      <th>NCSOS AdjEM_rank</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2017</td>\n",
       "      <td>Gonzaga</td>\n",
       "      <td>WCC</td>\n",
       "      <td>1</td>\n",
       "      <td>37</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>32.05</td>\n",
       "      <td>118.4</td>\n",
       "      <td>86.3</td>\n",
       "      <td>...</td>\n",
       "      <td>103.3</td>\n",
       "      <td>1.01</td>\n",
       "      <td>16</td>\n",
       "      <td>1</td>\n",
       "      <td>76</td>\n",
       "      <td>133</td>\n",
       "      <td>89</td>\n",
       "      <td>84</td>\n",
       "      <td>105</td>\n",
       "      <td>127</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2017</td>\n",
       "      <td>Villanova</td>\n",
       "      <td>BE</td>\n",
       "      <td>1</td>\n",
       "      <td>32</td>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "      <td>29.88</td>\n",
       "      <td>122.4</td>\n",
       "      <td>92.5</td>\n",
       "      <td>...</td>\n",
       "      <td>100.5</td>\n",
       "      <td>3.55</td>\n",
       "      <td>3</td>\n",
       "      <td>12</td>\n",
       "      <td>324</td>\n",
       "      <td>166</td>\n",
       "      <td>33</td>\n",
       "      <td>38</td>\n",
       "      <td>32</td>\n",
       "      <td>61</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2017</td>\n",
       "      <td>North Carolina</td>\n",
       "      <td>ACC</td>\n",
       "      <td>1</td>\n",
       "      <td>33</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>28.22</td>\n",
       "      <td>120.7</td>\n",
       "      <td>92.5</td>\n",
       "      <td>...</td>\n",
       "      <td>99.5</td>\n",
       "      <td>3.87</td>\n",
       "      <td>9</td>\n",
       "      <td>11</td>\n",
       "      <td>40</td>\n",
       "      <td>85</td>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "      <td>19</td>\n",
       "      <td>53</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2017</td>\n",
       "      <td>Kentucky</td>\n",
       "      <td>SEC</td>\n",
       "      <td>2</td>\n",
       "      <td>32</td>\n",
       "      <td>6</td>\n",
       "      <td>4</td>\n",
       "      <td>27.72</td>\n",
       "      <td>119.1</td>\n",
       "      <td>91.4</td>\n",
       "      <td>...</td>\n",
       "      <td>99.5</td>\n",
       "      <td>3.74</td>\n",
       "      <td>12</td>\n",
       "      <td>7</td>\n",
       "      <td>26</td>\n",
       "      <td>175</td>\n",
       "      <td>19</td>\n",
       "      <td>24</td>\n",
       "      <td>15</td>\n",
       "      <td>56</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2017</td>\n",
       "      <td>Florida</td>\n",
       "      <td>SEC</td>\n",
       "      <td>4</td>\n",
       "      <td>27</td>\n",
       "      <td>9</td>\n",
       "      <td>5</td>\n",
       "      <td>27.50</td>\n",
       "      <td>116.9</td>\n",
       "      <td>89.5</td>\n",
       "      <td>...</td>\n",
       "      <td>97.8</td>\n",
       "      <td>8.19</td>\n",
       "      <td>25</td>\n",
       "      <td>5</td>\n",
       "      <td>117</td>\n",
       "      <td>286</td>\n",
       "      <td>7</td>\n",
       "      <td>28</td>\n",
       "      <td>2</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 24 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   Year            Team Conf Seed  Wins  Losses KenPom  AdjEM   AdjO  AdjD  \\\n",
       "0  2017         Gonzaga  WCC    1    37       2      1  32.05  118.4  86.3   \n",
       "1  2017       Villanova   BE    1    32       4      2  29.88  122.4  92.5   \n",
       "2  2017  North Carolina  ACC    1    33       7      3  28.22  120.7  92.5   \n",
       "3  2017        Kentucky  SEC    2    32       6      4  27.72  119.1  91.4   \n",
       "4  2017         Florida  SEC    4    27       9      5  27.50  116.9  89.5   \n",
       "\n",
       "         ...          OppD  NCSOS AdjEM  AdjO_rank  AdjD_rank  AdjT_rank  \\\n",
       "0        ...         103.3         1.01         16          1         76   \n",
       "1        ...         100.5         3.55          3         12        324   \n",
       "2        ...          99.5         3.87          9         11         40   \n",
       "3        ...          99.5         3.74         12          7         26   \n",
       "4        ...          97.8         8.19         25          5        117   \n",
       "\n",
       "   Luck_rank  SOS AdjEM_rank  OppO_rank  OppD_rank  NCSOS AdjEM_rank  \n",
       "0        133              89         84        105               127  \n",
       "1        166              33         38         32                61  \n",
       "2         85               6          4         19                53  \n",
       "3        175              19         24         15                56  \n",
       "4        286               7         28          2                15  \n",
       "\n",
       "[5 rows x 24 columns]"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scraping all the Data\n",
    "\n",
    "Now that we have one function that will scrape and format the data for one year, all we need to do is call it for each year since 2002. Then we can save the data to a CSV file for later analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "def scrape_kenpom(year=None):\n",
    "    if year:\n",
    "        return formatted_kenpom(year)\n",
    "    else:\n",
    "        dfs = dict()\n",
    "        for year in tqdm_notebook(range(2002, 2018)):\n",
    "            dfs[year] = formatted_kenpom(year)\n",
    "        return pd.concat([dfs[year] for year in dfs], ignore_index=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "43b96256abb148bfa92449eb44578912",
       "version_major": 2,
       "version_minor": 0
      },
      "text/html": [
       "<p>Failed to display Jupyter Widget of type <code>HBox</code>.</p>\n",
       "<p>\n",
       "  If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n",
       "  that the widgets JavaScript is still loading. If this message persists, it\n",
       "  likely means that the widgets JavaScript library is either not installed or\n",
       "  not enabled. See the <a href=\"https://ipywidgets.readthedocs.io/en/stable/user_install.html\">Jupyter\n",
       "  Widgets Documentation</a> for setup instructions.\n",
       "</p>\n",
       "<p>\n",
       "  If you're reading this message in another frontend (for example, a static\n",
       "  rendering on GitHub or <a href=\"https://nbviewer.jupyter.org/\">NBViewer</a>),\n",
       "  it may mean that your frontend doesn't currently support widgets.\n",
       "</p>\n"
      ],
      "text/plain": [
       "HBox(children=(IntProgress(value=0, max=16), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(5453, 24)"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = scrape_kenpom()\n",
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = 'kenpom-historical.csv'\n",
    "csvfile = OUTPUT_DIR.joinpath(filename)\n",
    "df.to_csv(csvfile, index=False, float_format='%g')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:sports_py36]",
   "language": "python",
   "name": "conda-env-sports_py36-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}