{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## NCAA March Madness\n", "\n", "### Web Scraping Pomeroy College Basketball Ratings\n", "\n", "This short notebook shows how to scrape historical ratings from Ken Pomeroy's (KenPom) [college basketball ratings](https://kenpom.com/) site.\n", "\n", "KenPom has ratings for NCAA men's basketball teams going back to 2002." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pracpred.scrape as pps" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from tqdm import tqdm_notebook" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "PROJECT_DIR = Path.cwd().parent\n", "DATA_DIR = PROJECT_DIR / 'data' / 'scraped'\n", "DATA_DIR.mkdir(exist_ok=True, parents=True)\n", "OUTPUT_DIR = PROJECT_DIR / 'data' / 'kenpom'\n", "OUTPUT_DIR.mkdir(exist_ok=True, parents=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by scraping 2017 data. We can use the HTML table scraping functionality in our `pracpred` package.\n", "\n", "You can find information on the `pracpred` package [here](https://github.com/practicallypredictable/pracpred) and [here](https://pypi.python.org/pypi/pracpred). You can install the package in your sports analytics environment by running the command `pip install pracpred` in Terminal (Mac or Linux) or Windows Anaconda Prompt." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "url = 'https://kenpom.com/index.php?y=2017'" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tables = pps.HTMLTables(url)\n", "len(tables)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "... | \n", "11 | \n", "12 | \n", "13 | \n", "14 | \n", "15 | \n", "16 | \n", "17 | \n", "18 | \n", "19 | \n", "20 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "\n", " | NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "Strength of Schedule | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NCSOS | \n", "NaN | \n", "
| 1 | \n", "Rank | \n", "Team | \n", "Conf | \n", "W-L | \n", "AdjEM | \n", "AdjO | \n", "NaN | \n", "AdjD | \n", "NaN | \n", "AdjT | \n", "... | \n", "Luck | \n", "NaN | \n", "AdjEM | \n", "NaN | \n", "OppO | \n", "NaN | \n", "OppD | \n", "NaN | \n", "AdjEM | \n", "NaN | \n", "
| 2 | \n", "1 | \n", "Gonzaga 1 | \n", "WCC | \n", "37-2 | \n", "+32.05 | \n", "118.4 | \n", "16 | \n", "86.3 | \n", "1 | \n", "70.1 | \n", "... | \n", "+.020 | \n", "133 | \n", "+2.99 | \n", "89 | \n", "106.2 | \n", "84 | \n", "103.3 | \n", "105 | \n", "+1.01 | \n", "127 | \n", "
| 3 | \n", "2 | \n", "Villanova 1 | \n", "BE | \n", "32-4 | \n", "+29.88 | \n", "122.4 | \n", "3 | \n", "92.5 | \n", "12 | \n", "64.1 | \n", "... | \n", "+.010 | \n", "166 | \n", "+9.33 | \n", "33 | \n", "109.8 | \n", "38 | \n", "100.5 | \n", "32 | \n", "+3.55 | \n", "61 | \n", "
| 4 | \n", "3 | \n", "North Carolina 1 | \n", "ACC | \n", "33-7 | \n", "+28.22 | \n", "120.7 | \n", "9 | \n", "92.5 | \n", "11 | \n", "71.3 | \n", "... | \n", "+.037 | \n", "85 | \n", "+12.49 | \n", "6 | \n", "112.0 | \n", "4 | \n", "99.5 | \n", "19 | \n", "+3.87 | \n", "53 | \n", "
5 rows × 21 columns
\n", "| \n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "... | \n", "11 | \n", "12 | \n", "13 | \n", "14 | \n", "15 | \n", "16 | \n", "17 | \n", "18 | \n", "19 | \n", "20 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "\n", " | NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "Strength of Schedule | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NCSOS | \n", "NaN | \n", "
| 1 | \n", "Rank | \n", "Team | \n", "Conf | \n", "W-L | \n", "AdjEM | \n", "AdjO | \n", "NaN | \n", "AdjD | \n", "NaN | \n", "AdjT | \n", "... | \n", "Luck | \n", "NaN | \n", "AdjEM | \n", "NaN | \n", "OppO | \n", "NaN | \n", "OppD | \n", "NaN | \n", "AdjEM | \n", "NaN | \n", "
| 2 | \n", "1 | \n", "Gonzaga 1 | \n", "WCC | \n", "37-2 | \n", "+32.05 | \n", "118.4 | \n", "16 | \n", "86.3 | \n", "1 | \n", "70.1 | \n", "... | \n", "+.020 | \n", "133 | \n", "+2.99 | \n", "89 | \n", "106.2 | \n", "84 | \n", "103.3 | \n", "105 | \n", "+1.01 | \n", "127 | \n", "
| 3 | \n", "2 | \n", "Villanova 1 | \n", "BE | \n", "32-4 | \n", "+29.88 | \n", "122.4 | \n", "3 | \n", "92.5 | \n", "12 | \n", "64.1 | \n", "... | \n", "+.010 | \n", "166 | \n", "+9.33 | \n", "33 | \n", "109.8 | \n", "38 | \n", "100.5 | \n", "32 | \n", "+3.55 | \n", "61 | \n", "
| 4 | \n", "3 | \n", "North Carolina 1 | \n", "ACC | \n", "33-7 | \n", "+28.22 | \n", "120.7 | \n", "9 | \n", "92.5 | \n", "11 | \n", "71.3 | \n", "... | \n", "+.037 | \n", "85 | \n", "+12.49 | \n", "6 | \n", "112.0 | \n", "4 | \n", "99.5 | \n", "19 | \n", "+3.87 | \n", "53 | \n", "
5 rows × 21 columns
\n", "| \n", " | 0 | \n", "1 | \n", "
|---|---|---|
| 0 | \n", "Gonzaga | \n", "1 | \n", "
| 1 | \n", "Villanova | \n", "1 | \n", "
| 2 | \n", "North Carolina | \n", "1 | \n", "
| 3 | \n", "Kentucky | \n", "2 | \n", "
| 4 | \n", "Florida | \n", "4 | \n", "
| \n", " | 0 | \n", "1 | \n", "
|---|---|---|
| 346 | \n", "NaN | \n", "NaN | \n", "
| 347 | \n", "NaN | \n", "NaN | \n", "
| 348 | \n", "NaN | \n", "NaN | \n", "
| 349 | \n", "NaN | \n", "NaN | \n", "
| 350 | \n", "NaN | \n", "NaN | \n", "
| \n", " | Team | \n", "Seed | \n", "Wins | \n", "Losses | \n", "
|---|---|---|---|---|
| 0 | \n", "Gonzaga | \n", "1 | \n", "37 | \n", "2 | \n", "
| 1 | \n", "Villanova | \n", "1 | \n", "32 | \n", "4 | \n", "
| 2 | \n", "North Carolina | \n", "1 | \n", "33 | \n", "7 | \n", "
| 3 | \n", "Kentucky | \n", "2 | \n", "32 | \n", "6 | \n", "
| 4 | \n", "Florida | \n", "4 | \n", "27 | \n", "9 | \n", "
| \n", " | Team | \n", "Seed | \n", "Wins | \n", "Losses | \n", "
|---|---|---|---|---|
| 346 | \n", "Longwood | \n", "NaN | \n", "6 | \n", "24 | \n", "
| 347 | \n", "Arkansas Pine Bluff | \n", "NaN | \n", "7 | \n", "25 | \n", "
| 348 | \n", "North Carolina A&T | \n", "NaN | \n", "3 | \n", "29 | \n", "
| 349 | \n", "Presbyterian | \n", "NaN | \n", "5 | \n", "25 | \n", "
| 350 | \n", "Alabama A&M | \n", "NaN | \n", "2 | \n", "27 | \n", "
| \n", " | Year | \n", "Team | \n", "Conf | \n", "Seed | \n", "Wins | \n", "Losses | \n", "KenPom | \n", "AdjEM | \n", "AdjO | \n", "AdjD | \n", "... | \n", "OppD | \n", "NCSOS AdjEM | \n", "AdjO_rank | \n", "AdjD_rank | \n", "AdjT_rank | \n", "Luck_rank | \n", "SOS AdjEM_rank | \n", "OppO_rank | \n", "OppD_rank | \n", "NCSOS AdjEM_rank | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "2017 | \n", "Gonzaga | \n", "WCC | \n", "1 | \n", "37 | \n", "2 | \n", "1 | \n", "32.05 | \n", "118.4 | \n", "86.3 | \n", "... | \n", "103.3 | \n", "1.01 | \n", "16 | \n", "1 | \n", "76 | \n", "133 | \n", "89 | \n", "84 | \n", "105 | \n", "127 | \n", "
| 1 | \n", "2017 | \n", "Villanova | \n", "BE | \n", "1 | \n", "32 | \n", "4 | \n", "2 | \n", "29.88 | \n", "122.4 | \n", "92.5 | \n", "... | \n", "100.5 | \n", "3.55 | \n", "3 | \n", "12 | \n", "324 | \n", "166 | \n", "33 | \n", "38 | \n", "32 | \n", "61 | \n", "
| 2 | \n", "2017 | \n", "North Carolina | \n", "ACC | \n", "1 | \n", "33 | \n", "7 | \n", "3 | \n", "28.22 | \n", "120.7 | \n", "92.5 | \n", "... | \n", "99.5 | \n", "3.87 | \n", "9 | \n", "11 | \n", "40 | \n", "85 | \n", "6 | \n", "4 | \n", "19 | \n", "53 | \n", "
| 3 | \n", "2017 | \n", "Kentucky | \n", "SEC | \n", "2 | \n", "32 | \n", "6 | \n", "4 | \n", "27.72 | \n", "119.1 | \n", "91.4 | \n", "... | \n", "99.5 | \n", "3.74 | \n", "12 | \n", "7 | \n", "26 | \n", "175 | \n", "19 | \n", "24 | \n", "15 | \n", "56 | \n", "
| 4 | \n", "2017 | \n", "Florida | \n", "SEC | \n", "4 | \n", "27 | \n", "9 | \n", "5 | \n", "27.50 | \n", "116.9 | \n", "89.5 | \n", "... | \n", "97.8 | \n", "8.19 | \n", "25 | \n", "5 | \n", "117 | \n", "286 | \n", "7 | \n", "28 | \n", "2 | \n", "15 | \n", "
5 rows × 24 columns
\n", "Failed to display Jupyter Widget of type HBox.
\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "
\n", "\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "
\n" ], "text/plain": [ "HBox(children=(IntProgress(value=0, max=16), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "(5453, 24)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = scrape_kenpom()\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "filename = 'kenpom-historical.csv'\n", "csvfile = OUTPUT_DIR.joinpath(filename)\n", "df.to_csv(csvfile, index=False, float_format='%g')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:sports_py36]", "language": "python", "name": "conda-env-sports_py36-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }