{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## NCAA March Madness\n", "\n", "### Scrape NCAA Team History from the Washington Post\n", "\n", "This notebook scrapes data from the [Washington Post's NCAA Tournament History site](https://apps.washingtonpost.com/sports/apps/live-updating-mens-ncaa-basketball-bracket/search/).\n", "\n", "First, the site was used to create a list of all NCAA tournament games going back to 1985. This list was manually saved in [this spreadsheet](https://github.com/practicallypredictable/posts/blob/master/basketball/ncaa/data/scraped/ncaa_tournament_games-washpost.xlsx). The \"Team Links\" tab of this spreadsheet contains all the distinct teams which have played in the NCAA Tournament, in the form of hyperlinks to detailed Washington Post team pages. The team names and the related hyperlink URLs were saved into [this CSV file](https://github.com/practicallypredictable/posts/blob/master/basketball/ncaa/data/scraped/teams-washpost.csv).\n", "\n", "The code in this notebook loops through every team in the CSV file, and scrapes the historical team information into a new CSV file." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from tqdm import tqdm_notebook" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "PROJECT_DIR = Path.cwd().parent\n", "DATA_DIR = PROJECT_DIR / 'data' / 'scraped'\n", "DATA_DIR.mkdir(exist_ok=True, parents=True)\n", "OUTPUT_DIR = PROJECT_DIR / 'data' / 'prepared'\n", "OUTPUT_DIR.mkdir(exist_ok=True, parents=True)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(297, 1)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filename = 'teams-washpost.csv'\n", "teamsfile = DATA_DIR.joinpath(filename)\n", "teams = pd.read_csv(teamsfile).set_index('name')\n", "teams.shape" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
url
name
Air Forcehttps://apps.washingtonpost.com/sports/apps/li...
Akronhttps://apps.washingtonpost.com/sports/apps/li...
Alabamahttps://apps.washingtonpost.com/sports/apps/li...
Alabama A&Mhttps://apps.washingtonpost.com/sports/apps/li...
Alabama Statehttps://apps.washingtonpost.com/sports/apps/li...
\n", "
" ], "text/plain": [ " url\n", "name \n", "Air Force https://apps.washingtonpost.com/sports/apps/li...\n", "Akron https://apps.washingtonpost.com/sports/apps/li...\n", "Alabama https://apps.washingtonpost.com/sports/apps/li...\n", "Alabama A&M https://apps.washingtonpost.com/sports/apps/li...\n", "Alabama State https://apps.washingtonpost.com/sports/apps/li..." ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "teams.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "url = teams.loc['Air Force', 'url']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the `pandas` built-in function to read the simple HTML tables on each team's page." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearSeedRecordRound reachedBid typeRegionCoach
020061324-71First RoundAt LargeEastJeff Bzdelik
120041122-71First RoundAt LargeSouthJoe Scott
\n", "
" ], "text/plain": [ " Year Seed Record Round reached Bid type Region Coach \n", "0 2006 13 24-7 1First Round At Large East Jeff Bzdelik\n", "1 2004 11 22-7 1First Round At Large South Joe Scott" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_html(url)[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function below attempts to read each team's page, and returns a small `DataFrame` for each team. Note that at the time I wrote this, several Washington Post pages were \"under construction\" and returned HTTP errors. Therefore, there is exception-handling code in this function to make sure the code scrapes as much information as possible." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def team_info(team, url):\n", " try:\n", " df = pd.read_html(url)[0]\n", " except Exception as e:\n", " print(f'Failed for {team}:', e)\n", " print(url)\n", " return None\n", " df.columns = df.columns.str.rstrip()\n", " df['Team'] = team\n", " cols = [\n", " 'Team',\n", " 'Year',\n", " 'Seed',\n", " 'Region',\n", " 'Bid type',\n", " 'Round reached',\n", " 'Coach',\n", " 'Record',\n", " ]\n", " return df[cols]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to put all the information into one large `DataFrame`. The function below loops through all the teams and uses the `concat()` `pandas` function to do this." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def get_info(teams):\n", " info = dict()\n", " for name in tqdm_notebook(teams.index):\n", " url = teams.loc[name, 'url']\n", " info[name] = team_info(name, url)\n", " return pd.concat([info[name] for name in teams.index], ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can scrape all the information." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "489a65dcbd6e41e0b9555b5ee24cc016", "version_major": 2, "version_minor": 0 }, "text/html": [ "

Failed to display Jupyter Widget of type HBox.

\n", "

\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "

\n", "

\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "

\n" ], "text/plain": [ "HBox(children=(IntProgress(value=0, max=297), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Failed for California: HTTP Error 500: Internal Server Error\n", "https://apps.washingtonpost.com/sports/apps/live-updating-mens-ncaa-basketball-bracket/schools/california/\n", "Failed for Vanderbilt: HTTP Error 500: Internal Server Error\n", "https://apps.washingtonpost.com/sports/apps/live-updating-mens-ncaa-basketball-bracket/schools/vanderbilt/\n", "Failed for Wichita State: HTTP Error 500: Internal Server Error\n", "https://apps.washingtonpost.com/sports/apps/live-updating-mens-ncaa-basketball-bracket/schools/wichita-state/\n", "\n" ] } ], "source": [ "df = get_info(teams)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2113" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TeamYearSeedRegionBid typeRound reachedCoachRecord
0Air Force200613EastAt Large1First RoundJeff Bzdelik24-7
1Air Force200411SouthAt Large1First RoundJoe Scott22-7
2Akron201312SouthAutomatic Qualifier1First RoundKeith Dambrot26-7
3Akron201115SouthwestAutomatic Qualifier1First RoundKeith Dambrot23-13
4Akron200913SouthAutomatic Qualifier1First RoundKeith Dambrot23-13
\n", "
" ], "text/plain": [ " Team Year Seed Region Bid type Round reached \\\n", "0 Air Force 2006 13 East At Large 1First Round \n", "1 Air Force 2004 11 South At Large 1First Round \n", "2 Akron 2013 12 South Automatic Qualifier 1First Round \n", "3 Akron 2011 15 Southwest Automatic Qualifier 1First Round \n", "4 Akron 2009 13 South Automatic Qualifier 1First Round \n", "\n", " Coach Record \n", "0 Jeff Bzdelik 24-7 \n", "1 Joe Scott 22-7 \n", "2 Keith Dambrot 26-7 \n", "3 Keith Dambrot 23-13 \n", "4 Keith Dambrot 23-13 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's clean up the \"Round reached\" column to split the numeric value and the text values into two distinct columns." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['1First Round', '2Second Round', '4Elite Eight', '3Sweet 16',\n", " '0Play-In', '6National Championship', '5Final Four'], dtype=object)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Round reached'].unique()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def format_round(df):\n", " df[['Eliminated', 'Round Reached']] = df['Round reached'].str.extract('^(\\d)(.+)', expand=True)\n", " return df.drop(columns=['Round reached'])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TeamYearSeedRegionBid typeCoachRecordEliminatedRound Reached
0Air Force200613EastAt LargeJeff Bzdelik24-71First Round
1Air Force200411SouthAt LargeJoe Scott22-71First Round
2Akron201312SouthAutomatic QualifierKeith Dambrot26-71First Round
3Akron201115SouthwestAutomatic QualifierKeith Dambrot23-131First Round
4Akron200913SouthAutomatic QualifierKeith Dambrot23-131First Round
\n", "
" ], "text/plain": [ " Team Year Seed Region Bid type Coach \\\n", "0 Air Force 2006 13 East At Large Jeff Bzdelik \n", "1 Air Force 2004 11 South At Large Joe Scott \n", "2 Akron 2013 12 South Automatic Qualifier Keith Dambrot \n", "3 Akron 2011 15 Southwest Automatic Qualifier Keith Dambrot \n", "4 Akron 2009 13 South Automatic Qualifier Keith Dambrot \n", "\n", " Record Eliminated Round Reached \n", "0 24-7 1 First Round \n", "1 22-7 1 First Round \n", "2 26-7 1 First Round \n", "3 23-13 1 First Round \n", "4 23-13 1 First Round " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = format_round(df)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "filename = 'team_history-washpost.csv'\n", "csvfile = OUTPUT_DIR.joinpath(filename)\n", "df.to_csv(csvfile, index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:sports_py36]", "language": "python", "name": "conda-env-sports_py36-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }