{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## NCAA March Madness\n", "\n", "### Scrape NCAA Team History from the Washington Post\n", "\n", "This notebook scrapes data from the [Washington Post's NCAA Tournament History site](https://apps.washingtonpost.com/sports/apps/live-updating-mens-ncaa-basketball-bracket/search/).\n", "\n", "First, the site was used to create a list of all NCAA tournament games going back to 1985. This list was manually saved in [this spreadsheet](https://github.com/practicallypredictable/posts/blob/master/basketball/ncaa/data/scraped/ncaa_tournament_games-washpost.xlsx). The \"Team Links\" tab of this spreadsheet contains all the distinct teams which have played in the NCAA Tournament, in the form of hyperlinks to detailed Washington Post team pages. The team names and the related hyperlink URLs were saved into [this CSV file](https://github.com/practicallypredictable/posts/blob/master/basketball/ncaa/data/scraped/teams-washpost.csv).\n", "\n", "The code in this notebook loops through every team in the CSV file, and scrapes the historical team information into a new CSV file." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from tqdm import tqdm_notebook" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "PROJECT_DIR = Path.cwd().parent\n", "DATA_DIR = PROJECT_DIR / 'data' / 'scraped'\n", "DATA_DIR.mkdir(exist_ok=True, parents=True)\n", "OUTPUT_DIR = PROJECT_DIR / 'data' / 'prepared'\n", "OUTPUT_DIR.mkdir(exist_ok=True, parents=True)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(297, 1)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "filename = 'teams-washpost.csv'\n", "teamsfile = DATA_DIR.joinpath(filename)\n", "teams = pd.read_csv(teamsfile).set_index('name')\n", "teams.shape" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | url | \n", "
---|---|
name | \n", "\n", " |
Air Force | \n", "https://apps.washingtonpost.com/sports/apps/li... | \n", "
Akron | \n", "https://apps.washingtonpost.com/sports/apps/li... | \n", "
Alabama | \n", "https://apps.washingtonpost.com/sports/apps/li... | \n", "
Alabama A&M | \n", "https://apps.washingtonpost.com/sports/apps/li... | \n", "
Alabama State | \n", "https://apps.washingtonpost.com/sports/apps/li... | \n", "
\n", " | Year | \n", "Seed | \n", "Record | \n", "Round reached | \n", "Bid type | \n", "Region | \n", "Coach | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "2006 | \n", "13 | \n", "24-7 | \n", "1First Round | \n", "At Large | \n", "East | \n", "Jeff Bzdelik | \n", "
1 | \n", "2004 | \n", "11 | \n", "22-7 | \n", "1First Round | \n", "At Large | \n", "South | \n", "Joe Scott | \n", "
Failed to display Jupyter Widget of type HBox
.
\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "
\n", "\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "
\n" ], "text/plain": [ "HBox(children=(IntProgress(value=0, max=297), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Failed for California: HTTP Error 500: Internal Server Error\n", "https://apps.washingtonpost.com/sports/apps/live-updating-mens-ncaa-basketball-bracket/schools/california/\n", "Failed for Vanderbilt: HTTP Error 500: Internal Server Error\n", "https://apps.washingtonpost.com/sports/apps/live-updating-mens-ncaa-basketball-bracket/schools/vanderbilt/\n", "Failed for Wichita State: HTTP Error 500: Internal Server Error\n", "https://apps.washingtonpost.com/sports/apps/live-updating-mens-ncaa-basketball-bracket/schools/wichita-state/\n", "\n" ] } ], "source": [ "df = get_info(teams)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2113" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " | Team | \n", "Year | \n", "Seed | \n", "Region | \n", "Bid type | \n", "Round reached | \n", "Coach | \n", "Record | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "Air Force | \n", "2006 | \n", "13 | \n", "East | \n", "At Large | \n", "1First Round | \n", "Jeff Bzdelik | \n", "24-7 | \n", "
1 | \n", "Air Force | \n", "2004 | \n", "11 | \n", "South | \n", "At Large | \n", "1First Round | \n", "Joe Scott | \n", "22-7 | \n", "
2 | \n", "Akron | \n", "2013 | \n", "12 | \n", "South | \n", "Automatic Qualifier | \n", "1First Round | \n", "Keith Dambrot | \n", "26-7 | \n", "
3 | \n", "Akron | \n", "2011 | \n", "15 | \n", "Southwest | \n", "Automatic Qualifier | \n", "1First Round | \n", "Keith Dambrot | \n", "23-13 | \n", "
4 | \n", "Akron | \n", "2009 | \n", "13 | \n", "South | \n", "Automatic Qualifier | \n", "1First Round | \n", "Keith Dambrot | \n", "23-13 | \n", "
\n", " | Team | \n", "Year | \n", "Seed | \n", "Region | \n", "Bid type | \n", "Coach | \n", "Record | \n", "Eliminated | \n", "Round Reached | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "Air Force | \n", "2006 | \n", "13 | \n", "East | \n", "At Large | \n", "Jeff Bzdelik | \n", "24-7 | \n", "1 | \n", "First Round | \n", "
1 | \n", "Air Force | \n", "2004 | \n", "11 | \n", "South | \n", "At Large | \n", "Joe Scott | \n", "22-7 | \n", "1 | \n", "First Round | \n", "
2 | \n", "Akron | \n", "2013 | \n", "12 | \n", "South | \n", "Automatic Qualifier | \n", "Keith Dambrot | \n", "26-7 | \n", "1 | \n", "First Round | \n", "
3 | \n", "Akron | \n", "2011 | \n", "15 | \n", "Southwest | \n", "Automatic Qualifier | \n", "Keith Dambrot | \n", "23-13 | \n", "1 | \n", "First Round | \n", "
4 | \n", "Akron | \n", "2009 | \n", "13 | \n", "South | \n", "Automatic Qualifier | \n", "Keith Dambrot | \n", "23-13 | \n", "1 | \n", "First Round | \n", "