{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Web Scraping NBA Team Matchups and Box Scores\n", "\n", "We are going to use machine learning and statistics to predict NBA matchups. To do this, we are going to need data on NBA games, and lots of it. So let's get all the team matchups and box scores from [stats.nba.com](http://stats.nba.com/), and make them ready for use.\n", "\n", "This post has two purposes. The first is to show you how to do the actual web scraping. The second purpose is to show you how to examine data before you us it. Data are almost always a bit messy and need to be handled with care. It's important to take some time to look at data and to make sure it's clean before use.\n", "\n", "This post has a lot of code. If you don't want to spend time going through it, you can just run the code if you want. The code in this post will scrape and store the data for you. You won't need to run this code again after you get the data.\n", "\n", "I think the most useful information in this post is learning how to assess the data quality, and make smart decisions about how to address potential problems. If you decide to just skim the code and focus on the decisions, hopefully this example will still be useful for you. You may also find it interesting to see how real-world NBA facts (such as expansion teams, team moves and name changes) actually impacts the data!\n", "\n", "### The Data\n", "\n", "We are going to scrape [Team Box Scores for completed NBA seasons from stats.nba.com](https://stats.nba.com/teams/boxscores-traditional/?Season=2016-17&SeasonType=Regular%20Season). We are going to start with tradtional box scores, which give us plenty of statistics to consider, as well as all the matchup information we need. If you explore the site, you'll see there are also other more advanced data available. We'll consider these other data in future posts.\n", "\n", "The advanced data are only available going back to the 1996-97 season, so we'll limit our traditional box scores to the same time period. The site actually has traditional box scores going back to the 1946-47 season. Such old data are interesting, but aren't useful for predicting current matchups. We have to consider how much the game has changed over the years. Although [there have been a number of NBA rule changes since 1996](http://www.nba.com/analysis/rules_history.html) and game style has evolved, I think data over this time period are still useful today.\n", "\n", "The data from 1996-97 to 2016-17 comprise 21 complete seasons, including pre-season data from recent years. This is plenty of data for our purposes. Notice that we're not going to download data from the current 2017-18 season yet. We are going to use the historical data for building prediction models, since we know how the seasons turned out. After we build the model, we can apply it to the current season and see how it does over the remainder of the season.\n", "\n", "### Why Write a Web Scraper\n", "\n", "As I mentioned, there's a lot of code in this post. You may wonder why it's worth it to build a web scraper.\n", "\n", "First, in many cases it's the easiest way to get the data you need. Even if it takes a few hours to build a scraper, it's a lot faster than manually downloading it. Consider the excellent site, [Basketball Reference](https://www.basketball-reference.com/). Basketball Reference offers you the ability to download games into a CSV file. [Here is a link to the 2016-17 season games](https://www.basketball-reference.com/leagues/NBA_2017_games.html).\n", "\n", "Notice that this site shows games grouped by month. You can download the games by looking at the \"Share & More\" drop-down menu next to the month schedule header. Each month requires a separate download. So, if you want to download all the games for the 21 complete seasons since 1996-97, you'd need to manually download around 180 CSV files (there are 9 months in the typical NBA season). Then, to create a master spreadsheet of matchup results, you'd need to open them up in Microsoft Excel or Google Sheets and put all the data together in one place.\n", "\n", "Also, if you click on one of the box score links (for example, [the October 25, 2016 matchup between the Knicks and the Cavaliers](https://www.basketball-reference.com/boxscores/201610250CLE.html)), you'll see very detailed box score information, but no easy way to extract and download all the box scores at once. Scraping the original data is much faster and more practical. As you'll see, it's not actually that hard.\n", "\n", "Lastly, writing a scraper is essential if you want to update the data relatively frequently. If you create a matchup prediction model, you'll want to run it after every game, to update the probabilities for future matchups. Imagine if you had to type information into a spreadsheet or download a few new spreadsheets every night during the NBA season?\n", "\n", "The web scraper we'll develop here will also be able to scrape current season games. So once you have it working, you can run each day during the NBA season to get the lastest matchup results and team box scores automatically.\n", "\n", "### Scraping a JavaScript-Rendered Web Site\n", "\n", "[stats.nba.com](http://stats.nba.com/) is a JavaScript-rendered site. As [discussed in our previous post on web-scraping](http://practicallypredictable.com/2017/11/21/introduction-to-web-scraping/#html-versus-javascript-rendered-pages), these sites work differently from HTML-rendered sites.\n", "\n", "We can still use the [Requests](http://docs.python-requests.org/en/master/) package to communicate with the web site, but we can't use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) like we did previously. BeautifulSoup is great for scraping HTML-rendered sites, but it doesn't work with JavaScript-rendered sites.\n", "\n", "Scraping the box scores isn't actually very hard, but it has one very tricky aspect. The URL to send the GET request isn't the same URL that you see in your browser. Remember that the whole point of JavaScript-rendered pages is the web site returns JavaScript code to your browser, which then runs the JavaScript to get the data to render in the browser window. The URL you want to scrape is really that second URL, which returns the data.\n", "\n", "Let's look at the specific links for the box scores. The URL for the link to the page we can view is:\n", "\n", "`https://stats.nba.com/teams/boxscores-traditional/?Season=2016-17&SeasonType=Regular%20Season`\n", "\n", "It's clear from the URL that this page has traditional box scores for the 2016-17 regular season. It's pretty clear that we can access different seasons by changing the _parameters_ at the end of the URL.\n", "\n", "Now here's the trick. The URL which returns the actual data is:\n", "\n", "`https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00`\n", "`&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0`\n", "`&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season`\n", "`&ShotClockRange=&VsConference=&VsDivision=`\n", "\n", "That's all actually one line, which I've broken up for clarity. You can [open up the a browser tab with the raw data by clicking this link](https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&VsConference=&VsDivision=). Notice that this URL has a lot of parameters.\n", "\n", "So, how did I figure out that this is the correct URL?\n", "\n", "### Use the Browser Developer Tools\n", "\n", "As mentioned in [the prior post on web scraping](http://practicallypredictable.com/2017/11/21/introduction-to-web-scraping/#use-your-browser8217s-developer-tools), your browser's developer tools make it easier to see what is going on. In that post, we used the **Elements** section of the developer tools to look at the different HTML tags. To scrape this JavaScript-rendered page, we will instead use the **Network** section. [Here are instructions for how to use these tools in Chrome](https://developers.google.com/web/tools/chrome-devtools/network-performance/reference). [Here are similar instructions for Firefox](https://developer.mozilla.org/en-US/docs/Tools/Network_Monitor). I think these browsers are the easiest to use to understand JavaScript-rendered sites.\n", "\n", "Under the **Network** section, check out the **XHR** tab. _XHR_ is short for [XMLHttpRequest](https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest), but the details don't matter. The point is that if you load (or reload) the page (with the \"normal\" URL above), you will see a few lines show up under the **XHR** tab, one of which starts with `teamgamelogs`. That's the item we want.\n", "\n", "If you select that `teamgamelogs` item, in either Chrome or Firefox, you can explore what's in that object. I'll move on the scraping now, but hopefully this brief introduction will serve to explain how you could figure this out for yourself on a different web page. As we go through the steps below, see if you can follow along and find the corresponding information in your browser's developer tools on that **XHR** tab.\n", "\n", "If you explore other pages on [stats.nba.com](https://stats.nba.com/) with the developer tools open to the **XHR** tab, you'll be able to apply these techniques to figure out how to scrape those other pages, too.\n", "\n", "### Scraping the Site\n", "\n", "Let's do all the necessary imports." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from itertools import chain\n", "from pathlib import Path\n", "from time import sleep\n", "from datetime import datetime" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to use Requests to scrape the JavaScript." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import requests\n", "from tqdm import tqdm\n", "tqdm.monitor_interval = 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use `pandas` and `numpy` to explore and organize the data toward the end." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "pd.options.display.max_rows = 999\n", "pd.options.display.max_columns = 999" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will store the data (both in raw from and processed for later analysis) on disk. The code below will create directories on your computer if they don't exist already. Feel free to change the directories below if you're running this code on your computer." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "PROJECT_DIR = Path.cwd().parent\n", "DATA_DIR = PROJECT_DIR / 'data' / 'scraped' / 'teamgamelogs'\n", "DATA_DIR.mkdir(exist_ok=True, parents=True)\n", "OUTPUT_DIR = PROJECT_DIR / 'data' / 'prepared'\n", "OUTPUT_DIR.mkdir(exist_ok=True, parents=True)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "nbpresent": { "id": "55f833c3-57f7-4744-984a-7c3123cb79b1" } }, "outputs": [], "source": [ "USER_AGENT = (\n", " 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) ' +\n", " 'AppleWebKit/537.36 (KHTML, like Gecko) ' +\n", " 'Chrome/61.0.3163.100 Safari/537.36'\n", ")\n", "\n", "REQUEST_HEADERS = {\n", " 'user-agent': USER_AGENT,\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is the base URL for the team box scores. This type of URL is also called a [_web API endpoint_](https://en.wikipedia.org/wiki/Web_API#Endpoints). Recall that an API is an _application programming interface_, which makes it easier to communicate with a web site and obtain information.\n", "\n", "The NBA data defines the NBA itself as league '00'." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "NBA_URL = 'http://stats.nba.com/stats/teamgamelogs'\n", "NBA_ID = '00'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The site can give us information for the regular season, playoffs and pre-season." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "NBA_SEASON_TYPES = {\n", " 'regular': 'Regular Season',\n", " 'playoffs': 'Playoffs',\n", " 'preseason': 'Pre Season',\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To explore, we will start with the 2016-17 season, the most recent completed season. We will end up scraping all completed seasons since 1996-97." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "season = '2016-17'\n", "season_type = NBA_SEASON_TYPES['regular']" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "nba_params = {\n", " 'LeagueID': NBA_ID,\n", " 'Season': season,\n", " 'SeasonType': season_type,\n", "}" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "nbpresent": { "id": "5081b049-3625-4b66-877e-92a7911abd14" } }, "outputs": [ { "data": { "text/plain": [ "200" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r = requests.get(NBA_URL, params=nba_params, headers=REQUEST_HEADERS, allow_redirects=False, timeout=15)\n", "r.status_code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Figuring out the JSON\n", "\n", "The GET request worked. Let's look at what type of data was returned." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "nbpresent": { "id": "6aa2505f-a678-4be4-8cc2-cc74937a076f" } }, "outputs": [ { "data": { "text/plain": [ "'application/json; charset=utf-8'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r.headers['content-type']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data are in [_JavaScript Object Notation_](https://en.wikipedia.org/wiki/JSON), or JSON. Remember, if you want to look at the raw data in JSON format, you can [open up the a browser tab pointed to the web API endpoint by clicking this link](https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&VsConference=&VsDivision=), or you can explore with your browser's development tools.\n", "\n", "Requests makes it easy to process JSON." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "nbpresent": { "id": "a50de907-d262-44ca-87ef-798348329138" } }, "outputs": [ { "data": { "text/plain": [ "dict" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "json = r.json()\n", "type(json)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['resource', 'parameters', 'resultSets'])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "json.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see which parameters are available, and how they were filled in by the GET request." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'DateFrom': None,\n", " 'DateTo': None,\n", " 'GameSegment': None,\n", " 'LastNGames': None,\n", " 'LeagueID': '00',\n", " 'Location': None,\n", " 'MeasureType': None,\n", " 'Month': None,\n", " 'OppTeamID': None,\n", " 'Outcome': None,\n", " 'PORound': None,\n", " 'PerMode': None,\n", " 'Period': None,\n", " 'PlayerID': None,\n", " 'SeasonSegment': None,\n", " 'SeasonType': 'Regular Season',\n", " 'SeasonYear': '2016-17',\n", " 'ShotClockRange': None,\n", " 'TeamID': None,\n", " 'VsConference': None,\n", " 'VsDivision': None}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "json['parameters']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's get the results." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "nbpresent": { "id": "b3e8e21c-e69e-4d31-8b50-3cc6feabad1a" } }, "outputs": [ { "data": { "text/plain": [ "list" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(json['resultSets'])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(json['resultSets'])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = json['resultSets'][0]\n", "type(results)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "nbpresent": { "id": "adec8d96-78d0-4682-9980-f80fa4519691" } }, "outputs": [ { "data": { "text/plain": [ "dict_keys(['name', 'headers', 'rowSet'])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results.keys()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "nbpresent": { "id": "d44a8df2-b15a-4869-b3b9-903f0b66401b" } }, "outputs": [ { "data": { "text/plain": [ "list" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "headers = results['headers']\n", "type(headers)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "56" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(headers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 56 columns of data per row." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "nbpresent": { "id": "61c381e6-9404-4f55-a2a4-7de05e02ec20" } }, "outputs": [ { "data": { "text/plain": [ "list" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rows = results['rowSet']\n", "type(rows)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "nbpresent": { "id": "c87ab975-75cd-481d-bd1d-8aa2d0341633" } }, "outputs": [ { "data": { "text/plain": [ "2460" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(rows)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 30 NBA teams, and each team played 82 games in the 2016-17 season. Since 82 times 30 is 2460, there had better be 2460 rows in the results." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "nbpresent": { "id": "9ee47a21-13ec-4f9d-a557-beebc17d7792" } }, "outputs": [ { "data": { "text/plain": [ "list" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(rows[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each row is itself a list. Let's see how many columns are in the row." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "56" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(rows[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, there are 56 columns per row, same as we saw for the column headers. Let's look at the first row in the results." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "nbpresent": { "id": "5b4203c7-1922-451d-a915-825bb2645787" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['2016-17', 1610612753, 'ORL', 'Orlando Magic', '0021601217', '2017-04-12T00:00:00', 'ORL vs. DET', 'W', 48.0, 42, 87, 0.483, 13, 25, 0.52, 16, 17, 0.941, 6, 45, 51, 27, 11.0, 7, 6, 3, 22, 15, 113, 4.0, 1, 1, 1, 1, 141, 597, 897, 787, 339, 1391, 120, 1370, 1908, 73, 2023, 38, 263, 419, 451, 1253, 537, 440, 1622, 2104, 621, 975]\n" ] } ], "source": [ "print(rows[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the Houston Rockets playing against the Minnesota Timberwolves in April 12, 2017. The Rockets won. To make sense of all the numbers, let's put things into a more useful framework.\n", "\n", "### Scraping and Saving the Raw Data\n", "\n", "First, let's put what we've learned about scraping the JSON into a function. We are going to store our data in a `pandas` `DataFrame` to make analysis easier." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "def scrape_teamgamelogs(season, season_type, sleep_for=None):\n", " \"\"\"Process JSON from stats.nba.com teamgamelogs endpoint and return unformatted DataFrame.\"\"\"\n", " if sleep_for:\n", " sleep(sleep_for) # be nice to server by sleeping if we are scraping inside a loop\n", " nba_params = {\n", " 'LeagueID': NBA_ID,\n", " 'Season': season,\n", " 'SeasonType': season_type,\n", " }\n", " r = requests.get(\n", " NBA_URL,\n", " params=nba_params,\n", " headers=REQUEST_HEADERS,\n", " allow_redirects=False,\n", " timeout=15,\n", " )\n", " r.raise_for_status()\n", " results = r.json()['resultSets'][0]\n", " headers = results['headers']\n", " rows = results['rowSet']\n", " return pd.DataFrame(rows, columns=headers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's import when scraping data from the web to save it to your computer, especially if it's data that won't change in the future. There's no point scraping data again and again if it won't change. In this case, we are looking at NBA games from prior seasons, so once we scrape these games we are done.\n", "\n", "Let's call our scraping function from inside another function. This outer function will first check if a file already exists on our computer that has the data. If the file exists, we don't need to scrape the data; we can just load it from the file into a `DataFrame`.\n", "\n", "On the other hand, if we scrape the data, our function needs to save them to a file so we don't need to scrape the web site again. Also, our function has a parameter to overwrite the files on disk with freshly scraped data, in case we need or want to do that.\n", "\n", "It's important when scraping data (or collecting data of any kind) to save the raw results first. We are going to be doing a lot of processing to get the data into a useful format. We are going to end up saving those results too. However, if you have an error in your code and don't save the raw data first, you might not be able to figure out later where you went wrong. That's why it's good practice to save the raw data, so you can always start over and reconstruct your results. " ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "def raw_teamgamelogs(season, season_type, data_dir=None, overwrite=False, sleep_for=None):\n", " \"\"\"Scrape stats.nba.com teamgamelogs or read from a CSV file if it exists.\"\"\"\n", " if data_dir:\n", " csv_filename = 'stats_nba_com-teamgamelogs-{season}-{season_type}.csv'.format(\n", " season=season.replace('-', '_'),\n", " season_type=season_type.replace(' ', '_').lower(),\n", " )\n", " csvfile = data_dir.joinpath(csv_filename)\n", " else:\n", " csvfile = None\n", " if csvfile and not overwrite and csvfile.exists():\n", " df = pd.read_csv(csvfile)\n", " else:\n", " df = scrape_teamgamelogs(season, season_type, sleep_for)\n", " if csvfile:\n", " df.to_csv(csvfile, index=False)\n", " return df " ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2460, 56)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw = raw_teamgamelogs(season, season_type, data_dir=DATA_DIR)\n", "raw.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember that this function has also saved the raw data for us behind the scenes. If we call it again for this season and season type, it will read it from the file, rather than scrape it from the web site." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "nbpresent": { "id": "90a8e768-00b4-46a7-9588-3794fba86f36" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SEASON_YEARTEAM_IDTEAM_ABBREVIATIONTEAM_NAMEGAME_IDGAME_DATEMATCHUPWLMINFGMFGAFG_PCTFG3MFG3AFG3_PCTFTMFTAFT_PCTOREBDREBREBASTTOVSTLBLKBLKAPFPFDPTSPLUS_MINUSGP_RANKW_RANKL_RANKW_PCT_RANKMIN_RANKFGM_RANKFGA_RANKFG_PCT_RANKFG3M_RANKFG3A_RANKFG3_PCT_RANKFTM_RANKFTA_RANKFT_PCT_RANKOREB_RANKDREB_RANKREB_RANKAST_RANKTOV_RANKSTL_RANKBLK_RANKBLKA_RANKPF_RANKPFD_RANKPTS_RANKPLUS_MINUS_RANK
02016-171610612745HOUHouston Rockets216012242017-04-12T00:00:00HOU vs. MINW48.045970.46419560.33914220.6361631473710.0117314151235.01111141244113111219314001704123622161281551625102802353244401592104164898
12016-171610612737ATLAtlanta Hawks216012262017-04-12T00:00:00ATL @ INDL48.030720.41711290.37915190.789829371920.0547151486-18.0112311231123114123752396188668780095615311657110115921862204017622250190412131925241222123242220
22016-171610612749MILMilwaukee Bucks216012192017-04-12T00:00:00MIL @ BOSL48.037840.4406230.26114170.824934432617.0693201794-18.0112311231123114114971342152620081700206117041908748134210341203533185715941094401170175720232220
32016-171610612760OKCOklahoma City Thunder216012252017-04-12T00:00:00OKC vs. DENL48.0431010.4268250.32011170.6471437512012.010882417105-6.01123112311231141460451757149113911591209719082190307533263158466839619521382005175712421564
42016-171610612746LACLA Clippers216012282017-04-12T00:00:00LAC vs. SACW48.043830.51810280.35719250.760113243296.0461151811520.0111114146014953309209081184899870142984413751203231192142537402411532504165
\n", "
" ], "text/plain": [ " SEASON_YEAR TEAM_ID TEAM_ABBREVIATION TEAM_NAME GAME_ID \\\n", "0 2016-17 1610612745 HOU Houston Rockets 21601224 \n", "1 2016-17 1610612737 ATL Atlanta Hawks 21601226 \n", "2 2016-17 1610612749 MIL Milwaukee Bucks 21601219 \n", "3 2016-17 1610612760 OKC Oklahoma City Thunder 21601225 \n", "4 2016-17 1610612746 LAC LA Clippers 21601228 \n", "\n", " GAME_DATE MATCHUP WL MIN FGM FGA FG_PCT FG3M FG3A \\\n", "0 2017-04-12T00:00:00 HOU vs. MIN W 48.0 45 97 0.464 19 56 \n", "1 2017-04-12T00:00:00 ATL @ IND L 48.0 30 72 0.417 11 29 \n", "2 2017-04-12T00:00:00 MIL @ BOS L 48.0 37 84 0.440 6 23 \n", "3 2017-04-12T00:00:00 OKC vs. DEN L 48.0 43 101 0.426 8 25 \n", "4 2017-04-12T00:00:00 LAC vs. SAC W 48.0 43 83 0.518 10 28 \n", "\n", " FG3_PCT FTM FTA FT_PCT OREB DREB REB AST TOV STL BLK BLKA PF \\\n", "0 0.339 14 22 0.636 16 31 47 37 10.0 11 7 3 14 \n", "1 0.379 15 19 0.789 8 29 37 19 20.0 5 4 7 15 \n", "2 0.261 14 17 0.824 9 34 43 26 17.0 6 9 3 20 \n", "3 0.320 11 17 0.647 14 37 51 20 12.0 10 8 8 24 \n", "4 0.357 19 25 0.760 11 32 43 29 6.0 4 6 1 15 \n", "\n", " PFD PTS PLUS_MINUS GP_RANK W_RANK L_RANK W_PCT_RANK MIN_RANK \\\n", "0 15 123 5.0 1 1 1 1 141 \n", "1 14 86 -18.0 1 1231 1231 1231 141 \n", "2 17 94 -18.0 1 1231 1231 1231 141 \n", "3 17 105 -6.0 1 1231 1231 1231 141 \n", "4 18 115 20.0 1 1 1 1 141 \n", "\n", " FGM_RANK FGA_RANK FG_PCT_RANK FG3M_RANK FG3A_RANK FG3_PCT_RANK \\\n", "0 244 113 1112 19 3 1400 \n", "1 2375 2396 1886 687 800 956 \n", "2 1497 1342 1526 2008 1700 2061 \n", "3 460 45 1757 1491 1391 1591 \n", "4 460 1495 330 920 908 1184 \n", "\n", " FTM_RANK FTA_RANK FT_PCT_RANK OREB_RANK DREB_RANK REB_RANK AST_RANK \\\n", "0 1704 1236 2216 128 1551 625 10 \n", "1 1531 1657 1101 1592 1862 2040 1762 \n", "2 1704 1908 748 1342 1034 1203 533 \n", "3 2097 1908 2190 307 533 263 1584 \n", "4 899 870 1429 844 1375 1203 231 \n", "\n", " TOV_RANK STL_RANK BLK_RANK BLKA_RANK PF_RANK PFD_RANK PTS_RANK \\\n", "0 280 235 324 440 159 2104 164 \n", "1 2250 1904 1213 1925 241 2221 2324 \n", "2 1857 1594 109 440 1170 1757 2023 \n", "3 668 396 195 2138 2005 1757 1242 \n", "4 19 2142 537 40 241 1532 504 \n", "\n", " PLUS_MINUS_RANK \n", "0 898 \n", "1 2220 \n", "2 2220 \n", "3 1564 \n", "4 165 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cleaning up the Columns\n", "\n", "Now we have the 2016-17 regular season games inside a `pandas` `DataFrame`. Let's see what the columns are." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['SEASON_YEAR', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID',\n", " 'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M',\n", " 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST',\n", " 'TOV', 'STL', 'BLK', 'BLKA', 'PF', 'PFD', 'PTS', 'PLUS_MINUS',\n", " 'GP_RANK', 'W_RANK', 'L_RANK', 'W_PCT_RANK', 'MIN_RANK', 'FGM_RANK',\n", " 'FGA_RANK', 'FG_PCT_RANK', 'FG3M_RANK', 'FG3A_RANK', 'FG3_PCT_RANK',\n", " 'FTM_RANK', 'FTA_RANK', 'FT_PCT_RANK', 'OREB_RANK', 'DREB_RANK',\n", " 'REB_RANK', 'AST_RANK', 'TOV_RANK', 'STL_RANK', 'BLK_RANK', 'BLKA_RANK',\n", " 'PF_RANK', 'PFD_RANK', 'PTS_RANK', 'PLUS_MINUS_RANK'],\n", " dtype='object')" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We don't need a number of these columns. All of the '\\_RANK' columns are designed to help the browser sort the table by different columns. The data in the '\\_RANK' columns are just the sort orders for each row, if the user wants to sort the results by that particular column. (For example, the first row is the 244th row by field goals made 'FGM_RANK').\n", "\n", "We are going to throw away all these '\\_RANK' columns, since we can sort in `pandas` by any column we wish anyway. We are also going to throw away the '\\_PCT' columns, since we can compute any percentages we wish if we need them. For example, 'FG_PCT' = 'FGM' / 'FGA' for field goals, and similarly for three-point field goals ('FG3') and free throws ('FT').\n", "\n", "We can through away a few other columsn too. The first is 'PLUS_MINUS', which is just margin of victory (i.e., winner points minus loser points). We can compute this later from the actual points. Next is 'REB' = 'OREB' + 'DREB'. The last two are 'BLKA' (blocks allowed) and 'PFD' (personal fouls drawn). These are stats that apply to the opponent. As we will soon see, there are better ways to get this information.\n", "\n", "Let's write a function to exclude the columns we don't need." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "def drop_raw_columns(df):\n", " cols = [col for col in df.columns if '_RANK' not in col]\n", " cols = [col for col in cols if '_PCT' not in col]\n", " cols = [col for col in cols if col not in ['PLUS_MINUS', 'REB', 'BLKA', 'PFD']]\n", " return df[cols]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's write a function to rename the columns to tidy things up." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "def rename_raw_columns(df):\n", " df = df.rename(columns={\n", " 'SEASON_YEAR': 'season',\n", " 'TEAM_ABBREVIATION': 'team',\n", " 'GAME_DATE': 'date',\n", " })\n", " df.columns = df.columns.str.lower()\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we apply these functions to the data, here's what we'd get so far. Let's look at the column names and the data types." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['season', 'team_id', 'team', 'team_name', 'game_id', 'date', 'matchup',\n", " 'wl', 'min', 'fgm', 'fga', 'fg3m', 'fg3a', 'ftm', 'fta', 'oreb', 'dreb',\n", " 'ast', 'tov', 'stl', 'blk', 'pf', 'pts'],\n", " dtype='object')" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rename_raw_columns(drop_raw_columns(raw)).columns" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "season object\n", "team_id int64\n", "team object\n", "team_name object\n", "game_id int64\n", "date object\n", "matchup object\n", "wl object\n", "min float64\n", "fgm int64\n", "fga int64\n", "fg3m int64\n", "fg3a int64\n", "ftm int64\n", "fta int64\n", "oreb int64\n", "dreb int64\n", "ast int64\n", "tov float64\n", "stl int64\n", "blk int64\n", "pf int64\n", "pts int64\n", "dtype: object" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rename_raw_columns(drop_raw_columns(raw)).dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For some reason, game minutes ('MIN') and turnovers ('TOV') are floating-point, not integers. Let's keep things simple and round game minutes to the nearest minute. Also, let's make sure the game date is a `datetime` type." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "def type_raw_columns(df):\n", " df['date'] = pd.to_datetime(df['date'])\n", " df['min'] = df['min'].round().astype(int) # round minutes\n", " df['tov'] = df['tov'].astype(int)\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, let's reorder the columns to make the table a little easier to read." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "def reorder_raw_columns(df):\n", " first_cols = [\n", " 'season',\n", " 'date',\n", " 'team',\n", " 'matchup',\n", " 'wl',\n", " 'pts',\n", " 'min',\n", " ]\n", " last_cols = [\n", " 'game_id',\n", " 'team_id',\n", " 'team_name',\n", " ]\n", " cols = (\n", " first_cols +\n", " [col for col in df.columns if col not in (first_cols+last_cols)] +\n", " last_cols\n", " )\n", " return df[cols]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can put all these steps in one function to clean up the raw data." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "def formatted_columns(df):\n", " \"\"\"Formatted stats.nba.com teamgamelogs DataFrame.\"\"\"\n", " # Order must be drop -> rename -> others don't matter\n", " df = drop_raw_columns(df)\n", " df = rename_raw_columns(df)\n", " df = type_raw_columns(df)\n", " df = reorder_raw_columns(df)\n", " return df" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
seasondateteammatchupwlptsminfgmfgafg3mfg3aftmftaorebdrebasttovstlblkpfgame_idteam_idteam_name
02016-172017-04-12HOUHOU vs. MINW123484597195614221631371011714216012241610612745Houston Rockets
12016-172017-04-12ATLATL @ INDL864830721129151982919205415216012261610612737Atlanta Hawks
22016-172017-04-12MILMIL @ BOSL94483784623141793426176920216012191610612749Milwaukee Bucks
32016-172017-04-12OKCOKC vs. DENL105484310182511171437201210824216012251610612760Oklahoma City Thunder
42016-172017-04-12LACLAC vs. SACW1154843831028192511322964615216012281610612746LA Clippers
\n", "
" ], "text/plain": [ " season date team matchup wl pts min fgm fga fg3m fg3a \\\n", "0 2016-17 2017-04-12 HOU HOU vs. MIN W 123 48 45 97 19 56 \n", "1 2016-17 2017-04-12 ATL ATL @ IND L 86 48 30 72 11 29 \n", "2 2016-17 2017-04-12 MIL MIL @ BOS L 94 48 37 84 6 23 \n", "3 2016-17 2017-04-12 OKC OKC vs. DEN L 105 48 43 101 8 25 \n", "4 2016-17 2017-04-12 LAC LAC vs. SAC W 115 48 43 83 10 28 \n", "\n", " ftm fta oreb dreb ast tov stl blk pf game_id team_id \\\n", "0 14 22 16 31 37 10 11 7 14 21601224 1610612745 \n", "1 15 19 8 29 19 20 5 4 15 21601226 1610612737 \n", "2 14 17 9 34 26 17 6 9 20 21601219 1610612749 \n", "3 11 17 14 37 20 12 10 8 24 21601225 1610612760 \n", "4 19 25 11 32 29 6 4 6 15 21601228 1610612746 \n", "\n", " team_name \n", "0 Houston Rockets \n", "1 Atlanta Hawks \n", "2 Milwaukee Bucks \n", "3 Oklahoma City Thunder \n", "4 LA Clippers " ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = formatted_columns(raw)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Figuring out the Matchups\n", "\n", "Now we're getting somewhere. What we need to do now is figure out how to represent the matchup information in a more useful way. It's pretty clear that the '@' sign means an away game, while 'vs.' is a home game. We can use `pandas` to split apart the matchup information into a more useful representation.\n", "\n", "Let's try out a few things." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 True\n", "3 False\n", "4 False\n", "Name: matchup, dtype: bool" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['matchup'].str.contains('@').head()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [HOU, vs., MIN]\n", "1 [ATL, @, IND]\n", "2 [MIL, @, BOS]\n", "3 [OKC, vs., DEN]\n", "4 [LAC, vs., SAC]\n", "Name: matchup, dtype: object" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['matchup'].str.split(' ').head()" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 MIN\n", "1 IND\n", "2 BOS\n", "3 DEN\n", "4 SAC\n", "Name: matchup, dtype: object" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['matchup'].str.split(' ').str.get(-1).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can put this together into a new function which will remove the 'matchup' column and replace it with two new, more useful columns. The first is whether the game is home ('H') or away ('A'). The second is the opponent abbreviation." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "def parse_matchup(df):\n", " \"\"\"Add more useful columns based upon matchup information.\"\"\"\n", " df['ha'] = np.where(df['matchup'].str.contains('@'), 'A', 'H')\n", " df['opp'] = df['matchup'].str.split(' ').str.get(-1)\n", " # Put new columns where matchup used to be, and drop matchup\n", " cols = []\n", " for col in df.columns:\n", " if col not in ['matchup', 'ha', 'opp']:\n", " cols.append(col)\n", " elif col == 'matchup':\n", " cols.append('ha')\n", " cols.append('opp')\n", " return df[cols]" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
seasondateteamhaoppwlptsminfgmfgafg3mfg3aftmftaorebdrebasttovstlblkpfgame_idteam_idteam_name
02016-172017-04-12HOUHMINW123484597195614221631371011714216012241610612745Houston Rockets
12016-172017-04-12ATLAINDL864830721129151982919205415216012261610612737Atlanta Hawks
22016-172017-04-12MILABOSL94483784623141793426176920216012191610612749Milwaukee Bucks
32016-172017-04-12OKCHDENL105484310182511171437201210824216012251610612760Oklahoma City Thunder
42016-172017-04-12LACHSACW1154843831028192511322964615216012281610612746LA Clippers
\n", "
" ], "text/plain": [ " season date team ha opp wl pts min fgm fga fg3m fg3a ftm \\\n", "0 2016-17 2017-04-12 HOU H MIN W 123 48 45 97 19 56 14 \n", "1 2016-17 2017-04-12 ATL A IND L 86 48 30 72 11 29 15 \n", "2 2016-17 2017-04-12 MIL A BOS L 94 48 37 84 6 23 14 \n", "3 2016-17 2017-04-12 OKC H DEN L 105 48 43 101 8 25 11 \n", "4 2016-17 2017-04-12 LAC H SAC W 115 48 43 83 10 28 19 \n", "\n", " fta oreb dreb ast tov stl blk pf game_id team_id \\\n", "0 22 16 31 37 10 11 7 14 21601224 1610612745 \n", "1 19 8 29 19 20 5 4 15 21601226 1610612737 \n", "2 17 9 34 26 17 6 9 20 21601219 1610612749 \n", "3 17 14 37 20 12 10 8 24 21601225 1610612760 \n", "4 25 11 32 29 6 4 6 15 21601228 1610612746 \n", "\n", " team_name \n", "0 Houston Rockets \n", "1 Atlanta Hawks \n", "2 Milwaukee Bucks \n", "3 Oklahoma City Thunder \n", "4 LA Clippers " ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parse_matchup(df).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More Data Cleanup\n", "\n", "Now things are looking good. Let's put all this formatting into a new function, which will return the formatted `DataFrame`. We will also use the 'game_id' as the index for the `DataFrame`. Notice that the 'game_id' is an integer. This is an internal identifier that the web site database is using to uniquely identify each NBA game. We should keep track of this identifier, since we might want to go back to get more data on this game in the future. For instance, we might want to get shot charts or possession-by-possession information. We would need that 'game_id' in order to scrape the more detailed game data." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "def formatted_teamgamelogs(df):\n", " \"\"\"Formatted stats.nba.com teamgamelogs DataFrame from raw DataFrame.\"\"\"\n", " df = formatted_columns(df)\n", " df = parse_matchup(df)\n", " return df.set_index('game_id')" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
seasondateteamhaoppwlptsminfgmfgafg3mfg3aftmftaorebdrebasttovstlblkpfteam_idteam_name
game_id
216012242016-172017-04-12HOUHMINW1234845971956142216313710117141610612745Houston Rockets
216012262016-172017-04-12ATLAINDL8648307211291519829192054151610612737Atlanta Hawks
216012192016-172017-04-12MILABOSL944837846231417934261769201610612749Milwaukee Bucks
216012252016-172017-04-12OKCHDENL1054843101825111714372012108241610612760Oklahoma City Thunder
216012282016-172017-04-12LACHSACW11548438310281925113229646151610612746LA Clippers
\n", "
" ], "text/plain": [ " season date team ha opp wl pts min fgm fga fg3m fg3a \\\n", "game_id \n", "21601224 2016-17 2017-04-12 HOU H MIN W 123 48 45 97 19 56 \n", "21601226 2016-17 2017-04-12 ATL A IND L 86 48 30 72 11 29 \n", "21601219 2016-17 2017-04-12 MIL A BOS L 94 48 37 84 6 23 \n", "21601225 2016-17 2017-04-12 OKC H DEN L 105 48 43 101 8 25 \n", "21601228 2016-17 2017-04-12 LAC H SAC W 115 48 43 83 10 28 \n", "\n", " ftm fta oreb dreb ast tov stl blk pf team_id \\\n", "game_id \n", "21601224 14 22 16 31 37 10 11 7 14 1610612745 \n", "21601226 15 19 8 29 19 20 5 4 15 1610612737 \n", "21601219 14 17 9 34 26 17 6 9 20 1610612749 \n", "21601225 11 17 14 37 20 12 10 8 24 1610612760 \n", "21601228 19 25 11 32 29 6 4 6 15 1610612746 \n", "\n", " team_name \n", "game_id \n", "21601224 Houston Rockets \n", "21601226 Atlanta Hawks \n", "21601219 Milwaukee Bucks \n", "21601225 Oklahoma City Thunder \n", "21601228 LA Clippers " ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "games = formatted_teamgamelogs(raw)\n", "games.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's see what happens if we select all rows in the `DataFrame` with that first 'game_id'." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
seasondateteamhaoppwlptsminfgmfgafg3mfg3aftmftaorebdrebasttovstlblkpfteam_idteam_name
game_id
216012242016-172017-04-12HOUHMINW1234845971956142216313710117141610612745Houston Rockets
216012242016-172017-04-12MINAHOUL118484891112311151039371573151610612750Minnesota Timberwolves
\n", "
" ], "text/plain": [ " season date team ha opp wl pts min fgm fga fg3m fg3a \\\n", "game_id \n", "21601224 2016-17 2017-04-12 HOU H MIN W 123 48 45 97 19 56 \n", "21601224 2016-17 2017-04-12 MIN A HOU L 118 48 48 91 11 23 \n", "\n", " ftm fta oreb dreb ast tov stl blk pf team_id \\\n", "game_id \n", "21601224 14 22 16 31 37 10 11 7 14 1610612745 \n", "21601224 11 15 10 39 37 15 7 3 15 1610612750 \n", "\n", " team_name \n", "game_id \n", "21601224 Houston Rockets \n", "21601224 Minnesota Timberwolves " ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "games.loc[21601224]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interesting. Each game has two rows in the `DataFrame`, one for each team. One of the teams is home, and the other is away. One of the teams won, and the other lost. Each team has their box score recorded on its own row.\n", "\n", "This explains why we could drop the columns 'BLKA' and 'PFD' earlier. For example, the 'BLKA' from the Rockets' row is the same as the 'BLK' column from the Timberwolves' row. The 'BLKA' and 'PFD' information is redundant.\n", "\n", "Let's look at how manu unique 'game_id' values there are. You can think of these as unique matchups." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1230" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(games.index.unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are half as many matchups as there are team games played. That's because there are two teams per matchup." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1230, 23)" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "home_games = games[games['ha'] == 'H']\n", "home_games.shape" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1230, 23)" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "away_games = games[games['ha'] == 'A']\n", "away_games.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a New Table of Matchups\n", "\n", "What we really want for analysis is a table of matchups. We want to predict matchups, so that is the right unit of data. We can build a new table of matchups, where each matchup is identified by the 'game_id'. We can also keep track of which team was the home eam, and which was the away team. We can also put both teams' box scores on the same row, as long as we keep track of which stats belong to the home team and which to the away team." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
season_hdate_hteam_hha_hopp_hwl_hpts_hmin_hfgm_hfga_hfg3m_hfg3a_hftm_hfta_horeb_hdreb_hast_htov_hstl_hblk_hpf_hteam_id_hteam_name_hseason_adate_ateam_aha_aopp_awl_apts_amin_afgm_afga_afg3m_afg3a_aftm_afta_aoreb_adreb_aast_atov_astl_ablk_apf_ateam_id_ateam_name_a
game_id
216012242016-172017-04-12HOUHMINW1234845971956142216313710117141610612745Houston Rockets2016-172017-04-12MINAHOUL118484891112311151039371573151610612750Minnesota Timberwolves
216012252016-172017-04-12OKCHDENL1054843101825111714372012108241610612760Oklahoma City Thunder2016-172017-04-12DENAOKCW1114839857252632538221768171610612743Denver Nuggets
216012282016-172017-04-12LACHSACW11548438310281925113229646151610612746LA Clippers2016-172017-04-12SACALACL954840815111014730271121181610612758Sacramento Kings
216012182016-172017-04-12CLEHTORL83482876103717241032201557141610612739Cleveland Cavaliers2016-172017-04-12TORACLEW984840868261016123620982201610612761Toronto Raptors
216012302016-172017-04-12PORHNOPL10048418513305131439222542181610612757Portland Trail Blazers2016-172017-04-12NOPAPORW103484090824152111312011133171610612740New Orleans Pelicans
\n", "
" ], "text/plain": [ " season_h date_h team_h ha_h opp_h wl_h pts_h min_h fgm_h \\\n", "game_id \n", "21601224 2016-17 2017-04-12 HOU H MIN W 123 48 45 \n", "21601225 2016-17 2017-04-12 OKC H DEN L 105 48 43 \n", "21601228 2016-17 2017-04-12 LAC H SAC W 115 48 43 \n", "21601218 2016-17 2017-04-12 CLE H TOR L 83 48 28 \n", "21601230 2016-17 2017-04-12 POR H NOP L 100 48 41 \n", "\n", " fga_h fg3m_h fg3a_h ftm_h fta_h oreb_h dreb_h ast_h tov_h \\\n", "game_id \n", "21601224 97 19 56 14 22 16 31 37 10 \n", "21601225 101 8 25 11 17 14 37 20 12 \n", "21601228 83 10 28 19 25 11 32 29 6 \n", "21601218 76 10 37 17 24 10 32 20 15 \n", "21601230 85 13 30 5 13 14 39 22 25 \n", "\n", " stl_h blk_h pf_h team_id_h team_name_h season_a \\\n", "game_id \n", "21601224 11 7 14 1610612745 Houston Rockets 2016-17 \n", "21601225 10 8 24 1610612760 Oklahoma City Thunder 2016-17 \n", "21601228 4 6 15 1610612746 LA Clippers 2016-17 \n", "21601218 5 7 14 1610612739 Cleveland Cavaliers 2016-17 \n", "21601230 4 2 18 1610612757 Portland Trail Blazers 2016-17 \n", "\n", " date_a team_a ha_a opp_a wl_a pts_a min_a fgm_a fga_a \\\n", "game_id \n", "21601224 2017-04-12 MIN A HOU L 118 48 48 91 \n", "21601225 2017-04-12 DEN A OKC W 111 48 39 85 \n", "21601228 2017-04-12 SAC A LAC L 95 48 40 81 \n", "21601218 2017-04-12 TOR A CLE W 98 48 40 86 \n", "21601230 2017-04-12 NOP A POR W 103 48 40 90 \n", "\n", " fg3m_a fg3a_a ftm_a fta_a oreb_a dreb_a ast_a tov_a stl_a \\\n", "game_id \n", "21601224 11 23 11 15 10 39 37 15 7 \n", "21601225 7 25 26 32 5 38 22 17 6 \n", "21601228 5 11 10 14 7 30 27 11 2 \n", "21601218 8 26 10 16 12 36 20 9 8 \n", "21601230 8 24 15 21 11 31 20 11 13 \n", "\n", " blk_a pf_a team_id_a team_name_a \n", "game_id \n", "21601224 3 15 1610612750 Minnesota Timberwolves \n", "21601225 8 17 1610612743 Denver Nuggets \n", "21601228 1 18 1610612758 Sacramento Kings \n", "21601218 2 20 1610612761 Toronto Raptors \n", "21601230 3 17 1610612740 New Orleans Pelicans " ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "home_games.join(away_games, lsuffix='_h', rsuffix='_a').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This new `DataFrame` is the right next step, but we need to clean it up a bit. First, we don't need two win/loss columns. We only need one, telling us whether the home team won or the away team won." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "def home_away_win(row):\n", " home = row['wl_h']\n", " away = row['wl_a']\n", " assert home in ['W', 'L']\n", " assert away in ['W', 'L']\n", " assert home != away\n", " if home == 'W':\n", " return 'H'\n", " else:\n", " return 'A'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can drop a bunch of redundant columns and rename the ones that apply to the matchup itself, rather than particular teams." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "def drop_matchup_columns(df):\n", " return df.drop([\n", " 'wl_h',\n", " 'wl_a',\n", " 'season_a',\n", " 'date_a',\n", " 'min_a',\n", " 'ha_h',\n", " 'opp_h',\n", " 'ha_a',\n", " 'opp_a',\n", " ], axis='columns')" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "def rename_matchup_columns(df):\n", " return df.rename(columns={\n", " 'season_h': 'season',\n", " 'date_h': 'date',\n", " 'min_h': 'min',\n", " })" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "def reorder_matchup_columns(df):\n", " \"\"\"Column order for game statistics DataFrame, alternating home and away stats.\"\"\"\n", " first_cols = ['season', 'date', 'won', 'min',]\n", " raw_cols = [col.replace('_h', '') for col in df.columns if col.endswith('_h')]\n", " cols = list(chain.from_iterable((col+'_h', col+'_a') for col in raw_cols))\n", " return df[first_cols + cols]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Putting everything together in one place, here is a function which will create the matchups and reformat the resulting `DataFrame`. It will also add a column storing which team (home or away) won the matchup." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "def matchups_from_teamgamelogs(df):\n", " \"\"\"DataFrame with one unique game_id per row and team stats identified by home and away teams.\"\"\"\n", " home_games = df[df['ha'] == 'H']\n", " away_games = df[df['ha'] == 'A']\n", " matchups = home_games.join(away_games, lsuffix='_h', rsuffix='_a')\n", " # Add new 'won' column: 'H' if home time wins or 'A' if away team wins\n", " matchups['won'] = matchups.apply(home_away_win, axis='columns')\n", " matchups = drop_matchup_columns(matchups)\n", " matchups = rename_matchup_columns(matchups)\n", " matchups = reorder_matchup_columns(matchups)\n", " return matchups" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
seasondatewonminteam_hteam_apts_hpts_afgm_hfgm_afga_hfga_afg3m_hfg3m_afg3a_hfg3a_aftm_hftm_afta_hfta_aoreb_horeb_adreb_hdreb_aast_hast_atov_htov_astl_hstl_ablk_hblk_apf_hpf_ateam_id_hteam_id_ateam_name_hteam_name_a
game_id
216012242016-172017-04-12H48HOUMIN123118454897911911562314112215161031393737101511773141516106127451610612750Houston RocketsMinnesota Timberwolves
216012252016-172017-04-12A48OKCDEN1051114339101858725251126173214537382022121710688241716106127601610612743Oklahoma City ThunderDenver Nuggets
216012282016-172017-04-12H48LACSAC1159543408381105281119102514117323029276114261151816106127461610612758LA ClippersSacramento Kings
216012182016-172017-04-12A48CLETOR8398284076861083726171024161012323620201595872142016106127391610612761Cleveland CavaliersToronto Raptors
216012302016-172017-04-12A48PORNOP1001034140859013830245151321141139312220251141323181716106127571610612740Portland Trail BlazersNew Orleans Pelicans
\n", "
" ], "text/plain": [ " season date won min team_h team_a pts_h pts_a fgm_h \\\n", "game_id \n", "21601224 2016-17 2017-04-12 H 48 HOU MIN 123 118 45 \n", "21601225 2016-17 2017-04-12 A 48 OKC DEN 105 111 43 \n", "21601228 2016-17 2017-04-12 H 48 LAC SAC 115 95 43 \n", "21601218 2016-17 2017-04-12 A 48 CLE TOR 83 98 28 \n", "21601230 2016-17 2017-04-12 A 48 POR NOP 100 103 41 \n", "\n", " fgm_a fga_h fga_a fg3m_h fg3m_a fg3a_h fg3a_a ftm_h ftm_a \\\n", "game_id \n", "21601224 48 97 91 19 11 56 23 14 11 \n", "21601225 39 101 85 8 7 25 25 11 26 \n", "21601228 40 83 81 10 5 28 11 19 10 \n", "21601218 40 76 86 10 8 37 26 17 10 \n", "21601230 40 85 90 13 8 30 24 5 15 \n", "\n", " fta_h fta_a oreb_h oreb_a dreb_h dreb_a ast_h ast_a tov_h \\\n", "game_id \n", "21601224 22 15 16 10 31 39 37 37 10 \n", "21601225 17 32 14 5 37 38 20 22 12 \n", "21601228 25 14 11 7 32 30 29 27 6 \n", "21601218 24 16 10 12 32 36 20 20 15 \n", "21601230 13 21 14 11 39 31 22 20 25 \n", "\n", " tov_a stl_h stl_a blk_h blk_a pf_h pf_a team_id_h \\\n", "game_id \n", "21601224 15 11 7 7 3 14 15 1610612745 \n", "21601225 17 10 6 8 8 24 17 1610612760 \n", "21601228 11 4 2 6 1 15 18 1610612746 \n", "21601218 9 5 8 7 2 14 20 1610612739 \n", "21601230 11 4 13 2 3 18 17 1610612757 \n", "\n", " team_id_a team_name_h team_name_a \n", "game_id \n", "21601224 1610612750 Houston Rockets Minnesota Timberwolves \n", "21601225 1610612743 Oklahoma City Thunder Denver Nuggets \n", "21601228 1610612758 LA Clippers Sacramento Kings \n", "21601218 1610612761 Cleveland Cavaliers Toronto Raptors \n", "21601230 1610612740 Portland Trail Blazers New Orleans Pelicans " ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matchups_from_teamgamelogs(games).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One last step is to set some categorical data types and reorder the columns. Categories make a `pandas` `DataFrame` faster and more memory-efficient for columns that only take on a few possible values. For instance, the 'won' column can only be 'W' or 'L'. The function below sets certain columns to be categorical data. It also makes sure the box score stats are integers." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "def format_matchups(df):\n", " df['season'] = df['season'].astype('category')\n", " df['season_type'] = df['season_type'].astype('category')\n", " df['won'] = df['won'].astype('category')\n", " df['team_h'] = df['team_h'].astype('category')\n", " df['team_a'] = df['team_a'].astype('category')\n", " # Set all 'object' columns to int (except for team names)\n", " for col in df.columns:\n", " if 'team' not in col and (col.endswith('_h') or col.endswith('_a')):\n", " if df[col].dtype == 'object':\n", " df[col] = df[col].astype(int)\n", " first_cols = ['season', 'season_type', 'date', 'won']\n", " cols = first_cols + [col for col in df.columns if col not in first_cols]\n", " return df[cols]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Processing an Entire NBA Season\n", "\n", "Now we are almost done. We want a function which can scrape (or read from a file) the raw data for a given season, for all possible season types (regular, playoffs or pre-season). This function should then process the raw data and create the matchups for that season. The rows for each season type should be stored so we can tell the regular season games from the other types later." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "def nba_season_matchups(season=None, data_dir=None, overwrite=False):\n", " \"\"\"All NBA matchups for a given season (regular, playoffs and pre-season).\"\"\" \n", " matchups = None\n", " for season_type in NBA_SEASON_TYPES:\n", " df = raw_teamgamelogs(\n", " season,\n", " NBA_SEASON_TYPES[season_type],\n", " data_dir,\n", " overwrite,\n", " sleep_for=2,\n", " )\n", " if len(df) == 0:\n", " # no rows came back; this is probably a pre-season with no data, so just continue\n", " continue\n", " df = formatted_teamgamelogs(df)\n", " df = matchups_from_teamgamelogs(df)\n", " df['season_type'] = season_type\n", " if matchups is None:\n", " matchups = df.copy()\n", " else:\n", " matchups = matchups.append(df)\n", " return format_matchups(matchups)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can run our master function on the entire 2016-17 season." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "regular 1230\n", "preseason 102\n", "playoffs 79\n", "Name: season_type, dtype: int64" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matchups = nba_season_matchups(season, data_dir=DATA_DIR)\n", "matchups['season_type'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we already new, there are 1230 regular season matchups. We also see we got the pre-season and playoff matchups as well." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
seasonseason_typedatewonminteam_hteam_apts_hpts_afgm_hfgm_afga_hfga_afg3m_hfg3m_afg3a_hfg3a_aftm_hftm_afta_hfta_aoreb_horeb_adreb_hdreb_aast_hast_atov_htov_astl_hstl_ablk_hblk_apf_hpf_ateam_id_hteam_id_ateam_name_hteam_name_a
game_id
216012242016-17regular2017-04-12H48HOUMIN123118454897911911562314112215161031393737101511773141516106127451610612750Houston RocketsMinnesota Timberwolves
216012252016-17regular2017-04-12A48OKCDEN1051114339101858725251126173214537382022121710688241716106127601610612743Oklahoma City ThunderDenver Nuggets
216012282016-17regular2017-04-12H48LACSAC1159543408381105281119102514117323029276114261151816106127461610612758LA ClippersSacramento Kings
216012182016-17regular2017-04-12A48CLETOR8398284076861083726171024161012323620201595872142016106127391610612761Cleveland CavaliersToronto Raptors
216012302016-17regular2017-04-12A48PORNOP1001034140859013830245151321141139312220251141323181716106127571610612740Portland Trail BlazersNew Orleans Pelicans
\n", "
" ], "text/plain": [ " season season_type date won min team_h team_a pts_h pts_a \\\n", "game_id \n", "21601224 2016-17 regular 2017-04-12 H 48 HOU MIN 123 118 \n", "21601225 2016-17 regular 2017-04-12 A 48 OKC DEN 105 111 \n", "21601228 2016-17 regular 2017-04-12 H 48 LAC SAC 115 95 \n", "21601218 2016-17 regular 2017-04-12 A 48 CLE TOR 83 98 \n", "21601230 2016-17 regular 2017-04-12 A 48 POR NOP 100 103 \n", "\n", " fgm_h fgm_a fga_h fga_a fg3m_h fg3m_a fg3a_h fg3a_a ftm_h \\\n", "game_id \n", "21601224 45 48 97 91 19 11 56 23 14 \n", "21601225 43 39 101 85 8 7 25 25 11 \n", "21601228 43 40 83 81 10 5 28 11 19 \n", "21601218 28 40 76 86 10 8 37 26 17 \n", "21601230 41 40 85 90 13 8 30 24 5 \n", "\n", " ftm_a fta_h fta_a oreb_h oreb_a dreb_h dreb_a ast_h ast_a \\\n", "game_id \n", "21601224 11 22 15 16 10 31 39 37 37 \n", "21601225 26 17 32 14 5 37 38 20 22 \n", "21601228 10 25 14 11 7 32 30 29 27 \n", "21601218 10 24 16 10 12 32 36 20 20 \n", "21601230 15 13 21 14 11 39 31 22 20 \n", "\n", " tov_h tov_a stl_h stl_a blk_h blk_a pf_h pf_a team_id_h \\\n", "game_id \n", "21601224 10 15 11 7 7 3 14 15 1610612745 \n", "21601225 12 17 10 6 8 8 24 17 1610612760 \n", "21601228 6 11 4 2 6 1 15 18 1610612746 \n", "21601218 15 9 5 8 7 2 14 20 1610612739 \n", "21601230 25 11 4 13 2 3 18 17 1610612757 \n", "\n", " team_id_a team_name_h team_name_a \n", "game_id \n", "21601224 1610612750 Houston Rockets Minnesota Timberwolves \n", "21601225 1610612743 Oklahoma City Thunder Denver Nuggets \n", "21601228 1610612758 LA Clippers Sacramento Kings \n", "21601218 1610612761 Cleveland Cavaliers Toronto Raptors \n", "21601230 1610612740 Portland Trail Blazers New Orleans Pelicans " ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matchups.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting All Completed NBA Seasons\n", "\n", "The next step is to get all the completed seasons since 1996-97 (the earliest season for which the web site has advanced stats). To do this, all we need to do is call our `nba_season_matchups()` function above for each completed season. Let's write a few small funcitions to help us figure out what is the latest completed NBA season." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "def nba_season_from_year(y):\n", " \"\"\"Convert integer year into stats.nba.com season string.\"\"\"\n", " return str(y) + '-' + str(int(y)+1)[2:]" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "def nba_year_from_season(s):\n", " \"\"\"Convert stats.nba.com season string into integer year\"\"\"\n", " return int(s[:4])" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "def current_nba_season_year():\n", " \"\"\"Year in which current NBA season began.\"\"\"\n", " now = datetime.now()\n", " if now.month > 8:\n", " return now.year\n", " else:\n", " return now.year - 1" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "def current_nba_season():\n", " \"\"\"Current NBA season.\"\"\"\n", " return nba_season_from_year(current_nba_season_year())" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "def completed_nba_season_year():\n", " \"\"\"Year in which most recent completed NBA season began.\"\"\"\n", " return current_nba_season_year() - 1" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "def completed_nba_season():\n", " \"\"\"Most recent completed NBA season.\"\"\"\n", " return nba_season_from_year(completed_nba_season_year())" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'2016-17'" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "completed_nba_season()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now it's easy to just call our season-by-season scraper and collect all the results in one large `DataFrame`. Notice that we are using the `tqdm` package here. This package creates a \"progress bar\" to show you what percentage of an iteration has been accomplished, along with some timing information. This is helpful when scraping, since scraping can be slow." ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "def completed_nba_season_matchups(data_dir=None, overwrite=False):\n", " \"\"\"Get NBA matchups for completed seasons.\"\"\"\n", " end_year = completed_nba_season_year()\n", " seasons = [nba_season_from_year(y) for y in range(1996, end_year+1)]\n", " matchups = None\n", " for season in tqdm(seasons):\n", " df = nba_season_matchups(season, data_dir, overwrite)\n", " if matchups is None:\n", " matchups = df.copy()\n", " else:\n", " matchups = matchups.append(df)\n", " return format_matchups(matchups)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 21/21 [00:19<00:00, 1.06it/s]\n" ] } ], "source": [ "matchups = completed_nba_season_matchups(data_dir=DATA_DIR)" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(26812, 39)" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matchups.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sanity-Checking the Data\n", "\n", "We have almost 27,000 matchups in our `DataFrame`. Let's break things out by season and type of game to make sure they make sense." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
season_typeplayoffspreseasonregularAll
season
1996-9772011891261
1997-9871011891260
1998-99660725791
1999-0075011891264
2000-0171011891260
2001-0271011891260
2002-0388011891277
2003-0482011891271
2004-0584012301314
2005-0689012301319
2006-0779012301309
2007-0886012301316
2008-0985012301315
2009-1082012301312
2010-1181012301311
2011-128409901074
2012-1385012291314
2013-1489012301319
2014-158111812301429
2015-168610912301425
2016-177910212301411
All16863292479726812
\n", "
" ], "text/plain": [ "season_type playoffs preseason regular All\n", "season \n", "1996-97 72 0 1189 1261\n", "1997-98 71 0 1189 1260\n", "1998-99 66 0 725 791\n", "1999-00 75 0 1189 1264\n", "2000-01 71 0 1189 1260\n", "2001-02 71 0 1189 1260\n", "2002-03 88 0 1189 1277\n", "2003-04 82 0 1189 1271\n", "2004-05 84 0 1230 1314\n", "2005-06 89 0 1230 1319\n", "2006-07 79 0 1230 1309\n", "2007-08 86 0 1230 1316\n", "2008-09 85 0 1230 1315\n", "2009-10 82 0 1230 1312\n", "2010-11 81 0 1230 1311\n", "2011-12 84 0 990 1074\n", "2012-13 85 0 1229 1314\n", "2013-14 89 0 1230 1319\n", "2014-15 81 118 1230 1429\n", "2015-16 86 109 1230 1425\n", "2016-17 79 102 1230 1411\n", "All 1686 329 24797 26812" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matchups.pivot_table(\n", " index='season',\n", " columns='season_type',\n", " values='team_h',\n", " aggfunc='count',\n", " fill_value=0,\n", " margins=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We only have pre-season games since the 2014-15 season.\n", "\n", "Looking at the season totals, you can see the effect of the lockouts in 1998-99 and 2011-12. You can also see the impact of the league expansion from 29 to 30 teams in 2004 (with the addition of Charlotte). See [the history of NBA seasons here](https://en.wikipedia.org/wiki/List_of_National_Basketball_Association_seasons).\n", "\n", "You might think we are done, but there is one complication that we need to work through. This shows the importance of looking carefully at your data. It's hard to go through almost 27,000 rows by eye, so you need to think carefully and make sure things make sense.\n", "\n", "### Checking the Teams\n", "\n", "One thing we can and should check is: what teams are represented in our data?" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "def unique_home_away_teams(df, ha):\n", " \"\"\"Unique home or away teams (as specified) from matchups DataFrame.\"\"\"\n", " ha = ha.lower()\n", " if ha not in ['h', 'a']:\n", " raise ValueError('invalid home or away symbol')\n", " team_abbr = 'team_' + ha\n", " team_id = 'team_id_' + ha\n", " team_name = 'team_name_' + ha\n", " return (\n", " df[[team_abbr, team_id, team_name]]\n", " .reset_index()\n", " .drop(columns=['game_id'])\n", " .drop_duplicates()\n", " .sort_values(team_id)\n", " .rename(\n", " columns={\n", " team_abbr: 'team',\n", " team_id: 'team_id',\n", " team_name: 'team_name',\n", " }\n", " )\n", " .set_index('team_id')\n", " )" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "def unique_teams(df):\n", " \"\"\"Unique teams in matchups DataFrame.\"\"\"\n", " teams = unique_home_away_teams(df, 'h')\n", " teams.append(unique_home_away_teams(df, 'a'))\n", " return teams.drop_duplicates().sort_index()" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
teamteam_name
team_id
94MLNMilano Olimpia Milano
12304FCBFC Barcelona Lassa
12315RMDReal Madrid
12315RMDMadrid Real Madrid
12321FBUIstanbul Fenerbahce Ulker
12323ALBBerlin Alba Berlin
12325FLAFlamengo Flamengo
1610612737ATLAtlanta Hawks
1610612738BOSBoston Celtics
1610612739CLECleveland Cavaliers
1610612740NOKNew Orleans Hornets
1610612740NOPNew Orleans Pelicans
1610612740NOHNew Orleans Hornets
1610612741CHIChicago Bulls
1610612742DALDallas Mavericks
1610612743DENDenver Nuggets
1610612744GSWGolden State Warriors
1610612745HOUHouston Rockets
1610612746LACLos Angeles Clippers
1610612746LACLA Clippers
1610612747LALLos Angeles Lakers
1610612748MIAMiami Heat
1610612749MILMilwaukee Bucks
1610612750MINMinnesota Timberwolves
1610612751NJNNew Jersey Nets
1610612751BKNBrooklyn Nets
1610612752NYKNew York Knicks
1610612753ORLOrlando Magic
1610612754INDIndiana Pacers
1610612755PHIPhiladelphia 76ers
1610612756PHXPhoenix Suns
1610612757PORPortland Trail Blazers
1610612758SACSacramento Kings
1610612759SASSan Antonio Spurs
1610612760OKCOklahoma City Thunder
1610612760SEASeattle SuperSonics
1610612761TORToronto Raptors
1610612762UTAUtah Jazz
1610612763MEMMemphis Grizzlies
1610612763VANVancouver Grizzlies
1610612764WASWashington Bullets
1610612764WASWashington Wizards
1610612765DETDetroit Pistons
1610612766CHACharlotte Bobcats
1610612766CHACharlotte Hornets
1610612766CHHCharlotte Hornets
\n", "
" ], "text/plain": [ " team team_name\n", "team_id \n", "94 MLN Milano Olimpia Milano\n", "12304 FCB FC Barcelona Lassa\n", "12315 RMD Real Madrid\n", "12315 RMD Madrid Real Madrid\n", "12321 FBU Istanbul Fenerbahce Ulker\n", "12323 ALB Berlin Alba Berlin\n", "12325 FLA Flamengo Flamengo\n", "1610612737 ATL Atlanta Hawks\n", "1610612738 BOS Boston Celtics\n", "1610612739 CLE Cleveland Cavaliers\n", "1610612740 NOK New Orleans Hornets\n", "1610612740 NOP New Orleans Pelicans\n", "1610612740 NOH New Orleans Hornets\n", "1610612741 CHI Chicago Bulls\n", "1610612742 DAL Dallas Mavericks\n", "1610612743 DEN Denver Nuggets\n", "1610612744 GSW Golden State Warriors\n", "1610612745 HOU Houston Rockets\n", "1610612746 LAC Los Angeles Clippers\n", "1610612746 LAC LA Clippers\n", "1610612747 LAL Los Angeles Lakers\n", "1610612748 MIA Miami Heat\n", "1610612749 MIL Milwaukee Bucks\n", "1610612750 MIN Minnesota Timberwolves\n", "1610612751 NJN New Jersey Nets\n", "1610612751 BKN Brooklyn Nets\n", "1610612752 NYK New York Knicks\n", "1610612753 ORL Orlando Magic\n", "1610612754 IND Indiana Pacers\n", "1610612755 PHI Philadelphia 76ers\n", "1610612756 PHX Phoenix Suns\n", "1610612757 POR Portland Trail Blazers\n", "1610612758 SAC Sacramento Kings\n", "1610612759 SAS San Antonio Spurs\n", "1610612760 OKC Oklahoma City Thunder\n", "1610612760 SEA Seattle SuperSonics\n", "1610612761 TOR Toronto Raptors\n", "1610612762 UTA Utah Jazz\n", "1610612763 MEM Memphis Grizzlies\n", "1610612763 VAN Vancouver Grizzlies\n", "1610612764 WAS Washington Bullets\n", "1610612764 WAS Washington Wizards\n", "1610612765 DET Detroit Pistons\n", "1610612766 CHA Charlotte Bobcats\n", "1610612766 CHA Charlotte Hornets\n", "1610612766 CHH Charlotte Hornets" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unique_teams(matchups)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whoa, right away we see that we have some non-NBA teams represented, from pre-season exhibition-type games. We will exclude those in a moment, since we only want to have NBA matchups in our analysis.\n", "\n", "That's the easy part. The harder part is that team names and abbreviations have changed over time. Let's see what's going on in more detail.\n", "\n", "First, there are some simple cases that aren't too difficult to understand. The Clippers changed their name in 2015 from the Los Angeles Clippers to the LA Clippers. The abbreviation and 'team_id' didn't change, however. No big deal. There's a similar situation with Washington.\n", "\n", "For a team move like the Thunder, you have to keep track of the abbreviation change. Although the 'team_id' stays the same, the abbrevation changes from 'SEA' to 'OKC' reflecting the team move from Seattle to Oklahoma City in 2008. There's a similar situation with the Grizzlies and the Nets.\n", "\n", "The most complicated situation is with Charlotte and New Orleans.\n", "\n", "### Hornets and Pelicans\n", "\n", "As [descibed in detail here](https://en.wikipedia.org/wiki/Charlotte_Hornets), the Charlotte Hornets moved to New Orleans in 2002. The Charlotte Bobcats joined the NBA as an expansion team in 2004. Prior to the 2014-15 season, the Charlotte Bobcats and the New Orleans Hornets agreed to a deal whereby New Orleans would become the Pelicans and the Hornets name and history would revert to Charlotte. This explains why the abbreviation 'CHA' is used for both the Charlotte Bobcats and the Charlotte Hornets, in addition to the 'CHH' abbreviation for the other Charlotte Hornets (the team that moved in 2002 and ultimately became the Pelicans).\n", "\n", "Another complication is that [because of Hurricane Katrina](https://en.wikipedia.org/wiki/Effect_of_Hurricane_Katrina_on_the_New_Orleans_Hornets), the New Orleans Hornets had to relocate for much of the 2005-6 and 2006-7 seasons to Oklahoma City. That's why there is an abbreviation 'NOK' to represent the games that were played under the label New Orleans/Oklahoma City Hornets.\n", "\n", "Let's sort this out and make sure we understand this in detail.\n", "\n", "First, let build a table with the current 30 NBA teams, so we can keep track of historical abbreviations and how they map to current teams. Remember that the 'team_id' is meant to be a unique identifier for each team, even if the abbreviation or team name changes." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "def current_teams(df):\n", " \"\"\"Team abbreviations for the most recent completed NBA season.\"\"\"\n", " current_teams = (\n", " df.loc[\n", " (df['season'] == completed_nba_season()) & (df['season_type'] == 'regular'),\n", " ['team_h', 'team_id_h']\n", " ]\n", " .drop_duplicates()\n", " .sort_values(['team_id_h'])\n", " .rename(\n", " columns={\n", " 'team_h': 'curr_team',\n", " 'team_id_h': 'team_id',\n", " }\n", " )\n", " .set_index('team_id')\n", " )\n", " return current_teams" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "30" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(current_teams(matchups))" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "teams = unique_teams(matchups).merge(current_teams(matchups), left_index=True, right_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's build a table of any temas that have had a name change or an abbreviation change." ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
teamcurr_team
team_id
1610612740NOKNOP
1610612740NOPNOP
1610612740NOHNOP
1610612746LACLAC
1610612746LACLAC
1610612751NJNBKN
1610612751BKNBKN
1610612760OKCOKC
1610612760SEAOKC
1610612763MEMMEM
1610612763VANMEM
1610612764WASWAS
1610612764WASWAS
1610612766CHACHA
1610612766CHACHA
1610612766CHHCHA
\n", "
" ], "text/plain": [ " team curr_team\n", "team_id \n", "1610612740 NOK NOP\n", "1610612740 NOP NOP\n", "1610612740 NOH NOP\n", "1610612746 LAC LAC\n", "1610612746 LAC LAC\n", "1610612751 NJN BKN\n", "1610612751 BKN BKN\n", "1610612760 OKC OKC\n", "1610612760 SEA OKC\n", "1610612763 MEM MEM\n", "1610612763 VAN MEM\n", "1610612764 WAS WAS\n", "1610612764 WAS WAS\n", "1610612766 CHA CHA\n", "1610612766 CHA CHA\n", "1610612766 CHH CHA" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "team_changes = teams[['team', 'curr_team']].groupby('curr_team').filter(lambda team: len(team) > 1)\n", "team_changes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's look at how many games, per season, were played by various Charlotte and New Orleans franchises." ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "regular_season_summary = (\n", " matchups.loc[\n", " matchups['season_type'] == 'regular',\n", " ['season', 'team_h', 'team_a']\n", " ].pivot_table(\n", " index='season',\n", " columns='team_h',\n", " values='team_a',\n", " aggfunc='count',\n", " fill_value=0,\n", " margins=True,\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
team_hCHACHHNOHNOKNOP
season
1996-97041000
1997-98041000
1998-99025000
1999-00041000
2000-01041000
2001-02041000
2002-03004100
2003-04004100
2004-054104100
2005-064100410
2006-074100410
2007-084104100
2008-094104100
2009-104104100
2010-114104100
2011-123303300
2012-134104100
2013-144100041
2014-154100041
2015-164100041
2016-174100041
All52523036182164
\n", "
" ], "text/plain": [ "team_h CHA CHH NOH NOK NOP\n", "season \n", "1996-97 0 41 0 0 0\n", "1997-98 0 41 0 0 0\n", "1998-99 0 25 0 0 0\n", "1999-00 0 41 0 0 0\n", "2000-01 0 41 0 0 0\n", "2001-02 0 41 0 0 0\n", "2002-03 0 0 41 0 0\n", "2003-04 0 0 41 0 0\n", "2004-05 41 0 41 0 0\n", "2005-06 41 0 0 41 0\n", "2006-07 41 0 0 41 0\n", "2007-08 41 0 41 0 0\n", "2008-09 41 0 41 0 0\n", "2009-10 41 0 41 0 0\n", "2010-11 41 0 41 0 0\n", "2011-12 33 0 33 0 0\n", "2012-13 41 0 41 0 0\n", "2013-14 41 0 0 0 41\n", "2014-15 41 0 0 0 41\n", "2015-16 41 0 0 0 41\n", "2016-17 41 0 0 0 41\n", "All 525 230 361 82 164" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regular_season_summary.loc[:, ['CHA', 'CHH', 'NOH', 'NOK', 'NOP']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This table makes it clear that 'CHH' is the original Charlotte Hornets, who became the New Orleans Hornets (and the New Orleans/Oklahoma City Hornets), before becoming the New Orleans Pelicans ('NOP'). The abbreviation 'CHA' refers to the Bobcats, who then became the new Charlotte Hornets in the 2014-15 season.\n", "\n", "### \"Fixing\" the Team Data\n", "\n", "What might surprise you about this, is that **the NBA team data are wrong**!\n", "\n", "If you look at the team table above, you'll see that 'CHH' has the same 'team_id' as 'CHA'. Now, you may think this is correct, since the Bobcats and the New Orleans Hornets did a deal returning the history of the original Charlotte Hornets ('CHH') to Charlotte. That's why 'CHH' and 'CHA' have the same 'team_id'.\n", "\n", "On the other hand, if you goal is to track the performance of group of players, as their franchise changed cities, this is incorrect. The guys who played on the 'CHH' squad in the 2001-2 season should be connected with the 'NOH' squad in the 2002-3 season.\n", "\n", "The NBA team data as recorded don't reflect that.\n", "\n", "To keep track of this, we will do something that you should never do lightly. We are going to override our team data table to make sure the 'CHH' games are grouped with the current Pelicans franchise. We are effectively going to undo the Charlotte/New Orleans deal to transfer the Hornets history." ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "teams.loc[teams['team'] =='CHH', 'curr_team'] = 'NOP'" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
teamteam_namecurr_team
team_id
1610612740NOKNew Orleans HornetsNOP
1610612740NOPNew Orleans PelicansNOP
1610612740NOHNew Orleans HornetsNOP
1610612766CHACharlotte BobcatsCHA
1610612766CHACharlotte HornetsCHA
1610612766CHHCharlotte HornetsNOP
\n", "
" ], "text/plain": [ " team team_name curr_team\n", "team_id \n", "1610612740 NOK New Orleans Hornets NOP\n", "1610612740 NOP New Orleans Pelicans NOP\n", "1610612740 NOH New Orleans Hornets NOP\n", "1610612766 CHA Charlotte Bobcats CHA\n", "1610612766 CHA Charlotte Hornets CHA\n", "1610612766 CHH Charlotte Hornets NOP" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "teams.loc[teams['team'].isin(['CHA', 'CHH', 'NOH', 'NOK', 'NOP']), :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, we're not overriding the raw data. But it's important to keep track anytime you change data.\n", "\n", "Let's write a function to record what we're doing." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "def team_info(df):\n", " \"\"\"DataFrame of NBA team IDs, unadjusted abbreviations, names and adjusted abbrevations.\"\"\"\n", " teams = unique_teams(df).merge(current_teams(df), left_index=True, right_index=True)\n", " teams.loc[teams['team'] =='CHH', 'curr_team'] = 'NOP' # OVERRIDING Hornets/Pelicans\n", " return teams" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, we will write a function to exclude non-NBA matchups. Again, these were pre-season exhibition-style games. To exclude those games all we need to do is filter out any teams whose 'team_id' isn't one of the current 30 NBA teams." ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [], "source": [ "def exclude_non_nba_matchups(df):\n", " \"\"\"Filter out matchups involving non-NBA teams.\"\"\"\n", " nba_teams = current_teams(df)\n", " non_nba_matchups = list(\n", " df.loc[~df['team_id_h'].isin(nba_teams.index)].index\n", " )\n", " non_nba_matchups.extend(list(\n", " df.loc[~df['team_id_a'].isin(nba_teams.index)].index)\n", " )\n", " return df.loc[~df.index.isin(non_nba_matchups), :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's test this to see how many games get filtered out." ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(26812, 39)" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matchups.shape" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(26787, 39)" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exclude_non_nba_matchups(matchups).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function filtered out 25 games.\n", "\n", "Let's put all these team cleanup steps together. First, we will write a function to give us a mapping between the historical team abbreviation, and the \"correct\" current team abbreviation. In this case, \"correct\" means adjusted for the Hornets/Pelicans issue." ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [], "source": [ "def team_mapping(df):\n", " \"\"\"Dict mapping historical NBA team abbreviation to abbreviation of current franchise.\"\"\"\n", " teams = team_info(df) # this creates the lookup table with Hornets/Pelicans issue corrected\n", " list_of = teams.to_dict(orient='list')\n", " return dict(zip(list_of['team'], list_of['curr_team']))" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "def update_matchups(df):\n", " \"\"\"Prepare matchups for use in analytics.\"\"\"\n", " nba_teams = current_teams(df)\n", " df = exclude_non_nba_matchups(df) # filters any matchups that include non-NBA teams\n", " abbr_to_current = team_mapping(df)\n", " home_map = df['team_h'].map(abbr_to_current).rename('team_curr_h')\n", " away_map = df['team_a'].map(abbr_to_current).rename('team_curr_a')\n", " # Add new columns for current team abbreviations \n", " df = pd.concat([df.copy(), home_map, away_map], axis='columns').reset_index()\n", " df['team_curr_h'] = df['team_curr_h'].astype('category')\n", " df['team_curr_a'] = df['team_curr_a'].astype('category')\n", " return df.set_index('game_id')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the above function doesn't actually replace the team abbreviations in our matchups table. Rather, it creates two new columns for the current teams (home and away). That way, we can see what the original information was and make sure it maps correctly in our analysis. This is safer than just overwriting the old team abbreviations.\n", "\n", "We will use the current team abbreviations in building our matchup prediction models. This will allow us to track changes in team quality over time without getting confused by name changes or franchise moves.\n", "\n", "One last step is to write a function that will save our final, processed `DataFrame` for future analysis. This `DataFrame` is relatively large, and we've put some effort into formatting it. So, instead of using CSV, we'll use another built-in format that `pandas` can write for us: Python's [`pickle`](https://docs.python.org/3/library/pickle.html) format. The `pickle` format is an efficient, compressed format that will store all the information in our `DataFrame`, including formatting. \n", "\n", "As noted in the `pickle` documentation, you should be very careful about unpickling an object that you obtained from the Internet or via email. Only unpickle objects that you have previously pickled yourself or you obtained from somebody you really, really trust. By the way, it turns out that hackers have even [figured out ways to inject malware into CSV files](http://georgemauer.net/2017/10/07/csv-injection.html), so the general and best advice is: always be very careful opening any type of file that you've downloaded or received by email!" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "def save_matchups(df, output_dir):\n", " \"\"\"Save pickle file of NBA matchups prepared for analytics.\"\"\"\n", " seasons = list(df['season'].unique())\n", " start_season = min(seasons).replace('-', '_')\n", " end_season = max(seasons).replace('-', '_')\n", " PKL_TEMPLATE = 'stats_nba_com-matchups-{start_season}-{end_season}'\n", " pklfilename = (\n", " PKL_TEMPLATE.format(start_season=start_season, end_season=end_season) +\n", " '.pkl'\n", " )\n", " pklfile = output_dir.joinpath(pklfilename)\n", " if pklfile.exists():\n", " timestamp = str(datetime.now().strftime(\"%Y-%m-%d-%H-%M\"))\n", " backupfilename = (\n", " PKL_TEMPLATE.format(start_season=start_season, end_season=end_season) +\n", " '.bak.{timestamp}'.format(timestamp=timestamp) +\n", " '.pkl'\n", " )\n", " backupfile = output_dir.joinpath(backupfilename)\n", " pklfile.rename(backupfile)\n", " df.to_pickle(pklfile)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finishing Up\n", "\n", "Now, we can finally put everything together into one master function which reads all the matchups for since 1996 and saves it to a `pickle` file, ready for later analysis." ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "def nba_stats_matchups(data_dir=None, output_dir=None, overwrite=False):\n", " \"\"\"Get and prepare stats.nba.com matchups for all completed seasons since 1996.\"\"\"\n", " df = completed_nba_season_matchups(data_dir=data_dir, overwrite=overwrite)\n", " df = update_matchups(df)\n", " if output_dir:\n", " save_matchups(df, output_dir)\n", " return df" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 21/21 [00:16<00:00, 1.30it/s]\n" ] } ], "source": [ "matchups = nba_stats_matchups(data_dir=DATA_DIR, output_dir=OUTPUT_DIR, overwrite=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To conclude the data gathering, here's a summary of all the information we've obtained." ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 26787 entries, 29601187 to 11600001\n", "Data columns (total 41 columns):\n", "season 26787 non-null category\n", "season_type 26787 non-null category\n", "date 26787 non-null datetime64[ns]\n", "won 26787 non-null category\n", "min 26787 non-null int64\n", "team_h 26787 non-null category\n", "team_a 26787 non-null category\n", "pts_h 26787 non-null int64\n", "pts_a 26787 non-null int64\n", "fgm_h 26787 non-null int64\n", "fgm_a 26787 non-null int64\n", "fga_h 26787 non-null int64\n", "fga_a 26787 non-null int64\n", "fg3m_h 26787 non-null int64\n", "fg3m_a 26787 non-null int64\n", "fg3a_h 26787 non-null int64\n", "fg3a_a 26787 non-null int64\n", "ftm_h 26787 non-null int64\n", "ftm_a 26787 non-null int64\n", "fta_h 26787 non-null int64\n", "fta_a 26787 non-null int64\n", "oreb_h 26787 non-null int64\n", "oreb_a 26787 non-null int64\n", "dreb_h 26787 non-null int64\n", "dreb_a 26787 non-null int64\n", "ast_h 26787 non-null int64\n", "ast_a 26787 non-null int64\n", "tov_h 26787 non-null int64\n", "tov_a 26787 non-null int64\n", "stl_h 26787 non-null int64\n", "stl_a 26787 non-null int64\n", "blk_h 26787 non-null int64\n", "blk_a 26787 non-null int64\n", "pf_h 26787 non-null int64\n", "pf_a 26787 non-null int64\n", "team_id_h 26787 non-null int64\n", "team_id_a 26787 non-null int64\n", "team_name_h 26787 non-null object\n", "team_name_a 26787 non-null object\n", "team_curr_h 26787 non-null category\n", "team_curr_a 26787 non-null category\n", "dtypes: category(7), datetime64[ns](1), int64(31), object(2)\n", "memory usage: 7.3+ MB\n" ] } ], "source": [ "matchups.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've accomplished a lot. Now that we have data, we're ready to move on to the matchup predictions in future posts. To conclude, let's look at one important fact we can immediately learn from the data.\n", "\n", "### A Quick Look at Home Court Advantage in the NBA\n", "\n", "Let's calculate what home court has been worth in terms of regular season win percentage between the 1996-97 and 2016-17 seasons." ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "H 14830\n", "A 9967\n", "Name: won, dtype: int64" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matchups.loc[matchups['season_type'] == 'regular', 'won'].value_counts()" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [], "source": [ "def count_home_away_wins(df, ha, season_type='regular'):\n", " return df.loc[df['season_type'] == season_type, 'won'].value_counts()[ha]" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [], "source": [ "home_wins = count_home_away_wins(matchups, 'H')\n", "away_wins = count_home_away_wins(matchups, 'A')" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.59805621647779972" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "home_wins/(home_wins+away_wins)" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.40194378352220028" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "away_wins/(home_wins+away_wins)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The home team has won about 60% of the time on average during the NBA regular season. Any prediction model is going to have to address this fact. We'll look a lot more closely at this topic in future posts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:sports_py36]", "language": "python", "name": "conda-env-sports_py36-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "nbpresent": { "slides": {}, "themes": { "default": "7d3c7744-074d-4952-8785-e246175997fa", "theme": { "7d3c7744-074d-4952-8785-e246175997fa": { "backgrounds": { "backgroundColor": { "background-color": "backgroundColor", "id": "backgroundColor" } }, "id": "7d3c7744-074d-4952-8785-e246175997fa", "palette": { "backgroundColor": { "id": "backgroundColor", "rgb": [ 34, 34, 34 ] }, "headingColor": { "id": "headingColor", "rgb": [ 256, 256, 256 ] }, "linkColor": { "id": "linkColor", "rgb": [ 66, 175, 250 ] }, "mainColor": { "id": "mainColor", "rgb": [ 256, 256, 256 ] } }, "rules": { "a": { "color": "linkColor" }, "h1": { "color": "headingColor", "font-family": "Source Sans Pro", "font-size": 5.25 }, "h2": { "color": "headingColor", "font-family": "Source Sans Pro", "font-size": 4 }, "h3": { "color": "headingColor", "font-family": "Source Sans Pro", "font-size": 3.5 }, "h4": { "color": "headingColor", "font-family": "Source Sans Pro", "font-size": 3 }, "h5": { "color": "headingColor", "font-family": "Source Sans Pro" }, "h6": { "color": "headingColor", "font-family": "Source Sans Pro" }, "h7": { "color": "headingColor", "font-family": "Source Sans Pro" }, "li": { "color": "mainColor", "font-family": "Source Sans Pro", "font-size": 6 }, "p": { "color": "mainColor", "font-family": "Source Sans Pro", "font-size": 6 } }, "text-base": { "color": "mainColor", "font-family": "Source Sans Pro", "font-size": 6 } }, "f553e7a0-6b02-4aec-bca5-34769eabef9c": { "backgrounds": { "backgroundColor": { "background-color": "backgroundColor", "id": "backgroundColor" } }, "id": "f553e7a0-6b02-4aec-bca5-34769eabef9c", "palette": { "backgroundColor": { "id": "backgroundColor", "rgb": [ 34, 34, 34 ] }, "headingColor": { "id": "headingColor", "rgb": [ 256, 256, 256 ] }, "linkColor": { "id": "linkColor", "rgb": [ 66, 175, 250 ] }, "mainColor": { "id": "mainColor", "rgb": [ 256, 256, 256 ] } }, "rules": { "a": { "color": "linkColor" }, "h1": { "color": "headingColor", "font-family": "Source Sans Pro", "font-size": 5.25 }, "h2": { "color": "headingColor", "font-family": "Source Sans Pro", "font-size": 4 }, "h3": { "color": "headingColor", "font-family": "Source Sans Pro", "font-size": 3.5 }, "h4": { "color": "headingColor", "font-family": "Source Sans Pro", "font-size": 3 }, "h5": { "color": "headingColor", "font-family": "Source Sans Pro" }, "h6": { "color": "headingColor", "font-family": "Source Sans Pro" }, "h7": { "color": "headingColor", "font-family": "Source Sans Pro" }, "li": { "color": "mainColor", "font-family": "Source Sans Pro", "font-size": 6 }, "p": { "color": "mainColor", "font-family": "Source Sans Pro", "font-size": 6 } }, "text-base": { "color": "mainColor", "font-family": "Source Sans Pro", "font-size": 6 } } } } } }, "nbformat": 4, "nbformat_minor": 2 }