{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Checking Retrosheet and Lahman Data Consistency\n", "\n", "An incredible amount of work has gone into making the open-source Retrosheet and Lahman data sets as accurate as possible, but they are not perfectly accurate.\n", "\n", "Prior to performing any data analysis, it is necessary to check the accuracy of the data. The limitations of the data will define the limitations of the analysis.\n", "\n", "Performing data consistency tests is also a great way to validate the data processing pipeline and to learn about the data.\n", "\n", "The pytest test suite supplied in this repo automatically runs a superset of the tests performed in this notebook. The purpose of this notebook is to explain the key data consistency tests in more detail to better understand the data.\n", "\n", "A brief summary of the wrangled data is provided below at: [MLB Data Summary](#MLB-Data-Summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Consistency Tests\n", "Prior to performing the data consistency tests, it is necessary to verify that the fields which should uniquely identify a record, actually do. This is performed in the section called [Primary and Foreign Key tests](#Primary-and-Foreign-Key-Tests).\n", "\n", "Three types of data consistency tests will be performed. \n", "\n", "**1. Retrosheet Stats vs Lahman Stats** \n", "* aggregate Retrosheet individual batting/pitching/fielding stats per game to season level \n", "* compare with Lahman's individual batting/pitching/fielding season stats\n", "\n", "**2. Individual Stats vs Team Stats** \n", "\n", "**Retrosheet** \n", "* aggregate Retrosheet individual batting/pitching/fielding stats per game to team level\n", "* compare with Retrosheet team stats per game\n", "\n", "Notes: \n", "* Retrosheet individual batting/pitching/fielding stats per game were wrangled from the results of using the cwdaily parser on the Retrosheet play-by-play data. These correspond to the batting/pitching/fielding csv files in the Retrosheet wrangled directory.\n", "* Retrosheet team batting/pitching/fielding stats per game were wrangled from the results of using the cwgame parser on the Retrosheet play-by-play data. These correspond to the team_game csv file in the Retrosheet wrangled directory.\n", "\n", "**Lahman** \n", "* aggregate individual batting/pitching/fielding stats per season to team level\n", "* compare with team stats per season\n", "\n", "**3. Batting Stats vs Pitching Allowed Stats** \n", "\n", "**Retrosheet** \n", "* aggregate individual batting stats per game and pitching stats per game to ensure that the pitchers are those that were faced by the batters\n", "* compare batting stats with pitcher allowed stats \n", " * for example, the number of home runs hit by batters should equal the number of home runs allowed by pitchers\n", "\n", "**Lahman** \n", "* aggregate individual batting stats per season and pitching stats per season to ensure that the pitchers are those that were faced by the batters\n", "* compare batting stats with pitcher allowed stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Consistency Tolerances\n", "All data between 1974 and 2019 inclusive will be used. Retrosheet has all play-by-play data over this time period.\n", "\n", "The data consistency tests use several different hard-coded tolerance values: \n", "* relative tolerances range between 0% and 1%\n", "* absolute tolerances range between 0 and 2\n", "\n", "It is reasonable to expect that:\n", "* data accuracy for new data, such as the upcoming 2020 season, will be similar to the data accuracy between 1974 and 2019\n", "* with these tight data tolerances, the data wrangling scripts must have worked correctly" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Tests\n", "[Primary and Foreign Key Tests](#Primary-and-Foreign-Key-Tests) \n", "\n", "**Retrosheet vs Lahman** \n", "[Retrosheet Batting vs Lahman Batting Players](#Retrosheet-Batting-vs-Lahman-Batting-Players) \n", "[Retrosheet Batting vs Lahman Batting Stats](#Retrosheet-Batting-vs-Lahman-Batting-Stats) \n", "[Retrosheet Pitching vs Lahman Pitching Players](#Retrosheet-Pitching-vs-Lahman-Pitching-Players) \n", "[Retrosheet Pitching vs Lahman Pitching Stats](#Retrosheet-Pitching-vs-Lahman-Pitching-Stats) \n", "[Retrosheet Fielding vs Lahman Fielding Players](#Retrosheet-Fielding-vs-Lahman-Fielding-Players) \n", "[Retrosheet Fielding vs Lahman Fielding Stats](#Retrosheet-Fielding-vs-Lahman-Fielding-Stats) \n", "\n", "**Batting vs Pitching Allowed** \n", "[Retrosheet Pitching Allowed vs Retrosheet Hitting](#Retrosheet-Pitching-Allowed-vs-Retrosheet-Hitting) \n", "[Lahman Pitching Allowed vs Lahman Hitting](#Lahman-Pitching-Allowed-vs-Lahman-Hitting) \n", "\n", "**Individual vs Team** \n", "[Retrosheet Batting vs Retrosheet Team Batting](#Retrosheet-Batting-vs-Retrosheet-Team-Batting) \n", "[Lahman Batting vs Lahman Team Batting](#Lahman-Batting-vs-Lahman-Team-Batting) \n", "[Retrosheet Fielding vs Retrosheet Team Fielding](#Retrosheet-Fielding-vs-Retrosheet-Team-Fielding) \n", "[Lahman Fielding vs Lahman Team Fielding](#Lahman-Fielding-vs-Lahman-Team-Fielding) \n", "[Retrosheet Pitching vs Retrosheet Team Pitching](#Retrosheet-Pitching-vs-Retrosheet-Team-Pitching) \n", "[Lahman Pitching vs Lahman Team Pitching](#Lahman-Pitching-vs-Lahman-Team-Pitching) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preliminaries\n", "\n", "This notebook assumes that the Lahman and Retrosheet data sets have been downloaded and wrangled using the scripts in the `../download_scripts` directory of this repo.\n", "\n", "For this notebook, Retrosheet data from 1974 through 2019 inclusive is used.\n", "\n", "The `../download_scripts/data_helper.py` function: `from_csv_with_types()` uses pd.read_csv() with dtypes set to the type data read in from: <filename\\>_types.csv. This allows Pandas to use the previously optimized data types which require about 1/3rd as much memory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MLB Data Summary\n", "\n", "Main csv files.\n", "\n", "**Lahman** \n", "* Stats per Player per Year:\n", " * batting.csv\n", " * pitching.csv\n", " * fielding.csv\n", "* Stats per Team per Year:\n", " * teams.csv -- contains team_id for both Lahman and Retrosheet\n", "* Other\n", " * people.csv -- contains player_id for Lahman, Retrosheet and Baseball-Reference\n", " \n", "**Retrosheet** \n", "* Stats per Player per Game:\n", " * batting.csv.gz\n", " * pitching.csv.gz\n", " * fielding.csv.gz\n", "* Stats per Team per Game:\n", " * team_game.csv.gz\n", "* Stats per Game:\n", " * game.csv.gz " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports and Setup" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "import numpy as np\n", "from pathlib import Path\n", "import re\n", "from scipy.stats import linregress" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "sns.set()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import matplotlib as mpl\n", "mpl.rcParams['figure.dpi'] = 100 # increase dpi, will make figures larger and clearer" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import sys\n", "\n", "# import data_helper.py from download_scripts directory\n", "sys.path.append('../download_scripts')\n", "import data_helper as dh" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "data_dir = Path('../data')\n", "lahman_data = data_dir.joinpath('lahman/wrangled').resolve()\n", "retrosheet_data = data_dir.joinpath('retrosheet/wrangled').resolve()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "pd.set_option(\"display.max_columns\", 50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load the Data\n", "\n", "Loading all the data up front makes the code clearer, but uses more memory.\n", "\n", "As optimized Pandas data types were persisted when the data was wrangled, the total memory usage is about 3 times less than if Pandas inferred the data types with pd.read_csv(). \n", "\n", "A minor drawback to using the smallest integer type that will hold the data, is that some Pandas methods do not compare different integer types properly without first prompting them to the same larger integer type. Aggregations work properly for all Pandas data types.\n", "\n", "Notes:\n", "* every Retrosheet player who appears in a game is in retro_batting, even if they had no plate appearances\n", "* every Retrosheet team appears in retro_team_game\n", "* the Lahman stint value is incremented each time a player is traded in the same year" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "lahman_people = dh.from_csv_with_types(lahman_data / 'people.csv')\n", "lahman_teams = dh.from_csv_with_types(lahman_data / 'teams.csv')\n", "lahman_batting = dh.from_csv_with_types(lahman_data / 'batting.csv')\n", "lahman_pitching = dh.from_csv_with_types(lahman_data / 'pitching.csv')\n", "lahman_fielding = dh.from_csv_with_types(lahman_data / 'fielding.csv')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# restrict Lahman data to be between 1974 and 2019 inclusive\n", "lahman_batting = lahman_batting.query('1974 <= year <= 2019')\n", "lahman_pitching = lahman_pitching.query('1974 <= year <= 2019')\n", "lahman_fielding = lahman_fielding.query('1974 <= year <= 2019')\n", "lahman_teams = lahman_teams.query('1974 <= year <= 2019')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "retro_batting = dh.from_csv_with_types(retrosheet_data / 'batting.csv.gz')\n", "retro_pitching = dh.from_csv_with_types(retrosheet_data / 'pitching.csv.gz')\n", "retro_fielding = dh.from_csv_with_types(retrosheet_data / 'fielding.csv.gz')\n", "retro_team_game = dh.from_csv_with_types(retrosheet_data / 'team_game.csv.gz')\n", "retro_game = dh.from_csv_with_types(retrosheet_data / 'game.csv.gz')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# restrict Retrosheet data to be between 1974 and 2019 inclusive\n", "retro_batting = retro_batting.query('1974 <= year <= 2019')\n", "retro_pitching = retro_pitching.query('1974 <= year <= 2019')\n", "retro_fielding = retro_fielding.query('1974 <= year <= 2019')\n", "retro_team_game = retro_team_game.query('1974 <= year <= 2019')\n", "retro_game = retro_game.query('1974 <= game_start.dt.year <= 2019')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# verify these years exist in the downloaded data\n", "(retro_batting['year'].agg(['min', 'max']) == (1974, 2019)).all()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# verify all the years are in the downloaded data\n", "retro_batting['year'].nunique() == (2019 - 1974) + 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Primary and Foreign Key Tests\n", "Perform these tests before checking for data consistency.\n", "\n", "In the optional `../download_scripts/postgres_load_data.py` script, primary and foreign key constraints are used to enforce these data checks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lahman People and Lahman/Retrosheet Player IDs\n", "The Lahman people.csv file contains a mapping between Lahman player_id and Retrosheet player_id.\n", "\n", "Ensure that this mapping is \"one to one\" and \"onto\", which just means:\n", "* for every Lahman player_id there is exactly one Retrosheet player_id\n", "* for every Retrosheet player_id there is exactly one Lahman player_id\n", "\n", "Although the code that follows allows for a missing mapping, there are no missing mappings for players between 1974 and 2019 (as of the December 2019 Lahman data update)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# verify Lahman player_id is unique\n", "dh.is_unique(lahman_people, ['player_id'])" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# verify Lahman's retrosheet player_id is unqiue or missing\n", "dh.is_unique(lahman_people, ['retro_id'], ignore_null=True)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 9383 unique players in Retrosheet\n" ] } ], "source": [ "# get the unique player ids from Retrosheet\n", "retro_players = pd.Series(retro_batting['player_id'].unique(), name='player_id')\n", "print(f'There are {len(retro_players)} unique players in Retrosheet')" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 9383 player_ids that match between Lahman and Retrosheet\n" ] } ], "source": [ "# use an inner join to verify that the Lahman player_id to Retrosheet player_id\n", "# mapping is one-to-one and onto\n", "mapping = lahman_people[['player_id', 'retro_id']].merge(retro_players, how='inner', \n", " left_on=['retro_id'], right_on=['player_id'],\n", " suffixes=('_lahman', '_retro'))\n", "print(f'There are {len(mapping)} player_ids that match between Lahman and Retrosheet')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(retro_players) == len(mapping)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Given the above, the following must be True\n", "r_players = set(retro_players)\n", "l_players = set(lahman_people['retro_id'])\n", "r_players.issubset(l_players)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lahman Teams and Lahman/Retrosheet Team IDs\n", "Same analysis as above, but with (team_id, year) instead of player_id." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# verify Lahman (team_id, year) is unique\n", "dh.is_unique(lahman_teams, ['team_id', 'year'])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# verify Lahman's retrosheet (team__id, year) is unqiue\n", "dh.is_unique(lahman_teams, ['team_id_retro', 'year'])" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 1288 unique (team_id, year) in Retrosheet\n" ] } ], "source": [ "# alternatively Panda's drop_duplicates() could have been used\n", "retro_team_ids = set(zip(retro_team_game['team_id'], \n", " retro_team_game['year']))\n", "retro_team_ids_df = pd.DataFrame(retro_team_ids, columns=['team_id', 'year'])\n", "print(f'There are {len(retro_team_ids_df)} unique (team_id, year) in Retrosheet')" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 1288 (team_id, year) that match between Lahman and Retrosheet\n" ] } ], "source": [ "# use an inner join to verify that the mapping is one-to-one and onto\n", "mapping = lahman_teams.merge(retro_team_ids_df, how='inner', \n", " left_on=['team_id_retro', 'year'], \n", " right_on=['team_id','year'])\n", "print(f'There are {len(mapping)} (team_id, year) that match between Lahman and Retrosheet')" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(retro_team_ids) == len(mapping)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Given the above, the following must be True\n", "lahman_team_ids = set(zip(lahman_teams['team_id_retro'], lahman_teams['year']))\n", "retro_team_ids.issubset(lahman_team_ids)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lahman Batting/Pitching/Fielding" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dh.is_unique(lahman_batting, ['player_id', 'year', 'stint'])" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dh.is_unique(lahman_pitching, ['player_id', 'year', 'stint'])" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dh.is_unique(lahman_fielding, ['player_id', 'year', 'stint', 'pos'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Retrosheet Batting/Pitching/Fielding/Team_Game/Game" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<a href=\"https://www.baseball-reference.com/boxes/BOS/BOS201708250.shtml\">BOS201708250</a>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Note: 1 duplicate key was cleaned during wrangling for data between 1948 and 2019\n", "# Additional entries were cleaned for earlier years\n", "# The two entries for Chris Young (younc004) were summed into a single entry\n", "\n", "# Manually spot check with the box score for this game at Baseball Reference\n", "# to verify that summing all stats (other than g) was the correct approach\n", "dh.game_id_to_url('BOS201708250')" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dh.is_unique(retro_batting, ['player_id', 'game_id'])" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dh.is_unique(retro_pitching, ['player_id', 'game_id'])" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dh.is_unique(retro_fielding, ['player_id', 'game_id', 'pos'])" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dh.is_unique(retro_team_game, ['team_id', 'game_id'])" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dh.is_unique(retro_game, ['game_id'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Consistency Tests\n", "The download scripts ensured that columns with the same meaning, were given the same name. This makes it easy to compare data between CSV files using their column names." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Retrosheet Batting vs Lahman Batting Players" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# verify that Lahman and Retrosheet have stats on the same set of batters\n", "lahman_batters = pd.merge(lahman_batting['player_id'], lahman_people[['player_id', 'retro_id']])\n", "l_batters = set(lahman_batters['retro_id'].unique())\n", "r_batters = set(retro_batting['player_id'].unique())\n", "r_batters == l_batters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Retrosheet Batting vs Lahman Batting Stats" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "17" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# batting columns to compare\n", "cols = set(retro_batting.columns) & set(lahman_batting.columns)\n", "cols -= {'player_id', 'team_id', 'year'}\n", "len(cols)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'ab',\n", " 'bb',\n", " 'cs',\n", " 'double',\n", " 'g',\n", " 'gidp',\n", " 'h',\n", " 'hbp',\n", " 'hr',\n", " 'ibb',\n", " 'r',\n", " 'rbi',\n", " 'sb',\n", " 'sf',\n", " 'sh',\n", " 'so',\n", " 'triple'}" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "# aggregate the stats in common for all players for all years (1974 thru 2019)\n", "l = lahman_batting[cols]\n", "r = retro_batting[cols]\n", "\n", "l_sums = l.agg('sum')\n", "l_sums.sort_index(inplace=True)\n", "\n", "r_sums = r.agg('sum')\n", "r_sums.sort_index(inplace=True)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ab 1.426074e-07\n", "bb 1.487730e-06\n", "cs 3.295273e-05\n", "double 2.922456e-06\n", "g 1.397775e-06\n", "gidp 8.906306e-05\n", "h 0.000000e+00\n", "hbp 0.000000e+00\n", "hr 0.000000e+00\n", "ibb 1.776294e-05\n", "r 0.000000e+00\n", "rbi 5.724236e-06\n", "sb 2.964764e-05\n", "sf 0.000000e+00\n", "sh 1.411652e-05\n", "so 0.000000e+00\n", "triple 0.000000e+00\n", "dtype: float64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# compute the relative differences\n", "np.abs(1.0 - (l_sums / r_sums))" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.000089\n" ] } ], "source": [ "# find the largest relative difference\n", "print(f'{np.abs(1.0 - (l_sums / r_sums)).max():8.6f}')" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# all 17 batting attributes from 1974-2019 between Lahman and Retrosheet\n", "# are within plus/minus 0.01% of each other when summed\n", "(np.abs(1.0 - (l_sums / r_sums)) < .0001).all()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Retrosheet Pitching vs Lahman Pitching Players" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# verify that Lahman and Retrosheet have stats on exactly the same set of pitchers\n", "lahman_pitchers = pd.merge(lahman_pitching['player_id'], lahman_people[['player_id', 'retro_id']])\n", "l_pitchers = set(lahman_pitchers['retro_id'].unique())\n", "r_pitchers = set(retro_pitching['player_id'].unique())\n", "r_pitchers == l_pitchers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Retrosheet Pitching vs Lahman Pitching Stats" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "21" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pitching columns to compare\n", "cols = set(retro_pitching.columns) & set(lahman_pitching.columns)\n", "cols -= {'player_id', 'team_id', 'year'}\n", "len(cols)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'bb',\n", " 'bk',\n", " 'cg',\n", " 'er',\n", " 'g',\n", " 'gf',\n", " 'gidp',\n", " 'gs',\n", " 'h',\n", " 'hbp',\n", " 'hr',\n", " 'ibb',\n", " 'l',\n", " 'r',\n", " 'sf',\n", " 'sh',\n", " 'sho',\n", " 'so',\n", " 'sv',\n", " 'w',\n", " 'wp'}" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "# aggregate the stats in common for all players for all years (1974 thru 2019)\n", "l = lahman_pitching[cols]\n", "r = retro_pitching[cols]\n", "\n", "l_sums = l.agg('sum')\n", "l_sums.sort_index(inplace=True)\n", "\n", "r_sums = r.agg('sum')\n", "r_sums.sort_index(inplace=True)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bb 0.000001\n", "bk 0.000403\n", "cg 0.000327\n", "er 0.000002\n", "g 0.000105\n", "gf 0.000032\n", "gidp 0.000006\n", "gs 0.000000\n", "h 0.000000\n", "hbp 0.000000\n", "hr 0.000005\n", "ibb 0.000018\n", "l 0.000000\n", "r 0.000000\n", "sf 0.000000\n", "sh 0.000014\n", "sho 0.000598\n", "so 0.000000\n", "sv 0.000081\n", "w 0.000000\n", "wp 0.000355\n", "dtype: float64" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# compute the relative differences\n", "np.abs(1.0 - (l_sums / r_sums))" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.000598\n" ] } ], "source": [ "# find the largest relative difference\n", "print(f'{np.abs(1.0 - (l_sums / r_sums)).max():8.6f}')" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# verify all values between 1974 and 2019 are within plus/minus 0.06% of each other\n", "(np.abs(1.0 - (l_sums / r_sums)) < .0006).all()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Retrosheet Fielding vs Lahman Fielding Players" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# verify that Lahman and Retrosheet have stats on exactly the same set of fielders\n", "lahman_fielders = pd.merge(lahman_fielding['player_id'], lahman_people[['player_id', 'retro_id']])\n", "l_fielders = set(lahman_fielders['retro_id'].unique())\n", "r_fielders = set(retro_fielding['player_id'].unique())\n", "r_fielders == l_fielders" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "set()" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Lahman has no fielders that are not in Retrosheet\n", "l_fielders - r_fielders" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'olivt102'}" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Retrosheet has one fielder that is not in Lahman\n", "r_fielders - l_fielders" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "# missing Lahmann fielder\n", "missing_fielder = f'{(r_fielders - l_fielders).pop()}'" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>game_id</th>\n", " <th>player_id</th>\n", " <th>pos</th>\n", " <th>team_id</th>\n", " <th>g</th>\n", " <th>gs</th>\n", " <th>inn_outs</th>\n", " <th>tc</th>\n", " <th>po</th>\n", " <th>a</th>\n", " <th>e</th>\n", " <th>dp</th>\n", " <th>tp</th>\n", " <th>pb</th>\n", " <th>xi</th>\n", " <th>game_start</th>\n", " <th>year</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>791764</th>\n", " <td>BOS197604190</td>\n", " <td>olivt102</td>\n", " <td>2B</td>\n", " <td>MIN</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1976-04-19 11:05:00</td>\n", " <td>1976</td>\n", " </tr>\n", " <tr>\n", " <th>791786</th>\n", " <td>BOS197604200</td>\n", " <td>olivt102</td>\n", " <td>2B</td>\n", " <td>MIN</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1976-04-20 15:03:00</td>\n", " <td>1976</td>\n", " </tr>\n", " <tr>\n", " <th>816021</th>\n", " <td>NYA197604180</td>\n", " <td>olivt102</td>\n", " <td>2B</td>\n", " <td>MIN</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1976-04-18 14:07:00</td>\n", " <td>1976</td>\n", " </tr>\n", " <tr>\n", " <th>831226</th>\n", " <td>TEX197604110</td>\n", " <td>olivt102</td>\n", " <td>2B</td>\n", " <td>MIN</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1976-04-11 00:00:00</td>\n", " <td>1976</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " game_id player_id pos team_id g gs inn_outs tc po a e dp \\\n", "791764 BOS197604190 olivt102 2B MIN 0 1 0 0 0 0 0 0 \n", "791786 BOS197604200 olivt102 2B MIN 0 1 0 0 0 0 0 0 \n", "816021 NYA197604180 olivt102 2B MIN 0 1 0 0 0 0 0 0 \n", "831226 TEX197604110 olivt102 2B MIN 0 1 0 0 0 0 0 0 \n", "\n", " tp pb xi game_start year \n", "791764 0 0 0 1976-04-19 11:05:00 1976 \n", "791786 0 0 0 1976-04-20 15:03:00 1976 \n", "816021 0 0 0 1976-04-18 14:07:00 1976 \n", "831226 0 0 0 1976-04-11 00:00:00 1976 " ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "missing = retro_fielding.query(f'player_id == \"{missing_fielder}\"')\n", "missing" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Tony had no fielding Total Chances (tc)\n", "missing['tc'].sum()" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Tony was on the field for 0 outs (inn_outs)\n", "missing['inn_outs'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tony Oliva (olivt102) started 4 games as a second baseman. \n", "He had zero total chances and no outs were recorded during the time he played second base.\n", "\n", "This sounds like a pinch hitter or a designated hitter, rather than a fielder. Let's check the box scores and player information on Baseball-Reference." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<a href=\"https://www.baseball-reference.com/boxes/BOS/BOS197604190.shtml\">BOS197604190</a>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "<a href=\"https://www.baseball-reference.com/boxes/BOS/BOS197604200.shtml\">BOS197604200</a>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "<a href=\"https://www.baseball-reference.com/boxes/NYA/NYA197604180.shtml\">NYA197604180</a>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "<a href=\"https://www.baseball-reference.com/boxes/TEX/TEX197604110.shtml\">TEX197604110</a>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# right click on each generated link to see the box score on Baseball Reference\n", "for game_id in missing['game_id']:\n", " dh.game_id_to_url(game_id)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<a href=\"https://www.baseball-reference.com/players/o/olivato01.shtml\">olivato01</a>" ], "text/plain": [ "<IPython.core.display.HTML object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# right click on generated link to see Tony Oliva on Baseball Reference\n", "bb_player_id = lahman_people.query('retro_id == \"olivt102\"')['bb_ref_id'].values[0]\n", "dh.player_id_to_url(bb_player_id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tony Oliva lead off in each of these four away games and was immediately replaced, whether he got on base or not. He was only in the game to hit once. He might have been unable to run or field well for these games, but he could still hit, so he was in the lineup.\n", "\n", "Tony was listed as the starting second baseman, even though he was a career right fielder. Tony was not going to play second base, the guy who replaced him, Jerry Terrell, was.\n", "\n", "The cwdaily parser created a fielding record for Tony as he was on the lineup card as the starting second baseman.\n", "\n", "It would be hard to argue that either the Lahman or Retrosheet data is wrong in this scenario. The only difference is that Retrosheet show 4 starts as a second baseman for Tony Oliva that Lahman does not.\n", "\n", "The fielding players are effectively the same in Lahman and Retrosheet (1974 through 2019)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Retrosheet Fielding vs Lahman Fielding Stats" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# fielding columns to compare\n", "cols = set(retro_fielding.columns) & set(lahman_fielding.columns)\n", "cols -= {'player_id', 'team_id', 'year'}\n", "len(cols)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'a', 'dp', 'e', 'g', 'gs', 'inn_outs', 'po', 'pos'}" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "# aggregate the stats in common per position over all players over all years (1974 thru 2019)\n", "l = lahman_fielding[cols].copy()\n", "r = retro_fielding[cols]\n", "\n", "# work-around for Pandas 1.0.1 bugs\n", "# 1) sum does not up-cast for nullable integer types\n", "# 2) select_dtypes does not distinguish between nullable and non-nullable integer types\n", "idx = l.dtypes.isin([pd.UInt8Dtype(), pd.UInt16Dtype()])\n", "for col in l.columns[idx]:\n", " l[col] = l[col].astype('Int32')\n", "\n", "l_sums = l.groupby('pos').agg('sum')\n", "l_sums.sort_index(inplace=True)\n", "\n", "r_sums = r.groupby('pos').agg('sum')\n", "r_sums.sort_index(inplace=True)" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>inn_outs</th>\n", " <th>dp</th>\n", " <th>g</th>\n", " <th>a</th>\n", " <th>e</th>\n", " <th>gs</th>\n", " <th>po</th>\n", " </tr>\n", " <tr>\n", " <th>pos</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1B</th>\n", " <td>2.470157e-05</td>\n", " <td>0.000737</td>\n", " <td>0.000008</td>\n", " <td>0.006393</td>\n", " <td>0.000833</td>\n", " <td>0.000029</td>\n", " <td>0.000482</td>\n", " </tr>\n", " <tr>\n", " <th>2B</th>\n", " <td>2.179551e-06</td>\n", " <td>0.000007</td>\n", " <td>0.000017</td>\n", " <td>0.001008</td>\n", " <td>0.000309</td>\n", " <td>0.000063</td>\n", " <td>0.002122</td>\n", " </tr>\n", " <tr>\n", " <th>3B</th>\n", " <td>9.807978e-06</td>\n", " <td>0.000908</td>\n", " <td>0.000177</td>\n", " <td>0.001483</td>\n", " <td>0.000109</td>\n", " <td>0.000019</td>\n", " <td>0.001850</td>\n", " </tr>\n", " <tr>\n", " <th>C</th>\n", " <td>5.448877e-07</td>\n", " <td>0.000982</td>\n", " <td>0.000059</td>\n", " <td>0.007810</td>\n", " <td>0.000683</td>\n", " <td>0.000005</td>\n", " <td>0.000668</td>\n", " </tr>\n", " <tr>\n", " <th>CF</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>LF</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>OF</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>P</th>\n", " <td>4.722360e-06</td>\n", " <td>0.002302</td>\n", " <td>0.000105</td>\n", " <td>0.001890</td>\n", " <td>0.001415</td>\n", " <td>0.000000</td>\n", " <td>0.005532</td>\n", " </tr>\n", " <tr>\n", " <th>RF</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>SS</th>\n", " <td>8.354944e-06</td>\n", " <td>0.000730</td>\n", " <td>0.000004</td>\n", " <td>0.001009</td>\n", " <td>0.000102</td>\n", " <td>0.000068</td>\n", " <td>0.001301</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " inn_outs dp g a e gs po\n", "pos \n", "1B 2.470157e-05 0.000737 0.000008 0.006393 0.000833 0.000029 0.000482\n", "2B 2.179551e-06 0.000007 0.000017 0.001008 0.000309 0.000063 0.002122\n", "3B 9.807978e-06 0.000908 0.000177 0.001483 0.000109 0.000019 0.001850\n", "C 5.448877e-07 0.000982 0.000059 0.007810 0.000683 0.000005 0.000668\n", "CF NaN NaN NaN NaN NaN NaN NaN\n", "LF NaN NaN NaN NaN NaN NaN NaN\n", "OF NaN NaN NaN NaN NaN NaN NaN\n", "P 4.722360e-06 0.002302 0.000105 0.001890 0.001415 0.000000 0.005532\n", "RF NaN NaN NaN NaN NaN NaN NaN\n", "SS 8.354944e-06 0.000730 0.000004 0.001009 0.000102 0.000068 0.001301" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# compute the relative differences\n", "np.abs(1.0 - (l_sums / r_sums))" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>inn_outs</th>\n", " <th>dp</th>\n", " <th>g</th>\n", " <th>a</th>\n", " <th>e</th>\n", " <th>gs</th>\n", " <th>po</th>\n", " </tr>\n", " <tr>\n", " <th>pos</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1B</th>\n", " <td>5505722.0</td>\n", " <td>171005.0</td>\n", " <td>237097.0</td>\n", " <td>138119.0</td>\n", " <td>14409.0</td>\n", " <td>205280.0</td>\n", " <td>1798401.0</td>\n", " </tr>\n", " <tr>\n", " <th>2B</th>\n", " <td>5505722.0</td>\n", " <td>135428.0</td>\n", " <td>235233.0</td>\n", " <td>599404.0</td>\n", " <td>19443.0</td>\n", " <td>205280.0</td>\n", " <td>422753.0</td>\n", " </tr>\n", " <tr>\n", " <th>3B</th>\n", " <td>5505722.0</td>\n", " <td>37465.0</td>\n", " <td>237078.0</td>\n", " <td>404527.0</td>\n", " <td>27552.0</td>\n", " <td>205280.0</td>\n", " <td>149770.0</td>\n", " </tr>\n", " <tr>\n", " <th>C</th>\n", " <td>5505722.0</td>\n", " <td>13236.0</td>\n", " <td>236247.0</td>\n", " <td>106279.0</td>\n", " <td>14642.0</td>\n", " <td>205280.0</td>\n", " <td>1349560.0</td>\n", " </tr>\n", " <tr>\n", " <th>OF</th>\n", " <td>16517166.0</td>\n", " <td>8256.0</td>\n", " <td>735430.0</td>\n", " <td>38885.0</td>\n", " <td>24381.0</td>\n", " <td>615840.0</td>\n", " <td>1352944.0</td>\n", " </tr>\n", " <tr>\n", " <th>P</th>\n", " <td>5505722.0</td>\n", " <td>17807.0</td>\n", " <td>695259.0</td>\n", " <td>243922.0</td>\n", " <td>16956.0</td>\n", " <td>205280.0</td>\n", " <td>112264.0</td>\n", " </tr>\n", " <tr>\n", " <th>SS</th>\n", " <td>5505722.0</td>\n", " <td>127447.0</td>\n", " <td>232003.0</td>\n", " <td>610365.0</td>\n", " <td>29333.0</td>\n", " <td>205280.0</td>\n", " <td>319687.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " inn_outs dp g a e gs po\n", "pos \n", "1B 5505722.0 171005.0 237097.0 138119.0 14409.0 205280.0 1798401.0\n", "2B 5505722.0 135428.0 235233.0 599404.0 19443.0 205280.0 422753.0\n", "3B 5505722.0 37465.0 237078.0 404527.0 27552.0 205280.0 149770.0\n", "C 5505722.0 13236.0 236247.0 106279.0 14642.0 205280.0 1349560.0\n", "OF 16517166.0 8256.0 735430.0 38885.0 24381.0 615840.0 1352944.0\n", "P 5505722.0 17807.0 695259.0 243922.0 16956.0 205280.0 112264.0\n", "SS 5505722.0 127447.0 232003.0 610365.0 29333.0 205280.0 319687.0" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Lahman uses OF for sum of LF, CF, RF -- account for this\n", "r_sums.loc['OF'] = r_sums.loc['LF'] + r_sums.loc['CF'] + r_sums.loc['RF']\n", "r_sums = r_sums.drop(['LF', 'CF', 'RF'])\n", "r_sums.sort_index(inplace=True)\n", "r_sums" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>inn_outs</th>\n", " <th>dp</th>\n", " <th>g</th>\n", " <th>a</th>\n", " <th>e</th>\n", " <th>gs</th>\n", " <th>po</th>\n", " </tr>\n", " <tr>\n", " <th>pos</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1B</th>\n", " <td>5505722.0</td>\n", " <td>171005.0</td>\n", " <td>237097.0</td>\n", " <td>138119.0</td>\n", " <td>14409.0</td>\n", " <td>205280.0</td>\n", " <td>1798401.0</td>\n", " </tr>\n", " <tr>\n", " <th>2B</th>\n", " <td>5505722.0</td>\n", " <td>135428.0</td>\n", " <td>235233.0</td>\n", " <td>599404.0</td>\n", " <td>19443.0</td>\n", " <td>205280.0</td>\n", " <td>422753.0</td>\n", " </tr>\n", " <tr>\n", " <th>3B</th>\n", " <td>5505722.0</td>\n", " <td>37465.0</td>\n", " <td>237078.0</td>\n", " <td>404527.0</td>\n", " <td>27552.0</td>\n", " <td>205280.0</td>\n", " <td>149770.0</td>\n", " </tr>\n", " <tr>\n", " <th>C</th>\n", " <td>5505722.0</td>\n", " <td>13236.0</td>\n", " <td>236247.0</td>\n", " <td>106279.0</td>\n", " <td>14642.0</td>\n", " <td>205280.0</td>\n", " <td>1349560.0</td>\n", " </tr>\n", " <tr>\n", " <th>OF</th>\n", " <td>16517166.0</td>\n", " <td>8256.0</td>\n", " <td>711605.0</td>\n", " <td>38885.0</td>\n", " <td>24381.0</td>\n", " <td>615840.0</td>\n", " <td>1352944.0</td>\n", " </tr>\n", " <tr>\n", " <th>P</th>\n", " <td>5505722.0</td>\n", " <td>17807.0</td>\n", " <td>695259.0</td>\n", " <td>243922.0</td>\n", " <td>16956.0</td>\n", " <td>205280.0</td>\n", " <td>112264.0</td>\n", " </tr>\n", " <tr>\n", " <th>SS</th>\n", " <td>5505722.0</td>\n", " <td>127447.0</td>\n", " <td>232003.0</td>\n", " <td>610365.0</td>\n", " <td>29333.0</td>\n", " <td>205280.0</td>\n", " <td>319687.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " inn_outs dp g a e gs po\n", "pos \n", "1B 5505722.0 171005.0 237097.0 138119.0 14409.0 205280.0 1798401.0\n", "2B 5505722.0 135428.0 235233.0 599404.0 19443.0 205280.0 422753.0\n", "3B 5505722.0 37465.0 237078.0 404527.0 27552.0 205280.0 149770.0\n", "C 5505722.0 13236.0 236247.0 106279.0 14642.0 205280.0 1349560.0\n", "OF 16517166.0 8256.0 711605.0 38885.0 24381.0 615840.0 1352944.0\n", "P 5505722.0 17807.0 695259.0 243922.0 16956.0 205280.0 112264.0\n", "SS 5505722.0 127447.0 232003.0 610365.0 29333.0 205280.0 319687.0" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The above will overcount games, as Lahman will have a player in the game once as an OF, \n", "# whereas Retrosheet may have him in the game as both a LF and CF, for example.\n", "r_of = retro_fielding.query('pos in [\"LF\", \"CF\", \"RF\"]')\n", "\n", "total_dups = r_of.duplicated(subset=['player_id', 'game_id'], keep=False).sum()\n", "counted_dups = r_of.duplicated(subset=['player_id', 'game_id'], keep='first').sum()\n", "\n", "r_sums.loc['OF', 'g'] -= (total_dups - counted_dups)\n", "r_sums" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>inn_outs</th>\n", " <th>dp</th>\n", " <th>g</th>\n", " <th>a</th>\n", " <th>e</th>\n", " <th>gs</th>\n", " <th>po</th>\n", " </tr>\n", " <tr>\n", " <th>pos</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1B</th>\n", " <td>2.470157e-05</td>\n", " <td>0.000737</td>\n", " <td>0.000008</td>\n", " <td>0.006393</td>\n", " <td>0.000833</td>\n", " <td>0.000029</td>\n", " <td>0.000482</td>\n", " </tr>\n", " <tr>\n", " <th>2B</th>\n", " <td>2.179551e-06</td>\n", " <td>0.000007</td>\n", " <td>0.000017</td>\n", " <td>0.001008</td>\n", " <td>0.000309</td>\n", " <td>0.000063</td>\n", " <td>0.002122</td>\n", " </tr>\n", " <tr>\n", " <th>3B</th>\n", " <td>9.807978e-06</td>\n", " <td>0.000908</td>\n", " <td>0.000177</td>\n", " <td>0.001483</td>\n", " <td>0.000109</td>\n", " <td>0.000019</td>\n", " <td>0.001850</td>\n", " </tr>\n", " <tr>\n", " <th>C</th>\n", " <td>5.448877e-07</td>\n", " <td>0.000982</td>\n", " <td>0.000059</td>\n", " <td>0.007810</td>\n", " <td>0.000683</td>\n", " <td>0.000005</td>\n", " <td>0.000668</td>\n", " </tr>\n", " <tr>\n", " <th>OF</th>\n", " <td>6.599195e-06</td>\n", " <td>0.001453</td>\n", " <td>0.000311</td>\n", " <td>0.002546</td>\n", " <td>0.000656</td>\n", " <td>0.000013</td>\n", " <td>0.000072</td>\n", " </tr>\n", " <tr>\n", " <th>P</th>\n", " <td>4.722360e-06</td>\n", " <td>0.002302</td>\n", " <td>0.000105</td>\n", " <td>0.001890</td>\n", " <td>0.001415</td>\n", " <td>0.000000</td>\n", " <td>0.005532</td>\n", " </tr>\n", " <tr>\n", " <th>SS</th>\n", " <td>8.354944e-06</td>\n", " <td>0.000730</td>\n", " <td>0.000004</td>\n", " <td>0.001009</td>\n", " <td>0.000102</td>\n", " <td>0.000068</td>\n", " <td>0.001301</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " inn_outs dp g a e gs po\n", "pos \n", "1B 2.470157e-05 0.000737 0.000008 0.006393 0.000833 0.000029 0.000482\n", "2B 2.179551e-06 0.000007 0.000017 0.001008 0.000309 0.000063 0.002122\n", "3B 9.807978e-06 0.000908 0.000177 0.001483 0.000109 0.000019 0.001850\n", "C 5.448877e-07 0.000982 0.000059 0.007810 0.000683 0.000005 0.000668\n", "OF 6.599195e-06 0.001453 0.000311 0.002546 0.000656 0.000013 0.000072\n", "P 4.722360e-06 0.002302 0.000105 0.001890 0.001415 0.000000 0.005532\n", "SS 8.354944e-06 0.000730 0.000004 0.001009 0.000102 0.000068 0.001301" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# compute relative differences\n", "np.abs(1.0 - l_sums / r_sums)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.007810\n" ] } ], "source": [ "# find the largest relative difference\n", "print(f'{np.abs(1.0 - (l_sums / r_sums)).max().max():8.6f}')" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# verify all values between 1974 and 2019 are within plus/minus 0.8% of each other\n", "np.abs(1.0 - (l_sums / r_sums)).max().max() < 0.008" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Retrosheet Pitching Allowed vs Retrosheet Hitting" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['gidp',\n", " 'triple',\n", " 'sf',\n", " 'hr',\n", " 'h',\n", " 'so',\n", " 'ab',\n", " 'hr4',\n", " 'hbp',\n", " 'double',\n", " 'r',\n", " 'ibb',\n", " 'sh',\n", " 'xi',\n", " 'bb',\n", " 'tb']" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exclude = ['game_id', 'team_id', 'player_id', 'g', 'game_start', 'year']\n", "cols = set(retro_pitching.columns) & set(retro_batting.columns) - set(exclude)\n", "cols = list(cols)\n", "cols" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "gidp 157192\n", "triple 41532\n", "sf 58275\n", "hr 195985\n", "h 1826823\n", "so 1301700\n", "ab 7012260\n", "hr4 4710\n", "hbp 58097\n", "double 342178\n", "r 923811\n", "ibb 56297\n", "sh 70839\n", "xi 1060\n", "bb 672165\n", "tb 2840020\n", "dtype: int64" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# sum over all pitchers over all years\n", "p = retro_pitching[cols].agg('sum')\n", "p" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "gidp 157192\n", "triple 41532\n", "sf 58275\n", "hr 195985\n", "h 1826823\n", "so 1301700\n", "ab 7012260\n", "hr4 4710\n", "hbp 58097\n", "double 342178\n", "r 923811\n", "ibb 56297\n", "sh 70839\n", "xi 1060\n", "bb 672165\n", "tb 2840020\n", "dtype: int64" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# sum over all batters over all years\n", "b = retro_batting[cols].agg('sum')\n", "b" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Retrosheet is completely consistent\n", "p.equals(b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lahman Pitching Allowed vs Lahman Hitting" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'bb', 'gidp', 'h', 'hbp', 'hr', 'ibb', 'r', 'sf', 'sh', 'so'}" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exclude = ['lg_id', 'player_id', 'stint', 'team_id', 'year', 'g']\n", "pkey = ['player_id', 'stint', 'year']\n", "cols = set(lahman_pitching.columns) & set(lahman_batting.columns)\n", "cols -= set(exclude)\n", "cols" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "gidp 157193.0\n", "so 1301700.0\n", "sf 58275.0\n", "h 1826823.0\n", "hr 195986.0\n", "hbp 58097.0\n", "r 923811.0\n", "ibb 56296.0\n", "sh 70838.0\n", "bb 672166.0\n", "dtype: float64" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# sum over all pitchers over all years (1974 thru 2019)\n", "p = lahman_pitching[cols].agg('sum')\n", "p" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "gidp 157178.0\n", "so 1301700.0\n", "sf 58275.0\n", "h 1826823.0\n", "hr 195985.0\n", "hbp 58097.0\n", "r 923811.0\n", "ibb 56296.0\n", "sh 70838.0\n", "bb 672166.0\n", "dtype: float64" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# sum over all batters over all years (1974 thru 2019)\n", "b = lahman_batting[cols].agg('sum')\n", "b" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "gidp 0.000095\n", "so 0.000000\n", "sf 0.000000\n", "h 0.000000\n", "hr 0.000005\n", "hbp 0.000000\n", "r 0.000000\n", "ibb 0.000000\n", "sh 0.000000\n", "bb 0.000000\n", "dtype: float64" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# compute the relative differences\n", "np.abs(1.0 - p/b)" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "9.543320311999892e-05" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# find the max difference\n", "np.abs(1.0 - p/b).max()" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the biggest difference is less than 0.01%\n", "np.abs(1.0 - p/b).max() < 0.0001" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Retrosheet Batting vs Retrosheet Team Batting" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "17" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exclude = ['game_id', 'team_id', 'player_id', 'game_start', 'year']\n", "cols = set(retro_batting.columns) & set(retro_team_game.columns) - set(exclude)\n", "cols = list(cols)\n", "len(cols)" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['gidp',\n", " 'triple',\n", " 'sf',\n", " 'hr',\n", " 'h',\n", " 'cs',\n", " 'ab',\n", " 'rbi',\n", " 'hbp',\n", " 'so',\n", " 'double',\n", " 'r',\n", " 'ibb',\n", " 'sh',\n", " 'sb',\n", " 'xi',\n", " 'bb']" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>game_id</th>\n", " <th>team_id</th>\n", " <th>gidp</th>\n", " <th>triple</th>\n", " <th>sf</th>\n", " <th>hr</th>\n", " <th>h</th>\n", " <th>cs</th>\n", " <th>ab</th>\n", " <th>rbi</th>\n", " <th>hbp</th>\n", " <th>so</th>\n", " <th>double</th>\n", " <th>r</th>\n", " <th>ibb</th>\n", " <th>sh</th>\n", " <th>sb</th>\n", " <th>xi</th>\n", " <th>bb</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>ANA199704020</td>\n", " <td>ANA</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>12</td>\n", " <td>0</td>\n", " <td>38</td>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>7</td>\n", " <td>0</td>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>ANA199704020</td>\n", " <td>BOS</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>9</td>\n", " <td>0</td>\n", " <td>36</td>\n", " <td>6</td>\n", " <td>1</td>\n", " <td>12</td>\n", " <td>2</td>\n", " <td>6</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>8</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>ANA199704030</td>\n", " <td>ANA</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>7</td>\n", " <td>0</td>\n", " <td>27</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>ANA199704030</td>\n", " <td>BOS</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>29</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " game_id team_id gidp triple sf hr h cs ab rbi hbp so \\\n", "0 ANA199704020 ANA 0 0 1 1 12 0 38 5 0 7 \n", "1 ANA199704020 BOS 0 0 0 1 9 0 36 6 1 12 \n", "2 ANA199704030 ANA 2 0 1 0 7 0 27 2 0 1 \n", "3 ANA199704030 BOS 3 0 0 0 5 0 29 0 0 5 \n", "\n", " double r ibb sh sb xi bb \n", "0 0 5 0 0 2 0 5 \n", "1 2 6 0 0 0 0 8 \n", "2 2 2 1 0 0 0 4 \n", "3 0 0 0 0 0 0 0 " ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# retro_batting was wrangled from the output of cwdaily\n", "b = retro_batting[['game_id', 'team_id'] + cols].groupby(['game_id', 'team_id']).agg('sum')\n", "b = b.reset_index()\n", "b.head(4)" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>game_id</th>\n", " <th>team_id</th>\n", " <th>gidp</th>\n", " <th>triple</th>\n", " <th>sf</th>\n", " <th>hr</th>\n", " <th>h</th>\n", " <th>cs</th>\n", " <th>ab</th>\n", " <th>rbi</th>\n", " <th>hbp</th>\n", " <th>so</th>\n", " <th>double</th>\n", " <th>r</th>\n", " <th>ibb</th>\n", " <th>sh</th>\n", " <th>sb</th>\n", " <th>xi</th>\n", " <th>bb</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>ANA199704020</td>\n", " <td>ANA</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>12</td>\n", " <td>0</td>\n", " <td>38</td>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>7</td>\n", " <td>0</td>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>ANA199704020</td>\n", " <td>BOS</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>9</td>\n", " <td>0</td>\n", " <td>36</td>\n", " <td>6</td>\n", " <td>1</td>\n", " <td>12</td>\n", " <td>2</td>\n", " <td>6</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>8</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>ANA199704030</td>\n", " <td>ANA</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>7</td>\n", " <td>0</td>\n", " <td>27</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>ANA199704030</td>\n", " <td>BOS</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>29</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " game_id team_id gidp triple sf hr h cs ab rbi hbp so \\\n", "0 ANA199704020 ANA 0 0 1 1 12 0 38 5 0 7 \n", "1 ANA199704020 BOS 0 0 0 1 9 0 36 6 1 12 \n", "2 ANA199704030 ANA 2 0 1 0 7 0 27 2 0 1 \n", "3 ANA199704030 BOS 3 0 0 0 5 0 29 0 0 5 \n", "\n", " double r ibb sh sb xi bb \n", "0 0 5 0 0 2 0 5 \n", "1 2 6 0 0 0 0 8 \n", "2 2 2 1 0 0 0 4 \n", "3 0 0 0 0 0 0 0 " ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# retro_team_game was wrangled from the output of cwgame\n", "tg = retro_team_game[['game_id', 'team_id'] + cols].sort_values(['game_id', 'team_id']).reset_index(drop=True)\n", "tg.head(4)" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Retrosheet is completely consistent between cwdaily and cwgame for batting\n", "b.equals(tg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lahman Batting vs Lahman Team Batting" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "12" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exclude = ['lg_id', 'team_id', 'year', 'g']\n", "key = ['team_id', 'year']\n", "cols = set(lahman_batting.columns) & set(lahman_teams.columns) - set(exclude)\n", "cols = list(cols)\n", "len(cols)" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['triple', 'sf', 'hr', 'h', 'cs', 'ab', 'so', 'hbp', 'double', 'r', 'sb', 'bb']" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "# work-around for Pandas 1.0.1 bugs\n", "# 1) sum does not up-cast for nullable integer types\n", "# 2) select_dtypes does not distinguish between nullable and non-nullable int types\n", "idx = lahman_batting[cols].dtypes.isin([pd.UInt8Dtype(), pd.UInt16Dtype()])\n", "for col in lahman_batting[cols].columns[idx]:\n", " lahman_batting[col] = lahman_batting[col].astype('Int32')\n", "\n", "idx = lahman_teams[cols].dtypes.isin([pd.UInt8Dtype(), pd.UInt16Dtype()])\n", "for col in lahman_teams[cols].columns[idx]:\n", " lahman_teams[col] = lahman_teams[col].astype('Int32')" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>team_id</th>\n", " <th>year</th>\n", " <th>triple</th>\n", " <th>sf</th>\n", " <th>hr</th>\n", " <th>h</th>\n", " <th>cs</th>\n", " <th>ab</th>\n", " <th>so</th>\n", " <th>hbp</th>\n", " <th>double</th>\n", " <th>r</th>\n", " <th>sb</th>\n", " <th>bb</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>ANA</td>\n", " <td>1997</td>\n", " <td>25.0</td>\n", " <td>57</td>\n", " <td>161.0</td>\n", " <td>1531</td>\n", " <td>72</td>\n", " <td>5628</td>\n", " <td>953</td>\n", " <td>45</td>\n", " <td>279.0</td>\n", " <td>829.0</td>\n", " <td>126</td>\n", " <td>617.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>ANA</td>\n", " <td>1998</td>\n", " <td>27.0</td>\n", " <td>41</td>\n", " <td>147.0</td>\n", " <td>1530</td>\n", " <td>45</td>\n", " <td>5630</td>\n", " <td>1028</td>\n", " <td>48</td>\n", " <td>314.0</td>\n", " <td>787.0</td>\n", " <td>93</td>\n", " <td>510.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>ANA</td>\n", " <td>1999</td>\n", " <td>22.0</td>\n", " <td>42</td>\n", " <td>158.0</td>\n", " <td>1404</td>\n", " <td>45</td>\n", " <td>5494</td>\n", " <td>1022</td>\n", " <td>43</td>\n", " <td>248.0</td>\n", " <td>711.0</td>\n", " <td>71</td>\n", " <td>511.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>ANA</td>\n", " <td>2000</td>\n", " <td>34.0</td>\n", " <td>43</td>\n", " <td>236.0</td>\n", " <td>1574</td>\n", " <td>52</td>\n", " <td>5628</td>\n", " <td>1024</td>\n", " <td>47</td>\n", " <td>309.0</td>\n", " <td>864.0</td>\n", " <td>93</td>\n", " <td>608.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " team_id year triple sf hr h cs ab so hbp double r \\\n", "0 ANA 1997 25.0 57 161.0 1531 72 5628 953 45 279.0 829.0 \n", "1 ANA 1998 27.0 41 147.0 1530 45 5630 1028 48 314.0 787.0 \n", "2 ANA 1999 22.0 42 158.0 1404 45 5494 1022 43 248.0 711.0 \n", "3 ANA 2000 34.0 43 236.0 1574 52 5628 1024 47 309.0 864.0 \n", "\n", " sb bb \n", "0 126 617.0 \n", "1 93 510.0 \n", "2 71 511.0 \n", "3 93 608.0 " ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = lahman_batting[key + cols].groupby(key).agg('sum')\n", "b = b.reset_index()\n", "b.head(4)" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>team_id</th>\n", " <th>year</th>\n", " <th>triple</th>\n", " <th>sf</th>\n", " <th>hr</th>\n", " <th>h</th>\n", " <th>cs</th>\n", " <th>ab</th>\n", " <th>so</th>\n", " <th>hbp</th>\n", " <th>double</th>\n", " <th>r</th>\n", " <th>sb</th>\n", " <th>bb</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>ANA</td>\n", " <td>1997</td>\n", " <td>25</td>\n", " <td>57</td>\n", " <td>161</td>\n", " <td>1531</td>\n", " <td>72</td>\n", " <td>5628</td>\n", " <td>953</td>\n", " <td>45</td>\n", " <td>279</td>\n", " <td>829</td>\n", " <td>126</td>\n", " <td>617</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>ANA</td>\n", " <td>1998</td>\n", " <td>27</td>\n", " <td>41</td>\n", " <td>147</td>\n", " <td>1530</td>\n", " <td>45</td>\n", " <td>5630</td>\n", " <td>1028</td>\n", " <td>48</td>\n", " <td>314</td>\n", " <td>787</td>\n", " <td>93</td>\n", " <td>510</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>ANA</td>\n", " <td>1999</td>\n", " <td>22</td>\n", " <td>42</td>\n", " <td>158</td>\n", " <td>1404</td>\n", " <td>45</td>\n", " <td>5494</td>\n", " <td>1022</td>\n", " <td>43</td>\n", " <td>248</td>\n", " <td>711</td>\n", " <td>71</td>\n", " <td>511</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>ANA</td>\n", " <td>2000</td>\n", " <td>34</td>\n", " <td>43</td>\n", " <td>236</td>\n", " <td>1574</td>\n", " <td>52</td>\n", " <td>5628</td>\n", " <td>1024</td>\n", " <td>47</td>\n", " <td>309</td>\n", " <td>864</td>\n", " <td>93</td>\n", " <td>608</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " team_id year triple sf hr h cs ab so hbp double r sb \\\n", "0 ANA 1997 25 57 161 1531 72 5628 953 45 279 829 126 \n", "1 ANA 1998 27 41 147 1530 45 5630 1028 48 314 787 93 \n", "2 ANA 1999 22 42 158 1404 45 5494 1022 43 248 711 71 \n", "3 ANA 2000 34 43 236 1574 52 5628 1024 47 309 864 93 \n", "\n", " bb \n", "0 617 \n", "1 510 \n", "2 511 \n", "3 608 " ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t = lahman_teams[key + cols].sort_values(key).reset_index(drop=True)\n", "t.head(4)" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [ "# ensure the dtypes are the same\n", "for col in t.columns:\n", " if not col == 'team_id' and not col =='year':\n", " b[col] = b[col].astype('int')\n", " t[col] = t[col].astype('int')" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[cols].equals(t[cols])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Retrosheet Fielding vs Retrosheet Team Fielding\n", "Double plays (dp) are not compared because each fielder involved gets credit for a double play whereas the team as a whole only gets credit for the complete double play." ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['a', 'e', 'po', 'pb']" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols = ['a', 'e', 'po', 'pb']\n", "cols" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>game_id</th>\n", " <th>team_id</th>\n", " <th>a</th>\n", " <th>e</th>\n", " <th>po</th>\n", " <th>pb</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>ANA199704020</td>\n", " <td>ANA</td>\n", " <td>6</td>\n", " <td>1</td>\n", " <td>27</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>ANA199704020</td>\n", " <td>BOS</td>\n", " <td>8</td>\n", " <td>1</td>\n", " <td>27</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>ANA199704030</td>\n", " <td>ANA</td>\n", " <td>13</td>\n", " <td>0</td>\n", " <td>27</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>ANA199704030</td>\n", " <td>BOS</td>\n", " <td>9</td>\n", " <td>0</td>\n", " <td>24</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " game_id team_id a e po pb\n", "0 ANA199704020 ANA 6 1 27 0\n", "1 ANA199704020 BOS 8 1 27 0\n", "2 ANA199704030 ANA 13 0 27 0\n", "3 ANA199704030 BOS 9 0 24 0" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# retro_fielding was wrangled from the output of cwdaily\n", "f = retro_fielding[['game_id', 'team_id'] + cols].groupby(['game_id', 'team_id']).agg('sum')\n", "f = f.reset_index()\n", "f.head(4)" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>game_id</th>\n", " <th>team_id</th>\n", " <th>a</th>\n", " <th>e</th>\n", " <th>po</th>\n", " <th>pb</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>ANA199704020</td>\n", " <td>ANA</td>\n", " <td>6</td>\n", " <td>1</td>\n", " <td>27</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>ANA199704020</td>\n", " <td>BOS</td>\n", " <td>8</td>\n", " <td>1</td>\n", " <td>27</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>ANA199704030</td>\n", " <td>ANA</td>\n", " <td>13</td>\n", " <td>0</td>\n", " <td>27</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>ANA199704030</td>\n", " <td>BOS</td>\n", " <td>9</td>\n", " <td>0</td>\n", " <td>24</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " game_id team_id a e po pb\n", "0 ANA199704020 ANA 6 1 27 0\n", "1 ANA199704020 BOS 8 1 27 0\n", "2 ANA199704030 ANA 13 0 27 0\n", "3 ANA199704030 BOS 9 0 24 0" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# retro_team_game was wrangled from the output of cwgame\n", "tg = retro_team_game[['game_id', 'team_id'] + cols].sort_values(\n", " ['game_id', 'team_id']).reset_index(drop=True)\n", "tg.head(4)" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Retrosheet is completely consistent between cwdaily and cwgame for fielding\n", "f.equals(tg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lahman Fielding vs Lahman Team Fielding" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exclude = ['lg_id', 'team_id', 'year', 'g', 'dp', 'player_id']\n", "key = ['team_id', 'year']\n", "cols = set(lahman_fielding.columns) & set(lahman_teams.columns) - set(exclude)\n", "cols = list(cols)\n", "len(cols)" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['e']" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>team_id</th>\n", " <th>year</th>\n", " <th>e</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>ANA</td>\n", " <td>1997</td>\n", " <td>123</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>ANA</td>\n", " <td>1998</td>\n", " <td>106</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>ANA</td>\n", " <td>1999</td>\n", " <td>106</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>ANA</td>\n", " <td>2000</td>\n", " <td>134</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " team_id year e\n", "0 ANA 1997 123\n", "1 ANA 1998 106\n", "2 ANA 1999 106\n", "3 ANA 2000 134" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f = lahman_fielding[key + cols].groupby(key).agg('sum')\n", "f = f.sort_index().reset_index()\n", "f.head(4)" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>team_id</th>\n", " <th>year</th>\n", " <th>e</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>ANA</td>\n", " <td>1997</td>\n", " <td>123</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>ANA</td>\n", " <td>1998</td>\n", " <td>106</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>ANA</td>\n", " <td>1999</td>\n", " <td>106</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>ANA</td>\n", " <td>2000</td>\n", " <td>134</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " team_id year e\n", "0 ANA 1997 123\n", "1 ANA 1998 106\n", "2 ANA 1999 106\n", "3 ANA 2000 134" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t = lahman_teams[key+cols]\n", "t = t.sort_values(key).reset_index(drop=True)\n", "t.head(4)" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [], "source": [ "# ensure the dtypes are the same for errors (e)\n", "f[cols] = f[cols].astype('int')\n", "t[cols] = t[cols].astype('int')" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "e 2\n", "dtype: int64" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(f[cols] - t[cols]).max()" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "m = pd.merge(f, t, \n", " left_on=['team_id', 'year'],\n", " right_on=['team_id', 'year'],\n", " suffixes=['_fielding', '_teams'])" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>team_id</th>\n", " <th>year</th>\n", " <th>e_fielding</th>\n", " <th>e_teams</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>653</th>\n", " <td>MIN</td>\n", " <td>1977</td>\n", " <td>144</td>\n", " <td>143</td>\n", " </tr>\n", " <tr>\n", " <th>1014</th>\n", " <td>SDN</td>\n", " <td>2007</td>\n", " <td>94</td>\n", " <td>92</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " team_id year e_fielding e_teams\n", "653 MIN 1977 144 143\n", "1014 SDN 2007 94 92" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m.query(\"e_fielding != e_teams\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When comparing large numbers, it is best to examine their relative differences. \n", "When comparing small numbers, it is best to look at their absolute differences" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "((f[cols] - t[cols]).max() <= 2).all()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Retrosheet Pitching vs Retrosheet Team Pitching" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['wp', 'bk', 'er']" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols = ['wp', 'bk', 'er']\n", "cols" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "wp 64799\n", "bk 9919\n", "er 841197\n", "dtype: int64" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# retro_pitching was wrangled from the output of cwdaily\n", "p = retro_pitching[cols].agg('sum')\n", "p" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "wp 64799\n", "bk 9919\n", "er 841197\n", "dtype: int64" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# retro_team_game was wrangled from the output of cwgame\n", "tg = retro_team_game[cols].agg('sum')\n", "tg" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Retrosheet is completely consistent between cwdaily and cwgame for fielding\n", "p.equals(tg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lahman Pitching vs Lahman Team Pitching\n", "Most of the common column names between pitching and team pitching are for the batting stats, not the pitching stats.\n", "\n", "Notes:\n", "* era cannot be summed\n", "* shut outs (sho) per team are by MLB rules recorded differently that shut outs per pitcher\n", " * if 2+ pitchers combine to pitch a shut out, the team gets credit for a shut out but neither pitcher does\n", "* earned runs (er) per team are by MLB rules recorded differently than earned runs per pitcher\n", " * if a relief pitcher comes in with two outs, the inning having already been extended by an error, that relief pitcher can give up earned runs that are changed to him but not to the team as the inning should have been over" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exclude = ['lg_id', 'team_id', 'year', 'g', 'era',\n", " 'bb', 'h', 'hbp', 'hr', 'r', 'sf', 'so', 'sho', 'er']\n", "key = ['team_id', 'year']\n", "cols = set(lahman_pitching.columns) & set(lahman_teams.columns) - set(exclude)\n", "cols = list(cols)\n", "len(cols)" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['cg', 'l', 'sv', 'w', 'ip_outs']" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>team_id</th>\n", " <th>year</th>\n", " <th>cg</th>\n", " <th>l</th>\n", " <th>sv</th>\n", " <th>w</th>\n", " <th>ip_outs</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>ANA</td>\n", " <td>1997</td>\n", " <td>9</td>\n", " <td>78</td>\n", " <td>39</td>\n", " <td>84</td>\n", " <td>4364</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>ANA</td>\n", " <td>1998</td>\n", " <td>3</td>\n", " <td>77</td>\n", " <td>52</td>\n", " <td>85</td>\n", " <td>4332</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>ANA</td>\n", " <td>1999</td>\n", " <td>4</td>\n", " <td>92</td>\n", " <td>37</td>\n", " <td>70</td>\n", " <td>4294</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>ANA</td>\n", " <td>2000</td>\n", " <td>5</td>\n", " <td>80</td>\n", " <td>46</td>\n", " <td>82</td>\n", " <td>4344</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " team_id year cg l sv w ip_outs\n", "0 ANA 1997 9 78 39 84 4364\n", "1 ANA 1998 3 77 52 85 4332\n", "2 ANA 1999 4 92 37 70 4294\n", "3 ANA 2000 5 80 46 82 4344" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p = lahman_pitching[key + cols].groupby(key).agg('sum').reset_index()\n", "p.head(4)" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>team_id</th>\n", " <th>year</th>\n", " <th>cg</th>\n", " <th>l</th>\n", " <th>sv</th>\n", " <th>w</th>\n", " <th>ip_outs</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>ANA</td>\n", " <td>1997</td>\n", " <td>9</td>\n", " <td>78</td>\n", " <td>39</td>\n", " <td>84</td>\n", " <td>4364</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>ANA</td>\n", " <td>1998</td>\n", " <td>3</td>\n", " <td>77</td>\n", " <td>52</td>\n", " <td>85</td>\n", " <td>4332</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>ANA</td>\n", " <td>1999</td>\n", " <td>4</td>\n", " <td>92</td>\n", " <td>37</td>\n", " <td>70</td>\n", " <td>4294</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>ANA</td>\n", " <td>2000</td>\n", " <td>5</td>\n", " <td>80</td>\n", " <td>46</td>\n", " <td>82</td>\n", " <td>4344</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " team_id year cg l sv w ip_outs\n", "0 ANA 1997 9 78 39 84 4364\n", "1 ANA 1998 3 77 52 85 4332\n", "2 ANA 1999 4 92 37 70 4294\n", "3 ANA 2000 5 80 46 82 4344" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t = lahman_teams[key+cols].sort_values(key).reset_index(drop=True)\n", "t.head(4)" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [], "source": [ "# dtypes need to be the same\n", "for col in p.columns:\n", " if not col == 'year' and not col == 'team_id':\n", " p[col] = p[col].astype('int')\n", " t[col] = t[col].astype('int')" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cg 0\n", "l 1\n", "sv 0\n", "w 1\n", "ip_outs 0\n", "dtype: int64" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.abs(p[cols] - t[cols]).max()" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "m = pd.merge(p, t, on = ['team_id','year'], suffixes=['_p', '_t'])\n", "s_cols = m.columns.sort_values()\n", "m = m[s_cols]" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>team_id</th>\n", " <th>year</th>\n", " <th>l_p</th>\n", " <th>l_t</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>196</th>\n", " <td>CHA</td>\n", " <td>1979</td>\n", " <td>86</td>\n", " <td>87</td>\n", " </tr>\n", " <tr>\n", " <th>329</th>\n", " <td>CLE</td>\n", " <td>1974</td>\n", " <td>84</td>\n", " <td>85</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " team_id year l_p l_t\n", "196 CHA 1979 86 87\n", "329 CLE 1974 84 85" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>team_id</th>\n", " <th>year</th>\n", " <th>w_p</th>\n", " <th>w_t</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>407</th>\n", " <td>DET</td>\n", " <td>1979</td>\n", " <td>84</td>\n", " <td>85</td>\n", " </tr>\n", " <tr>\n", " <th>1184</th>\n", " <td>TEX</td>\n", " <td>1974</td>\n", " <td>83</td>\n", " <td>84</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " team_id year w_p w_t\n", "407 DET 1979 84 85\n", "1184 TEX 1974 83 84" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# display the differences\n", "for col in cols:\n", " cp = col+'_p'\n", " ct = col+'_t'\n", " query = cp + ' != '+ ct\n", " tmp = m.query(query)[['team_id', 'year'] +[cp, ct]]\n", " if len(tmp):\n", " display(tmp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Twice, the team value for wins and losses was 1 greater than the sum of that team's pitchers wins and losses. \n", "\n", "Unlike shut outs and earned runs which are intentionally counted differently between individual and teams, wins and losses should be counted the same. The above shows a minor data discrepancy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "It may seem like a lot of work to cross check the data in so many different ways, but doing so provides accuracy bounds which are important to the subsequent data analysis." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }