{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluating Passes Assignemnt\n",
"\n",
"The chief scout of your club wants to find good passers of the ball. He is interested in pass success rate, but realises also that the pure success rate isn't a reliable stat for his purposes. If players always take the safest option then they will never create.\n",
"\n",
"What he really wants to find is players who are able to make difficult, but successful passes. To solve this problem start by creating a statistical model (logistic regression for example) that predicts pass success as a function of where on the pitch the pass is taken. In the first model use the only the start co-ordinates of the pass. to predict success. \n",
"\n",
"Improve your model as much as possible by including x^2 , y^2 , x*y, goal angle, end co-ordinates of the pass etc. You can also include whether the pass was a cross or another type of ball. \n",
"\n",
"When you feel you can't improve your model any more, then use it to rank players in the Wyscout free data. Which players are particularly good at making difficult passes?\n",
"\n",
"**Submission should consist of 2 parts.**\n",
"\n",
"1. A two page document containing: a non-technical description of how your method works\n",
"an explanation of the strengths and weaknesses of your approach.\n",
"choose one playing position (goalkeeper, full back, centre back, central midfielder, attacking midfielder or striker) and one of the leagues \n",
"2. A runnable, commented code as a (preferably) Python or R script that generates all the plots from the report and explains the method you have used. Important: this code should be a single file run immediately if placed with in the same directory as the Wyscout folder, which in turn contains Wyscout/events etc. i.e. exactly as is done in the expected goals tutorial. Do not use non-standard libraries and make sure there are no errors when run. \n",
"\n",
"This hand-in is graded on a mark 0-10. The total possible points for the course is 40. To pass with grade 3 requires 18 points, grade 4 requires 25 or higher, grade 5 is 32 or higher. This hand-in is also anonymously peer-reviewed, to help you improve, but the grade is decided by the teacher.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# standard imports\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from collections import Counter\n",
"import json\n",
"import os\n",
"\n",
"# stats packages\n",
"import statsmodels.api as sm\n",
"import statsmodels.formula.api as smf\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn import metrics\n",
"from sklearn.calibration import calibration_curve\n",
"\n",
"# pprint to make json easier to read\n",
"import pprint as pp\n",
"\n",
"# plotting\n",
"from mplsoccer.pitch import Pitch\n",
"\n",
"# to deal with the unicode characters of players names / team names in Wyscout\n",
"import codecs\n",
"\n",
"pd.set_option('display.max_rows', 500)\n",
"pd.set_option('display.max_columns', 100)\n",
"pd.options.mode.chained_assignment = None\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Helper Functions"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def show_event_breakdown(df_events, dic_tags):\n",
" \"\"\"\n",
" Produces a full breakdown of the events, subevents, and the tags for the Wyscout dataset\n",
" Use this to look at the various tags attributed to the event taxonomy\n",
" \"\"\"\n",
"\n",
" df_event_breakdown = df_events.groupby(['eventName','subEventName'])\\\n",
" .agg({'id':'nunique','tags':lambda x: list(x)})\\\n",
" .reset_index()\\\n",
" .rename(columns={'id':'numSubEvents','tags':'tagList'})\n",
"\n",
" # creating a histogram of the tags per sub event\n",
" df_event_breakdown['tagHist'] = df_event_breakdown.tagList.apply(lambda x: Counter([dic_tags[j] for i in x for j in i]))\n",
"\n",
" dic = {}\n",
"\n",
" for i, cols in df_event_breakdown.iterrows():\n",
" eventName, subEventName, numEvents, tagList, tagHist = cols\n",
"\n",
" for key in tagHist:\n",
"\n",
" dic[f'{i}-{key}'] = [eventName, subEventName, numEvents, key, tagHist[key]]\n",
"\n",
" df_event_breakdown = pd.DataFrame.from_dict(dic, orient='index', columns=['eventName','subEventName','numSubEvents','tagKey','tagFrequency'])\\\n",
" .sort_values(['eventName','numSubEvents','tagFrequency'], ascending=[True, False, False])\\\n",
" .reset_index(drop=True)\\\n",
"\n",
" return df_event_breakdown"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def home_and_away(df):\n",
" \"\"\"\n",
" Picks out the home and away teamIds and their scores\n",
" \"\"\"\n",
" teamsData = df['teamsData']\n",
" \n",
" for team in teamsData:\n",
" teamData = teamsData[team]\n",
" if teamData.get('side') == 'home':\n",
" homeTeamId = team\n",
" homeScore = teamData.get('score')\n",
" homeFormation = teamData.get('hasFormation')\n",
" elif teamData.get('side') == 'away':\n",
" awayTeamId = team\n",
" awayScore = teamData.get('score')\n",
" \n",
" df['homeTeamId'], df['homeScore'] = homeTeamId, homeScore\n",
" df['awayTeamId'], df['awayScore'] = awayTeamId, awayScore\n",
" \n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def possession_indicator(df):\n",
" \"\"\"\n",
" Function that identifies which team is in possession of the ball\n",
" If the event is a found, interruption of offside, return a 0\n",
" Winner of a duel is deemed in possession of the ball\n",
" \"\"\"\n",
" \n",
" # team identifiers\n",
" teamId = df['teamId']\n",
" homeTeamId = df['homeTeamId']\n",
" awayTeamId = df['awayTeamId']\n",
" teams = set([homeTeamId, awayTeamId])\n",
" otherTeamId = list(teams - set([teamId]))[0]\n",
" \n",
" # eventName and subEventNames\n",
" eventName = df['eventName']\n",
" \n",
" # success flag\n",
" successFlag = df['successFlag']\n",
" \n",
" # assigning possession teamId\n",
" if eventName in ['Pass','Free Kick','Others on the ball','Shot','Save attempt','Goalkeeper leaving line']:\n",
" possessionTeamId = teamId\n",
" elif eventName == 'Duel':\n",
" possessionTeamId = teamId if successFlag == 1 else otherTeamId\n",
" else:\n",
" possessionTeamId = np.NaN\n",
" \n",
" return possessionTeamId\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"def strong_foot_flag(df):\n",
" \"\"\"\n",
" Compare foot of pass with footedness of player\n",
" Provides flag = 1 if pass played with strong foot of the player\n",
" \"\"\"\n",
" tags = df['tags']\n",
" foot = df['foot']\n",
" \n",
" # tags\n",
" if 401 in tags:\n",
" passFoot = 'L'\n",
" elif 402 in tags:\n",
" passFoot = 'R'\n",
" elif 403 in tags:\n",
" passFoot = 'H'\n",
" else:\n",
" passFoot = 'N'\n",
" \n",
" # feature\n",
" if (passFoot == 'L') and (foot in ['L','B']):\n",
" strongFlag = 1\n",
" elif (passFoot == 'R') and (foot in ['R','B']):\n",
" strongFlag = 1\n",
" else:\n",
" strongFlag = 0\n",
" \n",
" return strongFlag\n",
"\n",
"\n",
"def weak_foot_flag(df):\n",
" \"\"\"\n",
" Compare foot of pass with footedness of player\n",
" Provides flag = 1 if pass played with weak foot of the player\n",
" \"\"\"\n",
" tags = df['tags']\n",
" foot = df['foot']\n",
" \n",
" # tags\n",
" if 401 in tags:\n",
" passFoot = 'L'\n",
" elif 402 in tags:\n",
" passFoot = 'R'\n",
" elif 403 in tags:\n",
" passFoot = 'H'\n",
" else:\n",
" passFoot = 'N'\n",
" \n",
" # feature\n",
" if (passFoot == 'L') and (foot == 'R'):\n",
" weakFlag = 1\n",
" elif (passFoot == 'R') and (foot == 'L'):\n",
" weakFlag = 1\n",
" else:\n",
" weakFlag = 0\n",
" \n",
" return weakFlag"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# Data Loader Functions\n",
"\n",
"* Players\n",
"* Teams\n",
"* Tags\n",
"* Matches\n",
"* Formations\n",
"* Events"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def get_players(player_file):\n",
" \"\"\"\n",
" Returns dataframe of players\n",
" \"\"\"\n",
" \n",
" with open(player_file) as f:\n",
" players_data = json.load(f)\n",
"\n",
" player_cols = ['playerId','shortName','foot','height','weight','birthDate','birthCountry','role','roleCode']\n",
"\n",
" df_players = pd.DataFrame([[i.get('wyId'),codecs.unicode_escape_decode(i.get('shortName'))[0],i.get('foot'),i.get('height'),i.get('weight'),i.get('birthDate')\\\n",
" ,i.get('passportArea').get('name'),i.get('role').get('name'),i.get('role').get('code3')] for i in players_data], columns = player_cols)\n",
"\n",
" return df_players\n",
"\n",
"\n",
"\n",
"def get_teams(team_file):\n",
" \"\"\"\n",
" Returns dataframe of teams\n",
" \"\"\"\n",
" \n",
" with open(team_file) as f:\n",
" teams_data = json.load(f)\n",
"\n",
" team_cols = ['teamId','teamName','officialTeamName','teamType','teamArea']\n",
"\n",
" df_teams = pd.DataFrame([[i.get('wyId'),codecs.unicode_escape_decode(i.get('name'))[0],codecs.unicode_escape_decode(i.get('officialName'))[0],i.get('type')\\\n",
" ,i.get('area').get('name')] for i in teams_data], columns=team_cols)\n",
" \n",
" return df_teams\n",
"\n",
"\n",
"\n",
"dic_tags = {\n",
" 101: 'Goal',\n",
" 102: 'Own goal',\n",
" 301: 'Assist',\n",
" 302: 'Key pass',\n",
" 1901: 'Counter attack',\n",
" 401: 'Left foot',\n",
" 402: 'Right foot',\n",
" 403: 'Head/body',\n",
" 1101: 'Direct',\n",
" 1102: 'Indirect',\n",
" 2001: 'Dangerous ball lost',\n",
" 2101: 'Blocked',\n",
" 801: 'High',\n",
" 802: 'Low',\n",
" 1401: 'Interception',\n",
" 1501: 'Clearance',\n",
" 201: 'Opportunity',\n",
" 1301: 'Feint',\n",
" 1302: 'Missed ball',\n",
" 501: 'Free space right',\n",
" 502: 'Free space left',\n",
" 503: 'Take on left',\n",
" 504: 'Take on right',\n",
" 1601: 'Sliding tackle',\n",
" 601: 'Anticipated',\n",
" 602: 'Anticipation',\n",
" 1701: 'Red card',\n",
" 1702: 'Yellow card',\n",
" 1703: 'Second yellow card',\n",
" 1201: 'Position: Goal low center',\n",
" 1202: 'Position: Goal low right',\n",
" 1203: 'Position: Goal center',\n",
" 1204: 'Position: Goal center left',\n",
" 1205: 'Position: Goal low left',\n",
" 1206: 'Position: Goal center right',\n",
" 1207: 'Position: Goal high center',\n",
" 1208: 'Position: Goal high left',\n",
" 1209: 'Position: Goal high right',\n",
" 1210: 'Position: Out low right',\n",
" 1211: 'Position: Out center left',\n",
" 1212: 'Position: Out low left',\n",
" 1213: 'Position: Out center right',\n",
" 1214: 'Position: Out high center',\n",
" 1215: 'Position: Out high left',\n",
" 1216: 'Position: Out high right',\n",
" 1217: 'Position: Post low right',\n",
" 1218: 'Position: Post center left',\n",
" 1219: 'Position: Post low left',\n",
" 1220: 'Position: Post center right',\n",
" 1221: 'Position: Post high center',\n",
" 1222: 'Position: Post high left',\n",
" 1223: 'Position: Post high right',\n",
" 901: 'Through',\n",
" 1001: 'Fairplay',\n",
" 701: 'Lost',\n",
" 702: 'Neutral',\n",
" 703: 'Won',\n",
" 1801: 'Accurate',\n",
" 1802: 'Not accurate'\n",
"}\n",
"\n",
"\n",
"\n",
"def get_matches(match_repo):\n",
" \"\"\"\n",
" Return dataframe of matches\n",
" \"\"\"\n",
" \n",
" match_files = os.listdir(match_repo)\n",
"\n",
" lst_df_matches = []\n",
"\n",
" # note, this does not include groupName\n",
" match_cols = [\"status\",\"roundId\",\"gameweek\",\"teamsData\",\"seasonId\",\"dateutc\",\"winner\",\"venue\"\\\n",
" ,\"wyId\",\"label\",\"date\",\"referees\",\"duration\",\"competitionId\",\"source\"]\n",
"\n",
" for match_file in match_files:\n",
"\n",
" print (f'Processing {match_file}...')\n",
"\n",
" with open(f'matches/{match_file}') as f:\n",
" data = json.load(f)\n",
" df = pd.DataFrame(data)\n",
"\n",
" # adding some file source metadata\n",
" df['source'] = match_file.replace('matches_','').replace('.json','')\n",
"\n",
" # dealing with the groupName column that's only in the international competitions\n",
" df = df[match_cols]\n",
" lst_df_matches.append(df)\n",
"\n",
" # concatenating match files\n",
" df_matches = pd.concat(lst_df_matches, ignore_index=True)\n",
"\n",
" # applying home and away transformations using helper functions\n",
" df_matches = df_matches.apply(home_and_away, axis=1)\n",
"\n",
" # and changing the wyId to matchId\n",
" df_matches = df_matches.rename(columns={'wyId':'matchId'})\n",
"\n",
" # and filtering columns (may want to change this later)\n",
" match_cols_final = [\"source\",\"competitionId\",\"seasonId\",\"roundId\",\"gameweek\",\"matchId\",\"teamsData\",\"dateutc\",\"date\"\\\n",
" ,\"homeTeamId\",\"homeScore\",\"awayTeamId\",\"awayScore\",\"duration\",\"winner\",\"venue\",\"label\"]\n",
"\n",
" df_matches = df_matches[match_cols_final] \n",
" \n",
" return df_matches\n",
"\n",
"\n",
"\n",
"def get_formations(df_matches):\n",
" \"\"\"\n",
" Returns dataframe of formations within a match for all matches\n",
" Adapted from https://github.com/CleKraus/soccer_analytics\n",
" \"\"\"\n",
"\n",
" lst_formations = list()\n",
" \n",
" for idx, match in df_matches.iterrows():\n",
"\n",
" matchId = match['matchId']\n",
"\n",
" # loop through the two teams\n",
" for team in [0, 1]:\n",
" team = match['teamsData'][list(match['teamsData'])[team]]\n",
" teamId = team['teamId']\n",
"\n",
" # get all players that started on the bench\n",
" player_bench = [player['playerId'] for player in team['formation']['bench']]\n",
" df_bench = pd.DataFrame()\n",
" df_bench['playerId'] = player_bench\n",
" df_bench['lineup'] = 0\n",
"\n",
" # get all players that were in the lineup\n",
" player_lineup = [\n",
" player['playerId'] for player in team['formation']['lineup']\n",
" ]\n",
" df_lineup = pd.DataFrame()\n",
" df_lineup['playerId'] = player_lineup\n",
" df_lineup['lineup'] = 1\n",
"\n",
" # in case there were no substitutions in the match\n",
" if team['formation']['substitutions'] == 'null':\n",
" player_in = []\n",
" player_out = []\n",
" sub_minute = []\n",
" # if there were substitutions\n",
" else:\n",
" player_in = [\n",
" sub['playerIn'] for sub in team['formation']['substitutions']\n",
" ]\n",
" player_out = [\n",
" sub['playerOut'] for sub in team['formation']['substitutions']\n",
" ]\n",
" sub_minute = [\n",
" sub['minute'] for sub in team['formation']['substitutions']\n",
" ]\n",
"\n",
" # build a data frame who and when was substituted in\n",
" df_player_in = pd.DataFrame()\n",
" df_player_in['playerId'] = player_in\n",
" df_player_in['substituteIn'] = sub_minute\n",
"\n",
" # build a data frame who and when was substituted out\n",
" df_player_out = pd.DataFrame()\n",
" df_player_out['playerId'] = player_out\n",
" df_player_out['substituteOut'] = sub_minute\n",
"\n",
" # get the formation by concatenating lineup and bench players\n",
" df_formation = pd.concat([df_lineup, df_bench], axis=0)\n",
" df_formation['matchId'] = matchId\n",
" df_formation['teamId'] = teamId\n",
"\n",
" # add information about substitutions\n",
" df_formation = pd.merge(df_formation, df_player_in, how='left')\n",
" df_formation = pd.merge(df_formation, df_player_out, how='left')\n",
"\n",
" lst_formations.append(df_formation)\n",
"\n",
" df_formations = pd.concat(lst_formations)\n",
"\n",
" # get the minute the player started and the minute the player ended the match\n",
" df_formations['minuteStart'] = np.where(\n",
" df_formations['substituteIn'].isnull(), 0, df_formations['substituteIn']\n",
" )\n",
" df_formations['minuteEnd'] = np.where(\n",
" df_formations['substituteOut'].isnull(), 90, df_formations['substituteOut']\n",
" )\n",
"\n",
" # make sure the match always lasts 90 minutes\n",
" df_formations['minuteStart'] = np.minimum(df_formations['minuteStart'], 90)\n",
" df_formations['minuteEnd'] = np.minimum(df_formations['minuteEnd'], 90)\n",
"\n",
" # set minuteEnd to 0 in case the player was not in the lineup and did not get substituted in\n",
" df_formations['minuteEnd'] = np.where(\n",
" (df_formations['lineup'] == 0) & (df_formations['substituteIn'].isnull()),\n",
" 0,\n",
" df_formations['minuteEnd'],\n",
" )\n",
"\n",
" # compute the minutes played\n",
" df_formations['minutesPlayed'] = (\n",
" df_formations['minuteEnd'] - df_formations['minuteStart']\n",
" )\n",
"\n",
" # use a binary flag of substitution rather than a minute and NaNs\n",
" df_formations['substituteIn'] = 1 * (df_formations['substituteIn'].notnull())\n",
" df_formations['substituteOut'] = 1 * (df_formations['substituteOut'].notnull())\n",
"\n",
" return df_formations\n",
"\n",
"\n",
"\n",
"def get_events(event_repo, leagueSelectionFlag = 0, leagueSelection = 'England'):\n",
" \"\"\"\n",
" Returns dataframe of events\n",
" \"\"\"\n",
" \n",
" events_files = os.listdir(event_repo)\n",
" \n",
" lst_df_events = []\n",
"\n",
" if leagueSelectionFlag == 1:\n",
" events_files = [i for i in events_files if i == f'events_{leagueSelection}.json']\n",
"\n",
" event_cols = ['source','matchId','matchPeriod','eventSec','teamId','id','eventId','eventName','subEventId','subEventName','playerId','positions','tags']\n",
"\n",
" for events_file in events_files:\n",
"\n",
" print (f'Processing {events_file}...')\n",
"\n",
" with open(f'events/{events_file}') as f:\n",
" data = json.load(f)\n",
" df = pd.DataFrame(data)\n",
" df['source'] = events_file.replace('events_','').replace('.json','') \n",
" lst_df_events.append(df)\n",
"\n",
" df_events = pd.concat(lst_df_events, ignore_index=True)\n",
"\n",
" # applying column re-ordering\n",
" df_events = df_events[event_cols]\n",
" \n",
" return df_events"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# Event Feature Engineering Function\n",
"\n",
"* Rejigs tags\n",
"* Applies `homeFlag`\n",
"* Applies event `successFlag`\n",
"* Applies `matchEventIndex` (an ordering of every event that occurs within a match from 1-n)\n",
"* Applies `possessionTeamId` (the teamId that's in possession of the ball)\n",
"* Applies `possessionSequenceIndex`\n",
"* Applies `goalDelta` (the game state)\n",
"* Applies `numReds` (the cumulative number of red cards a team has accrued throughout a match)\n",
"* Applies `weakFlag` and `strongFlag` (for footedness of player and foot used for pass)\n",
"* Unpacks `positions`\n",
"* Applies `possessionStartSec`\n",
"* Applies `playerPossessionTimeSec`\n",
"* Re-orders and filters `df_events`"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"def event_feature_engineering(df_events):\n",
" \"\"\"\n",
" Takes in raw df_events dataframe and returns an augmented df_events dataframe with features for xPass model feature engineering\n",
" \"\"\"\n",
" \n",
" # Re-jigging tags -> list of integers\n",
" print ('Rejigging tags...')\n",
" df_events['tags'] = df_events.tags.apply(lambda x: [i.get('id') for i in x])\n",
" \n",
" \n",
" # Applies homeFlag by 1) first merging on df_matches and then 2) applying helper function\n",
" print ('Applying homeFlag...')\n",
" ## 1)\n",
" df_events = df_events.merge(df_matches, on=['matchId','source'], how = 'inner')\n",
" ## 2)\n",
" df_events['homeFlag'] = df_events.apply(lambda x: 1 if int(x.teamId) == int(x.homeTeamId) else 0, axis=1)\n",
" \n",
" \n",
" # Applying success flag\n",
" print ('Applying successFlag...')\n",
" df_events['successFlag'] = df_events.tags.apply(lambda x: 1 if 1801 in x else 0)\n",
" \n",
" \n",
" # 1) Ordering of events so that they're in precisely chronological order, and then 2) resorting (as the merge with df_matches will cause df_events to become unsorted)\n",
" print ('Applying matchEventIndex...')\n",
" ## 1)\n",
" df_events['matchEventIndex'] = df_events.sort_values(['matchId','matchPeriod','eventSec'], ascending=[True, True, True])\\\n",
" .groupby('matchId')\\\n",
" .cumcount() + 1\n",
" ## 2)\n",
" df_events = df_events.sort_values(['matchId','matchEventIndex'], ascending=[True,True])\n",
" \n",
" \n",
" # 1) Applying possession team indicator and then 2) forward filling the NaNs with the existing team (until possession is explicitly transferred)\n",
" print ('Applying possessionTeamId...')\n",
" ## 1)\n",
" df_events['possessionTeamId'] = df_events.apply(possession_indicator, axis=1)\n",
" ## 2) Filling the nans\n",
" df_events['possessionTeamId'] = df_events.possessionTeamId.fillna(method='ffill')\n",
" \n",
" \n",
" # Sequencing the possessions (each possession will have it's own index per match)\n",
" print ('Applying possessionSequenceIndex...')\n",
" ## 1) initiate sequence at 0\n",
" df_events['possessionSequenceIndex'] = 0\n",
" ## 2) every time there's a change in sequence, you set a value of 1\n",
" df_events['possessionSequenceIndex'][((df_events['possessionTeamId'] != df_events['possessionTeamId'].shift(1))) \\\n",
" | ((df_events['matchPeriod'] != df_events['matchPeriod'].shift(1)))] = 1\n",
" ## 3) take a cumulative sum of the 1s per match\n",
" df_events['possessionSequenceIndex'] = df_events.groupby('matchId')['possessionSequenceIndex'].cumsum()\n",
" \n",
" \n",
" # Applying Game State\n",
" ## Note this method is only 95% accurate; suspect that's sufficiently fine for this feature for this application\n",
" print ('Applying gameState...')\n",
" ## 1) getting goals scored flag\n",
" df_events['goalScoredFlag'] = df_events.apply(lambda x: 1 if 101 in x.tags and x.eventName in ['Shot','Free Kick'] else 0, axis=1)\n",
" ## 2) getting goal conceded flag\n",
" df_events['goalsConcededFlag'] = df_events.apply(lambda x: 1 if 101 in x.tags and x.eventName == 'Save attempt' else 0, axis=1)\n",
" ## 3) Cumulatively summing the goals scored\n",
" df_events['goalsScored'] = df_events.sort_values(['matchId','matchPeriod','eventSec'], ascending=[True, True, True])\\\n",
" .groupby(['matchId','teamId'])\\\n",
" ['goalScoredFlag'].cumsum()\n",
" ## 4) Cumulatively summing the goals conceded\n",
" df_events['goalsConceded'] = df_events.sort_values(['matchId','matchPeriod','eventSec'], ascending=[True, True, True])\\\n",
" .groupby(['matchId','teamId'])\\\n",
" ['goalsConcededFlag'].cumsum()\n",
" ## 5) Calculating the goal delta\n",
" df_events['goalDelta'] = df_events['goalsScored'] - df_events['goalsConceded']\n",
" \n",
" \n",
" # Applying red cards to calculate the difference in the number of players on each team\n",
" print ('Applying numReds...')\n",
" ## 1) Applying red card flag\n",
" df_events['redCardFlag'] = df_events.tags.apply(lambda x: -1 if 1701 in x else 0)\n",
" \n",
" ## 2) Applying Excess Player flag to the other team\n",
" df_reds = df_events.loc[df_events['redCardFlag'] == -1, ['matchId','teamId','matchEventIndex','id']]\n",
"\n",
" lst_redOtherTeamFlag = []\n",
"\n",
" for idx, cols in df_reds.iterrows():\n",
" matchId, teamId, matchEventIndex, Id = cols\n",
" try:\n",
" redOtherTeamId = df_events.loc[(df_events['matchId'] == matchId) & (df_events['teamId'] != teamId) & (df_events['matchEventIndex'] > matchEventIndex)].sort_values('matchEventIndex', ascending=True)['id'].values[0]\n",
" lst_redOtherTeamFlag.append(redOtherTeamId)\n",
" except:\n",
" continue\n",
"\n",
" df_events.loc[df_events['id'].isin(lst_redOtherTeamFlag), 'redCardFlag'] = 1\n",
" \n",
" ## 3) Cumulatively summing the number of red cards on a team throughout a game\n",
" df_events['numReds'] = df_events.sort_values(['matchId','matchPeriod','eventSec'], ascending=[True, True, True])\\\n",
" .groupby(['matchId','teamId'])\\\n",
" ['redCardFlag'].cumsum()\n",
" \n",
" \n",
" # Applying strong and weak foot flags\n",
" print ('Applying weakFlag and strongFlag for footedness...')\n",
" ## 1) adding player metadata\n",
" df_events = df_events.merge(df_players, on='playerId', how='inner')\n",
" ## 2) Cleaning up the foot preference flags of the players\n",
" df_events['foot'] = df_events.foot.apply(lambda x: 'L' if x == 'left' else 'R' if x == 'right' else 'B' if x == 'both' else 'N')\n",
" ## 3) Applying weak foot flag (mainly impacts crosses)\n",
" df_events['weakFlag'] = df_events.apply(weak_foot_flag, axis=1)\n",
" ## 4) Applying strong foot flag (this isn't seen as significant in the logistic regression, but keeping it in for completeness)\n",
" df_events['strongFlag'] = df_events.apply(strong_foot_flag, axis=1)\n",
" \n",
" \n",
" # Unpacking positions: Found that this multi-lambda method is by far and away the quickest rather than a multi-stage apply when dealing with 3M events\n",
" print ('Unpacking positions...')\n",
" # (this takes about a minute for the full Wyscout dataset which is pretty good)\n",
" ## 1) counting the number of positions found in the position dic\n",
" df_events['numPositions'] = df_events.positions.apply(lambda x: len(x))\n",
" ## 2) Getting the starting x,y\n",
" df_events['startPositions'] = df_events.positions.apply(lambda x: x[0])\n",
" df_events['start_x'] = df_events.startPositions.apply(lambda x: x.get('x'))\n",
" df_events['start_y'] = df_events.startPositions.apply(lambda x: x.get('y'))\n",
" ## 3) Getting the ending x,y\n",
" df_events['endPositions'] = df_events.apply(lambda x: x.positions[1] if x.numPositions == 2 else {}, axis=1)\n",
" df_events['end_x'] = df_events.endPositions.apply(lambda x: x.get('x', None))\n",
" df_events['end_y'] = df_events.endPositions.apply(lambda x: x.get('y', None))\n",
" \n",
" \n",
" # Getting the time that the team has been in possession until the pass has been made (1) takes a while, but allows 2) to be vectorised)\n",
" print ('Applying possessionStartSec...')\n",
" ## 1) getting the time since the possession started\n",
" df_events['possessionStartSec'] = df_events.loc[df_events.groupby(['matchId','possessionSequenceIndex'])['eventSec'].transform('idxmin'), 'eventSec'].values\n",
" ## 2) calculating the time of the posession\n",
" df_events['possessionTimeSec'] = df_events['eventSec'] - df_events['possessionStartSec']\n",
" \n",
" \n",
" # Getting the time that the player has been in possession\n",
" print ('Applying playerPossessionTimeSec...')\n",
" ## 1) initialising at 0\n",
" df_events['playerPossessionTimeSec'] = 0\n",
" ## 2) checks that the previous event was part of the same possession sequence within the same match, and if it is, calculates possession time in seconds\n",
" df_events['playerPossessionTimeSec'][((df_events['matchId'] == df_events['matchId'].shift(1)) &\\\n",
" (df_events['possessionSequenceIndex'] == df_events['possessionSequenceIndex'].shift(1)))]\\\n",
" = df_events['eventSec'] - df_events['eventSec'].shift(1)\n",
" \n",
" \n",
" # Getting previous event\n",
" print ('Grabbing previous event...')\n",
" df_events['previousSubEventName'] = 'Match Start'\n",
" df_events['previousSubEventName'][df_events['matchId'] == df_events['matchId'].shift(1)] = df_events['subEventName'].shift(1)\n",
"\n",
" \n",
" # finally, tidying up columns\n",
" df_events = df_events[['source','matchId','matchPeriod','eventSec','possessionTimeSec','playerPossessionTimeSec','matchEventIndex','teamId','homeTeamId','homeScore','awayTeamId','awayScore','homeFlag','id'\\\n",
" ,'eventName','subEventName','previousSubEventName','possessionTeamId','possessionSequenceIndex','playerId','shortName','roleCode','strongFlag','weakFlag','goalDelta','numReds'\\\n",
" ,'start_x','start_y','end_x','end_y','tags','successFlag']].sort_values(['matchId','matchEventIndex'], ascending=[True,True])\n",
" \n",
" print ('Outputting df_events.')\n",
" return df_events"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def pass_feature_engineering(df_events, pitchLength = 105, pitchWidth = 68, outputToCsvFlag = 0):\n",
" \"\"\"\n",
" Highly vectorised set of transformations\n",
" \n",
" Takes in the feature enriched df_events\n",
" Filters on the different pass types\n",
" Applies pass specific features\n",
" Outputs df_passes\n",
" \"\"\"\n",
" \n",
" dic_passes = {\n",
" 'Simple pass':['Pass','Simple pass'],\n",
" 'High pass':['Pass','High pass'],\n",
" 'Head pass':['Pass','Head pass'],\n",
" 'Cross':['Pass','Cross'],\n",
" 'Launch':['Pass','Launch'],\n",
" 'Smart pass':['Pass','Smart pass'],\n",
" 'Hand pass':['Pass','Hand pass'],\n",
" 'Free kick cross':['Free Kick','Free kick cross'],\n",
" 'Corner':['Free Kick','Corner'],\n",
" 'Free Kick':['Free Kick','Free Kick'],\n",
" 'Throw in':['Free Kick','Throw in']\n",
" }\n",
" \n",
" \n",
" # Filtering df_events on relevant pass events\n",
" ## 1) Applying filter\n",
" print ('Applying pass filter...')\n",
" df_passes = df_events.loc[df_events['subEventName'].isin(list(dic_passes.keys()))].copy()\n",
" ## 2) DQ step: getting rid of two passes that don't have and end co-ord\n",
" df_passes = df_passes.loc[pd.isna(df_passes['end_x']) == False].copy()\n",
" \n",
" \n",
" # Series of geometric transformations\n",
" print ('Applying geometric transformations...')\n",
" ## 1) splitting the pitch into thirds and capturing the transition between thirds\n",
" df_passes['startThird'] = df_passes.start_x.apply(lambda x: 1 if x < 34 else 2 if x < 67 else 3)\n",
" df_passes['endThird'] = df_passes.end_x.apply(lambda x: 1 if x < 34 else 2 if x < 67 else 3)\n",
" df_passes['thirdTransitionDelta'] = df_passes['endThird'] - df_passes['startThird']\n",
" \n",
" ## 2) transforming pitch dimensions from 100x100 grid to dimensions in meters\n",
" df_passes['startPassM_x'] = df_passes.start_x*pitchLength/100\n",
" df_passes['startPassM_y'] = df_passes.start_y*pitchWidth/100\n",
" df_passes['endPassM_x'] = df_passes.end_x*pitchLength/100\n",
" df_passes['endPassM_y'] = df_passes.end_y*pitchWidth/100\n",
"\n",
" ## 3) getting the squares of the x's\n",
" df_passes['startPassM_xSquared'] = df_passes['startPassM_x']**2\n",
" df_passes['endPassM_xSquared'] = df_passes['endPassM_x']**2\n",
"\n",
" ## 4) getting some central y stats and squared stats (same definitions as in David's code)\n",
" df_passes['start_c'] = abs(df_passes['start_y'] - 50)\n",
" df_passes['end_c'] = abs(df_passes['end_y'] - 50)\n",
" df_passes['startM_c'] = df_passes['start_c']*pitchWidth/100\n",
" df_passes['endM_c'] = df_passes['end_c']*pitchWidth/100\n",
" df_passes['start_cSquared'] = df_passes['start_c']**2\n",
" df_passes['end_cSquared'] = df_passes['end_c']**2\n",
" df_passes['startM_cSquared'] = df_passes['startM_c']**2\n",
" df_passes['endM_cSquared'] = df_passes['endM_c']**2\n",
"\n",
" ## 5) getting distance to ball\n",
" df_passes['vec_x'] = df_passes['endPassM_x'] - df_passes['startPassM_x']\n",
" df_passes['vec_y'] = df_passes['endPassM_y'] - df_passes['startPassM_y']\n",
" df_passes['D'] = np.sqrt(df_passes['vec_x']**2 + df_passes['vec_y']**2)\n",
" df_passes['Dsquared'] = df_passes.D**2\n",
" df_passes['Dcubed'] = df_passes.D**3\n",
"\n",
" ## 6) DQ step: getting rid of events where the vec_x = vec_y = 0 (look like data errors)\n",
" df_passes = df_passes.loc[~((df_passes['vec_x'] == 0) & (df_passes['vec_y'] == 0))].copy()\n",
"\n",
" ## 7) calculating passing angle in radians\n",
" df_passes['a'] = np.arctan(df_passes['vec_x'] / abs(df_passes['vec_y']))\n",
" #df_passes['aNew'] = np.arctan(df_passes['vec_x'] / (df_passes['endM_c'] - df_passes['startM_c']))\n",
" \n",
" ## 8) calculating shooting angle from initial position\n",
" df_passes['aShooting'] = np.arctan(7.32 * df_passes['startPassM_x'] / (df_passes['startPassM_x']**2 + df_passes['startM_c']**2 - (7.32/2)**2))\n",
" df_passes['aShooting'] = df_passes.aShooting.apply(lambda x: x+np.pi if x<0 else x)\n",
" \n",
" ## 9) calculating shooting angle from final position (i.e. )\n",
" df_passes['aShootingFinal'] = np.arctan(7.32 * df_passes['endPassM_x'] / (df_passes['endPassM_x']**2 + df_passes['endM_c']**2 - (7.32/2)**2))\n",
" df_passes['aShootingFinal'] = df_passes.aShootingFinal.apply(lambda x: x+np.pi if x<0 else x)\n",
" \n",
" ## 10) change in shooting angle caused by the pass\n",
" df_passes['aShootingChange'] = df_passes['aShootingFinal'] - df_passes['aShooting']\n",
" \n",
" ## 11) distance to goal\n",
" df_passes['DGoalStart'] = np.sqrt((pitchLength - df_passes['startPassM_x'])**2 + df_passes['startM_c']**2)\n",
" df_passes['DGoalEnd'] = np.sqrt((pitchLength - df_passes['endPassM_x'])**2 + df_passes['endM_c']**2)\n",
" df_passes['DGoalChange'] = df_passes['DGoalEnd'] - df_passes['DGoalStart']\n",
" \n",
" ## final) re-ordering cols\n",
" df_passes = df_passes.sort_values(['matchId','matchEventIndex'], ascending=[True,True])\n",
" \n",
" \n",
" \n",
" # Within each possession sequence, applies the pass index (so the first pass in a possession is 1, and the second is 2, etc.)\n",
" print ('Applying passIndexWithinSequence...')\n",
" ## 1) produces index\n",
" df_passes['passIndexWithinSequence'] = df_passes.sort_values(['matchId','possessionSequenceIndex','matchEventIndex'])\\\n",
" .groupby(['matchId','possessionSequenceIndex'])\\\n",
" .cumcount() + 1\n",
" ## 2) LOOKAHEAD BIAS: WILL NOT INCLUDE THIS IN FINAL MODEL\n",
" ## Calculating mean number of passes per possession per team\n",
" df_meanNumPasses = pd.DataFrame(df_passes.groupby(['teamId','possessionSequenceIndex'])\\\n",
" .agg({'passIndexWithinSequence':np.mean})\\\n",
" .groupby('teamId')\\\n",
" .passIndexWithinSequence.mean())\\\n",
" .reset_index()\\\n",
" .rename(columns={'passIndexWithinSequence':'meanNumPassesPerSequence'})\n",
" ## 3) Re-introducing this mean number of passes per possession via a join\n",
" df_passes = df_passes.merge(df_meanNumPasses, how='inner', on='teamId')\n",
" ## 4) getting the over under for the number of passes for that team\n",
" ## COULD POTENTIALLY USE THIS IF HAD MULTIPLE YEARS OF HISTORY, AS IT SHOWS A CHARACTERISTIC OF A TEAM\n",
" df_passes['numPassOverUnder'] = df_passes['passIndexWithinSequence'] - df_passes['meanNumPassesPerSequence']\n",
" \n",
" \n",
" \n",
" # Final set of flags (some are post-hoc so can't be used in the regression, but just adding for completeness)\n",
" print ('Applying final set of flags...')\n",
" ## 1) applying interception flag - this is of course highly correlated to an unsuccessful outcome, so won't be part of the regression\n",
" df_passes['interceptionFlag'] = df_passes.tags.apply(lambda x: 1 if 1401 in x else 0)\n",
" ## 2) applying dangerousBallLostFlag - this will also NOT be part of the regression\n",
" df_passes['dangerousBallLostFlag'] = df_passes.tags.apply(lambda x: 1 if 2001 in x else 0)\n",
" ## 3) counter attack flag\n",
" df_passes['counterAttackFlag'] = df_passes.tags.apply(lambda x: 1 if 1901 in x else 0)\n",
" ## 4) assist flag\n",
" df_passes['assistFlag'] = df_passes.tags.apply(lambda x: 1 if 301 in x else 0)\n",
" \n",
" \n",
" if outputToCsvFlag == 1:\n",
" print ('Outputting df_passes to CSV...')\n",
" df_passes.to_csv('df_passes.csv', index=None)\n",
" \n",
" print ('Outputting df_passes.')\n",
" return df_passes\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# Model Application: Applying Four Models to Produce **xP** Variations to **Test** Data"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# applying basic, added, and advanced models to test data\n",
"def apply_xP_model_to_test(models):\n",
" \"\"\"\n",
" Applying the four different models to produce four xP values\n",
" \"\"\"\n",
" basic, added, adv_canonical, adv_probit = models\n",
" \n",
" print ('Applying models...')\n",
" df_passes_test['xP_basic'] = basic.predict(df_passes_test)\n",
" df_passes_test['xP_added'] = added.predict(df_passes_test)\n",
" df_passes_test['xP_logit'] = adv_canonical.predict(df_passes_test)\n",
" df_passes_test['xP'] = adv_probit.predict(df_passes_test)\n",
" print (f'Done applying {len(models)} models.')\n",
" \n",
" return df_passes_test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# Model Validation: Calibration Curves"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"def plot_calibration_curve(df_passes_test, show_advanced=1, save_output=0):\n",
"\n",
" fig = plt.figure(figsize=(10, 10))\n",
"\n",
" # Plotting perfect calibration (line y=x)\n",
" plt.plot([0, 1], [0, 1], 'k:', label='Perfectly Calibrated Model')\n",
"\n",
" alpha = 0.6\n",
" numBins = 25\n",
"\n",
" # FOUR calibration curves - Tricky to plot all four at a time, so just do a Simple Vs Advanced\n",
" if show_advanced == 0:\n",
" ## 1) Simple Model\n",
" fraction_of_positives, mean_predicted_value = calibration_curve(df_passes_test.successFlag, df_passes_test.xP_basic, n_bins=numBins)\n",
" plt.plot(mean_predicted_value, fraction_of_positives, \"s-\", label='Basic Model', alpha = alpha, color='red')\n",
"\n",
" ## 2) Added Model\n",
" fraction_of_positives, mean_predicted_value = calibration_curve(df_passes_test.successFlag, df_passes_test.xP_added, n_bins=numBins)\n",
" plt.plot(mean_predicted_value, fraction_of_positives, \"s-\", label='Added Features', alpha = alpha, color='blue')\n",
"\n",
" elif show_advanced == 1:\n",
" ## 3) Advanced Model: Canonical (Logit) Link function\n",
" fraction_of_positives, mean_predicted_value = calibration_curve(df_passes_test.successFlag, df_passes_test.xP_logit, n_bins=numBins)\n",
" plt.plot(mean_predicted_value, fraction_of_positives, \"s-\", label='Advanced Features: Logit Link', alpha = alpha, color='black')\n",
"\n",
" ## 4) Advanced Model: Probit Link function\n",
" fraction_of_positives, mean_predicted_value = calibration_curve(df_passes_test.successFlag, df_passes_test.xP, n_bins=numBins)\n",
" plt.plot(mean_predicted_value, fraction_of_positives, \"s-\", label='Advanced Features: Probit Link', alpha = alpha, color='orange')\n",
"\n",
" plt.ylabel('Fraction of Successful Passes', fontsize=18)\n",
" plt.xlabel('Mean xP', fontsize=18)\n",
"\n",
" plt.ylim([-0.05, 1.05])\n",
" plt.xlim([-0.05, 1.05])\n",
"\n",
" plt.legend(loc=\"lower right\", fontsize=18)\n",
" #plt.title('Calibration Plot', fontsize=24)\n",
"\n",
" plt.yticks(fontsize=14)\n",
" plt.xticks(fontsize=14)\n",
"\n",
" plt.tight_layout()\n",
" \n",
" if save_output == 1:\n",
" plt.savefig(f'calibration_{show_advanced}.pdf', dpi=300, format='pdf', bbox_inches='tight')\n",
" \n",
" return plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# Model Validation: Metric Scores\n",
"\n",
"* Brier Score\n",
"* Precision, Recall, F1\n",
"* AUC\n",
"* Accuracy"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"def calculate_model_metrics(df_passes_test, xPtype='xP', log_reg_decision_threshold = 0.65):\n",
" '''\n",
" Applies Logistic Regression Decision Threshold (i.e. applying the model to attribute whether a pass would or would have not been successful)\n",
" And calculates a bunch of related metrics\n",
" '''\n",
" \n",
" df_passes_test['predictedSuccess'] = df_passes_test[xPtype].apply(lambda x: 1 if x > log_reg_decision_threshold else 0)\n",
"\n",
" brierScore = metrics.brier_score_loss(df_passes_test.successFlag, df_passes_test[xPtype])\n",
"\n",
" # precision = TRUE POSITIVE / (TRUE POSITIVE + FALSE POSITIVE)\n",
" # ratio of correctly positive observations / all predicted positive observations\n",
" precisionScore = metrics.precision_score(df_passes_test.successFlag, df_passes_test.predictedSuccess)\n",
"\n",
" # recall = TRUE POSITIVE / (TRUE POSITIVE + FALSE NEGATIVE)\n",
" # ratio of correctly positive observations / all true positive observations (that were either correctly picked TP or missed FN)\n",
" recallScore = metrics.recall_score(df_passes_test.successFlag, df_passes_test.predictedSuccess)\n",
"\n",
" # weighted average of precision and recall\n",
" f1Score = metrics.f1_score(df_passes_test.successFlag, df_passes_test.predictedSuccess)\n",
"\n",
" AUCScore = metrics.roc_auc_score(df_passes_test.successFlag, df_passes_test.predictedSuccess)\n",
"\n",
" # overall accuracy score: ratio of all correct over count of all observations\n",
" accuracyScore = metrics.accuracy_score(df_passes_test.successFlag, df_passes_test.predictedSuccess)\n",
"\n",
" return print (f'Brier Score: {brierScore}\\n\\nPrecision Score: {precisionScore}\\n\\nRecall Score: {recallScore}\\n\\nF1 Score: {f1Score}\\n\\nAUC Score: {AUCScore}\\n\\nAccuracyScore: {accuracyScore}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"**CODE STARTS HERE**\n",
"\n",
"---\n",
"\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
"# 1) Loading Data\n",
"\n",
"### Loading Players, Teams, Matches, Formations, Events"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Processing matches_World_Cup.json...\n",
"Processing matches_Italy.json...\n",
"Processing matches_Germany.json...\n",
"Processing matches_England.json...\n",
"Processing matches_France.json...\n",
"Processing matches_Spain.json...\n",
"Processing matches_European_Championship.json...\n",
"Processing events_France.json...\n",
"Processing events_Spain.json...\n",
"Processing events_Germany.json...\n",
"Processing events_European_Championship.json...\n",
"Processing events_World_Cup.json...\n",
"Processing events_Italy.json...\n",
"Processing events_England.json...\n",
"Done\n",
"CPU times: user 1min 30s, sys: 5.18 s, total: 1min 36s\n",
"Wall time: 1min 37s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"df_players = get_players('players.json')\n",
"df_teams = get_teams('teams.json')\n",
"df_matches = get_matches('matches')\n",
"df_formations = get_formations(df_matches)\n",
"df_events = get_events('events', leagueSelectionFlag = 0, leagueSelection = 'England')\n",
"\n",
"print ('Done')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# 2) Event Feature Engineering\n",
"\n",
"**Longest part of data preparation due to all of the nested feature extraction from the events data**:\n",
"\n",
"> Takes about 3 minutes if a single league is selected (`leagueSelectionFlag = 1` above).\n",
"\n",
"> Takes about 10 minutes if all leagues and international competitions are thrown into the mix."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Rejigging tags...\n",
"Applying homeFlag...\n",
"Applying successFlag...\n",
"Applying matchEventIndex...\n",
"Applying possessionTeamId...\n",
"Applying possessionSequenceIndex...\n",
"Applying gameState...\n",
"Applying numReds...\n",
"Applying weakFlag and strongFlag for footedness...\n",
"Unpacking positions...\n",
"Applying possessionStartSec...\n",
"Applying playerPossessionTimeSec...\n",
"Grabbing previous event...\n",
"Outputting df_events.\n",
"CPU times: user 9min 45s, sys: 44.4 s, total: 10min 29s\n",
"Wall time: 10min 43s\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" source | \n",
" matchId | \n",
" matchPeriod | \n",
" eventSec | \n",
" possessionTimeSec | \n",
" playerPossessionTimeSec | \n",
" matchEventIndex | \n",
" teamId | \n",
" homeTeamId | \n",
" homeScore | \n",
" awayTeamId | \n",
" awayScore | \n",
" homeFlag | \n",
" id | \n",
" eventName | \n",
" subEventName | \n",
" previousSubEventName | \n",
" possessionTeamId | \n",
" possessionSequenceIndex | \n",
" playerId | \n",
" shortName | \n",
" roleCode | \n",
" strongFlag | \n",
" weakFlag | \n",
" goalDelta | \n",
" numReds | \n",
" start_x | \n",
" start_y | \n",
" end_x | \n",
" end_y | \n",
" tags | \n",
" successFlag | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" European_Championship | \n",
" 1694390 | \n",
" 1H | \n",
" 1.255990 | \n",
" 0.000000 | \n",
" 0.0 | \n",
" 1 | \n",
" 4418 | \n",
" 4418 | \n",
" 2 | \n",
" 11944 | \n",
" 1 | \n",
" 1 | \n",
" 88178642 | \n",
" Pass | \n",
" Simple pass | \n",
" Match Start | \n",
" 4418 | \n",
" 1 | \n",
" 26010 | \n",
" O. Giroud | \n",
" FWD | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 50 | \n",
" 48 | \n",
" 47.0 | \n",
" 50.0 | \n",
" [1801] | \n",
" 1 | \n",
"
\n",
" \n",
" 1388 | \n",
" European_Championship | \n",
" 1694390 | \n",
" 1H | \n",
" 2.351908 | \n",
" 1.095918 | \n",
" 0.0 | \n",
" 2 | \n",
" 4418 | \n",
" 4418 | \n",
" 2 | \n",
" 11944 | \n",
" 1 | \n",
" 1 | \n",
" 88178643 | \n",
" Pass | \n",
" Simple pass | \n",
" Match Start | \n",
" 4418 | \n",
" 1 | \n",
" 3682 | \n",
" A. Griezmann | \n",
" FWD | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 47 | \n",
" 50 | \n",
" 41.0 | \n",
" 48.0 | \n",
" [1801] | \n",
" 1 | \n",
"
\n",
" \n",
" 3995 | \n",
" European_Championship | \n",
" 1694390 | \n",
" 1H | \n",
" 3.241028 | \n",
" 1.985038 | \n",
" 0.0 | \n",
" 3 | \n",
" 4418 | \n",
" 4418 | \n",
" 2 | \n",
" 11944 | \n",
" 1 | \n",
" 1 | \n",
" 88178644 | \n",
" Pass | \n",
" Simple pass | \n",
" Match Start | \n",
" 4418 | \n",
" 1 | \n",
" 31528 | \n",
" N. Kanté | \n",
" MID | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 41 | \n",
" 48 | \n",
" 32.0 | \n",
" 35.0 | \n",
" [1801] | \n",
" 1 | \n",
"
\n",
" \n",
" 7938 | \n",
" European_Championship | \n",
" 1694390 | \n",
" 1H | \n",
" 6.033681 | \n",
" 4.777691 | \n",
" 0.0 | \n",
" 4 | \n",
" 4418 | \n",
" 4418 | \n",
" 2 | \n",
" 11944 | \n",
" 1 | \n",
" 1 | \n",
" 88178645 | \n",
" Pass | \n",
" High pass | \n",
" Match Start | \n",
" 4418 | \n",
" 1 | \n",
" 7855 | \n",
" L. Koscielny | \n",
" DEF | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 32 | \n",
" 35 | \n",
" 89.0 | \n",
" 6.0 | \n",
" [1802] | \n",
" 0 | \n",
"
\n",
" \n",
" 10780 | \n",
" European_Championship | \n",
" 1694390 | \n",
" 1H | \n",
" 13.143591 | \n",
" 11.887601 | \n",
" 0.0 | \n",
" 5 | \n",
" 4418 | \n",
" 4418 | \n",
" 2 | \n",
" 11944 | \n",
" 1 | \n",
" 1 | \n",
" 88178646 | \n",
" Duel | \n",
" Ground defending duel | \n",
" Match Start | \n",
" 4418 | \n",
" 1 | \n",
" 25437 | \n",
" B. Matuidi | \n",
" MID | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 89 | \n",
" 6 | \n",
" 85.0 | \n",
" 0.0 | \n",
" [702, 1801] | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" source matchId matchPeriod eventSec \\\n",
"0 European_Championship 1694390 1H 1.255990 \n",
"1388 European_Championship 1694390 1H 2.351908 \n",
"3995 European_Championship 1694390 1H 3.241028 \n",
"7938 European_Championship 1694390 1H 6.033681 \n",
"10780 European_Championship 1694390 1H 13.143591 \n",
"\n",
" possessionTimeSec playerPossessionTimeSec matchEventIndex teamId \\\n",
"0 0.000000 0.0 1 4418 \n",
"1388 1.095918 0.0 2 4418 \n",
"3995 1.985038 0.0 3 4418 \n",
"7938 4.777691 0.0 4 4418 \n",
"10780 11.887601 0.0 5 4418 \n",
"\n",
" homeTeamId homeScore awayTeamId awayScore homeFlag id \\\n",
"0 4418 2 11944 1 1 88178642 \n",
"1388 4418 2 11944 1 1 88178643 \n",
"3995 4418 2 11944 1 1 88178644 \n",
"7938 4418 2 11944 1 1 88178645 \n",
"10780 4418 2 11944 1 1 88178646 \n",
"\n",
" eventName subEventName previousSubEventName possessionTeamId \\\n",
"0 Pass Simple pass Match Start 4418 \n",
"1388 Pass Simple pass Match Start 4418 \n",
"3995 Pass Simple pass Match Start 4418 \n",
"7938 Pass High pass Match Start 4418 \n",
"10780 Duel Ground defending duel Match Start 4418 \n",
"\n",
" possessionSequenceIndex playerId shortName roleCode strongFlag \\\n",
"0 1 26010 O. Giroud FWD 0 \n",
"1388 1 3682 A. Griezmann FWD 0 \n",
"3995 1 31528 N. Kanté MID 0 \n",
"7938 1 7855 L. Koscielny DEF 0 \n",
"10780 1 25437 B. Matuidi MID 0 \n",
"\n",
" weakFlag goalDelta numReds start_x start_y end_x end_y \\\n",
"0 0 0 0 50 48 47.0 50.0 \n",
"1388 0 0 0 47 50 41.0 48.0 \n",
"3995 0 0 0 41 48 32.0 35.0 \n",
"7938 0 0 0 32 35 89.0 6.0 \n",
"10780 0 0 0 89 6 85.0 0.0 \n",
"\n",
" tags successFlag \n",
"0 [1801] 1 \n",
"1388 [1801] 1 \n",
"3995 [1801] 1 \n",
"7938 [1802] 0 \n",
"10780 [702, 1801] 1 "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"\n",
"df_events = event_feature_engineering(df_events)\n",
"df_events.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### For some bizarre reason, this code doesn't work if contained within the event feature engineering function, so adding it on here.\n",
"\n",
"(Doesn't effect modelling, only plotting.)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Applying recipient of an event...\n"
]
}
],
"source": [
"# Getting the recipient player of an action (need to do this pre-pass filter as the next action may well not be a pass, and it'd be \n",
"## highly suboptimal to clip dangerous passes that resulted in shots and goals)\n",
"print ('Applying recipient of an event...')\n",
"possessionEventNames = ['Pass','Others on the ball','Shot']\n",
"\n",
"df_events['passRecipientPlayerIdNext1'] = None\n",
"df_events['passRecipientPlayerIdNext2'] = None\n",
"df_events['passRecipientPlayerIdNext3'] = None\n",
"df_events['passRecipientPlayerIdNext4'] = None\n",
"\n",
"df_events['passRecipientPlayerIdNext1'][((df_events['matchId'] == df_events['matchId'].shift(-1)) &\\\n",
" (df_events['matchEventIndex'] == (df_events['matchEventIndex'].shift(-1) - 1)) &\\\n",
" (df_events['possessionSequenceIndex'] == df_events['possessionSequenceIndex'].shift(-1)) &\\\n",
" (df_events['eventName'].shift(-1).isin(possessionEventNames)) &\\\n",
" (df_events['end_x'] == df_events['start_x'].shift(-1)) &\\\n",
" (df_events['end_y'] == df_events['start_y'].shift(-1)) &\\\n",
" (df_events['successFlag'] == 1))]\\\n",
" = df_events['playerId'].shift(-1)\n",
"\n",
"df_events['passRecipientPlayerIdNext2'][((df_events['matchId'] == df_events['matchId'].shift(-2)) &\\\n",
" (df_events['matchEventIndex'] == (df_events['matchEventIndex'].shift(-2) - 2)) &\\\n",
" (df_events['possessionSequenceIndex'] == df_events['possessionSequenceIndex'].shift(-2)) &\\\n",
" (df_events['eventName'].shift(-2).isin(possessionEventNames)) &\\\n",
" (df_events['end_x'] == df_events['start_x'].shift(-2)) &\\\n",
" (df_events['end_y'] == df_events['start_y'].shift(-2)) &\\\n",
" (df_events['successFlag'] == 1))]\\\n",
" = df_events['playerId'].shift(-2)\n",
"\n",
"df_events['passRecipientPlayerIdNext3'][((df_events['matchId'] == df_events['matchId'].shift(-3)) &\\\n",
" (df_events['matchEventIndex'] == (df_events['matchEventIndex'].shift(-3) - 3)) &\\\n",
" (df_events['possessionSequenceIndex'] == df_events['possessionSequenceIndex'].shift(-3)) &\\\n",
" (df_events['eventName'].shift(-3).isin(possessionEventNames)) &\\\n",
" (df_events['end_x'] == df_events['start_x'].shift(-3)) &\\\n",
" (df_events['end_y'] == df_events['start_y'].shift(-3)) &\\\n",
" (df_events['successFlag'] == 1))]\\\n",
" = df_events['playerId'].shift(-3)\n",
"\n",
"df_events['passRecipientPlayerIdNext4'][((df_events['matchId'] == df_events['matchId'].shift(-4)) &\\\n",
" (df_events['matchEventIndex'] == (df_events['matchEventIndex'].shift(-4) - 4)) &\\\n",
" (df_events['possessionSequenceIndex'] == df_events['possessionSequenceIndex'].shift(-4)) &\\\n",
" (df_events['eventName'].shift(-4).isin(possessionEventNames)) &\\\n",
" (df_events['end_x'] == df_events['start_x'].shift(-4)) &\\\n",
" (df_events['end_y'] == df_events['start_y'].shift(-4)) &\\\n",
" (df_events['successFlag'] == 1))]\\\n",
" = df_events['playerId'].shift(-4)\n",
"\n",
"\n",
"df_events['passRecipientPlayerId'] = df_events.apply(lambda x: int(x.passRecipientPlayerIdNext1) if x.passRecipientPlayerIdNext1 != None else\\\n",
" int(x.passRecipientPlayerIdNext2) if x.passRecipientPlayerIdNext2 != None else\\\n",
" int(x.passRecipientPlayerIdNext3) if x.passRecipientPlayerIdNext3 != None else\\\n",
" int(x.passRecipientPlayerIdNext4) if x.passRecipientPlayerIdNext4 != None else None, axis=1) \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# 3) Pass Specific Feature Engineering\n",
"\n",
"**Highly vectorised, so only takes a minute with all leagues loaded in**\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Applying pass filter...\n",
"Applying geometric transformations...\n",
"Applying passIndexWithinSequence...\n",
"Applying final set of flags...\n",
"Outputting df_passes.\n",
"CPU times: user 23.1 s, sys: 7.62 s, total: 30.7 s\n",
"Wall time: 31.5 s\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" source | \n",
" matchId | \n",
" matchPeriod | \n",
" eventSec | \n",
" possessionTimeSec | \n",
" playerPossessionTimeSec | \n",
" matchEventIndex | \n",
" teamId | \n",
" homeTeamId | \n",
" homeScore | \n",
" awayTeamId | \n",
" awayScore | \n",
" homeFlag | \n",
" id | \n",
" eventName | \n",
" subEventName | \n",
" previousSubEventName | \n",
" possessionTeamId | \n",
" possessionSequenceIndex | \n",
" playerId | \n",
" shortName | \n",
" roleCode | \n",
" strongFlag | \n",
" weakFlag | \n",
" goalDelta | \n",
" numReds | \n",
" start_x | \n",
" start_y | \n",
" end_x | \n",
" end_y | \n",
" tags | \n",
" successFlag | \n",
" passRecipientPlayerIdNext1 | \n",
" passRecipientPlayerIdNext2 | \n",
" passRecipientPlayerIdNext3 | \n",
" passRecipientPlayerIdNext4 | \n",
" passRecipientPlayerId | \n",
" startThird | \n",
" endThird | \n",
" thirdTransitionDelta | \n",
" startPassM_x | \n",
" startPassM_y | \n",
" endPassM_x | \n",
" endPassM_y | \n",
" startPassM_xSquared | \n",
" endPassM_xSquared | \n",
" start_c | \n",
" end_c | \n",
" startM_c | \n",
" endM_c | \n",
" start_cSquared | \n",
" end_cSquared | \n",
" startM_cSquared | \n",
" endM_cSquared | \n",
" vec_x | \n",
" vec_y | \n",
" D | \n",
" Dsquared | \n",
" Dcubed | \n",
" a | \n",
" aShooting | \n",
" aShootingFinal | \n",
" aShootingChange | \n",
" DGoalStart | \n",
" DGoalEnd | \n",
" DGoalChange | \n",
" passIndexWithinSequence | \n",
" meanNumPassesPerSequence | \n",
" numPassOverUnder | \n",
" interceptionFlag | \n",
" dangerousBallLostFlag | \n",
" counterAttackFlag | \n",
" assistFlag | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" European_Championship | \n",
" 1694390 | \n",
" 1H | \n",
" 1.255990 | \n",
" 0.000000 | \n",
" 0.0 | \n",
" 1 | \n",
" 4418 | \n",
" 4418 | \n",
" 2 | \n",
" 11944 | \n",
" 1 | \n",
" 1 | \n",
" 88178642 | \n",
" Pass | \n",
" Simple pass | \n",
" Match Start | \n",
" 4418 | \n",
" 1 | \n",
" 26010 | \n",
" O. Giroud | \n",
" FWD | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 50 | \n",
" 48 | \n",
" 47.0 | \n",
" 50.0 | \n",
" [1801] | \n",
" 1 | \n",
" 3682 | \n",
" None | \n",
" None | \n",
" None | \n",
" 3682.0 | \n",
" 2 | \n",
" 2 | \n",
" 0 | \n",
" 52.50 | \n",
" 32.64 | \n",
" 49.35 | \n",
" 34.00 | \n",
" 2756.2500 | \n",
" 2435.4225 | \n",
" 2 | \n",
" 0.0 | \n",
" 1.36 | \n",
" 0.00 | \n",
" 4 | \n",
" 0.0 | \n",
" 1.8496 | \n",
" 0.0000 | \n",
" -3.15 | \n",
" 1.36 | \n",
" 3.431049 | \n",
" 11.7721 | \n",
" 40.390657 | \n",
" -1.163226 | \n",
" 0.139111 | \n",
" 0.148057 | \n",
" 0.008946 | \n",
" 52.517612 | \n",
" 55.650000 | \n",
" 3.132388 | \n",
" 1 | \n",
" 2.699072 | \n",
" -1.699072 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" European_Championship | \n",
" 1694390 | \n",
" 1H | \n",
" 2.351908 | \n",
" 1.095918 | \n",
" 0.0 | \n",
" 2 | \n",
" 4418 | \n",
" 4418 | \n",
" 2 | \n",
" 11944 | \n",
" 1 | \n",
" 1 | \n",
" 88178643 | \n",
" Pass | \n",
" Simple pass | \n",
" Match Start | \n",
" 4418 | \n",
" 1 | \n",
" 3682 | \n",
" A. Griezmann | \n",
" FWD | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 47 | \n",
" 50 | \n",
" 41.0 | \n",
" 48.0 | \n",
" [1801] | \n",
" 1 | \n",
" 31528 | \n",
" None | \n",
" None | \n",
" None | \n",
" 31528.0 | \n",
" 2 | \n",
" 2 | \n",
" 0 | \n",
" 49.35 | \n",
" 34.00 | \n",
" 43.05 | \n",
" 32.64 | \n",
" 2435.4225 | \n",
" 1853.3025 | \n",
" 0 | \n",
" 2.0 | \n",
" 0.00 | \n",
" 1.36 | \n",
" 0 | \n",
" 4.0 | \n",
" 0.0000 | \n",
" 1.8496 | \n",
" -6.30 | \n",
" -1.36 | \n",
" 6.445122 | \n",
" 41.5396 | \n",
" 267.727798 | \n",
" -1.358186 | \n",
" 0.148057 | \n",
" 0.169460 | \n",
" 0.021403 | \n",
" 55.650000 | \n",
" 61.964926 | \n",
" 6.314926 | \n",
" 2 | \n",
" 2.699072 | \n",
" -0.699072 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" European_Championship | \n",
" 1694390 | \n",
" 1H | \n",
" 3.241028 | \n",
" 1.985038 | \n",
" 0.0 | \n",
" 3 | \n",
" 4418 | \n",
" 4418 | \n",
" 2 | \n",
" 11944 | \n",
" 1 | \n",
" 1 | \n",
" 88178644 | \n",
" Pass | \n",
" Simple pass | \n",
" Match Start | \n",
" 4418 | \n",
" 1 | \n",
" 31528 | \n",
" N. Kanté | \n",
" MID | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 41 | \n",
" 48 | \n",
" 32.0 | \n",
" 35.0 | \n",
" [1801] | \n",
" 1 | \n",
" 7855 | \n",
" None | \n",
" None | \n",
" None | \n",
" 7855.0 | \n",
" 2 | \n",
" 1 | \n",
" -1 | \n",
" 43.05 | \n",
" 32.64 | \n",
" 33.60 | \n",
" 23.80 | \n",
" 1853.3025 | \n",
" 1128.9600 | \n",
" 2 | \n",
" 15.0 | \n",
" 1.36 | \n",
" 10.20 | \n",
" 4 | \n",
" 225.0 | \n",
" 1.8496 | \n",
" 104.0400 | \n",
" -9.45 | \n",
" -8.84 | \n",
" 12.940174 | \n",
" 167.4481 | \n",
" 2166.807530 | \n",
" -0.818737 | \n",
" 0.169460 | \n",
" 0.198996 | \n",
" 0.029537 | \n",
" 61.964926 | \n",
" 72.124892 | \n",
" 10.159965 | \n",
" 3 | \n",
" 2.699072 | \n",
" 0.300928 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" European_Championship | \n",
" 1694390 | \n",
" 1H | \n",
" 6.033681 | \n",
" 4.777691 | \n",
" 0.0 | \n",
" 4 | \n",
" 4418 | \n",
" 4418 | \n",
" 2 | \n",
" 11944 | \n",
" 1 | \n",
" 1 | \n",
" 88178645 | \n",
" Pass | \n",
" High pass | \n",
" Match Start | \n",
" 4418 | \n",
" 1 | \n",
" 7855 | \n",
" L. Koscielny | \n",
" DEF | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 32 | \n",
" 35 | \n",
" 89.0 | \n",
" 6.0 | \n",
" [1802] | \n",
" 0 | \n",
" None | \n",
" None | \n",
" None | \n",
" None | \n",
" NaN | \n",
" 1 | \n",
" 3 | \n",
" 2 | \n",
" 33.60 | \n",
" 23.80 | \n",
" 93.45 | \n",
" 4.08 | \n",
" 1128.9600 | \n",
" 8732.9025 | \n",
" 15 | \n",
" 44.0 | \n",
" 10.20 | \n",
" 29.92 | \n",
" 225 | \n",
" 1936.0 | \n",
" 104.0400 | \n",
" 895.2064 | \n",
" 59.85 | \n",
" -19.72 | \n",
" 63.015085 | \n",
" 3970.9009 | \n",
" 250226.656557 | \n",
" 1.252508 | \n",
" 0.198996 | \n",
" 0.071027 | \n",
" -0.127969 | \n",
" 72.124892 | \n",
" 32.071933 | \n",
" -40.052958 | \n",
" 4 | \n",
" 2.699072 | \n",
" 1.300928 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" European_Championship | \n",
" 1694390 | \n",
" 1H | \n",
" 27.053006 | \n",
" 0.000000 | \n",
" 0.0 | \n",
" 7 | \n",
" 4418 | \n",
" 4418 | \n",
" 2 | \n",
" 11944 | \n",
" 1 | \n",
" 1 | \n",
" 88178648 | \n",
" Free Kick | \n",
" Throw in | \n",
" Match Start | \n",
" 4418 | \n",
" 3 | \n",
" 7915 | \n",
" P. Evra | \n",
" DEF | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 85 | \n",
" 0 | \n",
" 93.0 | \n",
" 16.0 | \n",
" [1802] | \n",
" 0 | \n",
" None | \n",
" None | \n",
" None | \n",
" None | \n",
" NaN | \n",
" 3 | \n",
" 3 | \n",
" 0 | \n",
" 89.25 | \n",
" 0.00 | \n",
" 97.65 | \n",
" 10.88 | \n",
" 7965.5625 | \n",
" 9535.5225 | \n",
" 50 | \n",
" 34.0 | \n",
" 34.00 | \n",
" 23.12 | \n",
" 2500 | \n",
" 1156.0 | \n",
" 1156.0000 | \n",
" 534.5344 | \n",
" 8.40 | \n",
" 10.88 | \n",
" 13.745341 | \n",
" 188.9344 | \n",
" 2596.967760 | \n",
" 0.657470 | \n",
" 0.071605 | \n",
" 0.070958 | \n",
" -0.000648 | \n",
" 37.470822 | \n",
" 24.260192 | \n",
" -13.210630 | \n",
" 1 | \n",
" 2.699072 | \n",
" -1.699072 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" source matchId matchPeriod eventSec possessionTimeSec \\\n",
"0 European_Championship 1694390 1H 1.255990 0.000000 \n",
"1 European_Championship 1694390 1H 2.351908 1.095918 \n",
"2 European_Championship 1694390 1H 3.241028 1.985038 \n",
"3 European_Championship 1694390 1H 6.033681 4.777691 \n",
"4 European_Championship 1694390 1H 27.053006 0.000000 \n",
"\n",
" playerPossessionTimeSec matchEventIndex teamId homeTeamId homeScore \\\n",
"0 0.0 1 4418 4418 2 \n",
"1 0.0 2 4418 4418 2 \n",
"2 0.0 3 4418 4418 2 \n",
"3 0.0 4 4418 4418 2 \n",
"4 0.0 7 4418 4418 2 \n",
"\n",
" awayTeamId awayScore homeFlag id eventName subEventName \\\n",
"0 11944 1 1 88178642 Pass Simple pass \n",
"1 11944 1 1 88178643 Pass Simple pass \n",
"2 11944 1 1 88178644 Pass Simple pass \n",
"3 11944 1 1 88178645 Pass High pass \n",
"4 11944 1 1 88178648 Free Kick Throw in \n",
"\n",
" previousSubEventName possessionTeamId possessionSequenceIndex playerId \\\n",
"0 Match Start 4418 1 26010 \n",
"1 Match Start 4418 1 3682 \n",
"2 Match Start 4418 1 31528 \n",
"3 Match Start 4418 1 7855 \n",
"4 Match Start 4418 3 7915 \n",
"\n",
" shortName roleCode strongFlag weakFlag goalDelta numReds start_x \\\n",
"0 O. Giroud FWD 0 0 0 0 50 \n",
"1 A. Griezmann FWD 0 0 0 0 47 \n",
"2 N. Kanté MID 0 0 0 0 41 \n",
"3 L. Koscielny DEF 0 0 0 0 32 \n",
"4 P. Evra DEF 0 0 0 0 85 \n",
"\n",
" start_y end_x end_y tags successFlag passRecipientPlayerIdNext1 \\\n",
"0 48 47.0 50.0 [1801] 1 3682 \n",
"1 50 41.0 48.0 [1801] 1 31528 \n",
"2 48 32.0 35.0 [1801] 1 7855 \n",
"3 35 89.0 6.0 [1802] 0 None \n",
"4 0 93.0 16.0 [1802] 0 None \n",
"\n",
" passRecipientPlayerIdNext2 passRecipientPlayerIdNext3 \\\n",
"0 None None \n",
"1 None None \n",
"2 None None \n",
"3 None None \n",
"4 None None \n",
"\n",
" passRecipientPlayerIdNext4 passRecipientPlayerId startThird endThird \\\n",
"0 None 3682.0 2 2 \n",
"1 None 31528.0 2 2 \n",
"2 None 7855.0 2 1 \n",
"3 None NaN 1 3 \n",
"4 None NaN 3 3 \n",
"\n",
" thirdTransitionDelta startPassM_x startPassM_y endPassM_x endPassM_y \\\n",
"0 0 52.50 32.64 49.35 34.00 \n",
"1 0 49.35 34.00 43.05 32.64 \n",
"2 -1 43.05 32.64 33.60 23.80 \n",
"3 2 33.60 23.80 93.45 4.08 \n",
"4 0 89.25 0.00 97.65 10.88 \n",
"\n",
" startPassM_xSquared endPassM_xSquared start_c end_c startM_c endM_c \\\n",
"0 2756.2500 2435.4225 2 0.0 1.36 0.00 \n",
"1 2435.4225 1853.3025 0 2.0 0.00 1.36 \n",
"2 1853.3025 1128.9600 2 15.0 1.36 10.20 \n",
"3 1128.9600 8732.9025 15 44.0 10.20 29.92 \n",
"4 7965.5625 9535.5225 50 34.0 34.00 23.12 \n",
"\n",
" start_cSquared end_cSquared startM_cSquared endM_cSquared vec_x vec_y \\\n",
"0 4 0.0 1.8496 0.0000 -3.15 1.36 \n",
"1 0 4.0 0.0000 1.8496 -6.30 -1.36 \n",
"2 4 225.0 1.8496 104.0400 -9.45 -8.84 \n",
"3 225 1936.0 104.0400 895.2064 59.85 -19.72 \n",
"4 2500 1156.0 1156.0000 534.5344 8.40 10.88 \n",
"\n",
" D Dsquared Dcubed a aShooting aShootingFinal \\\n",
"0 3.431049 11.7721 40.390657 -1.163226 0.139111 0.148057 \n",
"1 6.445122 41.5396 267.727798 -1.358186 0.148057 0.169460 \n",
"2 12.940174 167.4481 2166.807530 -0.818737 0.169460 0.198996 \n",
"3 63.015085 3970.9009 250226.656557 1.252508 0.198996 0.071027 \n",
"4 13.745341 188.9344 2596.967760 0.657470 0.071605 0.070958 \n",
"\n",
" aShootingChange DGoalStart DGoalEnd DGoalChange \\\n",
"0 0.008946 52.517612 55.650000 3.132388 \n",
"1 0.021403 55.650000 61.964926 6.314926 \n",
"2 0.029537 61.964926 72.124892 10.159965 \n",
"3 -0.127969 72.124892 32.071933 -40.052958 \n",
"4 -0.000648 37.470822 24.260192 -13.210630 \n",
"\n",
" passIndexWithinSequence meanNumPassesPerSequence numPassOverUnder \\\n",
"0 1 2.699072 -1.699072 \n",
"1 2 2.699072 -0.699072 \n",
"2 3 2.699072 0.300928 \n",
"3 4 2.699072 1.300928 \n",
"4 1 2.699072 -1.699072 \n",
"\n",
" interceptionFlag dangerousBallLostFlag counterAttackFlag assistFlag \n",
"0 0 0 0 0 \n",
"1 0 0 0 0 \n",
"2 0 0 0 0 \n",
"3 0 0 0 0 \n",
"4 0 0 0 0 "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"\n",
"df_passes = pass_feature_engineering(df_events, outputToCsvFlag=0)\n",
"df_passes.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# 4) Model Fitting\n",
"\n",
"### Splitting `df_passes` into training and test dataset, stratifying the dependent variable"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Stratified Pass Success Rates:\n",
"\n",
"Overall: 83.0%\n",
"Train: 83.0%\n",
"Test: 83.0%\n",
"\n"
]
}
],
"source": [
"# splitting into a dataframe for training and dataframe for testing\n",
"## stratifying the successFlag\n",
"df_passes_train, df_passes_test = train_test_split(df_passes, test_size=0.3, stratify=df_passes.successFlag, random_state=1, shuffle=True)\n",
"\n",
"print (f'Stratified Pass Success Rates:\\n\\nOverall: {100*np.round(df_passes.successFlag.mean(),3)}%\\nTrain: {100*np.round(df_passes_train.successFlag.mean(), 3)}%\\nTest: {100*np.round(df_passes_test.successFlag.mean(), 3)}%\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fitting basic model to **training** data:\n",
"* Starting X\n",
"* Starting Y"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 15.3 s, sys: 2.61 s, total: 17.9 s\n",
"Wall time: 4.35 s\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" Model: | GLM | AIC: | 1140927.9736 | \n",
"
\n",
"\n",
" Link Function: | logit | BIC: | -16674264.5204 | \n",
"
\n",
"\n",
" Dependent Variable: | successFlag | Log-Likelihood: | -5.7046e+05 | \n",
"
\n",
"\n",
" Date: | 2020-09-20 22:00 | LL-Null: | -5.7736e+05 | \n",
"
\n",
"\n",
" No. Observations: | 1267740 | Deviance: | 1.1409e+06 | \n",
"
\n",
"\n",
" Df Model: | 2 | Pearson chi2: | 1.28e+06 | \n",
"
\n",
"\n",
" Df Residuals: | 1267737 | Scale: | 1.0000 | \n",
"
\n",
"\n",
" Method: | IRLS | | | \n",
"
\n",
"
\n",
"\n",
"\n",
" | Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | \n",
"
\n",
"\n",
" Intercept | 2.3204 | 0.0069 | 335.5973 | 0.0000 | 2.3069 | 2.3340 | \n",
"
\n",
"\n",
" startPassM_x | -0.0090 | 0.0001 | -85.7712 | 0.0000 | -0.0092 | -0.0088 | \n",
"
\n",
"\n",
" startM_c | -0.0136 | 0.0003 | -54.0379 | 0.0000 | -0.0141 | -0.0131 | \n",
"
\n",
"
"
],
"text/plain": [
"\n",
"\"\"\"\n",
" Results: Generalized linear model\n",
"===================================================================\n",
"Model: GLM AIC: 1140927.9736 \n",
"Link Function: logit BIC: -16674264.5204\n",
"Dependent Variable: successFlag Log-Likelihood: -5.7046e+05 \n",
"Date: 2020-09-20 22:00 LL-Null: -5.7736e+05 \n",
"No. Observations: 1267740 Deviance: 1.1409e+06 \n",
"Df Model: 2 Pearson chi2: 1.28e+06 \n",
"Df Residuals: 1267737 Scale: 1.0000 \n",
"Method: IRLS \n",
"--------------------------------------------------------------------\n",
" Coef. Std.Err. z P>|z| [0.025 0.975]\n",
"--------------------------------------------------------------------\n",
"Intercept 2.3204 0.0069 335.5973 0.0000 2.3069 2.3340\n",
"startPassM_x -0.0090 0.0001 -85.7712 0.0000 -0.0092 -0.0088\n",
"startM_c -0.0136 0.0003 -54.0379 0.0000 -0.0141 -0.0131\n",
"===================================================================\n",
"\n",
"\"\"\""
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"\n",
"pass_model_basic = smf.glm(formula=\"successFlag ~ startPassM_x + startM_c\", data=df_passes_train\\\n",
" ,family=sm.families.Binomial()).fit()\n",
"\n",
"pass_model_basic.summary2()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fitting addititional features:\n",
"* Starting X\n",
"* Starting Y\n",
"* X\\*Y (Interaction Term)\n",
"* End X\n",
"* End Y\n",
"* Shooting Angle (Initial)\n",
"* Sub Event Type"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 45.3 s, sys: 6.05 s, total: 51.4 s\n",
"Wall time: 18.1 s\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" Model: | GLM | AIC: | 926516.8995 | \n",
"
\n",
"\n",
" Link Function: | logit | BIC: | -16888482.7506 | \n",
"
\n",
"\n",
" Dependent Variable: | successFlag | Log-Likelihood: | -4.6324e+05 | \n",
"
\n",
"\n",
" Date: | 2020-09-20 22:00 | LL-Null: | -5.7736e+05 | \n",
"
\n",
"\n",
" No. Observations: | 1267740 | Deviance: | 9.2648e+05 | \n",
"
\n",
"\n",
" Df Model: | 18 | Pearson chi2: | 1.22e+06 | \n",
"
\n",
"\n",
" Df Residuals: | 1267721 | Scale: | 1.0000 | \n",
"
\n",
"\n",
" Method: | IRLS | | | \n",
"
\n",
"
\n",
"\n",
"\n",
" | Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | \n",
"
\n",
"\n",
" Intercept | 2.2667 | 0.0328 | 69.0961 | 0.0000 | 2.2024 | 2.3310 | \n",
"
\n",
"\n",
" C(subEventName)[T.Cross] | -1.5938 | 0.0245 | -65.0123 | 0.0000 | -1.6418 | -1.5457 | \n",
"
\n",
"\n",
" C(subEventName)[T.Free Kick] | 1.0961 | 0.0336 | 32.5966 | 0.0000 | 1.0302 | 1.1620 | \n",
"
\n",
"\n",
" C(subEventName)[T.Free kick cross] | -0.9004 | 0.0358 | -25.1346 | 0.0000 | -0.9706 | -0.8302 | \n",
"
\n",
"\n",
" C(subEventName)[T.Hand pass] | 1.9723 | 0.0643 | 30.6929 | 0.0000 | 1.8464 | 2.0982 | \n",
"
\n",
"\n",
" C(subEventName)[T.Head pass] | -0.6328 | 0.0279 | -22.6792 | 0.0000 | -0.6875 | -0.5781 | \n",
"
\n",
"\n",
" C(subEventName)[T.High pass] | -0.5167 | 0.0280 | -18.4845 | 0.0000 | -0.5715 | -0.4619 | \n",
"
\n",
"\n",
" C(subEventName)[T.Launch] | -0.7976 | 0.0292 | -27.2993 | 0.0000 | -0.8548 | -0.7403 | \n",
"
\n",
"\n",
" C(subEventName)[T.Simple pass] | 1.2257 | 0.0266 | 46.0856 | 0.0000 | 1.1736 | 1.2779 | \n",
"
\n",
"\n",
" C(subEventName)[T.Smart pass] | -1.1578 | 0.0296 | -39.0535 | 0.0000 | -1.2159 | -1.0997 | \n",
"
\n",
"\n",
" C(subEventName)[T.Throw in] | 1.6786 | 0.0284 | 59.1520 | 0.0000 | 1.6230 | 1.7342 | \n",
"
\n",
"\n",
" startPassM_x | 0.0190 | 0.0007 | 26.8825 | 0.0000 | 0.0176 | 0.0203 | \n",
"
\n",
"\n",
" startM_c | -0.0253 | 0.0015 | -16.8990 | 0.0000 | -0.0282 | -0.0223 | \n",
"
\n",
"\n",
" startPassM_x:startM_c | 0.0009 | 0.0000 | 56.2755 | 0.0000 | 0.0008 | 0.0009 | \n",
"
\n",
"\n",
" endPassM_x | -0.0162 | 0.0002 | -97.0289 | 0.0000 | -0.0165 | -0.0158 | \n",
"
\n",
"\n",
" endM_c | -0.0159 | 0.0003 | -54.5851 | 0.0000 | -0.0165 | -0.0153 | \n",
"
\n",
"\n",
" aShooting | -0.0319 | 0.0453 | -0.7047 | 0.4810 | -0.1207 | 0.0569 | \n",
"
\n",
"\n",
" startPassM_xSquared | -0.0003 | 0.0000 | -51.8340 | 0.0000 | -0.0003 | -0.0003 | \n",
"
\n",
"\n",
" startM_cSquared | -0.0011 | 0.0000 | -30.3210 | 0.0000 | -0.0011 | -0.0010 | \n",
"
\n",
"
"
],
"text/plain": [
"\n",
"\"\"\"\n",
" Results: Generalized linear model\n",
"===================================================================================\n",
"Model: GLM AIC: 926516.8995 \n",
"Link Function: logit BIC: -16888482.7506\n",
"Dependent Variable: successFlag Log-Likelihood: -4.6324e+05 \n",
"Date: 2020-09-20 22:00 LL-Null: -5.7736e+05 \n",
"No. Observations: 1267740 Deviance: 9.2648e+05 \n",
"Df Model: 18 Pearson chi2: 1.22e+06 \n",
"Df Residuals: 1267721 Scale: 1.0000 \n",
"Method: IRLS \n",
"-----------------------------------------------------------------------------------\n",
" Coef. Std.Err. z P>|z| [0.025 0.975]\n",
"-----------------------------------------------------------------------------------\n",
"Intercept 2.2667 0.0328 69.0961 0.0000 2.2024 2.3310\n",
"C(subEventName)[T.Cross] -1.5938 0.0245 -65.0123 0.0000 -1.6418 -1.5457\n",
"C(subEventName)[T.Free Kick] 1.0961 0.0336 32.5966 0.0000 1.0302 1.1620\n",
"C(subEventName)[T.Free kick cross] -0.9004 0.0358 -25.1346 0.0000 -0.9706 -0.8302\n",
"C(subEventName)[T.Hand pass] 1.9723 0.0643 30.6929 0.0000 1.8464 2.0982\n",
"C(subEventName)[T.Head pass] -0.6328 0.0279 -22.6792 0.0000 -0.6875 -0.5781\n",
"C(subEventName)[T.High pass] -0.5167 0.0280 -18.4845 0.0000 -0.5715 -0.4619\n",
"C(subEventName)[T.Launch] -0.7976 0.0292 -27.2993 0.0000 -0.8548 -0.7403\n",
"C(subEventName)[T.Simple pass] 1.2257 0.0266 46.0856 0.0000 1.1736 1.2779\n",
"C(subEventName)[T.Smart pass] -1.1578 0.0296 -39.0535 0.0000 -1.2159 -1.0997\n",
"C(subEventName)[T.Throw in] 1.6786 0.0284 59.1520 0.0000 1.6230 1.7342\n",
"startPassM_x 0.0190 0.0007 26.8825 0.0000 0.0176 0.0203\n",
"startM_c -0.0253 0.0015 -16.8990 0.0000 -0.0282 -0.0223\n",
"startPassM_x:startM_c 0.0009 0.0000 56.2755 0.0000 0.0008 0.0009\n",
"endPassM_x -0.0162 0.0002 -97.0289 0.0000 -0.0165 -0.0158\n",
"endM_c -0.0159 0.0003 -54.5851 0.0000 -0.0165 -0.0153\n",
"aShooting -0.0319 0.0453 -0.7047 0.4810 -0.1207 0.0569\n",
"startPassM_xSquared -0.0003 0.0000 -51.8340 0.0000 -0.0003 -0.0003\n",
"startM_cSquared -0.0011 0.0000 -30.3210 0.0000 -0.0011 -0.0010\n",
"===================================================================================\n",
"\n",
"\"\"\""
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"\n",
"pass_model_added = smf.glm(formula=\"successFlag ~ C(subEventName) + startPassM_x*startM_c + endPassM_x + endM_c + aShooting +\\\n",
" startPassM_xSquared + startM_cSquared\", data=df_passes_train\\\n",
" ,family=sm.families.Binomial()).fit()\n",
"\n",
"pass_model_added.summary2()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fitting model to **training** data with advanced features, using two different link functions:\n",
"\n",
"#### Features:\n",
"\n",
"* Sub Event Name\n",
"* Starting X\n",
"* Starting Y\n",
"* Starting X\\*Y (Interaction Term)\n",
"* End X\n",
"* End Y\n",
"* End X\\*Y (Interaction Term)\n",
"* Start Y^2\n",
"* End Y^2\n",
"* Start Y^2 \\* End Y^2 (Interaction Term)\n",
"* Start X^2\n",
"* End X^2\n",
"* Start X^2 \\* End X^2 (Interaction Term)\n",
"* Distance to Goal (Initial)\n",
"* Passing Distance\n",
"* Passing Distance^2\n",
"* Passing Distance^3\n",
"* Passing Angle\n",
"* Shooting Angle (Initial)\n",
"* Shooting Angle (Change Before and After Pass)\n",
"* Transition Through Thirds (1->2, 2->3, etc.)\n",
"* Home / Away Flag\n",
"* Counter Attack Flag\n",
"* Number of Red Cards\n",
"* Game State (Delta Between Teams for Number of Goals Scored)\n",
"* Time of Current Possession Sequence\n",
"* Time of Passing Player Possession\n",
"* Passing Index Within Possession Sequence\n",
"\n",
"\n",
"\n",
"#### Link Functions:\n",
"* Logit (Canonical link function for Binomial family of distributions)\n",
"* Probit\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/christian/anaconda2/envs/py37_football/lib/python3.7/site-packages/ipykernel_launcher.py:14: DeprecationWarning: Calling Family(..) with a link class as argument is deprecated.\n",
"Use an instance of a link class instead.\n",
" \n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 2min 10s, sys: 12.9 s, total: 2min 23s\n",
"Wall time: 55.9 s\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" Model: | GLM | AIC: | 878059.9874 | \n",
"
\n",
"\n",
" Link Function: | logit | BIC: | -16936722.7132 | \n",
"
\n",
"\n",
" Dependent Variable: | successFlag | Log-Likelihood: | -4.3899e+05 | \n",
"
\n",
"\n",
" Date: | 2020-09-20 22:01 | LL-Null: | -5.7736e+05 | \n",
"
\n",
"\n",
" No. Observations: | 1267740 | Deviance: | 8.7799e+05 | \n",
"
\n",
"\n",
" Df Model: | 36 | Pearson chi2: | 1.48e+06 | \n",
"
\n",
"\n",
" Df Residuals: | 1267703 | Scale: | 1.0000 | \n",
"
\n",
"\n",
" Method: | IRLS | | | \n",
"
\n",
"
\n",
"\n",
"\n",
" | Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | \n",
"
\n",
"\n",
" Intercept | 4.6394 | 0.4439 | 10.4523 | 0.0000 | 3.7695 | 5.5094 | \n",
"
\n",
"\n",
" C(eventName)[T.Pass] | -0.5497 | 0.0334 | -16.4579 | 0.0000 | -0.6152 | -0.4842 | \n",
"
\n",
"\n",
" C(subEventName)[T.Cross] | -1.0629 | 0.0153 | -69.4302 | 0.0000 | -1.0929 | -1.0329 | \n",
"
\n",
"\n",
" C(subEventName)[T.Free Kick] | 0.8217 | 0.0438 | 18.7819 | 0.0000 | 0.7360 | 0.9075 | \n",
"
\n",
"\n",
" C(subEventName)[T.Free kick cross] | -0.7933 | 0.0450 | -17.6457 | 0.0000 | -0.8814 | -0.7052 | \n",
"
\n",
"\n",
" C(subEventName)[T.Hand pass] | 1.8936 | 0.0535 | 35.3916 | 0.0000 | 1.7887 | 1.9985 | \n",
"
\n",
"\n",
" C(subEventName)[T.Head pass] | -0.4716 | 0.0130 | -36.2374 | 0.0000 | -0.4971 | -0.4461 | \n",
"
\n",
"\n",
" C(subEventName)[T.High pass] | -0.4830 | 0.0122 | -39.4346 | 0.0000 | -0.5070 | -0.4590 | \n",
"
\n",
"\n",
" C(subEventName)[T.Launch] | -0.6207 | 0.0160 | -38.8133 | 0.0000 | -0.6521 | -0.5894 | \n",
"
\n",
"\n",
" C(subEventName)[T.Simple pass] | 1.1445 | 0.0112 | 101.7804 | 0.0000 | 1.1225 | 1.1666 | \n",
"
\n",
"\n",
" C(subEventName)[T.Smart pass] | -0.9496 | 0.0165 | -57.3987 | 0.0000 | -0.9820 | -0.9171 | \n",
"
\n",
"\n",
" C(subEventName)[T.Throw in] | 1.1878 | 0.0393 | 30.2234 | 0.0000 | 1.1107 | 1.2648 | \n",
"
\n",
"\n",
" C(homeFlag)[T.1] | 0.0939 | 0.0055 | 16.9337 | 0.0000 | 0.0830 | 0.1047 | \n",
"
\n",
"\n",
" C(counterAttackFlag)[T.1] | 0.4868 | 0.0232 | 20.9958 | 0.0000 | 0.4414 | 0.5323 | \n",
"
\n",
"\n",
" startPassM_x | -0.0481 | 0.0046 | -10.5313 | 0.0000 | -0.0571 | -0.0392 | \n",
"
\n",
"\n",
" startM_c | -0.0264 | 0.0020 | -13.0491 | 0.0000 | -0.0303 | -0.0224 | \n",
"
\n",
"\n",
" startPassM_x:startM_c | 0.0008 | 0.0000 | 24.3176 | 0.0000 | 0.0007 | 0.0008 | \n",
"
\n",
"\n",
" endPassM_x | 0.0257 | 0.0010 | 24.9444 | 0.0000 | 0.0237 | 0.0278 | \n",
"
\n",
"\n",
" endM_c | 0.0548 | 0.0018 | 30.9418 | 0.0000 | 0.0513 | 0.0582 | \n",
"
\n",
"\n",
" endPassM_x:endM_c | 0.0004 | 0.0000 | 24.0117 | 0.0000 | 0.0004 | 0.0004 | \n",
"
\n",
"\n",
" start_cSquared | -0.0002 | 0.0000 | -8.0138 | 0.0000 | -0.0002 | -0.0001 | \n",
"
\n",
"\n",
" end_cSquared | -0.0010 | 0.0000 | -53.9567 | 0.0000 | -0.0010 | -0.0009 | \n",
"
\n",
"\n",
" start_cSquared:end_cSquared | -0.0000 | 0.0000 | -5.8665 | 0.0000 | -0.0000 | -0.0000 | \n",
"
\n",
"\n",
" startPassM_xSquared | -0.0000 | 0.0000 | -3.5110 | 0.0004 | -0.0001 | -0.0000 | \n",
"
\n",
"\n",
" endPassM_xSquared | -0.0004 | 0.0000 | -42.7420 | 0.0000 | -0.0004 | -0.0004 | \n",
"
\n",
"\n",
" startPassM_xSquared:endPassM_xSquared | 0.0000 | 0.0000 | 8.1066 | 0.0000 | 0.0000 | 0.0000 | \n",
"
\n",
"\n",
" D | 0.0790 | 0.0011 | 69.7190 | 0.0000 | 0.0767 | 0.0812 | \n",
"
\n",
"\n",
" DGoalStart | -0.0435 | 0.0037 | -11.7103 | 0.0000 | -0.0508 | -0.0362 | \n",
"
\n",
"\n",
" Dsquared | -0.0015 | 0.0000 | -46.3579 | 0.0000 | -0.0016 | -0.0014 | \n",
"
\n",
"\n",
" Dcubed | 0.0000 | 0.0000 | 22.6732 | 0.0000 | 0.0000 | 0.0000 | \n",
"
\n",
"\n",
" a | -0.3321 | 0.0062 | -53.9585 | 0.0000 | -0.3441 | -0.3200 | \n",
"
\n",
"\n",
" aShooting | 4.6372 | 0.1464 | 31.6803 | 0.0000 | 4.3503 | 4.9240 | \n",
"
\n",
"\n",
" aShootingChange | 4.8917 | 0.1391 | 35.1681 | 0.0000 | 4.6191 | 5.1643 | \n",
"
\n",
"\n",
" thirdTransitionDelta | -0.1285 | 0.0068 | -18.7578 | 0.0000 | -0.1419 | -0.1151 | \n",
"
\n",
"\n",
" numReds | 0.2331 | 0.0163 | 14.2799 | 0.0000 | 0.2011 | 0.2651 | \n",
"
\n",
"\n",
" goalDelta | -0.0169 | 0.0024 | -7.0626 | 0.0000 | -0.0217 | -0.0122 | \n",
"
\n",
"\n",
" possessionTimeSec | 0.0064 | 0.0003 | 19.6586 | 0.0000 | 0.0058 | 0.0071 | \n",
"
\n",
"\n",
" playerPossessionTimeSec | 0.0045 | 0.0007 | 6.8592 | 0.0000 | 0.0032 | 0.0058 | \n",
"
\n",
"\n",
" passIndexWithinSequence | 0.0429 | 0.0014 | 30.0619 | 0.0000 | 0.0401 | 0.0457 | \n",
"
\n",
"
"
],
"text/plain": [
"\n",
"\"\"\"\n",
" Results: Generalized linear model\n",
"======================================================================================\n",
"Model: GLM AIC: 878059.9874 \n",
"Link Function: logit BIC: -16936722.7132\n",
"Dependent Variable: successFlag Log-Likelihood: -4.3899e+05 \n",
"Date: 2020-09-20 22:01 LL-Null: -5.7736e+05 \n",
"No. Observations: 1267740 Deviance: 8.7799e+05 \n",
"Df Model: 36 Pearson chi2: 1.48e+06 \n",
"Df Residuals: 1267703 Scale: 1.0000 \n",
"Method: IRLS \n",
"--------------------------------------------------------------------------------------\n",
" Coef. Std.Err. z P>|z| [0.025 0.975]\n",
"--------------------------------------------------------------------------------------\n",
"Intercept 4.6394 0.4439 10.4523 0.0000 3.7695 5.5094\n",
"C(eventName)[T.Pass] -0.5497 0.0334 -16.4579 0.0000 -0.6152 -0.4842\n",
"C(subEventName)[T.Cross] -1.0629 0.0153 -69.4302 0.0000 -1.0929 -1.0329\n",
"C(subEventName)[T.Free Kick] 0.8217 0.0438 18.7819 0.0000 0.7360 0.9075\n",
"C(subEventName)[T.Free kick cross] -0.7933 0.0450 -17.6457 0.0000 -0.8814 -0.7052\n",
"C(subEventName)[T.Hand pass] 1.8936 0.0535 35.3916 0.0000 1.7887 1.9985\n",
"C(subEventName)[T.Head pass] -0.4716 0.0130 -36.2374 0.0000 -0.4971 -0.4461\n",
"C(subEventName)[T.High pass] -0.4830 0.0122 -39.4346 0.0000 -0.5070 -0.4590\n",
"C(subEventName)[T.Launch] -0.6207 0.0160 -38.8133 0.0000 -0.6521 -0.5894\n",
"C(subEventName)[T.Simple pass] 1.1445 0.0112 101.7804 0.0000 1.1225 1.1666\n",
"C(subEventName)[T.Smart pass] -0.9496 0.0165 -57.3987 0.0000 -0.9820 -0.9171\n",
"C(subEventName)[T.Throw in] 1.1878 0.0393 30.2234 0.0000 1.1107 1.2648\n",
"C(homeFlag)[T.1] 0.0939 0.0055 16.9337 0.0000 0.0830 0.1047\n",
"C(counterAttackFlag)[T.1] 0.4868 0.0232 20.9958 0.0000 0.4414 0.5323\n",
"startPassM_x -0.0481 0.0046 -10.5313 0.0000 -0.0571 -0.0392\n",
"startM_c -0.0264 0.0020 -13.0491 0.0000 -0.0303 -0.0224\n",
"startPassM_x:startM_c 0.0008 0.0000 24.3176 0.0000 0.0007 0.0008\n",
"endPassM_x 0.0257 0.0010 24.9444 0.0000 0.0237 0.0278\n",
"endM_c 0.0548 0.0018 30.9418 0.0000 0.0513 0.0582\n",
"endPassM_x:endM_c 0.0004 0.0000 24.0117 0.0000 0.0004 0.0004\n",
"start_cSquared -0.0002 0.0000 -8.0138 0.0000 -0.0002 -0.0001\n",
"end_cSquared -0.0010 0.0000 -53.9567 0.0000 -0.0010 -0.0009\n",
"start_cSquared:end_cSquared -0.0000 0.0000 -5.8665 0.0000 -0.0000 -0.0000\n",
"startPassM_xSquared -0.0000 0.0000 -3.5110 0.0004 -0.0001 -0.0000\n",
"endPassM_xSquared -0.0004 0.0000 -42.7420 0.0000 -0.0004 -0.0004\n",
"startPassM_xSquared:endPassM_xSquared 0.0000 0.0000 8.1066 0.0000 0.0000 0.0000\n",
"D 0.0790 0.0011 69.7190 0.0000 0.0767 0.0812\n",
"DGoalStart -0.0435 0.0037 -11.7103 0.0000 -0.0508 -0.0362\n",
"Dsquared -0.0015 0.0000 -46.3579 0.0000 -0.0016 -0.0014\n",
"Dcubed 0.0000 0.0000 22.6732 0.0000 0.0000 0.0000\n",
"a -0.3321 0.0062 -53.9585 0.0000 -0.3441 -0.3200\n",
"aShooting 4.6372 0.1464 31.6803 0.0000 4.3503 4.9240\n",
"aShootingChange 4.8917 0.1391 35.1681 0.0000 4.6191 5.1643\n",
"thirdTransitionDelta -0.1285 0.0068 -18.7578 0.0000 -0.1419 -0.1151\n",
"numReds 0.2331 0.0163 14.2799 0.0000 0.2011 0.2651\n",
"goalDelta -0.0169 0.0024 -7.0626 0.0000 -0.0217 -0.0122\n",
"possessionTimeSec 0.0064 0.0003 19.6586 0.0000 0.0058 0.0071\n",
"playerPossessionTimeSec 0.0045 0.0007 6.8592 0.0000 0.0032 0.0058\n",
"passIndexWithinSequence 0.0429 0.0014 30.0619 0.0000 0.0401 0.0457\n",
"======================================================================================\n",
"\n",
"\"\"\""
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"\n",
"# logit = canonical link\n",
"# probit\n",
"# cloglog\n",
"\n",
"pass_model_advanced_canonical = smf.glm(formula=\"successFlag ~ C(eventName) + C(subEventName) +\\\n",
" startPassM_x*startM_c + endPassM_x*endM_c + start_cSquared*end_cSquared +\\\n",
" startPassM_xSquared*endPassM_xSquared +\\\n",
" D + DGoalStart + Dsquared + Dcubed +\\\n",
" a + aShooting + aShootingChange +\\\n",
" thirdTransitionDelta +\\\n",
" C(homeFlag) + C(counterAttackFlag) +\\\n",
" numReds + goalDelta +\\\n",
" possessionTimeSec + playerPossessionTimeSec + passIndexWithinSequence\", data=df_passes_train\\\n",
" ,family=sm.families.Binomial(link=sm.families.links.logit)).fit()\n",
"\n",
"pass_model_advanced_canonical.summary2()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/christian/anaconda2/envs/py37_football/lib/python3.7/site-packages/ipykernel_launcher.py:14: DeprecationWarning: Calling Family(..) with a link class as argument is deprecated.\n",
"Use an instance of a link class instead.\n",
" \n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 2min 49s, sys: 15.2 s, total: 3min 4s\n",
"Wall time: 1min 9s\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" Model: | GLM | AIC: | 875453.9179 | \n",
"
\n",
"\n",
" Link Function: | probit | BIC: | -16939328.7828 | \n",
"
\n",
"\n",
" Dependent Variable: | successFlag | Log-Likelihood: | -4.3769e+05 | \n",
"
\n",
"\n",
" Date: | 2020-09-20 22:02 | LL-Null: | -5.7736e+05 | \n",
"
\n",
"\n",
" No. Observations: | 1267740 | Deviance: | 8.7538e+05 | \n",
"
\n",
"\n",
" Df Model: | 36 | Pearson chi2: | 3.55e+07 | \n",
"
\n",
"\n",
" Df Residuals: | 1267703 | Scale: | 1.0000 | \n",
"
\n",
"\n",
" Method: | IRLS | | | \n",
"
\n",
"
\n",
"\n",
"\n",
" | Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | \n",
"
\n",
"\n",
" Intercept | 2.6608 | 0.2509 | 10.6031 | 0.0000 | 2.1690 | 3.1526 | \n",
"
\n",
"\n",
" C(eventName)[T.Pass] | -0.3362 | 0.0190 | -17.6570 | 0.0000 | -0.3736 | -0.2989 | \n",
"
\n",
"\n",
" C(subEventName)[T.Cross] | -0.6321 | 0.0086 | -73.6553 | 0.0000 | -0.6489 | -0.6153 | \n",
"
\n",
"\n",
" C(subEventName)[T.Free Kick] | 0.4951 | 0.0248 | 19.9859 | 0.0000 | 0.4465 | 0.5436 | \n",
"
\n",
"\n",
" C(subEventName)[T.Free kick cross] | -0.4661 | 0.0266 | -17.5292 | 0.0000 | -0.5182 | -0.4139 | \n",
"
\n",
"\n",
" C(subEventName)[T.Hand pass] | 1.0295 | 0.0243 | 42.3806 | 0.0000 | 0.9819 | 1.0771 | \n",
"
\n",
"\n",
" C(subEventName)[T.Head pass] | -0.2535 | 0.0071 | -35.7649 | 0.0000 | -0.2674 | -0.2396 | \n",
"
\n",
"\n",
" C(subEventName)[T.High pass] | -0.2514 | 0.0066 | -37.8607 | 0.0000 | -0.2645 | -0.2384 | \n",
"
\n",
"\n",
" C(subEventName)[T.Launch] | -0.3298 | 0.0091 | -36.4452 | 0.0000 | -0.3476 | -0.3121 | \n",
"
\n",
"\n",
" C(subEventName)[T.Simple pass] | 0.6608 | 0.0058 | 113.7434 | 0.0000 | 0.6494 | 0.6722 | \n",
"
\n",
"\n",
" C(subEventName)[T.Smart pass] | -0.5597 | 0.0095 | -58.8821 | 0.0000 | -0.5783 | -0.5410 | \n",
"
\n",
"\n",
" C(subEventName)[T.Throw in] | 0.6663 | 0.0224 | 29.7952 | 0.0000 | 0.6225 | 0.7102 | \n",
"
\n",
"\n",
" C(homeFlag)[T.1] | 0.0514 | 0.0031 | 16.8498 | 0.0000 | 0.0454 | 0.0574 | \n",
"
\n",
"\n",
" C(counterAttackFlag)[T.1] | 0.2704 | 0.0128 | 21.1762 | 0.0000 | 0.2454 | 0.2955 | \n",
"
\n",
"\n",
" startPassM_x | -0.0241 | 0.0026 | -9.3021 | 0.0000 | -0.0291 | -0.0190 | \n",
"
\n",
"\n",
" startM_c | -0.0133 | 0.0011 | -11.9171 | 0.0000 | -0.0155 | -0.0111 | \n",
"
\n",
"\n",
" startPassM_x:startM_c | 0.0004 | 0.0000 | 23.4883 | 0.0000 | 0.0004 | 0.0004 | \n",
"
\n",
"\n",
" endPassM_x | 0.0132 | 0.0006 | 22.9958 | 0.0000 | 0.0121 | 0.0143 | \n",
"
\n",
"\n",
" endM_c | 0.0256 | 0.0010 | 26.8241 | 0.0000 | 0.0237 | 0.0274 | \n",
"
\n",
"\n",
" endPassM_x:endM_c | 0.0003 | 0.0000 | 30.8649 | 0.0000 | 0.0003 | 0.0003 | \n",
"
\n",
"\n",
" start_cSquared | -0.0001 | 0.0000 | -8.3969 | 0.0000 | -0.0001 | -0.0001 | \n",
"
\n",
"\n",
" end_cSquared | -0.0005 | 0.0000 | -52.6469 | 0.0000 | -0.0005 | -0.0005 | \n",
"
\n",
"\n",
" start_cSquared:end_cSquared | -0.0000 | 0.0000 | -8.3416 | 0.0000 | -0.0000 | -0.0000 | \n",
"
\n",
"\n",
" startPassM_xSquared | -0.0000 | 0.0000 | -6.0799 | 0.0000 | -0.0001 | -0.0000 | \n",
"
\n",
"\n",
" endPassM_xSquared | -0.0002 | 0.0000 | -45.3239 | 0.0000 | -0.0002 | -0.0002 | \n",
"
\n",
"\n",
" startPassM_xSquared:endPassM_xSquared | 0.0000 | 0.0000 | 10.7485 | 0.0000 | 0.0000 | 0.0000 | \n",
"
\n",
"\n",
" D | 0.0452 | 0.0006 | 73.3264 | 0.0000 | 0.0440 | 0.0464 | \n",
"
\n",
"\n",
" DGoalStart | -0.0230 | 0.0021 | -10.9519 | 0.0000 | -0.0271 | -0.0189 | \n",
"
\n",
"\n",
" Dsquared | -0.0009 | 0.0000 | -51.0188 | 0.0000 | -0.0009 | -0.0009 | \n",
"
\n",
"\n",
" Dcubed | 0.0000 | 0.0000 | 26.8450 | 0.0000 | 0.0000 | 0.0000 | \n",
"
\n",
"\n",
" a | -0.1743 | 0.0033 | -52.4998 | 0.0000 | -0.1808 | -0.1678 | \n",
"
\n",
"\n",
" aShooting | 1.8915 | 0.0711 | 26.5953 | 0.0000 | 1.7521 | 2.0309 | \n",
"
\n",
"\n",
" aShootingChange | 1.9902 | 0.0660 | 30.1367 | 0.0000 | 1.8608 | 2.1197 | \n",
"
\n",
"\n",
" thirdTransitionDelta | -0.0662 | 0.0037 | -17.7817 | 0.0000 | -0.0735 | -0.0589 | \n",
"
\n",
"\n",
" numReds | 0.1339 | 0.0089 | 14.9810 | 0.0000 | 0.1164 | 0.1514 | \n",
"
\n",
"\n",
" goalDelta | -0.0091 | 0.0013 | -6.9512 | 0.0000 | -0.0117 | -0.0065 | \n",
"
\n",
"\n",
" possessionTimeSec | 0.0036 | 0.0002 | 20.4972 | 0.0000 | 0.0033 | 0.0040 | \n",
"
\n",
"\n",
" playerPossessionTimeSec | 0.0023 | 0.0004 | 6.5261 | 0.0000 | 0.0016 | 0.0030 | \n",
"
\n",
"\n",
" passIndexWithinSequence | 0.0239 | 0.0008 | 31.1508 | 0.0000 | 0.0224 | 0.0254 | \n",
"
\n",
"
"
],
"text/plain": [
"\n",
"\"\"\"\n",
" Results: Generalized linear model\n",
"======================================================================================\n",
"Model: GLM AIC: 875453.9179 \n",
"Link Function: probit BIC: -16939328.7828\n",
"Dependent Variable: successFlag Log-Likelihood: -4.3769e+05 \n",
"Date: 2020-09-20 22:02 LL-Null: -5.7736e+05 \n",
"No. Observations: 1267740 Deviance: 8.7538e+05 \n",
"Df Model: 36 Pearson chi2: 3.55e+07 \n",
"Df Residuals: 1267703 Scale: 1.0000 \n",
"Method: IRLS \n",
"--------------------------------------------------------------------------------------\n",
" Coef. Std.Err. z P>|z| [0.025 0.975]\n",
"--------------------------------------------------------------------------------------\n",
"Intercept 2.6608 0.2509 10.6031 0.0000 2.1690 3.1526\n",
"C(eventName)[T.Pass] -0.3362 0.0190 -17.6570 0.0000 -0.3736 -0.2989\n",
"C(subEventName)[T.Cross] -0.6321 0.0086 -73.6553 0.0000 -0.6489 -0.6153\n",
"C(subEventName)[T.Free Kick] 0.4951 0.0248 19.9859 0.0000 0.4465 0.5436\n",
"C(subEventName)[T.Free kick cross] -0.4661 0.0266 -17.5292 0.0000 -0.5182 -0.4139\n",
"C(subEventName)[T.Hand pass] 1.0295 0.0243 42.3806 0.0000 0.9819 1.0771\n",
"C(subEventName)[T.Head pass] -0.2535 0.0071 -35.7649 0.0000 -0.2674 -0.2396\n",
"C(subEventName)[T.High pass] -0.2514 0.0066 -37.8607 0.0000 -0.2645 -0.2384\n",
"C(subEventName)[T.Launch] -0.3298 0.0091 -36.4452 0.0000 -0.3476 -0.3121\n",
"C(subEventName)[T.Simple pass] 0.6608 0.0058 113.7434 0.0000 0.6494 0.6722\n",
"C(subEventName)[T.Smart pass] -0.5597 0.0095 -58.8821 0.0000 -0.5783 -0.5410\n",
"C(subEventName)[T.Throw in] 0.6663 0.0224 29.7952 0.0000 0.6225 0.7102\n",
"C(homeFlag)[T.1] 0.0514 0.0031 16.8498 0.0000 0.0454 0.0574\n",
"C(counterAttackFlag)[T.1] 0.2704 0.0128 21.1762 0.0000 0.2454 0.2955\n",
"startPassM_x -0.0241 0.0026 -9.3021 0.0000 -0.0291 -0.0190\n",
"startM_c -0.0133 0.0011 -11.9171 0.0000 -0.0155 -0.0111\n",
"startPassM_x:startM_c 0.0004 0.0000 23.4883 0.0000 0.0004 0.0004\n",
"endPassM_x 0.0132 0.0006 22.9958 0.0000 0.0121 0.0143\n",
"endM_c 0.0256 0.0010 26.8241 0.0000 0.0237 0.0274\n",
"endPassM_x:endM_c 0.0003 0.0000 30.8649 0.0000 0.0003 0.0003\n",
"start_cSquared -0.0001 0.0000 -8.3969 0.0000 -0.0001 -0.0001\n",
"end_cSquared -0.0005 0.0000 -52.6469 0.0000 -0.0005 -0.0005\n",
"start_cSquared:end_cSquared -0.0000 0.0000 -8.3416 0.0000 -0.0000 -0.0000\n",
"startPassM_xSquared -0.0000 0.0000 -6.0799 0.0000 -0.0001 -0.0000\n",
"endPassM_xSquared -0.0002 0.0000 -45.3239 0.0000 -0.0002 -0.0002\n",
"startPassM_xSquared:endPassM_xSquared 0.0000 0.0000 10.7485 0.0000 0.0000 0.0000\n",
"D 0.0452 0.0006 73.3264 0.0000 0.0440 0.0464\n",
"DGoalStart -0.0230 0.0021 -10.9519 0.0000 -0.0271 -0.0189\n",
"Dsquared -0.0009 0.0000 -51.0188 0.0000 -0.0009 -0.0009\n",
"Dcubed 0.0000 0.0000 26.8450 0.0000 0.0000 0.0000\n",
"a -0.1743 0.0033 -52.4998 0.0000 -0.1808 -0.1678\n",
"aShooting 1.8915 0.0711 26.5953 0.0000 1.7521 2.0309\n",
"aShootingChange 1.9902 0.0660 30.1367 0.0000 1.8608 2.1197\n",
"thirdTransitionDelta -0.0662 0.0037 -17.7817 0.0000 -0.0735 -0.0589\n",
"numReds 0.1339 0.0089 14.9810 0.0000 0.1164 0.1514\n",
"goalDelta -0.0091 0.0013 -6.9512 0.0000 -0.0117 -0.0065\n",
"possessionTimeSec 0.0036 0.0002 20.4972 0.0000 0.0033 0.0040\n",
"playerPossessionTimeSec 0.0023 0.0004 6.5261 0.0000 0.0016 0.0030\n",
"passIndexWithinSequence 0.0239 0.0008 31.1508 0.0000 0.0224 0.0254\n",
"======================================================================================\n",
"\n",
"\"\"\""
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"\n",
"# logit = canonical link\n",
"# probit\n",
"# cloglog\n",
"\n",
"pass_model_advanced_probit = smf.glm(formula=\"successFlag ~ C(eventName) + C(subEventName) +\\\n",
" startPassM_x*startM_c + endPassM_x*endM_c + start_cSquared*end_cSquared +\\\n",
" startPassM_xSquared*endPassM_xSquared +\\\n",
" D + DGoalStart + Dsquared + Dcubed +\\\n",
" a + aShooting + aShootingChange +\\\n",
" thirdTransitionDelta +\\\n",
" C(homeFlag) + C(counterAttackFlag) +\\\n",
" numReds + goalDelta +\\\n",
" possessionTimeSec + playerPossessionTimeSec + passIndexWithinSequence\", data=df_passes_train\\\n",
" ,family=sm.families.Binomial(link=sm.families.links.probit)).fit()\n",
"\n",
"pass_model_advanced_probit.summary2()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# 5) Applying models to **test** data\n",
"\n",
"1. Basic model: just starting position features;\n",
"2. Added model: including features outlined in the problem statement;\n",
"3. Advanced model: Logit link - added extra **x** features\n",
"4. Advabced model: Probit link (same features as above)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Applying models...\n",
"Done applying 4 models.\n"
]
}
],
"source": [
"df_passes_test = apply_xP_model_to_test([pass_model_basic, pass_model_added, pass_model_advanced_canonical, pass_model_advanced_probit])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"# 6) Model Validation: Calibration Curves of Models Fit to **Test** Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Calibration Curve: Basic Vs Added Models"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plot_calibration_curve(df_passes_test, show_advanced=0, save_output=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Calibration Curve - Advanced Models: Logit Vs Probit"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plot_calibration_curve(df_passes_test, show_advanced=1, save_output=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
"# 7) Applying Logistic Regression Classifier and Calculating Model Fit Metrics"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Brier Score: 0.10642146614710772\n",
"\n",
"Precision Score: 0.9005178325572422\n",
"\n",
"Recall Score: 0.9105388462288744\n",
"\n",
"F1 Score: 0.9055006150495346\n",
"\n",
"AUC Score: 0.7092062601966227\n",
"\n",
"AccuracyScore: 0.8422029087937452\n"
]
}
],
"source": [
"calculate_model_metrics(df_passes_test, 'xP')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"# 8) Applying advanced model to `df_passes` (**test + training** data) and **summarising best forwards at \"risky\" passes in England**"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"leagues = ['England'] #['England','Spain','Italy','Germany','France']\n",
"positions = ['FWD']\n",
"xPthreshold = 0.5\n",
"minMatches = 10\n",
"minPasses = 50"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" playerId | \n",
" shortName | \n",
" roleCode | \n",
" overxP | \n",
" overxPper90mins | \n",
" overxPper100attempts | \n",
" totSuccessful | \n",
" totAttempted | \n",
" pcCompleted | \n",
" minutesPlayed | \n",
" totMatches | \n",
" pcTrickyBall | \n",
" overallPcCompleted | \n",
" vec_x | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 25707 | \n",
" E. Hazard | \n",
" FWD | \n",
" 17.1 | \n",
" 0.6 | \n",
" 13.9 | \n",
" 59 | \n",
" 123 | \n",
" 48.0 | \n",
" 2432 | \n",
" 34 | \n",
" 8.8 | \n",
" 85.0 | \n",
" 748.65 | \n",
"
\n",
" \n",
" 2 | \n",
" 7944 | \n",
" W. Rooney | \n",
" FWD | \n",
" 9.0 | \n",
" 0.4 | \n",
" 8.4 | \n",
" 50 | \n",
" 108 | \n",
" 46.3 | \n",
" 2238 | \n",
" 30 | \n",
" 9.0 | \n",
" 80.0 | \n",
" 2674.35 | \n",
"
\n",
" \n",
" 6 | \n",
" 8325 | \n",
" S. Agüero | \n",
" FWD | \n",
" 4.2 | \n",
" 0.2 | \n",
" 8.1 | \n",
" 21 | \n",
" 52 | \n",
" 40.4 | \n",
" 1739 | \n",
" 22 | \n",
" 8.5 | \n",
" 84.7 | \n",
" 22.05 | \n",
"
\n",
" \n",
" 4 | \n",
" 25747 | \n",
" S. Mané | \n",
" FWD | \n",
" 6.8 | \n",
" 0.3 | \n",
" 6.5 | \n",
" 40 | \n",
" 104 | \n",
" 38.5 | \n",
" 2184 | \n",
" 27 | \n",
" 11.0 | \n",
" 82.4 | \n",
" -30.45 | \n",
"
\n",
" \n",
" 5 | \n",
" 9637 | \n",
" J. King | \n",
" FWD | \n",
" 6.3 | \n",
" 0.3 | \n",
" 6.1 | \n",
" 40 | \n",
" 104 | \n",
" 38.5 | \n",
" 2160 | \n",
" 28 | \n",
" 16.2 | \n",
" 75.3 | \n",
" 489.30 | \n",
"
\n",
" \n",
" 1 | \n",
" 3361 | \n",
" A. Sánchez | \n",
" FWD | \n",
" 14.3 | \n",
" 0.5 | \n",
" 6.0 | \n",
" 99 | \n",
" 236 | \n",
" 41.9 | \n",
" 2553 | \n",
" 31 | \n",
" 15.7 | \n",
" 74.0 | \n",
" 2404.50 | \n",
"
\n",
" \n",
" 12 | \n",
" 7879 | \n",
" T. Walcott | \n",
" FWD | \n",
" 3.1 | \n",
" 0.2 | \n",
" 4.5 | \n",
" 26 | \n",
" 69 | \n",
" 37.7 | \n",
" 1188 | \n",
" 17 | \n",
" 19.4 | \n",
" 74.4 | \n",
" 359.10 | \n",
"
\n",
" \n",
" 18 | \n",
" 25413 | \n",
" A. Lacazette | \n",
" FWD | \n",
" 2.1 | \n",
" 0.1 | \n",
" 3.6 | \n",
" 21 | \n",
" 57 | \n",
" 36.8 | \n",
" 1832 | \n",
" 25 | \n",
" 8.7 | \n",
" 79.1 | \n",
" 130.20 | \n",
"
\n",
" \n",
" 20 | \n",
" 9123 | \n",
" A. Barnes | \n",
" FWD | \n",
" 1.9 | \n",
" 0.1 | \n",
" 3.4 | \n",
" 21 | \n",
" 55 | \n",
" 38.2 | \n",
" 1779 | \n",
" 26 | \n",
" 13.4 | \n",
" 67.9 | \n",
" -106.05 | \n",
"
\n",
" \n",
" 11 | \n",
" 134513 | \n",
" A. Martial | \n",
" FWD | \n",
" 3.2 | \n",
" 0.2 | \n",
" 3.2 | \n",
" 34 | \n",
" 99 | \n",
" 34.3 | \n",
" 1551 | \n",
" 27 | \n",
" 12.9 | \n",
" 80.8 | \n",
" -371.70 | \n",
"
\n",
" \n",
" 10 | \n",
" 14703 | \n",
" M. Arnautović | \n",
" FWD | \n",
" 3.2 | \n",
" 0.1 | \n",
" 2.9 | \n",
" 39 | \n",
" 111 | \n",
" 35.1 | \n",
" 2249 | \n",
" 29 | \n",
" 15.6 | \n",
" 71.5 | \n",
" -255.15 | \n",
"
\n",
" \n",
" 9 | \n",
" 120353 | \n",
" Mohamed Salah | \n",
" FWD | \n",
" 3.4 | \n",
" 0.1 | \n",
" 2.7 | \n",
" 45 | \n",
" 123 | \n",
" 36.6 | \n",
" 2748 | \n",
" 34 | \n",
" 12.3 | \n",
" 77.0 | \n",
" 316.05 | \n",
"
\n",
" \n",
" 13 | \n",
" 11066 | \n",
" R. Sterling | \n",
" FWD | \n",
" 2.8 | \n",
" 0.1 | \n",
" 2.5 | \n",
" 37 | \n",
" 111 | \n",
" 33.3 | \n",
" 2580 | \n",
" 32 | \n",
" 9.3 | \n",
" 85.5 | \n",
" -888.30 | \n",
"
\n",
" \n",
" 15 | \n",
" 8717 | \n",
" H. Kane | \n",
" FWD | \n",
" 2.4 | \n",
" 0.1 | \n",
" 2.5 | \n",
" 36 | \n",
" 96 | \n",
" 37.5 | \n",
" 2794 | \n",
" 33 | \n",
" 15.9 | \n",
" 74.0 | \n",
" 855.75 | \n",
"
\n",
" \n",
" 33 | \n",
" 8958 | \n",
" J. Rodriguez | \n",
" FWD | \n",
" 0.7 | \n",
" 0.0 | \n",
" 0.9 | \n",
" 27 | \n",
" 74 | \n",
" 36.5 | \n",
" 2593 | \n",
" 34 | \n",
" 11.7 | \n",
" 74.8 | \n",
" 701.40 | \n",
"
\n",
" \n",
" 37 | \n",
" 12829 | \n",
" J. Vardy | \n",
" FWD | \n",
" 0.6 | \n",
" 0.0 | \n",
" 0.8 | \n",
" 25 | \n",
" 80 | \n",
" 31.2 | \n",
" 3074 | \n",
" 35 | \n",
" 19.5 | \n",
" 66.7 | \n",
" -581.70 | \n",
"
\n",
" \n",
" 31 | \n",
" 3802 | \n",
" Philippe Coutinho | \n",
" FWD | \n",
" 0.8 | \n",
" 0.1 | \n",
" 0.8 | \n",
" 37 | \n",
" 103 | \n",
" 35.9 | \n",
" 1115 | \n",
" 14 | \n",
" 13.9 | \n",
" 78.8 | \n",
" 1159.20 | \n",
"
\n",
" \n",
" 36 | \n",
" 7905 | \n",
" R. Lukaku | \n",
" FWD | \n",
" 0.6 | \n",
" 0.0 | \n",
" 0.6 | \n",
" 36 | \n",
" 107 | \n",
" 33.6 | \n",
" 2585 | \n",
" 30 | \n",
" 17.3 | \n",
" 72.1 | \n",
" 534.45 | \n",
"
\n",
" \n",
" 40 | \n",
" 230883 | \n",
" Ayoze Pérez | \n",
" FWD | \n",
" 0.4 | \n",
" 0.0 | \n",
" 0.4 | \n",
" 30 | \n",
" 88 | \n",
" 34.1 | \n",
" 2381 | \n",
" 32 | \n",
" 13.0 | \n",
" 75.0 | \n",
" 270.90 | \n",
"
\n",
" \n",
" 51 | \n",
" 8903 | \n",
" T. Deeney | \n",
" FWD | \n",
" -0.3 | \n",
" -0.0 | \n",
" -0.4 | \n",
" 25 | \n",
" 64 | \n",
" 39.1 | \n",
" 1684 | \n",
" 21 | \n",
" 15.9 | \n",
" 67.2 | \n",
" 622.65 | \n",
"
\n",
" \n",
" 59 | \n",
" 14911 | \n",
" Son Heung-Min | \n",
" FWD | \n",
" -0.5 | \n",
" -0.0 | \n",
" -0.5 | \n",
" 32 | \n",
" 103 | \n",
" 31.1 | \n",
" 1908 | \n",
" 27 | \n",
" 10.8 | \n",
" 83.0 | \n",
" -317.10 | \n",
"
\n",
" \n",
" 53 | \n",
" 3324 | \n",
" Álvaro Morata | \n",
" FWD | \n",
" -0.3 | \n",
" -0.0 | \n",
" -0.6 | \n",
" 15 | \n",
" 52 | \n",
" 28.8 | \n",
" 1776 | \n",
" 24 | \n",
" 10.4 | \n",
" 76.8 | \n",
" -443.10 | \n",
"
\n",
" \n",
" 67 | \n",
" 397178 | \n",
" M. Rashford | \n",
" FWD | \n",
" -1.2 | \n",
" -0.1 | \n",
" -1.2 | \n",
" 30 | \n",
" 103 | \n",
" 29.1 | \n",
" 1582 | \n",
" 25 | \n",
" 16.0 | \n",
" 75.0 | \n",
" -208.95 | \n",
"
\n",
" \n",
" 72 | \n",
" 15808 | \n",
" Roberto Firmino | \n",
" FWD | \n",
" -1.6 | \n",
" -0.1 | \n",
" -1.2 | \n",
" 44 | \n",
" 139 | \n",
" 31.7 | \n",
" 2739 | \n",
" 35 | \n",
" 12.4 | \n",
" 75.7 | \n",
" 517.65 | \n",
"
\n",
" \n",
" 68 | \n",
" 8677 | \n",
" M. Antonio | \n",
" FWD | \n",
" -1.4 | \n",
" -0.1 | \n",
" -1.5 | \n",
" 30 | \n",
" 93 | \n",
" 32.3 | \n",
" 1354 | \n",
" 21 | \n",
" 21.1 | \n",
" 66.4 | \n",
" 330.75 | \n",
"
\n",
" \n",
" 78 | \n",
" 15054 | \n",
" M. Diouf | \n",
" FWD | \n",
" -2.0 | \n",
" -0.1 | \n",
" -2.3 | \n",
" 26 | \n",
" 87 | \n",
" 29.9 | \n",
" 2273 | \n",
" 28 | \n",
" 16.9 | \n",
" 64.6 | \n",
" -30.45 | \n",
"
\n",
" \n",
" 79 | \n",
" 15198 | \n",
" E. Choupo-Moting | \n",
" FWD | \n",
" -2.0 | \n",
" -0.1 | \n",
" -2.4 | \n",
" 25 | \n",
" 84 | \n",
" 29.8 | \n",
" 2134 | \n",
" 26 | \n",
" 13.4 | \n",
" 76.5 | \n",
" -237.30 | \n",
"
\n",
" \n",
" 89 | \n",
" 25571 | \n",
" J. Ayew | \n",
" FWD | \n",
" -2.8 | \n",
" -0.1 | \n",
" -3.5 | \n",
" 22 | \n",
" 80 | \n",
" 27.5 | \n",
" 2355 | \n",
" 28 | \n",
" 10.0 | \n",
" 83.3 | \n",
" 121.80 | \n",
"
\n",
" \n",
" 93 | \n",
" 3577 | \n",
" S. Rondón | \n",
" FWD | \n",
" -3.4 | \n",
" -0.1 | \n",
" -3.7 | \n",
" 28 | \n",
" 92 | \n",
" 30.4 | \n",
" 2457 | \n",
" 29 | \n",
" 13.5 | \n",
" 72.7 | \n",
" 136.50 | \n",
"
\n",
" \n",
" 95 | \n",
" 377071 | \n",
" Richarlison | \n",
" FWD | \n",
" -5.0 | \n",
" -0.2 | \n",
" -4.3 | \n",
" 30 | \n",
" 116 | \n",
" 25.9 | \n",
" 2635 | \n",
" 35 | \n",
" 15.4 | \n",
" 69.7 | \n",
" -127.05 | \n",
"
\n",
" \n",
" 90 | \n",
" 293687 | \n",
" D. Calvert-Lewin | \n",
" FWD | \n",
" -2.8 | \n",
" -0.2 | \n",
" -5.0 | \n",
" 19 | \n",
" 57 | \n",
" 33.3 | \n",
" 1639 | \n",
" 23 | \n",
" 15.6 | \n",
" 71.2 | \n",
" 426.30 | \n",
"
\n",
" \n",
" 96 | \n",
" 3360 | \n",
" Pedro | \n",
" FWD | \n",
" -5.4 | \n",
" -0.4 | \n",
" -8.0 | \n",
" 17 | \n",
" 67 | \n",
" 25.4 | \n",
" 1307 | \n",
" 23 | \n",
" 9.8 | \n",
" 80.9 | \n",
" 241.50 | \n",
"
\n",
" \n",
" 97 | \n",
" 8416 | \n",
" G. Murray | \n",
" FWD | \n",
" -9.0 | \n",
" -0.4 | \n",
" -11.5 | \n",
" 21 | \n",
" 78 | \n",
" 26.9 | \n",
" 2050 | \n",
" 28 | \n",
" 14.7 | \n",
" 66.8 | \n",
" 428.40 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" playerId shortName roleCode overxP overxPper90mins \\\n",
"0 25707 E. Hazard FWD 17.1 0.6 \n",
"2 7944 W. Rooney FWD 9.0 0.4 \n",
"6 8325 S. Agüero FWD 4.2 0.2 \n",
"4 25747 S. Mané FWD 6.8 0.3 \n",
"5 9637 J. King FWD 6.3 0.3 \n",
"1 3361 A. Sánchez FWD 14.3 0.5 \n",
"12 7879 T. Walcott FWD 3.1 0.2 \n",
"18 25413 A. Lacazette FWD 2.1 0.1 \n",
"20 9123 A. Barnes FWD 1.9 0.1 \n",
"11 134513 A. Martial FWD 3.2 0.2 \n",
"10 14703 M. Arnautović FWD 3.2 0.1 \n",
"9 120353 Mohamed Salah FWD 3.4 0.1 \n",
"13 11066 R. Sterling FWD 2.8 0.1 \n",
"15 8717 H. Kane FWD 2.4 0.1 \n",
"33 8958 J. Rodriguez FWD 0.7 0.0 \n",
"37 12829 J. Vardy FWD 0.6 0.0 \n",
"31 3802 Philippe Coutinho FWD 0.8 0.1 \n",
"36 7905 R. Lukaku FWD 0.6 0.0 \n",
"40 230883 Ayoze Pérez FWD 0.4 0.0 \n",
"51 8903 T. Deeney FWD -0.3 -0.0 \n",
"59 14911 Son Heung-Min FWD -0.5 -0.0 \n",
"53 3324 Álvaro Morata FWD -0.3 -0.0 \n",
"67 397178 M. Rashford FWD -1.2 -0.1 \n",
"72 15808 Roberto Firmino FWD -1.6 -0.1 \n",
"68 8677 M. Antonio FWD -1.4 -0.1 \n",
"78 15054 M. Diouf FWD -2.0 -0.1 \n",
"79 15198 E. Choupo-Moting FWD -2.0 -0.1 \n",
"89 25571 J. Ayew FWD -2.8 -0.1 \n",
"93 3577 S. Rondón FWD -3.4 -0.1 \n",
"95 377071 Richarlison FWD -5.0 -0.2 \n",
"90 293687 D. Calvert-Lewin FWD -2.8 -0.2 \n",
"96 3360 Pedro FWD -5.4 -0.4 \n",
"97 8416 G. Murray FWD -9.0 -0.4 \n",
"\n",
" overxPper100attempts totSuccessful totAttempted pcCompleted \\\n",
"0 13.9 59 123 48.0 \n",
"2 8.4 50 108 46.3 \n",
"6 8.1 21 52 40.4 \n",
"4 6.5 40 104 38.5 \n",
"5 6.1 40 104 38.5 \n",
"1 6.0 99 236 41.9 \n",
"12 4.5 26 69 37.7 \n",
"18 3.6 21 57 36.8 \n",
"20 3.4 21 55 38.2 \n",
"11 3.2 34 99 34.3 \n",
"10 2.9 39 111 35.1 \n",
"9 2.7 45 123 36.6 \n",
"13 2.5 37 111 33.3 \n",
"15 2.5 36 96 37.5 \n",
"33 0.9 27 74 36.5 \n",
"37 0.8 25 80 31.2 \n",
"31 0.8 37 103 35.9 \n",
"36 0.6 36 107 33.6 \n",
"40 0.4 30 88 34.1 \n",
"51 -0.4 25 64 39.1 \n",
"59 -0.5 32 103 31.1 \n",
"53 -0.6 15 52 28.8 \n",
"67 -1.2 30 103 29.1 \n",
"72 -1.2 44 139 31.7 \n",
"68 -1.5 30 93 32.3 \n",
"78 -2.3 26 87 29.9 \n",
"79 -2.4 25 84 29.8 \n",
"89 -3.5 22 80 27.5 \n",
"93 -3.7 28 92 30.4 \n",
"95 -4.3 30 116 25.9 \n",
"90 -5.0 19 57 33.3 \n",
"96 -8.0 17 67 25.4 \n",
"97 -11.5 21 78 26.9 \n",
"\n",
" minutesPlayed totMatches pcTrickyBall overallPcCompleted vec_x \n",
"0 2432 34 8.8 85.0 748.65 \n",
"2 2238 30 9.0 80.0 2674.35 \n",
"6 1739 22 8.5 84.7 22.05 \n",
"4 2184 27 11.0 82.4 -30.45 \n",
"5 2160 28 16.2 75.3 489.30 \n",
"1 2553 31 15.7 74.0 2404.50 \n",
"12 1188 17 19.4 74.4 359.10 \n",
"18 1832 25 8.7 79.1 130.20 \n",
"20 1779 26 13.4 67.9 -106.05 \n",
"11 1551 27 12.9 80.8 -371.70 \n",
"10 2249 29 15.6 71.5 -255.15 \n",
"9 2748 34 12.3 77.0 316.05 \n",
"13 2580 32 9.3 85.5 -888.30 \n",
"15 2794 33 15.9 74.0 855.75 \n",
"33 2593 34 11.7 74.8 701.40 \n",
"37 3074 35 19.5 66.7 -581.70 \n",
"31 1115 14 13.9 78.8 1159.20 \n",
"36 2585 30 17.3 72.1 534.45 \n",
"40 2381 32 13.0 75.0 270.90 \n",
"51 1684 21 15.9 67.2 622.65 \n",
"59 1908 27 10.8 83.0 -317.10 \n",
"53 1776 24 10.4 76.8 -443.10 \n",
"67 1582 25 16.0 75.0 -208.95 \n",
"72 2739 35 12.4 75.7 517.65 \n",
"68 1354 21 21.1 66.4 330.75 \n",
"78 2273 28 16.9 64.6 -30.45 \n",
"79 2134 26 13.4 76.5 -237.30 \n",
"89 2355 28 10.0 83.3 121.80 \n",
"93 2457 29 13.5 72.7 136.50 \n",
"95 2635 35 15.4 69.7 -127.05 \n",
"90 1639 23 15.6 71.2 426.30 \n",
"96 1307 23 9.8 80.9 241.50 \n",
"97 2050 28 14.7 66.8 428.40 "
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# applying advanced model to predict xP\n",
"df_passes['xP'] = pass_model_advanced_probit.predict(df_passes)\n",
"# calculating \"overxP\"\n",
"df_passes['overxP'] = df_passes['successFlag'] - df_passes['xP']\n",
"\n",
"## get average distance and average y distance metrics, too\n",
"\n",
"df_overall_pcSuccess = df_passes.loc[df_passes['source'].isin(leagues)]\\\n",
" .groupby(['playerId','shortName'])\\\n",
" .agg({'successFlag':np.sum,'id':'nunique'})\\\n",
" .rename(columns={'successFlag':'overallSuccessful','id':'overallAttempted'})\n",
"\n",
"df_overall_pcSuccess['overallPcCompleted'] = np.round(100*df_overall_pcSuccess['overallSuccessful'] / df_overall_pcSuccess['overallAttempted'], 1)\n",
"\n",
"# producing summary, adding in formations data to calculate the excess xP per 90 minutes\n",
"df_summary_passer = df_passes.loc[(df_passes['roleCode'].isin(positions))\\\n",
" & (df_passes['source'].isin(leagues))\\\n",
" & (df_passes['xP'] < xPthreshold)]\\\n",
" .merge(df_formations, on=['matchId','teamId','playerId'], how='inner')\\\n",
" .groupby(['matchId','playerId','roleCode','minutesPlayed','shortName'])\\\n",
" .agg({'overxP':np.sum,'id':'nunique','successFlag':np.sum,'vec_x':np.sum})\\\n",
" .reset_index()\\\n",
" .rename(columns={'id':'totAttemptedPerMatch','successFlag':'totSuccessfulPerMatch'})\\\n",
" .groupby(['playerId','shortName','roleCode'])\\\n",
" .agg({'overxP':np.sum,'totAttemptedPerMatch':np.sum,'totSuccessfulPerMatch':np.sum,'minutesPlayed':np.sum,'matchId':'nunique','vec_x':np.sum})\\\n",
" .rename(columns={'totAttemptedPerMatch':'totAttempted','totSuccessfulPerMatch':'totSuccessful','matchId':'totMatches'})\\\n",
" .sort_values('overxP', ascending=False)\\\n",
" .reset_index()\n",
"\n",
"# getting the overall fraction of completed passes\n",
"df_summary_passer['pcCompleted'] = np.round(100*df_summary_passer['totSuccessful'] / df_summary_passer['totAttempted'], 1)\n",
"# getting a normalised metric per 90 minutes of play\n",
"df_summary_passer['overxPper90mins'] = np.round(90*(df_summary_passer['overxP'] / df_summary_passer['minutesPlayed']), 1)\n",
"# getting a normalised metric per 100 attempts\n",
"df_summary_passer['overxPper100attempts'] = np.round(100*(df_summary_passer['overxP'] / df_summary_passer['totAttempted']), 1)\n",
"# explicitly making mins played an integer\n",
"df_summary_passer['minutesPlayed'] = df_summary_passer.minutesPlayed.apply(lambda x: int(x))\n",
"# rounding overxP score\n",
"df_summary_passer['overxP'] = np.round(df_summary_passer['overxP'], 1)\n",
"\n",
"# joining difficult pass table to overall pc completed table\n",
"df_summary_passer = df_summary_passer.merge(df_overall_pcSuccess, on=['playerId','shortName'], how='inner')\n",
"df_summary_passer['pcTrickyBall'] = np.round(100*df_summary_passer['totAttempted'] / df_summary_passer['overallAttempted'], 1)\n",
"\n",
"# filtering using minimum criteria\n",
"df_summary_passer = df_summary_passer.loc[(df_summary_passer['totMatches'] >= minMatches) & (df_summary_passer['totAttempted'] >= minPasses)]\n",
"\n",
"df_summary_passer = df_summary_passer[['playerId','shortName','roleCode','overxP','overxPper90mins','overxPper100attempts','totSuccessful','totAttempted','pcCompleted','minutesPlayed','totMatches','pcTrickyBall','overallPcCompleted','vec_x']]\n",
"\n",
"df_summary_passer.sort_values('overxPper100attempts', ascending=False).head(40)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Latex Table of Results"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\\begin{tabular}{lrrr}\n",
"\\toprule\n",
" shortName & overxPper100attempts & totAttempted & pcTrickyBall \\\\\n",
"\\midrule\n",
" E. Hazard & 13.9 & 123 & 8.8 \\\\\n",
" W. Rooney & 8.4 & 108 & 9.0 \\\\\n",
" S. Agüero & 8.1 & 52 & 8.5 \\\\\n",
" S. Mané & 6.5 & 104 & 11.0 \\\\\n",
" J. King & 6.1 & 104 & 16.2 \\\\\n",
" A. Sánchez & 6.0 & 236 & 15.7 \\\\\n",
" T. Walcott & 4.5 & 69 & 19.4 \\\\\n",
" A. Lacazette & 3.6 & 57 & 8.7 \\\\\n",
" A. Barnes & 3.4 & 55 & 13.4 \\\\\n",
" A. Martial & 3.2 & 99 & 12.9 \\\\\n",
"\\bottomrule\n",
"\\end{tabular}\n",
"\n"
]
}
],
"source": [
"print(df_summary_passer.sort_values('overxPper100attempts', ascending=False).head(10).to_latex(index_names=False, index=None\\\n",
" ,columns=['shortName','overxPper100attempts'\\\n",
" ,'totAttempted','pcTrickyBall']))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Hazard distribution visualisation"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"df_hazard = df_passes.loc[(df_passes['playerId'] == 25707) & (df_passes['source'] == 'England')\\\n",
" & (df_passes['xP'] < 0.5)]\n",
"\n",
"df_hazard = df_hazard.merge(df_players.rename(columns={'playerId':'passRecipientPlayerId'}), on='passRecipientPlayerId', how='inner', suffixes=(['_hazard','_receiver']))\n",
"\n",
"hazard_main_receivers = ['Álvaro Morata','Willian','Pedro','Fàbregas','V. Moses','O. Giroud']\n",
"\n",
"df_hazard_main_receivers = df_hazard.loc[df_hazard['shortName_receiver'].isin(hazard_main_receivers)][['source','matchId','eventSec','subEventName','previousSubEventName','shortName_hazard','shortName_receiver','xP','overxP','startPassM_x','startPassM_y','endPassM_x','endPassM_y']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plotting Hazard Passes by Receiver"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pitch = Pitch(layout=(3, 2), tight_layout=False, constrained_layout=True, view='half', orientation='vertical', figsize=(12,12))\n",
"\n",
"fig, axs = pitch.draw()\n",
"\n",
"for ax, receiver in zip(axs.flat, hazard_main_receivers):\n",
" \n",
" df_receiver = df_hazard_main_receivers.loc[df_hazard_main_receivers['shortName_receiver'] == receiver]\n",
" \n",
" # smart passes\n",
" pitch.lines(df_receiver.loc[df_receiver['subEventName'] == 'Smart pass'].startPassM_x*120/105, df_receiver.loc[df_receiver['subEventName'] == 'Smart pass'].startPassM_y*80/68,\\\n",
" df_receiver.loc[df_receiver['subEventName'] == 'Smart pass'].endPassM_x*120/105, df_receiver.loc[df_receiver['subEventName'] == 'Smart pass'].endPassM_y*80/68,\n",
" lw=10, transparent=True, comet=True,\n",
" label=f'Smart passes to {receiver}', color='red', ax=ax, alpha_start=0.01, alpha_end=1)\n",
" \n",
" # crosses\n",
" pitch.lines(df_receiver.loc[df_receiver['subEventName'] == 'Cross'].startPassM_x*120/105, df_receiver.loc[df_receiver['subEventName'] == 'Cross'].startPassM_y*80/68,\\\n",
" df_receiver.loc[df_receiver['subEventName'] == 'Cross'].endPassM_x*120/105, df_receiver.loc[df_receiver['subEventName'] == 'Cross'].endPassM_y*80/68,\n",
" lw=10, transparent=True, comet=True,\n",
" label=f'Crosses to {receiver}', ax=ax, alpha_start=0.01, alpha_end=1)\n",
"\n",
" leg = ax.legend(borderpad=1, markerscale=1.5, labelspacing=1.5, loc='lower left', fontsize=12)\n",
"\n",
"fig.savefig('Hazard.pdf', format='pdf',dpi=300,pad_inches=0,bbox_inches='tight', transparent=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Looking at Receivers"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" playerId | \n",
" shortName | \n",
" overxP | \n",
" totAttempted | \n",
" totCompleted | \n",
" minutesPlayed | \n",
" totMatches | \n",
" fracCompleted | \n",
" overxPper90received | \n",
"
\n",
" \n",
" \n",
" \n",
" 26 | \n",
" 26010.0 | \n",
" O. Giroud | \n",
" 23.778349 | \n",
" 37 | \n",
" 37 | \n",
" 728.0 | \n",
" 17 | \n",
" 1.0 | \n",
" 2.939631 | \n",
"
\n",
" \n",
" 15 | \n",
" 11669.0 | \n",
" C. Wilson | \n",
" 28.063730 | \n",
" 44 | \n",
" 44 | \n",
" 1238.0 | \n",
" 18 | \n",
" 1.0 | \n",
" 2.040174 | \n",
"
\n",
" \n",
" 22 | \n",
" 10663.0 | \n",
" A. Gray | \n",
" 24.590565 | \n",
" 40 | \n",
" 40 | \n",
" 1120.0 | \n",
" 18 | \n",
" 1.0 | \n",
" 1.976028 | \n",
"
\n",
" \n",
" 9 | \n",
" 8384.0 | \n",
" S. Long | \n",
" 31.249384 | \n",
" 50 | \n",
" 50 | \n",
" 1453.0 | \n",
" 22 | \n",
" 1.0 | \n",
" 1.935612 | \n",
"
\n",
" \n",
" 5 | \n",
" 9123.0 | \n",
" A. Barnes | \n",
" 35.533073 | \n",
" 57 | \n",
" 57 | \n",
" 1665.0 | \n",
" 21 | \n",
" 1.0 | \n",
" 1.920707 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" playerId shortName overxP totAttempted totCompleted minutesPlayed \\\n",
"26 26010.0 O. Giroud 23.778349 37 37 728.0 \n",
"15 11669.0 C. Wilson 28.063730 44 44 1238.0 \n",
"22 10663.0 A. Gray 24.590565 40 40 1120.0 \n",
"9 8384.0 S. Long 31.249384 50 50 1453.0 \n",
"5 9123.0 A. Barnes 35.533073 57 57 1665.0 \n",
"\n",
" totMatches fracCompleted overxPper90received \n",
"26 17 1.0 2.939631 \n",
"15 18 1.0 2.040174 \n",
"22 18 1.0 1.976028 \n",
"9 22 1.0 1.935612 \n",
"5 21 1.0 1.920707 "
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# producing summary, adding in formations data to calculate the excess xP per 90 minutes\n",
"\n",
"df_passes_receiver = df_passes.copy()\n",
"df_passes_receiver = df_passes_receiver.drop(columns=['playerId','shortName','roleCode'])\n",
"df_passes_receiver = df_passes_receiver.rename(columns={'passRecipientPlayerId':'playerId'})\n",
"df_passes_receiver = df_passes_receiver.loc[pd.isna(df_passes_receiver['playerId']) == False]\n",
"\n",
"df_passes_receiver = df_passes_receiver.merge(df_players, on=['playerId'], how='inner')\n",
"\n",
"df_summary_receiver = df_passes_receiver.loc[(df_passes_receiver['roleCode'].isin(positions))\\\n",
" & (df_passes_receiver['source'].isin(leagues))\\\n",
" & (df_passes_receiver['xP'] < xPthreshold)]\\\n",
" .merge(df_formations, on=['matchId','teamId','playerId'], how='inner')\\\n",
" .groupby(['matchId','playerId','minutesPlayed','shortName'])\\\n",
" .agg({'overxP':np.sum,'id':'nunique','successFlag':np.sum})\\\n",
" .reset_index()\\\n",
" .rename(columns={'id':'totAttemptedPerMatch','successFlag':'totCompletedPerMatch'})\\\n",
" .groupby(['playerId','shortName'])\\\n",
" .agg({'overxP':np.sum,'totAttemptedPerMatch':np.sum,'totCompletedPerMatch':np.sum,'minutesPlayed':np.sum,'matchId':'nunique'})\\\n",
" .rename(columns={'totAttemptedPerMatch':'totAttempted','totCompletedPerMatch':'totCompleted','matchId':'totMatches'})\\\n",
" .sort_values('overxP', ascending=False)\\\n",
" .reset_index()\n",
"\n",
"df_summary_receiver['fracCompleted'] = df_summary_receiver['totCompleted'] / df_summary_receiver['totAttempted']\n",
"df_summary_receiver['overxPper90received'] = 90*(df_summary_receiver['overxP'] / df_summary_receiver['minutesPlayed'])\n",
"df_summary_receiver = df_summary_receiver.loc[df_summary_receiver['totMatches'] >= 10]\n",
"\n",
"df_summary_receiver.sort_values('overxPper90received', ascending=False).head()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "py37_football",
"language": "python",
"name": "py37_football"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}