{ "cells": [ { "cell_type": "markdown", "id": "93c92455", "metadata": {}, "source": [ "# ... and in the end, the most experienced team wins\n", "\n", "Author: Frédéric DITH | Creation date: 2024-03-11 | Last updated: 2024-03-14\n", "\n", "`python`, `pandas`, `plotly-express`, `data-analytics`,`statistics`,`football`,`championsleague`\n", "\n", "## Context\n", "\n", "This self-assigned project is a simple exploration of football games dataset, namely, all the games played in the Champions League between 2016-17 and 2022-23 seasons.\n", "\n", "## Findings\n", "\n", "### [1] The data suggests that there's an \"experience advantage\"\n", "\n", "The correlation coefficient between the number of games played and the overall win rate is +0.777 (strongly correlated, positive). While it can't imply causality on its own, this value suggests that experience plays a significant role in a team's success in the competition. Another way to read into this value could be: *in any matchup, the team with more experience in the Champions League can be expected to win*.\n", "\n", "### [2] The data suggests that there's a \"home advantage\"\n", "\n", "- **Overall**: across the entire set of 744 games, the proportion of home team wins (43.95%) is slightly higher than the proportion of away team wins (34.67%), suggesting a potential home advantage.\n", "- **Group stage**: in the subset of 576 group stage games, the proportion of home team wins (44.09%) remains slightly higher than the proportion of away team wins (32.63%), consistent with the trend observed in the entire dataset. The proportion of draws (23.26%) is higher compared to the entire set, indicating that group stage games might be more evenly matched leading to more draws.\n", "- **Knockout Games**: in the subset of 168 knockout games, the proportion of home team wins (43.45%) is still higher than the proportion of away team wins (41.66%), but the difference is smaller compared to the group stage games and the entire dataset. The proportion of draws (14.88%) is the lowest among the subsets, indicating that knockout games tend to have clearer outcomes with fewer draws.\n", "- **Excluding draws**: in the subset of 585 games that did not end in a draw, 55% of games were won by the home team, compared to 45% won by the away team.\n", "\n", "Our analysis suggests that while there might be a home advantage overall and in the group stage games, the advantage might diminish or become less significant in knockout games.\n", "\n", "\n", "\n", "## Methodology\n", "\n", "### Initial data source\n", "https://www.kaggle.com/datasets/cbxkgl/uefa-champions-league-2016-2022-data/data: 744 games, 74 teams\n", "\n", "### added data\n", "the initial dataset lists all games (home team, away team, home goals, away goals), and we added to that\n", "- `home` and `away`: a value for each team that reflects if the game played at home or away\n", "- `win`, `loss` or `draw`: the result of the game, for the team\n", "- `group` or `knockout`: for that I simply used the date of the game as a proxy: if the game took place between september and december, it is a group stage game. Otherwise it is a knockout game.\n", "\n", "As a consequence, each team in each game is given one of these twelve new distinct values:\n", "- Homewin_group\n", "- Homewin_knockout\n", "- Homeloss_group\n", "- Homeloss_knockout\n", "- Homedraw_group\n", "- Homedraw_knockout\n", "- Awaywin_group\n", "- Awaywin_knockout\n", "- Awayloss_group\n", "- Awayloss_knockout\n", "- Awaydraw_group\n", "- Awaydraw_knockout\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "50683fe4", "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd\n", "import numpy as np\n", "import plotly.express as px\n", "import requests\n", "\n", "df = pd.read_csv('matches.csv')\n", "\n", "# ADD group stage or knockout\n", "# if game is between september and december, then group, otherwise knockout\n", "df['DATE_TIME'] = pd.to_datetime(df['DATE_TIME'], format='%d-%b-%y %I.%M.%S.%f %p')\n", "df['Month'] = df['DATE_TIME'].dt.strftime('%b')\n", "def map_month_to_group(month):\n", " if month in ['Sep', 'Oct', 'Nov', 'Dec']:\n", " return 'group'\n", " else:\n", " return 'knockout'\n", "df['stage'] = df['Month'].apply(map_month_to_group) \n", "df['Result_hometeam_stage'] = df['Result_hometeam'].str.cat(df['stage'], sep='_')\n", "df['Result_awayteam_stage'] = df['Result_awayteam'].str.cat(df['stage'], sep='_')\n", "\n", "# count the number of games played by each team\n", "teams = pd.concat([df['HOME_TEAM'], df['AWAY_TEAM']]).unique()\n", "games_played = {team: 0 for team in teams}\n", "\n", "for index, row in df.iterrows():\n", " games_played[row['HOME_TEAM']] += 1\n", " games_played[row['AWAY_TEAM']] += 1\n", "games_played_df = pd.DataFrame.from_dict(games_played, orient='index', columns=['Games Played'])\n", "games_played_df.index.name = 'Team'\n", "games_played_df = games_played_df.sort_values(by='Games Played', ascending=False)\n", "\n", "\n", "# store the counts for each team and result in each stage\n", "homewin_group = {}\n", "homewin_knockout = {}\n", "homeloss_group = {}\n", "homeloss_knockout = {}\n", "homedraw_group = {}\n", "homedraw_knockout = {}\n", "awaywin_group = {}\n", "awaywin_knockout = {}\n", "awayloss_group = {}\n", "awayloss_knockout = {}\n", "awaydraw_group = {}\n", "awaydraw_knockout = {}\n", "\n", "# Iterate through each row of the DataFrame\n", "for index, row in df.iterrows():\n", " # Update counts based on the stage\n", " stage = row['stage']\n", " home_result = row['Result_hometeam']\n", " away_result = row['Result_awayteam']\n", " \n", " # Decide which dictionary to use based on stage and result\n", " if stage == 'group':\n", " home_dict = homewin_group if home_result == 'homewin' else (homeloss_group if home_result == 'homeloss' else homedraw_group)\n", " away_dict = awaywin_group if away_result == 'awaywin' else (awayloss_group if away_result == 'awayloss' else awaydraw_group)\n", " else: # Knockout stage\n", " home_dict = homewin_knockout if home_result == 'homewin' else (homeloss_knockout if home_result == 'homeloss' else homedraw_knockout)\n", " away_dict = awaywin_knockout if away_result == 'awaywin' else (awayloss_knockout if away_result == 'awayloss' else awaydraw_knockout)\n", " \n", " # Update counts for home team\n", " home_team = row['HOME_TEAM']\n", " if home_team not in home_dict:\n", " home_dict[home_team] = 0\n", " home_dict[home_team] += 1\n", " \n", " # Update counts for away team\n", " away_team = row['AWAY_TEAM']\n", " if away_team not in away_dict:\n", " away_dict[away_team] = 0\n", " away_dict[away_team] += 1\n", "\n", "# Create DataFrames to store the counts for each team\n", "results_group_df = pd.DataFrame({'Homewin_group': homewin_group, 'Homeloss_group': homeloss_group, 'Homedraw_group': homedraw_group, \n", " 'Awaywin_group': awaywin_group, 'Awayloss_group': awayloss_group, 'Awaydraw_group': awaydraw_group})\n", "results_group_df.index.name = 'Team'\n", "\n", "results_knockout_df = pd.DataFrame({'Homewin_knockout': homewin_knockout, 'Homeloss_knockout': homeloss_knockout, 'Homedraw_knockout': homedraw_knockout, \n", " 'Awaywin_knockout': awaywin_knockout, 'Awayloss_knockout': awayloss_knockout, 'Awaydraw_knockout': awaydraw_knockout})\n", "results_knockout_df.index.name = 'Team'\n", "\n", "merged_results_df = pd.merge(results_group_df, results_knockout_df, on='Team', how='outer')\n", "\n", "# merge results and games played\n", "merged_df = pd.merge(merged_results_df, games_played_df, left_index=True, right_index=True)\n", "merged_df.insert(0, 'Games Played', merged_df.pop('Games Played'))\n", "merged_df['Games Played'] = merged_df['Games Played'].astype(float)\n", "merged_df = merged_df.sort_values(by='Games Played', ascending=False)\n", "\n", "# ADD overall win, draw and loss rate\n", "merged_df['ovr_win_count'] = merged_df[['Homewin_group', 'Awaywin_group', 'Homewin_knockout', 'Awaywin_knockout']].sum(axis=1)\n", "merged_df['ovr_draw_count'] = merged_df[['Homedraw_group', 'Awaydraw_group', 'Homedraw_knockout', 'Awaydraw_knockout']].sum(axis=1)\n", "merged_df['ovr_loss_count'] = merged_df[['Homeloss_group', 'Awayloss_group', 'Homeloss_knockout', 'Awayloss_knockout']].sum(axis=1)\n", "\n", "merged_df['ovr_win'] = merged_df['ovr_win_count'] / merged_df['Games Played']\n", "merged_df['ovr_draw'] = merged_df['ovr_draw_count'] / merged_df['Games Played']\n", "merged_df['ovr_loss'] = merged_df['ovr_loss_count'] / merged_df['Games Played']\n", "\n", "merged_df.insert(1, 'ovr_win', merged_df.pop('ovr_win'))\n", "merged_df.insert(2, 'ovr_draw', merged_df.pop('ovr_draw'))\n", "merged_df.insert(3, 'ovr_loss', merged_df.pop('ovr_loss'))\n", "\n", "# ADD count games played in group and games played in knockout\n", "merged_df['Games_Played_Group'] = merged_df[['Homewin_group', 'Homeloss_group', 'Homedraw_group',\n", " 'Awaywin_group', 'Awayloss_group', 'Awaydraw_group']].sum(axis=1)\n", "\n", "merged_df['Games_Played_Knockout'] = merged_df[['Homewin_knockout', 'Homeloss_knockout', 'Homedraw_knockout',\n", " 'Awaywin_knockout', 'Awayloss_knockout', 'Awaydraw_knockout']].sum(axis=1)\n", "\n", "# ADD specific results for home stage, away stage, home ko, away ko\n", "merged_df['Winrate_home_stage'] = merged_df['Homewin_group'] / merged_df['Games_Played_Group']\n", "merged_df['Winrate_away_stage'] = merged_df['Awaywin_group'] / merged_df['Games_Played_Group']\n", "merged_df['Winrate_home_knockout'] = merged_df['Homewin_knockout'] / merged_df['Games_Played_Knockout']\n", "merged_df['Winrate_away_knockout'] = merged_df['Awaywin_knockout'] / merged_df['Games_Played_Knockout']\n", "\n", "merged_df['Winrate_group'] = (merged_df['Homewin_group'] + merged_df['Awaywin_group']) / merged_df['Games_Played_Group']\n", "merged_df['Winrate_knockout'] = (merged_df['Homewin_knockout'] + merged_df['Awaywin_knockout']) / merged_df['Games_Played_Knockout']\n", "\n", "#replace NaN with zeros for further calculation\n", "merged_df.fillna(0, inplace=True)\n", "#merged_df.head(20)\n", "#merged_df.describe()" ] }, { "cell_type": "markdown", "id": "5eeaf9c5", "metadata": {}, "source": [ "# The experience advantage\n", "\n", "\n", "### How to read this chart\n", "- A team further to the right of the chart has played more games, or has more experience\n", "- A team higher up has won more of these games played\n", "- The trendline (method: Ordinary Least Squares) represents **the level of performance expected, according to the number of games played in the competition**. While there are many other parameters that should be accounted for, it gives us an insight at how a team can be expected to perform, according to their experience.\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "d25d8e07", "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "hovertemplate": "Games Played=%{x}
Overall win rate=%{y}
Team=%{text}", "legendgroup": "", "marker": { "color": "#636efa", "symbol": "circle" }, "mode": "markers+text", "name": "", "orientation": "v", "showlegend": false, "text": [ "Real Madrid", "Manchester City", "Bayern München", "Liverpool FC", "Juventus", "Paris Saint-Germain", "FC Barcelona", "Atlético Madrid", "Borussia Dortmund", "FC Porto", "Chelsea FC", "SL Benfica", "Tottenham Hotspur", "Sevilla FC", "Manchester United", "Shakhtar Donetsk", "AFC Ajax", "Club Brugge KV", "RB Leipzig", "SSC Napoli", "Inter", "Olympique Lyon", "AS Monaco", "Atalanta", "Beşiktaş", "RB Salzburg", "Sporting CP", "AS Roma", "CSKA Moskva", "Lokomotiv Moskva", "Olympiakos Piraeus", "Zenit St. Petersburg", "Dinamo Kiev", "Bayer Leverkusen", "Valencia CF", "FC Basel", "Bor. Mönchengladbach", "Lille OSC", "Galatasaray", "PSV Eindhoven", "Celtic FC", "Dinamo Zagreb", "Crvena Zvezda", "Villarreal CF", "BSC Young Boys", "Leicester City", "Arsenal FC", "FC Schalke 04", "Lazio Roma", "KRC Genk", "Slavia Praha", "AEK Athen", "Stade Rennes", "1899 Hoffenheim", "FC Sheriff", "RSC Anderlecht", "Qarabağ FK", "NK Maribor", "APOEL Nikosia", "Ferencvárosi TC", "AC Milan", "FC Midtjylland", "Malmö FF", "FC København", "Legia Warszawa", "FK Rostov", "VfL Wolfsburg", "Feyenoord", "Spartak Moskva", "Olympique Marseille", "Viktoria Plzeň", "FK Krasnodar", "İstanbul Başakşehir", "PFC Ludogorets Razgrad" ], "textposition": "bottom right", "type": "scatter", "x": [ 67, 62, 61, 57, 57, 55, 55, 53, 48, 42, 39, 36, 35, 32, 32, 32, 32, 30, 30, 28, 26, 24, 24, 23, 20, 20, 20, 20, 18, 18, 18, 18, 18, 14, 14, 14, 14, 14, 12, 12, 12, 12, 12, 12, 12, 10, 8, 8, 8, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6 ], "xaxis": "x", "y": [ 0.6119402985074627, 0.6451612903225806, 0.7213114754098361, 0.6140350877192983, 0.631578947368421, 0.5636363636363636, 0.5818181818181818, 0.4528301886792453, 0.4375, 0.4523809523809524, 0.5641025641025641, 0.2777777777777778, 0.45714285714285713, 0.375, 0.46875, 0.28125, 0.53125, 0.13333333333333333, 0.4666666666666667, 0.35714285714285715, 0.34615384615384615, 0.2916666666666667, 0.25, 0.34782608695652173, 0.25, 0.3, 0.3, 0.5, 0.2777777777777778, 0.1111111111111111, 0.1111111111111111, 0.16666666666666666, 0.1111111111111111, 0.2857142857142857, 0.35714285714285715, 0.35714285714285715, 0.21428571428571427, 0.21428571428571427, 0.08333333333333333, 0, 0.08333333333333333, 0.08333333333333333, 0.16666666666666666, 0.4166666666666667, 0.16666666666666666, 0.5, 0.5, 0.375, 0.25, 0, 0, 0, 0, 0, 0.3333333333333333, 0.16666666666666666, 0, 0, 0, 0, 0.16666666666666666, 0, 0, 0.3333333333333333, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.16666666666666666, 0.3333333333333333, 0.16666666666666666, 0.16666666666666666, 0 ], "yaxis": "y" }, { "hovertemplate": "OLS trendline
ovr_win = 0.00897004 * Games Played + 0.0859575
R2=0.604494

Games Played=%{x}
Overall win rate=%{y} (trend)", "legendgroup": "", "marker": { "color": "#636efa", "symbol": "circle" }, "mode": "lines", "name": "", "showlegend": false, "textposition": "bottom right", "type": "scatter", "x": [ 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 10, 12, 12, 12, 12, 12, 12, 12, 14, 14, 14, 14, 14, 18, 18, 18, 18, 18, 20, 20, 20, 20, 23, 24, 24, 26, 28, 30, 30, 32, 32, 32, 32, 35, 36, 39, 42, 48, 53, 55, 55, 57, 57, 61, 62, 67 ], "xaxis": "x", "y": [ 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1397776925752266, 0.1577177731580514, 0.1577177731580514, 0.1577177731580514, 0.17565785374087622, 0.193597934323701, 0.193597934323701, 0.193597934323701, 0.193597934323701, 0.193597934323701, 0.193597934323701, 0.193597934323701, 0.2115380149065258, 0.2115380149065258, 0.2115380149065258, 0.2115380149065258, 0.2115380149065258, 0.24741817607217542, 0.24741817607217542, 0.24741817607217542, 0.24741817607217542, 0.24741817607217542, 0.26535825665500024, 0.26535825665500024, 0.26535825665500024, 0.26535825665500024, 0.2922683775292374, 0.3012384178206498, 0.3012384178206498, 0.3191784984034746, 0.3371185789862994, 0.3550586595691242, 0.3550586595691242, 0.372998740151949, 0.372998740151949, 0.372998740151949, 0.372998740151949, 0.3999088610261862, 0.40887890131759863, 0.4357890221918358, 0.462699143066073, 0.5165193848145474, 0.5613695862716094, 0.5793096668544342, 0.5793096668544342, 0.597249747437259, 0.597249747437259, 0.6331299086029086, 0.642099948894321, 0.686950150351383 ], "yaxis": "y" } ], "layout": { "height": 1000, "legend": { "tracegroupgap": 0 }, "template": { "data": { "bar": [ { "error_x": { "color": "#2a3f5f" }, "error_y": { "color": "#2a3f5f" }, "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "baxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmap" } ], "heatmapgl": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmapgl" } ], "histogram": [ { "marker": { "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "fillpattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergl" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "#EBF0F8" }, "line": { "color": "white" } }, "header": { "fill": { "color": "#C8D4E3" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowcolor": "#2a3f5f", "arrowhead": 0, "arrowwidth": 1 }, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "colorscale": { "diverging": [ [ 0, "#8e0152" ], [ 0.1, "#c51b7d" ], [ 0.2, "#de77ae" ], [ 0.3, "#f1b6da" ], [ 0.4, "#fde0ef" ], [ 0.5, "#f7f7f7" ], [ 0.6, "#e6f5d0" ], [ 0.7, "#b8e186" ], [ 0.8, "#7fbc41" ], [ 0.9, "#4d9221" ], [ 1, "#276419" ] ], "sequential": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "sequentialminus": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ] }, "colorway": [ "#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52" ], "font": { "color": "#2a3f5f" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "#E5ECF6", "showlakes": true, "showland": true, "subunitcolor": "white" }, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "paper_bgcolor": "white", "plot_bgcolor": "#E5ECF6", "polar": { "angularaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "radialaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "scene": { "xaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "yaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "zaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" } }, "shapedefaults": { "line": { "color": "#2a3f5f" } }, "ternary": { "aaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "baxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "caxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "title": { "x": 0.05 }, "xaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 }, "yaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 } } }, "title": { "text": "Games Played vs Overall win rate" }, "xaxis": { "anchor": "y", "domain": [ 0, 1 ], "title": { "text": "Games Played" } }, "yaxis": { "anchor": "x", "domain": [ 0, 1 ], "title": { "text": "Overall win rate" } } } }, "text/html": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = px.scatter(merged_df, x='Games Played', y='ovr_win',\n", " text=merged_df.index,\n", " title='Games Played vs Overall win rate',\n", " labels={'Games Played': 'Games Played', 'ovr_win': 'Overall win rate'},\n", " trendline='ols')\n", "fig.update_layout(height=1000)\n", "fig.update_traces(textposition='bottom right')\n", "fig.show()" ] }, { "cell_type": "markdown", "id": "29422237", "metadata": {}, "source": [ "### Correlation between Games played and Overall win rate" ] }, { "cell_type": "code", "execution_count": 3, "id": "5b2f5449", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Correlation coefficient between games played and overall win rate: 0.777491894098059\n" ] } ], "source": [ "correlation = merged_df['Games Played'].corr(merged_df['ovr_win'])\n", "print(\"Correlation coefficient between games played and overall win rate:\", correlation)" ] }, { "cell_type": "markdown", "id": "6dfc6d1a", "metadata": {}, "source": [ "+0.777 suggests a strong positive correlation between the number of games played and the overall win rate for the teams." ] }, { "cell_type": "markdown", "id": "ca4321bf", "metadata": {}, "source": [ "# The home advantage\n", "\n", "Instead of looking at team level data, we will go back to the main dataset and look at the outcome of all games.\n", "\n", "### Overall games distribution by outcomes" ] }, { "cell_type": "code", "execution_count": 4, "id": "ccee629e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
outcomecountpct
0Homewin_group254.00.341398
1Homeloss_group188.00.252688
2Homedraw_group134.00.180108
3Awaywin_group188.00.252688
4Awayloss_group254.00.341398
5Awaydraw_group134.00.180108
6Homewin_knockout73.00.098118
7Homeloss_knockout70.00.094086
8Homedraw_knockout25.00.033602
9Awaywin_knockout70.00.094086
10Awayloss_knockout73.00.098118
11Awaydraw_knockout25.00.033602
\n", "
" ], "text/plain": [ " outcome count pct\n", "0 Homewin_group 254.0 0.341398\n", "1 Homeloss_group 188.0 0.252688\n", "2 Homedraw_group 134.0 0.180108\n", "3 Awaywin_group 188.0 0.252688\n", "4 Awayloss_group 254.0 0.341398\n", "5 Awaydraw_group 134.0 0.180108\n", "6 Homewin_knockout 73.0 0.098118\n", "7 Homeloss_knockout 70.0 0.094086\n", "8 Homedraw_knockout 25.0 0.033602\n", "9 Awaywin_knockout 70.0 0.094086\n", "10 Awayloss_knockout 73.0 0.098118\n", "11 Awaydraw_knockout 25.0 0.033602" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_outcomes = pd.DataFrame(merged_results_df.sum(axis=0))\n", "all_outcomes = all_outcomes.reset_index()\n", "all_outcomes.columns = ['outcome','count']\n", "all_outcomes['pct'] = all_outcomes['count'] / 744\n", "all_outcomes" ] }, { "cell_type": "markdown", "id": "268dc3b9", "metadata": {}, "source": [ "### Games distribution by outcomes, and split by conditions (home/away, group/knockout)" ] }, { "cell_type": "code", "execution_count": 5, "id": "e8edbb0d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Total WinsTotal LossesTotal Draws
HomeGroup Stage254.0188.0134.0
Knockout73.070.025.0
AwayGroup Stage188.0254.0134.0
Knockout70.073.025.0
\n", "
" ], "text/plain": [ " Total Wins Total Losses Total Draws\n", "Home Group Stage 254.0 188.0 134.0\n", " Knockout 73.0 70.0 25.0\n", "Away Group Stage 188.0 254.0 134.0\n", " Knockout 70.0 73.0 25.0" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "home_total_wins = merged_df[['Homewin_group','Homewin_knockout']].sum()\n", "home_total_loss = merged_df[['Homeloss_group','Homeloss_knockout']].sum()\n", "home_total_draw = merged_df[['Homedraw_group','Homedraw_knockout']].sum()\n", "away_total_wins = merged_df[['Awaywin_group','Awaywin_knockout']].sum()\n", "away_total_loss = merged_df[['Awayloss_group','Awayloss_knockout']].sum()\n", "away_total_draw = merged_df[['Awaydraw_group','Awaydraw_knockout']].sum()\n", "\n", "home_stats = pd.DataFrame({\n", " 'Total Wins': [home_total_wins['Homewin_group'], home_total_wins['Homewin_knockout']],\n", " 'Total Losses': [home_total_loss['Homeloss_group'], home_total_loss['Homeloss_knockout']],\n", " 'Total Draws': [home_total_draw['Homedraw_group'], home_total_draw['Homedraw_knockout']]\n", "}, index=['Group Stage', 'Knockout'])\n", "\n", "away_stats = pd.DataFrame({\n", " 'Total Wins': [away_total_wins['Awaywin_group'], away_total_wins['Awaywin_knockout']],\n", " 'Total Losses': [away_total_loss['Awayloss_group'], away_total_loss['Awayloss_knockout']],\n", " 'Total Draws': [away_total_draw['Awaydraw_group'], away_total_draw['Awaydraw_knockout']]\n", "}, index=['Group Stage', 'Knockout'])\n", "total_stats = pd.concat([home_stats, away_stats], keys=['Home', 'Away'])\n", "total_stats\n" ] }, { "cell_type": "markdown", "id": "2f881f2b", "metadata": {}, "source": [ "As expected, the numbers between 'home' and 'away' subsets mirror each other:\n", "- 254 Home wins in group stage must equal to the same number of Away losses in group stages\n", "- 188 Home losses in group stage must equal the same number of Away wins in group stages\n", "- Draw games will also be similar in all cases (when a game draws, both teams get a draw)\n", "- etc\n", "\n", "Instead of looking at games ending with \"a winner and a loser\" (or both teams getting a draw), we can try the following approach, that is slightly different:\n", "\n", "**Any game results in one of three exclusive outcomes**:\n", "- the home team winning, or\n", "- the away team winning, or\n", "- both teams get a draw\n", "\n", "We can now look at different numbers, for each of these outcomes\n", "\n", "**Entire set** (744 games):\n", "- 43.95% ended with a win for the home team ((254+73)/744)\n", "- 34.67% ended with a win for the away team ((188+70)/744)\n", "- 21.37% ended with a draw ((134+25)/744)\n", "\n", "**Subset: group stage games** (576 games):\n", "- 44.09% ended with a win for the home team (254/576)\n", "- 32.63% ended with a win for the away team (188/576)\n", "- 23.26% ended with a draw (134/576)\n", "\n", "**Subset: knockout games** (168 games):\n", "- 43.45% ended with a win for the home team (73/168)\n", "- 41.66% ended with a win for the away team (70/168)\n", "- 14.88% ended with a draw (25/168)\n", "\n", "This suggests that while there might be a home advantage overall and in the group stage games, the advantage might diminish or become less significant in knockout games." ] }, { "cell_type": "markdown", "id": "695e59b7", "metadata": {}, "source": [ "### Exclude all draws\n", "\n", "Let's now look at all the games that did not end in a draw. We know from the previous table that 159 games ended with a draw (134+25). This means that 585 games ended with one of the teams winning the game.\n", "\n", "For this subset of 585 games, how many were won by the home team, and how many were won by the away team?" ] }, { "cell_type": "code", "execution_count": 6, "id": "8c6d5d07", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
resultcountpct
0homewin3270.558974
1awaywin2580.441026
\n", "
" ], "text/plain": [ " result count pct\n", "0 homewin 327 0.558974\n", "1 awaywin 258 0.441026" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result_counts = df['Result'].value_counts()\n", "result_counts_df = result_counts.to_frame().reset_index()\n", "result_counts_df.columns = ['result', 'count']\n", "result_counts_df = result_counts_df.drop(2).reset_index(drop=True)\n", "result_counts_df['pct'] = result_counts_df['count'] / 585\n", "result_counts_df" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 5 }