{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Predicting English Premier League Results\n", "\n", "I've been a fan of football for as long as I can remember, so with my quest to learn as much as possible about data science, I decided to go ahead and build a results predictor for English Premier League Matches!\n", "\n", "I'll first start off by importing commonly used data analysis libraries and importing the data into Python." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "%matplotlib inline\n", "\n", "# filter warnings, just so the jupyter notebook looks cleaner\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, to import the data. This data was collected from https://datahub.io/sports-data/english-premier-league" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "epl_1819 = pd.read_json(\"season-1819_json.json\")\n", "epl_1718 = pd.read_json(\"season-1718_json.json\")\n", "epl_1617 = pd.read_json(\"season-1617_json.json\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ACAFARASASTAYAwayTeamB365AB365DB365H...PSCHPSDPSHRefereeVCAVCDVCHWHAWHDWHH
05801341Leicester7.503.91.57...1.553.931.58A Marriner7.004.01.576.003.81.57
14901011Cardiff4.503.61.90...1.883.631.89K Friend4.753.61.874.003.51.91
251101092Crystal Palace3.003.42.50...2.623.462.50M Dean3.003.42.502.803.32.45
35801341Chelsea1.614.06.50...7.244.026.41C Kavanagh1.624.06.501.573.95.80
451201552Tottenham2.043.53.90...4.743.573.83M Atkinson2.103.43.902.053.23.80
\n", "

5 rows × 65 columns

\n", "
" ], "text/plain": [ " AC AF AR AS AST AY AwayTeam B365A B365D B365H ... PSCH \\\n", "0 5 8 0 13 4 1 Leicester 7.50 3.9 1.57 ... 1.55 \n", "1 4 9 0 10 1 1 Cardiff 4.50 3.6 1.90 ... 1.88 \n", "2 5 11 0 10 9 2 Crystal Palace 3.00 3.4 2.50 ... 2.62 \n", "3 5 8 0 13 4 1 Chelsea 1.61 4.0 6.50 ... 7.24 \n", "4 5 12 0 15 5 2 Tottenham 2.04 3.5 3.90 ... 4.74 \n", "\n", " PSD PSH Referee VCA VCD VCH WHA WHD WHH \n", "0 3.93 1.58 A Marriner 7.00 4.0 1.57 6.00 3.8 1.57 \n", "1 3.63 1.89 K Friend 4.75 3.6 1.87 4.00 3.5 1.91 \n", "2 3.46 2.50 M Dean 3.00 3.4 2.50 2.80 3.3 2.45 \n", "3 4.02 6.41 C Kavanagh 1.62 4.0 6.50 1.57 3.9 5.80 \n", "4 3.57 3.83 M Atkinson 2.10 3.4 3.90 2.05 3.2 3.80 \n", "\n", "[5 rows x 65 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = epl_1819.append([epl_1718,epl_1617])\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that we are spoiled with data here, with over 62 columns of data for each match. However, most of this data pertains to betting odds from different websites. While we can take a look at these values for the sake of evaluation of our models, the most important data I will be looking at is as follows:\n", "\n", "1) HomeTeam\n", "\n", "2) AwayTeam\n", "\n", "3) FTHG (Full-Time Home Goals)\n", "\n", "4) FTAG (Full-Time Away Goals)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
HomeTeamAwayTeamFTHGFTAG
0Man UnitedLeicester21
1BournemouthCardiff20
2FulhamCrystal Palace02
3HuddersfieldChelsea03
4NewcastleTottenham12
\n", "
" ], "text/plain": [ " HomeTeam AwayTeam FTHG FTAG\n", "0 Man United Leicester 2 1\n", "1 Bournemouth Cardiff 2 0\n", "2 Fulham Crystal Palace 0 2\n", "3 Huddersfield Chelsea 0 3\n", "4 Newcastle Tottenham 1 2" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df[['HomeTeam','AwayTeam','FTHG','FTAG']]\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's add a column to represent the total goals scored in each game:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
HomeTeamAwayTeamHomeGoalsAwayGoalsTotalGoals
0Man UnitedLeicester213
1BournemouthCardiff202
2FulhamCrystal Palace022
3HuddersfieldChelsea033
4NewcastleTottenham123
\n", "
" ], "text/plain": [ " HomeTeam AwayTeam HomeGoals AwayGoals TotalGoals\n", "0 Man United Leicester 2 1 3\n", "1 Bournemouth Cardiff 2 0 2\n", "2 Fulham Crystal Palace 0 2 2\n", "3 Huddersfield Chelsea 0 3 3\n", "4 Newcastle Tottenham 1 2 3" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['FTG'] = df['FTHG'] + df['FTAG']\n", "df = df.rename(columns={'FTHG': 'HomeGoals', 'FTAG': 'AwayGoals', 'FTG':'TotalGoals'})\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "HomeGoals 1.565789\n", "AwayGoals 1.200877\n", "TotalGoals 2.766667\n", "dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see here that the Home Team, on average, scores more goals than the away team. This is consistent with the concept of home field advantage, which is present in most sporting events. Let's take a look at some distributions to see if we can decipher any potential statistical patterns.\n", "\n", "I want to plot the count of the number of goals for each category, just to see the data in its broadest form at first. In order to do so, instead of using matplotlib and subplots, I prefered using the Pandas.melt() function in order to use Seaborn more easily. \n", "\n", "Note: Normally, in order to properly examine distributions, I would have to normalize the y-axis to percentages (create a pmf). However, just to have an overall idea first, I will be looking at the overall distribution with count values." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
variablevalue
0HomeGoals2
1HomeGoals2
2HomeGoals0
3HomeGoals0
4HomeGoals1
\n", "
" ], "text/plain": [ " variable value\n", "0 HomeGoals 2\n", "1 HomeGoals 2\n", "2 HomeGoals 0\n", "3 HomeGoals 0\n", "4 HomeGoals 1" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfMelted = pd.melt(df[['HomeGoals','AwayGoals','TotalGoals']])\n", "dfMelted.head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(15,8))\n", "sns.set_style('darkgrid')\n", "ax = sns.countplot(data=dfMelted,x='variable',hue='value', palette='coolwarm')\n", "ax.set(xlabel='Goals per Match', ylabel='Number of Matches',title='Number of Goals per Match')\n", "plt.legend(title=\"Number of Goals per Match\", fontsize='small', fancybox=True,facecolor='white')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point, two potential distributions come to mind, purely due to the shape of the plots:\n", "\n", "1) Lognormal Distribution (similar to a normal distribution, skewed to the right)\n", "\n", "2) Poisson Distribution\n", "\n", "**The Poisson Distribution makes more sense in this case, since we previously saw that means across the HomeGoals & AwayGoals classifications differed, and would thus have distributions that more accurately represent this. Moreover, the Poisson Distribution fits our problem pretty well, since it:**\n", "\n", "1) Is a Discrete Probability Distribution, and there are a discrete number of goals to be scored in a match\n", "\n", "2) Predicts probabilities within a specific time period, which, in this case, is a 90 minute game of football\n", "\n", "3) Assumes events are independent of time (i.e. the probability of a goal being scored is not dependent on whether or not goals have already been scored in the match)\n", "\n", "The formula for the Poisson Distribution is as follows:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "$$P(x) = \\frac{e^{-\\lambda} \\lambda^{x}}{x!}, \\lambda > 0$$\n", "Where $ \\lambda = $ Average Number of Goals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To test this, let's compare the distribution of TotalGoals with a Poisson Distribution:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "_lambdaTotal = np.mean(df['TotalGoals'])\n", "_lambdaAway = np.mean(df['AwayGoals'])\n", "_lambdaHome = np.mean(df['HomeGoals'])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GoalsPoisson Probability - HomePoisson Probability - AwayPoisson Probability - Total
000.00.00.0
110.00.00.0
220.00.00.0
330.00.00.0
440.00.00.0
550.00.00.0
660.00.00.0
770.00.00.0
880.00.00.0
\n", "
" ], "text/plain": [ " Goals Poisson Probability - Home Poisson Probability - Away \\\n", "0 0 0.0 0.0 \n", "1 1 0.0 0.0 \n", "2 2 0.0 0.0 \n", "3 3 0.0 0.0 \n", "4 4 0.0 0.0 \n", "5 5 0.0 0.0 \n", "6 6 0.0 0.0 \n", "7 7 0.0 0.0 \n", "8 8 0.0 0.0 \n", "\n", " Poisson Probability - Total \n", "0 0.0 \n", "1 0.0 \n", "2 0.0 \n", "3 0.0 \n", "4 0.0 \n", "5 0.0 \n", "6 0.0 \n", "7 0.0 \n", "8 0.0 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "poisson_pmf = pd.DataFrame({'Goals':range(max(df['TotalGoals'])),\n", " 'Poisson Probability - Home':np.zeros(max(df['TotalGoals'])),\n", " 'Poisson Probability - Away':np.zeros(max(df['TotalGoals'])),\n", " 'Poisson Probability - Total':np.zeros(max(df['TotalGoals'])),\n", " })\n", "poisson_pmf" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GoalsPoisson Probability - HomePoisson Probability - AwayPoisson Probability - Total
000.2089230.3009300.062871
110.3271290.3613800.173944
220.2561080.2169870.240622
330.1336700.0868580.221907
440.0523250.0260760.153486
550.0163860.0062630.084929
660.0042760.0012540.039162
770.0009570.0002150.015478
880.0000000.0000000.000000
\n", "
" ], "text/plain": [ " Goals Poisson Probability - Home Poisson Probability - Away \\\n", "0 0 0.208923 0.300930 \n", "1 1 0.327129 0.361380 \n", "2 2 0.256108 0.216987 \n", "3 3 0.133670 0.086858 \n", "4 4 0.052325 0.026076 \n", "5 5 0.016386 0.006263 \n", "6 6 0.004276 0.001254 \n", "7 7 0.000957 0.000215 \n", "8 8 0.000000 0.000000 \n", "\n", " Poisson Probability - Total \n", "0 0.062871 \n", "1 0.173944 \n", "2 0.240622 \n", "3 0.221907 \n", "4 0.153486 \n", "5 0.084929 \n", "6 0.039162 \n", "7 0.015478 \n", "8 0.000000 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import math\n", "\n", "for i in range(max(poisson_pmf['Goals'])):\n", " poisson_pmf['Poisson Probability - Home'][i] = (math.exp(-_lambdaHome)*_lambdaHome**(i)/math.factorial(i))\n", " poisson_pmf['Poisson Probability - Away'][i] = (math.exp(-_lambdaAway)*_lambdaAway**(i)/math.factorial(i))\n", " poisson_pmf['Poisson Probability - Total'][i] = (math.exp(-_lambdaTotal)*_lambdaTotal**(i)/math.factorial(i))\n", "poisson_pmf" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "poisson_pmf['Observed Probability - Home'] = np.zeros(9)\n", "poisson_pmf['Observed Probability - Away'] = np.zeros(9)\n", "poisson_pmf['Observed Probability - Total'] = np.zeros(9)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GoalsPoisson Probability - HomePoisson Probability - AwayPoisson Probability - TotalObserved Probability - HomeObserved Probability - AwayObserved Probability - Total
000.2089230.3009300.0628710.2289470.3385960.071053
110.3271290.3613800.1739440.3192980.3271930.158772
220.2561080.2169870.2406220.2377190.1964910.240351
330.1336700.0868580.2219070.1228070.0868420.219298
440.0523250.0260760.1534860.0605260.0385960.167544
550.0163860.0062630.0849290.0245610.0087720.088596
660.0042760.0012540.0391620.0052630.0026320.034211
770.0009570.0002150.0154780.0008770.0008770.014035
880.0000000.0000000.0000000.0000000.0000000.002632
\n", "
" ], "text/plain": [ " Goals Poisson Probability - Home Poisson Probability - Away \\\n", "0 0 0.208923 0.300930 \n", "1 1 0.327129 0.361380 \n", "2 2 0.256108 0.216987 \n", "3 3 0.133670 0.086858 \n", "4 4 0.052325 0.026076 \n", "5 5 0.016386 0.006263 \n", "6 6 0.004276 0.001254 \n", "7 7 0.000957 0.000215 \n", "8 8 0.000000 0.000000 \n", "\n", " Poisson Probability - Total Observed Probability - Home \\\n", "0 0.062871 0.228947 \n", "1 0.173944 0.319298 \n", "2 0.240622 0.237719 \n", "3 0.221907 0.122807 \n", "4 0.153486 0.060526 \n", "5 0.084929 0.024561 \n", "6 0.039162 0.005263 \n", "7 0.015478 0.000877 \n", "8 0.000000 0.000000 \n", "\n", " Observed Probability - Away Observed Probability - Total \n", "0 0.338596 0.071053 \n", "1 0.327193 0.158772 \n", "2 0.196491 0.240351 \n", "3 0.086842 0.219298 \n", "4 0.038596 0.167544 \n", "5 0.008772 0.088596 \n", "6 0.002632 0.034211 \n", "7 0.000877 0.014035 \n", "8 0.000000 0.002632 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for i in poisson_pmf['Goals']:\n", " poisson_pmf['Observed Probability - Home'][i] = df['HomeGoals'][df['HomeGoals']==i].count()/df['HomeGoals'].count()\n", " poisson_pmf['Observed Probability - Away'][i] = df['AwayGoals'][df['AwayGoals']==i].count()/df['AwayGoals'].count()\n", " poisson_pmf['Observed Probability - Total'][i] = df['TotalGoals'][df['TotalGoals']==i].count()/df['TotalGoals'].count()\n", "poisson_pmf" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig, axes = plt.subplots(3, 1, figsize=(15,8))\n", "sns.set_style('darkgrid')\n", "\n", "axes[0].plot(poisson_pmf['Goals'],poisson_pmf['Observed Probability - Home'], label = 'Observed Probability', marker='o')\n", "axes[0].plot(poisson_pmf['Goals'],poisson_pmf['Poisson Probability - Home'], label = 'Poisson Probability', marker = 'x')\n", "axes[0].set_title('Home Goals')\n", "axes[0].legend(fontsize='large', fancybox=True,facecolor='white')\n", "axes[0].set_ylabel('Probability')\n", "axes[0].set_xlabel('Number of Goals in a Match')\n", "\n", "axes[1].plot(poisson_pmf['Goals'],poisson_pmf['Observed Probability - Away'], label = 'Observed Probability', marker = 'o')\n", "axes[1].plot(poisson_pmf['Goals'],poisson_pmf['Poisson Probability - Away'], label = 'Poisson Probability', marker = 'x')\n", "axes[1].set_title('Away Goals')\n", "axes[1].legend(fontsize='large', fancybox=True,facecolor='white')\n", "axes[1].set_ylabel('Probability')\n", "axes[1].set_xlabel('Number of Goals in a Match')\n", "\n", "axes[2].plot(poisson_pmf['Goals'],poisson_pmf['Observed Probability - Total'], label = 'Observed Probability', marker = 'o')\n", "axes[2].plot(poisson_pmf['Goals'],poisson_pmf['Poisson Probability - Total'], label = 'Poisson Probability', marker = 'x')\n", "axes[2].set_title('Total Goals')\n", "axes[2].legend(fontsize='large', fancybox=True,facecolor='white')\n", "axes[2].set_ylabel('Probability')\n", "axes[2].set_xlabel('Number of Goals in a Match')\n", "\n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the Poisson distribution aligns decently well with the observed distribution regarding the number of goals scored in a match. That's a good sign!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For further confirmation, let's take a look at the Skellam distribution, which represents a difference between two Poisson-Distributed variables. In this case, we'll take a look at the difference between HomeGoals and AwayGoals" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import skellam" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DifferenceSkellam
0-10NaN
1-9NaN
2-8NaN
3-7NaN
4-6NaN
\n", "
" ], "text/plain": [ " Difference Skellam\n", "0 -10 NaN\n", "1 -9 NaN\n", "2 -8 NaN\n", "3 -7 NaN\n", "4 -6 NaN" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Set up a dataframe with values of \"Difference\", the difference of goals between the home team and away team,\n", "# from -10 to 10, just to be sure we encompass all possible values\n", "\n", "values = []\n", "values.extend(range(-10,11))\n", "skellamdf = pd.DataFrame(columns=[\"Difference\",\"Skellam\"])\n", "skellamdf.Difference = values\n", "skellamdf.head()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "pmf_temp = []\n", "for i in skellamdf['Difference']:\n", " pmf_temp.append(skellam.pmf(i, df.mean()[0], df.mean()[1]))\n", "skellamdf['Skellam'] = pmf_temp" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
HomeTeamAwayTeamHomeGoalsAwayGoalsTotalGoalsDifference
0Man UnitedLeicester2131
1BournemouthCardiff2022
2FulhamCrystal Palace022-2
3HuddersfieldChelsea033-3
4NewcastleTottenham123-1
\n", "
" ], "text/plain": [ " HomeTeam AwayTeam HomeGoals AwayGoals TotalGoals Difference\n", "0 Man United Leicester 2 1 3 1\n", "1 Bournemouth Cardiff 2 0 2 2\n", "2 Fulham Crystal Palace 0 2 2 -2\n", "3 Huddersfield Chelsea 0 3 3 -3\n", "4 Newcastle Tottenham 1 2 3 -1" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Difference'] = df['HomeGoals'] - df['AwayGoals']\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# Create a distribution\n", "\n", "differenceDistribution = []\n", "\n", "for i in skellamdf['Difference']:\n", " differenceDistribution.append(df['Difference'][df['Difference']==i].count()/df['Difference'].count())" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "skellamdf['Observed'] = differenceDistribution" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'Probability of Goal Difference per Match')" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Skellam vs Actual\n", "\n", "plt.figure(figsize=(15,8))\n", "sns.set_style('darkgrid')\n", "sns.barplot(data=skellamdf,x=skellamdf['Difference'], y='Observed', palette='coolwarm', label = 'Observed')\n", "sns.lineplot(data=skellamdf,x=skellamdf['Difference'], y='Skellam', marker='o', color='green', label = 'Skellam',)\n", "plt.legend(fontsize='large', fancybox=True,facecolor='white')\n", "plt.title('Probability of Goal Difference per Match')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, we see a pretty good match here! The skellam distribution mimics the observed distribution of the difference in goals between the home team and the away team. We can see that it is almost impossible for the home team to win by more than 6 goals or lose by more than 5 goals, with the most likely difference being 0, indicating a draw. However, there seems to be a higher probability for the home team winning, which is consistent with what we have seen previously." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Creating the Poisson Distribution Model" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Generalized Linear Model Regression Results
Dep. Variable: goals No. Observations: 2280
Model: GLM Df Residuals: 2228
Model Family: Poisson Df Model: 51
Link Function: log Scale: 1.0000
Method: IRLS Log-Likelihood: -3229.2
Date: Tue, 04 Feb 2020 Deviance: 2433.8
Time: 19:38:27 Pearson chi2: 2.10e+03
No. Iterations: 5
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err z P>|z| [0.025 0.975]
Intercept 0.4798 0.110 4.364 0.000 0.264 0.695
team[T.Bournemouth] -0.3454 0.104 -3.306 0.001 -0.550 -0.141
team[T.Brighton] -0.7580 0.138 -5.496 0.000 -1.028 -0.488
team[T.Burnley] -0.6206 0.113 -5.478 0.000 -0.843 -0.399
team[T.Cardiff] -0.7688 0.184 -4.168 0.000 -1.130 -0.407
team[T.Chelsea] -0.0770 0.096 -0.800 0.424 -0.266 0.112
team[T.Crystal Palace] -0.4210 0.107 -3.951 0.000 -0.630 -0.212
team[T.Everton] -0.3369 0.104 -3.249 0.001 -0.540 -0.134
team[T.Fulham] -0.7568 0.184 -4.102 0.000 -1.118 -0.395
team[T.Huddersfield] -1.0707 0.157 -6.836 0.000 -1.378 -0.764
team[T.Hull] -0.6882 0.178 -3.869 0.000 -1.037 -0.340
team[T.Leicester] -0.3610 0.105 -3.449 0.001 -0.566 -0.156
team[T.Liverpool] 0.0992 0.092 1.077 0.281 -0.081 0.280
team[T.Man City] 0.2080 0.090 2.318 0.020 0.032 0.384
team[T.Man United] -0.1929 0.099 -1.945 0.052 -0.387 0.002
team[T.Middlesbrough] -1.0307 0.204 -5.050 0.000 -1.431 -0.631
team[T.Newcastle] -0.6068 0.130 -4.671 0.000 -0.861 -0.352
team[T.Southampton] -0.5935 0.112 -5.281 0.000 -0.814 -0.373
team[T.Stoke] -0.6638 0.133 -4.991 0.000 -0.925 -0.403
team[T.Sunderland] -0.9434 0.198 -4.771 0.000 -1.331 -0.556
team[T.Swansea] -0.7032 0.135 -5.208 0.000 -0.968 -0.439
team[T.Tottenham] -0.0019 0.094 -0.021 0.984 -0.187 0.183
team[T.Watford] -0.4854 0.109 -4.458 0.000 -0.699 -0.272
team[T.West Brom] -0.6990 0.134 -5.204 0.000 -0.962 -0.436
team[T.West Ham] -0.4087 0.106 -3.844 0.000 -0.617 -0.200
team[T.Wolves] -0.4672 0.161 -2.903 0.004 -0.783 -0.152
opponent[T.Bournemouth] 0.2826 0.109 2.586 0.010 0.068 0.497
opponent[T.Brighton] 0.1149 0.125 0.918 0.359 -0.130 0.360
opponent[T.Burnley] 0.0692 0.114 0.606 0.545 -0.155 0.293
opponent[T.Cardiff] 0.3037 0.146 2.074 0.038 0.017 0.591
opponent[T.Chelsea] -0.2888 0.126 -2.284 0.022 -0.537 -0.041
opponent[T.Crystal Palace] 0.1321 0.113 1.171 0.242 -0.089 0.353
opponent[T.Everton] -0.0080 0.117 -0.069 0.945 -0.237 0.221
opponent[T.Fulham] 0.4645 0.139 3.343 0.001 0.192 0.737
opponent[T.Huddersfield] 0.2674 0.120 2.231 0.026 0.033 0.502
opponent[T.Hull] 0.4663 0.139 3.344 0.001 0.193 0.740
opponent[T.Leicester] 0.1351 0.113 1.197 0.231 -0.086 0.356
opponent[T.Liverpool] -0.3506 0.129 -2.713 0.007 -0.604 -0.097
opponent[T.Man City] -0.4771 0.135 -3.542 0.000 -0.741 -0.213
opponent[T.Man United] -0.2874 0.126 -2.279 0.023 -0.535 -0.040
opponent[T.Middlesbrough] 0.0438 0.161 0.273 0.785 -0.271 0.359
opponent[T.Newcastle] -0.0618 0.132 -0.468 0.640 -0.321 0.197
opponent[T.Southampton] 0.1126 0.113 0.995 0.320 -0.109 0.334
opponent[T.Stoke] 0.2083 0.122 1.703 0.089 -0.031 0.448
opponent[T.Sunderland] 0.3100 0.146 2.117 0.034 0.023 0.597
opponent[T.Swansea] 0.2229 0.122 1.829 0.067 -0.016 0.462
opponent[T.Tottenham] -0.3686 0.130 -2.844 0.004 -0.623 -0.115
opponent[T.Watford] 0.2397 0.110 2.177 0.030 0.024 0.456
opponent[T.West Brom] 0.0596 0.127 0.467 0.640 -0.190 0.309
opponent[T.West Ham] 0.2222 0.111 2.008 0.045 0.005 0.439
opponent[T.Wolves] -0.0898 0.169 -0.530 0.596 -0.422 0.242
home 0.2653 0.036 7.386 0.000 0.195 0.336
" ], "text/plain": [ "\n", "\"\"\"\n", " Generalized Linear Model Regression Results \n", "==============================================================================\n", "Dep. Variable: goals No. Observations: 2280\n", "Model: GLM Df Residuals: 2228\n", "Model Family: Poisson Df Model: 51\n", "Link Function: log Scale: 1.0000\n", "Method: IRLS Log-Likelihood: -3229.2\n", "Date: Tue, 04 Feb 2020 Deviance: 2433.8\n", "Time: 19:38:27 Pearson chi2: 2.10e+03\n", "No. Iterations: 5 \n", "Covariance Type: nonrobust \n", "==============================================================================================\n", " coef std err z P>|z| [0.025 0.975]\n", "----------------------------------------------------------------------------------------------\n", "Intercept 0.4798 0.110 4.364 0.000 0.264 0.695\n", "team[T.Bournemouth] -0.3454 0.104 -3.306 0.001 -0.550 -0.141\n", "team[T.Brighton] -0.7580 0.138 -5.496 0.000 -1.028 -0.488\n", "team[T.Burnley] -0.6206 0.113 -5.478 0.000 -0.843 -0.399\n", "team[T.Cardiff] -0.7688 0.184 -4.168 0.000 -1.130 -0.407\n", "team[T.Chelsea] -0.0770 0.096 -0.800 0.424 -0.266 0.112\n", "team[T.Crystal Palace] -0.4210 0.107 -3.951 0.000 -0.630 -0.212\n", "team[T.Everton] -0.3369 0.104 -3.249 0.001 -0.540 -0.134\n", "team[T.Fulham] -0.7568 0.184 -4.102 0.000 -1.118 -0.395\n", "team[T.Huddersfield] -1.0707 0.157 -6.836 0.000 -1.378 -0.764\n", "team[T.Hull] -0.6882 0.178 -3.869 0.000 -1.037 -0.340\n", "team[T.Leicester] -0.3610 0.105 -3.449 0.001 -0.566 -0.156\n", "team[T.Liverpool] 0.0992 0.092 1.077 0.281 -0.081 0.280\n", "team[T.Man City] 0.2080 0.090 2.318 0.020 0.032 0.384\n", "team[T.Man United] -0.1929 0.099 -1.945 0.052 -0.387 0.002\n", "team[T.Middlesbrough] -1.0307 0.204 -5.050 0.000 -1.431 -0.631\n", "team[T.Newcastle] -0.6068 0.130 -4.671 0.000 -0.861 -0.352\n", "team[T.Southampton] -0.5935 0.112 -5.281 0.000 -0.814 -0.373\n", "team[T.Stoke] -0.6638 0.133 -4.991 0.000 -0.925 -0.403\n", "team[T.Sunderland] -0.9434 0.198 -4.771 0.000 -1.331 -0.556\n", "team[T.Swansea] -0.7032 0.135 -5.208 0.000 -0.968 -0.439\n", "team[T.Tottenham] -0.0019 0.094 -0.021 0.984 -0.187 0.183\n", "team[T.Watford] -0.4854 0.109 -4.458 0.000 -0.699 -0.272\n", "team[T.West Brom] -0.6990 0.134 -5.204 0.000 -0.962 -0.436\n", "team[T.West Ham] -0.4087 0.106 -3.844 0.000 -0.617 -0.200\n", "team[T.Wolves] -0.4672 0.161 -2.903 0.004 -0.783 -0.152\n", "opponent[T.Bournemouth] 0.2826 0.109 2.586 0.010 0.068 0.497\n", "opponent[T.Brighton] 0.1149 0.125 0.918 0.359 -0.130 0.360\n", "opponent[T.Burnley] 0.0692 0.114 0.606 0.545 -0.155 0.293\n", "opponent[T.Cardiff] 0.3037 0.146 2.074 0.038 0.017 0.591\n", "opponent[T.Chelsea] -0.2888 0.126 -2.284 0.022 -0.537 -0.041\n", "opponent[T.Crystal Palace] 0.1321 0.113 1.171 0.242 -0.089 0.353\n", "opponent[T.Everton] -0.0080 0.117 -0.069 0.945 -0.237 0.221\n", "opponent[T.Fulham] 0.4645 0.139 3.343 0.001 0.192 0.737\n", "opponent[T.Huddersfield] 0.2674 0.120 2.231 0.026 0.033 0.502\n", "opponent[T.Hull] 0.4663 0.139 3.344 0.001 0.193 0.740\n", "opponent[T.Leicester] 0.1351 0.113 1.197 0.231 -0.086 0.356\n", "opponent[T.Liverpool] -0.3506 0.129 -2.713 0.007 -0.604 -0.097\n", "opponent[T.Man City] -0.4771 0.135 -3.542 0.000 -0.741 -0.213\n", "opponent[T.Man United] -0.2874 0.126 -2.279 0.023 -0.535 -0.040\n", "opponent[T.Middlesbrough] 0.0438 0.161 0.273 0.785 -0.271 0.359\n", "opponent[T.Newcastle] -0.0618 0.132 -0.468 0.640 -0.321 0.197\n", "opponent[T.Southampton] 0.1126 0.113 0.995 0.320 -0.109 0.334\n", "opponent[T.Stoke] 0.2083 0.122 1.703 0.089 -0.031 0.448\n", "opponent[T.Sunderland] 0.3100 0.146 2.117 0.034 0.023 0.597\n", "opponent[T.Swansea] 0.2229 0.122 1.829 0.067 -0.016 0.462\n", "opponent[T.Tottenham] -0.3686 0.130 -2.844 0.004 -0.623 -0.115\n", "opponent[T.Watford] 0.2397 0.110 2.177 0.030 0.024 0.456\n", "opponent[T.West Brom] 0.0596 0.127 0.467 0.640 -0.190 0.309\n", "opponent[T.West Ham] 0.2222 0.111 2.008 0.045 0.005 0.439\n", "opponent[T.Wolves] -0.0898 0.169 -0.530 0.596 -0.422 0.242\n", "home 0.2653 0.036 7.386 0.000 0.195 0.336\n", "==============================================================================================\n", "\"\"\"" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "goal_model_data = pd.concat([df[['HomeTeam','AwayTeam','HomeGoals']].assign(home=1).rename(\n", " columns={'HomeTeam':'team', 'AwayTeam':'opponent','HomeGoals':'goals'}),\n", " df[['AwayTeam','HomeTeam','AwayGoals']].assign(home=0).rename(\n", " columns={'AwayTeam':'team', 'HomeTeam':'opponent','AwayGoals':'goals'})])\n", "\n", "poisson_model = smf.glm(formula=\"goals ~ home + team + opponent\", data=goal_model_data, \n", " family=sm.families.Poisson()).fit()\n", "poisson_model.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This produces an output similar to a regression, where a positive \"coef\" would mean more expected goals and a negative \"coef\" would mean less goals, and values closer to 0 would indicate more neutral effects in comparison to the intercept.\n", "\n", "For example, Man City have a positive coefficient equal to 0.2080, meaning they would generally score more goals than an average team. Moreover, we see that Man City's **opponent value** is negative at -0.4771. The opponent value penalizes or rewards teams based on the value of the opposition. Man City's negative coefficient indicates that it is harder to score against them than the average opponent, while a positive coefficient would mean that it is generally easier to score against that opponent. \n", "\n", "Now, let's try predicting a match between Chelsea and Man City. To do so, I will create the predict_match function.\n", "\n", "Within this function, I will be specifying 4 variables:\n", "\n", "1) foot_model: The type of statistical model being used to model the data. In this case, we'll have the 'poisson_model'\n", "\n", "2) homeTeam: A string value containing the desired home team name.\n", "\n", "3) awayTeam: A string value containing the desired away team name.\n", "\n", "4) max_goals: The maximum number of goals allowed in the game, as a sum of homeTeam and awayTeam goals. I set this to 10." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def predict_match(foot_model, homeTeam, awayTeam, max_goals=10):\n", " from scipy.stats import poisson\n", " home_goals_avg = foot_model.predict(pd.DataFrame(data={'team': homeTeam, \n", " 'opponent': awayTeam,'home':1},\n", " index=[1])).values[0]\n", " away_goals_avg = foot_model.predict(pd.DataFrame(data={'team': awayTeam, \n", " 'opponent': homeTeam,'home':0},\n", " index=[1])).values[0]\n", " team_pred = [[poisson.pmf(i, team_avg) for i in range(0, max_goals+1)] for team_avg in [home_goals_avg, away_goals_avg]]\n", " return(np.outer(np.array(team_pred[0]), np.array(team_pred[1])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's simulate that match between Chelsea and Man City:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.06715664, 0.10008115, 0.07457369, 0.03704484],\n", " [0.08129063, 0.12114453, 0.0902687 , 0.04484141],\n", " [0.04919965, 0.0733205 , 0.05463346, 0.02713944],\n", " [0.01985146, 0.02958392, 0.02204393, 0.01095043]])" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chelsea_city = predict_match(poisson_model, 'Chelsea', 'Man City',max_goals=3)\n", "chelsea_city" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This outputted array shows us the probability of each team scoring a certain amount of goals. In this case, Chelsea's number of goals are represented by rows, and Man City's number of goals are represented by columns. For example:\n", "\n", "- The first array entry at [0,0], 0.06715664, shows us the probability of Chelsea scoring 0 goals, and Man City scoring 0 goals.\n", "\n", "- The entry at [0,1], 0.10008115, shows us the probability of Chelsea scoring 0 goals, and Man City scoring 1 goal.\n", "\n", "- The entry at [2,1], 0.0733205, shows us the probability of Chelsea scoring 2 goals, and Man City scoring 1 goal.\n", "\n", "... And so on\n", "\n", "Now, let's calculate the probability of Chelsea winning. On the array, the event of Chelsea winning would be indicated by any entry on the lower triangle of the square array. The event of Man City winning would be indicated by any entry on the upper triangle. The event of a draw is represented by the diagonal entries of the array." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# For the model to completely work, we would need to assume max_goals = 10, as to encompass all possible outcomes\n", "chelsea_city = predict_match(poisson_model, 'Chelsea', 'Man City')" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.999999414005042" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(chelsea_city)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Based off of our data, in a match of Chelsea vs. Man City at home:\n", "\n", "The probability of Chelsea winning is 0.308133\n", "The probability of Man City winning is 0.436653\n", "The probability of a draw is 0.255213\n" ] } ], "source": [ "chelsea_prob = np.sum(np.tril(chelsea_city,-1))\n", "\n", "city_prob = np.sum(np.triu(chelsea_city,1))\n", "\n", "draw_prob = np.sum(np.diag(chelsea_city))\n", "\n", "print('''Based off of our data, in a match of Chelsea vs. Man City at home:\\n\\nThe probability of Chelsea winning is %f\n", "The probability of Man City winning is %f\\nThe probability of a draw is %f''' %(chelsea_prob,city_prob,draw_prob))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the actual result of the game in the 2018/19 was..." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Chelsea 2 - 0 Man City\n" ] } ], "source": [ "chelsea_goals = epl_1819[(epl_1819['HomeTeam']=='Chelsea')&(epl_1819['AwayTeam']=='Man City')]['FTHG']\n", "\n", "city_goals = epl_1819[(epl_1819['HomeTeam']=='Chelsea')&(epl_1819['AwayTeam']=='Man City')]['FTAG']\n", "\n", "print(\"Chelsea %d - %d Man City\" %(chelsea_goals,city_goals))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, reality weighed in favor of Chelsea, instead of the predicted Man City win. The football world is riddled with surprises and upsets. However, the probabilities do make logical sense, as, in the 2018-19 season, Man City ended up winning the league with 98 points, and Chelsea ended up in 3rd place with only 72 points, indicating Man City were a much better team during the season, which is exactly what the probabilities are telling us too!\n", "\n", "To test this result, we'd need to compare it with a betting company's odds, which we have handy!" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Chelsea WinDrawMan City Win
1544.03.81.95
\n", "
" ], "text/plain": [ " Chelsea Win Draw Man City Win\n", "154 4.0 3.8 1.95" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bet365odds = epl_1819[(epl_1819['HomeTeam']=='Chelsea')&(epl_1819['AwayTeam']=='Man City')][['B365H','B365D','B365A']]\n", "bet365odds = bet365odds.rename(columns={'B365H':'Chelsea Win','B365D':'Draw','B365A':'Man City Win'})\n", "bet365odds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now to convert these odds into probabilities, using the following formula: \n", "\n", "$$Probability = \\frac{1}{Decimal Odds}$$" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Chelsea WinDrawMan City Win
1540.250.2631580.512821
\n", "
" ], "text/plain": [ " Chelsea Win Draw Man City Win\n", "154 0.25 0.263158 0.512821" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bet365odds['Chelsea Win'] = 1/bet365odds['Chelsea Win']\n", "bet365odds['Man City Win'] = 1/bet365odds['Man City Win']\n", "bet365odds['Draw'] = 1/bet365odds['Draw']\n", "bet365odds" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Chelsea WinDrawMan City Win
00.3081330.2552130.436653
\n", "
" ], "text/plain": [ " Chelsea Win Draw Man City Win\n", "0 0.308133 0.255213 0.436653" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_prob = pd.DataFrame.from_dict({'Chelsea Win': [chelsea_prob], 'Draw': [draw_prob], 'Man City Win': [city_prob]})\n", "model_prob" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Chelsea WinDrawMan City Win
Source
Bet 3650.2500000.2631580.512821
Our Model0.3081330.2552130.436653
\n", "
" ], "text/plain": [ " Chelsea Win Draw Man City Win\n", "Source \n", "Bet 365 0.250000 0.263158 0.512821\n", "Our Model 0.308133 0.255213 0.436653" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bet365odds = bet365odds.append(model_prob).reset_index().rename(columns={'index':'Source'})\n", "bet365odds['Source'] = ['Bet 365','Our Model']\n", "bet365odds = bet365odds.set_index('Source')\n", "bet365odds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The differences between Bet 365's probabilities and our probabilities are as follows:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Chelsea Win 0.058133\n", "Draw -0.007945\n", "Man City Win -0.076168\n", "Name: Our Model, dtype: float64" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bet365odds.diff(axis=0).loc['Our Model']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Evaluating Our Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, we weren't too far off with our calculations. We somewhat underestimated the probability of a Man City win, and that's fine! That's because Man City progressively became a better and better team with years gone by due to extensive purchases of great players, a great manager, increased team chemistry, and more! The more the data is weighted towards more recent results, the more accurate our model becomes! This indicates that sometimes, less data might indeed be better, due to the changes that happen from season to season within the sport.\n", "\n", "Moreover, as previously mentioned, the Poisson Distribution may oversimplify things. There are plenty of factors that could be taken into consideration, such as:\n", "\n", "1) The expected starting lineups of each side (i.e. Are certain players injured? Will the manager decide to start his best team?)\n", "\n", "2) Form (i.e. Assigning more weight to more recent results)\n", "\n", "3) How far along into the season are we? What does the team have to play for? Are they competing for the title, or will a win make no difference whatsoever? (i.e. Motivation)\n", "\n", "4) Historical Results between the two teams (Even if Man City are the better team now, do Chelsea have a good track record against them? Maybe Chelsea have a playing style that combats Man City's very well)\n", "\n", "5) Inability to weight data towards more recent results reasonably. Here, we could potentially use a weighted mean with higher weights on more recent results, but it wouldn't reflect the entire story.\n", "\n", "And many more!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }