{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Defining the Question" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### a) Specifying the Question\n", "As the football analyst for Mchezopesa Ltd, I have been tasked with creating a model that predicts the outcome of a football match between national teams." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### b) Defining the Metric for Success\n", "i. Create a model that predicts whether the home team will win, lose, or draw in a football match.\n", "\n", "ii. Create a model that predicts the number of goals that the home team will score.\n", "\n", "iii. Create a model that predicts the number of goals that the away team will score in a given match." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### c) Understanding the context \n", "The term ‘odds’ is commonly used in betting and it often refers to the probability of an event occurring. In a football match, the bookmaker assigns different odds depending on the true odds of an event occurring ie. win, loss, or draw (relative to the home team) while also factoring in the team’s form, team statistics, historical precedents, expert opinion, team motivation among other factors surrounding each match. In order to make profit, bookmakers will then adjust the probabilities downward before offering the bet to punters. While factors such as expert opinion and team motivation are hard to measure, team statistics such as wins, losses, goals scored, goals, conceded, and team ranks are recorded and can be used to predict the outcome of matches. \n", "\n", "FIFA has good data on the different matches, and it also as a ranking system that is used to measure the performance of national teams over time. The FIFA ranking system is updated periodically to ensure that the team rankings are reflective of team performances. The latest review of the ranking system was done in 2018 , replacing a system that was in place since 2006. More information about FIFA ranking system can be found [here](https://en.wikipedia.org/wiki/FIFA_World_Rankings). Aside from the FIFA website, bookmakers also source team information from team release news and professional contacts within different national teams. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### d) Recording the Experimental Design\n", "To predict the match outcome, I am tasked with creating a logistic regression model. \n", "To predict the match scores, I am tasked with creating a polynomial regression model.\n", "\n", "To improve model performance, I will perform feature engineering and parameter tuning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### e) Data Relevance\n", "Two datasets were provided for this project. The first data set contains different football matches played by national teams across different tournaments between 1872 and 2019. This data set includes the home team, away team, match scores, country and city the match was played, along with whether or not the playing ground was neutral. The second data set contains the national team ranks of different countries in the world. The ranking data set is updated monthly depending on the performance of the different teams in their respective matches.\n", "\n", "To predict the outcome of a match, it is important to factor in team performance which is reflected in the ranking while also considering previous results which is refelcted in the first data set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing Libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install -U pandas-profiling" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#Importing libraries\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sb\n", "import pandas_profiling\n", "sb.set_style()\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Loading and Checking the data sets" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rankcountry_fullcountry_abrvtotal_pointsprevious_pointsrank_changecur_year_avgcur_year_avg_weightedlast_year_avglast_year_avg_weightedtwo_year_ago_avgtwo_year_ago_weightedthree_year_ago_avgthree_year_ago_weightedconfederationrank_date
01GermanyGER0.05700.00.00.00.00.00.00.00.0UEFA1993-08-08
12ItalyITA0.05700.00.00.00.00.00.00.00.0UEFA1993-08-08
23SwitzerlandSUI0.05090.00.00.00.00.00.00.00.0UEFA1993-08-08
34SwedenSWE0.05500.00.00.00.00.00.00.00.0UEFA1993-08-08
45ArgentinaARG0.05150.00.00.00.00.00.00.00.0CONMEBOL1993-08-08
\n", "
" ], "text/plain": [ " rank country_full country_abrv total_points previous_points rank_change \\\n", "0 1 Germany GER 0.0 57 0 \n", "1 2 Italy ITA 0.0 57 0 \n", "2 3 Switzerland SUI 0.0 50 9 \n", "3 4 Sweden SWE 0.0 55 0 \n", "4 5 Argentina ARG 0.0 51 5 \n", "\n", " cur_year_avg cur_year_avg_weighted last_year_avg last_year_avg_weighted \\\n", "0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 \n", "\n", " two_year_ago_avg two_year_ago_weighted three_year_ago_avg \\\n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", "\n", " three_year_ago_weighted confederation rank_date \n", "0 0.0 UEFA 1993-08-08 \n", "1 0.0 UEFA 1993-08-08 \n", "2 0.0 UEFA 1993-08-08 \n", "3 0.0 UEFA 1993-08-08 \n", "4 0.0 CONMEBOL 1993-08-08 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# previewing top of fifa ranking data set\n", "fifa_ranking = pd.read_csv('/home/practitioner/Downloads/fifa_ranking.csv')\n", "fifa_ranking.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rankcountry_fullcountry_abrvtotal_pointsprevious_pointsrank_changecur_year_avgcur_year_avg_weightedlast_year_avglast_year_avg_weightedtwo_year_ago_avgtwo_year_ago_weightedthree_year_ago_avgthree_year_ago_weightedconfederationrank_date
57788206AnguillaAIA0.0010.00.00.00.00.00.00.00.0CONCACAF2018-06-07
57789206BahamasBAH0.0010.00.00.00.00.00.00.00.0CONCACAF2018-06-07
57790206EritreaERI0.0010.00.00.00.00.00.00.00.0CAF2018-06-07
57791206SomaliaSOM0.0010.00.00.00.00.00.00.00.0CAF2018-06-07
57792206TongaTGA0.0010.00.00.00.00.00.00.00.0OFC2018-06-07
\n", "
" ], "text/plain": [ " rank country_full country_abrv total_points previous_points \\\n", "57788 206 Anguilla AIA 0.0 0 \n", "57789 206 Bahamas BAH 0.0 0 \n", "57790 206 Eritrea ERI 0.0 0 \n", "57791 206 Somalia SOM 0.0 0 \n", "57792 206 Tonga TGA 0.0 0 \n", "\n", " rank_change cur_year_avg cur_year_avg_weighted last_year_avg \\\n", "57788 1 0.0 0.0 0.0 \n", "57789 1 0.0 0.0 0.0 \n", "57790 1 0.0 0.0 0.0 \n", "57791 1 0.0 0.0 0.0 \n", "57792 1 0.0 0.0 0.0 \n", "\n", " last_year_avg_weighted two_year_ago_avg two_year_ago_weighted \\\n", "57788 0.0 0.0 0.0 \n", "57789 0.0 0.0 0.0 \n", "57790 0.0 0.0 0.0 \n", "57791 0.0 0.0 0.0 \n", "57792 0.0 0.0 0.0 \n", "\n", " three_year_ago_avg three_year_ago_weighted confederation rank_date \n", "57788 0.0 0.0 CONCACAF 2018-06-07 \n", "57789 0.0 0.0 CONCACAF 2018-06-07 \n", "57790 0.0 0.0 CAF 2018-06-07 \n", "57791 0.0 0.0 CAF 2018-06-07 \n", "57792 0.0 0.0 OFC 2018-06-07 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# preview last five rows\n", "fifa_ranking.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A quick overview of the data indicates that the ranks of the different countries span between August 1993 and June 2016. According to FIFA, the most recent update of the rankings was done in 2018 prior to which, the previous system was in place from 2006 to 2018. For consistency, I will only rely on data that spans 2006 to 2018 to predict the outcome of the 2018 world cup matches." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datehome_teamaway_teamhome_scoreaway_scoretournamentcitycountryneutral
01872-11-30ScotlandEngland00FriendlyGlasgowScotlandFalse
11873-03-08EnglandScotland42FriendlyLondonEnglandFalse
21874-03-07ScotlandEngland21FriendlyGlasgowScotlandFalse
31875-03-06EnglandScotland22FriendlyLondonEnglandFalse
41876-03-04ScotlandEngland30FriendlyGlasgowScotlandFalse
\n", "
" ], "text/plain": [ " date home_team away_team home_score away_score tournament city \\\n", "0 1872-11-30 Scotland England 0 0 Friendly Glasgow \n", "1 1873-03-08 England Scotland 4 2 Friendly London \n", "2 1874-03-07 Scotland England 2 1 Friendly Glasgow \n", "3 1875-03-06 England Scotland 2 2 Friendly London \n", "4 1876-03-04 Scotland England 3 0 Friendly Glasgow \n", "\n", " country neutral \n", "0 Scotland False \n", "1 England False \n", "2 Scotland False \n", "3 England False \n", "4 Scotland False " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# previewing top of match results data set\n", "match_results = pd.read_csv('/home/practitioner/Downloads/results.csv')\n", "match_results.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datehome_teamaway_teamhome_scoreaway_scoretournamentcitycountryneutral
408342019-07-18American SamoaTahiti81Pacific GamesApiaSamoaTrue
408352019-07-18FijiSolomon Islands44Pacific GamesApiaSamoaTrue
408362019-07-19SenegalAlgeria01African Cup of NationsCairoEgyptTrue
408372019-07-19TajikistanNorth Korea01Intercontinental CupAhmedabadIndiaTrue
408382019-07-20Papua New GuineaFiji11Pacific GamesApiaSamoaTrue
\n", "
" ], "text/plain": [ " date home_team away_team home_score away_score \\\n", "40834 2019-07-18 American Samoa Tahiti 8 1 \n", "40835 2019-07-18 Fiji Solomon Islands 4 4 \n", "40836 2019-07-19 Senegal Algeria 0 1 \n", "40837 2019-07-19 Tajikistan North Korea 0 1 \n", "40838 2019-07-20 Papua New Guinea Fiji 1 1 \n", "\n", " tournament city country neutral \n", "40834 Pacific Games Apia Samoa True \n", "40835 Pacific Games Apia Samoa True \n", "40836 African Cup of Nations Cairo Egypt True \n", "40837 Intercontinental Cup Ahmedabad India True \n", "40838 Pacific Games Apia Samoa True " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# preview last five rows\n", "match_results.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assuming the results dataset is also ordered by date, the matches that have been recorded span between 1872 and 2019. To synchronize the two datsets, I will only use matches that were played between 2006 and 2018." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fifa ranking dataset shape: (57793, 16)\n", "Results dataset shape: (40839, 9)\n" ] } ], "source": [ "# checking the shape of our datasets\n", "print('Fifa ranking dataset shape:', fifa_ranking.shape)\n", "print('Results dataset shape:', match_results.shape)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "rank int64\n", "country_full object\n", "country_abrv object\n", "total_points float64\n", "previous_points int64\n", "rank_change int64\n", "cur_year_avg float64\n", "cur_year_avg_weighted float64\n", "last_year_avg float64\n", "last_year_avg_weighted float64\n", "two_year_ago_avg float64\n", "two_year_ago_weighted float64\n", "three_year_ago_avg float64\n", "three_year_ago_weighted float64\n", "confederation object\n", "rank_date object\n", "dtype: object" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# checking the data types\n", "fifa_ranking.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data types seem appropriate for the different columns aside from the rank date which is stored as an object. In the data preparation, this will be converted to date-time data type" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "date object\n", "home_team object\n", "away_team object\n", "home_score int64\n", "away_score int64\n", "tournament object\n", "city object\n", "country object\n", "neutral bool\n", "dtype: object" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#checking the data types\n", "match_results.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Aside from the date column which is stored as an object rather than date-time data type, all the other data types are appropriate." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. External Data Source Validation\n", "I confirmed the validity of the FIFA world rankings using information from the official FIFA website which can be found [here](https://www.fifa.com/fifa-world-ranking/ranking-table/men/). I also confirmed the accuracy of the match scores recorded for different games through a series of validaion scores across different tournament websites on the internet." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Tidying the Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### a. Match Results dataset" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "date 0\n", "home_team 0\n", "away_team 0\n", "home_score 0\n", "away_score 0\n", "tournament 0\n", "city 0\n", "country 0\n", "neutral 0\n", "dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# checking for null values in the results dataset\n", "match_results.isna().sum()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "date object\n", "home_team object\n", "away_team object\n", "home_score int64\n", "away_score int64\n", "tournament object\n", "city object\n", "country object\n", "neutral bool\n", "year int64\n", "month int64\n", "day int64\n", "dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# split date column into year, month, and day\n", "md = match_results['date'].str.split('-',n=2, expand=True)\n", "match_results['year'] = md[0]\n", "match_results['month'] = md[1]\n", "match_results['day'] = md[2]\n", "#match_results = match_results.drop('date', 1)\n", "match_results[['year', 'month', 'day']] = match_results[['year', 'month', 'day']].astype(int)\n", "match_results.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Splitting the date column will assist in the merging of the home and away teams and their FIFA ranks for the respective years." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datehome_teamaway_teamhome_scoreaway_scoretournamentneutralyearmonthday
01872-11-30ScotlandEngland00FriendlyFalse18721130
11873-03-08EnglandScotland42FriendlyFalse187338
21874-03-07ScotlandEngland21FriendlyFalse187437
31875-03-06EnglandScotland22FriendlyFalse187536
41876-03-04ScotlandEngland30FriendlyFalse187634
\n", "
" ], "text/plain": [ " date home_team away_team home_score away_score tournament neutral \\\n", "0 1872-11-30 Scotland England 0 0 Friendly False \n", "1 1873-03-08 England Scotland 4 2 Friendly False \n", "2 1874-03-07 Scotland England 2 1 Friendly False \n", "3 1875-03-06 England Scotland 2 2 Friendly False \n", "4 1876-03-04 Scotland England 3 0 Friendly False \n", "\n", " year month day \n", "0 1872 11 30 \n", "1 1873 3 8 \n", "2 1874 3 7 \n", "3 1875 3 6 \n", "4 1876 3 4 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop columns not needed\n", "match_results = match_results.drop(['city', 'country'], 1)\n", "match_results.head()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datehome_teamaway_teamhome_scoreaway_scoretournamentneutralyearmonthdayscore_differenceoutcome
01872-11-30ScotlandEngland00FriendlyFalse1872113001
11873-03-08EnglandScotland42FriendlyFalse18733822
21874-03-07ScotlandEngland21FriendlyFalse18743712
31875-03-06EnglandScotland22FriendlyFalse18753601
41876-03-04ScotlandEngland30FriendlyFalse18763432
\n", "
" ], "text/plain": [ " date home_team away_team home_score away_score tournament neutral \\\n", "0 1872-11-30 Scotland England 0 0 Friendly False \n", "1 1873-03-08 England Scotland 4 2 Friendly False \n", "2 1874-03-07 Scotland England 2 1 Friendly False \n", "3 1875-03-06 England Scotland 2 2 Friendly False \n", "4 1876-03-04 Scotland England 3 0 Friendly False \n", "\n", " year month day score_difference outcome \n", "0 1872 11 30 0 1 \n", "1 1873 3 8 2 2 \n", "2 1874 3 7 1 2 \n", "3 1875 3 6 0 1 \n", "4 1876 3 4 3 2 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# adding a score difference and win/draw/lose column relative to the home team\n", "match_results['score_difference'] = match_results['home_score'] - match_results['away_score']\n", "conditions = [(match_results['score_difference'] > 0), (match_results['score_difference'] == 0), (match_results['score_difference'] < 0)]\n", "values = [2, 1, 0] #where 2 is win, 1 is draw, 0 is loss\n", "match_results['outcome'] = np.select(conditions, values)\n", "match_results.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The score difference helps in the evaluation of each team's performance against their opponents. I recoreded wins as 2, draws as 1, and losses as 0." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(11801, 12)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#selecting matches that took place between January 2006 and June 2018\n", "recent_results = match_results[match_results['date'] >= '2006-01-01']\n", "recent_results = recent_results[recent_results['date'] <= '2018-06-07']\n", "recent_results.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As was evident earlier, the match results dataset holds matches from 1872 which are useless in predicting the outcome of matches today. To get a more reflective sample of modern day football, there is need to filter the data up to a specific point in recent history. The choice of 2006 as the lower year bound is based on the fact that the most recent update of FIFA's ranking system before the 2018 world cup was done in 2006. The ranking procedures are revised with each update and this could affect the cosnsitency of ranking as a predictor for team performance acorss different eras. 2006 to 2018 seemed like a viable time duration to work with." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### b. Fifa ranking dataset" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "rank 0\n", "country_full 0\n", "country_abrv 0\n", "total_points 0\n", "previous_points 0\n", "rank_change 0\n", "cur_year_avg 0\n", "cur_year_avg_weighted 0\n", "last_year_avg 0\n", "last_year_avg_weighted 0\n", "two_year_ago_avg 0\n", "two_year_ago_weighted 0\n", "three_year_ago_avg 0\n", "three_year_ago_weighted 0\n", "confederation 0\n", "rank_date 0\n", "dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# checking for null values in the results dataset\n", "fifa_ranking.isna().sum()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rankcountry_fullcountry_abrvtotal_pointsprevious_pointsrank_changecur_year_avgcur_year_avg_weightedlast_year_avglast_year_avg_weightedtwo_year_ago_avgtwo_year_ago_weightedthree_year_ago_avgthree_year_ago_weightedconfederationrank_dateyear_rankmonth_rankday_rank
57788206AnguillaAIA0.0010.00.00.00.00.00.00.00.0CONCACAF2018-06-0720180607
57789206BahamasBAH0.0010.00.00.00.00.00.00.00.0CONCACAF2018-06-0720180607
57790206EritreaERI0.0010.00.00.00.00.00.00.00.0CAF2018-06-0720180607
57791206SomaliaSOM0.0010.00.00.00.00.00.00.00.0CAF2018-06-0720180607
57792206TongaTGA0.0010.00.00.00.00.00.00.00.0OFC2018-06-0720180607
\n", "
" ], "text/plain": [ " rank country_full country_abrv total_points previous_points \\\n", "57788 206 Anguilla AIA 0.0 0 \n", "57789 206 Bahamas BAH 0.0 0 \n", "57790 206 Eritrea ERI 0.0 0 \n", "57791 206 Somalia SOM 0.0 0 \n", "57792 206 Tonga TGA 0.0 0 \n", "\n", " rank_change cur_year_avg cur_year_avg_weighted last_year_avg \\\n", "57788 1 0.0 0.0 0.0 \n", "57789 1 0.0 0.0 0.0 \n", "57790 1 0.0 0.0 0.0 \n", "57791 1 0.0 0.0 0.0 \n", "57792 1 0.0 0.0 0.0 \n", "\n", " last_year_avg_weighted two_year_ago_avg two_year_ago_weighted \\\n", "57788 0.0 0.0 0.0 \n", "57789 0.0 0.0 0.0 \n", "57790 0.0 0.0 0.0 \n", "57791 0.0 0.0 0.0 \n", "57792 0.0 0.0 0.0 \n", "\n", " three_year_ago_avg three_year_ago_weighted confederation rank_date \\\n", "57788 0.0 0.0 CONCACAF 2018-06-07 \n", "57789 0.0 0.0 CONCACAF 2018-06-07 \n", "57790 0.0 0.0 CAF 2018-06-07 \n", "57791 0.0 0.0 CAF 2018-06-07 \n", "57792 0.0 0.0 OFC 2018-06-07 \n", "\n", " year_rank month_rank day_rank \n", "57788 2018 06 07 \n", "57789 2018 06 07 \n", "57790 2018 06 07 \n", "57791 2018 06 07 \n", "57792 2018 06 07 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# splitting date column to year, month, day\n", "new = fifa_ranking['rank_date'].str.split('-',n=2, expand=True)\n", "fifa_ranking['year_rank'] = new[0]\n", "fifa_ranking['month_rank'] = new[1]\n", "fifa_ranking['day_rank'] = new[2]\n", "#fifa_ranking = fifa_ranking.drop('rank_date', 1)\n", "fifa_ranking.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the split, the year and month columns will be used to merge to the data set containg match results." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rankrank_changecountry_fullrank_dateyear_rankmonth_rankday_rank
010Germany1993-08-0819930808
120Italy1993-08-0819930808
239Switzerland1993-08-0819930808
340Sweden1993-08-0819930808
455Argentina1993-08-0819930808
\n", "
" ], "text/plain": [ " rank rank_change country_full rank_date year_rank month_rank day_rank\n", "0 1 0 Germany 1993-08-08 1993 08 08\n", "1 2 0 Italy 1993-08-08 1993 08 08\n", "2 3 9 Switzerland 1993-08-08 1993 08 08\n", "3 4 0 Sweden 1993-08-08 1993 08 08\n", "4 5 5 Argentina 1993-08-08 1993 08 08" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# selecting appropriate columns\n", "select_ranking = fifa_ranking[['rank', 'rank_change', 'country_full', 'rank_date', 'year_rank', 'month_rank', 'day_rank']]\n", "select_ranking.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many of the columns in the fifa ranking data set will are not useful to our analysis so we select only select columns that we need which include the country name, rank, rank change, and the year of ranking." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rankrank_changecountry_fullrank_dateyear_rankmonth_rankday_rank
010Germany1993-08-0819930808
120Italy1993-08-0819930808
239Switzerland1993-08-0819930808
340Sweden1993-08-0819930808
455Argentina1993-08-0819930808
\n", "
" ], "text/plain": [ " rank rank_change country_full rank_date year_rank month_rank day_rank\n", "0 1 0 Germany 1993-08-08 1993 08 08\n", "1 2 0 Italy 1993-08-08 1993 08 08\n", "2 3 9 Switzerland 1993-08-08 1993 08 08\n", "3 4 0 Sweden 1993-08-08 1993 08 08\n", "4 5 5 Argentina 1993-08-08 1993 08 08" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# renaming countries with different names in the two data sets\n", "select_ranking = select_ranking.replace(\"Côte d'Ivoire\", 'Ivory Coast')\n", "select_ranking.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of the countries have different names in the two data sets so we need to rename them. " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rankrank_changecountry_fullrank_dateyear_rankmonth_rankday_rank
577882061Anguilla2018-06-0720180607
577892061Bahamas2018-06-0720180607
577902061Eritrea2018-06-0720180607
577912061Somalia2018-06-0720180607
577922061Tonga2018-06-0720180607
\n", "
" ], "text/plain": [ " rank rank_change country_full rank_date year_rank month_rank \\\n", "57788 206 1 Anguilla 2018-06-07 2018 06 \n", "57789 206 1 Bahamas 2018-06-07 2018 06 \n", "57790 206 1 Eritrea 2018-06-07 2018 06 \n", "57791 206 1 Somalia 2018-06-07 2018 06 \n", "57792 206 1 Tonga 2018-06-07 2018 06 \n", "\n", " day_rank \n", "57788 07 \n", "57789 07 \n", "57790 07 \n", "57791 07 \n", "57792 07 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#selecting fifa team rankings between 2006 and 2018\n", "recent_ranking = select_ranking[select_ranking['rank_date'] >= '2006-01-01']\n", "recent_ranking.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For consistency, we select team ranks that fall within the same review period." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "#drop date columns\n", "recent_rank = recent_ranking.drop('rank_date', 1)\n", "recent_games = recent_results.drop('date', 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We drop the date columns since we already have the year, month and days." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "rank int64\n", "rank_change int64\n", "country_full object\n", "year_rank int64\n", "month_rank int64\n", "day_rank int64\n", "dtype: object" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# convert year, month, and day to integer for merging\n", "convert_dict = {'year_rank': int,\n", " 'month_rank': int, \n", " 'day_rank': int\n", " } \n", " \n", "recent_rank = recent_rank.astype(convert_dict) \n", "recent_rank.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### c. Merging the data sets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To merge the datasets, I used an inner join between the recent matches and recent ranks tables where on the matches data set I used the 'home_team', 'year', and 'month' columns to merge with the 'country_full', 'year_rank', and 'month_rank' columns. After merging, I dropped the 'year_rank', 'month_rank', 'day_rank', and 'country_full' columns since I needed the same columns to repeat the same merge to get the team ranks for the away teams." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
home_teamaway_teamhome_scoreaway_scoretournamentneutralyearmonthdayscore_differenceoutcomehome_team_rankhome_rank_change
0QatarLibya20FriendlyFalse20061222896
1EgyptZimbabwe20FriendlyFalse20061522320
2EgyptSouth Africa12FriendlyFalse2006114-10320
3EgyptLibya30African Cup of NationsFalse200612032320
4EgyptMorocco00African Cup of NationsFalse200612401320
\n", "
" ], "text/plain": [ " home_team away_team home_score away_score tournament \\\n", "0 Qatar Libya 2 0 Friendly \n", "1 Egypt Zimbabwe 2 0 Friendly \n", "2 Egypt South Africa 1 2 Friendly \n", "3 Egypt Libya 3 0 African Cup of Nations \n", "4 Egypt Morocco 0 0 African Cup of Nations \n", "\n", " neutral year month day score_difference outcome home_team_rank \\\n", "0 False 2006 1 2 2 2 89 \n", "1 False 2006 1 5 2 2 32 \n", "2 False 2006 1 14 -1 0 32 \n", "3 False 2006 1 20 3 2 32 \n", "4 False 2006 1 24 0 1 32 \n", "\n", " home_rank_change \n", "0 6 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# merging tables using inner join for home team rank and rank change\n", "combo1 = pd.merge(recent_games, recent_rank, how='inner', left_on=['home_team','year', 'month'], right_on=['country_full', 'year_rank', 'month_rank'])\n", "combo1 = combo1.drop(['year_rank', 'month_rank', 'day_rank', 'country_full'], 1)\n", "combo1.rename(columns={'rank':'home_team_rank'}, inplace=True)\n", "combo1.rename(columns={'rank_change':'home_rank_change'}, inplace=True)\n", "combo1.head()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
home_teamaway_teamhome_scoreaway_scoretournamentneutralyearmonthdayscore_differenceoutcomehome_team_rankhome_rank_changeaway_team_rankaway_rank_change
0QatarLibya20FriendlyFalse20061222896800
1EgyptLibya30African Cup of NationsFalse200612032320800
2TunisiaLibya10FriendlyFalse200611212280800
3EgyptZimbabwe20FriendlyFalse20061522320530
4MoroccoZimbabwe10FriendlyFalse200611412351530
\n", "
" ], "text/plain": [ " home_team away_team home_score away_score tournament \\\n", "0 Qatar Libya 2 0 Friendly \n", "1 Egypt Libya 3 0 African Cup of Nations \n", "2 Tunisia Libya 1 0 Friendly \n", "3 Egypt Zimbabwe 2 0 Friendly \n", "4 Morocco Zimbabwe 1 0 Friendly \n", "\n", " neutral year month day score_difference outcome home_team_rank \\\n", "0 False 2006 1 2 2 2 89 \n", "1 False 2006 1 20 3 2 32 \n", "2 False 2006 1 12 1 2 28 \n", "3 False 2006 1 5 2 2 32 \n", "4 False 2006 1 14 1 2 35 \n", "\n", " home_rank_change away_team_rank away_rank_change \n", "0 6 80 0 \n", "1 0 80 0 \n", "2 0 80 0 \n", "3 0 53 0 \n", "4 1 53 0 " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# merging tables using inner join for away team rank and rank change\n", "combo2 = pd.merge(combo1, recent_rank, how='inner', left_on=['away_team','year', 'month'], right_on=['country_full', 'year_rank', 'month_rank'])\n", "combo2 = combo2.drop(['year_rank', 'month_rank', 'day_rank', 'country_full'], 1)\n", "combo2.rename(columns={'rank':'away_team_rank'}, inplace=True)\n", "combo2.rename(columns={'rank_change':'away_rank_change'}, inplace=True)\n", "combo2.head()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Friendly', 'Other competition', 'FIFA World Cup',\n", " 'FIFA World Cup qualification'], dtype=object)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# reducing cardinality of tournament column\n", "combo2.replace(to_replace=['African Cup of Nations', 'Lunar New Year Cup',\n", " 'AFC Asian Cup qualification', 'Cyprus International Tournament',\n", " 'Malta International Tournament', 'AFC Challenge Cup',\n", " 'COSAFA Cup', 'Kirin Cup',\n", " 'Merdeka Tournament',\n", " 'CFU Caribbean Cup qualification',\n", " 'African Cup of Nations qualification', 'Copa del Pacífico',\n", " 'AFF Championship', 'ELF Cup', 'CECAFA Cup',\n", " 'UAFA Cup qualification', \"King's Cup\", 'CFU Caribbean Cup',\n", " 'Gulf Cup', 'UNCAF Cup', 'EAFF Championship', 'Copa América',\n", " 'Gold Cup', 'WAFF Championship', 'Island Games', 'AFC Asian Cup',\n", " 'Nehru Cup', 'South Pacific Games',\n", " 'Amílcar Cabral Cup', 'AFC Challenge Cup qualification',\n", " 'Baltic Cup', 'SAFF Cup',\n", " 'African Nations Championship', 'VFF Cup', 'Confederations Cup',\n", " 'Dragon Cup', 'ABCS Tournament', 'Nile Basin Tournament',\n", " 'Nations Cup', 'Copa Paz del Chaco', 'Pacific Games',\n", " 'Oceania Nations Cup qualification', 'Oceania Nations Cup',\n", " 'UAFA Cup', 'OSN Cup', 'Windward Islands Tournament',\n", " 'Gold Cup qualification', 'Copa América qualification',\n", " 'Intercontinental Cup', 'UEFA Euro qualification', 'UEFA Euro'], value = 'Other competition', inplace=True)\n", "combo2.tournament.unique()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "48" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#cheking for duplicates\n", "combo2.duplicated().sum()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# dropping duplictes \n", "combo2 = combo2.drop_duplicates()\n", "#cheking for duplicates\n", "combo2.duplicated().sum()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#checking for outliers\n", "sb.boxplot(x=combo2['score_difference'])\n", "plt.title('Goal Difference')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAc8AAAFsCAYAAACq8DJrAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAAdIklEQVR4nO3de1TUdf7H8dcAogmutxTNXF2zrLRExdYLiqKUQabm/bS7ZXk83nLT1NKo1A0vtKuldSzXc1LLNV0F3TQrb6jlba310sWNas1SAkJ0QVdhmO/vDw/zi8DL22aAsefjHM/xO9+Zz3y+M8BzPl84My7HcRwBAIArFlTREwAAINAQTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8EfBiY2O1a9euEpelpKRo6NChFTSjC4YPH642bdqoTZs2atmypVq1auXdfvbZZytkTunp6XrkkUd01113KSoqSg888IC2b99eIXMBAllIRU8AuFYtXrzY+/+nnnpKERERGj9+fAXOSBo5cqSGDh2qV199VZJ0+PBh+fp9Utxut0JC+NGCaxsrT/wifPXVV/r973+vqKgoJSQkaMuWLd59Tz31lKZNm+ZdKQ4ZMkTZ2dlKSkpS+/bt1atXL3322Wfe62dmZuqxxx5Thw4dFBsbq2XLlpnns23bNvXp00dRUVEaMmSIjhw54t23aNEi9ezZU23atFF8fLw2bdrk3ZeSkqIhQ4Zo5syZioqKUo8ePfTxxx8rJSVFMTEx6tixo1JTU8u8z5MnT+q7777ToEGDFBoaqtDQULVr105RUVHe62zevFl9+vRR27Zt1bNnT+3YscN7zCNHjtRdd92luLg4rVq1ynubBQsWaNy4cZo4caLatm2r1NRU5eXlaerUqYqOjlaXLl00b948FRUVmR8noLIinrjmFRYWauTIkercubN27dqlxMRETZw4UV9//bX3Ohs3btTjjz+uPXv2KDQ0VIMHD1bLli21Z88e3XPPPZo1a5YkyePxaNSoUWrRooV27NihpUuXaunSpdq5c+cVz+ezzz7T1KlTNWPGDO3du1eDBw/W6NGjVVBQIElq3Lixli9fro8++khjx47VpEmTlJWV5b39oUOH1KJFC+3du1f33XefJkyYoMOHD2vTpk164YUXNGPGDJ05c6bU/dauXVtNmjTRpEmTtHnzZv3www8l9h86dEhPPvmkJk+erP3792v58uVq1KiRJGnChAlq0KCBdu7cqfnz52vu3LnavXu397ZbtmxRr169tH//fvXu3VtPPfWUQkJC9P7772vt2rX68MMP9fe///2KHyOgsiOeuCaMGTNGUVFR3n/Tp0/37jt48KDOnj2rESNGKDQ0VB07dlT37t21YcMG73Xi4uLUqlUrVa1aVXFxcapatar69u2r4OBgxcfH6/PPP5d04TTnyZMnNXbsWIWGhqpx48YaNGiQ3nnnnSue68qVKzV48GC1bt1awcHB6tevn6pUqaIDBw5Iku69915FREQoKChI8fHxatKkiQ4dOuS9/Y033qj+/ft755aRkaExY8YoNDRU0dHRCg0N1bFjx0rdr8vl0rJly9SoUSPNnj1b0dHRevDBB3X06FFJ0urVq9W/f3917txZQUFBioiI0E033aSMjAx9/PHHmjhxoqpWrarbbrtNAwcO1Lp167xjR0ZGqmfPngoKClJ+fr62b9+uqVOnqnr16qpbt64efvjhEo83EOj4xQSuCa+88oo6derk3U5JSfGudLKystSgQQMFBf3/a8UbbrhBmZmZ3u26det6/1+tWjVdf/31JbbPnj0rSTp+/LiysrJKnOosKioqsX05J06c0Nq1a/Xmm296LyssLPSuLteuXavXX39dx48flySdPXtWubm5F52rpBLzrVq1apkrT0lq0KCB94+VMjIy9Mwzz+jJJ5/UypUrlZGRoZiYmFK3ycrKUs2aNRUeHu697IYbbtAnn3xSYtwfH5/b7VZ0dLT3Mo/Ho4YNG17qYQECCvHENa9+/fr6/vvv5fF4vAHNyMhQ06ZNzWM1bNhQN954o95///2rnk/Dhg01cuRIjRo1qtS+48ePKzExUUuWLFGbNm0UHBysPn36XPV9XW4eDz74oCZMmODdLmvFWr9+fZ0+fVr5+fnegGZkZCgiIsJ7HZfL5f1/gwYNFBoaqj179vCHQ7hmcdoW17w777xT1apV0+LFi1VYWKi9e/dq69atio+Pv6qxwsLCtGjRIp07d05FRUX64osvSpxWvZyBAwfqrbfe0sGDB+U4js6ePau0tDTl5+frf//7n1wul+rUqSNJWrNmjdLT083zLMvp06c1f/58ffPNN/J4PDp58qTWrFmjyMhISdKAAQOUkpKi3bt3y+PxKDMzU1999ZUaNmyoNm3aaO7cuTp//ryOHDmi1atX6/777y/zfurXr6/OnTtr9uzZys/Pl8fj0bFjx7Rv3z6fHAdQGRBPXPNCQ0P16quvaseOHerQoYOmT5+u5ORk3XTTTeaxgoOD9eqrr+rIkSPq0aOHOnTooMTEROXn51/xGHfccYf+9Kc/acaMGWrfvr3uvvtupaSkSJKaN2+uRx55REOGDFGnTp30xRdfqG3btuZ5lqVKlSo6fvy4hg0bpnbt2ql3794KDQ3V7NmzJV14YTBr1izNnDlT7dq10+9+9zudOHFCkjR37lwdP35cXbp00dixY/XYY4+VOE3+U8nJySosLFR8fLzat2+vcePGKTs72yfHAVQGLj4MGwAAG1aeAAAYEU8AAIyIJwAARsQTAAAj4gkAgBHxBADAiHgCAGBEPAEAMCKeAAAYEU8AAIyIJwAARsQTAAAj4gkAgBHxBADAiHgCAGBEPAEAMCKeAAAYEU8AAIyIJwAARsQTAAAj4gkAgBHxBADAiHgCAGBEPAEAMCKeAAAYEU8AAIyIJwAARsQTAAAj4gkAgBHxBADAiHgCAGBEPAEAMCKeAAAYhZTXHY0dO1bZ2dl+Gz8vL0+SVKNGDb/dR7169fTyyy/7bXwAQGAot3hmZ2fr+8wsOaFhfhnfVfA/SVK+2z+LaVfBGb+MCwAIPOUWT0lyQsN0pvUgv4wddnCVJPl9fAAA+J0nAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGIeV1R3l5eXK5C8vr7lDOFi5cKEkaNWpUBc8EAPyv3Fae586dkzzu8ro7lLO0tDSlpaVV9DQAoFxw2hYAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAkK3bt28/wJt/H79+qlbt27q37+/z8eWpDlz5qhbt276y1/+EpDj79+/X7Gxsfroo4/8Mv7WrVvVrVs3bdu2zS/j+1tOTo7GjRunnJwcv4z/5ZdfKiEhQV9++aVfxvcnfz82l0I8AT/Lzc2VJL99g2/cuFGS9Pbbbwfk+NOmTZPH49Fzzz3nl/FnzpwpSUpKSvLL+P62dOlSHT58WMuWLfPL+M8//7zOnDmj559/3i/j+5O/H5tLIZ6o9H66GvT16tCf4/fr16/Etq9Xn3PmzCmx7evVob/H379/v/Lz8yVJ+fn5Pl99bt26VW63W5LkdrsDbvWZk5Ojd999V47j6N133/X5C7Avv/xSR48elSQdPXo0oFaf/n5sLiekXO8tgLnc55WdfVaDBw+u6KlUStnZ2apWrVpFT6PSKV51FvP1N3jxqrDY22+/rSeeeCJgxp82bVqJ7eeee07r16/32fjFq85iSUlJ6t69u8/G97elS5fK4/FIkoqKirRs2TKNHz/eZ+P/dLX5/PPPa8mSJT4b35/8/dhcDitPABWmeNV5se2fq3jVebHtym7z5s0lVs6bNm3y6fjFq86LbVdm/n5sLoeV5xVyQqqqXu0wrVy5sqKnUimxIsfVCA8PLxHM8PBwn44fEhJSIpghIYH1I69nz55655135Ha7FRISori4OJ+O37Rp0xLBbNq0qU/H9yd/PzaXw8oT8KPatWuX2K5bt65Px7/33ntLbPfu3Tugxv/padvp06f7dPypU6eW2H766ad9Or6/PfTQQwoKuvBjOjg4WH/4wx98On5iYuIltyszfz82l0M8UemlpaVdcrsyj5+amlpie82aNT4bW5KefPLJEtu+/H1keYwfFRXlXW2Gh4erXbt2Ph0/NjbWu9oMCQkJqN93ShdebPXq1Usul0u9evXy+Yuv5s2be1ebTZs2VfPmzX06vj/5+7G5HOIJ+Fnx6tNf39zFq0NfrwrLa/xp06YpKCjI56vOYsWrz0BbdRZ76KGHdMcdd/htZZWYmKiwsLCAWnUW8/djcykux3Gc8rij2NhYFXkc5bcf5pfxww6ukiSdaT3Ib+M35HeeF1X8O08eHwC/BKw8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAEfEEAMCIeAIAYEQ8AQAwIp4AABgRTwAAjIgnAABGxBMAACPiCQCAUUh53VG1atV05lxhed0dylm3bt0qegoAUG7KLZ41atRQvvtMed0dytmoUaMqegoAUG44bQsAgBHxBADAiHgCAGBEPAEAMCKeAAAYEU8AAIyIJwAARsQTAAAj4gkAgBHxBADAiHgCAGBEPAEAMCKeAAAYEU8AAIyIJwAARsQTAAAj4gkAgBHxBADAiHgCAGBEPAEAMCKeAAAYEU8AAIyIJwAARsQTAAAj4gkAgBHxBADAiHgCAGBEPAEAMCKeAAAYEU8AAIyIJwAARsQTAAAj4gkAgBHxBADAiHgCAGBEPAEAMCKeAAAYEU8AAIyIJwAARsQTAAAj4gkAgBHxBADAiHgCAGBEPAEAMCKeAAAYEU8AAIyIJwAARsQTAAAj4gkAgBHxBADAiHgCAGBEPAEAMCKeAAAYEU8AAIyIJwAARsQTAACjkPK8M1fBGYUdXOW3sSX5efwwv4wNAAgs5RbPevXq+XX8vDyPJKlGDX8FLszvxwAACAwux3Gcip4EAACBhN95AgBgRDwBADAingAAGBFPAACMiCcAAEbEEwAAI+IJAIAR8QQAwIh4AgBgRDwBADAingAAGBFPAACMiCcAAEbEEwAAI+IJAIAR8QQAwIh4AgBgRDwBADAingAAGBFPAACMiCcAAEbEEwAAI+IJAIAR8QQAwIh4AgBgRDwBADAingAAGBFPAACMiCcAAEbEEwAAI+IJAIAR8QQAwIh4AgBgRDwBADAingAAGBFPAACMiCcAAEbEEwAAI+IJAIAR8QQAwCjEH4Pu2LFDSUlJ8ng8GjhwoEaMGFFif0FBgSZPnqxPP/1UtWrV0rx583TjjTf6Yyp+l5GRocmTJysnJ0cul0uDBg3SQw89VOI6e/fu1ejRo73HGBcXp7Fjx1bEdH0iNjZWYWFhCgoKUnBwsFJSUkrsdxxHSUlJ2r59u6pVq6bZs2erZcuWFTTbq/f1119r/Pjx3u1vv/1W48aN08MPP+y9LNCf2ylTpigtLU1169bV+vXrJUmnTp3S+PHjdfz4cTVq1EgvvviiatasWeq2qampWrhwoSRp1KhR6tevX7nO/WqUdbxz5szRtm3bVKVKFf3617/WrFmz9Ktf/arUbS/3dV8ZlXW8CxYs0KpVq1SnTh1J0oQJExQTE1Pqtpf7OV7ZlHWsjz/+uP7zn/9IkvLy8lSjRg2tW7eu1G2v6rl1fMztdjs9evRwjh075pw/f97p3bu3k56eXuI6b775pvPMM884juM469evd/74xz/6ehrlJjMz0/nkk08cx3GcvLw85+677y51vHv27HFGjBhREdPzi+7duzs5OTkX3Z+WluY8+uijjsfjcf71r385AwYMKMfZ+Yfb7XY6derkfPfddyUuD/Tndt++fc4nn3ziJCQkeC+bM2eO89prrzmO4zivvfaak5ycXOp2ubm5TmxsrJObm+ucOnXKiY2NdU6dOlVu875aZR3vzp07ncLCQsdxHCc5ObnM43Wcy3/dV0ZlHe/8+fOdxYsXX/J2V/JzvLIp61h/bNasWc6CBQvK3Hc1z63PT9seOnRITZo0UePGjRUaGqqEhARt2bKlxHW2bt3qfZV6zz33aPfu3XIcx9dTKRf169f3rqrCw8PVrFkzZWZmVvCsKtaWLVvUt29fuVwuRUZG6r///a+ysrIqelo/y+7du9W4cWM1atSooqfiU+3bty+1qix+/iSpb9++2rx5c6nbffDBB+rcubNq1aqlmjVrqnPnztq5c2d5TPlnKet4o6OjFRJy4SRcZGSkvv/++4qYml+UdbxX4kp+jlc2lzpWx3G0ceNG3XfffT67P5/HMzMzUw0aNPBuR0RElIpJZmamGjZsKEkKCQlRjRo1lJub6+uplLvvvvtOn3/+uVq3bl1q34EDB3T//fdr+PDhSk9Pr4DZ+dajjz6qBx54QCtXriy176dfAw0aNAj4FxQbNmy46Dfetfbc5uTkqH79+pKkevXqKScnp9R1ruT7PBCtWbNGXbt2vej+S33dB5Lly5erd+/emjJlik6fPl1q/7X2/O7fv19169ZV06ZNL3od63Prl995/hKdOXNG48aN09SpUxUeHl5iX8uWLbV161aFhYVp+/btGjNmjN5///0KmunPt2LFCkVERCgnJ0fDhg1Ts2bN1L59+4qelt8UFBRo69ateuKJJ0rtu9ae259yuVxyuVwVPY1ysXDhQgUHB+v+++8vc/+18nU/dOhQjR49Wi6XSy+99JJmz56tWbNmVfS0/Gr9+vWXXHVezXPr85VnREREidMemZmZioiIKHWdjIwMSZLb7VZeXp5q167t66mUm8LCQo0bN069e/fW3XffXWp/eHi4wsLCJEkxMTFyu906efJkeU/TZ4qfz7p16youLk6HDh0qtf/HXwPff/99qa+BQLJjxw61bNlS119/fal919pzK114XotPs2dlZXn/sOTHruT7PJCkpKQoLS1Nf/7zny/6YuFyX/eB4vrrr1dwcLCCgoI0cOBAHT58uNR1rqXn1+12a9OmTYqPj7/oda7mufV5PO+44w4dPXpU3377rQoKCrRhwwbFxsaWuE5sbKxSU1MlSe+99546dOgQsK9uHcfR008/rWbNmmnYsGFlXic7O9v7O91Dhw7J4/EE7IuFs2fPKj8/3/v/Dz/8UDfffHOJ68TGxmrt2rVyHEcHDhxQjRo1vKcBA9GGDRuUkJBQ5r5r6bktVvz8SdLatWvVo0ePUteJjo7WBx98oNOnT+v06dP64IMPFB0dXc4z9Y0dO3Zo8eLFWrhwoa677royr3MlX/eB4sd/f7B58+Yyj+NKfo4Hil27dqlZs2YlTkP/2NU+tz4/bRsSEqJnn31Ww4cPV1FRkfr376+bb75ZL730klq1aqUePXpowIABmjRpkuLi4lSzZk3NmzfP19MoNx999JHWrVunW265RX369JF04U+/T5w4IenCKZL33ntPK1asUHBwsKpVq6a5c+cG7IuFnJwcjRkzRpJUVFSk++67T127dtWKFSskXTjemJgYbd++XXFxcbruuus0c+bMipzyz3L27Fnt2rVLM2bM8F7242MN9Od2woQJ2rdvn3Jzc9W1a1c99thjGjFihB5//HGtXr1aN9xwg1588UVJ0uHDh/XWW28pKSlJtWrV0ujRozVgwABJ0pgxY1SrVq2KO5ArVNbxLlq0SAUFBd4Xv61bt9aMGTOUmZmpxMRE/fWvf73o131lV9bx7tu3T0eOHJEkNWrUyPu1/ePjvdjP8cqsrGMdOHCg3nnnnVIvfn3x3LqcQP0zVwAAKgjvMAQAgBHxBADAiHgCAGBEPAEAMCKeAAAYEU8AAIyIJ1CGFi1a6MyZMxU9DQCVFPEEfsE8Hk/AfqIRUJF4Y3jgIt544w1t2rRJp06d0uTJk3XPPfdIuvB2bnPnzlVRUZHq1KmjGTNmqEmTJtq7d6+SkpJ055136uDBgwoJCVFycrJefvllpaenq2HDhlqwYIGqV6+ugoICzZs3T//85z9VUFCgFi1aaNq0ad73yf2pnJwcPfHEE95POOnYsaOmTp0qSXrttde0fv16uVwuVa9eXX/7298UFBSkRYsW6R//+IekC2+3lpiYqLCwMC1YsEDp6enKz8/XiRMntHLlSh04cEALFy5UQUGBqlSpoilTpigyMtL/DzIQqEyf/gn8Qtxyyy3OG2+84TiO4+zfv9+Jjo52HMdxfvjhB+e3v/2t94OBV61a5f2w7z179ji3336789lnnzmO4zjTpk1zunTp4mRkZDiO4zjDhw93Vq1a5TiO47zyyivOK6+84r2/5ORkZ+7cuRedz+uvv+79AHnHcbwfPJ2SkuIMGjTIycvLcxzHcU6ePOk4zoUPJE9ISHDy8vIcj8fjTJo0yfshz/Pnz3diYmK8H/77zTfflBjjiy++cGJiYq7qcQN+KVh5AhdR/CkMkZGRysrK0vnz53Xw4EHdeuutat68uSSpf//+mj59uveNpX/zm9/otttukyTdfvvtOnHihPcNqVu2bKlvvvlG0oUPhM/Pz9d7770n6cLHnt16660XnUvr1q21ZMkSzZkzR3fddZf3Tdi3bdumoUOHej8Gr/hN6Xfv3q34+Hjv5YMGDSrxHsNdu3b1flrKzp07dezYMT344IPe/W63Wz/88EOZnyQDgNO2wEVVrVpVkhQcHCzpQlAuJzQ01Pv/4OBg7xjF2+fPn5d04dN4nnvuOXXs2PGK5tKmTRulpqZq165dWrdunRYtWuR9g/qr8dPTw126dFFycvJVjwf80vAHQ4BBZGSkjhw5oq+++kqSlJqaqttvv73UB6BfTmxsrJYsWaJz585JkvLz871jluXbb79VeHi4EhISNGXKFH366afyeDzq3r27VqxY4V355ubmSrrwO9GNGzcqPz9fjuNo9erV6tSpU5ljd+7cWTt37lR6err3skD9rEqgvLDyBAzq1Kmj5ORkTZw4UW63W3Xq1NELL7xgHmfEiBF6+eWXNWDAALlcLrlcLo0dO1Y33XRTmdfft2+flixZoqCgIHk8Hk2fPl1BQUHq27evMjMzNXjwYIWEhKh69epavny5YmJi9O9//1tDhgyRJLVq1UqjRo0qc+ymTZvqhRde0NNPP61z586psLBQbdu21Z133mk+LuCXgo8kAwDAiNO2AAAYcdoWqESeffZZHTx4sMRlwcHBSklJqaAZASgLp20BADDitC0AAEbEEwAAI+IJAIAR8QQAwOj/APFGQfnkFa/fAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#checking for outliers\n", "sb.boxplot(x=combo2['home_score'])\n", "plt.title('Home Team Score')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#checking for outliers\n", "sb.boxplot(x=combo2['away_score'])\n", "plt.title('Away Team Score')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
home_scoreaway_scoreyearmonthdayscore_differenceoutcomehome_team_rankhome_rank_changeaway_team_rankaway_rank_change
count9437.0000009437.0000009437.0000009437.0000009437.0000009437.0000009437.0000009437.0000009437.0000009437.0000009437.000000
mean1.5394721.0556322011.7217346.95697814.3418460.4838401.19349480.3058180.69058082.9009220.071527
std1.5169161.2285373.5003983.2974708.6158982.1041350.84907552.6642887.78867653.2587177.891007
min0.0000000.0000002006.0000001.0000001.000000-15.0000000.0000001.000000-62.0000001.000000-62.000000
25%0.0000000.0000002009.0000004.0000007.000000-1.0000000.00000035.000000-2.00000038.000000-3.000000
50%1.0000001.0000002012.0000007.00000013.0000000.0000001.00000076.0000000.00000078.0000000.000000
75%2.0000002.0000002015.00000010.00000022.0000002.0000002.000000119.0000003.000000121.0000002.000000
max17.00000015.0000002018.00000012.00000031.00000017.0000002.000000209.00000073.000000209.00000082.000000
\n", "
" ], "text/plain": [ " home_score away_score year month day \\\n", "count 9437.000000 9437.000000 9437.000000 9437.000000 9437.000000 \n", "mean 1.539472 1.055632 2011.721734 6.956978 14.341846 \n", "std 1.516916 1.228537 3.500398 3.297470 8.615898 \n", "min 0.000000 0.000000 2006.000000 1.000000 1.000000 \n", "25% 0.000000 0.000000 2009.000000 4.000000 7.000000 \n", "50% 1.000000 1.000000 2012.000000 7.000000 13.000000 \n", "75% 2.000000 2.000000 2015.000000 10.000000 22.000000 \n", "max 17.000000 15.000000 2018.000000 12.000000 31.000000 \n", "\n", " score_difference outcome home_team_rank home_rank_change \\\n", "count 9437.000000 9437.000000 9437.000000 9437.000000 \n", "mean 0.483840 1.193494 80.305818 0.690580 \n", "std 2.104135 0.849075 52.664288 7.788676 \n", "min -15.000000 0.000000 1.000000 -62.000000 \n", "25% -1.000000 0.000000 35.000000 -2.000000 \n", "50% 0.000000 1.000000 76.000000 0.000000 \n", "75% 2.000000 2.000000 119.000000 3.000000 \n", "max 17.000000 2.000000 209.000000 73.000000 \n", "\n", " away_team_rank away_rank_change \n", "count 9437.000000 9437.000000 \n", "mean 82.900922 0.071527 \n", "std 53.258717 7.891007 \n", "min 1.000000 -62.000000 \n", "25% 38.000000 -3.000000 \n", "50% 78.000000 0.000000 \n", "75% 121.000000 2.000000 \n", "max 209.000000 82.000000 " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#checking table statistics\n", "combo2.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the above box plots and table description, we can clearly observe outliers in the 'home_score', 'away_score', and 'score_difference' columns. In the 'home_score', the highest number goals scored is 21 yet the 75th percentile is 2 goals scored. This difference is also evident in the 'away_score' and 'score_difference' columns. Since we intend to make predictions using the data, we need to remove the outliers. " ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(6.5, -5.5)" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# compute iqr score to remove outliers for age and household size\n", "q1_diff, q3_diff = np.percentile(combo2['score_difference'], [25, 75])\n", "iqr_diff = q3_diff - q1_diff\n", "\n", "lower_diff = q1_diff - (1.5 * iqr_diff)\n", "upper_diff = q3_diff + (1.5 * iqr_diff)\n", "upper_diff, lower_diff" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(9437, 15) : With outliers\n", "(9158, 15) : No outliers\n" ] } ], "source": [ "#removing outliers\n", "print(combo2.shape, ': With outliers')\n", "\n", "combo2 = combo2.drop(combo2[combo2['score_difference'] > 5].index)\n", "combo2 = combo2.drop(combo2[combo2['score_difference'] < -5].index)\n", "combo2 = combo2.drop(combo2[combo2['home_score'] > 5.5].index)\n", "combo2 = combo2.drop(combo2[combo2['away_score'] > 5.5].index)\n", "\n", "print(combo2.shape, ': No outliers')" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAcwAAAFsCAYAAABBx4loAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAAWoUlEQVR4nO3da3BUhf2H8e+SECQXbhYCCYEq1TiWIqmxoaHVAJHrhKCgQG1tAcNliEABIS1tFRhTUYHKxRAG0nARkUtqMITqRJuGFsJEoVI6SNHpgATYhHJLtkLC5vxfMG6bPxd/Bs2y4fm8ytmcPee3y5An5+TMHpfjOI4AAMB1NfP3AAAABAKCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTOAG9e3bV7t27Wrw87dt26axY8f6lj/44AP1799fcXFxKioq0qlTp/TEE08oLi5OL7zwwlcxMoAGCPb3AEBj2L59u3Jzc3X48GG1bNlSnTt31rBhw/SjH/1ILpfra9tvRkaGCgoK1Lx5c0lSdHS0+vTpo/HjxysiIkKSNHToUA0dOtT3nCVLluiJJ57QT3/6U0nS8uXL1bZtW+3du/drnRXA9XGEiSYvJydHzz//vMaNG6e//OUv2rVrl+bOnau9e/eqtrb2a9//uHHjtG/fPpWWliozM1N/+9vfNHr0aP3nP/+56vrHjx/XXXfdVW+5W7duDYrlpUuXGjw3gPoIJpq0qqoqLVmyRM8++6wGDhyo8PBwuVwu3XvvvVq4cKFCQkJ8682aNUu9evVSnz599Oqrr6qurk6SdPToUT355JNKSEhQQkKCZsyYofPnz3/pWVq0aKEePXooKytLZ8+eVV5eniQpLy9Po0ePliQlJyfr008/1cSJExUXF6fp06frzTff1OrVqxUXF6ddu3aprq5OK1euVHJyshISEjR16lSdPXtWknTs2DHFxsZq8+bNSkpK8h2lbtmyRYMGDdIDDzygcePGqby83DdXbGysXn/9dfXv31/x8fGaO3eu/vcDwDZt2qRBgwYpLi5OgwcP1j/+8Q9Jktvt1tNPP61evXqpb9++Wrt27Zd+T4BAQjDRpO3bt081NTXq16/fddebP3++qqqqVFRUpHXr1ik/P19bt26VJDmOowkTJmjnzp3asWOHTp48qaVLlzZ4pvDwcCUmJur999+/4ntFRUWKiorSihUrtG/fPi1atEgpKSm+o9TExEStW7dORUVFWr9+vXbu3KnWrVtr3rx59bZTVlamwsJCrV69WkVFRcrOztayZcu0e/du3X///ZoxY0a99YuLi7VlyxZt27ZNO3bs0M6dOyVJO3bs0NKlS7VgwQLt3btXWVlZatOmjerq6jRp0iTFxsaqpKREa9as0Zo1a3zPA5oigokm7cyZM2rbtq2Cg//75/pRo0YpPj5ePXr0UFlZmbxerwoLCzVjxgyFh4erc+fOGjNmjLZt2yZJ6tq1q3r37q2QkBC1a9dOY8aMUVlZ2Q3N1aFDB507d65Bz924caN+/vOfq2PHjgoJCVF6errefvvteqdfn376aYWGhuq2227Txo0bNX78eHXr1k3BwcGaOHGiDh48WO8oMy0tTa1atVJUVJQSEhL00UcfSbp8ZPrUU0+pR48ecrlc6tq1q6Kjo/X3v/9dp0+fVnp6ukJCQhQTE6PHH39chYWFN/S+ADczLvpBk9amTRudOXNGly5d8kVz48aNkqQHH3xQdXV1OnPmjGpraxUVFeV7XlRUlNxutyTp1KlTev755/X+++/L4/HIcRy1atXqhuZyu91q3bp1g557/PhxTZ48Wc2a/ff33WbNmunf//63b7ljx4711s/MzNSCBQt8jzmOI7fbrejoaElS+/btfd9r2bKlPB6PJOnEiRPq0qXLFTOUl5eroqJC8fHxvse8Xm+9ZaCpIZho0uLi4hQSEqJ3331XAwYMuOo6bdu2VfPmzXX8+HF961vfknQ5FJGRkZKkRYsWyeVy6a233lKbNm1UVFR0xSnQL8Pj8Wj37t2aOHFig57fsWNHZWZm6v7777/ie8eOHZOkehcIderUSRMnTqx3Ja5Vp06ddPTo0as+3rlzZ73zzjtfeptAoOKULJq0Vq1aafLkyZo7d67++Mc/qrq6WnV1dTp48KA+++wzSVJQUJAGDhyoxYsXq7q6WuXl5fr973/vC4zH41FoaKgiIiLkdru1atWqBs1SU1OjAwcOaPLkyWrVqpUeffTRBm1n9OjR+t3vfuc7pXr69GkVFRVdc/1Ro0Zp5cqVOnz4sKTLFzjt2LHDtK8RI0YoJydHBw4ckOM4OnLkiMrLy9WjRw+FhYVp5cqVunDhgrxer/75z39q//79DXpNQCDgCBNNXlpamiIjI7Vq1SrNnj1bLVu2VExMjGbOnKm4uDhJ0q9//WvNnz9fycnJatGihR577DENHz5ckpSenq7Zs2crPj5eXbp0UWpqqnJzc837X716te8K0qioKCUlJWnJkiUKDQ1t0Ot58skn5TiOxo4dq4qKCt1+++0aPHiwkpOTr7r+ww8/LI/Ho+nTp6u8vFwRERFKTEzUoEGDvnBfgwYN0tmzZzVjxgxVVFQoOjpaL774oqKjo7VixQotWLBA/fr1U01Nje644w5NmzatQa8JCAQubiANAMAX45QsAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGwf4eAGhs6enpqqys9PcY11RVVSVJioiI8PMkga99+/ZatmyZv8dAE0EwccuprKzUSXeFnJAwf49yVa6azyRJ1Zc4AXQjXDUef4+AJoZg4pbkhITJc9/j/h7jqsI+3CRJN+18geLz9xH4qvArLAAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABo0WzKysLGVlZTXW7gAATVxjd6XRgllcXKzi4uLG2h0AoIlr7K5wShYAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAQXBj7aiqqkoXLlzQyJEjG2uXwFVVVlbKpSB/j4GvmevSRVVW/oefOU1YZWWlbrvttkbbH0eYAAAYNNoRZkREhCIiIvTGG2801i6Bqxo5cqROnPH4ewx8zZzgFmrfNoyfOU1YY5894AgTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgEFwY+0oKSmpsXYFALgFNHZXGi2YkyZNaqxdAQBuAY3dFU7JAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwCDY3wMA/uCq8Sjsw03+HuOqXDUeSbpp5wsUl9/HMH+PgSaEYOKW0759e3+PcF1VVXWSpIgIftjfmLCb/t8agcXlOI7j7yEAALjZ8TdMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBDM68jJyVFsbKxOnz7t71ECzoIFCzRw4EClpKRo8uTJOn/+vL9HChglJSUaMGCAHn74Ya1cudLf4wScEydO6Cc/+YkGDx6sIUOGaM2aNf4eKWB5vV4NGzZMEyZM8PcoNwWCeQ0nTpzQX//6V0VFRfl7lIDUu3dvFRQU6K233tI3v/lNZWdn+3ukgOD1ejVv3jytWrVK27dvV0FBgT7++GN/jxVQgoKClJGRocLCQr3xxhvasGED72EDrV27Vt26dfP3GDcNgnkNv/3tb/XMM8/I5XL5e5SA9IMf/EDBwcGSpJ49e+rkyZN+nigw7N+/X127dlVMTIxCQkI0ZMgQvfvuu/4eK6B06NBB3/72tyVJ4eHhuvPOO+V2u/08VeA5efKkiouLNWLECH+PctMgmFdRVFSkDh066J577vH3KE3C1q1b9eCDD/p7jIDgdrvVsWNH33JkZCQ/7G/AsWPHdPDgQd13333+HiXgZGZm6plnnlGzZmTic8H+HsBffvazn+nUqVNXPD5t2jRlZ2crJyfHD1MFluu9h8nJyZKkrKwsBQUFaejQoY09Hm5xHo9HU6ZM0S9/+UuFh4f7e5yA8qc//Unt2rVT9+7dtWfPHn+Pc9O4ZYOZm5t71ccPHTqkY8eOKTU1VdLl0xKPPvqoNm/erPbt2zfihDe/a72Hn8vLy1NxcbFyc3M5tW0UGRlZ7/S12+1WZGSkHycKTLW1tZoyZYpSUlLUv39/f48TcPbu3av33ntPJSUlunjxoqqrqzVz5ky9/PLL/h7Nr1yO4zj+HuJm1rdvX23ZskXt2rXz9ygBpaSkRC+88ILWr1/Pe/clXLp0SQMGDFBubq4iIyM1YsQILVy4UHfddZe/RwsYjuNo9uzZat26tebMmePvcQLenj17lJOTw4V7uoWPMPH1mj9/vmpqajRmzBhJ0n333ad58+b5eaqbX3BwsH7zm9/oqaeektfr1fDhw4nll/TBBx8oPz9fd999t+9M0fTp0/XQQw/5eTIEOo4wAQAw4PInAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABP8rIyND69eslSa+88ooKCwslSTU1NUpLS1NKSooyMzOvWAbQ+PjgAqCBvF6vgoKCvrLtTZ061ff1wYMHdfz4cW3fvl2S9OGHH9Zbtrp06ZLvrjEAbgz/k9DkffbZZ5o9e7Y+/vhjBQcH64477tArr7yiLVu2aO3atZKk5s2bKzs7W9/4xjf05ptvavXq1ZKkLl26aN68ebr99tuVl5enbdu2KSwsTEeOHNFLL72kmpoavfzyy/J4PJKkKVOmKCkp6ZqzuN1uzZo1S5WVlYqOjq53J4iMjAx1795diYmJmjlzpioqKpSamqohQ4Zo8+bNvuUJEyYoOTlZixcvVllZmWpqahQbG6vnnntOYWFhysjIUFBQkP71r3/J4/EoPz9ff/jDH7RhwwZ5vV6Fh4frueee05133qm8vDwVFBSoVatWOnz4sCIiIrR06VLf5yZnZ2eroKBALpdLoaGh2rBhg5o1a3bN7QFNmgM0ce+8844zduxY3/LZs2ed0tJSJzk52amoqHAcx3Gqq6udCxcuOIcOHXJ69+7tuN1ux3EcZ/Hixc7UqVMdx3GcrVu3Oj179nSOHDniOI7jnDt3zklNTfWt63a7nR/+8IfOuXPnrjlLenq6s3TpUsdxHOfo0aNOz549nXXr1jmO4zizZ8/2fV1aWuo88sgjvuf9/+Xly5c7y5cv9y2/+OKLzqJFi3zbeeSRRxyPx+M4juOUlZU5aWlpzsWLFx3HcZzi4mJn5MiRvtcUHx/vHD9+3HEcx5kzZ45vO3l5ec7jjz/uVFVVOY7jOKdPn/7C7QFNGUeYaPLuueceffLJJ5o7d66+973vKSkpScXFxUpNTfUdSYWFhUm6/EHTDz30kDp06CBJGjVqlO/zSCXpu9/9rrp06SJJ2rdvn44dO6a0tDTf910ul44cOaLvfOc7V51lz549+tWvfiVJiomJ0fe///0Gvab33ntP1dXVevvttyVd/pvn/96/deDAgQoNDfWt+9FHH+mxxx6TdPnDyc+fP1/vNXXq1EnS5c/83bVrl6TLt3gaPXq079ZYbdu2NW0PaKoIJpq8mJgYFRQUqLS0VCUlJVq8eLH69evXoG19HlbpcihiY2P12muvfVWjmjmOo2efffaawf08lp+vO3z48Hp/I/1fLVq08H0dFBQkr9f7hfu+3vaApoqrZNHknTx5UkFBQUpOTtYvfvELnT59Wvfee6/y8/N9N8D2eDy6ePGiEhIS9Oc//1mVlZWSpE2bNikxMfGq242Li9ORI0dUWlrqe2z//v1yrnM/g169emnr1q2SpE8//VS7d+9u0Gvq27evcnNzdeHCBUlSdXW1Pvnkk2uum5+f77vPptfr1YEDB75wH3369NHrr7+u6upqSdKZM2duaHtAoOMIE03eoUOHtHDhQklSXV2dxo8fr5SUFF24cEFjxoyRy+VSSEiIVqxYobvvvlszZ87U2LFjJV0+Or3Wbclat26tV199VS+99JIyMzNVW1urmJgYrVix4po3zJ4zZ45mzZqlgoICde7cWQkJCQ16TePHj9eyZcs0YsQIuVwuuVwupaenq1u3bles+8ADD2jatGmaNGmSvF6vamtrNXDgQHXv3v26+xg2bJjcbrdGjhyp4OBghYaG6rXXXmvw9oBAx+29AAAw4JQsAAAGnJIFvmIHDx5URkbGFY//+Mc/9l1ZCiDwcEoWAAADTskCAGBAMAEAMCCYAAAYEEwAAAz+DyOIaSXb0NeyAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#checking for outliers\n", "sb.boxplot(x=combo2['score_difference'])\n", "plt.title('Goal Difference')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAcwAAAFsCAYAAABBx4loAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAAWTUlEQVR4nO3df2xV9f3H8ddtyy2hMGNRSlWCAyYKCC0UhFLoqCCsyOpoactw2XSE0FGarZafIoMuICtLTfgRkZnwwzEGwwKLzkwES4lQEImtGzIaFkFLaUuLrre19Mc9+4N4v18U57s47vXW5yMx6f11Pu97JH1yTi89LsdxHAEAgP8qJNADAAAQDAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgIeklJSTp69Oh19xUVFWnmzJkBmuia2bNnKzY2VrGxsRo8eLCGDBniu718+fKAzFRRUaEnn3xSo0aNUlxcnKZPn67Dhw8HZBYg2IQFegCgs3rxxRd9Xy9evFhRUVH61a9+FcCJpLlz52rmzJnatGmTJOm9997T//p3l7S1tSksjG8t6Hw4wsS3wrlz5/STn/xEcXFxmjp1qg4ePOh7bPHixVqxYoXviDAzM1O1tbVatWqVRo4cqSlTpuj06dO+51dXV2v+/PkaPXq0kpKStH379g7P8+abbyolJUVxcXHKzMzUmTNnfI9t3rxZEydOVGxsrJKTk3XgwAHfY0VFRcrMzNTq1asVFxenhx9+WKdOnVJRUZESExM1ZswY7d2794Zr1tfX66OPPlJ6errcbrfcbrdGjBihuLg433PeeOMNpaSkaPjw4Zo4caJKSkp873nu3LkaNWqUJk2apN27d/tes379euXk5CgvL0/Dhw/X3r171dDQoKVLlyohIUHjxo3Tc889p/b29g7vJ+CbhGCi02ttbdXcuXM1duxYHT16VMuWLVNeXp7+9a9/+Z7z2muv6Ze//KVKS0vldruVkZGhwYMHq7S0VJMnT9azzz4rSfJ6vcrKytLAgQNVUlKibdu2adu2bTpy5Ih5ntOnT2vp0qXKz8/X8ePHlZGRoV/84hdqaWmRJPXp00c7duzQO++8o+zsbC1YsEA1NTW+15eXl2vgwIE6fvy4Hn30UeXm5uq9997TgQMHtHbtWuXn56uxsfEL695+++3q27evFixYoDfeeEOXL1++7vHy8nItWrRICxcu1MmTJ7Vjxw7dfffdkqTc3Fz17t1bR44c0bp161RYWKhjx475Xnvw4EFNmTJFJ0+e1LRp07R48WKFhYXp9ddf1759+/TWW2/pz3/+s3kfAd9EBBOdwrx58xQXF+f7b+XKlb7HysrK1NTUpDlz5sjtdmvMmDGaMGGCXn31Vd9zJk2apCFDhig8PFyTJk1SeHi4HnvsMYWGhio5OVnvv/++pGunMOvr65WdnS23260+ffooPT1df/3rX82z7tq1SxkZGRo2bJhCQ0P1ox/9SF26dNG7774rSfrBD36gqKgohYSEKDk5WX379lV5ebnv9ffcc49SU1N9s1VVVWnevHlyu91KSEiQ2+3WhQsXvrCuy+XS9u3bdffdd2vNmjVKSEjQrFmz9MEHH0iS9uzZo9TUVI0dO1YhISGKiopS//79VVVVpVOnTikvL0/h4eF64IEHNGPGDO3fv9+37ZiYGE2cOFEhISHyeDw6fPiwli5dqm7duqlnz5762c9+dt3+BoIRP2hAp7Bx40bFx8f7bhcVFfmOaGpqatS7d2+FhPzf3w/vuusuVVdX+2737NnT93XXrl11xx13XHe7qalJklRZWamamprrTmO2t7dfd/urXLx4Ufv27dMf/vAH332tra2+o8h9+/Zpy5YtqqyslCQ1NTXpypUrXzqrpOvmDQ8Pv+ERpiT17t3b94GjqqoqPfPMM1q0aJF27dqlqqoqJSYmfuE1NTU1uu2229S9e3fffXfddZf+/ve/X7fd///+2tralJCQ4LvP6/UqOjr6v+0W4BuPYKLT69Wrly5duiSv1+uLZlVVle69994Obys6Olr33HOPXn/99ZueJzo6WnPnzlVWVtYXHqusrNSyZcu0detWxcbGKjQ0VCkpKTe91lfNMWvWLOXm5vpu3+jItFevXvrkk0/k8Xh80ayqqlJUVJTvOS6Xy/d179695Xa7VVpayod/0KlwShad3tChQ9W1a1e9+OKLam1t1fHjx3Xo0CElJyff1LYiIiK0efNmNTc3q729XWfPnr3ulOlXmTFjhv70pz+prKxMjuOoqalJxcXF8ng8+vTTT+VyuRQZGSlJevnll1VRUdHhOW/kk08+0bp163T+/Hl5vV7V19fr5ZdfVkxMjCQpLS1NRUVFOnbsmLxer6qrq3Xu3DlFR0crNjZWhYWFunr1qs6cOaM9e/bohz/84Q3X6dWrl8aOHas1a9bI4/HI6/XqwoULOnHixP/kfQCBQjDR6bndbm3atEklJSUaPXq0Vq5cqYKCAvXv37/D2woNDdWmTZt05swZPfzwwxo9erSWLVsmj8dj3saDDz6o3/zmN8rPz9fIkSP1yCOPqKioSJI0YMAAPfnkk8rMzFR8fLzOnj2r4cOHd3jOG+nSpYsqKyv1xBNPaMSIEZo2bZrcbrfWrFkj6dpfBp599lmtXr1aI0aM0OOPP66LFy9KkgoLC1VZWalx48YpOztb8+fPv+4U+OcVFBSotbVVycnJGjlypHJyclRbW/s/eR9AoLi4gDQAAF+NI0wAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAzC/LVQdna2amtr/bVcp9XQ0CBJ6tGjR4AnCW533nmnNmzYEOgxAAQRvwWztrZWl6pr5Lgj/LVkp+Rq+VSS5Gnj5MDNcrU0BnoEAEHIb8GUJMcdocZh6f5cstOJKNstSezHr+GzfQgAHcFhCgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABmH+WqihoUGutlZ/LQcA6OSef/55SVJWVpZf1vPbEWZzc7PkbfPXcgCATq64uFjFxcV+W49TsgAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMwgI9AOBvrrarqq1tUkZGRqBHAfA11NbWqmvXrn5bjyNMAAAMOMLEt44TFq47b4/Qrl27Aj0KgK/B32eJOMIEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYBDmr4W6du2qxuZWfy0HAOjkvv/97/t1Pb8Fs0ePHvK0NfprOQBAJ5eVleXX9TglCwCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAIMwfy7mamlURNlufy7Z6bhaGiWJ/fg1XNuHEYEeA0CQ8Vsw77zzTn8t1ak1NHglST168A3/5kXw5xFAh7kcx3ECPQQAAN90/AwTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABmH+WqikpESrVq2S1+vVjBkzNGfOHH8t3WksWbJExcXF6tmzp1555ZVAjxN0qqqqtHDhQtXV1cnlcik9PV0//elPAz1W0Ll69apmzZqllpYWtbe3a/LkycrJyQn0WEGnvb1dqampioqK0gsvvBDocYJSUlKSIiIiFBISotDQUBUVFd3S9fwSzPb2duXn52vLli2KiopSWlqakpKSNGDAAH8s32lMnz5djz/+uBYtWhToUYJSaGioFi9erMGDB8vj8Sg1NVVjx47lz2EHud1ubdu2TREREWptbdWPf/xjjR8/XjExMYEeLahs375d/fv3l8fjCfQoQW3btm2KjIz0y1p+OSVbXl6uvn37qk+fPnK73Zo6daoOHjzoj6U7lZEjR+q2224L9BhBq1evXho8eLAkqXv37urXr5+qq6sDPFXwcblcioiIkCS1tbWpra1NLpcrwFMFl0uXLqm4uFhpaWmBHgUd4JdgVldXq3fv3r7bUVFRfKNCQH300Ud6//33NWzYsECPEpTa29uVkpKi+Ph4xcfHsx87aPXq1VqwYIFCQvgYydf185//XNOnT9euXbtu+Vr838K3TmNjo3JycrR06VJ179490OMEpdDQUO3fv1+HDx9WeXm5zp49G+iRgsabb76pyMhIDRkyJNCjBL2dO3dq7969+v3vf68dO3bo7bffvqXr+SWYUVFRunTpku92dXW1oqKi/LE0cJ3W1lbl5ORo2rRpeuSRRwI9TtD7zne+o4ceekhHjhwJ9ChB49SpUzp06JCSkpKUm5ur0tJS5eXlBXqsoPRZR3r27KlJkyapvLz8lq7nl2A++OCD+uCDD/Thhx+qpaVFr776qpKSkvyxNODjOI6efvpp9evXT0888USgxwla9fX1+ve//y1Jam5u1tGjR9WvX78ATxU8nnrqKZWUlOjQoUMqLCzU6NGj9bvf/S7QYwWdpqYm3wemmpqa9NZbb+l73/veLV3TL5+SDQsL0/LlyzV79mzfR6lv9RvrjHJzc3XixAlduXJF48eP1/z58zVjxoxAjxU03nnnHe3fv1/33XefUlJSJF3bp4mJiQGeLLjU1NRo8eLFam9vl+M4mjJliiZMmBDosfAtU1dXp3nz5km69jP1Rx99VOPHj7+la7ocx3Fu6QoAAHQCfOgHAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEzgBgYOHKjGxsZAjwHgG4RgAt9iXq9X/FNswMZvF5AGgs1LL72kAwcO6OOPP9bChQs1efJkSdcuhl5YWKj29nZFRkYqPz9fffv21fHjx7Vq1SoNHTpUZWVlCgsLU0FBgTZs2KCKigpFR0dr/fr16tatm1paWvTcc8/p7bffVktLiwYOHKgVK1b4Lpv1eXV1dXrqqadUV1cnSRozZoyWLl0qSXrhhRf0yiuvyOVyqVu3bvrjH/+okJAQbd68WX/5y18kXfv1lMuWLVNERITWr1+viooKeTweXbx4Ubt27dK7776r559/Xi0tLerSpYuWLFnC9S2Bz3MAfMF9993nvPTSS47jOM7JkyedhIQEx3Ec5/Lly85DDz3kVFRUOI7jOLt373bS0tIcx3Gc0tJSZ9CgQc7p06cdx3GcFStWOOPGjXOqqqocx3Gc2bNnO7t373Ycx3E2btzobNy40bdeQUGBU1hY+KXzbNmyxXnmmWd8tz/++GPHcRynqKjISU9PdxoaGhzHcZz6+nrHcRynuLjYmTp1qtPQ0OB4vV5nwYIFTkFBgeM4jrNu3TonMTHRqaurcxzHcc6fP3/dNs6ePeskJibe1H4DOjOOMIEvkZycLEmKiYlRTU2Nrl69qrKyMt1///0aMGCAJCk1NVUrV670/RLo7373u3rggQckSYMGDdLFixd914IdPHiwzp8/L0k6dOiQPB6P/va3v0mSWlpadP/993/pLMOGDdPWrVv129/+VqNGjVJCQoKka5eKmjlzpu8yZbfffrsk6dixY0pOTvbdn56ertWrV/u2N378eN9V6o8cOaILFy5o1qxZvsfb2tp0+fJl3XHHHTe9/4DOhmACXyI8PFzStWs/Stci8lXcbrfv69DQUN82Prt99epVSdeunPLrX/9aY8aMMc0SGxurvXv36ujRo9q/f782b96snTt3mt/L533+1O+4ceNUUFBw09sDvg340A/QATExMTpz5ozOnTsnSdq7d68GDRrU4QtRJyUlaevWrWpubpYkeTwe3zZv5MMPP1T37t01depULVmyRP/4xz/k9Xo1YcIE7dy503eEe+XKFUnXfsb52muvyePxyHEc7dmzR/Hx8Tfc9tixY3XkyBFVVFT47rvV1xUEghFHmEAHREZGqqCgQHl5eWpra1NkZKTWrl3b4e3MmTNHGzZsUFpamlwul1wul7Kzs9W/f/8bPv/EiRPaunWrQkJC5PV6tXLlSoWEhOixxx5TdXW1MjIyFBYWpm7dumnHjh1KTEzUP//5T2VmZkqShgwZoqysrBtu+95779XatWv19NNPq7m5Wa2trRo+fLiGDh3a4fcFdGZc3gsAAANOyQIAYMApWeAbZPny5SorK7vuvtDQUBUVFQVoIgCf4ZQsAAAGnJIFAMCAYAIAYEAwAQAwIJgAABj8B6t9FHMCkD0DAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#checking for outliers\n", "sb.boxplot(x=combo2['home_score'])\n", "plt.title('Home Team Score')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAcwAAAFrCAYAAABcwrnQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAAXmElEQVR4nO3deXDUhd3H8U/ImoQmWIEJESxooQJW7hwkhKMJgxjIEowBWiDDCAhhuGYAJaUcLWNpRUqnA4wclsNRkRRLEJCx5YyIQkIYaDsgDNAqRw4OyWXYHL/nD8Z9hgfw+VJh143v14wzSTab7zc/Ie/8frvDBjmO4wgAAHyjRv5eAACAQEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAgcvfCwD3W2Zmpk6ePKmPP/5YISEhftlh8ODBunjxoiSpurpaLpdLLtfNv24TJ05UVlaWz3cqKCjQkiVLdPr0aQUHB6tt27aaM2eOunTp4vNdgEBEMNGgnD9/XgUFBWrSpIl2796tlJQUv+yxY8cO79uZmZkaMmSIhg0b5pddJKmiokJZWVn69a9/rZSUFNXU1KigoOC+/0JRV1en4ODg+/o1ge8KLsmiQcnNzVXXrl313HPPKTc31/vx2bNna+3atZKk4uJidejQQW+//bYk6fPPP1dcXJzq6+t1/fp1TZw4UfHx8YqNjdXEiRNVVFQkSdq5c6fS09Nvmbdu3TpNmjTpnnbcvHmzUlJSFBsbq3HjxunChQve21555RX169dPPXr0UHp6ugoKCry3LVu2TNOmTdOsWbPUvXt3ud1unTt3TqtWrVJCQoL69eunAwcO3HHmuXPnJEmpqakKDg5WWFiYevfurY4dO3o/JycnRykpKerevbsGDRqkf/3rX5KkM2fOKDMzUzExMRo8eLB2797tvU92drYWLFigF198Ud26ddOhQ4dUXFysqVOnKj4+XsnJyXrzzTfv6fgA31UEEw3K1q1b5Xa75Xa7deDAAV2+fFmSFBsbq8OHD0uSDh8+rNatWys/P9/7fnR0tBo1aqT6+nqlp6dr79692rt3r0JDQ7Vw4UJJUv/+/XX+/HmdOXPmlnlDhw4177dr1y6tWrVKy5cv1yeffKLo6GjNnDnTe3vnzp2Vm5urw4cPKzU1VdOnT9eNGze8t+/du1dpaWnKz8/XU089pXHjxqm+vl55eXmaPHmy5s+ff8e5P/7xjxUcHKzZs2dr//79un79+i2379y5U8uWLdOrr76qwsJCvf7663rkkUdUU1OjrKwsJSYm6uDBg5o7d65mzZqls2fPeu+7fft2ZWVlqbCwUN27d9ekSZPUoUMH5eXlacOGDdqwYYM++ugj8zECvqsIJhqMgoICXbx4USkpKerUqZNat26t7du3S5Li4uJ05MgR1dfXKz8/X+PHj1dhYaEkKT8/X3FxcZKkpk2bauDAgWrcuLEiIiI0adIkb1hDQkKUkpKi999/X5J0+vRpXbhwQUlJSeYd3333XU2YMEHt2rWTy+VSVlaWTpw44T3LTEtLU9OmTeVyuTR27Fh5PB7v2aEkxcTEqE+fPnK5XHr22Wd17do1TZgwQQ899JAGDRqkCxcuqKys7La5EREReueddxQUFKR58+YpISFBWVlZ3l8oNm/erPHjx6tLly4KCgrS448/rscee0zHjh1TVVWVJkyYoJCQECUkJCgpKemWS879+/f3/sJx6tQpXb16VVOmTFFISIhat26t4cOH64MPPjAfI+C7imCiwcjNzVViYqKaNWsm6eblxy1btkiS2rRpo8aNG+vEiRM6cuSIkpKS1KJFC509e1b5+fmKjY2VJH311VeaP3++kpKS1KNHD40aNUplZWWqq6uTJD333HPatm2bHMfR1q1blZKSck+PA168eFGLFi1STEyMYmJiFBcXJ8dxVFxcLEn685//rJSUFEVHRysmJkbl5eW6du2a9/7Nmzf3vh0WFqamTZt6HzMMCwuTJFVVVd1xdrt27fT73/9eeXl52rZtm0pKSrRo0SJJ0qVLl9SmTZvb7lNSUqJHH31UjRr974+KVq1aefeVpJYtW3rfvnDhgkpKSrzfX0xMjFauXOkNMxDIeNIPGoTq6mrt3LlT9fX1SkxMlCR5PB6VlZXp5MmT6tixo2JjY/Xhhx+qpqZGUVFRio2NVW5urq5fv66nnnpKkrR27VqdO3dOOTk5ioyM1IkTJzR06FB9/aI+3bp100MPPaSCggJt375dS5Ysuac9W7ZsqaysLA0ZMuS22woKCvTGG29o/fr1evLJJ9WoUSPFxsbqQbygULt27ZSenq5NmzZ59/r8889v+7wWLVqoqKhI9fX13mheunRJTzzxxB2/bsuWLfWjH/1If/vb3+77zoC/cYaJBmHXrl0KDg7Wjh07lJubq9zcXH3wwQeKiYnxPvknLi5Ob731lmJiYiRJPXv21FtvvaXo6GjvWVplZaVCQ0P18MMP68svv9Ty5ctvmzV06FAtXLhQLpfL+7Wsfv7zn2v16tU6ffq0JKm8vFw7d+70zg4ODlazZs1UW1ur5cuXq6Ki4r89JLc4c+aM1q5d630C06VLl7R9+3Z17dpVkpSRkaG1a9fqn//8pxzH0X/+8x9duHBBXbp0UVhYmN544w3V1NTo0KFD2rNnjwYNGnTHOV26dFF4eLhWr16t6upq1dXV6dSpUzp+/Ph9+T4AfyKYaBC2bNmi9PR0tWrVSpGRkd7/Ro0apW3btqm2tlaxsbGqrKz0Xn6Njo5WdXX1LdEbM2aMbty4ofj4eI0YMUJ9+vS5bVZaWppOnz59x7PE/8+AAQM0fvx4zZgxQz169FBqaqry8vIkSb1791afPn00cOBAJScnKzQ09JbLnd9GRESEjh07pmHDhqlbt24aPny42rdvr+zsbElSSkqKsrKyNHPmTPXo0UOTJ0/W9evXFRISopUrVyovL0/x8fH6zW9+o8WLF6tdu3Z3nBMcHKyVK1fq5MmT6t+/v+Lj4zV37tz7Fn7An4J4AWng3lRXVyshIUFbtmy566VJAA0PZ5jAPdq4caM6d+5MLIHvGZ70A9yD5ORkOY6jFStW+HsVAD7GJVkAAAy4JAsAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAxcvho0ZcoUlZaW+mpcg1VeXi5JatKkiZ83CWyRkZFavny5v9cAEEB8FszS0lIVFZfICQn31cgGKcjzlSSpopaLA/+tIE+lv1cAEIB8FkxJckLCVdl1uC9HNjjhx3IkieP4LXx9DAHgXnCaAgCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJgAABgQTAAADAgmAAAGBBMAAAOCCQCAgctXg8rLyxVUW+OrcQCABu7111+XJE2aNMkn83x2hlldXS3V1/pqHACggdu3b5/27dvns3lckgUAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBg4PL3AoCvBdXeUGlplUaMGOHvVQB8C6WlpQoLC/PZPM4wAQAw4AwT3zuOK1SRTcO1adMmf68C4Fvw9VUizjABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGLh8NSgsLEyV1TW+GgcAaOB+9rOf+XSez4LZpEkTVdRW+mocAKCBmzRpkk/ncUkWAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABgQTAAADggkAgAHBBADAgGACAGBAMAEAMCCYAAAYEEwAAAwIJgAABi5fDgvyVCr8WI4vRzY4QZ5KSeI4fgs3j2G4v9cAEGB8FszIyEhfjWrQysvrJUlNmvAD/78Xzp9HAPcsyHEcx99LAADwXcdjmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADAgmAAAGBBMAAAMCCYAAAYEEwAAA4IJAIABwQQAwIBgAgBgQDABADBw+WpQXl6efvvb36q+vl7Dhg3ThAkTfDW6wfjlL3+pffv2qXnz5tq+fbu/1wk4ly5d0ssvv6wrV64oKChIw4cP15gxY/y9VsC5ceOGRo0aJY/Ho7q6Og0cOFDTpk3z91oBp66uTs8//7yioqK0atUqf68TkJKTkxUeHq5GjRopODhYf/3rXx/oPJ8Es66uTgsXLtS6desUFRWljIwMJScn6yc/+YkvxjcY6enpGj16tGbPnu3vVQJScHCwsrOz9fTTT6uiokLPP/+8EhMT+XN4j0JCQrRhwwaFh4erpqZGI0eOVN++fdWtWzd/rxZQ3nzzTbVr104VFRX+XiWgbdiwQc2aNfPJLJ9ckj1+/Lgef/xxtW7dWiEhIRo8eLB2797ti9ENSmxsrH74wx/6e42A1aJFCz399NOSpIiICLVt21bFxcV+3irwBAUFKTw8XJJUW1ur2tpaBQUF+XmrwFJUVKR9+/YpIyPD36vgHvgkmMXFxXr00Ue970dFRfGDCn51/vx5nThxQl27dvX3KgGprq5OaWlp6tWrl3r16sVxvEeLFi3SSy+9pEaNeBrJtzVu3Dilp6dr06ZND3wW/7fwvVNZWalp06Zpzpw5ioiI8Pc6ASk4OFhbt27V/v37dfz4cZ06dcrfKwWMvXv3qlmzZurUqZO/Vwl4Gzdu1JYtW7RmzRq9/fbbys/Pf6DzfBLMqKgoFRUVed8vLi5WVFSUL0YDt6ipqdG0adPkdrv1zDPP+HudgPfwww+rZ8+e+uijj/y9SsAoLCzUnj17lJycrBkzZujTTz/VrFmz/L1WQPq6I82bN9eAAQN0/PjxBzrPJ8Hs3Lmz/v3vf+uLL76Qx+PRjh07lJyc7IvRgJfjOPrVr36ltm3b6oUXXvD3OgHr6tWrKisrkyRVV1fr4MGDatu2rZ+3ChwzZ85UXl6e9uzZo6VLlyo+Pl5Llizx91oBp6qqyvuEqaqqKn388cd68sknH+hMnzxL1uVyaf78+Ro/frz3qdQP+htriGbMmKHDhw/r2rVr6tu3r6ZOnaphw4b5e62AceTIEW3dulXt27dXWlqapJvHtF+/fn7eLLCUlJQoOztbdXV1chxHzz77rJKSkvy9Fr5nrly5osmTJ0u6+Zh6amqq+vbt+0BnBjmO4zzQCQAANAA86QcAAAOCCQCAAcEEAMCAYAIAYEAwAQAwIJjA91Rtba2/VwACCsEEvsHMmTOVnp4ut9utyZMn6/r165oxY4Z27twpSVqzZo2io6NVV1cnSRo0aJDOnTun0tJSZWZmKj09XYMHD9bixYsl3XxprN69e6ukpMQ745VXXtHKlSvvusOuXbvkdruVlpam1NRUHTp0SNLNfzFr6tSpcrvdcrvd3peIunz5siZPnuz9eG5urvdrJScna8mSJcrIyND8+fPl8Xj06quvKiMjQ0OGDNFLL72kysrK+3oMgQbDAXBXV65c8b69dOlS57XXXnNycnKcefPmOY7jOGPHjnVGjBjhHD161CkuLnb69evnOI7jVFdXOxUVFY7jOI7H43EyMzOd/fv3O47jOK+99pqzbNkyx3Ecp6KiwomPj3cuX7581x3cbrdTWFjoOI7j1NbWOuXl5Y7jOM7o0aOdNWvW3Lbr9OnTnT/+8Y+O4zhOcXGxk5iY6Hz22WeO4zhOUlKSs2DBAu99VqxY4axYscL7/uLFi52lS5fe20ECvid89gLSQCDaunWrtm3bppqaGlVVVemJJ57QiBEjtHr1ank8HhUVFWncuHE6ePCgWrVqpZ49e0q6+S+PLF68WEePHpXjOLp8+bJOnjypvn37atSoURo1apSysrL0/vvvKzExUc2bN7/rDvHx8frd736nZ555Rn379lX79u1VWVmpo0ePat26dd7P+/o1AT/55BNlZ2dLuvmSZv369dOhQ4fUvn17SdLQoUO999mzZ48qKir04YcfSpI8Ho86dux4X48h0FAQTOAuCgoKtHHjRr377rtq1qyZtm3bppycHLVu3Vr19fXasWOHunXrpoSEBL388st67LHHlJCQIElat26dysrK9Je//EWhoaGaN2+ebty4IUlq2bKlOnXqpN27d+udd97RwoULv3GPOXPm6LPPPtOnn36q6dOn64UXXtDgwYP/6+/rBz/4gfdtx3G0YMEC794A7o7HMIG7KCsrU0REhB555BF5PB6999573tvi4+O1bNky9erVSy1bttSXX36pAwcOeMNTXl6uyMhIhYaGqri4+LYXTB89erQWLVokl8ul7t27f+MeZ8+eVYcOHTRmzBgNGTJE//jHPxQeHq7u3btr/fr13s+7evWqJCkhIUE5OTmSpNLSUu3fv1/x8fF3/NrJyclav369qqurJUkVFRU6c+bMvR0o4HuCYAJ30adPH7Vp00YDBw7U6NGj9dOf/tR7W0JCgi5evOgNUXR0tMLDw70vN5SZmanCwkKlpqZqzpw5t53BxcXFKTQ0VCNHjvx/9/jDH/6g1NRUpaWl6eDBg3rxxRclSUuWLPHOGDJkiDZv3ixJmjt3rk6ePCm3262xY8dq1qxZd32xgwkTJqhjx47KyMiQ2+3WyJEjCSZwF/zj64AffPHFF/rFL36hv//972rcuLG/1wFgwGOYgI/96U9/0nvvvafs7GxiCQQQzjCB74ArV65o7Nixt318wIABmjJlih82AvB/EUwAAAx40g8AAAYEEwAAA4IJAIABwQQAwIBgAgBg8D/23rVE66AYgQAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#checking for outliers\n", "sb.boxplot(x=combo2['away_score'])\n", "plt.title('Away Team Score')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All outliers were effectively removed." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Friendly 3682\n", "Other competition 3425\n", "FIFA World Cup qualification 1984\n", "FIFA World Cup 67\n", "Name: tournament, dtype: int64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# checking the count for tournament types\n", "combo2.tournament.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before the dependent variables can be used to make predictions, we need to get dummy variables for the categorical columns. Due to the high cardinality of the team names, we will only encode the tournament type." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
home_scoreaway_scoreneutralyearmonthdayscore_differenceoutcomehome_team_rankhome_rank_changeaway_team_rankaway_rank_changetournament_FIFA World Cup qualificationtournament_Friendlytournament_Other competitionaway_teamhome_team
020False20061222896800010LibyaQatar
130False200612032320800001LibyaEgypt
210False200611212280800010LibyaTunisia
320False20061522320530010ZimbabweEgypt
410False200611412351530010ZimbabweMorocco
\n", "
" ], "text/plain": [ " home_score away_score neutral year month day score_difference \\\n", "0 2 0 False 2006 1 2 2 \n", "1 3 0 False 2006 1 20 3 \n", "2 1 0 False 2006 1 12 1 \n", "3 2 0 False 2006 1 5 2 \n", "4 1 0 False 2006 1 14 1 \n", "\n", " outcome home_team_rank home_rank_change away_team_rank \\\n", "0 2 89 6 80 \n", "1 2 32 0 80 \n", "2 2 28 0 80 \n", "3 2 32 0 53 \n", "4 2 35 1 53 \n", "\n", " away_rank_change tournament_FIFA World Cup qualification \\\n", "0 0 0 \n", "1 0 0 \n", "2 0 0 \n", "3 0 0 \n", "4 0 0 \n", "\n", " tournament_Friendly tournament_Other competition away_team home_team \n", "0 1 0 Libya Qatar \n", "1 0 1 Libya Egypt \n", "2 1 0 Libya Tunisia \n", "3 1 0 Zimbabwe Egypt \n", "4 1 0 Zimbabwe Morocco " ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# encode categorical columns\n", "pd.set_option('display.max_columns', None)\n", "pd.set_option('display.width', None)\n", "rresults_dummies = pd.get_dummies(combo2.drop(['away_team', 'home_team'], 1), prefix_sep='_', drop_first=True)\n", "rresults_dummies[['away_team', 'home_team']] = combo2[['away_team', 'home_team']]\n", "rresults_dummies.head()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
home_scoreaway_scoreneutralyearmonthdayscore_differenceoutcomehome_team_rankhome_rank_changeaway_team_rankaway_rank_changetournament_FIFA World Cup qualificationtournament_Friendlytournament_Other competitionaway_teamhome_team
948011False20186501126712910010LatviaLithuania
948110False2018661253-5550010PanamaNorway
948211False2018660178151-2010HungaryBelarus
948330False2018673214395-7010UzbekistanUruguay
948430False201867324066-2010AlgeriaPortugal
\n", "
" ], "text/plain": [ " home_score away_score neutral year month day score_difference \\\n", "9480 1 1 False 2018 6 5 0 \n", "9481 1 0 False 2018 6 6 1 \n", "9482 1 1 False 2018 6 6 0 \n", "9483 3 0 False 2018 6 7 3 \n", "9484 3 0 False 2018 6 7 3 \n", "\n", " outcome home_team_rank home_rank_change away_team_rank \\\n", "9480 1 126 7 129 \n", "9481 2 53 -5 55 \n", "9482 1 78 1 51 \n", "9483 2 14 3 95 \n", "9484 2 4 0 66 \n", "\n", " away_rank_change tournament_FIFA World Cup qualification \\\n", "9480 10 0 \n", "9481 0 0 \n", "9482 -2 0 \n", "9483 -7 0 \n", "9484 -2 0 \n", "\n", " tournament_Friendly tournament_Other competition away_team home_team \n", "9480 1 0 Latvia Lithuania \n", "9481 1 0 Panama Norway \n", "9482 1 0 Hungary Belarus \n", "9483 1 0 Uzbekistan Uruguay \n", "9484 1 0 Algeria Portugal " ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# previewing last five rows of table with dummies\n", "rresults_dummies.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Exporatory Data Analysis" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "# creating a profile report for the combined data set\n", "from pandas_profiling import ProfileReport\n", "profile = ProfileReport(combo2, title='FIFA Matches and World Rankings Report')" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3f13d5aacf1940e5aa3f77dc2f0190ee", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=30.0, style=ProgressStyle(descrip…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "53a785b41e22443baa0321627d9df523", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "3cf602a4eae84204aaa30910de0954b5", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "profile.to_notebook_iframe()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "f9e198d99f1d4193b53a21617a6ecf2f", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=0.0, description='Export report to file', max=1.0, style=ProgressStyle(desc…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# save profile repot as html file\n", "profile.to_file(\"FIFA Matches and World Rankings Report.html\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Modelling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### a. Logistic Regression model" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
home_scoreaway_scoreneutralyearmonthscore_differenceoutcomehome_team_rankhome_rank_changeaway_team_rankaway_rank_changetournament_FIFA World Cup qualificationtournament_Friendlytournament_Other competitionaway_teamhome_team
772020False2015122216661940001Sri LankaIndia
772140True201512421506182-2001BangladeshAfghanistan
772203True201512-30188-8182-2001BangladeshBhutan
772331True201512221606188-8001BhutanMaldives
772403True201512-30188-81506001AfghanistanBhutan
\n", "
" ], "text/plain": [ " home_score away_score neutral year month score_difference outcome \\\n", "7720 2 0 False 2015 12 2 2 \n", "7721 4 0 True 2015 12 4 2 \n", "7722 0 3 True 2015 12 -3 0 \n", "7723 3 1 True 2015 12 2 2 \n", "7724 0 3 True 2015 12 -3 0 \n", "\n", " home_team_rank home_rank_change away_team_rank away_rank_change \\\n", "7720 166 6 194 0 \n", "7721 150 6 182 -2 \n", "7722 188 -8 182 -2 \n", "7723 160 6 188 -8 \n", "7724 188 -8 150 6 \n", "\n", " tournament_FIFA World Cup qualification tournament_Friendly \\\n", "7720 0 0 \n", "7721 0 0 \n", "7722 0 0 \n", "7723 0 0 \n", "7724 0 0 \n", "\n", " tournament_Other competition away_team home_team \n", "7720 1 Sri Lanka India \n", "7721 1 Bangladesh Afghanistan \n", "7722 1 Bangladesh Bhutan \n", "7723 1 Bhutan Maldives \n", "7724 1 Afghanistan Bhutan " ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Logistic regression\n", "training1 = rresults_dummies[rresults_dummies['year']<2016]\n", "training1 = training1.drop(['day'],1)\n", "training1.tail()" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "# assign features and target\n", "X = training1[['home_rank_change', 'away_rank_change', 'home_team_rank', 'away_team_rank', 'tournament_Friendly', 'tournament_FIFA World Cup qualification', 'tournament_Other competition']]\n", "y = training1['outcome']\n", "\n", "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I used a test size of 10% in the split since I have a different set of matches that I will use for evaluation of model performance after training. " ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "#tranform the data using standard scaler\n", "from sklearn.preprocessing import StandardScaler\n", "feature_scaler = StandardScaler()\n", "X_train = feature_scaler.fit_transform(X_train)\n", "X_test = feature_scaler.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "# Fitting our model\n", "# \n", "from sklearn.linear_model import LogisticRegression\n", "\n", "LogReg = LogisticRegression(C=1)\n", "LogReg.fit(X_train, y_train)\n", "\n", "# Using the model to make predictions\n", "y_pred = LogReg.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[125 0 69]\n", " [ 58 0 111]\n", " [ 50 0 333]]\n", "Accuracy 0.613941018766756\n" ] } ], "source": [ "# Evaluating the model\n", "from sklearn.metrics import confusion_matrix\n", "from sklearn.metrics import accuracy_score\n", "\n", "cm = confusion_matrix(y_test, y_pred)\n", "print(cm)\n", "print('Accuracy' , accuracy_score(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With a test size of 0.1 from matches played between 2006 and 2016, my model predicted 61.4 % of the matches played correctly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Logistic Model Evaluation using validation data set" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
home_scoreaway_scoreneutralyearscore_differenceoutcomehome_team_rankhome_rank_changeaway_team_rankaway_rank_changetournament_FIFA World Cup qualificationtournament_Friendlytournament_Other competitionaway_teamhome_team
772521False2016121633153-3001AfghanistanIndia
772611False2016019110572010CameroonRwanda
772700False201601621572010CameroonUganda
772801True2016-101050572001CameroonAngola
772910False2016129110190001Ivory CoastRwanda
\n", "
" ], "text/plain": [ " home_score away_score neutral year score_difference outcome \\\n", "7725 2 1 False 2016 1 2 \n", "7726 1 1 False 2016 0 1 \n", "7727 0 0 False 2016 0 1 \n", "7728 0 1 True 2016 -1 0 \n", "7729 1 0 False 2016 1 2 \n", "\n", " home_team_rank home_rank_change away_team_rank away_rank_change \\\n", "7725 163 3 153 -3 \n", "7726 91 10 57 2 \n", "7727 62 1 57 2 \n", "7728 105 0 57 2 \n", "7729 91 10 19 0 \n", "\n", " tournament_FIFA World Cup qualification tournament_Friendly \\\n", "7725 0 0 \n", "7726 0 1 \n", "7727 0 1 \n", "7728 0 0 \n", "7729 0 0 \n", "\n", " tournament_Other competition away_team home_team \n", "7725 1 Afghanistan India \n", "7726 0 Cameroon Rwanda \n", "7727 0 Cameroon Uganda \n", "7728 1 Cameroon Angola \n", "7729 1 Ivory Coast Rwanda " ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# selecting matches played in 2016 and beyond\n", "validation = rresults_dummies[rresults_dummies['year']>=2016]\n", "validation = validation.drop(['month', 'day'],1)\n", "validation.head()" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[271 0 196]\n", " [130 1 328]\n", " [ 93 0 684]]\n", "Accuracy0.5613623018203171\n" ] } ], "source": [ "# assigning features and targets\n", "X_wc = validation[['home_rank_change', 'away_rank_change', 'home_team_rank', 'away_team_rank', 'tournament_Friendly', 'tournament_FIFA World Cup qualification', 'tournament_Other competition']]\n", "y_wc = validation['outcome']\n", "\n", "# transforming the data\n", "X_wc = feature_scaler.transform(X_wc)\n", "wc_pred = LogReg.predict(X_wc)\n", "\n", "# checking model accuracy\n", "cm = confusion_matrix(y_wc, wc_pred)\n", "print(cm)\n", "print('Accuracy' + str(accuracy_score(y_wc, wc_pred)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This validation tests the model on 'future' matches (2016-2018), and it accurately predicts 56.1% of the matches correctly which is slightly lower than 61.4% accuracy recorded in training." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### b. Performing PCA" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "#Assigning the features amd targets\n", "X = training1[['home_rank_change', 'away_rank_change', 'home_team_rank', 'away_team_rank', 'tournament_Friendly', 'tournament_FIFA World Cup qualification', 'tournament_Other competition']]\n", "y = training1['outcome']\n", "\n", "# Splitting the dataset into train and test\n", "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I used a test size of 10% in the split since I have a different set of matches that I will use for evaluation of model performance after training. " ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "# scale the features using standard scaler\n", "from sklearn.preprocessing import StandardScaler\n", "sc = StandardScaler()\n", "X_train = sc.fit_transform(X_train)\n", "X_test = sc.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "# import PCA\n", "from sklearn.decomposition import PCA\n", "# train using 6 principal components\n", "pca = PCA(n_components=6)\n", "X_train = pca.fit_transform(X_train)\n", "X_test = pca.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "# import Random Forest Classifier for predictions\n", "from sklearn.ensemble import RandomForestClassifier\n", "\n", "classifier = RandomForestClassifier(max_depth=4, random_state=9)\n", "classifier.fit(X_train, y_train)\n", "\n", "y_pred = classifier.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[104 0 90]\n", " [ 49 0 120]\n", " [ 38 0 345]]\n", "Accuracy 0.6018766756032171\n" ] } ], "source": [ "# check model accuracy\n", "from sklearn.metrics import confusion_matrix\n", "from sklearn.metrics import accuracy_score\n", "\n", "cm = confusion_matrix(y_test, y_pred)\n", "print(cm)\n", "print('Accuracy' , accuracy_score(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using PCA for feature reduction, and a test size of 0.1 from matches played between 2006 and 2016, my model predicted 60.2 % of the matches played correctly. This score was attained using 6 principal components." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Model Evaluation" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "# Assigning features and targets\n", "X_wc = validation[['home_score', 'away_score', 'home_team_rank', 'away_team_rank', 'tournament_Friendly', 'tournament_FIFA World Cup qualification', 'tournament_Other competition']]\n", "y_wc = validation['outcome']\n", "\n", "#Transforming the features\n", "X_wc = sc.transform(X_wc)\n", "X_wc = pca.transform(X_wc)\n", "\n", "#making predictions\n", "wc_pred = classifier.predict(X_wc)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[288 0 179]\n", " [133 0 326]\n", " [ 42 0 735]]\n", "Accuracy0.6007046388725779\n" ] } ], "source": [ "# Checking model accuracy\n", "cm = confusion_matrix(y_wc, wc_pred)\n", "print(cm)\n", "print('Accuracy' + str(accuracy_score(y_wc, wc_pred)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For matches played between 2016 and 2018, my model correctly predicted 60% of the outcomes which is almost equal to the training score. However,with PCA, the model performs better on validation than the logistic regression model which had an accuracy of 56.1%." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### c. Performing LDA" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "#Assigning the features amd targets\n", "X = training1[['home_rank_change', 'away_rank_change', 'home_team_rank', 'away_team_rank', 'tournament_Friendly', 'tournament_FIFA World Cup qualification', 'tournament_Other competition']]\n", "y = training1['outcome']\n", "\n", "# Splitting the dataset into train and test\n", "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=10)" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "# scale the features using standard scale\n", "from sklearn.preprocessing import StandardScaler\n", "sc = StandardScaler()\n", "X_train = sc.fit_transform(X_train)\n", "X_test = sc.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "# import LDA\n", "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA\n", "\n", "# train using 2 linear discriminants\n", "lda = LDA(n_components=2)\n", "X_train = lda.fit_transform(X_train, y_train)\n", "X_test = lda.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "# import Random Forest Classifier for predictions\n", "from sklearn.ensemble import RandomForestClassifier\n", "\n", "classifier = RandomForestClassifier(max_depth=5, random_state=300)\n", "classifier.fit(X_train, y_train)\n", "y_pred = classifier.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[130 0 64]\n", " [ 61 0 108]\n", " [ 46 0 337]]\n", "Accuracy0.6260053619302949\n" ] } ], "source": [ "# check model accuracy\n", "cm = confusion_matrix(y_test, y_pred)\n", "print(cm)\n", "print('Accuracy' + str(accuracy_score(y_test, y_pred)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using 2 linear discriminants, my model predicted 62.6% of matches correctly which is the highest accuracy of all models in the training phase." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Model Evaluation" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "# assigning features and targets\n", "X_wc = validation[['home_rank_change', 'away_rank_change', 'home_team_rank', 'away_team_rank', 'tournament_Friendly', 'tournament_FIFA World Cup qualification', 'tournament_Other competition']]\n", "y_wc = validation['outcome']\n", "\n", "X_wc = sc.transform(X_wc)\n", "X_wc = lda.transform(X_wc)\n", "wc_pred = classifier.predict(X_wc)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[273 0 194]\n", " [135 0 324]\n", " [ 90 0 687]]\n", "Accuracy0.5637110980622431\n" ] } ], "source": [ "# checking model accuracy\n", "cm = confusion_matrix(y_wc, wc_pred)\n", "print(cm)\n", "print('Accuracy' + str(accuracy_score(y_wc, wc_pred)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although the model performed relatively highly in the training phase, its accuracy drops to 56.3% which is almost equal to the perfomaance of the logistic regression model on the validation data set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### d. Polynomial regression model (Home Team Score)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "# assining features and target\n", "X = training1[['home_team_rank', 'away_team_rank', 'home_rank_change', 'away_rank_change','tournament_Friendly', 'tournament_FIFA World Cup qualification', 'tournament_Other competition']]\n", "y = training1['home_score']" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.183756378945109" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fit linear Regression to the data set for comaprison with polynomial regression\n", "from sklearn.linear_model import LinearRegression\n", "\n", "reg_line = LinearRegression()\n", "reg_line.fit(X,y)\n", "\n", "# check score\n", "reg_line.score(X,y)" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.16485339296207635" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fit polynomial model\n", "from sklearn.preprocessing import PolynomialFeatures\n", "poly_reg = PolynomialFeatures(degree=1)\n", "X_poly = poly_reg.fit_transform(X)\n", "\n", "X_poly_train, X_poly_test, y_train, y_test = train_test_split(X_poly,y, test_size = 0.1, random_state=10)\n", "\n", "pol_reg = LinearRegression()\n", "model = pol_reg.fit(X_poly_train, y_train)\n", "score = model.score(X_poly_test, y_test)\n", "\n", "#check score\n", "score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### e. Polynomial regression model (Away Team Score)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "X = training1[['home_team_rank', 'away_team_rank', 'tournament_Friendly', 'tournament_FIFA World Cup qualification', 'tournament_Other competition']]\n", "y = training1['away_score']" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.1398407211283721" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fit polynomial Regression to the dataset\n", "from sklearn.linear_model import LinearRegression\n", "\n", "reg_line = LinearRegression()\n", "reg_line.fit(X,y)\n", "\n", "reg_line.score(X,y)" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.11769616259646476" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fit polynomial model\n", "from sklearn.preprocessing import PolynomialFeatures\n", "poly_reg = PolynomialFeatures(degree=1)\n", "X_poly = poly_reg.fit_transform(X)\n", "\n", "X_poly_train, X_poly_test, y_train, y_test = train_test_split(X_poly,y, test_size = 0.15, random_state=300)\n", "\n", "pol_reg = LinearRegression()\n", "model = pol_reg.fit(X_poly_train, y_train)\n", "score = model.score(X_poly_test, y_test)\n", "score" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "combo2.plot(x='home_team_rank', y='score_difference', style='o')\n", "plt.ylabel('Score difference')\n", "plt.title('Home Team Rank vs Score Difference')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary and Recommendations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this project, I was tasked with creating three models that predict the outcome of a football match between national teams. All models were trained on the team ranks and tournament type and they gave different predictions. The logistic regression model predicted the match outcome as win, draw, or loss relative to the home team. The polynomial regression models predicted the match scores for the home and away teams. The logistic regression model had a 56% accuracy. After performing feature engineering, I performed linear discriminant analysis and used a random forest classifier to make predictions which had no effect on the models performance on the validation data set with a 56% accuracy level. With feature engineering, principal component analysis produced the highest accuracy scores with 60% on the validation data set.\n", "\n", "As evidenced by the accuracy scores and the profile report, linear regression and polynomial regression models do not suit the data. To predict the the scores of different teams, I would recommend a Poisson model which predict the goals scored by each team. This model will consider the average number of goals scored by each time over the period identified and predict the likelihood of future scores deviating from the average based on changes in the team's performance based on their ranks.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Challenging the result\n", "A major challenge I experienced was regarding the high cardinality of the categorical columns. While training the models, encoding columns such as tournament type beyond the world cup, world cup qualifications, and friendlies categories since there are multiple tournaments and team names. This was a major challenge since my computer's memory was unable to successfully perform the computation.\n", "\n", "To improve the scores, I tuned the model parameters manually which helped to improve the scores. However, I could attain better scores with hyperparameter tuning.\n", "\n", "To improve the scores, there is need to gather more information about the different national teams including player details, player form, coaching staff, team tactics among others which will give a clearer indication of team performance rather than simply relying on team ranks.\n", "\n", "I will also look to create a poisson regression model to predict the match scores as it is better suited for the task." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }