{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ELO Team Ratings Data Parsing\n",
"##### Notebook to parse team rating data from landed from [Club ELO](http://clubelo.com/) through their API.\n",
"\n",
"### By [Edd Webster](https://www.twitter.com/eddwebster)\n",
"Notebook first written: 02/09/2021 \n",
"Notebook last updated: 04/09/2021\n",
"\n",
"![title](../../img/logos/elo.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"\n",
"\n",
"## Introduction\n",
"This notebook engineers transfer data data from [Club ELO](http://clubelo.com/), and manipulates this landed data as DataFrames using [pandas](http://pandas.pydata.org/) and [matplotlib](https://matplotlib.org/) for visualisation.\n",
"\n",
"For more information about this notebook and the author, I'm available through all the following channels:\n",
"* [eddwebster.com](https://www.eddwebster.com/);\n",
"* edd.j.webster@gmail.com;\n",
"* [@eddwebster](https://www.twitter.com/eddwebster);\n",
"* [linkedin.com/in/eddwebster](https://www.linkedin.com/in/eddwebster/);\n",
"* [github/eddwebster](https://github.com/eddwebster/);\n",
"* [public.tableau.com/profile/edd.webster](https://public.tableau.com/profile/edd.webster);\n",
"* [kaggle.com/eddwebster](https://www.kaggle.com/eddwebster); and\n",
"* [hackerrank.com/eddwebster](https://www.hackerrank.com/eddwebster).\n",
"\n",
"![title](../../img/fifa21eddwebsterbanner.png)\n",
"\n",
"The accompanying GitHub repository for this notebook can be found [here](https://github.com/eddwebster/football_analytics) and a static version of this notebook can be found [here](https://nbviewer.jupyter.org/github/eddwebster/football_analytics/blob/master/notebooks/A%29%20Web%20Scraping/TransferMarkt%20Web%20Scraping%20and%20Parsing.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"\n",
"\n",
"## Notebook Contents\n",
"1. [Notebook Dependencies](#section1) \n",
"2. [Project Brief](#section2) \n",
"3. [Data Sources](#section3) \n",
" 1. [Introduction](#section3.1) \n",
" 2. [Data Dictionary](#section3.2) \n",
" 3. [Creating the DataFrame](#section3.3) \n",
" 4. [Initial Data Handling](#section3.4) \n",
" 5. [Export the Raw DataFrame](#section3.5) \n",
"4. [Data Engineering](#section4) \n",
" 1. [Introduction](#section4.1) \n",
" 2. [Columns of Interest](#section4.2) \n",
"5. [Exploratory Data Analysis (EDA)](#section5) \n",
" 1. [...](#section5.1) \n",
" 2. [...](#section5.2) \n",
" 3. [...](#section5.3) \n",
"6. [Summary](#section6) \n",
"7. [Next Steps](#section7) \n",
"8. [Bibliography](#section8) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"\n",
"\n",
"## 1. Notebook Dependencies\n",
"This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries:\n",
"* [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented;\n",
"* [`NumPy`](http://www.numpy.org/) for multidimensional array computing;\n",
"* [`pandas`](http://pandas.pydata.org/) for data analysis and manipulation; and\n",
"* [`matplotlib`](https://matplotlib.org/contents.html?v=20200411155018) for data visualisations.\n",
"\n",
"All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the [Conda](https://anaconda.org/anaconda/conda) distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows [here](https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444) and Mac [here](https://medium.com/@GalarnykMichael/install-python-on-mac-anaconda-ccd9f2014072), as well as in the Anaconda documentation itself [here](https://docs.anaconda.com/anaconda/install/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import Libraries and Modules"
]
},
{
"cell_type": "code",
"execution_count": 170,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Setup Complete\n"
]
}
],
"source": [
"# Python ≥3.5 (ideally)\n",
"import platform\n",
"import sys, getopt\n",
"assert sys.version_info >= (3, 5)\n",
"import csv\n",
"\n",
"# Import Dependencies\n",
"%matplotlib inline\n",
"\n",
"# Math Operations\n",
"import numpy as np\n",
"import math\n",
"from math import pi\n",
"\n",
"# Datetime\n",
"import datetime\n",
"from datetime import date, timedelta\n",
"import time\n",
"\n",
"# Data Preprocessing\n",
"import pandas as pd\n",
"#import pandas_profiling as pp\n",
"import os\n",
"import re\n",
"import random\n",
"from io import BytesIO, StringIO\n",
"from pathlib import Path\n",
"\n",
"# Reading directories\n",
"import glob\n",
"import os\n",
"from os.path import basename\n",
"\n",
"# Flatten lists\n",
"from functools import reduce\n",
"\n",
"# Working with JSON\n",
"import json\n",
"from pandas.io.json import json_normalize\n",
"\n",
"# Web Scraping\n",
"import requests\n",
"from requests.adapters import HTTPAdapter\n",
"from requests.packages.urllib3.util.retry import Retry\n",
"from bs4 import BeautifulSoup\n",
"import re\n",
"\n",
"# Currency Convert\n",
"from forex_python.converter import CurrencyRates\n",
"\n",
"# Data Visualisation\n",
"import matplotlib as mpl\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"plt.style.use('seaborn-whitegrid')\n",
"import missingno as msno\n",
"\n",
"# Progress Bar\n",
"from tqdm import tqdm\n",
"\n",
"# Display in Jupyter\n",
"from IPython.display import Image, YouTubeVideo\n",
"from IPython.core.display import HTML\n",
"\n",
"# Ignore Warnings\n",
"import warnings\n",
"warnings.filterwarnings(action=\"ignore\", message=\"^internal gelsd\")\n",
"\n",
"print('Setup Complete')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Python: 3.7.6\n",
"NumPy: 1.20.3\n",
"pandas: 1.3.2\n",
"matplotlib: 3.4.2\n"
]
}
],
"source": [
"# Python / module versions used here for reference\n",
"print('Python: {}'.format(platform.python_version()))\n",
"print('NumPy: {}'.format(np.__version__))\n",
"print('pandas: {}'.format(pd.__version__))\n",
"print('matplotlib: {}'.format(mpl.__version__))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Defined Filepaths"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Set up initial paths to subfolders\n",
"base_dir = os.path.join('..', '..', )\n",
"data_dir = os.path.join(base_dir, 'data')\n",
"data_dir_elo = os.path.join(base_dir, 'data', 'elo')\n",
"img_dir = os.path.join(base_dir, 'img')\n",
"fig_dir = os.path.join(base_dir, 'img', 'fig')\n",
"video_dir = os.path.join(base_dir, 'video')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Defined Variables"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Defined Variables\n",
"\n",
"## Define today's date\n",
"today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Defined Dictionaries"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [],
"source": [
"# Defined Dictionaries\n",
"\n",
"## Define seasons\n",
"dict_seasons = {1975: '1975/1976',\n",
" 1976: '1976/1977',\n",
" 1977: '1977/1978',\n",
" 1978: '1978/1979',\n",
" 1979: '1979/1980',\n",
" 1980: '1980/1981',\n",
" 1981: '1981/1982',\n",
" 1982: '1982/1983',\n",
" 1983: '1983/1984',\n",
" 1984: '1984/1985',\n",
" 1985: '1985/1986',\n",
" 1986: '1986/1987',\n",
" 1987: '1987/1988',\n",
" 1988: '1988/1989',\n",
" 1989: '1989/1990',\n",
" 1990: '1990/1991',\n",
" 1991: '1991/1992',\n",
" 1992: '1992/1993',\n",
" 1993: '1993/1994',\n",
" 1994: '1994/1995',\n",
" 1995: '1995/1996',\n",
" 1996: '1996/1997',\n",
" 1997: '1997/1998',\n",
" 1998: '1998/1999',\n",
" 1999: '1999/2000',\n",
" 2000: '2000/2001',\n",
" 2001: '2001/2002',\n",
" 2002: '2002/2003',\n",
" 2003: '2003/2004',\n",
" 2004: '2004/2005',\n",
" 2005: '2005/2006',\n",
" 2006: '2006/2007',\n",
" 2007: '2007/2008',\n",
" 2008: '2008/2009',\n",
" 2009: '2009/2010',\n",
" 2010: '2010/2011',\n",
" 2011: '2011/2012',\n",
" 2012: '2012/2013',\n",
" 2013: '2013/2014',\n",
" 2014: '2014/2015',\n",
" 2015: '2015/2016',\n",
" 2016: '2016/2017',\n",
" 2017: '2017/2018',\n",
" 2018: '2018/2019',\n",
" 2019: '2019/2020',\n",
" 2020: '2020/2021',\n",
" 2021: '2021/2022',\n",
" 2022: '2022/2023',\n",
" 2023: '2023/2024',\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Defined Lists"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [],
"source": [
"# Defined Lists\n",
"\n",
"## Define list of folders\n",
"lst_folders = ['raw', 'engineered']\n",
"\n",
"\n",
"## Define list of data types\n",
"lst_data_types = ['ratings_by_date', 'ratings_by_team']\n",
"\n",
"\n",
"## Define lists of Teams and Countries\n",
"df_elo_teams_leagues = pd.read_csv(data_dir + '/reference/teams/elo_team_leagues_latest.csv')\n",
"\n",
"### Countries\n",
"lst_countries = list(df_elo_teams_leagues['Country'].unique())\n",
"\n",
"### 'Big 5' European League Countries\n",
"lst_countries_big5 = ['ENG', 'ESP', 'FRA', 'GER', 'ITA']\n",
"\n",
"### Teams \n",
"lst_teams = list(df_elo_teams_leagues['Club'].unique())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Custom Functions"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# Define retry requests session function - taken from: https://www.peterbe.com/plog/best-practice-with-retries-with-requests\n",
"def requests_retry_session(\n",
" retries=3,\n",
" backoff_factor=0.3,\n",
" status_forcelist=(500, 502, 504),\n",
" session=None,\n",
"):\n",
" session = session or requests.Session()\n",
" retry = Retry(\n",
" total=retries,\n",
" read=retries,\n",
" connect=retries,\n",
" backoff_factor=backoff_factor,\n",
" status_forcelist=status_forcelist,\n",
" )\n",
" adapter = HTTPAdapter(max_retries=retry)\n",
" session.mount('http://', adapter)\n",
" session.mount('https://', adapter)\n",
" return session"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
"# Define function for scraping the Elo rating, team-by-team\n",
"def get_elo_team_ratings():\n",
" \n",
" \"\"\"\n",
" Function to scrape the Elo ratings of teams, team-by-team.\n",
" \"\"\" \n",
" \n",
" ## Start timer\n",
" tic = datetime.datetime.now()\n",
" \n",
" \n",
" ## Print time scraping started\n",
" print(f'Scraping started at: {tic}')\n",
" \n",
" ## Define list of countries\n",
" lst_countries =list(df_elo_teams_leagues['Country'].unique())\n",
" \n",
" ## Loop through countries\n",
" for country in lst_countries:\n",
" \n",
" ### Print statement\n",
" print(f'Scraping started for Team Elo data for {country}...')\n",
" \n",
" df_temp = df_elo_teams_leagues[df_elo_teams_leagues['Country']==country]\n",
" lst_teams = list(df_temp['Club'].unique())\n",
" \n",
" ### Loop through teams\n",
" for team in lst_teams:\n",
" \n",
" #### Define team name used when saving to folder (Snailcase)\n",
" team_folder = team.replace(' ', '_')\n",
" \n",
" #### Define team name for use with URL (lowercase, no spaces)\n",
" team_stripped = team.lower().strip().replace(' ', '')\n",
" \n",
" #### Save Player URL List (if not already saved)\n",
" if not os.path.exists(os.path.join(data_dir_elo + f'/raw/ratings_by_team/{country}/{team_folder}_{country}_elo_rating_latest.csv')):\n",
"\n",
" ##### Scraping\n",
"\n",
" ###### Print statement\n",
" print(f'Scraping started for Team Elo data {team} in {country}...')\n",
"\n",
" ###### Define URL\n",
" url = f'http://api.clubelo.com/{team_stripped}'\n",
"\n",
" ###### Call requests_retry_session function, defined in previous command\n",
" response = requests_retry_session().get(url)\n",
" #print(response.status_code)\n",
"\n",
" ###### \n",
" s = requests.Session()\n",
" s.auth = ('user', 'pass')\n",
" s.headers.update({'x-test': 'true'})\n",
"\n",
" ##### \n",
" response = requests_retry_session(session=s).get(url)\n",
"\n",
" ###### \n",
" data = StringIO(response .text)\n",
" df = pd.read_csv(data, sep=',')\n",
"\n",
" \n",
" ###### Data Engineering\n",
" \n",
" ####### Convert 'From' and 'To' to datatime \n",
" df['From'] = pd.to_datetime(df['From'])\n",
" df['To'] = pd.to_datetime(df['To'])\n",
"\n",
" ####### Drop duplicate rows\n",
" df = df.drop_duplicates()\n",
"\n",
" \n",
" ##### Save DataFrame\n",
" \n",
" ###### Save latest version\n",
" df.to_csv(data_dir_elo + f'/raw/ratings_by_team/{country}/{team_folder}_{country}_elo_rating_latest.csv', index=None, header=True) \n",
"\n",
" ###### Export a copy to the 'archive' subfolder, including the date\n",
" df.to_csv(data_dir_elo + f'/raw/ratings_by_team/{country}/archive/{team_folder}_{country}_elo_rating_last_updated_{today}.csv', index=None, header=True) \n",
"\n",
" \n",
" ##### Print statement for team\n",
" print(f'Team Elo rating data for {team} scraped and saved.')\n",
"\n",
"\n",
" ##### If Team Elo rating already saved\n",
" else:\n",
"\n",
" ###### Print statement\n",
" print(f'Team Elo rating data for {team} already saved as a CSV file.') \n",
"\n",
" \n",
" ## End timer\n",
" toc = datetime.datetime.now()\n",
" \n",
" \n",
" ## Print time scraping ended\n",
" print(f'Scraping ended at: {toc}')\n",
"\n",
" \n",
" ## Calculate time take\n",
" total_time = (toc-tic).total_seconds()\n",
" print(f'Time taken to scrape the Team Elo for {len(teams)} teams in {len(countries)} countries is: {total_time/60:0.2f} minutes.')\n",
"\n",
" \n",
" ## Unify individual CSV files as a single DataFrame\n",
" \n",
" ### Show files in directory\n",
" all_files = glob.glob(os.path.join(data_dir_elo + f'/raw/ratings_by_team/*/*_elo_rating_latest.csv'))\n",
" \n",
" ### Create an empty list of Players URLs\n",
" lst_team_elo_ratings_all = []\n",
"\n",
" ### Loop through list of files and read into temporary DataFrames\n",
" for filename in all_files:\n",
" df_temp = pd.read_csv(filename, index_col=None, header=0)\n",
" lst_team_elo_ratings_all.append(df_temp)\n",
"\n",
" ### Concatenate the files into a single DataFrame\n",
" df_elo_teams_all = pd.concat(lst_team_elo_ratings_all, axis=0, ignore_index=True)\n",
" \n",
" ### Reset index\n",
" df_elo_teams_all = df_elo_teams_all.reset_index(drop=True)\n",
" \n",
" ### Sort DataFrame\n",
" df_elo_teams_all = df_elo_teams_all.sort_values(['Country', 'Club', 'To'], ascending=[True, True, True])\n",
"\n",
" \n",
" ## Export DataFrame\n",
" \n",
" ### Save latest version\n",
" df_elo_teams_all.to_csv(data_dir_elo + f'/raw/ratings_by_team/elo_team_ratings_combined_latest.csv', index=None, header=True)\n",
" \n",
" ### Save a copy to archive folder (dated)\n",
" df_elo_teams_all.to_csv(data_dir_elo + f'/raw/ratings_by_team/archive/elo_team_ratings_combined_last_updated_{today}.csv', index=None, header=True)\n",
" \n",
" \n",
" ## Distinct number of clubs and countries\n",
" total_countries = df_elo_teams_all['Country'].nunique()\n",
" total_clubs = df_elo_teams_all['Club'].nunique()\n",
"\n",
"\n",
" ## Print statement\n",
" print(f'Combined Elo DataFrame contains {total_countries} countries and {total_clubs} clubs.')\n",
" \n",
" \n",
" ## Return final list of Player URLs\n",
" return(df_elo_teams_all)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create Directory Structure\n",
"Create folders and subfolders for data, if not already created."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# Make the data directory structure\n",
"for folder in lst_folders:\n",
" path = os.path.join(data_dir_elo, folder)\n",
" if not os.path.exists(path):\n",
" os.mkdir(path)\n",
" for data_types in lst_data_types:\n",
" path = os.path.join(data_dir_elo, folder, data_types)\n",
" if not os.path.exists(path):\n",
" os.mkdir(path)\n",
" for country in lst_countries:\n",
" path = os.path.join(data_dir_elo, folder, data_types, country)\n",
" if not os.path.exists(path):\n",
" os.mkdir(path)\n",
" os.mkdir(os.path.join(path, 'archive'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Notebook Settings"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Display all columns of pandas DataFrames\n",
"pd.set_option('display.max_columns', None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"## 2. Project Brief\n",
"This Jupyter notebook is part of a series of notebooks to scrape, parse, engineer, unify, and the model, culminating in a an Expected Transfer (xTransfer) player performance vs. valuation model. This model aims to determine the under- and over-performing players based on their on-the-pitch output against transfer fee and wages.\n",
"\n",
"This particular notebook is one of several data parsing notebooks, that cleans team rating data from [Club ELO](http://clubelo.com/) using [pandas](http://pandas.pydata.org/).\n",
"\n",
"This notebook, along with the other notebooks in this project workflow are shown in the following diagram:\n",
"\n",
"![roadmap](../../img/football_analytics_data_roadmap.png)\n",
"\n",
"Links to these notebooks in the [`football_analytics`](https://github.com/eddwebster/football_analytics) GitHub repository can be found at the following:\n",
"* [Webscraping](https://github.com/eddwebster/football_analytics/tree/master/notebooks/1_data_scraping)\n",
" + [FBref Player Stats Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/FBref%20Player%20Stats%20Web%20Scraping.ipynb)\n",
" + [TransferMarket Player Bio and Status Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/TransferMarkt%20Player%20Bio%20and%20Status%20Web%20Scraping.ipynb)\n",
" + [TransferMarket Player Valuation Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/TransferMarkt%20Player%20Valuation%20Web%20Scraping.ipynb)\n",
" + [TransferMarkt Player Recorded Transfer Fees Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/TransferMarkt%20Player%20Recorded%20Transfer%20Fees%20Webscraping.ipynb)\n",
" + [Capology Player Salary Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/Capology%20Player%20Salary%20Web%20Scraping.ipynb)\n",
" + [FBref Team Stats Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/FBref%20Team%20Stats%20Web%20Scraping.ipynb)\n",
"* [Data Parsing](https://github.com/eddwebster/football_analytics/tree/master/notebooks/2_data_parsing)\n",
" + [ELO Team Ratings Data Parsing](https://github.com/eddwebster/football_analytics/blob/master/notebooks/2_data_parsing/ELO%20Team%20Ratings%20Data%20Parsing.ipynb)\n",
"* [Data Engineering](https://github.com/eddwebster/football_analytics/tree/master/notebooks/3_data_engineering)\n",
" + [FBref Player Stats Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/FBref%20Player%20Stats%20Data%20Engineering.ipynb)\n",
" + [TransferMarket Player Bio and Status Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Bio%20and%20Status%20Data%20Engineering.ipynb)\n",
" + [TransferMarket Player Valuation Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Valuation%20Data%20Engineering.ipynb)\n",
" + [TransferMarkt Player Recorded Transfer Fees Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Recorded%20Transfer%20Fees%20Data%20Engineering.ipynb)\n",
" + [Capology Player Salary Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/Capology%20Player%20Salary%20Data%20Engineering.ipynb)\n",
" + [FBref Team Stats Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/FBref%20Team%20Stats%20Data%20Engineering.ipynb)\n",
" + [ELO Team Ratings Data Parsing](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/ELO%20Team%20Ratings%20Data%20Parsing.ipynb)\n",
" + [TransferMarkt Team Recorded Transfer Fee Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Team%20Recorded%20Transfer%20Fee%20Data%20Engineering.ipynb) (aggregated from [TransferMarkt Player Recorded Transfer Fees notebook](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Recorded%20Transfer%20Fees%20Data%20Engineering.ipynb))\n",
" + [Capology Team Salary Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/Capology%20Team%20Salary%20Data%20Engineering.ipynb) (aggregated from [Capology Player Salary notebook](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/Capology%20Player%20Salary%20Data%20Engineering.ipynb))\n",
"* [Data Unification](https://github.com/eddwebster/football_analytics/tree/master/notebooks/4_data_unification)\n",
" + [Golden ID for Player Level Datasets](https://github.com/eddwebster/football_analytics/blob/master/notebooks/4_data_unification/Golden%20ID%20for%20Player%20Level%20Datasets.ipynb)\n",
" + [Golden ID for Team Level Datasets](https://github.com/eddwebster/football_analytics/blob/master/notebooks/4_data_unification/Golden%20ID%20for%20Team%20Level%20Datasets.ipynb)\n",
"* [Production Datasets](https://github.com/eddwebster/football_analytics/tree/master/notebooks/5_production_datasets)\n",
" + [Player Performance/Market Value Dataset](https://github.com/eddwebster/football_analytics/tree/master/notebooks/5_production_datasets/Player%20Performance/Market%20Value%20Dataset.ipynb)\n",
" + [Team Performance/Market Value Dataset](https://github.com/eddwebster/football_analytics/tree/master/notebooks/5_production_datasets/Team%20Performance/Market%20Value%20Dataset.ipynb)\n",
"* [Expected Transfer (xTransfer) Modeling](https://github.com/eddwebster/football_analytics/tree/master/notebooks/6_data_analysis_and_projects/expected_transfer_modeling)\n",
" + [Expected Transfer (xTransfer) Modeling](https://github.com/eddwebster/football_analytics/tree/master/notebooks/6_data_analysis_and_projects/expected_transfer_modeling/Expected%20Transfer%20%20Modeling.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"## 3. Data Sources\n",
"\n",
"Before conducting our EDA, the data needs to be imported as a DataFrame in the Data Sources section [Section 3](#section3) and cleaned in the Data Engineering section [Section 4](#section4).\n",
"\n",
"We'll be using the [pandas](http://pandas.pydata.org/) library to import our data to this workbook as a DataFrame."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"### 3.1. Data Dictionaries\n",
"The [Club ELO](http://clubelo.com/) dataset has eight features (columns) with the following definitions and data types:\n",
"\n",
"| Feature | Data type |\n",
"|------|-----|\n",
"| `Rank` | object |\n",
"| `Club` | object |\n",
"| `Country` | object |\n",
"| `Level` | object |\n",
"| `Elo` | object |\n",
"| `From` | object |\n",
"| `To` | object |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"### 3.2. Scrape the Data\n",
"See the [Club ELO](http://clubelo.com/) API [[link](http://clubelo.com/API)].\n",
"\n",
"Potential methodology to approach this problem\n",
"- Day-by-day rankings seems to be not loading well, but teams are. Day-by-day rankings also only includes team rankings that are in the top 1/2 leagues, per competition. An example is Sheffield Wednesday that does not make the cut in 2021.\n",
"- Download the day-by-day rankings once or twice per season, for the last x season. Use this to get all the unique team names.\n",
"- From the unique team names gathered in the previous step, created a DataFrame of unique team names. From this DataFrame, create a list to loop through and download all their individual rankings. Expecting the final DataFrame to have at least 3.5M rows, if not more for lower-league English teams\n",
"- In the same method as the TM Transfer History script, find a way to aggregate the rankings to an annual value, per team, per season."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"#### 3.2.1. Read in Data of Elo Rankings by Date\n",
"- Data downloaded from: api.clubelo.com/YYYY-MM-DD .\n",
"- Data downloads very slow and was done manually.\n",
"- Following dataset will be used to determine a DataFrame of unique teams and their leagues. The data itself is not required."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The Team ELO Rating by Date DataFrame contains 1,497 teams.\n"
]
}
],
"source": [
"## Unify individual CSV files as a single DataFrame\n",
"\n",
"### Show files in directory\n",
"all_files = glob.glob(os.path.join(data_dir_elo + '/raw/ratings_by_date/*/*.csv'))\n",
"\n",
"### Create an empty list of Players URLs\n",
"lst_team_elo_rating_all = []\n",
"\n",
"### Loop through list of files and read into temporary DataFrames\n",
"for filename in all_files:\n",
" df_temp = pd.read_csv(filename, index_col=None, header=0)\n",
" lst_team_elo_rating_all.append(df_temp)\n",
"\n",
"### Concatenate the files into a single DataFrame\n",
"df_elo_team_ratings_all = pd.concat(lst_team_elo_rating_all, axis=0, ignore_index=True)\n",
"\n",
"\n",
"## Reset index\n",
"df_elo_team_ratings_all = df_elo_team_ratings_all.reset_index(drop=True)\n",
"\n",
"\n",
"## Sort DataFrame\n",
"df_elo_team_ratings_all = df_elo_team_ratings_all.sort_values(['Country', 'Club'], ascending=[True, True])\n",
"\n",
"\n",
"## Export DataFrame\n",
"\n",
"### Save latest version\n",
"df_elo_team_ratings_all.to_csv(data_dir_elo + '/raw/ratings_by_date/elo_team_rating_by_date_combined_latest.csv', index=None, header=True)\n",
"\n",
"### Save a copy to archive folder (dated)\n",
"df_elo_team_ratings_all.to_csv(data_dir_elo + f'/raw/ratings_by_date/archive/elo_team_rating_by_date_combined_last_updated_{today}.csv', index=None, header=True)\n",
"\n",
"\n",
"## Distinct number of players\n",
"total_teams = df_elo_team_ratings_all['Club'].nunique()\n",
"\n",
"\n",
"## Print statement\n",
"print(f'The Team ELO Rating by Date DataFrame contains {total_teams:,} teams.')"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" Club Country\n",
"0 Besa Kavaje ALB\n",
"1 Bylis Ballsh ALB\n",
"2 Dinamo Tirana ALB\n",
"3 Elbasani ALB\n",
"4 Flamurtari ALB"
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create DataFrame of unique clubs and countries for reference\n",
"\n",
"## Define columns of interest\n",
"lst_cols = ['Club', 'Country']\n",
"\n",
"## Select columns of interest\n",
"df_elo_teams_leagues = df_elo_team_ratings_all[lst_cols]\n",
"\n",
"\n",
"# Drop duplicates\n",
"df_elo_teams_leagues = df_elo_teams_leagues.drop_duplicates()\n",
"\n",
"\n",
"# Sort DataFrame by Country/Club\n",
"df_elo_teams_leagues = df_elo_teams_leagues.sort_values(['Country', 'Club'], ascending=[True, True])\n",
"\n",
"\n",
"# Reset index\n",
"df_elo_teams_leagues = df_elo_teams_leagues.reset_index(drop=True)\n",
"\n",
"\n",
"# Export DataFrame\n",
"if not os.path.exists(os.path.join(data_dir + f'/reference/teams/archive/elo_team_leagues_last_updated_{today}.csv')):\n",
" \n",
" ### Save latest version\n",
" df_elo_teams_leagues.to_csv(data_dir + '/reference/teams/elo_team_leagues_latest.csv', index=None, header=True)\n",
"\n",
" ### Save a copy to archive folder (dated)\n",
" df_elo_teams_leagues.to_csv(data_dir + f'/reference/teams/archive/elo_team_leagues_last_updated_{today}.csv', index=None, header=True) \n",
"\n",
"else:\n",
" df_elo_teams_leagues = pd.read_csv(data_dir + '/reference/teams/elo_team_leagues_latest.csv')\n",
" print('Data already saved previously')\n",
"\n",
" \n",
"# Display DataFrame\n",
"df_elo_teams_leagues.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_elo_teams_leagues.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## All unique Teams\n",
"lst_teams = list(df_elo_teams_leagues['Club_Trimmed'].unique())"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1497"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(lst_teams)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"## All unique Countries\n",
"lst_countries = list(df_elo_teams_leagues['Country'].unique())"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"56"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(lst_countries)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"#### 3.2.2. Scrape Data of Elo Rankings by Date\n",
"Format of scraping data: 'api.clubelo.com/YYYY-MM-DD' e.g. https://api.clubelo.com/2021-09-02."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# Create list of dates for the last number of defined years\n",
"\n",
"## Define today's date\n",
"today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')\n",
"\n",
"## Create an empty list for scraped player bio data to be appended\n",
"lst_dates = []\n",
"\n",
"## Define start and end date\n",
"sdate = date(2016, 8, 1) # start date\n",
"edate = date(2021, 9, 2) # end date ## CHANGE TO TODAY'S DATE\n",
"\n",
"## Determine the difference between the two dates\n",
"delta = edate - sdate # as timedelta\n",
"\n",
"##\n",
"for i in range(delta.days + 1):\n",
" day = sdate + timedelta(days=i)\n",
" day = day.strftime('%Y/%m/%d').replace('/', '')\n",
" lst_dates.append(day)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1859"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(lst_dates)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"#### 3.2.3. Scrape Data of Elo Rankings by Team\n",
"Data scraped for individual teams and then saved before being combined to a unified DataFrame, ready for Data Engineering, using the `get_elo_team_ratings()` defined in the **Custom Functions** section of the notebook. This function incorporates ClubElo's API [[link](http://clubelo.com/API)] which has a 'One club's rating history' feature, that provides the ranking history for an individual club, going back to the mid forties (note however that Club Elo states that values before 1960 should be considered provisional).\n",
"\n",
"The format for obtaining a team's ranking is: api.clubelo.com/CLUBNAME i.e. [api.clubelo.com/mancity](api.clubelo.com/mancity)."
]
},
{
"cell_type": "code",
"execution_count": 168,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scraping started at: 2021-09-04 18:21:20.001197\n",
"Scraping started for Team Elo data for ALB...\n",
"Team Elo rating data for Besa Kavaje already saved as a CSV file.\n",
"Team Elo rating data for Bylis Ballsh already saved as a CSV file.\n",
"Team Elo rating data for Dinamo Tirana already saved as a CSV file.\n",
"Team Elo rating data for Elbasani already saved as a CSV file.\n",
"Team Elo rating data for Flamurtari already saved as a CSV file.\n",
"Team Elo rating data for Kukesi already saved as a CSV file.\n",
"Team Elo rating data for Laci already saved as a CSV file.\n",
"Team Elo rating data for Luftetari already saved as a CSV file.\n",
"Team Elo rating data for Partizani Tirana already saved as a CSV file.\n",
"Team Elo rating data for Skenderbeu already saved as a CSV file.\n",
"Team Elo rating data for Teuta Durres already saved as a CSV file.\n",
"Team Elo rating data for Tirana already saved as a CSV file.\n",
"Team Elo rating data for Tomori Berat already saved as a CSV file.\n",
"Team Elo rating data for Vllaznia already saved as a CSV file.\n",
"Scraping started for Team Elo data for AND...\n",
"Team Elo rating data for CE Principat already saved as a CSV file.\n",
"Team Elo rating data for Constelacio Esportiva already saved as a CSV file.\n",
"Team Elo rating data for Encamp already saved as a CSV file.\n",
"Team Elo rating data for Engordany already saved as a CSV file.\n",
"Team Elo rating data for Escaldes already saved as a CSV file.\n",
"Team Elo rating data for Escale already saved as a CSV file.\n",
"Team Elo rating data for FC Rangers already saved as a CSV file.\n",
"Team Elo rating data for Lusitanos already saved as a CSV file.\n",
"Team Elo rating data for Matecosa Sant Julia already saved as a CSV file.\n",
"Team Elo rating data for Santa Coloma already saved as a CSV file.\n",
"Scraping started for Team Elo data for ARM...\n",
"Team Elo rating data for Alashkert already saved as a CSV file.\n",
"Team Elo rating data for Ararat already saved as a CSV file.\n",
"Team Elo rating data for Ararat-Armenia already saved as a CSV file.\n",
"Team Elo rating data for Banants already saved as a CSV file.\n",
"Team Elo rating data for FC Yerevan already saved as a CSV file.\n",
"Team Elo rating data for Gandzasar Kapan already saved as a CSV file.\n",
"Team Elo rating data for MIKA already saved as a CSV file.\n",
"Team Elo rating data for Noah already saved as a CSV file.\n",
"Team Elo rating data for Pyunik already saved as a CSV file.\n",
"Team Elo rating data for Shirak already saved as a CSV file.\n",
"Team Elo rating data for Spartak Yerevan already saved as a CSV file.\n",
"Team Elo rating data for Ulisses already saved as a CSV file.\n",
"Team Elo rating data for Zvartnots already saved as a CSV file.\n",
"Scraping started for Team Elo data for AUT...\n",
"Team Elo rating data for Admira already saved as a CSV file.\n",
"Team Elo rating data for Altach already saved as a CSV file.\n",
"Team Elo rating data for Austria Kaernten already saved as a CSV file.\n",
"Team Elo rating data for Austria Wien already saved as a CSV file.\n",
"Team Elo rating data for Bregenz already saved as a CSV file.\n",
"Team Elo rating data for FCK Kaernten already saved as a CSV file.\n",
"Team Elo rating data for GAK already saved as a CSV file.\n",
"Team Elo rating data for Groedig already saved as a CSV file.\n",
"Team Elo rating data for Hartberg already saved as a CSV file.\n",
"Team Elo rating data for Kapfenberg already saved as a CSV file.\n",
"Team Elo rating data for Klagenfurt already saved as a CSV file.\n",
"Team Elo rating data for LASK already saved as a CSV file.\n",
"Team Elo rating data for Lustenau already saved as a CSV file.\n",
"Team Elo rating data for Mattersburg already saved as a CSV file.\n",
"Team Elo rating data for Pasching already saved as a CSV file.\n",
"Team Elo rating data for Rapid Wien already saved as a CSV file.\n",
"Team Elo rating data for Ried already saved as a CSV file.\n",
"Team Elo rating data for Salzburg already saved as a CSV file.\n",
"Team Elo rating data for St Poelten already saved as a CSV file.\n",
"Team Elo rating data for Sturm Graz already saved as a CSV file.\n",
"Team Elo rating data for Wacker Innsbruck already saved as a CSV file.\n",
"Team Elo rating data for Wattens already saved as a CSV file.\n",
"Team Elo rating data for Wiener Neustadt already saved as a CSV file.\n",
"Team Elo rating data for Wolfsberg already saved as a CSV file.\n",
"Scraping started for Team Elo data for AZE...\n",
"Team Elo rating data for AZAL Baku already saved as a CSV file.\n",
"Team Elo rating data for FK Baku already saved as a CSV file.\n",
"Team Elo rating data for FK Karvan Evlakh already saved as a CSV file.\n",
"Team Elo rating data for Gabala already saved as a CSV file.\n",
"Team Elo rating data for Inter Baku already saved as a CSV file.\n",
"Team Elo rating data for Kapaz Ganca already saved as a CSV file.\n",
"Team Elo rating data for Karabakh Agdam already saved as a CSV file.\n",
"Team Elo rating data for Khazar already saved as a CSV file.\n",
"Team Elo rating data for MKT Araz Imishli already saved as a CSV file.\n",
"Team Elo rating data for Neftchi already saved as a CSV file.\n",
"Team Elo rating data for Sabail already saved as a CSV file.\n",
"Team Elo rating data for Shafa already saved as a CSV file.\n",
"Team Elo rating data for Shamkir already saved as a CSV file.\n",
"Team Elo rating data for Simurq Zaqatala already saved as a CSV file.\n",
"Team Elo rating data for Sumqayit already saved as a CSV file.\n",
"Team Elo rating data for Zira already saved as a CSV file.\n",
"Scraping started for Team Elo data for BEL...\n",
"Team Elo rating data for Aalst already saved as a CSV file.\n",
"Team Elo rating data for Anderlecht already saved as a CSV file.\n",
"Team Elo rating data for Antwerp already saved as a CSV file.\n",
"Team Elo rating data for Beerschot AC already saved as a CSV file.\n",
"Team Elo rating data for Bergen already saved as a CSV file.\n",
"Team Elo rating data for Beveren already saved as a CSV file.\n",
"Team Elo rating data for Brugge already saved as a CSV file.\n",
"Team Elo rating data for Cercle Brugge already saved as a CSV file.\n",
"Team Elo rating data for Charleroi already saved as a CSV file.\n",
"Team Elo rating data for Dender already saved as a CSV file.\n",
"Team Elo rating data for Eupen already saved as a CSV file.\n",
"Team Elo rating data for FC Brussels already saved as a CSV file.\n",
"Team Elo rating data for Geel already saved as a CSV file.\n",
"Team Elo rating data for Genk already saved as a CSV file.\n",
"Team Elo rating data for Gent already saved as a CSV file.\n",
"Team Elo rating data for Harelbeke already saved as a CSV file.\n",
"Team Elo rating data for Heusden Zolder already saved as a CSV file.\n",
"Team Elo rating data for Kortrijk already saved as a CSV file.\n",
"Team Elo rating data for Leuven already saved as a CSV file.\n",
"Team Elo rating data for Lierse already saved as a CSV file.\n",
"Team Elo rating data for Lokeren already saved as a CSV file.\n",
"Team Elo rating data for Lommel already saved as a CSV file.\n",
"Team Elo rating data for Louvieroise already saved as a CSV file.\n",
"Team Elo rating data for Mechelen already saved as a CSV file.\n",
"Team Elo rating data for Molenbeek already saved as a CSV file.\n",
"Team Elo rating data for Mouscron already saved as a CSV file.\n",
"Team Elo rating data for Oostende already saved as a CSV file.\n",
"Team Elo rating data for Roeselare already saved as a CSV file.\n",
"Team Elo rating data for Seraing already saved as a CSV file.\n",
"Team Elo rating data for St Gillis already saved as a CSV file.\n",
"Team Elo rating data for St Truiden already saved as a CSV file.\n",
"Team Elo rating data for Standard already saved as a CSV file.\n",
"Team Elo rating data for Tubize already saved as a CSV file.\n",
"Team Elo rating data for Westerlo already saved as a CSV file.\n",
"Team Elo rating data for Zulte Waregem already saved as a CSV file.\n",
"Scraping started for Team Elo data for BHZ...\n",
"Team Elo rating data for Borac Banja Luka already saved as a CSV file.\n",
"Team Elo rating data for Brotnjo Citluk already saved as a CSV file.\n",
"Team Elo rating data for Buduchnost Banovici already saved as a CSV file.\n",
"Team Elo rating data for FK Sarajevo already saved as a CSV file.\n",
"Team Elo rating data for Leotar Trebinje already saved as a CSV file.\n",
"Team Elo rating data for Modrica Maksima already saved as a CSV file.\n",
"Team Elo rating data for Olimpik Sarajevo already saved as a CSV file.\n",
"Team Elo rating data for Orasje already saved as a CSV file.\n",
"Team Elo rating data for Radnik Bijeljina already saved as a CSV file.\n",
"Team Elo rating data for Siroki Brijeg already saved as a CSV file.\n",
"Team Elo rating data for Slavija Istocno already saved as a CSV file.\n",
"Team Elo rating data for Sloboda Tuzla already saved as a CSV file.\n",
"Team Elo rating data for Velez Mostar already saved as a CSV file.\n",
"Team Elo rating data for Zeljeznicar already saved as a CSV file.\n",
"Team Elo rating data for Zepce Limorad already saved as a CSV file.\n",
"Team Elo rating data for Zrinjski Mostar already saved as a CSV file.\n",
"Scraping started for Team Elo data for BLR...\n",
"Team Elo rating data for BATE already saved as a CSV file.\n",
"Team Elo rating data for Bobruisk already saved as a CSV file.\n",
"Team Elo rating data for Darida already saved as a CSV file.\n",
"Team Elo rating data for Dinamo Brest already saved as a CSV file.\n",
"Team Elo rating data for Dinamo Minsk already saved as a CSV file.\n",
"Team Elo rating data for Dnepr Mogilev already saved as a CSV file.\n",
"Team Elo rating data for FK Minsk already saved as a CSV file.\n",
"Team Elo rating data for FK Vitebsk already saved as a CSV file.\n",
"Team Elo rating data for Gomel already saved as a CSV file.\n",
"Team Elo rating data for Gorodeya already saved as a CSV file.\n",
"Team Elo rating data for Granit Mikashevichy already saved as a CSV file.\n",
"Team Elo rating data for Isloch already saved as a CSV file.\n",
"Team Elo rating data for Kamunalnik already saved as a CSV file.\n",
"Team Elo rating data for Krumkachy already saved as a CSV file.\n",
"Team Elo rating data for Lida already saved as a CSV file.\n",
"Team Elo rating data for Lok Minsk already saved as a CSV file.\n",
"Team Elo rating data for Luch Minsk already saved as a CSV file.\n",
"Team Elo rating data for MTZ-RIPO already saved as a CSV file.\n",
"Team Elo rating data for Maladzyechna already saved as a CSV file.\n",
"Team Elo rating data for Neman Grodno already saved as a CSV file.\n",
"Team Elo rating data for Novopolotsk already saved as a CSV file.\n",
"Team Elo rating data for Rukh Brest already saved as a CSV file.\n",
"Team Elo rating data for Savit already saved as a CSV file.\n",
"Team Elo rating data for Slavia Mozyr already saved as a CSV file.\n",
"Team Elo rating data for Slutsk already saved as a CSV file.\n",
"Team Elo rating data for Smolevichi already saved as a CSV file.\n",
"Team Elo rating data for Smorgon already saved as a CSV file.\n",
"Team Elo rating data for Soligorsk already saved as a CSV file.\n",
"Team Elo rating data for Sputnik already saved as a CSV file.\n",
"Team Elo rating data for Svisloch-Krovlya already saved as a CSV file.\n",
"Team Elo rating data for Torpedo Minsk already saved as a CSV file.\n",
"Team Elo rating data for Torpedo Mogilev already saved as a CSV file.\n",
"Team Elo rating data for Torpedo Zhodino already saved as a CSV file.\n",
"Team Elo rating data for Vedrych already saved as a CSV file.\n",
"Team Elo rating data for Zvyazda BDU already saved as a CSV file.\n",
"Scraping started for Team Elo data for BUL...\n",
"Team Elo rating data for Akademik Sofia already saved as a CSV file.\n",
"Team Elo rating data for Arda already saved as a CSV file.\n",
"Team Elo rating data for Belasitsa Petrich already saved as a CSV file.\n",
"Team Elo rating data for Beroe Stara Zagora already saved as a CSV file.\n",
"Team Elo rating data for Blagoevgrad already saved as a CSV file.\n",
"Team Elo rating data for Botev Plovdiv already saved as a CSV file.\n",
"Team Elo rating data for Botev Vratsa already saved as a CSV file.\n",
"Team Elo rating data for CSKA 1948 Sofia already saved as a CSV file.\n",
"Team Elo rating data for CSKA Sofia already saved as a CSV file.\n",
"Team Elo rating data for Cherno More already saved as a CSV file.\n",
"Team Elo rating data for Chernomorets Burgas already saved as a CSV file.\n",
"Team Elo rating data for Dobrudzha already saved as a CSV file.\n",
"Team Elo rating data for Dunav Ruse already saved as a CSV file.\n",
"Team Elo rating data for Etar already saved as a CSV file.\n",
"Team Elo rating data for Gotse Delchev already saved as a CSV file.\n",
"Team Elo rating data for Haskovo already saved as a CSV file.\n",
"Team Elo rating data for Hebar already saved as a CSV file.\n",
"Team Elo rating data for Kaliakra Kavarna already saved as a CSV file.\n",
"Team Elo rating data for Levski already saved as a CSV file.\n",
"Team Elo rating data for Litex already saved as a CSV file.\n",
"Team Elo rating data for Lok Gorna already saved as a CSV file.\n",
"Team Elo rating data for Lok Plovdiv already saved as a CSV file.\n",
"Team Elo rating data for Lok Sofia already saved as a CSV file.\n",
"Team Elo rating data for Lyubimets already saved as a CSV file.\n",
"Team Elo rating data for Marek Dupnitza already saved as a CSV file.\n",
"Team Elo rating data for Mezdra already saved as a CSV file.\n",
"Team Elo rating data for Minyor Pernik already saved as a CSV file.\n",
"Team Elo rating data for Montana 1921 already saved as a CSV file.\n",
"Team Elo rating data for N Burgas already saved as a CSV file.\n",
"Team Elo rating data for Nesebar already saved as a CSV file.\n",
"Team Elo rating data for Pirin 1922 already saved as a CSV file.\n",
"Team Elo rating data for Razgrad already saved as a CSV file.\n",
"Team Elo rating data for Rodopa Smolian already saved as a CSV file.\n",
"Team Elo rating data for Samakov already saved as a CSV file.\n",
"Team Elo rating data for Septemvri Sofia already saved as a CSV file.\n",
"Team Elo rating data for Sevlievo already saved as a CSV file.\n",
"Team Elo rating data for Shumen already saved as a CSV file.\n",
"Team Elo rating data for Slavia Sofia already saved as a CSV file.\n",
"Team Elo rating data for Sliven already saved as a CSV file.\n",
"Team Elo rating data for Spartak Pleven already saved as a CSV file.\n",
"Team Elo rating data for Spartak Varna already saved as a CSV file.\n",
"Team Elo rating data for Svetkavitsa already saved as a CSV file.\n",
"Team Elo rating data for Svoge already saved as a CSV file.\n",
"Team Elo rating data for Tsarsko Selo already saved as a CSV file.\n",
"Team Elo rating data for Velbazhd already saved as a CSV file.\n",
"Team Elo rating data for Vereya already saved as a CSV file.\n",
"Team Elo rating data for Vihren Sandanski already saved as a CSV file.\n",
"Team Elo rating data for Vitosha Bistritsa already saved as a CSV file.\n",
"Scraping started for Team Elo data for CRO...\n",
"Team Elo rating data for Cakovec already saved as a CSV file.\n",
"Team Elo rating data for Cibalia Vinkovci already saved as a CSV file.\n",
"Team Elo rating data for Croatia Sesvete already saved as a CSV file.\n",
"Team Elo rating data for Dinamo Zagreb already saved as a CSV file.\n",
"Team Elo rating data for Dragovoljac already saved as a CSV file.\n",
"Team Elo rating data for HNK Gorica already saved as a CSV file.\n",
"Team Elo rating data for Hajduk already saved as a CSV file.\n",
"Team Elo rating data for Inter Zapresic already saved as a CSV file.\n",
"Team Elo rating data for Istra already saved as a CSV file.\n",
"Team Elo rating data for Istra 1961 already saved as a CSV file.\n",
"Team Elo rating data for Kamen Ingrad already saved as a CSV file.\n",
"Team Elo rating data for Karlovac already saved as a CSV file.\n",
"Team Elo rating data for Lok Zagreb already saved as a CSV file.\n",
"Team Elo rating data for Lucko already saved as a CSV file.\n",
"Team Elo rating data for Marsonia already saved as a CSV file.\n",
"Team Elo rating data for Medjimurje already saved as a CSV file.\n",
"Team Elo rating data for NK Zagreb already saved as a CSV file.\n",
"Team Elo rating data for Osijek already saved as a CSV file.\n",
"Team Elo rating data for Pomorac already saved as a CSV file.\n",
"Team Elo rating data for Rijeka already saved as a CSV file.\n",
"Team Elo rating data for Rudes Zagreb already saved as a CSV file.\n",
"Team Elo rating data for Sibenik already saved as a CSV file.\n",
"Team Elo rating data for Slaven Belupo already saved as a CSV file.\n",
"Team Elo rating data for Split 1912 already saved as a CSV file.\n",
"Team Elo rating data for Topolovac already saved as a CSV file.\n",
"Team Elo rating data for Varazdin already saved as a CSV file.\n",
"Team Elo rating data for Varteks already saved as a CSV file.\n",
"Team Elo rating data for Vukovar 91 already saved as a CSV file.\n",
"Team Elo rating data for Zadar already saved as a CSV file.\n",
"Scraping started for Team Elo data for CSR...\n",
"Scraping started for Team Elo data Artmedia in CSR...\n"
]
},
{
"ename": "FileNotFoundError",
"evalue": "[Errno 2] No such file or directory: '../../data/elo/raw/ratings_by_team/CSR/Artmedia_CSR_elo_rating_latest.csv'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/var/folders/d7/wvbp_b411h75p3zvbmrvql080000gn/T/ipykernel_81883/3656556382.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf_elo_team_ratings_all_raw\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mget_elo_team_ratings\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/var/folders/d7/wvbp_b411h75p3zvbmrvql080000gn/T/ipykernel_81883/2911065457.py\u001b[0m in \u001b[0;36mget_elo_team_ratings\u001b[0;34m()\u001b[0m\n\u001b[1;32m 75\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 76\u001b[0m \u001b[0;31m###### Save latest version\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 77\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata_dir_elo\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34mf'/raw/ratings_by_team/{country}/{team_folder}_{country}_elo_rating_latest.csv'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindex\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mheader\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 78\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 79\u001b[0m \u001b[0;31m###### Export a copy to the 'archive' subfolder, including the date\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36mto_csv\u001b[0;34m(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, date_format, doublequote, escapechar, decimal, errors, storage_options)\u001b[0m\n\u001b[1;32m 3480\u001b[0m \u001b[0mdoublequote\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdoublequote\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3481\u001b[0m \u001b[0mescapechar\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mescapechar\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3482\u001b[0;31m \u001b[0mstorage_options\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mstorage_options\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3483\u001b[0m )\n\u001b[1;32m 3484\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/pandas/io/formats/format.py\u001b[0m in \u001b[0;36mto_csv\u001b[0;34m(self, path_or_buf, encoding, sep, columns, index_label, mode, compression, quoting, quotechar, line_terminator, chunksize, date_format, doublequote, escapechar, errors, storage_options)\u001b[0m\n\u001b[1;32m 1103\u001b[0m \u001b[0mformatter\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfmt\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1104\u001b[0m )\n\u001b[0;32m-> 1105\u001b[0;31m \u001b[0mcsv_formatter\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msave\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1106\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1107\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcreated_buffer\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/pandas/io/formats/csvs.py\u001b[0m in \u001b[0;36msave\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 241\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merrors\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 242\u001b[0m \u001b[0mcompression\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcompression\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 243\u001b[0;31m \u001b[0mstorage_options\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstorage_options\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 244\u001b[0m ) as handles:\n\u001b[1;32m 245\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/pandas/io/common.py\u001b[0m in \u001b[0;36mget_handle\u001b[0;34m(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)\u001b[0m\n\u001b[1;32m 704\u001b[0m \u001b[0mencoding\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mioargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mencoding\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 705\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merrors\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 706\u001b[0;31m \u001b[0mnewline\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 707\u001b[0m )\n\u001b[1;32m 708\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '../../data/elo/raw/ratings_by_team/CSR/Artmedia_CSR_elo_rating_latest.csv'"
]
}
],
"source": [
"df_elo_team_ratings_all_raw = get_elo_team_ratings()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"### 3.3. Preliminary Data Handling\n",
"Let's quality of the dataset by looking first and last rows in pandas using the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"#### 3.3.1. Summary Report\n",
"Initial step of the data handling and Exploratory Data Analysis (EDA) is to create a quick summary report of the dataset using [pandas Profiling Report](https://github.com/pandas-profiling/pandas-profiling)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Summary of the data using pandas Profiling Report\n",
"#pp.ProfileReport(df_elo_team_ratings_all_raw)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"#### 3.3.2. Further Inspection\n",
"The following commands go into more bespoke summary of the dataset. Some of the commands include content covered in the [pandas Profiling](https://github.com/pandas-profiling/pandas-profiling) summary above, but using the standard [pandas](https://pandas.pydata.org/) functions and methods that most peoplem will be more familiar with.\n",
"\n",
"First check the quality of the dataset by looking first and last rows in pandas using the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods."
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Rank
\n",
"
Club
\n",
"
Country
\n",
"
Level
\n",
"
Elo
\n",
"
From
\n",
"
To
\n",
"
\n",
" \n",
" \n",
"
\n",
"
1203829
\n",
"
None
\n",
"
Besa Kavaje
\n",
"
ALB
\n",
"
0
\n",
"
1325.261230
\n",
"
1972-07-01
\n",
"
1972-09-13
\n",
"
\n",
"
\n",
"
1203830
\n",
"
None
\n",
"
Besa Kavaje
\n",
"
ALB
\n",
"
0
\n",
"
1337.468506
\n",
"
1972-09-14
\n",
"
1972-09-27
\n",
"
\n",
"
\n",
"
1203831
\n",
"
None
\n",
"
Besa Kavaje
\n",
"
ALB
\n",
"
0
\n",
"
1338.146362
\n",
"
1972-09-28
\n",
"
1972-10-25
\n",
"
\n",
"
\n",
"
1203832
\n",
"
None
\n",
"
Besa Kavaje
\n",
"
ALB
\n",
"
0
\n",
"
1335.007324
\n",
"
1972-10-26
\n",
"
1972-11-08
\n",
"
\n",
"
\n",
"
1203833
\n",
"
None
\n",
"
Besa Kavaje
\n",
"
ALB
\n",
"
0
\n",
"
1334.845581
\n",
"
1972-11-09
\n",
"
1978-07-01
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Rank Club Country Level Elo From To\n",
"1203829 None Besa Kavaje ALB 0 1325.261230 1972-07-01 1972-09-13\n",
"1203830 None Besa Kavaje ALB 0 1337.468506 1972-09-14 1972-09-27\n",
"1203831 None Besa Kavaje ALB 0 1338.146362 1972-09-28 1972-10-25\n",
"1203832 None Besa Kavaje ALB 0 1335.007324 1972-10-26 1972-11-08\n",
"1203833 None Besa Kavaje ALB 0 1334.845581 1972-11-09 1978-07-01"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Display the first five rows of the raw DataFrame, df_elo_team_ratings_all_raw\n",
"df_elo_team_ratings_all_raw.head()"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Rank
\n",
"
Club
\n",
"
Country
\n",
"
Level
\n",
"
Elo
\n",
"
From
\n",
"
To
\n",
"
\n",
" \n",
" \n",
"
\n",
"
1804701
\n",
"
None
\n",
"
Zlatibor
\n",
"
YUG
\n",
"
1
\n",
"
1306.602783
\n",
"
1992-04-30
\n",
"
1992-05-03
\n",
"
\n",
"
\n",
"
1804702
\n",
"
None
\n",
"
Zlatibor
\n",
"
YUG
\n",
"
1
\n",
"
1312.926880
\n",
"
1992-05-04
\n",
"
1992-05-17
\n",
"
\n",
"
\n",
"
1804703
\n",
"
None
\n",
"
Zlatibor
\n",
"
YUG
\n",
"
1
\n",
"
1312.077393
\n",
"
1992-05-18
\n",
"
1992-05-24
\n",
"
\n",
"
\n",
"
1804704
\n",
"
None
\n",
"
Zlatibor
\n",
"
YUG
\n",
"
1
\n",
"
1331.334717
\n",
"
1992-05-25
\n",
"
1992-06-29
\n",
"
\n",
"
\n",
"
1804705
\n",
"
None
\n",
"
Zlatibor
\n",
"
YUG
\n",
"
1
\n",
"
1331.334717
\n",
"
1992-06-30
\n",
"
1995-07-01
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Rank Club Country Level Elo From To\n",
"1804701 None Zlatibor YUG 1 1306.602783 1992-04-30 1992-05-03\n",
"1804702 None Zlatibor YUG 1 1312.926880 1992-05-04 1992-05-17\n",
"1804703 None Zlatibor YUG 1 1312.077393 1992-05-18 1992-05-24\n",
"1804704 None Zlatibor YUG 1 1331.334717 1992-05-25 1992-06-29\n",
"1804705 None Zlatibor YUG 1 1331.334717 1992-06-30 1995-07-01"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Display the last five rows of the raw DataFrame, df_elo_team_ratings_all_raw\n",
"df_elo_team_ratings_all_raw.tail()"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(1854899, 7)\n"
]
}
],
"source": [
"# Print the shape of the raw DataFrame, df_elo_team_ratings_all_raw\n",
"print(df_elo_team_ratings_all_raw.shape)"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['Rank', 'Club', 'Country', 'Level', 'Elo', 'From', 'To'], dtype='object')\n"
]
}
],
"source": [
"# Print the column names of the raw DataFrame, df_elo_team_ratings_all_raw\n",
"print(df_elo_team_ratings_all_raw.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset has twelve features (columns)."
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Rank object\n",
"Club object\n",
"Country object\n",
"Level int64\n",
"Elo float64\n",
"From object\n",
"To object\n",
"dtype: object"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Data types of the features of the raw DataFrame, df_elo_team_ratings_all_raw\n",
"df_elo_team_ratings_all_raw.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Int64Index: 1854899 entries, 1203829 to 1804705\n",
"Data columns (total 7 columns):\n",
" # Column Dtype \n",
"--- ------ ----- \n",
" 0 Rank object \n",
" 1 Club object \n",
" 2 Country object \n",
" 3 Level int64 \n",
" 4 Elo float64\n",
" 5 From object \n",
" 6 To object \n",
"dtypes: float64(1), int64(1), object(5)\n",
"memory usage: 113.2+ MB\n"
]
}
],
"source": [
"# Info for the raw DataFrame, df_elo_team_ratings_all_raw\n",
"df_elo_team_ratings_all_raw.info()"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"