{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TransferMarkt Player Recorded Transfer Fees Data Engineering\n",
"##### Notebook to engineer player recorded transfer fee data scraped from [TransferMarkt](https://www.transfermarkt.co.uk/).\n",
"\n",
"### By [Edd Webster](https://www.twitter.com/eddwebster)\n",
"Notebook first written: 22/08/2021 \n",
"Notebook last updated: 22/08/2021\n",
"\n",
"![title](../../img/transfermarkt-logo-banner.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"\n",
"\n",
"## Introduction\n",
"This notebook engineers transfer data data from [TransferMarkt](https://www.transfermarkt.co.uk/) provided by [ewenme](https://github.com/ewenme/transfers), and manipulates this landed data as DataFrames using [pandas](http://pandas.pydata.org/) and [matplotlib](https://matplotlib.org/) for visualisation.\n",
"\n",
"For more information about this notebook and the author, I'm available through all the following channels:\n",
"* [eddwebster.com](https://www.eddwebster.com/);\n",
"* edd.j.webster@gmail.com;\n",
"* [@eddwebster](https://www.twitter.com/eddwebster);\n",
"* [linkedin.com/in/eddwebster](https://www.linkedin.com/in/eddwebster/);\n",
"* [github/eddwebster](https://github.com/eddwebster/);\n",
"* [public.tableau.com/profile/edd.webster](https://public.tableau.com/profile/edd.webster);\n",
"* [kaggle.com/eddwebster](https://www.kaggle.com/eddwebster); and\n",
"* [hackerrank.com/eddwebster](https://www.hackerrank.com/eddwebster).\n",
"\n",
"![title](../../img/fifa21eddwebsterbanner.png)\n",
"\n",
"The accompanying GitHub repository for this notebook can be found [here](https://github.com/eddwebster/football_analytics) and a static version of this notebook can be found [here](https://nbviewer.jupyter.org/github/eddwebster/football_analytics/blob/master/notebooks/A%29%20Web%20Scraping/TransferMarkt%20Web%20Scraping%20and%20Parsing.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"\n",
"\n",
"## Notebook Contents\n",
"1. [Notebook Dependencies](#section1) \n",
"2. [Project Brief](#section2) \n",
"3. [Data Sources](#section3) \n",
" 1. [Introduction](#section3.1) \n",
" 2. [Data Dictionary](#section3.2) \n",
" 3. [Creating the DataFrame](#section3.3) \n",
" 4. [Initial Data Handling](#section3.4) \n",
" 5. [Export the Raw DataFrame](#section3.5) \n",
"4. [Data Engineering](#section4) \n",
" 1. [Introduction](#section4.1) \n",
" 2. [Columns of Interest](#section4.2) \n",
" 3. [String Cleaning](#section4.3) \n",
" 4. [Converting Data Types](#section4.4) \n",
" 5. [Export the Engineered DataFrame](#section4.5) \n",
"5. [Exploratory Data Analysis (EDA)](#section5) \n",
" 1. [...](#section5.1) \n",
" 2. [...](#section5.2) \n",
" 3. [...](#section5.3) \n",
"6. [Summary](#section6) \n",
"7. [Next Steps](#section7) \n",
"8. [Bibliography](#section8) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"\n",
"\n",
"## 1. Notebook Dependencies\n",
"This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries:\n",
"* [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented;\n",
"* [`NumPy`](http://www.numpy.org/) for multidimensional array computing;\n",
"* [`pandas`](http://pandas.pydata.org/) for data analysis and manipulation; and\n",
"* [`matplotlib`](https://matplotlib.org/contents.html?v=20200411155018) for data visualisations.\n",
"\n",
"All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the [Conda](https://anaconda.org/anaconda/conda) distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows [here](https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444) and Mac [here](https://medium.com/@GalarnykMichael/install-python-on-mac-anaconda-ccd9f2014072), as well as in the Anaconda documentation itself [here](https://docs.anaconda.com/anaconda/install/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import Libraries and Modules"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Setup Complete\n"
]
}
],
"source": [
"# Python ≥3.5 (ideally)\n",
"import platform\n",
"import sys, getopt\n",
"assert sys.version_info >= (3, 5)\n",
"import csv\n",
"\n",
"# Import Dependencies\n",
"%matplotlib inline\n",
"\n",
"# Math Operations\n",
"import numpy as np\n",
"import math\n",
"from math import pi\n",
"\n",
"# Datetime\n",
"import datetime\n",
"from datetime import date\n",
"import time\n",
"\n",
"# Data Preprocessing\n",
"import pandas as pd\n",
"#import pandas_profiling as pp\n",
"import os\n",
"import re\n",
"import random\n",
"from io import BytesIO\n",
"from pathlib import Path\n",
"\n",
"# Reading directories\n",
"import glob\n",
"import os\n",
"from os.path import basename\n",
"\n",
"# Flatten lists\n",
"from functools import reduce\n",
"\n",
"# Working with JSON\n",
"import json\n",
"from pandas.io.json import json_normalize\n",
"\n",
"# Web Scraping\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"import re\n",
"\n",
"# Currency Convert\n",
"from forex_python.converter import CurrencyRates\n",
"\n",
"# APIs\n",
"from tyrone_mings import * \n",
"\n",
"# Fuzzy Matching - Record Linkage\n",
"import recordlinkage\n",
"import jellyfish\n",
"import numexpr as ne\n",
"\n",
"# Data Visualisation\n",
"import matplotlib as mpl\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"plt.style.use('seaborn-whitegrid')\n",
"import missingno as msno\n",
"\n",
"# Progress Bar\n",
"from tqdm import tqdm\n",
"\n",
"# Display in Jupyter\n",
"from IPython.display import Image, YouTubeVideo\n",
"from IPython.core.display import HTML\n",
"\n",
"# Ignore Warnings\n",
"import warnings\n",
"warnings.filterwarnings(action=\"ignore\", message=\"^internal gelsd\")\n",
"\n",
"print('Setup Complete')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Python: 3.7.6\n",
"NumPy: 1.20.3\n",
"pandas: 1.3.2\n",
"matplotlib: 3.4.2\n"
]
}
],
"source": [
"# Python / module versions used here for reference\n",
"print('Python: {}'.format(platform.python_version()))\n",
"print('NumPy: {}'.format(np.__version__))\n",
"print('pandas: {}'.format(pd.__version__))\n",
"print('matplotlib: {}'.format(mpl.__version__))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Defined Filepaths"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Set up initial paths to subfolders\n",
"base_dir = os.path.join('..', '..', )\n",
"data_dir = os.path.join(base_dir, 'data')\n",
"data_dir_tm = os.path.join(base_dir, 'data', 'tm')\n",
"img_dir = os.path.join(base_dir, 'img')\n",
"fig_dir = os.path.join(base_dir, 'img', 'fig')\n",
"video_dir = os.path.join(base_dir, 'video')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Defined Variables"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Define today's date\n",
"today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Defined Lists"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Defined Lists\n",
"\n",
"## Define list of league codes\n",
"df_leagues = pd.read_csv(data_dir_tm + '/reference/tm_leagues_comps.csv')\n",
"lst_league_codes = df_leagues['league_code'].to_numpy().tolist()\n",
"\n",
"## Define list of 'Big 5' European Leagues and MLS league names\n",
"lst_big5_mls_leagues = ['1 Bundesliga', 'Ligue 1', 'Premier League', 'Serie A', 'Primera Division', 'MLS']\n",
"\n",
"## Define list of 'Big 5' European Leagues and MLS codes\n",
"lst_big5_mls_league_codes = ['GB1', 'FR1', 'L1', 'IT1', 'ES1', 'MLS1']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Defined Dictionaries"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"### Defined Dictionaries\n",
"\n",
"## League names and TM codes\n",
"dict_league_codes = {'Primera Division': 'ES1',\n",
" 'Ligue 1': 'FR1',\n",
" 'Premier League': 'GB1',\n",
" 'Championship': 'GB2',\n",
" 'Serie A': 'IT1',\n",
" '1 Bundesliga': 'L1',\n",
" 'MLS': 'MLS1',\n",
" 'Eredivisie': 'NL1',\n",
" 'Liga Nos': 'PO1',\n",
" 'Premier Liga': 'RU1'\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Notebook Settings"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Display all columns of pandas DataFrames\n",
"pd.set_option('display.max_columns', None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"## 2. Project Brief\n",
"This Jupyter notebook is part of a series of notebooks to scrape, parse, engineer, unify, and the model, culminating in a an Expected Transfer (xTransfer) player performance vs. valuation model. This model aims to determine the under- and over-performing players based on their on-the-pitch output against transfer fee and wages.\n",
"\n",
"This particular notebook is one of several data engineering notebooks, that cleans recorded transfer fee player data from [TransferMarkt](https://www.transfermarkt.co.uk/) using [pandas](http://pandas.pydata.org/).\n",
"\n",
"[TransferMarkt](https://www.transfermarkt.co.uk/) is a German-based website owned by [Axel Springer](https://www.axelspringer.com/en/) and is the leading website for the football transfer market. The website posts football related data, including: scores and results, football news, transfer rumours, and most usefully for us - calculated estimates ofthe market values for teams and individual players.\n",
"\n",
"To read more about how these estimations are made, [Beyond crowd judgments: Data-driven estimation of market value in association football](https://www.sciencedirect.com/science/article/pii/S0377221717304332) by Oliver Müllera, Alexander Simons, and Markus Weinmann does an excellent job of explaining how the estimations are made and their level of accuracy.\n",
"\n",
"This notebook, along with the other notebooks in this project workflow are shown in the following diagram:\n",
"\n",
"![roadmap](../../img/football_analytics_data_roadmap.png)\n",
"\n",
"Links to these notebooks in the [`football_analytics`](https://github.com/eddwebster/football_analytics) GitHub repository can be found at the following:\n",
"* [Webscraping](https://github.com/eddwebster/football_analytics/tree/master/notebooks/1_data_scraping)\n",
" + [FBref Player Stats Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/FBref%20Player%20Stats%20Web%20Scraping.ipynb)\n",
" + [TransferMarket Player Bio and Status Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/TransferMarkt%20Player%20Bio%20and%20Status%20Web%20Scraping.ipynb)\n",
" + [TransferMarket Player Valuation Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/TransferMarkt%20Player%20Valuation%20Web%20Scraping.ipynb)\n",
" + [TransferMarkt Player Recorded Transfer Fees Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/TransferMarkt%20Player%20Recorded%20Transfer%20Fees%20Webscraping.ipynb)\n",
" + [Capology Player Salary Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/Capology%20Player%20Salary%20Web%20Scraping.ipynb)\n",
" + [FBref Team Stats Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/FBref%20Team%20Stats%20Web%20Scraping.ipynb)\n",
"* [Data Parsing](https://github.com/eddwebster/football_analytics/tree/master/notebooks/2_data_parsing)\n",
" + [ELO Team Ratings Data Parsing](https://github.com/eddwebster/football_analytics/blob/master/notebooks/2_data_parsing/ELO%20Team%20Ratings%20Data%20Parsing.ipynb)\n",
"* [Data Engineering](https://github.com/eddwebster/football_analytics/tree/master/notebooks/3_data_engineering)\n",
" + [FBref Player Stats Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/FBref%20Player%20Stats%20Data%20Engineering.ipynb)\n",
" + [TransferMarket Player Bio and Status Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Bio%20and%20Status%20Data%20Engineering.ipynb)\n",
" + [TransferMarket Player Valuation Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Valuation%20Data%20Engineering.ipynb)\n",
" + [TransferMarkt Player Recorded Transfer Fees Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Recorded%20Transfer%20Fees%20Data%20Engineering.ipynb)\n",
" + [Capology Player Salary Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/Capology%20Player%20Salary%20Data%20Engineering.ipynb)\n",
" + [FBref Team Stats Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/FBref%20Team%20Stats%20Data%20Engineering.ipynb)\n",
" + [ELO Team Ratings Data Parsing](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/ELO%20Team%20Ratings%20Data%20Parsing.ipynb)\n",
" + [TransferMarkt Team Recorded Transfer Fee Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Team%20Recorded%20Transfer%20Fee%20Data%20Engineering.ipynb) (aggregated from [TransferMarkt Player Recorded Transfer Fees notebook](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Recorded%20Transfer%20Fees%20Data%20Engineering.ipynb))\n",
" + [Capology Team Salary Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/Capology%20Team%20Salary%20Data%20Engineering.ipynb) (aggregated from [Capology Player Salary notebook](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/Capology%20Player%20Salary%20Data%20Engineering.ipynb))\n",
"* [Data Unification](https://github.com/eddwebster/football_analytics/tree/master/notebooks/4_data_unification)\n",
" + [Golden ID for Player Level Datasets](https://github.com/eddwebster/football_analytics/blob/master/notebooks/4_data_unification/Golden%20ID%20for%20Player%20Level%20Datasets.ipynb)\n",
" + [Golden ID for Team Level Datasets](https://github.com/eddwebster/football_analytics/blob/master/notebooks/4_data_unification/Golden%20ID%20for%20Team%20Level%20Datasets.ipynb)\n",
"* [Production Datasets](https://github.com/eddwebster/football_analytics/tree/master/notebooks/5_production_datasets)\n",
" + [Player Performance/Market Value Dataset](https://github.com/eddwebster/football_analytics/tree/master/notebooks/5_production_datasets/Player%20Performance/Market%20Value%20Dataset.ipynb)\n",
" + [Team Performance/Market Value Dataset](https://github.com/eddwebster/football_analytics/tree/master/notebooks/5_production_datasets/Team%20Performance/Market%20Value%20Dataset.ipynb)\n",
"* [Expected Transfer (xTransfer) Modeling](https://github.com/eddwebster/football_analytics/tree/master/notebooks/6_data_analysis_and_projects/expected_transfer_modeling)\n",
" + [Expected Transfer (xTransfer) Modeling](https://github.com/eddwebster/football_analytics/tree/master/notebooks/6_data_analysis_and_projects/expected_transfer_modeling/Expected%20Transfer%20%20Modeling.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"## 3. Data Sources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"### 3.1. Introduction\n",
"Before conducting our EDA, the data needs to be imported as a DataFrame in the Data Sources section [Section 3](#section3) and cleaned in the Data Engineering section [Section 4](#section4).\n",
"\n",
"We'll be using the [pandas](http://pandas.pydata.org/) library to import our data to this workbook as a DataFrame."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2. Data Dictionaries\n",
"The [TransferMarkt](https://www.transfermarkt.co.uk/) dataset has six features (columns) with the following definitions and data types:\n",
"\n",
"| Feature | Data type |\n",
"|------|-----|\n",
"| `position_number` | object |\n",
"| `position_description` | object |\n",
"| `name` | object |\n",
"| `dob` | object |\n",
"| `nationality` | object |\n",
"| `value` | object |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"### 3.3. Read in Data"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Unify individual CSV files as a single DataFrame\n",
"\n",
"## Show files in directory\n",
"all_files = glob.glob(os.path.join(data_dir_tm + f'/raw/transfer_history/*/*/*.csv'))\n",
"\n",
"## Create an empty list of Players URLs\n",
"lst_player_transfer_history_all = []\n",
"\n",
"## Loop through list of files and read into temporary DataFrames, appending to Player URL list \n",
"for filename in all_files:\n",
" df_temp = pd.read_csv(filename, index_col=None, header=0)\n",
" lst_player_transfer_history_all.append(df_temp)\n",
"\n",
"## Concatenate the files into a single DataFrame\n",
"df_tm_transfer_history_all_raw = pd.concat(lst_player_transfer_history_all, axis=0, ignore_index=True)\n",
"\n",
"\n",
"# Sort DataFrame\n",
"df_tm_transfer_history_all_raw = df_tm_transfer_history_all.sort_values(['league_name', 'season', 'player_name'], ascending=[True, True, True])\n",
"\n",
"\n",
"# Display DataFrame\n",
"df_tm_transfer_history_all_raw.head()"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"# Export DataFrame\n",
" \n",
"##\n",
"df_tm_transfer_history_all_raw.to_csv(data_dir_tm + f'/raw/transfer_history/tm_player_transfer_history_latest.csv', index=None, header=True)\n",
"\n",
"##\n",
"#df_tm_transfer_history_all_raw.to_csv(data_dir_tm + f'/raw/transfer_history/archive/tm_player_transfer_history_last_updated_{today}.csv', index=None, header=True)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(169208, 12)"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_tm_transfer_history_all_raw.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.4. Preliminary Data Handling\n",
"Let's quality of the dataset by looking first and last rows in pandas using the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"#### 3.4.1. Summary Report\n",
"Initial step of the data handling and Exploratory Data Analysis (EDA) is to create a quick summary report of the dataset using [pandas Profiling Report](https://github.com/pandas-profiling/pandas-profiling)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Summary of the data using pandas Profiling Report\n",
"#pp.ProfileReport(df_tm_transfer_history_all_raw)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"#### 3.4.1. Further Inspection\n",
"The following commands go into more bespoke summary of the dataset. Some of the commands include content covered in the [pandas Profiling](https://github.com/pandas-profiling/pandas-profiling) summary above, but using the standard [pandas](https://pandas.pydata.org/) functions and methods that most peoplem will be more familiar with.\n",
"\n",
"First check the quality of the dataset by looking first and last rows in pandas using the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
club_name
\n",
"
player_name
\n",
"
age
\n",
"
position
\n",
"
club_involved_name
\n",
"
fee
\n",
"
transfer_movement
\n",
"
transfer_period
\n",
"
fee_cleaned
\n",
"
league_name
\n",
"
year
\n",
"
season
\n",
"
\n",
" \n",
" \n",
"
\n",
"
159524
\n",
"
VfB Stuttgart
\n",
"
Adrian Knup
\n",
"
23.0
\n",
"
Centre-Forward
\n",
"
FC Luzern
\n",
"
?
\n",
"
in
\n",
"
Summer
\n",
"
NaN
\n",
"
1 Bundesliga
\n",
"
1992
\n",
"
1992/1993
\n",
"
\n",
"
\n",
"
159671
\n",
"
1. FC Köln
\n",
"
Adrian Spyrka
\n",
"
24.0
\n",
"
Central Midfield
\n",
"
Stuttg. Kickers
\n",
"
End of loanJun 30, 1992
\n",
"
in
\n",
"
Summer
\n",
"
0.0
\n",
"
1 Bundesliga
\n",
"
1992
\n",
"
1992/1993
\n",
"
\n",
"
\n",
"
159545
\n",
"
Karlsruher SC
\n",
"
Alexander Famulla
\n",
"
31.0
\n",
"
Goalkeeper
\n",
"
FC 08 Homburg
\n",
"
?
\n",
"
out
\n",
"
Summer
\n",
"
NaN
\n",
"
1 Bundesliga
\n",
"
1992
\n",
"
1992/1993
\n",
"
\n",
"
\n",
"
159510
\n",
"
SV Werder Bremen
\n",
"
Alexander Malchow
\n",
"
22.0
\n",
"
Centre-Back
\n",
"
VfB Oldenburg
\n",
"
Free transfer
\n",
"
out
\n",
"
Summer
\n",
"
0.0
\n",
"
1 Bundesliga
\n",
"
1992
\n",
"
1992/1993
\n",
"
\n",
"
\n",
"
159699
\n",
"
SG Dynamo Dresden
\n",
"
Alexander Zickler
\n",
"
18.0
\n",
"
Centre-Forward
\n",
"
D. Dresden U19
\n",
"
-
\n",
"
in
\n",
"
Summer
\n",
"
0.0
\n",
"
1 Bundesliga
\n",
"
1992
\n",
"
1992/1993
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" club_name player_name age position \\\n",
"159524 VfB Stuttgart Adrian Knup 23.0 Centre-Forward \n",
"159671 1. FC Köln Adrian Spyrka 24.0 Central Midfield \n",
"159545 Karlsruher SC Alexander Famulla 31.0 Goalkeeper \n",
"159510 SV Werder Bremen Alexander Malchow 22.0 Centre-Back \n",
"159699 SG Dynamo Dresden Alexander Zickler 18.0 Centre-Forward \n",
"\n",
" club_involved_name fee transfer_movement \\\n",
"159524 FC Luzern ? in \n",
"159671 Stuttg. Kickers End of loanJun 30, 1992 in \n",
"159545 FC 08 Homburg ? out \n",
"159510 VfB Oldenburg Free transfer out \n",
"159699 D. Dresden U19 - in \n",
"\n",
" transfer_period fee_cleaned league_name year season \n",
"159524 Summer NaN 1 Bundesliga 1992 1992/1993 \n",
"159671 Summer 0.0 1 Bundesliga 1992 1992/1993 \n",
"159545 Summer NaN 1 Bundesliga 1992 1992/1993 \n",
"159510 Summer 0.0 1 Bundesliga 1992 1992/1993 \n",
"159699 Summer 0.0 1 Bundesliga 1992 1992/1993 "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Display the first five rows of the raw DataFrame, df_tm_transfer_history_all_raw\n",
"df_tm_transfer_history_all_raw.head()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
club_name
\n",
"
player_name
\n",
"
age
\n",
"
position
\n",
"
club_involved_name
\n",
"
fee
\n",
"
transfer_movement
\n",
"
transfer_period
\n",
"
fee_cleaned
\n",
"
league_name
\n",
"
year
\n",
"
season
\n",
"
\n",
" \n",
" \n",
"
\n",
"
69864
\n",
"
FC Crotone
\n",
"
Zak Ruggiero
\n",
"
20.0
\n",
"
Second Striker
\n",
"
Pro Sesto
\n",
"
loan transfer
\n",
"
out
\n",
"
Winter
\n",
"
NaN
\n",
"
Serie A
\n",
"
2020
\n",
"
2020/2021
\n",
"
\n",
"
\n",
"
69523
\n",
"
SSC Napoli
\n",
"
Zinédine Machach
\n",
"
24.0
\n",
"
Central Midfield
\n",
"
VVV-Venlo
\n",
"
loan transfer
\n",
"
out
\n",
"
Summer
\n",
"
NaN
\n",
"
Serie A
\n",
"
2020
\n",
"
2020/2021
\n",
"
\n",
"
\n",
"
69137
\n",
"
Cagliari Calcio
\n",
"
Zito Luvumbo
\n",
"
18.0
\n",
"
Right Winger
\n",
"
1º de Agosto
\n",
"
£900Th.
\n",
"
in
\n",
"
Summer
\n",
"
0.9
\n",
"
Serie A
\n",
"
2020
\n",
"
2020/2021
\n",
"
\n",
"
\n",
"
69762
\n",
"
Torino FC
\n",
"
Álex Berenguer
\n",
"
25.0
\n",
"
Left Winger
\n",
"
Athletic
\n",
"
£10.80m
\n",
"
out
\n",
"
Summer
\n",
"
10.8
\n",
"
Serie A
\n",
"
2020
\n",
"
2020/2021
\n",
"
\n",
"
\n",
"
69423
\n",
"
Juventus FC
\n",
"
Álvaro Morata
\n",
"
27.0
\n",
"
Centre-Forward
\n",
"
Atlético Madrid
\n",
"
Loan fee:£9.00m
\n",
"
in
\n",
"
Summer
\n",
"
9.0
\n",
"
Serie A
\n",
"
2020
\n",
"
2020/2021
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" club_name player_name age position \\\n",
"69864 FC Crotone Zak Ruggiero 20.0 Second Striker \n",
"69523 SSC Napoli Zinédine Machach 24.0 Central Midfield \n",
"69137 Cagliari Calcio Zito Luvumbo 18.0 Right Winger \n",
"69762 Torino FC Álex Berenguer 25.0 Left Winger \n",
"69423 Juventus FC Álvaro Morata 27.0 Centre-Forward \n",
"\n",
" club_involved_name fee transfer_movement transfer_period \\\n",
"69864 Pro Sesto loan transfer out Winter \n",
"69523 VVV-Venlo loan transfer out Summer \n",
"69137 1º de Agosto £900Th. in Summer \n",
"69762 Athletic £10.80m out Summer \n",
"69423 Atlético Madrid Loan fee:£9.00m in Summer \n",
"\n",
" fee_cleaned league_name year season \n",
"69864 NaN Serie A 2020 2020/2021 \n",
"69523 NaN Serie A 2020 2020/2021 \n",
"69137 0.9 Serie A 2020 2020/2021 \n",
"69762 10.8 Serie A 2020 2020/2021 \n",
"69423 9.0 Serie A 2020 2020/2021 "
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Display the last five rows of the raw DataFrame, df_tm_transfer_history_all_raw\n",
"df_tm_transfer_history_all_raw.tail()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(169208, 12)\n"
]
}
],
"source": [
"# Print the shape of the raw DataFrame, df_tm_transfer_history_all_raw\n",
"print(df_tm_transfer_history_all_raw.shape)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['club_name', 'player_name', 'age', 'position', 'club_involved_name',\n",
" 'fee', 'transfer_movement', 'transfer_period', 'fee_cleaned',\n",
" 'league_name', 'year', 'season'],\n",
" dtype='object')\n"
]
}
],
"source": [
"# Print the column names of the raw DataFrame, df_tm_transfer_history_all_raw\n",
"print(df_tm_transfer_history_all_raw.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset has twelve features (columns)."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"club_name object\n",
"player_name object\n",
"age float64\n",
"position object\n",
"club_involved_name object\n",
"fee object\n",
"transfer_movement object\n",
"transfer_period object\n",
"fee_cleaned float64\n",
"league_name object\n",
"year int64\n",
"season object\n",
"dtype: object"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Data types of the features of the raw DataFrame, df_tm_transfer_history_all_raw\n",
"df_tm_transfer_history_all_raw.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Int64Index: 169208 entries, 159524 to 69423\n",
"Data columns (total 12 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 club_name 169208 non-null object \n",
" 1 player_name 169208 non-null object \n",
" 2 age 169155 non-null float64\n",
" 3 position 169205 non-null object \n",
" 4 club_involved_name 169208 non-null object \n",
" 5 fee 169064 non-null object \n",
" 6 transfer_movement 169208 non-null object \n",
" 7 transfer_period 154776 non-null object \n",
" 8 fee_cleaned 141044 non-null float64\n",
" 9 league_name 169208 non-null object \n",
" 10 year 169208 non-null int64 \n",
" 11 season 169208 non-null object \n",
"dtypes: float64(2), int64(1), object(9)\n",
"memory usage: 16.8+ MB\n"
]
}
],
"source": [
"# Info for the raw DataFrame, df_tm_transfer_history_all_raw\n",
"df_tm_transfer_history_all_raw.info()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Plot visualisation of the missing values for each feature of the raw DataFrame, df_tm_transfer_history_all_raw\n",
"msno.matrix(df_tm_transfer_history_all_raw, figsize = (30, 7))"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"age 53\n",
"position 3\n",
"fee 144\n",
"transfer_period 14432\n",
"fee_cleaned 28164\n",
"dtype: int64"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Counts of missing values\n",
"null_value_stats = df_tm_transfer_history_all_raw.isnull().sum(axis=0)\n",
"null_value_stats[null_value_stats != 0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The visualisation shows us very quickly that there a few missing values in the `age`, `position`, `fee`, and then more missing values in the `transfer_period` and `fee_cleaned` column."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"## 4. Data Engineering\n",
"Before we answer the questions in the brief through [Exploratory Data Analysis (EDA)](#section5), we'll first need to clean and wrangle the datasets to a form that meet our needs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.1. Assign Raw DataFrames to New Engineered DataFrames"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"# Assign Raw DataFrames to new Engineered DataFrames\n",
"df_tm_transfer_history_all = df_tm_transfer_history_all_raw.copy()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
club_name
\n",
"
player_name
\n",
"
age
\n",
"
position
\n",
"
club_involved_name
\n",
"
fee
\n",
"
transfer_movement
\n",
"
transfer_period
\n",
"
fee_cleaned
\n",
"
league_name
\n",
"
year
\n",
"
season
\n",
"
league_code
\n",
"
\n",
" \n",
" \n",
"
\n",
"
159524
\n",
"
VfB Stuttgart
\n",
"
Adrian Knup
\n",
"
23.0
\n",
"
Centre-Forward
\n",
"
FC Luzern
\n",
"
?
\n",
"
in
\n",
"
Summer
\n",
"
NaN
\n",
"
1 Bundesliga
\n",
"
1992
\n",
"
1992/1993
\n",
"
L1
\n",
"
\n",
"
\n",
"
159671
\n",
"
1. FC Köln
\n",
"
Adrian Spyrka
\n",
"
24.0
\n",
"
Central Midfield
\n",
"
Stuttg. Kickers
\n",
"
End of loanJun 30, 1992
\n",
"
in
\n",
"
Summer
\n",
"
0.0
\n",
"
1 Bundesliga
\n",
"
1992
\n",
"
1992/1993
\n",
"
L1
\n",
"
\n",
"
\n",
"
159545
\n",
"
Karlsruher SC
\n",
"
Alexander Famulla
\n",
"
31.0
\n",
"
Goalkeeper
\n",
"
FC 08 Homburg
\n",
"
?
\n",
"
out
\n",
"
Summer
\n",
"
NaN
\n",
"
1 Bundesliga
\n",
"
1992
\n",
"
1992/1993
\n",
"
L1
\n",
"
\n",
"
\n",
"
159510
\n",
"
SV Werder Bremen
\n",
"
Alexander Malchow
\n",
"
22.0
\n",
"
Centre-Back
\n",
"
VfB Oldenburg
\n",
"
Free transfer
\n",
"
out
\n",
"
Summer
\n",
"
0.0
\n",
"
1 Bundesliga
\n",
"
1992
\n",
"
1992/1993
\n",
"
L1
\n",
"
\n",
"
\n",
"
159699
\n",
"
SG Dynamo Dresden
\n",
"
Alexander Zickler
\n",
"
18.0
\n",
"
Centre-Forward
\n",
"
D. Dresden U19
\n",
"
-
\n",
"
in
\n",
"
Summer
\n",
"
0.0
\n",
"
1 Bundesliga
\n",
"
1992
\n",
"
1992/1993
\n",
"
L1
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" club_name player_name age position \\\n",
"159524 VfB Stuttgart Adrian Knup 23.0 Centre-Forward \n",
"159671 1. FC Köln Adrian Spyrka 24.0 Central Midfield \n",
"159545 Karlsruher SC Alexander Famulla 31.0 Goalkeeper \n",
"159510 SV Werder Bremen Alexander Malchow 22.0 Centre-Back \n",
"159699 SG Dynamo Dresden Alexander Zickler 18.0 Centre-Forward \n",
"\n",
" club_involved_name fee transfer_movement \\\n",
"159524 FC Luzern ? in \n",
"159671 Stuttg. Kickers End of loanJun 30, 1992 in \n",
"159545 FC 08 Homburg ? out \n",
"159510 VfB Oldenburg Free transfer out \n",
"159699 D. Dresden U19 - in \n",
"\n",
" transfer_period fee_cleaned league_name year season league_code \n",
"159524 Summer NaN 1 Bundesliga 1992 1992/1993 L1 \n",
"159671 Summer 0.0 1 Bundesliga 1992 1992/1993 L1 \n",
"159545 Summer NaN 1 Bundesliga 1992 1992/1993 L1 \n",
"159510 Summer 0.0 1 Bundesliga 1992 1992/1993 L1 \n",
"159699 Summer 0.0 1 Bundesliga 1992 1992/1993 L1 "
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_tm_transfer_history_all.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.2. String Cleaning"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"# Map TM league codes to DataFrame\n",
"df_tm_transfer_history_all['league_code'] = df_tm_transfer_history_all['league_name'].map(dict_league_codes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.3. Convert to GBP (£)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"# Currency Convert\n",
"from currency_converter import CurrencyConverter\n",
"import math"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.90053"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get EUR to GBP exchange rate\n",
"\n",
"## Get latest currency rates\n",
"c = CurrencyConverter()\n",
"\n",
"## Get conversion rate from EUR to GBP\n",
"rate_eur_gbp = (c.convert(1, 'EUR', 'GBP'))\n",
"rate_eur_gbp"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.9"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Overwrite\n",
"rate_eur_gbp = 0.90\n",
"rate_eur_gbp"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"# NOT NEEDED AS ALREADY IN POUNDS\n",
"\n",
"# Convert Euros to Pounds\n",
"#df_tm_transfer_history_all['fee_cleaned_pounds'] = df_tm_transfer_history_all['fee_cleaned'] * rate_eur_gbp"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.4. Filter Players in 'Big 5' European Leagues and MLS\n",
"No MLS data in the dataset currently"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['1 Bundesliga', 'Championship', 'Eredivisie', 'Liga Nos',\n",
" 'Ligue 1', 'Premier League', 'Premier Liga', 'Primera Division',\n",
" 'Serie A'], dtype=object)"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_tm_transfer_history_all['league_name'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"# Filter list of 'Big 5' European Leagues and MLS from DataFrame\n",
"df_tm_transfer_history_big5 = df_tm_transfer_history_all[df_tm_transfer_history_all['league_name'].isin(lst_big5_mls_leagues)]"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
club_name
\n",
"
player_name
\n",
"
age
\n",
"
position
\n",
"
club_involved_name
\n",
"
fee
\n",
"
transfer_movement
\n",
"
transfer_period
\n",
"
fee_cleaned
\n",
"
league_name
\n",
"
year
\n",
"
season
\n",
"
league_code
\n",
"
\n",
" \n",
" \n",
"
\n",
"
159524
\n",
"
VfB Stuttgart
\n",
"
Adrian Knup
\n",
"
23.0
\n",
"
Centre-Forward
\n",
"
FC Luzern
\n",
"
?
\n",
"
in
\n",
"
Summer
\n",
"
NaN
\n",
"
1 Bundesliga
\n",
"
1992
\n",
"
1992/1993
\n",
"
L1
\n",
"
\n",
"
\n",
"
159671
\n",
"
1. FC Köln
\n",
"
Adrian Spyrka
\n",
"
24.0
\n",
"
Central Midfield
\n",
"
Stuttg. Kickers
\n",
"
End of loanJun 30, 1992
\n",
"
in
\n",
"
Summer
\n",
"
0.0
\n",
"
1 Bundesliga
\n",
"
1992
\n",
"
1992/1993
\n",
"
L1
\n",
"
\n",
"
\n",
"
159545
\n",
"
Karlsruher SC
\n",
"
Alexander Famulla
\n",
"
31.0
\n",
"
Goalkeeper
\n",
"
FC 08 Homburg
\n",
"
?
\n",
"
out
\n",
"
Summer
\n",
"
NaN
\n",
"
1 Bundesliga
\n",
"
1992
\n",
"
1992/1993
\n",
"
L1
\n",
"
\n",
"
\n",
"
159510
\n",
"
SV Werder Bremen
\n",
"
Alexander Malchow
\n",
"
22.0
\n",
"
Centre-Back
\n",
"
VfB Oldenburg
\n",
"
Free transfer
\n",
"
out
\n",
"
Summer
\n",
"
0.0
\n",
"
1 Bundesliga
\n",
"
1992
\n",
"
1992/1993
\n",
"
L1
\n",
"
\n",
"
\n",
"
159699
\n",
"
SG Dynamo Dresden
\n",
"
Alexander Zickler
\n",
"
18.0
\n",
"
Centre-Forward
\n",
"
D. Dresden U19
\n",
"
-
\n",
"
in
\n",
"
Summer
\n",
"
0.0
\n",
"
1 Bundesliga
\n",
"
1992
\n",
"
1992/1993
\n",
"
L1
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
69864
\n",
"
FC Crotone
\n",
"
Zak Ruggiero
\n",
"
20.0
\n",
"
Second Striker
\n",
"
Pro Sesto
\n",
"
loan transfer
\n",
"
out
\n",
"
Winter
\n",
"
NaN
\n",
"
Serie A
\n",
"
2020
\n",
"
2020/2021
\n",
"
IT1
\n",
"
\n",
"
\n",
"
69523
\n",
"
SSC Napoli
\n",
"
Zinédine Machach
\n",
"
24.0
\n",
"
Central Midfield
\n",
"
VVV-Venlo
\n",
"
loan transfer
\n",
"
out
\n",
"
Summer
\n",
"
NaN
\n",
"
Serie A
\n",
"
2020
\n",
"
2020/2021
\n",
"
IT1
\n",
"
\n",
"
\n",
"
69137
\n",
"
Cagliari Calcio
\n",
"
Zito Luvumbo
\n",
"
18.0
\n",
"
Right Winger
\n",
"
1º de Agosto
\n",
"
£900Th.
\n",
"
in
\n",
"
Summer
\n",
"
0.9
\n",
"
Serie A
\n",
"
2020
\n",
"
2020/2021
\n",
"
IT1
\n",
"
\n",
"
\n",
"
69762
\n",
"
Torino FC
\n",
"
Álex Berenguer
\n",
"
25.0
\n",
"
Left Winger
\n",
"
Athletic
\n",
"
£10.80m
\n",
"
out
\n",
"
Summer
\n",
"
10.8
\n",
"
Serie A
\n",
"
2020
\n",
"
2020/2021
\n",
"
IT1
\n",
"
\n",
"
\n",
"
69423
\n",
"
Juventus FC
\n",
"
Álvaro Morata
\n",
"
27.0
\n",
"
Centre-Forward
\n",
"
Atlético Madrid
\n",
"
Loan fee:£9.00m
\n",
"
in
\n",
"
Summer
\n",
"
9.0
\n",
"
Serie A
\n",
"
2020
\n",
"
2020/2021
\n",
"
IT1
\n",
"
\n",
" \n",
"
\n",
"
93489 rows × 13 columns
\n",
"
"
],
"text/plain": [
" club_name player_name age position \\\n",
"159524 VfB Stuttgart Adrian Knup 23.0 Centre-Forward \n",
"159671 1. FC Köln Adrian Spyrka 24.0 Central Midfield \n",
"159545 Karlsruher SC Alexander Famulla 31.0 Goalkeeper \n",
"159510 SV Werder Bremen Alexander Malchow 22.0 Centre-Back \n",
"159699 SG Dynamo Dresden Alexander Zickler 18.0 Centre-Forward \n",
"... ... ... ... ... \n",
"69864 FC Crotone Zak Ruggiero 20.0 Second Striker \n",
"69523 SSC Napoli Zinédine Machach 24.0 Central Midfield \n",
"69137 Cagliari Calcio Zito Luvumbo 18.0 Right Winger \n",
"69762 Torino FC Álex Berenguer 25.0 Left Winger \n",
"69423 Juventus FC Álvaro Morata 27.0 Centre-Forward \n",
"\n",
" club_involved_name fee transfer_movement \\\n",
"159524 FC Luzern ? in \n",
"159671 Stuttg. Kickers End of loanJun 30, 1992 in \n",
"159545 FC 08 Homburg ? out \n",
"159510 VfB Oldenburg Free transfer out \n",
"159699 D. Dresden U19 - in \n",
"... ... ... ... \n",
"69864 Pro Sesto loan transfer out \n",
"69523 VVV-Venlo loan transfer out \n",
"69137 1º de Agosto £900Th. in \n",
"69762 Athletic £10.80m out \n",
"69423 Atlético Madrid Loan fee:£9.00m in \n",
"\n",
" transfer_period fee_cleaned league_name year season league_code \n",
"159524 Summer NaN 1 Bundesliga 1992 1992/1993 L1 \n",
"159671 Summer 0.0 1 Bundesliga 1992 1992/1993 L1 \n",
"159545 Summer NaN 1 Bundesliga 1992 1992/1993 L1 \n",
"159510 Summer 0.0 1 Bundesliga 1992 1992/1993 L1 \n",
"159699 Summer 0.0 1 Bundesliga 1992 1992/1993 L1 \n",
"... ... ... ... ... ... ... \n",
"69864 Winter NaN Serie A 2020 2020/2021 IT1 \n",
"69523 Summer NaN Serie A 2020 2020/2021 IT1 \n",
"69137 Summer 0.9 Serie A 2020 2020/2021 IT1 \n",
"69762 Summer 10.8 Serie A 2020 2020/2021 IT1 \n",
"69423 Summer 9.0 Serie A 2020 2020/2021 IT1 \n",
"\n",
"[93489 rows x 13 columns]"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_tm_transfer_history_big5"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(93489, 13)"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_tm_transfer_history_big5.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.5. Filter for 2016/2017 Season Onwards\n",
"Separate DataFrame created for transfers from 2016/2017 onwards as this is when the other data sources such as FBref start."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": [
"# Filter plays in the Big 5 European Leagues\n",
"\n",
"## Define list of countries\n",
"lst_seasons_1617_2122 = ['2016/2017', '2017/2018', '2018/2019', '2019/2020', '2020/2021', '2021/2022']\n",
"\n",
"## Filter list of Big 5 European League countries from DataFrame\n",
"df_tm_transfer_history_big5_1617_present = df_tm_transfer_history_big5[df_tm_transfer_history_big5['season'].isin(lst_seasons_1617_2122)]"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
club_name
\n",
"
player_name
\n",
"
age
\n",
"
position
\n",
"
club_involved_name
\n",
"
fee
\n",
"
transfer_movement
\n",
"
transfer_period
\n",
"
fee_cleaned
\n",
"
league_name
\n",
"
year
\n",
"
season
\n",
"
league_code
\n",
"
\n",
" \n",
" \n",
"
\n",
"
166018
\n",
"
1.FSV Mainz 05
\n",
"
Aaron Seydel
\n",
"
20.0
\n",
"
Left Winger
\n",
"
FSV Mainz 05 II
\n",
"
-
\n",
"
in
\n",
"
Winter
\n",
"
0.00
\n",
"
1 Bundesliga
\n",
"
2016
\n",
"
2016/2017
\n",
"
L1
\n",
"
\n",
"
\n",
"
165635
\n",
"
FC Schalke 04
\n",
"
Abdul Rahman Baba
\n",
"
22.0
\n",
"
Left-Back
\n",
"
Chelsea
\n",
"
Loan fee:£450Th.
\n",
"
in
\n",
"
Summer
\n",
"
0.45
\n",
"
1 Bundesliga
\n",
"
2016
\n",
"
2016/2017
\n",
"
L1
\n",
"
\n",
"
\n",
"
165998
\n",
"
Borussia Dortmund
\n",
"
Adrián Ramos
\n",
"
31.0
\n",
"
Centre-Forward
\n",
"
Granada CF
\n",
"
Loan
\n",
"
out
\n",
"
Winter
\n",
"
0.00
\n",
"
1 Bundesliga
\n",
"
2016
\n",
"
2016/2017
\n",
"
L1
\n",
"
\n",
"
\n",
"
165809
\n",
"
FC Augsburg
\n",
"
Albian Ajeti
\n",
"
19.0
\n",
"
Centre-Forward
\n",
"
FC St. Gallen
\n",
"
Loan
\n",
"
out
\n",
"
Summer
\n",
"
0.00
\n",
"
1 Bundesliga
\n",
"
2016
\n",
"
2016/2017
\n",
"
L1
\n",
"
\n",
"
\n",
"
165833
\n",
"
SV Werder Bremen
\n",
"
Alejandro Gálvez
\n",
"
27.0
\n",
"
Centre-Back
\n",
"
SD Eibar
\n",
"
£1.35m
\n",
"
out
\n",
"
Summer
\n",
"
1.35
\n",
"
1 Bundesliga
\n",
"
2016
\n",
"
2016/2017
\n",
"
L1
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" club_name player_name age position \\\n",
"166018 1.FSV Mainz 05 Aaron Seydel 20.0 Left Winger \n",
"165635 FC Schalke 04 Abdul Rahman Baba 22.0 Left-Back \n",
"165998 Borussia Dortmund Adrián Ramos 31.0 Centre-Forward \n",
"165809 FC Augsburg Albian Ajeti 19.0 Centre-Forward \n",
"165833 SV Werder Bremen Alejandro Gálvez 27.0 Centre-Back \n",
"\n",
" club_involved_name fee transfer_movement transfer_period \\\n",
"166018 FSV Mainz 05 II - in Winter \n",
"165635 Chelsea Loan fee:£450Th. in Summer \n",
"165998 Granada CF Loan out Winter \n",
"165809 FC St. Gallen Loan out Summer \n",
"165833 SD Eibar £1.35m out Summer \n",
"\n",
" fee_cleaned league_name year season league_code \n",
"166018 0.00 1 Bundesliga 2016 2016/2017 L1 \n",
"165635 0.45 1 Bundesliga 2016 2016/2017 L1 \n",
"165998 0.00 1 Bundesliga 2016 2016/2017 L1 \n",
"165809 0.00 1 Bundesliga 2016 2016/2017 L1 \n",
"165833 1.35 1 Bundesliga 2016 2016/2017 L1 "
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_tm_transfer_history_big5_1617_present.head()"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(19756, 13)"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_tm_transfer_history_big5_1617_present.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"## 5. Exporting the Engineered DataFrames\n",
"Export the two engineered [TransferMarkt](https://www.transfermarkt.co.uk/) DataFrames as CSV files."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"# Export DataFrame\n",
" \n",
"##\n",
"df_tm_transfer_history_all.to_csv(data_dir_tm + f'/engineered/transfer_history/tm_player_transfer_history_latest.csv', index=None, header=True)\n",
"\n",
"##\n",
"df_tm_transfer_history_all.to_csv(data_dir_tm + f'/engineered/transfer_history/archive/tm_player_transfer_history_last_updated_{today}.csv', index=None, header=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# FINISH OFF THIS SECTION"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## Filename\n",
"file_name = 'tm_player_joined_big5_mls'\n",
"\n",
"## Full season string\n",
"full_season_string = '2016/2017 - 2021/2022'\n",
"\n",
"## Short season string\n",
"short_season_string = '1617-2122'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Export DataFrame as a CSV file\n",
"\n",
"## Export a copy to the 'archive' subfolder of the TM folder, including the date\n",
"#df_tm_market_value_bio_status_select.to_csv(data_dir_tm + f'/engineered/{short_season_string}/joined/archive/' + f'{file_name}_{short_season_string}_last_updated_{today}.csv', index=None, header=True)\n",
"\n",
"## Export another copy to the TM folder called 'latest' (can be overwritten)\n",
"#df_tm_market_value_bio_status_select.to_csv(data_dir_tm + f'/engineered/{short_season_string}/joined/' + f'{file_name}_{short_season_string}_latest.csv', index=None, header=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Export DataFrame as a CSV file\n",
"\n",
"## All data\n",
"#df_tm_transfer_history_all.to_csv(data_dir_tm + f'/engineered/transfers/archive/' + f'tm_player_transfers_all_last_updated_{today}.csv', index=None, header=True)\n",
"#df_tm_transfer_history_all.to_csv(data_dir_tm + f'/engineered/transfers/' + f'tm_transfers_player_all_latest.csv', index=None, header=True)\n",
"#df_tm_transfer_history_all.to_csv(data_dir + f'/export/' + f'tm_transfers_player_all_latest.csv', index=None, header=True)\n",
"\n",
"\n",
"## Big 5 European Leagues\n",
"#df_transfers_big5.to_csv(data_dir_tm + f'/engineered/transfers/archive/' + f'tm_player_transfers_big5_last_updated_{today}.csv', index=None, header=True)\n",
"#df_transfers_big5.to_csv(data_dir_tm + f'/engineered/transfers/' + f'tm_player_transfers_big5_latest.csv', index=None, header=True)\n",
"#df_transfers_big5.to_csv(data_dir + f'/export/' + f'tm_player_transfers_big5_latest.csv', index=None, header=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"## 6. Summary\n",
"This Jupyter notebook engineered scraped football data from [TransferMarkt](https://www.transfermarkt.co.uk/) using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"## 7. Next Steps\n",
"The step is to use this data to..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"## 8. References\n",
"\n",
"#### Data and Web Scraping\n",
"* [Beyond crowd judgments: Data-driven estimation of market value in association football](https://www.sciencedirect.com/science/article/pii/S0377221717304332) by Oliver Müllera, Alexander Simons, and Markus Weinmann.\n",
"* [06/04/2020: BBC - Premier League squads 'drop £1.6bn in value'](https://www.bbc.co.uk/sport/football/52221463)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"***Visit my website [eddwebster.com](https://www.eddwebster.com) or my [GitHub Repository](https://github.com/eddwebster) for more projects. If you'd like to get in contact, my Twitter handle is [@eddwebster](http://www.twitter.com/eddwebster) and my email is: edd.j.webster@gmail.com.***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to the top](#top)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"oldHeight": 642,
"position": {
"height": "664px",
"left": "1119px",
"right": "20px",
"top": "-7px",
"width": "489px"
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"varInspector_section_display": "block",
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}