{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TransferMarkt Historical Market Value Data Engineering\n",
"##### Notebook to engineer data scraped from [TransferMarkt](https://www.transfermarkt.co.uk/) using [Beautifulsoup](https://pypi.org/project/beautifulsoup4/), the [Tyrone Mings web scraper](https://github.com/FCrSTATS/tyrone_mings) by [FCrSTATS](https://twitter.com/FC_rstats) and code from the [`football-progres-analysis`](https://github.com/Shomrey/football-progres-analysis) GitHub repo by [`Shomrey`](https://github.com/Shomrey).\n",
"\n",
"### By [Edd Webster](https://www.twitter.com/eddwebster)\n",
"Notebook first written: 22/08/2021 \n",
"Notebook last updated: 22/08/2021\n",
"\n",
"![title](../../img/transfermarkt-logo-banner.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"\n",
"\n",
"## Introduction\n",
"This notebook engineers previously scraped data from [TransferMarkt](https://www.transfermarkt.co.uk/) using the [FCrSTATS](https://twitter.com/FC_rstats) [Tyrone Mings webscraper](https://github.com/FCrSTATS/tyrone_mings) and manipulates this landed data as DataFrames using [pandas](http://pandas.pydata.org/) and [matplotlib](https://matplotlib.org/) for visualisation.\n",
"\n",
"For more information about this notebook and the author, I'm available through all the following channels:\n",
"* [eddwebster.com](https://www.eddwebster.com/);\n",
"* edd.j.webster@gmail.com;\n",
"* [@eddwebster](https://www.twitter.com/eddwebster);\n",
"* [linkedin.com/in/eddwebster](https://www.linkedin.com/in/eddwebster/);\n",
"* [github/eddwebster](https://github.com/eddwebster/);\n",
"* [public.tableau.com/profile/edd.webster](https://public.tableau.com/profile/edd.webster);\n",
"* [kaggle.com/eddwebster](https://www.kaggle.com/eddwebster); and\n",
"* [hackerrank.com/eddwebster](https://www.hackerrank.com/eddwebster).\n",
"\n",
"![title](../../img/fifa21eddwebsterbanner.png)\n",
"\n",
"The accompanying GitHub repository for this notebook can be found [here](https://github.com/eddwebster/football_analytics) and a static version of this notebook can be found [here](https://nbviewer.jupyter.org/github/eddwebster/football_analytics/blob/master/notebooks/A%29%20Web%20Scraping/TransferMarkt%20Web%20Scraping%20and%20Parsing.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"\n",
"\n",
"## Notebook Contents\n",
"1. [Notebook Dependencies](#section1) \n",
"2. [Project Brief](#section2) \n",
"3. [Data Sources](#section3) \n",
" 1. [Introduction](#section3.1) \n",
" 2. [Data Dictionary](#section3.2) \n",
" 3. [Creating the DataFrame](#section3.3) \n",
" 4. [Initial Data Handling](#section3.4) \n",
" 5. [Export the Raw DataFrame](#section3.5) \n",
"4. [Data Engineering](#section4) \n",
" 1. [Introduction](#section4.1) \n",
" 2. [Columns of Interest](#section4.2) \n",
" 3. [String Cleaning](#section4.3) \n",
" 4. [Converting Data Types](#section4.4) \n",
" 5. [Export the Engineered DataFrame](#section4.5) \n",
"5. [Exploratory Data Analysis (EDA)](#section5) \n",
" 1. [...](#section5.1) \n",
" 2. [...](#section5.2) \n",
" 3. [...](#section5.3) \n",
"6. [Summary](#section6) \n",
"7. [Next Steps](#section7) \n",
"8. [Bibliography](#section8) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
"\n",
"\n",
"## 1. Notebook Dependencies\n",
"This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries:\n",
"* [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented;\n",
"* [`NumPy`](http://www.numpy.org/) for multidimensional array computing;\n",
"* [`pandas`](http://pandas.pydata.org/) for data analysis and manipulation; and\n",
"* [`matplotlib`](https://matplotlib.org/contents.html?v=20200411155018) for data visualisations.\n",
"\n",
"All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the [Conda](https://anaconda.org/anaconda/conda) distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows [here](https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444) and Mac [here](https://medium.com/@GalarnykMichael/install-python-on-mac-anaconda-ccd9f2014072), as well as in the Anaconda documentation itself [here](https://docs.anaconda.com/anaconda/install/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import Libraries and Modules"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Setup Complete\n"
]
}
],
"source": [
"# Python ≥3.5 (ideally)\n",
"import platform\n",
"import sys, getopt\n",
"assert sys.version_info >= (3, 5)\n",
"import csv\n",
"\n",
"# Import Dependencies\n",
"%matplotlib inline\n",
"\n",
"# Math Operations\n",
"import numpy as np\n",
"import math\n",
"from math import pi\n",
"\n",
"# Datetime\n",
"import datetime\n",
"from datetime import date\n",
"import time\n",
"\n",
"# Data Preprocessing\n",
"import pandas as pd\n",
"import os\n",
"import re\n",
"import random\n",
"from io import BytesIO\n",
"from pathlib import Path\n",
"\n",
"# Reading directories\n",
"import glob\n",
"import os\n",
"from os.path import basename\n",
"\n",
"# Flatten lists\n",
"from functools import reduce\n",
"\n",
"# Working with JSON\n",
"import json\n",
"from pandas.io.json import json_normalize\n",
"\n",
"# Web Scraping\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"import re\n",
"\n",
"# Currency Converter\n",
"from currency_converter import CurrencyConverter\n",
"\n",
"# APIs\n",
"from tyrone_mings import * \n",
"\n",
"# Fuzzy Matching - Record Linkage\n",
"import recordlinkage\n",
"import jellyfish\n",
"import numexpr as ne\n",
"\n",
"# Data Visualisation\n",
"import matplotlib as mpl\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"plt.style.use('seaborn-whitegrid')\n",
"import missingno as msno\n",
"\n",
"# Progress Bar\n",
"from tqdm import tqdm\n",
"\n",
"# Display in Jupyter\n",
"from IPython.display import Image, YouTubeVideo\n",
"from IPython.core.display import HTML\n",
"\n",
"# Ignore Warnings\n",
"import warnings\n",
"warnings.filterwarnings(action=\"ignore\", message=\"^internal gelsd\")\n",
"\n",
"print('Setup Complete')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Python: 3.7.6\n",
"NumPy: 1.20.3\n",
"pandas: 1.3.2\n",
"matplotlib: 3.4.2\n"
]
}
],
"source": [
"# Python / module versions used here for reference\n",
"print('Python: {}'.format(platform.python_version()))\n",
"print('NumPy: {}'.format(np.__version__))\n",
"print('pandas: {}'.format(pd.__version__))\n",
"print('matplotlib: {}'.format(mpl.__version__))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Defined Filepaths"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Set up initial paths to subfolders\n",
"base_dir = os.path.join('..', '..')\n",
"data_dir = os.path.join(base_dir, 'data')\n",
"data_dir_tm = os.path.join(base_dir, 'data', 'tm')\n",
"img_dir = os.path.join(base_dir, 'img')\n",
"fig_dir = os.path.join(base_dir, 'img', 'fig')\n",
"video_dir = os.path.join(base_dir, 'video')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Defined Variables"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Defined Variables\n",
"\n",
"## Define today's date\n",
"today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Defined Dictionaries"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [],
"source": [
"# Defined Dictionaries\n",
"\n",
"## Define seasons\n",
"dict_seasons = {'2000': '2000/2001',\n",
" '2001': '2001/2002',\n",
" '2002': '2002/2003',\n",
" '2003': '2003/2004',\n",
" '2004': '2004/2005',\n",
" '2005': '2005/2006',\n",
" '2006': '2006/2007',\n",
" '2007': '2007/2008',\n",
" '2008': '2008/2009',\n",
" '2009': '2009/2010',\n",
" '2010': '2010/2011',\n",
" '2011': '2011/2012',\n",
" '2012': '2012/2013',\n",
" '2013': '2013/2014',\n",
" '2014': '2014/2015',\n",
" '2015': '2015/2016',\n",
" '2016': '2016/2017',\n",
" '2017': '2017/2018',\n",
" '2018': '2018/2019',\n",
" '2019': '2019/2020',\n",
" '2020': '2020/2021',\n",
" '2021': '2021/2022',\n",
" '2022': '2022/2023',\n",
" '2023': '2023/2024',\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Defined Lists"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Defined Lists\n",
"\n",
"## Define list of league codes\n",
"df_leagues = pd.read_csv(data_dir_tm + '/reference/tm_leagues_comps.csv')\n",
"lst_league_codes = df_leagues['league_code'].to_numpy().tolist()\n",
"\n",
"## Define list of 'Big 5' European Leagues and MLS codes\n",
"lst_big5_mls_league_codes = ['GB1', 'FR1', 'L1', 'IT1', 'ES1', 'MLS1']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Notebook Settings"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"pd.set_option('display.max_columns', None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"## 2. Project Brief\n",
"This Jupyter notebook is part of a series of notebooks to scrape, parse, engineer, unify, and the model, culminating in a an Expected Transfer (xTransfer) player performance vs. valuation model. This model aims to determine the under- and over-performing players based on their on-the-pitch output against transfer fee and wages.\n",
"\n",
"This particular notebook is one of several data eningeering notebooks, that cleans player valuation data from [TransferMarkt](https://www.transfermarkt.co.uk/) using [pandas](http://pandas.pydata.org/).\n",
"\n",
"[TransferMarkt](https://www.transfermarkt.co.uk/) is a German-based website owned by [Axel Springer](https://www.axelspringer.com/en/) and is the leading website for the football transfer market. The website posts football related data, including: scores and results, football news, transfer rumours, and most usefully for us - calculated estimates ofthe market values for teams and individual players.\n",
"\n",
"To read more about how these estimations are made, [Beyond crowd judgments: Data-driven estimation of market value in association football](https://www.sciencedirect.com/science/article/pii/S0377221717304332) by Oliver Müllera, Alexander Simons, and Markus Weinmann does an excellent job of explaining how the estimations are made and their level of accuracy.\n",
"\n",
"This notebook, along with the other notebooks in this project workflow are shown in the following diagram:\n",
"\n",
"![roadmap](../../img/football_analytics_data_roadmap.png)\n",
"\n",
"Links to these notebooks in the [`football_analytics`](https://github.com/eddwebster/football_analytics) GitHub repository can be found at the following:\n",
"* [Webscraping](https://github.com/eddwebster/football_analytics/tree/master/notebooks/1_data_scraping)\n",
" + [FBref Player Stats Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/FBref%20Player%20Stats%20Web%20Scraping.ipynb)\n",
" + [TransferMarket Player Bio and Status Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/TransferMarkt%20Player%20Bio%20and%20Status%20Web%20Scraping.ipynb)\n",
" + [TransferMarket Player Valuation Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/TransferMarkt%20Player%20Valuation%20Web%20Scraping.ipynb)\n",
" + [TransferMarkt Player Recorded Transfer Fees Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/TransferMarkt%20Player%20Recorded%20Transfer%20Fees%20Webscraping.ipynb)\n",
" + [Capology Player Salary Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/Capology%20Player%20Salary%20Web%20Scraping.ipynb)\n",
" + [FBref Team Stats Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/FBref%20Team%20Stats%20Web%20Scraping.ipynb)\n",
"* [Data Parsing](https://github.com/eddwebster/football_analytics/tree/master/notebooks/2_data_parsing)\n",
" + [ELO Team Ratings Data Parsing](https://github.com/eddwebster/football_analytics/blob/master/notebooks/2_data_parsing/ELO%20Team%20Ratings%20Data%20Parsing.ipynb)\n",
"* [Data Engineering](https://github.com/eddwebster/football_analytics/tree/master/notebooks/3_data_engineering)\n",
" + [FBref Player Stats Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/FBref%20Player%20Stats%20Data%20Engineering.ipynb)\n",
" + [TransferMarket Player Bio and Status Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Bio%20and%20Status%20Data%20Engineering.ipynb)\n",
" + [TransferMarket Player Valuation Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Valuation%20Data%20Engineering.ipynb)\n",
" + [TransferMarkt Player Recorded Transfer Fees Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Recorded%20Transfer%20Fees%20Data%20Engineering.ipynb)\n",
" + [Capology Player Salary Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/Capology%20Player%20Salary%20Data%20Engineering.ipynb)\n",
" + [FBref Team Stats Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/FBref%20Team%20Stats%20Data%20Engineering.ipynb)\n",
" + [ELO Team Ratings Data Parsing](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/ELO%20Team%20Ratings%20Data%20Parsing.ipynb)\n",
" + [TransferMarkt Team Recorded Transfer Fee Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Team%20Recorded%20Transfer%20Fee%20Data%20Engineering.ipynb) (aggregated from [TransferMarkt Player Recorded Transfer Fees notebook](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Recorded%20Transfer%20Fees%20Data%20Engineering.ipynb))\n",
" + [Capology Team Salary Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/Capology%20Team%20Salary%20Data%20Engineering.ipynb) (aggregated from [Capology Player Salary notebook](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/Capology%20Player%20Salary%20Data%20Engineering.ipynb))\n",
"* [Data Unification](https://github.com/eddwebster/football_analytics/tree/master/notebooks/4_data_unification)\n",
" + [Golden ID for Player Level Datasets](https://github.com/eddwebster/football_analytics/blob/master/notebooks/4_data_unification/Golden%20ID%20for%20Player%20Level%20Datasets.ipynb)\n",
" + [Golden ID for Team Level Datasets](https://github.com/eddwebster/football_analytics/blob/master/notebooks/4_data_unification/Golden%20ID%20for%20Team%20Level%20Datasets.ipynb)\n",
"* [Production Datasets](https://github.com/eddwebster/football_analytics/tree/master/notebooks/5_production_datasets)\n",
" + [Player Performance/Market Value Dataset](https://github.com/eddwebster/football_analytics/tree/master/notebooks/5_production_datasets/Player%20Performance/Market%20Value%20Dataset.ipynb)\n",
" + [Team Performance/Market Value Dataset](https://github.com/eddwebster/football_analytics/tree/master/notebooks/5_production_datasets/Team%20Performance/Market%20Value%20Dataset.ipynb)\n",
"* [Expected Transfer (xTransfer) Modeling](https://github.com/eddwebster/football_analytics/tree/master/notebooks/6_data_analysis_and_projects/expected_transfer_modeling)\n",
" + [Expected Transfer (xTransfer) Modeling](https://github.com/eddwebster/football_analytics/tree/master/notebooks/6_data_analysis_and_projects/expected_transfer_modeling/Expected%20Transfer%20%20Modeling.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"\n",
"\n",
"## 3. Data Sources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"### 3.1. Introduction\n",
"Before conducting our EDA, the data needs to be imported as a DataFrame in the Data Sources section [Section 3](#section3) and cleaned in the Data Engineering section [Section 4](#section4).\n",
"\n",
"We'll be using the [pandas](http://pandas.pydata.org/) library to import our data to this workbook as a DataFrame."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2. Data Dictionaries\n",
"The [TransferMarkt](https://www.transfermarkt.co.uk/) dataset has six features (columns) with the following definitions and data types:\n",
"\n",
"| Feature | Data type |\n",
"|------|-----|\n",
"| `position_number` | object |\n",
"| `position_description` | object |\n",
"| `name` | object |\n",
"| `dob` | object |\n",
"| `nationality` | object |\n",
"| `value` | object |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"### 3.3. Read in Data"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# Import DataFrame as a CSV file\n",
"df_tm_market_value_raw = pd.read_csv(data_dir_tm + f'/raw/historical_market_values/tm_player_valuations_combined_latest.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3.2.3. Preliminary Data Handling\n",
"Let's quality of the dataset by looking first and last rows in pandas using the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
market_value
\n",
"
club
\n",
"
date
\n",
"
tm_id
\n",
"
league_code
\n",
"
season
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
50000
\n",
"
sc Heerenveen Emmen II
\n",
"
Oct 31, 2008
\n",
"
72462
\n",
"
GB1
\n",
"
2017
\n",
"
\n",
"
\n",
"
1
\n",
"
200000
\n",
"
sc Heerenveen Emmen II
\n",
"
May 25, 2009
\n",
"
72462
\n",
"
GB1
\n",
"
2017
\n",
"
\n",
"
\n",
"
2
\n",
"
500000
\n",
"
SC Heerenveen
\n",
"
Jan 17, 2011
\n",
"
72462
\n",
"
GB1
\n",
"
2017
\n",
"
\n",
"
\n",
"
3
\n",
"
1000000
\n",
"
SC Heerenveen
\n",
"
Jun 28, 2011
\n",
"
72462
\n",
"
GB1
\n",
"
2017
\n",
"
\n",
"
\n",
"
4
\n",
"
2000000
\n",
"
SC Heerenveen
\n",
"
Oct 21, 2011
\n",
"
72462
\n",
"
GB1
\n",
"
2017
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" market_value club date tm_id league_code \\\n",
"0 50000 sc Heerenveen Emmen II Oct 31, 2008 72462 GB1 \n",
"1 200000 sc Heerenveen Emmen II May 25, 2009 72462 GB1 \n",
"2 500000 SC Heerenveen Jan 17, 2011 72462 GB1 \n",
"3 1000000 SC Heerenveen Jun 28, 2011 72462 GB1 \n",
"4 2000000 SC Heerenveen Oct 21, 2011 72462 GB1 \n",
"\n",
" season \n",
"0 2017 \n",
"1 2017 \n",
"2 2017 \n",
"3 2017 \n",
"4 2017 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Display the first five rows of the raw DataFrame, df_tm_market_value_raw\n",
"df_tm_market_value_raw.head()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
market_value
\n",
"
club
\n",
"
date
\n",
"
tm_id
\n",
"
league_code
\n",
"
season
\n",
"
\n",
" \n",
" \n",
"
\n",
"
393948
\n",
"
4500000
\n",
"
TSG 1899 Hoffenheim
\n",
"
Apr 8, 2020
\n",
"
68033
\n",
"
L1
\n",
"
2016
\n",
"
\n",
"
\n",
"
393949
\n",
"
3000000
\n",
"
Eintracht Frankfurt
\n",
"
Sep 16, 2020
\n",
"
68033
\n",
"
L1
\n",
"
2016
\n",
"
\n",
"
\n",
"
393950
\n",
"
2500000
\n",
"
Eintracht Frankfurt
\n",
"
Feb 10, 2021
\n",
"
68033
\n",
"
L1
\n",
"
2016
\n",
"
\n",
"
\n",
"
393951
\n",
"
2500000
\n",
"
Eintracht Frankfurt
\n",
"
May 25, 2021
\n",
"
68033
\n",
"
L1
\n",
"
2016
\n",
"
\n",
"
\n",
"
393952
\n",
"
3500000
\n",
"
Eintracht Frankfurt
\n",
"
Jul 15, 2021
\n",
"
68033
\n",
"
L1
\n",
"
2016
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" market_value club date tm_id league_code \\\n",
"393948 4500000 TSG 1899 Hoffenheim Apr 8, 2020 68033 L1 \n",
"393949 3000000 Eintracht Frankfurt Sep 16, 2020 68033 L1 \n",
"393950 2500000 Eintracht Frankfurt Feb 10, 2021 68033 L1 \n",
"393951 2500000 Eintracht Frankfurt May 25, 2021 68033 L1 \n",
"393952 3500000 Eintracht Frankfurt Jul 15, 2021 68033 L1 \n",
"\n",
" season \n",
"393948 2016 \n",
"393949 2016 \n",
"393950 2016 \n",
"393951 2016 \n",
"393952 2016 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Display the last five rows of the raw DataFrame, df_tm_market_value_raw\n",
"df_tm_market_value_raw.tail()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(393953, 6)\n"
]
}
],
"source": [
"# Print the shape of the raw DataFrame, df_tm_market_value_raw\n",
"print(df_tm_market_value_raw.shape)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['market_value', 'club', 'date', 'tm_id', 'league_code', 'season'], dtype='object')\n"
]
}
],
"source": [
"# Print the column names of the raw DataFrame, df_tm_market_value_raw\n",
"print(df_tm_market_value_raw.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset has thirteen features (columns). Full details of these attributes can be found in the [Data Dictionary](section3.3.1)."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"market_value int64\n",
"club object\n",
"date object\n",
"tm_id int64\n",
"league_code object\n",
"season int64\n",
"dtype: object"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Data types of the features of the raw DataFrame, df_tm_market_value_raw\n",
"df_tm_market_value_raw.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All thirteen of the columns have the object data type. Full details of these attributes and their data types can be found in the [Data Dictionary](section3.3.1)."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 393953 entries, 0 to 393952\n",
"Data columns (total 6 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 market_value 393953 non-null int64 \n",
" 1 club 393953 non-null object\n",
" 2 date 389883 non-null object\n",
" 3 tm_id 393953 non-null int64 \n",
" 4 league_code 393953 non-null object\n",
" 5 season 393953 non-null int64 \n",
"dtypes: int64(3), object(3)\n",
"memory usage: 18.0+ MB\n"
]
}
],
"source": [
"# Info for the raw DataFrame, df_tm_market_value_raw\n",
"df_tm_market_value_raw.info()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"