{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Webscraping of TransferMarkt Data\n",
"##### Notebook to scrape raw data using [Beautifulsoup](https://pypi.org/project/beautifulsoup4/) from [TransferMarkt](https://www.transfermarkt.co.uk/).\n",
"\n",
"### By [Edd Webster](https://www.twitter.com/eddwebster)\n",
"Last updated: 31/08/2020"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![title](../../../img/transfermarkt-logo-banner.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Click [here](#section5) to jump straight to the Exploratory Data Analysis section and skip the [Task Brief](#section2), [Data Sources](#section3), and [Data Engineering](#section4) sections. Or click [here](#section6) to jump straight to the Conclusion."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"This notebook scrapes data for player valuations using [Beautifulsoup](https://pypi.org/project/beautifulsoup4/) from [TransferMarkt](https://www.transfermarkt.co.uk/) using [pandas](http://pandas.pydata.org/) for data maniuplation through DataFrames, [Beautifulsoup](https://pypi.org/project/beautifulsoup4/) for webscraping.\n",
"\n",
"For more information about this notebook and the author, I'm available through all the following channels:\n",
"* [eddwebster.com](https://www.eddwebster.com/),\n",
"* edd.j.webster@gmail.com,\n",
"* [@eddwebster](https://www.twitter.com/eddwebster),\n",
"* [LinkedIn.com/in/eddwebster](https://www.linkedin.com/in/eddwebster/),\n",
"* [GitHub/eddwebster](https://github.com/eddwebster/),\n",
"* [Kaggle.com/eddwebster](https://www.kaggle.com/eddwebster), and\n",
"* [HackerRank.com/eddwebster](https://www.hackerrank.com/eddwebster)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The accompanying GitHub repository for this notebook can be found [here](https://github.com/eddwebster/fifa-league) and a static version of this notebook can be found [here](https://nbviewer.jupyter.org/github/eddwebster/fifa-league/blob/master/FIFA%2020%20Fantasy%20Football%20League%20using%20TransferMarkt%20Player%20Valuations.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Notebook Contents\n",
"1. [Notebook Dependencies](#section1) \n",
"2. [Project Brief](#section2) \n",
"3. [Data Sources](#section3) \n",
" 1. [Introduction](#section3.1) \n",
" 2. [Data Dictionary](#section3.2) \n",
" 3. [Creating the DataFrame](#section3.3) \n",
" 4. [Initial Data Handling](#section3.4) \n",
" 5. [Export the Raw DataFrame](#section3.5) \n",
"4. [Data Engineering](#section4) \n",
" 1. [Introduction](#section4.1) \n",
" 2. [Columns of Interest](#section4.2) \n",
" 3. [String Cleaning](#section4.3) \n",
" 4. [Converting Data Types](#section4.4) \n",
" 5. [Export the Engineered DataFrame](#section4.5) \n",
"5. [Exploratory Data Analysis (EDA)](#section5) \n",
" 1. [...](#section5.1) \n",
" 2. [...](#section5.2) \n",
" 3. [...](#section5.3) \n",
"6. [Summary](#section6) \n",
"7. [Next Steps](#section7) \n",
"8. [Bibliography](#section8) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Notebook Dependencies"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries:\n",
"* [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented;\n",
"* [`NumPy`](http://www.numpy.org/) for multidimensional array computing;\n",
"* [`pandas`](http://pandas.pydata.org/) for data analysis and manipulation;\n",
"* `tqdm` for a clean progress bar;\n",
"* `requests` for executing HTTP requests;\n",
"* [`Beautifulsoup`](https://pypi.org/project/beautifulsoup4/) for web scraping; and\n",
"* [`matplotlib`](https://matplotlib.org/contents.html?v=20200411155018) for data visualisations;\n",
"\n",
"All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the [Conda](https://anaconda.org/anaconda/conda) distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows [here](https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444) and Mac [here](https://medium.com/@GalarnykMichael/install-python-on-mac-anaconda-ccd9f2014072), as well as in the Anaconda documentation itself [here](https://docs.anaconda.com/anaconda/install/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import Libraries and Modules"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Setup Complete\n"
]
}
],
"source": [
"# Python ≥3.5 (ideally)\n",
"import platform\n",
"import sys, getopt\n",
"assert sys.version_info >= (3, 5)\n",
"import csv\n",
"\n",
"# Import Dependencies\n",
"%matplotlib inline\n",
"\n",
"# Math Operations\n",
"import numpy as np\n",
"from math import pi\n",
"\n",
"# Datetime\n",
"import datetime\n",
"from datetime import date\n",
"import time\n",
"\n",
"# Data Preprocessing\n",
"import pandas as pd # version 1.0.3\n",
"import os # used to read the csv filenames\n",
"import re\n",
"import random\n",
"from io import BytesIO\n",
"from pathlib import Path\n",
"\n",
"# Reading directories\n",
"import glob\n",
"import os\n",
"\n",
"# Working with JSON\n",
"import json\n",
"from pandas.io.json import json_normalize\n",
"\n",
"# Web Scraping\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"import re\n",
"\n",
"# Fuzzy Matching - Record Linkage\n",
"import recordlinkage\n",
"import jellyfish\n",
"import numexpr as ne\n",
"\n",
"# Data Visualisation\n",
"import matplotlib as mpl\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"plt.style.use('seaborn-whitegrid')\n",
"import missingno as msno # visually display missing data\n",
"\n",
"# Progress Bar\n",
"from tqdm import tqdm # a clean progress bar library\n",
"\n",
"# Display in Jupyter\n",
"from IPython.display import Image, YouTubeVideo\n",
"from IPython.core.display import HTML\n",
"\n",
"# Ignore Warnings\n",
"import warnings\n",
"warnings.filterwarnings(action=\"ignore\", message=\"^internal gelsd\")\n",
"\n",
"print('Setup Complete')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Python: 3.7.6\n",
"NumPy: 1.18.1\n",
"pandas: 1.0.1\n",
"matplotlib: 3.1.3\n",
"Seaborn: 0.10.0\n"
]
}
],
"source": [
"# Python / module versions used here for reference\n",
"print('Python: {}'.format(platform.python_version()))\n",
"print('NumPy: {}'.format(np.__version__))\n",
"print('pandas: {}'.format(pd.__version__))\n",
"print('matplotlib: {}'.format(mpl.__version__))\n",
"print('Seaborn: {}'.format(sns.__version__))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Defined Variables"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Define today's date\n",
"today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Defined Filepaths"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Set up initial paths to subfolders\n",
"base_dir = os.path.join('..', '..', )\n",
"data_dir = os.path.join(base_dir, 'data')\n",
"data_dir_fbref = os.path.join(base_dir, 'data', 'fbref')\n",
"data_dir_tm = os.path.join(base_dir, 'data', 'tm')\n",
"img_dir = os.path.join(base_dir, 'img')\n",
"fig_dir = os.path.join(base_dir, 'img', 'fig')\n",
"video_dir = os.path.join(base_dir, 'video')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Project Brief\n",
"This Jupyter notebook explores how to scrape football data from [TransferMarkt](https://www.transfermarkt.co.uk/), using [pandas](http://pandas.pydata.org/) for data maniuplation through DataFrames and [Beautifulsoup](https://pypi.org/project/beautifulsoup4/) for webscraping.\n",
"\n",
"The data of player values produced in this notebook is exported to CSV. This data can be further analysed in Python, joined to other datasets, or explored using Tableau, PowerBI, Microsoft Excel."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Data Sources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.1. Introduction\n",
"[TransferMarkt](https://www.transfermarkt.co.uk/) is a German-based website owned by [Axel Springer](https://www.axelspringer.com/en/) and is the leading website for the football transfer market. The website posts football related data, including: scores and results, football news, transfer rumours, and most usefully for us - calculated estimates ofthe market values for teams and individual players.\n",
"\n",
"To read more about how these estimations are made, [Beyond crowd judgments: Data-driven estimation of market value in association football](https://www.sciencedirect.com/science/article/pii/S0377221717304332) by Oliver Müllera, Alexander Simons, and Markus Weinmann does an excellent job of explaining how the estimations are made and their level of accuracy.\n",
"\n",
"Before conducting our EDA, the data needs to be imported as a DataFrame in the Data Sources section [Section 3](#section3) and Cleaned in the Data Engineering section [Section 4](#section4).\n",
"\n",
"We'll be using the [pandas](http://pandas.pydata.org/) library to import our data to this workbook as a DataFrame."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2. Data Dictionaries\n",
"The [TransferMarkt](https://www.transfermarkt.co.uk/) dataset has six features (columns) with the following definitions and data types:\n",
"\n",
"| Feature | Data type |\n",
"|------|-----|\n",
"| `position_number` | object |\n",
"| `position_description` | object |\n",
"| `name` | object |\n",
"| `dob` | object |\n",
"| `nationality` | object |\n",
"| `value` | object |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.3. Creating the DataFrame - scraping the data\n",
"Before scraping data from [TransferMarkt](https://www.transfermarkt.co.uk/), we need to look at the top five leagues that we wish to scrape.\n",
"\n",
"The web scraper for [TransferMarkt](https://www.transfermarkt.co.uk/) is made up of two parts:\n",
"1. In the first part, the scraper takes the webpages for each of the individual leagues e.g. The Premier League, and extract the hyperlinks to the pages of all the individual teams in the league table.\n",
"2. In the second part the script, the webscraper uses the list of invidual teams hyperlinks collected in part 1 to then collect the hyperlinks for each of the players for those teams. From this, the scraper can then extract the information we need for each of these players.\n",
"\n",
"This information collected for all the players is converted to a [pandas](http://pandas.pydata.org/) DataFrame from which we can view and manipulate the data.\n",
"\n",
"An example webpage for a football league is the following: https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/GB1/plus/?saison_id=2019. As we can see, between the subdirectory path of `'/wettbewerb/'` and the `'/plus/'`, there is a 3 or 4 digit code. For The Premier League, the code is GB1. \n",
"\n",
"In order to scrape the webpages from [TransferMarkt](https://www.transfermarkt.co.uk/), the codes of the top five leagues need to be recorded from [TransferMarkt](https://www.transfermarkt.co.uk/), which are the following:\n",
"\n",
"| League Name on FIFA | Country | Corresponding [TransferMarkt](https://www.transfermarkt.co.uk/) League Code |\n",
"|------|-----|-----|\n",
"| LaLiga Santander | Spain | ES1 |\n",
"| Ligue 1 Conforama | France | FR1 |\n",
"| Premier League | England | GB1 |\n",
"| Serie A TIM | Italy | IT1 |\n",
"| Bundesliga | Germany | L1 |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"See: https://fcpython.com/blog/scraping-lists-transfermarkt-saving-images"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"from bs4 import BeautifulSoup\n",
"from os.path import basename"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"headers = {'User-Agent': \n",
" 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"#Process League Table\n",
"page = 'https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1'\n",
"tree = requests.get(page, headers = headers)\n",
"soup = BeautifulSoup(tree.content, 'html.parser')\n",
"\n",
"#Create an empty list to assign these values to\n",
"teamLinks = []\n",
"\n",
"#Extract all links with the correct CSS selector\n",
"links = soup.select(\"a.vereinprofil_tooltip\")\n",
"\n",
"#We need the location that the link is pointing to, so for each link, take the link location. \n",
"#Additionally, we only need the links in locations 1,3,5,etc. of our list, so loop through those only\n",
"for i in range(1,41,2):\n",
" teamLinks.append(links[i].get(\"href\"))\n",
" \n",
"#For each location that we have taken, add the website before it - this allows us to call it later\n",
"for i in range(len(teamLinks)):\n",
" teamLinks[i] = \"https://www.transfermarkt.co.uk\"+teamLinks[i]"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"#Create an empty list for our player links to go into\n",
"playerLinks = []\n",
"\n",
"#Run the scraper through each of our 20 team links\n",
"for i in range(len(teamLinks)):\n",
"\n",
" #Download and process the team page\n",
" page = teamLinks[i]\n",
" tree = requests.get(page, headers = headers)\n",
" soup = BeautifulSoup(tree.content, 'html.parser')\n",
"\n",
" #Extract all links\n",
" links = soup.select(\"a.spielprofil_tooltip\")\n",
" \n",
" #For each link, extract the location that it is pointing to\n",
" for j in range(len(links)):\n",
" playerLinks.append(links[j].get(\"href\"))\n",
"\n",
" #Add the location to the end of the transfermarkt domain to make it ready to scrape\n",
" for j in range(len(playerLinks)):\n",
" playerLinks[j] = \"https://www.transfermarkt.co.uk\"+playerLinks[j]\n",
"\n",
" #The page list the players more than once - let's use list(set(XXX)) to remove the duplicates\n",
" playerLinks = list(set(playerLinks))"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"568"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(playerLinks)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"ename": "ConnectionError",
"evalue": "HTTPSConnectionPool(host='www.transfermarkt.co.ukhttps', port=443): Max retries exceeded with url: //www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.uk/david-de-gea/profil/spieler/59377 (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mgaierror\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/urllib3/connection.py\u001b[0m in \u001b[0;36m_new_conn\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 156\u001b[0m conn = connection.create_connection(\n\u001b[0;32m--> 157\u001b[0;31m \u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_dns_host\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mport\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtimeout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mextra_kw\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 158\u001b[0m )\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py\u001b[0m in \u001b[0;36mcreate_connection\u001b[0;34m(address, timeout, source_address, socket_options)\u001b[0m\n\u001b[1;32m 60\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 61\u001b[0;31m \u001b[0;32mfor\u001b[0m \u001b[0mres\u001b[0m \u001b[0;32min\u001b[0m \u001b[0msocket\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgetaddrinfo\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mhost\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mport\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfamily\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msocket\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSOCK_STREAM\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 62\u001b[0m \u001b[0maf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msocktype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mproto\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcanonname\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msa\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mres\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/socket.py\u001b[0m in \u001b[0;36mgetaddrinfo\u001b[0;34m(host, port, family, type, proto, flags)\u001b[0m\n\u001b[1;32m 751\u001b[0m \u001b[0maddrlist\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 752\u001b[0;31m \u001b[0;32mfor\u001b[0m \u001b[0mres\u001b[0m \u001b[0;32min\u001b[0m \u001b[0m_socket\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgetaddrinfo\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mhost\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mport\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfamily\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mproto\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mflags\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 753\u001b[0m \u001b[0maf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msocktype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mproto\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcanonname\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msa\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mres\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mgaierror\u001b[0m: [Errno 8] nodename nor servname provided, or not known",
"\nDuring handling of the above exception, another exception occurred:\n",
"\u001b[0;31mNewConnectionError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py\u001b[0m in \u001b[0;36murlopen\u001b[0;34m(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)\u001b[0m\n\u001b[1;32m 671\u001b[0m \u001b[0mheaders\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mheaders\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 672\u001b[0;31m \u001b[0mchunked\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mchunked\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 673\u001b[0m )\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py\u001b[0m in \u001b[0;36m_make_request\u001b[0;34m(self, conn, method, url, timeout, chunked, **httplib_request_kw)\u001b[0m\n\u001b[1;32m 375\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 376\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_validate_conn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconn\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 377\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mSocketTimeout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mBaseSSLError\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py\u001b[0m in \u001b[0;36m_validate_conn\u001b[0;34m(self, conn)\u001b[0m\n\u001b[1;32m 993\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mgetattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mconn\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"sock\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;31m# AppEngine might not have `.sock`\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 994\u001b[0;31m \u001b[0mconn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mconnect\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 995\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/urllib3/connection.py\u001b[0m in \u001b[0;36mconnect\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 299\u001b[0m \u001b[0;31m# Add certificate verification\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 300\u001b[0;31m \u001b[0mconn\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_new_conn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 301\u001b[0m \u001b[0mhostname\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhost\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/urllib3/connection.py\u001b[0m in \u001b[0;36m_new_conn\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 168\u001b[0m raise NewConnectionError(\n\u001b[0;32m--> 169\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"Failed to establish a new connection: %s\"\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 170\u001b[0m )\n",
"\u001b[0;31mNewConnectionError\u001b[0m: : Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known",
"\nDuring handling of the above exception, another exception occurred:\n",
"\u001b[0;31mMaxRetryError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/requests/adapters.py\u001b[0m in \u001b[0;36msend\u001b[0;34m(self, request, stream, timeout, verify, cert, proxies)\u001b[0m\n\u001b[1;32m 448\u001b[0m \u001b[0mretries\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmax_retries\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 449\u001b[0;31m \u001b[0mtimeout\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtimeout\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 450\u001b[0m )\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py\u001b[0m in \u001b[0;36murlopen\u001b[0;34m(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)\u001b[0m\n\u001b[1;32m 719\u001b[0m retries = retries.increment(\n\u001b[0;32m--> 720\u001b[0;31m \u001b[0mmethod\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merror\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_pool\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_stacktrace\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexc_info\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 721\u001b[0m )\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/urllib3/util/retry.py\u001b[0m in \u001b[0;36mincrement\u001b[0;34m(self, method, url, response, error, _pool, _stacktrace)\u001b[0m\n\u001b[1;32m 435\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mnew_retry\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_exhausted\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 436\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mMaxRetryError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_pool\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0merror\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mResponseError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcause\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 437\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mMaxRetryError\u001b[0m: HTTPSConnectionPool(host='www.transfermarkt.co.ukhttps', port=443): Max retries exceeded with url: //www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.uk/david-de-gea/profil/spieler/59377 (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))",
"\nDuring handling of the above exception, another exception occurred:\n",
"\u001b[0;31mConnectionError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;31m#Take site and structure html\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mpage\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mplayerLinks\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mtree\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrequests\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpage\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mheaders\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mheaders\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 6\u001b[0m \u001b[0msoup\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mBeautifulSoup\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtree\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcontent\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'html.parser'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/requests/api.py\u001b[0m in \u001b[0;36mget\u001b[0;34m(url, params, **kwargs)\u001b[0m\n\u001b[1;32m 73\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 74\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msetdefault\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'allow_redirects'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 75\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mrequest\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'get'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparams\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mparams\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 76\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 77\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/requests/api.py\u001b[0m in \u001b[0;36mrequest\u001b[0;34m(method, url, **kwargs)\u001b[0m\n\u001b[1;32m 58\u001b[0m \u001b[0;31m# cases, and look like a memory leak in others.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 59\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0msessions\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSession\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0msession\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 60\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0msession\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrequest\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmethod\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mmethod\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0murl\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0murl\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 61\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 62\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/requests/sessions.py\u001b[0m in \u001b[0;36mrequest\u001b[0;34m(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)\u001b[0m\n\u001b[1;32m 531\u001b[0m }\n\u001b[1;32m 532\u001b[0m \u001b[0msend_kwargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msettings\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 533\u001b[0;31m \u001b[0mresp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mprep\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0msend_kwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 534\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 535\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mresp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/requests/sessions.py\u001b[0m in \u001b[0;36msend\u001b[0;34m(self, request, **kwargs)\u001b[0m\n\u001b[1;32m 644\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 645\u001b[0m \u001b[0;31m# Send the request\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 646\u001b[0;31m \u001b[0mr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0madapter\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrequest\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 647\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 648\u001b[0m \u001b[0;31m# Total elapsed time of the request (approximately)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/opt/anaconda3/lib/python3.7/site-packages/requests/adapters.py\u001b[0m in \u001b[0;36msend\u001b[0;34m(self, request, stream, timeout, verify, cert, proxies)\u001b[0m\n\u001b[1;32m 514\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mSSLError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrequest\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrequest\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 515\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 516\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mConnectionError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrequest\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mrequest\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 517\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 518\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mClosedPoolError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mConnectionError\u001b[0m: HTTPSConnectionPool(host='www.transfermarkt.co.ukhttps', port=443): Max retries exceeded with url: //www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.uk/david-de-gea/profil/spieler/59377 (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))"
]
}
],
"source": [
"for i in range(len(playerLinks)):\n",
"\n",
" #Take site and structure html\n",
" page = playerLinks[i]\n",
" tree = requests.get(page, headers=headers)\n",
" soup = BeautifulSoup(tree.content, 'html.parser')\n",
"\n",
"\n",
" #Find image and save it with the player's name\n",
" #Find the player's name\n",
" name = soup.find_all(\"h1\")\n",
" \n",
" #Use the name to call the image\n",
" image = soup.find_all(\"img\",{\"title\":name[0].text})\n",
" \n",
" #Extract the location of the image. We also need to strip the text after '?lm', so let's do that through '.split()'.\n",
" src = image[0].get('src').split(\"?lm\")[0]\n",
"\n",
" #Save the image under the player's name\n",
" with open(name[0].text+\".jpg\",\"wb\") as f:\n",
" f.write(requests.get(src).content)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"headers = {\n",
" 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# List of leagues by code for which we want to scrape player data - Big 5 European leagues\n",
"lst_leagues = ['ES1', 'FR1', 'GB1', 'IT1', 'L1']"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# Assign season by year to season variable e.g. 2014/15 season = 2014\n",
"season = '2020' # 2020/21 season"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from os.path import basename"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fetching Links from ES1\n",
"Fetching Links from FR1\n",
"Fetching Links from GB1\n",
"Fetching Links from IT1\n",
"Fetching Links from L1\n",
"Collected 98 Links\n"
]
}
],
"source": [
"### Create empty list of links\n",
"playerLinks = []\n",
"\n",
"url = \"https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id=2020\"\n",
"\n",
"### For loop to iteratre through each league page to collect the team links\n",
"for league in lst_leagues:\n",
" print(f'Fetching Links from {league}')\n",
" r = requests.Session() .get(url.format(league), headers=headers)\n",
" soup = BeautifulSoup(r.content, 'html.parser')\n",
" link = [f\"{url[:31]}{item.next_element.get('href')}\" for item in soup.findAll(\n",
" \"td\", class_=\"hauptlink no-border-links hide-for-small hide-for-pad\")]\n",
" playerLinks.extend(link)\n",
"\n",
"### Print statement for the number of team links found\n",
"print(f'Collected {len(playerLinks)} Links')"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"98"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(playerLinks)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"ename": "IndexError",
"evalue": "list index out of range",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 16\u001b[0m \u001b[0;31m#Extract the location of the image. We also need to strip the text after '?lm', so let's do that through '.split()'.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 17\u001b[0;31m \u001b[0msrc\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mimage\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'src'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"?lm\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 18\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 19\u001b[0m \u001b[0;31m#Save the image under the player's name\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mIndexError\u001b[0m: list index out of range"
]
}
],
"source": [
"for i in range(len(playerLinks)):\n",
"\n",
" #Take site and structure html\n",
" page = playerLinks[i]\n",
" tree = requests.get(page, headers=headers)\n",
" soup = BeautifulSoup(tree.content, 'html.parser')\n",
"\n",
"\n",
" #Find image and save it with the player's name\n",
" #Find the player's name\n",
" name = soup.find_all(\"h1\")\n",
" \n",
" #Use the name to call the image\n",
" image = soup.find_all(\"img\",{\"title\":name[0].text})\n",
" \n",
" #Extract the location of the image. We also need to strip the text after '?lm', so let's do that through '.split()'.\n",
" src = image[0].get('src').split(\"?lm\")[0]\n",
"\n",
" #Save the image under the player's name\n",
" with open(name[0].text+\".jpg\",\"wb\") as f:\n",
" f.write(requests.get(src).content)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fetching Links from ES1\n",
"Fetching Links from FR1\n",
"Fetching Links from GB1\n",
"Fetching Links from IT1\n",
"Fetching Links from L1\n",
"Collected 98 Links\n",
"Extracting Page# 1\n",
"Extracting Page# 2\n",
"Extracting Page# 3\n",
"Extracting Page# 4\n",
"Extracting Page# 5\n",
"Extracting Page# 6\n",
"Extracting Page# 7\n",
"Extracting Page# 8\n",
"Extracting Page# 9\n",
"Extracting Page# 10\n",
"Extracting Page# 11\n",
"Extracting Page# 12\n",
"Extracting Page# 13\n",
"Extracting Page# 14\n",
"Extracting Page# 15\n",
"Extracting Page# 16\n",
"Extracting Page# 17\n",
"Extracting Page# 18\n",
"Extracting Page# 19\n",
"Extracting Page# 20\n",
"Extracting Page# 21\n",
"Extracting Page# 22\n",
"Extracting Page# 23\n",
"Extracting Page# 24\n",
"Extracting Page# 25\n",
"Extracting Page# 26\n",
"Extracting Page# 27\n",
"Extracting Page# 28\n",
"Extracting Page# 29\n",
"Extracting Page# 30\n",
"Extracting Page# 31\n",
"Extracting Page# 32\n",
"Extracting Page# 33\n",
"Extracting Page# 34\n",
"Extracting Page# 35\n",
"Extracting Page# 36\n",
"Extracting Page# 37\n",
"Extracting Page# 38\n",
"Extracting Page# 39\n",
"Extracting Page# 40\n",
"Extracting Page# 41\n",
"Extracting Page# 42\n",
"Extracting Page# 43\n",
"Extracting Page# 44\n",
"Extracting Page# 45\n",
"Extracting Page# 46\n",
"Extracting Page# 47\n",
"Extracting Page# 48\n",
"Extracting Page# 49\n",
"Extracting Page# 50\n",
"Extracting Page# 51\n",
"Extracting Page# 52\n",
"Extracting Page# 53\n",
"Extracting Page# 54\n",
"Extracting Page# 55\n",
"Extracting Page# 56\n",
"Extracting Page# 57\n",
"Extracting Page# 58\n",
"Extracting Page# 59\n",
"Extracting Page# 60\n",
"Extracting Page# 61\n",
"Extracting Page# 62\n",
"Extracting Page# 63\n",
"Extracting Page# 64\n",
"Extracting Page# 65\n",
"Extracting Page# 66\n",
"Extracting Page# 67\n",
"Extracting Page# 68\n",
"Extracting Page# 69\n",
"Extracting Page# 70\n",
"Extracting Page# 71\n",
"Extracting Page# 72\n",
"Extracting Page# 73\n",
"Extracting Page# 74\n",
"Extracting Page# 75\n",
"Extracting Page# 76\n",
"Extracting Page# 77\n",
"Extracting Page# 78\n",
"Extracting Page# 79\n",
"Extracting Page# 80\n",
"Extracting Page# 81\n",
"Extracting Page# 82\n",
"Extracting Page# 83\n",
"Extracting Page# 84\n",
"Extracting Page# 85\n",
"Extracting Page# 86\n",
"Extracting Page# 87\n",
"Extracting Page# 88\n",
"Extracting Page# 89\n",
"Extracting Page# 90\n",
"Extracting Page# 91\n",
"Extracting Page# 92\n",
"Extracting Page# 93\n",
"Extracting Page# 94\n",
"Extracting Page# 95\n",
"Extracting Page# 96\n",
"Extracting Page# 97\n",
"Extracting Page# 98\n"
]
},
{
"ename": "AttributeError",
"evalue": "'str' object has no attribute 'text'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 52\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mlinks\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 53\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 54\u001b[0;31m \u001b[0mmain\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id=2020\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 55\u001b[0m \u001b[0;31m#main(f'https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id={season}')\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 56\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m\u001b[0m in \u001b[0;36mmain\u001b[0;34m(url)\u001b[0m\n\u001b[1;32m 39\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 40\u001b[0m \u001b[0;31m#Use the name to call the image\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 41\u001b[0;31m \u001b[0mimage\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msoup\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind_all\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"img\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m{\u001b[0m\u001b[0;34m\"title\"\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtext\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 42\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 43\u001b[0m \u001b[0;31m#Extract the location of the image. We also need to strip the text after '?lm', so let's do that through '.split()'.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mAttributeError\u001b[0m: 'str' object has no attribute 'text'"
]
}
],
"source": [
"# Run this script to scrape latest version of this data from TransferMarkt\n",
"\n",
"## Start timer\n",
"tic = datetime.datetime.now()\n",
"\n",
"\n",
"## Scrape TransferMarkt data\n",
"def main(url):\n",
" with requests.Session() as req:\n",
" \n",
" ### Create empty list of links\n",
" links = []\n",
" \n",
" ### For loop to iteratre through each league page to collect the team links\n",
" for league in lst_leagues:\n",
" print(f'Fetching Links from {league}')\n",
" r = req.get(url.format(league), headers=headers)\n",
" soup = BeautifulSoup(r.content, 'html.parser')\n",
" link = [f\"{url[:31]}{item.next_element.get('href')}\" for item in soup.findAll(\n",
" \"td\", class_=\"hauptlink no-border-links hide-for-small hide-for-pad\")]\n",
" links.extend(link)\n",
" \n",
" ### Print statement for the number of team links found\n",
" print(f'Collected {len(links)} Links')\n",
" \n",
" \n",
" \n",
" \"\"\"\n",
" ### Create empty list of goals\n",
" goals = []\n",
" \n",
" ### For loop to iteratre through each goal to collect each players information and assign to a DF\n",
" for num, link in enumerate(links):\n",
" print(f\"Extracting Page# {num +1}\")\n",
" r = req.get(link, headers=headers)\n",
" soup = BeautifulSoup(r.content, 'html.parser')\n",
" target = soup.find(\"table\", class_=\"items\")\n",
" pn = [pn.text for pn in target.select(\"div.rn_nummer\")]\n",
" pos = [pos.text for pos in target.findAll(\"td\", class_=False)]\n",
" name = [name.text for name in target.select(\"td.hide\")]\n",
" dob = [date.find_next(\n",
" \"td\").text for date in target.select(\"td.hide\")]\n",
" nat = [\" / \".join([a.get(\"alt\") for a in nat.find_all_next(\"td\")[1] if a.get(\"alt\")]) for nat in target.findAll(\n",
" \"td\", itemprop=\"athlete\")]\n",
" val = [val.get_text(strip=True)\n",
" for val in target.select('td.rechts.hauptlink')]\n",
" goal = zip(pn, pos, name, dob, nat, val)\n",
" df = pd.DataFrame(goal, columns=[\n",
" 'position_number', 'position_description', 'name', 'dob', 'nationality', 'value'])\n",
" goals.append(df)\n",
" \n",
" ### Concontate the list of goals\n",
" new = pd.concat(goals)\n",
" \n",
" ### Save DataFrame to a CSV\n",
" new.to_csv(data_dir_tm + '/raw/' + f'players_big5_2021_raw_{today}.csv', index=None, header=True)\n",
" \"\"\"\n",
" \n",
"## Call defined function\n",
"main(\"https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id=2020\")\n",
"#main(f'https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id={season}')\n",
"\n",
"\n",
"## End timer\n",
"toc = datetime.datetime.now()\n",
"\n",
"\n",
"## Calculate time take\n",
"total_time = (toc-tic).total_seconds()\n",
"print(f'Time taken to scrape the data of all the players for the Big 5 leagues is: {total_time/60:0.2f} minutes.')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# Import data as a pandas DataFrame, df_tm_players_big5_2021_raw\n",
"\n",
"## Look for most recent CSV file\n",
"list_of_files = glob.glob(data_dir_tm + '/raw/*') # * means all if need specific format then *.csv\n",
"filepath_latest_tm = max(list_of_files, key=os.path.getctime)\n",
"\n",
"## Load in most recently parsed CSV file\n",
"df_tm_player_top5_2021_raw = pd.read_csv(filepath_latest_tm)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.4. Preliminary Data Handling\n",
"Let's quality of the dataset by looking first and last rows in pandas using the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
position_number
\n",
"
position_description
\n",
"
name
\n",
"
dob
\n",
"
nationality
\n",
"
value
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
Goalkeeper
\n",
"
Marc-André ter Stegen
\n",
"
Apr 30, 1992 (28)
\n",
"
Germany
\n",
"
£64.80m
\n",
"
\n",
"
\n",
"
1
\n",
"
13
\n",
"
Goalkeeper
\n",
"
Neto
\n",
"
Jul 19, 1989 (31)
\n",
"
Brazil / Italy
\n",
"
£13.05m
\n",
"
\n",
"
\n",
"
2
\n",
"
26
\n",
"
Goalkeeper
\n",
"
Iñaki Peña
\n",
"
Mar 2, 1999 (21)
\n",
"
Spain
\n",
"
£2.07m
\n",
"
\n",
"
\n",
"
3
\n",
"
15
\n",
"
Centre-Back
\n",
"
Clément Lenglet
\n",
"
Jun 17, 1995 (25)
\n",
"
France
\n",
"
£43.20m
\n",
"
\n",
"
\n",
"
4
\n",
"
23
\n",
"
Centre-Back
\n",
"
Samuel Umtiti
\n",
"
Nov 14, 1993 (26)
\n",
"
France / Cameroon
\n",
"
£28.80m
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" position_number position_description name \\\n",
"0 1 Goalkeeper Marc-André ter Stegen \n",
"1 13 Goalkeeper Neto \n",
"2 26 Goalkeeper Iñaki Peña \n",
"3 15 Centre-Back Clément Lenglet \n",
"4 23 Centre-Back Samuel Umtiti \n",
"\n",
" dob nationality value \n",
"0 Apr 30, 1992 (28) Germany £64.80m \n",
"1 Jul 19, 1989 (31) Brazil / Italy £13.05m \n",
"2 Mar 2, 1999 (21) Spain £2.07m \n",
"3 Jun 17, 1995 (25) France £43.20m \n",
"4 Nov 14, 1993 (26) France / Cameroon £28.80m "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Display the first 5 rows of the raw DataFrame, df_tm_player_top5_2021_raw\n",
"df_tm_player_top5_2021_raw.head()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
position_number
\n",
"
position_description
\n",
"
name
\n",
"
dob
\n",
"
nationality
\n",
"
value
\n",
"
\n",
" \n",
" \n",
"
\n",
"
2885
\n",
"
18
\n",
"
Centre-Forward
\n",
"
Sergio Córdova
\n",
"
Aug 9, 1997 (23)
\n",
"
Venezuela
\n",
"
£1.44m
\n",
"
\n",
"
\n",
"
2886
\n",
"
9
\n",
"
Centre-Forward
\n",
"
Fabian Klos
\n",
"
Dec 2, 1987 (32)
\n",
"
Germany
\n",
"
£900Th.
\n",
"
\n",
"
\n",
"
2887
\n",
"
13
\n",
"
Centre-Forward
\n",
"
Sebastian Müller
\n",
"
Jan 23, 2001 (19)
\n",
"
Germany
\n",
"
£270Th.
\n",
"
\n",
"
\n",
"
2888
\n",
"
36
\n",
"
Centre-Forward
\n",
"
Sven Schipplock
\n",
"
Nov 8, 1988 (31)
\n",
"
Germany
\n",
"
£270Th.
\n",
"
\n",
"
\n",
"
2889
\n",
"
39
\n",
"
Centre-Forward
\n",
"
Prince Osei Owusu
\n",
"
Jan 7, 1997 (23)
\n",
"
Germany / Ghana
\n",
"
£225Th.
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" position_number position_description name \\\n",
"2885 18 Centre-Forward Sergio Córdova \n",
"2886 9 Centre-Forward Fabian Klos \n",
"2887 13 Centre-Forward Sebastian Müller \n",
"2888 36 Centre-Forward Sven Schipplock \n",
"2889 39 Centre-Forward Prince Osei Owusu \n",
"\n",
" dob nationality value \n",
"2885 Aug 9, 1997 (23) Venezuela £1.44m \n",
"2886 Dec 2, 1987 (32) Germany £900Th. \n",
"2887 Jan 23, 2001 (19) Germany £270Th. \n",
"2888 Nov 8, 1988 (31) Germany £270Th. \n",
"2889 Jan 7, 1997 (23) Germany / Ghana £225Th. "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Display the last 5 rows of the raw DataFrame, df_tm_player_top5_2021_raw\n",
"df_tm_player_top5_2021_raw.tail()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(2890, 6)\n"
]
}
],
"source": [
"# Print the shape of the raw DataFrame, df_tm_player_top5_2021_raw\n",
"print(df_tm_player_top5_2021_raw.shape)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['position_number', 'position_description', 'name', 'dob', 'nationality',\n",
" 'value'],\n",
" dtype='object')\n"
]
}
],
"source": [
"# Print the column names of the raw DataFrame, df_tm_player_top5_2021_raw\n",
"print(df_tm_player_top5_2021_raw.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset has six features (columns). Full details of these attributes can be found in the [Data Dictionary](section3.3.1)."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"position_number object\n",
"position_description object\n",
"name object\n",
"dob object\n",
"nationality object\n",
"value object\n",
"dtype: object"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Data types of the features of the raw DataFrame, df_tm_player_top5_2021_raw\n",
"df_tm_player_top5_2021_raw.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All six of the columns have the object data type. Full details of these attributes and their data types can be found in the [Data Dictionary](section3.3.1)."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 2890 entries, 0 to 2889\n",
"Data columns (total 6 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 position_number 2890 non-null object\n",
" 1 position_description 2890 non-null object\n",
" 2 name 2890 non-null object\n",
" 3 dob 2890 non-null object\n",
" 4 nationality 2890 non-null object\n",
" 5 value 2854 non-null object\n",
"dtypes: object(6)\n",
"memory usage: 135.6+ KB\n"
]
}
],
"source": [
"# Info for the raw DataFrame, df_tm_player_top5_2021_raw\n",
"df_tm_player_top5_2021_raw.info()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Plot visualisation of the missing values for each feature of the raw DataFrame, df_tm_player_top5_2021_raw\n",
"msno.matrix(df_tm_player_top5_2021_raw, figsize = (30, 7))"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"value 36\n",
"dtype: int64"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Counts of missing values\n",
"tm_null_value_stats = df_tm_player_top5_2021_raw.isnull().sum(axis=0)\n",
"tm_null_value_stats[tm_null_value_stats != 0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The visualisation shows us very quickly that there a few missing values in the `value` column, but otherwise the dataset is complete."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Data Engineering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.1. Introduction\n",
"Before we answer the questions in the brief through [Exploratory Data Analysis (EDA)](#section5), we'll first need to clean and wrangle the datasets to a form that meet our needs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.2. Assign Raw DataFrames to New Engineered DataFrames"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"# Assign Raw DataFrame to new Engineered DataFrame\n",
"df_tm_player_top5_2021 = df_tm_player_top5_2021_raw"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.2. String Cleaning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Name"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"df_tm_player_top5_2021['name_lower'] = df_tm_player_top5_2021['name'].str.normalize('NFKD')\\\n",
" .str.encode('ascii', errors='ignore')\\\n",
" .str.decode('utf-8')\\\n",
" .str.lower()"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"# First Name Lower\n",
"df_tm_player_top5_2021['firstname_lower'] = df_tm_player_top5_2021['name_lower'].str.rsplit(' ', 0).str[0]\n",
"\n",
"# Last Name Lower\n",
"df_tm_player_top5_2021['lastname_lower'] = df_tm_player_top5_2021['name_lower'].str.rsplit(' ', 1).str[-1]\n",
"\n",
"# First Initial Lower\n",
"df_tm_player_top5_2021['firstinitial_lower'] = df_tm_player_top5_2021['name_lower'].astype(str).str[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### DoB and Age\n",
"The `dob` column is messy and contains both the date of birth as a string and also the age in brackets.\n",
"\n",
"This string cleaning consists of two parts, firstly, to split this apart into their seperate components. However, once the `age` column is created, we will replaced this by determining the current age using the Python [datetime](https://docs.python.org/3/library/datetime.html) module."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"# DoB string cleaning to create birth_date and age columns\n",
"df_tm_player_top5_2021[['birth_date', 'age']] = df_tm_player_top5_2021['dob'].str.extract(r'(.+) \\((\\d+)\\)')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Nationality\n",
"For the nationality, some of the players have duel nationality.\n",
"\n",
"For example, [Claudio Pizarro](https://www.transfermarkt.co.uk/claudio-pizarro/profil/spieler/532) is a Peruvian-born player who has has made 85 appearances for Peru, scoring 20 goals. However, his citizenship according to [TransferMarkt](https://www.transfermarkt.co.uk/) is 'Peru / Italy'. For our needs, we only want to know the country the player is eligible to play for, not their full heritage which from observations is always the first part of the string. We'll therefore be discarding anything after the first space in the string to form a new `playing_country` column."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# Take the first nationality i.e. text before the first space, ex. 'Peru / Italy'\n",
"df_tm_player_top5_2021['playing_country'] = df_tm_player_top5_2021['nationality'].str.split(' /').str[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Value\n",
"The values of the players have prefixes (£), commas, spaces, and suffixes (m, k) that need to cleaned and replaced before converting to a numerical value."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"# Value string cleaning from shortened string value to full numerical value\n",
"\n",
"## Convert 'm' to '000000'\n",
"df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('m','0000')\n",
"\n",
"## Convert 'k' to '000'\n",
"df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('k','000')\n",
"\n",
"## Convert 'Th' to '000'\n",
"df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('Th','000')\n",
"\n",
"## Remove '.'\n",
"df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('.','')\n",
"\n",
"## Remove '£' sign\n",
"df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('£','')\n",
"\n",
"## Remove '-'\n",
"df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('-','')\n",
"\n",
"## Remove '  '\n",
"df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('  ','')\n",
"\n",
"## Remove gaps\n",
"df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace(' ','')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Position"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Goalkeeper', 'Centre-Back', 'Left-Back', 'Right-Back',\n",
" 'Defensive Midfield', 'Central Midfield', 'Attacking Midfield',\n",
" 'Left Winger', 'Right Winger', 'Second Striker', 'Centre-Forward',\n",
" 'Right Midfield', 'Left Midfield', 'Midfielder'], dtype=object)"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# List unique values in the df_tm_player_top5_2021['position_description'] column\n",
"df_tm_player_top5_2021['position_description'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"dict_positions={\n",
" 'Goalkeeper': 'Goalkeeper',\n",
" 'Centre-Back': 'Defender',\n",
" 'Left-Back': 'Defender',\n",
" 'Right-Back': 'Defender',\n",
" 'Defensive Midfield': 'Midfielder',\n",
" 'Central Midfield': 'Midfielder',\n",
" 'Attacking Midfield': 'Midfielder',\n",
" 'Left Winger': 'Forward',\n",
" 'Right Winger': 'Forward',\n",
" 'Second Striker': 'Forward',\n",
" 'Centre-Forward': 'Forward',\n",
" 'Right Midfield': 'Midfielder',\n",
" 'Left Midfield': 'Midfielder',\n",
" 'Midfielder': 'Midfielder'\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"df_tm_player_top5_2021['position_cleaned'] = df_tm_player_top5_2021['position_description'].map(dict_positions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.3. Converting Data Types"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### DoB\n",
"First we need to convert the `dob` column from the `object` data type to `datetime64[ns]`, again using the [.to_datetime()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) method."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
position_number
\n",
"
position_description
\n",
"
name
\n",
"
dob
\n",
"
nationality
\n",
"
value
\n",
"
name_lower
\n",
"
firstname_lower
\n",
"
lastname_lower
\n",
"
firstinitial_lower
\n",
"
birth_date
\n",
"
age
\n",
"
playing_country
\n",
"
position_cleaned
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
Goalkeeper
\n",
"
Marc-André ter Stegen
\n",
"
Apr 30, 1992 (28)
\n",
"
Germany
\n",
"
64800000
\n",
"
marc-andre ter stegen
\n",
"
marc-andre
\n",
"
stegen
\n",
"
m
\n",
"
Apr 30, 1992
\n",
"
28
\n",
"
Germany
\n",
"
Goalkeeper
\n",
"
\n",
"
\n",
"
1
\n",
"
13
\n",
"
Goalkeeper
\n",
"
Neto
\n",
"
Jul 19, 1989 (31)
\n",
"
Brazil / Italy
\n",
"
13050000
\n",
"
neto
\n",
"
neto
\n",
"
neto
\n",
"
n
\n",
"
Jul 19, 1989
\n",
"
31
\n",
"
Brazil
\n",
"
Goalkeeper
\n",
"
\n",
"
\n",
"
2
\n",
"
26
\n",
"
Goalkeeper
\n",
"
Iñaki Peña
\n",
"
Mar 2, 1999 (21)
\n",
"
Spain
\n",
"
2070000
\n",
"
inaki pena
\n",
"
inaki
\n",
"
pena
\n",
"
i
\n",
"
Mar 2, 1999
\n",
"
21
\n",
"
Spain
\n",
"
Goalkeeper
\n",
"
\n",
"
\n",
"
3
\n",
"
15
\n",
"
Centre-Back
\n",
"
Clément Lenglet
\n",
"
Jun 17, 1995 (25)
\n",
"
France
\n",
"
43200000
\n",
"
clement lenglet
\n",
"
clement
\n",
"
lenglet
\n",
"
c
\n",
"
Jun 17, 1995
\n",
"
25
\n",
"
France
\n",
"
Defender
\n",
"
\n",
"
\n",
"
4
\n",
"
23
\n",
"
Centre-Back
\n",
"
Samuel Umtiti
\n",
"
Nov 14, 1993 (26)
\n",
"
France / Cameroon
\n",
"
28800000
\n",
"
samuel umtiti
\n",
"
samuel
\n",
"
umtiti
\n",
"
s
\n",
"
Nov 14, 1993
\n",
"
26
\n",
"
France
\n",
"
Defender
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
2885
\n",
"
18
\n",
"
Centre-Forward
\n",
"
Sergio Córdova
\n",
"
Aug 9, 1997 (23)
\n",
"
Venezuela
\n",
"
1440000
\n",
"
sergio cordova
\n",
"
sergio
\n",
"
cordova
\n",
"
s
\n",
"
Aug 9, 1997
\n",
"
23
\n",
"
Venezuela
\n",
"
Forward
\n",
"
\n",
"
\n",
"
2886
\n",
"
9
\n",
"
Centre-Forward
\n",
"
Fabian Klos
\n",
"
Dec 2, 1987 (32)
\n",
"
Germany
\n",
"
900000
\n",
"
fabian klos
\n",
"
fabian
\n",
"
klos
\n",
"
f
\n",
"
Dec 2, 1987
\n",
"
32
\n",
"
Germany
\n",
"
Forward
\n",
"
\n",
"
\n",
"
2887
\n",
"
13
\n",
"
Centre-Forward
\n",
"
Sebastian Müller
\n",
"
Jan 23, 2001 (19)
\n",
"
Germany
\n",
"
270000
\n",
"
sebastian muller
\n",
"
sebastian
\n",
"
muller
\n",
"
s
\n",
"
Jan 23, 2001
\n",
"
19
\n",
"
Germany
\n",
"
Forward
\n",
"
\n",
"
\n",
"
2888
\n",
"
36
\n",
"
Centre-Forward
\n",
"
Sven Schipplock
\n",
"
Nov 8, 1988 (31)
\n",
"
Germany
\n",
"
270000
\n",
"
sven schipplock
\n",
"
sven
\n",
"
schipplock
\n",
"
s
\n",
"
Nov 8, 1988
\n",
"
31
\n",
"
Germany
\n",
"
Forward
\n",
"
\n",
"
\n",
"
2889
\n",
"
39
\n",
"
Centre-Forward
\n",
"
Prince Osei Owusu
\n",
"
Jan 7, 1997 (23)
\n",
"
Germany / Ghana
\n",
"
225000
\n",
"
prince osei owusu
\n",
"
prince
\n",
"
owusu
\n",
"
p
\n",
"
Jan 7, 1997
\n",
"
23
\n",
"
Germany
\n",
"
Forward
\n",
"
\n",
" \n",
"
\n",
"
2890 rows × 14 columns
\n",
"
"
],
"text/plain": [
" position_number position_description name \\\n",
"0 1 Goalkeeper Marc-André ter Stegen \n",
"1 13 Goalkeeper Neto \n",
"2 26 Goalkeeper Iñaki Peña \n",
"3 15 Centre-Back Clément Lenglet \n",
"4 23 Centre-Back Samuel Umtiti \n",
"... ... ... ... \n",
"2885 18 Centre-Forward Sergio Córdova \n",
"2886 9 Centre-Forward Fabian Klos \n",
"2887 13 Centre-Forward Sebastian Müller \n",
"2888 36 Centre-Forward Sven Schipplock \n",
"2889 39 Centre-Forward Prince Osei Owusu \n",
"\n",
" dob nationality value name_lower \\\n",
"0 Apr 30, 1992 (28) Germany 64800000 marc-andre ter stegen \n",
"1 Jul 19, 1989 (31) Brazil / Italy 13050000 neto \n",
"2 Mar 2, 1999 (21) Spain 2070000 inaki pena \n",
"3 Jun 17, 1995 (25) France 43200000 clement lenglet \n",
"4 Nov 14, 1993 (26) France / Cameroon 28800000 samuel umtiti \n",
"... ... ... ... ... \n",
"2885 Aug 9, 1997 (23) Venezuela 1440000 sergio cordova \n",
"2886 Dec 2, 1987 (32) Germany 900000 fabian klos \n",
"2887 Jan 23, 2001 (19) Germany 270000 sebastian muller \n",
"2888 Nov 8, 1988 (31) Germany 270000 sven schipplock \n",
"2889 Jan 7, 1997 (23) Germany / Ghana 225000 prince osei owusu \n",
"\n",
" firstname_lower lastname_lower firstinitial_lower birth_date age \\\n",
"0 marc-andre stegen m Apr 30, 1992 28 \n",
"1 neto neto n Jul 19, 1989 31 \n",
"2 inaki pena i Mar 2, 1999 21 \n",
"3 clement lenglet c Jun 17, 1995 25 \n",
"4 samuel umtiti s Nov 14, 1993 26 \n",
"... ... ... ... ... .. \n",
"2885 sergio cordova s Aug 9, 1997 23 \n",
"2886 fabian klos f Dec 2, 1987 32 \n",
"2887 sebastian muller s Jan 23, 2001 19 \n",
"2888 sven schipplock s Nov 8, 1988 31 \n",
"2889 prince owusu p Jan 7, 1997 23 \n",
"\n",
" playing_country position_cleaned \n",
"0 Germany Goalkeeper \n",
"1 Brazil Goalkeeper \n",
"2 Spain Goalkeeper \n",
"3 France Defender \n",
"4 France Defender \n",
"... ... ... \n",
"2885 Venezuela Forward \n",
"2886 Germany Forward \n",
"2887 Germany Forward \n",
"2888 Germany Forward \n",
"2889 Germany Forward \n",
"\n",
"[2890 rows x 14 columns]"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_tm_player_top5_2021"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"position_number object\n",
"position_description object\n",
"name object\n",
"dob object\n",
"nationality object\n",
"value object\n",
"name_lower object\n",
"firstname_lower object\n",
"lastname_lower object\n",
"firstinitial_lower object\n",
"birth_date object\n",
"age object\n",
"playing_country object\n",
"position_cleaned object\n",
"dtype: object"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_tm_player_top5_2021.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"df_tm_player_top5_2021.to_csv(data_dir + f'test_tm_{today}.csv', index=None, header=True)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"df_tm_player_top5_2021 = df_tm_player_top5_2021.replace('N/A', '', regex=True)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Convert birth_date from string to datetime64[ns]\n",
"df_tm_player_top5_2021['birth_date'] = pd.to_datetime(df_tm_player_top5_2021['birth_date'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Age\n",
"The calculated `age` column needs to be converted from a float to an integer, with all null values ignored, using to [astype()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html) method."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": [
"# Date and time manipulation\n",
"from datetime import datetime"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"# Redetermine the age using the newly created birth_date column (after formatted to datetime data type)\n",
"\n",
"## Remove all not numeric values use to_numeric with parameter errors='coerce' - it replaces non numeric to NaNs\n",
"df_tm_player_top5_2021['age'] = pd.to_numeric(df_tm_player_top5_2021['age'], errors='coerce')\n",
"\n",
"## Convert floats to integers and leave null values\n",
"df_tm_player_top5_2021['age'] = np.nan_to_num(df_tm_player_top5_2021['age']).astype(int)\n",
"\n",
"## Calculate current age\n",
"today = datetime.today()\n",
"df_tm_player_top5_2021['age'] = df_tm_player_top5_2021['birth_date'].apply(lambda x: today.year - x.year - \n",
" ((today.month, today.day) < (x.month, x.day)) \n",
" )\n",
"\n",
"\n",
"# df_tm_player_top5_2021['age'] = pd.to_numeric(ddf_tm_player_top5_2021['age'], downcast='signed')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Value\n",
"The `value` column needs to be converted from a string to an integer using to [to_numeric()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html) method."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"# Convert string to integer\n",
"df_tm_player_top5_2021['value'] = pd.to_numeric(df_tm_player_top5_2021['value'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Position\n",
"..."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Attacking Midfield',\n",
" 'Central Midfield',\n",
" 'Centre-Back',\n",
" 'Centre-Forward',\n",
" 'Defensive Midfield',\n",
" 'Goalkeeper',\n",
" 'Left Midfield',\n",
" 'Left Winger',\n",
" 'Left-Back',\n",
" 'Midfielder',\n",
" 'Right Midfield',\n",
" 'Right Winger',\n",
" 'Right-Back',\n",
" 'Second Striker']"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sorted(df_tm_player_top5_2021['position_description'].unique())"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"dict_positions_tm = {\n",
" 'Attacking Midfield': 'Midfielder',\n",
" 'Central Midfield': 'Midfielder',\n",
" 'Centre-Back': 'Defender',\n",
" 'Centre-Forward': 'Forward',\n",
" 'Defensive Midfield': 'Midfielder',\n",
" 'Forward': 'Forward',\n",
" 'Goalkeeper': 'Goalkeeper',\n",
" 'Left Midfield': 'Midfielder',\n",
" 'Left Winger': 'Forward',\n",
" 'Left-Back': 'Defender',\n",
" 'Right Midfield': 'Midfielder',\n",
" 'Right Winger': 'Forward',\n",
" 'Right-Back': 'Defender',\n",
" 'Second Striker': 'Forward'\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
"df_tm_player_top5_2021['position_description_cleaned'] = df_tm_player_top5_2021['position_description'].map(dict_positions_tm)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.4. Create New Attributes\n",
"Create new attributes for birth month and birth year."
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"df_tm_player_top5_2021['birth_year'] = pd.DatetimeIndex(df_tm_player_top5_2021['birth_date']).year\n",
"df_tm_player_top5_2021['birth_month'] = pd.DatetimeIndex(df_tm_player_top5_2021['birth_date']).month"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.5. Columns of Interest\n",
"We are interested in the following thirteen columns in the [TransferMarkt](https://www.transfermarkt.co.uk/) dataset:\n",
"* `name`\n",
"* `name_lower`\n",
"* `firstinitial_lower`\n",
"* `firstname_lower`\n",
"* `lastname_lower`\n",
"* `position_description`\n",
"* `position_description_cleaned`\n",
"* `value`\n",
"* `birth_date`\n",
"* `birth_year`\n",
"* `birth_month`\n",
"* `age`\n",
"* `playing_country`"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": [
"# Select columns of interest\n",
"df_tm_player_top5_2021 = df_tm_player_top5_2021[['name', 'name_lower', 'firstinitial_lower', 'firstname_lower', 'lastname_lower', 'position_description', 'position_description_cleaned', 'value', 'birth_date', 'birth_year', 'birth_month', 'age', 'playing_country']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.6. Split Dataset into Outfielder Players and Goalkeepers"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"# Assign df_tm as a new DataFrame - df_tm_player_top5_all_2021_all, to represent all the players\n",
"df_tm_player_top5_all_2021 = df_tm_player_top5_2021\n",
"\n",
"# Filter rows for position_description is not equal to 'Goalkeeper'\n",
"df_tm_player_top5_outfield_2021 = df_tm_player_top5_all_2021[df_tm_player_top5_all_2021['position_description'] != 'Goalkeeper']\n",
"\n",
"# Filter rows for position_description are equal to 'Goalkeeper'\n",
"df_tm_player_top5_goalkeeper_2021 = df_tm_player_top5_all_2021[df_tm_player_top5_all_2021['position_description'] == 'Goalkeeper']"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"