{ "cells": [ { "cell_type": "markdown", "id": "e8097196-a89e-402c-9ab7-3bd28201f791", "metadata": {}, "source": [ "# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Capstone Project\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "bdc1ccd8-264c-4f79-8b17-65d01ac92a81", "metadata": { "tags": [] }, "source": [ "#### [Capstone Project, Part 1: Proposal](https://nbviewer.org/github/jaeyow/f1-predictor/blob/main/final-project-part1-proposal.ipynb)\n", "#### [Capstone Project, Part 2: Brief](https://nbviewer.org/github/jaeyow/f1-predictor/blob/main/final-project-part2-brief.ipynb)\n", "- [Writing data to MongoDB](https://nbviewer.org/github/jaeyow/f1-predictor/blob/main/final-project-part2-brief.ipynb#mongo_db)\n", "- [Data Dictionary](https://nbviewer.org/github/jaeyow/f1-predictor/blob/main/final-project-part2-brief.ipynb#data_dictionary)\n", "- [Map of races around the world](https://nbviewer.org/github/jaeyow/f1-predictor/blob/main/final-project-part2-brief.ipynb#world-map)\n", "\n", "#### [Capstone Project, Part 3: Technical Notebook](https://nbviewer.org/github/jaeyow/f1-predictor/blob/main/final-project-part3-technical-notebook.ipynb)\n", "- [Feature Engineering](https://nbviewer.org/github/jaeyow/f1-predictor/blob/main/final-project-part3-technical-notebook.ipynb#feature_eng)\n", "- [Regression Approaches](https://nbviewer.org/github/jaeyow/f1-predictor/blob/main/final-project-part3-technical-notebook.ipynb#regression_approaches)\n", "- [Classification Approaches](https://nbviewer.org/github/jaeyow/f1-predictor/blob/main/final-project-part3-technical-notebook.ipynb#classification_approaches)\n", "- [Feature Importance](https://nbviewer.org/github/jaeyow/f1-predictor/blob/main/final-project-part3-technical-notebook.ipynb#feature_importance)\n", "- [Feature Selection](https://nbviewer.org/github/jaeyow/f1-predictor/blob/main/final-project-part3-technical-notebook.ipynb#feature_selection)\n", "- [Models Comparison](https://nbviewer.org/github/jaeyow/f1-predictor/blob/main/final-project-part3-technical-notebook.ipynb#models_comparison)\n", "\n", "#### [Capstone Project, Part 4: Presentation](https://61c08c5e1627a3416b0c37b4--pensive-nobel-d54f9f.netlify.app/)\n", "#### [Capstone Project, Part 5: Appendix](https://nbviewer.org/github/jaeyow/f1-predictor/blob/main/final-project-part5-appendix.ipynb)\n", "\n", "#### [Capstone Project, Part 6: MLOps](https://github.com/jaeyow/f1-predictor/blob/main/.github/workflows/f1-mlops.yml)\n", "Using GitHub Actions as a cheap (and free) MLOps tool alternative: - invoke MLOps workflow on-demand (or with a cron schedule)\n", "- get latest source\n", "- setup Python build/MLOps environment\n", "- data retrieval and preparation\n", "- feature engineering\n", "- preparation for model training (including dummify categorical features)\n", "- feature selection\n", "- model building and scoring\n", "- setup serverless (lambda) API in AWS\n", "- deploy model to serverless API\n", "- profit!\n", "\n", "![](./images/f1-mclaren-car.png)" ] }, { "cell_type": "code", "execution_count": 5, "id": "2a8d2696-3532-40a5-b8d1-0bc966315cd0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: pandas in /Users/josereyes/opt/anaconda3/envs/jose_env/lib/python3.8/site-packages (1.3.5)\n", "Requirement already satisfied: python-dateutil>=2.7.3 in /Users/josereyes/opt/anaconda3/envs/jose_env/lib/python3.8/site-packages (from pandas) (2.8.2)\n", "Requirement already satisfied: numpy>=1.17.3 in /Users/josereyes/opt/anaconda3/envs/jose_env/lib/python3.8/site-packages (from pandas) (1.21.5)\n", "Requirement already satisfied: pytz>=2017.3 in /Users/josereyes/opt/anaconda3/envs/jose_env/lib/python3.8/site-packages (from pandas) (2021.3)\n", "Requirement already satisfied: six>=1.5 in /Users/josereyes/opt/anaconda3/envs/jose_env/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)\n", "Collecting matplotlib\n", " Downloading matplotlib-3.5.1-cp38-cp38-macosx_10_9_x86_64.whl (7.3 MB)\n", "\u001b[K |████████████████████████████████| 7.3 MB 640 kB/s eta 0:00:01\n", "\u001b[?25hCollecting cycler>=0.10\n", " Downloading cycler-0.11.0-py3-none-any.whl (6.4 kB)\n", "Requirement already satisfied: pyparsing>=2.2.1 in /Users/josereyes/opt/anaconda3/envs/jose_env/lib/python3.8/site-packages (from matplotlib) (3.0.4)\n", "Requirement already satisfied: python-dateutil>=2.7 in /Users/josereyes/opt/anaconda3/envs/jose_env/lib/python3.8/site-packages (from matplotlib) (2.8.2)\n", "Requirement already satisfied: numpy>=1.17 in /Users/josereyes/opt/anaconda3/envs/jose_env/lib/python3.8/site-packages (from matplotlib) (1.21.5)\n", "Collecting pillow>=6.2.0\n", " Downloading Pillow-8.4.0-cp38-cp38-macosx_10_10_x86_64.whl (3.0 MB)\n", "\u001b[K |████████████████████████████████| 3.0 MB 309 kB/s eta 0:00:01\n", "\u001b[?25hCollecting kiwisolver>=1.0.1\n", " Downloading kiwisolver-1.3.2-cp38-cp38-macosx_10_9_x86_64.whl (61 kB)\n", "\u001b[K |████████████████████████████████| 61 kB 560 kB/s eta 0:00:01\n", "\u001b[?25hCollecting fonttools>=4.22.0\n", " Downloading fonttools-4.28.5-py3-none-any.whl (890 kB)\n", "\u001b[K |████████████████████████████████| 890 kB 593 kB/s eta 0:00:01\n", "\u001b[?25hRequirement already satisfied: packaging>=20.0 in /Users/josereyes/opt/anaconda3/envs/jose_env/lib/python3.8/site-packages (from matplotlib) (21.3)\n", "Requirement already satisfied: six>=1.5 in /Users/josereyes/opt/anaconda3/envs/jose_env/lib/python3.8/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)\n", "Installing collected packages: pillow, kiwisolver, fonttools, cycler, matplotlib\n", "Successfully installed cycler-0.11.0 fonttools-4.28.5 kiwisolver-1.3.2 matplotlib-3.5.1 pillow-8.4.0\n" ] } ], "source": [ "# install modules first\n", "!pip install pandas\n", "!pip install matplotlib" ] }, { "cell_type": "code", "execution_count": 3, "id": "9de4b0ed-fc65-4915-9980-c48361487aac", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from matplotlib import pyplot as plt\n", "from datetime import date\n", "from datetime import datetime\n", "from timeit import default_timer as timer\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.metrics import precision_score, explained_variance_score\n", "from sklearn.metrics import confusion_matrix, accuracy_score, classification_report\n", "from sklearn.base import BaseEstimator\n", "from sklearn.ensemble import AdaBoostRegressor\n", "from sklearn.ensemble import BaggingRegressor\n", "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn.ensemble import ExtraTreesRegressor\n", "from sklearn.ensemble import GradientBoostingRegressor\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.linear_model import RidgeCV\n", "from sklearn.linear_model import ARDRegression\n", "from sklearn.svm import LinearSVR\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.ensemble import StackingRegressor\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.neural_network import MLPRegressor\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.svm import SVC\n", "from sklearn.ensemble import AdaBoostClassifier\n", "from sklearn.ensemble import ExtraTreesClassifier\n", "from sklearn.ensemble import GradientBoostingClassifier\n", "from sklearn.ensemble import StackingClassifier\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.svm import LinearSVC\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.feature_selection import SelectFromModel\n", "from sklearn.model_selection import GridSearchCV\n", "import pickle\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')\n" ] }, { "cell_type": "code", "execution_count": 135, "id": "99dbf969-be0e-4ed8-aef9-4615f4c1597b", "metadata": {}, "outputs": [], "source": [ "results_df = pd.read_csv('results_from_mongo.csv')\n", "results_df.drop(columns=['Unnamed: 0'],inplace=True)\n" ] }, { "cell_type": "code", "execution_count": 136, "id": "608fb66c-4ae2-45ef-b935-d3b6de9661a1", "metadata": {}, "outputs": [], "source": [ "# only keep 2021 for the api\n", "api_df = results_df[results_df['Season']==2021]\n", "api_df.reset_index(drop=True, inplace=True)\n", "api_df.to_csv(f'2021_races_drivers.csv')" ] }, { "cell_type": "markdown", "id": "2e99a9be-b977-4487-ac57-c3659d651d77", "metadata": {}, "source": [ "\n", "### Feature Engineering\n", "\n", "At this stage, we will be creating the following features from exiting data:\n", "- Driver experience\n", "- Constructor experience\n", "- Driver Recent Wins\n", "- Driver Age\n", "- Driver Recent Form\n", "- Driver recent DNFs\n", "- Home Circuit advantage\n", "- Dummify the following categorical parameters: Season?, Race Name?, Number of Laps, Weather\n" ] }, { "cell_type": "markdown", "id": "80b0565e-19e8-490e-b9a0-bb16a802c815", "metadata": {}, "source": [ "#### Driver experience\n", "Driver's experience in Formula 1, where a more experienced F1 driver typically places better than a rookie." ] }, { "cell_type": "code", "execution_count": 137, "id": "2e4219e9-e62c-4f55-812d-7955e038492f", "metadata": {}, "outputs": [], "source": [ "results_df['DriverExperience'] = 0\n", "drivers = results_df['Driver'].unique()\n", "for driver in drivers:\n", " df_driver = pd.DataFrame(results_df[results_df['Driver']==driver]).tail(60) # Arbitrary number, just look at the last x races\n", " df_driver.loc[:,'DriverExperience'] = 1\n", " \n", " results_df.loc[results_df['Driver']==driver, \"DriverExperience\"] = df_driver['DriverExperience'].cumsum()\n", " results_df['DriverExperience'].fillna(value=0,inplace=True)\n" ] }, { "cell_type": "markdown", "id": "d5de233d-d137-4a15-af2f-bfdb9181c451", "metadata": {}, "source": [ "#### Constructor Experience\n", "Constructor's experience in Formula 1, where a more experienced F1 constructor typically places better than a rookie." ] }, { "cell_type": "code", "execution_count": 138, "id": "ffdb661b-3051-4bb5-a0f5-3df6752c06bb", "metadata": {}, "outputs": [], "source": [ "results_df['ConstructorExperience'] = 0\n", "constructors = results_df['Constructor'].unique()\n", "for constructor in constructors:\n", " \n", " df_constructor = pd.DataFrame(results_df[results_df['Constructor']==constructor]).tail(60) # Arbitrary number, just look at the last x races per driver\n", " df_constructor.loc[:,'ConstructorExperience'] = 1\n", " \n", " results_df.loc[results_df['Constructor']==constructor, \"ConstructorExperience\"] = df_constructor['ConstructorExperience'].cumsum()\n", " results_df['ConstructorExperience'].fillna(value=0,inplace=True)\n" ] }, { "cell_type": "markdown", "id": "e3402a97-0cdd-4934-965e-c9bf48d6e757", "metadata": {}, "source": [ "#### Driver Recent Wins\n", "A new feature is added to represent the dirver's most recent past wins. Excluding the result of the current race ensures that there is no possibility of data leakage that might affect the results. " ] }, { "cell_type": "code", "execution_count": 139, "id": "74ad0dcd-8ae1-4a9b-b05b-33ec8cc3b398", "metadata": {}, "outputs": [], "source": [ "results_df['DriverRecentWins'] = 0\n", "drivers = results_df['Driver'].unique()\n", "\n", "results_df.loc[results_df['Position']==1, \"DriverRecentWins\"] = 1\n", "for driver in drivers:\n", " mask_first_place_drivers = (results_df['Driver']==driver) & (results_df['Position']==1)\n", " df_driver = results_df[mask_first_place_drivers]\n", " results_df.loc[results_df['Driver']==driver, \"DriverRecentWins\"] = results_df[results_df['Driver']==driver]['DriverRecentWins'].rolling(60).sum() # 60 races, about 3 years rolling\n", " results_df.loc[mask_first_place_drivers, \"DriverRecentWins\"] = results_df[mask_first_place_drivers]['DriverRecentWins'] - 1 # but don't count this race's win\n", " results_df['DriverRecentWins'].fillna(value=0,inplace=True)\n" ] }, { "cell_type": "markdown", "id": "7b9ebca5-0bee-4b28-b37a-bd0a4e0ec256", "metadata": {}, "source": [ "#### Driver Recent DNFs\n", "\n", "A new feature has also been added to represent a driver's recent DNFs (Did Not Finish), whatever/whoever's fault it is. We also have to take care and avoid data leakage into this new feature, by not counting the current race. " ] }, { "cell_type": "code", "execution_count": 140, "id": "14c237be-59bf-4545-97fc-912b15e2f8a6", "metadata": {}, "outputs": [], "source": [ "results_df['DriverRecentDNFs'] = 0\n", "drivers = results_df['Driver'].unique()\n", "\n", "results_df.loc[(~results_df['Status'].str.contains('Finished|\\+')), \"DriverRecentDNFs\"] = 1\n", "for driver in drivers:\n", " mask_not_finish_place_drivers = (results_df['Driver']==driver) & (~results_df['Status'].str.contains('Finished|\\+'))\n", " df_driver = results_df[mask_not_finish_place_drivers]\n", " results_df.loc[results_df['Driver']==driver, \"DriverRecentDNFs\"] = results_df[results_df['Driver']==driver]['DriverRecentDNFs'].rolling(60).sum() # 60 races, about 3 years rolling\n", " results_df.loc[mask_not_finish_place_drivers, \"DriverRecentDNFs\"] = results_df[mask_not_finish_place_drivers]['DriverRecentDNFs'] - 1 # but don't count this race\n", " results_df['DriverRecentDNFs'].fillna(value=0,inplace=True)\n", "\n", "# results_df[results_df['Driver']=='Daniel Ricciardo'].tail(60)" ] }, { "cell_type": "markdown", "id": "8709431d-8d6b-46a2-8212-33a86052f7f0", "metadata": {}, "source": [ "#### Fix issues with Recent form values\n", "\n", "In Formula 1, only the top 10 finishers score points, so even if a driver finished 11th, they will not score anything which will not help our calculation. So in this part, we give all finishers a score. The 1st place top points, and lower places get lower points and so on. We can then use this column as a variable (instead of F1's official points) to calclulate for the the Driver's recent form. " ] }, { "cell_type": "code", "execution_count": 141, "id": "8cd39925-c7e3-43aa-8345-859462c73d22", "metadata": {}, "outputs": [], "source": [ "# Add new RFPoints column - ALL finishers score points - max points First place and one less for each lesser place (using LogSpace)\n", "seasons = results_df['Season'].unique()\n", "results_df['RFPoints'] = 0\n", "for season in seasons:\n", " rounds = results_df[results_df['Season']==season]['Round'].unique()\n", " for round in rounds:\n", " mask = (results_df['Season']==season) & (results_df['Round']==round)\n", " finisher_mask = ((results_df['Status'].str.contains('Finished|\\+'))) # Count only if finished the race\n", " finished_count = results_df.loc[(mask) & finisher_mask, \"RFPoints\"].count()\n", " point_list = np.round(np.logspace(1,4,40, base=4),4) # use list of LogSpaced numbers\n", " point_list[::-1].sort()\n", " \n", " results_df.loc[(mask) & finisher_mask, \"RFPoints\"] = point_list[:finished_count].tolist()\n" ] }, { "cell_type": "markdown", "id": "484910ea-857a-4d78-b25f-aee2594920e9", "metadata": {}, "source": [ "#### Driver Recent Form\n", "Now that we've got our adjusted points system \"RFPoints\", we can now calculate for a more accurate Driver Recent Form. We also have to take care and avoid data leakage into this new feature." ] }, { "cell_type": "code", "execution_count": 142, "id": "fd26a682-1d78-48be-920e-2e3c61226d6e", "metadata": {}, "outputs": [], "source": [ "results_df['DriverRecentForm'] = 0\n", "# for all drivers, calculate the rolling X DriverRecentForm and add to a new column in \n", "# original data frame, this represents the 'recent form', then for NA's just impute to zero\n", "drivers = results_df['Driver'].unique()\n", "for driver in drivers:\n", " df_driver = results_df[results_df['Driver']==driver]\n", " results_df.loc[results_df['Driver']==driver, \"DriverRecentForm\"] = df_driver['RFPoints'].rolling(30).sum() - df_driver['RFPoints'] # calcluate recent form points but don't include this race's points\n", " results_df['DriverRecentForm'].fillna(value=0,inplace=True)\n" ] }, { "cell_type": "markdown", "id": "16e2608a-dc8a-485d-9447-f3391b51569e", "metadata": {}, "source": [ "#### Constructor Recent Form\n", "Now that we've got our adjusted points system \"RFPoints\", we can now also calculate for a more accurate Constructor Recent Form. We also have to take care and avoid data leakage into this new feature." ] }, { "cell_type": "code", "execution_count": 143, "id": "a8d2030d-b7f0-4529-a16b-e82f6a9a1ff0", "metadata": {}, "outputs": [], "source": [ "results_df['ConstructorRecentForm'] = 0\n", "# for all constructors, calculate the rolling X RFPoints and add to a new column in \n", "# original data frame, this represents the 'recent form', then for NA's just impute to zero\n", "constructors = results_df['Constructor'].unique()\n", "for constructor in constructors:\n", " df_constructor = results_df[results_df['Constructor']==constructor]\n", " results_df.loc[results_df['Constructor']==constructor, \"ConstructorRecentForm\"] = df_constructor['RFPoints'].rolling(30).sum() - df_constructor['RFPoints'] # calcluate recent form points but don't include this race's points\n", " results_df['ConstructorRecentForm'].fillna(value=0,inplace=True)\n" ] }, { "cell_type": "markdown", "id": "ec4ac735-ff61-40ac-8b3c-6f52e8501788", "metadata": {}, "source": [ "#### Driver Age\n", "Surely a driver's age has some effect and may have some influence to the outcome of the race. " ] }, { "cell_type": "code", "execution_count": 144, "id": "22b66ffb-ff7b-406c-9e3f-0e34b03d6d59", "metadata": {}, "outputs": [], "source": [ "def calculate_age(born, race):\n", " date_born = datetime.strptime(born,'%Y-%m-%d')\n", " date_race = datetime.strptime(race,'%Y-%m-%d')\n", " return date_race.year - date_born.year - ((date_race.month, date_race.day) < (date_born.month, date_born.day))\n", "\n", "results_df['Age'] = results_df.apply(lambda x: calculate_age(x['DOB'], x['Race Date']), axis=1) \n" ] }, { "cell_type": "markdown", "id": "49e7bab0-218d-4526-849a-dc6c512e61d2", "metadata": {}, "source": [ "#### Home Circuit\n", "Is there such a thing as Homecourt Advantage in Formula 1 racing? It doesn't look like it does, based on the preliminary EDA, however, I've got a feeling that it might have some. In the following cell, I have created a mapping between driver nationality vs race country, and this is used when we want to convey the Homecourt advantage concept in this model. " ] }, { "cell_type": "code", "execution_count": 145, "id": "4543dda5-bcb8-448c-af4f-fccdf0322316", "metadata": {}, "outputs": [], "source": [ "def is_race_in_home_country(driver_nationality, race_country):\n", " nationality_country_map = {\n", " 'American': ['USA'],\n", " 'American-Italian': ['USA','Italy'],\n", " 'Argentine': ['Argentina'],\n", " 'Argentine-Italian': ['Argentina','Italy'],\n", " 'Australian': ['Australia'],\n", " 'Austrian': ['Austria'],\n", " 'Belgian': ['Belgium'],\n", " 'Brazilian': ['Brazil'],\n", " 'British': ['UK'],\n", " 'Canadian': ['Canada'],\n", " 'Chilean': ['Brazil'],\n", " 'Colombian': ['Brazil'],\n", " 'Czech': ['Austria','Germany'],\n", " 'Danish': ['Germany'],\n", " 'Dutch': ['Netherlands'],\n", " 'East German': ['Germany'],\n", " 'Finnish': ['Germany','Austria'],\n", " 'French': ['France'],\n", " 'German': ['Germany'],\n", " 'Hungarian': ['Hungary'],\n", " 'Indian': ['India'],\n", " 'Indonesian': ['Singapore','Malaysia'],\n", " 'Irish': ['UK'],\n", " 'Italian': ['Italy'],\n", " 'Japanese': ['Japan','Korea'],\n", " 'Liechtensteiner': ['Switzerland','Austria'],\n", " 'Malaysian': ['Malaysia','Singapore'],\n", " 'Mexican': ['Mexico'],\n", " 'Monegasque': ['Monaco'],\n", " 'New Zealander': ['Australia'],\n", " 'Polish': ['Germany'],\n", " 'Portuguese': ['Portugal'],\n", " 'Rhodesian': ['South Africa'],\n", " 'Russian': ['Russia'],\n", " 'South African': ['South Africa'],\n", " 'Spanish': ['Spain','Morocco'],\n", " 'Swedish': ['Sweden'],\n", " 'Swiss': ['Switzerland'],\n", " 'Thai': ['Malaysia'],\n", " 'Uruguayan': ['Argentina'],\n", " 'Venezuelan': ['Brazil']\n", " }\n", " \n", " countries = ['None']\n", " \n", " try:\n", " countries = nationality_country_map[driver_nationality]\n", " except:\n", " print(\"An exception occurred, This driver has no race held in his home country.\")\n", " return race_country in countries\n", "\n", "results_df['IsHomeCountry'] = results_df.apply(lambda x: is_race_in_home_country(x['Nationality'], x['Country']), axis=1) \n" ] }, { "cell_type": "markdown", "id": "d72e084d-3562-4bd4-89ac-9a47f7a9269d", "metadata": {}, "source": [ "#### Handle all categorical variables\n", "Dummify applicable categorical variables and ensure that the variables for the model are all numeric.\n", "- Weather\n", "- Race name (circuit)\n", "- Driver nationality\n" ] }, { "cell_type": "markdown", "id": "c9aee8e0-c62f-4683-952c-6ee9884aa045", "metadata": {}, "source": [ "#### Dummify FTW\n", "Dummify the following parameters, and just drop irrelevant columns" ] }, { "cell_type": "code", "execution_count": null, "id": "463e9d69-15a3-4840-a77b-9b7fbfdbeebd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Season',\n", " 'Round',\n", " 'Race Date',\n", " 'Race Time',\n", " 'Position',\n", " 'Points',\n", " 'Grid',\n", " 'Laps',\n", " 'Status',\n", " 'Driver',\n", " 'DOB',\n", " 'Constructor',\n", " 'Constructor Nat',\n", " 'Circuit Name',\n", " 'Race Url',\n", " 'Lat',\n", " 'Long',\n", " 'Locality',\n", " 'Country',\n", " 'DriverExperience',\n", " 'ConstructorExperience',\n", " 'DriverRecentWins',\n", " 'DriverRecentDNFs',\n", " 'RFPoints',\n", " 'DriverRecentForm',\n", " 'ConstructorRecentForm',\n", " 'Age',\n", " 'IsHomeCountry',\n", " 'Weather_weather_cold',\n", " 'Weather_weather_dry',\n", " 'Weather_weather_hot',\n", " 'Weather_weather_warm',\n", " 'Weather_weather_wet',\n", " 'Nationality_Argentine',\n", " 'Nationality_Australian',\n", " 'Nationality_Austrian',\n", " 'Nationality_Belgian',\n", " 'Nationality_Brazilian',\n", " 'Nationality_British',\n", " 'Nationality_Canadian',\n", " 'Nationality_Dutch',\n", " 'Nationality_Finnish',\n", " 'Nationality_French',\n", " 'Nationality_German',\n", " 'Nationality_Italian',\n", " 'Nationality_Japanese',\n", " 'Nationality_Mexican',\n", " 'Nationality_New Zealander',\n", " 'Nationality_Spanish',\n", " 'Nationality_Swedish',\n", " 'Nationality_Swiss',\n", " 'Race Name_Abu Dhabi Grand Prix',\n", " 'Race Name_Argentine Grand Prix',\n", " 'Race Name_Australian Grand Prix',\n", " 'Race Name_Austrian Grand Prix',\n", " 'Race Name_Bahrain Grand Prix',\n", " 'Race Name_Belgian Grand Prix',\n", " 'Race Name_Brazilian Grand Prix',\n", " 'Race Name_British Grand Prix',\n", " 'Race Name_Canadian Grand Prix',\n", " 'Race Name_Chinese Grand Prix',\n", " 'Race Name_Detroit Grand Prix',\n", " 'Race Name_Dutch Grand Prix',\n", " 'Race Name_European Grand Prix',\n", " 'Race Name_French Grand Prix',\n", " 'Race Name_German Grand Prix',\n", " 'Race Name_Hungarian Grand Prix',\n", " 'Race Name_Indianapolis 500',\n", " 'Race Name_Italian Grand Prix',\n", " 'Race Name_Japanese Grand Prix',\n", " 'Race Name_Malaysian Grand Prix',\n", " 'Race Name_Mexican Grand Prix',\n", " 'Race Name_Monaco Grand Prix',\n", " 'Race Name_Portuguese Grand Prix',\n", " 'Race Name_Russian Grand Prix',\n", " 'Race Name_San Marino Grand Prix',\n", " 'Race Name_Singapore Grand Prix',\n", " 'Race Name_South African Grand Prix',\n", " 'Race Name_Spanish Grand Prix',\n", " 'Race Name_Swedish Grand Prix',\n", " 'Race Name_Turkish Grand Prix',\n", " 'Race Name_United States Grand Prix',\n", " 'Race Name_United States Grand Prix West']" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Weather\n", "results_df = pd.get_dummies(results_df, columns = ['Weather', 'Nationality', 'Race Name'],drop_first=True)\n", "\n", "for col in results_df.columns:\n", " if 'Nationality' in col and results_df[col].sum() < 300:\n", " results_df.drop(col, axis = 1, inplace = True)\n", " \n", " elif 'Race Name' in col and results_df[col].sum() < 130:\n", " results_df.drop(col, axis = 1, inplace = True)\n", " \n", " else:\n", " pass\n", "results_df.columns.tolist()" ] }, { "cell_type": "code", "execution_count": 108, "id": "8f57b7a3-5177-44e6-87b2-a87006c5f41c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(25380, 68)" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#### Drop Columns which are not needed/required for modelling\n", "results_df.drop(['Race Date', 'Race Time', 'Status', 'DOB', 'Constructor', 'Constructor Nat', 'Circuit Name',\n", " 'Race Url', 'Lat', 'Long', 'Locality', 'Country','Laps','Points',\n", " 'RFPoints'], axis=1, inplace=True)\n", "results_df.shape\n" ] }, { "cell_type": "code", "execution_count": 110, "id": "48ef1f71-7c0e-4514-84f6-15a639c75884", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Season', 'Round', 'Position', 'Grid', 'Driver', 'DriverExperience',\n", " 'ConstructorExperience', 'DriverRecentWins', 'DriverRecentDNFs',\n", " 'DriverRecentForm', 'ConstructorRecentForm', 'Age', 'IsHomeCountry',\n", " 'Weather_weather_cold', 'Weather_weather_dry', 'Weather_weather_hot',\n", " 'Weather_weather_warm', 'Weather_weather_wet', 'Nationality_Argentine',\n", " 'Nationality_Australian', 'Nationality_Austrian', 'Nationality_Belgian',\n", " 'Nationality_Brazilian', 'Nationality_British', 'Nationality_Canadian',\n", " 'Nationality_Dutch', 'Nationality_Finnish', 'Nationality_French',\n", " 'Nationality_German', 'Nationality_Italian', 'Nationality_Japanese',\n", " 'Nationality_Mexican', 'Nationality_New Zealander',\n", " 'Nationality_Spanish', 'Nationality_Swedish', 'Nationality_Swiss',\n", " 'Race Name_Abu Dhabi Grand Prix', 'Race Name_Argentine Grand Prix',\n", " 'Race Name_Australian Grand Prix', 'Race Name_Austrian Grand Prix',\n", " 'Race Name_Bahrain Grand Prix', 'Race Name_Belgian Grand Prix',\n", " 'Race Name_Brazilian Grand Prix', 'Race Name_British Grand Prix',\n", " 'Race Name_Canadian Grand Prix', 'Race Name_Chinese Grand Prix',\n", " 'Race Name_Detroit Grand Prix', 'Race Name_Dutch Grand Prix',\n", " 'Race Name_European Grand Prix', 'Race Name_French Grand Prix',\n", " 'Race Name_German Grand Prix', 'Race Name_Hungarian Grand Prix',\n", " 'Race Name_Indianapolis 500', 'Race Name_Italian Grand Prix',\n", " 'Race Name_Japanese Grand Prix', 'Race Name_Malaysian Grand Prix',\n", " 'Race Name_Mexican Grand Prix', 'Race Name_Monaco Grand Prix',\n", " 'Race Name_Portuguese Grand Prix', 'Race Name_Russian Grand Prix',\n", " 'Race Name_San Marino Grand Prix', 'Race Name_Singapore Grand Prix',\n", " 'Race Name_South African Grand Prix', 'Race Name_Spanish Grand Prix',\n", " 'Race Name_Swedish Grand Prix', 'Race Name_Turkish Grand Prix',\n", " 'Race Name_United States Grand Prix',\n", " 'Race Name_United States Grand Prix West'],\n", " dtype='object')" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results_df['Season'] = pd.to_numeric(results_df['Season'])\n", "results_df.columns" ] }, { "cell_type": "code", "execution_count": 81, "id": "61d735f4-3aef-4008-b3b1-db4f9dcaccd3", "metadata": {}, "outputs": [], "source": [ "scoring_raw ={'model':[], 'params': [], 'score': [], 'train_time': [], 'test_time': []}" ] }, { "cell_type": "code", "execution_count": 82, "id": "dcc9df47-c105-43fe-8997-94557ea92644", "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import MinMaxScaler\n", "# Prepare train set\n", "np.set_printoptions(precision=4)\n", "model_df = results_df.copy()\n", "model_df['Position'] = model_df['Position'].map(lambda x: 1 if x == 1 else 0)\n", "\n", "train = model_df[(model_df['Season'] >= 1950) & (model_df['Season'] < 2021)]\n", "X_train = train.drop(['Position','Driver'], axis = 1)\n", "y_train = train['Position']\n", "\n", "scaler = MinMaxScaler()\n", "X_train = pd.DataFrame(scaler.fit_transform(X_train, y_train), columns = X_train.columns)\n" ] }, { "cell_type": "code", "execution_count": 83, "id": "d4e8bea0-c0f7-4b66-a7cd-f36a4dcb5596", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Season', 'Round', 'Position', 'Grid', 'Driver', 'DriverExperience',\n", " 'ConstructorExperience', 'DriverRecentWins', 'DriverRecentDNFs',\n", " 'DriverRecentForm', 'ConstructorRecentForm', 'Age', 'IsHomeCountry',\n", " 'Weather_weather_cold', 'Weather_weather_dry', 'Weather_weather_hot',\n", " 'Weather_weather_warm', 'Weather_weather_wet', 'Nationality_Argentine',\n", " 'Nationality_Australian', 'Nationality_Austrian', 'Nationality_Belgian',\n", " 'Nationality_Brazilian', 'Nationality_British', 'Nationality_Canadian',\n", " 'Nationality_Dutch', 'Nationality_Finnish', 'Nationality_French',\n", " 'Nationality_German', 'Nationality_Italian', 'Nationality_Japanese',\n", " 'Nationality_Mexican', 'Nationality_New Zealander',\n", " 'Nationality_Spanish', 'Nationality_Swedish', 'Nationality_Swiss',\n", " 'Race Name_Abu Dhabi Grand Prix', 'Race Name_Argentine Grand Prix',\n", " 'Race Name_Australian Grand Prix', 'Race Name_Austrian Grand Prix',\n", " 'Race Name_Bahrain Grand Prix', 'Race Name_Belgian Grand Prix',\n", " 'Race Name_Brazilian Grand Prix', 'Race Name_British Grand Prix',\n", " 'Race Name_Canadian Grand Prix', 'Race Name_Chinese Grand Prix',\n", " 'Race Name_Detroit Grand Prix', 'Race Name_Dutch Grand Prix',\n", " 'Race Name_European Grand Prix', 'Race Name_French Grand Prix',\n", " 'Race Name_German Grand Prix', 'Race Name_Hungarian Grand Prix',\n", " 'Race Name_Indianapolis 500', 'Race Name_Italian Grand Prix',\n", " 'Race Name_Japanese Grand Prix', 'Race Name_Malaysian Grand Prix',\n", " 'Race Name_Mexican Grand Prix', 'Race Name_Monaco Grand Prix',\n", " 'Race Name_Portuguese Grand Prix', 'Race Name_Russian Grand Prix',\n", " 'Race Name_San Marino Grand Prix', 'Race Name_Singapore Grand Prix',\n", " 'Race Name_South African Grand Prix', 'Race Name_Spanish Grand Prix',\n", " 'Race Name_Swedish Grand Prix', 'Race Name_Turkish Grand Prix',\n", " 'Race Name_United States Grand Prix',\n", " 'Race Name_United States Grand Prix West'],\n", " dtype='object')" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_df.columns" ] }, { "cell_type": "markdown", "id": "1f5936dd-27d5-45a4-baa4-2ab0fd28b907", "metadata": {}, "source": [ "\n", "#### Regression Function" ] }, { "cell_type": "code", "execution_count": 84, "id": "f1a2af21-df0e-489c-a0a7-3e276766b335", "metadata": {}, "outputs": [], "source": [ "def regression_test_score(model, print_output=False):\n", " # --- Test ---\n", " score = 0\n", " races = model_df[(model_df['Season'] == 2021)]['Round'].unique()\n", " for race in races:\n", " test = model_df[(model_df['Season'] == 2021) & (model_df['Round'] == race)]\n", " X_test = test.drop(['Position','Driver'], axis = 1)\n", " y_test = test['Position']\n", " X_test = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns)\n", " X_test.to_csv(f'{2021}_{race}.csv')\n", "\n", " # make predictions\n", " prediction_df = pd.DataFrame(model.predict(X_test), columns = ['prediction'])\n", " merged_df = pd.concat([prediction_df, test[['Driver','Position']].reset_index(drop=True)], axis=1)\n", " merged_df.rename(columns = {'Position': 'actual_pos'}, inplace = True)\n", " \n", " # shuffle data to remove original order that will influence selection\n", " # of race winner when there are drivers with identical win probablilities\n", " merged_df = merged_df.sample(frac=1).reset_index(drop=True) \n", " merged_df.sort_values(by='prediction', ascending=False, inplace=True)\n", " merged_df['predicted_pos'] = merged_df['prediction'].map(lambda x: 0)\n", " merged_df.iloc[0, merged_df.columns.get_loc('predicted_pos')] = 1\n", " merged_df.reset_index(drop=True, inplace=True)\n", " if (print_output == True):\n", " print(merged_df)\n", "\n", " # --- Score --- \n", " score += precision_score(merged_df['actual_pos'], merged_df['predicted_pos'], zero_division=0)\n", " \n", " return score / len(races)\n" ] }, { "cell_type": "markdown", "id": "6867c77a-20b9-4331-9cea-9ccd5b5ef4fa", "metadata": {}, "source": [ "#### Dumb Classifier\n", "\n", "I have created a a dumb classifier and all it does is create a list of numbers between 0.001 and 0.99 which represents the order of finishing, and shuffle for good measure" ] }, { "cell_type": "code", "execution_count": 22, "id": "2263c436-6688-41cf-aa5b-ea360af41b77", "metadata": {}, "outputs": [], "source": [ "class F1OracleDumbClassifier(BaseEstimator):\n", " def fit(self, X, y=None):\n", " pass\n", " def predict(self, X):\n", " # numbers between 0.001 to 0.99 and shuffle - jose's awesome dumb classifier!\n", " numbers = np.round(np.logspace(np.log10(0.001),np.log10(0.999),len(X)),4)\n", " np.random.shuffle(numbers)\n", " return numbers" ] }, { "cell_type": "code", "execution_count": 23, "id": "8cf8d841-1a30-42ca-a0a4-650ce3c1ae3c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelparamsscoretrain_timetest_time
0Jose's Dumb Classifiernone4.7620.0000010.12515
\n", "
" ], "text/plain": [ " model params score train_time test_time\n", "0 Jose's Dumb Classifier none 4.762 0.000001 0.12515" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def joses_dumb_classifier(X_train, y_train):\n", " start = timer()\n", " model = F1OracleDumbClassifier()\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = regression_test_score(model)\n", " end = timer()\n", "\n", " scoring_raw['model'].append(\"Jose's Dumb Classifier\")\n", " scoring_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_raw['params'].append('none')\n", " scoring_raw['test_time'].append(end - start)\n", " \n", "joses_dumb_classifier(X_train, y_train)\n", "running_score = pd.DataFrame(scoring_raw)\n", "running_score\n" ] }, { "cell_type": "markdown", "id": "1e0a879f-fb26-48e2-bdff-239d9d0db40c", "metadata": {}, "source": [ "\n", "#### Feature Importance using Linear Regression Coefficients\n", "\n", "Using Linear Regression, we can identify the most relevant features, so that we may be able to remove the features that don't add any value, but bloat. " ] }, { "cell_type": "code", "execution_count": 355, "id": "e5edb8b4-50ce-4037-b03f-700fa9d37477", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "def feature_importance_using_linear_regression(X_train, y_train):\n", " model = LinearRegression()\n", " model.fit(X_train, y_train)\n", "\n", " # Assess the importance of features using Linear Regression coefficients\n", " importance = model.coef_\n", " importance_df = pd.DataFrame(importance, columns = ['importance'])\n", " features_df = pd.DataFrame(results_df.columns.tolist(), columns = ['feature_name'])\n", " features_df.drop(features_df[(features_df['feature_name'] == 'Driver') | (features_df['feature_name'] == 'Position')].index, inplace=True)\n", " merged_features_df = pd.concat([importance_df, features_df.reset_index(drop=True)], axis=1)\n", " merged_features_df.sort_values(by='importance', ascending=True, inplace=True)\n", " merged_features_df.set_index('feature_name', inplace=True)\n", " selected_features_df = merged_features_df[(merged_features_df['importance'] > 0)][['importance']]\n", "\n", " # plot feature importance \n", " axis = selected_features_df.plot(kind='barh', title=\"F1 Prediction Feature Importance\", figsize=(16, 10), color='#00CC99')\n", " y_label = axis.yaxis.get_label()\n", " y_label.set_visible(False)\n", "\n", "feature_importance_using_linear_regression(X_train, y_train)" ] }, { "cell_type": "markdown", "id": "5f41cd68-7070-4d27-b333-a0f3e26a4317", "metadata": {}, "source": [ "#### Feature importance using Random Forest" ] }, { "cell_type": "code", "execution_count": 533, "id": "5a375479-21af-487f-b522-d105c68ed637", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
imp_maskimportance
feature_name
GridTrue0.150197
DriverRecentFormTrue0.119112
ConstructorRecentFormTrue0.113140
DriverRecentWinsTrue0.073343
SeasonTrue0.066364
RoundTrue0.063992
AgeTrue0.060626
DriverRecentDNFsTrue0.047086
DriverExperienceTrue0.045102
Weather_weather_warmTrue0.016472
ConstructorExperienceFalse0.013140
Weather_weather_dryFalse0.010116
Nationality_BritishFalse0.010064
Weather_weather_hotFalse0.008448
Race Name_Italian Grand PrixFalse0.008315
Weather_weather_wetFalse0.008052
Race Name_Belgian Grand PrixFalse0.007769
IsHomeCountryFalse0.007731
Race Name_British Grand PrixFalse0.007701
Race Name_Monaco Grand PrixFalse0.007615
Race Name_German Grand PrixFalse0.007439
Race Name_French Grand PrixFalse0.007222
Nationality_GermanFalse0.006735
Race Name_United States Grand PrixFalse0.006140
Nationality_BrazilianFalse0.006081
Race Name_Spanish Grand PrixFalse0.005951
Race Name_Canadian Grand PrixFalse0.005885
Race Name_Brazilian Grand PrixFalse0.005844
Race Name_Austrian Grand PrixFalse0.005482
Race Name_Hungarian Grand PrixFalse0.005455
Nationality_FrenchFalse0.005127
Race Name_Australian Grand PrixFalse0.004712
Nationality_FinnishFalse0.004693
Nationality_AustralianFalse0.004413
Race Name_Japanese Grand PrixFalse0.004334
Race Name_Dutch Grand PrixFalse0.003995
Race Name_European Grand PrixFalse0.003828
Weather_weather_coldFalse0.003777
Nationality_ItalianFalse0.003739
Race Name_San Marino Grand PrixFalse0.003566
Race Name_South African Grand PrixFalse0.003518
Nationality_AustrianFalse0.003452
Race Name_Argentine Grand PrixFalse0.003199
Nationality_ArgentineFalse0.003165
Race Name_Mexican Grand PrixFalse0.003160
Race Name_Malaysian Grand PrixFalse0.003118
Race Name_Portuguese Grand PrixFalse0.003011
Nationality_SpanishFalse0.002691
Race Name_Chinese Grand PrixFalse0.002614
Nationality_CanadianFalse0.002482
\n", "
" ], "text/plain": [ " imp_mask importance\n", "feature_name \n", "Grid True 0.150197\n", "DriverRecentForm True 0.119112\n", "ConstructorRecentForm True 0.113140\n", "DriverRecentWins True 0.073343\n", "Season True 0.066364\n", "Round True 0.063992\n", "Age True 0.060626\n", "DriverRecentDNFs True 0.047086\n", "DriverExperience True 0.045102\n", "Weather_weather_warm True 0.016472\n", "ConstructorExperience False 0.013140\n", "Weather_weather_dry False 0.010116\n", "Nationality_British False 0.010064\n", "Weather_weather_hot False 0.008448\n", "Race Name_Italian Grand Prix False 0.008315\n", "Weather_weather_wet False 0.008052\n", "Race Name_Belgian Grand Prix False 0.007769\n", "IsHomeCountry False 0.007731\n", "Race Name_British Grand Prix False 0.007701\n", "Race Name_Monaco Grand Prix False 0.007615\n", "Race Name_German Grand Prix False 0.007439\n", "Race Name_French Grand Prix False 0.007222\n", "Nationality_German False 0.006735\n", "Race Name_United States Grand Prix False 0.006140\n", "Nationality_Brazilian False 0.006081\n", "Race Name_Spanish Grand Prix False 0.005951\n", "Race Name_Canadian Grand Prix False 0.005885\n", "Race Name_Brazilian Grand Prix False 0.005844\n", "Race Name_Austrian Grand Prix False 0.005482\n", "Race Name_Hungarian Grand Prix False 0.005455\n", "Nationality_French False 0.005127\n", "Race Name_Australian Grand Prix False 0.004712\n", "Nationality_Finnish False 0.004693\n", "Nationality_Australian False 0.004413\n", "Race Name_Japanese Grand Prix False 0.004334\n", "Race Name_Dutch Grand Prix False 0.003995\n", "Race Name_European Grand Prix False 0.003828\n", "Weather_weather_cold False 0.003777\n", "Nationality_Italian False 0.003739\n", "Race Name_San Marino Grand Prix False 0.003566\n", "Race Name_South African Grand Prix False 0.003518\n", "Nationality_Austrian False 0.003452\n", "Race Name_Argentine Grand Prix False 0.003199\n", "Nationality_Argentine False 0.003165\n", "Race Name_Mexican Grand Prix False 0.003160\n", "Race Name_Malaysian Grand Prix False 0.003118\n", "Race Name_Portuguese Grand Prix False 0.003011\n", "Nationality_Spanish False 0.002691\n", "Race Name_Chinese Grand Prix False 0.002614\n", "Nationality_Canadian False 0.002482" ] }, "execution_count": 533, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "def feature_importance_using_random_forest(X_train, y_train):\n", " sel = SelectFromModel(RandomForestClassifier(n_estimators = 100))\n", " sel.fit(X_train, y_train)\n", " features_to_keep = sel.get_support()\n", " feature_importance_df = pd.DataFrame(\n", " {'feature_name': X_train.columns.tolist(),\n", " 'imp_mask': sel.get_support(),\n", " 'importance': sel.estimator_.feature_importances_\n", " })\n", " return feature_importance_df, features_to_keep\n", " \n", "feature_importance_df, features_to_keep = feature_importance_using_random_forest(X_train, y_train)\n", "# feature_importance_df = feature_importance_df[feature_importance_df['imp_mask']==True]\n", "feature_importance_df.sort_values(by='importance', inplace=True)\n", "feature_importance_df.reset_index(drop=True, inplace=True)\n", "feature_importance_df.set_index('feature_name', inplace=True)\n", "\n", "# plot feature importance \n", "axis = feature_importance_df.plot(kind='barh', title=\"F1 Prediction Feature Importance\", figsize=(16, 20), color='#00CC99')\n", "y_label = axis.yaxis.get_label()\n", "y_label.set_visible(False)\n", "feature_importance_df = feature_importance_df.nlargest(50, 'importance')\n", "feature_importance_df\n" ] }, { "cell_type": "markdown", "id": "7704ce0a-49f0-4bf4-8884-f7b0bea18cdb", "metadata": {}, "source": [ "\n", "#### Feature Selection\n", "Based on the feature importance discovered in the previous section, the identified insignificant features can be safely ommitted as it contributes to model bloat. I've decided to use the features identified by **Random Forest**, since I did not trust the ones from **Linear Regression**. " ] }, { "cell_type": "code", "execution_count": 535, "id": "66ce8dbd-1e63-4f07-9173-b29444a1c48d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Grid', 'DriverRecentForm', 'ConstructorRecentForm', 'DriverRecentWins',\n", " 'Season', 'Round', 'Age', 'DriverRecentDNFs', 'DriverExperience',\n", " 'Weather_weather_warm', 'ConstructorExperience', 'Weather_weather_dry',\n", " 'Nationality_British', 'Weather_weather_hot',\n", " 'Race Name_Italian Grand Prix', 'Weather_weather_wet',\n", " 'Race Name_Belgian Grand Prix', 'IsHomeCountry',\n", " 'Race Name_British Grand Prix', 'Race Name_Monaco Grand Prix',\n", " 'Race Name_German Grand Prix', 'Race Name_French Grand Prix',\n", " 'Nationality_German', 'Race Name_United States Grand Prix',\n", " 'Nationality_Brazilian', 'Race Name_Spanish Grand Prix',\n", " 'Race Name_Canadian Grand Prix', 'Race Name_Brazilian Grand Prix',\n", " 'Race Name_Austrian Grand Prix', 'Race Name_Hungarian Grand Prix',\n", " 'Nationality_French', 'Race Name_Australian Grand Prix',\n", " 'Nationality_Finnish', 'Nationality_Australian',\n", " 'Race Name_Japanese Grand Prix', 'Race Name_Dutch Grand Prix',\n", " 'Race Name_European Grand Prix', 'Weather_weather_cold',\n", " 'Nationality_Italian', 'Race Name_San Marino Grand Prix',\n", " 'Race Name_South African Grand Prix', 'Nationality_Austrian',\n", " 'Race Name_Argentine Grand Prix', 'Nationality_Argentine',\n", " 'Race Name_Mexican Grand Prix', 'Race Name_Malaysian Grand Prix',\n", " 'Race Name_Portuguese Grand Prix', 'Nationality_Spanish',\n", " 'Race Name_Chinese Grand Prix', 'Nationality_Canadian', 'Position',\n", " 'Driver'],\n", " dtype='object')" ] }, "execution_count": 535, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def keep_significant_features():\n", " # keep significant features (Position and Driver will be dropped later, we still need them)\n", " return results_df[feature_importance_df.index.tolist() + ['Position','Driver']]\n", "results_df = keep_significant_features()\n", "results_df.columns" ] }, { "cell_type": "code", "execution_count": 26, "id": "aaa66c1c-9786-4e7c-a8fc-762f1ab30647", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(25380, 68)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results_df.shape" ] }, { "cell_type": "code", "execution_count": 537, "id": "dc208cc8-cd2f-4a41-9610-c3eb89d6ac68", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(24800, 50)" ] }, "execution_count": 537, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import MinMaxScaler\n", "# Prepare train set\n", "np.set_printoptions(precision=4)\n", "model_df = results_df.copy()\n", "model_df['Position'] = model_df['Position'].map(lambda x: 1 if x == 1 else 0)\n", "\n", "train = model_df[(model_df['Season'] >= 1950) & (model_df['Season'] < 2021)]\n", "X_train = train.drop(['Position','Driver'], axis = 1)\n", "y_train = train['Position']\n", "\n", "scaler = MinMaxScaler()\n", "X_train = pd.DataFrame(scaler.fit_transform(X_train, y_train), columns = X_train.columns)\n", "X_train.shape\n" ] }, { "cell_type": "markdown", "id": "921fcb41-6e62-43f5-a979-c1b08d757ea9", "metadata": {}, "source": [ "\n", "### Regression Approaches\n", "\n", "The problem of predicting the winner of a race can be considered a regression problem. We do this by submitting the independent variables (20 drivers) to the chosen machine learning algorithm, and allow the algorithm to make a winner prediction. In fact, the estimator does not only pick a winner among the 20 drivers. It responds with a \"prediction\" column, where it is sorted descending order, and can be used as the predicted finishing order of the race. The driver that ends up at the top of this list is the winner of this race. \n", "\n", "The following Regression techniques are applied to help solve our problem:\n", "- Linear Regression\n", "- Ada Boost Regressor\n", "- Bagging Regressor\n", "- Extra Trees Regressor\n", "- Gradient Boosting Regressor\n", "- Random Forest Regressor\n", "- Stacking Regressor\n", "- Neural Network (MLP Regressor)\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "d51fafc3-277b-4be0-81a4-94dd17ccfd32", "metadata": {}, "source": [ "#### Linear Regression" ] }, { "cell_type": "code", "execution_count": 189, "id": "77af755b-eb5f-4501-9943-6b279b37ecce", "metadata": {}, "outputs": [], "source": [ "def linear_regression(X_train, y_train):\n", " params={'fit_intercept': ['True', 'False']}\n", "\n", " for fit_intercept in params['fit_intercept']:\n", " start = timer()\n", " model_params = (fit_intercept)\n", " model = LinearRegression(fit_intercept = fit_intercept)\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = regression_test_score(model)\n", " end = timer()\n", "\n", " scoring_raw['model'].append('Linear Regression')\n", " scoring_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_raw['params'].append(model_params)\n", " scoring_raw['test_time'].append(end - start)\n", " \n" ] }, { "cell_type": "markdown", "id": "912137be-63a6-4a9c-b597-eae86bca77d0", "metadata": {}, "source": [ "#### AdaBoost Regressor" ] }, { "cell_type": "code", "execution_count": 180, "id": "d79b1772-5bf4-4927-b706-4a49e6657538", "metadata": {}, "outputs": [], "source": [ "def adaboost_regressor(X_train, y_train):\n", " params={'n_estimators': [100,200,300],\n", " 'learning_rate': [0.001,0.01,0.1,1],\n", " 'loss': ['linear','square','exponential']}\n", "\n", " for n_estimators in params['n_estimators']:\n", " for learning_rate in params['learning_rate']:\n", " for loss in params['loss']:\n", " start = timer()\n", " model_params = (n_estimators, learning_rate, loss)\n", " model = AdaBoostRegressor(random_state=0, n_estimators=n_estimators, learning_rate=learning_rate, loss=loss)\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = regression_test_score(model)\n", " end = timer()\n", "\n", " scoring_raw['model'].append('AdaBoost Regressor')\n", " scoring_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_raw['params'].append(model_params)\n", " scoring_raw['test_time'].append(end - start)\n", " \n" ] }, { "cell_type": "markdown", "id": "d339a9b9-858b-4511-b50b-30968a59b78a", "metadata": {}, "source": [ "#### Bagging Regressor" ] }, { "cell_type": "code", "execution_count": 526, "id": "097c8f28-3b38-4269-950a-4ed9378792fa", "metadata": {}, "outputs": [], "source": [ "def bagging_regressor(X_train, y_train):\n", " params={'n_estimators': [100,200,300],\n", " 'max_samples': [10,20,30],\n", " 'max_features': [20,40,50],\n", " 'bootstrap': [True,False],\n", " 'bootstrap_features': [True,False]}\n", "\n", " for n_estimators in params['n_estimators']:\n", " for max_samples in params['max_samples']:\n", " for max_features in params['max_features']:\n", " for bootstrap in params['bootstrap']:\n", " for bootstrap_features in params['bootstrap_features']:\n", " start = timer()\n", " model_params = (n_estimators, max_samples, max_features, bootstrap, bootstrap_features)\n", " model = BaggingRegressor(random_state=0, base_estimator=DecisionTreeRegressor(),\n", " n_estimators=n_estimators, max_samples=max_samples, max_features=max_features, bootstrap=bootstrap, bootstrap_features=bootstrap_features)\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = regression_test_score(model)\n", " end = timer()\n", "\n", " scoring_raw['model'].append('Bagging Regressor (DT)')\n", " scoring_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_raw['params'].append(model_params)\n", " scoring_raw['test_time'].append(end - start)\n", " \n", "bagging_regressor(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 496, "id": "bf89d1cc-0e25-4f77-b458-bf3e9882c129", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelparamsscoretrain_timetest_time
0Bagging Regressor (DT)(100, 10, 20, True, True)38.0950.2333180.353067
1Bagging Regressor (DT)(100, 10, 20, True, False)38.0950.2136320.354575
2Bagging Regressor (DT)(100, 10, 20, False, True)33.3330.2124610.357825
3Bagging Regressor (DT)(100, 10, 20, False, False)38.0950.2116800.353857
4Bagging Regressor (DT)(100, 10, 50, True, True)47.6190.4802690.346802
5Bagging Regressor (DT)(100, 10, 50, True, False)33.3330.4685800.346213
6Bagging Regressor (DT)(100, 10, 50, False, True)42.8570.4676980.347006
7Bagging Regressor (DT)(100, 10, 50, False, False)38.0950.4628290.349061
8Bagging Regressor (DT)(100, 10, 60, True, True)42.8570.5779320.353337
9Bagging Regressor (DT)(100, 10, 60, True, False)42.8570.5869960.347714
10Bagging Regressor (DT)(100, 10, 60, False, True)33.3330.5913050.348392
11Bagging Regressor (DT)(100, 10, 60, False, False)38.0950.5972260.360252
12Bagging Regressor (DT)(100, 20, 20, True, True)42.8570.2112280.339423
13Bagging Regressor (DT)(100, 20, 20, True, False)42.8570.2115950.361912
14Bagging Regressor (DT)(100, 20, 20, False, True)42.8570.2268940.341176
15Bagging Regressor (DT)(100, 20, 20, False, False)42.8570.2131970.350047
16Bagging Regressor (DT)(100, 20, 50, True, True)42.8570.4581220.346450
17Bagging Regressor (DT)(100, 20, 50, True, False)47.6190.4698010.348179
18Bagging Regressor (DT)(100, 20, 50, False, True)38.0950.4750780.346943
19Bagging Regressor (DT)(100, 20, 50, False, False)47.6190.4630000.366208
20Bagging Regressor (DT)(100, 20, 60, True, True)33.3330.6079600.361641
21Bagging Regressor (DT)(100, 20, 60, True, False)38.0950.5864410.346637
22Bagging Regressor (DT)(100, 20, 60, False, True)33.3330.5846680.357027
23Bagging Regressor (DT)(100, 20, 60, False, False)38.0950.5881400.347868
24Bagging Regressor (DT)(100, 30, 20, True, True)33.3330.2147780.341378
25Bagging Regressor (DT)(100, 30, 20, True, False)52.3810.2115760.353158
26Bagging Regressor (DT)(100, 30, 20, False, True)33.3330.2180990.347623
27Bagging Regressor (DT)(100, 30, 20, False, False)52.3810.2182880.350863
28Bagging Regressor (DT)(100, 30, 50, True, True)42.8570.4589110.344267
29Bagging Regressor (DT)(100, 30, 50, True, False)42.8570.4745710.355489
30Bagging Regressor (DT)(100, 30, 50, False, True)42.8570.4727240.344054
31Bagging Regressor (DT)(100, 30, 50, False, False)42.8570.4851220.345893
32Bagging Regressor (DT)(100, 30, 60, True, True)38.0950.5952250.342590
33Bagging Regressor (DT)(100, 30, 60, True, False)33.3330.6051690.344667
34Bagging Regressor (DT)(100, 30, 60, False, True)38.0950.5980130.344473
35Bagging Regressor (DT)(100, 30, 60, False, False)33.3330.6102300.347696
36Bagging Regressor (DT)(200, 10, 20, True, True)33.3330.4135000.484519
37Bagging Regressor (DT)(200, 10, 20, True, False)33.3330.4261040.472942
38Bagging Regressor (DT)(200, 10, 20, False, True)33.3330.4219680.472197
39Bagging Regressor (DT)(200, 10, 20, False, False)33.3330.4147680.470777
40Bagging Regressor (DT)(200, 10, 50, True, True)66.6670.9244590.477019
41Bagging Regressor (DT)(200, 10, 50, True, False)38.0950.9345400.475284
42Bagging Regressor (DT)(200, 10, 50, False, True)61.9050.9184670.473058
43Bagging Regressor (DT)(200, 10, 50, False, False)38.0950.9157880.477709
44Bagging Regressor (DT)(200, 10, 60, True, True)33.3331.1914110.487758
45Bagging Regressor (DT)(200, 10, 60, True, False)38.0951.1799260.486977
46Bagging Regressor (DT)(200, 10, 60, False, True)33.3331.1740520.475252
47Bagging Regressor (DT)(200, 10, 60, False, False)38.0951.1993970.491126
48Bagging Regressor (DT)(200, 20, 20, True, True)38.0950.4160470.476726
49Bagging Regressor (DT)(200, 20, 20, True, False)38.0950.4242520.469221
\n", "
" ], "text/plain": [ " model params score train_time \\\n", "0 Bagging Regressor (DT) (100, 10, 20, True, True) 38.095 0.233318 \n", "1 Bagging Regressor (DT) (100, 10, 20, True, False) 38.095 0.213632 \n", "2 Bagging Regressor (DT) (100, 10, 20, False, True) 33.333 0.212461 \n", "3 Bagging Regressor (DT) (100, 10, 20, False, False) 38.095 0.211680 \n", "4 Bagging Regressor (DT) (100, 10, 50, True, True) 47.619 0.480269 \n", "5 Bagging Regressor (DT) (100, 10, 50, True, False) 33.333 0.468580 \n", "6 Bagging Regressor (DT) (100, 10, 50, False, True) 42.857 0.467698 \n", "7 Bagging Regressor (DT) (100, 10, 50, False, False) 38.095 0.462829 \n", "8 Bagging Regressor (DT) (100, 10, 60, True, True) 42.857 0.577932 \n", "9 Bagging Regressor (DT) (100, 10, 60, True, False) 42.857 0.586996 \n", "10 Bagging Regressor (DT) (100, 10, 60, False, True) 33.333 0.591305 \n", "11 Bagging Regressor (DT) (100, 10, 60, False, False) 38.095 0.597226 \n", "12 Bagging Regressor (DT) (100, 20, 20, True, True) 42.857 0.211228 \n", "13 Bagging Regressor (DT) (100, 20, 20, True, False) 42.857 0.211595 \n", "14 Bagging Regressor (DT) (100, 20, 20, False, True) 42.857 0.226894 \n", "15 Bagging Regressor (DT) (100, 20, 20, False, False) 42.857 0.213197 \n", "16 Bagging Regressor (DT) (100, 20, 50, True, True) 42.857 0.458122 \n", "17 Bagging Regressor (DT) (100, 20, 50, True, False) 47.619 0.469801 \n", "18 Bagging Regressor (DT) (100, 20, 50, False, True) 38.095 0.475078 \n", "19 Bagging Regressor (DT) (100, 20, 50, False, False) 47.619 0.463000 \n", "20 Bagging Regressor (DT) (100, 20, 60, True, True) 33.333 0.607960 \n", "21 Bagging Regressor (DT) (100, 20, 60, True, False) 38.095 0.586441 \n", "22 Bagging Regressor (DT) (100, 20, 60, False, True) 33.333 0.584668 \n", "23 Bagging Regressor (DT) (100, 20, 60, False, False) 38.095 0.588140 \n", "24 Bagging Regressor (DT) (100, 30, 20, True, True) 33.333 0.214778 \n", "25 Bagging Regressor (DT) (100, 30, 20, True, False) 52.381 0.211576 \n", "26 Bagging Regressor (DT) (100, 30, 20, False, True) 33.333 0.218099 \n", "27 Bagging Regressor (DT) (100, 30, 20, False, False) 52.381 0.218288 \n", "28 Bagging Regressor (DT) (100, 30, 50, True, True) 42.857 0.458911 \n", "29 Bagging Regressor (DT) (100, 30, 50, True, False) 42.857 0.474571 \n", "30 Bagging Regressor (DT) (100, 30, 50, False, True) 42.857 0.472724 \n", "31 Bagging Regressor (DT) (100, 30, 50, False, False) 42.857 0.485122 \n", "32 Bagging Regressor (DT) (100, 30, 60, True, True) 38.095 0.595225 \n", "33 Bagging Regressor (DT) (100, 30, 60, True, False) 33.333 0.605169 \n", "34 Bagging Regressor (DT) (100, 30, 60, False, True) 38.095 0.598013 \n", "35 Bagging Regressor (DT) (100, 30, 60, False, False) 33.333 0.610230 \n", "36 Bagging Regressor (DT) (200, 10, 20, True, True) 33.333 0.413500 \n", "37 Bagging Regressor (DT) (200, 10, 20, True, False) 33.333 0.426104 \n", "38 Bagging Regressor (DT) (200, 10, 20, False, True) 33.333 0.421968 \n", "39 Bagging Regressor (DT) (200, 10, 20, False, False) 33.333 0.414768 \n", "40 Bagging Regressor (DT) (200, 10, 50, True, True) 66.667 0.924459 \n", "41 Bagging Regressor (DT) (200, 10, 50, True, False) 38.095 0.934540 \n", "42 Bagging Regressor (DT) (200, 10, 50, False, True) 61.905 0.918467 \n", "43 Bagging Regressor (DT) (200, 10, 50, False, False) 38.095 0.915788 \n", "44 Bagging Regressor (DT) (200, 10, 60, True, True) 33.333 1.191411 \n", "45 Bagging Regressor (DT) (200, 10, 60, True, False) 38.095 1.179926 \n", "46 Bagging Regressor (DT) (200, 10, 60, False, True) 33.333 1.174052 \n", "47 Bagging Regressor (DT) (200, 10, 60, False, False) 38.095 1.199397 \n", "48 Bagging Regressor (DT) (200, 20, 20, True, True) 38.095 0.416047 \n", "49 Bagging Regressor (DT) (200, 20, 20, True, False) 38.095 0.424252 \n", "\n", " test_time \n", "0 0.353067 \n", "1 0.354575 \n", "2 0.357825 \n", "3 0.353857 \n", "4 0.346802 \n", "5 0.346213 \n", "6 0.347006 \n", "7 0.349061 \n", "8 0.353337 \n", "9 0.347714 \n", "10 0.348392 \n", "11 0.360252 \n", "12 0.339423 \n", "13 0.361912 \n", "14 0.341176 \n", "15 0.350047 \n", "16 0.346450 \n", "17 0.348179 \n", "18 0.346943 \n", "19 0.366208 \n", "20 0.361641 \n", "21 0.346637 \n", "22 0.357027 \n", "23 0.347868 \n", "24 0.341378 \n", "25 0.353158 \n", "26 0.347623 \n", "27 0.350863 \n", "28 0.344267 \n", "29 0.355489 \n", "30 0.344054 \n", "31 0.345893 \n", "32 0.342590 \n", "33 0.344667 \n", "34 0.344473 \n", "35 0.347696 \n", "36 0.484519 \n", "37 0.472942 \n", "38 0.472197 \n", "39 0.470777 \n", "40 0.477019 \n", "41 0.475284 \n", "42 0.473058 \n", "43 0.477709 \n", "44 0.487758 \n", "45 0.486977 \n", "46 0.475252 \n", "47 0.491126 \n", "48 0.476726 \n", "49 0.469221 " ] }, "execution_count": 496, "metadata": {}, "output_type": "execute_result" } ], "source": [ "running_score = pd.DataFrame(scoring_raw)\n", "running_score.score.value_counts()\n", "running_score.head(50)" ] }, { "cell_type": "markdown", "id": "a3ffd36e-c9f7-44d9-bc17-09b4581d2371", "metadata": {}, "source": [ "#### ExtraTrees Regressor" ] }, { "cell_type": "code", "execution_count": 182, "id": "d61bea2a-3a15-400b-bb03-2da7e9e823f7", "metadata": {}, "outputs": [], "source": [ "def extratrees_regressor(X_train, y_train):\n", " params={'n_estimators': [100,200,300],\n", " 'max_depth': [10],\n", " 'min_samples_split': [2,4,6],\n", " 'min_samples_leaf': [1,3,5],\n", " 'max_features': ['auto','sqrt','log2']}\n", "\n", " for n_estimators in params['n_estimators']:\n", " for max_depth in params['max_depth']:\n", " for min_samples_split in params['min_samples_split']:\n", " for min_samples_leaf in params['min_samples_leaf']:\n", " for max_features in params['max_features']:\n", " start = timer()\n", " model_params = (n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features)\n", " model = ExtraTreesRegressor(random_state=0, n_estimators=n_estimators,\n", " max_depth=max_depth, min_samples_split=min_samples_split,\n", " min_samples_leaf=min_samples_leaf, max_features=max_features)\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = regression_test_score(model)\n", " end = timer()\n", "\n", " scoring_raw['model'].append('Extra Trees Regressor')\n", " scoring_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_raw['params'].append(model_params)\n", " scoring_raw['test_time'].append(end - start)\n", " \n" ] }, { "cell_type": "markdown", "id": "50ea13d1-d07a-4520-935c-4ef46d154098", "metadata": {}, "source": [ "#### Gradient Boosting Regressor" ] }, { "cell_type": "code", "execution_count": 183, "id": "2494c056-91da-4470-9f91-fa066f149d1d", "metadata": {}, "outputs": [], "source": [ "def gradientboosting_regressor(X_train, y_train):\n", " params={'n_estimators': [100,200,300],\n", " 'learning_rate': [0.001,0.01,0.1,1],\n", " 'subsample': [0.001,0.1,1],\n", " 'max_depth': [5,10,20]}\n", "\n", " for n_estimators in params['n_estimators']:\n", " for learning_rate in params['learning_rate']:\n", " for subsample in params['subsample']:\n", " for max_depth in params['max_depth']:\n", " start = timer()\n", " model_params = (n_estimators, learning_rate, subsample, max_depth)\n", " model = GradientBoostingRegressor(random_state=0, n_estimators=n_estimators,\n", " learning_rate=learning_rate, subsample=subsample,\n", " max_depth=max_depth)\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = regression_test_score(model)\n", " end = timer()\n", "\n", " scoring_raw['model'].append('Gradient Boosting Regressor')\n", " scoring_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_raw['params'].append(model_params)\n", " scoring_raw['test_time'].append(end - start)\n", " \n" ] }, { "cell_type": "markdown", "id": "f6f6b8fd-2839-4ce5-b5fc-e12531dfa2bd", "metadata": {}, "source": [ "#### Random Forest Regressor" ] }, { "cell_type": "code", "execution_count": 184, "id": "8ce341c2-25d4-4b01-bb0e-7427fe42287e", "metadata": {}, "outputs": [], "source": [ "def random_forest_regressor(X_train, y_train):\n", " params={'n_estimators': [100,200,300],\n", " 'max_depth': [10],\n", " 'min_samples_split': [2,4,6],\n", " 'min_samples_leaf': [1,3,5],\n", " 'max_features': ['auto','sqrt','log2']}\n", "\n", " for n_estimators in params['n_estimators']:\n", " for max_depth in params['max_depth']:\n", " for min_samples_split in params['min_samples_split']:\n", " for min_samples_leaf in params['min_samples_leaf']:\n", " for max_features in params['max_features']:\n", " start = timer()\n", " model_params = (n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features)\n", " model = RandomForestRegressor(random_state=0, n_estimators=n_estimators, max_depth=max_depth,\n", " min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf,\n", " max_features=max_features)\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = regression_test_score(model)\n", " end = timer()\n", "\n", " scoring_raw['model'].append('Random Forest Regressor')\n", " scoring_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_raw['params'].append(model_params)\n", " scoring_raw['test_time'].append(end - start)\n", " \n" ] }, { "cell_type": "markdown", "id": "7fa6a704-a3dd-464e-b29f-deec542a5e95", "metadata": {}, "source": [ "#### Stacking Regressor" ] }, { "cell_type": "code", "execution_count": 185, "id": "8423fdbd-876e-4476-80b0-04fcd4563536", "metadata": {}, "outputs": [], "source": [ "def stacking_regressor(X_train, y_train):\n", " params={'n_estimators': [100,200,300],\n", " 'max_depth': [10]}\n", "\n", " for n_estimators in params['n_estimators']:\n", " for max_depth in params['max_depth']:\n", " start = timer()\n", " model_params = (n_estimators, max_depth)\n", " estimators = [('lr', RidgeCV()),('adr', ARDRegression())]\n", " model = StackingRegressor(estimators=estimators,\n", " final_estimator=RandomForestRegressor(random_state=0, n_estimators=n_estimators, max_depth=max_depth))\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = regression_test_score(model)\n", " end = timer()\n", "\n", " scoring_raw['model'].append('Stacking Regressor (RF)')\n", " scoring_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_raw['params'].append(model_params)\n", " scoring_raw['test_time'].append(end - start)\n", " \n" ] }, { "cell_type": "markdown", "id": "78ec85b4-e1e7-437e-9dca-4ab87bc4b7fd", "metadata": {}, "source": [ "#### Neural Network - MLP Regressor" ] }, { "cell_type": "code", "execution_count": 186, "id": "d2607170-1db7-47eb-b3f1-3fcc2ee87685", "metadata": {}, "outputs": [], "source": [ "def mlp_regressor(X_train, y_train):\n", " params={\n", " 'hidden_layer_sizes': [(80,20,40,5), (75,30,50,10,3)], \n", " 'activation': ['identity', 'relu','logistic',], \n", " 'solver': ['lbfgs','sgd', 'adam'], \n", " 'alpha': np.logspace(-4,1,10)}\n", "\n", " for hidden_layer_sizes in params['hidden_layer_sizes']:\n", " for activation in params['activation']:\n", " for solver in params['solver']:\n", " for alpha in params['alpha']:\n", " start = timer()\n", " model_params = (hidden_layer_sizes, activation, solver, alpha)\n", " model = MLPRegressor(random_state=1, max_iter=500, hidden_layer_sizes=hidden_layer_sizes,\n", " activation=activation, solver=solver, alpha=alpha)\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = regression_test_score(model)\n", " end = timer()\n", "\n", " scoring_raw['model'].append('Neural Network - MLP Regressor')\n", " scoring_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_raw['params'].append(model_params)\n", " scoring_raw['test_time'].append(end - start)\n", " \n" ] }, { "cell_type": "code", "execution_count": 187, "id": "ce5294b4-b04a-4176-b736-e14fff55b60a", "metadata": {}, "outputs": [], "source": [ "scoring_raw ={'model':[], 'params': [], 'score': [], 'train_time': [], 'test_time': []}" ] }, { "cell_type": "code", "execution_count": 190, "id": "7bcce6ec-0053-4f24-9808-b9ebdc794584", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running Dumb Classifier\n", "Dumb Classifier done => 0.1877s.\n", "Running Linear Regression\n", "Linear Regression done => 0.4183s.\n", "Running AdaBoost Regressor\n", "AdaBoost Regressor done => 332.6744s.\n", "Running Bagging Regressor\n", "Bagging Regressor done => 134.8787s.\n", "Running Gradient Boosting Regressor\n", "Gradient Boosting Regressor done => 1111.0537s.\n", "Running Random Forest Regressor\n", "Random Forest Regressor done => 585.9226s.\n", "Running Stacking Regressor (RF)\n", "Stacking Regressor (RF) done => 16.3893s.\n", "Running Neural Network (MLP Regressor)\n", "Neural Network (MLP Regressor) done => 1744.2518s.\n", "Running all regressors: 3925.7784s\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
modelparamsscoretrain_timetest_time
0Jose's Dumb Classifiernone9.5240.0000020.187482
1Linear RegressionTrue38.0950.0377920.181738
2Linear RegressionFalse38.0950.0200090.178630
3AdaBoost Regressor(100, 0.001, linear)47.6197.4602980.315744
4AdaBoost Regressor(100, 0.001, square)47.6197.3358980.278544
..................
514Neural Network - MLP Regressor((75, 30, 50, 10, 3), logistic, adam, 0.059948...0.0002.8931490.199185
515Neural Network - MLP Regressor((75, 30, 50, 10, 3), logistic, adam, 0.215443...4.7623.6057350.191327
516Neural Network - MLP Regressor((75, 30, 50, 10, 3), logistic, adam, 0.774263...4.7624.3896980.197765
517Neural Network - MLP Regressor((75, 30, 50, 10, 3), logistic, adam, 2.782559...0.0005.8924180.195098
518Neural Network - MLP Regressor((75, 30, 50, 10, 3), logistic, adam, 10.0)0.0005.1961420.194635
\n", "

519 rows × 5 columns

\n", "
" ], "text/plain": [ " model \\\n", "0 Jose's Dumb Classifier \n", "1 Linear Regression \n", "2 Linear Regression \n", "3 AdaBoost Regressor \n", "4 AdaBoost Regressor \n", ".. ... \n", "514 Neural Network - MLP Regressor \n", "515 Neural Network - MLP Regressor \n", "516 Neural Network - MLP Regressor \n", "517 Neural Network - MLP Regressor \n", "518 Neural Network - MLP Regressor \n", "\n", " params score train_time \\\n", "0 none 9.524 0.000002 \n", "1 True 38.095 0.037792 \n", "2 False 38.095 0.020009 \n", "3 (100, 0.001, linear) 47.619 7.460298 \n", "4 (100, 0.001, square) 47.619 7.335898 \n", ".. ... ... ... \n", "514 ((75, 30, 50, 10, 3), logistic, adam, 0.059948... 0.000 2.893149 \n", "515 ((75, 30, 50, 10, 3), logistic, adam, 0.215443... 4.762 3.605735 \n", "516 ((75, 30, 50, 10, 3), logistic, adam, 0.774263... 4.762 4.389698 \n", "517 ((75, 30, 50, 10, 3), logistic, adam, 2.782559... 0.000 5.892418 \n", "518 ((75, 30, 50, 10, 3), logistic, adam, 10.0) 0.000 5.196142 \n", "\n", " test_time \n", "0 0.187482 \n", "1 0.181738 \n", "2 0.178630 \n", "3 0.315744 \n", "4 0.278544 \n", ".. ... \n", "514 0.199185 \n", "515 0.191327 \n", "516 0.197765 \n", "517 0.195098 \n", "518 0.194635 \n", "\n", "[519 rows x 5 columns]" ] }, "execution_count": 190, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# run all Regressors + the Dumb Classifier\n", "regressors = [('Dumb Classifier', joses_dumb_classifier),\n", " ('Linear Regression', linear_regression),\n", " ('AdaBoost Regressor', adaboost_regressor),\n", " ('Bagging Regressor', bagging_regressor),\n", " ('Gradient Boosting Regressor', gradientboosting_regressor),\n", " ('Random Forest Regressor', random_forest_regressor),\n", " ('Stacking Regressor (RF)', stacking_regressor),\n", " ('Neural Network (MLP Regressor)', mlp_regressor)]\n", "\n", "start = timer()\n", "for regressor in regressors:\n", " start_reg = timer()\n", " print(f'Running {regressor[0]}')\n", " regressor[1](X_train, y_train)\n", " end_reg = timer()\n", " print(f'{regressor[0]} done => {np.round(end_reg - start_reg, 4)}s.')\n", "end = timer()\n", "\n", "print(f'Running all regressors: {np.round(end - start, 4)}s')\n", "running_score = pd.DataFrame(scoring_raw)\n", "running_score\n", "\n", "running_score.to_csv('regressions_scores.csv')" ] }, { "cell_type": "markdown", "id": "d8d89384-5c88-4142-8400-6687f0fe423a", "metadata": {}, "source": [ "\n", "### Classification Approaches\n", "\n", "This problem can also be approached as a classification problem since we want to predict 2 categories - the race winner, and then everybody else. Like the Regression problem, we will need to submit the details for the 20 competitors, wherein the classification algorithm will predict if the driver WILL WIN, or NOT WIN. Because this produces 2 columns of probabilities for all competitors, we can then simply sort this, and the one with the highest probability of winning will be our winner. \n", "\n", "The following classification models were used:\n", "\n", "- Logistic Regression\n", "- Random Forest Classifier\n", "- SVM Classifier\n", "- Ada Boost Classifier\n", "- Extra Trees Classifier\n", "- Gradient Boosting\n", "- Stacking Classifier\n", "- Neural Networks (MLP Classifier)\n", "\n", "" ] }, { "cell_type": "markdown", "id": "4369c6aa-ba12-49a5-8e21-8022e2b33557", "metadata": {}, "source": [ "#### Classification Function" ] }, { "cell_type": "code", "execution_count": 262, "id": "253fb4b5-9045-411e-80f8-9e2c95fbe41b", "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import MinMaxScaler\n", "\n", "def classification_test_score(model):\n", " # --- Test ---\n", " score = 0\n", " races = model_df[(model_df['Season'] == 2021)]['Round'].unique()\n", " for race in races:\n", " test = model_df[(model_df['Season'] == 2021) & (model_df['Round'] == race)]\n", " X_test = test.drop(['Position','Driver'], axis = 1)\n", " y_test = test['Position']\n", " X_test = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns)\n", "\n", " # make predictions\n", " prediction_df = pd.DataFrame(model.predict_proba(X_test), columns = ['proba_not_win', 'proba_win'])\n", " merged_df = pd.concat([prediction_df, test[['Driver','Position']].reset_index(drop=True)], axis=1)\n", " merged_df.rename(columns = {'Position': 'actual_pos'}, inplace = True)\n", " \n", " # shuffle data to remove original order that will influence selection\n", " # of race winner when there are drivers with identical win probablilities\n", " merged_df = merged_df.sample(frac=1).reset_index(drop=True)\n", " merged_df.sort_values(by='proba_win', ascending=False, inplace=True)\n", " merged_df['predicted_pos'] = 0\n", " merged_df.iloc[0, merged_df.columns.get_loc('predicted_pos')] = 1\n", " merged_df.reset_index(drop=True, inplace=True)\n", " \n", " # --- Score ---\n", " score += precision_score(merged_df['actual_pos'], merged_df['predicted_pos'], zero_division=0)\n", "\n", " return score / len(races)\n" ] }, { "cell_type": "code", "execution_count": 261, "id": "efdee945-e431-488a-87c6-48f8f4ae3395", "metadata": {}, "outputs": [], "source": [ "scoring_clf_raw ={'model':[], 'params': [], 'score': [], 'train_time': [], 'test_time': []}" ] }, { "cell_type": "markdown", "id": "90af902d-4e43-4ca1-b6ee-1bf238b4be0d", "metadata": {}, "source": [ "#### Logistic Regression\n", "\n", "I have tried out using [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html?highlight=gridsearchcv#sklearn.model_selection.GridSearchCV) however, the precision score was a bit lower than the manual search, so I reverted to the manual method. " ] }, { "cell_type": "code", "execution_count": 252, "id": "3f430fe2-bedd-403d-9da2-906a14e33357", "metadata": {}, "outputs": [], "source": [ "def logistic_regression(X_train, y_train):\n", " params={'penalty': ['l1', 'l2'],\n", " 'solver': ['saga', 'liblinear'],\n", " 'C': np.logspace(-3,1,20)}\n", "\n", " for penalty in params['penalty']:\n", " for solver in params['solver']:\n", " for C in params['C']:\n", " start = timer()\n", " model_params = (penalty, solver, C)\n", " model = LogisticRegression(penalty=penalty, solver=solver, C=C, max_iter = 10000)\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_clf_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = classification_test_score(model)\n", " end = timer()\n", "\n", " scoring_clf_raw['model'].append('Logistic Regression')\n", " scoring_clf_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_clf_raw['params'].append(model_params)\n", " scoring_clf_raw['test_time'].append(end - start)\n", "\n", "def logistic_regression_grid(X_train, y_train):\n", " params={'penalty': ['l1', 'l2'],\n", " 'solver': ['saga', 'liblinear'],\n", " 'C': np.logspace(-3,1,20)}\n", " \n", " start = timer()\n", " model = GridSearchCV(LogisticRegression(max_iter = 10000), params, cv=2)\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_clf_raw['train_time'].append(end - start)\n", " \n", " start = timer()\n", " model_score = classification_test_score(model)\n", " end = timer()\n", " scoring_clf_raw['model'].append('Logistic Regression')\n", " scoring_clf_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_clf_raw['params'].append(model.best_params_)\n", " scoring_clf_raw['test_time'].append(end - start)\n", " " ] }, { "cell_type": "markdown", "id": "c05aa471-b5ad-47c0-8d07-f8bfc2582a6b", "metadata": {}, "source": [ "#### Random Forest Classifier\n", "\n", "Like in Logistic Regression, I have tried out using [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html?highlight=gridsearchcv#sklearn.model_selection.GridSearchCV) however, the precision score was a bit lower than the manual search, so I reverted to the manual method. " ] }, { "cell_type": "code", "execution_count": 253, "id": "ad8e1e85-d9a2-427d-94f0-f9cf96611f65", "metadata": {}, "outputs": [], "source": [ "def random_forest_classifier(X_train, y_train):\n", " params={'n_estimators': [100,200,300],\n", " 'max_depth': [10],\n", " 'min_samples_split': [2,4,6],\n", " 'min_samples_leaf': [1,3,5],\n", " 'max_features': ['auto','sqrt','log2']}\n", "\n", " for n_estimators in params['n_estimators']:\n", " for max_depth in params['max_depth']:\n", " for min_samples_split in params['min_samples_split']:\n", " for min_samples_leaf in params['min_samples_leaf']:\n", " for max_features in params['max_features']:\n", " start = timer()\n", " model_params = (n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features)\n", " model = RandomForestClassifier(random_state=0, n_estimators=n_estimators, max_depth=max_depth,\n", " min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf,\n", " max_features=max_features)\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_clf_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = classification_test_score(model)\n", " end = timer()\n", "\n", " scoring_clf_raw['model'].append('Random Forest Classifier')\n", " scoring_clf_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_clf_raw['params'].append(model_params)\n", " scoring_clf_raw['test_time'].append(end - start)\n", " \n", "\n", "def random_forest_classifier_grid(X_train, y_train):\n", " params={'n_estimators': [100,200,300],\n", " 'max_depth': [10],\n", " 'min_samples_split': [2,4,6],\n", " 'min_samples_leaf': [1,3,5],\n", " 'max_features': ['auto','sqrt','log2']}\n", " \n", " start = timer()\n", " model = GridSearchCV(RandomForestClassifier(random_state=0), params)\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_clf_raw['train_time'].append(end - start)\n", " \n", " start = timer()\n", " model_score = classification_test_score(model)\n", " end = timer()\n", " scoring_clf_raw['model'].append('Random Forest Classifier')\n", " scoring_clf_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_clf_raw['params'].append(model.best_params_)\n", " scoring_clf_raw['test_time'].append(end - start)\n", " " ] }, { "cell_type": "markdown", "id": "065aa5db-cd15-487b-a324-57905b936a67", "metadata": {}, "source": [ "#### SVM Classifier" ] }, { "cell_type": "code", "execution_count": 254, "id": "65974498-bb6e-43f9-9007-1cbc32364aea", "metadata": {}, "outputs": [], "source": [ "def svm_classifier(X_train, y_train):\n", " params={\n", " 'gamma': np.logspace(-4, -1, 3),\n", " 'C': np.logspace(-2, 1, 3),\n", " 'kernel': ['linear']}\n", " \n", " for gamma in params['gamma']:\n", " for C in params['C']:\n", " for kernel in params['kernel']:\n", " start = timer()\n", " model_params = (gamma, C, kernel)\n", " model = SVC(probability = True, gamma=gamma, C=C, kernel=kernel)\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_clf_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = classification_test_score(model)\n", " end = timer()\n", "\n", " scoring_clf_raw['model'].append('SVM Classifier')\n", " scoring_clf_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_clf_raw['params'].append(model_params)\n", " scoring_clf_raw['test_time'].append(end - start)\n", " \n" ] }, { "cell_type": "markdown", "id": "4ece93b8-a186-4949-8cf1-6c28b6cc1c79", "metadata": {}, "source": [ "#### AdaBoost Classifier" ] }, { "cell_type": "code", "execution_count": 255, "id": "4fdae65b-2e26-4994-bcd8-a5b496a4a82b", "metadata": {}, "outputs": [], "source": [ "def adaboost_classifier(X_train, y_train):\n", " params={\n", " 'n_estimators': [100,200,300],\n", " 'learning_rate': [0.001,0.01,0.1,1],\n", " 'algorithm': ['SAMME', 'SAMME.R']}\n", " \n", " for n_estimators in params['n_estimators']:\n", " for learning_rate in params['learning_rate']:\n", " for algorithm in params['algorithm']:\n", " start = timer()\n", " model_params = (n_estimators, learning_rate, algorithm)\n", " model = AdaBoostClassifier(random_state=0, n_estimators=n_estimators, \n", " learning_rate=learning_rate, algorithm=algorithm)\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_clf_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = classification_test_score(model)\n", " end = timer()\n", "\n", " scoring_clf_raw['model'].append('Ada Boost Classifier')\n", " scoring_clf_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_clf_raw['params'].append(model_params)\n", " scoring_clf_raw['test_time'].append(end - start)\n", " " ] }, { "cell_type": "markdown", "id": "1abdf32c-41e7-42a4-b483-df1c52311de3", "metadata": {}, "source": [ "#### ExtraTrees Classifier" ] }, { "cell_type": "code", "execution_count": 256, "id": "c1f869e3-062a-449a-9bba-e36c97662177", "metadata": {}, "outputs": [], "source": [ "def extratrees_classifier(X_train, y_train):\n", " params={\n", " 'n_estimators': [100,200,300],\n", " 'max_depth': [10],\n", " 'min_samples_split': [2,4,6],\n", " 'min_samples_leaf': [1,3,5],\n", " 'max_features': ['auto','sqrt','log2']}\n", " \n", " for n_estimators in params['n_estimators']:\n", " for max_depth in params['max_depth']:\n", " for min_samples_split in params['min_samples_split']:\n", " for min_samples_leaf in params['min_samples_leaf']:\n", " for max_features in params['max_features']:\n", " start = timer()\n", " model_params = (n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features)\n", " model = ExtraTreesClassifier(random_state=0, n_estimators=n_estimators,\n", " max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf,\n", " max_features=max_features)\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_clf_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = classification_test_score(model)\n", " end = timer()\n", "\n", " scoring_clf_raw['model'].append('Extra Trees Classifier')\n", " scoring_clf_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_clf_raw['params'].append(model_params)\n", " scoring_clf_raw['test_time'].append(end - start)\n", " " ] }, { "cell_type": "markdown", "id": "834db71c-5c41-4649-a901-a4da0d0eac12", "metadata": {}, "source": [ "#### GradientBoosting Classifier" ] }, { "cell_type": "code", "execution_count": 257, "id": "96ad39a8-4954-4d87-8af2-9bcec612aaf6", "metadata": {}, "outputs": [], "source": [ "def gradient_boosting_classifier(X_train, y_train):\n", " params={\n", " 'n_estimators': [100,200,300],\n", " 'learning_rate': [0.001,0.01,0.1,1],\n", " 'subsample': [0.001,0.1,1],\n", " 'max_depth': [5,10,20]}\n", " \n", " for n_estimators in params['n_estimators']:\n", " for learning_rate in params['learning_rate']:\n", " for subsample in params['subsample']:\n", " for max_depth in params['max_depth']:\n", " start = timer()\n", " model_params = (n_estimators, learning_rate, subsample, max_depth)\n", " model = GradientBoostingClassifier(random_state=0, n_estimators=n_estimators,\n", " learning_rate=learning_rate, subsample=subsample, max_depth=max_depth)\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_clf_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = classification_test_score(model)\n", " end = timer()\n", "\n", " scoring_clf_raw['model'].append('Gradient Boosting Classifier')\n", " scoring_clf_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_clf_raw['params'].append(model_params)\n", " scoring_clf_raw['test_time'].append(end - start)\n", " " ] }, { "cell_type": "markdown", "id": "f94119c9-cd6c-4d14-ae19-0e578a93fd83", "metadata": {}, "source": [ "#### Stacking Classifier" ] }, { "cell_type": "code", "execution_count": 258, "id": "52997d2a-6490-4dc4-97f6-2941f64adccb", "metadata": {}, "outputs": [], "source": [ "def stacking_classifier(X_train, y_train):\n", " params={\n", " 'n_estimators': [100,200,300]}\n", " \n", " for n_estimators in params['n_estimators']:\n", " start = timer()\n", " model_params = (n_estimators)\n", " estimators = [('rf', RandomForestClassifier(n_estimators=n_estimators, random_state=42)),\n", " ('svr', LinearSVC(random_state=45))]\n", "\n", " model = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_clf_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = classification_test_score(model)\n", " end = timer()\n", "\n", " scoring_clf_raw['model'].append('Stacking Classifier')\n", " scoring_clf_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_clf_raw['params'].append(model_params)\n", " scoring_clf_raw['test_time'].append(end - start)\n", " " ] }, { "cell_type": "markdown", "id": "5ae1a022-a65b-47cc-aa82-7bdf5a01e8ae", "metadata": {}, "source": [ "#### Neural Network (MLP Classifier)" ] }, { "cell_type": "code", "execution_count": 259, "id": "2e4021da-8eda-4efc-81d2-eef41d8ba290", "metadata": {}, "outputs": [], "source": [ "def mlp_classifier(X_train, y_train):\n", " params={\n", " 'hidden_layer_sizes': [(80,20,40,5), (75,30,50,10,3)], \n", " 'activation': ['identity', 'relu','logistic',], \n", " 'solver': ['lbfgs','sgd', 'adam'], \n", " 'alpha': np.logspace(-4,1,5)}\n", " \n", " for hidden_layer_sizes in params['hidden_layer_sizes']:\n", " for activation in params['activation']:\n", " for solver in params['solver']:\n", " for alpha in params['alpha']:\n", " start = timer()\n", " model_params = (hidden_layer_sizes, activation, solver, alpha)\n", " model = MLPClassifier(hidden_layer_sizes=hidden_layer_sizes, activation=activation, solver=solver, alpha=alpha, max_iter=20)\n", " model.fit(X_train, y_train)\n", " end = timer()\n", " scoring_clf_raw['train_time'].append(end - start)\n", "\n", " start = timer()\n", " model_score = classification_test_score(model)\n", " end = timer()\n", "\n", " scoring_clf_raw['model'].append('Neural Network (MLP Classifier)')\n", " scoring_clf_raw['score'].append(np.round(model_score*100, 3))\n", " scoring_clf_raw['params'].append(model_params)\n", " scoring_clf_raw['test_time'].append(end - start)\n", " " ] }, { "cell_type": "code", "execution_count": 263, "id": "76527677-eed0-464b-89ca-cc844cc5876b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running Logistic Regression\n", "Logistic Regression done => 35.1231s.\n", "Running Random Forest Classifier\n", "Random Forest Classifier done => 250.337s.\n", "Running Neural Network (MLP Classifier)\n", "Neural Network (MLP Classifier) done => 235.9248s.\n", "Running AdaBoost Classifier\n", "AdaBoost Classifier done => 109.1923s.\n", "Running Extra Trees Classifier\n", "Extra Trees Classifier done => 200.5558s.\n", "Running Gradient Boosting Classifier\n", "Gradient Boosting Classifier done => 1413.9461s.\n", "Running Stacking Classifier\n", "Stacking Classifier done => 62.2184s.\n", "Running all classifiers: 2307.3023s\n" ] } ], "source": [ "# run all Classifiers (more than 8 hours!)\n", "classifiers = [\n", " ('Logistic Regression', logistic_regression),\n", " ('Random Forest Classifier', random_forest_classifier),\n", " ('Neural Network (MLP Classifier)', mlp_classifier), \n", " ('AdaBoost Classifier', adaboost_classifier),\n", " ('Extra Trees Classifier', extratrees_classifier),\n", " ('Gradient Boosting Classifier', gradient_boosting_classifier),\n", " ('Stacking Classifier', stacking_classifier),\n", " # ('SVM Classifier', svm_classifier) => removed as takes a few hours, and with similar results as other models\n", "]\n", "\n", "start = timer()\n", "for classifier in classifiers:\n", " start_clf = timer()\n", " print(f'Running {classifier[0]}')\n", " classifier[1](X_train, y_train)\n", " end_clf = timer()\n", " print(f'{classifier[0]} done => {np.round(end_clf - start_clf, 4)}s.')\n", "end = timer()\n", "\n", "print(f'Running all classifiers: {np.round(end - start, 4)}s')\n", "running_score = pd.DataFrame(scoring_clf_raw)\n", "\n", "running_score.to_csv('classification_scores.csv')" ] }, { "cell_type": "markdown", "id": "f6fdae0d-4ba5-4214-8810-fb1dd3d44a08", "metadata": {}, "source": [ "\n", "#### Comparing all Machine Learning models used" ] }, { "cell_type": "code", "execution_count": 278, "id": "2cfa45df-93ad-4596-84da-1c351adf3d5e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
paramsscoretrain_timetest_time
model
Bagging Regressor (DT)(200, 10, 50, True, True)61.9050.9753340.416948
Neural Network - MLP Regressor((75, 30, 50, 10, 3), 'identity', 'sgd', 10.0)61.90510.2268320.200022
Gradient Boosting Regressor(300, 0.1, 0.001, 10)57.1430.6068890.171962
Gradient Boosting Classifier(200, 0.01, 0.001, 20)57.1430.9202840.166690
AdaBoost Regressor(200, 0.01, 'square')52.38114.9690500.449939
Random Forest Regressor(300, 10, 2, 5, 'auto')52.38123.7999650.553645
Ada Boost Classifier(200, 0.1, 'SAMME')52.3813.7741580.445910
Logistic Regression('l1', 'liblinear', 0.0069519279617756054)47.6190.0491250.162838
Neural Network (MLP Classifier)((80, 20, 40, 5), 'logistic', 'sgd', 0.0017782...42.8573.3356690.208089
Linear RegressionTrue38.0950.0377920.181738
Stacking Regressor (RF)(100, 10)38.0952.7677140.325875
Extra Trees Classifier(100, 10, 2, 1, 'auto')38.0951.0561100.330396
Random Forest Classifier(100, 10, 2, 3, 'log2')38.0951.1891050.328895
Stacking Classifier10033.33310.5380830.374325
Jose's Dumb Classifiernone9.5240.0000020.187482
\n", "
" ], "text/plain": [ " params \\\n", "model \n", "Bagging Regressor (DT) (200, 10, 50, True, True) \n", "Neural Network - MLP Regressor ((75, 30, 50, 10, 3), 'identity', 'sgd', 10.0) \n", "Gradient Boosting Regressor (300, 0.1, 0.001, 10) \n", "Gradient Boosting Classifier (200, 0.01, 0.001, 20) \n", "AdaBoost Regressor (200, 0.01, 'square') \n", "Random Forest Regressor (300, 10, 2, 5, 'auto') \n", "Ada Boost Classifier (200, 0.1, 'SAMME') \n", "Logistic Regression ('l1', 'liblinear', 0.0069519279617756054) \n", "Neural Network (MLP Classifier) ((80, 20, 40, 5), 'logistic', 'sgd', 0.0017782... \n", "Linear Regression True \n", "Stacking Regressor (RF) (100, 10) \n", "Extra Trees Classifier (100, 10, 2, 1, 'auto') \n", "Random Forest Classifier (100, 10, 2, 3, 'log2') \n", "Stacking Classifier 100 \n", "Jose's Dumb Classifier none \n", "\n", " score train_time test_time \n", "model \n", "Bagging Regressor (DT) 61.905 0.975334 0.416948 \n", "Neural Network - MLP Regressor 61.905 10.226832 0.200022 \n", "Gradient Boosting Regressor 57.143 0.606889 0.171962 \n", "Gradient Boosting Classifier 57.143 0.920284 0.166690 \n", "AdaBoost Regressor 52.381 14.969050 0.449939 \n", "Random Forest Regressor 52.381 23.799965 0.553645 \n", "Ada Boost Classifier 52.381 3.774158 0.445910 \n", "Logistic Regression 47.619 0.049125 0.162838 \n", "Neural Network (MLP Classifier) 42.857 3.335669 0.208089 \n", "Linear Regression 38.095 0.037792 0.181738 \n", "Stacking Regressor (RF) 38.095 2.767714 0.325875 \n", "Extra Trees Classifier 38.095 1.056110 0.330396 \n", "Random Forest Classifier 38.095 1.189105 0.328895 \n", "Stacking Classifier 33.333 10.538083 0.374325 \n", "Jose's Dumb Classifier 9.524 0.000002 0.187482 " ] }, "execution_count": 278, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.ticker as mtick\n", "\n", "def read_csv_dataframe(filename):\n", " summary = pd.read_csv(filename)\n", " summary.drop(columns=['Unnamed: 0'],inplace=True)\n", " groups = summary.groupby(by=['model'])\n", " summary = groups.apply(lambda x: x[x['score'] == x['score'].max()].iloc[0])\n", " summary.sort_values(by='score', inplace=True)\n", " summary.set_index('model', inplace=True)\n", " return summary\n", "\n", "summary_df = read_csv_dataframe('regressions_scores.csv')\n", "summary_clf_df = read_csv_dataframe('classification_scores.csv')\n", "\n", "# merge\n", "merged_scores_df = pd.concat([summary_df, summary_clf_df])\n", "merged_scores_df.sort_values(by='score', inplace=True)\n", "merged_scores_df\n", "\n", "# plot feature importance \n", "axis = feature_importance_df.plot(kind='barh', title=\"F1 Prediction Feature Importance\", figsize=(16, 4), color='#05AFF2', grid=True,)\n", "y_label = axis.yaxis.get_label()\n", "y_label.set_visible(False)\n", "\n", "# plot precision scores\n", "axis = merged_scores_df[['score']].plot(kind='barh', figsize=(16,10), color=['#00CC99','#9370DB','#CA98BF'], grid=True, linestyle='--')\n", "axis.xaxis.set_major_formatter(mtick.PercentFormatter())\n", "axis.set_title('F1 Prediction Precision Scores', pad=20)\n", "y_label = axis.yaxis.get_label()\n", "y_label.set_visible(False)\n", "merged_scores_df.sort_values(by='score', inplace=True, ascending=False)\n", "merged_scores_df" ] }, { "cell_type": "markdown", "id": "f78c83a0-77ca-45d5-9c9e-7e381bfbd28d", "metadata": {}, "source": [ "### Conclusion\n", "\n", "Following our [Champion/Challenger](https://medium.com/@awaiskaleem/mlflow-tips-n-tricks-eb1ac013edd1) model, the winning model is the one built with [Bagging Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html) ([Decision Tree Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html?highlight=decisiontree#sklearn.tree.DecisionTreeRegressor)). This is an ensemble model fron SKLearn which uses DecistionTreeRegressor as its base estimator.\n", "\n", "[Neural Network - MLP Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html?highlight=mlp#sklearn.neural_network.MLPRegressor) ended up with the same precision score, however because it took 10x more in training time, I have delegated it to 2nd place. \n", "\n", "- **Bagging Regressor (Decision Tree Regressor)**\n", "- Train time: **0.97s**\n", "- Test time: **0.42s**\n", "- **Grid**, **ConstructorRecentForm**, **DriverRecentForm** and **DriverRecentWins** round out the most significant features. The last 3 are one of the features we engineered for this project.\n", "- Precision score: **61.9%**\n", "- For each race, the finishing order is also available, but omitted here for brevity\n", "- Correctly predicting the winner in **13 out of 21** races in 2021 Season\n", "\n" ] }, { "cell_type": "markdown", "id": "9d2bcc6a-3330-4da3-a7cc-c95236e4667b", "metadata": {}, "source": [ "### Preparing for Rest API" ] }, { "cell_type": "code", "execution_count": 85, "id": "288a9f49-ef45-407a-878b-48e4975c9009", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "66\n", "bagging_regressor_pickle took 0.9400396940000064 s\n" ] } ], "source": [ "def bagging_regressor_pickle(X_train, y_train):\n", " # now use the winning paramters to build the model\n", " model = BaggingRegressor(random_state=0, base_estimator=DecisionTreeRegressor(),\n", " n_estimators=200, max_samples=10, max_features=50, bootstrap=True, bootstrap_features=True)\n", " model.fit(X_train, y_train)\n", " pickle.dump(model, open('f1-model.pkl', 'wb'))\n", " print(model.n_features_in_)\n", " \n", "\n", "start = timer()\n", "bagging_regressor_pickle(X_train, y_train)\n", "end = timer()\n", "print(f'bagging_regressor_pickle took {end - start} s')" ] }, { "cell_type": "code", "execution_count": 96, "id": "6e917f1b-63ba-4892-8719-1ac13903343b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.5714285714285714\n", "0.5714285714285714\n", "0.6190476190476191\n", "0.6190476190476191\n", "0.5714285714285714\n", "0.6190476190476191\n", "0.5238095238095238\n", "0.6190476190476191\n", "0.6666666666666666\n", "0.6190476190476191\n", "bagging_regressor_inference took 4.709704332999536 s\n" ] } ], "source": [ "start = timer()\n", "from_pickled_model = pickle.load(open('f1-model.pkl', 'rb'))\n", "for i in range(0, 10):\n", " f1_model_score = regression_test_score(from_pickled_model)\n", " print(f1_model_score)\n", "end = timer()\n", "print(f'bagging_regressor_inference took {end - start} s')" ] }, { "cell_type": "code", "execution_count": 637, "id": "5c7c7681-182a-42b8-9d4e-1808ebeaf32a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(25380, 68)" ] }, "execution_count": 637, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results_df.shape" ] }, { "cell_type": "code", "execution_count": null, "id": "9dfb2f0f-3962-4473-a897-696a1b3ed934", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 5 }