{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(**Click the icon below to open this notebook in Colab**)\n", "\n", "[](https://colab.research.google.com/github/xiangshiyin/machine-learning-for-actuarial-science/blob/main/2025-spring/week06/notebook/demo.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Overview\n", "\n", "In our last class, we explored the Titanic dataset, examined it from multiple perspectives, and applied various feature engineering techniques to enhance its explanatory variables. Today, we will continue working with the Titanic dataset, focusing on model training and evaluation techniques to gain deeper insights into predictive modeling." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load the dataset\n", "\n", "https://www.kaggle.com/competitions/titanic/data\n", "- **The Titanic** https://en.wikipedia.org/wiki/Titanic\n", "\n", "| Variable | Definition | Key |\n", "|------------|-------------------------------------------|--------------------------------------|\n", "| survival | Survival | 0 = No, 1 = Yes |\n", "| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |\n", "| sex | Sex | |\n", "| Age | Age in years | |\n", "| sibsp | # of siblings / spouses aboard the Titanic | |\n", "| parch | # of parents / children aboard the Titanic | |\n", "| ticket | Ticket number | |\n", "| fare | Passenger fare | |\n", "| cabin | Cabin number | |\n", "| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "pd.set_option('display.max_rows', None)\n", "pd.set_option('display.max_columns', None)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "train = pd.read_csv('../data/titanic/train.csv')\n", "test = pd.read_csv('../data/titanic/test.csv')\n", "\n", "# convert all column names to lower cases\n", "train.columns = train.columns.str.lower()\n", "test.columns = test.columns.str.lower()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>passengerid</th>\n", " <th>survived</th>\n", " <th>pclass</th>\n", " <th>name</th>\n", " <th>sex</th>\n", " <th>age</th>\n", " <th>sibsp</th>\n", " <th>parch</th>\n", " <th>ticket</th>\n", " <th>fare</th>\n", " <th>cabin</th>\n", " <th>embarked</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>Braund, Mr. Owen Harris</td>\n", " <td>male</td>\n", " <td>22.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>A/5 21171</td>\n", " <td>7.2500</td>\n", " <td>NaN</td>\n", " <td>S</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n", " <td>female</td>\n", " <td>38.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>PC 17599</td>\n", " <td>71.2833</td>\n", " <td>C85</td>\n", " <td>C</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>Heikkinen, Miss. Laina</td>\n", " <td>female</td>\n", " <td>26.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>STON/O2. 3101282</td>\n", " <td>7.9250</td>\n", " <td>NaN</td>\n", " <td>S</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " passengerid survived pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "\n", " name sex age sibsp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "\n", " parch ticket fare cabin embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head(3)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>passengerid</th>\n", " <th>pclass</th>\n", " <th>name</th>\n", " <th>sex</th>\n", " <th>age</th>\n", " <th>sibsp</th>\n", " <th>parch</th>\n", " <th>ticket</th>\n", " <th>fare</th>\n", " <th>cabin</th>\n", " <th>embarked</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>892</td>\n", " <td>3</td>\n", " <td>Kelly, Mr. James</td>\n", " <td>male</td>\n", " <td>34.5</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>330911</td>\n", " <td>7.8292</td>\n", " <td>NaN</td>\n", " <td>Q</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>893</td>\n", " <td>3</td>\n", " <td>Wilkes, Mrs. James (Ellen Needs)</td>\n", " <td>female</td>\n", " <td>47.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>363272</td>\n", " <td>7.0000</td>\n", " <td>NaN</td>\n", " <td>S</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>894</td>\n", " <td>2</td>\n", " <td>Myles, Mr. Thomas Francis</td>\n", " <td>male</td>\n", " <td>62.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>240276</td>\n", " <td>9.6875</td>\n", " <td>NaN</td>\n", " <td>Q</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " passengerid pclass name sex age sibsp \\\n", "0 892 3 Kelly, Mr. James male 34.5 0 \n", "1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 \n", "2 894 2 Myles, Mr. Thomas Francis male 62.0 0 \n", "\n", " parch ticket fare cabin embarked \n", "0 0 330911 7.8292 NaN Q \n", "1 0 363272 7.0000 NaN S \n", "2 0 240276 9.6875 NaN Q " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test.head(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "plaintext" } }, "source": [ "# Streamline the data transformations\n", "\n", "Here are the data exploration and transformation strategies we used so far:\n", "* Quick survey across key variables\n", "* Detect and address data anomalies\n", " * Missing values\n", " * Outliers\n", "* Feature engineering\n", " * Encode categorical variables\n", " * Normalize numerical variables\n", " * Create new features with stronger predictive power\n", "\n", "Data exploration process is typically iterative and complex. Once we have a good understanding of the data and some potential strategies to apply in the feature engineering process, we need to make sure these transformation strategies can be easily and consistently applied to new datasets, such as the test set and new batches of data for model retraining. This requires a systematic approach to streamline the data transformations so that we don't need to start from scratch and repeat the same steps for each new dataset. This is especially important in the real-world scenario where we want to productionalize and automate the data transformation process." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "passengerid 0\n", "survived 0\n", "pclass 0\n", "name 0\n", "sex 0\n", "age 177\n", "sibsp 0\n", "parch 0\n", "ticket 0\n", "fare 0\n", "cabin 687\n", "embarked 2\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Missing value imputation" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def impute_missing_num_values(df):\n", " \"\"\"\n", " Impute missing values in numerical columns of a DataFrame using the median of each column.\n", "\n", " Args:\n", " df (pandas.DataFrame): The DataFrame to impute missing values in.\n", "\n", " Returns:\n", " pandas.DataFrame: The DataFrame with missing values imputed.\n", " \"\"\"\n", " # Select only the numerical columns\n", " num_cols = df.select_dtypes(include=['float64', 'int64']).columns\n", " # Impute missing values with the median of each column\n", " for col in num_cols:\n", " df[col] = df[col].fillna(df[col].median())\n", " return df\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "passengerid 0\n", "survived 0\n", "pclass 0\n", "name 0\n", "sex 0\n", "age 0\n", "sibsp 0\n", "parch 0\n", "ticket 0\n", "fare 0\n", "cabin 687\n", "embarked 2\n", "dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train = impute_missing_num_values(train)\n", "train.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# categorical variables could have missing values too\n", "# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mode.html\n", "\n", "def impute_missing_cat_values(df, ignore_list):\n", " \"\"\"\n", " Impute missing categorical values with the most frequent value.\n", "\n", " Args:\n", " df (pd.DataFrame): DataFrame containing the data.\n", " ignore_list (list): List of column names to ignore. \n", " Returns:\n", " pd.DataFrame: DataFrame with imputed missing categorical values.\n", " \"\"\"\n", " for col in df.columns:\n", " if col not in ignore_list:\n", " if df[col].dtype == 'object':\n", " df[col] = df[col].fillna(df[col].mode()[0])\n", " return df" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "passengerid 0\n", "survived 0\n", "pclass 0\n", "name 0\n", "sex 0\n", "age 0\n", "sibsp 0\n", "parch 0\n", "ticket 0\n", "fare 0\n", "cabin 687\n", "embarked 0\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train = impute_missing_cat_values(train, ignore_list=['cabin'])\n", "train.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "⚠️ **Attention:** We will treat the missing values in `cabin` in the feature engineering step!!\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Engineering\n", "- Encode categorical features\n", "- Normalize numerical features\n", "- Create new features" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['name', 'sex', 'ticket', 'cabin', 'embarked']\n" ] } ], "source": [ "# The categorical variables in the datasets\n", "\n", "cat_cols = [\n", " col\n", " for col in train.columns if train[col].dtype == \"object\"\n", "] \n", "print(cat_cols)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# apply onehot encoding to the categorical columns\n", "# use the sklearn library\n", "\n", "from sklearn.preprocessing import OneHotEncoder\n", "\n", "def onehot_encode(df, ignore_list):\n", " cat_cols = [\n", " col for col in df.columns if col not in ignore_list and df[col].dtype == 'object'\n", " ]\n", " encoder = OneHotEncoder()\n", " encoded = encoder.fit_transform(df[cat_cols])\n", " encoded_df = pd.DataFrame(encoded.toarray(), columns=encoder.get_feature_names_out(cat_cols))\n", " df = pd.concat([df, encoded_df], axis=1)\n", " df = df.drop(cat_cols, axis=1)\n", " return df" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "train = onehot_encode(train, ignore_list=['cabin', 'name', 'ticket'])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>passengerid</th>\n", " <th>survived</th>\n", " <th>pclass</th>\n", " <th>name</th>\n", " <th>age</th>\n", " <th>sibsp</th>\n", " <th>parch</th>\n", " <th>ticket</th>\n", " <th>fare</th>\n", " <th>cabin</th>\n", " <th>sex_female</th>\n", " <th>sex_male</th>\n", " <th>embarked_C</th>\n", " <th>embarked_Q</th>\n", " <th>embarked_S</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>Braund, Mr. Owen Harris</td>\n", " <td>22.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>A/5 21171</td>\n", " <td>7.2500</td>\n", " <td>NaN</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n", " <td>38.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>PC 17599</td>\n", " <td>71.2833</td>\n", " <td>C85</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>Heikkinen, Miss. Laina</td>\n", " <td>26.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>STON/O2. 3101282</td>\n", " <td>7.9250</td>\n", " <td>NaN</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " passengerid survived pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "\n", " name age sibsp parch \\\n", "0 Braund, Mr. Owen Harris 22.0 1 0 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0 1 0 \n", "2 Heikkinen, Miss. Laina 26.0 0 0 \n", "\n", " ticket fare cabin sex_female sex_male embarked_C \\\n", "0 A/5 21171 7.2500 NaN 0.0 1.0 0.0 \n", "1 PC 17599 71.2833 C85 1.0 0.0 1.0 \n", "2 STON/O2. 3101282 7.9250 NaN 1.0 0.0 0.0 \n", "\n", " embarked_Q embarked_S \n", "0 0.0 1.0 \n", "1 0.0 0.0 \n", "2 0.0 1.0 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# transform numeric features, log transform `fare`\n", "import numpy as np\n", "\n", "def log_transform(df, features, drop=False):\n", " for feature in features:\n", " df[feature+'_log'] = np.log1p(df[feature]) \n", " if drop:\n", " df = df.drop(features, axis=1)\n", " return df" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['passengerid', 'survived', 'pclass', 'name', 'age', 'sibsp', 'parch',\n", " 'ticket', 'fare', 'cabin', 'sex_female', 'sex_male', 'embarked_C',\n", " 'embarked_Q', 'embarked_S', 'fare_log'],\n", " dtype='object')" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train = log_transform(train, features=['fare'])\n", "train.columns" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Create new features\n", "import re\n", "\n", "def create_features(df):\n", " df['has_cabin'] = df['cabin'].apply(lambda x: 0 if type(x) == float else 1)\n", " df['family_size'] = df['sibsp'] + df['parch'] + 1\n", " df['is_alone'] = df['family_size'].apply(lambda x: 1 if x == 1 else 0)\n", " df['title'] = df['name'].apply(lambda x: re.search('([A-Z][a-z]+)\\\\.', x).group(1))\n", " df['cabin'] = df['cabin'].fillna('U0')\n", " df['deck'] = df['cabin'].apply(lambda x: re.search('([A-Z]+)', x).group(1))\n", " df['name_len_cat'] = df['name'].apply(lambda x: 0 if len(x) <= 23 else 1 if len(x) <= 28 else 2 if len(x) <= 40 else 3)\n", " df['age_cat'] = df['age'].apply(lambda x: 0 if x <= 14 else 1 if x <= 30 else 2 if x <= 40 else 3 if x <= 50 else 4 if x <= 60 else 5)\n", " df['fare_log_cat'] = df['fare_log'].apply(lambda x: 0 if x <= 2.7 else 1 if x <= 3.2 else 2 if x <= 3.6 else 3)\n", " return df" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['passengerid', 'survived', 'pclass', 'name', 'age', 'sibsp', 'parch',\n", " 'ticket', 'fare', 'cabin', 'sex_female', 'sex_male', 'embarked_C',\n", " 'embarked_Q', 'embarked_S', 'fare_log', 'has_cabin', 'family_size',\n", " 'is_alone', 'title', 'deck', 'name_len_cat', 'age_cat', 'fare_log_cat'],\n", " dtype='object')" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train = create_features(train)\n", "train.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Put all together" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "import re\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.preprocessing import OneHotEncoder\n", "\n", "def impute_missing_num_values(df):\n", " \"\"\"\n", " Impute missing values in numerical columns of a DataFrame using the median of each column.\n", "\n", " Args:\n", " df (pandas.DataFrame): The DataFrame to impute missing values in.\n", "\n", " Returns:\n", " pandas.DataFrame: The DataFrame with missing values imputed.\n", " \"\"\"\n", " # Select only the numerical columns\n", " num_cols = df.select_dtypes(include=['float64', 'int64']).columns\n", " # Impute missing values with the median of each column\n", " for col in num_cols:\n", " df[col] = df[col].fillna(df[col].median())\n", " return df\n", "\n", "def impute_missing_cat_values(df, ignore_list):\n", " \"\"\"\n", " Impute missing categorical values with the most frequent value.\n", "\n", " Args:\n", " df (pd.DataFrame): DataFrame containing the data.\n", " ignore_list (list): List of column names to ignore. \n", " Returns:\n", " pd.DataFrame: DataFrame with imputed missing categorical values.\n", " \"\"\"\n", " for col in df.columns:\n", " if col not in ignore_list:\n", " if df[col].dtype == 'object':\n", " df[col] = df[col].fillna(df[col].mode()[0])\n", " return df\n", "\n", "def log_transform(df, features, drop=False):\n", " for feature in features:\n", " df[feature+'_log'] = np.log1p(df[feature]) \n", " if drop:\n", " df = df.drop(features, axis=1)\n", " return df\n", "\n", "def create_features(df):\n", " df['has_cabin'] = df['cabin'].apply(lambda x: 0 if type(x) == float else 1)\n", " df['family_size'] = df['sibsp'] + df['parch'] + 1\n", " df['is_alone'] = df['family_size'].apply(lambda x: 1 if x == 1 else 0)\n", " df['title'] = df['name'].apply(lambda x: re.search('([A-Z][a-z]+)\\\\.', x).group(1))\n", " df['cabin'] = df['cabin'].fillna('U0')\n", " df['deck'] = df['cabin'].apply(lambda x: re.search('([A-Z]+)', x).group(1))\n", " df['name_len_cat'] = df['name'].apply(lambda x: 0 if len(x) <= 23 else 1 if len(x) <= 28 else 2 if len(x) <= 40 else 3)\n", " df['age_cat'] = df['age'].apply(lambda x: 0 if x <= 14 else 1 if x <= 30 else 2 if x <= 40 else 3 if x <= 50 else 4 if x <= 60 else 5)\n", " df['fare_log_cat'] = df['fare_log'].apply(lambda x: 0 if x <= 2.7 else 1 if x <= 3.2 else 2 if x <= 3.6 else 3)\n", " return df\n", "\n", "def load_data():\n", " train = pd.read_csv('../data/titanic/train.csv')\n", " test = pd.read_csv('../data/titanic/test.csv')\n", " # convert all column names to lower cases\n", " train.columns = train.columns.str.lower()\n", " test.columns = test.columns.str.lower() \n", " return train, test\n", "\n", "def transform_data(df, encoder=None):\n", " df = impute_missing_num_values(df)\n", " df = impute_missing_cat_values(df, ['cabin', 'embarked'])\n", " df = log_transform(df, ['fare'])\n", " df = create_features(df)\n", " \n", " cat_attributes = ['sex', 'embarked', 'title', 'deck']\n", " if not encoder:\n", " encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)\n", " encoder.fit(df[cat_attributes])\n", " encoded = encoder.transform(df[cat_attributes])\n", " df = pd.concat([df, pd.DataFrame(encoded, columns=encoder.get_feature_names_out(cat_attributes))], axis=1)\n", " \n", " # drop columns that are not needed\n", " df = df.drop([\n", " 'name', 'ticket', 'cabin', 'fare'\n", " # , 'age', 'fare', 'sibsp', 'parch'\n", " ] + cat_attributes, axis=1)\n", " return df, encoder\n", "\n", "train, test = load_data()\n", "train, encoder = transform_data(train)\n", "test, _ = transform_data(test, encoder)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>passengerid</th>\n", " <th>survived</th>\n", " <th>pclass</th>\n", " <th>age</th>\n", " <th>sibsp</th>\n", " <th>parch</th>\n", " <th>fare_log</th>\n", " <th>has_cabin</th>\n", " <th>family_size</th>\n", " <th>is_alone</th>\n", " <th>name_len_cat</th>\n", " <th>age_cat</th>\n", " <th>fare_log_cat</th>\n", " <th>sex_female</th>\n", " <th>sex_male</th>\n", " <th>embarked_C</th>\n", " <th>embarked_Q</th>\n", " <th>embarked_S</th>\n", " <th>embarked_nan</th>\n", " <th>title_Capt</th>\n", " <th>title_Col</th>\n", " <th>title_Countess</th>\n", " <th>title_Don</th>\n", " <th>title_Dr</th>\n", " <th>title_Jonkheer</th>\n", " <th>title_Lady</th>\n", " <th>title_Major</th>\n", " <th>title_Master</th>\n", " <th>title_Miss</th>\n", " <th>title_Mlle</th>\n", " <th>title_Mme</th>\n", " <th>title_Mr</th>\n", " <th>title_Mrs</th>\n", " <th>title_Ms</th>\n", " <th>title_Rev</th>\n", " <th>title_Sir</th>\n", " <th>deck_A</th>\n", " <th>deck_B</th>\n", " <th>deck_C</th>\n", " <th>deck_D</th>\n", " <th>deck_E</th>\n", " <th>deck_F</th>\n", " <th>deck_G</th>\n", " <th>deck_T</th>\n", " <th>deck_U</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>22.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>2.110213</td>\n", " <td>0</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>38.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>4.280593</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>2</td>\n", " <td>3</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " passengerid survived pclass age sibsp parch fare_log has_cabin \\\n", "0 1 0 3 22.0 1 0 2.110213 0 \n", "1 2 1 1 38.0 1 0 4.280593 1 \n", "\n", " family_size is_alone name_len_cat age_cat fare_log_cat sex_female \\\n", "0 2 0 0 1 0 0.0 \n", "1 2 0 3 2 3 1.0 \n", "\n", " sex_male embarked_C embarked_Q embarked_S embarked_nan title_Capt \\\n", "0 1.0 0.0 0.0 1.0 0.0 0.0 \n", "1 0.0 1.0 0.0 0.0 0.0 0.0 \n", "\n", " title_Col title_Countess title_Don title_Dr title_Jonkheer title_Lady \\\n", "0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " title_Major title_Master title_Miss title_Mlle title_Mme title_Mr \\\n", "0 0.0 0.0 0.0 0.0 0.0 1.0 \n", "1 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " title_Mrs title_Ms title_Rev title_Sir deck_A deck_B deck_C deck_D \\\n", "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 \n", "\n", " deck_E deck_F deck_G deck_T deck_U \n", "0 0.0 0.0 0.0 0.0 1.0 \n", "1 0.0 0.0 0.0 0.0 0.0 " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.head(2)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(418, 44)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test.shape" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>passengerid</th>\n", " <th>pclass</th>\n", " <th>age</th>\n", " <th>sibsp</th>\n", " <th>parch</th>\n", " <th>fare_log</th>\n", " <th>has_cabin</th>\n", " <th>family_size</th>\n", " <th>is_alone</th>\n", " <th>name_len_cat</th>\n", " <th>age_cat</th>\n", " <th>fare_log_cat</th>\n", " <th>sex_female</th>\n", " <th>sex_male</th>\n", " <th>embarked_C</th>\n", " <th>embarked_Q</th>\n", " <th>embarked_S</th>\n", " <th>embarked_nan</th>\n", " <th>title_Capt</th>\n", " <th>title_Col</th>\n", " <th>title_Countess</th>\n", " <th>title_Don</th>\n", " <th>title_Dr</th>\n", " <th>title_Jonkheer</th>\n", " <th>title_Lady</th>\n", " <th>title_Major</th>\n", " <th>title_Master</th>\n", " <th>title_Miss</th>\n", " <th>title_Mlle</th>\n", " <th>title_Mme</th>\n", " <th>title_Mr</th>\n", " <th>title_Mrs</th>\n", " <th>title_Ms</th>\n", " <th>title_Rev</th>\n", " <th>title_Sir</th>\n", " <th>deck_A</th>\n", " <th>deck_B</th>\n", " <th>deck_C</th>\n", " <th>deck_D</th>\n", " <th>deck_E</th>\n", " <th>deck_F</th>\n", " <th>deck_G</th>\n", " <th>deck_T</th>\n", " <th>deck_U</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>892</td>\n", " <td>3</td>\n", " <td>34.5</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>2.178064</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>893</td>\n", " <td>3</td>\n", " <td>47.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>2.079442</td>\n", " <td>0</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>2</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>894</td>\n", " <td>2</td>\n", " <td>62.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>2.369075</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " passengerid pclass age sibsp parch fare_log has_cabin family_size \\\n", "0 892 3 34.5 0 0 2.178064 0 1 \n", "1 893 3 47.0 1 0 2.079442 0 2 \n", "2 894 2 62.0 0 0 2.369075 0 1 \n", "\n", " is_alone name_len_cat age_cat fare_log_cat sex_female sex_male \\\n", "0 1 0 2 0 0.0 1.0 \n", "1 0 2 3 0 1.0 0.0 \n", "2 1 1 5 0 0.0 1.0 \n", "\n", " embarked_C embarked_Q embarked_S embarked_nan title_Capt title_Col \\\n", "0 0.0 1.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 1.0 0.0 0.0 0.0 \n", "2 0.0 1.0 0.0 0.0 0.0 0.0 \n", "\n", " title_Countess title_Don title_Dr title_Jonkheer title_Lady \\\n", "0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 \n", "\n", " title_Major title_Master title_Miss title_Mlle title_Mme title_Mr \\\n", "0 0.0 0.0 0.0 0.0 0.0 1.0 \n", "1 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 1.0 \n", "\n", " title_Mrs title_Ms title_Rev title_Sir deck_A deck_B deck_C deck_D \\\n", "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " deck_E deck_F deck_G deck_T deck_U \n", "0 0.0 0.0 0.0 0.0 1.0 \n", "1 0.0 0.0 0.0 0.0 1.0 \n", "2 0.0 0.0 0.0 0.0 1.0 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test.head(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For standard transfromations, you could also use the `pipeline` modules from `sklearn`\n", "- https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train a simple model" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/xiangshiyin/Documents/Teaching/machine-learning-for-actuarial-science/.venv/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:465: ConvergenceWarning: lbfgs failed to converge (status=1):\n", "STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.\n", "\n", "Increase the number of iterations (max_iter) or scale the data as shown in:\n", " https://scikit-learn.org/stable/modules/preprocessing.html\n", "Please also refer to the documentation for alternative solver options:\n", " https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n", " n_iter_i = _check_optimize_result(\n" ] } ], "source": [ "# train a simple logistic regression model to predict the survival label\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "lr = LogisticRegression()\n", "lr.fit(\n", " train.drop(columns=['survived', 'passengerid']), # everything except the survival label\n", " train['survived'] # the survival label\n", ")\n", "\n", "pred = lr.predict(test.drop(columns=['passengerid']))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "numpy.ndarray" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(pred)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(418,)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pred.shape" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pred[:10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>passengerid</th>\n", " <th>survived</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>892</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>893</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>894</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " passengerid survived\n", "0 892 0\n", "1 893 1\n", "2 894 0" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_submission = pd.concat([test['passengerid'], pd.DataFrame(pred, columns=['survived'])], axis=1)\n", "df_submission.head(3)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "df_submission.to_csv('../data/titanic/submission.csv', index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate the prediction results\n", "\n", "https://www.kaggle.com/competitions/titanic/overview/evaluation\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## AutoML exploration" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "X_train = train.drop(columns=['survived', 'passengerid'])\n", "y_train = train['survived']" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>pclass</th>\n", " <th>age</th>\n", " <th>sibsp</th>\n", " <th>parch</th>\n", " <th>fare_log</th>\n", " <th>has_cabin</th>\n", " <th>family_size</th>\n", " <th>is_alone</th>\n", " <th>name_len_cat</th>\n", " <th>age_cat</th>\n", " <th>fare_log_cat</th>\n", " <th>sex_female</th>\n", " <th>sex_male</th>\n", " <th>embarked_C</th>\n", " <th>embarked_Q</th>\n", " <th>embarked_S</th>\n", " <th>embarked_nan</th>\n", " <th>title_Capt</th>\n", " <th>title_Col</th>\n", " <th>title_Countess</th>\n", " <th>title_Don</th>\n", " <th>title_Dr</th>\n", " <th>title_Jonkheer</th>\n", " <th>title_Lady</th>\n", " <th>title_Major</th>\n", " <th>title_Master</th>\n", " <th>title_Miss</th>\n", " <th>title_Mlle</th>\n", " <th>title_Mme</th>\n", " <th>title_Mr</th>\n", " <th>title_Mrs</th>\n", " <th>title_Ms</th>\n", " <th>title_Rev</th>\n", " <th>title_Sir</th>\n", " <th>deck_A</th>\n", " <th>deck_B</th>\n", " <th>deck_C</th>\n", " <th>deck_D</th>\n", " <th>deck_E</th>\n", " <th>deck_F</th>\n", " <th>deck_G</th>\n", " <th>deck_T</th>\n", " <th>deck_U</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>3</td>\n", " <td>22.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>2.110213</td>\n", " <td>0</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>38.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>4.280593</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>3</td>\n", " <td>2</td>\n", " <td>3</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " pclass age sibsp parch fare_log has_cabin family_size is_alone \\\n", "0 3 22.0 1 0 2.110213 0 2 0 \n", "1 1 38.0 1 0 4.280593 1 2 0 \n", "\n", " name_len_cat age_cat fare_log_cat sex_female sex_male embarked_C \\\n", "0 0 1 0 0.0 1.0 0.0 \n", "1 3 2 3 1.0 0.0 1.0 \n", "\n", " embarked_Q embarked_S embarked_nan title_Capt title_Col \\\n", "0 0.0 1.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 \n", "\n", " title_Countess title_Don title_Dr title_Jonkheer title_Lady \\\n", "0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 \n", "\n", " title_Major title_Master title_Miss title_Mlle title_Mme title_Mr \\\n", "0 0.0 0.0 0.0 0.0 0.0 1.0 \n", "1 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " title_Mrs title_Ms title_Rev title_Sir deck_A deck_B deck_C deck_D \\\n", "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 \n", "\n", " deck_E deck_F deck_G deck_T deck_U \n", "0 0.0 0.0 0.0 0.0 1.0 \n", "1 0.0 0.0 0.0 0.0 0.0 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**FLAML** - https://github.com/microsoft/FLAML/tree/main\n", "- `pip install flaml[automl]`\n", "- [Documentation](https://microsoft.github.io/FLAML/docs/Getting-Started)\n", "- Best practices [[link](https://learn.microsoft.com/en-us/fabric/data-science/automated-machine-learning-fabric#automl-workflow)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 30, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[flaml.automl.logger: 02-17 20:34:43] {1728} INFO - task = classification\n", "[flaml.automl.logger: 02-17 20:34:43] {1739} INFO - Evaluation method: cv\n", "[flaml.automl.logger: 02-17 20:34:43] {1838} INFO - Minimizing error metric: 1-accuracy\n", "[flaml.automl.logger: 02-17 20:34:43] {1955} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'sgd', 'lrl1']\n", "[flaml.automl.logger: 02-17 20:34:43] {2258} INFO - iteration 0, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:43] {2393} INFO - Estimated sufficient time budget=1181s. Estimated necessary time budget=27s.\n", "[flaml.automl.logger: 02-17 20:34:43] {2442} INFO - at 0.2s,\testimator lgbm's best error=0.2189,\tbest estimator lgbm's best error=0.2189\n", "[flaml.automl.logger: 02-17 20:34:43] {2258} INFO - iteration 1, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:43] {2442} INFO - at 0.3s,\testimator lgbm's best error=0.2189,\tbest estimator lgbm's best error=0.2189\n", "[flaml.automl.logger: 02-17 20:34:43] {2258} INFO - iteration 2, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:43] {2442} INFO - at 0.3s,\testimator lgbm's best error=0.1728,\tbest estimator lgbm's best error=0.1728\n", "[flaml.automl.logger: 02-17 20:34:43] {2258} INFO - iteration 3, current learner sgd\n", "[flaml.automl.logger: 02-17 20:34:43] {2442} INFO - at 0.5s,\testimator sgd's best error=0.3704,\tbest estimator lgbm's best error=0.1728\n", "[flaml.automl.logger: 02-17 20:34:43] {2258} INFO - iteration 4, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:44] {2442} INFO - at 0.6s,\testimator lgbm's best error=0.1683,\tbest estimator lgbm's best error=0.1683\n", "[flaml.automl.logger: 02-17 20:34:44] {2258} INFO - iteration 5, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:44] {2442} INFO - at 0.9s,\testimator lgbm's best error=0.1683,\tbest estimator lgbm's best error=0.1683\n", "[flaml.automl.logger: 02-17 20:34:44] {2258} INFO - iteration 6, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:44] {2442} INFO - at 1.0s,\testimator lgbm's best error=0.1683,\tbest estimator lgbm's best error=0.1683\n", "[flaml.automl.logger: 02-17 20:34:44] {2258} INFO - iteration 7, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:44] {2442} INFO - at 1.1s,\testimator lgbm's best error=0.1650,\tbest estimator lgbm's best error=0.1650\n", "[flaml.automl.logger: 02-17 20:34:44] {2258} INFO - iteration 8, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:44] {2442} INFO - at 1.1s,\testimator lgbm's best error=0.1650,\tbest estimator lgbm's best error=0.1650\n", "[flaml.automl.logger: 02-17 20:34:44] {2258} INFO - iteration 9, current learner sgd\n", "[flaml.automl.logger: 02-17 20:34:44] {2442} INFO - at 1.3s,\testimator sgd's best error=0.3704,\tbest estimator lgbm's best error=0.1650\n", "[flaml.automl.logger: 02-17 20:34:44] {2258} INFO - iteration 10, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:44] {2442} INFO - at 1.5s,\testimator xgboost's best error=0.2189,\tbest estimator lgbm's best error=0.1650\n", "[flaml.automl.logger: 02-17 20:34:44] {2258} INFO - iteration 11, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:45] {2442} INFO - at 1.7s,\testimator xgboost's best error=0.2189,\tbest estimator lgbm's best error=0.1650\n", "[flaml.automl.logger: 02-17 20:34:45] {2258} INFO - iteration 12, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:45] {2442} INFO - at 1.9s,\testimator xgboost's best error=0.1762,\tbest estimator lgbm's best error=0.1650\n", "[flaml.automl.logger: 02-17 20:34:45] {2258} INFO - iteration 13, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:34:45] {2442} INFO - at 2.2s,\testimator extra_tree's best error=0.2043,\tbest estimator lgbm's best error=0.1650\n", "[flaml.automl.logger: 02-17 20:34:45] {2258} INFO - iteration 14, current learner rf\n", "[flaml.automl.logger: 02-17 20:34:45] {2442} INFO - at 2.4s,\testimator rf's best error=0.2177,\tbest estimator lgbm's best error=0.1650\n", "[flaml.automl.logger: 02-17 20:34:45] {2258} INFO - iteration 15, current learner rf\n", "[flaml.automl.logger: 02-17 20:34:46] {2442} INFO - at 2.8s,\testimator rf's best error=0.2054,\tbest estimator lgbm's best error=0.1650\n", "[flaml.automl.logger: 02-17 20:34:46] {2258} INFO - iteration 16, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:34:46] {2442} INFO - at 3.0s,\testimator extra_tree's best error=0.2043,\tbest estimator lgbm's best error=0.1650\n", "[flaml.automl.logger: 02-17 20:34:46] {2258} INFO - iteration 17, current learner rf\n", "[flaml.automl.logger: 02-17 20:34:46] {2442} INFO - at 3.3s,\testimator rf's best error=0.2054,\tbest estimator lgbm's best error=0.1650\n", "[flaml.automl.logger: 02-17 20:34:46] {2258} INFO - iteration 18, current learner sgd\n", "[flaml.automl.logger: 02-17 20:34:47] {2442} INFO - at 3.9s,\testimator sgd's best error=0.3199,\tbest estimator lgbm's best error=0.1650\n", "[flaml.automl.logger: 02-17 20:34:47] {2258} INFO - iteration 19, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:47] {2442} INFO - at 4.0s,\testimator lgbm's best error=0.1650,\tbest estimator lgbm's best error=0.1650\n", "[flaml.automl.logger: 02-17 20:34:47] {2258} INFO - iteration 20, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:34:47] {2442} INFO - at 4.3s,\testimator extra_tree's best error=0.2020,\tbest estimator lgbm's best error=0.1650\n", "[flaml.automl.logger: 02-17 20:34:47] {2258} INFO - iteration 21, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:47] {2442} INFO - at 4.4s,\testimator lgbm's best error=0.1650,\tbest estimator lgbm's best error=0.1650\n", "[flaml.automl.logger: 02-17 20:34:47] {2258} INFO - iteration 22, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:48] {2442} INFO - at 4.8s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:48] {2258} INFO - iteration 23, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:48] {2442} INFO - at 5.0s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:48] {2258} INFO - iteration 24, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:48] {2442} INFO - at 5.3s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:48] {2258} INFO - iteration 25, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:49] {2442} INFO - at 5.6s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:49] {2258} INFO - iteration 26, current learner sgd\n", "[flaml.automl.logger: 02-17 20:34:49] {2442} INFO - at 6.0s,\testimator sgd's best error=0.3199,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:49] {2258} INFO - iteration 27, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:49] {2442} INFO - at 6.2s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:49] {2258} INFO - iteration 28, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:50] {2442} INFO - at 6.6s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:50] {2258} INFO - iteration 29, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:34:50] {2442} INFO - at 6.9s,\testimator extra_tree's best error=0.2020,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:50] {2258} INFO - iteration 30, current learner sgd\n", "[flaml.automl.logger: 02-17 20:34:50] {2442} INFO - at 7.0s,\testimator sgd's best error=0.3199,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:50] {2258} INFO - iteration 31, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:50] {2442} INFO - at 7.1s,\testimator lgbm's best error=0.1650,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:50] {2258} INFO - iteration 32, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:50] {2442} INFO - at 7.2s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:50] {2258} INFO - iteration 33, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:50] {2442} INFO - at 7.3s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:50] {2258} INFO - iteration 34, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:50] {2442} INFO - at 7.6s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:50] {2258} INFO - iteration 35, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:51] {2442} INFO - at 7.7s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:51] {2258} INFO - iteration 36, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:51] {2442} INFO - at 7.9s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:51] {2258} INFO - iteration 37, current learner sgd\n", "[flaml.automl.logger: 02-17 20:34:51] {2442} INFO - at 8.0s,\testimator sgd's best error=0.3199,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:51] {2258} INFO - iteration 38, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:51] {2442} INFO - at 8.1s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:51] {2258} INFO - iteration 39, current learner rf\n", "[flaml.automl.logger: 02-17 20:34:51] {2442} INFO - at 8.4s,\testimator rf's best error=0.1975,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:51] {2258} INFO - iteration 40, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:52] {2442} INFO - at 8.6s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:52] {2258} INFO - iteration 41, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:52] {2442} INFO - at 9.0s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:52] {2258} INFO - iteration 42, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:52] {2442} INFO - at 9.4s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:52] {2258} INFO - iteration 43, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:52] {2442} INFO - at 9.5s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:52] {2258} INFO - iteration 44, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:53] {2442} INFO - at 9.9s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:53] {2258} INFO - iteration 45, current learner sgd\n", "[flaml.automl.logger: 02-17 20:34:53] {2442} INFO - at 10.1s,\testimator sgd's best error=0.2222,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:53] {2258} INFO - iteration 46, current learner sgd\n", "[flaml.automl.logger: 02-17 20:34:53] {2442} INFO - at 10.2s,\testimator sgd's best error=0.2222,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:53] {2258} INFO - iteration 47, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:53] {2442} INFO - at 10.4s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:53] {2258} INFO - iteration 48, current learner sgd\n", "[flaml.automl.logger: 02-17 20:34:53] {2442} INFO - at 10.5s,\testimator sgd's best error=0.2222,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:53] {2258} INFO - iteration 49, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:54] {2442} INFO - at 10.7s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:54] {2258} INFO - iteration 50, current learner rf\n", "[flaml.automl.logger: 02-17 20:34:54] {2442} INFO - at 11.0s,\testimator rf's best error=0.1852,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:54] {2258} INFO - iteration 51, current learner sgd\n", "[flaml.automl.logger: 02-17 20:34:54] {2442} INFO - at 11.2s,\testimator sgd's best error=0.2222,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:54] {2258} INFO - iteration 52, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:54] {2442} INFO - at 11.4s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:54] {2258} INFO - iteration 53, current learner rf\n", "[flaml.automl.logger: 02-17 20:34:55] {2442} INFO - at 11.7s,\testimator rf's best error=0.1762,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:55] {2258} INFO - iteration 54, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:55] {2442} INFO - at 11.8s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:55] {2258} INFO - iteration 55, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:55] {2442} INFO - at 11.9s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:55] {2258} INFO - iteration 56, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:55] {2442} INFO - at 12.2s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:55] {2258} INFO - iteration 57, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:55] {2442} INFO - at 12.5s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:55] {2258} INFO - iteration 58, current learner rf\n", "[flaml.automl.logger: 02-17 20:34:56] {2442} INFO - at 12.7s,\testimator rf's best error=0.1762,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:56] {2258} INFO - iteration 59, current learner rf\n", "[flaml.automl.logger: 02-17 20:34:56] {2442} INFO - at 13.1s,\testimator rf's best error=0.1684,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:56] {2258} INFO - iteration 60, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:56] {2442} INFO - at 13.2s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:56] {2258} INFO - iteration 61, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:56] {2442} INFO - at 13.5s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:56] {2258} INFO - iteration 62, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:57] {2442} INFO - at 13.6s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:57] {2258} INFO - iteration 63, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:57] {2442} INFO - at 13.9s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:57] {2258} INFO - iteration 64, current learner sgd\n", "[flaml.automl.logger: 02-17 20:34:57] {2442} INFO - at 14.1s,\testimator sgd's best error=0.2222,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:57] {2258} INFO - iteration 65, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:57] {2442} INFO - at 14.3s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:57] {2258} INFO - iteration 66, current learner rf\n", "[flaml.automl.logger: 02-17 20:34:57] {2442} INFO - at 14.6s,\testimator rf's best error=0.1684,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:57] {2258} INFO - iteration 67, current learner sgd\n", "[flaml.automl.logger: 02-17 20:34:58] {2442} INFO - at 14.9s,\testimator sgd's best error=0.2222,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:58] {2258} INFO - iteration 68, current learner rf\n", "[flaml.automl.logger: 02-17 20:34:58] {2442} INFO - at 15.3s,\testimator rf's best error=0.1684,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:58] {2258} INFO - iteration 69, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:34:58] {2442} INFO - at 15.4s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:58] {2258} INFO - iteration 70, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:59] {2442} INFO - at 15.6s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:59] {2258} INFO - iteration 71, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:59] {2442} INFO - at 15.9s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:59] {2258} INFO - iteration 72, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:34:59] {2442} INFO - at 16.2s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:34:59] {2258} INFO - iteration 73, current learner sgd\n", "[flaml.automl.logger: 02-17 20:35:00] {2442} INFO - at 16.7s,\testimator sgd's best error=0.2222,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:00] {2258} INFO - iteration 74, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:00] {2442} INFO - at 17.1s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:00] {2258} INFO - iteration 75, current learner sgd\n", "[flaml.automl.logger: 02-17 20:35:00] {2442} INFO - at 17.2s,\testimator sgd's best error=0.2144,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:00] {2258} INFO - iteration 76, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:00] {2442} INFO - at 17.3s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:00] {2258} INFO - iteration 77, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:00] {2442} INFO - at 17.4s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:00] {2258} INFO - iteration 78, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:01] {2442} INFO - at 17.7s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:01] {2258} INFO - iteration 79, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:01] {2442} INFO - at 17.9s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:01] {2258} INFO - iteration 80, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:01] {2442} INFO - at 18.1s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:01] {2258} INFO - iteration 81, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:01] {2442} INFO - at 18.5s,\testimator extra_tree's best error=0.2020,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:01] {2258} INFO - iteration 82, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:02] {2442} INFO - at 18.7s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:02] {2258} INFO - iteration 83, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:02] {2442} INFO - at 19.1s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:02] {2258} INFO - iteration 84, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:02] {2442} INFO - at 19.4s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:02] {2258} INFO - iteration 85, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:03] {2442} INFO - at 19.7s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:03] {2258} INFO - iteration 86, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:03] {2442} INFO - at 20.1s,\testimator rf's best error=0.1684,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:03] {2258} INFO - iteration 87, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:03] {2442} INFO - at 20.4s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:03] {2258} INFO - iteration 88, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:03] {2442} INFO - at 20.5s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:03] {2258} INFO - iteration 89, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:03] {2442} INFO - at 20.5s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:03] {2258} INFO - iteration 90, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:04] {2442} INFO - at 20.8s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:04] {2258} INFO - iteration 91, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:04] {2442} INFO - at 21.2s,\testimator rf's best error=0.1684,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:04] {2258} INFO - iteration 92, current learner sgd\n", "[flaml.automl.logger: 02-17 20:35:04] {2442} INFO - at 21.5s,\testimator sgd's best error=0.2144,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:04] {2258} INFO - iteration 93, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:05] {2442} INFO - at 21.9s,\testimator rf's best error=0.1684,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:05] {2258} INFO - iteration 94, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:05] {2442} INFO - at 22.2s,\testimator rf's best error=0.1684,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:05] {2258} INFO - iteration 95, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:05] {2442} INFO - at 22.3s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:05] {2258} INFO - iteration 96, current learner sgd\n", "[flaml.automl.logger: 02-17 20:35:05] {2442} INFO - at 22.4s,\testimator sgd's best error=0.2144,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:05] {2258} INFO - iteration 97, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:06] {2442} INFO - at 22.8s,\testimator rf's best error=0.1684,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:06] {2258} INFO - iteration 98, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:06] {2442} INFO - at 23.2s,\testimator rf's best error=0.1661,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:06] {2258} INFO - iteration 99, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:06] {2442} INFO - at 23.3s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:06] {2258} INFO - iteration 100, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:06] {2442} INFO - at 23.4s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:06] {2258} INFO - iteration 101, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:07] {2442} INFO - at 23.7s,\testimator extra_tree's best error=0.2020,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:07] {2258} INFO - iteration 102, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:07] {2442} INFO - at 24.0s,\testimator rf's best error=0.1661,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:07] {2258} INFO - iteration 103, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:07] {2442} INFO - at 24.2s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:07] {2258} INFO - iteration 104, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:07] {2442} INFO - at 24.5s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:07] {2258} INFO - iteration 105, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:08] {2442} INFO - at 24.7s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:08] {2258} INFO - iteration 106, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:08] {2442} INFO - at 24.9s,\testimator rf's best error=0.1661,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:08] {2258} INFO - iteration 107, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:08] {2442} INFO - at 25.3s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:08] {2258} INFO - iteration 108, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:08] {2442} INFO - at 25.4s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:08] {2258} INFO - iteration 109, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:09] {2442} INFO - at 25.7s,\testimator extra_tree's best error=0.2020,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:09] {2258} INFO - iteration 110, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:09] {2442} INFO - at 26.0s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:09] {2258} INFO - iteration 111, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:09] {2442} INFO - at 26.1s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:09] {2258} INFO - iteration 112, current learner sgd\n", "[flaml.automl.logger: 02-17 20:35:09] {2442} INFO - at 26.3s,\testimator sgd's best error=0.2144,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:09] {2258} INFO - iteration 113, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:10] {2442} INFO - at 26.7s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:10] {2258} INFO - iteration 114, current learner sgd\n", "[flaml.automl.logger: 02-17 20:35:10] {2442} INFO - at 27.0s,\testimator sgd's best error=0.2144,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:10] {2258} INFO - iteration 115, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:10] {2442} INFO - at 27.2s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:10] {2258} INFO - iteration 116, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:10] {2442} INFO - at 27.4s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:10] {2258} INFO - iteration 117, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:11] {2442} INFO - at 27.7s,\testimator extra_tree's best error=0.1964,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:11] {2258} INFO - iteration 118, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:11] {2442} INFO - at 28.0s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:11] {2258} INFO - iteration 119, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:11] {2442} INFO - at 28.1s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:11] {2258} INFO - iteration 120, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:11] {2442} INFO - at 28.4s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:11] {2258} INFO - iteration 121, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:12] {2442} INFO - at 28.8s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:12] {2258} INFO - iteration 122, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:12] {2442} INFO - at 29.1s,\testimator extra_tree's best error=0.1964,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:12] {2258} INFO - iteration 123, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:12] {2442} INFO - at 29.5s,\testimator xgboost's best error=0.1582,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:12] {2258} INFO - iteration 124, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:13] {2442} INFO - at 29.6s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:13] {2258} INFO - iteration 125, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:13] {2442} INFO - at 29.8s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:13] {2258} INFO - iteration 126, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:13] {2442} INFO - at 29.9s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:13] {2258} INFO - iteration 127, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:13] {2442} INFO - at 30.0s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:13] {2258} INFO - iteration 128, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:13] {2442} INFO - at 30.4s,\testimator lgbm's best error=0.1583,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:13] {2258} INFO - iteration 129, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:14] {2442} INFO - at 30.7s,\testimator extra_tree's best error=0.1964,\tbest estimator xgboost's best error=0.1582\n", "[flaml.automl.logger: 02-17 20:35:14] {2258} INFO - iteration 130, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:14] {2442} INFO - at 31.0s,\testimator rf's best error=0.1560,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:14] {2258} INFO - iteration 131, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:14] {2442} INFO - at 31.3s,\testimator rf's best error=0.1560,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:14] {2258} INFO - iteration 132, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:15] {2442} INFO - at 31.7s,\testimator rf's best error=0.1560,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:15] {2258} INFO - iteration 133, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:15] {2442} INFO - at 32.2s,\testimator lgbm's best error=0.1583,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:15] {2258} INFO - iteration 134, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:15] {2442} INFO - at 32.5s,\testimator rf's best error=0.1560,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:15] {2258} INFO - iteration 135, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:16] {2442} INFO - at 32.9s,\testimator rf's best error=0.1560,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:16] {2258} INFO - iteration 136, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:16] {2442} INFO - at 33.1s,\testimator extra_tree's best error=0.1841,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:16] {2258} INFO - iteration 137, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:16] {2442} INFO - at 33.4s,\testimator rf's best error=0.1560,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:16] {2258} INFO - iteration 138, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:16] {2442} INFO - at 33.6s,\testimator lgbm's best error=0.1583,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:16] {2258} INFO - iteration 139, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:17] {2442} INFO - at 33.9s,\testimator extra_tree's best error=0.1841,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:17] {2258} INFO - iteration 140, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:17] {2442} INFO - at 34.3s,\testimator rf's best error=0.1560,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:17] {2258} INFO - iteration 141, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:18] {2442} INFO - at 34.6s,\testimator extra_tree's best error=0.1807,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:18] {2258} INFO - iteration 142, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:18] {2442} INFO - at 35.0s,\testimator rf's best error=0.1560,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:18] {2258} INFO - iteration 143, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:18] {2442} INFO - at 35.2s,\testimator extra_tree's best error=0.1807,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:18] {2258} INFO - iteration 144, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:18] {2442} INFO - at 35.5s,\testimator xgboost's best error=0.1582,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:18] {2258} INFO - iteration 145, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:19] {2442} INFO - at 35.9s,\testimator rf's best error=0.1560,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:19] {2258} INFO - iteration 146, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:19] {2442} INFO - at 36.0s,\testimator lgbm's best error=0.1583,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:19] {2258} INFO - iteration 147, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:19] {2442} INFO - at 36.2s,\testimator lgbm's best error=0.1583,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:19] {2258} INFO - iteration 148, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:19] {2442} INFO - at 36.4s,\testimator lgbm's best error=0.1583,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:19] {2258} INFO - iteration 149, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:19] {2442} INFO - at 36.5s,\testimator lgbm's best error=0.1583,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:19] {2258} INFO - iteration 150, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:20] {2442} INFO - at 36.6s,\testimator lgbm's best error=0.1583,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:20] {2258} INFO - iteration 151, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:20] {2442} INFO - at 37.1s,\testimator rf's best error=0.1560,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:20] {2258} INFO - iteration 152, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:20] {2442} INFO - at 37.4s,\testimator xgboost's best error=0.1582,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:20] {2258} INFO - iteration 153, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:21] {2442} INFO - at 37.7s,\testimator rf's best error=0.1560,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:21] {2258} INFO - iteration 154, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:21] {2442} INFO - at 37.9s,\testimator lgbm's best error=0.1583,\tbest estimator rf's best error=0.1560\n", "[flaml.automl.logger: 02-17 20:35:21] {2258} INFO - iteration 155, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:21] {2442} INFO - at 38.1s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:21] {2258} INFO - iteration 156, current learner sgd\n", "[flaml.automl.logger: 02-17 20:35:21] {2442} INFO - at 38.2s,\testimator sgd's best error=0.2144,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:21] {2258} INFO - iteration 157, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:22] {2442} INFO - at 38.6s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:22] {2258} INFO - iteration 158, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:22] {2442} INFO - at 38.8s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:22] {2258} INFO - iteration 159, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:22] {2442} INFO - at 39.1s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:22] {2258} INFO - iteration 160, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:23] {2442} INFO - at 39.6s,\testimator xgboost's best error=0.1582,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:23] {2258} INFO - iteration 161, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:23] {2442} INFO - at 39.7s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:23] {2258} INFO - iteration 162, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:23] {2442} INFO - at 40.1s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:23] {2258} INFO - iteration 163, current learner sgd\n", "[flaml.automl.logger: 02-17 20:35:23] {2442} INFO - at 40.2s,\testimator sgd's best error=0.2144,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:23] {2258} INFO - iteration 164, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:23] {2442} INFO - at 40.6s,\testimator extra_tree's best error=0.1762,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:23] {2258} INFO - iteration 165, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:24] {2442} INFO - at 40.9s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:24] {2258} INFO - iteration 166, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:24] {2442} INFO - at 41.4s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:24] {2258} INFO - iteration 167, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:25] {2442} INFO - at 41.6s,\testimator xgboost's best error=0.1582,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:25] {2258} INFO - iteration 168, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:25] {2442} INFO - at 42.0s,\testimator extra_tree's best error=0.1684,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:25] {2258} INFO - iteration 169, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:25] {2442} INFO - at 42.3s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:25] {2258} INFO - iteration 170, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:25] {2442} INFO - at 42.5s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:25] {2258} INFO - iteration 171, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:26] {2442} INFO - at 42.6s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:26] {2258} INFO - iteration 172, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:26] {2442} INFO - at 43.0s,\testimator extra_tree's best error=0.1684,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:26] {2258} INFO - iteration 173, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:26] {2442} INFO - at 43.2s,\testimator extra_tree's best error=0.1684,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:26] {2258} INFO - iteration 174, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:27] {2442} INFO - at 43.7s,\testimator xgboost's best error=0.1582,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:27] {2258} INFO - iteration 175, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:27] {2442} INFO - at 44.1s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:27] {2258} INFO - iteration 176, current learner sgd\n", "[flaml.automl.logger: 02-17 20:35:27] {2442} INFO - at 44.2s,\testimator sgd's best error=0.2144,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:27] {2258} INFO - iteration 177, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:27] {2442} INFO - at 44.6s,\testimator extra_tree's best error=0.1684,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:27] {2258} INFO - iteration 178, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:28] {2442} INFO - at 44.8s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:28] {2258} INFO - iteration 179, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:28] {2442} INFO - at 45.2s,\testimator xgboost's best error=0.1582,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:28] {2258} INFO - iteration 180, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:28] {2442} INFO - at 45.5s,\testimator extra_tree's best error=0.1684,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:28] {2258} INFO - iteration 181, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:29] {2442} INFO - at 45.8s,\testimator xgboost's best error=0.1582,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:29] {2258} INFO - iteration 182, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:29] {2442} INFO - at 46.1s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:29] {2258} INFO - iteration 183, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:29] {2442} INFO - at 46.3s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:29] {2258} INFO - iteration 184, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:30] {2442} INFO - at 46.6s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:30] {2258} INFO - iteration 185, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:30] {2442} INFO - at 47.0s,\testimator xgboost's best error=0.1582,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:30] {2258} INFO - iteration 186, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:30] {2442} INFO - at 47.1s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:30] {2258} INFO - iteration 187, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:30] {2442} INFO - at 47.5s,\testimator extra_tree's best error=0.1684,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:30] {2258} INFO - iteration 188, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:31] {2442} INFO - at 47.6s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:31] {2258} INFO - iteration 189, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:31] {2442} INFO - at 47.7s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:31] {2258} INFO - iteration 190, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:31] {2442} INFO - at 48.0s,\testimator extra_tree's best error=0.1684,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:31] {2258} INFO - iteration 191, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:31] {2442} INFO - at 48.2s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:31] {2258} INFO - iteration 192, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:32] {2442} INFO - at 48.7s,\testimator xgboost's best error=0.1582,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:32] {2258} INFO - iteration 193, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:32] {2442} INFO - at 49.0s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:32] {2258} INFO - iteration 194, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:32] {2442} INFO - at 49.3s,\testimator extra_tree's best error=0.1684,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:32] {2258} INFO - iteration 195, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:32] {2442} INFO - at 49.6s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:32] {2258} INFO - iteration 196, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:33] {2442} INFO - at 49.7s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:33] {2258} INFO - iteration 197, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:33] {2442} INFO - at 50.1s,\testimator xgboost's best error=0.1582,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:33] {2258} INFO - iteration 198, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:33] {2442} INFO - at 50.3s,\testimator extra_tree's best error=0.1684,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:33] {2258} INFO - iteration 199, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:34] {2442} INFO - at 50.7s,\testimator extra_tree's best error=0.1684,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:34] {2258} INFO - iteration 200, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:34] {2442} INFO - at 51.1s,\testimator extra_tree's best error=0.1684,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:34] {2258} INFO - iteration 201, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:34] {2442} INFO - at 51.2s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:34] {2258} INFO - iteration 202, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:34] {2442} INFO - at 51.4s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:34] {2258} INFO - iteration 203, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:35] {2442} INFO - at 51.8s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:35] {2258} INFO - iteration 204, current learner sgd\n", "[flaml.automl.logger: 02-17 20:35:35] {2442} INFO - at 51.9s,\testimator sgd's best error=0.2144,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:35] {2258} INFO - iteration 205, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:35] {2442} INFO - at 52.2s,\testimator extra_tree's best error=0.1684,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:35] {2258} INFO - iteration 206, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:35] {2442} INFO - at 52.6s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:35] {2258} INFO - iteration 207, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:36] {2442} INFO - at 52.9s,\testimator extra_tree's best error=0.1684,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:36] {2258} INFO - iteration 208, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:36] {2442} INFO - at 53.2s,\testimator xgboost's best error=0.1582,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:36] {2258} INFO - iteration 209, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:36] {2442} INFO - at 53.6s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:36] {2258} INFO - iteration 210, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:37] {2442} INFO - at 54.0s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:37] {2258} INFO - iteration 211, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:37] {2442} INFO - at 54.3s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:37] {2258} INFO - iteration 212, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:38] {2442} INFO - at 54.6s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:38] {2258} INFO - iteration 213, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:38] {2442} INFO - at 54.9s,\testimator extra_tree's best error=0.1684,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:38] {2258} INFO - iteration 214, current learner extra_tree\n", "[flaml.automl.logger: 02-17 20:35:38] {2442} INFO - at 55.3s,\testimator extra_tree's best error=0.1684,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:38] {2258} INFO - iteration 215, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:38] {2442} INFO - at 55.4s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:38] {2258} INFO - iteration 216, current learner sgd\n", "[flaml.automl.logger: 02-17 20:35:38] {2442} INFO - at 55.5s,\testimator sgd's best error=0.2144,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:38] {2258} INFO - iteration 217, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:39] {2442} INFO - at 55.7s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:39] {2258} INFO - iteration 218, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:39] {2442} INFO - at 56.0s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:39] {2258} INFO - iteration 219, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:39] {2442} INFO - at 56.1s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:39] {2258} INFO - iteration 220, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:39] {2442} INFO - at 56.5s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:39] {2258} INFO - iteration 221, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:40] {2442} INFO - at 56.7s,\testimator lgbm's best error=0.1526,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:40] {2258} INFO - iteration 222, current learner sgd\n", "[flaml.automl.logger: 02-17 20:35:40] {2442} INFO - at 56.8s,\testimator sgd's best error=0.2144,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:40] {2258} INFO - iteration 223, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:40] {2442} INFO - at 57.1s,\testimator xgboost's best error=0.1582,\tbest estimator lgbm's best error=0.1526\n", "[flaml.automl.logger: 02-17 20:35:40] {2258} INFO - iteration 224, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:40] {2442} INFO - at 57.5s,\testimator lgbm's best error=0.1515,\tbest estimator lgbm's best error=0.1515\n", "[flaml.automl.logger: 02-17 20:35:40] {2258} INFO - iteration 225, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:41] {2442} INFO - at 57.6s,\testimator lgbm's best error=0.1515,\tbest estimator lgbm's best error=0.1515\n", "[flaml.automl.logger: 02-17 20:35:41] {2258} INFO - iteration 226, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:41] {2442} INFO - at 57.8s,\testimator lgbm's best error=0.1515,\tbest estimator lgbm's best error=0.1515\n", "[flaml.automl.logger: 02-17 20:35:41] {2258} INFO - iteration 227, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:41] {2442} INFO - at 57.9s,\testimator lgbm's best error=0.1515,\tbest estimator lgbm's best error=0.1515\n", "[flaml.automl.logger: 02-17 20:35:41] {2258} INFO - iteration 228, current learner rf\n", "[flaml.automl.logger: 02-17 20:35:41] {2442} INFO - at 58.2s,\testimator rf's best error=0.1560,\tbest estimator lgbm's best error=0.1515\n", "[flaml.automl.logger: 02-17 20:35:41] {2258} INFO - iteration 229, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:41] {2442} INFO - at 58.5s,\testimator lgbm's best error=0.1515,\tbest estimator lgbm's best error=0.1515\n", "[flaml.automl.logger: 02-17 20:35:41] {2258} INFO - iteration 230, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:42] {2442} INFO - at 58.9s,\testimator xgboost's best error=0.1582,\tbest estimator lgbm's best error=0.1515\n", "[flaml.automl.logger: 02-17 20:35:42] {2258} INFO - iteration 231, current learner lgbm\n", "[flaml.automl.logger: 02-17 20:35:42] {2442} INFO - at 59.1s,\testimator lgbm's best error=0.1515,\tbest estimator lgbm's best error=0.1515\n", "[flaml.automl.logger: 02-17 20:35:42] {2258} INFO - iteration 232, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:42] {2442} INFO - at 59.5s,\testimator xgboost's best error=0.1582,\tbest estimator lgbm's best error=0.1515\n", "[flaml.automl.logger: 02-17 20:35:42] {2258} INFO - iteration 233, current learner xgboost\n", "[flaml.automl.logger: 02-17 20:35:43] {2442} INFO - at 59.9s,\testimator xgboost's best error=0.1582,\tbest estimator lgbm's best error=0.1515\n", "[flaml.automl.logger: 02-17 20:35:43] {2258} INFO - iteration 234, current learner xgb_limitdepth\n", "[flaml.automl.logger: 02-17 20:35:43] {2442} INFO - at 60.2s,\testimator xgb_limitdepth's best error=0.2099,\tbest estimator lgbm's best error=0.1515\n", "[flaml.automl.logger: 02-17 20:35:43] {2685} INFO - retrain lgbm for 0.1s\n", "[flaml.automl.logger: 02-17 20:35:43] {2688} INFO - retrained model: LGBMClassifier(colsample_bytree=np.float64(0.9263510142224147),\n", " learning_rate=np.float64(0.11073419548910175), max_bin=1023,\n", " min_child_samples=7, n_estimators=26, n_jobs=-1, num_leaves=35,\n", " reg_alpha=np.float64(0.8491669558916534),\n", " reg_lambda=np.float64(0.08792333670079354), verbose=-1)\n", "[flaml.automl.logger: 02-17 20:35:43] {1985} INFO - fit succeeded\n", "[flaml.automl.logger: 02-17 20:35:43] {1986} INFO - Time taken to find the best model: 57.45935106277466\n" ] } ], "source": [ "from flaml import AutoML\n", "\n", "automl = AutoML()\n", "settings = {\n", " \"time_budget\": 60, # total running time in seconds\n", " \"metric\": 'accuracy', \n", " # check the documentation for options of metrics (https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#optimization-metric)\n", " \"task\": 'classification', # task type\n", " \"log_file_name\": 'automl_experiment.log', # flaml log file\n", " \"seed\": 7654321, # random seed\n", " \"ensemble\": False,\n", "}\n", "automl.fit(X_train, y_train, **settings)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>passengerid</th>\n", " <th>survived</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>892</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>893</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>894</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " passengerid survived\n", "0 892 0\n", "1 893 0\n", "2 894 0" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_test = test.drop(columns=['passengerid'])\n", "y_test_pred = automl.predict(X_test)\n", "df_submission = pd.concat([test['passengerid'], pd.DataFrame(y_test_pred, columns=['survived'])], axis=1)\n", "df_submission.head(3)\n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "df_submission.to_csv('../data/titanic/submission.csv', index=False)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'lgbm'" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "automl.best_estimator" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'n_estimators': 26,\n", " 'num_leaves': 35,\n", " 'min_child_samples': 7,\n", " 'learning_rate': np.float64(0.11073419548910175),\n", " 'log_max_bin': 10,\n", " 'colsample_bytree': np.float64(0.9263510142224147),\n", " 'reg_alpha': np.float64(0.8491669558916534),\n", " 'reg_lambda': np.float64(0.08792333670079354)}" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "automl.best_config" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 32, 262, 1, 4, 292, 16, 28, 0, 29, 1, 2, 17, 3,\n", " 14, 0, 24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 1, 0, 0, 26, 1, 0, 0, 0, 2, 0, 8, 3, 8,\n", " 0, 0, 0, 5], dtype=int32)" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "automl.model.estimator.feature_importances_" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Feature</th>\n", " <th>Importance</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>4</th>\n", " <td>fare_log</td>\n", " <td>292</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>age</td>\n", " <td>262</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>pclass</td>\n", " <td>32</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>name_len_cat</td>\n", " <td>29</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>family_size</td>\n", " <td>28</td>\n", " </tr>\n", " <tr>\n", " <th>29</th>\n", " <td>title_Mr</td>\n", " <td>26</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>embarked_S</td>\n", " <td>24</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>sex_female</td>\n", " <td>17</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>has_cabin</td>\n", " <td>16</td>\n", " </tr>\n", " <tr>\n", " <th>13</th>\n", " <td>embarked_C</td>\n", " <td>14</td>\n", " </tr>\n", " <tr>\n", " <th>36</th>\n", " <td>deck_C</td>\n", " <td>8</td>\n", " </tr>\n", " <tr>\n", " <th>38</th>\n", " <td>deck_E</td>\n", " <td>8</td>\n", " </tr>\n", " <tr>\n", " <th>42</th>\n", " <td>deck_U</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>parch</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>12</th>\n", " <td>sex_male</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>37</th>\n", " <td>deck_D</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>34</th>\n", " <td>deck_A</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>10</th>\n", " <td>fare_log_cat</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>sibsp</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>26</th>\n", " <td>title_Miss</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>age_cat</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>30</th>\n", " <td>title_Mrs</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>is_alone</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>18</th>\n", " <td>title_Col</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>14</th>\n", " <td>embarked_Q</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>17</th>\n", " <td>title_Capt</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>16</th>\n", " <td>embarked_nan</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>25</th>\n", " <td>title_Master</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>24</th>\n", " <td>title_Major</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>23</th>\n", " <td>title_Lady</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>22</th>\n", " <td>title_Jonkheer</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>20</th>\n", " <td>title_Don</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>21</th>\n", " <td>title_Dr</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>19</th>\n", " <td>title_Countess</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>28</th>\n", " <td>title_Mme</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>33</th>\n", " <td>title_Sir</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>32</th>\n", " <td>title_Rev</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>31</th>\n", " <td>title_Ms</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>27</th>\n", " <td>title_Mlle</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>35</th>\n", " <td>deck_B</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>39</th>\n", " <td>deck_F</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>40</th>\n", " <td>deck_G</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>41</th>\n", " <td>deck_T</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Feature Importance\n", "4 fare_log 292\n", "1 age 262\n", "0 pclass 32\n", "8 name_len_cat 29\n", "6 family_size 28\n", "29 title_Mr 26\n", "15 embarked_S 24\n", "11 sex_female 17\n", "5 has_cabin 16\n", "13 embarked_C 14\n", "36 deck_C 8\n", "38 deck_E 8\n", "42 deck_U 5\n", "3 parch 4\n", "12 sex_male 3\n", "37 deck_D 3\n", "34 deck_A 2\n", "10 fare_log_cat 2\n", "2 sibsp 1\n", "26 title_Miss 1\n", "9 age_cat 1\n", "30 title_Mrs 1\n", "7 is_alone 0\n", "18 title_Col 0\n", "14 embarked_Q 0\n", "17 title_Capt 0\n", "16 embarked_nan 0\n", "25 title_Master 0\n", "24 title_Major 0\n", "23 title_Lady 0\n", "22 title_Jonkheer 0\n", "20 title_Don 0\n", "21 title_Dr 0\n", "19 title_Countess 0\n", "28 title_Mme 0\n", "33 title_Sir 0\n", "32 title_Rev 0\n", "31 title_Ms 0\n", "27 title_Mlle 0\n", "35 deck_B 0\n", "39 deck_F 0\n", "40 deck_G 0\n", "41 deck_T 0" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get feature importance values\n", "importances = automl.model.estimator.feature_importances_\n", "# Get feature names from the dataset\n", "feature_names = X_train.columns\n", "# Create a DataFrame for better visualization\n", "feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})\n", "# Sort by importance (highest first)\n", "feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)\n", "\n", "feature_importance_df" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "<Figure size 1000x600 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "# Plot feature importance\n", "plt.figure(figsize=(10, 6)) # Adjust figure size for better readability\n", "plt.bar(feature_importance_df['Feature'], feature_importance_df['Importance'], color='skyblue')\n", "\n", "# Rotate feature names for readability\n", "plt.xticks(rotation=45, ha='right') # Tilt labels 45 degrees and align to the right\n", "\n", "# Labels and title\n", "plt.xlabel('Feature Name')\n", "plt.ylabel('Importance Score')\n", "plt.title('Feature Importance')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Model Evaluations\n", "\n", "Model evaluations can be approached from multiple perspectives:\n", "\n", "- From the Perspective of Evaluation Metrics:\n", " - **Prediction Quality**: How well does the model predict or classify new data?\n", " - **Interpretability**: How easily can the model’s predictions be understood and explained?\n", "- From the Perspective of the ML Workflow:\n", " - **Offline Evaluations**: Assessment of model performance on the training and test datasets.\n", " - **Online Evaluations**: Evaluation of model performance using live, production data.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prediction Quality\n", "\n", "### Classification Problems\n", "#### Accuracy\n", "$$\n", "\\mathrm{Accuracy} = \\frac{\\mathrm{TP} + \\mathrm{TN}}{\\mathrm{TP} + \\mathrm{FP} + \\mathrm{FN} + \\mathrm{TN}}\n", "$$" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 79.22%\n" ] } ], "source": [ "# Cross Validation Classification Accuracy\n", "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import accuracy_score\n", "from sklearn.linear_model import LogisticRegression\n", "url = \"https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv\"\n", "names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']\n", "dataframe = pd.read_csv(url, names=names)\n", "array = dataframe.values\n", "X = array[:,0:8]\n", "Y = array[:,8]\n", "\n", "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=7)\n", "lr = LogisticRegression()\n", "lr.fit(X_train, Y_train)\n", "\n", "Y_test_pred = lr.predict(X_test)\n", "accuracy = accuracy_score(Y_test, Y_test_pred)\n", "print(\"Accuracy: %.2f%%\" % (accuracy * 100.0))" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(0.7922077922077922)" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(Y_test == Y_test_pred) / len(Y_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "precision: 0.7906976744186046\n", "recall: 0.5964912280701754\n" ] } ], "source": [ "from sklearn.metrics import precision_score, recall_score\n", "\n", "precision = precision_score(Y_test, Y_test_pred)\n", "recall = recall_score(Y_test, Y_test_pred)\n", "\n", "print(f\"precision: {precision}\")\n", "print(f\"recall: {recall}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "f1 score: 0.68\n" ] } ], "source": [ "from sklearn.metrics import f1_score\n", "f1_score = f1_score(Y_test, Y_test_pred)\n", "print(f\"f1 score: {f1_score}\")" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/78/njcscll93_s6cc27zw_h0pmr0000gn/T/ipykernel_50750/2920054557.py:7: RuntimeWarning: invalid value encountered in divide\n", " 2 * precision * recall / (precision + recall)\n" ] }, { "data": { "image/png": "", "text/plain": [ "<Figure size 640x480 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "precision = np.linspace(0, 1, 1000)\n", "recalls = np.linspace(0, 1, 10)\n", "f1s = [\n", " 2 * precision * recall / (precision + recall)\n", " for recall in recalls\n", "]\n", "for f1, recall in zip(f1s, recalls):\n", " plt.plot(precision, f1, label=f\"recall={recall:.1f}%\")\n", "\n", "plt.xlabel(\"Precision\")\n", "plt.ylabel(\"F1 Score\")\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Logistic Loss (Log Loss)\n", "\n", "For binary classification problem, the log loss is defined as:\n", "$$\n", "L(y, p) = - \\frac{1}{N} \\sum_{i=1}^N \\left[ y_i \\log(p_i) + (1 - y_i) \\log(1 - p_i) \\right]\n", "$$" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Calculated Log Loss Score: 0.495027137138449\n" ] } ], "source": [ "# Calculate the score from scratch\n", "import numpy as np\n", "Y_test_pred_prob = lr.predict_proba(X_test)[:, 1]\n", "log_loss_score = np.sum(-np.log(np.where(Y_test == 1, Y_test_pred_prob, 1 - Y_test_pred_prob))) / len(Y_test)\n", "print(f\"Calculated Log Loss Score: {log_loss_score}\")" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Log Loss Score: 0.4950\n" ] } ], "source": [ "from sklearn.metrics import log_loss\n", "\n", "log_loss_score2 = log_loss(Y_test, Y_test_pred_prob)\n", "print(f\"Log Loss Score: {log_loss_score2:.4f}\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Area under ROC curve" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "<Figure size 800x600 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.metrics import roc_curve, auc\n", "\n", "fpr, tpr, _ = roc_curve(Y_test, Y_test_pred_prob)\n", "roc_auc = auc(fpr, tpr)\n", "\n", "# plot the ROC curve\n", "plt.figure(figsize=(8, 6))\n", "plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC Curve (AUC = {roc_auc:.2f})')\n", "plt.plot([0, 1], [0, 1], color='gray', linestyle='--') # Diagonal line for random guessing\n", "plt.xlim([0.0, 1.0])\n", "plt.ylim([0.0, 1.05])\n", "plt.xlabel('False Positive Rate')\n", "plt.ylabel('True Positive Rate')\n", "plt.title('Receiver Operating Characteristic (ROC) Curve')\n", "plt.legend(loc='lower right')\n", "plt.grid()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Confusion Matrix\n", "\n", "- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html\n", "- https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[88 9]\n", " [23 34]]\n" ] } ], "source": [ "from sklearn.metrics import confusion_matrix\n", "\n", "matrix = confusion_matrix(Y_test, Y_test_pred)\n", "print(matrix)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Classification Report\n", "- https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0.0 0.79 0.91 0.85 97\n", " 1.0 0.79 0.60 0.68 57\n", "\n", " accuracy 0.79 154\n", " macro avg 0.79 0.75 0.76 154\n", "weighted avg 0.79 0.79 0.78 154\n", "\n" ] } ], "source": [ "from sklearn.metrics import classification_report\n", "\n", "report = classification_report(Y_test, Y_test_pred)\n", "print(report)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Regression Problems" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean Squared Error: 104.20\n" ] }, { "data": { "image/png": "", "text/plain": [ "<Figure size 640x480 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from sklearn.datasets import make_regression\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.metrics import mean_squared_error\n", "\n", "# Generate a synthetic regression dataset\n", "X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)\n", "\n", "# Split data into training and test sets\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", "\n", "# Train a linear regression model\n", "model = LinearRegression()\n", "model.fit(X_train, y_train)\n", "\n", "# Make predictions\n", "y_pred = model.predict(X_test)\n", "\n", "# Calculate Mean Squared Error\n", "mse = mean_squared_error(y_test, y_pred)\n", "\n", "# Print the result\n", "print(f\"Mean Squared Error: {mse:.2f}\")\n", "\n", "# Plot the regression line\n", "plt.scatter(X_test, y_test, color='blue', label=\"Actual values\")\n", "plt.plot(X_test, y_pred, color='red', linewidth=2, label=\"Predicted values\")\n", "plt.xlabel(\"Feature\")\n", "plt.ylabel(\"Target\")\n", "plt.title(\"Linear Regression with MSE Calculation\")\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE: 10.207949183448665\n" ] } ], "source": [ "print(f\"MSE: {np.sqrt(mse)}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MAE: 8.416659922209051\n" ] } ], "source": [ "from sklearn.metrics import mean_absolute_error\n", "\n", "mae = mean_absolute_error(y_test, y_pred)\n", "print(f\"MAE: {mae}\")" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "R2 score: 0.9374151607623286\n" ] } ], "source": [ "from sklearn.metrics import r2_score\n", "\n", "r2 = r2_score(y_test, y_pred)\n", "print(f\"R2 score: {r2}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `k-fold` cross-validation" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fold 1: MSE = 104.20\n", "Fold 2: MSE = 66.52\n", "Fold 3: MSE = 63.17\n", "Fold 4: MSE = 69.47\n", "Fold 5: MSE = 105.17\n", "\n", "Overall Mean Squared Error across all folds: 81.71\n" ] }, { "data": { "image/png": "", "text/plain": [ "<Figure size 800x500 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from sklearn.datasets import make_regression\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.model_selection import KFold\n", "\n", "# Generate a synthetic regression dataset\n", "X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)\n", "\n", "# Define the number of folds\n", "kf = KFold(n_splits=5, shuffle=True, random_state=42)\n", "\n", "# Store MSE for each fold\n", "mse_values = []\n", "\n", "# Perform K-Fold Cross-Validation\n", "for fold, (train_index, test_index) in enumerate(kf.split(X), 1):\n", " X_train, X_test = X[train_index], X[test_index]\n", " y_train, y_test = y[train_index], y[test_index]\n", "\n", " # Train a linear regression model\n", " model = LinearRegression()\n", " model.fit(X_train, y_train)\n", "\n", " # Make predictions\n", " y_pred = model.predict(X_test)\n", "\n", " # Calculate Mean Squared Error\n", " mse = mean_squared_error(y_test, y_pred)\n", " mse_values.append(mse)\n", "\n", " print(f\"Fold {fold}: MSE = {mse:.2f}\")\n", "\n", "# Calculate overall mean MSE\n", "overall_mse = np.mean(mse_values)\n", "print(f\"\\nOverall Mean Squared Error across all folds: {overall_mse:.2f}\")\n", "\n", "# Plot MSE values for each fold\n", "plt.figure(figsize=(8, 5))\n", "plt.plot(range(1, len(mse_values) + 1), mse_values, marker='o', linestyle='-', color='b', label=\"MSE per fold\")\n", "plt.axhline(y=overall_mse, color='r', linestyle='--', label=f\"Overall Mean MSE = {overall_mse:.2f}\")\n", "plt.xlabel(\"Fold Number\")\n", "plt.ylabel(\"Mean Squared Error\")\n", "plt.title(\"K-Fold Cross-Validation MSE\")\n", "plt.legend()\n", "plt.grid()\n", "plt.show()\n" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE: 81.709 (18.871)\n" ] } ], "source": [ "from sklearn import model_selection\n", "\n", "# Define the number of folds\n", "kf = KFold(n_splits=5, shuffle=True, random_state=42)\n", "\n", "LR = LinearRegression()\n", "k_fold_mse = model_selection.cross_val_score(\n", " LR, X, y, cv=kf, scoring='neg_mean_squared_error')\n", "print(\"MSE: %.3f (%.3f)\" % (-1 * k_fold_mse.mean(), k_fold_mse.std()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model interpretation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Random Forest model" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X = train.drop(columns=['survived', 'passengerid'])\n", "y = train['survived']\n" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(891, 43)" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.shape" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 80.00%\n" ] } ], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=123)\n", "rf = RandomForestClassifier(n_estimators=5, max_depth=3, random_state=234)\n", "rf.fit(X_train, y_train)\n", "y_pred = rf.predict(X_test)\n", "\n", "accuracy = accuracy_score(y_test, y_pred)\n", "print(\"Accuracy: %.2f%%\" % (accuracy * 100.0))\n" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Feature</th>\n", " <th>Importance</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>12</th>\n", " <td>sex_male</td>\n", " <td>0.463926</td>\n", " </tr>\n", " <tr>\n", " <th>29</th>\n", " <td>title_Mr</td>\n", " <td>0.136208</td>\n", " </tr>\n", " <tr>\n", " <th>30</th>\n", " <td>title_Mrs</td>\n", " <td>0.133414</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>has_cabin</td>\n", " <td>0.078299</td>\n", " </tr>\n", " <tr>\n", " <th>10</th>\n", " <td>fare_log_cat</td>\n", " <td>0.057256</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>age</td>\n", " <td>0.022328</td>\n", " </tr>\n", " <tr>\n", " <th>42</th>\n", " <td>deck_U</td>\n", " <td>0.021360</td>\n", " </tr>\n", " <tr>\n", " <th>37</th>\n", " <td>deck_D</td>\n", " <td>0.019512</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>pclass</td>\n", " <td>0.015956</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>family_size</td>\n", " <td>0.013974</td>\n", " </tr>\n", " <tr>\n", " <th>13</th>\n", " <td>embarked_C</td>\n", " <td>0.013360</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>name_len_cat</td>\n", " <td>0.008225</td>\n", " </tr>\n", " <tr>\n", " <th>39</th>\n", " <td>deck_F</td>\n", " <td>0.004893</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>age_cat</td>\n", " <td>0.002641</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>is_alone</td>\n", " <td>0.002355</td>\n", " </tr>\n", " <tr>\n", " <th>24</th>\n", " <td>title_Major</td>\n", " <td>0.001872</td>\n", " </tr>\n", " <tr>\n", " <th>14</th>\n", " <td>embarked_Q</td>\n", " <td>0.001702</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>parch</td>\n", " <td>0.001431</td>\n", " </tr>\n", " <tr>\n", " <th>40</th>\n", " <td>deck_G</td>\n", " <td>0.001100</td>\n", " </tr>\n", " <tr>\n", " <th>38</th>\n", " <td>deck_E</td>\n", " <td>0.000188</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>sibsp</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>fare_log</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>18</th>\n", " <td>title_Col</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>embarked_S</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>16</th>\n", " <td>embarked_nan</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>17</th>\n", " <td>title_Capt</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>sex_female</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>26</th>\n", " <td>title_Miss</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>25</th>\n", " <td>title_Master</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>23</th>\n", " <td>title_Lady</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>22</th>\n", " <td>title_Jonkheer</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>20</th>\n", " <td>title_Don</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>21</th>\n", " <td>title_Dr</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>19</th>\n", " <td>title_Countess</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>28</th>\n", " <td>title_Mme</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>34</th>\n", " <td>deck_A</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>33</th>\n", " <td>title_Sir</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>32</th>\n", " <td>title_Rev</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>31</th>\n", " <td>title_Ms</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>27</th>\n", " <td>title_Mlle</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>36</th>\n", " <td>deck_C</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>35</th>\n", " <td>deck_B</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>41</th>\n", " <td>deck_T</td>\n", " <td>0.000000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Feature Importance\n", "12 sex_male 0.463926\n", "29 title_Mr 0.136208\n", "30 title_Mrs 0.133414\n", "5 has_cabin 0.078299\n", "10 fare_log_cat 0.057256\n", "1 age 0.022328\n", "42 deck_U 0.021360\n", "37 deck_D 0.019512\n", "0 pclass 0.015956\n", "6 family_size 0.013974\n", "13 embarked_C 0.013360\n", "8 name_len_cat 0.008225\n", "39 deck_F 0.004893\n", "9 age_cat 0.002641\n", "7 is_alone 0.002355\n", "24 title_Major 0.001872\n", "14 embarked_Q 0.001702\n", "3 parch 0.001431\n", "40 deck_G 0.001100\n", "38 deck_E 0.000188\n", "2 sibsp 0.000000\n", "4 fare_log 0.000000\n", "18 title_Col 0.000000\n", "15 embarked_S 0.000000\n", "16 embarked_nan 0.000000\n", "17 title_Capt 0.000000\n", "11 sex_female 0.000000\n", "26 title_Miss 0.000000\n", "25 title_Master 0.000000\n", "23 title_Lady 0.000000\n", "22 title_Jonkheer 0.000000\n", "20 title_Don 0.000000\n", "21 title_Dr 0.000000\n", "19 title_Countess 0.000000\n", "28 title_Mme 0.000000\n", "34 deck_A 0.000000\n", "33 title_Sir 0.000000\n", "32 title_Rev 0.000000\n", "31 title_Ms 0.000000\n", "27 title_Mlle 0.000000\n", "36 deck_C 0.000000\n", "35 deck_B 0.000000\n", "41 deck_T 0.000000" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "importances = rf.feature_importances_\n", "df_importances = pd.DataFrame({\n", " 'Feature': X.columns,\n", " 'Importance': importances\n", "})\n", "df_importances = df_importances.sort_values(by='Importance', ascending=False)\n", "df_importances" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "X_test2 = test.drop(columns=['passengerid'])\n", "y_pred2 = automl.predict(X_test2)\n", "df_submission = pd.concat([test['passengerid'], pd.DataFrame(y_pred2, columns=['survived'])], axis=1)\n", "df_submission.to_csv('../data/titanic/submission.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SHAP" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/xiangshiyin/Documents/Teaching/machine-learning-for-actuarial-science/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] }, { "data": { "image/png": "", "text/plain": [ "<Figure size 800x550 with 3 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import xgboost\n", "import shap\n", "\n", "# train an XGBoost model\n", "X, y = shap.datasets.california()\n", "model = xgboost.XGBRegressor().fit(X, y)\n", "\n", "# explain the model's predictions using SHAP\n", "# (same syntax works for LightGBM, CatBoost, scikit-learn, transformers, Spark, etc.)\n", "explainer = shap.Explainer(model)\n", "shap_values = explainer(X)\n", "\n", "# visualize the first prediction's explanation\n", "shap.plots.waterfall(shap_values[0])" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([4.526, 3.585, 3.521, 3.413, 3.422])" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y[:5]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "<Figure size 800x950 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import shap\n", "\n", "explainer = shap.Explainer(rf)\n", "shap_values = explainer.shap_values(X_test)\n", "\n", "# Extract SHAP values for the positive class (class 1)\n", "shap_values_class1 = shap_values[:,:,1]\n", "\n", "# Visualize global feature importance for class 1\n", "shap.summary_plot(shap_values_class1, X_test, feature_names=feature_names, plot_type=\"bar\")" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "<Figure size 800x650 with 3 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "explainer = shap.Explainer(rf)\n", "shap_values = explainer.shap_values(X_test)\n", "\n", "shap_values_class1 = shap_values[:,:,1]\n", "explanation = shap.Explanation(\n", " values=shap_values_class1,\n", " base_values=explainer.expected_value[1], # Base value for class 1\n", " data=X_test.values, # Input data (as a NumPy array)\n", " feature_names=X_test.columns.tolist() # Feature names\n", ")\n", "shap.plots.waterfall(explanation[1])" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(0.16546512151307932)" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred_prob = rf.predict_proba(X_test)\n", "y_pred_prob[1,1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.6" } }, "nbformat": 4, "nbformat_minor": 4 }