{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:50:36.055576Z", "start_time": "2020-10-27T23:50:35.882507Z" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# code for loading the format for the notebook\n", "import os\n", "\n", "# path : store the current path to convert back to it later\n", "path = os.getcwd()\n", "os.chdir(os.path.join('..', 'notebook_format'))\n", "\n", "from formats import load_style\n", "load_style(css_style='custom2.css', plot_style=False)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:50:37.965563Z", "start_time": "2020-10-27T23:50:36.958947Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ethen 2020-10-27 16:50:37 \n", "\n", "CPython 3.6.4\n", "IPython 7.15.0\n", "\n", "numpy 1.18.5\n", "pandas 1.0.5\n", "sklearn 0.23.1\n", "matplotlib 3.1.0\n", "xgboost 1.2.1\n", "lightgbm 3.0.0\n" ] } ], "source": [ "os.chdir(path)\n", "\n", "# 1. magic for inline plot\n", "# 2. magic to print version\n", "# 3. magic so that the notebook will reload external python modules\n", "# 4. magic to enable retina (high resolution) plots\n", "# https://gist.github.com/minrk/3301035\n", "%matplotlib inline\n", "%load_ext watermark\n", "%load_ext autoreload\n", "%autoreload 2\n", "%config InlineBackend.figure_format='retina'\n", "\n", "import os\n", "import re\n", "import time\n", "import requests\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from xgboost import XGBClassifier\n", "from lightgbm import LGBMClassifier\n", "from lightgbm import plot_importance\n", "from sklearn.metrics import roc_auc_score\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder\n", "\n", "%watermark -a 'Ethen' -d -t -v -p numpy,pandas,sklearn,matplotlib,xgboost,lightgbm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# LightGBM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Gradient boosting](http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/trees/gbm/gbm.ipynb) is a machine learning technique that produces a prediction model in the form of an ensemble of weak classifiers, optimizing for a differentiable loss function. One of the most popular types of gradient boosting is gradient boosted trees, that internally is made up of an ensemble of week [decision trees](http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/trees/decision_tree.ipynb). There are two different ways to compute the trees: level-wise and leaf-wise as illustrated by the diagram below:\n", "\n", "\n", "\n", "\n", "\n", "> The level-wise strategy adds complexity extending the depth of the tree level by level. As a contrary, the leaf-wise strategy generates branches by optimizing a loss.\n", "\n", "The level-wise strategy grows the tree level by level. In this strategy, each node splits the data prioritizing the nodes closer to the tree root. The leaf-wise strategy grows the tree by splitting the data at the nodes with the highest loss change. Level-wise growth is usually better for smaller datasets whereas leaf-wise tends to overfit. Leaf-wise growth tends to [excel in larger datasets](http://researchcommons.waikato.ac.nz/handle/10289/2317) where it is considerably faster than level-wise growth.\n", "\n", "A key challenge in training boosted decision trees is the [computational cost of finding the best split](https://arxiv.org/abs/1706.08359) for each leaf. Conventional techniques find the [exact split](https://arxiv.org/abs/1603.02754) for each leaf, and require scanning through all the data in each iteration. A different approach [approximates the split](https://arxiv.org/abs/1611.01276) by building histograms of the features. That way, the algorithm doesn’t need to evaluate every single value of the features to compute the split, but only the bins of the histogram, which are bounded. This approach turns out to be much more efficient for large datasets, without adversely affecting accuracy.\n", "\n", "With all of that being said LightGBM is a fast, distributed, high performance gradient boosting that was open-source by Microsoft around August 2016. The main advantages of LightGBM includes:\n", "\n", "- Faster training speed and higher efficiency: LightGBM use histogram based algorithm i.e it buckets continuous feature values into discrete bins which fasten the training procedure.\n", "- Lower memory usage: Replaces continuous values to discrete bins which result in lower memory usage.\n", "- Better accuracy than any other boosting algorithm: It produces much more complex trees by following leaf wise split approach rather than a level-wise approach which is the main factor in achieving higher accuracy. However, it can sometimes lead to overfitting which can be avoided by setting the max_depth parameter.\n", "- Compatibility with Large Datasets: It is capable of performing equally good with large datasets with a significant reduction in training time as compared to XGBoost.\n", "- Parallel learning supported.\n", "\n", "The significant speed advantage of LightGBM translates into the ability to do more iterations and/or quicker hyperparameter search, which can be very useful if we have a limited time budget for optimizing your model or want to experiment with different feature engineering ideas." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook compares LightGBM with [XGBoost](http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/trees/xgboost.ipynb), another extremely popular gradient boosting framework by applying both the algorithms to a dataset and then comparing the model's performance and execution time. Here we will be using the [Adult dataset](http://archive.ics.uci.edu/ml/datasets/Adult) that consists of 32561 observations and 14 features describing individuals from various countries. Our target is to predict whether a person makes <=50k or >50k annually on basis of the other information available. Dataset consists of 32561 observations and 14 features describing individuals." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:50:41.005963Z", "start_time": "2020-10-27T23:50:40.817642Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dimensions: (32561, 15)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclassfnlwgteducationeducation_nummarital_statusoccupationrelationshipracesexcapital_gaincapital_losshours_per_weeknative_countryincome
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
\n", "
" ], "text/plain": [ " age workclass fnlwgt education education_num \\\n", "0 39 State-gov 77516 Bachelors 13 \n", "1 50 Self-emp-not-inc 83311 Bachelors 13 \n", "2 38 Private 215646 HS-grad 9 \n", "3 53 Private 234721 11th 7 \n", "4 28 Private 338409 Bachelors 13 \n", "\n", " marital_status occupation relationship race sex \\\n", "0 Never-married Adm-clerical Not-in-family White Male \n", "1 Married-civ-spouse Exec-managerial Husband White Male \n", "2 Divorced Handlers-cleaners Not-in-family White Male \n", "3 Married-civ-spouse Handlers-cleaners Husband Black Male \n", "4 Married-civ-spouse Prof-specialty Wife Black Female \n", "\n", " capital_gain capital_loss hours_per_week native_country income \n", "0 2174 0 40 United-States <=50K \n", "1 0 0 13 United-States <=50K \n", "2 0 0 40 United-States <=50K \n", "3 0 0 40 United-States <=50K \n", "4 0 0 40 Cuba <=50K " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def get_data():\n", " file_path = 'adult.csv'\n", " if not os.path.isfile(file_path):\n", " def chunks(input_list, n_chunk):\n", " \"\"\"take a list and break it up into n-size chunks\"\"\"\n", " for i in range(0, len(input_list), n_chunk):\n", " yield input_list[i:i + n_chunk] \n", "\n", " columns = [\n", " 'age',\n", " 'workclass',\n", " 'fnlwgt',\n", " 'education',\n", " 'education_num',\n", " 'marital_status',\n", " 'occupation',\n", " 'relationship',\n", " 'race',\n", " 'sex',\n", " 'capital_gain',\n", " 'capital_loss',\n", " 'hours_per_week',\n", " 'native_country',\n", " 'income'\n", " ]\n", "\n", " url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'\n", " r = requests.get(url)\n", " raw_text = r.text.replace('\\n', ',')\n", " splitted_text = re.split(r',\\s*', raw_text)\n", " data = list(chunks(splitted_text, n_chunk=len(columns)))\n", " data = pd.DataFrame(data, columns=columns).dropna(axis=0, how='any')\n", " data.to_csv(file_path, index=False)\n", "\n", " data = pd.read_csv(file_path)\n", " return data\n", "\n", "\n", "data = get_data()\n", "print('dimensions:', data.shape)\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:50:42.865677Z", "start_time": "2020-10-27T23:50:42.826673Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number of numerical features: 6\n", "number of categorical features: 8\n" ] } ], "source": [ "label_col = 'income'\n", "cat_cols = [\n", " 'workclass',\n", " 'education',\n", " 'marital_status',\n", " 'occupation',\n", " 'relationship',\n", " 'race',\n", " 'sex',\n", " 'native_country'\n", "]\n", "\n", "num_cols = [\n", " 'age',\n", " 'fnlwgt',\n", " 'education_num',\n", " 'capital_gain',\n", " 'capital_loss',\n", " 'hours_per_week'\n", "]\n", "\n", "print('number of numerical features: ', len(num_cols))\n", "print('number of categorical features: ', len(cat_cols))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:50:44.839694Z", "start_time": "2020-10-27T23:50:44.780245Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "labels distribution: [0.75919044 0.24080956]\n" ] } ], "source": [ "label_encode = LabelEncoder() \n", "data[label_col] = label_encode.fit_transform(data[label_col])\n", "y = data[label_col].values\n", "data = data.drop(label_col, axis=1)\n", "\n", "print('labels distribution:', np.bincount(y) / y.size)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:50:45.832667Z", "start_time": "2020-10-27T23:50:45.757734Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dimensions: (29304, 14)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclassfnlwgteducationeducation_nummarital_statusoccupationrelationshipracesexcapital_gaincapital_losshours_per_weeknative_country
052Private168381HS-grad9WidowedOther-serviceUnmarriedAsian-Pac-IslanderFemale0040India
131Private134613Assoc-voc11Married-civ-spouseExec-managerialWifeBlackFemale0043United-States
222Private68678HS-grad9Married-civ-spouseSalesHusbandBlackMale0040United-States
355Private110871Bachelors13Married-civ-spouseSalesHusbandWhiteMale0040United-States
435?117528Assoc-voc11Married-civ-spouse?WifeWhiteFemale0040United-States
\n", "
" ], "text/plain": [ " age workclass fnlwgt education education_num marital_status \\\n", "0 52 Private 168381 HS-grad 9 Widowed \n", "1 31 Private 134613 Assoc-voc 11 Married-civ-spouse \n", "2 22 Private 68678 HS-grad 9 Married-civ-spouse \n", "3 55 Private 110871 Bachelors 13 Married-civ-spouse \n", "4 35 ? 117528 Assoc-voc 11 Married-civ-spouse \n", "\n", " occupation relationship race sex capital_gain \\\n", "0 Other-service Unmarried Asian-Pac-Islander Female 0 \n", "1 Exec-managerial Wife Black Female 0 \n", "2 Sales Husband Black Male 0 \n", "3 Sales Husband White Male 0 \n", "4 ? Wife White Female 0 \n", "\n", " capital_loss hours_per_week native_country \n", "0 0 40 India \n", "1 0 43 United-States \n", "2 0 40 United-States \n", "3 0 40 United-States \n", "4 0 40 United-States " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_size = 0.1\n", "split_random_state = 1234\n", "df_train, df_test, y_train, y_test = train_test_split(\n", " data, y, test_size=test_size,\n", " random_state=split_random_state, stratify=y)\n", "\n", "df_train = df_train.reset_index(drop=True)\n", "df_test = df_test.reset_index(drop=True)\n", "\n", "print('dimensions:', df_train.shape)\n", "df_train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll perform very little feature engineering as that's not our main focus here. The following code chunk only one hot encodes the categorical features. There will be follow up discussions on this in later section." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:50:47.585465Z", "start_time": "2020-10-27T23:50:47.528126Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "number of one hot encoded categorical columns: 102\n" ] }, { "data": { "text/plain": [ "array(['workclass_?', 'workclass_Federal-gov', 'workclass_Local-gov',\n", " 'workclass_Never-worked', 'workclass_Private'], dtype=object)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import OneHotEncoder\n", "\n", "one_hot_encoder = OneHotEncoder(sparse=False, dtype=np.int32)\n", "one_hot_encoder.fit(df_train[cat_cols])\n", "cat_one_hot_cols = one_hot_encoder.get_feature_names(cat_cols)\n", "\n", "print('number of one hot encoded categorical columns: ', len(cat_one_hot_cols))\n", "cat_one_hot_cols[:5]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:50:48.602382Z", "start_time": "2020-10-27T23:50:48.575959Z" } }, "outputs": [], "source": [ "def preprocess_one_hot(df, one_hot_encoder, num_cols, cat_cols):\n", " df = df.copy()\n", " \n", " cat_one_hot_cols = one_hot_encoder.get_feature_names(cat_cols)\n", "\n", " df_one_hot = pd.DataFrame(\n", " one_hot_encoder.transform(df[cat_cols]),\n", " columns=cat_one_hot_cols\n", " )\n", " df_preprocessed = pd.concat([\n", " df[num_cols],\n", " df_one_hot\n", " ], axis=1)\n", " return df_preprocessed" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:50:50.316343Z", "start_time": "2020-10-27T23:50:50.188666Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(29304, 108)\n" ] }, { "data": { "text/plain": [ "age int64\n", "fnlwgt int64\n", "education_num int64\n", "capital_gain int64\n", "capital_loss int64\n", " ... \n", "native_country_Thailand int32\n", "native_country_Trinadad&Tobago int32\n", "native_country_United-States int32\n", "native_country_Vietnam int32\n", "native_country_Yugoslavia int32\n", "Length: 108, dtype: object" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train_one_hot = preprocess_one_hot(df_train, one_hot_encoder, num_cols, cat_cols)\n", "df_test_one_hot = preprocess_one_hot(df_test, one_hot_encoder, num_cols, cat_cols)\n", "print(df_train_one_hot.shape)\n", "df_train_one_hot.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Benchmarking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next section compares the xgboost and lightgbm's implementation in terms of both execution time and model performance. There are a bunch of other hyperparameters that we as the end-user can specify, but here we explicity specify arguably the most important ones. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:51:02.266532Z", "start_time": "2020-10-27T23:50:56.897609Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[LightGBM] [Warning] Accuracy may be bad since you didn't set num_leaves and 2^max_depth > num_leaves\n", "elapse:, 0.327070951461792\n" ] } ], "source": [ "time.sleep(5)\n", "\n", "lgb = LGBMClassifier(\n", " n_jobs=-1,\n", " max_depth=6,\n", " subsample=1,\n", " n_estimators=100,\n", " learning_rate=0.1,\n", " colsample_bytree=1,\n", " objective='binary',\n", " boosting_type='gbdt')\n", "\n", "start = time.time()\n", "lgb.fit(df_train_one_hot, y_train)\n", "lgb_elapse = time.time() - start\n", "print('elapse:, ', lgb_elapse)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:51:24.890527Z", "start_time": "2020-10-27T23:51:05.413283Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "elapse:, 14.434311866760254\n" ] } ], "source": [ "time.sleep(5)\n", "\n", "# raw xgboost\n", "xgb = XGBClassifier(\n", " n_jobs=-1,\n", " max_depth=6,\n", " subsample=1,\n", " n_estimators=100,\n", " learning_rate=0.1,\n", " colsample_bytree=1,\n", " objective='binary:logistic',\n", " booster='gbtree')\n", "\n", "start = time.time()\n", "xgb.fit(df_train_one_hot, y_train)\n", "xgb_elapse = time.time() - start\n", "print('elapse:, ', xgb_elapse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "XGBoost includes a `tree_method = 'hist' `option that buckets continuous variables into bins to speed up training, we also set `grow_policy = 'lossguide'` to favor splitting at nodes with highest loss change, which mimics LightGBM." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:51:35.456722Z", "start_time": "2020-10-27T23:51:29.605094Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "elapse:, 0.8119819164276123\n" ] } ], "source": [ "time.sleep(5)\n", "\n", "xgb_hist = XGBClassifier(\n", " n_jobs=-1,\n", " max_depth=6,\n", " subsample=1,\n", " n_estimators=100,\n", " learning_rate=0.1,\n", " colsample_bytree=1,\n", " objective='binary:logistic',\n", " booster='gbtree',\n", " tree_method='hist',\n", " grow_policy='lossguide')\n", "\n", "start = time.time()\n", "xgb_hist.fit(df_train_one_hot, y_train)\n", "xgb_hist_elapse = time.time() - start\n", "print('elapse:, ', xgb_hist_elapse)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:51:38.783066Z", "start_time": "2020-10-27T23:51:38.655229Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "auc score: 0.9352834593198379\n", "auc score: 0.9348347355521263\n", "auc score: 0.9351145431888891\n" ] } ], "source": [ "# evaluate performance\n", "y_pred = lgb.predict_proba(df_test_one_hot)[:, 1]\n", "lgb_auc = roc_auc_score(y_test, y_pred)\n", "print('auc score: ', lgb_auc)\n", "\n", "y_pred = xgb.predict_proba(df_test_one_hot)[:, 1]\n", "xgb_auc = roc_auc_score(y_test, y_pred)\n", "print('auc score: ', xgb_auc)\n", "\n", "y_pred = xgb_hist.predict_proba(df_test_one_hot)[:, 1]\n", "xgb_hist_auc = roc_auc_score(y_test, y_pred)\n", "print('auc score: ', xgb_hist_auc)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:51:41.434248Z", "start_time": "2020-10-27T23:51:41.398426Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
elapse_timeauc_score
LightGBM0.3270710.935283
XGBoostHist0.8119820.935115
XGBoost14.4343120.934835
\n", "
" ], "text/plain": [ " elapse_time auc_score\n", "LightGBM 0.327071 0.935283\n", "XGBoostHist 0.811982 0.935115\n", "XGBoost 14.434312 0.934835" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# comparison table\n", "results = pd.DataFrame({\n", " 'elapse_time': [lgb_elapse, xgb_hist_elapse, xgb_elapse],\n", " 'auc_score': [lgb_auc, xgb_hist_auc, xgb_auc]})\n", "results.index = ['LightGBM', 'XGBoostHist', 'XGBoost']\n", "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the resulting table, we can see that there isn't a noticeable difference in auc score between the two implementations. On the other hand, there is a significant difference in the time it takes to finish the whole training procedure. This is a huge advantage and makes LightGBM a much better approach when dealing with large datasets.\n", "\n", "For those interested, the people at Microsoft has a blog that has a even more thorough benchmark result on various datasets. Link is included below along with a summary of their results:\n", "\n", "> [Blog: Lessons Learned From Benchmarking Fast Machine Learning Algorithms](https://blogs.technet.microsoft.com/machinelearning/2017/07/25/lessons-learned-benchmarking-fast-machine-learning-algorithms/)\n", ">\n", "> Our results, based on tests on six datasets, are summarized as follows:\n", "\n", "> - XGBoost and LightGBM achieve similar accuracy metrics.\n", "> - LightGBM has lower training time than XGBoost and its histogram-based variant, XGBoost hist, for all test datasets, on both CPU and GPU implementations. The training time difference between the two libraries depends on the dataset, and can be as big as 25 times.\n", "> - XGBoost GPU implementation does not scale well to large datasets and ran out of memory in half of the tests.\n", "> - XGBoost hist may be significantly slower than the original XGBoost when feature dimensionality is high." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorical Variables in Tree-based Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many real-world datasets include a mix of continuous and categorical variables. The property of the latter is that their values has zero inherent ordering. One major advantage of decision tree models and their ensemble counterparts, such as random forests, extra trees and gradient boosted trees, is that they are able to operate on both continuous and categorical variables directly (popular implementations of tree-based models differ as to whether they honor this fact). In contrast, most other popular models (e.g., generalized linear models, neural networks) must instead transform categorical variables into some numerical format, usually by one-hot encoding them to create a new dummy variable for each level of the original variable. e.g.\n", "\n", "\n", "\n", "One drawback of one hot encoding is that they can lead to a huge increase in the dimensionality of the feature representations. For example, one hot encoding U.S. states adds 49 dimensions to to our feature representation.\n", "\n", "To understand why we don't need to perform one hot encoding for tree-based models, we need to refer back to the logic of tree-based algorithms. At the heart of the tree-based algorithm is a sub-algorithm that splits the samples into two bins by selecting a feature and a value. This splitting algorithm considers each of the features in turn, and for each feature selects the value of that feature that minimizes the impurity of the bins.\n", "\n", "This means tree-based models are essentially looking for places to split the data, they are not multiplying our inputs by weights. In contrast, most other popular models (e.g., generalized linear models, neural networks) would interpret categorical variables such as red=1, blue=2 as blue is twice the amount of red, which is obviously not what we want." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:51:46.390899Z", "start_time": "2020-10-27T23:51:46.332767Z" } }, "outputs": [ { "data": { "text/plain": [ "OrdinalEncoder(dtype=)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ordinal_encoder = OrdinalEncoder(dtype=np.int32)\n", "ordinal_encoder.fit(df_train[cat_cols])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:51:47.827221Z", "start_time": "2020-10-27T23:51:47.800347Z" } }, "outputs": [], "source": [ "def preprocess_ordinal(df, ordinal_encoder, cat_cols, cat_dtype='int32'):\n", " df = df.copy()\n", " df[cat_cols] = ordinal_encoder.transform(df[cat_cols])\n", " df[cat_cols] = df[cat_cols].astype(cat_dtype)\n", " return df" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:51:49.413004Z", "start_time": "2020-10-27T23:51:49.257154Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(29304, 14)\n" ] }, { "data": { "text/plain": [ "age int64\n", "workclass int32\n", "fnlwgt int64\n", "education int32\n", "education_num int64\n", "marital_status int32\n", "occupation int32\n", "relationship int32\n", "race int32\n", "sex int32\n", "capital_gain int64\n", "capital_loss int64\n", "hours_per_week int64\n", "native_country int32\n", "dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train_ordinal = preprocess_ordinal(df_train, ordinal_encoder, cat_cols)\n", "df_test_ordinal = preprocess_ordinal(df_test, ordinal_encoder, cat_cols)\n", "print(df_train_ordinal.shape)\n", "df_train_ordinal.dtypes" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:52:00.478855Z", "start_time": "2020-10-27T23:51:55.116271Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[LightGBM] [Warning] Accuracy may be bad since you didn't set num_leaves and 2^max_depth > num_leaves\n", "elapse:, 0.2872467041015625\n", "auc score: 0.9348548507555065\n" ] } ], "source": [ "time.sleep(5)\n", "\n", "lgb = LGBMClassifier(\n", " n_jobs=-1,\n", " max_depth=6,\n", " subsample=1,\n", " n_estimators=100,\n", " learning_rate=0.1,\n", " colsample_bytree=1,\n", " objective='binary',\n", " boosting_type='gbdt')\n", "\n", "start = time.time()\n", "lgb.fit(df_train_ordinal, y_train)\n", "lgb_ordinal_elapse = time.time() - start\n", "print('elapse:, ', lgb_ordinal_elapse)\n", "\n", "y_pred = lgb.predict_proba(df_test_ordinal)[:, 1]\n", "lgb_ordinal_auc = roc_auc_score(y_test, y_pred)\n", "print('auc score: ', lgb_ordinal_auc)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:52:02.600833Z", "start_time": "2020-10-27T23:52:02.552515Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
elapse_timeauc_score
LightGBM Ordinal0.2872470.934855
LightGBM0.3270710.935283
XGBoostHist0.8119820.935115
XGBoost14.4343120.934835
\n", "
" ], "text/plain": [ " elapse_time auc_score\n", "LightGBM Ordinal 0.287247 0.934855\n", "LightGBM 0.327071 0.935283\n", "XGBoostHist 0.811982 0.935115\n", "XGBoost 14.434312 0.934835" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# comparison table\n", "results = pd.DataFrame({\n", " 'elapse_time': [lgb_ordinal_elapse, lgb_elapse, xgb_hist_elapse, xgb_elapse],\n", " 'auc_score': [lgb_ordinal_auc, lgb_auc, xgb_hist_auc, xgb_auc]})\n", "results.index = ['LightGBM Ordinal', 'LightGBM', 'XGBoostHist', 'XGBoost']\n", "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the result above, we can see that it requires even less training time without sacrificing any sort of performance. What's even more is that we now no longer need to perform the one hot encoding on our categorical features. The code chunk below shows this is highly advantageous from a memory-usage perspective when we have a bunch of categorical features." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:52:06.165449Z", "start_time": "2020-10-27T23:52:06.120179Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "OneHot Encoding\n", "number of columns: 108\n", "memory usage: 13362752\n", "\n", "Ordinal Encoding\n", "number of columns: 14\n", "memory usage: 2344448\n" ] } ], "source": [ "print('OneHot Encoding')\n", "print('number of columns: ', df_train_one_hot.shape[1])\n", "print('memory usage: ', df_train_one_hot.memory_usage(deep=True).sum())\n", "print()\n", "\n", "print('Ordinal Encoding')\n", "print('number of columns: ', df_train_ordinal.shape[1])\n", "print('memory usage: ', df_train_ordinal.memory_usage(deep=True).sum())" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2020-10-27T23:52:09.079253Z", "start_time": "2020-10-27T23:52:08.314572Z" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 501, "width": 706 }, "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# plotting the feature importance just out of curiosity\n", "\n", "# change default style figure and font size\n", "plt.rcParams['figure.figsize'] = 10, 8\n", "plt.rcParams['font.size'] = 12\n", "\n", "# like other tree-based models, it can also output the\n", "# feature importance plot\n", "plot_importance(lgb, importance_type='gain')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For tuning LightGBM's hyperparameter, the documentation page has some pretty good suggestions. [LightGBM Documentation: Parameters Tuning](http://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [LightGBM Documentation: Parameters Tuning](http://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html)\n", "- [Blog: xgboost’s New Fast Histogram (tree_method = hist)](https://medium.com/data-design/xgboosts-new-fast-histogram-tree-method-hist-a3c08f36234c)\n", "- [Blog: Which algorithm takes the crown: Light GBM vs XGBOOST?](https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/)\n", "- [Blog: Are categorical variables getting lost in your random forests?](http://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)\n", "- [Blog: Lessons Learned From Benchmarking Fast Machine Learning Algorithms](https://blogs.technet.microsoft.com/machinelearning/2017/07/25/lessons-learned-benchmarking-fast-machine-learning-algorithms/)\n", "- [Stackoverflow: Why tree-based model do not need one-hot encoding for nominal data?\n", "](https://stackoverflow.com/questions/45139834/why-tree-based-model-do-not-need-one-hot-encoding-for-nominal-data)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "toc": { "nav_menu": { "height": "12px", "width": "252px" }, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": "block", "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }