{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TODO\n", "\n", "- Update commentary to be more intensive\n", "- Finish hyperparameter tuning\n", "- Add note on overfitting with random forests & gradient boosted trees\n", "\n", "# Ensemble Methods\n", "\n", "One quote I heard in the beginning of my machine learning journey is \"if you don't have a favorite algorithm yet, pick the random forest\". I'm glad that I heard that early on, because it has proven itself multiple times along with the other algorithms in the ensemble family.\n", "\n", "There are a few good reasons why ensemble methods are my favorite family of models. Not only are they often extremely powerful in predictive performance (often topping the leaderboards on Kaggle competitions for structured data), but they still maintain some semblance of interpretability and can often be parallelized to utilize all cores of a CPU.\n", "\n", "I recently gave a [talk on ensemble methods](https://github.com/JeffMacaluso/Talks/blob/master/EnsembleMethods/EnsembleMethods.pptx). at a local meetup group, and this is a more flushed out version of the hands-on portion that includes a more difficult dataset and more complete hyperparameter tuning.\n", "\n", "## Overview\n", "\n", "In this post, we'll train a few ensemble models on an artificial dataset for binary classification. We'll use scikit-learn to compare a few different types of ensemble methods, and then use XGBoost and LightGBM for more specialized implementations of gradient boosting. Additionally, we'll go over hyperparameter tuning and discuss a few strategies for tuning ensemble models.\n", "\n", "## Setup\n", "\n", "The setup here is largely a series of import statements, creating an artificial classification dataset with [scikit-learn's make_classification function](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html), and then creating a function to train our models and gather various metrics." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2018-07-12T00:39:58.966450Z", "start_time": "2018-07-12T00:39:37.655238Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2018/07/11 19:39\n", "OS: win32\n", "Python: 3.5.5 | packaged by conda-forge | (default, Apr 6 2018, 16:03:44) [MSC v.1900 64 bit (AMD64)]\n", "NumPy: 1.12.1\n", "Pandas: 0.23.1\n" ] } ], "source": [ "import sys\n", "import time\n", "import scipy\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "from sklearn import datasets\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.model_selection import RandomizedSearchCV\n", "from sklearn import ensemble\n", "from sklearn import linear_model\n", "from sklearn import metrics\n", "\n", "print(time.strftime('%Y/%m/%d %H:%M'))\n", "print('OS:', sys.platform)\n", "print('Python:', sys.version)\n", "print('NumPy:', np.__version__)\n", "print('Pandas:', pd.__version__)\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating an artificial data set with [scikit-learn's make_classification function](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html).\n", "\n", "**TODO: Increase the size of the dataset and re-run**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2018-07-12T23:04:01.008869Z", "start_time": "2018-07-12T23:04:00.696389Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789...212223242526272829label
0-2.0990070.787135-0.686102-1.2288300.723556-0.311079-0.9528260.867668-1.6425710.624871...0.987041-0.3777890.242981-0.792679-1.715772-0.420792-1.7291521.308256-0.7005611
1-0.801402-3.8517691.5388050.5657830.5261720.7525650.014558-0.240075-1.4791381.819075...-0.153594-0.324068-2.0602201.5414811.297861-1.2289420.4946062.1679530.1784361
2-5.662407-5.8531611.625716-0.3905931.1992841.888906-1.019720-1.3926503.012919-1.139037...-0.335733-0.468439-1.9960232.419778-1.558457-0.539612-1.1595663.362889-0.8912730
31.5692565.984285-1.678201-0.6013250.4708340.688409-2.392620-0.946743-2.7136601.422514...-0.161714-1.189745-0.837363-0.825927-1.654660-0.3395401.2169200.1454221.4598361
42.8818645.945795-1.627379-1.361672-0.773137-0.071754-4.0750991.901061-4.2944250.795730...0.656566-0.235963-2.1461110.5940492.2904430.330266-0.019847-5.7707430.8155811
\n", "

5 rows × 31 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 \\\n", "0 -2.099007 0.787135 -0.686102 -1.228830 0.723556 -0.311079 -0.952826 \n", "1 -0.801402 -3.851769 1.538805 0.565783 0.526172 0.752565 0.014558 \n", "2 -5.662407 -5.853161 1.625716 -0.390593 1.199284 1.888906 -1.019720 \n", "3 1.569256 5.984285 -1.678201 -0.601325 0.470834 0.688409 -2.392620 \n", "4 2.881864 5.945795 -1.627379 -1.361672 -0.773137 -0.071754 -4.075099 \n", "\n", " 7 8 9 ... 21 22 23 \\\n", "0 0.867668 -1.642571 0.624871 ... 0.987041 -0.377789 0.242981 \n", "1 -0.240075 -1.479138 1.819075 ... -0.153594 -0.324068 -2.060220 \n", "2 -1.392650 3.012919 -1.139037 ... -0.335733 -0.468439 -1.996023 \n", "3 -0.946743 -2.713660 1.422514 ... -0.161714 -1.189745 -0.837363 \n", "4 1.901061 -4.294425 0.795730 ... 0.656566 -0.235963 -2.146111 \n", "\n", " 24 25 26 27 28 29 label \n", "0 -0.792679 -1.715772 -0.420792 -1.729152 1.308256 -0.700561 1 \n", "1 1.541481 1.297861 -1.228942 0.494606 2.167953 0.178436 1 \n", "2 2.419778 -1.558457 -0.539612 -1.159566 3.362889 -0.891273 0 \n", "3 -0.825927 -1.654660 -0.339540 1.216920 0.145422 1.459836 1 \n", "4 0.594049 2.290443 0.330266 -0.019847 -5.770743 0.815581 1 \n", "\n", "[5 rows x 31 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Creating an artificial dataset to test algorithms on\n", "data = datasets.make_classification(#n_samples=300000,\n", " n_samples=3000,\n", " n_classes=2,\n", " n_features=30,\n", " n_informative=10,\n", " n_redundant=5, # Superfluous features working as noise for the algorithms\n", " flip_y=0.5, # Introduces additional noise\n", " class_sep=0.7, \n", " n_clusters_per_class=10,\n", " random_state=46)\n", "\n", "# Assigning features/labels to variables for ease of use\n", "X = data[0] # Features\n", "y = data[1] # Label\n", "\n", "# Train/test split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=46)\n", "\n", "# Putting into a dataframe for viewing\n", "df = pd.DataFrame(X)\n", "df['label'] = y\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to adhere to [DRY typing](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself), we'll create a function to train our models and gather the accuracy, [AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve), [log loss](https://en.wikipedia.org/wiki/Cross_entropy), and model training time." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2018-07-12T23:04:07.289737Z", "start_time": "2018-07-12T23:04:07.242865Z" } }, "outputs": [], "source": [ "# Data frame for gathering results \n", "results = pd.DataFrame(columns=['Accuracy', 'LogLoss', 'AUC', 'TrainingTime'])\n", "tuned_results = pd.DataFrame(columns=['Accuracy', 'LogLoss', 'AUC', 'TrainingTime', 'NumIterations'])\n", "\n", "# Function for training a model and retrieving the results\n", "def train_model_get_results(model, model_name):\n", " '''\n", " Trains a model and appends the results to the results dataframe\n", " \n", " Input:\n", " - model: The model with specified hyperparameters to be trained\n", " - model_name: The name of the model to be used as the index\n", " - is_tuned: A binary flag for if hyperparameter tuning has been performed\n", " \n", " Output: The results dataframe with the model results added\n", " \n", " Note: Only works with scikit-learn models and frameworks that integrate \n", " with the scikit-learn API\n", " '''\n", " \n", " # Collecting training time for results\n", " start_time = time.time()\n", " \n", " print('Training the model')\n", " model.fit(X_train, y_train)\n", " \n", " end_time = time.time()\n", " total_training_time = end_time - start_time\n", " print('Completed')\n", " \n", " # Calculating the testing set accuracy with the score method\n", " accuracy = model.score(X_test, y_test)\n", " \n", " # Calcuating the AUC and log loss with predicted probabilities\n", " class_probabilities = model.predict_proba(X_test)\n", " log_loss = metrics.log_loss(y_test, class_probabilities)\n", " auc = metrics.roc_auc_score(y_test, class_probabilities[:, 1])\n", " \n", " # Adding the model results to the results dataframe\n", " model_results = [accuracy, log_loss, auc, total_training_time]\n", " results.loc[model_name] = model_results\n", " \n", " print('\\n', 'Non-tuned results:')\n", " return results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Baseline\n", "\n", "It's always useful to have a baseline to compare against and let us know generally how difficult a problem is going to be. I like to use linear or logistic regression due to each them being extremely fast to train." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2018-07-12T23:04:10.477054Z", "start_time": "2018-07-12T23:04:10.414547Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training the model\n", "Completed\n", "\n", " Non-tuned results:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyLogLossAUCTrainingTime
Logistic Regression0.5055560.6979550.4912370.015628
\n", "
" ], "text/plain": [ " Accuracy LogLoss AUC TrainingTime\n", "Logistic Regression 0.505556 0.697955 0.491237 0.015628" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Instantiating the model\n", "logistic_regression = linear_model.LogisticRegression()\n", "\n", "# Using our user defined function to train the model and return the results\n", "train_model_get_results(model=logistic_regression, model_name='Logistic Regression')" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2018-06-30T21:52:49.624668Z", "start_time": "2018-06-30T21:52:49.593513Z" } }, "source": [ "## Bagging\n", "\n", "Bagging (bootstrap aggregating) is the technique that aggregates models built with bootstrapping, or sampling with replacement, via a majority vote or by averaging the predictions. The trees are independent of each other and can be built in parallel. \n", "\n", "Bagging models tend to decrease variance.\n", "\n", "### Random Forest\n", "\n", "The most popular bagging algorithm is the **random forest**. This algorithm works by building a series of decision trees where each tree uses a random selection of variables, and then decision trees vote on the final answer. \n", "\n", "More specifically, for each tree:\n", "\n", "- Use a different training sample with replacement (bootstrapping) for the data\n", "- For each node, choose a number of random attributes and find the best split\n", "- Typically is not pruned in order to have a smaller bias\n", "\n", "Once these trees are grown, a majority vote among all of the trees will be used to make predictions.\n", "\n", "The main ideas here are that the randomness makes a set of diverse models that helps improve accuracy and using random subsets of features to consider at each split helps make it more efficient to train.\n", "\n", "\n", "\n", "**Advantages:**\n", "- Robustness against over-fitting\n", " - Since the model is created through dense randomness, the generalization is typically better, and you can usually increase the accuracy with the number of trees up until a saturation point\n", "- Able to parallelize training multiple trees at once and thus speed up training time" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2018-07-12T23:04:48.256008Z", "start_time": "2018-07-12T23:04:47.834157Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training the model\n", "Completed\n", "\n", " Non-tuned results:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyLogLossAUCTrainingTime
Logistic Regression0.5055560.6979550.4912370.015628
Random Forest0.5233330.8106270.5323430.171863
\n", "
" ], "text/plain": [ " Accuracy LogLoss AUC TrainingTime\n", "Logistic Regression 0.505556 0.697955 0.491237 0.015628\n", "Random Forest 0.523333 0.810627 0.532343 0.171863" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_forest = ensemble.RandomForestClassifier(n_jobs=-1) # n_jobs=-1 uses all available cores\n", "\n", "train_model_get_results(random_forest, model_name='Random Forest')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Boosting\n", "\n", "Boosting methods train a sequence of weak learners (a learner that is barely better than random chance) where each successive model focuses on the parts that the previous model got wrong. The trees have to be built in a sequence and generally cannot be built in parallel without clever tricks.\n", "\n", "Boosting models tend to decrease bias.\n", "\n", "### Gradient Boosting\n", "\n", "While there are a few different boosting algorithms, gradient boosting is arguably the most popular. It's main differentiation from the others is that it uses gradient descent to decide what to focus on in order to minimize loss for the new trees being built in the sequence. This typically gives it performance advantages over other boosting algorithms.\n", "\n", "\n", "\n", "*Source: [BigML](https://blog.bigml.com/2017/03/14/introduction-to-boosted-trees/)*\n", "\n", "**Advantages:**\n", "- Can often outperform random forests when properly tuned\n", "\n", "**Disadvantages:**\n", "- Typically overfits easier than bagging\n", "- Sensitive to noise & extreme values\n", "- Has to be built sequentially, so cannot parallelize without tricks" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2018-07-12T23:04:49.927780Z", "start_time": "2018-07-12T23:04:48.849720Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training the model\n", "Completed\n", "\n", " Non-tuned results:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyLogLossAUCTrainingTime
Logistic Regression0.5055560.6979550.4912370.015628
Random Forest0.5233330.8106270.5323430.171863
Gradient Boosted Trees0.5466670.6934030.5610601.031190
\n", "
" ], "text/plain": [ " Accuracy LogLoss AUC TrainingTime\n", "Logistic Regression 0.505556 0.697955 0.491237 0.015628\n", "Random Forest 0.523333 0.810627 0.532343 0.171863\n", "Gradient Boosted Trees 0.546667 0.693403 0.561060 1.031190" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gradient_boosting = ensemble.GradientBoostingClassifier()\n", "\n", "train_model_get_results(gradient_boosting, model_name='Gradient Boosted Trees')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Note on interpretability\n", "\n", "It's possible to obtain \"feature importance\" from both bagging and boosting methods. These are not as interpretable as coefficients from linear/logistic regressions, but can still give us an idea of what is happening. \n", "\n", "Note that the multicollinearity assumption applies here - these interpretations will be misleading if the features are heavily correlated with each other." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2018-07-12T23:04:50.084020Z", "start_time": "2018-07-12T23:04:49.675Z" } }, "outputs": [], "source": [ "def feature_importance(model):\n", " '''\n", " Plots the feature importance for an ensemble model from scikit-learn\n", " '''\n", " feature_importance = model.feature_importances_\n", " feature_importance = 100.0 * (feature_importance / feature_importance.max())\n", " sorted_idx = np.argsort(feature_importance)\n", " pos = np.arange(sorted_idx.shape[0]) + .5\n", " plt.figure(figsize=(15, 8))\n", " plt.subplot(1, 2, 2)\n", " plt.barh(pos, feature_importance[sorted_idx], align='center')\n", " plt.yticks(pos, sorted_idx)\n", " plt.xlabel('Relative Importance')\n", " plt.title('Variable Importance')\n", " plt.show()\n", " \n", "\n", "feature_importance(gradient_boosting)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stacking\n", "\n", "Stacking is the final ensemble technique where we combine several different models into a chain of sorts. It is structured similarly to a neural network where layers of models provide predictions that the next layer then uses as inputs. Ultimately, the meta-classifier creates a final prediction.\n", "\n", "\n", "*Source: [Anshul Joshi](https://www.quora.com/What-is-stacking-in-machine-learning)*\n", "\n", "This is a little more nuanced than blending models (averaging their predictions for a final prediction) as the meta-learner learns how useful each of the models are.\n", "\n", "**Advantages:**\n", "- Can be more performant when properly tuned\n", "\n", "**Disadvantages:**\n", "- Much more computationally costly\n", "- More difficult to tune\n", "- Complete loss of interpretability\n", "\n", "We'll need another function that is similar to our previous one for training the models and getting the results. In this case, we'll deal with one layer of classifiers and use a logistic regression for the meta-learner. We'll use five different algorithms for the first layer, but this function is designed to accept any number of models." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2018-07-12T23:04:52.990094Z", "start_time": "2018-07-12T23:04:51.005839Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training the model\n", "Completed\n", "\n", "Coefficients for models\n", "Model 1: -2.9603327700929984\n", "Model 2: 6.827364713943086\n", "Model 3: 6.967087517500586\n", "Model 4: 0.7215877036971426\n", "Model 5: 0.49496178638295857\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyLogLossAUCTrainingTime
Logistic Regression0.5055560.6979550.4912370.015628
Random Forest0.5233330.8106270.5323430.171863
Gradient Boosted Trees0.5466670.6934030.5610601.031190
Stacking0.5200000.9940010.5334641.593773
\n", "
" ], "text/plain": [ " Accuracy LogLoss AUC TrainingTime\n", "Logistic Regression 0.505556 0.697955 0.491237 0.015628\n", "Random Forest 0.523333 0.810627 0.532343 0.171863\n", "Gradient Boosted Trees 0.546667 0.693403 0.561060 1.031190\n", "Stacking 0.520000 0.994001 0.533464 1.593773" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def train_stacking_get_results(list_of_models):\n", " '''\n", " Trains a stacking classifier and appends the results to the rsults dataframe\n", " \n", " Input: list_of_models: a list of untrained scikit-learn models\n", " \n", " Output: The results dataframe with the model results added\n", " \n", " Note: Only works with scikit-learn models and frameworks that integrate \n", " with the scikit-learn API\n", " '''\n", " # The meta learner is the one that takes the outputs from\n", " # the other models as input before final classification\n", " meta_learner = linear_model.LogisticRegression()\n", "\n", " # Collecting training time for results\n", " start_time = time.time()\n", " print('Training the model')\n", "\n", " # Fitting the first layer models\n", " for model in list_of_models:\n", " model.fit(X_train, y_train)\n", "\n", " # Collecting the predictions from the models for training\n", " model_output = []\n", "\n", " for model in list_of_models:\n", " class_probabilities = model.predict_proba(X_train)[:, 1]\n", " model_output.append(class_probabilities)\n", "\n", " # Re-shaping before passing to the meta learner\n", " X_train_meta = np.array(model_output).transpose()\n", "\n", " # Fitting the meta learner\n", " meta_learner.fit(X_train_meta, y_train)\n", "\n", " end_time = time.time()\n", " total_time = end_time - start_time\n", " print('Completed')\n", "\n", " # Collecting the predictions from the models for testing\n", " model_output = []\n", "\n", " for model in list_of_models:\n", " class_probabilities = model.predict_proba(X_test)[:, 1]\n", " model_output.append(class_probabilities)\n", "\n", " # Re-shaping before passing to the meta learner\n", " X_test_meta = np.array(model_output).transpose()\n", "\n", " # Collecting the accuracy from the meta learner\n", " accuracy = meta_learner.score(X_test_meta, y_test)\n", "\n", " # Calcuating the log loss with predicted probabilities\n", " class_probabilities = meta_learner.predict_proba(X_test_meta)\n", " log_loss = metrics.log_loss(y_test, class_probabilities)\n", " auc = metrics.roc_auc_score(y_test, class_probabilities[:, 1])\n", "\n", " # Printing coefficients of models\n", " print()\n", " print('Coefficients for models')\n", " for i, coef in enumerate(meta_learner.coef_[0]):\n", " print('Model {0}: {1}'.format( i+1, coef))\n", " \n", " model_results = [accuracy, log_loss, auc, total_time]\n", " results.loc['Stacking'] = model_results\n", "\n", " return results\n", "\n", "\n", "# Adding extra imports for additional models\n", "from sklearn import neighbors\n", "\n", "# Defining the learners for the first layer\n", "model_1 = linear_model.LogisticRegression()\n", "model_2 = ensemble.RandomForestClassifier(n_jobs=-1)\n", "model_3 = ensemble.RandomForestClassifier(n_jobs=-1)\n", "model_4 = ensemble.GradientBoostingClassifier()\n", "model_5 = neighbors.KNeighborsClassifier(n_jobs=-1)\n", "\n", "# Putting the models in a list to iterate through in the function\n", "models = [model_1, model_2, model_3, model_4, model_5]\n", "\n", "# Running our function to build a stacking model\n", "train_stacking_get_results(models)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Hyperparameter Tuning\n", "\n", "There two main methodologies for hyperparameter tuning: \n", "1. Manually testing hypotheses on how changing certain hyperparameters will impact the performance of the model\n", "2. Automatically checking a bunch of different combinations of hyperparameters using either a grid search or a randomized search\n", "\n", "For this post, we will discuss a few strategies for the first option, and then go with the second option by using a randomized search. \n", "\n", "Between grid search and random search, grid search generally makes more intuitive sense. However, research from [James Bergstra and Yoshua Bengio](http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf) have shown that random search tends to converge to good hyperparameters faster than grid search. Here's a graphic from their paper that gives an intuitive example of how random search can potentially cover more ground when there are hyperparameters that aren't as important:\n", "\n", "\n", "\n", "*Source: [James Bergstra & Yoshua Bengio](http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf)*\n", "\n", "## Hyperparameters & Decision Tree Structure\n", "\n", "Because both random forests and gradient boosted trees use decision trees for their underlying structures, their hyperparameters are largely the same. Here's a recap of the decision tree structure and a quick summary of what each of the hyperparameters we'll be tuning are:\n", "\n", "\n", "\n", "*Source: [Murtuza Morbiwala](http://insightfromdata.blogspot.com/2012/06/decision-tree-unembellished.html)*\n", "\n", "\n", "### Hyperparameters\n", "\n", "This is list is not all-inclusive, but has most of the common hyperparameters:\n", "\n", "- **Number of Estimators:** The number of decision trees to be trained\n", " - A higher number typically means better predictions (at the cost of computational power) up until a saturation point where the model begins to overfit\n", "- **Max Depth:** How deep a tree can be\n", " - This should ideally be low for gradient boosting and large (or none) for random forests\n", "- **Minimum Samples per Split:** The minimum samples considered to split a node\n", " - A higher number typically results in better performance at the cost of computational efficiency\n", "- **Minimum Samples per Leaf:** The minimum number of samples required to be a leaf node\n", " - A lower number could potentially result in more noise being captured\n", "- **Max Features:** The number of features to consider when looking for the best split\n", " - A lower number typically reduces variance/increases bias and improves computational efficiency\n", "- **Max Leaf Nodes:** The maximum number of leaf nodes for the tree\n", " - A smaller number could help prevent overfitting\n", "- **Learning Rate (gradient boosting only):** The adjustment/step size for each iteration\n", " - A larger step size can help get better performance in fewer iterations, but will plateau at a lower performance\n", " - A smaller step size will require more iterations (number of estimators) but will ultimately achieve a better performance\n", "\n", "Here is an illustration on what a learning rate is and how too small or large of a learning rate can have adverse impacts:\n", "\n", "\n", "*Source: [Jeremy Jordan](https://www.jeremyjordan.me/nn-learning-rate/)*\n", "\n", "Here is a more visual version of these hyperparameters on a tree: \n", "\n", "\n", "*Source: [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/)*\n", "\n", "### General Strategies\n", "\n", "Most strategies are specific to either random forests or gradient boosting, but there are a few strategies that apply to both.\n", "\n", "- Increase the number of estimators until either just before overfitting begins to start occurring or there are severely diminishing returns in performance\n", " - Compare the performance against the default parameters to see if this helps and how much\n", "- Further adjust the model complexity (starting with tree depth)\n", " - Decrease the complexity of the trees if you suspect the model is suffering from high variance\n", " - Increase the complexity of the trees if you suspect the model is suffering from high bias\n", " \n", "Remember that hyperparameter tuning is all about controlling model complexity in order to achieve the optimal state in the bias-variance tradeoff:\n", "\n", "\n", "*Source: [Satya Mallick](https://www.learnopencv.com/bias-variance-tradeoff-in-machine-learning/)*\n", "\n", "### Setup\n", "\n", "In order to do the actual hyperparameter tuning we need to create our third and final function. This will take a model, a dictionary of parameters, perform a random search for the number of iterations, and then give us our results." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2018-07-12T23:04:57.458575Z", "start_time": "2018-07-12T23:04:57.427465Z" } }, "outputs": [], "source": [ "def hyperparameter_tune_get_results(model, parameters, model_name, num_rounds=30):\n", " '''\n", " Performs a random search to find optimal hyperparameters and append the results\n", " to the tuned_results dataframe\n", " \n", " Input: \n", " - model: A scikit-learn model\n", " - parameters: A dictionary of parameters for the model\n", " - model_name: A string of the model name for the tuned_results dataframe\n", " - num_rounds: The number of rounds to try different hyperparameters\n", " \n", " Output: The tuned_results dataframe with the results appended\n", " '''\n", " \n", " # Reporting the default parameters before tuning\n", " print('Default Parameters:', '\\n')\n", " print(model, '\\n')\n", " \n", " # Defining the random search cross validation\n", " random_search = RandomizedSearchCV(model,\n", " param_distributions=parameters,\n", " n_iter=num_rounds, n_jobs=-1, cv=3,\n", " return_train_score=True, random_state=46,\n", " verbose=10) # Set to 20 to print the status of each completed fit\n", " \n", " print('Beginning hyperparameter tuning')\n", " start_time = time.time()\n", " random_search.fit(X_train, y_train)\n", " end_time = time.time()\n", " total_training_time = end_time - start_time\n", " print('Completed')\n", " \n", " # Calculating the testing set accuracy on the best estimator with the score method\n", " accuracy = random_search.best_estimator_.score(X_test, y_test)\n", " \n", " # Calcuating the log loss with predicted probabilities\n", " class_probabilities = random_search.best_estimator_.predict_proba(X_test)\n", " log_loss = metrics.log_loss(y_test, class_probabilities)\n", " auc = metrics.roc_auc_score(y_test, class_probabilities[:, 1])\n", " \n", " # Adding the model results to the results dataframe\n", " model_results = [accuracy, log_loss, auc, total_training_time, num_rounds]\n", " tuned_results.loc[model_name] = model_results\n", " \n", " # Plotting the mean training accuracy from the different iterations\n", " sns.distplot(random_search.cv_results_['mean_test_score'])\n", " plt.title('Mean test score')\n", " \n", " print('Best estimator:', '\\n')\n", " print(random_search.best_estimator_)\n", " \n", " print()\n", " print('Accuracy before tuning:', results.loc[model_name]['Accuracy'])\n", " print('Accuracy after tuning:', tuned_results.loc[model_name]['Accuracy'])\n", " \n", " print('\\n', 'Tuned results:')\n", " return tuned_results" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2018-06-30T23:13:57.881984Z", "start_time": "2018-06-30T23:13:57.866361Z" } }, "source": [ "## Baseline\n", "\n", "For our logistic regression model, we're just going to tune the regularization parameter. One of the advantages of simpler models like this is that they are easier to tune because we don't have nearly as many hyperparameters to worry about.\n", "\n", "**Note: The number of rounds is being kept small in these examples to keep within time limits for the talk, but increase them in a real-world scenario for more effective hyperparameter tuning**" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2018-07-12T23:05:07.442344Z", "start_time": "2018-07-12T23:04:58.770997Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Default Parameters: \n", "\n", "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", " verbose=0, warm_start=False) \n", "\n", "Beginning hyperparameter tuning\n", "Fitting 3 folds for each of 10 candidates, totalling 30 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Done 5 tasks | elapsed: 7.0s\n", "[Parallel(n_jobs=-1)]: Done 10 tasks | elapsed: 7.1s\n", "[Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 7.3s\n", "[Parallel(n_jobs=-1)]: Done 27 out of 30 | elapsed: 7.8s remaining: 0.8s\n", "[Parallel(n_jobs=-1)]: Done 30 out of 30 | elapsed: 7.9s finished\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Completed\n", "Best estimator: \n", "\n", "LogisticRegression(C=6.8421729839625272, class_weight=None, dual=False,\n", " fit_intercept=True, intercept_scaling=1, max_iter=100,\n", " multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,\n", " solver='liblinear', tol=0.0001, verbose=0, warm_start=False)\n", "\n", "Accuracy before tuning: 0.505555555556\n", "Accuracy after tuning: 0.505555555556\n", "\n", " Tuned results:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyLogLossAUCTrainingTimeNumIterations
Logistic Regression0.5055560.6979670.4912228.29637110.0
\n", "
" ], "text/plain": [ " Accuracy LogLoss AUC TrainingTime NumIterations\n", "Logistic Regression 0.505556 0.697967 0.491222 8.296371 10.0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "parameters = {'C': scipy.stats.uniform(0, 10), # Uniform distribution between 0 and 10\n", " 'penalty': ['l1', 'l2']\n", " }\n", "\n", "hyperparameter_tune_get_results(model=logistic_regression, parameters=parameters,\n", " model_name='Logistic Regression', num_rounds=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Random Forests\n", "\n", "Because random forests are generally robust to overfitting and there aren't as many parameters to control as there are in gradient boosting, our hyperparameter tuning strategy doesn't have to be as nuanced.\n", "\n", "I've found that increasing the number of trees has the most direct impact on performance. Because the saturation point of overfitting by too many trees is relatively high for random forests, we can usually increase them until our models take too long to train or there isn't much of a performance gain from using more trees. Scikit-learn's random forest implementation only uses 10 by default, but R's [randomForest](https://cran.r-project.org/web/packages/randomForest/randomForest.pdf#page=17) package uses 500 by default.\n", "\n", "That's the first level of complexity to control, so after that it's looking into controlling the max depth for overall model complexity. How this is adjusted depends on if we need to reduce bias or variance.\n", "\n", "We can also control a few other components like the number of features considered for each split or the minimum samples required for each split/leaf, but these may not have as large of an impact as the number of estimators or max depth." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2018-07-12T23:10:23.063855Z", "start_time": "2018-07-12T23:05:12.582658Z" }, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Default Parameters: \n", "\n", "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,\n", " oob_score=False, random_state=None, verbose=0,\n", " warm_start=False) \n", "\n", "Beginning hyperparameter tuning\n", "Fitting 3 folds for each of 30 candidates, totalling 90 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Done 5 tasks | elapsed: 16.1s\n", "[Parallel(n_jobs=-1)]: Done 10 tasks | elapsed: 56.8s\n", "[Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 1.2min\n", "[Parallel(n_jobs=-1)]: Done 24 tasks | elapsed: 1.5min\n", "[Parallel(n_jobs=-1)]: Done 33 tasks | elapsed: 1.9min\n", "[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 2.7min\n", "[Parallel(n_jobs=-1)]: Done 53 tasks | elapsed: 3.2min\n", "[Parallel(n_jobs=-1)]: Done 64 tasks | elapsed: 3.6min\n", "[Parallel(n_jobs=-1)]: Done 77 tasks | elapsed: 4.6min\n", "[Parallel(n_jobs=-1)]: Done 90 out of 90 | elapsed: 5.0min finished\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Completed\n", "Best estimator: \n", "\n", "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", " max_depth=30, max_features='log2', max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=2, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=831, n_jobs=-1,\n", " oob_score=False, random_state=None, verbose=0,\n", " warm_start=False)\n", "\n", "Accuracy before tuning: 0.523333333333\n", "Accuracy after tuning: 0.554444444444\n", "\n", " Tuned results:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyLogLossAUCTrainingTimeNumIterations
Logistic Regression0.5055560.6979670.4912228.29637110.0
Random Forest0.5544440.6834710.574590307.98135030.0
\n", "
" ], "text/plain": [ " Accuracy LogLoss AUC TrainingTime NumIterations\n", "Logistic Regression 0.505556 0.697967 0.491222 8.296371 10.0\n", "Random Forest 0.554444 0.683471 0.574590 307.981350 30.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Creating the dictionary of parameters to use in the search\n", "parameters = {'n_estimators': scipy.stats.randint(low=10, high=1000), # Uniform distribution\n", " 'max_depth': [None, 10, 30], # Maximum number of levels in a tree\n", " 'max_features': ['auto', 'log2', None], # Number of features to consider at each split\n", " 'min_samples_split': [2, 5, 10], # Minimum number of samples required to split a node\n", " 'min_samples_leaf': [1, 2, 4] # Minimum number of samples required at each leaf node\n", " }\n", "\n", "hyperparameter_tune_get_results(random_forest, parameters, 'Random Forest', num_rounds=30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As I mentioned earlier, random forests are relatively robust to overfitting. We can usually find good results by increasing the number of trees until we hit a point with diminishing returns on performance.\n", "\n", "However, let's test overfitting. I'll plot a [validation curve](https://chrisalbon.com/machine_learning/model_evaluation/plot_the_validation_curve/) (using the code from that post) to see how increasing the number of trees affects the cross validation accuracy." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "start_time": "2018-07-12T23:23:01.269Z" } }, "outputs": [], "source": [ "from sklearn import model_selection\n", "\n", "rf = ensemble.RandomForestClassifier(n_jobs=-1)\n", "\n", "trees_to_try = [10, 30, 60, 100, 300, 600, 1000, 3000, 6000, 10000, 30000]\n", "\n", "validation_curve_values = model_selection.validation_curve(\n", " estimator=rf,\n", " X=X,\n", " y=y,\n", " cv=3,\n", " param_name='n_estimators',\n", " param_range=trees_to_try,\n", " n_jobs=-1)\n", "\n", "validation_curve_values\n", "\n", "# TODO: combine these two cells" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2018-07-12T23:21:23.817433Z", "start_time": "2018-07-12T23:21:23.473703Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD8CAYAAACMwORRAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xt0VNX99/H3lySSiEAQ+IkSblqlxdyMMZYHWkUUsUvFiihY611EUVurVKjWC65aXP35qAhiqQr2pwTjBQqtmOcBrY83rhVBQCRKlABKCAQshpLLfv44Q5wkEzKTTC5z8nmtlZU5++w5s/ec5DNn9jmzx5xziIiIv3Ro7QaIiEj0KdxFRHxI4S4i4kMKdxERH1K4i4j4kMJdRMSHFO4iIj6kcBcR8SGFu4iID8W31gP36NHD9e/fv7UeXkQkJq1Zs2a3c65nQ/VaLdz79+/P6tWrW+vhRURikpl9GU49DcuIiPiQwl1ExIcU7iIiPqRwFxHxIYW7iIgPNXi1jJk9D1wI7HLOpYZYb8CTwM+A74BrnXP/inZDj2ThR9v5U/5mdpSWcUJyEpPOH8glp/Vus9tty9pjn0VaSkv+f4VzKeRcYAbw13rWXwCcHPg5E5gV+N0iFn60nSmvr6esvBKA7aVlTHl9PUCTnrTm2m5b1h77LNJSWvr/q8Fwd879PzPrf4Qqo4C/Ou/7+pabWbKZHe+c2xmlNh7Rn/I3Vz9Zh5WVV/L7hZ/wRfG/G73dOe8XNst227L22GeRllLf/9ef8je3TriHoTewLWi5KFBWJ9zNbDwwHqBv375ReGjYUVoWsvzb/1Tw1NsFjd5ufV8t29TttmXtsc8iLaW+/6/6MqypohHuFqIsZDecc7OB2QDZ2dlR+Wbunp07suvb/9Qp752cxPuTz2n0dodMe4vtIZ70pm63LWuPfRZpKfX9f52QnNQsjxeNq2WKgD5ByynAjihst0HLvyhhX9mhOuVJCXFMOn9gk7Y96fyBJCXERX27bVl77LNIS2np/69oHLkvAm4zs/l4J1L3Ndd4e/CZ5m6dEtj3XTkDeh7DL87sy7Pvbo3qGejD929PV460xz6LtJSW/v8yV99A0OEKZrnA2UAP4BvgASABwDn3TOBSyBnASLxLIa9zzjU4I1h2draLZOKw2meavbbBHy5J5coz+4W9HRGRWGZma5xz2Q3VC+dqmXENrHfAxAja1iihropxDma+/bnCXUSklpj5hGp9Z5Sb60yziEgsi5lwr++McnOdaRYRiWUxE+66kkNEJHyt9k1MkdKVHCIi4YuZcAcv4BXmIiINi5lhGRERCZ/CXUTEhxTuIiI+pHAXEfEhhbuIiA8p3EVEfEjhLiLiQwp3EREfUriLiPiQwl1ExIcU7iIiPqRwFxHxIYW7iIgPKdxFRHxI4S4i4kMKdxERH1K4i4j4kMJdRMSHFO4iIj6kcBcR8SGFu4iIDyncRUR8SOEuIuJDCncRER9SuIuI+JDCXUTEhxTuIiI+FFa4m9lIM9tsZgVmNjnE+n5mtszM1pnZP80sJfpNFRGRcDUY7mYWB8wELgAGAePMbFCtav8N/NU5lw5MBf4Y7YaKiEj4wjlyzwEKnHNfOOcOAfOBUbXqDAKWBW6/HWK9iIi0oHDCvTewLWi5KFAW7GNgdOD2z4HOZta99obMbLyZrTaz1cXFxY1pr4iIhCGccLcQZa7W8t3AWWb2EXAWsB2oqHMn52Y757Kdc9k9e/aMuLEiIhKe+DDqFAF9gpZTgB3BFZxzO4BLAczsGGC0c25ftBopIiKRCefIfRVwspkNMLOjgLHAouAKZtbDzA5vawrwfHSbKSIikWgw3J1zFcBtQD6wCchzzm0ws6lmdnGg2tnAZjP7DDgO+EMztVdERMJgztUePm8Z2dnZbvXq1a3y2CIiscrM1jjnshuqp0+oioj4kMJdRMSHFO4iIj6kcBcR8SGFu4iIDyncRUR8SOEuIuJDCncRER9SuIuI+JDCXUTEhxTuIiI+pHAXEfEhhbuIiA8p3EVEfEjhLiLiQwp3EREfUriLiPiQwl1ExIcU7iIiPqRwFxHxIYW7iIgPKdxFRHxI4S4i4kMKdxERH1K4i4j4kMJdRMSHFO4iIj6kcBcR8SGFu4iIDyncRUR8SOEuIuJDCncRER8KK9zNbKSZbTazAjObHGJ9XzN728w+MrN1Zvaz6DdVRETC1WC4m1kcMBO4ABgEjDOzQbWq3QfkOedOA8YCT0e7oSIiEr5wjtxzgALn3BfOuUPAfGBUrToO6BK43RXYEb0miohIpMIJ997AtqDlokBZsAeBq8ysCHgDuD3UhsxsvJmtNrPVxcXFjWiuiIiEIz6MOhaizNVaHgfMdc49ZmaDgf8xs1TnXFWNOzk3G5gNkJ2dXXsbIlKP8vJyioqKOHjwYGs3RVpIYmIiKSkpJCQkNOr+4YR7EdAnaDmFusMuNwAjAZxzH5pZItAD2NWoVolIDUVFRXTu3Jn+/ftjFup4S/zEOUdJSQlFRUUMGDCgUdsIZ1hmFXCymQ0ws6PwTpguqlXnK2A4gJn9CEgENO4iEiUHDx6ke/fuCvZ2wszo3r17k96pNRjuzrkK4DYgH9iEd1XMBjObamYXB6rdBdxkZh8DucC1zjkNu4hEkYK9fWnq/g7rOnfn3BvOuVOccyc55/4QKLvfObcocHujc26Icy7DOZfpnPs/TWqViLQpJSUlZGZmkpmZSa9evejdu3f18qFDh8LaxnXXXcfmzZuPWGfmzJm89NJL0WgyAN988w3x8fE899xzUdtmrLDWOsDOzs52q1evbpXHFok1mzZt4kc/+lHY9Rd+tJ0/5W9mR2kZJyQnMen8gVxyWu2L3BrnwQcf5JhjjuHuu++uUe6cwzlHhw5t54Pv06dP55VXXqFjx44sXbq02R6noqKC+PhwTmFGJtR+N7M1zrnshu7bdvaCiETFwo+2M+X19WwvLcMB20vLmPL6ehZ+tD3qj1VQUEBqaioTJkwgKyuLnTt3Mn78eLKzszn11FOZOnVqdd2hQ4eydu1aKioqSE5OZvLkyWRkZDB48GB27fKuvbjvvvt44oknqutPnjyZnJwcBg4cyAcffADAgQMHGD16NBkZGYwbN47s7GzWrl0bsn25ubk88cQTfPHFF3z99dfV5f/4xz/IysoiIyODESNGAPDtt99yzTXXkJaWRnp6OgsXLqxu62Hz58/nxhtvBOCqq67irrvuYtiwYfzud79j+fLlDB48mNNOO40hQ4awZcsWwAv+O++8k9TUVNLT03n66afJz89nzJgx1dtdsmQJl19+eZP3R7Dov9SISLN6aPEGNu7YX+/6j74q5VBljauQKSuv5LevriN35Vch7zPohC48cNGpjWrPxo0bmTNnDs888wwA06ZN49hjj6WiooJhw4Zx2WWXMWhQzQ+179u3j7POOotp06bxm9/8hueff57Jk+vMbIJzjpUrV7Jo0SKmTp3Km2++yVNPPUWvXr147bXX+Pjjj8nKygrZrsLCQvbu3cvpp5/OZZddRl5eHnfccQdff/01t9xyC++++y79+vVjz549gPeOpGfPnqxfvx7nHKWlpQ32/fPPP2fZsmV06NCBffv28d577xEXF8ebb77Jfffdx8svv8ysWbPYsWMHH3/8MXFxcezZs4fk5GTuuOMOSkpK6N69O3PmzOG6666L9Kk/Ih25i/hM7WBvqLypTjrpJM4444zq5dzcXLKyssjKymLTpk1s3Lixzn2SkpK44IILADj99NMpLCwMue1LL720Tp333nuPsWPHApCRkcGpp4Z+UcrNzeWKK64AYOzYseTm5gLw4YcfMmzYMPr16wfAscceC8DSpUuZOHEi4J3M7NatW4N9HzNmTPUwVGlpKZdeeimpqancfffdbNiwoXq7EyZMIC4urvrxOnTowJVXXsm8efPYs2cPa9asqX4HES06cheJMQ0dYQ+Z9hbbS8vqlPdOTuLlmwdHvT2dOnWqvr1lyxaefPJJVq5cSXJyMldddVXIy/mOOuqo6ttxcXFUVFSE3HbHjh3r1An3PGFubi4lJSW88MILAOzYsYOtW7finAt5JUqo8g4dOtR4vNp9Ce77vffey/nnn8+tt95KQUEBI0eOrHe7ANdffz2jR48G4IorrqgO/2jRkbuIz0w6fyBJCTWDIikhjknnD2z2x96/fz+dO3emS5cu7Ny5k/z8/Kg/xtChQ8nLywNg/fr1Id8ZbNy4kcrKSrZv305hYSGFhYVMmjSJ+fPnM2TIEN566y2+/PJLgOphmREjRjBjxgzAC+S9e/fSoUMHunXrxpYtW6iqqmLBggX1tmvfvn307u2dtJ47d251+YgRI5g1axaVlZU1Hq9Pnz706NGDadOmce211zbtSQlB4S7iM5ec1ps/XppG7+QkDO+I/Y+XpkXtapkjycrKYtCgQaSmpnLTTTcxZMiQqD/G7bffzvbt20lPT+exxx4jNTWVrl271qgzb948fv7zn9coGz16NPPmzeO4445j1qxZjBo1ioyMDH7xi18A8MADD/DNN9+QmppKZmYm7777LgCPPvooI0eOZPjw4aSkpNTbrnvuuYdJkybV6fPNN99Mr169SE9PJyMjo/qFCeDKK69kwIABnHLKKU16TkLRpZAiMSDSSyH9rKKigoqKChITE9myZQsjRoxgy5YtzXIpYnObMGECgwcP5pprrgm5vimXQsbesyEi7dq///1vhg8fTkVFBc45/vznP8dksGdmZtKtWzemT5/eLNuPvWdERNq15ORk1qxZ09rNaLL6rs2PFo25i4j4kMJdRMSHFO4iIj6kcBcR8SGFu4g0KBpT/gI8//zzNSbwCmca4Ei88sormBkFBQVR22asUriL+NG6PHg8FR5M9n6vy2v4PkfQvXt31q5dy9q1a5kwYQJ33nln9XLwVAINqR3uc+bMYeDA6H1yNjc3l6FDhzJ//vyobTOU+qZLaEsU7iJ+sy4PFt8B+7YBzvu9+I4mB3x9XnjhBXJycsjMzOTWW2+lqqqKiooKfvnLX5KWlkZqairTp0/n5ZdfZu3atVxxxRXVR/zhTAO8ZcsWzjzzTHJycvj9739fYwreYPv372fFihX85S9/qZ4k7LBHHnmEtLQ0MjIyuPfeewH47LPPOOecc8jIyCArK4vCwkKWLl3KJZdcUn2/CRMm8OKLLwKQkpLCww8/zJAhQ1iwYAHPPPMMZ5xxBhkZGYwZM4ayMm8+n6+//ppRo0ZVfyJ1xYoVTJkyhZkzZ1Zv95577uHpp5+O3k4IQde5i8SaJZPh6/X1ry9aBZX/qVlWXgZ/uw3WvBD6Pr3S4IJpETflk08+YcGCBXzwwQfEx8czfvx45s+fz0knncTu3btZv95rZ2lpKcnJyTz11FPMmDGDzMzMOtuqbxrg22+/nbvvvpsxY8ZUz/0Syuuvv86FF17ID3/4Qzp16sS6detIT09n8eLFLFmyhJUrV5KUlFQ9t8u4ceN48MEHueiiizh48CBVVVUNDud06tSJ999/H/CGqiZMmADA5MmTmTt3LrfccgsTJ07kvPPO47bbbqOiooLvvvuOHj16MHbsWCZOnEhlZSWvvPJKs1+rryN3Eb+pHewNlTfB0qVLWbVqFdnZ2WRmZvLOO+/w+eef84Mf/IDNmzfzq1/9ivz8/Dpzv4RS3zTAK1asqJ498corr6z3/rm5udVTAQdP8bt06VKuv/56kpKSAG/K3b1797J7924uuugiABITEzn66KMbbOPhKYQB1q1bx09+8hPS0tKYP39+9RS///znP7n55psBiI+Pp0uXLpx00kl07tyZ9evXs2TJEnJycsKaUrgpdOQuEmsaOsJ+PDUwJFNL1z5w3T+i2hTnHNdffz0PP/xwnXXr1q1jyZIlTJ8+nddee43Zs2cfcVvhTgMcSnFxMe+88w6ffvopZkZFRQUJCQk88sgj9U65G6osPj6eqqrv570/0hS/V199NUuWLCE1NZVnn32W5cuXH3HbN9xwA3PnzqWwsLA6/JuTjtxF/Gb4/ZCQVLMsIckrj7Jzzz2XvLw8du/eDXhDFV999RXFxcU45xgzZgwPPfQQ//rXvwDo3Lkz3377bUSPkZOTUz3Vbn0nSvPy8rjhhhv48ssvKSwspKioiBNOOIHly5czYsQInnvuueox8T179tCtWzd69OjB4sWLAS/Ev/vuO/r168eGDRs4dOgQe/fu5a233qq3XQcOHKBXr16Ul5czb9686vJhw4ZVfytVZWUl+/d735o1evRoFi9ezNq1azn33HMjeg4aQ+Eu4jfpl8NF070jdcz7fdF0rzzK0tLSeOCBBzj33HNJT09nxIgRfPPNN2zbto2f/vSnZGZmctNNN/HII48A3qWPN954Y0SXUE6fPp1HH32UnJwcdu3aFXKIJzc3t94pfi+88EJGjhxZPXT0+OOPA/DSSy/x2GOPkZ6eztChQykuLmbAgAFccsklpKWlcfXVV9f7FX4AU6dOJScnh/POO6/G1wjOmDGD/Px80tLSyM7O5tNPPwW8oZ+f/vSnjBs3rkW+RFxT/orEgPY85e+BAwc4+uijMTNefPFFFixYwGuvvdbazYpYVVUVmZmZLFy4kBNPPDGs+2jKXxHxrVWrVvHrX/+aqqoqunXrxpw5c1q7SRFbv349F198MWPGjAk72JtK4S4ibdrZZ5/d7NPjNre0tDS2bt3aoo+pMXcRER9SuIvEiNY6Pyato6n7W+EuEgMSExMpKSlRwLcTzjlKSkpITExs9DY05i4SA1JSUigqKqK4uLi1myItJDExkZSUlEbfX+EuEgMSEhIYMGBAazdDYoiGZUREfEjhLiLiQ2GFu5mNNLPNZlZgZpNDrH/czNYGfj4zs9LoN1VERMLV4Ji7mcUBM4HzgCJglZktcs5tPFzHOXdnUP3bgdOaoa0iIhKmcI7cc4AC59wXzrlDwHxg1BHqjwNyj7BeRESaWTjh3hsInhy6KFBWh5n1AwYA9c+TKSIizS6ccK876zzU90mKscCrzrnKkBsyG29mq81sta7XFRFpPuGEexHQJ2g5BdhRT92xHGFIxjk32zmX7ZzL7tmzZ/itFBGRiIQT7quAk81sgJkdhRfgi2pXMrOBQDfgw+g2UUREItVguDvnKoDbgHxgE5DnnNtgZlPN7OKgquOA+U6TX4iItLqwph9wzr0BvFGr7P5ayw9Gr1kiItIU+oSqiIgPKdxFRHxI4S4i4kMKdxERH1K4i4j4kMJdRMSHFO4iIj6kcBcR8SGFu4iIDyncRUR8SOEuIuJDCncRER9SuIuI+JDCXUTEhxTuIiI+pHAXEfEhhbuIiA8p3EVEfEjhLiLiQwp3EREfUriLiPiQwl1ExIcU7iIiPqRwFxHxIYW7iIgPKdxFRHxI4S4i4kMKdxERH1K4i4j4kMJdRMSHFO4iIj6kcBcR8aGwwt3MRprZZjMrMLPJ9dS53Mw2mtkGM5sX3WaKiEgk4huqYGZxwEzgPKAIWGVmi5xzG4PqnAxMAYY45/aa2X81V4NFRKRh4Ry55wAFzrkvnHOHgPnAqFp1bgJmOuf2AjjndkW3mSIiEolwwr03sC1ouShQFuwU4BQze9/MlpvZyGg1UEREItfgsAxgIcpciO2cDJwNpADvmlmqc660xobMxgPjAfr27RtxY0VEJDzhHLkXAX2CllOAHSHq/M05V+6c2wpsxgv7Gpxzs51z2c657J49eza2zSIi0oBwwn0VcLKZDTCzo4CxwKJadRYCwwDMrAfeMM0X0WyoiIiEr8Fwd85VALcB+cAmIM85t8HMpprZxYFq+UCJmW0E3gYmOedKmqvRIiJyZOZc7eHzlpGdne1Wr17dKo8tIhKrzGyNcy67oXr6hKqIiA8p3EVEfEjhLiLiQwp3EREfUrhLTevy4PFUeDDZ+70ur7VbJCKNEM4nVKW9WJcHi++A8jJved82bxkg/fLWa5eIREzhLt9b+tD3wX5YeRm8ORniE6FDfOAnzvsdl1BzuUNCPesPlwUv602jSHNSuLcnB/d7R+Ol2wK/vwy6/RUcKA59v+9KIO+XUW6MNRD+cSFePA6/gAQtxyW08PpGbsM6gIWapkmkeSjc/cI5KNvrhfThAK++HQjxg6U17xPXEbqmQHJfGHgBbPgb/Gdf3W0f0wuuehWqKqCq0vtdWV5zuar2chh1Il4f9FNxMFCnou664J8a68tbZl/Up7EvDh3iIS6+1gtMfDOvP9KLbTjr4/ViFsq6PFg2FfYVef97w+9vtiFPhXuscM47si796vuf2iF+6N8175PQyQvu5D6QkvP97a59vdudetYcHun/k5pj7gAJSTDiYeiV1jL9bG51Xigqg15EwnmBCGd9ZdALVajHaMT68rIwXwyD+1cOrqr1nms70gtUiBeHiIb6orW+se/mQj1GA0ONLXxOyx/h3oKvhs2mqhK+3Vn/kMm+Iu9oNVhishfWx54IJ54dCO4+gRDvC0ndIjt6OvycxfpzeSQd4rwfOrZ2S1pGVRW4I70TqvWCEOoFIpL1jXoxDLG+4hBUfde4F8NWY0cO/2+3e20NVl7m/b8p3EOIlSs8Ksu9wKwzZBI4Ct+/ve4fZqeeXlgfd6o3bHL4iPtwiCd2iX470y9vW8+bNE2HDkAHL3ASklq7Nc3POe/dSkPvhI44LBjmu6lIhxY/ruerpfcVNctTEfvhvmxq6Cs8lvwWOnaBTj3g6O5eUB7VKbIj2UjeEZSXefXqGzL5dmett8gGnY/3grpPTuCIO3DU3bWv93hHHR3x0yHSrpkFhoPiWrsldRW+62VBbV1TmuXhYj/c63vVK9sLuVfULItPhKN7QKfugd896lnuAVvfhfzJNd8RLLoddq6Dbv1qHXlvgwO1vjbW4qBrby+oB5wVFNyBEO+SAvFHRf/5EJG2afj9oc9pDb+/WR4u9sO9a0roV8POx8MVL8F3u+HAbu9k5He74UDJ92UlW7zl8gPhPVbFQfjwKe92jStNRtYdMul8vHcCSUQEWvycVuynz/D74W8TofLQ92UJSXDeVEg5PbxtlJd5YR8c/gturqeywV2b615pIiLSkBY8pxX76ZR+OXTp7Z2Nxryj5oumR/YEJiR5R9wnnAYnnwsZY73thNI1BTofp2AXkTYt9o/cd6yFvVvh/D/C4Fujt90WHh8TEYmm2D/8XPUsxCdB5pXR3W765d47gK59aPQ7AhGRVhLbR+5le2H9q5A+BpKSo799XfMtIjEqto/c1+ZCRRmccVNrt0REpE2J3XCvqvKGZFJy4Pj01m6NiEibErvhvvWfsOdzOOPG1m6JiEibE7vhvvJZb1qBUy9p7ZaIiLQ5sRnupdvgsyWQdTXEt5PZ/UREIhBbV8tUT+QVmG7gmONatz0iIm1U7IR77al9AZY95A3N6HJFEZEaYmdYpr6pfZdNbZ32iIi0YbET7vVN7dtME92LiMSy2An3+ia0b6aJ7kVEYlnshPvw++t+TZgm8hIRCSl2wl0TeYmIhC2sq2XMbCTwJBAHPOucm1Zr/bXAn4DtgaIZzrlno9hOjybyEhEJS4PhbmZxwEzgPKAIWGVmi5xzG2tVfdk5d1sztFFERCIUzrBMDlDgnPvCOXcImA+Mat5miYhIU4QT7r2B4G+gLgqU1TbazNaZ2atmVs931ImISEsIJ9wtRJmrtbwY6O+cSweWAi+E3JDZeDNbbWari4uLI2upiIiELZxwLwKCj8RTgB3BFZxzJc65/wQW/wKcHmpDzrnZzrls51x2z549G9NeEREJQzhXy6wCTjazAXhXw4wFanxhqZkd75zbGVi8GNjU0EbXrFmz28y+jKCtPYDdEdT3i/bY7/bYZ2if/W6PfYam9btfOJUaDHfnXIWZ3Qbk410K+bxzboOZTQVWO+cWAXeY2cVABbAHuDaM7UZ06G5mq51z2ZHcxw/aY7/bY5+hffa7PfYZWqbfYV3n7px7A3ijVtn9QbenAFOi2zQREWms2PmEqoiIhC2Wwn12azeglbTHfrfHPkP77Hd77DO0QL/NudpXNYqISKyLpSN3EREJU0yEu5mNNLPNZlZgZpNbuz3RYmZ9zOxtM9tkZhvM7FeB8mPN7P+a2ZbA726BcjOz6YHnYZ2ZZbVuDxrPzOLM7CMz+3tgeYCZrQj0+WUzOypQ3jGwXBBY3781290UZpYc+AT3p4F9Ptjv+9rM7gz8bX9iZrlmlujHfW1mz5vZLjP7JKgs4n1rZtcE6m8xs2ua0qY2H+5BE5ddAAwCxpnZoNZtVdRUAHc5534E/BiYGOjbZGCZc+5kYFlgGbzn4OTAz3hgVss3OWp+Rc3PQzwKPB7o817ghkD5DcBe59wPgMcD9WLVk8CbzrkfAhl4/fftvjaz3sAdQLZzLhXvUuqx+HNfzwVG1iqLaN+a2bHAA8CZeHN6PXD4BaFRnHNt+gcYDOQHLU8BprR2u5qpr3/Dm31zM3B8oOx4YHPg9p+BcUH1q+vF0g/ep5yXAecAf8eb4mI3EF97n+N9vmJw4HZ8oJ61dh8a0ecuwNbabffzvub7eamODey7vwPn+3VfA/2BTxq7b4FxwJ+DymvUi/SnzR+5E/7EZTEt8Bb0NGAFcJwLfOI38Pu/AtX88lw8AfwWqAosdwdKnXMVgeXgflX3ObB+X6B+rDkRKAbmBIajnjWzTvh4XzvntgP/DXwF7MTbd2vw/74+LNJ9G9V9HgvhHs7EZTHNzI4BXgN+7Zzbf6SqIcpi6rkwswuBXc65NcHFIaq6MNbFknggC5jlnDsNOMD3b9NDifl+B4YURgEDgBOATnhDErX5bV83pL5+RrX/sRDuDU5cFsvMLAEv2F9yzr0eKP7GzI4PrD8e2BUo98NzMQS42MwK8b4b4By8I/lkMzv8iengflX3ObC+K94UF7GmCChyzq0ILL+KF/Z+3tfnAludc8XOuXLgdeB/4f99fVik+zaq+zwWwr164rLAWfWxwKJWblNUmJkBzwGbnHP/O2jVIuDwmfJr8MbiD5dfHTjb/mNgn/t+wraY4Jyb4pxLcc71x9uXbznnfgG8DVwWqFa7z4efi8sC9WPuaM459zWwzczQmsVYAAAA6UlEQVQGBoqGAxvx8b7GG475sZkdHfhbP9xnX+/rIJHu23xghJl1C7zrGREoa5zWPgkR5omKnwGfAZ8D97Z2e6LYr6F4b7vWAWsDPz/DG2dcBmwJ/D42UN/wrhz6HFiPdxVCq/ejCf0/G/h74PaJwEqgAHgF6BgoTwwsFwTWn9ja7W5CfzOB1YH9vRDo5vd9DTwEfAp8AvwP0NGP+xrIxTuvUI53BH5DY/YtcH2g/wXAdU1pkz6hKiLiQ7EwLCMiIhFSuIuI+JDCXUTEhxTuIiI+pHAXEfEhhbuIiA8p3EVEfEjhLiLiQ/8fSBwIvy7nVDoAAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Calculate mean and standard deviation for training set scores\n", "train_mean = np.mean(train_scores, axis=1)\n", "train_std = np.std(train_scores, axis=1)\n", "\n", "# Calculate mean and standard deviation for test set scores\n", "test_mean = np.mean(test_scores, axis=1)\n", "test_std = np.std(test_scores, axis=1)\n", "\n", "# TODO: Make this plot bigger, add y log scale, add title, despine if possible, add standard deviation?\n", "plt.plot(trees_to_try, train_mean, marker='o', label='Training Accuracy')\n", "plt.plot(trees_to_try, test_mean, marker='o', label='Testing Accuracy')\n", "plt.legend()\n", "# ax.set_yscale('log')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gradient Boosted Trees\n", "\n", "Because gradient boosting is used for kaggle-style competitions more commonly than random forests, there are quite a few more established strategies out there. These models can often be more difficult to tune than random forests, but it is a little more nuanced than simply cranking up the number of trees and crossing your fingers.\n", "\n", "One crucial hyperparameter that is introduced to gradient boosting is the learning rate. As previously mentioned, this tells us how drastic the adjustments are on our new trees being built. One peculiarity is that learning rates suffer pretty heavily from the [Goldilocks principle](https://en.wikipedia.org/wiki/Goldilocks_principle) - it has to be just right to have the optimal performance. It also highly depends on the number of trees we're training. Here is a chart that shows the relationship between the number of trees and the learning rate:\n", "\n", "\n", "*Source: [Synced](https://medium.com/syncedreview/tree-boosting-with-xgboost-why-does-xgboost-win-every-machine-learning-competition-ca8034c0b283)*\n", "\n", "Generally speaking, if we have a low number of trees and a high learning rate, we will get to a good performance faster but we will have a lower top-end performance. Conversely, we can get a better performance with a low learning rate and a lot of trees, but it will take much longer to get there.\n", "\n", "Most of the other hyperparameters are either similar to or are the same as those in random forests. However, we'll want to use different value ranges for them because the trees between the two algorithms are inherently different. Random forests use larger, relatively unconstrained trees, but boosting methods use weak learners. These week learners are by definition much less complex, so they are smaller, simpler trees.\n", "\n", "There are a variety of tuning guides (several are listed [here](https://machinelearningmastery.com/configure-gradient-boosting-algorithm/)), but my favorite is this guide from Zhonghua Zhang, the former \\#1 Kaggler in the world:\n", "\n", "\n", "*Source: [Zhonghua Zhang](https://www.slideshare.net/ShangxuanZhang/winning-data-science-competitions-presented-by-owen-zhang)*\n", "\n", "Note that this does include several hyperparameters specifically for XGBoost that are not included in the scikit-learn implementation, but we will ignore those for now." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "start_time": "2018-07-11T00:47:19.836Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Default Parameters: \n", "\n", "GradientBoostingClassifier(criterion='friedman_mse', init=None,\n", " learning_rate=0.1, loss='deviance', max_depth=3,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=100,\n", " presort='auto', random_state=None, subsample=1.0, verbose=0,\n", " warm_start=False) \n", "\n", "Beginning hyperparameter tuning\n", "Fitting 3 folds for each of 30 candidates, totalling 90 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Done 5 tasks | elapsed: 2.2min\n", "[Parallel(n_jobs=-1)]: Done 10 tasks | elapsed: 8.0min\n", "[Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 11.0min\n", "[Parallel(n_jobs=-1)]: Done 24 tasks | elapsed: 15.9min\n", "[Parallel(n_jobs=-1)]: Done 33 tasks | elapsed: 21.8min\n", "[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 24.5min\n", "[Parallel(n_jobs=-1)]: Done 53 tasks | elapsed: 30.9min\n", "[Parallel(n_jobs=-1)]: Done 64 tasks | elapsed: 41.6min\n", "[Parallel(n_jobs=-1)]: Done 77 tasks | elapsed: 54.8min\n", "[Parallel(n_jobs=-1)]: Done 90 out of 90 | elapsed: 66.5min finished\n" ] } ], "source": [ "# Creating the dictionary of parameters to use in the search\n", "parameters = {'n_estimators': scipy.stats.randint(low=100, high=1000), # Uniform distribution between 100 and 1000\n", " 'learning_rate': [0.003, 0.01, 0.03, 0.1, 0.3], # How drastic updates are\n", " 'subsample': [0.5, 0.75, 1.0], # The portion of rows to use in updates\n", " 'max_depth': [3, 6, 8, 10], # Maximum number of levels in a tree\n", " 'min_samples_split': [2, 5, 10], # Minimum number of samples required to split a node\n", " 'min_samples_leaf': [1, 2, 4] # Minimum number of samples required at each leaf node\n", " }\n", "\n", "hyperparameter_tune_get_results(gradient_boosting, parameters, 'Gradient Boosted Trees', num_rounds=30)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2018-07-11T01:56:24.482360Z", "start_time": "2018-07-11T01:56:24.451113Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyLogLossAUCTrainingTime
Logistic Regression0.6886670.6210840.71936242.689128
Random Forest0.7130000.6004200.7314663247.834107
Gradient Boosted Trees0.7125560.6199080.7347924069.814399
\n", "
" ], "text/plain": [ " Accuracy LogLoss AUC TrainingTime\n", "Logistic Regression 0.688667 0.621084 0.719362 42.689128\n", "Random Forest 0.713000 0.600420 0.731466 3247.834107\n", "Gradient Boosted Trees 0.712556 0.619908 0.734792 4069.814399" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tuned_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stacking\n", "\n", "Stacking is more of a special case because we have to worry about tuning the hyperparameters of the individual models within the ensemble. We can borrow our tuned ensemble models for part of it, but will have to tune the *k*-NN model" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "ExecuteTime": { "end_time": "2018-07-11T02:35:46.570323Z", "start_time": "2018-07-11T02:33:04.825669Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training the model\n", "Completed\n", "\n", "Coefficients for models\n", "Model 1: -4.734082188838665\n", "Model 2: 13.070197844351487\n", "Model 3: 13.093441698919188\n", "Model 4: -7.996204564151691\n", "Model 5: 1.098772263424389\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyLogLossAUCTrainingTime
Logistic Regression0.6890000.6211180.7193240.124991
Random Forest0.6704440.9800890.7050071.015584
Gradient Boosted Trees0.7033330.6020380.72921311.014955
Stacking0.7082220.9736270.725960146.808059
LightGBM0.7125560.6063740.7348030.187471
\n", "
" ], "text/plain": [ " Accuracy LogLoss AUC TrainingTime\n", "Logistic Regression 0.689000 0.621118 0.719324 0.124991\n", "Random Forest 0.670444 0.980089 0.705007 1.015584\n", "Gradient Boosted Trees 0.703333 0.602038 0.729213 11.014955\n", "Stacking 0.708222 0.973627 0.725960 146.808059\n", "LightGBM 0.712556 0.606374 0.734803 0.187471" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Defining the learners for the first layer\n", "model_1 = linear_model.LogisticRegression()\n", "model_2 = ensemble.RandomForestClassifier(n_estimators=620, max_depth=30,\n", " min_samples_split=10, n_jobs=-1)\n", "model_3 = ensemble.RandomForestClassifier(n_estimators=620, max_depth=30,\n", " min_samples_split=10, n_jobs=-1)\n", "model_4 = ensemble.GradientBoostingClassifier()\n", "model_5 = neighbors.KNeighborsClassifier(n_jobs=-1)\n", "\n", "# Putting the models in a list to iterate through in the function\n", "models = [model_1, model_2, model_3, model_4, model_5]\n", "\n", "# Running our function to build a stacking model\n", "train_stacking_get_results(models)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Additional Frameworks\n", "\n", "We have been using scikit-learn up until now for our models, but there are more specialized frameworks for gradient boosting in particular. Scikit-learn's gradient boosting algorithm is good, but lacks additional optimization and a few components and options that can be useful\n", "\n", "Specifically, we're going to focus on **XGBoost** and **LightGBM**. We'll go into more specifics for each, but both frameworks are focused on speed and performance and have the following advantages & disadvantages:\n", "\n", "#### Advantages\n", "- Ability to parallelize training\n", "- Ability to use GPUs\n", "- Additional under-the-hood optimization\n", "- Can specify loss functions\n", "- Additional tuning parameters\n", "- Distributed computing options\n", "- Native handling of missing values\n", "\n", "#### Disadvantages\n", "- Relatively difficult to install\n", "- Not as unified integration in older versions\n", "\n", "So generally speaking, XGBoost and LightGBM are able to train better models faster, but can be more difficult to set up and use." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### XGBoost\n", "\n", "[XGBoost](https://github.com/dmlc/xgboost) is an extremely popular framework for gradient boosted trees created by Tianqi Chen, a Ph.D. student at the University of Washington. It was initially released in 2014, but did not become popular until it started dominating competitions on Kaggle a few years later. It has implementations in several languages, but we will be focusing on the Python implementation. For more history, Tianqi posted [this blog post](https://homes.cs.washington.edu/~tqchen/2016/03/10/story-and-lessons-behind-the-evolution-of-xgboost.html) about the history, philosophy, and learnings behind creating XGBoost.\n", "\n", "As I mentioned, both XGBoost and LightGBM use a series of clever tricks and under-the-hood optimizations that are not included in the Scikit-Learn implementation that make them train better models faster. One example is that XGBoost uses second derivatives to find the optimal constant in each terminal node, whereas other implementations just use the first derivative. This is nearly impossible to unpack without getting into the math, but it should give an idea of the type of under-the-hood optimization that is happening. If you are interested, [here is the XGBoost white paper](https://arxiv.org/abs/1603.02754) that explains a lot of the optimizations.\n", "\n", "#### Installation\n", "\n", "The [installation guide](https://xgboost.readthedocs.io/en/latest/build.html) states that there is only a wheel file on PyPI for the 64-bit version of Linux, so things get a little more complicated for Windows & OSX users. Specifically, you have to build the library from the source.\n", "\n", "However, I do have a workaround for Windows users (sorry OSX users!) that I borrowed from [this blog post](https://medium.com/@rakshithvasudev/how-i-installed-xgboost-after-a-lot-of-hassels-on-my-windows-machine-c53e972e801e). Download the wheel file for your version of Windows and Python [here](https://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost) (cp27/35/36/37 are the version of Python, and win32/\\_amd64 are the versions of Windows), navigate a command window to the directory where you downloaded it, and do a pip install in your command prompt with `pip install xgboost‑0.72‑cp35‑cp35m‑win_amd64.whl` using whichever wheel file you downloaded.\n", "\n", "If you don't know your version of Windows or Python, run the code block below." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2018-07-07T16:23:15.333691Z", "start_time": "2018-07-07T16:23:15.326695Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python: 3.5.5 | packaged by conda-forge | (default, Apr 6 2018, 16:03:44) [MSC v.1900 64 bit (AMD64)]\n", "('64bit', 'WindowsPE')\n" ] } ], "source": [ "import sys\n", "import platform\n", "\n", "print('Python:', sys.version)\n", "print(platform.architecture())" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2018-07-07T16:23:52.000357Z", "start_time": "2018-07-07T16:23:51.243785Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training the model\n", "Completed\n", "\n", " Non-tuned results:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyLogLossAUCTrainingTime
Logistic Regression0.6960.6045430.7434980.023969
Random Forest0.6541.4734350.6849390.148897
Gradient Boosted Trees0.7240.5962530.7333900.577691
Stacking0.6561.0429050.6936955.193073
XGBoost0.7220.5953990.7369510.262850
\n", "
" ], "text/plain": [ " Accuracy LogLoss AUC TrainingTime\n", "Logistic Regression 0.696 0.604543 0.743498 0.023969\n", "Random Forest 0.654 1.473435 0.684939 0.148897\n", "Gradient Boosted Trees 0.724 0.596253 0.733390 0.577691\n", "Stacking 0.656 1.042905 0.693695 5.193073\n", "XGBoost 0.722 0.595399 0.736951 0.262850" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import xgboost as xgb\n", "\n", "xgboost = xgb.XGBClassifier(n_jobs=-1) # n_jobs=-1 uses all available cores\n", "\n", "# Due to the scikit-learn API option, LightGBM works with our function!\n", "train_model_get_results(xgboost, 'XGBoost')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Hyperparameter Tuning\n", "\n", "XGBoost has additional hyperparameters that can be tuned - [here is the full list](https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster). For the purposes of this demonstration, we'll stick with mostly the same hyperparameters that we used for our previous gradient boosting example.\n", "\n", "I mentioned this above, but below is the tuning guide from Zhonghua Zhang, the former \\#1 kaggler in the world. Additionally, [here](https://machinelearningmastery.com/configure-gradient-boosting-algorithm/) is the blog post containing other tuning strategies that are primarily focused on XGBoost.\n", "\n", "\n", "*Source: [Zhonghua Zhang](https://www.slideshare.net/ShangxuanZhang/winning-data-science-competitions-presented-by-owen-zhang)*\n", "\n", "**TODO: Update these hyperparameters**" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2018-07-07T16:25:43.511719Z", "start_time": "2018-07-07T16:24:52.360557Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Default Parameters: \n", "\n", "XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n", " colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,\n", " max_depth=3, min_child_weight=1, missing=None, n_estimators=100,\n", " n_jobs=-1, nthread=None, objective='binary:logistic',\n", " random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n", " seed=None, silent=True, subsample=1) \n", "\n", "Beginning hyperparameter tuning\n", "Fitting 3 folds for each of 5 candidates, totalling 15 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 7.9s\n", "[Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 10.2s\n", "[Parallel(n_jobs=-1)]: Done 3 tasks | elapsed: 11.9s\n", "[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 12.7s\n", "[Parallel(n_jobs=-1)]: Done 5 tasks | elapsed: 14.3s\n", "[Parallel(n_jobs=-1)]: Done 6 tasks | elapsed: 15.4s\n", "[Parallel(n_jobs=-1)]: Done 7 tasks | elapsed: 24.3s\n", "[Parallel(n_jobs=-1)]: Done 8 tasks | elapsed: 25.5s\n", "[Parallel(n_jobs=-1)]: Done 9 out of 15 | elapsed: 27.5s remaining: 18.3s\n", "[Parallel(n_jobs=-1)]: Done 10 out of 15 | elapsed: 33.3s remaining: 16.6s\n", "[Parallel(n_jobs=-1)]: Done 11 out of 15 | elapsed: 38.3s remaining: 13.9s\n", "[Parallel(n_jobs=-1)]: Done 12 out of 15 | elapsed: 39.2s remaining: 9.7s\n", "[Parallel(n_jobs=-1)]: Done 13 out of 15 | elapsed: 41.3s remaining: 6.3s\n", "[Parallel(n_jobs=-1)]: Done 15 out of 15 | elapsed: 45.2s remaining: 0.0s\n", "[Parallel(n_jobs=-1)]: Done 15 out of 15 | elapsed: 45.2s finished\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Completed\n", "Best estimator: \n", "\n", "XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n", " colsample_bytree=1, gamma=0, learning_rate=0.01, max_delta_step=0,\n", " max_depth=4, min_child_weight=1, missing=None, n_estimators=899,\n", " n_jobs=-1, nthread=None, objective='binary:logistic',\n", " random_state=0, reg_alpha=1, reg_lambda=0, scale_pos_weight=1,\n", " seed=None, silent=True, subsample=1.0)\n", "\n", "Accuracy before tuning: 0.722\n", "Accuracy after tuning: 0.719333333333\n", "\n", " Tuned results:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyLogLossAUCTrainingTime
Logistic Regression0.6966670.6045830.7433725.783737
Random Forest0.7206670.6003490.73445329.119582
Gradient Boosted Trees0.7213330.5913190.74243168.062629
XGBoost0.7193330.6031500.73241250.254662
\n", "
" ], "text/plain": [ " Accuracy LogLoss AUC TrainingTime\n", "Logistic Regression 0.696667 0.604583 0.743372 5.783737\n", "Random Forest 0.720667 0.600349 0.734453 29.119582\n", "Gradient Boosted Trees 0.721333 0.591319 0.742431 68.062629\n", "XGBoost 0.719333 0.603150 0.732412 50.254662" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "parameters = {'n_estimators': scipy.stats.randint(low=100, high=1000), # Uniform distribution between 10 and 1000\n", " 'learning_rate': [0.01, 0.03, 0.1, 0.3],\n", " 'max_depth': [4, 6, 8, 10],\n", " 'subsample': [0.5, 0.75, 1.0],\n", " 'reg_alpha': [0, 1], # L1 regularization\n", " 'reg_lambda': [0, 1] # L2 regularization\n", " }\n", "\n", "hyperparameter_tune_get_results(xgboost, parameters, 'XGBoost', num_rounds=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### LightGBM\n", "\n", "[LightGBM](https://github.com/Microsoft/LightGBM) is a project from [Microsoft Research Asia](https://www.microsoft.com/en-us/research/lab/microsoft-research-asia/) that is focused around training gradient boosted trees in a highly efficient and distributed manner. It's generally comparable to XGBoost, but is not as popular because it is much newer. More specifically, LightGBM was released in December, 2016, after XGBoost had taken become the de-facto framework for Kaggle competitions.\n", "\n", "One of the fundamental differences between LightGBM and other implementations of gradient boosted trees is that it grows the trees leaf-wise rather than level-wise, which is reportedly able to let them achieve lower loss than level-wise trees:\n", "\n", "\n", "\n", "\n", "\n", "*Source: [LightGBM](https://github.com/Microsoft/LightGBM/blob/master/docs/Features.rst)*\n", "\n", "Additionally, LightGBM uses a histogram based algorithm to discretize continuous variables into buckets in order to speed up the training process and reduce the memory requirements. XGBoost has included this in recent versions, but it is not enabled by default.\n", "\n", "There are several other optimizations happening under the hood (listed [here](https://github.com/Microsoft/LightGBM/blob/master/docs/Features.rst)), but those are a few of the main differences from other implementations.\n", "\n", "#### Installation\n", "\n", "[The documentation on GitHub](https://github.com/Microsoft/LightGBM/tree/master/python-package#installation) has installation instructions for LightGBM. It can be installed from PyPI with `pip install lightgbm`, but requires a few things to work - check out the documentation depending on your OS." ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "ExecuteTime": { "end_time": "2018-07-11T03:03:27.823937Z", "start_time": "2018-07-11T03:03:25.527201Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training the model\n", "Completed\n", "\n", " Non-tuned results:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyLogLossAUCTrainingTime
Logistic Regression0.5304670.6904430.5421932.046665
Random Forest0.5769220.7793160.60805213.342987
LightGBM0.5850220.6783840.6167541.859263
LightGBM AUC0.5850220.6783840.6167541.765515
\n", "
" ], "text/plain": [ " Accuracy LogLoss AUC TrainingTime\n", "Logistic Regression 0.530467 0.690443 0.542193 2.046665\n", "Random Forest 0.576922 0.779316 0.608052 13.342987\n", "LightGBM 0.585022 0.678384 0.616754 1.859263\n", "LightGBM AUC 0.585022 0.678384 0.616754 1.765515" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import lightgbm as lgb\n", "\n", "lightGBM = lgb.LGBMClassifier(nthread=-1) # nthread=-1 uses all available cores\n", "\n", "# Due to the scikit-learn API option, LightGBM works with our function!\n", "train_model_get_results(lightGBM, 'LightGBM')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Hyperparameter Tuning\n", "\n", "Because LightGBM is so similar to XGBoost, we can use the same tuning guidelines in principle. However, there are a few additional tuning guidelines noted in the official [LightGBM Parameter Tuning Guide](http://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html).\n", "\n", "**TODO: Adjust these hyperparameters**" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "ExecuteTime": { "end_time": "2018-07-11T03:39:42.598836Z", "start_time": "2018-07-11T03:11:49.527668Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Default Parameters: \n", "\n", "LGBMClassifier(boosting_type='gbdt', colsample_bytree=1, learning_rate=0.1,\n", " max_bin=255, max_depth=-1, min_child_samples=10,\n", " min_child_weight=5, min_split_gain=0, n_estimators=100, nthread=-1,\n", " num_leaves=31, objective='binary', reg_alpha=0, reg_lambda=0,\n", " seed=0, silent=True, subsample=1, subsample_for_bin=50000,\n", " subsample_freq=1) \n", "\n", "Beginning hyperparameter tuning\n", "Fitting 3 folds for each of 30 candidates, totalling 90 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Done 5 tasks | elapsed: 1.5min\n", "[Parallel(n_jobs=-1)]: Done 10 tasks | elapsed: 3.4min\n", "[Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 5.6min\n", "[Parallel(n_jobs=-1)]: Done 24 tasks | elapsed: 7.9min\n", "[Parallel(n_jobs=-1)]: Done 33 tasks | elapsed: 9.8min\n", "[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 13.4min\n", "[Parallel(n_jobs=-1)]: Done 53 tasks | elapsed: 16.3min\n", "[Parallel(n_jobs=-1)]: Done 64 tasks | elapsed: 19.2min\n", "[Parallel(n_jobs=-1)]: Done 77 tasks | elapsed: 24.3min\n", "[Parallel(n_jobs=-1)]: Done 90 out of 90 | elapsed: 27.2min finished\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Completed\n", "Best estimator: \n", "\n", "LGBMClassifier(boosting_type='gbdt', colsample_bytree=1, learning_rate=0.03,\n", " max_bin=255, max_depth=6, min_child_samples=10, min_child_weight=5,\n", " min_split_gain=0, n_estimators=935, nthread=-1, num_leaves=31,\n", " objective='binary', reg_alpha=1, reg_lambda=1, seed=0, silent=True,\n", " subsample=0.5, subsample_for_bin=50000, subsample_freq=1)\n", "\n", "Accuracy before tuning: 0.585022222222\n", "Accuracy after tuning: 0.635922222222\n", "\n", " Tuned results:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyLogLossAUCTrainingTime
LightGBM0.6359220.6478060.6769931664.384194
\n", "
" ], "text/plain": [ " Accuracy LogLoss AUC TrainingTime\n", "LightGBM 0.635922 0.647806 0.676993 1664.384194" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "parameters = {'n_estimators': scipy.stats.randint(low=10, high=1000), # Uniform distribution between 10 and 1000\n", " 'learning_rate': [0.01, 0.03, 0.1, 0.3],\n", " 'max_depth': [4, 6, 8, 10],\n", " 'subsample': [0.5, 0.75, 1.0],\n", " 'reg_alpha': [0, 1], # L1 regularization\n", " 'reg_lambda': [0, 1] # L2 regularization\n", " }\n", "\n", "hyperparameter_tune_get_results(lightGBM, parameters, 'LightGBM', num_rounds=30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Blending\n", "\n", "While not technically not one of the three ensemble methods, blending is a popular technique for combining the predictions of multiple models via averaging. It's easy to program, and typically has good results.\n", "\n", "It has similar downsides as stacking - it requires more computational power, and any semblance of interpretability goes out the window." ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "ExecuteTime": { "end_time": "2018-07-11T03:51:48.316770Z", "start_time": "2018-07-11T03:45:54.891909Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training model number 1\n", "Training model number 2\n", "Training model number 3\n", "Training model number 4\n", "Training model number 5\n", "Training model number 6\n", "Training model number 7\n", "Training model number 8\n", "Training model number 9\n", "Training model number 10\n", "\n", "Accuracy: 0.633433333333\n", "Log Loss: 0.649578716595\n", "AUC: 0.674961877427\n" ] } ], "source": [ "num_models = 10\n", "\n", "class_probabilities = []\n", "\n", "for model in range(num_models):\n", " \n", " # Progress printing for every 10% of completion\n", " if (model+1) % (round(num_models) / 10) == 0:\n", " print('Training model number', model+1)\n", " \n", " model = lgb.LGBMClassifier(nthread=-1, n_estimators=935, learning_rate=0.03)\n", " model.fit(X_train, y_train)\n", " model_prediction = model.predict_proba(X_test)\n", " class_probabilities.append(model_prediction)\n", " \n", "# Averaging the predictions for output\n", "class_probabilities = np.asarray(class_probabilities).mean(axis=0)\n", "predictions = np.where(class_probabilities[:, 1] > 0.5, 1, 0)\n", "\n", "print()\n", "print('Accuracy:', metrics.accuracy_score(y_test, predictions))\n", "print('Log Loss:', metrics.log_loss(y_test, class_probabilities))\n", "print('AUC:', metrics.roc_auc_score(y_test, class_probabilities[:, 1]))\n", "\n" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "ExecuteTime": { "end_time": "2018-07-11T03:55:11.077734Z", "start_time": "2018-07-11T03:55:11.030862Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyLogLossAUCTrainingTime
LightGBM0.6359220.6478060.6769931664.384194
\n", "
" ], "text/plain": [ " Accuracy LogLoss AUC TrainingTime\n", "LightGBM 0.635922 0.647806 0.676993 1664.384194" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tuned_results" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "ExecuteTime": { "end_time": "2018-07-11T02:47:38.954431Z", "start_time": "2018-07-11T02:47:38.923322Z" } }, "outputs": [ { "data": { "text/plain": [ "0.71255555555555561" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "metrics.accuracy_score(y_test, predictions)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "ExecuteTime": { "end_time": "2018-07-11T02:47:40.782445Z", "start_time": "2018-07-11T02:47:40.751205Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyLogLossAUCTrainingTime
Logistic Regression0.6886670.6210840.71936242.689128
Random Forest0.7130000.6004200.7314663247.834107
Gradient Boosted Trees0.7125560.6199080.7347924069.814399
LightGBM0.7088890.5986960.734768629.372810
\n", "
" ], "text/plain": [ " Accuracy LogLoss AUC TrainingTime\n", "Logistic Regression 0.688667 0.621084 0.719362 42.689128\n", "Random Forest 0.713000 0.600420 0.731466 3247.834107\n", "Gradient Boosted Trees 0.712556 0.619908 0.734792 4069.814399\n", "LightGBM 0.708889 0.598696 0.734768 629.372810" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tuned_results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Summary\n", "\n", "And there we have it! We looked at different types of ensemble methods, how to tune them, and a few different frameworks for using them." ] } ], "metadata": { "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.5" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }