{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Classification with XGBoost\n", "\n", "> This chapter will introduce you to the fundamental idea behind XGBoost—boosted learners. Once you understand how XGBoost works, you'll apply it to solve a common classification problem found in industry - predicting whether a customer will stop being a customer at some point in the future. This is the Summary of lecture \"Extreme Gradient Boosting with XGBoost\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "- Supervised Learning\n", " - Relies on labeled data\n", " - Have some understanding of past behavior\n", "- AUC: Metric for binary classification models\n", " - Area Under the ROC Curve (AUC)\n", " - Larger area under the ROC curve = better model\n", "- Other supervised learning considerations\n", " - Features can be either numeric or categorical\n", " - Numeric features should be scaled (Z-scored)\n", " - Categorical features should be encoded (one-hot)\n", " \n", "## Introducing XGBoost\n", "- What is XGBoost? (eXtreme Gradient Boosting)\n", " - Optimized gradient-boosting machine learning library\n", " - Originally written in C++\n", " - Has APIs in several languages:\n", " - Python, R, Scala, Julia, Java\n", "- What makes XGBoost so popular?\n", " - Speed and performance\n", " - Core algorithm is parallelizable\n", " - Consistently outperforms single-algorithm methods\n", " - State-of-the-art performance in many ML tasks" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import xgboost as xgb\n", "\n", "plt.rcParams['figure.figsize'] = (10, 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### XGBoost - Fit/Predict\n", "It's time to create your first XGBoost model! As Sergey showed you in the video, you can use the scikit-learn `.fit()` / `.predict()` paradigm that you are already familiar to build your XGBoost models, as the xgboost library has a scikit-learn compatible API!\n", "\n", "Here, you'll be working with churn data. This dataset contains imaginary data from a ride-sharing app with user behaviors over their first month of app usage in a set of imaginary cities as well as whether they used the service 5 months after sign-up. \n", "\n", "Your goal is to use the first month's worth of data to predict whether the app's users will remain users of the service at the 5 month mark. This is a typical setup for a churn prediction problem. To do this, you'll split the data into training and test sets, fit a small xgboost model on the training set, and evaluate its performance on the test set by computing its accuracy." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "churn_data = pd.read_csv('./dataset/churn_data.csv')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "accuracy: 0.758200\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# Create arrays for the features and the target: X, y\n", "X, y = churn_data.iloc[:, :-1], churn_data.iloc[:, -1]\n", "\n", "# Create the training and test sets\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)\n", "\n", "# Instantiate the XGBClassifier: xg_cl\n", "xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)\n", "\n", "# Fit the classifier to the training set\n", "xg_cl.fit(X_train, y_train)\n", "\n", "# Predict the labels of the test set: preds\n", "preds = xg_cl.predict(X_test)\n", "\n", "# Compute the accuracy: accuracy\n", "accuracy = float(np.sum(preds == y_test)) / y_test.shape[0]\n", "print(\"accuracy: %f\" % (accuracy))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is a decision tree?\n", "- Decision trees as base learners\n", " - Base learner : Individual learning algorithm in an ensemble algorithm\n", " - Composed of a series of binary questions\n", " - Predictions happen at the \"leaves\" of the tree\n", "- CART: Classification And Regression Trees\n", " - Each leaf always contains a real-valued score\n", " - Can later be converted into categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Decision trees\n", "Your task in this exercise is to make a simple decision tree using scikit-learn's `DecisionTreeClassifier` on the breast cancer dataset.\n", "\n", "This dataset contains numeric measurements of various dimensions of individual tumors (such as perimeter and texture) from breast biopsies and a single outcome value (the tumor is either malignant, or benign).\n", "\n", "We've preloaded the dataset of samples (measurements) into `X` and the target values per tumor into `y`. Now, you have to split the complete dataset into training and testing sets, and then train a `DecisionTreeClassifier`. You'll specify a parameter called `max_depth`. Many other parameters can be modified within this model, and you can check all of them out here." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "X = pd.read_csv('./dataset/xgb_breast_X.csv').to_numpy()\n", "y = pd.read_csv('./dataset/xgb_breast_y.csv').to_numpy().ravel()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.9649122807017544\n" ] } ], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "\n", "# Create the training and test sets\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)\n", "\n", "# Instantiate the classifier: dt_clf_4\n", "dt_clf_4 = DecisionTreeClassifier(max_depth=4)\n", "\n", "# Fit the classifier to the training set\n", "dt_clf_4.fit(X_train, y_train)\n", "\n", "# Predict the labels of the test set: y_pred_4\n", "y_pred_4 = dt_clf_4.predict(X_test)\n", "\n", "# Compute the accuracy of the predictions: accuracy\n", "accuracy = float(np.sum(y_pred_4 == y_test)) / y_test.shape[0]\n", "print(\"Accuracy:\", accuracy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is Boosting?\n", "- Boosting overview\n", " - Not a specific machine learning algorithm\n", " - Concept that can be applied to a set of machine learning models\n", " - \"Meta-algorithm\"\n", " - Ensemble meta-algorithm used to convert many weak learners into a strong learner\n", "- Weak learners and strong learners\n", " - Weak learner: ML algorithm that is slightly better than chance\n", " - Boosting converts a collection of weak learners into a strong learner\n", " - Strong learner: Any algorithm that can be tuned to achieve good performance.\n", "- How boosting is accomplished?\n", " - Iteratively learning a set of week models on subsets of the data\n", " - Weighting each weak prediction according to each weak learner's performance\n", " - Combine the weighted predictions to obtain a single weighted prediction\n", " - that is much better than the individual predictions themselves!\n", "- Model evaluation through cross-validation\n", " - Cross-validation: Robust method for estimating the performance of a model on unseen data\n", " - Generates many non-overlapping train/test splits on training data\n", " - Reports the average test set performance across all data splits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Measuring accuracy\n", "You'll now practice using XGBoost's learning API through its baked in cross-validation capabilities. As Sergey discussed in the previous video, XGBoost gets its lauded performance and efficiency gains by utilizing its own optimized data structure for datasets called a `DMatrix`.\n", "\n", "In the previous exercise, the input datasets were converted into `DMatrix` data on the fly, but when you use the `xgboost` `cv` object, you have to first explicitly convert your data into a `DMatrix`. So, that's what you will do here before running cross-validation on `churn_data`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "churn_data = pd.read_csv('./dataset/churn_data.csv')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " train-error-mean train-error-std test-error-mean test-error-std\n", "0 0.28232 0.002366 0.28378 0.001932\n", "1 0.26951 0.001855 0.27190 0.001932\n", "2 0.25605 0.003213 0.25798 0.003963\n", "3 0.25090 0.001845 0.25434 0.003827\n", "4 0.24654 0.001981 0.24852 0.000934\n", "0.75148\n" ] } ], "source": [ "# Create arrays for the features and the target: X, y\n", "X, y = churn_data.iloc[:, :-1], churn_data.iloc[:, -1]\n", "\n", "# Create the DMatrix from X and y: churn_dmatrix\n", "churn_dmatrix = xgb.DMatrix(data=X, label=y)\n", "\n", "# Create the parameter dictionary: params\n", "params = {'objective':\"reg:logistic\", \"max_depth\":3}\n", "\n", "# Perform cross-validation: cv_results\n", "cv_results = xgb.cv(dtrain=churn_dmatrix, params=params,\n", " nfold=3, num_boost_round=5,\n", " metrics=\"error\", as_pandas=True, seed=123)\n", "\n", "# Pint cv_results\n", "print(cv_results)\n", "\n", "# Print the accuracy\n", "print(((1 - cv_results['test-error-mean']).iloc[-1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`cv_results` stores the training and test mean and standard deviation of the error per boosting round (tree built) as a DataFrame. From `cv_results`, the final round `'test-error-mean'` is extracted and converted into an accuracy, where accuracy is `1-error`. The final accuracy of around 75% is an improvement from earlier!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Measuring AUC\n", "Now that you've used cross-validation to compute average out-of-sample accuracy (after converting from an error), it's very easy to compute any other metric you might be interested in. All you have to do is pass it (or a list of metrics) in as an argument to the metrics parameter of `xgb.cv()`.\n", "\n", "Your job in this exercise is to compute another common metric used in binary classification - the area under the curve (`\"auc\"`). " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " train-auc-mean train-auc-std test-auc-mean test-auc-std\n", "0 0.768893 0.001544 0.767863 0.002820\n", "1 0.790864 0.006758 0.789157 0.006846\n", "2 0.815872 0.003900 0.814476 0.005997\n", "3 0.822959 0.002018 0.821682 0.003912\n", "4 0.827528 0.000769 0.826191 0.001937\n", "0.826191\n" ] } ], "source": [ "# Perform cross_validation: cv_results\n", "cv_results = xgb.cv(dtrain=churn_dmatrix, params=params,\n", " nfold=3, num_boost_round=5,\n", " metrics=\"auc\", as_pandas=True, seed=123)\n", "\n", "# Print cv_results\n", "print(cv_results)\n", "\n", "# Print the AUC\n", "print((cv_results[\"test-auc-mean\"]).iloc[-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An AUC of 0.84 is quite strong. As you have seen, XGBoost's learning API makes it very easy to compute any metric you may be interested in. In Chapter 3, you'll learn about techniques to fine-tune your XGBoost models to improve their performance even further. For now, it's time to learn a little about exactly when to use XGBoost." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## When should I use XGBoost?\n", "- When to use XGBoost\n", " - You have a large number of training samples\n", " - Greater than 1000 training samples and less 100 features\n", " - The number of features < number of training samples\n", " - You have a mixture of categorical and numeric features\n", " - Or just numeric features\n", "- When to NOT use XGBoost\n", " - Image recognition\n", " - Computer vision\n", " - Natural language processing and understanding problems\n", " - When the number of training samples is significantly smaller than the number of features" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }