{ "cells": [ { "cell_type": "code", "execution_count": 567, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import patsy as pt\n", "from mpl_toolkits.mplot3d import Axes3D\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "sns.set()\n", "\n", "from IPython.display import HTML\n", "from ipywidgets import interact\n", "import ipywidgets as widgets\n", "import copy\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "from sklearn import tree\n", "import graphviz \n", "from sklearn import metrics\n", "from sklearn.metrics import confusion_matrix\n", "from sklearn import datasets\n", "from sklearn.model_selection import cross_val_score\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.ensemble import GradientBoostingRegressor\n", "from sklearn.ensemble import GradientBoostingClassifier\n", "from sklearn.linear_model import Lasso\n", "import statsmodels.api as sm\n", "from sklearn.neighbors import KNeighborsClassifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7. In the lab, we applied random forests to the Boston data using mtry=6 and using ntree=25 and ntree=500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained." ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATPrice
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2
\n", "
" ], "text/plain": [ " CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \\\n", "0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 \n", "1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 \n", "2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 \n", "3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 \n", "4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 \n", "\n", " PTRATIO B LSTAT Price \n", "0 15.3 396.90 4.98 24.0 \n", "1 17.8 396.90 9.14 21.6 \n", "2 17.8 392.83 4.03 34.7 \n", "3 18.7 394.63 2.94 33.4 \n", "4 18.7 396.90 5.33 36.2 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "boston_df = datasets.load_boston()\n", "boston_df = pd.DataFrame(data=np.c_[boston_df['data'], boston_df['target']], columns= [c for c in boston_df['feature_names']] + ['Price'])\n", "\n", "np.random.seed(1)\n", "train = np.random.rand(len(boston_df)) < 0.5\n", "\n", "display(boston_df.head())\n", "\n", "# Create design and response matrix\n", "f = 'Price ~ ' + ' + '.join(boston_df.columns.drop(['Price']))\n", "y, X = pt.dmatrices(f, boston_df)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Compare test RMSE of Random Forest for various numbers of features considered at each split (mtry)\n", "# and increasing number of trees (ntree)\n", "\n", "max_features = {'p': X.shape[1], \n", " 'p/2': int(np.around(X.shape[1]/2)),\n", " '$\\sqrt{p}$': int(np.around(np.sqrt(X.shape[1]))),\n", " '1': 1} \n", "\n", "results = []\n", "for mtry in max_features:\n", " for tree_count in np.arange(1, 100):\n", " regr = RandomForestRegressor(max_features=max_features[mtry], random_state=0, n_estimators=tree_count)\n", " regr.fit(X[train], y[train])\n", " y_hat = regr.predict(X[~train])\n", " \n", " mse = metrics.mean_squared_error(y[~train], y_hat)\n", " rmse = np.sqrt(mse)\n", " results+= [[tree_count, mtry, rmse]]\n", "\n", "plt.figure(figsize=(10,10))\n", "sns.lineplot(x='Number of Trees', y='RMSE', hue='mtry', \n", " data=pd.DataFrame(results, columns=['Number of Trees', 'mtry', 'RMSE']));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above shows the test RMSE for a 50% holdout set where training set contains 240 observations each with 14 predictors.\n", "\n", "We find that the test RMSE decreases with increasing number of trees for all values of mtry (number of random features considered at each split) \n", "\n", "The optimal value of m is less than p and greater than 1. In this case $m=\\sqrt{p}$ yields the best results for $ntree > ?$ although $p/2$ performs better for some lower tree counts.\n", "\n", "Incredibly the model does not seem prone to overfitting with test RMSE continuing to decrees up to ntree = 100.\n", "\n", "### 8. In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### (a) Split the data set into a training set and a test set.\n", "\n", "#### (b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?" ] }, { "cell_type": "code", "execution_count": 246, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SalesCompPriceIncomeAdvertisingPopulationPriceShelveLocAgeEducationUrbanUS
09.501387311276120Bad4217YesYes
111.22111481626083Good6510YesYes
210.06113351026980Medium5912YesYes
37.40117100446697Medium5514YesYes
44.15141643340128Bad3813YesNo
\n", "
" ], "text/plain": [ " Sales CompPrice Income Advertising Population Price ShelveLoc Age \\\n", "0 9.50 138 73 11 276 120 Bad 42 \n", "1 11.22 111 48 16 260 83 Good 65 \n", "2 10.06 113 35 10 269 80 Medium 59 \n", "3 7.40 117 100 4 466 97 Medium 55 \n", "4 4.15 141 64 3 340 128 Bad 38 \n", "\n", " Education Urban US \n", "0 17 Yes Yes \n", "1 10 Yes Yes \n", "2 12 Yes Yes \n", "3 14 Yes Yes \n", "4 13 Yes No " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "carseats_df = pd.read_csv('./data/Carseats.csv')\n", "\n", "# Check for missing values\n", "assert carseats_df.isnull().sum().sum() == 0\n", "# Drop unused index\n", "carseats_df = carseats_df.drop('Unnamed: 0', axis=1)\n", "\n", "# Create index for training set\n", "np.random.seed(1)\n", "train = np.random.random(len(carseats_df)) > 0.5\n", "\n", "display(carseats_df.head())" ] }, { "cell_type": "code", "execution_count": 247, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test MSE: 5.09\n", "Test RMSE: 2.256\n" ] } ], "source": [ "# Use all features excpet response features\n", "# No intercept\n", "\n", "preds = carseats_df.columns.drop(['Sales'])\n", "#preds_scaled = ['standardize({})'.format(p) for p in preds]\n", "f = 'Sales ~ 0 +' + ' + '.join(preds)\n", "y, X = pt.dmatrices(f, carseats_df)\n", "y = y.flatten()\n", "\n", "# Fit Sklearn's tree regressor\n", "clf = tree.DecisionTreeRegressor(max_depth=5).fit(X[train], y[train])\n", "\n", "# Measure test set MSE\n", "y_hat = clf.predict(X[~train])\n", "mse = metrics.mean_squared_error(y[~train], y_hat)\n", "\n", "# Get proportion of correct classifications on test set\n", "print('Test MSE: {}'.format(np.around(mse, 3)))\n", "print('Test RMSE: {}'.format(np.around(np.sqrt(mse), 3)))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### (c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?" ] }, { "cell_type": "code", "execution_count": 248, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RMSEupperlower
max_leaf_nodes
31.02.1889012.42011.957702
\n", "
" ], "text/plain": [ " RMSE upper lower\n", "max_leaf_nodes \n", "31.0 2.188901 2.4201 1.957702" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Test MSE : 4.661\n", "Test RMSE: 2.159\n" ] } ], "source": [ "# How about using CV to compare trees with different number of leaf nodes \n", "# as defined by the max_leaf_nodes parameter?\n", "\n", "cv_folds = 10\n", "tuning_param = 'max_leaf_nodes'\n", "columns=[tuning_param, 'RMSE', 'upper', 'lower']\n", "\n", "results = []\n", "for m in np.arange(2, 100):\n", " regr = tree.DecisionTreeRegressor(max_leaf_nodes=m)\n", " scores = cross_val_score(regr, X[train], y[train], cv=cv_folds, scoring='neg_mean_squared_error')\n", " rmses = np.sqrt(np.absolute(scores))\n", " rmse = np.mean(rmses)\n", " conf_int = np.std(rmses) *2\n", " results += [[m, rmse, rmse+conf_int, rmse-conf_int]]\n", "\n", "\n", "# Plot classification accuracy for each max_depth cv result\n", "plot_df = pd.DataFrame(np.asarray(results), columns=columns).set_index(tuning_param)\n", "plt.figure(figsize=(10,10))\n", "sns.lineplot(data=plot_df)\n", "plt.ylabel('RMSE')\n", "plt.show();\n", "\n", "# Show chosen model\n", "chosen = plot_df[plot_df['RMSE'] == plot_df['RMSE'].min()]\n", "display(chosen)\n", "\n", "# Use chosen model for test prediction\n", "regr = tree.DecisionTreeRegressor(max_leaf_nodes=int(chosen.index[0])).fit(X[train], y[train])\n", "y_hat = regr.predict(X[~train])\n", "mse = metrics.mean_squared_error(y[~train], y_hat)\n", "\n", "# Get proportion of correct classifications on test set\n", "print('Test MSE : {}'.format(np.around(mse, 3)))\n", "print('Test RMSE: {}'.format(np.around(np.sqrt(mse), 3)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "10-fold cross validation selects a pruned tree model that achieves test MSE of 4.524, an improvement on the unpruned model (5.066). Interestingly 100-fold, 5-fold and 2-fold CV were all unable to select an improvement on the unpruned model. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### (d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important." ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test MSE : 2.615\n", "Test RMSE: 1.617\n" ] } ], "source": [ "# Bagging with 100 trees\n", "# although I'm using RandomForestRegressor algo here this is Bagging because max_features = n_predictors\n", "\n", "max_features = X.shape[1]\n", "tree_count = 100\n", "\n", "regr = RandomForestRegressor(max_features=max_features, random_state=0, n_estimators=tree_count)\n", "regr.fit(X[train], y[train])\n", "y_hat = regr.predict(X[~train])\n", "\n", "mse = metrics.mean_squared_error(y[~train], y_hat)\n", "rmse = np.sqrt(mse)\n", "\n", "print('Test MSE : {}'.format(np.around(mse, 3)))\n", "print('Test RMSE: {}'.format(np.around(rmse, 3)))" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot feature by importance in this model\n", "plot_df = pd.DataFrame({'feature': X.design_info.column_names, 'importance': regr.feature_importances_})\n", "\n", "plt.figure(figsize=(10,10))\n", "sns.barplot(x='importance', y='feature', data=plot_df.sort_values('importance', ascending=False),\n", " color='b')\n", "plt.xticks(rotation=90);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bagging yields a significantly improved test MSE of 2.615 compared with 4.524 for the optimal pruned tree.\n", "\n", "The bagging model indicates that instore Shelve Location (Good) and Price of the carseat are the most significant features affecting Sales revenue. This aligns with our observation when performing classification in hte lab." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### (e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained." ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE test: 2.544\n", "RMSE test: 1.595\n" ] } ], "source": [ "# Random Forest with 100 trees and 4 features considered at each split\n", "\n", "max_features = 7\n", "tree_count = 100\n", "\n", "regr = RandomForestRegressor(max_features=max_features, random_state=0, n_estimators=tree_count)\n", "regr.fit(X[train], y[train])\n", "y_hat = regr.predict(X[~train])\n", "\n", "mse = metrics.mean_squared_error(y[~train], y_hat)\n", "rmse = np.sqrt(mse)\n", "\n", "print('MSE test: {}'.format(np.around(mse, 3)))\n", "print('RMSE test: {}'.format(np.around(rmse, 3)))" ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot feature by importance in this model\n", "plot_df = pd.DataFrame({'feature': X.design_info.column_names, 'importance': regr.feature_importances_})\n", "\n", "plt.figure(figsize=(10,10))\n", "sns.barplot(x='importance', y='feature', data=plot_df.sort_values('importance', ascending=False),\n", " color='b')\n", "plt.xticks(rotation=90);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Random forest with 7 predictors at each split yields a test MSE 2.544 similar to bagging (2.615). A similar feature importance is ascribed." ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Describe the effect of m, the number of variables considered at each split, on the error rate obtained.\n", "\n", "results = []\n", "for max_features in np.arange(1, X.shape[1]):\n", "\n", " tree_count = 100\n", " \n", " regr = RandomForestRegressor(max_features=max_features, random_state=0, n_estimators=tree_count)\n", " regr.fit(X[train], y[train])\n", " y_hat = regr.predict(X[~train])\n", " \n", " mse = metrics.mean_squared_error(y[~train], y_hat)\n", " rmse = np.sqrt(mse)\n", " \n", " results += [[max_features, mse]]\n", "\n", "plt.figure(figsize=(10,5))\n", "sns.lineplot(x='Split Variables', y='MSE', data=pd.DataFrame(results, columns=['Split Variables', 'MSE']));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9. This problem involves the OJ data set which is part of the ISLR package.\n", "\n", "### (a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations." ] }, { "cell_type": "code", "execution_count": 251, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PurchaseWeekofPurchaseStoreIDPriceCHPriceMMDiscCHDiscMMSpecialCHSpecialMMLoyalCHSalePriceMMSalePriceCHPriceDiffStore7PctDiscMMPctDiscCHListPriceDiffSTORE
0CH23711.751.990.000.0000.5000001.991.750.24No0.0000000.0000000.241
1CH23911.751.990.000.3010.6000001.691.75-0.06No0.1507540.0000000.241
2CH24511.862.090.170.0000.6800002.091.690.40No0.0000000.0913980.231
3MM22711.691.690.000.0000.4000001.691.690.00No0.0000000.0000000.001
4CH22871.691.690.000.0000.9565351.691.690.00Yes0.0000000.0000000.000
\n", "
" ], "text/plain": [ " Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM \\\n", "0 CH 237 1 1.75 1.99 0.00 0.0 \n", "1 CH 239 1 1.75 1.99 0.00 0.3 \n", "2 CH 245 1 1.86 2.09 0.17 0.0 \n", "3 MM 227 1 1.69 1.69 0.00 0.0 \n", "4 CH 228 7 1.69 1.69 0.00 0.0 \n", "\n", " SpecialCH SpecialMM LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 \\\n", "0 0 0 0.500000 1.99 1.75 0.24 No \n", "1 0 1 0.600000 1.69 1.75 -0.06 No \n", "2 0 0 0.680000 2.09 1.69 0.40 No \n", "3 0 0 0.400000 1.69 1.69 0.00 No \n", "4 0 0 0.956535 1.69 1.69 0.00 Yes \n", "\n", " PctDiscMM PctDiscCH ListPriceDiff STORE \n", "0 0.000000 0.000000 0.24 1 \n", "1 0.150754 0.000000 0.24 1 \n", "2 0.000000 0.091398 0.23 1 \n", "3 0.000000 0.000000 0.00 1 \n", "4 0.000000 0.000000 0.00 0 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "oj_df = pd.read_csv('./data/oj.csv')\n", "oj_df = oj_df.drop(oj_df.columns[0], axis=1)\n", "display(oj_df.head())\n", "\n", "# Index for Training set of 800\n", "np.random.seed(1)\n", "train_sample = np.random.choice(np.arange(len(oj_df)), size=800, replace=False)\n", "train = np.asarray([(i in train_sample) for i in oj_df.index])\n", "\n", "#oj_df.Purchase = oj_df.Purchase.map({'CH' : 1, 'MM': 0})\n", "#oj_df.Store7 = oj_df.Store7.map({'Yes' : 1, 'No': 0})\n", "#oj_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?" ] }, { "cell_type": "code", "execution_count": 286, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "training accuracy: 0.99\n", "leaf nodes: 6\n" ] } ], "source": [ "f = 'C(Purchase) ~ ' + ' + '.join(oj_df.columns.drop(['Purchase']))\n", "y, X = pt.dmatrices(f, oj_df)\n", "y = y[:, 0]\n", "\n", "# Fit Sklearns tree classifier\n", "clf = tree.DecisionTreeClassifier().fit(X[train], y[train])\n", "\n", "print('training accuracy: {}'.format(np.around(clf.score(X[train], y[train]), 3)))\n", "print('leaf nodes: 6')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.\n", "\n", "### (d) Create a plot of the tree, and interpret the results." ] }, { "cell_type": "code", "execution_count": 287, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "Tree\n", "\n", "\n", "\n", "0\n", "\n", "LoyalCH <= 0.482\n", "gini = 0.479\n", "samples = 800\n", "value = [319, 481]\n", "\n", "\n", "\n", "1\n", "\n", "PriceDiff <= 0.31\n", "gini = 0.338\n", "samples = 297\n", "value = [233, 64]\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "True\n", "\n", "\n", "\n", "154\n", "\n", "LoyalCH <= 0.765\n", "gini = 0.283\n", "samples = 503\n", "value = [86, 417]\n", "\n", "\n", "\n", "0->154\n", "\n", "\n", "False\n", "\n", "\n", "\n", "2\n", "\n", "StoreID <= 3.5\n", "gini = 0.266\n", "samples = 234\n", "value = [197, 37]\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "107\n", "\n", "LoyalCH <= 0.173\n", "gini = 0.49\n", "samples = 63\n", "value = [36, 27]\n", "\n", "\n", "\n", "1->107\n", "\n", "\n", "\n", "\n", "\n", "3\n", "\n", "LoyalCH <= 0.434\n", "gini = 0.19\n", "samples = 179\n", "value = [160, 19]\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "76\n", "\n", "SalePriceCH <= 1.72\n", "gini = 0.44\n", "samples = 55\n", "value = [37, 18]\n", "\n", "\n", "\n", "2->76\n", "\n", "\n", "\n", "\n", "\n", "4\n", "\n", "LoyalCH <= 0.035\n", "gini = 0.156\n", "samples = 164\n", "value = [150, 14]\n", "\n", "\n", "\n", "3->4\n", "\n", "\n", "\n", "\n", "\n", "65\n", "\n", "LoyalCH <= 0.437\n", "gini = 0.444\n", "samples = 15\n", "value = [10, 5]\n", "\n", "\n", "\n", "3->65\n", "\n", "\n", "\n", "\n", "\n", "5\n", "\n", "gini = 0.0\n", "samples = 45\n", "value = [45, 0]\n", "\n", "\n", "\n", "4->5\n", "\n", "\n", "\n", "\n", "\n", "6\n", "\n", "LoyalCH <= 0.039\n", "gini = 0.208\n", "samples = 119\n", "value = [105, 14]\n", "\n", "\n", "\n", "4->6\n", "\n", "\n", "\n", "\n", "\n", "7\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "6->7\n", "\n", "\n", "\n", "\n", "\n", "8\n", "\n", "PriceCH <= 1.755\n", "gini = 0.196\n", "samples = 118\n", "value = [105, 13]\n", "\n", "\n", "\n", "6->8\n", "\n", "\n", "\n", "\n", "\n", "9\n", "\n", "gini = 0.0\n", "samples = 32\n", "value = [32, 0]\n", "\n", "\n", "\n", "8->9\n", "\n", "\n", "\n", "\n", "\n", "10\n", "\n", "DiscCH <= 0.05\n", "gini = 0.257\n", "samples = 86\n", "value = [73, 13]\n", "\n", "\n", "\n", "8->10\n", "\n", "\n", "\n", "\n", "\n", "11\n", "\n", "PctDiscMM <= 0.187\n", "gini = 0.23\n", "samples = 83\n", "value = [72, 11]\n", "\n", "\n", "\n", "10->11\n", "\n", "\n", "\n", "\n", "\n", "62\n", "\n", "DiscCH <= 0.15\n", "gini = 0.444\n", "samples = 3\n", "value = [1, 2]\n", "\n", "\n", "\n", "10->62\n", "\n", "\n", "\n", "\n", "\n", "12\n", "\n", "ListPriceDiff <= 0.215\n", "gini = 0.282\n", "samples = 59\n", "value = [49, 10]\n", "\n", "\n", "\n", "11->12\n", "\n", "\n", "\n", "\n", "\n", "57\n", "\n", "PctDiscMM <= 0.384\n", "gini = 0.08\n", "samples = 24\n", "value = [23, 1]\n", "\n", "\n", "\n", "11->57\n", "\n", "\n", "\n", "\n", "\n", "13\n", "\n", "LoyalCH <= 0.382\n", "gini = 0.159\n", "samples = 23\n", "value = [21, 2]\n", "\n", "\n", "\n", "12->13\n", "\n", "\n", "\n", "\n", "\n", "22\n", "\n", "WeekofPurchase <= 263.0\n", "gini = 0.346\n", "samples = 36\n", "value = [28, 8]\n", "\n", "\n", "\n", "12->22\n", "\n", "\n", "\n", "\n", "\n", "14\n", "\n", "SalePriceMM <= 1.74\n", "gini = 0.095\n", "samples = 20\n", "value = [19, 1]\n", "\n", "\n", "\n", "13->14\n", "\n", "\n", "\n", "\n", "\n", "19\n", "\n", "WeekofPurchase <= 230.0\n", "gini = 0.444\n", "samples = 3\n", "value = [2, 1]\n", "\n", "\n", "\n", "13->19\n", "\n", "\n", "\n", "\n", "\n", "15\n", "\n", "LoyalCH <= 0.162\n", "gini = 0.32\n", "samples = 5\n", "value = [4, 1]\n", "\n", "\n", "\n", "14->15\n", "\n", "\n", "\n", "\n", "\n", "18\n", "\n", "gini = 0.0\n", "samples = 15\n", "value = [15, 0]\n", "\n", "\n", "\n", "14->18\n", "\n", "\n", "\n", "\n", "\n", "16\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "15->16\n", "\n", "\n", "\n", "\n", "\n", "17\n", "\n", "gini = 0.0\n", "samples = 4\n", "value = [4, 0]\n", "\n", "\n", "\n", "15->17\n", "\n", "\n", "\n", "\n", "\n", "20\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "19->20\n", "\n", "\n", "\n", "\n", "\n", "21\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "19->21\n", "\n", "\n", "\n", "\n", "\n", "23\n", "\n", "LoyalCH <= 0.065\n", "gini = 0.305\n", "samples = 32\n", "value = [26, 6]\n", "\n", "\n", "\n", "22->23\n", "\n", "\n", "\n", "\n", "\n", "54\n", "\n", "LoyalCH <= 0.346\n", "gini = 0.5\n", "samples = 4\n", "value = [2, 2]\n", "\n", "\n", "\n", "22->54\n", "\n", "\n", "\n", "\n", "\n", "24\n", "\n", "PriceMM <= 2.19\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "23->24\n", "\n", "\n", "\n", "\n", "\n", "27\n", "\n", "LoyalCH <= 0.118\n", "gini = 0.278\n", "samples = 30\n", "value = [25, 5]\n", "\n", "\n", "\n", "23->27\n", "\n", "\n", "\n", "\n", "\n", "25\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "24->25\n", "\n", "\n", "\n", "\n", "\n", "26\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "24->26\n", "\n", "\n", "\n", "\n", "\n", "28\n", "\n", "gini = 0.0\n", "samples = 6\n", "value = [6, 0]\n", "\n", "\n", "\n", "27->28\n", "\n", "\n", "\n", "\n", "\n", "29\n", "\n", "LoyalCH <= 0.138\n", "gini = 0.33\n", "samples = 24\n", "value = [19, 5]\n", "\n", "\n", "\n", "27->29\n", "\n", "\n", "\n", "\n", "\n", "30\n", "\n", "SalePriceCH <= 1.89\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "29->30\n", "\n", "\n", "\n", "\n", "\n", "33\n", "\n", "ListPriceDiff <= 0.27\n", "gini = 0.298\n", "samples = 22\n", "value = [18, 4]\n", "\n", "\n", "\n", "29->33\n", "\n", "\n", "\n", "\n", "\n", "31\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "30->31\n", "\n", "\n", "\n", "\n", "\n", "32\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "30->32\n", "\n", "\n", "\n", "\n", "\n", "34\n", "\n", "WeekofPurchase <= 246.5\n", "gini = 0.153\n", "samples = 12\n", "value = [11, 1]\n", "\n", "\n", "\n", "33->34\n", "\n", "\n", "\n", "\n", "\n", "41\n", "\n", "LoyalCH <= 0.182\n", "gini = 0.42\n", "samples = 10\n", "value = [7, 3]\n", "\n", "\n", "\n", "33->41\n", "\n", "\n", "\n", "\n", "\n", "35\n", "\n", "gini = 0.0\n", "samples = 7\n", "value = [7, 0]\n", "\n", "\n", "\n", "34->35\n", "\n", "\n", "\n", "\n", "\n", "36\n", "\n", "StoreID <= 2.0\n", "gini = 0.32\n", "samples = 5\n", "value = [4, 1]\n", "\n", "\n", "\n", "34->36\n", "\n", "\n", "\n", "\n", "\n", "37\n", "\n", "SalePriceMM <= 2.04\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "36->37\n", "\n", "\n", "\n", "\n", "\n", "40\n", "\n", "gini = 0.0\n", "samples = 3\n", "value = [3, 0]\n", "\n", "\n", "\n", "36->40\n", "\n", "\n", "\n", "\n", "\n", "38\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "37->38\n", "\n", "\n", "\n", "\n", "\n", "39\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "37->39\n", "\n", "\n", "\n", "\n", "\n", "42\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "41->42\n", "\n", "\n", "\n", "\n", "\n", "43\n", "\n", "LoyalCH <= 0.228\n", "gini = 0.469\n", "samples = 8\n", "value = [5, 3]\n", "\n", "\n", "\n", "41->43\n", "\n", "\n", "\n", "\n", "\n", "44\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "43->44\n", "\n", "\n", "\n", "\n", "\n", "45\n", "\n", "PriceMM <= 2.235\n", "gini = 0.408\n", "samples = 7\n", "value = [5, 2]\n", "\n", "\n", "\n", "43->45\n", "\n", "\n", "\n", "\n", "\n", "46\n", "\n", "LoyalCH <= 0.278\n", "gini = 0.48\n", "samples = 5\n", "value = [3, 2]\n", "\n", "\n", "\n", "45->46\n", "\n", "\n", "\n", "\n", "\n", "53\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "45->53\n", "\n", "\n", "\n", "\n", "\n", "47\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "46->47\n", "\n", "\n", "\n", "\n", "\n", "48\n", "\n", "ListPriceDiff <= 0.31\n", "gini = 0.5\n", "samples = 4\n", "value = [2, 2]\n", "\n", "\n", "\n", "46->48\n", "\n", "\n", "\n", "\n", "\n", "49\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "48->49\n", "\n", "\n", "\n", "\n", "\n", "50\n", "\n", "LoyalCH <= 0.36\n", "gini = 0.444\n", "samples = 3\n", "value = [2, 1]\n", "\n", "\n", "\n", "48->50\n", "\n", "\n", "\n", "\n", "\n", "51\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "50->51\n", "\n", "\n", "\n", "\n", "\n", "52\n", "\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "50->52\n", "\n", "\n", "\n", "\n", "\n", "55\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [0, 2]\n", "\n", "\n", "\n", "54->55\n", "\n", "\n", "\n", "\n", "\n", "56\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "54->56\n", "\n", "\n", "\n", "\n", "\n", "58\n", "\n", "gini = 0.0\n", "samples = 19\n", "value = [19, 0]\n", "\n", "\n", "\n", "57->58\n", "\n", "\n", "\n", "\n", "\n", "59\n", "\n", "LoyalCH <= 0.296\n", "gini = 0.32\n", "samples = 5\n", "value = [4, 1]\n", "\n", "\n", "\n", "57->59\n", "\n", "\n", "\n", "\n", "\n", "60\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "59->60\n", "\n", "\n", "\n", "\n", "\n", "61\n", "\n", "gini = 0.0\n", "samples = 4\n", "value = [4, 0]\n", "\n", "\n", "\n", "59->61\n", "\n", "\n", "\n", "\n", "\n", "63\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [0, 2]\n", "\n", "\n", "\n", "62->63\n", "\n", "\n", "\n", "\n", "\n", "64\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "62->64\n", "\n", "\n", "\n", "\n", "\n", "66\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [0, 2]\n", "\n", "\n", "\n", "65->66\n", "\n", "\n", "\n", "\n", "\n", "67\n", "\n", "SalePriceMM <= 2.105\n", "gini = 0.355\n", "samples = 13\n", "value = [10, 3]\n", "\n", "\n", "\n", "65->67\n", "\n", "\n", "\n", "\n", "\n", "68\n", "\n", "SalePriceMM <= 1.89\n", "gini = 0.165\n", "samples = 11\n", "value = [10, 1]\n", "\n", "\n", "\n", "67->68\n", "\n", "\n", "\n", "\n", "\n", "75\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [0, 2]\n", "\n", "\n", "\n", "67->75\n", "\n", "\n", "\n", "\n", "\n", "69\n", "\n", "gini = 0.0\n", "samples = 7\n", "value = [7, 0]\n", "\n", "\n", "\n", "68->69\n", "\n", "\n", "\n", "\n", "\n", "70\n", "\n", "ListPriceDiff <= 0.27\n", "gini = 0.375\n", "samples = 4\n", "value = [3, 1]\n", "\n", "\n", "\n", "68->70\n", "\n", "\n", "\n", "\n", "\n", "71\n", "\n", "StoreID <= 1.5\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "70->71\n", "\n", "\n", "\n", "\n", "\n", "74\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "70->74\n", "\n", "\n", "\n", "\n", "\n", "72\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "71->72\n", "\n", "\n", "\n", "\n", "\n", "73\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "71->73\n", "\n", "\n", "\n", "\n", "\n", "77\n", "\n", "PriceDiff <= 0.25\n", "gini = 0.32\n", "samples = 10\n", "value = [2, 8]\n", "\n", "\n", "\n", "76->77\n", "\n", "\n", "\n", "\n", "\n", "84\n", "\n", "PriceDiff <= 0.05\n", "gini = 0.346\n", "samples = 45\n", "value = [35, 10]\n", "\n", "\n", "\n", "76->84\n", "\n", "\n", "\n", "\n", "\n", "78\n", "\n", "WeekofPurchase <= 253.0\n", "gini = 0.198\n", "samples = 9\n", "value = [1, 8]\n", "\n", "\n", "\n", "77->78\n", "\n", "\n", "\n", "\n", "\n", "83\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "77->83\n", "\n", "\n", "\n", "\n", "\n", "79\n", "\n", "gini = 0.0\n", "samples = 6\n", "value = [0, 6]\n", "\n", "\n", "\n", "78->79\n", "\n", "\n", "\n", "\n", "\n", "80\n", "\n", "LoyalCH <= 0.267\n", "gini = 0.444\n", "samples = 3\n", "value = [1, 2]\n", "\n", "\n", "\n", "78->80\n", "\n", "\n", "\n", "\n", "\n", "81\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "80->81\n", "\n", "\n", "\n", "\n", "\n", "82\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [0, 2]\n", "\n", "\n", "\n", "80->82\n", "\n", "\n", "\n", "\n", "\n", "85\n", "\n", "PctDiscCH <= 0.025\n", "gini = 0.18\n", "samples = 30\n", "value = [27, 3]\n", "\n", "\n", "\n", "84->85\n", "\n", "\n", "\n", "\n", "\n", "96\n", "\n", "WeekofPurchase <= 248.0\n", "gini = 0.498\n", "samples = 15\n", "value = [8, 7]\n", "\n", "\n", "\n", "84->96\n", "\n", "\n", "\n", "\n", "\n", "86\n", "\n", "WeekofPurchase <= 235.5\n", "gini = 0.128\n", "samples = 29\n", "value = [27, 2]\n", "\n", "\n", "\n", "85->86\n", "\n", "\n", "\n", "\n", "\n", "95\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "85->95\n", "\n", "\n", "\n", "\n", "\n", "87\n", "\n", "LoyalCH <= 0.145\n", "gini = 0.346\n", "samples = 9\n", "value = [7, 2]\n", "\n", "\n", "\n", "86->87\n", "\n", "\n", "\n", "\n", "\n", "94\n", "\n", "gini = 0.0\n", "samples = 20\n", "value = [20, 0]\n", "\n", "\n", "\n", "86->94\n", "\n", "\n", "\n", "\n", "\n", "88\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "87->88\n", "\n", "\n", "\n", "\n", "\n", "89\n", "\n", "SpecialCH <= 0.5\n", "gini = 0.219\n", "samples = 8\n", "value = [7, 1]\n", "\n", "\n", "\n", "87->89\n", "\n", "\n", "\n", "\n", "\n", "90\n", "\n", "gini = 0.0\n", "samples = 4\n", "value = [4, 0]\n", "\n", "\n", "\n", "89->90\n", "\n", "\n", "\n", "\n", "\n", "91\n", "\n", "LoyalCH <= 0.314\n", "gini = 0.375\n", "samples = 4\n", "value = [3, 1]\n", "\n", "\n", "\n", "89->91\n", "\n", "\n", "\n", "\n", "\n", "92\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "91->92\n", "\n", "\n", "\n", "\n", "\n", "93\n", "\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "91->93\n", "\n", "\n", "\n", "\n", "\n", "97\n", "\n", "ListPriceDiff <= 0.27\n", "gini = 0.278\n", "samples = 6\n", "value = [5, 1]\n", "\n", "\n", "\n", "96->97\n", "\n", "\n", "\n", "\n", "\n", "100\n", "\n", "WeekofPurchase <= 266.0\n", "gini = 0.444\n", "samples = 9\n", "value = [3, 6]\n", "\n", "\n", "\n", "96->100\n", "\n", "\n", "\n", "\n", "\n", "98\n", "\n", "gini = 0.0\n", "samples = 5\n", "value = [5, 0]\n", "\n", "\n", "\n", "97->98\n", "\n", "\n", "\n", "\n", "\n", "99\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "97->99\n", "\n", "\n", "\n", "\n", "\n", "101\n", "\n", "gini = 0.0\n", "samples = 4\n", "value = [0, 4]\n", "\n", "\n", "\n", "100->101\n", "\n", "\n", "\n", "\n", "\n", "102\n", "\n", "PctDiscCH <= 0.025\n", "gini = 0.48\n", "samples = 5\n", "value = [3, 2]\n", "\n", "\n", "\n", "100->102\n", "\n", "\n", "\n", "\n", "\n", "103\n", "\n", "LoyalCH <= 0.415\n", "gini = 0.444\n", "samples = 3\n", "value = [1, 2]\n", "\n", "\n", "\n", "102->103\n", "\n", "\n", "\n", "\n", "\n", "106\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "102->106\n", "\n", "\n", "\n", "\n", "\n", "104\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [0, 2]\n", "\n", "\n", "\n", "103->104\n", "\n", "\n", "\n", "\n", "\n", "105\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "103->105\n", "\n", "\n", "\n", "\n", "\n", "108\n", "\n", "WeekofPurchase <= 266.5\n", "gini = 0.227\n", "samples = 23\n", "value = [20, 3]\n", "\n", "\n", "\n", "107->108\n", "\n", "\n", "\n", "\n", "\n", "117\n", "\n", "STORE <= 1.5\n", "gini = 0.48\n", "samples = 40\n", "value = [16, 24]\n", "\n", "\n", "\n", "107->117\n", "\n", "\n", "\n", "\n", "\n", "109\n", "\n", "ListPriceDiff <= 0.215\n", "gini = 0.105\n", "samples = 18\n", "value = [17, 1]\n", "\n", "\n", "\n", "108->109\n", "\n", "\n", "\n", "\n", "\n", "114\n", "\n", "WeekofPurchase <= 269.5\n", "gini = 0.48\n", "samples = 5\n", "value = [3, 2]\n", "\n", "\n", "\n", "108->114\n", "\n", "\n", "\n", "\n", "\n", "110\n", "\n", "LoyalCH <= 0.075\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "109->110\n", "\n", "\n", "\n", "\n", "\n", "113\n", "\n", "gini = 0.0\n", "samples = 16\n", "value = [16, 0]\n", "\n", "\n", "\n", "109->113\n", "\n", "\n", "\n", "\n", "\n", "111\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "110->111\n", "\n", "\n", "\n", "\n", "\n", "112\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "110->112\n", "\n", "\n", "\n", "\n", "\n", "115\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [0, 2]\n", "\n", "\n", "\n", "114->115\n", "\n", "\n", "\n", "\n", "\n", "116\n", "\n", "gini = 0.0\n", "samples = 3\n", "value = [3, 0]\n", "\n", "\n", "\n", "114->116\n", "\n", "\n", "\n", "\n", "\n", "118\n", "\n", "LoyalCH <= 0.357\n", "gini = 0.403\n", "samples = 25\n", "value = [7, 18]\n", "\n", "\n", "\n", "117->118\n", "\n", "\n", "\n", "\n", "\n", "139\n", "\n", "LoyalCH <= 0.203\n", "gini = 0.48\n", "samples = 15\n", "value = [9, 6]\n", "\n", "\n", "\n", "117->139\n", "\n", "\n", "\n", "\n", "\n", "119\n", "\n", "gini = 0.0\n", "samples = 9\n", "value = [0, 9]\n", "\n", "\n", "\n", "118->119\n", "\n", "\n", "\n", "\n", "\n", "120\n", "\n", "WeekofPurchase <= 252.0\n", "gini = 0.492\n", "samples = 16\n", "value = [7, 9]\n", "\n", "\n", "\n", "118->120\n", "\n", "\n", "\n", "\n", "\n", "121\n", "\n", "gini = 0.0\n", "samples = 3\n", "value = [0, 3]\n", "\n", "\n", "\n", "120->121\n", "\n", "\n", "\n", "\n", "\n", "122\n", "\n", "WeekofPurchase <= 255.5\n", "gini = 0.497\n", "samples = 13\n", "value = [7, 6]\n", "\n", "\n", "\n", "120->122\n", "\n", "\n", "\n", "\n", "\n", "123\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "122->123\n", "\n", "\n", "\n", "\n", "\n", "124\n", "\n", "LoyalCH <= 0.43\n", "gini = 0.496\n", "samples = 11\n", "value = [5, 6]\n", "\n", "\n", "\n", "122->124\n", "\n", "\n", "\n", "\n", "\n", "125\n", "\n", "LoyalCH <= 0.408\n", "gini = 0.494\n", "samples = 9\n", "value = [5, 4]\n", "\n", "\n", "\n", "124->125\n", "\n", "\n", "\n", "\n", "\n", "138\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [0, 2]\n", "\n", "\n", "\n", "124->138\n", "\n", "\n", "\n", "\n", "\n", "126\n", "\n", "LoyalCH <= 0.361\n", "gini = 0.49\n", "samples = 7\n", "value = [3, 4]\n", "\n", "\n", "\n", "125->126\n", "\n", "\n", "\n", "\n", "\n", "137\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "125->137\n", "\n", "\n", "\n", "\n", "\n", "127\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "126->127\n", "\n", "\n", "\n", "\n", "\n", "128\n", "\n", "LoyalCH <= 0.392\n", "gini = 0.444\n", "samples = 6\n", "value = [2, 4]\n", "\n", "\n", "\n", "126->128\n", "\n", "\n", "\n", "\n", "\n", "129\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [0, 2]\n", "\n", "\n", "\n", "128->129\n", "\n", "\n", "\n", "\n", "\n", "130\n", "\n", "PriceCH <= 1.925\n", "gini = 0.5\n", "samples = 4\n", "value = [2, 2]\n", "\n", "\n", "\n", "128->130\n", "\n", "\n", "\n", "\n", "\n", "131\n", "\n", "DiscCH <= 0.185\n", "gini = 0.444\n", "samples = 3\n", "value = [1, 2]\n", "\n", "\n", "\n", "130->131\n", "\n", "\n", "\n", "\n", "\n", "136\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "130->136\n", "\n", "\n", "\n", "\n", "\n", "132\n", "\n", "WeekofPurchase <= 257.5\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "131->132\n", "\n", "\n", "\n", "\n", "\n", "135\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "131->135\n", "\n", "\n", "\n", "\n", "\n", "133\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "132->133\n", "\n", "\n", "\n", "\n", "\n", "134\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "132->134\n", "\n", "\n", "\n", "\n", "\n", "140\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [0, 2]\n", "\n", "\n", "\n", "139->140\n", "\n", "\n", "\n", "\n", "\n", "141\n", "\n", "WeekofPurchase <= 269.5\n", "gini = 0.426\n", "samples = 13\n", "value = [9, 4]\n", "\n", "\n", "\n", "139->141\n", "\n", "\n", "\n", "\n", "\n", "142\n", "\n", "LoyalCH <= 0.27\n", "gini = 0.32\n", "samples = 10\n", "value = [8, 2]\n", "\n", "\n", "\n", "141->142\n", "\n", "\n", "\n", "\n", "\n", "149\n", "\n", "LoyalCH <= 0.319\n", "gini = 0.444\n", "samples = 3\n", "value = [1, 2]\n", "\n", "\n", "\n", "141->149\n", "\n", "\n", "\n", "\n", "\n", "143\n", "\n", "gini = 0.0\n", "samples = 3\n", "value = [3, 0]\n", "\n", "\n", "\n", "142->143\n", "\n", "\n", "\n", "\n", "\n", "144\n", "\n", "LoyalCH <= 0.302\n", "gini = 0.408\n", "samples = 7\n", "value = [5, 2]\n", "\n", "\n", "\n", "142->144\n", "\n", "\n", "\n", "\n", "\n", "145\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "144->145\n", "\n", "\n", "\n", "\n", "\n", "146\n", "\n", "SpecialMM <= 0.5\n", "gini = 0.278\n", "samples = 6\n", "value = [5, 1]\n", "\n", "\n", "\n", "144->146\n", "\n", "\n", "\n", "\n", "\n", "147\n", "\n", "gini = 0.0\n", "samples = 4\n", "value = [4, 0]\n", "\n", "\n", "\n", "146->147\n", "\n", "\n", "\n", "\n", "\n", "148\n", "\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "146->148\n", "\n", "\n", "\n", "\n", "\n", "150\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "149->150\n", "\n", "\n", "\n", "\n", "\n", "151\n", "\n", "LoyalCH <= 0.335\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "149->151\n", "\n", "\n", "\n", "\n", "\n", "152\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "151->152\n", "\n", "\n", "\n", "\n", "\n", "153\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "151->153\n", "\n", "\n", "\n", "\n", "\n", "155\n", "\n", "PriceDiff <= 0.215\n", "gini = 0.432\n", "samples = 241\n", "value = [76, 165]\n", "\n", "\n", "\n", "154->155\n", "\n", "\n", "\n", "\n", "\n", "280\n", "\n", "PriceDiff <= -0.39\n", "gini = 0.073\n", "samples = 262\n", "value = [10, 252]\n", "\n", "\n", "\n", "154->280\n", "\n", "\n", "\n", "\n", "\n", "156\n", "\n", "ListPriceDiff <= 0.235\n", "gini = 0.499\n", "samples = 115\n", "value = [60, 55]\n", "\n", "\n", "\n", "155->156\n", "\n", "\n", "\n", "\n", "\n", "229\n", "\n", "DiscMM <= 0.03\n", "gini = 0.222\n", "samples = 126\n", "value = [16, 110]\n", "\n", "\n", "\n", "155->229\n", "\n", "\n", "\n", "\n", "\n", "157\n", "\n", "PriceDiff <= 0.05\n", "gini = 0.464\n", "samples = 82\n", "value = [52, 30]\n", "\n", "\n", "\n", "156->157\n", "\n", "\n", "\n", "\n", "\n", "214\n", "\n", "LoyalCH <= 0.512\n", "gini = 0.367\n", "samples = 33\n", "value = [8, 25]\n", "\n", "\n", "\n", "156->214\n", "\n", "\n", "\n", "\n", "\n", "158\n", "\n", "STORE <= 2.5\n", "gini = 0.388\n", "samples = 57\n", "value = [42, 15]\n", "\n", "\n", "\n", "157->158\n", "\n", "\n", "\n", "\n", "\n", "189\n", "\n", "StoreID <= 2.5\n", "gini = 0.48\n", "samples = 25\n", "value = [10, 15]\n", "\n", "\n", "\n", "157->189\n", "\n", "\n", "\n", "\n", "\n", "159\n", "\n", "WeekofPurchase <= 227.5\n", "gini = 0.278\n", "samples = 42\n", "value = [35, 7]\n", "\n", "\n", "\n", "158->159\n", "\n", "\n", "\n", "\n", "\n", "178\n", "\n", "LoyalCH <= 0.632\n", "gini = 0.498\n", "samples = 15\n", "value = [7, 8]\n", "\n", "\n", "\n", "158->178\n", "\n", "\n", "\n", "\n", "\n", "160\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [0, 2]\n", "\n", "\n", "\n", "159->160\n", "\n", "\n", "\n", "\n", "\n", "161\n", "\n", "SalePriceMM <= 1.84\n", "gini = 0.219\n", "samples = 40\n", "value = [35, 5]\n", "\n", "\n", "\n", "159->161\n", "\n", "\n", "\n", "\n", "\n", "162\n", "\n", "LoyalCH <= 0.639\n", "gini = 0.153\n", "samples = 36\n", "value = [33, 3]\n", "\n", "\n", "\n", "161->162\n", "\n", "\n", "\n", "\n", "\n", "171\n", "\n", "WeekofPurchase <= 255.0\n", "gini = 0.5\n", "samples = 4\n", "value = [2, 2]\n", "\n", "\n", "\n", "161->171\n", "\n", "\n", "\n", "\n", "\n", "163\n", "\n", "gini = 0.0\n", "samples = 23\n", "value = [23, 0]\n", "\n", "\n", "\n", "162->163\n", "\n", "\n", "\n", "\n", "\n", "164\n", "\n", "LoyalCH <= 0.672\n", "gini = 0.355\n", "samples = 13\n", "value = [10, 3]\n", "\n", "\n", "\n", "162->164\n", "\n", "\n", "\n", "\n", "\n", "165\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [0, 2]\n", "\n", "\n", "\n", "164->165\n", "\n", "\n", "\n", "\n", "\n", "166\n", "\n", "WeekofPurchase <= 230.5\n", "gini = 0.165\n", "samples = 11\n", "value = [10, 1]\n", "\n", "\n", "\n", "164->166\n", "\n", "\n", "\n", "\n", "\n", "167\n", "\n", "SalePriceMM <= 1.59\n", "gini = 0.375\n", "samples = 4\n", "value = [3, 1]\n", "\n", "\n", "\n", "166->167\n", "\n", "\n", "\n", "\n", "\n", "170\n", "\n", "gini = 0.0\n", "samples = 7\n", "value = [7, 0]\n", "\n", "\n", "\n", "166->170\n", "\n", "\n", "\n", "\n", "\n", "168\n", "\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "167->168\n", "\n", "\n", "\n", "\n", "\n", "169\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "167->169\n", "\n", "\n", "\n", "\n", "\n", "172\n", "\n", "LoyalCH <= 0.56\n", "gini = 0.444\n", "samples = 3\n", "value = [2, 1]\n", "\n", "\n", "\n", "171->172\n", "\n", "\n", "\n", "\n", "\n", "177\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "171->177\n", "\n", "\n", "\n", "\n", "\n", "173\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "172->173\n", "\n", "\n", "\n", "\n", "\n", "174\n", "\n", "LoyalCH <= 0.681\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "172->174\n", "\n", "\n", "\n", "\n", "\n", "175\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "174->175\n", "\n", "\n", "\n", "\n", "\n", "176\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "174->176\n", "\n", "\n", "\n", "\n", "\n", "179\n", "\n", "LoyalCH <= 0.562\n", "gini = 0.48\n", "samples = 10\n", "value = [6, 4]\n", "\n", "\n", "\n", "178->179\n", "\n", "\n", "\n", "\n", "\n", "186\n", "\n", "LoyalCH <= 0.748\n", "gini = 0.32\n", "samples = 5\n", "value = [1, 4]\n", "\n", "\n", "\n", "178->186\n", "\n", "\n", "\n", "\n", "\n", "180\n", "\n", "LoyalCH <= 0.51\n", "gini = 0.444\n", "samples = 6\n", "value = [2, 4]\n", "\n", "\n", "\n", "179->180\n", "\n", "\n", "\n", "\n", "\n", "185\n", "\n", "gini = 0.0\n", "samples = 4\n", "value = [4, 0]\n", "\n", "\n", "\n", "179->185\n", "\n", "\n", "\n", "\n", "\n", "181\n", "\n", "SalePriceMM <= 1.74\n", "gini = 0.444\n", "samples = 3\n", "value = [2, 1]\n", "\n", "\n", "\n", "180->181\n", "\n", "\n", "\n", "\n", "\n", "184\n", "\n", "gini = 0.0\n", "samples = 3\n", "value = [0, 3]\n", "\n", "\n", "\n", "180->184\n", "\n", "\n", "\n", "\n", "\n", "182\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "181->182\n", "\n", "\n", "\n", "\n", "\n", "183\n", "\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "181->183\n", "\n", "\n", "\n", "\n", "\n", "187\n", "\n", "gini = 0.0\n", "samples = 4\n", "value = [0, 4]\n", "\n", "\n", "\n", "186->187\n", "\n", "\n", "\n", "\n", "\n", "188\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "186->188\n", "\n", "\n", "\n", "\n", "\n", "190\n", "\n", "WeekofPurchase <= 246.5\n", "gini = 0.298\n", "samples = 11\n", "value = [2, 9]\n", "\n", "\n", "\n", "189->190\n", "\n", "\n", "\n", "\n", "\n", "197\n", "\n", "WeekofPurchase <= 268.0\n", "gini = 0.49\n", "samples = 14\n", "value = [8, 6]\n", "\n", "\n", "\n", "189->197\n", "\n", "\n", "\n", "\n", "\n", "191\n", "\n", "WeekofPurchase <= 238.0\n", "gini = 0.5\n", "samples = 4\n", "value = [2, 2]\n", "\n", "\n", "\n", "190->191\n", "\n", "\n", "\n", "\n", "\n", "196\n", "\n", "gini = 0.0\n", "samples = 7\n", "value = [0, 7]\n", "\n", "\n", "\n", "190->196\n", "\n", "\n", "\n", "\n", "\n", "192\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "191->192\n", "\n", "\n", "\n", "\n", "\n", "193\n", "\n", "WeekofPurchase <= 245.5\n", "gini = 0.444\n", "samples = 3\n", "value = [2, 1]\n", "\n", "\n", "\n", "191->193\n", "\n", "\n", "\n", "\n", "\n", "194\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "193->194\n", "\n", "\n", "\n", "\n", "\n", "195\n", "\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "193->195\n", "\n", "\n", "\n", "\n", "\n", "198\n", "\n", "SalePriceCH <= 1.94\n", "gini = 0.375\n", "samples = 8\n", "value = [6, 2]\n", "\n", "\n", "\n", "197->198\n", "\n", "\n", "\n", "\n", "\n", "207\n", "\n", "LoyalCH <= 0.581\n", "gini = 0.444\n", "samples = 6\n", "value = [2, 4]\n", "\n", "\n", "\n", "197->207\n", "\n", "\n", "\n", "\n", "\n", "199\n", "\n", "gini = 0.0\n", "samples = 4\n", "value = [4, 0]\n", "\n", "\n", "\n", "198->199\n", "\n", "\n", "\n", "\n", "\n", "200\n", "\n", "STORE <= 3.5\n", "gini = 0.5\n", "samples = 4\n", "value = [2, 2]\n", "\n", "\n", "\n", "198->200\n", "\n", "\n", "\n", "\n", "\n", "201\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "200->201\n", "\n", "\n", "\n", "\n", "\n", "202\n", "\n", "WeekofPurchase <= 261.0\n", "gini = 0.444\n", "samples = 3\n", "value = [1, 2]\n", "\n", "\n", "\n", "200->202\n", "\n", "\n", "\n", "\n", "\n", "203\n", "\n", "LoyalCH <= 0.621\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "202->203\n", "\n", "\n", "\n", "\n", "\n", "206\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "202->206\n", "\n", "\n", "\n", "\n", "\n", "204\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "203->204\n", "\n", "\n", "\n", "\n", "\n", "205\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "203->205\n", "\n", "\n", "\n", "\n", "\n", "208\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [0, 2]\n", "\n", "\n", "\n", "207->208\n", "\n", "\n", "\n", "\n", "\n", "209\n", "\n", "WeekofPurchase <= 269.5\n", "gini = 0.5\n", "samples = 4\n", "value = [2, 2]\n", "\n", "\n", "\n", "207->209\n", "\n", "\n", "\n", "\n", "\n", "210\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "209->210\n", "\n", "\n", "\n", "\n", "\n", "211\n", "\n", "LoyalCH <= 0.722\n", "gini = 0.444\n", "samples = 3\n", "value = [2, 1]\n", "\n", "\n", "\n", "209->211\n", "\n", "\n", "\n", "\n", "\n", "212\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "211->212\n", "\n", "\n", "\n", "\n", "\n", "213\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "211->213\n", "\n", "\n", "\n", "\n", "\n", "215\n", "\n", "STORE <= 1.5\n", "gini = 0.5\n", "samples = 10\n", "value = [5, 5]\n", "\n", "\n", "\n", "214->215\n", "\n", "\n", "\n", "\n", "\n", "222\n", "\n", "ListPriceDiff <= 0.31\n", "gini = 0.227\n", "samples = 23\n", "value = [3, 20]\n", "\n", "\n", "\n", "214->222\n", "\n", "\n", "\n", "\n", "\n", "216\n", "\n", "WeekofPurchase <= 233.5\n", "gini = 0.469\n", "samples = 8\n", "value = [5, 3]\n", "\n", "\n", "\n", "215->216\n", "\n", "\n", "\n", "\n", "\n", "221\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [0, 2]\n", "\n", "\n", "\n", "215->221\n", "\n", "\n", "\n", "\n", "\n", "217\n", "\n", "gini = 0.444\n", "samples = 3\n", "value = [1, 2]\n", "\n", "\n", "\n", "216->217\n", "\n", "\n", "\n", "\n", "\n", "218\n", "\n", "PriceDiff <= -0.11\n", "gini = 0.32\n", "samples = 5\n", "value = [4, 1]\n", "\n", "\n", "\n", "216->218\n", "\n", "\n", "\n", "\n", "\n", "219\n", "\n", "gini = 0.444\n", "samples = 3\n", "value = [2, 1]\n", "\n", "\n", "\n", "218->219\n", "\n", "\n", "\n", "\n", "\n", "220\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "218->220\n", "\n", "\n", "\n", "\n", "\n", "223\n", "\n", "WeekofPurchase <= 235.0\n", "gini = 0.091\n", "samples = 21\n", "value = [1, 20]\n", "\n", "\n", "\n", "222->223\n", "\n", "\n", "\n", "\n", "\n", "228\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "222->228\n", "\n", "\n", "\n", "\n", "\n", "224\n", "\n", "WeekofPurchase <= 233.5\n", "gini = 0.219\n", "samples = 8\n", "value = [1, 7]\n", "\n", "\n", "\n", "223->224\n", "\n", "\n", "\n", "\n", "\n", "227\n", "\n", "gini = 0.0\n", "samples = 13\n", "value = [0, 13]\n", "\n", "\n", "\n", "223->227\n", "\n", "\n", "\n", "\n", "\n", "225\n", "\n", "gini = 0.0\n", "samples = 7\n", "value = [0, 7]\n", "\n", "\n", "\n", "224->225\n", "\n", "\n", "\n", "\n", "\n", "226\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "224->226\n", "\n", "\n", "\n", "\n", "\n", "230\n", "\n", "WeekofPurchase <= 257.5\n", "gini = 0.203\n", "samples = 122\n", "value = [14, 108]\n", "\n", "\n", "\n", "229->230\n", "\n", "\n", "\n", "\n", "\n", "277\n", "\n", "LoyalCH <= 0.589\n", "gini = 0.5\n", "samples = 4\n", "value = [2, 2]\n", "\n", "\n", "\n", "229->277\n", "\n", "\n", "\n", "\n", "\n", "231\n", "\n", "LoyalCH <= 0.504\n", "gini = 0.273\n", "samples = 86\n", "value = [14, 72]\n", "\n", "\n", "\n", "230->231\n", "\n", "\n", "\n", "\n", "\n", "276\n", "\n", "gini = 0.0\n", "samples = 36\n", "value = [0, 36]\n", "\n", "\n", "\n", "230->276\n", "\n", "\n", "\n", "\n", "\n", "232\n", "\n", "PriceDiff <= 0.375\n", "gini = 0.413\n", "samples = 24\n", "value = [7, 17]\n", "\n", "\n", "\n", "231->232\n", "\n", "\n", "\n", "\n", "\n", "251\n", "\n", "PriceCH <= 1.755\n", "gini = 0.2\n", "samples = 62\n", "value = [7, 55]\n", "\n", "\n", "\n", "231->251\n", "\n", "\n", "\n", "\n", "\n", "233\n", "\n", "WeekofPurchase <= 253.5\n", "gini = 0.308\n", "samples = 21\n", "value = [4, 17]\n", "\n", "\n", "\n", "232->233\n", "\n", "\n", "\n", "\n", "\n", "250\n", "\n", "gini = 0.0\n", "samples = 3\n", "value = [3, 0]\n", "\n", "\n", "\n", "232->250\n", "\n", "\n", "\n", "\n", "\n", "234\n", "\n", "PriceCH <= 1.825\n", "gini = 0.375\n", "samples = 16\n", "value = [4, 12]\n", "\n", "\n", "\n", "233->234\n", "\n", "\n", "\n", "\n", "\n", "249\n", "\n", "gini = 0.0\n", "samples = 5\n", "value = [0, 5]\n", "\n", "\n", "\n", "233->249\n", "\n", "\n", "\n", "\n", "\n", "235\n", "\n", "WeekofPurchase <= 234.0\n", "gini = 0.219\n", "samples = 8\n", "value = [1, 7]\n", "\n", "\n", "\n", "234->235\n", "\n", "\n", "\n", "\n", "\n", "240\n", "\n", "WeekofPurchase <= 252.5\n", "gini = 0.469\n", "samples = 8\n", "value = [3, 5]\n", "\n", "\n", "\n", "234->240\n", "\n", "\n", "\n", "\n", "\n", "236\n", "\n", "PctDiscCH <= 0.089\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "235->236\n", "\n", "\n", "\n", "\n", "\n", "239\n", "\n", "gini = 0.0\n", "samples = 6\n", "value = [0, 6]\n", "\n", "\n", "\n", "235->239\n", "\n", "\n", "\n", "\n", "\n", "237\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "236->237\n", "\n", "\n", "\n", "\n", "\n", "238\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "236->238\n", "\n", "\n", "\n", "\n", "\n", "241\n", "\n", "StoreID <= 3.5\n", "gini = 0.408\n", "samples = 7\n", "value = [2, 5]\n", "\n", "\n", "\n", "240->241\n", "\n", "\n", "\n", "\n", "\n", "248\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "240->248\n", "\n", "\n", "\n", "\n", "\n", "242\n", "\n", "WeekofPurchase <= 244.0\n", "gini = 0.444\n", "samples = 3\n", "value = [2, 1]\n", "\n", "\n", "\n", "241->242\n", "\n", "\n", "\n", "\n", "\n", "247\n", "\n", "gini = 0.0\n", "samples = 4\n", "value = [0, 4]\n", "\n", "\n", "\n", "241->247\n", "\n", "\n", "\n", "\n", "\n", "243\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "242->243\n", "\n", "\n", "\n", "\n", "\n", "244\n", "\n", "WeekofPurchase <= 247.0\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "242->244\n", "\n", "\n", "\n", "\n", "\n", "245\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "244->245\n", "\n", "\n", "\n", "\n", "\n", "246\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "244->246\n", "\n", "\n", "\n", "\n", "\n", "252\n", "\n", "SalePriceCH <= 1.72\n", "gini = 0.346\n", "samples = 18\n", "value = [4, 14]\n", "\n", "\n", "\n", "251->252\n", "\n", "\n", "\n", "\n", "\n", "263\n", "\n", "StoreID <= 3.5\n", "gini = 0.127\n", "samples = 44\n", "value = [3, 41]\n", "\n", "\n", "\n", "251->263\n", "\n", "\n", "\n", "\n", "\n", "253\n", "\n", "LoyalCH <= 0.678\n", "gini = 0.153\n", "samples = 12\n", "value = [1, 11]\n", "\n", "\n", "\n", "252->253\n", "\n", "\n", "\n", "\n", "\n", "258\n", "\n", "WeekofPurchase <= 236.5\n", "gini = 0.5\n", "samples = 6\n", "value = [3, 3]\n", "\n", "\n", "\n", "252->258\n", "\n", "\n", "\n", "\n", "\n", "254\n", "\n", "gini = 0.0\n", "samples = 10\n", "value = [0, 10]\n", "\n", "\n", "\n", "253->254\n", "\n", "\n", "\n", "\n", "\n", "255\n", "\n", "WeekofPurchase <= 230.5\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "253->255\n", "\n", "\n", "\n", "\n", "\n", "256\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "255->256\n", "\n", "\n", "\n", "\n", "\n", "257\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "255->257\n", "\n", "\n", "\n", "\n", "\n", "259\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "258->259\n", "\n", "\n", "\n", "\n", "\n", "260\n", "\n", "LoyalCH <= 0.576\n", "gini = 0.375\n", "samples = 4\n", "value = [1, 3]\n", "\n", "\n", "\n", "258->260\n", "\n", "\n", "\n", "\n", "\n", "261\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "260->261\n", "\n", "\n", "\n", "\n", "\n", "262\n", "\n", "gini = 0.0\n", "samples = 3\n", "value = [0, 3]\n", "\n", "\n", "\n", "260->262\n", "\n", "\n", "\n", "\n", "\n", "264\n", "\n", "gini = 0.0\n", "samples = 27\n", "value = [0, 27]\n", "\n", "\n", "\n", "263->264\n", "\n", "\n", "\n", "\n", "\n", "265\n", "\n", "WeekofPurchase <= 256.5\n", "gini = 0.291\n", "samples = 17\n", "value = [3, 14]\n", "\n", "\n", "\n", "263->265\n", "\n", "\n", "\n", "\n", "\n", "266\n", "\n", "LoyalCH <= 0.61\n", "gini = 0.219\n", "samples = 16\n", "value = [2, 14]\n", "\n", "\n", "\n", "265->266\n", "\n", "\n", "\n", "\n", "\n", "275\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "265->275\n", "\n", "\n", "\n", "\n", "\n", "267\n", "\n", "SalePriceMM <= 2.26\n", "gini = 0.375\n", "samples = 8\n", "value = [2, 6]\n", "\n", "\n", "\n", "266->267\n", "\n", "\n", "\n", "\n", "\n", "274\n", "\n", "gini = 0.0\n", "samples = 8\n", "value = [0, 8]\n", "\n", "\n", "\n", "266->274\n", "\n", "\n", "\n", "\n", "\n", "268\n", "\n", "WeekofPurchase <= 251.5\n", "gini = 0.245\n", "samples = 7\n", "value = [1, 6]\n", "\n", "\n", "\n", "267->268\n", "\n", "\n", "\n", "\n", "\n", "273\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "267->273\n", "\n", "\n", "\n", "\n", "\n", "269\n", "\n", "STORE <= 2.0\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "268->269\n", "\n", "\n", "\n", "\n", "\n", "272\n", "\n", "gini = 0.0\n", "samples = 5\n", "value = [0, 5]\n", "\n", "\n", "\n", "268->272\n", "\n", "\n", "\n", "\n", "\n", "270\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "269->270\n", "\n", "\n", "\n", "\n", "\n", "271\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "269->271\n", "\n", "\n", "\n", "\n", "\n", "278\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [0, 2]\n", "\n", "\n", "\n", "277->278\n", "\n", "\n", "\n", "\n", "\n", "279\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "277->279\n", "\n", "\n", "\n", "\n", "\n", "281\n", "\n", "StoreID <= 1.5\n", "gini = 0.375\n", "samples = 8\n", "value = [2, 6]\n", "\n", "\n", "\n", "280->281\n", "\n", "\n", "\n", "\n", "\n", "288\n", "\n", "STORE <= 1.5\n", "gini = 0.061\n", "samples = 254\n", "value = [8, 246]\n", "\n", "\n", "\n", "280->288\n", "\n", "\n", "\n", "\n", "\n", "282\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "281->282\n", "\n", "\n", "\n", "\n", "\n", "283\n", "\n", "SpecialCH <= 0.5\n", "gini = 0.245\n", "samples = 7\n", "value = [1, 6]\n", "\n", "\n", "\n", "281->283\n", "\n", "\n", "\n", "\n", "\n", "284\n", "\n", "gini = 0.0\n", "samples = 5\n", "value = [0, 5]\n", "\n", "\n", "\n", "283->284\n", "\n", "\n", "\n", "\n", "\n", "285\n", "\n", "LoyalCH <= 0.997\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "283->285\n", "\n", "\n", "\n", "\n", "\n", "286\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "285->286\n", "\n", "\n", "\n", "\n", "\n", "287\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "285->287\n", "\n", "\n", "\n", "\n", "\n", "289\n", "\n", "gini = 0.0\n", "samples = 132\n", "value = [0, 132]\n", "\n", "\n", "\n", "288->289\n", "\n", "\n", "\n", "\n", "\n", "290\n", "\n", "ListPriceDiff <= 0.22\n", "gini = 0.123\n", "samples = 122\n", "value = [8, 114]\n", "\n", "\n", "\n", "288->290\n", "\n", "\n", "\n", "\n", "\n", "291\n", "\n", "PriceDiff <= 0.195\n", "gini = 0.062\n", "samples = 62\n", "value = [2, 60]\n", "\n", "\n", "\n", "290->291\n", "\n", "\n", "\n", "\n", "\n", "304\n", "\n", "LoyalCH <= 0.924\n", "gini = 0.18\n", "samples = 60\n", "value = [6, 54]\n", "\n", "\n", "\n", "290->304\n", "\n", "\n", "\n", "\n", "\n", "292\n", "\n", "gini = 0.0\n", "samples = 34\n", "value = [0, 34]\n", "\n", "\n", "\n", "291->292\n", "\n", "\n", "\n", "\n", "\n", "293\n", "\n", "WeekofPurchase <= 264.5\n", "gini = 0.133\n", "samples = 28\n", "value = [2, 26]\n", "\n", "\n", "\n", "291->293\n", "\n", "\n", "\n", "\n", "\n", "294\n", "\n", "gini = 0.0\n", "samples = 11\n", "value = [0, 11]\n", "\n", "\n", "\n", "293->294\n", "\n", "\n", "\n", "\n", "\n", "295\n", "\n", "WeekofPurchase <= 265.5\n", "gini = 0.208\n", "samples = 17\n", "value = [2, 15]\n", "\n", "\n", "\n", "293->295\n", "\n", "\n", "\n", "\n", "\n", "296\n", "\n", "LoyalCH <= 0.951\n", "gini = 0.5\n", "samples = 2\n", "value = [1, 1]\n", "\n", "\n", "\n", "295->296\n", "\n", "\n", "\n", "\n", "\n", "299\n", "\n", "LoyalCH <= 0.878\n", "gini = 0.124\n", "samples = 15\n", "value = [1, 14]\n", "\n", "\n", "\n", "295->299\n", "\n", "\n", "\n", "\n", "\n", "297\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "296->297\n", "\n", "\n", "\n", "\n", "\n", "298\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [0, 1]\n", "\n", "\n", "\n", "296->298\n", "\n", "\n", "\n", "\n", "\n", "300\n", "\n", "LoyalCH <= 0.865\n", "gini = 0.32\n", "samples = 5\n", "value = [1, 4]\n", "\n", "\n", "\n", "299->300\n", "\n", "\n", "\n", "\n", "\n", "303\n", "\n", "gini = 0.0\n", "samples = 10\n", "value = [0, 10]\n", "\n", "\n", "\n", "299->303\n", "\n", "\n", "\n", "\n", "\n", "301\n", "\n", "gini = 0.0\n", "samples = 4\n", "value = [0, 4]\n", "\n", "\n", "\n", "300->301\n", "\n", "\n", "\n", "\n", "\n", "302\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "300->302\n", "\n", "\n", "\n", "\n", "\n", "305\n", "\n", "SalePriceMM <= 2.26\n", "gini = 0.056\n", "samples = 35\n", "value = [1, 34]\n", "\n", "\n", "\n", "304->305\n", "\n", "\n", "\n", "\n", "\n", "310\n", "\n", "LoyalCH <= 0.935\n", "gini = 0.32\n", "samples = 25\n", "value = [5, 20]\n", "\n", "\n", "\n", "304->310\n", "\n", "\n", "\n", "\n", "\n", "306\n", "\n", "gini = 0.0\n", "samples = 31\n", "value = [0, 31]\n", "\n", "\n", "\n", "305->306\n", "\n", "\n", "\n", "\n", "\n", "307\n", "\n", "LoyalCH <= 0.813\n", "gini = 0.375\n", "samples = 4\n", "value = [1, 3]\n", "\n", "\n", "\n", "305->307\n", "\n", "\n", "\n", "\n", "\n", "308\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "307->308\n", "\n", "\n", "\n", "\n", "\n", "309\n", "\n", "gini = 0.0\n", "samples = 3\n", "value = [0, 3]\n", "\n", "\n", "\n", "307->309\n", "\n", "\n", "\n", "\n", "\n", "311\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "310->311\n", "\n", "\n", "\n", "\n", "\n", "312\n", "\n", "WeekofPurchase <= 263.5\n", "gini = 0.227\n", "samples = 23\n", "value = [3, 20]\n", "\n", "\n", "\n", "310->312\n", "\n", "\n", "\n", "\n", "\n", "313\n", "\n", "LoyalCH <= 0.948\n", "gini = 0.105\n", "samples = 18\n", "value = [1, 17]\n", "\n", "\n", "\n", "312->313\n", "\n", "\n", "\n", "\n", "\n", "318\n", "\n", "LoyalCH <= 0.983\n", "gini = 0.48\n", "samples = 5\n", "value = [2, 3]\n", "\n", "\n", "\n", "312->318\n", "\n", "\n", "\n", "\n", "\n", "314\n", "\n", "LoyalCH <= 0.946\n", "gini = 0.375\n", "samples = 4\n", "value = [1, 3]\n", "\n", "\n", "\n", "313->314\n", "\n", "\n", "\n", "\n", "\n", "317\n", "\n", "gini = 0.0\n", "samples = 14\n", "value = [0, 14]\n", "\n", "\n", "\n", "313->317\n", "\n", "\n", "\n", "\n", "\n", "315\n", "\n", "gini = 0.0\n", "samples = 3\n", "value = [0, 3]\n", "\n", "\n", "\n", "314->315\n", "\n", "\n", "\n", "\n", "\n", "316\n", "\n", "gini = 0.0\n", "samples = 1\n", "value = [1, 0]\n", "\n", "\n", "\n", "314->316\n", "\n", "\n", "\n", "\n", "\n", "319\n", "\n", "gini = 0.0\n", "samples = 3\n", "value = [0, 3]\n", "\n", "\n", "\n", "318->319\n", "\n", "\n", "\n", "\n", "\n", "320\n", "\n", "gini = 0.0\n", "samples = 2\n", "value = [2, 0]\n", "\n", "\n", "\n", "318->320\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Visualise the tree with GraphViz\n", "dot_data = tree.export_graphviz(clf, out_file=None,\n", " feature_names=X.design_info.column_names, \n", " filled=True, rounded=True)\n", "graph = graphviz.Source(dot_data) \n", "display(HTML(graph._repr_svg_()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?" ] }, { "cell_type": "code", "execution_count": 288, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 74 24]\n", " [ 40 132]]\n", "\n", "test error rate: 0.237\n" ] } ], "source": [ "# Here's the confusion matrix\n", "print(confusion_matrix(y[~train], clf.predict(X[~train])))\n", "\n", "test_err = 1 - clf.score(X[~train], y[~train])\n", "print('\\ntest error rate: {}'.format(np.around(test_err, 3)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.\n", "\n", "### (g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.\n", "\n", "### (h) Which tree size corresponds to the lowest cross-validated classification error rate?" ] }, { "cell_type": "code", "execution_count": 308, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

Optimal tree size:

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Test
Leaves
8.00.811304
\n", "
" ], "text/plain": [ " Test\n", "Leaves \n", "8.0 0.811304" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Estimate optimal tree with cross validation on training set\n", "\n", "cv_folds = 10\n", "\n", "results = []\n", "for mln in np.arange(2, 50):\n", " clf = tree.DecisionTreeClassifier(max_leaf_nodes=mln)\n", " score = cross_val_score(clf, X[train], y[train], cv=cv_folds)\n", " results += [[mln, np.mean(score)]]\n", "\n", "\n", "plt.figure(figsize=(10,5))\n", "plot_df = pd.DataFrame(np.asarray(results), columns=['Leaves', 'Test']).set_index('Leaves')\n", "sns.lineplot(data=plot_df);\n", "plt.ylabel('accuracy')\n", "plt.show();\n", "\n", "display(HTML('

Optimal tree size:

'))\n", "display(plot_df[plot_df['Test'] == plot_df['Test'].max()])" ] }, { "cell_type": "code", "execution_count": 285, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

Optimal tree size:

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TrainTest
Leaves
2.00.81250.8
3.00.81250.8
\n", "
" ], "text/plain": [ " Train Test\n", "Leaves \n", "2.0 0.8125 0.8\n", "3.0 0.8125 0.8" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Determine actual optimal tree using test set.\n", "\n", "results = []\n", "for mln in np.arange(2, 50):\n", " clf = tree.DecisionTreeClassifier(max_leaf_nodes=mln).fit(X[train], y[train])\n", "\n", " accuracy_train = clf.score(X[train], y[train]) \n", " accuracy_test = clf.score(X[~train], y[~train]) \n", " results += [[mln, accuracy_train, accuracy_test]]\n", "\n", "plt.figure(figsize=(10,5))\n", "plot_df = pd.DataFrame(np.asarray(results), columns=['Leaves', 'Train', 'Test']).set_index('Leaves')\n", "sns.lineplot(data=plot_df);\n", "plt.ylabel('accuracy')\n", "plt.show();\n", "\n", "display(HTML('

Optimal tree size:

'))\n", "display(plot_df[plot_df['Test'] == plot_df['Test'].max()])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.\n", "\n", "### (j) Compare the training error rates between the pruned and unpruned trees. Which is higher?\n", "\n", "### (k) Compare the test error rates between the pruned and unpruned trees. Which is higher?" ] }, { "cell_type": "code", "execution_count": 326, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
testerror rate
0unpruned_train0.010000000000000009
1pruned_train0.15000000000000002
2unpruned_test0.24444444444444446
3pruned_test0.21851851851851856
\n", "
" ], "text/plain": [ " test error rate\n", "0 unpruned_train 0.010000000000000009\n", "1 pruned_train 0.15000000000000002\n", "2 unpruned_test 0.24444444444444446\n", "3 pruned_test 0.21851851851851856" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "clf_unpruned = tree.DecisionTreeClassifier().fit(X[train], y[train])\n", "clf_pruned = tree.DecisionTreeClassifier(max_leaf_nodes=8).fit(X[train], y[train])\n", "\n", "scores = [['unpruned_train', 1 - clf_unpruned.score(X[train], y[train])],\n", " ['pruned_train', 1 - clf_pruned.score(X[train], y[train])],\n", " ['unpruned_test', 1 - clf_unpruned.score(X[~train], y[~train])],\n", " ['pruned_test', 1 - clf_pruned.score(X[~train], y[~train])]]\n", "\n", "plot_df = pd.DataFrame(scores, columns=['test', 'error rate'])\n", "\n", "plt.figure(figsize=(10, 6))\n", "sns.barplot(x='test', y='error rate', data=plot_df)\n", "plt.show();\n", "\n", "display(plot_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The unpruned tree performs best in the training setting where as the pruned tree performs best in the test setting. This suggests that the unpruned tree is overfitting the training data leading to poor test score." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10. We now use boosting to predict Salary in the Hitters data set.\n", "\n", "### (a) Remove the observations for whom the salary information is unknown, and then log-transform the salaries." ] }, { "cell_type": "code", "execution_count": 341, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InterceptLeague[T.N]Division[T.W]NewLeague[T.N]AtBatHitsHmRunRunsRBIWalksYearsCAtBatCHitsCHmRunCRunsCRBICWalksPutOutsAssistsErrors
01.01.01.01.0315.081.07.024.038.039.014.03449.0835.069.0321.0414.0375.0632.043.010.0
11.00.01.00.0479.0130.018.066.072.076.03.01624.0457.063.0224.0266.0263.0880.082.014.0
21.01.00.01.0496.0141.020.065.078.037.011.05628.01575.0225.0828.0838.0354.0200.011.03.0
31.01.00.01.0321.087.010.039.042.030.02.0396.0101.012.048.046.033.0805.040.04.0
41.00.01.00.0594.0169.04.074.051.035.011.04408.01133.019.0501.0336.0194.0282.0421.025.0
\n", "
" ], "text/plain": [ " Intercept League[T.N] Division[T.W] NewLeague[T.N] AtBat Hits HmRun \\\n", "0 1.0 1.0 1.0 1.0 315.0 81.0 7.0 \n", "1 1.0 0.0 1.0 0.0 479.0 130.0 18.0 \n", "2 1.0 1.0 0.0 1.0 496.0 141.0 20.0 \n", "3 1.0 1.0 0.0 1.0 321.0 87.0 10.0 \n", "4 1.0 0.0 1.0 0.0 594.0 169.0 4.0 \n", "\n", " Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks \\\n", "0 24.0 38.0 39.0 14.0 3449.0 835.0 69.0 321.0 414.0 375.0 \n", "1 66.0 72.0 76.0 3.0 1624.0 457.0 63.0 224.0 266.0 263.0 \n", "2 65.0 78.0 37.0 11.0 5628.0 1575.0 225.0 828.0 838.0 354.0 \n", "3 39.0 42.0 30.0 2.0 396.0 101.0 12.0 48.0 46.0 33.0 \n", "4 74.0 51.0 35.0 11.0 4408.0 1133.0 19.0 501.0 336.0 194.0 \n", "\n", " PutOuts Assists Errors \n", "0 632.0 43.0 10.0 \n", "1 880.0 82.0 14.0 \n", "2 200.0 11.0 3.0 \n", "3 805.0 40.0 4.0 \n", "4 282.0 421.0 25.0 " ] }, "execution_count": 341, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hitters_df = pd.read_csv('./data/Hitters.csv')\n", "\n", "# Drop null observations\n", "hitters_df = hitters_df.dropna()\n", "assert hitters_df.isnull().sum().sum() == 0\n", "\n", "f = 'np.log(Salary) ~ ' + ' + '.join(hitters_df.columns.drop(['Salary']))\n", "y, X = pt.dmatrices(f, hitters_df)\n", "\n", "pd.DataFrame(X, columns=X.design_info.column_names).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (b) Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations." ] }, { "cell_type": "code", "execution_count": 346, "metadata": {}, "outputs": [], "source": [ "# Index for Training set of 200\n", "np.random.seed(1)\n", "train_sample = np.random.choice(np.arange(len(hitters_df)), size=200, replace=False)\n", "train = np.asarray([(i in train_sample) for i in hitters_df.index])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (c) Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter λ. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (d) Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis." ] }, { "cell_type": "code", "execution_count": 465, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mse_trainmse_test
learning_rate
0.002290.0710570.206531
\n", "
" ], "text/plain": [ " mse_train mse_test\n", "learning_rate \n", "0.00229 0.071057 0.206531" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Gradient boosting\n", "\n", "max_features = 'auto'\n", "tree_count = 1000\n", "\n", "# np.arange(0.0001, 1.2, 0.01)\n", "\n", "results = []\n", "for learning_rate in np.logspace(-10, np.log(1.3), 100): \n", " regr = GradientBoostingRegressor(max_features=max_features, \n", " random_state=1, \n", " n_estimators=tree_count,\n", " learning_rate=learning_rate)\n", " regr = regr.fit(X[train], y[train])\n", " y_hat_train = regr.predict(X[train])\n", " y_hat_test = regr.predict(X[~train])\n", " \n", " mse_train = metrics.mean_squared_error(y[train], y_hat_train)\n", " mse_test = metrics.mean_squared_error(y[~train], y_hat_test)\n", " \n", " results += [[learning_rate, mse_train, mse_test]]\n", "\n", "# Plot\n", "df = pd.DataFrame(np.asarray(results), \n", " columns=['learning_rate', 'mse_train', 'mse_test']).set_index('learning_rate')\n", "plt.figure(figsize=(10,10))\n", "ax = sns.lineplot(data=df)\n", "ax.set_xscale('log')\n", "plt.ylabel('MSE')\n", "plt.show();\n", "\n", "# Show best learning rate\n", "display(df[df['mse_test'] == df['mse_test'].min()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (e) Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6." ] }, { "cell_type": "code", "execution_count": 446, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE test: 0.455\n" ] } ], "source": [ "# Naive ols\n", "model = sm.OLS(y[train], X[train]).fit()\n", "y_hat = model.predict(X[~train])\n", "\n", "mse_test = metrics.mean_squared_error(y[~train], y_hat)\n", "print('MSE test: {}'.format(np.around(mse_test, 3)))" ] }, { "cell_type": "code", "execution_count": 463, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZYAAAEPCAYAAABhkeIdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3Xt8U/X9P/DXSdKrLTRgUryBt80NVxAU6FDL16JFSi9oQStKvwwpjMuq3ebkNjdxoGVjVRTcGIM+iq3AvgqlA0v94gMvFIUqk/L9MTrxCqVNSiptoZc05/P7oyYkTdomNac5bV/Px4PHI+eS0/c5JOedz/VIQggBIiIiP9EEOgAiIupfmFiIiMivmFiIiMivmFiIiMivmFiIiMivmFiIiMivmFiIiMivmFiIiMivmFiIiMivmFiIiMivmFiIiMivmFiIiMivmFiIiMivdIEOoLfV1V2ELHs3ofPQoRE4f75R4Yh8p9a4AMbWE2qNC1BvbGqNC+hfsWk0EvT6K3z+OwMusciy8Dqx2PdXI7XGBTC2nlBrXIB6Y1NrXABjY1UYERH5FRMLERH5FRMLERH5FRMLERH5FRMLERH5FRMLERH5FRMLERH5FRMLERH5FRMLERH5FRMLERH5FRMLERH5FRMLERH5FRMLERH5FRNLN7RaDYRWgzZJgtBqoNXykhERdWXATZvvC1kWqLtkxZq8IzDVNcGoD8PyOeOhDw+CzSYHOjwiIlXiz+8uXLjY4kgqAGCqa8KavCNoC3BcRERqpmhiKS4uRmJiIhISElBQUNDpfgcPHkR8fLxj+ciRI5gwYQJSU1ORmpqKZcuWAQDq6+sxf/58TJ06FY8++ijMZrOS4cPaJjuSip2prgk2FT/Eh4go0BSrCqupqUFubi7efPNNBAcHIz09HRMmTMDNN9/ssl9tbS1ycnJc1p04cQJz587FggULXNa/+OKLuOOOO7Bp0ybs3r0bq1evxosvvqjUKSBIp4FRH+aSXIz6MGg1EmBjciEi8kSxEktZWRliY2MRFRWF8PBwTJkyBSUlJW77rVy5EkuWLHFZV1FRgQ8++ADJycn4+c9/jnPnzgFoL9kkJycDAJKSkvDee+/BarUqdQoYfEUIls8ZD6M+DAAcbSxsmCIi6pxi90iTyQSDweBYNhqNOH78uMs++fn5GDlyJEaPHu2yPjIyElOnTkVCQgJef/11ZGdnY/v27S7H1Ol0iIiIgMViQXR0tCLnoNFI0IcH4flFd8ImC2g1EnQAG+6JiLqgWGKRZRmSJDmWhRAuy5WVlSgtLUVeXh6qq6td3rtq1SrH60ceeQTr1q1DQ0OD298QQkCj8a3QNXRohE/7DxlyhU/79xaDITLQIXSKsflOrXEB6o1NrXEBjE2xxDJs2DCUl5c7ls1mM4xGo2O5pKQEZrMZaWlpsFqtMJlMmDVrFl577TX89a9/xfz586HVah37a7VaGI1G1NbWYtiwYWhra8PFixcRFRXlU1znzzdC9rLx3WCIhNnsntACTa1xAYytJ9QaF6De2NQaF9C/YtNoJJ9/jAMKtrFMnDgRhw8fhsViQVNTE0pLSxEXF+fYnpWVhf3796OoqAibNm2C0WhEYWEhNBoN3n77bezfvx8AsHv3bowePRrh4eGYNGkSdu/eDQDYt28f7rjjDgQFBSl1CkRE1AOKJZbo6GhkZ2cjIyMD06dPR1JSEkaNGoXMzExUVFR0+d6cnBzk5+dj2rRpeOONN/CHP/wBAPDEE0/gX//6F6ZNm4bCwkI888wzSoVPREQ9JAkhBlS/WVaFKYux+U6tcQHqjU2tcQH9KzbVVYUREdHAxMRCRER+xcRCRER+xcRCRER+xcRCRER+xcRCRER+xcRCRER+xcRCRER+xcRCRER+xcRCRER+xcTiBa1WA6HVoE2SILQaaLW8bEREneHDELuh1WpQd8mKNXlHYKprcjxFUh8exAd+ERF5wJ/e3WgDHEkFAEx1TViTdwRtgQ2LiEi1mFi6YZOFI6nYmeqaYPNyhmQiooGGiaUbWo0Eoz7MZZ1RHwatRurkHUREAxsTSzd0AJbPGe9ILvY2FjZOERF5xvtjN2w2GfrwIDy/6E7YZAGtRoLuu/VEROSOicULNpsMCd9dLJuALcDxEBGpGavCiIjIr5hYiIjIr5hYiIjIrxRNLMXFxUhMTERCQgIKCgo63e/gwYOIj493W19dXY3x48fjzJkzAACr1YqxY8ciNTXV8c9mY4sHEZGaKNZ4X1NTg9zcXLz55psIDg5Geno6JkyYgJtvvtllv9raWuTk5Li9X5ZlrFixAlar1bHu1KlTGDNmDP7+978rFTYREX1PipVYysrKEBsbi6ioKISHh2PKlCkoKSlx22/lypVYsmSJ2/rNmzdj4sSJ0Ov1jnUVFRWwWCx48MEH8dBDD+HIkSNKhU9ERD2kWGIxmUwwGAyOZaPRiJqaGpd98vPzMXLkSIwePdpl/YkTJ/Dhhx/iZz/7mct6SZIwefJk7NixA7///e+RnZ0Ni8Wi1CkQEVEPKFYVJssyJOnytCdCCJflyspKlJaWIi8vD9XV1Y71TU1NePbZZ/HSSy9Bo3HNe+np6Y7XI0eOxKhRo/DJJ5/g3nvv9TquoUMjfDoPgyHSp/17i1rjAhhbT6g1LkC9sak1LoCxKZZYhg0bhvLycsey2WyG0Wh0LJeUlMBsNiMtLQ1WqxUmkwmzZs3CwoULcf78eSxcuBBAe8ln/vz5eOWVV3D8+HGMHTsWw4cPB9CerIKCgnyK6/z5RsheTiBpMETCbG7w6fi9Qa1xAYytJ9QaF6De2NQaF9C/YtNoJJ9/jAMKVoVNnDgRhw8fhsViQVNTE0pLSxEXF+fYnpWVhf3796OoqAibNm2C0WhEYWEh7r77brzzzjsoKipCUVERjEYjNm3ahBtvvBGnTp3Cli1bAACff/45Tp48idtvv12pUyAioh5QLLFER0cjOzsbGRkZmD59OpKSkjBq1ChkZmaioqKiR8dcvHgxLBYLkpKS8MQTTyAnJwcREb5nUyIiUo4khBhQDxZhVZiyGJvv1BoXoN7Y1BoX0L9iU11VGBERDUxMLERE5FdMLD7QajUQWg3aJAlCq4FWy8tHRNQRn8fiJa1Wg7pLVqzJOwJTXZPjSZL68CA+9IuIyAl/cnupDXAkFQAw1TVhTd4RtAU2LCIi1WFi8ZJNFo6kYmeqa4LNyx5mREQDBROLl7QaCUZ9mMs6oz4MWo3UyTuIiAYmJhYv6QAsnzPekVzsbSxspCIicsX7opdsNhn68CA8v+hO2GQBrUaC7rv1RER0GROLD2w2GRK+u2g2AT67kojIHavCiIjIr5hYiIjIr5hYiIjIr5hYiIjIr5hY/IzziRHRQMdeYX7E+cSIiJhY/Mp5PrFbhuuRFv8DtFhtaEMwdFoNkwsRDQhMLD7SajVoAzwOkpS/m0/sluF6zE78MdbvOMaSCxENOGwA8IG9qmvZxkOY//wBLNt4CHWXrNB+15Yi0D7VS1r8DxxJBeBMyEQ0sDCx+KCrqfPbAPx9zwlkPTwGgyOCORMyEQ1YiiaW4uJiJCYmIiEhAQUFBZ3ud/DgQcTHx7utr66uxvjx43HmzBkAgBACOTk5uP/++5GYmIiPP/5Ysdg96WzqfEjt2z76vxps23cSEWHBnAmZiAYsxdpYampqkJubizfffBPBwcFIT0/HhAkTcPPNN7vsV1tbi5ycHLf3y7KMFStWwGq1Otbt378fp0+fxr59+/DVV19hwYIF2LdvH3S63mkqsk+d75xcJtwajYaLVtQ1NMOoD8Opr+uwfscxZD08BsXvn8bkcSMwOCIY+sgQBGslWDnBGBH1c4qVWMrKyhAbG4uoqCiEh4djypQpKCkpcdtv5cqVWLJkidv6zZs3Y+LEidDr9Y517777LhITE6HRaHDDDTfgqquuwrFjx5Q6BTeeps5/POUnWJN3BNtLK5H18BhHcvn4ZDUeSfgRNhdV4OlXPsDKv5ThfEMrx7UQUb+n2E99k8kEg8HgWDYajTh+/LjLPvn5+Rg5ciRGjx7tsv7EiRP48MMPsXnzZpcqNJPJBKPR6Fg2GAyorq5W6AzcOU+dDwmAkGCTZZjqmmCqa8K2fScxLzUGkeFBMOrDsWzjB27tMc8vuhOsECOi/kyxxCLLMiTp8i1UCOGyXFlZidLSUuTl5bkkh6amJjz77LN46aWXoNFouj1mx326M3RohE/7GwyRbutkWeCr6nr8YctH+MVDtzmqx059XYc1eUdg1Idh9cI7O2mPkTwe01f+OIZSGJvv1BoXoN7Y1BoXwNgUSyzDhg1DeXm5Y9lsNruUNkpKSmA2m5GWlgar1QqTyYRZs2Zh4cKFOH/+PBYuXAigvZQyf/58vPLKKxg2bBhMJpPjGLW1tS7H9Mb5842QveydZTBEwmxucFsvtBr8YctH0EeGIixUhyfSx+Cl7ZfHrKyYMx5ayb09xqgPA4TweExfdBaXGjA236k1LkC9sak1LqB/xabRSD7/GAcUTCwTJ07Eyy+/DIvFgrCwMJSWluK5555zbM/KykJWVhYA4MyZM8jIyEBhYSEA4J133nHsFx8fj02bNuHaa69FXFwc3njjDSQlJeHMmTP48ssvERMTo9QpdMreO2xeagzW5pdDHxnqqAJrbm3D4MhgSDaB5XPGu03vogP4gDAi6tcUSyzR0dHIzs5GRkYGrFYrZsyYgVGjRiEzMxNZWVk9Sgj3338/jh8/jpSUFADA6tWrERoa6u/Qu2XvHRYZHuRoX1mTd8SxfdOyydAJwUcZE9GAJAkhBtSoPX9UhdlH4Nc1NOPVN467VXc9v+hOSAomkP5U1O5Nao1NrXEB6o1NrXEB/Su2nlaFse9rD9h7h91w9SC37sf26i4iooGK98AestlkwAZWdxERdcDE8j3ZbDIkfHchbcJjw3xXMyITEfU3TCwK48O/iGigYRuLwrqaEZmIqD9iYlFYZzMicwp9IuqvmFgUZh/z4oxT6BNRf8bEojBPMyKzSzIR9We8vynMeUZk9gojooGAiaUXeNMlmYiov2BVGBER+VWXiaWqqqrTbe+9957fgyEior6vy8SyePFix+tf/OIXLttyc3OViYiIiPq0LhOL88TH33zzTafbqHtarQZCq0GbJEFoNdBqWQtJRP1Tl433zo8Bdn7taZk6x2ldiGgg8brEQj3HaV2IaCDpssQiyzIuXLgAIQRsNpvjNQDYbOw0662upnVhf28i6m+6vK9VVlYiNjbWkUwmTJjg2MaqMO/Zp3Xp+KRJrUYCbCwVElH/0mVi+fe//91bcfRr9mldOrax6AAOliSifqfbmhh7NZhOp0NjYyPKyspwyy23YMSIEb0RX7/AaV2IaCDpsvH+s88+w+TJk/H++++jubkZM2fORG5uLh577DEcOnSot2LsF2w2GZJNhk4ISDaZSYWIFNdxmIPcS4/r6DKxrF27Fk8++STuuece7N27FwCwd+9e7Ny5Ey+//HK3By8uLkZiYiISEhJQUFDQ6X4HDx5EfHy8Y/mzzz5Deno6UlJSMHv2bJw9exYAcPbsWYwZMwapqalITU3F448/7tVJqgnHsxBRb7APc1i28RDmP38AyzYewlfV9b1yz+myKuzcuXNISUkBAHz00UeYPHkyNBoNrrrqKjQ2NnZ54JqaGuTm5uLNN99EcHAw0tPTMWHCBNx8880u+9XW1iInJ8dl3bPPPotFixYhLi4Or7/+Ov785z9j3bp1OHHiBJKTk7Fq1aqenGvAcTwLEfUWGyS3YQ5/2PIRnl90J5TuetVl6tJoLm8+duwYxo0b51huaWnp8sBlZWWIjY1FVFQUwsPDMWXKFJSUlLjtt3LlSixZssRl3datWxEXFwdZllFVVYVBgwYBACoqKlBZWYnU1FRkZGTg1KlT3Z+hinA8CxEpTavVQBOshdUmB+zptV2WWAYPHox///vfaGxshNlsdiSWTz75BNHR0V0e2GQywWAwOJaNRiOOHz/usk9+fj5GjhyJ0aNHuwal06G+vh6JiYlobm7Gtm3bAAAhISFISUlBeno63n//fSxevBj79u1DcHCw1yc8dGiE1/sCgMEQ6dP+XTHVXfL4Hw1J8vnv+DMuf2NsvlNrXIB6Y1NrXEDgYmtrk/F1TT0s9c0A4HGYQ2iIDvrIUEXj6DKx/PKXv8ScOXPQ2NiIX//61wgPD8ff//53/OUvf8GGDRu6PLAsyy5jXYQQLsuVlZUoLS1FXl4eqqur3d4/aNAgfPDBB3jvvfewcOFCHDhwwGUizEmTJmHdunX4/PPP8aMf/cjrEz5/vtHrBiyDIRJmc4PXx+6WVuPxPxpC+PR3/B6XHzE236k1LkC9sak1LiBwsWm1GrTIAqu3HkH2I2OR98//h6yHx2D9jmOXq95/Nh7CavM6Po1G8vnHONBNYrn++uuxd+9eSJIEjUaDb7/9FqNHj8aWLVtw3XXXdXngYcOGoby83LFsNpthNBodyyUlJTCbzUhLS4PVaoXJZMKsWbNQWFiIffv2YerUqZAkCXFxcWhubsaFCxewd+9eJCUlQa/XA2hPVjpd3xm7bh/P8nrpvzF53AgMjgiGPjIEwVoJVg5oIaIe0mo1aIOEuoZmmOqa0HDJirqGZmzbdxLzUmMQGR6E5tY2XDk4FK1NVsXj6fKuHBsb61bqsJMkCSdPnuz0vRMnTsTLL78Mi8WCsLAwlJaW4rnnnnNsz8rKQlZWFgDgzJkzyMjIQGFhIQBgy5Yt0Ol0SEhIwIcffgi9Xo8hQ4bg6NGjaG5uRmZmJo4cOQJZlnHjjTf27MwDwGaTMTQyGI8k/IgN+ET0vbX3MpVQ19CKFqsNFxpbYdSH4Y13/uMorazJO+K4z0SGh+B8oBPL9OnTcezYMcTHxyMtLc2tR1dXoqOjkZ2djYyMDFitVsyYMQOjRo1CZmYmsrKyEBMT0+l7X3jhBfz2t7/Fhg0bEBkZifXr1wMAVqxYgaVLl6KoqAghISFYt26dSweDvqDVJjw24PdGTw0i6j/svUzrGprx6hvHMS81BgeOfuVIKNv2ncTCtFG4+soIBGk10EJAo+mdu4wkupnCuKmpCaWlpdi9ezcuXbqElJQUJCcnO3pq9TUBbWMB0CZJmP/8Abf1m5ZNhs7L2aRZv9wzao1NrXEB6o1NrXEBvRibVoulGz9A9iNjsWzjIdwyXI/ZiT9G8funXaraw4I0sLbaehRbT9tYuv25HxYWhtTUVGzduhUvvfQSGhsbkZGRgSeffNLnP0aXJ6R05piQkoioGx27EzdcssKoD8Opr+uwbd9JR1K5cnAYQjSSI6n0Jp/qkSwWCywWC+rq6tDQoM5fC2pnb8C3JxfnCSmJiLoSFKxF3SUrvqiqR1Vto0t7ij25bC6qQEiQFjqIgLXbdns/O3fuHPbs2YOioiJotVqkpKRg586d3Y5jIc+cJ6SEBEBIkGUZbQB0Wg0b8InII61WgyarjDV57t2JPbWnBPJe0mVimT17Nr744gskJibiT3/6E0aOHNlbcfVrNpsMnVaDuouc3oWIuudtd+KQYA3kVlvAH8fRZVXY0aNH0dDQgH/84x947LHHMHbsWIwdOxZjxozB2LFjeyvGfonTuxCRN+zVX7UXmty6E9c1NGNN3hHkvv4J9JGhkFTy4MAuSywHDrj3XiL/4OOKiagr9jEq9uovb7oTq6W2o8t72DXXXNNbcQw4fFwxEXXGeYxKeGgQTHVNeOOd/zi6E89LjXHrThzo6i9nfWt0YT/i3DvsluF6/G7eBDy3YCIkSHxGC9EAZ5/yPjRY56j+UlN34u7wDhYg9t5ha39xFxamjcKrbxzHghcOYOnGD1B3ycrkQjQAeRqjYq/+UlN34u6wOj+AbDYZAhpO8UJECArW4nxDK+oaLk953131l1rxZ3GAddWIT0T9n72UYm+kDw3WYXtppaPX17Z9J3H/T6/HiGGRMKi4+ssZSywBxkZ8ooHLUyO92seoeIMllgDjFC9EA5N90GPHRnq1j1HxBu9fAebciN9mE5BlAZ1GA6DvfIiIyDf29pQWq82tkV7tY1S8wcSiEt82tHJ6F6J+ztOgR28a6ftC9ZczVoWpAKd3Ier/7O0pX1TVo66hxTHosS830neGiUUF2DOMqP/ratDjvNQYzE25FcOjByFUKwE2W5+urWBiUQE+/Iuo/+ovgx59wTYWFbD3DFuTdwT6yFCkJ/wQV18Z4ZjepT980IgGov406NEXLLGoAKd3Iepf+uOgR18oescqLi5GYmIiEhISUFBQ0Ol+Bw8eRHx8vGP5s88+Q3p6OlJSUjB79mycPXsWANDa2oqnnnoKU6dOxQMPPIDTp08rGX6vstlk2GyCjfhEfZwsC7dG+o6DHucktT80MSRY0+fbUzxRLLHU1NQgNzcXhYWF2L17N3bs2IHPPvvMbb/a2lrk5OS4rHv22WexaNEi7NmzB4mJifjzn/8MANi2bRvCwsLw1ltvYfny5Vi2bJlS4QeEvRH/luF6LJ8zHs8vuhPzUmPAScOI+o5vG1v65aBHXyiWWMrKyhAbG4uoqCiEh4djypQpKCkpcdtv5cqVWLJkicu6rVu3Ii4uDrIso6qqCoMGDQLQXrJJSUkBAIwbNw4WiwVVVVVKnUKv02okTLg1GrMTf4zNRRVYtvEQNhdVoOEiq8OI1M5e/dXc0ubWSG8vrSxMG4W/Lp2MFxbd1a/HqSl2tzKZTDAYDI5lo9GImpoal33y8/MxcuRIjB492mW9TqdDfX094uLi8Prrr+Ohhx7yeEyDwYDq6mqlTqHX6QA8nvITrN9xjNVhRH2I/fHBX1TVo6q20VFKSb77Jkcj/dyUW3GNIQLh/bT6y5livcJkWYYkXa7DEUK4LFdWVqK0tBR5eXkek8OgQYPwwQcf4L333sPChQtx4MABt2MIIaDR+JYbhw6N8Gl/gyHSp/2/L5PlkiOp3DJcj7T4HyAyPAiQJAwdGgHNd12QezsuXzA236k1LkC9sakhLlkWaLjUgkvNbViTdwTZj4xF3j//n8vULPZenqEhOkRFhDi+w4HSG9dNscQybNgwlJeXO5bNZjOMRqNjuaSkBGazGWlpabBarTCZTJg1axYKCwuxb98+TJ06FZIkIS4uDs3Nzbhw4QKio6NhMpkwfPhwAO3tM87H9Mb5842QvRx4aDBEwmxu8On435tWA6M+DPrIUMxO/LGj9OI8zcuQIVf0flxeCsg185JaY1NrXIB6Ywt0XPapWb79riuxNzMT21qsON9iDVjMgO/XTaORfP4xDihYFTZx4kQcPnwYFosFTU1NKC0tRVxcnGN7VlYW9u/fj6KiImzatAlGoxGFhYUAgC1btuDtt98GAHz44YfQ6/UYMmQIJk2ahKKiIgBAeXk5QkJCcPXVVyt1CgFhH9OSnvBDVokRqZDz1CwDvZG+M4qVWKKjo5GdnY2MjAxYrVbMmDEDo0aNQmZmJrKyshATE9Ppe1944QX89re/xYYNGxAZGYn169cDAGbPno1nnnkG06ZNQ3BwMNauXatU+AFjH9MSHqrjNC9EKmSfmiX7kbH9cmZif5CEEAPqTqX6qrDvCK0GyzYegj4y1NHO0tzahhuuHoShg8NVWT0BBL6KoitqjU2tcQHqjS0Qcdmrv1paZSx44QCWzxmPzUUVjmrr4vdPY/K4EaoeSd9bVWGc0kWldAB+nxmLuoYWvLT9cjvLijnjoY8M6/b9ROQ/nU3N0lkjPdpsqksqvYmDI1TKZpMRGqRzJBWgvSpsdd4RXLjYEuDoiAYGb6ZmWZg2Cr+cNRYjvpuZeMig0AFX9dURSywq1ibLHrseW9tk6Dg5JZGinEsp/el59L2BJRYVs0+nf8twvcto/GUbODklkZK0Wo1LKYW9vnzDEouK2bse1zU0e+x6/PyiOzmNGJGfabUatEFCXUMze331EBOLinXseuxcHdZwycrJKYn8zF791WK1uZRS+tPz6HsD61JUzmaTOTklkcI6NtJfaGx1m0CyPz8/xd94V+oDODklkXKcJ5C0Pz9lIE8g6Q+sCusDbLb2CT099RATAHuIEfWAfcCjvZSS/chYR/XXqa/rsG3fSaTF/wCDI4Jx5eAw6CBYSvESSyx9hKaTHmLLNx5iDzEiH9irvTw95dFe/WVPLpuLKhASpIWODfQ+YYmlj2APMaLvz3lsyqtvHHcppXTXSE/eY2LpI9hDjOj7cR6b0tUEkvapWexdiZlUfMfE0ofYbDK0Wg0m3BqN5Ltv8visFhbXidx5GpvCrsTKYcV8H6MDMC81hj3EiLxk7/VVe6HJ4wh6diX2P5ZY+hibTYZGq2UPMaJudOz1NS81xosR9Cyl+AMTSx8UFOT++GJ9ZKhT3bCW00zQgOZpAklWe/UeJpY+aPAVIS49xJwTDNtcaCDj2BR1YBtLH6TRSNCHB+EaQwRMdU1Ii/+BI8EsnzMe2Y+MRV1DM4SWXcVo4PA0gp5jUwKDJZY+yt5DzKgPQ2R4kMdqsZZWmdVi1O91VUrh2JTAYImlD7MPmmxubUN6wg9dqsVKDn+Jr6obYL7QhBZZIChYG+hwifyuu1IKe30FhqKJpbi4GImJiUhISEBBQUGn+x08eBDx8fGO5dOnT+PRRx9FamoqHn74YZw8eRIAcPbsWYwZMwapqalITU3F448/rmT4qmcfNHnD1YNw9ZWXq8WK3z+N5LtvwuaiCmzZ8384a27EpVYZ0Go59Qv1C7Is3B4Z7FxK4QSSgaVYVVhNTQ1yc3Px5ptvIjg4GOnp6ZgwYQJuvvlml/1qa2uRk5Pjsm7lypVYsGAB/uu//guHDx/G008/jT179uDEiRNITk7GqlWrlAq7z7HZZMAGBGm1jmqxyeNGuJReit8/jcnjRlyuBgjW8hcb9Un2aq8vzl1AXb3rI4M5gl49FPv5WlZWhtjYWERFRSE8PBxTpkxBSUmJ234rV67EkiVLXNbNnDkTd999NwDglltuwblz5wAAFRUVqKysRGpqKjIyMnDq1Cmlwu9ztBCOarHBEcEsvVC/41zttWYrSylqptidxWQywWAwOJaNRiNqampc9snPz8fIkSNJdZlsAAAUhElEQVQxevRol/UPPvggtNr2NoH169fj3nvvBQCEhIQgJSUFu3btwuOPP47FixejtbVVqVPoU5yrxfSRIZ2WXtj2Qn1Nx4dwhQbr2JaicopVhcly+zNE7IQQLsuVlZUoLS1FXl4eqqur3d4vhMDatWvx6aefIj8/HwDwi1/8wrF90qRJWLduHT7//HP86Ec/8jquoUMjfDoPgyHSp/17S1dxtbXJWPGz8bDUNztKL/NSYxyll44DKkNDghAVEQKNxj/dk9V6zQD1xqbWuIDAxtbWJuPrmnpYOlR7ddbja8ig0PbxKbrAlsYH+v+nYoll2LBhKC8vdyybzWYYjUbHcklJCcxmM9LS0mC1WmEymTBr1iwUFhaira0NTz/9NGpqapCfn4/IyPYLsW3bNiQlJUGv1wNoTz46nW+ncP58I2RZeLWvwRAJs7nBp+P3Bm/iigoLwuCIYLRa5U5LL0oMqFTrNQPUG5ta4wICE5u9HUUAsFplrN7q3oW4q7aUurqLvRpvR/3p/1OjkXz+MQ4omFgmTpyIl19+GRaLBWFhYSgtLcVzzz3n2J6VlYWsrCwAwJkzZ5CRkYHCwkIAQE5ODhobG7FlyxYEBwc73nP06FE0NzcjMzMTR44cgSzLuPHGG5U6hT7N3qgfFqx1jNJ3Lr3YE8y81BhEhgehrqEZUZHB4LwWFCj2hHKxuQ11lha0Wm1dNs5zni/1UiyxREdHIzs7GxkZGbBarZgxYwZGjRqFzMxMZGVlISYmxuP7LBYLCgoKcO2112LmzJmO9UVFRVixYgWWLl2KoqIihISEYN26ddBo2ADdFWurDfrwIERFupZeOKCS1MR5bi8APj+EiwlFXSQhhHf1Qv3EQKkK88TTl7fziSw1PUowar1mgHpjU2tcgLKxdazyWvmXMmQ/MhYA2h+7PWc8Dhz9qpN2QR3Qps7eXv3p/1N1VWGkPs6ll5ZW2a1ajGNeqDd0V+UVpJO6LaVEDQpV7c2bOKXLgGOzyZBbbQhymmeMY15IafYuw5LTWJQayyW8tP2Y23iUyCuC8UQ6uxD3ZbxbDFDOAyo55oWUYk8oF602nDVfxJdV9Y6xKJ2NR/nbrhPQaSWsWXQnfvXo7RgRPQihWokDHfsQJpYBynlAJUfskz+1V3W1JxRPpRN7MrH/qPE0al4fGYoQrQSdkJlQ+iDeIQYwe7UYR+yTv9inXdm0uwJNLbLH0gmrvPo/JhZyJJiwIA1LL+Qz5/YT+7Qrk8eNQP3FFo+lE1Z59X+8K5CDlaUX8oGn9hP7M1Eiw4O6bJBfmDYKv5w1FobB4QgCWOXVzzCxkAuWXqgr3bWf2JMJG+QHNt4JyKOell7a2niT6I+0Tsmkq/YTezKxD2xkg/zAxAGS1Knu5hvzNFuyEOC0MP2EVqtBXUNze0L5bsaGV984jnmpMS7tJ54GNE4eNwIR4UHInB4DIQCN1H6zYYP8wMASC3XL29LLq28cx58LP8FXNfVotglWj/VBHau6NvzjXy6lE2/aT+y9uwaFBSMIgFaWIdlk/tAYQPitJ6901/biqXqsvqkVVgA2TfvNiklG3Tx1Fe7Yu4vtJ+QNftPJJ52VXjo27h84+hUaL1nxt90V+KKqHrUXmtmDTIW66yrcsXTC9hPyBhML+cxT6aVj9djkcSPcepDVNTSjxSbQJmlYTRZAzsmku67CnT3+96qh4Zg/fRRuuGoQrhwcygGN5IKN99RjzrMld2zc79gGk/nAT9DSasPyjYegjwzF3JSRGDo4DLKkgU4jsbFfQVqtBm0AIAFajQYXm62OWYUB92efOCcT+4+DjjMMB2vbE4kWAGx8Phy54k9G+l46mxam4ZLVpQ2m4WIrXtp+Ocm02QSWbzyEdQUfw3zhEqwASzJ+1LERftPuCpyrvYQvqi64jDvprquwc++uG65m6YS8w28w+YU9wRiiwh0PaBp0xeU2GPsNjElGOZ6SiXMj/Pod7snE00SQzsnkpmsGIzIsiL27yCf81pJf6XQa6MODMH96DMJCLrfBOLfDeJtk2G25a/a2ErmLZOLcCO8pmXTXVdg4JJzJhHzGbyv5nc3W/svWuYosekg4nkgf43WScR4XYy/FCJ0WmmAdhFaDNkkaUAnHXhoROg00wTqXObr+1kUycW438ZRM2FWYlDAwvpUUMPYqsiuCtLjGcAWu9zLJODf6t9kE/ra7AnUNzThrbsSyjYewrsBzwrHfeO2vhVYLfLfOZLnUJxKSpyRiL43UNbTgrLnRZY6urpKJc7tJZ+NOOBEk+Zt6v13Ur9gTjPAyyXQsxUweN8JjtZlzwnG+8dobqzftPg7Ld+uWbfzA64TUWXLqbt+eHMP87SXH686SiL00Yr8Gzm0lXSWTju0maxbdiV8/djuuHBzGZEKKUTSxFBcXIzExEQkJCSgoKOh0v4MHDyI+Pt6xfPr0aTz66KNITU3Fww8/jJMnTwIAWltb8dRTT2Hq1Kl44IEHcPr0aSXDJ4V4k2Q6lmI6qzZzvtl2fG0fT+NrQuoqOXW1b0+PYam/HIdzlZZzEul4DZzbSrzt0cVGeOotiiWWmpoa5ObmorCwELt378aOHTvw2Wefue1XW1uLnJwcl3UrV65EZmYmioqK8OSTT+Lpp58GAGzbtg1hYWF46623sHz5cixbtkyp8KmXdJZkOjb6O99IO0s4HV/3NCF1lZx68tqXYzhXaXV86mJnDe9MJqQ2iiWWsrIyxMbGIioqCuHh4ZgyZQpKSkrc9lu5ciWWLFnism7mzJm4++67AQC33HILzp07B6C9ZJOSkgIAGDduHCwWC6qqqpQ6BeplzkmmY6P/gaNfOW6knSWcjq97mpC6Sk5KH8O5SqvjUxedr4FzW0nm9BhcdeUVjpHwTCYUaIolFpPJBIPB4Fg2Go2oqalx2Sc/Px8jR47E6NGjXdY/+OCD0Grb55Rav3497r33Xo/HNBgMqK6uVuoUKIA6NvpnTo+BPjIU1xgicMPVgz0mnI6vO96MvU1IXSWnnrz2ZV9PcTt3BdZHhOAaQwSeX3SXS8O71GaDZLNBK5hMKPAkIYRQ4sCvvvoqWlpa8OSTTwIAdu7ciRMnTmDVqlUAgMrKSqxatQp5eXmorq5GRkYG3nnnHcf7hRBYu3YtPvzwQ+Tn5yMyMhIJCQnYvHkzhg8fDgBIT0/H0qVLcdtttylxCqRisizQcKkFLVYZkgRoJQltsnB7DQFoNBK+e4lLzVbUNbRgz3unMfPeH6Kl1dbla3sVkzf7+uMYzlVaxiFhiIoIgc0mIAuB4CAtoiJCoNFIAb76RF1TLLHs2rUL5eXlWL16NQBgw4YNEEI4qr3Wr1+PvXv3IjQ0FFarFV9//TVGjRqFwsJCtLW14emnn0ZNTQ1effVVREZGAgBmz56NJ554AnfccQcA4N5770V+fj6uvvpqr+M6f74RsuzdKRsMkTCbG3w57V6h1rgA9cdmsVyE0EposwlIkgStRoLNJr6bR8vzawhAktoTVHf79uQYkgbQSK77yrKARtM+mV8gSx9q/f9Ua1xA/4pNo5EwdGiEz39HsaqwiRMn4vDhw7BYLGhqakJpaSni4uIc27OysrB//34UFRVh06ZNMBqNKCwsBADk5OSgsbERW7ZscSQVAJg0aRKKiooAAOXl5QgJCfEpqRDZq9g0NhlSmw1yaxskm63L15LNBni5b0+OYYgKd9uXVVrUlyk2u3F0dDSys7ORkZEBq9WKGTNmYNSoUcjMzERWVhZiYmI8vs9isaCgoADXXnstZs6c6VhfVFSE2bNn45lnnsG0adMQHByMtWvXKhU+ERH1kGJVYWrFqjBlMTbfqTUuQL2xqTUuoH/FprqqMCIiGpiYWIiIyK+YWIiIyK+YWIiIyK+YWIiIyK+YWIiIyK+YWIiIyK+YWIiIyK+YWIiIyK+YWIiIyK+YWIiIyK8Um4RSrXx9loVan32h1rgAxtYTao0LUG9sao0L6D+x9fQ8BtwklEREpCxWhRERkV8xsRARkV8xsRARkV8xsRARkV8xsRARkV8xsRARkV8xsRARkV8xsRARkV8xsRARkV8xsQAoLi5GYmIiEhISUFBQ4Lb95MmTePDBBzFlyhSsWLECbW1tvRLXK6+8gmnTpmHatGlYu3atx+333HMPUlNTkZqa6jF2pcyePRvTpk1z/O1PP/3UZXtZWRmSk5ORkJCA3NzcXonpH//4hyOe1NRU3H777Vi1apXLPoG4Zo2NjUhKSsKZM2cAeHdtqqqq8Oijj+L+++/HwoULcfHiRcXj2rFjB5KSkpCcnIxly5ahtbXV7T27du3CXXfd5bh+Sv3fdoxt2bJlSEhIcPzdt99+2+09vfU9dY7t3XffdfnMxcbGYsGCBW7vUfq6ebpXBPRzJga46upqcc8994i6ujpx8eJFkZycLP7zn/+47DNt2jRx7NgxIYQQy5YtEwUFBYrHdejQIfHwww+LlpYW0draKjIyMkRpaanLPgsWLBCffPKJ4rF0JMuyuOuuu4TVavW4vampSUyaNEl8/fXXwmq1irlz54qDBw/2aoyVlZXivvvuE+fPn3dZ39vX7F//+pdISkoSt956q/jmm2+8vjbz588X//znP4UQQrzyyiti7dq1isb1+eefi/vuu080NDQIWZbFb37zG7F161a3961atUoUFxf7NZbuYhNCiKSkJFFTU9Pl+3rje+opNjuTySQmT54svvjiC7f3KXndPN0riouLA/o5G/AllrKyMsTGxiIqKgrh4eGYMmUKSkpKHNvPnj2L5uZm3HbbbQCABx980GW7UgwGA5YuXYrg4GAEBQXhpptuQlVVlcs+J06cwF//+lckJydj1apVaGlpUTwuAPj8888BAHPnzkVKSgpee+01l+3Hjx/HiBEjcN1110Gn0yE5OblXrpmz3//+98jOzsaQIUNc1vf2Ndu5cyd+97vfwWg0AvDu2litVhw9ehRTpkwBoMxnrmNcwcHB+N3vfoeIiAhIkoQf/vCHbp83AKioqMCuXbuQnJyMX//617hw4YJf4/IUW1NTE6qqqrB8+XIkJydj/fr1kGXZ5T299T3tGJuztWvXIj09Hddff73bNiWvm6d7xZdffhnQz9mATywmkwkGg8GxbDQaUVNT0+l2g8Hgsl0pP/jBDxxfki+//BJvvfUWJk2a5Nh+8eJF/PjHP8ZTTz2FXbt2ob6+Hhs3blQ8LgCor6/HT3/6U2zYsAF5eXnYvn07Dh065Nje3TVVWllZGZqbmzF16lSX9YG4ZqtXr8Ydd9zhWPbm2tTV1SEiIgI6Xfvk40p85jrGdc011+DOO+8EAFgsFhQUFGDy5Mlu7zMYDFi0aBH27NmDq666yq2qUYnYamtrERsbizVr1mDnzp0oLy/H//zP/7i8p7e+px1js/vyyy9x5MgRZGRkeHyfktfN071CkqSAfs4GfGKRZRmSdHlqaCGEy3J325X2n//8B3PnzsVvfvMbl19CV1xxBf72t7/hpptugk6nw9y5c/Huu+/2SkxjxozB2rVrERkZiSFDhmDGjBkufzvQ12z79u342c9+5rY+kNfMzptr42ldb12/mpoa/Pd//zfS0tIwYcIEt+0bNmzA7bffDkmSMG/ePLz//vuKx3Tddddhw4YNMBqNCAsLw+zZs93+3wL9mduxYwdmzZqF4OBgj9t747o53yuuu+66gH7OBnxiGTZsGMxms2PZbDa7FHM7bq+trfVYDFbCxx9/jDlz5uBXv/oVHnjgAZdtVVVVLr/ahBCOXx5KKy8vx+HDhzv9291dUyW1trbi6NGjiI+Pd9sWyGtm5821GTJkCBoaGmCz2TrdRwmnT59Geno6HnjgASxevNhte0NDA/Ly8hzLQghotVrF4zp16hT279/v8nc7/r8F8nsKAAcOHEBiYqLHbb1x3TreKwL9ORvwiWXixIk4fPgwLBYLmpqaUFpairi4OMf2a665BiEhIfj4448BAEVFRS7blXLu3DksXrwYf/rTnzBt2jS37aGhofjjH/+Ib775BkIIFBQU4L777lM8LqD9i7J27Vq0tLSgsbERu3btcvnbo0ePxhdffIGvvvoKNpsN//znP3vlmgHtN6Hrr78e4eHhbtsCec3svLk2QUFBuOOOO7Bv3z4AwO7duxW/fo2NjXj88cfxxBNPYO7cuR73CQ8Px+bNmx09AF977bVeuX5CCKxZswYXLlyA1WrFjh073P5uoL6nQHvVYXNzM6677jqP25W+bp7uFQH/nPmlC0Aft2fPHjFt2jSRkJAgNm3aJIQQYt68eeL48eNCCCFOnjwp0tLSxJQpU8Qvf/lL0dLSonhMzz33nLjttttESkqK419hYaFLXCUlJY64ly5d2itx2eXm5or7779fJCQkiLy8PCGEECkpKaK6uloIIURZWZlITk4WCQkJYvXq1UKW5V6Ja+/eveLJJ590WaeGa3bPPfc4ehF1dm2WL18u/vd//1cIIcSZM2fEY489JqZOnSrmzp0rvv32W0Xj2rp1q7j11ltdPm8vvviiW1xHjx4V06dPF/fff7/4+c9/Lurr6xWJyzk2IYR47bXXxNSpU8V9990n/vjHPzr2CdT31Dm2Tz/9VMycOdNtn966bp3dKwL5OeMTJImIyK8GfFUYERH5FxMLERH5FRMLERH5FRMLERH5FRMLERH5FRMLUQAdOnSo04F1RH0VEwtRAE2cOBGtra0oLy8PdChEfsPEQhRAkiThJz/5CXbs2BHoUIj8pncnSiIiFxcuXEBZWRlkWca3336LqKioQIdE9L1x5D1RAOXl5aGiogIAEBMTgzlz5gQ2ICI/YFUYUQDt2LEDDz/8MGbOnImdO3cGOhwiv2BVGFGAfPTRRxBCYPz48RBCoK2tDeXl5R4fJEXUl7DEQhQg27dvx0MPPQSgvRF/xowZ2L59e4CjIvr+2MZCRER+xRILERH5FRMLERH5FRMLERH5FRMLERH5FRMLERH5FRMLERH5FRMLERH5FRMLERH51f8H+CGVRYzEXw4AAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "RMSE Train CV: 0.6482094156231718\n", "@Lambda: 9.300000000000002\n", "MSE test: 0.469\n", "RMSE test: 0.685\n" ] } ], "source": [ "def lasso_cv(X, y, λ, k):\n", " \"\"\"Perform the lasso with \n", " k-fold cross validation to return mean MSE scores for each fold\"\"\"\n", " # Split dataset into k-folds\n", " # Note: np.array_split doesn't raise excpetion is folds are unequal in size\n", " X_folds = np.array_split(X, k)\n", " y_folds = np.array_split(y, k)\n", " \n", " MSEs = []\n", " for f in np.arange(len(X_folds)):\n", " # Create training and test sets\n", " X_test = X_folds[f]\n", " y_test = y_folds[f]\n", " X_train = X.drop(X_folds[f].index)\n", " y_train = y.drop(y_folds[f].index)\n", " \n", " # Fit model\n", " model = Lasso(alpha=λ, copy_X=True, fit_intercept=True, max_iter=10000,\n", " normalize=False, positive=False, precompute=False, random_state=0,\n", " selection='cyclic', tol=0.0001, warm_start=False).fit(X_train, y_train)\n", " \n", " # Measure MSE\n", " y_hat = model.predict(X_test)\n", " #print(y_test)\n", " MSEs += [metrics.mean_squared_error(y_test, y_hat)]\n", " return MSEs\n", "\n", "X_train = pd.DataFrame(X[train], columns=X.design_info.column_names)\n", "y_train = pd.DataFrame(y[train], columns=['Price'])\n", "\n", "#lambdas = np.arange(.000001, 0.01, .0001)\n", "\n", "lambdas = np.arange(0.2, 20, .1)\n", "\n", "MSEs = [] \n", "for l in lambdas:\n", " MSEs += [np.mean(lasso_cv(X_train, y_train, λ=l, k=10))]\n", "\n", "sns.scatterplot(x='λ', y='MSE', data=pd.DataFrame({'λ': lambdas, 'MSE': MSEs}))\n", "plt.show();\n", "\n", "# Choose model\n", "lamb = min(zip(MSEs, lambdas))\n", "print('RMSE Train CV: {}\\n@Lambda: {}'.format(np.sqrt(lamb[0]), lamb[1]))\n", "\n", "\n", "# Use chosen model on test set prediction\n", "model = Lasso(alpha=lamb[1], copy_X=True, fit_intercept=True, max_iter=10000,\n", " normalize=False, positive=False, precompute=False, random_state=0,\n", " selection='cyclic', tol=0.0001, warm_start=False).fit(X[train], y[train])\n", "\n", "y_hat = model.predict(X[~train])\n", "\n", "mse = metrics.mean_squared_error(y[~train], y_hat)\n", "rmse = np.sqrt(mse)\n", "\n", "print('MSE test: {}'.format(np.around(mse, 3)))\n", "print('RMSE test: {}'.format(np.around(rmse, 3)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Comment:**\n", "\n", "Boosting yields a test MSE of 0.207, which is significantly better than naive OLS (0.455) and the Lasso (0.469). \n", "\n", "It is interesting that naive OLS with all features outperforms the lasso. This suggests that the lasso is unable to identify redundant features in the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (f) Which variables appear to be the most important predictors in the boosted model?" ] }, { "cell_type": "code", "execution_count": 468, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.20666656200483657\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "max_features = 'auto'\n", "tree_count = 1000\n", "learning_rate = 0.00229\n", "\n", "regr = GradientBoostingRegressor(max_features=max_features, \n", " random_state=1, \n", " n_estimators=tree_count,\n", " learning_rate=learning_rate)\n", "\n", "regr = regr.fit(X[train], y[train])\n", "y_hat_test = regr.predict(X[~train])\n", "\n", "mse_test = metrics.mean_squared_error(y[~train], y_hat_test)\n", "print(mse_test)\n", "\n", "# Plot feature by importance in this model\n", "\n", "plot_df = pd.DataFrame({'feature': X.design_info.column_names, 'importance': regr.feature_importances_})\n", "\n", "plt.figure(figsize=(10,10))\n", "sns.barplot(x='importance', y='feature', data=plot_df.sort_values('importance', ascending=False),\n", " color='b')\n", "plt.xticks(rotation=90);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The boosting model suggests that CAtBat – Number of times at bat during his career – is by far the most important predictor of Salary. \n", "\n", "Number of walks during his career (CWalks) is also predictive of salary, but it seems likely that this feature would be covariant with CAtBat." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (g) Now apply bagging to the training set. What is the test set MSE for this approach?" ] }, { "cell_type": "code", "execution_count": 471, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE test: 0.208\n", "RMSE test: 0.456\n" ] } ], "source": [ "# Bagging with 100 trees\n", "# although I'm using RandomForestRegressor algo here this is Bagging because max_features = n_predictors\n", "\n", "max_features = X.shape[1]\n", "tree_count = 1000\n", "\n", "regr = RandomForestRegressor(max_features=max_features, random_state=0, n_estimators=tree_count)\n", "regr.fit(X[train], y[train])\n", "y_hat = regr.predict(X[~train])\n", "\n", "mse = metrics.mean_squared_error(y[~train], y_hat)\n", "rmse = np.sqrt(mse)\n", "\n", "print('MSE test: {}'.format(np.around(mse, 3)))\n", "print('RMSE test: {}'.format(np.around(rmse, 3)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bagging achieves a test MSE of 0.208 which is equivalent to that achieved by boosting. Bagging has the advantage here that the result was achived without need to tune hyper-parameters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 11. This question uses the Caravan data set.\n", "\n", "### (a) Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations." ] }, { "cell_type": "code", "execution_count": 518, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InterceptMOSTYPEMAANTHUIMGEMOMVMGEMLEEFMOSHOOFDMGODRKMGODPRMGODOVMGODGE...ALEVENAPERSONGAGEZONGAWAOREGABRANDAZEILPLAPLEZIERAFIETSAINBOEDABYSTAND
01.033.01.03.02.08.00.05.01.03.0...0.00.00.00.01.00.00.00.00.00.0
11.037.01.02.02.08.01.04.01.04.0...0.00.00.00.01.00.00.00.00.00.0
21.037.01.02.02.08.00.04.02.04.0...0.00.00.00.01.00.00.00.00.00.0
31.09.01.03.03.03.02.03.02.04.0...0.00.00.00.01.00.00.00.00.00.0
41.040.01.04.02.010.01.04.01.04.0...0.00.00.00.01.00.00.00.00.00.0
\n", "

5 rows × 86 columns

\n", "
" ], "text/plain": [ " Intercept MOSTYPE MAANTHUI MGEMOMV MGEMLEEF MOSHOOFD MGODRK MGODPR \\\n", "0 1.0 33.0 1.0 3.0 2.0 8.0 0.0 5.0 \n", "1 1.0 37.0 1.0 2.0 2.0 8.0 1.0 4.0 \n", "2 1.0 37.0 1.0 2.0 2.0 8.0 0.0 4.0 \n", "3 1.0 9.0 1.0 3.0 3.0 3.0 2.0 3.0 \n", "4 1.0 40.0 1.0 4.0 2.0 10.0 1.0 4.0 \n", "\n", " MGODOV MGODGE ... ALEVEN APERSONG AGEZONG AWAOREG ABRAND \\\n", "0 1.0 3.0 ... 0.0 0.0 0.0 0.0 1.0 \n", "1 1.0 4.0 ... 0.0 0.0 0.0 0.0 1.0 \n", "2 2.0 4.0 ... 0.0 0.0 0.0 0.0 1.0 \n", "3 2.0 4.0 ... 0.0 0.0 0.0 0.0 1.0 \n", "4 1.0 4.0 ... 0.0 0.0 0.0 0.0 1.0 \n", "\n", " AZEILPL APLEZIER AFIETS AINBOED ABYSTAND \n", "0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[5 rows x 86 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "caravan_df = pd.read_csv('./data/Caravan.csv').drop('Unnamed: 0', axis=1)\n", "\n", "# Patsy feature processing\n", "f = 'C(Purchase) ~ ' + ' + '.join(caravan_df.columns.drop(['Purchase']))\n", "y, X = pt.dmatrices(f, caravan_df)\n", "y = y[:, 1]\n", "\n", "# Display processed features\n", "display(pd.DataFrame(X, columns=X.design_info.column_names).head())\n", "\n", "# Index for Training set of 1000\n", "np.random.seed(1)\n", "train_sample = np.random.choice(np.arange(len(caravan_df)), size=1000, replace=False)\n", "train = np.asarray([(i in train_sample) for i in caravan_df.index])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (b) Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?" ] }, { "cell_type": "code", "execution_count": 519, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "accuracy: 93.26%\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "max_features = 'auto'\n", "tree_count = 1000\n", "learning_rate = 0.01\n", "\n", "model = GradientBoostingClassifier(max_features=max_features, \n", " random_state=1, \n", " n_estimators=tree_count,\n", " learning_rate=learning_rate)\n", "\n", "model = model.fit(X[train], y[train])\n", "#y_hat_test = regr.predict(X[~train])\n", "\n", "accuracy = model.score(X[~train], y[~train])\n", "print('accuracy: {}%'.format(np.around(accuracy*100, 2)))\n", "\n", "# Plot feature by importance in this model\n", "\n", "plot_df = pd.DataFrame({'feature': X.design_info.column_names, 'importance': model.feature_importances_})\n", "\n", "plt.figure(figsize=(10,20))\n", "sns.barplot(x='importance', y='feature', data=plot_df.sort_values('importance', ascending=False),\n", " color='b')\n", "plt.xticks(rotation=90);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we use boosting to predict if someone purchases caravan insurance, a classification problem. The boosting model yields a test set prediction accuracy of 93.2%. \n", "\n", "The model suggests that the ten most important predictors are:\n", "\n", "- PPLEIZER: Contribution boat policies\n", "- ALEVAN: Number of life insurances\n", "- PBRAND: Contribution fire policies\n", "- MOSTYPE: Customer Subtype; see L0\n", "- MBERARBG: Skilled labourers\n", "- APERSAUT: Number of car policies\n", "- AFIETS: Number of bicycle policies\n", "- MRELGE: Married\n", "- MGEMLEEF: Avg age\n", "- MFWEKIND: Household with children\n", "\n", "Broadly these predictors indicate whether customer has other insurance policies, and their level of family commitment e.g. married, with children.\n", "\n", "We can't tell the direction in which these predictors are related with reposnse **Revision note:** is this possible in the boosting setting?\n", "\n", "From tnis might hypothesise that customers who have already purchased insurance of other kinds are more inclined to purchase caravan insurance. Also, those with more family commitment and responsibility might also be more inclined to purchase insurance. It seems plausible that these two groups are more risk averse." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (c) Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated probability of purchase is greater than 20 %. Form a confusion matrix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtained from applying KNN or logistic regression to this data set?" ] }, { "cell_type": "code", "execution_count": 575, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

BOOSTING: Confusion matrix

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "[[4405 120]\n", " [ 266 31]]\n", "\n", "Positive Predictive Value: 0.2053\n" ] } ], "source": [ "max_features = 'auto'\n", "tree_count = 1000\n", "learning_rate = 0.01\n", "\n", "model = GradientBoostingClassifier(max_features=max_features, \n", " random_state=1, \n", " n_estimators=tree_count,\n", " learning_rate=learning_rate)\n", "\n", "model = model.fit(X[train], y[train])\n", "#y_hat_test = regr.predict(X[~train])\n", "\n", "\n", "# Boosting stats\n", "threshold = 0.2\n", "y_hat_proba = model.predict_proba(X[~train])\n", "y_hat = (y_hat_proba[:, 1] > threshold).astype(np.float64)\n", "confusion_mat = confusion_matrix(y[~train], y_hat)\n", "\n", "# What fraction of the people predicted to make a purchase do in fact make one?\n", "pos_pred_val = np.around(confusion_mat[:, 1][1] / np.sum(confusion_mat[:, 1]), 5)\n", "\n", "display(HTML('

BOOSTING: Confusion matrix

'))\n", "print(confusion_mat)\n", "\n", "print('\\nPositive Predictive Value: {}'.format(pos_pred_val))" ] }, { "cell_type": "code", "execution_count": 576, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

K=1

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "[[4280 245]\n", " [ 274 23]]\n", "\n", "Positive Predictive Value: 0.08582\n" ] }, { "data": { "text/html": [ "

K=2

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "[[4507 18]\n", " [ 293 4]]\n", "\n", "Positive Predictive Value: 0.18182\n" ] }, { "data": { "text/html": [ "

K=3

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "[[4476 49]\n", " [ 288 9]]\n", "\n", "Positive Predictive Value: 0.15517\n" ] }, { "data": { "text/html": [ "

K=4

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "[[4524 1]\n", " [ 297 0]]\n", "\n", "Positive Predictive Value: 0.0\n" ] }, { "data": { "text/html": [ "

K=5

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "[[4522 3]\n", " [ 297 0]]\n", "\n", "Positive Predictive Value: 0.0\n" ] }, { "data": { "text/html": [ "

K=6

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "[[4525 0]\n", " [ 297 0]]\n", "\n", "Positive Predictive Value: nan\n" ] }, { "data": { "text/html": [ "

K=7

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "[[4525 0]\n", " [ 297 0]]\n", "\n", "Positive Predictive Value: nan\n" ] }, { "data": { "text/html": [ "

K=8

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "[[4525 0]\n", " [ 297 0]]\n", "\n", "Positive Predictive Value: nan\n" ] }, { "data": { "text/html": [ "

K=9

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "[[4525 0]\n", " [ 297 0]]\n", "\n", "Positive Predictive Value: nan\n" ] } ], "source": [ "# KNN\n", "\n", "# PREDICT\n", "for K in range(1, 10):\n", " # model\n", " model = KNeighborsClassifier(n_neighbors=K).fit(X[train], y[train])\n", " # Predict\n", " y_pred = model.predict(X[~train])\n", " \n", " # Confusion table\n", " display(HTML('

K={}

'.format(K)))\n", " confusion_mtx = confusion_matrix(y[~train], y_pred)\n", " print(confusion_mtx)\n", " \n", " ## Classifier stats\n", " pos_pred_val = np.around(confusion_mtx[:, 1][1] / np.sum(confusion_mtx[:, 1]), 5)\n", " print('\\nPositive Predictive Value: {}'.format(pos_pred_val))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Comment:** KNN performs best when k=2 achieving a Positive Predictive Value of 0.182, which is slightly worse but not disimilar to the boosting result (0.205). The boosting model also excels in achieving roughly 8 times more True Positive predictions, so we would certainly choose the boosting model in this case.\n", "\n", "It is worth noting that the dataset in this setting is highly dimensional with over 80 predictors, so the effectiveness of KNN is likely hindered by the curse of dimensionality." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 12. Apply boosting, bagging, and random forests to a data set of your choice. Be sure to fit the models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression? Which of these approaches yields the best performance?\n", "\n", "# TODO: I'd like to do this using the kaggle house prices dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }