{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from sklearn import tree, model_selection, datasets, metrics, ensemble, linear_model, neighbors\n", "import graphviz as gv\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. In the lab, we applied random forests to the Boston data using mtry=6 and using ntree=25 and ntree=500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>CRIM</th>\n", " <th>ZN</th>\n", " <th>INDUS</th>\n", " <th>CHAS</th>\n", " <th>NOX</th>\n", " <th>RM</th>\n", " <th>AGE</th>\n", " <th>DIS</th>\n", " <th>RAD</th>\n", " <th>TAX</th>\n", " <th>PTRATIO</th>\n", " <th>B</th>\n", " <th>LSTAT</th>\n", " <th>Price</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.00632</td>\n", " <td>18.0</td>\n", " <td>2.31</td>\n", " <td>0.0</td>\n", " <td>0.538</td>\n", " <td>6.575</td>\n", " <td>65.2</td>\n", " <td>4.0900</td>\n", " <td>1.0</td>\n", " <td>296.0</td>\n", " <td>15.3</td>\n", " <td>396.90</td>\n", " <td>4.98</td>\n", " <td>24.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.02731</td>\n", " <td>0.0</td>\n", " <td>7.07</td>\n", " <td>0.0</td>\n", " <td>0.469</td>\n", " <td>6.421</td>\n", " <td>78.9</td>\n", " <td>4.9671</td>\n", " <td>2.0</td>\n", " <td>242.0</td>\n", " <td>17.8</td>\n", " <td>396.90</td>\n", " <td>9.14</td>\n", " <td>21.6</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.02729</td>\n", " <td>0.0</td>\n", " <td>7.07</td>\n", " <td>0.0</td>\n", " <td>0.469</td>\n", " <td>7.185</td>\n", " <td>61.1</td>\n", " <td>4.9671</td>\n", " <td>2.0</td>\n", " <td>242.0</td>\n", " <td>17.8</td>\n", " <td>392.83</td>\n", " <td>4.03</td>\n", " <td>34.7</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.03237</td>\n", " <td>0.0</td>\n", " <td>2.18</td>\n", " <td>0.0</td>\n", " <td>0.458</td>\n", " <td>6.998</td>\n", " <td>45.8</td>\n", " <td>6.0622</td>\n", " <td>3.0</td>\n", " <td>222.0</td>\n", " <td>18.7</td>\n", " <td>394.63</td>\n", " <td>2.94</td>\n", " <td>33.4</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.06905</td>\n", " <td>0.0</td>\n", " <td>2.18</td>\n", " <td>0.0</td>\n", " <td>0.458</td>\n", " <td>7.147</td>\n", " <td>54.2</td>\n", " <td>6.0622</td>\n", " <td>3.0</td>\n", " <td>222.0</td>\n", " <td>18.7</td>\n", " <td>396.90</td>\n", " <td>5.33</td>\n", " <td>36.2</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \\\n", "0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 \n", "1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 \n", "2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 \n", "3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 \n", "4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 \n", "\n", " PTRATIO B LSTAT Price \n", "0 15.3 396.90 4.98 24.0 \n", "1 17.8 396.90 9.14 21.6 \n", "2 17.8 392.83 4.03 34.7 \n", "3 18.7 394.63 2.94 33.4 \n", "4 18.7 396.90 5.33 36.2 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "boston_df = datasets.load_boston()\n", "boston_df = pd.DataFrame(data=np.c_[boston_df['data'], boston_df['target']], columns= [c for c in boston_df['feature_names']] + ['Price'])\n", "boston_df.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def run_cv(df, n_splits, extract_x, extract_y, fit_model, loss_fn):\n", " cv_result = pd.DataFrame()\n", " for train_idx, test_idx in model_selection.KFold(n_splits=n_splits).split(df):\n", " train, test = df.iloc[train_idx], df.iloc[test_idx]\n", " errors = []\n", " # train\n", " train_X = extract_x(train)\n", " train_y = extract_y(train)\n", " model = fit_model(train_X, train_y)\n", " # test\n", " test_X = extract_x(test)\n", " test_y = extract_y(test)\n", " preds = model.predict(test_X)\n", " errors.append(loss_fn(preds, test_y))\n", " cv_result = cv_result.append(pd.Series(errors), ignore_index=True)\n", " return cv_result" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [], "source": [ "rnge = range(1, 50)\n", "\n", "sqrt_rmses = [run_cv(boston_df,\n", " 5,\n", " lambda df: df.drop('Price', axis=1),\n", " lambda df: df.Price,\n", " lambda x, y: ensemble.RandomForestRegressor(n_estimators=i, max_features='sqrt').fit(x,y),\n", " lambda preds, true: np.sqrt(metrics.mean_squared_error(true, preds))).mean()[0] for i in rnge]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "log2_rmses = [run_cv(boston_df,\n", " 5,\n", " lambda df: df.drop('Price', axis=1),\n", " lambda df: df.Price,\n", " lambda x, y: ensemble.RandomForestRegressor(n_estimators=i, max_features='log2').fit(x,y),\n", " lambda preds, true: np.sqrt(metrics.mean_squared_error(true, preds))).mean()[0] for i in rnge]" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "all_fx_rmses = [run_cv(boston_df,\n", " 5,\n", " lambda df: df.drop('Price', axis=1),\n", " lambda df: df.Price,\n", " lambda x, y: ensemble.RandomForestRegressor(n_estimators=i, max_features=None).fit(x,y),\n", " lambda preds, true: np.sqrt(metrics.mean_squared_error(true, preds))).mean()[0] for i in rnge]" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<matplotlib.axes._subplots.AxesSubplot at 0x1128a1278>" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAEKCAYAAAARnO4WAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzs3Xdc1WX7wPHPfdiHPVVAQXHvvUeWuyyztLK9tK1PZetp711aWdrza5dllrPc5R6FCm5FRGRP2euM+/fHURSRmYjB9X69eKnnu64DeHFzf6/7+iqtNUIIIRo+Q30HIIQQ4uKQhC+EEI2EJHwhhGgkJOELIUQjIQlfCCEaCUn4QgjRSEjCF0KIRkISvhBCNBKS8IUQopGwr+8Azubn56dDQ0PrOwwhhPhX2blzZ7rW2r+q/S6phB8aGkp4eHh9hyGEEP8qSqnY6uwnUzpCCNFISMIXQohGQhK+EEI0EpLwhRCikZCEL4QQjYQkfCGEaCQk4QshRCPRIBJ+gamATyM+JSI1or5DEUKIS9YltfCqtgzKwJzIOdgb7Oke0L2+wxFCiEtSgxjhO9s74+7oTlphWn2HIoQQl6wGkfAB/F38SSuQhC+EEBVpOAnf6C8jfCGEqETDSfgu/qQXptd3GEIIcclqUAk/tSAVrXV9hyKEEJekhpPwjf6YrCZySnLqOxQhhLgkNZyE72Lr/S83boUQ4vwaTML3c/EDILUwtZ4jEUKIS1ODSfj+RtsIX27cCiHE+TWchC9TOkIIUakGk/CNDkZcHVxlhC+EEBVoMAkfzpRmCiGEKK9hJXyjLL4SQoiKNKiE7+fiJ+0VhBCiAg0q4Z9uoCarbYUQorwGlfADjAEUWYrIM+XVdyhCCHHJaVAJ//TiK5nWEUKI8hpUwj9di59eIDduhRDiXA0q4fsZpb2CEEJUpEEl/ACXAEBG+EIIcT4NKuG7OrjiYu8ic/hCCHEeDSrhK6VstfjST0cIIcppUAkfTtXiywhfCCHKaXgJX9orCCHEeTW8hC8N1IQQ4rzqPOErpf6jlNqvlNqnlJqvlHKuy+v5G/0pMBdQYCqoy8sIIcS/Tp0mfKVUEPAI0Ftr3RmwA26sy2uWPghF5vGFEKKMizGlYw+4KKXsASOQWJcXK22vIJU6QghRRp0mfK11AvAucAJIArK11qvr8poBRtviKxnhCyFEWXU9peMNXAO0BAIBV6XULefsM1UpFa6UCk9L++dJWkb4QghxfnU9pTMCiNFap2mtTcCvwMCzd9Baz9Na99Za9/b39//HF/Rw9MDR4CilmUIIcY66TvgngP5KKaNSSgFXAAfr8oJKKfyN/tJATQghzlHXc/g7gIXALmDvqevNq8trgq1SRxqoCSFEWfZ1fQGt9QvAC3V9nbP5G/2Jzoq+mJcUQohLXoNbaQtIAzUhhDiPBpnwA4wB5JpyKTIX1XcoQghxyWiQCV+ebSuEEOU1yIRf+mxbKc0UQohSDTPhG20JX7pmCiHEGQ0z4csIXwghymmQCd/LyQt7g71U6gghxFkaZMJXSsmjDoUQ4hwNMuHDqWfbyghfCCFKNdiE7+fiJyN8IYQ4S4NN+P5GmdIRQoizNdyE7+JPdnE2JZaS+g5FCCEuCQ034RulNFMIIc7WYBO+tFcQQoiyGmzCL322rVTqCCEE0IATvozwhRCirAab8H2cfbBTdjLCF0KIUxpswjcoA74uvjLCF0KIUxpswgekvYIQQpylwSd8eZi5EELYNOiE72eU9gpCCHFag074AS4BZBZlYrKa6jsUIYSodw064fsZbaWZGYUZ9RyJEELUvwad8OXJV0IIcUbDTvjybFshhCjVsBO+jPCFEKJUg074Ps4+KJRU6gghBA084dsb7G2rbaW9ghBCNOyED7LaVgghTmv4Cd/oLzdthRCCRpDwA10DSchLqO8whBCi3jX4hB/kFkRuSS45JTn1HYoQQtSrBp/wA90CAUjKS6rnSIQQon41+IQf5B4EQHxefD1HIoQQ9avhJ3xXW8JPzEus50iEEKJ+1SrhK6W8lVJdL3QwdcHTyROjvVESvhCi0at2wldKrVdKeSilfIBdwOdKqferOKadUirirI8cpdSMfxp0TSilCHIPkikdIUSjZ1+DfT211jlKqXuAb7TWLyil9lR2gNb6MNAdQCllByQAi2odbS0FuQbJCF8I0ejVZErHXinVDJgMLK/Fta4AorXWsbU49h8JdAskMS8RrfXFvrQQQlwyapLwXwZWYUvafyulWgFRNTj+RmB+TYK7UILcgsgz5UktvhCiUat2wtda/6y17qq1vv/Uv49pra+rzrFKKUfgauDn82ybqpQKV0qFp6XVTc+bIDdbpY6suBVCNGY1uWnbVim1Tim179S/uyqlnq3m4WOBXVrrlHM3aK3naa17a617+/v7VzecGjm9+Erm8YUQjVlNpnQ+B54GTABa6z3Ypmmq4ybqaToHziy+khG+EKIxq0nCN2qt/zrnNXNVBymlXIGRwK81CexC8nD0wN3BXRK+EKJRq0lZZrpSKgzQAEqp64EqG9RorfMB39qFd+GcrtQRQojGqiYJ/0FgHtBeKZUAxAC31ElUdSDQLZC43Lj6DkMIIepNtRO+1voYMOLUFI1Ba51bd2FdeEFuQWxP2o7WGqVUfYcjhBAXXU2qdKYrpTyAAuADpdQupdSougvtwgpyC6LQXEhWcVZ9hyKEEPWiJjdt79Ja5wCjsM3J3wq8WSdR1QEpzRRCNHY1Sfin50HGYeuls/+s1y55pxdfSRM1IURjVZOEv1MptRpbwl+llHIHrHUT1oUnI3whRGNXkyqdu7F1vjymtS5QSvkCd9ZNWBeeu6M7Ho4eUosvhGi0alKlY1VKmYGhSqmzj6u0RfKlJMgtSBK+EKLRqnbCV0p9AXQF9nNmKkdTjytoayrILYhj2cfqOwwhhKgXNZnS6a+17lhnkVwEgW6BbE7YLLX4QohGqSY3bbcppf7VCT/ILYgiSxEZRRn1HYoQQlx0NRnhf4Mt6ScDxdhKMrXW+l/xMHM4U5qZmJeIn4tfPUcjhBAXV00S/v9hW2y1l39ROebZzi7N7Or/r/k5JYQQF0RNEn6a1nppnUXyD5gtVqLT8vE2OhDg4VzhfrL4SgjRmNVkDn+3UuoHpdRNSqmJpz/qLLIayC0yM/rDjSyJqHxRldHBiLeTtyy+EkI0SjUZ4btgm7s/u2HaJVGW6e3qiK+rI0dT86rcV/riCyEaq2olfKWUHbBHa/1BHcdTa2H+bkSnVZ3wg9yCOHLyyEWISAghLi3VmtLRWluwPZf2khUW4MbRtDy01pXuF+QWRGJeIlb9r7zvLIQQtVaTOfwtSqmPlVJDlFI9T3/UWWQ1FObvSlaBicz8kkr3C3QLpMRaQkah1OILIRqXmszhdz/158tnvaaByy9cOLXXOsANgKOpefi6OVW43+nSzIS8BPyN/hclNiGEuBTUpHna8Mq2K6Vu11p//c9Dqp0wf1vCj07Lp1+rip+ZHuwWDNgSfveA7hXuJ4QQDU1NpnSqMv0CnqvGgrxccHYwVFmp08ytGSB98YUQjc+FTPj12o3MYFC08qu6UsfF3gUfZx9pkyyEaHQuZMKvvDzmImgd4FatWvxgt2BJ+EKIRqfBjPDBNo+fkFVIYYml0v1k8ZUQojG6kAl/ywU8V62crtSpalon0C2QxPxELNbKfzAIIURDUu2Er5T6Vinleda/Q5RS607/W2v90IUOrqaqm/CD3IIwW82kFaZdjLCEEOKSUJMR/mZgh1JqnFLqXmAN8GHdhFU7oX5GDAqiq5jHP7svvhBCNBY1qcOfq5TaD/wJpAM9tNbJdRZZLTjZ29HCx0h0Wn6l+529+Kpnk0tmsbAQQtSpmkzp3Ap8AdwGfAX8rpTqVkdx1VqYf9WVOmcnfCGEaCxq0lrhOmCw1joVmK+UWoQt8feoi8Bqq3WAG5ui0rFYNXaG8xcOOdk54e/iL1M6QohGpcoRvlLqrVN//eFUsgdAa/0X0K+uAqutMH83SixW4jILKt0v0C1QRvhCiEalOlM645RSCnjq3A1a68pbU9aDsBpU6kjCF0I0JtVJ+CuBk0BXpVTOWR+5SqmcOo6vxlr7n+maWZkgtyBS8lMwW80XIywhhKh3VSZ8rfVMrbUX8JvW2uOsD3ettcdFiLFGPI0O+Lk5VWvxlVmbSS1IrXQ/IYRoKKpdpaO1vqay7Uqpbf88nAsjzN+1yhF+sPuZNslCCNEYXMjWCs4X8Fz/SOsAN6LT8it93OHpvvjxufEXKywhhKhXdd4tUynlpZRaqJQ6pJQ6qJQacAGveebiWqPNtvn4MH83sgtNpOdVfE+5qWtT7JQdcblxdRGOEEJcci5kwq/ILGCl1ro90A04eKEvYM7IIGrAQLIWLgSq11PH3mBPU9emxOfJCF8I0TjUaXvkU83WhgL/B7YyTq111gW8JgB2Pj5oi4WiQ4eAM6WZ1ZnHT8iVOXwhRONwIRP+red5rSWQBnyplNqtlPqfUsr17B2UUlOVUuFKqfC0tNp1r1RK4dSuLcWHjwDQzMMZo6NdlZU6wW7BMsIXQjQa1Vlpm3tO/f156/C11vvOc7g90BP4VGvdA8jnnAVcWut5WuveWuve/v7+tX4jzu3aU3z4MNpqtT3usJqVOplFmRSYKl+VK4QQDUF16vDdz6m/r0kdfjwQr7XecerfC7H9ALjgnNq1xVpQgCnBNkXT2t+NY1V0zTxdmimjfCFEY1CdEb5PZR+VHXuqfXKcUqrdqZeuAA5cgLjLcW7fHuDMPP6pxx3mF1e8klZKM4UQjUl1umXuxFZyefZN2dP/1kCrKo5/GPheKeUIHAPurEWcVXJq3RqUss3jjxxZWqkTk55P5yDP8x4jCV8I0ZhUmfC11i1P//3UiL4NNVhkpbWOAHrXKroaMBiNOIaEUHzYNsJvfValTkUJ39PJEzcHN5nSEUI0CtXuh6+UugeYDgQDEUB/YCu2aZpLglP79hQdsM0Yhfi6YmdQlVbqKKVspZnSXkEI0QjUpCxzOtAHiNVaD8f24JPsOomqlpzbtcV04gSWvHwc7Q2E+BirrtRxC5YpHSFEo1CThF+ktS4CUEo5aa0PAe2qOOaicmpnu3FbHGWrx2/l71Z1Lf6pEb5VW+s8PiGEqE81SfjxSikvYDGwRim1BIitm7Bqx7ldWwCKDx8GbPP4Men5mC0VJ/Ngt2CKLcWkF6YDYLVqXl52gL3xl9QvL0II8Y/VpD3ytVrrLK31i8Bz2NolTKirwGrDPjAQg7v7WaWZrpgsmriThRUeE+QeBJyp1NlwJI0vtsTw9bbjdR2uEEJcVLVqraC13qC1XnqpPeLw3BYLravRU6e0NPNUpc43244DsOVoeqXtlYUQ4t/mYnTLvKjObrFQnefbBroFolAk5CYQm5HP+iNptPAxkpRdREx65St1hRDi36TBJfyzWyx4ODsQ4O5U6Qjf0c6RJq5NiM+L57vtsdgpxTvXdwVso/yz7UzZyfq49XUZvhBC1JkGl/DPbbHQpokb4cczMVVx4zY2J46f/o5jdOem9G3pQ5CXC1uOZpTZ7/3w93l528t1F7wQQtShBpfwy7RYAG4bEMrxjAK+3nq8wmOC3YM5dvIEOUVmbh8QilKKQa192RqdjsVqm8cvthRzIPMAaYVp8uBzIcS/UoNL+Oe2WBjVsQmXtfPnw7VRpOQUnfeYILcgcs0ZtGvqRJ9QbwAGtfYjp8jMvgRbeebBjIOYrbZGbAcy6qT/mxBC1KkGl/ABnNq1o+jUCF8pxUtXd6LEYuW1387/dEVTsS3Jj+vpjFK2HnEDw/wA2BJtm8ePSI2wnQ/F/oz91Y5lf2I2hSWW2r0RIYS4gBpkwndu3660xQLY+urcNyyMpZGJbI1OL7d/eJQtyXdsbip9zd/difZN3Utv3EamRRLkFkRr79bsT69ewt8WncFVH23miV/2/NO3JIQQ/1iDTPjntlgAeOCyMJr7uPD8kv2UmM/cwE3NLWLrYds8fVpRUpnzDAzz4+/jJyksMROZFkk3/2508u3E/oz9VdboZxeaeGxBBAalWBaZSGTcBX+UrxBC1EiDTPjntlgAcHaw48XxnTiamseXW2JKX//xrzhMJa442TmXa5M8uI0vJWYra44cIq0wrTThZxZlklKQUmkMLy7dT0puMd/c1RdfV0de//2gLOQSQtSrBpnwS1ssnJXwAa7o0IQRHQKYtS6KpOxCTBYrP+w4wZA2/jR3L981s29LX+wNihVHbU9o7BZgS/hApdM6y/cksmh3Ag9f3ppBrf2YPqINO2Iy+fOwVPcIIepPg0z4pS0WDh0ut+2F8Z2wWDWvLj/ImgMpJOcUcfuAUILdg8uN8N2c7One3Is9aZE42znT1rstbX3aYq/sK7xxm5xdxH8X7aNbcy8eHN4agJv6tqClnytv/H6o0kZuQghRlxpkwoeyLRbO1tzHyIPDW/Pb3iReXX6AIC8XhrcPKO2Lf+60y6DWfpy0RNHeuyMOBgec7Jxo493mvAnfatXMXBhJidnKhzd0x8HO9ul1sDPwxOh2RKXm8csu6b0vhKgfDTbhn91i4VxTh7Yi1NdIYnYRt/QPwc5ge/JVobmQk8Uny+zbt5U7BudEvO3blr7W0bfjeW/cfr3tOJui0nn2qg609HMts21M56b0bOHF+2uOUFBS8YPVhRCirjTYhH9ui4Uy2xzseGNiV7o39+LGPs2Bih9obu+SgFJWCnKCSl/r5NeJ7OLsMlNAUSm5vLniEJe3D2BK3xblrqmU4plxHUjJKeaLzTHltgshRF1rsAn/3BYL5xoQ5sviBwfh7eoI2NorQPmEfyBzLwBH43xLXyu9cXtqWqfEbGXGTxG4Otnz5nVdShdvnat3qA+jOzXhsw3HSM8r/gfvTgghaq7BJvxzWyxUJdAtEKDcjduI1Ag87ZtyPNVAYpbtQSptvNrgYHDgQPoBLKfm7fcn5vDGxC4EuDtXep0nxrSn0GTho3VRtXhXQghRew024UPZFgtVcbF3wd/Fv8wIX2tNZFokXf27AWfaJTvYOdDOux370vfx1C97WBKRyJNj2jO6U9MqrxPm78ZNfZvz/Y4T0m9fCHFRNeiEf26LhaoEuQWRkHfmJm9CXgIZRRkMad4LPzdHtkafaZfc0bcju1P28fPOE0y/og33XxZW7bimX9EWR3sD764uXzYqhBB1pUEnfKd27YCyLRYqE3zO4qvItEgAugd0Z0CYH5tPPfZQa82xBB/MFDJlkCszRrSpUVz+7k5M6hXMuoMpFJulsZoQ4uJo0Anf+XTCP1y9kXSwezDJBcmYLLYmapFpkbjYu9DGuw2DW/uSllvM0dQ83l19mA17nQAY2LGgwpu0lRnY2o8ik5U98dk1PlYIIWqjQSf8ilosVCTYLRirtpKUb2uiFpkWSWe/ztgb7BnU2tYu+T8LIvjkz2gmd+uNk50TBzJr1xu/X0sflILt0RlV7yyEEBdAg074Simc27WjYPsOtMlU5f5nl2YWmgs5knmE7v7dbdu8jYT4GtmXkMN1PYN5fUI32vu0r3ar5HN5GR1p39SD7TGS8IUQF0eDTvgA3rffRklMDBlffFnlvqWLr/Li2Z++H7M20+1UhQ7AQ8NbM21YK96+visGg6KTbycOZh7EYq3dPHz/Vj7sjD0p8/hCiIuiwSd8j5EjcR8zhvSPP6Y4OrrSff2N/jgaHInPiy+9YdvVv2vp9km9m/P02A7YGWxz9p38OlFoLuR4zvFaxda/la/M4wshLpoGn/ABmj77XwxGI0n/fRZtqXg0bVAGAt0Cic+1JfwQjxC8nb0r3P/cFbc1JfP4QoiLqVEkfHs/P5r89xkKIyI4+f33le57ujTz9BOuKhPqEYqLvYvM4wsh/hUaRcIH8Bg/HtdhQ0n94ENK4uIq3C/YLZgjJ4+QWZRZZcK3M9jRwadDrUf4IPP4QoiLp9EkfKUUzV58EWUwkPTc8xU+bjDYPRiLtiXfqhI+2ObxD2UewmytXctjmccXQlwsjSbhAzg0a0bAzJkUbN9O1s8/n3ef06WZRnsjrb1aV3nOTr6dKLYUE51V+Q3hisg8vhDiYmlUCR/Aa/IkjH37kvr2O5iSk8ttP12a2cWvC3YGuyrP909v3Mo8/j9jtWr+Pp6JxSoPiBeiKnWe8JVSx5VSe5VSEUqp8Lq+XpXxGAw0e/UVtNlM8osvlZvaCXYPxl7Z07NJz2qdr4VHC9wc3Gp94xZkHv+f+G5HLJM+28bnm47943PlleTx7OZnSS2Qh82LhulijfCHa627a617X6TrVcqxRQv8Z0wnb/16MubOLbPN1cGV78Z9xx2d7qjWuQzKUPrIw9qSefzayS408cGaIygFH62LIjWn6B+db3PCZpZEL+HHQz9eoAiFuLQ0uimd03xuvx2P8eNJ+3AW2UuWlNnWya8TRgdjtc/VybcTh08epsRSUqtYZB6/dj758yhZhSbmTOmJyaJ5c2X1HnZTkd2puwFYfmw5Vm29ECEKcUm5GAlfA6uVUjuVUlMvwvWqRSlF4GuvYuzXj8T/Pkv+tm21Ple3gG6YrWZ2JO2o1fH1MY+/PWk7cbkVl6de6mIz8vlqy3Gu7xnM2C7NuHdoS37dlcDO2JNVH1yBiLQInOycSMpPYmfKzgsYrRCXhouR8AdrrXsCY4EHlVJDz96olJqqlApXSoWnpaVdhHDOurajI8EfzcapZSjxDz9S7adjnWto0FD8XPyYf2h+rWO5mPP4yfnJTFszjRuW3cDWxK11fr268NbKQ9gZFI+PtrXAfuCy1jTxcOLFpfux1uIGboGpgMOZh5ncbjKuDq4sjV56oUMWot7VecLXWiec+jMVWAT0PWf7PK11b611b39//7oOpxw7Dw+az5uHwWgkburU81buVMXBzoFJbSexOWEzJ3JO1CqOizmPv/joYqzaiq+LLw+sfeBfN2f99/FMft+bzH3DwmjiYXuGsKuTPc+M68DehGx+3lnz31z2pe/Doi30b9afUSGjWH18NYXmwgsduhD1qk4TvlLKVSnlfvrvwChgX11eszYcmjWj+efzsOblETd1Gpbc3BqfY1LbSdgpu1qP8i/WPL5VW1l8dDH9mvbjx6t+ZFDQIF7b8Rqv73i91ovHLiSTxVrhojiwlWG+uvwATT2cuXdoyzLbru4WSO8Qb95eeZjswqrbYZ/t9Px9N/9ujA8bT4G5gHUn1tX8DQhxCavrEX4TYLNSKhL4C/hNa72yjq9ZK87t2hE0exbFx46RMH06uqRmN2D9jf6MDB3J4qOLKTAV1Pj6F2se/6/kv0jIS2Bim4m4Orgye/hsbut4G/MPzeehdQ+RW1LzH3YVic3I50BiTrX3LzFbGfH+Bq76aDN74rPOu8/SyEQi47OZObodRkf7MtuUUrx4dScyC0qYtTaqRrFGpEUQ5hmGp5MnvZr0IsgtiGXRy2p0DiEudXWa8LXWx7TW3U59dNJav1aX1/un3AYNotkrr5C/dRvJr9Y81Cntp5BnymP5seW1uv7FmMf/9civeDh60MatP8VmC3YGO2b2mckLA15gR9IObvn9ln98M/doah7/+SmC4e+u57pPt5KZX70fnr/tTSQ2o4ATmQVM+GQLLy3bT17xmd86CkssvLXyEJ2DPLi2R9B5z9E5yJOb+rbg623HiUqp3g8vq7YSmRZJ9wDbw24MysBVra5ie9J2UvJTqnUOIf4NGm1ZZkW8rp2A77RpZC1YwMmfFtTo2G7+3ejg04H5h+ZXOi1Rkbqcx0/LLeaHvw+y8vgaCjK7MfL9bTz4/e7SOK9vez1zR84lvTCdW36/hZjsmBpf40hKLo/M383IDzawcl8yk3s3p9Bk4dttsVUeq7Xm/zbH0DrAjc1PXs6Ufi34autxRr2/gTUHbEn3/zYfIym7iGev7IhSEJMdw/xD83l1+6tkF5/5nD0+qh2ujna8uGx/tb4Ox7KOkVuSW5rwAcaHjceqrfwe83uNPw9CXKok4Z+H/yMP4zp0CMmvvkrBrt3VPk4pxZQOUziadZS/kv+q8XX7tfRBGUr4Zf/6Wv3AOFdOkYk3VhxkzIcb6fPaWl7441s0Zrp6juLGPs1ZezCFL7ccL92/b7O+fDv2WwDuWX1PtUf6R1JyefD7XYz+cCNrD6YwbWgYm54czpvXdeWK9gF8ve04hSWV/9by9/GT7EvI4c5BoXi6OPDqhC4svG8g7s4O3PtNOFO/CWfOpt307BjNssT3GbFwBFcvvprXd7zOT4d/YmXMmZlCH1dHHhvVji1HM1i1v+oRekRaBAA9AnqUvhbiEUI3/24sjV5a5dfiQnythLgYJOGfh7KzI+idd3Bo1oz46Y9gSqn+UvuxLcfi7eTNDwd/qNE1tdb8nbYBzzYf8Fv6C/zfvv+radhlWKyah37Yzf82xeBtdOTxUW1p03o/HX078fUtE3hjYhdGdGjCGysOlpkvb+XVis9HfU6xpZh7Vt1DUl5SpddZfziV8Z9/w4boKB68rDWbn7ycp8a2x8/NCYBpw8LIzC+psnLmi80xeLo4MLFHcOlrvUK8Wf7IYB4d1YrN2Z9gF/oKUfpzNsRvoLt/d57r/xy/Xfsbga6BbEncUuZ8N/drQSt/V+ZurLqp3e7U3fg4+9DCvUWZ168Ou5qjWUc5lFnxgq6P/4hi1AcbycgrrvI6QtQ3SfgVsPP0JPjjj7DmF5AwfTrWat7EdbJz4rq217E+fj2JeYnVOiY2J5b7197Po+sfxc3BHWt+O2bvms3mhM21jv+tlYfYeCSNV67pzPyp/bmsazFxece4rs1EwPbbyDvXd8XPzYmH5+8mt+hMVUtb77bMHTmX3JJc7ll9T4W9ZRZERvDAugdxbP4ZLq3exTtwM+7Oqsw+fUK96dnCi883HcNsOf/q1bjMAlYfSGZKvxa4OJZtWFdgzmV3ydvYeYQztvmNLLhqARtv2Mh7l73H5HaTaeHRgoFBA/kr+S9M1jPvwd7OwJS+Ldh9IqvKufzTD7tRqmzso0NH42BwqLAmf8exDN5bc4So1Dwe+zmyWvX/WmvmbYzm6V/38sOOE+yJz2rjN31CAAAgAElEQVR0PZQa8yrm/GJzvf5GKAm/Es5t2xL4+msURkSQ8trr1T5uctvJAPx0+KdK9ys0FzJ712yuXXItkWmRPNX3KV7s9T/y424mwDmUJzY+QVxOzW+g/rornnkbj3Fr/xCm9LONWn+N+hVnO2fGthxbup+3qyOzb+pB/MlCnv51b5lvxE6+nZgzYg5phWncu/peMgrPVA8VmYt4fO3bvLzrTuyNMUzr8hCDggbx4a4PuX7Z9fyd/Hfpvkoppg0LIy6zkBX7zr/G4eutxzEoxW0DQsq8Hp8bz60rbiUyLZI3h7zJ25f/lw6+HTCost+2gwIHkW/KJzI1sszrE3oEYW9Q/PR3xZ/DjMIMYnNiy8zfn+bp5MllzS/j95jfy/wwAVsfn0cXRBLiY+Tpse1ZfzitWg3cPv7jKK//foglEQk8s2gvV3+8hc4vrOLK2Zt4cuEeVuyt/Deq6krMKmRJRMIFOdeF9NOhn7h8weUcy/7nze5MFhMHMw5egKgujsz8Evq/sY4PalhBdiFJwq+Cx5gx+N57L1k//cTJBdW7idvMrRmXN7+cX6J+ochcvqGXxWphRcwKJiyewOd7P2d06GiWTljKzR1u5vJ2TekS6E9e7M0oFI/8+UiNyjwj47J46te99Gvpw/PjOwK2VaQrYlYwKnQU7o7uZfbvE+rDoyPbsnxPEj+ekxi7B3Tnkys+ITEvkalrppJdnM36uPWM/nk8qxK+xc3Sg1/HL+GhntP4cPiHfHz5xxRbirlr1V08velp0gvTARjZoQmt/F35bEN0udFNXrGZn/6OY1yXZjTzdCl9fU/aHm7+/WYyCjOYN3IeV7a6ssL33K9ZP+yUXblVw35uTozo0IRFuxMoMZ9/VHn6YfXd/csnfIDxrcaTWZTJtsSyrTdeWLKP5JwiPrihO1OHtmJs56a8s+owu06cae1w7nv9ZWc87605wsQeQex7cTQbZw5nzs09uWdIK3xcHVl9IJn7v9/1j5O+1prHFkQy/ceIf9Rq4kJLyU/h/Z3vk1GUwcwNMym2/LNpsBe3vcjk5ZNZFLXoAkVYt+b/dYLcIjOfrj/KsbS8eolBEn41+M+YjuvgwSS/8iq569djycqq8teyKR2mkF2czYqYFaWvmawmFh9dzIQlE3hi4xMYHYx8OfpL3hjyBv5G2ypjg0Hx9Lj2JGe6cZn3oxzLPsZzW56r1q+BqTlFTP02HH83J+bc3BMHO9uXd03sGvJMeVzb+trSfYsOHcKSZZu7v39YGEPa+PHi0v0cTi47/dGnaR9mDZ9FTHYMVy66kof/eJj0XCvNCv7Dqilzae17pjxyWPNhLLpmEVO7TmXl8ZVcvehqFhxegFIwbWgr9ifmsOVo2XUGP4fHkVts5q7BZxZRrYtdx12r7sJob+S7cd/Ru2nlTVbdHd3p6t/1vG0ibujTnIz8Ev44dP6btxFpEdgb7Onk1+m82wcHDcbbyZslR8802FsSkcDiiEQeubwNPVp4o5TipQnt8PNLZtriWcxc/zTXLL6GAfMHlNbybz2azpO/7GFgmC9vXtcVg0HRwtfIuC7NeHJMe769ux87nhlBt+ZePPHLHuIyq/4hr7WmwFRQ7ntj1f5kth3LsHUR/aP+RpPnei/8PcxWM0/3fZojJ4/w7t/v1vpc606sY2n0UnycfXh528u17mN1sZgsVr7dFkv35l44O9jxwtLqVZBdaPZV7yKUnR1B775DzKTJxN93v+01JyfsAwJOffhj7+uHcnBA2duBwY4QOwP3RnlxInIWeTO68VvB33yx7wsS8xNp592Od4e9y4gWI877kJWBYX5c3j6Apdszuf/qh/l07yy+2PcFd3e5u8IYi80Wpn23k5xCM7/cPxDfUzdNwTadE+IRQq8mvQDIXbuW+EemY+/vT/CsD3Hp3p33J3dn7KxNPPjDLpY+NKjMoqaBQQP54LIPeGHL65hSB9PWOI5v7xqAp4tDuThc7F14uMfDXNXqKl7b8RqvbH+F3am7earPf3l3tRNzN0YzuI0fYLux/NXW4/Rs4UX35l5YtZWv9n/Fhzs/pItfF2ZfPhtfF99qfY0GBg5kTsQcThadxNvZu/T1IW38aOLhxE9/xzGmc7Nyx0WkRtDRtyNOdk7ltoGtbcbYlmNZeGQh2cXZ5BU68uyS3XQMzSIgaCcvbv2eAxkHiDoZhdnXtmZgXawHA4K7Y7Q38vzW5ykp9uDFBUW08nfl01t64Wh//nGWo72Bj2/qwbjZm3h4/m5+vm8ADnYGCs2FfLDzA2JzYsktySWnJMf2Z3EOZm2mo29H3h76NiEeIRSZLLz2+0HaNnHjqq6BvL/mCJFxWXRr7lWtz2Nd+SvpL1YcX8ED3R5gSocpJOQl8M2Bb+jXrB8jQkbU6FwZhRm8vO1lOvh0YO7Iudy58k7+s/4/fDfuO1p5tqqjd1BWgamAVcdX4efix5DgIVXuv2JfMsk5Rbx2bWfiMgt4cdkBVu5LZmyX8t+TdUldSiVlvXv31uHh9f6MlAqZMzMp2LEDc2oqptRUzCmpmFNPfWRmos1mMJvRViuYzywYyjEqXp9swL1rd6Z2ncqQoCHlbhCeZkpKInfNGtLbdWPM4nhuGxBCgec3rDy+kk9HfMqgoEHljtFa88TCPfy8M545N/dk3FnfRDHZMVy9+Gpm9JzB3V3upiA8nBN334NT69ZYcnIwJSfT9Jmn8brxRrYczeDWL3bQJcgTb6MjBSVm8oot5BeZaBOzh27HdhJ52UQ+ePQqPJzLJ/tzWbWVeXvmMSdiDu192tPLZQafrj3J8ocH0znIkzUHUrj3m3A+ntKD9s0LeHHbi0SmRTIyZCSvD34dZ3vnan9tTk8BvTXkLca1Gldm2zurDvHp+mi2PnUFTT3PnLPEUsKAHwZwU/ubeLzP4xWee3/Gfm5cfiO9AnqxLzmZIpWEUrYpIk8nTzr4dKCzX2c6+3YmItqVj1an8/I1nZnQy4eblt/CiewUnFKns2TatQR5uVR4ndN+35vEA9/vYtrQVjw9rgP/2/s/Zu2aRSffTng6eeLh6IGHowfuju442jny/cHvMVvNvDDgBY6faMvbKw/z3d396Nbck8Fv/UmfUB/+d3v9PYrCZDVx/dLrKbGUsOiaRTjbO2OymLh1xa2cyD3BwvELCXQLrNa5tNZM/3M6WxK28NNVP9HauzUJeQlM+W0KRnsj31/5PT7OPnX2Xk7knODHwz+yOGoxuaZcvJy8+GPyHzgYKv//cO2cLZzML+GPxy7DqjXjP95CVkEJ6x4bVm7FeG0opXZW63kjWutL5qNXr166obBarTqvMEff8ekIvW1AN72/Wzeds2FDpcfk/PmnPty3nz7Qrr0+0K69/mPMRH3Hra/qg3GJeuKSiXrgDwP1gfQD5Y6b8+dRHfLkcv3eqkPltr0X/p7u9nU3nVaQpgsPHdaH+vTVR8eM1abMTG3OytKxU6fqA+3a64Qnn9KWwkL9xeZj+or31uvxH23SN87dpp9/5Tu9bsTVpTEdGjBQF+zZW6PPxYa4DXrA9wP0oB8G605vzNYP/bBLa631DXO36v6vr9Czds7W3b/prgfPH6yXHF2irVZrjc6vtdZmi1kPmj9IP7PpmXLbYtLydMiTy/XHf0SVeX13ym7d+avOes3xNZWe22q16tt+v033/Wawbv/xdfq+5a/qtcfX6oTchHKxWixWfccXO3SbZ37XO45l6JGzf9Wd/q+fvuKn0TqzMLPa7+eZX/fokCeX6+X7juoBPwzQD619qMJ9E3MT9S2/3aI7f9VZd5p9l77jq82l2z5Yc1iHPLlc70/Irva1L7Qv936pO3/VWa8/sb7M6yeyT+h+3/fTN/92sy6xlFTrXIujFuvOX3XWX+37qszrEakRuuc3PfWtv9+qi8xFNYovr8ikX1y6Tw94fa1eFplQbrvFatEb4jbo+9bcpzt/1Vl3/7q7nrl+pp4TMUd3/qqz3hK/pdLz7z5xUoc8uVx/sflY6Wt/x2TokCeX6zdXHKxRrBUBwnU1cqyM8C8CU2oqcVOnUXz0KM1efQWvCRPKbNcmE2mzZpHxv//DqX17mr34AgXh4aT/8CPWxAQKXD3wnDSOmZ6rOWrMYVDgIG7rdBv9m/bng7VRfPTHUa7s2oyPbuyBwXDmNweT1cTIn0fS1b8r77V7kuM3TQEgdP4POATZ5t611Ur6nE9J/+QTnNq3J/ij2TgGB1N04ACpH35I/sZN2Pv74/fgA7j07En8/Q9gPnmS4A8/wG1omU7XlYrNiWXGnzOIzoqmKHUss8bM4KFffyGw9XKyzAlc1eoqZvaZ+Y9GZ49veJxdKbtYN2ldud+gbpi7jeScIv587LLSz9HX+7/m3fB3+XPyn/i5+FV67v2J2Uz4ZAtXtG/Cp7f0rPA3NLBVY4ydtZHU3GIMSvHfiUbmHHqcjr4d+XzU5xVOH52tyGRhwidbSDIswuq5loXjF9LOp12F+5usJq6b/wIx5mW0cGvF7CveJ8wrjOwCE4Pf+oMhbf2Yc3OvCo/PLcll1q5ZJOcn08a7DW2929LWuy0hHiHYG2o/Ak3JT2H84vH0a9qPj674qNz2FTEreGLjE9zd+W6uazmVQC8X7Azn/9wm5SUxcelE2nq35YvRX5SbDl15fCUzN8xkXMtxvDnkzUq/RqdtjkrnqV/3kJCbhG/QJvLMOTTzMtDc154SaxGF5kKyirPILMrEz8WPyW0nc33b6/E3+lNsKeayny5jZMhIXh70coXXmP7jbtYdTGXb05fjftZvxo//HMmSiARWTB9K6wC3KmOtTHVH+JLwLxJLXh7xDz9Mwbbt+D/6KL733oNSClNSEgmPPkbh7t143XgDTZ5+GoOTLSFoq5VvP1qAedFC+qceRGlNYRNPYlwLSPAwke3lTbS1A+07jObRu8Zg52YkKiuK8ORwwlPC2ZWyi5PFJ/mk15sEzfwEc3o6Id99h3O7tuXiy9uwgYSZT4BSGPv0Jm/tOgyenvjdew/eN9+MwcU2FWFOS+PEtGkUHz5Cs5dfwuu66877frXFQv727ZTEHMehaRPsmzXD7O/FkxFvsynpD6xFgRicE2lmDOSFgc+fd6qqphZFLeL5rc+fNzn+uiueRxdE8uPU/vRvZbsvMOPPGRzOPMyK61aUO1ex2cLe+Gz+Op5J+PGT/BWTidHRjlUzhuLt6li6nyUnh8KICKzFxeiSEnSJCV1SQkxiJj+Gx9P39uu5flR3Vh1fxeMbHmdsy7G8NeStaiWj8LgT3LF2Al66Bxvu+F+FiRBgT3wWV3+8hfH9c9hT/BkFpgKe6PsEE1tP5IM1R/lk/VFWzxhKmybu5Y6NSI3gyY1PkpSfjKd9IHmWZMzaNiXpaHAkzCuMILcgii3FFJgLKDAVUGgupMBUgMlqYnToaO7rdl+5+y3mkyeZ9cvjJO8PZ5rnlTieSKH42DGwWPB95VWiWnQiPPYkC2PfJ53NFMTdRVuPXjx3ZQcGti77A9iqrUxdPZW96XtZePVCmrs3P+/n4fM9nzN792zu73Y/D3R/oMLPV3ahidd+O8CC8HhC/Yz4tf6CmNwDuBj8ycwDR+VMx6Z+NPXwwGhvZFDQIEa0GIGDXdmpm2c2PcP6+PVsmLyh3DaAlJwiBr35B7cOCOGF8WULA9Lzihn+7nq6BXvx7d19q/U9UZHqJny5aXuR2Lm50WLuXBKffoa099/HnJKC66CBJD39DNpsJuj99/AYV3buWRkMTLzvei7L9WWbYzGvu8VRcjQKz9gThByOxrUgE9gCG7YQNe95jgbbE97SSkQrhSksiCHBQxjq15fmz39DcUICLb74v/MmewC3YcNoufBn4h+ZTv7Wbfjefx++d92FnXvZBGHv70/IN9+SMH06Sf99FlNyMn4PPFD6zVpy/DhZixeTvXgJ5vM8W+ARZ2dudDVyzDeedeNHMnfK6zV6nGRlBgQOAGBr4tZyCX9s52a8sGQ/C/6Oo38rX7TWRKRGMDBwYOk+ecVm5m08xvboDCLis0pLOcP8XRnfrRm39g8tTfbWoiJOfv896fM+x5pdvveRC3AnoA6uIO3o3Yy84w6m95zOrF2zaOHegod6PFTl+1mb9AMGg4WEo0P56I8oZow4/9dOa83Lyw7g5+bI62Oup8g6gqc3Pc3L217mm/3fcHO7u3FxcODjP4/ydm93Mv73f/hOnYpdaHM+3/M5c/fMxc7qTV7sNHIKQ/jf7d1p3iSPIyePEHUyiiNZR4jJjsHZ3hmjg5EAYwAu9i4YHYwUmgr5+cjPLI1eyh0db+dGSy9K/thI7po1mOLiOP0dbTauQoeGkhjaEWvUYYrvu48Fna7kl9bDaN3kWtz8j+ESupCMdDumfJnEiLat+O+VHWjp5wrA/EPz2ZG8gxcGvFBhsge4p8s9HM85zqeRn9LKqxVjQseU22fV/mSeW7yPjPwS7r8sjNYtD/HKlt287nErQ5sN5mh6AR9vOEbS3yaG9wzmxoFhuAW3R50noY9pOYZlx5axLWkbQ4PL/8b73fZYLFpzx8BQAA5nHibYPRhXB1f83JyYObodzy/Zz+97k7mya93fwJUR/kWmrVZS336HzK++AsCpQweCP3gfx9DQCo/5bnsszy7ex7xbezGkjT/3fbeTDUfSeOay5nQzHmTD3z/jFpVE92NWvE/YSi3t/P1wGzQYU3IyBX/9RfDsWbiPqLoaQpvNWIuKsXNzrXw/k4mkZ58je8kSvCZNwqVbV7IWLaZw504wGHAdPAiva6/FpWcvzGlpmJISMSclYUpMIut4HCWbNuA8YhRtZr1X7c9ddVy75Fr8XPz4fNTn5bY9s2gvv+6KZ+MIN3L+XMFzagnXTnqWyR1vIim7kLu+Cudwcg5dgr3oG+pN71Afeod4l6l40mYz2YsXk/bxJ5iTk3EdOgTfO+/EzssL5eh46sMJ5eiAJTOTtFmzyV29GvuAAPymP8KsJhH8Er2Ylwe+zLVtri0X42mJeYlctegqrg67mpy4CSyKSODW/iFM6tWczkEeZUaDSyMTeWT+bt6c2IUb+9oW2lm1lbWxa/lsz2dEnYzCXTVh0Gp/7tpzAEpKsOvbg1dvtGN3WgQ+uj+xR8bw8vhefLstltwiM6v+M/S8VVjn0hYL0RuWs3vBHJrtPIFvLljt7XDt34+FHkdI8DcwfuD7LIyzsuZgGiaLpncTZ+7b/gNBkVtxHjOOkDdfI7owjttX3E6uyVYWrEv8sBY1p3fT7tzQoysvbH+Kvk378kT3t4mMzyYyLps98Vmk5hYzpI0fYzo3pX8rXxzsDJgsJu5cdSdHs46y4KoFtPBoQUJWIWsPpLBiXxLbj2XSoZkHb1/XlRb+cPuX43jolyKC4iouhTWFhhH6zpt4dOlc9nWLiWELhjG8+XBeG1y2w26RycKgN/+gRwsv/nd7H45lHWPCkgmlU3vuju5YrJqrP95MRp7tBq6rU+3G4DKlc4k7+eOPmBIS8XvowdIpnIqYLVZGfbgRtG117O4TJ3n92jP/uc9mSk0lf8tW8jdtJG/LVqzZ2TR9+SW8J0++4O9Ba2279/DZXAAcW7bEc+K1eF59DQ5NAio9NvWDD8mYO5eQ+T9g7NGj0n1r4p2/3+HHQz+y+abNuNiXrYiJjMvi+Re/4vUdX2Awn1o56+WBdeBw3isOYqdPGLNv7cOwtuWfvKa1Jm/dOlI/+JCS6Gicu3Yl4LHHcO3Xt9y+5yrYtYuUt96iKHIPju3a8uMIJ37yOMTzA57n+rbXn/eY57c8z/Jjy/l94u+42/vx3OJ9LN+bRInZSvum7lzfK5gJPYJwdbTnivfW42V0ZNnDg8tN+1i1lY1//UzBC2/T8ngBu9o44dmlM2G/7uSDm1xJbnEbew6Hlf6w2BOfxbVztnJN90Den3z+xWinJf21i/iHHsYtJxOzvQOJ7duyoW0O60ISwdWDAksuLhn3kpoaho+rIxN7BHFDn+a0aeKO1pqMufNImzUL5w4dCP74I0z+XuzP2M+etD38nRRBeHIExfrUAEa7ohMeJzvX9jV1djDQOdATL6MjW6PTKSix4OFsz4gOTRjduSltAk1M+f0GnJU/xvTpHEi0Pb2slZ8rk3o3554hLXGwMzB33n30+HQD7gYjzZ59DscWzdEWC1itYLUSfiydxWsjuS5iOV7FeUQMGo/f/fcztEtQaXXNc1ueY23sWtbfsL7M/ZkF4XE8sXAP39/Tj0Gt/Xh287OsPL4Si7bQ2bczc0fOxehgZGfsSa77dCvThrXi6bEdqvx+Oh9J+A3M6v3JTP12J452Bj68sXuZ0suKaIsFc1oaDk2b1mlseZs2YefujnO38v1oKmLNzyd67DjsAwIIXfATynBh1gBuTdjKtLXTmHPFnHL10QW7dnPktjs56eHLzmf6ELNjDXck9cNu+1ZczMXg7oHHsKEoR0esBQVY8/NL/7RkZWFOTsaxZUv8/zMD95EjazTnqrUmd+VKUt97H1N8PNFd/XjjspM8MPxpbu5wc5l9Y7JjmLBkAlPaT+HJvk9SHBND5tdfY/ELINw9hG+zXNmZlI+9QRHm78bhlFx+mtqffq18y10za8HPpLz1FspgYMXoYcwL3oejUzwff2GPVXtyx6DHeP367mUGD++tPsxHfxzl89t6M7Jjk/O+n4QNW0h56CFyHVxY1v86tvi1I9mk0Fpj53YQJ/81WIub0NftYW7s05wRHZqcd/1B7vr1JD4+0/Z86dmzMPY+k7O01myOieLdDWvIz/emT2BnugZ70S3Yi7ZN3LA/tbCwyGRhU1Q6K/cls/ZgCtmFJgwKDK77cWn+LV6my7m5zSOM7NiEMH/bzVFtsbDvnecxfPUrec196PH59xX+ll1itrItMobs996lza71nHAP4JM+N9JsQB8mdA/CzTuaB9bdz+zhsxneYnhp7ONmb8Zq1aycMYSUghRu+GYMz29pgsugAdxv/JXeTfvwyRWf4GzvzBMLIzmeUcD8e/tXeq+mIpLwGxitNXM3HqNHc69y/7H/rbKXLSNx5hM0e+3VCm/+npb+2Vwyv/8OTKfWOVgstpGYxYJydMTnjjvwnTaVEoOVwT8OZlLbSTzZ98nS44sOHCD29jvId3Hn7p73EDjkJ7C4c2D3jfRs6sqssCIMm/4kb+tWlMEOg9GIwdXV9nHq7679++E5YQLKvva3vqwlJWR+/TXpH31MgSN8NNrM4BsfLbOobuaGmWyI38DvVy+DH5eS/tHHAKVPYVOOjuj2HTkU0JplOoBW7Vvy2LAWWAuL0EWFtj+Li8j+7TfyN27COKA/ga+9RoarD0Pf/oPLu9jTPDKKiYs+Ju6Ohxn1VNmbmyVmK9d8soW03GLW/KfsTWqAuBVryHz8MVKN3nh88hn9+tpGpRarJqfQRGZBCSfzSwj0ciGwGusOio8dI/7BhyiJi8NzwjV4X399jQYPZzNZrPwVk8nGqDTC/NzYX/wNv0b/yIfDP+SKFlcAtpvJ8Y8/TuGWrWzvYWTS3NW4eVTv/1T2ho3EP/scKj2NVR0u47NWI2nW1IOCps8zrPkg3hn2NgDbj2Vw47ztvDGxCzf1bcGsFc/T+ZWFBGba8m3usO481GsfPVsNYvbw2VgsdjjZG8pU2dWEJHxxydNaEzvlZkpOnCBs5YpyN4hPy/zue1JefRXXgQNsozCDHcrOYPvT3o6S47HkrlmDY8uWNH3pRR7Ltq1oXjrB1uWy+OhRYm+9DeXijNe8Lxn4dTjOYS9RnH4FI5vdynuTu+HsUH7Fc10qjooi4YknKD54iD+7KuxmTGXqgOkcPnmYScsm8Zj7tQz9bh/FBw7iPnIkTZ9/DuztKdy1i4K/wynYuZOiAwfAUnGnTeXsTMDjj+M95abS36CeW7yPb7fHgtb8fOgbPNMTCVu1Eju3smWBBxJzuOaTzYzp3IyPbjoz5Rb7yxJyn32G416BNPn0M/p0D7sgnw9Lbi6p77xL9rJl6MJCHFuH4XXd9XheczX2PtUv1dUWC4W7dpG3YQPWkhK0gz2LTvxGpiWHm7vdiaeLNxlffklJeirzRmiunjGLK2q40teSl0fqe++RNf9HTE0C+XzgzaxusgVHz0juC/2G2wa04fGfI9kRk8m2p64g7/gBDt16Ax4ldrT5/EsKwneSNns2xc28eWZMFm17j+CdYe9UuXirMpLwxb9C4b79HJ80CZ877qDJk0+U256zchUJ//kPbsOHEzx7VoWj67xNm0l+6SVM8fGkXd6VJ7vs59db1+CbaSL25lvQaEK//RbH0FBu+u5r9lneZYTXc7w3flKtR1X/lC4pIfWTT0ifN490D4h5+Cp2+uYQ8stfXLnNhJ23N02few6P0aPOe7wlL5/CyAisuXkYXJxRzi6n/nTG4OKCnbdPuZvviVmF3P7FX9w9uCXXuORwfNIkfKdNI+A/M8qd/6N1Uby35kjp6u2Yb+ZT8MYrHPZtSci8z+jVseJqmdqy5OWTs+J3shf+QmFkJDg44D58OO5XXI5jaCiOISHYeZVtE6HNZgrCw8lZtYrcNWuxpKfb2pw4O6NPlcuezRDUjGfHZtO0x0BmXz671uWQ+Tv+Ium//8WUkEDiqKE80WUz2Sm34mLqTn6xmWnDwpjeysCh226iuLgAn08/oE3/MaXHJjz+GKbsLD4bacXlmit5Y/Ab5221Uh2S8MW/RuKzz5K9eAmtli7FqdWZJmoFf//NibvvwblTJ1p88X+lawEqYi0sJH3OHDK++IJsJysFt15Fi2W7sBYU0OKbr3Fu25asoiweX/8kf6VsZ9uUrbg6VF6NdDHk797NgRnTMKbkctIdfHPBc+JEmjwxs1xyu9ASZj5B7urVhK1cgUOzsveFzBYr187ZSkJWIV+7HcV+7mx2N2tPh7lz6NG27ksIi6OiyFr4C9lLl2I5eabrp52nJ4ZzXPMAAAmYSURBVA6hITiGhKDsHcj7808sJ0+iXFxwGzYMj9GjcBs6FIOr7WurrVZWRS3nhfXPcHvYTUTpZLak7mDxNYur3dKhItb8fFLfe5+TP/xAiq8df0zpSbrXdHbFZvHzcE8KZjxEhjWHlTP68/LNX5Q51pyeTsLMmRRs2876LoqU+6/h+eGvlWv/XR2S8MW/hjkjg+jRY3Dp2YMW8+YBUHT4CLG33IK9vz+hP3xfo8RXeOgQfz40mZbxJgxubrT46itcOndibeza0uffzug1g9s73V5Xb6nGLPn5rH3qdpz2RtPlpXfxHXbFRbmuKTGR6DFjcR8zmqC33y63/fC+aNY9+iLDT4SzrXl3es+bTdeW5auY6pI2mSg5cYKS2FhKjsfa/jz1Yc3Lw23oUNxHj8JtyJBKBwUvbXuJhUcW8v/t3Xts1eUdx/H3p7UtxDI6mBpsEQRUYExYzJwGdE4HlsnEOF10QLzMuDhcvOLdkM15YfO2P5BtIko2VBR1EmRxxLmMoWK9bSq6TRF1KFRRx0WU0n73x+9H1iFCgdOenj6fV0LO7/ecX9PnG55+z+885zzfB+CCQy7gzGFnFqyPG55ayisXT6b7+xvoOWkCNUd+g5Xnnc/GHpVceOI6bvn+HIbvNfyzsTU38/5tM3jvtul82LeGry/4M3tU7ngl9tac8K2krLnzLhqnTaPuVzPodtBBrDjlVIj4vzIQO2Pq4qtZu3AhP5s4m08H7st1S6/j0RWPMqTXEK4Zec12yxSkpvGmm1lz++30nzeP7sOy1aAtGzeyZtYs1sy8g+amzSwcegxjb7yaYfu1X2Gy9vbJ5k+Y9IdJCDHnuDm7NWe+LQ3LF7Pkyh9y7PNZTq0YMIBLv/sxX6jtz531d273Z9f/dQmb3llJr138+rQTvpWU2LSJ5eNPIFqaUUUFm1etpt+cOZ+7MnhHttRVOesrZ/HAPx9gXdM6zhl+DmcMO6Pgf+ilrnn9el4fcyxVAwey3+y7WPvIIzTedDObV62iR309e198ERW1tbu19L+zaGppoiVa2lTPaGc1tzQzet5oxq6pY+LqATSMG8SVL09jxrdmMKp2VMF/X2ttTfjeAMU6BVVWss8Vl9P05ls0vfkWddOn73KyBzi8z+GUqYyZL86ktrqW+8fdz9kHn+1kvw3l1dXs9eNz+bihgeXHjeOdKZewR+/e9Pvdb6m79RYq6+q6RLIHqCiraJdkD1BeVs7ofqOZ22MZ1VdcxO1vz2Vwr8GM3Hf360QVimvpWKdRfcQR7D1lClUHHtimFazb07OqJ5NHTKaqvIoJQybsVsXHFNScfDIf3juX5g8+oM/119Nz/PEFWwyXkvr967n71buZ+sRUVqxdwS+O/EWnerH0lI6ZAVlBOJWVocrKHV9s29QSLYyZN4bVH6+mb4++zD9hfofcbHhKx8x2Slm3bk72u6lMZYzpn62bOP3Lp3e6d5adqzdmZiVu0pDsm0DjB40vdlc+wwnfzKyA+lT3YcrXphS7G9vkKR0zs0Q44ZuZJcIJ38wsEU74ZmaJcMI3M0uEE76ZWSKc8M3MEuGEb2aWiE5VS0fSe8CbO7jsS8D7HdCdzirl+FOOHdKO37FvX7+I2OHONJ0q4beFpGfaUiSoq0o5/pRjh7Tjd+yFid1TOmZmiXDCNzNLRCkm/N8UuwNFlnL8KccOacfv2Aug5Obwzcxs15TiHb6Zme2Ckkr4kuol/UPSa5IuK3Z/2pukWZIaJb3Uqq2XpEWS/pU/frGYfWwvkvpKelzSMkkvSzovb+/y8UvqJulpSX/LY/9J3r6/pKX5+J8rqctuTyWpXNLzkhbk5ynFvkLSi5JekPRM3laQcV8yCV9SOTAdGAsMBU6VNLS4vWp3dwH1W7VdBjwWEQcAj+XnXdFm4KKIGAocBkzO/79TiP9T4OiIGA6MAOolHQZMA26JiEHAh8APitjH9nYe8Eqr85RiB/hmRIxo9XXMgoz7kkn4wKHAaxGxPCI2AfcCnW8PsQKKiL8AH2zVPB6YnR/PBk7o0E51kIh4NyKey4/Xkf3x15JA/JFZn59W5P8COBqYl7d3ydgBJNUBxwEz83ORSOzbUZBxX0oJvxZ4u9X5v/O21OwTEe/mx6uAfYrZmY4gqT/wVWApicSfT2m8ADQCi4DXgY8iYnN+SVce/7cClwAt+Xlv0okdshf3P0p6VtLZeVtBxr33tC1hERGSuvTXrCRVAw8A50fE2uxmL9OV44+IZmCEpBrgIWBwkbvUISSNAxoj4llJRxW7P0UyKiJWStobWCTp1dZP7s64L6U7/JVA31bndXlbalZL6gOQPzYWuT/tRlIFWbKfExEP5s3JxA8QER8BjwOHAzWSttykddXxPxI4XtIKsmnbo4FfkkbsAETEyvyxkezF/lAKNO5LKeE3AAfkn9ZXAqcA84vcp2KYD5yWH58GPFzEvrSbfN72DuCViLi51VNdPn5Je+V39kjqDowm+wzjceCk/LIuGXtEXB4RdRHRn+xv/E8RMYEEYgeQtKekHluOgTHASxRo3JfUwitJ3yab3ysHZkXEtUXuUruSdA9wFFm1vNXAVOD3wH3AfmSVRb8XEVt/sFvyJI0CFgMv8r+53CvI5vG7dPySDib7YK6c7Kbsvoj4qaQBZHe9vYDngYkR8Wnxetq+8imdiyNiXCqx53E+lJ/uAdwdEddK6k0Bxn1JJXwzM9t1pTSlY2Zmu8EJ38wsEU74ZmaJcMI3M0uEE76ZWSKc8C1Jkmok/ajY/TDrSE74lqoa4DMJv9VqTrMuxwnfUnUDMDCvOd4gabGk+cAyAEkT85r0L0j6dV6eG0ljJD0p6TlJ9+e1fpB0Q167/++SbixeWGafzwuvLEl5Bc4FETEsX9H5CDAsIt6QNAT4OXBiRDRJug14ClgIPAiMjYgNki4Fqsj2aXgCGJwXtqrJa+CYdSp++2qWeToi3siPjwEOARry6pzdyYpVHUa2+c6SvL0SeBL4D/AJcEe+Q9OCju26Wds44ZtlNrQ6FjA7Ii5vfYGk7wCLIuLUrX9Y0qFkLxQnAeeSVXk061Q8h2+pWgf0+JznHgNOyuuRb9lPtB/ZtM5ISYPy9j0lHZjP4/eMiIXABcDw9u++2c7zHb4lKSLWSFqibIP4jWTVSLc8t0zSVWS7DpUBTcDkiHhK0unAPZKq8suvInvxeFhSN7J3Bxd2ZCxmbeUPbc3MEuEpHTOzRDjhm5klwgnfzCwRTvhmZolwwjczS4QTvplZIpzwzcwS4YRvZpaI/wJs8xvSKS6URQAAAABJRU5ErkJggg==\n", "text/plain": [ "<Figure size 432x288 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df = pd.DataFrame({'sqrt_rmses' : sqrt_rmses,\n", " 'log2_rmses' : log2_rmses,\n", " 'all_fx_rmses': all_fx_rmses,\n", " 'trees' : rnge})\n", "sns.lineplot(x='trees', y='sqrt_rmses', data=df, color='tab:blue')\n", "sns.lineplot(x='trees', y='log2_rmses', data=df, color='tab:green')\n", "sns.lineplot(x='trees', y='all_fx_rmses', data=df, color='tab:red')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "8. In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable." ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Sales</th>\n", " <th>CompPrice</th>\n", " <th>Income</th>\n", " <th>Advertising</th>\n", " <th>Population</th>\n", " <th>Price</th>\n", " <th>ShelveLoc</th>\n", " <th>Age</th>\n", " <th>Education</th>\n", " <th>Urban</th>\n", " <th>US</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>9.50</td>\n", " <td>138</td>\n", " <td>73</td>\n", " <td>11</td>\n", " <td>276</td>\n", " <td>120</td>\n", " <td>0</td>\n", " <td>42</td>\n", " <td>17</td>\n", " <td>True</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>11.22</td>\n", " <td>111</td>\n", " <td>48</td>\n", " <td>16</td>\n", " <td>260</td>\n", " <td>83</td>\n", " <td>2</td>\n", " <td>65</td>\n", " <td>10</td>\n", " <td>True</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>10.06</td>\n", " <td>113</td>\n", " <td>35</td>\n", " <td>10</td>\n", " <td>269</td>\n", " <td>80</td>\n", " <td>1</td>\n", " <td>59</td>\n", " <td>12</td>\n", " <td>True</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>7.40</td>\n", " <td>117</td>\n", " <td>100</td>\n", " <td>4</td>\n", " <td>466</td>\n", " <td>97</td>\n", " <td>1</td>\n", " <td>55</td>\n", " <td>14</td>\n", " <td>True</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>4.15</td>\n", " <td>141</td>\n", " <td>64</td>\n", " <td>3</td>\n", " <td>340</td>\n", " <td>128</td>\n", " <td>0</td>\n", " <td>38</td>\n", " <td>13</td>\n", " <td>True</td>\n", " <td>False</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Sales CompPrice Income Advertising Population Price ShelveLoc Age \\\n", "0 9.50 138 73 11 276 120 0 42 \n", "1 11.22 111 48 16 260 83 2 65 \n", "2 10.06 113 35 10 269 80 1 59 \n", "3 7.40 117 100 4 466 97 1 55 \n", "4 4.15 141 64 3 340 128 0 38 \n", "\n", " Education Urban US \n", "0 17 True True \n", "1 10 True True \n", "2 12 True True \n", "3 14 True True \n", "4 13 True False " ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "car_df = pd.read_csv('carseats.csv')\n", "car_df = car_df.drop(car_df.columns[0], axis=1)\n", "car_df['Urban'] = car_df['Urban'] == 'Yes'\n", "car_df['US'] = car_df['US'] == 'Yes'\n", "car_df['ShelveLoc'] = car_df['ShelveLoc'].map({'Bad' : 0, 'Medium': 1, 'Good' : 2})\n", "car_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(a) Split the data set into a training set and a test set." ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [], "source": [ "train, test = model_selection.train_test_split(car_df, test_size=0.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "test rmse: 2.2041731241021694\n" ] } ], "source": [ "reg = tree.DecisionTreeRegressor(max_depth=3) \n", "train_x = train.drop('Sales', axis=1)\n", "train_y = train.Sales\n", "reg.fit(train_x,train_y)\n", "test_x = test.drop('Sales', axis=1)\n", "test_y = test.Sales\n", "preds = reg.predict(test_x)\n", "rmse = np.sqrt(metrics.mean_squared_error(test_y, preds))\n", "print('test rmse: {}'.format(rmse))" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n", "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n", " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n", "<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n", " -->\n", "<!-- Title: Tree Pages: 1 -->\n", "<svg width=\"512pt\" height=\"534pt\"\n", " viewBox=\"0.00 0.00 512.14 534.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n", "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 530)\">\n", "<title>Tree</title>\n", "<polygon fill=\"#ffffff\" stroke=\"transparent\" points=\"-4,4 -4,-530 508.1377,-530 508.1377,4 -4,4\"/>\n", "<!-- 0 -->\n", "<g id=\"node1\" class=\"node\">\n", "<title>0</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"96.9985,-322 .0005,-322 .0005,-258 96.9985,-258 96.9985,-322\"/>\n", "<text text-anchor=\"middle\" x=\"48.4995\" y=\"-306.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">X[5] <= 1.5</text>\n", "<text text-anchor=\"middle\" x=\"48.4995\" y=\"-292.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 8.017</text>\n", "<text text-anchor=\"middle\" x=\"48.4995\" y=\"-278.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 320</text>\n", "<text text-anchor=\"middle\" x=\"48.4995\" y=\"-264.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 7.507</text>\n", "</g>\n", "<!-- 1 -->\n", "<g id=\"node2\" class=\"node\">\n", "<title>1</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"231.3545,-356 134.3564,-356 134.3564,-292 231.3545,-292 231.3545,-356\"/>\n", "<text text-anchor=\"middle\" x=\"182.8555\" y=\"-340.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">X[4] <= 105.5</text>\n", "<text text-anchor=\"middle\" x=\"182.8555\" y=\"-326.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 5.571</text>\n", "<text text-anchor=\"middle\" x=\"182.8555\" y=\"-312.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 247</text>\n", "<text text-anchor=\"middle\" x=\"182.8555\" y=\"-298.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 6.682</text>\n", "</g>\n", "<!-- 0->1 -->\n", "<g id=\"edge1\" class=\"edge\">\n", "<title>0->1</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M97.0051,-302.2748C105.837,-304.5098 115.1367,-306.8631 124.2243,-309.1628\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"123.4167,-312.5688 133.9698,-311.629 125.1341,-305.7827 123.4167,-312.5688\"/>\n", "<text text-anchor=\"middle\" x=\"112.4956\" y=\"-320.2297\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">True</text>\n", "</g>\n", "<!-- 8 -->\n", "<g id=\"node9\" class=\"node\">\n", "<title>8</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"232.5688,-234 133.1422,-234 133.1422,-170 232.5688,-170 232.5688,-234\"/>\n", "<text text-anchor=\"middle\" x=\"182.8555\" y=\"-218.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">X[4] <= 109.5</text>\n", "<text text-anchor=\"middle\" x=\"182.8555\" y=\"-204.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 6.21</text>\n", "<text text-anchor=\"middle\" x=\"182.8555\" y=\"-190.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 73</text>\n", "<text text-anchor=\"middle\" x=\"182.8555\" y=\"-176.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 10.297</text>\n", "</g>\n", "<!-- 0->8 -->\n", "<g id=\"edge8\" class=\"edge\">\n", "<title>0->8</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M97.0051,-258.23C106.2128,-252.1991 115.929,-245.8353 125.3832,-239.643\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"127.5221,-242.426 133.9698,-234.019 123.6867,-236.5702 127.5221,-242.426\"/>\n", "<text text-anchor=\"middle\" x=\"109.496\" y=\"-224.7167\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">False</text>\n", "</g>\n", "<!-- 2 -->\n", "<g id=\"node3\" class=\"node\">\n", "<title>2</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"364.7817,-465 272.355,-465 272.355,-401 364.7817,-401 364.7817,-465\"/>\n", "<text text-anchor=\"middle\" x=\"318.5684\" y=\"-449.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">X[6] <= 48.5</text>\n", "<text text-anchor=\"middle\" x=\"318.5684\" y=\"-435.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 5.253</text>\n", "<text text-anchor=\"middle\" x=\"318.5684\" y=\"-421.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 89</text>\n", "<text text-anchor=\"middle\" x=\"318.5684\" y=\"-407.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 8.034</text>\n", "</g>\n", "<!-- 1->2 -->\n", "<g id=\"edge2\" class=\"edge\">\n", "<title>1->2</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M222.8398,-356.1141C237.754,-368.0927 254.8425,-381.8176 270.4151,-394.3249\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"268.4125,-397.2056 278.4009,-400.7388 272.7959,-391.748 268.4125,-397.2056\"/>\n", "</g>\n", "<!-- 5 -->\n", "<g id=\"node6\" class=\"node\">\n", "<title>5</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"367.0674,-356 270.0693,-356 270.0693,-292 367.0674,-292 367.0674,-356\"/>\n", "<text text-anchor=\"middle\" x=\"318.5684\" y=\"-340.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">X[5] <= 0.5</text>\n", "<text text-anchor=\"middle\" x=\"318.5684\" y=\"-326.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 4.14</text>\n", "<text text-anchor=\"middle\" x=\"318.5684\" y=\"-312.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 158</text>\n", "<text text-anchor=\"middle\" x=\"318.5684\" y=\"-298.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 5.92</text>\n", "</g>\n", "<!-- 1->5 -->\n", "<g id=\"edge5\" class=\"edge\">\n", "<title>1->5</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M231.4676,-324C240.6278,-324 250.298,-324 259.7314,-324\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"259.8395,-327.5001 269.8395,-324 259.8394,-320.5001 259.8395,-327.5001\"/>\n", "</g>\n", "<!-- 3 -->\n", "<g id=\"node4\" class=\"node\">\n", "<title>3</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"500.4946,-526 408.0679,-526 408.0679,-476 500.4946,-476 500.4946,-526\"/>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-510.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 3.917</text>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-496.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 30</text>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-482.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 9.395</text>\n", "</g>\n", "<!-- 2->3 -->\n", "<g id=\"edge3\" class=\"edge\">\n", "<title>2->3</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M364.8935,-456.2116C375.6916,-461.6221 387.3014,-467.4392 398.4622,-473.0315\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"397.1957,-476.3116 407.7042,-477.6622 400.3316,-470.0533 397.1957,-476.3116\"/>\n", "</g>\n", "<!-- 4 -->\n", "<g id=\"node5\" class=\"node\">\n", "<title>4</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"500.4946,-458 408.0679,-458 408.0679,-408 500.4946,-408 500.4946,-458\"/>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-442.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 4.511</text>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-428.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 59</text>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-414.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 7.342</text>\n", "</g>\n", "<!-- 2->4 -->\n", "<g id=\"edge4\" class=\"edge\">\n", "<title>2->4</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M364.8935,-433C375.3677,-433 386.6055,-433 397.4566,-433\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"397.7042,-436.5001 407.7042,-433 397.7041,-429.5001 397.7042,-436.5001\"/>\n", "</g>\n", "<!-- 6 -->\n", "<g id=\"node7\" class=\"node\">\n", "<title>6</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"500.4946,-390 408.0679,-390 408.0679,-340 500.4946,-340 500.4946,-390\"/>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-374.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 3.626</text>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-360.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 47</text>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-346.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 4.706</text>\n", "</g>\n", "<!-- 5->6 -->\n", "<g id=\"edge6\" class=\"edge\">\n", "<title>5->6</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M367.1805,-338.6861C377.2094,-341.7159 387.8496,-344.9305 398.1212,-348.0336\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"397.2393,-351.4233 407.8243,-350.965 399.2638,-344.7225 397.2393,-351.4233\"/>\n", "</g>\n", "<!-- 7 -->\n", "<g id=\"node8\" class=\"node\">\n", "<title>7</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"502.2549,-322 406.3076,-322 406.3076,-272 502.2549,-272 502.2549,-322\"/>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-306.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 3.469</text>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-292.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 111</text>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-278.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 6.435</text>\n", "</g>\n", "<!-- 5->7 -->\n", "<g id=\"edge7\" class=\"edge\">\n", "<title>5->7</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M367.1805,-314.3286C376.551,-312.4644 386.4553,-310.4939 396.0938,-308.5764\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"396.8837,-311.9879 406.0085,-306.6038 395.5178,-305.1224 396.8837,-311.9879\"/>\n", "</g>\n", "<!-- 9 -->\n", "<g id=\"node10\" class=\"node\">\n", "<title>9</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"368.2817,-234 268.8551,-234 268.8551,-170 368.2817,-170 368.2817,-234\"/>\n", "<text text-anchor=\"middle\" x=\"318.5684\" y=\"-218.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">X[0] <= 124.0</text>\n", "<text text-anchor=\"middle\" x=\"318.5684\" y=\"-204.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 3.189</text>\n", "<text text-anchor=\"middle\" x=\"318.5684\" y=\"-190.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 26</text>\n", "<text text-anchor=\"middle\" x=\"318.5684\" y=\"-176.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 12.212</text>\n", "</g>\n", "<!-- 8->9 -->\n", "<g id=\"edge9\" class=\"edge\">\n", "<title>8->9</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M232.6193,-202C241.1015,-202 249.9905,-202 258.7015,-202\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"258.8106,-205.5001 268.8105,-202 258.8105,-198.5001 258.8106,-205.5001\"/>\n", "</g>\n", "<!-- 12 -->\n", "<g id=\"node13\" class=\"node\">\n", "<title>12</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"364.7817,-125 272.355,-125 272.355,-61 364.7817,-61 364.7817,-125\"/>\n", "<text text-anchor=\"middle\" x=\"318.5684\" y=\"-109.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">X[2] <= 11.5</text>\n", "<text text-anchor=\"middle\" x=\"318.5684\" y=\"-95.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 4.729</text>\n", "<text text-anchor=\"middle\" x=\"318.5684\" y=\"-81.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 47</text>\n", "<text text-anchor=\"middle\" x=\"318.5684\" y=\"-67.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 9.237</text>\n", "</g>\n", "<!-- 8->12 -->\n", "<g id=\"edge12\" class=\"edge\">\n", "<title>8->12</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M222.8398,-169.8859C237.754,-157.9073 254.8425,-144.1824 270.4151,-131.6751\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"272.7959,-134.252 278.4009,-125.2612 268.4125,-128.7944 272.7959,-134.252\"/>\n", "</g>\n", "<!-- 10 -->\n", "<g id=\"node11\" class=\"node\">\n", "<title>10</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"503.9816,-254 404.5809,-254 404.5809,-204 503.9816,-204 503.9816,-254\"/>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-238.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 2.701</text>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-224.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 18</text>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-210.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 11.612</text>\n", "</g>\n", "<!-- 9->10 -->\n", "<g id=\"edge10\" class=\"edge\">\n", "<title>9->10</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M368.3322,-211.9005C376.9066,-213.6064 385.8967,-215.3949 394.6984,-217.146\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"394.0327,-220.5821 404.5234,-219.1007 395.3986,-213.7166 394.0327,-220.5821\"/>\n", "</g>\n", "<!-- 11 -->\n", "<g id=\"node12\" class=\"node\">\n", "<title>11</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"503.9946,-186 404.5679,-186 404.5679,-136 503.9946,-136 503.9946,-186\"/>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-170.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 1.654</text>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-156.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 8</text>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-142.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 13.562</text>\n", "</g>\n", "<!-- 9->11 -->\n", "<g id=\"edge11\" class=\"edge\">\n", "<title>9->11</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M368.3322,-186.9659C376.9066,-184.3755 385.8967,-181.6595 394.6984,-179.0005\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"395.9629,-182.2748 404.5234,-176.0323 393.9385,-175.5739 395.9629,-182.2748\"/>\n", "</g>\n", "<!-- 13 -->\n", "<g id=\"node14\" class=\"node\">\n", "<title>13</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"500.4946,-118 408.0679,-118 408.0679,-68 500.4946,-68 500.4946,-118\"/>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-102.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 3.674</text>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-88.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 34</text>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-74.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 8.529</text>\n", "</g>\n", "<!-- 12->13 -->\n", "<g id=\"edge13\" class=\"edge\">\n", "<title>12->13</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M364.8935,-93C375.3677,-93 386.6055,-93 397.4566,-93\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"397.7042,-96.5001 407.7042,-93 397.7041,-89.5001 397.7042,-96.5001\"/>\n", "</g>\n", "<!-- 14 -->\n", "<g id=\"node15\" class=\"node\">\n", "<title>14</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"503.9816,-50 404.5809,-50 404.5809,0 503.9816,0 503.9816,-50\"/>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 2.75</text>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 13</text>\n", "<text text-anchor=\"middle\" x=\"454.2813\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 11.088</text>\n", "</g>\n", "<!-- 12->14 -->\n", "<g id=\"edge14\" class=\"edge\">\n", "<title>12->14</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M364.8935,-69.7884C374.6526,-64.8985 385.0747,-59.6765 395.2291,-54.5885\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"397.0802,-57.5759 404.4528,-49.967 393.9444,-51.3175 397.0802,-57.5759\"/>\n", "</g>\n", "</g>\n", "</svg>\n" ], "text/plain": [ "<graphviz.files.Source at 0x112746588>" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dot_dat = tree.export_graphviz(reg, out_file=None, rotate=True)\n", "graph = gv.Source(dot_dat)\n", "graph" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?\n", "\n", ".. omitting this as tree pruning isn't easily available in python world, the python community prefer to control variance with boosting, bagging and random forest." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to de- termine which variables are most important." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "training rmse: 1.5726550643736208\n" ] }, { "data": { "text/plain": [ "<matplotlib.axes._subplots.AxesSubplot at 0x11289b940>" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 1080x540 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "reg = ensemble.RandomForestRegressor(n_estimators=100, max_features=None)\n", "train_x = train.drop('Sales', axis=1)\n", "train_y = train.Sales\n", "reg.fit(train_x,train_y)\n", "test_x = test.drop('Sales', axis=1)\n", "test_y = test.Sales\n", "preds = reg.predict(test_x)\n", "rmse = np.sqrt(metrics.mean_squared_error(test_y, preds))\n", "print('training rmse: {}'.format(rmse))\n", "\n", "_,_ = plt.subplots(figsize=(15, 7.5))\n", "bar_df = pd.DataFrame({'predictor': train_x.columns, 'importance' : reg.feature_importances_})\n", "sns.barplot(x='predictor', y='importance', data=bar_df.sort_values('importance', ascending=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained." ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "training rmse: 1.7675543885549883\n" ] }, { "data": { "text/plain": [ "<matplotlib.axes._subplots.AxesSubplot at 0x112b8ca58>" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 1080x540 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "reg = ensemble.RandomForestRegressor(n_estimators=100, max_features='sqrt')\n", "train_x = train.drop('Sales', axis=1)\n", "train_y = train.Sales\n", "reg.fit(train_x,train_y)\n", "test_x = test.drop('Sales', axis=1)\n", "test_y = test.Sales\n", "preds = reg.predict(test_x)\n", "rmse = np.sqrt(metrics.mean_squared_error(test_y, preds))\n", "print('training rmse: {}'.format(rmse))\n", "\n", "_,_ = plt.subplots(figsize=(15, 7.5))\n", "bar_df = pd.DataFrame({'predictor': train_x.columns, 'importance' : reg.feature_importances_})\n", "sns.barplot(x='predictor', y='importance', data=bar_df.sort_values('importance', ascending=False))" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [], "source": [ "rnge = np.linspace(0.1, 1, 10)\n", "\n", "sqrt_rmses = [run_cv(car_df,\n", " 10,\n", " lambda df: df.drop('Sales', axis=1),\n", " lambda df: df.Sales,\n", " lambda x, y: ensemble.RandomForestRegressor(n_estimators=100, max_features=i).fit(x,y),\n", " lambda preds, true: np.sqrt(metrics.mean_squared_error(true, preds))).mean()[0] for i in rnge]" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<matplotlib.axes._subplots.AxesSubplot at 0x1132a1940>" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 432x288 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "line_df = pd.DataFrame({'m' : rnge * car_df.shape[1] - 1, 'rmse' : sqrt_rmses})\n", "sns.lineplot(x='m', y='rmse', data=line_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "9. This problem involves the OJ data set which is part of the ISLR package." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Purchase</th>\n", " <th>WeekofPurchase</th>\n", " <th>StoreID</th>\n", " <th>PriceCH</th>\n", " <th>PriceMM</th>\n", " <th>DiscCH</th>\n", " <th>DiscMM</th>\n", " <th>SpecialCH</th>\n", " <th>SpecialMM</th>\n", " <th>LoyalCH</th>\n", " <th>SalePriceMM</th>\n", " <th>SalePriceCH</th>\n", " <th>PriceDiff</th>\n", " <th>Store7</th>\n", " <th>PctDiscMM</th>\n", " <th>PctDiscCH</th>\n", " <th>ListPriceDiff</th>\n", " <th>STORE</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>237</td>\n", " <td>1</td>\n", " <td>1.75</td>\n", " <td>1.99</td>\n", " <td>0.00</td>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.500000</td>\n", " <td>1.99</td>\n", " <td>1.75</td>\n", " <td>0.24</td>\n", " <td>0</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.24</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1</td>\n", " <td>239</td>\n", " <td>1</td>\n", " <td>1.75</td>\n", " <td>1.99</td>\n", " <td>0.00</td>\n", " <td>0.3</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0.600000</td>\n", " <td>1.69</td>\n", " <td>1.75</td>\n", " <td>-0.06</td>\n", " <td>0</td>\n", " <td>0.150754</td>\n", " <td>0.000000</td>\n", " <td>0.24</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1</td>\n", " <td>245</td>\n", " <td>1</td>\n", " <td>1.86</td>\n", " <td>2.09</td>\n", " <td>0.17</td>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.680000</td>\n", " <td>2.09</td>\n", " <td>1.69</td>\n", " <td>0.40</td>\n", " <td>0</td>\n", " <td>0.000000</td>\n", " <td>0.091398</td>\n", " <td>0.23</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0</td>\n", " <td>227</td>\n", " <td>1</td>\n", " <td>1.69</td>\n", " <td>1.69</td>\n", " <td>0.00</td>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.400000</td>\n", " <td>1.69</td>\n", " <td>1.69</td>\n", " <td>0.00</td>\n", " <td>0</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.00</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1</td>\n", " <td>228</td>\n", " <td>7</td>\n", " <td>1.69</td>\n", " <td>1.69</td>\n", " <td>0.00</td>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.956535</td>\n", " <td>1.69</td>\n", " <td>1.69</td>\n", " <td>0.00</td>\n", " <td>1</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.00</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM \\\n", "0 1 237 1 1.75 1.99 0.00 0.0 \n", "1 1 239 1 1.75 1.99 0.00 0.3 \n", "2 1 245 1 1.86 2.09 0.17 0.0 \n", "3 0 227 1 1.69 1.69 0.00 0.0 \n", "4 1 228 7 1.69 1.69 0.00 0.0 \n", "\n", " SpecialCH SpecialMM LoyalCH SalePriceMM SalePriceCH PriceDiff \\\n", "0 0 0 0.500000 1.99 1.75 0.24 \n", "1 0 1 0.600000 1.69 1.75 -0.06 \n", "2 0 0 0.680000 2.09 1.69 0.40 \n", "3 0 0 0.400000 1.69 1.69 0.00 \n", "4 0 0 0.956535 1.69 1.69 0.00 \n", "\n", " Store7 PctDiscMM PctDiscCH ListPriceDiff STORE \n", "0 0 0.000000 0.000000 0.24 1 \n", "1 0 0.150754 0.000000 0.24 1 \n", "2 0 0.000000 0.091398 0.23 1 \n", "3 0 0.000000 0.000000 0.00 1 \n", "4 1 0.000000 0.000000 0.00 0 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "oj_df = pd.read_csv('oj.csv')\n", "oj_df = oj_df.drop(oj_df.columns[0], axis=1)\n", "oj_df.Purchase = oj_df.Purchase.map({'CH' : 1, 'MM': 0})\n", "oj_df.Store7 = oj_df.Store7.map({'Yes' : 1, 'No': 0})\n", "oj_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations." ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [], "source": [ "train, test = model_selection.train_test_split(oj_df, test_size=oj_df.shape[0] - 800)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?" ] }, { "cell_type": "code", "execution_count": 162, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train error: 0.16249999999999998\n" ] } ], "source": [ "reg = tree.DecisionTreeClassifier(max_depth=3) \n", "train_x = train.drop('Purchase', axis=1)\n", "train_y = train.Purchase\n", "reg.fit(train_x,train_y)\n", "\n", "# train error\n", "preds = reg.predict(train_x)\n", "train_error = 1 - ((train_y == preds).sum() / train_y.shape[0])\n", "print('train error: {}'.format(train_error))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed. (d) Create a plot of the tree, and interpret the results." ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n", "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n", " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n", "<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n", " -->\n", "<!-- Title: Tree Pages: 1 -->\n", "<svg width=\"515pt\" height=\"534pt\"\n", " viewBox=\"0.00 0.00 514.87 534.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n", "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 530)\">\n", "<title>Tree</title>\n", "<polygon fill=\"#ffffff\" stroke=\"transparent\" points=\"-4,4 -4,-530 510.8721,-530 510.8721,4 -4,4\"/>\n", "<!-- 0 -->\n", "<g id=\"node1\" class=\"node\">\n", "<title>0</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"96.9985,-322 .0005,-322 .0005,-258 96.9985,-258 96.9985,-322\"/>\n", "<text text-anchor=\"middle\" x=\"48.4995\" y=\"-306.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">X[8] <= 0.479</text>\n", "<text text-anchor=\"middle\" x=\"48.4995\" y=\"-292.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 0.24</text>\n", "<text text-anchor=\"middle\" x=\"48.4995\" y=\"-278.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 800</text>\n", "<text text-anchor=\"middle\" x=\"48.4995\" y=\"-264.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 0.6</text>\n", "</g>\n", "<!-- 1 -->\n", "<g id=\"node2\" class=\"node\">\n", "<title>1</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"229.9976,-356 132.9995,-356 132.9995,-292 229.9976,-292 229.9976,-356\"/>\n", "<text text-anchor=\"middle\" x=\"181.4985\" y=\"-340.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">X[8] <= 0.345</text>\n", "<text text-anchor=\"middle\" x=\"181.4985\" y=\"-326.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 0.171</text>\n", "<text text-anchor=\"middle\" x=\"181.4985\" y=\"-312.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 288</text>\n", "<text text-anchor=\"middle\" x=\"181.4985\" y=\"-298.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 0.219</text>\n", "</g>\n", "<!-- 0->1 -->\n", "<g id=\"edge1\" class=\"edge\">\n", "<title>0->1</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M97.2682,-302.4673C105.5808,-304.5923 114.2921,-306.8193 122.8288,-309.0016\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"122.1804,-312.4483 132.7357,-311.5342 123.9142,-305.6664 122.1804,-312.4483\"/>\n", "<text text-anchor=\"middle\" x=\"111.2305\" y=\"-320.0828\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">True</text>\n", "</g>\n", "<!-- 8 -->\n", "<g id=\"node9\" class=\"node\">\n", "<title>8</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"229.9976,-234 132.9995,-234 132.9995,-170 229.9976,-170 229.9976,-234\"/>\n", "<text text-anchor=\"middle\" x=\"181.4985\" y=\"-218.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">X[8] <= 0.765</text>\n", "<text text-anchor=\"middle\" x=\"181.4985\" y=\"-204.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 0.151</text>\n", "<text text-anchor=\"middle\" x=\"181.4985\" y=\"-190.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 512</text>\n", "<text text-anchor=\"middle\" x=\"181.4985\" y=\"-176.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 0.814</text>\n", "</g>\n", "<!-- 0->8 -->\n", "<g id=\"edge8\" class=\"edge\">\n", "<title>0->8</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M96.8914,-257.981C105.8761,-252.0362 115.3382,-245.7756 124.5505,-239.6802\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"126.5113,-242.5796 132.9198,-234.1426 122.6487,-236.7418 126.5113,-242.5796\"/>\n", "<text text-anchor=\"middle\" x=\"108.4224\" y=\"-224.9545\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">False</text>\n", "</g>\n", "<!-- 2 -->\n", "<g id=\"node3\" class=\"node\">\n", "<title>2</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"368.4346,-465 271.4365,-465 271.4365,-401 368.4346,-401 368.4346,-465\"/>\n", "<text text-anchor=\"middle\" x=\"319.9355\" y=\"-449.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">X[8] <= 0.035</text>\n", "<text text-anchor=\"middle\" x=\"319.9355\" y=\"-435.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 0.12</text>\n", "<text text-anchor=\"middle\" x=\"319.9355\" y=\"-421.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 200</text>\n", "<text text-anchor=\"middle\" x=\"319.9355\" y=\"-407.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 0.14</text>\n", "</g>\n", "<!-- 1->2 -->\n", "<g id=\"edge2\" class=\"edge\">\n", "<title>1->2</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M222.2854,-356.1141C237.4991,-368.0927 254.9306,-381.8176 270.8157,-394.3249\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"268.9397,-397.3025 278.9618,-400.7388 273.2701,-391.8027 268.9397,-397.3025\"/>\n", "</g>\n", "<!-- 5 -->\n", "<g id=\"node6\" class=\"node\">\n", "<title>5</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"366.1489,-356 273.7222,-356 273.7222,-292 366.1489,-292 366.1489,-356\"/>\n", "<text text-anchor=\"middle\" x=\"319.9355\" y=\"-340.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">X[9] <= 2.04</text>\n", "<text text-anchor=\"middle\" x=\"319.9355\" y=\"-326.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 0.24</text>\n", "<text text-anchor=\"middle\" x=\"319.9355\" y=\"-312.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 88</text>\n", "<text text-anchor=\"middle\" x=\"319.9355\" y=\"-298.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 0.398</text>\n", "</g>\n", "<!-- 1->5 -->\n", "<g id=\"edge5\" class=\"edge\">\n", "<title>1->5</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M230.3062,-324C240.9248,-324 252.2472,-324 263.1447,-324\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"263.4258,-327.5001 273.4258,-324 263.4258,-320.5001 263.4258,-327.5001\"/>\n", "</g>\n", "<!-- 3 -->\n", "<g id=\"node4\" class=\"node\">\n", "<title>3</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"504.5859,-526 412.1592,-526 412.1592,-476 504.5859,-476 504.5859,-526\"/>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-510.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 0.018</text>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-496.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 56</text>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-482.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 0.018</text>\n", "</g>\n", "<!-- 2->3 -->\n", "<g id=\"edge3\" class=\"edge\">\n", "<title>2->3</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M368.7433,-456.9743C379.6903,-462.3514 391.3853,-468.096 402.5916,-473.6005\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"401.3441,-476.8871 411.8628,-478.1545 404.4303,-470.6041 401.3441,-476.8871\"/>\n", "</g>\n", "<!-- 4 -->\n", "<g id=\"node5\" class=\"node\">\n", "<title>4</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"506.8716,-458 409.8735,-458 409.8735,-408 506.8716,-408 506.8716,-458\"/>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-442.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 0.152</text>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-428.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 144</text>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-414.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 0.188</text>\n", "</g>\n", "<!-- 2->4 -->\n", "<g id=\"edge4\" class=\"edge\">\n", "<title>2->4</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M368.7433,-433C378.6685,-433 389.2087,-433 399.4421,-433\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"399.539,-436.5001 409.539,-433 399.5389,-429.5001 399.539,-436.5001\"/>\n", "</g>\n", "<!-- 6 -->\n", "<g id=\"node7\" class=\"node\">\n", "<title>6</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"504.5859,-390 412.1592,-390 412.1592,-340 504.5859,-340 504.5859,-390\"/>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-374.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 0.185</text>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-360.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 53</text>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-346.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 0.245</text>\n", "</g>\n", "<!-- 5->6 -->\n", "<g id=\"edge6\" class=\"edge\">\n", "<title>5->6</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M366.4182,-337.7665C377.8324,-341.1469 390.1746,-344.8022 401.9985,-348.3041\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"401.1947,-351.7162 411.777,-351.2001 403.1826,-345.0044 401.1947,-351.7162\"/>\n", "</g>\n", "<!-- 7 -->\n", "<g id=\"node8\" class=\"node\">\n", "<title>7</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"504.5859,-322 412.1592,-322 412.1592,-272 504.5859,-272 504.5859,-322\"/>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-306.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 0.233</text>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-292.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 35</text>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-278.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 0.629</text>\n", "</g>\n", "<!-- 5->7 -->\n", "<g id=\"edge7\" class=\"edge\">\n", "<title>5->7</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M366.4182,-314.9343C377.7183,-312.7304 389.9278,-310.3491 401.6436,-308.0641\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"402.6319,-311.4374 411.777,-306.0878 401.2919,-304.5668 402.6319,-311.4374\"/>\n", "</g>\n", "<!-- 9 -->\n", "<g id=\"node10\" class=\"node\">\n", "<title>9</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"373.8106,-234 266.0605,-234 266.0605,-170 373.8106,-170 373.8106,-234\"/>\n", "<text text-anchor=\"middle\" x=\"319.9355\" y=\"-218.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">X[11] <= -0.165</text>\n", "<text text-anchor=\"middle\" x=\"319.9355\" y=\"-204.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 0.219</text>\n", "<text text-anchor=\"middle\" x=\"319.9355\" y=\"-190.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 246</text>\n", "<text text-anchor=\"middle\" x=\"319.9355\" y=\"-176.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 0.675</text>\n", "</g>\n", "<!-- 8->9 -->\n", "<g id=\"edge9\" class=\"edge\">\n", "<title>8->9</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M230.3062,-202C238.5871,-202 247.2959,-202 255.8974,-202\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"255.9135,-205.5001 265.9135,-202 255.9135,-198.5001 255.9135,-205.5001\"/>\n", "</g>\n", "<!-- 12 -->\n", "<g id=\"node13\" class=\"node\">\n", "<title>12</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"370.3106,-125 269.5605,-125 269.5605,-61 370.3106,-61 370.3106,-125\"/>\n", "<text text-anchor=\"middle\" x=\"319.9355\" y=\"-109.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">X[11] <= -0.39</text>\n", "<text text-anchor=\"middle\" x=\"319.9355\" y=\"-95.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 0.053</text>\n", "<text text-anchor=\"middle\" x=\"319.9355\" y=\"-81.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 266</text>\n", "<text text-anchor=\"middle\" x=\"319.9355\" y=\"-67.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 0.944</text>\n", "</g>\n", "<!-- 8->12 -->\n", "<g id=\"edge12\" class=\"edge\">\n", "<title>8->12</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M222.2854,-169.8859C237.4991,-157.9073 254.9306,-144.1824 270.8157,-131.6751\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"273.2701,-134.1973 278.9618,-125.2612 268.9397,-128.6975 273.2701,-134.1973\"/>\n", "</g>\n", "<!-- 10 -->\n", "<g id=\"node11\" class=\"node\">\n", "<title>10</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"504.5859,-254 412.1592,-254 412.1592,-204 504.5859,-204 504.5859,-254\"/>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-238.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 0.167</text>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-224.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 33</text>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-210.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 0.212</text>\n", "</g>\n", "<!-- 9->10 -->\n", "<g id=\"edge10\" class=\"edge\">\n", "<title>9->10</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M373.8553,-212.5162C383.1104,-214.3213 392.7496,-216.2012 402.0653,-218.0181\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"401.5283,-221.4793 412.0134,-219.9584 402.8683,-214.6087 401.5283,-221.4793\"/>\n", "</g>\n", "<!-- 11 -->\n", "<g id=\"node12\" class=\"node\">\n", "<title>11</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"506.8716,-186 409.8735,-186 409.8735,-136 506.8716,-136 506.8716,-186\"/>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-170.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 0.189</text>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-156.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 213</text>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-142.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 0.746</text>\n", "</g>\n", "<!-- 9->11 -->\n", "<g id=\"edge11\" class=\"edge\">\n", "<title>9->11</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M373.8553,-186.0309C382.4603,-183.4824 391.3974,-180.8356 400.0978,-178.2589\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"401.1955,-181.5841 409.7899,-175.3884 399.2077,-174.8723 401.1955,-181.5841\"/>\n", "</g>\n", "<!-- 13 -->\n", "<g id=\"node14\" class=\"node\">\n", "<title>13</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"503.3716,-118 413.3735,-118 413.3735,-68 503.3716,-68 503.3716,-118\"/>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-102.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 0.21</text>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-88.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 10</text>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-74.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 0.7</text>\n", "</g>\n", "<!-- 12->13 -->\n", "<g id=\"edge13\" class=\"edge\">\n", "<title>12->13</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M370.6983,-93C381.2245,-93 392.3649,-93 403.0452,-93\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"403.109,-96.5001 413.109,-93 403.1089,-89.5001 403.109,-96.5001\"/>\n", "</g>\n", "<!-- 14 -->\n", "<g id=\"node15\" class=\"node\">\n", "<title>14</title>\n", "<polygon fill=\"none\" stroke=\"#000000\" points=\"506.8716,-50 409.8735,-50 409.8735,0 506.8716,0 506.8716,-50\"/>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">mse = 0.045</text>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 256</text>\n", "<text text-anchor=\"middle\" x=\"458.3726\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = 0.953</text>\n", "</g>\n", "<!-- 12->14 -->\n", "<g id=\"edge14\" class=\"edge\">\n", "<title>12->14</title>\n", "<path fill=\"none\" stroke=\"#000000\" d=\"M370.6983,-68.0654C380.2888,-63.3546 390.3892,-58.3933 400.187,-53.5807\"/>\n", "<polygon fill=\"#000000\" stroke=\"#000000\" points=\"402.0288,-56.5754 409.4614,-49.0251 398.9426,-50.2925 402.0288,-56.5754\"/>\n", "</g>\n", "</g>\n", "</svg>\n" ], "text/plain": [ "<graphviz.files.Source at 0x1127460b8>" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dot_dat = tree.export_graphviz(reg, out_file=None, rotate=True)\n", "graph = gv.Source(dot_dat)\n", "graph" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?" ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>75</td>\n", " <td>22</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>27</td>\n", " <td>146</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1\n", "0 75 22\n", "1 27 146" ] }, "execution_count": 163, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_x = test.drop('Purchase', axis=1)\n", "test_y = test.Purchase\n", "preds = reg.predict(test_x)\n", "conf = pd.DataFrame(metrics.confusion_matrix(test_y, preds))\n", "conf" ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "test error rate: 0.18148148148148147\n" ] } ], "source": [ "print('test error rate: {}'.format( 1 - ((conf[0][0] + conf[1][1]) / test_y.shape[0])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.\n", "\n", ".. not easily available in python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis. (h) Which tree size corresponds to the lowest cross-validated classification error rate?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "rnge = range(1, 50)\n", "\n", "errors = [run_cv(oj_df,\n", " 5,\n", " lambda df: df.drop('Purchase', axis=1),\n", " lambda df: df.Purchase,\n", " lambda x, y: tree.DecisionTreeClassifier(max_depth=i) .fit(x,y),\n", " lambda preds, true: 1 - ((preds == true).sum() / preds.shape[0])).mean()[0] for i in rnge]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<matplotlib.axes._subplots.AxesSubplot at 0x1162366d8>" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 432x288 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "line_df = pd.DataFrame({'max_depth' : rnge, 'error' : errors})\n", "sns.lineplot(x='max_depth', y='error', data=line_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes. (j) Compare the training error rates between the pruned and un- pruned trees. Which is higher? \n", "(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?\n", "\n", ".. pruning not easily available in python world" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "10. We now use boosting to predict Salary in the Hitters data set." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>AtBat</th>\n", " <th>Hits</th>\n", " <th>HmRun</th>\n", " <th>Runs</th>\n", " <th>RBI</th>\n", " <th>Walks</th>\n", " <th>Years</th>\n", " <th>CAtBat</th>\n", " <th>CHits</th>\n", " <th>CHmRun</th>\n", " <th>CRuns</th>\n", " <th>CRBI</th>\n", " <th>CWalks</th>\n", " <th>League</th>\n", " <th>Division</th>\n", " <th>PutOuts</th>\n", " <th>Assists</th>\n", " <th>Errors</th>\n", " <th>Salary</th>\n", " <th>NewLeague</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <td>315</td>\n", " <td>81</td>\n", " <td>7</td>\n", " <td>24</td>\n", " <td>38</td>\n", " <td>39</td>\n", " <td>14</td>\n", " <td>3449</td>\n", " <td>835</td>\n", " <td>69</td>\n", " <td>321</td>\n", " <td>414</td>\n", " <td>375</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>632</td>\n", " <td>43</td>\n", " <td>10</td>\n", " <td>6.163315</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>479</td>\n", " <td>130</td>\n", " <td>18</td>\n", " <td>66</td>\n", " <td>72</td>\n", " <td>76</td>\n", " <td>3</td>\n", " <td>1624</td>\n", " <td>457</td>\n", " <td>63</td>\n", " <td>224</td>\n", " <td>266</td>\n", " <td>263</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>880</td>\n", " <td>82</td>\n", " <td>14</td>\n", " <td>6.173786</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>496</td>\n", " <td>141</td>\n", " <td>20</td>\n", " <td>65</td>\n", " <td>78</td>\n", " <td>37</td>\n", " <td>11</td>\n", " <td>5628</td>\n", " <td>1575</td>\n", " <td>225</td>\n", " <td>828</td>\n", " <td>838</td>\n", " <td>354</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>200</td>\n", " <td>11</td>\n", " <td>3</td>\n", " <td>6.214608</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>321</td>\n", " <td>87</td>\n", " <td>10</td>\n", " <td>39</td>\n", " <td>42</td>\n", " <td>30</td>\n", " <td>2</td>\n", " <td>396</td>\n", " <td>101</td>\n", " <td>12</td>\n", " <td>48</td>\n", " <td>46</td>\n", " <td>33</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>805</td>\n", " <td>40</td>\n", " <td>4</td>\n", " <td>4.516339</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>594</td>\n", " <td>169</td>\n", " <td>4</td>\n", " <td>74</td>\n", " <td>51</td>\n", " <td>35</td>\n", " <td>11</td>\n", " <td>4408</td>\n", " <td>1133</td>\n", " <td>19</td>\n", " <td>501</td>\n", " <td>336</td>\n", " <td>194</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>282</td>\n", " <td>421</td>\n", " <td>25</td>\n", " <td>6.620073</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns \\\n", "1 315 81 7 24 38 39 14 3449 835 69 321 \n", "2 479 130 18 66 72 76 3 1624 457 63 224 \n", "3 496 141 20 65 78 37 11 5628 1575 225 828 \n", "4 321 87 10 39 42 30 2 396 101 12 48 \n", "5 594 169 4 74 51 35 11 4408 1133 19 501 \n", "\n", " CRBI CWalks League Division PutOuts Assists Errors Salary \\\n", "1 414 375 1 1 632 43 10 6.163315 \n", "2 266 263 0 1 880 82 14 6.173786 \n", "3 838 354 1 0 200 11 3 6.214608 \n", "4 46 33 1 0 805 40 4 4.516339 \n", "5 336 194 0 1 282 421 25 6.620073 \n", "\n", " NewLeague \n", "1 1 \n", "2 0 \n", "3 1 \n", "4 1 \n", "5 0 " ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hit_df = pd.read_csv('hitters.csv')\n", "hit_df = hit_df.drop(hit_df.columns[0], axis = 1)\n", "hit_df = hit_df.dropna()\n", "hit_df.Salary = np.log(hit_df.Salary)\n", "hit_df.League = hit_df.League.map({'N':1, 'A':0})\n", "hit_df.Division = hit_df.Division.map({'W':1, 'E':0})\n", "hit_df.NewLeague = hit_df.NewLeague.map({'N':1, 'A':0})\n", "hit_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(b) Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "train, test = model_selection.train_test_split(hit_df, test_size=hit_df.shape[0] - 200)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(c) Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter λ. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "def hitters_boosting_test_error(lr):\n", " train_x = train.drop('Salary', axis=1)\n", " train_y = train.Salary\n", " reg = ensemble.GradientBoostingRegressor(n_estimators=1000, learning_rate=lr).fit(train_x, train_y)\n", " test_x = test.drop('Salary', axis=1)\n", " test_y = test.Salary\n", " preds = reg.predict(test_x)\n", " return np.sqrt(metrics.mean_squared_error(test_y, preds))\n", "\n", "lambdas = np.linspace(0.01, 1, 100)\n", "test_errors = [hitters_boosting_test_error(lr) for lr in lambdas]" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "def hitters_boosting_training_error(lr):\n", " train_x = train.drop('Salary', axis=1)\n", " train_y = train.Salary\n", " reg = ensemble.GradientBoostingRegressor(n_estimators=1000, learning_rate=lr).fit(train_x, train_y)\n", " preds = reg.predict(train_x)\n", " return np.sqrt(metrics.mean_squared_error(train_y, preds))\n", "\n", "lambdas = np.linspace(0.01, 1, 100)\n", "train_errors = [hitters_boosting_training_error(lr) for lr in lambdas]" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<matplotlib.axes._subplots.AxesSubplot at 0x1167bed68>" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 432x288 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "line_df = pd.DataFrame({'test_error': test_errors,\n", " 'train_error': train_errors,\n", " 'lambda' : lambdas})\n", "sns.lineplot(x='lambda', y='test_error', data=line_df, color='tab:blue')\n", "sns.lineplot(x='lambda', y='train_error', data=line_df, color='tab:red')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(e) Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6." ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5973063683221087" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_x = train.drop('Salary', axis=1)\n", "train_y = train.Salary\n", "reg = linear_model.Ridge(alpha=10000).fit(train_x, train_y)\n", "test_x = test.drop('Salary', axis=1)\n", "test_y = test.Salary\n", "preds = reg.predict(test_x)\n", "np.sqrt(metrics.mean_squared_error(test_y, preds))" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5953299537143887" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_x = train.drop('Salary', axis=1)\n", "train_y = train.Salary\n", "reg = linear_model.Lasso(alpha=0.5).fit(train_x, train_y)\n", "test_x = test.drop('Salary', axis=1)\n", "test_y = test.Salary\n", "preds = reg.predict(test_x)\n", "np.sqrt(metrics.mean_squared_error(test_y, preds))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(f) Which variables appear to be the most important predictors in the boosted model?" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<matplotlib.axes._subplots.AxesSubplot at 0x1173d5518>" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 1080x540 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "train_x = train.drop('Salary', axis=1)\n", "train_y = train.Salary\n", "reg = ensemble.GradientBoostingRegressor(n_estimators=1000, learning_rate=0.2).fit(train_x, train_y)\n", "\n", "# plot\n", "_,_ = plt.subplots(figsize=(15, 7.5))\n", "bar_df = pd.DataFrame({'predictor': train_x.columns, 'importance' : reg.feature_importances_})\n", "sns.barplot(x='predictor', y='importance', data=bar_df.sort_values('importance', ascending=False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(g) Now apply bagging to the training set. What is the test set MSE for this approach?" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.4969384424303783" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_x = train.drop('Salary', axis=1)\n", "train_y = train.Salary\n", "reg = ensemble.RandomForestRegressor(n_estimators=100, max_features=None).fit(train_x,train_y)\n", "test_x = test.drop('Salary', axis=1)\n", "test_y = test.Salary\n", "preds = reg.predict(test_x)\n", "np.sqrt(metrics.mean_squared_error(test_y, preds))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "11. This question uses the Caravan data set." ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>MOSTYPE</th>\n", " <th>MAANTHUI</th>\n", " <th>MGEMOMV</th>\n", " <th>MGEMLEEF</th>\n", " <th>MOSHOOFD</th>\n", " <th>MGODRK</th>\n", " <th>MGODPR</th>\n", " <th>MGODOV</th>\n", " <th>MGODGE</th>\n", " <th>MRELGE</th>\n", " <th>...</th>\n", " <th>APERSONG</th>\n", " <th>AGEZONG</th>\n", " <th>AWAOREG</th>\n", " <th>ABRAND</th>\n", " <th>AZEILPL</th>\n", " <th>APLEZIER</th>\n", " <th>AFIETS</th>\n", " <th>AINBOED</th>\n", " <th>ABYSTAND</th>\n", " <th>Purchase</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>33</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>2</td>\n", " <td>8</td>\n", " <td>0</td>\n", " <td>5</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>7</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>37</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>2</td>\n", " <td>8</td>\n", " <td>1</td>\n", " <td>4</td>\n", " <td>1</td>\n", " <td>4</td>\n", " <td>6</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>37</td>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>2</td>\n", " <td>8</td>\n", " <td>0</td>\n", " <td>4</td>\n", " <td>2</td>\n", " <td>4</td>\n", " <td>3</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>9</td>\n", " <td>1</td>\n", " <td>3</td>\n", " <td>3</td>\n", " <td>3</td>\n", " <td>2</td>\n", " <td>3</td>\n", " <td>2</td>\n", " <td>4</td>\n", " <td>5</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>40</td>\n", " <td>1</td>\n", " <td>4</td>\n", " <td>2</td>\n", " <td>10</td>\n", " <td>1</td>\n", " <td>4</td>\n", " <td>1</td>\n", " <td>4</td>\n", " <td>7</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>5 rows × 86 columns</p>\n", "</div>" ], "text/plain": [ " MOSTYPE MAANTHUI MGEMOMV MGEMLEEF MOSHOOFD MGODRK MGODPR MGODOV \\\n", "0 33 1 3 2 8 0 5 1 \n", "1 37 1 2 2 8 1 4 1 \n", "2 37 1 2 2 8 0 4 2 \n", "3 9 1 3 3 3 2 3 2 \n", "4 40 1 4 2 10 1 4 1 \n", "\n", " MGODGE MRELGE ... APERSONG AGEZONG AWAOREG ABRAND AZEILPL \\\n", "0 3 7 ... 0 0 0 1 0 \n", "1 4 6 ... 0 0 0 1 0 \n", "2 4 3 ... 0 0 0 1 0 \n", "3 4 5 ... 0 0 0 1 0 \n", "4 4 7 ... 0 0 0 1 0 \n", "\n", " APLEZIER AFIETS AINBOED ABYSTAND Purchase \n", "0 0 0 0 0 0 \n", "1 0 0 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 0 0 \n", "\n", "[5 rows x 86 columns]" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "van_df = pd.read_csv('caravan.csv')\n", "van_df = van_df.drop(van_df.columns[0], axis=1)\n", "van_df.Purchase = van_df.Purchase.map({'Yes':1, 'No': 0})\n", "van_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(a) Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations." ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [], "source": [ "train,test = model_selection.train_test_split(van_df, test_size=van_df.shape[0] - 1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(b) Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?" ] }, { "cell_type": "code", "execution_count": 146, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "classification error: 0.06574035669846534\n" ] }, { "data": { "text/plain": [ "<matplotlib.axes._subplots.AxesSubplot at 0x11b7680f0>" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "<Figure size 1080x540 with 1 Axes>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "train_x = train.drop('Purchase', axis=1)\n", "train_y = train.Purchase\n", "clr = ensemble.GradientBoostingClassifier(n_estimators=1000, learning_rate=0.01).fit(train_x, train_y)\n", "test_x = test.drop('Purchase', axis=1)\n", "test_y = test.Purchase\n", "preds = clr.predict(test_x)\n", "error = 1 - ((preds == test_y).sum() / preds.shape[0])\n", "\n", "print('classification error: {}'.format(error))\n", "\n", "\n", "# plot\n", "_,_ = plt.subplots(figsize=(15, 7.5))\n", "bar_df = pd.DataFrame({'predictor': train_x.columns, 'importance' : clr.feature_importances_})\n", "sns.barplot(x='predictor', y='importance', data=bar_df.sort_values('importance', ascending=False).iloc[0:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(c) Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated probability of purchase is greater than 20%. Form a confusion matrix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtained from applying KNN or logistic regression to this data set?" ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Positive Predictive Value: 0.1840277777777778\n" ] } ], "source": [ "preds = (clr.predict_proba(test_x)[:,1] > 0.2)\n", "conf_mtrx = pd.DataFrame(metrics.confusion_matrix(test_y, preds))\n", "\n", "predicted_true = conf_mtrx[0][1] + conf_mtrx[1][1]\n", "print('Positive Predictive Value: {}'.format(conf_mtrx[1][1] / predicted_true))" ] }, { "cell_type": "code", "execution_count": 169, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Positive Predictive Value: 0.1840277777777778\n" ] } ], "source": [ "logit = linear_model.LogisticRegression().fit(train_x, train_y)\n", "preds = (logit.predict_proba(test_x)[:,1] > 0.2)\n", "conf_mtrx = pd.DataFrame(metrics.confusion_matrix(test_y, preds))\n", "predicted_true = conf_mtrx[0][1] + conf_mtrx[1][1]\n", "print('Positive Predictive Value: {}'.format(conf_mtrx[1][1] / predicted_true))" ] }, { "cell_type": "code", "execution_count": 174, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Positive Predictive Value: 0.0625\n" ] } ], "source": [ "knn = neighbors.KNeighborsClassifier(n_neighbors=5).fit(train_x, train_y)\n", "preds = (knn.predict_proba(test_x)[:,1] > 0.2)\n", "conf_mtrx = pd.DataFrame(metrics.confusion_matrix(test_y, preds))\n", "predicted_true = conf_mtrx[0][1] + conf_mtrx[1][1]\n", "print('Positive Predictive Value: {}'.format(conf_mtrx[1][1] / predicted_true))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "12. Apply boosting, bagging, and random forests to a data set of your choice. Be sure to fit the models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression? Which of these approaches yields the best performance?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO: Come back for this one with a kaggle dataset and take XGBoost out for a spin" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }