{"cells":[{"cell_type":"markdown","metadata":{"_cell_guid":"675c904d-a805-4f18-b346-99edb3fb4259","_uuid":"6a777a9cc9ad77dfb593f31f95ca38eb3532e0a0"},"source":["# Gradient boosting\n"]},{"cell_type":"markdown","metadata":{"_cell_guid":"1c292fca-13b3-4a89-8c96-879c2f149489","_uuid":"8473bd95a68d5a4f4d2086df416ef5687651ce42"},"source":["The logic of gradient boosting is very simple.\n","One of the very basic assumption of linear regression is that it's sum of residuals is 0. Although, tree based models are not based on any of such assumptions, but if we think logic behind these assumptions, we might argue that, if sum of residuals is not 0, then most probably there is some pattern in the residuals of our model which can be leveraged to make our model better.\n","So, the intuition behind gradient boosting algorithm is to leverage the pattern in residuals and strenghten a weak prediction model, until our residuals become randomly distributed. Once we reach a stage that residuals do not have any pattern that could be modeled, we can stop modeling residuals. Algorithmically, we are minimizing our loss function, such that test loss reach it’s minima."]},{"cell_type":"code","execution_count":null,"metadata":{"_cell_guid":"6ff32c7c-669f-4cf6-8bee-8950833f9369","_uuid":"92321ddcadf065a1e813930b77fe3ccc9fe577ef","collapsed":true,"jupyter":{"outputs_hidden":true}},"outputs":[],"source":["%matplotlib inline\n","\n","import pandas as pd\n","import numpy as np\n","from IPython.display import display\n","from fastai.imports import *\n","from sklearn import metrics"]},{"cell_type":"code","execution_count":null,"metadata":{"_cell_guid":"68c06d01-6cee-4170-8b87-5b13398df4d0","_uuid":"2dcf0abbdc331a221736c0ec67204a6b8001c75c","collapsed":true,"jupyter":{"outputs_hidden":true}},"outputs":[],"source":["class DecisionTree():\n","    def __init__(self, x, y, idxs = None, min_leaf=2):\n","        if idxs is None: idxs=np.arange(len(y))\n","        self.x,self.y,self.idxs,self.min_leaf = x,y,idxs,min_leaf\n","        self.n,self.c = len(idxs), x.shape[1]\n","        self.val = np.mean(y[idxs])\n","        self.score = float('inf')\n","        self.find_varsplit()\n","        \n","    def find_varsplit(self):\n","        for i in range(self.c): self.find_better_split(i)\n","        if self.score == float('inf'): return\n","        x = self.split_col\n","        lhs = np.nonzero(x<=self.split)[0]\n","        rhs = np.nonzero(x>self.split)[0]\n","        self.lhs = DecisionTree(self.x, self.y, self.idxs[lhs])\n","        self.rhs = DecisionTree(self.x, self.y, self.idxs[rhs])\n","\n","    def find_better_split(self, var_idx):\n","        x,y = self.x.values[self.idxs,var_idx], self.y[self.idxs]\n","        sort_idx = np.argsort(x)\n","        sort_y,sort_x = y[sort_idx], x[sort_idx]\n","        rhs_cnt,rhs_sum,rhs_sum2 = self.n, sort_y.sum(), (sort_y**2).sum()\n","        lhs_cnt,lhs_sum,lhs_sum2 = 0,0.,0.\n","\n","        for i in range(0,self.n-self.min_leaf-1):\n","            xi,yi = sort_x[i],sort_y[i]\n","            lhs_cnt += 1; rhs_cnt -= 1\n","            lhs_sum += yi; rhs_sum -= yi\n","            lhs_sum2 += yi**2; rhs_sum2 -= yi**2\n","            if i<self.min_leaf or xi==sort_x[i+1]:\n","                continue\n","\n","            lhs_std = std_agg(lhs_cnt, lhs_sum, lhs_sum2)\n","            rhs_std = std_agg(rhs_cnt, rhs_sum, rhs_sum2)\n","            curr_score = lhs_std*lhs_cnt + rhs_std*rhs_cnt\n","            if curr_score<self.score: \n","                self.var_idx,self.score,self.split = var_idx,curr_score,xi\n","\n","    @property\n","    def split_name(self): return self.x.columns[self.var_idx]\n","    \n","    @property\n","    def split_col(self): return self.x.values[self.idxs,self.var_idx]\n","\n","    @property\n","    def is_leaf(self): return self.score == float('inf')\n","    \n","    def __repr__(self):\n","        s = f'n: {self.n}; val:{self.val}'\n","        if not self.is_leaf:\n","            s += f'; score:{self.score}; split:{self.split}; var:{self.split_name}'\n","        return s\n","\n","    def predict(self, x):\n","        return np.array([self.predict_row(xi) for xi in x])\n","\n","    def predict_row(self, xi):\n","        if self.is_leaf: return self.val\n","        t = self.lhs if xi[self.var_idx]<=self.split else self.rhs\n","        return t.predict_row(xi)"]},{"cell_type":"markdown","metadata":{"_cell_guid":"1faf7470-1a9b-4de9-aa2e-d5fb6968388c","_uuid":"2f9c2d2fdff663c028c6da7f1fd8b8645d48a1ed"},"source":["## Data simulation"]},{"cell_type":"code","execution_count":null,"metadata":{"_cell_guid":"34a2529e-8e75-43ee-95cc-2728b7f2b8c3","_uuid":"9df9f429cb2578c60678386d89e36fa1384c3a6c","collapsed":true,"jupyter":{"outputs_hidden":true}},"outputs":[],"source":["x = np.arange(0,50)\n","x = pd.DataFrame({'x':x})"]},{"cell_type":"code","execution_count":null,"metadata":{"_cell_guid":"ae9981a9-0487-46e1-a776-818fa0033b0d","_uuid":"57dc2fedf62d3ee026def5ac5299ef4c65557301","collapsed":true,"jupyter":{"outputs_hidden":true}},"outputs":[],"source":["# just random uniform distributions in differnt range\n","\n","y1 = np.random.uniform(10,15,10)\n","y2 = np.random.uniform(20,25,10)\n","y3 = np.random.uniform(0,5,10)\n","y4 = np.random.uniform(30,32,10)\n","y5 = np.random.uniform(13,17,10)\n","\n","y = np.concatenate((y1,y2,y3,y4,y5))\n","y = y[:,None]"]},{"cell_type":"markdown","metadata":{"_cell_guid":"9c4938ee-fa9a-4766-aae8-cb21f7e13402","_uuid":"37537e95e05ffe1e954d47dcd00f7e077b565139"},"source":["## Scatter plot of data"]},{"cell_type":"code","execution_count":null,"metadata":{"_cell_guid":"2a31ee4d-5399-4d4e-9f49-5ab34653fd71","_uuid":"f4bd72a38f7f285e793b77e97a46dd7521555786","collapsed":true,"jupyter":{"outputs_hidden":true}},"outputs":[],"source":["x.shape, y.shape"]},{"cell_type":"code","execution_count":null,"metadata":{"_cell_guid":"63772e58-bb74-4e2f-abee-460283b65709","_uuid":"ee40aff90e4f39e3d2c069af512ee04ac65894f3","collapsed":true,"jupyter":{"outputs_hidden":true}},"outputs":[],"source":["plt.figure(figsize=(7,5))\n","plt.plot(x,y, 'o')\n","plt.title(\"Scatter plot of x vs. y\")\n","plt.xlabel(\"x\")\n","plt.ylabel(\"y\")\n","plt.show()"]},{"cell_type":"markdown","metadata":{"_cell_guid":"f4efe8b4-26d4-4f03-9beb-f0dca7b5c02f","_uuid":"666f5df63d33c0d03d8c945c32a4554f2623a54e"},"source":["## Gradient Boosting (DecisionTrees in a loop)\n"]},{"cell_type":"code","execution_count":null,"metadata":{"_cell_guid":"d730b357-f5c9-48b9-b006-fd876488f73a","_uuid":"0ec7caf7263e323fc8962fc40f6fb29910a4003b","collapsed":true,"jupyter":{"outputs_hidden":true}},"outputs":[],"source":["def std_agg(cnt, s1, s2): return math.sqrt((s2/cnt) - (s1/cnt)**2)"]},{"cell_type":"code","execution_count":null,"metadata":{"_cell_guid":"4aa2f684-5d14-47d3-ae00-b8fa42e1edbb","_uuid":"68cf79497f0a6b38a6076fb01e7e40d8f8a6b1ef","collapsed":true,"jupyter":{"outputs_hidden":true}},"outputs":[],"source":["xi = x # initialization of input\n","yi = y # initialization of target\n","# x,y --> use where no need to change original y\n","ei = 0 # initialization of error\n","n = len(yi)  # number of rows\n","predf = 0 # initial prediction 0\n","\n","for i in range(30): # like n_estimators\n","    tree = DecisionTree(xi,yi)\n","    tree.find_better_split(0)\n","    \n","    r = np.where(xi == tree.split)[0][0]    \n","    \n","    left_idx = np.where(xi <= tree.split)[0]\n","    right_idx = np.where(xi > tree.split)[0]\n","    \n","    predi = np.zeros(n)\n","    np.put(predi, left_idx, np.repeat(np.mean(yi[left_idx]), r))  # replace left side mean y\n","    np.put(predi, right_idx, np.repeat(np.mean(yi[right_idx]), n-r))  # right side mean y\n","    \n","    predi = predi[:,None]  # make long vector (nx1) in compatible with y\n","    predf = predf + predi  # final prediction will be previous prediction value + new prediction of residual\n","    \n","    ei = y - predf  # needed originl y here as residual always from original y    \n","    yi = ei # update yi as residual to reloop\n","    \n","    \n","    # plotting after prediction\n","    xa = np.array(x.x) # column name of x is x \n","    order = np.argsort(xa)\n","    xs = np.array(xa)[order]\n","    ys = np.array(predf)[order]\n","    \n","    #epreds = np.array(epred[:,None])[order]\n","\n","    f, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize = (13,2.5))\n","\n","    ax1.plot(x,y, 'o')\n","    ax1.plot(xs, ys, 'r')\n","    ax1.set_title(f'Prediction (Iteration {i+1})')\n","    ax1.set_xlabel('x')\n","    ax1.set_ylabel('y / y_pred')\n","\n","    ax2.plot(x, ei, 'go')\n","    ax2.set_title(f'Residuals vs. x (Iteration {i+1})')\n","    ax2.set_xlabel('x')\n","    ax2.set_ylabel('Residuals')\n","    \n","    "]},{"cell_type":"markdown","metadata":{"_cell_guid":"830446c1-b084-4181-a94b-4a73a9fde3d9","_uuid":"2aea693edf995214b9d1b06133355a08559f436f"},"source":["Errors are not changing much after `20th iteration` and pattern in residuals is also removed. Residuals look distributed around the mean"]},{"cell_type":"markdown","metadata":{"_cell_guid":"3e91d738-d854-46eb-8b8f-28b0f91498a1","_uuid":"3cf63deb3ca3825a659dd375637cde764b9ffe9a"},"source":["### Maths behind this logic"]},{"cell_type":"markdown","metadata":{"_cell_guid":"e05cba5f-63d1-4788-9fa8-1f7152259868","_uuid":"8bf984fc53ee25f711881ed6a6c12b9370f344d5"},"source":["$ Predictions = y_i^p $  \n","$ Loss = L(y_i, y_i^p) $  \n","$ Loss = MSE = \\sum {(y_i - y_i^p)}^2 $  \n","$ y_i^p = y_i^p + \\alpha * \\delta {L(y_i, y_i^p)}/ \\delta{y_i^p } $  \n","$ y_i^p = y_i^p + \\alpha * \\delta {\\sum {(y_i - y_i^p)}^2}/ \\delta{y_i^p } $  \n","$ y_i^p = y_i^p - \\alpha * 2*{\\sum {(y_i - y_i^p)}} $  "]},{"cell_type":"markdown","metadata":{"_cell_guid":"10d2ffe3-5fb5-4bc2-811c-afdda6ec9cb2","_uuid":"d19995fecbe2f1b9e1171aff6e8c92790d09d073"},"source":["where, $y_i$ = ith target value, $y_i^p$ = ith prediction, $ L(y_i, y_i^p) $ is Loss function, $\\alpha$ is learning rate. So the last equation tells us that, we need to adjust predictions based on our residuals, i.e. $\\sum {(y_i - y_i^p)}$. This is what we did, we adjusted our predictions using the fit on residuals. (accordingly adjusting value of $\\alpha$"]},{"cell_type":"markdown","metadata":{"_cell_guid":"f5c3e03b-b0b3-4a1c-aa7b-cc2e0a63b501","_uuid":"3ed43b846ff5d7b868a0b828a4f0511aa506aeb8","collapsed":true,"jupyter":{"outputs_hidden":true}},"source":["## Acknowledgments\n","\n","Thanks to [Prince Grover](https://www.kaggle.com/grroverpr) for creating the open-source course [gradient-boosting-simplified](https://www.kaggle.com/code/grroverpr/gradient-boosting-simplified/notebook). It inspires the majority of the content in this chapter."]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.6.3"}},"nbformat":4,"nbformat_minor":4}