{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Programming Exercise 2: Logistic Regression\n", "#### Author - Rishabh Jain" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import pandas as pd\n", "import numpy as np\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1    Logistic Regression\n", "\n", "##### Problem Statement\n", "In this part of the exercise, we will build a logistic regression model to predict whether the student gets admitted into a university. \n", "\n", "Suppose that we are the adminstrator of a department and want to determine each applicant's chance of admission based on their results in two exams. We have hisotrical data from previous applicants that we can use as a training set for logistic regression. For each applicant, we have their scores in two exams and their admission decision.\n", "\n", "Our task is to build a classification model that estimates an applicant's probability of admission based on the score from those two exams." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TRAINING DATASET SHAPE : 100 X 3\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
exam1exam2decision
360.18259986.3085521
3764.17698980.9080611
7982.22666242.7198790
8568.46852285.5943071
6444.66826266.4500860
\n", "
" ], "text/plain": [ " exam1 exam2 decision\n", "3 60.182599 86.308552 1\n", "37 64.176989 80.908061 1\n", "79 82.226662 42.719879 0\n", "85 68.468522 85.594307 1\n", "64 44.668262 66.450086 0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data=pd.read_csv('./ex2data1.csv')\n", "print(f'TRAINING DATASET SHAPE : {data.shape[0]} X {data.shape[1]}')\n", "data.sample(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.1    Visualizing the data\n", "Before starting to implement any learning algorithms, it is always good to visualize the data if possible." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "sns.scatterplot(x='exam1',y='exam2',data=data[data['decision']==0],label='Not Admitted');\n", "sns.scatterplot(x='exam1',y='exam2',data=data[data['decision']==1],label='Admitted');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.2    Implementation\n", "We will add another dimension to the Design Matrix to accomodate the $\\theta_0$ intercept term. We will also initialize the fitting parameters to 0." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X : (100, 3)\n", "y : (100, 1)\n", "theta : (1, 3)\n" ] } ], "source": [ "m=data.shape[0]\n", "X=np.ones(shape=(m,3))\n", "X[:,1:]=data.values[:,:2]\n", "print(f'X : {X.shape}')\n", "\n", "y=data.values[:,2]\n", "y=y[:,np.newaxis]\n", "print(f'y : {y.shape}')\n", "\n", "theta=np.zeros(shape=(1,X.shape[1]))\n", "print(f'theta : {theta.shape}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 1.2.1    Sigmoid/Logistic Function\n", "\n", "Before starting with the implementation, let's try to understand the sigmoid function. \n", "For large positive values of $x$, sigmoid function is close to 1, while for large negative values, sigmoid function is close to 0. And sigmoid of 0 is exactly 0.5 .\n", "\n", "$$ \\sigma(x)=\\frac{1}{1+e^{-x}} $$" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def sigmoid(z):\n", " sigma=1/(1+np.exp(-z))\n", " return sigma" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df=pd.DataFrame({'X':np.linspace(-10,10,100)})\n", "df['Y']=sigmoid(df['X'])\n", "sns.lineplot(x='X',y='Y',data=df,label='SIGMOID CURVE');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Model Representation\n", "\n", "The hypothesis in Linear Regression (i.e. $h_\\theta(x)=\\theta^Tx$) cannot be used for the classification problems because the linear hypothesis output ranges from $-\\infty$ to $\\infty$, whereas $y$ (i.e. target variable) in classification problems are either 0 or 1. Hence the range of hypothesis output in Logistic Regression should be:\n", "\n", "$$ 0\\le h_\\theta(x) \\le 1 $$\n", "\n", "In order to keep $h_\\theta(x)$ in the range mentioned above, Sigmoid Function comes to rescue:\n", "\n", "$$ h_\\theta(x)=g(\\theta^Tx) $$\n", "$$ g(z)=\\frac{1}{1+e^{-z}} $$\n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def predict(X,theta):\n", " '''Predicts by applying logistic function on linear model'''\n", " z=np.dot(X,theta.T)\n", " h=sigmoid(z)\n", " return h" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 1.2.2    Cost function\n", "We learnt about the cost function $J(\\theta)$ in **Linear Regression**, the cost function represents the optimization objective i.e. we create a cost function and minimize it to develope an accurate model with minimum error.\n", "$$ J(\\theta)=\\frac{1}{2m}\\sum_{i=0}^m(h_\\theta(x^{(i)})-y^{(i)})^2 $$ \n", "On replacing $h_\\theta(x)$ with $\\frac{1}{1+e^{(-\\theta^Tx)}}$ in the above equation,it becomes '**Non Convex**' which does not guarrante to converge and find the global minimum. So we need a new cost function for Logistic Regression. \n", "\n", "\n", "\n", "The figure in left is a '**Convex**' function and has only one minimum called global minimum and will always converage whereas the figure in right is a '**Non Convex**' function and has mutiple minimums also called local minimums. And there is no guarrante that the non convex function will always converge.\n", "\n", "For **Logistic Regression** cost function is defined as: \n", "\n", "$$\\begin{equation}\n", " Cost(h_\\theta(x),y)=\\begin{cases}\n", " -log(h_\\theta(x)) & \\text{if }y=1 \\\\\n", " -log(1-h_\\theta(x)) & \\text{if }y=0\n", " \\end{cases}\n", "\\end{equation}$$\n", "\n", "Let's try to understand the cost function. \n", "\n", "For $y=1$\n", "- $J(\\theta)\\approx\\infty$ if $h_\\theta(x)\\approx0$ (Penalty for wrong prediction)\n", "- $J(\\theta)\\approx0$ if $h_\\theta(x)\\approx1$ \n", "\n", "For $y=0$\n", "- $J(\\theta)\\approx\\infty$ if $h_\\theta(x)\\approx1$ (Penalty for wrong prediction)\n", "- $J(\\theta)\\approx0$ if $h_\\theta(x)\\approx0$ " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# In Logistic Regression, 0" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "alpha=0.3\n", "iterations=1500\n", "\n", "theta,jHistory=gradientDescent(X,y,theta,alpha,iterations)\n", "df=pd.DataFrame({'Iterations':range(iterations),'Cost Function':jHistory})\n", "print('FINAL COST : %.4f'%jHistory[iterations-1])\n", "sns.lineplot(data=df,x='Iterations',y='Cost Function');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Plotting Decision Boundary** \n", "A decision boundary is hypersurface that partitions the underlying vector into two sets or classes.\n", "\n", "Genral Equation of Line : \n", "$$ ax+by=c $$\n", "$$ \\theta_1x_1+\\theta_2x_2=-\\theta_0 $$ \n", "$$ x_2=(\\frac{-1}{\\theta_2})(\\theta_0+\\theta_1x_1)$$" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "sns.scatterplot(x='exam1',y='exam2',data=data[data['decision']==0],label='Not Admitted');\n", "sns.scatterplot(x='exam1',y='exam2',data=data[data['decision']==1],label='Admitted');\n", "# Only need two points to define a line, so choose two endpoints\n", "x1=np.array([min(X[:,1]),max(X[:,1])])\n", "x2=np.multiply((-1/theta[0][2]),(theta[0][0]+np.multiply(theta[0][1],x1)))\n", "# De-normalizing the data\n", "x1=x1*data['exam1'].std()+data['exam1'].mean()\n", "x2=x2*data['exam2'].std()+data['exam2'].mean()\n", "sns.lineplot(x=x1,y=x2,label='Decision Boundary',color='green');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 1.2.4    Evaluating Logistic Regression" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
actualprediction
6400.075530
7310.878757
6300.000157
1000.905205
4910.999988
2110.998391
4010.967775
3800.209379
9910.999695
610.998787
\n", "
" ], "text/plain": [ " actual prediction\n", "64 0 0.075530\n", "73 1 0.878757\n", "63 0 0.000157\n", "10 0 0.905205\n", "49 1 0.999988\n", "21 1 0.998391\n", "40 1 0.967775\n", "38 0 0.209379\n", "99 1 0.999695\n", "6 1 0.998787" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "h=predict(X,theta).reshape(m)\n", "result=pd.DataFrame({'actual':data['decision'],'prediction':h})\n", "result.sample(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculating the admission probability of student with Exam 1 score of 45 and Exam 2 score of 85." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ADMISSION PROBABILITY : 0.770\n" ] } ], "source": [ "x=[1,45,85]\n", "# Normalizing the data\n", "x[1]=(x[1]-data['exam1'].mean())/data['exam1'].std()\n", "x[2]=(x[2]-data['exam2'].mean())/data['exam2'].std()\n", "print('ADMISSION PROBABILITY : %.3f'%predict(x,theta)[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculating the training accuracy." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TRAINING ACCURACY : 89.0%\n" ] } ], "source": [ "p=np.round(h).reshape(m,1)\n", "accuracy=np.mean(y==p)*100\n", "print(f'TRAINING ACCURACY : {accuracy}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2    Regularized Logistic Regression\n", "\n", "##### Problem Statement\n", "In this part of the exercise, we will implement regularized logistic regression to predict whether microchips from a fabrication plant passes quality assurance (QA). During QA,e ach microchip goes through various tests to ensure it is functioning correctly.\n", "\n", "Suppose we are the product manager of the factory and we have the rest results for some microchips on two different tests. From these two tests, we would like to determine whether the microchips should be accepted or rejected. To help us make a decision, we have a dataset of test results on past microchips, from which we can build a logistic regression model." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TRAINING DATASET SHAPE : 118 X 3\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
test1test2result
114-0.5938900.494880
160.1664700.538741
920.1031100.779970
380.062788-0.163011
790.085829-0.755120
\n", "
" ], "text/plain": [ " test1 test2 result\n", "114 -0.593890 0.49488 0\n", "16 0.166470 0.53874 1\n", "92 0.103110 0.77997 0\n", "38 0.062788 -0.16301 1\n", "79 0.085829 -0.75512 0" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data=pd.read_csv('./ex2data2.csv')\n", "print(f'TRAINING DATASET SHAPE : {data.shape[0]} X {data.shape[1]}')\n", "data.sample(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1    Visualizing the data" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "sns.scatterplot(x='test1',y='test2',data=data[data['result']==0],label='Rejected');\n", "sns.scatterplot(x='test1',y='test2',data=data[data['result']==1],label='Accepted');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure above shows that our dataset cannot be separated into two different classes by a straight line. Therefore, a straight forward application of logistic regression will not perform well on this dataset since logistic regression will only be able to find a linear decision boundary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2    Feature Mapping\n", "\n", "One way to fit the data better is to create more features from each data point. We will map the features into all polynomial terms of $x_1$ and $x_2$ up to $n^{th}$ power. As a result of this mapping, our vector of two features will be transformed into higher-dimensional feature vector. A logistic regression classifier trained on higher-dimension feature vector will have a more complex decision boundary and will appear non-linear. \n", "\n", "While the feature mapping allows us to build a more expressive classifier and helps us solve the **Underfitting** problem, it is also more susceptible to **Overfitting** problem.\n", "\n", "From $[x_1,x_2]$ features to $[1,x_1,x_2,x_1^2,x_1x_2,x_2^2,x_1^3...x_2^n]$ features." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "def mapFeature(x1,x2,degree):\n", " result=np.ones(shape=(x1.shape[0],1))\n", " for i in range(1,degree+1):\n", " for j in range(0,i+1):\n", " column=np.multiply(np.power(x1,i-j),np.power(x2,j))\n", " result=np.column_stack((result,column))\n", " return result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.3    Cost Function and Gradient\n", "\n", "In this part of the exercise, we will implement the **regularized** logistic regression to fit the data and also see for ourselves how parameters regularization can help combat the overfitting problem.\n", "\n", " \n", "If we have too many features, the learned hypothesis will have large parameters and fit the training set very well such that $J(\\theta)\\approx0$, but fail to generalize new examples. What are our options available?\n", "1. Reduce the number of features. (we will be throwing away the important information here.)\n", "2. **Regularization** - this process keeps all the features and adds a penalty to the **RSS (Sum of squared residuals)** to reduce the freedom of the model. Hence, the model will less likely to fit the noise of the training data and will improve the generalization abilities of the model. In layman terms, regularization tries to prevent the parameters from growing larger.\n", "\n", "##### Types of Regularization\n", "1. **L1 Regularization (used in Lasso Regression)** \n", "LASSO stands for Least Absolute Shrinkage and Selection Operator. \n", "This regularization will shrink some parameters to **Zero**. Hence, this type of regularization can also help in **Feature Selection** in the model. As $\\lambda$ grows bigger, more shrinkage in parameters will occur.\n", "$$ J(\\theta)=RSS+\\lambda \\sum_{j=1}^n|\\theta_j| $$ \n", "2. **L2 Regularization (used in Ridge Regression)** \n", "Parameters decrease progressively as $\\lambda$ grows larger, **but are never cut to zero.**\n", "$$ J(\\theta)=RSS+\\lambda \\sum_{j=1}^n(\\theta_j^2) $$ \n", "3. **L1/L2 Regularization (used in Elastic Net Regression)** \n", "Elastic-net is a mix of both L1 and L2 regularization. Lambda is a shared penalization parameter while alpha sets the ratio between L1 and L2 regularization in the Elastic Net Regularization.\n", "$$ J(\\theta)=RSS+\\lambda[(1-\\alpha)\\sum_{j=1}^n(\\theta_j^2)+\\alpha\\sum_{j=1}^n|\\theta_j|]$$ \n", "\n", "##### Geometric Perspective on Regularization\n", "Lasso, Ridge and Elastic-Net regularization can also be viewed as a **constraint** added to **optimization** process. Lasso error minimization can be written as :\n", "$$ minimize(RSS)\\;\\;\\;\\;subject\\;to\\; \\sum_{j=1}^n|\\theta_j|\n", "\n", "For the L1 regularization, the constraint region ($|\\theta_1|+|\\theta_2|" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "degree=6\n", "lmbdas=[0,1,100]\n", "alphas=[10,0.03,0.03]\n", "iterations=[40000,10000,10000]\n", "\n", "labels=['No Regularization (Overfitting)','Regularization (Just Right)','Too much Regularization (Underfitting)']\n", "\n", "nrows,ncols=2,3\n", "fig,ax=plt.subplots(nrows,ncols,figsize=(18,8))\n", "fig.suptitle('Models with different lambda values',size=16)\n", "\n", "for i in range(ncols):\n", " X=mapFeature(data['test1'],data['test2'],degree)\n", " y=data.values[:,2]\n", " y=y[:,np.newaxis]\n", " theta=np.zeros(shape=(1,X.shape[1]))\n", " theta,jHistory=regularizedGradientDescent(X,y,theta,alphas[i],lmbdas[i],iterations[i])\n", " axis1=ax[0][i]\n", " axis1.set_title(f'{labels[i]}\\nlambda = {lmbdas[i]} | Cost = {jHistory[iterations[i]-1]:.3f}')\n", " plotDecisionBoundary(theta,data,y,axis1)\n", " axis2=ax[1][i]\n", " c=sns.distplot(theta[0],ax=axis2,kde=False)\n", " c.set(xlabel=f'Theta{i+1} Values',ylabel='Frequency')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }