{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"LogisticRegressionKoban.ipynb","provenance":[]},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"cell_type":"markdown","metadata":{"id":"QeAIy7xnSAa1"},"source":["#Logistic Regression Homework\n","\n","- Use the full dataset -- all columns. ✗ ☹️ Couldn't find the data on Google drive so I grabbed a different data set.\n","\n","- Try our quick algorigthm and see if it's any good ✓ Not bad. With standardized regressors, we outperformed the toward data science post I borrowed (Accuracy 0.88 vs 0.74).\n","\n","- Try SKLearn on the same data, see if you do any better ✓ Sklearn did a little better (Accuracy - 0.91 vs 0.88; F1 - 0.523 vs 0.34).\n","\n","- Combine random walk with Logistic regression -- does it improve results? ✗ ☹️ I'm not sure what this is asking me to do. I tried the random walk to find optimal weights and it didn't seem to work too well.\n","\n","- How do we optimize the learning rate? -- you can use gradient descent; balance oscillations with speed of convergeance\n","\n","**I added in scaling on the quantitative variables and WAY outpeformed the person who wrote this post - a post that has over 11 thousand claps. I also added a bettter model evaluation summary / confusion matrix. Last I properly evaluated my model against a true test set, not the synthetic data they came up with.**\n","\n","\n","The majority of my code is taken from this toward data science post https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8. The analysis interpretation is solely mine. I did my own interpretation and then read the post.\n","\n","\n","**Question:** The comments on the article from medium tell the author she should have used the synthetically generated data to train the classifier and evaluated the model performance using the original test data. The synthetic data set is balanced (i.e. - has an equal number of positive and negative responses). The actual data has 88% positive and 11% negative cases. What is the correct way to do this? "]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"ekM2kTaJR0of","executionInfo":{"status":"ok","timestamp":1613045322066,"user_tz":300,"elapsed":69995,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"a3bc8f02-a0ff-4eef-b630-0e4cc8d78b5a"},"source":["# Mount data drive\n","from google.colab import drive\n","drive.mount('/data/')\n","data_dir = '/data/My Drive/EMSE 6575/LogisticRegressionHomework'"],"execution_count":null,"outputs":[{"output_type":"stream","text":["Mounted at /data/\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"kK61vDLrT4Qy"},"source":["# Load libraries\n","import pandas as pd\n","import numpy as np\n","from sklearn import preprocessing\n","import matplotlib.pyplot as plt \n","plt.rc(\"font\", size=14)\n","from sklearn.linear_model import LogisticRegression\n","from sklearn.model_selection import train_test_split\n","import seaborn as sns\n","sns.set(style=\"white\")\n","sns.set(style=\"whitegrid\", color_codes=True)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":241},"id":"8Bq83sNIUPTa","executionInfo":{"status":"ok","timestamp":1613045325598,"user_tz":300,"elapsed":73516,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"3f080170-3f26-4f70-fdab-68558c357e48"},"source":["# Read the data - \n","# Data is marketing campaign data for a bank where the goal is to predict whether the client will subscribe to a term deposit.\n","data = pd.read_csv(data_dir + '/loan_data.csv', header = 0)\n","data = data.dropna()\n","print(data.shape)\n","data.head()"],"execution_count":null,"outputs":[{"output_type":"stream","text":["(41188, 21)\n"],"name":"stdout"},{"output_type":"execute_result","data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
age
\n","
job
\n","
marital
\n","
education
\n","
default
\n","
housing
\n","
loan
\n","
contact
\n","
month
\n","
day_of_week
\n","
duration
\n","
campaign
\n","
pdays
\n","
previous
\n","
outcome
\n","
emp_var_rate
\n","
cons_price_idx
\n","
cons_conf_idx
\n","
euribor3m
\n","
nr_employed
\n","
y
\n","
\n"," \n"," \n","
\n","
0
\n","
44
\n","
blue-collar
\n","
married
\n","
basic.4y
\n","
unknown
\n","
yes
\n","
no
\n","
cellular
\n","
aug
\n","
thu
\n","
210
\n","
1
\n","
999
\n","
0
\n","
nonexistent
\n","
1.4
\n","
93.444
\n","
-36.1
\n","
4.963
\n","
5228.1
\n","
0
\n","
\n","
\n","
1
\n","
53
\n","
technician
\n","
married
\n","
unknown
\n","
no
\n","
no
\n","
no
\n","
cellular
\n","
nov
\n","
fri
\n","
138
\n","
1
\n","
999
\n","
0
\n","
nonexistent
\n","
-0.1
\n","
93.200
\n","
-42.0
\n","
4.021
\n","
5195.8
\n","
0
\n","
\n","
\n","
2
\n","
28
\n","
management
\n","
single
\n","
university.degree
\n","
no
\n","
yes
\n","
no
\n","
cellular
\n","
jun
\n","
thu
\n","
339
\n","
3
\n","
6
\n","
2
\n","
success
\n","
-1.7
\n","
94.055
\n","
-39.8
\n","
0.729
\n","
4991.6
\n","
1
\n","
\n","
\n","
3
\n","
39
\n","
services
\n","
married
\n","
high.school
\n","
no
\n","
no
\n","
no
\n","
cellular
\n","
apr
\n","
fri
\n","
185
\n","
2
\n","
999
\n","
0
\n","
nonexistent
\n","
-1.8
\n","
93.075
\n","
-47.1
\n","
1.405
\n","
5099.1
\n","
0
\n","
\n","
\n","
4
\n","
55
\n","
retired
\n","
married
\n","
basic.4y
\n","
no
\n","
yes
\n","
no
\n","
cellular
\n","
aug
\n","
fri
\n","
137
\n","
1
\n","
3
\n","
1
\n","
success
\n","
-2.9
\n","
92.201
\n","
-31.4
\n","
0.869
\n","
5076.2
\n","
1
\n","
\n"," \n","
\n","
"],"text/plain":[" age job marital ... euribor3m nr_employed y\n","0 44 blue-collar married ... 4.963 5228.1 0\n","1 53 technician married ... 4.021 5195.8 0\n","2 28 management single ... 0.729 4991.6 1\n","3 39 services married ... 1.405 5099.1 0\n","4 55 retired married ... 0.869 5076.2 1\n","\n","[5 rows x 21 columns]"]},"metadata":{"tags":[]},"execution_count":3}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"9koIyv0PUkMm","executionInfo":{"status":"ok","timestamp":1613045325599,"user_tz":300,"elapsed":73509,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"da3b3490-1db7-4312-9473-3ac0fd571bcc"},"source":["# Relabel education categories so there are less values\n","data['education'] = np.where(data['education'] == 'basic.9y', 'Basic', data['education'])\n","data['education'] = np.where(data['education'] == 'basic.6y', 'Basic', data['education'])\n","data['education'] = np.where(data['education'] == 'basic.4y', 'Basic', data['education'])\n","data['education'].value_counts()"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["Basic 12513\n","university.degree 12168\n","high.school 9515\n","professional.course 5243\n","unknown 1731\n","illiterate 18\n","Name: education, dtype: int64"]},"metadata":{"tags":[]},"execution_count":4}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":336},"id":"sHmuJ9zLXcx9","executionInfo":{"status":"ok","timestamp":1613045325710,"user_tz":300,"elapsed":73612,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"7107a875-e2b7-47fb-ffec-e845bc688911"},"source":["# Explore response variable to determine how many trues there are\n","print(data['y'].value_counts())\n","sns.countplot(x = 'y', data = data, palette = 'hls')\n","plt.show()"],"execution_count":null,"outputs":[{"output_type":"stream","text":["0 36548\n","1 4640\n","Name: y, dtype: int64\n"],"name":"stdout"},{"output_type":"display_data","data":{"image/png":"iVBORw0KGgoAAAANSUhEUgAAAZoAAAEMCAYAAAD9OXA9AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAWHUlEQVR4nO3df2hV9/3H8de9qbnV+ON602rvTaX+KMqld+J2L5N9NzuIk7iRqYOWZKEbmwhVqZU6taLrvRATSmIozCFNN4v+E5p/xoyJzlgrZc5thWRId5fSFNGWLhfFROeveBPvPd8/JBdjNV5z8zknuT4fUFjOJ6f3nZLd5z3nnpzrsizLEgAAhridHgAAkN8IDQDAKEIDADCK0AAAjCI0AACjnnB6gPEmnU7rxo0bmjRpklwul9PjAMCEYFmWBgcHVVRUJLd7+DEMobnHjRs31N3d7fQYADAhLVy4UNOmTRu2jdDcY9KkSZLu/McqLCx0eBoAmBgGBgbU3d2deQ69G6G5x9DpssLCQnk8HoenAYCJ5X5vOXAxAADAKEIDADCK0AAAjCI0AACjCA0AwChCAwAwitAAAIwiNAakBwedHgHjEL8XeFzxB5sGuCdNUsfr650eA+NMZG+j0yMAjuCIBgBgFKEBABhFaAAARtn2Hs3GjRv19ddfy+12a8qUKXrrrbcUDAZVWlo67AaWW7du1bJlyyRJZ86cUTQaVTKZVElJifbs2aPi4uKc1gAA9rLtiKaurk6HDx/WoUOHtHbtWu3cuTOztnfvXrW0tKilpSUTmXQ6rW3btikajaq9vV2RSEQNDQ05rQEA7GdbaO7+IJzr168/9NMr4/G4PB6PIpGIJKmyslLHjh3LaQ0AYD9bL2/etWuXTp8+LcuytH///sz2rVu3yrIshcNhbdmyRdOnT1cikVAgEMh8j8/nUzqd1pUrV0a95vV67flBAQAZtoamtrZWknTo0CHV19frj3/8o5qamuT3+zUwMKDa2lpVV1ePi1Nd8Xh81PuGw+ExnAT5pLOz0+kRANs58geba9asUTQa1eXLl+X3+yXd+UTLqqoqbdiwQZLk9/vV09OT2aevr09ut1ter3fUa48iFArxCZsYc7wIQb5KJpMPfIFuy3s0N27cUCKRyHx98uRJzZgxQx6PR9euXZMkWZalo0ePKhgMSrrzRH/r1i11dHRIkpqbm7Vy5cqc1gAA9rPliKa/v1+bN29Wf3+/3G63ZsyYocbGRvX29mrTpk1KpVJKp9NasGCBYrGYJMntdqu+vl6xWGzYZcq5rAEA7OeyLMtyeojxZOjwL9dTZ9zrDPfiXmfIZyM9d3JnAACAUYQGAGAUoQEAGEVoAABGERoAgFGEBgBgFKEBABhFaAAARhEaAIBRhAYAYBShAQAYRWgAAEYRGgCAUYQGAGAUoQEAGEVoAABGERoAgFGEBgBgFKEBABhFaAAARtkWmo0bN2rVqlVas2aNqqqq9Nlnn0mSzp07p4qKCpWVlamiokLnz5/P7GNiDQBgL9tCU1dXp8OHD+vQoUNau3atdu7cKUmKxWKqqqpSe3u7qqqqFI1GM/uYWAMA2Mu20EybNi3zv69fvy6Xy6Xe3l51dXWpvLxcklReXq6uri719fUZWQMA2O8JOx9s165dOn36tCzL0v79+5VIJDR79mwVFBRIkgoKCjRr1iwlEglZljXmaz6fL+tZ4/H4qH/OcDg86n2R3zo7O50eAbCdraGpra2VJB06dEj19fXavHmznQ//SEKhkDwej9NjIM/wIgT5KplMPvAFuiNXna1Zs0affPKJnnnmGV24cEGpVEqSlEqldPHiRfn9fvn9/jFfAwDYz5bQ3LhxQ4lEIvP1yZMnNWPGDBUXFysYDKqtrU2S1NbWpmAwKJ/PZ2QNAGA/l2VZlukHuXTpkjZu3Kj+/n653W7NmDFDb775pl544QWdPXtWO3bs0NWrVzV9+nTV1dVp/vz5kmRk7WGGDv9yPXXW8fr6Ue+L/BTZ2+j0CIAxIz132hKaiYTQwBRCg3w20nMndwYAABhFaAAARhEaAIBRhAYAYBShAQAYRWgAAEYRGgCAUYQGAGAUoQEAGEVoAABGERoAgFGEBgBgFKEBABhFaAAARhEaAIBRhAYAYBShAQAYRWgAAEYRGgCAUU/Y8SCXL1/W9u3b9dVXX6mwsFDPPfecqqur5fP5tGjRIi1cuFBu953m1dfXa9GiRZKkkydPqr6+XqlUSi+88ILefvttTZ48Oac1AIC9bDmicblcWrdundrb29Xa2qo5c+aooaEhs97c3KyWlha1tLRkInPjxg299dZbamxs1IcffqiioiK9//77Oa0BAOxnS2i8Xq+WLl2a+XrJkiXq6ekZcZ+//vWvCoVCmjt3riSpsrJSf/nLX3JaAwDYz5ZTZ3dLp9P64IMPVFpamtn2i1/8QqlUSi+++KI2bdqkwsJCJRIJBQKBzPcEAgElEglJGvUaAMB+todm9+7dmjJlil555RVJ0scffyy/36/r169r27Zt2rdvn9544w27x/qGeDw+6n3D4fAYToJ80tnZ6fQIgO1sDU1dXZ2+/PJLNTY2Zt789/v9kqSpU6fq5Zdf1oEDBzLbP/nkk8y+PT09me8d7dqjCIVC8ng8j7wfMBJehCBfJZPJB75At+3y5nfeeUfxeFz79u1TYWGhJOl///ufbt26JUm6ffu22tvbFQwGJUnLli3Tv//9b50/f17SnQsGfvzjH+e0BgCwny1HNF988YXee+89zZ07V5WVlZKkZ599VuvWrVM0GpXL5dLt27f17W9/W5s3b5Z05winurpar776qtLptILBoHbt2pXTGgDAfi7LsiynhxhPhg7/cj111vH6+jGcCvkgsrfR6REAY0Z67uTOAAAAowgNAMAoQgMAMIrQAACMIjQAAKMIDQDAKEIDADCK0AAAjCI0AACjCA0AwChCAwAwitAAAIwiNAAAo7IOzfvvv3/f7UMfVAYAwP1kHZp9+/bdd/u77747ZsMAAPLPQz/47B//+IckKZ1O65///Kfu/viar7/+WkVFReamAwBMeA8NzdCnUyaTSe3cuTOz3eVy6emnn9Zvf/tbc9MBACa8h4bm5MmTkqTt27ervr7e+EAAgPzy0NAMuTsy6XR62JrbzcVrAID7yzo0//nPf1RdXa3PP/9cyWRSkmRZllwulz777LMR9718+bK2b9+ur776SoWFhXruuedUXV0tn8+nM2fOKBqNKplMqqSkRHv27FFxcbEkGVkDANgr60ORHTt2aOnSpfrTn/6kEydO6MSJE/roo4904sSJh+7rcrm0bt06tbe3q7W1VXPmzFFDQ4PS6bS2bdumaDSq9vZ2RSIRNTQ0SJKRNQCA/bIOzX//+1+98cYbWrBggUpKSob98zBer1dLly7NfL1kyRL19PQoHo/L4/EoEolIkiorK3Xs2DFJMrIGALBf1qFZsWKF/va3v+X8gOl0Wh988IFKS0uVSCQUCAQyaz6fT+l0WleuXDGyBgCwX9bv0SSTSb322msKh8N66qmnhq09ytVou3fv1pQpU/TKK6/oww8/zH5Sm8Xj8VHvGw6Hx3AS5JPOzk6nRwBsl3Vonn/+eT3//PM5PVhdXZ2+/PJLNTY2yu12y+/3q6enJ7Pe19cnt9str9drZO1RhEIheTyeHH5a4Jt4EYJ8lUwmH/gCPevQvPbaazkN8c477ygej+sPf/iDCgsLJd15Mr9165Y6OjoUiUTU3NyslStXGlsDANgv69AM3Yrmfr73ve+NuO8XX3yh9957T3PnzlVlZaUk6dlnn9W+fftUX1+vWCw27FJk6c7f5oz1GgDAfi7r7puXjaC0tHTY15cvX9bg4KBmz56tjz76yMhwThg6/Mv11FnH6+vHcCrkg8jeRqdHAIwZ6bkz6yOaoVvRDEmlUnr33Xe5qSYAYESjvndMQUGB1q9fr/3794/lPACAPJPTTcpOnz4tl8s1VrMAAPJQ1qfOfvjDHw6LSn9/vwYGBhSLxYwMBgDID1mH5t4rtyZPnqx58+Zp6tSpYz4UACB/ZB2a7373u5Lu3ELm0qVLeuqpp/h4AADAQ2VdiuvXr2v79u1avHixXnzxRS1evFhvvvmmrl27ZnI+AMAEl3Voampq1N/fr9bWVn366adqbW1Vf3+/ampqTM4HAJjgsj51durUKZ04cUKTJ0+WJM2bN09vv/22VqxYYWw4AMDEl/URjcfjUV9f37Btly9fzty3DACA+8n6iOall17S2rVr9atf/UqBQEA9PT06ePCgXn75ZZPzAQAmuKxDs2HDBs2ePVutra26ePGiZs2apXXr1hEaAMCIsj51Vltbq3nz5ungwYM6evSoDh48qAULFqi2ttbkfACACS7r0LS1tSkUCg3bFgqF1NbWNuZDAQDyR9ahcblcSqfTw7alUqlvbAMA4G5ZhyYSieh3v/tdJizpdFq///3vFYlEjA0HAJj4sr4YYNeuXXr11Vf1gx/8QIFAQIlEQk8//bQaG/kwJwDAg2UdmmeeeUZ//vOf9emnnyqRSMjv92vx4sXc7wwAMKKsQyNJbrdbS5Ys0ZIlS0zNAwDIMxyOAACMsi00dXV1Ki0t1aJFi9Td3Z3ZXlpaqpUrV2r16tVavXq1Tp06lVk7c+aMVq1apbKyMq1du1a9vb05rwEA7GVbaJYvX66mpiaVlJR8Y23v3r1qaWlRS0uLli1bJunOVW3btm1TNBpVe3u7IpGIGhoacloDANjPttBEIhH5/f6svz8ej8vj8WQun66srNSxY8dyWgMA2O+RLgYwZevWrbIsS+FwWFu2bNH06dOVSCQUCAQy3+Pz+ZROp3XlypVRr3m93qxnisfjo/55wuHwqPdFfuvs7HR6BMB2joemqalJfr9fAwMDqq2tVXV19bg41RUKheTxeJweA3mGFyHIV8lk8oEv0B2/6mzodFphYaGqqqr0r3/9K7O9p6cn8319fX1yu93yer2jXgMA2M/R0Ny8eVPXrl2TJFmWpaNHjyoYDEq6c0Rx69YtdXR0SJKam5u1cuXKnNYAAPaz7dRZTU2Njh8/rkuXLunXv/61vF6vGhsbtWnTpszNORcsWKBYLCbpzh+H1tfXKxaLKZlMqqSkRHv27MlpDQBgP5dlWZbTQ4wnQ+cZc32PpuP19WM4FfJBZC/3BUT+Gum50/H3aAAA+Y3QAACMIjQAAKMIDQDAKEIDADCK0AAAjCI0AACjCA0AwChCAwAwitAAAIwiNAAAowgNAMAoQgMAMIrQAACMIjQAAKMIDQDAKEIDADCK0AAAjCI0AACjbAlNXV2dSktLtWjRInV3d2e2nzt3ThUVFSorK1NFRYXOnz9vdA0AYD9bQrN8+XI1NTWppKRk2PZYLKaqqiq1t7erqqpK0WjU6BoAwH62hCYSicjv9w/b1tvbq66uLpWXl0uSysvL1dXVpb6+PiNrAABnPOHUAycSCc2ePVsFBQWSpIKCAs2aNUuJREKWZY35ms/nc+YHBYDHnGOhGe/i8fio9w2Hw2M4CfJJZ2en0yMAtnMsNH6/XxcuXFAqlVJBQYFSqZQuXrwov98vy7LGfO1RhUIheTweAz85Hme8CEG+SiaTD3yB7tjlzcXFxQoGg2pra5MktbW1KRgMyufzGVkDADjDZVmWZfpBampqdPz4cV26dEkzZ86U1+vVkSNHdPbsWe3YsUNXr17V9OnTVVdXp/nz50uSkbVsDFU51yOajtfXj3pf5KfI3kanRwCMGem505bQTCSEBqYQGuSzkZ47uTMAAMAoQgMAMIrQAACMIjQAAKMIDQDAKEIDADCK0AAAjCI0AACjCA0AwChCAwAwitAAAIwiNAAAowgNAMAoQgMAMIrQAACMIjQAAKMIDQDAKEIDADCK0AAAjHrC6QEkqbS0VIWFhZnPmd66dauWLVumM2fOKBqNKplMqqSkRHv27FFxcbEkjXoNAGCvcXNEs3fvXrW0tKilpUXLli1TOp3Wtm3bFI1G1d7erkgkooaGBkka9RoAwH7jJjT3isfj8ng8ikQikqTKykodO3YspzUAgP3Gxakz6c7pMsuyFA6HtWXLFiUSCQUCgcy6z+dTOp3WlStXRr3m9Xpt/ZkAAOMkNE1NTfL7/RoYGFBtba2qq6u1YsUKR2eKx+Oj3jccDo/hJMgnnZ2dTo8A2G5chMbv90uSCgsLVVVVpQ0bNuiXv/ylenp6Mt/T19cnt9str9crv98/qrVHEQqFMhcnAGOFFyHIV8lk8oEv0B1/j+bmzZu6du2aJMmyLB09elTBYFChUEi3bt1SR0eHJKm5uVkrV66UpFGvAZAG02mnR8A4ZPL3wvEjmt7eXm3atEmpVErpdFoLFixQLBaT2+1WfX29YrHYsMuUJY16DYA0ye3W+r93OD0GxpnG/4sY+3c7Hpo5c+bo0KFD9137zne+o9bW1jFdAwDYy/FTZwCA/EZoAABGERoAgFGEBgBgFKEBABhFaAAARhEaAIBRhAYAYBShAQAYRWgAAEYRGgCAUYQGAGAUoQEAGEVoAABGERoAgFGEBgBgFKEBABhFaAAARhEaAIBRhAYAYFTehubcuXOqqKhQWVmZKioqdP78eadHAoDHUt6GJhaLqaqqSu3t7aqqqlI0GnV6JAB4LD3h9AAm9Pb2qqurSwcOHJAklZeXa/fu3err65PP5xtxX8uyJEkDAwO5DVE0Nbf9kXeSyaTTI2Tw24l75fr7OfScOfQcere8DE0ikdDs2bNVUFAgSSooKNCsWbOUSCQeGprBwUFJUnd3d04zuF76eU77I//E43GnR8j4ucfl9AgYZ8bq93NwcFBPPvnksG15GZpcFBUVaeHChZo0aZJcLv7PCADZsCxLg4ODKioq+sZaXobG7/frwoULSqVSKigoUCqV0sWLF+X3+x+6r9vt1rRp02yYEgDyy71HMkPy8mKA4uJiBYNBtbW1SZLa2toUDAYfetoMADD2XNb93rnJA2fPntWOHTt09epVTZ8+XXV1dZo/f77TYwHAYydvQwMAGB/y8tQZAGD8IDQAAKMIDQDAKEIDADCK0MAYbmyK8aqurk6lpaVatGhRzncBwcMRGhjDjU0xXi1fvlxNTU0qKSlxepTHAqGBEUM3Ni0vL5d058amXV1d6uvrc3gyQIpEIlndKQRjg9DAiJFubArg8UJoAABGERoYcfeNTSU90o1NAeQXQgMjuLEpgCHc6wzGcGNTjFc1NTU6fvy4Ll26pJkzZ8rr9erIkSNOj5W3CA0AwChOnQEAjCI0AACjCA0AwChCAwAwitAAAIwiNAAAowgNAMAoQgMAMIrQAOPc/v37tWnTpmHbampqVFNT49BEwKMhNMA4t2rVKp06dUpXr16VJN2+fVtHjhzRmjVrHJ4MyA6hAca5WbNmKRKJ6NixY5KkU6dOaebMmQqFQg5PBmSH0AATwM9+9jMdPnxYknT48GGtXr3a4YmA7BEaYAL40Y9+pM8//1zd3d36+OOP9dOf/tTpkYCsERpgAvB4PCorK9NvfvMbfetb31IgEHB6JCBrhAaYINasWaPu7m5Om2HCITTABBEIBPTkk0+qrKzM6VGAR0JogAkgnU7rwIED+slPfqKpU6c6PQ7wSJ5wegAAI7t586a+//3vKxAIaP/+/U6PAzwyPsoZAGAUp84AAEYRGgCAUYQGAGAUoQEAGEVoAABGERoAgFH/DzkWSBaRuWvWAAAAAElFTkSuQmCC\n","text/plain":["
"]},"metadata":{"tags":[]}}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"nuNfl3y9X7Hu","executionInfo":{"status":"ok","timestamp":1613045325711,"user_tz":300,"elapsed":73605,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"6b0d94de-97f4-4a17-87ee-8f94c6fbb86c"},"source":["# Determine if the prediciton classes are balanced\n","count_no_sub = len(data[data['y']==0])\n","count_sub = len(data[data['y']==1])\n","pct_of_no_sub = count_no_sub/(count_no_sub+count_sub)\n","print(\"percentage of no subscription is\", pct_of_no_sub*100)\n","pct_of_sub = count_sub/(count_no_sub+count_sub)\n","print(\"percentage of subscription\", pct_of_sub*100)"],"execution_count":null,"outputs":[{"output_type":"stream","text":["percentage of no subscription is 88.73458288821988\n","percentage of subscription 11.265417111780131\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":142},"id":"tmkIozICYFGo","executionInfo":{"status":"ok","timestamp":1613045325711,"user_tz":300,"elapsed":73598,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"6053f5c2-3e95-481e-a51f-7156b2180f09"},"source":["# Examine if there are any obvious differences in the explanatory variables between classes\n","data.groupby('y').mean()"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
age
\n","
duration
\n","
campaign
\n","
pdays
\n","
previous
\n","
emp_var_rate
\n","
cons_price_idx
\n","
cons_conf_idx
\n","
euribor3m
\n","
nr_employed
\n","
\n","
\n","
y
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n"," \n"," \n","
\n","
0
\n","
39.911185
\n","
220.844807
\n","
2.633085
\n","
984.113878
\n","
0.132374
\n","
0.248875
\n","
93.603757
\n","
-40.593097
\n","
3.811491
\n","
5176.166600
\n","
\n","
\n","
1
\n","
40.913147
\n","
553.191164
\n","
2.051724
\n","
792.035560
\n","
0.492672
\n","
-1.233448
\n","
93.354386
\n","
-39.789784
\n","
2.123135
\n","
5095.115991
\n","
\n"," \n","
\n","
"],"text/plain":[" age duration campaign ... cons_conf_idx euribor3m nr_employed\n","y ... \n","0 39.911185 220.844807 2.633085 ... -40.593097 3.811491 5176.166600\n","1 40.913147 553.191164 2.051724 ... -39.789784 2.123135 5095.115991\n","\n","[2 rows x 10 columns]"]},"metadata":{"tags":[]},"execution_count":7}]},{"cell_type":"markdown","metadata":{"id":"RinwTz4RZHiE"},"source":["Looks like people who subscribed were on the phone longer (duraction was 553 compared to 220). The `pdays` mean doesn't seem like a good measure since 999 means the client was not previously contacted. However, we can tell people who said yes are contacted more frequently than people who said no and were more likely to have been contacted in the previous quarter (previous 0.49 is greater than 0.132). \n"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":673},"id":"Ga7y0cCUaSJj","executionInfo":{"status":"ok","timestamp":1613045325813,"user_tz":300,"elapsed":73691,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"4f9f2b35-45b4-486b-c8dd-182ba9694765"},"source":["# Look at trends within categorical variables\n","print(data['job'].value_counts())\n","data.groupby('job').mean()"],"execution_count":null,"outputs":[{"output_type":"stream","text":["admin. 10422\n","blue-collar 9254\n","technician 6743\n","services 3969\n","management 2924\n","retired 1720\n","entrepreneur 1456\n","self-employed 1421\n","housemaid 1060\n","unemployed 1014\n","student 875\n","unknown 330\n","Name: job, dtype: int64\n"],"name":"stdout"},{"output_type":"execute_result","data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
age
\n","
duration
\n","
campaign
\n","
pdays
\n","
previous
\n","
emp_var_rate
\n","
cons_price_idx
\n","
cons_conf_idx
\n","
euribor3m
\n","
nr_employed
\n","
y
\n","
\n","
\n","
job
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n"," \n"," \n","
\n","
admin.
\n","
38.187296
\n","
254.312128
\n","
2.623489
\n","
954.319229
\n","
0.189023
\n","
0.015563
\n","
93.534054
\n","
-40.245433
\n","
3.550274
\n","
5164.125350
\n","
0.129726
\n","
\n","
\n","
blue-collar
\n","
39.555760
\n","
264.542360
\n","
2.558461
\n","
985.160363
\n","
0.122542
\n","
0.248995
\n","
93.656656
\n","
-41.375816
\n","
3.771996
\n","
5175.615150
\n","
0.068943
\n","
\n","
\n","
entrepreneur
\n","
41.723214
\n","
263.267857
\n","
2.535714
\n","
981.267170
\n","
0.138736
\n","
0.158723
\n","
93.605372
\n","
-41.283654
\n","
3.791120
\n","
5176.313530
\n","
0.085165
\n","
\n","
\n","
housemaid
\n","
45.500000
\n","
250.454717
\n","
2.639623
\n","
960.579245
\n","
0.137736
\n","
0.433396
\n","
93.676576
\n","
-39.495283
\n","
4.009645
\n","
5179.529623
\n","
0.100000
\n","
\n","
\n","
management
\n","
42.362859
\n","
257.058140
\n","
2.476060
\n","
962.647059
\n","
0.185021
\n","
-0.012688
\n","
93.522755
\n","
-40.489466
\n","
3.611316
\n","
5166.650513
\n","
0.112175
\n","
\n","
\n","
retired
\n","
62.027326
\n","
273.712209
\n","
2.476744
\n","
897.936047
\n","
0.327326
\n","
-0.698314
\n","
93.430786
\n","
-38.573081
\n","
2.770066
\n","
5122.262151
\n","
0.252326
\n","
\n","
\n","
self-employed
\n","
39.949331
\n","
264.142153
\n","
2.660802
\n","
976.621393
\n","
0.143561
\n","
0.094159
\n","
93.559982
\n","
-40.488107
\n","
3.689376
\n","
5170.674384
\n","
0.104856
\n","
\n","
\n","
services
\n","
37.926430
\n","
258.398085
\n","
2.587805
\n","
979.974049
\n","
0.154951
\n","
0.175359
\n","
93.634659
\n","
-41.290048
\n","
3.699187
\n","
5171.600126
\n","
0.081381
\n","
\n","
\n","
student
\n","
25.894857
\n","
283.683429
\n","
2.104000
\n","
840.217143
\n","
0.524571
\n","
-1.408000
\n","
93.331613
\n","
-40.187543
\n","
1.884224
\n","
5085.939086
\n","
0.314286
\n","
\n","
\n","
technician
\n","
38.507638
\n","
250.232241
\n","
2.577339
\n","
964.408127
\n","
0.153789
\n","
0.274566
\n","
93.561471
\n","
-39.927569
\n","
3.820401
\n","
5175.648391
\n","
0.108260
\n","
\n","
\n","
unemployed
\n","
39.733728
\n","
249.451677
\n","
2.564103
\n","
935.316568
\n","
0.199211
\n","
-0.111736
\n","
93.563781
\n","
-40.007594
\n","
3.466583
\n","
5157.156509
\n","
0.142012
\n","
\n","
\n","
unknown
\n","
45.563636
\n","
239.675758
\n","
2.648485
\n","
938.727273
\n","
0.154545
\n","
0.357879
\n","
93.718942
\n","
-38.797879
\n","
3.949033
\n","
5172.931818
\n","
0.112121
\n","
\n"," \n","
\n","
"],"text/plain":[" age duration ... nr_employed y\n","job ... \n","admin. 38.187296 254.312128 ... 5164.125350 0.129726\n","blue-collar 39.555760 264.542360 ... 5175.615150 0.068943\n","entrepreneur 41.723214 263.267857 ... 5176.313530 0.085165\n","housemaid 45.500000 250.454717 ... 5179.529623 0.100000\n","management 42.362859 257.058140 ... 5166.650513 0.112175\n","retired 62.027326 273.712209 ... 5122.262151 0.252326\n","self-employed 39.949331 264.142153 ... 5170.674384 0.104856\n","services 37.926430 258.398085 ... 5171.600126 0.081381\n","student 25.894857 283.683429 ... 5085.939086 0.314286\n","technician 38.507638 250.232241 ... 5175.648391 0.108260\n","unemployed 39.733728 249.451677 ... 5157.156509 0.142012\n","unknown 45.563636 239.675758 ... 5172.931818 0.112121\n","\n","[12 rows x 11 columns]"]},"metadata":{"tags":[]},"execution_count":8}]},{"cell_type":"markdown","metadata":{"id":"rOzb1MPBaqzM"},"source":["Young students have the highest subscription rate (31.4%) followed by older retired people (25.2%). Howver, students make up only a small percentage of the sample - only 875 students in the data."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":289},"id":"sJMFAavxbZkE","executionInfo":{"status":"ok","timestamp":1613045325906,"user_tz":300,"elapsed":73778,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"70b7f7d4-bfba-4b4c-bcb0-c73f8b9b7114"},"source":["print(data['marital'].value_counts())\n","data.groupby('marital').mean()"],"execution_count":null,"outputs":[{"output_type":"stream","text":["married 24928\n","single 11568\n","divorced 4612\n","unknown 80\n","Name: marital, dtype: int64\n"],"name":"stdout"},{"output_type":"execute_result","data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
age
\n","
duration
\n","
campaign
\n","
pdays
\n","
previous
\n","
emp_var_rate
\n","
cons_price_idx
\n","
cons_conf_idx
\n","
euribor3m
\n","
nr_employed
\n","
y
\n","
\n","
\n","
marital
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n"," \n"," \n","
\n","
divorced
\n","
44.899393
\n","
253.790330
\n","
2.61340
\n","
968.639853
\n","
0.168690
\n","
0.163985
\n","
93.606563
\n","
-40.707069
\n","
3.715603
\n","
5170.878643
\n","
0.103209
\n","
\n","
\n","
married
\n","
42.307165
\n","
257.438623
\n","
2.57281
\n","
967.247673
\n","
0.155608
\n","
0.183625
\n","
93.597367
\n","
-40.270659
\n","
3.745832
\n","
5171.848772
\n","
0.101573
\n","
\n","
\n","
single
\n","
33.158714
\n","
261.524378
\n","
2.53380
\n","
949.909578
\n","
0.211359
\n","
-0.167989
\n","
93.517300
\n","
-40.918698
\n","
3.317447
\n","
5155.199265
\n","
0.140041
\n","
\n","
\n","
unknown
\n","
40.275000
\n","
312.725000
\n","
3.18750
\n","
937.100000
\n","
0.275000
\n","
-0.221250
\n","
93.471250
\n","
-40.820000
\n","
3.313038
\n","
5157.393750
\n","
0.150000
\n","
\n"," \n","
\n","
"],"text/plain":[" age duration campaign ... euribor3m nr_employed y\n","marital ... \n","divorced 44.899393 253.790330 2.61340 ... 3.715603 5170.878643 0.103209\n","married 42.307165 257.438623 2.57281 ... 3.745832 5171.848772 0.101573\n","single 33.158714 261.524378 2.53380 ... 3.317447 5155.199265 0.140041\n","unknown 40.275000 312.725000 3.18750 ... 3.313038 5157.393750 0.150000\n","\n","[4 rows x 11 columns]"]},"metadata":{"tags":[]},"execution_count":9}]},{"cell_type":"markdown","metadata":{"id":"TrZKJdxZbq7E"},"source":["Not much to speak of on this one, other than single people have a higher percentage of successful subscpription than divorced or married. Makes sense considering how many students sign up. The unknown category also has high percentages but only makes up 80 people in the data."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":385},"id":"pglCHck-cNPr","executionInfo":{"status":"ok","timestamp":1613045325907,"user_tz":300,"elapsed":73773,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"8e7ba066-08d3-4f94-c86b-d0895abd4b82"},"source":["print(data['education'].value_counts())\n","data.groupby('education').mean()"],"execution_count":null,"outputs":[{"output_type":"stream","text":["Basic 12513\n","university.degree 12168\n","high.school 9515\n","professional.course 5243\n","unknown 1731\n","illiterate 18\n","Name: education, dtype: int64\n"],"name":"stdout"},{"output_type":"execute_result","data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
age
\n","
duration
\n","
campaign
\n","
pdays
\n","
previous
\n","
emp_var_rate
\n","
cons_price_idx
\n","
cons_conf_idx
\n","
euribor3m
\n","
nr_employed
\n","
y
\n","
\n","
\n","
education
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n","
\n"," \n"," \n","
\n","
Basic
\n","
42.163910
\n","
263.043874
\n","
2.559498
\n","
974.877967
\n","
0.141053
\n","
0.191329
\n","
93.639933
\n","
-40.927595
\n","
3.729654
\n","
5172.014113
\n","
0.087029
\n","
\n","
\n","
high.school
\n","
37.998213
\n","
260.886810
\n","
2.568576
\n","
964.358382
\n","
0.185917
\n","
0.032937
\n","
93.584857
\n","
-40.940641
\n","
3.556157
\n","
5164.994735
\n","
0.108355
\n","
\n","
\n","
illiterate
\n","
48.500000
\n","
276.777778
\n","
2.277778
\n","
943.833333
\n","
0.111111
\n","
-0.133333
\n","
93.317333
\n","
-39.950000
\n","
3.516556
\n","
5171.777778
\n","
0.222222
\n","
\n","
\n","
professional.course
\n","
40.080107
\n","
252.533855
\n","
2.586115
\n","
960.765974
\n","
0.163075
\n","
0.173012
\n","
93.569864
\n","
-40.124108
\n","
3.710457
\n","
5170.155979
\n","
0.113485
\n","
\n","
\n","
university.degree
\n","
38.879191
\n","
253.223373
\n","
2.563527
\n","
951.807692
\n","
0.192390
\n","
-0.028090
\n","
93.493466
\n","
-39.975805
\n","
3.529663
\n","
5163.226298
\n","
0.137245
\n","
\n","
\n","
unknown
\n","
43.481225
\n","
262.390526
\n","
2.596187
\n","
942.830734
\n","
0.226459
\n","
0.059099
\n","
93.658615
\n","
-39.877816
\n","
3.571098
\n","
5159.549509
\n","
0.145003
\n","
\n"," \n","
\n","
"],"text/plain":[" age duration ... nr_employed y\n","education ... \n","Basic 42.163910 263.043874 ... 5172.014113 0.087029\n","high.school 37.998213 260.886810 ... 5164.994735 0.108355\n","illiterate 48.500000 276.777778 ... 5171.777778 0.222222\n","professional.course 40.080107 252.533855 ... 5170.155979 0.113485\n","university.degree 38.879191 253.223373 ... 5163.226298 0.137245\n","unknown 43.481225 262.390526 ... 5159.549509 0.145003\n","\n","[6 rows x 11 columns]"]},"metadata":{"tags":[]},"execution_count":10}]},{"cell_type":"markdown","metadata":{"id":"gGvlAtazcrlY"},"source":["This one suffers the same issues with unbalance sampling like the job table. Illiterate people appear to have the highest subscription rate (22.2%), but only 18 people described themselves as illiterate. However, also looks like people with a university degree are more likely to be subscriber than other categories."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":386},"id":"IC4ZhEiadE3J","executionInfo":{"status":"ok","timestamp":1613045326393,"user_tz":300,"elapsed":74252,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"53d0bc68-2960-41ed-d42e-a7f4c78ef031"},"source":["%matplotlib inline\n","pd.crosstab(data.job,data.y).plot(kind='bar')\n","plt.title('Purchase Frequency for Job Title')\n","plt.xlabel('Job')\n","plt.ylabel('Frequency of Purchase')"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["Text(0, 0.5, 'Frequency of Purchase')"]},"metadata":{"tags":[]},"execution_count":11},{"output_type":"display_data","data":{"image/png":"\n","text/plain":["
"]},"metadata":{"tags":[],"needs_background":"light"}}]},{"cell_type":"markdown","metadata":{"id":"ET5V-epUdSIh"},"source":["This is a good plot because you can visually see which categories have high subscription rates (i.e., the blue and orange bars are closer in height for student and retired), but you can also see they make up a very small percentage of th total subscriptions (i.e., admin is the top job category to generate subscriptions). It also shows that the percentage of subscriptions varies within job categories; hence, knowing the type of job should help us predict whether or not a person subscribes. In other words, if we randomly draw a student, they have a higher chance of being a subscriber than if we draw a blue-collar worker.\n"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":358},"id":"O8u85NtKhXK6","executionInfo":{"status":"ok","timestamp":1613045326767,"user_tz":300,"elapsed":74619,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"cbca20c1-f636-4500-8c3d-b81f977013de"},"source":["%matplotlib inline\n","pd.crosstab(data.marital,data.y).plot(kind='bar')\n","plt.title('Purchase Frequency for Marital Status')\n","plt.xlabel('Marital Status')\n","plt.ylabel('Proportion of Customers')"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["Text(0, 0.5, 'Proportion of Customers')"]},"metadata":{"tags":[]},"execution_count":12},{"output_type":"display_data","data":{"image/png":"\n","text/plain":["
"]},"metadata":{"tags":[],"needs_background":"light"}}]},{"cell_type":"markdown","metadata":{"id":"4R6c3eypiAYx"},"source":["Same as I stated above. Single people has a slightly higher proportion of subscriptions within their category than others."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":415},"id":"FpH8uV0JiSYT","executionInfo":{"status":"ok","timestamp":1613045327093,"user_tz":300,"elapsed":74939,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"34a1bc56-47e5-4854-ef4a-21f4d6cbff2a"},"source":["%matplotlib inline\n","pd.crosstab(data.education,data.y).plot(kind='bar')\n","plt.title('Purchase Frequency for Education Status')\n","plt.xlabel('Education Status')\n","plt.ylabel('Proportion of Customers')"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["Text(0, 0.5, 'Proportion of Customers')"]},"metadata":{"tags":[]},"execution_count":13},{"output_type":"display_data","data":{"image/png":"\n","text/plain":["
"]},"metadata":{"tags":[],"needs_background":"light"}}]},{"cell_type":"markdown","metadata":{"id":"RNmu8tTSivvS"},"source":["No change from above. University degree seems to be predictive over other categories. However, comparisons across other categories like basic and high.school seem less useful.\n","\n","**In summary, the diagnostic plots make it seem like newly graduated, single people are more likely to sign up for a term subscription than other types of people**"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":519},"id":"3FXvyViNjn6F","executionInfo":{"status":"ok","timestamp":1613045327398,"user_tz":300,"elapsed":75237,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"84a25046-01d6-4f67-ba2d-d60096709d35"},"source":["print(data['month'].value_counts())\n","pd.crosstab(data.month,data.y).plot(kind='bar')\n","plt.title('Purchase Frequency by Month')\n","plt.xlabel('Month')\n","plt.ylabel('Proportion of Customers')"],"execution_count":null,"outputs":[{"output_type":"stream","text":["may 13769\n","jul 7174\n","aug 6178\n","jun 5318\n","nov 4101\n","apr 2632\n","oct 718\n","sep 570\n","mar 546\n","dec 182\n","Name: month, dtype: int64\n"],"name":"stdout"},{"output_type":"execute_result","data":{"text/plain":["Text(0, 0.5, 'Proportion of Customers')"]},"metadata":{"tags":[]},"execution_count":14},{"output_type":"display_data","data":{"image/png":"\n","text/plain":["
"]},"metadata":{"tags":[],"needs_background":"light"}}]},{"cell_type":"markdown","metadata":{"id":"ataSEKBOjzfv"},"source":["Looks like march, september, and october tend to be more successful months in terms of percentages. No idea why this would be the case other than, the fiscal year starts on October 1st. But for model training, it seems like month is going to be very predictive. If we are predicting something in any of those months, it should have better odds at being a success."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"XccHngLOlHlv","executionInfo":{"status":"ok","timestamp":1613045327512,"user_tz":300,"elapsed":75345,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"1662b212-7503-4a82-96d8-690e2f6ee315"},"source":["# One-hot encode the categorical variables \n","cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','outcome']\n","for var in cat_vars:\n"," cat_list='var'+'_'+var\n"," cat_list = pd.get_dummies(data[var], prefix=var)\n"," data1=data.join(cat_list)\n"," data=data1\n","\n","cat_vars=['job','marital','education','default','housing','loan','contact','month','day_of_week','outcome']\n","data_vars=data.columns.values.tolist()\n","to_keep=[i for i in data_vars if i not in cat_vars]\n","\n","data_final=data[to_keep]\n","data_final.columns.values"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["array(['age', 'duration', 'campaign', 'pdays', 'previous', 'emp_var_rate',\n"," 'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed', 'y',\n"," 'job_admin.', 'job_blue-collar', 'job_entrepreneur',\n"," 'job_housemaid', 'job_management', 'job_retired',\n"," 'job_self-employed', 'job_services', 'job_student',\n"," 'job_technician', 'job_unemployed', 'job_unknown',\n"," 'marital_divorced', 'marital_married', 'marital_single',\n"," 'marital_unknown', 'education_Basic', 'education_high.school',\n"," 'education_illiterate', 'education_professional.course',\n"," 'education_university.degree', 'education_unknown', 'default_no',\n"," 'default_unknown', 'default_yes', 'housing_no', 'housing_unknown',\n"," 'housing_yes', 'loan_no', 'loan_unknown', 'loan_yes',\n"," 'contact_cellular', 'contact_telephone', 'month_apr', 'month_aug',\n"," 'month_dec', 'month_jul', 'month_jun', 'month_mar', 'month_may',\n"," 'month_nov', 'month_oct', 'month_sep', 'day_of_week_fri',\n"," 'day_of_week_mon', 'day_of_week_thu', 'day_of_week_tue',\n"," 'day_of_week_wed', 'outcome_failure', 'outcome_nonexistent',\n"," 'outcome_success'], dtype=object)"]},"metadata":{"tags":[]},"execution_count":15}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":419},"id":"iTl_bnxVcLNR","executionInfo":{"status":"ok","timestamp":1613045327513,"user_tz":300,"elapsed":75340,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"c52cba30-c3fe-46bd-f6a9-378b41b07438"},"source":["import sklearn.preprocessing\n","from sklearn.preprocessing import StandardScaler\n","from sklearn.preprocessing import MinMaxScaler\n","sc = StandardScaler()\n","\n","numeric_vars = ['age', 'duration', 'campaign', 'pdays', 'emp_var_rate', 'cons_price_idx', \n"," 'cons_conf_idx', 'euribor3m', 'nr_employed']\n","\n","#only standardize numerical features\n","features=data_final[numeric_vars]\n","features_standard=StandardScaler().fit_transform(features)# Gaussian Standardisation\n","temp=pd.DataFrame(features_standard,columns=numeric_vars)\n","temp"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
age
\n","
duration
\n","
campaign
\n","
pdays
\n","
emp_var_rate
\n","
cons_price_idx
\n","
cons_conf_idx
\n","
euribor3m
\n","
nr_employed
\n","
\n"," \n"," \n","
\n","
0
\n","
0.381527
\n","
-0.186230
\n","
-0.565922
\n","
0.195414
\n","
0.839061
\n","
-0.227465
\n","
0.951267
\n","
0.773575
\n","
0.845170
\n","
\n","
\n","
1
\n","
1.245157
\n","
-0.463926
\n","
-0.565922
\n","
0.195414
\n","
-0.115781
\n","
-0.649003
\n","
-0.323542
\n","
0.230456
\n","
0.398115
\n","
\n","
\n","
2
\n","
-1.153816
\n","
0.311309
\n","
0.156105
\n","
-5.117342
\n","
-1.134279
\n","
0.828107
\n","
0.151810
\n","
-1.667578
\n","
-2.428157
\n","
\n","
\n","
3
\n","
-0.098268
\n","
-0.282652
\n","
-0.204909
\n","
0.195414
\n","
-1.197935
\n","
-0.864955
\n","
-1.425496
\n","
-1.277824
\n","
-0.940281
\n","
\n","
\n","
4
\n","
1.437075
\n","
-0.467783
\n","
-0.565922
\n","
-5.133393
\n","
-1.898153
\n","
-2.374889
\n","
1.966794
\n","
-1.586859
\n","
-1.257233
\n","
\n","
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
\n","
\n","
41183
\n","
1.820911
\n","
-0.139947
\n","
-0.565922
\n","
0.195414
\n","
0.839061
\n","
1.536429
\n","
-0.280328
\n","
0.717649
\n","
0.845170
\n","
\n","
\n","
41184
\n","
-0.865939
\n","
-0.240227
\n","
-0.204909
\n","
0.195414
\n","
0.648092
\n","
0.722722
\n","
0.886447
\n","
0.714190
\n","
0.331680
\n","
\n","
\n","
41185
\n","
0.189609
\n","
-0.757050
\n","
0.156105
\n","
0.195414
\n","
0.648092
\n","
0.722722
\n","
0.886447
\n","
0.712460
\n","
0.331680
\n","
\n","
\n","
41186
\n","
0.765363
\n","
-0.224799
\n","
-0.204909
\n","
0.195414
\n","
-2.216433
\n","
-1.977538
\n","
2.939106
\n","
-1.660082
\n","
-2.069683
\n","
\n","
\n","
41187
\n","
-1.441693
\n","
-0.564206
\n","
0.517118
\n","
0.195414
\n","
0.648092
\n","
0.722722
\n","
0.886447
\n","
0.713613
\n","
0.331680
\n","
\n"," \n","
\n","
41188 rows × 9 columns
\n","
"],"text/plain":[" age duration campaign ... cons_conf_idx euribor3m nr_employed\n","0 0.381527 -0.186230 -0.565922 ... 0.951267 0.773575 0.845170\n","1 1.245157 -0.463926 -0.565922 ... -0.323542 0.230456 0.398115\n","2 -1.153816 0.311309 0.156105 ... 0.151810 -1.667578 -2.428157\n","3 -0.098268 -0.282652 -0.204909 ... -1.425496 -1.277824 -0.940281\n","4 1.437075 -0.467783 -0.565922 ... 1.966794 -1.586859 -1.257233\n","... ... ... ... ... ... ... ...\n","41183 1.820911 -0.139947 -0.565922 ... -0.280328 0.717649 0.845170\n","41184 -0.865939 -0.240227 -0.204909 ... 0.886447 0.714190 0.331680\n","41185 0.189609 -0.757050 0.156105 ... 0.886447 0.712460 0.331680\n","41186 0.765363 -0.224799 -0.204909 ... 2.939106 -1.660082 -2.069683\n","41187 -1.441693 -0.564206 0.517118 ... 0.886447 0.713613 0.331680\n","\n","[41188 rows x 9 columns]"]},"metadata":{"tags":[]},"execution_count":16}]},{"cell_type":"code","metadata":{"id":"sGIIVNHcmPnK"},"source":["cat_data = data_final[['job_admin.', 'job_blue-collar', 'job_entrepreneur',\n"," 'job_housemaid', 'job_management', 'job_retired',\n"," 'job_self-employed', 'job_services', 'job_student',\n"," 'job_technician', 'job_unemployed', 'job_unknown',\n"," 'marital_divorced', 'marital_married', 'marital_single',\n"," 'marital_unknown', 'education_Basic', 'education_high.school',\n"," 'education_illiterate', 'education_professional.course',\n"," 'education_university.degree', 'education_unknown', 'default_no',\n"," 'default_unknown', 'default_yes', 'housing_no', 'housing_unknown',\n"," 'housing_yes', 'loan_no', 'loan_unknown', 'loan_yes',\n"," 'contact_cellular', 'contact_telephone', 'month_apr', 'month_aug',\n"," 'month_dec', 'month_jul', 'month_jun', 'month_mar', 'month_may',\n"," 'month_nov', 'month_oct', 'month_sep', 'day_of_week_fri',\n"," 'day_of_week_mon', 'day_of_week_thu', 'day_of_week_tue',\n"," 'day_of_week_wed', 'outcome_failure', 'outcome_nonexistent',\n"," 'outcome_success', 'y']]\n","data_final_after_standardizing = pd.concat([cat_data.reset_index(drop=True), temp], axis=1)\n","data_final = data_final_after_standardizing"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"0Y_TrOuxmnma"},"source":["# SMOTE\n","\n","With our training data created, I’ll up-sample the no-subscription using the SMOTE algorithm(Synthetic Minority Oversampling Technique). At a high level, SMOTE:\n","\n","1. Works by creating synthetic samples from the minor class (no-subscription) instead of creating copies.\n","2. Randomly choosing one of the k-nearest-neighbors and using it to create a similar, but randomly tweaked, new observations."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Iq6CmxhhmnFX","executionInfo":{"status":"ok","timestamp":1613045328680,"user_tz":300,"elapsed":76495,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"f074f651-0905-4935-b28b-cd16483d82ad"},"source":["X = data_final.loc[:, data_final.columns != 'y']\n","y = data_final.loc[:, data_final.columns == 'y']\n","from imblearn.over_sampling import SMOTE\n","os = SMOTE(random_state=0)\n","X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)\n","columns = X_train.columns\n","os_data_X,os_data_y=os.fit_sample(X_train, y_train)\n","os_data_X = pd.DataFrame(data=os_data_X,columns=columns )\n","os_data_y= pd.DataFrame(data=os_data_y,columns=['y'])\n","# we can Check the numbers of our data\n","print(\"length of oversampled data is \",len(os_data_X))\n","print(\"Number of no subscription in oversampled data\",len(os_data_y[os_data_y['y']==0]))\n","print(\"Number of subscription\",len(os_data_y[os_data_y['y']==1]))\n","print(\"Proportion of no subscription data in oversampled data is \",len(os_data_y[os_data_y['y']==0])/len(os_data_X))\n","print(\"Proportion of subscription data in oversampled data is \",len(os_data_y[os_data_y['y']==1])/len(os_data_X))"],"execution_count":null,"outputs":[{"output_type":"stream","text":["/usr/local/lib/python3.6/dist-packages/sklearn/externals/six.py:31: FutureWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/).\n"," \"(https://pypi.org/project/six/).\", FutureWarning)\n","/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.neighbors.base module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.neighbors. Anything that cannot be imported from sklearn.neighbors is now part of the private API.\n"," warnings.warn(message, FutureWarning)\n","/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n"," y = column_or_1d(y, warn=True)\n","/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24.\n"," warnings.warn(msg, category=FutureWarning)\n"],"name":"stderr"},{"output_type":"stream","text":["length of oversampled data is 51134\n","Number of no subscription in oversampled data 25567\n","Number of subscription 25567\n","Proportion of no subscription data in oversampled data is 0.5\n","Proportion of subscription data in oversampled data is 0.5\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"KaUugP4ke19g","executionInfo":{"status":"ok","timestamp":1613045328682,"user_tz":300,"elapsed":76492,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"60bbf716-fcfe-4f28-bb56-fb8c6791efae"},"source":["data_final['y'].value_counts()"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["0 36548\n","1 4640\n","Name: y, dtype: int64"]},"metadata":{"tags":[]},"execution_count":19}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"VPBTRLXRe_6_","executionInfo":{"status":"ok","timestamp":1613045328682,"user_tz":300,"elapsed":76488,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"38583cd7-9865-4a8d-81c7-642167b7f570"},"source":["os_data_y['y'].value_counts()"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["1 25567\n","0 25567\n","Name: y, dtype: int64"]},"metadata":{"tags":[]},"execution_count":20}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"FrT9knUlfLoB","executionInfo":{"status":"ok","timestamp":1613045328683,"user_tz":300,"elapsed":76485,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"f369b7f3-178e-49a1-8933-9f9ca1f3136c"},"source":["y_test['y'].value_counts()"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["0 10981\n","1 1376\n","Name: y, dtype: int64"]},"metadata":{"tags":[]},"execution_count":21}]},{"cell_type":"markdown","metadata":{"id":"HU2KJhWTnMYv"},"source":["Now we have a perfect balanced data! You may have noticed that I over-sampled only on the training data, because by oversampling only on the training data, none of the information in the test data is being used to create synthetic observations, therefore, no information will bleed from test data into the model training."]},{"cell_type":"markdown","metadata":{"id":"_uJzUpvopVjk"},"source":["# Recursive Feature Elimination\n","\n","Recursive Feature Elimination (RFE) is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. The goal of RFE is to select features by recursively considering smaller and smaller sets of features."]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"PjNuOpFapeqQ","executionInfo":{"status":"ok","timestamp":1613045372388,"user_tz":300,"elapsed":120186,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"10d592c4-cc04-4073-f4a5-63fa434c0ebc"},"source":["import warnings\n","warnings.filterwarnings('ignore')\n","\n","data_final_vars=data_final.columns.values.tolist()\n","y=['y']\n","X=[i for i in data_final_vars if i not in y]\n","from sklearn.feature_selection import RFE\n","from sklearn.linear_model import LogisticRegression\n","logreg = LogisticRegression()\n","rfe = RFE(logreg, 25)\n","rfe = rfe.fit(os_data_X, os_data_y.values.ravel())\n","print(rfe.support_)\n","print(rfe.ranking_)"],"execution_count":null,"outputs":[{"output_type":"stream","text":["[False True False True False True True True True False False False\n"," False False False False False False True False False False False True\n"," False False True False False False True False True False True True\n"," False True True True True False True False False False False False\n"," True True True False True False False True True False True False]\n","[22 1 2 1 9 1 1 1 1 8 21 20 19 32 34 14 11 12 1 33 17 10 31 1\n"," 35 28 1 29 23 6 1 7 1 5 1 1 4 1 1 1 1 16 1 25 26 24 27 13\n"," 1 1 1 36 1 15 18 1 1 30 1 3]\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"CVkoEkDSpyj1"},"source":["from itertools import compress\n","cols = list(compress(os_data_X.columns, rfe.support_))\n","X=os_data_X[cols]\n","y=os_data_y['y']"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"RfiR1uafr4Me","executionInfo":{"status":"ok","timestamp":1613045373574,"user_tz":300,"elapsed":121364,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"c8954fdb-e6bd-4a58-9bda-d4a837429527"},"source":["# Implement the model\n","import statsmodels.api as sm\n","logit_model=sm.Logit(y,X)\n","result=logit_model.fit()\n","print(result.summary2())"],"execution_count":null,"outputs":[{"output_type":"stream","text":["Optimization terminated successfully.\n"," Current function value: 0.315088\n"," Iterations 7\n"," Results: Logit\n","=====================================================================\n","Model: Logit Pseudo R-squared: 0.545 \n","Dependent Variable: y AIC: 32273.4687\n","Date: 2021-02-11 12:09 BIC: 32494.5238\n","No. Observations: 51134 Log-Likelihood: -16112. \n","Df Model: 24 LL-Null: -35443. \n","Df Residuals: 51109 LLR p-value: 0.0000 \n","Converged: 1.0000 Scale: 1.0000 \n","No. Iterations: 7.0000 \n","---------------------------------------------------------------------\n"," Coef. Std.Err. z P>|z| [0.025 0.975]\n","---------------------------------------------------------------------\n","job_blue-collar -0.3431 0.0411 -8.3412 0.0000 -0.4238 -0.2625\n","job_housemaid -0.4216 0.1155 -3.6492 0.0003 -0.6480 -0.1951\n","job_retired 0.3941 0.0611 6.4492 0.0000 0.2744 0.5139\n","job_self-employed -0.5627 0.0908 -6.2009 0.0000 -0.7406 -0.3849\n","job_services -0.4321 0.0565 -7.6411 0.0000 -0.5429 -0.3212\n","job_student 0.4525 0.0822 5.5062 0.0000 0.2914 0.6135\n","education_illiterate 0.6605 0.6181 1.0685 0.2853 -0.5510 1.8720\n","default_unknown -0.5534 0.0468 -11.8187 0.0000 -0.6452 -0.4616\n","housing_unknown -0.5237 0.1124 -4.6570 0.0000 -0.7441 -0.3033\n","loan_yes -0.3626 0.0444 -8.1660 0.0000 -0.4497 -0.2756\n","contact_telephone -0.6316 0.0515 -12.2732 0.0000 -0.7325 -0.5308\n","month_aug 1.0900 0.0580 18.8032 0.0000 0.9764 1.2037\n","month_dec -0.6926 0.1547 -4.4757 0.0000 -0.9959 -0.3893\n","month_jun -0.7708 0.0579 -13.3160 0.0000 -0.8843 -0.6574\n","month_mar 2.1291 0.0907 23.4618 0.0000 1.9512 2.3069\n","month_may -0.8230 0.0453 -18.1828 0.0000 -0.9117 -0.7343\n","month_nov -0.7804 0.0591 -13.2009 0.0000 -0.8962 -0.6645\n","month_sep 0.3917 0.0927 4.2276 0.0000 0.2101 0.5734\n","outcome_failure -1.4407 0.0539 -26.7484 0.0000 -1.5463 -1.3351\n","outcome_nonexistent -0.8536 0.0336 -25.3834 0.0000 -0.9195 -0.7877\n","outcome_success 0.5635 0.0740 7.6188 0.0000 0.4185 0.7084\n","duration 1.9455 0.0196 99.0136 0.0000 1.9070 1.9840\n","emp_var_rate -3.7658 0.1032 -36.4927 0.0000 -3.9681 -3.5636\n","cons_price_idx 1.3606 0.0420 32.4203 0.0000 1.2784 1.4429\n","euribor3m 1.6083 0.0838 19.1874 0.0000 1.4440 1.7726\n","=====================================================================\n","\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"333QHUW9YMW4"},"source":["Will remove variables that have coefficient estimates with p-value higher than 0.05. ['education_illiterate']"]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"XLgpTKXhYo04","executionInfo":{"status":"ok","timestamp":1613045374115,"user_tz":300,"elapsed":121899,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"237f33fb-2229-43dc-e3d4-b2cba724ab2e"},"source":["remove_cols = ['education_illiterate']\n","new_cols = [x for x in cols if x not in remove_cols]\n","X=os_data_X[new_cols]\n","y=os_data_y['y']\n","logit_model=sm.Logit(y,X)\n","result=logit_model.fit()\n","print(result.summary2())"],"execution_count":null,"outputs":[{"output_type":"stream","text":["Optimization terminated successfully.\n"," Current function value: 0.315100\n"," Iterations 7\n"," Results: Logit\n","====================================================================\n","Model: Logit Pseudo R-squared: 0.545 \n","Dependent Variable: y AIC: 32272.6253\n","Date: 2021-02-11 12:09 BIC: 32484.8382\n","No. Observations: 51134 Log-Likelihood: -16112. \n","Df Model: 23 LL-Null: -35443. \n","Df Residuals: 51110 LLR p-value: 0.0000 \n","Converged: 1.0000 Scale: 1.0000 \n","No. Iterations: 7.0000 \n","--------------------------------------------------------------------\n"," Coef. Std.Err. z P>|z| [0.025 0.975]\n","--------------------------------------------------------------------\n","job_blue-collar -0.3427 0.0411 -8.3317 0.0000 -0.4234 -0.2621\n","job_housemaid -0.4215 0.1155 -3.6481 0.0003 -0.6479 -0.1950\n","job_retired 0.3963 0.0611 6.4879 0.0000 0.2766 0.5160\n","job_self-employed -0.5593 0.0907 -6.1678 0.0000 -0.7371 -0.3816\n","job_services -0.4320 0.0565 -7.6395 0.0000 -0.5428 -0.3212\n","job_student 0.4523 0.0822 5.5047 0.0000 0.2913 0.6134\n","default_unknown -0.5531 0.0468 -11.8144 0.0000 -0.6449 -0.4614\n","housing_unknown -0.5243 0.1124 -4.6621 0.0000 -0.7446 -0.3039\n","loan_yes -0.3625 0.0444 -8.1651 0.0000 -0.4496 -0.2755\n","contact_telephone -0.6301 0.0514 -12.2506 0.0000 -0.7309 -0.5293\n","month_aug 1.0910 0.0580 18.8224 0.0000 0.9774 1.2047\n","month_dec -0.6937 0.1547 -4.4829 0.0000 -0.9970 -0.3904\n","month_jun -0.7708 0.0579 -13.3163 0.0000 -0.8843 -0.6574\n","month_mar 2.1281 0.0907 23.4535 0.0000 1.9503 2.3060\n","month_may -0.8236 0.0453 -18.1990 0.0000 -0.9123 -0.7349\n","month_nov -0.7788 0.0591 -13.1781 0.0000 -0.8946 -0.6629\n","month_sep 0.3916 0.0927 4.2260 0.0000 0.2100 0.5732\n","outcome_failure -1.4418 0.0539 -26.7739 0.0000 -1.5474 -1.3363\n","outcome_nonexistent -0.8541 0.0336 -25.3995 0.0000 -0.9200 -0.7882\n","outcome_success 0.5635 0.0740 7.6191 0.0000 0.4185 0.7085\n","duration 1.9455 0.0196 99.0139 0.0000 1.9070 1.9840\n","emp_var_rate -3.7642 0.1032 -36.4837 0.0000 -3.9664 -3.5620\n","cons_price_idx 1.3597 0.0420 32.4081 0.0000 1.2774 1.4419\n","euribor3m 1.6069 0.0838 19.1744 0.0000 1.4426 1.7711\n","====================================================================\n","\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"BJRL96LRbBzy","executionInfo":{"status":"ok","timestamp":1613045374869,"user_tz":300,"elapsed":122647,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"88c56ded-d82a-48f5-bc1e-9e81cdeadbe6"},"source":["from sklearn.linear_model import LogisticRegression\n","from sklearn import metrics\n","logreg = LogisticRegression()\n","logreg.fit(X_train, y_train)"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n"," intercept_scaling=1, l1_ratio=None, max_iter=100,\n"," multi_class='auto', n_jobs=None, penalty='l2',\n"," random_state=None, solver='lbfgs', tol=0.0001, verbose=0,\n"," warm_start=False)"]},"metadata":{"tags":[]},"execution_count":26}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"FhVLLvF_bOQd","executionInfo":{"status":"ok","timestamp":1613045374870,"user_tz":300,"elapsed":122643,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"ff67aeed-897d-41f0-96f0-8bc5070d507a"},"source":["y_pred = logreg.predict(X_test)\n","print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))\n","\n","cm = metrics.confusion_matrix(y_test, y_pred)\n","true_pos = cm[1,1]\n","true_neg = cm[0,0]\n","false_pos = cm[0,1]\n","false_neg = cm[1,0]\n","precision = true_pos/(true_pos + false_pos)\n","recall = true_pos/(true_pos + false_neg)\n","\n","print(\"When we check precision of our model against the test data set, \" + str(len(y_test)) + \" users\")\n","print(\"\")\n","print(\"Precision - We predicted a subscription \" + str(cm[1,1] + cm[0,1]) + \" times and were correct \" + str(cm[1,1]) + \" times: \" + str(cm[1,1]/(cm[1,1] + cm[0,1]))[0:5])\n","print(\"Recall - We predicted \" + str(cm[1,1]) + \" out of the \" + str(cm[1,1] + cm[1,0]) + \" subscriptions: \" + str(recall)[0:5])\n","print(\"\")\n","print(\"Our total accuracy was \" + str((cm[0,0] + cm[1,1])/len(y_test))[0:5])\n","print(\"Our F1 score was \" + str(2*(precision*recall)/(precision+recall))[0:5])"],"execution_count":null,"outputs":[{"output_type":"stream","text":["Accuracy of logistic regression classifier on test set: 0.91\n","When we check precision of our model against the test data set, 12357 users\n","\n","Precision - We predicted a subscription 884 times and were correct 592 times: 0.669\n","Recall - We predicted 592 out of the 1376 subscriptions: 0.430\n","\n","Our total accuracy was 0.912\n","Our F1 score was 0.523\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/","height":111},"id":"lRxm6RHDg40s","executionInfo":{"status":"ok","timestamp":1613045374872,"user_tz":300,"elapsed":122640,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"b8e9a1ec-8327-4b89-87a2-2f5d20bad72d"},"source":["metrics.precision_score(y_test, y_pred)\n","cm = metrics.confusion_matrix(y_test, y_pred)\n","\n","true_neg = str(cm[0,0]) + \"/\" + str(cm[0,0] + cm[1,0]) + \" (\" + str(cm[0,0]/(cm[0,0] + cm[1,0]))[0:5] + \")\"\n","false_pos = str(cm[0,1]) + \"/\" + str(cm[0,1] + cm[1,1]) + \" (\" + str(cm[0,1]/(cm[0,1] + cm[1,1]))[0:5] + \")\"\n","false_neg = str(cm[1,0]) + \"/\" + str(cm[1,0] + cm[0,0]) + \" (\" + str(cm[1,0]/(cm[1,0] + cm[0,0]))[0:5] + \")\"\n","true_pos = str(cm[1,1]) + \"/\" + str(cm[1,1] + cm[0,1]) + \" (\" + str(cm[1,1]/(cm[1,1] + cm[0,1]))[0:5] + \")\"\n","\n","conf_matrix = pd.DataFrame({'Not Subscription': [true_neg, false_neg],\n"," 'Subscription': [false_pos, true_pos],\n"," 'Support': [cm[0,0] + cm[0,1], cm[1,0] + cm[1,1]]},\n"," index = ['Not Subscription', 'Subscription'])\n","conf_matrix"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/html":["
"],"text/plain":[" prediction truth result\n","31880 0 0 None\n","38177 0 0 None\n","2459 0 0 None\n","756 0 0 None\n","11275 0 0 None"]},"metadata":{"tags":[]},"execution_count":37}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"5RTyj28lnD5L","executionInfo":{"status":"ok","timestamp":1613045382170,"user_tz":300,"elapsed":129890,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"be24897b-bdcc-4144-f3a2-632b98d90778"},"source":["prediction_df['result'][(prediction_df['prediction'] == 0) & (prediction_df['truth'] == 0)] = \"true_neg\"\n","prediction_df['result'][(prediction_df['prediction'] == 0) & (prediction_df['truth'] == 1)] = \"false_neg\"\n","prediction_df['result'][(prediction_df['prediction'] == 1) & (prediction_df['truth'] == 0)] = \"false_pos\"\n","prediction_df['result'][(prediction_df['prediction'] == 1) & (prediction_df['truth'] == 1)] = \"true_pos\"\n","results = prediction_df['result'].value_counts()\n","results"],"execution_count":null,"outputs":[{"output_type":"execute_result","data":{"text/plain":["true_neg 24640\n","false_neg 2071\n","true_pos 1193\n","false_pos 927\n","Name: result, dtype: int64"]},"metadata":{"tags":[]},"execution_count":38}]},{"cell_type":"code","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"gRoRQbQCmep6","executionInfo":{"status":"ok","timestamp":1613045382170,"user_tz":300,"elapsed":129887,"user":{"displayName":"Donald Koban","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14GgIl_q-klTdMSVMcpQ2RqU9YBN_aDPqg2-7Pd4=s64","userId":"12205738029019728376"}},"outputId":"245b9634-866f-461b-f0e6-a25319ab33b4"},"source":["\n","true_pos = results[3]\n","true_neg = results[0]\n","false_pos = results[1]\n","false_neg = results[2]\n","precision = true_pos/(true_pos + false_pos)\n","recall = true_pos/(true_pos + false_neg)\n","\n","print(\"When we check precision of our model against the test data set, \" + str(len(y)) + \" users\")\n","print(\"\")\n","print(\"Precision - We predicted a subscription \" + str(true_pos + false_pos) + \" times and were correct \" + str(true_pos) + \" times: \" + str(true_pos/(true_pos + false_pos))[0:5])\n","print(\"Recall - We predicted \" + str(true_pos) + \" out of the \" + str(true_pos + false_neg) + \" subscriptions: \" + str(recall)[0:5])\n","print(\"\")\n","print(\"Our total accuracy was \" + str((true_pos + true_neg)/len(y))[0:5])\n","print(\"Our F1 score was \" + str(2*(precision*recall)/(precision+recall))[0:5])"],"execution_count":null,"outputs":[{"output_type":"stream","text":["When we check precision of our model against the test data set, 28831 users\n","\n","Precision - We predicted a subscription 2998 times and were correct 927 times: 0.309\n","Recall - We predicted 927 out of the 2120 subscriptions: 0.437\n","\n","Our total accuracy was 0.886\n","Our F1 score was 0.362\n"],"name":"stdout"}]}]}