{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 1.04\n", "## Creating a simple model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this exercise we will create a simple logistic regression model from the scikit learn package.\n", "We will then create some model evaluation metrics and test the predictions against those model evaluation metrics.\n", "Let's load the feature data from the first excerice.\n", "\n", "We should always approach training any machine learning model training as an iterative approach, beginning first with a simple model, and using model evaluation metrics to evaluate the performance of the models." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "feats = pd.read_csv('../data/OSI_feats_e3.csv')\n", "target = pd.read_csv('../data/OSI_target_e2.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first begin by creating a test and train dataset. We will train the data using the training dataset and evaluate the performance of the model on the test dataset. Later in the lesson we will add validation datasets that will help us tune the hyperparameters.\n", "\n", "We will use a test_size = 0.2 which means that 20% of the data will be reserved for testing" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "test_size = 0.2\n", "random_state = 42\n", "X_train, X_test, y_train, y_test = train_test_split(feats, target, test_size=test_size, random_state=random_state)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make sure our dimensions are correct" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape of X_train: (9864, 68)\n", "Shape of y_train: (9864, 1)\n", "Shape of X_test: (2466, 68)\n", "Shape of y_test: (2466, 1)\n" ] } ], "source": [ "print(f'Shape of X_train: {X_train.shape}')\n", "print(f'Shape of y_train: {y_train.shape}')\n", "print(f'Shape of X_test: {X_test.shape}')\n", "print(f'Shape of y_test: {y_test.shape}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We fit our model first by instantiating it, then by fitting the model to the training data" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, l1_ratio=None, max_iter=10000,\n", " multi_class='auto', n_jobs=None, penalty='l2',\n", " random_state=42, solver='lbfgs', tol=0.0001, verbose=0,\n", " warm_start=False)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "model = LogisticRegression(random_state=42, max_iter=10000)\n", "model.fit(X_train, y_train['Revenue'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To test the model performance we will predict the outcome on the test features (X_test), and compare those outcomes to real values (y_test)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "y_pred = model.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's compare against the true values. Let's start by using accuracy, accuracy is defined as the propotion of correct predictions out of the total predictions." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy of the model is 87.0641%\n" ] } ], "source": [ "from sklearn import metrics\n", "accuracy = metrics.accuracy_score(y_pred=y_pred, y_true=y_test)\n", "print(f'Accuracy of the model is {accuracy*100:.4f}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "87.0641% - that's not bad with for a simple model with little feature engineering!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other evaluation metrics\n", "\n", "Other common metrics in classification models are precision, recall, and f1-score.\n", "Recall is defined as the proportion of correct positive predictions relative to total true postive values. Precision is defined as the proportion of correct positive predictions relative to total predicted postive values. F1 score is a combination of precision and recall, defined as 2 times the product of precision and recall, divided by the sum of the two.\n", "\n", "It's useful to use these other evaluation metrics other than accuracy when the distribution of true and false values. We want these values to be as close to 1.0 as possible." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Precision: 0.7347\n", "Recall: 0.3504\n", "fscore: 0.4745\n" ] } ], "source": [ "precision, recall, fscore, _ = metrics.precision_recall_fscore_support(y_pred=y_pred, y_true=y_test, average='binary')\n", "print(f'Precision: {precision:.4f}\\nRecall: {recall:.4f}\\nfscore: {fscore:.4f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see here that while the accuracy is high, the recall is much lower, which means that we're missing most of the true positive values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Feature importances\n", " \n", "\n", "We can look at which features are important by looking at the magnitude of the coefficients. Those with a larger coefficients will have a greater contribution to the result. Those with a positive value will make the result head toward the true result, that the customer will not subscribe. Features with a negative value for the coefficient will make the result heads towards a false result, that the customer will not subscribe to the product.\n", "\n", "As a note, since the features were not normalized (having the same scale), the values for these coefficients shouls serve as a rough guide as to observe which features add predictive power." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TrafficType_13: -0.9393317018656502\n", "VisitorType_Returning_Visitor: -0.7126379729869377\n", "Month_Dec: -0.6356666079086347\n", "ExitRates: -0.6168306621684505\n", "Month_Mar: -0.5531772345591857\n", "Region_9: -0.5493990371550316\n", "TrafficType_3: -0.5230504004211978\n", "OperatingSystems_3: -0.5047311736766499\n", "SpecialDay: -0.48888883272346506\n", "BounceRates: -0.4573686067908481\n", "Month_May: -0.4436363104925222\n", "Month_June: -0.4225194836012355\n", "OperatingSystems_8: -0.35057329371369783\n", "Browser_6: -0.33033671140440707\n", "TrafficType_6: -0.2572321108188088\n", "TrafficType_1: -0.24969535181259417\n", "Browser_3: -0.23765128996809284\n", "VisitorType_New_Visitor: -0.22945892368475135\n", "Browser_1: -0.22069737949723414\n", "Region_7: -0.21116529737609177\n", "Browser_13: -0.20773332314846657\n", "Region_4: -0.20645936733062473\n", "Browser_4: -0.18452552602906916\n", "OperatingSystems_4: -0.17537032410289136\n", "OperatingSystems_2: -0.17087815382440244\n", "OperatingSystems_1: -0.14530926674716454\n", "TrafficType_15: -0.12601954689866632\n", "TrafficType_4: -0.12551302296797587\n", "Browser_2: -0.12254444691952127\n", "Region_3: -0.116409339032699\n", "TrafficType_9: -0.09345050196986791\n", "Browser_8: -0.07432180699436479\n", "Browser_5: -0.06731941488695285\n", "TrafficType_19: -0.04763319631540111\n", "Browser_10: -0.03030326779492614\n", "TrafficType_14: -0.02486754694456821\n", "Region_1: -0.024392989712640506\n", "TrafficType_18: -0.02222257922449895\n", "TrafficType_20: -0.018331800703584155\n", "OperatingSystems_6: -0.016786449649954342\n", "TrafficType_7: -0.006542353054798274\n", "TrafficType_12: -0.0032342542351401346\n", "Browser_11: -0.002452753984304908\n", "Informational_Duration: -0.00032045144921367014\n", "Administrative_Duration: -0.00010008862449623993\n", "ProductRelated_Duration: 4.6077899325827885e-05\n", "ProductRelated: 0.003291131517956643\n", "Administrative: 0.008809132521965357\n", "TrafficType_2: 0.025894902253396974\n", "Browser_7: 0.028686788285342275\n", "Region_8: 0.029319493036519817\n", "OperatingSystems_7: 0.03298640042309421\n", "TrafficType_16: 0.047341484936212506\n", "Informational: 0.08555002045301442\n", "TrafficType_5: 0.0859889420171317\n", "PageValues: 0.08672528112710322\n", "Region_6: 0.09309020409318655\n", "Month_Aug: 0.09668425308005028\n", "Browser_12: 0.1189651797379178\n", "is_weekend: 0.11966844048422016\n", "Month_Sep: 0.12544889935651957\n", "Region_2: 0.13313545468089413\n", "TrafficType_11: 0.19223716898106263\n", "Month_Jul: 0.21082793061040983\n", "Month_Oct: 0.2715030204884287\n", "TrafficType_10: 0.35298265536282414\n", "TrafficType_8: 0.4020350043660541\n", "Month_Nov: 0.5044070793869467\n" ] } ], "source": [ "coef_list = [f'{feature}: {coef}' for coef, feature in sorted(zip(model.coef_[0], X_train.columns.values.tolist()))]\n", "for item in coef_list:\n", " print(item)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see from the coefficients that a whether or not the traffic type is a key indicator, as well as which month the user browsed in." ] } ], "metadata": { "kernelspec": { "display_name": "py3.7", "language": "python", "name": "py3.7" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 2 }