{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Fine-tuning your model\n", "> A Summary of lecture \"Supervised Learning with scikit-learn\", via datacamp\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Machine_Learning]\n", "- image: images/roc-curve.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How good is your model?\n", "- Classification metrics\n", " - Measuring model performance with accuracy:\n", " - Fraction of correctly classified samples\n", " - Not always a useful metrics\n", "- Class imbalance example: Emails\n", " - Spam classification\n", " - 99% of emails are real; 1% of emails are spam\n", " - Could build a classifier that predicts ALL emails as real\n", " - 99% accurate!\n", " - But horrible at actually classifying spam\n", " - Fails at its original purpose\n", "- Diagnosing classification predictions\n", "\n", " - Confusion matrix\n", "![cm](image/confusion_matrix.png)\n", "\n", " - Accuracy:\n", " $$ \\dfrac{tp + tn}{tp + tn + fp + fn} $$\n", " \n", " - Precision (Positive Predictive Value):\n", " $$ \\dfrac{tp}{tp + fp}$$\n", " \n", " - Recall (Sensitivity, hit rate, True Positive Rate):\n", " $$ \\dfrac{tp}{tp + fn}$$\n", " \n", " - F1 score: Harmonic mean of precision and recall\n", " $$ 2 \\cdot \\dfrac{\\text{precision} \\cdot \\text{recall}}{\\text{precision} + \\text{recall}} $$\n", " \n", " - High precision : Not many real emails predicted as spam\n", " - High recall : Predicted most spam emails correctly\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Metrics for classification\n", "Accuracy is not always an informative metric. In this exercise, you will dive more deeply into evaluating the performance of binary classifiers by computing a confusion matrix and generating a classification report.\n", "\n", "You may have noticed in the video that the classification report consisted of three rows, and an additional support column. The support gives the number of samples of the true response that lie in that class - so in the video example, the support was the number of Republicans or Democrats in the test set on which the classification report was computed. The precision, recall, and f1-score columns, then, gave the respective metrics for that particular class.\n", "\n", "Here, you'll work with the [PIMA Indians](https://www.kaggle.com/uciml/pima-indians-diabetes-database) dataset obtained from the UCI Machine Learning Repository. The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0 indicates that the patient does not have diabetes, while a value of 1 indicates that the patient does have diabetes. As in Chapters 1 and 2, the dataset has been preprocessed to deal with missing values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Preprocess" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pregnanciesglucosediastolictricepsinsulinbmidpfagediabetes
061487235033.60.627501
11856629026.60.351310
28183640023.30.672321
318966239428.10.167210
40137403516843.12.288331
\n", "
" ], "text/plain": [ " pregnancies glucose diastolic triceps insulin bmi dpf age \\\n", "0 6 148 72 35 0 33.6 0.627 50 \n", "1 1 85 66 29 0 26.6 0.351 31 \n", "2 8 183 64 0 0 23.3 0.672 32 \n", "3 1 89 66 23 94 28.1 0.167 21 \n", "4 0 137 40 35 168 43.1 2.288 33 \n", "\n", " diabetes \n", "0 1 \n", "1 0 \n", "2 1 \n", "3 0 \n", "4 1 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./dataset/diabetes.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "X = df.iloc[:, :-1]\n", "y = df.iloc[:, -1]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[176 30]\n", " [ 56 46]]\n", " precision recall f1-score support\n", "\n", " 0 0.76 0.85 0.80 206\n", " 1 0.61 0.45 0.52 102\n", "\n", " accuracy 0.72 308\n", " macro avg 0.68 0.65 0.66 308\n", "weighted avg 0.71 0.72 0.71 308\n", "\n" ] } ], "source": [ "from sklearn.metrics import classification_report, confusion_matrix\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "# Create training and test set\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)\n", "\n", "# Instantiate a k-NN classifier: knn\n", "knn = KNeighborsClassifier(n_neighbors=6)\n", "\n", "# Fit the classifier to the training data\n", "knn.fit(X_train, y_train)\n", "\n", "# Predict the labels of the test data: y_pred\n", "y_pred = knn.predict(X_test)\n", "\n", "# Generate the confusion matrix and classification report\n", "print(confusion_matrix(y_test, y_pred))\n", "print(classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logistic regression and the ROC curve\n", "- Logistic regression for binary classification\n", " - Logistic regression outputs probabilities\n", " - If the probability is greater than 0.5:\n", " - The data is labeled '1'\n", " - If the probability is less than 0.5:\n", " - The data is labeled '0'\n", "- Probability thresholds\n", " - By default, logistic regression threshold = 0.5\n", " - Not specific to logistic regression\n", " - k-NN classifiers also have thresholds\n", "- ROC curves (Receiver Operating Characteristic curve)\n", "![roc](./image/roc.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Building a logistic regression model\n", "Time to build your first logistic regression model! As Hugo showed in the video, scikit-learn makes it very easy to try different models, since the Train-Test-Split/Instantiate/Fit/Predict paradigm applies to all classifiers and regressors - which are known in scikit-learn as 'estimators'. You'll see this now for yourself as you train a logistic regression model on exactly the same data as in the previous exercise. Will it outperform k-NN? There's only one way to find out!" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[168 38]\n", " [ 36 66]]\n", " precision recall f1-score support\n", "\n", " 0 0.82 0.82 0.82 206\n", " 1 0.63 0.65 0.64 102\n", "\n", " accuracy 0.76 308\n", " macro avg 0.73 0.73 0.73 308\n", "weighted avg 0.76 0.76 0.76 308\n", "\n" ] } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "\n", "# Create training and test sets\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)\n", "\n", "# Create the classifier: logreg\n", "logreg = LogisticRegression(max_iter=1000)\n", "\n", "# Fit the classifier to the training data\n", "logreg.fit(X_train, y_train)\n", "\n", "# Predict the labels of the test set: y_pred\n", "y_pred = logreg.predict(X_test)\n", "\n", "# Compute and print the confusion matrix and classification report\n", "print(confusion_matrix(y_test, y_pred))\n", "print(classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting an ROC curve\n", "Great job in the previous exercise - you now have a new addition to your toolbox of classifiers!\n", "\n", "Classification reports and confusion matrices are great methods to quantitatively evaluate model performance, while ROC curves provide a way to visually evaluate models. As Hugo demonstrated in the video, most classifiers in scikit-learn have a ```.predict_proba()``` method which returns the probability of a given sample being in a particular class. Having built a logistic regression model, you'll now evaluate its performance by plotting an ROC curve. In doing so, you'll make use of the ```.predict_proba()``` method and become familiar with its functionality." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.metrics import roc_curve\n", "\n", "# Compute predicted probabilities: y_pred_prob\n", "y_pred_prob = logreg.predict_proba(X_test)[:, 1]\n", "\n", "# Generate ROC curve values: fpr, tpr, thresholds\n", "fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)\n", "\n", "# Plot ROC curve\n", "plt.plot([0, 1], [0, 1], 'k--')\n", "plt.plot(fpr, tpr)\n", "plt.xlabel('False Positive Rate')\n", "plt.ylabel('True Positive Rate')\n", "plt.title('ROC Curve')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Precision-recall Curve\n", "When looking at your ROC curve, you may have noticed that the y-axis (True positive rate) is also known as recall. Indeed, in addition to the ROC curve, there are other ways to visually evaluate model performance. One such way is the precision-recall curve, which is generated by plotting the precision and recall for different thresholds. As a reminder, precision and recall are defined as:\n", "$$ \\text{Precision} = \\dfrac{TP}{TP + FP} \\\\\n", " \\text{Recall} = \\dfrac{TP}{TP + FN}$$\n", " Study the precision-recall curve. Note that here, the class is positive (1) if the individual has diabetes." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'Precision / Recall plot')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.metrics import precision_recall_curve\n", "\n", "precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)\n", "\n", "plt.plot(recall, precision)\n", "plt.xlabel('Recall')\n", "plt.ylabel('Precision')\n", "plt.title('Precision / Recall plot')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Area under the ROC curve (AUC)\n", "- Larger area under the ROC curve = better model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### AUC computation\n", "Say you have a binary classifier that in fact is just randomly making guesses. It would be correct approximately 50% of the time, and the resulting ROC curve would be a diagonal line in which the True Positive Rate and False Positive Rate are always equal. The Area under this ROC curve would be 0.5. This is one way in which the AUC, which Hugo discussed in the video, is an informative metric to evaluate a model. If the AUC is greater than 0.5, the model is better than random guessing. Always a good sign!\n", "\n", "In this exercise, you'll calculate AUC scores using the ```roc_auc_score()``` function from ```sklearn.metrics``` as well as by performing cross-validation on the diabetes dataset." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AUC: 0.8243384732533791\n", "AUC scores computed using 5-fold cross-validation: [0.81240741 0.80777778 0.82574074 0.87283019 0.84471698]\n" ] } ], "source": [ "from sklearn.metrics import roc_auc_score\n", "from sklearn.model_selection import cross_val_score\n", "\n", "# Compute predicted probabilites: y_pred_prob\n", "y_pred_prob = logreg.predict_proba(X_test)[:, 1]\n", "\n", "# Compute and print AUC score\n", "print(\"AUC: {}\".format(roc_auc_score(y_test, y_pred_prob)))\n", "\n", "# Compute cross-validated AUC scores: cv_auc\n", "cv_auc = cross_val_score(logreg, X, y, cv=5, scoring='roc_auc')\n", "\n", "# Print list of AUC scores\n", "print(\"AUC scores computed using 5-fold cross-validation: {}\".format(cv_auc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hyperparameter tuning\n", "- Linear regression: Choosing parameters\n", "- Ridge/Lasso regression: Choosing alpha\n", "- k-Nearest Neighbors: Choosing n_neighbors\n", "- Hyperparameters: Parameters like alpha and k\n", "- Hyperparameters cannot be learned by fitting the model\n", "- Choosing the correct hyperparameter\n", " - Try a bunch of different hyperparameter values\n", " - Fit all of them separately\n", " - See how well each performs\n", " - Choose the best performing one\n", " - It is essential to use cross-validation\n", "- Grid search cross-validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hyperparameter tuning with GridSearchCV\n", "Like the alpha parameter of lasso and ridge regularization that you saw earlier, logistic regression also has a regularization parameter: $C$. $C$ controls the inverse of the regularization strength, and this is what you will tune in this exercise. A large $C$ can lead to an overfit model, while a small $C$ can lead to an underfit model.\n", "\n", "The hyperparameter space for $C$ has been setup for you. Your job is to use GridSearchCV and logistic regression to find the optimal $C$ in this hyperparameter space. \n", "\n", "You may be wondering why you aren't asked to split the data into training and test sets. Good observation! Here, we want you to focus on the process of setting up the hyperparameter grid and performing grid-search cross-validation. In practice, you will indeed want to hold out a portion of your data for evaluation purposes, and you will learn all about this in the next video!" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tuned Logistic Regression Parameters: {'C': 0.006105402296585327}\n", "Best score is 0.7734742381801205\n" ] } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "# Setup the hyperparameter grid\n", "c_space = np.logspace(-5, 8, 15)\n", "param_grid = {'C':c_space}\n", "\n", "# Instantiate a logistic regression classifier: logreg\n", "logreg = LogisticRegression(max_iter=1000)\n", "\n", "# Instantiate the GridSearchCV object: logreg_cv\n", "logreg_cv = GridSearchCV(logreg, param_grid, cv=5)\n", "\n", "# Fit it to the data\n", "logreg_cv.fit(X, y)\n", "\n", "# Print the tuned parameters and score\n", "print(\"Tuned Logistic Regression Parameters: {}\".format(logreg_cv.best_params_))\n", "print(\"Best score is {}\".format(logreg_cv.best_score_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hyperparameter tuning with RandomizedSearchCV\n", "GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use ```RandomizedSearchCV```, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions. You'll practice using ```RandomizedSearchCV``` in this exercise and see how this works.\n", "\n", "Here, you'll also be introduced to a new model: the Decision Tree. Don't worry about the specifics of how this model works. Just like k-NN, linear regression, and logistic regression, decision trees in scikit-learn have ```.fit()``` and ```.predict()``` methods that you can use in exactly the same way as before. Decision trees have many parameters that can be tuned, such as ```max_features```, ```max_depth```, and ```min_samples_leaf```: This makes it an ideal use case for ```RandomizedSearchCV```." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tuned Decision Tree Parameters: {'criterion': 'entropy', 'max_depth': 3, 'max_features': 7, 'min_samples_leaf': 4}\n", "Best score is 0.7448603683897801\n" ] } ], "source": [ "from scipy.stats import randint\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.model_selection import RandomizedSearchCV\n", "\n", "# Setup the parameters and distributions to sample from: param_dist\n", "param_dist = {\n", " \"max_depth\": [3, None],\n", " \"max_features\": randint(1, 9),\n", " \"min_samples_leaf\": randint(1, 9),\n", " \"criterion\": [\"gini\", \"entropy\"],\n", "}\n", "\n", "# Instantiate a Decision Tree classifier: tree\n", "tree = DecisionTreeClassifier()\n", "\n", "# Instantiate the RandomizedSearchCV object: tree_cv\n", "tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)\n", "\n", "# Fit it to the data\n", "tree_cv.fit(X, y)\n", "\n", "# Print the tuned parameters and score\n", "print(\"Tuned Decision Tree Parameters: {}\".format(tree_cv.best_params_))\n", "print(\"Best score is {}\".format(tree_cv.best_score_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hold-out set for final evaluation\n", "- How well can the model perform on never before seen data?\n", "- Using ALL data for cross-validation is not ideal\n", "- Split data into training and hold-out set at the beginning\n", "- Perform grid search cross-validation on training set\n", "- Choose best hyperparameters and evaluate on hold-out set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hold-out set in practice I: Classification\n", "You will now practice evaluating a model with tuned hyperparameters on a hold-out set. The feature array and target variable array from the diabetes dataset have been pre-loaded as ```X``` and ```y```.\n", "\n", "In addition to $C$, logistic regression has a ```'penalty'``` hyperparameter which specifies whether to use ```'l1'``` or ```'l2'``` regularization. Your job in this exercise is to create a hold-out set, tune the ```'C'``` and ```'penalty'``` hyperparameters of a logistic regression classifier using ```GridSearchCV``` on the training set." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tuned Logistic Regression Parameter: {'C': 3.727593720314938, 'penalty': 'l2'}\n", "Tuned Logistic Regression Accuracy: 0.7608695652173914\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split, GridSearchCV\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "# Create the hyperparameter grid\n", "c_space = np.logspace(-5, 8, 15)\n", "param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}\n", "\n", "# Instantiate the logistic regression classifier: logreg\n", "logreg = LogisticRegression(max_iter=1000, solver='liblinear')\n", "\n", "# Create train and test sets\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)\n", "\n", "# Instantiate the GridSearchCV object: logreg_cv\n", "logreg_cv = GridSearchCV(logreg, param_grid, cv=5)\n", "\n", "# Fit it to the training data\n", "logreg_cv.fit(X_train, y_train)\n", "\n", "# Print the optimal parameters and best score\n", "print(\"Tuned Logistic Regression Parameter: {}\".format(logreg_cv.best_params_))\n", "print(\"Tuned Logistic Regression Accuracy: {}\".format(logreg_cv.best_score_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hold-out set in practice II: Regression\n", "Remember lasso and ridge regression from the previous chapter? Lasso used the $L1$ penalty to regularize, while ridge used the $L2$ penalty. There is another type of regularized regression known as the elastic net. In elastic net regularization, the penalty term is a linear combination of the $L1$ and $L2$ penalties:\n", "$$ a * L1 + b * L2 $$\n", "\n", "In scikit-learn, this term is represented by the ```'l1_ratio'``` parameter: An ```'l1_ratio'``` of 1 corresponds to an $L1$ penalty, and anything lower is a combination of $L1$ and $L2$.\n", "\n", "In this exercise, you will ```GridSearchCV``` to tune the ```'l1_ratio'``` of an elastic net model trained on the Gapminder data. As in the previous exercise, use a hold-out set to evaluate your model's performance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Preprocess" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
populationfertilityHIVCO2BMI_maleGDPBMI_femalelifechild_mortality
034811059.02.730.13.32894524.5962012314.0129.904975.329.5
119842251.06.432.01.47435322.250837103.0130.124758.3192.0
240381860.02.240.54.78517027.5017014646.0118.891575.515.4
32975029.01.400.11.80410625.355427383.0132.810872.520.0
421370348.01.960.118.01631327.5637341312.0117.375581.55.2
\n", "
" ], "text/plain": [ " population fertility HIV CO2 BMI_male GDP BMI_female life \\\n", "0 34811059.0 2.73 0.1 3.328945 24.59620 12314.0 129.9049 75.3 \n", "1 19842251.0 6.43 2.0 1.474353 22.25083 7103.0 130.1247 58.3 \n", "2 40381860.0 2.24 0.5 4.785170 27.50170 14646.0 118.8915 75.5 \n", "3 2975029.0 1.40 0.1 1.804106 25.35542 7383.0 132.8108 72.5 \n", "4 21370348.0 1.96 0.1 18.016313 27.56373 41312.0 117.3755 81.5 \n", "\n", " child_mortality \n", "0 29.5 \n", "1 192.0 \n", "2 15.4 \n", "3 20.0 \n", "4 5.2 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./dataset/gm_2008_region.csv')\n", "df.drop(labels=['Region'], axis='columns', inplace=True)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "X = df.drop('life', axis='columns').values\n", "y = df['life'].values" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\kcsgo\\anaconda3\\lib\\site-packages\\sklearn\\linear_model\\_coordinate_descent.py:476: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 282.48621758506584, tolerance: 5.58941590909091\n", " positive)\n", "C:\\Users\\kcsgo\\anaconda3\\lib\\site-packages\\sklearn\\linear_model\\_coordinate_descent.py:476: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 309.8466391486277, tolerance: 5.893071666666667\n", " positive)\n", "C:\\Users\\kcsgo\\anaconda3\\lib\\site-packages\\sklearn\\linear_model\\_coordinate_descent.py:476: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 255.50344008061325, tolerance: 5.890250303030303\n", " positive)\n", "C:\\Users\\kcsgo\\anaconda3\\lib\\site-packages\\sklearn\\linear_model\\_coordinate_descent.py:476: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 287.6728412633782, tolerance: 5.814186865671641\n", " positive)\n", "C:\\Users\\kcsgo\\anaconda3\\lib\\site-packages\\sklearn\\linear_model\\_coordinate_descent.py:476: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 311.1827114768199, tolerance: 5.801944179104479\n", " positive)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Tuned ElasticNet l1 ratio: {'l1_ratio': 0.20689655172413793}\n", "Tuned ElasticNet R squared: 0.8668305372460283\n", "Tuned ElasticNet MSE: 10.057914133398445\n" ] } ], "source": [ "from sklearn.linear_model import ElasticNet\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.model_selection import GridSearchCV, train_test_split\n", "\n", "# Create train and test sets\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)\n", "\n", "# Create the hyperparameter grid\n", "l1_space = np.linspace(0, 1, 30)\n", "param_grid = {'l1_ratio': l1_space}\n", "\n", "# Instantiate the ElasticNet regressor: elastic_net\n", "elastic_net = ElasticNet(max_iter=100000, tol=0.001)\n", "\n", "# Setup the GridSearchCV object: gm_cv\n", "gm_cv = GridSearchCV(elastic_net, param_grid, cv=5)\n", "\n", "# Fit it to the training data\n", "gm_cv.fit(X_train, y_train)\n", "\n", "# Predict on the test set and compute metrics\n", "y_pred = gm_cv.predict(X_test)\n", "r2 = gm_cv.score(X_test, y_test)\n", "mse = mean_squared_error(y_pred,y_test)\n", "print(\"Tuned ElasticNet l1 ratio: {}\".format(gm_cv.best_params_))\n", "print(\"Tuned ElasticNet R squared: {}\".format(r2))\n", "print(\"Tuned ElasticNet MSE: {}\".format(mse))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }