{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "ml_sklearn_breastcancer_cross_validation_standard_error_nov2021.ipynb", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "jUHPI7fHd-A4" }, "source": [ "In this notebook we compute the Standard Error for Cross Validation based on\n", "\n", "https://www.stat.cmu.edu/~ryantibs/advmethods/notes/errval.pdf\n", "\n", "CMU Notes from Ryan Tibshirani\n", "\n", "https://www.stat.cmu.edu/~ryantibs/\n", "\n", "Note: Tibshirani's father at Stanford worked on 632bootstrap\n", "\n", "- [1] Efron, Bradley, and Robert Tibshirani. 1997.\n", "\"Improvements on Cross-Validation: The .632+ Bootstrap Method.\"\n", "Journal of the American Statistical Association\n", "92 (438): 548. doi:10.2307/2965703.\n", "\n", "The standard error is the standard deviation of the Student t-distribution. T-distributions are slightly different from Gaussian, and vary depending on the size of the sample.\n" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WoCsVVokfONq", "outputId": "c7c401da-7cfb-4088-c2e9-7b1220f93bb2" }, "source": [ "!pip install git+https://github.com/pattersonconsulting/ml_tools.git" ], "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Collecting git+https://github.com/pattersonconsulting/ml_tools.git\n", " Cloning https://github.com/pattersonconsulting/ml_tools.git to /tmp/pip-req-build-4im4qaw3\n", " Running command git clone -q https://github.com/pattersonconsulting/ml_tools.git /tmp/pip-req-build-4im4qaw3\n", "Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from ml-valuation==0.0.1) (1.1.5)\n", "Requirement already satisfied: sklearn in /usr/local/lib/python3.7/dist-packages (from ml-valuation==0.0.1) (0.0)\n", "Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from ml-valuation==0.0.1) (3.2.2)\n", "Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from ml-valuation==0.0.1) (1.19.5)\n", "Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->ml-valuation==0.0.1) (2.8.2)\n", "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->ml-valuation==0.0.1) (2.4.7)\n", "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->ml-valuation==0.0.1) (0.11.0)\n", "Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->ml-valuation==0.0.1) (1.3.2)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.1->matplotlib->ml-valuation==0.0.1) (1.15.0)\n", "Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->ml-valuation==0.0.1) (2018.9)\n", "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from sklearn->ml-valuation==0.0.1) (0.22.2.post1)\n", "Requirement already satisfied: scipy>=0.17.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->sklearn->ml-valuation==0.0.1) (1.4.1)\n", "Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->sklearn->ml-valuation==0.0.1) (1.1.0)\n", "Building wheels for collected packages: ml-valuation\n", " Building wheel for ml-valuation (setup.py) ... \u001b[?25l\u001b[?25hdone\n", " Created wheel for ml-valuation: filename=ml_valuation-0.0.1-py3-none-any.whl size=8800 sha256=7a29616033f2dd858fbc0727f8b761e3d7c60576a22dc427c0d6a5d80e12abd1\n", " Stored in directory: /tmp/pip-ephem-wheel-cache-o04g75e7/wheels/ce/52/e8/5f5de6a3a97eca5d2f9e453ecafb0f88f99054a1f2601f637e\n", "Successfully built ml-valuation\n", "Installing collected packages: ml-valuation\n", "Successfully installed ml-valuation-0.0.1\n" ] } ] }, { "cell_type": "code", "metadata": { "id": "K6qrORWRWL9i" }, "source": [ "import pandas as pd\n", "\n", "import numpy as np\n", "from sklearn.metrics import confusion_matrix\n", "from sklearn.metrics import roc_curve, auc\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.datasets import make_classification\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "#from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n", "from sklearn.model_selection import KFold, cross_val_score, GridSearchCV, train_test_split\n", "from sklearn.metrics import accuracy_score, classification_report, confusion_matrix\n", "from sklearn.utils import resample\n", "from sklearn.dummy import DummyClassifier\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.datasets import load_breast_cancer\n", "\n", "from sklearn.metrics import average_precision_score\n", "from sklearn.metrics import precision_recall_curve\n", "from sklearn.metrics import plot_precision_recall_curve\n", "from sklearn.metrics import roc_curve, auc, confusion_matrix\n", "\n", "from matplotlib import pyplot\n", "\n", "import ml_valuation\n", "\n", "from ml_valuation import model_valuation\n", "from ml_valuation import model_visualization\n", "\n", "\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "XfLSLa0v-6nG" }, "source": [ "arr_X, arr_y = load_breast_cancer(return_X_y=True)\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "S_n1_bDHAqn2", "outputId": "0109e1d6-6c2b-4042-b707-24aa71ef1abc" }, "source": [ "#X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n", "\n", "print(\"X: \" + str(arr_X.shape))\n", "#print(\"X_test: \" + str(X.shape))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "X: (569, 30)\n" ] } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "LqYSQbL2BkBE", "outputId": "a4023c4d-c7bf-4d2e-ae12-47d34c9d97c8" }, "source": [ "print(arr_X)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]\n", " [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]\n", " [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]\n", " ...\n", " [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]\n", " [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]\n", " [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]\n" ] } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "LI_guKSLHNML", "outputId": "91c07926-fef1-447d-9d75-c7b5ddb2851f" }, "source": [ "unique, counts = np.unique(arr_y, return_counts=True)\n", "dict(zip(unique, counts))" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{0: 212, 1: 357}" ] }, "metadata": {}, "execution_count": 6 } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "RUhMb8pmkW5w", "outputId": "2fd3e5d5-0d0e-4b0a-ae49-545c3ea20f39" }, "source": [ "from sklearn.model_selection import KFold, train_test_split, RandomizedSearchCV, StratifiedKFold\n", "\n", "# fit a model\n", "classifier_kfold_LR = LogisticRegression(solver='newton-cg')\n", "\n", "k = 10\n", "cv = StratifiedKFold( n_splits=k )\n", "stats = list()\n", "\n", "X = pd.DataFrame(arr_X)\n", "y = pd.DataFrame(arr_y)\n", "#for train_index, test_index in cv.split(X, y):\n", "for i, (train_index, test_index) in enumerate(cv.split(X, y)):\n", "\n", " # convert the data indexes into references\n", " Xtrain, Xtest = X.iloc[train_index], X.iloc[test_index]\n", " ytrain, ytest = y.iloc[train_index], y.iloc[test_index]\n", "\n", " print(\"Running CV Fold-\" + str(i))\n", " #print(Xtrain.shape)\n", " #print(Xtest.shape)\n", " #x_train_scaled, x_test_scaled, y_train_scaled, y_test_scaled = train_test_split(df_full_scaled, y_train_full, test_size=1 - train_ratio)\n", "\n", " # classifier_kfold_LR.fit( X.iloc[ train_index ], y.iloc[ train_index ])\n", " # fit the model on the training data (Xtrain) and labels (ytrain)\n", " classifier_kfold_LR.fit( Xtrain, ytrain.values.ravel() )\n", "\n", " # now get the probabilites of the predictions for the text input (data: Xtest, labels: ytest)\n", " #probas_ = classifier_kfold_LR.predict_proba( Xtest )\n", "\n", " #print( \"prediction probabilities: \" + str(yhat.shape) )\n", "\n", " \n", " #prediction_est_prob = probas_[:, 1]\n", "\n", " y_pred = classifier_kfold_LR.predict(Xtest)\n", "\n", " accuracy_fold = accuracy_score(ytest, y_pred)\n", "\n", " #scmtrx_lr_full_testset = model_valuation.standard_confusion_matrix_for_top_ranked_percent(y_test_scaled, yhat, 0.5, 1.0)\n", " \n", " \n", " stats.append(accuracy_fold)\n", " print(\"Accuracy: \" + str(accuracy_fold))\n", "\n", "\n", " \n", " print(\"-----\")\n", "\n", "mean_score = np.mean(stats)\n", "print(\"\\n\\nAverage Accuracy Across All Folds: \" + str(\"{:.4f}\".format(mean_score)))\n", "\n", "std_dev_score = np.std(stats)\n", "print(\"\\n\\nSTD DEV: \" + str(std_dev_score))\n", "standard_error_score = (1/np.sqrt(k)) * std_dev_score\n", "\n", "print(\"\\n\\nStandard Error (Accuracy) Across All Folds: ( \" + str(\"{:.4f}\".format(standard_error_score)) + \")\")\n", "\n", "# https://en.wikipedia.org/wiki/Standard_error#:~:text=Assumptions%20and%20usage%5Bedit%5D\n", "# https://en.wikipedia.org/wiki/1.96\n", "# 95% of values will lie within ±1.96\n", "ci_95 = 1.96 * standard_error_score\n", "print(\"CI Ranges 95%:\")\n", "\n", "low_end_range = mean_score - ci_95\n", "high_end_range = mean_score + ci_95\n", "\n", "print(\"High: \" + str(high_end_range))\n", "print(\"Low : \" + str(low_end_range))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Running CV Fold-0\n", "Accuracy: 0.9824561403508771\n", "-----\n", "Running CV Fold-1\n" ] }, { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.7/dist-packages/scipy/optimize/linesearch.py:314: LineSearchWarning: The line search algorithm did not converge\n", " warn('The line search algorithm did not converge', LineSearchWarning)\n", "/usr/local/lib/python3.7/dist-packages/sklearn/utils/optimize.py:204: UserWarning: Line Search failed\n", " warnings.warn('Line Search failed')\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "Accuracy: 0.9122807017543859\n", "-----\n", "Running CV Fold-2\n", "Accuracy: 0.9298245614035088\n", "-----\n", "Running CV Fold-3\n", "Accuracy: 0.9473684210526315\n", "-----\n", "Running CV Fold-4\n", "Accuracy: 0.9824561403508771\n", "-----\n", "Running CV Fold-5\n", "Accuracy: 0.9824561403508771\n", "-----\n", "Running CV Fold-6\n", "Accuracy: 0.9298245614035088\n", "-----\n", "Running CV Fold-7\n", "Accuracy: 0.9473684210526315\n", "-----\n", "Running CV Fold-8\n", "Accuracy: 0.9649122807017544\n", "-----\n", "Running CV Fold-9\n", "Accuracy: 0.9642857142857143\n", "-----\n", "\n", "\n", "Average Accuracy Across All Folds: 0.9543\n", "\n", "\n", "STD DEV: 0.023770661464399972\n", "\n", "\n", "Standard Error (Accuracy) Across All Folds: ( 0.0075)\n", "CI Ranges 95%:\n", "High: 0.969056516887071\n", "Low : 0.9395900996542824\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "OtWo8JfJgUjS" }, "source": [ "These results are comparable to bootstrap632 results, for reference, on the same dataset / classifier combination\n", "\n", "Notable Links\n", "* https://stats.stackexchange.com/questions/549103/is-bootstrap-variance-based-on-mean-same-as-standard-error\n", "* https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/basic-statistics/inference/how-to/resampling/bootstrapping-for-1-sample-mean/interpret-the-results/key-results/\n", "* https://stats.stackexchange.com/questions/231263/help-understanding-standard-error\n" ] } ] }