{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "ml_sklearn_breastcancer_cross_validation_student_T_CI_nov2021.ipynb", "provenance": [], "collapsed_sections": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "jUHPI7fHd-A4" }, "source": [ "In this notebook we compute 95% Confidence Intervals for 10-Fold Cross Validation as a comparison to the Standard Error Method\n", "\n", "The standard error is the standard deviation of the Student t-distribution. RandomForestClassifier, GradientBoostingClassifier\n", "from sklearn.model_selection import KFold, cross_val_score, GridSearchCV, train_test_split\n", "from sklearn.metrics import accuracy_score, classification_report, confusion_matrix\n", "from sklearn.utils import resample\n", "from sklearn.dummy import DummyClassifier\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.datasets import load_breast_cancer\n", "\n", "from sklearn.metrics import average_precision_score\n", "from sklearn.metrics import precision_recall_curve\n", "from sklearn.metrics import plot_precision_recall_curve\n", "from sklearn.metrics import roc_curve, auc, confusion_matrix\n", "\n", "from matplotlib import pyplot\n", "\n", "import ml_valuation\n", "\n", "from ml_valuation import model_valuation\n", "from ml_valuation import model_visualization\n", "\n", "\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "XfLSLa0v-6nG" }, "source": [ "arr_X, arr_X, arr_y = load_breast_cancer(return_X_y=True) unique, counts = np.unique(arr_y, return_counts=True)
dict(zip(unique, counts)) StratifiedKFold\n", "\n", "# fit a model\n", "classifier_kfold_LR = LogisticRegression(solver='newton-cg')\n", "\n", "k = 10\n", "cv = StratifiedKFold( n_splits=k )\n", "stats = list()\n", "\n", "X = pd.DataFrame(arr_X)\n", "y = pd.DataFrame(arr_y)\n", "#for train_index, test_index in cv.split(X, y):\n", "for i, (train_index, test_index) in enumerate(cv.split(X, y)):\n", "\n", " # convert the data indexes into references\n", " Xtrain, Xtest = X.iloc[train_index], X.iloc[test_index]\n", " ytrain, ytest = y.iloc[train_index], y.iloc[test_index]\n", "\n", " print(\"Running CV Fold-\" + str(i))\n", " #print(Xtrain.shape)\n", " #print(Xtest.shape)\n", " #x_train_scaled, x_test_scaled, y_train_scaled, y_test_scaled = train_test_split(df_full_scaled, y_train_full, test_size=1 - train_ratio)\n", "\n", " # classifier_kfold_LR.fit( X.iloc[ train_index ], y.iloc[ train_index ])\n", " # fit the model on the training data (Xtrain) and labels (ytrain)\n", " classifier_kfold_LR.fit( Xtrain, ytrain.values.ravel() )\n", "\n", " # now get the probabilites of the predictions for the text input (data: Xtest, labels: ytest)\n", " #probas_ = classifier_kfold_LR.predict_proba( Xtest )\n", "\n", " #print( \"prediction probabilities: \" + str(yhat.shape) )\n", "\n", " \n", " #prediction_est_prob = probas_[:, 1]\n", "\n", " y_pred = classifier_kfold_LR.predict(Xtest)\n", "\n", " accuracy_fold = accuracy_score(ytest, y_pred)\n", "\n", " #scmtrx_lr_full_testset = model_valuation.standard_confusion_matrix_for_top_ranked_percent(y_test_scaled, yhat, 0.5, 1.0)\n", " \n", " \n", " stats.append(accuracy_fold)\n", " print(\"Accuracy: \" + str(accuracy_fold))\n", "\n", "\n", " \n", " print(\"-----\")\n", "\n", "mean_score = np.mean(stats)\n", "print(\"\\n\\nAverage Accuracy Across All Folds: \" + str(\"{:.4f}\".format(mean_score)))\n" ], "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Running CV Fold-0\n", "Accuracy: 0.9824561403508771\n", "Accuracy: 0.9649122807017544\n", "-----\n", "Running CV Fold-9\n", "Accuracy: 0.9642857142857143\n", "-----\n", "\n", "\n", "Average Accuracy Across All Folds: 0.9543\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "oBdHdRtTgV_d" }, "source": [ "# Calculating a 95% confidence interval for a population mean\n", "\n", "(1) determine if you need to use: \n", "* the normal distribution\n", "* the student's t-distribution\n", "\n", "(2) if we have the sample standard deviation\n", "* this is an indication to use the student's t distribution\n", "\n", "(3) sample size as another indicator\n", "* when n >= 30 then we use the normal distribution\n", "* when n < 30 we use the student t distribution\n", "\n", "(4) computing confidence intervals for the population mean\n", "\n", "CIs = sample mean +/- t (v, alpha/2) * s / sqrt(n)\n", "\n", "v = n - 1\n", "\n", "alpha / 2 = (1 - CL) / 2 = (1 - 0.95) / 2 = 0.05 / 2 = 0.025\n", "\n", "t = look up in student's t-distribuion table with v and alpha/2 values\n", "\n", "t (9, 0.025) = 2.262\n", "\n", "EBM: \"margin of error\"\n", "\n", "ebm = (t) (std / sqrt(n))\n", "\n", "CI = mean +/ (t) (std / sqrt(n))\n", "\n", "CI = mean +/ ebm\n", "\n", "What does this confidence interval mean?\n", "\n", "This confidence interval means that we are 95% sure that the \"average accuracy for this model on of all breast cancer data in the full population of breast cancer data is somewhere between 93.7% and 97.1%\"\n", "\n", "## What does Confidence Interval Reveal?\n", "\n", "\"A confidence interval is a range of values, bounded above and below the statistic's mean, that likely would contain an unknown population parameter. Confidence level refers to the *percentage of probability*, or certainty, that the confidence interval would contain the true population parameter when you draw a random sample many times.\"\n", "\n", "if confidence level refers to \"percentage of probability\" of certainty, then we can assume that 95% of the time our accuracy should be between the lower and upper bound of the estimate.\n", "\n" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "oNC7Xie3fF5x", "outputId": "ea7bd86a-a868-453b-aec4-3a9144d152d1" }, "source": [ "\n", "\n", "t = 2.262 # v = 10 - 1, alpha = 0.025\n", "\n", "std_dev_sample = np.std(stats)\n", "print(\"\\n\\nSample STD DEV: \" + str(std_dev_sample))\n", "ebm = (1/np.sqrt(k)) * std_dev_sample * t\n", "\n", "\n", "print(\"\\n\\nEBM (Accuracy) Across All Folds: ( \" + str(\"{:.4f}\".format(ebm)) + \")\")\n", "\n", "print(\"CI Ranges 95%:\")\n", "\n", "low_end_range = mean_score - ebm\n", "high_end_range = mean_score + ebm\n", "\n", "print(\"High: \" + str(high_end_range))\n", "print(\"Low : \" + str(low_end_range))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\n", "\n", "Sample STD DEV: 0.023770661464399972\n", "\n", "\n", "EBM (Accuracy) Across All Folds: ( 0.0170)\n", "CI Ranges 95%:\n", "High: 0.9713266337249031\n", "Low : 0.9373199828164502\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "OtWo8JfJgUjS" }, "source": [ "These results are comparable to bootstrap632 results, for reference, on the same dataset / classifier combination\n", "\n", "Notable Links\n", "* https://stats.stackexchange.com/questions/549103/is-bootstrap-variance-based-on-mean-same-as-standard-error\n", "* https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/basic-statistics/inference/how-to/resampling/bootstrapping-for-1-sample-mean/interpret-the-results/key-results/\n", "* https://stats.stackexchange.com/questions/231263/help-understanding-standard-error\n" ] } ] }