{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Impact of the dependency between samples on cross-validation test score estimates"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "from sklearn.datasets import load_digits"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "digits = load_digits()\n",
      "X, y = digits.data, digits.target"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The digits dataset of scikit-learn is the test set of the [UCI optdigits dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/). Apparently consecutive samples are more likely to stem from the same writer on this dataset. Hence the samples are not independent and identically distributed (iid) as different writing styles grouped togethers effectively introduce a dependency. Unfortunately the exact per-sample authorship metadata has not be kept in the optdigits dataset.\n",
      "\n",
      "This is highlighted by the fact that shuffling the data significantly affects the test score estimated by K-Fold cross-validation. Let us build a model with non-optimal parameters to highlight the impact of dependent samples:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.svm  import SVC\n",
      "\n",
      "model = SVC(C=10, gamma=0.005)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.cross_validation import cross_val_score\n",
      "\n",
      "def print_cv_score_summary(model, X, y, cv):\n",
      "    scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1)\n",
      "    print(\"mean: {:3f}, stdev: {:3f}\".format(\n",
      "        np.mean(scores), np.std(scores)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "KFold does not shuffle the data by default hence takes the dependency structure of the dataset into account for small number of folds such as k=5:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.cross_validation import KFold\n",
      "\n",
      "cv = KFold(len(y), 5)\n",
      "print_cv_score_summary(model, X, y, cv)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "mean: 0.901543, stdev: 0.037016\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If we shuffle the data, the estimated test score is much higher as we hide the dependency structure to the model hence we cannot detect the overfitting caused by the author writing styles:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cv = KFold(len(y), 5, shuffle=True, random_state=0)\n",
      "print_cv_score_summary(model, X, y, cv)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "mean: 0.968836, stdev: 0.007350\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cv = KFold(len(y), 5, shuffle=True, random_state=1)\n",
      "print_cv_score_summary(model, X, y, cv)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "mean: 0.967725, stdev: 0.004847\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cv = KFold(len(y), 5, shuffle=True, random_state=2)\n",
      "print_cv_score_summary(model, X, y, cv)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "mean: 0.966622, stdev: 0.010217\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "There is almost **7% discrepancy between the estimated score** probably caused by the dependency between samples.\n",
      "\n",
      "Those shuffled KFold cv scores are in-line with equivalent `ShuffleSplit`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.cross_validation import ShuffleSplit\n",
      "\n",
      "cv = ShuffleSplit(len(y), n_iter=5, test_size=0.2, random_state=0)\n",
      "print_cv_score_summary(model, X, y, cv)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "mean: 0.971667, stdev: 0.007115\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cv = ShuffleSplit(len(y), n_iter=5, test_size=0.2, random_state=1)\n",
      "print_cv_score_summary(model, X, y, cv)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "mean: 0.973333, stdev: 0.003333\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cv = ShuffleSplit(len(y), n_iter=5, test_size=0.2, random_state=2)\n",
      "print_cv_score_summary(model, X, y, cv)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "mean: 0.958333, stdev: 0.008784\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note that `StratifiedKFold` sorts the samples by classes prior to computing the folds hence breaks the dependency too (at least in scikit-learn 0.14):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.cross_validation import StratifiedKFold\n",
      "\n",
      "cv = StratifiedKFold(y, 5)\n",
      "print_cv_score_summary(model, X, y, cv)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "mean: 0.969404, stdev: 0.010674\n"
       ]
      }
     ],
     "prompt_number": 12
    }
   ],
   "metadata": {}
  }
 ]
}