{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Appendix A - Cross-Validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we split our data into training and test sets, we simply chose the first 80% to be the training set and the remaining 20% to be the test set. However, we would obtain different results if we chose a different split. To get around this, we use cross-validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "df = pd.read_csv('data.csv')\n",
    "\n",
    "df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)\n",
    "\n",
    "age_mean = df['Age'].mean()\n",
    "df['Age'] = df['Age'].fillna(age_mean)\n",
    "\n",
    "df['Embarked'] = df['Embarked'].fillna('S')\n",
    "\n",
    "df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})\n",
    "df = pd.concat([df, pd.get_dummies(df['Embarked'], prefix='Embarked')], axis=1)\n",
    "\n",
    "df = df.drop(['Embarked'], axis=1)\n",
    "\n",
    "X = df.iloc[:, 2:].values\n",
    "y = df['Survived'].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Cross-validation involves splitting the data into five partitions, calculating accuracy once on each split, and then taking the average. We can generate cross-validation folds automatically with Scikit-learn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "prediction accuracy: 0.776536312849\n",
      "prediction accuracy: 0.814606741573\n",
      "prediction accuracy: 0.876404494382\n",
      "prediction accuracy: 0.76404494382\n",
      "prediction accuracy: 0.837078651685\n",
      "overall prediction accuracy: 0.813734228862\n"
     ]
    }
   ],
   "source": [
    "from sklearn.cross_validation import KFold\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "cv = KFold(n=len(y), n_folds=5)\n",
    "results = []\n",
    "\n",
    "for training_set, test_set in cv:\n",
    "    X_train = X[training_set]\n",
    "    y_train = y[training_set]\n",
    "    X_test = X[test_set]\n",
    "    y_test = y[test_set]\n",
    "    model = RandomForestClassifier(n_estimators=100)\n",
    "    model.fit(X_train, y_train)\n",
    "    y_prediction = model.predict(X_test)\n",
    "    result = np.sum(y_test == y_prediction)*1./len(y_test)\n",
    "    results.append(result)\n",
    "    print \"prediction accuracy:\", result\n",
    "    \n",
    "print \"overall prediction accuracy:\", np.mean(results)    "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}