{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 8.2. Predicting who will survive on the Titanic with logistic regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This recipe is based on a [Kaggle competition](http://www.kaggle.com/c/titanic-gettingStarted) where the goal is to predict survival on the Titanic, based on real data. [Kaggle](http://www.kaggle.com/competitions) hosts machine learning competitions where anyone can download a dataset, train a model, and test the predictions on the website. The author of the best model wins a price. It is a fun way to get started with machine learning.\n",
    "\n",
    "Here, we use this example to introduce logistic regression, a basic classifier. We also show how to perform a grid search with cross-validation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You need to download the Titanic dataset on the book's website (https://ipython-books.github.io)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. We import the standard libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import sklearn\n",
    "import sklearn.linear_model as lm\n",
    "import sklearn.cross_validation as cv\n",
    "import sklearn.grid_search as gs\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. We load the train and test datasets with Pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "train = pd.read_csv('data/titanic_train.csv')\n",
    "test = pd.read_csv('data/titanic_test.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "train[train.columns[[2,4,5,1]]].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. Let's keep only a few fields for this example. We also convert the `sex` field to a binary variable, so that it can be handled correctly by NumPy and scikit-learn. Finally, we remove the rows containing `NaN` values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data = train[['Sex', 'Age', 'Pclass', 'Survived']].copy()\n",
    "data['Sex'] = data['Sex'] == 'female'\n",
    "data = data.dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4. Now, we convert this `DataFrame` to a NumPy array, so that we can pass it to scikit-learn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data_np = data.astype(np.int32).values\n",
    "X = data_np[:,:-1]\n",
    "y = data_np[:,-1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5. Let's have a look at the survival of male and female passengers, as a function of their age."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# We define a few boolean vectors.\n",
    "female = X[:,0] == 1\n",
    "survived = y == 1\n",
    "# This vector contains the age of the passengers.\n",
    "age = X[:,1]\n",
    "# We compute a few histograms.\n",
    "bins_ = np.arange(0, 81, 5)\n",
    "S = {'male': np.histogram(age[survived & ~female], \n",
    "                          bins=bins_)[0],\n",
    "     'female': np.histogram(age[survived & female], \n",
    "                            bins=bins_)[0]}\n",
    "D = {'male': np.histogram(age[~survived & ~female], \n",
    "                          bins=bins_)[0],\n",
    "     'female': np.histogram(age[~survived & female], \n",
    "                            bins=bins_)[0]}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# We now plot the data.\n",
    "bins = bins_[:-1]\n",
    "plt.figure(figsize=(10,3));\n",
    "for i, sex, color in zip((0, 1),\n",
    "                         ('male', 'female'),\n",
    "                         ('#3345d0', '#cc3dc0')):\n",
    "    plt.subplot(121 + i);\n",
    "    plt.bar(bins, S[sex], bottom=D[sex], color=color,\n",
    "            width=5, label='survived');\n",
    "    plt.bar(bins, D[sex], color='k', width=5, label='died');\n",
    "    plt.xlim(0, 80);\n",
    "    plt.grid(None);\n",
    "    plt.title(sex + \" survival\");\n",
    "    plt.xlabel(\"Age (years)\");\n",
    "    plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6. Let's try to train a `LogisticRegression` classifier. We first need to create a train and a test dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# We split X and y into train and test datasets.\n",
    "(X_train, X_test, \n",
    " y_train, y_test) = cv.train_test_split(X, y, test_size=.05)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# We instanciate the classifier.\n",
    "logreg = lm.LogisticRegression();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "7. Let's train the model and get the predicted values on the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "logreg.fit(X_train, y_train)\n",
    "y_predicted = logreg.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following figure shows the actual and predicted results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=(8, 3));\n",
    "plt.imshow(np.vstack((y_test, y_predicted)),\n",
    "           interpolation='none', cmap='bone');\n",
    "plt.xticks([]); plt.yticks([]);\n",
    "plt.title((\"Actual and predicted survival outcomes\"\n",
    "          \" on the test set\"));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "8. To get an estimation of the performance of the model, we can use the `cross_val_score` that computes the cross-validation score. This function uses by default a 3-fold stratified cross-validation procedure, but this can be changed with the `cv` keyword argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cv.cross_val_score(logreg, X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This function returns, for each pair of train and test set, a prediction score."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "9. The `LogisticRegression` class accepts a `C` hyperparameter as argument. This parameter quantifies the regularization strength. To find a good value, we can perform a grid search with the `GridSearchCV` class. It takes as input an estimator, and a dictionary of parameter values. This new estimator uses cross-validation to select the best parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "grid = gs.GridSearchCV(logreg, {'C': np.logspace(-5, 5, 200)}, n_jobs=4)\n",
    "grid.fit(X_train, y_train);\n",
    "grid.best_params_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the performance of the best estimator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cv.cross_val_score(grid.best_estimator_, X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).\n",
    "\n",
    "> [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}