{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "> This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 8.2. Predicting who will survive on the Titanic with logistic regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This recipe is based on a [Kaggle competition](http://www.kaggle.com/c/titanic-gettingStarted) where the goal is to predict survival on the Titanic, based on real data. [Kaggle](http://www.kaggle.com/competitions) hosts machine learning competitions where anyone can download a dataset, train a model, and test the predictions on the website. The author of the best model wins a price. It is a fun way to get started with machine learning.\n", "\n", "Here, we use this example to introduce logistic regression, a basic classifier. We also show how to perform a grid search with cross-validation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You need to download the Titanic dataset on the book's website (https://ipython-books.github.io)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. We import the standard libraries." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import sklearn\n", "import sklearn.linear_model as lm\n", "import sklearn.cross_validation as cv\n", "import sklearn.grid_search as gs\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. We load the train and test datasets with Pandas." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "train = pd.read_csv('data/titanic_train.csv')\n", "test = pd.read_csv('data/titanic_test.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "train[train.columns[[2,4,5,1]]].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Let's keep only a few fields for this example. We also convert the `sex` field to a binary variable, so that it can be handled correctly by NumPy and scikit-learn. Finally, we remove the rows containing `NaN` values." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = train[['Sex', 'Age', 'Pclass', 'Survived']].copy()\n", "data['Sex'] = data['Sex'] == 'female'\n", "data = data.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Now, we convert this `DataFrame` to a NumPy array, so that we can pass it to scikit-learn." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data_np = data.astype(np.int32).values\n", "X = data_np[:,:-1]\n", "y = data_np[:,-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Let's have a look at the survival of male and female passengers, as a function of their age." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We define a few boolean vectors.\n", "female = X[:,0] == 1\n", "survived = y == 1\n", "# This vector contains the age of the passengers.\n", "age = X[:,1]\n", "# We compute a few histograms.\n", "bins_ = np.arange(0, 81, 5)\n", "S = {'male': np.histogram(age[survived & ~female], \n", " bins=bins_)[0],\n", " 'female': np.histogram(age[survived & female], \n", " bins=bins_)[0]}\n", "D = {'male': np.histogram(age[~survived & ~female], \n", " bins=bins_)[0],\n", " 'female': np.histogram(age[~survived & female], \n", " bins=bins_)[0]}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We now plot the data.\n", "bins = bins_[:-1]\n", "plt.figure(figsize=(10,3));\n", "for i, sex, color in zip((0, 1),\n", " ('male', 'female'),\n", " ('#3345d0', '#cc3dc0')):\n", " plt.subplot(121 + i);\n", " plt.bar(bins, S[sex], bottom=D[sex], color=color,\n", " width=5, label='survived');\n", " plt.bar(bins, D[sex], color='k', width=5, label='died');\n", " plt.xlim(0, 80);\n", " plt.grid(None);\n", " plt.title(sex + \" survival\");\n", " plt.xlabel(\"Age (years)\");\n", " plt.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "6. Let's try to train a `LogisticRegression` classifier. We first need to create a train and a test dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We split X and y into train and test datasets.\n", "(X_train, X_test, \n", " y_train, y_test) = cv.train_test_split(X, y, test_size=.05)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# We instanciate the classifier.\n", "logreg = lm.LogisticRegression();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "7. Let's train the model and get the predicted values on the test set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "logreg.fit(X_train, y_train)\n", "y_predicted = logreg.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following figure shows the actual and predicted results." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.figure(figsize=(8, 3));\n", "plt.imshow(np.vstack((y_test, y_predicted)),\n", " interpolation='none', cmap='bone');\n", "plt.xticks([]); plt.yticks([]);\n", "plt.title((\"Actual and predicted survival outcomes\"\n", " \" on the test set\"));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "8. To get an estimation of the performance of the model, we can use the `cross_val_score` that computes the cross-validation score. This function uses by default a 3-fold stratified cross-validation procedure, but this can be changed with the `cv` keyword argument." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cv.cross_val_score(logreg, X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function returns, for each pair of train and test set, a prediction score." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "9. The `LogisticRegression` class accepts a `C` hyperparameter as argument. This parameter quantifies the regularization strength. To find a good value, we can perform a grid search with the `GridSearchCV` class. It takes as input an estimator, and a dictionary of parameter values. This new estimator uses cross-validation to select the best parameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "grid = gs.GridSearchCV(logreg, {'C': np.logspace(-5, 5, 200)}, n_jobs=4)\n", "grid.fit(X_train, y_train);\n", "grid.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the performance of the best estimator." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cv.cross_val_score(grid.best_estimator_, X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).\n", "\n", "> [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.2" } }, "nbformat": 4, "nbformat_minor": 0 }