{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Predictive Modeling with heterogeneous data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import warnings\n", "warnings.simplefilter('ignore', DeprecationWarning)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 0 }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Loading tabular data from the Titanic kaggle challenge in a pandas Data Frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us have a look at the Titanic dataset from the Kaggle Getting Started challenge at:\n", "\n", "https://www.kaggle.com/c/titanic-gettingStarted\n", "\n", "We can load the CSV file as a pandas data frame in one line:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#!curl -s https://dl.dropboxusercontent.com/u/5743203/data/titanic/titanic_train.csv | head -5\n", "!head -5 titanic_train.csv" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked\r\n", "1,0,3,\"Braund, Mr. Owen Harris\",male,22,1,0,A/5 21171,7.25,,S\r\n", "2,1,1,\"Cumings, Mrs. John Bradley (Florence Briggs Thayer)\",female,38,1,0,PC 17599,71.2833,C85,C\r\n", "3,1,3,\"Heikkinen, Miss. Laina\",female,26,0,0,STON/O2. 3101282,7.925,,S\r\n", "4,1,1,\"Futrelle, Mrs. Jacques Heath (Lily May Peel)\",female,35,1,0,113803,53.1,C123,S\r\n" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "#data = pd.read_csv('https://dl.dropboxusercontent.com/u/5743203/data/titanic/titanic_train.csv')\n", "data = pd.read_csv('titanic_train.csv')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "pandas data frames have a HTML table representation in the IPython notebook. Let's have a look at the first 5 rows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.head(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S
\n", "

5 rows \u00d7 12 columns

\n", "
" ], "output_type": "pyout", "prompt_number": 5, "text": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n", "2 Heikkinen, Miss. Laina female 26 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 \n", "4 Allen, Mr. William Henry male 35 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S \n", "\n", "[5 rows x 12 columns]" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "data.count()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 6, "text": [ "PassengerId 891\n", "Survived 891\n", "Pclass 891\n", "Name 891\n", "Sex 891\n", "Age 714\n", "SibSp 891\n", "Parch 891\n", "Ticket 891\n", "Fare 891\n", "Cabin 204\n", "Embarked 889\n", "dtype: int64" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data frame has 891 rows. Some passengers have missing information though: in particular Age and Cabin info can be missing. The meaning of the columns is explained on the challenge website:\n", "\n", "https://www.kaggle.com/c/titanic-gettingStarted/data\n", "\n", "A data frame can be converted into a numpy array by calling the `values` attribute:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "list(data.columns)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 7, "text": [ "['PassengerId',\n", " 'Survived',\n", " 'Pclass',\n", " 'Name',\n", " 'Sex',\n", " 'Age',\n", " 'SibSp',\n", " 'Parch',\n", " 'Ticket',\n", " 'Fare',\n", " 'Cabin',\n", " 'Embarked']" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "data.shape" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 8, "text": [ "(891, 12)" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "data.values" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 9, "text": [ "array([[1, 0, 3, ..., 7.25, nan, 'S'],\n", " [2, 1, 1, ..., 71.2833, 'C85', 'C'],\n", " [3, 1, 3, ..., 7.925, nan, 'S'],\n", " ..., \n", " [889, 0, 3, ..., 23.45, nan, 'S'],\n", " [890, 1, 1, ..., 30.0, 'C148', 'C'],\n", " [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "However this cannot be directly fed to a scikit-learn model:\n", "\n", "\n", "- the target variable (survival) is mixed with the input data\n", "\n", "- some attribute such as unique ids have no predictive values for the task\n", "\n", "- the values are heterogeneous (string labels for categories, integers and floating point numbers)\n", "\n", "- some attribute values are missing (nan: \"not a number\")" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Predicting survival" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of the challenge is to predict whether a passenger has survived from others known attribute. Let us have a look at the `Survived` columns:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "survived_column = data['Survived']\n", "survived_column.dtype" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 10, "text": [ "dtype('int64')" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "`data.Survived` is an instance of the pandas `Series` class with an integer dtype:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "type(survived_column)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 11, "text": [ "pandas.core.series.Series" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `data` object is an instance pandas `DataFrame` class:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "type(data)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 12, "text": [ "pandas.core.frame.DataFrame" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Series` can be seen as homegeneous, 1D columns. `DataFrame` instances are heterogenous collections of columns with the same length.\n", "\n", "The original data frame can be aggregated by counting rows for each possible value of the `Survived` column:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.groupby('Survived').count()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
Survived
0 549 549 549 549 549 424 549 549 549 549 68 549
1 342 342 342 342 342 290 342 342 342 342 136 340
\n", "

2 rows \u00d7 12 columns

\n", "
" ], "output_type": "pyout", "prompt_number": 13, "text": [ " PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket \\\n", "Survived \n", "0 549 549 549 549 549 424 549 549 549 \n", "1 342 342 342 342 342 290 342 342 342 \n", "\n", " Fare Cabin Embarked \n", "Survived \n", "0 549 68 549 \n", "1 342 136 340 \n", "\n", "[2 rows x 12 columns]" ] } ], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "np.mean(survived_column == 0)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 14, "text": [ "0.61616161616161613" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this the subset of the full passengers list, about 2/3 perished in the event. So if we are to build a predictive model from this data, a baseline model to compare the performance to would be to always predict death. Such a constant model would reach around 62% predictive accuracy (which is higher than predicting at random):" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "pandas `Series` instances can be converted to regular 1D numpy arrays by using the `values` attribute:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "target = survived_column.values" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "type(target)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 16, "text": [ "numpy.ndarray" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "target.dtype" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 17, "text": [ "dtype('int64')" ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "target[:5]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 18, "text": [ "array([0, 1, 1, 1, 0])" ] } ], "prompt_number": 16 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Training a predictive model on numerical features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`sklearn` estimators all work with homegeneous numerical feature descriptors passed as a numpy array. Therefore passing the raw data frame will not work out of the box.\n", "\n", "Let us start simple and build a first model that only uses readily available numerical features as input, namely `data.Fare`, `data.Pclass` and `data.Age`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "numerical_features = data.get(['Fare', 'Pclass', 'Age'])\n", "numerical_features.head(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FarePclassAge
0 7.2500 3 22
1 71.2833 1 38
2 7.9250 3 26
3 53.1000 1 35
4 8.0500 3 35
\n", "

5 rows \u00d7 3 columns

\n", "
" ], "output_type": "pyout", "prompt_number": 19, "text": [ " Fare Pclass Age\n", "0 7.2500 3 22\n", "1 71.2833 1 38\n", "2 7.9250 3 26\n", "3 53.1000 1 35\n", "4 8.0500 3 35\n", "\n", "[5 rows x 3 columns]" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately some passengers do not have age information:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "numerical_features.count()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 20, "text": [ "Fare 891\n", "Pclass 891\n", "Age 714\n", "dtype: int64" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use pandas `fillna` method to input the median age for those passengers:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "median_features = numerical_features.dropna().median()\n", "median_features" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 21, "text": [ "Fare 15.7417\n", "Pclass 2.0000\n", "Age 28.0000\n", "dtype: float64" ] } ], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "imputed_features = numerical_features.fillna(median_features)\n", "imputed_features.count()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 22, "text": [ "Fare 891\n", "Pclass 891\n", "Age 891\n", "dtype: int64" ] } ], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "imputed_features.head(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FarePclassAge
0 7.2500 3 22
1 71.2833 1 38
2 7.9250 3 26
3 53.1000 1 35
4 8.0500 3 35
\n", "

5 rows \u00d7 3 columns

\n", "
" ], "output_type": "pyout", "prompt_number": 23, "text": [ " Fare Pclass Age\n", "0 7.2500 3 22\n", "1 71.2833 1 38\n", "2 7.9250 3 26\n", "3 53.1000 1 35\n", "4 8.0500 3 35\n", "\n", "[5 rows x 3 columns]" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that the data frame is clean, we can convert it into an homogeneous numpy array of floating point values:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "features_array = imputed_features.values\n", "features_array" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 24, "text": [ "array([[ 7.25 , 3. , 22. ],\n", " [ 71.2833, 1. , 38. ],\n", " [ 7.925 , 3. , 26. ],\n", " ..., \n", " [ 23.45 , 3. , 28. ],\n", " [ 30. , 1. , 26. ],\n", " [ 7.75 , 3. , 32. ]])" ] } ], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "features_array.dtype" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 25, "text": [ "dtype('float64')" ] } ], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take the 80% of the data for training a first model and keep 20% for computing is generalization score:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cross_validation import train_test_split\n", "\n", "features_train, features_test, target_train, target_test = train_test_split(\n", " features_array, target, test_size=0.20, random_state=0)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "features_train.shape" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 27, "text": [ "(712, 3)" ] } ], "prompt_number": 25 }, { "cell_type": "code", "collapsed": false, "input": [ "features_test.shape" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 28, "text": [ "(179, 3)" ] } ], "prompt_number": 26 }, { "cell_type": "code", "collapsed": false, "input": [ "target_train.shape" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 29, "text": [ "(712,)" ] } ], "prompt_number": 27 }, { "cell_type": "code", "collapsed": false, "input": [ "target_test.shape" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 30, "text": [ "(179,)" ] } ], "prompt_number": 28 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start with a simple model from sklearn, namely `LogisticRegression`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.linear_model import LogisticRegression\n", "\n", "logreg = LogisticRegression(C=1)\n", "logreg.fit(features_train, target_train)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 31, "text": [ "LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, max_iter=100, penalty='l2',\n", " random_state=None, solver='liblinear', tol=0.0001)" ] } ], "prompt_number": 29 }, { "cell_type": "code", "collapsed": false, "input": [ "target_predicted = logreg.predict(features_test)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 30 }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.metrics import accuracy_score\n", "\n", "accuracy_score(target_test, target_predicted)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 33, "text": [ "0.73184357541899436" ] } ], "prompt_number": 31 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This first model has around 73% accuracy: this is better than our baseline that always predicts death." ] }, { "cell_type": "code", "collapsed": false, "input": [ "logreg.score(features_test, target_test)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 34, "text": [ "0.73184357541899436" ] } ], "prompt_number": 32 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Model evaluation and interpretation" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Interpreting linear model weights" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `coef_` attribute of a fitted linear model such as `LogisticRegression` holds the weights of each features:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "feature_names = numerical_features.columns\n", "feature_names" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 35, "text": [ "Index([u'Fare', u'Pclass', u'Age'], dtype='object')" ] } ], "prompt_number": 33 }, { "cell_type": "code", "collapsed": false, "input": [ "logreg.coef_" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 36, "text": [ "array([[ 0.0043996 , -0.80916725, -0.03348064]])" ] } ], "prompt_number": 34 }, { "cell_type": "code", "collapsed": false, "input": [ "x = np.arange(len(feature_names))\n", "plt.bar(x, logreg.coef_.ravel())\n", "_ = plt.xticks(x + 0.5, feature_names, rotation=30)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAEHCAYAAAC+1b08AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAADjxJREFUeJzt3X+Q3HV9x/HnJsePg2xCNKKIUutPtKLMVFTIQNYaFYgo\no1QhLa0aq0ZmKMFf05bC4jDKQKXadkTEOBNUUMcThTIe7YibAJUfMyryo502nXYsHUpaKbkbZIJN\ntn+8v5fb7G3uNt/c7t6+7/mY2bnv7ud73/3MfPb72vf38/1+70CSJEmSJEmSJEmSJEmSJKkvKoN6\n4zVr1jS3bt06qLeXpGG1Fah1ahhYoAPNZrM5wLfvrXq9Tr1eH3Q3VIJjN9yyj1+lUoH9ZPeS/nZF\nktQrBrokJWGg90itVht0F1SSYzfcFvP4OYcuSUPEOXRJWgQMdElKwkCXpCQMdElKwkCXpCQMdElK\nwkCXpCQMdElKwkCXpCQMdElKwkCXpCQMdElKomygLwG+BPwD8CPgJW3tZwH3Fe0fLN07SVLXRkr+\n3tnAocApwBuAzxWvARwCXAO8DvgVcDdwC7DjoHoqSZpV2Qp9NTBeLN9LhPeUVwLbgZ3Ar4G7gNPK\ndlCS1J2yFfpyYKLl+W7iy2FP0bazpW0SWNFpI8Xf9VWPVKsrmZh4YtDdkNQnZQN9Aqi2PJ8Kc4gw\nb22rAv9b8n20QC1f/iwmJx3WXurVF7Jj1x/zNX6NRoNGo9HVumVL5HcRJz7fD7wR+HNgXdF2CPAw\nMbf+FHFi9CzgsbZt+B+LhlgcXTl+vVWhF/uIY9cvvRy/ztldtkK/GXgLccITItjPA5YB1wMXA7cT\nlftmZoa5JGme+T9FVYpVXj9YoQ+3/lfo3lgkSUkY6JKUhIEuSUmUPSkqaUhVqyuZnPQekF6rVlf2\n/T09KapSPLHWD705qabh5klRSVoEDHRJSsJAl6QkDHRJSsJAl6QkDHRJSsJAl6QkDHRJSsJAl6Qk\nDHRJSsJAl6QkDHRJSsJAl6QkDHRJSsJAl6QkDHRJSsJAl6QkDHRJSsJAl6QkDHRJSsJAl6QkDHRJ\nSsJAl6QkDHRJSsJAl6QkDHRJSsJAl6QkDHRJSsJAl6QkDHRJSsJAl6QkDHRJSqJMoI8CY8A24DZg\nVYd1NgH3FI9LS/dOktS1MoG+EXgAOA24Abikrf3FwHrgZOCNwFuBEw6ij5KkLpQJ9NXAeLE8Dqxt\na/8F8DagWTw/BHi6VO8kSV0bmaN9A3BR22uPAxPF8iSwoq39/4AngApwNfATYHunjdfr9b3LtVqN\nWq3WRZclafFoNBo0Go2u1q2U2P4YcCVwPxHmdzFzSuVw4KvATuCjTFfrrZrNZqeXNQwqlQqdh1Xz\np4L7iNrFvtc5u8tMudwNnFksn0GcHN3n/YDvAz8j5tv9REpSH5Sp0EeBLcAxwC7iBOgO4sqW7cBS\n4Cbgxy3b/xPiipdWVuhDzAq9H6zQNdNsFXqZQJ8vBvoQM9D7wUDXTPM95SJJWoAMdElKwkCXpCQM\ndElKwkCXpCQMdElKwkCXpCQMdElKwkCXpCQMdElKwkCXpCQMdElKwkCXpCQMdElKwkCXpCQMdElK\nwkCXpCQMdElKwkCXpCQMdElKwkCXpCQMdElKwkCXpCQMdElKwkCXpCQMdElKwkCXpCQMdElKwkCX\npCQMdElKwkCXpCQMdElKwkCXpCQMdElKwkCXpCQMdElKokygjwJjwDbgNmDVLNv+AfDhcl2TJB2I\nMoG+EXgAOA24AbhkP+tdARwFNMt1TZJ0IMoE+mpgvFgeB9Z2WOccYHfRXinXNUnSgZgr0DcAD7Y9\nVgATRftk8bzVq4HzgEsxzCWpb0bmaN9cPFqNAdViuQo82dZ+PnAscAfwIuAZ4N+Av2vfeL1e37tc\nq9Wo1WpddVqSFotGo0Gj0ehq3TIV9MVEkF8OnAucClywn3UvAx4DvtyhrdlsOr0+rCqVCp4e6bUK\n7iNqF/te5+yeq0Lv5FpgC3AnsAtYX7y+CdgO3Fpim5KkgzTIOW4r9CFmhd4PVuiaabYK3RuLJCkJ\nA12SkjDQJSkJA12SkjDQJSkJA12SkjDQJSkJA12SkjDQJSkJA12SkjDQJSkJA12SkjDQJSkJA12S\nkjDQJSkJA12SkjDQJSkJA12SkjDQJSkJA12SkjDQJSkJA12SkjDQJSkJA12SkjDQJSkJA12SkjDQ\nJSkJA12SkjDQJSkJA12SkjDQJSkJA12SkjDQJSkJA12SkjDQJSkJA12SkjDQJSmJMoE+CowB24Db\ngFUd1jkD+HHx+KvSvZMkda1MoG8EHgBOA24ALmlrrwJXAeuAk4H/BJ5zEH2UJHWhTKCvBsaL5XFg\nbVv7KcCDwDVEFf8Y8N9lOyhJ6s7IHO0bgIvaXnscmCiWJ4EVbe2rgDcBrwWeAu4kpl7+5aB6Kkma\n1VyBvrl4tBojplUofj7Z1v4/wP3AjuL5NuBEOgR6vV7fu1yr1ajVal10WZIWj0ajQaPR6GrdSont\nX0wE+eXAucCpwAUt7UcD9wAnATuBu4APAI+0bafZbDZLvL0WgkqlAjh+vVXBfUTtYt/rnN1lAn0U\n2AIcA+wC1hPV+CZgO3Ar8F7gE8X63wKu7rAdA32IGej9YKBrpvkO9PlioA8xA70fDHTNNFuge2OR\nJCVhoEtSEga6JCVhoEtSEga6JCVhoEtSEga6JCVhoEtSEga6JCVhoEtSEga6JCVhoEtSEga6JCVh\noEtSEga6JCVhoEtSEga6JCVhoEtSEga6JCVhoEtSEga6JCVhoEtSEga6JCVhoEtSEga6JCVhoEtS\nEga6JCVhoEtSEga6JCVhoEtSEga6JCVhoEtSEga6JCVhoEtSEga6JCVhoEtSEga6JCVRJtBHgTFg\nG3AbsKrDOhuB+4H7gLNL906S1LVKid+5GFgGfBp4L3AycFFL+zLg58DLi+WfAS/qsJ1ms9ks8fZa\nCCqVCuD49VYF9xG1i32vc3aXqdBXA+PF8jiwtq196hO4DKgCu0u8hyTpAI3M0b6BfatvgMeBiWJ5\nEljR1v4UcBPwCLAU+Mz+Nl6v1/cu12o1arXaXP2VpEWl0WjQaDS6WrfMlMsYcCUxR74CuAs4oaX9\nFOAK4PRi+7cDnyjWb+WUyxBzyqUfnHLRTLNNucxVoXdyN3AmEdBnECdHWx0JPA08Uzx/kplVvIZc\ntbqSycky9YC6Va2uHHQXNGTK7JGjwBbgGGAXsB7YAWwCtgO3AlcBa4j58zuBT3XYjhW6JB2g2Sr0\nQZZYBrokHaD5vspFkrQAGeiSlISBLklJGOiSlISBLklJGOiSlISBLklJGOiSlISBLklJGOiSlISB\nLklJGOiSlISBLklJGOiSlISB3iPd/ssoLTyO3XBbzONnoPfIYv5QDTvHbrgt5vEz0CUpCQNdkpIY\n5L+gaxD/d1SS1L2tQG3QnZAkSZIkSZIkSWq3BK8oysSxHB5LB90B5dK68x8NLB9UR3TQKuw7nobF\nwtY6VssG1os+8gPZO0uBZvE4DPgc8FHgVOAx4NHBdU0HoQmcBFwOHAU8MNjuaBZN4FXAXxNj1gT+\nvfgplbIU+BBwRfH8G8DNwKqB9UgHor3oeQdwD7AW+ABx1KWFoX2sXgB8DzgFOBu4F3hTvzvVT84F\nzp8l7Huj1onAHcA5wAjwYuDrwC+Bh4Bj+t1BlbK7+Pk6YnzXALcDK4B3A58ixlqDVWF6rM4Ezi2W\njwCeQxRV/0SM29T66Rjo86MC7CEO5SrAbwB/CfwU+BbwOBHo9wAXAs8lDgW18LTvE28mKrsNwOuJ\ncb2XqP7WEWP9m/3soPY6vPhZIfa9tcCNxNTmecBLgAngz4iQ/yrxhTw1HZqOc+jlLSEOufcA/0VU\n4ZcSH6ofAFXiBOh9xJz508BvAx8DfgJc2/8uaxYVYn/Y0/La84FrgIuIqvzNwCPENMsq4PeIcb4J\neKKfnV3kRogpzD8mvlyfICrxzwPXAbcQlfhrge8CLwVOJr6UrwX+tf9d1kJ2DjGdcjVwJDFHdwvw\np8Bm4GvEWfWvAWcw/cX5W0wf8mlhaD/0PpLpig5inC8rfn4WGCdOhp4PfLBPfdS+RoDvEBX3F4AP\nA68E/r5lnXXAGFFEHQec1ec+aggcRVQEY8CrW14/HfgjpsPgGeAE4A+ArxAfKC08ryEqtqlLSX+X\nqMQ/T3wZX9DyOsTh+s146ekgTRVHnyTG6fnANuAPgR3AO4v2C4mx/Ox+fj8l59APzE7icsM7iROb\nLyQOySeJ+bzriSC4jgiKO4rXfjGIzmq/pj73DwHHA28rnj8b+DRwFfAkcST2vOLxt8DHiSCZ6Gdn\ntY+pKbHtwD8Co8QX7JnENNhG4IdExX4rcQR2GNNjvpvERgbdgSHTJA6/rweOJaZQ7gfuJk6M/ZyY\nfjmUuKLlUbzefKFZwnQo7CHmYjcRJ6wfIubJjyPmXo8F/gL4faLa++d+d1YzTJ3MPIS4F6BG7JPf\nB35NXMnyKLCFGLsjgF1976WGyhXAfxDVwZR1wLeBG4iKTgvHUuJcxvHF80OJE9hjwCuI+dgLiSmz\nOjEVcz4xffahPvdV3RklKvHjW147iqjWP0IUWZcNoF8aQkcTJ2BeXzyfOpw7YjDd0SzeTcyxbgYe\nBl4GfBn4IvB2ogo/kbgy6RXEvOyPivWrA+ivunMc8UX8Qqbnxad+PosId6lrG4hLErUwPZc41/Ed\npm/8uY44Sf2NlvWqxJ9juJG4YmI53sU7LDwa1rw5HHgfSe84S+B5xJTKe4rnRwN/A7wBeBBYX7y+\nlvg7O0cAv9PnPkqSunQ68E3gM8SUytSc6luJo6sriSr+IwPpneaDV+pJi8Qyoip/iJl/PvWlxPXl\nL+h3pyRJ5ZxEXMI2dTXEYQPsiyTpIBxKXGd+46A7IvVa6ttgJeLOwF8Sd/M+POC+SJIkSZIkSZIk\nSZIkSZIkSZIkLRj/D7J1VwbqZOyhAAAAAElFTkSuQmCC\n", "text": [ "" ] } ], "prompt_number": 35 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, survival is slightly positively linked with Fare (the higher the fare, the higher the likelyhood the model will predict survival) while passenger from first class and lower ages are predicted to survive more often than older people from the 3rd class.\n", "\n", "First-class cabins where closer to the lifeboats and children and women reportedly had the priority. Our model seems to capture that historical data. We will see later if the sex of the passenger can be used as an informative predictor to increase the predictive accuracy of the model." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Alternative evaluation metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is possible to see the details of the false positive and false negative errors by computing the confusion matrix:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.metrics import confusion_matrix\n", "\n", "cm = confusion_matrix(target_test, target_predicted)\n", "print(cm)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[98 12]\n", " [36 33]]\n" ] }, { "output_type": "stream", "stream": "stderr", "text": [ "/home/varoquau/dev/numpy/numpy/core/fromnumeric.py:2499: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.\n", " VisibleDeprecationWarning)\n" ] } ], "prompt_number": 36 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The true labeling are seen as the rows and the predicted labels are the columns:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def plot_confusion(cm):\n", " plt.imshow(cm, interpolation='nearest', cmap=plt.cm.binary)\n", " plt.title('Confusion matrix')\n", " plt.set_cmap('Blues')\n", " plt.colorbar()\n", "\n", " target_names = ['not survived', 'survived']\n", "\n", " tick_marks = np.arange(len(target_names))\n", " plt.xticks(tick_marks, target_names, rotation=60)\n", " plt.yticks(tick_marks, target_names)\n", " plt.ylabel('True label')\n", " plt.xlabel('Predicted label')\n", " # Convenience function to adjust plot parameters for a clear layout.\n", " plt.tight_layout()\n", " \n", "plot_confusion(cm)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAATkAAAEZCAYAAADsTVLHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAG6NJREFUeJzt3XmYXFWZ+PFvVRMICSSELSQsAQIBcSIBIxDApIOIOqAg\nrqgILiCbMIKoIKMZRrYfgqMiKgxLQAYBn5+IqIARsrFkIwQDYQmLIwQQJOwBQtLzx3srXal0VVdV\n15Zb30+e+/StW1X3nuon/dY5555zXpAkSZIkSZIkSZIkSZLUxjqAk4E5wHzgAeBcYN0+nDML/A54\nGDiuive/F7ihD9evtcHA7SWenw8MalBZJFXoEuA6YMPk8QDgt8BVfTjnNsAyINO3orWMbYFXm10I\nSZXbDngN2KDg+FDgkGR/MPAr4K/A/cB5RO0P4E3g+8BM4HHgpORcDwLvAPcC2wMrgY3zzp97vAFR\nY5sPzCMCbgboTK5XzfV78iZwdvL+J4FPAdcDi4C/EIEd4MvAPUm5nwSOSY7fkfd5ssBbxBfDQ0St\ncyWwSVKWu5LXbAE8DUwoUiZJDfAJYFYvr5kM/CjZXxe4Bfh28ngl3c3R3Yna27rACFav+RQLcocD\nf0qOZYkgN5LVg1w11y+0Ejgh2f8W8DIwjAioc4HDgIFEgBqSvG4v4JVkv6fP8/kePk8WmJqU78/A\nd3ooi6QGOoToiyvlOSLw5L9narK/Etgy2c8kj4ewZvOuWJDbFvhfoqZ0OrBL8nwn3UGumusXWkkE\nNYBPEjW/nBvoDpQbA18EzgRuAlYkx3v6PNsU+XxbAUuBW3soh5ok2+wCqGnmAO9izebqlsDNQH/i\n/0d+31oHsE7e42XJz67kZ7F+uNzx/JrWk8AOwDlEx/0UonaZr1bXfytvf3kPz28FLAC2BmYAZ5Q4\nF0QzvycjgDeIz7VRifergQxy7etp4BrgcrpvPAwCLgZeIPqybgWOT55bDziaaIpV4nngfcn+ocnP\nDHAscAVwG9G0uxV4N90BixpdvzcZom/tH8BZyfk/mvfcO3T3A5ayEXA1URv8NXBZjcupKhnk2ttx\nxI2Cu4gbAPcAC4GvJs+fCGxOd8f/IiIQwOrBqPBx/v6JwM+ImwtjgCXJ85OJ4PEgUavcEPgxEVi6\n8t5bzfVLlaun990GPEUMe5lB1PyeIWpkS4ibDg8SzdKe3p8h+hRvJm5mTCKa2ccgSZIkSZIkSZIk\nCcgMHJ67W+fmlsrt/eMndFErHeuVc80Xa3a9RFomUTdLV/8xx/f+qhaz/JnZ9Bu2R7OLUZWlcy5q\ndhEq9oMzJ3HG9yY1uxhVWb9fBmoXJ7r67/b1ki94c/5Pa3k9YPXR45JUX5nG16sMcpIaJ1vO5JHa\nMsi1oewGW/b+ItXM+AmdzS5C68g0fpKVQa4NdWxokGskg1wem6uSUs3mqqRUs7kqKdWsyUlKNfvk\nJKWazVVJqdZhc1VSmlmTk5RqlffJrQv8N7EU/XJiSfzXgSuJTGkLiTwgRRcSMMhJapzK764eRWRA\n2xsYRSQJ+juRxnI68HPgYODGopesppySVJVMtvS2pl2IpOIAjxApM/cjAhxEgvL9S13SICepcbId\npbc13QcclOzvBWwGDMh7/jVgcKlL2lyV1DgFfXIr/rmYlS8uLvWOy4kk6DOAO4m0kZvmPb8h8FKp\nExjkJDVOQZO0Y9NRdGw6atXjFY/dVviOPYDbgZOBscCeRLN1AjAN+AiR67Yog5ykxqn8xsPDwHXE\njYY3icTnWeBS4s7rg8BvSp3AICepcSofJ/ci8MEejneWewKDnKTGcYK+pFRzgr6kVHNal6Q0y2QN\ncpJSLGNzVVKqNSGdvUFOUsNkba5KSjObq5JSLZM1yElKMWtyklLNICcp1bzxICndHEIiKc2qqMll\niUQ2o4jENUcBKzCRjaRWVEWf3AHAQGBfIpfD2UTcMpGNpBaU6WVb0zIih0Mm+fk28F4qSGRjTU5S\nw1TRXL0T6A88BGwCfBQYn/d8r4lsrMlJaphMJlNy68G3iEC3EzAGuArol/e8iWwktY7CQPbWkoW8\nveSBUm8ZCLyS7C8lYtZ8TGQjqRUVTuvqv9Vo+m81etXj1+69vvAt5wNXECkJ+wGnAfMwkY2kVlTF\n3dWXgI/3cLyz3BO0Sp/c0dQ34H6IGF9TrbHEt4mkPshkMyW3emiVmtxpwGTgnTqd/9Y6nVdSBdI0\nd/VI4F+B9YGRwHlEENsN+AkxYvlNonZ1ALAFcC1waN459gEuIMbFvAF8Mtl2IoJif2ARsB0wFXgO\n2Bh4FfgvYhzNWOAM4LfAzsASYAhwJrAecB/wHuAY4DBi1PSvgZ8m17mcGKfzz6QMkvqgGUGuns3V\nQcSYlo8B30mOXUpMwegELgYuBC4DngU+W/D+g4mAM4EY1TyE4lM3uoD/IZLQXgIckRz/UvI452rg\n08n+x4DfAzskx/Yhxt8cQkwhOR/4HjHQcEqZn1lSCdlstuRWD/WqyXURtSSAp4haF8Aw4P5kfwZw\nbolznA18l7g9/DQwq+D5wq+Eh5OftxEBaggxFeQEuoPeS8Tt532TYycTY29GALcnr9kI2JGoyc1J\njk0H9u6pkMufmb1qP7vBlnRsuGWJjyS1tunTpjJ92tT6XSBlE/R7qnUtAUYDfyVqaLnAtBIoTK39\nBWIS7qlE8/To5PXDkud3L3K9lcANwC+IZmphOS4FvkEE3keSnw8Q420gAt/9xK3pfYE/AuOKfch+\nw/Yo9pS01hk/oZPxEzpXPT7rP/+jpudP21JLXT3sHwVcRMTz5cBXkuMzgD8A++W9Zzax+sDrRB/e\n0URN7Njk9fOAl4tc7wpgMREgc8/lnp9ONGF/kDy+n6gtziQC3j1EzfEbRJA9haiNrijvY0sqpgld\ncs2oPKZKV/8xxze7DG1l6ZyLml2EtrJ+vwzULk507fDNP5V8weIffqSW1wNaZwiJpDaQNZGNpDRr\nRnPVICepYazJSUo1g5ykVLO5KinV0jZOTpJWU0VN7ghiLjzEXPhdiUH6P6bMbF2tstSSpDZQxfLn\nk4GJyTYX+Doxp/x0Yq55hpjnXpRBTlLDZLOZklsJY4FdiFlQZuuS1Jr6cOPhdCA3kTb/LL1m6zLI\nSWqYwtraK4/fx6tP3Ffk1atsRCx/Ni15vDLvObN1SWodhf1ug0fuxuCRu616vOT2yT29bTyrZ+Qy\nW5ek1lRlc3UU8Fje41MwW5ekVlTljIcfFjx+lAqydRnkJDVMmhLZSNIanNYlKdWc1iUp1azJSUo1\nl1qSlGreeJCUaq3WXP1akeNdrJ6VXpLK0tFizdVhlFijSZIq1WrN1Ul5+/sDI4G7idHGklSxVqvJ\n5ZwDbEms5bQcOA04rJ6FkpROzeiTK2dk3r7AF4FXgcuB7epaIkmplenlXz2UU5PrAPrn7a+oS0kk\npV4zmqvl1OR+BMwD3g3MBi6ua4kkpVYmU3or4jTgLmAOkdhmB2AmsQT6xVC6ClhOkLuBaLIeCHwI\nuKacDyNJhbKZTMmtB53AOGDvZH974AJqnMjmfcAU4Ebgd0SNTpIqVkUimwOAvxLx5/fATdQhkc2P\ngMOJFTjHAD8nIqgkVaSKu6ubAVsDBxG1uN9Th0Q2bxIBDuA+YhiJJFWso/Io9wKwCHgHeISIR1vm\nPd+nRDafSH4uA84ikkbsBfyz0lJKEqw54+G5RXP5x0NzS71lJnAScCEwHBhAJK6pSSKb0cS0rnnJ\nz3HJ8QdKnVCSiinsdhu2y1iG7TJ21eMHbvxl4Vv+QHSPzSbuIRwHPEmNEtlMytsfDvQj2sLDS51Q\nkoqpcj25b/dwrLPcN5fTJ3c50UzdAFgfmEV0AkpSRZoxQb+cISS7Av8C3ELMX32lriWSlFod2UzJ\nrR7Kqcn9E1hJ1OSeB7aoS0kkpV4T5ueXFeTmAacCS4BfE8FOkipWZFZDXZUT5E4jxqIsI27Xzq5r\niSSlVqslsjmnyPFxxLwxSapIq+V4eBiXP5dUQ63WXL2yUYVYm9187aRmF6Gt3PdkyRk8anGt1lyV\npJoqZ8xarRnkJDVMqyay2Qo4F9gcuA5YSMx6kKSKNCHGlVV7vAS4gpgMOwv4SV1LJCm1MplMya0e\nygly6xNLmXQRtbhldSmJpNTryJbe6qGc5uoy4MNEpq5xxKJ1klSxVhtCkvM14IfApsA3gWPrWiJJ\nqdVRXYy7F3g52X+cmKhwJTGnfiFwPCXG9JYT5P4OfKaqoklSnipqcrmczxPzjt1EzLqaTuScOZhI\ndNPzNcu4yLPAM8nPt4GHKi2lJEFVeVd3JZY8v5W4N7AXsDs1ztaVv7TSCFZfMViSyrZO5WNIXgfO\nBy4DdiTWtcxXk2xd+f4GvKvC90gSsGZt7YkFs3hiQclht48Ai5P9R4n1LXfLe75P2bpyrs3bH0Y0\nWyWpYoUVuZFj9mTkmD1XPZ569U8L3/Il4D3EzYXhRFC7jRpl68q5DlhKLOq5DCiZP0ySiqki7+pl\nxGSEXB/cl4jaXE2ydeWcCuxTackkqVAV07reAQ7v4XhnuScoN8fDSXSvL9dFVBclqSKtOkH/RWBM\nsuUY5CRVrNVWBr4e+DRwZGOKIintWm1a12YNK4WktlCvSfillApy2wNns2aqxC5MZCOpCtkmZF4t\nFeTeIG42SFJNtFpN7llgcqMKIin9Wq1Pbl7DSiGpLbTa3dVvNqwUktpCq46Tk6SaMCWhpFRrtT45\nSaopg5ykVGvCfYemNJEltalsNlNyK2FzIt/MKGAHYCax/NLF9BI7DXKSGibby1ZEP+CXxFLoGeBC\nYtbV+OTxwb1dU5IaIpPJlNyKOJ/IyvVM8riiRDYGOUkNk81kSm49OBJ4nu7l3TKs3jyteSIbSapa\nFbWqLxGLguxPrGk5mdVXSKpJIhtJqonC2trCOXexcO5dpd4yIW//DuAYovla00Q2klQThS3S0Xvs\nzeg99l71+LpfXNDbKbqAU6hxIhtJqok+ric3MW+/s9w3GeQkNYwzHiSlWqsttSRJNVVFcuk+M8hJ\nahhrcpJSLdNiiWwkqaaa0Vxth2lduwL/3of3b0oMQpTUR5lM6a0e2qEmtyDZJDVZM5qrrVyTGwXc\nCUwlVhz4AnBt3vO5FQmuBG5KXns58MXk+BbAXGL6x7XAR5Pnc+YRc+A+BdwFzADOSZ4bCkwhanA/\nrNknktpcRyZTcquHVg5y+wP3JD+/T/GVBrqIuWv7EHPajkiOH87qQe0PwDhgAPA+4DHgHWASsB/w\nfmDL5HrfJQLjROCaGn0eqe3ZXF3dZcC3gVuAl+leaiUn/1fycPJzEfGZtgE+DXyAWHsKYCUxx+1Q\nIthdSqwwuhmxJhXABsBIYKfk+hA1PEk14Di51R1MBJgzgcOAo+iueY4ANs57bVfe/mVEje4B4JWC\nc14GXAIMAY4nAtzfidrbCuDLRBN3Z6JmuADYq1QhJ1/0/1bt77rHPozZY58yP57Ueu6dNZN7Z82s\n2/mbkeOhGdcs1/bE2lFvE8HtVOAMoq9tEVEb2xm4gmha5mp6A4CniT64mUSf3NeAzyXP30oEzx8k\njz8PHAd0AE8Q61cNAK4GBgIPEf2D+/VQxq4pi56vxWdVmQau08rfy+kzbschULs40XX34qWlr7dD\nTa9HzU/WhgxyDWaQa6xaB7lZj5Vc35I9R25Uy+sBrd1clZQyVUSvDqL/fBTRLXUM8BYxqmIlsJDo\neuoq8n6DnKTGKZGsppiDiGC2L9H1dHZy/HRiaNnPif77G4udoJWHkEhKmSqGkPyO6FMH2BZYCrwX\ns3VJakWZXrYiVhDN0x8T41bN1iWpNRU2V+fePYO595Q1FPVIYibSbKB/3vFes3V5d7VvvLvaYN5d\nbaxa312d/7fCoaur223EoMLrHQ5sRUy5HATcBzxK9M1NA35BzHi6odg5/R8jqWGqiJa/IZqq04B+\nwEnE2FWzdUlqPVXcXV0GfKaH453lnsAgJ6lhXP5cUqoZ5CSlmjkeJKWaNTlJqWaQk5RqNlclpVrW\nmpykVDPISUozm6uSUs3mqqR0M8hJSjObq5JSzeaqpHSrPMj1Ay4nci2vR6QSXUQFiWxc/lxSw2Qz\nmZJbDz4PPA+MBz4M/Ay4gEhkM54ImweXvGZNP4EklVBFjocbgO8l+1lgObA7JrKR1IqqyNb1OpGs\nZkMi4J3B6nGr10Q2BjlJDZPJZEpuRWwN3A5cBVxL9MXl9JrIxhsPkhqmMIzdNXMad8+c3uNrE0OB\n24DjgDuSY/OJRNPTgI8QiWzKvqYqY7auBjNbV2PVOlvXMy+9XfIFwzZat/B6PwY+BTycd+wk4Cd0\nJ7I5ihJ3V/0fI6lxKg+XJyVboc5yT2CQk9QwzWg6GuQkNUyRsXB1ZZCT1DhO65KUZjZXJaWazVVJ\nqWa2LkmpZpCTlGoumikp1azJSUq1ZgQ5VyFpQ/fNvrPZRWgr986a2ewitIxML//qwSDXhhYY5BrK\nINctmym91YPNVUmNY5+cpDRrxt1V15Prm6nE4n1SWk2jgmWNelF0zbc8S4GNa3Q9SZIkSVJVHN0g\nNckQIqmvGsO+8TrqaHYB1JJ2Ag4FDiA6gp9ubnFSZ3fg/wP/AB5pcllSzyCnQhngWWAB8HFgPPAe\n4H+BF5tYrjR5Bngc+BYwDvgbEfAgmrDl3IVUmQxyKtRB/JEdTQS8q4jxlEcCw4G5+EfYF+sQyZHf\nAcYCbwBfBHYk8okua17RpPaxIXAXMDJ5vC5wMXBa00qUPr8BDk/2hxDN1z/iAP2a8+6OCmWAV4k/\nuAOSY5sC2wJXNKlMaZIF+gMvAM8B/Yh+z4eAnxM1PNWQ3xrK6QBWAAOA9YDpwIXAQcArwGyir07V\nyRDN/JXAm8Tv9wvEF8gOwPuA05tWuhTz1rUK/QTYnKhpXE/ULP5K1O5UvdyXyAeArYHXicC2LvAo\nMIf4IpFUB7kvu68ClwBbEEMbTqV7bq5fiNXLdQsNJeaC/hswi6jJbdasQrUL++QE3XdLhwK3AUcQ\nf4iLgc8VvEaVW5n8/ArRr3kDsIRotn6H6KOTVCe5L7qNge2IGsaTxI2G24H3N6VU6ZFfkTgQuJGo\nzQ0ETgZ+0IxCtRPHyamLGMIwm2iiTiH+X4wH7gSua17RUiFXAz4R+AMxHu4DRJD7MDEe0TuqUp3k\n97PtC3wX2CB5vB5+CdbCOsn2a+Aiora8G3AIsH0TyyW1jaHEXFWAnxEzHIY2rzip9l3gg80uRLvx\nxkN72gKYmOwfClxK3FWdTzRTD2pSudJkw+TngcSXx2iiRncx8En822sYf9HtaRRxd280MAjYD7iJ\nmIR/dfKcqjeW+PIYQPyN/R04BXgJeItopq4s+m5JNdEP+CxwAfEHOLL0y1WBzYhhIQcSc1K3zXtu\nKPHFogaxY7n95A/8HQvcTAS4/Ym+uXlYy+irN4ha8tbEWMNjgF2J5ZSeAN5uXtHaj83V9pMb0nA3\nsazPQuB84B7iD3B5k8qVNrsA/wrcCpxL9Mf9W/KcA6ulOlsv+XkEMAPYhqjhOXWrbwp/f+cRXQEQ\nYxGHNbY4Apur7SS3GOaewGHAJ4ja2wBiddpbMcj1RW6VkQ5isO9gYvDvN4kW0yzgtaaVro35n7o9\n5P4AIdaJ+y3xB/dRYvmk9xMrYqh6WaIv8+tEH+dY4H7iS+SDRB+dy8dLdZLrez2MWCNOtZX7/W4H\nzCTuXEN8eWwHfKwZhZLazXrEH+D1xDit3B+itfnaOQf4b2JeqlqEfXLplwti2xCzGt4FfI0Ick9g\n4pS+yvV1QtxRfRcxo+T1ZHO4iNQAmwJPAZOBTYCdgT8TGbhUG/9FBLdhwNnEhPxDm1oiqc10AP9J\n5Ps8OTlmU7XvcnlSfklM35pE/F4nAO9tUpmktpALYKNZfeHL/Yk5qk7C75vc7zc/GdQwoob8EpGQ\nWy3AGQ/plesnmgh8nhjaMJhYcvsKYjqXqpf7/X6BWMXlX4BniOEi9xAT8dUCbK6kU27M1kCiD253\nouk0AtibyDUwrWmlW/vlMm+NIILae4jf891EesGlwH80rXRajXdX06eDCHDbAL8i0gneTaxv9udk\nu6NppUuHXC3ucqJp+hCx6shhxN3Ub+Nd1ZZhTS69riAS0XQRsxlWEDceljazUClyIHACMdD3T8R6\nfOsTeWsdltNC7JNLp02I4QwvAp8hanTb0J1DVX33JrGk0lnEcJFZxE0dA1yLWaf3l2gt9BIwHdgK\n+B9iUOpg4HfNLFTKzCBmkSwhFsk8D9MLtiSbq+myDt3p7XKd418l7vzdQ9Q4VFu55eMHEUmAJNXB\ngILH+V9eg3p4Xmob1uTS4UPE+nAZYhmlP9I9p9KlzCWlwlnAK0S2rVyCaIcISVrr5WrjmxA5U79P\nrC5yHjGroX+TyiW1BIeQrN1yTdIxROf3x4i7qZ3Enb9riKEOkrTWye9PvQX4FnAisIDIEiVJqfBl\nVh8aMpFIMehNJQmbq2nwMPB03uPtidkN5vaUtNbKX8usPzFkZD6RqX0OMfhXEjZp1ka5mQw7EKte\n9Ccm3o8npnE9BNzVtNJJLcbm6tpnRfLzQqImtwlxs2EXYukfA5yktVau5j0OuC7v+LuJuamHN7xE\nklQHJxA5VE8Btm5yWaSW5rSftdMiIjPUjsA+wFBgId5RldZgkFs7vQ08SgS74URwu7epJZKkOskA\n/ZpdCEmSJEmSJEmSJElSohP4B3AHkbT6bmJQcjVOAY4AdgX+vcTrPg4MK/OcY4mE2vk6gWtLvOfI\npCzlOBI4p8zXqoWYd1Xl6gKmAJ9LHq9LLPN0FZFbohoLkq2YE4EHgWeqPH9vg6MrGTztQOu1lBP0\nVa4Ma6Y6fIdYMGAqMZf2NiL4XQZMIxIwT0hefwgwL3nNAcmxTrprWl8hlom6F5hErG48BphMjAP8\nOrH4wJ3JPsBOyeMpwKlFypxzAvAXYo7vzck5M0Sms78As4CPJK+dkJR9avJZrAxIbaATeI5orv6F\nWHL9w8lzdwAHJ/vHAucm+5sQ0806gMeBIcnxa4jm6gQiyG0GPELkpQA4GxiYnHcUscLKDCIodSTX\nHwXcBHwgec9RFG+uZoDv0R30bgH2Tsrwq+TY5kkZs0lZNk2On0kk6D4Cm6trJb+hVInbiTXsevJw\n8nM0sC+wZ/K4g5h69jKwNDk2veC92xPB8K3k8el5z2WIRUBHJNcH2IiYt7sTUfvLnXPvImXrApYT\nAe81Yt29fnnvg+hvfIUIblsANyTH1wf+DCwucm61OJurqpVcEutFRDCZSNTurgeeBQYTtSWAvQre\n+xiwM9HUhWj6Dk/OmSUWAn0gOedEIrfs/UR/3b7Je8aVKNvopCyfJfr5snTX6nJl2ZIIaC8ATxGZ\nzyYStdIppT+6Wpk1OZWri/I6338JXEr0Zw0CfkbUoo4llml/iahN5c7VRQSW84h+vC6iGbqE6IOb\nTHe/2UxiJeR7iLwW3wCuJO6QPkX3gqKFZV4MvE7U2l4g+v2GJ6/ZJDn3QKLJuxI4KSlrlqiBHgFs\nW+bnlyRJkiRJkiRJkiRJkiRJkiTVwv8BlJjSN2TZPjoAAAAASUVORK5CYII=\n", "text": [ "" ] } ], "prompt_number": 37 }, { "cell_type": "code", "collapsed": false, "input": [ "print(cm)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[98 12]\n", " [36 33]]\n" ] } ], "prompt_number": 38 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can normalize the number of prediction by dividing by the total number of true \"survived\" and \"not survived\" to compute false and true positive rates for survival (in the second column of the confusion matrix)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(cm.astype(np.float64) / cm.sum(axis=1))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[ 0.89090909 0.17391304]\n", " [ 0.32727273 0.47826087]]\n" ] } ], "prompt_number": 39 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can therefore observe that the fact that the target classes are not balanced in the dataset makes the accuracy score not very informative.\n", "\n", "scikit-learn provides alternative classification metrics to evaluate models performance on imbalanced data such as precision, recall and f1 score:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.metrics import classification_report\n", "\n", "print(classification_report(target_test, target_predicted,\n", " target_names=['not survived', 'survived']))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " precision recall f1-score support\n", "\n", "not survived 0.73 0.89 0.80 110\n", " survived 0.73 0.48 0.58 69\n", "\n", " avg / total 0.73 0.73 0.72 179\n", "\n" ] } ], "prompt_number": 40 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way to quantify the quality of a binary classifier on imbalanced data is to compute the precision, recall and f1-score of a model (at the default fixed decision threshold of 0.5)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Logistic Regression is a probabilistic models: instead of just predicting a binary outcome (survived or not) given the input features it can also estimates the posterior probability of the outcome given the input features using the `predict_proba` method:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "target_predicted_proba = logreg.predict_proba(features_test)\n", "target_predicted_proba[:5]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 43, "text": [ "array([[ 0.75263264, 0.24736736],\n", " [ 0.75824771, 0.24175229],\n", " [ 0.58542437, 0.41457563],\n", " [ 0.25224882, 0.74775118],\n", " [ 0.75817844, 0.24182156]])" ] } ], "prompt_number": 41 }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default the decision threshold is 0.5: if we vary the decision threshold from 0 to 1 we could generate a family of binary classifier models that address all the possible trade offs between false positive and false negative prediction errors.\n", "\n", "We can summarize the performance of a binary classifier for all the possible thresholds by plotting the ROC curve and quantifying the Area under the ROC curve:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.metrics import roc_curve\n", "from sklearn.metrics import auc\n", "\n", "def plot_roc_curve(target_test, target_predicted_proba):\n", " fpr, tpr, thresholds = roc_curve(target_test, target_predicted_proba[:, 1])\n", " \n", " roc_auc = auc(fpr, tpr)\n", " # Plot ROC curve\n", " plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % roc_auc)\n", " plt.plot([0, 1], [0, 1], 'k--') # random predictions curve\n", " plt.xlim([0.0, 1.0])\n", " plt.ylim([0.0, 1.0])\n", " plt.xlabel('False Positive Rate or (1 - Specifity)')\n", " plt.ylabel('True Positive Rate or (Sensitivity)')\n", " plt.title('Receiver Operating Characteristic')\n", " plt.legend(loc=\"lower right\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 42 }, { "cell_type": "code", "collapsed": false, "input": [ "plot_roc_curve(target_test, target_predicted_proba)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEVCAYAAADtmeJyAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl8U1XawPFfgJYdKQjIIqKIrC2UnSJYZBFEFNzecQMc\nRJZBUEFQ1BFfBuR1VNAREZFFZ9BxQVQWFaR0QBRRFCwqKqsjiLgAbYFSaM/7x3PTpmmaJiXJzfJ8\nP598ktzc5clNe557zzn3XFBKKaWUUkoppZRSSimllFJKKaWUUkoppZRSSkW4fOAr4EvgC2AnsAXo\nEKTtfQnUCNK6AUYD24CvgR3Ay8D5QdyeuzuAMdbrUcCUAK67PHAv8BmyH78GZgHx1udLgIkB3J6v\nBgKPlmG5R4HbSpnnr8DVfsyvlCqDfKCW27SJwMc2xHK2ngA+ABpa7x1IwfGTy7RgW0LwCuMXgNeA\n6tb7KsByJNkBLA7itr2ZBvwjSOtOB64L0rqVUpZ8oLbL+wrAM8AKl2kPAluRo9DlQH1r+nnA28C3\nyNHpXdb0c5AC8XNgO/AUcjTrur2PKfoPPst6AIywlv0CWAs0t6YvAd5FjvQfc/sejYAsa9vu5gDP\nWq/3AbORo+ofkDMIp0HAZmu7HwFdrenTkASzHSl061rf+2NgD7AeqAMMAX5HEs9YihaQ+4BHgA3W\n6/9z2e79wPfIPp4D7PXwHS4EsoFqbtPrAYOt14uBZcAmYJcVYxXrsz+7fLd9Lt97OLDR2vY6a/6X\ngU+A75Df4RJrXk+/d2fgZ+AwMN2az5ffbxZFk+ajyP79DHjf2tZfkN90t/UdXefvAnwKZFix9/Kw\nz5RSPnJWDW0DDiD/dHOAc63PhwKvUliQ3wmssl6/RWHhXQP5p2wKLALGWdPLA/8E7nPZXi2kAFrh\nMs9/rWUvA/4DVLY+64cUOiAFwZoSvsd1SJWWJ4Os7wdSyC6wXjdACrA2QDNkPyRYn7UGDiIF4zTg\nG6Cc9dl4l+8Dsj/utV4vdnn9CJJUndt93GW7J4ALgCuQgtVZXfYiklw8fb9PS/h+TkuQArySFevn\nwK1AVSRpOb9bVyDTej0cSV7OBHMd8vs7zXP5DiX93q7f05/fz7mvzgeOAnHW9HsprA5aD1zrNn8c\nknwGWNPbI7+dCoEKdgeggiYV+ANoB7yHFCa/WZ9dBXRCChWQQtv5T94bmGS9zgQS3ZYZYb2vjCQA\nV28gVTn1kPaIH5AkNAq4mKJVUwnWwyBH6iWJK2F6Rbftz7WeDyJHn/2AHORMJ81lvjwrFoMcTTvX\n8QzQAymUmiGJZLPLcg63Z6d3XLZ7GDkzuhJ4ncKCeS6yX93lUZiISmKQI/Yc6/0O5OzlOPKbDLK+\nTzskOTh9hZxtgJxR7EWO9i9G/jacv0VJv7eDwu86EP9/v5+Qs4Evkb+/9yj6O7hyWNs9Y80HcuaR\nVML8KsA0EUS/bcA9yFHpZmA/UvjMAuZb88RTWJV0xm35C5Gjy3LA9UjVAkBNpBBwdRxJBjcD3Sg8\nSi+HnEHcb713IEeMR1yW82QzUijXA35x+6wXRQumPJfX5SksZNcBf3L5rDFSSA1x2+7/IYluIVJg\nVaBooe/+XZ1Ous3jAE5TtIB3T5hOnwEtkSP3bJfpDZHf5nrrvetv4oyjEZLcn0eqgd5EEoOT6/rG\nACORKq2lyO/ZxMO6QX7v36ztOLfl7+/nsJa9DDkg6ItU3a0H7qY4g+wz933cCvl7yyu2hAqo0o5G\nVHT4N1JoOKsHPkAKBmcD5TTgJev1h8Dt1utzkIL0YmuZe5F/8nikXWGsh20tsJbvhhyJglQd3ITU\nEWNt21md4H6E7eoAcqT+KlL14nQ7UrXgWic/1HpujBQ8q5ECvR+F9dn9kcRYycN2+yH7Zynwq7UO\nZ9XZGQp78ZQWs0Gqla6jsGpoBJ4TyQFre4so/C1qAM8hhXFOCdtyIAXsYWAGUmc/yPrM0/90P6QK\nZzHSbnE1hQeBnn7vZhT9zmX5/ZKQs5edyEHHHAqP8F3X7Vz+O2Qf9bHet0cSh7d9rQJEzwiik6dC\nZxxSXdAXOTtoiBxxG+QsYZjLfPOQ0/pywEzkNH088LS1jjik8HHWj7tu7wvk6G4ZkGtNW4MU2muR\no+NjyBG5c9mSjrYBpiKNou8gBXhFpF69K9IG4dQYaWCsDExAqqVA2j/+TeGR+iCkLt99u/+LVGtN\nRQrYN5EECFJd4WyYLi1ekAJsAZJ8TyD16SdKmHcs8DBydnPG+n7LkTr6krZnkH36Z6QAPYzsn58p\nrPZyXeYJpHfSUORs4G0K6+JL+r3jkfaDU8j+9Of3M8jfyetI9WO29f3HW5+vsGKKd1k2F0nuc4C/\nW9sdQvEzFqWU8mgv0tMlXHSgsLcVyJnUqzbFolSpgl011AU5OnI3COkN8jFysY5S0eR7pOE5Azky\n7kVhryOlYspk5J/A/SKmOOS0/Rzr9RakF4RSSikbBPOMYBdS5+fe2NPS+uwYUmf7EdAziHEopZTy\nIpiJ4C08N/TUQJKAU0lXjiqllAoBO3oNHaOwqxzW6yPuMzVt2tTs3r07ZEEppVSU2E1hjzef2HEd\nwU6kn3IC0n2sJ9LNrojdu3djjNGHMTzyyCO2xxAuD90Xui90XxQ+MjMzGT16NI0aNWLVqlUsXmxA\nhgjxSyjOCJz9hG9CrqBcgPSg+ABJRAuR/s9KKRUzFi2C998v+/LHjmXw0UdXU7duLzp0yGDJkprs\n9TS0oQ+CnQj2ASnWa9d+1Cuth1JKxaSVK+HCC6FLl7Itf+JEYzp0eJ527a4oMv1//sf/demVxREg\nNTXV7hDChu6LQrovCkXqvujeHa69tvT5PDsHGei2qLIkgnAex8MYU9qV/EopFR5WroQ9ngYbL8GL\nL8K0aWeTCDxzOBzgZ9muiUAppQKgWTPo1g1q1vRtfocDJk6Exo29z5eWlsacOXNYvnw55cuX9z4z\nZUsEWjWklFIB8vDDkhACISsri8mTJ7Ny5Urmz5/vUxIoK00ESinlwfffQ3Z26fM55eSUPo+v0tLS\nGDFiBL169SIjI4Oavp5mlJEmAqWUcpObC61aQZIf90hr2BBq1y59vtJs3ryZYcOGMX/+fK688sqz\nX6EPtI1AKaXc5ORIXX8gj/J9ZYwhOzub6tWrlz6zB2VpI9A7lCmlVBhxOBxlTgJlpYlAKaVscujQ\nIbtDADQRKKVUyGVlZTFmzBhSU1M5c8b+u3FqIlBKqRBKS0sjKSmJU6dOsXnzZipUsL/Pjv0RKKVU\nEHzyCfzv/0Jenv/L5uVBuQAfJrtfFxCqHkG+0ESglIo6X34J11wDM2eWfuVuSWrVCmxMBw4cIC8v\nLyTXBfhLu48qpaLKd99Bair84x9w/fV2RxN6OsSEUiom/PgjLFlSfLoxMs7/zJmxmQTKShuLlVIR\nZ906WLYMzpwp+sjLg6eegttvty+2rKwsFixYYF8AZaBnBEqpiNS+vTQGhxPXMYKGDx9OXFyc3SH5\nRBOBUlEkLw82bZKj42j27bd2R1BUOPcI8oUmAqWiyPbtMHAgdOpkdyTBd/PNdkcgdu7cyYABA0I2\nUmgwaCJQKork5UGLFpCWZnckseOCCy5gwYIF9OnTx+5Qykwbi5VS6ixUrlw5opMAaCJQSqmYp4lA\nKaV8kJaWRu/evTl16pTdoQScthEopZQX7j2CKlasaHdIAaeJQKkItGMHjBwJ+flFp2dnQ40a9sQU\njUJ972C7aCJQKgLt2yejY86ZU/yz+vVDHk5UysjICPm9g+2iiUCpCJWQAF262B1F9EpMTOT777+n\ncuXKdocSdJoIlIoQ+/fLiJrGwO7ddkcTG2IhCYD2GlIqYmzeDGvXQoMG0KMH3Hef3RFFjx9//NHu\nEGylZwRKRZAWLWDiRLujiB7OHkFr1qzh66+/plKlSnaHZAs9I1BKxSTXewdv3bo1ZpMA6BmBUirG\nRPpIocGgiUApFVOOHj2Kw+GI6usC/KWJQKkwkJ8Pv//ufZ7MzNDEEu3OP/98nnvuObvDCCu+JIJa\nQHegNvALsBHIDmZQSsWaZ5+FKVOgWjXv8/35z6GJR8UWb3e6rws8BrQEdgI/AwlAMpABPIwkhmAx\nxpggrl6p8DFrFhw9Ks8qMLKysli0aBHjx4/H4fBW1EUX67v69YW99Rr6K/B/QArwZ+BBYCzQDZgD\nPFLKep8HPgbWA03dPh8CfAZsAUb7E7BSSpXG2SNo+/btUTlaaKB5qxoaZz2fBxxy++wbJCmUZDAQ\njySRLsCT1jSnp5Azi+PWul4FjvkctVJhbuZMeOkl3+f//XcY6+0/SvlEewSVjS9tBMuAX4EXgdVA\nvvfZAWlTeN96/SnQ0e3z00BNa10OQOuAVFT5+mu44w64+mrfl2ncOHjxxII9e/bQu3fvqB8pNBh8\nSQTdgdbAcOAhYB2wENjjZZkagGsfhzykusiZRJ4EtiJnBMvc5lUqKjRoAM2b2x1F7GjcuDGLFy8m\nNTXV7lAijq/dRw8gBX9HoA0wG2lAnlLC/JlAdZf3rkmgMVLtdAFwAvgXcD3wpvtKpk2bVvA6NTVV\nf2AVdpYsgY0bi0/fvBm0ViK0KlSoEJNlRHp6Ounp6We1Dl9all8HEpECezFw0Jr+OcWrfJyuBQYB\ntwNdkR5GA63PLrHW2QmpIpoD7ECqnlxpryEV9vr2hcREaN266HSHA665BmrXtieuaGeMiameQP4o\nS68hX84IVgA3urxvDnwH9PCyzHKgL7DJen87cBNQDVgAvIT0KMoBdgFL/AlaqXAyYIAkBBUaaWlp\nPPjgg3z44YdUrVrV7nCigrdEkAg0ACZSeL1AeWAW0BY46WVZA4xxm/a9y+vZ1kOpiLBhg+d7ABw4\nEPpYYpV7jyBNAoHjLREkIEfx9axnkHr+ucEOSqlwM24cXHgh1KpVdHq3btCqlT0xxZJYuXewXXyp\nR2oPfBHsQDzQNgIVNpKS4F//kmcVWrt37+byyy9n3rx5el2ADwLdRjAX+AvwHEX7+RvkQjGlosKh\nQ5CV5X0evTjVPk2bNuX777+nYsWKdocStbxljXpI20AzINdt3n1BjMlJzwhUSNSuDTVrQjkvA67E\nxcltIhs2DF1cSpVFWc4IfJn5W6Tn0IsUbfANNk0EKiSqV4eDB+VZ2WvXrl1cfPHFdocR0QI96JxT\nO2SAuKeAD4Fb/Y5MKaW8yMrKYsyYMfTu3Zus0urpVMD5ch3BKeANZBjqe5BhJv4VzKCU8mTHDujX\nD/LyArveU6eggt6iyTauPYK2b99OdT01Czlf/vz/ilxQ9iXwNLAhqBEpVYJff5UunG+9Fdj1VqwI\nlSsHdp2qdDpSaPjwJREcAS4FjgY5FqVKFR8P9erZHYUKhFOnThEfH6/XBYQBbw0KI5HhIB5zm26A\nqUGLyGU72lgcfdLT4R//KNuyhw9L7520tICGpFRUCfR1BD9azzud67eetXRWZfbJJ3Kj9lvL2OXg\nkksCG49Synsi+MB67oxcWOb0T2TQOKXKpEULuO46u6NQoZKVlcWzzz7LfffdRwVtlQ9L3rqPjkN6\nCt1hPf+M3LJSL6lRSvnEee/gH374Qe8dHMa8pednrcdUYGZowlFKRQPtERRZvCWCQcgVxb8Dd1rT\nnPcXfiHIcSmlItRPP/1Ejx49dKTQCOItETgH3K2PNhCrEvz2G+zb5/v8P/0ENWoELRwVBho0aMDS\npUtJSdGxKSOFL12MHMjN6POBIcBK4I9gBmXR7qMRYOhQuWevP7dknDgRbrqp9PmUUv4L1q0q/40U\n/inWyodYD6U4cwZmzICbb7Y7EmUHvXdwdPBl0LkGSJfRlsBoQAcCUUqRlpZGcnIyR44csTsUdZZ8\nOSOIA64FvgbqoIkgZgwZAu+9532e06f1bCDWuPcISkhIsDskdZZ8SQSPA38C7gXuAqYHNSIVNg4d\ngvffh65dvc9XqVJo4lH203sHRydfEsFb1gNkJFIVQypW1IJeiYMHDzJy5Ejmzp2r1wVEGV8SwVRg\nMnDSem+QdgMVJSZMgK+/Lj79m2+gfPnQx6PCU4MGDfjuu+90mIgo5Etz/1dAV+BEkGNxp91HQ+Si\ni+DRR6F+/aLTK1SASy/Vm7YoFUmC1X10D5BTloBU5OjeXRKCUgDffPMNrVq1sjsMFSK+JIKKQIb1\nMNZD+4nY4ORJeOop6akTSNr7Tzk5ewStWrWKL7/8ktr+XCmoIpYv1xH8HzAWmAc8D8wPakSqRLt3\nw9NPB369kydDo0aBX6+KLOvWrSMpKYnc3Fy++uorTQIxxJczgi+QxuIGyCB0GUGNSHlVty5Mm2Z3\nFCqaZGdnc99997Fy5UpeeOEFBgwYYHdIKsR8SQSLgNVAKjIS6ULgsiDGpFycOQMffSTPe/faHY2K\nRsYYqlWrptcFxDBfWpbXA71cnjcCPYIZlEV7DQGffQa9e0PnzvK+fXt4/HF7Y1JKha9g9RoyQAvr\ndSPgjH9hqbORnw8tW8KHH9odiVIqWvnSWDwBWAK0B5YBE4MZkFIqOLKysnjkkUfIydHe4KooXxJB\nBnJBWROgL9J4rEJk/XrQalt1ttatW0diYiI//fQTpwPd/1hFPG+JoD2wDYhHRh/9DvgMuDoEcSlg\n0SKYNw8WLLA7EhWpsrKyGDNmDMOHD2fevHksXLiQ6tV1AGFVlLdE8AQwDMgFZgADgE7A/SGIK+a9\n8QY89BCsXQuNG9sdjYpEhw8fJjExkdzcXDIyMrRbqCqRt8bicsB2oCFQBdhqTc8PdlCxKCMDRoyQ\nxmGAH3+ENWvgkkvsjUtFrjp16vDmm2/SsWNHu0NRYc5bInBWJF4BOPusxAHVghpRjNq3T4Z8nj1b\n3jdsWHwQOKX84XA4NAkon3hLBOuATUBjpF3gImAu8LoP6y0HPAckAaeAO4DdLp93Ap5E+roeAIYi\nVVAxrWZN0P9bVRb5+fmUK+dL3w+livOWCGYB7wLHkMK6KfACsNyH9Q5GGplTgC5IoT/Y+sxhrec6\nZGTTkcCFSGN0TNmxAxYuBGNgzx67o1GRat26ddx1112sX7+eevXq2R2OikDeEsEQihb6uyl6VH8t\nhXcuc9cdeN96/Sngepx7CTJUxb1AG2AVMZgEvvsO+vWTdoHataFJEz0bUP5xv3ewJgFVVt4SQRWk\nMP8AuTnNL0ACcoTfH3jZy7I1gEyX93lIdVE+cC5ypvAXJLGsBD5HhrCICfv3SxKYOROGD7c7GhWJ\n9N7BKpC8JYKlwNvALcCfkQL8MJCOVPNke1k2E3DtrOxMAiBnA7soPAt4HzljKJYIprkMs5mamkpq\naqqXTUaGw4ehTx+YOFGTgCqb3377jTFjxui9gxUA6enppKenn9U6/BqYyA/XAoOA25Grkh8GBlqf\nxQM7kauUdyPDVrwIvOe2jqgcdG7hQli9GpYtszsSFcny8vIorzeUVh4Ea9C5sliOFPSbrPe3Azch\nXU8XACOAV5BgN1E8CUQtYyAhwe4oVKTTJKACKViJwABj3KZ97/J6PdLWEFXOnIHff/c+z7FjoYlF\nRYdt27bRtm1b51GeUkHhSyJ4FTmaV6WYNQseewyqlXLJ3V13hSYeFbmys7OZPHkyK1asYMuWLdTX\nqwtVEPmSCOKBtkjjrrPBN+Yv/vLk5EmYOhUefNDuSFQk0x5BKtR8SQTNkd5DTga5ylgpFUAnTpxg\n0qRJrFixgvnz52uPIBUyviSCNtZzXaTrZ17wwlEqdpUvX55atWrpWYAKOV8GJ+mFDAWxBunu2S+o\nESkVoypWrMjf/vY3TQIq5Hw5I/gbcClwEBmSejmSFJRSSkUBX84IziBJAGTwuZPBC0ep6Jednc3U\nqVPJzMwsfWalQsCXRJAF3IX0HLoL+COoESkVxdLS0khMTOTQoUN2h6JUAV+qhm4FHkJuV/ktMu6Q\nUsoPrtcFaI8gFW58SQRHgUnBDiQaROHQSCoAjh07RnJyMqmpqdojSIWlcL5uPaIGnfv2W7j8cvjX\nv6B3b7ujUeEmIyODxMREu8NQMaAsg85pIgiAffugRw/4299g2DC7o1FKxbJgjT5aA5gMNABWABnI\n/QQUcPCg3F9gyhRNAkqHh1aRyZdeQ4uAvRTeYnJRUCOKMLNnQ//+MG6c3ZEou6WlpdGyZUv2799v\ndyhK+cWXRFAbWAicBjYQ3tVJIXf6NFx8sd1RKDtlZ2czduxYhg0bxpw5c7jgggvsDkkpv/iSCAzQ\nwnrdCLnATClF4XUBOTk5ZGRkaLdQFZF8SQQTgCVAe+S2khODGVAkmD4d6tSRx/z5UKWK3REpO2Rl\nZXH33Xczd+5cFi1apN1CVcTypZrnKmCly/sbgdeDE04RYdtr6I47oHVruPVWeX/uuaA3kIpNxhi9\ne5gKK4HuNXQV0B24GUixVlwOuIbQJIKwVr26nBGo2KZJQEUDb1VD25G7kp20nr8DdgB/CkFcSoWV\nLVu2kJ+fX/qMSkUgb2cE/0XaBl6m8BaVAHrzVBUzXMcI2rhxI02aNLE7JKUCzpfG4keBX4FMpMfQ\n8qBGpFSYcO8RpElARStfriy+GjgfeMp63B/UiMLUJ5/Axx/L64wM6NrV3nhU8Jw8eZKJEyfqSKEq\nZviSCH4GcpChJnYBMXm1zDPPQGYmtGgBl14KPXvaHZEKlvj4eBo2bKgjhaqY4UuXhxeBT4BOyJDU\n/YF2wQzKElbdR2+6Ca6+Wp6VUipcBWvQuTuRqqE3gOFId9KYcOgQOIeN+e03e2NRSqlg8dZYHAdc\nB1wG7Ecai18HpgU/rPAwdizcdhuMHw9ZWTqmULTJzs5m0qRJHD582O5QlLKVtzOCpchAc/WB1sA+\npJromeCHFR7OnIEnnpAqIRVd0tLSGDFiBL169aJixYp2h6OUrbwlgouAjkA8sBXIBXoh9y1WKiLp\nvYOVKs5b1VCm9ZxrzdeXKEoCffpAfLz3x8qVUKOG3ZGqQDlx4gTt2rXTkUKVcuPtjMC11fkw8EeQ\nYwmpX36BzZuhTZuS53E4IC4udDGp4KpSpQqrVq2iefPmdoeiVFjxlghaA68gCaEV8Ko13RAlPYfi\n4uTIX8UOTQJKFectEdyIFPoOYL7L9PDp3O+DkSNh9+7i0/fsAb21bPQ6ffo0cXo6p5RPwnkM3YBc\nUFa3LsybBwkJRafHxUH37lDOl9GWVERJS0tj5MiRrFq1ihYtWpS+gFJRJFgXlEW8Hj0kIajo5t4j\nSJOAUr6JykTw3HNyVTBAdra9sajQcL0uQMcIUso/vpw+NAJmAXWB15Cb03wazKAsZa4aqloV7rlH\nGoKrVJHX2h4QvXJycrjssst45JFHtEuoinllqRryZebVwJPAw8A4YCHQpZRlygHPAUnAKeAOwEOT\nLS8AvwMPePjsrBLB4cPyrGKD3jtYKVGWROBLU2llYB3SW2gHcuvK0gxGrkhOQe5f8KSHeUYBbYiw\nXkgqPGkSUKrsfEkEJ5Ghp8sD3ZB7E5SmO/C+9fpTZKgKVylAZ6Rbqv4HK59t2rSJ06dP2x2GUlHF\nl0QwCrgdOBeYBIzxYZkaFA5RAZDnsq36wF+RaiZNAson2dnZjB07lj/96U/s2bPH7nCUiiq+9Bq6\nDin8/RliIhOo7vK+HJBvvb4eSSqrgfOAKsgYRi+7r2TatGkFr1NTU0lNTfUjBBUttEeQUiVLT08n\nPT39rNbhyxH5JOAmYCewAPBli9cCg5Azia5IQ/NAD/MNA1qgjcXKg9zcXO6++24dKVQpPwTrgrIn\nrEcn4D6kp88lpSyzHBmtdJP1/nYkmVRDkokrbSxWHsXFxXHxxRfrWYBSQeZL1qiMVOcMteZfSOEA\ndMGkZwRKKeWnYJ0RfAUsQ9oJdvkfllJKqXBW2j2LAZKBR4AfkWsDdOBmFVDZ2dncfffd7Nu3z+5Q\nlIpJ3hKBsxdPBtJQ/J312BnsoFTsSEtLIzExkczMTG0HUMom3qqGbrKebwQ+c5meGrRoVMzQewcr\nFT68JYIeyJ3J7gGesqaVRy4Eax3kuFQUy83NpWPHjqSkpGiPIKXCgLdEcBS5CriS9exArhC+LwRx\nqSgWHx/P+++/T5MmTewORSmFb12MGgAHgx2IB9p9VCml/BTo0UeXWc9fAD+7POxICipC5eT4Mkah\nUspO3hLBddbzeUjVkPPRINhBqeiQlpZGy5Yt2bp1q92hKKW88GX00b7AAGSsoD3ALUGNSEU850ih\nw4YNY+7cuXTo0MHukJRSXviSCGYA3wPjkfsMjA5qRCqiOa8LyMnJISMjQ7uFKhUBfBli4gRwGDiN\ntBHke59dxarTp08zffp05s6dqwlAqQjiS8vyu0Bt5G5i1ZELym4IYkxO2mtIKaX8FKyb11cCLgK+\nQe4x/ANyQ/pg8ysRdOoEf1i3zvnxR8jKgkqVghSZUkqFqWAlgvORK4tbI2MN3QPs8zO2svArETgc\nsMsaG7VKFahfP0hRKQA2bNhAx44dqVKlit2hKKVcBPo6AqcFwD+RhuKXkPsRhKWmTeWhSSB4nD2C\nbrnlFnbv3m13OEqpAPAlEVRC2gmOAG9TODy1ijHuPYISExPtDkkpFQC+9BoqDyQhN6hJRG8tGXPO\nnDnD+PHjdaRQpaKUL4lgPLAIuar4IDAyqBGpsFOhQgXatm3LzJkzdaRQpaJQaQ0KNYAzyLUEoea1\nsdgYeO016R0EcOedMk0ppWJZoBuLxwHbkSqh/mUPKzhOnIBbb4UtW+QxZYrdESmlVGTyljU+AS5D\nzgr+ReiTgdczguPHoW5deVaBkZ2dzQMPPMDo0aNp3VrvPaRUJAr0GcFJIBf4De0pFPWcPYKOHz9O\nw4YN7Q5HKRVC3hqLXTOKL91MVQTSewcrpbwlgtbAK0hCaAW8ak03wM1BjkuFQF5eHikpKXTs2FHv\nHaxUDPNWj5SKFPru8xjgP8EKyHU72kYQfAcOHNCqIKWiSLDGGrKLJgKllPJTsMYaUlHg+PHjlHVY\nb6VUdNNEEAPS0tJo06YNGzdutDsUpVQY8mWIiUbALKAu8BqwA/g0mEGpwHDvEdSzZ0+7Q1JKhSFf\nzgheABaaovSbAAAW5klEQVQD8UgCeCaoEamA0HsHK6V85UsiqAysQ3oL7UAuNFNhLD8/n9mzZzN3\n7lwWLVqk3UKVUl75UjV0EhleojzQDcgJakTqrJUrV44VK1bYHYZSKkL4ckYwCrgdOBeYBIwJakRK\nKaVCSq8jiHDr16+nXbt2JCQk2B2KUioMBOs6gkPAz9ZzLrDT78hUwDnvHTx06FD27t1rdzhKqQjm\nSyI4D7k72XlAM2R4al/W+zzwMbAeaOr2+U3AZuAjYB7hfWYSdtx7BLVv397ukJRSEcyXxmJX+4GW\nPsw3GOlumgJ0AZ60poH0QpoOtEEanl8BrgK0dbMUxhjGjRvHu+++qyOFKqUCxpdE8KrL6/pIFVFp\nugPvW68/BTq6fJZD0d5HFdAuqT5xOBx069aNGTNmaJdQpVTA+JIIXgOOINU3J4HPfVimBpDp8j4P\nqS7KR65H+NWafhdQFfjQl2AzM6FhQ7lNpTHQoIEvS0WXW2+91e4QlFJRxpdEcB9yhO+PTKC6y3tn\nEnB9/zhwMXBdSSuZNm1awevU1FRatUqlUiU4elSmObRlQSkV49LT00lPTz+rdfhSlL6LXFn8HXI0\nb4A1pSxzLTAIuf6gK/AwMNDl8wVI1dB4a32eFOs+evgwtGkjz9EsOzubKVOmMHToULp06WJ3OEqp\nCBKs7qN/AO2A/wH+hPT4Kc1ypKDfhDQU32MtNxJIBv6MNBanIb2KBnteTexx9gg6efIkzZs3tzsc\npVQM8JY1XgduDFUgHsTUGYHeO1gpFQhlOSPw1kZQ56yiUT4zxnD55ZfTpk0bvXewUirkvGWN/cBS\nD/MYYGrQInLZTiydERw+fJi6devaHYZSKsIF+ozgBNJArEJAk4BSyi7eEsEh4KVQBRIrsrOzqVy5\nMuXLl7c7FKWUArz3GtoasihihLNH0Nq1a+0ORSmlCng7I5gUsiiiXFZWFpMnT2blypXMnz+f/v37\n2x2SUkoV8OU6AnUW0tLSSEpKIjc3V+8drJQKS/6OPqr8YIzhxRdf5LnnnmPAgAF2h6OUUh6F82g9\nMdV9VCmlAiFYQ0wopZSKYpoIAiQtLY2ff/7Z7jCUUspvmgjOUlZWFmPGjGHYsGH897//tTscpZTy\nmyaCs+DeI6hz5852h6SUUn7TXkNlYIxh/PjxvP3227zwwgvaI0gpFdH0jKAMHA4HvXr1IiMjQ5OA\nUiriafdRFRNq1arFkSNH7A5DqYBJSEjgjz/+KDY90KOPKhU1jhw5gvuBhVKRzBHAm7Zr1ZAXzh5B\nH374od2hKKVU0GgiKIFrj6COHTvaHY5SSgWNVg25cR0pVHsEKaVigSYCNwMHDqRZs2Z672ClVMzQ\nqiE377zzDgsXLtQkoEKmXLlyJCUlkZycTPv27WnRogWdO3dm69bCe0MdP36cSZMm0aJFC5KSkmjb\nti0PPfQQOTk5Rdb10ksvkZKSQnJyMq1bt2bUqFEcO3Ys1F/JL/fffz9r1qyxOwyvZs6cScuWLWnW\nrBmPPvqox3kmTJhAcnJywaNOnTq0bdsWgIyMDKpVq1bk8x9++KHgs9TUVNq3b0+nTp344osvAHj3\n3XeZPn16aL5gGDPufvnFmDp1ik1WqlSe/p7ChcPhML///nuRaU888YTp1q2bMcaY06dPm65du5qJ\nEyeakydPGmOMOXHihJkwYYLp2bOnOXPmjDHGmBkzZpgePXqYw4cPFyz3l7/8xfTo0SOE38Y/n3zy\nibn66qvtDsOrVatWmeTkZHPixAmTk5NjLrvsMvP66697XWbv3r3m/PPPN9u3bzfGGPP888+bO++8\ns9h8x48fN+edd5557733jDHGvPPOO+aSSy4p+Lxv375m27ZtHrdR0t80EFXd44p9wUAmgszMTJOT\nkxOYlamw5+nvKVw4HA7z22+/Fbw/ffq0ueuuu8xVV11ljDHmlVdeMV27dvW4bLt27cwbb7xhsrOz\nTbVq1cyuXbuKfH7ixAnzyiuvmNzc3GLLrlixwrRr184kJSWZbt26me3bt5u9e/eaatWqFczj+n7x\n4sXm0ksvNe3btze9evUyKSkp5s033yyYd8qUKWbKlCnGGGNefPFF06FDB5OcnGz69Oljdu7c6TH+\nK664wqxatcoYY0xeXp4ZP3686dKli2nVqpVp2bKl2bRpkzHGmGHDhplBgwaZ1q1bm/vvv9/k5uaa\nu+++27Rv3960bdvWDB8+3GRmZhZ8r5SUFNOxY0fTuHFj8/DDD3vcdkpKimnXrl2Rx7hx44rNd8cd\nd5gnnnii4P2SJUtKTV59+vQxc+bMKXg/dOhQ06NHD9O5c2fTuXNn89ZbbxljjFm+fHmRRJ2fn2++\n+uqrgvf//ve/zZAhQzxuo6S/aTQR+GbdunWmSZMmZtmyZWe/MhURPP09Ff08MI+ycDgcJjEx0bRt\n29Y0aNDAXHTRRWbChAnm119/NcYYM27cODN58mSPy06cONFMmDDBfP7556Zu3bo+b/PQoUOmZs2a\nBUesb731lrnyyivNvn37vCaCWrVqmaysrIL3zmR15swZ06hRI7Nr1y6Tnp5uevbsaU6cOGGMMeaD\nDz4wrVq1KhbDkSNHTNWqVc3p06eNMXJ2cOONNxZ8/thjj5lBgwYZYyQR9O3bt+CzRx991Nx3330F\n7x944AEzduxYY4wxvXr1KkiIBw4cMBUqVCh2xuWP/v37m9dee63g/dq1a0379u1LnH/16tWmRYsW\nJj8/v2Da2LFjzfPPP2+MMebbb7819erVM1u3bjWPP/64uf76682IESNMx44dTZ8+fcwXX3xRsFxm\nZqapUqWKx4PWkv6mKUMiiKnGYu0RpEpi97Vm6enp1KpVi23btjFgwAC6devGueeeC8iFQ7m5uR6X\ny8nJIT4+nvLly5Ofn+/z9jZt2kSbNm1ISkoCYMiQIQwZMoR9+/Z5XS4pKYlq1aoBcMMNNzBp0iR+\n+eUXtm7dSrNmzWjatCnz589n165dpKSkFCx35MgRjh49WqTtbdeuXdSvX58KFaQY6tq1K9OnT2fe\nvHns2bOH9PR0atSoUbAPLr300oJlV65cybFjx1i7di0Aubm51KtXD4AVK1awYsUKli5dyrfffosx\nhuPHj1OrVq0i3yUlJYWTJ08Wmda9e3eeffbZItM87dfy5cuXuI9mz57NAw88UOSCr7lz5xa8btGi\nBTfeeCPvvvsu8fHxrF69mvT0dDp16sS7777LlVdeyf79+4mPj6d69erUqFGD/fv3c8kll5S4zbMV\nM43F7iOFahJQ4ahdu3bMnj2bO+64g/379wNSOG3YsKHYldH5+fls2LCBlJQUWrVqxenTp9m9e3eR\neXJycrjyyis5dOhQkelxcXHFrkzdsWMHDoejyHbcE5AzCQBUrVqVG264gVdeeYUlS5YwcuTIgrhu\nu+02vvzyS7788ku++OILNm/eXKwDRrly5cjLyyt4v2rVKgYOHEi5cuUYPHgwo0ePLlIIV61atch3\nf+aZZwq28emnn/L6669z/Phx2rVrx7Zt2+jQoQN///vfiYuL83hV+ccff1ywvPPhngQAGjduzMGD\nBwveHzhwgEaNGhWbD+DXX39ly5Yt3HDDDUVinTFjBtnZ2UWmxcfH06BBA1q0aEGnTp0AuPrqq8nL\ny2Pv3r0F8+bl5XlNPNGu2CnP2VQNjRo1yqxevbpsC6uI5+nvKVx4aizu27evGTx4sDFGql169Ohh\nxo8fX6SxeOzYsaZ79+4FjcUzZ840PXv2NL/88osxxpicnBwzatQok5qaWmybhw4dMueee675+uuv\njTFSV52UlGSOHTtm4uLizDfffGOMMWbWrFlFqoacVUFOW7duNYmJiaZRo0bm1KlTxhipCmrcuLH5\n+eefjTHGzJ8/3zRv3rxYDM6qIedyd999t7nnnnuMMcacPHnSDBw4sKD+fNiwYUXq6adOnWoGDhxo\nTp06ZfLy8swtt9xiRo4cabZt22bq1atX0Cbyz3/+0zgcDrNnz57SfoYSrVixwnTs2NEcP37c5OTk\nmF69epmXX37Z47zLly83ffr0KTa9S5cu5sknnzTGGLNv3z5Tv359s3PnTnPo0CFTq1Yts3XrVmOM\nMf/5z39MvXr1CvbJ0aNHTbVq1Ty28ZT0N41WDZXs+eeftzsEpTzyNGbMs88+S1JSEmvXrqVv376s\nWbOG6dOn06FDh4Ij6WuuuYa1a9cWHC0+8MADVK1alSuuuAKQs4FevXrxzjvvFFt/vXr1WLp0KcOG\nDePMmTOcc845vPbaa9SoUYPHH3+cAQMGULduXW644YaC+BwOR7FY27dvT1xcHNdddx3x8fEA9OvX\njylTptC3b1/KlSvHOeecw/Lly4vFULNmTXr06EFaWhr9+/dn9OjR3HzzzSQnJ5OQkMA111zDk08+\niTGm2LYffvhhJk2aRHJyMvn5+SQnJ/PUU09RtWpVrrrqKlq2bEn9+vXp3r07HTt2ZNeuXVx44YVl\n+n2uuuqqgvuN5ObmMnjwYG677TYA5s+fz+eff86CBQsAStzO0qVLGTVqFEuWLCEvL4+nn36a5s2b\nA/D2228zduxYjh8/TqVKlXjrrbcK9uWaNWsYNGgQcXFxZYrdVzr6qIoJ7lUeKjx88sknzJgxg5Ur\nV9odSljq3bs3Tz/9NG3atCn2WUl/01E5+mheHrz8Mpw6BZmZpc+flpbGBRdcQNOmTYMfnFLqrHTr\n1o3mzZvzwQcfFJzJKPH222/Ts2dPj0kg0ML+jGDPHmjbFm65RSZecAE88EDxmV17BL322mtFeiwo\npWcEKtrE1BkBQJ064K2KPy0tjREjRnD55ZfrGEFKKeWniEgE3tx777288cYbel2AUkqVUcRfR9Cv\nXz+9LkAppc5CxJ8R9O/f3+4QVARISEgI6K39lLJbQkJCwNYV8YlAKV94usm3UkoEq2qoHPA88DGw\nHnDvyzkI2GJ9fkdJK0lIgHbtoFIluXfwsmXLghRueEtPT7c7hLCh+6KQ7otCui/OTrASwWAgHkgB\n7geedPksDngK6AtcBtwJ1PW0kj174J//TOPECRkjqHfv3kEKN7zpH3kh3ReFdF8U0n1xdoJVNdQd\neN96/Sngevf3lsAuwHnbpI+AnsCb7iuZOnWMjhSqlFJBFqxEUANwvQ44Dzn7yLc+c713XhZwjqeV\nnDp1Sq8LUEqpCPUkcIPL+/+6vE4EVrm8fwq41sM6diGj6OlDH/rQhz58f+wiTFwLLLZed6VowR8H\nfA8kIO0InwP1QxqdUkqpoHMA84BN1uMS4CZgpPX5VUivoc+BMXYEqJRSSimllAoTAbnmIEqUti9u\nAjYjPa3mEd6jx56t0vaF0wvAY6EKyial7YtOwAZgI/BvpMo1WpW2L4YAnyFlxujQhmaLLsh+cBdx\n5ea1wCLrdRfgbZfP4oAfkF5FccgX83jNQZTwti8qI41Alaz3ryA/drTyti+cRiF/6DNDFZRNvO0L\nB/AlcJH1fiTQPHShhVxpfxd7gZoULTui1WTgK+R/wJXf5WY4DDrn6zUHpym85iBaedsXOUA36xmk\n6+/J0IUWct72BcjFip2B+UT3mRF43xeXAL8D9wLpSCH4XSiDC7HS/i5OI/ugMvJ3YUIXWsjtQhKj\n+9+/3+VmOCSCkq45cH7m0zUHUcLbvjDAr9bru4CqwIehCy3kvO2L+sBfgXFEfxIA7/viXCQp/gPo\nA/QGeoU0utDyti9Auq5vBXYAK9zmjTZvAWc8TPe73AyHRJAJVHd577zwDOTLuH5WHTgSorjs4G1f\nON8/gfyzXxfCuOzgbV9cjxSAq4EpwM3A0JBGF1re9sXvyNHfd0ih8D7Fj5Kjibd90Rg5OLgAaALU\nQ/5WYo3f5WY4JIJNwJXW665InZfTTqAZhdcc9AQ+CWl0oeVtX4BUg1REGsRyiG7e9sU/kMKuFzAL\naS95OaTRhZa3fbEHqEZho2kP5Gg4WnnbF5WQM4RTSHI4jFQTxZqILDf1moNC3vZFMvJHvt7lMdie\nMEOitL8Lp2FEf2NxafuiF1JfvgWYbUeAIVTavrgH6TW0EbmoNdqH2m9CYWNxrJabSimllFJKKaWU\nUkoppZRSSimllFJKKaWUUkop3zRBrtJ0vS7hYS/zLwGuOIvt7QP+A6QhY+IsQy6E8scUZKTNisAI\na9owzm7wPWdc65ERPL8AOpSyzLiz2F5paiOjazpVQfrL+zOQXDnkzn9rkO+2GrgwQPG9ivTNvwi5\ncGkJcu3C+cgFTDd5WbY/8OcAxaGUCoAm+He14WKg31lsby9Fh0eehYyZVBZNCNyVku5x9UPGqPHm\n5wBt25N5QBvrdUfkgqCDyIVTvroSKbCdrsHz6K1nYygy3ImrVLfterKaosMfqDAQDkNMqPBSDngR\nGbNmOzDd5TMHUiBtQo7qNwCNrM8eQ67m/JiSx3dxuDzXRAbDqgD8y1rnZuBGa56x1vuPgaetaUuQ\ns5IHgVbIGcwjyHDUT1I43tB5SAHqb1wgSeYP6/X1yBnMRuu71ra2XQt41op9IXLUvRG4zMO6J1I4\nLvwsa9o05Gh9E9DCZd4aSOHvHCIiHrl63N/RRA9b67kRGZPpHQq/+8dIUv8IWI6M0hlXwvdwXp36\nGYWjvO5Djv6nIvclH42cTTVH9s3lyNWtPyBnCCBXtt5nvV4NDPfz+yilgqQJMjiVa9VQA2TALme1\nSyUKRzxdjBTCY5FCtwIytEFrYACFR4KVkPHx3Uc73Eth1dA6ZFiI8kg1y5PWPNWQ+1nXRgogZxXN\naGte51nJBRSeETgTQUtrvSCFzlg/4/oU+C+wAKhjffYAUlCCVNfcbL12nhGMobBwr03xMX4SkWRW\n3nq/DBhoxexpKIh+SFJ0tx7/zghAjs5fAQ4hSdE5DPFuKy6QI/p78Pw9yiP75lxr+iQkAexFquZc\nh/ZwxncZhft7GoVDG3xE4T7tiewHFUaifRwO5d03FB+yuAZSD98LaUOo6PKZQY4cpyBnDMeQI8NE\npNB23impAlJYuw+a1xfIdZvWgsLhtLOtmJoCtyOFz4VIoe961O4+9LQBvrW22xg5Eu6NJBB/4pph\nbc+Z/H4FXrLiakHxG4AkApciN0gBKTxrUXhG0RxJBHnW+41I4gRJeO5qA794mF6SFUjyzADGu8W1\nk8LE1Rd4HTlTOmzND1JAX4EM0NbD7Xuch4xY+Zs1zb0ayEHx38T1/SLkbmkbrO/k3KeHkO+pwohW\nDSl3w4GjwK1Ig2MVl88cSH3zRmTs+zeRpPAtUtj2QgqdN5BRMX3xLVIIgdQdJyJHnSORgjwVGXAv\nxSUG1zHoXQughcDfga+RJOZvXA8hZ0VjkTOHacD/WLGcpGjVljP2V631X4MUtq7D/e5ECtfy1jI9\nKUwArsOLO/k7WuYga9vj3ab3Qar0nHF+gyQzkCP8Jtbr7khS2Onhexy0YnFW78xGDhC8cf1dfkT+\njh5EqhqdEpDvqcKIJoLY5unuTR8ivTvWAvcj1QoNXOb/HPhfpBpmFPAMcmSajRz9bUEKuWyKKulO\nUS8gR4gbkUJ7GnL0mGFNW4ccUX7qsp7DSP35LOu9c91vItUrzoLH37gMcn/Xh5Ab/2xCzkaWIwW4\ncz98gwx7PR85U0i3Hj+6rW8HUqhusuLfS2Gjraf9sRlo62G6v55Bvuc2ZB++AtxmfXYGaTf5CBmv\n/wUv32MssMpaRzmkrcAZt+t+d753Vjs5E9MCJMm/7zJfF6L7hkpKKXXW5gHtgrj+jNJnCZjrkcTu\n6j387zaslFIxpQ5ylB4s7u0jwTITORNKcJl2JYUdEZRSSimllFJKKaWUUkoppZRSSimllFJKKaWU\nUvb4fy7knGlPwFNFAAAAAElFTkSuQmCC\n", "text": [ "" ] } ], "prompt_number": 43 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the area under ROC curve is 0.756 which is very similar to the accuracy (0.732). However the ROC-AUC score of a random model is expected to 0.5 on average while the accuracy score of a random model depends on the class imbalance of the data. ROC-AUC can be seen as a way to callibrate the predictive accuracy of a model against class imbalance." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Cross-validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We previously decided to randomly split the data to evaluate the model on 20% of held-out data. However the location randomness of the split might have a significant impact in the estimated accuracy:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "features_train, features_test, target_train, target_test = train_test_split(\n", " features_array, target, test_size=0.20, random_state=0)\n", "\n", "logreg.fit(features_train, target_train).score(features_test, target_test)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 46, "text": [ "0.73184357541899436" ] } ], "prompt_number": 44 }, { "cell_type": "code", "collapsed": false, "input": [ "features_train, features_test, target_train, target_test = train_test_split(\n", " features_array, target, test_size=0.20, random_state=1)\n", "\n", "logreg.fit(features_train, target_train).score(features_test, target_test)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 47, "text": [ "0.67039106145251393" ] } ], "prompt_number": 45 }, { "cell_type": "code", "collapsed": false, "input": [ "features_train, features_test, target_train, target_test = train_test_split(\n", " features_array, target, test_size=0.20, random_state=2)\n", "\n", "logreg.fit(features_train, target_train).score(features_test, target_test)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 48, "text": [ "0.66480446927374304" ] } ], "prompt_number": 46 }, { "cell_type": "markdown", "metadata": {}, "source": [ "So instead of using a single train / test split, we can use a group of them and compute the min, max and mean scores as an estimation of the real test score while not underestimating the variability:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cross_validation import cross_val_score\n", "\n", "scores = cross_val_score(logreg, features_array, target, cv=5)\n", "scores" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 49, "text": [ "array([ 0.63128492, 0.68715084, 0.70224719, 0.73033708, 0.71751412])" ] } ], "prompt_number": 47 }, { "cell_type": "code", "collapsed": true, "input": [ "scores.min(), scores.mean(), scores.max()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 50, "text": [ "(0.63128491620111726, 0.69370682962933028, 0.7303370786516854)" ] } ], "prompt_number": 48 }, { "cell_type": "markdown", "metadata": {}, "source": [ "`cross_val_score` reports accuracy by default be it can also be used to report other performance metrics such as ROC-AUC or f1-score:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "scores = cross_val_score(logreg, features_array, target, cv=5,\n", " scoring='roc_auc')\n", "scores.min(), scores.mean(), scores.max()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 51, "text": [ "(0.61093544137022393, 0.72123181651091728, 0.78776737967914434)" ] } ], "prompt_number": 49 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**:\n", "\n", "- Compute cross-validated scores for other classification metrics ('precision', 'recall', 'f1', 'accuracy'...).\n", "\n", "- Change the number of cross-validation folds between 3 and 10: what is the impact on the mean score? on the processing time?\n", "\n", "Hints:\n", "\n", "The list of classification metrics is available in the online documentation:\n", "\n", " http://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values\n", " \n", "You can use the `%%time` cell magic on the first line of an IPython cell to measure the time of the execution of the cell. " ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 50 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "More feature engineering and richer models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us now try to build richer models by including more features as potential predictors for our model.\n", "\n", "Categorical variables such as `data.Embarked` or `data.Sex` can be converted as boolean indicators features also known as dummy variables or one-hot-encoded features:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.get_dummies(data.Sex, prefix='Sex').head(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sex_femaleSex_male
0 0 1
1 1 0
2 1 0
3 1 0
4 0 1
\n", "

5 rows \u00d7 2 columns

\n", "
" ], "output_type": "pyout", "prompt_number": 52, "text": [ " Sex_female Sex_male\n", "0 0 1\n", "1 1 0\n", "2 1 0\n", "3 1 0\n", "4 0 1\n", "\n", "[5 rows x 2 columns]" ] } ], "prompt_number": 51 }, { "cell_type": "code", "collapsed": false, "input": [ "pd.get_dummies(data.Embarked, prefix='Embarked').head(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Embarked_CEmbarked_QEmbarked_S
0 0 0 1
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1
\n", "

5 rows \u00d7 3 columns

\n", "
" ], "output_type": "pyout", "prompt_number": 53, "text": [ " Embarked_C Embarked_Q Embarked_S\n", "0 0 0 1\n", "1 1 0 0\n", "2 0 0 1\n", "3 0 0 1\n", "4 0 0 1\n", "\n", "[5 rows x 3 columns]" ] } ], "prompt_number": 52 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can combine those new numerical features with the previous features using `pandas.concat` along `axis=1`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "rich_features = pd.concat([data.get(['Fare', 'Pclass', 'Age']),\n", " pd.get_dummies(data.Sex, prefix='Sex'),\n", " pd.get_dummies(data.Embarked, prefix='Embarked')],\n", " axis=1)\n", "rich_features.head(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FarePclassAgeSex_femaleSex_maleEmbarked_CEmbarked_QEmbarked_S
0 7.2500 3 22 0 1 0 0 1
1 71.2833 1 38 1 0 1 0 0
2 7.9250 3 26 1 0 0 0 1
3 53.1000 1 35 1 0 0 0 1
4 8.0500 3 35 0 1 0 0 1
\n", "

5 rows \u00d7 8 columns

\n", "
" ], "output_type": "pyout", "prompt_number": 54, "text": [ " Fare Pclass Age Sex_female Sex_male Embarked_C Embarked_Q \\\n", "0 7.2500 3 22 0 1 0 0 \n", "1 71.2833 1 38 1 0 1 0 \n", "2 7.9250 3 26 1 0 0 0 \n", "3 53.1000 1 35 1 0 0 0 \n", "4 8.0500 3 35 0 1 0 0 \n", "\n", " Embarked_S \n", "0 1 \n", "1 0 \n", "2 1 \n", "3 1 \n", "4 1 \n", "\n", "[5 rows x 8 columns]" ] } ], "prompt_number": 53 }, { "cell_type": "markdown", "metadata": {}, "source": [ "By construction the new `Sex_male` feature is redundant with `Sex_female`. Let us drop it:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "rich_features_no_male = rich_features.drop('Sex_male', 1)\n", "rich_features_no_male.head(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FarePclassAgeSex_femaleEmbarked_CEmbarked_QEmbarked_S
0 7.2500 3 22 0 0 0 1
1 71.2833 1 38 1 1 0 0
2 7.9250 3 26 1 0 0 1
3 53.1000 1 35 1 0 0 1
4 8.0500 3 35 0 0 0 1
\n", "

5 rows \u00d7 7 columns

\n", "
" ], "output_type": "pyout", "prompt_number": 55, "text": [ " Fare Pclass Age Sex_female Embarked_C Embarked_Q Embarked_S\n", "0 7.2500 3 22 0 0 0 1\n", "1 71.2833 1 38 1 1 0 0\n", "2 7.9250 3 26 1 0 0 1\n", "3 53.1000 1 35 1 0 0 1\n", "4 8.0500 3 35 0 0 0 1\n", "\n", "[5 rows x 7 columns]" ] } ], "prompt_number": 54 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us not forget to imput the median age for passengers without age information:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "rich_features_final = rich_features_no_male.fillna(rich_features_no_male.dropna().median())\n", "rich_features_final.head(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FarePclassAgeSex_femaleEmbarked_CEmbarked_QEmbarked_S
0 7.2500 3 22 0 0 0 1
1 71.2833 1 38 1 1 0 0
2 7.9250 3 26 1 0 0 1
3 53.1000 1 35 1 0 0 1
4 8.0500 3 35 0 0 0 1
\n", "

5 rows \u00d7 7 columns

\n", "
" ], "output_type": "pyout", "prompt_number": 56, "text": [ " Fare Pclass Age Sex_female Embarked_C Embarked_Q Embarked_S\n", "0 7.2500 3 22 0 0 0 1\n", "1 71.2833 1 38 1 1 0 0\n", "2 7.9250 3 26 1 0 0 1\n", "3 53.1000 1 35 1 0 0 1\n", "4 8.0500 3 35 0 0 0 1\n", "\n", "[5 rows x 7 columns]" ] } ], "prompt_number": 55 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can finally cross-validate a logistic regression model on this new data an observe that the mean score has significantly increased:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%time\n", "\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.cross_validation import cross_val_score\n", "\n", "logreg = LogisticRegression(C=1)\n", "scores = cross_val_score(logreg, rich_features_final, target, cv=5, scoring='accuracy')\n", "print(\"Logistic Regression CV scores:\")\n", "print(\"min: {:.3f}, mean: {:.3f}, max: {:.3f}\".format(\n", " scores.min(), scores.mean(), scores.max()))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Logistic Regression CV scores:\n", "min: 0.770, mean: 0.786, max: 0.810\n", "CPU times: user 19.6 ms, sys: 396 \u00b5s, total: 20 ms\n", "Wall time: 19.6 ms\n" ] } ], "prompt_number": 56 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**:\n", "\n", "- change the value of the parameter `C`. Does it have an impact on the score?\n", "\n", "- fit a new instance of the logistic regression model on the full dataset.\n", "\n", "- plot the weights for the features of this newly fitted logistic regression model." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load solutions/04A_plot_logistic_regression_weights.py" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 57 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 58 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Training Non-linear models: ensembles of randomized trees" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`sklearn` also implement non linear models that are known to perform very well for data-science projects where datasets have not too many features (e.g. less than 5000).\n", "\n", "In particular let us have a look at Random Forests and Gradient Boosted Trees:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%time\n", "\n", "from sklearn.ensemble import RandomForestClassifier\n", "\n", "rf = RandomForestClassifier(n_estimators=100)\n", "scores = cross_val_score(rf, rich_features_final, target, cv=5, n_jobs=4,\n", " scoring='accuracy')\n", "print(\"Random Forest CV scores:\")\n", "print(\"min: {:.3f}, mean: {:.3f}, max: {:.3f}\".format(\n", " scores.min(), scores.mean(), scores.max()))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Random Forest CV scores:\n", "min: 0.787, mean: 0.812, max: 0.843\n", "CPU times: user 56.9 ms, sys: 17.7 ms, total: 74.7 ms\n", "Wall time: 349 ms\n" ] } ], "prompt_number": 59 }, { "cell_type": "code", "collapsed": false, "input": [ "%%time\n", "\n", "from sklearn.ensemble import GradientBoostingClassifier\n", "\n", "gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,\n", " subsample=.8, max_features=.5)\n", "scores = cross_val_score(gb, rich_features_final, target, cv=5, n_jobs=4,\n", " scoring='accuracy')\n", "print(\"Gradient Boosted Trees CV scores:\")\n", "print(\"min: {:.3f}, mean: {:.3f}, max: {:.3f}\".format(\n", " scores.min(), scores.mean(), scores.max()))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Gradient Boosted Trees CV scores:\n", "min: 0.793, mean: 0.818, max: 0.859\n", "CPU times: user 55.2 ms, sys: 17.7 ms, total: 72.9 ms\n", "Wall time: 347 ms\n" ] } ], "prompt_number": 60 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both models seem to do slightly better than the logistic regression model on this data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**:\n", "\n", "- Change the value of the learning_rate and other `GradientBoostingClassifier` parameter, can you get a better mean score?\n", "\n", "- Would treating the `PClass` variable as categorical improve the models performance?\n", "\n", "- Find out which predictor variables (features) are the most informative for those models.\n", "\n", "Hints:\n", "\n", "Fitted ensembles of trees have `feature_importances_` attribute that can be used similarly to the `coef_` attribute of linear models." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load solutions/04B_more_categorical_variables.py" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 61 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 62 }, { "cell_type": "code", "collapsed": false, "input": [ "%load solutions/04C_feature_importance.py" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 63 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 64 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Automated parameter tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of changing the value of the learning rate manually and re-running the cross-validation, we can find the best values for the parameters automatically (assuming we are ready to wait):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%time\n", "\n", "from sklearn.grid_search import GridSearchCV\n", "\n", "gb = GradientBoostingClassifier(n_estimators=100, subsample=.8)\n", "\n", "params = {\n", " 'learning_rate': [0.05, 0.1, 0.5],\n", " 'max_features': [0.5, 1],\n", " 'max_depth': [3, 4, 5],\n", "}\n", "gs = GridSearchCV(gb, params, cv=5, scoring='roc_auc', n_jobs=4)\n", "gs.fit(rich_features_final, target)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CPU times: user 361 ms, sys: 29.6 ms, total: 391 ms\n", "Wall time: 3.86 s\n" ] } ], "prompt_number": 65 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us sort the models by mean validation score:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sorted(gs.grid_scores_, key=lambda x: x.mean_validation_score, reverse=True)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 64, "text": [ "[mean: 0.87602, std: 0.02655, params: {'max_features': 0.5, 'learning_rate': 0.1, 'max_depth': 4},\n", " mean: 0.87335, std: 0.02761, params: {'max_features': 0.5, 'learning_rate': 0.05, 'max_depth': 5},\n", " mean: 0.87144, std: 0.02649, params: {'max_features': 0.5, 'learning_rate': 0.1, 'max_depth': 3},\n", " mean: 0.87117, std: 0.02732, params: {'max_features': 0.5, 'learning_rate': 0.05, 'max_depth': 4},\n", " mean: 0.86713, std: 0.02569, params: {'max_features': 0.5, 'learning_rate': 0.05, 'max_depth': 3},\n", " mean: 0.86712, std: 0.02493, params: {'max_features': 1, 'learning_rate': 0.1, 'max_depth': 4},\n", " mean: 0.86626, std: 0.02438, params: {'max_features': 1, 'learning_rate': 0.05, 'max_depth': 4},\n", " mean: 0.86482, std: 0.02602, params: {'max_features': 1, 'learning_rate': 0.05, 'max_depth': 5},\n", " mean: 0.86472, std: 0.02421, params: {'max_features': 0.5, 'learning_rate': 0.1, 'max_depth': 5},\n", " mean: 0.86404, std: 0.02593, params: {'max_features': 1, 'learning_rate': 0.1, 'max_depth': 3},\n", " mean: 0.86389, std: 0.03037, params: {'max_features': 0.5, 'learning_rate': 0.5, 'max_depth': 3},\n", " mean: 0.86320, std: 0.02511, params: {'max_features': 1, 'learning_rate': 0.5, 'max_depth': 3},\n", " mean: 0.86159, std: 0.02487, params: {'max_features': 1, 'learning_rate': 0.1, 'max_depth': 5},\n", " mean: 0.86108, std: 0.02395, params: {'max_features': 1, 'learning_rate': 0.5, 'max_depth': 4},\n", " mean: 0.86087, std: 0.01712, params: {'max_features': 0.5, 'learning_rate': 0.5, 'max_depth': 4},\n", " mean: 0.85951, std: 0.01888, params: {'max_features': 1, 'learning_rate': 0.05, 'max_depth': 3},\n", " mean: 0.85709, std: 0.02520, params: {'max_features': 0.5, 'learning_rate': 0.5, 'max_depth': 5},\n", " mean: 0.83781, std: 0.02656, params: {'max_features': 1, 'learning_rate': 0.5, 'max_depth': 5}]" ] } ], "prompt_number": 66 }, { "cell_type": "code", "collapsed": false, "input": [ "gs.best_score_" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 65, "text": [ "0.87601626884332873" ] } ], "prompt_number": 67 }, { "cell_type": "code", "collapsed": false, "input": [ "gs.best_params_" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 66, "text": [ "{'learning_rate': 0.1, 'max_depth': 4, 'max_features': 0.5}" ] } ], "prompt_number": 68 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We should not that the mean scores are very close to one another and almost always within one standard deviation of one another. This means that all those parameters are quite reasonable. The only parameter of importance seems to be the `learning_rate`: 0.5 seems to be a bit too high." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Avoiding data snooping with pipelines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When doing imputation in pandas, prior to computing the train test split we use data from the test to improve the accuracy of the median value that we impute on the training set. This is actually cheating. To avoid this we should compute the median of the features on the training fold and use that median value to do the imputation both on the training and validation fold for a given CV split.\n", "\n", "To do this we can prepare the features as previously but without the imputation: we just replace missing values by the -1 marker value:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "features = pd.concat([data.get(['Fare', 'Age']),\n", " pd.get_dummies(data.Sex, prefix='Sex'),\n", " pd.get_dummies(data.Pclass, prefix='Pclass'),\n", " pd.get_dummies(data.Embarked, prefix='Embarked')],\n", " axis=1)\n", "features = features.drop('Sex_male', 1)\n", "\n", "# Because of the following bug we cannot use NaN as the missing\n", "# value marker, use a negative value as marker instead:\n", "# https://github.com/scikit-learn/scikit-learn/issues/3044\n", "features = features.fillna(-1)\n", "features.head(5)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FareAgeSex_femalePclass_1Pclass_2Pclass_3Embarked_CEmbarked_QEmbarked_S
0 7.2500 22 0 0 0 1 0 0 1
1 71.2833 38 1 1 0 0 1 0 0
2 7.9250 26 1 0 0 1 0 0 1
3 53.1000 35 1 1 0 0 0 0 1
4 8.0500 35 0 0 0 1 0 0 1
\n", "

5 rows \u00d7 9 columns

\n", "
" ], "output_type": "pyout", "prompt_number": 67, "text": [ " Fare Age Sex_female Pclass_1 Pclass_2 Pclass_3 Embarked_C \\\n", "0 7.2500 22 0 0 0 1 0 \n", "1 71.2833 38 1 1 0 0 1 \n", "2 7.9250 26 1 0 0 1 0 \n", "3 53.1000 35 1 1 0 0 0 \n", "4 8.0500 35 0 0 0 1 0 \n", "\n", " Embarked_Q Embarked_S \n", "0 0 1 \n", "1 0 0 \n", "2 0 1 \n", "3 0 1 \n", "4 0 1 \n", "\n", "[5 rows x 9 columns]" ] } ], "prompt_number": 69 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now use the `Imputer` transformer of scikit-learn to find the median value on the training set and apply it on missing values of both the training set and the test set." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cross_validation import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(features.values, target, random_state=0)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 70 }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.preprocessing import Imputer\n", "\n", "imputer = Imputer(strategy='median', missing_values=-1)\n", "\n", "imputer.fit(X_train)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 69, "text": [ "Imputer(axis=0, copy=True, missing_values=-1, strategy='median', verbose=0)" ] } ], "prompt_number": 71 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The median age computed on the training set is stored in the `statistics_` attribute." ] }, { "cell_type": "code", "collapsed": false, "input": [ "imputer.statistics_" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 70, "text": [ "array([ 14.5, 29. , 0. , 0. , 0. , 1. , 0. , 0. , 1. ])" ] } ], "prompt_number": 72 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imputation can now happen by calling the transform method:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "X_train_imputed = imputer.transform(X_train)\n", "X_test_imputed = imputer.transform(X_test)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 73 }, { "cell_type": "code", "collapsed": false, "input": [ "np.any(X_train == -1)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 72, "text": [ "True" ] } ], "prompt_number": 74 }, { "cell_type": "code", "collapsed": false, "input": [ "np.any(X_train_imputed == -1)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 73, "text": [ "False" ] } ], "prompt_number": 75 }, { "cell_type": "code", "collapsed": false, "input": [ "np.any(X_test == -1)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 74, "text": [ "True" ] } ], "prompt_number": 76 }, { "cell_type": "code", "collapsed": false, "input": [ "np.any(X_test_imputed == -1)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 75, "text": [ "False" ] } ], "prompt_number": 77 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now use a pipeline that wraps an imputer transformer and the classifier itself:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.pipeline import Pipeline\n", "\n", "imputer = Imputer(strategy='median', missing_values=-1)\n", "\n", "classifier = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,\n", " subsample=.8, max_features=.5)\n", "\n", "pipeline = Pipeline([\n", " ('imp', imputer),\n", " ('clf', classifier),\n", "])\n", "\n", "scores = cross_val_score(pipeline, features.values, target, cv=5, n_jobs=4,\n", " scoring='accuracy', )\n", "print(scores.min(), scores.mean(), scores.max())" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "(0.77653631284916202, 0.81824552705576692, 0.84745762711864403)\n" ] } ], "prompt_number": 78 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The mean cross-validation is slightly lower than we used the imputation on the whole data as we did earlier although not by much. This means that in this case the data-snooping was not really helping the model cheat by much.\n", "\n", "Let us re-run the grid search, this time on the pipeline. Note that thanks to the pipeline structure we can optimize the interaction of the imputation method with the parameters of the downstream classifier without cheating:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%time\n", "\n", "params = {\n", " 'imp__strategy': ['mean', 'median'],\n", " 'clf__max_features': [0.5, 1],\n", " 'clf__max_depth': [3, 4, 5],\n", "}\n", "gs = GridSearchCV(pipeline, params, cv=5, scoring='roc_auc', n_jobs=4)\n", "gs.fit(X_train, y_train)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CPU times: user 244 ms, sys: 25.5 ms, total: 270 ms\n", "Wall time: 2.53 s\n" ] } ], "prompt_number": 79 }, { "cell_type": "code", "collapsed": false, "input": [ "sorted(gs.grid_scores_, key=lambda x: x.mean_validation_score, reverse=True)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 78, "text": [ "[mean: 0.87246, std: 0.03058, params: {'clf__max_features': 0.5, 'clf__max_depth': 3, 'imp__strategy': 'mean'},\n", " mean: 0.86994, std: 0.02982, params: {'clf__max_features': 1, 'clf__max_depth': 4, 'imp__strategy': 'median'},\n", " mean: 0.86891, std: 0.02477, params: {'clf__max_features': 0.5, 'clf__max_depth': 4, 'imp__strategy': 'median'},\n", " mean: 0.86550, std: 0.02562, params: {'clf__max_features': 0.5, 'clf__max_depth': 3, 'imp__strategy': 'median'},\n", " mean: 0.86545, std: 0.02546, params: {'clf__max_features': 0.5, 'clf__max_depth': 4, 'imp__strategy': 'mean'},\n", " mean: 0.86534, std: 0.02843, params: {'clf__max_features': 0.5, 'clf__max_depth': 5, 'imp__strategy': 'median'},\n", " mean: 0.86405, std: 0.02447, params: {'clf__max_features': 0.5, 'clf__max_depth': 5, 'imp__strategy': 'mean'},\n", " mean: 0.86079, std: 0.02559, params: {'clf__max_features': 1, 'clf__max_depth': 3, 'imp__strategy': 'mean'},\n", " mean: 0.85974, std: 0.02794, params: {'clf__max_features': 1, 'clf__max_depth': 4, 'imp__strategy': 'mean'},\n", " mean: 0.85794, std: 0.02373, params: {'clf__max_features': 1, 'clf__max_depth': 3, 'imp__strategy': 'median'},\n", " mean: 0.85752, std: 0.02890, params: {'clf__max_features': 1, 'clf__max_depth': 5, 'imp__strategy': 'mean'},\n", " mean: 0.85312, std: 0.02751, params: {'clf__max_features': 1, 'clf__max_depth': 5, 'imp__strategy': 'median'}]" ] } ], "prompt_number": 80 }, { "cell_type": "code", "collapsed": false, "input": [ "gs.best_score_" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 79, "text": [ "0.87245552901481005" ] } ], "prompt_number": 81 }, { "cell_type": "code", "collapsed": false, "input": [ "plot_roc_curve(y_test, gs.predict_proba(X_test))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEVCAYAAADtmeJyAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xd4VGX6//F3KAGkKNgRAUWkJhQRIQgGFRvigu27rqvA\nIqsgYkNw+amLX74qq9hlEV0Uu2JBBVxFiVkUQVcUBBSUZgERC0ICJIHk+f1xn0kmQzKZSTI1n9d1\nzZUpZ86558zkuc952gERERERERERERERERERERERERERERFJcEXAF8DnwGfAGuAT4IQIbe9zoEmE\n1g1wFbAcWA2sAp4Gjo7g9gJdAYzy7l8JTKjGddcGbgD+i+3H1cAUINV7fRZwYzVuL1QDgdsr8b7b\ngcsqWOY24LwwlheRSigCmgU8dyPwUQxiqaqpwDvAUd7jFKzg+MHvuUibReQK48eAl4DG3uMDgDlY\nsgN4MoLbDmYS8HCE1p0NXBChdYuIpwg42O9xHeAhYK7fc/8PWIYdhc4BjvSePwJ4HfgKOzq9xnv+\nQKxA/BRYAdyHHc36b+8jSv+DT/FuACO8934GvAu0856fBbyJHenfFfA5WgA53rYDPQA84t3fBNyP\nHVV/g51B+AwClnrb/RDo5T0/CUswK7BC9zDvc38EbADeBw4FhgC/YolnNKULyE3A34FF3v1/+G33\nZuBrbB8/AGws4zMcA+QCjQKePxwY7N1/EngVWAys82I8wHvtL36fbZPf5x4GfOBte6G3/NPAEmAt\n9j0c7y1b1vfdE/gR2AZM9pYL5fubQumkeTu2f/8LvO1t62rsO13vfUb/5U8CPgZWerH3L2OfiUiI\nfFVDy4HN2D/dA8Ah3uuXAy9QUpD/FZjv3X+NksK7CfZP2QZ4AhjjPV8beAa4yW97zbACaK7fMt97\n7z0F+A/QwHvtDKzQASsIFpTzOS7AqrTKMsj7fGCF7OPe/eZYAdYZaIvth6bea52ALVjBOAn4Eqjl\nvTbW7/OA7Y8bvPtP+t3/O5ZUfdu922+7u4FWwJlYweqrLvsXllzK+nwfl/P5fGZhBXh9L9ZPgT8D\nDbGk5ftsvYCd3v1hWPLyJZgLsO/fZ7rfZyjv+/b/nOF8f759dTTwO1DXe/4GSqqD3gfOD1i+LpZ8\nzvae7459dxIFdWIdgERMJvAb0BX4N1aY/OK9di5wIlaogBXavn/y04Bx3v2dQFrAe0Z4jxtgCcDf\ny1hVzuFYe8Q3WBK6EjiO0lVTTb2bw47Uy1O3nOfrBWx/mvd3C3b0eQaQh53pZPktV+jF4rCjad86\nHgL6YoVSWyyRLPV7X0rAX583/La7DTszOgeYTUnBPA3br4EKKUlE5XHYEXue93gVdvayC/tOBnmf\npyuWHHy+wM42wM4oNmJH+8dhvw3fd1He951CyWcdSPjf3w/Y2cDn2O/v35T+HvyleNvd5y0HduaR\nXs7yUs2UCJLfcuB67Kh0KfAtVvhMAWZ4y6RSUpW0L+D9x2BHl7WAC7GqBYCDsELA3y4sGfwJ6E3J\nUXot7AziZu9xCnbEuN3vfWVZihXKhwM/BbzWn9IFU6Hf/dqUFLILgT/6vdYSK6SGBGz3H1iim4kV\nWHUoXegHflafPQHLpAB7KV3AByZMn/8CHbAj91y/54/CvpsLvcf+34kvjhZYcn8UqwZ6BUsMPv7r\nGwWMxKq0nsO+z9ZlrBvs+/7F245vW+F+fynee0/BDggGYFV37wPXsT+H7bPAfdwR+70V7vcOqVYV\nHY1IcngRKzR81QPvYAWDr4FyEvCUd/89YLh3/0CsID3Oe88N2D95KtauMLqMbT3uvb83diQKVnVw\nCVZHjLdtX3VC4BG2v83YkfoLWNWLz3CsasG/Tv5y729LrOB5CyvQz6CkPvssLDHWL2O7Z2D75zng\nZ28dvqqzfZT04qkoZodVK11ASdXQCMpOJJu97T1ByXfRBPgnVhjnlbOtFKyA3QbcgdXZD/JeK+t/\n+gysCudJrN3iPEoOAsv6vttS+jNX5vtLx85e1mAHHQ9QcoTvv27f+9di++h073F3LHEE29dSTXRG\nkJzKKnTGYNUFA7Czg6OwI26HnSUM9VtuOnZaXwu4EztNHws86K2jLlb4+OrH/bf3GXZ09ypQ4D23\nACu038WOjndgR+S+95Z3tA0wEWsUfQMrwOth9eq9sDYIn5ZYA2MD4FqsWgqs/eNFSo7UB2F1+YHb\n/V+sWmsiVsC+giVAsOoKX8N0RfGCFWCPY8l3N1afvrucZUcDt2JnN/u8zzcHq6Mvb3sO26d/wQrQ\nbdj++ZGSai//90zFeiddjp0NvE5JXXx533cq1n6Qj+3PcL4/h/1OZmPVj7ne5x/rvT7XiynV770F\nWHJ/ALjH2+4Q9j9jEREp00asp0u8OIGS3lZgZ1IvxCgWkQpFumroJOzoKNAgrDfIR9hgHZFk8jXW\n8LwSOzLuT0mvI5EaZTz2TxA4iKkudtp+oHf/E6wXhIiIxEAkzwjWYXV+gY09HbzXdmB1th8C/SIY\nh4iIBBHJRPAaZTf0NMGSgE95I0dFRCQKYtFraAclXeXw7m8PXKhNmzZu/fr1UQtKRCRJrKekx1tI\nYpEI1mD9lJtiA1H6Yd3FSlm/fj3OVdRLr2aYNGkSkyZNinUYcUH7ooT2RYnq2he//gqrV1e8XLzY\nvTuH6dPH88EH83BuBkuWnEOHDiltwl1PNBKBrzS/BBtB+TjWg+IdrGpqJtb/WUQkpu6+G2bPhqOj\nOcl5Je3atZLVq8/joIP606nTSqZOPYj27Su3rkgngk1Ahnffvx/1PO8mIhI3Cgvh6qth3LiKl421\nHTtasnTpo5x55plVXpdGFieAzMzMWIcQN7QvSsTzvvjuO/jHPyBatbubN2cyuqwJT8L00Ufw5z9X\nfT3RcOCBB1ZLEoD4nsfDqY1AJDG9/jrcdhtcdVXFy8abc8+Fli1jHUXlpaSkQJhlu84IRCQijj2W\najlKr+mysrJ44IEHmDNnDrVr1674DZWgRCBSDXbssCPJROpxEkkFBfCHP8Q6isSWk5PD+PHjmTdv\nHjNmzIhYEgAlApEqy8uDwYMhPR3eeKPi5WuKxo0rXkbKlpWVxYgRI+jfvz8rV67koIMOiuj21EYg\nUgVFRfA//wMpKfDCCxDBgzapIZYuXcpFF13EjBkzOOecc8J+f2XaCJQIRKpg7Vo49VTYsAHq1Yt1\nNJIMnHPk5ubSuJKnVJVJBLpCmUgVFBVBkyZKAlJ9UlJSKp0EKkuJQEQkRrZu3RrrEAAlAhGRqMvJ\nyWHUqFFkZmayb1/sr8apRCAiEkVZWVmkp6eTn5/P0qVLqVMn9p03Yx+BSIx88gmcdx7k5FR+HUVF\n1m1UpCKB4wIq0yMoUpQIpEZau9YGPD36KAwYULV1paZWT0yS3DZv3kxhYWFUxgWES91HJajt2+HD\nD2MdRfXatw9uuMHmwhk+PNbRiFQvjSOQavfPf9oc7WlpsY6keg0cmJgToolURJPOSbUrKrI5dB55\nJNaRiCSGnJwcXnzxRUaOHBnrUEKmXkMiItXE1yNoyZIl7N27N9bhhExnBCIiVRTPPYJCoTMCEZEq\nWLNmTfG4gJUrVyZcEgCdEYifL76AE0+067b6FBXBxImxi0kk3rVq1YrHH3+c008/PdahVJoSgRTb\nscMSQXZ26ec1tbJI+Ro0aJDQSQCUCGqMJUtg8+bgy3z1lc2rHwcj3kUkivQvX0Nceim0awcNGwZf\n7vzzoxOPSKLJysrijjvu4K233qJeks07rkRQQzhng8OOOSbWkYgklsAeQcmWBEC9hpLaxo1w8cVw\n4YXw009W7SMiofOfKTRRewSFQmcESWztWvj6a7jlFhg6FFq1inVEIolj5cqVDB06NCHHBYRLiSAJ\nOWddQAsL4Ygj7IxARMKTlpbG119/TYMGDWIdSsSpaigJDR9uUyP/4Q9w6KGxjkYkcdWEJABKBElp\n+3aYM8emW37mmVhHIxL/vvvuu1iHEFOqGkogO3fCvHlW9RPM999HJx6RROfrEbRgwQJWr15N/fr1\nYx1STCgRJJD33oMJE6Bfv+DLdeoEnTtHJyaRRJWVlcWIESPo378/y5Ytq7FJAJQIEopz0LMnPPdc\nrCMRSVyJPlNoJCgRxJnXX4eZM8t+7ccfNSBMpKp+//13UlJS4vLawbESz0OMauSlKm+80Rp7hwwp\n+/UOHeC446Ibk4gkjkhdqrIZ0Ac4GPgJ+ADIDTc4gT17bFrnYPbutTr+QYOiE5OISLDuo4cBM4F5\nwBCgLTAIeBd4DDg84tElkXXroHFjOOyw4LeZM+Fw7VmRKsvJyeHBBx+kJtYshCvYGcFtwD+Ar8t4\nrSPwd2B0Oe+tBfwTSAfygSuA9X6vDwEmAg54Ang0rKgT0K5ddqS/YkWsIxFJfv49gvLz82t0j6BQ\nBEsEY7y/RwBbA177kvKTAMBgIBXIAE4C7vWe87kP6Abs8tb1ArAj5KhFRMqgHkGVE8rI4leB14Fz\nQ1werE3hbe/+x0CPgNf3AgcBDbBGDZ27iUiVbNiwoUbMFBoJoTQW9wE6AcOAW4CFWNvBhiDvaQLs\n9HtciCURX1PpvcAy7Izg1YBlRUTC1rJlS5588kkyMzNjHUrCCXUcwWas4O8BdAbuB9YAE8pZfifQ\n2O+xfxJoiVU7tQJ2A88CFwKvBK5k0qRJxfczMzMT8gt+5hl4+mnIyYFamtlJJGLq1KmTkGVEVWVn\nZ5MdeKHxMIXS13Q2kIYV2E8CW7znP2X/Kh+f87EeRsOBXsCtwEDvteO9dZ6IVRE9AKwC/hWwjqQY\nR3DVVVCvHpx7LrRoYeMARKRqnHO+/vISoDLjCEI5Rp0LdADuwJJAO+/5vkHeMwfIAxZj1UDXA5cA\nI7FeSE8BH2FjEg4EZoUTdKLp0AEGDFASEKkOWVlZZGRksGvXrliHkjSCVQ2lAc2BG7GBZAC1gSlA\nF2BPkPc6YFTAc/7dUO/3biIiIQnsEdSwYcNYh5Q0giWCpthR/OHeX7B6/mmRDkpExJ//uADNEVT9\ngiWCRd6tO/BZdMIRESlt/fr1DB8+nOnTp6tLaIQESwTTgKuxEcL+rbYOGygmIhJxbdq04euvv6Ze\nvXqxDiVpBUsE/+v9vQwoIL5nKhWRJKYkEFnBeg35GojfxM4MUoFN3k2CcA4GD7Yrib3xBtSuHeuI\nRBLDunXrYh1CjRTKUX494DxgKFAf6+r5bARj8knYcQTO2eCx//zHHp9wAqiDg0j5fD2C3nrrLVat\nWkXjxo0rfpOUqTLjCMJZ+GRsPEAnoH04G6mkuEwE+fnw66/Bl3HOBo/FYfgicce/R9B9992nHkFV\nFKkL09wGXAx8DjyI9SSqscaMgZdfhgMOCL6cLh4vEpxmCo0foSSC7djZwO8RjiUh5OXBww/DZZfF\nOhKRxJafn09qaqrGBcSBYKcPI4HHgbsCnnfYRWUiLeZVQxs2WGOvv2efheuuUyIQkfhU3VVD33l/\n1/jW7/2tMTXfr70GL7xgvX98TjkF+vSJXUwiItUtWCJ4x/vbE+s+6vMMNmlcjdC/P0ydGusoRBJX\nTk4OjzzyCDfddBN16oQ6871EU7BxBGOAH7HrDf/o3bYCR0UhrpgaORJOOgkeegj0uxWpvKysLNLT\n0/nmm2/Iz8+PdThSjlDqkSYCd0Y6kDLErI2gTRu4+27rAtquHagdSyQ86hEUO9XdRjAIuxbBr8Bf\nfdvA2ggeq0R8CaVrV0sIIhKeH374gb59+2qm0AQSLBE08/4eSQ1qIBaRqmnevDnPPfccGRmamzJR\nhHL6kIJdjL4IGALMA36LZFCemFYNLVigMwIRSTyRGln8Ilb4Z3grH+LdElZhoQ0K21PONda2b49u\nPCKJStcOTg6hXLO4OdZltANwFZDws0H98gtMnAg7d5Z9u+YaaygWkfJlZWXRrVs3tuvIKeGFckZQ\nFzgfWA0cShIkAoDGjeGuwDHTIlKhwB5BTZs2jXVIUkWhnBHcDfwRm2riGmByRCOKoJdftt5Ap54K\ndevGOhqRxOMbF5Cfn8/KlSvVLTRJxHPlXrU3Ft9+O2zZAqNGQdOm0KpVta5eJKlt2bKFvn378vDD\nDysBxLFINRZPBMYDvqZVh7UbJIytW+H33+Hnn+GII+ysQETC07x5c9auXatpIpJQKN/oH7GCf3eE\nY4mYnj0hNdWmi7jttlhHI5K4lASSUyjf6gYgL9KBRFJBAXzyiZ0NiEjFvvzySzp27BjrMCRKQkkE\n9YCV3s15tz9FMqjqsHUrzJhhl4vMzY11NCKJwdcjaP78+Xz++eccfPDBsQ5JoiCUXkP/AEYD04FH\ngRkRjaiaLFli1xIAuPVWOPTQ2MYjEu8WLlxIeno6BQUFfPHFF0oCNUgoZwSfYY3FzbFJ6FZGNKJq\n1KEDTJoU6yhE4ltubi433XQT8+bN47HHHuPss8+OdUgSZaEkgieAt4BMbCbSmcApEYxJRKLIOUej\nRo00U2gNFkrV0MFYMtgLLArxPSKSIBo3bsw999yjJFCDhVKoO6C9d78FsC9y4YiISLSFkgiuBWYB\n3YFXgRsjGZCIREZOTg5///vfyctL6N7gEgGhJIKVQC+gNTAAazwWkQSycOFC0tLS+OGHH9i7d2+s\nw5E4EywRdAeWA6nY7KNrgf8C50UhLhGpBjk5OYwaNYphw4Yxffp0Zs6cSePGSTGBsFSjYIlgKjAU\nKADuAM4GTgRujkJcIlJF27ZtIy0tjYKCAlauXKluoVKuYN1HawErgKOAA4Bl3vNFkQ5KRKru0EMP\n5ZVXXqFHjx6xDkXiXLAzAl9F4pnAe979ukCjiEYkItUiJSVFSUBCEuyMYCGwGGiJtQscC0wDZoew\n3lrAP4F0IB+4Aljv9/qJwL3YnNmbgcuxKigRqYSioiJq1dIQH6mcYL+cKcBIrMfQ51ih/RhwZwjr\nHYw1MmdgbQr3+r3mW88woC+WcI4JM24R8SxcuJDOnTvz008/xToUSVDBEsEQ4EvsiB3siH6O3+vn\nB3lvH+Bt7/7HgP/56fHYVBU3ANnAQViPJBEJg3+PoKlTp3L44YfHOiRJUMESwQFYYX49cBrQGTuC\nH4e1GQRrK2gC7PR7XOi3rUOwM4WHgdO9dfevROwiNZauHSzVKVgbwXPA68ClwF+wAnwbdhQ/GAg2\ny/9OwL+zci1Kehv9Cqyj5CzgbeyM4f3AlUzymzo0MzOTzMzMMjeWl2ezjObnlzy3bp1dkUwk2fzy\nyy+MGjWKadOmKQEI2dnZZGdnV2kdkbp4/fnAIGA41sZwKzDQey0VWIONUl6PTVvxL+DfAesI+eL1\nGzZAjx523QF/PXtCnz6V+wAi8aywsJDatWvHOgyJQ5W5eH2kEkEKJb2GwBLCCVh10uNYVdAUb7nF\nWPVToLASwemn218RkZqsMokgUpUnDhgV8NzXfvffB06K0LZFksby5cvp0qWL759bJCJC6Xj8QsSj\nEJFScnNzGT16NIMGDWLr1q2xDkeSXCiJIBXoAtT37qdGNCKRGi4rK4u0tDTy8vJYuXIlRx55ZKxD\nkiQXStVQO6z3kI/DRhmLSDXavXs348aNY+7cucyYMUM9giRqQkkEnb2/h2FdPwsjF45IzVW7dm2a\nNWumawdL1IXSAtUfu2D9TmwU8F+BBZEMylNhr6H//Adeegl27IAlS9RrSEQkUr2G/g84GdiCTUk9\nh+gkggrNnw9btsAZZ8Cll8Y6GhGRxBRKY/E+LAmAzTu0J3LhhC8jA0aPBlWnSqLIzc1l4sSJ7Ny5\ns+KFRaIglESQA1yD9Ry6BvgtohGJJDFfjyB1CZV4EkrV0J+BW7DLVX6FzTskImHIzc1l/Pjx6hEk\ncSmURPA7NuOoiFTCjh076NatG5mZmeoRJHFJ83OKRNiBBx7IG2+8QVpaWqxDESmTrm0nEgVKAhLP\nQjkjaAKMB5oDc4GV2PUERCSApoeWRBTKGcETwEZKLjH5REQjEklQWVlZdOjQgW+//TbWoYiEJZRE\ncDA2sngvsIjIXcNAJCH5ZgodOnQoDzzwAK1atYp1SCJhCSUROKC9d78FNsAspjp2hIYN4f77oUmT\nWEcjNVngTKHqFiqJKJSj+zTsqmIdsEtMjgI+i2RQnnLnGmrcGL75xv4ecADomh0SCzk5OfTp04cp\nU6YoAUjciNSlKs8F5vk9vhiYHc5GKiloItiyxf6KxJJzTlcPk7hS3ZPOnQv0Af4EZHgrrgX8gegk\nApG4pyQgySBYG8EKYC02ydxa77YK+GMU4hKJK5988glFRUWxDkMkIkI5nKkF+P8HHAn8GJlwSlHV\nkMSc/xxBH3zwAa1bt451SCJBVaZqKJReQ7cDP2MXptmHXY9AJOkF9ghSEpBkFUoiOA84GngW60a6\nKqIRicTYnj17iscFTJs2jSeeeEITxUlSC2WKiR+BPGyqiXWARstIUktNTeWoo47STKFSY4RSj/Qv\nYAlwIjYl9VlA10gG5VEbgYhImCI1jqAWVjW0HRgGvAd8GWZslaFEICISpupuLK4LXACcAnyLNRbP\nBiZVLjyR+JKbm8u4cePYtm1brEMRialgieA54HzgVmAMNsBsuXcTSWi+HkG//fYb9erVi3U4IjEV\nrLH4WKAHkAosAwqA/th1i0USkq4dLLK/YIlgp/e3ADtzGAD8FvGIRCJk9+7ddO3alX79+qlHkIif\nYInAv7FhG0oCkuAOOOAA5s+fT7t27WIdikhcCZYIOgHPYwmhI/CC97zDJqKLic+8CbDr1o1VBJLI\nlARE9hcsEVyMFfopwAy/58vu0xkF69fDuefCU09B/fqxikISwd69e6mrowWRkMTzHLqlxhH8/DP0\n7g3jx8Nf/xrDqCTuZWVlMXLkSObPn0/79u0rfoNIEqnu6xHElQULoFMnJQEpX2CPICUBkdCEMulc\nXHBOI4mlfLp2sEjlhXJG0AKYAhwGvITNPvpxJIMSCUdeXh5/+9vfmDZtmhKASCWEckbwGPAkNrDs\nY+ChENf7KPAR8D7QJsi67wphfSLlql+/PkuXLlUSEKmkUBJBA2Ah1ltoFXbpyooMxhJHBnAzcG8Z\ny1wJdKaCXkj79sGuXZCXF8JWpcbStYNFKi+URLAHm3q6NtAbuzZBRfoAb3v3P8amqvCXAfTEuqUG\n/Q8eNAgOPhiuvRaOPDKELUtSW7x4MXv37o11GCJJJZREcCUwHDgEGAeMCuE9TSiZogKg0G9bRwK3\nYRPZVXgYl5MD771nZwX33BPCliUp5ebmMnr0aP74xz+yYcOGWIcjklRCaSy+ACv8w5liYifg38en\nFlDk3b8QSypvAUcAB2AT2T0duJLzzpvEl1/CE0/Avn2ZZGZmhhGCJIusrCxGjBhB//79NUeQSIDs\n7Gyys7OrtI5QKlbHAZcAa4DHgVC2eD4wCDuT6IVNZT2wjOWGYtdB/lsZr7m//MVRuzb83//BYYeF\nsFVJKgUFBVx33XWaKVQkDJEaUDbVu50I3IT19Dm+gvfMwWYrXew9Ho4lk0ZYMvFXbmPxzJkhRCdJ\nq27duhx33HE6CxCJsFCyRgOsOudyb/mZlExAF0nlXqpSRETKFqlrFn8DvIpdxH5d+GFVmhKBiEiY\nInHNYoBuwN+B77CxAamVCU6kPLm5uVx33XVs2rQp1qGI1EjBEoGvF89KrKF4rXdbE+mgpObwzRG0\nc+dOtQOIxEgopw8nAv/1e5xJaD2HqkpVQ0lM1w4WiYzq7jXUF7sy2fXAfd5ztbGBYJ0qEZ8IYN1C\ne/ToQUZGhnoEicSBYIngd2wUcH3vbwo2QvimKMQlSSw1NZW3336b1q1bxzoUESG004fmwJZIB1IG\nVQ2JiISpunsNver9/Qz40e8Wi6QgCSpP08aKxL1gieAC7+8RWNWQ79Y80kFJcsjKyqJDhw4sW7Ys\n1qGISBChzD46ADgbmytoA3BpRCOShOebKXTo0KFMmzaNE044IdYhiUgQoSSCO4CvgbHYdQauimhE\nktB07WCRxBPKpHO7gW3AXqyNoCj44lJT7d27l8mTJ+vawSIJJpSW5TeBg7GriTXGBpRdFMGYfNRr\nSEQkTJGadK4+cCzwJXaN4W+A/HCDqwQlAhGRMFV391GfQ4HbsUQwGes5JDXcokWL2L17d6zDEJFq\nEEoieBx4Bmsofgq7HoHUUL4eQZdeeinr16+PdTgiUg1CSQT1sXaC7cDrlExPLTVMYI+gtLS0WIck\nItUglF5DtYF04AsgjSCXlpTktG/fPsaOHauZQkWSVCiJYCzwBNY2sAUYGdGIJO7UqVOHLl26cOed\nd2qmUJEkVFHLchNgHzaWINrUa0hEJEzV3WtoDLACqxI6q/JhiYhIPAuWCC4F2gG9gOuiE47EUm5u\nLtdccw2rV6+OdSgiEkXBEsEeoAD4BfUUSnq+HkG7du3iqKOOinU4IhJFwRqL/euYQulmKglI1w4W\nkWCJoBPwPJYQOgIveM874E8RjkuioLCwkIyMDHr06KFrB4vUYMFaljOxQj9wGQf8J1IB+W9HvYYi\nb/PmzaoKEkkikZp0LlaUCEREwhSpSeckCezatQslVhEpixJBDZCVlUXnzp354IMPYh2KiMShUKaY\naAFMAQ4DXgJWAR9HMiipHoE9gvr16xfrkEQkDoVyRvAY8CSQiiWAhyIakVQLXTtYREIVSiJoACzE\negutwgaaSRwrKiri/vvvZ9q0aTzxxBPqFioiQYVSNbQHm2uoNtAbyItoRFJltWrVYu7cubEOQ0QS\nRChdjI4GpmLXIvgKGAdsjGRQHnUfFREJk8YR1EDvv/8+Xbt2pWnTprEORUTiQKTGEWwFfvT+FgBr\nwo5Mqp3v2sGXX345GzdG4wRNRJJVKIngCOzqZEcAbYElIa73UeAj4H2gTcDrlwBLgQ+B6cT3mUnc\nCewR1L1791iHJCIJLJTGYn/fAh1CWG4w1t00AzgJuNd7DqwX0mSgM9bw/DxwLqDWzQo45xgzZgxv\nvvmmZgoVkWoTSiJ4we/+kVgVUUX6AG979z8Gevi9lkfp3kd1UJfUkKSkpNC7d2/uuOMOdQkVkWoT\nSiJ4CdiS+nqUAAAVcElEQVSOVd/sAT4N4T1NgJ1+jwux6qIibDzCz97z1wANgfdCjLfG+/Of/xzr\nEEQkyYSSCG7CjvDDsRNo7PfYlwT8H98NHAdcUN5KJk2aVHw/MzOTzMzMMMMQEUlu2dnZZGdnV2kd\noTTSvomNLF6LHc07YEEF7zkfGAQMx655fCsw0O/1x7GqobHe+spSY7uP5ubmMmHCBC6//HJOOumk\nWIcjIgkkUt1HfwO6Av8D/BHr8VOROVhBvxhrKL7ee99IoBvwF6yxOAvrVTS47NXUPL4eQXv27KFd\nu3axDkdEaoBgWWM2cHG0AilDjToj0LWDRaQ6VOaMIFgbwaFVikZC5pzj1FNPpXPnzrp2sIhEXbCs\n8S3wXBnLOGBixCLy205NOiPYtm0bhx12WKzDEJEEV91nBLuxBmKJAiUBEYmVYIlgK/BUtAKpKXJz\nc2nQoAG1a9eOdSgiIkDwXkPLohZFDeHrEfTuu+/GOhQRkWLxPNlb0rQR5OTkMH78eObNm6ceQSIS\nUZEaRyBVkJWVRXp6OgUFBbp2sIjEJZ0RRJBzjksvvZTLLruMs88+O9bhiEgNoCuUiYjUcKoaEhGR\nsCkRVJOsrCx+/PHHWIchIhI2JYIqysnJYdSoUQwdOpTvv/8+1uGIiIRNiaAKAnsE9ezZM9YhiYiE\nLdxrFgvWG2js2LG8/vrrPPbYY+oRJCIJTb2GKum1117j1FNP1UyhIhJX1H1UpBzNmjVj+/btsQ5D\npNo0bdqU3377bb/nlQhEypGSkoJ+T5JMyvtNaxxBNfP1CHrvvfdiHYqISMQoEZTDv0dQjx49Yh2O\niEjEqNdQAP+ZQtUjSERqAiWCAAMHDqRt27a6drCI1BiqGgrwxhtvMHPmTCUBiZpatWqRnp5Ot27d\n6N69O+3bt6dnz54sW1Zybahdu3Yxbtw42rdvT3p6Ol26dOGWW24hLy+v1LqeeuopMjIy6NatG506\ndeLKK69kx44d0f5IYbn55ptZsGBBrMMI6s4776RDhw60bduW22+/vdzlbr31Vjp27EhaWhrDhg0j\nPz8fgBUrVnDyySeTnp5O7969ef/994vfc+ONN9KqVSu6detGt27duOSSSwB48803mTx5cmQ/WAJw\nItUlnn9PKSkp7tdffy313NSpU13v3r2dc87t3bvX9erVy914441uz549zjnndu/e7a699lrXr18/\nt2/fPuecc3fccYfr27ev27ZtW/H7rr76ate3b98ofprwLFmyxJ133nmxDiOo+fPnu27durndu3e7\nvLw8d8opp7jZs2fvt9y7777r2rdv7/Ly8pxzzg0ZMsTdc889zjnnWrVq5Z599lnnnHMbN250Rx99\ntPvpp5+cc8717t3bLVmypMxtDxgwwC1fvrzM18r7TQNJ1T0uvG8rTDt37iz+wiT5Rfr3VBUpKSnu\nl19+KX68d+9ed80117hzzz3XOefc888/73r16lXme7t27epefvlll5ub6xo1auTWrVtX6vXdu3e7\n559/3hUUFOz33rlz57quXbu69PR017t3b7dixQq3ceNG16hRo+Jl/B8/+eST7uSTT3bdu3d3/fv3\ndxkZGe6VV14pXnbChAluwoQJzjnn/vWvf7kTTjjBdevWzZ1++uluzZo1ZcZ/5plnuvnz5zvnnCss\nLHRjx451J510kuvYsaPr0KGDW7x4sXPOuaFDh7pBgwa5Tp06uZtvvtkVFBS46667znXv3t116dLF\nDRs2zO3cubP4c2VkZLgePXq4li1bultvvbXMbWdkZLiuXbuWuo0ZM2a/5a644go3derU4sezZs0q\nM3ktXbrUHXPMMe63335z+fn57uyzz3aPPPKI+/nnn12dOnVKLdu/f383a9Ysl5eX5+rXr+/OP/98\n16VLF3fBBRe47777rni5F1980Q0ZMqTM+Mv7TaNEEJqFCxe61q1bu1dffTVi25D4UtHvCarnVhkp\nKSkuLS3NdenSxTVv3twde+yx7tprr3U///yzc865MWPGuPHjx5f53htvvNFde+217tNPP3WHHXZY\nyNvcunWrO+igg9yKFSucc8699tpr7pxzznGbNm0KmgiaNWvmcnJyih/7ktW+fftcixYt3Lp161x2\ndrbr16+f2717t3POuXfeecd17Nhxvxi2b9/uGjZs6Pbu3eucs7ODiy++uPj1u+66yw0aNMg5Z4lg\nwIABxa/dfvvt7qabbip+/Le//c2NHj3aOWeFrC8hbt682dWpU2e/M65wnHXWWe6ll14qfvzuu++6\n7t27l7nsiBEjXKNGjVyzZs1cRkZG8Wdr06aNmzVrlnPOudWrV7sDDzzQTZkyxW3cuNENHDjQff31\n18455+655x7XrVu34vXt3LnTHXDAAWUetJb3m6YSiaBGNRarR5CUJ9ZjzbKzs2nWrBnLly/n7LPP\npnfv3hxyyCGADRAqKCgo8315eXmkpqZSu3ZtioqKQt7e4sWL6dy5M+np6QAMGTKEIUOGsGnTpqDv\nS09Pp1GjRgBcdNFFjBs3jp9++olly5bRtm1b2rRpw4wZM1i3bh0ZGRnF79u+fTu///57qba3devW\nceSRR1KnjhVDvXr1YvLkyUyfPp0NGzaQnZ1NkyZNivfBySefXPzeefPmsWPHDt59910ACgoKOPzw\nwwGYO3cuc+fO5bnnnuOrr77COceuXbto1qxZqc+SkZHBnj17Sj3Xp08fHnnkkVLPlbVfa9euvd9z\nd999N5s2bWLr1q3UrVuX4cOHc8MNN/DQQw/xxhtvMG7cOO677z569uzJGWecQWpqKq1bt2bevHnF\n6xg3bhyTJ0/m22+/pVWrVjRu3JgmTZrw7bffcvzxx5f3tVRZjWksDpwpVElA4lHXrl25//77ueKK\nK/j2228BK5wWLVq03yjSoqIiFi1aREZGBh07dmTv3r2sX7++1DJ5eXmcc845bN26tdTzdevW9Y1A\nLbZq1ar9RqsGJiBfEgBo2LAhF110Ec8//zyzZs1i5MiRxXFddtllfP7553z++ed89tlnLF26dL8O\nGLVq1aKwsLD48fz58xk4cCC1atVi8ODBXHXVVaUK4YYNG5b67A899FDxNj7++GNmz57Nrl276Nq1\nK8uXL+eEE07gnnvuoW7dumWOwP3oo4+K3++7BSYBgJYtW7Jly5bix5s3b6ZFixb7Lbdo0SIuu+wy\nGjZsSGpqKiNHjizVKDxv3jxWrFjB448/zg8//MBxxx3HypUreeaZZ4qXcc7hnKNu3brFzxUWFpaZ\neGqKcM/ggrryyivdW2+9Va3rlMRR3b+n6lRWY/GAAQPc4MGDnXNW7dK3b183duzYUo3Fo0ePdn36\n9CluLL7zzjtdv379ihsh8/Ly3JVXXukyMzP32+bWrVvdIYcc4lavXu2cc27OnDkuPT3d7dixw9Wt\nW9d9+eWXzjnnpkyZUqpqyFcV5LNs2TKXlpbmWrRo4fLz851zVhXUsmVL9+OPPzrnnJsxY4Zr167d\nfjH4qoZ877vuuuvc9ddf75xzbs+ePW7gwIHFDd1Dhw4tVU8/ceJEN3DgQJefn+8KCwvdpZde6kaO\nHOmWL1/uDj/88OI2kWeeecalpKS4DRs2VPQ1lGvu3LmuR48ebteuXS4vL8/179/fPf300/stN3ny\nZDdo0CC3b98+V1RU5K655ho3cuRI55w1CPvaUxYsWOCaN2/udu/e7VauXOmaNm3qNm7c6Jxzbtq0\nae7kk08uXufvv//uGjVqVGYbT3m/adRGIFK2eP491apVa79EsHbtWlevXj23YMEC55wVjBMnTnQd\nO3Z0nTt3dh06dHA333xzcT28z4MPPljc8Nm+fXs3atQot2PHjjK3+84777gePXq4rl27ulNOOcV9\n9dVXzjnn7r//fteqVSt34oknurvvvts1btzYOWeNpL46e3/du3d31157bannpk2b5jp37uzS09Nd\n3759ixNLoLPOOsv9+9//ds45t2bNGte9e3fXtWtX179/f/fAAw+4o48+2hUVFblhw4a5e++9t/h9\ne/bscVdffbXr2LGja9++vbvkkktcTk6OKyoqciNGjHBt2rRxJ598spswYYI78cQTi/djZd15552u\nU6dOrm3btqXaJh599FF3xRVXOOecy8/Pd2PGjHHt2rVzaWlp7rLLLituwF61apXr1auX69y5s+vd\nu7f77LPPitfx7LPPFn+nZ5xxhvv++++LX5s9e7a75JJLyoypvN80lUgEmnROagRNOheflixZwh13\n3FGqnlxKnHbaaTz44IN07tx5v9c06VwQWVlZ+9WTikh86t27N+3ateOdd96JdShx5/XXX6dfv35l\nJoHqljRnBP49gl566aVSPRZEdEYgyUZnBAECewQpCYiIhC7hxxHccMMNvPzyyxoXICJSSQlfNfT2\n22/Tq1cvTRInQalqSJJNdVYNJXwiEAmFrlksyUbXLBYRkTLFU2NxLeBR4CPgfaBNwOuDgE+816+o\naGW+awe/+uqr1R1nQsjOzo51CHFD+6KE9kUJ7YuqiVQiGAykAhnAzcC9fq/VBe4DBgCnAH8FDitv\nRf49gk477bQIhRvf9CMvoX1RQvuihPZF1USq11Af4G3v/seA/9XfOwDrAN9lkz4E+gGvBK5k1KhR\nmilURCTCIpUImgA7/R4XYmcfRd5r/tfOywEOLGsl+fn5unawiEiCuhe4yO/x937304D5fo/vA84v\nYx3rsMmTdNNNN910C/22jjhxPvCkd78XpQv+usDXQFOsHeFT4MioRiciIhGXAkwHFnu344FLgJHe\n6+divYY+BUbFIkAREREREYkT1TrmIMFVtC8uAZZiPa2mE98DAquqon3h8xhwV7SCipGK9sWJwCLg\nA+BFrMo1WVW0L4YA/8XKjKuiG1pMnITth0AJV26eDzzh3T8JeN3vtbrAN1ivorrYByt3zEESCLYv\nGmCNQPW9x89jX3ayCrYvfK7Efuh3RiuoGAm2L1KAz4FjvccjgXbRCy3qKvpdbAQOonTZkazGA19g\n/wP+wi4342Ea6lDHHOylZMxBsgq2L/KA3t5fsK6/e6IXWtQF2xdggxV7AjNI7jMjCL4vjgd+BW4A\nsrFCcG00g4uyin4Xe7F90AD7XbjohRZ167DEGPj7D7vcjIdEUN6YA99rIY05SBLB9oUDfvbuXwM0\nBN6LXmhRF2xfHAncBowh+ZMABN8Xh2BJ8WHgdOA0oH9Uo4uuYPsCrOv6MmAVMDdg2WTzGrCvjOfD\nLjfjIRHsBBr7PfYNPAP7MP6vNQaSeQrJYPvC93gq9s9+QRTjioVg++JCrAB8C5gA/Am4PKrRRVew\nffErdvS3FisU3mb/o+RkEmxftMQODloBrYHDsd9KTRN2uRkPiWAxcI53vxdW5+WzBmhLyZiDfsCS\nqEYXXcH2BVg1SD2sQSyP5BZsXzyMFXb9gSlYe8nTUY0uuoLtiw1AI0oaTftiR8PJKti+qI+dIeRj\nyWEbVk1U0yRkuakxByWC7Ytu2I/8fb/b4NiEGRUV/S58hpL8jcUV7Yv+WH35J8D9sQgwiiraF9dj\nvYY+wAa1JvxVGCvQmpLG4ppaboqIiIiIiIiIiIiIiIiIiIiIiIiIiEhoWmOjNP3HJdwaZPlZwJlV\n2N4m4D9AFjYnzqvYQKhwTMBm2qwHjPCeG0rVJt/zxfU+NoPnZ8AJFbxnTBW2V5GDsdk1fQ7A+suH\nM5FcLezKfwuwz/YWcEw1xfcC1jf/WGzg0ixs7MLR2ACmS4K89yzgL9UUh4hUg9aEN9rwSeCMKmxv\nI6WnR56CzZlUGa2pvpGSgXGdgc1RE8yP1bTtskwHOnv3e2ADgrZgA6dCdQ5WYPv8gbJnb62Ky7Hp\nTvxlBmy3LG9RevoDiQPxMMWExJdawL+wOWtWAJP9XkvBCqTF2FH9IqCF99pd2GjOjyh/fpcUv78H\nYZNh1QGe9da5FLjYW2a09/gj4EHvuVnYWcn/AzpiZzB/x6ajvpeS+YaOwArQcOMCSzK/efcvxM5g\nPvA+68HetpsBj3ixz8SOuj8ATilj3TdSMi/8FO+5SdjR+mKgvd+yTbDC3zdFRCo2ejzc2US3eeu5\nGJuT6Q1KPvtHWFL/EJiDzdJZt5zP4Rud+l9KZnndhB39T8SuS34VdjbVDts3p2KjW7/BzhDARrbe\n5N1/CxgW5ucRkQhpjU1O5V811BybsMtX7VKfkhlPn8QK4dFYoVsHm9qgE3A2JUeC9bH58QNnO9xI\nSdXQQmxaiNpYNcu93jKNsOtZH4wVQL4qmqu8ZX1nJa0oOSPwJYIO3nrBCp3RYcb1MfA98DhwqPfa\n37CCEqy65k/efd8ZwShKCveD2X+OnzQsmdX2Hr8KDPRiLmsqiDOwpBjofcI7IwA7On8e2IolRd80\nxOu9uMCO6K+n7M9RG9s3h3jPj8MSwEasas5/ag9ffKdQsr8nUTK1wYeU7NN+2H6QOJLs83BIcF+y\n/5TFTbB6+P5YG0I9v9ccduQ4ATtj2IEdGaZhhbbvSkl1sMI6cNK8AUBBwHPtKZlOO9eLqQ0wHCt8\njsEKff+j9sCppx3wlbfdltiR8GlYAgknrju87fmS38/AU15c7dn/AiBpwMnYBVLACs9mlJxRtMMS\nQaH3+AMscYIlvEAHAz+V8Xx55mLJcyUwNiCuNZQkrgHAbOxMaZu3PFgBfSY2QVvfgM9xBDZj5S/e\nc4HVQCns/534P34Cu1raIu8z+fbpVuxzShxR1ZAEGgb8DvwZa3A8wO+1FKy++QNs7vtXsKTwFVbY\n9scKnZexWTFD8RVWCIHVHadhR50jsYI8E5twL8MvBv856P0LoJnAPcBqLImFG9ct2FnRaOzMYRLw\nP14seyhdteWL/QVv/X/AClv/6X7XYIVrbe89/ShJAP7Ti/uEO1vmIG/bYwOePx2r0vPF+SWWzMCO\n8Ft79/tgSWFNGZ9jixeLr3rnfuwAIRj/7+U77Hf0/7CqRp+m2OeUOKJEULOVdfWm97DeHe8CN2PV\nCs39lv8U+F+sGuZK4CHsyDQXO/r7BCvkcimtvCtFPYYdIX6AFdqTsKPHld5zC7Ejyo/91rMNqz+f\n4j32rfsVrHrFV/CEG5fDru96C3bhn8XY2cgcrAD37YcvsWmvZ2BnCtne7buA9a3CCtXFXvwbKWm0\nLWt/LAW6lPF8uB7CPudybB8+D1zmvbYPazf5EJuv/7Egn2M0MN9bRy2srcAXt/9+9z32VTv5EtPj\nWJJ/22+5k0juCyqJiFTZdKBrBNe/suJFqs2FWGL392/C7zYsIlKjHIodpUdKYPtIpNyJnQk19Xvu\nHEo6IoiIiIiIiIiIiIiIiIiIiIiIiIiIiEhs/H+6cLnHafMyWwAAAABJRU5ErkJggg==\n", "text": [ "" ] } ], "prompt_number": 82 }, { "cell_type": "code", "collapsed": false, "input": [ "gs.best_params_" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 81, "text": [ "{'clf__max_depth': 3, 'clf__max_features': 0.5, 'imp__strategy': 'mean'}" ] } ], "prompt_number": 83 }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this search we can conclude that the imputation by the 'mean' strategy is generally a slightly better imputation strategy when training a GBRT model on this data." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Further integrating sklearn and pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Helper tool for better sklearn / pandas integration: https://github.com/paulgb/sklearn-pandas by making it possible to embed the feature construction from the raw dataframe directly inside a pipeline." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Credits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thanks to:\n", "\n", "- Kaggle for setting up the Titanic challenge.\n", "\n", "- This blog post by Philippe Adjiman for inspiration:\n", "\n", "http://www.philippeadjiman.com/blog/2013/09/12/a-data-science-exploration-from-the-titanic-in-r/" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 84 } ], "metadata": {} } ] }