{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Logistic Regression"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\piers\\Anaconda3\\lib\\site-packages\\sklearn\\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
" \"This module will be removed in 0.20.\", DeprecationWarning)\n"
]
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sb\n",
"import matplotlib.pyplot as plt\n",
"import sklearn\n",
"\n",
"from pandas import Series, DataFrame\n",
"from pylab import rcParams\n",
"from sklearn import preprocessing\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.cross_validation import train_test_split\n",
"from sklearn import metrics \n",
"from sklearn.metrics import classification_report"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"rcParams['figure.figsize'] = 10, 8\n",
"sb.set_style('whitegrid')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Logistic regression on the titanic dataset\n",
"The first thing we are going to do is to read in the dataset using the Pandas' read_csv() function. We will put this data into a Pandas DataFrame, called \"titanic\", and name each of the columns."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PassengerId | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Ticket | \n",
" Fare | \n",
" Cabin | \n",
" Embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" Braund, Mr. Owen Harris | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" A/5 21171 | \n",
" 7.2500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" | 1 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" Cumings, Mrs. John Bradley (Florence Briggs Th... | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" PC 17599 | \n",
" 71.2833 | \n",
" C85 | \n",
" C | \n",
"
\n",
" \n",
" | 2 | \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" Heikkinen, Miss. Laina | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" STON/O2. 3101282 | \n",
" 7.9250 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
" | 3 | \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" Futrelle, Mrs. Jacques Heath (Lily May Peel) | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 113803 | \n",
" 53.1000 | \n",
" C123 | \n",
" S | \n",
"
\n",
" \n",
" | 4 | \n",
" 5 | \n",
" 0 | \n",
" 3 | \n",
" Allen, Mr. William Henry | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 373450 | \n",
" 8.0500 | \n",
" NaN | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url = 'https://raw.githubusercontent.com/BigDataGal/Python-for-Data-Science/master/titanic-train.csv'\n",
"titanic = pd.read_csv(url)\n",
"titanic.columns = ['PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked']\n",
"titanic.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just a quick fyi (we will examine these variables more closely in a minute):\n",
"\n",
"##### VARIABLE DESCRIPTIONS\n",
"\n",
"Survived - Survival (0 = No; 1 = Yes)
\n",
"Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
\n",
"Name - Name
\n",
"Sex - Sex
\n",
"Age - Age
\n",
"SibSp - Number of Siblings/Spouses Aboard
\n",
"Parch - Number of Parents/Children Aboard
\n",
"Ticket - Ticket Number
\n",
"Fare - Passenger Fare (British pound)
\n",
"Cabin - Cabin
\n",
"Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Checking that your target variable is binary\n",
"Since we are building a model to predict survival of passangers from the Titanic, our target is going to be \"Survived\" variable from the titanic dataframe. To make sure that it's a binary variable, let's use Seaborn's countplot() function."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAmIAAAHfCAYAAADz6rTQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGVlJREFUeJzt3X+MXQWd9/HPtMO0OP0FJhhjU6BIBTUjpaTFh6W7LGhh\nFaKIox0zmmiIVAlpVVKU0uLCCg1a/LWK2+UPnAp1BGKI8WcbtKHVUcagobFaG0VFg+CPODPoTNu5\nzx9PHODBlkG48522r9dfc8+555zvbXJu3j135tyWRqPRCAAAE25K9QAAAEcqIQYAUESIAQAUEWIA\nAEWEGABAESEGAFCktXqAf0Z/f3/1CAAA47Zo0aJ/uPyQDLHkwC8IAGAyOdgFJB9NAgAUEWIAAEWE\nGABAESEGAFBEiAEAFBFiAABFhBgAQBEhBgBQRIgBABQRYgAARYQYAEARIQYAUESIAQAUEWIAAEWE\nGABAESEGAFBEiAEAFBFiAABFhBgAQJHW6gEOBfdfcVn1CHBEOuOTt1SPANBUrogBABQRYgAARYQY\nAEARIQYAUESIAQAUEWIAAEWEGABAESEGAFBEiAEAFBFiAABFhBgAQBEhBgBQRIgBABQRYgAARYQY\nAEARIQYAUESIAQAUEWIAAEWEGABAESEGAFBEiAEAFBFiAABFhBgAQBEhBgBQRIgBABQRYgAARYQY\nAEARIQYAUESIAQAUEWIAAEWEGABAESEGAFCktZk7f+Mb35gZM2YkSebOnZvLLrssV111VVpaWnLy\nySdn3bp1mTJlSnp7e7N58+a0trZmxYoVOeecc5o5FgDApNC0EBseHk6j0UhPT8/YsssuuywrV67M\nkiVLsnbt2mzdujWnnXZaenp6ctddd2V4eDhdXV0566yz0tbW1qzRAAAmhaaF2K5du/LXv/4173zn\nO7Nv3768733vy86dO7N48eIkydKlS7N9+/ZMmTIlCxcuTFtbW9ra2jJv3rzs2rUrHR0dzRoNAGBS\naFqITZ8+Pe9617vy5je/Ob/85S9z6aWXptFopKWlJUnS3t6egYGBDA4OZubMmWPbtbe3Z3BwsFlj\nAQBMGk0LsRNPPDHHH398WlpacuKJJ2bOnDnZuXPn2PqhoaHMmjUrM2bMyNDQ0FOWPznMDqS/v78p\ncwOTh/McONw1LcTuvPPO/OxnP8u1116bRx55JIODgznrrLPS19eXJUuWZNu2bTnzzDPT0dGRj3/8\n4xkeHs7IyEj27NmTBQsWPOP+Fy1a1KzRn+b+2zZO2LGAJ0zkeQ7QLAf7T2XTQuySSy7JBz/4wSxf\nvjwtLS35yEc+kmOOOSbXXHNNNmzYkPnz52fZsmWZOnVquru709XVlUajkVWrVmXatGnNGgsAYNJo\naTQajeohnq3+/v6JvSJ2xWUTdizgCWd88pbqEQCes4N1ixu6AgAUEWIAAEWEGABAESEGAFBEiAEA\nFBFiAABFhBgAQBEhBgBQRIgBABQRYgAARYQYAEARIQYAUESIAQAUEWIAAEWEGABAESEGAFBEiAEA\nFBFiAABFhBgAQBEhBgBQRIgBABQRYgAARYQYAEARIQYAUESIAQAUEWIAAEWEGABAESEGAFBEiAEA\nFBFiAABFhBgAQBEhBgBQRIgBABQRYgAARYQYAEARIQYAUESIAQAUEWIAAEWEGABAESEGAFBEiAEA\nFBFiAABFhBgAQBEhBgBQRIgBABQRYgAARYQYAEARIQYAUESIAQAUEWIAAEWEGABAESEGAFBEiAEA\nFBFiAABFhBgAQBEhBgBQRIgBABQRYgAARYQYAEARIQYAUESIAQAUEWIAAEWEGABAESEGAFBEiAEA\nFBFiAABFhBgAQBEhBgBQpKkh9oc//CH/+q//mj179uShhx7K8uXL09XVlXXr1mV0dDRJ0tvbm4sv\nvjidnZ259957mzkOAMCk0rQQ27t3b9auXZvp06cnSW644YasXLkyt99+exqNRrZu3ZpHH300PT09\n2bx5c2699dZs2LAhIyMjzRoJAGBSaVqIrV+/Pm9961tz3HHHJUl27tyZxYsXJ0mWLl2aHTt25Mc/\n/nEWLlyYtra2zJw5M/PmzcuuXbuaNRIAwKTS2oyd3n333Tn22GNz9tln53/+53+SJI1GIy0tLUmS\n9vb2DAwMZHBwMDNnzhzbrr29PYODg+M6Rn9///M/ODCpOM+Bw11TQuyuu+5KS0tLvvvd7+YnP/lJ\nVq9enT/+8Y9j64eGhjJr1qzMmDEjQ0NDT1n+5DA7mEWLFj3vcx/I/bdtnLBjAU+YyPMcoFkO9p/K\npnw0+YUvfCGbNm1KT09PTj311Kxfvz5Lly5NX19fkmTbtm0544wz0tHRkf7+/gwPD2dgYCB79uzJ\nggULmjESAMCk05QrYv/I6tWrc80112TDhg2ZP39+li1blqlTp6a7uztdXV1pNBpZtWpVpk2bNlEj\nAQCUanqI9fT0jP28adOmp63v7OxMZ2dns8cAAJh03NAVAKCIEAMAKCLEAACKCDEAgCJCDACgiBAD\nACgixAAAiggxAIAiQgwAoIgQAwAoIsQAAIoIMQCAIkIMAKCIEAMAKCLEAACKCDEAgCJCDACgiBAD\nACgixAAAiggxAIAiQgwAoIgQAwAoIsQAAIoIMQCAIkIMAKCIEAMAKCLEAACKCDEAgCJCDACgiBAD\nACgixAAAiggxAIAiQgwAoIgQAwAoIsQAAIoIMQCAIkIMAKCIEAMAKCLEAACKtFYPAHCkumzH/dUj\nwBHplv9zRvUIY1wRAwAoIsQAAIoIMQCAIkIMAKCIEAMAKCLEAACKCDEAgCJCDACgiBADACgixAAA\niggxAIAiQgwAoIgQAwAoIsQAAIoIMQCAIkIMAKCIEAMAKCLEAACKCDEAgCJCDACgiBADACgixAAA\niggxAIAiQgwAoIgQAwAoIsQAAIoIMQCAIq3N2vH+/fuzZs2a/OIXv0hLS0s+/OEPZ9q0abnqqqvS\n0tKSk08+OevWrcuUKVPS29ubzZs3p7W1NStWrMg555zTrLEAACaNcV0Ru+666562bPXq1Qfd5t57\n702SbN68OStXrszNN9+cG264IStXrsztt9+eRqORrVu35tFHH01PT082b96cW2+9NRs2bMjIyMg/\n8VIAAA4tB70idvXVV+fXv/51HnzwwezevXts+b59+zIwMHDQHZ933nn5t3/7tyTJb3/728yaNSs7\nduzI4sWLkyRLly7N9u3bM2XKlCxcuDBtbW1pa2vLvHnzsmvXrnR0dDzHlwYAMLkdNMRWrFiRhx9+\nOP/1X/+Vyy+/fGz51KlTc9JJJz3zzltbs3r16nzrW9/KJz/5yWzfvj0tLS1Jkvb29gwMDGRwcDAz\nZ84c26a9vT2Dg4PPuO/+/v5nfA5waHOeA80wmd5bDhpic+fOzdy5c3PPPfdkcHAwAwMDaTQaSZLH\nH388c+bMecYDrF+/Ph/4wAfS2dmZ4eHhseVDQ0OZNWtWZsyYkaGhoacsf3KYHciiRYue8TnPl/tv\n2zhhxwKeMJHneYWNO+6vHgGOSBP93nKw8BvXL+t/7nOfy+c+97mnhFdLS0u2bt16wG2+/OUv55FH\nHsm73/3uHH300WlpackrX/nK9PX1ZcmSJdm2bVvOPPPMdHR05OMf/3iGh4czMjKSPXv2ZMGCBc/i\n5QEAHJrGFWJf+tKXsmXLlhx77LHj3vFrX/vafPCDH8zb3va27Nu3Lx/60Idy0kkn5ZprrsmGDRsy\nf/78LFu2LFOnTk13d3e6urrSaDSyatWqTJs27Z9+QQAAh4pxhdiLX/zizJ49+1nt+AUveEE+8YlP\nPG35pk2bnrass7MznZ2dz2r/AACHunGF2AknnJCurq4sWbIkbW1tY8uf/Av8AAA8O+MKsRe96EV5\n0Yte1OxZAACOKOMKMVe+AACef+MKsVNOOWXs/l9/d9xxx+U73/lOU4YCADgSjCvEdu3aNfbz3r17\ns2XLljzwwANNGwoA4Egwru+afLKjjjoqF1xwQb73ve81Yx4AgCPGuK6IffnLXx77udFoZPfu3Tnq\nqKOaNhQAwJFgXCHW19f3lMfHHHNMbr755qYMBABwpBhXiN1www3Zu3dvfvGLX2T//v05+eST09o6\nrk0BADiAcdXUgw8+mCuuuCJz5szJ6OhoHnvssfz3f/93XvWqVzV7PgCAw9a4Quz666/PzTffPBZe\nDzzwQK677rrceeedTR0OAOBwNq6/mnz88cefcvXrtNNOy/DwcNOGAgA4EowrxGbPnp0tW7aMPd6y\nZUvmzJnTtKEAAI4E4/po8rrrrsu73/3uXH311WPLNm/e3LShAACOBOO6IrZt27YcffTRuffee3Pb\nbbfl2GOPzfe///1mzwYAcFgbV4j19vbmjjvuyAte8IKccsopufvuu7Np06ZmzwYAcFgbV4jt3bv3\nKXfSd1d9AIDnbly/I3beeeflHe94Ry644IIkyTe/+c2ce+65TR0MAOBwN64Qu/LKK/P1r389P/jB\nD9La2pq3v/3tOe+885o9GwDAYW3c31N0/vnn5/zzz2/mLAAAR5Rx/Y4YAADPPyEGAFBEiAEAFBFi\nAABFhBgAQBEhBgBQRIgBABQRYgAARYQYAEARIQYAUESIAQAUEWIAAEWEGABAESEGAFBEiAEAFBFi\nAABFhBgAQBEhBgBQRIgBABQRYgAARYQYAEARIQYAUESIAQAUEWIAAEWEGABAESEGAFBEiAEAFBFi\nAABFhBgAQBEhBgBQRIgBABQRYgAARYQYAEARIQYAUESIAQAUEWIAAEWEGABAESEGAFBEiAEAFBFi\nAABFhBgAQBEhBgBQRIgBABQRYgAARYQYAEARIQYAUESIAQAUEWIAAEVam7HTvXv35kMf+lAefvjh\njIyMZMWKFXnpS1+aq666Ki0tLTn55JOzbt26TJkyJb29vdm8eXNaW1uzYsWKnHPOOc0YCQBg0mlK\niN1zzz2ZM2dObrrppvz5z3/OG97whpxyyilZuXJllixZkrVr12br1q057bTT0tPTk7vuuivDw8Pp\n6urKWWedlba2tmaMBQAwqTQlxM4///wsW7YsSdJoNDJ16tTs3LkzixcvTpIsXbo027dvz5QpU7Jw\n4cK0tbWlra0t8+bNy65du9LR0dGMsQAAJpWmhFh7e3uSZHBwMFdccUVWrlyZ9evXp6WlZWz9wMBA\nBgcHM3PmzKdsNzg4OK5j9Pf3P/+DA5OK8xxohsn03tKUEEuS3/3ud3nve9+brq6uXHjhhbnpppvG\n1g0NDWXWrFmZMWNGhoaGnrL8yWF2MIsWLXreZz6Q+2/bOGHHAp4wked5hY077q8eAY5IE/3ecrDw\na8pfTT722GN55zvfmSuvvDKXXHJJkuTlL395+vr6kiTbtm3LGWeckY6OjvT392d4eDgDAwPZs2dP\nFixY0IyRAAAmnaZcEbvlllvyl7/8JZ/5zGfymc98Jkly9dVX5/rrr8+GDRsyf/78LFu2LFOnTk13\nd3e6urrSaDSyatWqTJs2rRkjAQBMOk0JsTVr1mTNmjVPW75p06anLevs7ExnZ2czxgAAmNTc0BUA\noIgQAwAoIsQAAIoIMQCAIkIMAKCIEAMAKCLEAACKCDEAgCJCDACgiBADACgixAAAiggxAIAiQgwA\noIgQAwAoIsQAAIoIMQCAIkIMAKCIEAMAKCLEAACKCDEAgCJCDACgiBADACgixAAAiggxAIAiQgwA\noIgQAwAoIsQAAIoIMQCAIkIMAKCIEAMAKCLEAACKCDEAgCJCDACgiBADACgixAAAiggxAIAiQgwA\noIgQAwAoIsQAAIoIMQCAIkIMAKCIEAMAKCLEAACKCDEAgCJCDACgiBADACgixAAAiggxAIAiQgwA\noIgQAwAoIsQAAIoIMQCAIkIMAKCIEAMAKCLEAACKCDEAgCJCDACgiBADACgixAAAiggxAIAiQgwA\noIgQAwAoIsQAAIoIMQCAIkIMAKCIEAMAKCLEAACKNDXEfvSjH6W7uztJ8tBDD2X58uXp6urKunXr\nMjo6miTp7e3NxRdfnM7Oztx7773NHAcAYFJpWoht3Lgxa9asyfDwcJLkhhtuyMqVK3P77ben0Whk\n69atefTRR9PT05PNmzfn1ltvzYYNGzIyMtKskQAAJpWmhdi8efPyqU99auzxzp07s3jx4iTJ0qVL\ns2PHjvz4xz/OwoUL09bWlpkzZ2bevHnZtWtXs0YCAJhUWpu142XLluU3v/nN2ONGo5GWlpYkSXt7\newYGBjI4OJiZM2eOPae9vT2Dg4Pj2n9/f//zOzAw6TjPgWaYTO8tTQux/9+UKU9cfBsaGsqsWbMy\nY8aMDA0NPWX5k8PsYBYtWvS8z3gg99+2ccKOBTxhIs/zCht33F89AhyRJvq95WDhN2F/Nfnyl788\nfX19SZJt27bljDPOSEdHR/r7+zM8PJyBgYHs2bMnCxYsmKiRAABKTdgVsdWrV+eaa67Jhg0bMn/+\n/CxbtixTp05Nd3d3urq60mg0smrVqkybNm2iRgIAKNXUEJs7d256e3uTJCeeeGI2bdr0tOd0dnam\ns7OzmWMAAExKbugKAFBEiAEAFBFiAABFhBgAQBEhBgBQRIgBABQRYgAARYQYAEARIQYAUESIAQAU\nEWIAAEWEGABAESEGAFBEiAEAFBFiAABFhBgAQBEhBgBQRIgBABQRYgAARYQYAEARIQYAUESIAQAU\nEWIAAEWEGABAESEGAFBEiAEAFBFiAABFhBgAQBEhBgBQRIgBABQRYgAARYQYAEARIQYAUESIAQAU\nEWIAAEWEGABAESEGAFBEiAEAFBFiAABFhBgAQBEhBgBQRIgBABQRYgAARYQYAEARIQYAUESIAQAU\nEWIAAEWEGABAESEGAFBEiAEAFBFiAABFhBgAQBEhBgBQRIgBABQRYgAARYQYAEARIQYAUESIAQAU\nEWIAAEWEGABAESEGAFBEiAEAFBFiAABFhBgAQBEhBgBQRIgBABQRYgAARYQYAECR1uoBkmR0dDTX\nXnttfvrTn6atrS3XX399jj/++OqxAACaalJcEduyZUtGRkbyxS9+Me9///tz4403Vo8EANB0kyLE\n+vv7c/bZZydJTjvttDz44IPFEwEANN+k+GhycHAwM2bMGHs8derU7Nu3L62tBx6vv79/IkZLkrS8\n49IJOxbwhIk8zytcOq2legQ4Ik2m95ZJEWIzZszI0NDQ2OPR0dGDRtiiRYsmYiwAgKaaFB9Nnn76\n6dm2bVuS5IEHHsiCBQuKJwIAaL6WRqPRqB7i7381+bOf/SyNRiMf+chHctJJJ1WPBQDQVJMixAAA\njkST4qNJAIAjkRADACgixDgsjY6OZu3atXnLW96S7u7uPPTQQ9UjAYeRH/3oR+nu7q4eg8PApLh9\nBTzfnvxtDQ888EBuvPHGfPazn60eCzgMbNy4Mffcc0+OPvro6lE4DLgixmHJtzUAzTJv3rx86lOf\nqh6Dw4QQ47B0oG9rAHiuli1bdtCbjsOzIcQ4LD3bb2sAgApCjMOSb2sA4FDgEgGHpde85jXZvn17\n3vrWt459WwMATDburA8AUMRHkwAARYQYAEARIQYAUESIAQAUEWIAAEWEGHDI+vrXv56LL744F110\nUS688ML87//+73Pe5x133JE77rjjOe+nu7s7fX19z3k/wOHNfcSAQ9IjjzyS9evX5+67784xxxyT\noaGhdHd358QTT8y55577T+93+fLlz+OUAAcnxIBD0p/+9Kfs3bs3f/vb35Ik7e3tufHGGzNt2rT8\n+7//ez7/+c9n7ty56evry6c//en09PSku7s7s2fPzu7du3PhhRfmj3/8Y9auXZskWb9+fY477rgM\nDg4mSWbPnp1f/vKXT1vf2dmZ//zP/8zu3buzf//+XHrppXn961+fkZGRXH311XnwwQfzkpe8JH/6\n059q/mGAQ4qPJoFD0imnnJJzzz035513Xi655JLcdNNNGR0dzfHHH3/Q7V72spflG9/4RpYvX54t\nW7Zk//79aTQa+cY3vpHXve51Y8973ete9w/Xf/azn80rXvGK3H333fnCF76QW265Jb/+9a/T09OT\nJPna176WNWvW5Fe/+lVTXz9weHBFDDhkffjDH8573vOe3HfffbnvvvvS2dmZj370owfdpqOjI0ny\nwhe+MKeeemr6+vpy1FFH5YQTTshxxx039rwDrd+xY0f+9re/5a677kqSPP7449m9e3e+//3v5y1v\neUuS5IQTTsjChQub9KqBw4kQAw5J3/72t/P444/nP/7jP/KmN70pb3rTm9Lb25s777wzSfL3b2/b\nt2/fU7abPn362M8XXXRRvvrVr+aoo47KRRdd9LRj/KP1o6Ojuemmm/KKV7wiSfLYY49l9uzZ6e3t\nzejo6Ni2ra3eXoFn5qNJ4JA0ffr0fOxjH8tvfvObJP8vvH7+85/n1FNPzTHHHJOf//znSZKtW7ce\ncB/nnntufvCDH+S+++7La17zmnGtP/PMM8f+qvL3v/99Lrroovzud7/Lq1/96nzlK1/J6OhoHn74\n4fzwhz98vl8ycBjyXzbgkHTmmWfm8ssvz2WXXZa9e/cmSc4+++y8973vzemnn57rrrsun/70p/Mv\n//IvB9zH9OnTc/rpp2dkZCTt7e3jWn/55Zfn2muvzetf//rs378/V155ZebNm5eurq7s3r07F1xw\nQV7ykpdkwYIFzXnhwGGlpfH36/cAAEwoH00CABQRYgAARYQYAEARIQYAUESIAQAUEWIAAEWEGABA\nESEGAFDk/wKEO3lD04KFOAAAAABJRU5ErkJggg==\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sb.countplot(x='Survived',data=titanic, palette='hls')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, so we see that the Survived variable is binary (0 - did not survive / 1 - survived)\n",
"\n",
"### Checking for missing values\n",
"It's easy to check for missing values by calling the isnull() method, and the sum() method off of that, to return a tally of all the True values that are returned by the isnull() method."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId 0\n",
"Survived 0\n",
"Pclass 0\n",
"Name 0\n",
"Sex 0\n",
"Age 177\n",
"SibSp 0\n",
"Parch 0\n",
"Ticket 0\n",
"Fare 0\n",
"Cabin 687\n",
"Embarked 2\n",
"dtype: int64"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Well, how many records are there in the data frame anyway?"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 891 entries, 0 to 890\n",
"Data columns (total 12 columns):\n",
"PassengerId 891 non-null int64\n",
"Survived 891 non-null int64\n",
"Pclass 891 non-null int64\n",
"Name 891 non-null object\n",
"Sex 891 non-null object\n",
"Age 714 non-null float64\n",
"SibSp 891 non-null int64\n",
"Parch 891 non-null int64\n",
"Ticket 891 non-null object\n",
"Fare 891 non-null float64\n",
"Cabin 204 non-null object\n",
"Embarked 889 non-null object\n",
"dtypes: float64(2), int64(5), object(5)\n",
"memory usage: 83.6+ KB\n"
]
}
],
"source": [
"titanic.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, so there are only 891 rows in the titanic data frame. Cabin is almost all missing values, so we can drop that variable completely, but what about age? Age seems like a relevant predictor for survival right? We'd want to keep the variables, but it has 177 missing values. Yikes!! We are going to need to find a way to approximate for those missing values!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Taking care of missing values\n",
"##### Dropping missing values\n",
"So let's just go ahead and drop all the variables that aren't relevant for predicting survival. We should at least keep the following:\n",
"- Survived - This variable is obviously relevant.\n",
"- Pclass - Does a passenger's class on the boat affect their survivability?\n",
"- Sex - Could a passenger's gender impact their survival rate?\n",
"- Age - Does a person's age impact their survival rate?\n",
"- SibSp - Does the number of relatives on the boat (that are siblings or a spouse) affect a person survivability? Probability\n",
"- Parch - Does the number of relatives on the boat (that are children or parents) affect a person survivability? Probability\n",
"- Fare - Does the fare a person paid effect his survivability? Maybe - let's keep it.\n",
"- Embarked - Does a person's point of embarkation matter? It depends on how the boat was filled... Let's keep it.\n",
"\n",
"What about a person's name, ticket number, and passenger ID number? They're irrelavant for predicting survivability. And as you recall, the cabin variable is almost all missing values, so we can just drop all of these."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survived | \n",
" Pclass | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Fare | \n",
" Embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" 7.2500 | \n",
" S | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 1 | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" 71.2833 | \n",
" C | \n",
"
\n",
" \n",
" | 2 | \n",
" 1 | \n",
" 3 | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" 7.9250 | \n",
" S | \n",
"
\n",
" \n",
" | 3 | \n",
" 1 | \n",
" 1 | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 53.1000 | \n",
" S | \n",
"
\n",
" \n",
" | 4 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 8.0500 | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Survived Pclass Sex Age SibSp Parch Fare Embarked\n",
"0 0 3 male 22.0 1 0 7.2500 S\n",
"1 1 1 female 38.0 1 0 71.2833 C\n",
"2 1 3 female 26.0 0 0 7.9250 S\n",
"3 1 1 female 35.0 1 0 53.1000 S\n",
"4 0 3 male 35.0 0 0 8.0500 S"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic_data = titanic.drop(['PassengerId','Name','Ticket','Cabin'], 1)\n",
"titanic_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have the dataframe reduced down to only relevant variables, but now we need to deal with the missing values in the age variable.\n",
"\n",
"#### Imputing missing values\n",
"Let's look at how passenger age is related to their class as a passenger on the boat."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAlwAAAHfCAYAAACF0AZbAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X+QVfV9//HXXbYusBQpnWrbOFpQFKxxasFFJwltGglk\nMhkTSqOLQ0z8kSFjm0CjAZEFGxzRUiUJo8WmNXUkYjIVLZ1OzBh03EQ64DDRSclufsyYTv1FRjdB\nWGBd3Pv9o1+ovzENn7137308/jq7sPe893pcnvdzzj1bqVar1QAAUExLrQcAAGh0ggsAoDDBBQBQ\nmOACAChMcAEAFCa4AAAKa631AG9n586dtR4BAOAdmz59+pt+vq6DK3nrwQEA6snbLRQ5pQgAUJjg\nAgAoTHABABQmuAAAChNcAACFCS4AgMIEFwBAYYILAKAwwQUAUJjgAgAoTHABABQmuAAAChNcAACF\nCS4AgMJaSz3w4OBgli1blmeeeSYtLS1ZvXp1Wltbs2zZslQqlUyZMiWrVq1KS4vmAwAaW7HgevTR\nR3Po0KHce++9eeyxx/KlL30pg4ODWbx4cWbOnJmVK1dm69atmT17dqkRAADqQrHlpUmTJuWVV17J\n0NBQ9u3bl9bW1uzatSsdHR1JklmzZmXbtm2ldg8AUDeKrXCNHTs2zzzzTD70oQ/lF7/4RTZs2JDH\nH388lUolSdLe3p69e/eW2n3d27RpU3bs2FHrMV6jv78/yf/8t6k3HR0d6ezsrPUYAPB/Uiy4/vmf\n/znvfe978/nPfz7PPfdcLr300gwODh758/7+/owfP/6oj7Nz585SI9bU7t27MzAwUOsxXuPgwYNJ\nktbWYofF/9nu3bsb9lgAoPEV+5d1/Pjx+Y3f+I0kyfHHH59Dhw7lzDPPzPbt2zNz5sx0d3fnvPPO\nO+rjTJ8+vdSINVWP39eSJUuSJOvWravxJAAw8rzdwkCx4PrkJz+Z5cuXZ8GCBRkcHMySJUty1lln\npaurK7feemsmT56cOXPmlNo9AEDdKBZc7e3t+fKXv/yGz2/cuLHULgEA6pKbYAEAFCa4AAAKE1wA\nAIUJLgCAwgQXAEBhggsAoDDBBQBQmOACAChMcAEAFCa4AAAKE1wAAIUJLgCAwgQXAEBhggsAoDDB\nBQBQmOACAChMcAEAFCa4AAAKE1wAAIUJLgCAwgQXAEBhggsAGkxPT096enpqPQavIrgAoMFs3rw5\nmzdvrvUYvIrgAoAG0tPTk97e3vT29lrlqiOCCwAayKtXtqxy1Q/BBQBQmOACgAYyb968N92mtlpr\nPQAAcOxMmzYtU6dOPbJNfRBcANBgrGzVH8EFAA3Gylb9cQ0XAEBhggsAoDDBBQBQmOACAChMcAEA\nFCa4AAAKE1wAAIUJLgCAwgQXAEBhggsAGkxPT096enpqPQavIrgAoMFs3rw5mzdvrvUYvIrgApqK\nV/40up6envT29qa3t9exXkeKBdfmzZuzcOHCLFy4MB//+Mfz7ne/Oz/4wQ/S2dmZBQsWZNWqVRka\nGiq1e4A35ZU/je7Vx7djvX4UC6558+bl7rvvzt13350//MM/zIoVK3Lbbbdl8eLFueeee1KtVrN1\n69ZSuwd4A6/8gVopfkrxBz/4QX7605/moosuyq5du9LR0ZEkmTVrVrZt21Z69wBHeOVPM5g3b96b\nblNbraV3cMcdd+Sqq65KklSr1VQqlSRJe3t79u7de9Sv37lzZ9H5+F8DAwNJPOc0rlf/zNm7d69j\nnYb1rne9K0myf/9+x3mdKBpcL730Up566qmcd955SZKWlv9dUOvv78/48eOP+hjTp08vNh+vtXHj\nxiSecxrX2LFjc+ONNyZJLr300kybNq3GE0EZY8eOTRLH+DB7u7gtGlyPP/54zj///CMfn3nmmdm+\nfXtmzpyZ7u7uIyEGMBymTZuWqVOnHtmGRuX4rj9Fg+upp57KSSeddOTjpUuXpqurK7feemsmT56c\nOXPmlNw9wBu4pgWohaLBdcUVV7zm40mTJh05bQVQC175A7XgxqcAAIUJLgCAwgQXAEBhggsAoDDB\nBQBQmOACAChMcAEAFCa4AAAKE1wAAIUJLgCAwgQXAEBhggtoKj09Penp6an1GECTEVxAU9m8eXM2\nb95c6zGAJiO4gKbR09OT3t7e9Pb2WuUChpXgAprGq1e2rHIBw0lwAQAUJriApjFv3rw33YZG480h\n9UdwAU1j2rRpmTp1aqZOnZpp06bVehwoxptD6k9rrQcAGE5Wtmh0h98ccnjbi4v6YIULaCrTpk3z\nDxANzZtD6pPgAgAoTHABQAPx5pD65BouAGggh98ccnib+iC4AKDBWNmqP4ILABqMla364xouAIDC\nBBcANBh3mq8/ggsAGow7zdcfwQUADeTwneZ7e3utctURwQUADcSd5uuT4AIAKExwAU3FxcQ0Onea\nr0+CC2gqLiam0R2+0/zUqVPdj6uOuPEp0DQOX0x8eNs/RjQqK1v1xwoX0DRcTEyzmDZtmhcUdUZw\nAQAUJriApuFiYpqFN4fUH8EFNA0XE9MsvDmk/rhoHmgqVrZodN4cUp+scAFNxcXENDpvDqlPggsA\noDDBBTQVFxPT6Lw5pD4VvYbrjjvuyMMPP5zBwcF0dnamo6Mjy5YtS6VSyZQpU7Jq1aq0tGg+YPgc\nPsVy3XXX1XgSKOPwm0MOb1MfitXO9u3b8/3vfz+bNm3K3Xffneeffz5r1qzJ4sWLc88996RarWbr\n1q2ldg/wBocvJu7t7bXKRUObN2+e1a06Uyy4vve97+X000/PVVddlUWLFuVP//RPs2vXrnR0dCRJ\nZs2alW3btpXaPcAbuJiYZuHNIfWn2CnFX/ziF3n22WezYcOGPP300/nMZz6TarWaSqWSJGlvb8/e\nvXuP+jg7d+4sNSKvMzAwkMRzTuN69c+cvXv3OtaBYVMsuCZMmJDJkyfnuOOOy+TJk9PW1pbnn3/+\nyJ/39/dn/PjxR32c6dOnlxqR19m4cWMSzzmNa+zYsbnxxhuTJJdeeqkVABrW4VPmjvHh9XYv4oqd\nUpw+fXq++93vplqtZvfu3Tlw4EDOP//8bN++PUnS3d2dGTNmlNo9wBtMmzYtY8aMyZgxY/xDRENz\np/n6U2yF6/3vf38ef/zxzJ8/P9VqNStXrsxJJ52Urq6u3HrrrZk8eXLmzJlTavcAb9DT05MDBw4c\n2RZdNCJ3mq9PRW8L8YUvfOENnzt82gpguL3+onm3hqAROc7rk5tgAQAUJriApuEO3DQDx3l9KnpK\nEaCeuAM3zcBxXp8EF9BUvOKnGTjO64/gApqKV/w0A8d5/XENFwBAYYILaCoPPvhgHnzwwVqPATQZ\npxSBpnL4HkVz586t8SRAM7HCBTSNBx98MAcOHMiBAwescgHDSnABTeP1d+AGGC6CCwCgMMEFNA13\n4AZqRXABTWPu3LkZM2ZMxowZ46J5YFh5lyLQVKxs0Qx6enqSuAFqPRFcQFOxskUzOPymkOuuu67G\nk3CYU4oA0EB6enrS29ub3t7eIytd1J7gAoAG4vYn9UlwAQAUJrgAoIG4/Ul9ctE8UMymTZuyY8eO\nWo/xGv39/UmS9vb2Gk/yRh0dHens7Kz1GIxw06ZNy9SpU49sUx8EF9BUBgYGktRncMGxYmWr/ggu\noJjOzs66W7FZsmRJkmTdunU1ngTKsbJVf1zDBQBQmOACAChMcAEAFCa4AAAKE1wAAIUJLgCAwgQX\nAEBhggsAoDDBBQBQmOACAChMcAEAFCa4AAAKE1wAAIUJLgCAwgQXAEBhggsAoDDBBQBQWGutBwCA\nkWzTpk3ZsWNHrcd4jf7+/iRJe3t7jSd5o46OjnR2dtZ6jGFnhQsAGszAwEAGBgZqPQavYoULAH4N\nnZ2ddbdis2TJkiTJunXrajwJhxUNro997GMZN25ckuSkk07KokWLsmzZslQqlUyZMiWrVq1KS4tF\nNgCgsRULroGBgVSr1dx9991HPrdo0aIsXrw4M2fOzMqVK7N169bMnj271AgAAHWh2PJSb29vDhw4\nkMsuuyyf+MQn8sQTT2TXrl3p6OhIksyaNSvbtm0rtXsAgLpRbIVr9OjRufzyy/MXf/EX+dnPfpYr\nr7wy1Wo1lUolyf+8c2Lv3r1HfZydO3eWGpHXOXyBpeecRuY4pxk4zutPseCaNGlSTjnllFQqlUya\nNCkTJkzIrl27jvx5f39/xo8ff9THmT59eqkReZ2NGzcm8ZzT2BznNAPHeW28XeAWO6X4L//yL7np\nppuSJLt3786+ffvynve8J9u3b0+SdHd3Z8aMGaV2DwBQN4qtcM2fPz/XXnttOjs7U6lUcuONN+a3\nfuu30tXVlVtvvTWTJ0/OnDlzSu0eAKBuFAuu4447LrfccssbPn94mXM4rV69On19fcO+35Hm8HN0\n+P4tvL2JEyemq6ur1mMAMAI0xY1P+/r68uILL2RcS6XWo9S1UUPVJMlA34s1nqT+7fv/zxUAvBNN\nEVxJMq6lksuOH1vrMWgQd+7ZX+sRABhB3OYdAKAwwQUAUJjgAgAoTHABABQmuAAAChNcAACFCS4A\ngMIEFwBAYYILAKAwwQUAUJjgAgAoTHABABQmuAAAChNcAACFCS4AgMIEFwBAYYILAKAwwQUAUJjg\nAgAoTHABABQmuAAAChNcAACFCS4AgMIEFwBAYYILAKAwwQUAUJjgAgAoTHABABQmuAAAChNcAACF\nCS4AgMIEFwBAYe84uPbs2VNyDgCAhnXU4Orp6cncuXNz4YUXZvfu3Zk9e3Z27do1HLMBADSEowbX\nDTfckNtuuy0TJkzIiSeemOuvvz6rVq0ajtkAABrCUYPrwIEDOfXUU498/J73vCcvv/xy0aEAABrJ\nUYNrwoQJ6e3tTaVSSZJs2bIlxx9/fPHBAAAaRevR/sL111+fpUuX5ic/+UlmzJiRU045JWvXrh2O\n2QAAGsJRg+vkk0/Opk2bsn///gwNDWXcuHHDMRcAQMM4anAtXLjwyOnEJKlUKhk9enQmT56cRYsW\nve3pxRdffDHz5s3LnXfemdbW1ixbtiyVSiVTpkzJqlWr0tLiNmAAQOM7avGcdtppOeOMM7J8+fIs\nX7487373u/Obv/mbOfHEE3Pddde95dcNDg5m5cqVGT16dJJkzZo1Wbx4ce65555Uq9Vs3br12H0X\nAAB17KjB9eSTT+a6667L1KlTM3Xq1Fx99dV56qmn8slPfjJPP/30W37dzTffnIsvvjgnnHBCkmTX\nrl3p6OhIksyaNSvbtm07Rt8CAEB9O+opxcHBwfzkJz/JlClTkiQ//vGPMzQ0lIMHD2ZwcPBNv2bz\n5s2ZOHFi3ve+9+Uf/uEfkiTVavXIqcn29vbs3bv3HQ24c+fOd/T33s7AwMCv/RjwegMDA8fk+GR4\nHf554L8djcxxXn+OGlwrVqzIlVdemd/+7d9OtVrNnj17snbt2qxfvz4XXnjhm37Nfffdl0qlkv/4\nj/9IT09Pli5dmr6+viN/3t/fn/Hjx7+jAadPn/4Ov5W3tnHjxgz07/u1Hwdera2t7ZgcnwyvjRs3\nJjk2P1ugXjnOa+PtAveowTVz5sx85zvfyQ9/+MN0d3fne9/7Xi6//PJ8//vff8uv+frXv35ke+HC\nhbn++uuzdu3abN++PTNnzkx3d3fOO++8X/HbAAAYmY56Ddd///d/50tf+lIWLVqUDRs25L3vfe//\n6YL3pUuXZv369bnooosyODiYOXPm/J8GBgAYad5yheuhhx7Kvffem127dmX27NlZu3Zturq68pd/\n+Ze/0g7uvvvuI9uHlzgBAJrJWwbXX/3VX2Xu3Ln5xje+kVNOOSVJXnM/LgAA3pm3DK4tW7bk/vvv\nz4IFC/Kud70rH/7wh/PKK68M52wAAA3hLa/hOv3007N06dJ0d3fn05/+dHbs2JEXXnghn/70p/Po\no48O54wAACPaUS+aHzVqVC644ILcdttt6e7uzvnnn59bbrllOGYDAGgIv9IvM5w4cWI+9alPZcuW\nLaXmAQBoOH57NABAYYILAKAwwQUAUJjgAgAo7Ki/S7ER9Pf35+BQNXfu2V/rUWgQ+4aqOdTfX+sx\nABghrHABABTWFCtc7e3taR04mMuOH1vrUWgQd+7Zn7b29lqPAcAIYYULAKAwwQUAUFhTnFKEZrB6\n9er09fXVeoy6d/g5WrJkSY0nGRkmTpyYrq6uWo8BI57gggbR19eXF158IZVxFq7fTnXUUJLkxQFx\nejTVfUO1HgEahuCCBlIZ15Kxlx1f6zFoEPvv3FPrEaBheCkMAFCY4AIAKExwAQAUJrgAAAoTXAAA\nhQkuAIDCBBcAQGGCCwCgMMEFAFCY4AIAKExwAQAUJrgAAAoTXAAAhQkuAIDCBBcAQGGCCwCgMMEF\nAFBYa60HAIB3avXq1enr66v1GHXv8HO0ZMmSGk8yMkycODFdXV1F9yG4ABgx+vr68sILL6alMq7W\no9S1oeqoJEnfiwM1nqT+DVX3Dct+BBcAI0pLZVyOH3tZrcegQezZf+ew7Mc1XAAAhQkuAIDCBBcA\nQGGCCwCgMMEFAFBYsXcpvvLKK1mxYkWeeuqpVCqV/M3f/E3a2tqybNmyVCqVTJkyJatWrUpLi+YD\nABpbseB65JFHkiT33ntvtm/fnnXr1qVarWbx4sWZOXNmVq5cma1bt2b27NmlRgAAqAvFlpcuuOCC\nrF69Okny7LPPZvz48dm1a1c6OjqSJLNmzcq2bdtK7R4AoG4UvfFpa2trli5dmoceeihf+cpX8thj\nj6VSqSRJ2tvbs3fv3qM+xs6dO3/tOQYG3GmXY29gYOCYHJ/HiuOcEhznNIPhOM6L32n+5ptvztVX\nX52Pf/zjr/kfpb+/P+PHjz/q10+fPv3XnmHjxo0Z6B+eW/fTPNra2o7J8XmsbNy4MfsG+ms9Bg2m\nHo/z/n2ii2PrWB3nbxdtxU4pPvDAA7njjjuSJGPGjEmlUslZZ52V7du3J0m6u7szY8aMUrsHAKgb\nxVa4PvjBD+baa6/NJZdckkOHDmX58uU59dRT09XVlVtvvTWTJ0/OnDlzSu0eAKBuFAuusWPH5stf\n/vIbPr9x48ZSu4Sm1t/fn+rBoey/c0+tR6FBVPcNpf+Q09RwLLgJFgBAYcUvmgeGR3t7ew62DmTs\nZcfXehQaxP4796S9rb3WY0BDsMIFAFCY4AIAKExwAQAUJrgAAAoTXAAAhQkuAIDCBBcAQGFNcx+u\nfUPV3Llnf63HqGsHh6pJktEtlRpPUv/2DVXTVushABgxmiK4Jk6cWOsRRoT+vr4kSZvn66ja4rgC\n4J1riuDq6uqq9QgjwpIlS5Ik69atq/EkANBYXMMFAFCY4AIAKExwAQAUJrgAAAoTXAAAhQkuAIDC\nBBcAQGGCCwCgMMEFAFCY4AIAKExwAQAUJrgAAApril9eDUBj6O/vz1D1YPbsv7PWo9Aghqr70t9/\nqPh+rHABABRmhQuAEaO9vT0DB1tz/NjLaj0KDWLP/jvT3t5WfD9WuAAAChNcAACFOaUIDaS6byj7\n79xT6zHqWvXgUJKkMtrrzaOp7htKyp9pgaYguKBBTJw4sdYjjAh9/X1Jkoltnq+janNcwbEiuKBB\ndHV11XqEEWHJkiVJknXr1tV4EqCZWFMHAChMcAEAFCa4AAAKE1wAAIUJLgCAwgQXAEBhggsAoDDB\nBQBQmOACAChMcAEAFFbkV/sMDg5m+fLleeaZZ/Lyyy/nM5/5TE477bQsW7YslUolU6ZMyapVq9LS\novcAgMZXJLi2bNmSCRMmZO3atfnlL3+Zj370o5k6dWoWL16cmTNnZuXKldm6dWtmz55dYvcAAHWl\nyBLT3Llz87nPfS5JUq1WM2rUqOzatSsdHR1JklmzZmXbtm0ldg0AUHeKrHC1t7cnSfbt25fPfvaz\nWbx4cW6++eZUKpUjf75379539Fg7d+4sMSJvYmBgIInnnMbmOB/ZDv/3g2NpYGCg+M+EIsGVJM89\n91yuuuqqLFiwIB/5yEeydu3aI3/W39+f8ePHv6PHmT59eqkReZ2NGzcm8ZzT2BznI9vGjRvTv090\ncWy1tbUdk58JbxdtRU4pvvDCC7nssstyzTXXZP78+UmSM888M9u3b0+SdHd3Z8aMGSV2DQBQd4oE\n14YNG/LSSy/l9ttvz8KFC7Nw4cIsXrw469evz0UXXZTBwcHMmTOnxK4BAOpOkVOKK1asyIoVK97w\n+cNL+QAAzcSNsAAAChNcAACFCS4AgMIEFwBAYcXuwwUAJQxV92XP/jtrPUZdG6oeTJK0VEbXeJL6\nN1Tdl6St+H4EFwAjxsSJE2s9wojQ19efJJk4sXxIjHxtw3JcCS4ARoyurq5ajzAiLFmyJEmybt26\nGk/CYa7hAgAoTHABABQmuAAAChNcAACFCS4AgMIEFwBAYYILAKAwwQUAUJjgAgAoTHABABQmuAAA\nChNcAACFCS4AgMIEFwBAYYILAKAwwQUAUJjgAgAoTHABABQmuAAAChNcAACFCS4AgMIEFwBAYYIL\nAKAwwQUAUJjgAgAoTHABABQmuAAAChNcAACFCS4AgMIEFwBAYYILAKAwwQUAUJjgAgAoTHABABRW\nNLiefPLJLFy4MEnyX//1X+ns7MyCBQuyatWqDA0Nldw1AEDdKBZcX/3qV7NixYoMDAwkSdasWZPF\nixfnnnvuSbVazdatW0vtGgCgrhQLrpNPPjnr168/8vGuXbvS0dGRJJk1a1a2bdtWatcAAHWltdQD\nz5kzJ08//fSRj6vVaiqVSpKkvb09e/fufUePs3PnziLz8UaHVyM95zQyxznNwHFef4oF1+u1tPzv\nYlp/f3/Gjx//jr5u+vTppUbidTZu3JjEc05jc5zTDBzntfF2gTts71I888wzs3379iRJd3d3ZsyY\nMVy7BgCoqWELrqVLl2b9+vW56KKLMjg4mDlz5gzXrgEAaqroKcWTTjop3/zmN5MkkyZNOrLECQDQ\nTNz4FACgMMEFAFCY4AIAKExwAQAUJrgAAAoTXAAAhQkuAIDCBBcAQGGCCwCgMMEFAFCY4AIAKExw\nAQAUJrgAAAoTXAAAhQkuAIDCBBcAQGGCCwCgMMEFAFCY4AIAKExwAQAUJrgAAAoTXAAAhQkuAIDC\nBBcAQGGttR4AaFybNm3Kjh07aj3Ga/T19SVJlixZUuNJ3qijoyOdnZ21HgMoQHABTaWtra3WIwBN\nSHABxXR2dtbdis1dd92VJLn00ktrPAnQTFzDBTSVhx9+OA8//HCtxwCajOACmsZdd92VoaGhDA0N\nHVnpAhgOggtoGq9e2bLKBQwnwQUAUJjgAprGn/3Zn73pNkBpggtoGq9+Z6J3KQLDSXABTePBBx98\n022A0gQX0DQ2b978ptsApQkuAIDCBBfQNObNm/em2wClCS6gacydO/dNtwFKE1xA03DRPFArggto\nGt/85jffdBugtNZaD9CsNm3alB07dtR6jNfo6+tLkixZsqTGk7xRR0dHOjs7az0GI9zg4OCbbgOU\nJrg4oq2trdYjQFGVSiXVavXINsBwGdbgGhoayvXXX58f/ehHOe6443LDDTfklFNOGc4R6kZnZ6cV\nGxhmv/M7v5Of//znR7bhWHDG4lfTrGcshvUaru985zt5+eWX841vfCOf//znc9NNNw3n7oEmd8UV\nV7zpNjSatrY2Zy3qzLCucO3cuTPve9/7kiR/9Ed/lP/8z/8czt0DTW7atGk54YQTjmzDseCMBe/E\nsAbXvn37Mm7cuCMfjxo1KocOHUpr61uPsXPnzuEYDWgSh1/0+dkCDKdhDa5x48alv7//yMdDQ0Nv\nG1tJMn369NJjAU3EzxSglLd7ITes13D98R//cbq7u5MkTzzxRE4//fTh3D0AQE0M6wrX7Nmz89hj\nj+Xiiy9OtVrNjTfeOJy7BwCoiWENrpaWlnzxi18czl0CANScX+0DAFCY4AIAKExwAQAUJrgAAAoT\nXAAAhQkuAIDCBBcAQGGCCwCgMMEFAFCY4AIAKExwAQAUJrgAAAoTXAAAhbXWeoCj2blzZ61HAAD4\ntVSq1Wq11kMAADQypxQBAAoTXAAAhQkuAIDCBBcAQGGCCwCgMMHFazz55JNZuHBhrceAIgYHB3PN\nNddkwYIFmT9/frZu3VrrkeCYe+WVV3Lttdfm4osvTmdnZ3784x/XeiQyAu7DxfD56le/mi1btmTM\nmDG1HgWK2LJlSyZMmJC1a9fml7/8ZT760Y/mAx/4QK3HgmPqkUceSZLce++92b59e9atW5e///u/\nr/FUWOHiiJNPPjnr16+v9RhQzNy5c/O5z30uSVKtVjNq1KgaTwTH3gUXXJDVq1cnSZ599tmMHz++\nxhORWOHiVebMmZOnn3661mNAMe3t7UmSffv25bOf/WwWL15c44mgjNbW1ixdujQPPfRQvvKVr9R6\nHGKFC2gyzz33XD7xiU/kwgsvzEc+8pFajwPF3Hzzzfn2t7+drq6u7N+/v9bjND3BBTSNF154IZdd\ndlmuueaazJ8/v9bjQBEPPPBA7rjjjiTJmDFjUqlU0tLin/ta818AaBobNmzISy+9lNtvvz0LFy7M\nwoULc/DgwVqPBcfUBz/4wfzwhz/MJZdckssvvzzLly/P6NGjaz1W0/PLqwEACrPCBQBQmOACAChM\ncAEAFCa4AAAKE1wAAIW50zwwIj399NOZO3duTj311FQqlQwODuaEE07ImjVr8ru/+7tv+PubN2/O\njh07ctNNN9VgWqDZWeECRqwTTjgh//qv/5oHHngg//7v/56zzjrryO+QA6gnVriAhjFjxow8/PDD\n2bZtW2666aZUq9X8/u//fm655ZbX/L1vfetb+drXvpaDBw9mYGAgN9xwQ84999x87Wtfy/3335+W\nlpacffbZ+eIXv5je3t6sXLkyhw4dSltbW9asWZM/+IM/qM03CIxYVriAhjA4OJhvfetbOfvss3P1\n1Vfn5ptvzr/927/ljDPOyP3333/k7w0NDeXee+/Nhg0bsmXLllx55ZX5p3/6pxw6dCh33HFH7rvv\nvmzevDkgomNMAAACCElEQVSVSiW7d+/OXXfdlU996lPZvHlzFi5cmCeeeKKG3yUwUlnhAkasn//8\n57nwwguTJC+//HLOPvvsLFiwIL29vZk2bVqS5K//+q+T/M81XEnS0tKS2267LQ8//HCeeuqp7Nix\nIy0tLWltbc0555yT+fPn5wMf+EAuueSSnHjiifmTP/mTfPGLX8x3v/vdvP/978+cOXNq880CI5rg\nAkasw9dwvVpvb+9rPt67d2/6+/uPfNzf358///M/z4UXXphzzz03Z5xxRr7+9a8nSW6//fY88cQT\n6e7uzhVXXJG/+7u/y9y5c3POOefkkUceyV133ZVHH300N9xwQ/lvDmgoggtoKJMmTUpfX19++tOf\n5rTTTss//uM/JklOOeWUJMnPfvaztLS0ZNGiRUmSFStW5JVXXklfX18WLFiQ++67L+ecc06ef/75\n/OhHP8o999yTD3/4w7n44otz6qmnZs2aNTX73oCRS3ABDaWtrS1r167NF77whQwODubkk0/O3/7t\n3+bb3/52kmTq1KmZNm1aPvShD2X06NE599xz8+yzz2bixIm5+OKLM3/+/IwZMya/93u/l4997GM5\n99xzc9111+X222/PqFGjsmzZshp/h8BIVKlWq9VaDwEA0Mi8SxEAoDDBBQBQmOACAChMcAEAFCa4\nAAAKE1wAAIUJLgCAwgQXAEBh/w9OjsxyWSYgYgAAAABJRU5ErkJggg==\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sb.boxplot(x='Pclass', y='Age', data=titanic_data, palette='hls')"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survived | \n",
" Pclass | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Fare | \n",
" Embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" 7.2500 | \n",
" S | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 1 | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" 71.2833 | \n",
" C | \n",
"
\n",
" \n",
" | 2 | \n",
" 1 | \n",
" 3 | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" 7.9250 | \n",
" S | \n",
"
\n",
" \n",
" | 3 | \n",
" 1 | \n",
" 1 | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 53.1000 | \n",
" S | \n",
"
\n",
" \n",
" | 4 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 8.0500 | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Survived Pclass Sex Age SibSp Parch Fare Embarked\n",
"0 0 3 male 22.0 1 0 7.2500 S\n",
"1 1 1 female 38.0 1 0 71.2833 C\n",
"2 1 3 female 26.0 0 0 7.9250 S\n",
"3 1 1 female 35.0 1 0 53.1000 S\n",
"4 0 3 male 35.0 0 0 8.0500 S"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Speaking roughly, we could say that the younger a passenger is, the more likely it is for them to be in 3rd class. The older a passenger is, the more likely it is for them to be in 1st class. So there is a loose relationship between these variables. So, let's write a function that approximates a passengers age, based on their class. From the box plot, it looks like the average age of 1st class passengers is about 37, 2nd class passengers is 29, and 3rd class pasengers is 24.\n",
"\n",
"So let's write a function that finds each null value in the Age variable, and for each null, checks the value of the Pclass and assigns an age value according to the average age of passengers in that class."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def age_approx(cols):\n",
" Age = cols[0]\n",
" Pclass = cols[1]\n",
" \n",
" if pd.isnull(Age):\n",
" if Pclass == 1:\n",
" return 37\n",
" elif Pclass == 2:\n",
" return 29\n",
" else:\n",
" return 24\n",
" else:\n",
" return Age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we apply the function and check again for null values, we see that there are no more null values in the age variable."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Survived 0\n",
"Pclass 0\n",
"Sex 0\n",
"Age 0\n",
"SibSp 0\n",
"Parch 0\n",
"Fare 0\n",
"Embarked 2\n",
"dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic_data['Age'] = titanic_data[['Age', 'Pclass']].apply(age_approx, axis=1)\n",
"titanic_data.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are 2 null values in the embarked variable. We can drop those 2 records without loosing too much important information from our dataset, so we will do that."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Survived 0\n",
"Pclass 0\n",
"Sex 0\n",
"Age 0\n",
"SibSp 0\n",
"Parch 0\n",
"Fare 0\n",
"Embarked 0\n",
"dtype: int64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic_data.dropna(inplace=True)\n",
"titanic_data.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Converting categorical variables to a dummy indicators\n",
"The next thing we need to do is reformat our variables so that they work with the model. Specifically, we need to reformat the Sex and Embarked variables into numeric variables."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" male | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 1 | \n",
"
\n",
" \n",
" | 1 | \n",
" 0 | \n",
"
\n",
" \n",
" | 2 | \n",
" 0 | \n",
"
\n",
" \n",
" | 3 | \n",
" 0 | \n",
"
\n",
" \n",
" | 4 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" male\n",
"0 1\n",
"1 0\n",
"2 0\n",
"3 0\n",
"4 1"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gender = pd.get_dummies(titanic_data['Sex'],drop_first=True)\n",
"gender.head()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Q | \n",
" S | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" | 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" | 2 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" | 3 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" | 4 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Q S\n",
"0 0 1\n",
"1 0 0\n",
"2 0 1\n",
"3 0 1\n",
"4 0 1"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"embark_location = pd.get_dummies(titanic_data['Embarked'],drop_first=True)\n",
"embark_location.head()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survived | \n",
" Pclass | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Fare | \n",
" Embarked | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" 7.2500 | \n",
" S | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 1 | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" 71.2833 | \n",
" C | \n",
"
\n",
" \n",
" | 2 | \n",
" 1 | \n",
" 3 | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" 7.9250 | \n",
" S | \n",
"
\n",
" \n",
" | 3 | \n",
" 1 | \n",
" 1 | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 53.1000 | \n",
" S | \n",
"
\n",
" \n",
" | 4 | \n",
" 0 | \n",
" 3 | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 8.0500 | \n",
" S | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Survived Pclass Sex Age SibSp Parch Fare Embarked\n",
"0 0 3 male 22.0 1 0 7.2500 S\n",
"1 1 1 female 38.0 1 0 71.2833 C\n",
"2 1 3 female 26.0 0 0 7.9250 S\n",
"3 1 1 female 35.0 1 0 53.1000 S\n",
"4 0 3 male 35.0 0 0 8.0500 S"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survived | \n",
" Pclass | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Fare | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0 | \n",
" 3 | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" 7.2500 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 1 | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" 71.2833 | \n",
"
\n",
" \n",
" | 2 | \n",
" 1 | \n",
" 3 | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" 7.9250 | \n",
"
\n",
" \n",
" | 3 | \n",
" 1 | \n",
" 1 | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 53.1000 | \n",
"
\n",
" \n",
" | 4 | \n",
" 0 | \n",
" 3 | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 8.0500 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Survived Pclass Age SibSp Parch Fare\n",
"0 0 3 22.0 1 0 7.2500\n",
"1 1 1 38.0 1 0 71.2833\n",
"2 1 3 26.0 0 0 7.9250\n",
"3 1 1 35.0 1 0 53.1000\n",
"4 0 3 35.0 0 0 8.0500"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic_data.drop(['Sex', 'Embarked'],axis=1,inplace=True)\n",
"titanic_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survived | \n",
" Pclass | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Fare | \n",
" male | \n",
" Q | \n",
" S | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0 | \n",
" 3 | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" 7.2500 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 1 | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" 71.2833 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" | 2 | \n",
" 1 | \n",
" 3 | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" 7.9250 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" | 3 | \n",
" 1 | \n",
" 1 | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 53.1000 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" | 4 | \n",
" 0 | \n",
" 3 | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 8.0500 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Survived Pclass Age SibSp Parch Fare male Q S\n",
"0 0 3 22.0 1 0 7.2500 1 0 1\n",
"1 1 1 38.0 1 0 71.2833 0 0 0\n",
"2 1 3 26.0 0 0 7.9250 0 0 1\n",
"3 1 1 35.0 1 0 53.1000 0 0 1\n",
"4 0 3 35.0 0 0 8.0500 1 0 1"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic_dmy = pd.concat([titanic_data,gender,embark_location],axis=1)\n",
"titanic_dmy.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have a dataset with all the variables in the correct format!\n",
"\n",
"### Checking for independence between features"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAikAAAHRCAYAAACvuin3AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl0FGW+//FPdRYSSAAhEFwmyCLgxlU4gIwDM7IqKrJK\nAEFEYHTEo8gFcUPIUWBgWAYFHBUYWWQTVATlOuCCgzszqCCCMEgQbhZkTQjZ+vn9wc++kxE6TUh1\ndVW/X+f0OZ2uTtWnoDv55vs8T7VljDECAACIMD6nAwAAAJwNRQoAAIhIFCkAACAiUaQAAICIRJEC\nAAAiEkUKAACISLF27vw+63I7dx8Wv9u2xekIlaJLw4ucjlApYnyW0xEuWGLBT05HqBS+/V85HeGC\nDd5R1+kIlWJB32ucjlApDpwscTpCpbiibnLYjmXX79kXzA+27Pd80UkBAAARydZOCgAAsE+M+5vL\nQdFJAQAAEYlOCgAALhVjebuVQicFAABEJDopAAC4lNfnpFCkAADgUgz3AAAAOIBOCgAALuX14R46\nKQAAICLRSQEAwKW8PieFIgUAAJdiuAcAAMABdFIAAHAprw/30EkBAAARiU4KAAAu5fVOA0UKAAAu\nxXAPAACAA+ikAADgUixBBgAAcACdFAAAXIo5KQAAAA4I2kn54osvzrmtVatWlR4GAACEzutzUoIW\nKcuWLZMkZWZmqri4WNdee62+/fZbVatWTYsXLw5LQAAAcHZeH+4JWqTMmDFDkjRixAjNnTtXsbGx\nKi0t1YgRI8ISDgAARK+QJs7m5uYG7peWlurIkSO2BQIAAKGJ6uGen/Xp00e33nqrmjRpou+//17D\nhw+3OxcAAIhyIRUpAwcO1M0336zMzEzVr19ftWrVsjsXAAAoR1TPSfnZ999/r6efflonTpxQ9+7d\ndcUVV+imm26yOxsAAAjC68M9IV0n5ZlnntHkyZN10UUXqU+fPnruuefszgUAAKJcyFecrV+/vizL\nUq1atVStWjU7MwEAgBDQSZFUo0YNLV++XAUFBVq/fr2qV69udy4AABDlQipSJk2apB9//FEXXXSR\ntm/frmeffdbuXAAAoBwxlmXLLVKENNwze/Zs3XnnnWrcuLHdeQAAQIi8PtwTUpHSsmVLTZs2Tfn5\n+erVq5e6deumhIQEu7MBAIAoFtJwT9euXfWXv/xFM2bM0EcffaTf/OY3ducCAADlYLhH0qFDh/T6\n66/r3Xff1VVXXaWXXnrJ7lwAACDKhVSkPPjgg+rbt6+WLl2qpKQkuzMBAIAQRPWclKysLNWrV0/T\npk2TZVnKzc0NfNhggwYNwhIQAACcXSQNzdghaJGycOFCPfbYY3r66afLPG5ZlhYtWmRrMAAAEJn8\nfr8mTJigXbt2KT4+Xs8884zq168f2L527VotXLhQPp9PvXv31oABAyp0nKBFymOPPSZJuvvuu9Wh\nQwf5fCHNswUAAGHg1HDPxo0bVVRUpBUrVmjbtm2aMmWK5s2bF9g+depUrVu3TlWrVtWtt96qW2+9\nVTVq1Djv44RUdXzyySe64447NHPmTB04cOC8DwIAALxj69atateunSTpuuuu0/bt28tsb9q0qU6e\nPKmioiIZY2RVcFgqpImzTz31lIqKirRp0yZlZGSouLhYf/3rXyt0QAAAUDmcmpOSl5dXZiFNTEyM\nSkpKFBt7pqy44oor1Lt3byUmJqpz584V/jidkMdvvv76a/3973/XTz/9pLZt21boYAAAwP2SkpKU\nn58f+Nrv9wcKlO+++04ffPCBNm3apPfee09HjhzRO++8U6HjhNRJ6datm5o1a6a+ffvyuT0AAEQI\nn0OdlBYtWuj9999Xt27dtG3bNjVp0iSwLTk5WQkJCapSpYpiYmJUq1YtnThxokLHCalI6dWrl4YN\nG1ahAwAAAHtYDs2c7dy5s7Zs2aL09HQZYzRp0iS99dZbOnXqlPr166d+/fppwIABiouLU1pamnr2\n7Fmh44RUpGzevFn33HOPYmJiKnQQAADgHT6fTxkZGWUea9SoUeB+//791b9//ws+TkhFytGjR9Wu\nXTtddtllsixLlmVp+fLlF3xwAABQcT6PX3I2pCLlhRdeqNDOf7dtS4W+L5J8cN2NTkeoFNl/f8/p\nCJXi/qbxTke4YCbx/K8VEIkGflPH6QgXbGmXiq04iDTWvs+cjlAp6tdrUv6TXCHZ6QCeEVKR8vrr\nr//isZEjR1Z6GAAAEDorxtsXWQ2pSElJSZEkGWP07bffyu/32xoKAACUz6mJs+ESUpGSnp5e5mtW\n+gAAALuFVKTs27cvcD8nJ0eHDh2yLRAAAAgNE2cljR8/XpZl6fjx46pZs6bGjRtndy4AABDlgs64\n2bFjh3r06KH58+frrrvuUk5OjrKyslRcXByufAAA4Bwsn8+WW6QI2kmZOnWqpkyZovj4eM2aNUsv\nv/yy6tevr2HDhqljx47hyggAAM4iqod7/H6/mjVrpuzsbBUUFOjqq6+WdOZKcwAAAHYKWqT8/ImG\nH330UeCTj4uLi8t88iEAAHBGVC9Bbtu2rdLT05WVlaV58+YpMzNTGRkZ6tatW7jyAQCAKBW0SBkx\nYoQ6duyopKQkpaamKjMzU/369VPnzp3DlQ8AAJxD1F9x9t8/1TAtLU1paWm2BgIAAJBCvE4KAACI\nPFG9ugcAAEQuy+ftIsXbg1kAAMC16KQAAOBSPo9PnPX22QEAANeikwIAgEtF9cXcAABA5PJ6kcJw\nDwAAiEh0UgAAcCkmzv5/fr9fpaWl+vLLL1VUVGRnJgAAgNA6Kc8++6waNWqkQ4cOaceOHUpJSdEf\n//hHu7MBAIAgmJMi6ZtvvlF6err++c9/av78+crKyrI7FwAAKIfPZ9lyixQhFSl+v1/bt2/XZZdd\npqKiIuXn59udCwAARLmQhnvuuOMOTZw4UZMmTdK0adPUr18/u3MBAIByWB6fOBtSkTJw4EANHDhQ\nkjR06FBdfPHFtoYCAAAIqUh5+eWXVb16dZ04cUJr1qxRu3bt9Nhjj9mdDQAABOFj4qz07rvvqkeP\nHtq8ebPefvttffvtt3bnAgAAUS6kTorP59Phw4eVkpIiSSosLLQ1FAAAKJ/XlyCHVKS0adNGgwYN\n0rRp0zRp0iT99re/tTsXAAAoBxNnJY0aNUqjRo2SJF177bWKi4uzNRQAAEBIRcqmTZv06quvqri4\nWMYYHTt2TG+99Zbd2QAAQBBMnJU0a9YsjRw5UhdffLF69uyppk2b2p0LAABEuZCKlLp16+r666+X\nJPXq1UvZ2dm2hgIAAOWzfJYtt0gR0nBPXFycvvjiC5WUlOijjz7S0aNH7c4FAADK4fP4xNmQzm7i\nxIkqKSnR/fffr5UrV+r++++3OxcAAIhyQTsp+/btC9yvV6+epDMrfSwrclpBAABEq6i+Tsr48eMD\n9y3LkjEmUKAsWrTI3mQAACCqBS1SFi9eLOnMFWb37t2rq666Shs3buRibgAARACvX8wtpLMbM2aM\ndu7cKenMENC4ceNsDQUAAMpn+Xy23CJFSEmys7PVu3dvSdLw4cOVk5NjaygAAICQliBblqV9+/ap\nQYMGyszMlN/vtzsXAAAoh9eXIJdbpOTl5Wn06NEaNWqUDh8+rLp16yojIyMc2QAAQBQLWqQsWbJE\nCxYsUGxsrJ588km1b98+XLkAAEA5onri7Lp167RhwwYtX76cJccAACCsgnZS4uPjFR8fr1q1aqm4\nuDhcmQAAQAi83kkJaeKsJBljznvnXRpedN7fE2my//6e0xEqxc7fdHA6QqX4bO9WpyNcsDql5/9e\nikRLu7r//e2PS3Q6QqU4cmkrpyNUippxTidwn0haLmyHoEXKnj17NHr0aBljAvd/Nn36dNvDAQCA\n6BW0SJk1a1bgfnp6uu1hAABA6KyYGKcj2CpokdK6detw5QAAACgj5DkpAAAgsjBxFgAARCSfxyfO\nevvsAACAa9FJAQDApbw+3OPtswMAAK5FJwUAAJfyeieFIgUAAJfy+hVnvX12AADAteikAADgUl4f\n7vH22QEAANeikwIAgEvRSQEAAHAAnRQAAFzK5/FOCkUKAAAuxRJkAAAAB9BJAQDApZg4CwAA4ICQ\nOik//PCD9u/fr6ZNmyo1NVWWZdmdCwAAlMPrnZRyi5QlS5bob3/7m44fP64ePXooMzNT48ePD0c2\nAAAQRNRPnF2/fr0WLlyo5ORkDRkyRF999VU4cgEAgChXbifFGCPLsgJDPPHx8baHAgAA5fPFxDgd\nwVblFim33nqrBg4cqEOHDmn48OHq1KlTOHIBAIAoV26RMmjQIP3617/W7t271bBhQzVt2jQcuQAA\nQDmifuLsY489Fri/efNmxcXFqV69eho4cKBq1KhhazgAAHBuXi9Syj27wsJC1a1bV926ddOll16q\n7OxsFRUV6dFHHw1HPgAAEKXKLVKOHDmiUaNGqV27dho5cqSKi4v18MMP6+TJk+HIBwAAzsHy+Wy5\nRYpyk+Tl5Wnv3r2SpL179+rUqVM6evSoTp06ZXs4AAAQefx+v8aPH69+/fpp0KBB2r9//1mf99RT\nT+lPf/pThY9T7pyU8ePHa8yYMcrJyVFCQoJ69uypt99+W/fdd1+FDwoAAC6cU3NSNm7cqKKiIq1Y\nsULbtm3TlClTNG/evDLPWb58uXbv3q1WrVpV+Djlnl3z5s01YcIE/frXv1ZBQYF++uknDRw4UF27\ndq3wQQEAgHtt3bpV7dq1kyRdd9112r59e5nt//jHP/TVV1+pX79+F3Scc3ZSioqKtH79ei1dulTx\n8fHKy8vTpk2blJCQcEEHBAAAlcOpTkpeXp6SkpICX8fExKikpESxsbHKycnRnDlz9Pzzz+udd965\noOOcs0jp0KGDbrvtNv3pT3/S5ZdfrmHDhlGgAAAQQZya5JqUlKT8/PzA136/X7GxZ0qKDRs26OjR\noxoxYoRyc3N1+vRpNWzYUL169Trv45yzSLn77rv11ltv6eDBg+rTp4+MMRU4DQAA4DUtWrTQ+++/\nr27dumnbtm1q0qRJYNvgwYM1ePBgSdKaNWv0r3/9q0IFihSkSBk+fLiGDx+uzz//XKtWrdL27ds1\nbdo03XHHHWXCAAAAZ1g+Zz67p3PnztqyZYvS09NljNGkSZP01ltv6dSpUxc8D+Xflbu6p3Xr1mrd\nurVOnDihN998U2PHjtUbb7xRaQEAAIC7+Hw+ZWRklHmsUaNGv3heRTsoPyu3SPlZ9erVNWjQIA0a\nNOiCDggAACqJQ52UcAm5SAEAABEmgq4Oawdvnx0AAHAtOikAALiUFePt4R46KQAAICLRSQEAwK2Y\nOAsAACKSx4sUhnsAAEBEopMCAIBLOfXZPeHi7bMDAACuZWsnJcZn2bn7sLi/abzTESrFZ3u3Oh2h\nUixt1NLpCBds6oLBTkeoFFaH3k5HuGBZVS52OkKluCjOG39v+gqPOx2hclStFr5jMScFAAAg/JiT\nAgCAW3m8k0KRAgCASzFxFgAAwAF0UgAAcCuPD/fQSQEAABGJTgoAAG7l8U4KRQoAAC5lxXi7SGG4\nBwAARCQ6KQAAuBVLkAEAAMKPTgoAAG7FxFkAABCJLI8XKQz3AACAiEQnBQAAt2LiLAAAQPjRSQEA\nwKW8PieFIgUAALfyeJHCcA8AAIhIdFIAAHArJs4CAACEX8idFL/fryNHjqh27dqyLMvOTAAAIAR8\nCrKkd999V506ddKwYcPUpUsXbdmyxe5cAAAgyoXUSZk7d65WrVql2rVr6/Dhw7rvvvt044032p0N\nAAAE4/HVPSEVKTVr1lTt2rUlSSkpKUpKSrI1FAAACAFFilStWjXde++9atWqlbZv367Tp09rxowZ\nkqRHHnnE1oAAACA6hVSkdOrUKXA/NTXVtjAAACB0lseXIJdbpHz33Xfq2bOnioqKtGrVKsXHx6t3\n797yefwfBgAAOCtopbFw4UI99dRTKikp0dSpU7Vlyxbt2rVLkyZNClc+AABwLr4Ye24RImgnZcOG\nDVq+fLksy9K6dev07rvvqnr16kpPTw9XPgAAcC6Wt0c1gp5dtWrVFBMTo507d+pXv/qVqlevLkky\nxoQlHAAAiF5BOymWZWnfvn16/fXX1aFDB0nSDz/8oBiPX+EOAABXiOZOykMPPaSxY8fq4MGDGjx4\nsD7//HPdfffdGjt2bLjyAQCAKBW0k9K8eXOtWrUq8PV1112njRs3Ki4uzvZgAAAgOBPNnZSfffPN\nN+rVq5c6deqkQYMGadeuXXbnAgAA5bF89twiREgXc3v22Wc1depUNW7cWLt27dLEiRP16quv2p0N\nAABEsZCKlCpVqqhx48aSpKZNmzLcAwBAJLAspxPYKmiRsmLFijNPio3VhAkT1KpVK3399dd8wCAA\nALBd0CIlNzdXknT99ddLkvbt26fk5GRdeeWV9icDAADBefwjaoIWKX369FG9evW0b9++cOUBAACQ\nVE6RsnDhQj322GMaP368LMvS8ePHFRMTo6SkJC1atChcGQEAwFlE9RLk7t27q0ePHpo/f77uuusu\n5eTkKD8/X3fffXe48gEAgHPx+BLkoEmmTp2qKVOmKD4+XrNmzdLLL7+s1atX66WXXgpXPgAAEKWC\nDvf4/X41a9ZM2dnZKigo0NVXXy3pzGf6AAAAh0VQ18MOQc8uNvZMDfPRRx+pbdu2kqTi4mKdOnXK\n/mQAACCqBe2ktG3bVunp6crKytK8efOUmZmpjIwMdevWLVz5AADAuXi8kxK0SBkxYoQ6duyopKQk\npaamKjMzU/369VPnzp3DlQ8AAJyD11f3lHtZ/EaNGgXup6WlKS0tzdZAAAAAUoif3VNRiQU/2bn7\nsDCJNZyOUCnqlBqnI1SKqQsGOx3hgo0d6o1rDM3Z6f6O6v7SQqcjVIr42glOR6gUVTzeFbCFx//N\nvH12AADAtWztpAAAABt5/JIgFCkAALgVwz0AAADhRycFAACX8voSZG+fHQAAcC06KQAAuJXP270G\nb58dAABwLTopAAC4lcfnpFCkAADgVh4vUrx9dgAAwLXopAAA4FZ0UgAAAMKPTgoAAC7l9Yu5UaQA\nAOBWHi9SvH12AADAteikAADgVpbldAJb0UkBAAARiU4KAABu5fE5KRQpAAC4lFOre/x+vyZMmKBd\nu3YpPj5ezzzzjOrXrx/Y/t5772nOnDmKjY1V7969deedd1boON4uwQAAQKXbuHGjioqKtGLFCo0e\nPVpTpkwJbCsuLtbkyZO1YMECLV68WCtWrNDhw4crdJyQOik//vij/ud//kcFBQWBx0aOHFmhAwIA\ngEriUCdl69atateunSTpuuuu0/bt2wPb9u7dq7S0NNWoUUOS1LJlS33xxRe65ZZbzvs4IZ3d6NGj\nVVBQoJSUlMANAABEp7y8PCUlJQW+jomJUUlJSWBbcnJyYFu1atWUl5dXoeOE1ElJSEigcwIAQIQx\nDi1BTkpKUn5+fuBrv9+v2NjYs27Lz88vU7Scj6CdlH379mnfvn1KSUnRW2+9pX/961+BxwAAQHRq\n0aKFNm/eLEnatm2bmjRpEtjWqFEj7d+/X8eOHVNRUZG+/PJLXX/99RU6TtBOyvjx4wP3V65cGbhv\nWZYWLVpUoQMCAIDKYYwzx+3cubO2bNmi9PR0GWM0adIkvfXWWzp16pT69euncePG6d5775UxRr17\n91ZqamqFjhO0SFm8eLEkqbCwUHv37tVVV12ljRs36re//W2FDgYAACqP36EqxefzKSMjo8xjjRo1\nCtzv0KGDOnTocOHHCeVJY8aM0c6dOyWdGQIaN27cBR8YAAAgmJCKlOzsbPXu3VuSNHz4cOXk5Nga\nCgAAlM/YdIsUIRUplmUFJstmZmbK7/fbGgoAACCkJciPP/64Ro0apcOHD6tu3bq/GIcCAADh54+k\ntocNQipSvvjiC73xxht2ZwEAAOfBOLW8J0xCGu758MMPVVpaancWAACAgJA6KUePHlW7du102WWX\nybIsWZal5cuX250NAAAEwXCPpBdeeMHuHAAAAGWEVKSUlJRow4YNKi4uliTl5OQweRYAAId5vJES\n+qcgS9I//vEP/fjjjzp27JitoQAAQPn8xp5bpAipSKlatap+//vfKzU1VVOmTNHhw4ftzgUAAKJc\nSMM9lmUpNzdX+fn5OnXqlE6dOmV3LgAAUI6oX4Kcl5enkSNHauPGjbrjjjvUqVMntW3bNhzZAABA\nFAvaSVmyZIkWLFig2NhYPfnkk2rfvr06duwYrmwAACAIr39ITdBOyrp167RhwwYtX75cixYtClcm\nAACA4J2U+Ph4xcfHq1atWoHlxwAAIDJ4fEpKaBNnJe9PzgEAwG0iabmwHYIWKXv27NHo0aNljAnc\n/9n06dNtDwcAAKJX0CJl1qxZgfvp6em2hwEAAKHz+ihH0CKldevW4coBAABQhmVsLMNKtq63a9dh\nM/CbOk5HqBRLu17kdIRKYRV54EKCBSecTlApHrhykNMRLtiDh752OkKlSK4S0sXDI15BiTcW1Dat\nWz1sx8o8kmfLftNqJdmy3/MV8sRZAAAQWTw+2hPaZ/cAAACEG50UAABcyu/xVgqdFAAAEJHopAAA\n4FLe7qNQpAAA4Fpev+Iswz0AACAi0UkBAMClPD5vlk4KAACITHRSAABwKb/Hp87SSQEAABGJTgoA\nAC7l9TkpFCkAALgUS5ABAAAcQCcFAACX8vpwD50UAAAQkeikAADgUl5fgkyRAgCASzHcAwAA4AA6\nKQAAuJTf460UOikAACAihdxJycvL048//qi0tDRVrVrVzkwAACAEpX6nE9grpCJlw4YNeuGFF1Ra\nWqqbb75ZlmXpD3/4g93ZAABAEAz3SPrrX/+qlStXqmbNmvrDH/6gjRs32p0LAABEuZA6KTExMYqP\nj5dlWbIsS4mJiXbnAgAA5SilkyK1bNlSo0ePVnZ2tsaPH69rr73W7lwAACDKhdRJGT58uP75z3/q\nyiuvVMOGDdWhQwe7cwEAgHJ4fU5KSEXKiBEjtGzZMrVv397uPAAAAJJCLFJq1KihV155RQ0aNJDP\nd2aE6De/+Y2twQAAQHAsQZZ00UUX6bvvvtN3330XeIwiBQAAZzHcI2ny5Mllvs7JybElDAAAwM9C\nKlL+/Oc/a9myZSouLtbp06d1+eWXa/369XZnAwAAQbAEWdJ7772nzZs36/bbb9fbb7+t1NRUu3MB\nAIAoF1InpU6dOoqPj1d+fr7q16+v4uJiu3MBAIBy+L3dSAmtSKlXr55ee+01JSYmavr06Tpx4oTd\nuQAAQDlKPV6lBB3umTt3riQpIyNDjRo10tixY1W3bl1Nnz49LOEAAED0ClqkfPrpp2ee5PNp5syZ\nSkpK0qBBg9S4ceOwhAMAAOfmN8aWW6QIWqSYfwtqIig0AADwvqBzUizLOut9AADgvFKP9w+CFik7\nduxQenq6jDHas2dP4L5lWVq+fHm4MgIAgLOIpKEZOwQtUtauXRuuHAAAAGUELVIuvfTScOUAAADn\nKaqXIAMAADglpIu5AQCAyOP1OSl0UgAAQESikwIAgEtF9RJkAAAQubw+3GNrkTJ4R107dx8WS7tU\ndzpCpfDHJTodoVJkVbnY6QgXbH9podMRKsWDh752OsIFe+6S5k5HqBQZx3Y4HaFSXFKY7XSESuKN\n3xuRgE4KAAAu5WcJMgAAQPjRSQEAwKWYOAsAACKS1yfOMtwDAAAiEp0UAABcqpROCgAAQPjRSQEA\nwKW8vgSZIgUAAJfy+uoehnsAAEBEokgBAMCl/MbYcquI06dP68EHH9SAAQM0fPhwHTly5OyZ/X4N\nGzZMy5YtK3efFCkAAOCCLVu2TE2aNNGrr76qHj16aO7cuWd93qxZs3TixImQ9kmRAgCAS5UaY8ut\nIrZu3ap27dpJktq3b69PPvnkF8/ZsGGDLMsKPK88TJwFAMClSh1a3bNq1Sq98sorZR6rXbu2kpOT\nJUnVqlXTyZMny2zfvXu31q1bp9mzZ2vOnDkhHYciBQAAnJe+ffuqb9++ZR4bOXKk8vPzJUn5+fmq\nXr16me1vvPGGsrOzdffdd+vgwYOKi4vTpZdeqvbt25/zOBQpAAC4lFOdlLNp0aKFPvzwQzVv3lyb\nN29Wy5Yty2wfO3Zs4P5zzz2nlJSUoAWKxJwUAABQCfr376/vv/9e/fv314oVKzRy5EhJ0sKFC7Vp\n06YK7ZNOCgAALhVJnZTExETNnj37F4/fc889v3jswQcfDGmfdFIAAEBEopMCAIBLRVInxQ4UKQAA\nuJTXixSGewAAQEQKuZPyww8/aP/+/WratKlSU1NlWZaduQAAQDm83kkJqUhZsmSJ/va3v+n48ePq\n0aOHMjMzNX78eLuzAQCAKBbScM/69eu1cOFCJScna8iQIfrqq6/szgUAAMpR6je23CJFSJ0UY4ws\nywoM8cTHx9saCgAAlC+SCgo7hFSk3HbbbRo4cKAOHTqk4cOHq1OnTnbnAgAAUS6kIuWuu+5S27Zt\ntXv3bjVo0EDNmjWzOxcAAChHVHdSpk+f/otVPDt37tTbb7+tRx55xNZgAAAgugUtUho2bBiuHAAA\n4DxFdSelZ8+ekqSSkhJ98803KikpkTFGOTk5YQkHAADOrSSai5SfjRw5UsXFxcrJyVFpaanq1q2r\n2267ze5sAAAgioV0nZSjR49q/vz5at68udasWaPCwkK7cwEAgHJ4/TopIRUpCQkJkqSCgoLAfQAA\nADuFNNzTpUsXzZkzR82aNVO/fv2UmJhody4AAFCOSOp62CGkIqVevXr6+9//ruLiYiUkJCgmJsbu\nXAAAIMqFVKRMnTpVGRkZqlGjht15AABAiEoNnRRdccUVatOmjd1ZAADAeWC4R1LHjh3Vr1+/Mhd3\nmzx5sm2hAAAAQipSFi9erGHDhik5OdnuPAAAIER0UiSlpKSoW7dudmcBAAAICKlISUhI0L333qur\nrroq8IGDfMAgAADOopMi6aabbrI7BwAAOE+lfr/TEWwVUpHy8wcNAgAAhEtIRQoAAIg8Xh/uCemz\newAAAMIT24A+AAAS3klEQVSNTgoAAC7l9U4KRQoAAC5VQpFScQv6XmPn7sPC2veZ0xEqxZFLWzkd\noVJcFOf+Ecr42glOR6gURaXu/+GYcWyH0xEqxfiaVzsdoVLEvrbW6QiV4vneTifwDjopAAC4lNeH\ne9z/ZykAAPAkOikAALgUnRQAAAAH0EkBAMClvN5JoUgBAMClvF6kMNwDAAAiEp0UAABcik4KAACA\nA+ikAADgUsbjnRSKFAAAXMrv8SKF4R4AABCR6KQAAOBSxtBJAQAACDs6KQAAuBQTZwEAQERi4iwA\nAIAD6KQAAOBSxu90AnvRSQEAABGJTgoAAC7FEmRJRUVFOnjwoE6fPi1JOnHihAoKCmwNBgAAolvQ\nTkpxcbEmT56sDz/8UCkpKfrf//1f/e53v1NxcbHuueceNWnSJFw5AQDAf/D66p6gRcqcOXNUu3Zt\nbdq0SZLk9/v15JNP6qeffqJAAQDAYVF9nZTPPvtMy5YtC3zt8/mUnZ2to0eP2h4MAABEt6BzUny+\nX26eOXOmEhISbAsEAABCY/zGllukCFqkJCQkKDMzs8xjx44dU2Jioq2hAAAAgg73jBo1Svfdd5/u\nvPNOXXbZZTpw4IBee+01TZs2LVz5AADAOfijeQnyNddco4ULF6qoqEibN29WYWGh5s+fr6uuuipc\n+QAAwDl4fbin3Iu5paamasSIEeHIAgAAEMAVZwEAcKlI6nrYgc/uAQAAEYlOCgAALhXVV5wFAACR\niw8YBAAAcACdFAAAXMr4nU5gLzopAAAgItFJAQDApbw+cZZOCgAAiEh0UgAAcCmvX8yNIgUAAJfy\nepHCcA8AAIhIdFIAAHApPxdzAwAACD86KQAAuJTX56RQpAAA4FJeL1IY7gEAABGJTgoAAC7l9SvO\n2lqkHDhZYufuw6J+vSZOR6gUNeOcTlA5fIXHnY5wwapY3mhg7itJcDrCBbukMNvpCJUi9rW1Tkeo\nFCV9ujsdoXKYH5xO4Bl0UgAAcCkTQUuQT58+rTFjxuinn35StWrV9Mc//lG1atUq85wFCxZo3bp1\nsixL9913nzp37hx0n974kw4AgChk/MaWW0UsW7ZMTZo00auvvqoePXpo7ty5ZbafOHFCixYt0vLl\ny7VgwQJNmjSp3H1SpAAAgAu2detWtWvXTpLUvn17ffLJJ2W2JyYm6pJLLlFBQYEKCgpkWVa5+2S4\nBwAAl3Jq4uyqVav0yiuvlHmsdu3aSk5OliRVq1ZNJ0+e/MX3XXzxxbr11ltVWlqq3//+9+UehyIF\nAACcl759+6pv375lHhs5cqTy8/MlSfn5+apevXqZ7Zs3b1ZOTo42bdokSbr33nvVokULNW/e/JzH\noUgBAMCljL/U6QgBLVq00IcffqjmzZtr8+bNatmyZZntNWrUUEJCguLj42VZlpKTk3XixImg+6RI\nAQAAF6x///569NFH1b9/f8XFxWn69OmSpIULFyotLU0dO3bUxx9/rDvvvFM+n08tWrTQjTfeGHSf\nlrFx/dL3Ob8cj3Kb+rHuPwdJ8ifUcDpCpfAVeuD/wyvXSSly/3VSGpV64zopD39c4HSESuGV66S8\nEMbrpNQf+qot+92/YIAt+z1fdFIAAHCpSBrusYM3/qQDAACeQycFAACXMqV0UgAAAMKOTgoAAC7l\n9TkpFCkAALiU14sUhnsAAEBEopMCAIBL0UkBAABwAJ0UAABcyuudFIoUAABcyutFCsM9AAAgIp1X\nJ+XEiRPy+XxKSkqyKw8AAAiRP5o7KTt27FCPHj1UXFysd999V127dlXv3r313nvvhSsfAACIUkE7\nKVOnTtWUKVMUFxenWbNm6eWXX1b9+vU1bNgwdejQIVwZAQDAWXh9TkrQIsXv96tZs2bKzs5WQUGB\nrr76akmSz8dUFgAAYK+gRUps7JnNH330kdq2bStJKi4uVn5+vv3JAABAUFHdSWnbtq3S09OVlZWl\nefPmKTMzUxkZGerWrVu48gEAgHMwpVFcpIwYMUIdO3ZUUlKSUlNTlZmZqX79+qlz587hygcAAKJU\nuUuQGzVqFLiflpamtLQ0WwMBAIDQeH24hxmwAAAgInFZfAAAXMrrnRSKFAAAXMrrRQrDPQAAICLR\nSQEAwKWM3+90BFvRSQEAABGJTgoAAC7l9TkpFCkAALiU14sUhnsAAEBEopMCAIBL+emkAAAAhB+d\nFAAAXMrrn4JMJwUAAEQkOikAALiU11f3UKQAAOBSXi9SGO4BAAARiU4KAAAuRScFAADAAXRSAABw\nKa93UixjjHE6BAAAwH9iuAcAAEQkihQAABCRKFIAAEBEokgBAAARiSIFAABEJIoUAAAQkRy7TsqL\nL76ojz/+WCUlJbIsS48++qiuueaaCu3r2Wef1T333KNLLrmkQt8/atQopaenq02bNhX6/p999tln\nevjhh9W4cWNJUmFhoW6//XYNGjToF88dNGiQJkyYoEaNGl3QMZ3w0ksv6ZVXXtGmTZtUpUoVp+OU\n62yvtTfffFP33HOPVq9erZSUFPXv37/M93z99deaNWuW/H6/8vPzdcstt2jo0KEOncH5vbZCEQmv\nvx9//FHdu3fX1VdfHXisTZs2GjlypGOZ7LZmzRr961//0n//9387HSXqVObvHISPI0XKnj179N57\n72nZsmWyLEs7d+7Uo48+qrVr11Zof0888UQlJ6y4G264QTNnzpQkFRUV6eabb9Ydd9yh6tWrO5ys\n8qxdu1bdunXT+vXr1atXL6fjBFXR11pGRob++Mc/qlGjRiouLlZ6erpuuOEGXXXVVWFK/ktefG01\nbtxYixcvdjoGgti+fbtmzJihgoICGWPUpk0bPfDAA4qPj3c6Wsgq+3cOwseRIiU5OVmHDh3Sa6+9\npvbt2+vKK6/Ua6+9Vuavu2XLlunw4cPq2bOn7r//ftWsWVPt27fXmjVr9Pbbb8uyLGVkZKht27Za\ntGiRJkyYoDFjxmj27Nm67LLLtGHDBn355Zd66KGH9MQTT+jo0aOSpCeffFJNmzbV0qVLtWrVKtWp\nU0c//fSTLeeZl5cnn8+n7777TtOnT5ff71dqaqr+9Kc/BZ6TlZWlCRMmqLCwULm5uXr44YfVqVMn\nzZw5U5999plKSkrUpUsXjRgxQkuXLtUbb7whn8+na6+9Vk8++aQtuYP57LPPlJaWpvT0dI0ZM0a9\nevXS119/rYkTJ6patWqqXbu2qlSpoilTpmjx4sVat26dLMtSt27dNHjw4LDnLe+1JkkbN27UO++8\no9OnT+vJJ59U8+bNlZKSoqVLl6pXr1668sortWzZMsXHx2vNmjXauHGj8vPzdfToUT3wwAPq2rVr\n2M/r319bzz//vIwxys/P1/Tp0xUXF1fmPdO6dWtNmjTpF6+/OXPm6PDhwyooKNCMGTP0q1/9Kuzn\n8Z9KS0s1fvx4ZWVlKScnRx06dNCoUaM0btw4HTt2TMeOHdNf/vIXvfzyy/ryyy/l9/s1ZMgQ3XLL\nLWHPumbNGr3//vs6ffq0cnNzNXjwYG3atEnff/+9xo4dq6ysLL377rsqKCjQRRddpOeff77M90fC\n+6M8WVlZGjNmjObOnasGDRrIGKM5c+Zo8uTJevrpp52OF7Jz/RyACxiHbN++3YwbN8789re/NV27\ndjUbNmwwd911l9mzZ48xxphXX33VzJ492xw4cMC0adPGFBYWGmOMeeihh8znn39uCgsLTbdu3Uxx\ncXHg+5YuXWqee+45Y4wxw4cPN7t27TJTp041S5cuNcYYs2/fPpOenm5yc3NNly5dTGFhoSkqKjK3\n3Xab+fTTTy/4nD799FNzww03mLvuussMGjTIDB061HzwwQeme/fugfNauXKl2b59eyDzli1bAsfe\nunWrGTJkiDHGmJtuuskcOHDAFBYWmmXLlhljjOnVq5f56quvjDHGLF261BQXF19w5vM1evRo8/77\n7xtjjElPTzfbtm0zPXr0MLt37zbGGDNjxgzz6KOPmu+//96kp6ebkpISU1JSYgYNGmT27t0b9rzG\nBH+tzZ492zz11FPGGGN2795tevToYYwx5uTJk+b55583vXv3Nq1btzYZGRmmsLDQrF692gwZMsSU\nlpaa3Nxc87vf/S4s/w/nem0tWbLEZGVlGWOMmTdvnpk7d+4v3jPnev298cYbxhhjZs+ebV588UXb\nz+E/HThwwFx//fXmrrvuCty+/PJLs3LlSmOMMadPnzatW7c2xhjz6KOPmoULFxpjjPnggw/Mww8/\nHHhO9+7dzfHjx8Oef/Xq1eaee+4xxhizbt0606dPH+P3+80nn3xifv/735vnnnvOlJaWGmOMGTp0\nqPnyyy/N6tWrzbRp0yLq/RHMCy+8YObPn1/mMb/fb2666SZTUFDgUKqKOdvPAUQ+Rzop+/fvV1JS\nkiZPnixJ+uabbzR8+HDVqVPn34unwP3LLrss0Fq888479frrrys3N1cdOnRQbOz/ncLtt9+uAQMG\nqG/fvsrLy1OTJk20e/duffrpp3rnnXckScePH1dmZqYaN24c2Gfz5s0r7dz+vSX/s8cffzww9t+3\nb98y2+rUqaN58+bptddek2VZKikpkSRNmzZN06dP1+HDh9WuXTtJ0uTJk7VgwQJNnTpV1113XZl/\no3A4fvy4Nm/erCNHjmjx4sXKy8vTkiVLlJOToyuuuEKS1LJlS7399tvavXu3Dh06pCFDhgS+d//+\n/WrYsGFYM4fyWmvVqpUk6YorrlBubq4KCwu1Y8cOPfDAA3rggQd07NgxPfbYY1qxYoWqVaumVq1a\nyefzKSUlRdWrV9eRI0dUt25d28/lbK+tjRs36tlnn1XVqlWVnZ2tFi1aSCr7njl8+PBZX38/j8en\npKTo8OHDtuc/m/8c7snLy9Obb76pTz/9VElJSSoqKgpsa9CggSRp9+7d2rFjR2A+TklJiQ4ePOjI\nsNeVV14p6cxf6o0aNZJlWapRo4aKi4sVFxenRx55RFWrVlVWVlbgvf3zOUTC+6M8Bw8eDPz8+Zll\nWUpJSVFubm5EdN9Cca6fA23atFHNmjUdTodgHFnds2vXLmVkZAR+ADVo0EDVq1dXzZo1lZubK0n6\n9ttv/y+k7/9itm3bVjt37tTq1at/8Qs/OTlZ11xzjSZPnhyYK9GwYUMNGTJEixcv1qxZs9S9e3dd\nfvnl2rNnj06fPq3S0lLt3LnT1vOtW7eufvjhB0lnJm/97W9/C2z785//rDvuuEPTpk1TmzZtZIxR\nUVGRNmzYoBkzZmjRokV6/fXXdfDgQa1cuVITJ07UkiVLtHPnTv3zn/+0Nfd/Wrt2rXr37q0FCxZo\n/vz5WrlypbZs2aIqVapoz549kqSvvvpK0pl/98aNG2vRokVavHixevXqpaZNm4Y1r3Tu11pMTEzg\nOV9//XXguZdccoksy9KYMWO0b98+SVLNmjV16aWXBn7p79ixQ9KZX/55eXmqXbt2OE+pjKeeekqT\nJk3SlClTVLdu3UDh+u/vmWCvv0izZs0aJScna/r06Ro6dKhOnz4dOCfLsiSdeW21adNGixcv1iuv\nvKJbbrnFsV+WP2f6T8XFxdq4caNmzZqlp556Sn6/v8wfFZHy/ijPxRdfrAMHDpR5zO/369ChQ46+\n7s9XKD8HEJkc6aR06dJFe/fuVZ8+fVS1alUZYzR27FjFxcVp4sSJuuSSS875l6llWeratas+/vhj\npaWl/WJ73759NWzYME2aNEmSdN999+mJJ57QypUrlZeXp5EjR6pWrVoaPny40tPTVatWLSUmJtp6\nvhMnTtTjjz8un8+nOnXqaMiQIVq0aJEk6eabb9bUqVP14osvql69ejp69Kji4+NVo0YN3XnnnUpI\nSNCNN96oSy65RE2bNtWAAQNUrVo1paam6r/+679szf2fVq1apalTpwa+TkxMVJcuXZSSkqLHH39c\nVatWVVxcnFJTU9WsWTO1bdtW/fv3V1FRkZo3b67U1NSw5pXO/Vp75ZVXAs/58ccfNXjwYBUVFSkj\nI0Px8fGaNWuWHn/88cBKgGuvvVa9e/fW2rVrdfjwYd199906efKknn76aUd/0HXv3l0DBw5UYmKi\nUlJSlJOT84vnBHv9RZq2bdtq9OjR2rZtm+Lj41W/fv1fnFOHDh30+eefa8CAATp16pQ6deqkpKQk\nhxKfXWxsrBITE5Weni7pTMf0388jUt4f5enRo4eGDh2qDh06qFatWnr44YeVmpqqm266SVWrVnU6\nXsjO9XMgOTnZ6WgoB5+CjAu2dOlS3XLLLapVq5ZmzpypuLg4zy4jZQkpos327ds1c+ZM5efn6/Tp\n00pJSVFKSorGjRvHUAls59h1UuAdtWvX1tChQ1W1alUlJydrypQpTkcCUEmuueYazZ8/v8xj3333\nneLi4hxKhGhCJwUAAEQkLosPAAAiEkUKAACISBQpAAAgIlGkAACAiESRAgAAIhJFCgAAiEj/D+dK\nvCTma2foAAAAAElFTkSuQmCC\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sb.heatmap(titanic_dmy.corr()) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fare and Pclass are not independent of each other, so I am going to drop these."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survived | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" male | \n",
" Q | \n",
" S | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0 | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" | 2 | \n",
" 1 | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" | 3 | \n",
" 1 | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" | 4 | \n",
" 0 | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Survived Age SibSp Parch male Q S\n",
"0 0 22.0 1 0 1 0 1\n",
"1 1 38.0 1 0 0 0 0\n",
"2 1 26.0 0 0 0 0 1\n",
"3 1 35.0 1 0 0 0 1\n",
"4 0 35.0 0 0 1 0 1"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titanic_dmy.drop(['Fare', 'Pclass'],axis=1,inplace=True)\n",
"titanic_dmy.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Checking that your dataset size is sufficient\n",
"We have 6 predictive features that remain. The rule of thumb is 50 records per feature... so we need to have at least 300 records in this dataset. Let's check again."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Int64Index: 889 entries, 0 to 890\n",
"Data columns (total 7 columns):\n",
"Survived 889 non-null int64\n",
"Age 889 non-null float64\n",
"SibSp 889 non-null int64\n",
"Parch 889 non-null int64\n",
"male 889 non-null uint8\n",
"Q 889 non-null uint8\n",
"S 889 non-null uint8\n",
"dtypes: float64(1), int64(3), uint8(3)\n",
"memory usage: 37.3 KB\n"
]
}
],
"source": [
"titanic_dmy.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, we have 889 records so we are fine."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\piers\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:1: DeprecationWarning: \n",
".ix is deprecated. Please use\n",
".loc for label based indexing or\n",
".iloc for positional indexing\n",
"\n",
"See the documentation here:\n",
"http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix\n",
" \"\"\"Entry point for launching an IPython kernel.\n"
]
}
],
"source": [
"X = titanic_dmy.ix[:,(1,2,3,4,5,6)].values\n",
"y = titanic_dmy.ix[:,0].values"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=25)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Deploying and evaluating the model"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
" intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
" penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
" verbose=0, warm_start=False)"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"LogReg = LogisticRegression()\n",
"LogReg.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"y_pred = LogReg.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[137, 27],\n",
" [ 34, 69]])"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.metrics import confusion_matrix\n",
"confusion_matrix = confusion_matrix(y_test, y_pred)\n",
"confusion_matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The results from the confusion matrix are telling us that 137 and 69 are the number of correct predictions. 34 and 27 are the number of incorrect predictions."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 0.80 0.84 0.82 164\n",
" 1 0.72 0.67 0.69 103\n",
"\n",
"avg / total 0.77 0.77 0.77 267\n",
"\n"
]
}
],
"source": [
"print(classification_report(y_test, y_pred))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}