{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://github.com/dc-aihub/dc-aihub.github.io/blob/master/img/ai-logo-transparent-banner.png?raw=true\" \n",
    "alt=\"Ai/Hub Logo\"/>\n",
    "\n",
    "<h1 style=\"text-align:center;color:#0B8261;\"><center>Data Science</center></h1>\n",
    "<h1 style=\"text-align:center;\"><center>Lesson 8</center></h1>\n",
    "<h1 style=\"text-align:center;\"><center>Preparing Data for Consumption</center></h1>\n",
    "\n",
    "<hr />\n",
    "\n",
    "<center><a href=\"#Data-Wrangling\">Data Wrangling</a></center>\n",
    "\n",
    "<center><a href=\"#Import-Required\">Import Required Libraries</a></center>\n",
    "\n",
    "<center><a href=\"#Load-Data-Modelling\">Load Data Modelling Libraries</a></center>\n",
    "\n",
    "<center><a href=\"#Meet-Greet-Data\">Meet and Greet Data</a></center>\n",
    "\n",
    "<center><a href=\"#4-Cs\">The 4 C's of Data Cleaning</a></center>\n",
    "\n",
    "<center><a href=\"#Clean-Data\">Clean the Data</a></center>\n",
    "\n",
    "<center><a href=\"#Convert-Formats\">Convert Formats</a></center>\n",
    "\n",
    "<center><a href=\"#Double-Check\">Double Check Cleaned Data</a></center>\n",
    "\n",
    "<center><a href=\"#Split-Train-Test\">Split Training and Testing Data</a></center>\n",
    "\n",
    "<hr />\n",
    "\n",
    "<center>***Original Tutorial:*** <br/>https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\">\n",
    "OVERVIEW\n",
    "</div>\n",
    "\n",
    "<center style=\"color:#0B8261;\">\n",
    "Usually, there are a few extra steps to preparing data in a real life scenario, however for the purposes of this exercise, we are given the data on a golden plater, so we only really have to do one step before moving forward. Please familiarize yourself with the steps below as you will need to understand what they are and why they would be used in real life.\n",
    "</center>\n",
    "\n",
    "<br/>\n",
    "\n",
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\" id=\"Data-Wrangling\">\n",
    "DATA WRANGLING (a.k.a. data munging)\n",
    "</div>\n",
    "\n",
    "- ** *Data Architecture* **: Data architecture is a set of rules, policies, standards and models that govern and define the type of data collected and how it is used, stored, managed and integrated within an organization and its database systems.\n",
    "\n",
    "\n",
    "- ** *Data Governance* **: Data governance is a defined process an organization follows to ensure that high quality data exists throughout the complete lifecycle. The key focus areas of data governance include availability, usability, integrity and security.\n",
    "\n",
    "\n",
    "- ** *Data Extraction* **: Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration).\n",
    "\n",
    "\n",
    "- ** *Data Cleaning (Cleansing)* **: Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.\n",
    "\n",
    "\n",
    "<div style=\"background-color:#D33222; margin-left:10%; width:90%; height:38px; color:white; font-size:18px; padding:10px; float:right;\">\n",
    "NOTE\n",
    "</div>\n",
    ">- There are a few other steps, but we won't touch on them for these lessons.\n",
    ">- The main one we're interested in is the Data Cleaning/Cleansing step.\n",
    "\n",
    "<img src=\"http://i2.wp.com/rocketsci.azurewebsites.net/wp-content/uploads/2015/07/data-wrangling-1.jpg\" alt=\"Data Wrangling\"/>\n",
    "\n",
    "<hr />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\" id=\"Import-Required\">\n",
    "IMPORT REQUIRED LIBRARIES\n",
    "</div>\n",
    "\n",
    "The following code is written in Python 3.x. Libraries provide pre-written functionality to perform necessary tasks. The idea is why write ten lines of code, when you can write one line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Python version: 3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]\n",
      "pandas version: 0.20.3\n",
      "matplotlib version: 2.1.0\n",
      "NumPy version: 1.13.3\n",
      "SciPy version: 0.19.1\n",
      "IPython version: 6.1.0\n",
      "scikit-learn version: 0.19.1\n",
      "-------------------------\n",
      "titanic_test.csv\n",
      "titanic_train.csv\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# This Python 3 environment comes with many helpful analytics libraries installed\n",
    "# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n",
    "\n",
    "#load packages\n",
    "import sys #access to system parameters https://docs.python.org/3/library/sys.html\n",
    "print(\"Python version: {}\". format(sys.version))\n",
    "\n",
    "import pandas as pd #collection of functions for data processing and analysis modeled after R dataframes with SQL like features\n",
    "print(\"pandas version: {}\". format(pd.__version__))\n",
    "\n",
    "import matplotlib #collection of functions for scientific and publication-ready visualization\n",
    "print(\"matplotlib version: {}\". format(matplotlib.__version__))\n",
    "\n",
    "import numpy as np #foundational package for scientific computing\n",
    "print(\"NumPy version: {}\". format(np.__version__))\n",
    "\n",
    "import scipy as sp #collection of functions for scientific computing and advance mathematics\n",
    "print(\"SciPy version: {}\". format(sp.__version__)) \n",
    "\n",
    "import IPython\n",
    "from IPython import display #pretty printing of dataframes in Jupyter notebook\n",
    "print(\"IPython version: {}\". format(IPython.__version__)) \n",
    "\n",
    "import sklearn #collection of machine learning algorithms\n",
    "print(\"scikit-learn version: {}\". format(sklearn.__version__))\n",
    "\n",
    "#misc libraries\n",
    "import random\n",
    "import time\n",
    "\n",
    "\n",
    "#ignore warnings\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "print('-'*25)\n",
    "\n",
    "\n",
    "\n",
    "# Input data files are available in the \"../input/\" directory.\n",
    "# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory\n",
    "\n",
    "from subprocess import check_output\n",
    "print(check_output([\"ls\", \"../data\"]).decode(\"utf8\"))\n",
    "\n",
    "# Any results you write to the current directory are saved as output."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\" id=\"Load-Data-Modelling\">\n",
    "LOAD DATA MODELLING LIBRARIES\n",
    "</div>\n",
    "\n",
    "We will use the popular scikit-learn library to develop our machine learning algorithms. In sklearn, algorithms are called Estimators and implemented in their own classes. For data visualization, we will use the matplotlib and seaborn library. Below are common classes to load.\n",
    "\n",
    "**Note:** You will have to run the following command in a command prompt WITHIN THIS FOLDER before proceeding. This will install xgboost as it does not exist within the Anaconda package you install initially.\n",
    "\n",
    "```> pip install .\\xgboost-0.7-cp36-cp36m-win_amd64.whl```\n",
    "\n",
    "<p style=\"text-align:right\">*the .whl (wheel) file is contained within this folder*</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#Common Model Algorithms\n",
    "from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process\n",
    "from xgboost import XGBClassifier\n",
    "\n",
    "#Common Model Helpers\n",
    "from sklearn.preprocessing import OneHotEncoder, LabelEncoder\n",
    "from sklearn import feature_selection\n",
    "from sklearn import model_selection\n",
    "from sklearn import metrics\n",
    "\n",
    "#Visualization\n",
    "import matplotlib as mpl\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.pylab as pylab\n",
    "import seaborn as sns\n",
    "from pandas.tools.plotting import scatter_matrix\n",
    "\n",
    "#Configure Visualization Defaults\n",
    "#%matplotlib inline = show plots in Jupyter Notebook browser\n",
    "%matplotlib inline\n",
    "# The following are just some configuration steps to make sure our graphs look good.\n",
    "mpl.style.use('ggplot')\n",
    "sns.set_style('white')\n",
    "pylab.rcParams['figure.figsize'] = 12,8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\" id=\"Meet-Greet-Data\">\n",
    "MEET &AMP; GREET DATA\n",
    "</div>\n",
    "\n",
    "This is the meet and greet step. Get to know your data by first name and learn a little bit about it. What does it look like (datatype and values), what makes it tick (independent/feature variables(s)), what's its goals in life (dependent/target variable(s)). Think of it like a first date, before you jump in and start poking it in the bedroom.\n",
    "\n",
    "To begin this step, we first import our data. Next we use the info() and sample() function, to get a quick and dirty overview of variable datatypes (i.e. qualitative vs quantitative). Click https://www.kaggle.com/c/titanic/data for the Source Data Dictionary.\n",
    "\n",
    "<br/>\n",
    "\n",
    "<div style=\"\">\n",
    "<ol style=\"max-width:600px\">\n",
    "<li>The Survived variable is our outcome or dependent variable. It is a binary nominal datatype of 1 for survived and 0 for did not survive. All other variables are potential predictor or independent variables. **It's important to note, more predictor variables do not make a better model, but the right variables.**</li><br/>\n",
    "\n",
    "<li>The *PassengerID* and *Ticket*  variables are assumed to be random unique identifiers, that have no impact on the outcome variable. Thus, they will be excluded from analysis.</li><br/>\n",
    "\n",
    "<li>The *Pclass* variable is an ordinal datatype for the ticket class, a proxy for socio-economic status (SES), representing 1 = upper class, 2 = middle class, and 3 = lower class.</li><br/>\n",
    "\n",
    "<li>The *Name* variable is a nominal datatype. It could be used in feature engineering to derive the gender from title, family size from surname, and SES from titles like doctor or master. Since these variables already exist, we'll make use of it to see if title, like master, makes a difference.</li><br/>\n",
    "\n",
    "<li>The *Sex* and *Embarked* variables are a nominal datatype. They will be converted to dummy variables for mathematical calculations.</li><br/>\n",
    "\n",
    "<li>The *Age* and *Fare* variable are continuous quantitative datatypes.</li><br/>\n",
    "\n",
    "<li>The *SibSp* represents number of related siblings/spouse aboard and *Parch* represents number of related parents/children aboard. Both are discrete quantitative datatypes. This can be used for feature engineering to create a family size and is alone variable.</li><br/>\n",
    "\n",
    "<li>The *Cabin* variable is a nominal datatype that can be used in feature engineering for approximate position on ship when the incident occurred and SES from deck levels. However, since there are many null values, it does not add value and thus is excluded from analysis.</li>\n",
    "</ol>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 891 entries, 0 to 890\n",
      "Data columns (total 12 columns):\n",
      "PassengerId    891 non-null int64\n",
      "Survived       891 non-null int64\n",
      "Pclass         891 non-null int64\n",
      "Name           891 non-null object\n",
      "Sex            891 non-null object\n",
      "Age            714 non-null float64\n",
      "SibSp          891 non-null int64\n",
      "Parch          891 non-null int64\n",
      "Ticket         891 non-null object\n",
      "Fare           891 non-null float64\n",
      "Cabin          204 non-null object\n",
      "Embarked       889 non-null object\n",
      "dtypes: float64(2), int64(5), object(5)\n",
      "memory usage: 83.6+ KB\n",
      "None\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>237</th>\n",
       "      <td>238</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>Collyer, Miss. Marjorie \"Lottie\"</td>\n",
       "      <td>female</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>C.A. 31921</td>\n",
       "      <td>26.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>225</th>\n",
       "      <td>226</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Berglund, Mr. Karl Ivar Sven</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>PP 4348</td>\n",
       "      <td>9.3500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>98</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Greenfield, Mr. William Bertram</td>\n",
       "      <td>male</td>\n",
       "      <td>23.0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>PC 17759</td>\n",
       "      <td>63.3583</td>\n",
       "      <td>D10 D12</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>808</th>\n",
       "      <td>809</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Meyer, Mr. August</td>\n",
       "      <td>male</td>\n",
       "      <td>39.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>248723</td>\n",
       "      <td>13.0000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>711</th>\n",
       "      <td>712</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Klaber, Mr. Herman</td>\n",
       "      <td>male</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>113028</td>\n",
       "      <td>26.5500</td>\n",
       "      <td>C124</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>94</th>\n",
       "      <td>95</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Coxon, Mr. Daniel</td>\n",
       "      <td>male</td>\n",
       "      <td>59.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>364500</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>466</th>\n",
       "      <td>467</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Campbell, Mr. William</td>\n",
       "      <td>male</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>239853</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>388</th>\n",
       "      <td>389</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Sadlier, Mr. Matthew</td>\n",
       "      <td>male</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>367655</td>\n",
       "      <td>7.7292</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Q</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>594</th>\n",
       "      <td>595</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Chapman, Mr. John Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>37.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>SC/AH 29037</td>\n",
       "      <td>26.0000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>307</th>\n",
       "      <td>308</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Penasco y Castellana, Mrs. Victor de Satode (M...</td>\n",
       "      <td>female</td>\n",
       "      <td>17.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17758</td>\n",
       "      <td>108.9000</td>\n",
       "      <td>C65</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     PassengerId  Survived  Pclass  \\\n",
       "237          238         1       2   \n",
       "225          226         0       3   \n",
       "97            98         1       1   \n",
       "808          809         0       2   \n",
       "711          712         0       1   \n",
       "94            95         0       3   \n",
       "466          467         0       2   \n",
       "388          389         0       3   \n",
       "594          595         0       2   \n",
       "307          308         1       1   \n",
       "\n",
       "                                                  Name     Sex   Age  SibSp  \\\n",
       "237                   Collyer, Miss. Marjorie \"Lottie\"  female   8.0      0   \n",
       "225                       Berglund, Mr. Karl Ivar Sven    male  22.0      0   \n",
       "97                     Greenfield, Mr. William Bertram    male  23.0      0   \n",
       "808                                  Meyer, Mr. August    male  39.0      0   \n",
       "711                                 Klaber, Mr. Herman    male   NaN      0   \n",
       "94                                   Coxon, Mr. Daniel    male  59.0      0   \n",
       "466                              Campbell, Mr. William    male   NaN      0   \n",
       "388                               Sadlier, Mr. Matthew    male   NaN      0   \n",
       "594                            Chapman, Mr. John Henry    male  37.0      1   \n",
       "307  Penasco y Castellana, Mrs. Victor de Satode (M...  female  17.0      1   \n",
       "\n",
       "     Parch       Ticket      Fare    Cabin Embarked  \n",
       "237      2   C.A. 31921   26.2500      NaN        S  \n",
       "225      0      PP 4348    9.3500      NaN        S  \n",
       "97       1     PC 17759   63.3583  D10 D12        C  \n",
       "808      0       248723   13.0000      NaN        S  \n",
       "711      0       113028   26.5500     C124        S  \n",
       "94       0       364500    7.2500      NaN        S  \n",
       "466      0       239853    0.0000      NaN        S  \n",
       "388      0       367655    7.7292      NaN        Q  \n",
       "594      0  SC/AH 29037   26.0000      NaN        S  \n",
       "307      0     PC 17758  108.9000      C65        C  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#import data from file: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html\n",
    "data_raw = pd.read_csv('../data/titanic_train.csv')\n",
    "\n",
    "\n",
    "#a dataset should be broken into 3 splits: train, test, and (final) validation\n",
    "#the test file provided is the validation file for competition submission\n",
    "#we will split the train set into train and test data in future sections\n",
    "data_val  = pd.read_csv('../data/titanic_test.csv')\n",
    "\n",
    "\n",
    "#to play with our data we'll create a copy\n",
    "#remember python assignment or equal passes by reference vs values, so we use the copy function: https://stackoverflow.com/questions/46327494/python-pandas-dataframe-copydeep-false-vs-copydeep-true-vs\n",
    "data1 = data_raw.copy(deep = True)\n",
    "\n",
    "#however passing by reference is convenient, because we can clean both datasets at once\n",
    "data_cleaner = [data1, data_val]\n",
    "\n",
    "\n",
    "#preview data\n",
    "print (data_raw.info()) #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html\n",
    "#data_raw.head() #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html\n",
    "#data_raw.tail() #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html\n",
    "data_raw.sample(10) #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\" id=\"4-Cs\">\n",
    "THE 4 C's OF DATA CLEANING\n",
    "</div>\n",
    "\n",
    "In this stage, we will clean our data by 1) correcting aberrant values and outliers, 2) completing missing information, 3) creating new features for analysis, and 4) converting fields to the correct format for calculations and presentation.\n",
    "\n",
    "<br/>\n",
    "\n",
    "<div style=\"\">\n",
    "<ol style=\"\">\n",
    "<li>**Correcting**: Reviewing the data, there does not appear to be any aberrant or non-acceptable data inputs. In addition, we see we may have potential outliers in age and fare. However, since they are reasonable values, we will wait until after we complete our exploratory analysis to determine if we should include or exclude from the dataset. It should be noted, that if they were unreasonable values, for example age = 800 instead of 80, then it's probably a safe decision to fix now. However, we want to use caution when we modify data from its original value, because it may be necessary to create an accurate model.</li><br/>\n",
    "\n",
    "<li>**Completing**: There are null values or missing data in the age, cabin, and embarked field. Missing values can be bad, because some algorithms don't know how-to handle null values and will fail. While others, like decision trees, can handle null values. Thus, it's important to fix before we start modeling, because we will compare and contrast several models. There are two common methods, either delete the record or populate the missing value using a reasonable input. It is not recommended to delete the record, especially a large percentage of records, unless it truly represents an incomplete record. Instead, it's best to impute missing values. A basic methodology for qualitative data is impute using mode. A basic methodology for quantitative data is impute using mean, median, or mean + randomized standard deviation. An intermediate methodology is to use the basic methodology based on specific criteria; like the average age by class or embark port by fare and SES. There are more complex methodologies, however before deploying, it should be compared to the base model to determine if complexity truly adds value. For this dataset, age will be imputed with the median, the cabin attribute will be dropped, and embark will be imputed with mode. Subsequent model iterations may modify this decision to determine if it improves the model’s accuracy.</li><br/>\n",
    "\n",
    "<li>**Creating**: Feature engineering is when we use existing features to create new features to determine if they provide new signals to predict our outcome. For this dataset, we will create a title feature to determine if it played a role in survival.</li><br/>\n",
    "\n",
    "<li>**Converting**: Last, but certainly not least, we'll deal with formatting. There are no date or currency formats, but datatype formats. Our categorical data imported as objects, which makes it difficult for mathematical calculations. For this dataset, we will convert object datatypes to categorical dummy variables.</li>\n",
    "</ol>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train columns with null values:\n",
      " PassengerId      0\n",
      "Survived         0\n",
      "Pclass           0\n",
      "Name             0\n",
      "Sex              0\n",
      "Age            177\n",
      "SibSp            0\n",
      "Parch            0\n",
      "Ticket           0\n",
      "Fare             0\n",
      "Cabin          687\n",
      "Embarked         2\n",
      "dtype: int64\n",
      "--------------------\n",
      "Test/Validation columns with null values:\n",
      " PassengerId      0\n",
      "Pclass           0\n",
      "Name             0\n",
      "Sex              0\n",
      "Age             86\n",
      "SibSp            0\n",
      "Parch            0\n",
      "Ticket           0\n",
      "Fare             1\n",
      "Cabin          327\n",
      "Embarked         0\n",
      "dtype: int64\n",
      "--------------------\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>891.000000</td>\n",
       "      <td>891.000000</td>\n",
       "      <td>891.000000</td>\n",
       "      <td>891</td>\n",
       "      <td>891</td>\n",
       "      <td>714.000000</td>\n",
       "      <td>891.000000</td>\n",
       "      <td>891.000000</td>\n",
       "      <td>891</td>\n",
       "      <td>891.000000</td>\n",
       "      <td>204</td>\n",
       "      <td>889</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>891</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>681</td>\n",
       "      <td>NaN</td>\n",
       "      <td>147</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Hickman, Mr. Lewis</td>\n",
       "      <td>male</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>347082</td>\n",
       "      <td>NaN</td>\n",
       "      <td>C23 C25 C27</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>577</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4</td>\n",
       "      <td>644</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>446.000000</td>\n",
       "      <td>0.383838</td>\n",
       "      <td>2.308642</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>29.699118</td>\n",
       "      <td>0.523008</td>\n",
       "      <td>0.381594</td>\n",
       "      <td>NaN</td>\n",
       "      <td>32.204208</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>257.353842</td>\n",
       "      <td>0.486592</td>\n",
       "      <td>0.836071</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>14.526497</td>\n",
       "      <td>1.102743</td>\n",
       "      <td>0.806057</td>\n",
       "      <td>NaN</td>\n",
       "      <td>49.693429</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.420000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>223.500000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20.125000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7.910400</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>446.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>28.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>14.454200</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>668.500000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>31.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>891.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>80.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>512.329200</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        PassengerId    Survived      Pclass                Name   Sex  \\\n",
       "count    891.000000  891.000000  891.000000                 891   891   \n",
       "unique          NaN         NaN         NaN                 891     2   \n",
       "top             NaN         NaN         NaN  Hickman, Mr. Lewis  male   \n",
       "freq            NaN         NaN         NaN                   1   577   \n",
       "mean     446.000000    0.383838    2.308642                 NaN   NaN   \n",
       "std      257.353842    0.486592    0.836071                 NaN   NaN   \n",
       "min        1.000000    0.000000    1.000000                 NaN   NaN   \n",
       "25%      223.500000    0.000000    2.000000                 NaN   NaN   \n",
       "50%      446.000000    0.000000    3.000000                 NaN   NaN   \n",
       "75%      668.500000    1.000000    3.000000                 NaN   NaN   \n",
       "max      891.000000    1.000000    3.000000                 NaN   NaN   \n",
       "\n",
       "               Age       SibSp       Parch  Ticket        Fare        Cabin  \\\n",
       "count   714.000000  891.000000  891.000000     891  891.000000          204   \n",
       "unique         NaN         NaN         NaN     681         NaN          147   \n",
       "top            NaN         NaN         NaN  347082         NaN  C23 C25 C27   \n",
       "freq           NaN         NaN         NaN       7         NaN            4   \n",
       "mean     29.699118    0.523008    0.381594     NaN   32.204208          NaN   \n",
       "std      14.526497    1.102743    0.806057     NaN   49.693429          NaN   \n",
       "min       0.420000    0.000000    0.000000     NaN    0.000000          NaN   \n",
       "25%      20.125000    0.000000    0.000000     NaN    7.910400          NaN   \n",
       "50%      28.000000    0.000000    0.000000     NaN   14.454200          NaN   \n",
       "75%      38.000000    1.000000    0.000000     NaN   31.000000          NaN   \n",
       "max      80.000000    8.000000    6.000000     NaN  512.329200          NaN   \n",
       "\n",
       "       Embarked  \n",
       "count       889  \n",
       "unique        3  \n",
       "top           S  \n",
       "freq        644  \n",
       "mean        NaN  \n",
       "std         NaN  \n",
       "min         NaN  \n",
       "25%         NaN  \n",
       "50%         NaN  \n",
       "75%         NaN  \n",
       "max         NaN  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print('Train columns with null values:\\n', data1.isnull().sum())\n",
    "print(\"-\"*20)\n",
    "\n",
    "print('Test/Validation columns with null values:\\n', data_val.isnull().sum())\n",
    "print(\"-\"*20)\n",
    "\n",
    "data_raw.describe(include = 'all')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\" id=\"Clean-Data\">\n",
    "CLEAN THE DATA\n",
    "</div>\n",
    "\n",
    "Now that we know what to clean, let's execute our code.\n",
    "\n",
    "### Some helpful documentation:\n",
    "\n",
    "<ul>\n",
    "<li><a href=\"https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html\">pandas.DataFrame</a></li>\n",
    "<li><a href=\"https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html\">pandas.DataFrame.info</a></li>\n",
    "<li><a href=\"https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html\">pandas.DataFrame.describe</a></li>\n",
    "<li><a href=\"https://pandas.pydata.org/pandas-docs/stable/indexing.html\">Indexing and Selecting Data</a></li>\n",
    "<li><a href=\"https://pandas.pydata.org/pandas-docs/stable/generated/pandas.isnull.html\">pandas.isnull</a></li>\n",
    "<li><a href=\"https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html\">pandas.DataFrame.sum</a></li>\n",
    "<li><a href=\"https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html\">pandas.DataFrame.mode</a></li>\n",
    "<li><a href=\"https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html\">pandas.DataFrame.copy</a></li>\n",
    "<li><a href=\"https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html\">pandas.DataFrame.fillna</a></li>\n",
    "<li><a href=\"https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html\">pandas.DataFrame.drop</a></li>\n",
    "<li><a href=\"https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html\">pandas.Series.value_counts</a></li>\n",
    "<li><a href=\"https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html\">pandas.DataFrame.loc</a></li>\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Survived    0\n",
      "Pclass      0\n",
      "Name        0\n",
      "Sex         0\n",
      "Age         0\n",
      "SibSp       0\n",
      "Parch       0\n",
      "Fare        0\n",
      "Embarked    0\n",
      "dtype: int64\n",
      "----------\n",
      "PassengerId      0\n",
      "Pclass           0\n",
      "Name             0\n",
      "Sex              0\n",
      "Age              0\n",
      "SibSp            0\n",
      "Parch            0\n",
      "Ticket           0\n",
      "Fare             0\n",
      "Cabin          327\n",
      "Embarked         0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "###COMPLETING: complete or delete missing values in train and test/validation dataset\n",
    "for dataset in data_cleaner:    \n",
    "    #complete missing age with median\n",
    "    dataset['Age'].fillna(dataset['Age'].median(), inplace = True)\n",
    "\n",
    "    #complete embarked with mode\n",
    "    dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)\n",
    "\n",
    "    #complete missing fare with median\n",
    "    dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)\n",
    "    \n",
    "###DELETING: delete the cabin feature/column and others previously stated to exclude in train dataset\n",
    "drop_column = ['PassengerId','Cabin', 'Ticket']\n",
    "data1.drop(drop_column, axis=1, inplace = True)\n",
    "\n",
    "print(data1.isnull().sum())\n",
    "print(\"-\"*20)\n",
    "print(data_val.isnull().sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mr        517\n",
      "Miss      182\n",
      "Mrs       125\n",
      "Master     40\n",
      "Misc       27\n",
      "Name: Title, dtype: int64\n",
      "--------------------\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 891 entries, 0 to 890\n",
      "Data columns (total 14 columns):\n",
      "Survived      891 non-null int64\n",
      "Pclass        891 non-null int64\n",
      "Name          891 non-null object\n",
      "Sex           891 non-null object\n",
      "Age           891 non-null float64\n",
      "SibSp         891 non-null int64\n",
      "Parch         891 non-null int64\n",
      "Fare          891 non-null float64\n",
      "Embarked      891 non-null object\n",
      "FamilySize    891 non-null int64\n",
      "IsAlone       891 non-null int64\n",
      "Title         891 non-null object\n",
      "FareBin       891 non-null category\n",
      "AgeBin        891 non-null category\n",
      "dtypes: category(2), float64(2), int64(6), object(4)\n",
      "memory usage: 85.5+ KB\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 418 entries, 0 to 417\n",
      "Data columns (total 16 columns):\n",
      "PassengerId    418 non-null int64\n",
      "Pclass         418 non-null int64\n",
      "Name           418 non-null object\n",
      "Sex            418 non-null object\n",
      "Age            418 non-null float64\n",
      "SibSp          418 non-null int64\n",
      "Parch          418 non-null int64\n",
      "Ticket         418 non-null object\n",
      "Fare           418 non-null float64\n",
      "Cabin          91 non-null object\n",
      "Embarked       418 non-null object\n",
      "FamilySize     418 non-null int64\n",
      "IsAlone        418 non-null int64\n",
      "Title          418 non-null object\n",
      "FareBin        418 non-null category\n",
      "AgeBin         418 non-null category\n",
      "dtypes: category(2), float64(2), int64(6), object(6)\n",
      "memory usage: 46.8+ KB\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Embarked</th>\n",
       "      <th>FamilySize</th>\n",
       "      <th>IsAlone</th>\n",
       "      <th>Title</th>\n",
       "      <th>FareBin</th>\n",
       "      <th>AgeBin</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>354</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Yousif, Mr. Wazli</td>\n",
       "      <td>male</td>\n",
       "      <td>28.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7.2250</td>\n",
       "      <td>C</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Mr</td>\n",
       "      <td>(-0.001, 7.91]</td>\n",
       "      <td>(16.0, 32.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>778</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Kilgannon, Mr. Thomas J</td>\n",
       "      <td>male</td>\n",
       "      <td>28.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7.7375</td>\n",
       "      <td>Q</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Mr</td>\n",
       "      <td>(-0.001, 7.91]</td>\n",
       "      <td>(16.0, 32.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>598</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Boulos, Mr. Hanna</td>\n",
       "      <td>male</td>\n",
       "      <td>28.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7.2250</td>\n",
       "      <td>C</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Mr</td>\n",
       "      <td>(-0.001, 7.91]</td>\n",
       "      <td>(16.0, 32.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>738</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Ivanoff, Mr. Kanio</td>\n",
       "      <td>male</td>\n",
       "      <td>28.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7.8958</td>\n",
       "      <td>S</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Mr</td>\n",
       "      <td>(-0.001, 7.91]</td>\n",
       "      <td>(16.0, 32.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>425</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Wiseman, Mr. Phillippe</td>\n",
       "      <td>male</td>\n",
       "      <td>28.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>S</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Mr</td>\n",
       "      <td>(-0.001, 7.91]</td>\n",
       "      <td>(16.0, 32.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>729</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Ilmakangas, Miss. Pieta Sofia</td>\n",
       "      <td>female</td>\n",
       "      <td>25.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>S</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>Miss</td>\n",
       "      <td>(7.91, 14.454]</td>\n",
       "      <td>(16.0, 32.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>557</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Robbins, Mr. Victor</td>\n",
       "      <td>male</td>\n",
       "      <td>28.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>227.5250</td>\n",
       "      <td>C</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Mr</td>\n",
       "      <td>(31.0, 512.329]</td>\n",
       "      <td>(16.0, 32.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>144</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Andrew, Mr. Edgardo Samuel</td>\n",
       "      <td>male</td>\n",
       "      <td>18.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>11.5000</td>\n",
       "      <td>S</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Mr</td>\n",
       "      <td>(7.91, 14.454]</td>\n",
       "      <td>(16.0, 32.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>214</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Kiernan, Mr. Philip</td>\n",
       "      <td>male</td>\n",
       "      <td>28.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>7.7500</td>\n",
       "      <td>Q</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>Mr</td>\n",
       "      <td>(-0.001, 7.91]</td>\n",
       "      <td>(16.0, 32.0]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>174</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Smith, Mr. James Clinch</td>\n",
       "      <td>male</td>\n",
       "      <td>56.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>30.6958</td>\n",
       "      <td>C</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Mr</td>\n",
       "      <td>(14.454, 31.0]</td>\n",
       "      <td>(48.0, 64.0]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Survived  Pclass                           Name     Sex   Age  SibSp  \\\n",
       "354         0       3              Yousif, Mr. Wazli    male  28.0      0   \n",
       "778         0       3        Kilgannon, Mr. Thomas J    male  28.0      0   \n",
       "598         0       3              Boulos, Mr. Hanna    male  28.0      0   \n",
       "738         0       3             Ivanoff, Mr. Kanio    male  28.0      0   \n",
       "425         0       3         Wiseman, Mr. Phillippe    male  28.0      0   \n",
       "729         0       3  Ilmakangas, Miss. Pieta Sofia  female  25.0      1   \n",
       "557         0       1            Robbins, Mr. Victor    male  28.0      0   \n",
       "144         0       2     Andrew, Mr. Edgardo Samuel    male  18.0      0   \n",
       "214         0       3            Kiernan, Mr. Philip    male  28.0      1   \n",
       "174         0       1        Smith, Mr. James Clinch    male  56.0      0   \n",
       "\n",
       "     Parch      Fare Embarked  FamilySize  IsAlone Title          FareBin  \\\n",
       "354      0    7.2250        C           1        1    Mr   (-0.001, 7.91]   \n",
       "778      0    7.7375        Q           1        1    Mr   (-0.001, 7.91]   \n",
       "598      0    7.2250        C           1        1    Mr   (-0.001, 7.91]   \n",
       "738      0    7.8958        S           1        1    Mr   (-0.001, 7.91]   \n",
       "425      0    7.2500        S           1        1    Mr   (-0.001, 7.91]   \n",
       "729      0    7.9250        S           2        0  Miss   (7.91, 14.454]   \n",
       "557      0  227.5250        C           1        1    Mr  (31.0, 512.329]   \n",
       "144      0   11.5000        S           1        1    Mr   (7.91, 14.454]   \n",
       "214      0    7.7500        Q           2        0    Mr   (-0.001, 7.91]   \n",
       "174      0   30.6958        C           1        1    Mr   (14.454, 31.0]   \n",
       "\n",
       "           AgeBin  \n",
       "354  (16.0, 32.0]  \n",
       "778  (16.0, 32.0]  \n",
       "598  (16.0, 32.0]  \n",
       "738  (16.0, 32.0]  \n",
       "425  (16.0, 32.0]  \n",
       "729  (16.0, 32.0]  \n",
       "557  (16.0, 32.0]  \n",
       "144  (16.0, 32.0]  \n",
       "214  (16.0, 32.0]  \n",
       "174  (48.0, 64.0]  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "###CREATE: Feature Engineering for train and test/validation dataset\n",
    "for dataset in data_cleaner:    \n",
    "    #Discrete variables\n",
    "    dataset['FamilySize'] = dataset ['SibSp'] + dataset['Parch'] + 1\n",
    "\n",
    "    dataset['IsAlone'] = 1 #initialize to yes/1 is alone\n",
    "    dataset['IsAlone'].loc[dataset['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1\n",
    "\n",
    "    #quick and dirty code split title from name: http://www.pythonforbeginners.com/dictionary/python-split\n",
    "    dataset['Title'] = dataset['Name'].str.split(\", \", expand=True)[1].str.split(\".\", expand=True)[0]\n",
    "\n",
    "\n",
    "    #Continuous variable bins; qcut vs cut: https://stackoverflow.com/questions/30211923/what-is-the-difference-between-pandas-qcut-and-pandas-cut\n",
    "    #Fare Bins/Buckets using qcut or frequency bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html\n",
    "    dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)\n",
    "\n",
    "    #Age Bins/Buckets using cut or value bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html\n",
    "    dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)\n",
    "\n",
    "\n",
    "    \n",
    "#cleanup rare title names\n",
    "#print(data1['Title'].value_counts())\n",
    "stat_min = 10 #while small is arbitrary, we'll use the common minimum in statistics: http://nicholasjjackson.com/2012/03/08/sample-size-is-10-a-magic-number/\n",
    "title_names = (data1['Title'].value_counts() < stat_min) #this will create a true false series with title name as index\n",
    "\n",
    "#apply and lambda functions are quick and dirty code to find and replace with fewer lines of code: https://community.modeanalytics.com/python/tutorial/pandas-groupby-and-python-lambda-functions/\n",
    "data1['Title'] = data1['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)\n",
    "print(data1['Title'].value_counts())\n",
    "print(\"-\"*20)\n",
    "\n",
    "\n",
    "#preview data again\n",
    "data1.info()\n",
    "data_val.info()\n",
    "data1.sample(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\" id=\"Convert-Formats\">\n",
    "CONVERT FORMATS\n",
    "</div>\n",
    "\n",
    "We will convert categorical data to dummy variables for mathematical analysis. There are multiple ways to encode categorical variables; we will use the sklearn and pandas functions.\n",
    "\n",
    "In this step, we will also define our x (independent/features/explanatory/predictor/etc.) and y (dependent/target/outcome/response/etc.) variables for data modeling.\n",
    "\n",
    "### Some helpful documentation:\n",
    "\n",
    "<ul>\n",
    "<li><a href=\"http://pbpython.com/categorical-encoding.html\">Categorical Encoding</a></li>\n",
    "<li><a href=\"http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html\">Sklearn LabelEncoder</a></li>\n",
    "<li><a href=\"http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html\">Sklearn OneHotEncoder</a></li>\n",
    "<li><a href=\"https://pandas.pydata.org/pandas-docs/stable/categorical.html\">Pandas Categorical dtype</a></li>\n",
    "<li><a href=\"https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html\">pandas.get_dummies</a></li>\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original X Y:  ['Survived', 'Sex', 'Pclass', 'Embarked', 'Title', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone'] \n",
      "\n",
      "Bin X Y:  ['Survived', 'Sex_Code', 'Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code'] \n",
      "\n",
      "Dummy X Y:  ['Survived', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Title_Master', 'Title_Misc', 'Title_Miss', 'Title_Mr', 'Title_Mrs'] \n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Pclass</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Age</th>\n",
       "      <th>Fare</th>\n",
       "      <th>FamilySize</th>\n",
       "      <th>IsAlone</th>\n",
       "      <th>Sex_female</th>\n",
       "      <th>Sex_male</th>\n",
       "      <th>Embarked_C</th>\n",
       "      <th>Embarked_Q</th>\n",
       "      <th>Embarked_S</th>\n",
       "      <th>Title_Master</th>\n",
       "      <th>Title_Misc</th>\n",
       "      <th>Title_Miss</th>\n",
       "      <th>Title_Mr</th>\n",
       "      <th>Title_Mrs</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>22.0</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Pclass  SibSp  Parch   Age     Fare  FamilySize  IsAlone  Sex_female  \\\n",
       "0       3      1      0  22.0   7.2500           2        0           0   \n",
       "1       1      1      0  38.0  71.2833           2        0           1   \n",
       "2       3      0      0  26.0   7.9250           1        1           1   \n",
       "3       1      1      0  35.0  53.1000           2        0           1   \n",
       "4       3      0      0  35.0   8.0500           1        1           0   \n",
       "\n",
       "   Sex_male  Embarked_C  Embarked_Q  Embarked_S  Title_Master  Title_Misc  \\\n",
       "0         1           0           0           1             0           0   \n",
       "1         0           1           0           0             0           0   \n",
       "2         0           0           0           1             0           0   \n",
       "3         0           0           0           1             0           0   \n",
       "4         1           0           0           1             0           0   \n",
       "\n",
       "   Title_Miss  Title_Mr  Title_Mrs  \n",
       "0           0         1          0  \n",
       "1           0         0          1  \n",
       "2           1         0          0  \n",
       "3           0         0          1  \n",
       "4           0         1          0  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#CONVERT: convert objects to category using Label Encoder for train and test/validation dataset\n",
    "\n",
    "#code categorical data\n",
    "label = LabelEncoder()\n",
    "for dataset in data_cleaner:    \n",
    "    dataset['Sex_Code'] = label.fit_transform(dataset['Sex'])\n",
    "    dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])\n",
    "    dataset['Title_Code'] = label.fit_transform(dataset['Title'])\n",
    "    dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])\n",
    "    dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])\n",
    "\n",
    "\n",
    "#define y variable aka target/outcome\n",
    "Target = ['Survived']\n",
    "\n",
    "#define x variables for original features aka feature selection\n",
    "data1_x = ['Sex','Pclass', 'Embarked', 'Title','SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone'] #pretty name/values for charts\n",
    "data1_x_calc = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code','SibSp', 'Parch', 'Age', 'Fare'] #coded for algorithm calculation\n",
    "data1_xy =  Target + data1_x\n",
    "print('Original X Y: ', data1_xy, '\\n')\n",
    "\n",
    "\n",
    "#define x variables for original w/bin features to remove continuous variables\n",
    "data1_x_bin = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']\n",
    "data1_xy_bin = Target + data1_x_bin\n",
    "print('Bin X Y: ', data1_xy_bin, '\\n')\n",
    "\n",
    "\n",
    "#define x and y variables for dummy features original\n",
    "data1_dummy = pd.get_dummies(data1[data1_x])\n",
    "data1_x_dummy = data1_dummy.columns.tolist()\n",
    "data1_xy_dummy = Target + data1_x_dummy\n",
    "print('Dummy X Y: ', data1_xy_dummy, '\\n')\n",
    "\n",
    "\n",
    "\n",
    "data1_dummy.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#D33222; margin-left:10%; width:90%; height:38px; color:white; font-size:18px; padding:10px; float:right;\">\n",
    "STOP\n",
    "</div>\n",
    ">- Please be sure you understand the above code and what each step does. \n",
    ">- Read the comments and if you still don't understand, run each line of code individually."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\" id=\"Double-Check\">\n",
    "DOUBLE CHECK CLEANED DATA\n",
    "</div>\n",
    "\n",
    "Now that we've cleaned our data, let's do a discount da-double check!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train columns with null values: \n",
      " Survived         0\n",
      "Pclass           0\n",
      "Name             0\n",
      "Sex              0\n",
      "Age              0\n",
      "SibSp            0\n",
      "Parch            0\n",
      "Fare             0\n",
      "Embarked         0\n",
      "FamilySize       0\n",
      "IsAlone          0\n",
      "Title            0\n",
      "FareBin          0\n",
      "AgeBin           0\n",
      "Sex_Code         0\n",
      "Embarked_Code    0\n",
      "Title_Code       0\n",
      "AgeBin_Code      0\n",
      "FareBin_Code     0\n",
      "dtype: int64\n",
      "----------\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 891 entries, 0 to 890\n",
      "Data columns (total 19 columns):\n",
      "Survived         891 non-null int64\n",
      "Pclass           891 non-null int64\n",
      "Name             891 non-null object\n",
      "Sex              891 non-null object\n",
      "Age              891 non-null float64\n",
      "SibSp            891 non-null int64\n",
      "Parch            891 non-null int64\n",
      "Fare             891 non-null float64\n",
      "Embarked         891 non-null object\n",
      "FamilySize       891 non-null int64\n",
      "IsAlone          891 non-null int64\n",
      "Title            891 non-null object\n",
      "FareBin          891 non-null category\n",
      "AgeBin           891 non-null category\n",
      "Sex_Code         891 non-null int64\n",
      "Embarked_Code    891 non-null int64\n",
      "Title_Code       891 non-null int64\n",
      "AgeBin_Code      891 non-null int64\n",
      "FareBin_Code     891 non-null int64\n",
      "dtypes: category(2), float64(2), int64(11), object(4)\n",
      "memory usage: 120.3+ KB\n",
      "None\n",
      "----------\n",
      "Test/Validation columns with null values: \n",
      " PassengerId        0\n",
      "Pclass             0\n",
      "Name               0\n",
      "Sex                0\n",
      "Age                0\n",
      "SibSp              0\n",
      "Parch              0\n",
      "Ticket             0\n",
      "Fare               0\n",
      "Cabin            327\n",
      "Embarked           0\n",
      "FamilySize         0\n",
      "IsAlone            0\n",
      "Title              0\n",
      "FareBin            0\n",
      "AgeBin             0\n",
      "Sex_Code           0\n",
      "Embarked_Code      0\n",
      "Title_Code         0\n",
      "AgeBin_Code        0\n",
      "FareBin_Code       0\n",
      "dtype: int64\n",
      "----------\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 418 entries, 0 to 417\n",
      "Data columns (total 21 columns):\n",
      "PassengerId      418 non-null int64\n",
      "Pclass           418 non-null int64\n",
      "Name             418 non-null object\n",
      "Sex              418 non-null object\n",
      "Age              418 non-null float64\n",
      "SibSp            418 non-null int64\n",
      "Parch            418 non-null int64\n",
      "Ticket           418 non-null object\n",
      "Fare             418 non-null float64\n",
      "Cabin            91 non-null object\n",
      "Embarked         418 non-null object\n",
      "FamilySize       418 non-null int64\n",
      "IsAlone          418 non-null int64\n",
      "Title            418 non-null object\n",
      "FareBin          418 non-null category\n",
      "AgeBin           418 non-null category\n",
      "Sex_Code         418 non-null int64\n",
      "Embarked_Code    418 non-null int64\n",
      "Title_Code       418 non-null int64\n",
      "AgeBin_Code      418 non-null int64\n",
      "FareBin_Code     418 non-null int64\n",
      "dtypes: category(2), float64(2), int64(11), object(6)\n",
      "memory usage: 63.1+ KB\n",
      "None\n",
      "----------\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>891.000000</td>\n",
       "      <td>891.000000</td>\n",
       "      <td>891.000000</td>\n",
       "      <td>891</td>\n",
       "      <td>891</td>\n",
       "      <td>714.000000</td>\n",
       "      <td>891.000000</td>\n",
       "      <td>891.000000</td>\n",
       "      <td>891</td>\n",
       "      <td>891.000000</td>\n",
       "      <td>204</td>\n",
       "      <td>889</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>891</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>681</td>\n",
       "      <td>NaN</td>\n",
       "      <td>147</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Hickman, Mr. Lewis</td>\n",
       "      <td>male</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>347082</td>\n",
       "      <td>NaN</td>\n",
       "      <td>C23 C25 C27</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>577</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4</td>\n",
       "      <td>644</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>446.000000</td>\n",
       "      <td>0.383838</td>\n",
       "      <td>2.308642</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>29.699118</td>\n",
       "      <td>0.523008</td>\n",
       "      <td>0.381594</td>\n",
       "      <td>NaN</td>\n",
       "      <td>32.204208</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>257.353842</td>\n",
       "      <td>0.486592</td>\n",
       "      <td>0.836071</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>14.526497</td>\n",
       "      <td>1.102743</td>\n",
       "      <td>0.806057</td>\n",
       "      <td>NaN</td>\n",
       "      <td>49.693429</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.420000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>223.500000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>20.125000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7.910400</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>446.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>28.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>14.454200</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>668.500000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>38.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>31.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>891.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>80.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>512.329200</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        PassengerId    Survived      Pclass                Name   Sex  \\\n",
       "count    891.000000  891.000000  891.000000                 891   891   \n",
       "unique          NaN         NaN         NaN                 891     2   \n",
       "top             NaN         NaN         NaN  Hickman, Mr. Lewis  male   \n",
       "freq            NaN         NaN         NaN                   1   577   \n",
       "mean     446.000000    0.383838    2.308642                 NaN   NaN   \n",
       "std      257.353842    0.486592    0.836071                 NaN   NaN   \n",
       "min        1.000000    0.000000    1.000000                 NaN   NaN   \n",
       "25%      223.500000    0.000000    2.000000                 NaN   NaN   \n",
       "50%      446.000000    0.000000    3.000000                 NaN   NaN   \n",
       "75%      668.500000    1.000000    3.000000                 NaN   NaN   \n",
       "max      891.000000    1.000000    3.000000                 NaN   NaN   \n",
       "\n",
       "               Age       SibSp       Parch  Ticket        Fare        Cabin  \\\n",
       "count   714.000000  891.000000  891.000000     891  891.000000          204   \n",
       "unique         NaN         NaN         NaN     681         NaN          147   \n",
       "top            NaN         NaN         NaN  347082         NaN  C23 C25 C27   \n",
       "freq           NaN         NaN         NaN       7         NaN            4   \n",
       "mean     29.699118    0.523008    0.381594     NaN   32.204208          NaN   \n",
       "std      14.526497    1.102743    0.806057     NaN   49.693429          NaN   \n",
       "min       0.420000    0.000000    0.000000     NaN    0.000000          NaN   \n",
       "25%      20.125000    0.000000    0.000000     NaN    7.910400          NaN   \n",
       "50%      28.000000    0.000000    0.000000     NaN   14.454200          NaN   \n",
       "75%      38.000000    1.000000    0.000000     NaN   31.000000          NaN   \n",
       "max      80.000000    8.000000    6.000000     NaN  512.329200          NaN   \n",
       "\n",
       "       Embarked  \n",
       "count       889  \n",
       "unique        3  \n",
       "top           S  \n",
       "freq        644  \n",
       "mean        NaN  \n",
       "std         NaN  \n",
       "min         NaN  \n",
       "25%         NaN  \n",
       "50%         NaN  \n",
       "75%         NaN  \n",
       "max         NaN  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print('Train columns with null values: \\n', data1.isnull().sum())\n",
    "print(\"-\"*10)\n",
    "print (data1.info())\n",
    "print(\"-\"*10)\n",
    "\n",
    "print('Test/Validation columns with null values: \\n', data_val.isnull().sum())\n",
    "print(\"-\"*10)\n",
    "print (data_val.info())\n",
    "print(\"-\"*10)\n",
    "\n",
    "data_raw.describe(include = 'all')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\" id=\"Split-Train-Test\">\n",
    "SPLIT TRAINING AND TESTING DATA\n",
    "</div>\n",
    "\n",
    "As mentioned previously, the test file provided is really validation data for competition submission. So, we will use sklearn function to split the training data in two datasets; 75/25 split. This is important, so we don't overfit our model. Meaning, the algorithm is so specific to a given subset, it cannot accurately generalize another subset, from the same dataset. It's important our algorithm has not seen the subset we will use to test, so it doesn't \"cheat\" by memorizing the answers. We will use sklearn's train_test_split function. In later sections we will also use sklearn's cross validation functions, that splits our dataset into train and test for data modeling comparison."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data1 Shape: (891, 19)\n",
      "Train1 Shape: (668, 8)\n",
      "Test1 Shape: (223, 8)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sex_Code</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Embarked_Code</th>\n",
       "      <th>Title_Code</th>\n",
       "      <th>FamilySize</th>\n",
       "      <th>AgeBin_Code</th>\n",
       "      <th>FareBin_Code</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>105</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>68</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>7</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>253</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>320</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>706</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Sex_Code  Pclass  Embarked_Code  Title_Code  FamilySize  AgeBin_Code  \\\n",
       "105         1       3              2           3           1            1   \n",
       "68          0       3              2           2           7            1   \n",
       "253         1       3              2           3           2            1   \n",
       "320         1       3              2           3           1            1   \n",
       "706         0       2              2           4           1            2   \n",
       "\n",
       "     FareBin_Code  \n",
       "105             0  \n",
       "68              1  \n",
       "253             2  \n",
       "320             0  \n",
       "706             1  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#split train and test data with function defaults\n",
    "#random_state -> seed or control random number generator: https://www.quora.com/What-is-seed-in-random-number-generation\n",
    "train1_x, test1_x, train1_y, test1_y = model_selection.train_test_split(data1[data1_x_calc], data1[Target], random_state = 0)\n",
    "train1_x_bin, test1_x_bin, train1_y_bin, test1_y_bin = model_selection.train_test_split(data1[data1_x_bin], data1[Target] , random_state = 0)\n",
    "train1_x_dummy, test1_x_dummy, train1_y_dummy, test1_y_dummy = model_selection.train_test_split(data1_dummy[data1_x_dummy], data1[Target], random_state = 0)\n",
    "\n",
    "\n",
    "print(\"Data1 Shape: {}\".format(data1.shape))\n",
    "print(\"Train1 Shape: {}\".format(train1_x.shape))\n",
    "print(\"Test1 Shape: {}\".format(test1_x.shape))\n",
    "\n",
    "train1_x_bin.head()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}