{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Titanic: Machine Learning from Disaster\n", "**Predict survival on the Titanic**\n", "
\n", "I barely remember first when exactly I watched Titanic movie but still now Titanic remains a discussion subject in the most diverse areas. The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. It was April 15-1912 during her maiden voyage, the Titanic sank after colliding with an iceberg and killing 1502 out of 2224 passengers and crew. \n", "\n", "In this kaggle challenge, we're asked to complete the analysis of what sorts of people were likely to survive. In particular, we're asked to apply the tools of **machine learning** to predict which passengers survived the tragedy.\n", "\n", "More challenge information and datasets are available on [Kaagle Titanic Page](https://www.kaggle.com/c/titanic/data) The datasets has been split into two groups:\n", "\n", "- training set (train.csv)\n", "- test set (test.csv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Look at the Big Picture \n", "
\n", "\n", "The goal is to build a Model that can predict the survival or the death of a given passenger based on a set of variables describing their such as age, sex, or passenger class on the boat.\n", "\n", "### Frame the Problem\n", "\n", "To frame the problem elegantly, is very much important because it will determine our problem spaces. What algorithms we will select, what performance measure we will use to evaluate our model and also how much effort we should spend tweaking it. \n", "\n", "The test set should be used to see how well our model performs on unseen data. For the test set, the ground truth for each passenger is not provided. It is our job to predict these outcomes. For each passenger in the test set, we use the trained model to predict whether or not they survived the sinking of the Titanic. We will use **Cross-validation** for evaluating estimator performance.\n", "\n", "Basically, we've two datasets are available, a `train set` and a `test set`. We'll be using the training set to build our predictive model and the testing set will be used to validate that model. This is a binary classification problem. \n", "\n", "To solve this **ML** problem, topics like feature analysis, data visualization, missing data imputation, feature engineering, model fine tuning and various classification models will be addressed for ensemble modeling.\n", "\n", "### Preprocessing\n", "\n", "In Data Science or ML problem spaces, Data Preprocessing means a lot, which is to make the Data usable or clean before using it, like before fit the model.\n", "\n", "Now, the real world data is so messy, like following -\n", "* **inconsistant values**\n", "* **duplicate records**\n", "* **missing values**\n", "* **invalid data**\n", "* **outlier**\n", "\n", "So what? Actually this is a matter of big concern. Because, Model can't handle missing data. So, we need to handle this manually. Actually there're many approaches we can take to handle missing value in our data sets, such as-\n", "\n", "* **Remove observation/records that have missing values.** But..\n", " - data may randomly missing, so by doing this we may loss a lots of data\n", " - data may non-randomly missing, so by doing this we may also loss a lots of data, again we're also introducing potential biases \n", " \n", "* **Imputation**\n", " - replace missing values with another values \n", " - strategies: mean, median or highest frequency value of the given feature" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Table of Contents\n", "
\n", "The steps we will go through:\n", "\n", "[Get The Data](#2-bullet)\n", "\n", "Here we explore what inside of the dataset and make our first commit on it.\n", " \n", "[Feature Analysis To Gain Insights](#5-bullet)\n", "\n", "First we try to find out outlier from our datasets. There're many method to dectect outlier but here we will use tukey method to detect it. Then we will do component analysis of our features.\n", "\n", "[Feature Engineering](#9-bullet)\n", "\n", "Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. And here, in our datasets there are few features that we can do engineering on it. I like to choose two of them.\n", " - Name\n", " - Family Size\n", " \n", "[Predictive Modeling](#10-bullet)\n", "\n", "Here, we will use various classificatiom models and compare the results. We'll use Cross-validation for evaluating estimator performance and fine-tune the model and observe the learning curve, of best estimator and finally, will do enseble modeling of with three best predictive model. \n", "\n", "[Submit Predictor](#11-bullet)\n", "\n", "Create a CSV file and submit to Kaggle." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Import \n", "
\n", "At first we will load some various libraries. At first sight it may be confusing but we will see the use cases each of them in details later on." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Data Processing and Visualization Libraries\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "\n", "\n", "# Data Modelling Libraries\n", "from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier,\n", " GradientBoostingClassifier, ExtraTreesClassifier,\n", " VotingClassifier)\n", "\n", "from sklearn.model_selection import (GridSearchCV, cross_val_score, cross_val_predict,\n", " StratifiedKFold, learning_curve)\n", "\n", "\n", "from sklearn.metrics import (confusion_matrix, accuracy_score) \n", "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.svm import SVC\n", "\n", "import warnings\n", "from collections import Counter\n", "\n", "sns.set(style = 'white' , context = 'notebook', palette = 'deep')\n", "warnings.filterwarnings('ignore', category = DeprecationWarning)\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Get Data Sets\n", "
\n", "Using pandas, we now load the dataset. Basically two files, one is for training purpose and other is for testng." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# load the datasets using pandas's read_csv method\n", "train = pd.read_csv('train.csv')\n", "test = pd.read_csv('test.csv')\n", "\n", "# concat these two datasets, this will come handy while processing the data\n", "dataset = pd.concat(objs=[train, test], axis=0).reset_index(drop=True)\n", "\n", "# separately store ID of test datasets, \n", "# this will be using at the end of the task to predict.\n", "TestPassengerID = test['PassengerId']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Look Inside \n", "Let's look what we've just loaded. Datasets size, shape, short description and few more." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(891, 12)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# shape of the data set\n", "train.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So it has 891 samples with 12 features. That's somewhat big, let's top 5 sample of it." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# first 5 records\n", "train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Definitions of each features and quick thoughts:\n", "\n", "- PassengerId. Unique identification of the passenger. It shouldn't be necessary for the machine learning model.\n", "\n", "- Survived. Survival (0 = No, 1 = Yes). Binary variable that will be our target variable.\n", "- Pclass. Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd). Ready to go.\n", "- Name. Name of the passenger. We need to parse before using it.\n", "- Sex. Gender Categorical variable that should be encoded. We can use dummy variable to encode it.\n", "- Age. Age in years.\n", "- SibSp. Siblings / Spouses aboard the Titanic.\n", "- Parch. Parents / Children aboard the Titanic.\n", "- Ticket. Ticket number. Big mess. \n", "- Fare. Passenger fare.\n", "- Cabin. Cabin number.\n", "- Embarked. Port of Embarkation , C = Cherbourg, Q = Queenstown, S = Southampton. Categorical feature that should be encoded. We can use feature mapping or make dummy vairables for it.\n", "\n", "The main conclusion is that we already have a set of features that we can easily use in our machine learning model. But features like Name, Ticket, Cabin require an additional effort before we can integrate them." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 891 entries, 0 to 890\n", "Data columns (total 12 columns):\n", "PassengerId 891 non-null int64\n", "Survived 891 non-null int64\n", "Pclass 891 non-null int64\n", "Name 891 non-null object\n", "Sex 891 non-null object\n", "Age 714 non-null float64\n", "SibSp 891 non-null int64\n", "Parch 891 non-null int64\n", "Ticket 891 non-null object\n", "Fare 891 non-null float64\n", "Cabin 204 non-null object\n", "Embarked 889 non-null object\n", "dtypes: float64(2), int64(5), object(5)\n", "memory usage: 83.6+ KB\n" ] } ], "source": [ "# using info method we can get quick overview of the data sets\n", "train.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One things to notice, we have 891 samples or entries but columns like **Age**, **Cabin** and **Embarked** have some missing values. We can't ignore those. However, let's generate the descriptive statistics to get the basic quantitative information about the features of our data set." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Age SibSp \\\n", "count 891.000000 891.000000 891.000000 714.000000 891.000000 \n", "mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n", "std 257.353842 0.486592 0.836071 14.526497 1.102743 \n", "min 1.000000 0.000000 1.000000 0.420000 0.000000 \n", "25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n", "50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n", "75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n", "max 891.000000 1.000000 3.000000 80.000000 8.000000 \n", "\n", " Parch Fare \n", "count 891.000000 891.000000 \n", "mean 0.381594 32.204208 \n", "std 0.806057 49.693429 \n", "min 0.000000 0.000000 \n", "25% 0.000000 7.910400 \n", "50% 0.000000 14.454200 \n", "75% 0.000000 31.000000 \n", "max 6.000000 512.329200 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Descriptive Statistics\n", "train.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are three aspects that usually catch my attention when I analyse descriptive statistics:\n", "\n", "- **Min and max values**: This can give us an idea about the range of values and is helpful to detect outliers.\n", "\n", "- **Mean and standard deviation**: The mean shows us the central tendency of the distribution, while the standard deviation quantifies its amount of variation.\n", "- **Count**: Give us a first perception about the volume of missing data. \n", "\n", "\n", "Let's define a function for missing data analysis more in details." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Create table for missing data analysis\n", "def find_missing_data(data):\n", " Total = data.isnull().sum().sort_values(ascending = False)\n", " Percentage = (data.isnull().sum()/data.isnull().count()).sort_values(ascending = False)\n", " \n", " return pd.concat([Total,Percentage] , axis = 1 , keys = ['Total' , 'Percent'])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TotalPercent
Cabin6870.771044
Age1770.198653
Embarked20.002245
Fare00.000000
Ticket00.000000
Parch00.000000
SibSp00.000000
Sex00.000000
Name00.000000
Pclass00.000000
Survived00.000000
PassengerId00.000000
\n", "
" ], "text/plain": [ " Total Percent\n", "Cabin 687 0.771044\n", "Age 177 0.198653\n", "Embarked 2 0.002245\n", "Fare 0 0.000000\n", "Ticket 0 0.000000\n", "Parch 0 0.000000\n", "SibSp 0 0.000000\n", "Sex 0 0.000000\n", "Name 0 0.000000\n", "Pclass 0 0.000000\n", "Survived 0 0.000000\n", "PassengerId 0 0.000000" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "find_missing_data(train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a heatmap plot to visualize the amount of missing values." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# checking only train set - visualize\n", "sns.heatmap(train.isnull(), cbar = False , \n", " yticklabels = False , cmap = 'viridis')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that, Cabin feature has terrible amount of missing values, around 77% data are missing. Until now, we only see train datasets, now let's see amount of missing values in whole datasets." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TotalPercent
Cabin10140.774637
Survived4180.319328
Age2630.200917
Embarked20.001528
Fare10.000764
Ticket00.000000
SibSp00.000000
Sex00.000000
Pclass00.000000
PassengerId00.000000
Parch00.000000
Name00.000000
\n", "
" ], "text/plain": [ " Total Percent\n", "Cabin 1014 0.774637\n", "Survived 418 0.319328\n", "Age 263 0.200917\n", "Embarked 2 0.001528\n", "Fare 1 0.000764\n", "Ticket 0 0.000000\n", "SibSp 0 0.000000\n", "Sex 0 0.000000\n", "Pclass 0 0.000000\n", "PassengerId 0 0.000000\n", "Parch 0 0.000000\n", "Name 0 0.000000" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "find_missing_data(dataset)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# checking only datasets set\n", "sns.heatmap(dataset.isnull(), cbar = False , \n", " yticklabels = False , cmap = 'viridis')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As it mentioned earlier, ground truth of test datasets are missing.\n", "\n", "## Problem Spaces \n", "\n", "So, we've this train data set and with a quick analysis we've seen its internal components and find some missing values there. We've also seen many observations with concern attributes. \n", "\n", "**Task**: The goal is to predict the survival or the death of a given passenger based on a set of variables describing their such as age, sex, or passenger class on the boat.\n", "\n", "So, **Survived** is our **target variable**, This is the variable we're going to predict. `1` represent **survived** , `0` represent **not survived**. And rest of the attributes are called **feature variables**, based on those we need to build a model which will predict whether a passenger survived or not.\n", "\n", "\n", "### Preprocessing\n", "In Data Science or ML contexts, Data Preprocessing means to make the Data usable or clean before using it, like before fit the model.\n", "\n", "Now, the real world data is so messy, they're like -\n", "* inconsistant\n", "* duplicate records\n", "* missing values\n", "* invalid data\n", "* outlier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Analysis \n", "
\n", "\n", "### Outlier Detection \n", "\n", "There are many method to detect outlier. We will use [Tukey Method](http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/) to accomplish it." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Mohammed Innat\\Anaconda3\\lib\\site-packages\\numpy\\lib\\function_base.py:4291: RuntimeWarning: Invalid value encountered in percentile\n", " interpolation=interpolation)\n" ] } ], "source": [ "# Outlier detection \n", "\n", "def detect_outliers(df,n,features):\n", " \"\"\"\n", " Takes a dataframe df of features and returns a list of the indices\n", " corresponding to the observations containing more than n outliers according\n", " to the Tukey method.\n", " \"\"\"\n", " outlier_indices = []\n", " \n", " # iterate over features(columns)\n", " for col in features:\n", " \n", " # 1st quartile (25%)\n", " Q1 = np.percentile(df[col], 25)\n", " \n", " # 3rd quartile (75%)\n", " Q3 = np.percentile(df[col],75)\n", " \n", " # Interquartile range (IQR)\n", " IQR = Q3 - Q1\n", " \n", " # outlier step\n", " outlier_step = 1.5 * IQR\n", " \n", " # Determine a list of indices of outliers for feature col\n", " outlier_list_col = df[(df[col] < Q1 - outlier_step) | \n", " (df[col] > Q3 + outlier_step )].index\n", " # append the found outlier indices for col to the list of outlier indices \n", " outlier_indices.extend(outlier_list_col)\n", " \n", " \n", " # select observations containing more than 2 outliers\n", " outlier_indices = Counter(outlier_indices) \n", "\n", " multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )\n", " return multiple_outliers \n", "\n", "# detect outliers from Age, SibSp , Parch and Fare\n", "Outliers_to_drop = detect_outliers(train,2,[\"Age\",\"SibSp\",\"Parch\",\"Fare\"])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
272801Fortune, Mr. Charles Alexandermale19.03219950263.00C23 C25 C27S
888911Fortune, Miss. Mabel Helenfemale23.03219950263.00C23 C25 C27S
15916003Sage, Master. Thomas HenrymaleNaN82CA. 234369.55NaNS
18018103Sage, Miss. Constance GladysfemaleNaN82CA. 234369.55NaNS
20120203Sage, Mr. FrederickmaleNaN82CA. 234369.55NaNS
32432503Sage, Mr. George John JrmaleNaN82CA. 234369.55NaNS
34134211Fortune, Miss. Alice Elizabethfemale24.03219950263.00C23 C25 C27S
79279303Sage, Miss. Stella AnnafemaleNaN82CA. 234369.55NaNS
84684703Sage, Mr. Douglas BullenmaleNaN82CA. 234369.55NaNS
86386403Sage, Miss. Dorothy Edith \"Dolly\"femaleNaN82CA. 234369.55NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name Sex \\\n", "27 28 0 1 Fortune, Mr. Charles Alexander male \n", "88 89 1 1 Fortune, Miss. Mabel Helen female \n", "159 160 0 3 Sage, Master. Thomas Henry male \n", "180 181 0 3 Sage, Miss. Constance Gladys female \n", "201 202 0 3 Sage, Mr. Frederick male \n", "324 325 0 3 Sage, Mr. George John Jr male \n", "341 342 1 1 Fortune, Miss. Alice Elizabeth female \n", "792 793 0 3 Sage, Miss. Stella Anna female \n", "846 847 0 3 Sage, Mr. Douglas Bullen male \n", "863 864 0 3 Sage, Miss. Dorothy Edith \"Dolly\" female \n", "\n", " Age SibSp Parch Ticket Fare Cabin Embarked \n", "27 19.0 3 2 19950 263.00 C23 C25 C27 S \n", "88 23.0 3 2 19950 263.00 C23 C25 C27 S \n", "159 NaN 8 2 CA. 2343 69.55 NaN S \n", "180 NaN 8 2 CA. 2343 69.55 NaN S \n", "201 NaN 8 2 CA. 2343 69.55 NaN S \n", "324 NaN 8 2 CA. 2343 69.55 NaN S \n", "341 24.0 3 2 19950 263.00 C23 C25 C27 S \n", "792 NaN 8 2 CA. 2343 69.55 NaN S \n", "846 NaN 8 2 CA. 2343 69.55 NaN S \n", "863 NaN 8 2 CA. 2343 69.55 NaN S " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show the outliers rows\n", "train.loc[Outliers_to_drop]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# Drop outliers\n", "train = train.drop(Outliers_to_drop, axis = 0).reset_index(drop=True)\n", "\n", "# after removing outlier, let's re-concat the data sets\n", "dataset = pd.concat(objs=[train, test], axis=0).reset_index(drop=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've removed outlier, let's analysis the various features and in the same time we'll also handle the missing value during analysis.\n", "\n", "- Numerical Analysis\n", "- Categorical Analysi\n", "\n", "# Numerical Analysis \n", "
\n", "\n", "At first let's analysis the correlation of 'Survived' features with the other numerical features like 'SibSp', 'Parch', 'Age', 'Fare'." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Correlation matrix between numerical values (SibSp Parch Age and Fare values) and Survived \n", "corr_numeric = sns.heatmap(dataset[[\"Survived\",\"SibSp\",\"Parch\",\"Age\",\"Fare\"]].corr(),\n", " annot=True, fmt = \".2f\", cmap = \"summer\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only Fare feature seems to have a significative correlation with the survival probability.\n", "\n", "But it doesn't make other features useless. Subpopulations in these features can be correlated with the survival. To estimate this, we need to explore in detail these features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Age\n", "\n", "Let's first look the age distribution among survived and not survived passengers." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Explore the Age vs Survived features\n", "age_survived = sns.FacetGrid(dataset, col='Survived')\n", "age_survived = age_survived.map(sns.distplot, \"Age\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, It's look like age distributions are not the same in the survived and not survived subpopulations. Indeed, there is a peak corresponding to young passengers, that have survived. We also see that passengers between 60-80 have less survived. So, even if \"Age\" is not correlated with \"Survived\", we can see that there is age categories of passengers that of have more or less chance to survive.\n", "\n", "It seems that very young passengers have more chance to survive. Let's look one for time." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = sns.FacetGrid(dataset, hue = 'Survived', aspect = 4)\n", "fig.map(sns.kdeplot, 'Age' , shade = True)\n", "fig.set(xlim = (0, dataset['Age'].max()))\n", "fig.add_legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again we see that aged passengers between 65-80 have less survived.\n", "\n", "\n", "## Missing Age Value\n", "\n", "We have seen significantly missing values in **Age** coloumn. Missing Age value is a big issue, to address this problem, I've looked at the most correlated features with Age. Let's first try to find correlation between Age and Sex features." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAARgAAAEYCAYAAACHjumMAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAEeNJREFUeJzt3X2QXXV9x/H3JiQsuCgxVKuNioh+bUkKFR/A8BAFRaiA2g52AAUTn5CKdqhMdbBVp5aKgko7SBsT0KnRtlqMVRHaFCzg+FisScEvxeBQHxASEVySFJJs/zh3ZRsXson3e8/eu+/XTGbPOffu+f2Su/ns7/zOw3dobGwMSaowq+0OSBpcBoykMgaMpDIGjKQye7TdgYcTEXsAC4AfZObWtvsjaddN24ChCZfb16xZ03Y/JO3c0GQbPUSSVMaAkVTGgJFUxoCRVMaAkVTGgJFUxoCZBtauXcvatWvb7obUdWXXwUTEHOBjwP7ANuB1wFbgCmAMWAecnZnbq/rQL1atWgXABRdc0HJPpO6qHMGcAOyRmc8H3gO8F7gYOD8zj6S5MOfkwvb7wtq1a1m3bh3r1q1zFKOBUxkwtwJ7RMQs4NHAg8ChwJc7r18FHFvYfl8YH73suCwNgspbBUZpDo++C+wHvBQ4KjPHn3D1c+Axhe1LalnlCOaPgKsz8xnAwTTzMXMnvL4P8LPC9vvCqaeeOumyNAgqA+Ye4N7O8k+BOcBNEbGks+144PrC9vvCokWLWLhwIQsXLmTRokVtd0fqqspDpA8CKyPiepqRyzuAbwLLI2IucAvw6cL2+4YjFw2qsoDJzFHglEleOrqqzX7lyEWDygvtJJUxYCSVMWCkAt7+0TBgpAKrVq3ywkkMGKnrvP3jIQaM1GXe/vEQA0ZSGQNmGnBCcLB4+8dDpnNdpBnD58EMlvHbP8aXZzIDpmXjE4LjyzP9B3JQzPSRyzgPkVrmhOBgWrRokb8sMGAkFTJgWuaEoAaZczAtc0JQg8yAmQYcuWhQeYg0Daxfv57169e33Q2p6xzBTAPjZ49OPnnGV3HRgHEE07LVq1ezadMmNm3axOrVq9vujtRVBkzLvA5Gg6yydOyZwJmd1WHgEGAJ8GGaErLXZOa7q9rvF9u2bZt0WRoEZSOYzLwiM5dk5hLgW8A5wGXAqcARwPMi4llV7feLOXPmTLosDYLyQ6SIeDZwEPApYM/M/F6nuuPVwDHV7U93mzZtmnRZGgS9mIN5B/BumvrU903YbulYYO7cuZMuS4OgNGAiYl/gmZl5LU247DPhZUvHAqeffvqky9IgqB7BHAX8K0Bm3gc8EBFPi4gh4DgsHcvJJ5/M8PAww8PDXgejgVN9oV0AEy9RfSPwCWA2zVmkrxW33xccuWhQlQZMZr5/h/WvAodVttmPDjjggLa7IJXwQrtpwBo6GlQGTMusoaNBZsC0zFsFNMgMGEllDJiW+chMDTKfBzMFK1eu5MYbbyzb/6xZTc5/6EMfKtn/4sWLWbp0acm+pUdiwEwDY2NjbXdBKjE0XX+4I2J/4PY1a9awYMGCtrtTatmyZQCsWLGi5Z5Iu21oso3OwUgqY8BIKuMcjGasysn70dFRAEZGRkr23y8T945gpAJbtmxhy5YtbXejdY5gNGMtXbq0bBTgxH3DEYykMgaMpDIGjKQyBoykMgaMpDIGjKQypaepI+LtwEnAXOBS4MvAFcAYsA44OzO3V/ZBUnvKRjARsQR4PrAYOBp4EnAxcH5mHklzc5R1OqQBVnmIdBywFrgS+Gfg88ChNKMYgKuAYwvbl9SyykOk/YCnAC8Fngp8DpjVqUsNlo6VBl5lwGwEvpuZDwAZEVtoDpPGWTpWGnCVh0g3AC+JiKGIeCLwKGBNZ24G4HgsHSsNtLIRTGZ+PiKOAr5OE2RnA7cDyyNiLnAL8Omq9iW1r7p07HmTbD66sk1J04cX2kkqY8BIKmPASCpjwEgqY8BIKmPASCpjwEgqY8BIKmPASCpjwEgqY8BIKmPASCpjwEgqY8BIKmPASCpjwEgqY8BIKmPASCpjwEgqU1069ibg3s7q7cDfAB8GtgLXZOa7K9uX1K6ygImIYYDMXDJh27eB3wPWA1+IiGdl5n9U9UFSuypHMAcDe0fENZ123gXsmZnfA4iIq4FjAANGGlCVAbMJ+ADwUeDpNLWoJ1Zy/DlwQGH7klpWGTC3Ard1alHfGhH3Ao+d8LqlY6UBV3kWaSlwEUCndOzewP0R8bSIGAKOw9Kx0kCrHMGsAK6IiBuAMZrA2Q58AphNcxbpa4XtS2pZZW3qB4BTJ3npsKo2JU0vXmgnqYwBI6mMASOpjAEjqYwBI6mMASOpjAEjqYwBI6mMASOpjAEjqYwBI6mMASOpjAEjqcyUAyYi5lV2RNLg2enjGiLiEOBTNM/XPRz4MnCKD+uWtDNTGcFcArwc2JiZPwTOAi4r7ZWkgTCVgNk7M28ZX8nMfwH2rOuSpEExlYD5aUQcTPPYSyLiNOCnpb2SNBCm8sjMs4CPAQdFxM+A/wZOL+2VpIGw04DpFEo7IiIeBczOzPumuvOIeBzwLeBFNOVir6AZCa0Dzs7M7bvTaUn9YSpnka6lc3jUWR8DNgO3AH+Rmfc8zPfNoalFvbmz6WLg/My8LiIuA04GrvzVui9pOpvKHMzNwHeAt3b+fIOmYNqPaEqTPJwP0Jxt+lFn/VCaU9zQVHk8djf6K6mPTGUO5rDMPHTC+nci4huZeXpEvHqyb4iIM4G7M/PqiHh7Z/NQp8ojNGVjH7PbvZbUF6YygpkTEQeNr0TEQmB2ROwFzH2Y71kKvCgirgMOAT4OPG7C65aNlWaAqYxgzgGuioif0ATSPJqzSO+iCY5fkplHjS93QuaNwPsjYklmXgccD1z7q3R8R+eddx4bN27s5i57ZsOGDQAsW7as5Z7snvnz53PhhRe23Q1NQ1M5i3RdRBwA/A5NMBxHU/Z1ZBfbOhdYHhFzaSaIP72rnX0kGzdu5K677mZozl7d3G1PjHUGknffM9pyT3bd2IObd/4mzVhTOYv0VOD1NIc9+wLvpTkDNCWZuWTC6tG72L9dMjRnL0YOPKmyCe1g9LbPle3bUWl7ujUqfdiAiYiXA2+gOftzJc1h0fLMfM+v3Ko0BRs3buSuu+9i1l5lJdTLbJ/VnM/YMNp/F71v37y1a/t6pE/uM8A/AIdn5m0AEeGFceqpWXvtwbyXPLntbswo93zpjq7t65EC5reB1wA3RMT3gU/u5P2S9P887GnqzFyXmecCC4C/BF4APD4ivhARJ/Sqg5L611TOIm0FPgt8NiJ+DXg1cAHwxeK+Sepzu3TIk5l3Axd1/kjSI/Kh35LKGDCSyhgwksoYMJLKGDCSyhgwksoYMJLKGDCSyhgwksoYMJLKGDCSyhgwksoYMJLKlD1AKiJmA8uBALbRPLxqCMvHSjNG5QjmRIDMXAz8KU3p2PHysUfShM2UHx4uqf+UBUxmfpamGgHAU4CfYPlYaUYpnYPJzK0R8THgr2jqIFk+VppByid5M/MM4Bk08zETq6JZPlYacGUBExGvmlD4fhOwHfhmRCzpbDseuL6qfUntqyxD8k/A5RHx78Ac4K00JWPLysdKml7KAiYz7wdOmeSlkvKxo6OjjD24ubSUqX7Z2IObGe2/ktrqES+0k1RmYCo1joyMsPlBGDnwpLa7MqOM3vY5RkZG2u6GpqmBCRgNntHRUbZv3trVWsnaue2btzJKd457PUSSVMYRjKatkZERtvAA817y5La7MqPc86U7unbY6whGUhkDRlIZA0ZSGQNGUhkDRlIZA0ZSGQNGUhkDRlIZA0ZSGQNGUhkDRlIZA0ZSGQNGUpmBupu6Xx+ZObbtAQCGZs9tuSe7buzBzYAPnNLkSgImIuYAK4H9gT2BPwduprBs7Pz587u1q57bsGEDAPvN68f/qCN9/W+vWlUjmNOBjZn5qoiYD9wEfJumbOx1EXEZTdnYK7vV4IUXXtitXfXcsmXLAFixYkXLPZG6q2oO5h+Bd05Y34plY6UZp2QEk5mjABGxD03to/OBD1g2VppZyiZ5I+JJNIdAl2bmqoiYeAxj2VhNSb8+9Hv7A9sAmDV3dss92XXbN2/t2rx91STv44FrgD/MzDWdzTdFxJLMvI6mbOy1FW1rcPTz5PEvJu5HHttyT3bDSPf+7atGMO8A5gHvjIjxuZi3AJdYNlZT5cR9/6uag3kLTaDsqKRsrKTpySt5JZUxYCSVMWAklTFgJJUxYCSVMWAklTFgJJUxYCSVMWAklTFgJJUxYCSVMWAklTFgJJUxYCSVMWAklTFgJJUxYCSVMWAklTFgJJUprU0dEc8D3peZSyLiQApLx0qafspGMBFxHvBRYLiz6WKa0rFHAkM0pWMlDbDKQ6TvAa+YsG7pWGmGKQuYzPwM8OCETUOWjpVmll5O8k6cb7F0rDQD9DJgboqIJZ3l44Hre9i2pBaUnkXawbnAckvHSjNHacBk5veBwzrLt2LpWGlG8UI7SWUMGEllDBhJZQwYSWUMGEllDBhJZQwYSWUMGEllDBhJZQwYSWUMGEllDBhJZQwYSWUMGEllDBhJZQwYSWUMGEllDBhJZQwYSWV6+dBvImIWcClwMPC/wGsz87Ze9kFS7/Q0YICXAcOZeXhEHAZcRB+UkF25ciU33nhj2f43bNgAwLJly0r2v3jxYpYuXVqy735W+bn6mTZ6fYh0BPAlgMz8KvDsHrc/LQ0PDzM8PLzzN6pv+Jk2hsbGxnb+ri6JiI8Cn8nMqzrrdwAHZObWSd67P3D7mjVrWLBgQc/6KGm3DE22sdcjmPtoysb+ov3JwkXSYOh1wNwInADQmYNZ2+P2JfVQryd5rwReFBFfoRlSvabH7UvqoZ4GTGZuB97YyzYltccL7SSVMWAklTFgJJUxYCSV6fVZpF0xG+DOO+9sux+SduKYY47ZH/jBjte1TeeAeQLAaaed1nY/JO3c7cBTge9P3DidA+YbwJHAj4FtLfdF0s79YMcNPb0XSdLM4iSvpDIGjKQyBoykMgaMpDIGjKQy0/k09YwVEWcCz8zMP2m7LzNdRMwGvgg8CjgxM+/p0n7vzMxf78a+pjMDRnpkTwD2y8xD2+5IPzJginVGIycCe9H8sH6YppLCQuCPgScBrwDmAPd2lid+/5uBU4Ex4FOZeUmv+i4A/hZ4ekRcTvO41/md7edk5tqIuA34CvB04N+AxwDPBTIzXxURC4GLaaYj9u1831fGdx4Ri4BLaB7AthFYmpn39uavVs85mN7YJzNPAN4HnEUTIq8HltH8wB6bmUfShMxzxr8pIn4LeCVNNYYjgJdFRPS47zPdm4CbgbuANZn5AprP7iOd1/cHzgeOAs6hqfv1POCIiNgXOAg4NzOPpQmaHZ/iuBw4OzOX0ByKnVf5l+k1RzC9cVPn68+AWzJzLCLuAeYCDwCfjIhRYAFNyIxbCDwFWNNZnwccCGRPeq2JFgEvjIhXdtbndb5uzMw7ACLi/sy8ubN8LzAM/BB4Z0RsphkB3bfDfn8TuLTze2MOcGvp36LHHMH0xsPdjzEXeFlmvhJ4M83nMbH8QwL/Bbyg8xvuCnxQelu+C3yw8zmcAnyis31n99pcAvxZZp5B89ntWN4jgVd39nse8IVudXg6cATTrq3A/RHxTZpSuj8Gnjj+Ymb+Z0SsAW6IiD2Br9P8RlTvvRdYERGvBx4NvGuK3/d3wOqI+AnNzYD77fD6WcDHO2eroDlsHhje7CipjIdIksoYMJLKGDCSyhgwksoYMJLKeJpaXRURvw+8neZnaxbw8cx8f7u9UlscwahrIuI3gIuAF2fmwcDhwB9ExEnt9kxtcQSjbtqP5nL3vWkuoR+NiDOALRHxHOCDndc2AG/ofF0LLMvMNRFxNbA6My9tp/vqNi+0U1dFxEeA19Lcf3UtsAq4haYMzYmZeUdEHAe8LTOPjYgX0tw4eAnw0sw8vqWuq4ABo67rHCq9GDiO5tEUF9DcZ3PbhLc9OjMP6Lz/IzSPpHhmZv64x91VIQ+R1DUR8bvASGb+PXA5cHlEvI4mPNZn5iGd980GHt9ZHgIC2NT5asAMECd51U2bgAsiYn/4RXgcAnwVeGxEHNl531KaQydonrcySjPSWR4RIz3tsUp5iKSu6kzqvo2HnmtzdWf9WTRP8xumeSbKGcB2mqfBPTcz/yci/hqYlZlv6nnHVcKAkVTGQyRJZQwYSWUMGEllDBhJZQwYSWUMGEllDBhJZf4PvWrBugf69ooAAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# visualize this using box plot\n", "AS = sns.factorplot(y=\"Age\", x=\"Sex\", data = dataset, kind=\"box\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Age distribution seems to be almost same in Male and Female subpopulations, so Sex is not informative to predict Age. Let's explore `age` and `pclass` distribution." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "facet = sns.FacetGrid(dataset, hue=\"Pclass\", aspect=4)\n", "facet.map(sns.kdeplot,'Age',shade= True)\n", "facet.set(xlim=(0, train['Age'].max()))\n", "facet.add_legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, we see there're more young people from class 3. First class passenger seems more aged than second class and third class are following. But we can't get any information to predict age. But let's try an another approach to visualize with the same parameter." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAARgAAAEYCAYAAACHjumMAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAEtxJREFUeJzt3X+QXWV9x/F3EhKSsIgxUactIP6o344SoUJHNAJx0GKsYqXqTAMCbqri4PijjHGg4KijRlHoaK1iMRFQqVUU/IVCS0ExVgV/lKTiF0EsYymGrAiuyRaSTf84d3HBxb17s8859+S+XzPMnHN37z7PXdgPz3nOc57vnF27diFJJcxtugOS9lwGjKRiDBhJxRgwkorZq+kOPJyI2AvYH/h5Zu5ouj+SZq5vA4YqXG67+uqrm+6HpOnNmepFL5EkFWPASCrGgJFUjAEjqRgDRlIxBoykYgwYScUUWwcTEfOBi4CDgJ3Aq4AdwIXALmAzcFpmjpfqg6RmlRzBvADYKzOfBbwDeBdwHnBWZh5JtTDnxQXbl9Swkit5bwb2ioi5wCOA+4EjgK93vv5V4M+Bywr24WFt2LCBjRs39vTe0dFRAIaGhnpuf8WKFQwPD/f8fqkNSgbMKNXl0Y+BZcALgaMyc2KHq18D+xVsv5ixsTFg9wJGGgQlA+ZNwJWZeUZEHAD8O7Bg0tf3BX5VsP3fa3h4uOcRxJo1awBYv379bHZJ2uOUnIO5G7inc/xLYD7wg4hY2XltFXBdwfYlNazkCObvgQ0RcR3VyOVM4AbggohYANwEXFqwfUkNKxYwmTkKvHyKLx1dqk1J/cWFdpKKMWAkFWPASCrGgJFUjAEjqRgDRlIxBoykYgwYqQU2bdrEpk2bmu7GjBkwUgtccsklXHLJJU13Y8YMGKnPbdq0ic2bN7N58+bWjWIMGKnPTR65tG0UY8BIKsaAkfrc6tWrpzxug5LbNUiaBcuXL+fggw9+4LhNDBipBdo2cplgwEgt0LaRywTnYCQVY8BILeBKXmkabf0j6QdtXclbsnTsKcApndOFwKHASuADVCVkr8rMt5dqX/1n4g9k3bp1DfekXSZW8k4ct2k+ptgIJjMvzMyVmbkS+B7weuB8YDXwbOAZEfH0Uu2rv7R5uXvTXMn7e0TE4cBTgU8De2fmrZ3qjlcCx5RuX/2hzX8k6l0dczBnAm+nqk9976TXW1s6VqpTm1fyFg2YiHgk8CeZeQ1VuOw76cuNlo5Vvdr8R9K0iZW8Bx98cKvmX6D8QrujgH8DyMx7I+K+iHgi8FPgWKqRjQZAm5e794O2hnLpgAmqMJlwKvApYB7VXaTvFG5ffaStfyT9oK2hXDRgMvN9Dzn/NnBEyTbVv9r6R6LeudBOUjEGjKRiDBjVxkcFBo8Bo9q09Xka9c6AUS18VGAwGTCqhY8K7J62Xl4aMFILtPXy0oBRLXxUoHdtvrw0YFSLNj9P07Q2X1666bdq48hl8DiCkfpcmy8vDRjVpq0TlU1bvnw5ixcvZvHixa27vDRgVIs2T1Q2bdOmTWzbto1t27a17ndnwKgWbZ6obFqbf3cGjKRiDBjV4sADD5zyWNNzkleaxrXXXjvlsabnJK80jZ07d055rOk5yStNY8mSJVMea3ptnuQtupI3Is4AjgMWAB8Gvg5cCOwCNgOnZeZ4yT6oPyxbtow777zzgWN1b3R0dMrjNig2gomIlcCzgBXA0cABwHnAWZl5JDAHeHGp9tVfnOTt3Zw5c6Y8boOSl0jHApuAy4AvAV8GDqMaxQB8FXhuwfbVR5zk7d0+++wz5XEblAyYZcDhwMv4bT2kuZ261GDpWKkr3qae2ghwZWbel5kJjPHgQLF07ABZuXLllMfas5UMmG8Cz4+IORHxh8A+wNWduRmAVcB1BdtXH7n99tunPNb0vIs0hcz8ckQcBXyXKshOA24DLoiIBcBNwKWl2pfUvNKlY9dO8fLRJdtUf1q9ejVnnnnmA8fqXpt/d+5op1pMbJk5cazutfl3Z8CoNm37v28/aevvzoDRjGzYsIGNGzf29N6JVahDQ0M9t79ixQqGh4d7fn9btW3kMsFnkVSbsbExxsbGmu5GK7W18JojGM3I8PBwzyOINWvWALB+/frZ7NJAmLg9vW7duoZ7MjOOYKQ+1+b9jA0Yqc+1eaGdASOpGANG6nNtftjRSV6pz7nQTlJRbRu5TDBgpBZo28hlgnMwkooxYCQVY8BIKsaAkVSMASOpGANGUjHeppZqsDv76MDu76XT1D46pUvH/gC4p3N6G/BR4APADuCqzHx7yfalPcXEPjq7s1lXE4oFTEQsBMjMlZNe+yHwV8BPga9ExNMz8/ul+iD1i93ZRwfau5dOyRHMIcDiiLiq087bgL0z81aAiLgSOAboKWDWrl3LyMjILHV1ZrZu3Qr89l96E5YuXco555zTWPtSN0oGzDbg/cDHgD+mqkU9uZLjr4En9PrDR0ZG2LLlLubMX7RbnezFrs7c+F13j9beNsCu+7c30q40UyUD5mbglk4t6psj4h7gUZO+vtulY+fMX8TQk47bnR/RSqO3fLHpLkhdKXmbehg4F6BTOnYx8JuIeGJEzAGOxdKx0h6t5AhmPXBhRHwT2EUVOOPAp4B5VHeRvlOwfUkNK1mb+j5gqk0sjijVpqT+4kpeScUYMJKKMWAkFWPASCrGgJFUjAEjqRgDRlIxBoykYgwYScUYMJKKMWAkFWPASCrGgJFUTNcBExFLSnZE0p5n2u0aIuJQ4NNU++s+E/g68HI365Y0nW5GMB8EXgKMZOb/AK8Fzi/aK0l7hG4CZnFm3jRxkpn/CuxdrkuS9hTdBMwvI+IQqm0viYgTgF8W7ZWkPUI3W2a+FrgIeGpE/Ar4CXBi0V5J2iNMGzCdQmnPjoh9gHmZeW+3PzwiHgN8D3geVbnYC6lGQpuB0zJzvJdOS2qHbu4iXUPn8qhzvgvYDtwEvDsz736Y982nqkU9USXsPOCszLw2Is4HXgxctnvdl9TPupmD+RFwI/DGzj/XUxVMu4OqNMnDeT/V3aY7OueHUd3ihqrK43N76K+kFulmDuaIzDxs0vmNEXF9Zp4YESdN9YaIOAW4KzOvjIgzOi/P6VR5hKps7H4991pSK3QzgpkfEU+dOImIg4F5EbEIWPAw7xkGnhcR1wKHAhcDj5n09d0uGyup/3Uzgnk98NWI+AVVIC2huov0Nqrg+B2ZedTEcSdkTgXeFxErM/NaYBVwze50fHR0lF33bx/IOs277t/O6GjTvZCm181dpGsj4gnAn1IFw7FUZV+HZtjW6cAFEbGAaoL40pl2VlK7dHMX6fHAq6kuex4JvIvqDlBXMnPlpNOjZ9i/hzU0NMT2+2HoScfN1o9sjdFbvsjQ0EzzXarfwwZMRLwEeA3V3Z/LqC6LLsjMd9TUN0kt9/tGMJ8DPgM8MzNvAYgIF8ZJ6trvC5inAa8EvhkRPwP+eZrvl6QHedjb1Jm5OTNPB/YH3gM8B3hsRHwlIl5QVwcltVc3d5F2AJcDl0fEo4GTgHXAFYX7JqnlZnTJk5l3Aed2/lELrV27lpGRkUba3rp1KwBr1qxppH2ApUuXcs455zTW/qBxTmXAjIyMsOWuLcxdVP+/+vG51ZMiW0eb2U5ofPuORtodZAbMAJq7aC+WPP/AprtRu7u/dnvTXRg4li2RVIwBI6kYA0ZSMQaMpGIMGEnFGDCSijFgJBXjOhipC02ugIbmV0H3ugLagJG6MDIywl1btjA0t5lB/7zxaqeU7Z2gqdPoeO+7tBgwUpeG5s7lxP0e1XQ3avfJe3p/tMM5GEnFFBvBRMQ84AIggJ1Um1fNwfKx0sAoOYJ5EUBmrgDeSlU6dqJ87JFUYdP15uGS2qdYwGTm5VTVCAAeB/wCy8dKA6XoHExm7oiIi4B/oKqDZPlYaYAUv4uUmSdHxFuA7wCLJn3J8rENGB0dZXz7joHcG2V8+w5GsSRmnYqNYCLiFZMK328DxoEbImJl57VVwHWl2pfUvJIjmM8DH4+IbwDzgTdSlYydtfKxTdWm3rXzPgDmzFtQe9tQfW7orbLj0NAQY9w3sDvaWRGzXsUCJjN/A7x8ii/NSvnYpUuXzsaP6cnEsu1lS5r6j3Wo0c8vdau1K3mb3Bl+4nmQ9evXN9YHqQ1cySupGANGUjEGjKRiDBhJxRgwkooxYCQVY8BIKsaAkVSMASOpmNau5JXqNDo6yvbx8d3an7atRsfH2Tna21PojmAkFeMIRurC0NAQ88bGBraqwKIen0J3BCOpGANGUjFeIg2gprbMHL9vJwBzF8yrvW2oPneP+3SpRwbMgOmLjbqGGprHGGr28w8iA2bAuFGX6uQcjKRiioxgImI+sAE4CNgbeCfwIywbKw2UUiOYE4GRTonYVcCHsGysNHBKBcxngbMnne/AsrHSwClyiZSZowARsS9V7aOzgPdbNlYaLCUrOx4AXAN8IjMvoarsOMGysdIAKBIwEfFY4CrgLZm5ofPyDywbKw2WUutgzgSWAGdHxMRczBuAD85W2VhJ/a/UHMwbqALloWalbKzUhNEG94MZG69mGBbOrX/p2uj4OIt6fK8reaUuNP2IwW86j1ksWras9rYX0fvnN2CkLjT5iAW09zELHxWQVIwBI6kYA0ZSMQaMpGIMGEnFGDCSijFgJBVjwEgqxoCRVIwBI6kYA0ZSMQaMpGIMGEnFGDCSijFgJBVjwEgqxoCRVEzRHe0i4hnAezNzZUQ8CUvHSgOlZF2ktcDHgIWdlywdKw2YkpdItwLHTzq3dKw0YIoFTGZ+Drh/0ktzLB0rDZY6J3ktHSsNmDoDxtKx0oCpsy7S6cAFlo6VBkfRgMnMnwFHdI5vxtKx0kBxoZ2kYgwYScUYMJKKMWAkFWPASCrGgJFUjAEjqRgDRlIxBoykYgwYScUYMJKKMWAkFWPASCrGgJFUjAEjqZg6N5zSgLv33nub7oJqZsCoNmNjY013QTXzEkm1+MhHPjLlsfZsAzuC2bBhAxs3buzpvVu3bgVgzZo1Pbe/YsUKhoeHe35/U3r9vW3ZsuWB4yuuuIIbbrihp/bb+nsbVLUGTETMBT4MHAL8H/A3mXlLnX2YDQsXLpz+myQxZ9euXdN/1yyJiOOB4zLzlIg4AjgjM6csIRsRBwG3XX311ey///619VFlnHTSSdx9990ALFmyhIsvvrjhHtVrd0bM8NtR87Jly3p6fw0jvzlTvVj3HMyzga8BZOa3gcNrbl8NefOb3zzlsbqzcOHCVo6c656DeQRwz6TznRGxV2buqLkfqtny5ctZsmTJA8eDZnh4eCDnjuoOmHupysZOmGu4DA5HLoOn7oDZCLwI+ExnDmZTze2rQYM4chl0dQfMZcDzIuJbVJNCr6y5fUk1qjVgMnMcOLXONiU1x5W8kooxYCQVY8BIKsaAkVRMPz/sOA/gzjvvbLofkqZxzDHHHAT8/KHr2vo5YP4A4IQTTmi6H5KmdxvweOBnk1/s54C5HjgS+F9gZ8N9kTS9nz/0hVqfppY0WJzklVSMASOpGANGUjEGjKRiDBhJxfTzbeq+FhHPAN6bmSub7ksbRMR8YANwELA38M7M/GKjnWqBiJgHXAAE1XKNV2bmrc32qnuOYHoQEWuBjwHt2yS1OScCI5l5JLAK+FDD/WmLFwFk5grgrcB5zXZnZgyY3twKHN90J1rms8DZk87dKrULmXk58OrO6eOAXzTYnRnzEqkHmfm5TlkVdSkzRwEiYl/gUuCsZnvUHpm5IyIuAl4CvLTp/syEIxjVJiIOAK4BPpGZlzTdnzbJzJOBJwMXRMQ+TfenW45gVIuIeCxwFfC6zLy66f60RUS8Atg/M9cB24BxWvRsngGjupwJLAHOjoiJuZhVmbm9wT61weeBj0fEN4D5wBszc6zhPnXNhx0lFeMcjKRiDBhJxRgwkooxYCQVY8BIKsbb1Jqxzirmm4EfAbuABcAdVA/i/c6+rBFxCrAyM0+pr5fqBwaMenVHZh46cRIR5wLvA/66uS6p3xgwmi3XAOsi4rnAuVSX3/8NrJ78TRHxMuB0YBHVtg3DmfmtiPhb4GSqlarfzczXRMTTgH+i+u90jGqE9JO6PpB2n3Mw2m2dvV5eCtwAfAo4OTOXA5uoQmPi++YCpwIvzMxDgHOAMzp7npwBHA4cBiyIiD8C3gScm5mHU+2JckR9n0qzwZW8mrGHzMFANRL5LvCPwPmZ+fSHfP8pdOZgIuIRVHucBLAS2JmZz4mIL1BtR/AF4LOZuTkiXtr5mV8GvgR8KTNb8xyOvERS7x40BwMQEYdQTfpOnO8H7DvpfIgqiD4JfAO4EXhd58t/STVCWQV8LSJOyMxLI+I/gBdSjWb+AnhVsU+kWeclkmZTAo+JiKd0ztdSXRJNeDJVAL2bas7meGBeRDyaajS0KTPfSvXU9dMi4l+AP8vMj1JtVvWgkZH6nwGjWdN5yvdE4OKIuBF4CvCeSd/yn8APgR8D/wXcBTwuM++imsy9PiK+R7UV6QaqIPq7iPg+1XzNa+v6LJodzsFIKsYRjKRiDBhJxRgwkooxYCQVY8BIKsaAkVSMASOpmP8H5VGMAR7ss28AAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# using boxplot \n", "PA = sns.factorplot(data = dataset , x = 'Pclass' , y = 'Age', kind = 'box')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we can get some information, First class passengers are older than 2nd class passengers who are also older than 3rd class passengers. We can easily visaulize that roughly `37, 29, 24` respectively are the median values of each classes. The strategy can be used to fill Age with the median age of similar rows according to Pclass." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# a custom function for age imputation\n", "def AgeImpute(df):\n", " Age = df[0]\n", " Pclass = df[1]\n", " \n", " if pd.isnull(Age):\n", " if Pclass == 1: return 37\n", " elif Pclass == 2: return 29\n", " else: return 24\n", " else:\n", " return Age\n", "\n", "# Age Impute\n", "dataset['Age'] = dataset[['Age' , 'Pclass']].apply(AgeImpute, axis = 1)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# age featured imputed; no missing age records\n", "sns.heatmap(dataset.isnull(), yticklabels = False, cbar = False, cmap = 'summer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Fare` feature missing some values. However, we will handle it later.\n", "\n", "## SibSP\n", "Now, let's look `Survived` and `SibSp` features in details" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Explore SibSp feature vs Survived\n", "# We'll use factorplot to analysis\n", "Sib_Sur = sns.factorplot(x=\"SibSp\",y=\"Survived\",data=train,\n", " kind=\"bar\", size = 6 , palette = \"Blues\")\n", "\n", "Sib_Sur.despine(left=True)\n", "Sib_Sur = Sib_Sur.set_ylabels(\"survival probability\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems that passengers having a lot of siblings/spouses have less chance to survive.\n", "Single passengers (0 SibSP) or with two other persons (SibSP 1 or 2) have more chance to survive" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parch\n", "Let's look `Survived` and `Parch` features in details" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Explore Parch feature vs Survived\n", "# We'll use factorplot to analysis\n", "Sur_Par = sns.factorplot(x=\"Parch\",y=\"Survived\",data=train, \n", " kind=\"bar\", size = 6 , palette = \"GnBu_d\")\n", "\n", "Sur_Par.despine(left=True)\n", "Sur_Par = Sur_Par.set_ylabels(\"survival probability\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Small families have more chance to survive, more than single." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fare\n", "Let's look `Survived` and `Fare` features in details. We have seen that, Fare feature also mssing some values. Let's handle it first." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset[\"Fare\"].isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we have one missing value , I liket to fill it with the median value." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "dataset[\"Fare\"] = dataset[\"Fare\"].fillna(dataset[\"Fare\"].median())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Categorical values \n", "
\n", "\n", "We can turn categorical values into numerical values. This is simply needed because of feeding the traing data to model. We can use feature mapping or create dummy variables.\n", "\n", "## sex\n", "Let's take a quick look of values in this features." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 male\n", "1 female\n", "2 female\n", "3 female\n", "4 male\n", "Name: Sex, dtype: object\n", " \n", "1294 male\n", "1295 female\n", "1296 male\n", "1297 male\n", "1298 male\n", "Name: Sex, dtype: object\n" ] } ], "source": [ "print(dataset['Sex'].head()) # top 5\n", "print(' ')\n", "print(dataset['Sex'].tail()) # last 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Model can not take such values. We need to map the `sex` column to numeric values, so that our model can digest." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "# convert Sex into categorical value 0 for male and 1 for female\n", "sex = pd.get_dummies(dataset['Sex'], drop_first = True)\n", "dataset = pd.concat([dataset,sex], axis = 1)\n", "\n", "# After now, we really don't need to Sex features, we can drop it.\n", "dataset.drop(['Sex'] , axis = 1 , inplace = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let see how much people `survived` based on their `gender`. We can guess though, `Female` passenger survived more than `Male`, this is just assumption though. In the movie, we heard that **Women and Children First**." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEFCAYAAADuT+DpAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAEuZJREFUeJzt3XuUnXV97/H3TC5cJMSjBwipYm7Hb1yyBMGWUAikFY4LskgsPUsC6aFwQl1U9JSKEoOhGSpaI6ldBeWo5OCUCFknA7G6EFqEcokglyItskq/AU5jBIyGWyaEQDKX/rF3cCdOYELy7D2T3/u1VtZ69u/Z+5nPZE32Z37P8+xf2vr7+5Eklau91QEkSa1lEUhS4SwCSSqcRSBJhRvZ6gC7KiL2AX4b+AXQ2+I4kjRcjAAOBR7KzNcadwy7IqBWAqtaHUKShqnpwI8aB4ZjEfwC4Prrr2fcuHGtziJJw8K6deuYO3cu1N9DGw3HIugFGDduHO9617tanUWShpvfOKXuxWJJKpxFIEmFswgkqXAWgSQVziKQpMJZBJJUOItAkgo3HD9HIEkt1dvby5IlS8hM+vv7GTduHB0dHeyzzz6tjvaWFFsEpy78m1ZHGDJuufzPWx1BGlbuueceXnnlFa699loAFi9ezHe/+13mzJnT4mRvjaeGJGkXHXrooTzyyCPcddddbN68mYsuuoiPfexjfPOb32TOnDnMmTOHhx56iO7ubmbOnMkvf/lLbr75ZhYsWNDq6AMqdkYgSW/V1KlTmT9/Ptdffz0LFizgiCOO4LzzzuMnP/kJy5cvZ+PGjcybN4+uri4+97nPcckll/DSSy9x3XXXtTr6gCwCSdpFmcnUqVO5+uqr6enp4Vvf+hbz58+nv7+fs88+G4Du7m62bNnC9OnTWbx4MTNnzuRtb3tbi5MPzFNDkrSL7rvvPr72ta8BMHLkSCKCCRMmcOSRR7Js2TKWLl3KaaedxujRo7nhhhs49thjuf322/n5z3/e4uQDswgkaRfNnTuXvr4+Zs+ezZw5c7jpppv4yle+woQJEzjrrLM444wzGD9+PGvXrqWrq4vPfOYzLFiwgEsuuYS+vr5Wx/8Nbf39/a3OsEsiYgLwH3fcccduLUPtXUO/5l1D0t7v6aef5sMf/jDAxMxc07jPGYEkFc4ikKTCWQSSVDiLQJIKZxFIUuH8QJmkvdqevkNwb7zLzhmBJLXYypUrWbJkScu+vkUgSYXz1JAk7UErV67kzjvv5NVXX2X9+vWcffbZ3HHHHTzxxBNcfPHFrFu3jttuu42enh7GjBnDVVddtd3rly1bxs0330xbWxunnnrq62sXVckikKQ9bNOmTVx77bX84Ac/oLOzkxUrVvDAAw/Q2dnJ4YcfTmdnJ+3t7cybN4+f/vSnr7/uySef5JZbbuGGG26gra2Nc845h+OPP55JkyZVmtcikKQ97H3vex8AY8aMYfLkybS1tTF27Fi2bt3KqFGj+PSnP83+++/PunXr6Onpef11q1ev5tlnn+Wcc84BYMOGDaxdu9YikKThpq2tbcDxrVu3cvvtt9PV1cXmzZs5/fTTaVzvbdKkSUyZMoWlS5fS1tZGZ2cn733veyvPaxFI2qsNpds9R44cyX777cfpp5/O6NGjOeigg/jVr371+v6pU6dy7LHHcuaZZ7JlyxY+8IEPcMghh1Sey9VHNaT+oUiqhquPSpJ2yiKQpMJZBJJUOItAkgpX6V1DEXEw8DBwMtADdAL9wGPABZnZFxGLgJn1/Rdm5oNVZpIkba+yIoiIUcA3gc31oa8CCzPzroj4BjA7In4GnAgcA7wbuAn47aoySSrPZ7//wz16vCtmnbxHjzcUVHlqaAnwDeDZ+uOjgbvr27cCJwHHA7dlZn9mrgVGRsRBFWaSpEr19vYyb948zjzzTDZs2LDHjnvcccftsWPtqJIiiIhzgPWZ+Y8Nw22Zue1DCxuBscCBQOPf1LZxSRqW1q9fz4svvsjy5csZO3Z4vJ1VdWrofwH9EXEScCRwHXBww/4xwEtAd317x3FJGpYuvfRS1qxZw4IFC9i0aRMvvvgiAAsXLiQiOPnkk/ngBz/Iz372M6ZNm8bGjRt59NFHmThxIldccQWrV6/my1/+Mn19fXR3d7Nw4UKOOuqo14+fmVx++eUAvP3tb+dLX/oSY8aMGTDLYFUyI8jMEzLzxMycAfwLcDZwa0TMqD/lFGAVcC/wkYhoj4jDgPbMfK6KTJLUDIsWLWLKlCm84x3vYNq0aSxbtowvfOELdHR0APDMM89w4YUX8p3vfIfrrruOs846i66uLh5++GG6u7t58sknmT9/Pp2dnZx77rmsXLlyu+NfeumlLFq0iGXLlnHCCSewdOnS3c7czLWGLgKuiYjRwOPAjZnZGxGrgB9TK6ULmphHkiqzevVq7r//fm699VYAuru7gdpv8ePHjwdg//33Z8qUKUBtpdLXXnuNgw8+mKuvvpp9992XTZs2ccABB2x33KeeeorLLrsMqC1iN3HixN3OWnkR1GcF25w4wP4OoKPqHJLUTJMmTWLWrFmcdtppPP/883R1dQE7X5l0my9+8YssWbKEyZMnc+WVV/LMM89st3/ixIksXryY8ePH8/DDD7N+/frdzurqo5L2aq263fP888/n85//PCtWrODll1/mk5/85KBeN2vWLD7xiU/wzne+k3Hjxr1+jWGbjo4O5s+fT29vL1Arjt3l6qNy9VGpAK4+KknaKYtAkgpnEUhS4SwCSSqcRSBJhbMIJKlwFoEkFc4ikKTCWQSSVDiLQJIKZxFIUuEsAkkqnEUgSYWzCCSpcBaBJBXOIpCkwlkEklQ4i0CSCmcRSFLhLAJJKpxFIEmFswgkqXAWgSQVziKQpMJZBJJUOItAkgpnEUhS4SwCSSqcRSBJhbMIJKlwFoEkFc4ikKTCWQSSVDiLQJIKZxFIUuFGVnXgiBgBXAME0AucC7QBnUA/8BhwQWb2RcQiYCbQA1yYmQ9WlUuStL0qZwSnAWTmccBfAF+t/1mYmdOplcLsiDgKOBE4BpgDfL3CTJKkHVRWBJn598DH6w/fA/wSOBq4uz52K3AScDxwW2b2Z+ZaYGREHFRVLknS9iq9RpCZPRHxd8BVwI1AW2b213dvBMYCBwIbGl62bVyS1ASVXyzOzD8G3kvtesF+DbvGAC8B3fXtHcclSU1QWRFExP+MiAX1h68AfcA/R8SM+tgpwCrgXuAjEdEeEYcB7Zn5XFW5JEnbq+yuIWAl8O2IuAcYBVwIPA5cExGj69s3ZmZvRKwCfkytmC6oMJMkaQeVFUFmbgI+NsCuEwd4bgfQUVUWSdLO+YEySSqcRSBJhbMIJKlwFoEkFc4ikKTCWQSSVDiLQJIKZxFIUuEsAkkq3KCKICKuGmDs7/Z8HElSs73hEhMRsRSYBHwoIt7fsGsULhUtSXuFN1tr6HJgAvC3wGUN4z3UFo2TJA1zb1gEmbkGWAMcEREHUpsFtNV3HwC8UGU4SVL1BrX6aP3/FVgAPN8w3E/ttJEkaRgb7DLU5wGTM3N9lWEkSc032NtH1+JpIEnaKw12RvAE8KOIuBN4ddtgZv5lJakkSU0z2CJ4pv4Hfn2xWJK0FxhUEWTmZW/+LEnScDTYu4b6qN0l1OjZzHz3no8kSWqmwc4IXr+oHBGjgI8Cx1YVSpLUPLu86Fxmbs3MLuD3K8gjSWqywZ4aOrvhYRvwfmBrJYkkSU012LuGfq9hux94Djhjz8eRJDXbYK8RnFu/NhD11zyWmT2VJpMkNcVgTw0dDdxEba2hduCQiPiDzHygynCSyvbZ7/+w1RGGjCtmnVzZsQd7auhK4Ixtb/wRMQ24CvidqoJJkppjsHcNHdD4239m3g/sW00kSVIzDbYIXoiI2dseRMRH2X5JaknSMDXYU0MfB26OiP9L7fbRfuB3K0slSWqawc4ITgFeAd5D7VbS9cCMijJJkpposEXwceC4zNyUmY8CRwOfqi6WJKlZBlsEo4AtDY+38JuL0EmShqHBXiP4e+CfImIFtQL4Q+B7laWSJDXNoGYEmTmf2mcJApgMXJmZl1YZTJLUHIOdEZCZNwI3VphFktQCu7wMtSRp7zLoGcGuqC9Qdy0wAdgHuBz4N6CT2jWGx4ALMrMvIhYBM4Ee4MLMfLCKTJKkgVU1I/gj4PnMnE7tMwhfA74KLKyPtQGzI+Io4ETgGGAO8PWK8kiSdqKqIugCGi8m91D77MHd9ce3AicBxwO3ZWZ/Zq4FRkbEQRVlkiQNoJIiyMyXM3NjRIyhdoF5IdCWmds+e7ARGAscCGxoeOm2cUlSk1R2sTgi3g3cCSzLzBuAvobdY4CXgO769o7jkqQmqaQIIuIQ4DZgfmZeWx9+JCJm1LdPAVYB9wIfiYj2iDgMaM/M56rIJEkaWCV3DQGXAP8FuDQitl0r+DPgyogYDTwO3JiZvRGxCvgxtVK6oKI8kqSdqKQIMvPPqL3x7+jEAZ7bAXRUkUOS9Ob8QJkkFc4ikKTCWQSSVDiLQJIKZxFIUuEsAkkqnEUgSYWzCCSpcBaBJBXOIpCkwlkEklQ4i0CSCmcRSFLhLAJJKpxFIEmFswgkqXAWgSQVziKQpMJZBJJUOItAkgpnEUhS4SwCSSqcRSBJhbMIJKlwFoEkFc4ikKTCWQSSVDiLQJIKZxFIUuEsAkkqnEUgSYWzCCSpcBaBJBXOIpCkwlkEklS4ka0OIGl7py78m1ZHGDLe/zuHtzpCESotgog4BlicmTMiYgrQCfQDjwEXZGZfRCwCZgI9wIWZ+WCVmSRJ26vs1FBEXAwsBfatD30VWJiZ04E2YHZEHAWcCBwDzAG+XlUeSdLAqrxG8BRwesPjo4G769u3AicBxwO3ZWZ/Zq4FRkbEQRVmkiTtoLIiyMybgK0NQ22Z2V/f3giMBQ4ENjQ8Z9u4JKlJmnnXUF/D9hjgJaC7vr3juCSpSZpZBI9ExIz69inAKuBe4CMR0R4RhwHtmflcEzNJUvGaefvoRcA1ETEaeBy4MTN7I2IV8GNqpXRBE/NIkqi4CDJzDTCtvr2a2h1COz6nA+ioMockaef8ZLEkFc4ikKTCWQSSVDiLQJIKZxFIUuFcfVR89vs/bHWEIeGKWSe3OoLUEs4IJKlwFoEkFc4ikKTCWQSSVDiLQJIKZxFIUuEsAkkqnEUgSYWzCCSpcBaBJBXOIpCkwlkEklQ4i0CSCmcRSFLhLAJJKpxFIEmFswgkqXAWgSQVziKQpMJZBJJUOItAkgpnEUhS4SwCSSqcRSBJhbMIJKlwFoEkFc4ikKTCWQSSVDiLQJIKZxFIUuEsAkkq3MhWBwCIiHbgauAI4DXgvMx8srWpJKkMQ2VG8FFg38w8Fvgc8NctziNJxRgSMwLgeOAfADLz/oj40Bs8dwTAunXrdusLbt3UvVuv35u88sLzrY4wJDz99NOtjgD4s9nIn81f292fz4b3zBE77hsqRXAgsKHhcW9EjMzMngGeeyjA3LlzmxKsBGtvaXWCoeH2v2p1Au3In81f24M/n4cCTzUODJUi6AbGNDxu30kJADwETAd+AfRWHUyS9hIjqJXAQzvuGCpFcC9wGrAiIqYBP93ZEzPzNeBHzQomSXuRpwYaHCpF8F3g5Ii4D2gDzm1xHkkqRlt/f3+rM0iSWmio3D4qSWoRi0CSCmcRSFLhhsrFYjWZy3poqIuIY4DFmTmj1Vn2ds4IyuWyHhqyIuJiYCmwb6uzlMAiKNd2y3oAb7Ssh9RsTwGntzpEKSyCcg24rEerwkiNMvMmYGurc5TCIijXrizrIWkvZhGU617gVIA3W9ZD0t7NUwHlclkPSYBLTEhS8Tw1JEmFswgkqXAWgSQVziKQpMJZBJJUOG8fVdEi4n8AC6j9W2gHrsvMK3bzmOcDZOY3dvM4dwEdmXnX7hxHejMWgYoVEb9FbbG9ozLz+Yg4ALg7IjIzv/9Wj7u7BSA1m0Wgkv1XYBSwP/B8Zr4cEX8MvBoRa4AZmbkmImZQ+818Rv239BeA9wPXAwdl5qcAIuKvgaeBsfXjvwD8twH2XwN8HTgcGEFtqeXlEbEPtRU3PwSsqeeTKuc1AhUrM/8V+B7w/yPiwYhYDIwYxP/L8GhmBvB/gD+IiBER0Qb8IbC84XnLd7J/IfBwZh4NnAB8PiImAZ+q53of8L+ByXvsm5XegEWgomXmnwITqL2pvwe4PyLebPnjB+qvXQ/8K/B7wPTaUK5rOPbO9p8EnB8R/wLcA7yN2gxjBrCi/tongPv2yDcpvQlPDalYETETOCAz/x/wbeDbEfEnwDygn9oaTFA7fdRoc8P2MuAMYAvwnQG+zED7RwB/lJk/qec4hNpppI83fE0AV4NVUzgjUMleAf4qIiYA1E/fHAk8AjxH7bd0gNlvcIzvUTu989+pLeQ3mP3/BPxp/WseCjwKHAbcDsyNiPaIeA/wu2/1G5N2hUWgYmXmncBlwM0RkcC/A73AF4BFwN9GxEPAS29wjM3UlvR+MDNfHuT+y4D9IuIxaqVwcWY+Re3/kO4GHqd2QfmxPfKNSm/C1UclqXDOCCSpcBaBJBXOIpCkwlkEklQ4i0CSCmcRSFLhLAJJKtx/AmI6Iamp4KpqAAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# using countplot to estimate amount\n", "sns.countplot(data = train , x = 'Survived' , hue = 'Sex', palette = 'GnBu_d')" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Survived
Sex
female0.747573
male0.190559
\n", "
" ], "text/plain": [ " Survived\n", "Sex \n", "female 0.747573\n", "male 0.190559" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's see the percentage\n", "train[[\"Sex\",\"Survived\"]].groupby('Sex').mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is clearly obvious that Male have less chance to survive than Female. This is heavily an important feature for our prediction task." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pclass\n", "Let's explore passenger calsses feature with age feature. From this we can know, how much children, young and aged people were in different passenger class." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "facet = sns.FacetGrid(train, hue=\"Pclass\",aspect=4)\n", "facet.map(sns.kdeplot,'Age',shade= True)\n", "facet.set(xlim=(0, train['Age'].max()))\n", "facet.add_legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, we see there're more young people from class 3. And more aged passenger were in first class, and that indicate that they're rich. So, most of the young people were in class three.\n", "\n", "However, let's explore the `Pclass` vs `Survived` using `Sex` feature. This will give more information about the survival probability of each classes according to their gender." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "Survived_Pcalss = sns.factorplot(x=\"Pclass\", y=\"Survived\", \n", " hue=\"Sex\", data=train,size=6, \n", " kind=\"bar\", palette=\"BuGn_r\")\n", "Survived_Pcalss.despine(left=True)\n", "Survived_Pcalss = Survived_Pcalss.set_ylabels(\"survival probability\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The passenger survival is not the same in the all classes. First class passengers have more chance to survive than second class and third class passengers. And Female survived more than Male in every classes.\n", "\n", "## Embarked\n", "
\n", "Port of Embarkation , C = Cherbourg, Q = Queenstown, S = Southampton. Categorical feature that should be encoded. We can use feature mapping or make dummy vairables for it.\n", "\n", "However, let's explore it combining `Pclass` and `Survivied` features. So that, we can get idea about the classes of passengers and also the concern embarked." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEFCAYAAADqujDUAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAEsJJREFUeJzt3X+QXWV9x/H3bggkkqSIWCMgqBW/WoVAm9oEojZdIshowbajLUEkTlQGhCKliraDoFYsNHXcVrSKShXEoVo0QYhIEZFArAVRlPLV1NHKSKRIISBLluxu/zh39eaSbG4WntxsnvdrJpN7znN+fDM7uZ89z3POc/rGxsaQJNWnv9cFSJJ6wwCQpEoZAJJUKQNAkiq1W68L6EZE7AH8HnAPMNLjciRpqpgGPBP4VmZu7GycEgFA8+X/jV4XIUlT1EuBmzpXTpUAuAfgsssuY+7cub2uRZKmhPXr17N06VJofYd2mioBMAIwd+5c9t9//17XIklTzRa7zh0ElqRKGQCSVCkDQJIqZQBIUqUMAEmqlAEgSZUyACSpUgaA9AQMDg4yMDDA4OBgr0uRtpsBIE3S0NAQK1euBGDVqlUMDQ31uCJp+xgA0iQNDw8z/ka90dFRhoeHe1yRtH0MAEmqlAEgSZUyACSpUsVmA42IfuAiYB6wEViemeva2l8JvLu1eBtwamaOlapHkrS5klcAxwEzMnMhcDawYrwhImYDFwKvyswFwI+BfQrWIknqUDIAFgGrATJzLTC/re1w4A5gRUR8A/h5Zv5vwVokSR1KBsAc4MG25ZGIGO9y2gdYDLwDeCVwRkQ8v2AtkqQOJQNgAzC7/VyZuan1+Rc0Lylen5kPAzcChxasRZLUoWQArAGOAYiIBTRdPuNuBV4cEfu0rgoWAHcWrEWS1KHkO4GvBJZExM1AH7AsIs4E1mXmyoh4J/CV1rZXZOb3CtYiSepQLAAycxQ4uWP1XW3tnwM+V+r8kqSJ+SCYJFXKAJCkShkAklQpA0CSKmUASFKlDABJqpQBIEmVMgAkqVIGgCRVygCQpEqVnAtImpSTPvUXvS6hKyMbN222/NbL38W0PabGf6lLln2o1yVoJ+AVgCRVygCQpEoZAJJUKQNAkiplAEhSpQwASaqUASBJlTIAJKlSBoAkVcoAkKRKGQCSVCkDQJIqZQBIUqUMAEmqVLG5ayOiH7gImAdsBJZn5rq29kHgCOCh1qpjM/PBUvVIkjZXcvLy44AZmbkwIhYAK4Bj29p/BzgqM+8rWIMkaStKdgEtAlYDZOZaYP54Q+vq4CDgYxGxJiLeWLAOSdIWlAyAOUB7l85IRIxfcewJ/CNwAnA0cEpEHFKwFklSh5IBsAGY3X6uzBx/h94jwIcy85HMfAi4nmasQJJ2mMHBQQYGBhgcHOx1KT1RMgDWAMcAtMYA7mhrez5wU0RMi4jpNN1FtxWsRZI2MzQ0xMqVKwFYtWoVQ0NDPa5oxys5CHwlsCQibgb6gGURcSawLjNXRsRlwFrgMeDTmfn9grVI0maGh4cZGxsDYHR0lOHhYWbOnNnjqnasYgGQmaPAyR2r72prvwC4oNT5JUkT80EwSaqUASBJlTIAJKlSBoA0SX39fW0LHcvSFGAASJPUP30as56/NwCzDtqb/unTelyRtH1K3gYq7fKe+pJ9eepL9u11GdKkeAUgSZUyACSpUgaAJFXKAJCkShkAklQpA0CSKmUASFKlDABJqpQBIEmVMgAkqVIGgCRVygCQpEoZAJJUKQNAkiplAEhSpQwASaqUASBJlTIAJKlSBoAkVcoAkKRKFXspfET0AxcB84CNwPLMXLeFbb4MfCkzP1qqFknS45W8AjgOmJGZC4GzgRVb2OZ9wN4Fa5AkbUXJAFgErAbIzLXA/PbGiPhTYBS4pmANkqStKBkAc4AH25ZHImI3gIh4MXA8cE7B80uSJlBsDADYAMxuW+7PzE2tzycC+wHXA88GhiPix5m5umA9kqQ2JQNgDfBq4IqIWADcMd6QmW8f/xwR5wLr/fKXpB2rZABcCSyJiJuBPmBZRJwJrMvMlQXPK6lHrj5xWa9L6NrQyMhmy9edchozp03rUTXb55hPf+pJOc6EARARL5uoPTNvnKBtFDi5Y/VdW9ju3InOIUkqY1tXAOe1/n4a8Dyabp0R4HCaLp0jypUmSSppwruAMnNxZi4G7gYOycwlmXk0cDDw0I4osAaDg4MMDAwwODjY61IkVaTb20AP7HiK93+AAwvUU52hoSFWrmyGRFatWsXQ0FCPK5JUi24HgW+NiH8BrqAZ0F0KfKNYVRUZHh5mbGwMgNHRUYaHh5k5c2aPq5JUg24DYDlwGs2g7hhwHc08P5KkKaqrAMjM4Yj4As1dPF8BntX2UJckaQrqagwgIl4HrAI+RDN52y0RcULJwiRJZXU7CPwOmls/H8rMe4HDgHcWq0qSVFy3ATCSmb+67TMz76GZyVOSNEV1Owj8/Yh4KzA9Ig4FTgFuL1eWJKm0bq8ATqWZvXMI+CTNTJ+nlCpKklTe9twG+sHMtN9fknYR3QbAs4BvRsRdwKXAlZn5SLmyJEmlddUFlJlnZeZzgPcDC4FvR8Sni1YmSSqq6/cBREQfMB3YneZp4OFSRT1Rx7/9sl6X0LXRTY9utvyW8z5P/24zelTN9vnsBUt7XYKkJ6CrAIiIQeA1NHf+XAqcnpmPTryXJGln1u0VwA+BwzLzvpLFSJJ2nG29EezNmfkxmukfTomIzdoz8z0Fa5MkFbStK4C+rXyWJE1xEwZAZv5z6+MDwOWteYAkSbsAnwOQpEr5HIAkVarbuYCm1HMAkqRtm8xzAJ/B5wAkacrrdgzgXnwOQJJ2Kd0GwNLMfN/2HDgi+mleHD8P2Agsz8x1be2nAifRdCe9JzOv2p7jS5KemG4D4M6IOAf4Js07AQDIzBsn2Oc4YEZmLoyIBcAK4FiAiNiH5n0ChwIzWsf/cmaOTeLfMLX1TWtf6FiWpHK6HQTeG1gMnA2c1/pz7jb2WQSsBsjMtcD88YZWV9K8zHwMmAs8UOWXP9A/bTozn/5CAGY+/QX0T5ve44ok1aKrK4DMXDyJY88BHmxbHomI3TJzU+uYm1qvmTwPGJzE8XcZcw5YyJwDFva6DEmV6fYuoK/R9NVvJjP/cILdNgCz25b7x7/82/b/p4j4GHBNRCzOzK91U48k6Ynrdgzg3LbP02n68v9vG/usAV4NXNEaA7hjvCGaWeXOB/4EeIxmkHi0y1okSU+CbruAvt6x6rqI+CZwzgS7XQksiYibaSaSWxYRZwLrMnNlRHwHuIXmyuKaLZxDklRQt11AB7Qt9gEvAp420T6ZOQqc3LH6rrb28cFkSVIPdNsF9HV+PQYwBtwHnFakIknSDrHN20Aj4lXAkZn5XOAvgf8CvgJcV7g2SVJBEwZARJwFvBvYIyIOoZkK+os0zwVcWL48SVIp27oCeD3w8sy8EzgeWJmZF9N0/xxVujhJKmVa369fctjXsVyLbQXAWNuLXxbz6yd7q3xqV9KuY/f+fg7dcxYA8/acxe79Xc+Ov8vY1iDwpojYC5gFHAZcCxARBwKbJtpRknZ2A3vtzcBee/e6jJ7ZVuR9gOYdAGuBizPznoh4LfDvwAWli5MklbOtl8J/vvUg1z6Z+d3W6odppna+oXRxkqRytvkcQGb+DPhZ2/LVRSuSJO0Q9Y16SJIAA0CSqmUASFKlDABJqpQBIEmVMgAkqVIGgCRVygCQpEoZAJJUKQNAkiplAEhSpQwASaqUASBJlTIAJKlSBoAkVcoAkKRKbfOFMJMVEf3ARcA8YCPNW8TWtbW/Dfiz1uLVmXleqVokSY9X8grgOGBGZi4EzgZWjDdExHOBpcDhwELgFRFxSMFaJEkdSgbAImA1QGauBea3tf0UODozRzJzFJgOPFqwFklSh2JdQMAc4MG25ZGI2C0zN2XmY8B9EdEHXAh8OzN/ULAWSVKHklcAG4DZ7efKzE3jCxExA7istc0pBeuQJG1ByQBYAxwDEBELgDvGG1q/+X8J+E5mviUzRwrWIUnagpJdQFcCSyLiZqAPWBYRZwLrgGnAy4E9IuKVre3fmZm3FKxHktSmWAC0BndP7lh9V9vnGaXOLUnaNh8Ek6RKGQCSVCkDQJIqZQBIUqUMAEmqlAEgSZUyACSpUgaAJFXKAJCkShkAklQpA0CSKmUASFKlDABJqpQBIEmVMgAkqVIGgCRVygCQpEoZAJJUKQNAkiplAEhSpQwASaqUASBJlTIAJKlSBoAkVcoAkKRK7VbqwBHRD1wEzAM2Asszc13HNk8HbgYOzsxHS9UiSXq8klcAxwEzMnMhcDawor0xIo4CrgWeUbAGSdJWlAyARcBqgMxcC8zvaB8FjgTuL1iDJGkrSgbAHODBtuWRiPhVl1NmfjUzf1Hw/JKkCZQMgA3A7PZzZeamgueTJG2HkgGwBjgGICIWAHcUPJckaTsVuwsIuBJYEhE3A33Asog4E1iXmSsLnleS1IViAZCZo8DJHavv2sJ2zy5VgyRp63wQTJIqZQBIUqUMAEmqlAEgSZUyACSpUgaAJFXKAJCkShkAklQpA0CSKmUASFKlDABJqpQBIEmVMgAkqVIGgCRVygCQpEoZAJJUKQNAkiplAEhSpQwASaqUASBJlTIAJKlSBoAkVcoAkKRKGQCSVCkDQJIqtVupA0dEP3ARMA/YCCzPzHVt7W8C3gJsAt6XmVeVqkWS9HglrwCOA2Zk5kLgbGDFeENEzAVOB44AjgLOj4g9CtYiSepQ7AoAWASsBsjMtRExv63tJcCazNwIbIyIdcAhwLe2cqxpAOvXr+/qxBsfeWCyNWs73H333UWO++gDjxQ5rn6t1M/u/o2PFjmuNtftz6/tO3PaltpLBsAc4MG25ZGI2C0zN22h7SHgNyY41jMBli5d+qQXqckb+Opgr0vQJA18dKDXJegJeO/Adv/8ngn8d+fKkgGwAZjdttzf+vLfUttsYKJf278FvBS4Bxh5MouUpF3YNJov/y32rpQMgDXAq4ErImIBcEdb238AfxsRM4A9gBcC39vagVpdRTcVrFWSdlWP+81/XN/Y2FiRM7bdBXQI0AcsA44B1mXmytZdQG+mGYh+f2Z+oUghkqQtKhYAkqSdmw+CSVKlDABJqpQBIEmVKnkXkLoQEWcDRwKjwBjwrsy8tbdVqRsR8SLgAuApwCzgauDczHRgbQqIiN8Fzqf5+fUDXwPOy8zhnha2A3kF0EMR8dvAHwFLMvMVwDuAT/a2KnUjIvYCPgeckZmLgQXAwTTzW2knFxH7A5cCb83MRTTT0mwEPtjTwnYwA6C37gUOAN4YEftl5u0002Ro53cscH1m/hAgM0eAEzHAp4oTgYsz8wcArau29wLHRMTMnla2AxkAPZSZ99FcARwB3BIRdwGv6m1V6tK+wI/aV2TmwzV1H0xxB/L4n98Y8HNgbk8q6gEDoIci4nnAhsx8Y2YeAJwAfCQi9u5xadq2nwDPal8REc+JiJf1qB5tn58Az21f0Xp49QCaK/MqGAC9dQjNF/6M1vIPaCbJc76jnd9VwNER8VsAETEd+AfgxT2tSt36DLA8Ig6KiL0i4lrgYuCqzPxlj2vbYXwSuMci4q+B1wIP0wTy32XmF3tblbrRuovkQpqf22xgFc1dJP6nmgJaP7/309zB9RRgPU0X0JmZeX8va9tRDABJaomIQ4AfZebDva5lRzAAJKlSjgFIUqUMAEmqlAEgSZUyACSpUk4Gp11eRDyb5hmLOzuaPp6ZH+5i/xtoJnm7YZLnvwS4ITMvmcS+JwF/kJknTebc0kQMANXiZ5l5aK+LkHYmBoCqFhHrgS8Cv0/zINAngdOB/YGTMvPrrU3fHBHjM0W+LTNviIj9gE8Ae9HMDXRJZp7T+q39DcA+NA+HjZ/rKcC1wOWZ+eGIOBE4g6Yr9lbg1Mx8NCJeD/wNsIFmyoIq7knXjucYgGqxb0Tc3vHnYOAZwDWZeRgwA3hNZr4UOJfmy3ncw61t3gBcGhF7AH9O82U+PhX0GRGxT2v7/YHDMvNdreXdgX8DPt/68n8R8Cbg8NaVyb3AWRGxL807Bl4GLKR5wlgqwisA1WKLXUARAXBNa/EnwE1tn5/atuknADLzuxFxL/CCzPz7iFgcEWfRzAG0O7Bna/vbMnNT2/7vpXnpzx+3lhcDBwFrWzXsDtwGHA7cnJk/b9V3KTAw2X+0NBEDQNXrmMJ501Y2a1/fDzwWEStoZpT8LE030pFAX2uboY79L6eZc+Y84K+AacAVmXk6QETMovn/ONB2jInqkZ4wu4Ck7iwFiIj5NN0yPwSWABdm5r8CAexH88W+JbcDbwdOiIhDgRuA10TEb0ZEH/ARmi6nm4CFEbFfa3ri15X7J6l2XgGoFvtGxO0d627cjv1nRcS3aabqPj4zH4uI84HPRMQQ8FPgP4HnbO0AmXl/6x3QH6d5heR5wPU0v4jdDnygNQh8GnAd8Esef+uq9KRxMjhJqpRdQJJUKQNAkiplAEhSpQwASaqUASBJlTIAJKlSBoAkVer/AYgj5v4GJ0eEAAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 'Embarked' vs 'Survived'\n", "sns.barplot(dataset['Embarked'], dataset['Survived']);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like, coming from Cherbourg people have more chance to survive. But why? That's weird. Let's compare this feature with other variables." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Embarked\n", "C 270\n", "Q 123\n", "S 904\n", "Name: PassengerId, dtype: int64\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeFareParchPassengerIdPclassSibSpSurvivedmale
Embarked
C31.24229662.3362670.370370690.6555561.8518520.4000000.5535710.581481
Q25.96341512.4090120.113821668.5934962.8943090.3414630.3896100.512195
S28.97317526.2964500.409292645.9712392.3473450.4845130.3391170.683628
\n", "
" ], "text/plain": [ " Age Fare Parch PassengerId Pclass SibSp \\\n", "Embarked \n", "C 31.242296 62.336267 0.370370 690.655556 1.851852 0.400000 \n", "Q 25.963415 12.409012 0.113821 668.593496 2.894309 0.341463 \n", "S 28.973175 26.296450 0.409292 645.971239 2.347345 0.484513 \n", "\n", " Survived male \n", "Embarked \n", "C 0.553571 0.581481 \n", "Q 0.389610 0.512195 \n", "S 0.339117 0.683628 " ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Count\n", "print(dataset.groupby(['Embarked'])['PassengerId'].count())\n", "\n", "# Compare with other variables\n", "dataset.groupby(['Embarked']).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oh, C passenger have paid more and travelling in a better class than people embarking on Q and S. Amount of passenger from S is larger than others. But survival probability of C have more than others. \n", "\n", "As we've seen earlier that Embarked feature also has some missing values, so we can fill them with the most fequent value of Embarked which is S (almost 904)." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n" ] } ], "source": [ "# count missing values\n", "print(dataset[\"Embarked\"].isnull().sum())\n", "\n", "# Fill Embarked nan values of dataset set with 'S' most frequent value\n", "dataset[\"Embarked\"] = dataset[\"Embarked\"].fillna(\"S\")" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# let's visualize it to confirm\n", "sns.heatmap(dataset.isnull(), yticklabels = False, \n", " cbar = False, cmap = 'summer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And there it goes. Now, there's no missing values in Embarked feature. Let's explore this feature a little bit more. We can viz the survival probability with the amount of classes passenger embarked on different port." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Counting passenger based on Pclass and Embarked \n", "Embarked_Pc = sns.factorplot(\"Pclass\", col=\"Embarked\", data=dataset,\n", " size=5, kind=\"count\", palette=\"muted\", hue = 'Survived')\n", "\n", "Embarked_Pc.despine(left=True)\n", "Embarked_Pc = Embarked_Pc.set_ylabels(\"Count\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indeed, the third class is the most frequent for passenger coming from Southampton (S) and Queenstown (Q), and but Cherbourg passengers are mostly in first class. From this, we can also get idea about the economic condition of these region on that time.\n", "\n", "However, We need to map the `Embarked` column to numeric values, so that our model can digest." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "# create dummy variable\n", "embarked = pd.get_dummies(dataset['Embarked'], drop_first = True)\n", "dataset = pd.concat([dataset,embarked], axis = 1)\n", "\n", "# after now, we don't need Embarked coloumn anymore, so we can drop it.\n", "dataset.drop(['Embarked'] , axis = 1 , inplace = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Commitment for Feature Analysis\n", "So far, we've seen various subpopulation components of each features and fill the gap of missing values. We've done many visualization of each components and tried to find some insight of them. Though we can dive into more deeper but I like to end this here and try to focus on feature engineering. \n", "\n", "We saw that, we've many messy features like `Name`, `Ticket` and `Cabin`. We can do feature engineering to each of them and find out some meaningfull insight. But, I like to work on only `Name` variables. `Ticket` is, I think not too much important for prediction task and again almost 77% data missing in `Cabin` variables.\n", "\n", "However, let's have a quick look over our datasets." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeCabinFareNameParchPassengerIdPclassSibSpSurvivedTicketmaleQS
022.0NaN7.2500Braund, Mr. Owen Harris01310.0A/5 21171101
138.0C8571.2833Cumings, Mrs. John Bradley (Florence Briggs Th...02111.0PC 17599000
226.0NaN7.9250Heikkinen, Miss. Laina03301.0STON/O2. 3101282001
335.0C12353.1000Futrelle, Mrs. Jacques Heath (Lily May Peel)04111.0113803001
435.0NaN8.0500Allen, Mr. William Henry05300.0373450101
\n", "
" ], "text/plain": [ " Age Cabin Fare Name \\\n", "0 22.0 NaN 7.2500 Braund, Mr. Owen Harris \n", "1 38.0 C85 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... \n", "2 26.0 NaN 7.9250 Heikkinen, Miss. Laina \n", "3 35.0 C123 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) \n", "4 35.0 NaN 8.0500 Allen, Mr. William Henry \n", "\n", " Parch PassengerId Pclass SibSp Survived Ticket male Q S \n", "0 0 1 3 1 0.0 A/5 21171 1 0 1 \n", "1 0 2 1 1 1.0 PC 17599 0 0 0 \n", "2 0 3 3 0 1.0 STON/O2. 3101282 0 0 1 \n", "3 0 4 1 1 1.0 113803 0 0 1 \n", "4 0 5 3 0 0.0 373450 1 0 1 " ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Engineering \n", "\n", "[Feature engineering](https://en.wikipedia.org/wiki/Feature_engineering) is an informal topic, but it is considered essential in applied machine learning. Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. \n", "\n", "Some resources to get more depth on it -\n", "\n", "- [Data Preprocessing and Feature Exploration](https://www.youtube.com/watch?v=V0u6bxQOUJ8&t=1384s)\n", "- [Makes a Good Feature](https://www.youtube.com/watch?v=N9fDIAflCMY)\n", "- [Feature Engineering](https://www.kdnuggets.com/tag/feature-engineering)\n", "- [Automated Feature Engineering](https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219)\n", "\n", "Feature engineering is the art of converting raw data into useful features. There are several feature engineering techniques that you can apply. Some techniques are -\n", "\n", "- Box-Cox transformations\n", "- Polynomials generation through non-linear expansions\n", "\n", "But we don't wanna be too serious on this rather than simply apply feature engineering approaches to get usefull information.\n", "\n", "
\n", "\n", "## Name \n", "\n", "We can assume that people's title influences how they are treated. In our case, we have several titles (like Mr, Mrs, Miss, Master etc ), but only some of them are shared by a significant number of people. Accordingly, it would be interesting if we could group some of the titles and simplify our analysis.\n", "\n", "Let's analyse the 'Name' and see if we can find a sensible way to group them. Then, we test our new groups and, if it works in an acceptable way, we keep it. For now, optimization will not be a goal. The focus is on getting something that can improve our current situation.\n" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Braund, Mr. Owen Harris\n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th...\n", "2 Heikkinen, Miss. Laina\n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel)\n", "4 Allen, Mr. William Henry\n", "5 Moran, Mr. James\n", "6 McCarthy, Mr. Timothy J\n", "7 Palsson, Master. Gosta Leonard\n", "8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)\n", "9 Nasser, Mrs. Nicholas (Adele Achem)\n", "Name: Name, dtype: object" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset['Name'].head(10)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Mr 753\n", "Miss 255\n", "Mrs 197\n", "Master 60\n", "Dr 8\n", "Rev 8\n", "Col 4\n", "Mlle 2\n", "Ms 2\n", "Major 2\n", "Lady 1\n", "Don 1\n", "the Countess 1\n", "Jonkheer 1\n", "Dona 1\n", "Capt 1\n", "Mme 1\n", "Sir 1\n", "Name: Title, dtype: int64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get Title from Name\n", "dataset_title = [i.split(\",\")[1].split(\".\")[0].strip() for i in dataset[\"Name\"]]\n", "\n", "# add dataset_title to the main dataset named 'Title'\n", "dataset[\"Title\"] = pd.Series(dataset_title)\n", "\n", "# count\n", "dataset[\"Title\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAABBoAAAE8CAYAAABuEdLTAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzt3XmYJVV9N/BvM9OzILi9atBIXNAcFTUgcQFZRgdEjHHcYlzQIFFciOgbjVs0YhISjYqmjbggigtxQ5QZFJeoKO57lBc4xjVqRBFkMw0NdL9/VLX0DNMz3UzVvb18Ps8zz9y6t6bO78y9t6rut05VjUxNTQUAAACgCzsNuwAAAABg6RA0AAAAAJ0RNAAAAACdETQAAAAAnRE0AAAAAJ1ZOewCZlNKWZ3k3kl+keTaIZcDAAAAXGdFklsn+Vqt9aqZLyzYoCFNyHD2sIsAAAAAZnVAks/PfGIhBw2/SJJTTjklu+2227BrAQAAAFoXXHBBnvCEJyTtb/eZFnLQcG2S7LbbbrntbW877FoAAACA67vepQ5cDBIAAADojKABAAAA6IygAQAAAOiMoAEAAADojKABAAAA6IygAQAAAOiMoAEAAADozMq+FlxKOSLJEe3kmiR7JVmX5F+TXJPkE7XWl/fVPgAAADB4vY1oqLWeXGtdV2tdl+QbSY5J8qYkj0+yf5L7llLu1Vf7AAAAwOD1fupEKeWPk+yZ5L1JVtdaf1BrnUry8STr+24fAAAAGJzeTp2Y4cVJXp7kxkkum/H85UnuOID2AQCAZeYL77xw2CXM2/2fdMthlwCd6HVEQynlpknuUmv9TJqQYdcZL++a5JI+2wcAAAAGq+9TJw5M8h9JUmu9LMlEKWWPUspIkkOTnN1z+wAAAMAA9X3qREnywxnTT09ySpIVae468ZWe2wcAAAAGqNegodb6qi2mv5zkfn22CQAAAAxP73edAAAAAJYPQQMAAADQGUEDAAAA0BlBAwAAANAZQQMAAADQGUEDAAAA0BlBAwAAANAZQQMAAADQGUEDAAAA0BlBAwAAANAZQQMAAADQGUEDAAAA0BlBAwAAANAZQQMAAADQGUEDAAAA0BlBAwAAANAZQQMAAADQGUEDAAAA0BlBAwAAANAZQQMAAADQGUEDAAAA0BlBAwAAANAZQQMAAADQGUEDAAAA0BlBAwAAANAZQQMAAADQGUEDAAAA0BlBAwAAANAZQQMAAADQmZV9LryU8qIkD0uyKskJST6b5OQkU0nOSXJ0rXWyzxoAAACAweltREMpZV2S/ZLcP8lBSXZPcnySl9RaD0gykmRDX+0DAAAAg9fnqROHJvlukg8l2ZTkjCT7pBnVkCRnJjm4x/YBAACAAevz1IlbJLldkocmuUOSjUl2qrVOta9fnuQmPbYPAAAADFifQcNFSc6vtU4kqaWUK9OcPjFt1ySX9Ng+AAAAMGB9njrx+SQPLqWMlFJuk+RGST7VXrshSQ5LcnaP7QMAAAAD1tuIhlrrGaWUA5N8NU2gcXSSHyU5sZSyKsl5SU7tq30AAABg8Hq9vWWt9flbefqgPtsEAAAAhqfPUycAAIBlZmxsLOvXr8/Y2NiwSwGGRNAAAAB0Ynx8PBs3bkySbNq0KePj40OuCBgGQQMAANCJiYmJTE01d7OfnJzMxMTEkCsChkHQAAAAAHRG0AAAAAB0RtAAAAAAdEbQAAAAAHRG0AAAAAB0RtAAAAAAdEbQAAAAAHRG0AAAAAB0RtAAAAAAdEbQAAAAAHRG0AAAAAB0RtAAAAAAdEbQAAAAAHRG0AAAAAB0ZuWwCwAAAAbvPR+8sPNlXnnl5ZtNn3bGRVmzZqKz5T/uUbfsbFlAf4xoAAAAADojaAAAAAA6I2gAAAAAOiNoAAAAADojaAAAAAA6I2gAAAAAOiNoAACWnLGxsaxfvz5jY2PDLgUAlh1BAwCwpIyPj2fjxo1Jkk2bNmV8fHzIFQHA8iJoAACWlImJiUxNTSVJJicnMzExMeSKAGB5ETQAAAAAnRE0AAAAAJ1Z2efCSynfSnJpO/mjJG9O8q9JrknyiVrry/tsHwAAABis3oKGUsqaJKm1rpvx3LeTPCrJD5N8pJRyr1rrN/uqAQAAABisPkc0/FGSnUspn2jbOTbJ6lrrD5KklPLxJOuTCBoAAABgiegzaPjfJK9O8tYkd05yZpJLZrx+eZI79tg+AAAAMGB9Bg3fS/L9WutUku+VUi5NcvMZr++azYMHAAAAYJHr864TRyZ5TZKUUm6TZOckvy2l7FFKGUlyaJKze2wfgJ6NjY1l/fr1GRsbG3YpAAAsEH0GDScluWkp5fNJ3pcmeHhKklOSfDXJt2qtX+mxfYChW8o/xMfHx7Nx48YkyaZNmzI+Pj7kigAYthUrRpOMJElGRkbaaWC56e3UiVrrRJLHb+Wl+/XVJsBCsuUP8ac+9alZu3btkKvqzsTERKamppIkk5OTmZiYWFL9A+bmkR/88rBLmJfTHmVXtE+jo2tyt7sfknPP+UTuuuchGR1dM+ySgCHo8xoNAMuaH+IwNw899ZROlzd15ZWbTT9+06kZWdPtj50zHv2ETpcHS8l+Bx6Z/Q48cthlAEPU56kTAAAAwDJjRAMAQI8eduoZwy5h3jY++qHDLgGARcyIBgAAAKAzggYAAACgM4IGAAAAoDOCBgAAAKAzggYAAACgM4IGAAAAoDOCBgAAAKAzggYAAACgM4IGAAAAoDOCBgAAAKAzggYAYGlZseK6xyMjm08DAL0TNAAAS8rI6GhW3P2uSZIVe94lI6OjQ64IAJaXlcMuAACga6sO2Dc5YN9hlwEAy5IRDQAAAEBnBA0AAABAZwQNAAAAQGcEDQAAAEBnBA0AAABAZwQNAAAAQGfc3hJgmXjyhx7c6fKuHZ/abPpZH3lMVqwd6Wz5b3/ExzpbFgAAg2NEAwAAANAZQQMAAADQGUEDAAAA0BlBAwAAANAZF4MEaL3mPYd2uryJKze/WOIJp/1ZVq3p7mKJSfLcx3280+UBAMCOMqIBAAAA6EyvIxpKKbdK8o0khyS5JsnJSaaSnJPk6FrrZJ/tAwAAAIPV24iGUspokjcnGW+fOj7JS2qtByQZSbKhr7YBAACA4ejz1IlXJ3lTkv9pp/dJ8tn28ZlJDu6xbQAAAGAIejl1opRyRJILa60fL6W8qH16pNY6fWW0y5PcpI+2gf686+RuL5Y4CE88wsUSAQBgkPq6RsORSaZKKQcn2SvJO5Pcasbruya5pKe2AQAAgCHp5dSJWuuBtdaDaq3rknw7yZOSnFlKWdfOcliSs/toGwAAABieOQcNpZSb7WBbz03y8lLKl5KsSnLqDi4PAAAAWGC2e+pEKWWvJO9NsnMpZd80F3R8TK31m3NpoB3VMO2gG1IkAAAALBW/ev1/DLuEebnVs+Z3L4e5jGgYS/KIJBfVWn+e5Blp7iYBAAAAsJm5BA0711rPm56otX4yyer+SgIAAAAWq7ncdeLiUsofJZlKklLKE5Jc3GtVdG5sbCynn356NmzYkGOOOWbY5QAAAMzql6/9zrBLmJff+7/3HHYJC8pcRjQ8I8kbkuxZSrkkyXOSPL3XqujU+Ph4Nm7cmCTZtGlTxsfHh1wRAAAAS9V2RzTUWn+QZP9Syo2SrKi1XtZ/WXRpYmIiU1NTSZLJyclMTExk7dq1Q64KAACApWgud534TNrTJtrpqSTjSc5L8k+11t/0Vx4AAACwmMzlGg3nJrk6ydva6ccnuW2S/0lyUpJH9lMaAAAAsNjMJWi4X611nxnT3ymlfK3Wengp5Ul9FQYAAAAsPnO5GORoKWXP6YlSyt2TrCilrE2yqrfKAAAAgEVnLiMajklyZinll2mCiZslOTzJsUne2V9pAAAAwGIzl7tOnFVKuWOSvZMcluTQJJ+ote7Sd3HL1QUnvKzT5V1+1dWbTf/qba/M+OrRzpa/2zNf3tmyAAAAWNzmcteJOyQ5KsmRSW6a5LgkG3quCwAAAFiEZg0aSimPSPK0JPsk+VCa0yVOrLX+/YBqA2ABG1kxc2KLaQAAlq1tjWj4YJL3J9m31vr9JCmlTA6kKgAWvJ1WjWSXe+yUK747mV3uvlN2WjUy7JIAAFgAthU03DPJk5N8vpTy4yTv2c78ACwzNz9oRW5+kKEMAABcZ9bbW9Zaz6m1PjfJbZO8IskDkvxeKeUjpZSHDKpAAAAAYPGYy10nrkny4SQfLqXcMsmTkvxzko/2XBsAAACwyMzrVIha64VJXtP+YZEYXbFTRpJMJRlpp4H+7bTFxRJ3coYBAADLgF+cy8CalSty8B1vnSQ5+I63zpqVfu3AIKwcHcnt7tZcIPF2dx3JylEXSwQAYOlzccdl4oi998gRe+8x7DJg2dlzvxXZc79hVwEAAINjRAMAAADQGSMaAAAAFpkfv+6CYZcwb7d/zm7DLoEBMaIBAAAA6IygAQAAAOiMoKE1NjaW9evXZ2xsbNilAAAAwKIlaEgyPj6ejRs3Jkk2bdqU8fHxIVcEAAAAi5OgIcnExESmpqaSJJOTk5mYmBhyRQAAALA4CRoAAACAzizK21te+MZ3d7q8y6+6crPpi97+gUysXtNpG7d8xuGdLg8AAAAWot6ChlLKiiQnJilJrk3y5CQjSU5OMpXknCRH11on+6oBAAAAGKw+T5340ySptd4/yd8lOb7985Ja6wFpQocNPbYPAAAADFhvQUOt9cNJjmonb5fkl0n2SfLZ9rkzkxzcV/vzMbpiRUbaxyMZyeiKFUOtBwAAABarXi8GWWu9ppTyjiSvT3JqkpFa61T78uVJbtJn+3O1ZuVoDtnjLkmSQ/YoWbNydMgVAQAAwOLU+8Uga61/UUp5QZKvJFk746Vdk1zSd/tzdeTe++bIvfcddhkAAACwqPU2oqGU8sRSyovayf9NMpnk66WUde1zhyU5u6/2AQAAgMHrc0TDaUneXkr5XJLRJM9Jcl6SE0spq9rHp/bYPgAAADBgvQUNtdbfJnnMVl46qK82AQAAgOHq9WKQAAAAwPIiaACAZWpsbCzr16/P2NjYsEsBAJYQQQMALEPj4+PZuHFjkmTTpk0ZHx8fckUAwFIhaACAZWhiYiJTU1NJksnJyUxMTAy5IgBgqRA0AAAAAJ0RNAAAAACdETQAAAAAnRE0AAAAAJ0RNAAAAACdETQAAAAAnVk57AIAgG37k9NO6HyZU1detdn04z7ytoysWd3Z8j/yyGd2tiwAYHExogEAAADojKABAAAA6IygAQAAAOiMoAEAAADojKABAAAA6IygAQAAAOiM21uyJIyNjeX000/Phg0bcswxxwy7HGAIHvLhlw67hHn56MP/YbgF7LTiuscjW0wDv3PMh3467BLmZewRuw+7BAAjGlj8xsfHs3HjxiTJpk2bMj4+PuSKABa+kVUrs+Ied0ySrLj7HTOyyrEHAKAb9ipY9CYmJjI1NZUkmZyczMTERNauXTvkqgAWvtED987ogXsPuwwAYIkxogEAAADojKABAAAA6IygAQAAAOiMazQwcN9548M6Xd5vr5rabPrctx+eG60e6Wz593zGxs6WBQAAsNQZ0QAAAAB0RtAAAAAAdEbQAAAAAHRG0AAAAAB0RtAAAAAAdKaXu06UUkaTvC3J7ZOsTvKPSc5NcnKSqSTnJDm61jrZR/ssLytXJCNpPlgj7TQAAADD0deIhsOTXFRrPSDJYUn+LcnxSV7SPjeSZENPbbPMrF45kvvv0WRm999jZVav7O7WlgAAAMxPLyMaknwgyakzpq9Jsk+Sz7bTZyZ5UJIP9dQ+y8yj9l6dR+29ethlAAAALHu9BA211iuSpJSya5rA4SVJXl1rnWpnuTzJTfpoGwAAABie3i4GWUrZPclnkryr1vrvSWZej2HXJJf01TYAAAAwHL0EDaWU30vyiSQvqLW+rX36W6WUde3jw5Kc3UfbAAAAwPD0dY2GFye5WZKXllJe2j737CRjpZRVSc7L5tdwAAAAAJaAvq7R8Ow0wcKWDuqjPQAAAGBh6O0aDQAAAMDyI2gAAAAAOiNoAAAAADojaAAAAAA6I2gAAAAAOiNoAAAAADojaAAAAAA6I2gAAAAAOiNoAAAAADojaAAAAAA6I2gAAAAAOiNoAAAAADojaAAAAAA6I2gAAAAAOiNoAAAAADojaAAAAAA6I2gAAAAAOiNoAAAAADojaAAAAAA6I2gAAAAAOiNoAAAAADojaAAAAAA6I2gAAAAAOiNoAAAAADojaAAAAAA6I2gAAAAAOiNoAAAAADojaAAAAAA6I2gAAAAAOrOyz4WXUu6b5JW11nWllDslOTnJVJJzkhxda53ss30AAABgsHob0VBKeX6StyZZ0z51fJKX1FoPSDKSZENfbQMAAADD0eepEz9I8sgZ0/sk+Wz7+MwkB/fYNgAAADAEvQUNtdYPJrl6xlMjtdap9vHlSW7SV9sAAADAcAzyYpAzr8ewa5JLBtg2LGpjY2NZv359xsbGhl0KAADANg0yaPhWKWVd+/iwJGcPsG1YtMbHx7Nx48YkyaZNmzI+Pj7kigAAAGbX610ntvDcJCeWUlYlOS/JqQNsGxatiYmJTE01Zx1NTk5mYmIia9euHXJVAAAAW9dr0FBr/XGS+7WPv5fkoD7bAwAAAIZrkKdOAAAAAEucoAEAAADojKABAAAA6IygAQAAAOjMIO86AUvex096SOfLHL9qarPpz5zy2KxdPdLZ8g/9y492tiwAAAAjGgAAAIDOCBoAAACAzggaAAAAgM4IGgAAAIDOCBoAAACAzggaYIFbMeNbOjKy+TQAAMBC4ycLLHCrRkey952br+ped9opq0a7u7UlAABA11YOuwBg+w6592gOufewqwAAANg+IxoAAACAzggaAAAAgM4IGgAAAIDOCBoAAACAzggaAAAAgM4IGgAAAIDOCBoAAACAzggaAAAAgM4IGgAAAIDOCBoAAACAzggaAAAAgM4IGgAAAIDOCBoAAACAzggaAAAAgM4IGgAAAIDOrBxkY6WUnZKckOSPklyV5Cm11u8PsgYAAACgP4Me0fDwJGtqrfsmeWGS1wy4fQAAAKBHgw4a9k/ysSSptX45yR8PuH0AAACgRwM9dSLJjZNcOmP62lLKylrrNVuZd0WSXHDBBdd74eJLL+mnuh5d9bOfzXneCy+7osdKunfNPPqWJL+69OqeKunHz+bRv4sWWd+S+fXvkkuWdv8u/83S7t+VFy+u/s2nb0ly9cW/7amSfsynf1f/5rIeK+nH/Pq3+Lbr8+vfxT1W0o/59G/iNxf2WEn35rtuGb/4+vuiC9nPfjYy53l/syg/m1fNed5fX7K0+3fBZYvru5ckK3+2tZ99W/fry37VYyXdu3qe65aLLruop0r6MbGV/s34rb5iy9dGpqamei7pOqWU45N8udb6/nb6Z7XW284y7/5Jzh5YcQAAAMB8HVBr/fzMJwY9ouELSf40yftLKfdL8t1tzPu1JAck+UWSawdQGwAAADA3K5LcOs1v980MekTD9F0n7plkJMmTa63nD6wAAAAAoFcDDRoAAACApW3Qd50AAAAAljBBAwAAANAZQQMAAADQmUHfdWLBKKWsS/KZJI+ttb5vxvPfSfLNWusRQyrtBluKfUq2368kN661PnJI5c1Ll+9RKeWvaq3/1nmRPWj7/f4k56a5EOxoktdN3+p2sduif1NJbpzkh0meUGudGGJpvVgK7+dSXV9uy1Lp8w3pRynlwUn+oNb6lkHVuSPaPj691vrYef67C2qtu/VTVb9KKS9McnCSyTTr0RcneWKS42ut/z3M2m6IOey7PLjWulsp5aw07/WivTh6KWXPJP+SZOckuyT5aJJja63XuxBcKeXkJO+ttX5soEXOw3z6s41lHJXk7bXWqzuoZ02Sw2utby2lHJvkglrrm27Aco5Kcnia79hokr+ttZ61o/W1y75HkpvVWj/XxfK209ZZmcd3Zmvr0xu6jh2mxbjvtdxHNJyf5HHTE+2X5EbDK6cTS7FPyTb6tVhChhm6eo9e0llFg/HpWuu6WutBSR6U5AWllL2GXVSHpvv3gFrrPkmuTvKwYRfVo6Xwfi7V9eW2LJU+z6sftdaPLZaQYTkqpdwtzfrykFrrg5K8IMnbaq3PWYwhwwxL5fs2q1LKTZO8N8lzaq0PSHK/JPdI8rShFnYDddifF6e57V8XdkvylB1ZQCnlsUkOSbK+1rouTeDwrlLKLXa8vCTJo5LcraNlMbtFte+1bEc0tP4zyR+WUm5aa70kzZfulCR/UEr5SZoNxHm11ucMs8h5mlOfknwuzYb86iQ/TvKkWuvkcEqek23164L2qMAzk/xFmqT287XWvymlPDILr5/b6stfJXlkmpTy0vbx7ZOcnKYP1yR5UpIjkty8lHJCkmcneVOSO6cJD19Saz2rlHJOku8luarW+rsdnYWg1npFKeXNSR5dSnlikv3bl/691vqv7RGPq9L0/dZJjqi1fnMoxd4ApZRVaer+TSnln5McmOa9OT7NEa6zk9yt1jpVSnlDkv+otX5oaAXvoBnv57+VUkaTTCR5S631XUMubXu29V38fpIvpvlefTrJTZLcJ0mttT6xlLJ7krckWZPkyiRH1Vp/OoxOzNO2+nxykj3S9OnVM4/CLkDzXY8+Psldaq0vLKU8N8lj06xPP1drfUF7lHC/NEcv/7LWet7Ae7QdpZRHJzk6zZGsJHl0kt+k+RzumeQHSVa3txL/XpL71FovLqU8I8kutdZXDaHsufpVkj9IcmQp5WO11m+XUu4zfeQyzfu1oN+fWcz6Od1yxlLKTZKclOT/tE8dU2v97sAqveE2pPnx819JUmu9tpTypCQTpZTXZIvt+7CKnIfZ+nN1KeWtSXZP8x6dWWt9abveHGmf3yXNPtr904QD703y8A5q+tskdyul/N10jaWUP2vreGmtdVM7/ddJrk2zD/zCLZbxtCR/PT3Cotb6o1LKXrXWi0opt0/z2RtNM5romFrrf84cIVVKeW+afc3bJ3lImtEeeyR5ZZJPptkvnSilfDPJ2iTHtbX8oG37Drn+vuxEkvel2T8aTTPCYK6f+ZuWUs5IM4J0ZZp930+3I4Y+m+SebV82TP+DUsrOSU5L8q4kP09y51LKmUlulWRTrfXYNgwcS/OeXpTkyFrrpVvuy9VaP9Cuny5McrMkh9Zar51j7Z24ofvSW9tG9jX6drmPaEiaD9wjSikjaXYiv9g+v3uSxy+ykGHaXPr0uCSvrbXun+QTab6oC91s/Zr25CTPrrXum+SHpZSVWbj93Fpfdkqz0Ti41npAmhXAvdMk0N9IM6T0uDRD045LcnGt9ZlpUu5f11oPTLNCfUPbxi5J/mGhhQwz/DLNzuMd0hwx2D/J49uVfJL8pNZ6aJLXJzlqOCXOywNLKWeVUs5NMyz2Q0lWJblDrfX+SR6QZmfhmiTfSXJAKWV1knVJNg2n5E79MsktkqyptR6wCEKGabOtV26fZtTQgUmOSXJCkvsm2b894vXqJGPtEa9XJ3nFgOveEVvr865pPqOPTHJYujsS16f5rEeT/O6I8mPS/GjdL82O5kPbl8+rte63gH/E/mGSP2mPRtYkh6Z5r9bUWu+X5EVJdm7D9FPSrF+T5vSDdw6+3Lmrtf46zYiG+yf5Uinl/CQP3WK2hf7+zGZ7+y7TXpzkU+065agkbxxQfTvqNmlOFfydWusVaY62zrZ9X8hm68+tk3y53S/ZP8kzZszyg1rrA5Mcm+Rfaq0nJbkg130Hd9RxSc6ttf59O/3zWuv6JM9J8oxSys2TvDzNaIX9k/x+KeWQOfTrovbh9PbswDQHr07aTj03qbU+NM139oW11p+nCRGOT/K1JCem+fF6UJof9EdkK/uyab4Pl6ZZjx2T+e2jPy/JJ9ua/yzJSW3IeuMk75nR9mHt/Luk2dc6odZ6SvvcmjRB0AFJ/qp97sQkR7fr2Y8meX4p5bBssS/X7gckzY/6gwcdMswwr33p9v9o1m1k15b7iIYk+fc0K/MfpjnKOO3XM76Ai81c+vTXSV7UHuk4L8mHB1viDTJbv6Y9OcnzSimvTPKlNGnkQu3n1voymSbdfU8p5Yokt02zAjgpzaiMj6VZIb94i2XdI82P1vu20ytLKdNHRGpvPdhxt0vyjiRX1Oa8x6tLKV/OdUPvvtX+/dM0O58L3adrrY9t/+8/meRHad6bfdrUO2nez9ul2ZD9RZojHhtrrdcMod6u3S7Ju9McRVhMZluvXDQ9ZLuU8tta67nt40vT7JzcI8mLSykvSLOuWUzX4thany9Ps6P1ljQ7au8eTmnzMp/16LS7pPmxcHWSlFLOTjMaIFnY68ukOer/jrZfd0mzndszyVeTpNb636WU6VE1JyV5Xynlc2nO5/7lMAqeq1LKnZJcVms9sp3+4zQ7+RfMmG2hvz+z2d6+y7R7pAms/7ydvlnfhXXkJ0nuNfOJUsodkuyT5OxZtu8L2Wz92T3JvUspD0hyWZLVM2b5dPv3F5O8dgA1fqP9+4I0IwvulOSWST5aSkma4PiOW/ybn6Tpw6XTT5RSHpTmwMdd04x0TjuaaPettDky4/G3279/mmZ7ONMt04Qy729rWZvmQN9xuf6+7JlpRg2enmakwz/O1uFSyi5pRuhOX/PiRjNq/nkp5bK27WTz/cfp+g5K8t1s/r6dU2u9ql3+9H7YXZOc0NY+mmZ02Gz7csnw10vz2peutU6WUra1jezUsh/RUGv9YZoP6zHZfMdq2MPrb7A59umoNBe2OSjNyuMRg6vwhtlGv6Y9Nc2wq4OS7J3maNWC7OcsfblxkofXWv88ybPSfD9H0oxSOLtNrz+QZkWdXLfSPz9NersuTXL7gTTDaZMF+jkupeya5v26LO1Qr3bI/X5J/qudbc4XXVpI2jDv8CRvTZM0f6Z9bx6Y5iI+P0zyqTSf0SOz/SMHC96M9/PCLNDP3Gy2sV7Z3ufv/CQvaN/bpyU5tZcCezBLn2+dZJ9a6yOS/EmSf2lHhS1Y81yPTjs/yX1LKSvbI8wHptmRTBbwZ7cdVv/yNEeunpJkPE2/zk+ybzvPbZIcuq/hAAAE1ElEQVT8ftKEDkkuSTOKajGsY+6Z5I2luehd0rwnl6YZej1twb4/2zKHfZdp56cZgbkuzaibU7Yx70JyRpIHl1L2SH63LT8+zX7IbNv3hWy2/uyV5JJa6xOSvCbJzu06JGlClaQ5KPL/2seT6e531pbL2nL79KM0PyQPaT8/r0/ylS3meVuSl06v10spf5hm3TCZ5kDcAe3ze+W6gG+0lLJLaU4H3XPGsra2fZyu8ddJfpZkQ1vLcWlOGd3avuy6JL+ozXVZ/jHJP23j/+AdaUYU7pTmVIdfzKj599MEc9MHU7dW30fS/AY4rl1XzjZfTXOa9bokz2//3fnZ+r7cdL+H4obsS5dS7pltbyM7teyDhtb7kuxea/3edudcPLbXp68m+WQp5dNpjqqeMbDKdsy2+vXdJF9r+/SrNCvZhdzPLftyTZLfllK+nuaI+C/SDHX7epoV49lpzlV9fTv/uaWUdyd5c5K7lFI+myZN/0kd/nUotmb61IJPpXkfXtaer/mjUsqXknw5yal1EV2LYTbt0e+xNEN/r2jfu28kmaq1Xt6mzqcmWVVr/f4QS90R13s/M/xk/4a6IduA5yV5Wfu9e2eao0KLyZZ9viDJbqWUb6VZ/7x6kYy0met6NEnSnv/7/iRfSLN9+HEWzki3LT2olPL1ti+fTrNN+2aao+LjSW5Taz09yU9LKV9J8ro0O/nTTkyzI75gr+4/rdZ6WpKzknyllPKFJB9P8jeZcfR1kZvLOua4JI9pj5p+LMk5gyhsR9VaL0szQu/EtvYvp7k2xeuzCLfv2+jPp5I8pJTyxTQjVP4r161bDmv3M5+f5Lntc2enGWHQxY+4XyVZ1Y7Y3VrNF6YJQz7brgsOy3UB6vQ872378vl2pNPb09zJ4ldptmfPap9/Y5K/bP/Z69p/c2qaERHb8o00o+IOSnP6xUfa/6tnpvksb21f9j+TPLX9jLwqyT9vY/mvaef5XJrTNJ6XZj/kc2nW4Udtb5vVjux6Wdv32d6XZyR5Z1vnK9Js2zdlK/ty22qrRzu6L/39bGMb2bWRqalFedAQAGBOSilPTfND7++2O/MSUUp5TJK7L6c+w6CVRXC7ThiWBT0sEgBgR5RSHpLmCNvTh13LoJRS/inNaIYN25sXAPpgRAMAAADQGddoAAAAADojaAAAAAA6I2gAAAAAOuNikADADimlvCHNPeRXJblTknPbl96c5lZgbyqlvD3JsbXWn5RSfpxkXa31x0MoFwDomaABANghtdajk6SUcvskZ9Va99rKbA9I8vJB1gUADIegAQDoRSnl2PbhlUluk+SjpZQDZry+IsmrkqxLsiLJybXW1w64TACgY67RAAD0qtb6iiT/k+QhtdaLZrz01Pb1eyW5T5INM4MIAGBxMqIBABiWg5PsVUp5YDu9S5J7JDl7eCUBADtK0AAADMuKJM+vtZ6WJKWUWyS5YrglAQA7yqkTAMAgXJPrH+D4dJKnllJGSym7JPl8kvsNvDIAoFNGNAAAg3BGmotBHjrjuTcluXOSb6XZJ3l7rfWsIdQGAHRoZGpqatg1AAAAAEuEUycAAACAzggaAAAAgM4IGgAAAIDOCBoAAACAzggaAAAAgM4IGgAAAIDOCBoAAACAzggaAAAAgM78fyaLZLmFqwzoAAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot bar plot (titles and Age)\n", "plt.figure(figsize=(18,5))\n", "sns.barplot(x=dataset['Title'], y = dataset['Age'])" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Title\n", "Capt 70.000000\n", "Col 54.000000\n", "Don 40.000000\n", "Dona 39.000000\n", "Dr 42.750000\n", "Jonkheer 38.000000\n", "Lady 48.000000\n", "Major 48.500000\n", "Master 7.643000\n", "Miss 22.261137\n", "Mlle 24.000000\n", "Mme 24.000000\n", "Mr 30.926295\n", "Mrs 35.898477\n", "Ms 26.000000\n", "Rev 41.250000\n", "Sir 49.000000\n", "the Countess 33.000000\n", "Name: Age, dtype: float64\n" ] } ], "source": [ "# Means per title\n", "print(dataset.groupby('Title')['Age'].mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is 18 titles in the dataset and most of them are very uncommon so we like to group them in 4 categories." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "# Convert to categorical values Title \n", "dataset[\"Title\"] = dataset[\"Title\"].replace(['Lady', 'the Countess',\n", " 'Capt', 'Col','Don', 'Dr', \n", " 'Major', 'Rev', 'Sir', 'Jonkheer',\n", " 'Dona'], 'Rare')\n", "\n", "dataset[\"Title\"] = dataset[\"Title\"].map({\"Master\":0, \"Miss\":1, \"Ms\" : 1 ,\n", " \"Mme\":1, \"Mlle\":1, \"Mrs\":1, \"Mr\":2, \n", " \"Rare\":3})\n", "\n", "dataset[\"Title\"] = dataset[\"Title\"].astype(int)\n", "\n", "# Drop Name variable\n", "dataset.drop(labels = [\"Name\"], axis = 1, inplace = True)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEFCAYAAADuT+DpAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAE0NJREFUeJzt3X2QXXV9x/H3bkKgSoKK2OjwpFi/zlgHDGgCJCRgMA1UgzitCBaJgw4aLbROsWgooZVOWxUHNBQHMNEarRIfWtFgtDx0Ex61WGHALwIC4wMtAUMClITNbv84Z83NspvckD13d/N7v2Yyc87vnHvz3XMfPvd3Hn6nq7+/H0lSubpHuwBJ0ugyCCSpcAaBJBXOIJCkwk0c7QJ2VkTsCbwR+A2wZZTLkaTxYgLwcuD2zNzUumDcBQFVCPSMdhGSNE7NAta0NozHIPgNwIoVK5g6depo1yJJ48IjjzzCaaedBvV3aKvxGARbAKZOncr+++8/2rVI0njznF3qHiyWpMIZBJJUOINAkgpnEEhS4QwCSSqcQSBJhTMIJKlwBoEkFW48XlAmjWvfO33haJcwZpzwpWWjXYKwRyBJxTMIJKlwBoEkFc4gkKTCGQSSVDiDQJIKZxBIUuEMAkkqnEEgSYUzCCSpcAaBJBWusbGGIuIM4Ix6di/gMGAOcAnQC6zOzAsjohu4DDgU2AScmZn3NVWXJGlbjQVBZi4HlgNExFLgC8DlwDuAB4DvRsQ04GBgr8w8MiJmAJ8GFjRVlyRpW43vGoqII4DXAf8K7JmZ92dmP/B94M3ATOBagMy8BTii6ZokSVt14hjBx4ALgSnAhpb2jcA+dfsTLe1bIsLhsSWpQxoNgoh4EfDazLyeKgQmtyyeDKwfor07M3ubrEuStFXTPYJjgB8CZOYGYHNEHBIRXcA8oAdYC5wAUB8juLPhmiRJLZreBRNUB4YHnAWsACZQnTV0a0TcDhwfETcBXYC3b5KkDmo0CDLzk4PmbwFmDGrrowoISdIo8IIySSqcQSBJhTMIJKlwBoEkFc4gkKTCGQSSVDiDQJIKZxBIUuEMAkkqnEEgSYUzCCSpcAaBJBXOIJCkwhkEklQ4g0CSCmcQSFLhDAJJKpxBIEmFMwgkqXCN3rM4Is4D3gZMAi4DbgSWA/3AXcCizOyLiAuAE4Fe4JzMvK3JuiRJWzXWI4iIOcBRwNHAbOAA4GJgcWbOArqABRExrV4+HTgFWNpUTZKk52py19A84E7gW8B3gGuAw6l6BQCrgLnATGB1ZvZn5sPAxIjYr8G6JEktmtw19FLgIOCPgVcC/w50Z2Z/vXwjsA8wBXis5XED7Y82WJskqdZkEDwG/CwzNwMZEc9Q7R4aMBlYD2yopwe3S5I6oMldQ2uAP4qIroh4BfBC4D/qYwcA84EeYC0wLyK6I+JAql7DugbrkiS1aKxHkJnXRMQxwG1UgbMI+AVwRURMAu4BVmbmlojoAW5uWU+S1CGNnj6amecO0Tx7iPWWAEuarEWSNDQvKJOkwhkEklQ4g0CSCmcQSFLhDAJJKpxBIEmFMwgkqXAGgSQVziCQpMIZBJJUOINAkgpnEEhS4QwCSSqcQSBJhTMIJKlwBoEkFc4gkKTCGQSSVDiDQJIKZxBIUuEavXl9RNwBPFHP/gL4PHAJ0AuszswLI6IbuAw4FNgEnJmZ9zVZlyRpq8aCICL2AsjMOS1tPwHeATwAfDcipgEHA3tl5pERMQP4NLCgqbokSdtqskdwKPCCiFhd/z9LgD0z836AiPg+8Gbg5cC1AJl5S0Qc0WBNkqRBmjxG8DTwKWAecBawrG4bsBHYB5jC1t1HAFsiotFdVpKkrZr8wr0XuC8z+4F7I+IJ4CUtyycD64EX1NMDujOzt8G6JEktmuwRvJdqfz8R8QqqL/ynIuKQiOii6in0AGuBE+r1ZgB3NliTJGmQJnsEVwHLI2IN0E8VDH3ACmAC1VlDt0bE7cDxEXET0AUsbLAmSdIgjQVBZm4GTh1i0YxB6/VRHUOQJI0CLyiTpMIZBJJUOINAkgpnEEhS4QwCSSqcQSBJhXMoB+3QGcvOHu0SxozlCy8Z7RKkEWePQJIKZxBIUuEMAkkqnEEgSYUzCCSpcG0FQUR8doi2L458OZKkTtvu6aMRcSXwKuCIiHhdy6I9qO4uJkka53Z0HcEnqG4ufwlwYUt7L3BPQzVJkjpou0GQmQ8CDwKHRsQUql5AV714b+DxJouTJDWvrSuLI+I84DzgsZbmfqrdRpKkcazdISbOBA7JzEebLEaS1Hntnj76MO4GkqTdUrs9gp8DayLieuCZgcbM/NvtPSgiXgb8GDie6gDzcqpdSncBizKzLyIuAE6sl5+Tmbft7B8hSXr+2u0R/Aq4FthEdbB44N+wImIP4PPA/9VNFwOLM3NW/dgFETENmA1MB04Blu7sHyBJ2jVt9Qgy88Idr/UcnwIupzrIDHA4cGM9vQp4C5DA6szsBx6OiIkRsZ/HIiSpc9o9a6iPapdOq19n5gHDrH8G8Ghmfr8+4wigq/7CB9hIdSrqFLY9E2mg3SCQpA5pt0fwu11I9S6fk4Ajt/OQ9wL9ETEXOAz4EvCyluWTgfXAhnp6cLskqUN2etC5zHw2M68GjtvOOsdk5uzMnAP8BDgdWBURc+pV5gM9wFpgXkR0R8SBQHdmrtvZmiRJz1+7u4ZOb5ntAl4HPLuT/9dHgCsiYhLV8BQrM3NLRPQAN1OF0qKdfE5J0i5q9/TRY1um+4F1wDvbeWDdKxgwe4jlS4AlbdYhSRph7R4jWFgfG4j6MXdlZm+jlUmSOqLd+xEcTnVR2ReBZVSnek5vsjBJUme0u2voUuCdmXkrQETMAD4LvKmpwiRJndHuWUN7D4QAQGbeAuzVTEmSpE5qNwgej4gFAzMRcRLbXggmSRqn2t019H7gmoi4iur00X7gqMaqkiR1TLs9gvnA08BBVKeSPgrMaagmSVIHtRsE7weOzsynMvOnVAPIfbi5siRJndJuEOwBbG6Z38xzB6GTJI1D7R4j+DZwXUR8nSoA3gH8W2NVSZI6pq0eQWZ+lOpaggAOAS7NzPObLEyS1Bnt9gjIzJXAygZrkSSNgp0ehlqStHsxCCSpcAaBJBXOIJCkwhkEklQ4g0CSCmcQSFLhDAJJKlzbF5TtrIiYAFxBdTXyFmAh1RDWy6mGqbgLWJSZfRFxAXAi0Auck5m3NVWXJGlbTfYI3gqQmUcDfwNcXP9bnJmzqEJhQURMA2YD04FTgKUN1iRJGqSxIMjMb1MNXw3VfQz+h2r46hvrtlXAXGAmsDoz+zPzYWBiROzXVF2SpG01eowgM3sj4otUN7pfCXRl5sDw1RuBfYApwBMtDxtolyR1QOMHizPzPcBrqI4X/F7LosnAemBDPT24XZLUAY0FQUT8WUScV88+DfQBP4qIOXXbfKAHWAvMi4juiDgQ6M7MdU3VJUnaVmNnDQHfBJZFxH9S3eHsHOAe4IqImFRPr8zMLRHRA9xMFUyLGqxJkjRIY0GQmU8BfzrEotlDrLsEWNJULZKk4XlBmSQVziCQpMIZBJJUOINAkgpnEEhS4QwCSSqcQSBJhTMIJKlwBoEkFc4gkKTCGQSSVDiDQJIKZxBIUuEMAkkqnEEgSYUzCCSpcAaBJBXOIJCkwhkEklS4Ru5ZHBF7AF8ADgb2BD4B3A0sB/qBu4BFmdkXERcAJwK9wDmZeVsTNUmShtZUj+DdwGOZOQuYD3wOuBhYXLd1AQsiYhrVzeynA6cASxuqR5I0jKaC4Grg/Jb5XuBw4MZ6fhUwF5gJrM7M/sx8GJgYEfs1VJMkaQiNBEFmPpmZGyNiMrASWAx0ZWZ/vcpGYB9gCvBEy0MH2iVJHdLYweKIOAC4HviXzPwK0NeyeDKwHthQTw9ulyR1SCNBEBG/D6wGPpqZX6ib74iIOfX0fKAHWAvMi4juiDgQ6M7MdU3UJEkaWiNnDQEfA14MnB8RA8cKzgYujYhJwD3AyszcEhE9wM1UobSooXokScNoJAgy82yqL/7BZg+x7hJgSRN1SJJ2zAvKJKlwBoEkFc4gkKTCGQSSVDiDQJIKZxBIUuEMAkkqnEEgSYUzCCSpcAaBJBXOIJCkwhkEklQ4g0CSCmcQSFLhDAJJKpxBIEmFMwgkqXAGgSQVziCQpMIZBJJUuEZuXj8gIqYD/5iZcyLi1cByoB+4C1iUmX0RcQFwItALnJOZtzVZkyRpW431CCLiXOBKYK+66WJgcWbOArqABRExDZgNTAdOAZY2VY8kaWhN7hq6Hzi5Zf5w4MZ6ehUwF5gJrM7M/sx8GJgYEfs1WJMkaZDGgiAzvwE829LUlZn99fRGYB9gCvBEyzoD7ZKkDunkweK+lunJwHpgQz09uF2S1CGdDII7ImJOPT0f6AHWAvMiojsiDgS6M3NdB2uSpOI1etbQIB8BroiIScA9wMrM3BIRPcDNVKG0qIP1SJJoOAgy80FgRj19L9UZQoPXWQIsabIOSdLwvKBMkgpnEEhS4QwCSSqcQSBJhTMIJKlwnTx9tGNOPXfFaJcwZnzln04b7RIkjXH2CCSpcAaBJBXOIJCkwhkEklQ4g0CSCmcQSFLhDAJJKpxBIEmF2y0vKJNUhr//+NWjXcKY8bGL/uR5P9YegSQVziCQpMIZBJJUOINAkgo3Jg4WR0Q3cBlwKLAJODMz7xvdqiSpDGOlR3ASsFdmHgn8NfDpUa5HkooxJnoEwEzgWoDMvCUijtjOuhMAHnnkkWFX2PT0+hEtbjz75S9/ucvP8cz6p0egkt3DSGzPxzc9MwKV7B52dXs++dRvR6iS8W9H27LlO3PC4GVd/f39DZS0cyLiSuAbmbmqnn8YeFVm9g6x7kygp8MlStLuYlZmrmltGCs9gg3A5Jb57qFCoHY7MAv4DbCl6cIkaTcxAXg51XfoNsZKEKwF3gp8PSJmAHcOt2JmbgLWDLdckjSs+4dqHCtB8C3g+Ii4CegCFo5yPZJUjDFxjECSNHrGyumjkqRRYhBIUuEMAkkq3Fg5WDxmRcQc4HrglMz8Wkv7T4H/yswzduK5PpSZnxvxIseoHW07YEpmnrwLz98PXJ6ZH2hpuxR4W2Ye/Hyfd3czku/h0tTb7uvA3UA/MAV4ADgtMzePYmkjyh5Be34GvGtgJiJeD7zweTzP4hGraPwYdtvtSgjUHgNmR8TE+rknANu7Kr1kI/UeLtF1mTknM4/NzMOBZ4G3jXZRI8keQXv+G3hNRLwoM9cD7wZWAAdGxIeAk4E9gCfq6YOB5VRvmF7gdOAM4CURcRlwNnA58AdUYbw4M2+IiLuAe4FNmfm7D+04t71t90hmTo2IDwLvAfqANZn5VxFxMvBRqm34IHB6ZvYNeu5e4AbgeGAV8Bbgh1Tbm4i4AXgUeDGwCFhGy2uSmb9q6o8eg7b3OjxEFRT3ZOY5o1nkWBcRk6guyvptPSLCAcC+wKrMPD8iltfz+wInAucCx1B9zi/OzDF5SzV7BO37JvD2iOgC3gTcRLX99gXmZuYsqjB4I9UX04+BucBFwIsz8yLg8cz8IHAmsC4zjwEWAEvr/2Nv4O92oxAYMNS2a7UQOLsedPCB+hf+u4DPZOZMYDVVl3woXwFOqadPpfpy22Z5Zs6lei22eU127U8al4Z7HQ4ATjUEhnVcRNwQEXdT7dL8FtWFWbdk5jyqsdI+0LL+dZl5FDADeGVmHg0cC3w8Il7U4drbYhC0b+AL5xi2jnXUB2wGvhoRVwH7U4XBVcA6qoH0PkT1C7TV64ET6l+s3wAmRsS+9bJs8G8YLUNtu1YLgbMi4kbgIKqLCv8SOKZuOwroi4gr6w9k66+qtcAb6u23L/DQoOce2J47ek1KMNzrsC4zHxudksaF6zJzDtXQNpuBXwCPA2+MiBXAZ4A9W9YfeM+9Hji8/pxfS/XdcFCHat4pBkGbMvMBqn2qfw58uW6eApyUme8EPky1PbuofuX3ZOabgaupdnFQL4OqG/7V+s01v15nYBjFwbs/xr1htl2r9wFnZeZs4A1UX/zvB5bUbV3A2zPzzHpf7e/u0p2Z/cD3gH8Gvj3Ecw9sz+Fek2Js53XY7d5zTajD8t3AlcBfAOsz8zSqYfNfUPe0YOv2/Blwff05P47qoPMDHS26TQbBzvkacEBm3lvP9wJPRcSPgB9QDYT3CuBHwEUR0QOcBXy2Xv/uiPgy8HngtfWv3ZuAh4bY/727GbztWt0J3B4R1wH/C9wK3Ab8oG6bClyznedeQfVFv739r8O9JqXZ3uugHcjMu4FLgT+k6tXfRPUj5OdUn/1W3wGerN9zPwb6M3NjJ+ttl0NMSFLh7BFIUuEMAkkqnEEgSYUzCCSpcAaBJBXOISakNkTEUuBoYBLwaqpByKA6Fbg/My+PiGVU1z48FBEPAnMy88FRKFfaKQaB1IbMXAQQEQcDN2TmYUOsdixwYSfrkkaCQSDtgohYUk8+Q3VB0fciYlbL8gnAJ4E5wARgeWZ+psNlStvlMQJpBGTmPwC/Bk4YNG7P++rl06gGelvQGhTSWGCPQGrWXOCwiDiunt+bajCyoQbfk0aFQSA1awJwbmZ+EyAiXgo8ObolSdty15A0cnp57o+r64D3RcQeEbE3sIZqnHppzLBHII2ca6gOFs9raRu4E90dVJ+3ZZl5wyjUJg3L0UclqXDuGpKkwhkEklQ4g0CSCmcQSFLhDAJJKpxBIEmFMwgkqXD/Dx+oLs3K2cYhAAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# viz counts the title coloumn\n", "sns.countplot(dataset[\"Title\"]).set_xticklabels([\"Master\",\"Miss-Mrs\",\"Mr\",\"Rare\"]);" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEFCAYAAADqujDUAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAExZJREFUeJzt3XuQnXV9x/H37hJIComIOKBFLo7yrVUJOilNIFgxIJKRklrtxag1bVAGlGrqINgqYBUtTKquluJlxBvqIBrdWIhKvQCJsYy3Rmm+7TqD1YFoQUnAbLLspX+cs3Jystk9m+xvT84+79cMw3nO79l9PrMD53Oe2+/pGh0dRZJUPd3tDiBJag8LQJIqygKQpIqyACSpog5pd4BWRMRhwB8A9wPDbY4jSZ2iB3gScHdm7m4e7IgCoPbhf2e7Q0hShzoTuKv5zU4pgPsBbrrpJo499th2Z5GkjrBt2zZWrlwJ9c/QZp1SAMMAxx57LMcdd1y7s0hSpxn30LkngSWpoiwASaooC0CSKqrYOYCI6AauBxYCu4HVmdnfMP4m4C+BEeCazFxXKoskaW8l9wBWAHMzcwlwObB2bCAijgQuBZYALwTeWzCHJGkcJQtgKbABIDM3A4saxn4D/BQ4vP7PSMEckqRxlCyABcD2huXhiGg85PQz4B7ge0BvwRySpHGULIAdwPzGbWXmUP31edRuTz4JOB5YERGnFcyiDtTb28uyZcvo7fX7gVRCyQLYCCwHiIjFwJaGsV8DA8DuzNwFPAQcWTCLOszAwAB9fX0ArF+/noGBgTYnkmafkncCrwPOiYhNQBewKiLWAP2Z2RcRZwObI2KE2hwVXyuYRR1mcHCQsafVjYyMMDg4yLx589qcSppdihVAZo4AFzW9vbVh/ErgylLblyRNzBvBJKmiLABJqigLQJIqygKQpIqyACSpoiwASaooC0CSKsoCkKSKsgCmkXPXSOoknfJQ+INe89w1F154YUdPXfDqG/+2rdsf3j20x/LrPvMWeg5rz3+uH1v1vrZsVyrNPYBpMt7cNZJ0MLMAJKmiLABJqigLQJIqygKQpIqyACSpoiwASaooC0CSKqrYnTUR0Q1cDywEdgOrM7O/PnYq8N6G1RcDKzJzQ6k8kqQ9lby1cgUwNzOXRMRiYC1wAUBm/gB4PkBEvAy4zw9/SZpZJQ8BLQU2AGTmZmBR8woRcThwNXBpwRySpHGULIAFwPaG5eGIaN7j+Bvgc5n5QMEckqRxlDwEtAOY37DcnZlDTeusBF5aMIM6VFd3V8NC07KkaVFyD2AjsBygfg5gS+NgRDwOOCwzf1YwgzpU95wejjj5KACOePpRdM/paXMiafYpuQewDjgnIjYBXcCqiFgD9GdmH3AycG/B7avDPf60J/P4057c7hjSrFWsADJzBLio6e2tDeN3U7tSSJLUBt4IJkkVZQFIUkVZAJJUURaAJFWUBSBJFWUBSFJFWQCSVFEWgCRVlAUgSRVlAUhSRVkAklRRFoAkVZQFIEkVZQFIUkVZAJJUURaAJFWUBSBJFWUBSFJFFXskZER0A9cDC4HdwOrM7G8YPw+4sr74PeCSzBwtlUeStKeSewArgLmZuQS4HFg7NhAR84HrgBdn5mJqD4c/umAWSVKTkgWwFNgAkJmbgUUNY6cDW4C1EXEn8IvM/L+CWSRJTUoWwAJge8PycESMHXI6GjgLeDNwHvCGiDi5YBZJUpOSBbADmN+4rcwcqr9+ELg7M7dl5iPAHcCpBbNIkpoUOwkMbATOB26OiMXUDvmM+S7wrIg4GngIWAx8+EA29vLLbjqQHz9gI0O79lh+7dW30H3I3LZk+fS1K9uyXUmdpWQBrAPOiYhNQBewKiLWAP2Z2RcRVwBfqa97c2b+qGAWSVKTYgWQmSPARU1vb20Y/yzw2VLblyRNzBvBJKmiLABJqigLQJIqygKQpCnq7e1l2bJl9Pb2tjvKAbEAJGkKBgYG6OvrA2D9+vUMDAy0OdH+swAkaQoGBwcZHa3NWzkyMsLg4GCbE+0/C0CSKsoCkKSKsgAkqaIsAEmqKAtAkirKApCkirIAJKmiLABJqigLQJIqygKQpIqyACSpoiwASaqoYo+EjIhu4HpgIbAbWJ2Z/Q3jvcAZwMP1ty7IzO2l8kiS9lTyofArgLmZuSQiFgNrgQsaxp8LnJuZDxTMIEnah5KHgJYCGwAyczOwaGygvnfwdOBDEbExIv66YA5J0jhKFsACoPGQznBEjO1xHA68H3gF8CLg4og4pWAWSVKTkgWwA5jfuK3MHKq/3gm8LzN3ZubDwNepnSuQJM2QkgWwEVgOUD8HsKVh7GTgrojoiYg51A4Xfa9gFklSkwlPAkfE8yYaz8w7JhheB5wTEZuALmBVRKwB+jOzLyJuAjYDjwKfyMwfTy26JOlATHYV0NX1fz8BeBq1b/XDwOnUvtGfsa8fzMwR4KKmt7c2jF8LXDvFvJKkaTJhAWTmWQARcSvwkrHr+CPiBOCD5eNJkkpp9RzACY03cQH/C5xQII8kaYa0eiPYdyPi48DN1I7nrwTuLJZKklRcqwWwGng9tWP6o8Dt1KZ5kCR1qJYKIDMHI+Lz1E7ifgV4SsM1/ZKkDtTSOYCI+HNgPfA+4Cjg2xHxipLBJElltXoS+M3ULv18ODN/CTwHuKJYKklSca0WwHB9ygYAMvN+YKRMJEnSTGj1JPCPI+J1wJyIOBW4GPhBuViSpNJa3QO4BPhdYAD4KLWJ3i4uFUqSVN5ULgN9T2Z63F+SZolWC+ApwHciYivwKWBdZu4sF0uSVFpLh4Ay802ZeRJwDbAE+H5EfKJosk7T1dO40LQsSQeflp8HEBFdwBzgUGp3Aw+WCtWJunvmMO+JzwBg3hN/j+6eOW1OJEkTa+kQUET0An9C7cqfTwGXZuauksE60YLjl7Dg+CXtjiFJLWn1HMD/AM/JzAdKhpEkzZzJngj2msz8ELXpHy6OiD3GM/PtBbNJkgqabA+gax+vJUkdbrIngo099esh4DP1eYBaEhHd1KaMXgjsBlY3PVRmbJ1/A76UmTdMJbgk6cC0ehXQ2H0At0XEyoj4nRZ+ZgUwNzOXAJcDa8dZ5x3UDi9JkmZYyfsAlgIb6j+/GVjUOBgRL6U2odxtUw0tSTpwJe8DWABsb1gejohD6r/rWcDLgbdNKa0kadrsz30An6S1+wB2APMblrsbniL2KmqTy30dOBEYjIh7M3PDFLJLkg5Aq/cB/JKp3wewETgfuDkiFgNbxgYy87Kx1xFxFbDND39JmlmtHgJauR83ga0DdkXEJuA9wBsjYk1E/PEUf48kqYBW9wDuiYi3Ad+h9kwAADLzjn39QGaOABc1vb11nPWuajGDJGkatVoARwFn1f8ZMwq8YNoTSZJmREsFkJlnTb6WJKmTtHoV0DeofePfQ2a6ByBJHarVQ0BXNbyeA1wA/Hra00iSZkyrh4C+1fTW7RHxHbyRS5I6VquHgI5vWOwCngk8oUgiSdKMaPUQ0Ld47BzAKPAA8PoiiSRJM2LSG8Ei4sXA2Zn5VODvgP8CvgLcXjibJKmgCQsgIt4EXAkcFhGnUHse8Bep3RdwXfl4kqRSJjsE9EpgSWbujIh3A32Z+ZH6zKD3lI8nSXu75u8/17ZtP/ronvNgvvedfcyZM7dNaeAt73zZfv/sZIeARjNzZ/31WTw2v/9e9wRIkjrLZHsAQxFxJHAE8BzgqwARcQIwNNEPSpIObpPtAbyb2jMANgMfycz7I+LPgH8Hri0dTpJUzmQPhb+lPp3z0Zn5n/W3H6H2gPdvlg4nSSpn0vsAMvM+4L6G5VuLJpIkzYiWnwksSZpdLABJqigLQJIqygKQpIpqdTK4KYuIbuB6YCGwm9qVQ/0N45cAr6Y2udzbM/PLpbJIkvZWcg9gBTA3M5cAlwNrxwYi4mjgYuB0YBnwr/XpJSRJM6RkASzlsakjNgOLxgYy8wFgYWY+ChwLPOT0EpI0s0oWwAJge8PycET89pBTZg5FxOuo3WV8S8EckqRxlCyAHcD8xm1l5h7zB2XmB4AnAc+LiLMKZpEkNSl2EhjYCJwP3BwRi4EtYwMREcC7gD8FHqV2knikYBZJUpOSBbAOOKc+l1AXsCoi1gD9mdkXET8Evk3tKqDbxnnwvCSpoGIFkJkjwEVNb29tGL8auLrU9iVJE/NGMEmqKAtAkirKApCkirIAJKmiLABJqigLQJIqygKQpIqyACSpoiwASaooC0CSKsoCkKSKsgAkqaIsAEmqKAtAkirKApCkirIAJKmiLABJqigLQJIqqtgjISOiG7geWEjtoe+rM7O/YfyNwF/UF2+tPyJSkjRDSu4BrADmZuYS4HJg7dhARDwVWAmcDiwBXhgRpxTMIklqUrIAlgIbADJzM7CoYexnwIsyc7j+8Pg5wK6CWSRJTYodAgIWANsblocj4pDMHMrMR4EHIqILuA74fmb+d8EskqQmJfcAdgDzG7eVmUNjCxExF7ipvs7FBXNIldfb28uyZcvo7e1tdxQdREoWwEZgOUBELAa2jA3Uv/l/CfhhZr42M4cL5pAqbWBggL6+PgDWr1/PwMBAmxPpYFHyENA64JyI2AR0AasiYg3QD/QAfwQcFhHn1de/IjO/XTCPVEmDg4OMjo4CMDIywuDgIPPmzWtzKh0MihVA/eTuRU1vb214PbfUtiVJk/NGMEmqKAtAkqagq7uncalpubNYAJI0BYf0zOG4Y54JwHHH/D6H9Mxpc6L9V/IksCTNSnHSmcRJZ7Y7xgFzD0CSKso9AGkG3PqqVW3b9sDwnrfZ3H7x65nX077j1ss/cWPbtq09uQcgSRVlAUhSRVkAklRRFoAkVZQFIEkVZQFIUkVZAJJUURaAJFWUBSDNcj1dXb993dW0rGqzAKRZ7tDubk49/AgAFh5+BId2+7+9apwKQqqAZUcexbIjj2p3DB1k/CogSRVVbA8gIrqB64GFwG5gdWb2N63zRGAT8OzM3FUqiyRpbyX3AFYAczNzCXA5sLZxMCLOBb4KHFMwgyRpH0oWwFJgA0BmbgYWNY2PAGcDvyqYQZK0DyULYAGwvWF5OCJ+e8gpM7+WmQ8W3L4kaQIlC2AHML9xW5k5VHB7kqQpKFkAG4HlABGxGNhScFuSpCkqeR/AOuCciNhE7QbEVRGxBujPzL6C25UktaBYAWTmCHBR09tbx1nvxFIZJEn75o1gklRRFoAkVZQFIEkVZQFIUkVZAJJUURaAJFWUBSBJFWUBSFJFWQCSVFEWgCRVlAUgSRVlAUhSRVkAklRRFoAkVZQFIEkVZQFIUkVZAJJUURaAJFVUsUdCRkQ3cD2wENgNrM7M/obxC4HXAkPAOzLzy6WySJL2VnIPYAUwNzOXAJcDa8cGIuJY4FLgDOBc4F0RcVjBLJKkJsX2AIClwAaAzNwcEYsaxk4DNmbmbmB3RPQDpwB37+N39QBs27ZtnxvbvfOh6cg8K/z85z8/4N+x66Gd05BkdpiOv+evdu+ahiSzw3T8PR/5za+nIcnsMNHfs+Ezs2e88ZIFsADY3rA8HBGHZObQOGMPA4+b4Hc9CWDlypXTHnI2Wva13nZHmFWW3bCs3RFmlX9c5t9zOn3+trWTr1T7DP1J85slC2AHML9hubv+4T/e2Hxgoq/wdwNnAvcDw9MZUpJmsR5qH/7jHl0pWQAbgfOBmyNiMbClYew/gHdGxFzgMOAZwI/29Yvqh4ruKphVkmarvb75j+kaHR0tssWGq4BOAbqAVcByoD8z++pXAb2G2onoazLz80WCSJLGVawAJEkHN28Ek6SKsgAkqaIsAEmqqJJXAVXKZFNfaOoi4g+Bf8rM57c7SyeLiDnAR4ETqV11947M7GtrqA4WET3Ah4Ggdln6qszc55U2BzP3AKbPPqe+0NRFxGXAR4C57c4yC7wCeDAzzwTOAz7Q5jyd7nyAzDwDeBvwz+2Ns/8sgOmzx9QXwKKJV9ckfgK8pN0hZonPAW9tWB7a14qaXGZ+kdol7AAnAL9oY5wDYgFMn3GnvmhXmE5Xvy/k0XbnmA0y85HMfDgi5gO3AP/Q7kydLjOHIuLjwPup/U07kgUwfSaa+kJqq4h4CvAN4JOZ+el255kNMvOvgJOBD0fE4e3Osz8sgOmzkdqdzowz9YXUNhFxDPBV4M2Z+dF25+l0EfHKiLiivrgTGKFD5yjzEMX0WQecExGbeGzqC+lg8Bbg8cBbI2LsXMB5mTnQxkyd7AvAjRFxBzAHeENmduR8304FIUkV5SEgSaooC0CSKsoCkKSKsgAkqaIsAEmqKC8DlfYhIv4FOAM4FHgacE996IPAaGbeEBE3Aldl5k8j4l7g+Zl5bxviSlNmAUj7kJmXAETEicA3M/PUcVY7C7h6JnNJ08UCkKYoIq6qv9wFPBm4NSLObBjvAa4Dng/0AB/LzPfMcExpUp4DkPZTZr4buA9YnpkPNgxdWB9/LnAacEFjQUgHC/cApOl3NnBqRLygvnwE8GzgzvZFkvZmAUjTrwe4LDO/ABARRwOPtDeStDcPAUkHZoi9v0h9HbgwIuZExBHAXcDiGU8mTcI9AOnAfJnaSeBzG967AXg68H1q/4/dmJnfbEM2aULOBipJFeUhIEmqKAtAkirKApCkirIAJKmiLABJqigLQJIqygKQpIr6f2s/6fN1bqHXAAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Let's see, based on title what's the survival probability\n", "sns.barplot(x='Title', y='Survived', data=dataset);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Catching Aspects:\n", "\n", "- People with the title 'Mr' survived less than people with any other title.\n", "- Titles with a survival rate higher than 70% are those that correspond to female (Miss-Mrs)\n", "\n", "Our new category, 'Rare', should be more discretized. As we can see by the error bar (black line), there is a significant uncertainty around the mean value. Probably, one of the problems is that we are mixing male and female titles in the 'Rare' category. We should proceed with a more detailed analysis to sort this out. Also, the category 'Master' seems to have a similar problem. For now, we will not make any changes, but we will keep these two situations in our mind for future improvement of our data set.\n", "\n", "\n", "From now on, there's no Name features and have Title feature to represent it." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeCabinFareParchPassengerIdPclassSibSpSurvivedTicketmaleQSTitle
022.0NaN7.250001310.0A/5 211711012
138.0C8571.283302111.0PC 175990001
226.0NaN7.925003301.0STON/O2. 31012820011
335.0C12353.100004111.01138030011
435.0NaN8.050005300.03734501012
\n", "
" ], "text/plain": [ " Age Cabin Fare Parch PassengerId Pclass SibSp Survived \\\n", "0 22.0 NaN 7.2500 0 1 3 1 0.0 \n", "1 38.0 C85 71.2833 0 2 1 1 1.0 \n", "2 26.0 NaN 7.9250 0 3 3 0 1.0 \n", "3 35.0 C123 53.1000 0 4 1 1 1.0 \n", "4 35.0 NaN 8.0500 0 5 3 0 0.0 \n", "\n", " Ticket male Q S Title \n", "0 A/5 21171 1 0 1 2 \n", "1 PC 17599 0 0 0 1 \n", "2 STON/O2. 3101282 0 0 1 1 \n", "3 113803 0 0 1 1 \n", "4 373450 1 0 1 2 " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# viz top 5\n", "dataset.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Family size\n", "\n", "I like to create a `Famize` feature which is the sum of `SibSp` , `Parch`." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "# Create a family size descriptor from SibSp and Parch\n", "dataset[\"Famize\"] = dataset[\"SibSp\"] + dataset[\"Parch\"] + 1\n", "\n", "# Drop SibSp and Parch variables\n", "dataset.drop(labels = [\"SibSp\",'Parch'], axis = 1, inplace = True)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Viz the survival probabily of Famize feature\n", "\n", "facet = sns.FacetGrid(dataset, hue=\"Survived\",aspect=4)\n", "facet.map(sns.kdeplot,'Famize',shade= True)\n", "facet.set(xlim=(0, dataset['Famize'].max()))\n", "facet.add_legend()\n", "plt.xlim(0);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Survival probability is worst for large families." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cabin & Ticket\n", "Now, `Cabin` feature has a huge data missing. So, I like to drop it anyway. Moreover, we also can't get to much information by `Ticket` feature for prediction task." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "# drop some useless features\n", "dataset.drop(labels = [\"Ticket\",'Cabin','PassengerId'], axis = 1, \n", " inplace = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Predictive Modeling \n", "
\n", "\n", "Here, we split our datasets according to the previous amounts and make test and train set. To avoid overfitting event we can create validation set but that's not effective. So, we use [**K-Fold**](http://scikit-learn.org/stable/modules/cross_validation.html#k-fold) approaches and use [**StratifiedKFold**](http://scikit-learn.org/stable/modules/cross_validation.html#stratified-k-fold) to split the train datasets into 10 (by default)." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Mohammed Innat\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:4: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " after removing the cwd from sys.path.\n" ] } ], "source": [ "# Separate train dataset and test dataset\n", "train = dataset[:len(train)]\n", "test = dataset[len(train):]\n", "test.drop(labels=[\"Survived\"],axis = 1,inplace = True)\n", "\n", "## Separate train features and label \n", "Y_train = train[\"Survived\"].astype(int)\n", "X_train = train.drop(labels = [\"Survived\"],axis = 1)\n", "\n", "# Cross validate model with Kfold stratified cross val\n", "K_fold = StratifiedKFold(n_splits=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classifier \n", "
\n", "\n", "I compared 10 popular classifiers and evaluate the mean accuracy of each of them by a stratified kfold cross validation procedure.\n", "\n", "- KNN\n", "- AdaBoost\n", "- Decision Tree\n", "- Random Forest\n", "- Extra Trees\n", "- Support Vector Machine\n", "- Gradient Boosting\n", "- Logistic regression\n", "- Linear Discriminant Analysis\n", "- Multiple layer perceprton\n", "\n", "## Evaluation using Cross Validation\n", "A great alternative is to use Scikit-Learn's `cross-validation` feature. The following performs **K-fold** cross validation; it randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the Models 10 times, picking a different fold for evaluation every time and training on the other 9 folds." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Modeling step Test differents algorithms \n", "random_state = 2\n", "\n", "models = [] # append all models or predictive models \n", "cv_results = [] # cross validation result\n", "cv_means = [] # cross validation mean value\n", "cv_std = [] # cross validation standard deviation\n", "\n", "models.append(KNeighborsClassifier())\n", "models.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1))\n", "models.append(DecisionTreeClassifier(random_state=random_state))\n", "models.append(RandomForestClassifier(random_state=random_state))\n", "models.append(ExtraTreesClassifier(random_state=random_state))\n", "models.append(SVC(random_state=random_state))\n", "models.append(GradientBoostingClassifier(random_state=random_state))\n", "models.append(LogisticRegression(random_state = random_state))\n", "models.append(LinearDiscriminantAnalysis())\n", "models.append(MLPClassifier(random_state=random_state))\n", "\n", "\n", "for model in models :\n", " cv_results.append(cross_val_score(model, X_train, Y_train, \n", " scoring = \"accuracy\", cv = K_fold, n_jobs=4))\n", "\n", "for cv_result in cv_results:\n", " cv_means.append(cv_result.mean())\n", " cv_std.append(cv_result.std())\n", "\n", "cv_frame = pd.DataFrame(\n", " {\n", " \"CrossValMeans\":cv_means,\n", " \"CrossValErrors\": cv_std,\n", " \"Algorithms\":[\n", " \"KNeighboors\",\n", " \"AdaBoost\", \n", " \"DecisionTree\", \n", " \"RandomForest\",\n", " \"ExtraTrees\",\n", " \"SVC\",\n", " \"GradientBoosting\", \n", " \"LogisticRegression\",\n", " \"LinearDiscriminantAnalysis\",\n", " \"MultipleLayerPerceptron\"]\n", " })\n", "\n", "cv_plot = sns.barplot(\"CrossValMeans\",\"Algorithms\", data = cv_frame,\n", " palette=\"husl\", orient = \"h\", **{'xerr':cv_std})\n", "\n", "cv_plot.set_xlabel(\"Mean Accuracy\")\n", "cv_plot = cv_plot.set_title(\"CV Scores\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's explore following models separately:\n", "\n", "- GBC Classifier\n", "- Linear Discriminant Analysis \n", "- Logistic Regression\n", "- Random Forest Classifer\n", "- Gaussian Naive Bayes\n", "- Support Vectore Machine" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.83146067 0.82954545 0.76136364 0.89772727 0.90909091 0.875\n", " 0.84090909 0.79545455 0.84090909 0.82954545]\n" ] }, { "data": { "text/plain": [ "84.11" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# GBC Classifier\n", "GBC_Model = GradientBoostingClassifier()\n", "\n", "scores = cross_val_score(GBC_Model, X_train, Y_train, cv = K_fold,\n", " n_jobs = 4, scoring = 'accuracy')\n", "\n", "print(scores)\n", "round(np.mean(scores)*100, 2)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.84269663 0.82954545 0.76136364 0.88636364 0.81818182 0.80681818\n", " 0.79545455 0.78409091 0.86363636 0.84090909]\n" ] }, { "data": { "text/plain": [ "82.29" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Linear Discriminant Analysis \n", "LDA_Model= LinearDiscriminantAnalysis()\n", "\n", "scores = cross_val_score(LDA_Model, X_train, Y_train, cv = K_fold,\n", " n_jobs = 4, scoring = 'accuracy')\n", "\n", "print(scores)\n", "round(np.mean(scores)*100, 2)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.83146067 0.81818182 0.76136364 0.875 0.81818182 0.77272727\n", " 0.79545455 0.79545455 0.84090909 0.84090909]\n" ] }, { "data": { "text/plain": [ "81.5" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Logistic Regression\n", "#\n", "Log_Model = LogisticRegression(C=1)\n", "scores = cross_val_score(Log_Model, X_train, Y_train, cv=K_fold, \n", " n_jobs=4, scoring='accuracy')\n", "\n", "print(scores)\n", "round(np.mean(scores)*100, 2)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.79775281 0.88636364 0.73863636 0.80681818 0.86363636 0.79545455\n", " 0.82954545 0.76136364 0.84090909 0.82954545]\n" ] }, { "data": { "text/plain": [ "81.5" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Random Forest Classifier Model\n", "#\n", "RFC_model = RandomForestClassifier(n_estimators=10)\n", "scores = cross_val_score(RFC_model, X_train, Y_train, cv=K_fold, \n", " n_jobs=4, scoring='accuracy')\n", "\n", "print(scores)\n", "round(np.mean(scores)*100, 2)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.78651685 0.81818182 0.75 0.86363636 0.77272727 0.79545455\n", " 0.80681818 0.78409091 0.85227273 0.84090909]\n" ] }, { "data": { "text/plain": [ "80.71" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Gaussian Naive Bayes\n", "GNB_Model = GaussianNB()\n", "\n", "scores = cross_val_score(GNB_Model, X_train, Y_train, cv=K_fold, \n", " n_jobs=4, scoring='accuracy')\n", "\n", "print(scores)\n", "round(np.mean(scores)*100, 2)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.69662921 0.65909091 0.64772727 0.72727273 0.76136364 0.70454545\n", " 0.76136364 0.73863636 0.72727273 0.78409091]\n" ] }, { "data": { "text/plain": [ "72.08" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Support Vector Machine\n", "SVM_Model = SVC()\n", "\n", "scores = cross_val_score(SVM_Model, X_train, Y_train, cv=K_fold, \n", " n_jobs=4, scoring='accuracy')\n", "\n", "print(scores)\n", "round(np.mean(scores)*100, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Hyperparameter Tuning \n", "
\n", "\n", "I decided to choose this promising models of GradientBoosting, Linear Discriminant Analysis, RandomForest, Logistic Regression and SVM for the ensemble modeling. So, now we need to fine-tune them.\n", "\n", "One way to do that would be to fiddle with the hyperparameters manually until we find a great combination of hyperparamerter values. This would be very tedious work, and we may not have time to explore many combination. Instead we should get `Scikit-Learn's GridSearchCV` to search for us. All we need to do is tell it which hyperparameters we want it to experiment with, and what values to try out and it will evaluate all the possible combination of hyperparameter values, using **cross-validation**.\n", "\n", "Here we perform grid search optimization for GradientBoosting, RandomForest, Linear Discriminant Analysis, Logistic Regression and SVC classifier." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 10 folds for each of 216 candidates, totalling 2160 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 2.6s\n", "[Parallel(n_jobs=4)]: Done 626 tasks | elapsed: 12.9s\n", "[Parallel(n_jobs=4)]: Done 1626 tasks | elapsed: 30.5s\n", "[Parallel(n_jobs=4)]: Done 2160 out of 2160 | elapsed: 41.0s finished\n" ] }, { "data": { "text/plain": [ "0.8365493757094211" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Gradient boosting tunning\n", "GBC = GradientBoostingClassifier()\n", "gb_param_grid = {\n", " 'loss' : [\"deviance\"],\n", " 'n_estimators' : [100,200,300],\n", " 'learning_rate': [0.1, 0.05, 0.01, 0.001],\n", " 'max_depth': [4, 8,16],\n", " 'min_samples_leaf': [100,150,250],\n", " 'max_features': [0.3, 0.1]\n", " }\n", "\n", "gsGBC = GridSearchCV(GBC, param_grid = gb_param_grid, cv=K_fold, \n", " scoring=\"accuracy\", n_jobs= 4, verbose = 1)\n", "\n", "gsGBC.fit(X_train,Y_train)\n", "GBC_best = gsGBC.best_estimator_\n", "\n", "# Best score\n", "gsGBC.best_score_" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 10 folds for each of 36 candidates, totalling 360 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=4)]: Done 42 tasks | elapsed: 5.5s\n", "[Parallel(n_jobs=4)]: Done 192 tasks | elapsed: 18.7s\n", "[Parallel(n_jobs=4)]: Done 360 out of 360 | elapsed: 32.5s finished\n" ] }, { "data": { "text/plain": [ "0.8422247446083996" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# RFC Parameters tunning \n", "RFC = RandomForestClassifier()\n", "\n", "## Search grid for optimal parameters\n", "rf_param_grid = {\"max_depth\": [None],\n", " \"min_samples_split\": [2, 6, 20],\n", " \"min_samples_leaf\": [1, 4, 16],\n", " \"n_estimators\" :[100,200,300,400],\n", " \"criterion\": [\"gini\"]}\n", "\n", "\n", "gsRFC = GridSearchCV(RFC, param_grid = rf_param_grid, cv=K_fold,\n", " scoring=\"accuracy\", n_jobs= 4, verbose = 1)\n", "\n", "gsRFC.fit(X_train,Y_train)\n", "RFC_best = gsRFC.best_estimator_\n", "\n", "# Best score\n", "gsRFC.best_score_" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 10 folds for each of 180 candidates, totalling 1800 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=4)]: Done 351 tasks | elapsed: 2.6s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[LibLinear]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=4)]: Done 1800 out of 1800 | elapsed: 4.4s finished\n" ] }, { "data": { "text/plain": [ "0.8240635641316686" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# LogisticRegression Parameters tunning \n", "LRM = LogisticRegression()\n", "\n", "## Search grid for optimal parameters\n", "lr_param_grid = {\"penalty\" : [\"l2\"],\n", " \"tol\" : [0.0001,0.0002,0.0003],\n", " \"max_iter\": [100,200,300],\n", " \"C\" :[0.01, 0.1, 1, 10, 100],\n", " \"intercept_scaling\": [1, 2, 3, 4],\n", " \"solver\":['liblinear'],\n", " \"verbose\":[1]}\n", "\n", "\n", "gsLRM = GridSearchCV(LRM, param_grid = lr_param_grid, cv=K_fold,\n", " scoring=\"accuracy\", n_jobs= 4, verbose = 1)\n", "\n", "gsLRM.fit(X_train,Y_train)\n", "LRM_best = gsLRM.best_estimator_\n", "\n", "# Best score\n", "gsLRM.best_score_" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 10 folds for each of 3 candidates, totalling 30 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=4)]: Done 23 out of 30 | elapsed: 1.9s remaining: 0.5s\n", "[Parallel(n_jobs=4)]: Done 30 out of 30 | elapsed: 1.9s finished\n" ] }, { "data": { "text/plain": [ "0.8229284903518729" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Linear Discriminant Analysis - Parameter Tuning\n", "LDA = LinearDiscriminantAnalysis()\n", "\n", "## Search grid for optimal parameters\n", "lda_param_grid = {\"solver\" : [\"svd\"],\n", " \"tol\" : [0.0001,0.0002,0.0003]}\n", "\n", "\n", "gsLDA = GridSearchCV(LDA, param_grid = lda_param_grid, cv=K_fold,\n", " scoring=\"accuracy\", n_jobs= 4, verbose = 1)\n", "\n", "gsLDA.fit(X_train,Y_train)\n", "LDA_best = gsLDA.best_estimator_\n", "\n", "# Best score\n", "gsLDA.best_score_" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 10 folds for each of 30 candidates, totalling 300 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Done 50 tasks | elapsed: 3.2s\n", "[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 17.3s finished\n" ] }, { "data": { "text/plain": [ "0.8161180476730987" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### SVC classifier\n", "SVMC = SVC(probability=True)\n", "svc_param_grid = {'kernel': ['rbf'], \n", " 'gamma': [0.0001, 0.001, 0.01, 0.1, 1],\n", " 'C': [1, 10, 50, 100, 200, 300]}\n", "\n", "gsSVMC = GridSearchCV(SVMC, param_grid = svc_param_grid, cv = K_fold,\n", " scoring=\"accuracy\", n_jobs= -1, verbose = 1)\n", "\n", "gsSVMC.fit(X_train,Y_train)\n", "\n", "SVMC_best = gsSVMC.best_estimator_\n", "\n", "# Best score\n", "gsSVMC.best_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot Learning Curves \n", "**Diagnose Bias and Variance to Reduce Error**\n", "
\n", "Learning curves are a good way to see the overfitting and underfitting effect on the training set and the effect of the training size on the accuracy. Learning curves plots the model's performance on the training set and the validation set as a function of training set size. To generate the plots, we simply train the model several times on different sized subsets of the training sets. In a nutshell, a learning curves shows how error changes as the training set size increases.\n", "\n", "If a models perform well on the training data but generalizes poorly according to the cross-validation metrics, the model is called overfitting. And again if it performs poorly on both, the model is called underfitting.\n", "\n", "When the model is trained on very few training instances, it is incapable of generalizing properly, which is why the validation error will be initially quite big.\n", "\n", "> **Underfitting**: If model is underfitting the training data, adding more training example will not help. We need to use more complex model or come up with better features.\n", "\n", "> **Overfitting**: One way to improve the overfitting model is to feed it more training data until the validation error reaches the training error.\n", "\n", "**Resource**\n", "- [Learning Curves for Machine Learning](https://www.dataquest.io/blog/learning-curves-machine-learning/)\n", "\n", "\n", "## Bias-Variance Trade-Off\n", "
\n", "A model's generalization error can be expressed as the sum of three very different errors.\n", "\n", "- Bias\n", "- Variance\n", "- Irreducible Error\n", "\n", "#### Bias Error in Learning Curve\n", "This part of generalization error is due to the wrong assumption, such as assuming that, the data is linear when it is actually quadratic.\n", "\n", "- **A high bias model is most likely to underfit the training data**\n", "\n", "\n", "#### Variance Error in Learning Curve\n", "This part of generalization is due to the model is excessive sensitivity to small variations in the training data.\n", "\n", "- **A high variance model is most likely to overfit the training data**\n", "\n", "#### Irreducible Error in Learning Curve\n", "This is due to the noisiness of the data itself. This is not concern now, because we already clean the data sets.\n", "\n", "
\n", "\n", "Increasing a model's complexity will typically increases its variance and reduce its bias. Conversly, reducing a model's complexity increases its bias and reduces its variance.\n", "\n", "\n", "Now, we'll define a learning curve ploting function where `x` and `y` axies will be traning set size and scores (not errors) gradually. So the higher the score, the better the performance of the model." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "# Plot learning curve\n", "def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,\n", " n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):\n", " \"\"\"\n", " Generate a simple plot of the test and traning learning curve.\n", "\n", " Parameters\n", " ----------\n", " estimator : object type that implements the \"fit\" and \"predict\" methods\n", " An object of that type which is cloned for each validation.\n", "\n", " title : string\n", " Title for the chart.\n", "\n", " X : array-like, shape (n_samples, n_features)\n", " Training vector, where n_samples is the number of samples and\n", " n_features is the number of features.\n", "\n", " y : array-like, shape (n_samples) or (n_samples, n_features), optional\n", " Target relative to X for classification or regression;\n", " None for unsupervised learning.\n", "\n", " ylim : tuple, shape (ymin, ymax), optional\n", " Defines minimum and maximum yvalues plotted.\n", "\n", " cv : integer, cross-validation generator, optional\n", " If an integer is passed, it is the number of folds (defaults to 3).\n", " Specific cross-validation objects can be passed, see\n", " sklearn.cross_validation module for the list of possible objects\n", "\n", " n_jobs : integer, optional\n", " Number of jobs to run in parallel (default 1).\n", " \n", " x1 = np.linspace(0, 10, 8, endpoint=True) produces\n", " 8 evenly spaced points in the range 0 to 10\n", " \"\"\"\n", " \n", " \n", " plt.figure()\n", " plt.title(title)\n", " if ylim is not None:\n", " plt.ylim(*ylim)\n", " \n", " plt.xlabel(\"Training examples\")\n", " plt.ylabel(\"Score\")\n", " train_sizes, train_scores, test_scores = learning_curve(\n", " estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)\n", " train_scores_mean = np.mean(train_scores, axis=1)\n", " train_scores_std = np.std(train_scores, axis=1)\n", " test_scores_mean = np.mean(test_scores, axis=1)\n", " test_scores_std = np.std(test_scores, axis=1)\n", " plt.grid()\n", "\n", " plt.fill_between(train_sizes, train_scores_mean - train_scores_std,\n", " train_scores_mean + train_scores_std, alpha=0.1,\n", " color=\"r\")\n", " plt.fill_between(train_sizes, test_scores_mean - test_scores_std,\n", " test_scores_mean + test_scores_std, alpha=0.1, color=\"g\")\n", " plt.plot(train_sizes, train_scores_mean, 'o-', color=\"r\",\n", " label=\"Training score\")\n", " plt.plot(train_sizes, test_scores_mean, 'o-', color=\"g\",\n", " label=\"Cross-validation score\")\n", "\n", " plt.legend(loc=\"best\")\n", " return plt" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Gradient boosting - Learning Curve \n", "plot_learning_curve(estimator = gsGBC.best_estimator_,title = \"GBC learning curve\",\n", " X = X_train, y = Y_train, cv = K_fold);" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Random Forest - Learning Curve\n", "plot_learning_curve(estimator = gsRFC.best_estimator_ ,title = \"RF learninc curve\",\n", " X = X_train, y = Y_train, cv = K_fold);" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Logistic Regression - Learning Curve gsLRM.best_estimator_\n", "plot_learning_curve(estimator = Log_Model ,title = \"Logistic Regression - Learning Curve\",\n", " X = X_train, y = Y_train, cv = K_fold);" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Linear Discriminant Analysis - Learning Curve\n", "plot_learning_curve(estimator = gsLDA.best_estimator_ ,title = \"Linear Discriminant - Learning Curve\",\n", " X = X_train, y = Y_train, cv = K_fold);" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Support Vector Machine - Learning Curve\n", "plot_learning_curve(estimator = gsSVMC.best_estimator_,title = \"SVC learning curve\",\n", " X = X_train, y = Y_train, cv = K_fold);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SVC seem to better generalize the prediction since the training and cross-validation curves are close together. And again Random Forest and GradientBoosting classifiers tend to overfit the training set. One way to improve the overfitting model is to feed it more training data until the validation error reaches the training error." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Ensemble modeling \n", "
\n", "\n", "The another way to fine-tune our system is to try to combine the models that perform best. The goup will often perform better than the best individual model, especially if the individual models make very different types of errors.\n", "\n", "Building a model on top of many other models are called Ensemble Learning. And it is often a great way to push ML algorithm even further.\n", "\n", "I used **voting classifier** to combine the predictions coming from the 5 classifiers. I preferred to pass the argument `soft` to the voting parameter to take into account the probability of each vote." ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.79775281 0.84090909 0.72727273 0.90909091 0.90909091 0.85227273\n", " 0.85227273 0.77272727 0.88636364 0.84090909]\n", "83.89\n" ] } ], "source": [ "#about 84%\n", "VotingPredictor = VotingClassifier(estimators =\n", " [('rfc', RFC_best), \n", " ('gbc', GBC_best)],\n", " voting='soft', n_jobs = 4)\n", "\n", "# 82.97%\n", "# VotingPredictor = VotingClassifier(estimators =\n", "# [ ('rfc', RFC_best), \n", "# ('svc', SVMC_best),\n", "# ('gbc', GBC_best),\n", "# ('lda', LDA_best),\n", "# ('lrm', LRM_best)],\n", "# voting='soft', n_jobs = 4)\n", "\n", "VotingPredictor = VotingPredictor.fit(X_train, Y_train)\n", "\n", "scores = cross_val_score(VotingPredictor, X_train, Y_train, cv = K_fold,\n", " n_jobs = 4, scoring = 'accuracy')\n", "\n", "print(scores)\n", "print(round(np.mean(scores)*100, 2))" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Voting Predictor - Learning Curve\n", "plot_learning_curve(estimator = VotingPredictor, title = \"VP learning curve\",\n", " X = X_train, y = Y_train, cv = K_fold);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Submit Predictor \n", "**Kaggle : Titanic Competition**" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "Predictive_Model = pd.DataFrame({\n", " \"PassengerId\": TestPassengerID,\n", " \"Survived\": VotingPredictor.predict(test)})\n", "\n", "Predictive_Model.to_csv('titanic_model.csv', index=False)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvived
08920
18930
28940
38950
48960
\n", "
" ], "text/plain": [ " PassengerId Survived\n", "0 892 0\n", "1 893 0\n", "2 894 0\n", "3 895 0\n", "4 896 0" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's look inside\n", "submission = pd.read_csv('titanic_model.csv')\n", "submission.head()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }