{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Background\n", "The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.\n", "\n", "One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.\n", "\n", "Complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Suppress Future Warnings\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Importing¶" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "************************\n", " Versions \n", "************************\n", "Scikit-learn version=0.21.3\n", "Numpy version=1.16.5\n", "Pandas version=0.25.1\n", "Matplotlib version=3.1.1\n", "Python version=3.7.4\n" ] } ], "source": [ "import sklearn\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib\n", "import platform\n", "%config IPCompleter.greedy=True #autocomplete code\n", "message=\" Versions \"\n", "print(\"*\"*len(message))\n", "print(message)\n", "print(\"*\"*len(message))\n", "print(\"Scikit-learn version={}\".format(sklearn.__version__))\n", "print(\"Numpy version={}\".format(np.__version__))\n", "print(\"Pandas version={}\".format(pd.__version__))\n", "print(\"Matplotlib version={}\".format(matplotlib.__version__))\n", "print(\"Python version={}\".format(platform.python_version()))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from sklearn import datasets \n", "from matplotlib import pyplot as plt\n", "import seaborn as sns\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Import our libraries\n", "import pandas as pd\n", "import numpy as np\n", "# Import sklearn libraries\n", "from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score\n", "from sklearn.model_selection import cross_validate\n", "from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_curve, precision_recall_curve, auc, make_scorer, confusion_matrix, f1_score, fbeta_score\n", "# Import the Naive Bayes, logistic regression, Bagging, RandomForest, AdaBoost, GradientBoost, Decision Trees and SVM Classifier\n", "from sklearn.naive_bayes import MultinomialNB\n", "from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn import svm\n", "from xgboost import XGBClassifier\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "plt.style.use('seaborn-notebook')\n", "from matplotlib.ticker import StrMethodFormatter\n", "from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelBinarizer" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "5 6 0 3 \n", "6 7 0 1 \n", "7 8 0 3 \n", "8 9 1 3 \n", "9 10 1 2 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "5 Moran, Mr. James male NaN 0 \n", "6 McCarthy, Mr. Timothy J male 54.0 0 \n", "7 Palsson, Master. Gosta Leonard male 2.0 3 \n", "8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 \n", "9 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S \n", "5 0 330877 8.4583 NaN Q \n", "6 0 17463 51.8625 E46 S \n", "7 1 349909 21.0750 NaN S \n", "8 2 347742 11.1333 NaN S \n", "9 0 237736 30.0708 NaN C " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic = pd.read_csv(\"TitanicDataset_train.csv\")\n", "titanic.head(10) #display the first 3 sets of data in train csv" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
\n", "
" ], "text/plain": [ " PassengerId Pclass Name Sex \\\n", "0 892 3 Kelly, Mr. James male \n", "1 893 3 Wilkes, Mrs. James (Ellen Needs) female \n", "2 894 2 Myles, Mr. Thomas Francis male \n", "3 895 3 Wirz, Mr. Albert male \n", "4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n", "\n", " Age SibSp Parch Ticket Fare Cabin Embarked \n", "0 34.5 0 0 330911 7.8292 NaN Q \n", "1 47.0 1 0 363272 7.0000 NaN S \n", "2 62.0 0 0 240276 9.6875 NaN Q \n", "3 27.0 0 0 315154 8.6625 NaN S \n", "4 22.0 1 1 3101298 12.2875 NaN S " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanictest = pd.read_csv(\"TitanicDataset_test.csv\")\n", "titanictest.head(5) #display the first 3 sets of data in train csv" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvived
08920
18931
28940
38950
48961
\n", "
" ], "text/plain": [ " PassengerId Survived\n", "0 892 0\n", "1 893 1\n", "2 894 0\n", "3 895 0\n", "4 896 1" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kaggle_testset = pd.read_csv(\"100-Acc.csv\")\n", "kaggle_testset.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exploratory Data Analysis" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 891 entries, 0 to 890\n", "Data columns (total 12 columns):\n", "PassengerId 891 non-null int64\n", "Survived 891 non-null int64\n", "Pclass 891 non-null int64\n", "Name 891 non-null object\n", "Sex 891 non-null object\n", "Age 714 non-null float64\n", "SibSp 891 non-null int64\n", "Parch 891 non-null int64\n", "Ticket 891 non-null object\n", "Fare 891 non-null float64\n", "Cabin 204 non-null object\n", "Embarked 889 non-null object\n", "dtypes: float64(2), int64(5), object(5)\n", "memory usage: 83.7+ KB\n" ] } ], "source": [ "titanic.info()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PassengerId 0\n", "Survived 0\n", "Pclass 0\n", "Name 0\n", "Sex 0\n", "Age 177\n", "SibSp 0\n", "Parch 0\n", "Ticket 0\n", "Fare 0\n", "Cabin 687\n", "Embarked 2\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# List out all variables with nulls/missing values\n", "titanic.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']\n", "['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']\n" ] } ], "source": [ "# Get list of numeric and nonnumeric variables\n", "numvars = list(titanic.columns[titanic.dtypes != \"object\"])\n", "nonnumvars = list(titanic.columns[titanic.dtypes == \"object\"])\n", "print(numvars)\n", "print(nonnumvars)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']\n", "['Sex', 'Cabin', 'Embarked']\n" ] } ], "source": [ "# Do some further exploration on list to get list of features used\n", "numvars.remove('PassengerId')\n", "numvars.remove('Survived')\n", "numfeats = numvars\n", "print(numfeats)\n", "\n", "#nonnumvars.remove('Cabin')\n", "nonnumvars.remove('Name')\n", "nonnumvars.remove('Ticket')\n", "nonnumfeats = nonnumvars\n", "print(nonnumfeats)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAADnCAYAAAAeqiGTAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAVy0lEQVR4nO3df0zU9+HH8dfhAYqyKdtdsYawOV23iLUNqZaugUoVEXqxMuzQrna//BWHK1l/qGW0W9qorQlZ01iTZVta2swpxcmIo3WitA7Xzvuj5oqtzfhRdQ6vVSsgHHD3+f7hed9gy4+D+9wHuufjL9/36/0iMfe6z/vu83nbDMMwBAD4nxdjdQAAwNhAIQAAJFEIAIAgCgEAIEmyWx1gpAKBgDo7OxUbGyubzWZ1HAAY8wzDUG9vryZPnqyYmM8fD4zbQujs7NTp06etjgEA4863v/1tJSYmfu72cVsIsbGxkq79YXFxcRanAYCxr6enR6dPnw69f95o3BbC9WWiuLg4xcfHW5wGAMaPgZbZ+VIZACCJQgAABFEIAABJFAIQcRcvXtTmzZt16dIlq6MAYTG1EA4cOKD8/Hzl5+drx44dkqRTp06poKBAS5Ys0ZNPPqm+vj4zIwBRt2fPHjU2NmrPnj1WRwHCYlohdHV16dlnn1VFRYUOHDigEydOqKGhQY899pjKysr0xhtvyDAM7d2716wIQNRdvHhRhw8flmEY+vvf/85RAsYV0wrB7/crEAioq6tLfX196uvrk91uV3d3t2677TZJUkFBgWpra82KAETdnj17FAgEJF07m56jBIwnpp2HMGXKFP3iF7/Q0qVLNWnSJN1xxx2KjY2Vw+EIPcbhcKitrW1U83g8ntFGBSKmrq4utAza19enw4cPa/78+RanAobHtEL44IMP9Prrr+vIkSNKTEzUo48+qn/84x/9TogwDGPU1yFKS0vjxDSMGdnZ2Tp06FDoiPjee+9Venq61bEASZLP5xv0Q7RpS0bHjh1TRkaGvva1rykuLk4FBQV655135PV6Q4/55JNP5HQ6zYoARF1RUVHoomExMTEqKiqyOBEwfKYVwne+8x01NDTo6tWrMgxDdXV1mj9/vuLj4+V2uyVd+xVSZmamWRGAqEtKStK9994rm82mRYsWadq0aVZHAobNtCWju+++W42NjSooKFBsbKzmzp2rtWvXavHixSotLVVHR4fmzJmj1atXmxUBsERRUZE+/vhjjg4w7tgMwzCsDjES19fC+A4BAIZnqPdNzlQGAEiiEAAAQRQCAEAShQAACKIQAACSKAQg4rj8NcYrCgGIMC5/jfGKQgAiiMtfYzyjEIAI4vLXGM8oBCCCjh492u/y10eOHLE4ETB8FAIQQffcc4/s9muXCLPb7Vq4cKHFiYDhoxCACOLy1xjPKAQggrj8NcYzCgGIsIyMDNlsNt11111WRwHCQiEAEbZ7924FAgG99NJLVkcBwkIhABHU1NSk//znP5Kkc+fOqbm52eJEwPCZtmPavn379Oqrr4bGZ8+e1bJly7Ro0SJt27ZNPp9PS5cuVUlJiVkRgKjbsWPH58a7d++2KA0QHtMKYcWKFVqxYoUk6aOPPtLGjRu1Zs0arVy5UhUVFZo+fbrWrVun+vp6ZWVlmRUDiKrrRwfXnTt3zqIkQPiismT09NNPq6SkRGfOnFFqaqpSUlJkt9vlcrlUW1sbjQgAgCGYdoRwXUNDg7q7u7V06VLV1NTI4XCE7nM6nWpraxvV63s8ntFGBEzldrutjgAMi+mFsGfPHv34xz+WdO3aLjabLXSfYRj9xiMx0GbRgBVuvvnmfstGM2bMUHp6uoWJgP/n8/kG/RBt6pJRT0+P/vWvfyk7O1uSlJycLK/XG7rf6/XK6XSaGQGIqieeeGLQMTCWmVoIH374ob7xjW8oISFBkjRv3jw1NzertbVVfr9fNTU1yszMNDMCEFUzZ87UzTffLOna0cE3v/lNixMBw2dqIZw5c0bJycmhcXx8vLZv367i4mLl5eVp5syZys3NNTMCEHVPPPGEEhISODrAuGMzDMOwOsRIXF8L4zsEABieod43OVMZiDD2VMZ4RSEAEcaeyhivKAQggthTGeMZhQBE0J49e9Tb2ytJ6u3t5SgB4wqFAETQ0aNHdf13GoZhsKcyxhUKAYigOXPm9BvPnTvXoiRA+CgEIIIaGxv7jd9//32LkgDhoxCACLp69Wq/cWdnp0VJgPCZfnE7/G+oq6vToUOHrI5huZiYGAUCgX7jLVu2WJjIeosXLw5dzwxjG0cIQASlpKQMOgbGMo4QEBHZ2dl8CgxatmyZAoGApkyZohdffNHqOMCwcYQARNj1o4LNmzdbnAQID4UARFhiYqLS0tI0b948q6MAYaEQAACSKAQAQBCFAACQZHIh1NXVqaCgQEuXLtUzzzwjSWpoaJDL5VJOTo7Ky8vNnB4AEAbTCuHMmTN66qmntGvXLlVXV6uxsVH19fXaunWrdu3apYMHD8rj8ai+vt6sCACAMJhWCIcOHVJeXp6Sk5MVGxur8vJyTZo0SampqUpJSZHdbpfL5VJtba1ZEQAAYTDtxLTW1lbFxsZq/fr1On/+vO655x7Nnj1bDocj9Bin06m2trZRzePxeEYbFYio9vZ2SZLb7bY4CRAe0wrB7/frxIkTqqioUEJCgjZs2KCJEyfKZrOFHmMYRr/xSAy0WTRglcrKSklSenq6xUmA/nw+36Afok0rhK9//evKyMhQUlKSJGnRokWqra3VhAkTQo/xer1yOp1mRQAAhMG07xAWLlyoY8eO6cqVK/L7/Xr77beVm5ur5uZmtba2yu/3q6amRpmZmWZFAACEwbQjhHnz5ulnP/uZVq1apd7eXn3ve9/TypUrNXPmTBUXF8vn8ykrK0u5ublmRQAAhMHUq50WFhaqsLCw320ZGRmqrq42c1oAwAhwpjIAQBKFAAAIohAAAJIoBABAEIUAAJBEIQAAgigEAIAkCgEAEEQhAAAkUQgAgCAKAQAgiUIAAARRCAAASRQCACCIQgAASKIQAABBpm6Q89BDD+nixYuy269N85vf/EYff/yxXnrpJfX19enhhx/Wgw8+aGYEAMAwmVYIhmGopaVFR44cCRVCW1ubSkpKVFVVpbi4OBUVFWnBggWaNWuWWTEAAMNkWiE0NTVJkn7yk5/o8uXLeuCBBzR58mTdeeedmjp1qiRpyZIlqq2t1c9//nOzYgAAhsm0Qrhy5YoyMjL0q1/9Sr29vVq9erWWLl0qh8MReozT6dTJkydHNY/H4xltVCCi2tvbJUlut9viJEB4TCuE22+/XbfffntoXFhYqG3btmnDhg2h2wzDkM1mG9U8aWlpio+PH9VrAJFUWVkpSUpPT7c4CdCfz+cb9EO0ab8yOnHihI4fPx4aG4ahGTNmyOv1hm7zer1yOp1mRQAAhMG0Qmhvb9dzzz0nn8+njo4O7d+/X88//7yOHz+uixcvqqurS2+++aYyMzPNigAACINpS0YLFy7Ue++9p/vvv1+BQECrVq1Senq6SkpKtHr1avX29qqwsFC33nqrWREAAGEw9TyERx55RI888ki/21wul1wul5nTAgBGgDOVAQCSKAQAQBCFAACQRCEAAIIoBACAJAoBABBEIQAAJFEIAIAgCgEAIIlCAAAEDVkInZ2d+vWvf62HH35Yly9fVllZmTo7O6ORDQAQRUMWwjPPPKOvfOUr+vTTTxUfH6+Ojg6VlZVFIxsAIIqGLIRTp06ppKREdrtdkyZN0s6dO3Xq1KloZAMARNGQhRAT0/8hfr//c7cBAMa/IS9/fccdd+j5559Xd3e33n77bb322mtasGBBNLIBAKJoyI/6jz76qBISEpSYmKjy8nLdcsstevzxx4c9wY4dO7R582ZJ15afCgoKtGTJEj355JPq6+sbeXIAQEQNWQixsbHauHGj9u3bp6qqKpWUlAx7U/vjx49r//79ofFjjz2msrIyvfHGGzIMQ3v37h15cgBARA25ZJSdnS2bzRYa22w2TZo0SbNnz9bmzZvldDq/8HmXL19WeXm51q9frw8++EDnzp1Td3e3brvtNklSQUGBXnjhBa1atSpCfwoAYDSGLIRFixaps7NTDz74oGJiYlRZWanOzk7dcsstKisr0+7du7/weWVlZSopKdH58+clSRcuXJDD4Qjd73A41NbWFqE/AwAwWkMWwokTJ1RVVRUal5aWqrCwUNu2bdPrr7/+hc/Zt2+fpk+froyMjNBzA4FAvyMNwzD6jUfK4/GM+jWASGpvb5ckud1ui5MA4RmyEDo7O9XR0aEpU6ZIkjo6OtTV1TXocw4ePCiv16tly5bps88+09WrV2Wz2eT1ekOP+eSTTwZcbgpHWlrasL/TAKKhsrJSkpSenm5xEqA/n8836IfoIQvh+9//vh544AHl5uYqEAjo0KFDWrFihSoqKjRz5swvfM4f//jH0L+rqqr07rvvatu2bbrvvvvkdruVnp6uAwcOKDMzcwR/EgDADEMWwtq1a/Xd735Xb731lux2uzZs2KCXX35ZTz/9tJYvXx7WZDt37lRpaak6Ojo0Z84crV69esTBAQCRZTMMwxjqQZ999pn+/Oc/69VXX1VXV5ceeughbdq0KRr5BnT90MfKJaPf/e53ampqsmRujF3X/08MdASN/10zZ87UmjVrLJt/qPfNQY8Qmpqa9PLLL6u6ulozZsyQz+dTXV2dEhMTTQs8njQ1NcnT+KEmTJxqdRSMIYG+CZKkU038ig7/z9992eoIQxqwENauXSuPx6O8vDy98sormjt3rrKzsymDG0yYOFUJqfdaHQPAGHe19bDVEYY04JnKjY2NmjNnjmbPnq3U1FRJisjPRAEAY9OAhXD06FEtX75cNTU1uvvuu7Vp0yb5fL5oZgMARNGAhWC325WXl6eKigpVVVXJ6XTK5/MpJydHf/rTn6KZEQAQBcPa2GDWrFkqLS3VW2+9pZ/+9KdclA4AvoTC2ulm0qRJ+sEPftDvCqYAgC8Htj4DAEiiEAAAQRQCAEAShQAACKIQAACSKAQAQBCFAACQRCEAAIIoBACAJJML4be//a3y8vKUn58f2lazoaFBLpdLOTk5Ki8vN3N6AEAYhtxCc6Teffdd/fOf/1R1dbX6+vqUl5enjIwMbd26VRUVFZo+fbrWrVun+vp6ZWVlmRUDADBMph0hzJ8/X6+88orsdrs+/fRT+f1+XblyRampqUpJSZHdbpfL5VJtba1ZEQAAYTDtCEGSYmNj9cILL+gPf/iDcnNzdeHCBTkcjtD9TqdTbW2j22bQ4/GMNuaItbe3WzY3gPGnvb1dbrfb6hgDMrUQJGnTpk1as2aN1q9fr5aWln67rhmGMepd2AbaLDoaKisrJe9VS+YGMP4kJiYqPT3dsvl9Pt+gH6JNWzL697//rVOnTkm6dtnsnJwcvfPOO/J6vaHHeL1eOZ1OsyIAAMJgWiGcPXtWpaWl6unpUU9Pjw4fPqyioiI1NzertbVVfr9fNTU1yszMNCsCACAMpi0ZZWVl6eTJk7r//vs1YcIE5eTkKD8/X0lJSSouLpbP51NWVpZyc3PNimC6S5cuyd99WVdbD1sdBcAY5+++rEuX4qyOMShTv0MoLi5WcXFxv9syMjJUXV1t5rQAgBEw/UvlL7Np06bpv5d6lJB6r9VRAIxxV1sPa9q0aVbHGBSXrgAASKIQAABBFAIAQBKFAAAIohAAAJIoBABAEIUAAJBEIQAAgigEAIAkCgEAEEQhAAAkUQgAgCAKAQAgiUIAAARRCAAASSYXwosvvqj8/Hzl5+frueeekyQ1NDTI5XIpJydH5eXlZk4PAAiDaYXQ0NCgY8eOaf/+/frLX/6i999/XzU1Ndq6dat27dqlgwcPyuPxqL6+3qwIAIAwmLZjmsPh0ObNmxUXd20P0W9961tqaWlRamqqUlJSJEkul0u1tbXKysoyK4bp2FMZNwr0dUuSYuwTLU6CscTffVnSTVbHGJRphTB79uzQv1taWvS3v/1NP/zhD+VwOEK3O51OtbW1jWoej8czquePxuTJk5WacrNl82Ns+u9/r0iSkh1JFifB2JKgyZMny+12Wx1kQKbvqfzRRx9p3bp1evzxxzVhwgS1tLSE7jMMQzabbVSvn5aWpvj4+FGmHJn09HRL5sXYtmXLFknStm3bLE4C9Ofz+Qb9EG3ql8put1s/+tGP9Mtf/lLLly9XcnKyvF5v6H6v1yun02lmBADAMJlWCOfPn9fGjRu1c+dO5efnS5LmzZun5uZmtba2yu/3q6amRpmZmWZFAACEwbQlo9///vfy+Xzavn176LaioiJt375dxcXF8vl8ysrKUm5urlkRAABhMK0QSktLVVpa+oX3VVdXmzUtAGCEOFMZACCJQgAABFEIAABJFAIAIIhCAABIohAAAEEUAgBAEoUAAAiiEAAAkigEAEAQhQAAkEQhAACCKAQAgCQKAQAQRCEAACRRCACAINMLoaOjQ/fdd5/Onj0rSWpoaJDL5VJOTo7Ky8vNnh4AMEymFsJ7772nlStXqqWlRZLU3d2trVu3ateuXTp48KA8Ho/q6+vNjAAAGCZTC2Hv3r166qmn5HQ6JUknT55UamqqUlJSZLfb5XK5VFtba2YEAMAwmbansiQ9++yz/cYXLlyQw+EIjZ1Op9ra2kY1h8fjGdXzgUhrb2+XJLndbouTAOExtRBuFAgEZLPZQmPDMPqNRyItLU3x8fGjjQZETGVlpSQpPT3d4iRAfz6fb9AP0VH9lVFycrK8Xm9o7PV6Q8tJAABrRbUQ5s2bp+bmZrW2tsrv96umpkaZmZnRjAAAGEBUl4zi4+O1fft2FRcXy+fzKSsrS7m5udGMAAAYQFQKoa6uLvTvjIwMVVdXR2NaAEAYOFMZACCJQgAABFEIAABJFAIAIIhCAABIohAAAEEUAgBAEoUAAAiiEAAAkigEAEAQhQAAkEQhAACCKAQAgCQKAQAQRCEAACRRCACAIEsK4a9//avy8vKUk5Oj1157zYoIAIAbRHULTUlqa2tTeXm5qqqqFBcXp6KiIi1YsECzZs2KdhREUF1dnQ4dOmR1jDGhqalJkrRlyxaLk4wNixcvVnZ2ttUxMAxRP0JoaGjQnXfeqalTpyohIUFLlixRbW1ttGMApklKSlJSUpLVMYCwRf0I4cKFC3I4HKGx0+nUyZMnR/x6Ho8nErEwSl/96ldVWFhodQyMUW632+oIGIaoF0IgEJDNZguNDcPoNw5XWlqa4uPjIxENAL7UfD7foB+io75klJycLK/XGxp7vV45nc5oxwAA3CDqhXDXXXfp+PHjunjxorq6uvTmm28qMzMz2jEAADeI+pLRTTfdpJKSEq1evVq9vb0qLCzUrbfeGu0YAIAbRL0QJMnlcsnlclkxNQBgAJypDACQRCEAAIIsWTKKBMMwJEk9PT0WJwGA8eH6++X1988bjdtC6O3tlSSdPn3a4iQAML709vZq4sSJn7vdZgxUFWNcIBBQZ2enYmNjR3ViGwD8rzAMQ729vZo8ebJiYj7/jcG4LQQAQGTxpTIAQBKFAAAIohAAAJIoBABA0P8B7+GKmVa5piIAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import seaborn as sns\n", "sns.set(style=\"whitegrid\")\n", "ax = sns.boxplot(y=titanic[\"Age\"])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "titanic['Age'].hist(bins=10)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Preparation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Handle missing values" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000891.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607113.0020151.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000022.0000000.0000000.0000007.910400
50%446.0000000.0000003.00000029.6991180.0000000.00000014.454200
75%668.5000001.0000003.00000035.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Age SibSp \\\n", "count 891.000000 891.000000 891.000000 891.000000 891.000000 \n", "mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n", "std 257.353842 0.486592 0.836071 13.002015 1.102743 \n", "min 1.000000 0.000000 1.000000 0.420000 0.000000 \n", "25% 223.500000 0.000000 2.000000 22.000000 0.000000 \n", "50% 446.000000 0.000000 3.000000 29.699118 0.000000 \n", "75% 668.500000 1.000000 3.000000 35.000000 1.000000 \n", "max 891.000000 1.000000 3.000000 80.000000 8.000000 \n", "\n", " Parch Fare \n", "count 891.000000 891.000000 \n", "mean 0.381594 32.204208 \n", "std 0.806057 49.693429 \n", "min 0.000000 0.000000 \n", "25% 0.000000 7.910400 \n", "50% 0.000000 14.454200 \n", "75% 0.000000 31.000000 \n", "max 6.000000 512.329200 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Age has some missing values which needs to be filled\n", "titanic['Age'].fillna(titanic['Age'].mean(), inplace=True)\n", "model_mean_age = titanic['Age'].mean()\n", "titanic.describe()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NoS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833YesC
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 No S \n", "1 0 PC 17599 71.2833 Yes C " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# For the cabin parameter there are over 600 missing values\n", "# We will replace the Cabin value with No if missing and Yes if there is a cabin number\n", "titanic['Cabin'].fillna('No', inplace=True)\n", "titanic['Cabin'].replace(regex=r'^((?!No).)*$',value='Yes',inplace=True)\n", "titanic.head(2)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NoS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833YesC
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 No S \n", "1 0 PC 17599 71.2833 Yes C " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 2 missing values in the embarked\n", "# use the mode to replace it\n", "titanic['Embarked'].fillna(titanic['Embarked'].mode()[0], inplace=True)\n", "model_embarked_mode = titanic['Embarked'].mode()[0]\n", "titanic.head(2)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassSexAgeSibSpParchFareCabinEmbarked
003male22.0107.2500NoS
111female38.01071.2833YesC
\n", "
" ], "text/plain": [ " Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked\n", "0 0 3 male 22.0 1 0 7.2500 No S\n", "1 1 1 female 38.0 1 0 71.2833 Yes C" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Drop the PassengerId\n", "titanic = titanic.drop([\"PassengerId\",\"Name\",\"Ticket\"],axis=1)\n", "titanic.head(2)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassAgeSibSpParchFareSex_femaleSex_maleCabin_NoCabin_YesEmbarked_CEmbarked_QEmbarked_S
00322.0107.25000110001
11138.01071.28331001100
\n", "
" ], "text/plain": [ " Survived Pclass Age SibSp Parch Fare Sex_female Sex_male \\\n", "0 0 3 22.0 1 0 7.2500 0 1 \n", "1 1 1 38.0 1 0 71.2833 1 0 \n", "\n", " Cabin_No Cabin_Yes Embarked_C Embarked_Q Embarked_S \n", "0 1 0 0 0 1 \n", "1 0 1 1 0 0 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Encode all the categorical variables\n", "titanicdf = pd.get_dummies(titanic,columns=nonnumfeats)\n", "titanicdf.head(2)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Survived int64\n", "Pclass int64\n", "Age float64\n", "SibSp int64\n", "Parch int64\n", "Fare float64\n", "Sex_female uint8\n", "Sex_male uint8\n", "Cabin_No uint8\n", "Cabin_Yes uint8\n", "Embarked_C uint8\n", "Embarked_Q uint8\n", "Embarked_S uint8\n", "dtype: object" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanicdf.dtypes" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassAgeSibSpParchFareSex_femaleSex_maleCabin_NoCabin_YesEmbarked_CEmbarked_QEmbarked_S
count891.000000891.000000891.000000891.000000891.000000891.000000891.000000891.000000891.000000891.000000891.000000891.000000891.000000
mean0.3838382.30864229.6991180.5230080.38159432.2042080.3524130.6475870.7710440.2289560.1885520.0864200.725028
std0.4865920.83607113.0020151.1027430.80605749.6934290.4779900.4779900.4203970.4203970.3913720.2811410.446751
min0.0000001.0000000.4200000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%0.0000002.00000022.0000000.0000000.0000007.9104000.0000000.0000001.0000000.0000000.0000000.0000000.000000
50%0.0000003.00000029.6991180.0000000.00000014.4542000.0000001.0000001.0000000.0000000.0000000.0000001.000000
75%1.0000003.00000035.0000001.0000000.00000031.0000001.0000001.0000001.0000000.0000000.0000000.0000001.000000
max1.0000003.00000080.0000008.0000006.000000512.3292001.0000001.0000001.0000001.0000001.0000001.0000001.000000
\n", "
" ], "text/plain": [ " Survived Pclass Age SibSp Parch Fare \\\n", "count 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 \n", "mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 \n", "std 0.486592 0.836071 13.002015 1.102743 0.806057 49.693429 \n", "min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 \n", "25% 0.000000 2.000000 22.000000 0.000000 0.000000 7.910400 \n", "50% 0.000000 3.000000 29.699118 0.000000 0.000000 14.454200 \n", "75% 1.000000 3.000000 35.000000 1.000000 0.000000 31.000000 \n", "max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200 \n", "\n", " Sex_female Sex_male Cabin_No Cabin_Yes Embarked_C Embarked_Q \\\n", "count 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 \n", "mean 0.352413 0.647587 0.771044 0.228956 0.188552 0.086420 \n", "std 0.477990 0.477990 0.420397 0.420397 0.391372 0.281141 \n", "min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 \n", "50% 0.000000 1.000000 1.000000 0.000000 0.000000 0.000000 \n", "75% 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 \n", "max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 \n", "\n", " Embarked_S \n", "count 891.000000 \n", "mean 0.725028 \n", "std 0.446751 \n", "min 0.000000 \n", "25% 0.000000 \n", "50% 1.000000 \n", "75% 1.000000 \n", "max 1.000000 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanicdf.describe()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Survived 1.000000\n", "Sex_female 0.543351\n", "Cabin_Yes 0.316912\n", "Fare 0.257307\n", "Embarked_C 0.168240\n", "Parch 0.081629\n", "Embarked_Q 0.003650\n", "SibSp -0.035322\n", "Age -0.069809\n", "Embarked_S -0.149683\n", "Cabin_No -0.316912\n", "Pclass -0.338481\n", "Sex_male -0.543351\n", "Name: Survived, dtype: float64\n" ] } ], "source": [ "# Since all values are numeric, do a correction and sort to determine the most important features relative to Survived\n", "corr = titanicdf.corr()\n", "corr.sort_values([\"Survived\"], ascending = False, inplace = True)\n", "print(corr.Survived)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Train Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Split Data into Train and Test Sets" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "y = titanicdf[\"Survived\"].values\n", "X = titanicdf.drop([\"Survived\"],axis=1).values\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train Model with Algorithm: Logistics Regression " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Train Model with Logistics Regression\n", "from sklearn.linear_model import LogisticRegression\n", "LogisticRegression_Model = LogisticRegression()\n", "LogisticRegression_Model.fit(X_train,y_train)\n", "Y_prediction = LogisticRegression_Model.predict(X_test)\n", "LogisticRegression_Model.score(X_train, y_train)\n", "acc_LR = round(LogisticRegression_Model.score(X_train, y_train) * 100, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train Model with Algorithm: CART Classification Tree" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "CART_Model = DecisionTreeClassifier()\n", "CART_Model.fit(X_train,y_train)\n", "Y_prediction = CART_Model.predict(X_test)\n", "CART_Model.score(X_train, y_train)\n", "acc_CART = round(CART_Model.score(X_train, y_train) * 100, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train Model with Algorithm: SVM" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "from sklearn.svm import SVC\n", "SVM_Model = SVC()\n", "SVM_Model.fit(X_train, y_train)\n", "Y_prediction = SVM_Model.predict(X_test)\n", "SVM_Model.score(X_train, y_train)\n", "acc_SVM = round(SVM_Model.score(X_train, y_train) * 100, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train Model with Algorithm: K-Nearest Neighbour" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "K_NearestNeighbour = np.arange(1,20)\n", "train_accuracy = np.empty(len(K_NearestNeighbour))\n", "test_accuracy = np.empty(len(K_NearestNeighbour))\n", "#Loop over the different values of K\n", "for i,k in enumerate(K_NearestNeighbour):\n", " #Setup K-NN classifier\n", " knn_classifier = KNeighborsClassifier(n_neighbors = k)\n", " #train the model\n", " knn_classifier.fit(X_train, y_train)\n", " #Compute accuracy of training set\n", " train_accuracy[i] = knn_classifier.score(X_train, y_train)\n", " #Compute accuracy of test set\n", " test_accuracy[i] = knn_classifier.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Checking accuracy of K**" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.title('K-NN varying number of neighbors')\n", "plt.plot(K_NearestNeighbour, test_accuracy, label = 'Testing Accuracy')\n", "plt.plot(K_NearestNeighbour, train_accuracy, label = 'Training Accuracy')\n", "plt.legend()\n", "plt.xlabel('No. of neighbors')\n", "plt.ylabel('Accuracy')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the above graph, I will re-run the test, however I will limit it down to below 9" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "K_NearestNeighbour_run2 = np.arange(1,9)\n", "train_accuracy = np.empty(len(K_NearestNeighbour_run2))\n", "test_accuracy = np.empty(len(K_NearestNeighbour_run2))\n", "#Loop over the different values of K\n", "for i,k in enumerate(K_NearestNeighbour_run2):\n", " #Setup K-NN classifier\n", " knn_classifier = KNeighborsClassifier(n_neighbors = k)\n", " #train the model\n", " knn_classifier.fit(X_train, y_train)\n", " #Compute accuracy of training set\n", " train_accuracy[i] = knn_classifier.score(X_train, y_train)\n", " #Compute accuracy of test set\n", " test_accuracy[i] = knn_classifier.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.title('K-NN varying number of neighbors')\n", "plt.plot(K_NearestNeighbour_run2 , test_accuracy, label = 'Testing Accuracy')\n", "plt.plot(K_NearestNeighbour_run2 , train_accuracy, label = 'Training Accuracy')\n", "plt.legend()\n", "plt.xlabel('No. of neighbors')\n", "plt.ylabel('Accuracy')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems like we have a good accuracy with 2 so I will take that as my K value" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "K_NearestNeighbour_Model = KNeighborsClassifier(n_neighbors = 2)\n", "K_NearestNeighbour_Model.fit(X_train, y_train)\n", "Y_prediction = K_NearestNeighbour_Model.predict(X_test)\n", "K_NearestNeighbour_Model.score(X_train, y_train)\n", "acc_KNN = round(K_NearestNeighbour_Model.score(X_train, y_train) * 100, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train Model with Algorithm: Naive Bayes" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "from sklearn import metrics\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.metrics import confusion_matrix\n", "from sklearn.metrics import classification_report\n", "from sklearn.metrics import roc_curve\n", "from sklearn.metrics import roc_auc_score\n", "NaiveBayes_Model = GaussianNB()\n", "NaiveBayes_Model.fit(X_train, y_train)\n", "Y_prediction = NaiveBayes_Model.predict(X_test)\n", "NaiveBayes_Model.score(X_train, y_train)\n", "acc_NB = round(NaiveBayes_Model.score(X_train, y_train) * 100, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train Model with Algorithm: Stochastic Gradient Descent" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import SGDClassifier\n", "SGD_Model = SGDClassifier()\n", "SGD_Model.fit(X_train, y_train)\n", "Y_prediction = SGD_Model.predict(X_test)\n", "SGD_Model.score(X_train, y_train)\n", "acc_SGD = round(SGD_Model.score(X_train, y_train) * 100, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train Model with Algorithm: Gradient Boosting Classifier" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Learning rate: 0.1\n", "Accuracy score (training): 0.905\n", "Accuracy score (validation): 0.811\n", "\n", "Learning rate: 0.2\n", "Accuracy score (training): 0.933\n", "Accuracy score (validation): 0.811\n", "\n", "Learning rate: 0.3\n", "Accuracy score (training): 0.950\n", "Accuracy score (validation): 0.811\n", "\n", "Learning rate: 0.4\n", "Accuracy score (training): 0.964\n", "Accuracy score (validation): 0.811\n", "\n", "Learning rate: 0.45\n", "Accuracy score (training): 0.970\n", "Accuracy score (validation): 0.756\n", "\n", "Learning rate: 0.5\n", "Accuracy score (training): 0.969\n", "Accuracy score (validation): 0.778\n", "\n", "Learning rate: 0.6\n", "Accuracy score (training): 0.975\n", "Accuracy score (validation): 0.756\n", "\n", "Learning rate: 0.75\n", "Accuracy score (training): 0.981\n", "Accuracy score (validation): 0.789\n", "\n", "Learning rate: 0.8\n", "Accuracy score (training): 0.978\n", "Accuracy score (validation): 0.811\n", "\n", "Learning rate: 1\n", "Accuracy score (training): 0.954\n", "Accuracy score (validation): 0.822\n", "\n" ] } ], "source": [ "from sklearn.ensemble import GradientBoostingClassifier\n", "from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc\n", "learning_rates = [0.1, 0.2, 0.3, 0.4, 0.45, 0.5, 0.6, 0.75, 0.8, 1]\n", "for learning_rate in learning_rates:\n", " gb = GradientBoostingClassifier(n_estimators=63, learning_rate = learning_rate, max_features=2, max_depth = 5, random_state = 0)\n", " gb.fit(X_train, y_train)\n", " print(\"Learning rate: \", learning_rate)\n", " print(\"Accuracy score (training): {0:.3f}\".format(gb.score(X_train, y_train)))\n", " print(\"Accuracy score (validation): {0:.3f}\".format(gb.score(X_test, y_test)))\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the learning rate at 0.5 gave the best score so I will use that to train" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "GB_Model = GradientBoostingClassifier(n_estimators=63, learning_rate = 0.5, max_features=2, max_depth = 5, random_state = 0)\n", "GB_Model.fit(X_train, y_train)\n", "Y_prediction = GB_Model.predict(X_test)\n", "GB_Model.score(X_train, y_train)\n", "acc_GBC = round(GB_Model.score(X_train, y_train) * 100, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train Model with Algorithm: Random Forest" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "random_forest = RandomForestClassifier(n_estimators=100)\n", "random_forest.fit(X_train, y_train)\n", "Y_prediction = random_forest.predict(X_test)\n", "random_forest.score(X_train, y_train)\n", "acc_random_forest = round(random_forest.score(X_train, y_train) * 100, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train model with Algorithm: Perceptron" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import Perceptron\n", "perceptron = Perceptron(max_iter=5)\n", "perceptron.fit(X_train, y_train)\n", "\n", "Y_pred = perceptron.predict(X_test)\n", "\n", "acc_perceptron = round(perceptron.score(X_train, y_train) * 100, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train model with Algorithm: Linear Support Vector Machine" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "from sklearn.svm import LinearSVC\n", "linear_svc = LinearSVC()\n", "linear_svc.fit(X_train, y_train)\n", "\n", "Y_pred = linear_svc.predict(X_test)\n", "\n", "acc_linear_svc = round(linear_svc.score(X_train, y_train) * 100, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Score and Evaluate Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Score Model and Evaluate Model with: Logistics Regression" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Model
Score
98.63CART
98.63Random Forest
96.88GBC
88.51SVM
84.64KNN
79.15Linear Regression
77.28Naive Bayes
76.78LSVM
75.66SGD
71.16Perceptron
\n", "
" ], "text/plain": [ " Model\n", "Score \n", "98.63 CART\n", "98.63 Random Forest\n", "96.88 GBC\n", "88.51 SVM\n", "84.64 KNN\n", "79.15 Linear Regression\n", "77.28 Naive Bayes\n", "76.78 LSVM\n", "75.66 SGD\n", "71.16 Perceptron" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = pd.DataFrame({\n", " 'Model': ['Linear Regression', 'CART','SVM', 'KNN', 'Naive Bayes','SGD','GBC', 'Random Forest','Perceptron','LSVM'],\n", " 'Score': [acc_LR,acc_CART,acc_SVM,acc_KNN,acc_NB,acc_SGD, acc_GBC, acc_random_forest, acc_perceptron, acc_linear_svc]})\n", "result_df = results.sort_values(by='Score', ascending=False)\n", "result_df = result_df.set_index('Score')\n", "result_df.head(10)\n" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "# score model for test set\n", "y_hat_LogisticRegression_Model = LogisticRegression_Model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.metrics import confusion_matrix\n", "# evaluate model for test set\n", "class_names=[\"Survived\",\"Died\"]\n", "cm = confusion_matrix(y_test, y_hat_LogisticRegression_Model, labels=[1,0])\n", "df_cm = pd.DataFrame(cm, columns=class_names, index = class_names)\n", "df_cm.index.name = 'Actual'\n", "df_cm.columns.name = 'Predicted'\n", "plt.figure(figsize = (6,5))\n", "sns.set(font_scale=1.4)#for label size\n", "sns.heatmap(df_cm, cmap=\"Blues\", annot=True,annot_kws={\"size\": 16})# font size" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy score for the test set with Logistics Regression=86.67%\n" ] } ], "source": [ "# Accuracy score for test set\n", "from sklearn.metrics import accuracy_score\n", "score = accuracy_score(y_test, y_hat_LogisticRegression_Model)\n", "print(\"Accuracy score for the test set with Logistics Regression={:.2f}%\".format(score*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Score Model and Evaluate Model with: CART" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "# score model for test set\n", "y_hat_CART_Model = CART_Model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# evaluate model for test set\n", "class_names=[\"Survived\",\"Died\"]\n", "cm = confusion_matrix(y_test, y_hat_CART_Model, labels=[1,0])\n", "df_cm = pd.DataFrame(cm, columns=class_names, index = class_names)\n", "df_cm.index.name = 'Actual'\n", "df_cm.columns.name = 'Predicted'\n", "plt.figure(figsize = (6,5))\n", "sns.set(font_scale=1)#for label size\n", "sns.heatmap(df_cm, cmap=\"Blues\", annot=True,annot_kws={\"size\": 16})# font size" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy score for the test set with CART=80.00%\n" ] } ], "source": [ "# Accuracy score for test set\n", "from sklearn.metrics import accuracy_score\n", "score = accuracy_score(y_test, y_hat_CART_Model)\n", "print(\"Accuracy score for the test set with CART={:.2f}%\".format(score*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Score Model and Evaluate Model with: SVM" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "# score model for test set\n", "y_hat_SVM_Model = SVM_Model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# evaluate model for test set\n", "class_names=[\"Survived\",\"Died\"]\n", "cm = confusion_matrix(y_test, y_hat_SVM_Model, labels=[1,0])\n", "df_cm = pd.DataFrame(cm, columns=class_names, index = class_names)\n", "df_cm.index.name = 'Actual'\n", "df_cm.columns.name = 'Predicted'\n", "plt.figure(figsize = (6,5))\n", "sns.set(font_scale=1)#for label size\n", "sns.heatmap(df_cm, cmap=\"Blues\", annot=True,annot_kws={\"size\": 16})# font size" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy score for the test set with SVM=77.78%\n" ] } ], "source": [ "# Accuracy score for test set\n", "from sklearn.metrics import accuracy_score\n", "score = accuracy_score(y_test, y_hat_SVM_Model)\n", "print(\"Accuracy score for the test set with SVM={:.2f}%\".format(score*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Score Model and Evaluate Model with: K-Nearest Neighbour" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "y_hat_KNN_Model = K_NearestNeighbour_Model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# evaluate model for test set\n", "class_names=[\"Survived\",\"Died\"]\n", "cm = confusion_matrix(y_test, y_hat_KNN_Model, labels=[1,0])\n", "df_cm = pd.DataFrame(cm, columns=class_names, index = class_names)\n", "df_cm.index.name = 'Actual'\n", "df_cm.columns.name = 'Predicted'\n", "plt.figure(figsize = (6,5))\n", "sns.set(font_scale=1)#for label size\n", "sns.heatmap(df_cm, cmap=\"Blues\", annot=True,annot_kws={\"size\": 16})# font size" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy score for the test set with KNN=71.11%\n" ] } ], "source": [ "# Accuracy score for test set\n", "from sklearn.metrics import accuracy_score\n", "score = accuracy_score(y_test, y_hat_KNN_Model)\n", "print(\"Accuracy score for the test set with KNN={:.2f}%\".format(score*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Score Model and Evaluate Model with: Naive Bayes" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "y_hat_NB_Model = NaiveBayes_Model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# evaluate model for test set\n", "class_names=[\"Survived\",\"Died\"]\n", "cm = confusion_matrix(y_test, y_hat_NB_Model, labels=[1,0])\n", "df_cm = pd.DataFrame(cm, columns=class_names, index = class_names)\n", "df_cm.index.name = 'Actual'\n", "df_cm.columns.name = 'Predicted'\n", "plt.figure(figsize = (6,5))\n", "sns.set(font_scale=1)#for label size\n", "sns.heatmap(df_cm, cmap=\"Blues\", annot=True,annot_kws={\"size\": 16})# font size" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy score for the test set with NB=83.33%\n" ] } ], "source": [ "# Accuracy score for test set\n", "from sklearn.metrics import accuracy_score\n", "score = accuracy_score(y_test, y_hat_NB_Model)\n", "print(\"Accuracy score for the test set with NB={:.2f}%\".format(score*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Score Model and Evaluate Model with: Stochastic Gradient Descent" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "y_hat_SGD_Model = SGD_Model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# evaluate model for test set\n", "class_names=[\"Survived\",\"Died\"]\n", "cm = confusion_matrix(y_test, y_hat_SGD_Model, labels=[1,0])\n", "df_cm = pd.DataFrame(cm, columns=class_names, index = class_names)\n", "df_cm.index.name = 'Actual'\n", "df_cm.columns.name = 'Predicted'\n", "plt.figure(figsize = (6,5))\n", "sns.set(font_scale=1)#for label size\n", "sns.heatmap(df_cm, cmap=\"Blues\", annot=True,annot_kws={\"size\": 16})# font size" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy score for the test set with SGD=78.89%\n" ] } ], "source": [ "# Accuracy score for test set\n", "from sklearn.metrics import accuracy_score\n", "score = accuracy_score(y_test, y_hat_SGD_Model)\n", "print(\"Accuracy score for the test set with SGD={:.2f}%\".format(score*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Score Model and Evaluate Model with: Gradient Boosting Classifier" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "y_hat_GB_Model = GB_Model.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# evaluate model for test set\n", "class_names=[\"Survived\",\"Died\"]\n", "cm = confusion_matrix(y_test, y_hat_GB_Model, labels=[1,0])\n", "df_cm = pd.DataFrame(cm, columns=class_names, index = class_names)\n", "df_cm.index.name = 'Actual'\n", "df_cm.columns.name = 'Predicted'\n", "plt.figure(figsize = (6,5))\n", "sns.set(font_scale=1)#for label size\n", "sns.heatmap(df_cm, cmap=\"Blues\", annot=True,annot_kws={\"size\": 16})# font size" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy score for the test set with GB=77.78%\n" ] } ], "source": [ "# Accuracy score for test set\n", "from sklearn.metrics import accuracy_score\n", "score = accuracy_score(y_test, y_hat_GB_Model)\n", "print(\"Accuracy score for the test set with GB={:.2f}%\".format(score*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Score Model and Evaluate Model with: Random Forest" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "y_hat_RF_Model = random_forest.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# evaluate model for test set\n", "class_names=[\"Survived\",\"Died\"]\n", "cm = confusion_matrix(y_test, y_hat_RF_Model, labels=[1,0])\n", "df_cm = pd.DataFrame(cm, columns=class_names, index = class_names)\n", "df_cm.index.name = 'Actual'\n", "df_cm.columns.name = 'Predicted'\n", "plt.figure(figsize = (6,5))\n", "sns.set(font_scale=1)#for label size\n", "sns.heatmap(df_cm, cmap=\"Blues\", annot=True,annot_kws={\"size\": 16})# font size" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy score for the test set with RF=76.67%\n" ] } ], "source": [ "# Accuracy score for test set\n", "from sklearn.metrics import accuracy_score\n", "score = accuracy_score(y_test, y_hat_RF_Model)\n", "print(\"Accuracy score for the test set with RF={:.2f}%\".format(score*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Score Model and Evaluate Model with: Perceptron" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "y_hat_Per_Model = perceptron.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# evaluate model for test set\n", "class_names=[\"Survived\",\"Died\"]\n", "cm = confusion_matrix(y_test, y_hat_Per_Model, labels=[1,0])\n", "df_cm = pd.DataFrame(cm, columns=class_names, index = class_names)\n", "df_cm.index.name = 'Actual'\n", "df_cm.columns.name = 'Predicted'\n", "plt.figure(figsize = (6,5))\n", "sns.set(font_scale=1)#for label size\n", "sns.heatmap(df_cm, cmap=\"Blues\", annot=True,annot_kws={\"size\": 16})# font size" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy score for the test set with Per=75.56%\n" ] } ], "source": [ "# Accuracy score for test set\n", "from sklearn.metrics import accuracy_score\n", "score = accuracy_score(y_test, y_hat_Per_Model)\n", "print(\"Accuracy score for the test set with Per={:.2f}%\".format(score*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Score Model and Evaluate Model with:LinearSVM" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "y_hat_LSVM_Model = linear_svc.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXYAAAFKCAYAAAAJ5nSzAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3deVxU9f4/8NdhEQgEUcEtJBUVQkMtM8nke10QZXVLM9G+3ZtZgqVeXAiX65IoplclK29+DUlNLBAlxQXLckm9ZRmlXSiBUGRJYEBkGZjfH97mFyEHBzhz5hxfzx7n8WDOzDmfN4ov3n3Omc8IOp1OByIiUg0zuQsgIqLWxWAnIlIZBjsRkcow2ImIVIbBTkSkMgx2IiKVYbArSFpaGgYOHFhvX1lZGVasWIFhw4Zh4MCBeOWVV5CTkyNThSSHe/1cfP/99+jbt2+Dbd26dTJVScZkIXcBdH+++eYbRERENNi/YMEC/Pjjj/j73/8OR0dHbNu2DaGhofj0009hZ2cnQ6VkTI39XPz000946KGHsHPnznr7nZ2djVUayYjBbuKqq6sRFxeHzZs346GHHkJNTY3+uczMTJw6dQpbt26Fr68vAMDNzQ0jRoxAWloagoOD5SqbJCb2cwHcDfbevXtjwIABMlVIcuJUjIn74osvsH37dixcuBDTp0+v95yLiwsSEhLg4+Oj32dpaQng7j98Ui+xnwvgbrD37dtXhsrIFDDYTVz//v2RlpaGGTNmQBCEes9ZWVnBy8sLVlZW0Gq1yMzMRGRkJDp27IhRo0bJVDEZg9jPBQD85z//QV5eHoKDg9GvXz+MHj0aSUlJMlRKcuBUjInr1KnTfb1u6dKlSExMhJmZGd588004OjpKXBnJSeznIj8/H8XFxcjOzsb8+fPh4OCAlJQULF68GIIgICQkxIiVkhwkCfaLFy+KPj948GAphn2gPffccwgODsaJEyewePFiaLVaTJ48We6ySAb29vZ4//330bdvX/3FUm9vbxQUFCA2NpbB/gCQJNi3bNkCACgpKUFOTg4GDRoEMzMzXLp0CX369MFHH33U7HP/cP12a5WpOAWaatTV6e75Z2DeoRfadgDG/29/ZGRdx9a338Wj3uNkqFIeVdpauUuQTV5JFWrrdPgmW6PfZ9vdC7l3gNw/7OvRbzC+/PJLnL16E9Y2D8lRqtENcrVv8TlsBoYZfMydS7EtHrclJJljj4+PR3x8PDp37oyDBw9i586d2LFjBw4dOgRbW1sphnwg3byRi7QjB/DnlZd7uLnjVlGBTFWR3PJys3E85RPU/OkCenVVFdpYWcHK2kamyshYJL14euPGDbi6uuofd+3aFTdu3JByyAfKjdxsvB2zEunf/lu/T6fT4buvv0L3nm4yVkZyulVUiP/bEo1vL5zR79PpdLh45jO49xt4z4utJEIwM3yTmaQXTz09PbFo0SKMHTsWOp0Ohw4dwhNPPCHlkA8Ur8eHoM+j/bF13XI8/9c5aOvQDmmHk3E1/TtERW+VuzySiUf/gejbbwDe3xKN8vIyOLbvgLRPk5DzSyaWb/qX3OUpjwJ/EUoa7KtXr8aHH36on1P39vbGtGnTpBzygWJuboE31mzGh+9vRfz2LSgv06BnHw8sj9mG/gN5gfpBZWZujr+v2ICPdm7Dx7veQ5mmFD3c+iIyOha9+jwqd3nKYwIduKEEqT8aLzc3F5mZmRg2bBjy8vLg4uLSovM9yBdPqXEP8sVTalyrXDwdPN/gY+5c3NjicVtC0l9Fhw8fxiuvvII1a9agtLQUU6dORXJyspRDEhG1LgXOsUtawb/+9S/s3bsXtra26NChA5KSkrB9+3YphyQial2CYPgmM0nn2M3MzOqtMOjs7AwzM/l/mxER3TcT6MANJWmw9+7dGx9++CG0Wi2uXLmCPXv2wN3dXcohiYhalwl04IaS9FfRsmXLkJ+fDysrK0RGRsLOzg7Lly+XckgiotalwDl2STv2/fv344UXXsCCBQukHIaISDoK7NglDfabN29i8uTJ6NmzJ4KCgjB69GjY2PDtzESkICbQgRtK0ooXLVqEkydPYvbs2fj2228REhKChQsXSjkkEVHr4l0xDel0OtTU1KCmpgaCIOg/4YeISBEU2LFLvqTA8ePH4eHhgaCgIERFRcHKykrKIYmIWheDvT5XV1ckJSWhffv2Ug5DRCQdM/mnVgwlSbDv27cPU6ZMQWlpKfbs2dPg+bAwwxeuJyKShQI7dkkqlnhdMSIiEiFJxz516lQAQNu2bREQEIAOHTpIMQwRkfRM4C4XQ/E+diIiMZyKqY/3sROR4vE+9oZ4HzsRKZoCO3bex05EJMYEOnBDSRrs7du3533sRKRsCuzYJa340KFDDHUiUjbOsdfn5uaG2NhYeHl5wdraWr9/8ODBUg5LRNR6FNixSxrsJSUlOH/+PM6fP6/fJwgCdu3aJeWwREStxwQ6cENJGuzx8fFSnp6ISHrs2OsLDQ2FcI/fduzYiUgxGOz1hYeH67/WarVIS0uDvb29lEMSEbUuTsXU9+STT9Z77O3tjcmTJ+O1116TclgiotbDjr2+Gzdu6L/W6XTIzMxESUmJlEMSEbUuduz1TZ8+XT/HLggCHB0dERUVJeWQREStix37//fZZ5/hgw8+QPfu3XH8+HF8/PHHePTRR+Ht7S3VkERErU+BHbskv4p27NiB2NhYVFdX4+rVq4iIiMCoUaNQWlqK9evXSzEkERH9lyQde3JyMvbt2wcbGxts2LABI0aMwOTJk6HT6TBu3DgphiQiksS9btk2dZJ07IIg6D9Q4/z583jmmWf0+4mIlEQQBIM3uUnSsZubm0Oj0aCiogJXrlzB008/DQC4fv06LCwkXwKeiKj1yJ/TBpMkZWfNmoWQkBBotVpMmjQJzs7OOHz4MDZt2oQ5c+ZIMSQRkSRMoQM3lCTB7ufnh4EDB6K4uBju7u4AAFtbW6xevRpDhgyRYkgiIkkYO9hDQ0Nx69Yt/ezGypUrkZOTg3feeQdarRYzZ87E888/L3oOyeZFOnXqhE6dOukf+/j4SDUUEZFkjBnsOp0OWVlZ+Oyzz/TBnp+fj3nz5iExMRFt2rTB1KlTMWTIELi5uTV6Hk54ExGJMGaw//LLLwCAF198ESUlJXj22Wdha2uLp556Cu3atQMAjBkzBqmpqQgLC2v0PAx2IiIxzch1jUYDjUbTYL+9vb3oQogajQZDhw7F0qVLUVNTgxkzZmDs2LFwcnLSv8bZ2RmXL18WHZ/BTkQkojkde1xcHGJjYxvsDwsLq7fq7Z8NHDgQAwcO1D+eNGkS1q5di1deeUW/T6fTNVkTg52ISERzgn3mzJkYP358g/1NLVv+73//GzU1NRg6dCiAuyHerVs3FBYW6l9TWFgIZ2dn0fMob3UbIiIjas4blOzt7fHwww832JoK9rKyMqxfvx5VVVUoLy9HUlISYmJicO7cOdy6dQt37tzBsWPHMHz4cNHzsGMnIhJhzIunf/nLX/Ddd98hJCQEdXV1mDZtGh5//HHMmzcPM2bMQE1NDSZNmoTHHntM9DyCTqfTGanmVvHD9dtyl0AmqEpbK3cJZIIGubb8E9s6zNxr8DG/xT3X4nFbgh07EZEIvvOUiEhlGOxERCqjxGDnXTFERCrDjp2ISIzyGnYGOxGRGCVOxTDYiYhEMNiJiFSGwU5EpDIMdiIitVFerjPYiYjEsGMnIlIZBjsRkcow2ImI1EZ5uc5gJyISw46diEhlGOxERCrDYCciUhkGOxGR2igv1xnsRERilNix84M2iIhUhh07EZEIJXbsDHYiIhEKzHUGOxGRGHbsREQqo8BcZ7ATEYlhx05EpDIKzHUGOxGRGDMz5SU7g52ISAQ7diIileEcOxGRyigw1xnsRERi2LETEakMg52ISGUUmOsMdiIiMezYiYhURoG5zmAnIhLDjp2ISGUUmOv8BCUiIrVhx05EJEKJUzHs2ImIRAiC4VtrWLduHRYvXgwAuHLlCiZMmIAxY8bgjTfegFarFT2WwU5EJEIQBIO3ljp37hySkpL0jyMiIrBs2TIcPXoUOp0OCQkJoscz2ImIRDSnY9doNMjNzW2waTSaJscrKSnBpk2bMHv2bADA9evXUVlZiQEDBgAAJkyYgNTUVNFzcI6diEhEczrwuLg4xMbGNtgfFhaG8PBw0WOXLVuGefPmIS8vDwBQUFAAJycn/fNOTk7Iz88XPQeDnYhIRHNmVmbOnInx48c32G9vby963P79+9GlSxcMHToUiYmJAIC6urp6v1x0Ol2Tv2wY7EREIprTsdvb2zcZ4vdy+PBhFBYWIjg4GKWlpaioqIAgCCgsLNS/pqioCM7OzqLnYbATEYkw5t2OO3fu1H+dmJiICxcuYO3atQgICMDXX3+Nxx9/HMnJyRg+fLjoeRjsREQiTOE+9g0bNiAqKgrl5eXw9PTEjBkzRF8v6HQ6nZFqaxU/XL8tdwlkgqq0tXKXQCZokKvh0yF/NnzjGYOP+WL+0y0etyXYsRMRiTCBht1gDHYiIhGmMBVjKAY7EZEIBeY6g52ISAw7diIilVFgrjPYiYjEmCkw2bkIGBGRyrBjJyISocCGncFORCSGF0+JiFTGTHm5zmAnIhLDjp2ISGUUmOsMdiIiMQKUl+wMdiIiEZxjJyJSGc6xExGpjAJzncFORCRGiUsKMNiJiEQoMNcZ7EREYjjHTkSkMgrMdQY7EZEYzrETEamM8mKdwU5EJEpVc+yrV68WPTAqKqrViyEiopZrNNjbtWtnzDqIiEySqpYUCAsLa/SgiooKSYohIjI1qpqK+d2JEyewZcsWVFRUQKfToa6uDiUlJbh06ZIx6iMikpUCc73pYF+/fj1ef/117N27Fy+99BJOnDgBW1tbY9RGRCQ7JXbsZk29wMbGBuPGjcOAAQNgZWWFFStW4PPPPzdCaURE8jMTDN/k1mSwW1lZobq6Gt27d8eVK1dgZmamyN9gRETNIQiCwZvcmpyKGTFiBGbNmoV169ZhypQp+Prrr+Ho6GiM2oiIZCd/TBuuyWCfPXs2goKC0KlTJ2zbtg0XL15EQECAMWojIpKdKpcU+OGHHwAAxcXFAIAnnngCN2/eRIcOHaStjIjIBCgw15sO9vDwcP3XNTU1KCoqgqenJz7++GNJCyMiMgWmMGduqCaD/eTJk/Uenz9/HocOHZKsICIiU6LAXG/6rpg/GzJkiH56hohI7cwEweBNbvc9xw4AOp0O6enpqKyslLQoIiJTYQI5bTCD5tgFQUCHDh2wYsUKKWsS1asT3/VKDTkObnxtI3pw3bkU2+JzqHKOfc+ePejcuXO9fZmZmZIVRERkSgyerzYBjdZcUlKCkpISzJo1C6WlpSgpKUFpaSmKiopEV34kIlITY7/zdPPmzRg3bhz8/f2xc+dOAMDZs2cRGBgIX19fbNq0qclzNNqxL1iwAGfOnAFw94Lp78zNzeHn59eiwomIqKELFy7gq6++wsGDB6HVajFu3DgMHToUkZGRiI+PR5cuXfDyyy/j1KlT8PHxafQ8jQb7jh07AABLlizB2rVrW/87ICJSAGMu6vXkk09i165dsLCwQH5+Pmpra6HRaODq6goXFxcAQGBgIFJTU0WDvcnpo9dee01/sfSXX37Bq6++iqKiotb5LoiITFxzVnfUaDTIzc1tsGk0mibHs7S0xJYtW+Dv74+hQ4eioKAATk5O+uednZ2Rn58vXnNTgyxevBg9e/YEAHTr1g1PPvkklixZ0mRxRERq0Jw59ri4OIwcObLBFhcXd19jzp07F+fOnUNeXh6ysrLqzdvrdLom5/GbvCumuLgYM2bMAHB3Cd8XXngBBw4cuK/iiIiUrjlTMTNnzsT48eMb7Le3txc97ueff0Z1dTU8PDxgY2MDX19fpKamwtzcXP+awsJCODs7i9fcVIG1tbX12v6ioiLodLqmDiMiUgVBMHyzt7fHww8/3GBrKthzc3MRFRWF6upqVFdXIy0tDVOnTsW1a9eQnZ2N2tpapKSkYPjw4aLnabJjf+GFFxASEoJnnnkGAHDu3DksXLjQgD8WIiLlMuYSAT4+Prh8+TJCQkJgbm4OX19f+Pv7o3379ggPD0dVVRV8fHyavDNR0N1H+3316lV89dVXMDc3R2lpKU6dOoX9+/e32jdjiEqtLMOSieM7T+leWuOdp5GH/2PwMW+O69PicVuiyY4dALp06YLq6mrs3r0bFRUVCA0NlbouIiKToMAVBcSD/ZdffkFcXBwOHjyIbt26obKyEidPnkTbtm2NVR8RkaxMYbVGQzV68XTWrFmYPn06LC0tsWvXLqSkpMDW1pahTkQPlOZcPJVbox37jz/+CE9PT/Tu3Ruurq4AlLnKGRFRSxjznaetpdGO/fPPP8f48eORkpKCYcOGYe7cuaiqqjJmbUREslPiB200GuwWFhYYN24c4uPjkZiYCGdnZ1RVVcHX1xd79+41Zo1ERLJR4lTMfS017ObmhqioKHzxxRf461//ioSEBKnrIiIyCc1ZK0Zu93W74+9sbGwwZcoUTJkyRap6iIhMigATSGoDGRTsREQPGlPowA2lxE99IiIiEezYiYhEKLFjZ7ATEYlQ4vt3GOxERCLYsRMRqYwCG3YGOxGRGFN4J6mhGOxERCI4FUNEpDIKbNgZ7EREYsz4zlMiInVhx05EpDKcYyciUhneFUNEpDIKzHUGOxGRGHbsREQqo8BcZ7ATEYlR4trmDHYiIhFc3ZGISGWUF+vK/L8MIiISwY6diEgE74ohIlIZ5cU6g52ISJQCG3YGOxGRGN4VQ0SkMkq8w4TBTkQkgh07EZHKKC/WGexERKLYsRMRqQzn2ImIVIYdOxGRyigv1hnsRESiFNiwK3L6iIjIaMwgGLy1RGxsLPz9/eHv74/169cDAM6ePYvAwED4+vpi06ZN91EzERE1ShAM35rr7NmzOH36NJKSknDgwAH88MMPSElJQWRkJLZt24bDhw8jPT0dp06dEj0Pp2KIiEQIzejANRoNNBpNg/329vawt7dv9DgnJycsXrwYbdq0AQD06tULWVlZcHV1hYuLCwAgMDAQqamp8PHxafQ8DHYiIhHN6cDj4uIQGxvbYH9YWBjCw8MbPa537976r7OysnDkyBFMnz4dTk5O+v3Ozs7Iz88XHZ/BTkTUymbOnInx48c32C/Wrf9RRkYGXn75ZSxcuBDm5ubIysrSP6fT6Zq8BZPBTkQkojkXQ5uachHz9ddfY+7cuYiMjIS/vz8uXLiAwsJC/fOFhYVwdnYWPQcvnhIRiTDmxdO8vDzMmTMHGzZsgL+/PwDAy8sL165dQ3Z2Nmpra5GSkoLhw4eLnocdOxGRCGPex75jxw5UVVUhOjpav2/q1KmIjo5GeHg4qqqq4OPjAz8/P9HzCDqdTid1sa2pUit3BWSKHAeHyV0CmaA7lxpewDTU8StFBh8z2qNji8dtCXbsREQizBT4zlMGOxGRiObcxy43BjsRkQglrhXDYCciEsGOnYhIZTjHTkSkMuzYiYhUhnPsREQqo8BcZ7ATEYkxU2DLzmAnIhKhvFhnsBMRiVNgsjPYiYhEKPGuGC7bS0SkMuzYiYhEKPDaKYOdiEiMAnOdwU5EJEqByc5gJyISocSLpwx2IiIRnGMnIlIZBeY6g52ISJQCk53BTkQkgnPsREQqwzl2IiKVUWCuM9iJiEQpMNkZ7EREIjjHTkSkMpxjJyJSGQXmOoOdiEiUApOdwU5EJEKJc+z8oA0iIpVhx05EJIIXT4mIVEaBuc5gJyISpcBkZ7ATEYlQ4sVTBjsRkQjOsRMRqYwCc53BTkQkSoHJzmAnIhLBOXYiIpXhHDsRkcooMNe5pAARkSihGVsLlZeXIyAgALm5uQCAs2fPIjAwEL6+vti0aVOTxzPYiYhECM34ryW+++47PPfcc8jKygIAVFZWIjIyEtu2bcPhw4eRnp6OU6dOiZ6DwU5EJEIQDN80Gg1yc3MbbBqNpsnxEhISsHz5cjg7OwMALl++DFdXV7i4uMDCwgKBgYFITU0VPQfn2ImIRDSn/46Li0NsbGyD/WFhYQgPDxc9ds2aNfUeFxQUwMnJSf/Y2dkZ+fn5oudgsBMRiWlGss+cORPjx49vsN/e3t7gc9XV1UH4w605Op2u3uN7YbATEYlozpy5vb19s0L8Xjp37ozCwkL948LCQv00TWM4x05EZMK8vLxw7do1ZGdno7a2FikpKRg+fLjoMezYiYhEyP0GJSsrK0RHRyM8PBxVVVXw8fGBn5+f6DGCTqfTGam+VlGplbsCMkWOg8PkLoFM0J1LDS9gGurXW1UGH+PS3qrF47YEO3YiIhFyd+zNwWAnIhKlvGRnsBMRiWDHTkSkMgrMdQY7EZEYduxERCrDD9ogIlIb5eU6g52ISIwCc53BTkQkhnPsREQqwzl2IiK1UV6uM9iJiMQoMNcZ7EREYjjHTkSkMpxjJyJSGSV27PwEJSIilWGwExGpDKdiiIhEKHEqhsFORCSCF0+JiFSGHTsRkcooMNcZ7EpTU12N9955GymHDqK4pBj9+z+GBRGL4PGop9ylkRG1d7DF9c/XNdifdOISpkXsgJOjHdYtmAC/Z/oBAD6/8BMWb0xCTt4tY5eqfApMdga7wsSsW4uUQ8l4ff7f8bBLd+zZHY+//e8M7E86iK5du8ldHhlJ/z53/64DXolF2e1K/f7fSm/D0sIch9+bi04d2yJqczJ+vXkLrz73P/jsg/kY/Oxa3Cq9LVfZisQ5dpJUWVkZPvl4P16btwDPTp0GABj0+BPweXoIUg4mY9bsV2WukIylf++uuFmkQdpXVxs8FzJyAPr17orAV9/GiXNXAACnLmbg8oGlWPDCKLyxOdnY5SqaEufYeR+7gtjY2ODDjxIQMn6Cfp+FhQUgCKiurpaxMjK2fr27IT3j+j2fc3N1hlZbi88u/KTfV12jxdc/ZGO096PGKlE1hGZscmPHriAWFhbw8Lj7D7Ourg43blzHO29vhQABAYFBMldHxtSvTzdUVdXgsw/mY4C7C34rKce2vaewMe4Ecm8Ww8LCHF2dHPDrzWL9Ma7dOsK1a3sZq1YoU0hqAzHYFWr7u9vwzttbAQCvhs3FIz16ylwRGYsgCPDo0Rm3K6uwZNMB/Jp3C37DPLEyPAhWVhbYnvAlCovLsGP1DISv+QiFt8rxylQfePbqAksLc7nLVxzOsf+Xu7s7hD9MTFlYWMDc3BxVVVWws7PDxYsXpRj2gTJi5Cg8MfhJXLxwHtvf3YaamhqEzX1d7rLICAQBmPDau/j15i388msRAOCLf2fA9iErLHhhNDZ+cAJT5/8LO1bPwLeJSwEAn576HjuTzmJ64BA5S1ckJc6xCzqdTifVyZcvX45BgwYhKCgIgiDg6NGj+PLLL7F69WqphnwgRUdHY/fu3fjmm29gaWkpdzlEJDNJL55evnwZwcHB+u59zJgxSE9Pl3JIVSssLMQnn3yC8vLyevs9PDxQXV2NkpISmSojIlMiabDb2Njgk08+QUVFBcrLy7F79244ODhIOaSqaTQaREZG4ujRo/X2nzlzBh06dECHDh1kqoyITImkF09jYmKwatUqrF69GmZmZvD29sb69eulHFLVevXqhTFjxmDdunWoqamBi4sLjh07huTkZLz55pswM+Pdq0Qk8Rz770pKStCuXTuph3kg3LlzB7GxsThy5AgKCgrg5uaG2bNnw8/PT+7SiMhESBrsV65cwbx581BZWYl9+/Zh+vTp+Oc//wlPT65rQkQkFUn/33316tV4++230a5dO3Tq1AkrVqzA8uXLpRySiOiBJ2mw37lzB7169dI/fvrpp/nWdyIiiUka7O3atcPVq1f1tzsePHiQd8UQEUlM0jn2nJwcLFq0CN9//z2sra3h6uqKmJgY9OzJt78TEUnFKHfFVFRUoK6uDnZ2dlIPRUT0wJPkPvalS5di1apVCA0NrbdmzO927dolxbBERASJgn3KlCkAAF9fXzg5OcHKygq3bt2Ci4uLFMMREdEfSBLsXbp0wfPPP4+MjAw88sgjAIBr165hwIAB2LhxoxRDEhHRf0kyxx4ZGYmOHTsiPDxcv9pgdXU1tm7disLCQkRHR7f2kIqQmpqK7du3Q6vVQqfTITg4GH/7299adM69e/cCAJ577rkWnSc0NBRhYWEYMoTLupqq3Nxc+Pn56W8hrqysxKBBg7BgwQLk5eXho48+wpo1a+77fH379sVPP/3U9AtJcSTp2C9duoQjR47U29emTRvMnz8fwcHBUgxp8vLz87Fu3TokJibC0dERt2/fRmhoKHr06IGRI0c2+7wtDXRSFmdnZyQn3/3MUp1Oh40bN2Lu3LnYs2cP+vfvL3N1ZCokCXYrK6t77hcE4YFdqKq4uBg1NTWorLz7ifK2traIjo6GlZUVRowYgV27duHhhx/G+fPnERsbi/j4eISGhsLBwQEZGRkIDAxEcXExli69+8EJ0dHR6Ny5M8rKygAADg4OyM7ObvD85MmTsXLlSmRkZKC2thYvvfQSAgICUF1djTfeeAPp6eno1q0biouL7104mSxBEBAeHo6nn34au3btwvHjxxEfH4/s7GysWLECJSUlsLa2xtKlS/Hoo48iNzcXERERqKiogJeXl9zlk4QkSdl73QlzP8+pmbu7O0aOHIlRo0Zh0qRJiImJQV1dHVxdXUWP69u3L44ePYpp06bh+PHjqK2thU6nw7Fjx+Dv769/XUBAwD2ff+edd+Dp6YnExETs3r0b7777Ln799VfEx8cDAI4cOYKoqCjk5ORI+v2TNNq0aQNXV1d07NhRv2/RokWIiIhAUlISVq1ahXnz5gEAVq1ahQkTJiA5ORmDBg2Sq2QyAkk69oyMjHtOL+h0OhQWFkoxpCL84x//wKuvvorTp0/j9OnTePbZZ7FhwwbRYx577DEAQPv27eHu7o7z58/D0tISPXr0gJOTk/51jT1/9uxZVFZW4pNPPgFw9z0FGRkZuHDhgv7upUceeQQDBw6U6LsmqQmCAGtrawDA7du3kZ6ejiVLluifr6ioQHFxMS5cuIC33rKkNbgAAAV2SURBVHoLABAUFISoqChZ6iXpSRLsf/4gCAI+//xzVFRUYNy4cZg4cSImTpyIhIQEfPzxxwDu/tIDAK1WW++43//BAkBwcDAOHz4MS0tLBAYGNhjjXs/X1dUhJiZGv6JmUVERHBwckJCQgD9eN7ew4OeaK1F1dTWuXbuG3377DcDdv+82bdro5+EB4ObNm/pls3//O3+Qp0UfBJL8zXbr1k10exBZW1vjrbfeQm5uLoC7/8CuXLkCDw8PODo6IjMzEwCQlpbW6DlGjhyJixcv4syZMxg9evR9Pf/UU0/p75wpKChAUFAQ8vLyMHToUBw6dAh1dXW4fv06vvnmm9b+lklidXV12Lp1K7y8vNC9e3cAQNu2bfHII4/og/3MmTN4/vnnAQDe3t44ePAgAODYsWOoqqqSp3CSHNs0I3nqqacQFhaG2bNno6amBgDwzDPPYM6cORg0aBBWrVqF2NhYDBs2rNFzWFtbY9CgQaiuroatre19PR8WFoYVK1YgICAAtbW1iIiIQPfu3TFt2jRkZGRg7Nix6NatG/r06SPNN06tqqCgQH9nWV1dHTw8PLBx40ZcvXpV/5qYmBisWLEC77//PiwtLbFp0yYIgoBly5YhIiIC+/btQ79+/e75M0TqYJS1YoiIyHg4yUZEpDIMdiIilWGwExGpDIOdiEhlGOxERCrDYCejyM3NhYeHB4KDg/VbUFCQ/g1azfXyyy8jMTERwN03aGk0mkZfW1ZWhhkzZhg8RmpqKkJDQ5tdI5Gx8T52Mhpra+t674jMz89HQEAA+vXrB3d39xaf/4/nvpfS0lJ8//33LR6HyNQx2Ek2nTp1gqurK86cOYOVK1fizp07sLOzQ3x8PPbv34+9e/eirq4O7dq1w9KlS9GrVy/k5+dj8eLFKCgoQNeuXfVvpQfuLph27tw5tG/fHu+99x6SkpJgYWEBV1dXREdHY8mSJaisrERwcDASExORlZWFNWvWoKSkBLW1tQgNDcWkSZMAAJs3b8ahQ4fQrl27JhdqIzI1DHaSzaVLl5CTk4PKykpkZmbi5MmTsLOzw4ULF3DgwAHs3r0bNjY2OH36NMLCwnDkyBGsXLkSXl5eeP3115GdnY2QkJAG501LS0NiYiISEhLg4OCAtWvX4sMPP8TatWsRGBiI5ORkaLVazJ07F+vXr4enpyfKysowZcoUuLm5oaioCMeOHcOBAwdgbW2NOXPmyPCnQ9R8DHYymt+7ZQCora2Fo6MjYmJi8Ntvv6Fv376ws7MDcHfBtOzsbEydOlV/rEajQUlJCc6ePYtFixYBAFxdXe/5iU/nzp2Dn58fHBwcAEC/0uHv6/QAQFZWFnJychAZGVmvvh9//BE///wzRo8era9n4sSJ+mWOiZSAwU5G8+c59t8lJibioYce0j+uq6tDcHAwIiIi9I8LCgrg4OAAQRCaXJXS3Ny83rr/Go2mwUXV2tpatG3btl49RUVFaNu2LdavX19vDHNz82Z8t0Ty4V0xZHKGDRuGTz/9FAUFBQDufq7rzJkzAdxdOG3fvn0AgBs3buD8+fMNjvf29sbx48dRXl4OANi6dSs++OADWFhY6D+IpEePHvV+0eTl5SEgIADp6ekYPnw4UlNTodFoUFdX1+RFWSJTw46dTM6wYcPw0ksv4cUXX4QgCLCzs0NsbCwEQcDy5cuxZMkSjB07Fp07d77n3TQ+Pj7IzMzUfx6sm5sbVq1aBRsbGzz22GPw9/fH7t27sW3bNqxZswbvv/8+tFotXnvtNTz++OMAgJ9++gkTJ06Evb093N3d+dGBpChc3ZGISGU4FUNEpDIMdiIilWGwExGpDIOdiEhlGOxERCrDYCciUhkGOxGRyvw/yjLqxL/BCUwAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# evaluate model for test set\n", "class_names=[\"Survived\",\"Died\"]\n", "cm = confusion_matrix(y_test, y_hat_LSVM_Model, labels=[1,0])\n", "df_cm = pd.DataFrame(cm, columns=class_names, index = class_names)\n", "df_cm.index.name = 'Actual'\n", "df_cm.columns.name = 'Predicted'\n", "plt.figure(figsize = (6,5))\n", "sns.set(font_scale=1)#for label size\n", "sns.heatmap(df_cm, cmap=\"Blues\", annot=True,annot_kws={\"size\": 16})# font size" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy score for the test set with LSVM=80.00%\n" ] } ], "source": [ "# Accuracy score for test set\n", "from sklearn.metrics import accuracy_score\n", "score = accuracy_score(y_test, y_hat_LSVM_Model)\n", "print(\"Accuracy score for the test set with LSVM={:.2f}%\".format(score*100))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Submitting to Kaggle" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stupid Baseline (Everyone Dies)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The stupid baseline is based on the majority of *Survived* status. In which case, we will have a rule which states that everybody died in the Titanic. " ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvived
08920
18930
28940
38950
48960
\n", "
" ], "text/plain": [ " PassengerId Survived\n", "0 892 0\n", "1 893 0\n", "2 894 0\n", "3 895 0\n", "4 896 0" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfout = pd.DataFrame() \n", "dfout[[\"PassengerId\"]] = titanictest[[\"PassengerId\"]]\n", "dfout[\"Survived\"] = 0\n", "dfout[:5]\n" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "dfout.to_csv(\"stupidbaseline.csv\",index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preparation for the test.csv" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PassengerId int64\n", "Pclass int64\n", "Name object\n", "Sex object\n", "Age float64\n", "SibSp int64\n", "Parch int64\n", "Ticket object\n", "Fare float64\n", "Cabin object\n", "Embarked object\n", "dtype: object" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanictest.dtypes" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PassengerId int64\n", "Survived int64\n", "dtype: object" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kaggle_testset.dtypes" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PclassSexAgeSibSpParchFareCabinEmbarked
03male34.5007.8292NaNQ
13female47.0107.0000NaNS
\n", "
" ], "text/plain": [ " Pclass Sex Age SibSp Parch Fare Cabin Embarked\n", "0 3 male 34.5 0 0 7.8292 NaN Q\n", "1 3 female 47.0 1 0 7.0000 NaN S" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanictestdf = titanictest.drop([\"PassengerId\",\"Name\",\"Ticket\"],axis=1)\n", "titanictestdf.head(2)" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvived
08920
18931
\n", "
" ], "text/plain": [ " PassengerId Survived\n", "0 892 0\n", "1 893 1" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kaggle_testset.head(2)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pclass 0\n", "Sex 0\n", "Age 86\n", "SibSp 0\n", "Parch 0\n", "Fare 1\n", "Cabin 327\n", "Embarked 0\n", "dtype: int64" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# List out all variables with nulls/missing values\n", "titanictestdf.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "# fill in the missing age\n", "titanictestdf['Age'].fillna(titanic['Age'].mean(), inplace=True)" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PclassSexAgeSibSpParchFareCabinEmbarked
03male34.5007.8292NoQ
13female47.0107.0000NoS
\n", "
" ], "text/plain": [ " Pclass Sex Age SibSp Parch Fare Cabin Embarked\n", "0 3 male 34.5 0 0 7.8292 No Q\n", "1 3 female 47.0 1 0 7.0000 No S" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# fill in the missing cabins\n", "# We will replace the Cabin value with No if missing and Yes if there is a cabin number\n", "titanictestdf['Cabin'].fillna('No', inplace=True)\n", "titanictestdf['Cabin'].replace(regex=r'^((?!No).)*$',value='Yes',inplace=True)\n", "titanictestdf.head(2)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [], "source": [ "# fill in the missing fare with the mean fare\n", "titanictestdf['Fare'].fillna(titanic['Fare'].mean(), inplace=True)" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pclass 0\n", "Sex 0\n", "Age 0\n", "SibSp 0\n", "Parch 0\n", "Fare 0\n", "Cabin 0\n", "Embarked 0\n", "dtype: int64" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# List out all variables with nulls/missing values\n", "titanictestdf.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PclassAgeSibSpParchFareSex_femaleSex_maleCabin_NoCabin_YesEmbarked_CEmbarked_QEmbarked_S
0334.5007.82920110010
1347.0107.00001010001
\n", "
" ], "text/plain": [ " Pclass Age SibSp Parch Fare Sex_female Sex_male Cabin_No \\\n", "0 3 34.5 0 0 7.8292 0 1 1 \n", "1 3 47.0 1 0 7.0000 1 0 1 \n", "\n", " Cabin_Yes Embarked_C Embarked_Q Embarked_S \n", "0 0 0 1 0 \n", "1 0 0 0 1 " ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Encode all the categorical variables\n", "predictdf = pd.get_dummies(titanictestdf,columns=nonnumfeats)\n", "predictdf.head(2)" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 3. , 34.5 , 0. , 0. , 7.8292, 0. , 1. ,\n", " 1. , 0. , 0. , 1. , 0. ],\n", " [ 3. , 47. , 1. , 0. , 7. , 1. , 0. ,\n", " 1. , 0. , 0. , 0. , 1. ]])" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Xp = predictdf.values\n", "Xp[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prediction with Logistics Regression Trained Model" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [], "source": [ "yp_hat_LGR = LogisticRegression_Model.predict(Xp)" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvived
08920
18930
28940
38950
48961
.........
41313050
41413061
41513070
41613080
41713090
\n", "

418 rows × 2 columns

\n", "
" ], "text/plain": [ " PassengerId Survived\n", "0 892 0\n", "1 893 0\n", "2 894 0\n", "3 895 0\n", "4 896 1\n", ".. ... ...\n", "413 1305 0\n", "414 1306 1\n", "415 1307 0\n", "416 1308 0\n", "417 1309 0\n", "\n", "[418 rows x 2 columns]" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfout = pd.DataFrame() \n", "dfout[[\"PassengerId\"]] = titanictest[[\"PassengerId\"]]\n", "dfout[\"Survived\"] = yp_hat_LGR\n", "dfout[:418]" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [], "source": [ "dfout.to_csv(\"Prediction_LogisticRegression.csv\",index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prediction with CART Trained Model" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [], "source": [ "yp_hat_CART = CART_Model.predict(Xp)" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvived
08920
18930
28941
38951
48961
.........
41313050
41413061
41513070
41613080
41713091
\n", "

418 rows × 2 columns

\n", "
" ], "text/plain": [ " PassengerId Survived\n", "0 892 0\n", "1 893 0\n", "2 894 1\n", "3 895 1\n", "4 896 1\n", ".. ... ...\n", "413 1305 0\n", "414 1306 1\n", "415 1307 0\n", "416 1308 0\n", "417 1309 1\n", "\n", "[418 rows x 2 columns]" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfout = pd.DataFrame() \n", "dfout[[\"PassengerId\"]] = titanictest[[\"PassengerId\"]]\n", "dfout[\"Survived\"] = yp_hat_CART\n", "dfout[:418]" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [], "source": [ "dfout.to_csv(\"Prediction_CART.csv\",index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prediction with SVM Trained Model" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "yp_hat_SVM = SVM_Model.predict(Xp)" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvived
08920
18930
28940
38950
48960
.........
41313050
41413061
41513070
41613080
41713090
\n", "

418 rows × 2 columns

\n", "
" ], "text/plain": [ " PassengerId Survived\n", "0 892 0\n", "1 893 0\n", "2 894 0\n", "3 895 0\n", "4 896 0\n", ".. ... ...\n", "413 1305 0\n", "414 1306 1\n", "415 1307 0\n", "416 1308 0\n", "417 1309 0\n", "\n", "[418 rows x 2 columns]" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfout = pd.DataFrame() \n", "dfout[[\"PassengerId\"]] = titanictest[[\"PassengerId\"]]\n", "dfout[\"Survived\"] = yp_hat_SVM\n", "dfout[:418]" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "dfout.to_csv(\"Prediction_SVM.csv\",index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prediction with K-Nearest Neighbour Regression Trained Model" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "yp_hat_KNN = K_NearestNeighbour_Model.predict(Xp)" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvived
08920
18930
28940
38950
48960
.........
41313050
41413061
41513070
41613080
41713090
\n", "

418 rows × 2 columns

\n", "
" ], "text/plain": [ " PassengerId Survived\n", "0 892 0\n", "1 893 0\n", "2 894 0\n", "3 895 0\n", "4 896 0\n", ".. ... ...\n", "413 1305 0\n", "414 1306 1\n", "415 1307 0\n", "416 1308 0\n", "417 1309 0\n", "\n", "[418 rows x 2 columns]" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfout = pd.DataFrame() \n", "dfout[[\"PassengerId\"]] = titanictest[[\"PassengerId\"]]\n", "dfout[\"Survived\"] = yp_hat_KNN\n", "dfout[:418]" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [], "source": [ "dfout.to_csv(\"Prediction_KNN.csv\",index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prediction with Naive Bayes Trained Model" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [], "source": [ "yp_hat_NB = NaiveBayes_Model.predict(Xp)" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvived
08920
18931
28940
38950
48961
.........
41313050
41413061
41513070
41613080
41713090
\n", "

418 rows × 2 columns

\n", "
" ], "text/plain": [ " PassengerId Survived\n", "0 892 0\n", "1 893 1\n", "2 894 0\n", "3 895 0\n", "4 896 1\n", ".. ... ...\n", "413 1305 0\n", "414 1306 1\n", "415 1307 0\n", "416 1308 0\n", "417 1309 0\n", "\n", "[418 rows x 2 columns]" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfout = pd.DataFrame() \n", "dfout[[\"PassengerId\"]] = titanictest[[\"PassengerId\"]]\n", "dfout[\"Survived\"] = yp_hat_NB\n", "dfout[:418]" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [], "source": [ "dfout.to_csv(\"Prediction_NB.csv\",index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prediction with Stochastic Gradient Descent Trained Model" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [], "source": [ "yp_hat_SGD = SGD_Model.predict(Xp)" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvived
08920
18930
28940
38950
48961
.........
41313050
41413061
41513070
41613080
41713090
\n", "

418 rows × 2 columns

\n", "
" ], "text/plain": [ " PassengerId Survived\n", "0 892 0\n", "1 893 0\n", "2 894 0\n", "3 895 0\n", "4 896 1\n", ".. ... ...\n", "413 1305 0\n", "414 1306 1\n", "415 1307 0\n", "416 1308 0\n", "417 1309 0\n", "\n", "[418 rows x 2 columns]" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfout = pd.DataFrame() \n", "dfout[[\"PassengerId\"]] = titanictest[[\"PassengerId\"]]\n", "dfout[\"Survived\"] = yp_hat_SGD\n", "dfout[:418]" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [], "source": [ "dfout.to_csv(\"Prediction_SGD.csv\",index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predicting with Gradient Boosting Classifier" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [], "source": [ "yp_hat_GBC = GB_Model.predict(Xp)" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvived
08920
18930
28940
38950
48960
.........
41313050
41413061
41513070
41613080
41713091
\n", "

418 rows × 2 columns

\n", "
" ], "text/plain": [ " PassengerId Survived\n", "0 892 0\n", "1 893 0\n", "2 894 0\n", "3 895 0\n", "4 896 0\n", ".. ... ...\n", "413 1305 0\n", "414 1306 1\n", "415 1307 0\n", "416 1308 0\n", "417 1309 1\n", "\n", "[418 rows x 2 columns]" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfout = pd.DataFrame() \n", "dfout[[\"PassengerId\"]] = titanictest[[\"PassengerId\"]]\n", "dfout[\"Survived\"] = yp_hat_GBC\n", "dfout[:418]" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [], "source": [ "dfout.to_csv(\"Prediction_GBC.csv\",index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predicting with Random Forest" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [], "source": [ "yp_hat_RF = random_forest.predict(Xp)" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvived
08920
18930
28940
38951
48960
.........
41313050
41413061
41513070
41613080
41713091
\n", "

418 rows × 2 columns

\n", "
" ], "text/plain": [ " PassengerId Survived\n", "0 892 0\n", "1 893 0\n", "2 894 0\n", "3 895 1\n", "4 896 0\n", ".. ... ...\n", "413 1305 0\n", "414 1306 1\n", "415 1307 0\n", "416 1308 0\n", "417 1309 1\n", "\n", "[418 rows x 2 columns]" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfout = pd.DataFrame() \n", "dfout[[\"PassengerId\"]] = titanictest[[\"PassengerId\"]]\n", "dfout[\"Survived\"] = yp_hat_RF\n", "dfout[:418]" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [], "source": [ "dfout.to_csv(\"Prediction_RF.csv\",index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predicting with Perceptron" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [], "source": [ "yp_hat_Per = perceptron.predict(Xp)" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvived
08920
18930
28940
38950
48960
.........
41313050
41413061
41513070
41613080
41713090
\n", "

418 rows × 2 columns

\n", "
" ], "text/plain": [ " PassengerId Survived\n", "0 892 0\n", "1 893 0\n", "2 894 0\n", "3 895 0\n", "4 896 0\n", ".. ... ...\n", "413 1305 0\n", "414 1306 1\n", "415 1307 0\n", "416 1308 0\n", "417 1309 0\n", "\n", "[418 rows x 2 columns]" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfout = pd.DataFrame() \n", "dfout[[\"PassengerId\"]] = titanictest[[\"PassengerId\"]]\n", "dfout[\"Survived\"] = yp_hat_Per\n", "dfout[:418]" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [], "source": [ "dfout.to_csv(\"Prediction_Perceptron.csv\",index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predicting with Linear SVM" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [], "source": [ "yp_hat_LSVM = linear_svc.predict(Xp)" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvived
08920
18930
28940
38950
48960
.........
41313050
41413061
41513070
41613080
41713090
\n", "

418 rows × 2 columns

\n", "
" ], "text/plain": [ " PassengerId Survived\n", "0 892 0\n", "1 893 0\n", "2 894 0\n", "3 895 0\n", "4 896 0\n", ".. ... ...\n", "413 1305 0\n", "414 1306 1\n", "415 1307 0\n", "416 1308 0\n", "417 1309 0\n", "\n", "[418 rows x 2 columns]" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dfout = pd.DataFrame() \n", "dfout[[\"PassengerId\"]] = titanictest[[\"PassengerId\"]]\n", "dfout[\"Survived\"] = yp_hat_LSVM\n", "dfout[:418]" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "dfout.to_csv(\"Prediction_LSVM.csv\",index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This project is to predict the who would survive in the titanic, as the ouput is either survived or died, it is a classification task. A regression task would be if the output is scalable, like prices or percentage, where data is on a scale rather than like a yes or no. So unless this task is changed in such a way to predict the chances of survivalbility then, it would be a regression task. Otherwise, it is a classification task.\n", "\n", "After running different models, the better performing ones are Linear Regression and Gradient Boosting Classifier\n", "On Kaggle, my highest score was a 0.77 which means my prediction are 77% correct.\n", "\n", "For the data, I removed the names, id and ticket when training as they should not be the factors of survivability. I did not do any special modifications to the features as I did not see a point in doing so.\n", "\n", "For learning algorithms, I felt that the best way to go about it was to use as many as possible and then sifting out the better ones to use. So for example, in my tests, Gradient Boosting Classifier and Logistic Regression were among the top when it come's to accuracy of prediction and hence, these were my choices if I needed to make a prediction. However, I trained so many models is to ensure that I can see which are more accurate.\n", "\n", "As for hyperparameters, I did not go into very specific tuning, so for example, I only tuned the K value of the K-NN model, as well as changing the learning rate for the Gradient Boosting Classifier.\n", "\n", "I check my results against kaggle for accuracy as the accuracy check I've implemented in Jupyter notebook is not as accurate as the one on Kaggle. When compared to a baseline assumption that everyone died, it was about 15% more accurate. The baseline got a score of 0.62 whereas mine was 0.77.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "202px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 1 }