{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Section 1-1 - Filling-in Missing Values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous section, we ended up with a smaller set of predictions because we chose to throw away rows with missing values. We build on this approach in this section by filling in the missing data with an educated guess."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will only provide detailed descriptions on new concepts introduced."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas - Extracting data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "df = pd.read_csv('data.csv')\n",
    "\n",
    "df_train = df.iloc[:712, :]\n",
    "df_test = df.iloc[712:, :]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas - Cleaning data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df_train = df_train.drop(['Name', 'Ticket', 'Cabin'], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to the previous section, we review the data type and value counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 712 entries, 0 to 711\n",
      "Data columns (total 9 columns):\n",
      "PassengerId    712 non-null int64\n",
      "Survived       712 non-null int64\n",
      "Pclass         712 non-null int64\n",
      "Sex            712 non-null object\n",
      "Age            565 non-null float64\n",
      "SibSp          712 non-null int64\n",
      "Parch          712 non-null int64\n",
      "Fare           712 non-null float64\n",
      "Embarked       711 non-null object\n",
      "dtypes: float64(2), int64(5), object(2)\n",
      "memory usage: 55.6+ KB\n"
     ]
    }
   ],
   "source": [
    "df_train.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are a number of ways that we could fill in the NaN values of the column Age. For simplicity, we'll do so by taking the average, or mean, of values of each column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "age_mean = df_train['Age'].mean()\n",
    "df_train['Age'] = df_train['Age'].fillna(age_mean)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise**\n",
    "\n",
    "- Write the code to replace the NaN values by the median, instead of the mean."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Taking the average does not make sense for the column Embarked, as it is a categorical value. Instead, we shall replace the NaN values by the mode, or most frequently occurring value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({nan: 1, 'C': 138, 'Q': 64, 'S': 509})"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from collections import Counter\n",
    "\n",
    "Counter(df_train['Embarked'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df_train['Embarked'] = df_train['Embarked'].fillna('S')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})\n",
    "df_train['Embarked'] = df_train['Embarked'].map({'C':1, 'S':2, 'Q':3})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now review details of our training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 712 entries, 0 to 711\n",
      "Data columns (total 9 columns):\n",
      "PassengerId    712 non-null int64\n",
      "Survived       712 non-null int64\n",
      "Pclass         712 non-null int64\n",
      "Sex            712 non-null int64\n",
      "Age            712 non-null float64\n",
      "SibSp          712 non-null int64\n",
      "Parch          712 non-null int64\n",
      "Fare           712 non-null float64\n",
      "Embarked       712 non-null int64\n",
      "dtypes: float64(2), int64(7)\n",
      "memory usage: 55.6 KB\n"
     ]
    }
   ],
   "source": [
    "df_train.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hence have we have preserved all the rows of our data set, and proceed to create a numerical array for Scikit-learn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_train = df_train.iloc[:, 2:].values\n",
    "y_train = df_train['Survived']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scikit-learn - Training the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "model = RandomForestClassifier(n_estimators = 100, random_state=0)\n",
    "model = model.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scikit-learn - Making predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now review what needs to be cleaned in the test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 179 entries, 712 to 890\n",
      "Data columns (total 12 columns):\n",
      "PassengerId    179 non-null int64\n",
      "Survived       179 non-null int64\n",
      "Pclass         179 non-null int64\n",
      "Name           179 non-null object\n",
      "Sex            179 non-null object\n",
      "Age            149 non-null float64\n",
      "SibSp          179 non-null int64\n",
      "Parch          179 non-null int64\n",
      "Ticket         179 non-null object\n",
      "Fare           179 non-null float64\n",
      "Cabin          42 non-null object\n",
      "Embarked       178 non-null object\n",
      "dtypes: float64(2), int64(5), object(5)\n",
      "memory usage: 18.2+ KB\n"
     ]
    }
   ],
   "source": [
    "df_test.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As per our previous approach, we fill in the NaN values in the column Age and Embarked with the mean and mode respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df_test['Age'] = df_test['Age'].fillna(age_mean)\n",
    "df_test['Embarked'] = df_test['Embarked'].fillna('S')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 179 entries, 712 to 890\n",
      "Data columns (total 9 columns):\n",
      "PassengerId    179 non-null int64\n",
      "Survived       179 non-null int64\n",
      "Pclass         179 non-null int64\n",
      "Sex            179 non-null object\n",
      "Age            179 non-null float64\n",
      "SibSp          179 non-null int64\n",
      "Parch          179 non-null int64\n",
      "Fare           179 non-null float64\n",
      "Embarked       179 non-null object\n",
      "dtypes: float64(2), int64(5), object(2)\n",
      "memory usage: 14.0+ KB\n"
     ]
    }
   ],
   "source": [
    "df_test.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)\n",
    "df_test['Embarked'] = df_test['Embarked'].map({'C':1, 'S':2, 'Q':3})\n",
    "\n",
    "X_test = df_test.iloc[:, 2:]\n",
    "y_test = df_test['Survived']\n",
    "\n",
    "y_prediction = model.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As before, we calculate the model's accuracy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.81564245810055869"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.sum(y_prediction == y_test) / float(len(y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While this is slightly less than our previous approach, our current approach preserves the number of predictions to be made."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "179"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More importantly, all the training data was used to train our model. By ignoring rows with missing value, we are essentially throwing away information that can be used."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}