{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Section 1-1 - Filling-in Missing Values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous section, we ended up with a smaller set of predictions because we chose to throw away rows with missing values. We build on this approach in this section by filling in the missing data with an educated guess." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will only provide detailed descriptions on new concepts introduced." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas - Extracting data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "df = pd.read_csv('data.csv')\n", "\n", "df_train = df.iloc[:712, :]\n", "df_test = df.iloc[712:, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas - Cleaning data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_train = df_train.drop(['Name', 'Ticket', 'Cabin'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to the previous section, we review the data type and value counts." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 712 entries, 0 to 711\n", "Data columns (total 9 columns):\n", "PassengerId 712 non-null int64\n", "Survived 712 non-null int64\n", "Pclass 712 non-null int64\n", "Sex 712 non-null object\n", "Age 565 non-null float64\n", "SibSp 712 non-null int64\n", "Parch 712 non-null int64\n", "Fare 712 non-null float64\n", "Embarked 711 non-null object\n", "dtypes: float64(2), int64(5), object(2)\n", "memory usage: 55.6+ KB\n" ] } ], "source": [ "df_train.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a number of ways that we could fill in the NaN values of the column Age. For simplicity, we'll do so by taking the average, or mean, of values of each column." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "age_mean = df_train['Age'].mean()\n", "df_train['Age'] = df_train['Age'].fillna(age_mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**\n", "\n", "- Write the code to replace the NaN values by the median, instead of the mean." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Taking the average does not make sense for the column Embarked, as it is a categorical value. Instead, we shall replace the NaN values by the mode, or most frequently occurring value." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Counter({nan: 1, 'C': 138, 'Q': 64, 'S': 509})" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import Counter\n", "\n", "Counter(df_train['Embarked'])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_train['Embarked'] = df_train['Embarked'].fillna('S')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})\n", "df_train['Embarked'] = df_train['Embarked'].map({'C':1, 'S':2, 'Q':3})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now review details of our training data." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 712 entries, 0 to 711\n", "Data columns (total 9 columns):\n", "PassengerId 712 non-null int64\n", "Survived 712 non-null int64\n", "Pclass 712 non-null int64\n", "Sex 712 non-null int64\n", "Age 712 non-null float64\n", "SibSp 712 non-null int64\n", "Parch 712 non-null int64\n", "Fare 712 non-null float64\n", "Embarked 712 non-null int64\n", "dtypes: float64(2), int64(7)\n", "memory usage: 55.6 KB\n" ] } ], "source": [ "df_train.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hence have we have preserved all the rows of our data set, and proceed to create a numerical array for Scikit-learn." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_train = df_train.iloc[:, 2:].values\n", "y_train = df_train['Survived']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scikit-learn - Training the model" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "model = RandomForestClassifier(n_estimators = 100, random_state=0)\n", "model = model.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scikit-learn - Making predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now review what needs to be cleaned in the test data." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 179 entries, 712 to 890\n", "Data columns (total 12 columns):\n", "PassengerId 179 non-null int64\n", "Survived 179 non-null int64\n", "Pclass 179 non-null int64\n", "Name 179 non-null object\n", "Sex 179 non-null object\n", "Age 149 non-null float64\n", "SibSp 179 non-null int64\n", "Parch 179 non-null int64\n", "Ticket 179 non-null object\n", "Fare 179 non-null float64\n", "Cabin 42 non-null object\n", "Embarked 178 non-null object\n", "dtypes: float64(2), int64(5), object(5)\n", "memory usage: 18.2+ KB\n" ] } ], "source": [ "df_test.info()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As per our previous approach, we fill in the NaN values in the column Age and Embarked with the mean and mode respectively." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_test['Age'] = df_test['Age'].fillna(age_mean)\n", "df_test['Embarked'] = df_test['Embarked'].fillna('S')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 179 entries, 712 to 890\n", "Data columns (total 9 columns):\n", "PassengerId 179 non-null int64\n", "Survived 179 non-null int64\n", "Pclass 179 non-null int64\n", "Sex 179 non-null object\n", "Age 179 non-null float64\n", "SibSp 179 non-null int64\n", "Parch 179 non-null int64\n", "Fare 179 non-null float64\n", "Embarked 179 non-null object\n", "dtypes: float64(2), int64(5), object(2)\n", "memory usage: 14.0+ KB\n" ] } ], "source": [ "df_test.info()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)\n", "df_test['Embarked'] = df_test['Embarked'].map({'C':1, 'S':2, 'Q':3})\n", "\n", "X_test = df_test.iloc[:, 2:]\n", "y_test = df_test['Survived']\n", "\n", "y_prediction = model.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, we calculate the model's accuracy:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.81564245810055869" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(y_prediction == y_test) / float(len(y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While this is slightly less than our previous approach, our current approach preserves the number of predictions to be made." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "179" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More importantly, all the training data was used to train our model. By ignoring rows with missing value, we are essentially throwing away information that can be used." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }