{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Section 1-2 - Creating Dummy Variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In previous sections, we replaced the categorical values {C, S, Q} in the column Embarked by the numerical values {1, 2, 3}. The latter, however, has a notion of ordering not present in the former (which is simply arranged in alphabetical order). To get around this problem, we shall introduce the concept of dummy variables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas - Extracting data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "df = pd.read_csv('data.csv')\n", "\n", "df_train = df.iloc[:712, :]\n", "df_test = df.iloc[712:, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas - Cleaning data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_train = df_train.drop(['Name', 'Ticket', 'Cabin'], axis=1)\n", "\n", "age_mean = df_train['Age'].mean()\n", "df_train['Age'] = df_train['Age'].fillna(age_mean)\n", "\n", "df_train['Embarked'] = df_train['Embarked'].fillna('S')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As there are only two unique values for the column Sex, we have no problems of ordering." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the column Embarked, however, replacing {C, S, Q} by {1, 2, 3} would seem to imply the ordering C < S < Q when in fact they are simply arranged alphabetically." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To avoid this problem, we create dummy variables. Essentially this involves creating new columns to represent whether the passenger embarked at C with the value 1 if true, 0 otherwise. Pandas has a built-in function to create these columns automatically." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Embarked_CEmbarked_QEmbarked_S
0001
1100
2001
3001
4001
5010
6001
7001
8001
9100
\n", "
" ], "text/plain": [ " Embarked_C Embarked_Q Embarked_S\n", "0 0 0 1\n", "1 1 0 0\n", "2 0 0 1\n", "3 0 0 1\n", "4 0 0 1\n", "5 0 1 0\n", "6 0 0 1\n", "7 0 0 1\n", "8 0 0 1\n", "9 1 0 0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.get_dummies(df_train['Embarked'], prefix='Embarked').head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now concatenate the columns containing the dummy variables to our main dataframe." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_train = pd.concat([df_train, pd.get_dummies(df_train['Embarked'], prefix='Embarked')], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**\n", "\n", "- Write the code to create dummy variables for the column Sex." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_train = df_train.drop(['Embarked'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We review our processed training data." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassSexAgeSibSpParchFareEmbarked_CEmbarked_QEmbarked_S
0103122.000000107.2500001
1211038.0000001071.2833100
2313026.000000007.9250001
3411035.0000001053.1000001
4503135.000000008.0500001
5603130.030531008.4583010
6701154.0000000051.8625001
780312.0000003121.0750001
8913027.0000000211.1333001
91012014.0000001030.0708100
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Sex Age SibSp Parch Fare \\\n", "0 1 0 3 1 22.000000 1 0 7.2500 \n", "1 2 1 1 0 38.000000 1 0 71.2833 \n", "2 3 1 3 0 26.000000 0 0 7.9250 \n", "3 4 1 1 0 35.000000 1 0 53.1000 \n", "4 5 0 3 1 35.000000 0 0 8.0500 \n", "5 6 0 3 1 30.030531 0 0 8.4583 \n", "6 7 0 1 1 54.000000 0 0 51.8625 \n", "7 8 0 3 1 2.000000 3 1 21.0750 \n", "8 9 1 3 0 27.000000 0 2 11.1333 \n", "9 10 1 2 0 14.000000 1 0 30.0708 \n", "\n", " Embarked_C Embarked_Q Embarked_S \n", "0 0 0 1 \n", "1 1 0 0 \n", "2 0 0 1 \n", "3 0 0 1 \n", "4 0 0 1 \n", "5 0 1 0 \n", "6 0 0 1 \n", "7 0 0 1 \n", "8 0 0 1 \n", "9 1 0 0 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train.head(10)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_train = df_train.iloc[:, 2:].values\n", "y_train = df_train['Survived']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scikit-learn - Training the model" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "model = RandomForestClassifier(n_estimators=100, random_state=0)\n", "model = model.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scikit-learn - Making predictions" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)\n", "\n", "df_test['Age'] = df_test['Age'].fillna(age_mean)\n", "df_test['Embarked'] = df_test['Embarked'].fillna('S')\n", "\n", "df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly we create dummy variables for the test data." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_test = pd.concat([df_test, pd.get_dummies(df_test['Embarked'], prefix='Embarked')], axis=1)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_test = df_test.drop(['Embarked'], axis=1)\n", "\n", "X_test = df_test.iloc[:, 2:]\n", "y_test = df_test['Survived']\n", "\n", "y_prediction = model.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.83798882681564246" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(y_prediction == y_test) / float(len(y_test))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }