# Section 1-2 - Creating Dummy Variables

In previous sections, we replaced the categorical values {C, S, Q} in the column Embarked by the numerical values {1, 2, 3}. The latter, however, has a notion of ordering not present in the former (which is simply arranged in alphabetical order). To get around this problem, we shall introduce the concept of dummy variables.

## Pandas - Extracting data

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

df_train = df.iloc[:712, :]
df_test = df.iloc[712:, :]

## Pandas - Cleaning data

In [2]:
df_train = df_train.drop(['Name', 'Ticket', 'Cabin'], axis=1)

age_mean = df_train['Age'].mean()
df_train['Age'] = df_train['Age'].fillna(age_mean)

df_train['Embarked'] = df_train['Embarked'].fillna('S')

As there are only two unique values for the column Sex, we have no problems of ordering.

In [3]:
df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})

For the column Embarked, however, replacing {C, S, Q} by {1, 2, 3} would seem to imply the ordering C < S < Q when in fact they are simply arranged alphabetically.

To avoid this problem, we create dummy variables. Essentially this involves creating new columns to represent whether the passenger embarked at C with the value 1 if true, 0 otherwise. Pandas has a built-in function to create these columns automatically.

In [4]:
pd.get_dummies(df_train['Embarked'], prefix='Embarked').head(10)

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1
5,0,1,0
6,0,0,1
7,0,0,1
8,0,0,1
9,1,0,0


We now concatenate the columns containing the dummy variables to our main dataframe.

In [5]:
df_train = pd.concat([df_train, pd.get_dummies(df_train['Embarked'], prefix='Embarked')], axis=1)

**Exercise**

- Write the code to create dummy variables for the column Sex.

In [6]:
df_train = df_train.drop(['Embarked'], axis=1)

We review our processed training data.

In [7]:
df_train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,1,22.0,1,0,7.25,0,0,1
1,2,1,1,0,38.0,1,0,71.2833,1,0,0
2,3,1,3,0,26.0,0,0,7.925,0,0,1
3,4,1,1,0,35.0,1,0,53.1,0,0,1
4,5,0,3,1,35.0,0,0,8.05,0,0,1
5,6,0,3,1,30.030531,0,0,8.4583,0,1,0
6,7,0,1,1,54.0,0,0,51.8625,0,0,1
7,8,0,3,1,2.0,3,1,21.075,0,0,1
8,9,1,3,0,27.0,0,2,11.1333,0,0,1
9,10,1,2,0,14.0,1,0,30.0708,1,0,0


In [8]:
X_train = df_train.iloc[:, 2:].values
y_train = df_train['Survived']

## Scikit-learn - Training the model

In [9]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=0)
model = model.fit(X_train, y_train)

## Scikit-learn - Making predictions

In [10]:
df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)

df_test['Age'] = df_test['Age'].fillna(age_mean)
df_test['Embarked'] = df_test['Embarked'].fillna('S')

df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1})

Similarly we create dummy variables for the test data.

In [11]:
df_test = pd.concat([df_test, pd.get_dummies(df_test['Embarked'], prefix='Embarked')], axis=1)

In [12]:
df_test = df_test.drop(['Embarked'], axis=1)

X_test = df_test.iloc[:, 2:]
y_test = df_test['Survived']

y_prediction = model.predict(X_test)

## Evaluation

In [13]:
np.sum(y_prediction == y_test) / float(len(y_test))

0.83798882681564246