{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Section 1-2 - Creating Dummy Variables"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In previous sections, we replaced the categorical values {C, S, Q} in the column Embarked by the numerical values {1, 2, 3}. The latter, however, has a notion of ordering not present in the former (which is simply arranged in alphabetical order). To get around this problem, we shall introduce the concept of dummy variables."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pandas - Extracting data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"df = pd.read_csv('data.csv')\n",
"\n",
"df_train = df.iloc[:712, :]\n",
"df_test = df.iloc[712:, :]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pandas - Cleaning data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df_train = df_train.drop(['Name', 'Ticket', 'Cabin'], axis=1)\n",
"\n",
"age_mean = df_train['Age'].mean()\n",
"df_train['Age'] = df_train['Age'].fillna(age_mean)\n",
"\n",
"df_train['Embarked'] = df_train['Embarked'].fillna('S')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As there are only two unique values for the column Sex, we have no problems of ordering."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the column Embarked, however, replacing {C, S, Q} by {1, 2, 3} would seem to imply the ordering C < S < Q when in fact they are simply arranged alphabetically."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To avoid this problem, we create dummy variables. Essentially this involves creating new columns to represent whether the passenger embarked at C with the value 1 if true, 0 otherwise. Pandas has a built-in function to create these columns automatically."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" Embarked_C | \n",
" Embarked_Q | \n",
" Embarked_S | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 5 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 6 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 7 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 8 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 9 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Embarked_C Embarked_Q Embarked_S\n",
"0 0 0 1\n",
"1 1 0 0\n",
"2 0 0 1\n",
"3 0 0 1\n",
"4 0 0 1\n",
"5 0 1 0\n",
"6 0 0 1\n",
"7 0 0 1\n",
"8 0 0 1\n",
"9 1 0 0"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.get_dummies(df_train['Embarked'], prefix='Embarked').head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now concatenate the columns containing the dummy variables to our main dataframe."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df_train = pd.concat([df_train, pd.get_dummies(df_train['Embarked'], prefix='Embarked')], axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Exercise**\n",
"\n",
"- Write the code to create dummy variables for the column Sex."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df_train = df_train.drop(['Embarked'], axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We review our processed training data."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" PassengerId | \n",
" Survived | \n",
" Pclass | \n",
" Sex | \n",
" Age | \n",
" SibSp | \n",
" Parch | \n",
" Fare | \n",
" Embarked_C | \n",
" Embarked_Q | \n",
" Embarked_S | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 3 | \n",
" 1 | \n",
" 22.000000 | \n",
" 1 | \n",
" 0 | \n",
" 7.2500 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 38.000000 | \n",
" 1 | \n",
" 0 | \n",
" 71.2833 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 1 | \n",
" 3 | \n",
" 0 | \n",
" 26.000000 | \n",
" 0 | \n",
" 0 | \n",
" 7.9250 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 35.000000 | \n",
" 1 | \n",
" 0 | \n",
" 53.1000 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 0 | \n",
" 3 | \n",
" 1 | \n",
" 35.000000 | \n",
" 0 | \n",
" 0 | \n",
" 8.0500 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 5 | \n",
" 6 | \n",
" 0 | \n",
" 3 | \n",
" 1 | \n",
" 30.030531 | \n",
" 0 | \n",
" 0 | \n",
" 8.4583 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 6 | \n",
" 7 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 54.000000 | \n",
" 0 | \n",
" 0 | \n",
" 51.8625 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 7 | \n",
" 8 | \n",
" 0 | \n",
" 3 | \n",
" 1 | \n",
" 2.000000 | \n",
" 3 | \n",
" 1 | \n",
" 21.0750 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 8 | \n",
" 9 | \n",
" 1 | \n",
" 3 | \n",
" 0 | \n",
" 27.000000 | \n",
" 0 | \n",
" 2 | \n",
" 11.1333 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 9 | \n",
" 10 | \n",
" 1 | \n",
" 2 | \n",
" 0 | \n",
" 14.000000 | \n",
" 1 | \n",
" 0 | \n",
" 30.0708 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" PassengerId Survived Pclass Sex Age SibSp Parch Fare \\\n",
"0 1 0 3 1 22.000000 1 0 7.2500 \n",
"1 2 1 1 0 38.000000 1 0 71.2833 \n",
"2 3 1 3 0 26.000000 0 0 7.9250 \n",
"3 4 1 1 0 35.000000 1 0 53.1000 \n",
"4 5 0 3 1 35.000000 0 0 8.0500 \n",
"5 6 0 3 1 30.030531 0 0 8.4583 \n",
"6 7 0 1 1 54.000000 0 0 51.8625 \n",
"7 8 0 3 1 2.000000 3 1 21.0750 \n",
"8 9 1 3 0 27.000000 0 2 11.1333 \n",
"9 10 1 2 0 14.000000 1 0 30.0708 \n",
"\n",
" Embarked_C Embarked_Q Embarked_S \n",
"0 0 0 1 \n",
"1 1 0 0 \n",
"2 0 0 1 \n",
"3 0 0 1 \n",
"4 0 0 1 \n",
"5 0 1 0 \n",
"6 0 0 1 \n",
"7 0 0 1 \n",
"8 0 0 1 \n",
"9 1 0 0 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train.head(10)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"X_train = df_train.iloc[:, 2:].values\n",
"y_train = df_train['Survived']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scikit-learn - Training the model"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"model = RandomForestClassifier(n_estimators=100, random_state=0)\n",
"model = model.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scikit-learn - Making predictions"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)\n",
"\n",
"df_test['Age'] = df_test['Age'].fillna(age_mean)\n",
"df_test['Embarked'] = df_test['Embarked'].fillna('S')\n",
"\n",
"df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly we create dummy variables for the test data."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df_test = pd.concat([df_test, pd.get_dummies(df_test['Embarked'], prefix='Embarked')], axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df_test = df_test.drop(['Embarked'], axis=1)\n",
"\n",
"X_test = df_test.iloc[:, 2:]\n",
"y_test = df_test['Survived']\n",
"\n",
"y_prediction = model.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluation"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.83798882681564246"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.sum(y_prediction == y_test) / float(len(y_test))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}